Page Fault Handling - Learning Module

Loading content...

0/227

Load into Frame

Bringing Data Home: From Disk to RAM

We've detected the page fault, trapped to the OS, and located where the page data resides on disk. Now comes the critical operation: actually loading that data into physical memory.

This phase involves orchestrating multiple subsystems:

Frame allocation: Find or create a free physical frame
Disk I/O: Read the page content from storage
Page table update: Create the virtual-to-physical mapping
TLB management: Ensure the new mapping is visible
Synchronization: Handle concurrent access safely

This page explores each aspect in depth. You'll understand the full lifecycle of a page from its arrival in RAM to its integration into the process's address space—the moment when the trap can finally return and the instruction can successfully complete.

What You Will Learn

By the end of this page, you will understand: (1) How physical frames are allocated during page faults, (2) The mechanics of reading page content from disk, (3) How page tables are atomically updated, (4) TLB considerations when establishing new mappings, (5) How these operations are synchronized for correctness.

Frame Allocation Basics

Before we can load content from disk, we need somewhere to put it—a physical frame. The OS maintains elaborate data structures to track which frames are free, in use, or reclaimable.

The Frame Allocation Challenge:

Frame allocation during page fault handling must be:

Fast: Page faults should resolve quickly
Fair: All processes should get their share
Flexible: Must handle various memory pressures
NUMA-aware: On multi-socket systems, prefer local memory

Free Frame Sources:

The OS can obtain free frames from several sources, in rough order of preference:

Free lists: Frames not currently in use by anyone
Page cache: Clean file-backed pages can be discarded and re-read later
Inactive lists: Pages not recently accessed, candidates for eviction
Active eviction: Force eviction of in-use pages to swap

frame_allocation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Physical frame allocator (simplified)
 
// Per-NUMA-node free lists
struct free_area {
    struct list_head free_list;  // List of free page blocks
    unsigned long nr_free;       // Count of free pages
};
 
// Per-zone free area lists (one per order for buddy allocator)
struct zone {
    struct free_area free_area[MAX_ORDER];  // Order 0-10 (1, 2, 4, ... 1024 pages)
    unsigned long managed_pages;
    unsigned long watermark[NR_WMARK];      // Min-low-high watermarks
};
 
// Core page allocation function
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order) {
    struct page *page;
    unsigned int alloc_flags;
    
    // Step 1: Determine allocation flags from gfp_mask
    alloc_flags = gfp_to_alloc_flags(gfp_mask);
    
    // Step 2: Try to get a page from the free lists
    page = get_page_from_freelist(gfp_mask, order, alloc_flags);
    if (page)
        return page;  // Got a page immediately
    
    // Step 3: No free pages - enter the slow path
    // This may involve reclaiming pages from caches
    page = __alloc_pages_slowpath(gfp_mask, order);
    
    return page;  // May be NULL if truly out of memory
}
 
// Page fault specific allocation
struct page *alloc_page_for_fault(struct vm_area_struct *vma, 
                                   unsigned long address) {
    gfp_t gfp_flags = GFP_HIGHUSER_MOVABLE;
    
    // Prefer memory close to the faulting CPU (NUMA)
    int preferred_nid = numa_node_id();
    
    // Policy might specify different behavior
    if (vma->vm_policy) {
        preferred_nid = get_policy_node(vma->vm_policy, address);
    }
    
    // Allocate the page
    struct page *page = __alloc_pages_node(preferred_nid, gfp_flags, 0);
    
    if (!page) {
        // Emergency: try any node
        page = __alloc_pages(gfp_flags | __GFP_THISNODE, 0);
    }
    
    return page;
}

GFP Flags: Controlling Allocation Behavior

GFP (Get Free Pages) flags control how aggressively the allocator tries to satisfy the request. GFP_KERNEL allows blocking and reclaim. GFP_ATOMIC never blocks (for interrupt context). GFP_HIGHUSER_MOVABLE is typical for user page faults—allows using high memory, can be moved for compaction.

Page Reclamation: When Memory is Tight

When free memory is low, the OS must reclaim pages from current users to satisfy new allocation requests. This reclamation can happen synchronously (during the page fault) or asynchronously (by background kernel threads).

The Reclaim Process:

Identify candidates: Scan inactive page lists for pages that haven't been accessed recently
Determine reclaim action:
- Clean file-backed page → Simply discard (can re-read from file)
- Dirty file-backed page → Write back to file, then discard
- Anonymous page → Write to swap space, then free frame
Update page tables: All mappings to the reclaimed page must be removed
Free the frame: Page is now available for new allocation

The Watermark System:

Linux uses watermarks to trigger proactive reclamation:

Watermark	State	Action
High	Plenty of free memory	No action
Low	Getting low	Wake kswapd for background reclaim
Min	Critical	Direct reclaim in allocating process

page_reclaim.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
// Simplified page reclaim logic
 
// kswapd - background reclaim daemon
int kswapd(void *p) {
    struct zone *zone = p;
    
    while (!kthread_should_stop()) {
        if (zone_watermark_ok(zone, order, zone->watermark[WMARK_HIGH]))
            // Enough free memory, sleep
            wait_event_interruptible(zone->kswapd_wait, !zone_watermark_ok());
        else
            // Below high watermark, reclaim pages
            shrink_zone(zone, &sc);
    }
}
 
// Direct reclaim - called when allocation fails
static unsigned long try_to_free_pages(struct zonelist *zonelist, gfp_t gfp_mask) {
    struct scan_control sc = {
        .nr_to_reclaim = SWAP_CLUSTER_MAX,
        .gfp_mask = gfp_mask,
        .priority = DEF_PRIORITY,
    };
    
    unsigned long nr_reclaimed = 0;
    
    do {
        nr_reclaimed += shrink_zones(zonelist, &sc);
        sc.priority--;  // Get more aggressive
    } while (nr_reclaimed < sc.nr_to_reclaim && sc.priority >= 0);
    
    return nr_reclaimed;
}
 
// Shrink by scanning LRU lists
static unsigned long shrink_lruvec(struct lruvec *lruvec, 
                                    struct scan_control *sc) {
    unsigned long nr_reclaimed = 0;
    
    // Scan inactive anonymous pages (candidates for swap)
    nr_reclaimed += shrink_inactive_list(LRU_INACTIVE_ANON, lruvec, sc);
    
    // Scan inactive file pages (candidates for discarding/writeback)
    nr_reclaimed += shrink_inactive_list(LRU_INACTIVE_FILE, lruvec, sc);
    
    return nr_reclaimed;
}
 
// For a single page, decide and execute reclaim
static int shrink_page(struct page *page, struct scan_control *sc) {
    // Is page referenced recently? Move to active list
    if (page_referenced(page)) {
        activate_page(page);
        return PAGEREF_ACTIVE;
    }
    
    // Anonymous page - must swap out
    if (PageAnon(page)) {
        if (!add_to_swap(page))
            return PAGEREF_KEEP;  // No swap space, can't reclaim
        
        // Write to swap (may be async)
        swap_writepage(page);
    }
    
    // File-backed dirty page - write back
    if (PageDirty(page)) {
        writepage(page);
    }
    
    // Remove from page tables (reverse mapping)
    try_to_unmap(page);
    
    // Free the page
    free_page(page);
    return PAGEREF_RECLAIMED;
}

Direct Reclaim Adds Latency

When a page fault triggers direct reclaim, the faulting process pays the cost of evicting other pages. This can add significant latency—potentially writing pages to swap, waiting for I/O, scanning LRU lists. High-performance systems try to keep enough free memory to avoid direct reclaim.

Disk I/O for Page Loading

With a frame allocated, we now need to fill it with the page's content. This involves issuing I/O to the storage device—the most time-consuming part of page fault handling.

I/O Paths:

1. Swap Read:

Page Fault → alloc_page() → swap_readpage() → Block Layer → Storage Driver → Wait → Complete

2. File Read:

Page Fault → alloc_page() → readpage() → Filesystem → Block Layer → Storage Driver → Wait → Complete

3. Zero Fill (no I/O):

Page Fault → alloc_page() → clear_page() → Complete

Blocking vs Async I/O:

Most page fault I/O is synchronous—the faulting process blocks until the I/O completes. However, some variations exist:

Async prefetch: Reading ahead into page cache asynchronously
Swap read-ahead: Reading multiple swap pages together
Userfaultfd: Delegating fault handling to userspace (async notification)

page_io.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Reading a page from swap
 
int swap_readpage(struct page *page, swp_entry_t entry) {
    struct swap_info_struct *sis;
    struct bio *bio;
    sector_t sector;
    int ret;
    
    // Lock the page - prevents concurrent I/O or access
    lock_page(page);
    
    // Get swap device info
    sis = swp_swap_info(entry);
    
    // Calculate disk sector for this swap slot
    sector = swp_offset(entry);
    sector <<= PAGE_SHIFT - SECTOR_SHIFT;  // Convert page offset to sector
    
    // Allocate and set up bio (block I/O request)
    bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
    bio->bi_iter.bi_sector = sector + sis->start_sector;
    bio_add_page(bio, page, PAGE_SIZE, 0);
    
    // Submit the I/O request
    bio->bi_end_io = swap_read_endio;  // Completion callback
    bio->bi_private = page;
    submit_bio(bio);
    
    // Wait for completion (page fault path waits synchronously)
    wait_on_page_locked(page);
    
    // Check if read succeeded
    if (PageError(page)) {
        ClearPageError(page);
        return -EIO;
    }
    
    SetPageUptodate(page);  // Mark page as having valid content
    return 0;
}
 
// Completion callback for swap read
static void swap_read_endio(struct bio *bio) {
    struct page *page = bio->bi_private;
    
    if (bio->bi_status) {
        SetPageError(page);
    } else {
        SetPageUptodate(page);
    }
    
    unlock_page(page);  // Wake up waiters
    bio_put(bio);
}
 
// Reading from a file (filesystem-specific)
int generic_file_read_page(struct file *file, struct page *page) {
    struct inode *inode = file->f_inode;
    struct address_space *mapping = inode->i_mapping;
    loff_t offset = page_offset(page);
    
    // Read from filesystem into page
    return mapping->a_ops->readpage(file, page);
}
 
// Special case: zero-fill (no I/O needed)
void zero_fill_page(struct page *page) {
    void *kaddr = kmap_local_page(page);
    memset(kaddr, 0, PAGE_SIZE);
    kunmap_local(kaddr);
    SetPageUptodate(page);
}

I/O Latency by Storage Type
Storage Type	Random 4KB Read	Impact on Page Fault
DDR4 RAM (for reference)	~60 ns	Negligible
NVMe SSD	~70-150 µs	~70-150 µs per fault
SATA SSD	~100-300 µs	~100-300 µs per fault
HDD (7200 RPM)	~5-15 ms	~5-15 ms per fault
Network storage (NFS)	~1-100 ms	Highly variable

The Huge Impact of Storage Speed

A modern CPU can execute 3+ billion instructions per second. A single HDD page fault (10ms) means ~30 million instructions lost. Even an NVMe fault (100µs) costs ~300,000 instructions. This is why keeping the working set in memory is so critical for performance.

Page Table Updates: Establishing the Mapping

With the page now in memory (loaded from disk or zero-filled), we must create the mapping in the page table so that the faulting instruction can access it.

The Page Table Entry (PTE) Update:

Walk to the PTE: Navigate through the page table hierarchy to the correct PTE
Construct the new PTE value:
- Frame number (physical address of allocated frame)
- Present bit = 1
- Access permissions (from VMA)
- Other flags (accessed, dirty, etc.)
Atomically update the PTE: Use atomic operations to prevent races
Handle concurrent modifications: Check if another thread modified the PTE while we were loading

The Atomicity Requirement:

PTE updates must be atomic because:

Other CPUs might be accessing the same page tables
Hardware walkers (MMU) read PTEs concurrently
Race conditions could create security vulnerabilities

pte_update.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Updating the page table entry after loading page
 
int finish_fault(struct vm_fault *vmf) {
    struct mm_struct *mm = vmf->vma->vm_mm;
    struct page *page = vmf->page;
    pte_t *pte = vmf->pte;
    pte_t entry;
    spinlock_t *ptl;
    
    // Step 1: Construct the new PTE
    entry = mk_pte(page, vmf->vma->vm_page_prot);
    
    // Apply VMA permissions
    if (vmf->vma->vm_flags & VM_WRITE)
        entry = pte_mkwrite(entry);
    if (vmf->vma->vm_flags & VM_EXEC)
        entry = pte_mkexec(entry);
    
    // Mark as young (just accessed)
    entry = pte_mkyoung(entry);
    
    // For write faults, may mark dirty
    if (vmf->flags & FAULT_FLAG_WRITE)
        entry = pte_mkdirty(entry);
    
    // Step 2: Lock the page table
    ptl = pte_lockptr(mm, vmf->pmd);
    spin_lock(ptl);
    
    // Step 3: Check if PTE was modified by another thread
    if (!pte_none(*pte) && pte_present(*pte)) {
        // Another thread already handled this fault!
        spin_unlock(ptl);
        put_page(page);  // We don't need this page
        return VM_FAULT_NOPAGE;
    }
    
    // Step 4: Atomically install the new PTE
    set_pte_at(mm, vmf->address, pte, entry);
    
    // Step 5: Update memory accounting
    mm->_rss++;  // Increment resident set size
    
    spin_unlock(ptl);
    
    return 0;
}
 
// Architecture-specific PTE setting (x86-64)
static inline void set_pte_at(struct mm_struct *mm, 
                               unsigned long addr,
                               pte_t *ptep, 
                               pte_t pte) {
    // On x86-64, PTEs are 64-bit and naturally aligned
    // A simple store is atomic for aligned 64-bit on x86-64
    WRITE_ONCE(*ptep, pte);
    
    // Memory barrier to ensure PTE is visible before TLB operations
    smp_wmb();
}
 
// Allocating intermediate page table levels if needed
pte_t *walk_and_allocate(struct mm_struct *mm, unsigned long addr) {
    pgd_t *pgd;
    p4d_t *p4d;
    pud_t *pud;
    pmd_t *pmd;
    
    pgd = pgd_offset(mm, addr);
    
    p4d = p4d_alloc(mm, pgd, addr);
    if (!p4d) return NULL;
    
    pud = pud_alloc(mm, p4d, addr);
    if (!pud) return NULL;
    
    pmd = pmd_alloc(mm, pud, addr);
    if (!pmd) return NULL;
    
    return pte_alloc_map(mm, pmd, addr);
}

Creating Page Table Levels

If intermediate page table levels don't exist (e.g., first access to a previously unmapped region), the fault handler allocates them. Each level is a page-sized structure. The allocation uses the same mechanisms as regular page allocation, adding to the fault handling time.

TLB Considerations

After updating the page table, we must consider the TLB (Translation Lookaside Buffer). The TLB caches page table entries for fast translation, and it must reflect our changes.

For New Mappings (page fault):

We're mapping a page that wasn't previously present. The TLB definitely doesn't have a stale entry for this mapping because:

The page faulted precisely because there was no valid TLB entry
If there was a TLB entry, it was invalid (wouldn't have faulted)

Therefore, no TLB invalidation is needed when installing a new mapping. The next access will cause a TLB miss, the hardware walker will load our new PTE, and the mapping will be cached.

For Modified Mappings (COW, permission change):

When modifying an existing mapping:

Old TLB entries might have stale information
Must explicitly invalidate the affected TLB entry(ies)
On multi-core systems, must invalidate on all cores (TLB shootdown)

tlb_management.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// TLB operations in page fault context
 
// When installing a completely new mapping (not present → present)
// No TLB invalidation needed - there was no valid TLB entry
void install_new_pte(struct mm_struct *mm, pte_t *pte, pte_t entry) {
    set_pte_at(mm, addr, pte, entry);
    // Next access will TLB miss and load our entry - no flush needed
}
 
// When modifying an existing mapping (e.g., COW, permission upgrade)
void modify_existing_pte(struct mm_struct *mm, unsigned long addr,
                          pte_t *pte, pte_t new_entry) {
    pte_t old_entry = *pte;
    
    // Install new PTE
    set_pte_at(mm, addr, pte, new_entry);
    
    // Must invalidate stale TLB entries
    // Single CPU case:
    flush_tlb_page(mm, addr);
    
    // Multi-CPU case: This is a TLB shootdown
    // Must send IPI to all CPUs that might have this mapping cached
}
 
// TLB shootdown - invalidate on all relevant CPUs
void flush_tlb_page_all(struct mm_struct *mm, unsigned long addr) {
    // Get mask of CPUs that have used this mm
    const cpumask_t *mm_cpumask = mm_cpumask(mm);
    
    // Current CPU - local flush
    local_flush_tlb_page(addr);
    
    // Other CPUs - send Inter-Processor Interrupt
    smp_call_function_many(mm_cpumask, 
                           tlb_flush_func, 
                           &addr, 
                           1 /* wait */);
}
 
// IPI handler on remote CPU
void tlb_flush_func(void *info) {
    unsigned long addr = *(unsigned long *)info;
    local_flush_tlb_page(addr);
}
 
// Local TLB flush (x86)
static inline void local_flush_tlb_page(unsigned long addr) {
    // INVLPG instruction invalidates single TLB entry
    asm volatile("invlpg (%0)" : : "r" (addr) : "memory");
}
 
// Flush entire TLB (x86) - used for larger changes
static inline void local_flush_tlb(void) {
    // Reloading CR3 flushes entire TLB (except global entries)
    unsigned long cr3 = read_cr3();
    write_cr3(cr3);
}

TLB Operations in Page Fault Handling
Scenario	TLB State Before	TLB Action Needed
New anonymous page	No entry (faulted)	None - will load on next access
Page loaded from swap	No valid entry	None - will load on next access
File page loaded from disk	No valid entry	None - will load on next access
COW page (write protection)	Had read-only entry	Flush to remove stale entry
Permission upgrade	Had restrictive entry	Flush to allow reload with new perms

TLB Shootdowns Are Expensive

Sending IPIs to multiple CPUs and waiting for acknowledgment takes microseconds. When modifying mappings (especially during process exit or munmap), batching TLB invalidations can significantly improve performance. Fortunately, for basic page fault handling (new mappings), no shootdown is needed.

Synchronization and Locking

Page fault handling involves multiple data structures that could be accessed concurrently. Proper synchronization is essential for correctness and safety.

Key Locks in Page Fault Path:

Page Table Lock (PTL): Protects individual page table pages during modification. Fine-grained—different PTEs can be modified in parallel.
mmap_lock (formerly mmap_sem): Protects the VMA tree structure. Held in read mode during fault handling, write mode during mmap/munmap.
Page Lock: Individual page lock prevents concurrent I/O to the same page. Held while reading from disk.
Swap Map Locks: Protect swap slot reference counts.

Lock Ordering:

To prevent deadlocks, locks are acquired in a consistent order:

mmap_lock → page table lock → page lock → swap locks

Lock Contention Concerns:

High page fault rates can cause contention:

Many threads faulting in same region → mmap_lock contention
Threads faulting on same page → page lock contention
Many swaps → swap lock contention

fault_locking.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// Locking in page fault path
 
int __do_page_fault(struct mm_struct *mm, unsigned long address,
                    unsigned int flags, struct pt_regs *regs) {
    struct vm_area_struct *vma;
    int fault;
    
    // Step 1: Acquire mmap_lock in read mode
    // This protects VMA structure from concurrent modification
    mmap_read_lock(mm);
    
    // Find VMA under lock protection
    vma = find_vma(mm, address);
    if (!vma) {
        mmap_read_unlock(mm);
        return VM_FAULT_SIGSEGV;
    }
    
    // Step 2: Handle the fault (will acquire PTL internally)
    fault = handle_mm_fault(mm, vma, address, flags);
    
    // Step 3: Release mmap_lock
    mmap_read_unlock(mm);
    
    return fault;
}
 
// Fine-grained page table locking
int finish_fault(struct vm_fault *vmf) {
    spinlock_t *ptl;
    pte_t *pte;
    
    // Acquire page table lock for this specific PTE
    // Different from mmap_lock - very fine-grained
    pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, 
                               vmf->address, &ptl);
    
    // ... update PTE ...
    
    pte_unmap_unlock(pte, ptl);
}
 
// Page lock for I/O synchronization
int read_swap_page(struct page *page, swp_entry_t entry) {
    // Lock page - blocks other threads from same page I/O
    lock_page(page);
    
    // Do swap read I/O
    int err = swap_readpage(page, entry);
    
    // Page stays locked until I/O completes
    // Then swap_read_endio() unlocks it
    return err;
}
 
// Speculative fault handling (kernel 5.8+)
// Try without mmap_lock first
int do_user_addr_fault(struct pt_regs *regs, unsigned long error_code,
                       unsigned long address) {
    // First, try lockless (speculative) handling
    // Works if VMA is stable and page tables are populated
    if (maybe_handle_faults_without_lock(mm, address, flags))
        return 0;
    
    // Fall back to locked path
    mmap_read_lock(mm);
    // ... full fault handling ...
    mmap_read_unlock(mm);
}

mmap_lock Contention

The mmap_lock is a well-known scalability bottleneck. When many threads fault concurrently, they all contend for this lock. Modern kernels have introduced speculative fault handling that can bypass the lock in common cases, significantly improving scalability for memory-intensive parallel workloads.

Error Handling During Page Loading

Many things can go wrong during page loading. Robust error handling is essential to prevent data corruption, security vulnerabilities, and system crashes.

Potential Errors:

Error Conditions and Responses

•OOM (Out of Memory): Cannot allocate frame. Response: Try reclaim; if still failing, invoke OOM killer or return VM_FAULT_OOM (kills the faulting process).
•I/O Error: Disk read fails (bad sector, device error). Response: Return VM_FAULT_SIGBUS; process receives SIGBUS signal.
•Swap Full: Cannot write evicted page to swap. Response: May stall trying to reclaim; eventually OOM if nothing can be freed.
•Bad Page State: Corrupted page table or page flags. Response: Kernel panic or BUG() in debug builds.
•Race Condition: Another thread modified state. Response: Retry or return; the racing thread handled it.

fault_errors.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Error handling in page fault path
 
int do_fault_around(struct vm_fault *vmf, pgoff_t start_pgoff) {
    struct page *page;
    int ret;
    
    // Try to allocate a page
    page = alloc_page(GFP_HIGHUSER_MOVABLE);
    if (!page) {
        // OOM - couldn't allocate
        return VM_FAULT_OOM;
    }
    
    // Try to read from file/swap
    ret = do_read_page(page, vmf);
    if (ret < 0) {
        put_page(page);
        if (ret == -EIO)
            return VM_FAULT_SIGBUS;  // I/O error
        if (ret == -ENOMEM)
            return VM_FAULT_OOM;     // OOM during I/O
        return VM_FAULT_ERROR;       // Generic error
    }
    
    // Check for data corruption
    if (!PageUptodate(page)) {
        put_page(page);
        return VM_FAULT_SIGBUS;  // Page didn't load correctly
    }
    
    return 0;  // Success
}
 
// After do_page_fault returns, higher level handles result
void handle_page_fault_result(int fault_result, struct pt_regs *regs,
                               unsigned long address) {
    if (fault_result & VM_FAULT_OOM) {
        // OOM - this is serious
        pagefault_out_of_memory();  // May invoke OOM killer
        return;
    }
    
    if (fault_result & VM_FAULT_SIGBUS) {
        // I/O error or similar
        do_sigbus(regs, address);  // Send SIGBUS to process
        return;
    }
    
    if (fault_result & VM_FAULT_SIGSEGV) {
        // Invalid access
        do_sigsegv(regs, address);  // Send SIGSEGV to process
        return;
    }
    
    // Success - return to user mode and retry instruction
}
 
// OOM killer - last resort
void pagefault_out_of_memory(void) {
    if (current_thread_has_oom_victim())
        return;  // This thread is already being killed
    
    // Select a process to kill to free memory
    out_of_memory(&oc);
}

Graceful Degradation

The error handling philosophy is to be as graceful as possible. An I/O error on one page shouldn't crash the whole system—just that process. OOM situations try reclaim and OOM killing before giving up. Only truly unrecoverable situations (kernel bugs, hardware failures) cause panics.

The Complete Loading Sequence

Let's put together everything we've learned into a complete timeline of loading a page into a frame:

Converting Mermaid diagram...

Loading Sequence Timeline
Step	Operation	Typical Time	Blocking?
1	Frame allocation (fast path)	~100 ns	No
1a	Frame allocation (reclaim)	~1-100 ms	Yes
2	Determine page source	~100 ns	No
3a	Zero-fill page	~1 µs	No
3b	Read from NVMe swap	~70-150 µs	Yes
3c	Read from file (cached)	~1 µs	No
3c	Read from file (disk)	~100 µs - 10 ms	Yes
4	Acquire PTL	~100 ns	Maybe (contention)
5	Update PTE	~10 ns	No
6	Update statistics	~100 ns	No
7	Return to user mode	~500 ns	No

The Best Case: Minor Faults

The fastest page faults are minor faults on anonymous pages—zero-fill takes only microseconds with no disk I/O. This is why modern systems can handle millions of minor faults per second. The killer is major faults (disk I/O), which are 1000x slower at minimum.

Summary: From Disk to Mapped Memory

Loading a page into a frame is the heart of demand paging—the operation that makes virtual memory's promise possible. Let's consolidate the key concepts:

Key Takeaways

•Frame allocation uses buddy allocator with free lists. When memory is tight, reclamation frees pages from caches or swaps out anonymous pages.
•Reclamation can be background (kswapd) or direct (in the faulting process). Direct reclaim adds latency to page faults.
•Disk I/O is the expensive part. Swap reads and file reads block until complete, adding milliseconds to fault handling.
•Zero-fill is the fast path for new anonymous pages—no I/O needed, just clear memory.
•PTE updates must be atomic and locked to prevent races with concurrent faults or page table modifications.
•TLB invalidation is usually not needed for new mappings—the TLB will naturally load our new entry on next access.
•Synchronization uses multiple locks (mmap_lock, PTL, page lock) with careful ordering to prevent deadlocks.
•Error handling provides graceful degradation: OOM triggers killer, I/O errors become SIGBUS.

What's Next:

With the page now loaded and mapped, the handler is almost done. The final page explores Restart Instruction—how control returns to user mode and the faulting instruction successfully completes, making the entire page fault completely transparent to the application.

Page Complete

You now understand how pages are loaded from secondary storage into physical memory. From frame allocation through disk I/O to PTE updates, you've seen the complete process that fulfills virtual memory's promise. Next, we'll see how the instruction that caused all this activity finally completes successfully.