Operating SystemsVirtual Memory

Demand Paging

LevelIntermediate

Duration75 mins

TopicVirtual Memory

3 / 5

Page Fault Handling

The Art of Graceful Recovery

A page fault is a crisis. The processor attempted to access memory that doesn't exist in physical RAM. In the absence of virtual memory, this would be a catastrophic error—the program would crash, data would be lost, the user would be frustrated. But in a demand-paged system, a page fault is something entirely different: it's an opportunity.

The page fault handler transforms what would be a fatal error into a smooth, nearly invisible operation. It loads the missing data, updates the system state, and restarts execution so seamlessly that the running program never knows anything unusual happened. This sleight of hand—making disk-speed operations appear as memory-speed operations—is one of the most elegant achievements in systems software.

But don't let the elegance fool you: page fault handling is also one of the most performance-critical code paths in an operating system. A fault handler that takes an extra microsecond might seem trivial, but multiply that by millions of faults per day, and you've lost hours of compute time. Kernel developers spend enormous effort optimizing this path.

What You Will Learn

By the end of this page, you will understand the complete page fault handling sequence: fault detection and classification, determining the source of the page, allocating frames, loading data, updating page tables, and resuming execution. You'll also explore the distinction between minor and major faults, error cases, and the critical importance of instruction restart semantics.

Page Fault Classification

Not all page faults are equal. Before the handler can take action, it must classify the fault to determine the appropriate response. This classification is the first and most crucial step in page fault handling.

The Classification Matrix:

Page faults can be categorized along multiple dimensions:

Validity: Is the faulting address part of a legitimate memory region?
Reason: Why is the page not present? (Never loaded, swapped out, shared-pending)
Access Type: Was this a read, write, or instruction fetch?
Privilege: Was the access from user mode or kernel mode?

Resolving these questions determines whether the fault leads to page loading, copy-on-write handling, or process termination.

Page Fault Classification and Outcomes
Condition	Valid Region?	Outcome
Unmapped address	No	SIGSEGV (Segmentation Fault)
Address in file mapping, page not loaded	Yes	Load from file (major fault)
Address in anonymous region, first access	Yes	Zero-fill on demand (minor fault)
Page was swapped out	Yes	Load from swap (major fault)
Write to read-only page (COW)	Yes	Copy page, update mappings (minor fault)
Write to truly read-only page	Yes/No	SIGSEGV or SIGBUS
Execute from non-executable page	Yes	SIGSEGV (protection violation)
Access from user mode to kernel page	Yes	SIGSEGV (protection violation)

Determining Address Validity:

The handler must consult the process's address space metadata to determine if the faulting address belongs to a valid region. In Linux, this involves:

Acquiring the memory map lock (mmap_lock)
Searching the tree of Virtual Memory Areas (vm_area_struct)
Checking if the faulting address falls within a VMA
Verifying the access type is permitted by the VMA's flags

If no VMA contains the address, or if the access violates the VMA's permissions, the fault is an error. Otherwise, it's a legitimate demand fault.

Minor vs. Major Faults

A minor fault (soft fault) can be resolved without disk I/O—the page data is already in memory somewhere. A major fault (hard fault) requires reading from disk. Major faults are orders of magnitude more expensive. Monitoring the ratio of major to minor faults is a key performance metric in any memory-intensive system.

The Page Fault Handling Sequence

The page fault handling sequence involves multiple components working in tight coordination. Let's trace through the complete flow from fault to resumed execution.

Phase 1: Exception Entry

When the MMU detects an access to a non-present page:

Processor saves current execution state (registers, flags)
Processor pushes error code onto kernel stack
Processor loads page fault handler address from IDT (Interrupt Descriptor Table)
Processor switches to kernel mode (if not already there)
Faulting address saved in CR2 register (x86)

Phase 2: Initial Handler

The low-level assembly handler:

Saves remaining registers (general purpose, segment)
Sets up kernel stack frame
Reads faulting address from CR2
Reads error code from stack
Calls C-language fault handler

Converting Mermaid diagram...

Phase 3: Fault Classification and Resolution

The C-language handler performs the classification we discussed earlier, then takes the appropriate action:

if (address not in valid VMA)
    → send SIGSEGV to process
    
else if (page is in swap)
    → allocate frame
    → read page from swap partition
    → update PTE with new frame
    
else if (page is file-backed)
    → check page cache for page
    → if not cached: read from file
    → map page from cache into process
    
else if (page is anonymous, first access)
    → allocate zeroed frame (or map zero-page read-only)
    
else if (copy-on-write fault)
    → allocate new frame
    → copy data from shared page
    → update PTE to point to new frame with write permission

Phase 4: Page Table Update and TLB

After obtaining the frame with the correct content:

Update page table entry:
- Set frame number
- Set Present bit = 1
- Set appropriate protection bits
- Clear Accessed and Dirty bits
Flush TLB for the faulting address (if stale entry might exist)
Release any locks held

Phase 5: Return to User Space

The handler returns, causing:

Kernel stack unwound
Registers restored
CPU mode returns to user (if applicable)
Faulting instruction restarted from the beginning

Instruction Restart is Critical

The architecture must save enough state to restart the faulting instruction exactly as if nothing happened. This is non-trivial: instructions might have side effects (like auto-increment addressing modes) that must be either undone or remembered. Complex CISC instructions that access multiple memory locations require particularly careful handling.

Finding the Page on Disk

Once the handler determines that a page must be loaded, it needs to find where the page's data resides. This requires consulting multiple data structures depending on the page's type.

For File-Backed Pages (Code, Data, mmap regions):

The VMA structure contains:

File reference (struct file *)
Offset into file where mapping begins
Permissions (read, write, execute)

Page offset is calculated:

page_offset = (fault_address - vma->start) + vma->file_offset

The page cache is checked first—the page might already be in memory (just not mapped for this process). If not, a disk read is scheduled.

For Swapped-Out Pages:

The (non-present) PTE itself contains the swap entry:

Swap device identifier (which swap partition/file)
Offset within that device (which slot)

swap_entry = pte_to_swap_entry(pte)
device = swap_entry.device
offset = swap_entry.offset

For Anonymous Zero-Fill Pages:

No lookup needed—the OS simply provides a page of zeros. Optimizations:

Map a shared "zero page" read-only initially
On first write, allocate real page and copy zeros (or just zero the new page)

find_page_source.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/* Determine source of page data and handle fault */
 
int handle_demand_fault(struct vm_area_struct *vma,
                        unsigned long address,
                        unsigned int flags) {
    struct page *page = NULL;
    pte_t pte;
    
    /* Calculate page-aligned address and file offset */
    unsigned long page_addr = address & PAGE_MASK;
    pgoff_t pgoff = ((page_addr - vma->vm_start) >> PAGE_SHIFT) 
                    + vma->vm_pgoff;
    
    if (vma->vm_file) {
        /* File-backed mapping - check page cache first */
        struct address_space *mapping = vma->vm_file->f_mapping;
        
        page = find_get_page(mapping, pgoff);
        if (!page) {
            /* Page not in cache - read from file */
            page = read_mapping_page(mapping, pgoff, 
                                     vma->vm_file);
            if (IS_ERR(page))
                return VM_FAULT_SIGBUS;
        }
        
        /* Page is now in page cache and in 'page' variable */
        
    } else {
        /* Anonymous mapping - check if swapped */
        pte = *pte_offset(vma->vm_mm, address);
        
        if (is_swap_pte(pte)) {
            /* Swapped out - read from swap */
            swp_entry_t entry = pte_to_swp_entry(pte);
            page = alloc_page(GFP_HIGHUSER);
            if (!page)
                return VM_FAULT_OOM;
            
            int err = swap_readpage(page, entry);
            if (err) {
                put_page(page);
                return VM_FAULT_SIGBUS;
            }
            
            /* Free the swap slot */
            swap_free(entry);
            
        } else {
            /* First access - zero fill */
            page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
            if (!page)
                return VM_FAULT_OOM;
        }
    }
    
    /* Install page in page table */
    return install_page(vma, address, page, flags);
}

Page Cache is Key

The page cache is shared across the entire system. When one process faults in a page from a file, the page becomes available in the cache. Other processes mapping the same file can then satisfy faults from the cache without any disk I/O. This is a major optimization for shared libraries—the first process to touch a library page pays the I/O cost; subsequent processes get it 'free.'

Allocating a Physical Frame

Before a page can be installed, the handler must obtain a physical frame to hold it. Frame allocation is a critical component of page fault handling, and it must handle the case where memory is scarce.

The Allocation Process:

Request from free pool:
- Kernel maintains lists of free frames
- Often organized by NUMA node/zone
- Grab one if available
If no free frames, trigger reclamation:
- Invoke the page frame reclaim algorithm (PFRA)
- Scan pages, evict candidates to swap/disk
- Repeat until a frame is available
If reclamation fails, invoke OOM killer:
- Select a process to terminate
- Kill it to free its frames
- Last resort mechanism

Frame Allocation Challenges:

Frame Allocation Challenges

•NUMA Locality: Preferably allocate frames on the same NUMA node as the CPU handling the fault, but fall back to remote nodes if necessary.
•Memory Zones: Some memory is reserved for specific purposes (DMA-able, high memory). Allocations must respect zone constraints.
•Atomicity Constraints: Some fault contexts cannot sleep (kernel interrupt handlers). Must use atomic allocation if available, or defer otherwise.
•Memory Pressure: Under heavy load, allocation might need to wait for reclamation, adding latency.
•Fragmentation: Even with free memory, large contiguous allocations (huge pages) might fail.

Allocation Flags:

In Linux, alloc_page() takes flags controlling allocation behavior:

/* Common allocation flags */

GFP_KERNEL      /* Normal kernel allocation, can sleep */
GFP_ATOMIC      /* Cannot sleep, for interrupt context */
GFP_HIGHUSER    /* User-space page, preferably from high memory */
__GFP_ZERO      /* Zero the page before returning */
__GFP_NOWARN    /* Don't warn if allocation fails */
__GFP_NORETRY   /* Don't try hard, fail quickly */

/* Page fault typically uses: */
GFP_HIGHUSER | __GFP_ZERO  /* For anonymous pages */
GFP_HIGHUSER               /* For file-backed pages (data comes from file) */

Page Zeroing:

For security reasons, frames given to user processes must not contain stale data from other processes. Anonymous pages are always zeroed. This zeroing can be performed:

On-demand: Zero the page during fault handling (simpler, higher latency)
Background: A kernel thread pre-zeros free pages (lower latency, more complex)

Many systems use background zeroing to keep a pool of pre-zeroed pages ready for fast allocation.

The Zero Page Optimization

A clever optimization for zero-filled pages: instead of allocating a unique zeroed page for each anonymous page fault, map all zero pages to a single, shared zero-filled frame (read-only). Only when the process writes to the page is a private copy allocated. This avoids allocation and zeroing entirely for pages that are read but never written.

Loading the Page Content

Once a frame is allocated, the page content must be loaded. The method depends on the page's backing store.

File-Backed Pages:

For pages backed by files (executables, shared libraries, mmap'd files):

Determine the file and offset from VMA
Check if page is already in the page cache
If not cached:
- Add a new page to the cache
- Submit I/O request to read file data
- Block until I/O completes
Return the cached page

The page cache is crucial here—not only does it avoid repeated I/O, but it also enables sharing. Multiple VMAs (even from different processes) can reference the same cached page.

Swap-Backed Pages:

For pages that were evicted to swap:

Extract swap entry from non-present PTE
Allocate destination frame
Compute swap device offset from entry
Submit read I/O to swap device/file
Block until I/O completes
Free the swap slot (page is now in RAM)

I/O Characteristics by Page Source
Source	Typical Latency	Can Be Shared?	Post-Load Action
Page Cache (hit)	~1-10 μs	Yes	Just map into process
File (cache miss)	~100 μs - 10 ms	Yes (added to cache)	Read → cache → map
Swap (SSD)	~50-200 μs	No (private)	Read → free swap slot → map
Swap (HDD)	~5-15 ms	No (private)	Read → free swap slot → map
Zero Fill	~1-5 μs	Shared zero page	Allocate or map zero page

Asynchronous Read-Ahead:

While servicing a page fault, the kernel often initiates read-ahead for adjacent pages:

Current fault at page N:

1. Load page N (synchronous - must wait)
2. Initiate async load for pages N+1, N+2, ... N+K
3. Return with page N ready
4. Background I/O continues for N+1 through N+K
5. Next faults for N+1..N+K likely find pages already loaded (minor faults or no faults)

Read-ahead converts potential major faults into minor faults or no faults at all. The kernel uses heuristics to detect sequential access patterns and adjust read-ahead window size.

Blocking During Page Load:

While waiting for I/O, the faulting process is blocked. However:

Other processes continue running
Other threads of the same process can run (on different CPUs)
The CPU isn't wasted—the scheduler switches to ready work

I/O Overlap is Essential

The ability to overlap page loading I/O with other computation is what makes demand paging practical. If the system had to stop everything during each page load, performance would be abysmal. Instead, the scheduler ensures that waiting for I/O doesn't waste CPU cycles—other work proceeds in parallel.

Updating the Page Table

After the page is loaded into a frame, the page table must be updated to reflect the new mapping. This step is subtle and requires careful attention to synchronization and consistency.

The PTE Update:

Before (non-present):
┌──────────────────────────────────────────────────────────────┐
│ Present=0 │  Swap Entry ID / File Offset / Zero marker       │
└──────────────────────────────────────────────────────────────┘

After (present):
┌──────────────────────────────────────────────────────────────┐
│ Present=1 │ Frame Number │ R/W │ User │ A=0 │ D=0 │ NX │ ... │
└──────────────────────────────────────────────────────────────┘

The update must:

Set the frame number - Physical location of the page
Set Present = 1 - Allow accesses without faulting
Set protection bits - Based on VMA permissions
Clear Accessed/Dirty bits - Fresh start for this mapping
Set other flags - Cache control, global, etc. as appropriate

install_page.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* Install a page into the process's page table */
 
int install_page(struct vm_area_struct *vma,
                 unsigned long addr,
                 struct page *page,
                 unsigned int flags) {
    struct mm_struct *mm = vma->vm_mm;
    pte_t *ptep;
    pte_t new_pte;
    spinlock_t *ptl;
    
    /* Get the PTE pointer with page table lock */
    ptep = pte_offset_map_lock(mm, addr, &ptl);
    if (!ptep)
        return VM_FAULT_OOM;
    
    /* Check if another CPU already handled this fault */
    if (pte_present(*ptep)) {
        /* Race condition: page already installed */
        put_page(page);
        pte_unmap_unlock(ptep, ptl);
        return 0;  /* Success - just use existing mapping */
    }
    
    /* Construct the new PTE */
    new_pte = mk_pte(page, vma->vm_page_prot);
    
    /* If VMA is writable and this is a write fault, set writable */
    if ((flags & FAULT_FLAG_WRITE) && 
        (vma->vm_flags & VM_WRITE)) {
        new_pte = pte_mkwrite(new_pte);
    }
    
    /* Make it present and young (accessed) */
    new_pte = pte_mkpresent(new_pte);
    new_pte = pte_mkyoung(new_pte);
    
    /* If it was a write that caused the fault, mark dirty */
    if (flags & FAULT_FLAG_WRITE)
        new_pte = pte_mkdirty(new_pte);
    
    /* Atomically install the PTE */
    set_pte_at(mm, addr, ptep, new_pte);
    
    /* Update memory accounting */
    inc_mm_counter(mm, MM_FILEPAGES);
    
    /* Record the mapping for reverse lookup */
    page_add_file_rmap(page);
    
    pte_unmap_unlock(ptep, ptl);
    return 0;
}

Race Conditions:

Multiple CPUs might handle faults for the same address simultaneously:

CPU0 and CPU1 both fault on address X
Both start loading the page
CPU0 finishes first, installs PTE
CPU1 finishes, finds PTE already present

The solution: hold a lock while examining and updating the PTE. If the PTE becomes present while we were loading, discard our work and use the existing mapping.

TLB Considerations:

After installing the PTE, we might need to invalidate stale TLB entries:

For new mappings: Usually no stale entry exists
For replacing non-present with present: Old non-present 'entry' might be cached as negative
On multiprocessor systems: Other CPUs might have stale TLB entries after PTE updates

Modern x86 processors don't cache non-present PTEs as TLB entries, so new mappings typically don't require explicit TLB flush.

The Page Table Lock

The page table lock (ptl) protects against concurrent modification of PTEs. Holding this lock during I/O would be catastrophic—I/O takes milliseconds, and the lock would block all page faults for this process. The solution: hold the lock only during the actual PTE read/write, releasing it during I/O. Reacquire and verify before final installation.

Instruction Restart Mechanics

After the page is installed, the processor must resume exactly where it left off. This requires restarting the faulting instruction from the beginning. The mechanics of instruction restart are crucial for correct demand paging operation.

Why Complete Restart?

The faulting instruction didn't complete—it was interrupted mid-execution. We can't resume from the middle because:

Instruction decode must happen again
Memory addresses must be recalculated
Any register updates for the instruction were rolled back

The processor's microarchitecture is designed to leave the architectural state exactly as it was before the instruction began.

Saved State:

When the page fault occurs, the CPU saves:

Instruction Pointer (RIP): Points to the faulting instruction (not the next one)
Stack Pointer (RSP): Correct value for instruction start
Flags (RFLAGS): Condition codes from before the instruction
Segment Registers: Current segment state
Error Code: Information about the fault

When the handler returns via iret, all this state is restored and the instruction runs again.

Example: Fault and Restart on x86-64Tracing through a page fault and successful restart

Input

Instruction: MOV RAX, [RBX] where RBX points to an unmapped page

Output

Instruction successfully loads the value after page is loaded

Explanation

Timeline:

MOV RAX, [RBX] begins execution RIP = 0x401000 (address of MOV) RBX = 0x7FFF12340000 (target address)
CPU attempts to read from 0x7FFF12340000 MMU finds Present=0 in PTE Page fault triggered
CPU state saved: RIP saved as 0x401000 (MOV instruction address) RAX unchanged (load didn't complete) Error code: 0x4 (user read, not present)
Page fault handler runs:
- Finds VMA covering 0x7FFF12340000
- Allocates frame, loads page from swap
- Updates PTE: Present=1, Frame=42
- Returns via IRET
CPU restores state: RIP = 0x401000 Execution resumes at MOV instruction
MOV RAX, [RBX] executes again: MMU now finds Present=1 Loads 8 bytes from Frame 42 RAX updated with loaded value RIP advances to next instruction

Complex Instruction Challenges:

Some instructions are particularly challenging for restart:

1. Multi-Memory-Access Instructions (CISC):

MOVS - String move (reads source, writes destination)
- Might fault on source read or destination write
- Must restart from beginning, re-reading source
- Auto-increment of SI/DI must be undone

2. Page-Crossing Accesses:

MOV RAX, [address]  where the 8-byte load spans two pages
- First page present, second page not
- Fault occurs mid-load
- Partial data discarded, instruction restarted

3. Read-Modify-Write Instructions:

INC [memory]  - Read, increment, write back
- Might fault on read or on write
- Must restart entire sequence

Modern processors handle all these cases correctly, saving precise enough state to restart any instruction. This "precise exception" behavior is essential for virtual memory.

Imprecise Exceptions (Historical)

Early pipelined processors had 'imprecise exceptions'—by the time a page fault was detected, subsequent instructions might have partially executed. Operating on such machines required complex recovery software or restrictions on virtual memory. Modern out-of-order processors work hard to maintain precise exception behavior despite executing instructions speculatively.

Error Conditions and Edge Cases

Not all page faults can be resolved by loading a page. The handler must recognize and properly handle numerous error conditions.

Access Violations (Protection Faults):

Present=1 but access violates protection:
- Writing to read-only page: SIGSEGV (or COW handling)
- User access to supervisor page: SIGSEGV
- Executing from non-executable page: SIGSEGV
- These are protection faults, not missing-page faults

Invalid Address Faults:

Address not in any VMA:
- Access to unmapped memory: SIGSEGV
- NULL pointer dereference: SIGSEGV
- Stack overflow beyond guard page: SIGSEGV (or stack expand)
- Access to kernel addresses from user mode: SIGSEGV

Resource Exhaustion:

Cannot allocate resources to resolve fault:
- Out of memory (no frames available): OOM killer or SIGKILL
- Swap full (can't evict to make room): OOM situation
- Page table allocation failure: Process terminated

Page Fault Error Conditions and Signals
Error Condition	Signal	Default Action	Notes
Unmapped address	SIGSEGV	Core dump + terminate	Most common programming error
Protection violation	SIGSEGV	Core dump + terminate	Write to read-only, etc.
I/O error reading page	SIGBUS	Core dump + terminate	Disk error, network file issue
Out of memory	SIGKILL	Immediate terminate	Or OOM killer selects victim
COW limit exceeded	SIGKILL	Immediate terminate	rlimit RLIMIT_AS reached
Stack guard violation	SIGSEGV	Core dump + terminate	Or stack expand if within limits

Stack Expansion:

Stack faults near the current stack limit are special-cased:

if (fault_address is below current stack pointer but
    within RLIMIT_STACK limit)
    → expand stack VMA downward
    → allocate zero page
    → continue execution

if (fault_address is beyond RLIMIT_STACK)
    → SIGSEGV (stack overflow)

if (fault_address is on stack guard page)
    → SIGSEGV (stack overflow)

Kernel Page Faults:

Page faults in kernel mode receive special handling:

Expected faults: Some kernel code deliberately accesses user-space addresses (e.g., copy_from_user). These are expected and handled.
Unexpected faults: Accessing invalid addresses from kernel code indicates a kernel bug. The system typically panics with an "Oops" message.
Exception tables: The kernel maintains tables of "expected" fault locations and their fixup handlers.

kernel_fault_handling.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Handling kernel-mode page faults (simplified) */
 
void do_page_fault(struct pt_regs *regs, unsigned long error_code,
                   unsigned long fault_addr) {
    
    if (fault_in_kernel_mode(regs)) {
        /* Kernel page fault - could be expected or a bug */
        
        if (fault_addr >= TASK_SIZE) {
            /* Fault on kernel address - might be vmalloc region */
            if (vmalloc_fault(fault_addr) == 0)
                return;  /* Handled - vmalloc PTE synced */
        }
        
        /* Check if this is an expected user-space access */
        const struct exception_table_entry *fixup;
        fixup = search_exception_tables(regs->ip);
        
        if (fixup) {
            /* This fault was expected - use fixup handler */
            regs->ip = fixup->handler;
            return;  /* Will return error to copy_from_user caller */
        }
        
        /* Unexpected kernel fault - this is a kernel bug! */
        oops_begin();
        printk(KERN_EMERG "BUG: unable to handle page fault "
               "at %lx\n", fault_addr);
        show_regs(regs);
        oops_end();
        panic("Kernel fault at %lx", fault_addr);
    }
    
    /* User-mode fault - normal handling follows */
    handle_user_fault(regs, error_code, fault_addr);
}

Kernel Faults are Critical

A user-mode page fault can be safely handled (worst case: kill the offending process). A kernel-mode page fault on an unexpected address indicates a bug that might have corrupted kernel data structures. Systems typically panic rather than continue with potentially corrupted state.

Performance Optimization

Page fault handling is a performance-critical path. Operating system developers employ numerous optimizations to minimize fault overhead.

Fast Path Optimizations:

Minimal Lock Contention:
- Per-VMA locks instead of per-mm locks
- RCU for read-mostly VMA lookups
- Page table locks are per-page-table (not global)
Assembly Entry Code:
- Fault entry written in assembly for speed
- Saves minimal state, checks for quick resolution
- Only calls C handler for complex cases
Page Cache Integration:
- Direct mapping from page cache to process
- No copying of file data
- Reference counting for shared pages

Major Optimizations in Page Fault Handling

•Speculative Mapcount: Optimistic mapping assumes race-free operation, with fallback path for contention.
•Fault-Around: When loading one page, speculatively map adjacent pages that are already in page cache.
•Huge Page Support: Single fault for 2MB or 1GB page reduces fault count by 512x or 262144x.
•Batched TLB Flush: Defer TLB flush when evicting multiple pages, issue single batched flush.
•NUMA-Aware Allocation: Allocate frames on local NUMA node to minimize memory access latency.
•Zero Page Sharing: All zero pages share a single physical frame until written.
•Read-Ahead Windows: Dynamically adjust read-ahead based on detected access patterns.

Monitoring Page Fault Performance:

System administrators and developers use various tools to monitor page fault behavior:

# Per-process page fault statistics
/proc/<pid>/stat    # Fields 10 (minor faults) and 12 (major faults)

# System-wide page fault monitoring
sar -B 1            # Page statistics every second
vmstat 1            # Virtual memory statistics

# Detailed per-process analysis
perf stat -e page-faults,minor-faults,major-faults <command>

# Trace individual page faults
perf trace -e page-faults <command>

Typical Performance Targets:

Metric	Excellent	Acceptable	Problematic
Minor fault latency	< 5 μs	< 20 μs	> 100 μs
Major fault latency	< 1 ms	< 10 ms	> 50 ms
Major fault rate	< 10/sec	< 100/sec	> 1000/sec
Page fault handler CPU %	< 1%	< 5%	> 10%

The Goal: Invisible Paging

The ideal paging system is invisible to applications. Enough memory, intelligent read-ahead, and efficient handling mean processes run at near-native speed despite using virtual memory. Achieving this requires continuous measurement and optimization of the page fault path.

Summary: Page Fault Handling

We've explored the complete lifecycle of page fault handling. Let's consolidate the essential knowledge:

Key Takeaways

•Page faults are classified into legitimate demand faults (page exists but isn't loaded) and errors (invalid addresses, protection violations).
•The handling sequence involves exception entry, fault classification, page location lookup, frame allocation, page loading, PTE update, and instruction restart.
•Page content sources include the page cache (for file-backed pages), swap devices (for evicted pages), and zero-fill (for anonymous pages).
•Frame allocation may trigger page reclamation if memory is scarce, potentially blocking the fault handler.
•PTE updates are atomic and protected by locks to prevent race conditions with concurrent faults.
•Instruction restart re-executes the faulting instruction from the beginning after the page is installed.
•Error conditions (invalid addresses, protection violations, I/O errors) result in signals that typically terminate the process.
•Performance optimization is critical—page fault handling is one of the most tuned code paths in the kernel.

What's Next:

Now that we understand the mechanics of page fault handling, we'll explore two contrasting strategies for when to load pages: pure demand paging (load only on fault) versus prepaging (anticipate and preload). These strategies represent different points on the trade-off spectrum between memory efficiency and fault reduction.

Critical Path Mastered

You now understand page fault handling—one of the most important and performance-sensitive code paths in any operating system. This knowledge enables you to reason about system behavior, optimize memory-intensive applications, and debug virtual memory issues.

3 / 5

Loading learning content...

Operating SystemsVirtual Memory

Demand Paging

LevelIntermediate

Duration75 mins

TopicVirtual Memory

3 / 5

Page Fault Handling

The Art of Graceful Recovery

What You Will Learn

Page Fault Classification

The Classification Matrix:

Page faults can be categorized along multiple dimensions:

Validity: Is the faulting address part of a legitimate memory region?
Reason: Why is the page not present? (Never loaded, swapped out, shared-pending)
Access Type: Was this a read, write, or instruction fetch?
Privilege: Was the access from user mode or kernel mode?

Resolving these questions determines whether the fault leads to page loading, copy-on-write handling, or process termination.

Page Fault Classification and Outcomes
Condition	Valid Region?	Outcome
Unmapped address	No	SIGSEGV (Segmentation Fault)
Address in file mapping, page not loaded	Yes	Load from file (major fault)
Address in anonymous region, first access	Yes	Zero-fill on demand (minor fault)
Page was swapped out	Yes	Load from swap (major fault)
Write to read-only page (COW)	Yes	Copy page, update mappings (minor fault)
Write to truly read-only page	Yes/No	SIGSEGV or SIGBUS
Execute from non-executable page	Yes	SIGSEGV (protection violation)
Access from user mode to kernel page	Yes	SIGSEGV (protection violation)

Determining Address Validity:

The handler must consult the process's address space metadata to determine if the faulting address belongs to a valid region. In Linux, this involves:

Acquiring the memory map lock (mmap_lock)
Searching the tree of Virtual Memory Areas (vm_area_struct)
Checking if the faulting address falls within a VMA
Verifying the access type is permitted by the VMA's flags

If no VMA contains the address, or if the access violates the VMA's permissions, the fault is an error. Otherwise, it's a legitimate demand fault.

Minor vs. Major Faults

The Page Fault Handling Sequence

The page fault handling sequence involves multiple components working in tight coordination. Let's trace through the complete flow from fault to resumed execution.

Phase 1: Exception Entry

When the MMU detects an access to a non-present page:

Processor saves current execution state (registers, flags)
Processor pushes error code onto kernel stack
Processor loads page fault handler address from IDT (Interrupt Descriptor Table)
Processor switches to kernel mode (if not already there)
Faulting address saved in CR2 register (x86)

Phase 2: Initial Handler

The low-level assembly handler:

Saves remaining registers (general purpose, segment)
Sets up kernel stack frame
Reads faulting address from CR2
Reads error code from stack
Calls C-language fault handler

Converting Mermaid diagram...

Phase 3: Fault Classification and Resolution

The C-language handler performs the classification we discussed earlier, then takes the appropriate action:

if (address not in valid VMA)
    → send SIGSEGV to process
    
else if (page is in swap)
    → allocate frame
    → read page from swap partition
    → update PTE with new frame
    
else if (page is file-backed)
    → check page cache for page
    → if not cached: read from file
    → map page from cache into process
    
else if (page is anonymous, first access)
    → allocate zeroed frame (or map zero-page read-only)
    
else if (copy-on-write fault)
    → allocate new frame
    → copy data from shared page
    → update PTE to point to new frame with write permission

Phase 4: Page Table Update and TLB

After obtaining the frame with the correct content:

Update page table entry:
- Set frame number
- Set Present bit = 1
- Set appropriate protection bits
- Clear Accessed and Dirty bits
Flush TLB for the faulting address (if stale entry might exist)
Release any locks held

Phase 5: Return to User Space

The handler returns, causing:

Kernel stack unwound
Registers restored
CPU mode returns to user (if applicable)
Faulting instruction restarted from the beginning

Instruction Restart is Critical

Finding the Page on Disk

Once the handler determines that a page must be loaded, it needs to find where the page's data resides. This requires consulting multiple data structures depending on the page's type.

For File-Backed Pages (Code, Data, mmap regions):

The VMA structure contains:

File reference (struct file *)
Offset into file where mapping begins
Permissions (read, write, execute)

Page offset is calculated:

page_offset = (fault_address - vma->start) + vma->file_offset

The page cache is checked first—the page might already be in memory (just not mapped for this process). If not, a disk read is scheduled.

For Swapped-Out Pages:

The (non-present) PTE itself contains the swap entry:

Swap device identifier (which swap partition/file)
Offset within that device (which slot)

swap_entry = pte_to_swap_entry(pte)
device = swap_entry.device
offset = swap_entry.offset

For Anonymous Zero-Fill Pages:

No lookup needed—the OS simply provides a page of zeros. Optimizations:

Map a shared "zero page" read-only initially
On first write, allocate real page and copy zeros (or just zero the new page)

find_page_source.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/* Determine source of page data and handle fault */
 
int handle_demand_fault(struct vm_area_struct *vma,
                        unsigned long address,
                        unsigned int flags) {
    struct page *page = NULL;
    pte_t pte;
    
    /* Calculate page-aligned address and file offset */
    unsigned long page_addr = address & PAGE_MASK;
    pgoff_t pgoff = ((page_addr - vma->vm_start) >> PAGE_SHIFT) 
                    + vma->vm_pgoff;
    
    if (vma->vm_file) {
        /* File-backed mapping - check page cache first */
        struct address_space *mapping = vma->vm_file->f_mapping;
        
        page = find_get_page(mapping, pgoff);
        if (!page) {
            /* Page not in cache - read from file */
            page = read_mapping_page(mapping, pgoff, 
                                     vma->vm_file);
            if (IS_ERR(page))
                return VM_FAULT_SIGBUS;
        }
        
        /* Page is now in page cache and in 'page' variable */
        
    } else {
        /* Anonymous mapping - check if swapped */
        pte = *pte_offset(vma->vm_mm, address);
        
        if (is_swap_pte(pte)) {
            /* Swapped out - read from swap */
            swp_entry_t entry = pte_to_swp_entry(pte);
            page = alloc_page(GFP_HIGHUSER);
            if (!page)
                return VM_FAULT_OOM;
            
            int err = swap_readpage(page, entry);
            if (err) {
                put_page(page);
                return VM_FAULT_SIGBUS;
            }
            
            /* Free the swap slot */
            swap_free(entry);
            
        } else {
            /* First access - zero fill */
            page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
            if (!page)
                return VM_FAULT_OOM;
        }
    }
    
    /* Install page in page table */
    return install_page(vma, address, page, flags);
}

Page Cache is Key

Allocating a Physical Frame

The Allocation Process:

Request from free pool:
- Kernel maintains lists of free frames
- Often organized by NUMA node/zone
- Grab one if available
If no free frames, trigger reclamation:
- Invoke the page frame reclaim algorithm (PFRA)
- Scan pages, evict candidates to swap/disk
- Repeat until a frame is available
If reclamation fails, invoke OOM killer:
- Select a process to terminate
- Kill it to free its frames
- Last resort mechanism

Frame Allocation Challenges:

Frame Allocation Challenges

•NUMA Locality: Preferably allocate frames on the same NUMA node as the CPU handling the fault, but fall back to remote nodes if necessary.
•Memory Zones: Some memory is reserved for specific purposes (DMA-able, high memory). Allocations must respect zone constraints.
•Atomicity Constraints: Some fault contexts cannot sleep (kernel interrupt handlers). Must use atomic allocation if available, or defer otherwise.
•Memory Pressure: Under heavy load, allocation might need to wait for reclamation, adding latency.
•Fragmentation: Even with free memory, large contiguous allocations (huge pages) might fail.

Allocation Flags:

In Linux, alloc_page() takes flags controlling allocation behavior:

/* Common allocation flags */

GFP_KERNEL      /* Normal kernel allocation, can sleep */
GFP_ATOMIC      /* Cannot sleep, for interrupt context */
GFP_HIGHUSER    /* User-space page, preferably from high memory */
__GFP_ZERO      /* Zero the page before returning */
__GFP_NOWARN    /* Don't warn if allocation fails */
__GFP_NORETRY   /* Don't try hard, fail quickly */

/* Page fault typically uses: */
GFP_HIGHUSER | __GFP_ZERO  /* For anonymous pages */
GFP_HIGHUSER               /* For file-backed pages (data comes from file) */

Page Zeroing:

For security reasons, frames given to user processes must not contain stale data from other processes. Anonymous pages are always zeroed. This zeroing can be performed:

On-demand: Zero the page during fault handling (simpler, higher latency)
Background: A kernel thread pre-zeros free pages (lower latency, more complex)

Many systems use background zeroing to keep a pool of pre-zeroed pages ready for fast allocation.

The Zero Page Optimization

Loading the Page Content

Once a frame is allocated, the page content must be loaded. The method depends on the page's backing store.

File-Backed Pages:

For pages backed by files (executables, shared libraries, mmap'd files):

Determine the file and offset from VMA
Check if page is already in the page cache
If not cached:
- Add a new page to the cache
- Submit I/O request to read file data
- Block until I/O completes
Return the cached page

The page cache is crucial here—not only does it avoid repeated I/O, but it also enables sharing. Multiple VMAs (even from different processes) can reference the same cached page.

Swap-Backed Pages:

For pages that were evicted to swap:

Extract swap entry from non-present PTE
Allocate destination frame
Compute swap device offset from entry
Submit read I/O to swap device/file
Block until I/O completes
Free the swap slot (page is now in RAM)

I/O Characteristics by Page Source
Source	Typical Latency	Can Be Shared?	Post-Load Action
Page Cache (hit)	~1-10 μs	Yes	Just map into process
File (cache miss)	~100 μs - 10 ms	Yes (added to cache)	Read → cache → map
Swap (SSD)	~50-200 μs	No (private)	Read → free swap slot → map
Swap (HDD)	~5-15 ms	No (private)	Read → free swap slot → map
Zero Fill	~1-5 μs	Shared zero page	Allocate or map zero page

Asynchronous Read-Ahead:

While servicing a page fault, the kernel often initiates read-ahead for adjacent pages:

Current fault at page N:

1. Load page N (synchronous - must wait)
2. Initiate async load for pages N+1, N+2, ... N+K
3. Return with page N ready
4. Background I/O continues for N+1 through N+K
5. Next faults for N+1..N+K likely find pages already loaded (minor faults or no faults)

Read-ahead converts potential major faults into minor faults or no faults at all. The kernel uses heuristics to detect sequential access patterns and adjust read-ahead window size.

Blocking During Page Load:

While waiting for I/O, the faulting process is blocked. However:

Other processes continue running
Other threads of the same process can run (on different CPUs)
The CPU isn't wasted—the scheduler switches to ready work

I/O Overlap is Essential

Updating the Page Table

After the page is loaded into a frame, the page table must be updated to reflect the new mapping. This step is subtle and requires careful attention to synchronization and consistency.

The PTE Update:

Before (non-present):
┌──────────────────────────────────────────────────────────────┐
│ Present=0 │  Swap Entry ID / File Offset / Zero marker       │
└──────────────────────────────────────────────────────────────┘

After (present):
┌──────────────────────────────────────────────────────────────┐
│ Present=1 │ Frame Number │ R/W │ User │ A=0 │ D=0 │ NX │ ... │
└──────────────────────────────────────────────────────────────┘

The update must:

Set the frame number - Physical location of the page
Set Present = 1 - Allow accesses without faulting
Set protection bits - Based on VMA permissions
Clear Accessed/Dirty bits - Fresh start for this mapping
Set other flags - Cache control, global, etc. as appropriate

install_page.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* Install a page into the process's page table */
 
int install_page(struct vm_area_struct *vma,
                 unsigned long addr,
                 struct page *page,
                 unsigned int flags) {
    struct mm_struct *mm = vma->vm_mm;
    pte_t *ptep;
    pte_t new_pte;
    spinlock_t *ptl;
    
    /* Get the PTE pointer with page table lock */
    ptep = pte_offset_map_lock(mm, addr, &ptl);
    if (!ptep)
        return VM_FAULT_OOM;
    
    /* Check if another CPU already handled this fault */
    if (pte_present(*ptep)) {
        /* Race condition: page already installed */
        put_page(page);
        pte_unmap_unlock(ptep, ptl);
        return 0;  /* Success - just use existing mapping */
    }
    
    /* Construct the new PTE */
    new_pte = mk_pte(page, vma->vm_page_prot);
    
    /* If VMA is writable and this is a write fault, set writable */
    if ((flags & FAULT_FLAG_WRITE) && 
        (vma->vm_flags & VM_WRITE)) {
        new_pte = pte_mkwrite(new_pte);
    }
    
    /* Make it present and young (accessed) */
    new_pte = pte_mkpresent(new_pte);
    new_pte = pte_mkyoung(new_pte);
    
    /* If it was a write that caused the fault, mark dirty */
    if (flags & FAULT_FLAG_WRITE)
        new_pte = pte_mkdirty(new_pte);
    
    /* Atomically install the PTE */
    set_pte_at(mm, addr, ptep, new_pte);
    
    /* Update memory accounting */
    inc_mm_counter(mm, MM_FILEPAGES);
    
    /* Record the mapping for reverse lookup */
    page_add_file_rmap(page);
    
    pte_unmap_unlock(ptep, ptl);
    return 0;
}

Race Conditions:

Multiple CPUs might handle faults for the same address simultaneously:

CPU0 and CPU1 both fault on address X
Both start loading the page
CPU0 finishes first, installs PTE
CPU1 finishes, finds PTE already present

The solution: hold a lock while examining and updating the PTE. If the PTE becomes present while we were loading, discard our work and use the existing mapping.

TLB Considerations:

After installing the PTE, we might need to invalidate stale TLB entries:

For new mappings: Usually no stale entry exists
For replacing non-present with present: Old non-present 'entry' might be cached as negative
On multiprocessor systems: Other CPUs might have stale TLB entries after PTE updates

Modern x86 processors don't cache non-present PTEs as TLB entries, so new mappings typically don't require explicit TLB flush.

The Page Table Lock

Instruction Restart Mechanics

Why Complete Restart?

The faulting instruction didn't complete—it was interrupted mid-execution. We can't resume from the middle because:

Instruction decode must happen again
Memory addresses must be recalculated
Any register updates for the instruction were rolled back

The processor's microarchitecture is designed to leave the architectural state exactly as it was before the instruction began.

Saved State:

When the page fault occurs, the CPU saves:

Instruction Pointer (RIP): Points to the faulting instruction (not the next one)
Stack Pointer (RSP): Correct value for instruction start
Flags (RFLAGS): Condition codes from before the instruction
Segment Registers: Current segment state
Error Code: Information about the fault

When the handler returns via iret, all this state is restored and the instruction runs again.

Example: Fault and Restart on x86-64Tracing through a page fault and successful restart

Input

Instruction: MOV RAX, [RBX] where RBX points to an unmapped page

Output

Instruction successfully loads the value after page is loaded

Explanation

Timeline:

MOV RAX, [RBX] begins execution RIP = 0x401000 (address of MOV) RBX = 0x7FFF12340000 (target address)
CPU attempts to read from 0x7FFF12340000 MMU finds Present=0 in PTE Page fault triggered
CPU state saved: RIP saved as 0x401000 (MOV instruction address) RAX unchanged (load didn't complete) Error code: 0x4 (user read, not present)
Page fault handler runs:
- Finds VMA covering 0x7FFF12340000
- Allocates frame, loads page from swap
- Updates PTE: Present=1, Frame=42
- Returns via IRET
CPU restores state: RIP = 0x401000 Execution resumes at MOV instruction
MOV RAX, [RBX] executes again: MMU now finds Present=1 Loads 8 bytes from Frame 42 RAX updated with loaded value RIP advances to next instruction

Complex Instruction Challenges:

Some instructions are particularly challenging for restart:

1. Multi-Memory-Access Instructions (CISC):

MOVS - String move (reads source, writes destination)
- Might fault on source read or destination write
- Must restart from beginning, re-reading source
- Auto-increment of SI/DI must be undone

2. Page-Crossing Accesses:

MOV RAX, [address]  where the 8-byte load spans two pages
- First page present, second page not
- Fault occurs mid-load
- Partial data discarded, instruction restarted

3. Read-Modify-Write Instructions:

INC [memory]  - Read, increment, write back
- Might fault on read or on write
- Must restart entire sequence

Modern processors handle all these cases correctly, saving precise enough state to restart any instruction. This "precise exception" behavior is essential for virtual memory.

Imprecise Exceptions (Historical)

Error Conditions and Edge Cases

Not all page faults can be resolved by loading a page. The handler must recognize and properly handle numerous error conditions.

Access Violations (Protection Faults):

Present=1 but access violates protection:
- Writing to read-only page: SIGSEGV (or COW handling)
- User access to supervisor page: SIGSEGV
- Executing from non-executable page: SIGSEGV
- These are protection faults, not missing-page faults

Invalid Address Faults:

Address not in any VMA:
- Access to unmapped memory: SIGSEGV
- NULL pointer dereference: SIGSEGV
- Stack overflow beyond guard page: SIGSEGV (or stack expand)
- Access to kernel addresses from user mode: SIGSEGV

Resource Exhaustion:

Cannot allocate resources to resolve fault:
- Out of memory (no frames available): OOM killer or SIGKILL
- Swap full (can't evict to make room): OOM situation
- Page table allocation failure: Process terminated

Page Fault Error Conditions and Signals
Error Condition	Signal	Default Action	Notes
Unmapped address	SIGSEGV	Core dump + terminate	Most common programming error
Protection violation	SIGSEGV	Core dump + terminate	Write to read-only, etc.
I/O error reading page	SIGBUS	Core dump + terminate	Disk error, network file issue
Out of memory	SIGKILL	Immediate terminate	Or OOM killer selects victim
COW limit exceeded	SIGKILL	Immediate terminate	rlimit RLIMIT_AS reached
Stack guard violation	SIGSEGV	Core dump + terminate	Or stack expand if within limits

Stack Expansion:

Stack faults near the current stack limit are special-cased:

if (fault_address is below current stack pointer but
    within RLIMIT_STACK limit)
    → expand stack VMA downward
    → allocate zero page
    → continue execution

if (fault_address is beyond RLIMIT_STACK)
    → SIGSEGV (stack overflow)

if (fault_address is on stack guard page)
    → SIGSEGV (stack overflow)

Kernel Page Faults:

Page faults in kernel mode receive special handling:

Expected faults: Some kernel code deliberately accesses user-space addresses (e.g., copy_from_user). These are expected and handled.
Unexpected faults: Accessing invalid addresses from kernel code indicates a kernel bug. The system typically panics with an "Oops" message.
Exception tables: The kernel maintains tables of "expected" fault locations and their fixup handlers.

kernel_fault_handling.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Handling kernel-mode page faults (simplified) */
 
void do_page_fault(struct pt_regs *regs, unsigned long error_code,
                   unsigned long fault_addr) {
    
    if (fault_in_kernel_mode(regs)) {
        /* Kernel page fault - could be expected or a bug */
        
        if (fault_addr >= TASK_SIZE) {
            /* Fault on kernel address - might be vmalloc region */
            if (vmalloc_fault(fault_addr) == 0)
                return;  /* Handled - vmalloc PTE synced */
        }
        
        /* Check if this is an expected user-space access */
        const struct exception_table_entry *fixup;
        fixup = search_exception_tables(regs->ip);
        
        if (fixup) {
            /* This fault was expected - use fixup handler */
            regs->ip = fixup->handler;
            return;  /* Will return error to copy_from_user caller */
        }
        
        /* Unexpected kernel fault - this is a kernel bug! */
        oops_begin();
        printk(KERN_EMERG "BUG: unable to handle page fault "
               "at %lx\n", fault_addr);
        show_regs(regs);
        oops_end();
        panic("Kernel fault at %lx", fault_addr);
    }
    
    /* User-mode fault - normal handling follows */
    handle_user_fault(regs, error_code, fault_addr);
}

Kernel Faults are Critical

Performance Optimization

Page fault handling is a performance-critical path. Operating system developers employ numerous optimizations to minimize fault overhead.

Fast Path Optimizations:

Minimal Lock Contention:
- Per-VMA locks instead of per-mm locks
- RCU for read-mostly VMA lookups
- Page table locks are per-page-table (not global)
Assembly Entry Code:
- Fault entry written in assembly for speed
- Saves minimal state, checks for quick resolution
- Only calls C handler for complex cases
Page Cache Integration:
- Direct mapping from page cache to process
- No copying of file data
- Reference counting for shared pages

Major Optimizations in Page Fault Handling

•Speculative Mapcount: Optimistic mapping assumes race-free operation, with fallback path for contention.
•Fault-Around: When loading one page, speculatively map adjacent pages that are already in page cache.
•Huge Page Support: Single fault for 2MB or 1GB page reduces fault count by 512x or 262144x.
•Batched TLB Flush: Defer TLB flush when evicting multiple pages, issue single batched flush.
•NUMA-Aware Allocation: Allocate frames on local NUMA node to minimize memory access latency.
•Zero Page Sharing: All zero pages share a single physical frame until written.
•Read-Ahead Windows: Dynamically adjust read-ahead based on detected access patterns.

Monitoring Page Fault Performance:

System administrators and developers use various tools to monitor page fault behavior:

# Per-process page fault statistics
/proc/<pid>/stat    # Fields 10 (minor faults) and 12 (major faults)

# System-wide page fault monitoring
sar -B 1            # Page statistics every second
vmstat 1            # Virtual memory statistics

# Detailed per-process analysis
perf stat -e page-faults,minor-faults,major-faults <command>

# Trace individual page faults
perf trace -e page-faults <command>

Typical Performance Targets:

Metric	Excellent	Acceptable	Problematic
Minor fault latency	< 5 μs	< 20 μs	> 100 μs
Major fault latency	< 1 ms	< 10 ms	> 50 ms
Major fault rate	< 10/sec	< 100/sec	> 1000/sec
Page fault handler CPU %	< 1%	< 5%	> 10%

The Goal: Invisible Paging

Summary: Page Fault Handling

We've explored the complete lifecycle of page fault handling. Let's consolidate the essential knowledge:

Key Takeaways

•Page faults are classified into legitimate demand faults (page exists but isn't loaded) and errors (invalid addresses, protection violations).
•The handling sequence involves exception entry, fault classification, page location lookup, frame allocation, page loading, PTE update, and instruction restart.
•Page content sources include the page cache (for file-backed pages), swap devices (for evicted pages), and zero-fill (for anonymous pages).
•Frame allocation may trigger page reclamation if memory is scarce, potentially blocking the fault handler.
•PTE updates are atomic and protected by locks to prevent race conditions with concurrent faults.
•Instruction restart re-executes the faulting instruction from the beginning after the page is installed.
•Error conditions (invalid addresses, protection violations, I/O errors) result in signals that typically terminate the process.
•Performance optimization is critical—page fault handling is one of the most tuned code paths in the kernel.

What's Next:

Critical Path Mastered

3 / 5