Loading learning content...
A page fault is a crisis. The processor attempted to access memory that doesn't exist in physical RAM. In the absence of virtual memory, this would be a catastrophic error—the program would crash, data would be lost, the user would be frustrated. But in a demand-paged system, a page fault is something entirely different: it's an opportunity.
The page fault handler transforms what would be a fatal error into a smooth, nearly invisible operation. It loads the missing data, updates the system state, and restarts execution so seamlessly that the running program never knows anything unusual happened. This sleight of hand—making disk-speed operations appear as memory-speed operations—is one of the most elegant achievements in systems software.
But don't let the elegance fool you: page fault handling is also one of the most performance-critical code paths in an operating system. A fault handler that takes an extra microsecond might seem trivial, but multiply that by millions of faults per day, and you've lost hours of compute time. Kernel developers spend enormous effort optimizing this path.
By the end of this page, you will understand the complete page fault handling sequence: fault detection and classification, determining the source of the page, allocating frames, loading data, updating page tables, and resuming execution. You'll also explore the distinction between minor and major faults, error cases, and the critical importance of instruction restart semantics.
Not all page faults are equal. Before the handler can take action, it must classify the fault to determine the appropriate response. This classification is the first and most crucial step in page fault handling.
The Classification Matrix:
Page faults can be categorized along multiple dimensions:
Resolving these questions determines whether the fault leads to page loading, copy-on-write handling, or process termination.
| Condition | Valid Region? | Outcome |
|---|---|---|
| Unmapped address | No | SIGSEGV (Segmentation Fault) |
| Address in file mapping, page not loaded | Yes | Load from file (major fault) |
| Address in anonymous region, first access | Yes | Zero-fill on demand (minor fault) |
| Page was swapped out | Yes | Load from swap (major fault) |
| Write to read-only page (COW) | Yes | Copy page, update mappings (minor fault) |
| Write to truly read-only page | Yes/No | SIGSEGV or SIGBUS |
| Execute from non-executable page | Yes | SIGSEGV (protection violation) |
| Access from user mode to kernel page | Yes | SIGSEGV (protection violation) |
Determining Address Validity:
The handler must consult the process's address space metadata to determine if the faulting address belongs to a valid region. In Linux, this involves:
mmap_lock)vm_area_struct)If no VMA contains the address, or if the access violates the VMA's permissions, the fault is an error. Otherwise, it's a legitimate demand fault.
A minor fault (soft fault) can be resolved without disk I/O—the page data is already in memory somewhere. A major fault (hard fault) requires reading from disk. Major faults are orders of magnitude more expensive. Monitoring the ratio of major to minor faults is a key performance metric in any memory-intensive system.
The page fault handling sequence involves multiple components working in tight coordination. Let's trace through the complete flow from fault to resumed execution.
Phase 1: Exception Entry
When the MMU detects an access to a non-present page:
Phase 2: Initial Handler
The low-level assembly handler:
Phase 3: Fault Classification and Resolution
The C-language handler performs the classification we discussed earlier, then takes the appropriate action:
if (address not in valid VMA)
→ send SIGSEGV to process
else if (page is in swap)
→ allocate frame
→ read page from swap partition
→ update PTE with new frame
else if (page is file-backed)
→ check page cache for page
→ if not cached: read from file
→ map page from cache into process
else if (page is anonymous, first access)
→ allocate zeroed frame (or map zero-page read-only)
else if (copy-on-write fault)
→ allocate new frame
→ copy data from shared page
→ update PTE to point to new frame with write permission
Phase 4: Page Table Update and TLB
After obtaining the frame with the correct content:
Phase 5: Return to User Space
The handler returns, causing:
The architecture must save enough state to restart the faulting instruction exactly as if nothing happened. This is non-trivial: instructions might have side effects (like auto-increment addressing modes) that must be either undone or remembered. Complex CISC instructions that access multiple memory locations require particularly careful handling.
Once the handler determines that a page must be loaded, it needs to find where the page's data resides. This requires consulting multiple data structures depending on the page's type.
For File-Backed Pages (Code, Data, mmap regions):
The VMA structure contains:
Page offset is calculated:
page_offset = (fault_address - vma->start) + vma->file_offset
The page cache is checked first—the page might already be in memory (just not mapped for this process). If not, a disk read is scheduled.
For Swapped-Out Pages:
The (non-present) PTE itself contains the swap entry:
swap_entry = pte_to_swap_entry(pte)
device = swap_entry.device
offset = swap_entry.offset
For Anonymous Zero-Fill Pages:
No lookup needed—the OS simply provides a page of zeros. Optimizations:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
/* Determine source of page data and handle fault */ int handle_demand_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) { struct page *page = NULL; pte_t pte; /* Calculate page-aligned address and file offset */ unsigned long page_addr = address & PAGE_MASK; pgoff_t pgoff = ((page_addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; if (vma->vm_file) { /* File-backed mapping - check page cache first */ struct address_space *mapping = vma->vm_file->f_mapping; page = find_get_page(mapping, pgoff); if (!page) { /* Page not in cache - read from file */ page = read_mapping_page(mapping, pgoff, vma->vm_file); if (IS_ERR(page)) return VM_FAULT_SIGBUS; } /* Page is now in page cache and in 'page' variable */ } else { /* Anonymous mapping - check if swapped */ pte = *pte_offset(vma->vm_mm, address); if (is_swap_pte(pte)) { /* Swapped out - read from swap */ swp_entry_t entry = pte_to_swp_entry(pte); page = alloc_page(GFP_HIGHUSER); if (!page) return VM_FAULT_OOM; int err = swap_readpage(page, entry); if (err) { put_page(page); return VM_FAULT_SIGBUS; } /* Free the swap slot */ swap_free(entry); } else { /* First access - zero fill */ page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); if (!page) return VM_FAULT_OOM; } } /* Install page in page table */ return install_page(vma, address, page, flags);}The page cache is shared across the entire system. When one process faults in a page from a file, the page becomes available in the cache. Other processes mapping the same file can then satisfy faults from the cache without any disk I/O. This is a major optimization for shared libraries—the first process to touch a library page pays the I/O cost; subsequent processes get it 'free.'
Before a page can be installed, the handler must obtain a physical frame to hold it. Frame allocation is a critical component of page fault handling, and it must handle the case where memory is scarce.
The Allocation Process:
Request from free pool:
If no free frames, trigger reclamation:
If reclamation fails, invoke OOM killer:
Frame Allocation Challenges:
Allocation Flags:
In Linux, alloc_page() takes flags controlling allocation behavior:
/* Common allocation flags */
GFP_KERNEL /* Normal kernel allocation, can sleep */
GFP_ATOMIC /* Cannot sleep, for interrupt context */
GFP_HIGHUSER /* User-space page, preferably from high memory */
__GFP_ZERO /* Zero the page before returning */
__GFP_NOWARN /* Don't warn if allocation fails */
__GFP_NORETRY /* Don't try hard, fail quickly */
/* Page fault typically uses: */
GFP_HIGHUSER | __GFP_ZERO /* For anonymous pages */
GFP_HIGHUSER /* For file-backed pages (data comes from file) */
Page Zeroing:
For security reasons, frames given to user processes must not contain stale data from other processes. Anonymous pages are always zeroed. This zeroing can be performed:
Many systems use background zeroing to keep a pool of pre-zeroed pages ready for fast allocation.
A clever optimization for zero-filled pages: instead of allocating a unique zeroed page for each anonymous page fault, map all zero pages to a single, shared zero-filled frame (read-only). Only when the process writes to the page is a private copy allocated. This avoids allocation and zeroing entirely for pages that are read but never written.
Once a frame is allocated, the page content must be loaded. The method depends on the page's backing store.
File-Backed Pages:
For pages backed by files (executables, shared libraries, mmap'd files):
The page cache is crucial here—not only does it avoid repeated I/O, but it also enables sharing. Multiple VMAs (even from different processes) can reference the same cached page.
Swap-Backed Pages:
For pages that were evicted to swap:
| Source | Typical Latency | Can Be Shared? | Post-Load Action |
|---|---|---|---|
| Page Cache (hit) | ~1-10 μs | Yes | Just map into process |
| File (cache miss) | ~100 μs - 10 ms | Yes (added to cache) | Read → cache → map |
| Swap (SSD) | ~50-200 μs | No (private) | Read → free swap slot → map |
| Swap (HDD) | ~5-15 ms | No (private) | Read → free swap slot → map |
| Zero Fill | ~1-5 μs | Shared zero page | Allocate or map zero page |
Asynchronous Read-Ahead:
While servicing a page fault, the kernel often initiates read-ahead for adjacent pages:
Current fault at page N:
1. Load page N (synchronous - must wait)
2. Initiate async load for pages N+1, N+2, ... N+K
3. Return with page N ready
4. Background I/O continues for N+1 through N+K
5. Next faults for N+1..N+K likely find pages already loaded (minor faults or no faults)
Read-ahead converts potential major faults into minor faults or no faults at all. The kernel uses heuristics to detect sequential access patterns and adjust read-ahead window size.
Blocking During Page Load:
While waiting for I/O, the faulting process is blocked. However:
The ability to overlap page loading I/O with other computation is what makes demand paging practical. If the system had to stop everything during each page load, performance would be abysmal. Instead, the scheduler ensures that waiting for I/O doesn't waste CPU cycles—other work proceeds in parallel.
After the page is loaded into a frame, the page table must be updated to reflect the new mapping. This step is subtle and requires careful attention to synchronization and consistency.
The PTE Update:
Before (non-present):
┌──────────────────────────────────────────────────────────────┐
│ Present=0 │ Swap Entry ID / File Offset / Zero marker │
└──────────────────────────────────────────────────────────────┘
After (present):
┌──────────────────────────────────────────────────────────────┐
│ Present=1 │ Frame Number │ R/W │ User │ A=0 │ D=0 │ NX │ ... │
└──────────────────────────────────────────────────────────────┘
The update must:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
/* Install a page into the process's page table */ int install_page(struct vm_area_struct *vma, unsigned long addr, struct page *page, unsigned int flags) { struct mm_struct *mm = vma->vm_mm; pte_t *ptep; pte_t new_pte; spinlock_t *ptl; /* Get the PTE pointer with page table lock */ ptep = pte_offset_map_lock(mm, addr, &ptl); if (!ptep) return VM_FAULT_OOM; /* Check if another CPU already handled this fault */ if (pte_present(*ptep)) { /* Race condition: page already installed */ put_page(page); pte_unmap_unlock(ptep, ptl); return 0; /* Success - just use existing mapping */ } /* Construct the new PTE */ new_pte = mk_pte(page, vma->vm_page_prot); /* If VMA is writable and this is a write fault, set writable */ if ((flags & FAULT_FLAG_WRITE) && (vma->vm_flags & VM_WRITE)) { new_pte = pte_mkwrite(new_pte); } /* Make it present and young (accessed) */ new_pte = pte_mkpresent(new_pte); new_pte = pte_mkyoung(new_pte); /* If it was a write that caused the fault, mark dirty */ if (flags & FAULT_FLAG_WRITE) new_pte = pte_mkdirty(new_pte); /* Atomically install the PTE */ set_pte_at(mm, addr, ptep, new_pte); /* Update memory accounting */ inc_mm_counter(mm, MM_FILEPAGES); /* Record the mapping for reverse lookup */ page_add_file_rmap(page); pte_unmap_unlock(ptep, ptl); return 0;}Race Conditions:
Multiple CPUs might handle faults for the same address simultaneously:
The solution: hold a lock while examining and updating the PTE. If the PTE becomes present while we were loading, discard our work and use the existing mapping.
TLB Considerations:
After installing the PTE, we might need to invalidate stale TLB entries:
Modern x86 processors don't cache non-present PTEs as TLB entries, so new mappings typically don't require explicit TLB flush.
The page table lock (ptl) protects against concurrent modification of PTEs. Holding this lock during I/O would be catastrophic—I/O takes milliseconds, and the lock would block all page faults for this process. The solution: hold the lock only during the actual PTE read/write, releasing it during I/O. Reacquire and verify before final installation.
After the page is installed, the processor must resume exactly where it left off. This requires restarting the faulting instruction from the beginning. The mechanics of instruction restart are crucial for correct demand paging operation.
Why Complete Restart?
The faulting instruction didn't complete—it was interrupted mid-execution. We can't resume from the middle because:
The processor's microarchitecture is designed to leave the architectural state exactly as it was before the instruction began.
Saved State:
When the page fault occurs, the CPU saves:
When the handler returns via iret, all this state is restored and the instruction runs again.
Instruction: MOV RAX, [RBX] where RBX points to an unmapped pageInstruction successfully loads the value after page is loadedTimeline:
MOV RAX, [RBX] begins execution RIP = 0x401000 (address of MOV) RBX = 0x7FFF12340000 (target address)
CPU attempts to read from 0x7FFF12340000 MMU finds Present=0 in PTE Page fault triggered
CPU state saved: RIP saved as 0x401000 (MOV instruction address) RAX unchanged (load didn't complete) Error code: 0x4 (user read, not present)
Page fault handler runs:
CPU restores state: RIP = 0x401000 Execution resumes at MOV instruction
MOV RAX, [RBX] executes again: MMU now finds Present=1 Loads 8 bytes from Frame 42 RAX updated with loaded value RIP advances to next instruction
Complex Instruction Challenges:
Some instructions are particularly challenging for restart:
1. Multi-Memory-Access Instructions (CISC):
MOVS - String move (reads source, writes destination)
- Might fault on source read or destination write
- Must restart from beginning, re-reading source
- Auto-increment of SI/DI must be undone
2. Page-Crossing Accesses:
MOV RAX, [address] where the 8-byte load spans two pages
- First page present, second page not
- Fault occurs mid-load
- Partial data discarded, instruction restarted
3. Read-Modify-Write Instructions:
INC [memory] - Read, increment, write back
- Might fault on read or on write
- Must restart entire sequence
Modern processors handle all these cases correctly, saving precise enough state to restart any instruction. This "precise exception" behavior is essential for virtual memory.
Early pipelined processors had 'imprecise exceptions'—by the time a page fault was detected, subsequent instructions might have partially executed. Operating on such machines required complex recovery software or restrictions on virtual memory. Modern out-of-order processors work hard to maintain precise exception behavior despite executing instructions speculatively.
Not all page faults can be resolved by loading a page. The handler must recognize and properly handle numerous error conditions.
Access Violations (Protection Faults):
Present=1 but access violates protection:
- Writing to read-only page: SIGSEGV (or COW handling)
- User access to supervisor page: SIGSEGV
- Executing from non-executable page: SIGSEGV
- These are protection faults, not missing-page faults
Invalid Address Faults:
Address not in any VMA:
- Access to unmapped memory: SIGSEGV
- NULL pointer dereference: SIGSEGV
- Stack overflow beyond guard page: SIGSEGV (or stack expand)
- Access to kernel addresses from user mode: SIGSEGV
Resource Exhaustion:
Cannot allocate resources to resolve fault:
- Out of memory (no frames available): OOM killer or SIGKILL
- Swap full (can't evict to make room): OOM situation
- Page table allocation failure: Process terminated
| Error Condition | Signal | Default Action | Notes |
|---|---|---|---|
| Unmapped address | SIGSEGV | Core dump + terminate | Most common programming error |
| Protection violation | SIGSEGV | Core dump + terminate | Write to read-only, etc. |
| I/O error reading page | SIGBUS | Core dump + terminate | Disk error, network file issue |
| Out of memory | SIGKILL | Immediate terminate | Or OOM killer selects victim |
| COW limit exceeded | SIGKILL | Immediate terminate | rlimit RLIMIT_AS reached |
| Stack guard violation | SIGSEGV | Core dump + terminate | Or stack expand if within limits |
Stack Expansion:
Stack faults near the current stack limit are special-cased:
if (fault_address is below current stack pointer but
within RLIMIT_STACK limit)
→ expand stack VMA downward
→ allocate zero page
→ continue execution
if (fault_address is beyond RLIMIT_STACK)
→ SIGSEGV (stack overflow)
if (fault_address is on stack guard page)
→ SIGSEGV (stack overflow)
Kernel Page Faults:
Page faults in kernel mode receive special handling:
copy_from_user). These are expected and handled.123456789101112131415161718192021222324252627282930313233343536
/* Handling kernel-mode page faults (simplified) */ void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long fault_addr) { if (fault_in_kernel_mode(regs)) { /* Kernel page fault - could be expected or a bug */ if (fault_addr >= TASK_SIZE) { /* Fault on kernel address - might be vmalloc region */ if (vmalloc_fault(fault_addr) == 0) return; /* Handled - vmalloc PTE synced */ } /* Check if this is an expected user-space access */ const struct exception_table_entry *fixup; fixup = search_exception_tables(regs->ip); if (fixup) { /* This fault was expected - use fixup handler */ regs->ip = fixup->handler; return; /* Will return error to copy_from_user caller */ } /* Unexpected kernel fault - this is a kernel bug! */ oops_begin(); printk(KERN_EMERG "BUG: unable to handle page fault " "at %lx\n", fault_addr); show_regs(regs); oops_end(); panic("Kernel fault at %lx", fault_addr); } /* User-mode fault - normal handling follows */ handle_user_fault(regs, error_code, fault_addr);}A user-mode page fault can be safely handled (worst case: kill the offending process). A kernel-mode page fault on an unexpected address indicates a bug that might have corrupted kernel data structures. Systems typically panic rather than continue with potentially corrupted state.
Page fault handling is a performance-critical path. Operating system developers employ numerous optimizations to minimize fault overhead.
Fast Path Optimizations:
Minimal Lock Contention:
Assembly Entry Code:
Page Cache Integration:
Monitoring Page Fault Performance:
System administrators and developers use various tools to monitor page fault behavior:
# Per-process page fault statistics
/proc/<pid>/stat # Fields 10 (minor faults) and 12 (major faults)
# System-wide page fault monitoring
sar -B 1 # Page statistics every second
vmstat 1 # Virtual memory statistics
# Detailed per-process analysis
perf stat -e page-faults,minor-faults,major-faults <command>
# Trace individual page faults
perf trace -e page-faults <command>
Typical Performance Targets:
| Metric | Excellent | Acceptable | Problematic |
|---|---|---|---|
| Minor fault latency | < 5 μs | < 20 μs | > 100 μs |
| Major fault latency | < 1 ms | < 10 ms | > 50 ms |
| Major fault rate | < 10/sec | < 100/sec | > 1000/sec |
| Page fault handler CPU % | < 1% | < 5% | > 10% |
The ideal paging system is invisible to applications. Enough memory, intelligent read-ahead, and efficient handling mean processes run at near-native speed despite using virtual memory. Achieving this requires continuous measurement and optimization of the page fault path.
We've explored the complete lifecycle of page fault handling. Let's consolidate the essential knowledge:
What's Next:
Now that we understand the mechanics of page fault handling, we'll explore two contrasting strategies for when to load pages: pure demand paging (load only on fault) versus prepaging (anticipate and preload). These strategies represent different points on the trade-off spectrum between memory efficiency and fault reduction.
You now understand page fault handling—one of the most important and performance-sensitive code paths in any operating system. This knowledge enables you to reason about system behavior, optimize memory-intensive applications, and debug virtual memory issues.