Loading content...
We've come full circle. The page fault was detected, the trap brought us to the kernel, we located the page on disk, loaded it into a physical frame, and updated the page table. Now comes the culmination of all this effort: restarting the instruction that originally faulted.
This restart must be seamless. The process should have no idea anything unusual happened—from its perspective, the memory access simply worked. The instruction executes, gets the data it expected, and computation continues. Billions of page faults happen across all the world's computers every second, and virtually none are noticed by the applications experiencing them.
This transparency is the ultimate deliverable of virtual memory. This page explores exactly how it's achieved: the hardware mechanisms for returning to user mode, the state restoration that makes the retry possible, and the edge cases that complicate this seemingly simple "just try again" operation.
By the end of this page, you will understand: (1) How the IRET instruction returns from kernel to user mode, (2) Complete state restoration from the trap frame, (3) Why restarting works—the precise exception model, (4) Complex instruction restart challenges, (5) The overall transparency guarantee of page fault handling.
The IRET (Interrupt Return) instruction is the x86 mechanism for returning from an interrupt or exception handler. On x86-64, the instruction is IRETQ (64-bit variant). This single instruction undoes everything the exception entry process did.
What IRET Does:
The beauty is that IRET is atomic—all these changes happen as an indivisible operation. There's no window where the CPU is half in kernel mode, half in user mode.
12345678910111213141516171819202122232425262728293031323334353637383940414243
# Returning from page fault handler (x86-64) .global page_fault_returnpage_fault_return: # At this point, the C page fault handler has returned # Stack contains saved registers from entry # Restore general-purpose registers popq %rax popq %rbx popq %rcx popq %rdx popq %rsi popq %rdi popq %rbp popq %r8 popq %r9 popq %r10 popq %r11 popq %r12 popq %r13 popq %r14 popq %r15 # Skip error code pushed by CPU addq $8, %rsp # Now stack layout is exactly what CPU pushed: # [RSP+0] = RIP (faulting instruction address) # [RSP+8] = CS (code segment, CPL in bits 0:1) # [RSP+16] = RFLAGS # [RSP+24] = RSP (user stack pointer) # [RSP+32] = SS (stack segment) # IRETQ pops all of these atomically iretq # After IRETQ: # - CPU is in user mode (ring 3) # - RIP points to the faulting instruction # - All registers are restored # - The instruction executes again # - This time, page is mapped, so it succeeds!The saved RIP points to the faulting instruction, not the one after it. This is what makes page faults different from, say, a breakpoint trap. When IRETQ returns, the CPU doesn't continue to the next instruction—it retries the same one. Since we've now mapped the page, the retry succeeds as if nothing happened.
For the instruction to retry successfully, the CPU state must be exactly what it was when the instruction first tried to execute. Let's trace what gets restored:
Hardware-Restored State (by IRETQ):
| Register | Purpose | Why Needed |
|---|---|---|
| RIP | Instruction pointer | Execute same instruction |
| CS | Code segment | Correct privilege, segment |
| RFLAGS | CPU flags | Direction flag, interrupt flag, etc. |
| RSP | Stack pointer | Stack operations work correctly |
| SS | Stack segment | Stack addressing correct |
Software-Restored State (by handler code):
| Registers | Purpose | Why Needed |
|---|---|---|
| RAX-RDX, RSI, RDI | Function arguments, scratch | Instruction operands |
| RBP | Frame pointer | Stack walking |
| R8-R15 | General purpose | Any purpose |
Implicitly Preserved State:
| State | How Preserved | Notes |
|---|---|---|
| Memory contents | Didn't write | Other pages unchanged |
| FPU/SSE state | Saved if used | May be in XSAVE area |
| Segment registers (DS, ES, FS, GS) | Reloaded by OS | Kernel-managed |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// Conceptual illustration of state preservation struct saved_state { // Hardware-saved (on stack, restored by IRET) uint64_t rip; // Points to MOV from example below uint64_t cs; // User code segment uint64_t rflags; // Flags at time of fault uint64_t rsp; // User stack pointer uint64_t ss; // User stack segment // Software-saved (by entry point, restored before IRET) uint64_t rax; // Value in RAX during fault uint64_t rbx; // etc. uint64_t rcx; uint64_t rdx; uint64_t rdi; uint64_t rsi; uint64_t rbp; uint64_t r8, r9, r10, r11, r12, r13, r14, r15;}; /* * Example faulting instruction: * MOV RAX, [RBX + RCX * 8] * * At fault time: * RIP = address of this MOV * RBX = 0x1000 (base address of array) * RCX = 5 (index) * Effective address = 0x1000 + 5*8 = 0x1028 * * Page at 0x1000 is not present → fault * * After handling: * Page containing 0x1028 is now mapped * RIP still points to MOV * RBX still = 0x1000 * RCX still = 5 * * Retry: * MOV calculates same address: 0x1028 * Translation succeeds this time * RAX gets the value from memory * Execution continues to next instruction */ // The guarantee: no register was changedvoid verify_state_preserved(struct saved_state *before, struct saved_state *after) { // Everything should be identical assert(before->rax == after->rax); assert(before->rbx == after->rbx); // ... etc for all registers ... // RIP points to same instruction assert(before->rip == after->rip); // The ONLY difference: page is now mapped // This is invisible to user code}Modern processors have extensive floating-point and SIMD state (XMM, YMM, ZMM registers). This state is typically saved lazily—only if the kernel itself uses FPU. If the page fault handler is pure integer code (common), FPU state is never touched and doesn't need restoration. If the kernel does use FPU, it saves state first via XSAVE.
The ability to restart instructions depends on a fundamental CPU property: precise exceptions. This concept is so important that modern processor designs invest significant silicon to ensure it.
What Makes an Exception Precise:
All instructions before the faulting one have completed. Their effects (register writes, memory writes) are fully visible.
The faulting instruction has had no visible effects. Any partial progress is rolled back.
No instructions after the faulting one have effects. Despite speculative execution, nothing is committed past the fault.
Why This Matters:
If exceptions weren't precise, we couldn't restart:
The Implementation Challenge:
Modern CPUs execute instructions out of order and speculatively. They might be executing instructions 50-100 instructions ahead of what's been committed. When a page fault occurs deep in the pipeline, the CPU must:
| Property | Precise | Imprecise |
|---|---|---|
| Restart possible? | Yes | No |
| State at exception | Exactly as if stopped before fault | May include partial effects |
| Implementation cost | High (rollback logic) | Low |
| Modern CPUs | Standard requirement | Obsolete |
| Page faults | Fully supported | Would break demand paging |
Speculative execution—running instructions before knowing if they'll actually be needed—is what enables high performance but also created the Spectre and Meltdown vulnerabilities. Even though speculative results are discarded on exceptions, they leave traces in caches that can be exploited. The security mitigations for these issues add overhead to page fault handling.
While the restart model is elegant for simple instructions, some complex instructions pose challenges:
1. Multi-Memory-Access Instructions:
An instruction like MOVSB (string move) can copy multiple bytes, accessing memory many times. What if it faults in the middle?
Solution: x86 uses the RSI, RDI, and RCX registers as implicit loop counters. If REP MOVSB faults, these registers reflect progress made, and restarting continues from where it stopped rather than restarting the entire operation.
2. Instructions with Multiple Destinations:
Some instructions write to multiple locations. What if the second write faults?
Solution: Such instructions are designed to be restartable. They either use temporary internal state or are defined such that partial writes are valid.
3. Stack Operations During Faults:
If PUSH faults while writing to the stack, we have a chicken-and-egg problem—we need to push the fault frame, but pushing caused the fault.
Solution: This is the double-fault scenario. A separate stack (IST) is used for the fault handler, avoiding the circular dependency.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// How REP MOVS handles faults /* * Instruction: REP MOVSB * * Semantics: * while (RCX != 0) { * [RDI] = [RSI]; // Copy one byte * RSI += direction; * RDI += direction; * RCX--; * } * * Problem: What if fault occurs mid-copy? * * Solution: RSI, RDI, RCX are updated each iteration. * When fault occurs: * - RCX = remaining count * - RSI = next source address * - RDI = next destination address * * Restart: REP MOVSB resumes from current RSI/RDI/RCX * No work is repeated; no work is skipped. */ // Example trace:// Initial: RSI=src, RDI=dst, RCX=1000// Copy 500 bytes successfully// Fault on byte 501 (destination page not present)// At fault: RSI=src+500, RDI=dst+500, RCX=500// Handle fault: map dst+500 page// Restart: REP MOVSB with RSI=src+500, RDI=dst+500, RCX=500// Copies remaining 500 bytes// Total effect: all 1000 bytes copied /* * Compare: CISC vs RISC * * CISC (x86): Has complex instructions like REP MOVS * Need elaborate restart logic * Architecture ensures restartability * * RISC (ARM, RISC-V): Instructions are simple * Each instruction does one memory access * Restartability is trivial * String copies are loops of simple loads/stores */ // Pseudo-code for what the CPU does internallyvoid handle_fault_in_rep_movs(FaultState *state) { // CPU has already updated RSI, RDI, RCX to reflect progress // The saved state in the interrupt frame has current values // When we IRET: // - RIP points to the REP MOVSB instruction // - RSI, RDI, RCX reflect progress made // - REP MOVSB will continue from where it left off}RISC architectures sidestep most of these complications by having simple instructions that access memory at most once. A string copy is just a loop of load/store pairs, each trivially restartable. This is one reason RISC architectures are easier to implement with precise exceptions.
The return to user mode involves a privilege transition—from ring 0 to ring 3. This transition is just as carefully controlled as the entry transition.
Security Considerations:
The OS controls everything about the return. User code cannot forge a return to kernel mode.
IRET validates the CS selector. If malicious code somehow corrupted the stack, IRET won't jump to arbitrary addresses with kernel privilege.
RFLAGS is sanitized. Certain dangerous flags (IOPL, VM) are checked and restricted.
What Changes During Return:
| Aspect | Before (Kernel) | After (User) |
|---|---|---|
| CPL | 0 (ring 0) | 3 (ring 3) |
| Accessible memory | All | User pages only |
| Privileged ops | Allowed | Trap |
| Interrupts | May be disabled | Enabled |
| Stack | Kernel stack | User stack |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// Security aspects of returning to user mode /* * IRET performs implicit security checks: * * 1. CS.RPL check * - RPL (Requested Privilege Level) in CS must match target ring * - Returning to ring 3: CS.RPL must be 3 * - Can't return to kernel (ring 0) from user-originated exception * * 2. Segment validity * - CS must reference a valid code segment * - Segment must be present * - Segment must be executable * * 3. RFLAGS sanitization * - IOPL (I/O Privilege Level) can only be raised by ring 0 * - VM flag (virtual 8086 mode) is restricted * - IF (interrupt flag) behavior varies */ // What the kernel ensures before IRETvoid prepare_return_to_usermode(struct pt_regs *regs) { // Ensure CS has user ring (RPL = 3) regs->cs = USER_CS | 3; // USER_CS selector with RPL=3 // Ensure SS has user ring regs->ss = USER_DS | 3; // Sanitize flags regs->flags &= FLAG_MASK_USER; // Clear dangerous bits regs->flags |= FLAG_IF; // Ensure interrupts will be enabled // SMAP/SMEP: CPU flags that trap kernel access to user memory // These are automatically re-enabled on return to user mode} // After IRETQ completes:// - Kernel stack is now unused (until next entry)// - User stack (restored RSP) is active// - User cannot access kernel memory// - User code resumes at saved RIP /* * If an attacker corrupted the stack: * * Scenario: Try to return to kernel address with ring 0 * * fake frame: RIP = kernel_function * CS = USER_CS | 0 // Try CPL 0 * * Result: CPU rejects this * - CS.RPL (0) != actual CPL requested * - Would need descriptor with DPL 0 * - User can't access kernel descriptors * → General Protection Fault (#GP) * → Kernel handles GP, kills malicious process */Supervisor Mode Access Prevention (SMAP) and Supervisor Mode Execution Prevention (SMEP) are CPU features that prevent the kernel from accidentally reading/writing/executing user memory. They're temporarily disabled during intentional user memory access (copy_from_user) but automatically re-enabled on return to user mode, adding defense against kernel vulnerabilities.
Let's trace through exactly what happens when the faulting instruction retries:
Cycle-by-Cycle (simplified):
IRETQ completes: CPU is now in user mode, RIP points to MOV RAX, [RBX]
Instruction fetch: CPU fetches the MOV instruction (same instruction that faulted)
Decode: CPU decodes: load memory at address in RBX into RAX
Address generation: CPU computes effective address: value of RBX = 0x7FFF1000
TLB lookup: CPU checks TLB for 0x7FFF1000... TLB MISS
Page table walk: Hardware walker traverses page table... finds PTE with valid=1, frame=0x12345
TLB fill: New entry cached in TLB: VA 0x7FFF1000 → PA 0x12345000
Physical access: CPU accesses physical address 0x12345000 + offset
Data returned: The byte(s) at that location come back from cache/memory
Writeback: CPU writes the value into RAX
Retire: Instruction completes, RIP advances to next instruction
The process has no visibility into steps 6-7 (the page table walk that finds our newly-installed mapping). From the process's view, the memory access just took a bit longer than usual.
| Aspect | First Attempt | Retry |
|---|---|---|
| PTE valid bit | 0 (not present) | 1 (present) |
| PTE frame number | Undefined/swap entry | Physical frame number |
| Physical frame | Not allocated | Allocated, contains data |
| TLB entry | None (TLB miss) | None initially, then filled |
| Instruction behavior | Fault → trap | Complete normally |
The retry will TLB miss because we deliberately don't pre-fill the TLB with the new mapping. The TLB miss triggers a page table walk, which finds our newly-installed PTE and fills the TLB. Subsequent accesses to this page will TLB hit. This is the normal, expected behavior.
While the basic restart mechanism is elegant, several edge cases require special handling:
1. Multiple Faults on Same Instruction:
An instruction might access two pages, both absent:
MOV [addr1], [addr2] ; addr1 and addr2 in different pages
First fault: addr2 page missing. Handle, restart. Second fault: addr1 page missing. Handle, restart. Third attempt: Both pages present, instruction completes.
2. Fault During Instruction Fetch:
The instruction itself might be on a non-present page. The page fault handler must be careful—it can't examine the faulting instruction (it's not in memory!). This is handled by the fact that we map code pages the same as data pages.
3. Signals Pending:
If a signal arrived during page fault handling (e.g., SIGTERM), should we deliver the signal first or restart the instruction? Linux delivers signals after the fault is handled but before the instruction retries, using the signal's userspace handler.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
// Handling multiple faults per instruction /* * Instruction: MOVS (MOV String) * Source: [RSI] * Destination: [RDI] * May fault on either or both * * Execution trace: * 1. Attempt to read [RSI] * - Source page not present → fault * - Handle: load source page * - Restart MOVS * * 2. Read [RSI] succeeds * Attempt to write [RDI] * - Dest page not present → fault * - Handle: load dest page * - Restart MOVS * * 3. Read [RSI] succeeds (maybe TLB hit now) * Write [RDI] succeeds * MOVS completes */ // Signal handling interacts with page faultsint resume_user_or_handle_signal(struct pt_regs *regs) { // Page fault handling is complete // Before returning to user mode, check for pending signals if (signal_pending(current)) { // A signal arrived during fault handling // Don't restart faulting instruction yet // Instead, set up signal handler frame struct ksignal ksig; get_signal(&ksig); handle_signal(&ksig, regs); // Signal handler will eventually return // Then faulting instruction will retry // (unless signal was fatal) } // Return to user mode return 0;} // Fault during instruction fetch// The instruction bytes themselves might be on a not-present pageint handle_instruction_fetch_fault(unsigned long address, struct pt_regs *regs) { // Special case: can't examine the opcode (it's not present!) // But we don't need to - same handling as data page fault // // address == regs->rip (faulting instruction address) // VMA should be executable // Load the code page the same as any file-backed page struct vm_area_struct *vma = find_vma(current->mm, address); if (!(vma->vm_flags & VM_EXEC)) { // Code page must be executable // If it's not, this is definitely wrong return VM_FAULT_SIGSEGV; } // Handle like normal file-backed fault // (code is in executable file) return filemap_fault(vmf);}A pathological case: what if page fault handling itself requires pages that keep faulting? If swap is full, reclaim can't make progress, and new page allocations fail. This leads to livelock or OOM. The kernel has watchdogs and maximum retry counts to detect these situations and invoke the OOM killer.
The fundamental promise of virtual memory is transparency: applications should not be able to distinguish between memory that's physically present and memory that's demand-paged from disk. Let's examine how completely this promise is kept.
What Applications Cannot Observe:
What Applications CAN Observe:
mincore() can tell which pages are resident. mlock() affects residency.The Functional Guarantee:
| Aspect | Guarantee |
|---|---|
| Correctness | Absolutely guaranteed - computations produce same results |
| Atomicity | Memory operations have same semantics |
| Ordering | Memory ordering rules preserved |
| Timing | NOT guaranteed - variable latency |
| Performance | NOT guaranteed - depends on residency |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
// Demonstration of transparency #include <stdio.h>#include <stdlib.h>#include <time.h> // This function cannot tell if pages are demand-pagedint sum_array(int *array, size_t n) { int sum = 0; for (size_t i = 0; i < n; i++) { sum += array[i]; // May page fault, but we can't tell } return sum; // Correct result regardless} // This function CAN detect timing differencesvoid measure_access_time(void *ptr) { struct timespec start, end; volatile int value; clock_gettime(CLOCK_MONOTONIC, &start); value = *(int *)ptr; // This access clock_gettime(CLOCK_MONOTONIC, &end); long ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec); if (ns > 100000) { // > 100 microseconds printf("Likely page fault: %ld ns", ns); } else if (ns > 100) { // > 100 nanoseconds printf("Likely cache miss: %ld ns", ns); } else { printf("Cache hit: %ld ns", ns); }} // Using mincore to check residency (breaks transparency)#include <sys/mman.h> void check_residency(void *addr, size_t length) { size_t page_size = sysconf(_SC_PAGESIZE); size_t num_pages = (length + page_size - 1) / page_size; unsigned char *vec = malloc(num_pages); mincore(addr, length, vec); for (size_t i = 0; i < num_pages; i++) { printf("Page %zu: %s", i, (vec[i] & 1) ? "resident" : "not resident"); } free(vec);}The timing difference between resident and non-resident pages can be exploited as a side channel. An attacker might infer information about other processes' memory usage by probing timing. This is one reason why kernel address space layout randomization (KASLR) and other mitigations exist—though perfect timing isolation remains challenging.
Different architectures implement the return-from-exception mechanism differently, though all achieve the same goal:
x86-64: IRETQ
ARM AArch64: ERET
RISC-V: SRET/MRET
| Aspect | x86-64 (IRETQ) | ARM (ERET) | RISC-V (SRET) |
|---|---|---|---|
| Return PC source | Stack (RIP) | ELR_ELn register | sepc CSR |
| Flags/status source | Stack (RFLAGS) | SPSR_ELn register | sstatus CSR (partial) |
| Stack restore | Stack (RSP, SS) | SP_ELn selected | Not automatic |
| Privilege transition | Via CS.RPL | SPSR.M bits | SPP bit in sstatus |
| Atomicity | Single instruction | Single instruction | Single instruction |
RISC-V takes a minimalist approach—SRET only restores PC and privilege mode. Software must save/restore other registers explicitly. ARM and x86 have more automatic state restoration. All approaches work; they just place complexity in different places (hardware vs software).
The instruction restart is the triumphant conclusion of page fault handling—the moment when all the effort pays off and the application continues unaware that anything unusual happened. Let's consolidate everything we've learned:
The Complete Page Fault Journey:
Instruction attempts memory access
↓
TLB miss → Page table walk
↓
Valid bit = 0 → Page fault exception
↓
Trap to kernel, save state
↓
Handler finds VMA, validates access
↓
Locate page content (swap/file/zero)
↓
Allocate frame, load content
↓
Update PTE (valid=1, frame number)
↓
Restore state, IRET to user mode
↓
Instruction retries, succeeds
↓
Application continues, unaware
This cycle happens millions of times per second across all the world's computers, silently enabling the virtual memory abstraction that makes modern computing possible.
You have now mastered the complete page fault handling lifecycle—from detection through handling to seamless restart. This knowledge is fundamental to understanding operating system behavior, debugging performance issues, and designing memory-efficient systems. The page fault mechanism is one of the most elegant hardware/software partnerships in computer systems: hardware detects and preserves, software analyzes and remedies, and together they create the illusion of unlimited memory.