Page Fault Handling - Learning Module

Loading content...

0/227

Restart Instruction

The Final Step: Seamless Resumption

We've come full circle. The page fault was detected, the trap brought us to the kernel, we located the page on disk, loaded it into a physical frame, and updated the page table. Now comes the culmination of all this effort: restarting the instruction that originally faulted.

This restart must be seamless. The process should have no idea anything unusual happened—from its perspective, the memory access simply worked. The instruction executes, gets the data it expected, and computation continues. Billions of page faults happen across all the world's computers every second, and virtually none are noticed by the applications experiencing them.

This transparency is the ultimate deliverable of virtual memory. This page explores exactly how it's achieved: the hardware mechanisms for returning to user mode, the state restoration that makes the retry possible, and the edge cases that complicate this seemingly simple "just try again" operation.

What You Will Learn

By the end of this page, you will understand: (1) How the IRET instruction returns from kernel to user mode, (2) Complete state restoration from the trap frame, (3) Why restarting works—the precise exception model, (4) Complex instruction restart challenges, (5) The overall transparency guarantee of page fault handling.

The IRET Instruction: Gateway Back to User Mode

The IRET (Interrupt Return) instruction is the x86 mechanism for returning from an interrupt or exception handler. On x86-64, the instruction is IRETQ (64-bit variant). This single instruction undoes everything the exception entry process did.

What IRET Does:

Pops RIP from the stack → Instruction to resume execution at
Pops CS from the stack → Code segment (includes privilege level)
Pops RFLAGS from the stack → Restores CPU flags
Pops RSP from the stack → Restores user stack pointer
Pops SS from the stack → Restores user stack segment
Changes privilege level to that specified in CS (ring 0 → ring 3)
Resumes execution at the new RIP

The beauty is that IRET is atomic—all these changes happen as an indivisible operation. There's no window where the CPU is half in kernel mode, half in user mode.

page_fault_return.S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Returning from page fault handler (x86-64)
 
.global page_fault_return
page_fault_return:
    # At this point, the C page fault handler has returned
    # Stack contains saved registers from entry
    
    # Restore general-purpose registers
    popq %rax
    popq %rbx
    popq %rcx
    popq %rdx
    popq %rsi
    popq %rdi
    popq %rbp
    popq %r8
    popq %r9
    popq %r10
    popq %r11
    popq %r12
    popq %r13
    popq %r14
    popq %r15
    
    # Skip error code pushed by CPU
    addq $8, %rsp
    
    # Now stack layout is exactly what CPU pushed:
    # [RSP+0]  = RIP  (faulting instruction address)
    # [RSP+8]  = CS   (code segment, CPL in bits 0:1)
    # [RSP+16] = RFLAGS
    # [RSP+24] = RSP  (user stack pointer)
    # [RSP+32] = SS   (stack segment)
    
    # IRETQ pops all of these atomically
    iretq
    
    # After IRETQ:
    # - CPU is in user mode (ring 3)
    # - RIP points to the faulting instruction
    # - All registers are restored
    # - The instruction executes again
    # - This time, page is mapped, so it succeeds!

The Magic of Retry Semantics

The saved RIP points to the faulting instruction, not the one after it. This is what makes page faults different from, say, a breakpoint trap. When IRETQ returns, the CPU doesn't continue to the next instruction—it retries the same one. Since we've now mapped the page, the retry succeeds as if nothing happened.

State Restoration: Recreating the Faulting Moment

For the instruction to retry successfully, the CPU state must be exactly what it was when the instruction first tried to execute. Let's trace what gets restored:

Hardware-Restored State (by IRETQ):

Register	Purpose	Why Needed
RIP	Instruction pointer	Execute same instruction
CS	Code segment	Correct privilege, segment
RFLAGS	CPU flags	Direction flag, interrupt flag, etc.
RSP	Stack pointer	Stack operations work correctly
SS	Stack segment	Stack addressing correct

Software-Restored State (by handler code):

Registers	Purpose	Why Needed
RAX-RDX, RSI, RDI	Function arguments, scratch	Instruction operands
RBP	Frame pointer	Stack walking
R8-R15	General purpose	Any purpose

Implicitly Preserved State:

State	How Preserved	Notes
Memory contents	Didn't write	Other pages unchanged
FPU/SSE state	Saved if used	May be in XSAVE area
Segment registers (DS, ES, FS, GS)	Reloaded by OS	Kernel-managed

state_verification.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Conceptual illustration of state preservation
 
struct saved_state {
    // Hardware-saved (on stack, restored by IRET)
    uint64_t rip;     // Points to MOV from example below
    uint64_t cs;      // User code segment
    uint64_t rflags;  // Flags at time of fault
    uint64_t rsp;     // User stack pointer
    uint64_t ss;      // User stack segment
    
    // Software-saved (by entry point, restored before IRET)
    uint64_t rax;     // Value in RAX during fault
    uint64_t rbx;     // etc.
    uint64_t rcx;
    uint64_t rdx;
    uint64_t rdi;
    uint64_t rsi;
    uint64_t rbp;
    uint64_t r8, r9, r10, r11, r12, r13, r14, r15;
};
 
/*
 * Example faulting instruction:
 *   MOV RAX, [RBX + RCX * 8]
 * 
 * At fault time:
 *   RIP = address of this MOV
 *   RBX = 0x1000 (base address of array)
 *   RCX = 5 (index)
 *   Effective address = 0x1000 + 5*8 = 0x1028
 *   
 * Page at 0x1000 is not present → fault
 * 
 * After handling:
 *   Page containing 0x1028 is now mapped
 *   RIP still points to MOV
 *   RBX still = 0x1000
 *   RCX still = 5
 *   
 * Retry:
 *   MOV calculates same address: 0x1028
 *   Translation succeeds this time
 *   RAX gets the value from memory
 *   Execution continues to next instruction
 */
 
// The guarantee: no register was changed
void verify_state_preserved(struct saved_state *before,
                            struct saved_state *after) {
    // Everything should be identical
    assert(before->rax == after->rax);
    assert(before->rbx == after->rbx);
    // ... etc for all registers ...
    
    // RIP points to same instruction
    assert(before->rip == after->rip);
    
    // The ONLY difference: page is now mapped
    // This is invisible to user code
}

FPU/SSE State Handling

Modern processors have extensive floating-point and SIMD state (XMM, YMM, ZMM registers). This state is typically saved lazily—only if the kernel itself uses FPU. If the page fault handler is pure integer code (common), FPU state is never touched and doesn't need restoration. If the kernel does use FPU, it saves state first via XSAVE.

Precise Exceptions: The Foundation of Restart

The ability to restart instructions depends on a fundamental CPU property: precise exceptions. This concept is so important that modern processor designs invest significant silicon to ensure it.

What Makes an Exception Precise:

All instructions before the faulting one have completed. Their effects (register writes, memory writes) are fully visible.
The faulting instruction has had no visible effects. Any partial progress is rolled back.
No instructions after the faulting one have effects. Despite speculative execution, nothing is committed past the fault.

Why This Matters:

If exceptions weren't precise, we couldn't restart:

If the faulting instruction partially executed, we couldn't safely retry
If later instructions had effects, replaying would cause double-execution
If earlier instructions hadn't completed, the retry would get wrong inputs

The Implementation Challenge:

Modern CPUs execute instructions out of order and speculatively. They might be executing instructions 50-100 instructions ahead of what's been committed. When a page fault occurs deep in the pipeline, the CPU must:

Stop speculation at that point
Wait for all earlier instructions to complete
Discard all later instruction results
Report the exception with precise state

Converting Mermaid diagram...

Precise vs Imprecise Exceptions
Property	Precise	Imprecise
Restart possible?	Yes	No
State at exception	Exactly as if stopped before fault	May include partial effects
Implementation cost	High (rollback logic)	Low
Modern CPUs	Standard requirement	Obsolete
Page faults	Fully supported	Would break demand paging

The Spectre/Meltdown Connection

Speculative execution—running instructions before knowing if they'll actually be needed—is what enables high performance but also created the Spectre and Meltdown vulnerabilities. Even though speculative results are discarded on exceptions, they leave traces in caches that can be exploited. The security mitigations for these issues add overhead to page fault handling.

Complex Instruction Challenges

While the restart model is elegant for simple instructions, some complex instructions pose challenges:

1. Multi-Memory-Access Instructions:

An instruction like MOVSB (string move) can copy multiple bytes, accessing memory many times. What if it faults in the middle?

Solution: x86 uses the RSI, RDI, and RCX registers as implicit loop counters. If REP MOVSB faults, these registers reflect progress made, and restarting continues from where it stopped rather than restarting the entire operation.

2. Instructions with Multiple Destinations:

Some instructions write to multiple locations. What if the second write faults?

Solution: Such instructions are designed to be restartable. They either use temporary internal state or are defined such that partial writes are valid.

3. Stack Operations During Faults:

If PUSH faults while writing to the stack, we have a chicken-and-egg problem—we need to push the fault frame, but pushing caused the fault.

Solution: This is the double-fault scenario. A separate stack (IST) is used for the fault handler, avoiding the circular dependency.

complex_restart.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// How REP MOVS handles faults
 
/*
 * Instruction: REP MOVSB
 * 
 * Semantics:
 *   while (RCX != 0) {
 *       [RDI] = [RSI];  // Copy one byte
 *       RSI += direction;
 *       RDI += direction;
 *       RCX--;
 *   }
 * 
 * Problem: What if fault occurs mid-copy?
 * 
 * Solution: RSI, RDI, RCX are updated each iteration.
 * When fault occurs:
 *   - RCX = remaining count
 *   - RSI = next source address
 *   - RDI = next destination address
 * 
 * Restart: REP MOVSB resumes from current RSI/RDI/RCX
 * No work is repeated; no work is skipped.
 */
 
// Example trace:
// Initial: RSI=src, RDI=dst, RCX=1000
// Copy 500 bytes successfully
// Fault on byte 501 (destination page not present)
// At fault: RSI=src+500, RDI=dst+500, RCX=500
// Handle fault: map dst+500 page
// Restart: REP MOVSB with RSI=src+500, RDI=dst+500, RCX=500
// Copies remaining 500 bytes
// Total effect: all 1000 bytes copied
 
/*
 * Compare: CISC vs RISC
 * 
 * CISC (x86): Has complex instructions like REP MOVS
 *   Need elaborate restart logic
 *   Architecture ensures restartability
 * 
 * RISC (ARM, RISC-V): Instructions are simple
 *   Each instruction does one memory access
 *   Restartability is trivial
 *   String copies are loops of simple loads/stores
 */
 
// Pseudo-code for what the CPU does internally
void handle_fault_in_rep_movs(FaultState *state) {
    // CPU has already updated RSI, RDI, RCX to reflect progress
    // The saved state in the interrupt frame has current values
    
    // When we IRET:
    // - RIP points to the REP MOVSB instruction
    // - RSI, RDI, RCX reflect progress made
    // - REP MOVSB will continue from where it left off
}

RISC Simplicity

RISC architectures sidestep most of these complications by having simple instructions that access memory at most once. A string copy is just a loop of load/store pairs, each trivially restartable. This is one reason RISC architectures are easier to implement with precise exceptions.

Privilege Transition on Return

The return to user mode involves a privilege transition—from ring 0 to ring 3. This transition is just as carefully controlled as the entry transition.

Security Considerations:

The OS controls everything about the return. User code cannot forge a return to kernel mode.
IRET validates the CS selector. If malicious code somehow corrupted the stack, IRET won't jump to arbitrary addresses with kernel privilege.
RFLAGS is sanitized. Certain dangerous flags (IOPL, VM) are checked and restricted.

What Changes During Return:

Aspect	Before (Kernel)	After (User)
CPL	0 (ring 0)	3 (ring 3)
Accessible memory	All	User pages only
Privileged ops	Allowed	Trap
Interrupts	May be disabled	Enabled
Stack	Kernel stack	User stack

return_security.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Security aspects of returning to user mode
 
/*
 * IRET performs implicit security checks:
 * 
 * 1. CS.RPL check
 *    - RPL (Requested Privilege Level) in CS must match target ring
 *    - Returning to ring 3: CS.RPL must be 3
 *    - Can't return to kernel (ring 0) from user-originated exception
 * 
 * 2. Segment validity
 *    - CS must reference a valid code segment
 *    - Segment must be present
 *    - Segment must be executable
 * 
 * 3. RFLAGS sanitization
 *    - IOPL (I/O Privilege Level) can only be raised by ring 0
 *    - VM flag (virtual 8086 mode) is restricted
 *    - IF (interrupt flag) behavior varies
 */
 
// What the kernel ensures before IRET
void prepare_return_to_usermode(struct pt_regs *regs) {
    // Ensure CS has user ring (RPL = 3)
    regs->cs = USER_CS | 3;  // USER_CS selector with RPL=3
    
    // Ensure SS has user ring
    regs->ss = USER_DS | 3;
    
    // Sanitize flags
    regs->flags &= FLAG_MASK_USER;  // Clear dangerous bits
    regs->flags |= FLAG_IF;         // Ensure interrupts will be enabled
    
    // SMAP/SMEP: CPU flags that trap kernel access to user memory
    // These are automatically re-enabled on return to user mode
}
 
// After IRETQ completes:
// - Kernel stack is now unused (until next entry)
// - User stack (restored RSP) is active
// - User cannot access kernel memory
// - User code resumes at saved RIP
 
/*
 * If an attacker corrupted the stack:
 * 
 * Scenario: Try to return to kernel address with ring 0
 * 
 *   fake frame: RIP = kernel_function
 *               CS  = USER_CS | 0  // Try CPL 0
 * 
 * Result: CPU rejects this
 *   - CS.RPL (0) != actual CPL requested
 *   - Would need descriptor with DPL 0
 *   - User can't access kernel descriptors
 *   → General Protection Fault (#GP)
 *   → Kernel handles GP, kills malicious process
 */

SMAP and SMEP

Supervisor Mode Access Prevention (SMAP) and Supervisor Mode Execution Prevention (SMEP) are CPU features that prevent the kernel from accidentally reading/writing/executing user memory. They're temporarily disabled during intentional user memory access (copy_from_user) but automatically re-enabled on return to user mode, adding defense against kernel vulnerabilities.

The Retry Execution: What Actually Happens

Let's trace through exactly what happens when the faulting instruction retries:

Cycle-by-Cycle (simplified):

IRETQ completes: CPU is now in user mode, RIP points to MOV RAX, [RBX]
Instruction fetch: CPU fetches the MOV instruction (same instruction that faulted)
Decode: CPU decodes: load memory at address in RBX into RAX
Address generation: CPU computes effective address: value of RBX = 0x7FFF1000
TLB lookup: CPU checks TLB for 0x7FFF1000... TLB MISS
Page table walk: Hardware walker traverses page table... finds PTE with valid=1, frame=0x12345
TLB fill: New entry cached in TLB: VA 0x7FFF1000 → PA 0x12345000
Physical access: CPU accesses physical address 0x12345000 + offset
Data returned: The byte(s) at that location come back from cache/memory
Writeback: CPU writes the value into RAX
Retire: Instruction completes, RIP advances to next instruction

The process has no visibility into steps 6-7 (the page table walk that finds our newly-installed mapping). From the process's view, the memory access just took a bit longer than usual.

Converting Mermaid diagram...

What Changed Between First Attempt and Retry
Aspect	First Attempt	Retry
PTE valid bit	0 (not present)	1 (present)
PTE frame number	Undefined/swap entry	Physical frame number
Physical frame	Not allocated	Allocated, contains data
TLB entry	None (TLB miss)	None initially, then filled
Instruction behavior	Fault → trap	Complete normally

TLB Miss is Normal

The retry will TLB miss because we deliberately don't pre-fill the TLB with the new mapping. The TLB miss triggers a page table walk, which finds our newly-installed PTE and fills the TLB. Subsequent accesses to this page will TLB hit. This is the normal, expected behavior.

Edge Cases in Restart

While the basic restart mechanism is elegant, several edge cases require special handling:

1. Multiple Faults on Same Instruction:

An instruction might access two pages, both absent:

MOV [addr1], [addr2]   ; addr1 and addr2 in different pages

First fault: addr2 page missing. Handle, restart. Second fault: addr1 page missing. Handle, restart. Third attempt: Both pages present, instruction completes.

2. Fault During Instruction Fetch:

The instruction itself might be on a non-present page. The page fault handler must be careful—it can't examine the faulting instruction (it's not in memory!). This is handled by the fact that we map code pages the same as data pages.

3. Signals Pending:

If a signal arrived during page fault handling (e.g., SIGTERM), should we deliver the signal first or restart the instruction? Linux delivers signals after the fault is handled but before the instruction retries, using the signal's userspace handler.

restart_edge_cases.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// Handling multiple faults per instruction
 
/*
 * Instruction: MOVS (MOV String)
 *   Source: [RSI]
 *   Destination: [RDI]
 *   May fault on either or both
 * 
 * Execution trace:
 * 1. Attempt to read [RSI]
 *    - Source page not present → fault
 *    - Handle: load source page
 *    - Restart MOVS
 * 
 * 2. Read [RSI] succeeds
 *    Attempt to write [RDI]
 *    - Dest page not present → fault
 *    - Handle: load dest page  
 *    - Restart MOVS
 * 
 * 3. Read [RSI] succeeds (maybe TLB hit now)
 *    Write [RDI] succeeds
 *    MOVS completes
 */
 
// Signal handling interacts with page faults
int resume_user_or_handle_signal(struct pt_regs *regs) {
    // Page fault handling is complete
    // Before returning to user mode, check for pending signals
    
    if (signal_pending(current)) {
        // A signal arrived during fault handling
        // Don't restart faulting instruction yet
        // Instead, set up signal handler frame
        
        struct ksignal ksig;
        get_signal(&ksig);
        handle_signal(&ksig, regs);
        
        // Signal handler will eventually return
        // Then faulting instruction will retry
        // (unless signal was fatal)
    }
    
    // Return to user mode
    return 0;
}
 
// Fault during instruction fetch
// The instruction bytes themselves might be on a not-present page
int handle_instruction_fetch_fault(unsigned long address,
                                    struct pt_regs *regs) {
    // Special case: can't examine the opcode (it's not present!)
    // But we don't need to - same handling as data page fault
    // 
    // address == regs->rip (faulting instruction address)
    // VMA should be executable
    // Load the code page the same as any file-backed page
    
    struct vm_area_struct *vma = find_vma(current->mm, address);
    
    if (!(vma->vm_flags & VM_EXEC)) {
        // Code page must be executable
        // If it's not, this is definitely wrong
        return VM_FAULT_SIGSEGV;
    }
    
    // Handle like normal file-backed fault
    // (code is in executable file)
    return filemap_fault(vmf);
}

Fault Loops

A pathological case: what if page fault handling itself requires pages that keep faulting? If swap is full, reclaim can't make progress, and new page allocations fail. This leads to livelock or OOM. The kernel has watchdogs and maximum retry counts to detect these situations and invoke the OOM killer.

The Transparency Guarantee

The fundamental promise of virtual memory is transparency: applications should not be able to distinguish between memory that's physically present and memory that's demand-paged from disk. Let's examine how completely this promise is kept.

What Applications Cannot Observe:

The page fault itself (no visible exception)
That their pages were ever absent
Whether other processes share their read-only pages
Which physical frame holds their data
Whether data was just read from disk

What Applications CAN Observe:

Timing: Page faults take much longer than regular memory access. Performance-sensitive code can detect variability.
System calls about memory: mincore() can tell which pages are resident. mlock() affects residency.
Resource limits: The process may receive signals if it exceeds memory limits.

The Functional Guarantee:

Aspect	Guarantee
Correctness	Absolutely guaranteed - computations produce same results
Atomicity	Memory operations have same semantics
Ordering	Memory ordering rules preserved
Timing	NOT guaranteed - variable latency
Performance	NOT guaranteed - depends on residency

transparency_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Demonstration of transparency
 
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
// This function cannot tell if pages are demand-paged
int sum_array(int *array, size_t n) {
    int sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += array[i];  // May page fault, but we can't tell
    }
    return sum;  // Correct result regardless
}
 
// This function CAN detect timing differences
void measure_access_time(void *ptr) {
    struct timespec start, end;
    volatile int value;
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    value = *(int *)ptr;  // This access
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    long ns = (end.tv_sec - start.tv_sec) * 1e9 + 
              (end.tv_nsec - start.tv_nsec);
    
    if (ns > 100000) {  // > 100 microseconds
        printf("Likely page fault: %ld ns
", ns);
    } else if (ns > 100) {  // > 100 nanoseconds
        printf("Likely cache miss: %ld ns
", ns);
    } else {
        printf("Cache hit: %ld ns
", ns);
    }
}
 
// Using mincore to check residency (breaks transparency)
#include <sys/mman.h>
 
void check_residency(void *addr, size_t length) {
    size_t page_size = sysconf(_SC_PAGESIZE);
    size_t num_pages = (length + page_size - 1) / page_size;
    unsigned char *vec = malloc(num_pages);
    
    mincore(addr, length, vec);
    
    for (size_t i = 0; i < num_pages; i++) {
        printf("Page %zu: %s
", i, 
               (vec[i] & 1) ? "resident" : "not resident");
    }
    
    free(vec);
}

Timing Side Channels

The timing difference between resident and non-resident pages can be exploited as a side channel. An attacker might infer information about other processes' memory usage by probing timing. This is one reason why kernel address space layout randomization (KASLR) and other mitigations exist—though perfect timing isolation remains challenging.

Architecture Comparisons: Return Mechanisms

Different architectures implement the return-from-exception mechanism differently, though all achieve the same goal:

x86-64: IRETQ

Single instruction restores RIP, CS, RFLAGS, RSP, SS
Privilege transition is implicit in CS.RPL
Error code must be manually popped before IRET

ARM AArch64: ERET

Exception Return instruction
Restores PC from ELR_ELn (Exception Link Register)
Restores PSTATE from SPSR_ELn (Saved Program Status Register)
Stack pointer selection based on target exception level

RISC-V: SRET/MRET

Supervisor/Machine Return from Trap
Restores PC from SEPC/MEPC (Exception Program Counter)
Restores privilege from SSTATUS/MSTATUS
Simple and uniform design

Exception Return Mechanisms
Aspect	x86-64 (IRETQ)	ARM (ERET)	RISC-V (SRET)
Return PC source	Stack (RIP)	ELR_ELn register	sepc CSR
Flags/status source	Stack (RFLAGS)	SPSR_ELn register	sstatus CSR (partial)
Stack restore	Stack (RSP, SS)	SP_ELn selected	Not automatic
Privilege transition	Via CS.RPL	SPSR.M bits	SPP bit in sstatus
Atomicity	Single instruction	Single instruction	Single instruction

Design Philosophy Differences

RISC-V takes a minimalist approach—SRET only restores PC and privilege mode. Software must save/restore other registers explicitly. ARM and x86 have more automatic state restoration. All approaches work; they just place complexity in different places (hardware vs software).

Summary: Completing the Circle

The instruction restart is the triumphant conclusion of page fault handling—the moment when all the effort pays off and the application continues unaware that anything unusual happened. Let's consolidate everything we've learned:

Key Takeaways

•IRETQ (or equivalent) is the return instruction that atomically restores the complete CPU state and returns to user mode.
•State restoration encompasses both hardware-restored state (RIP, CS, RFLAGS, RSP, SS) and software-restored state (general-purpose registers).
•Precise exceptions are essential—the CPU guarantees that the state at exception time matches exactly what it would be if execution had stopped just before the faulting instruction.
•Complex instructions require special consideration for restart, but architecture design ensures they are restartable (e.g., using progress registers for string operations).
•Security is maintained during the return—IRET validates segments, and user code cannot forge a return to kernel mode.
•The retry succeeds because the page is now mapped. The instruction re-executes, TLB loads the new entry, and data is accessed successfully.
•Transparency is achieved for correctness but not for timing. Applications get correct results without knowing about the fault, but timing-sensitive code can detect variability.

The Complete Page Fault Journey:

Instruction attempts memory access
    ↓
TLB miss → Page table walk
    ↓
Valid bit = 0 → Page fault exception
    ↓
Trap to kernel, save state
    ↓
Handler finds VMA, validates access
    ↓
Locate page content (swap/file/zero)
    ↓
Allocate frame, load content
    ↓
Update PTE (valid=1, frame number)
    ↓
Restore state, IRET to user mode
    ↓
Instruction retries, succeeds
    ↓
Application continues, unaware

This cycle happens millions of times per second across all the world's computers, silently enabling the virtual memory abstraction that makes modern computing possible.

Module Complete

You have now mastered the complete page fault handling lifecycle—from detection through handling to seamless restart. This knowledge is fundamental to understanding operating system behavior, debugging performance issues, and designing memory-efficient systems. The page fault mechanism is one of the most elegant hardware/software partnerships in computer systems: hardware detects and preserves, software analyzes and remedies, and together they create the illusion of unlimited memory.

Restart Instruction

The Final Step: Seamless Resumption

What You Will Learn

The IRET Instruction: Gateway Back to User Mode

What IRET Does:

Pops RIP from the stack → Instruction to resume execution at
Pops CS from the stack → Code segment (includes privilege level)
Pops RFLAGS from the stack → Restores CPU flags
Pops RSP from the stack → Restores user stack pointer
Pops SS from the stack → Restores user stack segment
Changes privilege level to that specified in CS (ring 0 → ring 3)
Resumes execution at the new RIP

The beauty is that IRET is atomic—all these changes happen as an indivisible operation. There's no window where the CPU is half in kernel mode, half in user mode.

page_fault_return.S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Returning from page fault handler (x86-64)
 
.global page_fault_return
page_fault_return:
    # At this point, the C page fault handler has returned
    # Stack contains saved registers from entry
    
    # Restore general-purpose registers
    popq %rax
    popq %rbx
    popq %rcx
    popq %rdx
    popq %rsi
    popq %rdi
    popq %rbp
    popq %r8
    popq %r9
    popq %r10
    popq %r11
    popq %r12
    popq %r13
    popq %r14
    popq %r15
    
    # Skip error code pushed by CPU
    addq $8, %rsp
    
    # Now stack layout is exactly what CPU pushed:
    # [RSP+0]  = RIP  (faulting instruction address)
    # [RSP+8]  = CS   (code segment, CPL in bits 0:1)
    # [RSP+16] = RFLAGS
    # [RSP+24] = RSP  (user stack pointer)
    # [RSP+32] = SS   (stack segment)
    
    # IRETQ pops all of these atomically
    iretq
    
    # After IRETQ:
    # - CPU is in user mode (ring 3)
    # - RIP points to the faulting instruction
    # - All registers are restored
    # - The instruction executes again
    # - This time, page is mapped, so it succeeds!

The Magic of Retry Semantics

State Restoration: Recreating the Faulting Moment

For the instruction to retry successfully, the CPU state must be exactly what it was when the instruction first tried to execute. Let's trace what gets restored:

Hardware-Restored State (by IRETQ):

Register	Purpose	Why Needed
RIP	Instruction pointer	Execute same instruction
CS	Code segment	Correct privilege, segment
RFLAGS	CPU flags	Direction flag, interrupt flag, etc.
RSP	Stack pointer	Stack operations work correctly
SS	Stack segment	Stack addressing correct

Software-Restored State (by handler code):

Registers	Purpose	Why Needed
RAX-RDX, RSI, RDI	Function arguments, scratch	Instruction operands
RBP	Frame pointer	Stack walking
R8-R15	General purpose	Any purpose

Implicitly Preserved State:

State	How Preserved	Notes
Memory contents	Didn't write	Other pages unchanged
FPU/SSE state	Saved if used	May be in XSAVE area
Segment registers (DS, ES, FS, GS)	Reloaded by OS	Kernel-managed

state_verification.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Conceptual illustration of state preservation
 
struct saved_state {
    // Hardware-saved (on stack, restored by IRET)
    uint64_t rip;     // Points to MOV from example below
    uint64_t cs;      // User code segment
    uint64_t rflags;  // Flags at time of fault
    uint64_t rsp;     // User stack pointer
    uint64_t ss;      // User stack segment
    
    // Software-saved (by entry point, restored before IRET)
    uint64_t rax;     // Value in RAX during fault
    uint64_t rbx;     // etc.
    uint64_t rcx;
    uint64_t rdx;
    uint64_t rdi;
    uint64_t rsi;
    uint64_t rbp;
    uint64_t r8, r9, r10, r11, r12, r13, r14, r15;
};
 
/*
 * Example faulting instruction:
 *   MOV RAX, [RBX + RCX * 8]
 * 
 * At fault time:
 *   RIP = address of this MOV
 *   RBX = 0x1000 (base address of array)
 *   RCX = 5 (index)
 *   Effective address = 0x1000 + 5*8 = 0x1028
 *   
 * Page at 0x1000 is not present → fault
 * 
 * After handling:
 *   Page containing 0x1028 is now mapped
 *   RIP still points to MOV
 *   RBX still = 0x1000
 *   RCX still = 5
 *   
 * Retry:
 *   MOV calculates same address: 0x1028
 *   Translation succeeds this time
 *   RAX gets the value from memory
 *   Execution continues to next instruction
 */
 
// The guarantee: no register was changed
void verify_state_preserved(struct saved_state *before,
                            struct saved_state *after) {
    // Everything should be identical
    assert(before->rax == after->rax);
    assert(before->rbx == after->rbx);
    // ... etc for all registers ...
    
    // RIP points to same instruction
    assert(before->rip == after->rip);
    
    // The ONLY difference: page is now mapped
    // This is invisible to user code
}

FPU/SSE State Handling

Precise Exceptions: The Foundation of Restart

The ability to restart instructions depends on a fundamental CPU property: precise exceptions. This concept is so important that modern processor designs invest significant silicon to ensure it.

What Makes an Exception Precise:

All instructions before the faulting one have completed. Their effects (register writes, memory writes) are fully visible.
The faulting instruction has had no visible effects. Any partial progress is rolled back.
No instructions after the faulting one have effects. Despite speculative execution, nothing is committed past the fault.

Why This Matters:

If exceptions weren't precise, we couldn't restart:

If the faulting instruction partially executed, we couldn't safely retry
If later instructions had effects, replaying would cause double-execution
If earlier instructions hadn't completed, the retry would get wrong inputs

The Implementation Challenge:

Stop speculation at that point
Wait for all earlier instructions to complete
Discard all later instruction results
Report the exception with precise state

Converting Mermaid diagram...

Precise vs Imprecise Exceptions
Property	Precise	Imprecise
Restart possible?	Yes	No
State at exception	Exactly as if stopped before fault	May include partial effects
Implementation cost	High (rollback logic)	Low
Modern CPUs	Standard requirement	Obsolete
Page faults	Fully supported	Would break demand paging

The Spectre/Meltdown Connection

Complex Instruction Challenges

While the restart model is elegant for simple instructions, some complex instructions pose challenges:

1. Multi-Memory-Access Instructions:

An instruction like MOVSB (string move) can copy multiple bytes, accessing memory many times. What if it faults in the middle?

2. Instructions with Multiple Destinations:

Some instructions write to multiple locations. What if the second write faults?

Solution: Such instructions are designed to be restartable. They either use temporary internal state or are defined such that partial writes are valid.

3. Stack Operations During Faults:

If PUSH faults while writing to the stack, we have a chicken-and-egg problem—we need to push the fault frame, but pushing caused the fault.

Solution: This is the double-fault scenario. A separate stack (IST) is used for the fault handler, avoiding the circular dependency.

complex_restart.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// How REP MOVS handles faults
 
/*
 * Instruction: REP MOVSB
 * 
 * Semantics:
 *   while (RCX != 0) {
 *       [RDI] = [RSI];  // Copy one byte
 *       RSI += direction;
 *       RDI += direction;
 *       RCX--;
 *   }
 * 
 * Problem: What if fault occurs mid-copy?
 * 
 * Solution: RSI, RDI, RCX are updated each iteration.
 * When fault occurs:
 *   - RCX = remaining count
 *   - RSI = next source address
 *   - RDI = next destination address
 * 
 * Restart: REP MOVSB resumes from current RSI/RDI/RCX
 * No work is repeated; no work is skipped.
 */
 
// Example trace:
// Initial: RSI=src, RDI=dst, RCX=1000
// Copy 500 bytes successfully
// Fault on byte 501 (destination page not present)
// At fault: RSI=src+500, RDI=dst+500, RCX=500
// Handle fault: map dst+500 page
// Restart: REP MOVSB with RSI=src+500, RDI=dst+500, RCX=500
// Copies remaining 500 bytes
// Total effect: all 1000 bytes copied
 
/*
 * Compare: CISC vs RISC
 * 
 * CISC (x86): Has complex instructions like REP MOVS
 *   Need elaborate restart logic
 *   Architecture ensures restartability
 * 
 * RISC (ARM, RISC-V): Instructions are simple
 *   Each instruction does one memory access
 *   Restartability is trivial
 *   String copies are loops of simple loads/stores
 */
 
// Pseudo-code for what the CPU does internally
void handle_fault_in_rep_movs(FaultState *state) {
    // CPU has already updated RSI, RDI, RCX to reflect progress
    // The saved state in the interrupt frame has current values
    
    // When we IRET:
    // - RIP points to the REP MOVSB instruction
    // - RSI, RDI, RCX reflect progress made
    // - REP MOVSB will continue from where it left off
}

RISC Simplicity

Privilege Transition on Return

The return to user mode involves a privilege transition—from ring 0 to ring 3. This transition is just as carefully controlled as the entry transition.

Security Considerations:

The OS controls everything about the return. User code cannot forge a return to kernel mode.
IRET validates the CS selector. If malicious code somehow corrupted the stack, IRET won't jump to arbitrary addresses with kernel privilege.
RFLAGS is sanitized. Certain dangerous flags (IOPL, VM) are checked and restricted.

What Changes During Return:

Aspect	Before (Kernel)	After (User)
CPL	0 (ring 0)	3 (ring 3)
Accessible memory	All	User pages only
Privileged ops	Allowed	Trap
Interrupts	May be disabled	Enabled
Stack	Kernel stack	User stack

return_security.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Security aspects of returning to user mode
 
/*
 * IRET performs implicit security checks:
 * 
 * 1. CS.RPL check
 *    - RPL (Requested Privilege Level) in CS must match target ring
 *    - Returning to ring 3: CS.RPL must be 3
 *    - Can't return to kernel (ring 0) from user-originated exception
 * 
 * 2. Segment validity
 *    - CS must reference a valid code segment
 *    - Segment must be present
 *    - Segment must be executable
 * 
 * 3. RFLAGS sanitization
 *    - IOPL (I/O Privilege Level) can only be raised by ring 0
 *    - VM flag (virtual 8086 mode) is restricted
 *    - IF (interrupt flag) behavior varies
 */
 
// What the kernel ensures before IRET
void prepare_return_to_usermode(struct pt_regs *regs) {
    // Ensure CS has user ring (RPL = 3)
    regs->cs = USER_CS | 3;  // USER_CS selector with RPL=3
    
    // Ensure SS has user ring
    regs->ss = USER_DS | 3;
    
    // Sanitize flags
    regs->flags &= FLAG_MASK_USER;  // Clear dangerous bits
    regs->flags |= FLAG_IF;         // Ensure interrupts will be enabled
    
    // SMAP/SMEP: CPU flags that trap kernel access to user memory
    // These are automatically re-enabled on return to user mode
}
 
// After IRETQ completes:
// - Kernel stack is now unused (until next entry)
// - User stack (restored RSP) is active
// - User cannot access kernel memory
// - User code resumes at saved RIP
 
/*
 * If an attacker corrupted the stack:
 * 
 * Scenario: Try to return to kernel address with ring 0
 * 
 *   fake frame: RIP = kernel_function
 *               CS  = USER_CS | 0  // Try CPL 0
 * 
 * Result: CPU rejects this
 *   - CS.RPL (0) != actual CPL requested
 *   - Would need descriptor with DPL 0
 *   - User can't access kernel descriptors
 *   → General Protection Fault (#GP)
 *   → Kernel handles GP, kills malicious process
 */

SMAP and SMEP

The Retry Execution: What Actually Happens

Let's trace through exactly what happens when the faulting instruction retries:

Cycle-by-Cycle (simplified):

IRETQ completes: CPU is now in user mode, RIP points to MOV RAX, [RBX]
Instruction fetch: CPU fetches the MOV instruction (same instruction that faulted)
Decode: CPU decodes: load memory at address in RBX into RAX
Address generation: CPU computes effective address: value of RBX = 0x7FFF1000
TLB lookup: CPU checks TLB for 0x7FFF1000... TLB MISS
Page table walk: Hardware walker traverses page table... finds PTE with valid=1, frame=0x12345
TLB fill: New entry cached in TLB: VA 0x7FFF1000 → PA 0x12345000
Physical access: CPU accesses physical address 0x12345000 + offset
Data returned: The byte(s) at that location come back from cache/memory
Writeback: CPU writes the value into RAX
Retire: Instruction completes, RIP advances to next instruction

The process has no visibility into steps 6-7 (the page table walk that finds our newly-installed mapping). From the process's view, the memory access just took a bit longer than usual.

Converting Mermaid diagram...

What Changed Between First Attempt and Retry
Aspect	First Attempt	Retry
PTE valid bit	0 (not present)	1 (present)
PTE frame number	Undefined/swap entry	Physical frame number
Physical frame	Not allocated	Allocated, contains data
TLB entry	None (TLB miss)	None initially, then filled
Instruction behavior	Fault → trap	Complete normally

TLB Miss is Normal

Edge Cases in Restart

While the basic restart mechanism is elegant, several edge cases require special handling:

1. Multiple Faults on Same Instruction:

An instruction might access two pages, both absent:

MOV [addr1], [addr2]   ; addr1 and addr2 in different pages

First fault: addr2 page missing. Handle, restart. Second fault: addr1 page missing. Handle, restart. Third attempt: Both pages present, instruction completes.

2. Fault During Instruction Fetch:

3. Signals Pending:

restart_edge_cases.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// Handling multiple faults per instruction
 
/*
 * Instruction: MOVS (MOV String)
 *   Source: [RSI]
 *   Destination: [RDI]
 *   May fault on either or both
 * 
 * Execution trace:
 * 1. Attempt to read [RSI]
 *    - Source page not present → fault
 *    - Handle: load source page
 *    - Restart MOVS
 * 
 * 2. Read [RSI] succeeds
 *    Attempt to write [RDI]
 *    - Dest page not present → fault
 *    - Handle: load dest page  
 *    - Restart MOVS
 * 
 * 3. Read [RSI] succeeds (maybe TLB hit now)
 *    Write [RDI] succeeds
 *    MOVS completes
 */
 
// Signal handling interacts with page faults
int resume_user_or_handle_signal(struct pt_regs *regs) {
    // Page fault handling is complete
    // Before returning to user mode, check for pending signals
    
    if (signal_pending(current)) {
        // A signal arrived during fault handling
        // Don't restart faulting instruction yet
        // Instead, set up signal handler frame
        
        struct ksignal ksig;
        get_signal(&ksig);
        handle_signal(&ksig, regs);
        
        // Signal handler will eventually return
        // Then faulting instruction will retry
        // (unless signal was fatal)
    }
    
    // Return to user mode
    return 0;
}
 
// Fault during instruction fetch
// The instruction bytes themselves might be on a not-present page
int handle_instruction_fetch_fault(unsigned long address,
                                    struct pt_regs *regs) {
    // Special case: can't examine the opcode (it's not present!)
    // But we don't need to - same handling as data page fault
    // 
    // address == regs->rip (faulting instruction address)
    // VMA should be executable
    // Load the code page the same as any file-backed page
    
    struct vm_area_struct *vma = find_vma(current->mm, address);
    
    if (!(vma->vm_flags & VM_EXEC)) {
        // Code page must be executable
        // If it's not, this is definitely wrong
        return VM_FAULT_SIGSEGV;
    }
    
    // Handle like normal file-backed fault
    // (code is in executable file)
    return filemap_fault(vmf);
}

Fault Loops

The Transparency Guarantee

What Applications Cannot Observe:

The page fault itself (no visible exception)
That their pages were ever absent
Whether other processes share their read-only pages
Which physical frame holds their data
Whether data was just read from disk

What Applications CAN Observe:

Timing: Page faults take much longer than regular memory access. Performance-sensitive code can detect variability.
System calls about memory: mincore() can tell which pages are resident. mlock() affects residency.
Resource limits: The process may receive signals if it exceeds memory limits.

The Functional Guarantee:

Aspect	Guarantee
Correctness	Absolutely guaranteed - computations produce same results
Atomicity	Memory operations have same semantics
Ordering	Memory ordering rules preserved
Timing	NOT guaranteed - variable latency
Performance	NOT guaranteed - depends on residency

transparency_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Demonstration of transparency
 
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
// This function cannot tell if pages are demand-paged
int sum_array(int *array, size_t n) {
    int sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += array[i];  // May page fault, but we can't tell
    }
    return sum;  // Correct result regardless
}
 
// This function CAN detect timing differences
void measure_access_time(void *ptr) {
    struct timespec start, end;
    volatile int value;
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    value = *(int *)ptr;  // This access
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    long ns = (end.tv_sec - start.tv_sec) * 1e9 + 
              (end.tv_nsec - start.tv_nsec);
    
    if (ns > 100000) {  // > 100 microseconds
        printf("Likely page fault: %ld ns
", ns);
    } else if (ns > 100) {  // > 100 nanoseconds
        printf("Likely cache miss: %ld ns
", ns);
    } else {
        printf("Cache hit: %ld ns
", ns);
    }
}
 
// Using mincore to check residency (breaks transparency)
#include <sys/mman.h>
 
void check_residency(void *addr, size_t length) {
    size_t page_size = sysconf(_SC_PAGESIZE);
    size_t num_pages = (length + page_size - 1) / page_size;
    unsigned char *vec = malloc(num_pages);
    
    mincore(addr, length, vec);
    
    for (size_t i = 0; i < num_pages; i++) {
        printf("Page %zu: %s
", i, 
               (vec[i] & 1) ? "resident" : "not resident");
    }
    
    free(vec);
}

Timing Side Channels

Architecture Comparisons: Return Mechanisms

Different architectures implement the return-from-exception mechanism differently, though all achieve the same goal:

x86-64: IRETQ

Single instruction restores RIP, CS, RFLAGS, RSP, SS
Privilege transition is implicit in CS.RPL
Error code must be manually popped before IRET

ARM AArch64: ERET

Exception Return instruction
Restores PC from ELR_ELn (Exception Link Register)
Restores PSTATE from SPSR_ELn (Saved Program Status Register)
Stack pointer selection based on target exception level

RISC-V: SRET/MRET

Supervisor/Machine Return from Trap
Restores PC from SEPC/MEPC (Exception Program Counter)
Restores privilege from SSTATUS/MSTATUS
Simple and uniform design

Exception Return Mechanisms
Aspect	x86-64 (IRETQ)	ARM (ERET)	RISC-V (SRET)
Return PC source	Stack (RIP)	ELR_ELn register	sepc CSR
Flags/status source	Stack (RFLAGS)	SPSR_ELn register	sstatus CSR (partial)
Stack restore	Stack (RSP, SS)	SP_ELn selected	Not automatic
Privilege transition	Via CS.RPL	SPSR.M bits	SPP bit in sstatus
Atomicity	Single instruction	Single instruction	Single instruction

Design Philosophy Differences

Summary: Completing the Circle

Key Takeaways

•IRETQ (or equivalent) is the return instruction that atomically restores the complete CPU state and returns to user mode.
•State restoration encompasses both hardware-restored state (RIP, CS, RFLAGS, RSP, SS) and software-restored state (general-purpose registers).
•Precise exceptions are essential—the CPU guarantees that the state at exception time matches exactly what it would be if execution had stopped just before the faulting instruction.
•Complex instructions require special consideration for restart, but architecture design ensures they are restartable (e.g., using progress registers for string operations).
•Security is maintained during the return—IRET validates segments, and user code cannot forge a return to kernel mode.
•The retry succeeds because the page is now mapped. The instruction re-executes, TLB loads the new entry, and data is accessed successfully.
•Transparency is achieved for correctness but not for timing. Applications get correct results without knowing about the fault, but timing-sensitive code can detect variability.

The Complete Page Fault Journey:

Instruction attempts memory access
    ↓
TLB miss → Page table walk
    ↓
Valid bit = 0 → Page fault exception
    ↓
Trap to kernel, save state
    ↓
Handler finds VMA, validates access
    ↓
Locate page content (swap/file/zero)
    ↓
Allocate frame, load content
    ↓
Update PTE (valid=1, frame number)
    ↓
Restore state, IRET to user mode
    ↓
Instruction retries, succeeds
    ↓
Application continues, unaware

This cycle happens millions of times per second across all the world's computers, silently enabling the virtual memory abstraction that makes modern computing possible.

Module Complete