Loading content...
The instant the MMU detects an invalid page access, a remarkable transformation occurs. The CPU, which was happily executing user code at high speed, must immediately and safely transfer control to the operating system kernel. This transfer—called a trap—is one of the most critical mechanisms in computer architecture.
The trap mechanism must satisfy seemingly contradictory requirements:
This page explores the trap mechanism in exhaustive detail. You'll understand exactly what happens in the nanoseconds between fault detection and the first instruction of the page fault handler, and why this mechanism is fundamental to protected, multi-tasking operating systems.
By the end of this page, you will understand: (1) The trap mechanism and how it differs from other control transfers, (2) CPU state that must be saved during a trap, (3) The role of the Interrupt Descriptor Table (IDT) and exception vectors, (4) How the CPU switches to kernel mode and kernel stack, (5) The initial actions taken by the page fault handler entry point.
A trap is a synchronous, intentionally-triggered exception that transfers control to the operating system. It differs from other control flow mechanisms:
Comparisons:
| Mechanism | Trigger | Timing | Return |
|---|---|---|---|
| Function call | Explicit call instruction | Synchronous | Returns to caller |
| Interrupt | External device signal | Asynchronous | Returns to interrupted instruction |
| Trap | Internal condition (syscall, fault) | Synchronous | May return to same instruction |
| Abort | Unrecoverable error | Synchronous | Does not return |
Key characteristics of page fault traps:
Synchronous: The trap occurs as a direct result of the executing instruction, not some external event.
Precise: The processor state when the trap is taken corresponds exactly to having stopped before the faulting instruction completed.
Restartable: The faulting instruction can be re-executed after the fault is handled.
Privileged Transition: Control passes from user mode (ring 3) to kernel mode (ring 0) with elevated privileges.
| Category | Examples | Behavior | Use in Page Faults |
|---|---|---|---|
| Fault | Page fault, Divide by zero | Return to faulting instruction | This is a fault — instruction is restarted after handling |
| Trap | INT 3 (breakpoint), syscall | Return to next instruction | Not used for page faults |
| Abort | Machine check, Double fault | Cannot reliably return | Only if page fault handler itself faults |
Confusingly, 'trap' is used in two different ways: (1) As a general term for any exception that transfers control to the OS, and (2) As a specific exception category where the saved instruction pointer points to the next instruction. Page faults are technically 'faults' (returning to the same instruction) but the mechanism is commonly called 'trapping' to the OS.
When a page fault occurs, the CPU must preserve enough state to later resume the faulting process as if nothing happened. This preservation is performed entirely by hardware—it's too fast and too critical to rely on software.
State Saved Automatically by Hardware:
On x86-64, the CPU pushes the following onto the kernel stack when a page fault occurs:
+------------------+
| SS | Stack Segment (if privilege change)
+------------------+
| RSP | Stack Pointer (if privilege change)
+------------------+
| RFLAGS | CPU flags register
+------------------+
| CS | Code Segment
+------------------+
| RIP | Instruction Pointer (address of faulting instruction)
+------------------+
| Error Code | Page fault specific info ← Top of stack
+------------------+
This layout is dictated by the processor architecture and cannot be changed.
State NOT Saved Automatically:
General-purpose registers (RAX, RBX, RCX, etc.) are not saved by hardware. The page fault handler must save them if it needs to preserve them. This is typically done immediately upon handler entry.
12345678910111213141516171819202122232425262728293031
// x86-64 Exception Stack Frame// This structure matches what the CPU pushes on page fault struct ExceptionFrame { // Error code (pushed by CPU for certain exceptions including page faults) uint64_t error_code; // These are pushed by CPU for all exceptions uint64_t rip; // Instruction pointer - points to faulting instruction uint64_t cs; // Code segment selector uint64_t rflags; // CPU flags (interrupt flag, direction flag, etc.) uint64_t rsp; // Stack pointer (from user mode) uint64_t ss; // Stack segment selector}; // Page Fault Error Code Bits#define PF_PRESENT (1 << 0) // 0 = not-present page, 1 = protection violation#define PF_WRITE (1 << 1) // 0 = read access, 1 = write access#define PF_USER (1 << 2) // 0 = supervisor mode, 1 = user mode#define PF_RESERVED (1 << 3) // 1 = reserved bit set in page table entry#define PF_INSTR (1 << 4) // 1 = instruction fetch (NX violation) // Example: Decoding the error codevoid decode_page_fault_error(uint64_t error_code) { printf("Page Fault Analysis:\n"); printf(" %s page\n", (error_code & PF_PRESENT) ? "Protection violation on present" : "Non-present"); printf(" %s access\n", (error_code & PF_WRITE) ? "Write" : "Read"); printf(" %s mode\n", (error_code & PF_USER) ? "User" : "Supervisor"); if (error_code & PF_RESERVED) printf(" Reserved bit violation\n"); if (error_code & PF_INSTR) printf(" Instruction fetch (NX violation)\n");}The saved RIP points to the instruction that caused the fault, not the next instruction. This is essential for page fault handling: after the OS loads the page into memory, the CPU will retry the same instruction and this time it will succeed. This 'retry semantics' is what makes page faults transparent to the application.
How does the CPU know where to transfer control when a page fault occurs? The answer is the Interrupt Descriptor Table (IDT)—a table of 256 entries established by the OS at boot time.
IDT Structure:
Each IDT entry (called a 'gate descriptor') contains:
Page Fault Vector:
Page faults are exception number 14 (0x0E). When a page fault occurs, the CPU:
The LIDT Instruction:
The OS tells the CPU where the IDT is located using the LIDT (Load IDT Register) instruction. This instruction is privileged—only kernel code can execute it.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
// IDT Gate Descriptor (x86-64 format)struct IDTGateDescriptor { uint16_t offset_low; // Handler address, bits 0-15 uint16_t segment; // Code segment selector (typically kernel CS) uint8_t ist; // Interrupt Stack Table index (0 = no IST) uint8_t type_attr; // Gate type and attributes uint16_t offset_mid; // Handler address, bits 16-31 uint32_t offset_high; // Handler address, bits 32-63 uint32_t reserved; // Reserved, must be zero} __attribute__((packed)); // Gate types#define IDT_INTERRUPT_GATE 0x8E // Present, DPL=0, 64-bit interrupt gate#define IDT_TRAP_GATE 0x8F // Present, DPL=0, 64-bit trap gate // Exception vector numbers#define VECTOR_DIVIDE_ERROR 0#define VECTOR_DEBUG 1#define VECTOR_NMI 2#define VECTOR_BREAKPOINT 3#define VECTOR_PAGE_FAULT 14 // <-- Page fault handler lives here#define VECTOR_GENERAL_PROTECTION 13 // IDT Register structurestruct IDTRegister { uint16_t limit; // Size of IDT - 1 uint64_t base; // Linear address of IDT} __attribute__((packed)); static struct IDTGateDescriptor idt[256];static struct IDTRegister idtr; // Set up a single IDT entryvoid set_idt_gate(int vector, void (*handler)(void), uint8_t type) { uint64_t addr = (uint64_t)handler; idt[vector].offset_low = addr & 0xFFFF; idt[vector].offset_mid = (addr >> 16) & 0xFFFF; idt[vector].offset_high = (addr >> 32) & 0xFFFFFFFF; idt[vector].segment = KERNEL_CS; // Kernel code segment idt[vector].ist = 0; // No IST idt[vector].type_attr = type; idt[vector].reserved = 0;} // Initialize the IDTvoid init_idt(void) { // Set up exception handlers set_idt_gate(VECTOR_PAGE_FAULT, page_fault_handler_entry, IDT_INTERRUPT_GATE); // ... other exception handlers ... // Load the IDT idtr.limit = sizeof(idt) - 1; idtr.base = (uint64_t)&idt; asm volatile("lidt %0" : : "m"(idtr));}Page fault handlers typically use an 'interrupt gate' rather than a 'trap gate'. The difference: interrupt gates automatically disable interrupts (clear IF flag) upon entry, preventing nested interrupts. This is important because the page fault handler must perform atomic operations on kernel data structures during early handling before it can safely re-enable interrupts.
One of the most critical aspects of the trap mechanism is stack switching. When a page fault occurs in user mode, the CPU cannot continue using the user stack—it's untrusted and potentially compromised. The CPU must switch to a kernel stack.
Why Stack Switching is Essential:
How Stack Switching Works (x86-64):
The CPU uses the Task State Segment (TSS) to find the kernel stack. Each CPU core has a TSS that contains:
RSP0: The kernel stack pointer for privilege level 0RSP1, RSP2: Stack pointers for levels 1 and 2 (rarely used in modern OS)IST1-IST7: Interrupt Stack Table pointers for special exceptionsWhen a page fault occurs in user mode:
RSP0 from the current TSSThe kernel stack is already set up by the OS when the process was scheduled. Each thread typically has its own kernel stack.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
// Task State Segment (x86-64 format, relevant portions)struct TaskStateSegment { uint32_t reserved0; // Stack pointers for privilege level transitions uint64_t rsp0; // Kernel stack pointer (used on page faults from user mode) uint64_t rsp1; // Not used in modern OS uint64_t rsp2; // Not used in modern OS uint64_t reserved1; // Interrupt Stack Table pointers (for special exceptions) uint64_t ist1; // E.g., double fault stack uint64_t ist2; // E.g., NMI stack uint64_t ist3; uint64_t ist4; uint64_t ist5; uint64_t ist6; uint64_t ist7; uint64_t reserved2; uint16_t reserved3; uint16_t io_map_base;} __attribute__((packed)); static struct TaskStateSegment tss[MAX_CPUS]; // Set up the kernel stack for the current CPUvoid set_kernel_stack(int cpu_id, void *stack_top) { tss[cpu_id].rsp0 = (uint64_t)stack_top;} // During context switch, update the TSS with new thread's kernel stackvoid context_switch_to(Thread *new_thread) { int cpu = get_current_cpu(); // Set up kernel stack for new thread // If new thread triggers page fault, CPU will use this stack set_kernel_stack(cpu, new_thread->kernel_stack_top); // ... rest of context switch ...} // Each thread has its own kernel stack#define KERNEL_STACK_SIZE 16384 // 16 KB typical Thread *create_thread(void (*entry)(void)) { Thread *t = allocate_thread_struct(); // Allocate kernel stack for this thread t->kernel_stack = allocate_pages(KERNEL_STACK_SIZE / PAGE_SIZE); t->kernel_stack_top = t->kernel_stack + KERNEL_STACK_SIZE; return t;}Each thread needs its own kernel stack because a thread might be in the middle of a system call when it's preempted. The kernel stack holds the return path back to user space. With per-thread kernel stacks, preemption and resumption work correctly even when threads are executing kernel code.
The transition from user mode to kernel mode involves changing the CPU's privilege level. On x86, this is represented by the Current Privilege Level (CPL), stored in the low two bits of the CS register.
Privilege Levels (Rings):
| Ring | CPL | Usage | Capabilities |
|---|---|---|---|
| Ring 0 | 0 | Kernel | Full access to all instructions and memory |
| Ring 1 | 1 | (Historical) | Device drivers (rarely used today) |
| Ring 2 | 2 | (Historical) | Device drivers (rarely used today) |
| Ring 3 | 3 | User Mode | Restricted access, no privileged instructions |
What Changes During Privilege Transition:
Hardware Enforcement:
The privilege transition is enforced entirely by hardware. User code cannot:
The only way to enter kernel mode is through designated entry points in the IDT.
| Aspect | Before (User Mode) | After (Kernel Mode) |
|---|---|---|
| CPL | 3 (user) | 0 (supervisor) |
| Code Segment | User CS (e.g., 0x2B) | Kernel CS (e.g., 0x08) |
| Stack | User stack in user memory | Kernel stack in kernel memory |
| Privileged Ops | Cause #GP exception | Execute normally |
| Kernel Memory | Access causes #PF | Accessible |
| I/O Ports | Controlled by IOPL | Full access |
| Interrupts | As before | Disabled (interrupt gate) |
The privilege transition mechanism is the foundation of operating system security. No matter how clever user code is, it cannot bypass this transition. Every entry into kernel mode goes through hardware-controlled gates with well-defined semantics. This is why operating systems can safely run untrusted code—the hardware enforces the boundary.
When the CPU begins executing the page fault handler, it lands at an entry point—typically a small piece of assembly code that completes the state saving and then calls a C handler function.
Entry Point Responsibilities:
Save remaining registers: The CPU only saved the interrupt frame. General-purpose registers must be saved by software.
Set up kernel data segment: Ensure DS, ES are set to kernel data selectors.
Build a canonical stack frame: Create a well-defined structure that C code can use.
Read CR2: Capture the faulting address before it could potentially be overwritten.
Call C handler: Invoke the main handler with pointers to saved state.
On return: Restore registers and use iretq to return.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
# Page Fault Handler Entry Point (x86-64 Assembly)# This is the actual entry point registered in the IDT .global page_fault_handler_entry.type page_fault_handler_entry, @function page_fault_handler_entry: # At this point, CPU has already: # - Switched to kernel stack # - Pushed SS, RSP, RFLAGS, CS, RIP, error_code # - Cleared IF (interrupts disabled) # Save all general-purpose registers (build trap frame) pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %r11 pushq %r10 pushq %r9 pushq %r8 pushq %rbp pushq %rdi pushq %rsi pushq %rdx pushq %rcx pushq %rbx pushq %rax # Read CR2 (faulting address) BEFORE it could be overwritten # by a nested page fault (shouldn't happen, but be safe) movq %cr2, %r12 # Save in callee-saved register # Set up kernel data segments movw $KERNEL_DS, %ax movw %ax, %ds movw %ax, %es # Call C handler # Arguments: rdi = pointer to trap frame, rsi = faulting address movq %rsp, %rdi # First arg: trap frame pointer movq %r12, %rsi # Second arg: faulting address (CR2) call page_fault_handler # C function # Handler returns here (if fault was handled successfully) # Restore general-purpose registers popq %rax popq %rbx popq %rcx popq %rdx popq %rsi popq %rdi popq %rbp popq %r8 popq %r9 popq %r10 popq %r11 popq %r12 popq %r13 popq %r14 popq %r15 # Skip error code (it was pushed by CPU, we need to remove it) addq $8, %rsp # Return from interrupt # This restores RIP, CS, RFLAGS, RSP, SS from stack # and returns to user mode (or kernel mode if fault was there) iretqThe faulting address is stored in CR2, but CR2 can be overwritten by subsequent page faults. While nested page faults during handler entry are unusual (the handler code should be resident), defensive programming dictates reading CR2 as early as possible and saving it in a register or on the stack. Some architectures push the faulting address on the stack automatically, avoiding this issue.
After the entry point completes its register saves, the kernel stack contains a complete trap frame—a snapshot of the CPU's state at the moment of the page fault. This frame is essential for:
Linux's pt_regs Structure:
Linux defines the trap frame as struct pt_regs:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// Linux x86-64 pt_regs structure (simplified)// This represents the complete trap frame on the kernel stack struct pt_regs { // Saved by software (entry point assembly) unsigned long r15; unsigned long r14; unsigned long r13; unsigned long r12; unsigned long rbp; unsigned long rbx; // Saved by software, arguments or scratch unsigned long r11; unsigned long r10; unsigned long r9; unsigned long r8; unsigned long rax; unsigned long rcx; unsigned long rdx; unsigned long rsi; unsigned long rdi; // Exception vector number (for generic handler dispatch) unsigned long orig_rax; // Saved by hardware (CPU pushes these) unsigned long rip; // Faulting instruction address unsigned long cs; // Code segment (includes CPL) unsigned long eflags; // CPU flags unsigned long rsp; // User stack pointer unsigned long ss; // Stack segment}; // For page faults, error code is accessed separatelystruct page_fault_info { unsigned long error_code; // Page fault specific error code unsigned long cr2; // Faulting virtual address struct pt_regs *regs; // Pointer to saved registers}; // The C page fault handler receives this informationvoid page_fault_handler(struct pt_regs *regs, unsigned long address) { unsigned long error_code = get_error_code(regs); // Now we have everything needed to analyze and handle the fault: // - address: the virtual address that faulted // - error_code: why it faulted (present?, write?, user?) // - regs->rip: what instruction caused the fault // - regs->{all registers}: complete CPU state}iretq restores everything—the process never knows a fault occurred.regs->rip points to the exact instruction that faulted. Stack traces can be built from regs->rbp.What happens if a page fault occurs while handling a page fault? Or worse, what if the page fault handler itself causes another page fault? These scenarios require special handling.
Controllable Nested Faults:
Some page faults during kernel execution are intentional and expected:
Accessing user memory: The kernel might copy data from a user buffer. If that buffer isn't resident, a page fault occurs. This is normal and handled like any other page fault.
Demand-paged kernel modules: Some systems demand-page kernel modules. Page faults can occur accessing module code.
These are handled by re-entering the page fault handler, which works fine as long as the original fault's state is properly preserved.
Problematic Nested Faults:
Double Fault (x86):
When certain exception combinations occur (e.g., couldn't push to stack during exception handling), the CPU generates a double fault (exception 8). The double fault handler uses its own dedicated stack (from the TSS's IST entries) that is guaranteed to be valid.
If the double fault handler faults, a triple fault occurs, which resets the CPU—the equivalent of a forced reboot.
| Scenario | Result | Recovery |
|---|---|---|
| Page fault on user memory access | Normal re-entrant handling | Handled, continues |
| Page fault on kernel stack during trap | Double fault | Use IST stack, likely kernel panic |
| Fault in double fault handler | Triple fault | CPU reset (reboot) |
| NMI during page fault handling | NMI takes priority, then returns | IST stack for NMI ensures safety |
A triple fault is unrecoverable—the CPU has no more fallback positions. This is why operating systems are extremely careful about kernel stack validity, handler code residency, and avoiding recursive fault scenarios. Any driver bug that corrupts the kernel stack can lead straight to triple fault and unexpected reboot.
While we've focused on x86-64, other architectures implement similar trap mechanisms with different details.
ARM AArch64:
ARM uses a different terminology and structure:
When a page fault (called a Translation Fault or Permission Fault) occurs:
RISC-V:
RISC-V takes a minimalist approach:
scause: Exception cause codestval: Faulting address or instructionsepc: Exception program counter (return address)stvec: Trap vector base addressPage faults are classified as Load Page Fault (code 13), Store Page Fault (code 15), or Instruction Page Fault (code 12).
Despite architectural differences, all systems share the core concepts: save state, identify the fault, transfer to handler, eventually restore and return.
| Aspect | x86-64 | ARM AArch64 | RISC-V |
|---|---|---|---|
| Vector table name | IDT | Exception Vector Table | Trap Vector (stvec) |
| Faulting address register | CR2 | FAR_EL1 | stval |
| Cause/type register | Error code on stack | ESR_EL1 | scause |
| Return address | RIP on stack | ELR_EL1 | sepc |
| Saved flags/status | RFLAGS on stack | SPSR_EL1 | sstatus (partial) |
| Privilege levels | Ring 0-3 (CPL) | EL0-EL3 | U/S/M modes |
| Stack switching | TSS → RSP0 | SP_ELn per level | Software managed |
While the specifics differ, the fundamental concepts are universal: the hardware must detect the fault, save sufficient state for later restoration, identify the fault type and address, and transfer control to a predefined handler location. Understanding these concepts on one architecture transfers readily to others.
The trap mechanism is the precisely-engineered bridge between user and kernel space. When a page fault is detected, this mechanism ensures a safe, complete, and reversible transfer of control. Let's consolidate the key concepts:
What's Next:
With control now in the page fault handler and complete state information available, the OS must determine what action to take. The next page explores Find Page on Disk—how the OS determines where the page's content resides and initiates the I/O to retrieve it.
You now understand the trap mechanism that transfers control from a faulting instruction to the kernel's page fault handler. This precisely-engineered hardware/software handoff is the critical path that enables virtual memory, process isolation, and protected operating systems. Next, we'll explore how the OS locates the page content that must be loaded.