Loading learning content...
The division between User Mode and Kernel Mode would be useless without controlled mechanisms to cross the boundary. Applications need file I/O, network access, and memory allocation—all requiring kernel assistance. Hardware devices need to signal events. Errors need to be caught and handled.
Mode switching is the highly orchestrated process by which the CPU transitions between privilege levels. It's not a simple flag flip—it involves:
Every mode switch is a security boundary crossing. The mechanisms must be airtight—one vulnerability in the mode switching process could give attackers kernel access.
In this page, we'll dissect every type of mode transition, understand the hardware's role, and see how operating systems build upon these primitives.
By the end of this page, you will understand: (1) The three causes of User→Kernel transitions (interrupts, exceptions, system calls), (2) The exact hardware steps in a mode switch, (3) How system call instructions (SYSCALL, SYSENTER, INT, SVC) work, (4) How the kernel returns to User Mode, and (5) Performance implications and optimizations of mode switching.
The CPU can transition from User Mode to Kernel Mode through exactly three mechanisms, each serving a different purpose:
1. Hardware Interrupts (Asynchronous)
External devices signal events that need immediate attention:
Characteristics:
2. Exceptions/Traps (Synchronous, Unintentional)
Errors or unusual conditions during instruction execution:
Characteristics:
3. System Calls (Synchronous, Intentional)
Explicit requests for kernel services:
Characteristics:
All Three Share Core Mechanics:
Despite their different triggers, all three types of transitions follow similar hardware steps:
| Aspect | Hardware Interrupt | Exception | System Call |
|---|---|---|---|
| Trigger | External device signal | Instruction error/condition | Explicit syscall instruction |
| When it occurs | Any time (async) | During specific instruction | At syscall instruction |
| Predictable? | No | Sometimes | Yes |
| Example | Keyboard IRQ | Page fault, divide by zero | read(), write() |
| CPL change | 3 → 0 | 3 → 0 | 3 → 0 |
| Entry point | From IDT (interrupt number) | From IDT (exception number) | From MSR or IDT |
Notice that all three transitions go User→Kernel. The reverse transition (Kernel→User) always happens through a 'return' mechanism—IRET, SYSRET, or ERET. The kernel explicitly chooses to return to User Mode; there's no way for the hardware to spontaneously drop privileges.
When a mode switch occurs, the CPU executes a precise sequence of steps in hardware. These steps happen atomically—there's no window where security could be compromised.
x86/x64 Interrupt/Exception Entry (via IDT):
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Hardware steps when interrupt/exception occurs (x86-64, IST = 0) // STEP 1: Determine handler from IDTidx = interrupt_number; // e.g., 14 for page fault, 0x80 for legacy syscallidtEntry = IDT[idx]; // Read from Interrupt Descriptor Table // STEP 2: Privilege level checkif (idtEntry.DPL < current_CPL && trigger_was_software_int) { // Software interrupts (INT n) check DPL raise_exception(GENERAL_PROTECTION_FAULT);} // STEP 3: Stack switch (if changing privilege)if (current_CPL > idtEntry.targetCPL) { // e.g., 3 > 0 // Save old stack info old_SS = SS; old_RSP = RSP; // Load new stack from TSS new_RSP = TSS.RSP0; // Kernel stack pointer new_SS = idtEntry.targetSS; // Usually kernel data segment // Switch stacks RSP = new_RSP; SS = new_SS; // Push old stack info onto new (kernel) stack push(old_SS); push(old_RSP);} // STEP 4: Save execution context onto kernel stackpush(RFLAGS); // Processor flags (including IF)push(CS); // Code segment (includes CPL)push(RIP); // Return address (instruction pointer) if (exception_has_error_code) { push(error_code); // Page faults, GP faults have error codes} // STEP 5: Update processor stateRFLAGS.IF = 0; // Disable interrupts (for most interrupts)CS = idtEntry.selector; // Load new code segment (CPL = 0)RIP = idtEntry.offset; // Jump to handler // Now executing kernel code at CPL=0Key Security Properties:
ARM Exception Entry (AArch64):
1234567891011121314151617181920212223242526
// ARM AArch64 exception entry (e.g., SVC from EL0 to EL1) // STEP 1: Save processor stateSPSR_EL1 = PSTATE; // Save current processor stateELR_EL1 = PC + 4; // Save return address (next instruction) // STEP 2: Update PSTATEPSTATE.EL = 1; // Switch to EL1 (kernel mode)PSTATE.SP = 1; // Use SP_EL1 (kernel stack pointer)PSTATE.D = 1; // Mask debug exceptionsPSTATE.A = 1; // Mask SError (asynchronous) exceptionsPSTATE.I = 1; // Mask IRQPSTATE.F = 1; // Mask FIQ // STEP 3: Jump to exception vectorexception_type = determine_type(exception, source_EL);vector_offset = calculate_offset(exception_type);PC = VBAR_EL1 + vector_offset; // Vector offsets:// From EL0, synchronous (SVC): VBAR_EL1 + 0x400// From EL0, IRQ: VBAR_EL1 + 0x480// From EL0, FIQ: VBAR_EL1 + 0x500// From EL0, SError: VBAR_EL1 + 0x580 // Now executing at EL1 with SP_EL1ARM's exception handling is simpler than x86's: dedicated SPSR/ELR registers per exception level eliminate the need for a TSS, and each EL has its own stack pointer register (SP_EL0, SP_EL1, etc.). The hardware doesn't need to read complex structures to perform the stack switch.
System calls are the most common type of User→Kernel transition. Different architectures and generations have evolved increasingly efficient mechanisms:
x86 Evolution of System Calls:
| Method | Era | Used On | Typical Latency |
|---|---|---|---|
| INT 0x80 | Linux legacy | i386 Linux | ~400 cycles |
| SYSENTER/SYSEXIT | Pentium II+ | Windows, older Linux | ~200 cycles |
| SYSCALL/SYSRET | AMD K6+/x86-64 | Linux x86-64, Windows x64 | ~100 cycles |
SYSCALL (Modern x86-64):
The SYSCALL instruction is the fastest way to enter kernel mode on x86-64. It's specifically designed for the common User→Kernel→User pattern of system calls.
1234567891011121314151617181920212223242526272829
// SYSCALL instruction (x86-64)// Fastest system call method, skips IDT entirely // User-mode setup before SYSCALL:RAX = syscall_number; // e.g., 0 = read, 1 = write, 60 = exitRDI = arg1; // First argumentRSI = arg2; // Second argumentRDX = arg3; // Third argumentR10 = arg4; // Fourth argument (RCX used by SYSCALL)R8 = arg5; // Fifth argumentR9 = arg6; // Sixth argument // SYSCALL execution (hardware):// 1. RCX ← RIP (save return address)// 2. R11 ← RFLAGS (save flags)// 3. RIP ← IA32_LSTAR MSR (jump to kernel entry)// 4. CS ← IA32_STAR[47:32] (kernel code segment, CPL=0)// 5. SS ← IA32_STAR[47:32] + 8 (kernel data segment)// 6. RFLAGS &= ~IA32_FMASK (mask certain flags, including IF) // Now in kernel mode at the address in IA32_LSTAR // SYSRET (returning to user mode):// 1. RIP ← RCX (return to saved address)// 2. RFLAGS ← R11 (restore flags)// 3. CS ← IA32_STAR[63:48] + 16 (user code segment, CPL=3)// 4. SS ← IA32_STAR[63:48] + 8 (user data segment) // Return value in RAXARM SVC (Supervisor Call):
ARM uses the SVC instruction for system calls, which is conceptually simpler—it triggers a synchronous exception to EL1.
12345678910111213141516171819202122232425262728
// ARM AArch64 System Call // User-mode setup:X8 = syscall_number; // System call numberX0 = arg1; // First argumentX1 = arg2; // Second argumentX2 = arg3; // Third argumentX3 = arg4; // Fourth argumentX4 = arg5; // Fifth argumentX5 = arg6; // Sixth argument // Execute system call:SVC #0 // Supervisor Call, immediate ignored on AArch64 // Hardware automatically:// 1. SPSR_EL1 ← PSTATE// 2. ELR_EL1 ← PC + 4 (return to instruction after SVC)// 3. PSTATE.EL ← 1, masks set// 4. PC ← VBAR_EL1 + 0x400 (sync exception from EL0) // Kernel reads X8 to determine which syscall// Kernel execution... // Return to user mode with ERET:ERET// 1. PSTATE ← SPSR_EL1// 2. PC ← ELR_EL1// X0 contains return valueEarly systems used software interrupts (INT) for system calls, but this was slow—the CPU had to read the IDT, check permissions, and perform full interrupt entry. Dedicated syscall instructions (SYSCALL, SVC) use fixed kernel entry points stored in MSRs or system registers, eliminating memory reads and permission checks. The hardware 'knows' this is a system call and optimizes accordingly.
Every kernel entry must eventually return to user mode (or terminate the process). The return mechanism is just as critical as the entry—it must restore user context and lower privilege atomically.
x86-64 Return Mechanisms:
123456789101112131415161718192021222324252627282930313233
// Linux x86-64 system call return path (simplified) entry_SYSCALL_64: // ... syscall handling ... // Prepare return values movq %rax, ORIG_RAX(%rsp) ; Store syscall return value // Check if we can use fast SYSRET path testq $(THREAD_FLAGS_SLOW_PATH), THREAD_INFO_FLAGS(%rdi) jnz slow_path fast_path: // Fast return via SYSRET movq RCX(%rsp), %rcx ; Restore user RIP movq R11(%rsp), %r11 ; Restore user RFLAGS movq RAX(%rsp), %rax ; Return value // Switch to user stack BEFORE SYSRET movq RSP(%rsp), %rsp ; Restore user RSP sysretq ; CPL 0 → CPL 3 slow_path: // Need full IRET path for: // - Signals pending // - Single-step debugging // - iret-requiring registers (different SS, etc.) // Prepare IRET frame on stack // Stack layout: RIP, CS, RFLAGS, RSP, SS iretq ; Full context restoreARM Exception Return (ERET):
ARM uses a single ERET instruction for all exception returns. It's elegant and consistent:
12345678910111213141516171819202122232425262728
// ARM exception return // Before ERET, kernel must populate:// - ELR_EL1: Return address for user code// - SPSR_EL1: Saved processor state (includes target EL)// - X0: Return value (for syscalls) // Kernel exit code:kernel_exit: // Restore general-purpose registers from stack ldp x0, x1, [sp, #S_X0] ldp x2, x3, [sp, #S_X2] // ... more register restores ... // Restore stack pointer ldr x19, [sp, #S_SP] msr sp_el0, x19 // Restore return address and saved state (done earlier) // ELR_EL1 and SPSR_EL1 were saved on exception entry // Return to user mode eret // ERET atomically:// 1. PSTATE ← SPSR_EL1 (restores EL0, interrupt masks, etc.)// 2. PC ← ELR_EL1 (jump to user code)// Now executing at EL0 (user mode)SYSRET has a subtle security issue: if RCX (return address) is in non-canonical form (invalid x64 address), SYSRET generates a #GP exception while still at CPL=0 but with user RSP/RFLAGS. This can be exploited. Linux checks for non-canonical RCX before SYSRET, switching to IRET if needed. Windows had a vulnerability (CVE-2012-0217) from missing this check.
Hardware interrupts are particularly interesting because they're asynchronous—they can occur at any point, requiring the kernel to handle arbitrary interrupted state.
Interrupt Lifecycle:
Detailed Interrupt Flow (x86):
Device Signals IRQ — Hardware device asserts interrupt line
Interrupt Controller Routes — APIC (Advanced Programmable Interrupt Controller) determines:
CPU Checks Interrupt Flag — If RFLAGS.IF = 0, interrupt is held pending
Between-Instruction Window — Interrupt is recognized at instruction boundary
Hardware Context Save — CPU pushes SS, RSP, RFLAGS, CS, RIP to kernel stack
Mode Switch — CPL becomes 0, jump to IDT[vector].handler
Kernel Handler Executes — Reads device registers, processes event
End-Of-Interrupt (EOI) — Kernel signals APIC that interrupt is handled
IRET — Restore context, return to interrupted user code (CPL → 3)
12345678910111213141516171819202122232425262728293031
// Simplified keyboard interrupt handler (Linux-style) // Registered in IDT for vector 33 (IRQ 1 + 32 offset)irqreturn_t keyboard_interrupt(int irq, void *dev_id) { uint8_t scancode; // Read scancode from keyboard controller // IN instruction (privileged) - only works at CPL=0 scancode = inb(KEYBOARD_DATA_PORT); // Port 0x60 // Process the keypress if (scancode & 0x80) { // Key release handle_key_release(scancode & 0x7F); } else { // Key press handle_key_press(scancode); // Wake up processes waiting for input wake_up_interruptible(&keyboard_wait_queue); } // Acknowledge interrupt to APIC // OUT instruction (privileged) outb(0x20, PIC_EOI); // Send EOI to master PIC return IRQ_HANDLED;} // After this function returns, common interrupt exit code runs IRET// to return to whatever user process was interruptedTime from IRQ assertion to handler execution is 'interrupt latency.' For real-time systems, this must be bounded and predictable. Mode switch overhead (context save, stack switch) is a significant component. Real-time operating systems minimize this through techniques like interrupt nesting limits and predictable handler execution.
Exceptions are synchronous—they occur as a direct result of instruction execution. The kernel must determine: Is this a recoverable situation, or should the process be terminated?
Categories of Exceptions:
| Vector | Name | Type | Cause | Kernel Response |
|---|---|---|---|---|
| 0 | #DE Divide Error | Fault | DIV/IDIV by zero | SIGFPE, may kill process |
| 6 | #UD Invalid Opcode | Fault | Unknown instruction | SIGILL, terminate |
| 13 | #GP General Protection | Fault | Privilege/segment violation | SIGSEGV, terminate |
| 14 | #PF Page Fault | Fault | Page not present or protected | Handle or SIGSEGV |
| 1 | #DB Debug | Trap | Breakpoint, single-step | SIGTRAP, debugger handles |
| 3 | #BP Breakpoint | Trap | INT 3 instruction | SIGTRAP, debugger handles |
Page Fault: The Most Important Exception
Page faults are special because they're often not errors—they're expected events that the OS handles transparently:
Only if the fault cannot be resolved does the kernel deliver SIGSEGV.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// Simplified page fault handler (Linux-style) void do_page_fault(struct pt_regs *regs, unsigned long error_code) { unsigned long fault_addr = read_cr2(); // CR2 has faulting address struct vm_area_struct *vma; // Was this a kernel or user fault? if (fault_in_kernel_mode(regs)) { if (kernel_exception_fixup(regs)) { return; // Kernel expected this, handled } // Kernel bug - panic kernel_oops(); } // User-mode fault - find VMA vma = find_vma(current->mm, fault_addr); if (!vma || fault_addr < vma->vm_start) { // No mapping exists here if (is_stack_growth(vma, fault_addr)) { expand_stack(vma, fault_addr); return; // Handled by expanding stack } goto bad_area; // Genuine invalid access } // VMA exists - check access type if (error_code & WRITE_FAULT) { if (!(vma->vm_flags & VM_WRITE)) { goto bad_area; // Write to read-only } if (vma->vm_flags & VM_SHARED) { // Copy-on-write do_cow_fault(vma, fault_addr); return; } } // Handle the fault (allocate page, read from file, etc.) handle_mm_fault(vma, fault_addr, error_code); return; bad_area: // Access violation - deliver SIGSEGV send_signal(current, SIGSEGV);}Faults save the address of the faulting instruction—after handling, re-execute it (e.g., page fault loads the page, then retries the access). Traps save the address of the next instruction—used for breakpoints and debugging. Aborts are unrecoverable—the processor state may be corrupted, and the process (or system) cannot continue.
Mode switches have significant performance implications. Understanding their cost helps design efficient applications and systems.
Cost Breakdown of a System Call:
| Component | Cycles | Notes |
|---|---|---|
| SYSCALL instruction | ~20-50 | Hardware mode switch |
| Kernel entry code | ~50-100 | Context save, security checks |
| Syscall dispatch | ~20-50 | Table lookup, validation |
| Actual syscall work | Varies | Depends on operation |
| Kernel exit code | ~50-100 | Context restore, signal check |
| SYSRET instruction | ~20-50 | Hardware mode switch back |
| Total overhead | ~150-350 | Plus actual work |
Why Mode Switches Are Expensive:
Pipeline Flush — Privilege changes may require flushing speculative execution
TLB Considerations — With KPTI, page table switches flush TLB
Register Save/Restore — All caller-saved registers must be preserved
Security Checks — Kernel validates parameters before trusting them
Branch Prediction — Entering kernel may pollute branch predictor state
Cache Effects — Kernel code/data may evict user cache lines
Mitigation Strategies:
1234567891011121314151617181920212223242526
// VDSO (Virtual Dynamic Shared Object) example // Traditional approach - requires mode switch:int gettimeofday(struct timeval *tv, struct timezone *tz) { return syscall(__NR_gettimeofday, tv, tz);}// Cost: ~200-400 cycles per call // VDSO approach - no mode switch:// The kernel maps a special page into every process containing:// - Current time (updated by timer interrupt)// - gettimeofday implementation that reads this shared page // User code calls what looks like a syscall:int gettimeofday(struct timeval *tv, struct timezone *tz) { // But it's actually a user-space function that reads // kernel-maintained data from a shared mapping uint64_t ns = vdso_read_clock(); tv->tv_sec = ns / 1000000000; tv->tv_usec = (ns % 1000000000) / 1000; return 0;}// Cost: ~20-50 cycles per call (10x faster!) // Available VDSOs vary by kernel/architecture:// Linux x86-64: clock_gettime, gettimeofday, getcpu, timeUse 'perf stat -e syscalls:sys_enter_write,syscalls:sys_exit_write ./program' to measure syscall count and timing. High syscall rates (>10k/sec) may indicate optimization opportunities. The strace -c command provides syscall statistics without performance counters.
Mode switching is the carefully orchestrated process of crossing the User/Kernel boundary—a security-critical operation that must be both fast and secure. Let's consolidate our understanding:
Module Complete:
You've now completed the CPU Execution Modes module. You understand:
This knowledge forms the foundation for understanding process isolation, system call implementation, interrupt handling, and operating system security.
Congratulations! You now have a comprehensive understanding of CPU execution modes—the hardware foundation of operating system security and process isolation. This knowledge is essential for understanding system calls, interrupt handling, kernel development, and security analysis. The next module explores Memory Hierarchy, another fundamental architecture concept that deeply influences OS design.