Operating SystemsCPU Execution Modes

CPU Execution Modes

LevelBeginner

Duration60 mins

TopicCPU Execution Modes

5 / 5

Mode Switching

Crossing the Privilege Boundary

The division between User Mode and Kernel Mode would be useless without controlled mechanisms to cross the boundary. Applications need file I/O, network access, and memory allocation—all requiring kernel assistance. Hardware devices need to signal events. Errors need to be caught and handled.

Mode switching is the highly orchestrated process by which the CPU transitions between privilege levels. It's not a simple flag flip—it involves:

Saving the complete execution state (registers, flags, stack pointer)
Switching to a trusted execution context (kernel stack, kernel code)
Changing the privilege level atomically
Transferring control to a predetermined, trusted entry point

Every mode switch is a security boundary crossing. The mechanisms must be airtight—one vulnerability in the mode switching process could give attackers kernel access.

In this page, we'll dissect every type of mode transition, understand the hardware's role, and see how operating systems build upon these primitives.

What You Will Learn

By the end of this page, you will understand: (1) The three causes of User→Kernel transitions (interrupts, exceptions, system calls), (2) The exact hardware steps in a mode switch, (3) How system call instructions (SYSCALL, SYSENTER, INT, SVC) work, (4) How the kernel returns to User Mode, and (5) Performance implications and optimizations of mode switching.

The Three Types of Mode Transitions

The CPU can transition from User Mode to Kernel Mode through exactly three mechanisms, each serving a different purpose:

1. Hardware Interrupts (Asynchronous)

External devices signal events that need immediate attention:

Keyboard key pressed
Network packet arrived
Timer tick occurred
Disk operation completed

Characteristics:

Asynchronous: Can occur at any point during program execution
External: Triggered by hardware, not by the running code
Unpredictable: The CPU cannot anticipate when they'll occur

2. Exceptions/Traps (Synchronous, Unintentional)

Errors or unusual conditions during instruction execution:

Page fault (accessing unmapped memory)
Divide by zero
Invalid instruction
Privilege violation

Characteristics:

Synchronous: Occur as a direct result of the executing instruction
Unintentional: The code didn't explicitly request kernel entry
Restartable or fatal: Some can be handled, others terminate the process

3. System Calls (Synchronous, Intentional)

Explicit requests for kernel services:

read() — Read from file
mmap() — Allocate memory
fork() — Create process
socket() — Create network endpoint

Characteristics:

Synchronous: Occur at a specific point in execution (the syscall instruction)
Intentional: The code explicitly requests kernel services
Parameter passing: Data must cross the privilege boundary

All Three Share Core Mechanics:

Despite their different triggers, all three types of transitions follow similar hardware steps:

Mode Transition Types Compared
Aspect	Hardware Interrupt	Exception	System Call
Trigger	External device signal	Instruction error/condition	Explicit syscall instruction
When it occurs	Any time (async)	During specific instruction	At syscall instruction
Predictable?	No	Sometimes	Yes
Example	Keyboard IRQ	Page fault, divide by zero	read(), write()
CPL change	3 → 0	3 → 0	3 → 0
Entry point	From IDT (interrupt number)	From IDT (exception number)	From MSR or IDT

Kernel→User is Always Return

Notice that all three transitions go User→Kernel. The reverse transition (Kernel→User) always happens through a 'return' mechanism—IRET, SYSRET, or ERET. The kernel explicitly chooses to return to User Mode; there's no way for the hardware to spontaneously drop privileges.

Hardware Steps in a Mode Switch

When a mode switch occurs, the CPU executes a precise sequence of steps in hardware. These steps happen atomically—there's no window where security could be compromised.

x86/x64 Interrupt/Exception Entry (via IDT):

x86_interrupt_entry_steps.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Hardware steps when interrupt/exception occurs (x86-64, IST = 0)
 
// STEP 1: Determine handler from IDT
idx = interrupt_number;  // e.g., 14 for page fault, 0x80 for legacy syscall
idtEntry = IDT[idx];     // Read from Interrupt Descriptor Table
 
// STEP 2: Privilege level check
if (idtEntry.DPL < current_CPL && trigger_was_software_int) {
    // Software interrupts (INT n) check DPL
    raise_exception(GENERAL_PROTECTION_FAULT);
}
 
// STEP 3: Stack switch (if changing privilege)
if (current_CPL > idtEntry.targetCPL) {  // e.g., 3 > 0
    // Save old stack info
    old_SS = SS;
    old_RSP = RSP;
    
    // Load new stack from TSS
    new_RSP = TSS.RSP0;  // Kernel stack pointer
    new_SS = idtEntry.targetSS;  // Usually kernel data segment
    
    // Switch stacks
    RSP = new_RSP;
    SS = new_SS;
    
    // Push old stack info onto new (kernel) stack
    push(old_SS);
    push(old_RSP);
}
 
// STEP 4: Save execution context onto kernel stack
push(RFLAGS);     // Processor flags (including IF)
push(CS);         // Code segment (includes CPL)
push(RIP);        // Return address (instruction pointer)
 
if (exception_has_error_code) {
    push(error_code);  // Page faults, GP faults have error codes
}
 
// STEP 5: Update processor state
RFLAGS.IF = 0;    // Disable interrupts (for most interrupts)
CS = idtEntry.selector;  // Load new code segment (CPL = 0)
RIP = idtEntry.offset;   // Jump to handler
 
// Now executing kernel code at CPL=0

Key Security Properties:

New stack comes from TSS — User code cannot control where the kernel stack is located
Entry point from IDT — User code cannot control where execution jumps to
Old CS/RIP is saved — Kernel knows exactly where to return
CPL change is implicit — Loading the new CS changes CPL to 0
Atomic execution — No user code runs between any of these steps

ARM Exception Entry (AArch64):

arm_exception_entry_steps.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// ARM AArch64 exception entry (e.g., SVC from EL0 to EL1)
 
// STEP 1: Save processor state
SPSR_EL1 = PSTATE;  // Save current processor state
ELR_EL1 = PC + 4;   // Save return address (next instruction)
 
// STEP 2: Update PSTATE
PSTATE.EL = 1;      // Switch to EL1 (kernel mode)
PSTATE.SP = 1;      // Use SP_EL1 (kernel stack pointer)
PSTATE.D = 1;       // Mask debug exceptions
PSTATE.A = 1;       // Mask SError (asynchronous) exceptions
PSTATE.I = 1;       // Mask IRQ
PSTATE.F = 1;       // Mask FIQ
 
// STEP 3: Jump to exception vector
exception_type = determine_type(exception, source_EL);
vector_offset = calculate_offset(exception_type);
PC = VBAR_EL1 + vector_offset;
 
// Vector offsets:
//   From EL0, synchronous (SVC): VBAR_EL1 + 0x400
//   From EL0, IRQ:               VBAR_EL1 + 0x480
//   From EL0, FIQ:               VBAR_EL1 + 0x500
//   From EL0, SError:            VBAR_EL1 + 0x580
 
// Now executing at EL1 with SP_EL1

ARM's Simpler Model

ARM's exception handling is simpler than x86's: dedicated SPSR/ELR registers per exception level eliminate the need for a TSS, and each EL has its own stack pointer register (SP_EL0, SP_EL1, etc.). The hardware doesn't need to read complex structures to perform the stack switch.

System Call Mechanisms

System calls are the most common type of User→Kernel transition. Different architectures and generations have evolved increasingly efficient mechanisms:

x86 Evolution of System Calls:

x86 System Call Instruction Evolution
Method	Era	Used On	Typical Latency
INT 0x80	Linux legacy	i386 Linux	~400 cycles
SYSENTER/SYSEXIT	Pentium II+	Windows, older Linux	~200 cycles
SYSCALL/SYSRET	AMD K6+/x86-64	Linux x86-64, Windows x64	~100 cycles

SYSCALL (Modern x86-64):

The SYSCALL instruction is the fastest way to enter kernel mode on x86-64. It's specifically designed for the common User→Kernel→User pattern of system calls.

syscall_instruction.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// SYSCALL instruction (x86-64)
// Fastest system call method, skips IDT entirely
 
// User-mode setup before SYSCALL:
RAX = syscall_number;   // e.g., 0 = read, 1 = write, 60 = exit
RDI = arg1;             // First argument
RSI = arg2;             // Second argument
RDX = arg3;             // Third argument
R10 = arg4;             // Fourth argument (RCX used by SYSCALL)
R8  = arg5;             // Fifth argument
R9  = arg6;             // Sixth argument
 
// SYSCALL execution (hardware):
// 1. RCX ← RIP (save return address)
// 2. R11 ← RFLAGS (save flags)
// 3. RIP ← IA32_LSTAR MSR (jump to kernel entry)
// 4. CS ← IA32_STAR[47:32] (kernel code segment, CPL=0)
// 5. SS ← IA32_STAR[47:32] + 8 (kernel data segment)
// 6. RFLAGS &= ~IA32_FMASK (mask certain flags, including IF)
 
// Now in kernel mode at the address in IA32_LSTAR
 
// SYSRET (returning to user mode):
// 1. RIP ← RCX (return to saved address)
// 2. RFLAGS ← R11 (restore flags)
// 3. CS ← IA32_STAR[63:48] + 16 (user code segment, CPL=3)
// 4. SS ← IA32_STAR[63:48] + 8 (user data segment)
 
// Return value in RAX

ARM SVC (Supervisor Call):

ARM uses the SVC instruction for system calls, which is conceptually simpler—it triggers a synchronous exception to EL1.

arm_svc_syscall.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// ARM AArch64 System Call
 
// User-mode setup:
X8 = syscall_number;    // System call number
X0 = arg1;              // First argument
X1 = arg2;              // Second argument
X2 = arg3;              // Third argument
X3 = arg4;              // Fourth argument
X4 = arg5;              // Fifth argument
X5 = arg6;              // Sixth argument
 
// Execute system call:
SVC #0                  // Supervisor Call, immediate ignored on AArch64
 
// Hardware automatically:
// 1. SPSR_EL1 ← PSTATE
// 2. ELR_EL1 ← PC + 4 (return to instruction after SVC)
// 3. PSTATE.EL ← 1, masks set
// 4. PC ← VBAR_EL1 + 0x400 (sync exception from EL0)
 
// Kernel reads X8 to determine which syscall
// Kernel execution...
 
// Return to user mode with ERET:
ERET
// 1. PSTATE ← SPSR_EL1
// 2. PC ← ELR_EL1
// X0 contains return value

Why Dedicated Syscall Instructions?

Early systems used software interrupts (INT) for system calls, but this was slow—the CPU had to read the IDT, check permissions, and perform full interrupt entry. Dedicated syscall instructions (SYSCALL, SVC) use fixed kernel entry points stored in MSRs or system registers, eliminating memory reads and permission checks. The hardware 'knows' this is a system call and optimizes accordingly.

Returning to User Mode

Every kernel entry must eventually return to user mode (or terminate the process). The return mechanism is just as critical as the entry—it must restore user context and lower privilege atomically.

x86-64 Return Mechanisms:

IRET (Interrupt Return)

•Used for returning from interrupts/exceptions
•Pops RIP, CS, RFLAGS, RSP, SS from stack
•CS bits 0-1 determine new CPL (3 for user)
•Full context restore including stack switch
•Slower but more general
•~200+ cycles

SYSRET (Syscall Return)

•Used specifically for returning from SYSCALL
•RIP from RCX, RFLAGS from R11
•CS/SS from MSR (always returns to CPL=3)
•Does NOT switch stacks automatically
•Faster, but kernel must manage stack
•~50-70 cycles

kernel_return_flow.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Linux x86-64 system call return path (simplified)
 
entry_SYSCALL_64:
    // ... syscall handling ...
    
    // Prepare return values
    movq %rax, ORIG_RAX(%rsp)  ; Store syscall return value
    
    // Check if we can use fast SYSRET path
    testq $(THREAD_FLAGS_SLOW_PATH), THREAD_INFO_FLAGS(%rdi)
    jnz slow_path
    
fast_path:
    // Fast return via SYSRET
    movq RCX(%rsp), %rcx       ; Restore user RIP
    movq R11(%rsp), %r11       ; Restore user RFLAGS
    movq RAX(%rsp), %rax       ; Return value
    
    // Switch to user stack BEFORE SYSRET
    movq RSP(%rsp), %rsp       ; Restore user RSP
    
    sysretq                     ; CPL 0 → CPL 3
 
slow_path:
    // Need full IRET path for:
    // - Signals pending
    // - Single-step debugging
    // - iret-requiring registers (different SS, etc.)
    
    // Prepare IRET frame on stack
    // Stack layout: RIP, CS, RFLAGS, RSP, SS
    
    iretq                       ; Full context restore

ARM Exception Return (ERET):

ARM uses a single ERET instruction for all exception returns. It's elegant and consistent:

arm_eret_return.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// ARM exception return
 
// Before ERET, kernel must populate:
// - ELR_EL1: Return address for user code
// - SPSR_EL1: Saved processor state (includes target EL)
// - X0: Return value (for syscalls)
 
// Kernel exit code:
kernel_exit:
    // Restore general-purpose registers from stack
    ldp x0, x1, [sp, #S_X0]
    ldp x2, x3, [sp, #S_X2]
    // ... more register restores ...
    
    // Restore stack pointer
    ldr x19, [sp, #S_SP]
    msr sp_el0, x19
    
    // Restore return address and saved state (done earlier)
    // ELR_EL1 and SPSR_EL1 were saved on exception entry
    
    // Return to user mode
    eret
    
// ERET atomically:
// 1. PSTATE ← SPSR_EL1 (restores EL0, interrupt masks, etc.)
// 2. PC ← ELR_EL1 (jump to user code)
// Now executing at EL0 (user mode)

The SYSRET Vulnerability

SYSRET has a subtle security issue: if RCX (return address) is in non-canonical form (invalid x64 address), SYSRET generates a #GP exception while still at CPL=0 but with user RSP/RFLAGS. This can be exploited. Linux checks for non-canonical RCX before SYSRET, switching to IRET if needed. Windows had a vulnerability (CVE-2012-0217) from missing this check.

Interrupt Handling Deep Dive

Hardware interrupts are particularly interesting because they're asynchronous—they can occur at any point, requiring the kernel to handle arbitrary interrupted state.

Interrupt Lifecycle:

Converting Mermaid diagram...

Detailed Interrupt Flow (x86):

Device Signals IRQ — Hardware device asserts interrupt line
Interrupt Controller Routes — APIC (Advanced Programmable Interrupt Controller) determines:
- Which CPU core receives the interrupt (for multi-core)
- Interrupt vector number (maps to specific handler)
- Priority (can interrupt current handler if higher priority)
CPU Checks Interrupt Flag — If RFLAGS.IF = 0, interrupt is held pending
Between-Instruction Window — Interrupt is recognized at instruction boundary
Hardware Context Save — CPU pushes SS, RSP, RFLAGS, CS, RIP to kernel stack
Mode Switch — CPL becomes 0, jump to IDT[vector].handler
Kernel Handler Executes — Reads device registers, processes event
End-Of-Interrupt (EOI) — Kernel signals APIC that interrupt is handled
IRET — Restore context, return to interrupted user code (CPL → 3)

keyboard_interrupt_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Simplified keyboard interrupt handler (Linux-style)
 
// Registered in IDT for vector 33 (IRQ 1 + 32 offset)
irqreturn_t keyboard_interrupt(int irq, void *dev_id) {
    uint8_t scancode;
    
    // Read scancode from keyboard controller
    // IN instruction (privileged) - only works at CPL=0
    scancode = inb(KEYBOARD_DATA_PORT);  // Port 0x60
    
    // Process the keypress
    if (scancode & 0x80) {
        // Key release
        handle_key_release(scancode & 0x7F);
    } else {
        // Key press
        handle_key_press(scancode);
        
        // Wake up processes waiting for input
        wake_up_interruptible(&keyboard_wait_queue);
    }
    
    // Acknowledge interrupt to APIC
    // OUT instruction (privileged)
    outb(0x20, PIC_EOI);  // Send EOI to master PIC
    
    return IRQ_HANDLED;
}
 
// After this function returns, common interrupt exit code runs IRET
// to return to whatever user process was interrupted

Interrupt Latency Matters

Time from IRQ assertion to handler execution is 'interrupt latency.' For real-time systems, this must be bounded and predictable. Mode switch overhead (context save, stack switch) is a significant component. Real-time operating systems minimize this through techniques like interrupt nesting limits and predictable handler execution.

Exception Handling

Exceptions are synchronous—they occur as a direct result of instruction execution. The kernel must determine: Is this a recoverable situation, or should the process be terminated?

Categories of Exceptions:

x86 Exception Types
Vector	Name	Type	Cause	Kernel Response
0	#DE Divide Error	Fault	DIV/IDIV by zero	SIGFPE, may kill process
6	#UD Invalid Opcode	Fault	Unknown instruction	SIGILL, terminate
13	#GP General Protection	Fault	Privilege/segment violation	SIGSEGV, terminate
14	#PF Page Fault	Fault	Page not present or protected	Handle or SIGSEGV
1	#DB Debug	Trap	Breakpoint, single-step	SIGTRAP, debugger handles
3	#BP Breakpoint	Trap	INT 3 instruction	SIGTRAP, debugger handles

Page Fault: The Most Important Exception

Page faults are special because they're often not errors—they're expected events that the OS handles transparently:

Demand paging: Page not loaded yet, load it from disk
Copy-on-write: Shared page written, create private copy
Stack growth: Access below stack, extend stack
Lazy allocation: mmap'd region accessed, allocate physical page

Only if the fault cannot be resolved does the kernel deliver SIGSEGV.

page_fault_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Simplified page fault handler (Linux-style)
 
void do_page_fault(struct pt_regs *regs, unsigned long error_code) {
    unsigned long fault_addr = read_cr2();  // CR2 has faulting address
    struct vm_area_struct *vma;
    
    // Was this a kernel or user fault?
    if (fault_in_kernel_mode(regs)) {
        if (kernel_exception_fixup(regs)) {
            return;  // Kernel expected this, handled
        }
        // Kernel bug - panic
        kernel_oops();
    }
    
    // User-mode fault - find VMA
    vma = find_vma(current->mm, fault_addr);
    
    if (!vma || fault_addr < vma->vm_start) {
        // No mapping exists here
        if (is_stack_growth(vma, fault_addr)) {
            expand_stack(vma, fault_addr);
            return;  // Handled by expanding stack
        }
        goto bad_area;  // Genuine invalid access
    }
    
    // VMA exists - check access type
    if (error_code & WRITE_FAULT) {
        if (!(vma->vm_flags & VM_WRITE)) {
            goto bad_area;  // Write to read-only
        }
        if (vma->vm_flags & VM_SHARED) {
            // Copy-on-write
            do_cow_fault(vma, fault_addr);
            return;
        }
    }
    
    // Handle the fault (allocate page, read from file, etc.)
    handle_mm_fault(vma, fault_addr, error_code);
    return;
    
bad_area:
    // Access violation - deliver SIGSEGV
    send_signal(current, SIGSEGV);
}

Faults vs. Traps vs. Aborts

Faults save the address of the faulting instruction—after handling, re-execute it (e.g., page fault loads the page, then retries the access). Traps save the address of the next instruction—used for breakpoints and debugging. Aborts are unrecoverable—the processor state may be corrupted, and the process (or system) cannot continue.

Mode Switching Performance

Mode switches have significant performance implications. Understanding their cost helps design efficient applications and systems.

Cost Breakdown of a System Call:

Approximate Syscall Costs (Modern x86-64)
Component	Cycles	Notes
SYSCALL instruction	~20-50	Hardware mode switch
Kernel entry code	~50-100	Context save, security checks
Syscall dispatch	~20-50	Table lookup, validation
Actual syscall work	Varies	Depends on operation
Kernel exit code	~50-100	Context restore, signal check
SYSRET instruction	~20-50	Hardware mode switch back
Total overhead	~150-350	Plus actual work

Why Mode Switches Are Expensive:

Pipeline Flush — Privilege changes may require flushing speculative execution
TLB Considerations — With KPTI, page table switches flush TLB
Register Save/Restore — All caller-saved registers must be preserved
Security Checks — Kernel validates parameters before trusting them
Branch Prediction — Entering kernel may pollute branch predictor state
Cache Effects — Kernel code/data may evict user cache lines

Mitigation Strategies:

Reducing Mode Switch Overhead

•Batching — Combine multiple operations into one syscall (io_uring, writev, recvmmsg)
•VDSO — Map kernel data into user space for read-only access (gettimeofday, clock_gettime)
•User-space implementations — Use user-space libraries for frequent operations (math, string ops)
•Memory mapping — mmap files instead of read/write (trades syscalls for page faults)
•Kernel bypass — Technologies like DPDK access hardware directly from user space
•io_uring — Submit multiple I/O operations via shared memory ring buffers

vdso_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// VDSO (Virtual Dynamic Shared Object) example
 
// Traditional approach - requires mode switch:
int gettimeofday(struct timeval *tv, struct timezone *tz) {
    return syscall(__NR_gettimeofday, tv, tz);
}
// Cost: ~200-400 cycles per call
 
// VDSO approach - no mode switch:
// The kernel maps a special page into every process containing:
// - Current time (updated by timer interrupt)
// - gettimeofday implementation that reads this shared page
 
// User code calls what looks like a syscall:
int gettimeofday(struct timeval *tv, struct timezone *tz) {
    // But it's actually a user-space function that reads
    // kernel-maintained data from a shared mapping
    uint64_t ns = vdso_read_clock();
    tv->tv_sec = ns / 1000000000;
    tv->tv_usec = (ns % 1000000000) / 1000;
    return 0;
}
// Cost: ~20-50 cycles per call (10x faster!)
 
// Available VDSOs vary by kernel/architecture:
// Linux x86-64: clock_gettime, gettimeofday, getcpu, time

Measuring Mode Switch Overhead

Use 'perf stat -e syscalls:sys_enter_write,syscalls:sys_exit_write ./program' to measure syscall count and timing. High syscall rates (>10k/sec) may indicate optimization opportunities. The strace -c command provides syscall statistics without performance counters.

Summary: Mode Switching

Mode switching is the carefully orchestrated process of crossing the User/Kernel boundary—a security-critical operation that must be both fast and secure. Let's consolidate our understanding:

Key Takeaways

•Three causes for User→Kernel: Interrupts (asynchronous hardware events), Exceptions (synchronous errors), and System Calls (intentional kernel requests).
•Hardware performs atomic state changes: Stack switch, context save, privilege change, and control transfer happen indivisibly.
•Entry points are kernel-controlled: The CPU jumps to addresses from IDT (x86) or VBAR (ARM), not attacker-controlled locations.
•System calls have evolved for speed: From INT 0x80 (~400 cycles) to SYSCALL (~100 cycles) on x86-64.
•Return mechanisms differ: IRET is general but slow; SYSRET is fast but has subtle security considerations.
•Page faults are often handled silently: Demand paging, COW, and stack growth are implemented via page fault handling.
•Mode switch overhead is significant: ~150-350 cycles minimum, making batching and VDSO important optimizations.

Module Complete:

You've now completed the CPU Execution Modes module. You understand:

Why privilege separation exists (User Mode page)
What kernel code can do (Kernel Mode page)
How privilege is tracked (Mode Bit page)
Which operations are restricted (Privileged Instructions page)
How transitions between modes work (this page)

This knowledge forms the foundation for understanding process isolation, system call implementation, interrupt handling, and operating system security.

Module Complete: CPU Execution Modes

Congratulations! You now have a comprehensive understanding of CPU execution modes—the hardware foundation of operating system security and process isolation. This knowledge is essential for understanding system calls, interrupt handling, kernel development, and security analysis. The next module explores Memory Hierarchy, another fundamental architecture concept that deeply influences OS design.

5 / 5

Loading learning content...

Operating SystemsCPU Execution Modes

CPU Execution Modes

LevelBeginner

Duration60 mins

TopicCPU Execution Modes

5 / 5

Mode Switching

Crossing the Privilege Boundary

Mode switching is the highly orchestrated process by which the CPU transitions between privilege levels. It's not a simple flag flip—it involves:

Saving the complete execution state (registers, flags, stack pointer)
Switching to a trusted execution context (kernel stack, kernel code)
Changing the privilege level atomically
Transferring control to a predetermined, trusted entry point

Every mode switch is a security boundary crossing. The mechanisms must be airtight—one vulnerability in the mode switching process could give attackers kernel access.

In this page, we'll dissect every type of mode transition, understand the hardware's role, and see how operating systems build upon these primitives.

What You Will Learn

The Three Types of Mode Transitions

The CPU can transition from User Mode to Kernel Mode through exactly three mechanisms, each serving a different purpose:

1. Hardware Interrupts (Asynchronous)

External devices signal events that need immediate attention:

Keyboard key pressed
Network packet arrived
Timer tick occurred
Disk operation completed

Characteristics:

Asynchronous: Can occur at any point during program execution
External: Triggered by hardware, not by the running code
Unpredictable: The CPU cannot anticipate when they'll occur

2. Exceptions/Traps (Synchronous, Unintentional)

Errors or unusual conditions during instruction execution:

Page fault (accessing unmapped memory)
Divide by zero
Invalid instruction
Privilege violation

Characteristics:

Synchronous: Occur as a direct result of the executing instruction
Unintentional: The code didn't explicitly request kernel entry
Restartable or fatal: Some can be handled, others terminate the process

3. System Calls (Synchronous, Intentional)

Explicit requests for kernel services:

read() — Read from file
mmap() — Allocate memory
fork() — Create process
socket() — Create network endpoint

Characteristics:

Synchronous: Occur at a specific point in execution (the syscall instruction)
Intentional: The code explicitly requests kernel services
Parameter passing: Data must cross the privilege boundary

All Three Share Core Mechanics:

Despite their different triggers, all three types of transitions follow similar hardware steps:

Mode Transition Types Compared
Aspect	Hardware Interrupt	Exception	System Call
Trigger	External device signal	Instruction error/condition	Explicit syscall instruction
When it occurs	Any time (async)	During specific instruction	At syscall instruction
Predictable?	No	Sometimes	Yes
Example	Keyboard IRQ	Page fault, divide by zero	read(), write()
CPL change	3 → 0	3 → 0	3 → 0
Entry point	From IDT (interrupt number)	From IDT (exception number)	From MSR or IDT

Kernel→User is Always Return

Hardware Steps in a Mode Switch

When a mode switch occurs, the CPU executes a precise sequence of steps in hardware. These steps happen atomically—there's no window where security could be compromised.

x86/x64 Interrupt/Exception Entry (via IDT):

x86_interrupt_entry_steps.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Hardware steps when interrupt/exception occurs (x86-64, IST = 0)
 
// STEP 1: Determine handler from IDT
idx = interrupt_number;  // e.g., 14 for page fault, 0x80 for legacy syscall
idtEntry = IDT[idx];     // Read from Interrupt Descriptor Table
 
// STEP 2: Privilege level check
if (idtEntry.DPL < current_CPL && trigger_was_software_int) {
    // Software interrupts (INT n) check DPL
    raise_exception(GENERAL_PROTECTION_FAULT);
}
 
// STEP 3: Stack switch (if changing privilege)
if (current_CPL > idtEntry.targetCPL) {  // e.g., 3 > 0
    // Save old stack info
    old_SS = SS;
    old_RSP = RSP;
    
    // Load new stack from TSS
    new_RSP = TSS.RSP0;  // Kernel stack pointer
    new_SS = idtEntry.targetSS;  // Usually kernel data segment
    
    // Switch stacks
    RSP = new_RSP;
    SS = new_SS;
    
    // Push old stack info onto new (kernel) stack
    push(old_SS);
    push(old_RSP);
}
 
// STEP 4: Save execution context onto kernel stack
push(RFLAGS);     // Processor flags (including IF)
push(CS);         // Code segment (includes CPL)
push(RIP);        // Return address (instruction pointer)
 
if (exception_has_error_code) {
    push(error_code);  // Page faults, GP faults have error codes
}
 
// STEP 5: Update processor state
RFLAGS.IF = 0;    // Disable interrupts (for most interrupts)
CS = idtEntry.selector;  // Load new code segment (CPL = 0)
RIP = idtEntry.offset;   // Jump to handler
 
// Now executing kernel code at CPL=0

Key Security Properties:

New stack comes from TSS — User code cannot control where the kernel stack is located
Entry point from IDT — User code cannot control where execution jumps to
Old CS/RIP is saved — Kernel knows exactly where to return
CPL change is implicit — Loading the new CS changes CPL to 0
Atomic execution — No user code runs between any of these steps

ARM Exception Entry (AArch64):

arm_exception_entry_steps.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// ARM AArch64 exception entry (e.g., SVC from EL0 to EL1)
 
// STEP 1: Save processor state
SPSR_EL1 = PSTATE;  // Save current processor state
ELR_EL1 = PC + 4;   // Save return address (next instruction)
 
// STEP 2: Update PSTATE
PSTATE.EL = 1;      // Switch to EL1 (kernel mode)
PSTATE.SP = 1;      // Use SP_EL1 (kernel stack pointer)
PSTATE.D = 1;       // Mask debug exceptions
PSTATE.A = 1;       // Mask SError (asynchronous) exceptions
PSTATE.I = 1;       // Mask IRQ
PSTATE.F = 1;       // Mask FIQ
 
// STEP 3: Jump to exception vector
exception_type = determine_type(exception, source_EL);
vector_offset = calculate_offset(exception_type);
PC = VBAR_EL1 + vector_offset;
 
// Vector offsets:
//   From EL0, synchronous (SVC): VBAR_EL1 + 0x400
//   From EL0, IRQ:               VBAR_EL1 + 0x480
//   From EL0, FIQ:               VBAR_EL1 + 0x500
//   From EL0, SError:            VBAR_EL1 + 0x580
 
// Now executing at EL1 with SP_EL1

ARM's Simpler Model

System Call Mechanisms

System calls are the most common type of User→Kernel transition. Different architectures and generations have evolved increasingly efficient mechanisms:

x86 Evolution of System Calls:

x86 System Call Instruction Evolution
Method	Era	Used On	Typical Latency
INT 0x80	Linux legacy	i386 Linux	~400 cycles
SYSENTER/SYSEXIT	Pentium II+	Windows, older Linux	~200 cycles
SYSCALL/SYSRET	AMD K6+/x86-64	Linux x86-64, Windows x64	~100 cycles

SYSCALL (Modern x86-64):

The SYSCALL instruction is the fastest way to enter kernel mode on x86-64. It's specifically designed for the common User→Kernel→User pattern of system calls.

syscall_instruction.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// SYSCALL instruction (x86-64)
// Fastest system call method, skips IDT entirely
 
// User-mode setup before SYSCALL:
RAX = syscall_number;   // e.g., 0 = read, 1 = write, 60 = exit
RDI = arg1;             // First argument
RSI = arg2;             // Second argument
RDX = arg3;             // Third argument
R10 = arg4;             // Fourth argument (RCX used by SYSCALL)
R8  = arg5;             // Fifth argument
R9  = arg6;             // Sixth argument
 
// SYSCALL execution (hardware):
// 1. RCX ← RIP (save return address)
// 2. R11 ← RFLAGS (save flags)
// 3. RIP ← IA32_LSTAR MSR (jump to kernel entry)
// 4. CS ← IA32_STAR[47:32] (kernel code segment, CPL=0)
// 5. SS ← IA32_STAR[47:32] + 8 (kernel data segment)
// 6. RFLAGS &= ~IA32_FMASK (mask certain flags, including IF)
 
// Now in kernel mode at the address in IA32_LSTAR
 
// SYSRET (returning to user mode):
// 1. RIP ← RCX (return to saved address)
// 2. RFLAGS ← R11 (restore flags)
// 3. CS ← IA32_STAR[63:48] + 16 (user code segment, CPL=3)
// 4. SS ← IA32_STAR[63:48] + 8 (user data segment)
 
// Return value in RAX

ARM SVC (Supervisor Call):

ARM uses the SVC instruction for system calls, which is conceptually simpler—it triggers a synchronous exception to EL1.

arm_svc_syscall.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// ARM AArch64 System Call
 
// User-mode setup:
X8 = syscall_number;    // System call number
X0 = arg1;              // First argument
X1 = arg2;              // Second argument
X2 = arg3;              // Third argument
X3 = arg4;              // Fourth argument
X4 = arg5;              // Fifth argument
X5 = arg6;              // Sixth argument
 
// Execute system call:
SVC #0                  // Supervisor Call, immediate ignored on AArch64
 
// Hardware automatically:
// 1. SPSR_EL1 ← PSTATE
// 2. ELR_EL1 ← PC + 4 (return to instruction after SVC)
// 3. PSTATE.EL ← 1, masks set
// 4. PC ← VBAR_EL1 + 0x400 (sync exception from EL0)
 
// Kernel reads X8 to determine which syscall
// Kernel execution...
 
// Return to user mode with ERET:
ERET
// 1. PSTATE ← SPSR_EL1
// 2. PC ← ELR_EL1
// X0 contains return value

Why Dedicated Syscall Instructions?

Returning to User Mode

Every kernel entry must eventually return to user mode (or terminate the process). The return mechanism is just as critical as the entry—it must restore user context and lower privilege atomically.

x86-64 Return Mechanisms:

IRET (Interrupt Return)

•Used for returning from interrupts/exceptions
•Pops RIP, CS, RFLAGS, RSP, SS from stack
•CS bits 0-1 determine new CPL (3 for user)
•Full context restore including stack switch
•Slower but more general
•~200+ cycles

SYSRET (Syscall Return)

•Used specifically for returning from SYSCALL
•RIP from RCX, RFLAGS from R11
•CS/SS from MSR (always returns to CPL=3)
•Does NOT switch stacks automatically
•Faster, but kernel must manage stack
•~50-70 cycles

kernel_return_flow.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Linux x86-64 system call return path (simplified)
 
entry_SYSCALL_64:
    // ... syscall handling ...
    
    // Prepare return values
    movq %rax, ORIG_RAX(%rsp)  ; Store syscall return value
    
    // Check if we can use fast SYSRET path
    testq $(THREAD_FLAGS_SLOW_PATH), THREAD_INFO_FLAGS(%rdi)
    jnz slow_path
    
fast_path:
    // Fast return via SYSRET
    movq RCX(%rsp), %rcx       ; Restore user RIP
    movq R11(%rsp), %r11       ; Restore user RFLAGS
    movq RAX(%rsp), %rax       ; Return value
    
    // Switch to user stack BEFORE SYSRET
    movq RSP(%rsp), %rsp       ; Restore user RSP
    
    sysretq                     ; CPL 0 → CPL 3
 
slow_path:
    // Need full IRET path for:
    // - Signals pending
    // - Single-step debugging
    // - iret-requiring registers (different SS, etc.)
    
    // Prepare IRET frame on stack
    // Stack layout: RIP, CS, RFLAGS, RSP, SS
    
    iretq                       ; Full context restore

ARM Exception Return (ERET):

ARM uses a single ERET instruction for all exception returns. It's elegant and consistent:

arm_eret_return.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// ARM exception return
 
// Before ERET, kernel must populate:
// - ELR_EL1: Return address for user code
// - SPSR_EL1: Saved processor state (includes target EL)
// - X0: Return value (for syscalls)
 
// Kernel exit code:
kernel_exit:
    // Restore general-purpose registers from stack
    ldp x0, x1, [sp, #S_X0]
    ldp x2, x3, [sp, #S_X2]
    // ... more register restores ...
    
    // Restore stack pointer
    ldr x19, [sp, #S_SP]
    msr sp_el0, x19
    
    // Restore return address and saved state (done earlier)
    // ELR_EL1 and SPSR_EL1 were saved on exception entry
    
    // Return to user mode
    eret
    
// ERET atomically:
// 1. PSTATE ← SPSR_EL1 (restores EL0, interrupt masks, etc.)
// 2. PC ← ELR_EL1 (jump to user code)
// Now executing at EL0 (user mode)

The SYSRET Vulnerability

Interrupt Handling Deep Dive

Hardware interrupts are particularly interesting because they're asynchronous—they can occur at any point, requiring the kernel to handle arbitrary interrupted state.

Interrupt Lifecycle:

Converting Mermaid diagram...

Detailed Interrupt Flow (x86):

Device Signals IRQ — Hardware device asserts interrupt line
Interrupt Controller Routes — APIC (Advanced Programmable Interrupt Controller) determines:
- Which CPU core receives the interrupt (for multi-core)
- Interrupt vector number (maps to specific handler)
- Priority (can interrupt current handler if higher priority)
CPU Checks Interrupt Flag — If RFLAGS.IF = 0, interrupt is held pending
Between-Instruction Window — Interrupt is recognized at instruction boundary
Hardware Context Save — CPU pushes SS, RSP, RFLAGS, CS, RIP to kernel stack
Mode Switch — CPL becomes 0, jump to IDT[vector].handler
Kernel Handler Executes — Reads device registers, processes event
End-Of-Interrupt (EOI) — Kernel signals APIC that interrupt is handled
IRET — Restore context, return to interrupted user code (CPL → 3)

keyboard_interrupt_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Simplified keyboard interrupt handler (Linux-style)
 
// Registered in IDT for vector 33 (IRQ 1 + 32 offset)
irqreturn_t keyboard_interrupt(int irq, void *dev_id) {
    uint8_t scancode;
    
    // Read scancode from keyboard controller
    // IN instruction (privileged) - only works at CPL=0
    scancode = inb(KEYBOARD_DATA_PORT);  // Port 0x60
    
    // Process the keypress
    if (scancode & 0x80) {
        // Key release
        handle_key_release(scancode & 0x7F);
    } else {
        // Key press
        handle_key_press(scancode);
        
        // Wake up processes waiting for input
        wake_up_interruptible(&keyboard_wait_queue);
    }
    
    // Acknowledge interrupt to APIC
    // OUT instruction (privileged)
    outb(0x20, PIC_EOI);  // Send EOI to master PIC
    
    return IRQ_HANDLED;
}
 
// After this function returns, common interrupt exit code runs IRET
// to return to whatever user process was interrupted

Interrupt Latency Matters

Exception Handling

Exceptions are synchronous—they occur as a direct result of instruction execution. The kernel must determine: Is this a recoverable situation, or should the process be terminated?

Categories of Exceptions:

x86 Exception Types
Vector	Name	Type	Cause	Kernel Response
0	#DE Divide Error	Fault	DIV/IDIV by zero	SIGFPE, may kill process
6	#UD Invalid Opcode	Fault	Unknown instruction	SIGILL, terminate
13	#GP General Protection	Fault	Privilege/segment violation	SIGSEGV, terminate
14	#PF Page Fault	Fault	Page not present or protected	Handle or SIGSEGV
1	#DB Debug	Trap	Breakpoint, single-step	SIGTRAP, debugger handles
3	#BP Breakpoint	Trap	INT 3 instruction	SIGTRAP, debugger handles

Page Fault: The Most Important Exception

Page faults are special because they're often not errors—they're expected events that the OS handles transparently:

Demand paging: Page not loaded yet, load it from disk
Copy-on-write: Shared page written, create private copy
Stack growth: Access below stack, extend stack
Lazy allocation: mmap'd region accessed, allocate physical page

Only if the fault cannot be resolved does the kernel deliver SIGSEGV.

page_fault_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Simplified page fault handler (Linux-style)
 
void do_page_fault(struct pt_regs *regs, unsigned long error_code) {
    unsigned long fault_addr = read_cr2();  // CR2 has faulting address
    struct vm_area_struct *vma;
    
    // Was this a kernel or user fault?
    if (fault_in_kernel_mode(regs)) {
        if (kernel_exception_fixup(regs)) {
            return;  // Kernel expected this, handled
        }
        // Kernel bug - panic
        kernel_oops();
    }
    
    // User-mode fault - find VMA
    vma = find_vma(current->mm, fault_addr);
    
    if (!vma || fault_addr < vma->vm_start) {
        // No mapping exists here
        if (is_stack_growth(vma, fault_addr)) {
            expand_stack(vma, fault_addr);
            return;  // Handled by expanding stack
        }
        goto bad_area;  // Genuine invalid access
    }
    
    // VMA exists - check access type
    if (error_code & WRITE_FAULT) {
        if (!(vma->vm_flags & VM_WRITE)) {
            goto bad_area;  // Write to read-only
        }
        if (vma->vm_flags & VM_SHARED) {
            // Copy-on-write
            do_cow_fault(vma, fault_addr);
            return;
        }
    }
    
    // Handle the fault (allocate page, read from file, etc.)
    handle_mm_fault(vma, fault_addr, error_code);
    return;
    
bad_area:
    // Access violation - deliver SIGSEGV
    send_signal(current, SIGSEGV);
}

Faults vs. Traps vs. Aborts

Mode Switching Performance

Mode switches have significant performance implications. Understanding their cost helps design efficient applications and systems.

Cost Breakdown of a System Call:

Approximate Syscall Costs (Modern x86-64)
Component	Cycles	Notes
SYSCALL instruction	~20-50	Hardware mode switch
Kernel entry code	~50-100	Context save, security checks
Syscall dispatch	~20-50	Table lookup, validation
Actual syscall work	Varies	Depends on operation
Kernel exit code	~50-100	Context restore, signal check
SYSRET instruction	~20-50	Hardware mode switch back
Total overhead	~150-350	Plus actual work

Why Mode Switches Are Expensive:

Pipeline Flush — Privilege changes may require flushing speculative execution
TLB Considerations — With KPTI, page table switches flush TLB
Register Save/Restore — All caller-saved registers must be preserved
Security Checks — Kernel validates parameters before trusting them
Branch Prediction — Entering kernel may pollute branch predictor state
Cache Effects — Kernel code/data may evict user cache lines

Mitigation Strategies:

Reducing Mode Switch Overhead

•Batching — Combine multiple operations into one syscall (io_uring, writev, recvmmsg)
•VDSO — Map kernel data into user space for read-only access (gettimeofday, clock_gettime)
•User-space implementations — Use user-space libraries for frequent operations (math, string ops)
•Memory mapping — mmap files instead of read/write (trades syscalls for page faults)
•Kernel bypass — Technologies like DPDK access hardware directly from user space
•io_uring — Submit multiple I/O operations via shared memory ring buffers

vdso_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// VDSO (Virtual Dynamic Shared Object) example
 
// Traditional approach - requires mode switch:
int gettimeofday(struct timeval *tv, struct timezone *tz) {
    return syscall(__NR_gettimeofday, tv, tz);
}
// Cost: ~200-400 cycles per call
 
// VDSO approach - no mode switch:
// The kernel maps a special page into every process containing:
// - Current time (updated by timer interrupt)
// - gettimeofday implementation that reads this shared page
 
// User code calls what looks like a syscall:
int gettimeofday(struct timeval *tv, struct timezone *tz) {
    // But it's actually a user-space function that reads
    // kernel-maintained data from a shared mapping
    uint64_t ns = vdso_read_clock();
    tv->tv_sec = ns / 1000000000;
    tv->tv_usec = (ns % 1000000000) / 1000;
    return 0;
}
// Cost: ~20-50 cycles per call (10x faster!)
 
// Available VDSOs vary by kernel/architecture:
// Linux x86-64: clock_gettime, gettimeofday, getcpu, time

Measuring Mode Switch Overhead

Summary: Mode Switching

Mode switching is the carefully orchestrated process of crossing the User/Kernel boundary—a security-critical operation that must be both fast and secure. Let's consolidate our understanding:

Key Takeaways

•Three causes for User→Kernel: Interrupts (asynchronous hardware events), Exceptions (synchronous errors), and System Calls (intentional kernel requests).
•Hardware performs atomic state changes: Stack switch, context save, privilege change, and control transfer happen indivisibly.
•Entry points are kernel-controlled: The CPU jumps to addresses from IDT (x86) or VBAR (ARM), not attacker-controlled locations.
•System calls have evolved for speed: From INT 0x80 (~400 cycles) to SYSCALL (~100 cycles) on x86-64.
•Return mechanisms differ: IRET is general but slow; SYSRET is fast but has subtle security considerations.
•Page faults are often handled silently: Demand paging, COW, and stack growth are implemented via page fault handling.
•Mode switch overhead is significant: ~150-350 cycles minimum, making batching and VDSO important optimizations.

Module Complete:

You've now completed the CPU Execution Modes module. You understand:

Why privilege separation exists (User Mode page)
What kernel code can do (Kernel Mode page)
How privilege is tracked (Mode Bit page)
Which operations are restricted (Privileged Instructions page)
How transitions between modes work (this page)

This knowledge forms the foundation for understanding process isolation, system call implementation, interrupt handling, and operating system security.

Module Complete: CPU Execution Modes

5 / 5