Operating SystemsCPU Execution Modes

CPU Execution Modes

LevelBeginner

Duration60 mins

TopicCPU Execution Modes

3 / 5

Mode Bit

The Bit That Guards the Kingdom

Deep within the CPU, in a register measured in bits rather than bytes, lies perhaps the most security-critical piece of state in the entire computer: the Mode Bit.

This tiny piece of hardware—often just 1-2 bits—answers a question that must be resolved before every single instruction executes: "Is this code trusted?"

If the Mode Bit indicates Kernel Mode: Execute anything, access anything.
If the Mode Bit indicates User Mode: Enforce restrictions, block privileged operations.

Every process isolation, every memory protection, every security boundary in the operating system ultimately depends on this bit being correctly managed. Corrupt it, and all protections evaporate. Secure it, and untrusted code cannot harm the system.

The Mode Bit is the foundation upon which all operating system security is built.

What You Will Learn

By the end of this page, you will understand: (1) What the Mode Bit is and where it's stored in the CPU, (2) How different architectures implement privilege tracking, (3) How the Mode Bit is used on every instruction to enforce security, (4) Who can change the Mode Bit and under what conditions, and (5) Historical vulnerabilities related to Mode Bit manipulation.

What Is the Mode Bit?

The Mode Bit is a bit (or small field of bits) in a CPU status register that indicates the current privilege level of the executing code. It is the authoritative source of truth for whether the processor should enforce restrictions on instruction execution and memory access.

Formal Definition:

The Mode Bit is a hardware-maintained indicator of the CPU's current execution privilege level, consulted by the processor's control logic before executing privileged instructions or accessing protected memory. It can only be modified through carefully controlled hardware mechanisms designed to transfer control to trusted code.

Key Characteristics:

Hardware-Resident: The Mode Bit exists in silicon, not in software-accessible RAM
Consulted on Every Instruction: The CPU checks privilege before executing each operation
Protected from User Modification: No unprivileged instruction can directly change the Mode Bit
Atomically Changed: When the Mode Bit changes, other security-relevant state changes simultaneously

Mode Bit Implementation Across Architectures
Architecture	Register	Bit Field	Values
x86 (32-bit)	CS (Code Segment)	RPL (bits 0-1)	0 = Kernel, 3 = User
x86-64	CS (Code Segment)	CPL (bits 0-1)	0 = Ring 0, 3 = Ring 3
ARM (AArch64)	CurrentEL / PSTATE	EL field (2 bits)	0-3 (EL0-EL3)
ARM (AArch32)	CPSR	Mode bits (5 bits)	0x10=User, 0x13=SVC, etc.
RISC-V	mstatus/sstatus	MPP/SPP field	0=User, 1=Supervisor, 3=Machine
MIPS	Status Register (CP0)	KSU field (bits 3-4)	00=Kernel, 01=Supervisor, 10=User

More Than One Bit

Despite the name 'Mode Bit,' most architectures use 2 or more bits to encode the privilege level. This allows for intermediate levels (like x86's Ring 1 and Ring 2, or ARM's hypervisor level). However, the conceptual idea remains the same: a small hardware field that encodes 'how trusted is the current code.'

x86/x64 Implementation: Current Privilege Level

On x86 and x64 processors, the Mode Bit is implemented as the Current Privilege Level (CPL), a 2-bit field stored in the Code Segment (CS) register.

The Four Protection Rings:

Ring	CPL Value	Privilege	Typical Use
Ring 0	00	Highest	OS Kernel
Ring 1	01	High	Device Drivers (rarely used)
Ring 2	10	Medium	Device Drivers (rarely used)
Ring 3	11	Lowest	User Applications

Most operating systems use only Ring 0 and Ring 3, ignoring the intermediate rings. This simplifies the design while still providing clear kernel/user separation.

Where CPL Lives:

The CS register contains a Segment Selector, which includes:

Bits 0-1: RPL (Requested Privilege Level)
Bit 2: TI (Table Indicator: GDT vs LDT)
Bits 3-15: Index into descriptor table

The CPU determines the CPL as the lower 2 bits of CS. When code is executing, the CPL is the privilege level of that code.

x86_segment_selector.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// x86 Segment Selector Format (16 bits)
// Used in CS, DS, SS, ES, FS, GS registers
 
+------------------------+----+--------+
|     Index (13 bits)    | TI | RPL    |
|                        |    | (2 bit)|
+------------------------+----+--------+
   Bits 15-3               Bit 2  Bits 1-0
 
// Example: CS = 0x0033 (typical user-mode code segment)
// Binary: 0000 0000 0011 0011
//         ↑↑↑↑ ↑↑↑↑ ↑↑↑↑  ↑↑
//         Index = 6        TI = 0 (GDT)
//                          RPL = 3 (Ring 3 = User Mode)
 
// Example: CS = 0x0010 (typical kernel-mode code segment)
// Binary: 0000 0000 0001 0000
//         Index = 2, TI = 0, RPL = 0 (Ring 0 = Kernel Mode)

CPL in Action:

Every instruction execution involves CPL checks:

Privileged Instruction Check: If instruction requires Ring 0, compare CPL:
- If CPL > 0: Generate #GP (General Protection Fault)
- If CPL = 0: Allow execution
Memory Access Check: Compare CPL to page table U/S bit:
- If CPL = 3 and page is Supervisor-only: Generate Page Fault
- If CPL = 0: Access allowed (unless SMEP/SMAP restrict it)
Segment Access Check: Compare CPL to segment DPL:
- Code segment: CPL must equal DPL for non-conforming, or CPL ≥ DPL for conforming
- Data segment: CPL must be ≤ DPL (lower number = more privileged)

SYSCALL and the CPL

The SYSCALL instruction (x64) atomically: (1) Saves RIP to RCX, (2) Saves RFLAGS to R11, (3) Loads CS with the kernel code segment (CPL=0), (4) Loads SS with the kernel stack segment, (5) Masks RFLAGS, (6) Jumps to the kernel entry point (LSTAR MSR). The CPL change from 3 to 0 happens in a single unprogrammable hardware operation—there's no window where Ring 3 code could interfere.

ARM Implementation: Exception Levels

ARM processors use Exception Levels (EL0-EL3) to encode the current privilege, providing a cleaner, more modern design than x86's segment-based approach.

Exception Level Hierarchy:

Converting Mermaid diagram...

Where the Level is Stored:

In AArch64 (64-bit ARM), the current exception level is stored in CurrentEL, a read-only system register that returns the EL in bits [3:2]. The full processor state is in PSTATE, which includes the EL along with flags and other state.

How Exception Levels Work:

Level	Registers Accessible	Memory Access	Purpose
EL0	General + limited SP/LR	TTBR0_EL1 mappings	User apps
EL1	+ System registers for EL1	+ TTBR1_EL1 (kernel)	OS Kernel
EL2	+ EL2 system registers	+ Stage 2 translation	Hypervisor
EL3	All registers	All memory	Secure firmware

Transitioning Between Levels:

ARM uses a clean exception-based model:

Going up (more privileged): Exception occurs (SVC, IRQ, abort)
Going down (less privileged): ERET (Exception Return) instruction

The exception causes the hardware to:

Save the current PC and PSTATE to ELR_ELn and SPSR_ELn
Set the new exception level
Jump to the appropriate exception vector

arm_exception_transition.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// ARM Exception Flow: User (EL0) → Kernel (EL1)
 
// User code executes SVC (Supervisor Call) instruction
// Hardware automatically:
1. SPSR_EL1 ← PSTATE     // Save current state
2. ELR_EL1 ← PC + 4      // Save return address
3. PSTATE.EL ← 1         // Set Exception Level to EL1
4. PSTATE.SP ← 1         // Use SP_EL1 (kernel stack)
5. PC ← VBAR_EL1 + 0x400 // Jump to sync exception vector
 
// Kernel runs, handles syscall, then:
ERET instruction:
1. PSTATE ← SPSR_EL1     // Restore saved state (including EL0)
2. PC ← ELR_EL1          // Jump back to user code
// CPU is now in EL0 again

ARM's Cleaner Design

ARM's Exception Level design avoids x86's legacy complexity (segments, far pointers, call gates). Each level has its own stack pointer register (SP_EL0, SP_EL1, SP_EL2, SP_EL3) and exception state registers. The separation is cleaner, but the fundamental concept is identical: hardware-enforced privilege levels with controlled transitions.

Mode Bit and Instruction Execution

The Mode Bit is not just recorded—it's actively used on every instruction. Let's trace exactly how the CPU uses privilege level information during instruction execution.

The Instruction Execution Pipeline:

Modern CPUs execute instructions through a pipeline with stages like Fetch → Decode → Execute → Memory → Writeback. Privilege checks happen at multiple points:

Decode Stage: Check if instruction is privileged
Memory Stage: Check if memory access is permitted
Execute Stage: Some operations check privilege dynamically

instruction_privilege_flow.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Pseudocode: CPU privilege checking logic
 
function executeInstruction(instr) {
    CPL = getCurrentPrivilegeLevel();  // Read from CS[1:0] or CurrentEL
    
    // === DECODE STAGE ===
    if (instr.isPrivileged) {
        // List: CLI, STI, IN, OUT, LGDT, MOV CR*, MSR, MRS, HLT, ...
        if (CPL != 0) {
            raiseException(GENERAL_PROTECTION_FAULT, "#GP(0)");
            return;  // Never reaches execute
        }
    }
    
    // === Address Calculation ===
    if (instr.hasMemoryOperand) {
        linearAddr = calculateEffectiveAddress(instr);
        
        // === TLB/Page Table Lookup ===
        pte = translateAddress(linearAddr);
        
        // Check User/Supervisor bit
        if (pte.supervisorOnly && CPL > 0) {
            raiseException(PAGE_FAULT, "U/S violation");
            return;
        }
        
        // Check read/write permission
        if (instr.isWrite && !pte.writable) {
            if (CPL > 0 || CR0.WP) {  // WP: Write Protect in kernel mode
                raiseException(PAGE_FAULT, "R/W violation");
                return;
            }
        }
        
        // Check no-execute (if instruction fetch)
        if (instr.isFetch && pte.noExecute) {
            raiseException(PAGE_FAULT, "NX violation");
            return;
        }
    }
    
    // === EXECUTE STAGE ===
    result = performOperation(instr);
    
    // === WRITEBACK STAGE ===
    commitResult(result);
}

Privilege Checks Are Not Software:

Critically, these checks are implemented in hardware logic gates, not in microcode or software. This means:

Zero overhead: Checks happen in parallel with other pipeline operations
Unforgeable: There's no instruction to bypass the check
Atomic: The check and the action are indivisible

Memory Protection Integration:

Page tables include a Supervisor bit (U/S on x86, AP on ARM) that works with the mode bit:

Mode Bit	Page Bit	Result
Kernel (CPL=0)	Supervisor	Access allowed
Kernel (CPL=0)	User	Access allowed*
User (CPL=3)	User	Access allowed
User (CPL=3)	Supervisor	ACCESS DENIED → Page Fault

*Modern CPUs have SMAP/SMEP to restrict kernel access to user pages as a security measure.

The Spectre Lesson

Spectre-class vulnerabilities revealed that while privilege checks are correct, speculative execution might temporarily ignore them, leaving traces in caches. The CPU speculatively executes instructions as if checks pass, rolling back if they fail—but the cache state remains. This side channel can leak kernel data to user code, despite the mode bit protection working correctly at the architectural level.

Who Can Change the Mode Bit?

The Mode Bit can only be changed through controlled hardware mechanisms that simultaneously transfer control to trusted code locations. There is no instruction that simply "sets the mode bit"—this is by design.

Mechanisms That Raise Privilege (User → Kernel):

Transitions to Higher Privilege

•System Call Instructions (INT, SYSCALL, SYSENTER on x86; SVC on ARM; ECALL on RISC-V) — Explicitly request kernel service. Hardware changes CPL to 0 AND jumps to a fixed entry point.
•Hardware Interrupts — Device signals trigger interrupt. Hardware saves state, sets CPL=0, jumps to interrupt handler from IDT.
•Exceptions/Faults — Page fault, divide by zero, invalid opcode. Hardware catches the error, sets CPL=0, jumps to exception handler.
•Software Interrupts (INT n on x86) — Legacy mechanism for system calls. Hardware validates interrupt gate privilege, sets CPL=0, jumps to handler.

Critical Insight: Entry Points Are Fixed

When privilege increases, the CPU doesn't let the calling code choose where to jump. The destination is always determined by:

Interrupt Descriptor Table (IDT) on x86 — kernel sets up these entries during boot
Exception Vector Table (VBAR) on ARM — kernel configures vector base address
Trap Vector on RISC-V — kernel sets mtvec/stvec registers

This means a User Mode attacker cannot trick the CPU into jumping to attacker-controlled code with Kernel privileges. The hardware always transfers control to kernel-designated entry points.

Transitions to Lower Privilege

•Return from Interrupt/Exception (IRET on x86, ERET on ARM, SRET on RISC-V) — Hardware pops saved state including CPL. Kernel controls what privilege level to restore.
•SYSRET/SYSEXIT (x86) — Fast return from system call. Specifically designed for the User→Kernel→User pattern.
•Far Return through Call Gate (x86, legacy) — Segment-based privilege lowering mechanism.

Asymmetric Control

Notice the asymmetry: User code can REQUEST privilege elevation (via syscall), but cannot CONTROL it. The hardware and kernel together control where elevated code runs. In contrast, Kernel code has full control over returning to User mode—it can return to any address with any privilege level, because the kernel is trusted.

Mode Bit Security Implications

The Mode Bit is the ultimate security primitive. Exploits often aim to corrupt it or trick the hardware into misinterpreting the privilege level. Understanding historical vulnerabilities illuminates why modern CPUs have additional safeguards.

Categories of Mode Bit Attacks:

Mode Bit Attack Categories
Attack Type	Mechanism	Example/Impact
Direct Corruption	Bug in kernel allows overwriting IRET frame on stack	Attacker controls CPL on return to user
Confused Deputy	Kernel is tricked into performing privileged action on attacker's behalf	TOCTOU attacks, symlink attacks
Speculative Leaks	Speculative execution ignores mode bit temporarily	Meltdown: read kernel memory from user
Race Conditions	Mode changes during multi-step operation	Double-fetch vulnerabilities
Return-to-User	Attacker controls user-space code that kernel returns to	Stack smash + mprotect shellcode

Case Study: The Meltdown Vulnerability (2018)

Meltdown demonstrated a fundamental weakness in how CPUs optimized around the mode bit:

User code speculatively reads kernel memory address
CPU starts the read before privilege check completes (speculation)
Privilege check eventually fails, read is architecturally aborted
BUT: The read modified cache state before abort
Attacker uses cache timing to determine what was read

Result: User code could read arbitrary kernel memory despite mode bit protection working correctly at the architectural level.

Mitigation (KPTI/KAISER): Operating systems now use separate page tables for user and kernel mode. When in User Mode, kernel pages aren't even mapped—so there's nothing to speculatively read.

Modern Mode Bit Protections

•SMEP (Supervisor Mode Execution Prevention) — Kernel cannot execute code from User pages, preventing ret2user attacks.
•SMAP (Supervisor Mode Access Prevention) — Kernel cannot read/write User pages without explicit override, preventing confused deputy attacks.
•KPTI (Kernel Page Table Isolation) — User mode has minimal kernel mappings, mitigating Meltdown.
•Stack Canaries — Detect corruption of return addresses (and thus IRET frames).
•KASLR (Kernel Address Space Layout Randomization) — Randomize kernel location so attackers can't predict addresses.

Defense in Depth

No single mechanism is sufficient. Modern systems layer protections: hardware mode bit + SMEP + SMAP + KPTI + stack canaries + KASLR + CFI (Control Flow Integrity). Each layer catches different attack vectors, and an attacker must bypass all of them.

Observing the Mode Bit in Practice

While the Mode Bit is a hardware concept, its effects are visible through various debugging and observability tools. Let's explore how to observe privilege transitions in real systems.

Linux: /proc/stat and Syscall Tracing

The /proc/stat file shows time spent in different modes:

observing_mode_linux.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# View CPU time in user vs kernel mode
$ cat /proc/stat | head -1
cpu  1234567 12345 567890 12345678 12345 67890 1234 0 0 0
#    user    nice  system idle     iowait irq  softirq
#    ^^^^^^        ^^^^^^
#    Time in user  Time in kernel mode
#    mode
 
# Trace system calls (mode transitions) for a process
$ strace ls
execve("/bin/ls", ["ls"], ...) = 0  # User→Kernel→User
openat(AT_FDCWD, ".", ...) = 3       # User→Kernel→User
getdents64(3, ..., 32768) = 480      # User→Kernel→User
write(1, "file1  file2\n", 13) = 13  # User→Kernel→User
close(3) = 0                          # User→Kernel→User
exit_group(0) = ?                     # User→Kernel (never returns)
 
# Count syscalls (mode transitions)
$ strace -c ls >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 25.00    0.000010           2         4           openat
 25.00    0.000010           3         3           close
 25.00    0.000010           2         4         3 access
 ...

Perf: Hardware Performance Counters

Modern CPUs have performance counters that track privilege transitions:

perf_privilege_tracking.sh
1
2
3
4
5
6
7
8
9
10
11
12
# Record and analyze privilege transitions
$ sudo perf stat -e syscalls:sys_enter_* ls
 
# Sample with privilege level annotations
$ sudo perf record -e cycles:u,cycles:k ls  # u=user, k=kernel
$ sudo perf report
# Shows percentage of time in user vs kernel code
 
# Intel: Use specific hardware counters
$ sudo perf stat -e cpu/event=0x3c,umask=0x0,name=cpu_clk_unhalted_core/ \
                 -e cpu/event=0x3c,umask=0x1,name=cpu_clk_unhalted_ref/ \
                 ls

Windows: Performance Monitor and ETW

# Performance Monitor counters:
\Processor(_Total)\% User Time
\Processor(_Total)\% Privileged Time

# ETW (Event Tracing for Windows) can capture syscalls:
xperf -on SYSCALL
# Then analyze with Windows Performance Analyzer

Kernel Debugging:

With a kernel debugger attached, you can directly inspect the mode bit:

kernel_debugger_mode.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// GDB with QEMU stub (Linux kernel debugging)
(gdb) info registers cs
cs             0x10    16        # CPL=0 (kernel mode)
 
// After returning to user space:
(gdb) info registers cs  
cs             0x33    51        # CPL=3 (user mode)
 
// WinDbg (Windows kernel debugging)
kd> r cs
cs=0010  # Kernel mode
kd> !process 0 0  # List processes
# Attach to user process, then:
kd> r cs
cs=0033  # User mode

Performance Impact of Mode Switches

Use 'perf stat' to measure syscall overhead in your applications. High system (kernel) time percentage often indicates excessive mode switching. Strategies like batching I/O operations (io_uring), memory mapping files, or using buffered I/O can dramatically reduce mode switch overhead.

Summary: Mode Bit

The Mode Bit is the hardware foundation of operating system security—a small piece of processor state with enormous implications. Let's consolidate our understanding:

Key Takeaways

•The Mode Bit encodes current privilege level — A small field (1-2 bits typically) in a CPU status register that determines what operations are permitted.
•It's checked on every instruction — Hardware compares CPL against instruction requirements and memory permissions before allowing execution.
•Different architectures, same concept — x86 uses CPL in CS, ARM uses Exception Levels, RISC-V uses mode fields—all implement the same principle.
•Only controlled mechanisms change it — Syscalls, interrupts, and exceptions raise privilege; return instructions lower it. There's no 'set privilege' instruction.
•Entry points are kernel-controlled — When privilege increases, the CPU jumps to a fixed, kernel-determined address—never attacker-controlled code.
•The Mode Bit is the ultimate target — Many security exploits aim to corrupt or work around this protection, leading to defense-in-depth approaches.
•Modern mitigations add layers — SMEP, SMAP, KPTI, and other technologies address weaknesses that the basic mode bit architecture doesn't cover.

Looking ahead:

We've seen the Mode Bit determines what's allowed. But what specific operations are forbidden to unprivileged code? The next page examines Privileged Instructions—the specific CPU operations that require Kernel Mode and why each one could be dangerous in untrusted hands.

Page Complete

You now understand the Mode Bit: the hardware-encoded privilege level that gates access to system resources. This simple mechanism—checked on every instruction—is the foundation of all OS security. Next, we'll examine the specific privileged instructions that the Mode Bit protects.

3 / 5

Loading learning content...

Operating SystemsCPU Execution Modes

CPU Execution Modes

LevelBeginner

Duration60 mins

TopicCPU Execution Modes

3 / 5

Mode Bit

The Bit That Guards the Kingdom

Deep within the CPU, in a register measured in bits rather than bytes, lies perhaps the most security-critical piece of state in the entire computer: the Mode Bit.

This tiny piece of hardware—often just 1-2 bits—answers a question that must be resolved before every single instruction executes: "Is this code trusted?"

If the Mode Bit indicates Kernel Mode: Execute anything, access anything.
If the Mode Bit indicates User Mode: Enforce restrictions, block privileged operations.

The Mode Bit is the foundation upon which all operating system security is built.

What You Will Learn

What Is the Mode Bit?

Formal Definition:

The Mode Bit is a hardware-maintained indicator of the CPU's current execution privilege level, consulted by the processor's control logic before executing privileged instructions or accessing protected memory. It can only be modified through carefully controlled hardware mechanisms designed to transfer control to trusted code.

Key Characteristics:

Hardware-Resident: The Mode Bit exists in silicon, not in software-accessible RAM
Consulted on Every Instruction: The CPU checks privilege before executing each operation
Protected from User Modification: No unprivileged instruction can directly change the Mode Bit
Atomically Changed: When the Mode Bit changes, other security-relevant state changes simultaneously

Mode Bit Implementation Across Architectures
Architecture	Register	Bit Field	Values
x86 (32-bit)	CS (Code Segment)	RPL (bits 0-1)	0 = Kernel, 3 = User
x86-64	CS (Code Segment)	CPL (bits 0-1)	0 = Ring 0, 3 = Ring 3
ARM (AArch64)	CurrentEL / PSTATE	EL field (2 bits)	0-3 (EL0-EL3)
ARM (AArch32)	CPSR	Mode bits (5 bits)	0x10=User, 0x13=SVC, etc.
RISC-V	mstatus/sstatus	MPP/SPP field	0=User, 1=Supervisor, 3=Machine
MIPS	Status Register (CP0)	KSU field (bits 3-4)	00=Kernel, 01=Supervisor, 10=User

More Than One Bit

x86/x64 Implementation: Current Privilege Level

On x86 and x64 processors, the Mode Bit is implemented as the Current Privilege Level (CPL), a 2-bit field stored in the Code Segment (CS) register.

The Four Protection Rings:

Ring	CPL Value	Privilege	Typical Use
Ring 0	00	Highest	OS Kernel
Ring 1	01	High	Device Drivers (rarely used)
Ring 2	10	Medium	Device Drivers (rarely used)
Ring 3	11	Lowest	User Applications

Most operating systems use only Ring 0 and Ring 3, ignoring the intermediate rings. This simplifies the design while still providing clear kernel/user separation.

Where CPL Lives:

The CS register contains a Segment Selector, which includes:

Bits 0-1: RPL (Requested Privilege Level)
Bit 2: TI (Table Indicator: GDT vs LDT)
Bits 3-15: Index into descriptor table

The CPU determines the CPL as the lower 2 bits of CS. When code is executing, the CPL is the privilege level of that code.

x86_segment_selector.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// x86 Segment Selector Format (16 bits)
// Used in CS, DS, SS, ES, FS, GS registers
 
+------------------------+----+--------+
|     Index (13 bits)    | TI | RPL    |
|                        |    | (2 bit)|
+------------------------+----+--------+
   Bits 15-3               Bit 2  Bits 1-0
 
// Example: CS = 0x0033 (typical user-mode code segment)
// Binary: 0000 0000 0011 0011
//         ↑↑↑↑ ↑↑↑↑ ↑↑↑↑  ↑↑
//         Index = 6        TI = 0 (GDT)
//                          RPL = 3 (Ring 3 = User Mode)
 
// Example: CS = 0x0010 (typical kernel-mode code segment)
// Binary: 0000 0000 0001 0000
//         Index = 2, TI = 0, RPL = 0 (Ring 0 = Kernel Mode)

CPL in Action:

Every instruction execution involves CPL checks:

Privileged Instruction Check: If instruction requires Ring 0, compare CPL:
- If CPL > 0: Generate #GP (General Protection Fault)
- If CPL = 0: Allow execution
Memory Access Check: Compare CPL to page table U/S bit:
- If CPL = 3 and page is Supervisor-only: Generate Page Fault
- If CPL = 0: Access allowed (unless SMEP/SMAP restrict it)
Segment Access Check: Compare CPL to segment DPL:
- Code segment: CPL must equal DPL for non-conforming, or CPL ≥ DPL for conforming
- Data segment: CPL must be ≤ DPL (lower number = more privileged)

SYSCALL and the CPL

ARM Implementation: Exception Levels

ARM processors use Exception Levels (EL0-EL3) to encode the current privilege, providing a cleaner, more modern design than x86's segment-based approach.

Exception Level Hierarchy:

Converting Mermaid diagram...

Where the Level is Stored:

How Exception Levels Work:

Level	Registers Accessible	Memory Access	Purpose
EL0	General + limited SP/LR	TTBR0_EL1 mappings	User apps
EL1	+ System registers for EL1	+ TTBR1_EL1 (kernel)	OS Kernel
EL2	+ EL2 system registers	+ Stage 2 translation	Hypervisor
EL3	All registers	All memory	Secure firmware

Transitioning Between Levels:

ARM uses a clean exception-based model:

Going up (more privileged): Exception occurs (SVC, IRQ, abort)
Going down (less privileged): ERET (Exception Return) instruction

The exception causes the hardware to:

Save the current PC and PSTATE to ELR_ELn and SPSR_ELn
Set the new exception level
Jump to the appropriate exception vector

arm_exception_transition.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// ARM Exception Flow: User (EL0) → Kernel (EL1)
 
// User code executes SVC (Supervisor Call) instruction
// Hardware automatically:
1. SPSR_EL1 ← PSTATE     // Save current state
2. ELR_EL1 ← PC + 4      // Save return address
3. PSTATE.EL ← 1         // Set Exception Level to EL1
4. PSTATE.SP ← 1         // Use SP_EL1 (kernel stack)
5. PC ← VBAR_EL1 + 0x400 // Jump to sync exception vector
 
// Kernel runs, handles syscall, then:
ERET instruction:
1. PSTATE ← SPSR_EL1     // Restore saved state (including EL0)
2. PC ← ELR_EL1          // Jump back to user code
// CPU is now in EL0 again

ARM's Cleaner Design

Mode Bit and Instruction Execution

The Mode Bit is not just recorded—it's actively used on every instruction. Let's trace exactly how the CPU uses privilege level information during instruction execution.

The Instruction Execution Pipeline:

Modern CPUs execute instructions through a pipeline with stages like Fetch → Decode → Execute → Memory → Writeback. Privilege checks happen at multiple points:

Decode Stage: Check if instruction is privileged
Memory Stage: Check if memory access is permitted
Execute Stage: Some operations check privilege dynamically

instruction_privilege_flow.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Pseudocode: CPU privilege checking logic
 
function executeInstruction(instr) {
    CPL = getCurrentPrivilegeLevel();  // Read from CS[1:0] or CurrentEL
    
    // === DECODE STAGE ===
    if (instr.isPrivileged) {
        // List: CLI, STI, IN, OUT, LGDT, MOV CR*, MSR, MRS, HLT, ...
        if (CPL != 0) {
            raiseException(GENERAL_PROTECTION_FAULT, "#GP(0)");
            return;  // Never reaches execute
        }
    }
    
    // === Address Calculation ===
    if (instr.hasMemoryOperand) {
        linearAddr = calculateEffectiveAddress(instr);
        
        // === TLB/Page Table Lookup ===
        pte = translateAddress(linearAddr);
        
        // Check User/Supervisor bit
        if (pte.supervisorOnly && CPL > 0) {
            raiseException(PAGE_FAULT, "U/S violation");
            return;
        }
        
        // Check read/write permission
        if (instr.isWrite && !pte.writable) {
            if (CPL > 0 || CR0.WP) {  // WP: Write Protect in kernel mode
                raiseException(PAGE_FAULT, "R/W violation");
                return;
            }
        }
        
        // Check no-execute (if instruction fetch)
        if (instr.isFetch && pte.noExecute) {
            raiseException(PAGE_FAULT, "NX violation");
            return;
        }
    }
    
    // === EXECUTE STAGE ===
    result = performOperation(instr);
    
    // === WRITEBACK STAGE ===
    commitResult(result);
}

Privilege Checks Are Not Software:

Critically, these checks are implemented in hardware logic gates, not in microcode or software. This means:

Zero overhead: Checks happen in parallel with other pipeline operations
Unforgeable: There's no instruction to bypass the check
Atomic: The check and the action are indivisible

Memory Protection Integration:

Page tables include a Supervisor bit (U/S on x86, AP on ARM) that works with the mode bit:

Mode Bit	Page Bit	Result
Kernel (CPL=0)	Supervisor	Access allowed
Kernel (CPL=0)	User	Access allowed*
User (CPL=3)	User	Access allowed
User (CPL=3)	Supervisor	ACCESS DENIED → Page Fault

*Modern CPUs have SMAP/SMEP to restrict kernel access to user pages as a security measure.

The Spectre Lesson

Who Can Change the Mode Bit?

Mechanisms That Raise Privilege (User → Kernel):

Transitions to Higher Privilege

•System Call Instructions (INT, SYSCALL, SYSENTER on x86; SVC on ARM; ECALL on RISC-V) — Explicitly request kernel service. Hardware changes CPL to 0 AND jumps to a fixed entry point.
•Hardware Interrupts — Device signals trigger interrupt. Hardware saves state, sets CPL=0, jumps to interrupt handler from IDT.
•Exceptions/Faults — Page fault, divide by zero, invalid opcode. Hardware catches the error, sets CPL=0, jumps to exception handler.
•Software Interrupts (INT n on x86) — Legacy mechanism for system calls. Hardware validates interrupt gate privilege, sets CPL=0, jumps to handler.

Critical Insight: Entry Points Are Fixed

When privilege increases, the CPU doesn't let the calling code choose where to jump. The destination is always determined by:

Interrupt Descriptor Table (IDT) on x86 — kernel sets up these entries during boot
Exception Vector Table (VBAR) on ARM — kernel configures vector base address
Trap Vector on RISC-V — kernel sets mtvec/stvec registers

This means a User Mode attacker cannot trick the CPU into jumping to attacker-controlled code with Kernel privileges. The hardware always transfers control to kernel-designated entry points.

Transitions to Lower Privilege

•Return from Interrupt/Exception (IRET on x86, ERET on ARM, SRET on RISC-V) — Hardware pops saved state including CPL. Kernel controls what privilege level to restore.
•SYSRET/SYSEXIT (x86) — Fast return from system call. Specifically designed for the User→Kernel→User pattern.
•Far Return through Call Gate (x86, legacy) — Segment-based privilege lowering mechanism.

Asymmetric Control

Mode Bit Security Implications

Categories of Mode Bit Attacks:

Mode Bit Attack Categories
Attack Type	Mechanism	Example/Impact
Direct Corruption	Bug in kernel allows overwriting IRET frame on stack	Attacker controls CPL on return to user
Confused Deputy	Kernel is tricked into performing privileged action on attacker's behalf	TOCTOU attacks, symlink attacks
Speculative Leaks	Speculative execution ignores mode bit temporarily	Meltdown: read kernel memory from user
Race Conditions	Mode changes during multi-step operation	Double-fetch vulnerabilities
Return-to-User	Attacker controls user-space code that kernel returns to	Stack smash + mprotect shellcode

Case Study: The Meltdown Vulnerability (2018)

Meltdown demonstrated a fundamental weakness in how CPUs optimized around the mode bit:

User code speculatively reads kernel memory address
CPU starts the read before privilege check completes (speculation)
Privilege check eventually fails, read is architecturally aborted
BUT: The read modified cache state before abort
Attacker uses cache timing to determine what was read

Result: User code could read arbitrary kernel memory despite mode bit protection working correctly at the architectural level.

Mitigation (KPTI/KAISER): Operating systems now use separate page tables for user and kernel mode. When in User Mode, kernel pages aren't even mapped—so there's nothing to speculatively read.

Modern Mode Bit Protections

•SMEP (Supervisor Mode Execution Prevention) — Kernel cannot execute code from User pages, preventing ret2user attacks.
•SMAP (Supervisor Mode Access Prevention) — Kernel cannot read/write User pages without explicit override, preventing confused deputy attacks.
•KPTI (Kernel Page Table Isolation) — User mode has minimal kernel mappings, mitigating Meltdown.
•Stack Canaries — Detect corruption of return addresses (and thus IRET frames).
•KASLR (Kernel Address Space Layout Randomization) — Randomize kernel location so attackers can't predict addresses.

Defense in Depth

Observing the Mode Bit in Practice

While the Mode Bit is a hardware concept, its effects are visible through various debugging and observability tools. Let's explore how to observe privilege transitions in real systems.

Linux: /proc/stat and Syscall Tracing

The /proc/stat file shows time spent in different modes:

observing_mode_linux.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# View CPU time in user vs kernel mode
$ cat /proc/stat | head -1
cpu  1234567 12345 567890 12345678 12345 67890 1234 0 0 0
#    user    nice  system idle     iowait irq  softirq
#    ^^^^^^        ^^^^^^
#    Time in user  Time in kernel mode
#    mode
 
# Trace system calls (mode transitions) for a process
$ strace ls
execve("/bin/ls", ["ls"], ...) = 0  # User→Kernel→User
openat(AT_FDCWD, ".", ...) = 3       # User→Kernel→User
getdents64(3, ..., 32768) = 480      # User→Kernel→User
write(1, "file1  file2\n", 13) = 13  # User→Kernel→User
close(3) = 0                          # User→Kernel→User
exit_group(0) = ?                     # User→Kernel (never returns)
 
# Count syscalls (mode transitions)
$ strace -c ls >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 25.00    0.000010           2         4           openat
 25.00    0.000010           3         3           close
 25.00    0.000010           2         4         3 access
 ...

Perf: Hardware Performance Counters

Modern CPUs have performance counters that track privilege transitions:

perf_privilege_tracking.sh
1
2
3
4
5
6
7
8
9
10
11
12
# Record and analyze privilege transitions
$ sudo perf stat -e syscalls:sys_enter_* ls
 
# Sample with privilege level annotations
$ sudo perf record -e cycles:u,cycles:k ls  # u=user, k=kernel
$ sudo perf report
# Shows percentage of time in user vs kernel code
 
# Intel: Use specific hardware counters
$ sudo perf stat -e cpu/event=0x3c,umask=0x0,name=cpu_clk_unhalted_core/ \
                 -e cpu/event=0x3c,umask=0x1,name=cpu_clk_unhalted_ref/ \
                 ls

Windows: Performance Monitor and ETW

# Performance Monitor counters:
\Processor(_Total)\% User Time
\Processor(_Total)\% Privileged Time

# ETW (Event Tracing for Windows) can capture syscalls:
xperf -on SYSCALL
# Then analyze with Windows Performance Analyzer

Kernel Debugging:

With a kernel debugger attached, you can directly inspect the mode bit:

kernel_debugger_mode.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// GDB with QEMU stub (Linux kernel debugging)
(gdb) info registers cs
cs             0x10    16        # CPL=0 (kernel mode)
 
// After returning to user space:
(gdb) info registers cs  
cs             0x33    51        # CPL=3 (user mode)
 
// WinDbg (Windows kernel debugging)
kd> r cs
cs=0010  # Kernel mode
kd> !process 0 0  # List processes
# Attach to user process, then:
kd> r cs
cs=0033  # User mode

Performance Impact of Mode Switches

Summary: Mode Bit

The Mode Bit is the hardware foundation of operating system security—a small piece of processor state with enormous implications. Let's consolidate our understanding:

Key Takeaways

•The Mode Bit encodes current privilege level — A small field (1-2 bits typically) in a CPU status register that determines what operations are permitted.
•It's checked on every instruction — Hardware compares CPL against instruction requirements and memory permissions before allowing execution.
•Different architectures, same concept — x86 uses CPL in CS, ARM uses Exception Levels, RISC-V uses mode fields—all implement the same principle.
•Only controlled mechanisms change it — Syscalls, interrupts, and exceptions raise privilege; return instructions lower it. There's no 'set privilege' instruction.
•Entry points are kernel-controlled — When privilege increases, the CPU jumps to a fixed, kernel-determined address—never attacker-controlled code.
•The Mode Bit is the ultimate target — Many security exploits aim to corrupt or work around this protection, leading to defense-in-depth approaches.
•Modern mitigations add layers — SMEP, SMAP, KPTI, and other technologies address weaknesses that the basic mode bit architecture doesn't cover.

Looking ahead:

Page Complete

3 / 5