Protection Domains - Learning Module

Loading content...

0/227

Domain Switching

Crossing the Privilege Boundary

If protection domains were static prisons, processes trapped forever in their initial privilege set, operating systems would be far simpler—but also far less useful. The power of modern protection systems lies in controlled domain switching: the ability for a process to transition from one protection domain to another under carefully enforced rules.

Every time you execute a sudo command, make a system call, attach a debugger, or run a setuid binary, your process crosses a domain boundary. These transitions are among the most security-critical operations in computing. A bug in domain switching can grant unlimited power to unprivileged code; an overly restrictive implementation can make legitimate operations impossible.

Understanding domain switching is understanding the precise mechanisms by which privilege is gained, exercised, and relinquished.

What You Will Learn

By the end of this page, you will understand how processes transition between protection domains, the hardware and software mechanisms that enable domain switching, security policies governing transitions, and common vulnerabilities that arise when domain switching is implemented incorrectly.

Why Domain Switching Is Necessary

Protection domains provide isolation, but isolation alone is insufficient for practical computing. Programs must interact with the kernel, with privileged system services, and with each other. Domain switching enables these interactions while maintaining security.

The Fundamental Tension:

User programs need kernel services — Opening files, allocating memory, creating processes all require kernel privileges
The kernel cannot trust user code — If user code could directly invoke kernel functions, it could corrupt system state
Temporary privilege is often needed — Password changes require writing to /etc/shadow, but normal users shouldn't have permanent write access
Debugging requires access to debuggee — A debugger must read/write another process's memory, violating normal isolation

Domain switching resolves these tensions by providing controlled, auditable, revocable transitions between privilege levels.

Common Domain Switching Scenarios

•System Calls — User process requests kernel service: User Domain → Kernel Domain → User Domain
•Setuid Execution — User runs privileged binary: User Domain → Program's Domain → User Domain
•Signal Handling — Kernel invokes user handler: Kernel Domain → User Domain → Kernel Domain
•Exception Handling — Hardware fault triggers kernel: User Domain → Kernel Domain
•Context Switch — Scheduler changes running process: Process A Domain → Kernel Domain → Process B Domain
•Driver Invocation — Kernel calls device driver: Kernel Domain → Driver Domain → Kernel Domain

Domain Switching Is Security-Critical

Every domain switch is an opportunity for privilege escalation attacks. If the transition is not performed correctly—if registers aren't cleared, if the stack isn't switched, if the return address can be manipulated—an attacker may gain unauthorized access to the target domain's privileges.

The Domain Switching Mechanism

Domain switching requires coordination between hardware and software. The basic phases are:

Phase 1: Switch Request

The currently executing code requests or triggers a domain transition. This may be:

Explicit: A system call instruction (syscall, svc, int 0x80)
Implicit: A hardware exception (page fault, divide by zero)
External: An interrupt from a device

Phase 2: Privilege Verification

Before the switch occurs, the system verifies the transition is permitted:

Does the current domain have "switch" rights to the target domain?
Is the entry point valid (not arbitrary code in the target domain)?
Are the access conditions met (correct trap gate, valid syscall number)?

Phase 3: Context Save

The CPU and OS save the current domain's execution context:

Program counter (return address)
Stack pointer
CPU flags and registers
Memory management state (page table base)

Phase 4: Domain Transition

The actual privilege level change occurs:

CPU privilege level changes (ring 3 → ring 0)
Stack switches to new domain's stack
Memory mapping may change
New domain's code begins execution

Phase 5: Context Restore (on return)

When the target domain completes, the original context is restored and execution resumes in the original domain.

Converting Mermaid diagram...

Hardware-Assisted Domain Switching

Modern CPUs provide hardware mechanisms to enforce domain boundaries and enable controlled transitions. Without hardware support, software-only protection could be bypassed by malicious code.

x86/x64 Privilege Transitions:

Entering Ring 0 (User → Kernel):

Mechanism	Instruction	Modern Usage
Software Interrupt	`int 0x80`	Legacy Linux syscall
SYSENTER	`sysenter`	32-bit fast syscall
SYSCALL	`syscall`	64-bit fast syscall
Hardware Interrupt	(automatic)	Timer, device, exception
Exception	(automatic)	Page fault, GPF, etc.

Returning to Ring 3 (Kernel → User):

Mechanism	Instruction	Notes
IRET	`iret`	Restores full context
SYSEXIT	`sysexit`	Fast return from sysenter
SYSRET	`sysret`	Fast return from syscall

The SYSCALL/SYSRET Fast Path:

Modern 64-bit systems use syscall/sysret for performance:

; User-space system call invocation
mov rax, 1          ; syscall number (write)
mov rdi, 1          ; fd = stdout
mov rsi, msg        ; buffer address
mov rdx, len        ; buffer length
syscall             ; ENTER KERNEL DOMAIN
; rax now contains return value

When syscall executes:

RCX ← RIP (save user instruction pointer)
R11 ← RFLAGS (save user flags)
RIP ← IA32_LSTAR MSR (jump to kernel entry point)
CS ← kernel code segment (privilege = 0)
SS ← kernel stack segment
RFLAGS masked (interrupts disabled per IA32_FMASK)

Note what does NOT happen automatically:

Stack is not switched (kernel must do this in software)
Other registers are not saved (kernel must preserve)
Arguments are not validated (kernel must check everything)

SYSRET Vulnerability (CVE-2012-0217)

The sysret instruction has a dangerous quirk: if RCX contains a non-canonical address, a general protection fault occurs while the CPU is in Ring 0 but using Ring 3's stack pointer. This allowed user-space code to execute at Ring 0 privilege. A reminder that even CPU instructions can have security bugs.

Software Domain Switching

Not all domain switches involve hardware privilege changes. Software-only domain switching occurs when the OS changes a process's effective privileges without a ring transition.

Unix setuid/setgid Mechanism:

The most common software domain switch is executing a setuid binary. When a file has the setuid bit set, executing it changes the process's effective UID to the file's owner.

$ ls -l /usr/bin/passwd
-rwsr-xr-x 1 root root 68208 Jan 1 2024 /usr/bin/passwd
    ^
    setuid bit (s in owner execute position)

When a regular user executes /usr/bin/passwd:

Kernel performs normal execve() processing
Kernel notices setuid bit on the executable
Process's effective UID changes from user's UID to root (0)
Process now operates in root's domain (can write /etc/shadow)
passwd program drops privileges when done (changes eUID back)

This is entirely software-based—the CPU privilege level doesn't change; user code still runs in Ring 3. But the kernel's access control checks now see root privileges.

setuid_domain_switch.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
 
void print_ids(const char *label) {
    printf("%s: real=%d, effective=%d, saved=%d\n",
           label, getuid(), geteuid(), getresuid_saved());
}
 
int main() {
    // Assuming this binary is setuid root
    print_ids("Initial");       // real=1000, effective=0, saved=0
    
    // Do privileged work (e.g., write to protected file)
    FILE *f = fopen("/etc/protected", "w");
    if (f) {
        fprintf(f, "Privileged write\n");
        fclose(f);
    }
    
    // Drop privileges - critical for security!
    if (seteuid(getuid()) < 0) {
        perror("seteuid failed");
        return 1;
    }
    print_ids("After drop");    // real=1000, effective=1000, saved=0
    
    // Could restore privileges if needed
    if (seteuid(0) < 0) {       // Can restore since saved UID is still 0
        perror("seteuid restore failed");
    }
    print_ids("After restore"); // real=1000, effective=0, saved=0
    
    // Permanently drop privileges - cannot regain
    if (setuid(getuid()) < 0) {
        perror("setuid failed");
        return 1;
    }
    print_ids("Permanent drop"); // real=1000, effective=1000, saved=1000
    
    return 0;
}

Privilege Dropping Best Practice

Setuid programs should drop privileges as soon as possible, keep them for the minimum necessary time, and drop them permanently when elevated access is no longer needed. The saved UID mechanism allows temporary drops (with restoration) vs. permanent drops.

Controlled Entry Points

A critical security requirement for domain switching is that entry into a higher-privilege domain must occur only at controlled entry points. If an attacker could jump to arbitrary code in the kernel, protection would be meaningless.

The Gate Concept:

A gate is a controlled entry point into a protected domain. It specifies:

The address where execution begins (not attacker-controlled)
The target privilege level
Stack behavior (which stack to use)
Any additional checks (is the call allowed from this source domain?)

x86 Gate Types:

Gate Type	Purpose	Usage
Interrupt Gate	Hardware/software interrupts	Timer, syscall via int
Trap Gate	Software exceptions	Debug breakpoints
Call Gate	Controlled privilege calls	Rarely used in modern systems
Task Gate	Task switching	Obsolete in 64-bit mode

Modern Syscall Entry:

Instead of gates, modern systems use Model-Specific Registers (MSRs):

IA32_LSTAR MSR = address of syscall entry point
IA32_STAR MSR  = segment selectors for syscall transitions
IA32_FMASK MSR = flags to mask on syscall entry

Only the kernel can write these MSRs, so user code cannot redirect syscall entry.

Why Controlled Entry Points Matter

•Prevent arbitrary code execution — Attacker cannot jump to 'pop registers; return' gadgets in kernel
•Ensure proper setup — Entry point code validates arguments before use
•Maintain invariants — Kernel assumes certain state on entry; random jumps break assumptions
•Enable auditing — All entries flow through known code; logging is possible
•Support validation — Entry code can check caller identity, arguments, quotas

The System Call Table:

Even with a controlled entry point, the kernel must dispatch to the correct handler. The syscall number provided by user code indexes into a table of handler functions:

// Kernel syscall table (simplified)
const syscall_fn_t sys_call_table[] = {
    [0] = sys_read,
    [1] = sys_write,
    [2] = sys_open,
    [3] = sys_close,
    // ... hundreds more
};

// Entry point code
void syscall_entry(struct pt_regs *regs) {
    long nr = regs->rax;  // syscall number from user
    
    if (nr >= NR_syscalls || nr < 0) {
        regs->rax = -ENOSYS;  // Invalid syscall number
        return;
    }
    
    regs->rax = sys_call_table[nr](regs);  // Dispatch
}

Note the bounds check—without it, a malicious syscall number could cause out-of-bounds access.

Stack Switching

One of the most critical aspects of domain switching is stack management. The stack contains return addresses, local variables, and sensitive data. Using the wrong stack in the wrong domain is a catastrophic security vulnerability.

Why Separate Stacks Are Required:

User stacks are untrusted — Kernel cannot store sensitive data where user code might read it
User stacks may be invalid — User could have munmapped their stack or corrupted the pointer
Stack limits differ — Kernel needs guaranteed stack space; user stacks may be exhausted
Isolation requirements — Kernel data on stack must not leak to user space

The Stack Switch Process:

When transitioning from Ring 3 to Ring 0 on x86-64:

CPU loads new stack pointer from the Task State Segment (TSS)
CPU pushes user SS, RSP, RFLAGS, CS, RIP onto kernel stack
Optionally, CPU pushes error code (for relevant exceptions)
Kernel code continues on kernel stack

On return (IRET):

CPU pops RIP, CS, RFLAGS, RSP, SS from kernel stack
Execution resumes in user mode with user stack

Stack Layout After Domain Switch to Kernel
Kernel Stack	Contents	Purpose
Top → SS (user)	0x2b (user data segment)	Restore user stack segment
RSP (user)	User's stack pointer value	Restore user stack position
RFLAGS	User's CPU flags	Restore interrupt state, etc.
CS (user)	0x33 (user code segment)	Restore user privilege level
RIP	User's instruction pointer	Resume execution here
[Error code]	Exception-specific	Only for some exceptions
← New RSP	Kernel working space	Kernel uses from here down

Per-CPU and Per-Process Kernel Stacks:

Modern kernels maintain multiple stacks for different purposes:

Per-process kernel stack: Each process has a small kernel stack (8-16KB) for syscall handling
Per-CPU interrupt stack: Each CPU has a dedicated interrupt stack for IRQ handling
Per-CPU IST stacks: Interrupt Stack Table entries for critical exceptions (double fault, NMI, machine check)

Why IST Stacks?

Some exceptions (like double fault) can occur even if the kernel stack is corrupted. IST entries provide known-good stacks that are always valid:

// TSS IST entries (per-CPU)
struct tss_struct {
    // ...
    u64 ist[7];  // 7 Interrupt Stack Table entries
    // IST1: Double Fault stack
    // IST2: NMI stack  
    // IST3: Debug stack
    // IST4: Machine Check stack
    // ...
};

Stack Pivot Attacks

If an attacker can control the stack pointer during a domain switch, they may be able to 'pivot' the kernel onto attacker-controlled memory. This is why hardware-enforced stack switching (via TSS) is critical—the kernel doesn't trust any user-provided stack pointer on entry.

Domain Switching Policies

Beyond the mechanism of domain switching, operating systems must define policies governing when and how transitions are permitted.

Policy Question 1: Who Can Enter Which Domains?

Not all domain transitions are permitted. The access matrix includes domains as objects, with a "switch" or "enter" right:

              │ Kernel Domain │ Debug Domain │ Admin Domain │
──────────────┼───────────────┼──────────────┼──────────────┤
 User Domain  │ Enter(syscall)│ -            │ -            │
──────────────┼───────────────┼──────────────┼──────────────┤
 Admin Domain │ Enter(syscall)│ Enter        │ -            │
──────────────┼───────────────┼──────────────┼──────────────┤
 Kernel Domain│ -             │ Enter        │ Enter        │
──────────────┴───────────────┴──────────────┴──────────────┘

Only admins can enter the debug domain; only the kernel can enter the admin domain; everyone can enter the kernel domain (via syscall).

Policy Question 2: What Data Crosses the Boundary?

When domains switch, what happens to register contents, memory mappings, and other state?

Conservative approaches:

Clear all registers on domain entry (prevent information leakage)
Switch page tables completely (prevent cross-domain access)
Flush CPU caches (prevent side-channel attacks)

Performance-oriented approaches:

Preserve argument registers (pass syscall arguments)
Map kernel into all address spaces (no TLB flush on switch)
Keep caches intact (benefit from temporal locality)

KPTI (Kernel Page Table Isolation):

Modern systems use KPTI to mitigate Meltdown-class attacks. The user-mode page tables contain minimal kernel mappings—just enough for the syscall entry point. Upon entry, the kernel switches to a different page table with full kernel mappings:

User mode page table:
├── User space: Fully mapped
└── Kernel space: Only entry trampoline mapped

Kernel mode page table:
├── User space: Fully mapped (for copying data)
└── Kernel space: Fully mapped

Spectre and Domain Boundaries

Speculative execution attacks (Spectre) can leak data across domain boundaries even with proper access control. The CPU may speculatively access kernel memory from user mode before the permission check completes. Mitigations include retpolines, IBPB, and IBRS.

Domain Switch Vulnerabilities

Domain switching is one of the most security-sensitive operations in an operating system. Historical vulnerabilities illustrate the subtleties involved:

Class 1: Improper Privilege Retention

Failing to drop privileges after temporary elevation:

// VULNERABLE: setuid program
int main(int argc, char *argv[]) {
    open_privileged_resource();  // Needs root
    // BUG: Never dropped privileges!
    execute_user_command(argv[1]);  // Runs as root!
}

Class 2: Race Conditions (TOCTOU)

Time-of-check to time-of-use races during domain switch:

// VULNERABLE: Kernel syscall handler
int sys_read(int fd, char *buf, size_t count) {
    if (!access_ok(VERIFY_WRITE, buf, count))  // CHECK
        return -EFAULT;
    // Another thread remaps 'buf' to kernel memory!
    copy_to_user(buf, kernel_data, count);     // USE
}

Class 3: Uninitialized Data Leakage

Kernel stack may contain sensitive data from previous operations:

// VULNERABLE: Stack leak
struct response {
    int status;
    char data[64];
};

int sys_getinfo(struct response *user_resp) {
    struct response resp;
    resp.status = get_status();  // data[] not initialized!
    copy_to_user(user_resp, &resp, sizeof(resp));
    // Leaks previous kernel stack contents in resp.data
}

Historical Domain Switching Vulnerabilities
CVE	Vulnerability	Impact	Root Cause
CVE-2012-0217	SYSRET with non-canonical RCX	Ring 0 code execution	CPU microcode bug on return
CVE-2014-0038	recvmmsg TOCTOU	Privilege escalation	Race in argument validation
CVE-2016-5195	Dirty COW	Root privilege	Race in copy-on-write handling
CVE-2017-5754	Meltdown	Kernel memory disclosure	Speculative execution past checks
CVE-2018-8897	MOV SS exception delivery	Ring 0 code execution	Interrupt handling during switch

Secure Domain Switching Practices

•Validate all inputs at entry — Never trust data from lower-privilege domains
•Clear sensitive registers on exit — Zero registers that might leak kernel addresses
•Use separate stacks consistently — Never mix user and kernel stack usage
•Initialize all output data — Prevent unintended information disclosure
•Avoid TOCTOU windows — Copy user data to kernel buffers before validation
•Apply mitigations for speculation — Use barriers, KPTI, and microcode updates

Summary: Domain Switching

We've explored the mechanisms, policies, and pitfalls of domain switching. Let's consolidate the key insights:

Key Takeaways

•Domain switching enables controlled privilege transitions — Processes can temporarily acquire higher privileges, use them, and return to lower privilege
•Hardware provides the enforcement mechanism — CPU privilege levels, MSRs for entry points, and TSS for stack switching ensure security
•Software domain switching changes effective identity — setuid/setgid changes privileges without ring transitions
•Controlled entry points prevent arbitrary privileged execution — The kernel specifies exactly where higher-privilege code begins
•Stack switching is critical for isolation — Each domain uses its own stack to prevent data leakage and corruption
•Policies govern what transitions are permitted — Not all domains can switch to all other domains
•Domain switching is a major source of vulnerabilities — TOCTOU races, data leakage, and speculative execution are ongoing threats

What's Next:

We've seen how domains are defined and how switching between them works. Now we'll examine protection rings—the hierarchical domain model implemented in hardware by most processors. Protection rings provide a concrete, efficient implementation of the domain concepts we've discussed.

Page Complete

You now understand how processes transition between protection domains through carefully controlled mechanisms. This knowledge is essential for understanding privilege escalation vulnerabilities and for designing secure system interfaces.