Loading content...
If protection domains were static prisons, processes trapped forever in their initial privilege set, operating systems would be far simpler—but also far less useful. The power of modern protection systems lies in controlled domain switching: the ability for a process to transition from one protection domain to another under carefully enforced rules.
Every time you execute a sudo command, make a system call, attach a debugger, or run a setuid binary, your process crosses a domain boundary. These transitions are among the most security-critical operations in computing. A bug in domain switching can grant unlimited power to unprivileged code; an overly restrictive implementation can make legitimate operations impossible.
Understanding domain switching is understanding the precise mechanisms by which privilege is gained, exercised, and relinquished.
By the end of this page, you will understand how processes transition between protection domains, the hardware and software mechanisms that enable domain switching, security policies governing transitions, and common vulnerabilities that arise when domain switching is implemented incorrectly.
Protection domains provide isolation, but isolation alone is insufficient for practical computing. Programs must interact with the kernel, with privileged system services, and with each other. Domain switching enables these interactions while maintaining security.
The Fundamental Tension:
/etc/shadow, but normal users shouldn't have permanent write accessDomain switching resolves these tensions by providing controlled, auditable, revocable transitions between privilege levels.
Every domain switch is an opportunity for privilege escalation attacks. If the transition is not performed correctly—if registers aren't cleared, if the stack isn't switched, if the return address can be manipulated—an attacker may gain unauthorized access to the target domain's privileges.
Domain switching requires coordination between hardware and software. The basic phases are:
Phase 1: Switch Request
The currently executing code requests or triggers a domain transition. This may be:
Phase 2: Privilege Verification
Before the switch occurs, the system verifies the transition is permitted:
Phase 3: Context Save
The CPU and OS save the current domain's execution context:
Phase 4: Domain Transition
The actual privilege level change occurs:
Phase 5: Context Restore (on return)
When the target domain completes, the original context is restored and execution resumes in the original domain.
Modern CPUs provide hardware mechanisms to enforce domain boundaries and enable controlled transitions. Without hardware support, software-only protection could be bypassed by malicious code.
x86/x64 Privilege Transitions:
Entering Ring 0 (User → Kernel):
| Mechanism | Instruction | Modern Usage |
|---|---|---|
| Software Interrupt | int 0x80 | Legacy Linux syscall |
| SYSENTER | sysenter | 32-bit fast syscall |
| SYSCALL | syscall | 64-bit fast syscall |
| Hardware Interrupt | (automatic) | Timer, device, exception |
| Exception | (automatic) | Page fault, GPF, etc. |
Returning to Ring 3 (Kernel → User):
| Mechanism | Instruction | Notes |
|---|---|---|
| IRET | iret | Restores full context |
| SYSEXIT | sysexit | Fast return from sysenter |
| SYSRET | sysret | Fast return from syscall |
The SYSCALL/SYSRET Fast Path:
Modern 64-bit systems use syscall/sysret for performance:
; User-space system call invocation
mov rax, 1 ; syscall number (write)
mov rdi, 1 ; fd = stdout
mov rsi, msg ; buffer address
mov rdx, len ; buffer length
syscall ; ENTER KERNEL DOMAIN
; rax now contains return value
When syscall executes:
Note what does NOT happen automatically:
The sysret instruction has a dangerous quirk: if RCX contains a non-canonical address, a general protection fault occurs while the CPU is in Ring 0 but using Ring 3's stack pointer. This allowed user-space code to execute at Ring 0 privilege. A reminder that even CPU instructions can have security bugs.
Not all domain switches involve hardware privilege changes. Software-only domain switching occurs when the OS changes a process's effective privileges without a ring transition.
Unix setuid/setgid Mechanism:
The most common software domain switch is executing a setuid binary. When a file has the setuid bit set, executing it changes the process's effective UID to the file's owner.
$ ls -l /usr/bin/passwd
-rwsr-xr-x 1 root root 68208 Jan 1 2024 /usr/bin/passwd
^
setuid bit (s in owner execute position)
When a regular user executes /usr/bin/passwd:
execve() processing/etc/shadow)This is entirely software-based—the CPU privilege level doesn't change; user code still runs in Ring 3. But the kernel's access control checks now see root privileges.
123456789101112131415161718192021222324252627282930313233343536373839404142
#include <stdio.h>#include <unistd.h>#include <sys/types.h> void print_ids(const char *label) { printf("%s: real=%d, effective=%d, saved=%d\n", label, getuid(), geteuid(), getresuid_saved());} int main() { // Assuming this binary is setuid root print_ids("Initial"); // real=1000, effective=0, saved=0 // Do privileged work (e.g., write to protected file) FILE *f = fopen("/etc/protected", "w"); if (f) { fprintf(f, "Privileged write\n"); fclose(f); } // Drop privileges - critical for security! if (seteuid(getuid()) < 0) { perror("seteuid failed"); return 1; } print_ids("After drop"); // real=1000, effective=1000, saved=0 // Could restore privileges if needed if (seteuid(0) < 0) { // Can restore since saved UID is still 0 perror("seteuid restore failed"); } print_ids("After restore"); // real=1000, effective=0, saved=0 // Permanently drop privileges - cannot regain if (setuid(getuid()) < 0) { perror("setuid failed"); return 1; } print_ids("Permanent drop"); // real=1000, effective=1000, saved=1000 return 0;}Setuid programs should drop privileges as soon as possible, keep them for the minimum necessary time, and drop them permanently when elevated access is no longer needed. The saved UID mechanism allows temporary drops (with restoration) vs. permanent drops.
A critical security requirement for domain switching is that entry into a higher-privilege domain must occur only at controlled entry points. If an attacker could jump to arbitrary code in the kernel, protection would be meaningless.
The Gate Concept:
A gate is a controlled entry point into a protected domain. It specifies:
x86 Gate Types:
| Gate Type | Purpose | Usage |
|---|---|---|
| Interrupt Gate | Hardware/software interrupts | Timer, syscall via int |
| Trap Gate | Software exceptions | Debug breakpoints |
| Call Gate | Controlled privilege calls | Rarely used in modern systems |
| Task Gate | Task switching | Obsolete in 64-bit mode |
Modern Syscall Entry:
Instead of gates, modern systems use Model-Specific Registers (MSRs):
IA32_LSTAR MSR = address of syscall entry point
IA32_STAR MSR = segment selectors for syscall transitions
IA32_FMASK MSR = flags to mask on syscall entry
Only the kernel can write these MSRs, so user code cannot redirect syscall entry.
The System Call Table:
Even with a controlled entry point, the kernel must dispatch to the correct handler. The syscall number provided by user code indexes into a table of handler functions:
// Kernel syscall table (simplified)
const syscall_fn_t sys_call_table[] = {
[0] = sys_read,
[1] = sys_write,
[2] = sys_open,
[3] = sys_close,
// ... hundreds more
};
// Entry point code
void syscall_entry(struct pt_regs *regs) {
long nr = regs->rax; // syscall number from user
if (nr >= NR_syscalls || nr < 0) {
regs->rax = -ENOSYS; // Invalid syscall number
return;
}
regs->rax = sys_call_table[nr](regs); // Dispatch
}
Note the bounds check—without it, a malicious syscall number could cause out-of-bounds access.
One of the most critical aspects of domain switching is stack management. The stack contains return addresses, local variables, and sensitive data. Using the wrong stack in the wrong domain is a catastrophic security vulnerability.
Why Separate Stacks Are Required:
The Stack Switch Process:
When transitioning from Ring 3 to Ring 0 on x86-64:
On return (IRET):
| Kernel Stack | Contents | Purpose |
|---|---|---|
| Top → SS (user) | 0x2b (user data segment) | Restore user stack segment |
| RSP (user) | User's stack pointer value | Restore user stack position |
| RFLAGS | User's CPU flags | Restore interrupt state, etc. |
| CS (user) | 0x33 (user code segment) | Restore user privilege level |
| RIP | User's instruction pointer | Resume execution here |
| [Error code] | Exception-specific | Only for some exceptions |
| ← New RSP | Kernel working space | Kernel uses from here down |
Per-CPU and Per-Process Kernel Stacks:
Modern kernels maintain multiple stacks for different purposes:
Why IST Stacks?
Some exceptions (like double fault) can occur even if the kernel stack is corrupted. IST entries provide known-good stacks that are always valid:
// TSS IST entries (per-CPU)
struct tss_struct {
// ...
u64 ist[7]; // 7 Interrupt Stack Table entries
// IST1: Double Fault stack
// IST2: NMI stack
// IST3: Debug stack
// IST4: Machine Check stack
// ...
};
If an attacker can control the stack pointer during a domain switch, they may be able to 'pivot' the kernel onto attacker-controlled memory. This is why hardware-enforced stack switching (via TSS) is critical—the kernel doesn't trust any user-provided stack pointer on entry.
Beyond the mechanism of domain switching, operating systems must define policies governing when and how transitions are permitted.
Policy Question 1: Who Can Enter Which Domains?
Not all domain transitions are permitted. The access matrix includes domains as objects, with a "switch" or "enter" right:
│ Kernel Domain │ Debug Domain │ Admin Domain │
──────────────┼───────────────┼──────────────┼──────────────┤
User Domain │ Enter(syscall)│ - │ - │
──────────────┼───────────────┼──────────────┼──────────────┤
Admin Domain │ Enter(syscall)│ Enter │ - │
──────────────┼───────────────┼──────────────┼──────────────┤
Kernel Domain│ - │ Enter │ Enter │
──────────────┴───────────────┴──────────────┴──────────────┘
Only admins can enter the debug domain; only the kernel can enter the admin domain; everyone can enter the kernel domain (via syscall).
Policy Question 2: What Data Crosses the Boundary?
When domains switch, what happens to register contents, memory mappings, and other state?
Conservative approaches:
Performance-oriented approaches:
KPTI (Kernel Page Table Isolation):
Modern systems use KPTI to mitigate Meltdown-class attacks. The user-mode page tables contain minimal kernel mappings—just enough for the syscall entry point. Upon entry, the kernel switches to a different page table with full kernel mappings:
User mode page table:
├── User space: Fully mapped
└── Kernel space: Only entry trampoline mapped
Kernel mode page table:
├── User space: Fully mapped (for copying data)
└── Kernel space: Fully mapped
Speculative execution attacks (Spectre) can leak data across domain boundaries even with proper access control. The CPU may speculatively access kernel memory from user mode before the permission check completes. Mitigations include retpolines, IBPB, and IBRS.
Domain switching is one of the most security-sensitive operations in an operating system. Historical vulnerabilities illustrate the subtleties involved:
Class 1: Improper Privilege Retention
Failing to drop privileges after temporary elevation:
// VULNERABLE: setuid program
int main(int argc, char *argv[]) {
open_privileged_resource(); // Needs root
// BUG: Never dropped privileges!
execute_user_command(argv[1]); // Runs as root!
}
Class 2: Race Conditions (TOCTOU)
Time-of-check to time-of-use races during domain switch:
// VULNERABLE: Kernel syscall handler
int sys_read(int fd, char *buf, size_t count) {
if (!access_ok(VERIFY_WRITE, buf, count)) // CHECK
return -EFAULT;
// Another thread remaps 'buf' to kernel memory!
copy_to_user(buf, kernel_data, count); // USE
}
Class 3: Uninitialized Data Leakage
Kernel stack may contain sensitive data from previous operations:
// VULNERABLE: Stack leak
struct response {
int status;
char data[64];
};
int sys_getinfo(struct response *user_resp) {
struct response resp;
resp.status = get_status(); // data[] not initialized!
copy_to_user(user_resp, &resp, sizeof(resp));
// Leaks previous kernel stack contents in resp.data
}
| CVE | Vulnerability | Impact | Root Cause |
|---|---|---|---|
| CVE-2012-0217 | SYSRET with non-canonical RCX | Ring 0 code execution | CPU microcode bug on return |
| CVE-2014-0038 | recvmmsg TOCTOU | Privilege escalation | Race in argument validation |
| CVE-2016-5195 | Dirty COW | Root privilege | Race in copy-on-write handling |
| CVE-2017-5754 | Meltdown | Kernel memory disclosure | Speculative execution past checks |
| CVE-2018-8897 | MOV SS exception delivery | Ring 0 code execution | Interrupt handling during switch |
We've explored the mechanisms, policies, and pitfalls of domain switching. Let's consolidate the key insights:
What's Next:
We've seen how domains are defined and how switching between them works. Now we'll examine protection rings—the hierarchical domain model implemented in hardware by most processors. Protection rings provide a concrete, efficient implementation of the domain concepts we've discussed.
You now understand how processes transition between protection domains through carefully controlled mechanisms. This knowledge is essential for understanding privilege escalation vulnerabilities and for designing secure system interfaces.