Loading content...
We've traced the system call from user code, through the trap instruction, into the kernel, and through parameter handling. Now we complete the journey: returning to user mode.
This return path might seem simple—just reverse what we did on entry—but it's actually one of the most complex and security-sensitive parts of the kernel. Before returning, the kernel must:
Every one of these steps has security implications. A bug in the return path can leak kernel data, skip security checks, or enable privilege escalation.
By the end of this page, you will understand the complete return path from kernel to user mode, signal delivery during system call return, the opportunity for context switches, register restoration, and the security considerations that make this path critical.
After the kernel completes the requested operation, it must return control to user space. This isn't simply the reverse of entry—additional processing occurs that makes the return path more complex than the entry path.
Key Operations on Return:
12345678910111213141516171819202122232425262728293031323334353637383940414243
// arch/x86/entry/common.c (simplified) __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs){ // Check if there's work to do before returning unsigned long work = READ_ONCE(current_thread_info()->flags); if (unlikely(work & EXIT_TO_USER_MODE_WORK)) work = exit_to_user_mode_loop(regs, work); // Final preparations lockdep_hardirqs_on_prepare(); instrumentation_end(); // Restore state and return // (handled in assembly after this returns)} static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long work){ while (work & EXIT_TO_USER_MODE_WORK) { // Handle pending signals if (work & _TIF_SIGPENDING) { do_signal(regs); } // Handle rescheduling request if (work & _TIF_NEED_RESCHED) { schedule(); // Potentially switch to another process } // Handle audit/tracing/seccomp if (work & _TIF_SYSCALL_TRACE) { tracehook_report_syscall_exit(regs, 0); } // More work might have arrived, re-check work = READ_ONCE(current_thread_info()->flags); } return work;}The return path is a loop because handling one piece of work might create more. For example, delivering a signal might set TIF_NEED_RESCHED if the handler blocks. The kernel loops until all work is complete, only then returning to user space.
Signals are the UNIX mechanism for asynchronous notification—interrupting a process to inform it of events like SIGINT (Ctrl+C), SIGCHLD (child exited), or SIGSEGV (segmentation fault).
The system call return path is, by design, where signals are delivered. This is one of the few points where the kernel has complete control over user state and can safely redirect execution.
Why Deliver Signals Here?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
// kernel/signal.c (simplified) void do_signal(struct pt_regs *regs){ struct ksignal ksig; // Get next pending signal if (get_signal(&ksig)) { // Handle the signal // For SIGKILL, SIGSTOP: immediate action, no handler // For others with handler: set up handler execution if (ksig.ka.sa.sa_handler != SIG_DFL) { // Redirect execution to signal handler handle_signal(&ksig, regs); return; } // Default action (terminate, stop, ignore, etc.) // ... } // No signals or all ignored // Check if we need to restart a system call restore_saved_sigmask();} static void handle_signal(struct ksignal *ksig, struct pt_regs *regs){ // Set up the signal frame on user stack: // 1. Save current user registers (including return address) // 2. Push signal number and optional siginfo // 3. Set up return address to sigreturn trampoline // 4. Modify saved RIP to point to signal handler // 5. Modify saved RSP to point to new stack frame if (setup_rt_frame(ksig, regs) < 0) { // Can't deliver signal, force terminate force_sigsegv(ksig->sig); } // Now when we return to user mode: // - RIP will be signal handler address // - RSP will point to signal frame // - Arguments will be signal number, info, context // - Return address on stack is sigreturn trampoline}The Signal Stack Frame:
When delivering a signal, the kernel constructs a stack frame on the user stack:
User Stack After Signal Setup:
+------------------+
| siginfo_t | Signal details (sender, reason)
+------------------+
| ucontext_t | Saved user context (all registers)
+------------------+
| Return address | Points to sigreturn trampoline
+------------------+ <-- New RSP when handler starts
| (Signal |
| handler | Normal function prologue
| frame...) |
When the handler returns (via ret), it jumps to the sigreturn trampoline, which invokes the rt_sigreturn system call. This call restores the original context from the ucontext_t, allowing execution to resume where it was interrupted.
The ability to restore arbitrary context via sigreturn is a powerful exploitation technique. An attacker who controls stack contents can fake a signal frame and call sigreturn to set registers to arbitrary values. Modern kernels include cookies/canaries in the signal frame to detect tampering.
The system call return path is a preemption point—a safe place for the kernel to switch to a different process. If another process has higher priority or this process has exhausted its time slice, the scheduler takes control.
When TIF_NEED_RESCHED Is Set:
123456789101112131415161718192021222324252627282930313233343536
// The reschedule check in return path static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long work){ while (work & EXIT_TO_USER_MODE_WORK) { local_irq_enable_exit_to_user(work); if (work & _TIF_NEED_RESCHED) { // Call the scheduler - may not return immediately! schedule(); // When schedule() returns, we're executing again // but potentially much later (milliseconds, seconds, // or even longer after suspend/resume) } // ... other checks ... work = READ_ONCE(current_thread_info()->flags); }} // The schedule() function:// 1. Saves current process context to its task struct// 2. Selects next process to run (scheduler algorithm)// 3. Switches page tables to new process (CR3 on x86)// 4. Restores next process context// 5. Returns (but now "we" are the new process) // From perspective of the original process:// - schedule() is called at time T// - Process sleeps// - At time T+Δ, process is selected again// - schedule() returns, process continues// - The delay Δ can be arbitrarily longWith kernel preemption enabled (CONFIG_PREEMPT), context switches can occur not just at system call return but also during kernel execution at various preemption points. The return-to-user path is just one of many places where schedule() might be called.
After all pending work is complete, the kernel restores user registers from the saved state. This must be done precisely—any mistake could leak kernel data or corrupt user state.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
; arch/x86/entry/entry_64.S (simplified); After syscall_exit_to_user_mode() returns SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode) ; We're on kernel stack, about to return to user ; Move to the user return frame on the kernel stack movq %rsp, %rdi ; Save for later movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp ; Push the iret frame pushq 5*8(%rdi) ; user SS pushq 4*8(%rdi) ; user RSP pushq 3*8(%rdi) ; user RFLAGS pushq 2*8(%rdi) ; user CS pushq 1*8(%rdi) ; user RIP ; Restore general purpose registers POP_REGS pop_rdi=0 ; Restore RDI (was used as scratch) movq (%rsp), %rdi ; Clear registers that might contain kernel data xorq %rax, %rax ; RAX holds return value, set separately movq syscall_return_value, %rax ; Load actual return value ; Switch GS back to user value swapgs ; Return to user mode ; This pops RIP, CS, RFLAGS, RSP, SS from stack iretq ; For SYSRET path (faster but restrictions apply):SYM_INNER_LABEL(syscall_return_via_sysret) ; Restore registers (most were saved to pt_regs) movq R15(%rsp), %r15 movq R14(%rsp), %r14 ; ... restore other registers ... movq RDI(%rsp), %rdi movq RSI(%rsp), %rsi ; Load return address into RCX (SYSRET uses RCX as RIP) movq RIP(%rsp), %rcx ; Load flags into R11 (SYSRET uses R11 as RFLAGS) movq EFLAGS(%rsp), %r11 ; Load user stack pointer movq RSP(%rsp), %rsp ; Switch GS swapgs ; Return! ; SYSRET loads: RIP from RCX, RFLAGS from R11, CS/SS from MSRs sysretqBefore returning, the kernel should clear any registers that might contain sensitive kernel data and aren't being explicitly restored with user values. Failure to do so could leak kernel addresses (defeating KASLR) or other secrets. Modern kernels explicitly zero such registers.
x86-64 Linux can return to user space via two different instructions, each with different characteristics:
| Property | SYSRET | IRET |
|---|---|---|
| Speed | ~20-30 cycles | ~40-100 cycles |
| State source | RCX (RIP), R11 (RFLAGS) | Stack frame |
| Stack switch | Manual (before instruction) | Automatic |
| RIP restrictions | Must be canonical (< 0x8000_0000_0000) | No restrictions |
| Use case | Normal syscall return | Signals, special cases |
| Segment handling | From MSRs (fixed) | From stack (flexible) |
The SYSRET Vulnerability
SYSRET has a subtle security issue on Intel CPUs: if RCX contains a non-canonical address (e.g., 0x8000_0000_0000_0000), SYSRET raises a #GP fault, but the fault occurs after the segment selectors are loaded to user values but before privilege is actually dropped.
This means the #GP handler runs in kernel mode but with user GS. An attacker who controls user GS can exploit this for privilege escalation.
Linux's Solution:
// Before using SYSRET, Linux validates RCX:
if (regs->ip >= TASK_SIZE_MAX) {
// Non-canonical or too high, use IRET instead
// IRET doesn't have this vulnerability
use_iret();
} else {
use_sysret();
}
IRET is used instead of SYSRET when: returning to a signal handler (different CS might be used), returning to 32-bit code (compatibility mode), RIP is non-canonical, or any special handling is needed. The kernel tracks which path to use and falls back to IRET when SYSRET isn't safe.
Kernel Page Table Isolation (KPTI) was introduced to mitigate the Meltdown vulnerability. With KPTI, the kernel and user space use different page tables, requiring switches on every system call entry and exit.
The Two Page Tables:
User page tables: Contains user mappings plus a minimal kernel mapping (just enough to handle the syscall entry and switch to full kernel tables).
Kernel page tables: Full kernel and user mappings, used during kernel execution.
The Return Path with KPTI:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
; KPTI return path (arch/x86/entry/entry_64.S) SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode) ; KPTI: Switch from kernel page tables to user page tables ; Get the user page table address ; On entry, we stored it in a per-CPU variable movq PER_CPU_VAR(user_cr3), %rdi ; We need to switch CR3, but first finish on kernel stack ; because user page tables don't map the kernel stack! ; Build the IRET frame on the special "trampoline" stack ; This stack IS mapped in user page tables movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp ; Push IRET frame (this stack is mapped in both page tables) pushq 5*8(%rsi) ; SS pushq 4*8(%rsi) ; RSP pushq 3*8(%rsi) ; RFLAGS pushq 2*8(%rsi) ; CS pushq 1*8(%rsi) ; RIP ; Restore user registers from saved state ; (they were copied to the trampoline area) ; ... ; Now switch page tables ; After this, kernel memory is unmapped! movq %rdi, %cr3 ; We're now running on user page tables ; Only the trampoline code/stack is accessible ; Switch GS back to user swapgs ; Return to user mode ; IRET pops the frame we built and jumps to user code iretq ; The trampoline is tiny (~one page) and mapped at a; consistent address in both page table sets.; It contains just enough code to do the final switch.Performance Impact:
KPTI adds overhead to every system call:
| Operation | Overhead |
|---|---|
| CR3 write | ~50-100 cycles |
| TLB flush (without PCID) | ~200-500 cycles |
| TLB flush (with PCID) | ~50 cycles |
With PCID (Process Context ID), the CPU can keep separate TLB entries for user and kernel address spaces, dramatically reducing the flush cost. Modern CPUs with PCID see only modest (~1-5%) overhead from KPTI.
PCID tags TLB entries with an identifier. By using different PCIDs for kernel and user page tables, the CPU doesn't need to flush the entire TLB on CR3 switch—just use a different PCID. This makes KPTI nearly free on modern CPUs with good PCID support.
When a signal interrupts a system call, the kernel must decide: should the call fail with EINTR, or should it automatically restart when the signal handler returns?
The Problem:
// User code
ssize_t n = read(fd, buf, 1000); // Blocks waiting for data
// Signal arrives! Handler runs. Then what?
// Option 1: read() returns -1 with errno=EINTR (caller must retry)
// Option 2: read() automatically restarts (caller doesn't notice)
Linux's Solution:
The kernel tracks the original system call number and restart behavior in the saved register state:
123456789101112131415161718192021222324252627282930313233343536373839
// Signal handling decides if restart is appropriate void handle_signal(struct ksignal *ksig, struct pt_regs *regs){ bool restart = false; int retval = regs->ax; // Current return value // Check if the system call should be restarted if (retval == -ERESTARTSYS) { // Restart unless SA_RESTART not set if (ksig->ka.sa.sa_flags & SA_RESTART) { restart = true; } else { regs->ax = -EINTR; // Return EINTR to user } } else if (retval == -ERESTARTNOINTR) { // Always restart, unconditionally restart = true; } else if (retval == -ERESTARTNOHAND) { // Restart only if no signal handler regs->ax = -EINTR; } if (restart) { // Restore original RAX (system call number) regs->ax = regs->orig_ax; // Move RIP back to the syscall instruction regs->ip -= 2; // sizeof(syscall instruction) // When we return, syscall will re-execute! } // Now set up signal handler...} // The different restart codes:// -ERESTARTSYS : Restart if SA_RESTART set// -ERESTARTNOINTR : Always restart (used by futex)// -ERESTARTNOHAND : Restart only if no handler runs// -EINTR : Don't restart (return error to user)| Return Code | SA_RESTART Set | SA_RESTART Not Set |
|---|---|---|
| -ERESTARTSYS | Restart syscall | Return -EINTR |
| -ERESTARTNOINTR | Restart syscall | Restart syscall |
| -ERESTARTNOHAND | Return -EINTR | Restart syscall |
| -EINTR | Return -EINTR | Return -EINTR |
When installing a signal handler with sigaction(), setting the SA_RESTART flag causes most blocking system calls to automatically restart after the handler returns. This simplifies application code by avoiding the need to manually retry on EINTR. However, some calls (like select/poll with timeout) are never restarted because their time-sensitive nature makes restart semantics unclear.
The return path is also where system call tracing and auditing hooks execute. This enables debugging tools (strace), security monitoring (auditd), and containerization (seccomp).
Ptrace Exit Tracing:
When a process is being traced (e.g., by strace), the kernel stops at system call exit to report the return value:
1234567891011121314151617181920212223242526272829303132
// Called on return path if TIF_SYSCALL_TRACE is set static void syscall_exit_trace(struct pt_regs *regs){ // Report to tracer (e.g., strace) if (test_thread_flag(TIF_SYSCALL_TRACE)) { tracehook_report_syscall_exit(regs, 0); // This might: // 1. Stop the process (PTRACE_SYSCALL) // 2. Notify tracer of return value // 3. Allow tracer to modify return value } // Audit logging (if enabled) if (unlikely(audit_context())) { audit_syscall_exit(regs); // Logs: syscall number, return value, arguments // Used for security auditing and compliance } // Seccomp notification (if configured) if (current->seccomp.mode == SECCOMP_MODE_FILTER) { seccomp_notify_exit(regs); // Some seccomp policies want to see return values }} // strace output showing entry and exit:// open("/etc/passwd", O_RDONLY) = 3// ^-- entry params ^-- exit return value// // The tracer sees both the entry (arguments) and exit (result)Tracer's View:
Tools like strace use ptrace to intercept system calls:
PTRACE_SYSCALL causes the tracee to stop at syscall entryPTRACE_SYSCALL againread(3, "hello", 5) = 5Audit Logging:
type=SYSCALL msg=audit(1234567:1): arch=c000003e syscall=2
success=yes exit=3 a0=7ffd5a3c1100 a1=0 a2=0 a3=0
items=1 ppid=1234 pid=5678 uid=1000 gid=1000
comm="cat" exe="/bin/cat"
This audit record shows open() (syscall 2) returning fd 3, with the process details and arguments logged for security review.
System call tracing adds significant overhead—each traced call requires stopping the process, context switching to the tracer, and back. strace can slow a program by 10-100x. For production debugging, consider eBPF-based tools like bpftrace which have much lower overhead.
The return path is security-critical. Any vulnerability here could allow:
1234567891011121314151617181920212223242526272829
// Security measures in the return path void prepare_exit_to_usermode(struct pt_regs *regs){ // 1. Validate return RIP is in user range if (regs->ip >= TASK_SIZE_MAX) { // Force SIGSEGV - return address is bad force_sigsegv(current); return; } // 2. Sanitize RFLAGS // User cannot set: IOPL, VM, VIF, VIP, or reserved bits regs->flags &= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM | X86_EFLAGS_VIF | X86_EFLAGS_VIP); // Ensure IF is set (interrupts enabled in user mode) regs->flags |= X86_EFLAGS_IF; // 3. Validate segment selectors // (though these should always be correct from entry save) regs->cs = __USER_CS; regs->ss = __USER_DS; // 4. Clear any potentially sensitive data from scratch registers // Modern kernels clear registers that might hold kernel addresses} // RFLAGS mask applied during SYSRET (via IA32_FMASK MSR)// This hardware masking ensures certain bits can't be set by userThe return path has been the source of numerous vulnerabilities: SYSRET non-canonical (CVE-2012-0217), SWAPGS speculation (CVE-2019-1125), FSGSBASE leaks, signal frame injection, and more. This code is among the most audited in the kernel, yet its complexity continues to yield bugs.
We've traced the complete journey of a system call from user space into the kernel and back. The return path, far from being a simple reversal, is a complex sequence involving signal handling, scheduling decisions, security validations, and privilege transitions.
Module Complete:
Congratulations! You've now mastered the complete system call mechanism—from the initial trap instruction, through kernel entry, parameter handling, and return. This knowledge forms the foundation for understanding how all operating system services are accessed.
The next module explores the types of system calls—process control, file management, device management, information maintenance, and communication—showing how the mechanism we've studied is used to implement the full range of OS services.
You have completed the System Call Mechanism module. You now understand the complete round-trip: user code → trap instruction → kernel entry → system call dispatch → handler execution → return preparation → signal/scheduling checks → privilege drop → user continuation. This is the fundamental interface between applications and the operating system kernel.