System Call Mechanism - Learning Module

Loading content...

0/240

Return to User Mode

Completing the Round Trip

We've traced the system call from user code, through the trap instruction, into the kernel, and through parameter handling. Now we complete the journey: returning to user mode.

This return path might seem simple—just reverse what we did on entry—but it's actually one of the most complex and security-sensitive parts of the kernel. Before returning, the kernel must:

Set the return value in the correct register
Check for and deliver pending signals
Decide if a context switch is needed
Restore user registers precisely
Transition back to Ring 3 safely

Every one of these steps has security implications. A bug in the return path can leak kernel data, skip security checks, or enable privilege escalation.

What You Will Learn

By the end of this page, you will understand the complete return path from kernel to user mode, signal delivery during system call return, the opportunity for context switches, register restoration, and the security considerations that make this path critical.

The Return Path Overview

After the kernel completes the requested operation, it must return control to user space. This isn't simply the reverse of entry—additional processing occurs that makes the return path more complex than the entry path.

Key Operations on Return:

System Call Return Sequence

•Set return value — Place the result (or error code) in the designated register (RAX on x86-64).
•Check for pending work — Has a signal arrived? Is preemption needed? Did a timer expire?
•Deliver signals — If signals are pending, divert to signal handler instead of returning normally.
•Context switch check — If TIF_NEED_RESCHED is set, yield to the scheduler before returning.
•Restore user registers — Pop saved registers from kernel stack, restoring user state.
•Switch to user stack — Load user's stack pointer that was saved on entry.
•Return to user mode — Execute SYSRET/IRET to transition privilege and resume user execution.

Linux System Call Return Path (Simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// arch/x86/entry/common.c (simplified)
 
__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
{
    // Check if there's work to do before returning
    unsigned long work = READ_ONCE(current_thread_info()->flags);
    
    if (unlikely(work & EXIT_TO_USER_MODE_WORK))
        work = exit_to_user_mode_loop(regs, work);
    
    // Final preparations
    lockdep_hardirqs_on_prepare();
    instrumentation_end();
    
    // Restore state and return
    // (handled in assembly after this returns)
}
 
static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
                                            unsigned long work)
{
    while (work & EXIT_TO_USER_MODE_WORK) {
        // Handle pending signals
        if (work & _TIF_SIGPENDING) {
            do_signal(regs);
        }
        
        // Handle rescheduling request
        if (work & _TIF_NEED_RESCHED) {
            schedule();  // Potentially switch to another process
        }
        
        // Handle audit/tracing/seccomp
        if (work & _TIF_SYSCALL_TRACE) {
            tracehook_report_syscall_exit(regs, 0);
        }
        
        // More work might have arrived, re-check
        work = READ_ONCE(current_thread_info()->flags);
    }
    
    return work;
}

The Exit Loop

The return path is a loop because handling one piece of work might create more. For example, delivering a signal might set TIF_NEED_RESCHED if the handler blocks. The kernel loops until all work is complete, only then returning to user space.

Signal Delivery on System Call Return

Signals are the UNIX mechanism for asynchronous notification—interrupting a process to inform it of events like SIGINT (Ctrl+C), SIGCHLD (child exited), or SIGSEGV (segmentation fault).

The system call return path is, by design, where signals are delivered. This is one of the few points where the kernel has complete control over user state and can safely redirect execution.

Why Deliver Signals Here?

Clean state: All user registers are saved; we can modify them before restore.
Definite check point: Every system call returns here, so signals are delivered promptly.
Safe manipulation: We're already transitioning privilege levels, making handler setup simpler.
Atomicity: Signal delivery can be coordinated with the system call result.

Signal Delivery During Return
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// kernel/signal.c (simplified)
 
void do_signal(struct pt_regs *regs)
{
    struct ksignal ksig;
    
    // Get next pending signal
    if (get_signal(&ksig)) {
        // Handle the signal
        
        // For SIGKILL, SIGSTOP: immediate action, no handler
        // For others with handler: set up handler execution
        
        if (ksig.ka.sa.sa_handler != SIG_DFL) {
            // Redirect execution to signal handler
            handle_signal(&ksig, regs);
            return;
        }
        
        // Default action (terminate, stop, ignore, etc.)
        // ...
    }
    
    // No signals or all ignored
    // Check if we need to restart a system call
    restore_saved_sigmask();
}
 
static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
{
    // Set up the signal frame on user stack:
    // 1. Save current user registers (including return address)
    // 2. Push signal number and optional siginfo
    // 3. Set up return address to sigreturn trampoline
    // 4. Modify saved RIP to point to signal handler
    // 5. Modify saved RSP to point to new stack frame
    
    if (setup_rt_frame(ksig, regs) < 0) {
        // Can't deliver signal, force terminate
        force_sigsegv(ksig->sig);
    }
    
    // Now when we return to user mode:
    // - RIP will be signal handler address
    // - RSP will point to signal frame
    // - Arguments will be signal number, info, context
    // - Return address on stack is sigreturn trampoline
}

The Signal Stack Frame:

When delivering a signal, the kernel constructs a stack frame on the user stack:

User Stack After Signal Setup:
    +------------------+
    |   siginfo_t      | Signal details (sender, reason)
    +------------------+
    |   ucontext_t     | Saved user context (all registers)
    +------------------+
    |   Return address | Points to sigreturn trampoline
    +------------------+  <-- New RSP when handler starts
    |   (Signal        |
    |    handler       | Normal function prologue
    |    frame...)     |

When the handler returns (via ret), it jumps to the sigreturn trampoline, which invokes the rt_sigreturn system call. This call restores the original context from the ucontext_t, allowing execution to resume where it was interrupted.

SIGRETURN Attacks (SROP)

The ability to restore arbitrary context via sigreturn is a powerful exploitation technique. An attacker who controls stack contents can fake a signal frame and call sigreturn to set registers to arbitrary values. Modern kernels include cookies/canaries in the signal frame to detect tampering.

The Rescheduling Decision

The system call return path is a preemption point—a safe place for the kernel to switch to a different process. If another process has higher priority or this process has exhausted its time slice, the scheduler takes control.

When TIF_NEED_RESCHED Is Set:

Timer interrupt: Time slice expired
Higher priority task woke up: I/O completed, signal received
Explicit yield: Process called sched_yield()
Priority changes: Nice value changed, RT priority modified

Rescheduling Check on Return
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// The reschedule check in return path
 
static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
                                            unsigned long work)
{
    while (work & EXIT_TO_USER_MODE_WORK) {
        local_irq_enable_exit_to_user(work);
        
        if (work & _TIF_NEED_RESCHED) {
            // Call the scheduler - may not return immediately!
            schedule();
            
            // When schedule() returns, we're executing again
            // but potentially much later (milliseconds, seconds,
            // or even longer after suspend/resume)
        }
        
        // ... other checks ...
        
        work = READ_ONCE(current_thread_info()->flags);
    }
}
 
// The schedule() function:
// 1. Saves current process context to its task struct
// 2. Selects next process to run (scheduler algorithm)
// 3. Switches page tables to new process (CR3 on x86)
// 4. Restores next process context
// 5. Returns (but now "we" are the new process)
 
// From perspective of the original process:
// - schedule() is called at time T
// - Process sleeps
// - At time T+Δ, process is selected again
// - schedule() returns, process continues
// - The delay Δ can be arbitrarily long

What Happens During schedule()

•Save current state: All registers are already saved in pt_regs; scheduler saves kernel stack pointer.
•Select next task: The CFS (Completely Fair Scheduler) or RT scheduler picks the next runnable process.
•Switch context: Page tables, stack pointer, and other per-process state are switched.
•Resume next task: The new process continues from where it previously called schedule().
•Later resumption: When our original process is selected again, schedule() returns and we continue.

Preemption in Kernel Mode

With kernel preemption enabled (CONFIG_PREEMPT), context switches can occur not just at system call return but also during kernel execution at various preemption points. The return-to-user path is just one of many places where schedule() might be called.

After all pending work is complete, the kernel restores user registers from the saved state. This must be done precisely—any mistake could leak kernel data or corrupt user state.

x86-64 Register Restoration

Assembly

; arch/x86/entry/entry_64.S (simplified)
; After syscall_exit_to_user_mode() returns
 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode)
    ; We're on kernel stack, about to return to user
    
    ; Move to the user return frame on the kernel stack
    movq    %rsp, %rdi          ; Save for later
    movq    PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
    
    ; Push the iret frame
    pushq   5*8(%rdi)           ; user SS
    pushq   4*8(%rdi)           ; user RSP  
    pushq   3*8(%rdi)           ; user RFLAGS
    pushq   2*8(%rdi)           ; user CS
    pushq   1*8(%rdi)           ; user RIP
    
    ; Restore general purpose registers
    POP_REGS pop_rdi=0
    
    ; Restore RDI (was used as scratch)
    movq    (%rsp), %rdi
    
    ; Clear registers that might contain kernel data
    xorq    %rax, %rax          ; RAX holds return value, set separately
    movq    syscall_return_value, %rax  ; Load actual return value
    
    ; Switch GS back to user value
    swapgs
    
    ; Return to user mode
    ; This pops RIP, CS, RFLAGS, RSP, SS from stack
    iretq
    
; For SYSRET path (faster but restrictions apply):
SYM_INNER_LABEL(syscall_return_via_sysret)
    ; Restore registers (most were saved to pt_regs)
    movq    R15(%rsp), %r15
    movq    R14(%rsp), %r14
    ; ... restore other registers ...
    movq    RDI(%rsp), %rdi
    movq    RSI(%rsp), %rsi
    
    ; Load return address into RCX (SYSRET uses RCX as RIP)
    movq    RIP(%rsp), %rcx
    
    ; Load flags into R11 (SYSRET uses R11 as RFLAGS)  
    movq    EFLAGS(%rsp), %r11
    
    ; Load user stack pointer
    movq    RSP(%rsp), %rsp
    
    ; Switch GS
    swapgs
    
    ; Return!
    ; SYSRET loads: RIP from RCX, RFLAGS from R11, CS/SS from MSRs
    sysretq

Before returning, the kernel should clear any registers that might contain sensitive kernel data and aren't being explicitly restored with user values. Failure to do so could leak kernel addresses (defeating KASLR) or other secrets. Modern kernels explicitly zero such registers.

SYSRET vs. IRET: Choosing the Return Path

x86-64 Linux can return to user space via two different instructions, each with different characteristics:

SYSRET vs. IRET Comparison
Property	SYSRET	IRET
Speed	~20-30 cycles	~40-100 cycles
State source	RCX (RIP), R11 (RFLAGS)	Stack frame
Stack switch	Manual (before instruction)	Automatic
RIP restrictions	Must be canonical (< 0x8000_0000_0000)	No restrictions
Use case	Normal syscall return	Signals, special cases
Segment handling	From MSRs (fixed)	From stack (flexible)

The SYSRET Vulnerability

SYSRET has a subtle security issue on Intel CPUs: if RCX contains a non-canonical address (e.g., 0x8000_0000_0000_0000), SYSRET raises a #GP fault, but the fault occurs after the segment selectors are loaded to user values but before privilege is actually dropped.

This means the #GP handler runs in kernel mode but with user GS. An attacker who controls user GS can exploit this for privilege escalation.

Linux's Solution:

// Before using SYSRET, Linux validates RCX:
if (regs->ip >= TASK_SIZE_MAX) {
    // Non-canonical or too high, use IRET instead
    // IRET doesn't have this vulnerability
    use_iret();
} else {
    use_sysret();
}

When IRET Is Required

IRET is used instead of SYSRET when: returning to a signal handler (different CS might be used), returning to 32-bit code (compatibility mode), RIP is non-canonical, or any special handling is needed. The kernel tracks which path to use and falls back to IRET when SYSRET isn't safe.

KPTI and Page Table Switching

Kernel Page Table Isolation (KPTI) was introduced to mitigate the Meltdown vulnerability. With KPTI, the kernel and user space use different page tables, requiring switches on every system call entry and exit.

The Two Page Tables:

User page tables: Contains user mappings plus a minimal kernel mapping (just enough to handle the syscall entry and switch to full kernel tables).
Kernel page tables: Full kernel and user mappings, used during kernel execution.

The Return Path with KPTI:

Page Table Switch on Return (KPTI)

Assembly

; KPTI return path (arch/x86/entry/entry_64.S)
 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode)
    ; KPTI: Switch from kernel page tables to user page tables
    
    ; Get the user page table address
    ; On entry, we stored it in a per-CPU variable
    movq    PER_CPU_VAR(user_cr3), %rdi
    
    ; We need to switch CR3, but first finish on kernel stack
    ; because user page tables don't map the kernel stack!
    
    ; Build the IRET frame on the special "trampoline" stack
    ; This stack IS mapped in user page tables
    movq    PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
    
    ; Push IRET frame (this stack is mapped in both page tables)
    pushq   5*8(%rsi)           ; SS
    pushq   4*8(%rsi)           ; RSP
    pushq   3*8(%rsi)           ; RFLAGS
    pushq   2*8(%rsi)           ; CS
    pushq   1*8(%rsi)           ; RIP
    
    ; Restore user registers from saved state
    ; (they were copied to the trampoline area)
    ; ...
    
    ; Now switch page tables
    ; After this, kernel memory is unmapped!
    movq    %rdi, %cr3
    
    ; We're now running on user page tables
    ; Only the trampoline code/stack is accessible
    
    ; Switch GS back to user
    swapgs
    
    ; Return to user mode
    ; IRET pops the frame we built and jumps to user code
    iretq
    
; The trampoline is tiny (~one page) and mapped at a
; consistent address in both page table sets.
; It contains just enough code to do the final switch.

Performance Impact:

KPTI adds overhead to every system call:

Operation	Overhead
CR3 write	~50-100 cycles
TLB flush (without PCID)	~200-500 cycles
TLB flush (with PCID)	~50 cycles

With PCID (Process Context ID), the CPU can keep separate TLB entries for user and kernel address spaces, dramatically reducing the flush cost. Modern CPUs with PCID see only modest (~1-5%) overhead from KPTI.

PCID and TLB Management

PCID tags TLB entries with an identifier. By using different PCIDs for kernel and user page tables, the CPU doesn't need to flush the entire TLB on CR3 switch—just use a different PCID. This makes KPTI nearly free on modern CPUs with good PCID support.

System Call Restart Handling

When a signal interrupts a system call, the kernel must decide: should the call fail with EINTR, or should it automatically restart when the signal handler returns?

The Problem:

// User code
ssize_t n = read(fd, buf, 1000);  // Blocks waiting for data

// Signal arrives! Handler runs. Then what?
// Option 1: read() returns -1 with errno=EINTR (caller must retry)
// Option 2: read() automatically restarts (caller doesn't notice)

Linux's Solution:

The kernel tracks the original system call number and restart behavior in the saved register state:

System Call Restart Mechanism
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Signal handling decides if restart is appropriate
 
void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
{
    bool restart = false;
    int retval = regs->ax;  // Current return value
 
    // Check if the system call should be restarted
    if (retval == -ERESTARTSYS) {
        // Restart unless SA_RESTART not set
        if (ksig->ka.sa.sa_flags & SA_RESTART) {
            restart = true;
        } else {
            regs->ax = -EINTR;  // Return EINTR to user
        }
    } else if (retval == -ERESTARTNOINTR) {
        // Always restart, unconditionally
        restart = true;
    } else if (retval == -ERESTARTNOHAND) {
        // Restart only if no signal handler
        regs->ax = -EINTR;
    }
 
    if (restart) {
        // Restore original RAX (system call number)
        regs->ax = regs->orig_ax;
        // Move RIP back to the syscall instruction
        regs->ip -= 2;  // sizeof(syscall instruction)
        // When we return, syscall will re-execute!
    }
 
    // Now set up signal handler...
}
 
// The different restart codes:
// -ERESTARTSYS     : Restart if SA_RESTART set
// -ERESTARTNOINTR  : Always restart (used by futex)
// -ERESTARTNOHAND  : Restart only if no handler runs
// -EINTR           : Don't restart (return error to user)

System Call Restart Behaviors
Return Code	SA_RESTART Set	SA_RESTART Not Set
-ERESTARTSYS	Restart syscall	Return -EINTR
-ERESTARTNOINTR	Restart syscall	Restart syscall
-ERESTARTNOHAND	Return -EINTR	Restart syscall
-EINTR	Return -EINTR	Return -EINTR

SA_RESTART Flag

When installing a signal handler with sigaction(), setting the SA_RESTART flag causes most blocking system calls to automatically restart after the handler returns. This simplifies application code by avoiding the need to manually retry on EINTR. However, some calls (like select/poll with timeout) are never restarted because their time-sensitive nature makes restart semantics unclear.

Tracing and Auditing on Return

The return path is also where system call tracing and auditing hooks execute. This enables debugging tools (strace), security monitoring (auditd), and containerization (seccomp).

Ptrace Exit Tracing:

When a process is being traced (e.g., by strace), the kernel stops at system call exit to report the return value:

System Call Exit Tracing
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Called on return path if TIF_SYSCALL_TRACE is set
 
static void syscall_exit_trace(struct pt_regs *regs)
{
    // Report to tracer (e.g., strace)
    if (test_thread_flag(TIF_SYSCALL_TRACE)) {
        tracehook_report_syscall_exit(regs, 0);
        // This might:
        // 1. Stop the process (PTRACE_SYSCALL)
        // 2. Notify tracer of return value
        // 3. Allow tracer to modify return value
    }
    
    // Audit logging (if enabled)
    if (unlikely(audit_context())) {
        audit_syscall_exit(regs);
        // Logs: syscall number, return value, arguments
        // Used for security auditing and compliance
    }
    
    // Seccomp notification (if configured)
    if (current->seccomp.mode == SECCOMP_MODE_FILTER) {
        seccomp_notify_exit(regs);
        // Some seccomp policies want to see return values
    }
}
 
// strace output showing entry and exit:
// open("/etc/passwd", O_RDONLY) = 3
//      ^-- entry params             ^-- exit return value
// 
// The tracer sees both the entry (arguments) and exit (result)

Tracer's View:

Tools like strace use ptrace to intercept system calls:

PTRACE_SYSCALL causes the tracee to stop at syscall entry
Tracer reads arguments from registers
Tracer continues tracee with PTRACE_SYSCALL again
Tracee stops at syscall exit
Tracer reads return value from RAX
Tracer logs: read(3, "hello", 5) = 5

Audit Logging:

type=SYSCALL msg=audit(1234567:1): arch=c000003e syscall=2 
  success=yes exit=3 a0=7ffd5a3c1100 a1=0 a2=0 a3=0 
  items=1 ppid=1234 pid=5678 uid=1000 gid=1000 
  comm="cat" exe="/bin/cat"

This audit record shows open() (syscall 2) returning fd 3, with the process details and arguments logged for security review.

Tracing Performance Impact

System call tracing adds significant overhead—each traced call requires stopping the process, context switching to the tracer, and back. strace can slow a program by 10-100x. For production debugging, consider eBPF-based tools like bpftrace which have much lower overhead.

Security Considerations of the Return Path

The return path is security-critical. Any vulnerability here could allow:

Information disclosure: Leaking kernel data through unsanitized registers
Privilege escalation: Returning to user mode with elevated privileges
Control flow hijacking: Returning to an attacker-controlled address

Security Requirements for Return Path

•Register Sanitization — Registers not explicitly set must be cleared. Failing to zero R8-R15 could leak kernel pointers, defeating KASLR.
•RFLAGS Validation — RFLAGS restored from user state must be masked. The user cannot be allowed to enable IOPL or other privileged flags.
•Return Address Validation — The return RIP must be in user space. SYSRET's non-canonical check vulnerability shows how this can go wrong.
•Stack Pointer Safety — RSP must point to user memory. Returning with RSP pointing to kernel memory would be catastrophic.
•Segment Register Safety — CS and SS must be user-mode selectors. Returning with kernel selectors but CPL=3 could enable attacks.

Return Path Security Checks
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Security measures in the return path
 
void prepare_exit_to_usermode(struct pt_regs *regs)
{
    // 1. Validate return RIP is in user range
    if (regs->ip >= TASK_SIZE_MAX) {
        // Force SIGSEGV - return address is bad
        force_sigsegv(current);
        return;
    }
    
    // 2. Sanitize RFLAGS  
    // User cannot set: IOPL, VM, VIF, VIP, or reserved bits
    regs->flags &= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM | 
                     X86_EFLAGS_VIF | X86_EFLAGS_VIP);
    // Ensure IF is set (interrupts enabled in user mode)
    regs->flags |= X86_EFLAGS_IF;
    
    // 3. Validate segment selectors
    // (though these should always be correct from entry save)
    regs->cs = __USER_CS;
    regs->ss = __USER_DS;
    
    // 4. Clear any potentially sensitive data from scratch registers
    // Modern kernels clear registers that might hold kernel addresses
}
 
// RFLAGS mask applied during SYSRET (via IA32_FMASK MSR)
// This hardware masking ensures certain bits can't be set by user

Historical Vulnerabilities

The return path has been the source of numerous vulnerabilities: SYSRET non-canonical (CVE-2012-0217), SWAPGS speculation (CVE-2019-1125), FSGSBASE leaks, signal frame injection, and more. This code is among the most audited in the kernel, yet its complexity continues to yield bugs.

Summary: Completing the System Call

We've traced the complete journey of a system call from user space into the kernel and back. The return path, far from being a simple reversal, is a complex sequence involving signal handling, scheduling decisions, security validations, and privilege transitions.

Key Takeaways

•Return path handles pending work — Signals, rescheduling, and tracing are processed before returning to user mode.
•Signals are delivered at return — The saved user state can be modified to redirect execution to signal handlers.
•Context switches may occur — If TIF_NEED_RESCHED is set, schedule() runs, and another process may execute before we return.
•SYSRET is faster but has restrictions — Non-canonical RIP addresses require falling back to the slower IRET path.
•KPTI requires page table switches — Meltdown mitigation adds CR3 switches on both entry and exit.
•Security is paramount — Every register, flag, and address must be validated or sanitized before returning to user mode.

Module Complete:

Congratulations! You've now mastered the complete system call mechanism—from the initial trap instruction, through kernel entry, parameter handling, and return. This knowledge forms the foundation for understanding how all operating system services are accessed.

The next module explores the types of system calls—process control, file management, device management, information maintenance, and communication—showing how the mechanism we've studied is used to implement the full range of OS services.

Module Complete

You have completed the System Call Mechanism module. You now understand the complete round-trip: user code → trap instruction → kernel entry → system call dispatch → handler execution → return preparation → signal/scheduling checks → privilege drop → user continuation. This is the fundamental interface between applications and the operating system kernel.

Return to User Mode

Completing the Round Trip

We've traced the system call from user code, through the trap instruction, into the kernel, and through parameter handling. Now we complete the journey: returning to user mode.

This return path might seem simple—just reverse what we did on entry—but it's actually one of the most complex and security-sensitive parts of the kernel. Before returning, the kernel must:

Set the return value in the correct register
Check for and deliver pending signals
Decide if a context switch is needed
Restore user registers precisely
Transition back to Ring 3 safely

Every one of these steps has security implications. A bug in the return path can leak kernel data, skip security checks, or enable privilege escalation.

What You Will Learn

The Return Path Overview

Key Operations on Return:

System Call Return Sequence

•Set return value — Place the result (or error code) in the designated register (RAX on x86-64).
•Check for pending work — Has a signal arrived? Is preemption needed? Did a timer expire?
•Deliver signals — If signals are pending, divert to signal handler instead of returning normally.
•Context switch check — If TIF_NEED_RESCHED is set, yield to the scheduler before returning.
•Restore user registers — Pop saved registers from kernel stack, restoring user state.
•Switch to user stack — Load user's stack pointer that was saved on entry.
•Return to user mode — Execute SYSRET/IRET to transition privilege and resume user execution.

Linux System Call Return Path (Simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// arch/x86/entry/common.c (simplified)
 
__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
{
    // Check if there's work to do before returning
    unsigned long work = READ_ONCE(current_thread_info()->flags);
    
    if (unlikely(work & EXIT_TO_USER_MODE_WORK))
        work = exit_to_user_mode_loop(regs, work);
    
    // Final preparations
    lockdep_hardirqs_on_prepare();
    instrumentation_end();
    
    // Restore state and return
    // (handled in assembly after this returns)
}
 
static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
                                            unsigned long work)
{
    while (work & EXIT_TO_USER_MODE_WORK) {
        // Handle pending signals
        if (work & _TIF_SIGPENDING) {
            do_signal(regs);
        }
        
        // Handle rescheduling request
        if (work & _TIF_NEED_RESCHED) {
            schedule();  // Potentially switch to another process
        }
        
        // Handle audit/tracing/seccomp
        if (work & _TIF_SYSCALL_TRACE) {
            tracehook_report_syscall_exit(regs, 0);
        }
        
        // More work might have arrived, re-check
        work = READ_ONCE(current_thread_info()->flags);
    }
    
    return work;
}

The Exit Loop

Signal Delivery on System Call Return

Signals are the UNIX mechanism for asynchronous notification—interrupting a process to inform it of events like SIGINT (Ctrl+C), SIGCHLD (child exited), or SIGSEGV (segmentation fault).

The system call return path is, by design, where signals are delivered. This is one of the few points where the kernel has complete control over user state and can safely redirect execution.

Why Deliver Signals Here?

Clean state: All user registers are saved; we can modify them before restore.
Definite check point: Every system call returns here, so signals are delivered promptly.
Safe manipulation: We're already transitioning privilege levels, making handler setup simpler.
Atomicity: Signal delivery can be coordinated with the system call result.

Signal Delivery During Return
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// kernel/signal.c (simplified)
 
void do_signal(struct pt_regs *regs)
{
    struct ksignal ksig;
    
    // Get next pending signal
    if (get_signal(&ksig)) {
        // Handle the signal
        
        // For SIGKILL, SIGSTOP: immediate action, no handler
        // For others with handler: set up handler execution
        
        if (ksig.ka.sa.sa_handler != SIG_DFL) {
            // Redirect execution to signal handler
            handle_signal(&ksig, regs);
            return;
        }
        
        // Default action (terminate, stop, ignore, etc.)
        // ...
    }
    
    // No signals or all ignored
    // Check if we need to restart a system call
    restore_saved_sigmask();
}
 
static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
{
    // Set up the signal frame on user stack:
    // 1. Save current user registers (including return address)
    // 2. Push signal number and optional siginfo
    // 3. Set up return address to sigreturn trampoline
    // 4. Modify saved RIP to point to signal handler
    // 5. Modify saved RSP to point to new stack frame
    
    if (setup_rt_frame(ksig, regs) < 0) {
        // Can't deliver signal, force terminate
        force_sigsegv(ksig->sig);
    }
    
    // Now when we return to user mode:
    // - RIP will be signal handler address
    // - RSP will point to signal frame
    // - Arguments will be signal number, info, context
    // - Return address on stack is sigreturn trampoline
}

The Signal Stack Frame:

When delivering a signal, the kernel constructs a stack frame on the user stack:

User Stack After Signal Setup:
    +------------------+
    |   siginfo_t      | Signal details (sender, reason)
    +------------------+
    |   ucontext_t     | Saved user context (all registers)
    +------------------+
    |   Return address | Points to sigreturn trampoline
    +------------------+  <-- New RSP when handler starts
    |   (Signal        |
    |    handler       | Normal function prologue
    |    frame...)     |

SIGRETURN Attacks (SROP)

The Rescheduling Decision

When TIF_NEED_RESCHED Is Set:

Timer interrupt: Time slice expired
Higher priority task woke up: I/O completed, signal received
Explicit yield: Process called sched_yield()
Priority changes: Nice value changed, RT priority modified

Rescheduling Check on Return
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// The reschedule check in return path
 
static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
                                            unsigned long work)
{
    while (work & EXIT_TO_USER_MODE_WORK) {
        local_irq_enable_exit_to_user(work);
        
        if (work & _TIF_NEED_RESCHED) {
            // Call the scheduler - may not return immediately!
            schedule();
            
            // When schedule() returns, we're executing again
            // but potentially much later (milliseconds, seconds,
            // or even longer after suspend/resume)
        }
        
        // ... other checks ...
        
        work = READ_ONCE(current_thread_info()->flags);
    }
}
 
// The schedule() function:
// 1. Saves current process context to its task struct
// 2. Selects next process to run (scheduler algorithm)
// 3. Switches page tables to new process (CR3 on x86)
// 4. Restores next process context
// 5. Returns (but now "we" are the new process)
 
// From perspective of the original process:
// - schedule() is called at time T
// - Process sleeps
// - At time T+Δ, process is selected again
// - schedule() returns, process continues
// - The delay Δ can be arbitrarily long

What Happens During schedule()

•Save current state: All registers are already saved in pt_regs; scheduler saves kernel stack pointer.
•Select next task: The CFS (Completely Fair Scheduler) or RT scheduler picks the next runnable process.
•Switch context: Page tables, stack pointer, and other per-process state are switched.
•Resume next task: The new process continues from where it previously called schedule().
•Later resumption: When our original process is selected again, schedule() returns and we continue.

Preemption in Kernel Mode

After all pending work is complete, the kernel restores user registers from the saved state. This must be done precisely—any mistake could leak kernel data or corrupt user state.

x86-64 Register Restoration

Assembly

; arch/x86/entry/entry_64.S (simplified)
; After syscall_exit_to_user_mode() returns
 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode)
    ; We're on kernel stack, about to return to user
    
    ; Move to the user return frame on the kernel stack
    movq    %rsp, %rdi          ; Save for later
    movq    PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
    
    ; Push the iret frame
    pushq   5*8(%rdi)           ; user SS
    pushq   4*8(%rdi)           ; user RSP  
    pushq   3*8(%rdi)           ; user RFLAGS
    pushq   2*8(%rdi)           ; user CS
    pushq   1*8(%rdi)           ; user RIP
    
    ; Restore general purpose registers
    POP_REGS pop_rdi=0
    
    ; Restore RDI (was used as scratch)
    movq    (%rsp), %rdi
    
    ; Clear registers that might contain kernel data
    xorq    %rax, %rax          ; RAX holds return value, set separately
    movq    syscall_return_value, %rax  ; Load actual return value
    
    ; Switch GS back to user value
    swapgs
    
    ; Return to user mode
    ; This pops RIP, CS, RFLAGS, RSP, SS from stack
    iretq
    
; For SYSRET path (faster but restrictions apply):
SYM_INNER_LABEL(syscall_return_via_sysret)
    ; Restore registers (most were saved to pt_regs)
    movq    R15(%rsp), %r15
    movq    R14(%rsp), %r14
    ; ... restore other registers ...
    movq    RDI(%rsp), %rdi
    movq    RSI(%rsp), %rsi
    
    ; Load return address into RCX (SYSRET uses RCX as RIP)
    movq    RIP(%rsp), %rcx
    
    ; Load flags into R11 (SYSRET uses R11 as RFLAGS)  
    movq    EFLAGS(%rsp), %r11
    
    ; Load user stack pointer
    movq    RSP(%rsp), %rsp
    
    ; Switch GS
    swapgs
    
    ; Return!
    ; SYSRET loads: RIP from RCX, RFLAGS from R11, CS/SS from MSRs
    sysretq

SYSRET vs. IRET: Choosing the Return Path

x86-64 Linux can return to user space via two different instructions, each with different characteristics:

SYSRET vs. IRET Comparison
Property	SYSRET	IRET
Speed	~20-30 cycles	~40-100 cycles
State source	RCX (RIP), R11 (RFLAGS)	Stack frame
Stack switch	Manual (before instruction)	Automatic
RIP restrictions	Must be canonical (< 0x8000_0000_0000)	No restrictions
Use case	Normal syscall return	Signals, special cases
Segment handling	From MSRs (fixed)	From stack (flexible)

The SYSRET Vulnerability

This means the #GP handler runs in kernel mode but with user GS. An attacker who controls user GS can exploit this for privilege escalation.

Linux's Solution:

// Before using SYSRET, Linux validates RCX:
if (regs->ip >= TASK_SIZE_MAX) {
    // Non-canonical or too high, use IRET instead
    // IRET doesn't have this vulnerability
    use_iret();
} else {
    use_sysret();
}

When IRET Is Required

KPTI and Page Table Switching

The Two Page Tables:

User page tables: Contains user mappings plus a minimal kernel mapping (just enough to handle the syscall entry and switch to full kernel tables).
Kernel page tables: Full kernel and user mappings, used during kernel execution.

The Return Path with KPTI:

Page Table Switch on Return (KPTI)

Assembly

; KPTI return path (arch/x86/entry/entry_64.S)
 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode)
    ; KPTI: Switch from kernel page tables to user page tables
    
    ; Get the user page table address
    ; On entry, we stored it in a per-CPU variable
    movq    PER_CPU_VAR(user_cr3), %rdi
    
    ; We need to switch CR3, but first finish on kernel stack
    ; because user page tables don't map the kernel stack!
    
    ; Build the IRET frame on the special "trampoline" stack
    ; This stack IS mapped in user page tables
    movq    PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
    
    ; Push IRET frame (this stack is mapped in both page tables)
    pushq   5*8(%rsi)           ; SS
    pushq   4*8(%rsi)           ; RSP
    pushq   3*8(%rsi)           ; RFLAGS
    pushq   2*8(%rsi)           ; CS
    pushq   1*8(%rsi)           ; RIP
    
    ; Restore user registers from saved state
    ; (they were copied to the trampoline area)
    ; ...
    
    ; Now switch page tables
    ; After this, kernel memory is unmapped!
    movq    %rdi, %cr3
    
    ; We're now running on user page tables
    ; Only the trampoline code/stack is accessible
    
    ; Switch GS back to user
    swapgs
    
    ; Return to user mode
    ; IRET pops the frame we built and jumps to user code
    iretq
    
; The trampoline is tiny (~one page) and mapped at a
; consistent address in both page table sets.
; It contains just enough code to do the final switch.

Performance Impact:

KPTI adds overhead to every system call:

Operation	Overhead
CR3 write	~50-100 cycles
TLB flush (without PCID)	~200-500 cycles
TLB flush (with PCID)	~50 cycles

PCID and TLB Management

System Call Restart Handling

When a signal interrupts a system call, the kernel must decide: should the call fail with EINTR, or should it automatically restart when the signal handler returns?

The Problem:

// User code
ssize_t n = read(fd, buf, 1000);  // Blocks waiting for data

// Signal arrives! Handler runs. Then what?
// Option 1: read() returns -1 with errno=EINTR (caller must retry)
// Option 2: read() automatically restarts (caller doesn't notice)

Linux's Solution:

The kernel tracks the original system call number and restart behavior in the saved register state:

System Call Restart Mechanism
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Signal handling decides if restart is appropriate
 
void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
{
    bool restart = false;
    int retval = regs->ax;  // Current return value
 
    // Check if the system call should be restarted
    if (retval == -ERESTARTSYS) {
        // Restart unless SA_RESTART not set
        if (ksig->ka.sa.sa_flags & SA_RESTART) {
            restart = true;
        } else {
            regs->ax = -EINTR;  // Return EINTR to user
        }
    } else if (retval == -ERESTARTNOINTR) {
        // Always restart, unconditionally
        restart = true;
    } else if (retval == -ERESTARTNOHAND) {
        // Restart only if no signal handler
        regs->ax = -EINTR;
    }
 
    if (restart) {
        // Restore original RAX (system call number)
        regs->ax = regs->orig_ax;
        // Move RIP back to the syscall instruction
        regs->ip -= 2;  // sizeof(syscall instruction)
        // When we return, syscall will re-execute!
    }
 
    // Now set up signal handler...
}
 
// The different restart codes:
// -ERESTARTSYS     : Restart if SA_RESTART set
// -ERESTARTNOINTR  : Always restart (used by futex)
// -ERESTARTNOHAND  : Restart only if no handler runs
// -EINTR           : Don't restart (return error to user)

System Call Restart Behaviors
Return Code	SA_RESTART Set	SA_RESTART Not Set
-ERESTARTSYS	Restart syscall	Return -EINTR
-ERESTARTNOINTR	Restart syscall	Restart syscall
-ERESTARTNOHAND	Return -EINTR	Restart syscall
-EINTR	Return -EINTR	Return -EINTR

SA_RESTART Flag

Tracing and Auditing on Return

The return path is also where system call tracing and auditing hooks execute. This enables debugging tools (strace), security monitoring (auditd), and containerization (seccomp).

Ptrace Exit Tracing:

When a process is being traced (e.g., by strace), the kernel stops at system call exit to report the return value:

System Call Exit Tracing
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Called on return path if TIF_SYSCALL_TRACE is set
 
static void syscall_exit_trace(struct pt_regs *regs)
{
    // Report to tracer (e.g., strace)
    if (test_thread_flag(TIF_SYSCALL_TRACE)) {
        tracehook_report_syscall_exit(regs, 0);
        // This might:
        // 1. Stop the process (PTRACE_SYSCALL)
        // 2. Notify tracer of return value
        // 3. Allow tracer to modify return value
    }
    
    // Audit logging (if enabled)
    if (unlikely(audit_context())) {
        audit_syscall_exit(regs);
        // Logs: syscall number, return value, arguments
        // Used for security auditing and compliance
    }
    
    // Seccomp notification (if configured)
    if (current->seccomp.mode == SECCOMP_MODE_FILTER) {
        seccomp_notify_exit(regs);
        // Some seccomp policies want to see return values
    }
}
 
// strace output showing entry and exit:
// open("/etc/passwd", O_RDONLY) = 3
//      ^-- entry params             ^-- exit return value
// 
// The tracer sees both the entry (arguments) and exit (result)

Tracer's View:

Tools like strace use ptrace to intercept system calls:

PTRACE_SYSCALL causes the tracee to stop at syscall entry
Tracer reads arguments from registers
Tracer continues tracee with PTRACE_SYSCALL again
Tracee stops at syscall exit
Tracer reads return value from RAX
Tracer logs: read(3, "hello", 5) = 5

Audit Logging:

type=SYSCALL msg=audit(1234567:1): arch=c000003e syscall=2 
  success=yes exit=3 a0=7ffd5a3c1100 a1=0 a2=0 a3=0 
  items=1 ppid=1234 pid=5678 uid=1000 gid=1000 
  comm="cat" exe="/bin/cat"

This audit record shows open() (syscall 2) returning fd 3, with the process details and arguments logged for security review.

Tracing Performance Impact

Security Considerations of the Return Path

The return path is security-critical. Any vulnerability here could allow:

Information disclosure: Leaking kernel data through unsanitized registers
Privilege escalation: Returning to user mode with elevated privileges
Control flow hijacking: Returning to an attacker-controlled address

Security Requirements for Return Path

•Register Sanitization — Registers not explicitly set must be cleared. Failing to zero R8-R15 could leak kernel pointers, defeating KASLR.
•RFLAGS Validation — RFLAGS restored from user state must be masked. The user cannot be allowed to enable IOPL or other privileged flags.
•Return Address Validation — The return RIP must be in user space. SYSRET's non-canonical check vulnerability shows how this can go wrong.
•Stack Pointer Safety — RSP must point to user memory. Returning with RSP pointing to kernel memory would be catastrophic.
•Segment Register Safety — CS and SS must be user-mode selectors. Returning with kernel selectors but CPL=3 could enable attacks.

Return Path Security Checks
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Security measures in the return path
 
void prepare_exit_to_usermode(struct pt_regs *regs)
{
    // 1. Validate return RIP is in user range
    if (regs->ip >= TASK_SIZE_MAX) {
        // Force SIGSEGV - return address is bad
        force_sigsegv(current);
        return;
    }
    
    // 2. Sanitize RFLAGS  
    // User cannot set: IOPL, VM, VIF, VIP, or reserved bits
    regs->flags &= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM | 
                     X86_EFLAGS_VIF | X86_EFLAGS_VIP);
    // Ensure IF is set (interrupts enabled in user mode)
    regs->flags |= X86_EFLAGS_IF;
    
    // 3. Validate segment selectors
    // (though these should always be correct from entry save)
    regs->cs = __USER_CS;
    regs->ss = __USER_DS;
    
    // 4. Clear any potentially sensitive data from scratch registers
    // Modern kernels clear registers that might hold kernel addresses
}
 
// RFLAGS mask applied during SYSRET (via IA32_FMASK MSR)
// This hardware masking ensures certain bits can't be set by user

Historical Vulnerabilities

Summary: Completing the System Call

Key Takeaways

•Return path handles pending work — Signals, rescheduling, and tracing are processed before returning to user mode.
•Signals are delivered at return — The saved user state can be modified to redirect execution to signal handlers.
•Context switches may occur — If TIF_NEED_RESCHED is set, schedule() runs, and another process may execute before we return.
•SYSRET is faster but has restrictions — Non-canonical RIP addresses require falling back to the slower IRET path.
•KPTI requires page table switches — Meltdown mitigation adds CR3 switches on both entry and exit.
•Security is paramount — Every register, flag, and address must be validated or sanitized before returning to user mode.

Module Complete:

Module Complete