Context Switching - Learning Module

Loading content...

0/227

Restoring Context: Resuming Execution

Awakening a Frozen Process

We've seen how the kernel saves the complete state of a running process—registers pushed to the stack, stack pointer stored in the thread structure, FPU state preserved. The process is effectively frozen in time, its execution suspended mid-instruction.

Now comes the mirror operation: restoring context. The scheduler has selected a process to run next. Its state was saved earlier, perhaps milliseconds ago, perhaps minutes. The kernel must now recreate that exact execution environment so the process can resume as if nothing happened.

This is not merely reversing the save process. Restoring context must be done with extreme precision—wrong register values mean crashes, wrong instruction pointers mean security vulnerabilities, wrong memory spaces mean processes accessing each other's data. This page explores every step of the restoration process.

What You Will Learn

By the end of this page, you will understand the complete sequence of operations that restore a process to execution: switching kernel stacks, loading page tables, restoring CPU registers, returning to user mode, and the elegant assembly tricks that make this all happen atomically and correctly.

The Central Operation: Switching the Stack Pointer

The most critical moment in a context switch is the stack pointer swap. This single operation—loading a new value into RSP—transitions the kernel from operating in the context of the old process to operating in the context of the new process.

Why the Stack Pointer Is Everything:

Recall from the previous page that we don't directly save/restore most registers to/from a structure. Instead:

We push all registers onto the kernel stack
We save the stack pointer to thread_struct
Later, we load the new stack pointer
We pop all registers from the new stack

The stack pointer is the handle to all other register state. Change RSP, and suddenly every subsequent pop instruction retrieves a different process's saved values.

__switch_to.S

Assembly (x86-64 Linux Style)

/**
 * __switch_to_asm - The assembly heart of context switch
 * 
 * This is the actual code that switches between two processes.
 * Called from schedule() -> context_switch() -> switch_to()
 * 
 * Arguments:
 *   %rdi = prev task_struct pointer
 *   %rsi = next task_struct pointer
 * 
 * This function "returns" in the context of 'next', not 'prev'!
 */
ENTRY(__switch_to_asm)
    /*
     * Save callee-saved registers to prev's kernel stack.
     * These are registers that C calling convention requires
     * us to preserve across function calls.
     */
    pushq   %rbp
    pushq   %rbx
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15
 
    /*
     * Save current stack pointer to prev->thread.sp
     * 
     * THREAD_SP is the offset of 'sp' field in thread_struct.
     * After this, prev->thread.sp points to where we saved registers.
     */
    movq    %rsp, THREAD_SP(%rdi)
 
    /*============================================================
     * THE CRITICAL MOMENT: Stack pointer switch
     * 
     * After this instruction, we are "in" the next process!
     * RSP now points to next's kernel stack, which has next's
     * saved registers at the top.
     *============================================================*/
    movq    THREAD_SP(%rsi), %rsp
 
    /*
     * Restore callee-saved registers from next's kernel stack.
     * These are the values that were pushed when 'next' was
     * switched OUT previously.
     */
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbx
    popq    %rbp
 
    /*
     * Jump to __switch_to (C function) for remaining work.
     * __switch_to handles:
     *   - FPU state restoration
     *   - TLS segment base updates
     *   - Debug register restoration
     *   - TSS updates
     *
     * When __switch_to returns, it returns to wherever
     * next was when it called __switch_to_asm - because
     * we popped next's RBP and the 'ret' uses next's return address!
     */
    jmp     __switch_to
ENDPROC(__switch_to_asm)
 
/*
 * Annotated timeline:
 * 
 * Time T1: Process A running, calls schedule()
 * Time T2: schedule() calls __switch_to_asm(A, B)
 * Time T3: A's registers saved, RSP saved to A->thread.sp
 * Time T4: RSP loaded from B->thread.sp  <-- THE SWITCH
 * Time T5: B's registers restored from B's stack
 * Time T6: We're now in B's context, return to where B was
 * 
 * A is now frozen; B continues from where it was suspended.
 */

The Return Address Trick

When __switch_to_asm eventually returns (via 'ret'), it doesn't return to the caller's code—it returns to wherever the NEW process was when IT called __switch_to_asm. The return address on the stack belongs to the NEW process. This is how control transfers: we literally 'return' into a different execution context.

Completing the Switch: The C-Level __switch_to()

After the assembly stub switches stacks, control transfers to the C function __switch_to(). This function handles architecture-specific details that are easier to manage in C than assembly.

What __switch_to() Does:

This function runs in the context of the NEW process (because RSP was already switched), but it does the final cleanup and setup work:

__switch_to.c
C (Linux x86-64)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/**
 * __switch_to() - Complete the context switch in C
 * 
 * Called after __switch_to_asm has switched stack pointers.
 * We're now running on next's kernel stack.
 * 
 * @prev: process we're switching FROM
 * @next: process we're switching TO (now current!)
 * 
 * Returns: prev (for switch_to() macro bookkeeping)
 */
struct task_struct *__switch_to(struct task_struct *prev, 
                                struct task_struct *next)
{
    struct thread_struct *next_thread = &next->thread;
    struct thread_struct *prev_thread = &prev->thread;
    
    /*
     * 1. Update the current task pointer
     * 
     * From this point on, 'current' macro returns 'next'.
     * This is essential for all kernel code to work correctly.
     */
    this_cpu_write(current_task, next);
    
    /*
     * 2. Update TSS with new kernel stack pointer
     * 
     * When next returns to user space and later takes an
     * interrupt, CPU needs to load the correct kernel stack
     * from TSS.sp0. Must point to next's kernel stack top.
     */
    update_task_stack(next);
    
    /*
     * 3. Switch the FPU / SIMD state
     * 
     * Modern kernels do "eager" FPU switching: always save
     * prev's FPU state and restore next's FPU state here.
     */
    switch_fpu_finish(next);
    
    /*
     * 4. Update Thread-Local Storage segment bases
     * 
     * FS base (user-space TLS) and GS base (kernel per-CPU)
     * may differ between processes. Update MSRs as needed.
     */
    if (prev_thread->fsbase != next_thread->fsbase) {
        wrmsrl(MSR_FS_BASE, next_thread->fsbase);
    }
    if (prev_thread->gsbase != next_thread->gsbase) {
        wrmsrl(MSR_KERNEL_GS_BASE, next_thread->gsbase);
    }
    
    /*
     * 5. Load debugging registers if process uses them
     * 
     * Hardware breakpoints are per-process. If next has
     * breakpoints set, load them into DR0-DR7.
     */
    if (unlikely(test_tsk_thread_flag(next, TIF_DEBUG))) {
        load_debug_registers(next_thread);
    }
    
    /*
     * 6. Handle I/O permission bitmap if needed
     * 
     * Some processes have direct I/O port access rights.
     * Update TSS I/O bitmap pointer for those processes.
     */
    if (prev_thread->io_bitmap_max != next_thread->io_bitmap_max) {
        update_io_bitmap(next);
    }
    
    /*
     * 7. Architecture-specific accounting
     */
    this_cpu_write(cpu_current_top_of_stack, 
                   task_top_of_stack(next));
    
    /*
     * Return prev for switch_to() macro.
     * This allows the caller (which is now next!) to know
     * which process it switched from.
     */
    return prev;
}

Key Tasks in __switch_to()

•current_task update — The per-CPU pointer that the 'current' macro uses
•TSS stack pointer — So interrupts in user mode go to the right kernel stack
•FPU/SIMD state — Restore the next process's floating-point registers
•TLS segment bases — FS and GS for thread-local storage access
•Debug registers — Hardware breakpoints for debugging
•I/O bitmap — Port access permissions for special processes

Address Space Restoration: Switching Page Tables

If switching between processes (not threads of the same process), the kernel must activate the new process's address space. This happens in switch_mm() called from context_switch() before the actual register switch.

The CR3 Switch:

Loading a new value into CR3 tells the CPU to use a different page table hierarchy. After this:

The same virtual address maps to different physical memory
The old process's memory becomes inaccessible (except kernel space which is shared)
The TLB (Translation Lookaside Buffer) is flushed (unless PCID is used)

switch_mm_context.c
C (Linux)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/**
 * context_switch() - The high-level context switch orchestrator
 * 
 * Coordinates memory management switch and CPU register switch.
 */
static struct task_struct *context_switch(struct rq *rq,
                                          struct task_struct *prev,
                                          struct task_struct *next)
{
    struct mm_struct *mm = next->mm;
    struct mm_struct *oldmm = prev->active_mm;
    
    /*
     * Handle memory management context switch
     */
    if (!mm) {
        /*
         * next is a kernel thread (no user address space).
         * Kernel threads borrow the previous process's mm
         * to avoid unnecessary page table switches.
         */
        next->active_mm = oldmm;
        atomic_inc(&oldmm->mm_count);  /* Reference count */
        enter_lazy_tlb(oldmm, next);
    } else {
        /*
         * next is a user process with its own address space.
         * Actually switch the page tables.
         */
        switch_mm_irqs_off(oldmm, mm, next);
    }
    
    /*
     * Now switch the CPU register context.
     * After this, we're executing in next's context.
     */
    prev = switch_to(prev, next);
    
    return prev;
}
 
/**
 * switch_mm_irqs_off() - Switch memory management context
 * 
 * This loads the new page tables into CR3.
 */
void switch_mm_irqs_off(struct mm_struct *prev, 
                        struct mm_struct *next,
                        struct task_struct *tsk)
{
    unsigned long new_cr3;
    
    /* Same address space? Skip CR3 load. */
    if (likely(prev == next)) {
        return;
    }
    
    /* Build CR3 value: page table base + PCID if supported */
    if (static_cpu_has(X86_FEATURE_PCID)) {
        new_cr3 = build_cr3(next->pgd, next->context.ctx_id);
    } else {
        new_cr3 = __pa(next->pgd);
    }
    
    /*
     * Load new CR3 - this is the actual address space switch!
     * 
     * Effects:
     * - New page table hierarchy active
     * - All user-space addresses now resolve using next's mappings
     * - TLB flushed (or entries tagged with PCID)
     */
    native_write_cr3(new_cr3);
    
    /* Update per-CPU tracking */
    this_cpu_write(cpu_tlbstate.loaded_mm, next);
}

PCID: Avoiding TLB Flushes

Process Context IDentifiers (PCID) allow the CPU to cache TLB entries from multiple address spaces, tagged by a 12-bit ID. When switching with PCID, old TLB entries aren't flushed—they remain valid but tagged with the old PCID. When switching back to that process later, its TLB entries may still be cached. This dramatically reduces context switch overhead for processes that switch frequently.

Kernel Threads and Lazy TLB:

Kernel threads have no user-space address space (mm == NULL). They only access kernel memory, which is mapped identically in all page tables (upper half of the address space). So kernel threads don't need their own page tables—they "borrow" whatever page tables were active when they started running.

This optimization is called "lazy TLB": we don't switch page tables when entering a kernel thread, avoiding a costly CR3 load. The kernel thread can access kernel memory just fine since it's mapped the same way in all address spaces.

After switching stacks, the new process's saved registers are at the top of the kernel stack. Restoring them is conceptually simple—pop each register—but the devil is in the details.

Two Levels of Register State:

Callee-saved registers (RBX, RBP, R12-R15): Pushed/popped by __switch_to_asm. These are the registers that C functions must preserve.
Full register set (all GPRs plus RIP, RFLAGS, etc.): Stored in pt_regs on the kernel stack. These are restored when returning from the system call or interrupt that originally triggered the context switch.

The key insight: after __switch_to_asm returns, control goes back through the kernel call stack until eventually reaching the interrupt/syscall return path, which does the full pt_regs restoration.

return_to_userspace.S

Assembly (x86-64)

/**
 * Returning to user space after context switch
 * 
 * When the new process was originally switched out, it was
 * in the middle of handling an interrupt or syscall. Its
 * pt_regs is still on the kernel stack. We unwind back to
 * that point and restore everything.
 */
 
/*
 * Path: schedule() returns -> interrupt handler returns -> 
 *       syscall handler returns -> here
 */
 
/**
 * POP_REGS - Restore all general-purpose registers from pt_regs
 */
.macro POP_REGS
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbp
    popq    %rbx
    popq    %r11
    popq    %r10
    popq    %r9
    popq    %r8
    popq    %rax
    popq    %rcx
    popq    %rdx
    popq    %rsi
    popq    %rdi
    /* Skip orig_ax */
    addq    $8, %rsp
.endm
 
/**
 * syscall_return_slowpath - Return from syscall to user space
 * 
 * Called when returning from a syscall (or after schedule()).
 * Restores all user-space register state.
 */
ENTRY(syscall_return_slowpath)
    /* Check for pending work (signals, reschedule needed) */
    testl   $(_TIF_ALLWORK_MASK), TASK_FLAGS(%r12)
    jnz     work_pending
    
    /* Disable interrupts for the final return sequence */
    cli
    
    /* Restore all registers from pt_regs on stack */
    POP_REGS
    
    /*
     * At this point:
     * - All GPRs have been restored to user values
     * - RSP points to the CPU-pushed interrupt frame:
     *   [RIP, CS, RFLAGS, RSP, SS]
     */
    
    /*
     * SYSRETQ or IRETQ to return to user space
     * 
     * SYSRETQ is faster but has restrictions:
     * - Can only return to ring 3
     * - RCX must contain return RIP
     * - R11 must contain return RFLAGS
     * 
     * IRETQ is general-purpose:
     * - Pops RIP, CS, RFLAGS, RSP, SS from stack
     * - Works for any transition (kernel->user, user->kernel)
     */
    
    /* Use IRETQ for general case: */
    iretq
 
ENDPROC(syscall_return_slowpath)
 
/**
 * The exact moment of restoration:
 * 
 * IRETQ atomically:
 * 1. Pops RIP -> jumps to user code
 * 2. Pops CS -> changes code segment (ring 3)
 * 3. Pops RFLAGS -> restores condition codes & IF flag
 * 4. Pops RSP -> switches to user stack
 * 5. Pops SS -> changes stack segment
 * 
 * After IRETQ, we are in user space, at exactly the
 * instruction where the process was interrupted.
 */

Converting Mermaid diagram...

FPU and SIMD State Restoration

Floating-point and SIMD registers are restored during __switch_to() (for eager switching) or on first FPU instruction (for lazy switching). Modern kernels use eager switching for security.

The XRSTOR Instruction:

The XRSTOR family of instructions restores processor state from a memory buffer. The state includes x87 FPU, SSE, AVX, and other extended state components.

fpu_restore.c
C (Linux)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
/**
 * switch_fpu_finish() - Restore FPU state for incoming process
 * 
 * Called from __switch_to() during context switch.
 */
void switch_fpu_finish(struct task_struct *next)
{
    struct fpu *next_fpu = &next->thread.fpu;
    
    /*
     * Check if next has any FPU state to restore.
     * A newly forked process might not have used FPU yet.
     */
    if (!fpu_state_valid(next_fpu)) {
        /* Initialize to clean FPU state */
        fpu_init_state(next_fpu);
    }
    
    /*
     * Restore the FPU state from memory to registers.
     * 
     * xrstor() uses a state component bitmap to select which
     * parts of the state to restore. The kernel typically
     * restores everything that was saved.
     */
    
    /* The actual restoration - XRSTOR instruction wrapper */
    copy_kernel_to_xregs(&next_fpu->state, next_fpu->state_mask);
}
 
/**
 * copy_kernel_to_xregs() - Execute XRSTOR to load FPU state
 */
static inline void copy_kernel_to_xregs(struct xregs_state *xstate,
                                        u64 mask)
{
    u32 lo = (u32)mask;
    u32 hi = (u32)(mask >> 32);
    
    /*
     * XRSTOR loads processor state from memory.
     * 
     * EDX:EAX specify state component bitmap:
     * - Bit 0: x87 FPU
     * - Bit 1: SSE (XMM registers)
     * - Bit 2: AVX (YMM registers)
     * - Bit 5: AVX-512 opmask
     * - Bit 6-7: AVX-512 ZMM
     * - etc.
     */
    asm volatile(
        "xrstor %[xstate]"
        :  /* no outputs */
        : [xstate] "m" (*xstate), "a" (lo), "d" (hi)
        : "memory"
    );
}
 
/**
 * State layout restored by XRSTOR:
 * 
 * Offset 0: FXSAVE legacy area (x87 + SSE)
 *   - x87 control/status words
 *   - ST0-ST7 (x87 stack)
 *   - XMM0-XMM15 (SSE registers)
 * 
 * Offset 576+: Extended state (when XSAVE used)
 *   - YMM0-YMM15 high halves (AVX)
 *   - ZMM0-ZMM31 (AVX-512)
 *   - Opmask registers (AVX-512)
 *   - etc.
 */

State Component Bitmap

The XSAVE/XRSTOR instructions use a bitmask to select which state components to save/restore. This allows the kernel to restore only the components that were actually in use, saving time. For example, if a process never used AVX-512, those huge ZMM registers don't need to be restored.

The Final Step: Returning to User Mode

The last step is the actual mode transition from kernel (ring 0) to user (ring 3). This is accomplished by special instructions that atomically load multiple registers and change privilege level.

IRETQ: The Universal Return:

The IRET (Interrupt Return) instruction was designed exactly for this purpose. It atomically pops RIP, CS, RFLAGS, RSP, and SS from the stack, then resumes execution at the restored RIP in user mode.

SYSRETQ: The Fast Path:

For returning from system calls specifically, x86-64 provides SYSRET. It's faster than IRET but has restrictions that don't allow it to be used in all cases (e.g., when returning to different code segments or when RFLAGS needs special handling).

user_return.S

Assembly (x86-64)

/**
 * Two ways to return from kernel to user space:
 * SYSRETQ (fast) and IRETQ (general)
 */
 
/**
 * SYSRETQ - Fast return from syscall
 * 
 * Preconditions:
 * - RCX contains return RIP
 * - R11 contains return RFLAGS  
 * - Must be returning to ring 3 code
 * 
 * Fast path for common syscall returns.
 */
ENTRY(syscall_return_sysret)
    /* Load return RIP into RCX */
    movq    PT_RIP(%rsp), %rcx
    
    /* Load return RFLAGS into R11 */
    movq    PT_FLAGS(%rsp), %r11
    
    /* Restore GPRs (RAX has syscall return value) */
    POP_REGS
    
    /* Skip the interrupt frame - SYSRETQ doesn't use it */
    addq    $8*5, %rsp
    
    /*
     * SYSRETQ does the following atomically:
     * 1. Set RIP = RCX
     * 2. Set RFLAGS = R11 (masked)
     * 3. Set CS to user code segment (ring 3)
     * 4. Set SS to user stack segment
     * 5. (RSP already points to user stack)
     * 
     * Much faster than IRETQ because it doesn't pop from stack.
     */
    sysretq
 
ENDPROC(syscall_return_sysret)
 
/**
 * IRETQ - General return from interrupt/exception
 * 
 * Works for any kernel-to-user transition.
 * Used when SYSRETQ isn't safe/applicable.
 */
ENTRY(interrupt_return_iret)
    /* Restore all GPRs */
    POP_REGS
    
    /*
     * Stack now contains (pushed by CPU on entry):
     *   +40: SS
     *   +32: RSP  
     *   +24: RFLAGS
     *   +16: CS
     *   +8:  RIP
     *   +0:  (error code - already removed)
     */
    
    /*
     * IRETQ atomically:
     * 1. Pop RIP from stack -> jump there
     * 2. Pop CS from stack -> change code segment (ring 3)
     * 3. Pop RFLAGS from stack -> restore flags
     * 4. Pop RSP from stack -> switch to user stack
     * 5. Pop SS from stack -> change stack segment
     * 
     * After this single instruction, we're in user mode,
     * at the saved RIP, with the saved register state.
     */
    iretq
 
ENDPROC(interrupt_return_iret)
 
/**
 * Why IRETQ is essential:
 * 
 * The interrupt/syscall return must be ATOMIC. We can't do:
 *   mov user_rsp, %rsp   ; Now on user stack, still ring 0!
 *   jmp user_rip         ; Still ring 0, security disaster!
 * 
 * IRETQ changes privilege level and multiple registers in
 * a single, atomic CPU operation. There's no window where
 * we're in an inconsistent state.
 */

IRETQ vs SYSRETQ Comparison
Aspect	IRETQ	SYSRETQ
State source	Stack (5 values popped)	Registers (RCX=RIP, R11=RFLAGS)
Speed	Slower (~20+ cycles)	Faster (~10 cycles)
Generality	Works for any transition	Only syscall return to ring 3
RFLAGS handling	Full restore from stack	Masked restore from R11
RSP handling	Popped from stack	Already set by kernel
Usage	Interrupts, exceptions, special cases	Fast path syscall returns

SYSRET Security Considerations

SYSRET has subtle security issues: if RCX contains a non-canonical address (not valid in x86-64), SYSRET causes a #GP fault that executes on the USER stack but in ring 0. Intel and AMD handle this differently, creating potential vulnerabilities. Kernels must validate RCX before using SYSRET, or fall back to IRETQ for suspect cases. This is why both return paths exist.

The Complete Restoration Timeline

Let's trace through a complete context switch restoration, from scheduler selection to user-mode execution:

Context Restoration Step-by-Step Timeline
Step	Action	CPU State Change
1	Scheduler selects next process	No CPU change yet
2	switch_mm() loads CR3	Address space switched, TLB flushed
3	__switch_to_asm saves prev's callee-saved regs	Push RBX, RBP, R12-R15
4	__switch_to_asm saves prev's RSP to thread.sp	No CPU change (memory write)
5	__switch_to_asm loads next's RSP from thread.sp	RSP = next's kernel stack
6	__switch_to_asm pops next's callee-saved regs	RBX, RBP, R12-R15 = next's values
7	__switch_to updates current_task	Per-CPU points to next
8	__switch_to updates TSS.SP0	Next interrupt uses next's kstack
9	switch_fpu_finish restores FPU	XMM/YMM/ZMM = next's values
10	Kernel unwinds back through call stack	Each ret restores more context
11	POP_REGS restores all GPRs from pt_regs	RAX-R15 = next's user values
12	IRETQ/SYSRETQ returns to user mode	RIP, RSP, RFLAGS, CS, SS restored

timeline_annotated.txt

Timeline

CONTEXT SWITCH RESTORATION TIMELINE
====================================
 
Process A: Running, executes syscall, eventually schedule() is called
Process B: Was sleeping, now selected to run
 
Time   Location              Action
───────────────────────────────────────────────────────────────────
T0     schedule()            Scheduler picks process B
 
T1     context_switch()      Check if B has different mm_struct
       └─ switch_mm()        Load B's page table (write CR3)
                             TLB flushed (unless PCID)
                             
T2     switch_to() macro     Calls __switch_to_asm(A, B)
 
T3     __switch_to_asm       push %rbp, %rbx, %r12-r15  (A's registers)
       (running as A)        movq %rsp, A->thread.sp    (save A's RSP)
                             
       ═══════════════════════════════════════════════════════════
       ║     movq B->thread.sp, %rsp   <-- THE SWITCH MOMENT    ║
       ═══════════════════════════════════════════════════════════
                             
       (now on B's stack!)   pop %r15-r12, %rbx, %rbp  (B's registers)
 
T4     __switch_to           current_task = B
       (now "B" running)     TSS.SP0 = B's kernel stack
                             XRSTOR B's FPU state
                             Update FS/GS if needed
                             return (to where B was)
 
T5     <B's call stack>      Functions return, unwinding to...
       ...context_switch()   
       ...schedule()
       ...syscall_handler()
 
T6     syscall_exit          Check for pending signals
                             POP_REGS from B's pt_regs
                             All GPRs now have B's user values
 
T7     return_to_user        IRETQ or SYSRETQ
                             ───────────────────────────────────
                             RIP  = B's user code position
                             RSP  = B's user stack
                             RFLAGS = B's flags
                             CS   = user code segment (ring 3)
                             SS   = user stack segment
 
T8     User space            B continues from exactly where it was
                             No idea it was ever suspended!

Summary: Restoration Completes the Cycle

Context restoration is the inverse of context saving, but it's executed with equal precision. Every saved value must be restored to exactly the right register, in exactly the right order, with exact atomicity at the final mode transition.

Key Takeaways

•The stack pointer switch is the critical moment — Loading RSP from the new process's thread_struct switches context; everything else follows
•Address space switch precedes register switch — CR3 is loaded to activate the new page tables before we switch stacks
•Registers are restored in layers — Callee-saved by __switch_to_asm, all GPRs by POP_REGS, interrupt frame by IRETQ
•FPU state is restored by XRSTOR — Modern kernels do eager restoration during __switch_to
•IRETQ/SYSRETQ completes the transition — Atomically restores RIP, RSP, RFLAGS, CS, SS and transitions to ring 3
•The process resumes obliviously — It has no idea it was ever suspended; execution continues seamlessly

Page Complete

You now understand the complete context restoration process—from stack pointer switch to user mode return. Together with context saving, this forms the complete picture of how the kernel pauses and resumes processes. Next, we'll explore the OVERHEAD of context switching and why minimizing switches matters for performance.

Restoring Context: Resuming Execution

Awakening a Frozen Process

What You Will Learn

The Central Operation: Switching the Stack Pointer

Why the Stack Pointer Is Everything:

Recall from the previous page that we don't directly save/restore most registers to/from a structure. Instead:

We push all registers onto the kernel stack
We save the stack pointer to thread_struct
Later, we load the new stack pointer
We pop all registers from the new stack

The stack pointer is the handle to all other register state. Change RSP, and suddenly every subsequent pop instruction retrieves a different process's saved values.

__switch_to.S

Assembly (x86-64 Linux Style)

/**
 * __switch_to_asm - The assembly heart of context switch
 * 
 * This is the actual code that switches between two processes.
 * Called from schedule() -> context_switch() -> switch_to()
 * 
 * Arguments:
 *   %rdi = prev task_struct pointer
 *   %rsi = next task_struct pointer
 * 
 * This function "returns" in the context of 'next', not 'prev'!
 */
ENTRY(__switch_to_asm)
    /*
     * Save callee-saved registers to prev's kernel stack.
     * These are registers that C calling convention requires
     * us to preserve across function calls.
     */
    pushq   %rbp
    pushq   %rbx
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15
 
    /*
     * Save current stack pointer to prev->thread.sp
     * 
     * THREAD_SP is the offset of 'sp' field in thread_struct.
     * After this, prev->thread.sp points to where we saved registers.
     */
    movq    %rsp, THREAD_SP(%rdi)
 
    /*============================================================
     * THE CRITICAL MOMENT: Stack pointer switch
     * 
     * After this instruction, we are "in" the next process!
     * RSP now points to next's kernel stack, which has next's
     * saved registers at the top.
     *============================================================*/
    movq    THREAD_SP(%rsi), %rsp
 
    /*
     * Restore callee-saved registers from next's kernel stack.
     * These are the values that were pushed when 'next' was
     * switched OUT previously.
     */
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbx
    popq    %rbp
 
    /*
     * Jump to __switch_to (C function) for remaining work.
     * __switch_to handles:
     *   - FPU state restoration
     *   - TLS segment base updates
     *   - Debug register restoration
     *   - TSS updates
     *
     * When __switch_to returns, it returns to wherever
     * next was when it called __switch_to_asm - because
     * we popped next's RBP and the 'ret' uses next's return address!
     */
    jmp     __switch_to
ENDPROC(__switch_to_asm)
 
/*
 * Annotated timeline:
 * 
 * Time T1: Process A running, calls schedule()
 * Time T2: schedule() calls __switch_to_asm(A, B)
 * Time T3: A's registers saved, RSP saved to A->thread.sp
 * Time T4: RSP loaded from B->thread.sp  <-- THE SWITCH
 * Time T5: B's registers restored from B's stack
 * Time T6: We're now in B's context, return to where B was
 * 
 * A is now frozen; B continues from where it was suspended.
 */

The Return Address Trick

Completing the Switch: The C-Level __switch_to()

After the assembly stub switches stacks, control transfers to the C function __switch_to(). This function handles architecture-specific details that are easier to manage in C than assembly.

What __switch_to() Does:

This function runs in the context of the NEW process (because RSP was already switched), but it does the final cleanup and setup work:

__switch_to.c
C (Linux x86-64)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/**
 * __switch_to() - Complete the context switch in C
 * 
 * Called after __switch_to_asm has switched stack pointers.
 * We're now running on next's kernel stack.
 * 
 * @prev: process we're switching FROM
 * @next: process we're switching TO (now current!)
 * 
 * Returns: prev (for switch_to() macro bookkeeping)
 */
struct task_struct *__switch_to(struct task_struct *prev, 
                                struct task_struct *next)
{
    struct thread_struct *next_thread = &next->thread;
    struct thread_struct *prev_thread = &prev->thread;
    
    /*
     * 1. Update the current task pointer
     * 
     * From this point on, 'current' macro returns 'next'.
     * This is essential for all kernel code to work correctly.
     */
    this_cpu_write(current_task, next);
    
    /*
     * 2. Update TSS with new kernel stack pointer
     * 
     * When next returns to user space and later takes an
     * interrupt, CPU needs to load the correct kernel stack
     * from TSS.sp0. Must point to next's kernel stack top.
     */
    update_task_stack(next);
    
    /*
     * 3. Switch the FPU / SIMD state
     * 
     * Modern kernels do "eager" FPU switching: always save
     * prev's FPU state and restore next's FPU state here.
     */
    switch_fpu_finish(next);
    
    /*
     * 4. Update Thread-Local Storage segment bases
     * 
     * FS base (user-space TLS) and GS base (kernel per-CPU)
     * may differ between processes. Update MSRs as needed.
     */
    if (prev_thread->fsbase != next_thread->fsbase) {
        wrmsrl(MSR_FS_BASE, next_thread->fsbase);
    }
    if (prev_thread->gsbase != next_thread->gsbase) {
        wrmsrl(MSR_KERNEL_GS_BASE, next_thread->gsbase);
    }
    
    /*
     * 5. Load debugging registers if process uses them
     * 
     * Hardware breakpoints are per-process. If next has
     * breakpoints set, load them into DR0-DR7.
     */
    if (unlikely(test_tsk_thread_flag(next, TIF_DEBUG))) {
        load_debug_registers(next_thread);
    }
    
    /*
     * 6. Handle I/O permission bitmap if needed
     * 
     * Some processes have direct I/O port access rights.
     * Update TSS I/O bitmap pointer for those processes.
     */
    if (prev_thread->io_bitmap_max != next_thread->io_bitmap_max) {
        update_io_bitmap(next);
    }
    
    /*
     * 7. Architecture-specific accounting
     */
    this_cpu_write(cpu_current_top_of_stack, 
                   task_top_of_stack(next));
    
    /*
     * Return prev for switch_to() macro.
     * This allows the caller (which is now next!) to know
     * which process it switched from.
     */
    return prev;
}

Key Tasks in __switch_to()

•current_task update — The per-CPU pointer that the 'current' macro uses
•TSS stack pointer — So interrupts in user mode go to the right kernel stack
•FPU/SIMD state — Restore the next process's floating-point registers
•TLS segment bases — FS and GS for thread-local storage access
•Debug registers — Hardware breakpoints for debugging
•I/O bitmap — Port access permissions for special processes

Address Space Restoration: Switching Page Tables

The CR3 Switch:

Loading a new value into CR3 tells the CPU to use a different page table hierarchy. After this:

The same virtual address maps to different physical memory
The old process's memory becomes inaccessible (except kernel space which is shared)
The TLB (Translation Lookaside Buffer) is flushed (unless PCID is used)

switch_mm_context.c
C (Linux)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/**
 * context_switch() - The high-level context switch orchestrator
 * 
 * Coordinates memory management switch and CPU register switch.
 */
static struct task_struct *context_switch(struct rq *rq,
                                          struct task_struct *prev,
                                          struct task_struct *next)
{
    struct mm_struct *mm = next->mm;
    struct mm_struct *oldmm = prev->active_mm;
    
    /*
     * Handle memory management context switch
     */
    if (!mm) {
        /*
         * next is a kernel thread (no user address space).
         * Kernel threads borrow the previous process's mm
         * to avoid unnecessary page table switches.
         */
        next->active_mm = oldmm;
        atomic_inc(&oldmm->mm_count);  /* Reference count */
        enter_lazy_tlb(oldmm, next);
    } else {
        /*
         * next is a user process with its own address space.
         * Actually switch the page tables.
         */
        switch_mm_irqs_off(oldmm, mm, next);
    }
    
    /*
     * Now switch the CPU register context.
     * After this, we're executing in next's context.
     */
    prev = switch_to(prev, next);
    
    return prev;
}
 
/**
 * switch_mm_irqs_off() - Switch memory management context
 * 
 * This loads the new page tables into CR3.
 */
void switch_mm_irqs_off(struct mm_struct *prev, 
                        struct mm_struct *next,
                        struct task_struct *tsk)
{
    unsigned long new_cr3;
    
    /* Same address space? Skip CR3 load. */
    if (likely(prev == next)) {
        return;
    }
    
    /* Build CR3 value: page table base + PCID if supported */
    if (static_cpu_has(X86_FEATURE_PCID)) {
        new_cr3 = build_cr3(next->pgd, next->context.ctx_id);
    } else {
        new_cr3 = __pa(next->pgd);
    }
    
    /*
     * Load new CR3 - this is the actual address space switch!
     * 
     * Effects:
     * - New page table hierarchy active
     * - All user-space addresses now resolve using next's mappings
     * - TLB flushed (or entries tagged with PCID)
     */
    native_write_cr3(new_cr3);
    
    /* Update per-CPU tracking */
    this_cpu_write(cpu_tlbstate.loaded_mm, next);
}

PCID: Avoiding TLB Flushes

Kernel Threads and Lazy TLB:

After switching stacks, the new process's saved registers are at the top of the kernel stack. Restoring them is conceptually simple—pop each register—but the devil is in the details.

Two Levels of Register State:

Callee-saved registers (RBX, RBP, R12-R15): Pushed/popped by __switch_to_asm. These are the registers that C functions must preserve.
Full register set (all GPRs plus RIP, RFLAGS, etc.): Stored in pt_regs on the kernel stack. These are restored when returning from the system call or interrupt that originally triggered the context switch.

The key insight: after __switch_to_asm returns, control goes back through the kernel call stack until eventually reaching the interrupt/syscall return path, which does the full pt_regs restoration.

return_to_userspace.S

Assembly (x86-64)

/**
 * Returning to user space after context switch
 * 
 * When the new process was originally switched out, it was
 * in the middle of handling an interrupt or syscall. Its
 * pt_regs is still on the kernel stack. We unwind back to
 * that point and restore everything.
 */
 
/*
 * Path: schedule() returns -> interrupt handler returns -> 
 *       syscall handler returns -> here
 */
 
/**
 * POP_REGS - Restore all general-purpose registers from pt_regs
 */
.macro POP_REGS
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbp
    popq    %rbx
    popq    %r11
    popq    %r10
    popq    %r9
    popq    %r8
    popq    %rax
    popq    %rcx
    popq    %rdx
    popq    %rsi
    popq    %rdi
    /* Skip orig_ax */
    addq    $8, %rsp
.endm
 
/**
 * syscall_return_slowpath - Return from syscall to user space
 * 
 * Called when returning from a syscall (or after schedule()).
 * Restores all user-space register state.
 */
ENTRY(syscall_return_slowpath)
    /* Check for pending work (signals, reschedule needed) */
    testl   $(_TIF_ALLWORK_MASK), TASK_FLAGS(%r12)
    jnz     work_pending
    
    /* Disable interrupts for the final return sequence */
    cli
    
    /* Restore all registers from pt_regs on stack */
    POP_REGS
    
    /*
     * At this point:
     * - All GPRs have been restored to user values
     * - RSP points to the CPU-pushed interrupt frame:
     *   [RIP, CS, RFLAGS, RSP, SS]
     */
    
    /*
     * SYSRETQ or IRETQ to return to user space
     * 
     * SYSRETQ is faster but has restrictions:
     * - Can only return to ring 3
     * - RCX must contain return RIP
     * - R11 must contain return RFLAGS
     * 
     * IRETQ is general-purpose:
     * - Pops RIP, CS, RFLAGS, RSP, SS from stack
     * - Works for any transition (kernel->user, user->kernel)
     */
    
    /* Use IRETQ for general case: */
    iretq
 
ENDPROC(syscall_return_slowpath)
 
/**
 * The exact moment of restoration:
 * 
 * IRETQ atomically:
 * 1. Pops RIP -> jumps to user code
 * 2. Pops CS -> changes code segment (ring 3)
 * 3. Pops RFLAGS -> restores condition codes & IF flag
 * 4. Pops RSP -> switches to user stack
 * 5. Pops SS -> changes stack segment
 * 
 * After IRETQ, we are in user space, at exactly the
 * instruction where the process was interrupted.
 */

Converting Mermaid diagram...

FPU and SIMD State Restoration

Floating-point and SIMD registers are restored during __switch_to() (for eager switching) or on first FPU instruction (for lazy switching). Modern kernels use eager switching for security.

The XRSTOR Instruction:

The XRSTOR family of instructions restores processor state from a memory buffer. The state includes x87 FPU, SSE, AVX, and other extended state components.

fpu_restore.c
C (Linux)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
/**
 * switch_fpu_finish() - Restore FPU state for incoming process
 * 
 * Called from __switch_to() during context switch.
 */
void switch_fpu_finish(struct task_struct *next)
{
    struct fpu *next_fpu = &next->thread.fpu;
    
    /*
     * Check if next has any FPU state to restore.
     * A newly forked process might not have used FPU yet.
     */
    if (!fpu_state_valid(next_fpu)) {
        /* Initialize to clean FPU state */
        fpu_init_state(next_fpu);
    }
    
    /*
     * Restore the FPU state from memory to registers.
     * 
     * xrstor() uses a state component bitmap to select which
     * parts of the state to restore. The kernel typically
     * restores everything that was saved.
     */
    
    /* The actual restoration - XRSTOR instruction wrapper */
    copy_kernel_to_xregs(&next_fpu->state, next_fpu->state_mask);
}
 
/**
 * copy_kernel_to_xregs() - Execute XRSTOR to load FPU state
 */
static inline void copy_kernel_to_xregs(struct xregs_state *xstate,
                                        u64 mask)
{
    u32 lo = (u32)mask;
    u32 hi = (u32)(mask >> 32);
    
    /*
     * XRSTOR loads processor state from memory.
     * 
     * EDX:EAX specify state component bitmap:
     * - Bit 0: x87 FPU
     * - Bit 1: SSE (XMM registers)
     * - Bit 2: AVX (YMM registers)
     * - Bit 5: AVX-512 opmask
     * - Bit 6-7: AVX-512 ZMM
     * - etc.
     */
    asm volatile(
        "xrstor %[xstate]"
        :  /* no outputs */
        : [xstate] "m" (*xstate), "a" (lo), "d" (hi)
        : "memory"
    );
}
 
/**
 * State layout restored by XRSTOR:
 * 
 * Offset 0: FXSAVE legacy area (x87 + SSE)
 *   - x87 control/status words
 *   - ST0-ST7 (x87 stack)
 *   - XMM0-XMM15 (SSE registers)
 * 
 * Offset 576+: Extended state (when XSAVE used)
 *   - YMM0-YMM15 high halves (AVX)
 *   - ZMM0-ZMM31 (AVX-512)
 *   - Opmask registers (AVX-512)
 *   - etc.
 */

State Component Bitmap

The Final Step: Returning to User Mode

The last step is the actual mode transition from kernel (ring 0) to user (ring 3). This is accomplished by special instructions that atomically load multiple registers and change privilege level.

IRETQ: The Universal Return:

SYSRETQ: The Fast Path:

user_return.S

Assembly (x86-64)

/**
 * Two ways to return from kernel to user space:
 * SYSRETQ (fast) and IRETQ (general)
 */
 
/**
 * SYSRETQ - Fast return from syscall
 * 
 * Preconditions:
 * - RCX contains return RIP
 * - R11 contains return RFLAGS  
 * - Must be returning to ring 3 code
 * 
 * Fast path for common syscall returns.
 */
ENTRY(syscall_return_sysret)
    /* Load return RIP into RCX */
    movq    PT_RIP(%rsp), %rcx
    
    /* Load return RFLAGS into R11 */
    movq    PT_FLAGS(%rsp), %r11
    
    /* Restore GPRs (RAX has syscall return value) */
    POP_REGS
    
    /* Skip the interrupt frame - SYSRETQ doesn't use it */
    addq    $8*5, %rsp
    
    /*
     * SYSRETQ does the following atomically:
     * 1. Set RIP = RCX
     * 2. Set RFLAGS = R11 (masked)
     * 3. Set CS to user code segment (ring 3)
     * 4. Set SS to user stack segment
     * 5. (RSP already points to user stack)
     * 
     * Much faster than IRETQ because it doesn't pop from stack.
     */
    sysretq
 
ENDPROC(syscall_return_sysret)
 
/**
 * IRETQ - General return from interrupt/exception
 * 
 * Works for any kernel-to-user transition.
 * Used when SYSRETQ isn't safe/applicable.
 */
ENTRY(interrupt_return_iret)
    /* Restore all GPRs */
    POP_REGS
    
    /*
     * Stack now contains (pushed by CPU on entry):
     *   +40: SS
     *   +32: RSP  
     *   +24: RFLAGS
     *   +16: CS
     *   +8:  RIP
     *   +0:  (error code - already removed)
     */
    
    /*
     * IRETQ atomically:
     * 1. Pop RIP from stack -> jump there
     * 2. Pop CS from stack -> change code segment (ring 3)
     * 3. Pop RFLAGS from stack -> restore flags
     * 4. Pop RSP from stack -> switch to user stack
     * 5. Pop SS from stack -> change stack segment
     * 
     * After this single instruction, we're in user mode,
     * at the saved RIP, with the saved register state.
     */
    iretq
 
ENDPROC(interrupt_return_iret)
 
/**
 * Why IRETQ is essential:
 * 
 * The interrupt/syscall return must be ATOMIC. We can't do:
 *   mov user_rsp, %rsp   ; Now on user stack, still ring 0!
 *   jmp user_rip         ; Still ring 0, security disaster!
 * 
 * IRETQ changes privilege level and multiple registers in
 * a single, atomic CPU operation. There's no window where
 * we're in an inconsistent state.
 */

IRETQ vs SYSRETQ Comparison
Aspect	IRETQ	SYSRETQ
State source	Stack (5 values popped)	Registers (RCX=RIP, R11=RFLAGS)
Speed	Slower (~20+ cycles)	Faster (~10 cycles)
Generality	Works for any transition	Only syscall return to ring 3
RFLAGS handling	Full restore from stack	Masked restore from R11
RSP handling	Popped from stack	Already set by kernel
Usage	Interrupts, exceptions, special cases	Fast path syscall returns

SYSRET Security Considerations

The Complete Restoration Timeline

Let's trace through a complete context switch restoration, from scheduler selection to user-mode execution:

Context Restoration Step-by-Step Timeline
Step	Action	CPU State Change
1	Scheduler selects next process	No CPU change yet
2	switch_mm() loads CR3	Address space switched, TLB flushed
3	__switch_to_asm saves prev's callee-saved regs	Push RBX, RBP, R12-R15
4	__switch_to_asm saves prev's RSP to thread.sp	No CPU change (memory write)
5	__switch_to_asm loads next's RSP from thread.sp	RSP = next's kernel stack
6	__switch_to_asm pops next's callee-saved regs	RBX, RBP, R12-R15 = next's values
7	__switch_to updates current_task	Per-CPU points to next
8	__switch_to updates TSS.SP0	Next interrupt uses next's kstack
9	switch_fpu_finish restores FPU	XMM/YMM/ZMM = next's values
10	Kernel unwinds back through call stack	Each ret restores more context
11	POP_REGS restores all GPRs from pt_regs	RAX-R15 = next's user values
12	IRETQ/SYSRETQ returns to user mode	RIP, RSP, RFLAGS, CS, SS restored

timeline_annotated.txt

Timeline

CONTEXT SWITCH RESTORATION TIMELINE
====================================
 
Process A: Running, executes syscall, eventually schedule() is called
Process B: Was sleeping, now selected to run
 
Time   Location              Action
───────────────────────────────────────────────────────────────────
T0     schedule()            Scheduler picks process B
 
T1     context_switch()      Check if B has different mm_struct
       └─ switch_mm()        Load B's page table (write CR3)
                             TLB flushed (unless PCID)
                             
T2     switch_to() macro     Calls __switch_to_asm(A, B)
 
T3     __switch_to_asm       push %rbp, %rbx, %r12-r15  (A's registers)
       (running as A)        movq %rsp, A->thread.sp    (save A's RSP)
                             
       ═══════════════════════════════════════════════════════════
       ║     movq B->thread.sp, %rsp   <-- THE SWITCH MOMENT    ║
       ═══════════════════════════════════════════════════════════
                             
       (now on B's stack!)   pop %r15-r12, %rbx, %rbp  (B's registers)
 
T4     __switch_to           current_task = B
       (now "B" running)     TSS.SP0 = B's kernel stack
                             XRSTOR B's FPU state
                             Update FS/GS if needed
                             return (to where B was)
 
T5     <B's call stack>      Functions return, unwinding to...
       ...context_switch()   
       ...schedule()
       ...syscall_handler()
 
T6     syscall_exit          Check for pending signals
                             POP_REGS from B's pt_regs
                             All GPRs now have B's user values
 
T7     return_to_user        IRETQ or SYSRETQ
                             ───────────────────────────────────
                             RIP  = B's user code position
                             RSP  = B's user stack
                             RFLAGS = B's flags
                             CS   = user code segment (ring 3)
                             SS   = user stack segment
 
T8     User space            B continues from exactly where it was
                             No idea it was ever suspended!

Summary: Restoration Completes the Cycle

Key Takeaways

•The stack pointer switch is the critical moment — Loading RSP from the new process's thread_struct switches context; everything else follows
•Address space switch precedes register switch — CR3 is loaded to activate the new page tables before we switch stacks
•Registers are restored in layers — Callee-saved by __switch_to_asm, all GPRs by POP_REGS, interrupt frame by IRETQ
•FPU state is restored by XRSTOR — Modern kernels do eager restoration during __switch_to
•IRETQ/SYSRETQ completes the transition — Atomically restores RIP, RSP, RFLAGS, CS, SS and transitions to ring 3
•The process resumes obliviously — It has no idea it was ever suspended; execution continues seamlessly

Page Complete