Loading content...
We've seen how the kernel saves the complete state of a running process—registers pushed to the stack, stack pointer stored in the thread structure, FPU state preserved. The process is effectively frozen in time, its execution suspended mid-instruction.
Now comes the mirror operation: restoring context. The scheduler has selected a process to run next. Its state was saved earlier, perhaps milliseconds ago, perhaps minutes. The kernel must now recreate that exact execution environment so the process can resume as if nothing happened.
This is not merely reversing the save process. Restoring context must be done with extreme precision—wrong register values mean crashes, wrong instruction pointers mean security vulnerabilities, wrong memory spaces mean processes accessing each other's data. This page explores every step of the restoration process.
By the end of this page, you will understand the complete sequence of operations that restore a process to execution: switching kernel stacks, loading page tables, restoring CPU registers, returning to user mode, and the elegant assembly tricks that make this all happen atomically and correctly.
The most critical moment in a context switch is the stack pointer swap. This single operation—loading a new value into RSP—transitions the kernel from operating in the context of the old process to operating in the context of the new process.
Why the Stack Pointer Is Everything:
Recall from the previous page that we don't directly save/restore most registers to/from a structure. Instead:
The stack pointer is the handle to all other register state. Change RSP, and suddenly every subsequent pop instruction retrieves a different process's saved values.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
/** * __switch_to_asm - The assembly heart of context switch * * This is the actual code that switches between two processes. * Called from schedule() -> context_switch() -> switch_to() * * Arguments: * %rdi = prev task_struct pointer * %rsi = next task_struct pointer * * This function "returns" in the context of 'next', not 'prev'! */ENTRY(__switch_to_asm) /* * Save callee-saved registers to prev's kernel stack. * These are registers that C calling convention requires * us to preserve across function calls. */ pushq %rbp pushq %rbx pushq %r12 pushq %r13 pushq %r14 pushq %r15 /* * Save current stack pointer to prev->thread.sp * * THREAD_SP is the offset of 'sp' field in thread_struct. * After this, prev->thread.sp points to where we saved registers. */ movq %rsp, THREAD_SP(%rdi) /*============================================================ * THE CRITICAL MOMENT: Stack pointer switch * * After this instruction, we are "in" the next process! * RSP now points to next's kernel stack, which has next's * saved registers at the top. *============================================================*/ movq THREAD_SP(%rsi), %rsp /* * Restore callee-saved registers from next's kernel stack. * These are the values that were pushed when 'next' was * switched OUT previously. */ popq %r15 popq %r14 popq %r13 popq %r12 popq %rbx popq %rbp /* * Jump to __switch_to (C function) for remaining work. * __switch_to handles: * - FPU state restoration * - TLS segment base updates * - Debug register restoration * - TSS updates * * When __switch_to returns, it returns to wherever * next was when it called __switch_to_asm - because * we popped next's RBP and the 'ret' uses next's return address! */ jmp __switch_toENDPROC(__switch_to_asm) /* * Annotated timeline: * * Time T1: Process A running, calls schedule() * Time T2: schedule() calls __switch_to_asm(A, B) * Time T3: A's registers saved, RSP saved to A->thread.sp * Time T4: RSP loaded from B->thread.sp <-- THE SWITCH * Time T5: B's registers restored from B's stack * Time T6: We're now in B's context, return to where B was * * A is now frozen; B continues from where it was suspended. */When __switch_to_asm eventually returns (via 'ret'), it doesn't return to the caller's code—it returns to wherever the NEW process was when IT called __switch_to_asm. The return address on the stack belongs to the NEW process. This is how control transfers: we literally 'return' into a different execution context.
After the assembly stub switches stacks, control transfers to the C function __switch_to(). This function handles architecture-specific details that are easier to manage in C than assembly.
What __switch_to() Does:
This function runs in the context of the NEW process (because RSP was already switched), but it does the final cleanup and setup work:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
/** * __switch_to() - Complete the context switch in C * * Called after __switch_to_asm has switched stack pointers. * We're now running on next's kernel stack. * * @prev: process we're switching FROM * @next: process we're switching TO (now current!) * * Returns: prev (for switch_to() macro bookkeeping) */struct task_struct *__switch_to(struct task_struct *prev, struct task_struct *next){ struct thread_struct *next_thread = &next->thread; struct thread_struct *prev_thread = &prev->thread; /* * 1. Update the current task pointer * * From this point on, 'current' macro returns 'next'. * This is essential for all kernel code to work correctly. */ this_cpu_write(current_task, next); /* * 2. Update TSS with new kernel stack pointer * * When next returns to user space and later takes an * interrupt, CPU needs to load the correct kernel stack * from TSS.sp0. Must point to next's kernel stack top. */ update_task_stack(next); /* * 3. Switch the FPU / SIMD state * * Modern kernels do "eager" FPU switching: always save * prev's FPU state and restore next's FPU state here. */ switch_fpu_finish(next); /* * 4. Update Thread-Local Storage segment bases * * FS base (user-space TLS) and GS base (kernel per-CPU) * may differ between processes. Update MSRs as needed. */ if (prev_thread->fsbase != next_thread->fsbase) { wrmsrl(MSR_FS_BASE, next_thread->fsbase); } if (prev_thread->gsbase != next_thread->gsbase) { wrmsrl(MSR_KERNEL_GS_BASE, next_thread->gsbase); } /* * 5. Load debugging registers if process uses them * * Hardware breakpoints are per-process. If next has * breakpoints set, load them into DR0-DR7. */ if (unlikely(test_tsk_thread_flag(next, TIF_DEBUG))) { load_debug_registers(next_thread); } /* * 6. Handle I/O permission bitmap if needed * * Some processes have direct I/O port access rights. * Update TSS I/O bitmap pointer for those processes. */ if (prev_thread->io_bitmap_max != next_thread->io_bitmap_max) { update_io_bitmap(next); } /* * 7. Architecture-specific accounting */ this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next)); /* * Return prev for switch_to() macro. * This allows the caller (which is now next!) to know * which process it switched from. */ return prev;}If switching between processes (not threads of the same process), the kernel must activate the new process's address space. This happens in switch_mm() called from context_switch() before the actual register switch.
The CR3 Switch:
Loading a new value into CR3 tells the CPU to use a different page table hierarchy. After this:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
/** * context_switch() - The high-level context switch orchestrator * * Coordinates memory management switch and CPU register switch. */static struct task_struct *context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next){ struct mm_struct *mm = next->mm; struct mm_struct *oldmm = prev->active_mm; /* * Handle memory management context switch */ if (!mm) { /* * next is a kernel thread (no user address space). * Kernel threads borrow the previous process's mm * to avoid unnecessary page table switches. */ next->active_mm = oldmm; atomic_inc(&oldmm->mm_count); /* Reference count */ enter_lazy_tlb(oldmm, next); } else { /* * next is a user process with its own address space. * Actually switch the page tables. */ switch_mm_irqs_off(oldmm, mm, next); } /* * Now switch the CPU register context. * After this, we're executing in next's context. */ prev = switch_to(prev, next); return prev;} /** * switch_mm_irqs_off() - Switch memory management context * * This loads the new page tables into CR3. */void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk){ unsigned long new_cr3; /* Same address space? Skip CR3 load. */ if (likely(prev == next)) { return; } /* Build CR3 value: page table base + PCID if supported */ if (static_cpu_has(X86_FEATURE_PCID)) { new_cr3 = build_cr3(next->pgd, next->context.ctx_id); } else { new_cr3 = __pa(next->pgd); } /* * Load new CR3 - this is the actual address space switch! * * Effects: * - New page table hierarchy active * - All user-space addresses now resolve using next's mappings * - TLB flushed (or entries tagged with PCID) */ native_write_cr3(new_cr3); /* Update per-CPU tracking */ this_cpu_write(cpu_tlbstate.loaded_mm, next);}Process Context IDentifiers (PCID) allow the CPU to cache TLB entries from multiple address spaces, tagged by a 12-bit ID. When switching with PCID, old TLB entries aren't flushed—they remain valid but tagged with the old PCID. When switching back to that process later, its TLB entries may still be cached. This dramatically reduces context switch overhead for processes that switch frequently.
Kernel Threads and Lazy TLB:
Kernel threads have no user-space address space (mm == NULL). They only access kernel memory, which is mapped identically in all page tables (upper half of the address space). So kernel threads don't need their own page tables—they "borrow" whatever page tables were active when they started running.
This optimization is called "lazy TLB": we don't switch page tables when entering a kernel thread, avoiding a costly CR3 load. The kernel thread can access kernel memory just fine since it's mapped the same way in all address spaces.
After switching stacks, the new process's saved registers are at the top of the kernel stack. Restoring them is conceptually simple—pop each register—but the devil is in the details.
Two Levels of Register State:
Callee-saved registers (RBX, RBP, R12-R15): Pushed/popped by __switch_to_asm. These are the registers that C functions must preserve.
Full register set (all GPRs plus RIP, RFLAGS, etc.): Stored in pt_regs on the kernel stack. These are restored when returning from the system call or interrupt that originally triggered the context switch.
The key insight: after __switch_to_asm returns, control goes back through the kernel call stack until eventually reaching the interrupt/syscall return path, which does the full pt_regs restoration.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
/** * Returning to user space after context switch * * When the new process was originally switched out, it was * in the middle of handling an interrupt or syscall. Its * pt_regs is still on the kernel stack. We unwind back to * that point and restore everything. */ /* * Path: schedule() returns -> interrupt handler returns -> * syscall handler returns -> here */ /** * POP_REGS - Restore all general-purpose registers from pt_regs */.macro POP_REGS popq %r15 popq %r14 popq %r13 popq %r12 popq %rbp popq %rbx popq %r11 popq %r10 popq %r9 popq %r8 popq %rax popq %rcx popq %rdx popq %rsi popq %rdi /* Skip orig_ax */ addq $8, %rsp.endm /** * syscall_return_slowpath - Return from syscall to user space * * Called when returning from a syscall (or after schedule()). * Restores all user-space register state. */ENTRY(syscall_return_slowpath) /* Check for pending work (signals, reschedule needed) */ testl $(_TIF_ALLWORK_MASK), TASK_FLAGS(%r12) jnz work_pending /* Disable interrupts for the final return sequence */ cli /* Restore all registers from pt_regs on stack */ POP_REGS /* * At this point: * - All GPRs have been restored to user values * - RSP points to the CPU-pushed interrupt frame: * [RIP, CS, RFLAGS, RSP, SS] */ /* * SYSRETQ or IRETQ to return to user space * * SYSRETQ is faster but has restrictions: * - Can only return to ring 3 * - RCX must contain return RIP * - R11 must contain return RFLAGS * * IRETQ is general-purpose: * - Pops RIP, CS, RFLAGS, RSP, SS from stack * - Works for any transition (kernel->user, user->kernel) */ /* Use IRETQ for general case: */ iretq ENDPROC(syscall_return_slowpath) /** * The exact moment of restoration: * * IRETQ atomically: * 1. Pops RIP -> jumps to user code * 2. Pops CS -> changes code segment (ring 3) * 3. Pops RFLAGS -> restores condition codes & IF flag * 4. Pops RSP -> switches to user stack * 5. Pops SS -> changes stack segment * * After IRETQ, we are in user space, at exactly the * instruction where the process was interrupted. */Floating-point and SIMD registers are restored during __switch_to() (for eager switching) or on first FPU instruction (for lazy switching). Modern kernels use eager switching for security.
The XRSTOR Instruction:
The XRSTOR family of instructions restores processor state from a memory buffer. The state includes x87 FPU, SSE, AVX, and other extended state components.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
/** * switch_fpu_finish() - Restore FPU state for incoming process * * Called from __switch_to() during context switch. */void switch_fpu_finish(struct task_struct *next){ struct fpu *next_fpu = &next->thread.fpu; /* * Check if next has any FPU state to restore. * A newly forked process might not have used FPU yet. */ if (!fpu_state_valid(next_fpu)) { /* Initialize to clean FPU state */ fpu_init_state(next_fpu); } /* * Restore the FPU state from memory to registers. * * xrstor() uses a state component bitmap to select which * parts of the state to restore. The kernel typically * restores everything that was saved. */ /* The actual restoration - XRSTOR instruction wrapper */ copy_kernel_to_xregs(&next_fpu->state, next_fpu->state_mask);} /** * copy_kernel_to_xregs() - Execute XRSTOR to load FPU state */static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask){ u32 lo = (u32)mask; u32 hi = (u32)(mask >> 32); /* * XRSTOR loads processor state from memory. * * EDX:EAX specify state component bitmap: * - Bit 0: x87 FPU * - Bit 1: SSE (XMM registers) * - Bit 2: AVX (YMM registers) * - Bit 5: AVX-512 opmask * - Bit 6-7: AVX-512 ZMM * - etc. */ asm volatile( "xrstor %[xstate]" : /* no outputs */ : [xstate] "m" (*xstate), "a" (lo), "d" (hi) : "memory" );} /** * State layout restored by XRSTOR: * * Offset 0: FXSAVE legacy area (x87 + SSE) * - x87 control/status words * - ST0-ST7 (x87 stack) * - XMM0-XMM15 (SSE registers) * * Offset 576+: Extended state (when XSAVE used) * - YMM0-YMM15 high halves (AVX) * - ZMM0-ZMM31 (AVX-512) * - Opmask registers (AVX-512) * - etc. */The XSAVE/XRSTOR instructions use a bitmask to select which state components to save/restore. This allows the kernel to restore only the components that were actually in use, saving time. For example, if a process never used AVX-512, those huge ZMM registers don't need to be restored.
The last step is the actual mode transition from kernel (ring 0) to user (ring 3). This is accomplished by special instructions that atomically load multiple registers and change privilege level.
IRETQ: The Universal Return:
The IRET (Interrupt Return) instruction was designed exactly for this purpose. It atomically pops RIP, CS, RFLAGS, RSP, and SS from the stack, then resumes execution at the restored RIP in user mode.
SYSRETQ: The Fast Path:
For returning from system calls specifically, x86-64 provides SYSRET. It's faster than IRET but has restrictions that don't allow it to be used in all cases (e.g., when returning to different code segments or when RFLAGS needs special handling).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
/** * Two ways to return from kernel to user space: * SYSRETQ (fast) and IRETQ (general) */ /** * SYSRETQ - Fast return from syscall * * Preconditions: * - RCX contains return RIP * - R11 contains return RFLAGS * - Must be returning to ring 3 code * * Fast path for common syscall returns. */ENTRY(syscall_return_sysret) /* Load return RIP into RCX */ movq PT_RIP(%rsp), %rcx /* Load return RFLAGS into R11 */ movq PT_FLAGS(%rsp), %r11 /* Restore GPRs (RAX has syscall return value) */ POP_REGS /* Skip the interrupt frame - SYSRETQ doesn't use it */ addq $8*5, %rsp /* * SYSRETQ does the following atomically: * 1. Set RIP = RCX * 2. Set RFLAGS = R11 (masked) * 3. Set CS to user code segment (ring 3) * 4. Set SS to user stack segment * 5. (RSP already points to user stack) * * Much faster than IRETQ because it doesn't pop from stack. */ sysretq ENDPROC(syscall_return_sysret) /** * IRETQ - General return from interrupt/exception * * Works for any kernel-to-user transition. * Used when SYSRETQ isn't safe/applicable. */ENTRY(interrupt_return_iret) /* Restore all GPRs */ POP_REGS /* * Stack now contains (pushed by CPU on entry): * +40: SS * +32: RSP * +24: RFLAGS * +16: CS * +8: RIP * +0: (error code - already removed) */ /* * IRETQ atomically: * 1. Pop RIP from stack -> jump there * 2. Pop CS from stack -> change code segment (ring 3) * 3. Pop RFLAGS from stack -> restore flags * 4. Pop RSP from stack -> switch to user stack * 5. Pop SS from stack -> change stack segment * * After this single instruction, we're in user mode, * at the saved RIP, with the saved register state. */ iretq ENDPROC(interrupt_return_iret) /** * Why IRETQ is essential: * * The interrupt/syscall return must be ATOMIC. We can't do: * mov user_rsp, %rsp ; Now on user stack, still ring 0! * jmp user_rip ; Still ring 0, security disaster! * * IRETQ changes privilege level and multiple registers in * a single, atomic CPU operation. There's no window where * we're in an inconsistent state. */| Aspect | IRETQ | SYSRETQ |
|---|---|---|
| State source | Stack (5 values popped) | Registers (RCX=RIP, R11=RFLAGS) |
| Speed | Slower (~20+ cycles) | Faster (~10 cycles) |
| Generality | Works for any transition | Only syscall return to ring 3 |
| RFLAGS handling | Full restore from stack | Masked restore from R11 |
| RSP handling | Popped from stack | Already set by kernel |
| Usage | Interrupts, exceptions, special cases | Fast path syscall returns |
SYSRET has subtle security issues: if RCX contains a non-canonical address (not valid in x86-64), SYSRET causes a #GP fault that executes on the USER stack but in ring 0. Intel and AMD handle this differently, creating potential vulnerabilities. Kernels must validate RCX before using SYSRET, or fall back to IRETQ for suspect cases. This is why both return paths exist.
Let's trace through a complete context switch restoration, from scheduler selection to user-mode execution:
| Step | Action | CPU State Change |
|---|---|---|
| 1 | Scheduler selects next process | No CPU change yet |
| 2 | switch_mm() loads CR3 | Address space switched, TLB flushed |
| 3 | __switch_to_asm saves prev's callee-saved regs | Push RBX, RBP, R12-R15 |
| 4 | __switch_to_asm saves prev's RSP to thread.sp | No CPU change (memory write) |
| 5 | __switch_to_asm loads next's RSP from thread.sp | RSP = next's kernel stack |
| 6 | __switch_to_asm pops next's callee-saved regs | RBX, RBP, R12-R15 = next's values |
| 7 | __switch_to updates current_task | Per-CPU points to next |
| 8 | __switch_to updates TSS.SP0 | Next interrupt uses next's kstack |
| 9 | switch_fpu_finish restores FPU | XMM/YMM/ZMM = next's values |
| 10 | Kernel unwinds back through call stack | Each ret restores more context |
| 11 | POP_REGS restores all GPRs from pt_regs | RAX-R15 = next's user values |
| 12 | IRETQ/SYSRETQ returns to user mode | RIP, RSP, RFLAGS, CS, SS restored |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
CONTEXT SWITCH RESTORATION TIMELINE==================================== Process A: Running, executes syscall, eventually schedule() is calledProcess B: Was sleeping, now selected to run Time Location Action───────────────────────────────────────────────────────────────────T0 schedule() Scheduler picks process B T1 context_switch() Check if B has different mm_struct └─ switch_mm() Load B's page table (write CR3) TLB flushed (unless PCID) T2 switch_to() macro Calls __switch_to_asm(A, B) T3 __switch_to_asm push %rbp, %rbx, %r12-r15 (A's registers) (running as A) movq %rsp, A->thread.sp (save A's RSP) ═══════════════════════════════════════════════════════════ ║ movq B->thread.sp, %rsp <-- THE SWITCH MOMENT ║ ═══════════════════════════════════════════════════════════ (now on B's stack!) pop %r15-r12, %rbx, %rbp (B's registers) T4 __switch_to current_task = B (now "B" running) TSS.SP0 = B's kernel stack XRSTOR B's FPU state Update FS/GS if needed return (to where B was) T5 <B's call stack> Functions return, unwinding to... ...context_switch() ...schedule() ...syscall_handler() T6 syscall_exit Check for pending signals POP_REGS from B's pt_regs All GPRs now have B's user values T7 return_to_user IRETQ or SYSRETQ ─────────────────────────────────── RIP = B's user code position RSP = B's user stack RFLAGS = B's flags CS = user code segment (ring 3) SS = user stack segment T8 User space B continues from exactly where it was No idea it was ever suspended!Context restoration is the inverse of context saving, but it's executed with equal precision. Every saved value must be restored to exactly the right register, in exactly the right order, with exact atomicity at the final mode transition.
You now understand the complete context restoration process—from stack pointer switch to user mode return. Together with context saving, this forms the complete picture of how the kernel pauses and resumes processes. Next, we'll explore the OVERHEAD of context switching and why minimizing switches matters for performance.