Loading learning content...
Imagine pausing a movie, then resuming it hours later from exactly where you left off. The movie continues seamlessly as if no time had passed. A context switch must achieve precisely this for a running program—but the challenge is vastly more complex.
A running process isn't just reading code from storage like a movie player reads frames. It has a rich, dynamic state: values in CPU registers, the position in the code, the contents of its stack, pending I/O operations, signal masks, and much more. When the kernel preempts this process, every single element of this state must be preserved so that when the process resumes—possibly milliseconds later, possibly seconds—it continues exactly as if nothing happened.
This page explores the intricate process of saving context: what must be saved, where it's stored, how hardware and software cooperate, and why even tiny omissions cause catastrophic failures.
By the end of this page, you will understand exactly which pieces of process state must be saved during a context switch, where they are stored, the roles of hardware and software in the saving process, and the internal kernel data structures that make context switching possible across different CPU architectures.
Process state can be divided into hardware context (state held in CPU registers and processor control structures) and software context (state maintained in kernel data structures). During a context switch, both must be correctly handled.
Hardware Context (CPU State):
This is the state physically present in the CPU at the moment of the switch. If lost, the process cannot resume correctly:
General-Purpose Registers: On x86-64: RAX, RBX, RCX, RDX, RSI, RDI, R8-R15 (16 registers × 64 bits = 128 bytes)
Stack Pointer (RSP): Points to the current position in the process's stack—critical for function returns
Base Pointer (RBP): Often used as a frame pointer for debugging and stack unwinding
Instruction Pointer (RIP): The address of the next instruction to execute—the program's "playback position"
Flags Register (RFLAGS): Contains condition codes (zero, carry, overflow) and control flags (interrupt enable, direction)
Segment Registers: CS, DS, ES, FS, GS, SS—mostly vestigial in 64-bit mode but FS/GS are used for thread-local storage
Floating-Point/SIMD State: x87 FPU registers (ST0-ST7), SSE registers (XMM0-XMM15), AVX registers (YMM/ZMM)—can be 256-2048 bytes
Control Registers: CR0, CR2, CR3, CR4—CR3 is particularly important as it points to the page table
Debug Registers: DR0-DR7—used for hardware breakpoints
Model-Specific Registers (MSRs): Various processor state not in standard registers
| Register Category | Count × Size | Total Size | When Saved |
|---|---|---|---|
| General-Purpose (RAX, RBX, etc.) | 16 × 8 bytes | 128 bytes | Always (every switch) |
| Instruction Pointer (RIP) | 1 × 8 bytes | 8 bytes | Always (every switch) |
| Stack Pointer (RSP) | 1 × 8 bytes | 8 bytes | Always (every switch) |
| Flags (RFLAGS) | 1 × 8 bytes | 8 bytes | Always (every switch) |
| Segment Registers | 6 × 2 bytes | 12 bytes | Typically (may optimize) |
| x87 FPU (ST0-ST7 + control) | 8 × 10 + 14 bytes | 94 bytes | Lazy (on first FP use) |
| SSE/AVX (XMM0-XMM15 or YMM) | 16 × 16-32 bytes | 256-512 bytes | Lazy (on first SIMD use) |
| AVX-512 (ZMM0-ZMM31) | 32 × 64 bytes | 2048 bytes | Lazy (on first AVX-512 use) |
Software Context (Kernel Data Structures):
Beyond CPU registers, the kernel maintains extensive metadata about each process:
Process Control Block (PCB) / task_struct: The master record containing PID, state, priority, scheduling info
Memory Management Info: Pointer to page tables (mm_struct), virtual memory areas (VMAs), memory limits
File Descriptor Table: Array of open file references (files_struct)
Signal State: Pending signals, signal handlers, signal masks (sighand_struct)
Credentials: User ID, group ID, capabilities (cred struct)
Scheduling Information: Priority, time slice remaining, CFS virtual runtime, run queue position
Kernel Stack: Each process has a kernel-mode stack for system calls and interrupts
The process's user-space memory (heap, stack, code) is NOT saved during a context switch—it remains in RAM protected by page tables. Only the CPU state that points INTO this memory (registers, PC) is saved. This is why context switches are relatively fast: we save ~1KB of register state, not gigabytes of process memory.
Modern processors cooperate with the operating system during context switches. When an interrupt or exception occurs, the CPU automatically saves a minimal set of state to the stack before transferring control to the kernel. Understanding this hardware behavior is essential.
What the CPU Saves Automatically (on x86-64 interrupt/exception):
When an interrupt occurs while in user mode, the CPU automatically pushes the following onto the kernel stack (not the user stack):
This is the absolute minimum needed to return from the interrupt. Notice that general-purpose registers are NOT saved automatically—the kernel must save them manually.
12345678910111213141516171819202122232425262728
; x86-64 Interrupt Stack Frame (automatically pushed by CPU); ; When an interrupt occurs in user mode, the CPU automatically:; 1. Switches from user stack to kernel stack (using TSS.RSP0); 2. Pushes the following values (growing downward):;; Higher addresses; ┌──────────────────────────────────────┐; │ SS (user stack segment) │ +40 (8 bytes, padded); ├──────────────────────────────────────┤; │ RSP (user stack pointer) │ +32 (8 bytes); ├──────────────────────────────────────┤; │ RFLAGS (flags register) │ +24 (8 bytes); ├──────────────────────────────────────┤; │ CS (user code segment) │ +16 (8 bytes, padded); ├──────────────────────────────────────┤; │ RIP (instruction pointer) │ +8 (8 bytes); ├──────────────────────────────────────┤; │ Error Code (some exceptions only) │ +0 (8 bytes, optional); └──────────────────────────────────────┘; │ (kernel RSP points here) │; Lower addresses;; After this automatic push, CPU jumps to interrupt handler.; Handler must MANUALLY save all general-purpose registers. ; The hardware IRET instruction pops this frame to return:; iretq ; Pop RIP, CS, RFLAGS, RSP, SS and return to user spaceWhy Doesn't the CPU Save All Registers?
Design tradeoff. Saving 16+ general-purpose registers plus FPU/SIMD state would make every interrupt extremely expensive, even when no context switch occurs. Most interrupts (timer ticks, network packets) are handled quickly and return to the same process. By having the kernel manually save only what's needed, the system achieves:
The Task State Segment (TSS):
On x86-64, the TSS contains the kernel stack pointer (RSP0) that the CPU loads when transitioning from user to kernel mode. Each CPU has its own TSS. When an interrupt occurs:
The kernel stack is separate from the user stack for security. If the CPU pushed interrupt frames onto the user stack, a malicious process could corrupt its stack to hijack control flow when returning from interrupts. The TSS mechanism ensures we always transition to a trusted kernel stack.
After the CPU's automatic push, the kernel interrupt handler takes over. The very first instructions of any interrupt handler must save all general-purpose registers that might be modified. This is typically done in assembly before any C code runs.
The pt_regs Structure:
Linux defines a struct pt_regs that holds all the register values at the point of interrupt. The interrupt entry code pushes all registers into this format on the kernel stack:
123456789101112131415161718192021222324252627282930313233343536373839
/** * struct pt_regs - saved CPU register state * * This structure is pushed onto the kernel stack at the start * of any interrupt or system call entry. It captures the complete * CPU register state at the moment of kernel entry. * * Located in: arch/x86/include/asm/ptrace.h */struct pt_regs { /* Manually saved by interrupt entry code */ unsigned long r15; unsigned long r14; unsigned long r13; unsigned long r12; unsigned long bp; /* RBP - base pointer */ unsigned long bx; /* RBX */ unsigned long r11; unsigned long r10; unsigned long r9; unsigned long r8; unsigned long ax; /* RAX - return value / syscall number */ unsigned long cx; /* RCX - 4th syscall argument */ unsigned long dx; /* RDX - 3rd syscall argument */ unsigned long si; /* RSI - 2nd syscall argument */ unsigned long di; /* RDI - 1st syscall argument */ /* Identifies the interrupt/exception source */ unsigned long orig_ax; /* Original RAX (syscall number) */ /* Automatically saved by CPU on interrupt */ unsigned long ip; /* RIP - instruction pointer */ unsigned long cs; /* Code segment */ unsigned long flags; /* RFLAGS */ unsigned long sp; /* RSP - user stack pointer */ unsigned long ss; /* Stack segment */}; /* Size: 21 * 8 = 168 bytes of register state */1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
/** * Simplified interrupt entry macro (Linux style) * * This assembly code runs at the very start of interrupt handling, * immediately after the CPU pushes SS, RSP, RFLAGS, CS, RIP. * * It completes the pt_regs structure by pushing all GPRs. */ .macro PUSH_REGS /* At this point, CPU has already pushed: * SS, RSP, RFLAGS, CS, RIP (and maybe error code) */ /* Push all general-purpose registers to complete pt_regs */ pushq %rdi /* Save 1st argument */ pushq %rsi /* Save 2nd argument */ pushq %rdx /* Save 3rd argument */ pushq %rcx /* Save 4th argument */ pushq %rax /* Save syscall number / return value */ pushq %r8 /* Save 5th argument */ pushq %r9 /* Save 6th argument */ pushq %r10 pushq %r11 /* Scratch registers */ pushq %rbx pushq %rbp /* Frame pointer (callee-saved) */ pushq %r12 pushq %r13 pushq %r14 pushq %r15 /* Now RSP points to a complete pt_regs structure */ /* We can pass this pointer to C interrupt handlers */.endm /* Example interrupt entry point */ENTRY(interrupt_entry) /* CPU already pushed SS, RSP, RFLAGS, CS, RIP */ PUSH_REGS /* Push remaining registers */ movq %rsp, %rdi /* pt_regs pointer as 1st arg to C */ call do_IRQ /* Call C interrupt handler */ /* After handler returns, restore all registers */ POP_REGS /* Reverse of PUSH_REGS */ iretq /* Return from interrupt, pop CPU frame */ENDPROC(interrupt_entry) .macro POP_REGS popq %r15 popq %r14 popq %r13 popq %r12 popq %rbp popq %rbx popq %r11 popq %r10 popq %r9 popq %r8 popq %rax popq %rcx popq %rdx popq %rsi popq %rdi.endmSome registers are 'callee-saved' (RBX, RBP, R12-R15): if C code calls a function, that function must preserve these. Others are 'caller-saved' (RAX, RCX, RDX, RSI, RDI, R8-R11): the caller expects them to be clobbered. The interrupt entry code saves EVERYTHING because we don't know which registers the interrupted code was using—we must restore them all exactly.
During interrupt handling, register state is temporarily saved on the kernel stack (in pt_regs). But when a context switch actually occurs, the outgoing process's state must be stored more permanently in kernel memory structures. This is where thread_struct comes in.
Each process has a task_struct (the PCB) which contains an embedded thread_struct for architecture-specific CPU state. When we switch away from a process, the kernel copies critical register values from the kernel stack into this structure.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
/** * struct thread_struct - Per-thread CPU state (x86-64) * * This structure is embedded in task_struct and holds the * saved CPU context when the thread is not running. * * Key fields shown (simplified from actual Linux): */struct thread_struct { /* Segment descriptor caches - for segment reload on switch */ unsigned short es; unsigned short ds; unsigned short fsindex; unsigned short gsindex; /* Thread-local storage base addresses */ unsigned long fsbase; /* FS segment base (TLS) */ unsigned long gsbase; /* GS segment base (per-CPU data) */ /* Saved stack pointer - THE critical value for switch */ unsigned long sp; /* Kernel stack pointer (RSP) */ /* Debug registers */ unsigned long debugreg[8]; /* Floating-point state (not directly here in modern kernels) */ struct fpu fpu; /* FPU/SSE/AVX state container */ /* I/O permission bitmap */ unsigned long *io_bitmap_ptr; unsigned long io_bitmap_max; /* * Note: General registers (RAX, RBX, etc.) and RIP are NOT here! * They're stored in the pt_regs on the kernel stack. * The 'sp' field points to where pt_regs is located. */}; /** * During context switch (simplified): * * 1. Save current RSP into current->thread.sp * 2. Load next->thread.sp into RSP * 3. (RSP now points to next's kernel stack with its pt_regs) * 4. Return "into" the next process */The Elegant Trick: Stack Pointer as Complete State Reference
Notice that thread_struct doesn't store RAX, RBX, or even RIP directly. Instead, it stores the kernel stack pointer (sp). Why? Because at the moment of context switch, the kernel stack contains a complete pt_regs at a known offset from RSP. By saving RSP, we've effectively saved a pointer to all the other registers.
When switching:
current->thread.sp = <current RSP><set RSP> = next->thread.spret (pops RIP from next's stack, jumping into next's code)This is brilliantly efficient—we save one register (RSP) and get all the others "for free" because they're already on the stack we're pointing to.
Floating-point and SIMD registers (x87, SSE, AVX, AVX-512) present a special challenge: they are large (up to 2KB for AVX-512) but many processes never use them. Saving/restoring them on every context switch would be wasteful.
Lazy FPU Context Switching:
Modern operating systems use a clever optimization called lazy FPU switching:
This means if neither the old nor new process uses floating-point, we skip saving/restoring 512-2048 bytes of state entirely.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
/** * Lazy FPU context switching (traditional approach) * * Modern Linux (since ~4.x) uses "eager" FPU switching for * security reasons, but lazy switching illustrates the concept. */ /* The CR0.TS (Task Switched) bit controls FPU traps */#define CR0_TS_BIT (1 << 3) /** * Called during context switch - DON'T save FPU, just set trap */void switch_fpu_lazy(struct task_struct *prev, struct task_struct *next){ /* * Set CR0.TS bit - next FPU instruction will trap. * We haven't spent time saving prev's FPU state yet. */ unsigned long cr0 = read_cr0(); write_cr0(cr0 | CR0_TS_BIT); /* Remember who "owns" the FPU right now */ this_cpu_write(fpu_owner, prev);} /** * Device Not Available exception handler (#NM, vector 7) * * Called when process tries to use FPU but TS bit is set. */void do_device_not_available(struct pt_regs *regs){ struct task_struct *current = get_current(); struct task_struct *prev_owner = this_cpu_read(fpu_owner); /* * NOW we actually save the FPU state - but only if someone * else had been using it (not just saving empty state) */ if (prev_owner && prev_owner != current) { /* Save prev_owner's FPU state to their fpu struct */ fxsave(&prev_owner->thread.fpu.state); } /* * Restore current process's FPU state */ if (current->thread.fpu.initialized) { fxrstor(¤t->thread.fpu.state); } else { /* First time this process uses FPU - initialize to default */ fninit(); current->thread.fpu.initialized = true; } /* * Clear TS bit - subsequent FPU instructions won't trap * (until next context switch sets it again) */ clts(); /* Clear Task-Switched flag */ /* Mark current process as FPU owner */ this_cpu_write(fpu_owner, current); /* Exception handler returns, instruction retries, works now */}Post-Spectre/Meltdown, lazy FPU saving became a security concern. Speculative execution could leak FPU register contents across processes. Modern Linux kernels now use 'eager' FPU switching: always save/restore FPU state on every context switch. The performance cost is accepted for security. However, understanding lazy switching remains important for embedded systems and older kernels.
| Instruction | Registers Saved | Size | Speed |
|---|---|---|---|
| FNSAVE/FRSTOR | x87 only | 94 bytes | Legacy, slow |
| FXSAVE/FXRSTOR | x87 + SSE | 512 bytes | Standard, fast |
| XSAVE/XRSTOR | x87 + SSE + AVX + more | 576+ bytes | Extensible, modern |
| XSAVEOPT | XSAVE optimized | Variable | Only saves modified portions |
| XSAVES/XRSTORS | Supervisor + user state | Variable | For kernel, includes more state |
Each process has its own virtual address space, implemented through page tables. During a context switch between processes (not threads), the kernel must switch the active page table. This has profound implications for performance.
The CR3 Register and Page Table Base:
On x86-64, the CR3 register points to the physical address of the top-level page table (PML4 in 4-level paging). Changing CR3 effectively "switches" the entire virtual address space:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
/** * switch_mm() - Switch memory management context * * Called during context switch when switching between processes * (not needed when switching between threads of same process). */void switch_mm(struct mm_struct *prev_mm, struct mm_struct *next_mm, struct task_struct *next){ unsigned long cr3; /* If same address space (threads of same process), skip */ if (likely(prev_mm == next_mm)) { return; } /* * Get physical address of next process's PML4 page table. * The pgd (Page Global Directory) is the top-level table. */ cr3 = __pa(next_mm->pgd); /* * Load new CR3 - this switches the entire address space. * * CRITICAL SIDE EFFECT: Loading CR3 flushes the TLB! * All cached virtual->physical translations are invalidated. * This is expensive and a major source of context switch overhead. */ write_cr3(cr3); /* Update per-CPU tracking of current mm */ this_cpu_write(cpu_current_mm, next_mm); /* * PCID optimization (if available): * Process Context IDentifiers allow keeping TLB entries * from multiple address spaces simultaneously, tagged by PCID. * This avoids full TLB flush on context switch. */ if (cpu_has_pcid) { /* CR3 includes PCID in lower bits */ cr3 = build_cr3(next_mm->pgd, next_mm->context.ctx_id); write_cr3_pcid(cr3); /* May not flush TLB */ }} /** * TLB flush cost example: * * - TLB has ~1000+ entries caching page translations * - After flush, EVERY memory access causes page table walk * - Page table walk: 4-5 memory accesses per translation * - Until TLB warms up, process runs much slower * * This is why process context switches are more expensive * than thread context switches (threads share address space). */Threads within the same process share the same mm_struct and page tables. When switching between threads of the same process, there's no need to switch CR3 or flush the TLB. This is a major reason why multi-threaded programs can be more efficient than multi-process programs: thread context switches skip the expensive memory management switch.
Beyond CPU registers, a context switch must update kernel data structures to reflect the new running process.
Updating the Current Process Pointer:
The kernel needs to know which process is currently executing on each CPU. On Linux, this is tracked via a per-CPU variable and a special segment register (GS in kernel mode):
task_struct address into per-CPU storagecurrent macro now returns the new process1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
/** * How Linux tracks the current task on x86-64 * * The 'current' macro is used everywhere in the kernel to get * the currently executing task. It must be updated during context switch. */ /* Per-CPU variable holding pointer to current task */DEFINE_PER_CPU(struct task_struct *, current_task); /** * The 'current' macro uses the GS segment base to access per-CPU data. * GS base points to this CPU's per-CPU area. */#define current get_current() static inline struct task_struct *get_current(void){ return this_cpu_read_stable(current_task);} /** * During context switch, update current task pointer */void update_current_task(struct task_struct *next){ /* Update per-CPU current_task to point to next */ this_cpu_write(current_task, next); /* * From this point on, current == next. * Any kernel code executing on this CPU will see * 'next' as the current task. */} /** * Also update TSS for next interrupt's kernel stack */void update_tss_sp(struct task_struct *next){ struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw); /* * TSS.SP0 holds the kernel stack top for ring 0. * When next interrupt occurs in user mode, CPU loads * RSP from here. Must point to next's kernel stack. */ tss->x86_tss.sp0 = (unsigned long)next->stack + THREAD_SIZE;}Context saving is a carefully choreographed dance between hardware and software, each contributing essential pieces:
| Component | Saved By | Where Stored | When Saved |
|---|---|---|---|
| RIP, RSP, RFLAGS, CS, SS | CPU (automatic) | Kernel stack | Interrupt entry |
| RAX, RBX, ..., R15 (GPRs) | Kernel (assembly) | Kernel stack (pt_regs) | Interrupt entry |
| Kernel RSP | Kernel (C code) | thread_struct.sp | Context switch |
| FPU/SSE/AVX state | Kernel (XSAVE) | thread.fpu structure | Switch or on-demand |
| CR3 / page tables | N/A - not saved | Already in mm_struct | Switch loads new CR3 |
| TLS bases (FS/GS) | Kernel | thread_struct | Context switch |
| current_task pointer | Kernel | Per-CPU variable | Context switch |
You now understand exactly what gets saved during a context switch and how hardware and software cooperate to preserve execution state. Next, we'll explore the reverse process: how the kernel RESTORES context to resume a previously-suspended process.