Operating SystemsContext Switching

Context Switching: The Heart of Multitasking

LevelIntermediate

Duration90 mins

TopicContext Switching

2 / 5

Saving Context: Preserving Execution State

The Art of Freezing a Running Program

Imagine pausing a movie, then resuming it hours later from exactly where you left off. The movie continues seamlessly as if no time had passed. A context switch must achieve precisely this for a running program—but the challenge is vastly more complex.

A running process isn't just reading code from storage like a movie player reads frames. It has a rich, dynamic state: values in CPU registers, the position in the code, the contents of its stack, pending I/O operations, signal masks, and much more. When the kernel preempts this process, every single element of this state must be preserved so that when the process resumes—possibly milliseconds later, possibly seconds—it continues exactly as if nothing happened.

This page explores the intricate process of saving context: what must be saved, where it's stored, how hardware and software cooperate, and why even tiny omissions cause catastrophic failures.

What You Will Learn

By the end of this page, you will understand exactly which pieces of process state must be saved during a context switch, where they are stored, the roles of hardware and software in the saving process, and the internal kernel data structures that make context switching possible across different CPU architectures.

The Complete Process State: What Must Be Saved

Process state can be divided into hardware context (state held in CPU registers and processor control structures) and software context (state maintained in kernel data structures). During a context switch, both must be correctly handled.

Hardware Context (CPU State):

This is the state physically present in the CPU at the moment of the switch. If lost, the process cannot resume correctly:

General-Purpose Registers: On x86-64: RAX, RBX, RCX, RDX, RSI, RDI, R8-R15 (16 registers × 64 bits = 128 bytes)
Stack Pointer (RSP): Points to the current position in the process's stack—critical for function returns
Base Pointer (RBP): Often used as a frame pointer for debugging and stack unwinding
Instruction Pointer (RIP): The address of the next instruction to execute—the program's "playback position"
Flags Register (RFLAGS): Contains condition codes (zero, carry, overflow) and control flags (interrupt enable, direction)
Segment Registers: CS, DS, ES, FS, GS, SS—mostly vestigial in 64-bit mode but FS/GS are used for thread-local storage
Floating-Point/SIMD State: x87 FPU registers (ST0-ST7), SSE registers (XMM0-XMM15), AVX registers (YMM/ZMM)—can be 256-2048 bytes
Control Registers: CR0, CR2, CR3, CR4—CR3 is particularly important as it points to the page table
Debug Registers: DR0-DR7—used for hardware breakpoints
Model-Specific Registers (MSRs): Various processor state not in standard registers

x86-64 Register State Sizes During Context Switch
Register Category	Count × Size	Total Size	When Saved
General-Purpose (RAX, RBX, etc.)	16 × 8 bytes	128 bytes	Always (every switch)
Instruction Pointer (RIP)	1 × 8 bytes	8 bytes	Always (every switch)
Stack Pointer (RSP)	1 × 8 bytes	8 bytes	Always (every switch)
Flags (RFLAGS)	1 × 8 bytes	8 bytes	Always (every switch)
Segment Registers	6 × 2 bytes	12 bytes	Typically (may optimize)
x87 FPU (ST0-ST7 + control)	8 × 10 + 14 bytes	94 bytes	Lazy (on first FP use)
SSE/AVX (XMM0-XMM15 or YMM)	16 × 16-32 bytes	256-512 bytes	Lazy (on first SIMD use)
AVX-512 (ZMM0-ZMM31)	32 × 64 bytes	2048 bytes	Lazy (on first AVX-512 use)

Software Context (Kernel Data Structures):

Beyond CPU registers, the kernel maintains extensive metadata about each process:

Process Control Block (PCB) / task_struct: The master record containing PID, state, priority, scheduling info
Memory Management Info: Pointer to page tables (mm_struct), virtual memory areas (VMAs), memory limits
File Descriptor Table: Array of open file references (files_struct)
Signal State: Pending signals, signal handlers, signal masks (sighand_struct)
Credentials: User ID, group ID, capabilities (cred struct)
Scheduling Information: Priority, time slice remaining, CFS virtual runtime, run queue position
Kernel Stack: Each process has a kernel-mode stack for system calls and interrupts

What Isn't Saved Directly

The process's user-space memory (heap, stack, code) is NOT saved during a context switch—it remains in RAM protected by page tables. Only the CPU state that points INTO this memory (registers, PC) is saved. This is why context switches are relatively fast: we save ~1KB of register state, not gigabytes of process memory.

The Hardware's Role: Automatic State Preservation

Modern processors cooperate with the operating system during context switches. When an interrupt or exception occurs, the CPU automatically saves a minimal set of state to the stack before transferring control to the kernel. Understanding this hardware behavior is essential.

What the CPU Saves Automatically (on x86-64 interrupt/exception):

When an interrupt occurs while in user mode, the CPU automatically pushes the following onto the kernel stack (not the user stack):

SS (Stack Segment) — 8 bytes
RSP (User Stack Pointer) — 8 bytes
RFLAGS (Flags Register) — 8 bytes
CS (Code Segment) — 8 bytes
RIP (Instruction Pointer) — 8 bytes
Error Code (for some exceptions) — 8 bytes (optional)

This is the absolute minimum needed to return from the interrupt. Notice that general-purpose registers are NOT saved automatically—the kernel must save them manually.

interrupt_stack_frame.txt

Assembly (x86-64 Stack Layout)

; x86-64 Interrupt Stack Frame (automatically pushed by CPU)
; 
; When an interrupt occurs in user mode, the CPU automatically:
; 1. Switches from user stack to kernel stack (using TSS.RSP0)
; 2. Pushes the following values (growing downward):
;
; Higher addresses
; ┌──────────────────────────────────────┐
; │          SS (user stack segment)     │ +40  (8 bytes, padded)
; ├──────────────────────────────────────┤
; │          RSP (user stack pointer)    │ +32  (8 bytes)
; ├──────────────────────────────────────┤
; │          RFLAGS (flags register)     │ +24  (8 bytes)
; ├──────────────────────────────────────┤
; │          CS (user code segment)      │ +16  (8 bytes, padded)
; ├──────────────────────────────────────┤
; │          RIP (instruction pointer)   │ +8   (8 bytes)
; ├──────────────────────────────────────┤
; │   Error Code (some exceptions only)  │ +0   (8 bytes, optional)
; └──────────────────────────────────────┘
; │          (kernel RSP points here)    │
; Lower addresses
;
; After this automatic push, CPU jumps to interrupt handler.
; Handler must MANUALLY save all general-purpose registers.
 
; The hardware IRET instruction pops this frame to return:
;    iretq   ; Pop RIP, CS, RFLAGS, RSP, SS and return to user space

Why Doesn't the CPU Save All Registers?

Design tradeoff. Saving 16+ general-purpose registers plus FPU/SIMD state would make every interrupt extremely expensive, even when no context switch occurs. Most interrupts (timer ticks, network packets) are handled quickly and return to the same process. By having the kernel manually save only what's needed, the system achieves:

Fast interrupt handling: Simple interrupts save only the registers they use
Selective saving: Only full context switches save everything
Lazy FPU saving: Floating-point state is only saved if actually used

The Task State Segment (TSS):

On x86-64, the TSS contains the kernel stack pointer (RSP0) that the CPU loads when transitioning from user to kernel mode. Each CPU has its own TSS. When an interrupt occurs:

CPU reads current TSS
Loads kernel RSP from TSS.RSP0
Pushes interrupt frame onto this kernel stack
Jumps to interrupt handler

Security Critical

The kernel stack is separate from the user stack for security. If the CPU pushed interrupt frames onto the user stack, a malicious process could corrupt its stack to hijack control flow when returning from interrupts. The TSS mechanism ensures we always transition to a trusted kernel stack.

The Kernel's Role: Completing the State Save

After the CPU's automatic push, the kernel interrupt handler takes over. The very first instructions of any interrupt handler must save all general-purpose registers that might be modified. This is typically done in assembly before any C code runs.

The pt_regs Structure:

Linux defines a struct pt_regs that holds all the register values at the point of interrupt. The interrupt entry code pushes all registers into this format on the kernel stack:

pt_regs_structure.c
C (Linux x86-64)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/**
 * struct pt_regs - saved CPU register state
 * 
 * This structure is pushed onto the kernel stack at the start
 * of any interrupt or system call entry. It captures the complete
 * CPU register state at the moment of kernel entry.
 * 
 * Located in: arch/x86/include/asm/ptrace.h
 */
struct pt_regs {
    /* Manually saved by interrupt entry code */
    unsigned long r15;
    unsigned long r14;
    unsigned long r13;
    unsigned long r12;
    unsigned long bp;      /* RBP - base pointer */
    unsigned long bx;      /* RBX */
    unsigned long r11;
    unsigned long r10;
    unsigned long r9;
    unsigned long r8;
    unsigned long ax;      /* RAX - return value / syscall number */
    unsigned long cx;      /* RCX - 4th syscall argument */
    unsigned long dx;      /* RDX - 3rd syscall argument */
    unsigned long si;      /* RSI - 2nd syscall argument */
    unsigned long di;      /* RDI - 1st syscall argument */
    
    /* Identifies the interrupt/exception source */
    unsigned long orig_ax; /* Original RAX (syscall number) */
    
    /* Automatically saved by CPU on interrupt */
    unsigned long ip;      /* RIP - instruction pointer */
    unsigned long cs;      /* Code segment */
    unsigned long flags;   /* RFLAGS */
    unsigned long sp;      /* RSP - user stack pointer */
    unsigned long ss;      /* Stack segment */
};
 
/* Size: 21 * 8 = 168 bytes of register state */

entry_64.S

Assembly (Linux x86-64)

/**
 * Simplified interrupt entry macro (Linux style)
 * 
 * This assembly code runs at the very start of interrupt handling,
 * immediately after the CPU pushes SS, RSP, RFLAGS, CS, RIP.
 * 
 * It completes the pt_regs structure by pushing all GPRs.
 */
 
.macro PUSH_REGS
    /* At this point, CPU has already pushed:
     * SS, RSP, RFLAGS, CS, RIP (and maybe error code)
     */
    
    /* Push all general-purpose registers to complete pt_regs */
    pushq   %rdi            /* Save 1st argument */
    pushq   %rsi            /* Save 2nd argument */
    pushq   %rdx            /* Save 3rd argument */
    pushq   %rcx            /* Save 4th argument */
    pushq   %rax            /* Save syscall number / return value */
    pushq   %r8             /* Save 5th argument */
    pushq   %r9             /* Save 6th argument */
    pushq   %r10
    pushq   %r11            /* Scratch registers */
    pushq   %rbx
    pushq   %rbp            /* Frame pointer (callee-saved) */
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15
    
    /* Now RSP points to a complete pt_regs structure */
    /* We can pass this pointer to C interrupt handlers */
.endm
 
/* Example interrupt entry point */
ENTRY(interrupt_entry)
    /* CPU already pushed SS, RSP, RFLAGS, CS, RIP */
    PUSH_REGS               /* Push remaining registers */
    
    movq    %rsp, %rdi      /* pt_regs pointer as 1st arg to C */
    call    do_IRQ          /* Call C interrupt handler */
    
    /* After handler returns, restore all registers */
    POP_REGS               /* Reverse of PUSH_REGS */
    iretq                  /* Return from interrupt, pop CPU frame */
ENDPROC(interrupt_entry)
 
.macro POP_REGS
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbp
    popq    %rbx
    popq    %r11
    popq    %r10
    popq    %r9
    popq    %r8
    popq    %rax
    popq    %rcx
    popq    %rdx
    popq    %rsi
    popq    %rdi
.endm

Some registers are 'callee-saved' (RBX, RBP, R12-R15): if C code calls a function, that function must preserve these. Others are 'caller-saved' (RAX, RCX, RDX, RSI, RDI, R8-R11): the caller expects them to be clobbered. The interrupt entry code saves EVERYTHING because we don't know which registers the interrupted code was using—we must restore them all exactly.

The thread_struct: Where Context is Permanently Stored

During interrupt handling, register state is temporarily saved on the kernel stack (in pt_regs). But when a context switch actually occurs, the outgoing process's state must be stored more permanently in kernel memory structures. This is where thread_struct comes in.

Each process has a task_struct (the PCB) which contains an embedded thread_struct for architecture-specific CPU state. When we switch away from a process, the kernel copies critical register values from the kernel stack into this structure.

thread_struct.c
C (Linux x86-64)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/**
 * struct thread_struct - Per-thread CPU state (x86-64)
 * 
 * This structure is embedded in task_struct and holds the
 * saved CPU context when the thread is not running.
 * 
 * Key fields shown (simplified from actual Linux):
 */
struct thread_struct {
    /* Segment descriptor caches - for segment reload on switch */
    unsigned short          es;
    unsigned short          ds;
    unsigned short          fsindex;
    unsigned short          gsindex;
    
    /* Thread-local storage base addresses */
    unsigned long           fsbase;    /* FS segment base (TLS) */
    unsigned long           gsbase;    /* GS segment base (per-CPU data) */
    
    /* Saved stack pointer - THE critical value for switch */
    unsigned long           sp;        /* Kernel stack pointer (RSP) */
    
    /* Debug registers */
    unsigned long           debugreg[8];
    
    /* Floating-point state (not directly here in modern kernels) */
    struct fpu              fpu;       /* FPU/SSE/AVX state container */
    
    /* I/O permission bitmap */
    unsigned long           *io_bitmap_ptr;
    unsigned long           io_bitmap_max;
    
    /*
     * Note: General registers (RAX, RBX, etc.) and RIP are NOT here!
     * They're stored in the pt_regs on the kernel stack.
     * The 'sp' field points to where pt_regs is located.
     */
};
 
/**
 * During context switch (simplified):
 * 
 * 1. Save current RSP into current->thread.sp
 * 2. Load next->thread.sp into RSP
 * 3. (RSP now points to next's kernel stack with its pt_regs)
 * 4. Return "into" the next process
 */

The Elegant Trick: Stack Pointer as Complete State Reference

Notice that thread_struct doesn't store RAX, RBX, or even RIP directly. Instead, it stores the kernel stack pointer (sp). Why? Because at the moment of context switch, the kernel stack contains a complete pt_regs at a known offset from RSP. By saving RSP, we've effectively saved a pointer to all the other registers.

When switching:

current->thread.sp = <current RSP>
<set RSP> = next->thread.sp
ret (pops RIP from next's stack, jumping into next's code)

This is brilliantly efficient—we save one register (RSP) and get all the others "for free" because they're already on the stack we're pointing to.

Converting Mermaid diagram...

Floating-Point and SIMD State: The Lazy Saving Optimization

Floating-point and SIMD registers (x87, SSE, AVX, AVX-512) present a special challenge: they are large (up to 2KB for AVX-512) but many processes never use them. Saving/restoring them on every context switch would be wasteful.

Lazy FPU Context Switching:

Modern operating systems use a clever optimization called lazy FPU switching:

When switching away from a process, don't save FPU/SIMD state immediately
Set a CPU flag that causes a trap if FPU/SIMD instructions are executed
When the next process tries to use FPU/SIMD:
- Trap occurs ("Device Not Available" exception, #NM)
- NOW save previous process's FPU state
- Load current process's FPU state
- Clear the trap flag
- Retry the instruction

This means if neither the old nor new process uses floating-point, we skip saving/restoring 512-2048 bytes of state entirely.

lazy_fpu.c
C (Linux Style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/**
 * Lazy FPU context switching (traditional approach)
 * 
 * Modern Linux (since ~4.x) uses "eager" FPU switching for
 * security reasons, but lazy switching illustrates the concept.
 */
 
/* The CR0.TS (Task Switched) bit controls FPU traps */
#define CR0_TS_BIT  (1 << 3)
 
/**
 * Called during context switch - DON'T save FPU, just set trap
 */
void switch_fpu_lazy(struct task_struct *prev, struct task_struct *next)
{
    /* 
     * Set CR0.TS bit - next FPU instruction will trap.
     * We haven't spent time saving prev's FPU state yet.
     */
    unsigned long cr0 = read_cr0();
    write_cr0(cr0 | CR0_TS_BIT);
    
    /* Remember who "owns" the FPU right now */
    this_cpu_write(fpu_owner, prev);
}
 
/**
 * Device Not Available exception handler (#NM, vector 7)
 * 
 * Called when process tries to use FPU but TS bit is set.
 */
void do_device_not_available(struct pt_regs *regs)
{
    struct task_struct *current = get_current();
    struct task_struct *prev_owner = this_cpu_read(fpu_owner);
    
    /*
     * NOW we actually save the FPU state - but only if someone
     * else had been using it (not just saving empty state)
     */
    if (prev_owner && prev_owner != current) {
        /* Save prev_owner's FPU state to their fpu struct */
        fxsave(&prev_owner->thread.fpu.state);
    }
    
    /*
     * Restore current process's FPU state
     */
    if (current->thread.fpu.initialized) {
        fxrstor(&current->thread.fpu.state);
    } else {
        /* First time this process uses FPU - initialize to default */
        fninit();
        current->thread.fpu.initialized = true;
    }
    
    /*
     * Clear TS bit - subsequent FPU instructions won't trap
     * (until next context switch sets it again)
     */
    clts();  /* Clear Task-Switched flag */
    
    /* Mark current process as FPU owner */
    this_cpu_write(fpu_owner, current);
    
    /* Exception handler returns, instruction retries, works now */
}

Lazy FPU and Security (Spectre/Meltdown Era)

Post-Spectre/Meltdown, lazy FPU saving became a security concern. Speculative execution could leak FPU register contents across processes. Modern Linux kernels now use 'eager' FPU switching: always save/restore FPU state on every context switch. The performance cost is accepted for security. However, understanding lazy switching remains important for embedded systems and older kernels.

FPU/SIMD State Saving Methods
Instruction	Registers Saved	Size	Speed
FNSAVE/FRSTOR	x87 only	94 bytes	Legacy, slow
FXSAVE/FXRSTOR	x87 + SSE	512 bytes	Standard, fast
XSAVE/XRSTOR	x87 + SSE + AVX + more	576+ bytes	Extensible, modern
XSAVEOPT	XSAVE optimized	Variable	Only saves modified portions
XSAVES/XRSTORS	Supervisor + user state	Variable	For kernel, includes more state

Memory Management State: Page Tables and Address Spaces

Each process has its own virtual address space, implemented through page tables. During a context switch between processes (not threads), the kernel must switch the active page table. This has profound implications for performance.

The CR3 Register and Page Table Base:

On x86-64, the CR3 register points to the physical address of the top-level page table (PML4 in 4-level paging). Changing CR3 effectively "switches" the entire virtual address space:

Before switch: CR3 points to Process A's page tables
After switch: CR3 points to Process B's page tables
Same virtual address now maps to different physical memory

switch_mm.c
C (Linux Style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/**
 * switch_mm() - Switch memory management context
 * 
 * Called during context switch when switching between processes
 * (not needed when switching between threads of same process).
 */
void switch_mm(struct mm_struct *prev_mm, 
               struct mm_struct *next_mm,
               struct task_struct *next)
{
    unsigned long cr3;
    
    /* If same address space (threads of same process), skip */
    if (likely(prev_mm == next_mm)) {
        return;
    }
    
    /* 
     * Get physical address of next process's PML4 page table.
     * The pgd (Page Global Directory) is the top-level table.
     */
    cr3 = __pa(next_mm->pgd);
    
    /*
     * Load new CR3 - this switches the entire address space.
     * 
     * CRITICAL SIDE EFFECT: Loading CR3 flushes the TLB!
     * All cached virtual->physical translations are invalidated.
     * This is expensive and a major source of context switch overhead.
     */
    write_cr3(cr3);
    
    /* Update per-CPU tracking of current mm */
    this_cpu_write(cpu_current_mm, next_mm);
    
    /* 
     * PCID optimization (if available):
     * Process Context IDentifiers allow keeping TLB entries
     * from multiple address spaces simultaneously, tagged by PCID.
     * This avoids full TLB flush on context switch.
     */
    if (cpu_has_pcid) {
        /* CR3 includes PCID in lower bits */
        cr3 = build_cr3(next_mm->pgd, next_mm->context.ctx_id);
        write_cr3_pcid(cr3);  /* May not flush TLB */
    }
}
 
/**
 * TLB flush cost example:
 * 
 * - TLB has ~1000+ entries caching page translations
 * - After flush, EVERY memory access causes page table walk
 * - Page table walk: 4-5 memory accesses per translation
 * - Until TLB warms up, process runs much slower
 * 
 * This is why process context switches are more expensive
 * than thread context switches (threads share address space).
 */

Thread Switches Are Cheaper

Threads within the same process share the same mm_struct and page tables. When switching between threads of the same process, there's no need to switch CR3 or flush the TLB. This is a major reason why multi-threaded programs can be more efficient than multi-process programs: thread context switches skip the expensive memory management switch.

Kernel Stack and PCB Updates

Beyond CPU registers, a context switch must update kernel data structures to reflect the new running process.

Updating the Current Process Pointer:

The kernel needs to know which process is currently executing on each CPU. On Linux, this is tracked via a per-CPU variable and a special segment register (GS in kernel mode):

Load next process's task_struct address into per-CPU storage
Update segment base addresses if needed (FS for user TLS, GS for kernel per-CPU)
The current macro now returns the new process

current_task_switch.c
C (Linux)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/**
 * How Linux tracks the current task on x86-64
 * 
 * The 'current' macro is used everywhere in the kernel to get
 * the currently executing task. It must be updated during context switch.
 */
 
/* Per-CPU variable holding pointer to current task */
DEFINE_PER_CPU(struct task_struct *, current_task);
 
/**
 * The 'current' macro uses the GS segment base to access per-CPU data.
 * GS base points to this CPU's per-CPU area.
 */
#define current get_current()
 
static inline struct task_struct *get_current(void)
{
    return this_cpu_read_stable(current_task);
}
 
/**
 * During context switch, update current task pointer
 */
void update_current_task(struct task_struct *next)
{
    /* Update per-CPU current_task to point to next */
    this_cpu_write(current_task, next);
    
    /*
     * From this point on, current == next.
     * Any kernel code executing on this CPU will see
     * 'next' as the current task.
     */
}
 
/**
 * Also update TSS for next interrupt's kernel stack
 */
void update_tss_sp(struct task_struct *next)
{
    struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw);
    
    /*
     * TSS.SP0 holds the kernel stack top for ring 0.
     * When next interrupt occurs in user mode, CPU loads
     * RSP from here. Must point to next's kernel stack.
     */
    tss->x86_tss.sp0 = (unsigned long)next->stack + THREAD_SIZE;
}

Complete State Saving Checklist

•CPU registers to kernel stack — All GPRs pushed into pt_regs structure
•Kernel RSP to thread_struct — Stack pointer saved for later retrieval
•FPU/SIMD state to fpu structure — Either eager (always) or lazy (on demand)
•Current task pointer update — Per-CPU current_task set to next process
•TSS kernel stack update — TSS.SP0 set for next process's kernel stack
•Segment bases if needed — FS/GS bases for TLS support
•Scheduling metadata update — Timestamps, context switch counts, etc.

Summary: The Complete Picture of Saving Context

Context saving is a carefully choreographed dance between hardware and software, each contributing essential pieces:

Context Saving: Hardware vs. Software Responsibilities
Component	Saved By	Where Stored	When Saved
RIP, RSP, RFLAGS, CS, SS	CPU (automatic)	Kernel stack	Interrupt entry
RAX, RBX, ..., R15 (GPRs)	Kernel (assembly)	Kernel stack (pt_regs)	Interrupt entry
Kernel RSP	Kernel (C code)	thread_struct.sp	Context switch
FPU/SSE/AVX state	Kernel (XSAVE)	thread.fpu structure	Switch or on-demand
CR3 / page tables	N/A - not saved	Already in mm_struct	Switch loads new CR3
TLS bases (FS/GS)	Kernel	thread_struct	Context switch
current_task pointer	Kernel	Per-CPU variable	Context switch

Key Takeaways

•Context is the complete execution state — Registers, stack pointer, instruction pointer, flags, FPU state, and memory mappings
•Hardware saves the minimum automatically — RIP, RSP, RFLAGS, CS, SS on interrupt—just enough to return
•Kernel assembly saves the rest immediately — All GPRs pushed to kernel stack at interrupt entry
•thread_struct holds the sleeping process's state — The sp field points to the saved pt_regs on the kernel stack
•FPU state is large and optimized specially — Modern kernels save eagerly for security, historically used lazy saving
•Page table switch causes TLB flush — Major overhead for process switches; threads avoid this

Page Complete

You now understand exactly what gets saved during a context switch and how hardware and software cooperate to preserve execution state. Next, we'll explore the reverse process: how the kernel RESTORES context to resume a previously-suspended process.

2 / 5

Loading learning content...

Operating SystemsContext Switching

Context Switching: The Heart of Multitasking

LevelIntermediate

Duration90 mins

TopicContext Switching

2 / 5

Saving Context: Preserving Execution State

The Art of Freezing a Running Program

This page explores the intricate process of saving context: what must be saved, where it's stored, how hardware and software cooperate, and why even tiny omissions cause catastrophic failures.

What You Will Learn

The Complete Process State: What Must Be Saved

Hardware Context (CPU State):

This is the state physically present in the CPU at the moment of the switch. If lost, the process cannot resume correctly:

General-Purpose Registers: On x86-64: RAX, RBX, RCX, RDX, RSI, RDI, R8-R15 (16 registers × 64 bits = 128 bytes)
Stack Pointer (RSP): Points to the current position in the process's stack—critical for function returns
Base Pointer (RBP): Often used as a frame pointer for debugging and stack unwinding
Instruction Pointer (RIP): The address of the next instruction to execute—the program's "playback position"
Flags Register (RFLAGS): Contains condition codes (zero, carry, overflow) and control flags (interrupt enable, direction)
Segment Registers: CS, DS, ES, FS, GS, SS—mostly vestigial in 64-bit mode but FS/GS are used for thread-local storage
Floating-Point/SIMD State: x87 FPU registers (ST0-ST7), SSE registers (XMM0-XMM15), AVX registers (YMM/ZMM)—can be 256-2048 bytes
Control Registers: CR0, CR2, CR3, CR4—CR3 is particularly important as it points to the page table
Debug Registers: DR0-DR7—used for hardware breakpoints
Model-Specific Registers (MSRs): Various processor state not in standard registers

x86-64 Register State Sizes During Context Switch
Register Category	Count × Size	Total Size	When Saved
General-Purpose (RAX, RBX, etc.)	16 × 8 bytes	128 bytes	Always (every switch)
Instruction Pointer (RIP)	1 × 8 bytes	8 bytes	Always (every switch)
Stack Pointer (RSP)	1 × 8 bytes	8 bytes	Always (every switch)
Flags (RFLAGS)	1 × 8 bytes	8 bytes	Always (every switch)
Segment Registers	6 × 2 bytes	12 bytes	Typically (may optimize)
x87 FPU (ST0-ST7 + control)	8 × 10 + 14 bytes	94 bytes	Lazy (on first FP use)
SSE/AVX (XMM0-XMM15 or YMM)	16 × 16-32 bytes	256-512 bytes	Lazy (on first SIMD use)
AVX-512 (ZMM0-ZMM31)	32 × 64 bytes	2048 bytes	Lazy (on first AVX-512 use)

Software Context (Kernel Data Structures):

Beyond CPU registers, the kernel maintains extensive metadata about each process:

Process Control Block (PCB) / task_struct: The master record containing PID, state, priority, scheduling info
Memory Management Info: Pointer to page tables (mm_struct), virtual memory areas (VMAs), memory limits
File Descriptor Table: Array of open file references (files_struct)
Signal State: Pending signals, signal handlers, signal masks (sighand_struct)
Credentials: User ID, group ID, capabilities (cred struct)
Scheduling Information: Priority, time slice remaining, CFS virtual runtime, run queue position
Kernel Stack: Each process has a kernel-mode stack for system calls and interrupts

What Isn't Saved Directly

The Hardware's Role: Automatic State Preservation

What the CPU Saves Automatically (on x86-64 interrupt/exception):

When an interrupt occurs while in user mode, the CPU automatically pushes the following onto the kernel stack (not the user stack):

SS (Stack Segment) — 8 bytes
RSP (User Stack Pointer) — 8 bytes
RFLAGS (Flags Register) — 8 bytes
CS (Code Segment) — 8 bytes
RIP (Instruction Pointer) — 8 bytes
Error Code (for some exceptions) — 8 bytes (optional)

This is the absolute minimum needed to return from the interrupt. Notice that general-purpose registers are NOT saved automatically—the kernel must save them manually.

interrupt_stack_frame.txt

Assembly (x86-64 Stack Layout)

; x86-64 Interrupt Stack Frame (automatically pushed by CPU)
; 
; When an interrupt occurs in user mode, the CPU automatically:
; 1. Switches from user stack to kernel stack (using TSS.RSP0)
; 2. Pushes the following values (growing downward):
;
; Higher addresses
; ┌──────────────────────────────────────┐
; │          SS (user stack segment)     │ +40  (8 bytes, padded)
; ├──────────────────────────────────────┤
; │          RSP (user stack pointer)    │ +32  (8 bytes)
; ├──────────────────────────────────────┤
; │          RFLAGS (flags register)     │ +24  (8 bytes)
; ├──────────────────────────────────────┤
; │          CS (user code segment)      │ +16  (8 bytes, padded)
; ├──────────────────────────────────────┤
; │          RIP (instruction pointer)   │ +8   (8 bytes)
; ├──────────────────────────────────────┤
; │   Error Code (some exceptions only)  │ +0   (8 bytes, optional)
; └──────────────────────────────────────┘
; │          (kernel RSP points here)    │
; Lower addresses
;
; After this automatic push, CPU jumps to interrupt handler.
; Handler must MANUALLY save all general-purpose registers.
 
; The hardware IRET instruction pops this frame to return:
;    iretq   ; Pop RIP, CS, RFLAGS, RSP, SS and return to user space

Why Doesn't the CPU Save All Registers?

Fast interrupt handling: Simple interrupts save only the registers they use
Selective saving: Only full context switches save everything
Lazy FPU saving: Floating-point state is only saved if actually used

The Task State Segment (TSS):

On x86-64, the TSS contains the kernel stack pointer (RSP0) that the CPU loads when transitioning from user to kernel mode. Each CPU has its own TSS. When an interrupt occurs:

CPU reads current TSS
Loads kernel RSP from TSS.RSP0
Pushes interrupt frame onto this kernel stack
Jumps to interrupt handler

Security Critical

The Kernel's Role: Completing the State Save

The pt_regs Structure:

Linux defines a struct pt_regs that holds all the register values at the point of interrupt. The interrupt entry code pushes all registers into this format on the kernel stack:

pt_regs_structure.c
C (Linux x86-64)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/**
 * struct pt_regs - saved CPU register state
 * 
 * This structure is pushed onto the kernel stack at the start
 * of any interrupt or system call entry. It captures the complete
 * CPU register state at the moment of kernel entry.
 * 
 * Located in: arch/x86/include/asm/ptrace.h
 */
struct pt_regs {
    /* Manually saved by interrupt entry code */
    unsigned long r15;
    unsigned long r14;
    unsigned long r13;
    unsigned long r12;
    unsigned long bp;      /* RBP - base pointer */
    unsigned long bx;      /* RBX */
    unsigned long r11;
    unsigned long r10;
    unsigned long r9;
    unsigned long r8;
    unsigned long ax;      /* RAX - return value / syscall number */
    unsigned long cx;      /* RCX - 4th syscall argument */
    unsigned long dx;      /* RDX - 3rd syscall argument */
    unsigned long si;      /* RSI - 2nd syscall argument */
    unsigned long di;      /* RDI - 1st syscall argument */
    
    /* Identifies the interrupt/exception source */
    unsigned long orig_ax; /* Original RAX (syscall number) */
    
    /* Automatically saved by CPU on interrupt */
    unsigned long ip;      /* RIP - instruction pointer */
    unsigned long cs;      /* Code segment */
    unsigned long flags;   /* RFLAGS */
    unsigned long sp;      /* RSP - user stack pointer */
    unsigned long ss;      /* Stack segment */
};
 
/* Size: 21 * 8 = 168 bytes of register state */

entry_64.S

Assembly (Linux x86-64)

/**
 * Simplified interrupt entry macro (Linux style)
 * 
 * This assembly code runs at the very start of interrupt handling,
 * immediately after the CPU pushes SS, RSP, RFLAGS, CS, RIP.
 * 
 * It completes the pt_regs structure by pushing all GPRs.
 */
 
.macro PUSH_REGS
    /* At this point, CPU has already pushed:
     * SS, RSP, RFLAGS, CS, RIP (and maybe error code)
     */
    
    /* Push all general-purpose registers to complete pt_regs */
    pushq   %rdi            /* Save 1st argument */
    pushq   %rsi            /* Save 2nd argument */
    pushq   %rdx            /* Save 3rd argument */
    pushq   %rcx            /* Save 4th argument */
    pushq   %rax            /* Save syscall number / return value */
    pushq   %r8             /* Save 5th argument */
    pushq   %r9             /* Save 6th argument */
    pushq   %r10
    pushq   %r11            /* Scratch registers */
    pushq   %rbx
    pushq   %rbp            /* Frame pointer (callee-saved) */
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15
    
    /* Now RSP points to a complete pt_regs structure */
    /* We can pass this pointer to C interrupt handlers */
.endm
 
/* Example interrupt entry point */
ENTRY(interrupt_entry)
    /* CPU already pushed SS, RSP, RFLAGS, CS, RIP */
    PUSH_REGS               /* Push remaining registers */
    
    movq    %rsp, %rdi      /* pt_regs pointer as 1st arg to C */
    call    do_IRQ          /* Call C interrupt handler */
    
    /* After handler returns, restore all registers */
    POP_REGS               /* Reverse of PUSH_REGS */
    iretq                  /* Return from interrupt, pop CPU frame */
ENDPROC(interrupt_entry)
 
.macro POP_REGS
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbp
    popq    %rbx
    popq    %r11
    popq    %r10
    popq    %r9
    popq    %r8
    popq    %rax
    popq    %rcx
    popq    %rdx
    popq    %rsi
    popq    %rdi
.endm

The thread_struct: Where Context is Permanently Stored

thread_struct.c
C (Linux x86-64)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/**
 * struct thread_struct - Per-thread CPU state (x86-64)
 * 
 * This structure is embedded in task_struct and holds the
 * saved CPU context when the thread is not running.
 * 
 * Key fields shown (simplified from actual Linux):
 */
struct thread_struct {
    /* Segment descriptor caches - for segment reload on switch */
    unsigned short          es;
    unsigned short          ds;
    unsigned short          fsindex;
    unsigned short          gsindex;
    
    /* Thread-local storage base addresses */
    unsigned long           fsbase;    /* FS segment base (TLS) */
    unsigned long           gsbase;    /* GS segment base (per-CPU data) */
    
    /* Saved stack pointer - THE critical value for switch */
    unsigned long           sp;        /* Kernel stack pointer (RSP) */
    
    /* Debug registers */
    unsigned long           debugreg[8];
    
    /* Floating-point state (not directly here in modern kernels) */
    struct fpu              fpu;       /* FPU/SSE/AVX state container */
    
    /* I/O permission bitmap */
    unsigned long           *io_bitmap_ptr;
    unsigned long           io_bitmap_max;
    
    /*
     * Note: General registers (RAX, RBX, etc.) and RIP are NOT here!
     * They're stored in the pt_regs on the kernel stack.
     * The 'sp' field points to where pt_regs is located.
     */
};
 
/**
 * During context switch (simplified):
 * 
 * 1. Save current RSP into current->thread.sp
 * 2. Load next->thread.sp into RSP
 * 3. (RSP now points to next's kernel stack with its pt_regs)
 * 4. Return "into" the next process
 */

The Elegant Trick: Stack Pointer as Complete State Reference

When switching:

current->thread.sp = <current RSP>
<set RSP> = next->thread.sp
ret (pops RIP from next's stack, jumping into next's code)

This is brilliantly efficient—we save one register (RSP) and get all the others "for free" because they're already on the stack we're pointing to.

Converting Mermaid diagram...

Floating-Point and SIMD State: The Lazy Saving Optimization

Lazy FPU Context Switching:

Modern operating systems use a clever optimization called lazy FPU switching:

When switching away from a process, don't save FPU/SIMD state immediately
Set a CPU flag that causes a trap if FPU/SIMD instructions are executed
When the next process tries to use FPU/SIMD:
- Trap occurs ("Device Not Available" exception, #NM)
- NOW save previous process's FPU state
- Load current process's FPU state
- Clear the trap flag
- Retry the instruction

This means if neither the old nor new process uses floating-point, we skip saving/restoring 512-2048 bytes of state entirely.

lazy_fpu.c
C (Linux Style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/**
 * Lazy FPU context switching (traditional approach)
 * 
 * Modern Linux (since ~4.x) uses "eager" FPU switching for
 * security reasons, but lazy switching illustrates the concept.
 */
 
/* The CR0.TS (Task Switched) bit controls FPU traps */
#define CR0_TS_BIT  (1 << 3)
 
/**
 * Called during context switch - DON'T save FPU, just set trap
 */
void switch_fpu_lazy(struct task_struct *prev, struct task_struct *next)
{
    /* 
     * Set CR0.TS bit - next FPU instruction will trap.
     * We haven't spent time saving prev's FPU state yet.
     */
    unsigned long cr0 = read_cr0();
    write_cr0(cr0 | CR0_TS_BIT);
    
    /* Remember who "owns" the FPU right now */
    this_cpu_write(fpu_owner, prev);
}
 
/**
 * Device Not Available exception handler (#NM, vector 7)
 * 
 * Called when process tries to use FPU but TS bit is set.
 */
void do_device_not_available(struct pt_regs *regs)
{
    struct task_struct *current = get_current();
    struct task_struct *prev_owner = this_cpu_read(fpu_owner);
    
    /*
     * NOW we actually save the FPU state - but only if someone
     * else had been using it (not just saving empty state)
     */
    if (prev_owner && prev_owner != current) {
        /* Save prev_owner's FPU state to their fpu struct */
        fxsave(&prev_owner->thread.fpu.state);
    }
    
    /*
     * Restore current process's FPU state
     */
    if (current->thread.fpu.initialized) {
        fxrstor(&current->thread.fpu.state);
    } else {
        /* First time this process uses FPU - initialize to default */
        fninit();
        current->thread.fpu.initialized = true;
    }
    
    /*
     * Clear TS bit - subsequent FPU instructions won't trap
     * (until next context switch sets it again)
     */
    clts();  /* Clear Task-Switched flag */
    
    /* Mark current process as FPU owner */
    this_cpu_write(fpu_owner, current);
    
    /* Exception handler returns, instruction retries, works now */
}

Lazy FPU and Security (Spectre/Meltdown Era)

FPU/SIMD State Saving Methods
Instruction	Registers Saved	Size	Speed
FNSAVE/FRSTOR	x87 only	94 bytes	Legacy, slow
FXSAVE/FXRSTOR	x87 + SSE	512 bytes	Standard, fast
XSAVE/XRSTOR	x87 + SSE + AVX + more	576+ bytes	Extensible, modern
XSAVEOPT	XSAVE optimized	Variable	Only saves modified portions
XSAVES/XRSTORS	Supervisor + user state	Variable	For kernel, includes more state

Memory Management State: Page Tables and Address Spaces

The CR3 Register and Page Table Base:

On x86-64, the CR3 register points to the physical address of the top-level page table (PML4 in 4-level paging). Changing CR3 effectively "switches" the entire virtual address space:

Before switch: CR3 points to Process A's page tables
After switch: CR3 points to Process B's page tables
Same virtual address now maps to different physical memory

switch_mm.c
C (Linux Style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/**
 * switch_mm() - Switch memory management context
 * 
 * Called during context switch when switching between processes
 * (not needed when switching between threads of same process).
 */
void switch_mm(struct mm_struct *prev_mm, 
               struct mm_struct *next_mm,
               struct task_struct *next)
{
    unsigned long cr3;
    
    /* If same address space (threads of same process), skip */
    if (likely(prev_mm == next_mm)) {
        return;
    }
    
    /* 
     * Get physical address of next process's PML4 page table.
     * The pgd (Page Global Directory) is the top-level table.
     */
    cr3 = __pa(next_mm->pgd);
    
    /*
     * Load new CR3 - this switches the entire address space.
     * 
     * CRITICAL SIDE EFFECT: Loading CR3 flushes the TLB!
     * All cached virtual->physical translations are invalidated.
     * This is expensive and a major source of context switch overhead.
     */
    write_cr3(cr3);
    
    /* Update per-CPU tracking of current mm */
    this_cpu_write(cpu_current_mm, next_mm);
    
    /* 
     * PCID optimization (if available):
     * Process Context IDentifiers allow keeping TLB entries
     * from multiple address spaces simultaneously, tagged by PCID.
     * This avoids full TLB flush on context switch.
     */
    if (cpu_has_pcid) {
        /* CR3 includes PCID in lower bits */
        cr3 = build_cr3(next_mm->pgd, next_mm->context.ctx_id);
        write_cr3_pcid(cr3);  /* May not flush TLB */
    }
}
 
/**
 * TLB flush cost example:
 * 
 * - TLB has ~1000+ entries caching page translations
 * - After flush, EVERY memory access causes page table walk
 * - Page table walk: 4-5 memory accesses per translation
 * - Until TLB warms up, process runs much slower
 * 
 * This is why process context switches are more expensive
 * than thread context switches (threads share address space).
 */

Thread Switches Are Cheaper

Kernel Stack and PCB Updates

Beyond CPU registers, a context switch must update kernel data structures to reflect the new running process.

Updating the Current Process Pointer:

The kernel needs to know which process is currently executing on each CPU. On Linux, this is tracked via a per-CPU variable and a special segment register (GS in kernel mode):

Load next process's task_struct address into per-CPU storage
Update segment base addresses if needed (FS for user TLS, GS for kernel per-CPU)
The current macro now returns the new process

current_task_switch.c
C (Linux)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/**
 * How Linux tracks the current task on x86-64
 * 
 * The 'current' macro is used everywhere in the kernel to get
 * the currently executing task. It must be updated during context switch.
 */
 
/* Per-CPU variable holding pointer to current task */
DEFINE_PER_CPU(struct task_struct *, current_task);
 
/**
 * The 'current' macro uses the GS segment base to access per-CPU data.
 * GS base points to this CPU's per-CPU area.
 */
#define current get_current()
 
static inline struct task_struct *get_current(void)
{
    return this_cpu_read_stable(current_task);
}
 
/**
 * During context switch, update current task pointer
 */
void update_current_task(struct task_struct *next)
{
    /* Update per-CPU current_task to point to next */
    this_cpu_write(current_task, next);
    
    /*
     * From this point on, current == next.
     * Any kernel code executing on this CPU will see
     * 'next' as the current task.
     */
}
 
/**
 * Also update TSS for next interrupt's kernel stack
 */
void update_tss_sp(struct task_struct *next)
{
    struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw);
    
    /*
     * TSS.SP0 holds the kernel stack top for ring 0.
     * When next interrupt occurs in user mode, CPU loads
     * RSP from here. Must point to next's kernel stack.
     */
    tss->x86_tss.sp0 = (unsigned long)next->stack + THREAD_SIZE;
}

Complete State Saving Checklist

•CPU registers to kernel stack — All GPRs pushed into pt_regs structure
•Kernel RSP to thread_struct — Stack pointer saved for later retrieval
•FPU/SIMD state to fpu structure — Either eager (always) or lazy (on demand)
•Current task pointer update — Per-CPU current_task set to next process
•TSS kernel stack update — TSS.SP0 set for next process's kernel stack
•Segment bases if needed — FS/GS bases for TLS support
•Scheduling metadata update — Timestamps, context switch counts, etc.

Summary: The Complete Picture of Saving Context

Context saving is a carefully choreographed dance between hardware and software, each contributing essential pieces:

Context Saving: Hardware vs. Software Responsibilities
Component	Saved By	Where Stored	When Saved
RIP, RSP, RFLAGS, CS, SS	CPU (automatic)	Kernel stack	Interrupt entry
RAX, RBX, ..., R15 (GPRs)	Kernel (assembly)	Kernel stack (pt_regs)	Interrupt entry
Kernel RSP	Kernel (C code)	thread_struct.sp	Context switch
FPU/SSE/AVX state	Kernel (XSAVE)	thread.fpu structure	Switch or on-demand
CR3 / page tables	N/A - not saved	Already in mm_struct	Switch loads new CR3
TLS bases (FS/GS)	Kernel	thread_struct	Context switch
current_task pointer	Kernel	Per-CPU variable	Context switch

Key Takeaways

•Context is the complete execution state — Registers, stack pointer, instruction pointer, flags, FPU state, and memory mappings
•Hardware saves the minimum automatically — RIP, RSP, RFLAGS, CS, SS on interrupt—just enough to return
•Kernel assembly saves the rest immediately — All GPRs pushed to kernel stack at interrupt entry
•thread_struct holds the sleeping process's state — The sp field points to the saved pt_regs on the kernel stack
•FPU state is large and optimized specially — Modern kernels save eagerly for security, historically used lazy saving
•Page table switch causes TLB flush — Major overhead for process switches; threads avoid this

Page Complete

2 / 5