Loading learning content...
The Program Counter tells us where a process is executing. But a process is doing far more than just pointing at an instruction—it's computing. It has values in registers, intermediate results, loop counters, function arguments, and return values. All of this working state lives in CPU registers.
Registers are the fastest storage in a computer—faster than cache, far faster than RAM. A modern CPU has dozens of registers, and they hold the immediate computational state of whatever code is running. When the operating system switches from one process to another, it must save all these registers to the outgoing process's PCB and restore the incoming process's saved registers.
This register save/restore is the bulk of the context switch overhead. Understanding what registers exist and how they're managed reveals both the mechanics of multitasking and the performance costs involved.
By the end of this page, you will understand the complete register architecture: general-purpose registers, special-purpose registers, status flags, floating-point/SIMD registers, and system registers. You'll learn how these are saved and restored during context switches, the trade-offs in register context size, and modern optimizations that reduce context switch overhead.
CPU registers are small, extremely fast storage locations built directly into the processor core. They hold the data that the CPU is actively working with—operands for arithmetic, addresses for memory access, control information, and execution state.
Why Registers Exist:
The speed hierarchy of computer memory creates a fundamental problem: main memory is too slow. While a CPU can execute multiple instructions per clock cycle, fetching data from RAM takes hundreds of cycles. Registers bridge this gap:
| Storage Type | Typical Access Time | Typical Size |
|---|---|---|
| Registers | 1 cycle (~0.3ns) | Bytes to KB |
| L1 Cache | 3-4 cycles (~1ns) | 32-64 KB |
| L2 Cache | 10-20 cycles (~5ns) | 256-512 KB |
| L3 Cache | 30-50 cycles (~15ns) | 4-32 MB |
| RAM | 100-300 cycles (~60ns) | GB to TB |
| SSD | 10,000+ cycles (~50μs) | TB |
Registers operate at full processor speed because they're part of the CPU itself, not accessed over any bus.
Registers are not addressable like memory. You can't take a pointer to a register. Instructions specify registers by name (RAX, R12, XMM0) not by address. The compiler's job is to keep frequently-used values in registers as much as possible, 'spilling' to memory only when necessary.
Categories of Registers:
Modern CPUs have several categories of registers, each serving different purposes:
General-Purpose Registers (GPRs): Hold data and addresses during computation. Most arithmetic and logic operations use these.
Program Counter / Instruction Pointer: Holds the address of the next instruction (covered in previous page).
Stack Pointer: Points to the current top of the stack in memory.
Status/Flags Register: Contains condition codes (zero, carry, overflow) and control flags.
Floating-Point Registers: For floating-point arithmetic operations.
SIMD/Vector Registers: For parallel operations on multiple data elements (SSE, AVX, NEON).
Segment Registers: (x86 specific) Define memory segments.
Control Registers: System-level registers for memory management, protection, and CPU features.
Debug Registers: For hardware breakpoints and debugging.
Not all registers need to be saved during a context switch—system registers, for instance, are typically per-CPU, not per-process.
General-purpose registers (GPRs) are the workhorses of computation. They hold integers, pointers, addresses, and are the operands for most instructions. The number and width of GPRs varies by architecture.
x86-64 has 16 64-bit general-purpose registers. The first 8 have historical names from the 16-bit and 32-bit eras; the additional 8 use R8-R15 naming.
| Register | 64-bit | 32-bit | 16-bit | 8-bit | Common Use (Linux ABI) |
|---|---|---|---|---|---|
| RAX | RAX | EAX | AX | AL/AH | Return value, accumulator |
| RBX | RBX | EBX | BX | BL/BH | Callee-saved, base pointer |
| RCX | RCX | ECX | CX | CL/CH | 4th argument, counter for loops |
| RDX | RDX | EDX | DX | DL/DH | 3rd argument, I/O operations |
| RSI | RSI | ESI | SI | SIL | 2nd argument, source index |
| RDI | RDI | EDI | DI | DIL | 1st argument, destination index |
| RBP | RBP | EBP | BP | BPL | Base/frame pointer (callee-saved) |
| RSP | RSP | ESP | SP | SPL | Stack pointer |
| R8 | R8 | R8D | R8W | R8B | 5th argument |
| R9 | R9 | R9D | R9W | R9B | 6th argument |
| R10 | R10 | R10D | R10W | R10B | Caller-saved |
| R11 | R11 | R11D | R11W | R11B | Caller-saved |
| R12 | R12 | R12D | R12W | R12B | Callee-saved |
| R13 | R13 | R13D | R13W | R13B | Callee-saved |
| R14 | R14 | R14D | R14W | R14B | Callee-saved |
| R15 | R15 | R15D | R15W | R15B | Callee-saved |
Total GPR Context Size: 16 registers × 8 bytes = 128 bytes for GPRs alone.
Caller-saved registers may be overwritten by function calls; the caller must save them if needed. Callee-saved registers are preserved across calls; the callee must restore them before returning. The kernel follows the callee-saved convention—it saves callee-saved registers during context switch because the interrupted code expects them preserved.
The status register (also called flags register or condition codes register) contains bits that reflect the results of recent operations and control CPU behavior. Preserving this register is critical—conditional branches depend on these flags.
x86-64: RFLAGS Register (64 bits)
The RFLAGS register contains condition flags (set by arithmetic), control flags (affect CPU operation), and system flags (control system-level features).
| Bit | Name | Description | Context Switch |
|---|---|---|---|
| 0 | CF (Carry) | Set on unsigned overflow/underflow | Must save |
| 2 | PF (Parity) | Set if low byte has even parity | Must save |
| 4 | AF (Auxiliary) | BCD arithmetic carry | Must save |
| 6 | ZF (Zero) | Set if result is zero | Must save |
| 7 | SF (Sign) | Set if result is negative | Must save |
| 8 | TF (Trap) | Single-step debugging mode | Must save |
| 9 | IF (Interrupt) | Enable/disable interrupts | System-level |
| 10 | DF (Direction) | String operation direction | Must save |
| 11 | OF (Overflow) | Set on signed overflow | Must save |
| 12-13 | IOPL | I/O privilege level | System-level |
| 14 | NT (Nested Task) | Nested task flag | System-level |
| 21 | ID | CPUID availability | System-level |
ARM64: NZCV Flags in PSTATE
ARM64 has a simpler approach. The NZCV flags (Negative, Zero, Carry, Overflow) are in the PSTATE register and are accessed via special registers like NZCV.
12345678910111213141516171819202122232425262728293031
// Why flags must be preserved exactly int compare_values(int a, int b) { // Assembly perspective: cmp sets flags // cmp eax, ebx ; computes a - b, sets ZF, SF, CF, OF if (a < b) { // Uses SF and OF (signed comparison) return -1; } else if (a > b) { // Uses flags again return 1; } else { // Uses ZF return 0; }} // If context switch happens between CMP and conditional jump://// cmp eax, ebx ; Flags set here// <--- CONTEXT SWITCH HAPPENS HERE --->// jl less_than ; Uses flags from cmp//// If flags weren't restored, the jump uses wrong flags!// Process would take wrong branch, corrupting execution. // Kernel must save exact flags state:struct context_switch_frame { uint64_t rflags; // Saved flags register uint64_t rip; // Program counter uint64_t rsp; // Stack pointer // ... other registers};The gap between an instruction that sets flags and the instruction that uses them is a critical region. The kernel's context switch code must ensure this window is handled correctly. Usually, interrupts save flags automatically as part of the interrupt frame, so flags are inherently preserved across context switches triggered by interrupts.
Modern CPUs have extensive floating-point (FP) and SIMD (Single Instruction, Multiple Data) register files. These registers are much larger than GPRs and represent a significant portion of the context switch overhead.
| Architecture | Extension | Registers | Width | Total Size |
|---|---|---|---|---|
| x86-64 | x87 FPU | ST0-ST7 | 80 bits | 80 bytes |
| x86-64 | SSE | XMM0-XMM15 | 128 bits | 256 bytes |
| x86-64 | AVX | YMM0-YMM15 | 256 bits | 512 bytes |
| x86-64 | AVX-512 | ZMM0-ZMM31 | 512 bits | 2048 bytes |
| ARM64 | NEON/FP | V0-V31 | 128 bits | 512 bytes |
| RISC-V | D extension | F0-F31 | 64 bits | 256 bytes |
The Size Problem:
With AVX-512, the FP/SIMD context alone is over 2 KB. Adding control registers (MXCSR, FCW) increases this further. Saving and restoring 2KB on every context switch would be catastrophically expensive.
Lazy FPU Context Switching:
Most processes don't use FP/SIMD registers. A shell script doesn't need floating-point math. This observation enables lazy FPU/SIMD context management:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
// Lazy FPU context switching (simplified Linux approach) // Per-CPU: Which task owns the FPU?DEFINE_PER_CPU(struct task_struct *, fpu_owner); // On context switch - DON'T save FPUvoid context_switch(struct task_struct *prev, struct task_struct *next) { // Save integer registers (always) save_integer_context(prev); // For FPU: just disable it if (cpu_has_fpu()) { // Set CR0.TS (Task Switched) bit // Any FPU access will now cause #NM (Device Not Available) disable_fpu_access(); } // Switch address space and integer context switch_mm(next); restore_integer_context(next);} // Device Not Available exception handlervoid do_device_not_available(struct pt_regs *regs) { struct task_struct *current_task = current; struct task_struct *fpu_owner_task = this_cpu_read(fpu_owner); // Save previous owner's FPU state (if different from current) if (fpu_owner_task && fpu_owner_task != current_task) { save_fpu_state(&fpu_owner_task->fpu_state); } // Restore current task's FPU state (or initialize if never used) if (current_task->fpu_initialized) { restore_fpu_state(¤t_task->fpu_state); } else { init_fpu_state(); current_task->fpu_initialized = true; } // Mark current as FPU owner this_cpu_write(fpu_owner, current_task); // Clear CR0.TS to enable FPU access enable_fpu_access(); // Return from exception - faulting instruction will be retried} // Actual save/restore using FXSAVE/FXRSTOR or XSAVE/XRSTORvoid save_fpu_state(struct fpu_state *state) { // XSAVE saves all extended state (SSE, AVX, AVX-512, etc.) // based on XCR0 register configuration asm volatile("xsave %0" : "=m"(state->xsave_area) : "a"(-1), "d"(-1));} void restore_fpu_state(struct fpu_state *state) { asm volatile("xrstor %0" : : "m"(state->xsave_area), "a"(-1), "d"(-1));}Linux recently moved to 'eager' FPU switching for many scenarios. With fast XSAVE/XRSTOR instructions and security concerns (FPU state can leak between processes), the simplicity and security of always saving/restoring often outweighs the cost on modern CPUs. The decision is configurable.
Beyond user-accessible registers, CPUs have system registers that control memory management, protection, and CPU features. Most of these are per-CPU, not per-process, but some are switched during context changes.
| Register | Purpose | Switched Per-Process? |
|---|---|---|
| CR0 | Protected mode, paging, FPU control | No (system-wide) |
| CR2 | Page fault linear address | No (transient) |
| CR3 | Page table base address | Yes (per-process address space) |
| CR4 | CPU feature enables (PAE, SMEP, etc.) | No (system-wide) |
| CR8 (TPR) | Task Priority Register | No (per-CPU) |
| DR0-DR7 | Debug registers | Optional (if debugging) |
| GDTR | Global Descriptor Table pointer | No (per-CPU) |
| LDTR | Local Descriptor Table selector | Rarely (if using LDT) |
| TR | Task State Segment selector | No (per-CPU) |
| FS base | Thread-local storage pointer | Yes (per-thread) |
| GS base | Kernel-space or thread pointer | Yes (varies) |
CR3: The Address Space Switch:
CR3 is critically important. It points to the top-level page table (PML4 on x86-64). Changing CR3 switches the entire virtual address space—all memory mappings change in one instruction.
Loading CR3 is one of the most expensive parts of a context switch because it invalidates the TLB (Translation Lookaside Buffer), forcing re-translation of virtual addresses. Modern CPUs with PCID (Process Context ID) can avoid full TLB flushes by tagging TLB entries with process identifiers.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
// Address space switching on x86-64 // The mm_struct contains the page table pointerstruct mm_struct { pgd_t *pgd; // Top-level page table (PML4) unsigned long cr3; // Pre-computed CR3 value unsigned int pcid; // Process Context ID (for TLB tagging) // ... other fields}; // Switch address spacevoid switch_mm(struct mm_struct *prev_mm, struct mm_struct *next_mm) { if (prev_mm == next_mm) { // Same address space - no switch needed (e.g., threads) return; } // Build CR3 value: page table base + PCID unsigned long new_cr3 = next_mm->cr3; if (cpu_has_pcid()) { // Include PCID to avoid TLB flush new_cr3 |= next_mm->pcid; // Set bit 63 to NOT flush TLB entries with this PCID new_cr3 |= (1UL << 63); } // The actual switch - this is expensive! write_cr3(new_cr3); // Update per-CPU tracking this_cpu_write(current_mm, next_mm);} // Inline assembly for CR3 accessstatic inline void write_cr3(unsigned long val) { asm volatile("mov %0, %%cr3" : : "r"(val) : "memory");} static inline unsigned long read_cr3(void) { unsigned long val; asm volatile("mov %%cr3, %0" : "=r"(val)); return val;} // TLB impact:// Without PCID: write_cr3 flushes ALL TLB entries// Next memory accesses suffer TLB misses// Each miss = page table walk = 4 memory accesses// With PCID: Only flush entries with matching PCID// Other processes' entries remain cached// Significant performance improvementThreads within the same process share an address space (same CR3). Switching between threads skips the expensive CR3 load, making thread switches 10-100x faster than full process switches. This is why thread pools are so common in high-performance applications.
Putting it all together, here's what a complete CPU context structure looks like. This structure is stored in the PCB (or pointed to by it) and represents the full CPU state needed to resume a process.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
// Complete CPU context for context switching (x86-64 example) // Integer register contextstruct integer_context { // General-purpose registers uint64_t rax; uint64_t rbx; uint64_t rcx; uint64_t rdx; uint64_t rsi; uint64_t rdi; uint64_t rbp; uint64_t rsp; uint64_t r8; uint64_t r9; uint64_t r10; uint64_t r11; uint64_t r12; uint64_t r13; uint64_t r14; uint64_t r15; // Program counter and flags uint64_t rip; uint64_t rflags; // Segment selectors (usually unchanged in 64-bit mode) uint16_t cs; uint16_t ss; uint16_t ds; uint16_t es; uint16_t fs; uint16_t gs; // FS/GS base addresses (for thread-local storage) uint64_t fs_base; uint64_t gs_base;};// Size: approximately 160 bytes // FPU/SIMD context (XSAVE area)// Layout depends on CPU features enabledstruct fpu_context { // Legacy x87 + SSE state (FXSAVE format) uint16_t fcw; // FPU control word uint16_t fsw; // FPU status word uint8_t ftw; // FPU tag word (abridged) uint8_t reserved1; uint16_t fop; // FPU opcode uint64_t fip; // FPU instruction pointer uint64_t fdp; // FPU data pointer uint32_t mxcsr; // SIMD control/status uint32_t mxcsr_mask; // x87 FPU registers (ST0-ST7) uint8_t st_space[128]; // 8 × 16 bytes (80-bit padded) // SSE registers (XMM0-XMM15) uint8_t xmm_space[256]; // 16 × 16 bytes // XSAVE header (for extended state) uint64_t xstate_bv; // State components present uint64_t xcomp_bv; // State components format uint8_t reserved2[48]; // Extended state (AVX, AVX-512, etc.) // Size varies based on CPU and enabled features // Can be 2KB+ for AVX-512 uint8_t extended_state[0]; // Variable-length};// Size: 512+ bytes (up to 2KB+ for AVX-512) // Combined context in PCBstruct task_struct { // ... process identification fields // CPU context struct integer_context cpu_context; // FPU/SIMD context (dynamically allocated based on CPU features) struct fpu_context *fpu; // Address space unsigned long cr3; // Page table pointer struct mm_struct *mm; // Memory management // Debug registers (only if debugging active) unsigned long debugreg[8]; // ... scheduling, files, signals, etc.}; // Typical total context size breakdown:// - Integer registers + RIP + RFLAGS: ~160 bytes// - SSE state (FXSAVE): ~512 bytes // - AVX state: +256 bytes// - AVX-512 state: +2048 bytes// - Miscellaneous: ~64 bytes//// Total: ~750 bytes (SSE) to ~3000 bytes (AVX-512)A larger context means more memory to save/restore, more cache pollution, and longer context switch times. This is why lazy FPU switching and careful feature enablement matter. Enabling AVX-512 on a server with many context switches per second can noticeably impact performance.
Context switching is a core OS operation that happens thousands of times per second. Understanding its performance characteristics is essential for system design and tuning.
Direct Costs of Context Switching:
| Cost Component | Approximate Time | Notes |
|---|---|---|
| Save/restore GPRs | 50-100 ns | 16-32 registers |
| Save/restore FPU (SSE) | 50-100 ns | If using lazy, often free |
| Save/restore FPU (AVX-512) | 200-500 ns | Much larger state |
| CR3 load (address space) | 50-100 ns | Instruction itself |
| TLB miss overhead | 200-1000 ns | Per miss after switch |
| Cache effects | Variable | Depends on working sets |
Total: A minimal context switch takes ~200ns. With TLB misses and cache effects, it can exceed 1-2μs.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
// Measuring context switch time (simplified benchmark) #include <stdio.h>#include <unistd.h>#include <sys/wait.h>#include <time.h> #define ITERATIONS 100000 int main() { int pipe1[2], pipe2[2]; pipe(pipe1); pipe(pipe2); pid_t pid = fork(); if (pid == 0) { // Child: ping-pong through pipes char c; for (int i = 0; i < ITERATIONS; i++) { read(pipe1[0], &c, 1); // Wait for parent write(pipe2[1], &c, 1); // Signal parent // Each iteration = 2 context switches } _exit(0); } // Parent: measure ping-pong time struct timespec start, end; char c = 'x'; clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < ITERATIONS; i++) { write(pipe1[1], &c, 1); // Signal child read(pipe2[0], &c, 1); // Wait for child } clock_gettime(CLOCK_MONOTONIC, &end); wait(NULL); double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec); double per_switch = elapsed_ns / (ITERATIONS * 2); printf("Context switches: %d\n", ITERATIONS * 2); printf("Time per switch: %.1f ns\n", per_switch); // Typical results: // - Same CPU core: ~1-2 μs (includes pipe overhead) // - Different cores: ~3-5 μs // Pure context switch (no pipe): ~200-500 ns return 0;}For most applications, context switches are not a bottleneck. But for latency-sensitive workloads (trading systems, real-time audio, game engines), minimizing switches is critical. Techniques include: CPU pinning, real-time scheduling (SCHED_FIFO), kernel bypass (DPDK), and busy-waiting instead of blocking.
We've explored the complete CPU register context, from general-purpose registers to floating-point state, and examined how this context is managed during context switches. Let's consolidate the key insights:
What's Next:
We've covered the CPU context extensively. The final page of this module examines Memory Management Information in the PCB—how the kernel tracks each process's virtual address space, page tables, and memory regions.
You now understand CPU registers comprehensively: their categories, their role in execution, how they're saved and restored, and the performance implications of context switching. This knowledge is essential for anyone working on operating systems, compilers, or performance-critical systems.