Operating SystemsProcess Control Block (PCB)

Process Control Block (PCB)

LevelIntermediate

Duration60 mins

TopicProcess Control Block (PCB)

4 / 5

CPU Registers: The Complete Execution State

More Than Just the Program Counter

The Program Counter tells us where a process is executing. But a process is doing far more than just pointing at an instruction—it's computing. It has values in registers, intermediate results, loop counters, function arguments, and return values. All of this working state lives in CPU registers.

Registers are the fastest storage in a computer—faster than cache, far faster than RAM. A modern CPU has dozens of registers, and they hold the immediate computational state of whatever code is running. When the operating system switches from one process to another, it must save all these registers to the outgoing process's PCB and restore the incoming process's saved registers.

This register save/restore is the bulk of the context switch overhead. Understanding what registers exist and how they're managed reveals both the mechanics of multitasking and the performance costs involved.

What You Will Learn

By the end of this page, you will understand the complete register architecture: general-purpose registers, special-purpose registers, status flags, floating-point/SIMD registers, and system registers. You'll learn how these are saved and restored during context switches, the trade-offs in register context size, and modern optimizations that reduce context switch overhead.

What Are CPU Registers?

CPU registers are small, extremely fast storage locations built directly into the processor core. They hold the data that the CPU is actively working with—operands for arithmetic, addresses for memory access, control information, and execution state.

Why Registers Exist:

The speed hierarchy of computer memory creates a fundamental problem: main memory is too slow. While a CPU can execute multiple instructions per clock cycle, fetching data from RAM takes hundreds of cycles. Registers bridge this gap:

Storage Type	Typical Access Time	Typical Size
Registers	1 cycle (~0.3ns)	Bytes to KB
L1 Cache	3-4 cycles (~1ns)	32-64 KB
L2 Cache	10-20 cycles (~5ns)	256-512 KB
L3 Cache	30-50 cycles (~15ns)	4-32 MB
RAM	100-300 cycles (~60ns)	GB to TB
SSD	10,000+ cycles (~50μs)	TB

Registers operate at full processor speed because they're part of the CPU itself, not accessed over any bus.

Registers vs Memory

Registers are not addressable like memory. You can't take a pointer to a register. Instructions specify registers by name (RAX, R12, XMM0) not by address. The compiler's job is to keep frequently-used values in registers as much as possible, 'spilling' to memory only when necessary.

Categories of Registers:

Modern CPUs have several categories of registers, each serving different purposes:

General-Purpose Registers (GPRs): Hold data and addresses during computation. Most arithmetic and logic operations use these.
Program Counter / Instruction Pointer: Holds the address of the next instruction (covered in previous page).
Stack Pointer: Points to the current top of the stack in memory.
Status/Flags Register: Contains condition codes (zero, carry, overflow) and control flags.
Floating-Point Registers: For floating-point arithmetic operations.
SIMD/Vector Registers: For parallel operations on multiple data elements (SSE, AVX, NEON).
Segment Registers: (x86 specific) Define memory segments.
Control Registers: System-level registers for memory management, protection, and CPU features.
Debug Registers: For hardware breakpoints and debugging.

Not all registers need to be saved during a context switch—system registers, for instance, are typically per-CPU, not per-process.

General-Purpose Registers

General-purpose registers (GPRs) are the workhorses of computation. They hold integers, pointers, addresses, and are the operands for most instructions. The number and width of GPRs varies by architecture.

x86-64 has 16 64-bit general-purpose registers. The first 8 have historical names from the 16-bit and 32-bit eras; the additional 8 use R8-R15 naming.

x86-64 General-Purpose Registers
Register	64-bit	32-bit	16-bit	8-bit	Common Use (Linux ABI)
RAX	RAX	EAX	AX	AL/AH	Return value, accumulator
RBX	RBX	EBX	BX	BL/BH	Callee-saved, base pointer
RCX	RCX	ECX	CX	CL/CH	4th argument, counter for loops
RDX	RDX	EDX	DX	DL/DH	3rd argument, I/O operations
RSI	RSI	ESI	SI	SIL	2nd argument, source index
RDI	RDI	EDI	DI	DIL	1st argument, destination index
RBP	RBP	EBP	BP	BPL	Base/frame pointer (callee-saved)
RSP	RSP	ESP	SP	SPL	Stack pointer
R8	R8	R8D	R8W	R8B	5th argument
R9	R9	R9D	R9W	R9B	6th argument
R10	R10	R10D	R10W	R10B	Caller-saved
R11	R11	R11D	R11W	R11B	Caller-saved
R12	R12	R12D	R12W	R12B	Callee-saved
R13	R13	R13D	R13W	R13B	Callee-saved
R14	R14	R14D	R14W	R14B	Callee-saved
R15	R15	R15D	R15W	R15B	Callee-saved

Total GPR Context Size: 16 registers × 8 bytes = 128 bytes for GPRs alone.

Caller-Saved vs Callee-Saved

Caller-saved registers may be overwritten by function calls; the caller must save them if needed. Callee-saved registers are preserved across calls; the callee must restore them before returning. The kernel follows the callee-saved convention—it saves callee-saved registers during context switch because the interrupted code expects them preserved.

Status and Flags Registers

The status register (also called flags register or condition codes register) contains bits that reflect the results of recent operations and control CPU behavior. Preserving this register is critical—conditional branches depend on these flags.

x86-64: RFLAGS Register (64 bits)

The RFLAGS register contains condition flags (set by arithmetic), control flags (affect CPU operation), and system flags (control system-level features).

Key x86-64 RFLAGS Bits
Bit	Name	Description	Context Switch
0	CF (Carry)	Set on unsigned overflow/underflow	Must save
2	PF (Parity)	Set if low byte has even parity	Must save
4	AF (Auxiliary)	BCD arithmetic carry	Must save
6	ZF (Zero)	Set if result is zero	Must save
7	SF (Sign)	Set if result is negative	Must save
8	TF (Trap)	Single-step debugging mode	Must save
9	IF (Interrupt)	Enable/disable interrupts	System-level
10	DF (Direction)	String operation direction	Must save
11	OF (Overflow)	Set on signed overflow	Must save
12-13	IOPL	I/O privilege level	System-level
14	NT (Nested Task)	Nested task flag	System-level
21	ID	CPUID availability	System-level

ARM64: NZCV Flags in PSTATE

ARM64 has a simpler approach. The NZCV flags (Negative, Zero, Carry, Overflow) are in the PSTATE register and are accessed via special registers like NZCV.

flags_importance.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Why flags must be preserved exactly
 
int compare_values(int a, int b) {
    // Assembly perspective: cmp sets flags
    // cmp eax, ebx    ; computes a - b, sets ZF, SF, CF, OF
    
    if (a < b) {       // Uses SF and OF (signed comparison)
        return -1;
    } else if (a > b) { // Uses flags again
        return 1;
    } else {           // Uses ZF
        return 0;
    }
}
 
// If context switch happens between CMP and conditional jump:
//
//   cmp  eax, ebx     ; Flags set here
//   <--- CONTEXT SWITCH HAPPENS HERE --->
//   jl   less_than    ; Uses flags from cmp
//
// If flags weren't restored, the jump uses wrong flags!
// Process would take wrong branch, corrupting execution.
 
// Kernel must save exact flags state:
struct context_switch_frame {
    uint64_t rflags;  // Saved flags register
    uint64_t rip;     // Program counter
    uint64_t rsp;     // Stack pointer
    // ... other registers
};

Flags and Atomicity

The gap between an instruction that sets flags and the instruction that uses them is a critical region. The kernel's context switch code must ensure this window is handled correctly. Usually, interrupts save flags automatically as part of the interrupt frame, so flags are inherently preserved across context switches triggered by interrupts.

Floating-Point and SIMD Registers

Modern CPUs have extensive floating-point (FP) and SIMD (Single Instruction, Multiple Data) register files. These registers are much larger than GPRs and represent a significant portion of the context switch overhead.

Floating-Point and SIMD Register Sizes
Architecture	Extension	Registers	Width	Total Size
x86-64	x87 FPU	ST0-ST7	80 bits	80 bytes
x86-64	SSE	XMM0-XMM15	128 bits	256 bytes
x86-64	AVX	YMM0-YMM15	256 bits	512 bytes
x86-64	AVX-512	ZMM0-ZMM31	512 bits	2048 bytes
ARM64	NEON/FP	V0-V31	128 bits	512 bytes
RISC-V	D extension	F0-F31	64 bits	256 bytes

The Size Problem:

With AVX-512, the FP/SIMD context alone is over 2 KB. Adding control registers (MXCSR, FCW) increases this further. Saving and restoring 2KB on every context switch would be catastrophically expensive.

Lazy FPU Context Switching:

Most processes don't use FP/SIMD registers. A shell script doesn't need floating-point math. This observation enables lazy FPU/SIMD context management:

On context switch, mark FPU state as "owned" by the previous process
Don't actually save the FPU registers yet
If the new process tries to use FPU, a "device not available" exception occurs
In the exception handler:
- Save the FPU state to the previous owner's PCB
- Restore (or initialize) the new process's FPU state
- Mark the new process as the FPU owner
Retry the faulting instruction—it now succeeds

lazy_fpu_switching.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Lazy FPU context switching (simplified Linux approach)
 
// Per-CPU: Which task owns the FPU?
DEFINE_PER_CPU(struct task_struct *, fpu_owner);
 
// On context switch - DON'T save FPU
void context_switch(struct task_struct *prev, struct task_struct *next) {
    // Save integer registers (always)
    save_integer_context(prev);
    
    // For FPU: just disable it
    if (cpu_has_fpu()) {
        // Set CR0.TS (Task Switched) bit
        // Any FPU access will now cause #NM (Device Not Available)
        disable_fpu_access();
    }
    
    // Switch address space and integer context
    switch_mm(next);
    restore_integer_context(next);
}
 
// Device Not Available exception handler
void do_device_not_available(struct pt_regs *regs) {
    struct task_struct *current_task = current;
    struct task_struct *fpu_owner_task = this_cpu_read(fpu_owner);
    
    // Save previous owner's FPU state (if different from current)
    if (fpu_owner_task && fpu_owner_task != current_task) {
        save_fpu_state(&fpu_owner_task->fpu_state);
    }
    
    // Restore current task's FPU state (or initialize if never used)
    if (current_task->fpu_initialized) {
        restore_fpu_state(&current_task->fpu_state);
    } else {
        init_fpu_state();
        current_task->fpu_initialized = true;
    }
    
    // Mark current as FPU owner
    this_cpu_write(fpu_owner, current_task);
    
    // Clear CR0.TS to enable FPU access
    enable_fpu_access();
    
    // Return from exception - faulting instruction will be retried
}
 
// Actual save/restore using FXSAVE/FXRSTOR or XSAVE/XRSTOR
void save_fpu_state(struct fpu_state *state) {
    // XSAVE saves all extended state (SSE, AVX, AVX-512, etc.)
    // based on XCR0 register configuration
    asm volatile("xsave %0" : "=m"(state->xsave_area) : "a"(-1), "d"(-1));
}
 
void restore_fpu_state(struct fpu_state *state) {
    asm volatile("xrstor %0" : : "m"(state->xsave_area), "a"(-1), "d"(-1));
}

Modern Eager Switching

Linux recently moved to 'eager' FPU switching for many scenarios. With fast XSAVE/XRSTOR instructions and security concerns (FPU state can leak between processes), the simplicity and security of always saving/restoring often outweighs the cost on modern CPUs. The decision is configurable.

System and Control Registers

Beyond user-accessible registers, CPUs have system registers that control memory management, protection, and CPU features. Most of these are per-CPU, not per-process, but some are switched during context changes.

x86-64 Control Registers
Register	Purpose	Switched Per-Process?
CR0	Protected mode, paging, FPU control	No (system-wide)
CR2	Page fault linear address	No (transient)
CR3	Page table base address	Yes (per-process address space)
CR4	CPU feature enables (PAE, SMEP, etc.)	No (system-wide)
CR8 (TPR)	Task Priority Register	No (per-CPU)
DR0-DR7	Debug registers	Optional (if debugging)
GDTR	Global Descriptor Table pointer	No (per-CPU)
LDTR	Local Descriptor Table selector	Rarely (if using LDT)
TR	Task State Segment selector	No (per-CPU)
FS base	Thread-local storage pointer	Yes (per-thread)
GS base	Kernel-space or thread pointer	Yes (varies)

CR3: The Address Space Switch:

CR3 is critically important. It points to the top-level page table (PML4 on x86-64). Changing CR3 switches the entire virtual address space—all memory mappings change in one instruction.

Loading CR3 is one of the most expensive parts of a context switch because it invalidates the TLB (Translation Lookaside Buffer), forcing re-translation of virtual addresses. Modern CPUs with PCID (Process Context ID) can avoid full TLB flushes by tagging TLB entries with process identifiers.

address_space_switch.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Address space switching on x86-64
 
// The mm_struct contains the page table pointer
struct mm_struct {
    pgd_t *pgd;           // Top-level page table (PML4)
    unsigned long cr3;     // Pre-computed CR3 value
    unsigned int pcid;     // Process Context ID (for TLB tagging)
    // ... other fields
};
 
// Switch address space
void switch_mm(struct mm_struct *prev_mm, struct mm_struct *next_mm) {
    if (prev_mm == next_mm) {
        // Same address space - no switch needed (e.g., threads)
        return;
    }
    
    // Build CR3 value: page table base + PCID
    unsigned long new_cr3 = next_mm->cr3;
    
    if (cpu_has_pcid()) {
        // Include PCID to avoid TLB flush
        new_cr3 |= next_mm->pcid;
        
        // Set bit 63 to NOT flush TLB entries with this PCID
        new_cr3 |= (1UL << 63);
    }
    
    // The actual switch - this is expensive!
    write_cr3(new_cr3);
    
    // Update per-CPU tracking
    this_cpu_write(current_mm, next_mm);
}
 
// Inline assembly for CR3 access
static inline void write_cr3(unsigned long val) {
    asm volatile("mov %0, %%cr3" : : "r"(val) : "memory");
}
 
static inline unsigned long read_cr3(void) {
    unsigned long val;
    asm volatile("mov %%cr3, %0" : "=r"(val));
    return val;
}
 
// TLB impact:
// Without PCID: write_cr3 flushes ALL TLB entries
//               Next memory accesses suffer TLB misses
//               Each miss = page table walk = 4 memory accesses
// With PCID:    Only flush entries with matching PCID
//               Other processes' entries remain cached
//               Significant performance improvement

Thread Context Switches Are Faster

Threads within the same process share an address space (same CR3). Switching between threads skips the expensive CR3 load, making thread switches 10-100x faster than full process switches. This is why thread pools are so common in high-performance applications.

The Complete Context Structure in the PCB

Putting it all together, here's what a complete CPU context structure looks like. This structure is stored in the PCB (or pointed to by it) and represents the full CPU state needed to resume a process.

complete_context_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
// Complete CPU context for context switching (x86-64 example)
 
// Integer register context
struct integer_context {
    // General-purpose registers
    uint64_t rax;
    uint64_t rbx;
    uint64_t rcx;
    uint64_t rdx;
    uint64_t rsi;
    uint64_t rdi;
    uint64_t rbp;
    uint64_t rsp;
    uint64_t r8;
    uint64_t r9;
    uint64_t r10;
    uint64_t r11;
    uint64_t r12;
    uint64_t r13;
    uint64_t r14;
    uint64_t r15;
    
    // Program counter and flags
    uint64_t rip;
    uint64_t rflags;
    
    // Segment selectors (usually unchanged in 64-bit mode)
    uint16_t cs;
    uint16_t ss;
    uint16_t ds;
    uint16_t es;
    uint16_t fs;
    uint16_t gs;
    
    // FS/GS base addresses (for thread-local storage)
    uint64_t fs_base;
    uint64_t gs_base;
};
// Size: approximately 160 bytes
 
// FPU/SIMD context (XSAVE area)
// Layout depends on CPU features enabled
struct fpu_context {
    // Legacy x87 + SSE state (FXSAVE format)
    uint16_t fcw;         // FPU control word
    uint16_t fsw;         // FPU status word
    uint8_t  ftw;         // FPU tag word (abridged)
    uint8_t  reserved1;
    uint16_t fop;         // FPU opcode
    uint64_t fip;         // FPU instruction pointer
    uint64_t fdp;         // FPU data pointer
    uint32_t mxcsr;       // SIMD control/status
    uint32_t mxcsr_mask;
    
    // x87 FPU registers (ST0-ST7)
    uint8_t st_space[128];  // 8 × 16 bytes (80-bit padded)
    
    // SSE registers (XMM0-XMM15)
    uint8_t xmm_space[256]; // 16 × 16 bytes
    
    // XSAVE header (for extended state)
    uint64_t xstate_bv;     // State components present
    uint64_t xcomp_bv;      // State components format
    uint8_t reserved2[48];
    
    // Extended state (AVX, AVX-512, etc.)
    // Size varies based on CPU and enabled features
    // Can be 2KB+ for AVX-512
    uint8_t extended_state[0];  // Variable-length
};
// Size: 512+ bytes (up to 2KB+ for AVX-512)
 
// Combined context in PCB
struct task_struct {
    // ... process identification fields
    
    // CPU context
    struct integer_context cpu_context;
    
    // FPU/SIMD context (dynamically allocated based on CPU features)
    struct fpu_context *fpu;
    
    // Address space
    unsigned long cr3;          // Page table pointer
    struct mm_struct *mm;       // Memory management
    
    // Debug registers (only if debugging active)
    unsigned long debugreg[8];
    
    // ... scheduling, files, signals, etc.
};
 
// Typical total context size breakdown:
// - Integer registers + RIP + RFLAGS: ~160 bytes
// - SSE state (FXSAVE): ~512 bytes  
// - AVX state: +256 bytes
// - AVX-512 state: +2048 bytes
// - Miscellaneous: ~64 bytes
//
// Total: ~750 bytes (SSE) to ~3000 bytes (AVX-512)

Context Size vs Context Switch Time

A larger context means more memory to save/restore, more cache pollution, and longer context switch times. This is why lazy FPU switching and careful feature enablement matter. Enabling AVX-512 on a server with many context switches per second can noticeably impact performance.

Context Switch Performance Considerations

Context switching is a core OS operation that happens thousands of times per second. Understanding its performance characteristics is essential for system design and tuning.

Direct Costs of Context Switching:

Cost Component	Approximate Time	Notes
Save/restore GPRs	50-100 ns	16-32 registers
Save/restore FPU (SSE)	50-100 ns	If using lazy, often free
Save/restore FPU (AVX-512)	200-500 ns	Much larger state
CR3 load (address space)	50-100 ns	Instruction itself
TLB miss overhead	200-1000 ns	Per miss after switch
Cache effects	Variable	Depends on working sets

Total: A minimal context switch takes ~200ns. With TLB misses and cache effects, it can exceed 1-2μs.

Indirect Costs (Harder to Measure)

•TLB pollution: New process flushes (or partially pollutes) TLB. Subsequent memory accesses suffer page table walks.
•Cache pollution: The new process's working set displaces the old process's cached data. When the old process runs again, it faces cache misses.
•Branch predictor disruption: Branch history from the old process is irrelevant; branch mispredictions increase until the new process warms up the predictor.
•Pipeline stalls: The CPU may need to drain pipelines before switching contexts, especially for serializing operations.

measure_context_switch.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Measuring context switch time (simplified benchmark)
 
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
#include <time.h>
 
#define ITERATIONS 100000
 
int main() {
    int pipe1[2], pipe2[2];
    pipe(pipe1);
    pipe(pipe2);
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: ping-pong through pipes
        char c;
        for (int i = 0; i < ITERATIONS; i++) {
            read(pipe1[0], &c, 1);   // Wait for parent
            write(pipe2[1], &c, 1);  // Signal parent
            // Each iteration = 2 context switches
        }
        _exit(0);
    }
    
    // Parent: measure ping-pong time
    struct timespec start, end;
    char c = 'x';
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < ITERATIONS; i++) {
        write(pipe1[1], &c, 1);  // Signal child
        read(pipe2[0], &c, 1);   // Wait for child
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    wait(NULL);
    
    double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + 
                        (end.tv_nsec - start.tv_nsec);
    double per_switch = elapsed_ns / (ITERATIONS * 2);
    
    printf("Context switches: %d\n", ITERATIONS * 2);
    printf("Time per switch: %.1f ns\n", per_switch);
    
    // Typical results:
    // - Same CPU core: ~1-2 μs (includes pipe overhead)
    // - Different cores: ~3-5 μs
    // Pure context switch (no pipe): ~200-500 ns
    
    return 0;
}

When Context Switches Matter

For most applications, context switches are not a bottleneck. But for latency-sensitive workloads (trading systems, real-time audio, game engines), minimizing switches is critical. Techniques include: CPU pinning, real-time scheduling (SCHED_FIFO), kernel bypass (DPDK), and busy-waiting instead of blocking.

Summary: CPU Register Context

We've explored the complete CPU register context, from general-purpose registers to floating-point state, and examined how this context is managed during context switches. Let's consolidate the key insights:

Key Takeaways

•Registers are the CPU's fastest storage — They hold the immediate computational state, and their values define what the process is 'doing' at any moment.
•Multiple register categories exist — GPRs for computation, flags for conditions, FP/SIMD for math and parallel operations, and system registers for control.
•Context size varies significantly — From ~160 bytes (integer only) to 3KB+ (with AVX-512). Size directly impacts switch overhead.
•Lazy FPU switching reduces overhead — Most processes don't use FP registers, so deferring their save/restore until needed is a major optimization.
•CR3 and TLB effects dominate switch cost — The address space switch and resulting TLB misses often cost more than the register save/restore itself.
•Thread switches are much faster — Same address space means no CR3 change, no TLB flush—just register switching.

What's Next:

We've covered the CPU context extensively. The final page of this module examines Memory Management Information in the PCB—how the kernel tracks each process's virtual address space, page tables, and memory regions.

Page Complete

You now understand CPU registers comprehensively: their categories, their role in execution, how they're saved and restored, and the performance implications of context switching. This knowledge is essential for anyone working on operating systems, compilers, or performance-critical systems.

4 / 5

Loading learning content...

Operating SystemsProcess Control Block (PCB)

Process Control Block (PCB)

LevelIntermediate

Duration60 mins

TopicProcess Control Block (PCB)

4 / 5

CPU Registers: The Complete Execution State

More Than Just the Program Counter

What You Will Learn

What Are CPU Registers?

Why Registers Exist:

Storage Type	Typical Access Time	Typical Size
Registers	1 cycle (~0.3ns)	Bytes to KB
L1 Cache	3-4 cycles (~1ns)	32-64 KB
L2 Cache	10-20 cycles (~5ns)	256-512 KB
L3 Cache	30-50 cycles (~15ns)	4-32 MB
RAM	100-300 cycles (~60ns)	GB to TB
SSD	10,000+ cycles (~50μs)	TB

Registers operate at full processor speed because they're part of the CPU itself, not accessed over any bus.

Registers vs Memory

Categories of Registers:

Modern CPUs have several categories of registers, each serving different purposes:

General-Purpose Registers (GPRs): Hold data and addresses during computation. Most arithmetic and logic operations use these.
Program Counter / Instruction Pointer: Holds the address of the next instruction (covered in previous page).
Stack Pointer: Points to the current top of the stack in memory.
Status/Flags Register: Contains condition codes (zero, carry, overflow) and control flags.
Floating-Point Registers: For floating-point arithmetic operations.
SIMD/Vector Registers: For parallel operations on multiple data elements (SSE, AVX, NEON).
Segment Registers: (x86 specific) Define memory segments.
Control Registers: System-level registers for memory management, protection, and CPU features.
Debug Registers: For hardware breakpoints and debugging.

Not all registers need to be saved during a context switch—system registers, for instance, are typically per-CPU, not per-process.

General-Purpose Registers

x86-64 has 16 64-bit general-purpose registers. The first 8 have historical names from the 16-bit and 32-bit eras; the additional 8 use R8-R15 naming.

x86-64 General-Purpose Registers
Register	64-bit	32-bit	16-bit	8-bit	Common Use (Linux ABI)
RAX	RAX	EAX	AX	AL/AH	Return value, accumulator
RBX	RBX	EBX	BX	BL/BH	Callee-saved, base pointer
RCX	RCX	ECX	CX	CL/CH	4th argument, counter for loops
RDX	RDX	EDX	DX	DL/DH	3rd argument, I/O operations
RSI	RSI	ESI	SI	SIL	2nd argument, source index
RDI	RDI	EDI	DI	DIL	1st argument, destination index
RBP	RBP	EBP	BP	BPL	Base/frame pointer (callee-saved)
RSP	RSP	ESP	SP	SPL	Stack pointer
R8	R8	R8D	R8W	R8B	5th argument
R9	R9	R9D	R9W	R9B	6th argument
R10	R10	R10D	R10W	R10B	Caller-saved
R11	R11	R11D	R11W	R11B	Caller-saved
R12	R12	R12D	R12W	R12B	Callee-saved
R13	R13	R13D	R13W	R13B	Callee-saved
R14	R14	R14D	R14W	R14B	Callee-saved
R15	R15	R15D	R15W	R15B	Callee-saved

Total GPR Context Size: 16 registers × 8 bytes = 128 bytes for GPRs alone.

Caller-Saved vs Callee-Saved

Status and Flags Registers

x86-64: RFLAGS Register (64 bits)

The RFLAGS register contains condition flags (set by arithmetic), control flags (affect CPU operation), and system flags (control system-level features).

Key x86-64 RFLAGS Bits
Bit	Name	Description	Context Switch
0	CF (Carry)	Set on unsigned overflow/underflow	Must save
2	PF (Parity)	Set if low byte has even parity	Must save
4	AF (Auxiliary)	BCD arithmetic carry	Must save
6	ZF (Zero)	Set if result is zero	Must save
7	SF (Sign)	Set if result is negative	Must save
8	TF (Trap)	Single-step debugging mode	Must save
9	IF (Interrupt)	Enable/disable interrupts	System-level
10	DF (Direction)	String operation direction	Must save
11	OF (Overflow)	Set on signed overflow	Must save
12-13	IOPL	I/O privilege level	System-level
14	NT (Nested Task)	Nested task flag	System-level
21	ID	CPUID availability	System-level

ARM64: NZCV Flags in PSTATE

ARM64 has a simpler approach. The NZCV flags (Negative, Zero, Carry, Overflow) are in the PSTATE register and are accessed via special registers like NZCV.

flags_importance.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Why flags must be preserved exactly
 
int compare_values(int a, int b) {
    // Assembly perspective: cmp sets flags
    // cmp eax, ebx    ; computes a - b, sets ZF, SF, CF, OF
    
    if (a < b) {       // Uses SF and OF (signed comparison)
        return -1;
    } else if (a > b) { // Uses flags again
        return 1;
    } else {           // Uses ZF
        return 0;
    }
}
 
// If context switch happens between CMP and conditional jump:
//
//   cmp  eax, ebx     ; Flags set here
//   <--- CONTEXT SWITCH HAPPENS HERE --->
//   jl   less_than    ; Uses flags from cmp
//
// If flags weren't restored, the jump uses wrong flags!
// Process would take wrong branch, corrupting execution.
 
// Kernel must save exact flags state:
struct context_switch_frame {
    uint64_t rflags;  // Saved flags register
    uint64_t rip;     // Program counter
    uint64_t rsp;     // Stack pointer
    // ... other registers
};

Flags and Atomicity

Floating-Point and SIMD Registers

Floating-Point and SIMD Register Sizes
Architecture	Extension	Registers	Width	Total Size
x86-64	x87 FPU	ST0-ST7	80 bits	80 bytes
x86-64	SSE	XMM0-XMM15	128 bits	256 bytes
x86-64	AVX	YMM0-YMM15	256 bits	512 bytes
x86-64	AVX-512	ZMM0-ZMM31	512 bits	2048 bytes
ARM64	NEON/FP	V0-V31	128 bits	512 bytes
RISC-V	D extension	F0-F31	64 bits	256 bytes

The Size Problem:

Lazy FPU Context Switching:

Most processes don't use FP/SIMD registers. A shell script doesn't need floating-point math. This observation enables lazy FPU/SIMD context management:

On context switch, mark FPU state as "owned" by the previous process
Don't actually save the FPU registers yet
If the new process tries to use FPU, a "device not available" exception occurs
In the exception handler:
- Save the FPU state to the previous owner's PCB
- Restore (or initialize) the new process's FPU state
- Mark the new process as the FPU owner
Retry the faulting instruction—it now succeeds

lazy_fpu_switching.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Lazy FPU context switching (simplified Linux approach)
 
// Per-CPU: Which task owns the FPU?
DEFINE_PER_CPU(struct task_struct *, fpu_owner);
 
// On context switch - DON'T save FPU
void context_switch(struct task_struct *prev, struct task_struct *next) {
    // Save integer registers (always)
    save_integer_context(prev);
    
    // For FPU: just disable it
    if (cpu_has_fpu()) {
        // Set CR0.TS (Task Switched) bit
        // Any FPU access will now cause #NM (Device Not Available)
        disable_fpu_access();
    }
    
    // Switch address space and integer context
    switch_mm(next);
    restore_integer_context(next);
}
 
// Device Not Available exception handler
void do_device_not_available(struct pt_regs *regs) {
    struct task_struct *current_task = current;
    struct task_struct *fpu_owner_task = this_cpu_read(fpu_owner);
    
    // Save previous owner's FPU state (if different from current)
    if (fpu_owner_task && fpu_owner_task != current_task) {
        save_fpu_state(&fpu_owner_task->fpu_state);
    }
    
    // Restore current task's FPU state (or initialize if never used)
    if (current_task->fpu_initialized) {
        restore_fpu_state(&current_task->fpu_state);
    } else {
        init_fpu_state();
        current_task->fpu_initialized = true;
    }
    
    // Mark current as FPU owner
    this_cpu_write(fpu_owner, current_task);
    
    // Clear CR0.TS to enable FPU access
    enable_fpu_access();
    
    // Return from exception - faulting instruction will be retried
}
 
// Actual save/restore using FXSAVE/FXRSTOR or XSAVE/XRSTOR
void save_fpu_state(struct fpu_state *state) {
    // XSAVE saves all extended state (SSE, AVX, AVX-512, etc.)
    // based on XCR0 register configuration
    asm volatile("xsave %0" : "=m"(state->xsave_area) : "a"(-1), "d"(-1));
}
 
void restore_fpu_state(struct fpu_state *state) {
    asm volatile("xrstor %0" : : "m"(state->xsave_area), "a"(-1), "d"(-1));
}

Modern Eager Switching

System and Control Registers

x86-64 Control Registers
Register	Purpose	Switched Per-Process?
CR0	Protected mode, paging, FPU control	No (system-wide)
CR2	Page fault linear address	No (transient)
CR3	Page table base address	Yes (per-process address space)
CR4	CPU feature enables (PAE, SMEP, etc.)	No (system-wide)
CR8 (TPR)	Task Priority Register	No (per-CPU)
DR0-DR7	Debug registers	Optional (if debugging)
GDTR	Global Descriptor Table pointer	No (per-CPU)
LDTR	Local Descriptor Table selector	Rarely (if using LDT)
TR	Task State Segment selector	No (per-CPU)
FS base	Thread-local storage pointer	Yes (per-thread)
GS base	Kernel-space or thread pointer	Yes (varies)

CR3: The Address Space Switch:

CR3 is critically important. It points to the top-level page table (PML4 on x86-64). Changing CR3 switches the entire virtual address space—all memory mappings change in one instruction.

address_space_switch.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Address space switching on x86-64
 
// The mm_struct contains the page table pointer
struct mm_struct {
    pgd_t *pgd;           // Top-level page table (PML4)
    unsigned long cr3;     // Pre-computed CR3 value
    unsigned int pcid;     // Process Context ID (for TLB tagging)
    // ... other fields
};
 
// Switch address space
void switch_mm(struct mm_struct *prev_mm, struct mm_struct *next_mm) {
    if (prev_mm == next_mm) {
        // Same address space - no switch needed (e.g., threads)
        return;
    }
    
    // Build CR3 value: page table base + PCID
    unsigned long new_cr3 = next_mm->cr3;
    
    if (cpu_has_pcid()) {
        // Include PCID to avoid TLB flush
        new_cr3 |= next_mm->pcid;
        
        // Set bit 63 to NOT flush TLB entries with this PCID
        new_cr3 |= (1UL << 63);
    }
    
    // The actual switch - this is expensive!
    write_cr3(new_cr3);
    
    // Update per-CPU tracking
    this_cpu_write(current_mm, next_mm);
}
 
// Inline assembly for CR3 access
static inline void write_cr3(unsigned long val) {
    asm volatile("mov %0, %%cr3" : : "r"(val) : "memory");
}
 
static inline unsigned long read_cr3(void) {
    unsigned long val;
    asm volatile("mov %%cr3, %0" : "=r"(val));
    return val;
}
 
// TLB impact:
// Without PCID: write_cr3 flushes ALL TLB entries
//               Next memory accesses suffer TLB misses
//               Each miss = page table walk = 4 memory accesses
// With PCID:    Only flush entries with matching PCID
//               Other processes' entries remain cached
//               Significant performance improvement

Thread Context Switches Are Faster

The Complete Context Structure in the PCB

complete_context_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
// Complete CPU context for context switching (x86-64 example)
 
// Integer register context
struct integer_context {
    // General-purpose registers
    uint64_t rax;
    uint64_t rbx;
    uint64_t rcx;
    uint64_t rdx;
    uint64_t rsi;
    uint64_t rdi;
    uint64_t rbp;
    uint64_t rsp;
    uint64_t r8;
    uint64_t r9;
    uint64_t r10;
    uint64_t r11;
    uint64_t r12;
    uint64_t r13;
    uint64_t r14;
    uint64_t r15;
    
    // Program counter and flags
    uint64_t rip;
    uint64_t rflags;
    
    // Segment selectors (usually unchanged in 64-bit mode)
    uint16_t cs;
    uint16_t ss;
    uint16_t ds;
    uint16_t es;
    uint16_t fs;
    uint16_t gs;
    
    // FS/GS base addresses (for thread-local storage)
    uint64_t fs_base;
    uint64_t gs_base;
};
// Size: approximately 160 bytes
 
// FPU/SIMD context (XSAVE area)
// Layout depends on CPU features enabled
struct fpu_context {
    // Legacy x87 + SSE state (FXSAVE format)
    uint16_t fcw;         // FPU control word
    uint16_t fsw;         // FPU status word
    uint8_t  ftw;         // FPU tag word (abridged)
    uint8_t  reserved1;
    uint16_t fop;         // FPU opcode
    uint64_t fip;         // FPU instruction pointer
    uint64_t fdp;         // FPU data pointer
    uint32_t mxcsr;       // SIMD control/status
    uint32_t mxcsr_mask;
    
    // x87 FPU registers (ST0-ST7)
    uint8_t st_space[128];  // 8 × 16 bytes (80-bit padded)
    
    // SSE registers (XMM0-XMM15)
    uint8_t xmm_space[256]; // 16 × 16 bytes
    
    // XSAVE header (for extended state)
    uint64_t xstate_bv;     // State components present
    uint64_t xcomp_bv;      // State components format
    uint8_t reserved2[48];
    
    // Extended state (AVX, AVX-512, etc.)
    // Size varies based on CPU and enabled features
    // Can be 2KB+ for AVX-512
    uint8_t extended_state[0];  // Variable-length
};
// Size: 512+ bytes (up to 2KB+ for AVX-512)
 
// Combined context in PCB
struct task_struct {
    // ... process identification fields
    
    // CPU context
    struct integer_context cpu_context;
    
    // FPU/SIMD context (dynamically allocated based on CPU features)
    struct fpu_context *fpu;
    
    // Address space
    unsigned long cr3;          // Page table pointer
    struct mm_struct *mm;       // Memory management
    
    // Debug registers (only if debugging active)
    unsigned long debugreg[8];
    
    // ... scheduling, files, signals, etc.
};
 
// Typical total context size breakdown:
// - Integer registers + RIP + RFLAGS: ~160 bytes
// - SSE state (FXSAVE): ~512 bytes  
// - AVX state: +256 bytes
// - AVX-512 state: +2048 bytes
// - Miscellaneous: ~64 bytes
//
// Total: ~750 bytes (SSE) to ~3000 bytes (AVX-512)

Context Size vs Context Switch Time

Context Switch Performance Considerations

Context switching is a core OS operation that happens thousands of times per second. Understanding its performance characteristics is essential for system design and tuning.

Direct Costs of Context Switching:

Cost Component	Approximate Time	Notes
Save/restore GPRs	50-100 ns	16-32 registers
Save/restore FPU (SSE)	50-100 ns	If using lazy, often free
Save/restore FPU (AVX-512)	200-500 ns	Much larger state
CR3 load (address space)	50-100 ns	Instruction itself
TLB miss overhead	200-1000 ns	Per miss after switch
Cache effects	Variable	Depends on working sets

Total: A minimal context switch takes ~200ns. With TLB misses and cache effects, it can exceed 1-2μs.

Indirect Costs (Harder to Measure)

•TLB pollution: New process flushes (or partially pollutes) TLB. Subsequent memory accesses suffer page table walks.
•Cache pollution: The new process's working set displaces the old process's cached data. When the old process runs again, it faces cache misses.
•Branch predictor disruption: Branch history from the old process is irrelevant; branch mispredictions increase until the new process warms up the predictor.
•Pipeline stalls: The CPU may need to drain pipelines before switching contexts, especially for serializing operations.

measure_context_switch.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Measuring context switch time (simplified benchmark)
 
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
#include <time.h>
 
#define ITERATIONS 100000
 
int main() {
    int pipe1[2], pipe2[2];
    pipe(pipe1);
    pipe(pipe2);
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: ping-pong through pipes
        char c;
        for (int i = 0; i < ITERATIONS; i++) {
            read(pipe1[0], &c, 1);   // Wait for parent
            write(pipe2[1], &c, 1);  // Signal parent
            // Each iteration = 2 context switches
        }
        _exit(0);
    }
    
    // Parent: measure ping-pong time
    struct timespec start, end;
    char c = 'x';
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < ITERATIONS; i++) {
        write(pipe1[1], &c, 1);  // Signal child
        read(pipe2[0], &c, 1);   // Wait for child
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    wait(NULL);
    
    double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + 
                        (end.tv_nsec - start.tv_nsec);
    double per_switch = elapsed_ns / (ITERATIONS * 2);
    
    printf("Context switches: %d\n", ITERATIONS * 2);
    printf("Time per switch: %.1f ns\n", per_switch);
    
    // Typical results:
    // - Same CPU core: ~1-2 μs (includes pipe overhead)
    // - Different cores: ~3-5 μs
    // Pure context switch (no pipe): ~200-500 ns
    
    return 0;
}

When Context Switches Matter

Summary: CPU Register Context

Key Takeaways

•Registers are the CPU's fastest storage — They hold the immediate computational state, and their values define what the process is 'doing' at any moment.
•Multiple register categories exist — GPRs for computation, flags for conditions, FP/SIMD for math and parallel operations, and system registers for control.
•Context size varies significantly — From ~160 bytes (integer only) to 3KB+ (with AVX-512). Size directly impacts switch overhead.
•Lazy FPU switching reduces overhead — Most processes don't use FP registers, so deferring their save/restore until needed is a major optimization.
•CR3 and TLB effects dominate switch cost — The address space switch and resulting TLB misses often cost more than the register save/restore itself.
•Thread switches are much faster — Same address space means no CR3 change, no TLB flush—just register switching.

What's Next:

Page Complete

4 / 5