Scheduling Concepts - Learning Module

Loading content...

0/227

Dispatcher

The Scheduler's Executive Arm

The scheduler decides which process should run next. But making a decision is not the same as implementing it. The component that actually performs the context switch—saving the current process's state, loading the next process's state, and transferring CPU control—is the dispatcher.

If the scheduler is the brain of process management, the dispatcher is the hands. It operates at the lowest levels of the operating system, interfacing directly with CPU registers, memory management hardware, and privilege mode transitions. The dispatcher must be:

Extremely fast: It runs on every context switch; overhead directly impacts system performance
Absolutely correct: Any error corrupts process state, causing crashes or security vulnerabilities
Architecture-aware: Must handle CPU-specific details (registers, TLB, cache)

Understanding the dispatcher reveals the true cost of context switching and explains why scheduling decisions have such tangible performance implications.

What You Will Learn

By the end of this page, you will understand: (1) The precise role and responsibilities of the dispatcher, (2) The mechanics of context switching at the hardware level, (3) What happens during mode transitions (user ↔ kernel), (4) Dispatch latency and its components, and (5) How dispatchers are implemented in real operating systems.

Dispatcher vs Scheduler: Roles Defined

Before diving into mechanics, let's precisely distinguish these two closely related components:

The Scheduler:

Role: Policy maker—decides which process runs next
Input: Ready queue, process priorities, historical data
Output: Pointer to the next process to run
Frequency: Called when scheduling decision needed
Complexity: Algorithm-dependent (simple FCFS to complex MLFQ)

The Dispatcher:

Role: Policy executor—implements the scheduling decision
Input: Current process, next process (from scheduler)
Output: CPU now running the selected process
Frequency: Every context switch
Complexity: Architecture-dependent, but procedurally straightforward

Converting Mermaid diagram...

The separation principle:

This separation of concerns follows a fundamental OS design principle: separate policy from mechanism.

Policy (scheduler): What to do—which process gets CPU time, for how long, under what conditions
Mechanism (dispatcher): How to do it—the technical steps to switch CPU contexts

This separation allows:

Scheduling algorithms to be changed without modifying the dispatcher
Dispatcher code to be optimized without affecting scheduling logic
Clear responsibility boundaries for debugging and verification

Terminology in Practice

In some operating systems and textbooks, the terms 'scheduler' and 'dispatcher' are used interchangeably or combined. When reading source code, 'schedule()' functions may include both selection logic and context switch invocation. The conceptual distinction remains valuable for understanding, even if implementation boundaries vary.

Dispatcher Responsibilities

The dispatcher performs a precise sequence of operations during every context switch. Each step is critical—omitting or misordering any step results in crashes, data corruption, or security breaches.

The dispatcher must:

Core Dispatcher Operations

•Save the context of the current process: All CPU registers, program counter, stack pointer, processor status word (condition codes, interrupt flags), floating-point registers, and any other CPU state.
•Update process control block (PCB): Store saved context in the outgoing process's PCB; update its state (running → ready/blocked).
•Select next process's PCB: Obtain the PCB of the process chosen by the scheduler.
•Load the context of the new process: Restore all registers, counters, etc. from the new process's PCB to the CPU.
•Update memory management structures: Switch page tables/segment tables to the new process's address space; may involve TLB flush or ASID switch.
•Switch privilege mode: Transition from kernel mode back to user mode (if switching to user process).
•Jump to the new process's instruction: Resume execution at the saved program counter.

What gets saved/restored:

CPU State Saved During Context Switch (x86-64 Example)
State Category	Specific Registers	Typical Size
General-purpose registers	RAX, RBX, RCX, RDX, RSI, RDI, R8-R15	128 bytes
Program counter	RIP (instruction pointer)	8 bytes
Stack pointer	RSP, RBP (stack and base pointer)	16 bytes
Flags/status	RFLAGS (condition codes, interrupt flag)	8 bytes
Segment registers	CS, DS, ES, FS, GS, SS	48 bytes
FPU/SSE state	XMM0-XMM15, FPU stack, MXCSR	512+ bytes
AVX state (if used)	YMM/ZMM registers	2KB+ (AVX-512)
Total		~600 bytes minimum, up to 4KB with extensions

The Growing State Problem

Modern CPUs have ever-larger state to save. AVX-512 adds 32 × 512-bit registers (2KB alone). On context switch, all this must be saved if the process used it. OSes employ 'lazy' state saving—only save extended state for processes that actually use it—but the cost still grows with each CPU generation.

Context Switch Mechanics

Let's trace through a context switch in detail. Assume Process A is running and the scheduler decides to switch to Process B.

Phase 1: Entry into kernel (from interrupt)

context_switch_conceptual.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// PHASE 1: Interrupt entry (hardware-assisted)
// Timer fires → CPU automatically:
// 1. Saves user RSP to per-CPU kernel stack
// 2. Loads kernel RSP from TSS (Task State Segment)
// 3. Pushes: SS, RSP, RFLAGS, CS, RIP onto kernel stack (for iret return)
// 4. Changes CPL to 0 (kernel mode)
// 5. Jumps to interrupt handler address (IDT entry)
 
void timer_interrupt_handler(struct pt_regs *regs) {
    // 'regs' points to the interrupted state pushed on stack
    // Contains: RIP, CS, RFLAGS, RSP, SS of interrupted process
    
    // Save remaining registers (not auto-saved by hardware)
    save_registers();  // Pushes RAX, RBX, RCX, etc.
}

Phase 2: Save current process context

context_switch_save.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// PHASE 2: Save outgoing process (Process A)
 
void save_process_context(struct task_struct *prev, struct pt_regs *regs) {
    // Save user-mode registers from interrupt frame
    prev->thread.rip = regs->rip;
    prev->thread.rsp = regs->rsp;
    prev->thread.rflags = regs->rflags;
    
    // Save general registers from kernel stack
    prev->thread.rax = regs->rax;
    prev->thread.rbx = regs->rbx;
    // ... all other general-purpose registers
    
    // Save FPU/SSE state (if process used it)
    if (prev->thread.flags & USED_FPU) {
        fxsave(&prev->thread.fpu_state);  // Save 512 bytes of FPU/SSE
    }
    
    // Save extended state (AVX, etc.) if used
    if (cpu_has_xsave && (prev->thread.flags & USED_EXTENDED)) {
        xsaveopt(&prev->thread.xstate);  // Save variable-size extended state
    }
    
    // Update process state
    prev->state = TASK_READY;  // or TASK_BLOCKED if yielding for I/O
}

Phase 3: Switch address space

context_switch_mm.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// PHASE 3: Switch memory context
 
void switch_mm(struct task_struct *prev, struct task_struct *next) {
    // Check if address space change needed
    if (prev->mm == next->mm) {
        // Same address space (e.g., kernel threads, or threads of same process)
        // No page table switch needed - optimization!
        return;
    }
    
    // Load new process's page table base
    // CR3 = Page Table Base Register on x86
    unsigned long new_cr3 = next->mm->pgd_phys;
    
    // This is expensive! Flushing TLB entries
    write_cr3(new_cr3);
    
    // Alternative: Use PCID (Process Context ID) to avoid full TLB flush
    // PCID allows TLB to cache entries for multiple address spaces
    if (cpu_has_pcid) {
        // Write CR3 with PCID - doesn't flush TLB entries for other PCIDs
        new_cr3 |= (next->mm->context.asid & 0xFFF);
        write_cr3_noflush(new_cr3);
    }
}

Phase 4: Load new process context and return

context_switch_restore.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// PHASE 4: Load incoming process (Process B) and return to user mode
 
void restore_and_switch(struct task_struct *next) {
    // Update current pointer
    current = next;
    next->state = TASK_RUNNING;
    
    // Restore FPU/SSE state (if process uses it)
    if (next->thread.flags & USED_FPU) {
        fxrstor(&next->thread.fpu_state);
    }
    
    // Restore extended state
    if (cpu_has_xsave && (next->thread.flags & USED_EXTENDED)) {
        xrstor(&next->thread.xstate);
    }
    
    // Load general-purpose registers
    // This is done via carefully constructed stack frame
    // and the IRET instruction
}
 
// PHASE 5: Return to user mode (assembly)
// .global return_to_user
// return_to_user:
//     mov next->thread.rsp, %rsp    # Load user stack pointer
//     pop %r15                       # Restore registers from stack
//     pop %r14
//     ...
//     pop %rax
//     iretq                          # Return from interrupt
//                                    # Restores: RIP, CS, RFLAGS, RSP, SS
//                                    # Transitions to Ring 3 (user mode)
 
// Process B is now running!
// From B's perspective, it never knew it was interrupted

The Magical IRET

The x86 'iret' (interrupt return) instruction atomically: restores RIP (resume address), restores RFLAGS (including interrupt enable), restores CS (including privilege level), restores RSP and SS (user stack). This single instruction performs the privilege transition and resumption that no sequence of normal instructions could achieve safely.

Dispatch Latency

Dispatch latency is the time from when the scheduler decides to switch processes until the new process actually starts running. This is pure overhead—no useful user work happens during dispatch.

Components of dispatch latency:

Converting Mermaid diagram...

Dispatch Latency Components (Approximate, Modern x86)
Component	Time	Dominant Factor
Save registers to PCB	1-2 μs	Memory writes, FPU save if used
Address space switch (CR3 write)	0.1-1 μs	TLB flush cost (if no PCID)
TLB refill (indirect)	10-100 μs	First accesses after switch
Load registers from PCB	1-2 μs	Memory reads
Mode transition (iret)	~0.1 μs	Microcode execution
Direct latency total	2-5 μs
Indirect cost (cold cache/TLB)	10-100+ μs	Depends on working set size

Direct vs. indirect costs:

The direct dispatch latency (register save/restore, mode switch) is typically just a few microseconds. The indirect costs dominate:

TLB refill: After address space switch, virtual→physical translations miss, requiring page table walks
Cache pollution: New process's data displaces old process's cached data
Branch predictor reset: CPU's branch predictor trained for old process is wrong for new
Prefetch disruption: Hardware prefetchers trained on old access patterns

These indirect costs accumulate as the new process runs, manifesting as slower execution for the first microseconds to milliseconds after a switch.

measure_context_switch.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Measuring context switch latency using pipe ping-pong
 
#include <stdio.h>
#include <unistd.h>
#include <time.h>
 
#define ITERATIONS 100000
 
int main() {
    int pipe1[2], pipe2[2];
    pipe(pipe1);
    pipe(pipe2);
    
    char buf;
    struct timespec start, end;
    
    if (fork() == 0) {
        // Child: read from pipe1, write to pipe2
        while (1) {
            read(pipe1[0], &buf, 1);   // Block until parent writes
            write(pipe2[1], &buf, 1);  // Wake parent
        }
    } else {
        // Parent: time round-trip (2 context switches)
        clock_gettime(CLOCK_MONOTONIC, &start);
        
        for (int i = 0; i < ITERATIONS; i++) {
            write(pipe1[1], &buf, 1);  // Wake child
            read(pipe2[0], &buf, 1);   // Block until child responds
        }
        
        clock_gettime(CLOCK_MONOTONIC, &end);
        
        double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                           (end.tv_nsec - start.tv_nsec);
        double per_switch_ns = elapsed_ns / (2 * ITERATIONS);
        
        printf("Context switch latency: %.1f ns (%.3f μs)\n",
               per_switch_ns, per_switch_ns / 1000);
    }
    
    return 0;
}
 
// Typical output on modern Linux/x86:
// Context switch latency: 1500.0 ns (1.500 μs)
//
// This measures minimal direct cost; real workloads pay more
// due to cache/TLB displacement

Why Dispatch Latency Matters

If a system performs 10,000 context switches per second (reasonable for interactive systems), and each switch costs 5μs directly plus 50μs of cache-related slowdown, that's 550ms of overhead per second—over 50% of system capacity! Minimizing dispatch latency through hardware support (PCID for TLBs, per-CPU data structures) is crucial for system performance.

Mode Switching: User ↔ Kernel

The dispatcher must handle transitions between user mode and kernel mode—a fundamental security boundary enforced by CPU hardware.

Privilege levels (x86):

Ring	Name	Access	Used For
0	Kernel mode	Full hardware access	Operating system kernel
1-2	Supervisor	Limited (rarely used)	Some hypervisors, legacy
3	User mode	Restricted	Applications

Mode transitions happen in two contexts:

User → Kernel Transitions

•System calls: User process explicitly requests kernel service (via SYSCALL/SYSENTER instruction)
•Interrupts: External event (timer, I/O completion, keyboard) triggers kernel handler
•Exceptions: Fault/trap (page fault, divide-by-zero, breakpoint) transfers to kernel

Kernel → User Transitions

•Return from system call: SYSRET/SYSEXIT returns to user code after kernel handles request
•Return from interrupt (IRET): After interrupt handling, resume interrupted user process
•New process start: First execution of a newly created process
•Signal delivery: Kernel redirects user process to signal handler

The mode switch process:

Transition to kernel involves:

CPU detects privilege transition (interrupt/syscall/exception)
Hardware saves minimal state (RIP, RFLAGS, RSP, CS, SS on x86)
Hardware loads kernel stack pointer from per-CPU TSS
Hardware changes CPL (Current Privilege Level) to 0
Execution jumps to predetermined handler address
Software (kernel) saves additional state as needed

Transition to user mode involves:

Kernel sets up return frame on kernel stack
IRET or SYSRET instruction executed
Hardware validates the transition (can't escalate privilege!)
Hardware restores saved state
Hardware changes CPL to 3
Execution resumes in user process

Security-Critical Operation

Mode switching is a primary security boundary. Bugs in this code can allow user processes to gain kernel privileges—a complete system compromise. The dispatcher must meticulously validate all state before returning to user mode. Spectre/Meltdown attacks exploited speculative execution during these transitions, requiring extensive mitigations in dispatch code.

mode_switch_security.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Security considerations in mode transitions
 
void return_to_user(struct pt_regs *regs) {
    // SECURITY CHECK: Ensure we're not returning to kernel code
    if ((regs->cs & 3) != 3) {
        panic("Attempted return to ring 0 via user return path!");
    }
    
    // SECURITY CHECK: Validate segment selectors
    if (!valid_user_segment(regs->cs) ||
        !valid_user_segment(regs->ss)) {
        panic("Invalid segments in user return!");
    }
    
    // SECURITY CHECK: Clear sensitive flags
    regs->rflags &= ~(FLAG_IOPL | FLAG_NT | FLAG_TF);
    regs->rflags |= FLAG_IF;  // Ensure interrupts enabled
    
    // SPECTRE MITIGATION: Clear registers that might leak kernel data
    // speculative_store_bypass_barrier();
    
    // SPECTRE MITIGATION: Return stack buffer stuffing
    // fill_rsb_on_return();
    
    // Perform the actual return
    asm volatile("iretq");
}

Real-World Dispatcher Implementation

Let's examine how real operating systems implement the dispatcher. The core function is remarkably compact—most of the work is carefully orchestrated register manipulation.

Linux context_switch() simplified:

linux_context_switch.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Simplified from kernel/sched/core.c and arch/x86/kernel/process.c
 
/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next)
{
    struct mm_struct *mm, *oldmm;
    
    // Prepare for switch
    prepare_task_switch(rq, prev, next);
    
    mm = next->mm;
    oldmm = prev->active_mm;
    
    // STEP 1: Switch memory context if needed
    if (!mm) {
        // Kernel thread: borrow previous mm (lazy TLB)
        next->active_mm = oldmm;
        atomic_inc(&oldmm->mm_count);
    } else {
        // User process: switch address spaces
        switch_mm_irqs_off(oldmm, mm, next);
    }
    
    // STEP 2: Switch CPU state (architecture-specific)
    // This is where the actual register switch happens
    switch_to(prev, next, prev);
    
    // After switch_to returns, we ARE the next task!
    // prev now points to what was the previous task
    
    return finish_task_switch(prev);
}
 
// The switch_to macro (x86_64) - this is the core
// Defined in arch/x86/include/asm/switch_to.h
#define switch_to(prev, next, last)                     \
do {                                                    \
    prepare_switch_to(prev, next);                      \
                                                        \
    ((last) = __switch_to_asm((prev), (next)));         \
} while (0)

The actual register switch (x86_64 assembly):

switch_to_asm.S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Linux arch/x86/entry/entry_64.S (simplified)
# __switch_to_asm - switch processor context
 
SYM_FUNC_START(__switch_to_asm)
    # Save callee-saved registers (per C ABI)
    # These are the only registers we need to save;
    # caller-saved registers are already on the stack
    pushq   %rbp
    pushq   %rbx
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15
 
    # Switch stacks:
    # Save current stack pointer into prev->thread.sp
    movq    %rsp, TASK_threadsp(%rdi)    # %rdi = prev
    # Load next stack pointer from next->thread.sp
    movq    TASK_threadsp(%rsi), %rsp    # %rsi = next
 
    # Note: We are now on next's kernel stack!
    # The registers we pop are from next's saved state
 
    # Restore callee-saved registers (now from next)
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbx
    popq    %rbp
 
    # Jump to __switch_to() C function for remaining work
    jmp     __switch_to
SYM_FUNC_END(__switch_to_asm)
 
# Magic insight: The 'ret' at the end of __switch_to returns
# to next's saved return address—which is where next was
# when it previously called __switch_to_asm!

The Stack Switch Trick

The key insight: by saving the stack pointer and loading a different stack, when we 'pop' registers we're loading the next process's saved registers. When the function 'returns,' it returns to whatever return address is on the new stack—i.e., where the next process was when it yielded. This elegantly handles the 'teleportation' of control between processes.

Dispatcher Optimizations

Given that dispatch happens thousands of times per second, operating systems employ numerous optimizations to minimize latency:

Key Dispatcher Optimizations

•Lazy FPU save/restore: Only save FPU state if the process actually used floating-point instructions. On first FPU use, a '#NM' exception triggers state restore.
•PCID (Process Context ID): Tag TLB entries with process ID; avoid full TLB flush on address space switch. Intel introduced in Westmere; now essential for Meltdown mitigation.
•Per-CPU variables: Each CPU has private scheduler data structures, eliminating locking overhead in common paths.
•Kernel stack per-process: Each process has dedicated kernel stack; no stack switching needed within kernel mode for same process.
•Avoid full context switch: If switching between threads of same process, skip address space switch (major savings).
•XSAVEOPT/XRSTOR: Modern instructions for fast extended state save/restore with optimizations for unchanged state.

Thread switching vs. process switching:

Switching between threads of the same process is significantly faster than switching between processes:

Thread Switch vs Process Switch Cost
Operation	Thread Switch	Process Switch
Register save/restore	Same	Same
Address space switch	Skipped ✓	Required
TLB impact	None ✓	Flush or PCID overhead
Cache impact	Lower (shared address space) ✓	Higher (different working set)
Scheduling overhead	Same	Same
Typical latency	1-2 μs	3-10 μs + indirect costs

The Thread Advantage

This latency difference is one reason for the popularity of multi-threaded applications: switching between threads of the same application is cheaper than switching between separate processes. The shared address space means no TLB invalidation and better cache behavior—threads can share data in cache.

Hardware support evolution:

Hardware Feature	Purpose	Savings
SYSCALL/SYSENTER	Fast system call entry	~50% vs INT 0x80
SYSRET/SYSEXIT	Fast system call return	~50% vs IRET (for syscalls)
FXSAVE/FXRSTOR	Fast FPU state save	Built-in vs manual save
XSAVEOPT	Lazy extended state save	Only saves changed state
PCID	Process Context IDs	Avoid TLB flush (major!)
INVPCID	Selective TLB invalidation	Fine-grained TLB control
FSGSBASE	Fast FS/GS base access	Avoid MSR access

Summary: The Dispatcher

The dispatcher is the mechanical heart of process scheduling—translating scheduling decisions into actual CPU context changes. Let's consolidate our understanding:

Key Takeaways

•The dispatcher implements context switches: It saves current process state, switches address spaces, loads new process state, and transfers control.
•Scheduler = policy, Dispatcher = mechanism: The scheduler decides who runs; the dispatcher makes it happen.
•Dispatch latency has direct (2-5μs) and indirect (cache/TLB, 10-100μs) components.
•Mode transitions (user↔kernel) are security-critical operations involving hardware privilege level changes.
•Thread switches are faster than process switches because they skip address space changes.
•Modern CPUs add hardware support (PCID, XSAVE, fast syscall) to reduce dispatch overhead.
•Every context switch costs—this cost is the fundamental tension in scheduling quantum selection.

Module completion:

With the dispatcher covered, we've completed our exploration of Scheduling Concepts—the foundational theory underlying all CPU scheduling. We've covered:

CPU and I/O bursts: The fundamental pattern of process execution
CPU-bound vs I/O-bound: Process classification that shapes scheduling priorities
Preemptive vs non-preemptive: The control model for CPU allocation
Scheduling criteria: Metrics for evaluating scheduling algorithms
The dispatcher: The mechanism that implements context switches

Coming up in subsequent modules:

With this theoretical foundation, we're ready to explore specific scheduling algorithms—FCFS, SJF, Priority Scheduling, Round Robin, and Multi-Level Feedback Queue—understanding not just how they work, but why they make the tradeoffs they do.

Module Complete

You now have a complete understanding of CPU scheduling fundamentals. You understand why scheduling exists (bursts and multiprogramming), what we're trying to optimize (scheduling criteria), the fundamental control mechanism (preemption), and how decisions become reality (dispatcher). This foundation will make understanding specific algorithms much more intuitive.

Dispatcher

The Scheduler's Executive Arm

Extremely fast: It runs on every context switch; overhead directly impacts system performance
Absolutely correct: Any error corrupts process state, causing crashes or security vulnerabilities
Architecture-aware: Must handle CPU-specific details (registers, TLB, cache)

Understanding the dispatcher reveals the true cost of context switching and explains why scheduling decisions have such tangible performance implications.

What You Will Learn

Dispatcher vs Scheduler: Roles Defined

Before diving into mechanics, let's precisely distinguish these two closely related components:

The Scheduler:

Role: Policy maker—decides which process runs next
Input: Ready queue, process priorities, historical data
Output: Pointer to the next process to run
Frequency: Called when scheduling decision needed
Complexity: Algorithm-dependent (simple FCFS to complex MLFQ)

The Dispatcher:

Role: Policy executor—implements the scheduling decision
Input: Current process, next process (from scheduler)
Output: CPU now running the selected process
Frequency: Every context switch
Complexity: Architecture-dependent, but procedurally straightforward

Converting Mermaid diagram...

The separation principle:

This separation of concerns follows a fundamental OS design principle: separate policy from mechanism.

Policy (scheduler): What to do—which process gets CPU time, for how long, under what conditions
Mechanism (dispatcher): How to do it—the technical steps to switch CPU contexts

This separation allows:

Scheduling algorithms to be changed without modifying the dispatcher
Dispatcher code to be optimized without affecting scheduling logic
Clear responsibility boundaries for debugging and verification

Terminology in Practice

Dispatcher Responsibilities

The dispatcher must:

Core Dispatcher Operations

•Save the context of the current process: All CPU registers, program counter, stack pointer, processor status word (condition codes, interrupt flags), floating-point registers, and any other CPU state.
•Update process control block (PCB): Store saved context in the outgoing process's PCB; update its state (running → ready/blocked).
•Select next process's PCB: Obtain the PCB of the process chosen by the scheduler.
•Load the context of the new process: Restore all registers, counters, etc. from the new process's PCB to the CPU.
•Update memory management structures: Switch page tables/segment tables to the new process's address space; may involve TLB flush or ASID switch.
•Switch privilege mode: Transition from kernel mode back to user mode (if switching to user process).
•Jump to the new process's instruction: Resume execution at the saved program counter.

What gets saved/restored:

CPU State Saved During Context Switch (x86-64 Example)
State Category	Specific Registers	Typical Size
General-purpose registers	RAX, RBX, RCX, RDX, RSI, RDI, R8-R15	128 bytes
Program counter	RIP (instruction pointer)	8 bytes
Stack pointer	RSP, RBP (stack and base pointer)	16 bytes
Flags/status	RFLAGS (condition codes, interrupt flag)	8 bytes
Segment registers	CS, DS, ES, FS, GS, SS	48 bytes
FPU/SSE state	XMM0-XMM15, FPU stack, MXCSR	512+ bytes
AVX state (if used)	YMM/ZMM registers	2KB+ (AVX-512)
Total		~600 bytes minimum, up to 4KB with extensions

The Growing State Problem

Context Switch Mechanics

Let's trace through a context switch in detail. Assume Process A is running and the scheduler decides to switch to Process B.

Phase 1: Entry into kernel (from interrupt)

context_switch_conceptual.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// PHASE 1: Interrupt entry (hardware-assisted)
// Timer fires → CPU automatically:
// 1. Saves user RSP to per-CPU kernel stack
// 2. Loads kernel RSP from TSS (Task State Segment)
// 3. Pushes: SS, RSP, RFLAGS, CS, RIP onto kernel stack (for iret return)
// 4. Changes CPL to 0 (kernel mode)
// 5. Jumps to interrupt handler address (IDT entry)
 
void timer_interrupt_handler(struct pt_regs *regs) {
    // 'regs' points to the interrupted state pushed on stack
    // Contains: RIP, CS, RFLAGS, RSP, SS of interrupted process
    
    // Save remaining registers (not auto-saved by hardware)
    save_registers();  // Pushes RAX, RBX, RCX, etc.
}

Phase 2: Save current process context

context_switch_save.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// PHASE 2: Save outgoing process (Process A)
 
void save_process_context(struct task_struct *prev, struct pt_regs *regs) {
    // Save user-mode registers from interrupt frame
    prev->thread.rip = regs->rip;
    prev->thread.rsp = regs->rsp;
    prev->thread.rflags = regs->rflags;
    
    // Save general registers from kernel stack
    prev->thread.rax = regs->rax;
    prev->thread.rbx = regs->rbx;
    // ... all other general-purpose registers
    
    // Save FPU/SSE state (if process used it)
    if (prev->thread.flags & USED_FPU) {
        fxsave(&prev->thread.fpu_state);  // Save 512 bytes of FPU/SSE
    }
    
    // Save extended state (AVX, etc.) if used
    if (cpu_has_xsave && (prev->thread.flags & USED_EXTENDED)) {
        xsaveopt(&prev->thread.xstate);  // Save variable-size extended state
    }
    
    // Update process state
    prev->state = TASK_READY;  // or TASK_BLOCKED if yielding for I/O
}

Phase 3: Switch address space

context_switch_mm.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// PHASE 3: Switch memory context
 
void switch_mm(struct task_struct *prev, struct task_struct *next) {
    // Check if address space change needed
    if (prev->mm == next->mm) {
        // Same address space (e.g., kernel threads, or threads of same process)
        // No page table switch needed - optimization!
        return;
    }
    
    // Load new process's page table base
    // CR3 = Page Table Base Register on x86
    unsigned long new_cr3 = next->mm->pgd_phys;
    
    // This is expensive! Flushing TLB entries
    write_cr3(new_cr3);
    
    // Alternative: Use PCID (Process Context ID) to avoid full TLB flush
    // PCID allows TLB to cache entries for multiple address spaces
    if (cpu_has_pcid) {
        // Write CR3 with PCID - doesn't flush TLB entries for other PCIDs
        new_cr3 |= (next->mm->context.asid & 0xFFF);
        write_cr3_noflush(new_cr3);
    }
}

Phase 4: Load new process context and return

context_switch_restore.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// PHASE 4: Load incoming process (Process B) and return to user mode
 
void restore_and_switch(struct task_struct *next) {
    // Update current pointer
    current = next;
    next->state = TASK_RUNNING;
    
    // Restore FPU/SSE state (if process uses it)
    if (next->thread.flags & USED_FPU) {
        fxrstor(&next->thread.fpu_state);
    }
    
    // Restore extended state
    if (cpu_has_xsave && (next->thread.flags & USED_EXTENDED)) {
        xrstor(&next->thread.xstate);
    }
    
    // Load general-purpose registers
    // This is done via carefully constructed stack frame
    // and the IRET instruction
}
 
// PHASE 5: Return to user mode (assembly)
// .global return_to_user
// return_to_user:
//     mov next->thread.rsp, %rsp    # Load user stack pointer
//     pop %r15                       # Restore registers from stack
//     pop %r14
//     ...
//     pop %rax
//     iretq                          # Return from interrupt
//                                    # Restores: RIP, CS, RFLAGS, RSP, SS
//                                    # Transitions to Ring 3 (user mode)
 
// Process B is now running!
// From B's perspective, it never knew it was interrupted

The Magical IRET

Dispatch Latency

Components of dispatch latency:

Converting Mermaid diagram...

Dispatch Latency Components (Approximate, Modern x86)
Component	Time	Dominant Factor
Save registers to PCB	1-2 μs	Memory writes, FPU save if used
Address space switch (CR3 write)	0.1-1 μs	TLB flush cost (if no PCID)
TLB refill (indirect)	10-100 μs	First accesses after switch
Load registers from PCB	1-2 μs	Memory reads
Mode transition (iret)	~0.1 μs	Microcode execution
Direct latency total	2-5 μs
Indirect cost (cold cache/TLB)	10-100+ μs	Depends on working set size

Direct vs. indirect costs:

The direct dispatch latency (register save/restore, mode switch) is typically just a few microseconds. The indirect costs dominate:

TLB refill: After address space switch, virtual→physical translations miss, requiring page table walks
Cache pollution: New process's data displaces old process's cached data
Branch predictor reset: CPU's branch predictor trained for old process is wrong for new
Prefetch disruption: Hardware prefetchers trained on old access patterns

These indirect costs accumulate as the new process runs, manifesting as slower execution for the first microseconds to milliseconds after a switch.

measure_context_switch.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Measuring context switch latency using pipe ping-pong
 
#include <stdio.h>
#include <unistd.h>
#include <time.h>
 
#define ITERATIONS 100000
 
int main() {
    int pipe1[2], pipe2[2];
    pipe(pipe1);
    pipe(pipe2);
    
    char buf;
    struct timespec start, end;
    
    if (fork() == 0) {
        // Child: read from pipe1, write to pipe2
        while (1) {
            read(pipe1[0], &buf, 1);   // Block until parent writes
            write(pipe2[1], &buf, 1);  // Wake parent
        }
    } else {
        // Parent: time round-trip (2 context switches)
        clock_gettime(CLOCK_MONOTONIC, &start);
        
        for (int i = 0; i < ITERATIONS; i++) {
            write(pipe1[1], &buf, 1);  // Wake child
            read(pipe2[0], &buf, 1);   // Block until child responds
        }
        
        clock_gettime(CLOCK_MONOTONIC, &end);
        
        double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                           (end.tv_nsec - start.tv_nsec);
        double per_switch_ns = elapsed_ns / (2 * ITERATIONS);
        
        printf("Context switch latency: %.1f ns (%.3f μs)\n",
               per_switch_ns, per_switch_ns / 1000);
    }
    
    return 0;
}
 
// Typical output on modern Linux/x86:
// Context switch latency: 1500.0 ns (1.500 μs)
//
// This measures minimal direct cost; real workloads pay more
// due to cache/TLB displacement

Why Dispatch Latency Matters

Mode Switching: User ↔ Kernel

The dispatcher must handle transitions between user mode and kernel mode—a fundamental security boundary enforced by CPU hardware.

Privilege levels (x86):

Ring	Name	Access	Used For
0	Kernel mode	Full hardware access	Operating system kernel
1-2	Supervisor	Limited (rarely used)	Some hypervisors, legacy
3	User mode	Restricted	Applications

Mode transitions happen in two contexts:

User → Kernel Transitions

•System calls: User process explicitly requests kernel service (via SYSCALL/SYSENTER instruction)
•Interrupts: External event (timer, I/O completion, keyboard) triggers kernel handler
•Exceptions: Fault/trap (page fault, divide-by-zero, breakpoint) transfers to kernel

Kernel → User Transitions

•Return from system call: SYSRET/SYSEXIT returns to user code after kernel handles request
•Return from interrupt (IRET): After interrupt handling, resume interrupted user process
•New process start: First execution of a newly created process
•Signal delivery: Kernel redirects user process to signal handler

The mode switch process:

Transition to kernel involves:

CPU detects privilege transition (interrupt/syscall/exception)
Hardware saves minimal state (RIP, RFLAGS, RSP, CS, SS on x86)
Hardware loads kernel stack pointer from per-CPU TSS
Hardware changes CPL (Current Privilege Level) to 0
Execution jumps to predetermined handler address
Software (kernel) saves additional state as needed

Transition to user mode involves:

Kernel sets up return frame on kernel stack
IRET or SYSRET instruction executed
Hardware validates the transition (can't escalate privilege!)
Hardware restores saved state
Hardware changes CPL to 3
Execution resumes in user process

Security-Critical Operation

mode_switch_security.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Security considerations in mode transitions
 
void return_to_user(struct pt_regs *regs) {
    // SECURITY CHECK: Ensure we're not returning to kernel code
    if ((regs->cs & 3) != 3) {
        panic("Attempted return to ring 0 via user return path!");
    }
    
    // SECURITY CHECK: Validate segment selectors
    if (!valid_user_segment(regs->cs) ||
        !valid_user_segment(regs->ss)) {
        panic("Invalid segments in user return!");
    }
    
    // SECURITY CHECK: Clear sensitive flags
    regs->rflags &= ~(FLAG_IOPL | FLAG_NT | FLAG_TF);
    regs->rflags |= FLAG_IF;  // Ensure interrupts enabled
    
    // SPECTRE MITIGATION: Clear registers that might leak kernel data
    // speculative_store_bypass_barrier();
    
    // SPECTRE MITIGATION: Return stack buffer stuffing
    // fill_rsb_on_return();
    
    // Perform the actual return
    asm volatile("iretq");
}

Real-World Dispatcher Implementation

Let's examine how real operating systems implement the dispatcher. The core function is remarkably compact—most of the work is carefully orchestrated register manipulation.

Linux context_switch() simplified:

linux_context_switch.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Simplified from kernel/sched/core.c and arch/x86/kernel/process.c
 
/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next)
{
    struct mm_struct *mm, *oldmm;
    
    // Prepare for switch
    prepare_task_switch(rq, prev, next);
    
    mm = next->mm;
    oldmm = prev->active_mm;
    
    // STEP 1: Switch memory context if needed
    if (!mm) {
        // Kernel thread: borrow previous mm (lazy TLB)
        next->active_mm = oldmm;
        atomic_inc(&oldmm->mm_count);
    } else {
        // User process: switch address spaces
        switch_mm_irqs_off(oldmm, mm, next);
    }
    
    // STEP 2: Switch CPU state (architecture-specific)
    // This is where the actual register switch happens
    switch_to(prev, next, prev);
    
    // After switch_to returns, we ARE the next task!
    // prev now points to what was the previous task
    
    return finish_task_switch(prev);
}
 
// The switch_to macro (x86_64) - this is the core
// Defined in arch/x86/include/asm/switch_to.h
#define switch_to(prev, next, last)                     \
do {                                                    \
    prepare_switch_to(prev, next);                      \
                                                        \
    ((last) = __switch_to_asm((prev), (next)));         \
} while (0)

The actual register switch (x86_64 assembly):

switch_to_asm.S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Linux arch/x86/entry/entry_64.S (simplified)
# __switch_to_asm - switch processor context
 
SYM_FUNC_START(__switch_to_asm)
    # Save callee-saved registers (per C ABI)
    # These are the only registers we need to save;
    # caller-saved registers are already on the stack
    pushq   %rbp
    pushq   %rbx
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15
 
    # Switch stacks:
    # Save current stack pointer into prev->thread.sp
    movq    %rsp, TASK_threadsp(%rdi)    # %rdi = prev
    # Load next stack pointer from next->thread.sp
    movq    TASK_threadsp(%rsi), %rsp    # %rsi = next
 
    # Note: We are now on next's kernel stack!
    # The registers we pop are from next's saved state
 
    # Restore callee-saved registers (now from next)
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbx
    popq    %rbp
 
    # Jump to __switch_to() C function for remaining work
    jmp     __switch_to
SYM_FUNC_END(__switch_to_asm)
 
# Magic insight: The 'ret' at the end of __switch_to returns
# to next's saved return address—which is where next was
# when it previously called __switch_to_asm!

The Stack Switch Trick

Dispatcher Optimizations

Given that dispatch happens thousands of times per second, operating systems employ numerous optimizations to minimize latency:

Key Dispatcher Optimizations

•Lazy FPU save/restore: Only save FPU state if the process actually used floating-point instructions. On first FPU use, a '#NM' exception triggers state restore.
•PCID (Process Context ID): Tag TLB entries with process ID; avoid full TLB flush on address space switch. Intel introduced in Westmere; now essential for Meltdown mitigation.
•Per-CPU variables: Each CPU has private scheduler data structures, eliminating locking overhead in common paths.
•Kernel stack per-process: Each process has dedicated kernel stack; no stack switching needed within kernel mode for same process.
•Avoid full context switch: If switching between threads of same process, skip address space switch (major savings).
•XSAVEOPT/XRSTOR: Modern instructions for fast extended state save/restore with optimizations for unchanged state.

Thread switching vs. process switching:

Switching between threads of the same process is significantly faster than switching between processes:

Thread Switch vs Process Switch Cost
Operation	Thread Switch	Process Switch
Register save/restore	Same	Same
Address space switch	Skipped ✓	Required
TLB impact	None ✓	Flush or PCID overhead
Cache impact	Lower (shared address space) ✓	Higher (different working set)
Scheduling overhead	Same	Same
Typical latency	1-2 μs	3-10 μs + indirect costs

The Thread Advantage

Hardware support evolution:

Hardware Feature	Purpose	Savings
SYSCALL/SYSENTER	Fast system call entry	~50% vs INT 0x80
SYSRET/SYSEXIT	Fast system call return	~50% vs IRET (for syscalls)
FXSAVE/FXRSTOR	Fast FPU state save	Built-in vs manual save
XSAVEOPT	Lazy extended state save	Only saves changed state
PCID	Process Context IDs	Avoid TLB flush (major!)
INVPCID	Selective TLB invalidation	Fine-grained TLB control
FSGSBASE	Fast FS/GS base access	Avoid MSR access

Summary: The Dispatcher

The dispatcher is the mechanical heart of process scheduling—translating scheduling decisions into actual CPU context changes. Let's consolidate our understanding:

Key Takeaways

•The dispatcher implements context switches: It saves current process state, switches address spaces, loads new process state, and transfers control.
•Scheduler = policy, Dispatcher = mechanism: The scheduler decides who runs; the dispatcher makes it happen.
•Dispatch latency has direct (2-5μs) and indirect (cache/TLB, 10-100μs) components.
•Mode transitions (user↔kernel) are security-critical operations involving hardware privilege level changes.
•Thread switches are faster than process switches because they skip address space changes.
•Modern CPUs add hardware support (PCID, XSAVE, fast syscall) to reduce dispatch overhead.
•Every context switch costs—this cost is the fundamental tension in scheduling quantum selection.

Module completion:

With the dispatcher covered, we've completed our exploration of Scheduling Concepts—the foundational theory underlying all CPU scheduling. We've covered:

CPU and I/O bursts: The fundamental pattern of process execution
CPU-bound vs I/O-bound: Process classification that shapes scheduling priorities
Preemptive vs non-preemptive: The control model for CPU allocation
Scheduling criteria: Metrics for evaluating scheduling algorithms
The dispatcher: The mechanism that implements context switches

Coming up in subsequent modules:

Module Complete