Loading content...
The scheduler decides which process should run next. But making a decision is not the same as implementing it. The component that actually performs the context switch—saving the current process's state, loading the next process's state, and transferring CPU control—is the dispatcher.
If the scheduler is the brain of process management, the dispatcher is the hands. It operates at the lowest levels of the operating system, interfacing directly with CPU registers, memory management hardware, and privilege mode transitions. The dispatcher must be:
Understanding the dispatcher reveals the true cost of context switching and explains why scheduling decisions have such tangible performance implications.
By the end of this page, you will understand: (1) The precise role and responsibilities of the dispatcher, (2) The mechanics of context switching at the hardware level, (3) What happens during mode transitions (user ↔ kernel), (4) Dispatch latency and its components, and (5) How dispatchers are implemented in real operating systems.
Before diving into mechanics, let's precisely distinguish these two closely related components:
The Scheduler:
The Dispatcher:
The separation principle:
This separation of concerns follows a fundamental OS design principle: separate policy from mechanism.
This separation allows:
In some operating systems and textbooks, the terms 'scheduler' and 'dispatcher' are used interchangeably or combined. When reading source code, 'schedule()' functions may include both selection logic and context switch invocation. The conceptual distinction remains valuable for understanding, even if implementation boundaries vary.
The dispatcher performs a precise sequence of operations during every context switch. Each step is critical—omitting or misordering any step results in crashes, data corruption, or security breaches.
The dispatcher must:
What gets saved/restored:
| State Category | Specific Registers | Typical Size |
|---|---|---|
| General-purpose registers | RAX, RBX, RCX, RDX, RSI, RDI, R8-R15 | 128 bytes |
| Program counter | RIP (instruction pointer) | 8 bytes |
| Stack pointer | RSP, RBP (stack and base pointer) | 16 bytes |
| Flags/status | RFLAGS (condition codes, interrupt flag) | 8 bytes |
| Segment registers | CS, DS, ES, FS, GS, SS | 48 bytes |
| FPU/SSE state | XMM0-XMM15, FPU stack, MXCSR | 512+ bytes |
| AVX state (if used) | YMM/ZMM registers | 2KB+ (AVX-512) |
| Total | ~600 bytes minimum, up to 4KB with extensions |
Modern CPUs have ever-larger state to save. AVX-512 adds 32 × 512-bit registers (2KB alone). On context switch, all this must be saved if the process used it. OSes employ 'lazy' state saving—only save extended state for processes that actually use it—but the cost still grows with each CPU generation.
Let's trace through a context switch in detail. Assume Process A is running and the scheduler decides to switch to Process B.
Phase 1: Entry into kernel (from interrupt)
123456789101112131415
// PHASE 1: Interrupt entry (hardware-assisted)// Timer fires → CPU automatically:// 1. Saves user RSP to per-CPU kernel stack// 2. Loads kernel RSP from TSS (Task State Segment)// 3. Pushes: SS, RSP, RFLAGS, CS, RIP onto kernel stack (for iret return)// 4. Changes CPL to 0 (kernel mode)// 5. Jumps to interrupt handler address (IDT entry) void timer_interrupt_handler(struct pt_regs *regs) { // 'regs' points to the interrupted state pushed on stack // Contains: RIP, CS, RFLAGS, RSP, SS of interrupted process // Save remaining registers (not auto-saved by hardware) save_registers(); // Pushes RAX, RBX, RCX, etc.}Phase 2: Save current process context
1234567891011121314151617181920212223242526
// PHASE 2: Save outgoing process (Process A) void save_process_context(struct task_struct *prev, struct pt_regs *regs) { // Save user-mode registers from interrupt frame prev->thread.rip = regs->rip; prev->thread.rsp = regs->rsp; prev->thread.rflags = regs->rflags; // Save general registers from kernel stack prev->thread.rax = regs->rax; prev->thread.rbx = regs->rbx; // ... all other general-purpose registers // Save FPU/SSE state (if process used it) if (prev->thread.flags & USED_FPU) { fxsave(&prev->thread.fpu_state); // Save 512 bytes of FPU/SSE } // Save extended state (AVX, etc.) if used if (cpu_has_xsave && (prev->thread.flags & USED_EXTENDED)) { xsaveopt(&prev->thread.xstate); // Save variable-size extended state } // Update process state prev->state = TASK_READY; // or TASK_BLOCKED if yielding for I/O}Phase 3: Switch address space
12345678910111213141516171819202122232425
// PHASE 3: Switch memory context void switch_mm(struct task_struct *prev, struct task_struct *next) { // Check if address space change needed if (prev->mm == next->mm) { // Same address space (e.g., kernel threads, or threads of same process) // No page table switch needed - optimization! return; } // Load new process's page table base // CR3 = Page Table Base Register on x86 unsigned long new_cr3 = next->mm->pgd_phys; // This is expensive! Flushing TLB entries write_cr3(new_cr3); // Alternative: Use PCID (Process Context ID) to avoid full TLB flush // PCID allows TLB to cache entries for multiple address spaces if (cpu_has_pcid) { // Write CR3 with PCID - doesn't flush TLB entries for other PCIDs new_cr3 |= (next->mm->context.asid & 0xFFF); write_cr3_noflush(new_cr3); }}Phase 4: Load new process context and return
123456789101112131415161718192021222324252627282930313233343536
// PHASE 4: Load incoming process (Process B) and return to user mode void restore_and_switch(struct task_struct *next) { // Update current pointer current = next; next->state = TASK_RUNNING; // Restore FPU/SSE state (if process uses it) if (next->thread.flags & USED_FPU) { fxrstor(&next->thread.fpu_state); } // Restore extended state if (cpu_has_xsave && (next->thread.flags & USED_EXTENDED)) { xrstor(&next->thread.xstate); } // Load general-purpose registers // This is done via carefully constructed stack frame // and the IRET instruction} // PHASE 5: Return to user mode (assembly)// .global return_to_user// return_to_user:// mov next->thread.rsp, %rsp # Load user stack pointer// pop %r15 # Restore registers from stack// pop %r14// ...// pop %rax// iretq # Return from interrupt// # Restores: RIP, CS, RFLAGS, RSP, SS// # Transitions to Ring 3 (user mode) // Process B is now running!// From B's perspective, it never knew it was interruptedThe x86 'iret' (interrupt return) instruction atomically: restores RIP (resume address), restores RFLAGS (including interrupt enable), restores CS (including privilege level), restores RSP and SS (user stack). This single instruction performs the privilege transition and resumption that no sequence of normal instructions could achieve safely.
Dispatch latency is the time from when the scheduler decides to switch processes until the new process actually starts running. This is pure overhead—no useful user work happens during dispatch.
Components of dispatch latency:
| Component | Time | Dominant Factor |
|---|---|---|
| Save registers to PCB | 1-2 μs | Memory writes, FPU save if used |
| Address space switch (CR3 write) | 0.1-1 μs | TLB flush cost (if no PCID) |
| TLB refill (indirect) | 10-100 μs | First accesses after switch |
| Load registers from PCB | 1-2 μs | Memory reads |
| Mode transition (iret) | ~0.1 μs | Microcode execution |
| Direct latency total | 2-5 μs | |
| Indirect cost (cold cache/TLB) | 10-100+ μs | Depends on working set size |
Direct vs. indirect costs:
The direct dispatch latency (register save/restore, mode switch) is typically just a few microseconds. The indirect costs dominate:
These indirect costs accumulate as the new process runs, manifesting as slower execution for the first microseconds to milliseconds after a switch.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// Measuring context switch latency using pipe ping-pong #include <stdio.h>#include <unistd.h>#include <time.h> #define ITERATIONS 100000 int main() { int pipe1[2], pipe2[2]; pipe(pipe1); pipe(pipe2); char buf; struct timespec start, end; if (fork() == 0) { // Child: read from pipe1, write to pipe2 while (1) { read(pipe1[0], &buf, 1); // Block until parent writes write(pipe2[1], &buf, 1); // Wake parent } } else { // Parent: time round-trip (2 context switches) clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < ITERATIONS; i++) { write(pipe1[1], &buf, 1); // Wake child read(pipe2[0], &buf, 1); // Block until child responds } clock_gettime(CLOCK_MONOTONIC, &end); double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec); double per_switch_ns = elapsed_ns / (2 * ITERATIONS); printf("Context switch latency: %.1f ns (%.3f μs)\n", per_switch_ns, per_switch_ns / 1000); } return 0;} // Typical output on modern Linux/x86:// Context switch latency: 1500.0 ns (1.500 μs)//// This measures minimal direct cost; real workloads pay more// due to cache/TLB displacementIf a system performs 10,000 context switches per second (reasonable for interactive systems), and each switch costs 5μs directly plus 50μs of cache-related slowdown, that's 550ms of overhead per second—over 50% of system capacity! Minimizing dispatch latency through hardware support (PCID for TLBs, per-CPU data structures) is crucial for system performance.
The dispatcher must handle transitions between user mode and kernel mode—a fundamental security boundary enforced by CPU hardware.
Privilege levels (x86):
| Ring | Name | Access | Used For |
|---|---|---|---|
| 0 | Kernel mode | Full hardware access | Operating system kernel |
| 1-2 | Supervisor | Limited (rarely used) | Some hypervisors, legacy |
| 3 | User mode | Restricted | Applications |
Mode transitions happen in two contexts:
The mode switch process:
Transition to kernel involves:
Transition to user mode involves:
Mode switching is a primary security boundary. Bugs in this code can allow user processes to gain kernel privileges—a complete system compromise. The dispatcher must meticulously validate all state before returning to user mode. Spectre/Meltdown attacks exploited speculative execution during these transitions, requiring extensive mitigations in dispatch code.
123456789101112131415161718192021222324252627
// Security considerations in mode transitions void return_to_user(struct pt_regs *regs) { // SECURITY CHECK: Ensure we're not returning to kernel code if ((regs->cs & 3) != 3) { panic("Attempted return to ring 0 via user return path!"); } // SECURITY CHECK: Validate segment selectors if (!valid_user_segment(regs->cs) || !valid_user_segment(regs->ss)) { panic("Invalid segments in user return!"); } // SECURITY CHECK: Clear sensitive flags regs->rflags &= ~(FLAG_IOPL | FLAG_NT | FLAG_TF); regs->rflags |= FLAG_IF; // Ensure interrupts enabled // SPECTRE MITIGATION: Clear registers that might leak kernel data // speculative_store_bypass_barrier(); // SPECTRE MITIGATION: Return stack buffer stuffing // fill_rsb_on_return(); // Perform the actual return asm volatile("iretq");}Let's examine how real operating systems implement the dispatcher. The core function is remarkably compact—most of the work is carefully orchestrated register manipulation.
Linux context_switch() simplified:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// Simplified from kernel/sched/core.c and arch/x86/kernel/process.c /* * context_switch - switch to the new MM and the new thread's register state. */static __always_inline struct rq *context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next){ struct mm_struct *mm, *oldmm; // Prepare for switch prepare_task_switch(rq, prev, next); mm = next->mm; oldmm = prev->active_mm; // STEP 1: Switch memory context if needed if (!mm) { // Kernel thread: borrow previous mm (lazy TLB) next->active_mm = oldmm; atomic_inc(&oldmm->mm_count); } else { // User process: switch address spaces switch_mm_irqs_off(oldmm, mm, next); } // STEP 2: Switch CPU state (architecture-specific) // This is where the actual register switch happens switch_to(prev, next, prev); // After switch_to returns, we ARE the next task! // prev now points to what was the previous task return finish_task_switch(prev);} // The switch_to macro (x86_64) - this is the core// Defined in arch/x86/include/asm/switch_to.h#define switch_to(prev, next, last) \do { \ prepare_switch_to(prev, next); \ \ ((last) = __switch_to_asm((prev), (next))); \} while (0)The actual register switch (x86_64 assembly):
1234567891011121314151617181920212223242526272829303132333435363738
# Linux arch/x86/entry/entry_64.S (simplified)# __switch_to_asm - switch processor context SYM_FUNC_START(__switch_to_asm) # Save callee-saved registers (per C ABI) # These are the only registers we need to save; # caller-saved registers are already on the stack pushq %rbp pushq %rbx pushq %r12 pushq %r13 pushq %r14 pushq %r15 # Switch stacks: # Save current stack pointer into prev->thread.sp movq %rsp, TASK_threadsp(%rdi) # %rdi = prev # Load next stack pointer from next->thread.sp movq TASK_threadsp(%rsi), %rsp # %rsi = next # Note: We are now on next's kernel stack! # The registers we pop are from next's saved state # Restore callee-saved registers (now from next) popq %r15 popq %r14 popq %r13 popq %r12 popq %rbx popq %rbp # Jump to __switch_to() C function for remaining work jmp __switch_toSYM_FUNC_END(__switch_to_asm) # Magic insight: The 'ret' at the end of __switch_to returns# to next's saved return address—which is where next was# when it previously called __switch_to_asm!The key insight: by saving the stack pointer and loading a different stack, when we 'pop' registers we're loading the next process's saved registers. When the function 'returns,' it returns to whatever return address is on the new stack—i.e., where the next process was when it yielded. This elegantly handles the 'teleportation' of control between processes.
Given that dispatch happens thousands of times per second, operating systems employ numerous optimizations to minimize latency:
Thread switching vs. process switching:
Switching between threads of the same process is significantly faster than switching between processes:
| Operation | Thread Switch | Process Switch |
|---|---|---|
| Register save/restore | Same | Same |
| Address space switch | Skipped ✓ | Required |
| TLB impact | None ✓ | Flush or PCID overhead |
| Cache impact | Lower (shared address space) ✓ | Higher (different working set) |
| Scheduling overhead | Same | Same |
| Typical latency | 1-2 μs | 3-10 μs + indirect costs |
This latency difference is one reason for the popularity of multi-threaded applications: switching between threads of the same application is cheaper than switching between separate processes. The shared address space means no TLB invalidation and better cache behavior—threads can share data in cache.
Hardware support evolution:
| Hardware Feature | Purpose | Savings |
|---|---|---|
| SYSCALL/SYSENTER | Fast system call entry | ~50% vs INT 0x80 |
| SYSRET/SYSEXIT | Fast system call return | ~50% vs IRET (for syscalls) |
| FXSAVE/FXRSTOR | Fast FPU state save | Built-in vs manual save |
| XSAVEOPT | Lazy extended state save | Only saves changed state |
| PCID | Process Context IDs | Avoid TLB flush (major!) |
| INVPCID | Selective TLB invalidation | Fine-grained TLB control |
| FSGSBASE | Fast FS/GS base access | Avoid MSR access |
The dispatcher is the mechanical heart of process scheduling—translating scheduling decisions into actual CPU context changes. Let's consolidate our understanding:
Module completion:
With the dispatcher covered, we've completed our exploration of Scheduling Concepts—the foundational theory underlying all CPU scheduling. We've covered:
Coming up in subsequent modules:
With this theoretical foundation, we're ready to explore specific scheduling algorithms—FCFS, SJF, Priority Scheduling, Round Robin, and Multi-Level Feedback Queue—understanding not just how they work, but why they make the tradeoffs they do.
You now have a complete understanding of CPU scheduling fundamentals. You understand why scheduling exists (bursts and multiprogramming), what we're trying to optimize (scheduling criteria), the fundamental control mechanism (preemption), and how decisions become reality (dispatcher). This foundation will make understanding specific algorithms much more intuitive.