Loading learning content...
A process sitting in the Ready queue has completed all initialization and possesses everything it needs to execute—except one critical resource: the CPU itself. The Dispatch transition is the moment of truth when the operating system selects a process from the Ready queue and grants it control of a processor core.
This transition involves one of the most carefully engineered operations in computer systems: the context switch. In microseconds, the kernel must save the complete state of the currently running process, load the state of the newly selected process, and transfer control seamlessly. The dispatched process resumes execution as if it had never been paused.
Understanding the Dispatch transition reveals how operating systems achieve the illusion of simultaneous execution—how your computer appears to run dozens of programs at once despite having only a handful of CPU cores.
By the end of this page, you will understand the complete Dispatch transition—from scheduler selection algorithms through hardware context switching, dispatcher operation, CPU binding policies, and the performance implications of frequent dispatching. You'll see the precise sequence of operations that occur in nanoseconds every time a process gets CPU time.
Before examining the dispatch operation, we must understand what it means for a process to be in the Ready state. A Ready process is runnable—it has all resources except CPU time. It differs fundamentally from a Waiting/Blocked process, which needs some event (I/O completion, lock acquisition, timer expiration) before it can proceed.
| Property | Ready State Status | Implication |
|---|---|---|
| Code loaded | Yes - in memory or pageable | Process can execute immediately if given CPU |
| Data accessible | Yes - address space complete | All memory references can be resolved |
| Pending I/O | None | No outstanding blocking operations |
| Required locks | None waiting | No synchronization preventing progress |
| CPU available | No - waiting for dispatch | This is what Ready is waiting for |
| Time slice | Allocated (for preemptive schedulers) | Process knows how long it can run |
Multiple Ready queues:
Modern operating systems don't maintain a single Ready queue. Instead, they use sophisticated data structures optimized for their scheduling policies:
Per-CPU Ready Queues:
Priority-Based Queues:
Run Trees (CFS):
The choice of Ready 'queue' data structure profoundly impacts scheduling performance. Linux 2.4 used a single run queue with O(n) selection. Linux 2.6's O(1) scheduler used bitmaps and arrays. Linux 2.6.23+ CFS uses per-CPU red-black trees. Each evolution improved scalability for systems with more processes and CPUs.
The scheduler is the kernel component responsible for deciding which Ready process should run next. This decision is the precursor to dispatch—without selection, there's no process to dispatch. The scheduler must balance multiple competing objectives:
When is the scheduler invoked?
The scheduler runs in several situations, each potentially leading to a dispatch:
Time quantum expiration — The currently running process used its allocated time slice. Scheduler picks the next process.
Process blocks — Current process waits for I/O or lock. Scheduler selects another Ready process.
Process terminates — Current process exits. Scheduler runs to find next process.
Higher-priority process becomes Ready — On preemptive systems, this triggers rescheduling. The scheduler may dispatch the new high-priority process immediately.
Process yields — Current process voluntarily gives up CPU via sched_yield(). Scheduler picks next process.
System call return — The kernel may check for rescheduling before returning to user space.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// Simplified Linux scheduler invocation points// Actual implementation in kernel/sched/core.c // 1. Timer interrupt handler (quantum expired)void scheduler_tick(void) { struct task_struct *curr = current; // Update accounting curr->sched_class->task_tick(curr); // Check if time slice exhausted if (need_resched()) set_tsk_need_resched(curr); // Mark for rescheduling} // 2. When returning from kernel to user spacevoid exit_to_user_mode(void) { // Check if rescheduling needed if (need_resched()) schedule(); // Invoke scheduler} // 3. When process blocksvoid wait_event(wait_queue_head_t *wq) { current->state = TASK_INTERRUPTIBLE; schedule(); // Give up CPU, scheduler picks next // Returns here when woken up} // 4. Explicit yieldvoid sys_sched_yield(void) { current->sched_class->yield_task(current); schedule(); // Let scheduler pick another process} // The main scheduling functionvoid schedule(void) { struct task_struct *prev = current; struct task_struct *next; // Select next task to run next = pick_next_task(); if (prev != next) { // Context switch needed context_switch(prev, next); } // Now running as 'next'}In preemptive scheduling, the kernel can interrupt a running process to dispatch another (e.g., when a higher-priority process becomes Ready). In non-preemptive scheduling, a process runs until it voluntarily yields or blocks. Modern general-purpose OSes are preemptive; some real-time systems are non-preemptive for determinism.
While the scheduler decides which process should run, the dispatcher is the kernel component that actually implements the decision. The dispatcher performs the low-level work of saving and restoring CPU state, switching address spaces, and transferring control.
Key distinction:
The dispatcher is invoked by the scheduler after a selection decision is made. In many kernel implementations, they're tightly coupled, but conceptually they serve distinct roles.
Dispatcher latency:
Dispatcher latency is the time from when the scheduler makes its decision to when the new process starts executing. This latency is critical for:
Typical dispatcher latencies:
Dispatcher latency comes from multiple sources: register save/restore time, memory mapping switch, TLB flush overhead, cache pollution effects, and interrupt handling delays. Modern hardware provides features (ASID, PCID, hardware context switch support) to minimize these overheads, but they remain non-zero.
The context switch is the heart of the dispatch operation. It requires saving the complete execution state of one process and restoring another's state, all while the CPU continues executing kernel code. Let's examine this operation in detail.
What constitutes 'context'?
A process's execution context includes everything needed to resume execution exactly where it left off:
| Category | Components | Save Location |
|---|---|---|
| General-Purpose Registers | RAX, RBX, RCX, RDX, RSI, RDI, R8-R15 (x86-64) | PCB / kernel stack |
| Stack Registers | RSP (stack pointer), RBP (base pointer) | PCB / kernel stack |
| Instruction Pointer | RIP (program counter) | PCB / kernel stack |
| Flags Register | RFLAGS (condition codes, interrupt enable) | PCB / kernel stack |
| Segment Registers | CS, DS, SS, ES, FS, GS | PCB (usually constants) |
| FPU/SIMD State | x87 FPU, XMM, YMM, ZMM registers | Extended save area (XSAVE) |
| Debug Registers | DR0-DR7 (hardware breakpoints) | PCB if debugging active |
| Address Space | CR3 (page table base register) | mm_struct pointer in PCB |
| Kernel Stack | Per-process kernel stack pointer | PCB |
The switch sequence on x86-64 Linux:
Context switching involves both C code for the high-level logic and assembly code for the actual register manipulation. Here's the essential sequence:
12345678910111213141516171819202122232425262728293031323334353637383940414243
// Simplified context switch - Linux kernel// Real implementation in arch/x86/kernel/process_64.c struct task_struct *__switch_to(struct task_struct *prev, struct task_struct *next) { struct thread_struct *prev_sp = &prev->thread; struct thread_struct *next_sp = &next->thread; // 1. Save FPU/SSE/AVX state of previous process // Uses XSAVE instruction for efficiency fpu__switch_save(prev); // 2. Save kernel stack pointer in prev's PCB prev_sp->sp = current_stack_pointer; // 3. Switch kernel stacks // The new process will resume on its own stack load_sp(next_sp->sp); // 4. Switch address space (if different process, not thread) if (prev->mm != next->mm) { // Load CR3 with new page table base load_cr3(next->mm->pgd); // Flush TLB entries (or use PCID) // Modern CPUs can skip this with PCID } // 5. Switch FS/GS segment bases (for TLS) load_fsbase(next->thread.fsbase); load_gsbase(next->thread.gsbase); // 6. Restore FPU/SSE/AVX state of next process fpu__switch_restore(next); // 7. Update per-CPU "current" pointer this_cpu_write(current_task, next); // 8. Return - execution continues as next process return prev; // Returned to caller in next's context}123456789101112131415161718192021222324252627282930313233
# Simplified x86-64 context switch assembly# Actual code in arch/x86/entry/entry_64.S .globl switch_to_asmswitch_to_asm: # Save callee-saved registers of prev on its stack pushq %rbp pushq %rbx pushq %r12 pushq %r13 pushq %r14 pushq %r15 pushfq # Save flags # Save current stack pointer in prev's PCB movq %rsp, TASK_THREAD_SP(%rdi) # prev->thread.sp = rsp # Switch to next's kernel stack movq TASK_THREAD_SP(%rsi), %rsp # rsp = next->thread.sp # Restore callee-saved registers of next from its stack popfq # Restore flags popq %r15 popq %r14 popq %r13 popq %r12 popq %rbx popq %rbp # Return to caller # But now we're in next's context! # The return address on stack belongs to next retThe key insight is that the 'ret' instruction at the end returns to different code depending on whose stack is active. When we switch stacks, we change the return address, so 'ret' jumps to where the next process was suspended. This is why context switching 'returns' into the new process seamlessly.
When dispatching a process with a different address space, the kernel must switch the memory context. This involves changing the page table base register and managing the Translation Lookaside Buffer (TLB). These operations can be expensive, making memory context switching a significant component of dispatch overhead.
mov cr3, <address>Address Space Layout Randomization (ASLR) impact:
Modern systems randomize process address spaces for security (ASLR). This means:
The KPTI challenge (Meltdown mitigation):
Kernel Page Table Isolation (KPTI), implemented to mitigate the Meltdown vulnerability, adds additional context switch overhead:
| Strategy | Mechanism | Overhead | Hardware Required |
|---|---|---|---|
| Full Flush | Invalidate all TLB entries on switch | High - cold TLB for new process | None |
| Global Pages | Kernel pages never flushed | Medium - user TLB cold | Global bit support |
| ASID/PCID | Tag entries with process ID, no flush needed | Low - TLB entries persist | ASID/PCID support |
| Page-Specific Flush | INVLPG for specific modified pages | Variable - depends on changes | INVLPG instruction |
Threads within the same process share an address space, so dispatching between threads avoids the expensive page table switch and TLB flush. This is one reason multi-threading often outperforms multi-processing for CPU-bound workloads—context switches are cheaper.
Modern CPUs have extensive state beyond general-purpose registers: floating-point unit (FPU) registers, SIMD registers (SSE, AVX, AVX-512), and various control/status registers. This state can be substantial—AVX-512 adds 2KB+ of register state per process. Efficient handling of this state is crucial for dispatch performance.
| Component | Size | Content | Save/Restore |
|---|---|---|---|
| x87 FPU | 112 bytes | 8 x 80-bit floating-point registers + control/status | FXSAVE/FXRSTOR or XSAVE |
| SSE/SSE2 | 256 bytes | 16 x 128-bit XMM registers | FXSAVE/FXRSTOR or XSAVE |
| AVX/AVX2 | 512 bytes | 16 x 256-bit YMM registers (upper 128 bits) | XSAVE only |
| AVX-512 | 2KB+ | 32 x 512-bit ZMM registers + mask registers | XSAVE only |
| MPX | 64 bytes | Bound registers for memory protection | XSAVE only |
| PKRU | 8 bytes | Protection keys register | XSAVE only |
Lazy FPU context switching:
Historically, operating systems used 'lazy' FPU context switching to avoid unnecessary save/restore:
This optimization avoided FPU save/restore for processes that didn't use floating-point. However, with SSE being ubiquitous (used even by memcpy), and security concerns (speculative execution attacks), lazy FPU is now mostly deprecated. Modern kernels often do 'eager' FPU switching.
12345678910111213141516171819202122232425262728293031323334353637
// Modern FPU/SIMD state management with XSAVE// Simplified from Linux arch/x86/kernel/fpu/ struct fpu_state { // Dynamically sized based on CPU features // Can be 512 bytes (SSE) to 2KB+ (AVX-512) u8 state[XSAVE_SIZE];}; void fpu_save(struct task_struct *task) { // Use XSAVE instruction to save all enabled components // The bitmap in XCR0 controls which components to save // Efficient: only saves components that have been modified // Uses "optimized" XSAVE that checks for actual use xsave_to_memory(&task->thread.fpu.state);} void fpu_restore(struct task_struct *task) { // Use XRSTOR to restore state // Only restores components that were saved xrstor_from_memory(&task->thread.fpu.state);} // During context switchvoid switch_fpu_context(struct task_struct *prev, struct task_struct *next) { // Always save/restore in modern kernels (eager switching) if (cpu_has_xsave) { fpu_save(prev); fpu_restore(next); } else { // Legacy FXSAVE/FXRSTOR for older CPUs fxsave(&prev->thread.fpu); fxrstor(&next->thread.fpu); }}The XSAVE instruction family is extensible and efficient. It only saves/restores state that has been modified (using the XSTATE_BV bitmap). A process using only scalar math may have a 200-byte XSAVE area, while one using AVX-512 may have 2KB+. The hardware tracks which components have been used since the last XRSTOR.
In multiprocessor systems, the dispatcher must decide which CPU will run the selected process. This decision involves balancing load across processors against the benefits of keeping processes on the same CPU (cache affinity).
Setting CPU affinity programmatically:
Applications can control which CPUs they run on using scheduler affinity APIs:
123456789101112131415161718192021222324252627282930313233343536
#define _GNU_SOURCE#include <sched.h>#include <stdio.h>#include <unistd.h> int main() { cpu_set_t mask; // Get current affinity CPU_ZERO(&mask); if (sched_getaffinity(0, sizeof(mask), &mask) == 0) { printf("Current affinity: "); for (int i = 0; i < CPU_SETSIZE; i++) { if (CPU_ISSET(i, &mask)) printf("%d ", i); } printf("\n"); } // Set affinity to CPUs 0 and 1 only CPU_ZERO(&mask); CPU_SET(0, &mask); CPU_SET(1, &mask); if (sched_setaffinity(0, sizeof(mask), &mask) == 0) { printf("Affinity set to CPUs 0 and 1\n"); } // Process will now only be dispatched to CPUs 0 or 1 // Useful for: // - Keeping threads on same NUMA node // - Isolating from other workloads // - Real-time predictability return 0;}The load balancer:
Linux runs a periodic load balancer that migrates processes between CPUs to maintain balance:
Migrating a process to a different CPU invalidates cache benefits. The process's data must be reloaded from higher-level caches or main memory. For memory-intensive workloads, a single migration can cost milliseconds in cache misses. Schedulers are conservative about migration—a 5% load imbalance is often tolerated to avoid migration overhead.
The Dispatch transition occurs in various contexts, each with different characteristics and optimizations:
Switching between threads of the same process:
This is the lightest form of dispatch:
Thread switching is typically 2-5x faster than full process switching because it avoids the expensive memory context change. This is why thread pools and async programming models are popular for I/O-bound workloads.
| Switch Type | Operations Required | Typical Cost | Cache Impact |
|---|---|---|---|
| Thread (same process) | Register + FPU save/restore | 0.5-2 μs | Minimal |
| Process (PCID enabled) | Thread + page table switch | 2-5 μs | Moderate - TLB maintained |
| Process (no PCID) | Thread + page table + TLB flush | 5-15 μs | Significant - cold TLB |
| Process (KPTI) | Process + double page table switch | 10-30 μs | High |
| Cross-NUMA | Process + memory latency | 20-100 μs | Very high - remote cache |
In virtual machines, context switching becomes more complex. The hypervisor may need to intercept CR3 loads ('exit' in VMX terminology), and nested page tables add another level of translation. Container-based virtualization (Docker) avoids this—containers share the host kernel and experience native context switch costs.
The dispatch operation, while necessary for multitasking, represents pure overhead—CPU time spent switching rather than doing useful work. Understanding and minimizing dispatch overhead is crucial for system performance.
Strategies to reduce dispatch overhead:
Measuring dispatch overhead:
You can measure context switch costs with tools like lmbench or simple benchmarks:
12345678910111213141516171819
# Using perf to count context switchesperf stat -e context-switches,cpu-clock ./my_application # Example output:# 15,234 context-switches# 2,500.00 msec cpu-clock# Average: 6.1 switches/ms, each costing ~164 ns of CPU time # Using lmbench for precise measurementlat_ctx -s 0 2 # Measure 2-process context switch time# Output: "size=0k ovr=0.59 2 1.98"# Shows ~2μs per context switch # Monitor system-wide context switchesvmstat 1 | awk '{print $12}' # 'cs' column# Shows context switches per second # Detailed per-CPU scheduling statisticscat /proc/schedstatHigh-performance servers (nginx, Node.js, Redis) minimize context switches by using event-driven architectures. Instead of spawning threads/processes per connection, they use a single thread with epoll/kqueue to handle thousands of connections. This dramatically reduces dispatch overhead for I/O-bound workloads.
We've now thoroughly examined the Dispatch transition—the mechanism by which processes in the Ready queue gain control of a CPU and begin (or resume) execution.
Looking ahead:
Now that we understand how processes get onto the CPU (Dispatch), we need to understand what happens when their time runs out. The Timeout transition (Running → Ready) occurs when a process exhausts its time quantum—a crucial mechanism for ensuring fairness in multitasking systems.
You now understand the Dispatch transition (Ready → Running) comprehensively—from scheduler selection through context switching mechanics, memory management, FPU state handling, and CPU affinity. Next, we'll examine the Timeout transition that returns processes to the Ready state.