Operating SystemsState Transitions

Process State Transitions

LevelIntermediate

Duration75 mins

TopicState Transitions

2 / 5

Dispatch: Ready → Running

From Waiting to Executing

A process sitting in the Ready queue has completed all initialization and possesses everything it needs to execute—except one critical resource: the CPU itself. The Dispatch transition is the moment of truth when the operating system selects a process from the Ready queue and grants it control of a processor core.

This transition involves one of the most carefully engineered operations in computer systems: the context switch. In microseconds, the kernel must save the complete state of the currently running process, load the state of the newly selected process, and transfer control seamlessly. The dispatched process resumes execution as if it had never been paused.

Understanding the Dispatch transition reveals how operating systems achieve the illusion of simultaneous execution—how your computer appears to run dozens of programs at once despite having only a handful of CPU cores.

What You Will Learn

By the end of this page, you will understand the complete Dispatch transition—from scheduler selection algorithms through hardware context switching, dispatcher operation, CPU binding policies, and the performance implications of frequent dispatching. You'll see the precise sequence of operations that occur in nanoseconds every time a process gets CPU time.

The Ready State and Queue Management

Before examining the dispatch operation, we must understand what it means for a process to be in the Ready state. A Ready process is runnable—it has all resources except CPU time. It differs fundamentally from a Waiting/Blocked process, which needs some event (I/O completion, lock acquisition, timer expiration) before it can proceed.

Ready State Characteristics
Property	Ready State Status	Implication
Code loaded	Yes - in memory or pageable	Process can execute immediately if given CPU
Data accessible	Yes - address space complete	All memory references can be resolved
Pending I/O	None	No outstanding blocking operations
Required locks	None waiting	No synchronization preventing progress
CPU available	No - waiting for dispatch	This is what Ready is waiting for
Time slice	Allocated (for preemptive schedulers)	Process knows how long it can run

Multiple Ready queues:

Modern operating systems don't maintain a single Ready queue. Instead, they use sophisticated data structures optimized for their scheduling policies:

Per-CPU Ready Queues:

Each processor has its own queue of Ready processes
Reduces lock contention on multiprocessor systems
Improves cache locality—a process tends to run on the same CPU
Load balancing periodically migrates processes between queues

Priority-Based Queues:

Multiple queues for different priority levels
Higher-priority queues checked first
Enables O(1) selection of highest priority process

Run Trees (CFS):

Linux CFS uses a red-black tree per CPU
Processes ordered by 'virtual runtime' (consumed CPU time)
Leftmost node (lowest vruntime) is next to run
O(log n) insertion and O(1) selection

Converting Mermaid diagram...

Queue vs Tree vs Bitmap

The choice of Ready 'queue' data structure profoundly impacts scheduling performance. Linux 2.4 used a single run queue with O(n) selection. Linux 2.6's O(1) scheduler used bitmaps and arrays. Linux 2.6.23+ CFS uses per-CPU red-black trees. Each evolution improved scalability for systems with more processes and CPUs.

Scheduler Selection Decision

The scheduler is the kernel component responsible for deciding which Ready process should run next. This decision is the precursor to dispatch—without selection, there's no process to dispatch. The scheduler must balance multiple competing objectives:

Scheduling Objectives

•CPU Utilization — Keep the CPU busy; idle CPUs waste resources.
•Throughput — Maximize completed processes per time unit.
•Turnaround Time — Minimize total time from submission to completion.
•Waiting Time — Minimize time processes spend in Ready queue.
•Response Time — Minimize time from request to first response (critical for interactive systems).
•Fairness — Ensure all processes get reasonable CPU time.
•Priority Compliance — Higher-priority processes should run before lower-priority ones.

When is the scheduler invoked?

The scheduler runs in several situations, each potentially leading to a dispatch:

Time quantum expiration — The currently running process used its allocated time slice. Scheduler picks the next process.
Process blocks — Current process waits for I/O or lock. Scheduler selects another Ready process.
Process terminates — Current process exits. Scheduler runs to find next process.
Higher-priority process becomes Ready — On preemptive systems, this triggers rescheduling. The scheduler may dispatch the new high-priority process immediately.
Process yields — Current process voluntarily gives up CPU via sched_yield(). Scheduler picks next process.
System call return — The kernel may check for rescheduling before returning to user space.

scheduler_invocation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Simplified Linux scheduler invocation points
// Actual implementation in kernel/sched/core.c
 
// 1. Timer interrupt handler (quantum expired)
void scheduler_tick(void) {
    struct task_struct *curr = current;
    
    // Update accounting
    curr->sched_class->task_tick(curr);
    
    // Check if time slice exhausted
    if (need_resched())
        set_tsk_need_resched(curr);  // Mark for rescheduling
}
 
// 2. When returning from kernel to user space
void exit_to_user_mode(void) {
    // Check if rescheduling needed
    if (need_resched())
        schedule();  // Invoke scheduler
}
 
// 3. When process blocks
void wait_event(wait_queue_head_t *wq) {
    current->state = TASK_INTERRUPTIBLE;
    schedule();  // Give up CPU, scheduler picks next
    // Returns here when woken up
}
 
// 4. Explicit yield
void sys_sched_yield(void) {
    current->sched_class->yield_task(current);
    schedule();  // Let scheduler pick another process
}
 
// The main scheduling function
void schedule(void) {
    struct task_struct *prev = current;
    struct task_struct *next;
    
    // Select next task to run
    next = pick_next_task();
    
    if (prev != next) {
        // Context switch needed
        context_switch(prev, next);
    }
    // Now running as 'next'
}

Preemptive vs Non-Preemptive

In preemptive scheduling, the kernel can interrupt a running process to dispatch another (e.g., when a higher-priority process becomes Ready). In non-preemptive scheduling, a process runs until it voluntarily yields or blocks. Modern general-purpose OSes are preemptive; some real-time systems are non-preemptive for determinism.

The Dispatcher Module

While the scheduler decides which process should run, the dispatcher is the kernel component that actually implements the decision. The dispatcher performs the low-level work of saving and restoring CPU state, switching address spaces, and transferring control.

Key distinction:

Scheduler: Policy component—decides which process runs next
Dispatcher: Mechanism component—executes the transition

The dispatcher is invoked by the scheduler after a selection decision is made. In many kernel implementations, they're tightly coupled, but conceptually they serve distinct roles.

Dispatcher Responsibilities

•Save current process state — CPU registers, program counter, stack pointer, floating-point state, and other context must be preserved in the PCB.
•Switch memory context — Change the memory mapping to the new process's address space by loading its page table base into the appropriate CPU register.
•Restore new process state — Load the saved context of the process being dispatched from its PCB into CPU registers.
•Flush/manage CPU caches — Optionally flush TLBs and caches, or use ASIDs to avoid full flushes.
•Jump to process code — Transfer control to the new process at the point where it was last executing.
•Update accounting — Record the dispatch time for scheduling statistics.

Dispatcher latency:

Dispatcher latency is the time from when the scheduler makes its decision to when the new process starts executing. This latency is critical for:

Interactive responsiveness — Users perceive delays over ~100ms
Real-time systems — Missed deadlines can cause system failure
High-frequency trading — Microseconds matter for profit

Typical dispatcher latencies:

Simple microcontrollers: 1-10 μs
Linux on modern hardware: 2-20 μs
Windows on modern hardware: 5-30 μs
Real-time Linux (PREEMPT_RT): < 100 μs worst case guaranteed

Converting Mermaid diagram...

Latency Sources

Dispatcher latency comes from multiple sources: register save/restore time, memory mapping switch, TLB flush overhead, cache pollution effects, and interrupt handling delays. Modern hardware provides features (ASID, PCID, hardware context switch support) to minimize these overheads, but they remain non-zero.

Context Switching Mechanics

The context switch is the heart of the dispatch operation. It requires saving the complete execution state of one process and restoring another's state, all while the CPU continues executing kernel code. Let's examine this operation in detail.

What constitutes 'context'?

A process's execution context includes everything needed to resume execution exactly where it left off:

Process Context Components
Category	Components	Save Location
General-Purpose Registers	RAX, RBX, RCX, RDX, RSI, RDI, R8-R15 (x86-64)	PCB / kernel stack
Stack Registers	RSP (stack pointer), RBP (base pointer)	PCB / kernel stack
Instruction Pointer	RIP (program counter)	PCB / kernel stack
Flags Register	RFLAGS (condition codes, interrupt enable)	PCB / kernel stack
Segment Registers	CS, DS, SS, ES, FS, GS	PCB (usually constants)
FPU/SIMD State	x87 FPU, XMM, YMM, ZMM registers	Extended save area (XSAVE)
Debug Registers	DR0-DR7 (hardware breakpoints)	PCB if debugging active
Address Space	CR3 (page table base register)	mm_struct pointer in PCB
Kernel Stack	Per-process kernel stack pointer	PCB

The switch sequence on x86-64 Linux:

Context switching involves both C code for the high-level logic and assembly code for the actual register manipulation. Here's the essential sequence:

context_switch.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Simplified context switch - Linux kernel
// Real implementation in arch/x86/kernel/process_64.c
 
struct task_struct *
__switch_to(struct task_struct *prev, 
            struct task_struct *next) {
    
    struct thread_struct *prev_sp = &prev->thread;
    struct thread_struct *next_sp = &next->thread;
    
    // 1. Save FPU/SSE/AVX state of previous process
    //    Uses XSAVE instruction for efficiency
    fpu__switch_save(prev);
    
    // 2. Save kernel stack pointer in prev's PCB
    prev_sp->sp = current_stack_pointer;
    
    // 3. Switch kernel stacks
    //    The new process will resume on its own stack
    load_sp(next_sp->sp);
    
    // 4. Switch address space (if different process, not thread)
    if (prev->mm != next->mm) {
        // Load CR3 with new page table base
        load_cr3(next->mm->pgd);
        
        // Flush TLB entries (or use PCID)
        // Modern CPUs can skip this with PCID
    }
    
    // 5. Switch FS/GS segment bases (for TLS)
    load_fsbase(next->thread.fsbase);
    load_gsbase(next->thread.gsbase);
    
    // 6. Restore FPU/SSE/AVX state of next process
    fpu__switch_restore(next);
    
    // 7. Update per-CPU "current" pointer
    this_cpu_write(current_task, next);
    
    // 8. Return - execution continues as next process
    return prev;  // Returned to caller in next's context
}

switch_to.S

x86-64 Assembly

# Simplified x86-64 context switch assembly
# Actual code in arch/x86/entry/entry_64.S
 
.globl switch_to_asm
switch_to_asm:
    # Save callee-saved registers of prev on its stack
    pushq   %rbp
    pushq   %rbx
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15
    pushfq                  # Save flags
    
    # Save current stack pointer in prev's PCB
    movq    %rsp, TASK_THREAD_SP(%rdi)    # prev->thread.sp = rsp
    
    # Switch to next's kernel stack
    movq    TASK_THREAD_SP(%rsi), %rsp    # rsp = next->thread.sp
    
    # Restore callee-saved registers of next from its stack
    popfq                   # Restore flags
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbx
    popq    %rbp
    
    # Return to caller
    # But now we're in next's context!
    # The return address on stack belongs to next
    ret

The Magic of Stack Switching

The key insight is that the 'ret' instruction at the end returns to different code depending on whose stack is active. When we switch stacks, we change the return address, so 'ret' jumps to where the next process was suspended. This is why context switching 'returns' into the new process seamlessly.

Memory Context and TLB Management

When dispatching a process with a different address space, the kernel must switch the memory context. This involves changing the page table base register and managing the Translation Lookaside Buffer (TLB). These operations can be expensive, making memory context switching a significant component of dispatch overhead.

Page Table Switching

•CR3 Register (x86) — Contains physical address of top-level page table (PML4)
•Load Operation — Single instruction: mov cr3, <address>
•Hardware WAW — CPU automatically walks new page tables on memory access
•Kernel Mapping — Kept identical across processes for efficiency
•User Mapping — Completely different for each process

TLB Considerations

•Full Flush — Traditional approach: invalidate all TLB entries on switch
•ASID/PCID — Modern CPUs tag TLB entries with process ID
•Selective Flush — Only flush entries for the old process
•Flush Cost — Full TLB flush can cost 100s of cycles on next accesses
•Cache Pollution — Old process's data evicted, new process must reload

Address Space Layout Randomization (ASLR) impact:

Modern systems randomize process address spaces for security (ASLR). This means:

Each process has unique memory layout
Virtual addresses differ even for identical code
TLB entries cannot be reused between processes
PCID/ASID becomes more valuable for performance

The KPTI challenge (Meltdown mitigation):

Kernel Page Table Isolation (KPTI), implemented to mitigate the Meltdown vulnerability, adds additional context switch overhead:

Each process has two page tables: one for user mode (minimal kernel mapping), one for kernel mode (full mapping)
On every user-to-kernel transition, page tables are switched
On every kernel-to-user return, switch back
This doubles the address space switching cost
PCID support is essential to mitigate this penalty

TLB Management Strategies
Strategy	Mechanism	Overhead	Hardware Required
Full Flush	Invalidate all TLB entries on switch	High - cold TLB for new process	None
Global Pages	Kernel pages never flushed	Medium - user TLB cold	Global bit support
ASID/PCID	Tag entries with process ID, no flush needed	Low - TLB entries persist	ASID/PCID support
Page-Specific Flush	INVLPG for specific modified pages	Variable - depends on changes	INVLPG instruction

Threads vs Processes

Threads within the same process share an address space, so dispatching between threads avoids the expensive page table switch and TLB flush. This is one reason multi-threading often outperforms multi-processing for CPU-bound workloads—context switches are cheaper.

Floating-Point and SIMD State Management

Modern CPUs have extensive state beyond general-purpose registers: floating-point unit (FPU) registers, SIMD registers (SSE, AVX, AVX-512), and various control/status registers. This state can be substantial—AVX-512 adds 2KB+ of register state per process. Efficient handling of this state is crucial for dispatch performance.

Extended CPU State Components (x86-64)
Component	Size	Content	Save/Restore
x87 FPU	112 bytes	8 x 80-bit floating-point registers + control/status	FXSAVE/FXRSTOR or XSAVE
SSE/SSE2	256 bytes	16 x 128-bit XMM registers	FXSAVE/FXRSTOR or XSAVE
AVX/AVX2	512 bytes	16 x 256-bit YMM registers (upper 128 bits)	XSAVE only
AVX-512	2KB+	32 x 512-bit ZMM registers + mask registers	XSAVE only
MPX	64 bytes	Bound registers for memory protection	XSAVE only
PKRU	8 bytes	Protection keys register	XSAVE only

Lazy FPU context switching:

Historically, operating systems used 'lazy' FPU context switching to avoid unnecessary save/restore:

On context switch, FPU is marked as 'unavailable' (via CR0.TS bit)
If new process uses FPU, a 'Device Not Available' exception occurs
Exception handler saves old FPU state, restores new state, clears TS
Process continues; future FPU use doesn't trap

This optimization avoided FPU save/restore for processes that didn't use floating-point. However, with SSE being ubiquitous (used even by memcpy), and security concerns (speculative execution attacks), lazy FPU is now mostly deprecated. Modern kernels often do 'eager' FPU switching.

xsave_usage.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Modern FPU/SIMD state management with XSAVE
// Simplified from Linux arch/x86/kernel/fpu/
 
struct fpu_state {
    // Dynamically sized based on CPU features
    // Can be 512 bytes (SSE) to 2KB+ (AVX-512)
    u8 state[XSAVE_SIZE];
};
 
void fpu_save(struct task_struct *task) {
    // Use XSAVE instruction to save all enabled components
    // The bitmap in XCR0 controls which components to save
    
    // Efficient: only saves components that have been modified
    // Uses "optimized" XSAVE that checks for actual use
    xsave_to_memory(&task->thread.fpu.state);
}
 
void fpu_restore(struct task_struct *task) {
    // Use XRSTOR to restore state
    // Only restores components that were saved
    xrstor_from_memory(&task->thread.fpu.state);
}
 
// During context switch
void switch_fpu_context(struct task_struct *prev,
                        struct task_struct *next) {
    // Always save/restore in modern kernels (eager switching)
    if (cpu_has_xsave) {
        fpu_save(prev);
        fpu_restore(next);
    } else {
        // Legacy FXSAVE/FXRSTOR for older CPUs
        fxsave(&prev->thread.fpu);
        fxrstor(&next->thread.fpu);
    }
}

XSAVE Efficiency

The XSAVE instruction family is extensible and efficient. It only saves/restores state that has been modified (using the XSTATE_BV bitmap). A process using only scalar math may have a 200-byte XSAVE area, while one using AVX-512 may have 2KB+. The hardware tracks which components have been used since the last XRSTOR.

CPU Affinity and Binding

In multiprocessor systems, the dispatcher must decide which CPU will run the selected process. This decision involves balancing load across processors against the benefits of keeping processes on the same CPU (cache affinity).

CPU Selection Factors

•Cache Affinity — A process that ran on CPU 0 likely has hot data in CPU 0's caches. Dispatching it to CPU 1 means cache misses. Strong preference to dispatch on the same CPU as last time.
•Load Balancing — If one CPU is overloaded and another is idle, migrate processes. Imbalance hurts overall throughput.
•NUMA Topology — In NUMA systems, memory access is faster to node-local RAM. Dispatch processes near their memory allocations.
•Hard Affinity — Administrator or application sets explicit CPU mask. Dispatcher must honor it. Used for real-time workloads, isolation.
•Soft Affinity — Scheduler prefers certain CPUs but can use others if needed. Default behavior.
•CPU Isolation — Some CPUs excluded from general scheduling. Reserved for dedicated tasks (housekeeping CPUs).

Setting CPU affinity programmatically:

Applications can control which CPUs they run on using scheduler affinity APIs:

cpu_affinity.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
 
int main() {
    cpu_set_t mask;
    
    // Get current affinity
    CPU_ZERO(&mask);
    if (sched_getaffinity(0, sizeof(mask), &mask) == 0) {
        printf("Current affinity: ");
        for (int i = 0; i < CPU_SETSIZE; i++) {
            if (CPU_ISSET(i, &mask))
                printf("%d ", i);
        }
        printf("\n");
    }
    
    // Set affinity to CPUs 0 and 1 only
    CPU_ZERO(&mask);
    CPU_SET(0, &mask);
    CPU_SET(1, &mask);
    
    if (sched_setaffinity(0, sizeof(mask), &mask) == 0) {
        printf("Affinity set to CPUs 0 and 1\n");
    }
    
    // Process will now only be dispatched to CPUs 0 or 1
    // Useful for:
    // - Keeping threads on same NUMA node
    // - Isolating from other workloads
    // - Real-time predictability
    
    return 0;
}

The load balancer:

Linux runs a periodic load balancer that migrates processes between CPUs to maintain balance:

Runs on idle CPUs (idle balancing) and periodically on busy CPUs
Considers cache hotness—recently run processes are less likely to migrate
Respects NUMA topology—prefers migration within same node
Uses hierarchical balancing—first balance within cores sharing L2, then across sockets
Migration has a cost threshold—only migrates if imbalance exceeds threshold

Migration Cost

Migrating a process to a different CPU invalidates cache benefits. The process's data must be reloaded from higher-level caches or main memory. For memory-intensive workloads, a single migration can cost milliseconds in cache misses. Schedulers are conservative about migration—a 5% load imbalance is often tolerated to avoid migration overhead.

Dispatch in Different Contexts

The Dispatch transition occurs in various contexts, each with different characteristics and optimizations:

Switching between threads of the same process:

This is the lightest form of dispatch:

No address space change — Threads share the same virtual memory
No TLB flush — Page tables remain the same
Smaller context — Only need to save/restore registers and FPU state
Shared resources — File descriptors, signals, etc. all shared

Thread switching is typically 2-5x faster than full process switching because it avoids the expensive memory context change. This is why thread pools and async programming models are popular for I/O-bound workloads.

Approximate Context Switch Costs
Switch Type	Operations Required	Typical Cost	Cache Impact
Thread (same process)	Register + FPU save/restore	0.5-2 μs	Minimal
Process (PCID enabled)	Thread + page table switch	2-5 μs	Moderate - TLB maintained
Process (no PCID)	Thread + page table + TLB flush	5-15 μs	Significant - cold TLB
Process (KPTI)	Process + double page table switch	10-30 μs	High
Cross-NUMA	Process + memory latency	20-100 μs	Very high - remote cache

Virtualization Implications

In virtual machines, context switching becomes more complex. The hypervisor may need to intercept CR3 loads ('exit' in VMX terminology), and nested page tables add another level of translation. Container-based virtualization (Docker) avoids this—containers share the host kernel and experience native context switch costs.

Performance Considerations

The dispatch operation, while necessary for multitasking, represents pure overhead—CPU time spent switching rather than doing useful work. Understanding and minimizing dispatch overhead is crucial for system performance.

Sources of Dispatch Overhead

•Direct CPU Cost — Time spent executing context switch code: saving registers, switching stacks, changing page tables. Typically 1-5 microseconds of CPU time.
•TLB Miss Cost — After switching address spaces, TLB is cold. First accesses to each page incur expensive page table walks. Can add 10s of microseconds.
•Cache Pollution — New process evicts old process's cache lines. Both suffer: old process loses cached data, new process must reload its working set. Impact depends on working set sizes.
•Pipeline Disruption — Context switch flushes CPU pipeline. Speculative execution state is lost. Branch predictors need to re-learn new process's patterns.
•Lock Contention — If scheduler locks are heavily contended, dispatch must wait. Modern kernels use per-CPU locks to minimize this.

Strategies to reduce dispatch overhead:

Kernel-Level Optimizations

•PCID/ASID for TLB tagging
•Lazy FPU state management
•Per-CPU run queues
•Cache-aware scheduling
•Global kernel pages (no TLB flush)
•XSAVE/XRSTOR for FPU state
•Minimal lock critical sections

Application-Level Approaches

•Use threads instead of processes
•Batching work to reduce switches
•CPU affinity for predictability
•Lock-free data structures
•Avoiding unnecessary blocking
•Cooperative scheduling (user-level)
•Async I/O to reduce blocking

Measuring dispatch overhead:

You can measure context switch costs with tools like lmbench or simple benchmarks:

measure_switch.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Using perf to count context switches
perf stat -e context-switches,cpu-clock ./my_application
 
# Example output:
#   15,234 context-switches
#   2,500.00 msec cpu-clock
# Average: 6.1 switches/ms, each costing ~164 ns of CPU time
 
# Using lmbench for precise measurement
lat_ctx -s 0 2  # Measure 2-process context switch time
# Output: "size=0k ovr=0.59 2 1.98"
# Shows ~2μs per context switch
 
# Monitor system-wide context switches
vmstat 1 | awk '{print $12}'  # 'cs' column
# Shows context switches per second
 
# Detailed per-CPU scheduling statistics
cat /proc/schedstat

Event-Driven Architecture

High-performance servers (nginx, Node.js, Redis) minimize context switches by using event-driven architectures. Instead of spawning threads/processes per connection, they use a single thread with epoll/kqueue to handle thousands of connections. This dramatically reduces dispatch overhead for I/O-bound workloads.

Summary: The Complete Dispatch Transition

We've now thoroughly examined the Dispatch transition—the mechanism by which processes in the Ready queue gain control of a CPU and begin (or resume) execution.

Key Takeaways

•Ready → Running is the Dispatch transition — It's triggered by scheduler decisions when processes yield, block, or exhaust time slices.
•Scheduler selects, dispatcher executes — The scheduler decides which process runs; the dispatcher performs the actual context switch.
•Context switch saves/restores complete state — General registers, stack, FPU/SIMD state, and address space must be swapped atomically.
•Memory context switching is expensive — Page table changes and TLB management add significant overhead. Thread switches avoid this.
•Modern hardware provides optimizations — PCID/ASID, XSAVE, per-CPU caches all reduce dispatch overhead.
•CPU affinity improves cache performance — Keeping processes on the same CPU avoids cache cold-start penalties.
•Dispatch frequency impacts throughput — Too many switches waste CPU on overhead; too few hurt responsiveness.
•Different switch types have different costs — Thread switches are cheap; cross-NUMA process switches are expensive.

Looking ahead:

Now that we understand how processes get onto the CPU (Dispatch), we need to understand what happens when their time runs out. The Timeout transition (Running → Ready) occurs when a process exhausts its time quantum—a crucial mechanism for ensuring fairness in multitasking systems.

Page Complete

You now understand the Dispatch transition (Ready → Running) comprehensively—from scheduler selection through context switching mechanics, memory management, FPU state handling, and CPU affinity. Next, we'll examine the Timeout transition that returns processes to the Ready state.

2 / 5

Loading learning content...

Operating SystemsState Transitions

Process State Transitions

LevelIntermediate

Duration75 mins

TopicState Transitions

2 / 5

Dispatch: Ready → Running

From Waiting to Executing

What You Will Learn

The Ready State and Queue Management

Ready State Characteristics
Property	Ready State Status	Implication
Code loaded	Yes - in memory or pageable	Process can execute immediately if given CPU
Data accessible	Yes - address space complete	All memory references can be resolved
Pending I/O	None	No outstanding blocking operations
Required locks	None waiting	No synchronization preventing progress
CPU available	No - waiting for dispatch	This is what Ready is waiting for
Time slice	Allocated (for preemptive schedulers)	Process knows how long it can run

Multiple Ready queues:

Modern operating systems don't maintain a single Ready queue. Instead, they use sophisticated data structures optimized for their scheduling policies:

Per-CPU Ready Queues:

Each processor has its own queue of Ready processes
Reduces lock contention on multiprocessor systems
Improves cache locality—a process tends to run on the same CPU
Load balancing periodically migrates processes between queues

Priority-Based Queues:

Multiple queues for different priority levels
Higher-priority queues checked first
Enables O(1) selection of highest priority process

Run Trees (CFS):

Linux CFS uses a red-black tree per CPU
Processes ordered by 'virtual runtime' (consumed CPU time)
Leftmost node (lowest vruntime) is next to run
O(log n) insertion and O(1) selection

Converting Mermaid diagram...

Queue vs Tree vs Bitmap

Scheduler Selection Decision

Scheduling Objectives

•CPU Utilization — Keep the CPU busy; idle CPUs waste resources.
•Throughput — Maximize completed processes per time unit.
•Turnaround Time — Minimize total time from submission to completion.
•Waiting Time — Minimize time processes spend in Ready queue.
•Response Time — Minimize time from request to first response (critical for interactive systems).
•Fairness — Ensure all processes get reasonable CPU time.
•Priority Compliance — Higher-priority processes should run before lower-priority ones.

When is the scheduler invoked?

The scheduler runs in several situations, each potentially leading to a dispatch:

Time quantum expiration — The currently running process used its allocated time slice. Scheduler picks the next process.
Process blocks — Current process waits for I/O or lock. Scheduler selects another Ready process.
Process terminates — Current process exits. Scheduler runs to find next process.
Higher-priority process becomes Ready — On preemptive systems, this triggers rescheduling. The scheduler may dispatch the new high-priority process immediately.
Process yields — Current process voluntarily gives up CPU via sched_yield(). Scheduler picks next process.
System call return — The kernel may check for rescheduling before returning to user space.

scheduler_invocation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Simplified Linux scheduler invocation points
// Actual implementation in kernel/sched/core.c
 
// 1. Timer interrupt handler (quantum expired)
void scheduler_tick(void) {
    struct task_struct *curr = current;
    
    // Update accounting
    curr->sched_class->task_tick(curr);
    
    // Check if time slice exhausted
    if (need_resched())
        set_tsk_need_resched(curr);  // Mark for rescheduling
}
 
// 2. When returning from kernel to user space
void exit_to_user_mode(void) {
    // Check if rescheduling needed
    if (need_resched())
        schedule();  // Invoke scheduler
}
 
// 3. When process blocks
void wait_event(wait_queue_head_t *wq) {
    current->state = TASK_INTERRUPTIBLE;
    schedule();  // Give up CPU, scheduler picks next
    // Returns here when woken up
}
 
// 4. Explicit yield
void sys_sched_yield(void) {
    current->sched_class->yield_task(current);
    schedule();  // Let scheduler pick another process
}
 
// The main scheduling function
void schedule(void) {
    struct task_struct *prev = current;
    struct task_struct *next;
    
    // Select next task to run
    next = pick_next_task();
    
    if (prev != next) {
        // Context switch needed
        context_switch(prev, next);
    }
    // Now running as 'next'
}

Preemptive vs Non-Preemptive

The Dispatcher Module

Key distinction:

Scheduler: Policy component—decides which process runs next
Dispatcher: Mechanism component—executes the transition

The dispatcher is invoked by the scheduler after a selection decision is made. In many kernel implementations, they're tightly coupled, but conceptually they serve distinct roles.

Dispatcher Responsibilities

•Save current process state — CPU registers, program counter, stack pointer, floating-point state, and other context must be preserved in the PCB.
•Switch memory context — Change the memory mapping to the new process's address space by loading its page table base into the appropriate CPU register.
•Restore new process state — Load the saved context of the process being dispatched from its PCB into CPU registers.
•Flush/manage CPU caches — Optionally flush TLBs and caches, or use ASIDs to avoid full flushes.
•Jump to process code — Transfer control to the new process at the point where it was last executing.
•Update accounting — Record the dispatch time for scheduling statistics.

Dispatcher latency:

Dispatcher latency is the time from when the scheduler makes its decision to when the new process starts executing. This latency is critical for:

Interactive responsiveness — Users perceive delays over ~100ms
Real-time systems — Missed deadlines can cause system failure
High-frequency trading — Microseconds matter for profit

Typical dispatcher latencies:

Simple microcontrollers: 1-10 μs
Linux on modern hardware: 2-20 μs
Windows on modern hardware: 5-30 μs
Real-time Linux (PREEMPT_RT): < 100 μs worst case guaranteed

Converting Mermaid diagram...

Latency Sources

Context Switching Mechanics

What constitutes 'context'?

A process's execution context includes everything needed to resume execution exactly where it left off:

Process Context Components
Category	Components	Save Location
General-Purpose Registers	RAX, RBX, RCX, RDX, RSI, RDI, R8-R15 (x86-64)	PCB / kernel stack
Stack Registers	RSP (stack pointer), RBP (base pointer)	PCB / kernel stack
Instruction Pointer	RIP (program counter)	PCB / kernel stack
Flags Register	RFLAGS (condition codes, interrupt enable)	PCB / kernel stack
Segment Registers	CS, DS, SS, ES, FS, GS	PCB (usually constants)
FPU/SIMD State	x87 FPU, XMM, YMM, ZMM registers	Extended save area (XSAVE)
Debug Registers	DR0-DR7 (hardware breakpoints)	PCB if debugging active
Address Space	CR3 (page table base register)	mm_struct pointer in PCB
Kernel Stack	Per-process kernel stack pointer	PCB

The switch sequence on x86-64 Linux:

Context switching involves both C code for the high-level logic and assembly code for the actual register manipulation. Here's the essential sequence:

context_switch.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Simplified context switch - Linux kernel
// Real implementation in arch/x86/kernel/process_64.c
 
struct task_struct *
__switch_to(struct task_struct *prev, 
            struct task_struct *next) {
    
    struct thread_struct *prev_sp = &prev->thread;
    struct thread_struct *next_sp = &next->thread;
    
    // 1. Save FPU/SSE/AVX state of previous process
    //    Uses XSAVE instruction for efficiency
    fpu__switch_save(prev);
    
    // 2. Save kernel stack pointer in prev's PCB
    prev_sp->sp = current_stack_pointer;
    
    // 3. Switch kernel stacks
    //    The new process will resume on its own stack
    load_sp(next_sp->sp);
    
    // 4. Switch address space (if different process, not thread)
    if (prev->mm != next->mm) {
        // Load CR3 with new page table base
        load_cr3(next->mm->pgd);
        
        // Flush TLB entries (or use PCID)
        // Modern CPUs can skip this with PCID
    }
    
    // 5. Switch FS/GS segment bases (for TLS)
    load_fsbase(next->thread.fsbase);
    load_gsbase(next->thread.gsbase);
    
    // 6. Restore FPU/SSE/AVX state of next process
    fpu__switch_restore(next);
    
    // 7. Update per-CPU "current" pointer
    this_cpu_write(current_task, next);
    
    // 8. Return - execution continues as next process
    return prev;  // Returned to caller in next's context
}

switch_to.S

x86-64 Assembly

# Simplified x86-64 context switch assembly
# Actual code in arch/x86/entry/entry_64.S
 
.globl switch_to_asm
switch_to_asm:
    # Save callee-saved registers of prev on its stack
    pushq   %rbp
    pushq   %rbx
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15
    pushfq                  # Save flags
    
    # Save current stack pointer in prev's PCB
    movq    %rsp, TASK_THREAD_SP(%rdi)    # prev->thread.sp = rsp
    
    # Switch to next's kernel stack
    movq    TASK_THREAD_SP(%rsi), %rsp    # rsp = next->thread.sp
    
    # Restore callee-saved registers of next from its stack
    popfq                   # Restore flags
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbx
    popq    %rbp
    
    # Return to caller
    # But now we're in next's context!
    # The return address on stack belongs to next
    ret

The Magic of Stack Switching

Memory Context and TLB Management

Page Table Switching

•CR3 Register (x86) — Contains physical address of top-level page table (PML4)
•Load Operation — Single instruction: mov cr3, <address>
•Hardware WAW — CPU automatically walks new page tables on memory access
•Kernel Mapping — Kept identical across processes for efficiency
•User Mapping — Completely different for each process

TLB Considerations

•Full Flush — Traditional approach: invalidate all TLB entries on switch
•ASID/PCID — Modern CPUs tag TLB entries with process ID
•Selective Flush — Only flush entries for the old process
•Flush Cost — Full TLB flush can cost 100s of cycles on next accesses
•Cache Pollution — Old process's data evicted, new process must reload

Address Space Layout Randomization (ASLR) impact:

Modern systems randomize process address spaces for security (ASLR). This means:

Each process has unique memory layout
Virtual addresses differ even for identical code
TLB entries cannot be reused between processes
PCID/ASID becomes more valuable for performance

The KPTI challenge (Meltdown mitigation):

Kernel Page Table Isolation (KPTI), implemented to mitigate the Meltdown vulnerability, adds additional context switch overhead:

Each process has two page tables: one for user mode (minimal kernel mapping), one for kernel mode (full mapping)
On every user-to-kernel transition, page tables are switched
On every kernel-to-user return, switch back
This doubles the address space switching cost
PCID support is essential to mitigate this penalty

TLB Management Strategies
Strategy	Mechanism	Overhead	Hardware Required
Full Flush	Invalidate all TLB entries on switch	High - cold TLB for new process	None
Global Pages	Kernel pages never flushed	Medium - user TLB cold	Global bit support
ASID/PCID	Tag entries with process ID, no flush needed	Low - TLB entries persist	ASID/PCID support
Page-Specific Flush	INVLPG for specific modified pages	Variable - depends on changes	INVLPG instruction

Threads vs Processes

Floating-Point and SIMD State Management

Extended CPU State Components (x86-64)
Component	Size	Content	Save/Restore
x87 FPU	112 bytes	8 x 80-bit floating-point registers + control/status	FXSAVE/FXRSTOR or XSAVE
SSE/SSE2	256 bytes	16 x 128-bit XMM registers	FXSAVE/FXRSTOR or XSAVE
AVX/AVX2	512 bytes	16 x 256-bit YMM registers (upper 128 bits)	XSAVE only
AVX-512	2KB+	32 x 512-bit ZMM registers + mask registers	XSAVE only
MPX	64 bytes	Bound registers for memory protection	XSAVE only
PKRU	8 bytes	Protection keys register	XSAVE only

Lazy FPU context switching:

Historically, operating systems used 'lazy' FPU context switching to avoid unnecessary save/restore:

On context switch, FPU is marked as 'unavailable' (via CR0.TS bit)
If new process uses FPU, a 'Device Not Available' exception occurs
Exception handler saves old FPU state, restores new state, clears TS
Process continues; future FPU use doesn't trap

xsave_usage.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Modern FPU/SIMD state management with XSAVE
// Simplified from Linux arch/x86/kernel/fpu/
 
struct fpu_state {
    // Dynamically sized based on CPU features
    // Can be 512 bytes (SSE) to 2KB+ (AVX-512)
    u8 state[XSAVE_SIZE];
};
 
void fpu_save(struct task_struct *task) {
    // Use XSAVE instruction to save all enabled components
    // The bitmap in XCR0 controls which components to save
    
    // Efficient: only saves components that have been modified
    // Uses "optimized" XSAVE that checks for actual use
    xsave_to_memory(&task->thread.fpu.state);
}
 
void fpu_restore(struct task_struct *task) {
    // Use XRSTOR to restore state
    // Only restores components that were saved
    xrstor_from_memory(&task->thread.fpu.state);
}
 
// During context switch
void switch_fpu_context(struct task_struct *prev,
                        struct task_struct *next) {
    // Always save/restore in modern kernels (eager switching)
    if (cpu_has_xsave) {
        fpu_save(prev);
        fpu_restore(next);
    } else {
        // Legacy FXSAVE/FXRSTOR for older CPUs
        fxsave(&prev->thread.fpu);
        fxrstor(&next->thread.fpu);
    }
}

XSAVE Efficiency

CPU Affinity and Binding

CPU Selection Factors

•Cache Affinity — A process that ran on CPU 0 likely has hot data in CPU 0's caches. Dispatching it to CPU 1 means cache misses. Strong preference to dispatch on the same CPU as last time.
•Load Balancing — If one CPU is overloaded and another is idle, migrate processes. Imbalance hurts overall throughput.
•NUMA Topology — In NUMA systems, memory access is faster to node-local RAM. Dispatch processes near their memory allocations.
•Hard Affinity — Administrator or application sets explicit CPU mask. Dispatcher must honor it. Used for real-time workloads, isolation.
•Soft Affinity — Scheduler prefers certain CPUs but can use others if needed. Default behavior.
•CPU Isolation — Some CPUs excluded from general scheduling. Reserved for dedicated tasks (housekeeping CPUs).

Setting CPU affinity programmatically:

Applications can control which CPUs they run on using scheduler affinity APIs:

cpu_affinity.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
 
int main() {
    cpu_set_t mask;
    
    // Get current affinity
    CPU_ZERO(&mask);
    if (sched_getaffinity(0, sizeof(mask), &mask) == 0) {
        printf("Current affinity: ");
        for (int i = 0; i < CPU_SETSIZE; i++) {
            if (CPU_ISSET(i, &mask))
                printf("%d ", i);
        }
        printf("\n");
    }
    
    // Set affinity to CPUs 0 and 1 only
    CPU_ZERO(&mask);
    CPU_SET(0, &mask);
    CPU_SET(1, &mask);
    
    if (sched_setaffinity(0, sizeof(mask), &mask) == 0) {
        printf("Affinity set to CPUs 0 and 1\n");
    }
    
    // Process will now only be dispatched to CPUs 0 or 1
    // Useful for:
    // - Keeping threads on same NUMA node
    // - Isolating from other workloads
    // - Real-time predictability
    
    return 0;
}

The load balancer:

Linux runs a periodic load balancer that migrates processes between CPUs to maintain balance:

Runs on idle CPUs (idle balancing) and periodically on busy CPUs
Considers cache hotness—recently run processes are less likely to migrate
Respects NUMA topology—prefers migration within same node
Uses hierarchical balancing—first balance within cores sharing L2, then across sockets
Migration has a cost threshold—only migrates if imbalance exceeds threshold

Migration Cost

Dispatch in Different Contexts

The Dispatch transition occurs in various contexts, each with different characteristics and optimizations:

Switching between threads of the same process:

This is the lightest form of dispatch:

No address space change — Threads share the same virtual memory
No TLB flush — Page tables remain the same
Smaller context — Only need to save/restore registers and FPU state
Shared resources — File descriptors, signals, etc. all shared

Approximate Context Switch Costs
Switch Type	Operations Required	Typical Cost	Cache Impact
Thread (same process)	Register + FPU save/restore	0.5-2 μs	Minimal
Process (PCID enabled)	Thread + page table switch	2-5 μs	Moderate - TLB maintained
Process (no PCID)	Thread + page table + TLB flush	5-15 μs	Significant - cold TLB
Process (KPTI)	Process + double page table switch	10-30 μs	High
Cross-NUMA	Process + memory latency	20-100 μs	Very high - remote cache

Virtualization Implications

Performance Considerations

Sources of Dispatch Overhead

•Direct CPU Cost — Time spent executing context switch code: saving registers, switching stacks, changing page tables. Typically 1-5 microseconds of CPU time.
•TLB Miss Cost — After switching address spaces, TLB is cold. First accesses to each page incur expensive page table walks. Can add 10s of microseconds.
•Cache Pollution — New process evicts old process's cache lines. Both suffer: old process loses cached data, new process must reload its working set. Impact depends on working set sizes.
•Pipeline Disruption — Context switch flushes CPU pipeline. Speculative execution state is lost. Branch predictors need to re-learn new process's patterns.
•Lock Contention — If scheduler locks are heavily contended, dispatch must wait. Modern kernels use per-CPU locks to minimize this.

Strategies to reduce dispatch overhead:

Kernel-Level Optimizations

•PCID/ASID for TLB tagging
•Lazy FPU state management
•Per-CPU run queues
•Cache-aware scheduling
•Global kernel pages (no TLB flush)
•XSAVE/XRSTOR for FPU state
•Minimal lock critical sections

Application-Level Approaches

•Use threads instead of processes
•Batching work to reduce switches
•CPU affinity for predictability
•Lock-free data structures
•Avoiding unnecessary blocking
•Cooperative scheduling (user-level)
•Async I/O to reduce blocking

Measuring dispatch overhead:

You can measure context switch costs with tools like lmbench or simple benchmarks:

measure_switch.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Using perf to count context switches
perf stat -e context-switches,cpu-clock ./my_application
 
# Example output:
#   15,234 context-switches
#   2,500.00 msec cpu-clock
# Average: 6.1 switches/ms, each costing ~164 ns of CPU time
 
# Using lmbench for precise measurement
lat_ctx -s 0 2  # Measure 2-process context switch time
# Output: "size=0k ovr=0.59 2 1.98"
# Shows ~2μs per context switch
 
# Monitor system-wide context switches
vmstat 1 | awk '{print $12}'  # 'cs' column
# Shows context switches per second
 
# Detailed per-CPU scheduling statistics
cat /proc/schedstat

Event-Driven Architecture

Summary: The Complete Dispatch Transition

We've now thoroughly examined the Dispatch transition—the mechanism by which processes in the Ready queue gain control of a CPU and begin (or resume) execution.

Key Takeaways

•Ready → Running is the Dispatch transition — It's triggered by scheduler decisions when processes yield, block, or exhaust time slices.
•Scheduler selects, dispatcher executes — The scheduler decides which process runs; the dispatcher performs the actual context switch.
•Context switch saves/restores complete state — General registers, stack, FPU/SIMD state, and address space must be swapped atomically.
•Memory context switching is expensive — Page table changes and TLB management add significant overhead. Thread switches avoid this.
•Modern hardware provides optimizations — PCID/ASID, XSAVE, per-CPU caches all reduce dispatch overhead.
•CPU affinity improves cache performance — Keeping processes on the same CPU avoids cache cold-start penalties.
•Dispatch frequency impacts throughput — Too many switches waste CPU on overhead; too few hurt responsiveness.
•Different switch types have different costs — Thread switches are cheap; cross-NUMA process switches are expensive.

Looking ahead:

Page Complete

2 / 5