Operating SystemsMulti-Processor Scheduling

Multi-Processor Scheduling

LevelAdvanced

Duration90 mins

TopicMulti-Processor Scheduling

1 / 5

Symmetric Multiprocessing (SMP)

The Multi-Processor Revolution

For decades, computer performance growth followed a simple trajectory: faster clock speeds meant faster programs. Engineers and programmers rode Moore's Law, confident that each generation of processors would automatically make their software run faster. Then, around 2004, the industry hit a wall—the power wall, the frequency wall, and the instruction-level parallelism wall simultaneously. Clock speeds stagnated, but the demand for computing power continued to accelerate.

The solution? Parallelism at the processor level. Instead of one faster core, modern systems deploy multiple cores working in harmony. But this architectural shift fundamentally changes how operating systems must approach CPU scheduling. A scheduler designed for a single processor becomes inadequate—even counterproductive—in a multi-processor environment.

What You Will Master

By the end of this page, you will understand Symmetric Multiprocessing (SMP) at a fundamental level—its architectural principles, how processors share resources, why this architecture dominates modern computing, and the profound implications it creates for operating system schedulers. This knowledge forms the bedrock for understanding all multi-processor scheduling strategies.

Understanding SMP Architecture

Symmetric Multiprocessing (SMP) is a multiprocessor architecture where two or more identical processors are connected to a single, shared main memory and are controlled by a single operating system instance. The term "symmetric" is crucial—it defines the fundamental property that distinguishes SMP from other multiprocessor architectures.

The Symmetry Principle:

In an SMP system, all processors are functionally identical and equally privileged:

Every processor can access all memory locations with uniform latency (within the physical constraints of the memory system)
Every processor can execute any process or thread in the system
Every processor can handle interrupts from any device
Every processor runs the same operating system kernel image
No processor has special status or exclusive access to resources

This symmetry creates an elegant abstraction: from the operating system's perspective, processors are interchangeable. A ready process can be dispatched to any available processor without architectural constraints. This property, while conceptually simple, enables sophisticated load balancing and failure tolerance.

Why Symmetry Matters

The symmetry in SMP isn't just an implementation detail—it's a design contract that simplifies operating system scheduling. The scheduler can treat the processor pool as a homogeneous resource. Without symmetry, the scheduler would need to track processor-specific capabilities, leading to complex assignment logic and potential bottlenecks. SMP's uniformity enables general-purpose scheduling algorithms that scale naturally with processor count.

Architectural Components of SMP:

An SMP system consists of several tightly integrated components that work together to present a coherent computing environment:

Core SMP Components

•Multiple Identical Processors — Each processor contains the same instruction set architecture (ISA), execution units, and capability profile. In modern multi-core chips, these are CPU cores on the same die; in traditional SMP systems, they were separate physical processor packages.
•Shared Main Memory — All processors connect to a single logical address space. Any processor can read from or write to any memory location. This shared memory model enables inter-processor communication through memory-resident data structures.
•Interconnection Network — A bus, crossbar, or network-on-chip (NoC) that connects processors to memory and I/O devices. The design of this interconnect significantly impacts system scalability and performance.
•Cache Hierarchy with Coherence Protocol — Each processor typically has private L1/L2 caches for performance. A hardware cache coherence protocol (e.g., MESI, MOESI) ensures that all processors observe a consistent view of memory despite cached copies.
•I/O Subsystem — Shared access to all I/O devices, with the ability for any processor to handle any device interrupt. Modern systems often use IOMMU (I/O Memory Management Unit) for secure, efficient I/O.

The Shared Memory Model in Depth

The shared memory characteristic of SMP is not merely a technical feature—it fundamentally shapes how software is written and how operating systems manage concurrent execution. Understanding its implications is essential for effective multi-processor scheduling.

Memory Access Model:

In an ideal SMP system, all processors have Uniform Memory Access (UMA)—the latency to access any memory location is the same for all processors. While true UMA is approximated in small-scale SMP systems, real hardware has nuanced access characteristics:

Access to L1 cache: ~1-4 cycles
Access to L2 cache: ~10-20 cycles
Access to L3 cache (shared): ~30-50 cycles
Access to main memory: ~100-300 cycles

These differences create implicit "locality" even in UMA systems—data in a processor's cache hierarchy is dramatically faster to access than data elsewhere.

Memory Hierarchy Access Latencies in Modern SMP Systems
Memory Level	Typical Latency	Size (per core)	Sharing Scope
L1 Data Cache	~4 cycles (1-2 ns)	32-64 KB	Private to core
L1 Instruction Cache	~4 cycles (1-2 ns)	32-64 KB	Private to core
L2 Cache	~12 cycles (3-5 ns)	256 KB - 1 MB	Private or shared pair
L3 Cache (LLC)	~40 cycles (10-20 ns)	8-64 MB total	Shared across all cores
Main Memory (DRAM)	~200 cycles (50-100 ns)	GBs - TBs	Shared across system
Remote NUMA Memory	~300+ cycles (100+ ns)	Varies	Accessible but distant

Implications for Scheduling:

The memory hierarchy creates significant scheduling considerations:

Cache Affinity: When a process runs on a processor, its working set becomes cached in that processor's cache hierarchy. If the scheduler moves the process to a different processor, the cache must be repopulated—a "cold start" penalty that can cost millions of cycles for large working sets.
False Sharing: When two processors modify different variables that happen to reside on the same cache line, the coherence protocol forces cache invalidations even though no actual data sharing occurs. The scheduler can inadvertently exacerbate this by frequently moving processes.
Memory Bandwidth Contention: Even with shared memory access, the bus or interconnect has finite bandwidth. Too many processors accessing memory simultaneously creates contention, degrading overall system performance.

These factors mean that optimal SMP scheduling isn't simply about keeping all processors busy—it must also respect the memory system's physical reality.

The Cache Affinity Dilemma

Scheduler decisions that appear optimal from a load-balancing perspective can be catastrophic for performance due to cache effects. Moving a process from CPU 0 to CPU 3 might balance load perfectly, but if that process has 2 MB of hot data in CPU 0's L2 cache, the migration cost could exceed any benefit. Modern schedulers must balance load distribution against cache preservation—a fundamental tension with no universal solution.

Cache Coherence: The Hidden Complexity

Perhaps the most sophisticated aspect of SMP hardware is the cache coherence protocol—the mechanism that ensures all processors observe a consistent view of memory despite each maintaining local cached copies. Without coherence, processors could operate on stale data, leading to impossible-to-debug corruption.

The Coherence Problem:

Consider a simple scenario:

Processors P0 and P1 both read variable X (value = 0) into their L1 caches
P0 modifies X to 1, updating its L1 cache
P1 reads X—should it see 0 (its cached copy) or 1 (P0's updated value)?

Without coherence, P1 sees 0, creating an inconsistent view of memory that violates the shared memory abstraction. Cache coherence protocols solve this by tracking cache line states and propagating updates.

The MESI Protocol:

The most widely implemented coherence protocol is MESI, named for its four cache line states:

MESI Cache Line States

•Modified (M) — The cache line has been modified and is different from main memory. This cache holds the only valid copy. The cache is responsible for writing the data back to memory before allowing other caches to read it.
•Exclusive (E) — The cache line matches main memory and is present only in this cache. The cache can transition to Modified without bus traffic if a write occurs, since no other cache holds a copy.
•Shared (S) — The cache line matches main memory and may be present in other caches as well. Writes require notifying other caches to invalidate their copies first.
•Invalid (I) — The cache line is invalid (empty or stale). Any access to this line requires fetching from memory or another cache.

mesi_state_transitions.txt
MESI State Transition Diagram:
 
                    Read Hit                Read Hit
                       │                       │
                       ▼                       ▼
             ┌─────────────────────────────────────────┐
             │           COHERENCE STATES              │
             └─────────────────────────────────────────┘
 
  ┌──────────┐   Read Miss    ┌──────────┐   Remote   ┌──────────┐
  │ Invalid  │ ──────────────►│ Shared   │◄──────────│ Modified │
  │   (I)    │   (no other    │   (S)    │   Read    │   (M)    │
  └──────────┘    caches)     └──────────┘  Request  └──────────┘
       │              │            │                      │
       │              │            │ Write                │ Write
       │              │            │ Hit                  │ Hit
       │              ▼            ▼                      │
       │        ┌──────────┐  Invalidate              (stays M)
       │        │Exclusive │  Others                      │
       │        │   (E)    │────────────────────────────►─┘
       │        └──────────┘
       │              │
       │              │ Write Hit
       │              │ (silent upgrade)
       │              ▼
       └──────────► Modified (M)
 
Key Transitions:
- Read Miss (exclusive): I → E (only cache with data)
- Read Miss (shared): I → S (other caches have copies)
- Write Hit (exclusive): E → M (silent, no bus traffic)
- Write Hit (shared): S → M (must invalidate others first)
- Remote Read Request: M → S (write back, share data)
- Remote Write Request: M/E/S → I (invalidate local copy)

Coherence Traffic and Scalability:

Cache coherence is not free. Every state transition generates traffic on the interconnect:

Invalidation broadcasts: When a processor writes to a Shared line, invalidation messages must reach all other caches
Data sharing: When a processor reads data that another has Modified, the data must be transferred
Snooping overhead: In bus-based systems, every cache must monitor (snoop) all bus transactions

As processor count increases, coherence traffic grows—potentially quadratically in pathological cases. This is why traditional bus-based SMP systems rarely scale beyond 8-16 processors. Modern systems use directory-based coherence and hierarchical interconnects to manage this complexity.

Coherence and Scheduling

For the OS scheduler, cache coherence is largely transparent—hardware handles it automatically. However, understanding coherence explains why certain scheduling patterns have performance implications. Scheduling threads that frequently share data onto the same physical package (where L3 is shared) reduces coherence traffic compared to spreading them across packages, even in a "symmetric" system.

SMP Operating System Design Principles

Running an operating system on an SMP platform introduces fundamental challenges that don't exist in uniprocessor systems. The OS must now handle true concurrency—multiple processors executing kernel code simultaneously—while maintaining correctness and performance.

The Single-Kernel Image:

In SMP, all processors run the same operating system kernel. This is a crucial simplification compared to alternatives, but it creates coordination requirements:

Multiple processors may simultaneously attempt to modify kernel data structures
Interrupt handlers can execute on any processor, potentially concurrent with system calls
The scheduler itself may run on multiple processors simultaneously, racing to grab ready processes

Without careful design, an SMP kernel can exhibit race conditions, deadlocks, or worse—silent data corruption.

Evolution of Kernel Locking:

OS kernels have evolved through several approaches to handle SMP concurrency:

1. Giant Kernel Lock (GKL) / Big Kernel Lock (BKL):

The simplest approach—a single lock protecting the entire kernel. Only one processor can execute kernel code at a time.

Advantage: Simple to implement; existing uniprocessor code works without modification
Disadvantage: Severe bottleneck; kernel serialization eliminates most parallelism benefits

2. Coarse-Grained Locking:

Divide the kernel into subsystems, each with its own lock. Processors can execute in different subsystems simultaneously.

Advantage: Better parallelism for workloads spanning subsystems
Disadvantage: Lock contention for hot subsystems; complex to reason about lock ordering

3. Fine-Grained Locking:

Every data structure has its own lock. Maximum parallelism at the cost of complexity.

Advantage: Minimal contention for well-designed workloads
Disadvantage: Risk of deadlock; significant code complexity; lock overhead for every operation

4. Lock-Free and Wait-Free Structures:

Modern kernels use atomic operations and carefully designed algorithms to avoid locks entirely for some operations.

Advantage: No lock contention or deadlock risk
Disadvantage: Extremely difficult to implement correctly; limited applicability

Kernel Locking Strategies Comparison
Strategy	Parallelism	Complexity	Risk	Use Case
Giant Lock	Minimal	Very Low	None	Legacy systems, simple kernels
Coarse-Grained	Moderate	Low	Low	Subsystem isolation, mixed workloads
Fine-Grained	High	High	Moderate	High-performance servers
Lock-Free	Maximum	Very High	High (if incorrect)	Hot paths, read-heavy structures

Modern Kernel Reality

Contemporary operating systems like Linux use a hybrid approach: fine-grained locking for most subsystems, lock-free structures for critical hot paths (like scheduler run queues), and RCU (Read-Copy-Update) for read-heavy data structures. This pragmatic mix balances performance with maintainability.

SMP Scheduler Architecture

The scheduler in an SMP system faces challenges that fundamentally differ from uniprocessor scheduling. It must not only decide which process runs, but also where (on which processor) and when to migrate processes between processors.

Key Architectural Decisions:

1. Run Queue Organization:

There are two fundamental approaches to organizing ready processes:

Global Run Queue: A single queue of ready processes, shared by all processors.

Processors compete to dequeue the next process
Natural load balancing—processors always grab work from the same pool
Severe contention at the queue lock as processor count increases
Poor cache affinity—processes bounce between processors

Per-Processor Run Queues: Each processor maintains its own queue of ready processes.

No contention for queue access (each processor accesses only its own)
Excellent cache affinity—processes tend to stay on their assigned processor
Load imbalance risk—some processors may be idle while others are overloaded
Requires explicit load balancing mechanism

Global Run Queue

•Single shared ready queue
•All processors contend for queue lock
•Automatic, perfect load balance
•Poor cache affinity
•Scalability limited by lock contention
•Simple implementation

Per-Processor Run Queues

•Each CPU has own ready queue
•No cross-CPU queue contention
•Requires load balancing logic
•Excellent cache affinity
•Scalability to 100+ CPUs
•More complex implementation

Modern SMP schedulers universally use per-processor run queues. The scalability benefits are decisive—global queue lock contention limits practical scalability to perhaps 8-16 processors, while per-processor queues scale to hundreds of cores.

2. Load Balancing Strategy:

With per-processor queues, load balancing becomes an explicit scheduler responsibility. The scheduler must periodically:

Detect imbalance (some processors overloaded, others idle)
Select processes to migrate based on priority, cache affinity, and other factors
Execute the migration without disrupting running processes

Linux, for example, implements both push and pull migration:

Push: Overloaded processors push tasks to less loaded ones
Pull: Idle processors pull tasks from loaded ones

The frequency and aggressiveness of load balancing creates tradeoffs. Too frequent rebalancing disrupts cache affinity; too infrequent allows persistent imbalance.

The Scheduler Run Queue Revolution

The shift from global to per-processor run queues was one of the most significant scheduler architecture changes in OS history. Linux 2.4 used a global queue and struggled past 4 CPUs. Linux 2.6's O(1) scheduler introduced per-CPU queues, and the current CFS scheduler continues this design. This change enabled Linux to scale from desktop machines to servers with hundreds of cores.

Inter-Processor Communication and Coordination

SMP systems require mechanisms for processors to communicate with each other. While shared memory provides one form of communication, the operating system often needs immediate, targeted signaling between processors.

Inter-Processor Interrupts (IPIs):

IPIs are hardware-level signals that one processor sends to another, forcing the target processor to respond immediately:

Scheduler IPIs: Wake an idle processor to run a newly-ready high-priority task
TLB Shootdown IPIs: Force all processors to invalidate TLB entries after page table changes
Cache Flush IPIs: Request cache synchronization before DMA operations
Function Call IPIs: Execute a function on another processor (used for debugging, profiling)

IPIs are powerful but expensive—they force the target processor to interrupt its current work, save context, handle the IPI, and restore context. Overuse of IPIs can significantly impact performance.

ipi_scheduler_wakeup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/* Conceptual: Scheduler IPI to wake an idle CPU */
 
void wake_up_task(struct task *task, int target_cpu) {
    /* Add task to target CPU's run queue */
    struct run_queue *rq = cpu_run_queue(target_cpu);
    
    spin_lock(&rq->lock);
    enqueue_task(rq, task);
    spin_unlock(&rq->lock);
    
    /* If target CPU is idle, send IPI to wake it */
    if (cpu_is_idle(target_cpu)) {
        send_ipi(target_cpu, IPI_RESCHEDULE);
    }
    /*
     * IPI handler on target CPU:
     * 1. Receives interrupt
     * 2. Sets TIF_NEED_RESCHED flag
     * 3. Returns from interrupt
     * 4. Kernel checks flag, calls scheduler
     * 5. Scheduler picks newly-enqueued task
     */
}
 
void handle_reschedule_ipi(void) {
    /* Minimal handler - just set flag */
    set_thread_flag(TIF_NEED_RESCHED);
    /* Actual rescheduling happens on return to user/kernel */
}

Memory Barriers and Synchronization:

Modern processors execute instructions out of order and cache writes for efficiency. In SMP systems, this can cause one processor to observe memory operations in a different order than another processor executed them—a phenomenon that violates intuitive expectations about program behavior.

Memory barriers (fences) are special instructions that enforce ordering:

Read barrier (rmb): Ensures all reads before the barrier complete before any reads after
Write barrier (wmb): Ensures all writes before the barrier are visible before writes after
Full barrier (mb): Orders both reads and writes

Without proper barriers, SMP code can exhibit bugs that:

Only appear under heavy load
Only occur on certain processor architectures
Disappear when debugging code is added (timing changes)

These are among the most difficult bugs to diagnose in systems programming.

The Memory Ordering Trap

Memory ordering bugs are insidious because they often work correctly for years before manifesting. The code "works" because most of the time, the processor happens to execute things in program order, or the timing works out coincidentally. Then a new CPU model, kernel version, or workload pattern triggers the latent bug. Always use proper synchronization primitives—never assume memory operations are ordered without explicit barriers.

Modern SMP Implementations

Contemporary SMP systems have evolved far beyond the simple "multiple processors on a bus" model. Understanding modern implementations is essential for effective scheduling decisions.

Multi-Core Processors:

Today's SMP systems are dominated by multi-core processors—multiple CPU cores integrated onto a single chip (die). This integration provides several advantages:

Shared Last-Level Cache (LLC): Cores share L3 cache, enabling efficient data sharing without main memory access
On-chip Interconnect: Core-to-core communication is faster than crossing chip boundaries
Power Efficiency: Shared infrastructure (memory controller, I/O) reduces total power consumption
Thermal Management: Unified thermal envelope allows dynamic power distribution

Modern SMP Processor Architectures
Platform	Cores/Threads	Cache Hierarchy	Notable Features
Intel Core i9-13900K	24C/32T (8P+16E)	L1: 80KB, L2: 32MB, L3: 36MB	Hybrid big.LITTLE architecture
AMD Ryzen 9 7950X	16C/32T	L1: 64KB, L2: 1MB, L3: 64MB	3D V-Cache option, chiplet design
Apple M2 Ultra	24C (16P+8E)	L1: 192KB, L2: 64MB	Unified memory architecture
AWS Graviton3	64C	L1: 64KB, L2: 1MB, L3: 32MB	Custom ARM Neoverse cores
Intel Xeon Platinum 8480+	60C/120T	L1: 48KB, L2: 2MB, L3: 105MB	Server-class, 8-socket capable

Simultaneous Multithreading (SMT) / Hyper-Threading:

SMT allows a single physical core to present as multiple logical processors (typically 2) to the operating system. While logical processors share execution resources, they have independent architectural state (registers, instruction pointers).

SMT Scheduling Considerations:

Logical CPUs on the same core contend for execution units, caches, and memory bandwidth
Running two compute-intensive threads on SMT siblings provides less than 2x speedup (typically 1.2-1.3x)
SMT provides significant benefit when one thread frequently stalls (memory access, I/O wait)
Security considerations: speculative execution vulnerabilities may require SMT disabling

Schedulers must understand the SMT topology to make intelligent decisions—scheduling related threads onto SMT siblings can improve cache utilization, while scheduling competing threads may waste resources.

Heterogeneous Computing: The New Frontier

Modern processors like Intel's 12th/13th gen and Apple's M-series break the SMP symmetry assumption with heterogeneous cores (Performance + Efficiency cores). While architecturally different, the OS still presents these as SMP to applications. The scheduler must now consider core capabilities—a compute-intensive task should run on P-cores, while background work can run on E-cores. This adds complexity but enables better performance-per-watt.

Summary: Symmetric Multiprocessing Fundamentals

We have explored Symmetric Multiprocessing from its architectural foundations to its modern implementations. This knowledge provides the essential context for understanding all multi-processor scheduling strategies that follow.

Consolidating Our Understanding:

Key Takeaways

•Symmetry defines SMP — All processors are functionally identical and equally capable. This uniformity enables general-purpose scheduling algorithms that treat the processor pool as a homogeneous resource.
•Shared memory enables but complicates — Processors share a single address space, enabling communication through memory-resident structures. However, cache coherence protocols and memory hierarchy create performance subtleties that schedulers must respect.
•Cache coherence is invisible but impactful — Hardware protocols like MESI ensure memory consistency automatically, but coherence traffic scales poorly. Scheduling decisions affect coherence traffic patterns and thus system scalability.
•Kernel design must embrace concurrency — SMP kernels must handle true parallelism with appropriate locking strategies. Modern kernels use fine-grained locking, lock-free structures, and RCU to maximize parallelism.
•Per-processor run queues are essential — Global queues don't scale beyond small systems. Per-CPU queues eliminate contention at the cost of requiring explicit load balancing.
•Inter-processor communication is costly — IPIs provide immediate signaling but at significant overhead. Memory barriers ensure ordering but add latency. Minimize IPI frequency in performance-critical paths.
•Modern systems add complexity — Multi-core chips, SMT, and heterogeneous cores create a topology that schedulers must understand. True "symmetric" systems are increasingly rare.

What's Next:

With SMP fundamentals established, we'll contrast this architecture with Asymmetric Multiprocessing (AMP)—an alternative model where processors have differentiated roles. Understanding both models illuminates why SMP dominates general-purpose computing while AMP persists in specialized domains.

Foundation Established

You now understand the architectural principles underlying Symmetric Multiprocessing—the foundation upon which all modern operating system schedulers are built. This knowledge will inform every scheduling concept we explore throughout this module, from processor affinity to NUMA-aware scheduling.

1 / 5

Loading learning content...

Operating SystemsMulti-Processor Scheduling

Multi-Processor Scheduling

LevelAdvanced

Duration90 mins

TopicMulti-Processor Scheduling

1 / 5

Symmetric Multiprocessing (SMP)

The Multi-Processor Revolution

What You Will Master

Understanding SMP Architecture

The Symmetry Principle:

In an SMP system, all processors are functionally identical and equally privileged:

Every processor can access all memory locations with uniform latency (within the physical constraints of the memory system)
Every processor can execute any process or thread in the system
Every processor can handle interrupts from any device
Every processor runs the same operating system kernel image
No processor has special status or exclusive access to resources

Why Symmetry Matters

Architectural Components of SMP:

An SMP system consists of several tightly integrated components that work together to present a coherent computing environment:

Core SMP Components

•Multiple Identical Processors — Each processor contains the same instruction set architecture (ISA), execution units, and capability profile. In modern multi-core chips, these are CPU cores on the same die; in traditional SMP systems, they were separate physical processor packages.
•Shared Main Memory — All processors connect to a single logical address space. Any processor can read from or write to any memory location. This shared memory model enables inter-processor communication through memory-resident data structures.
•Interconnection Network — A bus, crossbar, or network-on-chip (NoC) that connects processors to memory and I/O devices. The design of this interconnect significantly impacts system scalability and performance.
•Cache Hierarchy with Coherence Protocol — Each processor typically has private L1/L2 caches for performance. A hardware cache coherence protocol (e.g., MESI, MOESI) ensures that all processors observe a consistent view of memory despite cached copies.
•I/O Subsystem — Shared access to all I/O devices, with the ability for any processor to handle any device interrupt. Modern systems often use IOMMU (I/O Memory Management Unit) for secure, efficient I/O.

The Shared Memory Model in Depth

Memory Access Model:

Access to L1 cache: ~1-4 cycles
Access to L2 cache: ~10-20 cycles
Access to L3 cache (shared): ~30-50 cycles
Access to main memory: ~100-300 cycles

These differences create implicit "locality" even in UMA systems—data in a processor's cache hierarchy is dramatically faster to access than data elsewhere.

Memory Hierarchy Access Latencies in Modern SMP Systems
Memory Level	Typical Latency	Size (per core)	Sharing Scope
L1 Data Cache	~4 cycles (1-2 ns)	32-64 KB	Private to core
L1 Instruction Cache	~4 cycles (1-2 ns)	32-64 KB	Private to core
L2 Cache	~12 cycles (3-5 ns)	256 KB - 1 MB	Private or shared pair
L3 Cache (LLC)	~40 cycles (10-20 ns)	8-64 MB total	Shared across all cores
Main Memory (DRAM)	~200 cycles (50-100 ns)	GBs - TBs	Shared across system
Remote NUMA Memory	~300+ cycles (100+ ns)	Varies	Accessible but distant

Implications for Scheduling:

The memory hierarchy creates significant scheduling considerations:

Cache Affinity: When a process runs on a processor, its working set becomes cached in that processor's cache hierarchy. If the scheduler moves the process to a different processor, the cache must be repopulated—a "cold start" penalty that can cost millions of cycles for large working sets.
False Sharing: When two processors modify different variables that happen to reside on the same cache line, the coherence protocol forces cache invalidations even though no actual data sharing occurs. The scheduler can inadvertently exacerbate this by frequently moving processes.
Memory Bandwidth Contention: Even with shared memory access, the bus or interconnect has finite bandwidth. Too many processors accessing memory simultaneously creates contention, degrading overall system performance.

These factors mean that optimal SMP scheduling isn't simply about keeping all processors busy—it must also respect the memory system's physical reality.

The Cache Affinity Dilemma

Cache Coherence: The Hidden Complexity

The Coherence Problem:

Consider a simple scenario:

Processors P0 and P1 both read variable X (value = 0) into their L1 caches
P0 modifies X to 1, updating its L1 cache
P1 reads X—should it see 0 (its cached copy) or 1 (P0's updated value)?

The MESI Protocol:

The most widely implemented coherence protocol is MESI, named for its four cache line states:

MESI Cache Line States

•Modified (M) — The cache line has been modified and is different from main memory. This cache holds the only valid copy. The cache is responsible for writing the data back to memory before allowing other caches to read it.
•Exclusive (E) — The cache line matches main memory and is present only in this cache. The cache can transition to Modified without bus traffic if a write occurs, since no other cache holds a copy.
•Shared (S) — The cache line matches main memory and may be present in other caches as well. Writes require notifying other caches to invalidate their copies first.
•Invalid (I) — The cache line is invalid (empty or stale). Any access to this line requires fetching from memory or another cache.

mesi_state_transitions.txt
MESI State Transition Diagram:
 
                    Read Hit                Read Hit
                       │                       │
                       ▼                       ▼
             ┌─────────────────────────────────────────┐
             │           COHERENCE STATES              │
             └─────────────────────────────────────────┘
 
  ┌──────────┐   Read Miss    ┌──────────┐   Remote   ┌──────────┐
  │ Invalid  │ ──────────────►│ Shared   │◄──────────│ Modified │
  │   (I)    │   (no other    │   (S)    │   Read    │   (M)    │
  └──────────┘    caches)     └──────────┘  Request  └──────────┘
       │              │            │                      │
       │              │            │ Write                │ Write
       │              │            │ Hit                  │ Hit
       │              ▼            ▼                      │
       │        ┌──────────┐  Invalidate              (stays M)
       │        │Exclusive │  Others                      │
       │        │   (E)    │────────────────────────────►─┘
       │        └──────────┘
       │              │
       │              │ Write Hit
       │              │ (silent upgrade)
       │              ▼
       └──────────► Modified (M)
 
Key Transitions:
- Read Miss (exclusive): I → E (only cache with data)
- Read Miss (shared): I → S (other caches have copies)
- Write Hit (exclusive): E → M (silent, no bus traffic)
- Write Hit (shared): S → M (must invalidate others first)
- Remote Read Request: M → S (write back, share data)
- Remote Write Request: M/E/S → I (invalidate local copy)

Coherence Traffic and Scalability:

Cache coherence is not free. Every state transition generates traffic on the interconnect:

Invalidation broadcasts: When a processor writes to a Shared line, invalidation messages must reach all other caches
Data sharing: When a processor reads data that another has Modified, the data must be transferred
Snooping overhead: In bus-based systems, every cache must monitor (snoop) all bus transactions

Coherence and Scheduling

SMP Operating System Design Principles

The Single-Kernel Image:

In SMP, all processors run the same operating system kernel. This is a crucial simplification compared to alternatives, but it creates coordination requirements:

Multiple processors may simultaneously attempt to modify kernel data structures
Interrupt handlers can execute on any processor, potentially concurrent with system calls
The scheduler itself may run on multiple processors simultaneously, racing to grab ready processes

Without careful design, an SMP kernel can exhibit race conditions, deadlocks, or worse—silent data corruption.

Evolution of Kernel Locking:

OS kernels have evolved through several approaches to handle SMP concurrency:

1. Giant Kernel Lock (GKL) / Big Kernel Lock (BKL):

The simplest approach—a single lock protecting the entire kernel. Only one processor can execute kernel code at a time.

Advantage: Simple to implement; existing uniprocessor code works without modification
Disadvantage: Severe bottleneck; kernel serialization eliminates most parallelism benefits

2. Coarse-Grained Locking:

Divide the kernel into subsystems, each with its own lock. Processors can execute in different subsystems simultaneously.

Advantage: Better parallelism for workloads spanning subsystems
Disadvantage: Lock contention for hot subsystems; complex to reason about lock ordering

3. Fine-Grained Locking:

Every data structure has its own lock. Maximum parallelism at the cost of complexity.

Advantage: Minimal contention for well-designed workloads
Disadvantage: Risk of deadlock; significant code complexity; lock overhead for every operation

4. Lock-Free and Wait-Free Structures:

Modern kernels use atomic operations and carefully designed algorithms to avoid locks entirely for some operations.

Advantage: No lock contention or deadlock risk
Disadvantage: Extremely difficult to implement correctly; limited applicability

Kernel Locking Strategies Comparison
Strategy	Parallelism	Complexity	Risk	Use Case
Giant Lock	Minimal	Very Low	None	Legacy systems, simple kernels
Coarse-Grained	Moderate	Low	Low	Subsystem isolation, mixed workloads
Fine-Grained	High	High	Moderate	High-performance servers
Lock-Free	Maximum	Very High	High (if incorrect)	Hot paths, read-heavy structures

Modern Kernel Reality

SMP Scheduler Architecture

Key Architectural Decisions:

1. Run Queue Organization:

There are two fundamental approaches to organizing ready processes:

Global Run Queue: A single queue of ready processes, shared by all processors.

Processors compete to dequeue the next process
Natural load balancing—processors always grab work from the same pool
Severe contention at the queue lock as processor count increases
Poor cache affinity—processes bounce between processors

Per-Processor Run Queues: Each processor maintains its own queue of ready processes.

No contention for queue access (each processor accesses only its own)
Excellent cache affinity—processes tend to stay on their assigned processor
Load imbalance risk—some processors may be idle while others are overloaded
Requires explicit load balancing mechanism

Global Run Queue

•Single shared ready queue
•All processors contend for queue lock
•Automatic, perfect load balance
•Poor cache affinity
•Scalability limited by lock contention
•Simple implementation

Per-Processor Run Queues

•Each CPU has own ready queue
•No cross-CPU queue contention
•Requires load balancing logic
•Excellent cache affinity
•Scalability to 100+ CPUs
•More complex implementation

2. Load Balancing Strategy:

With per-processor queues, load balancing becomes an explicit scheduler responsibility. The scheduler must periodically:

Detect imbalance (some processors overloaded, others idle)
Select processes to migrate based on priority, cache affinity, and other factors
Execute the migration without disrupting running processes

Linux, for example, implements both push and pull migration:

Push: Overloaded processors push tasks to less loaded ones
Pull: Idle processors pull tasks from loaded ones

The frequency and aggressiveness of load balancing creates tradeoffs. Too frequent rebalancing disrupts cache affinity; too infrequent allows persistent imbalance.

The Scheduler Run Queue Revolution

Inter-Processor Communication and Coordination

Inter-Processor Interrupts (IPIs):

IPIs are hardware-level signals that one processor sends to another, forcing the target processor to respond immediately:

Scheduler IPIs: Wake an idle processor to run a newly-ready high-priority task
TLB Shootdown IPIs: Force all processors to invalidate TLB entries after page table changes
Cache Flush IPIs: Request cache synchronization before DMA operations
Function Call IPIs: Execute a function on another processor (used for debugging, profiling)

ipi_scheduler_wakeup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/* Conceptual: Scheduler IPI to wake an idle CPU */
 
void wake_up_task(struct task *task, int target_cpu) {
    /* Add task to target CPU's run queue */
    struct run_queue *rq = cpu_run_queue(target_cpu);
    
    spin_lock(&rq->lock);
    enqueue_task(rq, task);
    spin_unlock(&rq->lock);
    
    /* If target CPU is idle, send IPI to wake it */
    if (cpu_is_idle(target_cpu)) {
        send_ipi(target_cpu, IPI_RESCHEDULE);
    }
    /*
     * IPI handler on target CPU:
     * 1. Receives interrupt
     * 2. Sets TIF_NEED_RESCHED flag
     * 3. Returns from interrupt
     * 4. Kernel checks flag, calls scheduler
     * 5. Scheduler picks newly-enqueued task
     */
}
 
void handle_reschedule_ipi(void) {
    /* Minimal handler - just set flag */
    set_thread_flag(TIF_NEED_RESCHED);
    /* Actual rescheduling happens on return to user/kernel */
}

Memory Barriers and Synchronization:

Memory barriers (fences) are special instructions that enforce ordering:

Read barrier (rmb): Ensures all reads before the barrier complete before any reads after
Write barrier (wmb): Ensures all writes before the barrier are visible before writes after
Full barrier (mb): Orders both reads and writes

Without proper barriers, SMP code can exhibit bugs that:

Only appear under heavy load
Only occur on certain processor architectures
Disappear when debugging code is added (timing changes)

These are among the most difficult bugs to diagnose in systems programming.

The Memory Ordering Trap

Modern SMP Implementations

Contemporary SMP systems have evolved far beyond the simple "multiple processors on a bus" model. Understanding modern implementations is essential for effective scheduling decisions.

Multi-Core Processors:

Today's SMP systems are dominated by multi-core processors—multiple CPU cores integrated onto a single chip (die). This integration provides several advantages:

Shared Last-Level Cache (LLC): Cores share L3 cache, enabling efficient data sharing without main memory access
On-chip Interconnect: Core-to-core communication is faster than crossing chip boundaries
Power Efficiency: Shared infrastructure (memory controller, I/O) reduces total power consumption
Thermal Management: Unified thermal envelope allows dynamic power distribution

Modern SMP Processor Architectures
Platform	Cores/Threads	Cache Hierarchy	Notable Features
Intel Core i9-13900K	24C/32T (8P+16E)	L1: 80KB, L2: 32MB, L3: 36MB	Hybrid big.LITTLE architecture
AMD Ryzen 9 7950X	16C/32T	L1: 64KB, L2: 1MB, L3: 64MB	3D V-Cache option, chiplet design
Apple M2 Ultra	24C (16P+8E)	L1: 192KB, L2: 64MB	Unified memory architecture
AWS Graviton3	64C	L1: 64KB, L2: 1MB, L3: 32MB	Custom ARM Neoverse cores
Intel Xeon Platinum 8480+	60C/120T	L1: 48KB, L2: 2MB, L3: 105MB	Server-class, 8-socket capable

Simultaneous Multithreading (SMT) / Hyper-Threading:

SMT Scheduling Considerations:

Logical CPUs on the same core contend for execution units, caches, and memory bandwidth
Running two compute-intensive threads on SMT siblings provides less than 2x speedup (typically 1.2-1.3x)
SMT provides significant benefit when one thread frequently stalls (memory access, I/O wait)
Security considerations: speculative execution vulnerabilities may require SMT disabling

Heterogeneous Computing: The New Frontier

Summary: Symmetric Multiprocessing Fundamentals

Consolidating Our Understanding:

Key Takeaways

•Symmetry defines SMP — All processors are functionally identical and equally capable. This uniformity enables general-purpose scheduling algorithms that treat the processor pool as a homogeneous resource.
•Shared memory enables but complicates — Processors share a single address space, enabling communication through memory-resident structures. However, cache coherence protocols and memory hierarchy create performance subtleties that schedulers must respect.
•Cache coherence is invisible but impactful — Hardware protocols like MESI ensure memory consistency automatically, but coherence traffic scales poorly. Scheduling decisions affect coherence traffic patterns and thus system scalability.
•Kernel design must embrace concurrency — SMP kernels must handle true parallelism with appropriate locking strategies. Modern kernels use fine-grained locking, lock-free structures, and RCU to maximize parallelism.
•Per-processor run queues are essential — Global queues don't scale beyond small systems. Per-CPU queues eliminate contention at the cost of requiring explicit load balancing.
•Inter-processor communication is costly — IPIs provide immediate signaling but at significant overhead. Memory barriers ensure ordering but add latency. Minimize IPI frequency in performance-critical paths.
•Modern systems add complexity — Multi-core chips, SMT, and heterogeneous cores create a topology that schedulers must understand. True "symmetric" systems are increasingly rare.

What's Next:

Foundation Established

1 / 5