Loading learning content...
For decades, computer performance growth followed a simple trajectory: faster clock speeds meant faster programs. Engineers and programmers rode Moore's Law, confident that each generation of processors would automatically make their software run faster. Then, around 2004, the industry hit a wall—the power wall, the frequency wall, and the instruction-level parallelism wall simultaneously. Clock speeds stagnated, but the demand for computing power continued to accelerate.
The solution? Parallelism at the processor level. Instead of one faster core, modern systems deploy multiple cores working in harmony. But this architectural shift fundamentally changes how operating systems must approach CPU scheduling. A scheduler designed for a single processor becomes inadequate—even counterproductive—in a multi-processor environment.
By the end of this page, you will understand Symmetric Multiprocessing (SMP) at a fundamental level—its architectural principles, how processors share resources, why this architecture dominates modern computing, and the profound implications it creates for operating system schedulers. This knowledge forms the bedrock for understanding all multi-processor scheduling strategies.
Symmetric Multiprocessing (SMP) is a multiprocessor architecture where two or more identical processors are connected to a single, shared main memory and are controlled by a single operating system instance. The term "symmetric" is crucial—it defines the fundamental property that distinguishes SMP from other multiprocessor architectures.
The Symmetry Principle:
In an SMP system, all processors are functionally identical and equally privileged:
This symmetry creates an elegant abstraction: from the operating system's perspective, processors are interchangeable. A ready process can be dispatched to any available processor without architectural constraints. This property, while conceptually simple, enables sophisticated load balancing and failure tolerance.
The symmetry in SMP isn't just an implementation detail—it's a design contract that simplifies operating system scheduling. The scheduler can treat the processor pool as a homogeneous resource. Without symmetry, the scheduler would need to track processor-specific capabilities, leading to complex assignment logic and potential bottlenecks. SMP's uniformity enables general-purpose scheduling algorithms that scale naturally with processor count.
Architectural Components of SMP:
An SMP system consists of several tightly integrated components that work together to present a coherent computing environment:
The shared memory characteristic of SMP is not merely a technical feature—it fundamentally shapes how software is written and how operating systems manage concurrent execution. Understanding its implications is essential for effective multi-processor scheduling.
Memory Access Model:
In an ideal SMP system, all processors have Uniform Memory Access (UMA)—the latency to access any memory location is the same for all processors. While true UMA is approximated in small-scale SMP systems, real hardware has nuanced access characteristics:
These differences create implicit "locality" even in UMA systems—data in a processor's cache hierarchy is dramatically faster to access than data elsewhere.
| Memory Level | Typical Latency | Size (per core) | Sharing Scope |
|---|---|---|---|
| L1 Data Cache | ~4 cycles (1-2 ns) | 32-64 KB | Private to core |
| L1 Instruction Cache | ~4 cycles (1-2 ns) | 32-64 KB | Private to core |
| L2 Cache | ~12 cycles (3-5 ns) | 256 KB - 1 MB | Private or shared pair |
| L3 Cache (LLC) | ~40 cycles (10-20 ns) | 8-64 MB total | Shared across all cores |
| Main Memory (DRAM) | ~200 cycles (50-100 ns) | GBs - TBs | Shared across system |
| Remote NUMA Memory | ~300+ cycles (100+ ns) | Varies | Accessible but distant |
Implications for Scheduling:
The memory hierarchy creates significant scheduling considerations:
Cache Affinity: When a process runs on a processor, its working set becomes cached in that processor's cache hierarchy. If the scheduler moves the process to a different processor, the cache must be repopulated—a "cold start" penalty that can cost millions of cycles for large working sets.
False Sharing: When two processors modify different variables that happen to reside on the same cache line, the coherence protocol forces cache invalidations even though no actual data sharing occurs. The scheduler can inadvertently exacerbate this by frequently moving processes.
Memory Bandwidth Contention: Even with shared memory access, the bus or interconnect has finite bandwidth. Too many processors accessing memory simultaneously creates contention, degrading overall system performance.
These factors mean that optimal SMP scheduling isn't simply about keeping all processors busy—it must also respect the memory system's physical reality.
Scheduler decisions that appear optimal from a load-balancing perspective can be catastrophic for performance due to cache effects. Moving a process from CPU 0 to CPU 3 might balance load perfectly, but if that process has 2 MB of hot data in CPU 0's L2 cache, the migration cost could exceed any benefit. Modern schedulers must balance load distribution against cache preservation—a fundamental tension with no universal solution.
Perhaps the most sophisticated aspect of SMP hardware is the cache coherence protocol—the mechanism that ensures all processors observe a consistent view of memory despite each maintaining local cached copies. Without coherence, processors could operate on stale data, leading to impossible-to-debug corruption.
The Coherence Problem:
Consider a simple scenario:
Without coherence, P1 sees 0, creating an inconsistent view of memory that violates the shared memory abstraction. Cache coherence protocols solve this by tracking cache line states and propagating updates.
The MESI Protocol:
The most widely implemented coherence protocol is MESI, named for its four cache line states:
MESI State Transition Diagram: Read Hit Read Hit │ │ ▼ ▼ ┌─────────────────────────────────────────┐ │ COHERENCE STATES │ └─────────────────────────────────────────┘ ┌──────────┐ Read Miss ┌──────────┐ Remote ┌──────────┐ │ Invalid │ ──────────────►│ Shared │◄──────────│ Modified │ │ (I) │ (no other │ (S) │ Read │ (M) │ └──────────┘ caches) └──────────┘ Request └──────────┘ │ │ │ │ │ │ │ Write │ Write │ │ │ Hit │ Hit │ ▼ ▼ │ │ ┌──────────┐ Invalidate (stays M) │ │Exclusive │ Others │ │ │ (E) │────────────────────────────►─┘ │ └──────────┘ │ │ │ │ Write Hit │ │ (silent upgrade) │ ▼ └──────────► Modified (M) Key Transitions:- Read Miss (exclusive): I → E (only cache with data)- Read Miss (shared): I → S (other caches have copies)- Write Hit (exclusive): E → M (silent, no bus traffic)- Write Hit (shared): S → M (must invalidate others first)- Remote Read Request: M → S (write back, share data)- Remote Write Request: M/E/S → I (invalidate local copy)Coherence Traffic and Scalability:
Cache coherence is not free. Every state transition generates traffic on the interconnect:
As processor count increases, coherence traffic grows—potentially quadratically in pathological cases. This is why traditional bus-based SMP systems rarely scale beyond 8-16 processors. Modern systems use directory-based coherence and hierarchical interconnects to manage this complexity.
For the OS scheduler, cache coherence is largely transparent—hardware handles it automatically. However, understanding coherence explains why certain scheduling patterns have performance implications. Scheduling threads that frequently share data onto the same physical package (where L3 is shared) reduces coherence traffic compared to spreading them across packages, even in a "symmetric" system.
Running an operating system on an SMP platform introduces fundamental challenges that don't exist in uniprocessor systems. The OS must now handle true concurrency—multiple processors executing kernel code simultaneously—while maintaining correctness and performance.
The Single-Kernel Image:
In SMP, all processors run the same operating system kernel. This is a crucial simplification compared to alternatives, but it creates coordination requirements:
Without careful design, an SMP kernel can exhibit race conditions, deadlocks, or worse—silent data corruption.
Evolution of Kernel Locking:
OS kernels have evolved through several approaches to handle SMP concurrency:
1. Giant Kernel Lock (GKL) / Big Kernel Lock (BKL):
The simplest approach—a single lock protecting the entire kernel. Only one processor can execute kernel code at a time.
2. Coarse-Grained Locking:
Divide the kernel into subsystems, each with its own lock. Processors can execute in different subsystems simultaneously.
3. Fine-Grained Locking:
Every data structure has its own lock. Maximum parallelism at the cost of complexity.
4. Lock-Free and Wait-Free Structures:
Modern kernels use atomic operations and carefully designed algorithms to avoid locks entirely for some operations.
| Strategy | Parallelism | Complexity | Risk | Use Case |
|---|---|---|---|---|
| Giant Lock | Minimal | Very Low | None | Legacy systems, simple kernels |
| Coarse-Grained | Moderate | Low | Low | Subsystem isolation, mixed workloads |
| Fine-Grained | High | High | Moderate | High-performance servers |
| Lock-Free | Maximum | Very High | High (if incorrect) | Hot paths, read-heavy structures |
Contemporary operating systems like Linux use a hybrid approach: fine-grained locking for most subsystems, lock-free structures for critical hot paths (like scheduler run queues), and RCU (Read-Copy-Update) for read-heavy data structures. This pragmatic mix balances performance with maintainability.
The scheduler in an SMP system faces challenges that fundamentally differ from uniprocessor scheduling. It must not only decide which process runs, but also where (on which processor) and when to migrate processes between processors.
Key Architectural Decisions:
1. Run Queue Organization:
There are two fundamental approaches to organizing ready processes:
Global Run Queue: A single queue of ready processes, shared by all processors.
Per-Processor Run Queues: Each processor maintains its own queue of ready processes.
Modern SMP schedulers universally use per-processor run queues. The scalability benefits are decisive—global queue lock contention limits practical scalability to perhaps 8-16 processors, while per-processor queues scale to hundreds of cores.
2. Load Balancing Strategy:
With per-processor queues, load balancing becomes an explicit scheduler responsibility. The scheduler must periodically:
Linux, for example, implements both push and pull migration:
The frequency and aggressiveness of load balancing creates tradeoffs. Too frequent rebalancing disrupts cache affinity; too infrequent allows persistent imbalance.
The shift from global to per-processor run queues was one of the most significant scheduler architecture changes in OS history. Linux 2.4 used a global queue and struggled past 4 CPUs. Linux 2.6's O(1) scheduler introduced per-CPU queues, and the current CFS scheduler continues this design. This change enabled Linux to scale from desktop machines to servers with hundreds of cores.
SMP systems require mechanisms for processors to communicate with each other. While shared memory provides one form of communication, the operating system often needs immediate, targeted signaling between processors.
Inter-Processor Interrupts (IPIs):
IPIs are hardware-level signals that one processor sends to another, forcing the target processor to respond immediately:
IPIs are powerful but expensive—they force the target processor to interrupt its current work, save context, handle the IPI, and restore context. Overuse of IPIs can significantly impact performance.
1234567891011121314151617181920212223242526272829
/* Conceptual: Scheduler IPI to wake an idle CPU */ void wake_up_task(struct task *task, int target_cpu) { /* Add task to target CPU's run queue */ struct run_queue *rq = cpu_run_queue(target_cpu); spin_lock(&rq->lock); enqueue_task(rq, task); spin_unlock(&rq->lock); /* If target CPU is idle, send IPI to wake it */ if (cpu_is_idle(target_cpu)) { send_ipi(target_cpu, IPI_RESCHEDULE); } /* * IPI handler on target CPU: * 1. Receives interrupt * 2. Sets TIF_NEED_RESCHED flag * 3. Returns from interrupt * 4. Kernel checks flag, calls scheduler * 5. Scheduler picks newly-enqueued task */} void handle_reschedule_ipi(void) { /* Minimal handler - just set flag */ set_thread_flag(TIF_NEED_RESCHED); /* Actual rescheduling happens on return to user/kernel */}Memory Barriers and Synchronization:
Modern processors execute instructions out of order and cache writes for efficiency. In SMP systems, this can cause one processor to observe memory operations in a different order than another processor executed them—a phenomenon that violates intuitive expectations about program behavior.
Memory barriers (fences) are special instructions that enforce ordering:
Without proper barriers, SMP code can exhibit bugs that:
These are among the most difficult bugs to diagnose in systems programming.
Memory ordering bugs are insidious because they often work correctly for years before manifesting. The code "works" because most of the time, the processor happens to execute things in program order, or the timing works out coincidentally. Then a new CPU model, kernel version, or workload pattern triggers the latent bug. Always use proper synchronization primitives—never assume memory operations are ordered without explicit barriers.
Contemporary SMP systems have evolved far beyond the simple "multiple processors on a bus" model. Understanding modern implementations is essential for effective scheduling decisions.
Multi-Core Processors:
Today's SMP systems are dominated by multi-core processors—multiple CPU cores integrated onto a single chip (die). This integration provides several advantages:
| Platform | Cores/Threads | Cache Hierarchy | Notable Features |
|---|---|---|---|
| Intel Core i9-13900K | 24C/32T (8P+16E) | L1: 80KB, L2: 32MB, L3: 36MB | Hybrid big.LITTLE architecture |
| AMD Ryzen 9 7950X | 16C/32T | L1: 64KB, L2: 1MB, L3: 64MB | 3D V-Cache option, chiplet design |
| Apple M2 Ultra | 24C (16P+8E) | L1: 192KB, L2: 64MB | Unified memory architecture |
| AWS Graviton3 | 64C | L1: 64KB, L2: 1MB, L3: 32MB | Custom ARM Neoverse cores |
| Intel Xeon Platinum 8480+ | 60C/120T | L1: 48KB, L2: 2MB, L3: 105MB | Server-class, 8-socket capable |
Simultaneous Multithreading (SMT) / Hyper-Threading:
SMT allows a single physical core to present as multiple logical processors (typically 2) to the operating system. While logical processors share execution resources, they have independent architectural state (registers, instruction pointers).
SMT Scheduling Considerations:
Schedulers must understand the SMT topology to make intelligent decisions—scheduling related threads onto SMT siblings can improve cache utilization, while scheduling competing threads may waste resources.
Modern processors like Intel's 12th/13th gen and Apple's M-series break the SMP symmetry assumption with heterogeneous cores (Performance + Efficiency cores). While architecturally different, the OS still presents these as SMP to applications. The scheduler must now consider core capabilities—a compute-intensive task should run on P-cores, while background work can run on E-cores. This adds complexity but enables better performance-per-watt.
We have explored Symmetric Multiprocessing from its architectural foundations to its modern implementations. This knowledge provides the essential context for understanding all multi-processor scheduling strategies that follow.
Consolidating Our Understanding:
What's Next:
With SMP fundamentals established, we'll contrast this architecture with Asymmetric Multiprocessing (AMP)—an alternative model where processors have differentiated roles. Understanding both models illuminates why SMP dominates general-purpose computing while AMP persists in specialized domains.
You now understand the architectural principles underlying Symmetric Multiprocessing—the foundation upon which all modern operating system schedulers are built. This knowledge will inform every scheduling concept we explore throughout this module, from processor affinity to NUMA-aware scheduling.