Operating SystemsMulti-Processor Scheduling

Multi-Processor Scheduling

LevelAdvanced

Duration90 mins

TopicMulti-Processor Scheduling

2 / 5

Asymmetric Multiprocessing (AMP)

When Symmetry Isn't the Answer

Symmetric Multiprocessing (SMP) dominates modern general-purpose computing, but it's not the only multiprocessor architecture—nor is it always the best choice. Before SMP became ubiquitous, and continuing in specialized domains today, Asymmetric Multiprocessing (AMP) offers an alternative model where processors have differentiated, specialized roles.

Understanding AMP is essential for several reasons: it illuminates why SMP design decisions were made by contrasting with alternatives; it explains architectures still prevalent in embedded systems, real-time computing, and heterogeneous platforms; and it provides context for emerging hybrid architectures that blend symmetric and asymmetric characteristics.

What You Will Master

By the end of this page, you will understand Asymmetric Multiprocessing architectures—their design principles, the master-slave scheduling model, advantages in specialized scenarios, and why general-purpose computing moved toward symmetry. You'll also see how modern heterogeneous processors represent a renaissance of asymmetric concepts within nominally symmetric systems.

Defining Asymmetric Multiprocessing

Asymmetric Multiprocessing (AMP) describes multiprocessor architectures where processors have different roles, capabilities, or access rights. The "asymmetry" can manifest in several forms:

1. Role Asymmetry (Master-Slave):

One processor (the "master") runs the operating system kernel and makes all scheduling decisions. Other processors ("slaves") execute user applications as directed by the master but do not directly handle system services.

2. Capability Asymmetry:

Processors have different instruction sets, performance characteristics, or functional capabilities. Some processors might handle floating-point operations while others cannot; some might have access to specific I/O devices while others do not.

3. Access Asymmetry:

Processors have different access rights to memory regions, I/O devices, or system resources. Even with identical hardware, the operating system may restrict what each processor can do.

4. Execution Asymmetry:

Certain code (kernel, interrupt handlers) runs only on designated processors, while other code (user applications) may run on a different set of processors.

The Symmetry Spectrum

Real systems exist on a spectrum between pure SMP and pure AMP. Many "SMP" systems have subtle asymmetries (CPU 0 often handles more interrupts, bootstrap always starts on a specific processor). Conversely, "AMP" systems may have processors that are hardware-identical but software-differentiated. The label reflects the predominant design philosophy rather than absolute purity.

Historical Context:

Asymmetric multiprocessing predates symmetric multiprocessing. Early multiprocessor systems in the 1960s and 1970s often used asymmetric designs for practical reasons:

Hardware Limitations: True hardware symmetry required complex interconnects and cache coherence mechanisms that didn't yet exist
Software Complexity: Making an operating system kernel re-entrant and thread-safe for true SMP was extremely difficult
I/O Attachment: Peripheral devices often could only be accessed by specific processors due to bus architecture

The master-slave model provided a pragmatic path to multiprocessing without requiring the full complexity of symmetric operation. The master processor ran the single-threaded OS kernel safely, while slave processors provided additional compute capacity.

The Master-Slave Architecture in Depth

The most common form of asymmetric multiprocessing is the master-slave (also called master-worker or boss-worker) architecture. This model creates a clear division of responsibilities between processors.

Master Processor Responsibilities:

Executes the operating system kernel in its entirety
Handles all interrupts (timer, I/O, inter-processor)
Makes all scheduling decisions for the system
Manages memory allocation and virtual memory
Handles all system calls (slaves trap to master for OS services)
Controls I/O device access and drivers
Bootstraps and initializes slave processors

Slave Processor Responsibilities:

Executes user-level application code only
Runs processes assigned by the master
Traps to master processor for any system service
Has no direct access to kernel data structures
Cannot handle interrupts (all interrupts go to master)
Has no scheduling autonomy

master_slave_architecture.txt
Master-Slave AMP System Architecture:
 
┌─────────────────────────────────────────────────────────────────┐
│                    SYSTEM OVERVIEW                               │
└─────────────────────────────────────────────────────────────────┘
 
    ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
    │  MASTER (CPU 0) │      │  SLAVE (CPU 1)  │      │  SLAVE (CPU 2)  │
    │                 │      │                 │      │                 │
    │ ┌─────────────┐ │      │ ┌─────────────┐ │      │ ┌─────────────┐ │
    │ │   Kernel    │ │      │ │             │ │      │ │             │ │
    │ │  - Scheduler│ │      │ │   User      │ │      │ │   User      │ │
    │ │  - Memory   │ │      │ │  Process A  │ │      │ │  Process B  │ │
    │ │  - I/O      │ │      │ │             │ │      │ │             │ │
    │ │  - Syscalls │ │      │ │             │ │      │ │             │ │
    │ └─────────────┘ │      │ └─────────────┘ │      │ └─────────────┘ │
    │                 │      │                 │      │                 │
    │ [IRQ Handler]   │      │ [Trap to Master]│      │ [Trap to Master]│
    │ [Timer IRQ]     │      │ [for syscalls]  │      │ [for syscalls]  │ 
    └────────┬────────┘      └────────┬────────┘      └────────┬────────┘
             │                        │                        │
             │   ┌────────────────────┼────────────────────────┤
             │   │                    │                        │
             ▼   ▼                    ▼                        ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                      SHARED MEMORY BUS                          │
    └─────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                        MAIN MEMORY                              │
    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
    │  │ Kernel Code  │  │ User Process │  │ User Process         │  │
    │  │ & Data       │  │ A Memory     │  │ B Memory             │  │
    │  │ (Master only)│  │              │  │                      │  │
    │  └──────────────┘  └──────────────┘  └──────────────────────┘  │
    └─────────────────────────────────────────────────────────────────┘
 
Control Flow for System Call from Slave:
    1. Process on Slave 1 calls read()
    2. Slave 1 generates inter-processor interrupt to Master
    3. Master is interrupted, saves context
    4. Master handles read() syscall (schedules I/O, blocks process)
    5. Master selects next process for Slave 1
    6. Master signals Slave 1 with new assignment
    7. Slave 1 resumes with new process

System Call Handling in Master-Slave Systems:

One of the most significant differences from SMP is how system calls work. In an SMP system, each processor can execute kernel code directly—a system call runs to completion on the processor where it was invoked. In master-slave AMP:

User process on slave processor executes a system call instruction
Slave cannot execute kernel code directly
Slave sends an IPI (Inter-Processor Interrupt) to the master
Slave enters a wait state (spinning or low-power idle)
Master receives the IPI and handles the system call
Master modifies process state as needed (block, wake, etc.)
Master determines what the slave should run next
Master signals slave with the next process to execute
Slave resumes execution with the assigned process

This round-trip to the master for every system call creates significant overhead, especially for system-call-intensive workloads.

The Master Bottleneck

The master processor becomes a critical bottleneck in AMP systems. Every system call, every interrupt, every scheduling decision flows through it. As slave count increases, the master can become saturated, limiting system scalability. This fundamental limitation drove the transition to SMP for general-purpose computing, where any processor can handle any kernel operation.

Scheduling in Asymmetric Systems

Scheduling in AMP systems is fundamentally different from SMP scheduling because all scheduling decisions are centralized in the master processor. This creates both simplifications and limitations.

Centralized Scheduling Advantages:

No synchronization overhead: The master holds all scheduling state; no locks needed for run queue access
Global knowledge: The master knows exactly what every slave is doing at all times
Simpler algorithms: Classic uniprocessor scheduling algorithms work without modification
Deterministic behavior: Scheduling decisions are serialized, making timing analysis tractable

Centralized Scheduling Disadvantages:

Master bottleneck: Scheduling overhead scales linearly with slave count
Latency overhead: Slaves cannot make local scheduling decisions; must wait for master
Poor locality: Master has no "feel" for each slave's cache state
Single point of failure: Master failure brings down the entire system

The AMP Scheduling Loop:

The master processor runs a scheduling loop that manages all slaves:

amp_scheduler_loop.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/* Conceptual AMP Master Scheduler */
 
void master_scheduler_loop(void) {
    while (true) {
        /* Check for events from slaves */
        for (int slave = 1; slave < num_cpus; slave++) {
            if (slave_event_pending(slave)) {
                handle_slave_event(slave);
            }
        }
        
        /* Handle timer expiration - check time slice exhaustion */
        check_timers_and_preempt();
        
        /* Handle I/O completions */
        handle_io_completions();
        
        /* Assign processes to idle slaves */
        for (int slave = 1; slave < num_cpus; slave++) {
            if (slave_is_idle(slave)) {
                struct task *next = pick_next_task();
                if (next != NULL) {
                    assign_task_to_slave(slave, next);
                }
            }
        }
        
        /* Run any master-only tasks */
        run_master_tasks();
    }
}
 
void handle_slave_event(int slave) {
    enum event_type event = get_slave_event(slave);
    
    switch (event) {
        case SYSCALL_REQUEST:
            /* Execute syscall on behalf of slave's process */
            execute_syscall_for_slave(slave);
            break;
            
        case TIME_SLICE_EXPIRED:
            /* Slave's process used its quantum */
            preempt_slave_process(slave);
            break;
            
        case PROCESS_EXITED:
            /* Clean up and find new work for slave */
            cleanup_exited_process(slave);
            assign_task_to_slave(slave, pick_next_task());
            break;
            
        case BLOCKING_EVENT:
            /* Process blocked (waiting for I/O, lock, etc.) */
            block_process_on_slave(slave);
            assign_task_to_slave(slave, pick_next_task());
            break;
    }
}

Real-Time Scheduling Advantage

Despite its general-purpose limitations, the centralized nature of AMP scheduling provides advantages for real-time systems. Timing analysis is simpler when all scheduling decisions funnel through one point. Worst-Case Execution Time (WCET) calculations don't need to account for inter-processor scheduling races. This predictability is why AMP persists in safety-critical embedded systems where certification requires demonstrable timing guarantees.

AMP vs SMP: Comprehensive Trade-off Analysis

Understanding the trade-offs between AMP and SMP illuminates why modern general-purpose systems overwhelmingly use SMP, while AMP persists in specialized domains.

Comprehensive AMP vs SMP Comparison
Characteristic	Asymmetric (AMP)	Symmetric (SMP)
Kernel execution	Master processor only	Any/all processors
Scheduling decisions	Centralized, serialized	Distributed, potentially concurrent
System call latency	High (round-trip to master)	Low (local execution)
Interrupt handling	Master processor only	Distributed across processors
Scalability	Limited by master capacity	Limited by synchronization overhead
Implementation complexity	Lower (simpler kernel)	Higher (concurrent kernel)
Timing predictability	High (deterministic)	Lower (concurrent interactions)
Fault tolerance	Lower (SPOF at master)	Higher (graceful degradation possible)
Cache efficiency	Variable (master overhead)	Good (local execution)
Load balancing	Centralized, optimal knowledge	Distributed, heuristic-based

Scalability Deep Dive:

The scalability characteristics of AMP and SMP differ fundamentally:

AMP Scalability Limit:

In AMP, the master processor handles:

All interrupts: O(interrupt_rate)
All system calls: O(syscall_rate × num_slaves)
All scheduling decisions: O(context_switch_rate × num_slaves)

As slave count increases, master load grows linearly. Eventually, the master becomes 100% utilized handling slave requests, creating a hard ceiling. Practical AMP systems rarely exceed 4-8 slaves.

SMP Scalability Limit:

In SMP, scalability is limited by:

Lock contention: Grows with concurrent access to shared structures
Cache coherence traffic: Grows with data sharing patterns
Memory bandwidth: Grows with aggregate memory access

These limitations are less severe for well-designed workloads. SMP systems routinely scale to 64-128+ processors, with careful attention to lock granularity and data locality.

Choose AMP When

•Timing predictability is paramount
•Real-time certification is required
•Processor count is small (≤4)
•Kernel complexity must be minimal
•Legacy code cannot be parallelized
•Power/cost constraints limit hardware
•Safety-critical applications

Choose SMP When

•Maximum throughput is the goal
•Workload is general-purpose
•Processor count is large (>4)
•System call rate is high
•Fault tolerance is important
•Workload is unpredictable
•Hardware supports full symmetry

Modern Applications of Asymmetric Multiprocessing

While SMP dominates general-purpose computing, asymmetric multiprocessing remains prevalent in specialized domains where its characteristics provide advantages.

1. Embedded Real-Time Systems:

Safety-critical systems in automotive, aerospace, and industrial control often use AMP:

Automotive ECUs: Engine control units may use a master processor for control algorithms and slaves for sensor processing. The deterministic timing of AMP aids certification.
Avionics: Flight control systems require provable timing bounds. AMP's serialized scheduling simplifies WCET analysis.
Industrial PLCs: Programmable Logic Controllers use AMP for reliable, predictable control loops.

2. Heterogeneous Processing:

Systems with different processor types are inherently asymmetric:

CPU + GPU: The CPU orchestrates work, offloading parallel tasks to the GPU. This is role-based asymmetry.
CPU + DSP: Digital Signal Processors handle specialized signal processing under CPU direction.
CPU + FPGA: Field-Programmable Gate Arrays provide custom acceleration, controlled by the main CPU.
CPU + NPU: Neural Processing Units accelerate machine learning inference under CPU supervision.

3. AMP on Multi-Core Chips:

Some multi-core embedded systems deliberately use AMP on otherwise symmetric hardware:

Mixed-criticality systems: Safety-critical code runs on dedicated cores, isolated from less critical code
Bare-metal + RTOS: One core runs a full OS, others run bare-metal code or a minimal RTOS
Partitioned systems: Cores are statically assigned to different applications with no migration

4. Boot and Initialization:

Even SMP systems often boot asymmetrically:

Hardware designates one processor as the Bootstrap Processor (BSP)
BSP executes firmware and initial kernel boot sequence
BSP initializes memory, interrupt controllers, and other shared resources
BSP starts Application Processors (APs) one by one
Only after initialization completes does the system operate symmetrically

The AMP within SMP

Even production SMP systems often have AMP-like behaviors. CPU 0 typically handles time-sensitive operations like timekeeping. Interrupt affinity settings may direct all interrupts to specific processors. The kernel's boot sequence is inherently asymmetric. Understanding AMP helps recognize and debug these asymmetric aspects within nominally symmetric systems.

The Heterogeneous Computing Renaissance

Modern processors are experiencing a renaissance of asymmetric concepts within nominally symmetric packaging. Heterogeneous multi-core processors combine different core types on a single chip, challenging the pure SMP model.

ARM big.LITTLE and DynamIQ:

ARM pioneered heterogeneous multi-core with big.LITTLE architecture:

Big cores: High-performance out-of-order cores for demanding tasks
LITTLE cores: Power-efficient in-order cores for background work
Scheduler responsibility: The OS must understand which tasks benefit from big cores vs. which should run on LITTLE cores to save power

DynamIQ extends this, allowing more flexible core mixing (e.g., 2 big + 6 LITTLE) and enabling big and LITTLE cores to share cache clusters.

Intel Performance and Efficiency Cores:

Intel's 12th generation (Alder Lake) and beyond use a similar approach:

P-cores (Performance): Full-featured cores with Hyper-Threading for maximum single-thread performance
E-cores (Efficiency): Smaller, simpler cores with higher density and lower power
Intel Thread Director: Hardware monitoring that hints to the OS scheduler about task characteristics

Heterogeneous Core Characteristics
Characteristic	Performance Cores	Efficiency Cores
Microarchitecture	Complex out-of-order	Simpler in-order or narrow OoO
Clock speed	Higher (up to 5.8 GHz)	Lower (up to 4.3 GHz)
Power consumption	Higher per core	1/4 to 1/2 of P-core
SMT/Hyper-Threading	Yes (2 threads/core)	No
Die area	Larger	~1/4 P-core size
Cache	Larger L1/L2	Smaller, often shared L2
Best workloads	Latency-sensitive, bursty	Background, throughput

Scheduling Implications:

Heterogeneous processors create scheduling challenges that echo AMP concerns:

Task classification: The scheduler must identify which tasks benefit from P-cores (interactive, compute-intensive) vs. E-cores (background, batch)
Migration costs: Moving from P-core to E-core (or vice versa) has performance implications beyond cache effects
Fairness recalibration: A process running on a P-core makes more progress per time slice than one on an E-core; fairness metrics need adjustment
Affinity constraints: Some tasks (specific ISA extensions) may only work on certain core types

Linux has developed sophisticated mechanisms for heterogeneous scheduling, including the EEVDF (Earliest Eligible Virtual Deadline First) enhancements to CFS that consider core asymmetry.

The Post-SMP Era?

Heterogeneous processors signal a shift away from pure SMP toward hybrid architectures. The scheduling challenge now includes matching task characteristics to core capabilities—a form of resource-aware scheduling that borrows from both SMP (multiple cores running the same OS) and AMP (different core roles) traditions. Expect this trend to accelerate as power efficiency becomes ever more critical.

Implementing AMP: Practical Considerations

For system designers considering AMP architectures—common in embedded development—there are practical implementation considerations beyond the theoretical model.

Memory Layout Strategies:

AMP systems typically partition memory explicitly:

Private regions: Each processor has dedicated memory that others cannot access
Shared regions: Memory visible to all processors for inter-processor communication
Code segregation: Kernel code may reside in memory accessible only by the master

This explicit partitioning simplifies memory protection but requires careful design of shared data structures and communication protocols.

amp_memory_layout.txt
Typical AMP Memory Layout:
 
Address Space
┌─────────────────────────────────────────────────────────────┐
│  0xFFFFFFFF  │  Interrupt Vectors (Master only)             │
├─────────────────────────────────────────────────────────────┤
│              │                                               │
│              │  Master Private Memory                        │
│              │  - Kernel code and data                       │
│              │  - Kernel stacks                              │
│              │  - Master-only peripherals                    │
│              │                                               │
├─────────────────────────────────────────────────────────────┤
│              │                                               │
│              │  Shared Memory Region                         │
│              │  - Inter-processor mailboxes                  │
│              │  - Work queues (tasks for slaves)             │
│              │  - Completion queues (results from slaves)    │
│              │  - Shared data buffers                        │
│              │                                               │
├─────────────────────────────────────────────────────────────┤
│              │                                               │
│              │  Slave 1 Private Memory                       │
│              │  - Application code                           │
│              │  - Stacks and local data                      │
│              │                                               │
├─────────────────────────────────────────────────────────────┤
│              │                                               │
│              │  Slave 2 Private Memory                       │
│              │  - Application code                           │
│              │  - Stacks and local data                      │
│              │                                               │
├─────────────────────────────────────────────────────────────┤
│  0x00000000  │  Boot/ROM region                              │
└─────────────────────────────────────────────────────────────┘
 
Memory Protection Unit (MPU) Configuration:
- Master: Full access to all regions  
- Slave 1: Private + Shared only
- Slave 2: Private + Shared only
- Violations generate faults to Master

Inter-Processor Communication Patterns:

AMP systems need reliable mechanisms for master-slave communication:

1. Hardware Mailboxes: Dedicated hardware registers for message passing. Writing to a mailbox can trigger an interrupt on the receiving processor.

2. Shared Memory Queues: Ring buffers in shared memory with careful synchronization. Lock-free designs using atomic operations are preferred.

3. Software IPIs: Generic inter-processor interrupts that signal attention needed, with details in shared memory.

4. Doorbell Registers: Simple signaling mechanism—one processor writes, another monitors and responds.

Synchronization Complexity

Even in AMP systems where the master makes all decisions, concurrent access to shared memory requires careful synchronization. The master might update a work queue while a slave reads it. Memory barriers and atomic operations remain essential. The simplification is in kernel internals, not in all inter-processor coordination.

Summary: Understanding Asymmetric Multiprocessing

We have explored Asymmetric Multiprocessing from its historical origins through its modern applications. This knowledge complements our SMP understanding and provides context for the full spectrum of multiprocessor architectures.

Consolidating Our Understanding:

Key Takeaways

•Asymmetry manifests in multiple forms — Role asymmetry (master-slave), capability asymmetry (different processor types), access asymmetry (different permissions), and execution asymmetry (code restrictions) all qualify as AMP characteristics.
•Master-slave centralization simplifies but limits — Having one processor make all scheduling decisions eliminates concurrency in the kernel but creates a scalability bottleneck that limits practical system size.
•AMP excels in predictability — The serialized, deterministic nature of AMP scheduling makes timing analysis tractable, which is essential for real-time and safety-critical certifications.
•AMP persists in specialized domains — Embedded systems, real-time control, and mixed-criticality applications continue to benefit from AMP's simplicity and predictability.
•Heterogeneous processors blend paradigms — Modern big.LITTLE and P-core/E-core designs introduce asymmetric characteristics into nominally SMP operating environments, requiring schedulers to adapt.
•Even SMP has asymmetric aspects — Bootstrap sequences, interrupt affinity, and CPU 0 specialization mean pure symmetry is an ideal rather than absolute reality.
•Implementation requires careful memory design — AMP systems need explicit memory partitioning and robust inter-processor communication mechanisms.

What's Next:

With both SMP and AMP architectures understood, we'll explore Processor Affinity—the mechanisms that bind processes to specific processors. Affinity bridges both paradigms: in SMP, it's an optimization to preserve cache locality; in AMP, it can be a necessity for correctness. Understanding affinity is essential for effective multi-processor scheduling.

Architectural Perspective Complete

You now understand both symmetric and asymmetric multiprocessing architectures—the fundamental design choices for multiprocessor systems. This dual perspective enables you to evaluate scheduling strategies in context: what works for SMP may not suit AMP, and vice versa. Modern hybrid architectures require understanding both traditions.

2 / 5

Loading learning content...

Operating SystemsMulti-Processor Scheduling

Multi-Processor Scheduling

LevelAdvanced

Duration90 mins

TopicMulti-Processor Scheduling

2 / 5

Asymmetric Multiprocessing (AMP)

When Symmetry Isn't the Answer

What You Will Master

Defining Asymmetric Multiprocessing

Asymmetric Multiprocessing (AMP) describes multiprocessor architectures where processors have different roles, capabilities, or access rights. The "asymmetry" can manifest in several forms:

1. Role Asymmetry (Master-Slave):

2. Capability Asymmetry:

3. Access Asymmetry:

Processors have different access rights to memory regions, I/O devices, or system resources. Even with identical hardware, the operating system may restrict what each processor can do.

4. Execution Asymmetry:

Certain code (kernel, interrupt handlers) runs only on designated processors, while other code (user applications) may run on a different set of processors.

The Symmetry Spectrum

Historical Context:

Asymmetric multiprocessing predates symmetric multiprocessing. Early multiprocessor systems in the 1960s and 1970s often used asymmetric designs for practical reasons:

Hardware Limitations: True hardware symmetry required complex interconnects and cache coherence mechanisms that didn't yet exist
Software Complexity: Making an operating system kernel re-entrant and thread-safe for true SMP was extremely difficult
I/O Attachment: Peripheral devices often could only be accessed by specific processors due to bus architecture

The Master-Slave Architecture in Depth

Master Processor Responsibilities:

Executes the operating system kernel in its entirety
Handles all interrupts (timer, I/O, inter-processor)
Makes all scheduling decisions for the system
Manages memory allocation and virtual memory
Handles all system calls (slaves trap to master for OS services)
Controls I/O device access and drivers
Bootstraps and initializes slave processors

Slave Processor Responsibilities:

Executes user-level application code only
Runs processes assigned by the master
Traps to master processor for any system service
Has no direct access to kernel data structures
Cannot handle interrupts (all interrupts go to master)
Has no scheduling autonomy

master_slave_architecture.txt
Master-Slave AMP System Architecture:
 
┌─────────────────────────────────────────────────────────────────┐
│                    SYSTEM OVERVIEW                               │
└─────────────────────────────────────────────────────────────────┘
 
    ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
    │  MASTER (CPU 0) │      │  SLAVE (CPU 1)  │      │  SLAVE (CPU 2)  │
    │                 │      │                 │      │                 │
    │ ┌─────────────┐ │      │ ┌─────────────┐ │      │ ┌─────────────┐ │
    │ │   Kernel    │ │      │ │             │ │      │ │             │ │
    │ │  - Scheduler│ │      │ │   User      │ │      │ │   User      │ │
    │ │  - Memory   │ │      │ │  Process A  │ │      │ │  Process B  │ │
    │ │  - I/O      │ │      │ │             │ │      │ │             │ │
    │ │  - Syscalls │ │      │ │             │ │      │ │             │ │
    │ └─────────────┘ │      │ └─────────────┘ │      │ └─────────────┘ │
    │                 │      │                 │      │                 │
    │ [IRQ Handler]   │      │ [Trap to Master]│      │ [Trap to Master]│
    │ [Timer IRQ]     │      │ [for syscalls]  │      │ [for syscalls]  │ 
    └────────┬────────┘      └────────┬────────┘      └────────┬────────┘
             │                        │                        │
             │   ┌────────────────────┼────────────────────────┤
             │   │                    │                        │
             ▼   ▼                    ▼                        ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                      SHARED MEMORY BUS                          │
    └─────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                        MAIN MEMORY                              │
    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
    │  │ Kernel Code  │  │ User Process │  │ User Process         │  │
    │  │ & Data       │  │ A Memory     │  │ B Memory             │  │
    │  │ (Master only)│  │              │  │                      │  │
    │  └──────────────┘  └──────────────┘  └──────────────────────┘  │
    └─────────────────────────────────────────────────────────────────┘
 
Control Flow for System Call from Slave:
    1. Process on Slave 1 calls read()
    2. Slave 1 generates inter-processor interrupt to Master
    3. Master is interrupted, saves context
    4. Master handles read() syscall (schedules I/O, blocks process)
    5. Master selects next process for Slave 1
    6. Master signals Slave 1 with new assignment
    7. Slave 1 resumes with new process

System Call Handling in Master-Slave Systems:

User process on slave processor executes a system call instruction
Slave cannot execute kernel code directly
Slave sends an IPI (Inter-Processor Interrupt) to the master
Slave enters a wait state (spinning or low-power idle)
Master receives the IPI and handles the system call
Master modifies process state as needed (block, wake, etc.)
Master determines what the slave should run next
Master signals slave with the next process to execute
Slave resumes execution with the assigned process

This round-trip to the master for every system call creates significant overhead, especially for system-call-intensive workloads.

The Master Bottleneck

Scheduling in Asymmetric Systems

Scheduling in AMP systems is fundamentally different from SMP scheduling because all scheduling decisions are centralized in the master processor. This creates both simplifications and limitations.

Centralized Scheduling Advantages:

No synchronization overhead: The master holds all scheduling state; no locks needed for run queue access
Global knowledge: The master knows exactly what every slave is doing at all times
Simpler algorithms: Classic uniprocessor scheduling algorithms work without modification
Deterministic behavior: Scheduling decisions are serialized, making timing analysis tractable

Centralized Scheduling Disadvantages:

Master bottleneck: Scheduling overhead scales linearly with slave count
Latency overhead: Slaves cannot make local scheduling decisions; must wait for master
Poor locality: Master has no "feel" for each slave's cache state
Single point of failure: Master failure brings down the entire system

The AMP Scheduling Loop:

The master processor runs a scheduling loop that manages all slaves:

amp_scheduler_loop.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/* Conceptual AMP Master Scheduler */
 
void master_scheduler_loop(void) {
    while (true) {
        /* Check for events from slaves */
        for (int slave = 1; slave < num_cpus; slave++) {
            if (slave_event_pending(slave)) {
                handle_slave_event(slave);
            }
        }
        
        /* Handle timer expiration - check time slice exhaustion */
        check_timers_and_preempt();
        
        /* Handle I/O completions */
        handle_io_completions();
        
        /* Assign processes to idle slaves */
        for (int slave = 1; slave < num_cpus; slave++) {
            if (slave_is_idle(slave)) {
                struct task *next = pick_next_task();
                if (next != NULL) {
                    assign_task_to_slave(slave, next);
                }
            }
        }
        
        /* Run any master-only tasks */
        run_master_tasks();
    }
}
 
void handle_slave_event(int slave) {
    enum event_type event = get_slave_event(slave);
    
    switch (event) {
        case SYSCALL_REQUEST:
            /* Execute syscall on behalf of slave's process */
            execute_syscall_for_slave(slave);
            break;
            
        case TIME_SLICE_EXPIRED:
            /* Slave's process used its quantum */
            preempt_slave_process(slave);
            break;
            
        case PROCESS_EXITED:
            /* Clean up and find new work for slave */
            cleanup_exited_process(slave);
            assign_task_to_slave(slave, pick_next_task());
            break;
            
        case BLOCKING_EVENT:
            /* Process blocked (waiting for I/O, lock, etc.) */
            block_process_on_slave(slave);
            assign_task_to_slave(slave, pick_next_task());
            break;
    }
}

Real-Time Scheduling Advantage

AMP vs SMP: Comprehensive Trade-off Analysis

Understanding the trade-offs between AMP and SMP illuminates why modern general-purpose systems overwhelmingly use SMP, while AMP persists in specialized domains.

Comprehensive AMP vs SMP Comparison
Characteristic	Asymmetric (AMP)	Symmetric (SMP)
Kernel execution	Master processor only	Any/all processors
Scheduling decisions	Centralized, serialized	Distributed, potentially concurrent
System call latency	High (round-trip to master)	Low (local execution)
Interrupt handling	Master processor only	Distributed across processors
Scalability	Limited by master capacity	Limited by synchronization overhead
Implementation complexity	Lower (simpler kernel)	Higher (concurrent kernel)
Timing predictability	High (deterministic)	Lower (concurrent interactions)
Fault tolerance	Lower (SPOF at master)	Higher (graceful degradation possible)
Cache efficiency	Variable (master overhead)	Good (local execution)
Load balancing	Centralized, optimal knowledge	Distributed, heuristic-based

Scalability Deep Dive:

The scalability characteristics of AMP and SMP differ fundamentally:

AMP Scalability Limit:

In AMP, the master processor handles:

All interrupts: O(interrupt_rate)
All system calls: O(syscall_rate × num_slaves)
All scheduling decisions: O(context_switch_rate × num_slaves)

As slave count increases, master load grows linearly. Eventually, the master becomes 100% utilized handling slave requests, creating a hard ceiling. Practical AMP systems rarely exceed 4-8 slaves.

SMP Scalability Limit:

In SMP, scalability is limited by:

Lock contention: Grows with concurrent access to shared structures
Cache coherence traffic: Grows with data sharing patterns
Memory bandwidth: Grows with aggregate memory access

These limitations are less severe for well-designed workloads. SMP systems routinely scale to 64-128+ processors, with careful attention to lock granularity and data locality.

Choose AMP When

•Timing predictability is paramount
•Real-time certification is required
•Processor count is small (≤4)
•Kernel complexity must be minimal
•Legacy code cannot be parallelized
•Power/cost constraints limit hardware
•Safety-critical applications

Choose SMP When

•Maximum throughput is the goal
•Workload is general-purpose
•Processor count is large (>4)
•System call rate is high
•Fault tolerance is important
•Workload is unpredictable
•Hardware supports full symmetry

Modern Applications of Asymmetric Multiprocessing

While SMP dominates general-purpose computing, asymmetric multiprocessing remains prevalent in specialized domains where its characteristics provide advantages.

1. Embedded Real-Time Systems:

Safety-critical systems in automotive, aerospace, and industrial control often use AMP:

Automotive ECUs: Engine control units may use a master processor for control algorithms and slaves for sensor processing. The deterministic timing of AMP aids certification.
Avionics: Flight control systems require provable timing bounds. AMP's serialized scheduling simplifies WCET analysis.
Industrial PLCs: Programmable Logic Controllers use AMP for reliable, predictable control loops.

2. Heterogeneous Processing:

Systems with different processor types are inherently asymmetric:

CPU + GPU: The CPU orchestrates work, offloading parallel tasks to the GPU. This is role-based asymmetry.
CPU + DSP: Digital Signal Processors handle specialized signal processing under CPU direction.
CPU + FPGA: Field-Programmable Gate Arrays provide custom acceleration, controlled by the main CPU.
CPU + NPU: Neural Processing Units accelerate machine learning inference under CPU supervision.

3. AMP on Multi-Core Chips:

Some multi-core embedded systems deliberately use AMP on otherwise symmetric hardware:

Mixed-criticality systems: Safety-critical code runs on dedicated cores, isolated from less critical code
Bare-metal + RTOS: One core runs a full OS, others run bare-metal code or a minimal RTOS
Partitioned systems: Cores are statically assigned to different applications with no migration

4. Boot and Initialization:

Even SMP systems often boot asymmetrically:

Hardware designates one processor as the Bootstrap Processor (BSP)
BSP executes firmware and initial kernel boot sequence
BSP initializes memory, interrupt controllers, and other shared resources
BSP starts Application Processors (APs) one by one
Only after initialization completes does the system operate symmetrically

The AMP within SMP

The Heterogeneous Computing Renaissance

ARM big.LITTLE and DynamIQ:

ARM pioneered heterogeneous multi-core with big.LITTLE architecture:

Big cores: High-performance out-of-order cores for demanding tasks
LITTLE cores: Power-efficient in-order cores for background work
Scheduler responsibility: The OS must understand which tasks benefit from big cores vs. which should run on LITTLE cores to save power

DynamIQ extends this, allowing more flexible core mixing (e.g., 2 big + 6 LITTLE) and enabling big and LITTLE cores to share cache clusters.

Intel Performance and Efficiency Cores:

Intel's 12th generation (Alder Lake) and beyond use a similar approach:

P-cores (Performance): Full-featured cores with Hyper-Threading for maximum single-thread performance
E-cores (Efficiency): Smaller, simpler cores with higher density and lower power
Intel Thread Director: Hardware monitoring that hints to the OS scheduler about task characteristics

Heterogeneous Core Characteristics
Characteristic	Performance Cores	Efficiency Cores
Microarchitecture	Complex out-of-order	Simpler in-order or narrow OoO
Clock speed	Higher (up to 5.8 GHz)	Lower (up to 4.3 GHz)
Power consumption	Higher per core	1/4 to 1/2 of P-core
SMT/Hyper-Threading	Yes (2 threads/core)	No
Die area	Larger	~1/4 P-core size
Cache	Larger L1/L2	Smaller, often shared L2
Best workloads	Latency-sensitive, bursty	Background, throughput

Scheduling Implications:

Heterogeneous processors create scheduling challenges that echo AMP concerns:

Task classification: The scheduler must identify which tasks benefit from P-cores (interactive, compute-intensive) vs. E-cores (background, batch)
Migration costs: Moving from P-core to E-core (or vice versa) has performance implications beyond cache effects
Fairness recalibration: A process running on a P-core makes more progress per time slice than one on an E-core; fairness metrics need adjustment
Affinity constraints: Some tasks (specific ISA extensions) may only work on certain core types

Linux has developed sophisticated mechanisms for heterogeneous scheduling, including the EEVDF (Earliest Eligible Virtual Deadline First) enhancements to CFS that consider core asymmetry.

The Post-SMP Era?

Implementing AMP: Practical Considerations

For system designers considering AMP architectures—common in embedded development—there are practical implementation considerations beyond the theoretical model.

Memory Layout Strategies:

AMP systems typically partition memory explicitly:

Private regions: Each processor has dedicated memory that others cannot access
Shared regions: Memory visible to all processors for inter-processor communication
Code segregation: Kernel code may reside in memory accessible only by the master

This explicit partitioning simplifies memory protection but requires careful design of shared data structures and communication protocols.

amp_memory_layout.txt
Typical AMP Memory Layout:
 
Address Space
┌─────────────────────────────────────────────────────────────┐
│  0xFFFFFFFF  │  Interrupt Vectors (Master only)             │
├─────────────────────────────────────────────────────────────┤
│              │                                               │
│              │  Master Private Memory                        │
│              │  - Kernel code and data                       │
│              │  - Kernel stacks                              │
│              │  - Master-only peripherals                    │
│              │                                               │
├─────────────────────────────────────────────────────────────┤
│              │                                               │
│              │  Shared Memory Region                         │
│              │  - Inter-processor mailboxes                  │
│              │  - Work queues (tasks for slaves)             │
│              │  - Completion queues (results from slaves)    │
│              │  - Shared data buffers                        │
│              │                                               │
├─────────────────────────────────────────────────────────────┤
│              │                                               │
│              │  Slave 1 Private Memory                       │
│              │  - Application code                           │
│              │  - Stacks and local data                      │
│              │                                               │
├─────────────────────────────────────────────────────────────┤
│              │                                               │
│              │  Slave 2 Private Memory                       │
│              │  - Application code                           │
│              │  - Stacks and local data                      │
│              │                                               │
├─────────────────────────────────────────────────────────────┤
│  0x00000000  │  Boot/ROM region                              │
└─────────────────────────────────────────────────────────────┘
 
Memory Protection Unit (MPU) Configuration:
- Master: Full access to all regions  
- Slave 1: Private + Shared only
- Slave 2: Private + Shared only
- Violations generate faults to Master

Inter-Processor Communication Patterns:

AMP systems need reliable mechanisms for master-slave communication:

1. Hardware Mailboxes: Dedicated hardware registers for message passing. Writing to a mailbox can trigger an interrupt on the receiving processor.

2. Shared Memory Queues: Ring buffers in shared memory with careful synchronization. Lock-free designs using atomic operations are preferred.

3. Software IPIs: Generic inter-processor interrupts that signal attention needed, with details in shared memory.

4. Doorbell Registers: Simple signaling mechanism—one processor writes, another monitors and responds.

Synchronization Complexity

Summary: Understanding Asymmetric Multiprocessing

Consolidating Our Understanding:

Key Takeaways

•Asymmetry manifests in multiple forms — Role asymmetry (master-slave), capability asymmetry (different processor types), access asymmetry (different permissions), and execution asymmetry (code restrictions) all qualify as AMP characteristics.
•Master-slave centralization simplifies but limits — Having one processor make all scheduling decisions eliminates concurrency in the kernel but creates a scalability bottleneck that limits practical system size.
•AMP excels in predictability — The serialized, deterministic nature of AMP scheduling makes timing analysis tractable, which is essential for real-time and safety-critical certifications.
•AMP persists in specialized domains — Embedded systems, real-time control, and mixed-criticality applications continue to benefit from AMP's simplicity and predictability.
•Heterogeneous processors blend paradigms — Modern big.LITTLE and P-core/E-core designs introduce asymmetric characteristics into nominally SMP operating environments, requiring schedulers to adapt.
•Even SMP has asymmetric aspects — Bootstrap sequences, interrupt affinity, and CPU 0 specialization mean pure symmetry is an ideal rather than absolute reality.
•Implementation requires careful memory design — AMP systems need explicit memory partitioning and robust inter-processor communication mechanisms.

What's Next:

Architectural Perspective Complete

2 / 5