Multi Processor Scheduling - Learning Module

Loading content...

0/227

Processor Affinity

Binding Processes to Processors

In an ideal SMP world, any process can run on any processor with equal efficiency. Reality is more nuanced. When a process runs on a processor, it populates that processor's caches with its working data—creating a warm cache state. Moving the process to a different processor means starting with a cold cache, suffering cache misses until the working set is reloaded.

This cache locality phenomenon gives rise to processor affinity—the tendency, or explicit requirement, for a process to run on the same processor(s) over time. Understanding affinity is essential for building high-performance multi-processor systems, as intelligent affinity management can dramatically impact throughput, latency, and power consumption.

What You Will Master

By the end of this page, you will understand both soft and hard processor affinity, the performance implications of cache-aware scheduling, system interfaces for controlling affinity, and advanced considerations including NUMA effects and SMT siblings. You'll be able to make informed decisions about when and how to apply affinity constraints in production systems.

The Cache Locality Foundation

Processor affinity is fundamentally about cache locality. To understand why affinity matters, we must first understand how modern cache hierarchies work and their dramatic performance impact.

Cache Working Set Dynamics:

When a process executes, it accesses memory in patterns determined by its algorithm and data structures:

Code locality: Instructions from frequently executed functions remain in L1 instruction cache
Data locality: Recently accessed variables and data structures reside in L1/L2 data cache
Prefetch benefits: Hardware prefetchers learn access patterns and speculatively load data

Over time, the caches become "tuned" to the process's behavior. The working set—the set of cache lines the process uses during a given time interval—resides in local cache, enabling fast access.

Cache Miss Penalty Impact
Scenario	Access Latency	Relative Cost	Impact
L1 cache hit	~4 cycles (1-2 ns)	1x (baseline)	Optimal performance
L2 cache hit	~12 cycles (4 ns)	3x	Minor slowdown
L3 cache hit	~40 cycles (15 ns)	10x	Noticeable latency
Main memory	~200 cycles (70 ns)	50x	Significant stall
Full working set reload	Millions of cycles	1000x+	Major performance hit

The Migration Cost:

When the scheduler moves a process from Processor A to Processor B:

Cache state is lost: Processor A's caches contain the process's working set, but Processor B's caches do not
Cold start penalty: Every memory access on Processor B initially misses cache, going to L3 or main memory
Working set rebuild: The process must re-execute code and re-access data to warm Processor B's caches
Potential interference: The process's working set may evict data useful to other processes previously running on B

For CPU-intensive processes with large working sets, migration cost can amount to millions of cycles—potentially negating any load-balancing benefit that motivated the migration.

Quantifying Migration Cost:

Research and production experience suggest migration costs ranging from:

Small working set (KB): 50-100 microseconds to re-warm L1/L2
Medium working set (MB): 1-10 milliseconds to reach steady-state performance
Large working set (tens of MB): 10-100+ milliseconds before optimal throughput returns

The Hidden Cost of Over-Migration

Aggressive load balancing that frequently migrates processes can paradoxically reduce total system throughput. Each migration trades guaranteed cache-miss penalties for potential load-balance gains. If migrations occur faster than caches can warm, processes never reach optimal performance. This is the fundamental tension that affinity mechanisms address.

Soft Affinity: The Default Scheduler Behavior

Soft affinity (also called natural affinity) is the scheduler's default tendency to keep a process on the same processor where it last ran, without any explicit constraint from the user or application.

How Soft Affinity Works:

Modern schedulers implement soft affinity through several mechanisms:

1. Per-CPU Run Queues: When a process becomes ready (e.g., after sleeping), the scheduler places it on its "home" CPU's run queue by default. Migration only occurs if load balancing triggers.

2. Migration Cost Awareness: Advanced schedulers estimate migration cost and factor it into load-balancing decisions. A small imbalance may not justify the migration overhead.

3. Wake-Up Heuristics: When a process is woken (by I/O completion, signal, etc.), the scheduler often places it on the CPU where the waking process runs, anticipating data sharing.

4. Load Balancing Thresholds: Load balancers don't react to minor imbalances. A CPU must be significantly more loaded than others before migrations trigger.

soft_affinity_concept.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/* Conceptual soft affinity in scheduler wake-up path */
 
void wake_up_process(struct task *p) {
    int prev_cpu = p->last_cpu;      /* CPU where process last ran */
    int this_cpu = current_cpu();    /* CPU executing wake_up */
    int target_cpu;
    
    /* Default: prefer previous CPU (soft affinity) */
    target_cpu = prev_cpu;
    
    /* Check if previous CPU is available */
    if (!cpu_is_idle(prev_cpu)) {
        /* Consider waking on current CPU if it makes sense */
        if (cpu_is_idle(this_cpu) && on_same_domain(prev_cpu, this_cpu)) {
            target_cpu = this_cpu;
        }
        /* Or find a nearby idle CPU in the same cache domain */
        else {
            target_cpu = find_idle_cpu_in_domain(prev_cpu);
        }
    }
    
    /* Enqueue on target CPU */
    enqueue_task(cpu_rq(target_cpu), p);
    
    /* If target was idle, send IPI to wake it */
    if (cpu_is_idle(target_cpu)) {
        send_reschedule_ipi(target_cpu);
    }
}

Soft Affinity Characteristics:

Non-binding: The scheduler treats previous CPU as a preference, not a requirement
Automatic: No user configuration needed; the scheduler manages affinity implicitly
Balanced: The scheduler can override affinity when load imbalance justifies migration
Adaptive: The "home" CPU updates when a process is migrated, establishing new affinity

When Soft Affinity Works Well:

Soft affinity provides good results for many workloads:

Long-running compute processes naturally stay on one CPU
Interactive processes with short CPU bursts don't accumulate significant cache state
Balanced workloads rarely trigger load balancing

When Soft Affinity Is Insufficient:

Some scenarios require stronger guarantees:

Real-time processes that cannot tolerate migration jitter
Processes with extremely large working sets where any migration is costly
Applications that have tuned memory allocation for specific NUMA nodes
Workloads where the application knows optimal CPU placement better than the scheduler

Soft Affinity Is Usually Sufficient

For the majority of workloads, soft affinity provides an appropriate balance between cache efficiency and load distribution. Modern schedulers are sophisticated enough that explicit affinity settings are the exception, not the rule. Premature optimization with hard affinity can actually harm performance by preventing the scheduler from making beneficial migrations.

Hard Affinity: Explicit CPU Binding

Hard affinity (also called CPU pinning or processor binding) is an explicit constraint that restricts a process to run only on a specified subset of CPUs. Unlike soft affinity, the scheduler cannot override hard affinity for load balancing.

The CPU Affinity Mask:

Hard affinity is implemented via a CPU affinity mask—a bitmask where each bit represents whether the process is allowed to run on the corresponding CPU:

Bit N set (1): Process may run on CPU N
Bit N clear (0): Process may not run on CPU N

For example, on an 8-CPU system:

Mask 0xFF (binary 11111111): Allowed on all CPUs (default)
Mask 0x01 (binary 00000001): Pinned to CPU 0 only
Mask 0x0A (binary 00001010): Allowed on CPUs 1 and 3
Mask 0xF0 (binary 11110000): Allowed on CPUs 4-7

affinity_linux_api.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
/* Linux CPU Affinity API Example */
 
#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
 
void set_thread_affinity_to_cpu(pthread_t thread, int cpu) {
    cpu_set_t cpuset;
    
    CPU_ZERO(&cpuset);              /* Initialize: clear all bits */
    CPU_SET(cpu, &cpuset);          /* Set bit for target CPU */
    
    int result = pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset);
    if (result != 0) {
        perror("pthread_setaffinity_np failed");
        exit(EXIT_FAILURE);
    }
    
    printf("Thread pinned to CPU %d\n", cpu);
}
 
void set_process_affinity_to_cpus(pid_t pid, int *cpus, int num_cpus) {
    cpu_set_t cpuset;
    
    CPU_ZERO(&cpuset);
    for (int i = 0; i < num_cpus; i++) {
        CPU_SET(cpus[i], &cpuset);  /* Allow each specified CPU */
    }
    
    int result = sched_setaffinity(pid, sizeof(cpuset), &cpuset);
    if (result == -1) {
        perror("sched_setaffinity failed");
        exit(EXIT_FAILURE);
    }
    
    printf("Process %d affinity set to %d CPUs\n", pid, num_cpus);
}
 
void get_and_print_affinity(void) {
    cpu_set_t cpuset;
    
    CPU_ZERO(&cpuset);
    if (sched_getaffinity(0, sizeof(cpuset), &cpuset) == -1) {
        perror("sched_getaffinity failed");
        return;
    }
    
    printf("Current process affinity: ");
    for (int cpu = 0; cpu < CPU_SETSIZE; cpu++) {
        if (CPU_ISSET(cpu, &cpuset)) {
            printf("CPU%d ", cpu);
        }
    }
    printf("\n");
}
 
int main(void) {
    get_and_print_affinity();
    
    /* Pin current process to CPU 0 */
    int target_cpu = 0;
    set_process_affinity_to_cpus(0, &target_cpu, 1);
    
    get_and_print_affinity();
    
    return 0;
}

Command-Line Tools:

Several utilities allow affinity manipulation without code changes:

taskset (Linux):

# Run command with specific affinity
taskset -c 0,2,4 ./my_program    # CPUs 0, 2, and 4
taskset 0x0F ./my_program         # CPUs 0-3 (hex mask)

# Change affinity of running process
taskset -p -c 0 1234              # Pin PID 1234 to CPU 0

numactl (Linux with NUMA):

# Run on specific CPUs and memory nodes
numactl --cpunodebind=0 --membind=0 ./my_program

start /affinity (Windows):

# Start process with affinity mask
start /affinity 0x3 MyProgram.exe  # CPUs 0 and 1

Hard Affinity Restricts Scheduler Freedom

Setting hard affinity prevents the scheduler from migrating the process for load balancing. If you pin four CPU-intensive processes to the same CPU, the scheduler cannot help—they will time-share that single CPU while other CPUs sit idle. Use hard affinity judiciously and only when you understand the full system workload.

Performance Implications of Affinity Decisions

Affinity decisions have measurable performance impact. Understanding when affinity helps versus hurts is essential for effective system tuning.

When Affinity Improves Performance:

Affinity Benefits

•Large working set processes — When cache warm-up takes significant time, preventing migration avoids repeated cold-start penalties. Database servers with large buffer pools are classic examples.
•Latency-sensitive workloads — Real-time audio/video processing, trading systems, and game servers benefit from predictable, migration-free execution.
•NUMA-optimized allocations — When memory is allocated on a specific NUMA node, pinning the process to CPUs on that node avoids remote memory access latency.
•Producer-consumer threading — Pinning communicating threads to the same L3 cache domain reduces coherence traffic for shared data.
•Interrupt processing — Binding interrupt handlers to specific CPUs alongside their processing threads can reduce data movement.

When Affinity Hurts Performance:

Affinity Drawbacks

•Load imbalance — Pinning prevents the scheduler from distributing work. Some CPUs may be overloaded while others idle.
•Inflexible to dynamic workloads — If workload patterns change, static affinity settings may become suboptimal.
•SMT contention — Pinning two compute-intensive threads to the same physical core (SMT siblings) causes resource contention.
•Over-consolidation — Pinning multiple processes to a small CPU subset might improve cache sharing but creates scheduling contention.
•Heterogeneous systems — On big.LITTLE systems, pinning to efficiency cores limits performance; pinning to performance cores wastes power for light work.

Affinity Tuning Decision Matrix
Workload Characteristic	Recommendation	Rationale
Short-lived processes	Use soft affinity (default)	Not enough cache state to benefit from pinning
I/O-bound workload	Use soft affinity (default)	Cache state not critical; scheduler flexibility valuable
CPU-bound, large working set	Consider hard affinity	Significant migration cost avoided
NUMA-aware memory allocation	Hard affinity to local NUMA CPUs	Avoids remote memory latency
Real-time or latency-critical	Hard affinity + isolation	Eliminates migration jitter
Multi-threaded, data sharing	Pin to same LLC domain	Reduces cache coherence traffic

Measure Before Pinning

Affinity tuning should be data-driven. Measure performance with default scheduler behavior, then experiment with affinity settings and measure again. Tools like perf stat (Linux) can show cache miss rates and context switch costs. Only apply hard affinity when measurements demonstrate benefit.

CPU Topology-Aware Affinity

Modern CPUs have complex hierarchical topologies. Effective affinity decisions require understanding this structure.

The Topology Hierarchy:

A typical server might have:

System: 2 sockets (physical processor packages)
Socket: 32 cores per socket
Core: 2 hardware threads (SMT/Hyper-Threading) per core
Total: 128 logical CPUs = 2 × 32 × 2

Within this hierarchy, different CPUs share different resources:

SMT siblings: Share L1, L2 cache and execution units
Same socket cores: Share L3 cache and memory controller
Cross-socket cores: Share only main memory (DRAM)

topology_exploration.txt
Discovering CPU Topology on Linux:
 
$ lscpu | head -20
Architecture:          x86_64
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    32
Socket(s):             2
NUMA node(s):          2
Model name:            Intel(R) Xeon(R) Platinum 8358
L1d cache:             48K
L1i cache:             32K
L2 cache:              1280K
L3 cache:              49152K
NUMA node0 CPU(s):     0-31,64-95
NUMA node1 CPU(s):     32-63,96-127
 
$ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
0,64
 
$ cat /sys/devices/system/cpu/cpu0/topology/core_siblings_list  
0-31,64-95
 
Understanding the output:
- CPU 0 and CPU 64 are SMT siblings (same physical core)
- CPUs 0-31 and 64-95 share the same socket (L3 cache)
- CPUs 0-31 are on NUMA node 0
 
Affinity implications:
- Pinning related threads to 0-31 keeps them in same L3 domain
- Pinning to 0 and 64 shares L2/L1 (good for sharing, bad for contention)
- Pinning to 0 and 32 crosses sockets (maximum memory bandwidth, worst sharing)

Topology-Informed Pinning Strategies:

1. Spread Across Cores (Maximize Throughput): For compute-intensive threads that don't share data, spread across physical cores to maximize cache resources:

Thread 0 → CPU 0
Thread 1 → CPU 1  
Thread 2 → CPU 2
...
(Skipping SMT siblings)

2. Pack on Shared Cache (Data Sharing): For threads that frequently share data, keep them on the same socket/L3:

Producer thread → CPU 0
Consumer thread → CPU 1  (same socket, different core)

3. Isolate Critical Threads: For latency-critical work, isolate entire physical cores:

# Reserve core 0 (CPUs 0 and 64) for critical work
isolcpus=0,64 (kernel boot parameter)
taskset -c 0 ./critical_process

Scheduler Domains

Linux organizes CPUs into nested 'scheduling domains' that reflect the topology. Load balancing prefers migration within the same domain (e.g., same socket) before migrating across domains. Understanding domains helps predict scheduler behavior even without explicit affinity settings.

Interrupt Affinity and Device Processing

Beyond process affinity, modern systems allow controlling which CPUs handle hardware interrupts. Interrupt affinity is a powerful tool for optimizing I/O-intensive workloads.

Why Interrupt Affinity Matters:

When a network packet arrives:

Network card generates an interrupt
CPU handles interrupt, processes packet headers, enqueues data
Application (on potentially different CPU) reads the data

If the interrupt handler runs on CPU 0 but the application runs on CPU 5, the packet data must traverse memory and caches to reach the application. If both run on the same CPU (or at least same L3 domain), data access is faster.

Receive Side Scaling (RSS):

Modern NICs support RSS, distributing interrupt load across multiple CPUs:

Hardware hashes packet headers (IP addresses, ports)
Hash determines which receive queue handles the packet
Each receive queue has its own interrupt, which can be affinitized

This allows scaling network processing across cores while maintaining flow affinity (packets from the same connection go to the same CPU).

interrupt_affinity.sh
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash
# Setting interrupt affinity on Linux
 
# Find interrupts for NIC (eth0)
cat /proc/interrupts | grep eth0
# Output example:
#  72:   123456   0   0   0   IR-PCI-MSI eth0-rx-0
#  73:   234567   0   0   0   IR-PCI-MSI eth0-rx-1
#  74:   345678   0   0   0   IR-PCI-MSI eth0-rx-2
#  75:   456789   0   0   0   IR-PCI-MSI eth0-rx-3
 
# Set affinity for each receive queue
# /proc/irq/<irq_num>/smp_affinity (hex mask)
# /proc/irq/<irq_num>/smp_affinity_list (CPU list)
 
echo 0 > /proc/irq/72/smp_affinity_list    # IRQ 72 → CPU 0
echo 1 > /proc/irq/73/smp_affinity_list    # IRQ 73 → CPU 1  
echo 2 > /proc/irq/74/smp_affinity_list    # IRQ 74 → CPU 2
echo 3 > /proc/irq/75/smp_affinity_list    # IRQ 75 → CPU 3
 
# Verify the settings
cat /proc/irq/72/smp_affinity_list
# Output: 0
 
# Using irqbalance service (automatic balancing)
systemctl status irqbalance
 
# Disable irqbalance if doing manual tuning
systemctl stop irqbalance
 
# Many systems provide scripts in:
# /sys/class/net/eth0/queues/rx-*/rps_cpus

Coordinating Interrupt and Process Affinity:

For optimal I/O performance, align interrupt handling with application processing:

Pattern: Affinity Alignment

Application thread pinned to CPU 0
NIC receive queue interrupt pinned to CPU 0
Packet data stays in CPU 0's cache from interrupt to application

Pattern: NUMA-Local Processing

Application pinned to CPUs on NUMA node 0
NIC interrupts distributed across node 0 CPUs
All packet processing stays NUMA-local

irqbalance:

The irqbalance daemon automatically distributes interrupts across CPUs based on load. For most systems, this provides reasonable defaults. Manual interrupt affinity is typically only needed for:

Ultra-low-latency networking
High-throughput packet processing
Real-time systems
Specialized server applications (DPDK, etc.)

RPS and RFS for Software Receive Steering

When hardware RSS isn't available or insufficient, Linux provides software alternatives: Receive Packet Steering (RPS) distributes packets across CPUs in software, and Receive Flow Steering (RFS) directs packets to the CPU where the consuming application runs. These enable cache-efficient I/O processing without hardware support.

Advanced Affinity Scenarios

Real-world systems often require sophisticated affinity configurations that go beyond simple process-to-CPU binding.

CPU Isolation (isolcpus):

For extreme latency requirements, CPUs can be completely isolated from the general scheduler:

# Kernel boot parameter
isolcpus=4-7,12-15  # Isolate cores 4-7 and 12-15

Isolated CPUs:

Do not run general system processes
Do not participate in load balancing
Only run explicitly affinitized processes
Provide near-bare-metal execution environment

cgroups CPU Affinity (cpuset):

The cgroups subsystem provides container-level affinity control:

# Create a cpuset cgroup
mkdir /sys/fs/cgroup/cpuset/my_app

# Assign CPUs 0-3
echo "0-3" > /sys/fs/cgroup/cpuset/my_app/cpuset.cpus

# Assign memory nodes (NUMA)
echo "0" > /sys/fs/cgroup/cpuset/my_app/cpuset.mems

# Move process into the cgroup
echo $PID > /sys/fs/cgroup/cpuset/my_app/tasks

This is the mechanism container runtimes (Docker, Kubernetes) use to implement CPU limits and affinity for containers.

NUMA and Affinity Integration:

On NUMA systems, CPU affinity and memory placement are tightly coupled:

Preferred policy: Allocate memory from the local node when possible
Bind policy: Strictly allocate from specified nodes (fail if unavailable)
Interleave policy: Stripe memory across nodes for bandwidth-sensitive workloads

# Pin process to CPUs 0-7 and memory node 0
numactl --cpunodebind=0 --membind=0 ./my_application

# Pin process to CPUs 0-7, prefer local memory but allow remote
numactl --cpunodebind=0 --preferred=0 ./my_application

Affinity in Virtualized Environments:

Virtual machines add complexity:

vCPU pinning: Hypervisor can pin virtual CPUs to physical CPUs
CPU topology passthrough: Guest sees real topology for informed scheduling
NUMA topology presentation: Virtual NUMA nodes for guest scheduler awareness

Without vCPU pinning, a virtual machine's vCPU might migrate between physical CPUs, losing cache state and potentially moving across NUMA boundaries—degrading guest performance.

The Kubernetes CPU Manager

Kubernetes offers a 'static' CPU manager policy for Guaranteed QoS pods that request integer CPU values. This provides exclusive CPU sets similar to cpuset cgroups. However, by default (none policy), pods share CPUs via CFS bandwidth control, which doesn't provide affinity. Understanding this distinction is critical for latency-sensitive containerized workloads.

Summary: Mastering Processor Affinity

We have explored processor affinity from cache locality fundamentals through advanced configuration scenarios. This knowledge enables informed decisions about when and how to constrain process placement for performance optimization.

Consolidating Our Understanding:

Key Takeaways

•Cache locality justifies affinity — Migration between CPUs destroys cache state, creating measurable performance penalties. Affinity reduces these penalties by keeping processes on their warm CPUs.
•Soft affinity is usually sufficient — Modern schedulers naturally prefer previous CPUs. Explicit hard affinity is needed only for specific optimization scenarios.
•Hard affinity trades flexibility for predictability — Pinning processes prevents scheduler-driven load balancing. Use only when the benefit outweighs lost flexibility.
•Topology awareness enables smart affinity — Understanding which CPUs share caches, cores, and NUMA nodes enables affinity decisions that maximize cache sharing while avoiding contention.
•Interrupt affinity complements process affinity — Aligning interrupt handling with data processing reduces cache misses and improves I/O throughput.
•CPU isolation provides extreme control — For real-time and latency-critical workloads, isolcpus removes CPUs from general scheduling entirely.
•Measure before and after — Affinity tuning should be empirical. Measure cache miss rates, context switches, and application latency to validate that affinity settings provide benefit.

What's Next:

With affinity mechanisms understood, we'll explore Load Balancing—the scheduler's challenge of distributing work across processors while respecting affinity constraints. Load balancing is the dynamic counterpart to static affinity: it redistributes work when imbalance occurs, but must weigh migration costs against the benefits of better distribution.

Affinity Mastery Achieved

You now understand processor affinity at a depth sufficient for production system tuning and kernel-level reasoning. You can evaluate when affinity constraints benefit performance, select appropriate APIs and tools, and design affinity strategies that account for CPU topology and NUMA architecture.