Loading content...
In an ideal SMP world, any process can run on any processor with equal efficiency. Reality is more nuanced. When a process runs on a processor, it populates that processor's caches with its working data—creating a warm cache state. Moving the process to a different processor means starting with a cold cache, suffering cache misses until the working set is reloaded.
This cache locality phenomenon gives rise to processor affinity—the tendency, or explicit requirement, for a process to run on the same processor(s) over time. Understanding affinity is essential for building high-performance multi-processor systems, as intelligent affinity management can dramatically impact throughput, latency, and power consumption.
By the end of this page, you will understand both soft and hard processor affinity, the performance implications of cache-aware scheduling, system interfaces for controlling affinity, and advanced considerations including NUMA effects and SMT siblings. You'll be able to make informed decisions about when and how to apply affinity constraints in production systems.
Processor affinity is fundamentally about cache locality. To understand why affinity matters, we must first understand how modern cache hierarchies work and their dramatic performance impact.
Cache Working Set Dynamics:
When a process executes, it accesses memory in patterns determined by its algorithm and data structures:
Over time, the caches become "tuned" to the process's behavior. The working set—the set of cache lines the process uses during a given time interval—resides in local cache, enabling fast access.
| Scenario | Access Latency | Relative Cost | Impact |
|---|---|---|---|
| L1 cache hit | ~4 cycles (1-2 ns) | 1x (baseline) | Optimal performance |
| L2 cache hit | ~12 cycles (4 ns) | 3x | Minor slowdown |
| L3 cache hit | ~40 cycles (15 ns) | 10x | Noticeable latency |
| Main memory | ~200 cycles (70 ns) | 50x | Significant stall |
| Full working set reload | Millions of cycles | 1000x+ | Major performance hit |
The Migration Cost:
When the scheduler moves a process from Processor A to Processor B:
For CPU-intensive processes with large working sets, migration cost can amount to millions of cycles—potentially negating any load-balancing benefit that motivated the migration.
Quantifying Migration Cost:
Research and production experience suggest migration costs ranging from:
Aggressive load balancing that frequently migrates processes can paradoxically reduce total system throughput. Each migration trades guaranteed cache-miss penalties for potential load-balance gains. If migrations occur faster than caches can warm, processes never reach optimal performance. This is the fundamental tension that affinity mechanisms address.
Soft affinity (also called natural affinity) is the scheduler's default tendency to keep a process on the same processor where it last ran, without any explicit constraint from the user or application.
How Soft Affinity Works:
Modern schedulers implement soft affinity through several mechanisms:
1. Per-CPU Run Queues: When a process becomes ready (e.g., after sleeping), the scheduler places it on its "home" CPU's run queue by default. Migration only occurs if load balancing triggers.
2. Migration Cost Awareness: Advanced schedulers estimate migration cost and factor it into load-balancing decisions. A small imbalance may not justify the migration overhead.
3. Wake-Up Heuristics: When a process is woken (by I/O completion, signal, etc.), the scheduler often places it on the CPU where the waking process runs, anticipating data sharing.
4. Load Balancing Thresholds: Load balancers don't react to minor imbalances. A CPU must be significantly more loaded than others before migrations trigger.
123456789101112131415161718192021222324252627282930
/* Conceptual soft affinity in scheduler wake-up path */ void wake_up_process(struct task *p) { int prev_cpu = p->last_cpu; /* CPU where process last ran */ int this_cpu = current_cpu(); /* CPU executing wake_up */ int target_cpu; /* Default: prefer previous CPU (soft affinity) */ target_cpu = prev_cpu; /* Check if previous CPU is available */ if (!cpu_is_idle(prev_cpu)) { /* Consider waking on current CPU if it makes sense */ if (cpu_is_idle(this_cpu) && on_same_domain(prev_cpu, this_cpu)) { target_cpu = this_cpu; } /* Or find a nearby idle CPU in the same cache domain */ else { target_cpu = find_idle_cpu_in_domain(prev_cpu); } } /* Enqueue on target CPU */ enqueue_task(cpu_rq(target_cpu), p); /* If target was idle, send IPI to wake it */ if (cpu_is_idle(target_cpu)) { send_reschedule_ipi(target_cpu); }}Soft Affinity Characteristics:
When Soft Affinity Works Well:
Soft affinity provides good results for many workloads:
When Soft Affinity Is Insufficient:
Some scenarios require stronger guarantees:
For the majority of workloads, soft affinity provides an appropriate balance between cache efficiency and load distribution. Modern schedulers are sophisticated enough that explicit affinity settings are the exception, not the rule. Premature optimization with hard affinity can actually harm performance by preventing the scheduler from making beneficial migrations.
Hard affinity (also called CPU pinning or processor binding) is an explicit constraint that restricts a process to run only on a specified subset of CPUs. Unlike soft affinity, the scheduler cannot override hard affinity for load balancing.
The CPU Affinity Mask:
Hard affinity is implemented via a CPU affinity mask—a bitmask where each bit represents whether the process is allowed to run on the corresponding CPU:
For example, on an 8-CPU system:
0xFF (binary 11111111): Allowed on all CPUs (default)0x01 (binary 00000001): Pinned to CPU 0 only0x0A (binary 00001010): Allowed on CPUs 1 and 30xF0 (binary 11110000): Allowed on CPUs 4-7123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
/* Linux CPU Affinity API Example */ #define _GNU_SOURCE#include <sched.h>#include <pthread.h>#include <stdio.h>#include <stdlib.h> void set_thread_affinity_to_cpu(pthread_t thread, int cpu) { cpu_set_t cpuset; CPU_ZERO(&cpuset); /* Initialize: clear all bits */ CPU_SET(cpu, &cpuset); /* Set bit for target CPU */ int result = pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset); if (result != 0) { perror("pthread_setaffinity_np failed"); exit(EXIT_FAILURE); } printf("Thread pinned to CPU %d\n", cpu);} void set_process_affinity_to_cpus(pid_t pid, int *cpus, int num_cpus) { cpu_set_t cpuset; CPU_ZERO(&cpuset); for (int i = 0; i < num_cpus; i++) { CPU_SET(cpus[i], &cpuset); /* Allow each specified CPU */ } int result = sched_setaffinity(pid, sizeof(cpuset), &cpuset); if (result == -1) { perror("sched_setaffinity failed"); exit(EXIT_FAILURE); } printf("Process %d affinity set to %d CPUs\n", pid, num_cpus);} void get_and_print_affinity(void) { cpu_set_t cpuset; CPU_ZERO(&cpuset); if (sched_getaffinity(0, sizeof(cpuset), &cpuset) == -1) { perror("sched_getaffinity failed"); return; } printf("Current process affinity: "); for (int cpu = 0; cpu < CPU_SETSIZE; cpu++) { if (CPU_ISSET(cpu, &cpuset)) { printf("CPU%d ", cpu); } } printf("\n");} int main(void) { get_and_print_affinity(); /* Pin current process to CPU 0 */ int target_cpu = 0; set_process_affinity_to_cpus(0, &target_cpu, 1); get_and_print_affinity(); return 0;}Command-Line Tools:
Several utilities allow affinity manipulation without code changes:
taskset (Linux):
# Run command with specific affinity
taskset -c 0,2,4 ./my_program # CPUs 0, 2, and 4
taskset 0x0F ./my_program # CPUs 0-3 (hex mask)
# Change affinity of running process
taskset -p -c 0 1234 # Pin PID 1234 to CPU 0
numactl (Linux with NUMA):
# Run on specific CPUs and memory nodes
numactl --cpunodebind=0 --membind=0 ./my_program
start /affinity (Windows):
# Start process with affinity mask
start /affinity 0x3 MyProgram.exe # CPUs 0 and 1
Setting hard affinity prevents the scheduler from migrating the process for load balancing. If you pin four CPU-intensive processes to the same CPU, the scheduler cannot help—they will time-share that single CPU while other CPUs sit idle. Use hard affinity judiciously and only when you understand the full system workload.
Affinity decisions have measurable performance impact. Understanding when affinity helps versus hurts is essential for effective system tuning.
When Affinity Improves Performance:
When Affinity Hurts Performance:
| Workload Characteristic | Recommendation | Rationale |
|---|---|---|
| Short-lived processes | Use soft affinity (default) | Not enough cache state to benefit from pinning |
| I/O-bound workload | Use soft affinity (default) | Cache state not critical; scheduler flexibility valuable |
| CPU-bound, large working set | Consider hard affinity | Significant migration cost avoided |
| NUMA-aware memory allocation | Hard affinity to local NUMA CPUs | Avoids remote memory latency |
| Real-time or latency-critical | Hard affinity + isolation | Eliminates migration jitter |
| Multi-threaded, data sharing | Pin to same LLC domain | Reduces cache coherence traffic |
Affinity tuning should be data-driven. Measure performance with default scheduler behavior, then experiment with affinity settings and measure again. Tools like perf stat (Linux) can show cache miss rates and context switch costs. Only apply hard affinity when measurements demonstrate benefit.
Modern CPUs have complex hierarchical topologies. Effective affinity decisions require understanding this structure.
The Topology Hierarchy:
A typical server might have:
Within this hierarchy, different CPUs share different resources:
Discovering CPU Topology on Linux: $ lscpu | head -20Architecture: x86_64CPU(s): 128On-line CPU(s) list: 0-127Thread(s) per core: 2Core(s) per socket: 32Socket(s): 2NUMA node(s): 2Model name: Intel(R) Xeon(R) Platinum 8358L1d cache: 48KL1i cache: 32KL2 cache: 1280KL3 cache: 49152KNUMA node0 CPU(s): 0-31,64-95NUMA node1 CPU(s): 32-63,96-127 $ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list0,64 $ cat /sys/devices/system/cpu/cpu0/topology/core_siblings_list 0-31,64-95 Understanding the output:- CPU 0 and CPU 64 are SMT siblings (same physical core)- CPUs 0-31 and 64-95 share the same socket (L3 cache)- CPUs 0-31 are on NUMA node 0 Affinity implications:- Pinning related threads to 0-31 keeps them in same L3 domain- Pinning to 0 and 64 shares L2/L1 (good for sharing, bad for contention)- Pinning to 0 and 32 crosses sockets (maximum memory bandwidth, worst sharing)Topology-Informed Pinning Strategies:
1. Spread Across Cores (Maximize Throughput): For compute-intensive threads that don't share data, spread across physical cores to maximize cache resources:
Thread 0 → CPU 0
Thread 1 → CPU 1
Thread 2 → CPU 2
...
(Skipping SMT siblings)
2. Pack on Shared Cache (Data Sharing): For threads that frequently share data, keep them on the same socket/L3:
Producer thread → CPU 0
Consumer thread → CPU 1 (same socket, different core)
3. Isolate Critical Threads: For latency-critical work, isolate entire physical cores:
# Reserve core 0 (CPUs 0 and 64) for critical work
isolcpus=0,64 (kernel boot parameter)
taskset -c 0 ./critical_process
Linux organizes CPUs into nested 'scheduling domains' that reflect the topology. Load balancing prefers migration within the same domain (e.g., same socket) before migrating across domains. Understanding domains helps predict scheduler behavior even without explicit affinity settings.
Beyond process affinity, modern systems allow controlling which CPUs handle hardware interrupts. Interrupt affinity is a powerful tool for optimizing I/O-intensive workloads.
Why Interrupt Affinity Matters:
When a network packet arrives:
If the interrupt handler runs on CPU 0 but the application runs on CPU 5, the packet data must traverse memory and caches to reach the application. If both run on the same CPU (or at least same L3 domain), data access is faster.
Receive Side Scaling (RSS):
Modern NICs support RSS, distributing interrupt load across multiple CPUs:
This allows scaling network processing across cores while maintaining flow affinity (packets from the same connection go to the same CPU).
1234567891011121314151617181920212223242526272829303132
#!/bin/bash# Setting interrupt affinity on Linux # Find interrupts for NIC (eth0)cat /proc/interrupts | grep eth0# Output example:# 72: 123456 0 0 0 IR-PCI-MSI eth0-rx-0# 73: 234567 0 0 0 IR-PCI-MSI eth0-rx-1# 74: 345678 0 0 0 IR-PCI-MSI eth0-rx-2# 75: 456789 0 0 0 IR-PCI-MSI eth0-rx-3 # Set affinity for each receive queue# /proc/irq/<irq_num>/smp_affinity (hex mask)# /proc/irq/<irq_num>/smp_affinity_list (CPU list) echo 0 > /proc/irq/72/smp_affinity_list # IRQ 72 → CPU 0echo 1 > /proc/irq/73/smp_affinity_list # IRQ 73 → CPU 1 echo 2 > /proc/irq/74/smp_affinity_list # IRQ 74 → CPU 2echo 3 > /proc/irq/75/smp_affinity_list # IRQ 75 → CPU 3 # Verify the settingscat /proc/irq/72/smp_affinity_list# Output: 0 # Using irqbalance service (automatic balancing)systemctl status irqbalance # Disable irqbalance if doing manual tuningsystemctl stop irqbalance # Many systems provide scripts in:# /sys/class/net/eth0/queues/rx-*/rps_cpusCoordinating Interrupt and Process Affinity:
For optimal I/O performance, align interrupt handling with application processing:
Pattern: Affinity Alignment
Pattern: NUMA-Local Processing
irqbalance:
The irqbalance daemon automatically distributes interrupts across CPUs based on load. For most systems, this provides reasonable defaults. Manual interrupt affinity is typically only needed for:
When hardware RSS isn't available or insufficient, Linux provides software alternatives: Receive Packet Steering (RPS) distributes packets across CPUs in software, and Receive Flow Steering (RFS) directs packets to the CPU where the consuming application runs. These enable cache-efficient I/O processing without hardware support.
Real-world systems often require sophisticated affinity configurations that go beyond simple process-to-CPU binding.
CPU Isolation (isolcpus):
For extreme latency requirements, CPUs can be completely isolated from the general scheduler:
# Kernel boot parameter
isolcpus=4-7,12-15 # Isolate cores 4-7 and 12-15
Isolated CPUs:
cgroups CPU Affinity (cpuset):
The cgroups subsystem provides container-level affinity control:
# Create a cpuset cgroup
mkdir /sys/fs/cgroup/cpuset/my_app
# Assign CPUs 0-3
echo "0-3" > /sys/fs/cgroup/cpuset/my_app/cpuset.cpus
# Assign memory nodes (NUMA)
echo "0" > /sys/fs/cgroup/cpuset/my_app/cpuset.mems
# Move process into the cgroup
echo $PID > /sys/fs/cgroup/cpuset/my_app/tasks
This is the mechanism container runtimes (Docker, Kubernetes) use to implement CPU limits and affinity for containers.
NUMA and Affinity Integration:
On NUMA systems, CPU affinity and memory placement are tightly coupled:
# Pin process to CPUs 0-7 and memory node 0
numactl --cpunodebind=0 --membind=0 ./my_application
# Pin process to CPUs 0-7, prefer local memory but allow remote
numactl --cpunodebind=0 --preferred=0 ./my_application
Affinity in Virtualized Environments:
Virtual machines add complexity:
Without vCPU pinning, a virtual machine's vCPU might migrate between physical CPUs, losing cache state and potentially moving across NUMA boundaries—degrading guest performance.
Kubernetes offers a 'static' CPU manager policy for Guaranteed QoS pods that request integer CPU values. This provides exclusive CPU sets similar to cpuset cgroups. However, by default (none policy), pods share CPUs via CFS bandwidth control, which doesn't provide affinity. Understanding this distinction is critical for latency-sensitive containerized workloads.
We have explored processor affinity from cache locality fundamentals through advanced configuration scenarios. This knowledge enables informed decisions about when and how to constrain process placement for performance optimization.
Consolidating Our Understanding:
What's Next:
With affinity mechanisms understood, we'll explore Load Balancing—the scheduler's challenge of distributing work across processors while respecting affinity constraints. Load balancing is the dynamic counterpart to static affinity: it redistributes work when imbalance occurs, but must weigh migration costs against the benefits of better distribution.
You now understand processor affinity at a depth sufficient for production system tuning and kernel-level reasoning. You can evaluate when affinity constraints benefit performance, select appropriate APIs and tools, and design affinity strategies that account for CPU topology and NUMA architecture.