Loading learning content...
In a multi-processor system, where should a just-awakened process run? The naive answer might be 'any available CPU,' but decades of operating systems research and engineering have revealed a more nuanced truth: where a process ran last matters profoundly for where it should run next.
This insight gives rise to the concept of processor affinity—the tendency for a process to be scheduled on the same CPU where it previously executed. The most fundamental form of this behavior is soft affinity, a scheduling policy where the operating system attempts to maintain cache locality by preferring the previous CPU, but does not guarantee it.
Understanding soft affinity is essential for any systems engineer working on performance-sensitive applications. It explains why processes tend to 'stick' to certain CPUs, how the scheduler balances locality against load distribution, and when this default behavior helps versus when it hinders performance.
By the end of this page, you will understand: the theoretical foundations of soft affinity, how modern kernels implement this policy, the data structures and algorithms involved, the conditions under which soft affinity breaks down, and how to observe and measure affinity behavior in practice.
To understand soft affinity, we must first understand why CPU affinity matters at all. The answer lies in the architecture of modern processors, particularly the memory hierarchy.
The Cache Locality Principle:
Modern CPUs maintain private caches (L1, L2, and sometimes L3) that store recently accessed data. When a process executes, it loads:
This cache state represents a significant investment. A typical L1 cache can hold 32-64 KB per core, L2 caches 256 KB - 1 MB, and L3 caches (often shared) 8-32 MB or more. Building this 'warm' cache state takes thousands to millions of CPU cycles.
When a process migrates to a different CPU, its carefully built cache state becomes useless. The new CPU's caches are 'cold' for this process, leading to a cascade of cache misses. An L1 miss costs ~4 cycles, an L2 miss ~10-20 cycles, and an L3 miss can cost 50-100+ cycles. For memory access, we're looking at 100-300+ cycles. This 'cache warmup penalty' can significantly degrade performance.
The Soft Affinity Policy:
Soft affinity is the operating system's default response to this cache locality challenge. The policy can be stated simply:
When scheduling a ready process, prefer the CPU where it last ran, unless compelling reasons dictate otherwise.
The key word is prefer. Soft affinity is a hint, not a constraint. The scheduler considers multiple factors:
Soft affinity represents a best-effort optimization. It improves average-case performance without sacrificing scheduling flexibility.
| Previous CPU State | System State | Scheduling Decision |
|---|---|---|
| Idle | Balanced load | Schedule on previous CPU (affinity preserved) |
| Lightly loaded | Balanced load | Schedule on previous CPU (affinity preserved) |
| Heavily loaded | Imbalanced (other CPUs idle) | Migrate to idle CPU (affinity broken) |
| Heavily loaded | All CPUs loaded | May queue on previous CPU (affinity preserved) |
| Offline/unavailable | Any state | Migrate to available CPU (affinity broken) |
The concept of processor affinity emerged alongside the development of symmetric multiprocessing (SMP) systems in the 1980s and 1990s. Understanding this history illuminates why modern systems implement affinity as they do.
Early SMP Systems (1980s-1990s):
The first SMP systems faced a fundamental scheduling question: should there be a single scheduler (master) or multiple independent schedulers (per-CPU)? Early designs often used a single scheduler with little consideration for cache affinity—all CPUs were treated as interchangeable.
As cache sizes grew and the performance gap between cache and main memory widened, the cost of ignoring locality became apparent. Systems like SVR4 (System V Release 4) and early versions of Solaris began incorporating affinity heuristics.
The Linux Journey:
Linux's scheduler evolution illustrates the increasing sophistication of affinity handling:
Linux 2.4 (O(n) scheduler): Basic affinity through a single runqueue. Affinity was implicit—processes stayed on CPUs because migration code was expensive.
Linux 2.6 (O(1) scheduler): Introduced per-CPU runqueues and explicit affinity tracking. The task_struct gained a cpus_allowed mask and affinity hints.
Linux 2.6.23+ (CFS - Completely Fair Scheduler): Sophisticated load balancing with multi-level scheduling domains. Affinity decisions now account for cache topology, NUMA distances, and load imbalance thresholds.
Windows NT has included soft affinity since its earliest multiprocessor versions. The Windows scheduler tracks 'ideal processor' for each thread—the processor where the thread last ran or was created. Windows attempts to schedule threads on their ideal processor unless load balancing dictates otherwise. This concept directly parallels Linux's soft affinity implementation.
Let us examine how modern kernels implement soft affinity, using Linux as our primary example. The implementation involves several interconnected subsystems.
Task Structure and Affinity State:
Every process (and thread) in Linux is represented by a task_struct. This structure contains affinity-related fields:
12345678910111213141516171819202122232425262728
/* From Linux kernel's sched.h */struct task_struct { /* ... many other fields ... */ /* CPU affinity mask - which CPUs this task CAN run on */ cpumask_t cpus_mask; /* Effective mask */ cpumask_t *cpus_ptr; /* Pointer to effective mask */ /* Original affinity set by user (for cpuset changes) */ cpumask_t user_cpus_ptr; /* The last CPU this task ran on - soft affinity hint */ int recent_used_cpu; /* The CPU this task woke up on */ int wake_cpu; /* The CPU currently running this task */ unsigned int cpu; /* Flags indicating migration status */ unsigned int migration_pending; /* Per-entity load tracking for CFS */ struct sched_entity se; /* ... scheduling class, policy, etc. ... */};The Soft Affinity Algorithm:
When a task becomes runnable (e.g., wakes from sleep), the scheduler must decide where to place it. The select_task_rq() function orchestrates this decision, with soft affinity implemented primarily in select_task_rq_fair() for CFS-scheduled tasks.
The algorithm proceeds roughly as follows:
cpus_mask)recent_used_cpu)1234567891011121314151617181920212223242526272829303132
/* Conceptual representation of soft affinity selection */static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags){ int new_cpu = prev_cpu; /* Start with previous CPU - soft affinity */ int want_affine = 0; /* Only consider affinity for wakeup paths */ if (sd_flag & SD_BALANCE_WAKE) { /* Check if wake_affine makes sense */ want_affine = cpumask_test_cpu(cpu_of(rq), p->cpus_ptr); /* If waker's CPU is valid and beneficial, consider it */ if (want_affine) { int waker_cpu = smp_processor_id(); if (wake_affine_ok(rq, p, waker_cpu, prev_cpu)) new_cpu = waker_cpu; /* Wake affine wins */ } } /* Find idle sibling if previous/waker CPU is busy */ if (!idle_cpu(new_cpu)) { new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); } /* If all else fails, find CPU via load balance path */ if (!cpumask_test_cpu(new_cpu, p->cpus_ptr)) { new_cpu = select_fallback_rq(p, prev_cpu); } return new_cpu;}The actual Linux implementation is significantly more complex, handling edge cases like RT tasks, CPU hotplug, cgroup hierarchies, and NUMA policies. The code above is conceptual. Real scheduler code spans thousands of lines across multiple files in kernel/sched/.
Modern processors have complex topologies—multiple cores share L3 caches, multiple dies share a socket, multiple sockets form a NUMA node. Soft affinity must respect this hierarchy, preferring not just 'the same CPU' but 'a CPU with shared cache' when the previous CPU is unavailable.
Scheduling Domains Explained:
Linux models processor topology through scheduling domains (sched_domains). A scheduling domain is a set of CPUs that share some level of the memory hierarchy. Domains are organized hierarchically:
Scheduling Domain Hierarchy (Example: 2-socket, 8-core/socket system)
┌─────────────────────────────────────────────────────────────┐
│ NUMA Domain (SD_NUMA) │
│ All 16 cores across 2 sockets │
│ ┌─────────────────────────────┐ ┌─────────────────────────┐│
│ │ Socket/MC Domain (Socket 0)│ │ Socket/MC Domain (S1) ││
│ │ 8 cores, shared L3 cache │ │ 8 cores, shared L3 ││
│ │ ┌─────────┐ ┌─────────┐ │ │ ┌─────────┐ ┌────────┐ ││
│ │ │LLC Domain│ │LLC Domain│ │ │ │ │ │ │ ││
│ │ │Core 0-3 │ │Core 4-7 │ │ │ │Core 8-11│ │Core12-15││
│ │ │Shared L2│ │Shared L2│ │ │ │ │ │ │ ││
│ │ └─────────┘ └─────────┘ │ │ └─────────┘ └────────┘ ││
│ └─────────────────────────────┘ └─────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Affinity Preference Order:
When the scheduler cannot place a task on its previous CPU, it follows the domain hierarchy to find the 'next best' location:
Soft Affinity Across Domains:
Scheduling domains have flags that control balancing behavior. Key flags affecting soft affinity include:
| Flag | Meaning | Impact on Affinity |
|---|---|---|
SD_BALANCE_WAKE | Balance on wakeup | Enables wake-affine decisions |
SD_WAKE_AFFINE | Consider waker's CPU | May override previous CPU |
SD_SHARE_CPUPOWER | Hyperthreading domain | Siblings share execution resources |
SD_SHARE_PKG_RESOURCES | Same physical package | Cores share L3 cache |
SD_SHARE_LLC | Same last-level cache | Stronger affinity preference |
SD_NUMA | NUMA boundary | Higher migration cost |
Each domain has an imbalance_pct threshold—migration across that domain only occurs if the load imbalance exceeds this percentage. Higher-level domains (NUMA) have higher thresholds, making migration across them less likely, effectively strengthening 'soft' affinity at NUMA boundaries.
You can inspect your system's scheduling domains via: cat /proc/sys/kernel/sched_domain/cpu0/domain*/name and examine their flags. This reveals how the kernel models your processor's cache topology and what affinity preferences apply at each level.
Soft affinity typically means 'prefer the previous CPU.' But there's an important exception: the wake-affine heuristic. When one task wakes another (e.g., via completion of an I/O request or unlocking a mutex), should the awakened task run on:
The Wake-Affine Insight:
When tasks communicate, they often share data. If Task A writes data and then wakes Task B to process it, that data is likely in Task A's CPU's cache. Running Task B on the same CPU (or nearby core sharing L3) can be faster than running on B's previous CPU, which would need to fetch the data from memory or via cache coherence.
This creates a tension with soft affinity:
Scenario: Producer-Consumer Communication
CPU 0: Task A (producer) CPU 1: Task B (consumer)
├── writes data to buffer [previously ran here]
├── wakes Task B
└── continues working
Option 1: Schedule B on CPU 1 → Cache miss for shared data
Option 2: Schedule B on CPU 0 → Data likely in L1/L2 cache
Wake-Affine Decision Criteria:
The scheduler uses several metrics to decide if wake-affine makes sense:
The kernel tracks wake-affine statistics and uses them to make adaptive decisions. When tasks have an established producer-consumer relationship, wake-affine often wins over previous-CPU affinity.
123456789101112131415161718192021222324252627282930313233
/* * Simplified wake_affine logic * Returns 1 if task should run on waker's CPU, 0 otherwise */static int wake_affine_decision(struct sched_domain *sd, struct task_struct *p, int wake_cpu, int prev_cpu){ /* If waker and previous are same CPU, no conflict */ if (wake_cpu == prev_cpu) return 1; /* Check if within wake-affine allowed domain */ if (!(sd->flags & SD_WAKE_AFFINE)) return 0; /* Evaluate based on LLC sharing */ if (cpus_share_cache(wake_cpu, prev_cpu)) { /* Same LLC: wake-affine less disruptive */ return check_wake_affine_metrics(p, wake_cpu, prev_cpu); } /* Cross-LLC or cross-NUMA: higher bar for wake-affine */ int wake_load = cpu_runqueue_load(wake_cpu); int prev_load = cpu_runqueue_load(prev_cpu); /* Only wake-affine if waker's CPU significantly less loaded */ if (wake_load < prev_load - WAKE_AFFINE_THRESHOLD) return 1; /* Default: preserve previous CPU affinity */ return 0;}Linux's wake-affine heuristic has been refined over many kernel versions. Early implementations were too aggressive, causing 'task stacking' where all related tasks piled onto one CPU. Modern versions use more nuanced metrics and domain awareness to balance producer-consumer locality with load distribution.
Soft affinity exists in tension with another scheduler goal: load balancing. The scheduler must prevent scenarios where some CPUs are overloaded while others sit idle. But load balancing necessarily means moving tasks—exactly what affinity tries to avoid.
The Balancing Challenge:
Consider a system with 8 CPUs. If all 8 runnable tasks have strong affinity to CPUs 0-3 (perhaps they were all spawned from a process that ran there), perfect affinity would mean:
This is clearly suboptimal. The scheduler must override soft affinity to achieve reasonable load distribution.
Linux Load Balancing Hierarchy:
Linux balances load at multiple levels:
| Trigger | Frequency | Affinity Impact | Purpose |
|---|---|---|---|
| NEWIDLE | When CPU goes idle | Low—pulls from busiest nearby | Opportunistic work stealing |
| Periodic (SMT) | ~4ms | Low—balances hyperthreads | Keep SMT siblings balanced |
| Periodic (MC) | ~32ms | Medium—balances cores in socket | Prevent core-level hotspots |
| Periodic (NUMA) | ~64-256ms | High—cross-node migration | Last resort for major imbalance |
| Active balance | On extreme imbalance | Very high—forces migration | Emergency load redistribution |
Imbalance Calculation:
The scheduler computes imbalance by comparing the busiest and least busy CPUs in a domain. Migration occurs only if:
imbalance > (busiest_load * imbalance_pct) / 100
Higher-level domains have higher imbalance_pct values:
| Domain Level | Typical imbalance_pct | Meaning |
|---|---|---|
| SMT (Hyperthread) | 110% | Balance if >10% imbalance |
| MC (Multi-Core) | 117% | Balance if >17% imbalance |
| NUMA | 125-140% | Balance only if >25-40% imbalance |
These thresholds encode the cost of cache coldness: small imbalances aren't worth paying the migration penalty, especially across NUMA boundaries.
Scheduling domain parameters are exposed in /proc/sys/kernel/sched_domain/. Advanced users can adjust imbalance thresholds, balance intervals, and other parameters. However, the defaults are carefully tuned for general workloads—modify with caution and benchmarking.
Understanding soft affinity conceptually is valuable, but as engineers we need to observe and measure it in practice. Several tools allow us to examine CPU placement decisions.
Method 1: /proc/<pid>/stat
For any process, /proc/<pid>/stat includes the processor number on which the process last ran. Monitoring this over time reveals migration patterns:
123456789101112131415
#!/bin/bash# Monitor which CPU a process runs on over timePID=$1echo "Monitoring CPU placement for PID $PID"echo "Time, CPU"while true; do # Field 39 in /proc/<pid>/stat is the processor number CPU=$(awk '{print $39}' /proc/$PID/stat 2>/dev/null) if [ -z "$CPU" ]; then echo "Process $PID no longer exists" exit 1 fi echo "$(date +%H:%M:%S.%3N), CPU $CPU" sleep 0.1doneMethod 2: perf sched Tracing
The perf tool can trace scheduler events, showing every context switch and migration:
1234567891011121314
# Record scheduler traces for 10 secondssudo perf sched record -- sleep 10 # Display migration eventssudo perf sched map# Shows timeline of tasks across CPUs # Get migration statisticssudo perf sched latency# Shows scheduling latencies including migration impact # Visualize with timelinessudo perf sched timehist# Per-event breakdown with CPU numbersMethod 3: Scheduler Statistics in /proc
Linux exposes extensive scheduler statistics:
1234567891011121314151617
# Per-CPU scheduling statisticscat /proc/schedstat# Fields include: context switches, schedule calls, load balancing attempts # Scheduling domain informationfor i in /sys/kernel/debug/sched/domains/cpu0/domain*; do echo "=== $i ===" cat $i/name cat $i/flagsdone # Per-task scheduling infocat /proc/<pid>/sched# Shows nr_migrations: number of times task was migrated # Real-time migration countingwatch -n1 'grep nr_migrations /proc/<pid>/sched'The nr_migrations field in /proc/<pid>/sched is invaluable for diagnosing affinity issues. A process with thousands of migrations per second likely has either no stable affinity (perhaps competing with many processes) or has explicit affinity that conflicts with load balancing. Compare this against nr_voluntary_switches and nr_involuntary_switches for context.
Soft affinity represents the operating system's default strategy for balancing cache efficiency with scheduling flexibility. Let's consolidate our understanding:
What's Next:
Soft affinity works well for many workloads, but sometimes applications need stronger guarantees. In the next page, we'll explore hard affinity—mechanisms that allow processes to explicitly restrict which CPUs they may run on, providing deterministic placement at the cost of scheduling flexibility.
You now understand soft processor affinity—the scheduler's default cache-aware placement strategy. You've learned how it's implemented, how it interacts with load balancing, and how to observe it in practice. Next, we'll examine hard affinity for applications requiring explicit CPU binding.