Operating SystemsCPU Scheduling Advanced

Processor Affinity

LevelIntermediate

Duration60 mins

TopicCPU Scheduling Advanced

1 / 5

Soft Affinity

The Natural Tendency of CPU Schedulers

In a multi-processor system, where should a just-awakened process run? The naive answer might be 'any available CPU,' but decades of operating systems research and engineering have revealed a more nuanced truth: where a process ran last matters profoundly for where it should run next.

This insight gives rise to the concept of processor affinity—the tendency for a process to be scheduled on the same CPU where it previously executed. The most fundamental form of this behavior is soft affinity, a scheduling policy where the operating system attempts to maintain cache locality by preferring the previous CPU, but does not guarantee it.

Understanding soft affinity is essential for any systems engineer working on performance-sensitive applications. It explains why processes tend to 'stick' to certain CPUs, how the scheduler balances locality against load distribution, and when this default behavior helps versus when it hinders performance.

What You Will Learn

By the end of this page, you will understand: the theoretical foundations of soft affinity, how modern kernels implement this policy, the data structures and algorithms involved, the conditions under which soft affinity breaks down, and how to observe and measure affinity behavior in practice.

Conceptual Foundations of Soft Affinity

To understand soft affinity, we must first understand why CPU affinity matters at all. The answer lies in the architecture of modern processors, particularly the memory hierarchy.

The Cache Locality Principle:

Modern CPUs maintain private caches (L1, L2, and sometimes L3) that store recently accessed data. When a process executes, it loads:

Code segments into the instruction cache (I-cache)
Data into the data cache (D-cache)
Stack and heap contents into the cache hierarchy

This cache state represents a significant investment. A typical L1 cache can hold 32-64 KB per core, L2 caches 256 KB - 1 MB, and L3 caches (often shared) 8-32 MB or more. Building this 'warm' cache state takes thousands to millions of CPU cycles.

The Cost of Cache Migration

When a process migrates to a different CPU, its carefully built cache state becomes useless. The new CPU's caches are 'cold' for this process, leading to a cascade of cache misses. An L1 miss costs ~4 cycles, an L2 miss ~10-20 cycles, and an L3 miss can cost 50-100+ cycles. For memory access, we're looking at 100-300+ cycles. This 'cache warmup penalty' can significantly degrade performance.

The Soft Affinity Policy:

Soft affinity is the operating system's default response to this cache locality challenge. The policy can be stated simply:

When scheduling a ready process, prefer the CPU where it last ran, unless compelling reasons dictate otherwise.

The key word is prefer. Soft affinity is a hint, not a constraint. The scheduler considers multiple factors:

Cache Warmth: The previous CPU likely has cache state for this process
CPU Load: If the previous CPU is heavily loaded, migration may be worthwhile
Load Balance: The system must prevent some CPUs from being overloaded while others idle
Fairness: All processes deserve reasonable CPU access

Soft affinity represents a best-effort optimization. It improves average-case performance without sacrificing scheduling flexibility.

Soft Affinity Decision Matrix
Previous CPU State	System State	Scheduling Decision
Idle	Balanced load	Schedule on previous CPU (affinity preserved)
Lightly loaded	Balanced load	Schedule on previous CPU (affinity preserved)
Heavily loaded	Imbalanced (other CPUs idle)	Migrate to idle CPU (affinity broken)
Heavily loaded	All CPUs loaded	May queue on previous CPU (affinity preserved)
Offline/unavailable	Any state	Migrate to available CPU (affinity broken)

Historical Context and Evolution

The concept of processor affinity emerged alongside the development of symmetric multiprocessing (SMP) systems in the 1980s and 1990s. Understanding this history illuminates why modern systems implement affinity as they do.

Early SMP Systems (1980s-1990s):

The first SMP systems faced a fundamental scheduling question: should there be a single scheduler (master) or multiple independent schedulers (per-CPU)? Early designs often used a single scheduler with little consideration for cache affinity—all CPUs were treated as interchangeable.

As cache sizes grew and the performance gap between cache and main memory widened, the cost of ignoring locality became apparent. Systems like SVR4 (System V Release 4) and early versions of Solaris began incorporating affinity heuristics.

The Linux Journey:

Linux's scheduler evolution illustrates the increasing sophistication of affinity handling:

Linux 2.4 (O(n) scheduler): Basic affinity through a single runqueue. Affinity was implicit—processes stayed on CPUs because migration code was expensive.
Linux 2.6 (O(1) scheduler): Introduced per-CPU runqueues and explicit affinity tracking. The task_struct gained a cpus_allowed mask and affinity hints.
Linux 2.6.23+ (CFS - Completely Fair Scheduler): Sophisticated load balancing with multi-level scheduling domains. Affinity decisions now account for cache topology, NUMA distances, and load imbalance thresholds.

Key Milestones in Affinity Development

•1990s - Affinity Recognition: Research papers quantify the cost of cache coldness after migration, motivating affinity-aware scheduling.
•2001 - Per-CPU Runqueues: Linux 2.6 development introduces per-CPU structures, making affinity natural rather than forced.
•2003 - Scheduling Domains: Linux 2.6 introduces hierarchical scheduling domains, allowing affinity policies to vary by cache level.
•2007 - CFS Introduction: Completely Fair Scheduler brings more nuanced load balancing with configurable affinity thresholds.
•2010s - NUMA Awareness: Modern schedulers integrate NUMA topology, treating memory locality as an extension of cache affinity.
•2020s - Hybrid Architectures: ARM big.LITTLE and Intel Alder Lake introduce performance/efficiency core affinity considerations.

The Windows Perspective

Windows NT has included soft affinity since its earliest multiprocessor versions. The Windows scheduler tracks 'ideal processor' for each thread—the processor where the thread last ran or was created. Windows attempts to schedule threads on their ideal processor unless load balancing dictates otherwise. This concept directly parallels Linux's soft affinity implementation.

Kernel Implementation Deep Dive

Let us examine how modern kernels implement soft affinity, using Linux as our primary example. The implementation involves several interconnected subsystems.

Task Structure and Affinity State:

Every process (and thread) in Linux is represented by a task_struct. This structure contains affinity-related fields:

task_struct affinity fields (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* From Linux kernel's sched.h */
struct task_struct {
    /* ... many other fields ... */
    
    /* CPU affinity mask - which CPUs this task CAN run on */
    cpumask_t cpus_mask;    /* Effective mask */
    cpumask_t *cpus_ptr;    /* Pointer to effective mask */
    
    /* Original affinity set by user (for cpuset changes) */
    cpumask_t user_cpus_ptr;
    
    /* The last CPU this task ran on - soft affinity hint */
    int recent_used_cpu;
    
    /* The CPU this task woke up on */
    int wake_cpu;
    
    /* The CPU currently running this task */
    unsigned int cpu;
    
    /* Flags indicating migration status */
    unsigned int migration_pending;
    
    /* Per-entity load tracking for CFS */
    struct sched_entity se;
    
    /* ... scheduling class, policy, etc. ... */
};

The Soft Affinity Algorithm:

When a task becomes runnable (e.g., wakes from sleep), the scheduler must decide where to place it. The select_task_rq() function orchestrates this decision, with soft affinity implemented primarily in select_task_rq_fair() for CFS-scheduled tasks.

The algorithm proceeds roughly as follows:

Check Hard Affinity: First verify the task is allowed to run on candidate CPUs (cpus_mask)
Identify Previous CPU: Retrieve the CPU where the task last ran (recent_used_cpu)
Check Previous CPU Viability: Is it idle? Is it in the same scheduling domain? Is it allowed?
Evaluate Wake-Affine Heuristic: Should the task run on the waker's CPU instead?
Consider Load Balance: Would scheduling here cause unacceptable imbalance?
Make Final Selection: Choose the best CPU balancing all factors

Simplified soft affinity selection logic
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/* Conceptual representation of soft affinity selection */
static int select_task_rq_fair(struct task_struct *p, int prev_cpu,
                                int sd_flag, int wake_flags)
{
    int new_cpu = prev_cpu;  /* Start with previous CPU - soft affinity */
    int want_affine = 0;
    
    /* Only consider affinity for wakeup paths */
    if (sd_flag & SD_BALANCE_WAKE) {
        /* Check if wake_affine makes sense */
        want_affine = cpumask_test_cpu(cpu_of(rq), p->cpus_ptr);
        
        /* If waker's CPU is valid and beneficial, consider it */
        if (want_affine) {
            int waker_cpu = smp_processor_id();
            if (wake_affine_ok(rq, p, waker_cpu, prev_cpu))
                new_cpu = waker_cpu;  /* Wake affine wins */
        }
    }
    
    /* Find idle sibling if previous/waker CPU is busy */
    if (!idle_cpu(new_cpu)) {
        new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
    }
    
    /* If all else fails, find CPU via load balance path */
    if (!cpumask_test_cpu(new_cpu, p->cpus_ptr)) {
        new_cpu = select_fallback_rq(p, prev_cpu);
    }
    
    return new_cpu;
}

Implementation Complexity

The actual Linux implementation is significantly more complex, handling edge cases like RT tasks, CPU hotplug, cgroup hierarchies, and NUMA policies. The code above is conceptual. Real scheduler code spans thousands of lines across multiple files in kernel/sched/.

Scheduling Domains and Affinity Hierarchy

Modern processors have complex topologies—multiple cores share L3 caches, multiple dies share a socket, multiple sockets form a NUMA node. Soft affinity must respect this hierarchy, preferring not just 'the same CPU' but 'a CPU with shared cache' when the previous CPU is unavailable.

Scheduling Domains Explained:

Linux models processor topology through scheduling domains (sched_domains). A scheduling domain is a set of CPUs that share some level of the memory hierarchy. Domains are organized hierarchically:

Scheduling Domain Hierarchy (Example: 2-socket, 8-core/socket system)

    ┌─────────────────────────────────────────────────────────────┐
    │                    NUMA Domain (SD_NUMA)                    │
    │                   All 16 cores across 2 sockets             │
    │  ┌─────────────────────────────┐ ┌─────────────────────────┐│
    │  │  Socket/MC Domain (Socket 0)│ │  Socket/MC Domain (S1)  ││
    │  │  8 cores, shared L3 cache   │ │  8 cores, shared L3     ││
    │  │  ┌─────────┐  ┌─────────┐   │ │  ┌─────────┐ ┌────────┐ ││
    │  │  │LLC Domain│ │LLC Domain│   │ │  │         │ │        │ ││
    │  │  │Core 0-3 │  │Core 4-7 │   │ │  │Core 8-11│ │Core12-15││
    │  │  │Shared L2│  │Shared L2│   │ │  │         │ │        │ ││
    │  │  └─────────┘  └─────────┘   │ │  └─────────┘ └────────┘ ││
    │  └─────────────────────────────┘ └─────────────────────────┘│
    └─────────────────────────────────────────────────────────────┘

Affinity Preference Order:

When the scheduler cannot place a task on its previous CPU, it follows the domain hierarchy to find the 'next best' location:

Affinity Preference Hierarchy

•Same CPU (Ideal): Task runs on exactly the same CPU. All cache levels warm. Zero migration overhead.
•Same LLC (L3) Domain: Different core, same shared L3 cache. L1/L2 caches cold, but L3 may retain some data.
•Same Socket: Different L3 partition, same socket. Socket-level resources (memory controllers) remain efficient.
•Same NUMA Node: For multi-socket, different socket but same memory domain. Memory access latency unchanged.
•Different NUMA Node: Cross-socket migration. Memory attached to original node becomes remote. Highest penalty.

Soft Affinity Across Domains:

Scheduling domains have flags that control balancing behavior. Key flags affecting soft affinity include:

Flag	Meaning	Impact on Affinity
`SD_BALANCE_WAKE`	Balance on wakeup	Enables wake-affine decisions
`SD_WAKE_AFFINE`	Consider waker's CPU	May override previous CPU
`SD_SHARE_CPUPOWER`	Hyperthreading domain	Siblings share execution resources
`SD_SHARE_PKG_RESOURCES`	Same physical package	Cores share L3 cache
`SD_SHARE_LLC`	Same last-level cache	Stronger affinity preference
`SD_NUMA`	NUMA boundary	Higher migration cost

Each domain has an imbalance_pct threshold—migration across that domain only occurs if the load imbalance exceeds this percentage. Higher-level domains (NUMA) have higher thresholds, making migration across them less likely, effectively strengthening 'soft' affinity at NUMA boundaries.

Domain Inspection Command

You can inspect your system's scheduling domains via: cat /proc/sys/kernel/sched_domain/cpu0/domain*/name and examine their flags. This reveals how the kernel models your processor's cache topology and what affinity preferences apply at each level.

The Wake-Affine Heuristic

Soft affinity typically means 'prefer the previous CPU.' But there's an important exception: the wake-affine heuristic. When one task wakes another (e.g., via completion of an I/O request or unlocking a mutex), should the awakened task run on:

Its previous CPU (preserving its cache state), or
The waker's CPU (enabling fast hand-off of shared data)?

The Wake-Affine Insight:

When tasks communicate, they often share data. If Task A writes data and then wakes Task B to process it, that data is likely in Task A's CPU's cache. Running Task B on the same CPU (or nearby core sharing L3) can be faster than running on B's previous CPU, which would need to fetch the data from memory or via cache coherence.

This creates a tension with soft affinity:

Scenario: Producer-Consumer Communication

CPU 0: Task A (producer)           CPU 1: Task B (consumer)
   ├── writes data to buffer          [previously ran here]
   ├── wakes Task B                    
   └── continues working               
                                       
   Option 1: Schedule B on CPU 1   →   Cache miss for shared data
   Option 2: Schedule B on CPU 0   →   Data likely in L1/L2 cache

Wake-Affine Decision Criteria:

The scheduler uses several metrics to decide if wake-affine makes sense:

Wakeup Frequency: If A wakes B frequently, they likely share data—wake-affine beneficial
CPU Utilization: If waker's CPU is busy, wake-affine would congest it—avoid
Domain Boundaries: If previous and waker CPUs are in different NUMA nodes, cache sharing gains may not apply
Task Relationship: If tasks are in same process or share memory mappings, wake-affine more beneficial

The kernel tracks wake-affine statistics and uses them to make adaptive decisions. When tasks have an established producer-consumer relationship, wake-affine often wins over previous-CPU affinity.

Conceptual wake-affine decision
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/*
 * Simplified wake_affine logic
 * Returns 1 if task should run on waker's CPU, 0 otherwise
 */
static int wake_affine_decision(struct sched_domain *sd,
                                 struct task_struct *p,
                                 int wake_cpu, int prev_cpu)
{
    /* If waker and previous are same CPU, no conflict */
    if (wake_cpu == prev_cpu)
        return 1;
    
    /* Check if within wake-affine allowed domain */
    if (!(sd->flags & SD_WAKE_AFFINE))
        return 0;
    
    /* Evaluate based on LLC sharing */
    if (cpus_share_cache(wake_cpu, prev_cpu)) {
        /* Same LLC: wake-affine less disruptive */
        return check_wake_affine_metrics(p, wake_cpu, prev_cpu);
    }
    
    /* Cross-LLC or cross-NUMA: higher bar for wake-affine */
    int wake_load = cpu_runqueue_load(wake_cpu);
    int prev_load = cpu_runqueue_load(prev_cpu);
    
    /* Only wake-affine if waker's CPU significantly less loaded */
    if (wake_load < prev_load - WAKE_AFFINE_THRESHOLD)
        return 1;
    
    /* Default: preserve previous CPU affinity */
    return 0;
}

Wake-Affine Evolution in Linux

Linux's wake-affine heuristic has been refined over many kernel versions. Early implementations were too aggressive, causing 'task stacking' where all related tasks piled onto one CPU. Modern versions use more nuanced metrics and domain awareness to balance producer-consumer locality with load distribution.

Load Balancing vs. Affinity: The Eternal Tension

Soft affinity exists in tension with another scheduler goal: load balancing. The scheduler must prevent scenarios where some CPUs are overloaded while others sit idle. But load balancing necessarily means moving tasks—exactly what affinity tries to avoid.

The Balancing Challenge:

Consider a system with 8 CPUs. If all 8 runnable tasks have strong affinity to CPUs 0-3 (perhaps they were all spawned from a process that ran there), perfect affinity would mean:

CPUs 0-3: 2 tasks each (overloaded)
CPUs 4-7: 0 tasks (idle)

This is clearly suboptimal. The scheduler must override soft affinity to achieve reasonable load distribution.

Linux Load Balancing Hierarchy:

Linux balances load at multiple levels:

NEWIDLE Balance: When a CPU is about to go idle, it checks if nearby CPUs have excess tasks
Periodic Balance: Regular interrupts trigger rebalancing checks across domains
Wake Balance: The wakeup path considers load when placing tasks
Active Balance: Extreme imbalance triggers forced migration of running tasks

Load Balancing Thresholds
Trigger	Frequency	Affinity Impact	Purpose
NEWIDLE	When CPU goes idle	Low—pulls from busiest nearby	Opportunistic work stealing
Periodic (SMT)	~4ms	Low—balances hyperthreads	Keep SMT siblings balanced
Periodic (MC)	~32ms	Medium—balances cores in socket	Prevent core-level hotspots
Periodic (NUMA)	~64-256ms	High—cross-node migration	Last resort for major imbalance
Active balance	On extreme imbalance	Very high—forces migration	Emergency load redistribution

Imbalance Calculation:

The scheduler computes imbalance by comparing the busiest and least busy CPUs in a domain. Migration occurs only if:

imbalance > (busiest_load * imbalance_pct) / 100

Higher-level domains have higher imbalance_pct values:

Domain Level	Typical imbalance_pct	Meaning
SMT (Hyperthread)	110%	Balance if >10% imbalance
MC (Multi-Core)	117%	Balance if >17% imbalance
NUMA	125-140%	Balance only if >25-40% imbalance

These thresholds encode the cost of cache coldness: small imbalances aren't worth paying the migration penalty, especially across NUMA boundaries.

Tunable Parameters

Scheduling domain parameters are exposed in /proc/sys/kernel/sched_domain/. Advanced users can adjust imbalance thresholds, balance intervals, and other parameters. However, the defaults are carefully tuned for general workloads—modify with caution and benchmarking.

Observing and Measuring Soft Affinity

Understanding soft affinity conceptually is valuable, but as engineers we need to observe and measure it in practice. Several tools allow us to examine CPU placement decisions.

Method 1: /proc/<pid>/stat

For any process, /proc/<pid>/stat includes the processor number on which the process last ran. Monitoring this over time reveals migration patterns:

Monitoring process CPU affinity
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# Monitor which CPU a process runs on over time
PID=$1
echo "Monitoring CPU placement for PID $PID"
echo "Time, CPU"
while true; do
    # Field 39 in /proc/<pid>/stat is the processor number
    CPU=$(awk '{print $39}' /proc/$PID/stat 2>/dev/null)
    if [ -z "$CPU" ]; then
        echo "Process $PID no longer exists"
        exit 1
    fi
    echo "$(date +%H:%M:%S.%3N), CPU $CPU"
    sleep 0.1
done

Method 2: perf sched Tracing

The perf tool can trace scheduler events, showing every context switch and migration:

Tracing scheduler migrations
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Record scheduler traces for 10 seconds
sudo perf sched record -- sleep 10
 
# Display migration events
sudo perf sched map
# Shows timeline of tasks across CPUs
 
# Get migration statistics
sudo perf sched latency
# Shows scheduling latencies including migration impact
 
# Visualize with timelines
sudo perf sched timehist
# Per-event breakdown with CPU numbers

Method 3: Scheduler Statistics in /proc

Linux exposes extensive scheduler statistics:

Reading scheduler statistics
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Per-CPU scheduling statistics
cat /proc/schedstat
# Fields include: context switches, schedule calls, load balancing attempts
 
# Scheduling domain information
for i in /sys/kernel/debug/sched/domains/cpu0/domain*; do
    echo "=== $i ==="
    cat $i/name
    cat $i/flags
done
 
# Per-task scheduling info
cat /proc/<pid>/sched
# Shows nr_migrations: number of times task was migrated
 
# Real-time migration counting
watch -n1 'grep nr_migrations /proc/<pid>/sched'

Migration Count as Diagnostic

The nr_migrations field in /proc/<pid>/sched is invaluable for diagnosing affinity issues. A process with thousands of migrations per second likely has either no stable affinity (perhaps competing with many processes) or has explicit affinity that conflicts with load balancing. Compare this against nr_voluntary_switches and nr_involuntary_switches for context.

Summary: The Art of Soft Affinity

Soft affinity represents the operating system's default strategy for balancing cache efficiency with scheduling flexibility. Let's consolidate our understanding:

Key Takeaways

•Soft affinity is a preference, not a guarantee—the scheduler will break affinity when load balancing demands it.
•Cache locality motivates affinity—migrating processes incurs cache warmup costs that can significantly impact performance.
•Modern schedulers use hierarchical domains—affinity preferences are stronger within cache domains, weaker across NUMA boundaries.
•Wake-affine can override previous-CPU affinity—when tasks communicate frequently, running on the waker's CPU may be faster.
•Load balancing creates tension with affinity—thresholds control how much imbalance is tolerated before migration occurs.
•Soft affinity is implicit and automatic—no programmer intervention required to benefit; the scheduler handles it transparently.

What's Next:

Soft affinity works well for many workloads, but sometimes applications need stronger guarantees. In the next page, we'll explore hard affinity—mechanisms that allow processes to explicitly restrict which CPUs they may run on, providing deterministic placement at the cost of scheduling flexibility.

Page Complete

You now understand soft processor affinity—the scheduler's default cache-aware placement strategy. You've learned how it's implemented, how it interacts with load balancing, and how to observe it in practice. Next, we'll examine hard affinity for applications requiring explicit CPU binding.

1 / 5

Loading learning content...

Operating SystemsCPU Scheduling Advanced

Processor Affinity

LevelIntermediate

Duration60 mins

TopicCPU Scheduling Advanced

1 / 5

Soft Affinity

The Natural Tendency of CPU Schedulers

What You Will Learn

Conceptual Foundations of Soft Affinity

To understand soft affinity, we must first understand why CPU affinity matters at all. The answer lies in the architecture of modern processors, particularly the memory hierarchy.

The Cache Locality Principle:

Modern CPUs maintain private caches (L1, L2, and sometimes L3) that store recently accessed data. When a process executes, it loads:

Code segments into the instruction cache (I-cache)
Data into the data cache (D-cache)
Stack and heap contents into the cache hierarchy

The Cost of Cache Migration

The Soft Affinity Policy:

Soft affinity is the operating system's default response to this cache locality challenge. The policy can be stated simply:

When scheduling a ready process, prefer the CPU where it last ran, unless compelling reasons dictate otherwise.

The key word is prefer. Soft affinity is a hint, not a constraint. The scheduler considers multiple factors:

Cache Warmth: The previous CPU likely has cache state for this process
CPU Load: If the previous CPU is heavily loaded, migration may be worthwhile
Load Balance: The system must prevent some CPUs from being overloaded while others idle
Fairness: All processes deserve reasonable CPU access

Soft affinity represents a best-effort optimization. It improves average-case performance without sacrificing scheduling flexibility.

Soft Affinity Decision Matrix
Previous CPU State	System State	Scheduling Decision
Idle	Balanced load	Schedule on previous CPU (affinity preserved)
Lightly loaded	Balanced load	Schedule on previous CPU (affinity preserved)
Heavily loaded	Imbalanced (other CPUs idle)	Migrate to idle CPU (affinity broken)
Heavily loaded	All CPUs loaded	May queue on previous CPU (affinity preserved)
Offline/unavailable	Any state	Migrate to available CPU (affinity broken)

Historical Context and Evolution

Early SMP Systems (1980s-1990s):

The Linux Journey:

Linux's scheduler evolution illustrates the increasing sophistication of affinity handling:

Linux 2.4 (O(n) scheduler): Basic affinity through a single runqueue. Affinity was implicit—processes stayed on CPUs because migration code was expensive.
Linux 2.6 (O(1) scheduler): Introduced per-CPU runqueues and explicit affinity tracking. The task_struct gained a cpus_allowed mask and affinity hints.
Linux 2.6.23+ (CFS - Completely Fair Scheduler): Sophisticated load balancing with multi-level scheduling domains. Affinity decisions now account for cache topology, NUMA distances, and load imbalance thresholds.

Key Milestones in Affinity Development

•1990s - Affinity Recognition: Research papers quantify the cost of cache coldness after migration, motivating affinity-aware scheduling.
•2001 - Per-CPU Runqueues: Linux 2.6 development introduces per-CPU structures, making affinity natural rather than forced.
•2003 - Scheduling Domains: Linux 2.6 introduces hierarchical scheduling domains, allowing affinity policies to vary by cache level.
•2007 - CFS Introduction: Completely Fair Scheduler brings more nuanced load balancing with configurable affinity thresholds.
•2010s - NUMA Awareness: Modern schedulers integrate NUMA topology, treating memory locality as an extension of cache affinity.
•2020s - Hybrid Architectures: ARM big.LITTLE and Intel Alder Lake introduce performance/efficiency core affinity considerations.

The Windows Perspective

Kernel Implementation Deep Dive

Let us examine how modern kernels implement soft affinity, using Linux as our primary example. The implementation involves several interconnected subsystems.

Task Structure and Affinity State:

Every process (and thread) in Linux is represented by a task_struct. This structure contains affinity-related fields:

task_struct affinity fields (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* From Linux kernel's sched.h */
struct task_struct {
    /* ... many other fields ... */
    
    /* CPU affinity mask - which CPUs this task CAN run on */
    cpumask_t cpus_mask;    /* Effective mask */
    cpumask_t *cpus_ptr;    /* Pointer to effective mask */
    
    /* Original affinity set by user (for cpuset changes) */
    cpumask_t user_cpus_ptr;
    
    /* The last CPU this task ran on - soft affinity hint */
    int recent_used_cpu;
    
    /* The CPU this task woke up on */
    int wake_cpu;
    
    /* The CPU currently running this task */
    unsigned int cpu;
    
    /* Flags indicating migration status */
    unsigned int migration_pending;
    
    /* Per-entity load tracking for CFS */
    struct sched_entity se;
    
    /* ... scheduling class, policy, etc. ... */
};

The Soft Affinity Algorithm:

The algorithm proceeds roughly as follows:

Check Hard Affinity: First verify the task is allowed to run on candidate CPUs (cpus_mask)
Identify Previous CPU: Retrieve the CPU where the task last ran (recent_used_cpu)
Check Previous CPU Viability: Is it idle? Is it in the same scheduling domain? Is it allowed?
Evaluate Wake-Affine Heuristic: Should the task run on the waker's CPU instead?
Consider Load Balance: Would scheduling here cause unacceptable imbalance?
Make Final Selection: Choose the best CPU balancing all factors

Simplified soft affinity selection logic
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/* Conceptual representation of soft affinity selection */
static int select_task_rq_fair(struct task_struct *p, int prev_cpu,
                                int sd_flag, int wake_flags)
{
    int new_cpu = prev_cpu;  /* Start with previous CPU - soft affinity */
    int want_affine = 0;
    
    /* Only consider affinity for wakeup paths */
    if (sd_flag & SD_BALANCE_WAKE) {
        /* Check if wake_affine makes sense */
        want_affine = cpumask_test_cpu(cpu_of(rq), p->cpus_ptr);
        
        /* If waker's CPU is valid and beneficial, consider it */
        if (want_affine) {
            int waker_cpu = smp_processor_id();
            if (wake_affine_ok(rq, p, waker_cpu, prev_cpu))
                new_cpu = waker_cpu;  /* Wake affine wins */
        }
    }
    
    /* Find idle sibling if previous/waker CPU is busy */
    if (!idle_cpu(new_cpu)) {
        new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
    }
    
    /* If all else fails, find CPU via load balance path */
    if (!cpumask_test_cpu(new_cpu, p->cpus_ptr)) {
        new_cpu = select_fallback_rq(p, prev_cpu);
    }
    
    return new_cpu;
}

Implementation Complexity

Scheduling Domains and Affinity Hierarchy

Scheduling Domains Explained:

Scheduling Domain Hierarchy (Example: 2-socket, 8-core/socket system)

    ┌─────────────────────────────────────────────────────────────┐
    │                    NUMA Domain (SD_NUMA)                    │
    │                   All 16 cores across 2 sockets             │
    │  ┌─────────────────────────────┐ ┌─────────────────────────┐│
    │  │  Socket/MC Domain (Socket 0)│ │  Socket/MC Domain (S1)  ││
    │  │  8 cores, shared L3 cache   │ │  8 cores, shared L3     ││
    │  │  ┌─────────┐  ┌─────────┐   │ │  ┌─────────┐ ┌────────┐ ││
    │  │  │LLC Domain│ │LLC Domain│   │ │  │         │ │        │ ││
    │  │  │Core 0-3 │  │Core 4-7 │   │ │  │Core 8-11│ │Core12-15││
    │  │  │Shared L2│  │Shared L2│   │ │  │         │ │        │ ││
    │  │  └─────────┘  └─────────┘   │ │  └─────────┘ └────────┘ ││
    │  └─────────────────────────────┘ └─────────────────────────┘│
    └─────────────────────────────────────────────────────────────┘

Affinity Preference Order:

When the scheduler cannot place a task on its previous CPU, it follows the domain hierarchy to find the 'next best' location:

Affinity Preference Hierarchy

•Same CPU (Ideal): Task runs on exactly the same CPU. All cache levels warm. Zero migration overhead.
•Same LLC (L3) Domain: Different core, same shared L3 cache. L1/L2 caches cold, but L3 may retain some data.
•Same Socket: Different L3 partition, same socket. Socket-level resources (memory controllers) remain efficient.
•Same NUMA Node: For multi-socket, different socket but same memory domain. Memory access latency unchanged.
•Different NUMA Node: Cross-socket migration. Memory attached to original node becomes remote. Highest penalty.

Soft Affinity Across Domains:

Scheduling domains have flags that control balancing behavior. Key flags affecting soft affinity include:

Flag	Meaning	Impact on Affinity
`SD_BALANCE_WAKE`	Balance on wakeup	Enables wake-affine decisions
`SD_WAKE_AFFINE`	Consider waker's CPU	May override previous CPU
`SD_SHARE_CPUPOWER`	Hyperthreading domain	Siblings share execution resources
`SD_SHARE_PKG_RESOURCES`	Same physical package	Cores share L3 cache
`SD_SHARE_LLC`	Same last-level cache	Stronger affinity preference
`SD_NUMA`	NUMA boundary	Higher migration cost

Domain Inspection Command

The Wake-Affine Heuristic

Its previous CPU (preserving its cache state), or
The waker's CPU (enabling fast hand-off of shared data)?

The Wake-Affine Insight:

This creates a tension with soft affinity:

Scenario: Producer-Consumer Communication

CPU 0: Task A (producer)           CPU 1: Task B (consumer)
   ├── writes data to buffer          [previously ran here]
   ├── wakes Task B                    
   └── continues working               
                                       
   Option 1: Schedule B on CPU 1   →   Cache miss for shared data
   Option 2: Schedule B on CPU 0   →   Data likely in L1/L2 cache

Wake-Affine Decision Criteria:

The scheduler uses several metrics to decide if wake-affine makes sense:

Wakeup Frequency: If A wakes B frequently, they likely share data—wake-affine beneficial
CPU Utilization: If waker's CPU is busy, wake-affine would congest it—avoid
Domain Boundaries: If previous and waker CPUs are in different NUMA nodes, cache sharing gains may not apply
Task Relationship: If tasks are in same process or share memory mappings, wake-affine more beneficial

The kernel tracks wake-affine statistics and uses them to make adaptive decisions. When tasks have an established producer-consumer relationship, wake-affine often wins over previous-CPU affinity.

Conceptual wake-affine decision
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/*
 * Simplified wake_affine logic
 * Returns 1 if task should run on waker's CPU, 0 otherwise
 */
static int wake_affine_decision(struct sched_domain *sd,
                                 struct task_struct *p,
                                 int wake_cpu, int prev_cpu)
{
    /* If waker and previous are same CPU, no conflict */
    if (wake_cpu == prev_cpu)
        return 1;
    
    /* Check if within wake-affine allowed domain */
    if (!(sd->flags & SD_WAKE_AFFINE))
        return 0;
    
    /* Evaluate based on LLC sharing */
    if (cpus_share_cache(wake_cpu, prev_cpu)) {
        /* Same LLC: wake-affine less disruptive */
        return check_wake_affine_metrics(p, wake_cpu, prev_cpu);
    }
    
    /* Cross-LLC or cross-NUMA: higher bar for wake-affine */
    int wake_load = cpu_runqueue_load(wake_cpu);
    int prev_load = cpu_runqueue_load(prev_cpu);
    
    /* Only wake-affine if waker's CPU significantly less loaded */
    if (wake_load < prev_load - WAKE_AFFINE_THRESHOLD)
        return 1;
    
    /* Default: preserve previous CPU affinity */
    return 0;
}

Wake-Affine Evolution in Linux

Load Balancing vs. Affinity: The Eternal Tension

The Balancing Challenge:

Consider a system with 8 CPUs. If all 8 runnable tasks have strong affinity to CPUs 0-3 (perhaps they were all spawned from a process that ran there), perfect affinity would mean:

CPUs 0-3: 2 tasks each (overloaded)
CPUs 4-7: 0 tasks (idle)

This is clearly suboptimal. The scheduler must override soft affinity to achieve reasonable load distribution.

Linux Load Balancing Hierarchy:

Linux balances load at multiple levels:

NEWIDLE Balance: When a CPU is about to go idle, it checks if nearby CPUs have excess tasks
Periodic Balance: Regular interrupts trigger rebalancing checks across domains
Wake Balance: The wakeup path considers load when placing tasks
Active Balance: Extreme imbalance triggers forced migration of running tasks

Load Balancing Thresholds
Trigger	Frequency	Affinity Impact	Purpose
NEWIDLE	When CPU goes idle	Low—pulls from busiest nearby	Opportunistic work stealing
Periodic (SMT)	~4ms	Low—balances hyperthreads	Keep SMT siblings balanced
Periodic (MC)	~32ms	Medium—balances cores in socket	Prevent core-level hotspots
Periodic (NUMA)	~64-256ms	High—cross-node migration	Last resort for major imbalance
Active balance	On extreme imbalance	Very high—forces migration	Emergency load redistribution

Imbalance Calculation:

The scheduler computes imbalance by comparing the busiest and least busy CPUs in a domain. Migration occurs only if:

imbalance > (busiest_load * imbalance_pct) / 100

Higher-level domains have higher imbalance_pct values:

Domain Level	Typical imbalance_pct	Meaning
SMT (Hyperthread)	110%	Balance if >10% imbalance
MC (Multi-Core)	117%	Balance if >17% imbalance
NUMA	125-140%	Balance only if >25-40% imbalance

These thresholds encode the cost of cache coldness: small imbalances aren't worth paying the migration penalty, especially across NUMA boundaries.

Tunable Parameters

Observing and Measuring Soft Affinity

Understanding soft affinity conceptually is valuable, but as engineers we need to observe and measure it in practice. Several tools allow us to examine CPU placement decisions.

Method 1: /proc/<pid>/stat

For any process, /proc/<pid>/stat includes the processor number on which the process last ran. Monitoring this over time reveals migration patterns:

Monitoring process CPU affinity
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# Monitor which CPU a process runs on over time
PID=$1
echo "Monitoring CPU placement for PID $PID"
echo "Time, CPU"
while true; do
    # Field 39 in /proc/<pid>/stat is the processor number
    CPU=$(awk '{print $39}' /proc/$PID/stat 2>/dev/null)
    if [ -z "$CPU" ]; then
        echo "Process $PID no longer exists"
        exit 1
    fi
    echo "$(date +%H:%M:%S.%3N), CPU $CPU"
    sleep 0.1
done

Method 2: perf sched Tracing

The perf tool can trace scheduler events, showing every context switch and migration:

Tracing scheduler migrations
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Record scheduler traces for 10 seconds
sudo perf sched record -- sleep 10
 
# Display migration events
sudo perf sched map
# Shows timeline of tasks across CPUs
 
# Get migration statistics
sudo perf sched latency
# Shows scheduling latencies including migration impact
 
# Visualize with timelines
sudo perf sched timehist
# Per-event breakdown with CPU numbers

Method 3: Scheduler Statistics in /proc

Linux exposes extensive scheduler statistics:

Reading scheduler statistics
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Per-CPU scheduling statistics
cat /proc/schedstat
# Fields include: context switches, schedule calls, load balancing attempts
 
# Scheduling domain information
for i in /sys/kernel/debug/sched/domains/cpu0/domain*; do
    echo "=== $i ==="
    cat $i/name
    cat $i/flags
done
 
# Per-task scheduling info
cat /proc/<pid>/sched
# Shows nr_migrations: number of times task was migrated
 
# Real-time migration counting
watch -n1 'grep nr_migrations /proc/<pid>/sched'

Migration Count as Diagnostic

Summary: The Art of Soft Affinity

Soft affinity represents the operating system's default strategy for balancing cache efficiency with scheduling flexibility. Let's consolidate our understanding:

Key Takeaways

•Soft affinity is a preference, not a guarantee—the scheduler will break affinity when load balancing demands it.
•Cache locality motivates affinity—migrating processes incurs cache warmup costs that can significantly impact performance.
•Modern schedulers use hierarchical domains—affinity preferences are stronger within cache domains, weaker across NUMA boundaries.
•Wake-affine can override previous-CPU affinity—when tasks communicate frequently, running on the waker's CPU may be faster.
•Load balancing creates tension with affinity—thresholds control how much imbalance is tolerated before migration occurs.
•Soft affinity is implicit and automatic—no programmer intervention required to benefit; the scheduler handles it transparently.

What's Next:

Page Complete

1 / 5