Operating SystemsCPU Scheduling Advanced

Processor Affinity

LevelIntermediate

Duration60 mins

TopicCPU Scheduling Advanced

3 / 5

Cache Effects

The Hidden Performance Driver

We've established that processor affinity—both soft and hard—exists to preserve cache locality. But what exactly are these 'cache effects' that make affinity so important? Why does it matter which CPU runs our code?

The answer lies in the profound performance gap between CPU execution speed and memory access latency. Modern processors can execute billions of instructions per second, but accessing main memory takes 100-300+ nanoseconds—time for hundreds of instructions to execute. Caches bridge this gap by keeping frequently accessed data close to the CPU.

The Affinity Connection:

When a process is migrated to a different CPU, it typically loses access to the cache state it built up on the previous CPU. The new CPU's caches are 'cold' for this process—filled with another process's data if anything at all. This cache coldness translates directly into performance degradation as the migrated process experiences cache misses while rebuilding its cache footprint.

Understanding cache effects at a deep level is essential for systems engineers who need to reason about when affinity helps, when it hurts, and how to measure its impact.

What You Will Learn

By the end of this page, you will understand: the cache hierarchy and its performance characteristics, cache coherence and its overhead, the quantitative cost of cache migration, cache warmup dynamics, false sharing pathologies, and techniques for measuring cache performance in scheduling contexts.

Cache Hierarchy Deep Dive

Modern processors employ a multi-level cache hierarchy. Understanding each level's characteristics is crucial for reasoning about affinity's impact.

The Typical Cache Hierarchy:

                        ┌───────────────────────────────────────────────────┐
                        │                   Main Memory                      │
                        │              (8GB - 512GB+, ~100ns)               │
                        └───────────────────────────────────────────────────┘
                                              ▲
                              ┌───────────────┴───────────────┐
                              │      Last-Level Cache (L3)     │
                              │   Shared across cores, 8-32MB  │
                              │          ~40-50 cycles         │
                              └───────────────────────────────┘
                     ┌────────────────────┴────────────────────┐
          ┌──────────┴──────────┐              ┌───────────────┴─────────┐
          │    L2 Cache (CPU 0) │              │    L2 Cache (CPU 1)     │
          │   Private, 256KB-1MB│              │   Private, 256KB-1MB    │
          │      ~10-20 cycles  │              │       ~10-20 cycles     │
          └──────────┬──────────┘              └───────────────┬─────────┘
     ┌───────────────┴───────────────┐               ┌─────────┴───────────────┐
     │         L1 Cache              │               │        L1 Cache         │
     │  ┌──────────┐ ┌───────────┐   │               │  ┌─────────┐ ┌────────┐ │
     │  │  L1-D    │ │   L1-I    │   │               │  │  L1-D   │ │  L1-I  │ │
     │  │(32-48KB) │ │ (32-48KB) │   │               │  │(32-48KB)│ │(32-48KB)││
     │  │~4 cycles │ │ ~4 cycles │   │               │  │~4 cycles│ │~4 cycles││
     │  └──────────┘ └───────────┘   │               │  └─────────┘ └────────┘ │
     └───────────────────────────────┘               └─────────────────────────┘
                    │                                           │
                 CPU Core 0                                 CPU Core 1

Cache Level Characteristics (Typical Modern Processor)
Cache Level	Size	Latency (cycles)	Shared?	Associativity
L1 Data (L1-D)	32-48 KB/core	4-5	No (per-core)	8-way
L1 Instruction (L1-I)	32-48 KB/core	4-5	No (per-core)	8-way
L2 (Mid-Level Cache)	256KB - 1MB/core	10-20	No (per-core)*	8-16 way
L3 (Last-Level Cache)	8-32+ MB total	40-50	Yes (all cores)	12-16 way
Main Memory	8GB - 512GB+	100-300+	Yes (system-wide)	N/A

*Note: Some architectures have shared L2 caches between SMT siblings or core pairs.

The Cache Line:

Data moves between memory and caches in fixed-size units called cache lines (typically 64 bytes). When you access a single byte, the entire 64-byte cache line containing that byte is loaded. This has profound implications:

Spatial Locality: Accessing nearby data is essentially free once the cache line is loaded
False Sharing: Unrelated data sharing a cache line can cause pathological behavior
Alignment Matters: Data crossing cache line boundaries requires loading multiple lines

Cache Inclusivity:

Cache hierarchies can be:

Inclusive: Lower levels contain all data in higher levels (L3 contains all L2 data)
Exclusive: Data exists in exactly one cache level
Non-Inclusive, Non-Exclusive (NINE): Intermediate policy

Modern processors typically use non-inclusive L3 caches to maximize effective cache capacity.

Inspecting Your Cache Hierarchy

On Linux, examine /sys/devices/system/cpu/cpu0/cache/ for cache topology. Each index* directory describes a cache level with size, type (Data/Instruction/Unified), associativity, and which CPUs share it. Tools like lscpu, lstopo, and hwloc-ls visualize this hierarchy clearly.

Cache Coherence and Migration Cost

When multiple CPUs have caches, they might cache copies of the same memory location. What happens when one CPU modifies that data? This is the cache coherence problem, and its solution adds overhead that directly impacts migration costs.

The MESI Protocol:

Most x86 processors use variants of the MESI protocol (Modified, Exclusive, Shared, Invalid) to maintain coherence:

State	Meaning	Can Read	Can Write	In Other Caches?
M (Modified)	Dirty; we have the only copy	Yes	Yes	No
E (Exclusive)	Clean; we have the only copy	Yes	Yes → M	No
S (Shared)	Clean; others may have copies	Yes	Invalidate → M	Maybe
I (Invalid)	Not present or stale	Fetch first	Fetch first	-

Cache-to-Cache Transfers:

When a CPU needs data that's in another CPU's cache:

CPU 1 performs a load that misses in L1/L2/L3
Memory controller checks if another core has the data
If found (e.g., in CPU 0's cache in M or E state):
- Cache-to-cache transfer occurs (~40-90 cycles)
- Faster than going to memory (~100-300 cycles)
- But slower than L1/L2 hit (~4-20 cycles)
Coherence state updated appropriately

Migration and Coherence Traffic:

When a process migrates, consider what happens:

Process was running on CPU 0, data in CPU 0's caches
Process migrates to CPU 1
Process accesses its data:
- Stack variables → L1 miss → cache coherence snoop → transfer from CPU 0's cache
- Code → L1-I miss → coherence snoop → transfer from CPU 0 (if not evicted)
- Heap data → multiple cache line transfers

This cascade of cache misses and coherence traffic constitutes the migration penalty.

Quantifying Migration Overhead:

Migration Overhead by Topology
Migration Type	Cache Impact	Memory Cost	Typical Penalty
Same Core (SMT sibling)	Mostly shared L1/L2	None	Minimal (~1000 cycles)
Same L3 (different core)	L1/L2 cold, L3 shared	None	Moderate (~5,000-10,000 cycles)
Same Socket (different L3)	All caches cold	Same NUMA node	Significant (~20,000-50,000 cycles)
Different Socket/NUMA	All caches cold	Remote memory	Severe (~50,000-200,000 cycles)

These Numbers Vary Widely

The cycle counts above are rough estimates. Actual costs depend on working set size, cache sizes, memory bandwidth, coherence protocol implementation, and workload characteristics. Measure for your specific system and workload!

Cache Warmup Dynamics

After migration, a process experiences a warmup period—time during which cache misses are elevated as the new CPU's caches are populated. Understanding warmup dynamics helps predict when migration costs are acceptable.

Warmup Phases:

Immediate Phase (first ~1000 instructions):
- Stack access causes L1 misses
- Current function's code fetched into L1-I
- Most critical data re-fetched
- Highest miss rate
Working Set Phase (~1000 - 100,000 instructions):
- Hot code paths loaded into L1-I/L2
- Frequently accessed data structures cached
- Miss rate declining exponentially
Steady State (after 100K+ instructions):
- Working set largely cached
- Miss rate stabilizes
- Performance approaches pre-migration levels

Modeling Warmup Time:

Warmup duration depends on:

Working Set Size (WSS): Larger WSS = longer warmup
Access Pattern: Sequential access warms faster (prefetching helps) than random
Cache Associativity: Higher associativity reduces conflict misses during warmup
Memory Bandwidth: Limits how fast cache can be filled

Conceptual warmup impact model
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
"""
Simple model of warmup impact after CPU migration
"""
 
def estimate_warmup_cost(
    working_set_bytes: int,
    cache_line_size: int = 64,
    l1_miss_cycles: int = 15,     # Average L1 miss, hit L2/L3/memory
    memory_bandwidth_gbps: float = 50.0,  # GB/s
    cpu_freq_ghz: float = 3.0
) -> dict:
    """
    Estimate cycles and time to warm a cache after migration.
    Assumes cold start (no data in new CPU's cache).
    """
    
    # Number of cache lines to load
    num_cache_lines = working_set_bytes / cache_line_size
    
    # Time to load at memory bandwidth (seconds)
    load_time_sec = working_set_bytes / (memory_bandwidth_gbps * 1e9)
    
    # Cycles at CPU frequency
    load_cycles = load_time_sec * cpu_freq_ghz * 1e9
    
    # Miss penalty cycles (simplified - doesn't account for MLP)
    miss_penalty_cycles = num_cache_lines * l1_miss_cycles
    
    # Real cost is max of bandwidth limit and latency chain
    effective_cycles = max(load_cycles, miss_penalty_cycles)
    
    # Instructions that could have executed in this time
    # Assume IPC of 2 for typical workload
    instructions_lost = effective_cycles * 2
    
    return {
        "cache_lines": int(num_cache_lines),
        "warmup_cycles": int(effective_cycles),
        "warmup_time_us": load_time_sec * 1e6,
        "instructions_worth": int(instructions_lost),
    }
 
# Examples
for wss_kb in [64, 256, 1024, 4096]:
    result = estimate_warmup_cost(wss_kb * 1024)
    print(f"WSS: {wss_kb:4d} KB -> ~{result['warmup_cycles']:,} cycles "
          f"({result['warmup_time_us']:.1f} µs)")
 
# Output:
# WSS:   64 KB -> ~20,480 cycles (7 µs)
# WSS:  256 KB -> ~81,920 cycles (27 µs)
# WSS: 1024 KB -> ~327,680 cycles (109 µs)
# WSS: 4096 KB -> ~1,310,720 cycles (437 µs)

Memory-Level Parallelism (MLP)

Modern CPUs can have multiple cache misses in flight simultaneously (memory-level parallelism). This significantly reduces effective warmup time compared to a purely sequential model. Prefetching also helps by loading cache lines before they're needed. Real warmup is often 2-5x faster than naive latency × count estimates.

False Sharing: The Cache Coherence Pathology

False sharing is a particularly insidious cache effect that can devastate multi-core performance. It occurs when threads on different CPUs write to independent variables that happen to reside on the same cache line.

The Problem:

Consider two threads, each incrementing its own counter:

False sharing example
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/* BAD: False sharing between counters */
struct counters {
    long counter_thread_0;  /* Offset 0-7 */
    long counter_thread_1;  /* Offset 8-15 */
    /* Both fit in one 64-byte cache line! */
};
 
/* Thread 0 */
void thread0_work(struct counters *c) {
    for (int i = 0; i < 100000000; i++) {
        c->counter_thread_0++;  /* Writes to cache line */
    }
}
 
/* Thread 1 */
void thread1_work(struct counters *c) {
    for (int i = 0; i < 100000000; i++) {
        c->counter_thread_1++;  /* Writes to SAME cache line! */
    }
}
 
/* 
 * What happens:
 * 1. Thread 0 writes counter_0 -> cache line in CPU 0's L1 (Modified)
 * 2. Thread 1 writes counter_1 -> needs cache line -> invalidate CPU 0's copy
 * 3. CPU 0 flushes Modified line to L3/memory
 * 4. CPU 1 gets line, marks Modified
 * 5. Thread 0 writes counter_0 -> invalidate CPU 1's copy
 * ... (ping-pong continues 100 million times per thread)
 */

Performance Impact:

False sharing can cause 10-100x slowdown compared to properly padded structures. The cache line bounces between CPUs constantly, generating coherence traffic that saturates the interconnect.

The Solution: Cache Line Padding

Eliminating false sharing
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#define CACHE_LINE_SIZE 64
 
/* GOOD: Each counter on its own cache line */
struct padded_counters {
    long counter_thread_0;
    char pad0[CACHE_LINE_SIZE - sizeof(long)];
    
    long counter_thread_1;
    char pad1[CACHE_LINE_SIZE - sizeof(long)];
};
 
/* Or using alignas (C11/C++11) */
struct alignas_counters {
    alignas(CACHE_LINE_SIZE) long counter_thread_0;
    alignas(CACHE_LINE_SIZE) long counter_thread_1;
};
 
/* Or using compiler attributes (GCC) */
struct __attribute__((aligned(64))) aligned_counter {
    long value;
};
 
struct counters_v2 {
    struct aligned_counter counter_thread_0;
    struct aligned_counter counter_thread_1;
};
 
/*
 * Now each counter is on its own cache line.
 * Writes don't invalidate each other's copies.
 * Each thread maintains exclusive (E/M) cache line ownership.
 * Result: near-linear scaling with core count.
 */

False Sharing and Affinity:

Interestingly, false sharing is less harmful when threads share a CPU (via time-slicing) or share L3 cache. The cache line stays in the shared cache rather than bouncing across the interconnect. This creates a counter-intuitive situation where packing threads onto fewer CPUs might outperform spreading them across NUMA nodes—if false sharing exists.

However, the right solution is always to fix the false sharing rather than rely on scheduling workarounds.

Detecting False Sharing

False sharing is invisible to most profilers since there's no contention—each thread writes to 'its own' variable. Use hardware performance counters: high L2/L3 snoop rates, high cache coherence traffic, or Intel VTune's 'Memory Access Analysis' can reveal false sharing. The perf c2c tool on Linux is specifically designed to detect sharing patterns.

NUMA and Memory Topology Effects

On Non-Uniform Memory Access (NUMA) systems, the cost of memory access depends on which CPU is accessing which memory region. This adds another dimension to affinity considerations.

NUMA Architecture:

           NUMA Node 0                              NUMA Node 1
    ┌────────────────────────┐               ┌────────────────────────┐
    │   ┌────────────────┐   │               │   ┌────────────────┐   │
    │   │  CPU 0-7       │   │               │   │  CPU 8-15      │   │
    │   │  L1/L2 caches  │   │               │   │  L1/L2 caches  │   │
    │   │     L3 cache   │   │               │   │     L3 cache   │   │
    │   └───────┬────────┘   │               │   └───────┬────────┘   │
    │           │            │               │           │            │
    │   ┌───────┴────────┐   │     QPI/     │   ┌───────┴────────┐   │
    │   │ Memory         │←──┼─────UPI──────┼──→│ Memory         │   │
    │   │ Controller     │   │  Interconnect│   │ Controller     │   │
    │   └───────┬────────┘   │               │   └───────┬────────┘   │
    │           │            │               │           │            │
    │   ┌───────┴────────┐   │               │   ┌───────┴────────┐   │
    │   │  Memory (32GB) │   │               │   │  Memory (32GB) │   │
    │   │  "Local"       │   │               │   │  "Local"       │   │
    │   └────────────────┘   │               │   └────────────────┘   │
    └────────────────────────┘               └────────────────────────┘
           │                                           │
           └─────── "Remote" to each other ────────────┘

NUMA Access Latencies:

NUMA Memory Access Latency Comparison
Access Type	Latency (ns)	Latency (cycles @ 3GHz)	Bandwidth Impact
L1 Cache Hit	~1.3	4	Unlimited (on-chip)
L2 Cache Hit	~5	15	Very high
L3 Cache Hit	~15-20	45-60	High
Local DRAM (same node)	~80-100	240-300	Full bandwidth
Remote DRAM (other node)	~120-200	360-600	Limited by interconnect
Remote DRAM (2 hops)	~200-300	600-900	Severely limited

The NUMA Ratio:

The performance penalty for remote vs. local memory access is called the NUMA ratio. A ratio of 1.5 means remote access is 50% slower. Modern systems typically have ratios of 1.3-2.0 for latency and 0.5-0.8 for bandwidth.

NUMA and Affinity:

NUMA dramatically changes affinity calculus:

Memory follows process: When a process allocates memory, it typically comes from the local NUMA node
Migration moves process, not memory: After migrating to a different NUMA node, all that memory becomes 'remote'
Cross-node migration is expensive: Both caches are cold AND memory is now remote
Affinity becomes memory-aware: Keeping processes on their memory's NUMA node is crucial

Checking NUMA topology and distances
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Show NUMA topology
numactl --hardware
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 32768 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 32768 MB
# node distances:
# node   0   1
#   0:  10  21     <- 21/10 = 2.1x penalty for remote access
#   1:  21  10
 
# Show which NUMA node a process uses
numastat -p <pid>
# Shows pages on each NUMA node
 
# Force process to local memory
numactl --membind=0 --cpunodebind=0 ./my_program
 
# Monitor NUMA statistics
numastat -m
# Shows per-node memory usage statistics

First-Touch Policy

By default, Linux uses 'first-touch' NUMA policy—memory is allocated on the NUMA node where it's first accessed. This means initialization patterns matter! If thread 0 initializes all data, it becomes local to node 0. Consider parallelizing initialization to distribute memory appropriately.

Measuring Cache Effects in Practice

Theory is essential, but systems engineers need to measure cache effects on real workloads. Linux provides powerful tools for this.

Using perf for Cache Analysis:

Cache performance measurement with perf
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Basic cache statistics
perf stat -e cache-references,cache-misses,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses ./my_program
 
# Sample output:
#  1,234,567,890  cache-references
#    123,456,789  cache-misses           # 10.0% of all refs
#    987,654,321  L1-dcache-loads
#    123,456,789  L1-dcache-load-misses  # 12.5% miss rate
#     98,765,432  LLC-loads              # L3 (last-level cache)
#      9,876,543  LLC-load-misses        # 10.0% of LLC loads
 
# Track migrations and cache misses together
perf stat -e migrations,cache-misses,cycles ./my_program
 
# Detailed cache event sampling for profiling
perf record -e cache-misses -c 10000 ./my_program
perf report  # Shows which functions cause cache misses
 
# Compare cache behavior with/without pinning
echo "=== Without affinity ==="
perf stat -e cache-misses,migrations ./my_program
 
echo "=== With CPU pinning ==="
taskset -c 0 perf stat -e cache-misses,migrations ./my_program

Using perf c2c for Sharing Analysis:

The perf c2c tool specifically targets cache line sharing and contention:

Cache sharing analysis with perf c2c
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Record cache-to-cache (c2c) events
sudo perf c2c record ./my_program
 
# Analyze sharing patterns
sudo perf c2c report
 
# Output shows:
# - Cache lines with shared accesses
# - Which CPUs are involved in sharing
# - Read vs write patterns
# - Potential false sharing candidates
 
# Filter for specific data symbols
sudo perf c2c report --call-graph=dwarf -d my_program
 
# Key metrics to look for:
# - High "Hitm" (Hit Modified) counts = true/false sharing
# - "RmtHitm" = remote cache line hit in modified state
# - "LclHitm" = local cache line hit in modified state

Benchmarking Migration Impact:

To quantify migration costs, compare performance with and without affinity:

Migration impact benchmark
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <time.h>
 
#define ARRAY_SIZE (4 * 1024 * 1024)  /* 4MB working set */
#define ITERATIONS 1000
 
long array[ARRAY_SIZE / sizeof(long)];
 
double time_diff_ms(struct timespec *start, struct timespec *end) {
    return (end->tv_sec - start->tv_sec) * 1000.0 +
           (end->tv_nsec - start->tv_nsec) / 1e6;
}
 
void benchmark(const char *label) {
    struct timespec start, end;
    volatile long sum = 0;
    
    /* Warm the cache */
    for (int i = 0; i < ARRAY_SIZE / sizeof(long); i++) {
        sum += array[i];
    }
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int iter = 0; iter < ITERATIONS; iter++) {
        for (int i = 0; i < ARRAY_SIZE / sizeof(long); i++) {
            sum += array[i];
        }
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    printf("%s: %.2f ms, sum=%ld\n", label, 
           time_diff_ms(&start, &end), sum);
}
 
int main() {
    /* Initialize array */
    for (int i = 0; i < ARRAY_SIZE / sizeof(long); i++) {
        array[i] = i;
    }
    
    printf("Working set: %d bytes\n", ARRAY_SIZE);
    printf("Iterations: %d\n", ITERATIONS);
    
    /* Test 1: Free to migrate */
    printf("\n=== Test 1: No affinity ===\n");
    benchmark("Unrestricted");
    
    /* Test 2: Pinned to CPU 0 */
    printf("\n=== Test 2: Pinned to CPU 0 ===\n");
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(0, &cpuset);
    sched_setaffinity(0, sizeof(cpuset), &cpuset);
    benchmark("Pinned CPU 0");
    
    return 0;
}
 
/* 
 * Run multiple times, observe variance:
 * - Pinned runs should have less variance
 * - Unpinned may occasionally be faster (better load balance)
 *   or slower (migration overhead)
 */

Variance is the Enemy

When benchmarking cache effects, focus on variance as well as mean. Migrations cause unpredictable performance—a process with frequent migrations will have high variance. Consistent (pinned) execution trades potential peak performance for predictability, which is often more valuable.

Cache-Aware Design Patterns

Understanding cache effects enables cache-aware software design. These patterns work synergistically with processor affinity.

Pattern 1: Thread-Local Data

Structure data so each thread works primarily on its own memory:

Thread-local data pattern
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* BAD: Shared counter with locking */
struct shared_state {
    pthread_mutex_t lock;
    long global_counter;
};
 
/* GOOD: Thread-local counters, merged at end */
struct per_thread_state {
    long local_counter;
    char padding[64 - sizeof(long)];  /* Avoid false sharing */
};
 
struct per_thread_state thread_states[NUM_THREADS];
 
void thread_work(int thread_id) {
    /* All writes go to thread-local state */
    for (int i = 0; i < ITERATIONS; i++) {
        thread_states[thread_id].local_counter++;
    }
}
 
long get_total() {
    long total = 0;
    for (int i = 0; i < NUM_THREADS; i++) {
        total += thread_states[i].local_counter;
    }
    return total;
}

Pattern 2: NUMA-Aware Data Placement

Allocate data on the NUMA node where it will be accessed:

NUMA-aware allocation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#include <numa.h>
 
/* Allocate memory on specific NUMA node */
void* numa_aware_alloc(size_t size, int target_node) {
    return numa_alloc_onnode(size, target_node);
}
 
/* Initialize array in parallel for first-touch locality */
void parallel_initialize(double* array, size_t n, int num_threads) {
    #pragma omp parallel for num_threads(num_threads)
    for (size_t i = 0; i < n; i++) {
        /* Each thread touches pages in its portion */
        /* Memory will be allocated on local NUMA node */
        array[i] = 0.0;
    }
}
 
/* Pin threads to match data allocation */
void run_computation(double* array, size_t per_thread) {
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int node = numa_node_of_cpu(sched_getcpu());
        
        /* Thread works on data local to its NUMA node */
        double* my_data = &array[tid * per_thread];
        
        for (size_t i = 0; i < per_thread; i++) {
            my_data[i] = compute(my_data[i]);
        }
    }
}

Pattern 3: Cache-Line-Aligned Structures

Align hot data to cache line boundaries:

Cache-aligned structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#define CACHE_LINE 64
 
/* Align structure to cache line */
struct __attribute__((aligned(CACHE_LINE))) hot_data {
    /* Frequently accessed fields grouped together */
    long counter;
    int flags;
    void* ptr;
    /* These all fit in first cache line */
    
    char padding[CACHE_LINE - sizeof(long) - sizeof(int) - sizeof(void*)];
    
    /* Cold data in second cache line */
    char name[64];
    long last_access_time;
};
 
/* Array of cache-aligned elements */
struct hot_data items[1024] __attribute__((aligned(CACHE_LINE)));
 
/* Verify alignment */
_Static_assert(sizeof(struct hot_data) % CACHE_LINE == 0,
               "Structure size must be multiple of cache line");

Combining Patterns with Affinity

These patterns are most effective when combined with processor affinity. Thread-local data benefits from pinning because the data stays hot in one CPU's cache. NUMA-aware allocation requires affinity to ensure threads stay on the nodes where their data lives. Affinity is the scheduling foundation; these patterns are the data structure foundation. Together, they deliver optimal performance.

Summary: Cache Effects as the Foundation of Affinity

Cache effects are the fundamental reason processor affinity matters. Understanding these effects enables informed decisions about when to use affinity and how to design cache-friendly software.

Key Takeaways

•Cache hierarchy creates locality—L1/L2/L3 caches bridge the CPU-memory speed gap, making recently accessed data fast to access.
•Migration invalidates cache state—moving a process to a new CPU means cache misses until the working set is rebuilt.
•Coherence adds overhead—cross-cache communication takes cycles; keeping processes on one CPU reduces coherence traffic.
•False sharing is devastating—unrelated variables sharing cache lines can cause 10-100x slowdowns.
•NUMA amplifies locality importance—remote memory access is 1.3-2x slower than local; migration can turn local into remote.
•Measurement is essential—use perf, perf c2c, and numastat to understand cache behavior on your workload.
•Design and affinity work together—cache-aware data structures gain most benefit when combined with proper affinity settings.

What's Next:

We now understand what soft and hard affinity are, and why cache effects make them important. In the next page, we'll learn how to actually set affinity programmatically—using system calls, library functions, and command-line tools to control where our processes run.

Page Complete

You now understand the cache effects that drive the importance of processor affinity. From cache hierarchy to coherence protocols, from warmup dynamics to false sharing, you can reason about the performance implications of CPU placement decisions. Next, we'll cover the practical APIs and tools for setting affinity.

3 / 5

Loading learning content...

Operating SystemsCPU Scheduling Advanced

Processor Affinity

LevelIntermediate

Duration60 mins

TopicCPU Scheduling Advanced

3 / 5

Cache Effects

The Hidden Performance Driver

The Affinity Connection:

Understanding cache effects at a deep level is essential for systems engineers who need to reason about when affinity helps, when it hurts, and how to measure its impact.

What You Will Learn

Cache Hierarchy Deep Dive

Modern processors employ a multi-level cache hierarchy. Understanding each level's characteristics is crucial for reasoning about affinity's impact.

The Typical Cache Hierarchy:

                        ┌───────────────────────────────────────────────────┐
                        │                   Main Memory                      │
                        │              (8GB - 512GB+, ~100ns)               │
                        └───────────────────────────────────────────────────┘
                                              ▲
                              ┌───────────────┴───────────────┐
                              │      Last-Level Cache (L3)     │
                              │   Shared across cores, 8-32MB  │
                              │          ~40-50 cycles         │
                              └───────────────────────────────┘
                     ┌────────────────────┴────────────────────┐
          ┌──────────┴──────────┐              ┌───────────────┴─────────┐
          │    L2 Cache (CPU 0) │              │    L2 Cache (CPU 1)     │
          │   Private, 256KB-1MB│              │   Private, 256KB-1MB    │
          │      ~10-20 cycles  │              │       ~10-20 cycles     │
          └──────────┬──────────┘              └───────────────┬─────────┘
     ┌───────────────┴───────────────┐               ┌─────────┴───────────────┐
     │         L1 Cache              │               │        L1 Cache         │
     │  ┌──────────┐ ┌───────────┐   │               │  ┌─────────┐ ┌────────┐ │
     │  │  L1-D    │ │   L1-I    │   │               │  │  L1-D   │ │  L1-I  │ │
     │  │(32-48KB) │ │ (32-48KB) │   │               │  │(32-48KB)│ │(32-48KB)││
     │  │~4 cycles │ │ ~4 cycles │   │               │  │~4 cycles│ │~4 cycles││
     │  └──────────┘ └───────────┘   │               │  └─────────┘ └────────┘ │
     └───────────────────────────────┘               └─────────────────────────┘
                    │                                           │
                 CPU Core 0                                 CPU Core 1

Cache Level Characteristics (Typical Modern Processor)
Cache Level	Size	Latency (cycles)	Shared?	Associativity
L1 Data (L1-D)	32-48 KB/core	4-5	No (per-core)	8-way
L1 Instruction (L1-I)	32-48 KB/core	4-5	No (per-core)	8-way
L2 (Mid-Level Cache)	256KB - 1MB/core	10-20	No (per-core)*	8-16 way
L3 (Last-Level Cache)	8-32+ MB total	40-50	Yes (all cores)	12-16 way
Main Memory	8GB - 512GB+	100-300+	Yes (system-wide)	N/A

*Note: Some architectures have shared L2 caches between SMT siblings or core pairs.

The Cache Line:

Spatial Locality: Accessing nearby data is essentially free once the cache line is loaded
False Sharing: Unrelated data sharing a cache line can cause pathological behavior
Alignment Matters: Data crossing cache line boundaries requires loading multiple lines

Cache Inclusivity:

Cache hierarchies can be:

Inclusive: Lower levels contain all data in higher levels (L3 contains all L2 data)
Exclusive: Data exists in exactly one cache level
Non-Inclusive, Non-Exclusive (NINE): Intermediate policy

Modern processors typically use non-inclusive L3 caches to maximize effective cache capacity.

Inspecting Your Cache Hierarchy

Cache Coherence and Migration Cost

The MESI Protocol:

Most x86 processors use variants of the MESI protocol (Modified, Exclusive, Shared, Invalid) to maintain coherence:

State	Meaning	Can Read	Can Write	In Other Caches?
M (Modified)	Dirty; we have the only copy	Yes	Yes	No
E (Exclusive)	Clean; we have the only copy	Yes	Yes → M	No
S (Shared)	Clean; others may have copies	Yes	Invalidate → M	Maybe
I (Invalid)	Not present or stale	Fetch first	Fetch first	-

Cache-to-Cache Transfers:

When a CPU needs data that's in another CPU's cache:

CPU 1 performs a load that misses in L1/L2/L3
Memory controller checks if another core has the data
If found (e.g., in CPU 0's cache in M or E state):
- Cache-to-cache transfer occurs (~40-90 cycles)
- Faster than going to memory (~100-300 cycles)
- But slower than L1/L2 hit (~4-20 cycles)
Coherence state updated appropriately

Migration and Coherence Traffic:

When a process migrates, consider what happens:

Process was running on CPU 0, data in CPU 0's caches
Process migrates to CPU 1
Process accesses its data:
- Stack variables → L1 miss → cache coherence snoop → transfer from CPU 0's cache
- Code → L1-I miss → coherence snoop → transfer from CPU 0 (if not evicted)
- Heap data → multiple cache line transfers

This cascade of cache misses and coherence traffic constitutes the migration penalty.

Quantifying Migration Overhead:

Migration Overhead by Topology
Migration Type	Cache Impact	Memory Cost	Typical Penalty
Same Core (SMT sibling)	Mostly shared L1/L2	None	Minimal (~1000 cycles)
Same L3 (different core)	L1/L2 cold, L3 shared	None	Moderate (~5,000-10,000 cycles)
Same Socket (different L3)	All caches cold	Same NUMA node	Significant (~20,000-50,000 cycles)
Different Socket/NUMA	All caches cold	Remote memory	Severe (~50,000-200,000 cycles)

These Numbers Vary Widely

Cache Warmup Dynamics

Warmup Phases:

Immediate Phase (first ~1000 instructions):
- Stack access causes L1 misses
- Current function's code fetched into L1-I
- Most critical data re-fetched
- Highest miss rate
Working Set Phase (~1000 - 100,000 instructions):
- Hot code paths loaded into L1-I/L2
- Frequently accessed data structures cached
- Miss rate declining exponentially
Steady State (after 100K+ instructions):
- Working set largely cached
- Miss rate stabilizes
- Performance approaches pre-migration levels

Modeling Warmup Time:

Warmup duration depends on:

Working Set Size (WSS): Larger WSS = longer warmup
Access Pattern: Sequential access warms faster (prefetching helps) than random
Cache Associativity: Higher associativity reduces conflict misses during warmup
Memory Bandwidth: Limits how fast cache can be filled

Conceptual warmup impact model
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
"""
Simple model of warmup impact after CPU migration
"""
 
def estimate_warmup_cost(
    working_set_bytes: int,
    cache_line_size: int = 64,
    l1_miss_cycles: int = 15,     # Average L1 miss, hit L2/L3/memory
    memory_bandwidth_gbps: float = 50.0,  # GB/s
    cpu_freq_ghz: float = 3.0
) -> dict:
    """
    Estimate cycles and time to warm a cache after migration.
    Assumes cold start (no data in new CPU's cache).
    """
    
    # Number of cache lines to load
    num_cache_lines = working_set_bytes / cache_line_size
    
    # Time to load at memory bandwidth (seconds)
    load_time_sec = working_set_bytes / (memory_bandwidth_gbps * 1e9)
    
    # Cycles at CPU frequency
    load_cycles = load_time_sec * cpu_freq_ghz * 1e9
    
    # Miss penalty cycles (simplified - doesn't account for MLP)
    miss_penalty_cycles = num_cache_lines * l1_miss_cycles
    
    # Real cost is max of bandwidth limit and latency chain
    effective_cycles = max(load_cycles, miss_penalty_cycles)
    
    # Instructions that could have executed in this time
    # Assume IPC of 2 for typical workload
    instructions_lost = effective_cycles * 2
    
    return {
        "cache_lines": int(num_cache_lines),
        "warmup_cycles": int(effective_cycles),
        "warmup_time_us": load_time_sec * 1e6,
        "instructions_worth": int(instructions_lost),
    }
 
# Examples
for wss_kb in [64, 256, 1024, 4096]:
    result = estimate_warmup_cost(wss_kb * 1024)
    print(f"WSS: {wss_kb:4d} KB -> ~{result['warmup_cycles']:,} cycles "
          f"({result['warmup_time_us']:.1f} µs)")
 
# Output:
# WSS:   64 KB -> ~20,480 cycles (7 µs)
# WSS:  256 KB -> ~81,920 cycles (27 µs)
# WSS: 1024 KB -> ~327,680 cycles (109 µs)
# WSS: 4096 KB -> ~1,310,720 cycles (437 µs)

Memory-Level Parallelism (MLP)

False Sharing: The Cache Coherence Pathology

The Problem:

Consider two threads, each incrementing its own counter:

False sharing example
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/* BAD: False sharing between counters */
struct counters {
    long counter_thread_0;  /* Offset 0-7 */
    long counter_thread_1;  /* Offset 8-15 */
    /* Both fit in one 64-byte cache line! */
};
 
/* Thread 0 */
void thread0_work(struct counters *c) {
    for (int i = 0; i < 100000000; i++) {
        c->counter_thread_0++;  /* Writes to cache line */
    }
}
 
/* Thread 1 */
void thread1_work(struct counters *c) {
    for (int i = 0; i < 100000000; i++) {
        c->counter_thread_1++;  /* Writes to SAME cache line! */
    }
}
 
/* 
 * What happens:
 * 1. Thread 0 writes counter_0 -> cache line in CPU 0's L1 (Modified)
 * 2. Thread 1 writes counter_1 -> needs cache line -> invalidate CPU 0's copy
 * 3. CPU 0 flushes Modified line to L3/memory
 * 4. CPU 1 gets line, marks Modified
 * 5. Thread 0 writes counter_0 -> invalidate CPU 1's copy
 * ... (ping-pong continues 100 million times per thread)
 */

Performance Impact:

False sharing can cause 10-100x slowdown compared to properly padded structures. The cache line bounces between CPUs constantly, generating coherence traffic that saturates the interconnect.

The Solution: Cache Line Padding

Eliminating false sharing
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#define CACHE_LINE_SIZE 64
 
/* GOOD: Each counter on its own cache line */
struct padded_counters {
    long counter_thread_0;
    char pad0[CACHE_LINE_SIZE - sizeof(long)];
    
    long counter_thread_1;
    char pad1[CACHE_LINE_SIZE - sizeof(long)];
};
 
/* Or using alignas (C11/C++11) */
struct alignas_counters {
    alignas(CACHE_LINE_SIZE) long counter_thread_0;
    alignas(CACHE_LINE_SIZE) long counter_thread_1;
};
 
/* Or using compiler attributes (GCC) */
struct __attribute__((aligned(64))) aligned_counter {
    long value;
};
 
struct counters_v2 {
    struct aligned_counter counter_thread_0;
    struct aligned_counter counter_thread_1;
};
 
/*
 * Now each counter is on its own cache line.
 * Writes don't invalidate each other's copies.
 * Each thread maintains exclusive (E/M) cache line ownership.
 * Result: near-linear scaling with core count.
 */

False Sharing and Affinity:

However, the right solution is always to fix the false sharing rather than rely on scheduling workarounds.

Detecting False Sharing

NUMA and Memory Topology Effects

On Non-Uniform Memory Access (NUMA) systems, the cost of memory access depends on which CPU is accessing which memory region. This adds another dimension to affinity considerations.

NUMA Architecture:

           NUMA Node 0                              NUMA Node 1
    ┌────────────────────────┐               ┌────────────────────────┐
    │   ┌────────────────┐   │               │   ┌────────────────┐   │
    │   │  CPU 0-7       │   │               │   │  CPU 8-15      │   │
    │   │  L1/L2 caches  │   │               │   │  L1/L2 caches  │   │
    │   │     L3 cache   │   │               │   │     L3 cache   │   │
    │   └───────┬────────┘   │               │   └───────┬────────┘   │
    │           │            │               │           │            │
    │   ┌───────┴────────┐   │     QPI/     │   ┌───────┴────────┐   │
    │   │ Memory         │←──┼─────UPI──────┼──→│ Memory         │   │
    │   │ Controller     │   │  Interconnect│   │ Controller     │   │
    │   └───────┬────────┘   │               │   └───────┬────────┘   │
    │           │            │               │           │            │
    │   ┌───────┴────────┐   │               │   ┌───────┴────────┐   │
    │   │  Memory (32GB) │   │               │   │  Memory (32GB) │   │
    │   │  "Local"       │   │               │   │  "Local"       │   │
    │   └────────────────┘   │               │   └────────────────┘   │
    └────────────────────────┘               └────────────────────────┘
           │                                           │
           └─────── "Remote" to each other ────────────┘

NUMA Access Latencies:

NUMA Memory Access Latency Comparison
Access Type	Latency (ns)	Latency (cycles @ 3GHz)	Bandwidth Impact
L1 Cache Hit	~1.3	4	Unlimited (on-chip)
L2 Cache Hit	~5	15	Very high
L3 Cache Hit	~15-20	45-60	High
Local DRAM (same node)	~80-100	240-300	Full bandwidth
Remote DRAM (other node)	~120-200	360-600	Limited by interconnect
Remote DRAM (2 hops)	~200-300	600-900	Severely limited

The NUMA Ratio:

NUMA and Affinity:

NUMA dramatically changes affinity calculus:

Memory follows process: When a process allocates memory, it typically comes from the local NUMA node
Migration moves process, not memory: After migrating to a different NUMA node, all that memory becomes 'remote'
Cross-node migration is expensive: Both caches are cold AND memory is now remote
Affinity becomes memory-aware: Keeping processes on their memory's NUMA node is crucial

Checking NUMA topology and distances
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Show NUMA topology
numactl --hardware
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 32768 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 32768 MB
# node distances:
# node   0   1
#   0:  10  21     <- 21/10 = 2.1x penalty for remote access
#   1:  21  10
 
# Show which NUMA node a process uses
numastat -p <pid>
# Shows pages on each NUMA node
 
# Force process to local memory
numactl --membind=0 --cpunodebind=0 ./my_program
 
# Monitor NUMA statistics
numastat -m
# Shows per-node memory usage statistics

First-Touch Policy

Measuring Cache Effects in Practice

Theory is essential, but systems engineers need to measure cache effects on real workloads. Linux provides powerful tools for this.

Using perf for Cache Analysis:

Cache performance measurement with perf
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Basic cache statistics
perf stat -e cache-references,cache-misses,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses ./my_program
 
# Sample output:
#  1,234,567,890  cache-references
#    123,456,789  cache-misses           # 10.0% of all refs
#    987,654,321  L1-dcache-loads
#    123,456,789  L1-dcache-load-misses  # 12.5% miss rate
#     98,765,432  LLC-loads              # L3 (last-level cache)
#      9,876,543  LLC-load-misses        # 10.0% of LLC loads
 
# Track migrations and cache misses together
perf stat -e migrations,cache-misses,cycles ./my_program
 
# Detailed cache event sampling for profiling
perf record -e cache-misses -c 10000 ./my_program
perf report  # Shows which functions cause cache misses
 
# Compare cache behavior with/without pinning
echo "=== Without affinity ==="
perf stat -e cache-misses,migrations ./my_program
 
echo "=== With CPU pinning ==="
taskset -c 0 perf stat -e cache-misses,migrations ./my_program

Using perf c2c for Sharing Analysis:

The perf c2c tool specifically targets cache line sharing and contention:

Cache sharing analysis with perf c2c
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Record cache-to-cache (c2c) events
sudo perf c2c record ./my_program
 
# Analyze sharing patterns
sudo perf c2c report
 
# Output shows:
# - Cache lines with shared accesses
# - Which CPUs are involved in sharing
# - Read vs write patterns
# - Potential false sharing candidates
 
# Filter for specific data symbols
sudo perf c2c report --call-graph=dwarf -d my_program
 
# Key metrics to look for:
# - High "Hitm" (Hit Modified) counts = true/false sharing
# - "RmtHitm" = remote cache line hit in modified state
# - "LclHitm" = local cache line hit in modified state

Benchmarking Migration Impact:

To quantify migration costs, compare performance with and without affinity:

Migration impact benchmark
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <time.h>
 
#define ARRAY_SIZE (4 * 1024 * 1024)  /* 4MB working set */
#define ITERATIONS 1000
 
long array[ARRAY_SIZE / sizeof(long)];
 
double time_diff_ms(struct timespec *start, struct timespec *end) {
    return (end->tv_sec - start->tv_sec) * 1000.0 +
           (end->tv_nsec - start->tv_nsec) / 1e6;
}
 
void benchmark(const char *label) {
    struct timespec start, end;
    volatile long sum = 0;
    
    /* Warm the cache */
    for (int i = 0; i < ARRAY_SIZE / sizeof(long); i++) {
        sum += array[i];
    }
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int iter = 0; iter < ITERATIONS; iter++) {
        for (int i = 0; i < ARRAY_SIZE / sizeof(long); i++) {
            sum += array[i];
        }
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    printf("%s: %.2f ms, sum=%ld\n", label, 
           time_diff_ms(&start, &end), sum);
}
 
int main() {
    /* Initialize array */
    for (int i = 0; i < ARRAY_SIZE / sizeof(long); i++) {
        array[i] = i;
    }
    
    printf("Working set: %d bytes\n", ARRAY_SIZE);
    printf("Iterations: %d\n", ITERATIONS);
    
    /* Test 1: Free to migrate */
    printf("\n=== Test 1: No affinity ===\n");
    benchmark("Unrestricted");
    
    /* Test 2: Pinned to CPU 0 */
    printf("\n=== Test 2: Pinned to CPU 0 ===\n");
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(0, &cpuset);
    sched_setaffinity(0, sizeof(cpuset), &cpuset);
    benchmark("Pinned CPU 0");
    
    return 0;
}
 
/* 
 * Run multiple times, observe variance:
 * - Pinned runs should have less variance
 * - Unpinned may occasionally be faster (better load balance)
 *   or slower (migration overhead)
 */

Variance is the Enemy

Cache-Aware Design Patterns

Understanding cache effects enables cache-aware software design. These patterns work synergistically with processor affinity.

Pattern 1: Thread-Local Data

Structure data so each thread works primarily on its own memory:

Thread-local data pattern
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* BAD: Shared counter with locking */
struct shared_state {
    pthread_mutex_t lock;
    long global_counter;
};
 
/* GOOD: Thread-local counters, merged at end */
struct per_thread_state {
    long local_counter;
    char padding[64 - sizeof(long)];  /* Avoid false sharing */
};
 
struct per_thread_state thread_states[NUM_THREADS];
 
void thread_work(int thread_id) {
    /* All writes go to thread-local state */
    for (int i = 0; i < ITERATIONS; i++) {
        thread_states[thread_id].local_counter++;
    }
}
 
long get_total() {
    long total = 0;
    for (int i = 0; i < NUM_THREADS; i++) {
        total += thread_states[i].local_counter;
    }
    return total;
}

Pattern 2: NUMA-Aware Data Placement

Allocate data on the NUMA node where it will be accessed:

NUMA-aware allocation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#include <numa.h>
 
/* Allocate memory on specific NUMA node */
void* numa_aware_alloc(size_t size, int target_node) {
    return numa_alloc_onnode(size, target_node);
}
 
/* Initialize array in parallel for first-touch locality */
void parallel_initialize(double* array, size_t n, int num_threads) {
    #pragma omp parallel for num_threads(num_threads)
    for (size_t i = 0; i < n; i++) {
        /* Each thread touches pages in its portion */
        /* Memory will be allocated on local NUMA node */
        array[i] = 0.0;
    }
}
 
/* Pin threads to match data allocation */
void run_computation(double* array, size_t per_thread) {
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int node = numa_node_of_cpu(sched_getcpu());
        
        /* Thread works on data local to its NUMA node */
        double* my_data = &array[tid * per_thread];
        
        for (size_t i = 0; i < per_thread; i++) {
            my_data[i] = compute(my_data[i]);
        }
    }
}

Pattern 3: Cache-Line-Aligned Structures

Align hot data to cache line boundaries:

Cache-aligned structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#define CACHE_LINE 64
 
/* Align structure to cache line */
struct __attribute__((aligned(CACHE_LINE))) hot_data {
    /* Frequently accessed fields grouped together */
    long counter;
    int flags;
    void* ptr;
    /* These all fit in first cache line */
    
    char padding[CACHE_LINE - sizeof(long) - sizeof(int) - sizeof(void*)];
    
    /* Cold data in second cache line */
    char name[64];
    long last_access_time;
};
 
/* Array of cache-aligned elements */
struct hot_data items[1024] __attribute__((aligned(CACHE_LINE)));
 
/* Verify alignment */
_Static_assert(sizeof(struct hot_data) % CACHE_LINE == 0,
               "Structure size must be multiple of cache line");

Combining Patterns with Affinity

Summary: Cache Effects as the Foundation of Affinity

Cache effects are the fundamental reason processor affinity matters. Understanding these effects enables informed decisions about when to use affinity and how to design cache-friendly software.

Key Takeaways

•Cache hierarchy creates locality—L1/L2/L3 caches bridge the CPU-memory speed gap, making recently accessed data fast to access.
•Migration invalidates cache state—moving a process to a new CPU means cache misses until the working set is rebuilt.
•Coherence adds overhead—cross-cache communication takes cycles; keeping processes on one CPU reduces coherence traffic.
•False sharing is devastating—unrelated variables sharing cache lines can cause 10-100x slowdowns.
•NUMA amplifies locality importance—remote memory access is 1.3-2x slower than local; migration can turn local into remote.
•Measurement is essential—use perf, perf c2c, and numastat to understand cache behavior on your workload.
•Design and affinity work together—cache-aware data structures gain most benefit when combined with proper affinity settings.

What's Next:

Page Complete

3 / 5