Loading learning content...
We've established that processor affinity—both soft and hard—exists to preserve cache locality. But what exactly are these 'cache effects' that make affinity so important? Why does it matter which CPU runs our code?
The answer lies in the profound performance gap between CPU execution speed and memory access latency. Modern processors can execute billions of instructions per second, but accessing main memory takes 100-300+ nanoseconds—time for hundreds of instructions to execute. Caches bridge this gap by keeping frequently accessed data close to the CPU.
The Affinity Connection:
When a process is migrated to a different CPU, it typically loses access to the cache state it built up on the previous CPU. The new CPU's caches are 'cold' for this process—filled with another process's data if anything at all. This cache coldness translates directly into performance degradation as the migrated process experiences cache misses while rebuilding its cache footprint.
Understanding cache effects at a deep level is essential for systems engineers who need to reason about when affinity helps, when it hurts, and how to measure its impact.
By the end of this page, you will understand: the cache hierarchy and its performance characteristics, cache coherence and its overhead, the quantitative cost of cache migration, cache warmup dynamics, false sharing pathologies, and techniques for measuring cache performance in scheduling contexts.
Modern processors employ a multi-level cache hierarchy. Understanding each level's characteristics is crucial for reasoning about affinity's impact.
The Typical Cache Hierarchy:
┌───────────────────────────────────────────────────┐
│ Main Memory │
│ (8GB - 512GB+, ~100ns) │
└───────────────────────────────────────────────────┘
▲
┌───────────────┴───────────────┐
│ Last-Level Cache (L3) │
│ Shared across cores, 8-32MB │
│ ~40-50 cycles │
└───────────────────────────────┘
┌────────────────────┴────────────────────┐
┌──────────┴──────────┐ ┌───────────────┴─────────┐
│ L2 Cache (CPU 0) │ │ L2 Cache (CPU 1) │
│ Private, 256KB-1MB│ │ Private, 256KB-1MB │
│ ~10-20 cycles │ │ ~10-20 cycles │
└──────────┬──────────┘ └───────────────┬─────────┘
┌───────────────┴───────────────┐ ┌─────────┴───────────────┐
│ L1 Cache │ │ L1 Cache │
│ ┌──────────┐ ┌───────────┐ │ │ ┌─────────┐ ┌────────┐ │
│ │ L1-D │ │ L1-I │ │ │ │ L1-D │ │ L1-I │ │
│ │(32-48KB) │ │ (32-48KB) │ │ │ │(32-48KB)│ │(32-48KB)││
│ │~4 cycles │ │ ~4 cycles │ │ │ │~4 cycles│ │~4 cycles││
│ └──────────┘ └───────────┘ │ │ └─────────┘ └────────┘ │
└───────────────────────────────┘ └─────────────────────────┘
│ │
CPU Core 0 CPU Core 1
| Cache Level | Size | Latency (cycles) | Shared? | Associativity |
|---|---|---|---|---|
| L1 Data (L1-D) | 32-48 KB/core | 4-5 | No (per-core) | 8-way |
| L1 Instruction (L1-I) | 32-48 KB/core | 4-5 | No (per-core) | 8-way |
| L2 (Mid-Level Cache) | 256KB - 1MB/core | 10-20 | No (per-core)* | 8-16 way |
| L3 (Last-Level Cache) | 8-32+ MB total | 40-50 | Yes (all cores) | 12-16 way |
| Main Memory | 8GB - 512GB+ | 100-300+ | Yes (system-wide) | N/A |
*Note: Some architectures have shared L2 caches between SMT siblings or core pairs.
The Cache Line:
Data moves between memory and caches in fixed-size units called cache lines (typically 64 bytes). When you access a single byte, the entire 64-byte cache line containing that byte is loaded. This has profound implications:
Cache Inclusivity:
Cache hierarchies can be:
Modern processors typically use non-inclusive L3 caches to maximize effective cache capacity.
On Linux, examine /sys/devices/system/cpu/cpu0/cache/ for cache topology. Each index* directory describes a cache level with size, type (Data/Instruction/Unified), associativity, and which CPUs share it. Tools like lscpu, lstopo, and hwloc-ls visualize this hierarchy clearly.
When multiple CPUs have caches, they might cache copies of the same memory location. What happens when one CPU modifies that data? This is the cache coherence problem, and its solution adds overhead that directly impacts migration costs.
The MESI Protocol:
Most x86 processors use variants of the MESI protocol (Modified, Exclusive, Shared, Invalid) to maintain coherence:
| State | Meaning | Can Read | Can Write | In Other Caches? |
|---|---|---|---|---|
| M (Modified) | Dirty; we have the only copy | Yes | Yes | No |
| E (Exclusive) | Clean; we have the only copy | Yes | Yes → M | No |
| S (Shared) | Clean; others may have copies | Yes | Invalidate → M | Maybe |
| I (Invalid) | Not present or stale | Fetch first | Fetch first | - |
Cache-to-Cache Transfers:
When a CPU needs data that's in another CPU's cache:
Migration and Coherence Traffic:
When a process migrates, consider what happens:
This cascade of cache misses and coherence traffic constitutes the migration penalty.
Quantifying Migration Overhead:
| Migration Type | Cache Impact | Memory Cost | Typical Penalty |
|---|---|---|---|
| Same Core (SMT sibling) | Mostly shared L1/L2 | None | Minimal (~1000 cycles) |
| Same L3 (different core) | L1/L2 cold, L3 shared | None | Moderate (~5,000-10,000 cycles) |
| Same Socket (different L3) | All caches cold | Same NUMA node | Significant (~20,000-50,000 cycles) |
| Different Socket/NUMA | All caches cold | Remote memory | Severe (~50,000-200,000 cycles) |
The cycle counts above are rough estimates. Actual costs depend on working set size, cache sizes, memory bandwidth, coherence protocol implementation, and workload characteristics. Measure for your specific system and workload!
After migration, a process experiences a warmup period—time during which cache misses are elevated as the new CPU's caches are populated. Understanding warmup dynamics helps predict when migration costs are acceptable.
Warmup Phases:
Immediate Phase (first ~1000 instructions):
Working Set Phase (~1000 - 100,000 instructions):
Steady State (after 100K+ instructions):
Modeling Warmup Time:
Warmup duration depends on:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
"""Simple model of warmup impact after CPU migration""" def estimate_warmup_cost( working_set_bytes: int, cache_line_size: int = 64, l1_miss_cycles: int = 15, # Average L1 miss, hit L2/L3/memory memory_bandwidth_gbps: float = 50.0, # GB/s cpu_freq_ghz: float = 3.0) -> dict: """ Estimate cycles and time to warm a cache after migration. Assumes cold start (no data in new CPU's cache). """ # Number of cache lines to load num_cache_lines = working_set_bytes / cache_line_size # Time to load at memory bandwidth (seconds) load_time_sec = working_set_bytes / (memory_bandwidth_gbps * 1e9) # Cycles at CPU frequency load_cycles = load_time_sec * cpu_freq_ghz * 1e9 # Miss penalty cycles (simplified - doesn't account for MLP) miss_penalty_cycles = num_cache_lines * l1_miss_cycles # Real cost is max of bandwidth limit and latency chain effective_cycles = max(load_cycles, miss_penalty_cycles) # Instructions that could have executed in this time # Assume IPC of 2 for typical workload instructions_lost = effective_cycles * 2 return { "cache_lines": int(num_cache_lines), "warmup_cycles": int(effective_cycles), "warmup_time_us": load_time_sec * 1e6, "instructions_worth": int(instructions_lost), } # Examplesfor wss_kb in [64, 256, 1024, 4096]: result = estimate_warmup_cost(wss_kb * 1024) print(f"WSS: {wss_kb:4d} KB -> ~{result['warmup_cycles']:,} cycles " f"({result['warmup_time_us']:.1f} µs)") # Output:# WSS: 64 KB -> ~20,480 cycles (7 µs)# WSS: 256 KB -> ~81,920 cycles (27 µs)# WSS: 1024 KB -> ~327,680 cycles (109 µs)# WSS: 4096 KB -> ~1,310,720 cycles (437 µs)Modern CPUs can have multiple cache misses in flight simultaneously (memory-level parallelism). This significantly reduces effective warmup time compared to a purely sequential model. Prefetching also helps by loading cache lines before they're needed. Real warmup is often 2-5x faster than naive latency × count estimates.
False sharing is a particularly insidious cache effect that can devastate multi-core performance. It occurs when threads on different CPUs write to independent variables that happen to reside on the same cache line.
The Problem:
Consider two threads, each incrementing its own counter:
123456789101112131415161718192021222324252627282930
/* BAD: False sharing between counters */struct counters { long counter_thread_0; /* Offset 0-7 */ long counter_thread_1; /* Offset 8-15 */ /* Both fit in one 64-byte cache line! */}; /* Thread 0 */void thread0_work(struct counters *c) { for (int i = 0; i < 100000000; i++) { c->counter_thread_0++; /* Writes to cache line */ }} /* Thread 1 */void thread1_work(struct counters *c) { for (int i = 0; i < 100000000; i++) { c->counter_thread_1++; /* Writes to SAME cache line! */ }} /* * What happens: * 1. Thread 0 writes counter_0 -> cache line in CPU 0's L1 (Modified) * 2. Thread 1 writes counter_1 -> needs cache line -> invalidate CPU 0's copy * 3. CPU 0 flushes Modified line to L3/memory * 4. CPU 1 gets line, marks Modified * 5. Thread 0 writes counter_0 -> invalidate CPU 1's copy * ... (ping-pong continues 100 million times per thread) */Performance Impact:
False sharing can cause 10-100x slowdown compared to properly padded structures. The cache line bounces between CPUs constantly, generating coherence traffic that saturates the interconnect.
The Solution: Cache Line Padding
123456789101112131415161718192021222324252627282930313233
#define CACHE_LINE_SIZE 64 /* GOOD: Each counter on its own cache line */struct padded_counters { long counter_thread_0; char pad0[CACHE_LINE_SIZE - sizeof(long)]; long counter_thread_1; char pad1[CACHE_LINE_SIZE - sizeof(long)];}; /* Or using alignas (C11/C++11) */struct alignas_counters { alignas(CACHE_LINE_SIZE) long counter_thread_0; alignas(CACHE_LINE_SIZE) long counter_thread_1;}; /* Or using compiler attributes (GCC) */struct __attribute__((aligned(64))) aligned_counter { long value;}; struct counters_v2 { struct aligned_counter counter_thread_0; struct aligned_counter counter_thread_1;}; /* * Now each counter is on its own cache line. * Writes don't invalidate each other's copies. * Each thread maintains exclusive (E/M) cache line ownership. * Result: near-linear scaling with core count. */False Sharing and Affinity:
Interestingly, false sharing is less harmful when threads share a CPU (via time-slicing) or share L3 cache. The cache line stays in the shared cache rather than bouncing across the interconnect. This creates a counter-intuitive situation where packing threads onto fewer CPUs might outperform spreading them across NUMA nodes—if false sharing exists.
However, the right solution is always to fix the false sharing rather than rely on scheduling workarounds.
False sharing is invisible to most profilers since there's no contention—each thread writes to 'its own' variable. Use hardware performance counters: high L2/L3 snoop rates, high cache coherence traffic, or Intel VTune's 'Memory Access Analysis' can reveal false sharing. The perf c2c tool on Linux is specifically designed to detect sharing patterns.
On Non-Uniform Memory Access (NUMA) systems, the cost of memory access depends on which CPU is accessing which memory region. This adds another dimension to affinity considerations.
NUMA Architecture:
NUMA Node 0 NUMA Node 1
┌────────────────────────┐ ┌────────────────────────┐
│ ┌────────────────┐ │ │ ┌────────────────┐ │
│ │ CPU 0-7 │ │ │ │ CPU 8-15 │ │
│ │ L1/L2 caches │ │ │ │ L1/L2 caches │ │
│ │ L3 cache │ │ │ │ L3 cache │ │
│ └───────┬────────┘ │ │ └───────┬────────┘ │
│ │ │ │ │ │
│ ┌───────┴────────┐ │ QPI/ │ ┌───────┴────────┐ │
│ │ Memory │←──┼─────UPI──────┼──→│ Memory │ │
│ │ Controller │ │ Interconnect│ │ Controller │ │
│ └───────┬────────┘ │ │ └───────┬────────┘ │
│ │ │ │ │ │
│ ┌───────┴────────┐ │ │ ┌───────┴────────┐ │
│ │ Memory (32GB) │ │ │ │ Memory (32GB) │ │
│ │ "Local" │ │ │ │ "Local" │ │
│ └────────────────┘ │ │ └────────────────┘ │
└────────────────────────┘ └────────────────────────┘
│ │
└─────── "Remote" to each other ────────────┘
NUMA Access Latencies:
| Access Type | Latency (ns) | Latency (cycles @ 3GHz) | Bandwidth Impact |
|---|---|---|---|
| L1 Cache Hit | ~1.3 | 4 | Unlimited (on-chip) |
| L2 Cache Hit | ~5 | 15 | Very high |
| L3 Cache Hit | ~15-20 | 45-60 | High |
| Local DRAM (same node) | ~80-100 | 240-300 | Full bandwidth |
| Remote DRAM (other node) | ~120-200 | 360-600 | Limited by interconnect |
| Remote DRAM (2 hops) | ~200-300 | 600-900 | Severely limited |
The NUMA Ratio:
The performance penalty for remote vs. local memory access is called the NUMA ratio. A ratio of 1.5 means remote access is 50% slower. Modern systems typically have ratios of 1.3-2.0 for latency and 0.5-0.8 for bandwidth.
NUMA and Affinity:
NUMA dramatically changes affinity calculus:
123456789101112131415161718192021
# Show NUMA topologynumactl --hardware# node 0 cpus: 0 1 2 3 4 5 6 7# node 0 size: 32768 MB# node 1 cpus: 8 9 10 11 12 13 14 15# node 1 size: 32768 MB# node distances:# node 0 1# 0: 10 21 <- 21/10 = 2.1x penalty for remote access# 1: 21 10 # Show which NUMA node a process usesnumastat -p <pid># Shows pages on each NUMA node # Force process to local memorynumactl --membind=0 --cpunodebind=0 ./my_program # Monitor NUMA statisticsnumastat -m# Shows per-node memory usage statisticsBy default, Linux uses 'first-touch' NUMA policy—memory is allocated on the NUMA node where it's first accessed. This means initialization patterns matter! If thread 0 initializes all data, it becomes local to node 0. Consider parallelizing initialization to distribute memory appropriately.
Theory is essential, but systems engineers need to measure cache effects on real workloads. Linux provides powerful tools for this.
Using perf for Cache Analysis:
1234567891011121314151617181920212223242526
# Basic cache statisticsperf stat -e cache-references,cache-misses,\L1-dcache-loads,L1-dcache-load-misses,\LLC-loads,LLC-load-misses ./my_program # Sample output:# 1,234,567,890 cache-references# 123,456,789 cache-misses # 10.0% of all refs# 987,654,321 L1-dcache-loads# 123,456,789 L1-dcache-load-misses # 12.5% miss rate# 98,765,432 LLC-loads # L3 (last-level cache)# 9,876,543 LLC-load-misses # 10.0% of LLC loads # Track migrations and cache misses togetherperf stat -e migrations,cache-misses,cycles ./my_program # Detailed cache event sampling for profilingperf record -e cache-misses -c 10000 ./my_programperf report # Shows which functions cause cache misses # Compare cache behavior with/without pinningecho "=== Without affinity ==="perf stat -e cache-misses,migrations ./my_program echo "=== With CPU pinning ==="taskset -c 0 perf stat -e cache-misses,migrations ./my_programUsing perf c2c for Sharing Analysis:
The perf c2c tool specifically targets cache line sharing and contention:
12345678910111213141516171819
# Record cache-to-cache (c2c) eventssudo perf c2c record ./my_program # Analyze sharing patternssudo perf c2c report # Output shows:# - Cache lines with shared accesses# - Which CPUs are involved in sharing# - Read vs write patterns# - Potential false sharing candidates # Filter for specific data symbolssudo perf c2c report --call-graph=dwarf -d my_program # Key metrics to look for:# - High "Hitm" (Hit Modified) counts = true/false sharing# - "RmtHitm" = remote cache line hit in modified state# - "LclHitm" = local cache line hit in modified stateBenchmarking Migration Impact:
To quantify migration costs, compare performance with and without affinity:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
#define _GNU_SOURCE#include <stdio.h>#include <stdlib.h>#include <sched.h>#include <time.h> #define ARRAY_SIZE (4 * 1024 * 1024) /* 4MB working set */#define ITERATIONS 1000 long array[ARRAY_SIZE / sizeof(long)]; double time_diff_ms(struct timespec *start, struct timespec *end) { return (end->tv_sec - start->tv_sec) * 1000.0 + (end->tv_nsec - start->tv_nsec) / 1e6;} void benchmark(const char *label) { struct timespec start, end; volatile long sum = 0; /* Warm the cache */ for (int i = 0; i < ARRAY_SIZE / sizeof(long); i++) { sum += array[i]; } clock_gettime(CLOCK_MONOTONIC, &start); for (int iter = 0; iter < ITERATIONS; iter++) { for (int i = 0; i < ARRAY_SIZE / sizeof(long); i++) { sum += array[i]; } } clock_gettime(CLOCK_MONOTONIC, &end); printf("%s: %.2f ms, sum=%ld\n", label, time_diff_ms(&start, &end), sum);} int main() { /* Initialize array */ for (int i = 0; i < ARRAY_SIZE / sizeof(long); i++) { array[i] = i; } printf("Working set: %d bytes\n", ARRAY_SIZE); printf("Iterations: %d\n", ITERATIONS); /* Test 1: Free to migrate */ printf("\n=== Test 1: No affinity ===\n"); benchmark("Unrestricted"); /* Test 2: Pinned to CPU 0 */ printf("\n=== Test 2: Pinned to CPU 0 ===\n"); cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(0, &cpuset); sched_setaffinity(0, sizeof(cpuset), &cpuset); benchmark("Pinned CPU 0"); return 0;} /* * Run multiple times, observe variance: * - Pinned runs should have less variance * - Unpinned may occasionally be faster (better load balance) * or slower (migration overhead) */When benchmarking cache effects, focus on variance as well as mean. Migrations cause unpredictable performance—a process with frequent migrations will have high variance. Consistent (pinned) execution trades potential peak performance for predictability, which is often more valuable.
Understanding cache effects enables cache-aware software design. These patterns work synergistically with processor affinity.
Pattern 1: Thread-Local Data
Structure data so each thread works primarily on its own memory:
12345678910111213141516171819202122232425262728
/* BAD: Shared counter with locking */struct shared_state { pthread_mutex_t lock; long global_counter;}; /* GOOD: Thread-local counters, merged at end */struct per_thread_state { long local_counter; char padding[64 - sizeof(long)]; /* Avoid false sharing */}; struct per_thread_state thread_states[NUM_THREADS]; void thread_work(int thread_id) { /* All writes go to thread-local state */ for (int i = 0; i < ITERATIONS; i++) { thread_states[thread_id].local_counter++; }} long get_total() { long total = 0; for (int i = 0; i < NUM_THREADS; i++) { total += thread_states[i].local_counter; } return total;}Pattern 2: NUMA-Aware Data Placement
Allocate data on the NUMA node where it will be accessed:
1234567891011121314151617181920212223242526272829303132
#include <numa.h> /* Allocate memory on specific NUMA node */void* numa_aware_alloc(size_t size, int target_node) { return numa_alloc_onnode(size, target_node);} /* Initialize array in parallel for first-touch locality */void parallel_initialize(double* array, size_t n, int num_threads) { #pragma omp parallel for num_threads(num_threads) for (size_t i = 0; i < n; i++) { /* Each thread touches pages in its portion */ /* Memory will be allocated on local NUMA node */ array[i] = 0.0; }} /* Pin threads to match data allocation */void run_computation(double* array, size_t per_thread) { #pragma omp parallel { int tid = omp_get_thread_num(); int node = numa_node_of_cpu(sched_getcpu()); /* Thread works on data local to its NUMA node */ double* my_data = &array[tid * per_thread]; for (size_t i = 0; i < per_thread; i++) { my_data[i] = compute(my_data[i]); } }}Pattern 3: Cache-Line-Aligned Structures
Align hot data to cache line boundaries:
1234567891011121314151617181920212223
#define CACHE_LINE 64 /* Align structure to cache line */struct __attribute__((aligned(CACHE_LINE))) hot_data { /* Frequently accessed fields grouped together */ long counter; int flags; void* ptr; /* These all fit in first cache line */ char padding[CACHE_LINE - sizeof(long) - sizeof(int) - sizeof(void*)]; /* Cold data in second cache line */ char name[64]; long last_access_time;}; /* Array of cache-aligned elements */struct hot_data items[1024] __attribute__((aligned(CACHE_LINE))); /* Verify alignment */_Static_assert(sizeof(struct hot_data) % CACHE_LINE == 0, "Structure size must be multiple of cache line");These patterns are most effective when combined with processor affinity. Thread-local data benefits from pinning because the data stays hot in one CPU's cache. NUMA-aware allocation requires affinity to ensure threads stay on the nodes where their data lives. Affinity is the scheduling foundation; these patterns are the data structure foundation. Together, they deliver optimal performance.
Cache effects are the fundamental reason processor affinity matters. Understanding these effects enables informed decisions about when to use affinity and how to design cache-friendly software.
What's Next:
We now understand what soft and hard affinity are, and why cache effects make them important. In the next page, we'll learn how to actually set affinity programmatically—using system calls, library functions, and command-line tools to control where our processes run.
You now understand the cache effects that drive the importance of processor affinity. From cache hierarchy to coherence protocols, from warmup dynamics to false sharing, you can reason about the performance implications of CPU placement decisions. Next, we'll cover the practical APIs and tools for setting affinity.