Loading learning content...
Throughout this module, we've explored what processor affinity is, why cache effects make it matter, and how to configure it. Now we address the ultimate question: what are the actual performance implications of affinity decisions?
This is not a simple question. Processor affinity is a tradeoff, not a pure optimization. Pinning processes to CPUs can:
The key is knowing when each outcome applies. This page provides the analytical framework and practical methodology for making informed affinity decisions.
By the end of this page, you will understand: when affinity improves vs. degrades performance, how to quantify affinity benefits, methodology for measuring affinity impact, real-world case studies, common anti-patterns to avoid, and decision frameworks for production deployments.
Let's systematically examine the scenarios where processor affinity delivers measurable performance benefits.
Benefit 1: Reduced Cache Migration Penalty
The primary benefit of affinity is avoiding the cache warmup cost when processes migrate. This benefit is greatest when:
| Working Set Size | Cache Level | Warmup Time (approx) | Affinity Benefit |
|---|---|---|---|
| < 32 KB | L1 | ~1-5 µs | High — entire working set lost on migration |
| 32 KB - 256 KB | L2 | ~5-20 µs | High — significant warmup cost |
| 256 KB - 8 MB | L3 | ~20-100 µs | Medium — L3 may be shared, partial benefit |
| 8 MB - 32 MB | L3 + some DRAM | ~100-500 µs | Low-Medium — already paying memory latency |
32 MB | Primarily DRAM | ~500+ µs | Low — working set doesn't fit in cache anyway |
Benefit 2: Reduced Cache Coherence Traffic
When threads share data and run on the same LLC domain (sharing L3 cache), coherence traffic stays local to the socket. Cross-socket coherence requires expensive interconnect messages (QPI/UPI on Intel, Infinity Fabric on AMD).
Benefit 3: NUMA Locality Preservation
On NUMA systems, keeping processes on their memory's local node avoids remote memory access penalties of 1.3-2x latency and reduced bandwidth.
Benefit 4: Reduced Scheduling Overhead
Migration events themselves have overhead:
With strict affinity, this overhead is eliminated.
Affinity benefits are proportional to (migration_frequency × migration_cost) / total_execution_time. If migrations are rare OR cheap OR execution time is long, benefits are small. If migrations are frequent AND expensive AND execution is short, benefits can be dramatic.
Affinity is not free. Constraining process placement has real costs that can outweigh benefits.
Cost 1: Impaired Load Balancing
The scheduler's primary job is to utilize all CPUs effectively. Hard affinity can prevent this:
Scenario: 4 CPUs, 4 processes, all pinned to CPUs 0-1 Without affinity (scheduler can balance): CPU 0: Process A (1 unit work) CPU 1: Process B (1 unit work) Total time: 1 unit CPU 2: Process C (1 unit work) All CPUs utilized CPU 3: Process D (1 unit work) With affinity (constrained to CPUs 0-1): CPU 0: Process A, then C (2 units work) CPU 1: Process B, then D (2 units work) Total time: 2 units CPU 2: [IDLE] 50% CPUs wasted CPU 3: [IDLE] Result: 2x worse throughput due to affinity constraintCost 2: Wasted CPU Capacity
When pinned processes are idle (waiting for I/O, sleeping), their CPU sits idle even if other runnable processes exist:
The scheduler cannot migrate B to CPU 0 due to its affinity constraint.
Cost 3: Interference Amplification
Pinning can concentrate interference rather than distribute it:
Scenario: Noisy neighbor (system monitoring agent) Without affinity: Monitoring agent migrates across all CPUs Each application sees ~10% overhead distributed across time Impact: 10% average degradation With affinity (all apps pinned, monitor on CPU 0): App on CPU 0: 80% overhead (constant interference) Apps on CPU 1-3: 0% overhead Impact: One victim, others untouched Depending on perspective, this could be: - Better: Most apps unaffected - Worse: One app severely degradedCost 4: Burst Absorption Reduction
Applications with bursty workloads benefit from access to all CPUs during bursts. Affinity limits this:
Cost 5: Complexity and Maintenance
Affinity configuration adds operational complexity:
Aggressive affinity optimization can make systems perform well in benchmarks but poorly in production. Benchmarks often have predictable, steady load. Production has variable, bursty load with mixed workloads. The scheduler's flexibility is valuable precisely because real workloads are unpredictable.
To make informed decisions, we need to measure affinity's impact. Here's a systematic approach.
Step 1: Baseline Measurement
Establish performance without affinity constraints:
12345678910111213141516171819202122232425262728
#!/bin/bash# Baseline performance measurement OUTPUT_DIR="./benchmark_results"mkdir -p $OUTPUT_DIR echo "=== Baseline Measurement (No Affinity) ===" # Run multiple iterations for statistical significancefor i in {1..10}; do echo "Iteration $i..." # Capture scheduler statistics before grep -E "migrations|nr_switches" /proc/$(pgrep myapp)/sched > \ "$OUTPUT_DIR/sched_before_$i.txt" # Run benchmark time ./benchmark_workload 2>&1 | tee "$OUTPUT_DIR/baseline_$i.txt" # Capture statistics after grep -E "migrations|nr_switches" /proc/$(pgrep myapp)/sched > \ "$OUTPUT_DIR/sched_after_$i.txt" # Capture perf metrics perf stat -e cache-misses,cache-references,migrations,\ context-switches ./benchmark_workload 2>&1 | \ tee "$OUTPUT_DIR/perf_baseline_$i.txt"doneStep 2: Affinity Measurement
Repeat with affinity applied:
12345678910111213141516171819202122232425
#!/bin/bash# Affinity measurement echo "=== Affinity Measurement ===" # Test different affinity configurationsdeclare -A AFFINITY_CONFIGS=( ["single_cpu"]="0" ["core_pair"]="0,1" ["numa_node0"]="0-7" ["half_system"]="0-7" ["all_cpus"]="0-15" # Baseline comparison) for config_name in "${!AFFINITY_CONFIGS[@]}"; do cpus=${AFFINITY_CONFIGS[$config_name]} echo "Testing configuration: $config_name (CPUs: $cpus)" for i in {1..10}; do taskset -c "$cpus" perf stat -e \ cache-misses,cache-references,migrations,\ context-switches,cpu-clock ./benchmark_workload 2>&1 | \ tee "$OUTPUT_DIR/perf_${config_name}_$i.txt" donedoneStep 3: Statistical Analysis
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import statisticsimport refrom pathlib import Path def parse_perf_output(filepath): """Extract key metrics from perf stat output.""" metrics = {} with open(filepath) as f: content = f.read() patterns = { 'cache_misses': r'([\d,]+)\s+cache-misses', 'cache_refs': r'([\d,]+)\s+cache-references', 'migrations': r'([\d,]+)\s+migrations', 'context_switches': r'([\d,]+)\s+context-switches', 'time_seconds': r'([\d.]+)\s+seconds time elapsed', } for key, pattern in patterns.items(): match = re.search(pattern, content) if match: metrics[key] = float(match.group(1).replace(',', '')) return metrics def analyze_config(output_dir, config_name, iterations=10): """Analyze all iterations for a configuration.""" all_metrics = [] for i in range(1, iterations + 1): filepath = Path(output_dir) / f"perf_{config_name}_{i}.txt" if filepath.exists(): all_metrics.append(parse_perf_output(filepath)) if not all_metrics: return None # Calculate statistics for each metric results = {} for key in all_metrics[0].keys(): values = [m[key] for m in all_metrics] results[key] = { 'mean': statistics.mean(values), 'stdev': statistics.stdev(values) if len(values) > 1 else 0, 'min': min(values), 'max': max(values), 'cv': statistics.stdev(values) / statistics.mean(values) * 100 if len(values) > 1 else 0 # Coefficient of variation } return results def compare_configurations(baseline, test): """Compare test config vs baseline.""" print(f"{'Metric':<20} {'Baseline':>12} {'Test':>12} {'Change':>12}") print("-" * 60) for key in baseline.keys(): b = baseline[key]['mean'] t = test[key]['mean'] change = (t - b) / b * 100 if b != 0 else 0 print(f"{key:<20} {b:>12.0f} {t:>12.0f} {change:>+11.1f}%") # Example usageoutput_dir = "./benchmark_results"baseline = analyze_config(output_dir, "all_cpus")numa_local = analyze_config(output_dir, "numa_node0") if baseline and numa_local: print("\n=== NUMA Node 0 vs Baseline ===") compare_configurations(baseline, numa_local)Let's examine real-world scenarios where affinity decisions had significant impact.
Case Study 1: High-Frequency Trading System
A trading firm experienced inconsistent order execution latency:
PROBLEM:- Order processing latency: median 15µs, p99 450µs- 30x latency variance unacceptable for trading- Profiling showed most variance during CPU migrations DIAGNOSIS:- Trading threads migrated ~500 times/second- Each migration caused 5-20µs cache warmup- Migrations often crossed NUMA boundaries (100+µs penalty) SOLUTION:- Isolated 4 CPUs on NUMA node 0 using isolcpus- Pinned trading threads to isolated CPUs- Moved network interrupts to those same CPUs- Used huge pages for TLB stability RESULTS:- Order processing latency: median 12µs, p99 18µs- Variance reduced from 30x to 1.5x- Migration count: 0 (as expected)- Cache miss rate decreased 40%Case Study 2: Database Server Affinity Failure
A database team applied 'best practice' affinity that backfired:
PROBLEM:- OLTP database handling mixed read/write workload- Team pinned each worker thread to its own CPU- Performance DECREASED 20% after applying affinity DIAGNOSIS:- Database has 64 worker threads, server has 16 CPUs- With pinning: 4 threads per CPU (forced sharing)- Without pinning: scheduler distributed load dynamically - Workload is bursty: some threads busy, others idle- Pinning prevented idle CPU utilization- Busy CPUs became bottlenecks SOLUTION:- Removed per-thread pinning- Instead: grouped workers into NUMA-node pools - Threads 0-31: allowed on NUMA node 0 (CPUs 0-15) - Threads 32-63: allowed on NUMA node 1 (CPUs 16-31)- Memory allocation matched to appropriate node RESULTS:- Performance restored, +5% over original baseline- NUMA-local memory access maintained- Scheduler flexibility preserved within each nodeCase Study 3: Microservices on Kubernetes
A cloud team optimized latency-sensitive services:
PROBLEM:- Payment processing service with p99 latency SLA- Running on Kubernetes with default scheduling- 15% of requests exceeded latency SLA DIAGNOSIS:- Service pods scheduled dynamically across nodes- Within nodes, containers shared CPUs (CFS bandwidth)- Noisy neighbor pods caused latency spikes SOLUTION:1. Configured CPUManager policy = static2. Changed pod spec to Guaranteed QoS: - requests.cpu = limits.cpu = 23. Deployed to dedicated node pool with: - Taints to prevent non-critical pods - Larger nodes (more exclusive CPU room) RESULTS:- p99 latency reduced 60%- SLA violations dropped from 15% to <0.1%- Trade-off: 40% lower pod density (cost increase)Notice the pattern: aggressive per-CPU pinning often fails; NUMA-aware grouping often succeeds. The key insight is preserving scheduler flexibility within locality domains while preventing cross-domain migration. Think 'affinity zones' not 'affinity pins.'
Experience has revealed common affinity mistakes. Avoid these anti-patterns:
Anti-Pattern 1: Over-Pinning
Anti-Pattern 2: Ignoring NUMA
Anti-Pattern 3: Static Configuration in Dynamic Environments
Anti-Pattern 4: Not Reserving CPUs for System
Don't copy affinity configurations from blog posts or Stack Overflow without understanding your specific workload and hardware. What works for a low-latency trading system doesn't work for a batch processing cluster. What works on a 2-socket server doesn't work on a single-socket laptop. Always benchmark YOUR workload on YOUR hardware.
Use this framework to decide whether and how to apply affinity:
Step 1: Characterize Your Workload
| Characteristic | Favors Affinity | Favors No Affinity |
|---|---|---|
| Working set size | Small (fits L1/L2) | Large (exceeds L3) |
| Execution pattern | Long-running, steady | Short bursts, variable |
| Thread count | ≤ CPU count | CPU count |
| Latency sensitivity | p99 latency critical | Throughput critical |
| Memory access | NUMA-aware, local | Random, distributed |
| Load pattern | Constant, predictable | Bursty, unpredictable |
Step 2: Evaluate Hardware Topology
12345678910111213141516171819202122232425262728293031
#!/bin/bash# Gather topology information for affinity planning echo "=== CPU Count ==="lscpu | grep -E "^CPU\(s\):|^Core|^Socket|^NUMA" echo -e "\n=== NUMA Topology ==="numactl --hardware echo -e "\n=== Cache Topology ==="lscpu -C echo -e "\n=== CPU to NUMA Node Mapping ==="for node in /sys/devices/system/node/node*; do node_id=$(basename $node) cpus=$(cat $node/cpulist 2>/dev/null) echo "$node_id: CPUs $cpus"done echo -e "\n=== Recommendations ==="numa_nodes=$(ls -d /sys/devices/system/node/node* 2>/dev/null | wc -l)if [ "$numa_nodes" -gt 1 ]; then echo "NUMA system detected. Consider:" echo " - Bind processes to complete NUMA nodes" echo " - Use numactl for memory affinity" echo " - Avoid cross-node CPU masks"else echo "UMA system (single NUMA node). Consider:" echo " - Group by LLC sharing if applicable" echo " - Simple CPU range masks usually sufficient"fiStep 3: Decision Matrix
Based on your analysis, select an approach:
Step 4: Implement Incrementally
Apply the minimum affinity constraints that achieve your goals. More constraints mean less scheduler flexibility. For most workloads, NUMA-node awareness is sufficient. Reserve strict per-CPU pinning for proven latency-critical paths.
Affinity is not 'set and forget.' Monitor its effectiveness and adapt as conditions change.
Key Metrics to Monitor:
1234567891011121314151617181920212223242526
#!/bin/bash# Continuous affinity effectiveness monitoring PID=${1:- $(pgrep - x myapp)} while true; do echo "=== $(date) ===" # Current CPU placement cpu=$(awk '{print $39}' / proc / $PID / stat) echo "Current CPU: $cpu" # Migration statistics grep - E "nr_migrations|nr_switches" / proc / $PID / sched # CPU utilization per core mpstat - P ALL 1 1 | tail - n + 4 | head - n $(nproc) # NUMA statistics numastat - p $PID 2 > /dev/null | head - n 5 # Cache statistics(requires perf, run periodically not continuously) # timeout 1 perf stat - e cache - misses, cache - references - p $PID 2 >& 1 sleep 5doneAlerting Thresholds:
| Metric | Healthy Range | Warning | Investigate |
|---|---|---|---|
| Migrations/second (pinned) | 0 | 0 for minutes | Continuous migrations |
| Migrations/second (soft) | 0-10 | 100 | 1000 |
| CPU imbalance (%) | <20% | 20-50% | 50% |
| NUMA foreign pages (%) | <10% | 10-30% | 30% |
| Latency variance (CV%) | <10% | 10-50% | 50% |
Adaptation Strategies:
Hardware Changes: Re-evaluate affinity after any hardware change (new CPUs, memory, sockets).
Workload Changes: Monitor for workload shifts that might change optimal affinity.
Seasonal Patterns: Some workloads benefit from different affinity at different load levels.
Auto-Tuning: Consider tools like numad (automatic NUMA balancing) for adaptive affinity.
12345678910111213141516171819
# numad - automatic NUMA affinity management # Installsudo apt install numad # Ubuntu / Debiansudo yum install numad # RHEL / CentOS # Start numad daemonsudo systemctl start numad # numad will:# - Monitor process memory and CPU usage# - Automatically bind processes to optimal NUMA nodes# - Rebalance as workload changes # Check numad decisionssudo numad - S 1 # Show stats every 1 second # Manual suggestion mode(don't apply, just recommend)numad - H # Show current hintsTreat affinity configuration as a hypothesis, not a truth. Measure → Analyze → Adjust → Repeat. Conditions change: hardware ages, workloads evolve, traffic patterns shift. Periodic review of affinity effectiveness should be part of performance engineering practice.
We've thoroughly examined the performance implications of processor affinity—when it helps, when it hurts, and how to make informed decisions.
Module Complete
Congratulations! You've completed the Processor Affinity module. You now understand:
This knowledge enables you to make informed decisions about CPU placement in performance-critical systems—from low-latency trading to high-throughput databases to latency-sensitive microservices.
You have completed the Processor Affinity module under CPU Scheduling Advanced. You now possess expert-level understanding of processor affinity—its mechanisms, its implications, and its practical application. Use this knowledge to build systems that are not just fast, but consistently and predictably fast.