Operating SystemsCPU Scheduling Advanced

Processor Affinity

LevelIntermediate

Duration60 mins

TopicCPU Scheduling Advanced

5 / 5

Performance Implications

The Performance Calculus of Affinity

Throughout this module, we've explored what processor affinity is, why cache effects make it matter, and how to configure it. Now we address the ultimate question: what are the actual performance implications of affinity decisions?

This is not a simple question. Processor affinity is a tradeoff, not a pure optimization. Pinning processes to CPUs can:

Improve performance by preserving cache locality and reducing migration overhead
Hurt performance by preventing load balancing and creating hotspots
Improve predictability by reducing jitter and variance
Reduce flexibility by limiting the scheduler's ability to respond to load changes

The key is knowing when each outcome applies. This page provides the analytical framework and practical methodology for making informed affinity decisions.

What You Will Learn

By the end of this page, you will understand: when affinity improves vs. degrades performance, how to quantify affinity benefits, methodology for measuring affinity impact, real-world case studies, common anti-patterns to avoid, and decision frameworks for production deployments.

When Affinity Improves Performance

Let's systematically examine the scenarios where processor affinity delivers measurable performance benefits.

Benefit 1: Reduced Cache Migration Penalty

The primary benefit of affinity is avoiding the cache warmup cost when processes migrate. This benefit is greatest when:

Working set fits in L1/L2 cache — Smaller working sets mean faster warmup, but also complete loss on migration
Tight loops with high cache reuse — Code that repeatedly accesses the same data benefits most from warm caches
Long-running processes — More opportunity for cache to warm and stay warm

Cache Migration Cost by Working Set Size
Working Set Size	Cache Level	Warmup Time (approx)	Affinity Benefit
< 32 KB	L1	~1-5 µs	High — entire working set lost on migration
32 KB - 256 KB	L2	~5-20 µs	High — significant warmup cost
256 KB - 8 MB	L3	~20-100 µs	Medium — L3 may be shared, partial benefit
8 MB - 32 MB	L3 + some DRAM	~100-500 µs	Low-Medium — already paying memory latency
32 MB	Primarily DRAM	~500+ µs	Low — working set doesn't fit in cache anyway

Benefit 2: Reduced Cache Coherence Traffic

When threads share data and run on the same LLC domain (sharing L3 cache), coherence traffic stays local to the socket. Cross-socket coherence requires expensive interconnect messages (QPI/UPI on Intel, Infinity Fabric on AMD).

Benefit 3: NUMA Locality Preservation

On NUMA systems, keeping processes on their memory's local node avoids remote memory access penalties of 1.3-2x latency and reduced bandwidth.

Benefit 4: Reduced Scheduling Overhead

Migration events themselves have overhead:

Scheduler lock acquisition
Runqueue manipulation
IPI (Inter-Processor Interrupt) to trigger reschedule
TLB flush in some cases

With strict affinity, this overhead is eliminated.

The Golden Rule of Affinity Benefits

Affinity benefits are proportional to (migration_frequency × migration_cost) / total_execution_time. If migrations are rare OR cheap OR execution time is long, benefits are small. If migrations are frequent AND expensive AND execution is short, benefits can be dramatic.

When Affinity Hurts Performance

Affinity is not free. Constraining process placement has real costs that can outweigh benefits.

Cost 1: Impaired Load Balancing

The scheduler's primary job is to utilize all CPUs effectively. Hard affinity can prevent this:

Load imbalance scenario

Text

Scenario: 4 CPUs, 4 processes, all pinned to CPUs 0-1
 
Without affinity (scheduler can balance):
  CPU 0: Process A (1 unit work)
  CPU 1: Process B (1 unit work)    Total time: 1 unit
  CPU 2: Process C (1 unit work)    All CPUs utilized
  CPU 3: Process D (1 unit work)
 
With affinity (constrained to CPUs 0-1):
  CPU 0: Process A, then C (2 units work)
  CPU 1: Process B, then D (2 units work)    Total time: 2 units
  CPU 2: [IDLE]                               50% CPUs wasted
  CPU 3: [IDLE]
 
Result: 2x worse throughput due to affinity constraint

Cost 2: Wasted CPU Capacity

When pinned processes are idle (waiting for I/O, sleeping), their CPU sits idle even if other runnable processes exist:

Process A pinned to CPU 0, waiting for network
Process B pinned to CPU 1, CPU-bound
Result: CPU 0 idle while process B could use it

The scheduler cannot migrate B to CPU 0 due to its affinity constraint.

Cost 3: Interference Amplification

Pinning can concentrate interference rather than distribute it:

Interference concentration

Text

Scenario: Noisy neighbor (system monitoring agent)
 
Without affinity:
  Monitoring agent migrates across all CPUs
  Each application sees ~10% overhead distributed across time
  Impact: 10% average degradation
 
With affinity (all apps pinned, monitor on CPU 0):
  App on CPU 0: 80% overhead (constant interference)
  Apps on CPU 1-3: 0% overhead
  Impact: One victim, others untouched
  
Depending on perspective, this could be:
  - Better: Most apps unaffected
  - Worse: One app severely degraded

Cost 4: Burst Absorption Reduction

Applications with bursty workloads benefit from access to all CPUs during bursts. Affinity limits this:

Web server normally uses 2 CPUs
Traffic spike requires 8 CPUs momentarily
If pinned to 4 CPUs, half the burst capacity is lost

Cost 5: Complexity and Maintenance

Affinity configuration adds operational complexity:

Must track CPU topology across hardware changes
Configuration doesn't automatically adapt to new hardware
Can mask underlying performance issues (treating symptom, not cause)

The Affinity Paradox

Aggressive affinity optimization can make systems perform well in benchmarks but poorly in production. Benchmarks often have predictable, steady load. Production has variable, bursty load with mixed workloads. The scheduler's flexibility is valuable precisely because real workloads are unpredictable.

Quantifying Affinity Impact

To make informed decisions, we need to measure affinity's impact. Here's a systematic approach.

Step 1: Baseline Measurement

Establish performance without affinity constraints:

Baseline benchmarking script
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
# Baseline performance measurement
 
OUTPUT_DIR="./benchmark_results"
mkdir -p $OUTPUT_DIR
 
echo "=== Baseline Measurement (No Affinity) ==="
 
# Run multiple iterations for statistical significance
for i in {1..10}; do
    echo "Iteration $i..."
    
    # Capture scheduler statistics before
    grep -E "migrations|nr_switches" /proc/$(pgrep myapp)/sched > \
        "$OUTPUT_DIR/sched_before_$i.txt"
    
    # Run benchmark
    time ./benchmark_workload 2>&1 | tee "$OUTPUT_DIR/baseline_$i.txt"
    
    # Capture statistics after
    grep -E "migrations|nr_switches" /proc/$(pgrep myapp)/sched > \
        "$OUTPUT_DIR/sched_after_$i.txt"
    
    # Capture perf metrics
    perf stat -e cache-misses,cache-references,migrations,\
        context-switches ./benchmark_workload 2>&1 | \
        tee "$OUTPUT_DIR/perf_baseline_$i.txt"
done

Step 2: Affinity Measurement

Repeat with affinity applied:

Affinity benchmarking script
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# Affinity measurement
 
echo "=== Affinity Measurement ==="
 
# Test different affinity configurations
declare -A AFFINITY_CONFIGS=(
    ["single_cpu"]="0"
    ["core_pair"]="0,1"
    ["numa_node0"]="0-7"
    ["half_system"]="0-7"
    ["all_cpus"]="0-15"  # Baseline comparison
)
 
for config_name in "${!AFFINITY_CONFIGS[@]}"; do
    cpus=${AFFINITY_CONFIGS[$config_name]}
    echo "Testing configuration: $config_name (CPUs: $cpus)"
    
    for i in {1..10}; do
        taskset -c "$cpus" perf stat -e \
            cache-misses,cache-references,migrations,\
            context-switches,cpu-clock ./benchmark_workload 2>&1 | \
            tee "$OUTPUT_DIR/perf_${config_name}_$i.txt"
    done
done

Step 3: Statistical Analysis

analyze_results.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import statistics
import re
from pathlib import Path
 
def parse_perf_output(filepath):
    """Extract key metrics from perf stat output."""
    metrics = {}
    with open(filepath) as f:
        content = f.read()
        
    patterns = {
        'cache_misses': r'([\d,]+)\s+cache-misses',
        'cache_refs': r'([\d,]+)\s+cache-references',
        'migrations': r'([\d,]+)\s+migrations',
        'context_switches': r'([\d,]+)\s+context-switches',
        'time_seconds': r'([\d.]+)\s+seconds time elapsed',
    }
    
    for key, pattern in patterns.items():
        match = re.search(pattern, content)
        if match:
            metrics[key] = float(match.group(1).replace(',', ''))
    
    return metrics
 
def analyze_config(output_dir, config_name, iterations=10):
    """Analyze all iterations for a configuration."""
    all_metrics = []
    
    for i in range(1, iterations + 1):
        filepath = Path(output_dir) / f"perf_{config_name}_{i}.txt"
        if filepath.exists():
            all_metrics.append(parse_perf_output(filepath))
    
    if not all_metrics:
        return None
    
    # Calculate statistics for each metric
    results = {}
    for key in all_metrics[0].keys():
        values = [m[key] for m in all_metrics]
        results[key] = {
            'mean': statistics.mean(values),
            'stdev': statistics.stdev(values) if len(values) > 1 else 0,
            'min': min(values),
            'max': max(values),
            'cv': statistics.stdev(values) / statistics.mean(values) * 100 
                  if len(values) > 1 else 0  # Coefficient of variation
        }
    
    return results
 
def compare_configurations(baseline, test):
    """Compare test config vs baseline."""
    print(f"{'Metric':<20} {'Baseline':>12} {'Test':>12} {'Change':>12}")
    print("-" * 60)
    
    for key in baseline.keys():
        b = baseline[key]['mean']
        t = test[key]['mean']
        change = (t - b) / b * 100 if b != 0 else 0
        print(f"{key:<20} {b:>12.0f} {t:>12.0f} {change:>+11.1f}%")
 
# Example usage
output_dir = "./benchmark_results"
baseline = analyze_config(output_dir, "all_cpus")
numa_local = analyze_config(output_dir, "numa_node0")
 
if baseline and numa_local:
    print("\n=== NUMA Node 0 vs Baseline ===")
    compare_configurations(baseline, numa_local)

Key Metrics to Watch

Execution time (mean and variance) — Primary measure of throughput. 2. Cache miss rate — Should decrease with good affinity. 3. Migrations — Should drop to zero with hard affinity. 4. Coefficient of Variation (CV) — Low CV indicates predictable performance. 5. Tail latency (p99) — Often more important than mean for user-facing services.

Real-World Case Studies

Let's examine real-world scenarios where affinity decisions had significant impact.

Case Study 1: High-Frequency Trading System

A trading firm experienced inconsistent order execution latency:

HFT latency analysis

Text

PROBLEM:
- Order processing latency: median 15µs, p99 450µs
- 30x latency variance unacceptable for trading
- Profiling showed most variance during CPU migrations
 
DIAGNOSIS:
- Trading threads migrated ~500 times/second
- Each migration caused 5-20µs cache warmup
- Migrations often crossed NUMA boundaries (100+µs penalty)
 
SOLUTION:
- Isolated 4 CPUs on NUMA node 0 using isolcpus
- Pinned trading threads to isolated CPUs
- Moved network interrupts to those same CPUs
- Used huge pages for TLB stability
 
RESULTS:
- Order processing latency: median 12µs, p99 18µs
- Variance reduced from 30x to 1.5x
- Migration count: 0 (as expected)
- Cache miss rate decreased 40%

Case Study 2: Database Server Affinity Failure

A database team applied 'best practice' affinity that backfired:

Database affinity anti-pattern

Text

PROBLEM:
- OLTP database handling mixed read/write workload
- Team pinned each worker thread to its own CPU
- Performance DECREASED 20% after applying affinity
 
DIAGNOSIS:
- Database has 64 worker threads, server has 16 CPUs
- With pinning: 4 threads per CPU (forced sharing)
- Without pinning: scheduler distributed load dynamically
  
- Workload is bursty: some threads busy, others idle
- Pinning prevented idle CPU utilization
- Busy CPUs became bottlenecks
 
SOLUTION:
- Removed per-thread pinning
- Instead: grouped workers into NUMA-node pools
  - Threads 0-31: allowed on NUMA node 0 (CPUs 0-15)
  - Threads 32-63: allowed on NUMA node 1 (CPUs 16-31)
- Memory allocation matched to appropriate node
 
RESULTS:
- Performance restored, +5% over original baseline
- NUMA-local memory access maintained
- Scheduler flexibility preserved within each node

Case Study 3: Microservices on Kubernetes

A cloud team optimized latency-sensitive services:

Kubernetes CPU optimization

Text

PROBLEM:
- Payment processing service with p99 latency SLA
- Running on Kubernetes with default scheduling
- 15% of requests exceeded latency SLA
 
DIAGNOSIS:
- Service pods scheduled dynamically across nodes
- Within nodes, containers shared CPUs (CFS bandwidth)
- Noisy neighbor pods caused latency spikes
 
SOLUTION:
1. Configured CPUManager policy = static
2. Changed pod spec to Guaranteed QoS:
   - requests.cpu = limits.cpu = 2
3. Deployed to dedicated node pool with:
   - Taints to prevent non-critical pods
   - Larger nodes (more exclusive CPU room)
 
RESULTS:
- p99 latency reduced 60%
- SLA violations dropped from 15% to <0.1%
- Trade-off: 40% lower pod density (cost increase)

Pattern Recognition

Notice the pattern: aggressive per-CPU pinning often fails; NUMA-aware grouping often succeeds. The key insight is preserving scheduler flexibility within locality domains while preventing cross-domain migration. Think 'affinity zones' not 'affinity pins.'

Anti-Patterns to Avoid

Experience has revealed common affinity mistakes. Avoid these anti-patterns:

Anti-Pattern 1: Over-Pinning

The Mistake

•Pin every thread to its own CPU
•Leave no scheduling flexibility
•Assume more pinning = more performance

The Fix

•Pin to CPU groups, not individual CPUs
•Allow scheduler flexibility within groups
•Benchmark with various configurations

Anti-Pattern 2: Ignoring NUMA

The Mistake

•Pin CPUs without memory binding
•Pin to CPUs 0,8,16 (different nodes)
•Focus only on CPU, ignore memory

The Fix

•Use numactl for combined CPU+memory binding
•Keep CPUs within same NUMA node
•Measure NUMA statistics (numastat)

Anti-Pattern 3: Static Configuration in Dynamic Environments

The Mistake

•Hardcode CPU numbers in config files
•Assume same topology across all servers
•Use same config in VMs and bare metal

The Fix

•Query topology at startup (numactl -H, lscpu)
•Generate affinity config dynamically
•Test on each deployment target type

Anti-Pattern 4: Not Reserving CPUs for System

The Mistake

•Pin application to all CPUs
•Leave no room for kernel threads
•Ignore IRQ and softirq processing

The Fix

•Reserve at least 1-2 CPUs for system
•Pin IRQs to dedicated CPUs if needed
•Use isolcpus for strict isolation

The Cargo Cult Warning

Don't copy affinity configurations from blog posts or Stack Overflow without understanding your specific workload and hardware. What works for a low-latency trading system doesn't work for a batch processing cluster. What works on a 2-socket server doesn't work on a single-socket laptop. Always benchmark YOUR workload on YOUR hardware.

Decision Framework for Affinity

Use this framework to decide whether and how to apply affinity:

Step 1: Characterize Your Workload

Workload Characteristics and Affinity Recommendations
Characteristic	Favors Affinity	Favors No Affinity
Working set size	Small (fits L1/L2)	Large (exceeds L3)
Execution pattern	Long-running, steady	Short bursts, variable
Thread count	≤ CPU count	CPU count
Latency sensitivity	p99 latency critical	Throughput critical
Memory access	NUMA-aware, local	Random, distributed
Load pattern	Constant, predictable	Bursty, unpredictable

Step 2: Evaluate Hardware Topology

Topology evaluation script
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# Gather topology information for affinity planning
 
echo "=== CPU Count ==="
lscpu | grep -E "^CPU\(s\):|^Core|^Socket|^NUMA"
 
echo -e "\n=== NUMA Topology ==="
numactl --hardware
 
echo -e "\n=== Cache Topology ==="
lscpu -C
 
echo -e "\n=== CPU to NUMA Node Mapping ==="
for node in /sys/devices/system/node/node*; do
    node_id=$(basename $node)
    cpus=$(cat $node/cpulist 2>/dev/null)
    echo "$node_id: CPUs $cpus"
done
 
echo -e "\n=== Recommendations ==="
numa_nodes=$(ls -d /sys/devices/system/node/node* 2>/dev/null | wc -l)
if [ "$numa_nodes" -gt 1 ]; then
    echo "NUMA system detected. Consider:"
    echo "  - Bind processes to complete NUMA nodes"
    echo "  - Use numactl for memory affinity"
    echo "  - Avoid cross-node CPU masks"
else
    echo "UMA system (single NUMA node). Consider:"
    echo "  - Group by LLC sharing if applicable"
    echo "  - Simple CPU range masks usually sufficient"
fi

Step 3: Decision Matrix

Based on your analysis, select an approach:

Affinity Strategy Selection

•No Affinity (Default): Throughput-focused, variable load, many threads, UMA system.
•Soft NUMA Awareness: Large NUMA system, want locality but need flexibility. Use cgroups or NUMA-scoped masks.
•NUMA-Node Binding: Memory-intensive, consistent thread count per node. Use numactl --cpunodebind --membind.
•LLC Domain Binding: Cache-sensitive without NUMA concerns. Pin to CPUs sharing L3.
•Strict Per-CPU Pinning: Latency-critical, real-time requirements, dedicated hardware. Use isolcpus + taskset.

Step 4: Implement Incrementally

Start with no affinity (baseline)
Apply lightest constraint (NUMA awareness)
Measure improvement
Tighten constraints only if needed
Always keep a 'no affinity' comparison option

The Minimum Viable Affinity

Apply the minimum affinity constraints that achieve your goals. More constraints mean less scheduler flexibility. For most workloads, NUMA-node awareness is sufficient. Reserve strict per-CPU pinning for proven latency-critical paths.

Monitoring and Adaptation

Affinity is not 'set and forget.' Monitor its effectiveness and adapt as conditions change.

Key Metrics to Monitor:

Affinity monitoring script
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
# Continuous affinity effectiveness monitoring
 
PID=${1:- $(pgrep - x myapp)}
 
while true; do
    echo "=== $(date) ==="
    
    # Current CPU placement
    cpu=$(awk '{print $39}' / proc / $PID / stat)
    echo "Current CPU: $cpu"
    
    # Migration statistics
    grep - E "nr_migrations|nr_switches" / proc / $PID / sched
    
    # CPU utilization per core
    mpstat - P ALL 1 1 | tail - n + 4 | head - n $(nproc)
    
    # NUMA statistics
    numastat - p $PID 2 > /dev/null | head - n 5
    
    # Cache statistics(requires perf, run periodically not continuously)
    # timeout 1 perf stat - e cache - misses, cache - references - p $PID 2 >& 1
    
    sleep 5
done

Alerting Thresholds:

Affinity Health Indicators
Metric	Healthy Range	Warning	Investigate
Migrations/second (pinned)	0	0 for minutes	Continuous migrations
Migrations/second (soft)	0-10	100	1000
CPU imbalance (%)	<20%	20-50%	50%
NUMA foreign pages (%)	<10%	10-30%	30%
Latency variance (CV%)	<10%	10-50%	50%

Adaptation Strategies:

Hardware Changes: Re-evaluate affinity after any hardware change (new CPUs, memory, sockets).
Workload Changes: Monitor for workload shifts that might change optimal affinity.
Seasonal Patterns: Some workloads benefit from different affinity at different load levels.
Auto-Tuning: Consider tools like numad (automatic NUMA balancing) for adaptive affinity.

Using numad for automatic NUMA balancing
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# numad - automatic NUMA affinity management
 
# Install
sudo apt install numad  # Ubuntu / Debian
sudo yum install numad  # RHEL / CentOS
 
# Start numad daemon
sudo systemctl start numad
 
# numad will:
# - Monitor process memory and CPU usage
# - Automatically bind processes to optimal NUMA nodes
# - Rebalance as workload changes
 
# Check numad decisions
sudo numad - S 1  # Show stats every 1 second
 
# Manual suggestion mode(don't apply, just recommend)
numad - H  # Show current hints

The Feedback Loop

Treat affinity configuration as a hypothesis, not a truth. Measure → Analyze → Adjust → Repeat. Conditions change: hardware ages, workloads evolve, traffic patterns shift. Periodic review of affinity effectiveness should be part of performance engineering practice.

Summary: Mastering Affinity Performance

We've thoroughly examined the performance implications of processor affinity—when it helps, when it hurts, and how to make informed decisions.

Key Takeaways

•Affinity is a tradeoff, not a pure optimization—benefits (cache locality) compete with costs (load balancing limitations).
•Benefits are proportional to migration_frequency × migration_cost—measure both before optimizing.
•Over-pinning often hurts more than it helps—prefer NUMA-aware groups over strict per-CPU pinning.
•Always benchmark your specific workload—generic advice doesn't account for your real conditions.
•Avoid common anti-patterns—over-pinning, ignoring NUMA, static configs, no system CPU reservation.
•Use the decision framework—characterize workload, evaluate topology, select an appropriate strategy.
•Monitor and adapt—affinity effectiveness changes as hardware and workloads evolve.

Module Complete

Congratulations! You've completed the Processor Affinity module. You now understand:

Soft affinity — The scheduler's default cache-aware preferences
Hard affinity — Explicit CPU constraints and how to configure them
Cache effects — Why affinity matters at the hardware level
Setting affinity — APIs, tools, and techniques across platforms
Performance implications — When and how to apply affinity effectively

This knowledge enables you to make informed decisions about CPU placement in performance-critical systems—from low-latency trading to high-throughput databases to latency-sensitive microservices.

Module Complete

You have completed the Processor Affinity module under CPU Scheduling Advanced. You now possess expert-level understanding of processor affinity—its mechanisms, its implications, and its practical application. Use this knowledge to build systems that are not just fast, but consistently and predictably fast.

5 / 5

Loading learning content...

Operating SystemsCPU Scheduling Advanced

Processor Affinity

LevelIntermediate

Duration60 mins

TopicCPU Scheduling Advanced

5 / 5

Performance Implications

The Performance Calculus of Affinity

This is not a simple question. Processor affinity is a tradeoff, not a pure optimization. Pinning processes to CPUs can:

Improve performance by preserving cache locality and reducing migration overhead
Hurt performance by preventing load balancing and creating hotspots
Improve predictability by reducing jitter and variance
Reduce flexibility by limiting the scheduler's ability to respond to load changes

The key is knowing when each outcome applies. This page provides the analytical framework and practical methodology for making informed affinity decisions.

What You Will Learn

When Affinity Improves Performance

Let's systematically examine the scenarios where processor affinity delivers measurable performance benefits.

Benefit 1: Reduced Cache Migration Penalty

The primary benefit of affinity is avoiding the cache warmup cost when processes migrate. This benefit is greatest when:

Working set fits in L1/L2 cache — Smaller working sets mean faster warmup, but also complete loss on migration
Tight loops with high cache reuse — Code that repeatedly accesses the same data benefits most from warm caches
Long-running processes — More opportunity for cache to warm and stay warm

Cache Migration Cost by Working Set Size
Working Set Size	Cache Level	Warmup Time (approx)	Affinity Benefit
< 32 KB	L1	~1-5 µs	High — entire working set lost on migration
32 KB - 256 KB	L2	~5-20 µs	High — significant warmup cost
256 KB - 8 MB	L3	~20-100 µs	Medium — L3 may be shared, partial benefit
8 MB - 32 MB	L3 + some DRAM	~100-500 µs	Low-Medium — already paying memory latency
32 MB	Primarily DRAM	~500+ µs	Low — working set doesn't fit in cache anyway

Benefit 2: Reduced Cache Coherence Traffic

Benefit 3: NUMA Locality Preservation

On NUMA systems, keeping processes on their memory's local node avoids remote memory access penalties of 1.3-2x latency and reduced bandwidth.

Benefit 4: Reduced Scheduling Overhead

Migration events themselves have overhead:

Scheduler lock acquisition
Runqueue manipulation
IPI (Inter-Processor Interrupt) to trigger reschedule
TLB flush in some cases

With strict affinity, this overhead is eliminated.

The Golden Rule of Affinity Benefits

When Affinity Hurts Performance

Affinity is not free. Constraining process placement has real costs that can outweigh benefits.

Cost 1: Impaired Load Balancing

The scheduler's primary job is to utilize all CPUs effectively. Hard affinity can prevent this:

Load imbalance scenario

Text

Scenario: 4 CPUs, 4 processes, all pinned to CPUs 0-1
 
Without affinity (scheduler can balance):
  CPU 0: Process A (1 unit work)
  CPU 1: Process B (1 unit work)    Total time: 1 unit
  CPU 2: Process C (1 unit work)    All CPUs utilized
  CPU 3: Process D (1 unit work)
 
With affinity (constrained to CPUs 0-1):
  CPU 0: Process A, then C (2 units work)
  CPU 1: Process B, then D (2 units work)    Total time: 2 units
  CPU 2: [IDLE]                               50% CPUs wasted
  CPU 3: [IDLE]
 
Result: 2x worse throughput due to affinity constraint

Cost 2: Wasted CPU Capacity

When pinned processes are idle (waiting for I/O, sleeping), their CPU sits idle even if other runnable processes exist:

Process A pinned to CPU 0, waiting for network
Process B pinned to CPU 1, CPU-bound
Result: CPU 0 idle while process B could use it

The scheduler cannot migrate B to CPU 0 due to its affinity constraint.

Cost 3: Interference Amplification

Pinning can concentrate interference rather than distribute it:

Interference concentration

Text

Scenario: Noisy neighbor (system monitoring agent)
 
Without affinity:
  Monitoring agent migrates across all CPUs
  Each application sees ~10% overhead distributed across time
  Impact: 10% average degradation
 
With affinity (all apps pinned, monitor on CPU 0):
  App on CPU 0: 80% overhead (constant interference)
  Apps on CPU 1-3: 0% overhead
  Impact: One victim, others untouched
  
Depending on perspective, this could be:
  - Better: Most apps unaffected
  - Worse: One app severely degraded

Cost 4: Burst Absorption Reduction

Applications with bursty workloads benefit from access to all CPUs during bursts. Affinity limits this:

Web server normally uses 2 CPUs
Traffic spike requires 8 CPUs momentarily
If pinned to 4 CPUs, half the burst capacity is lost

Cost 5: Complexity and Maintenance

Affinity configuration adds operational complexity:

Must track CPU topology across hardware changes
Configuration doesn't automatically adapt to new hardware
Can mask underlying performance issues (treating symptom, not cause)

The Affinity Paradox

Quantifying Affinity Impact

To make informed decisions, we need to measure affinity's impact. Here's a systematic approach.

Step 1: Baseline Measurement

Establish performance without affinity constraints:

Baseline benchmarking script
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
# Baseline performance measurement
 
OUTPUT_DIR="./benchmark_results"
mkdir -p $OUTPUT_DIR
 
echo "=== Baseline Measurement (No Affinity) ==="
 
# Run multiple iterations for statistical significance
for i in {1..10}; do
    echo "Iteration $i..."
    
    # Capture scheduler statistics before
    grep -E "migrations|nr_switches" /proc/$(pgrep myapp)/sched > \
        "$OUTPUT_DIR/sched_before_$i.txt"
    
    # Run benchmark
    time ./benchmark_workload 2>&1 | tee "$OUTPUT_DIR/baseline_$i.txt"
    
    # Capture statistics after
    grep -E "migrations|nr_switches" /proc/$(pgrep myapp)/sched > \
        "$OUTPUT_DIR/sched_after_$i.txt"
    
    # Capture perf metrics
    perf stat -e cache-misses,cache-references,migrations,\
        context-switches ./benchmark_workload 2>&1 | \
        tee "$OUTPUT_DIR/perf_baseline_$i.txt"
done

Step 2: Affinity Measurement

Repeat with affinity applied:

Affinity benchmarking script
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# Affinity measurement
 
echo "=== Affinity Measurement ==="
 
# Test different affinity configurations
declare -A AFFINITY_CONFIGS=(
    ["single_cpu"]="0"
    ["core_pair"]="0,1"
    ["numa_node0"]="0-7"
    ["half_system"]="0-7"
    ["all_cpus"]="0-15"  # Baseline comparison
)
 
for config_name in "${!AFFINITY_CONFIGS[@]}"; do
    cpus=${AFFINITY_CONFIGS[$config_name]}
    echo "Testing configuration: $config_name (CPUs: $cpus)"
    
    for i in {1..10}; do
        taskset -c "$cpus" perf stat -e \
            cache-misses,cache-references,migrations,\
            context-switches,cpu-clock ./benchmark_workload 2>&1 | \
            tee "$OUTPUT_DIR/perf_${config_name}_$i.txt"
    done
done

Step 3: Statistical Analysis

analyze_results.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import statistics
import re
from pathlib import Path
 
def parse_perf_output(filepath):
    """Extract key metrics from perf stat output."""
    metrics = {}
    with open(filepath) as f:
        content = f.read()
        
    patterns = {
        'cache_misses': r'([\d,]+)\s+cache-misses',
        'cache_refs': r'([\d,]+)\s+cache-references',
        'migrations': r'([\d,]+)\s+migrations',
        'context_switches': r'([\d,]+)\s+context-switches',
        'time_seconds': r'([\d.]+)\s+seconds time elapsed',
    }
    
    for key, pattern in patterns.items():
        match = re.search(pattern, content)
        if match:
            metrics[key] = float(match.group(1).replace(',', ''))
    
    return metrics
 
def analyze_config(output_dir, config_name, iterations=10):
    """Analyze all iterations for a configuration."""
    all_metrics = []
    
    for i in range(1, iterations + 1):
        filepath = Path(output_dir) / f"perf_{config_name}_{i}.txt"
        if filepath.exists():
            all_metrics.append(parse_perf_output(filepath))
    
    if not all_metrics:
        return None
    
    # Calculate statistics for each metric
    results = {}
    for key in all_metrics[0].keys():
        values = [m[key] for m in all_metrics]
        results[key] = {
            'mean': statistics.mean(values),
            'stdev': statistics.stdev(values) if len(values) > 1 else 0,
            'min': min(values),
            'max': max(values),
            'cv': statistics.stdev(values) / statistics.mean(values) * 100 
                  if len(values) > 1 else 0  # Coefficient of variation
        }
    
    return results
 
def compare_configurations(baseline, test):
    """Compare test config vs baseline."""
    print(f"{'Metric':<20} {'Baseline':>12} {'Test':>12} {'Change':>12}")
    print("-" * 60)
    
    for key in baseline.keys():
        b = baseline[key]['mean']
        t = test[key]['mean']
        change = (t - b) / b * 100 if b != 0 else 0
        print(f"{key:<20} {b:>12.0f} {t:>12.0f} {change:>+11.1f}%")
 
# Example usage
output_dir = "./benchmark_results"
baseline = analyze_config(output_dir, "all_cpus")
numa_local = analyze_config(output_dir, "numa_node0")
 
if baseline and numa_local:
    print("\n=== NUMA Node 0 vs Baseline ===")
    compare_configurations(baseline, numa_local)

Key Metrics to Watch

Execution time (mean and variance) — Primary measure of throughput. 2. Cache miss rate — Should decrease with good affinity. 3. Migrations — Should drop to zero with hard affinity. 4. Coefficient of Variation (CV) — Low CV indicates predictable performance. 5. Tail latency (p99) — Often more important than mean for user-facing services.

Real-World Case Studies

Let's examine real-world scenarios where affinity decisions had significant impact.

Case Study 1: High-Frequency Trading System

A trading firm experienced inconsistent order execution latency:

HFT latency analysis

Text

PROBLEM:
- Order processing latency: median 15µs, p99 450µs
- 30x latency variance unacceptable for trading
- Profiling showed most variance during CPU migrations
 
DIAGNOSIS:
- Trading threads migrated ~500 times/second
- Each migration caused 5-20µs cache warmup
- Migrations often crossed NUMA boundaries (100+µs penalty)
 
SOLUTION:
- Isolated 4 CPUs on NUMA node 0 using isolcpus
- Pinned trading threads to isolated CPUs
- Moved network interrupts to those same CPUs
- Used huge pages for TLB stability
 
RESULTS:
- Order processing latency: median 12µs, p99 18µs
- Variance reduced from 30x to 1.5x
- Migration count: 0 (as expected)
- Cache miss rate decreased 40%

Case Study 2: Database Server Affinity Failure

A database team applied 'best practice' affinity that backfired:

Database affinity anti-pattern

Text

PROBLEM:
- OLTP database handling mixed read/write workload
- Team pinned each worker thread to its own CPU
- Performance DECREASED 20% after applying affinity
 
DIAGNOSIS:
- Database has 64 worker threads, server has 16 CPUs
- With pinning: 4 threads per CPU (forced sharing)
- Without pinning: scheduler distributed load dynamically
  
- Workload is bursty: some threads busy, others idle
- Pinning prevented idle CPU utilization
- Busy CPUs became bottlenecks
 
SOLUTION:
- Removed per-thread pinning
- Instead: grouped workers into NUMA-node pools
  - Threads 0-31: allowed on NUMA node 0 (CPUs 0-15)
  - Threads 32-63: allowed on NUMA node 1 (CPUs 16-31)
- Memory allocation matched to appropriate node
 
RESULTS:
- Performance restored, +5% over original baseline
- NUMA-local memory access maintained
- Scheduler flexibility preserved within each node

Case Study 3: Microservices on Kubernetes

A cloud team optimized latency-sensitive services:

Kubernetes CPU optimization

Text

PROBLEM:
- Payment processing service with p99 latency SLA
- Running on Kubernetes with default scheduling
- 15% of requests exceeded latency SLA
 
DIAGNOSIS:
- Service pods scheduled dynamically across nodes
- Within nodes, containers shared CPUs (CFS bandwidth)
- Noisy neighbor pods caused latency spikes
 
SOLUTION:
1. Configured CPUManager policy = static
2. Changed pod spec to Guaranteed QoS:
   - requests.cpu = limits.cpu = 2
3. Deployed to dedicated node pool with:
   - Taints to prevent non-critical pods
   - Larger nodes (more exclusive CPU room)
 
RESULTS:
- p99 latency reduced 60%
- SLA violations dropped from 15% to <0.1%
- Trade-off: 40% lower pod density (cost increase)

Pattern Recognition

Anti-Patterns to Avoid

Experience has revealed common affinity mistakes. Avoid these anti-patterns:

Anti-Pattern 1: Over-Pinning

The Mistake

•Pin every thread to its own CPU
•Leave no scheduling flexibility
•Assume more pinning = more performance

The Fix

•Pin to CPU groups, not individual CPUs
•Allow scheduler flexibility within groups
•Benchmark with various configurations

Anti-Pattern 2: Ignoring NUMA

The Mistake

•Pin CPUs without memory binding
•Pin to CPUs 0,8,16 (different nodes)
•Focus only on CPU, ignore memory

The Fix

•Use numactl for combined CPU+memory binding
•Keep CPUs within same NUMA node
•Measure NUMA statistics (numastat)

Anti-Pattern 3: Static Configuration in Dynamic Environments

The Mistake

•Hardcode CPU numbers in config files
•Assume same topology across all servers
•Use same config in VMs and bare metal

The Fix

•Query topology at startup (numactl -H, lscpu)
•Generate affinity config dynamically
•Test on each deployment target type

Anti-Pattern 4: Not Reserving CPUs for System

The Mistake

•Pin application to all CPUs
•Leave no room for kernel threads
•Ignore IRQ and softirq processing

The Fix

•Reserve at least 1-2 CPUs for system
•Pin IRQs to dedicated CPUs if needed
•Use isolcpus for strict isolation

The Cargo Cult Warning

Decision Framework for Affinity

Use this framework to decide whether and how to apply affinity:

Step 1: Characterize Your Workload

Workload Characteristics and Affinity Recommendations
Characteristic	Favors Affinity	Favors No Affinity
Working set size	Small (fits L1/L2)	Large (exceeds L3)
Execution pattern	Long-running, steady	Short bursts, variable
Thread count	≤ CPU count	CPU count
Latency sensitivity	p99 latency critical	Throughput critical
Memory access	NUMA-aware, local	Random, distributed
Load pattern	Constant, predictable	Bursty, unpredictable

Step 2: Evaluate Hardware Topology

Topology evaluation script
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# Gather topology information for affinity planning
 
echo "=== CPU Count ==="
lscpu | grep -E "^CPU\(s\):|^Core|^Socket|^NUMA"
 
echo -e "\n=== NUMA Topology ==="
numactl --hardware
 
echo -e "\n=== Cache Topology ==="
lscpu -C
 
echo -e "\n=== CPU to NUMA Node Mapping ==="
for node in /sys/devices/system/node/node*; do
    node_id=$(basename $node)
    cpus=$(cat $node/cpulist 2>/dev/null)
    echo "$node_id: CPUs $cpus"
done
 
echo -e "\n=== Recommendations ==="
numa_nodes=$(ls -d /sys/devices/system/node/node* 2>/dev/null | wc -l)
if [ "$numa_nodes" -gt 1 ]; then
    echo "NUMA system detected. Consider:"
    echo "  - Bind processes to complete NUMA nodes"
    echo "  - Use numactl for memory affinity"
    echo "  - Avoid cross-node CPU masks"
else
    echo "UMA system (single NUMA node). Consider:"
    echo "  - Group by LLC sharing if applicable"
    echo "  - Simple CPU range masks usually sufficient"
fi

Step 3: Decision Matrix

Based on your analysis, select an approach:

Affinity Strategy Selection

•No Affinity (Default): Throughput-focused, variable load, many threads, UMA system.
•Soft NUMA Awareness: Large NUMA system, want locality but need flexibility. Use cgroups or NUMA-scoped masks.
•NUMA-Node Binding: Memory-intensive, consistent thread count per node. Use numactl --cpunodebind --membind.
•LLC Domain Binding: Cache-sensitive without NUMA concerns. Pin to CPUs sharing L3.
•Strict Per-CPU Pinning: Latency-critical, real-time requirements, dedicated hardware. Use isolcpus + taskset.

Step 4: Implement Incrementally

Start with no affinity (baseline)
Apply lightest constraint (NUMA awareness)
Measure improvement
Tighten constraints only if needed
Always keep a 'no affinity' comparison option

The Minimum Viable Affinity

Monitoring and Adaptation

Affinity is not 'set and forget.' Monitor its effectiveness and adapt as conditions change.

Key Metrics to Monitor:

Affinity monitoring script
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
# Continuous affinity effectiveness monitoring
 
PID=${1:- $(pgrep - x myapp)}
 
while true; do
    echo "=== $(date) ==="
    
    # Current CPU placement
    cpu=$(awk '{print $39}' / proc / $PID / stat)
    echo "Current CPU: $cpu"
    
    # Migration statistics
    grep - E "nr_migrations|nr_switches" / proc / $PID / sched
    
    # CPU utilization per core
    mpstat - P ALL 1 1 | tail - n + 4 | head - n $(nproc)
    
    # NUMA statistics
    numastat - p $PID 2 > /dev/null | head - n 5
    
    # Cache statistics(requires perf, run periodically not continuously)
    # timeout 1 perf stat - e cache - misses, cache - references - p $PID 2 >& 1
    
    sleep 5
done

Alerting Thresholds:

Affinity Health Indicators
Metric	Healthy Range	Warning	Investigate
Migrations/second (pinned)	0	0 for minutes	Continuous migrations
Migrations/second (soft)	0-10	100	1000
CPU imbalance (%)	<20%	20-50%	50%
NUMA foreign pages (%)	<10%	10-30%	30%
Latency variance (CV%)	<10%	10-50%	50%

Adaptation Strategies:

Hardware Changes: Re-evaluate affinity after any hardware change (new CPUs, memory, sockets).
Workload Changes: Monitor for workload shifts that might change optimal affinity.
Seasonal Patterns: Some workloads benefit from different affinity at different load levels.
Auto-Tuning: Consider tools like numad (automatic NUMA balancing) for adaptive affinity.

Using numad for automatic NUMA balancing
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# numad - automatic NUMA affinity management
 
# Install
sudo apt install numad  # Ubuntu / Debian
sudo yum install numad  # RHEL / CentOS
 
# Start numad daemon
sudo systemctl start numad
 
# numad will:
# - Monitor process memory and CPU usage
# - Automatically bind processes to optimal NUMA nodes
# - Rebalance as workload changes
 
# Check numad decisions
sudo numad - S 1  # Show stats every 1 second
 
# Manual suggestion mode(don't apply, just recommend)
numad - H  # Show current hints

The Feedback Loop

Summary: Mastering Affinity Performance

We've thoroughly examined the performance implications of processor affinity—when it helps, when it hurts, and how to make informed decisions.

Key Takeaways

•Affinity is a tradeoff, not a pure optimization—benefits (cache locality) compete with costs (load balancing limitations).
•Benefits are proportional to migration_frequency × migration_cost—measure both before optimizing.
•Over-pinning often hurts more than it helps—prefer NUMA-aware groups over strict per-CPU pinning.
•Always benchmark your specific workload—generic advice doesn't account for your real conditions.
•Avoid common anti-patterns—over-pinning, ignoring NUMA, static configs, no system CPU reservation.
•Use the decision framework—characterize workload, evaluate topology, select an appropriate strategy.
•Monitor and adapt—affinity effectiveness changes as hardware and workloads evolve.

Module Complete

Congratulations! You've completed the Processor Affinity module. You now understand:

Soft affinity — The scheduler's default cache-aware preferences
Hard affinity — Explicit CPU constraints and how to configure them
Cache effects — Why affinity matters at the hardware level
Setting affinity — APIs, tools, and techniques across platforms
Performance implications — When and how to apply affinity effectively

This knowledge enables you to make informed decisions about CPU placement in performance-critical systems—from low-latency trading to high-throughput databases to latency-sensitive microservices.

Module Complete

5 / 5