Operating SystemsMulti-Processor Scheduling

Multi-Processor Scheduling

LevelAdvanced

Duration90 mins

TopicMulti-Processor Scheduling

5 / 5

NUMA Considerations

When Memory Isn't Uniform

Everything we've discussed about SMP scheduling has assumed, implicitly or explicitly, that all processors can access all memory with equal latency. For small systems—laptops, desktops, small servers—this assumption holds reasonably well. But as systems scale to multiple processor sockets, each with many cores, a fundamental physical reality emerges: memory cannot be equidistant from all processors.

Non-Uniform Memory Access (NUMA) describes architectures where memory access time depends on the processor making the request and the location of the memory being accessed. Memory attached to a processor's local controller is fast; memory attached to a remote processor's controller is slower. This seemingly simple distinction has profound implications for operating system design, particularly scheduling and memory allocation.

What You Will Master

By the end of this page, you will understand NUMA architecture at a fundamental level, including memory access latency differences, NUMA topologies, memory placement policies, NUMA-aware scheduling strategies, automatic NUMA balancing, and practical optimization techniques. This knowledge is essential for anyone working with multi-socket servers or large-scale computing infrastructure.

Understanding NUMA Architecture

The Physical Reality:

In a traditional SMP system, all processors connect to memory through a shared bus or interconnect. As processor count increases, this shared path becomes a bottleneck:

More processors means more memory requests competing for the same bus
Memory bandwidth cannot scale proportionally with processor count
Access latency increases under contention

NUMA solves this by giving each processor (or processor group) its own dedicated memory controller and local memory bank. Processors can access their local memory without contending with other processors. The tradeoff: accessing another processor's memory—remote memory—requires traversing an interconnect, adding latency.

NUMA Node:

A NUMA node is the fundamental unit of a NUMA system—a set of processors and their directly-attached local memory. Modern nodes typically correspond to:

A single CPU socket (physical processor package)
Or, in some systems, subdivisions within a large socket

Within a node, memory access is uniform. Across nodes, access latency varies based on the interconnect topology.

numa_architecture_diagram.txt
NUMA Architecture: Dual-Socket System
 
┌─────────────────────────────────────────────────────────────────────────────┐
│                           NUMA SYSTEM OVERVIEW                               │
└─────────────────────────────────────────────────────────────────────────────┘
 
   NUMA NODE 0                                      NUMA NODE 1
   (Socket 0)                                       (Socket 1)
   
┌───────────────────────────┐                   ┌───────────────────────────┐
│                           │                   │                           │
│  ┌─────┐ ┌─────┐ ┌─────┐  │                   │  ┌─────┐ ┌─────┐ ┌─────┐  │
│  │Core0│ │Core1│ │Core2│  │                   │  │Core8│ │Core9│ │Core10│ │
│  └──┬──┘ └──┬──┘ └──┬──┘  │                   │  └──┬──┘ └──┬──┘ └──┬──┘  │
│     │       │       │     │                   │     │       │       │     │
│  ┌──┴───────┴───────┴──┐  │                   │  ┌──┴───────┴───────┴──┐  │
│  │     L3 Cache        │  │                   │  │     L3 Cache        │  │
│  │     (shared)        │  │                   │  │     (shared)        │  │
│  └──────────┬──────────┘  │                   │  └──────────┬──────────┘  │
│             │             │                   │             │             │
│  ┌──────────┴──────────┐  │                   │  ┌──────────┴──────────┐  │
│  │  Memory Controller  │  │                   │  │  Memory Controller  │  │
│  └──────────┬──────────┘  │                   │  └──────────┬──────────┘  │
│             │             │                   │             │             │
│  ┌──────────┴──────────┐  │                   │  ┌──────────┴──────────┐  │
│  │  LOCAL MEMORY       │  │                   │  │  LOCAL MEMORY       │  │
│  │  (64-256 GB)        │  │◄─────QPI/UPI─────►│  │  (64-256 GB)        │  │
│  │  ~70ns latency      │  │    Interconnect   │  │  ~70ns latency      │  │
│  └─────────────────────┘  │    (~100ns added) │  └─────────────────────┘  │
│                           │                   │                           │
└───────────────────────────┘                   └───────────────────────────┘
 
Memory Access Latencies (typical):
- Local memory access:  ~70 ns
- Remote memory access: ~170 ns (2.4x slower)
- Bandwidth: Local ~80 GB/s, Remote ~40 GB/s per direction
 
When Core0 accesses:
- Node 0 memory: Direct path through local memory controller
- Node 1 memory: QPI/UPI hop to Node 1, memory controller, data back

NUMA Distance:

NUMA systems express the relative cost of accessing different nodes through a distance metric. This is typically normalized so that local access has distance 10:

Local node: Distance 10 (baseline)
Adjacent node: Distance 20-21 (2x latency)
Two hops away: Distance 30+ (3x latency)

The distance matrix describes the topology:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 65536 MB
node 0 free: 32000 MB
node 1 cpus: 8 9 10 11 12 13 14 15  
node 1 size: 65536 MB
node 1 free: 48000 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

In this example, accessing remote memory is 2.1x the distance of local—corresponding to significantly higher latency and reduced bandwidth.

The NUMA Penalty Is Real

Remote memory access being 2-3x slower might seem minor, but memory-intensive workloads can spend a large fraction of execution time waiting for memory. A workload that's 30% memory-bound on local memory might become 50+% memory-bound with remote access. For some workloads, poor NUMA placement can reduce performance by 40% or more.

NUMA Topology Variations

NUMA systems come in various topologies, each with different performance characteristics. Understanding your system's topology is essential for optimization.

Common NUMA Topologies:

1. Two-Node (Dual Socket): The most common server configuration. Two CPU sockets, each a NUMA node, connected by a single high-bandwidth interconnect (Intel QPI/UPI or AMD Infinity Fabric).

Distance: Local 10, Remote ~20
Simple optimization: Keep processes and their memory on the same node

2. Four-Node (Quad Socket): Four CPU sockets arranged in a mesh or ring. Some node pairs are direct neighbors; others require a hop.

Distance: Local 10, Neighbor 20, Two-hop 30
More complex optimization: Consider node proximity, not just local vs. remote

3. Eight-Node and Beyond: Large systems (8+ sockets) have complex interconnect topologies. Some nodes may be 3-4 hops apart.

Distance varies significantly across node pairs
Optimization requires detailed topology awareness

numa_topologies.txt
NUMA Topology Examples:
 
1. DUAL-SOCKET (Most Common)
   Direct connection between both nodes
   
         Node 0 ◄────────────────► Node 1
         
   Distance Matrix:
        0    1
   0:  10   21
   1:  21   10
 
 
2. QUAD-SOCKET RING TOPOLOGY
   Each node directly connects to two neighbors
   
         Node 0 ◄─────► Node 1
           △               △
           │               │
           │               │
           ▽               ▽
         Node 3 ◄─────► Node 2
         
   Distance Matrix:
        0    1    2    3
   0:  10   21   31   21
   1:  21   10   21   31
   2:  31   21   10   21
   3:  21   31   21   10
   
   Node 0 to Node 2 requires traversing Node 1 or Node 3 (2 hops)
 
 
3. QUAD-SOCKET FULL MESH (Higher-End)
   All nodes directly connected
   
         Node 0 ◄─────► Node 1
           △ ╲           ╱ △
           │   ╲       ╱   │
           │     ╲   ╱     │
           │      ╳╳       │
           │     ╱   ╲     │
           │   ╱       ╲   │
           ▽ ╱           ╲ ▽
         Node 3 ◄─────► Node 2
         
   Distance Matrix:
        0    1    2    3
   0:  10   21   21   21
   1:  21   10   21   21
   2:  21   21   10   21
   3:  21   21   21   10
   
   Equal remote distance (better for balanced workloads)

AMD EPYC and Sub-Socket NUMA:

Modern AMD EPYC processors introduce sub-socket NUMA—multiple NUMA nodes within a single physical socket:

Each chiplet (CCD) may be its own NUMA node
A single 64-core EPYC socket might present as 4 NUMA nodes
Provides finer-grained locality optimization

This design reflects the chiplet architecture: each chiplet has its own L3 cache and memory controller connection. Inter-chiplet communication adds latency compared to intra-chiplet.

Intel Sub-NUMA Clustering (SNC):

Intel offers Sub-NUMA Clustering (SNC) as a BIOS option:

Divides each socket into 2 (SNC2) or 4 (SNC4) NUMA nodes
Reduces memory latency variation within each sub-node
Increases total NUMA node count (complexity)
Beneficial for memory-latency-sensitive workloads

Know Your Topology

Before optimizing NUMA, understand your specific system's topology using 'numactl --hardware' and 'lscpu'. Don't assume all multi-socket systems are identical. Sub-socket NUMA, asymmetric configurations, and varying interconnect speeds all affect the optimal strategy.

Memory Placement Policies

How and where memory is allocated has dramatic performance implications on NUMA systems. The operating system provides several memory placement policies that control allocation behavior.

Default Policy (Allocation follows First Touch):

By default, memory is allocated on the NUMA node where it's first accessed:

Memory pages are allocated from the node of the CPU that first touches them
If a process is later migrated, the memory stays where it was allocated
This can lead to suboptimal placement if initialization and runtime patterns differ

// First touch example
char *buffer = malloc(1024 * 1024 * 1024);  // No physical pages yet
memset(buffer, 0, 1024 * 1024 * 1024);       // Pages allocated here
// Memory will be on whatever node executes the memset

NUMA Memory Placement Policies
Policy	Behavior	Use Case	Linux API
Default/Local	Allocate on local node	General purpose	MPOL_DEFAULT
Bind	Allocate only from specified nodes	Dedicated memory regions	MPOL_BIND
Preferred	Prefer specified node, fall back if full	Soft locality hints	MPOL_PREFERRED
Interleave	Round-robin across nodes	Bandwidth-bound allocations	MPOL_INTERLEAVE
Weighted Interleave	Round-robin with weights	Asymmetric node setup	MPOL_WEIGHTED_INTERLEAVE

Using numactl for Memory Policy:

# Run with memory bound to node 0
numactl --membind=0 ./my_application

# Prefer node 0, but allow others if memory pressure
numactl --preferred=0 ./my_application

# Interleave memory across all nodes
numactl --interleave=all ./my_application

# Combine CPU and memory binding
numactl --cpunodebind=0 --membind=0 ./my_application

When to Use Each Policy:

Bind (--membind):

Process is CPU-bound on specific node
Strict latency requirements
Size fits in node's memory

Preferred (--preferred):

Want locality but can't guarantee fit
Burst allocations that might exceed node capacity
Shared data structures accessed from multiple nodes

Interleave (--interleave):

High bandwidth requirements (streaming data)
Working set spreads across all nodes anyway
Initialization phase before knowing access patterns

numa_memory_api.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/* NUMA Memory API Example (libnuma) */
 
#include <numa.h>
#include <numaif.h>
 
void demonstrate_numa_memory_policies(void) {
    size_t size = 1024 * 1024 * 1024;  /* 1 GB */
    
    /* Check NUMA availability */
    if (numa_available() < 0) {
        printf("NUMA not available\n");
        return;
    }
    
    /* Allocate on specific node */
    void *local_mem = numa_alloc_onnode(size, 0);
    printf("Allocated 1GB on node 0: %p\n", local_mem);
    
    /* Allocate interleaved across all nodes */
    void *interleaved = numa_alloc_interleaved(size);
    printf("Allocated 1GB interleaved: %p\n", interleaved);
    
    /* Set memory policy for future allocations */
    struct bitmask *nodemask = numa_allocate_nodemask();
    numa_bitmask_setbit(nodemask, 0);
    numa_bitmask_setbit(nodemask, 1);
    
    /* Prefer node 0, allow node 1 as fallback */
    numa_set_preferred(0);
    void *preferred = malloc(size);  /* Will prefer node 0 */
    
    /* Bind strictly to nodes 0 and 1 */
    numa_set_membind(nodemask);
    void *bound = malloc(size);      /* Only from nodes 0 or 1 */
    
    /* Move existing pages to a different node */
    int nodes[1] = { 1 };
    int status[1];
    void *pages[1] = { local_mem };
    
    int result = numa_move_pages(0, 1, pages, nodes, status, MPOL_MF_MOVE);
    if (result == 0) {
        printf("Moved page to node %d\n", status[0]);
    }
    
    /* Clean up */
    numa_free(local_mem, size);
    numa_free(interleaved, size);
    numa_free_nodemask(nodemask);
    /* Note: preferred/bound memory freed with normal free() */
}

The First-Touch Trap

Many parallel applications initialize data structures on the main thread before spawning workers. With first-touch allocation, all data ends up on one node even though workers on all nodes will access it. Solutions include: parallel initialization (each thread touches its portion), explicit interleaving, or post-initialization migration. This is one of the most common NUMA performance issues.

NUMA-Aware Scheduling Strategies

Beyond memory allocation, the scheduler itself must be NUMA-aware. A scheduler that ignores NUMA may frequently run processes on CPUs far from their memory, negating any benefit from careful memory placement.

NUMA and Scheduling Domains:

Linux's scheduling domains naturally encode NUMA structure:

NUMA nodes form distinct scheduling domains
Load balancing between NUMA nodes is less frequent than within nodes
The cost of cross-NUMA migration includes both cache invalidation and future remote memory access

Key NUMA Scheduling Principles:

1. Prefer Local Execution: When a process wakes, prefer scheduling it on a CPU in the same NUMA node as its memory:

// Conceptual NUMA-aware wake placement
int select_wakeup_cpu(struct task *p) {
    int memory_node = get_task_memory_node(p);
    int prev_cpu = p->last_cpu;
    
    // Strong preference for CPUs on the memory node
    if (cpu_is_idle_on_node(memory_node)) {
        return find_idle_cpu_on_node(memory_node);
    }
    
    // Fall back to previous CPU if on same node
    if (cpu_to_node(prev_cpu) == memory_node) {
        return prev_cpu;
    }
    
    // Last resort: any idle CPU, but prefer same node
    return find_best_cpu_preferring_node(memory_node);
}

2. Group Related Threads:

Threads that share data should ideally run on the same NUMA node to share the L3 cache and avoid coherence traffic across the interconnect:

// Schedule related threads together
struct cgroup *cg = get_task_cgroup(p);
if (cg->cpuset_mems applies) {
    // Respect cgroup memory node binding
    prefer_node = cg->allowed_nodes;
}

3. Avoid Cross-NUMA Migration:

The scheduler should be reluctant to migrate processes across NUMA nodes:

Higher migration threshold for cross-NUMA moves
Consider memory location when calculating migration benefit
Account for future remote memory access cost, not just cache miss cost

NUMA Migration Cost Analysis
Scenario	Cache Cost	Memory Cost	Total Cost	Recommendation
Same-node migration	L3 shared, L1/L2 cold	None	Low	Allow freely
Cross-node, memory follows	All levels cold	Migration overhead	Medium	Allow if balance needed
Cross-node, memory stays	All levels cold	Remote access forever	Very High	Avoid unless necessary

Memory Migration vs. Process Migration

When load balancing requires cross-NUMA process migration, the scheduler faces a choice: migrate just the process (cheap but leaves memory remote), or migrate both process and memory (expensive upfront but optimal long-term). Linux's Automatic NUMA Balancing (AutoNUMA) can migrate memory pages to follow processes, but this has significant overhead. The optimal choice depends on workload duration and memory access patterns.

Automatic NUMA Balancing (AutoNUMA)

Manual NUMA optimization requires deep application knowledge and tuning effort. Automatic NUMA Balancing (AutoNUMA, introduced in Linux 3.8) attempts to automatically detect and correct suboptimal memory placement.

How AutoNUMA Works:

1. Fault-Based Sampling: The kernel periodically unmaps pages (marks them as not present), causing page faults on next access. The fault handler records which CPU (and thus which NUMA node) accessed the page.

2. Access Pattern Analysis: Over time, the kernel builds a profile of which nodes access which pages. Pages consistently accessed from a remote node are candidates for migration.

3. Page Migration: The kernel migrates pages to the node that accesses them most frequently. This happens transparently to the application.

4. Task Migration: Alternatively (or additionally), the scheduler may migrate the task to the node where its memory resides, rather than moving the memory.

autonuma_operation.txt
AutoNUMA Operation Cycle:
 
TIME ─────────────────────────────────────────────────────────────────────►
 
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌────────────┐
│ NUMA Hint   │     │ Access      │     │ Statistics  │     │ Migration  │
│ Page Faults │────►│ Recording   │────►│ Analysis    │────►│ Decision   │
│ (sampling)  │     │             │     │             │     │            │
└─────────────┘     └─────────────┘     └─────────────┘     └────────────┘
      │                                                            │
      │                                                            ▼
      │                         ┌────────────────────────────────────────┐
      │                         │         MIGRATION OPTIONS              │
      │                         │                                        │
      │                         │  Option A: Page Migration              │
      │                         │  - Move pages to accessing node        │
      │                         │  - Expensive but fixes placement       │
      │                         │                                        │
      │                         │  Option B: Task Migration              │
      │                         │  - Move task to node holding pages     │
      │                         │  - Cheaper but changes CPU allocation  │
      │                         │                                        │
      │                         │  Option C: Both                        │
      │                         │  - Converge task and pages to same     │
      │                         │    node over time                      │
      └─────────────────────────┴────────────────────────────────────────┘
 
Overhead Considerations:
- Scanning overhead: ~1-3% CPU for scanning tasks
- False fault overhead: Pages need re-mapping after sampling
- Migration overhead: DMA operations to copy page contents
- Convergence time: May take minutes to reach optimal state

Controlling AutoNUMA:

# Check if AutoNUMA is enabled
cat /proc/sys/kernel/numa_balancing
# 1 = enabled, 0 = disabled

# Disable AutoNUMA (if manual control preferred)
echo 0 > /proc/sys/kernel/numa_balancing

# Enable AutoNUMA
echo 1 > /proc/sys/kernel/numa_balancing

# Tune scanning rate (pages per second)
cat /proc/sys/kernel/numa_balancing_scan_delay_ms    # Delay between scans
cat /proc/sys/kernel/numa_balancing_scan_period_min_ms
cat /proc/sys/kernel/numa_balancing_scan_period_max_ms
cat /proc/sys/kernel/numa_balancing_scan_size_mb     # How much to scan

When AutoNUMA Helps:

Long-running processes with stable access patterns
Workloads with poor initial placement (first-touch issues)
Environments where manual tuning isn't feasible

When AutoNUMA Hurts:

Short-lived processes (overhead exceeds benefit)
Already well-tuned workloads (adds overhead without improving)
Workloads with deliberately distributed memory (interleaving)
Very latency-sensitive applications (migration causes spikes)

AutoNUMA as a Starting Point

AutoNUMA is valuable as a safety net—it prevents the worst NUMA placement disasters without manual intervention. For performance-critical workloads, use AutoNUMA during development to identify NUMA issues, then implement explicit policies (numactl, cgroups) for production. The combination of initial AutoNUMA analysis followed by explicit tuning often yields the best results.

NUMA Optimization Patterns

Effective NUMA optimization follows established patterns. Here are the key strategies used by NUMA-aware applications and system configurations.

Pattern 1: Node-Bound Instances

Run multiple application instances, each bound to a specific NUMA node:

# Node 0 instance
numactl --cpunodebind=0 --membind=0 ./server --port=8080

# Node 1 instance  
numactl --cpunodebind=1 --membind=1 ./server --port=8081

# Load balancer distributes requests between instances
nginx -> 8080, 8081

This pattern is common for databases (run one instance per NUMA node) and stateless services. Each instance operates within its own node, achieving near-UMA performance.

Pattern 2: NUMA-Aware Thread Pools

Create per-node thread pools that process data local to their node:

// Allocate per-node work queues and pools
for (int node = 0; node < num_nodes; node++) {
    pools[node] = create_thread_pool_on_node(node, threads_per_node);
    queues[node] = create_queue_on_node(node);
}

// Route work based on data location
void submit_work(void *data) {
    int node = get_memory_node(data);
    enqueue(queues[node], work_item(data));
}

Pattern 3: Partitioned Data Structures

Divide large data structures so each partition resides on and is accessed from a single node:

struct numa_partitioned_hash {
    struct hash_node **buckets[MAX_NODES];  // Per-node bucket arrays
    int bucket_count_per_node;
};

void *lookup(struct numa_partitioned_hash *h, uint64_t key) {
    // Hash determines which node holds this key
    int node = key % num_nodes;
    int bucket = (key / num_nodes) % h->bucket_count_per_node;
    
    // Access is always node-local if accessor runs on correct node
    return search_bucket(h->buckets[node][bucket], key);
}

Pattern 4: Initialization-Time Binding

Use parallel initialization to ensure first-touch allocation distributes data:

// BAD: Serial initialization on one node
void init_array_serial(int *array, size_t n) {
    for (size_t i = 0; i < n; i++) {
        array[i] = initial_value(i);  // All on current node
    }
}

// GOOD: Parallel initialization across nodes
void init_array_parallel(int *array, size_t n) {
    #pragma omp parallel for schedule(static)
    for (size_t i = 0; i < n; i++) {
        array[i] = initial_value(i);  // Each thread's pages on its node
    }
}

NUMA Optimization Checklist
Aspect	Check	Action if Failing
Memory locality	Are pages on same node as accessing CPUs?	Use numactl membind or AutoNUMA
Process placement	Are threads running on correct nodes?	Use cpunodebind or taskset
Initialization	Is data initialized in parallel?	Parallelize initialization loops
Data partitioning	Is data partitioned by access pattern?	Redesign data structures for NUMA
Thread-to-data affinity	Do threads consistently access local data?	Implement work queue per node

The Shared-Data Challenge

Some data structures are inherently shared across nodes—global configuration, shared counters, lock structures. For these, consider: read-mostly patterns (replicate data, invalidate on write), per-node copies with periodic synchronization, or interleaved allocation (no node is more remote than others). Truly shared mutable data remains a NUMA performance challenge with no universal solution.

Diagnosing NUMA Performance Issues

Identifying NUMA-related performance problems requires specific tools and metrics. Here's a systematic approach to NUMA diagnosis.

Step 1: Confirm NUMA Configuration

# Check system NUMA topology
numactl --hardware
lscpu | grep NUMA

# View current memory distribution
numastat
numastat -c  # Compact format

# Per-process NUMA stats
numastat -p <pid>

Step 2: Measure Remote Memory Access

High "Other Node" or "numa_foreign" counts indicate remote access:

# Watch NUMA stats change over time
watch -n 1 numastat

# Performance counters for remote access (Intel)
perf stat -e mem_load_uops_retired.local_dram,\
           mem_load_uops_retired.remote_dram \
           -p <pid> -- sleep 10

numa_diagnosis.sh
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/bin/bash
# NUMA Performance Diagnosis Script
 
echo "=== NUMA Topology ==="
numactl --hardware
 
echo ""
echo "=== System-Wide NUMA Statistics ==="
numastat
 
echo ""
echo "=== Process Memory Distribution ==="
# For each significant process, show memory placement
for pid in $(pgrep -f "mysql|postgres|java|python"); do
    echo "--- PID: $pid ($(ps -p $pid -o comm=)) ---"
    numastat -p $pid 2>/dev/null
done
 
echo ""
echo "=== Memory Policy Check ==="
# Check if processes have explicit NUMA policies
for pid in $(pgrep -f "mysql|postgres"); do
    echo "PID $pid memory policy:"
    cat /proc/$pid/numa_maps | head -5
done
 
echo ""
echo "=== Performance Counter Analysis (10 seconds) ==="
# Measure local vs remote memory access
if command -v perf &> /dev/null; then
    perf stat -a -e mem_load_uops_retired.local_dram,mem_load_uops_retired.remote_dram -- sleep 10
else
    echo "perf not available - install linux-tools package"
fi
 
echo ""
echo "=== AutoNUMA Statistics ==="
cat /proc/vmstat | grep numa
echo "numa_balancing enabled: $(cat /proc/sys/kernel/numa_balancing)"

Interpreting numastat Output:

                           node0           node1
numa_hit                12345678        11234567
numa_miss                  10234          534123  <-- PROBLEM: high misses on node1
numa_foreign              534123           10234
interleave_hit              1234            1234
local_node              12145678        11034567
other_node                200000          200000

Key Metrics:

numa_miss / numa_foreign: Pages allocated on wrong node (high = bad placement)
local_node / other_node: Ratio indicates locality quality (should be 90%+ local)
interleave_hit: Only relevant for interleaved allocations

Step 3: Verify Process-Memory Colocality

Use /proc/<pid>/numa_maps to see where each memory region is allocated:

cat /proc/$(pgrep mysqld)/numa_maps | head -20
# Output shows per-VMA NUMA placement
# Look for N0= vs N1= values to see distribution

Quick NUMA Health Check

A healthy NUMA system should show: >90% local_node accesses in numastat, balanced memory utilization across nodes, low numa_miss counts after initial startup, and minimal numa_balancing page migrations (if AutoNUMA is enabled and the system is well-tuned). Deviations indicate opportunities for optimization.

Summary: Mastering NUMA-Aware Scheduling

We have explored Non-Uniform Memory Access architecture from its physical foundations through practical optimization and diagnosis. This knowledge is essential for anyone working with multi-socket servers or large-scale computing infrastructure.

Consolidating Our Understanding:

Key Takeaways

•NUMA is a physical reality — Memory access latency varies based on the path from CPU to memory. Local memory is 2-3x faster than remote memory. This isn't tunable—it's physics.
•Memory placement policies are powerful — Default (first-touch), bind, preferred, and interleave policies give applications control over where memory resides. Choose policy based on access patterns.
•NUMA-aware scheduling keeps data and compute together — The scheduler's job expands beyond CPU assignment to ensuring processes run near their memory. Cross-NUMA migration has hidden long-term costs.
•AutoNUMA provides automatic correction — Linux's automatic NUMA balancing detects and corrects suboptimal placement, but with overhead. It's a safety net, not a replacement for thoughtful design.
•Partitioning is the key optimization pattern — Structure applications so each thread/process works with local data. Per-node instances, partitioned data structures, and NUMA-aware thread pools are proven patterns.
•First-touch allocation is a common trap — Serial initialization concentrates memory on one node. Parallel initialization or explicit placement is essential for balanced memory distribution.
•Measurement enables optimization — numastat, numa_maps, and performance counters reveal NUMA behavior. Diagnose before optimizing; measure after changing.

Module Complete:

With this page, we have completed our exploration of Multi-Processor Scheduling. You now understand:

SMP architecture and its symmetric resource model
Asymmetric multiprocessing and its specialized applications
Processor affinity for cache-aware process placement
Load balancing mechanisms and strategies
NUMA considerations for memory-aware scheduling

This knowledge forms the foundation for understanding how modern operating systems manage the most complex multi-core, multi-socket systems in production today.

NUMA Mastery Achieved

You now understand NUMA architecture and scheduling at the depth required for production server optimization, kernel-level reasoning, and high-performance application design. You can diagnose NUMA performance issues, apply appropriate memory placement policies, and design NUMA-aware application architectures. This knowledge is essential for anyone working with modern multi-socket server systems.

5 / 5

Loading learning content...

Operating SystemsMulti-Processor Scheduling

Multi-Processor Scheduling

LevelAdvanced

Duration90 mins

TopicMulti-Processor Scheduling

5 / 5

NUMA Considerations

When Memory Isn't Uniform

What You Will Master

Understanding NUMA Architecture

The Physical Reality:

In a traditional SMP system, all processors connect to memory through a shared bus or interconnect. As processor count increases, this shared path becomes a bottleneck:

More processors means more memory requests competing for the same bus
Memory bandwidth cannot scale proportionally with processor count
Access latency increases under contention

NUMA Node:

A NUMA node is the fundamental unit of a NUMA system—a set of processors and their directly-attached local memory. Modern nodes typically correspond to:

A single CPU socket (physical processor package)
Or, in some systems, subdivisions within a large socket

Within a node, memory access is uniform. Across nodes, access latency varies based on the interconnect topology.

numa_architecture_diagram.txt
NUMA Architecture: Dual-Socket System
 
┌─────────────────────────────────────────────────────────────────────────────┐
│                           NUMA SYSTEM OVERVIEW                               │
└─────────────────────────────────────────────────────────────────────────────┘
 
   NUMA NODE 0                                      NUMA NODE 1
   (Socket 0)                                       (Socket 1)
   
┌───────────────────────────┐                   ┌───────────────────────────┐
│                           │                   │                           │
│  ┌─────┐ ┌─────┐ ┌─────┐  │                   │  ┌─────┐ ┌─────┐ ┌─────┐  │
│  │Core0│ │Core1│ │Core2│  │                   │  │Core8│ │Core9│ │Core10│ │
│  └──┬──┘ └──┬──┘ └──┬──┘  │                   │  └──┬──┘ └──┬──┘ └──┬──┘  │
│     │       │       │     │                   │     │       │       │     │
│  ┌──┴───────┴───────┴──┐  │                   │  ┌──┴───────┴───────┴──┐  │
│  │     L3 Cache        │  │                   │  │     L3 Cache        │  │
│  │     (shared)        │  │                   │  │     (shared)        │  │
│  └──────────┬──────────┘  │                   │  └──────────┬──────────┘  │
│             │             │                   │             │             │
│  ┌──────────┴──────────┐  │                   │  ┌──────────┴──────────┐  │
│  │  Memory Controller  │  │                   │  │  Memory Controller  │  │
│  └──────────┬──────────┘  │                   │  └──────────┬──────────┘  │
│             │             │                   │             │             │
│  ┌──────────┴──────────┐  │                   │  ┌──────────┴──────────┐  │
│  │  LOCAL MEMORY       │  │                   │  │  LOCAL MEMORY       │  │
│  │  (64-256 GB)        │  │◄─────QPI/UPI─────►│  │  (64-256 GB)        │  │
│  │  ~70ns latency      │  │    Interconnect   │  │  ~70ns latency      │  │
│  └─────────────────────┘  │    (~100ns added) │  └─────────────────────┘  │
│                           │                   │                           │
└───────────────────────────┘                   └───────────────────────────┘
 
Memory Access Latencies (typical):
- Local memory access:  ~70 ns
- Remote memory access: ~170 ns (2.4x slower)
- Bandwidth: Local ~80 GB/s, Remote ~40 GB/s per direction
 
When Core0 accesses:
- Node 0 memory: Direct path through local memory controller
- Node 1 memory: QPI/UPI hop to Node 1, memory controller, data back

NUMA Distance:

NUMA systems express the relative cost of accessing different nodes through a distance metric. This is typically normalized so that local access has distance 10:

Local node: Distance 10 (baseline)
Adjacent node: Distance 20-21 (2x latency)
Two hops away: Distance 30+ (3x latency)

The distance matrix describes the topology:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 65536 MB
node 0 free: 32000 MB
node 1 cpus: 8 9 10 11 12 13 14 15  
node 1 size: 65536 MB
node 1 free: 48000 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

In this example, accessing remote memory is 2.1x the distance of local—corresponding to significantly higher latency and reduced bandwidth.

The NUMA Penalty Is Real

NUMA Topology Variations

NUMA systems come in various topologies, each with different performance characteristics. Understanding your system's topology is essential for optimization.

Common NUMA Topologies:

1. Two-Node (Dual Socket): The most common server configuration. Two CPU sockets, each a NUMA node, connected by a single high-bandwidth interconnect (Intel QPI/UPI or AMD Infinity Fabric).

Distance: Local 10, Remote ~20
Simple optimization: Keep processes and their memory on the same node

2. Four-Node (Quad Socket): Four CPU sockets arranged in a mesh or ring. Some node pairs are direct neighbors; others require a hop.

Distance: Local 10, Neighbor 20, Two-hop 30
More complex optimization: Consider node proximity, not just local vs. remote

3. Eight-Node and Beyond: Large systems (8+ sockets) have complex interconnect topologies. Some nodes may be 3-4 hops apart.

Distance varies significantly across node pairs
Optimization requires detailed topology awareness

numa_topologies.txt
NUMA Topology Examples:
 
1. DUAL-SOCKET (Most Common)
   Direct connection between both nodes
   
         Node 0 ◄────────────────► Node 1
         
   Distance Matrix:
        0    1
   0:  10   21
   1:  21   10
 
 
2. QUAD-SOCKET RING TOPOLOGY
   Each node directly connects to two neighbors
   
         Node 0 ◄─────► Node 1
           △               △
           │               │
           │               │
           ▽               ▽
         Node 3 ◄─────► Node 2
         
   Distance Matrix:
        0    1    2    3
   0:  10   21   31   21
   1:  21   10   21   31
   2:  31   21   10   21
   3:  21   31   21   10
   
   Node 0 to Node 2 requires traversing Node 1 or Node 3 (2 hops)
 
 
3. QUAD-SOCKET FULL MESH (Higher-End)
   All nodes directly connected
   
         Node 0 ◄─────► Node 1
           △ ╲           ╱ △
           │   ╲       ╱   │
           │     ╲   ╱     │
           │      ╳╳       │
           │     ╱   ╲     │
           │   ╱       ╲   │
           ▽ ╱           ╲ ▽
         Node 3 ◄─────► Node 2
         
   Distance Matrix:
        0    1    2    3
   0:  10   21   21   21
   1:  21   10   21   21
   2:  21   21   10   21
   3:  21   21   21   10
   
   Equal remote distance (better for balanced workloads)

AMD EPYC and Sub-Socket NUMA:

Modern AMD EPYC processors introduce sub-socket NUMA—multiple NUMA nodes within a single physical socket:

Each chiplet (CCD) may be its own NUMA node
A single 64-core EPYC socket might present as 4 NUMA nodes
Provides finer-grained locality optimization

This design reflects the chiplet architecture: each chiplet has its own L3 cache and memory controller connection. Inter-chiplet communication adds latency compared to intra-chiplet.

Intel Sub-NUMA Clustering (SNC):

Intel offers Sub-NUMA Clustering (SNC) as a BIOS option:

Divides each socket into 2 (SNC2) or 4 (SNC4) NUMA nodes
Reduces memory latency variation within each sub-node
Increases total NUMA node count (complexity)
Beneficial for memory-latency-sensitive workloads

Know Your Topology

Memory Placement Policies

How and where memory is allocated has dramatic performance implications on NUMA systems. The operating system provides several memory placement policies that control allocation behavior.

Default Policy (Allocation follows First Touch):

By default, memory is allocated on the NUMA node where it's first accessed:

Memory pages are allocated from the node of the CPU that first touches them
If a process is later migrated, the memory stays where it was allocated
This can lead to suboptimal placement if initialization and runtime patterns differ

// First touch example
char *buffer = malloc(1024 * 1024 * 1024);  // No physical pages yet
memset(buffer, 0, 1024 * 1024 * 1024);       // Pages allocated here
// Memory will be on whatever node executes the memset

NUMA Memory Placement Policies
Policy	Behavior	Use Case	Linux API
Default/Local	Allocate on local node	General purpose	MPOL_DEFAULT
Bind	Allocate only from specified nodes	Dedicated memory regions	MPOL_BIND
Preferred	Prefer specified node, fall back if full	Soft locality hints	MPOL_PREFERRED
Interleave	Round-robin across nodes	Bandwidth-bound allocations	MPOL_INTERLEAVE
Weighted Interleave	Round-robin with weights	Asymmetric node setup	MPOL_WEIGHTED_INTERLEAVE

Using numactl for Memory Policy:

# Run with memory bound to node 0
numactl --membind=0 ./my_application

# Prefer node 0, but allow others if memory pressure
numactl --preferred=0 ./my_application

# Interleave memory across all nodes
numactl --interleave=all ./my_application

# Combine CPU and memory binding
numactl --cpunodebind=0 --membind=0 ./my_application

When to Use Each Policy:

Bind (--membind):

Process is CPU-bound on specific node
Strict latency requirements
Size fits in node's memory

Preferred (--preferred):

Want locality but can't guarantee fit
Burst allocations that might exceed node capacity
Shared data structures accessed from multiple nodes

Interleave (--interleave):

High bandwidth requirements (streaming data)
Working set spreads across all nodes anyway
Initialization phase before knowing access patterns

numa_memory_api.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/* NUMA Memory API Example (libnuma) */
 
#include <numa.h>
#include <numaif.h>
 
void demonstrate_numa_memory_policies(void) {
    size_t size = 1024 * 1024 * 1024;  /* 1 GB */
    
    /* Check NUMA availability */
    if (numa_available() < 0) {
        printf("NUMA not available\n");
        return;
    }
    
    /* Allocate on specific node */
    void *local_mem = numa_alloc_onnode(size, 0);
    printf("Allocated 1GB on node 0: %p\n", local_mem);
    
    /* Allocate interleaved across all nodes */
    void *interleaved = numa_alloc_interleaved(size);
    printf("Allocated 1GB interleaved: %p\n", interleaved);
    
    /* Set memory policy for future allocations */
    struct bitmask *nodemask = numa_allocate_nodemask();
    numa_bitmask_setbit(nodemask, 0);
    numa_bitmask_setbit(nodemask, 1);
    
    /* Prefer node 0, allow node 1 as fallback */
    numa_set_preferred(0);
    void *preferred = malloc(size);  /* Will prefer node 0 */
    
    /* Bind strictly to nodes 0 and 1 */
    numa_set_membind(nodemask);
    void *bound = malloc(size);      /* Only from nodes 0 or 1 */
    
    /* Move existing pages to a different node */
    int nodes[1] = { 1 };
    int status[1];
    void *pages[1] = { local_mem };
    
    int result = numa_move_pages(0, 1, pages, nodes, status, MPOL_MF_MOVE);
    if (result == 0) {
        printf("Moved page to node %d\n", status[0]);
    }
    
    /* Clean up */
    numa_free(local_mem, size);
    numa_free(interleaved, size);
    numa_free_nodemask(nodemask);
    /* Note: preferred/bound memory freed with normal free() */
}

The First-Touch Trap

NUMA-Aware Scheduling Strategies

NUMA and Scheduling Domains:

Linux's scheduling domains naturally encode NUMA structure:

NUMA nodes form distinct scheduling domains
Load balancing between NUMA nodes is less frequent than within nodes
The cost of cross-NUMA migration includes both cache invalidation and future remote memory access

Key NUMA Scheduling Principles:

1. Prefer Local Execution: When a process wakes, prefer scheduling it on a CPU in the same NUMA node as its memory:

// Conceptual NUMA-aware wake placement
int select_wakeup_cpu(struct task *p) {
    int memory_node = get_task_memory_node(p);
    int prev_cpu = p->last_cpu;
    
    // Strong preference for CPUs on the memory node
    if (cpu_is_idle_on_node(memory_node)) {
        return find_idle_cpu_on_node(memory_node);
    }
    
    // Fall back to previous CPU if on same node
    if (cpu_to_node(prev_cpu) == memory_node) {
        return prev_cpu;
    }
    
    // Last resort: any idle CPU, but prefer same node
    return find_best_cpu_preferring_node(memory_node);
}

2. Group Related Threads:

Threads that share data should ideally run on the same NUMA node to share the L3 cache and avoid coherence traffic across the interconnect:

// Schedule related threads together
struct cgroup *cg = get_task_cgroup(p);
if (cg->cpuset_mems applies) {
    // Respect cgroup memory node binding
    prefer_node = cg->allowed_nodes;
}

3. Avoid Cross-NUMA Migration:

The scheduler should be reluctant to migrate processes across NUMA nodes:

Higher migration threshold for cross-NUMA moves
Consider memory location when calculating migration benefit
Account for future remote memory access cost, not just cache miss cost

NUMA Migration Cost Analysis
Scenario	Cache Cost	Memory Cost	Total Cost	Recommendation
Same-node migration	L3 shared, L1/L2 cold	None	Low	Allow freely
Cross-node, memory follows	All levels cold	Migration overhead	Medium	Allow if balance needed
Cross-node, memory stays	All levels cold	Remote access forever	Very High	Avoid unless necessary

Memory Migration vs. Process Migration

Automatic NUMA Balancing (AutoNUMA)

How AutoNUMA Works:

2. Access Pattern Analysis: Over time, the kernel builds a profile of which nodes access which pages. Pages consistently accessed from a remote node are candidates for migration.

3. Page Migration: The kernel migrates pages to the node that accesses them most frequently. This happens transparently to the application.

4. Task Migration: Alternatively (or additionally), the scheduler may migrate the task to the node where its memory resides, rather than moving the memory.

autonuma_operation.txt
AutoNUMA Operation Cycle:
 
TIME ─────────────────────────────────────────────────────────────────────►
 
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌────────────┐
│ NUMA Hint   │     │ Access      │     │ Statistics  │     │ Migration  │
│ Page Faults │────►│ Recording   │────►│ Analysis    │────►│ Decision   │
│ (sampling)  │     │             │     │             │     │            │
└─────────────┘     └─────────────┘     └─────────────┘     └────────────┘
      │                                                            │
      │                                                            ▼
      │                         ┌────────────────────────────────────────┐
      │                         │         MIGRATION OPTIONS              │
      │                         │                                        │
      │                         │  Option A: Page Migration              │
      │                         │  - Move pages to accessing node        │
      │                         │  - Expensive but fixes placement       │
      │                         │                                        │
      │                         │  Option B: Task Migration              │
      │                         │  - Move task to node holding pages     │
      │                         │  - Cheaper but changes CPU allocation  │
      │                         │                                        │
      │                         │  Option C: Both                        │
      │                         │  - Converge task and pages to same     │
      │                         │    node over time                      │
      └─────────────────────────┴────────────────────────────────────────┘
 
Overhead Considerations:
- Scanning overhead: ~1-3% CPU for scanning tasks
- False fault overhead: Pages need re-mapping after sampling
- Migration overhead: DMA operations to copy page contents
- Convergence time: May take minutes to reach optimal state

Controlling AutoNUMA:

# Check if AutoNUMA is enabled
cat /proc/sys/kernel/numa_balancing
# 1 = enabled, 0 = disabled

# Disable AutoNUMA (if manual control preferred)
echo 0 > /proc/sys/kernel/numa_balancing

# Enable AutoNUMA
echo 1 > /proc/sys/kernel/numa_balancing

# Tune scanning rate (pages per second)
cat /proc/sys/kernel/numa_balancing_scan_delay_ms    # Delay between scans
cat /proc/sys/kernel/numa_balancing_scan_period_min_ms
cat /proc/sys/kernel/numa_balancing_scan_period_max_ms
cat /proc/sys/kernel/numa_balancing_scan_size_mb     # How much to scan

When AutoNUMA Helps:

Long-running processes with stable access patterns
Workloads with poor initial placement (first-touch issues)
Environments where manual tuning isn't feasible

When AutoNUMA Hurts:

Short-lived processes (overhead exceeds benefit)
Already well-tuned workloads (adds overhead without improving)
Workloads with deliberately distributed memory (interleaving)
Very latency-sensitive applications (migration causes spikes)

AutoNUMA as a Starting Point

NUMA Optimization Patterns

Effective NUMA optimization follows established patterns. Here are the key strategies used by NUMA-aware applications and system configurations.

Pattern 1: Node-Bound Instances

Run multiple application instances, each bound to a specific NUMA node:

# Node 0 instance
numactl --cpunodebind=0 --membind=0 ./server --port=8080

# Node 1 instance  
numactl --cpunodebind=1 --membind=1 ./server --port=8081

# Load balancer distributes requests between instances
nginx -> 8080, 8081

This pattern is common for databases (run one instance per NUMA node) and stateless services. Each instance operates within its own node, achieving near-UMA performance.

Pattern 2: NUMA-Aware Thread Pools

Create per-node thread pools that process data local to their node:

// Allocate per-node work queues and pools
for (int node = 0; node < num_nodes; node++) {
    pools[node] = create_thread_pool_on_node(node, threads_per_node);
    queues[node] = create_queue_on_node(node);
}

// Route work based on data location
void submit_work(void *data) {
    int node = get_memory_node(data);
    enqueue(queues[node], work_item(data));
}

Pattern 3: Partitioned Data Structures

Divide large data structures so each partition resides on and is accessed from a single node:

struct numa_partitioned_hash {
    struct hash_node **buckets[MAX_NODES];  // Per-node bucket arrays
    int bucket_count_per_node;
};

void *lookup(struct numa_partitioned_hash *h, uint64_t key) {
    // Hash determines which node holds this key
    int node = key % num_nodes;
    int bucket = (key / num_nodes) % h->bucket_count_per_node;
    
    // Access is always node-local if accessor runs on correct node
    return search_bucket(h->buckets[node][bucket], key);
}

Pattern 4: Initialization-Time Binding

Use parallel initialization to ensure first-touch allocation distributes data:

// BAD: Serial initialization on one node
void init_array_serial(int *array, size_t n) {
    for (size_t i = 0; i < n; i++) {
        array[i] = initial_value(i);  // All on current node
    }
}

// GOOD: Parallel initialization across nodes
void init_array_parallel(int *array, size_t n) {
    #pragma omp parallel for schedule(static)
    for (size_t i = 0; i < n; i++) {
        array[i] = initial_value(i);  // Each thread's pages on its node
    }
}

NUMA Optimization Checklist
Aspect	Check	Action if Failing
Memory locality	Are pages on same node as accessing CPUs?	Use numactl membind or AutoNUMA
Process placement	Are threads running on correct nodes?	Use cpunodebind or taskset
Initialization	Is data initialized in parallel?	Parallelize initialization loops
Data partitioning	Is data partitioned by access pattern?	Redesign data structures for NUMA
Thread-to-data affinity	Do threads consistently access local data?	Implement work queue per node

The Shared-Data Challenge

Diagnosing NUMA Performance Issues

Identifying NUMA-related performance problems requires specific tools and metrics. Here's a systematic approach to NUMA diagnosis.

Step 1: Confirm NUMA Configuration

# Check system NUMA topology
numactl --hardware
lscpu | grep NUMA

# View current memory distribution
numastat
numastat -c  # Compact format

# Per-process NUMA stats
numastat -p <pid>

Step 2: Measure Remote Memory Access

High "Other Node" or "numa_foreign" counts indicate remote access:

# Watch NUMA stats change over time
watch -n 1 numastat

# Performance counters for remote access (Intel)
perf stat -e mem_load_uops_retired.local_dram,\
           mem_load_uops_retired.remote_dram \
           -p <pid> -- sleep 10

numa_diagnosis.sh
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/bin/bash
# NUMA Performance Diagnosis Script
 
echo "=== NUMA Topology ==="
numactl --hardware
 
echo ""
echo "=== System-Wide NUMA Statistics ==="
numastat
 
echo ""
echo "=== Process Memory Distribution ==="
# For each significant process, show memory placement
for pid in $(pgrep -f "mysql|postgres|java|python"); do
    echo "--- PID: $pid ($(ps -p $pid -o comm=)) ---"
    numastat -p $pid 2>/dev/null
done
 
echo ""
echo "=== Memory Policy Check ==="
# Check if processes have explicit NUMA policies
for pid in $(pgrep -f "mysql|postgres"); do
    echo "PID $pid memory policy:"
    cat /proc/$pid/numa_maps | head -5
done
 
echo ""
echo "=== Performance Counter Analysis (10 seconds) ==="
# Measure local vs remote memory access
if command -v perf &> /dev/null; then
    perf stat -a -e mem_load_uops_retired.local_dram,mem_load_uops_retired.remote_dram -- sleep 10
else
    echo "perf not available - install linux-tools package"
fi
 
echo ""
echo "=== AutoNUMA Statistics ==="
cat /proc/vmstat | grep numa
echo "numa_balancing enabled: $(cat /proc/sys/kernel/numa_balancing)"

Interpreting numastat Output:

                           node0           node1
numa_hit                12345678        11234567
numa_miss                  10234          534123  <-- PROBLEM: high misses on node1
numa_foreign              534123           10234
interleave_hit              1234            1234
local_node              12145678        11034567
other_node                200000          200000

Key Metrics:

numa_miss / numa_foreign: Pages allocated on wrong node (high = bad placement)
local_node / other_node: Ratio indicates locality quality (should be 90%+ local)
interleave_hit: Only relevant for interleaved allocations

Step 3: Verify Process-Memory Colocality

Use /proc/<pid>/numa_maps to see where each memory region is allocated:

cat /proc/$(pgrep mysqld)/numa_maps | head -20
# Output shows per-VMA NUMA placement
# Look for N0= vs N1= values to see distribution

Quick NUMA Health Check

Summary: Mastering NUMA-Aware Scheduling

Consolidating Our Understanding:

Key Takeaways

•NUMA is a physical reality — Memory access latency varies based on the path from CPU to memory. Local memory is 2-3x faster than remote memory. This isn't tunable—it's physics.
•Memory placement policies are powerful — Default (first-touch), bind, preferred, and interleave policies give applications control over where memory resides. Choose policy based on access patterns.
•NUMA-aware scheduling keeps data and compute together — The scheduler's job expands beyond CPU assignment to ensuring processes run near their memory. Cross-NUMA migration has hidden long-term costs.
•AutoNUMA provides automatic correction — Linux's automatic NUMA balancing detects and corrects suboptimal placement, but with overhead. It's a safety net, not a replacement for thoughtful design.
•Partitioning is the key optimization pattern — Structure applications so each thread/process works with local data. Per-node instances, partitioned data structures, and NUMA-aware thread pools are proven patterns.
•First-touch allocation is a common trap — Serial initialization concentrates memory on one node. Parallel initialization or explicit placement is essential for balanced memory distribution.
•Measurement enables optimization — numastat, numa_maps, and performance counters reveal NUMA behavior. Diagnose before optimizing; measure after changing.

Module Complete:

With this page, we have completed our exploration of Multi-Processor Scheduling. You now understand:

SMP architecture and its symmetric resource model
Asymmetric multiprocessing and its specialized applications
Processor affinity for cache-aware process placement
Load balancing mechanisms and strategies
NUMA considerations for memory-aware scheduling

This knowledge forms the foundation for understanding how modern operating systems manage the most complex multi-core, multi-socket systems in production today.

NUMA Mastery Achieved

5 / 5