Loading learning content...
Everything we've discussed about SMP scheduling has assumed, implicitly or explicitly, that all processors can access all memory with equal latency. For small systems—laptops, desktops, small servers—this assumption holds reasonably well. But as systems scale to multiple processor sockets, each with many cores, a fundamental physical reality emerges: memory cannot be equidistant from all processors.
Non-Uniform Memory Access (NUMA) describes architectures where memory access time depends on the processor making the request and the location of the memory being accessed. Memory attached to a processor's local controller is fast; memory attached to a remote processor's controller is slower. This seemingly simple distinction has profound implications for operating system design, particularly scheduling and memory allocation.
By the end of this page, you will understand NUMA architecture at a fundamental level, including memory access latency differences, NUMA topologies, memory placement policies, NUMA-aware scheduling strategies, automatic NUMA balancing, and practical optimization techniques. This knowledge is essential for anyone working with multi-socket servers or large-scale computing infrastructure.
The Physical Reality:
In a traditional SMP system, all processors connect to memory through a shared bus or interconnect. As processor count increases, this shared path becomes a bottleneck:
NUMA solves this by giving each processor (or processor group) its own dedicated memory controller and local memory bank. Processors can access their local memory without contending with other processors. The tradeoff: accessing another processor's memory—remote memory—requires traversing an interconnect, adding latency.
NUMA Node:
A NUMA node is the fundamental unit of a NUMA system—a set of processors and their directly-attached local memory. Modern nodes typically correspond to:
Within a node, memory access is uniform. Across nodes, access latency varies based on the interconnect topology.
NUMA Architecture: Dual-Socket System ┌─────────────────────────────────────────────────────────────────────────────┐│ NUMA SYSTEM OVERVIEW │└─────────────────────────────────────────────────────────────────────────────┘ NUMA NODE 0 NUMA NODE 1 (Socket 0) (Socket 1) ┌───────────────────────────┐ ┌───────────────────────────┐│ │ │ ││ ┌─────┐ ┌─────┐ ┌─────┐ │ │ ┌─────┐ ┌─────┐ ┌─────┐ ││ │Core0│ │Core1│ │Core2│ │ │ │Core8│ │Core9│ │Core10│ ││ └──┬──┘ └──┬──┘ └──┬──┘ │ │ └──┬──┘ └──┬──┘ └──┬──┘ ││ │ │ │ │ │ │ │ │ ││ ┌──┴───────┴───────┴──┐ │ │ ┌──┴───────┴───────┴──┐ ││ │ L3 Cache │ │ │ │ L3 Cache │ ││ │ (shared) │ │ │ │ (shared) │ ││ └──────────┬──────────┘ │ │ └──────────┬──────────┘ ││ │ │ │ │ ││ ┌──────────┴──────────┐ │ │ ┌──────────┴──────────┐ ││ │ Memory Controller │ │ │ │ Memory Controller │ ││ └──────────┬──────────┘ │ │ └──────────┬──────────┘ ││ │ │ │ │ ││ ┌──────────┴──────────┐ │ │ ┌──────────┴──────────┐ ││ │ LOCAL MEMORY │ │ │ │ LOCAL MEMORY │ ││ │ (64-256 GB) │ │◄─────QPI/UPI─────►│ │ (64-256 GB) │ ││ │ ~70ns latency │ │ Interconnect │ │ ~70ns latency │ ││ └─────────────────────┘ │ (~100ns added) │ └─────────────────────┘ ││ │ │ │└───────────────────────────┘ └───────────────────────────┘ Memory Access Latencies (typical):- Local memory access: ~70 ns- Remote memory access: ~170 ns (2.4x slower)- Bandwidth: Local ~80 GB/s, Remote ~40 GB/s per direction When Core0 accesses:- Node 0 memory: Direct path through local memory controller- Node 1 memory: QPI/UPI hop to Node 1, memory controller, data backNUMA Distance:
NUMA systems express the relative cost of accessing different nodes through a distance metric. This is typically normalized so that local access has distance 10:
The distance matrix describes the topology:
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 65536 MB
node 0 free: 32000 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 65536 MB
node 1 free: 48000 MB
node distances:
node 0 1
0: 10 21
1: 21 10
In this example, accessing remote memory is 2.1x the distance of local—corresponding to significantly higher latency and reduced bandwidth.
Remote memory access being 2-3x slower might seem minor, but memory-intensive workloads can spend a large fraction of execution time waiting for memory. A workload that's 30% memory-bound on local memory might become 50+% memory-bound with remote access. For some workloads, poor NUMA placement can reduce performance by 40% or more.
NUMA systems come in various topologies, each with different performance characteristics. Understanding your system's topology is essential for optimization.
Common NUMA Topologies:
1. Two-Node (Dual Socket): The most common server configuration. Two CPU sockets, each a NUMA node, connected by a single high-bandwidth interconnect (Intel QPI/UPI or AMD Infinity Fabric).
2. Four-Node (Quad Socket): Four CPU sockets arranged in a mesh or ring. Some node pairs are direct neighbors; others require a hop.
3. Eight-Node and Beyond: Large systems (8+ sockets) have complex interconnect topologies. Some nodes may be 3-4 hops apart.
NUMA Topology Examples: 1. DUAL-SOCKET (Most Common) Direct connection between both nodes Node 0 ◄────────────────► Node 1 Distance Matrix: 0 1 0: 10 21 1: 21 10 2. QUAD-SOCKET RING TOPOLOGY Each node directly connects to two neighbors Node 0 ◄─────► Node 1 △ △ │ │ │ │ ▽ ▽ Node 3 ◄─────► Node 2 Distance Matrix: 0 1 2 3 0: 10 21 31 21 1: 21 10 21 31 2: 31 21 10 21 3: 21 31 21 10 Node 0 to Node 2 requires traversing Node 1 or Node 3 (2 hops) 3. QUAD-SOCKET FULL MESH (Higher-End) All nodes directly connected Node 0 ◄─────► Node 1 △ ╲ ╱ △ │ ╲ ╱ │ │ ╲ ╱ │ │ ╳╳ │ │ ╱ ╲ │ │ ╱ ╲ │ ▽ ╱ ╲ ▽ Node 3 ◄─────► Node 2 Distance Matrix: 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 Equal remote distance (better for balanced workloads)AMD EPYC and Sub-Socket NUMA:
Modern AMD EPYC processors introduce sub-socket NUMA—multiple NUMA nodes within a single physical socket:
This design reflects the chiplet architecture: each chiplet has its own L3 cache and memory controller connection. Inter-chiplet communication adds latency compared to intra-chiplet.
Intel Sub-NUMA Clustering (SNC):
Intel offers Sub-NUMA Clustering (SNC) as a BIOS option:
Before optimizing NUMA, understand your specific system's topology using 'numactl --hardware' and 'lscpu'. Don't assume all multi-socket systems are identical. Sub-socket NUMA, asymmetric configurations, and varying interconnect speeds all affect the optimal strategy.
How and where memory is allocated has dramatic performance implications on NUMA systems. The operating system provides several memory placement policies that control allocation behavior.
Default Policy (Allocation follows First Touch):
By default, memory is allocated on the NUMA node where it's first accessed:
// First touch example
char *buffer = malloc(1024 * 1024 * 1024); // No physical pages yet
memset(buffer, 0, 1024 * 1024 * 1024); // Pages allocated here
// Memory will be on whatever node executes the memset
| Policy | Behavior | Use Case | Linux API |
|---|---|---|---|
| Default/Local | Allocate on local node | General purpose | MPOL_DEFAULT |
| Bind | Allocate only from specified nodes | Dedicated memory regions | MPOL_BIND |
| Preferred | Prefer specified node, fall back if full | Soft locality hints | MPOL_PREFERRED |
| Interleave | Round-robin across nodes | Bandwidth-bound allocations | MPOL_INTERLEAVE |
| Weighted Interleave | Round-robin with weights | Asymmetric node setup | MPOL_WEIGHTED_INTERLEAVE |
Using numactl for Memory Policy:
# Run with memory bound to node 0
numactl --membind=0 ./my_application
# Prefer node 0, but allow others if memory pressure
numactl --preferred=0 ./my_application
# Interleave memory across all nodes
numactl --interleave=all ./my_application
# Combine CPU and memory binding
numactl --cpunodebind=0 --membind=0 ./my_application
When to Use Each Policy:
Bind (--membind):
Preferred (--preferred):
Interleave (--interleave):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
/* NUMA Memory API Example (libnuma) */ #include <numa.h>#include <numaif.h> void demonstrate_numa_memory_policies(void) { size_t size = 1024 * 1024 * 1024; /* 1 GB */ /* Check NUMA availability */ if (numa_available() < 0) { printf("NUMA not available\n"); return; } /* Allocate on specific node */ void *local_mem = numa_alloc_onnode(size, 0); printf("Allocated 1GB on node 0: %p\n", local_mem); /* Allocate interleaved across all nodes */ void *interleaved = numa_alloc_interleaved(size); printf("Allocated 1GB interleaved: %p\n", interleaved); /* Set memory policy for future allocations */ struct bitmask *nodemask = numa_allocate_nodemask(); numa_bitmask_setbit(nodemask, 0); numa_bitmask_setbit(nodemask, 1); /* Prefer node 0, allow node 1 as fallback */ numa_set_preferred(0); void *preferred = malloc(size); /* Will prefer node 0 */ /* Bind strictly to nodes 0 and 1 */ numa_set_membind(nodemask); void *bound = malloc(size); /* Only from nodes 0 or 1 */ /* Move existing pages to a different node */ int nodes[1] = { 1 }; int status[1]; void *pages[1] = { local_mem }; int result = numa_move_pages(0, 1, pages, nodes, status, MPOL_MF_MOVE); if (result == 0) { printf("Moved page to node %d\n", status[0]); } /* Clean up */ numa_free(local_mem, size); numa_free(interleaved, size); numa_free_nodemask(nodemask); /* Note: preferred/bound memory freed with normal free() */}Many parallel applications initialize data structures on the main thread before spawning workers. With first-touch allocation, all data ends up on one node even though workers on all nodes will access it. Solutions include: parallel initialization (each thread touches its portion), explicit interleaving, or post-initialization migration. This is one of the most common NUMA performance issues.
Beyond memory allocation, the scheduler itself must be NUMA-aware. A scheduler that ignores NUMA may frequently run processes on CPUs far from their memory, negating any benefit from careful memory placement.
NUMA and Scheduling Domains:
Linux's scheduling domains naturally encode NUMA structure:
Key NUMA Scheduling Principles:
1. Prefer Local Execution: When a process wakes, prefer scheduling it on a CPU in the same NUMA node as its memory:
// Conceptual NUMA-aware wake placement
int select_wakeup_cpu(struct task *p) {
int memory_node = get_task_memory_node(p);
int prev_cpu = p->last_cpu;
// Strong preference for CPUs on the memory node
if (cpu_is_idle_on_node(memory_node)) {
return find_idle_cpu_on_node(memory_node);
}
// Fall back to previous CPU if on same node
if (cpu_to_node(prev_cpu) == memory_node) {
return prev_cpu;
}
// Last resort: any idle CPU, but prefer same node
return find_best_cpu_preferring_node(memory_node);
}
2. Group Related Threads:
Threads that share data should ideally run on the same NUMA node to share the L3 cache and avoid coherence traffic across the interconnect:
// Schedule related threads together
struct cgroup *cg = get_task_cgroup(p);
if (cg->cpuset_mems applies) {
// Respect cgroup memory node binding
prefer_node = cg->allowed_nodes;
}
3. Avoid Cross-NUMA Migration:
The scheduler should be reluctant to migrate processes across NUMA nodes:
| Scenario | Cache Cost | Memory Cost | Total Cost | Recommendation |
|---|---|---|---|---|
| Same-node migration | L3 shared, L1/L2 cold | None | Low | Allow freely |
| Cross-node, memory follows | All levels cold | Migration overhead | Medium | Allow if balance needed |
| Cross-node, memory stays | All levels cold | Remote access forever | Very High | Avoid unless necessary |
When load balancing requires cross-NUMA process migration, the scheduler faces a choice: migrate just the process (cheap but leaves memory remote), or migrate both process and memory (expensive upfront but optimal long-term). Linux's Automatic NUMA Balancing (AutoNUMA) can migrate memory pages to follow processes, but this has significant overhead. The optimal choice depends on workload duration and memory access patterns.
Manual NUMA optimization requires deep application knowledge and tuning effort. Automatic NUMA Balancing (AutoNUMA, introduced in Linux 3.8) attempts to automatically detect and correct suboptimal memory placement.
How AutoNUMA Works:
1. Fault-Based Sampling: The kernel periodically unmaps pages (marks them as not present), causing page faults on next access. The fault handler records which CPU (and thus which NUMA node) accessed the page.
2. Access Pattern Analysis: Over time, the kernel builds a profile of which nodes access which pages. Pages consistently accessed from a remote node are candidates for migration.
3. Page Migration: The kernel migrates pages to the node that accesses them most frequently. This happens transparently to the application.
4. Task Migration: Alternatively (or additionally), the scheduler may migrate the task to the node where its memory resides, rather than moving the memory.
AutoNUMA Operation Cycle: TIME ─────────────────────────────────────────────────────────────────────► ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐│ NUMA Hint │ │ Access │ │ Statistics │ │ Migration ││ Page Faults │────►│ Recording │────►│ Analysis │────►│ Decision ││ (sampling) │ │ │ │ │ │ │└─────────────┘ └─────────────┘ └─────────────┘ └────────────┘ │ │ │ ▼ │ ┌────────────────────────────────────────┐ │ │ MIGRATION OPTIONS │ │ │ │ │ │ Option A: Page Migration │ │ │ - Move pages to accessing node │ │ │ - Expensive but fixes placement │ │ │ │ │ │ Option B: Task Migration │ │ │ - Move task to node holding pages │ │ │ - Cheaper but changes CPU allocation │ │ │ │ │ │ Option C: Both │ │ │ - Converge task and pages to same │ │ │ node over time │ └─────────────────────────┴────────────────────────────────────────┘ Overhead Considerations:- Scanning overhead: ~1-3% CPU for scanning tasks- False fault overhead: Pages need re-mapping after sampling- Migration overhead: DMA operations to copy page contents- Convergence time: May take minutes to reach optimal stateControlling AutoNUMA:
# Check if AutoNUMA is enabled
cat /proc/sys/kernel/numa_balancing
# 1 = enabled, 0 = disabled
# Disable AutoNUMA (if manual control preferred)
echo 0 > /proc/sys/kernel/numa_balancing
# Enable AutoNUMA
echo 1 > /proc/sys/kernel/numa_balancing
# Tune scanning rate (pages per second)
cat /proc/sys/kernel/numa_balancing_scan_delay_ms # Delay between scans
cat /proc/sys/kernel/numa_balancing_scan_period_min_ms
cat /proc/sys/kernel/numa_balancing_scan_period_max_ms
cat /proc/sys/kernel/numa_balancing_scan_size_mb # How much to scan
When AutoNUMA Helps:
When AutoNUMA Hurts:
AutoNUMA is valuable as a safety net—it prevents the worst NUMA placement disasters without manual intervention. For performance-critical workloads, use AutoNUMA during development to identify NUMA issues, then implement explicit policies (numactl, cgroups) for production. The combination of initial AutoNUMA analysis followed by explicit tuning often yields the best results.
Effective NUMA optimization follows established patterns. Here are the key strategies used by NUMA-aware applications and system configurations.
Pattern 1: Node-Bound Instances
Run multiple application instances, each bound to a specific NUMA node:
# Node 0 instance
numactl --cpunodebind=0 --membind=0 ./server --port=8080
# Node 1 instance
numactl --cpunodebind=1 --membind=1 ./server --port=8081
# Load balancer distributes requests between instances
nginx -> 8080, 8081
This pattern is common for databases (run one instance per NUMA node) and stateless services. Each instance operates within its own node, achieving near-UMA performance.
Pattern 2: NUMA-Aware Thread Pools
Create per-node thread pools that process data local to their node:
// Allocate per-node work queues and pools
for (int node = 0; node < num_nodes; node++) {
pools[node] = create_thread_pool_on_node(node, threads_per_node);
queues[node] = create_queue_on_node(node);
}
// Route work based on data location
void submit_work(void *data) {
int node = get_memory_node(data);
enqueue(queues[node], work_item(data));
}
Pattern 3: Partitioned Data Structures
Divide large data structures so each partition resides on and is accessed from a single node:
struct numa_partitioned_hash {
struct hash_node **buckets[MAX_NODES]; // Per-node bucket arrays
int bucket_count_per_node;
};
void *lookup(struct numa_partitioned_hash *h, uint64_t key) {
// Hash determines which node holds this key
int node = key % num_nodes;
int bucket = (key / num_nodes) % h->bucket_count_per_node;
// Access is always node-local if accessor runs on correct node
return search_bucket(h->buckets[node][bucket], key);
}
Pattern 4: Initialization-Time Binding
Use parallel initialization to ensure first-touch allocation distributes data:
// BAD: Serial initialization on one node
void init_array_serial(int *array, size_t n) {
for (size_t i = 0; i < n; i++) {
array[i] = initial_value(i); // All on current node
}
}
// GOOD: Parallel initialization across nodes
void init_array_parallel(int *array, size_t n) {
#pragma omp parallel for schedule(static)
for (size_t i = 0; i < n; i++) {
array[i] = initial_value(i); // Each thread's pages on its node
}
}
| Aspect | Check | Action if Failing |
|---|---|---|
| Memory locality | Are pages on same node as accessing CPUs? | Use numactl membind or AutoNUMA |
| Process placement | Are threads running on correct nodes? | Use cpunodebind or taskset |
| Initialization | Is data initialized in parallel? | Parallelize initialization loops |
| Data partitioning | Is data partitioned by access pattern? | Redesign data structures for NUMA |
| Thread-to-data affinity | Do threads consistently access local data? | Implement work queue per node |
Some data structures are inherently shared across nodes—global configuration, shared counters, lock structures. For these, consider: read-mostly patterns (replicate data, invalidate on write), per-node copies with periodic synchronization, or interleaved allocation (no node is more remote than others). Truly shared mutable data remains a NUMA performance challenge with no universal solution.
Identifying NUMA-related performance problems requires specific tools and metrics. Here's a systematic approach to NUMA diagnosis.
Step 1: Confirm NUMA Configuration
# Check system NUMA topology
numactl --hardware
lscpu | grep NUMA
# View current memory distribution
numastat
numastat -c # Compact format
# Per-process NUMA stats
numastat -p <pid>
Step 2: Measure Remote Memory Access
High "Other Node" or "numa_foreign" counts indicate remote access:
# Watch NUMA stats change over time
watch -n 1 numastat
# Performance counters for remote access (Intel)
perf stat -e mem_load_uops_retired.local_dram,\
mem_load_uops_retired.remote_dram \
-p <pid> -- sleep 10
123456789101112131415161718192021222324252627282930313233343536373839
#!/bin/bash# NUMA Performance Diagnosis Script echo "=== NUMA Topology ==="numactl --hardware echo ""echo "=== System-Wide NUMA Statistics ==="numastat echo ""echo "=== Process Memory Distribution ==="# For each significant process, show memory placementfor pid in $(pgrep -f "mysql|postgres|java|python"); do echo "--- PID: $pid ($(ps -p $pid -o comm=)) ---" numastat -p $pid 2>/dev/nulldone echo ""echo "=== Memory Policy Check ==="# Check if processes have explicit NUMA policiesfor pid in $(pgrep -f "mysql|postgres"); do echo "PID $pid memory policy:" cat /proc/$pid/numa_maps | head -5done echo ""echo "=== Performance Counter Analysis (10 seconds) ==="# Measure local vs remote memory accessif command -v perf &> /dev/null; then perf stat -a -e mem_load_uops_retired.local_dram,mem_load_uops_retired.remote_dram -- sleep 10else echo "perf not available - install linux-tools package"fi echo ""echo "=== AutoNUMA Statistics ==="cat /proc/vmstat | grep numaecho "numa_balancing enabled: $(cat /proc/sys/kernel/numa_balancing)"Interpreting numastat Output:
node0 node1
numa_hit 12345678 11234567
numa_miss 10234 534123 <-- PROBLEM: high misses on node1
numa_foreign 534123 10234
interleave_hit 1234 1234
local_node 12145678 11034567
other_node 200000 200000
Key Metrics:
Step 3: Verify Process-Memory Colocality
Use /proc/<pid>/numa_maps to see where each memory region is allocated:
cat /proc/$(pgrep mysqld)/numa_maps | head -20
# Output shows per-VMA NUMA placement
# Look for N0= vs N1= values to see distribution
A healthy NUMA system should show: >90% local_node accesses in numastat, balanced memory utilization across nodes, low numa_miss counts after initial startup, and minimal numa_balancing page migrations (if AutoNUMA is enabled and the system is well-tuned). Deviations indicate opportunities for optimization.
We have explored Non-Uniform Memory Access architecture from its physical foundations through practical optimization and diagnosis. This knowledge is essential for anyone working with multi-socket servers or large-scale computing infrastructure.
Consolidating Our Understanding:
Module Complete:
With this page, we have completed our exploration of Multi-Processor Scheduling. You now understand:
This knowledge forms the foundation for understanding how modern operating systems manage the most complex multi-core, multi-socket systems in production today.
You now understand NUMA architecture and scheduling at the depth required for production server optimization, kernel-level reasoning, and high-performance application design. You can diagnose NUMA performance issues, apply appropriate memory placement policies, and design NUMA-aware application architectures. This knowledge is essential for anyone working with modern multi-socket server systems.