Operating SystemsNUMA Architecture

NUMA Architecture

LevelAdvanced

Duration60 mins

TopicNUMA Architecture

1 / 5

Non-Uniform Memory Access

When Memory Access Time Isn't Equal

In the early days of multiprocessor computing, a comforting assumption prevailed: every processor could access any memory location in approximately the same time. This Uniform Memory Access (UMA) model simplified programming and operating system design. But as systems grew beyond a single processor socket—scaling to 2, 4, 8, or even hundreds of cores—this assumption became not just inaccurate, but catastrophically misleading.

Welcome to the world of Non-Uniform Memory Access (NUMA), where the physical reality of silicon, copper traces, and interconnects forces us to confront an uncomfortable truth: memory access time depends on which processor is doing the accessing and which memory bank contains the data.

What You Will Learn

By the end of this page, you will understand why NUMA exists, how it fundamentally differs from symmetric multiprocessing, and why operating system designers must be acutely aware of memory topology. You'll grasp the physics and engineering constraints that make NUMA inevitable in large-scale systems, and the profound implications for software performance.

The Evolution from SMP to NUMA

To understand NUMA, we must first understand what it replaced and why the transition became necessary. The story begins with Symmetric Multiprocessing (SMP), also known as Uniform Memory Access (UMA).

The SMP/UMA Architecture:

In an SMP system, all processors share a common memory bus and can access all memory with identical latency. The memory controller sits between processors and memory, arbitrating access transparently. From a programmer's perspective, memory is memory—there's no distinction between 'near' and 'far' memory because all memory is equidistant from all processors.

This architecture dominated the 1990s and early 2000s. Dual-socket and quad-socket servers used SMP successfully because the number of processors was small enough that bus bandwidth could keep up with aggregate memory demand.

Converting Mermaid diagram...

The Scalability Wall

The shared bus in SMP systems becomes a catastrophic bottleneck as processor count increases. With 4 processors, bus contention is manageable. With 8, it's problematic. With 16 or more, the bus becomes so saturated that adding more processors actually decreases performance. This phenomenon is called 'negative scaling.'

The Physics Problem:

Several fundamental physical constraints doom SMP at scale:

Bus Bandwidth Saturation: Each processor can issue memory requests at modern DRAM speeds. A shared bus must handle the aggregate bandwidth of all processors. With more cores, the bus becomes the bottleneck.
Signal Propagation Delays: Electrical signals travel at roughly 1 foot per nanosecond in copper. As systems grow physically larger to accommodate more processors and memory, signal propagation itself introduces latency.
Cache Coherency Traffic: In SMP, when one processor modifies data, all other processors with cached copies must be notified. This coherency traffic consumes bus bandwidth and increases with processor count.
Memory Controller Contention: A single memory controller must serialize all memory requests. This serialization overhead grows non-linearly with processor count.

By the mid-2000s, these constraints forced a fundamental architectural rethinking. The result was NUMA.

What NUMA Actually Is

Non-Uniform Memory Access (NUMA) is a memory architecture where the time to access memory depends on the memory's location relative to the accessing processor. Some memory is 'local' (attached directly to the processor's memory controller) and some is 'remote' (attached to another processor's memory controller).

The Core Principle:

In NUMA, each processor (or group of processors) has its own local memory connected via a dedicated memory controller. Processors can still access memory attached to other processors, but this requires traversing an interconnect (such as Intel's QPI/UPI or AMD's Infinity Fabric), which introduces additional latency.

Converting Mermaid diagram...

Understanding Memory Access Latency in NUMA:

Consider a dual-socket server where CPU 0 needs to access data:

Memory Location	Access Path	Typical Latency
Local (Node 0)	CPU 0 → MC 0 → DRAM 0	~80 ns
Remote (Node 1)	CPU 0 → Interconnect → MC 1 → DRAM 1	~150 ns

The remote access is nearly 2x slower than local access. This ratio is called the NUMA ratio or remote-to-local latency ratio. In real systems, NUMA ratios typically range from 1.5x to 3x, depending on the number of hops through the interconnect.

The NUMA Guarantee

Despite the non-uniform access times, NUMA systems guarantee that all memory is accessible from all processors. This is crucial—it means existing programs 'just work' on NUMA systems. They might work slowly if memory is poorly placed, but they won't crash or produce incorrect results. NUMA is transparent to correctness, but critical to performance.

Why NUMA Enables Scalability:

NUMA solves SMP's scalability problems by distributing the memory system:

Distributed Memory Controllers: Each processor has its own memory controller. No single controller bottlenecks all memory access.
Local Bandwidth Isolation: A processor accessing local memory doesn't compete with other processors for bandwidth (except for cache coherency traffic).
Scalable Interconnect: Modern interconnects like AMD's Infinity Fabric or Intel's UPI use point-to-point links rather than shared buses. Adding more sockets doesn't degrade existing connections.
Aggregate Bandwidth Growth: Total system memory bandwidth scales with socket count. A 4-socket NUMA system has ~4x the memory bandwidth of a single socket.

This architecture enables systems with hundreds of cores and terabytes of memory—something fundamentally impossible with SMP.

NUMA Topology and Distance

In simple two-socket systems, NUMA is straightforward: memory is either local or one hop away. But modern high-end systems can have 4, 8, or even 16 sockets, creating complex NUMA topologies where some remote memory is 'closer' than other remote memory.

NUMA Distance:

Operating systems model NUMA topology using a concept called NUMA distance—a relative measure of access latency between NUMA nodes. The ACPI SLIT (System Locality Information Table) provides this information to the OS.

By convention:

Local access has distance 10
Remote access has distance 20 or higher
Actual values vary by system and represent relative latency, not absolute time

numa-distance-query.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# View NUMA topology on Linux
numactl --hardware
 
# Sample output for a 4-socket system:
# available: 4 nodes (0-3)
# node 0 cpus: 0-15
# node 0 size: 65536 MB
# node 0 free: 48120 MB
# node 1 cpus: 16-31
# node 1 size: 65536 MB
# node 1 free: 51234 MB
# node 2 cpus: 32-47
# node 2 size: 65536 MB
# node 2 free: 49876 MB
# node 3 cpus: 48-63
# node 3 size: 65536 MB
# node 3 free: 52012 MB
# node distances:
# node   0   1   2   3
#   0:  10  21  31  21
#   1:  21  10  21  31
#   2:  31  21  10  21
#   3:  21  31  21  10
 
# Query distance programmatically
cat /sys/devices/system/node/node0/distance
# Output: 10 21 31 21

Interpreting the Distance Matrix:

In the example above, we see a 4-socket system with an interesting topology:

From Node	To Node 0	To Node 1	To Node 2	To Node 3
Node 0	10 (local)	21	31	21
Node 1	21	10 (local)	21	31
Node 2	31	21	10 (local)	21
Node 3	21	31	21	10 (local)

This reveals a linear or ring topology: Node 0 has fast access to Nodes 1 and 3 (one hop, distance 21), but slower access to Node 2 (two hops, distance 31). The OS must understand this topology to make intelligent memory placement decisions.

Converting Mermaid diagram...

Why Topology Matters

When the OS needs to allocate memory remotely (because local memory is full), knowing the topology lets it choose the nearest remote node. Allocating on Node 1 from Node 0 (distance 21) is much better than allocating on Node 2 (distance 31). A 47% latency difference can translate to significant performance differences for memory-bound applications.

Hardware Implementations of NUMA

NUMA is not an abstract concept—it's implemented in concrete silicon. Understanding the major hardware implementations helps appreciate why NUMA behaves the way it does.

Intel's Implementation: QPI and UPI

Intel's multi-socket Xeon processors use QuickPath Interconnect (QPI), replaced by Ultra Path Interconnect (UPI) in newer Xeon Scalable processors. These are high-speed point-to-point links that connect processor sockets.

Bandwidth: UPI provides 10.4 GT/s per link in 3rd gen Xeon Scalable
Links per socket: Typically 2-3 UPI links per processor
Latency: ~10-15 ns added per hop
Cache Coherency: UPI carries both data and coherency traffic

NUMA Interconnect Comparison
Vendor	Interconnect	Bandwidth	Notable Features
Intel	UPI (Ultra Path Interconnect)	10.4 GT/s	Integrated memory controller, 3 links per Xeon Max
AMD	Infinity Fabric	~36 GB/s bidirectional	Scalable to 8+ sockets, CCX awareness
IBM	PowerAXON	~60 GB/s	Large socket counts, memory buffer chips
ARM	CCIX/CXL	Variable	Open standard, emerging in datacenter

AMD's Implementation: Infinity Fabric

AMD's EPYC processors use Infinity Fabric, a highly scalable interconnect with an interesting property: NUMA exists within a single socket.

EPYC processors are composed of multiple chiplets (CCDs), each containing a subset of cores. Memory is attached to a central I/O die, but different CCDs have different distances to different memory channels. This creates multiple NUMA nodes per socket—a 64-core EPYC might expose 8 NUMA nodes.

This design choice reflects a fundamental tradeoff:

More NUMA domains = more complex memory placement
But enables AMD to build larger processors from smaller, higher-yield chiplets

Sub-NUMA Clustering (SNC)

Intel also offers Sub-NUMA Clustering (SNC) on Xeon Scalable, which divides a single socket into 2 NUMA nodes. This reduces average local memory latency but increases NUMA complexity. BIOS settings control whether SNC is enabled—sysadmins must understand the performance implications.

Cache Coherency in NUMA:

NUMA systems must maintain cache coherency—ensuring that all processors see consistent values for shared memory locations. This is challenging because:

Each processor has its own cache hierarchy (L1, L2, L3)
The same memory location might be cached on multiple sockets
Writes must invalidate or update remote copies

Modern NUMA systems use directory-based coherency protocols where a 'home node' (usually where the memory is physically located) tracks which caches hold copies of each cache line. This is more scalable than broadcast-based snooping used in SMP.

The coherency tax:

When data is shared and modified across NUMA nodes, the coherency traffic eats into interconnect bandwidth. This creates an additional performance cliff beyond pure memory access latency—heavily shared, frequently modified data suffers worse performance than data accessed by a single node.

The NUMA Performance Cliff

NUMA introduces a fundamental performance asymmetry that can cause dramatic performance degradation—or dramatic improvements—depending on memory placement. Understanding this 'performance cliff' is essential for systems engineering.

Quantifying the Impact:

Consider a memory-bound application performing random reads across a working set that exceeds cache. On a typical modern server:

Memory Access Latency Impact
Access Pattern	Latency	Relative Performance
L3 Cache Hit	~15 ns	1.0x (baseline)
Local DRAM Access	~80 ns	0.19x
Remote DRAM (1 hop)	~150 ns	0.10x
Remote DRAM (2 hops)	~200 ns	0.075x

The Multiplicative Effect:

If an application's memory is entirely misplaced on a remote node:

Every memory access pays the remote penalty
Throughput drops by 40-50% for memory-bound workloads
Tail latencies (P99, P99.9) explode due to interconnect congestion
Other applications sharing the interconnect also suffer

This isn't theoretical. Production databases, in-memory caches, and high-frequency trading systems regularly see these degradations when NUMA is ignored.

numa-benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
#define SIZE (1024 * 1024 * 256)  // 256 MB
#define ITERATIONS 100000000
 
// Benchmark memory access latency on different NUMA nodes
void benchmark_numa_latency(int allocation_node, int execution_node) {
    // Bind thread to specified node
    struct bitmask *nodemask = numa_bitmask_alloc(numa_max_node() + 1);
    numa_bitmask_setbit(nodemask, execution_node);
    numa_run_on_node_mask(nodemask);
    
    // Allocate memory on specified node
    char *buffer = numa_alloc_onnode(SIZE, allocation_node);
    if (!buffer) {
        fprintf(stderr, "NUMA allocation failed\n");
        return;
    }
    
    // Initialize (page fault to materialize pages)
    for (size_t i = 0; i < SIZE; i += 4096) {
        buffer[i] = 0;
    }
    
    // Benchmark random access
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    volatile char sum = 0;
    unsigned int seed = 42;
    for (long i = 0; i < ITERATIONS; i++) {
        size_t index = rand_r(&seed) % SIZE;
        sum += buffer[index];  // Random read
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + 
                        (end.tv_nsec - start.tv_nsec);
    double latency_ns = elapsed_ns / ITERATIONS;
    
    printf("Alloc Node %d, Exec Node %d: %.2f ns/access\n",
           allocation_node, execution_node, latency_ns);
    
    numa_free(buffer, SIZE);
    numa_bitmask_free(nodemask);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    int max_node = numa_max_node();
    printf("NUMA nodes: 0-%d\n\n", max_node);
    
    // Test all combinations
    for (int alloc = 0; alloc <= max_node; alloc++) {
        for (int exec = 0; exec <= max_node; exec++) {
            benchmark_numa_latency(alloc, exec);
        }
    }
    
    return 0;
}

The Hidden NUMA Trap

The most insidious NUMA performance problem is silent degradation. Applications don't crash or throw errors when memory is remote—they just run slowly. Without explicit NUMA metrics, teams blame 'the network' or 'the database' when the real culprit is memory placement. Always monitor NUMA metrics (local vs. remote access counters) on multi-socket systems.

First Principles of NUMA Design

Now that we understand what NUMA is and why it matters, let's establish the foundational principles that guide all NUMA-aware system design.

Principle 1: Locality is Everything

The single most important optimization in NUMA systems is ensuring data is placed in memory local to the processor that will access it. This sounds simple, but requires:

Understanding which threads access which data
Ensuring memory allocation happens on the correct node
Preventing the OS or other processes from migrating pages

Core NUMA Design Principles

•Collocate Threads and Data — Threads should execute on the same NUMA node as the memory they access. Pin critical threads and use NUMA-aware allocation.
•Partition by Node, Not by Thread — Design data structures that can be cleanly partitioned across NUMA nodes, with each partition owned by threads on that node.
•Minimize Cross-Node Traffic — When data must be shared across nodes, batch updates and use techniques like replication or message passing rather than shared memory.
•Be Aware of First-Touch — Linux allocates memory on the node where it's first accessed ('first-touch policy'). Initialize data from the thread that will use it.
•Measure, Don't Assume — NUMA behavior varies dramatically between systems. Always benchmark on target hardware and monitor production metrics.

Principle 2: Avoid the Memory Allocation Anti-Pattern

A common mistake is to allocate all memory in a single thread (e.g., during initialization) and then distribute work across NUMA nodes. Because of first-touch allocation, all memory ends up on a single node, and most threads suffer remote access penalties.

The Right Approach:

Spawn worker threads, one per NUMA node
Pin each worker to its NUMA node
Have each worker allocate and initialize its own memory
Only begin processing after all workers have their local memory

This ensures each thread's data is local. The initialization cost is minimal compared to the runtime savings.

The NUMA Mental Model

Think of a NUMA system as a collection of servers connected by a fast network, rather than a single unified machine. Each NUMA node has its own memory, and crossing node boundaries is like crossing a network boundary—possible, but with latency overhead. Design your software accordingly.

NUMA and the Operating System

Operating systems play a critical role in managing NUMA systems. The OS must:

Discover and expose NUMA topology to applications and administrators
Make intelligent allocation decisions when applications don't specify preferences
Schedule processes with awareness of where their memory lives
Balance memory utilization across nodes while respecting locality

Modern operating systems (Linux, Windows, FreeBSD) all have NUMA support, though the depth and configurability varies.

OS-Level NUMA Features
Feature	Linux	Windows	Purpose
NUMA-aware scheduler	✓ Full support	✓ Full support	Keep threads near their memory
Memory placement APIs	numactl, libnuma	NUMA API functions	Application-controlled placement
Automatic page migration	numad, AutoNUMA	Memory partition	Rebalance poorly placed pages
NUMA statistics	/proc/vmstat, numastat	Performance counters	Monitor local vs remote access
Huge page NUMA placement	✓ Supported	✓ Large pages	Reduce TLB misses with locality

Linux's Approach: Default and Tunable

Linux uses several default policies that work well for many workloads:

First-Touch Allocation: Memory is allocated on the node where it's first accessed
AffineScheduling: The scheduler prefers to run threads where their memory is located
Memory Zones per Node: Each NUMA node has its own zone lists for allocations
AutoNUMA/NUMA Balancing: Kernel feature that detects misplaced pages and migrates them

These defaults are reasonable, but high-performance applications typically override them with explicit NUMA control.

linux-numa-tools.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# View NUMA statistics
numastat
# Output shows memory allocated per node and access patterns
 
# Run a command bound to specific NUMA nodes
numactl --cpunodebind=0 --membind=0 ./my_application
 
# Interleave memory across all nodes (for shared data structures)
numactl --interleave=all ./my_application
 
# Check if NUMA balancing is enabled
cat /proc/sys/kernel/numa_balancing
# 1 = enabled, 0 = disabled
 
# View per-process NUMA memory usage
cat /proc/<pid>/numa_maps
 
# Monitor NUMA page faults and migrations
watch -n 1 'grep numa /proc/vmstat'

When to Disable AutoNUMA

AutoNUMA (numa_balancing) migrates pages to match thread locality automatically. While helpful for untuned applications, it can cause problems for applications that carefully place their memory. The migration itself consumes CPU and can cause latency spikes. Database vendors often recommend disabling AutoNUMA and using explicit NUMA placement instead.

Summary: Understanding NUMA

We've established the foundation of NUMA architecture. Let's consolidate the key concepts before moving forward:

Key Takeaways

•NUMA exists because SMP doesn't scale — Shared bus architectures hit physical limits with increasing core counts.
•Memory access time is non-uniform — Local memory access is 1.5-3x faster than remote memory access.
•NUMA topology is hierarchical — Systems have multiple nodes with varying distances between them.
•Physical interconnects matter — Intel UPI, AMD Infinity Fabric, and other technologies implement NUMA in hardware.
•Performance cliffs are real — Poorly placed memory can degrade performance by 40-50% or more.
•Design for locality — The core principle is keeping data near the processors that access it.
•Operating systems help, but aren't magic — NUMA support is automated, but optimal performance requires application awareness.

What's Next:

In the next page, we'll dive deeper into NUMA Nodes—the fundamental unit of NUMA architecture. We'll explore how nodes are structured, what resources they contain, and how to query and work with nodes programmatically.

Page Complete

You now understand why NUMA exists, how it differs from SMP, and the fundamental principles of memory locality that drive NUMA-aware system design. Next, we'll examine NUMA nodes in detail—the building blocks of the NUMA architecture.

1 / 5

Loading learning content...

Operating SystemsNUMA Architecture

NUMA Architecture

LevelAdvanced

Duration60 mins

TopicNUMA Architecture

1 / 5

Non-Uniform Memory Access

When Memory Access Time Isn't Equal

What You Will Learn

The Evolution from SMP to NUMA

The SMP/UMA Architecture:

Converting Mermaid diagram...

The Scalability Wall

The Physics Problem:

Several fundamental physical constraints doom SMP at scale:

Bus Bandwidth Saturation: Each processor can issue memory requests at modern DRAM speeds. A shared bus must handle the aggregate bandwidth of all processors. With more cores, the bus becomes the bottleneck.
Signal Propagation Delays: Electrical signals travel at roughly 1 foot per nanosecond in copper. As systems grow physically larger to accommodate more processors and memory, signal propagation itself introduces latency.
Cache Coherency Traffic: In SMP, when one processor modifies data, all other processors with cached copies must be notified. This coherency traffic consumes bus bandwidth and increases with processor count.
Memory Controller Contention: A single memory controller must serialize all memory requests. This serialization overhead grows non-linearly with processor count.

By the mid-2000s, these constraints forced a fundamental architectural rethinking. The result was NUMA.

What NUMA Actually Is

The Core Principle:

Converting Mermaid diagram...

Understanding Memory Access Latency in NUMA:

Consider a dual-socket server where CPU 0 needs to access data:

Memory Location	Access Path	Typical Latency
Local (Node 0)	CPU 0 → MC 0 → DRAM 0	~80 ns
Remote (Node 1)	CPU 0 → Interconnect → MC 1 → DRAM 1	~150 ns

The NUMA Guarantee

Why NUMA Enables Scalability:

NUMA solves SMP's scalability problems by distributing the memory system:

Distributed Memory Controllers: Each processor has its own memory controller. No single controller bottlenecks all memory access.
Local Bandwidth Isolation: A processor accessing local memory doesn't compete with other processors for bandwidth (except for cache coherency traffic).
Scalable Interconnect: Modern interconnects like AMD's Infinity Fabric or Intel's UPI use point-to-point links rather than shared buses. Adding more sockets doesn't degrade existing connections.
Aggregate Bandwidth Growth: Total system memory bandwidth scales with socket count. A 4-socket NUMA system has ~4x the memory bandwidth of a single socket.

This architecture enables systems with hundreds of cores and terabytes of memory—something fundamentally impossible with SMP.

NUMA Topology and Distance

NUMA Distance:

By convention:

Local access has distance 10
Remote access has distance 20 or higher
Actual values vary by system and represent relative latency, not absolute time

numa-distance-query.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# View NUMA topology on Linux
numactl --hardware
 
# Sample output for a 4-socket system:
# available: 4 nodes (0-3)
# node 0 cpus: 0-15
# node 0 size: 65536 MB
# node 0 free: 48120 MB
# node 1 cpus: 16-31
# node 1 size: 65536 MB
# node 1 free: 51234 MB
# node 2 cpus: 32-47
# node 2 size: 65536 MB
# node 2 free: 49876 MB
# node 3 cpus: 48-63
# node 3 size: 65536 MB
# node 3 free: 52012 MB
# node distances:
# node   0   1   2   3
#   0:  10  21  31  21
#   1:  21  10  21  31
#   2:  31  21  10  21
#   3:  21  31  21  10
 
# Query distance programmatically
cat /sys/devices/system/node/node0/distance
# Output: 10 21 31 21

Interpreting the Distance Matrix:

In the example above, we see a 4-socket system with an interesting topology:

From Node	To Node 0	To Node 1	To Node 2	To Node 3
Node 0	10 (local)	21	31	21
Node 1	21	10 (local)	21	31
Node 2	31	21	10 (local)	21
Node 3	21	31	21	10 (local)

Converting Mermaid diagram...

Why Topology Matters

Hardware Implementations of NUMA

NUMA is not an abstract concept—it's implemented in concrete silicon. Understanding the major hardware implementations helps appreciate why NUMA behaves the way it does.

Intel's Implementation: QPI and UPI

Bandwidth: UPI provides 10.4 GT/s per link in 3rd gen Xeon Scalable
Links per socket: Typically 2-3 UPI links per processor
Latency: ~10-15 ns added per hop
Cache Coherency: UPI carries both data and coherency traffic

NUMA Interconnect Comparison
Vendor	Interconnect	Bandwidth	Notable Features
Intel	UPI (Ultra Path Interconnect)	10.4 GT/s	Integrated memory controller, 3 links per Xeon Max
AMD	Infinity Fabric	~36 GB/s bidirectional	Scalable to 8+ sockets, CCX awareness
IBM	PowerAXON	~60 GB/s	Large socket counts, memory buffer chips
ARM	CCIX/CXL	Variable	Open standard, emerging in datacenter

AMD's Implementation: Infinity Fabric

AMD's EPYC processors use Infinity Fabric, a highly scalable interconnect with an interesting property: NUMA exists within a single socket.

This design choice reflects a fundamental tradeoff:

More NUMA domains = more complex memory placement
But enables AMD to build larger processors from smaller, higher-yield chiplets

Sub-NUMA Clustering (SNC)

Cache Coherency in NUMA:

NUMA systems must maintain cache coherency—ensuring that all processors see consistent values for shared memory locations. This is challenging because:

Each processor has its own cache hierarchy (L1, L2, L3)
The same memory location might be cached on multiple sockets
Writes must invalidate or update remote copies

The coherency tax:

The NUMA Performance Cliff

Quantifying the Impact:

Consider a memory-bound application performing random reads across a working set that exceeds cache. On a typical modern server:

Memory Access Latency Impact
Access Pattern	Latency	Relative Performance
L3 Cache Hit	~15 ns	1.0x (baseline)
Local DRAM Access	~80 ns	0.19x
Remote DRAM (1 hop)	~150 ns	0.10x
Remote DRAM (2 hops)	~200 ns	0.075x

The Multiplicative Effect:

If an application's memory is entirely misplaced on a remote node:

Every memory access pays the remote penalty
Throughput drops by 40-50% for memory-bound workloads
Tail latencies (P99, P99.9) explode due to interconnect congestion
Other applications sharing the interconnect also suffer

This isn't theoretical. Production databases, in-memory caches, and high-frequency trading systems regularly see these degradations when NUMA is ignored.

numa-benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
#define SIZE (1024 * 1024 * 256)  // 256 MB
#define ITERATIONS 100000000
 
// Benchmark memory access latency on different NUMA nodes
void benchmark_numa_latency(int allocation_node, int execution_node) {
    // Bind thread to specified node
    struct bitmask *nodemask = numa_bitmask_alloc(numa_max_node() + 1);
    numa_bitmask_setbit(nodemask, execution_node);
    numa_run_on_node_mask(nodemask);
    
    // Allocate memory on specified node
    char *buffer = numa_alloc_onnode(SIZE, allocation_node);
    if (!buffer) {
        fprintf(stderr, "NUMA allocation failed\n");
        return;
    }
    
    // Initialize (page fault to materialize pages)
    for (size_t i = 0; i < SIZE; i += 4096) {
        buffer[i] = 0;
    }
    
    // Benchmark random access
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    volatile char sum = 0;
    unsigned int seed = 42;
    for (long i = 0; i < ITERATIONS; i++) {
        size_t index = rand_r(&seed) % SIZE;
        sum += buffer[index];  // Random read
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + 
                        (end.tv_nsec - start.tv_nsec);
    double latency_ns = elapsed_ns / ITERATIONS;
    
    printf("Alloc Node %d, Exec Node %d: %.2f ns/access\n",
           allocation_node, execution_node, latency_ns);
    
    numa_free(buffer, SIZE);
    numa_bitmask_free(nodemask);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    int max_node = numa_max_node();
    printf("NUMA nodes: 0-%d\n\n", max_node);
    
    // Test all combinations
    for (int alloc = 0; alloc <= max_node; alloc++) {
        for (int exec = 0; exec <= max_node; exec++) {
            benchmark_numa_latency(alloc, exec);
        }
    }
    
    return 0;
}

The Hidden NUMA Trap

First Principles of NUMA Design

Now that we understand what NUMA is and why it matters, let's establish the foundational principles that guide all NUMA-aware system design.

Principle 1: Locality is Everything

The single most important optimization in NUMA systems is ensuring data is placed in memory local to the processor that will access it. This sounds simple, but requires:

Understanding which threads access which data
Ensuring memory allocation happens on the correct node
Preventing the OS or other processes from migrating pages

Core NUMA Design Principles

•Collocate Threads and Data — Threads should execute on the same NUMA node as the memory they access. Pin critical threads and use NUMA-aware allocation.
•Partition by Node, Not by Thread — Design data structures that can be cleanly partitioned across NUMA nodes, with each partition owned by threads on that node.
•Minimize Cross-Node Traffic — When data must be shared across nodes, batch updates and use techniques like replication or message passing rather than shared memory.
•Be Aware of First-Touch — Linux allocates memory on the node where it's first accessed ('first-touch policy'). Initialize data from the thread that will use it.
•Measure, Don't Assume — NUMA behavior varies dramatically between systems. Always benchmark on target hardware and monitor production metrics.

Principle 2: Avoid the Memory Allocation Anti-Pattern

The Right Approach:

Spawn worker threads, one per NUMA node
Pin each worker to its NUMA node
Have each worker allocate and initialize its own memory
Only begin processing after all workers have their local memory

This ensures each thread's data is local. The initialization cost is minimal compared to the runtime savings.

The NUMA Mental Model

NUMA and the Operating System

Operating systems play a critical role in managing NUMA systems. The OS must:

Discover and expose NUMA topology to applications and administrators
Make intelligent allocation decisions when applications don't specify preferences
Schedule processes with awareness of where their memory lives
Balance memory utilization across nodes while respecting locality

Modern operating systems (Linux, Windows, FreeBSD) all have NUMA support, though the depth and configurability varies.

OS-Level NUMA Features
Feature	Linux	Windows	Purpose
NUMA-aware scheduler	✓ Full support	✓ Full support	Keep threads near their memory
Memory placement APIs	numactl, libnuma	NUMA API functions	Application-controlled placement
Automatic page migration	numad, AutoNUMA	Memory partition	Rebalance poorly placed pages
NUMA statistics	/proc/vmstat, numastat	Performance counters	Monitor local vs remote access
Huge page NUMA placement	✓ Supported	✓ Large pages	Reduce TLB misses with locality

Linux's Approach: Default and Tunable

Linux uses several default policies that work well for many workloads:

First-Touch Allocation: Memory is allocated on the node where it's first accessed
AffineScheduling: The scheduler prefers to run threads where their memory is located
Memory Zones per Node: Each NUMA node has its own zone lists for allocations
AutoNUMA/NUMA Balancing: Kernel feature that detects misplaced pages and migrates them

These defaults are reasonable, but high-performance applications typically override them with explicit NUMA control.

linux-numa-tools.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# View NUMA statistics
numastat
# Output shows memory allocated per node and access patterns
 
# Run a command bound to specific NUMA nodes
numactl --cpunodebind=0 --membind=0 ./my_application
 
# Interleave memory across all nodes (for shared data structures)
numactl --interleave=all ./my_application
 
# Check if NUMA balancing is enabled
cat /proc/sys/kernel/numa_balancing
# 1 = enabled, 0 = disabled
 
# View per-process NUMA memory usage
cat /proc/<pid>/numa_maps
 
# Monitor NUMA page faults and migrations
watch -n 1 'grep numa /proc/vmstat'

When to Disable AutoNUMA

Summary: Understanding NUMA

We've established the foundation of NUMA architecture. Let's consolidate the key concepts before moving forward:

Key Takeaways

•NUMA exists because SMP doesn't scale — Shared bus architectures hit physical limits with increasing core counts.
•Memory access time is non-uniform — Local memory access is 1.5-3x faster than remote memory access.
•NUMA topology is hierarchical — Systems have multiple nodes with varying distances between them.
•Physical interconnects matter — Intel UPI, AMD Infinity Fabric, and other technologies implement NUMA in hardware.
•Performance cliffs are real — Poorly placed memory can degrade performance by 40-50% or more.
•Design for locality — The core principle is keeping data near the processors that access it.
•Operating systems help, but aren't magic — NUMA support is automated, but optimal performance requires application awareness.

What's Next:

Page Complete

1 / 5