Loading learning content...
In the early days of multiprocessor computing, a comforting assumption prevailed: every processor could access any memory location in approximately the same time. This Uniform Memory Access (UMA) model simplified programming and operating system design. But as systems grew beyond a single processor socket—scaling to 2, 4, 8, or even hundreds of cores—this assumption became not just inaccurate, but catastrophically misleading.
Welcome to the world of Non-Uniform Memory Access (NUMA), where the physical reality of silicon, copper traces, and interconnects forces us to confront an uncomfortable truth: memory access time depends on which processor is doing the accessing and which memory bank contains the data.
By the end of this page, you will understand why NUMA exists, how it fundamentally differs from symmetric multiprocessing, and why operating system designers must be acutely aware of memory topology. You'll grasp the physics and engineering constraints that make NUMA inevitable in large-scale systems, and the profound implications for software performance.
To understand NUMA, we must first understand what it replaced and why the transition became necessary. The story begins with Symmetric Multiprocessing (SMP), also known as Uniform Memory Access (UMA).
The SMP/UMA Architecture:
In an SMP system, all processors share a common memory bus and can access all memory with identical latency. The memory controller sits between processors and memory, arbitrating access transparently. From a programmer's perspective, memory is memory—there's no distinction between 'near' and 'far' memory because all memory is equidistant from all processors.
This architecture dominated the 1990s and early 2000s. Dual-socket and quad-socket servers used SMP successfully because the number of processors was small enough that bus bandwidth could keep up with aggregate memory demand.
The shared bus in SMP systems becomes a catastrophic bottleneck as processor count increases. With 4 processors, bus contention is manageable. With 8, it's problematic. With 16 or more, the bus becomes so saturated that adding more processors actually decreases performance. This phenomenon is called 'negative scaling.'
The Physics Problem:
Several fundamental physical constraints doom SMP at scale:
Bus Bandwidth Saturation: Each processor can issue memory requests at modern DRAM speeds. A shared bus must handle the aggregate bandwidth of all processors. With more cores, the bus becomes the bottleneck.
Signal Propagation Delays: Electrical signals travel at roughly 1 foot per nanosecond in copper. As systems grow physically larger to accommodate more processors and memory, signal propagation itself introduces latency.
Cache Coherency Traffic: In SMP, when one processor modifies data, all other processors with cached copies must be notified. This coherency traffic consumes bus bandwidth and increases with processor count.
Memory Controller Contention: A single memory controller must serialize all memory requests. This serialization overhead grows non-linearly with processor count.
By the mid-2000s, these constraints forced a fundamental architectural rethinking. The result was NUMA.
Non-Uniform Memory Access (NUMA) is a memory architecture where the time to access memory depends on the memory's location relative to the accessing processor. Some memory is 'local' (attached directly to the processor's memory controller) and some is 'remote' (attached to another processor's memory controller).
The Core Principle:
In NUMA, each processor (or group of processors) has its own local memory connected via a dedicated memory controller. Processors can still access memory attached to other processors, but this requires traversing an interconnect (such as Intel's QPI/UPI or AMD's Infinity Fabric), which introduces additional latency.
Understanding Memory Access Latency in NUMA:
Consider a dual-socket server where CPU 0 needs to access data:
| Memory Location | Access Path | Typical Latency |
|---|---|---|
| Local (Node 0) | CPU 0 → MC 0 → DRAM 0 | ~80 ns |
| Remote (Node 1) | CPU 0 → Interconnect → MC 1 → DRAM 1 | ~150 ns |
The remote access is nearly 2x slower than local access. This ratio is called the NUMA ratio or remote-to-local latency ratio. In real systems, NUMA ratios typically range from 1.5x to 3x, depending on the number of hops through the interconnect.
Despite the non-uniform access times, NUMA systems guarantee that all memory is accessible from all processors. This is crucial—it means existing programs 'just work' on NUMA systems. They might work slowly if memory is poorly placed, but they won't crash or produce incorrect results. NUMA is transparent to correctness, but critical to performance.
Why NUMA Enables Scalability:
NUMA solves SMP's scalability problems by distributing the memory system:
Distributed Memory Controllers: Each processor has its own memory controller. No single controller bottlenecks all memory access.
Local Bandwidth Isolation: A processor accessing local memory doesn't compete with other processors for bandwidth (except for cache coherency traffic).
Scalable Interconnect: Modern interconnects like AMD's Infinity Fabric or Intel's UPI use point-to-point links rather than shared buses. Adding more sockets doesn't degrade existing connections.
Aggregate Bandwidth Growth: Total system memory bandwidth scales with socket count. A 4-socket NUMA system has ~4x the memory bandwidth of a single socket.
This architecture enables systems with hundreds of cores and terabytes of memory—something fundamentally impossible with SMP.
In simple two-socket systems, NUMA is straightforward: memory is either local or one hop away. But modern high-end systems can have 4, 8, or even 16 sockets, creating complex NUMA topologies where some remote memory is 'closer' than other remote memory.
NUMA Distance:
Operating systems model NUMA topology using a concept called NUMA distance—a relative measure of access latency between NUMA nodes. The ACPI SLIT (System Locality Information Table) provides this information to the OS.
By convention:
123456789101112131415161718192021222324252627
# View NUMA topology on Linuxnumactl --hardware # Sample output for a 4-socket system:# available: 4 nodes (0-3)# node 0 cpus: 0-15# node 0 size: 65536 MB# node 0 free: 48120 MB# node 1 cpus: 16-31# node 1 size: 65536 MB# node 1 free: 51234 MB# node 2 cpus: 32-47# node 2 size: 65536 MB# node 2 free: 49876 MB# node 3 cpus: 48-63# node 3 size: 65536 MB# node 3 free: 52012 MB# node distances:# node 0 1 2 3# 0: 10 21 31 21# 1: 21 10 21 31# 2: 31 21 10 21# 3: 21 31 21 10 # Query distance programmaticallycat /sys/devices/system/node/node0/distance# Output: 10 21 31 21Interpreting the Distance Matrix:
In the example above, we see a 4-socket system with an interesting topology:
| From Node | To Node 0 | To Node 1 | To Node 2 | To Node 3 |
|---|---|---|---|---|
| Node 0 | 10 (local) | 21 | 31 | 21 |
| Node 1 | 21 | 10 (local) | 21 | 31 |
| Node 2 | 31 | 21 | 10 (local) | 21 |
| Node 3 | 21 | 31 | 21 | 10 (local) |
This reveals a linear or ring topology: Node 0 has fast access to Nodes 1 and 3 (one hop, distance 21), but slower access to Node 2 (two hops, distance 31). The OS must understand this topology to make intelligent memory placement decisions.
When the OS needs to allocate memory remotely (because local memory is full), knowing the topology lets it choose the nearest remote node. Allocating on Node 1 from Node 0 (distance 21) is much better than allocating on Node 2 (distance 31). A 47% latency difference can translate to significant performance differences for memory-bound applications.
NUMA is not an abstract concept—it's implemented in concrete silicon. Understanding the major hardware implementations helps appreciate why NUMA behaves the way it does.
Intel's Implementation: QPI and UPI
Intel's multi-socket Xeon processors use QuickPath Interconnect (QPI), replaced by Ultra Path Interconnect (UPI) in newer Xeon Scalable processors. These are high-speed point-to-point links that connect processor sockets.
| Vendor | Interconnect | Bandwidth | Notable Features |
|---|---|---|---|
| Intel | UPI (Ultra Path Interconnect) | 10.4 GT/s | Integrated memory controller, 3 links per Xeon Max |
| AMD | Infinity Fabric | ~36 GB/s bidirectional | Scalable to 8+ sockets, CCX awareness |
| IBM | PowerAXON | ~60 GB/s | Large socket counts, memory buffer chips |
| ARM | CCIX/CXL | Variable | Open standard, emerging in datacenter |
AMD's Implementation: Infinity Fabric
AMD's EPYC processors use Infinity Fabric, a highly scalable interconnect with an interesting property: NUMA exists within a single socket.
EPYC processors are composed of multiple chiplets (CCDs), each containing a subset of cores. Memory is attached to a central I/O die, but different CCDs have different distances to different memory channels. This creates multiple NUMA nodes per socket—a 64-core EPYC might expose 8 NUMA nodes.
This design choice reflects a fundamental tradeoff:
Intel also offers Sub-NUMA Clustering (SNC) on Xeon Scalable, which divides a single socket into 2 NUMA nodes. This reduces average local memory latency but increases NUMA complexity. BIOS settings control whether SNC is enabled—sysadmins must understand the performance implications.
Cache Coherency in NUMA:
NUMA systems must maintain cache coherency—ensuring that all processors see consistent values for shared memory locations. This is challenging because:
Modern NUMA systems use directory-based coherency protocols where a 'home node' (usually where the memory is physically located) tracks which caches hold copies of each cache line. This is more scalable than broadcast-based snooping used in SMP.
The coherency tax:
When data is shared and modified across NUMA nodes, the coherency traffic eats into interconnect bandwidth. This creates an additional performance cliff beyond pure memory access latency—heavily shared, frequently modified data suffers worse performance than data accessed by a single node.
NUMA introduces a fundamental performance asymmetry that can cause dramatic performance degradation—or dramatic improvements—depending on memory placement. Understanding this 'performance cliff' is essential for systems engineering.
Quantifying the Impact:
Consider a memory-bound application performing random reads across a working set that exceeds cache. On a typical modern server:
| Access Pattern | Latency | Relative Performance |
|---|---|---|
| L3 Cache Hit | ~15 ns | 1.0x (baseline) |
| Local DRAM Access | ~80 ns | 0.19x |
| Remote DRAM (1 hop) | ~150 ns | 0.10x |
| Remote DRAM (2 hops) | ~200 ns | 0.075x |
The Multiplicative Effect:
If an application's memory is entirely misplaced on a remote node:
This isn't theoretical. Production databases, in-memory caches, and high-frequency trading systems regularly see these degradations when NUMA is ignored.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
#include <numa.h>#include <stdio.h>#include <stdlib.h>#include <time.h> #define SIZE (1024 * 1024 * 256) // 256 MB#define ITERATIONS 100000000 // Benchmark memory access latency on different NUMA nodesvoid benchmark_numa_latency(int allocation_node, int execution_node) { // Bind thread to specified node struct bitmask *nodemask = numa_bitmask_alloc(numa_max_node() + 1); numa_bitmask_setbit(nodemask, execution_node); numa_run_on_node_mask(nodemask); // Allocate memory on specified node char *buffer = numa_alloc_onnode(SIZE, allocation_node); if (!buffer) { fprintf(stderr, "NUMA allocation failed\n"); return; } // Initialize (page fault to materialize pages) for (size_t i = 0; i < SIZE; i += 4096) { buffer[i] = 0; } // Benchmark random access struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); volatile char sum = 0; unsigned int seed = 42; for (long i = 0; i < ITERATIONS; i++) { size_t index = rand_r(&seed) % SIZE; sum += buffer[index]; // Random read } clock_gettime(CLOCK_MONOTONIC, &end); double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec); double latency_ns = elapsed_ns / ITERATIONS; printf("Alloc Node %d, Exec Node %d: %.2f ns/access\n", allocation_node, execution_node, latency_ns); numa_free(buffer, SIZE); numa_bitmask_free(nodemask);} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } int max_node = numa_max_node(); printf("NUMA nodes: 0-%d\n\n", max_node); // Test all combinations for (int alloc = 0; alloc <= max_node; alloc++) { for (int exec = 0; exec <= max_node; exec++) { benchmark_numa_latency(alloc, exec); } } return 0;}The most insidious NUMA performance problem is silent degradation. Applications don't crash or throw errors when memory is remote—they just run slowly. Without explicit NUMA metrics, teams blame 'the network' or 'the database' when the real culprit is memory placement. Always monitor NUMA metrics (local vs. remote access counters) on multi-socket systems.
Now that we understand what NUMA is and why it matters, let's establish the foundational principles that guide all NUMA-aware system design.
Principle 1: Locality is Everything
The single most important optimization in NUMA systems is ensuring data is placed in memory local to the processor that will access it. This sounds simple, but requires:
Principle 2: Avoid the Memory Allocation Anti-Pattern
A common mistake is to allocate all memory in a single thread (e.g., during initialization) and then distribute work across NUMA nodes. Because of first-touch allocation, all memory ends up on a single node, and most threads suffer remote access penalties.
The Right Approach:
This ensures each thread's data is local. The initialization cost is minimal compared to the runtime savings.
Think of a NUMA system as a collection of servers connected by a fast network, rather than a single unified machine. Each NUMA node has its own memory, and crossing node boundaries is like crossing a network boundary—possible, but with latency overhead. Design your software accordingly.
Operating systems play a critical role in managing NUMA systems. The OS must:
Modern operating systems (Linux, Windows, FreeBSD) all have NUMA support, though the depth and configurability varies.
| Feature | Linux | Windows | Purpose |
|---|---|---|---|
| NUMA-aware scheduler | ✓ Full support | ✓ Full support | Keep threads near their memory |
| Memory placement APIs | numactl, libnuma | NUMA API functions | Application-controlled placement |
| Automatic page migration | numad, AutoNUMA | Memory partition | Rebalance poorly placed pages |
| NUMA statistics | /proc/vmstat, numastat | Performance counters | Monitor local vs remote access |
| Huge page NUMA placement | ✓ Supported | ✓ Large pages | Reduce TLB misses with locality |
Linux's Approach: Default and Tunable
Linux uses several default policies that work well for many workloads:
These defaults are reasonable, but high-performance applications typically override them with explicit NUMA control.
12345678910111213141516171819
# View NUMA statisticsnumastat# Output shows memory allocated per node and access patterns # Run a command bound to specific NUMA nodesnumactl --cpunodebind=0 --membind=0 ./my_application # Interleave memory across all nodes (for shared data structures)numactl --interleave=all ./my_application # Check if NUMA balancing is enabledcat /proc/sys/kernel/numa_balancing# 1 = enabled, 0 = disabled # View per-process NUMA memory usagecat /proc/<pid>/numa_maps # Monitor NUMA page faults and migrationswatch -n 1 'grep numa /proc/vmstat'AutoNUMA (numa_balancing) migrates pages to match thread locality automatically. While helpful for untuned applications, it can cause problems for applications that carefully place their memory. The migration itself consumes CPU and can cause latency spikes. Database vendors often recommend disabling AutoNUMA and using explicit NUMA placement instead.
We've established the foundation of NUMA architecture. Let's consolidate the key concepts before moving forward:
What's Next:
In the next page, we'll dive deeper into NUMA Nodes—the fundamental unit of NUMA architecture. We'll explore how nodes are structured, what resources they contain, and how to query and work with nodes programmatically.
You now understand why NUMA exists, how it differs from SMP, and the fundamental principles of memory locality that drive NUMA-aware system design. Next, we'll examine NUMA nodes in detail—the building blocks of the NUMA architecture.