Loading content...
If NUMA is the architecture, then NUMA nodes are its fundamental building blocks. A NUMA node represents a coherent unit of memory and processing resources with uniform access characteristics. Understanding nodes is essential because every memory allocation decision, every thread scheduling decision, and every performance optimization in a NUMA system operates at the node level.
In this page, we'll dissect what constitutes a NUMA node, examine how they're discovered and managed by the operating system, and explore the deep implications of node boundaries on system behavior.
By the end of this page, you will understand the anatomy of a NUMA node, how to query node configuration on Linux and Windows systems, how nodes relate to physical hardware (sockets, chiplets, memory controllers), and the critical role nodes play in both OS internals and application design.
A NUMA node is a logical grouping of resources that share uniform memory access characteristics. The exact composition varies by hardware platform, but a typical node contains:
Core Components:
The Locality Boundary:
The defining characteristic of a NUMA node is the locality boundary: all memory within a node is equidistant from all cores within that node. Cross this boundary, and access latency increases.
This boundary is physical, not arbitrary. It's determined by:
Node Size Variation:
Node size varies dramatically across systems:
| System Type | Typical Node Composition | Example |
|---|---|---|
| Dual-socket Xeon | 1 socket = 1 node | 32 cores, 256 GB per node |
| AMD EPYC (NPS1) | 1 socket = 1 node | 64 cores, 512 GB per node |
| AMD EPYC (NPS4) | 1/4 socket = 1 node | 16 cores, 128 GB per node |
| Intel SNC-2 | 1/2 socket = 1 node | 16 cores, 128 GB per node |
| Large SGI/HPE | Multiple sockets per node | 8+ sockets, 2+ TB per node |
On many server platforms, BIOS/UEFI settings control how many NUMA nodes are exposed. AMD's 'NPS' (Nodes Per Socket) setting can divide a single socket into 1, 2, or 4 NUMA nodes. Intel's 'SNC' (Sub-NUMA Clustering) can split a socket into 2 nodes. These settings trade NUMA complexity for improved average memory latency.
Before an operating system can manage NUMA effectively, it must discover the NUMA topology from hardware. This discovery happens early in boot and uses standardized mechanisms.
ACPI Tables for NUMA:
The Advanced Configuration and Power Interface (ACPI) provides several tables for NUMA discovery:
The OS parses these tables during boot to build its internal NUMA representation.
12345678910111213141516171819202122232425262728
# View ACPI SRAT table (requires root)cat /sys/firmware/acpi/tables/SRAT | xxd | head -100 # More user-friendly: use iasl to decompile# (Install acpica-tools package first)sudo cp /sys/firmware/acpi/tables/SRAT /tmp/sudo iasl -d /tmp/SRATcat /tmp/SRAT.dsl # Sample SRAT content (decompiled):# [Memory Affinity Structure]# Proximity Domain: 0x00000000# Base Address: 0x0000000000000000# Length: 0x0000000080000000 (2 GB)# Flags: 0x00000001 (Enabled)## [Processor Local x2APIC Affinity Structure]# Proximity Domain: 0x00000000# X2APIC ID: 0x00000000# Flags: 0x00000001 (Enabled) # Quick NUMA overview using lscpulscpu | grep -i numa# NUMA node(s): 4# NUMA node0 CPU(s): 0-15# NUMA node1 CPU(s): 16-31# NUMA node2 CPU(s): 32-47# NUMA node3 CPU(s): 48-63Linux sysfs NUMA Interface:
Once discovered, Linux exposes NUMA information through the sysfs filesystem. This is the primary interface for querying node configuration:
12345678910111213141516171819202122232425262728293031323334353637383940
# Navigate to NUMA node informationls /sys/devices/system/node/# Output: has_cpu has_memory has_normal_memory node0 node1 online possible # Examine a specific nodels /sys/devices/system/node/node0/# Output: compact cpulist cpumap distance hugepages meminfo # numastat vmstat ... # Get CPUs in this node (as a list)cat /sys/devices/system/node/node0/cpulist# Output: 0-15 # Get CPUs as a bitmaskcat /sys/devices/system/node/node0/cpumap# Output: 0000ffff # Memory information for this nodecat /sys/devices/system/node/node0/meminfo# Output:# Node 0 MemTotal: 65799168 kB# Node 0 MemFree: 48234560 kB# Node 0 MemUsed: 17564608 kB# Node 0 Active: 8234567 kB# Node 0 Inactive: 4567890 kB# ... (many more fields) # NUMA distance from this node to all nodescat /sys/devices/system/node/node0/distance# Output: 10 21 31 21 (distance to nodes 0, 1, 2, 3) # Detailed NUMA statisticscat /sys/devices/system/node/node0/numastat# Output:# numa_hit 123456789# numa_miss 1234567# numa_foreign 987654# interleave_hit 12345# local_node 123456789# other_node 1234567numa_hit: Allocations that went to the intended node. numa_miss: Allocations intended for this node but placed elsewhere (other node was full). numa_foreign: Allocations intended for other nodes but placed here. local_node: Allocations from local CPUs that went to local memory. other_node: Allocations from remote CPUs. A healthy NUMA system has high hit/local ratios and low miss/foreign counts.
Each NUMA node contains a portion of the system's physical memory. The operating system must track which physical pages belong to which node, as this determines the locality of every memory access.
Physical Memory Zones per Node:
Linux divides physical memory into zones based on address range and capability:
| Zone | Address Range | Purpose |
|---|---|---|
| ZONE_DMA | 0-16 MB | Legacy ISA DMA compatibility |
| ZONE_DMA32 | 0-4 GB | 32-bit device DMA |
| ZONE_NORMAL | 4 GB - end | General-purpose memory |
| ZONE_MOVABLE | Varies | Memory that can be hot-removed |
In a NUMA system, each node has its own set of zones. A 4-node system with 64 GB per node might have 4 ZONE_NORMAL zones, each containing 64 GB of pages.
Page Frame Ownership:
Every physical page frame in the system 'belongs' to exactly one NUMA node. The kernel's struct page (the descriptor for each physical page) includes a node ID field that identifies ownership.
When allocating memory, the kernel's page allocator starts with the preferred node's zones. If those zones lack free pages, it falls back to other nodes based on a zonelist—an ordered list of zones to try.
Zone Lists and Fallback Order:
Each NUMA node maintains a zonelist that defines fallback order when local memory is unavailable:
1234567891011121314151617181920212223242526
// Conceptual zonelist for Node 0 in a 4-node system// The kernel builds this during boot based on NUMA distances struct zonelist node0_zonelist = { .zones = { // First try: local zones (distance 10) &node0_ZONE_NORMAL, &node0_ZONE_DMA32, &node0_ZONE_DMA, // Then: nearest remote nodes (distance 21) &node1_ZONE_NORMAL, &node3_ZONE_NORMAL, // Finally: farthest remote nodes (distance 31) &node2_ZONE_NORMAL, // Null terminator NULL }}; // During allocation:// 1. Kernel scans zonelist in order// 2. First zone with free pages and correct properties wins// 3. If all zones exhausted, allocation fails or triggers reclaimWhen local memory is exhausted, allocations silently fall back to remote nodes. The application continues working—but with degraded performance. There's no error, no exception, just slower memory access. This is why monitoring numa_miss and other_node statistics is critical in production NUMA systems.
Each CPU (logical processor) in a NUMA system belongs to exactly one NUMA node. This mapping is fundamental—it determines which memory is 'local' for each thread.
Mapping Discovery:
The kernel discovers CPU-to-node mappings from ACPI SRAT tables during boot. Applications can query this information through several interfaces:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
#include <numa.h>#include <sched.h>#include <stdio.h> void print_cpu_node_mapping() { int num_cpus = numa_num_configured_cpus(); int num_nodes = numa_num_configured_nodes(); printf("System has %d CPUs across %d NUMA nodes\n\n", num_cpus, num_nodes); // Print which node each CPU belongs to printf("CPU -> Node mapping:\n"); for (int cpu = 0; cpu < num_cpus; cpu++) { int node = numa_node_of_cpu(cpu); printf(" CPU %3d -> Node %d\n", cpu, node); } printf("\n"); // Print which CPUs belong to each node printf("Node -> CPU mapping:\n"); for (int node = 0; node < num_nodes; node++) { struct bitmask *cpumask = numa_allocate_cpumask(); numa_node_to_cpus(node, cpumask); printf(" Node %d: CPUs ", node); for (int cpu = 0; cpu < num_cpus; cpu++) { if (numa_bitmask_isbitset(cpumask, cpu)) { printf("%d ", cpu); } } printf("\n"); numa_free_cpumask(cpumask); }} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } print_cpu_node_mapping(); // What node is this thread currently running on? int current_cpu = sched_getcpu(); int current_node = numa_node_of_cpu(current_cpu); printf("\nCurrent thread: CPU %d, Node %d\n", current_cpu, current_node); return 0;}Implications for Thread Scheduling:
The CPU-to-node mapping profoundly affects the scheduler. When a thread is ready to run:
Linux's scheduler uses scheduling domains that align with NUMA boundaries. Load balancing within a node is aggressive; load balancing across nodes is conservative.
With hyper-threading, each physical core appears as 2 logical CPUs. Both logical CPUs are on the same NUMA node (they share the same physical core and memory controller). A 4-node system with 8 cores per socket and hyper-threading has 64 logical CPUs: 16 per node. Always verify the actual topology—don't assume!
While most NUMA nodes contain both CPUs and memory, modern systems can have memory-only nodes—NUMA nodes that contain memory but no processors. These arise from:
| Memory Type | Typical Latency | Bandwidth | Persistence | Use Case |
|---|---|---|---|---|
| Standard DRAM | ~80 ns | ~25 GB/s/channel | No | General purpose |
| Intel Optane PMEM | ~300 ns | ~8 GB/s | Yes | Large datasets, databases |
| HBM (on package) | ~40 ns | ~1 TB/s | No | GPUs, accelerators |
| CXL Memory | ~150-200 ns | ~16 GB/s | Depends | Memory expansion |
Operating System Handling:
Memory-only nodes present challenges because threads can't run 'locally'—there are no local CPUs. The OS handles this by:
123456789101112131415161718192021222324252627
# Identify memory-only nodes on Linux# Check which nodes have CPUsfor node in /sys/devices/system/node/node*/cpulist; do node_num=$(echo $node | grep -o 'node[0-9]*') cpus=$(cat $node) if [ -z "$cpus" ]; then echo "$node_num: MEMORY-ONLY (no CPUs)" else echo "$node_num: CPUs $cpus" fidone # Example output on system with PMEM:# node0: CPUs 0-31# node1: CPUs 32-63# node2: MEMORY-ONLY (no CPUs) <- PMEM on Node 2# node3: MEMORY-ONLY (no CPUs) <- PMEM on Node 3 # Query memory type (requires kernel 5.5+ with HMAT support)cat /sys/devices/system/node/node2/access0/initiators/node0/read_latency# Output: 350 (relative latency value) # Allocate specifically on memory-only nodenumactl --membind=2 ./my_application # Or use explicit tiering in application# (requires application-level memory management)CXL (Compute Express Link) is driving rapid growth in memory-only NUMA nodes. CXL allows memory pooling across servers and attaching large memory capacity via PCIe-like fabrics. Operating systems are actively evolving to handle these configurations—Linux kernel 6.x includes significant CXL and tiered memory support.
Applications and administrators can explicitly control which NUMA nodes a process uses. This node binding is the primary mechanism for NUMA optimization.
Types of Binding:
| Binding Type | What It Controls | API/Tool |
|---|---|---|
| CPU Binding | Which CPUs can execute threads | sched_setaffinity(), numactl --cpunodebind |
| Memory Binding | Where memory is allocated | numa_alloc_onnode(), numactl --membind |
| Preferred | Preferred node (with fallback) | numa_set_preferred(), numactl --preferred |
| Interleave | Round-robin across nodes | numa_set_interleave_mask(), numactl --interleave |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
#include <numa.h>#include <stdio.h>#include <stdlib.h> #define GB (1024UL * 1024 * 1024) void example_strict_binding() { /* * Strict binding: Memory MUST come from specified node. * Allocation fails if node has insufficient memory. */ int target_node = 0; size_t size = 1 * GB; // Allocate 1 GB strictly on node 0 void *buffer = numa_alloc_onnode(size, target_node); if (!buffer) { fprintf(stderr, "Failed to allocate on node %d\n", target_node); return; } printf("Allocated %zu bytes strictly on node %d\n", size, target_node); // First-touch to ensure pages are materialized memset(buffer, 0, size); numa_free(buffer, size);} void example_preferred_binding() { /* * Preferred binding: Try specified node first, fall back if needed. * More flexible, but may result in remote allocations. */ size_t size = 1 * GB; // Set preferred node (affects subsequent allocations) numa_set_preferred(1); // Prefer node 1 // Standard malloc now prefers node 1 void *buffer = malloc(size); memset(buffer, 0, size); // First-touch printf("Allocated %zu bytes with preference for node 1\n", size); free(buffer); // Reset to local allocation numa_set_preferred(-1);} void example_interleaved_allocation() { /* * Interleaved: Spread pages round-robin across specified nodes. * Good for shared data accessed by all nodes equally. */ size_t size = 4 * GB; // Interleave across all nodes void *buffer = numa_alloc_interleaved(size); if (!buffer) { fprintf(stderr, "Interleaved allocation failed\n"); return; } // First-touch (pages will be distributed across nodes) memset(buffer, 0, size); printf("Allocated %zu bytes interleaved across all nodes\n", size); numa_free(buffer, size);} void example_bind_and_run() { /* * Bind current thread to a specific node's CPUs, * then allocate locally. */ struct bitmask *nodemask = numa_bitmask_alloc(numa_max_node() + 1); numa_bitmask_setbit(nodemask, 0); // Node 0 only // Bind thread to run only on node 0's CPUs numa_run_on_node_mask(nodemask); // Allocate using local policy (will use node 0) void *buffer = numa_alloc_local(1 * GB); printf("Thread bound to node 0, allocated 1 GB locally\n"); numa_free(buffer, 1 * GB); numa_bitmask_free(nodemask);} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } example_strict_binding(); example_preferred_binding(); example_interleaved_allocation(); example_bind_and_run(); return 0;}Over-aggressive binding can backfire. If you bind a process strictly to one node and it needs more memory than that node has, allocations fail (or OOM-killer strikes). Similarly, binding all processes to the same node negates NUMA benefits. Use preferred binding or interleaving for applications with unpredictable memory patterns.
Performance-sensitive applications need to query NUMA node state at runtime—available memory, allocation statistics, and page placement. Here are the key interfaces:
Memory Availability:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
#include <numa.h>#include <stdio.h>#include <stdlib.h> void print_node_memory_state() { int num_nodes = numa_max_node() + 1; printf("NUMA Node Memory Status:\n"); printf("%-8s %12s %12s %12s\n", "Node", "Total (MB)", "Free (MB)", "Used (MB)"); printf("%-8s %12s %12s %12s\n", "----", "----------", "----------", "----------"); for (int node = 0; node < num_nodes; node++) { long long total_bytes, free_bytes; // Get node memory size total_bytes = numa_node_size64(node, &free_bytes); if (total_bytes > 0) { long long used_bytes = total_bytes - free_bytes; printf("Node %-3d %12lld %12lld %12lld\n", node, total_bytes / (1024 * 1024), free_bytes / (1024 * 1024), used_bytes / (1024 * 1024)); } }} void print_allocation_policy() { struct bitmask *membind = numa_get_membind(); struct bitmask *interleave = numa_get_interleave_mask(); int preferred = numa_preferred(); printf("\nCurrent Memory Policy:\n"); // Check if we have a preferred node if (preferred >= 0) { printf(" Preferred node: %d\n", preferred); } // Print membind mask printf(" Membind mask: "); for (int i = 0; i <= numa_max_node(); i++) { if (numa_bitmask_isbitset(membind, i)) { printf("%d ", i); } } printf("\n"); // Print interleave mask printf(" Interleave mask: "); for (int i = 0; i <= numa_max_node(); i++) { if (numa_bitmask_isbitset(interleave, i)) { printf("%d ", i); } } printf("\n");} // Check which node a specific page is onvoid check_page_placement(void *addr) { int status[1]; void *pages[1] = { addr }; // move_pages with NULL dest gets current location if (move_pages(0, 1, pages, NULL, status, 0) == 0) { if (status[0] >= 0) { printf("Address %p is on node %d\n", addr, status[0]); } else if (status[0] == -ENOENT) { printf("Address %p is not mapped\n", addr); } else { printf("Address %p: error %d\n", addr, status[0]); } }} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } print_node_memory_state(); print_allocation_policy(); // Allocate and check placement void *test = numa_alloc_onnode(4096, 0); memset(test, 0, 4096); // Materialize check_page_placement(test); numa_free(test, 4096); return 0;}Process-Specific NUMA Statistics:
Linux provides per-process NUMA information through /proc/<pid>/numa_maps:
1234567891011121314151617181920212223242526
# View NUMA memory mapping for a processcat /proc/$(pidof my_application)/numa_maps # Sample output (annotated):# 7f1234560000 default file=/lib/libc.so.6 anon=0 dirty=0 N0=128 N1=0# 7f1234670000 prefer:0 anon=4096 dirty=4096 N0=4096 N1=0# 7f1235000000 bind:0 anon=262144 dirty=262144 N0=262144 N1=0# 7f1240000000 interleave:0-3 anon=1048576 dirty=0 N0=262144 N1=262144 N2=262144 N3=262144 # Fields explained:# - Address range (hex)# - Memory policy (default/prefer/bind/interleave)# - Type (file/anon/stack/heap)# - Page counts per node (N0=pages, N1=pages, ...) # Quick summary with numastatnumastat -p $(pidof my_application) # Output shows per-process NUMA memory usage:# Node 0 Node 1 Node 2 Node 3# my_application# Huge 0 0 0 0# Heap 1234 567 890 123# Stack 4 0 0 0# Private 98765 12345 54321 67890# Total 100003 12912 55211 68013For production systems, integrate NUMA metrics into your monitoring stack. Tools like Prometheus node_exporter expose NUMA statistics. Key metrics to track: numa_miss (allocations that went to non-preferred nodes), numa_foreign (allocations from remote CPUs), and per-node memory utilization imbalance.
We've explored NUMA nodes in depth. Let's consolidate the key concepts:
What's Next:
In the next page, we'll examine Local vs Remote Access in detail—quantifying the performance difference, understanding the underlying hardware mechanisms, and exploring techniques for minimizing remote access in real applications.
You now understand the fundamental building block of NUMA systems—the NUMA node. You know how nodes are discovered, what they contain, how to query their state, and how to control memory placement. Next, we'll dive into the performance implications of crossing node boundaries.