Numa Architecture - Learning Module

Loading content...

0/227

NUMA Nodes

The Building Blocks of NUMA

If NUMA is the architecture, then NUMA nodes are its fundamental building blocks. A NUMA node represents a coherent unit of memory and processing resources with uniform access characteristics. Understanding nodes is essential because every memory allocation decision, every thread scheduling decision, and every performance optimization in a NUMA system operates at the node level.

In this page, we'll dissect what constitutes a NUMA node, examine how they're discovered and managed by the operating system, and explore the deep implications of node boundaries on system behavior.

What You Will Learn

By the end of this page, you will understand the anatomy of a NUMA node, how to query node configuration on Linux and Windows systems, how nodes relate to physical hardware (sockets, chiplets, memory controllers), and the critical role nodes play in both OS internals and application design.

Anatomy of a NUMA Node

A NUMA node is a logical grouping of resources that share uniform memory access characteristics. The exact composition varies by hardware platform, but a typical node contains:

Core Components:

Processor Cores: One or more CPU cores that execute instructions
Local Memory: DRAM banks connected via a local memory controller
Memory Controller: Hardware that arbitrates access to local DRAM
Cache Hierarchy: L1, L2, and often L3 cache local to the node
Interconnect Ports: Connections to other NUMA nodes

Converting Mermaid diagram...

The Locality Boundary:

The defining characteristic of a NUMA node is the locality boundary: all memory within a node is equidistant from all cores within that node. Cross this boundary, and access latency increases.

This boundary is physical, not arbitrary. It's determined by:

Memory controller ownership: Each memory controller 'owns' specific DIMM slots
Physical interconnect topology: Signals must traverse the node interconnect to reach other nodes
Cache coherency domains: Some coherency operations are faster within a node

Node Size Variation:

Node size varies dramatically across systems:

System Type	Typical Node Composition	Example
Dual-socket Xeon	1 socket = 1 node	32 cores, 256 GB per node
AMD EPYC (NPS1)	1 socket = 1 node	64 cores, 512 GB per node
AMD EPYC (NPS4)	1/4 socket = 1 node	16 cores, 128 GB per node
Intel SNC-2	1/2 socket = 1 node	16 cores, 128 GB per node
Large SGI/HPE	Multiple sockets per node	8+ sockets, 2+ TB per node

Nodes Are Configurable

On many server platforms, BIOS/UEFI settings control how many NUMA nodes are exposed. AMD's 'NPS' (Nodes Per Socket) setting can divide a single socket into 1, 2, or 4 NUMA nodes. Intel's 'SNC' (Sub-NUMA Clustering) can split a socket into 2 nodes. These settings trade NUMA complexity for improved average memory latency.

NUMA Node Discovery

Before an operating system can manage NUMA effectively, it must discover the NUMA topology from hardware. This discovery happens early in boot and uses standardized mechanisms.

ACPI Tables for NUMA:

The Advanced Configuration and Power Interface (ACPI) provides several tables for NUMA discovery:

SRAT (System Resource Affinity Table): Maps CPUs and memory regions to proximity domains (nodes)
SLIT (System Locality Information Table): Provides the distance matrix between nodes
HMAT (Heterogeneous Memory Attribute Table): For systems with multiple memory types (HBM, PMEM, DRAM)

The OS parses these tables during boot to build its internal NUMA representation.

numa-discovery.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# View ACPI SRAT table (requires root)
cat /sys/firmware/acpi/tables/SRAT | xxd | head -100
 
# More user-friendly: use iasl to decompile
# (Install acpica-tools package first)
sudo cp /sys/firmware/acpi/tables/SRAT /tmp/
sudo iasl -d /tmp/SRAT
cat /tmp/SRAT.dsl
 
# Sample SRAT content (decompiled):
# [Memory Affinity Structure]
#   Proximity Domain: 0x00000000
#   Base Address: 0x0000000000000000
#   Length: 0x0000000080000000 (2 GB)
#   Flags: 0x00000001 (Enabled)
#
# [Processor Local x2APIC Affinity Structure]
#   Proximity Domain: 0x00000000
#   X2APIC ID: 0x00000000
#   Flags: 0x00000001 (Enabled)
 
# Quick NUMA overview using lscpu
lscpu | grep -i numa
# NUMA node(s):          4
# NUMA node0 CPU(s):     0-15
# NUMA node1 CPU(s):     16-31
# NUMA node2 CPU(s):     32-47
# NUMA node3 CPU(s):     48-63

Linux sysfs NUMA Interface:

Once discovered, Linux exposes NUMA information through the sysfs filesystem. This is the primary interface for querying node configuration:

sysfs-numa.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Navigate to NUMA node information
ls /sys/devices/system/node/
# Output: has_cpu  has_memory  has_normal_memory  node0  node1  online  possible
 
# Examine a specific node
ls /sys/devices/system/node/node0/
# Output: compact  cpulist  cpumap  distance  hugepages  meminfo  
#         numastat  vmstat  ...
 
# Get CPUs in this node (as a list)
cat /sys/devices/system/node/node0/cpulist
# Output: 0-15
 
# Get CPUs as a bitmask
cat /sys/devices/system/node/node0/cpumap
# Output: 0000ffff
 
# Memory information for this node
cat /sys/devices/system/node/node0/meminfo
# Output:
# Node 0 MemTotal:       65799168 kB
# Node 0 MemFree:        48234560 kB
# Node 0 MemUsed:        17564608 kB
# Node 0 Active:          8234567 kB
# Node 0 Inactive:        4567890 kB
# ... (many more fields)
 
# NUMA distance from this node to all nodes
cat /sys/devices/system/node/node0/distance
# Output: 10 21 31 21   (distance to nodes 0, 1, 2, 3)
 
# Detailed NUMA statistics
cat /sys/devices/system/node/node0/numastat
# Output:
# numa_hit 123456789
# numa_miss 1234567
# numa_foreign 987654
# interleave_hit 12345
# local_node 123456789
# other_node 1234567

Understanding numastat Fields

numa_hit: Allocations that went to the intended node. numa_miss: Allocations intended for this node but placed elsewhere (other node was full). numa_foreign: Allocations intended for other nodes but placed here. local_node: Allocations from local CPUs that went to local memory. other_node: Allocations from remote CPUs. A healthy NUMA system has high hit/local ratios and low miss/foreign counts.

Memory Organization Within Nodes

Each NUMA node contains a portion of the system's physical memory. The operating system must track which physical pages belong to which node, as this determines the locality of every memory access.

Physical Memory Zones per Node:

Linux divides physical memory into zones based on address range and capability:

Zone	Address Range	Purpose
ZONE_DMA	0-16 MB	Legacy ISA DMA compatibility
ZONE_DMA32	0-4 GB	32-bit device DMA
ZONE_NORMAL	4 GB - end	General-purpose memory
ZONE_MOVABLE	Varies	Memory that can be hot-removed

In a NUMA system, each node has its own set of zones. A 4-node system with 64 GB per node might have 4 ZONE_NORMAL zones, each containing 64 GB of pages.

Converting Mermaid diagram...

Page Frame Ownership:

Every physical page frame in the system 'belongs' to exactly one NUMA node. The kernel's struct page (the descriptor for each physical page) includes a node ID field that identifies ownership.

When allocating memory, the kernel's page allocator starts with the preferred node's zones. If those zones lack free pages, it falls back to other nodes based on a zonelist—an ordered list of zones to try.

Zone Lists and Fallback Order:

Each NUMA node maintains a zonelist that defines fallback order when local memory is unavailable:

zonelist-example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Conceptual zonelist for Node 0 in a 4-node system
// The kernel builds this during boot based on NUMA distances
 
struct zonelist node0_zonelist = {
    .zones = {
        // First try: local zones (distance 10)
        &node0_ZONE_NORMAL,
        &node0_ZONE_DMA32,
        &node0_ZONE_DMA,
        
        // Then: nearest remote nodes (distance 21)
        &node1_ZONE_NORMAL,
        &node3_ZONE_NORMAL,
        
        // Finally: farthest remote nodes (distance 31)
        &node2_ZONE_NORMAL,
        
        // Null terminator
        NULL
    }
};
 
// During allocation:
// 1. Kernel scans zonelist in order
// 2. First zone with free pages and correct properties wins
// 3. If all zones exhausted, allocation fails or triggers reclaim

The Silent Performance Killer

When local memory is exhausted, allocations silently fall back to remote nodes. The application continues working—but with degraded performance. There's no error, no exception, just slower memory access. This is why monitoring numa_miss and other_node statistics is critical in production NUMA systems.

CPU-to-Node Mapping

Each CPU (logical processor) in a NUMA system belongs to exactly one NUMA node. This mapping is fundamental—it determines which memory is 'local' for each thread.

Mapping Discovery:

The kernel discovers CPU-to-node mappings from ACPI SRAT tables during boot. Applications can query this information through several interfaces:

cpu-node-mapping.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <numa.h>
#include <sched.h>
#include <stdio.h>
 
void print_cpu_node_mapping() {
    int num_cpus = numa_num_configured_cpus();
    int num_nodes = numa_num_configured_nodes();
    
    printf("System has %d CPUs across %d NUMA nodes\n\n", 
           num_cpus, num_nodes);
    
    // Print which node each CPU belongs to
    printf("CPU -> Node mapping:\n");
    for (int cpu = 0; cpu < num_cpus; cpu++) {
        int node = numa_node_of_cpu(cpu);
        printf("  CPU %3d -> Node %d\n", cpu, node);
    }
    
    printf("\n");
    
    // Print which CPUs belong to each node
    printf("Node -> CPU mapping:\n");
    for (int node = 0; node < num_nodes; node++) {
        struct bitmask *cpumask = numa_allocate_cpumask();
        numa_node_to_cpus(node, cpumask);
        
        printf("  Node %d: CPUs ", node);
        for (int cpu = 0; cpu < num_cpus; cpu++) {
            if (numa_bitmask_isbitset(cpumask, cpu)) {
                printf("%d ", cpu);
            }
        }
        printf("\n");
        
        numa_free_cpumask(cpumask);
    }
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    print_cpu_node_mapping();
    
    // What node is this thread currently running on?
    int current_cpu = sched_getcpu();
    int current_node = numa_node_of_cpu(current_cpu);
    printf("\nCurrent thread: CPU %d, Node %d\n", 
           current_cpu, current_node);
    
    return 0;
}

Implications for Thread Scheduling:

The CPU-to-node mapping profoundly affects the scheduler. When a thread is ready to run:

Affinity Preference: The scheduler prefers CPUs where the thread previously ran (cache affinity)
NUMA Preference: The scheduler prefers CPUs on the same NUMA node as the thread's memory
Load Balancing: The scheduler may migrate threads between CPUs—but is reluctant to cross NUMA boundaries

Linux's scheduler uses scheduling domains that align with NUMA boundaries. Load balancing within a node is aggressive; load balancing across nodes is conservative.

Hyper-Threading and NUMA

With hyper-threading, each physical core appears as 2 logical CPUs. Both logical CPUs are on the same NUMA node (they share the same physical core and memory controller). A 4-node system with 8 cores per socket and hyper-threading has 64 logical CPUs: 16 per node. Always verify the actual topology—don't assume!

Memory-Only NUMA Nodes

While most NUMA nodes contain both CPUs and memory, modern systems can have memory-only nodes—NUMA nodes that contain memory but no processors. These arise from:

Persistent Memory (PMEM/Intel Optane): Non-volatile memory with different characteristics than DRAM
High Bandwidth Memory (HBM): Very fast memory on accelerators or specialized packages
CXL-attached Memory: Memory connected via Compute Express Link (CXL)
Memory Extension Chassis: External memory expansion units

Heterogeneous Memory Types as NUMA Nodes
Memory Type	Typical Latency	Bandwidth	Persistence	Use Case
Standard DRAM	~80 ns	~25 GB/s/channel	No	General purpose
Intel Optane PMEM	~300 ns	~8 GB/s	Yes	Large datasets, databases
HBM (on package)	~40 ns	~1 TB/s	No	GPUs, accelerators
CXL Memory	~150-200 ns	~16 GB/s	Depends	Memory expansion

Operating System Handling:

Memory-only nodes present challenges because threads can't run 'locally'—there are no local CPUs. The OS handles this by:

Special Allocation Policies: The kernel's default allocator doesn't use memory-only nodes unless explicitly requested
HMAT-guided Placement: The Heterogeneous Memory Attribute Table provides latency/bandwidth hints
Tiered Memory: Some kernels implement automatic tiering—hot data on fast memory, cold data on slow memory

memory-only-node.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Identify memory-only nodes on Linux
# Check which nodes have CPUs
for node in /sys/devices/system/node/node*/cpulist; do
    node_num=$(echo $node | grep -o 'node[0-9]*')
    cpus=$(cat $node)
    if [ -z "$cpus" ]; then
        echo "$node_num: MEMORY-ONLY (no CPUs)"
    else
        echo "$node_num: CPUs $cpus"
    fi
done
 
# Example output on system with PMEM:
# node0: CPUs 0-31
# node1: CPUs 32-63
# node2: MEMORY-ONLY (no CPUs)  <- PMEM on Node 2
# node3: MEMORY-ONLY (no CPUs)  <- PMEM on Node 3
 
# Query memory type (requires kernel 5.5+ with HMAT support)
cat /sys/devices/system/node/node2/access0/initiators/node0/read_latency
# Output: 350  (relative latency value)
 
# Allocate specifically on memory-only node
numactl --membind=2 ./my_application
 
# Or use explicit tiering in application
# (requires application-level memory management)

The Future of Memory-Only Nodes

CXL (Compute Express Link) is driving rapid growth in memory-only NUMA nodes. CXL allows memory pooling across servers and attaching large memory capacity via PCIe-like fabrics. Operating systems are actively evolving to handle these configurations—Linux kernel 6.x includes significant CXL and tiered memory support.

Node Binding and Control

Applications and administrators can explicitly control which NUMA nodes a process uses. This node binding is the primary mechanism for NUMA optimization.

Types of Binding:

Binding Type	What It Controls	API/Tool
CPU Binding	Which CPUs can execute threads	`sched_setaffinity()`, `numactl --cpunodebind`
Memory Binding	Where memory is allocated	`numa_alloc_onnode()`, `numactl --membind`
Preferred	Preferred node (with fallback)	`numa_set_preferred()`, `numactl --preferred`
Interleave	Round-robin across nodes	`numa_set_interleave_mask()`, `numactl --interleave`

numa-binding-examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
 
#define GB (1024UL * 1024 * 1024)
 
void example_strict_binding() {
    /*
     * Strict binding: Memory MUST come from specified node.
     * Allocation fails if node has insufficient memory.
     */
    int target_node = 0;
    size_t size = 1 * GB;
    
    // Allocate 1 GB strictly on node 0
    void *buffer = numa_alloc_onnode(size, target_node);
    if (!buffer) {
        fprintf(stderr, "Failed to allocate on node %d\n", target_node);
        return;
    }
    
    printf("Allocated %zu bytes strictly on node %d\n", size, target_node);
    
    // First-touch to ensure pages are materialized
    memset(buffer, 0, size);
    
    numa_free(buffer, size);
}
 
void example_preferred_binding() {
    /*
     * Preferred binding: Try specified node first, fall back if needed.
     * More flexible, but may result in remote allocations.
     */
    size_t size = 1 * GB;
    
    // Set preferred node (affects subsequent allocations)
    numa_set_preferred(1);  // Prefer node 1
    
    // Standard malloc now prefers node 1
    void *buffer = malloc(size);
    memset(buffer, 0, size);  // First-touch
    
    printf("Allocated %zu bytes with preference for node 1\n", size);
    free(buffer);
    
    // Reset to local allocation
    numa_set_preferred(-1);
}
 
void example_interleaved_allocation() {
    /*
     * Interleaved: Spread pages round-robin across specified nodes.
     * Good for shared data accessed by all nodes equally.
     */
    size_t size = 4 * GB;
    
    // Interleave across all nodes
    void *buffer = numa_alloc_interleaved(size);
    if (!buffer) {
        fprintf(stderr, "Interleaved allocation failed\n");
        return;
    }
    
    // First-touch (pages will be distributed across nodes)
    memset(buffer, 0, size);
    
    printf("Allocated %zu bytes interleaved across all nodes\n", size);
    numa_free(buffer, size);
}
 
void example_bind_and_run() {
    /*
     * Bind current thread to a specific node's CPUs, 
     * then allocate locally.
     */
    struct bitmask *nodemask = numa_bitmask_alloc(numa_max_node() + 1);
    numa_bitmask_setbit(nodemask, 0);  // Node 0 only
    
    // Bind thread to run only on node 0's CPUs
    numa_run_on_node_mask(nodemask);
    
    // Allocate using local policy (will use node 0)
    void *buffer = numa_alloc_local(1 * GB);
    
    printf("Thread bound to node 0, allocated 1 GB locally\n");
    
    numa_free(buffer, 1 * GB);
    numa_bitmask_free(nodemask);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    example_strict_binding();
    example_preferred_binding();
    example_interleaved_allocation();
    example_bind_and_run();
    
    return 0;
}

Binding Pitfalls

Over-aggressive binding can backfire. If you bind a process strictly to one node and it needs more memory than that node has, allocations fail (or OOM-killer strikes). Similarly, binding all processes to the same node negates NUMA benefits. Use preferred binding or interleaving for applications with unpredictable memory patterns.

Querying Node State at Runtime

Performance-sensitive applications need to query NUMA node state at runtime—available memory, allocation statistics, and page placement. Here are the key interfaces:

Memory Availability:

query-node-state.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
 
void print_node_memory_state() {
    int num_nodes = numa_max_node() + 1;
    
    printf("NUMA Node Memory Status:\n");
    printf("%-8s %12s %12s %12s\n", 
           "Node", "Total (MB)", "Free (MB)", "Used (MB)");
    printf("%-8s %12s %12s %12s\n", 
           "----", "----------", "----------", "----------");
    
    for (int node = 0; node < num_nodes; node++) {
        long long total_bytes, free_bytes;
        
        // Get node memory size
        total_bytes = numa_node_size64(node, &free_bytes);
        
        if (total_bytes > 0) {
            long long used_bytes = total_bytes - free_bytes;
            printf("Node %-3d %12lld %12lld %12lld\n",
                   node,
                   total_bytes / (1024 * 1024),
                   free_bytes / (1024 * 1024),
                   used_bytes / (1024 * 1024));
        }
    }
}
 
void print_allocation_policy() {
    struct bitmask *membind = numa_get_membind();
    struct bitmask *interleave = numa_get_interleave_mask();
    int preferred = numa_preferred();
    
    printf("\nCurrent Memory Policy:\n");
    
    // Check if we have a preferred node
    if (preferred >= 0) {
        printf("  Preferred node: %d\n", preferred);
    }
    
    // Print membind mask
    printf("  Membind mask: ");
    for (int i = 0; i <= numa_max_node(); i++) {
        if (numa_bitmask_isbitset(membind, i)) {
            printf("%d ", i);
        }
    }
    printf("\n");
    
    // Print interleave mask
    printf("  Interleave mask: ");
    for (int i = 0; i <= numa_max_node(); i++) {
        if (numa_bitmask_isbitset(interleave, i)) {
            printf("%d ", i);
        }
    }
    printf("\n");
}
 
// Check which node a specific page is on
void check_page_placement(void *addr) {
    int status[1];
    void *pages[1] = { addr };
    
    // move_pages with NULL dest gets current location
    if (move_pages(0, 1, pages, NULL, status, 0) == 0) {
        if (status[0] >= 0) {
            printf("Address %p is on node %d\n", addr, status[0]);
        } else if (status[0] == -ENOENT) {
            printf("Address %p is not mapped\n", addr);
        } else {
            printf("Address %p: error %d\n", addr, status[0]);
        }
    }
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    print_node_memory_state();
    print_allocation_policy();
    
    // Allocate and check placement
    void *test = numa_alloc_onnode(4096, 0);
    memset(test, 0, 4096);  // Materialize
    check_page_placement(test);
    numa_free(test, 4096);
    
    return 0;
}

Process-Specific NUMA Statistics:

Linux provides per-process NUMA information through /proc/<pid>/numa_maps:

numa-maps.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# View NUMA memory mapping for a process
cat /proc/$(pidof my_application)/numa_maps
 
# Sample output (annotated):
# 7f1234560000 default file=/lib/libc.so.6 anon=0 dirty=0 N0=128 N1=0
# 7f1234670000 prefer:0 anon=4096 dirty=4096 N0=4096 N1=0
# 7f1235000000 bind:0 anon=262144 dirty=262144 N0=262144 N1=0
# 7f1240000000 interleave:0-3 anon=1048576 dirty=0 N0=262144 N1=262144 N2=262144 N3=262144
 
# Fields explained:
# - Address range (hex)
# - Memory policy (default/prefer/bind/interleave)
# - Type (file/anon/stack/heap)
# - Page counts per node (N0=pages, N1=pages, ...)
 
# Quick summary with numastat
numastat -p $(pidof my_application)
 
# Output shows per-process NUMA memory usage:
#                       Node 0    Node 1    Node 2    Node 3
# my_application
#   Huge                     0         0         0         0
#   Heap                  1234       567       890       123
#   Stack                    4         0         0         0
#   Private              98765     12345     54321     67890
#   Total               100003     12912     55211     68013

Continuous NUMA Monitoring

For production systems, integrate NUMA metrics into your monitoring stack. Tools like Prometheus node_exporter expose NUMA statistics. Key metrics to track: numa_miss (allocations that went to non-preferred nodes), numa_foreign (allocations from remote CPUs), and per-node memory utilization imbalance.

Summary: NUMA Nodes

We've explored NUMA nodes in depth. Let's consolidate the key concepts:

Key Takeaways

•NUMA nodes are locality domains — Each node contains CPUs, memory, and I/O with uniform local access.
•Node composition is hardware-dependent — BIOS settings (NPS, SNC) can divide sockets into multiple nodes.
•Discovery uses ACPI tables — SRAT maps resources to nodes; SLIT provides distances; HMAT handles heterogeneous memory.
•Memory is organized per-node — Each node has its own zones; fallback to remote nodes is ordered by distance.
•CPU-to-node mapping drives scheduling — The scheduler prefers keeping threads near their memory.
•Memory-only nodes exist — PMEM, HBM, and CXL memory can appear as nodes without CPUs.
•Binding controls placement — Applications can strictly bind, prefer, or interleave memory across nodes.
•Runtime queries enable optimization — APIs and syscalls reveal current memory placement and node state.

What's Next:

In the next page, we'll examine Local vs Remote Access in detail—quantifying the performance difference, understanding the underlying hardware mechanisms, and exploring techniques for minimizing remote access in real applications.

Page Complete

You now understand the fundamental building block of NUMA systems—the NUMA node. You know how nodes are discovered, what they contain, how to query their state, and how to control memory placement. Next, we'll dive into the performance implications of crossing node boundaries.

NUMA Nodes

The Building Blocks of NUMA

In this page, we'll dissect what constitutes a NUMA node, examine how they're discovered and managed by the operating system, and explore the deep implications of node boundaries on system behavior.

What You Will Learn

Anatomy of a NUMA Node

A NUMA node is a logical grouping of resources that share uniform memory access characteristics. The exact composition varies by hardware platform, but a typical node contains:

Core Components:

Processor Cores: One or more CPU cores that execute instructions
Local Memory: DRAM banks connected via a local memory controller
Memory Controller: Hardware that arbitrates access to local DRAM
Cache Hierarchy: L1, L2, and often L3 cache local to the node
Interconnect Ports: Connections to other NUMA nodes

Converting Mermaid diagram...

The Locality Boundary:

The defining characteristic of a NUMA node is the locality boundary: all memory within a node is equidistant from all cores within that node. Cross this boundary, and access latency increases.

This boundary is physical, not arbitrary. It's determined by:

Memory controller ownership: Each memory controller 'owns' specific DIMM slots
Physical interconnect topology: Signals must traverse the node interconnect to reach other nodes
Cache coherency domains: Some coherency operations are faster within a node

Node Size Variation:

Node size varies dramatically across systems:

System Type	Typical Node Composition	Example
Dual-socket Xeon	1 socket = 1 node	32 cores, 256 GB per node
AMD EPYC (NPS1)	1 socket = 1 node	64 cores, 512 GB per node
AMD EPYC (NPS4)	1/4 socket = 1 node	16 cores, 128 GB per node
Intel SNC-2	1/2 socket = 1 node	16 cores, 128 GB per node
Large SGI/HPE	Multiple sockets per node	8+ sockets, 2+ TB per node

Nodes Are Configurable

NUMA Node Discovery

Before an operating system can manage NUMA effectively, it must discover the NUMA topology from hardware. This discovery happens early in boot and uses standardized mechanisms.

ACPI Tables for NUMA:

The Advanced Configuration and Power Interface (ACPI) provides several tables for NUMA discovery:

SRAT (System Resource Affinity Table): Maps CPUs and memory regions to proximity domains (nodes)
SLIT (System Locality Information Table): Provides the distance matrix between nodes
HMAT (Heterogeneous Memory Attribute Table): For systems with multiple memory types (HBM, PMEM, DRAM)

The OS parses these tables during boot to build its internal NUMA representation.

numa-discovery.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# View ACPI SRAT table (requires root)
cat /sys/firmware/acpi/tables/SRAT | xxd | head -100
 
# More user-friendly: use iasl to decompile
# (Install acpica-tools package first)
sudo cp /sys/firmware/acpi/tables/SRAT /tmp/
sudo iasl -d /tmp/SRAT
cat /tmp/SRAT.dsl
 
# Sample SRAT content (decompiled):
# [Memory Affinity Structure]
#   Proximity Domain: 0x00000000
#   Base Address: 0x0000000000000000
#   Length: 0x0000000080000000 (2 GB)
#   Flags: 0x00000001 (Enabled)
#
# [Processor Local x2APIC Affinity Structure]
#   Proximity Domain: 0x00000000
#   X2APIC ID: 0x00000000
#   Flags: 0x00000001 (Enabled)
 
# Quick NUMA overview using lscpu
lscpu | grep -i numa
# NUMA node(s):          4
# NUMA node0 CPU(s):     0-15
# NUMA node1 CPU(s):     16-31
# NUMA node2 CPU(s):     32-47
# NUMA node3 CPU(s):     48-63

Linux sysfs NUMA Interface:

Once discovered, Linux exposes NUMA information through the sysfs filesystem. This is the primary interface for querying node configuration:

sysfs-numa.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Navigate to NUMA node information
ls /sys/devices/system/node/
# Output: has_cpu  has_memory  has_normal_memory  node0  node1  online  possible
 
# Examine a specific node
ls /sys/devices/system/node/node0/
# Output: compact  cpulist  cpumap  distance  hugepages  meminfo  
#         numastat  vmstat  ...
 
# Get CPUs in this node (as a list)
cat /sys/devices/system/node/node0/cpulist
# Output: 0-15
 
# Get CPUs as a bitmask
cat /sys/devices/system/node/node0/cpumap
# Output: 0000ffff
 
# Memory information for this node
cat /sys/devices/system/node/node0/meminfo
# Output:
# Node 0 MemTotal:       65799168 kB
# Node 0 MemFree:        48234560 kB
# Node 0 MemUsed:        17564608 kB
# Node 0 Active:          8234567 kB
# Node 0 Inactive:        4567890 kB
# ... (many more fields)
 
# NUMA distance from this node to all nodes
cat /sys/devices/system/node/node0/distance
# Output: 10 21 31 21   (distance to nodes 0, 1, 2, 3)
 
# Detailed NUMA statistics
cat /sys/devices/system/node/node0/numastat
# Output:
# numa_hit 123456789
# numa_miss 1234567
# numa_foreign 987654
# interleave_hit 12345
# local_node 123456789
# other_node 1234567

Understanding numastat Fields

Memory Organization Within Nodes

Each NUMA node contains a portion of the system's physical memory. The operating system must track which physical pages belong to which node, as this determines the locality of every memory access.

Physical Memory Zones per Node:

Linux divides physical memory into zones based on address range and capability:

Zone	Address Range	Purpose
ZONE_DMA	0-16 MB	Legacy ISA DMA compatibility
ZONE_DMA32	0-4 GB	32-bit device DMA
ZONE_NORMAL	4 GB - end	General-purpose memory
ZONE_MOVABLE	Varies	Memory that can be hot-removed

In a NUMA system, each node has its own set of zones. A 4-node system with 64 GB per node might have 4 ZONE_NORMAL zones, each containing 64 GB of pages.

Converting Mermaid diagram...

Page Frame Ownership:

Every physical page frame in the system 'belongs' to exactly one NUMA node. The kernel's struct page (the descriptor for each physical page) includes a node ID field that identifies ownership.

Zone Lists and Fallback Order:

Each NUMA node maintains a zonelist that defines fallback order when local memory is unavailable:

zonelist-example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Conceptual zonelist for Node 0 in a 4-node system
// The kernel builds this during boot based on NUMA distances
 
struct zonelist node0_zonelist = {
    .zones = {
        // First try: local zones (distance 10)
        &node0_ZONE_NORMAL,
        &node0_ZONE_DMA32,
        &node0_ZONE_DMA,
        
        // Then: nearest remote nodes (distance 21)
        &node1_ZONE_NORMAL,
        &node3_ZONE_NORMAL,
        
        // Finally: farthest remote nodes (distance 31)
        &node2_ZONE_NORMAL,
        
        // Null terminator
        NULL
    }
};
 
// During allocation:
// 1. Kernel scans zonelist in order
// 2. First zone with free pages and correct properties wins
// 3. If all zones exhausted, allocation fails or triggers reclaim

The Silent Performance Killer

CPU-to-Node Mapping

Each CPU (logical processor) in a NUMA system belongs to exactly one NUMA node. This mapping is fundamental—it determines which memory is 'local' for each thread.

Mapping Discovery:

The kernel discovers CPU-to-node mappings from ACPI SRAT tables during boot. Applications can query this information through several interfaces:

cpu-node-mapping.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <numa.h>
#include <sched.h>
#include <stdio.h>
 
void print_cpu_node_mapping() {
    int num_cpus = numa_num_configured_cpus();
    int num_nodes = numa_num_configured_nodes();
    
    printf("System has %d CPUs across %d NUMA nodes\n\n", 
           num_cpus, num_nodes);
    
    // Print which node each CPU belongs to
    printf("CPU -> Node mapping:\n");
    for (int cpu = 0; cpu < num_cpus; cpu++) {
        int node = numa_node_of_cpu(cpu);
        printf("  CPU %3d -> Node %d\n", cpu, node);
    }
    
    printf("\n");
    
    // Print which CPUs belong to each node
    printf("Node -> CPU mapping:\n");
    for (int node = 0; node < num_nodes; node++) {
        struct bitmask *cpumask = numa_allocate_cpumask();
        numa_node_to_cpus(node, cpumask);
        
        printf("  Node %d: CPUs ", node);
        for (int cpu = 0; cpu < num_cpus; cpu++) {
            if (numa_bitmask_isbitset(cpumask, cpu)) {
                printf("%d ", cpu);
            }
        }
        printf("\n");
        
        numa_free_cpumask(cpumask);
    }
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    print_cpu_node_mapping();
    
    // What node is this thread currently running on?
    int current_cpu = sched_getcpu();
    int current_node = numa_node_of_cpu(current_cpu);
    printf("\nCurrent thread: CPU %d, Node %d\n", 
           current_cpu, current_node);
    
    return 0;
}

Implications for Thread Scheduling:

The CPU-to-node mapping profoundly affects the scheduler. When a thread is ready to run:

Affinity Preference: The scheduler prefers CPUs where the thread previously ran (cache affinity)
NUMA Preference: The scheduler prefers CPUs on the same NUMA node as the thread's memory
Load Balancing: The scheduler may migrate threads between CPUs—but is reluctant to cross NUMA boundaries

Linux's scheduler uses scheduling domains that align with NUMA boundaries. Load balancing within a node is aggressive; load balancing across nodes is conservative.

Hyper-Threading and NUMA

Memory-Only NUMA Nodes

While most NUMA nodes contain both CPUs and memory, modern systems can have memory-only nodes—NUMA nodes that contain memory but no processors. These arise from:

Persistent Memory (PMEM/Intel Optane): Non-volatile memory with different characteristics than DRAM
High Bandwidth Memory (HBM): Very fast memory on accelerators or specialized packages
CXL-attached Memory: Memory connected via Compute Express Link (CXL)
Memory Extension Chassis: External memory expansion units

Heterogeneous Memory Types as NUMA Nodes
Memory Type	Typical Latency	Bandwidth	Persistence	Use Case
Standard DRAM	~80 ns	~25 GB/s/channel	No	General purpose
Intel Optane PMEM	~300 ns	~8 GB/s	Yes	Large datasets, databases
HBM (on package)	~40 ns	~1 TB/s	No	GPUs, accelerators
CXL Memory	~150-200 ns	~16 GB/s	Depends	Memory expansion

Operating System Handling:

Memory-only nodes present challenges because threads can't run 'locally'—there are no local CPUs. The OS handles this by:

Special Allocation Policies: The kernel's default allocator doesn't use memory-only nodes unless explicitly requested
HMAT-guided Placement: The Heterogeneous Memory Attribute Table provides latency/bandwidth hints
Tiered Memory: Some kernels implement automatic tiering—hot data on fast memory, cold data on slow memory

memory-only-node.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Identify memory-only nodes on Linux
# Check which nodes have CPUs
for node in /sys/devices/system/node/node*/cpulist; do
    node_num=$(echo $node | grep -o 'node[0-9]*')
    cpus=$(cat $node)
    if [ -z "$cpus" ]; then
        echo "$node_num: MEMORY-ONLY (no CPUs)"
    else
        echo "$node_num: CPUs $cpus"
    fi
done
 
# Example output on system with PMEM:
# node0: CPUs 0-31
# node1: CPUs 32-63
# node2: MEMORY-ONLY (no CPUs)  <- PMEM on Node 2
# node3: MEMORY-ONLY (no CPUs)  <- PMEM on Node 3
 
# Query memory type (requires kernel 5.5+ with HMAT support)
cat /sys/devices/system/node/node2/access0/initiators/node0/read_latency
# Output: 350  (relative latency value)
 
# Allocate specifically on memory-only node
numactl --membind=2 ./my_application
 
# Or use explicit tiering in application
# (requires application-level memory management)

The Future of Memory-Only Nodes

Node Binding and Control

Applications and administrators can explicitly control which NUMA nodes a process uses. This node binding is the primary mechanism for NUMA optimization.

Types of Binding:

Binding Type	What It Controls	API/Tool
CPU Binding	Which CPUs can execute threads	`sched_setaffinity()`, `numactl --cpunodebind`
Memory Binding	Where memory is allocated	`numa_alloc_onnode()`, `numactl --membind`
Preferred	Preferred node (with fallback)	`numa_set_preferred()`, `numactl --preferred`
Interleave	Round-robin across nodes	`numa_set_interleave_mask()`, `numactl --interleave`

numa-binding-examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
 
#define GB (1024UL * 1024 * 1024)
 
void example_strict_binding() {
    /*
     * Strict binding: Memory MUST come from specified node.
     * Allocation fails if node has insufficient memory.
     */
    int target_node = 0;
    size_t size = 1 * GB;
    
    // Allocate 1 GB strictly on node 0
    void *buffer = numa_alloc_onnode(size, target_node);
    if (!buffer) {
        fprintf(stderr, "Failed to allocate on node %d\n", target_node);
        return;
    }
    
    printf("Allocated %zu bytes strictly on node %d\n", size, target_node);
    
    // First-touch to ensure pages are materialized
    memset(buffer, 0, size);
    
    numa_free(buffer, size);
}
 
void example_preferred_binding() {
    /*
     * Preferred binding: Try specified node first, fall back if needed.
     * More flexible, but may result in remote allocations.
     */
    size_t size = 1 * GB;
    
    // Set preferred node (affects subsequent allocations)
    numa_set_preferred(1);  // Prefer node 1
    
    // Standard malloc now prefers node 1
    void *buffer = malloc(size);
    memset(buffer, 0, size);  // First-touch
    
    printf("Allocated %zu bytes with preference for node 1\n", size);
    free(buffer);
    
    // Reset to local allocation
    numa_set_preferred(-1);
}
 
void example_interleaved_allocation() {
    /*
     * Interleaved: Spread pages round-robin across specified nodes.
     * Good for shared data accessed by all nodes equally.
     */
    size_t size = 4 * GB;
    
    // Interleave across all nodes
    void *buffer = numa_alloc_interleaved(size);
    if (!buffer) {
        fprintf(stderr, "Interleaved allocation failed\n");
        return;
    }
    
    // First-touch (pages will be distributed across nodes)
    memset(buffer, 0, size);
    
    printf("Allocated %zu bytes interleaved across all nodes\n", size);
    numa_free(buffer, size);
}
 
void example_bind_and_run() {
    /*
     * Bind current thread to a specific node's CPUs, 
     * then allocate locally.
     */
    struct bitmask *nodemask = numa_bitmask_alloc(numa_max_node() + 1);
    numa_bitmask_setbit(nodemask, 0);  // Node 0 only
    
    // Bind thread to run only on node 0's CPUs
    numa_run_on_node_mask(nodemask);
    
    // Allocate using local policy (will use node 0)
    void *buffer = numa_alloc_local(1 * GB);
    
    printf("Thread bound to node 0, allocated 1 GB locally\n");
    
    numa_free(buffer, 1 * GB);
    numa_bitmask_free(nodemask);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    example_strict_binding();
    example_preferred_binding();
    example_interleaved_allocation();
    example_bind_and_run();
    
    return 0;
}

Binding Pitfalls

Querying Node State at Runtime

Performance-sensitive applications need to query NUMA node state at runtime—available memory, allocation statistics, and page placement. Here are the key interfaces:

Memory Availability:

query-node-state.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
 
void print_node_memory_state() {
    int num_nodes = numa_max_node() + 1;
    
    printf("NUMA Node Memory Status:\n");
    printf("%-8s %12s %12s %12s\n", 
           "Node", "Total (MB)", "Free (MB)", "Used (MB)");
    printf("%-8s %12s %12s %12s\n", 
           "----", "----------", "----------", "----------");
    
    for (int node = 0; node < num_nodes; node++) {
        long long total_bytes, free_bytes;
        
        // Get node memory size
        total_bytes = numa_node_size64(node, &free_bytes);
        
        if (total_bytes > 0) {
            long long used_bytes = total_bytes - free_bytes;
            printf("Node %-3d %12lld %12lld %12lld\n",
                   node,
                   total_bytes / (1024 * 1024),
                   free_bytes / (1024 * 1024),
                   used_bytes / (1024 * 1024));
        }
    }
}
 
void print_allocation_policy() {
    struct bitmask *membind = numa_get_membind();
    struct bitmask *interleave = numa_get_interleave_mask();
    int preferred = numa_preferred();
    
    printf("\nCurrent Memory Policy:\n");
    
    // Check if we have a preferred node
    if (preferred >= 0) {
        printf("  Preferred node: %d\n", preferred);
    }
    
    // Print membind mask
    printf("  Membind mask: ");
    for (int i = 0; i <= numa_max_node(); i++) {
        if (numa_bitmask_isbitset(membind, i)) {
            printf("%d ", i);
        }
    }
    printf("\n");
    
    // Print interleave mask
    printf("  Interleave mask: ");
    for (int i = 0; i <= numa_max_node(); i++) {
        if (numa_bitmask_isbitset(interleave, i)) {
            printf("%d ", i);
        }
    }
    printf("\n");
}
 
// Check which node a specific page is on
void check_page_placement(void *addr) {
    int status[1];
    void *pages[1] = { addr };
    
    // move_pages with NULL dest gets current location
    if (move_pages(0, 1, pages, NULL, status, 0) == 0) {
        if (status[0] >= 0) {
            printf("Address %p is on node %d\n", addr, status[0]);
        } else if (status[0] == -ENOENT) {
            printf("Address %p is not mapped\n", addr);
        } else {
            printf("Address %p: error %d\n", addr, status[0]);
        }
    }
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    print_node_memory_state();
    print_allocation_policy();
    
    // Allocate and check placement
    void *test = numa_alloc_onnode(4096, 0);
    memset(test, 0, 4096);  // Materialize
    check_page_placement(test);
    numa_free(test, 4096);
    
    return 0;
}

Process-Specific NUMA Statistics:

Linux provides per-process NUMA information through /proc/<pid>/numa_maps:

numa-maps.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# View NUMA memory mapping for a process
cat /proc/$(pidof my_application)/numa_maps
 
# Sample output (annotated):
# 7f1234560000 default file=/lib/libc.so.6 anon=0 dirty=0 N0=128 N1=0
# 7f1234670000 prefer:0 anon=4096 dirty=4096 N0=4096 N1=0
# 7f1235000000 bind:0 anon=262144 dirty=262144 N0=262144 N1=0
# 7f1240000000 interleave:0-3 anon=1048576 dirty=0 N0=262144 N1=262144 N2=262144 N3=262144
 
# Fields explained:
# - Address range (hex)
# - Memory policy (default/prefer/bind/interleave)
# - Type (file/anon/stack/heap)
# - Page counts per node (N0=pages, N1=pages, ...)
 
# Quick summary with numastat
numastat -p $(pidof my_application)
 
# Output shows per-process NUMA memory usage:
#                       Node 0    Node 1    Node 2    Node 3
# my_application
#   Huge                     0         0         0         0
#   Heap                  1234       567       890       123
#   Stack                    4         0         0         0
#   Private              98765     12345     54321     67890
#   Total               100003     12912     55211     68013

Continuous NUMA Monitoring

Summary: NUMA Nodes

We've explored NUMA nodes in depth. Let's consolidate the key concepts:

Key Takeaways

•NUMA nodes are locality domains — Each node contains CPUs, memory, and I/O with uniform local access.
•Node composition is hardware-dependent — BIOS settings (NPS, SNC) can divide sockets into multiple nodes.
•Discovery uses ACPI tables — SRAT maps resources to nodes; SLIT provides distances; HMAT handles heterogeneous memory.
•Memory is organized per-node — Each node has its own zones; fallback to remote nodes is ordered by distance.
•CPU-to-node mapping drives scheduling — The scheduler prefers keeping threads near their memory.
•Memory-only nodes exist — PMEM, HBM, and CXL memory can appear as nodes without CPUs.
•Binding controls placement — Applications can strictly bind, prefer, or interleave memory across nodes.
•Runtime queries enable optimization — APIs and syscalls reveal current memory placement and node state.

What's Next:

Page Complete