Operating SystemsNUMA Architecture

NUMA Architecture

LevelAdvanced

Duration60 mins

TopicNUMA Architecture

3 / 5

Local vs Remote Access

The Geography of Memory Access

In a NUMA system, the question 'how fast is memory access?' has no single answer. The latency and bandwidth you experience depend entirely on where the data lives relative to where your code runs. This page dives deep into the local vs. remote access dichotomy—the heart of what makes NUMA both powerful and challenging.

We'll quantify the performance difference, trace the hardware path of memory accesses, and develop the intuition needed to write NUMA-efficient code.

What You Will Learn

By the end of this page, you will understand the exact hardware mechanisms that cause remote access to be slower, how to measure and benchmark local vs. remote latency, the bandwidth implications of remote access, and concrete patterns for minimizing remote memory traffic in your applications.

The Hardware Path of Memory Access

To understand why remote access is slower, we need to trace the physical path that memory requests take through the system. Let's follow a memory load operation from a CPU core to DRAM and back.

Path of a Local Memory Access:

Core issues load instruction: The CPU core executes a load to a virtual address
TLB lookup: The Translation Lookaside Buffer translates virtual to physical address
L1 cache check: If in L1 cache (~1 ns latency), done
L2 cache check: If L1 miss, check L2 cache (~3-4 ns)
L3 cache check: If L2 miss, check unified L3 cache (~10-15 ns)
Memory controller request: If L3 miss, request sent to integrated memory controller
DRAM access: Memory controller issues read to DRAM (~50-60 ns)
Data returns: Data flows back through cache hierarchy to core

Total local latency: ~80-100 ns for an L3 cache miss hitting DRAM.

Converting Mermaid diagram...

Path of a Remote Memory Access:

When data resides in another node's memory, the path becomes significantly longer:

Steps 1-5: Same as local access (TLB, L1/L2/L3 cache checks)
Cache miss determination: L3 miss detected, address decoded to determine home node
Interconnect traversal: Request sent across QPI/UPI/Infinity Fabric to remote node
Remote node processing:
- Request arrives at remote node's interconnect interface
- Request forwarded to remote memory controller
- Remote DRAM accessed
Response return: Data traverses interconnect back to requesting node
Local cache population: Data inserted into local L3/L2/L1 caches
Core receives data: Load instruction completes

Total remote latency: ~140-200+ ns depending on number of hops and interconnect congestion.

Converting Mermaid diagram...

The Interconnect Tax

The interconnect traversal adds 40-100+ ns to each memory access. This 'interconnect tax' is the fundamental cause of the NUMA performance penalty. It's not just latency—interconnect bandwidth is also limited, so high traffic can cause queuing delays beyond the base latency.

Quantifying the Performance Gap

Abstract discussion of latency is less useful than concrete numbers. Let's examine real-world measurements from production systems.

Memory Latency Comparison:

The following table shows measured latencies from various server platforms. These are representative values for random access patterns—sequential access can be faster due to prefetching.

Memory Access Latency by Access Type and Platform
Access Type	Intel Xeon Scalable	AMD EPYC (NPS1)	AMD EPYC (NPS4)
L3 Cache Hit	~14 ns	~12 ns	~12 ns
Local DRAM	~80 ns	~90 ns	~75 ns
Remote 1-hop	~140 ns	~145 ns	~110 ns
Remote 2-hop	~180 ns	~180 ns	~145 ns
NUMA Ratio (1-hop)	1.75x	1.61x	1.47x

Bandwidth Implications:

Latency isn't the only concern. Remote access also consumes interconnect bandwidth, which is shared among all cross-node traffic. Consider these bandwidth characteristics:

Memory Bandwidth Comparison
Bandwidth Type	Intel Xeon (8-channel DDR5)	AMD EPYC (8-channel DDR5)
Local Memory Bandwidth (per socket)	~300 GB/s	~460 GB/s
Interconnect Bandwidth (per link)	~38 GB/s	~36 GB/s
Max Remote Bandwidth	~75 GB/s (2 links)	~144 GB/s (4 links)
Remote/Local Ratio	~25%	~31%

The Critical Insight:

These numbers reveal a critical asymmetry: local memory bandwidth vastly exceeds what the interconnect can deliver. If your application is bandwidth-bound and accessing remote memory, you're limited to 25-30% of the bandwidth you could achieve with local access.

For latency-sensitive applications (databases, in-memory caches), the 1.5-2x latency penalty on remote access directly translates to reduced operations per second.

For bandwidth-sensitive applications (scientific computing, machine learning), the 3-4x bandwidth reduction for remote access can be catastrophic.

Queuing Amplifies Latency

The latency numbers above are for uncontended access. When multiple cores hammer the interconnect simultaneously, requests queue up. Under heavy load, remote access latency can exceed 300 ns—4x the local latency. This is why NUMA effects are often most severe under high load, exactly when good performance matters most.

Cache Coherency and Remote Access

NUMA complicates cache coherency—the mechanism that ensures all processors see consistent values for shared memory. Understanding coherency is essential because it creates additional remote access patterns beyond simple memory reads.

Cache Coherency Basics:

Modern processors use MESI (Modified, Exclusive, Shared, Invalid) or extended variants like MESIF (forward) or MOESI (owned) protocols. Each cache line is in one of these states:

State	Meaning	Can Read?	Can Write?
Modified	Cache has only copy, it's dirty	Yes	Yes
Exclusive	Cache has only copy, it's clean	Yes	Yes (becomes Modified)
Shared	Multiple caches have copies	Yes	No (must invalidate others)
Invalid	Cache line not valid	No	No

Snoop vs. Directory Coherency:

In SMP systems, coherency uses snooping: every cache broadcasts its requests, and all other caches snoop (listen to) these broadcasts. This doesn't scale—broadcast traffic grows quadratically with cache count.

NUMA systems use directory-based coherency:

Each cache line has a 'home node' (where the physical memory is located)
The home node maintains a directory tracking which caches have copies
Coherency requests go to the home node, which selectively notifies affected caches

This is more scalable but introduces additional latency for cross-node coherency operations.

Converting Mermaid diagram...

The Coherency Tax on Remote Data:

Consider what happens when Core 0 (Node 0) writes to data that Core 1 (Node 1) has cached:

Core 0 checks local cache → miss
Core 0 sends request to home node
Home node looks up directory → Core 1 has a copy
Home node sends invalidation to Core 1
Core 1 invalidates its copy, acknowledges
Home node grants exclusive access to Core 0
Core 0 can now write

This involves 3 cross-node messages just to perform a write. If this data is frequently shared and written ('false sharing' being an extreme case), coherency traffic can saturate the interconnect.

False Sharing: The Silent Killer

False sharing occurs when unrelated data happens to share a cache line (typically 64 bytes). If Node 0 writes variable A and Node 1 writes variable B, but both are in the same cache line, every write invalidates the other node's cache. This creates ping-pong coherency traffic and can slow programs by 10-100x. Always pad frequently-written data to cache line boundaries on NUMA systems.

Measuring NUMA Performance

Measuring NUMA performance requires careful methodology. Naive benchmarks often hide NUMA effects or attribute them to wrong causes. Here's how to measure correctly.

Key Metrics:

Local vs. Remote Latency: Direct measurement of memory access time
Local vs. Remote Bandwidth: Sustained memory throughput
NUMA Hit/Miss Ratio: What fraction of allocations went to the intended node
Interconnect Utilization: How saturated is cross-node bandwidth

numa-latency-benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
 
#define ARRAY_SIZE (256 * 1024 * 1024)  // 256 MB
#define ITERATIONS 10000000
 
// High-resolution timing
static inline uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}
 
double measure_random_read_latency(void *buffer, size_t size, int iterations) {
    /*
     * Pointer chasing benchmark to measure true memory latency.
     * Each element contains a pointer to another random element,
     * forming a chain that defeats prefetching.
     */
    
    // Build pointer chain
    void **chain = (void **)buffer;
    size_t num_elements = size / sizeof(void *);
    
    // Fisher-Yates shuffle to create random chain
    for (size_t i = num_elements - 1; i > 0; i--) {
        size_t j = rand() % (i + 1);
        chain[i] = &chain[j];
        chain[j] = &chain[i];
    }
    chain[0] = &chain[1];
    
    // Touch all pages to ensure they're allocated
    for (size_t i = 0; i < num_elements; i += 512) {
        volatile void *touch = chain[i];
        (void)touch;
    }
    
    // Warm up
    void **p = chain;
    for (int i = 0; i < 10000; i++) {
        p = (void **)*p;
    }
    
    // Measure
    uint64_t start = rdtsc();
    p = chain;
    for (int i = 0; i < iterations; i++) {
        p = (void **)*p;
    }
    uint64_t end = rdtsc();
    
    // Prevent optimization
    volatile void *sink = p;
    (void)sink;
    
    // Convert cycles to nanoseconds (assume 3 GHz)
    double cycles_per_access = (double)(end - start) / iterations;
    double ns_per_access = cycles_per_access / 3.0;
    
    return ns_per_access;
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    int num_nodes = numa_max_node() + 1;
    int current_node = numa_node_of_cpu(sched_getcpu());
    
    printf("Running on node %d\n", current_node);
    printf("Testing memory latency to each node:\n\n");
    printf("%-12s %15s %12s\n", "Target Node", "Latency (ns)", "Ratio");
    printf("%-12s %15s %12s\n", "-----------", "-------------", "-------");
    
    double local_latency = 0;
    
    for (int target_node = 0; target_node < num_nodes; target_node++) {
        // Allocate memory on target node
        void *buffer = numa_alloc_onnode(ARRAY_SIZE, target_node);
        if (!buffer) {
            printf("Node %-8d Failed to allocate\n", target_node);
            continue;
        }
        
        // Ensure pages are materialized on target node
        memset(buffer, 0, ARRAY_SIZE);
        
        // Measure latency
        double latency = measure_random_read_latency(buffer, ARRAY_SIZE, ITERATIONS);
        
        if (target_node == current_node) {
            local_latency = latency;
            printf("Node %-8d %12.1f ns   (local)\n", target_node, latency);
        } else {
            double ratio = latency / local_latency;
            printf("Node %-8d %12.1f ns   %.2fx\n", target_node, latency, ratio);
        }
        
        numa_free(buffer, ARRAY_SIZE);
    }
    
    return 0;
}
 
/* Compile with:
   gcc -O3 -o numa_latency numa_latency.c -lnuma -lpthread
   
   Run binding to specific node:
   numactl --cpunodebind=0 ./numa_latency
*/

Using Hardware Performance Counters:

Modern CPUs have performance counters that directly measure NUMA behavior. Intel's uncore counters and AMD's Data Fabric counters provide unprecedented visibility into cross-node traffic.

numa-perf-counters.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Using perf to measure NUMA metrics
 
# Intel: Measure QPI/UPI traffic
perf stat -e 'uncore_qpi_0/event=0x00,umask=0x04/' \
          -e 'uncore_qpi_0/event=0x00,umask=0x08/' \
          ./my_application
 
# AMD: Measure Infinity Fabric traffic
perf stat -e 'amd_df/event=0x07,umask=0x0f/' \
          ./my_application
 
# More portable: use Linux perf NUMA events
perf stat -e 'node-loads' \
          -e 'node-load-misses' \
          -e 'node-stores' \
          -e 'node-store-misses' \
          ./my_application
 
# Example output:
#  Performance counter stats for './my_application':
#          1,234,567,890      node-loads
#            123,456,789      node-load-misses    # 10% remote
#            567,890,123      node-stores
#             12,345,678      node-store-misses   # 2.2% remote
 
# Calculate NUMA hit ratio
# Hit ratio = (loads - misses) / loads * 100
# Target: >95% hit ratio for good NUMA behavior
 
# Continuous monitoring with numastat
watch -n 1 numastat -c
 
# Output shows per-node memory and access patterns:
#                  Node 0    Node 1    Node 2    Node 3
# numa_hit       12345678   11234567   10123456   09012345
# numa_miss        123456      98765      87654      76543
# numa_foreign      98765     123456      76543      87654

The 5% Rule

A well-tuned NUMA application should have <5% remote memory accesses. If numa_miss or node-load-misses exceeds 5-10%, investigate memory placement. Use numastat -p <pid> to see per-process NUMA behavior and identify offending applications.

Patterns of Remote Access

Remote memory access isn't always avoidable or even bad. Understanding common patterns helps you make informed decisions about when to optimize and when to accept remote access.

Pattern 1: Initialization Anti-Pattern

The most common NUMA mistake: a single thread allocates and initializes all data, then spawns workers across nodes.

bad-init.c
C (Bad)
// BAD: All memory on Node 0
void *buffer = malloc(HUGE_SIZE);
memset(buffer, 0, HUGE_SIZE);
 
// Workers on all nodes access
// mostly remote memory
for (int i = 0; i < NUM_NODES; i++) {
    spawn_worker(i, buffer);
}

good-init.c
C (Good)
// GOOD: Each node gets local memory
for (int i = 0; i < NUM_NODES; i++) {
    spawn_worker(i, NULL);
    // Worker allocates its own
    // memory locally
}
 
void worker(int node_id) {
    numa_set_preferred(node_id);
    void *local_buf = malloc(SIZE);
    // First-touch on local node
}

Pattern 2: Acceptable Shared Data

Some data is legitimately shared across all nodes—read-only configuration, lookup tables, code pages. For this data, remote access is unavoidable but can be optimized:

Handling Shared Read-Only Data

•Interleave allocation: Spread pages across all nodes so every node has some data local (numactl --interleave)
•Replicate: Create per-node copies of the data (trades memory for locality)
•Accept remote access: For small, cache-resident data, the cache absorbs the remote access cost

Pattern 3: Thread Migration

When the scheduler migrates a thread to a different node, its memory becomes remote. This is subtle and can happen without any application code change.

Mitigation strategies:

Pin threads to nodes: Use pthread_setaffinity_np() or numactl --cpunodebind
Use strict memory binding: Force allocations to follow the thread's node
Disable AutoNUMA: Prevents kernel from migrating pages (but also prevents optimization)

thread-pinning.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#define _GNU_SOURCE
#include <pthread.h>
#include <numa.h>
#include <sched.h>
 
void *worker_thread(void *arg) {
    int target_node = *(int *)arg;
    
    // Step 1: Pin thread to CPUs on target node
    struct bitmask *cpumask = numa_allocate_cpumask();
    numa_node_to_cpus(target_node, cpumask);
    
    cpu_set_t cpu_set;
    CPU_ZERO(&cpu_set);
    for (int i = 0; i < numa_num_configured_cpus(); i++) {
        if (numa_bitmask_isbitset(cpumask, i)) {
            CPU_SET(i, &cpu_set);
        }
    }
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpu_set);
    
    // Step 2: Set memory policy to local
    numa_set_preferred(target_node);
    // Or for strict binding:
    // numa_set_membind(numa_allocate_nodemask());
    
    // Step 3: Now allocate memory (will be local)
    void *local_buffer = malloc(WORK_SIZE);
    memset(local_buffer, 0, WORK_SIZE);  // First-touch
    
    // Do work with local memory
    do_computation(local_buffer);
    
    numa_free_cpumask(cpumask);
    free(local_buffer);
    return NULL;
}
 
int main() {
    int num_nodes = numa_max_node() + 1;
    pthread_t threads[MAX_NODES];
    int node_ids[MAX_NODES];
    
    // Spawn one worker per node
    for (int i = 0; i < num_nodes; i++) {
        node_ids[i] = i;
        pthread_create(&threads[i], NULL, worker_thread, &node_ids[i]);
    }
    
    // Join workers
    for (int i = 0; i < num_nodes; i++) {
        pthread_join(threads[i], NULL);
    }
    
    return 0;
}

The Scheduler Knows Best (Sometimes)

Over-aggressive thread pinning can backfire. If one node is overloaded and another is idle, pinning prevents the scheduler from balancing load. Use pinning for critical, memory-bound threads. Allow flexibility for I/O-bound or lightly loaded threads.

Bandwidth-Bound vs Latency-Bound Workloads

Understanding whether your workload is latency-bound or bandwidth-bound determines which NUMA optimizations matter most.

Latency-Bound Workloads:

These are characterized by:

Random access patterns
Small working sets that don't fit in cache
Pointer-chasing data structures (linked lists, trees, graphs)
Many dependent memory operations (each load depends on previous)

Impact of remote access: Each operation pays the full latency penalty. A 2x latency increase directly halves throughput.

Bandwidth-Bound Workloads:

These are characterized by:

Sequential or strided access patterns
Large working sets (streaming through data)
Independent operations (CPU can pipeline many requests)
High bytes-per-operation ratio

Impact of remote access: The CPU can issue many requests in flight, hiding latency. But bandwidth limits become hard ceilings. A 3-4x bandwidth reduction for remote access is catastrophic.

NUMA Impact by Workload Type
Characteristic	Latency-Bound	Bandwidth-Bound
Access Pattern	Random	Sequential/Strided
Key Metric	ns per operation	GB/s throughput
Prefetching Helps?	No	Yes
Local NUMA Benefit	1.5-2x lower latency	3-4x higher bandwidth
Interleaving Helps?	Rarely	Sometimes (balances bandwidth)
Example Workloads	Databases, hash tables, graph traversal	Matrix operations, ML training, video encoding

Hybrid Workloads:

Many real applications exhibit both patterns:

Databases: Latency-bound for index lookups, bandwidth-bound for table scans
Machine Learning: Bandwidth-bound for forward/backward passes, latency-bound for some sparse operations
Scientific Computing: Depends heavily on algorithm and data layout

For such workloads, NUMA optimization must consider both dimensions. Commonly:

Keep latency-sensitive data (indexes) strictly local
Interleave bandwidth-hungry data (bulk datasets) for balance
Pin compute-heavy threads; allow I/O threads flexibility

Profile Before Optimizing

Don't assume your workload type—measure it. Use perf stat to measure Memory Level Parallelism (MLP). Low MLP (1-2 outstanding requests) = latency-bound. High MLP (10+) = bandwidth-bound. NUMA optimizations differ accordingly.

Real-World Case Studies

Let's examine how real production systems handle NUMA's local vs. remote access challenge.

Case Study 1: PostgreSQL

PostgreSQL is a latency-sensitive database with complex memory access patterns:

Shared buffers: The buffer cache is in shared memory, accessed by all backends
Index lookups: Random pointer-chasing through B-tree pages
Heap scans: Sequential reads through table data

PostgreSQL's NUMA approach:

Run on a single NUMA node for small instances
Use numactl --interleave for shared buffers on multi-node instances
Bind connection pools to specific nodes
Consider partitioning across instances (one per NUMA node) for large deployments

Case Study 2: Redis

Redis is an in-memory key-value store where latency is everything:

NUMA challenges:

Single-threaded event loop for command processing
All data in memory, random access patterns
Any remote memory access adds to response latency

Redis's NUMA approach:

Pin Redis server to a single NUMA node (numactl --cpunodebind=N --membind=N)
Limit dataset size to node's memory (minus OS overhead)
For larger datasets, run multiple Redis instances (one per node)
Memory not an issue? Use interleave to avoid worst-case imbalance

numa-redis-deployment.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash
# Production Redis deployment on NUMA system
 
# Option 1: Single instance pinned to Node 0
numactl --cpunodebind=0 --membind=0 redis-server /etc/redis/redis.conf
 
# Option 2: One instance per NUMA node for maximum throughput
NUM_NODES=$(numactl --hardware | grep 'available:' | awk '{print $2}')
 
for ((node=0; node<NUM_NODES; node++)); do
    PORT=$((6379 + node))
    CONFIG="/etc/redis/redis-node${node}.conf"
    
    numactl --cpunodebind=$node --membind=$node \
        redis-server $CONFIG --port $PORT --daemonize yes
    
    echo "Started Redis on Node $node, port $PORT"
done
 
# Option 3: Interleaved for unknown/mixed access patterns
numactl --interleave=all redis-server /etc/redis/redis.conf

Case Study 3: Apache Spark

Spark processes large datasets with bandwidth-intensive operations:

NUMA challenges:

Executors process data in parallel across many cores
Shuffle operations move data between executors (and nodes)
JVM garbage collection is NUMA-unaware by default

Spark's NUMA approach:

Run one Spark executor per NUMA node
Bind executor JVMs to specific nodes
Use NUMA-aware JVM flags: -XX:+UseNUMA (for G1GC)
Configure shuffle to use local disk attached to same node
Set executor memory to match node memory capacity

The JVM and NUMA

The JVM's NUMA support varies by garbage collector. G1GC has explicit NUMA mode. ZGC is NUMA-aware by default. ParallelGC and CMS have limited NUMA awareness. For NUMA-critical Java applications, carefully benchmark GC options. The wrong choice can negate NUMA optimizations.

Summary: Local vs Remote Access

We've thoroughly examined the performance difference between local and remote memory access. Let's consolidate the key insights:

Key Takeaways

•The interconnect adds 40-100+ ns per access — This is the fundamental source of NUMA latency penalty.
•Remote access is 1.5-2x slower — The NUMA ratio varies by system but is always significant for memory-bound workloads.
•Bandwidth is 3-4x lower for remote access — Bandwidth-bound applications suffer even more than latency-bound ones.
•Cache coherency creates additional remote traffic — Write-shared data generates cross-node invalidations.
•Measurement requires careful methodology — Pointer-chasing benchmarks reveal true latency; perf counters show hit ratios.
•Different workloads need different strategies — Latency-bound favors strict locality; bandwidth-bound may benefit from interleaving.
•Real applications require pragmatic tradeoffs — Perfect locality isn't always possible; optimize what matters most.

What's Next:

In the next page, we'll explore NUMA-Aware Allocation—the specific techniques, APIs, and strategies for ensuring memory ends up on the right NUMA node. We'll cover Linux's memory policy system, libnuma programming, and practical allocation patterns.

Page Complete

You now understand the fundamental performance asymmetry at the heart of NUMA systems. You can trace memory access paths, measure latency and bandwidth, identify remote access patterns, and apply this knowledge to real workloads. Next, we'll learn how to control memory placement to ensure locality.

3 / 5

Loading learning content...

Operating SystemsNUMA Architecture

NUMA Architecture

LevelAdvanced

Duration60 mins

TopicNUMA Architecture

3 / 5

Local vs Remote Access

The Geography of Memory Access

We'll quantify the performance difference, trace the hardware path of memory accesses, and develop the intuition needed to write NUMA-efficient code.

What You Will Learn

The Hardware Path of Memory Access

To understand why remote access is slower, we need to trace the physical path that memory requests take through the system. Let's follow a memory load operation from a CPU core to DRAM and back.

Path of a Local Memory Access:

Core issues load instruction: The CPU core executes a load to a virtual address
TLB lookup: The Translation Lookaside Buffer translates virtual to physical address
L1 cache check: If in L1 cache (~1 ns latency), done
L2 cache check: If L1 miss, check L2 cache (~3-4 ns)
L3 cache check: If L2 miss, check unified L3 cache (~10-15 ns)
Memory controller request: If L3 miss, request sent to integrated memory controller
DRAM access: Memory controller issues read to DRAM (~50-60 ns)
Data returns: Data flows back through cache hierarchy to core

Total local latency: ~80-100 ns for an L3 cache miss hitting DRAM.

Converting Mermaid diagram...

Path of a Remote Memory Access:

When data resides in another node's memory, the path becomes significantly longer:

Steps 1-5: Same as local access (TLB, L1/L2/L3 cache checks)
Cache miss determination: L3 miss detected, address decoded to determine home node
Interconnect traversal: Request sent across QPI/UPI/Infinity Fabric to remote node
Remote node processing:
- Request arrives at remote node's interconnect interface
- Request forwarded to remote memory controller
- Remote DRAM accessed
Response return: Data traverses interconnect back to requesting node
Local cache population: Data inserted into local L3/L2/L1 caches
Core receives data: Load instruction completes

Total remote latency: ~140-200+ ns depending on number of hops and interconnect congestion.

Converting Mermaid diagram...

The Interconnect Tax

Quantifying the Performance Gap

Abstract discussion of latency is less useful than concrete numbers. Let's examine real-world measurements from production systems.

Memory Latency Comparison:

The following table shows measured latencies from various server platforms. These are representative values for random access patterns—sequential access can be faster due to prefetching.

Memory Access Latency by Access Type and Platform
Access Type	Intel Xeon Scalable	AMD EPYC (NPS1)	AMD EPYC (NPS4)
L3 Cache Hit	~14 ns	~12 ns	~12 ns
Local DRAM	~80 ns	~90 ns	~75 ns
Remote 1-hop	~140 ns	~145 ns	~110 ns
Remote 2-hop	~180 ns	~180 ns	~145 ns
NUMA Ratio (1-hop)	1.75x	1.61x	1.47x

Bandwidth Implications:

Latency isn't the only concern. Remote access also consumes interconnect bandwidth, which is shared among all cross-node traffic. Consider these bandwidth characteristics:

Memory Bandwidth Comparison
Bandwidth Type	Intel Xeon (8-channel DDR5)	AMD EPYC (8-channel DDR5)
Local Memory Bandwidth (per socket)	~300 GB/s	~460 GB/s
Interconnect Bandwidth (per link)	~38 GB/s	~36 GB/s
Max Remote Bandwidth	~75 GB/s (2 links)	~144 GB/s (4 links)
Remote/Local Ratio	~25%	~31%

The Critical Insight:

For latency-sensitive applications (databases, in-memory caches), the 1.5-2x latency penalty on remote access directly translates to reduced operations per second.

For bandwidth-sensitive applications (scientific computing, machine learning), the 3-4x bandwidth reduction for remote access can be catastrophic.

Queuing Amplifies Latency

Cache Coherency and Remote Access

Cache Coherency Basics:

Modern processors use MESI (Modified, Exclusive, Shared, Invalid) or extended variants like MESIF (forward) or MOESI (owned) protocols. Each cache line is in one of these states:

State	Meaning	Can Read?	Can Write?
Modified	Cache has only copy, it's dirty	Yes	Yes
Exclusive	Cache has only copy, it's clean	Yes	Yes (becomes Modified)
Shared	Multiple caches have copies	Yes	No (must invalidate others)
Invalid	Cache line not valid	No	No

Snoop vs. Directory Coherency:

NUMA systems use directory-based coherency:

Each cache line has a 'home node' (where the physical memory is located)
The home node maintains a directory tracking which caches have copies
Coherency requests go to the home node, which selectively notifies affected caches

This is more scalable but introduces additional latency for cross-node coherency operations.

Converting Mermaid diagram...

The Coherency Tax on Remote Data:

Consider what happens when Core 0 (Node 0) writes to data that Core 1 (Node 1) has cached:

Core 0 checks local cache → miss
Core 0 sends request to home node
Home node looks up directory → Core 1 has a copy
Home node sends invalidation to Core 1
Core 1 invalidates its copy, acknowledges
Home node grants exclusive access to Core 0
Core 0 can now write

False Sharing: The Silent Killer

Measuring NUMA Performance

Measuring NUMA performance requires careful methodology. Naive benchmarks often hide NUMA effects or attribute them to wrong causes. Here's how to measure correctly.

Key Metrics:

Local vs. Remote Latency: Direct measurement of memory access time
Local vs. Remote Bandwidth: Sustained memory throughput
NUMA Hit/Miss Ratio: What fraction of allocations went to the intended node
Interconnect Utilization: How saturated is cross-node bandwidth

numa-latency-benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
 
#define ARRAY_SIZE (256 * 1024 * 1024)  // 256 MB
#define ITERATIONS 10000000
 
// High-resolution timing
static inline uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}
 
double measure_random_read_latency(void *buffer, size_t size, int iterations) {
    /*
     * Pointer chasing benchmark to measure true memory latency.
     * Each element contains a pointer to another random element,
     * forming a chain that defeats prefetching.
     */
    
    // Build pointer chain
    void **chain = (void **)buffer;
    size_t num_elements = size / sizeof(void *);
    
    // Fisher-Yates shuffle to create random chain
    for (size_t i = num_elements - 1; i > 0; i--) {
        size_t j = rand() % (i + 1);
        chain[i] = &chain[j];
        chain[j] = &chain[i];
    }
    chain[0] = &chain[1];
    
    // Touch all pages to ensure they're allocated
    for (size_t i = 0; i < num_elements; i += 512) {
        volatile void *touch = chain[i];
        (void)touch;
    }
    
    // Warm up
    void **p = chain;
    for (int i = 0; i < 10000; i++) {
        p = (void **)*p;
    }
    
    // Measure
    uint64_t start = rdtsc();
    p = chain;
    for (int i = 0; i < iterations; i++) {
        p = (void **)*p;
    }
    uint64_t end = rdtsc();
    
    // Prevent optimization
    volatile void *sink = p;
    (void)sink;
    
    // Convert cycles to nanoseconds (assume 3 GHz)
    double cycles_per_access = (double)(end - start) / iterations;
    double ns_per_access = cycles_per_access / 3.0;
    
    return ns_per_access;
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    int num_nodes = numa_max_node() + 1;
    int current_node = numa_node_of_cpu(sched_getcpu());
    
    printf("Running on node %d\n", current_node);
    printf("Testing memory latency to each node:\n\n");
    printf("%-12s %15s %12s\n", "Target Node", "Latency (ns)", "Ratio");
    printf("%-12s %15s %12s\n", "-----------", "-------------", "-------");
    
    double local_latency = 0;
    
    for (int target_node = 0; target_node < num_nodes; target_node++) {
        // Allocate memory on target node
        void *buffer = numa_alloc_onnode(ARRAY_SIZE, target_node);
        if (!buffer) {
            printf("Node %-8d Failed to allocate\n", target_node);
            continue;
        }
        
        // Ensure pages are materialized on target node
        memset(buffer, 0, ARRAY_SIZE);
        
        // Measure latency
        double latency = measure_random_read_latency(buffer, ARRAY_SIZE, ITERATIONS);
        
        if (target_node == current_node) {
            local_latency = latency;
            printf("Node %-8d %12.1f ns   (local)\n", target_node, latency);
        } else {
            double ratio = latency / local_latency;
            printf("Node %-8d %12.1f ns   %.2fx\n", target_node, latency, ratio);
        }
        
        numa_free(buffer, ARRAY_SIZE);
    }
    
    return 0;
}
 
/* Compile with:
   gcc -O3 -o numa_latency numa_latency.c -lnuma -lpthread
   
   Run binding to specific node:
   numactl --cpunodebind=0 ./numa_latency
*/

Using Hardware Performance Counters:

Modern CPUs have performance counters that directly measure NUMA behavior. Intel's uncore counters and AMD's Data Fabric counters provide unprecedented visibility into cross-node traffic.

numa-perf-counters.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Using perf to measure NUMA metrics
 
# Intel: Measure QPI/UPI traffic
perf stat -e 'uncore_qpi_0/event=0x00,umask=0x04/' \
          -e 'uncore_qpi_0/event=0x00,umask=0x08/' \
          ./my_application
 
# AMD: Measure Infinity Fabric traffic
perf stat -e 'amd_df/event=0x07,umask=0x0f/' \
          ./my_application
 
# More portable: use Linux perf NUMA events
perf stat -e 'node-loads' \
          -e 'node-load-misses' \
          -e 'node-stores' \
          -e 'node-store-misses' \
          ./my_application
 
# Example output:
#  Performance counter stats for './my_application':
#          1,234,567,890      node-loads
#            123,456,789      node-load-misses    # 10% remote
#            567,890,123      node-stores
#             12,345,678      node-store-misses   # 2.2% remote
 
# Calculate NUMA hit ratio
# Hit ratio = (loads - misses) / loads * 100
# Target: >95% hit ratio for good NUMA behavior
 
# Continuous monitoring with numastat
watch -n 1 numastat -c
 
# Output shows per-node memory and access patterns:
#                  Node 0    Node 1    Node 2    Node 3
# numa_hit       12345678   11234567   10123456   09012345
# numa_miss        123456      98765      87654      76543
# numa_foreign      98765     123456      76543      87654

The 5% Rule

Patterns of Remote Access

Remote memory access isn't always avoidable or even bad. Understanding common patterns helps you make informed decisions about when to optimize and when to accept remote access.

Pattern 1: Initialization Anti-Pattern

The most common NUMA mistake: a single thread allocates and initializes all data, then spawns workers across nodes.

bad-init.c
C (Bad)
// BAD: All memory on Node 0
void *buffer = malloc(HUGE_SIZE);
memset(buffer, 0, HUGE_SIZE);
 
// Workers on all nodes access
// mostly remote memory
for (int i = 0; i < NUM_NODES; i++) {
    spawn_worker(i, buffer);
}

good-init.c
C (Good)
// GOOD: Each node gets local memory
for (int i = 0; i < NUM_NODES; i++) {
    spawn_worker(i, NULL);
    // Worker allocates its own
    // memory locally
}
 
void worker(int node_id) {
    numa_set_preferred(node_id);
    void *local_buf = malloc(SIZE);
    // First-touch on local node
}

Pattern 2: Acceptable Shared Data

Some data is legitimately shared across all nodes—read-only configuration, lookup tables, code pages. For this data, remote access is unavoidable but can be optimized:

Handling Shared Read-Only Data

•Interleave allocation: Spread pages across all nodes so every node has some data local (numactl --interleave)
•Replicate: Create per-node copies of the data (trades memory for locality)
•Accept remote access: For small, cache-resident data, the cache absorbs the remote access cost

Pattern 3: Thread Migration

When the scheduler migrates a thread to a different node, its memory becomes remote. This is subtle and can happen without any application code change.

Mitigation strategies:

Pin threads to nodes: Use pthread_setaffinity_np() or numactl --cpunodebind
Use strict memory binding: Force allocations to follow the thread's node
Disable AutoNUMA: Prevents kernel from migrating pages (but also prevents optimization)

thread-pinning.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#define _GNU_SOURCE
#include <pthread.h>
#include <numa.h>
#include <sched.h>
 
void *worker_thread(void *arg) {
    int target_node = *(int *)arg;
    
    // Step 1: Pin thread to CPUs on target node
    struct bitmask *cpumask = numa_allocate_cpumask();
    numa_node_to_cpus(target_node, cpumask);
    
    cpu_set_t cpu_set;
    CPU_ZERO(&cpu_set);
    for (int i = 0; i < numa_num_configured_cpus(); i++) {
        if (numa_bitmask_isbitset(cpumask, i)) {
            CPU_SET(i, &cpu_set);
        }
    }
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpu_set);
    
    // Step 2: Set memory policy to local
    numa_set_preferred(target_node);
    // Or for strict binding:
    // numa_set_membind(numa_allocate_nodemask());
    
    // Step 3: Now allocate memory (will be local)
    void *local_buffer = malloc(WORK_SIZE);
    memset(local_buffer, 0, WORK_SIZE);  // First-touch
    
    // Do work with local memory
    do_computation(local_buffer);
    
    numa_free_cpumask(cpumask);
    free(local_buffer);
    return NULL;
}
 
int main() {
    int num_nodes = numa_max_node() + 1;
    pthread_t threads[MAX_NODES];
    int node_ids[MAX_NODES];
    
    // Spawn one worker per node
    for (int i = 0; i < num_nodes; i++) {
        node_ids[i] = i;
        pthread_create(&threads[i], NULL, worker_thread, &node_ids[i]);
    }
    
    // Join workers
    for (int i = 0; i < num_nodes; i++) {
        pthread_join(threads[i], NULL);
    }
    
    return 0;
}

The Scheduler Knows Best (Sometimes)

Bandwidth-Bound vs Latency-Bound Workloads

Understanding whether your workload is latency-bound or bandwidth-bound determines which NUMA optimizations matter most.

Latency-Bound Workloads:

These are characterized by:

Random access patterns
Small working sets that don't fit in cache
Pointer-chasing data structures (linked lists, trees, graphs)
Many dependent memory operations (each load depends on previous)

Impact of remote access: Each operation pays the full latency penalty. A 2x latency increase directly halves throughput.

Bandwidth-Bound Workloads:

These are characterized by:

Sequential or strided access patterns
Large working sets (streaming through data)
Independent operations (CPU can pipeline many requests)
High bytes-per-operation ratio

Impact of remote access: The CPU can issue many requests in flight, hiding latency. But bandwidth limits become hard ceilings. A 3-4x bandwidth reduction for remote access is catastrophic.

NUMA Impact by Workload Type
Characteristic	Latency-Bound	Bandwidth-Bound
Access Pattern	Random	Sequential/Strided
Key Metric	ns per operation	GB/s throughput
Prefetching Helps?	No	Yes
Local NUMA Benefit	1.5-2x lower latency	3-4x higher bandwidth
Interleaving Helps?	Rarely	Sometimes (balances bandwidth)
Example Workloads	Databases, hash tables, graph traversal	Matrix operations, ML training, video encoding

Hybrid Workloads:

Many real applications exhibit both patterns:

Databases: Latency-bound for index lookups, bandwidth-bound for table scans
Machine Learning: Bandwidth-bound for forward/backward passes, latency-bound for some sparse operations
Scientific Computing: Depends heavily on algorithm and data layout

For such workloads, NUMA optimization must consider both dimensions. Commonly:

Keep latency-sensitive data (indexes) strictly local
Interleave bandwidth-hungry data (bulk datasets) for balance
Pin compute-heavy threads; allow I/O threads flexibility

Profile Before Optimizing

Real-World Case Studies

Let's examine how real production systems handle NUMA's local vs. remote access challenge.

Case Study 1: PostgreSQL

PostgreSQL is a latency-sensitive database with complex memory access patterns:

Shared buffers: The buffer cache is in shared memory, accessed by all backends
Index lookups: Random pointer-chasing through B-tree pages
Heap scans: Sequential reads through table data

PostgreSQL's NUMA approach:

Run on a single NUMA node for small instances
Use numactl --interleave for shared buffers on multi-node instances
Bind connection pools to specific nodes
Consider partitioning across instances (one per NUMA node) for large deployments

Case Study 2: Redis

Redis is an in-memory key-value store where latency is everything:

NUMA challenges:

Single-threaded event loop for command processing
All data in memory, random access patterns
Any remote memory access adds to response latency

Redis's NUMA approach:

Pin Redis server to a single NUMA node (numactl --cpunodebind=N --membind=N)
Limit dataset size to node's memory (minus OS overhead)
For larger datasets, run multiple Redis instances (one per node)
Memory not an issue? Use interleave to avoid worst-case imbalance

numa-redis-deployment.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash
# Production Redis deployment on NUMA system
 
# Option 1: Single instance pinned to Node 0
numactl --cpunodebind=0 --membind=0 redis-server /etc/redis/redis.conf
 
# Option 2: One instance per NUMA node for maximum throughput
NUM_NODES=$(numactl --hardware | grep 'available:' | awk '{print $2}')
 
for ((node=0; node<NUM_NODES; node++)); do
    PORT=$((6379 + node))
    CONFIG="/etc/redis/redis-node${node}.conf"
    
    numactl --cpunodebind=$node --membind=$node \
        redis-server $CONFIG --port $PORT --daemonize yes
    
    echo "Started Redis on Node $node, port $PORT"
done
 
# Option 3: Interleaved for unknown/mixed access patterns
numactl --interleave=all redis-server /etc/redis/redis.conf

Case Study 3: Apache Spark

Spark processes large datasets with bandwidth-intensive operations:

NUMA challenges:

Executors process data in parallel across many cores
Shuffle operations move data between executors (and nodes)
JVM garbage collection is NUMA-unaware by default

Spark's NUMA approach:

Run one Spark executor per NUMA node
Bind executor JVMs to specific nodes
Use NUMA-aware JVM flags: -XX:+UseNUMA (for G1GC)
Configure shuffle to use local disk attached to same node
Set executor memory to match node memory capacity

The JVM and NUMA

Summary: Local vs Remote Access

We've thoroughly examined the performance difference between local and remote memory access. Let's consolidate the key insights:

Key Takeaways

•The interconnect adds 40-100+ ns per access — This is the fundamental source of NUMA latency penalty.
•Remote access is 1.5-2x slower — The NUMA ratio varies by system but is always significant for memory-bound workloads.
•Bandwidth is 3-4x lower for remote access — Bandwidth-bound applications suffer even more than latency-bound ones.
•Cache coherency creates additional remote traffic — Write-shared data generates cross-node invalidations.
•Measurement requires careful methodology — Pointer-chasing benchmarks reveal true latency; perf counters show hit ratios.
•Different workloads need different strategies — Latency-bound favors strict locality; bandwidth-bound may benefit from interleaving.
•Real applications require pragmatic tradeoffs — Perfect locality isn't always possible; optimize what matters most.

What's Next:

Page Complete

3 / 5