Loading learning content...
In a NUMA system, the question 'how fast is memory access?' has no single answer. The latency and bandwidth you experience depend entirely on where the data lives relative to where your code runs. This page dives deep into the local vs. remote access dichotomy—the heart of what makes NUMA both powerful and challenging.
We'll quantify the performance difference, trace the hardware path of memory accesses, and develop the intuition needed to write NUMA-efficient code.
By the end of this page, you will understand the exact hardware mechanisms that cause remote access to be slower, how to measure and benchmark local vs. remote latency, the bandwidth implications of remote access, and concrete patterns for minimizing remote memory traffic in your applications.
To understand why remote access is slower, we need to trace the physical path that memory requests take through the system. Let's follow a memory load operation from a CPU core to DRAM and back.
Path of a Local Memory Access:
Total local latency: ~80-100 ns for an L3 cache miss hitting DRAM.
Path of a Remote Memory Access:
When data resides in another node's memory, the path becomes significantly longer:
Total remote latency: ~140-200+ ns depending on number of hops and interconnect congestion.
The interconnect traversal adds 40-100+ ns to each memory access. This 'interconnect tax' is the fundamental cause of the NUMA performance penalty. It's not just latency—interconnect bandwidth is also limited, so high traffic can cause queuing delays beyond the base latency.
Abstract discussion of latency is less useful than concrete numbers. Let's examine real-world measurements from production systems.
Memory Latency Comparison:
The following table shows measured latencies from various server platforms. These are representative values for random access patterns—sequential access can be faster due to prefetching.
| Access Type | Intel Xeon Scalable | AMD EPYC (NPS1) | AMD EPYC (NPS4) |
|---|---|---|---|
| L3 Cache Hit | ~14 ns | ~12 ns | ~12 ns |
| Local DRAM | ~80 ns | ~90 ns | ~75 ns |
| Remote 1-hop | ~140 ns | ~145 ns | ~110 ns |
| Remote 2-hop | ~180 ns | ~180 ns | ~145 ns |
| NUMA Ratio (1-hop) | 1.75x | 1.61x | 1.47x |
Bandwidth Implications:
Latency isn't the only concern. Remote access also consumes interconnect bandwidth, which is shared among all cross-node traffic. Consider these bandwidth characteristics:
| Bandwidth Type | Intel Xeon (8-channel DDR5) | AMD EPYC (8-channel DDR5) |
|---|---|---|
| Local Memory Bandwidth (per socket) | ~300 GB/s | ~460 GB/s |
| Interconnect Bandwidth (per link) | ~38 GB/s | ~36 GB/s |
| Max Remote Bandwidth | ~75 GB/s (2 links) | ~144 GB/s (4 links) |
| Remote/Local Ratio | ~25% | ~31% |
The Critical Insight:
These numbers reveal a critical asymmetry: local memory bandwidth vastly exceeds what the interconnect can deliver. If your application is bandwidth-bound and accessing remote memory, you're limited to 25-30% of the bandwidth you could achieve with local access.
For latency-sensitive applications (databases, in-memory caches), the 1.5-2x latency penalty on remote access directly translates to reduced operations per second.
For bandwidth-sensitive applications (scientific computing, machine learning), the 3-4x bandwidth reduction for remote access can be catastrophic.
The latency numbers above are for uncontended access. When multiple cores hammer the interconnect simultaneously, requests queue up. Under heavy load, remote access latency can exceed 300 ns—4x the local latency. This is why NUMA effects are often most severe under high load, exactly when good performance matters most.
NUMA complicates cache coherency—the mechanism that ensures all processors see consistent values for shared memory. Understanding coherency is essential because it creates additional remote access patterns beyond simple memory reads.
Cache Coherency Basics:
Modern processors use MESI (Modified, Exclusive, Shared, Invalid) or extended variants like MESIF (forward) or MOESI (owned) protocols. Each cache line is in one of these states:
| State | Meaning | Can Read? | Can Write? |
|---|---|---|---|
| Modified | Cache has only copy, it's dirty | Yes | Yes |
| Exclusive | Cache has only copy, it's clean | Yes | Yes (becomes Modified) |
| Shared | Multiple caches have copies | Yes | No (must invalidate others) |
| Invalid | Cache line not valid | No | No |
Snoop vs. Directory Coherency:
In SMP systems, coherency uses snooping: every cache broadcasts its requests, and all other caches snoop (listen to) these broadcasts. This doesn't scale—broadcast traffic grows quadratically with cache count.
NUMA systems use directory-based coherency:
This is more scalable but introduces additional latency for cross-node coherency operations.
The Coherency Tax on Remote Data:
Consider what happens when Core 0 (Node 0) writes to data that Core 1 (Node 1) has cached:
This involves 3 cross-node messages just to perform a write. If this data is frequently shared and written ('false sharing' being an extreme case), coherency traffic can saturate the interconnect.
False sharing occurs when unrelated data happens to share a cache line (typically 64 bytes). If Node 0 writes variable A and Node 1 writes variable B, but both are in the same cache line, every write invalidates the other node's cache. This creates ping-pong coherency traffic and can slow programs by 10-100x. Always pad frequently-written data to cache line boundaries on NUMA systems.
Measuring NUMA performance requires careful methodology. Naive benchmarks often hide NUMA effects or attribute them to wrong causes. Here's how to measure correctly.
Key Metrics:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
#include <numa.h>#include <stdio.h>#include <stdlib.h>#include <stdint.h>#include <time.h>#include <string.h> #define ARRAY_SIZE (256 * 1024 * 1024) // 256 MB#define ITERATIONS 10000000 // High-resolution timingstatic inline uint64_t rdtsc() { uint32_t lo, hi; __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); return ((uint64_t)hi << 32) | lo;} double measure_random_read_latency(void *buffer, size_t size, int iterations) { /* * Pointer chasing benchmark to measure true memory latency. * Each element contains a pointer to another random element, * forming a chain that defeats prefetching. */ // Build pointer chain void **chain = (void **)buffer; size_t num_elements = size / sizeof(void *); // Fisher-Yates shuffle to create random chain for (size_t i = num_elements - 1; i > 0; i--) { size_t j = rand() % (i + 1); chain[i] = &chain[j]; chain[j] = &chain[i]; } chain[0] = &chain[1]; // Touch all pages to ensure they're allocated for (size_t i = 0; i < num_elements; i += 512) { volatile void *touch = chain[i]; (void)touch; } // Warm up void **p = chain; for (int i = 0; i < 10000; i++) { p = (void **)*p; } // Measure uint64_t start = rdtsc(); p = chain; for (int i = 0; i < iterations; i++) { p = (void **)*p; } uint64_t end = rdtsc(); // Prevent optimization volatile void *sink = p; (void)sink; // Convert cycles to nanoseconds (assume 3 GHz) double cycles_per_access = (double)(end - start) / iterations; double ns_per_access = cycles_per_access / 3.0; return ns_per_access;} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } int num_nodes = numa_max_node() + 1; int current_node = numa_node_of_cpu(sched_getcpu()); printf("Running on node %d\n", current_node); printf("Testing memory latency to each node:\n\n"); printf("%-12s %15s %12s\n", "Target Node", "Latency (ns)", "Ratio"); printf("%-12s %15s %12s\n", "-----------", "-------------", "-------"); double local_latency = 0; for (int target_node = 0; target_node < num_nodes; target_node++) { // Allocate memory on target node void *buffer = numa_alloc_onnode(ARRAY_SIZE, target_node); if (!buffer) { printf("Node %-8d Failed to allocate\n", target_node); continue; } // Ensure pages are materialized on target node memset(buffer, 0, ARRAY_SIZE); // Measure latency double latency = measure_random_read_latency(buffer, ARRAY_SIZE, ITERATIONS); if (target_node == current_node) { local_latency = latency; printf("Node %-8d %12.1f ns (local)\n", target_node, latency); } else { double ratio = latency / local_latency; printf("Node %-8d %12.1f ns %.2fx\n", target_node, latency, ratio); } numa_free(buffer, ARRAY_SIZE); } return 0;} /* Compile with: gcc -O3 -o numa_latency numa_latency.c -lnuma -lpthread Run binding to specific node: numactl --cpunodebind=0 ./numa_latency*/Using Hardware Performance Counters:
Modern CPUs have performance counters that directly measure NUMA behavior. Intel's uncore counters and AMD's Data Fabric counters provide unprecedented visibility into cross-node traffic.
12345678910111213141516171819202122232425262728293031323334353637
# Using perf to measure NUMA metrics # Intel: Measure QPI/UPI trafficperf stat -e 'uncore_qpi_0/event=0x00,umask=0x04/' \ -e 'uncore_qpi_0/event=0x00,umask=0x08/' \ ./my_application # AMD: Measure Infinity Fabric trafficperf stat -e 'amd_df/event=0x07,umask=0x0f/' \ ./my_application # More portable: use Linux perf NUMA eventsperf stat -e 'node-loads' \ -e 'node-load-misses' \ -e 'node-stores' \ -e 'node-store-misses' \ ./my_application # Example output:# Performance counter stats for './my_application':# 1,234,567,890 node-loads# 123,456,789 node-load-misses # 10% remote# 567,890,123 node-stores# 12,345,678 node-store-misses # 2.2% remote # Calculate NUMA hit ratio# Hit ratio = (loads - misses) / loads * 100# Target: >95% hit ratio for good NUMA behavior # Continuous monitoring with numastatwatch -n 1 numastat -c # Output shows per-node memory and access patterns:# Node 0 Node 1 Node 2 Node 3# numa_hit 12345678 11234567 10123456 09012345# numa_miss 123456 98765 87654 76543# numa_foreign 98765 123456 76543 87654A well-tuned NUMA application should have <5% remote memory accesses. If numa_miss or node-load-misses exceeds 5-10%, investigate memory placement. Use numastat -p <pid> to see per-process NUMA behavior and identify offending applications.
Remote memory access isn't always avoidable or even bad. Understanding common patterns helps you make informed decisions about when to optimize and when to accept remote access.
Pattern 1: Initialization Anti-Pattern
The most common NUMA mistake: a single thread allocates and initializes all data, then spawns workers across nodes.
// BAD: All memory on Node 0void *buffer = malloc(HUGE_SIZE);memset(buffer, 0, HUGE_SIZE); // Workers on all nodes access// mostly remote memoryfor (int i = 0; i < NUM_NODES; i++) { spawn_worker(i, buffer);}// GOOD: Each node gets local memoryfor (int i = 0; i < NUM_NODES; i++) { spawn_worker(i, NULL); // Worker allocates its own // memory locally} void worker(int node_id) { numa_set_preferred(node_id); void *local_buf = malloc(SIZE); // First-touch on local node}Pattern 2: Acceptable Shared Data
Some data is legitimately shared across all nodes—read-only configuration, lookup tables, code pages. For this data, remote access is unavoidable but can be optimized:
numactl --interleave)Pattern 3: Thread Migration
When the scheduler migrates a thread to a different node, its memory becomes remote. This is subtle and can happen without any application code change.
Mitigation strategies:
pthread_setaffinity_np() or numactl --cpunodebind1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
#define _GNU_SOURCE#include <pthread.h>#include <numa.h>#include <sched.h> void *worker_thread(void *arg) { int target_node = *(int *)arg; // Step 1: Pin thread to CPUs on target node struct bitmask *cpumask = numa_allocate_cpumask(); numa_node_to_cpus(target_node, cpumask); cpu_set_t cpu_set; CPU_ZERO(&cpu_set); for (int i = 0; i < numa_num_configured_cpus(); i++) { if (numa_bitmask_isbitset(cpumask, i)) { CPU_SET(i, &cpu_set); } } pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpu_set); // Step 2: Set memory policy to local numa_set_preferred(target_node); // Or for strict binding: // numa_set_membind(numa_allocate_nodemask()); // Step 3: Now allocate memory (will be local) void *local_buffer = malloc(WORK_SIZE); memset(local_buffer, 0, WORK_SIZE); // First-touch // Do work with local memory do_computation(local_buffer); numa_free_cpumask(cpumask); free(local_buffer); return NULL;} int main() { int num_nodes = numa_max_node() + 1; pthread_t threads[MAX_NODES]; int node_ids[MAX_NODES]; // Spawn one worker per node for (int i = 0; i < num_nodes; i++) { node_ids[i] = i; pthread_create(&threads[i], NULL, worker_thread, &node_ids[i]); } // Join workers for (int i = 0; i < num_nodes; i++) { pthread_join(threads[i], NULL); } return 0;}Over-aggressive thread pinning can backfire. If one node is overloaded and another is idle, pinning prevents the scheduler from balancing load. Use pinning for critical, memory-bound threads. Allow flexibility for I/O-bound or lightly loaded threads.
Understanding whether your workload is latency-bound or bandwidth-bound determines which NUMA optimizations matter most.
Latency-Bound Workloads:
These are characterized by:
Impact of remote access: Each operation pays the full latency penalty. A 2x latency increase directly halves throughput.
Bandwidth-Bound Workloads:
These are characterized by:
Impact of remote access: The CPU can issue many requests in flight, hiding latency. But bandwidth limits become hard ceilings. A 3-4x bandwidth reduction for remote access is catastrophic.
| Characteristic | Latency-Bound | Bandwidth-Bound |
|---|---|---|
| Access Pattern | Random | Sequential/Strided |
| Key Metric | ns per operation | GB/s throughput |
| Prefetching Helps? | No | Yes |
| Local NUMA Benefit | 1.5-2x lower latency | 3-4x higher bandwidth |
| Interleaving Helps? | Rarely | Sometimes (balances bandwidth) |
| Example Workloads | Databases, hash tables, graph traversal | Matrix operations, ML training, video encoding |
Hybrid Workloads:
Many real applications exhibit both patterns:
For such workloads, NUMA optimization must consider both dimensions. Commonly:
Don't assume your workload type—measure it. Use perf stat to measure Memory Level Parallelism (MLP). Low MLP (1-2 outstanding requests) = latency-bound. High MLP (10+) = bandwidth-bound. NUMA optimizations differ accordingly.
Let's examine how real production systems handle NUMA's local vs. remote access challenge.
Case Study 1: PostgreSQL
PostgreSQL is a latency-sensitive database with complex memory access patterns:
PostgreSQL's NUMA approach:
numactl --interleave for shared buffers on multi-node instancesCase Study 2: Redis
Redis is an in-memory key-value store where latency is everything:
NUMA challenges:
Redis's NUMA approach:
numactl --cpunodebind=N --membind=N)123456789101112131415161718192021
#!/bin/bash# Production Redis deployment on NUMA system # Option 1: Single instance pinned to Node 0numactl --cpunodebind=0 --membind=0 redis-server /etc/redis/redis.conf # Option 2: One instance per NUMA node for maximum throughputNUM_NODES=$(numactl --hardware | grep 'available:' | awk '{print $2}') for ((node=0; node<NUM_NODES; node++)); do PORT=$((6379 + node)) CONFIG="/etc/redis/redis-node${node}.conf" numactl --cpunodebind=$node --membind=$node \ redis-server $CONFIG --port $PORT --daemonize yes echo "Started Redis on Node $node, port $PORT"done # Option 3: Interleaved for unknown/mixed access patternsnumactl --interleave=all redis-server /etc/redis/redis.confCase Study 3: Apache Spark
Spark processes large datasets with bandwidth-intensive operations:
NUMA challenges:
Spark's NUMA approach:
-XX:+UseNUMA (for G1GC)The JVM's NUMA support varies by garbage collector. G1GC has explicit NUMA mode. ZGC is NUMA-aware by default. ParallelGC and CMS have limited NUMA awareness. For NUMA-critical Java applications, carefully benchmark GC options. The wrong choice can negate NUMA optimizations.
We've thoroughly examined the performance difference between local and remote memory access. Let's consolidate the key insights:
What's Next:
In the next page, we'll explore NUMA-Aware Allocation—the specific techniques, APIs, and strategies for ensuring memory ends up on the right NUMA node. We'll cover Linux's memory policy system, libnuma programming, and practical allocation patterns.
You now understand the fundamental performance asymmetry at the heart of NUMA systems. You can trace memory access paths, measure latency and bandwidth, identify remote access patterns, and apply this knowledge to real workloads. Next, we'll learn how to control memory placement to ensure locality.