Loading learning content...
You now understand NUMA architecture, nodes, local vs. remote access, and allocation strategies. But theory without application is incomplete. This final page transforms that knowledge into actionable optimization skills—the ability to diagnose NUMA problems, measure improvements, and apply the right fixes in production environments.
We'll cover the complete optimization workflow: profiling to identify problems, benchmarking to quantify them, applying targeted fixes, and validating improvements. By the end, you'll have a systematic approach to NUMA performance optimization.
By the end of this page, you will be able to systematically profile applications for NUMA issues, benchmark and quantify NUMA performance, apply targeted optimizations based on workload characteristics, and avoid common pitfalls that undermine NUMA performance.
Effective NUMA optimization follows a disciplined workflow. Jumping straight to 'fixes' without understanding the problem often makes things worse. Here's the systematic approach:
The Five-Step NUMA Optimization Process:
Never optimize without measurement. 'It should be faster' is not validation. NUMA optimizations can backfire—interleaving hurts locality-dependent workloads, over-binding prevents load balancing. Always measure before and after, with representative workloads.
Before optimizing, you must identify whether NUMA is even a problem. Many performance issues blamed on NUMA are actually caused by lock contention, I/O latency, or algorithmic inefficiency. Here's how to detect genuine NUMA problems.
Key NUMA Metrics:
| Metric | Healthy Range | Concerning | Source |
|---|---|---|---|
| Local memory hit ratio | 95% | <90% | numastat, perf counters |
| numa_miss rate | <5% of allocations | 10% | /proc/vmstat, numastat |
| Node memory imbalance | < 20% variance | 50% variance | numastat -m |
| Remote memory bandwidth | < 10% of total | 25% of total | UPI/Infinity Fabric counters |
| Cross-node coherency traffic | Low | High under load | uncore counters |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
#!/bin/bash# Comprehensive NUMA profiling script echo "=== System NUMA Configuration ==="numactl --hardware echo -e "\n=== System-Wide NUMA Statistics ==="numastat echo -e "\n=== NUMA Memory Distribution ==="numastat -m echo -e "\n=== Per-Process NUMA Breakdown ==="# Replace 'myapp' with your application namefor pid in $(pgrep myapp); do echo "--- PID $pid ---" numastat -p $piddone echo -e "\n=== Live NUMA Monitoring (5 seconds) ==="# Monitor numa_hit/miss rate changesecho "Timestamp | numa_hit | numa_miss | numa_foreign | local_node | other_node"for i in {1..5}; do stats=$(grep -E "numa_(hit|miss|foreign)|local_node|other_node" /proc/vmstat) hit=$(echo "$stats" | grep numa_hit | awk '{print $2}') miss=$(echo "$stats" | grep numa_miss | awk '{print $2}') foreign=$(echo "$stats" | grep numa_foreign | awk '{print $2}') local=$(echo "$stats" | grep local_node | awk '{print $2}') other=$(echo "$stats" | grep other_node | awk '{print $2}') echo "$(date +%H:%M:%S) | $hit | $miss | $foreign | $local | $other" sleep 1done echo -e "\n=== Memory Placement for Specific Process ==="# Show NUMA mapping of process memory regionspid=$(pgrep -f myapp | head -1)if [ -n "$pid" ]; then echo "Memory regions for PID $pid:" head -20 /proc/$pid/numa_mapsfi echo -e "\n=== Hardware Performance Counters (if available) ==="# Check if perf can measure NUMA eventsperf list 2>/dev/null | grep -i numa || echo "No NUMA perf events found" # Example: measure NUMA events during application run# perf stat -e node-loads,node-stores,node-load-misses,node-store-misses ./myappUsing perf for Deep NUMA Analysis:
The Linux perf tool can provide detailed NUMA insight using hardware performance counters:
1234567891011121314151617181920212223242526272829
# Basic NUMA event recordingperf stat -e node-loads,node-stores,node-load-misses,node-store-misses \ ./my_application # Sample output:# Performance counter stats for './my_application':# 456,789,012 node-loads# 123,456,789 node-stores# 45,678,901 node-load-misses # 10% remote loads (BAD)# 345,678 node-store-misses # 0.3% remote stores (OK) # Calculate NUMA efficiency:# Local Load Ratio = 1 - (node-load-misses / node-loads)# Target: > 95% # Record for detailed analysisperf record -e node-loads,node-load-misses -g ./my_applicationperf report # AMD systems: Data Fabric eventsperf stat -e amd_df/event=0x07,umask=0x02/ ./my_application # Intel systems: UPI trafficperf stat -e uncore_upi/event=0x02,umask=0x0f/ ./my_application # Memory bandwidth by NUMA node (Intel)perf stat -e uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/ \ -e uncore_imc_1/cas_count_read/,uncore_imc_1/cas_count_write/ \ ./my_applicationIf more than 10% of your memory accesses are remote (node-load-misses / node-loads > 0.10), NUMA optimization will likely yield significant benefits. Below 5%, other optimizations may be more impactful. Between 5-10%, it depends on whether your workload is latency-sensitive.
Profiling identifies problems; benchmarking quantifies them. Good benchmarks isolate NUMA effects from other variables and produce reproducible results.
Benchmarking Best Practices:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
#include <numa.h>#include <stdio.h>#include <stdlib.h>#include <string.h>#include <time.h>#include <pthread.h> #define ARRAY_SIZE (512 * 1024 * 1024) // 512 MB per thread#define ITERATIONS 100000000#define WARMUP_ITERATIONS 10000000#define NUM_RUNS 5 typedef struct { int thread_id; int numa_node; void *buffer; double latency_ns; double bandwidth_gbps;} ThreadResult; // Pointer-chasing latency benchmarkdouble benchmark_latency(void *buffer, size_t size, int iterations) { void **chain = (void **)buffer; size_t num_elements = size / sizeof(void *); // Create random pointer chain (defeats prefetching) for (size_t i = 0; i < num_elements - 1; i++) { chain[i] = &chain[(i * 179 + 17) % num_elements]; } chain[num_elements - 1] = &chain[0]; // Touch all pages for (size_t i = 0; i < num_elements; i += 512) { volatile void *touch = chain[i]; (void)touch; } // Warmup void **p = chain; for (int i = 0; i < WARMUP_ITERATIONS; i++) { p = (void **)*p; } // Timed run struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); p = chain; for (int i = 0; i < iterations; i++) { p = (void **)*p; } clock_gettime(CLOCK_MONOTONIC, &end); // Prevent optimization volatile void *sink = p; (void)sink; double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec); return elapsed_ns / iterations;} // Sequential bandwidth benchmarkdouble benchmark_bandwidth(void *buffer, size_t size) { // Warmup memset(buffer, 1, size); struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); // Read entire buffer (bandwidth test) volatile long sum = 0; long *data = (long *)buffer; size_t count = size / sizeof(long); for (size_t i = 0; i < count; i += 8) { sum += data[i] + data[i+1] + data[i+2] + data[i+3] + data[i+4] + data[i+5] + data[i+6] + data[i+7]; } clock_gettime(CLOCK_MONOTONIC, &end); double elapsed_sec = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; return (size / 1e9) / elapsed_sec; // GB/s} void *benchmark_thread(void *arg) { ThreadResult *result = (ThreadResult *)arg; // Pin to NUMA node numa_run_on_node(result->numa_node); // Allocate locally result->buffer = numa_alloc_local(ARRAY_SIZE); if (!result->buffer) { fprintf(stderr, "Thread %d: allocation failed\n", result->thread_id); return NULL; } // Touch to materialize memset(result->buffer, 0, ARRAY_SIZE); // Run benchmarks result->latency_ns = benchmark_latency(result->buffer, ARRAY_SIZE, ITERATIONS); result->bandwidth_gbps = benchmark_bandwidth(result->buffer, ARRAY_SIZE); numa_free(result->buffer, ARRAY_SIZE); return NULL;} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } int num_nodes = numa_max_node() + 1; printf("NUMA Benchmark: %d nodes\n\n", num_nodes); // Run benchmark on each node printf("%-8s %15s %15s\n", "Node", "Latency (ns)", "Bandwidth (GB/s)"); printf("%-8s %15s %15s\n", "----", "-----------", "---------------"); for (int node = 0; node < num_nodes; node++) { double total_latency = 0; double total_bandwidth = 0; for (int run = 0; run < NUM_RUNS; run++) { ThreadResult result = { .thread_id = 0, .numa_node = node, }; pthread_t thread; pthread_create(&thread, NULL, benchmark_thread, &result); pthread_join(thread, NULL); total_latency += result.latency_ns; total_bandwidth += result.bandwidth_gbps; } printf("Node %-3d %15.1f %15.1f\n", node, total_latency / NUM_RUNS, total_bandwidth / NUM_RUNS); } return 0;}Microbenchmarks measure theoretical limits. Real applications have mixed access patterns, cache effects, and non-memory bottlenecks. Use microbenchmarks to understand hardware capabilities, but always validate with realistic application benchmarks before declaring victory.
With measurements in hand, apply the appropriate optimization. The best technique depends on your workload's characteristics.
Decision Framework:
| Workload Characteristic | Recommended Optimization | Tools/Approach |
|---|---|---|
| Single-threaded, memory-intensive | Strict node binding | numactl --cpunodebind=0 --membind=0 |
| Multi-threaded, partitionable data | Per-thread local allocation | NUMA-aware memory pools |
| Shared read-only data | Interleave across all nodes | numactl --interleave=all |
| Heavily shared, frequently modified | Minimize sharing, or replicate | Data structure redesign |
| Unknown/mixed access patterns | Local allocation + AutoNUMA | Enable numa_balancing |
| Database workload | Buffer pool per node | Database-specific tuning |
Technique 1: Thread-Data Colocation
The most powerful optimization: ensure threads run near the data they access.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
#include <numa.h>#include <pthread.h> typedef struct { int node; void *local_data; size_t data_size; void (*work_fn)(void *, size_t);} Worker; void *colocated_worker(void *arg) { Worker *w = (Worker *)arg; // Step 1: Bind thread to designated NUMA node numa_run_on_node(w->node); // Step 2: Allocate data locally (thread is now on correct node) w->local_data = numa_alloc_local(w->data_size); // Step 3: Initialize data (first-touch on local node) memset(w->local_data, 0, w->data_size); // Step 4: Process - all accesses are local w->work_fn(w->local_data, w->data_size); numa_free(w->local_data, w->data_size); return NULL;} void spawn_colocated_workers(int num_workers, size_t data_per_worker, void (*work_fn)(void *, size_t)) { int num_nodes = numa_max_node() + 1; pthread_t threads[num_workers]; Worker workers[num_workers]; for (int i = 0; i < num_workers; i++) { workers[i].node = i % num_nodes; // Round-robin across nodes workers[i].data_size = data_per_worker; workers[i].work_fn = work_fn; pthread_create(&threads[i], NULL, colocated_worker, &workers[i]); } for (int i = 0; i < num_workers; i++) { pthread_join(threads[i], NULL); }}Technique 2: Data Partitioning
For embarrassingly parallel workloads, partition data so each NUMA node has a complete, independent slice:
1234567891011121314151617181920212223242526272829303132333435363738394041
# Conceptual data partitioning for NUMA# (Actual memory binding requires C/ctypes or specialized libraries) def partition_for_numa(data, num_nodes): """ Partition data for NUMA-aware processing. Each partition should be: 1. Allocated on its designated node 2. Processed by threads running on that node 3. Independent (no cross-partition access during processing) """ partition_size = len(data) // num_nodes partitions = [] for node in range(num_nodes): start = node * partition_size end = start + partition_size if node < num_nodes - 1 else len(data) # In real code, allocate this partition on 'node' using numa_alloc_onnode partition = data[start:end] partitions.append((node, partition)) return partitions def process_partitioned(partitions, process_fn): """ Process each partition on its designated NUMA node. In real implementation: 1. Pin thread to node 2. Access only local partition 3. Aggregate results at end (minimize cross-node sync) """ results = [] for node, partition in partitions: # spawn_thread_on_node(node, process_fn, partition) result = process_fn(partition) results.append(result) return aggregate(results)Technique 3: NUMA-Aware Data Structures
Redesign data structures to respect NUMA boundaries:
Focus optimization on the hot path. A NUMA-oblivious cold path (error handling, logging) barely matters. Identify the 20% of code that consumes 80% of memory bandwidth and optimize that. Leave the rest alone.
Different application types require different NUMA strategies. Here are specific recommendations for common workloads.
Databases (PostgreSQL, MySQL, MongoDB):
12345678910111213141516171819202122232425262728293031
# PostgreSQL NUMA tuning # Option 1: Interleave shared_buffers (large, shared by all backends)numactl --interleave=all postgres -c shared_buffers=64GB # Option 2: Run separate PostgreSQL instances per NUMA node# Node 0 instancenumactl --cpunodebind=0 --membind=0 \ postgres -D /var/lib/postgresql/node0 -p 5432 # Node 1 instance numactl --cpunodebind=1 --membind=1 \ postgres -D /var/lib/postgresql/node1 -p 5433 # Then use connection pooler (PgBouncer) to route queries # MySQL NUMA tuning# In my.cnf:# innodb_numa_interleave = 1 # Interleave buffer pool # Or via command linenumactl --interleave=all mysqld \ --innodb-buffer-pool-size=64G \ --innodb-buffer-pool-instances=8 # MongoDB NUMA tuning# MongoDB strongly recommends disabling zone_reclaim_modeecho 0 > /proc/sys/vm/zone_reclaim_mode # Then run with interleavenumactl --interleave=all mongod --dbpath /data/dbIn-Memory Caches (Redis, Memcached):
123456789101112131415161718192021222324252627
# Redis NUMA tuning# Redis is single-threaded, so strict node binding is ideal # Bind to single node (best if data fits in one node's memory)numactl --cpunodebind=0 --membind=0 redis-server /etc/redis.conf # For larger datasets, run multiple instancesfor node in 0 1 2 3; do port=$((6379 + node)) numactl --cpunodebind=$node --membind=$node \ redis-server --port $port --daemonize yesdone # Memcached NUMA tuning# Memcached is multi-threaded, use interleave for shared hash table numactl --interleave=all memcached -m 65536 -t $(nproc) # Or per-node instances (better for very large deployments)for node in 0 1 2 3; do port=$((11211 + node)) cores_on_node=$(numactl -H | grep "node $node cpus:" | cut -d: -f2) num_cores=$(echo $cores_on_node | wc -w) numactl --cpunodebind=$node --membind=$node \ memcached -p $port -t $num_cores -m 16384 -ddoneJVM Applications (Spark, Elasticsearch, Kafka):
1234567891011121314151617181920212223242526272829
# JVM NUMA tuning # Enable JVM NUMA support (works with G1GC)java -XX:+UseNUMA -XX:+UseG1GC -Xmx64g -jar application.jar # For ZGC (NUMA-aware by default in Java 15+)java -XX:+UseZGC -Xmx64g -jar application.jar # Combine with numactl for additional controlnumactl --interleave=all \ java -XX:+UseNUMA -XX:+UseG1GC -Xmx64g -jar application.jar # Elasticsearch NUMA settings# In jvm.options:# -XX:+UseG1GC# -XX:+UseNUMA # Also consider:# - Running one ES node per NUMA node (smaller heap, more instances)# - Setting node.processors to CPUs on one NUMA node # Kafka NUMA tuning# Bind broker to single NUMA node for best latencynumactl --cpunodebind=0 --membind=0 \ kafka-server-start.sh config/server.properties # For throughput, interleave (log data is shared across partitions)numactl --interleave=all \ kafka-server-start.sh config/server.propertiesMany applications have official NUMA tuning guides. Check documentation before applying generic advice. PostgreSQL, MySQL, MongoDB, Redis, and Kafka all have specific recommendations that reflect their internal architectures.
NUMA optimization has many ways to go wrong. Learn from others' mistakes.
Pitfall 1: Over-Binding
Pitfall 2: Ignoring First-Touch
12345678910111213141516171819202122
// WRONG: Allocate in main(), workers access remote memoryvoid *shared_data; int main() { shared_data = malloc(HUGE_SIZE); memset(shared_data, 0, HUGE_SIZE); // All pages on main()'s node! for (int i = 0; i < NUM_WORKERS; i++) { spawn_worker(i); // Workers run on various nodes } // Workers mostly access REMOTE memory!} // RIGHT: Workers allocate their own datavoid worker(int id) { numa_run_on_node(id % numa_max_node()); void *my_data = numa_alloc_local(WORKER_SIZE); memset(my_data, 0, WORKER_SIZE); // First touch on local node process(my_data); // All LOCAL accesses}Pitfall 3: False Sharing
Multiple threads modify different variables that share a cache line:
1234567891011121314151617181920212223242526
// WRONG: Counters share cache linesstruct BadCounters { long count_thread_0; // Offset 0 long count_thread_1; // Offset 8 - SAME CACHE LINE! long count_thread_2; // Offset 16 - SAME CACHE LINE! long count_thread_3; // Offset 24 - SAME CACHE LINE!};// Every increment invalidates other threads' caches! // RIGHT: Pad to separate cache lines#define CACHE_LINE 64 struct GoodCounters { long count_thread_0; char _pad0[CACHE_LINE - sizeof(long)]; long count_thread_1; char _pad1[CACHE_LINE - sizeof(long)]; long count_thread_2; char _pad2[CACHE_LINE - sizeof(long)]; long count_thread_3; char _pad3[CACHE_LINE - sizeof(long)];};// Each counter on its own cache line - no false sharingPitfall 4: Measuring Wrong
NUMA optimization adds complexity and can reduce portability. Don't optimize NUMA until (1) you've profiled and confirmed NUMA is a bottleneck, (2) other optimizations (algorithms, caching, I/O) are exhausted, and (3) the workload will run on NUMA hardware in production. Premature NUMA optimization is wasted effort.
Optimization isn't a one-time event. Production workloads change, and NUMA problems can emerge over time. Continuous monitoring catches regressions early.
Key Metrics to Monitor:
| Metric | Source | Alert Threshold |
|---|---|---|
| Local hit ratio | numastat, perf counters | < 90% sustained |
| Per-node memory utilization | numastat -m | 50% imbalance |
| numa_miss rate | /proc/vmstat | 10% of numa_hit |
| Cross-node bandwidth | UPI/IF counters | 50% saturation |
| Page migration rate | /proc/vmstat | Sustained high rate |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
#!/bin/bash# Production NUMA monitoring script# Run via cron every minute, export to monitoring system OUTPUT_FILE="/var/log/numa_metrics.json" # Collect NUMA statisticsnuma_stats=$(numastat -c 2>/dev/null | tail -n +3) # Parse key metricsnuma_hit=$(grep numa_hit /proc/vmstat | awk '{print $2}')numa_miss=$(grep numa_miss /proc/vmstat | awk '{print $2}')numa_foreign=$(grep numa_foreign /proc/vmstat | awk '{print $2}')pgmigrate=$(grep pgmigrate_success /proc/vmstat | awk '{print $2}') # Calculate hit ratioif [ $((numa_hit + numa_miss)) -gt 0 ]; then hit_ratio=$(echo "scale=4; $numa_hit / ($numa_hit + $numa_miss) * 100" | bc)else hit_ratio="100"fi # Per-node memory (example for 4 nodes)for node in 0 1 2 3; do meminfo="/sys/devices/system/node/node$node/meminfo" if [ -f "$meminfo" ]; then total=$(grep MemTotal "$meminfo" | awk '{print $4}') free=$(grep MemFree "$meminfo" | awk '{print $4}') used=$((total - free)) usage=$((used * 100 / total)) eval "node${node}_usage=$usage" eval "node${node}_used_mb=$((used / 1024))" fidone # Output as JSON for ingestion by Prometheus/Datadog/etc.cat > "$OUTPUT_FILE" << EOF{ "timestamp": "$(date -Iseconds)", "numa_hit": $numa_hit, "numa_miss": $numa_miss, "numa_foreign": $numa_foreign, "hit_ratio_pct": $hit_ratio, "pgmigrate_success": $pgmigrate, "node0_usage_pct": ${node0_usage:- 0}, "node1_usage_pct": ${ node1_usage: -0 }, "node2_usage_pct": ${ node2_usage: -0 }, "node3_usage_pct": ${ node3_usage: -0 }}EOF # Alert if hit ratio dropsif(($(echo "$hit_ratio < 90" | bc - l))); then echo "ALERT: NUMA hit ratio dropped to ${hit_ratio}%" # Integrate with alerting system fiIntegrating with Observability Stack:
For comprehensive monitoring, export NUMA metrics to your observability platform:
--collector.meminfo_numa or custom exporterCorrelate NUMA metrics with application performance. When latency spikes, check if NUMA hit ratio dropped simultaneously.
Establish NUMA baselines after tuning. A code change or configuration update might unknowingly break NUMA optimization (e.g., thread pool size change, library update). Compare current metrics to baseline to catch regressions before users notice.
We've completed our deep dive into NUMA performance optimization. Let's consolidate the key insights:
Module Complete: NUMA Architecture
Over these five pages, you've mastered:
You're now equipped to understand, diagnose, and optimize NUMA behavior in production systems—a skill that separates good engineers from exceptional ones in the multi-socket server world.
Congratulations! You've completed the NUMA Architecture module. You now possess deep understanding of how modern multi-socket systems organize memory access and the skills to optimize performance at this fundamental level. This knowledge is invaluable for anyone working with high-performance, memory-intensive applications on enterprise hardware.