Operating SystemsNUMA Architecture

NUMA Architecture

LevelAdvanced

Duration60 mins

TopicNUMA Architecture

5 / 5

Performance Optimization

From Theory to Practice

You now understand NUMA architecture, nodes, local vs. remote access, and allocation strategies. But theory without application is incomplete. This final page transforms that knowledge into actionable optimization skills—the ability to diagnose NUMA problems, measure improvements, and apply the right fixes in production environments.

We'll cover the complete optimization workflow: profiling to identify problems, benchmarking to quantify them, applying targeted fixes, and validating improvements. By the end, you'll have a systematic approach to NUMA performance optimization.

What You Will Learn

By the end of this page, you will be able to systematically profile applications for NUMA issues, benchmark and quantify NUMA performance, apply targeted optimizations based on workload characteristics, and avoid common pitfalls that undermine NUMA performance.

The NUMA Optimization Workflow

Effective NUMA optimization follows a disciplined workflow. Jumping straight to 'fixes' without understanding the problem often makes things worse. Here's the systematic approach:

The Five-Step NUMA Optimization Process:

Converting Mermaid diagram...

Optimization Steps Explained

•Step 1: Baseline Measurement — Measure current performance with production-like workload. Record throughput, latency percentiles, and CPU utilization. This is your comparison point.
•Step 2: Identify NUMA Issues — Use NUMA-specific metrics (numa_miss, remote access ratio, node imbalance) to detect problems. Not all performance issues are NUMA-related!
•Step 3: Root Cause Analysis — Understand why the NUMA issue exists. Is it allocation pattern? Thread migration? Shared data? Wrong policy?
•Step 4: Apply Optimization — Make a single, targeted change. Avoid multiple changes simultaneously—you won't know what helped.
•Step 5: Validate Improvement — Re-measure with the same workload. Did the target metric improve? Did anything regress? Iterate or ship.

The Cardinal Rule

Never optimize without measurement. 'It should be faster' is not validation. NUMA optimizations can backfire—interleaving hurts locality-dependent workloads, over-binding prevents load balancing. Always measure before and after, with representative workloads.

Profiling for NUMA Issues

Before optimizing, you must identify whether NUMA is even a problem. Many performance issues blamed on NUMA are actually caused by lock contention, I/O latency, or algorithmic inefficiency. Here's how to detect genuine NUMA problems.

Key NUMA Metrics:

NUMA Performance Indicators
Metric	Healthy Range	Concerning	Source
Local memory hit ratio	95%	<90%	`numastat`, perf counters
numa_miss rate	<5% of allocations	10%	`/proc/vmstat`, `numastat`
Node memory imbalance	< 20% variance	50% variance	`numastat -m`
Remote memory bandwidth	< 10% of total	25% of total	UPI/Infinity Fabric counters
Cross-node coherency traffic	Low	High under load	uncore counters

numa-profiling.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/bin/bash
# Comprehensive NUMA profiling script
 
echo "=== System NUMA Configuration ==="
numactl --hardware
 
echo -e "\n=== System-Wide NUMA Statistics ==="
numastat
 
echo -e "\n=== NUMA Memory Distribution ==="
numastat -m
 
echo -e "\n=== Per-Process NUMA Breakdown ==="
# Replace 'myapp' with your application name
for pid in $(pgrep myapp); do
    echo "--- PID $pid ---"
    numastat -p $pid
done
 
echo -e "\n=== Live NUMA Monitoring (5 seconds) ==="
# Monitor numa_hit/miss rate changes
echo "Timestamp | numa_hit | numa_miss | numa_foreign | local_node | other_node"
for i in {1..5}; do
    stats=$(grep -E "numa_(hit|miss|foreign)|local_node|other_node" /proc/vmstat)
    hit=$(echo "$stats" | grep numa_hit | awk '{print $2}')
    miss=$(echo "$stats" | grep numa_miss | awk '{print $2}')
    foreign=$(echo "$stats" | grep numa_foreign | awk '{print $2}')
    local=$(echo "$stats" | grep local_node | awk '{print $2}')
    other=$(echo "$stats" | grep other_node | awk '{print $2}')
    echo "$(date +%H:%M:%S) | $hit | $miss | $foreign | $local | $other"
    sleep 1
done
 
echo -e "\n=== Memory Placement for Specific Process ==="
# Show NUMA mapping of process memory regions
pid=$(pgrep -f myapp | head -1)
if [ -n "$pid" ]; then
    echo "Memory regions for PID $pid:"
    head -20 /proc/$pid/numa_maps
fi
 
echo -e "\n=== Hardware Performance Counters (if available) ==="
# Check if perf can measure NUMA events
perf list 2>/dev/null | grep -i numa || echo "No NUMA perf events found"
 
# Example: measure NUMA events during application run
# perf stat -e node-loads,node-stores,node-load-misses,node-store-misses ./myapp

Using perf for Deep NUMA Analysis:

The Linux perf tool can provide detailed NUMA insight using hardware performance counters:

perf-numa-analysis.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Basic NUMA event recording
perf stat -e node-loads,node-stores,node-load-misses,node-store-misses \
    ./my_application
 
# Sample output:
#  Performance counter stats for './my_application':
#       456,789,012      node-loads
#       123,456,789      node-stores
#        45,678,901      node-load-misses   # 10% remote loads (BAD)
#           345,678      node-store-misses  # 0.3% remote stores (OK)
 
# Calculate NUMA efficiency:
# Local Load Ratio = 1 - (node-load-misses / node-loads)
# Target: > 95%
 
# Record for detailed analysis
perf record -e node-loads,node-load-misses -g ./my_application
perf report
 
# AMD systems: Data Fabric events
perf stat -e amd_df/event=0x07,umask=0x02/ ./my_application
 
# Intel systems: UPI traffic
perf stat -e uncore_upi/event=0x02,umask=0x0f/ ./my_application
 
# Memory bandwidth by NUMA node (Intel)
perf stat -e uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/ \
          -e uncore_imc_1/cas_count_read/,uncore_imc_1/cas_count_write/ \
          ./my_application

The 10% Rule

If more than 10% of your memory accesses are remote (node-load-misses / node-loads > 0.10), NUMA optimization will likely yield significant benefits. Below 5%, other optimizations may be more impactful. Between 5-10%, it depends on whether your workload is latency-sensitive.

Benchmarking NUMA Performance

Profiling identifies problems; benchmarking quantifies them. Good benchmarks isolate NUMA effects from other variables and produce reproducible results.

Benchmarking Best Practices:

NUMA Benchmarking Principles

•Warm up before measuring — Run workload for 30+ seconds before timing to ensure caches are populated and memory is allocated.
•Control thread placement — Pin threads to specific CPUs/nodes to eliminate scheduler variability.
•Use realistic data sizes — Working sets should exceed cache to measure memory effects, but match production scale.
•Measure multiple metrics — Throughput, median latency, P99 latency, and CPU utilization together tell the full story.
•Run multiple iterations — At least 5 runs, discard outliers, report mean and standard deviation.
•Isolate the system — Disable turbo boost, frequency scaling, and background services for reproducibility.

numa-benchmark-harness.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <pthread.h>
 
#define ARRAY_SIZE (512 * 1024 * 1024)  // 512 MB per thread
#define ITERATIONS 100000000
#define WARMUP_ITERATIONS 10000000
#define NUM_RUNS 5
 
typedef struct {
    int thread_id;
    int numa_node;
    void *buffer;
    double latency_ns;
    double bandwidth_gbps;
} ThreadResult;
 
// Pointer-chasing latency benchmark
double benchmark_latency(void *buffer, size_t size, int iterations) {
    void **chain = (void **)buffer;
    size_t num_elements = size / sizeof(void *);
    
    // Create random pointer chain (defeats prefetching)
    for (size_t i = 0; i < num_elements - 1; i++) {
        chain[i] = &chain[(i * 179 + 17) % num_elements];
    }
    chain[num_elements - 1] = &chain[0];
    
    // Touch all pages
    for (size_t i = 0; i < num_elements; i += 512) {
        volatile void *touch = chain[i];
        (void)touch;
    }
    
    // Warmup
    void **p = chain;
    for (int i = 0; i < WARMUP_ITERATIONS; i++) {
        p = (void **)*p;
    }
    
    // Timed run
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    p = chain;
    for (int i = 0; i < iterations; i++) {
        p = (void **)*p;
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    // Prevent optimization
    volatile void *sink = p;
    (void)sink;
    
    double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                        (end.tv_nsec - start.tv_nsec);
    return elapsed_ns / iterations;
}
 
// Sequential bandwidth benchmark
double benchmark_bandwidth(void *buffer, size_t size) {
    // Warmup
    memset(buffer, 1, size);
    
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    // Read entire buffer (bandwidth test)
    volatile long sum = 0;
    long *data = (long *)buffer;
    size_t count = size / sizeof(long);
    for (size_t i = 0; i < count; i += 8) {
        sum += data[i] + data[i+1] + data[i+2] + data[i+3] +
               data[i+4] + data[i+5] + data[i+6] + data[i+7];
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed_sec = (end.tv_sec - start.tv_sec) +
                         (end.tv_nsec - start.tv_nsec) / 1e9;
    return (size / 1e9) / elapsed_sec;  // GB/s
}
 
void *benchmark_thread(void *arg) {
    ThreadResult *result = (ThreadResult *)arg;
    
    // Pin to NUMA node
    numa_run_on_node(result->numa_node);
    
    // Allocate locally
    result->buffer = numa_alloc_local(ARRAY_SIZE);
    if (!result->buffer) {
        fprintf(stderr, "Thread %d: allocation failed\n", result->thread_id);
        return NULL;
    }
    
    // Touch to materialize
    memset(result->buffer, 0, ARRAY_SIZE);
    
    // Run benchmarks
    result->latency_ns = benchmark_latency(result->buffer, ARRAY_SIZE, ITERATIONS);
    result->bandwidth_gbps = benchmark_bandwidth(result->buffer, ARRAY_SIZE);
    
    numa_free(result->buffer, ARRAY_SIZE);
    return NULL;
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    int num_nodes = numa_max_node() + 1;
    printf("NUMA Benchmark: %d nodes\n\n", num_nodes);
    
    // Run benchmark on each node
    printf("%-8s %15s %15s\n", "Node", "Latency (ns)", "Bandwidth (GB/s)");
    printf("%-8s %15s %15s\n", "----", "-----------", "---------------");
    
    for (int node = 0; node < num_nodes; node++) {
        double total_latency = 0;
        double total_bandwidth = 0;
        
        for (int run = 0; run < NUM_RUNS; run++) {
            ThreadResult result = {
                .thread_id = 0,
                .numa_node = node,
            };
            
            pthread_t thread;
            pthread_create(&thread, NULL, benchmark_thread, &result);
            pthread_join(thread, NULL);
            
            total_latency += result.latency_ns;
            total_bandwidth += result.bandwidth_gbps;
        }
        
        printf("Node %-3d %15.1f %15.1f\n",
               node,
               total_latency / NUM_RUNS,
               total_bandwidth / NUM_RUNS);
    }
    
    return 0;
}

Benchmark vs. Reality

Microbenchmarks measure theoretical limits. Real applications have mixed access patterns, cache effects, and non-memory bottlenecks. Use microbenchmarks to understand hardware capabilities, but always validate with realistic application benchmarks before declaring victory.

Optimization Techniques

With measurements in hand, apply the appropriate optimization. The best technique depends on your workload's characteristics.

Decision Framework:

NUMA Optimization Selection Guide
Workload Characteristic	Recommended Optimization	Tools/Approach
Single-threaded, memory-intensive	Strict node binding	`numactl --cpunodebind=0 --membind=0`
Multi-threaded, partitionable data	Per-thread local allocation	NUMA-aware memory pools
Shared read-only data	Interleave across all nodes	`numactl --interleave=all`
Heavily shared, frequently modified	Minimize sharing, or replicate	Data structure redesign
Unknown/mixed access patterns	Local allocation + AutoNUMA	Enable `numa_balancing`
Database workload	Buffer pool per node	Database-specific tuning

Technique 1: Thread-Data Colocation

The most powerful optimization: ensure threads run near the data they access.

thread-colocation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <numa.h>
#include <pthread.h>
 
typedef struct {
    int node;
    void *local_data;
    size_t data_size;
    void (*work_fn)(void *, size_t);
} Worker;
 
void *colocated_worker(void *arg) {
    Worker *w = (Worker *)arg;
    
    // Step 1: Bind thread to designated NUMA node
    numa_run_on_node(w->node);
    
    // Step 2: Allocate data locally (thread is now on correct node)
    w->local_data = numa_alloc_local(w->data_size);
    
    // Step 3: Initialize data (first-touch on local node)
    memset(w->local_data, 0, w->data_size);
    
    // Step 4: Process - all accesses are local
    w->work_fn(w->local_data, w->data_size);
    
    numa_free(w->local_data, w->data_size);
    return NULL;
}
 
void spawn_colocated_workers(int num_workers, size_t data_per_worker,
                             void (*work_fn)(void *, size_t)) {
    int num_nodes = numa_max_node() + 1;
    
    pthread_t threads[num_workers];
    Worker workers[num_workers];
    
    for (int i = 0; i < num_workers; i++) {
        workers[i].node = i % num_nodes;  // Round-robin across nodes
        workers[i].data_size = data_per_worker;
        workers[i].work_fn = work_fn;
        
        pthread_create(&threads[i], NULL, colocated_worker, &workers[i]);
    }
    
    for (int i = 0; i < num_workers; i++) {
        pthread_join(threads[i], NULL);
    }
}

Technique 2: Data Partitioning

For embarrassingly parallel workloads, partition data so each NUMA node has a complete, independent slice:

data-partitioning.py
Python (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Conceptual data partitioning for NUMA
# (Actual memory binding requires C/ctypes or specialized libraries)
 
def partition_for_numa(data, num_nodes):
    """
    Partition data for NUMA-aware processing.
    
    Each partition should be:
    1. Allocated on its designated node
    2. Processed by threads running on that node
    3. Independent (no cross-partition access during processing)
    """
    partition_size = len(data) // num_nodes
    partitions = []
    
    for node in range(num_nodes):
        start = node * partition_size
        end = start + partition_size if node < num_nodes - 1 else len(data)
        
        # In real code, allocate this partition on 'node' using numa_alloc_onnode
        partition = data[start:end]
        partitions.append((node, partition))
    
    return partitions
 
def process_partitioned(partitions, process_fn):
    """
    Process each partition on its designated NUMA node.
    
    In real implementation:
    1. Pin thread to node
    2. Access only local partition
    3. Aggregate results at end (minimize cross-node sync)
    """
    results = []
    for node, partition in partitions:
        # spawn_thread_on_node(node, process_fn, partition)
        result = process_fn(partition)
        results.append(result)
    
    return aggregate(results)

Technique 3: NUMA-Aware Data Structures

Redesign data structures to respect NUMA boundaries:

NUMA-Friendly Data Structure Patterns

•Partition hash maps by node — Route keys to node-local sub-maps based on hash prefix. Lookups stay local.
•Replicate read-only data — Each node gets a copy of reference data. Updates are rare; broadcast when needed.
•NUMA-aware queues — Per-node queues for work distribution. Workers consume from local queue first.
•Cache-line padding — Prevent false sharing by padding frequently-written fields to 64+ bytes.
•Slab allocators per node — Object pools that allocate from local memory, return to local pool.

The 80/20 Rule of NUMA

Focus optimization on the hot path. A NUMA-oblivious cold path (error handling, logging) barely matters. Identify the 20% of code that consumes 80% of memory bandwidth and optimize that. Leave the rest alone.

Application-Specific Tuning

Different application types require different NUMA strategies. Here are specific recommendations for common workloads.

Databases (PostgreSQL, MySQL, MongoDB):

database-numa-tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# PostgreSQL NUMA tuning
 
# Option 1: Interleave shared_buffers (large, shared by all backends)
numactl --interleave=all postgres -c shared_buffers=64GB
 
# Option 2: Run separate PostgreSQL instances per NUMA node
# Node 0 instance
numactl --cpunodebind=0 --membind=0 \
    postgres -D /var/lib/postgresql/node0 -p 5432
 
# Node 1 instance  
numactl --cpunodebind=1 --membind=1 \
    postgres -D /var/lib/postgresql/node1 -p 5433
 
# Then use connection pooler (PgBouncer) to route queries
 
# MySQL NUMA tuning
# In my.cnf:
# innodb_numa_interleave = 1  # Interleave buffer pool
 
# Or via command line
numactl --interleave=all mysqld \
    --innodb-buffer-pool-size=64G \
    --innodb-buffer-pool-instances=8
 
# MongoDB NUMA tuning
# MongoDB strongly recommends disabling zone_reclaim_mode
echo 0 > /proc/sys/vm/zone_reclaim_mode
 
# Then run with interleave
numactl --interleave=all mongod --dbpath /data/db

In-Memory Caches (Redis, Memcached):

cache-numa-tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Redis NUMA tuning
# Redis is single-threaded, so strict node binding is ideal
 
# Bind to single node (best if data fits in one node's memory)
numactl --cpunodebind=0 --membind=0 redis-server /etc/redis.conf
 
# For larger datasets, run multiple instances
for node in 0 1 2 3; do
    port=$((6379 + node))
    numactl --cpunodebind=$node --membind=$node \
        redis-server --port $port --daemonize yes
done
 
# Memcached NUMA tuning
# Memcached is multi-threaded, use interleave for shared hash table
 
numactl --interleave=all memcached -m 65536 -t $(nproc)
 
# Or per-node instances (better for very large deployments)
for node in 0 1 2 3; do
    port=$((11211 + node))
    cores_on_node=$(numactl -H | grep "node $node cpus:" | cut -d: -f2)
    num_cores=$(echo $cores_on_node | wc -w)
    
    numactl --cpunodebind=$node --membind=$node \
        memcached -p $port -t $num_cores -m 16384 -d
done

JVM Applications (Spark, Elasticsearch, Kafka):

jvm-numa-tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# JVM NUMA tuning
 
# Enable JVM NUMA support (works with G1GC)
java -XX:+UseNUMA -XX:+UseG1GC -Xmx64g -jar application.jar
 
# For ZGC (NUMA-aware by default in Java 15+)
java -XX:+UseZGC -Xmx64g -jar application.jar
 
# Combine with numactl for additional control
numactl --interleave=all \
    java -XX:+UseNUMA -XX:+UseG1GC -Xmx64g -jar application.jar
 
# Elasticsearch NUMA settings
# In jvm.options:
# -XX:+UseG1GC
# -XX:+UseNUMA
 
# Also consider:
# - Running one ES node per NUMA node (smaller heap, more instances)
# - Setting node.processors to CPUs on one NUMA node
 
# Kafka NUMA tuning
# Bind broker to single NUMA node for best latency
numactl --cpunodebind=0 --membind=0 \
    kafka-server-start.sh config/server.properties
 
# For throughput, interleave (log data is shared across partitions)
numactl --interleave=all \
    kafka-server-start.sh config/server.properties

Application Documentation First

Many applications have official NUMA tuning guides. Check documentation before applying generic advice. PostgreSQL, MySQL, MongoDB, Redis, and Kafka all have specific recommendations that reflect their internal architectures.

Common Pitfalls

NUMA optimization has many ways to go wrong. Learn from others' mistakes.

Pitfall 1: Over-Binding

The Mistake

•Bind all processes to node 0
•One node is overloaded, others idle
•Memory exhausted on node 0
•Fallback to remote allocation
•Worse performance than no tuning

The Fix

•Distribute workloads across nodes
•Use --preferred instead of --membind
•Monitor per-node utilization
•Allow scheduler flexibility for I/O threads
•Only bind critical, memory-bound threads

Pitfall 2: Ignoring First-Touch

first-touch-pitfall.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// WRONG: Allocate in main(), workers access remote memory
void *shared_data;
 
int main() {
    shared_data = malloc(HUGE_SIZE);
    memset(shared_data, 0, HUGE_SIZE);  // All pages on main()'s node!
    
    for (int i = 0; i < NUM_WORKERS; i++) {
        spawn_worker(i);  // Workers run on various nodes
    }
    // Workers mostly access REMOTE memory!
}
 
// RIGHT: Workers allocate their own data
void worker(int id) {
    numa_run_on_node(id % numa_max_node());
    
    void *my_data = numa_alloc_local(WORKER_SIZE);
    memset(my_data, 0, WORKER_SIZE);  // First touch on local node
    
    process(my_data);  // All LOCAL accesses
}

Pitfall 3: False Sharing

Multiple threads modify different variables that share a cache line:

false-sharing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// WRONG: Counters share cache lines
struct BadCounters {
    long count_thread_0;  // Offset 0
    long count_thread_1;  // Offset 8  - SAME CACHE LINE!
    long count_thread_2;  // Offset 16 - SAME CACHE LINE!
    long count_thread_3;  // Offset 24 - SAME CACHE LINE!
};
// Every increment invalidates other threads' caches!
 
// RIGHT: Pad to separate cache lines
#define CACHE_LINE 64
 
struct GoodCounters {
    long count_thread_0;
    char _pad0[CACHE_LINE - sizeof(long)];
    
    long count_thread_1;
    char _pad1[CACHE_LINE - sizeof(long)];
    
    long count_thread_2;
    char _pad2[CACHE_LINE - sizeof(long)];
    
    long count_thread_3;
    char _pad3[CACHE_LINE - sizeof(long)];
};
// Each counter on its own cache line - no false sharing

Pitfall 4: Measuring Wrong

Measurement Errors

•Testing on laptop, deploying on NUMA server — Laptops typically have one NUMA node. Test on representative hardware.
•Benchmarking without warmup — First run includes cold cache and page faults. Skip or average first iterations.
•Ignoring variance — One fast run doesn't mean anything. Measure multiple times, report statistics.
•Wrong working set size — If data fits in cache, you're not measuring memory. Use production-scale data.
•Background load — Other applications competing for memory skew results. Isolate benchmark systems.

The Worst Pitfall: Premature Optimization

NUMA optimization adds complexity and can reduce portability. Don't optimize NUMA until (1) you've profiled and confirmed NUMA is a bottleneck, (2) other optimizations (algorithms, caching, I/O) are exhausted, and (3) the workload will run on NUMA hardware in production. Premature NUMA optimization is wasted effort.

Monitoring in Production

Optimization isn't a one-time event. Production workloads change, and NUMA problems can emerge over time. Continuous monitoring catches regressions early.

Key Metrics to Monitor:

Production NUMA Monitoring
Metric	Source	Alert Threshold
Local hit ratio	numastat, perf counters	< 90% sustained
Per-node memory utilization	numastat -m	50% imbalance
numa_miss rate	/proc/vmstat	10% of numa_hit
Cross-node bandwidth	UPI/IF counters	50% saturation
Page migration rate	/proc/vmstat	Sustained high rate

numa-monitoring.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/bin/bash
# Production NUMA monitoring script
# Run via cron every minute, export to monitoring system
 
OUTPUT_FILE="/var/log/numa_metrics.json"
 
# Collect NUMA statistics
numa_stats=$(numastat -c 2>/dev/null | tail -n +3)
 
# Parse key metrics
numa_hit=$(grep numa_hit /proc/vmstat | awk '{print $2}')
numa_miss=$(grep numa_miss /proc/vmstat | awk '{print $2}')
numa_foreign=$(grep numa_foreign /proc/vmstat | awk '{print $2}')
pgmigrate=$(grep pgmigrate_success /proc/vmstat | awk '{print $2}')
 
# Calculate hit ratio
if [ $((numa_hit + numa_miss)) -gt 0 ]; then
    hit_ratio=$(echo "scale=4; $numa_hit / ($numa_hit + $numa_miss) * 100" | bc)
else
    hit_ratio="100"
fi
 
# Per-node memory (example for 4 nodes)
for node in 0 1 2 3; do
    meminfo="/sys/devices/system/node/node$node/meminfo"
    if [ -f "$meminfo" ]; then
        total=$(grep MemTotal "$meminfo" | awk '{print $4}')
        free=$(grep MemFree "$meminfo" | awk '{print $4}')
        used=$((total - free))
        usage=$((used * 100 / total))
        eval "node${node}_usage=$usage"
        eval "node${node}_used_mb=$((used / 1024))"
    fi
done
 
# Output as JSON for ingestion by Prometheus/Datadog/etc.
cat > "$OUTPUT_FILE" << EOF
{
  "timestamp": "$(date -Iseconds)",
  "numa_hit": $numa_hit,
  "numa_miss": $numa_miss,
  "numa_foreign": $numa_foreign,
  "hit_ratio_pct": $hit_ratio,
  "pgmigrate_success": $pgmigrate,
  "node0_usage_pct": ${node0_usage:- 0},
                            "node1_usage_pct": ${ node1_usage: -0 },
                        "node2_usage_pct": ${ node2_usage: -0 },
                        "node3_usage_pct": ${ node3_usage: -0 }
}
EOF
 
# Alert if hit ratio drops
if(($(echo "$hit_ratio < 90" | bc - l))); then
    echo "ALERT: NUMA hit ratio dropped to ${hit_ratio}%"
    # Integrate with alerting system
    fi

Integrating with Observability Stack:

For comprehensive monitoring, export NUMA metrics to your observability platform:

Prometheus: Use node_exporter with --collector.meminfo_numa or custom exporter
Datadog: Custom check parsing /proc/vmstat and numastat
CloudWatch/Stackdriver: Publish custom metrics via agent
Grafana dashboards: Visualize NUMA hit ratio, per-node utilization over time

Correlate NUMA metrics with application performance. When latency spikes, check if NUMA hit ratio dropped simultaneously.

Regression Detection

Establish NUMA baselines after tuning. A code change or configuration update might unknowingly break NUMA optimization (e.g., thread pool size change, library update). Compare current metrics to baseline to catch regressions before users notice.

Summary: Performance Optimization

We've completed our deep dive into NUMA performance optimization. Let's consolidate the key insights:

Key Takeaways

•Follow the workflow — Baseline, identify, analyze, optimize, validate. Skip steps at your peril.
•Profile before optimizing — Confirm NUMA is actually the bottleneck using numastat and perf.
•Benchmark rigorously — Warm up, control variables, run multiple iterations, measure the right metrics.
•Choose appropriate techniques — Binding, partitioning, interleaving, or data structure redesign based on workload.
•Apply application-specific tuning — Databases, caches, and JVM apps each have unique best practices.
•Avoid common pitfalls — Over-binding, ignoring first-touch, false sharing, wrong measurements.
•Monitor continuously — Production NUMA health requires ongoing observation and regression detection.

Module Complete: NUMA Architecture

Over these five pages, you've mastered:

Non-Uniform Memory Access fundamentals — Why NUMA exists and how it differs from SMP
NUMA Nodes — The building blocks, discovery, and internal organization
Local vs Remote Access — Quantifying performance differences and cache coherency
NUMA-Aware Allocation — Memory policies, libnuma, first-touch, migration
Performance Optimization — Profiling, benchmarking, tuning, and monitoring

You're now equipped to understand, diagnose, and optimize NUMA behavior in production systems—a skill that separates good engineers from exceptional ones in the multi-socket server world.

Module Complete

Congratulations! You've completed the NUMA Architecture module. You now possess deep understanding of how modern multi-socket systems organize memory access and the skills to optimize performance at this fundamental level. This knowledge is invaluable for anyone working with high-performance, memory-intensive applications on enterprise hardware.

5 / 5

Loading learning content...

Operating SystemsNUMA Architecture

NUMA Architecture

LevelAdvanced

Duration60 mins

TopicNUMA Architecture

5 / 5

Performance Optimization

From Theory to Practice

What You Will Learn

The NUMA Optimization Workflow

Effective NUMA optimization follows a disciplined workflow. Jumping straight to 'fixes' without understanding the problem often makes things worse. Here's the systematic approach:

The Five-Step NUMA Optimization Process:

Converting Mermaid diagram...

Optimization Steps Explained

•Step 1: Baseline Measurement — Measure current performance with production-like workload. Record throughput, latency percentiles, and CPU utilization. This is your comparison point.
•Step 2: Identify NUMA Issues — Use NUMA-specific metrics (numa_miss, remote access ratio, node imbalance) to detect problems. Not all performance issues are NUMA-related!
•Step 3: Root Cause Analysis — Understand why the NUMA issue exists. Is it allocation pattern? Thread migration? Shared data? Wrong policy?
•Step 4: Apply Optimization — Make a single, targeted change. Avoid multiple changes simultaneously—you won't know what helped.
•Step 5: Validate Improvement — Re-measure with the same workload. Did the target metric improve? Did anything regress? Iterate or ship.

The Cardinal Rule

Profiling for NUMA Issues

Key NUMA Metrics:

NUMA Performance Indicators
Metric	Healthy Range	Concerning	Source
Local memory hit ratio	95%	<90%	`numastat`, perf counters
numa_miss rate	<5% of allocations	10%	`/proc/vmstat`, `numastat`
Node memory imbalance	< 20% variance	50% variance	`numastat -m`
Remote memory bandwidth	< 10% of total	25% of total	UPI/Infinity Fabric counters
Cross-node coherency traffic	Low	High under load	uncore counters

numa-profiling.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/bin/bash
# Comprehensive NUMA profiling script
 
echo "=== System NUMA Configuration ==="
numactl --hardware
 
echo -e "\n=== System-Wide NUMA Statistics ==="
numastat
 
echo -e "\n=== NUMA Memory Distribution ==="
numastat -m
 
echo -e "\n=== Per-Process NUMA Breakdown ==="
# Replace 'myapp' with your application name
for pid in $(pgrep myapp); do
    echo "--- PID $pid ---"
    numastat -p $pid
done
 
echo -e "\n=== Live NUMA Monitoring (5 seconds) ==="
# Monitor numa_hit/miss rate changes
echo "Timestamp | numa_hit | numa_miss | numa_foreign | local_node | other_node"
for i in {1..5}; do
    stats=$(grep -E "numa_(hit|miss|foreign)|local_node|other_node" /proc/vmstat)
    hit=$(echo "$stats" | grep numa_hit | awk '{print $2}')
    miss=$(echo "$stats" | grep numa_miss | awk '{print $2}')
    foreign=$(echo "$stats" | grep numa_foreign | awk '{print $2}')
    local=$(echo "$stats" | grep local_node | awk '{print $2}')
    other=$(echo "$stats" | grep other_node | awk '{print $2}')
    echo "$(date +%H:%M:%S) | $hit | $miss | $foreign | $local | $other"
    sleep 1
done
 
echo -e "\n=== Memory Placement for Specific Process ==="
# Show NUMA mapping of process memory regions
pid=$(pgrep -f myapp | head -1)
if [ -n "$pid" ]; then
    echo "Memory regions for PID $pid:"
    head -20 /proc/$pid/numa_maps
fi
 
echo -e "\n=== Hardware Performance Counters (if available) ==="
# Check if perf can measure NUMA events
perf list 2>/dev/null | grep -i numa || echo "No NUMA perf events found"
 
# Example: measure NUMA events during application run
# perf stat -e node-loads,node-stores,node-load-misses,node-store-misses ./myapp

Using perf for Deep NUMA Analysis:

The Linux perf tool can provide detailed NUMA insight using hardware performance counters:

perf-numa-analysis.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Basic NUMA event recording
perf stat -e node-loads,node-stores,node-load-misses,node-store-misses \
    ./my_application
 
# Sample output:
#  Performance counter stats for './my_application':
#       456,789,012      node-loads
#       123,456,789      node-stores
#        45,678,901      node-load-misses   # 10% remote loads (BAD)
#           345,678      node-store-misses  # 0.3% remote stores (OK)
 
# Calculate NUMA efficiency:
# Local Load Ratio = 1 - (node-load-misses / node-loads)
# Target: > 95%
 
# Record for detailed analysis
perf record -e node-loads,node-load-misses -g ./my_application
perf report
 
# AMD systems: Data Fabric events
perf stat -e amd_df/event=0x07,umask=0x02/ ./my_application
 
# Intel systems: UPI traffic
perf stat -e uncore_upi/event=0x02,umask=0x0f/ ./my_application
 
# Memory bandwidth by NUMA node (Intel)
perf stat -e uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/ \
          -e uncore_imc_1/cas_count_read/,uncore_imc_1/cas_count_write/ \
          ./my_application

The 10% Rule

Benchmarking NUMA Performance

Profiling identifies problems; benchmarking quantifies them. Good benchmarks isolate NUMA effects from other variables and produce reproducible results.

Benchmarking Best Practices:

NUMA Benchmarking Principles

•Warm up before measuring — Run workload for 30+ seconds before timing to ensure caches are populated and memory is allocated.
•Control thread placement — Pin threads to specific CPUs/nodes to eliminate scheduler variability.
•Use realistic data sizes — Working sets should exceed cache to measure memory effects, but match production scale.
•Measure multiple metrics — Throughput, median latency, P99 latency, and CPU utilization together tell the full story.
•Run multiple iterations — At least 5 runs, discard outliers, report mean and standard deviation.
•Isolate the system — Disable turbo boost, frequency scaling, and background services for reproducibility.

numa-benchmark-harness.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <pthread.h>
 
#define ARRAY_SIZE (512 * 1024 * 1024)  // 512 MB per thread
#define ITERATIONS 100000000
#define WARMUP_ITERATIONS 10000000
#define NUM_RUNS 5
 
typedef struct {
    int thread_id;
    int numa_node;
    void *buffer;
    double latency_ns;
    double bandwidth_gbps;
} ThreadResult;
 
// Pointer-chasing latency benchmark
double benchmark_latency(void *buffer, size_t size, int iterations) {
    void **chain = (void **)buffer;
    size_t num_elements = size / sizeof(void *);
    
    // Create random pointer chain (defeats prefetching)
    for (size_t i = 0; i < num_elements - 1; i++) {
        chain[i] = &chain[(i * 179 + 17) % num_elements];
    }
    chain[num_elements - 1] = &chain[0];
    
    // Touch all pages
    for (size_t i = 0; i < num_elements; i += 512) {
        volatile void *touch = chain[i];
        (void)touch;
    }
    
    // Warmup
    void **p = chain;
    for (int i = 0; i < WARMUP_ITERATIONS; i++) {
        p = (void **)*p;
    }
    
    // Timed run
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    p = chain;
    for (int i = 0; i < iterations; i++) {
        p = (void **)*p;
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    // Prevent optimization
    volatile void *sink = p;
    (void)sink;
    
    double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                        (end.tv_nsec - start.tv_nsec);
    return elapsed_ns / iterations;
}
 
// Sequential bandwidth benchmark
double benchmark_bandwidth(void *buffer, size_t size) {
    // Warmup
    memset(buffer, 1, size);
    
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    // Read entire buffer (bandwidth test)
    volatile long sum = 0;
    long *data = (long *)buffer;
    size_t count = size / sizeof(long);
    for (size_t i = 0; i < count; i += 8) {
        sum += data[i] + data[i+1] + data[i+2] + data[i+3] +
               data[i+4] + data[i+5] + data[i+6] + data[i+7];
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed_sec = (end.tv_sec - start.tv_sec) +
                         (end.tv_nsec - start.tv_nsec) / 1e9;
    return (size / 1e9) / elapsed_sec;  // GB/s
}
 
void *benchmark_thread(void *arg) {
    ThreadResult *result = (ThreadResult *)arg;
    
    // Pin to NUMA node
    numa_run_on_node(result->numa_node);
    
    // Allocate locally
    result->buffer = numa_alloc_local(ARRAY_SIZE);
    if (!result->buffer) {
        fprintf(stderr, "Thread %d: allocation failed\n", result->thread_id);
        return NULL;
    }
    
    // Touch to materialize
    memset(result->buffer, 0, ARRAY_SIZE);
    
    // Run benchmarks
    result->latency_ns = benchmark_latency(result->buffer, ARRAY_SIZE, ITERATIONS);
    result->bandwidth_gbps = benchmark_bandwidth(result->buffer, ARRAY_SIZE);
    
    numa_free(result->buffer, ARRAY_SIZE);
    return NULL;
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    int num_nodes = numa_max_node() + 1;
    printf("NUMA Benchmark: %d nodes\n\n", num_nodes);
    
    // Run benchmark on each node
    printf("%-8s %15s %15s\n", "Node", "Latency (ns)", "Bandwidth (GB/s)");
    printf("%-8s %15s %15s\n", "----", "-----------", "---------------");
    
    for (int node = 0; node < num_nodes; node++) {
        double total_latency = 0;
        double total_bandwidth = 0;
        
        for (int run = 0; run < NUM_RUNS; run++) {
            ThreadResult result = {
                .thread_id = 0,
                .numa_node = node,
            };
            
            pthread_t thread;
            pthread_create(&thread, NULL, benchmark_thread, &result);
            pthread_join(thread, NULL);
            
            total_latency += result.latency_ns;
            total_bandwidth += result.bandwidth_gbps;
        }
        
        printf("Node %-3d %15.1f %15.1f\n",
               node,
               total_latency / NUM_RUNS,
               total_bandwidth / NUM_RUNS);
    }
    
    return 0;
}

Benchmark vs. Reality

Optimization Techniques

With measurements in hand, apply the appropriate optimization. The best technique depends on your workload's characteristics.

Decision Framework:

NUMA Optimization Selection Guide
Workload Characteristic	Recommended Optimization	Tools/Approach
Single-threaded, memory-intensive	Strict node binding	`numactl --cpunodebind=0 --membind=0`
Multi-threaded, partitionable data	Per-thread local allocation	NUMA-aware memory pools
Shared read-only data	Interleave across all nodes	`numactl --interleave=all`
Heavily shared, frequently modified	Minimize sharing, or replicate	Data structure redesign
Unknown/mixed access patterns	Local allocation + AutoNUMA	Enable `numa_balancing`
Database workload	Buffer pool per node	Database-specific tuning

Technique 1: Thread-Data Colocation

The most powerful optimization: ensure threads run near the data they access.

thread-colocation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <numa.h>
#include <pthread.h>
 
typedef struct {
    int node;
    void *local_data;
    size_t data_size;
    void (*work_fn)(void *, size_t);
} Worker;
 
void *colocated_worker(void *arg) {
    Worker *w = (Worker *)arg;
    
    // Step 1: Bind thread to designated NUMA node
    numa_run_on_node(w->node);
    
    // Step 2: Allocate data locally (thread is now on correct node)
    w->local_data = numa_alloc_local(w->data_size);
    
    // Step 3: Initialize data (first-touch on local node)
    memset(w->local_data, 0, w->data_size);
    
    // Step 4: Process - all accesses are local
    w->work_fn(w->local_data, w->data_size);
    
    numa_free(w->local_data, w->data_size);
    return NULL;
}
 
void spawn_colocated_workers(int num_workers, size_t data_per_worker,
                             void (*work_fn)(void *, size_t)) {
    int num_nodes = numa_max_node() + 1;
    
    pthread_t threads[num_workers];
    Worker workers[num_workers];
    
    for (int i = 0; i < num_workers; i++) {
        workers[i].node = i % num_nodes;  // Round-robin across nodes
        workers[i].data_size = data_per_worker;
        workers[i].work_fn = work_fn;
        
        pthread_create(&threads[i], NULL, colocated_worker, &workers[i]);
    }
    
    for (int i = 0; i < num_workers; i++) {
        pthread_join(threads[i], NULL);
    }
}

Technique 2: Data Partitioning

For embarrassingly parallel workloads, partition data so each NUMA node has a complete, independent slice:

data-partitioning.py
Python (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Conceptual data partitioning for NUMA
# (Actual memory binding requires C/ctypes or specialized libraries)
 
def partition_for_numa(data, num_nodes):
    """
    Partition data for NUMA-aware processing.
    
    Each partition should be:
    1. Allocated on its designated node
    2. Processed by threads running on that node
    3. Independent (no cross-partition access during processing)
    """
    partition_size = len(data) // num_nodes
    partitions = []
    
    for node in range(num_nodes):
        start = node * partition_size
        end = start + partition_size if node < num_nodes - 1 else len(data)
        
        # In real code, allocate this partition on 'node' using numa_alloc_onnode
        partition = data[start:end]
        partitions.append((node, partition))
    
    return partitions
 
def process_partitioned(partitions, process_fn):
    """
    Process each partition on its designated NUMA node.
    
    In real implementation:
    1. Pin thread to node
    2. Access only local partition
    3. Aggregate results at end (minimize cross-node sync)
    """
    results = []
    for node, partition in partitions:
        # spawn_thread_on_node(node, process_fn, partition)
        result = process_fn(partition)
        results.append(result)
    
    return aggregate(results)

Technique 3: NUMA-Aware Data Structures

Redesign data structures to respect NUMA boundaries:

NUMA-Friendly Data Structure Patterns

•Partition hash maps by node — Route keys to node-local sub-maps based on hash prefix. Lookups stay local.
•Replicate read-only data — Each node gets a copy of reference data. Updates are rare; broadcast when needed.
•NUMA-aware queues — Per-node queues for work distribution. Workers consume from local queue first.
•Cache-line padding — Prevent false sharing by padding frequently-written fields to 64+ bytes.
•Slab allocators per node — Object pools that allocate from local memory, return to local pool.

The 80/20 Rule of NUMA

Application-Specific Tuning

Different application types require different NUMA strategies. Here are specific recommendations for common workloads.

Databases (PostgreSQL, MySQL, MongoDB):

database-numa-tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# PostgreSQL NUMA tuning
 
# Option 1: Interleave shared_buffers (large, shared by all backends)
numactl --interleave=all postgres -c shared_buffers=64GB
 
# Option 2: Run separate PostgreSQL instances per NUMA node
# Node 0 instance
numactl --cpunodebind=0 --membind=0 \
    postgres -D /var/lib/postgresql/node0 -p 5432
 
# Node 1 instance  
numactl --cpunodebind=1 --membind=1 \
    postgres -D /var/lib/postgresql/node1 -p 5433
 
# Then use connection pooler (PgBouncer) to route queries
 
# MySQL NUMA tuning
# In my.cnf:
# innodb_numa_interleave = 1  # Interleave buffer pool
 
# Or via command line
numactl --interleave=all mysqld \
    --innodb-buffer-pool-size=64G \
    --innodb-buffer-pool-instances=8
 
# MongoDB NUMA tuning
# MongoDB strongly recommends disabling zone_reclaim_mode
echo 0 > /proc/sys/vm/zone_reclaim_mode
 
# Then run with interleave
numactl --interleave=all mongod --dbpath /data/db

In-Memory Caches (Redis, Memcached):

cache-numa-tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Redis NUMA tuning
# Redis is single-threaded, so strict node binding is ideal
 
# Bind to single node (best if data fits in one node's memory)
numactl --cpunodebind=0 --membind=0 redis-server /etc/redis.conf
 
# For larger datasets, run multiple instances
for node in 0 1 2 3; do
    port=$((6379 + node))
    numactl --cpunodebind=$node --membind=$node \
        redis-server --port $port --daemonize yes
done
 
# Memcached NUMA tuning
# Memcached is multi-threaded, use interleave for shared hash table
 
numactl --interleave=all memcached -m 65536 -t $(nproc)
 
# Or per-node instances (better for very large deployments)
for node in 0 1 2 3; do
    port=$((11211 + node))
    cores_on_node=$(numactl -H | grep "node $node cpus:" | cut -d: -f2)
    num_cores=$(echo $cores_on_node | wc -w)
    
    numactl --cpunodebind=$node --membind=$node \
        memcached -p $port -t $num_cores -m 16384 -d
done

JVM Applications (Spark, Elasticsearch, Kafka):

jvm-numa-tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# JVM NUMA tuning
 
# Enable JVM NUMA support (works with G1GC)
java -XX:+UseNUMA -XX:+UseG1GC -Xmx64g -jar application.jar
 
# For ZGC (NUMA-aware by default in Java 15+)
java -XX:+UseZGC -Xmx64g -jar application.jar
 
# Combine with numactl for additional control
numactl --interleave=all \
    java -XX:+UseNUMA -XX:+UseG1GC -Xmx64g -jar application.jar
 
# Elasticsearch NUMA settings
# In jvm.options:
# -XX:+UseG1GC
# -XX:+UseNUMA
 
# Also consider:
# - Running one ES node per NUMA node (smaller heap, more instances)
# - Setting node.processors to CPUs on one NUMA node
 
# Kafka NUMA tuning
# Bind broker to single NUMA node for best latency
numactl --cpunodebind=0 --membind=0 \
    kafka-server-start.sh config/server.properties
 
# For throughput, interleave (log data is shared across partitions)
numactl --interleave=all \
    kafka-server-start.sh config/server.properties

Application Documentation First

Common Pitfalls

NUMA optimization has many ways to go wrong. Learn from others' mistakes.

Pitfall 1: Over-Binding

The Mistake

•Bind all processes to node 0
•One node is overloaded, others idle
•Memory exhausted on node 0
•Fallback to remote allocation
•Worse performance than no tuning

The Fix

•Distribute workloads across nodes
•Use --preferred instead of --membind
•Monitor per-node utilization
•Allow scheduler flexibility for I/O threads
•Only bind critical, memory-bound threads

Pitfall 2: Ignoring First-Touch

first-touch-pitfall.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// WRONG: Allocate in main(), workers access remote memory
void *shared_data;
 
int main() {
    shared_data = malloc(HUGE_SIZE);
    memset(shared_data, 0, HUGE_SIZE);  // All pages on main()'s node!
    
    for (int i = 0; i < NUM_WORKERS; i++) {
        spawn_worker(i);  // Workers run on various nodes
    }
    // Workers mostly access REMOTE memory!
}
 
// RIGHT: Workers allocate their own data
void worker(int id) {
    numa_run_on_node(id % numa_max_node());
    
    void *my_data = numa_alloc_local(WORKER_SIZE);
    memset(my_data, 0, WORKER_SIZE);  // First touch on local node
    
    process(my_data);  // All LOCAL accesses
}

Pitfall 3: False Sharing

Multiple threads modify different variables that share a cache line:

false-sharing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// WRONG: Counters share cache lines
struct BadCounters {
    long count_thread_0;  // Offset 0
    long count_thread_1;  // Offset 8  - SAME CACHE LINE!
    long count_thread_2;  // Offset 16 - SAME CACHE LINE!
    long count_thread_3;  // Offset 24 - SAME CACHE LINE!
};
// Every increment invalidates other threads' caches!
 
// RIGHT: Pad to separate cache lines
#define CACHE_LINE 64
 
struct GoodCounters {
    long count_thread_0;
    char _pad0[CACHE_LINE - sizeof(long)];
    
    long count_thread_1;
    char _pad1[CACHE_LINE - sizeof(long)];
    
    long count_thread_2;
    char _pad2[CACHE_LINE - sizeof(long)];
    
    long count_thread_3;
    char _pad3[CACHE_LINE - sizeof(long)];
};
// Each counter on its own cache line - no false sharing

Pitfall 4: Measuring Wrong

Measurement Errors

•Testing on laptop, deploying on NUMA server — Laptops typically have one NUMA node. Test on representative hardware.
•Benchmarking without warmup — First run includes cold cache and page faults. Skip or average first iterations.
•Ignoring variance — One fast run doesn't mean anything. Measure multiple times, report statistics.
•Wrong working set size — If data fits in cache, you're not measuring memory. Use production-scale data.
•Background load — Other applications competing for memory skew results. Isolate benchmark systems.

The Worst Pitfall: Premature Optimization

Monitoring in Production

Optimization isn't a one-time event. Production workloads change, and NUMA problems can emerge over time. Continuous monitoring catches regressions early.

Key Metrics to Monitor:

Production NUMA Monitoring
Metric	Source	Alert Threshold
Local hit ratio	numastat, perf counters	< 90% sustained
Per-node memory utilization	numastat -m	50% imbalance
numa_miss rate	/proc/vmstat	10% of numa_hit
Cross-node bandwidth	UPI/IF counters	50% saturation
Page migration rate	/proc/vmstat	Sustained high rate

numa-monitoring.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/bin/bash
# Production NUMA monitoring script
# Run via cron every minute, export to monitoring system
 
OUTPUT_FILE="/var/log/numa_metrics.json"
 
# Collect NUMA statistics
numa_stats=$(numastat -c 2>/dev/null | tail -n +3)
 
# Parse key metrics
numa_hit=$(grep numa_hit /proc/vmstat | awk '{print $2}')
numa_miss=$(grep numa_miss /proc/vmstat | awk '{print $2}')
numa_foreign=$(grep numa_foreign /proc/vmstat | awk '{print $2}')
pgmigrate=$(grep pgmigrate_success /proc/vmstat | awk '{print $2}')
 
# Calculate hit ratio
if [ $((numa_hit + numa_miss)) -gt 0 ]; then
    hit_ratio=$(echo "scale=4; $numa_hit / ($numa_hit + $numa_miss) * 100" | bc)
else
    hit_ratio="100"
fi
 
# Per-node memory (example for 4 nodes)
for node in 0 1 2 3; do
    meminfo="/sys/devices/system/node/node$node/meminfo"
    if [ -f "$meminfo" ]; then
        total=$(grep MemTotal "$meminfo" | awk '{print $4}')
        free=$(grep MemFree "$meminfo" | awk '{print $4}')
        used=$((total - free))
        usage=$((used * 100 / total))
        eval "node${node}_usage=$usage"
        eval "node${node}_used_mb=$((used / 1024))"
    fi
done
 
# Output as JSON for ingestion by Prometheus/Datadog/etc.
cat > "$OUTPUT_FILE" << EOF
{
  "timestamp": "$(date -Iseconds)",
  "numa_hit": $numa_hit,
  "numa_miss": $numa_miss,
  "numa_foreign": $numa_foreign,
  "hit_ratio_pct": $hit_ratio,
  "pgmigrate_success": $pgmigrate,
  "node0_usage_pct": ${node0_usage:- 0},
                            "node1_usage_pct": ${ node1_usage: -0 },
                        "node2_usage_pct": ${ node2_usage: -0 },
                        "node3_usage_pct": ${ node3_usage: -0 }
}
EOF
 
# Alert if hit ratio drops
if(($(echo "$hit_ratio < 90" | bc - l))); then
    echo "ALERT: NUMA hit ratio dropped to ${hit_ratio}%"
    # Integrate with alerting system
    fi

Integrating with Observability Stack:

For comprehensive monitoring, export NUMA metrics to your observability platform:

Prometheus: Use node_exporter with --collector.meminfo_numa or custom exporter
Datadog: Custom check parsing /proc/vmstat and numastat
CloudWatch/Stackdriver: Publish custom metrics via agent
Grafana dashboards: Visualize NUMA hit ratio, per-node utilization over time

Correlate NUMA metrics with application performance. When latency spikes, check if NUMA hit ratio dropped simultaneously.

Regression Detection

Summary: Performance Optimization

We've completed our deep dive into NUMA performance optimization. Let's consolidate the key insights:

Key Takeaways

•Follow the workflow — Baseline, identify, analyze, optimize, validate. Skip steps at your peril.
•Profile before optimizing — Confirm NUMA is actually the bottleneck using numastat and perf.
•Benchmark rigorously — Warm up, control variables, run multiple iterations, measure the right metrics.
•Choose appropriate techniques — Binding, partitioning, interleaving, or data structure redesign based on workload.
•Apply application-specific tuning — Databases, caches, and JVM apps each have unique best practices.
•Avoid common pitfalls — Over-binding, ignoring first-touch, false sharing, wrong measurements.
•Monitor continuously — Production NUMA health requires ongoing observation and regression detection.

Module Complete: NUMA Architecture

Over these five pages, you've mastered:

Non-Uniform Memory Access fundamentals — Why NUMA exists and how it differs from SMP
NUMA Nodes — The building blocks, discovery, and internal organization
Local vs Remote Access — Quantifying performance differences and cache coherency
NUMA-Aware Allocation — Memory policies, libnuma, first-touch, migration
Performance Optimization — Profiling, benchmarking, tuning, and monitoring

You're now equipped to understand, diagnose, and optimize NUMA behavior in production systems—a skill that separates good engineers from exceptional ones in the multi-socket server world.

Module Complete

5 / 5