Loading learning content...
Understanding NUMA is only half the battle. The other half is controlling memory placement—ensuring that allocations end up on the right NUMA node. This page dives into the complete toolkit for NUMA-aware memory allocation: Linux memory policies, the libnuma library, system calls, and practical patterns for production systems.
By mastering these techniques, you'll be able to write applications that actively exploit NUMA topology rather than being victimized by poor memory placement.
By the end of this page, you will understand Linux's memory policy system in depth, be able to program with libnuma for fine-grained control, know how to apply policies at different granularities (process, thread, virtual memory area), and apply proven allocation patterns for NUMA-optimized applications.
Linux provides a sophisticated memory policy framework that controls how memory is allocated across NUMA nodes. This framework operates at three levels:
The Four Memory Policies:
Linux supports four fundamental memory allocation policies:
| Policy | Behavior | Use Case |
|---|---|---|
| DEFAULT | Allocate on the local node (where CPU runs) | General purpose, best for most workloads |
| BIND | Strictly allocate from specified node set | Guarantee locality, fail if unavailable |
| PREFERRED | Try specified node, fall back if needed | Prefer locality, allow flexibility |
| INTERLEAVE | Round-robin pages across specified nodes | Shared data, bandwidth spreading |
Policy Representation in the Kernel:
Internally, each policy is represented by a struct mempolicy containing:
Policies are reference-counted and can be shared between VMAs for efficiency.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
#include <numaif.h>#include <numa.h>#include <stdio.h>#include <stdlib.h> /** * Core system calls for memory policy control: * * set_mempolicy() - Set task policy * get_mempolicy() - Query current policy * mbind() - Set policy for a VMA * migrate_pages() - Move pages between nodes * move_pages() - Move specific pages */ void demonstrate_policies() { unsigned long nodemask = 0x3; // Nodes 0 and 1 int maxnode = numa_max_node() + 2; // +2 for mask size convention // ===================================================== // Policy 1: BIND - Strict allocation to specified nodes // ===================================================== printf("Setting BIND policy to nodes 0,1\n"); if (set_mempolicy(MPOL_BIND, &nodemask, maxnode) != 0) { perror("set_mempolicy BIND"); } // All subsequent allocations MUST come from nodes 0 or 1 void *bind_mem = malloc(1024 * 1024); // Allocated on node 0 or 1 // ===================================================== // Policy 2: INTERLEAVE - Round-robin across nodes // ===================================================== printf("Setting INTERLEAVE policy across nodes 0,1\n"); if (set_mempolicy(MPOL_INTERLEAVE, &nodemask, maxnode) != 0) { perror("set_mempolicy INTERLEAVE"); } // Pages will alternate between nodes 0 and 1 void *interleave_mem = malloc(16 * 1024 * 1024); // 16 MB // Page 0 on node 0, page 1 on node 1, page 2 on node 0, ... // ===================================================== // Policy 3: PREFERRED - Try node, fall back if needed // ===================================================== unsigned long prefer_nodemask = 0x2; // Node 1 printf("Setting PREFERRED policy for node 1\n"); if (set_mempolicy(MPOL_PREFERRED, &prefer_nodemask, maxnode) != 0) { perror("set_mempolicy PREFERRED"); } // Tries node 1 first, falls back to other nodes if necessary void *prefer_mem = malloc(1024 * 1024); // ===================================================== // Policy 4: DEFAULT - Local allocation // ===================================================== printf("Resetting to DEFAULT policy\n"); if (set_mempolicy(MPOL_DEFAULT, NULL, 0) != 0) { perror("set_mempolicy DEFAULT"); } // Allocates on the node where the CPU is running void *default_mem = malloc(1024 * 1024); free(bind_mem); free(interleave_mem); free(prefer_mem); free(default_mem);} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } demonstrate_policies(); return 0;}Setting a policy doesn't allocate memory—it sets the rule for future allocations. Memory is actually allocated on first access (first-touch). If you set a policy, then access the memory from a different thread on a different node, the memory lands on that thread's node. Always access (touch) memory from the intended node after setting policy.
While set_mempolicy() sets a task-wide default, mbind() allows per-VMA (per-memory-region) policies. This is essential for applications with different memory types that need different placement strategies.
mbind() Signature:
long mbind(void *addr, unsigned long len, int mode,
const unsigned long *nodemask, unsigned long maxnode,
unsigned flags);
The flags Parameter:
The flags parameter is critical and often misunderstood:
| Flag | Effect | Use Case |
|---|---|---|
| 0 (no flags) | Policy applies to future allocations only | New allocations, lazy population |
| MPOL_MF_STRICT | Fail if any existing pages violate policy | Verify correctness |
| MPOL_MF_MOVE | Migrate existing pages that process owns | Rebalance owned pages |
| MPOL_MF_MOVE_ALL | Migrate all existing pages (requires CAP_SYS_NICE) | Force migration of shared pages |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
#include <numaif.h>#include <sys/mman.h>#include <stdio.h>#include <stdlib.h>#include <string.h> #define SIZE (64 * 1024 * 1024) // 64 MB void example_mbind_before_touch() { /* * Pattern 1: Set policy BEFORE first touch * This is the most common and efficient approach. */ // Allocate virtual address space (not yet backed by physical pages) void *buffer = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (buffer == MAP_FAILED) { perror("mmap"); return; } // Bind to node 0 BEFORE touching the memory unsigned long nodemask = 1UL << 0; // Node 0 if (mbind(buffer, SIZE, MPOL_BIND, &nodemask, sizeof(nodemask) * 8, 0) != 0) { perror("mbind"); munmap(buffer, SIZE); return; } // Now touch the memory - pages will be allocated on node 0 memset(buffer, 0, SIZE); printf("Allocated 64 MB on node 0 using mbind before touch\n"); munmap(buffer, SIZE);} void example_mbind_migrate() { /* * Pattern 2: Migrate existing pages to a new node * Useful for rebalancing after detecting poor placement. */ // Allocate and touch (pages land on current node) void *buffer = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(buffer, 0, SIZE); // Pages now allocated on current node printf("Initial allocation complete on local node\n"); // Migrate pages to node 1 unsigned long nodemask = 1UL << 1; // Node 1 if (mbind(buffer, SIZE, MPOL_BIND, &nodemask, sizeof(nodemask) * 8, MPOL_MF_MOVE | MPOL_MF_STRICT) != 0) { perror("mbind migrate"); // If this fails, pages may be on the wrong node } else { printf("Migrated 64 MB to node 1\n"); } munmap(buffer, SIZE);} void example_mixed_policies() { /* * Pattern 3: Different regions with different policies * Common in database systems with different access patterns. */ // Allocate 256 MB size_t total_size = 256 * 1024 * 1024; void *base = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); size_t quarter = total_size / 4; // First quarter: BIND to node 0 (hot data) unsigned long node0_mask = 1UL << 0; mbind(base, quarter, MPOL_BIND, &node0_mask, 64, 0); // Second quarter: BIND to node 1 (hot data on other socket) unsigned long node1_mask = 1UL << 1; mbind(base + quarter, quarter, MPOL_BIND, &node1_mask, 64, 0); // Third and fourth quarter: INTERLEAVE for shared data unsigned long all_nodes = 0xF; // Nodes 0-3 mbind(base + 2*quarter, 2*quarter, MPOL_INTERLEAVE, &all_nodes, 64, 0); // Touch each region from appropriate threads/nodes // (demonstration simplified here) memset(base, 0, total_size); printf("Created mixed-policy allocation:\n"); printf(" - 64 MB on node 0 (BIND)\n"); printf(" - 64 MB on node 1 (BIND)\n"); printf(" - 128 MB interleaved across nodes 0-3\n"); munmap(base, total_size);} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } example_mbind_before_touch(); example_mbind_migrate(); example_mixed_policies(); return 0;}The most reliable NUMA allocation pattern is: (1) mmap with MAP_ANONYMOUS to reserve address space, (2) mbind to set policy, (3) touch from correct thread. This gives complete control over placement. malloc() with set_mempolicy() works too but gives less control over address space layout.
While raw system calls offer maximum control, the libnuma library provides a more ergonomic API for common NUMA operations. It wraps the system calls with convenient functions and handles many edge cases.
Key libnuma Functions:
| Function | Purpose | Underlying Mechanism |
|---|---|---|
numa_alloc_onnode(size, node) | Allocate strictly on specified node | mmap + mbind BIND |
numa_alloc_local(size) | Allocate on local (current) node | mmap + mbind BIND to current |
numa_alloc_interleaved(size) | Interleave across all nodes | mmap + mbind INTERLEAVE |
numa_alloc(size) | Allocate following task policy | mmap (respects set_mempolicy) |
numa_free(ptr, size) | Free NUMA-allocated memory | munmap |
numa_set_preferred(node) | Set task policy to PREFERRED | set_mempolicy PREFERRED |
numa_set_membind(mask) | Set task policy to BIND | set_mempolicy BIND |
numa_set_interleave_mask(mask) | Set task policy to INTERLEAVE | set_mempolicy INTERLEAVE |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168
#include <numa.h>#include <stdio.h>#include <string.h>#include <stdlib.h> #define GB (1024UL * 1024 * 1024)#define MB (1024UL * 1024) /** * Comprehensive libnuma usage patterns */ void query_numa_configuration() { printf("=== NUMA Configuration ===\n"); printf("NUMA available: %s\n", numa_available() >= 0 ? "yes" : "no"); printf("Number of nodes: %d\n", numa_max_node() + 1); printf("Number of CPUs: %d\n", numa_num_configured_cpus()); // Print per-node information for (int node = 0; node <= numa_max_node(); node++) { long long size, free_size; size = numa_node_size64(node, &free_size); printf(" Node %d: %.2f GB total, %.2f GB free\n", node, (double)size / GB, (double)free_size / GB); } printf("\n");} void demonstrate_strict_allocation() { printf("=== Strict Node Allocation ===\n"); int target_node = 0; size_t alloc_size = 256 * MB; // Allocate strictly on node 0 void *buffer = numa_alloc_onnode(alloc_size, target_node); if (!buffer) { printf("Failed to allocate on node %d\n", target_node); return; } // CRITICAL: Touch the memory to materialize pages // The allocation reserves address space; first touch allocates pages memset(buffer, 0, alloc_size); printf("Allocated %.0f MB strictly on node %d\n", (double)alloc_size / MB, target_node); // Verify placement int status; void *page = buffer; move_pages(0, 1, &page, NULL, &status, 0); printf("Verified: first page is on node %d\n", status); numa_free(buffer, alloc_size);} void demonstrate_interleaved_allocation() { printf("\n=== Interleaved Allocation ===\n"); size_t alloc_size = 1 * GB; // Allocate interleaved across all nodes void *buffer = numa_alloc_interleaved(alloc_size); if (!buffer) { printf("Failed to allocate interleaved memory\n"); return; } // Touch all pages to materialize memset(buffer, 0, alloc_size); printf("Allocated 1 GB interleaved across all nodes\n"); // Count pages per node int num_pages = alloc_size / 4096; int *node_counts = calloc(numa_max_node() + 1, sizeof(int)); for (int i = 0; i < num_pages; i += 1000) { // Sample every 1000th page void *page = buffer + (i * 4096); int status; move_pages(0, 1, &page, NULL, &status, 0); if (status >= 0) { node_counts[status]++; } } printf("Page distribution (sampled):\n"); for (int n = 0; n <= numa_max_node(); n++) { printf(" Node %d: %d pages\n", n, node_counts[n]); } free(node_counts); numa_free(buffer, alloc_size);} void demonstrate_node_masks() { printf("\n=== Node Mask Operations ===\n"); // Create a bitmask for specific nodes struct bitmask *nodes = numa_bitmask_alloc(numa_max_node() + 1); // Set nodes 0 and 2 numa_bitmask_setbit(nodes, 0); numa_bitmask_setbit(nodes, 2); // Bind thread to run only on these nodes' CPUs numa_run_on_node_mask(nodes); printf("Thread bound to CPUs on nodes 0 and 2\n"); // Allocate memory interleaved across these specific nodes numa_set_interleave_mask(nodes); void *buffer = malloc(64 * MB); // Uses interleave policy memset(buffer, 0, 64 * MB); printf("Allocated 64 MB interleaved across nodes 0 and 2\n"); // Reset to default numa_set_interleave_mask(numa_no_nodes_ptr); free(buffer); numa_bitmask_free(nodes);} void demonstrate_local_allocation() { printf("\n=== Local Allocation ===\n"); // Get current node int current_cpu = sched_getcpu(); int current_node = numa_node_of_cpu(current_cpu); printf("Currently running on CPU %d, Node %d\n", current_cpu, current_node); // Allocate locally - memory will be on current node void *buffer = numa_alloc_local(128 * MB); if (!buffer) { printf("Failed to allocate local memory\n"); return; } memset(buffer, 0, 128 * MB); printf("Allocated 128 MB on local node %d\n", current_node); numa_free(buffer, 128 * MB);} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available on this system\n"); return 1; } query_numa_configuration(); demonstrate_strict_allocation(); demonstrate_interleaved_allocation(); demonstrate_node_masks(); demonstrate_local_allocation(); printf("\n=== All demonstrations complete ===\n"); return 0;} /* * Compile: gcc -O2 -o numa_demo numa_demo.c -lnuma * Run: ./numa_demo */Memory from numa_alloc* functions MUST be freed with numa_free(ptr, size). Memory from malloc() under a set_mempolicy can be freed with regular free(). Don't mix them! The implementations are different (mmap vs sbrk/malloc pools).
Linux's default memory policy is first-touch allocation: physical pages are allocated on the node where they are first accessed, not where (or when) they are allocated. Understanding first-touch is critical because it determines where your memory actually lives.
The Mechanics:
malloc(1GB) returns immediately with virtual addressCommon First-Touch Pitfalls:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
#include <numa.h>#include <pthread.h>#include <stdio.h>#include <stdlib.h>#include <string.h> #define NUM_WORKERS 4#define DATA_PER_WORKER (1024 * 1024 * 256) // 256 MB each typedef struct { int worker_id; int numa_node; void *data; size_t data_size;} WorkerContext; void *worker_function(void *arg) { WorkerContext *ctx = (WorkerContext *)arg; // Step 1: Bind this thread to its designated NUMA node struct bitmask *nodemask = numa_bitmask_alloc(numa_max_node() + 1); numa_bitmask_setbit(nodemask, ctx->numa_node); numa_run_on_node_mask(nodemask); numa_bitmask_free(nodemask); printf("Worker %d running on node %d\n", ctx->worker_id, ctx->numa_node); // Step 2: Allocate memory (still virtual) ctx->data = numa_alloc_onnode(ctx->data_size, ctx->numa_node); if (!ctx->data) { fprintf(stderr, "Worker %d: allocation failed\n", ctx->worker_id); return NULL; } // Step 3: Touch memory from THIS thread (ensures local allocation) // Do NOT let main thread or another worker touch this memset(ctx->data, 0, ctx->data_size); printf("Worker %d: allocated and touched %zu MB on node %d\n", ctx->worker_id, ctx->data_size / (1024*1024), ctx->numa_node); // Step 4: Do actual work with locally-placed data // ... computation here ... return NULL;} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } int num_nodes = numa_max_node() + 1; pthread_t threads[NUM_WORKERS]; WorkerContext contexts[NUM_WORKERS]; printf("Starting %d workers across %d nodes\n", NUM_WORKERS, num_nodes); // Create workers - each will bind to its node and allocate locally for (int i = 0; i < NUM_WORKERS; i++) { contexts[i].worker_id = i; contexts[i].numa_node = i % num_nodes; // Round-robin across nodes contexts[i].data = NULL; contexts[i].data_size = DATA_PER_WORKER; pthread_create(&threads[i], NULL, worker_function, &contexts[i]); } // Wait for all workers for (int i = 0; i < NUM_WORKERS; i++) { pthread_join(threads[i], NULL); } // Verify placement printf("\nVerifying memory placement:\n"); for (int i = 0; i < NUM_WORKERS; i++) { if (contexts[i].data) { int status; void *page = contexts[i].data; move_pages(0, 1, &page, NULL, &status, 0); printf(" Worker %d data: first page on node %d (expected %d)\n", i, status, contexts[i].numa_node); } } // Cleanup for (int i = 0; i < NUM_WORKERS; i++) { if (contexts[i].data) { numa_free(contexts[i].data, contexts[i].data_size); } } return 0;}For large arrays that will be processed in parallel, use parallel first-touch initialization. Divide the array into regions matching your parallel worker distribution. Have each worker touch its region during initialization. This ensures data is local when processing begins.
Sometimes memory ends up on the wrong node. Rather than reallocating and copying, the kernel can migrate pages between nodes. This is useful for:
Migration Mechanisms:
| Method | Scope | Use Case |
|---|---|---|
move_pages() | Move specific pages | Fine-grained migration of select pages |
migrate_pages() | Move all pages from one node to another | Bulk migration of entire node's allocation |
mbind() with MPOL_MF_MOVE | Migrate pages in a VMA | Rebalance a memory region |
AutoNUMA / numa_balancing | Automatic kernel migration | Hands-off optimization |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
#include <numaif.h>#include <numa.h>#include <stdio.h>#include <stdlib.h>#include <string.h>#include <unistd.h>#include <sys/mman.h> #define PAGE_SIZE 4096#define NUM_PAGES 10000#define TOTAL_SIZE (PAGE_SIZE * NUM_PAGES) /** * move_pages() - Move specific pages to specified nodes */void demonstrate_move_pages() { printf("=== move_pages() demonstration ===\n"); // Allocate memory on node 0 void *buffer = numa_alloc_onnode(TOTAL_SIZE, 0); memset(buffer, 0, TOTAL_SIZE); // Materialize pages // Prepare arrays for move_pages void *pages[NUM_PAGES]; int nodes[NUM_PAGES]; int status[NUM_PAGES]; for (int i = 0; i < NUM_PAGES; i++) { pages[i] = buffer + (i * PAGE_SIZE); nodes[i] = 1; // Target: node 1 } printf("Moving %d pages from node 0 to node 1...\n", NUM_PAGES); // Move pages long ret = move_pages(0, // pid (0 = current process) NUM_PAGES, pages, nodes, status, 0); if (ret != 0) { perror("move_pages"); } else { // Count successful migrations int success = 0, failed = 0; for (int i = 0; i < NUM_PAGES; i++) { if (status[i] == 1) { // Page is now on node 1 success++; } else if (status[i] < 0) { failed++; } } printf("Migration complete: %d succeeded, %d failed\n", success, failed); } numa_free(buffer, TOTAL_SIZE);} /** * migrate_pages() - Move all pages belonging to a process * from one node to another */void demonstrate_migrate_pages() { printf("\n=== migrate_pages() demonstration ===\n"); // Allocate on multiple nodes void *buf0 = numa_alloc_onnode(32 * 1024 * 1024, 0); void *buf1 = numa_alloc_onnode(32 * 1024 * 1024, 1); memset(buf0, 0, 32 * 1024 * 1024); memset(buf1, 0, 32 * 1024 * 1024); printf("Allocated 32 MB on node 0 and 32 MB on node 1\n"); // Migrate all pages from node 0 to node 2 unsigned long old_nodes = 1UL << 0; // Node 0 unsigned long new_nodes = 1UL << 2; // Node 2 printf("Migrating all node 0 pages to node 2...\n"); long ret = migrate_pages(0, // pid (0 = current) sizeof(old_nodes) * 8, &old_nodes, &new_nodes); if (ret < 0) { perror("migrate_pages"); } else if (ret > 0) { printf("Migration incomplete: %ld pages could not be moved\n", ret); } else { printf("All pages migrated successfully\n"); } numa_free(buf0, 32 * 1024 * 1024); numa_free(buf1, 32 * 1024 * 1024);} /** * AutoNUMA (NUMA Balancing) - Kernel automatic migration */void explain_autonuma() { printf("\n=== AutoNUMA (numa_balancing) ===\n"); printf("Check status: cat /proc/sys/kernel/numa_balancing\n"); printf("Enable: echo 1 > /proc/sys/kernel/numa_balancing\n"); printf("Disable: echo 0 > /proc/sys/kernel/numa_balancing\n"); printf("\n"); printf("How AutoNUMA works:\n"); printf(" 1. Kernel periodically unmaps random pages (lazy scan)\n"); printf(" 2. When accessed, page fault reveals which CPU touched it\n"); printf(" 3. If CPU's node != page's node, page becomes migration candidate\n"); printf(" 4. Kernel migrates pages that are consistently accessed remotely\n"); printf("\n"); printf("Tuning parameters in /proc/sys/kernel/:\n"); printf(" numa_balancing_scan_delay_ms - delay before scanning starts\n"); printf(" numa_balancing_scan_period_min_ms - minimum scan interval\n"); printf(" numa_balancing_scan_period_max_ms - maximum scan interval\n"); printf(" numa_balancing_scan_size_mb - pages scanned per interval\n");} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } if (numa_max_node() < 2) { printf("Need at least 3 NUMA nodes for full demo\n"); printf("Running partial demo...\n"); } demonstrate_move_pages(); demonstrate_migrate_pages(); explain_autonuma(); return 0;}Page migration isn't free. Each page must be copied (read old location, write new location), and the process is paused for unmapped pages. For large allocations, migration can take seconds and cause latency spikes. Prefer correct initial placement over post-hoc migration when possible.
Huge pages (2 MB or 1 GB pages instead of 4 KB) reduce TLB pressure for large allocations. Combining huge pages with NUMA requires special consideration.
The NUMA-Huge Page Challenge:
Huge page allocation is fundamentally different from regular page allocation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
#include <numa.h>#include <numaif.h>#include <sys/mman.h>#include <stdio.h>#include <string.h>#include <fcntl.h> #define HUGE_PAGE_SIZE (2 * 1024 * 1024) // 2 MB#define ALLOCATION_SIZE (1024 * 1024 * 1024) // 1 GB /** * Pattern 1: MAP_HUGETLB with NUMA binding */void allocate_huge_pages_numa(int target_node) { printf("Allocating 1 GB huge pages on node %d\n", target_node); // Allocate using MAP_HUGETLB void *buffer = mmap(NULL, ALLOCATION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); if (buffer == MAP_FAILED) { perror("mmap huge pages"); printf("Hint: Ensure huge pages are configured:\n"); printf(" echo 1024 > /proc/sys/vm/nr_hugepages\n"); return; } // Bind to specific node BEFORE first touch unsigned long nodemask = 1UL << target_node; if (mbind(buffer, ALLOCATION_SIZE, MPOL_BIND, &nodemask, sizeof(nodemask) * 8, 0) != 0) { perror("mbind"); } // First touch to allocate physical pages memset(buffer, 0, ALLOCATION_SIZE); // Verify placement int status; void *page = buffer; move_pages(0, 1, &page, NULL, &status, 0); printf("Huge pages allocated on node %d\n", status); munmap(buffer, ALLOCATION_SIZE);} /** * Pattern 2: Per-node huge page pools */void configure_per_node_hugepages() { printf("\n=== Per-Node Huge Page Configuration ===\n"); printf("Configure huge pages per node for NUMA-aware allocation:\n\n"); printf("# Reserve 512 huge pages on each of 4 nodes\n"); printf("for node in 0 1 2 3; do\n"); printf(" echo 512 > /sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/nr_hugepages\n"); printf("done\n\n"); printf("# Verify allocation\n"); printf("for node in 0 1 2 3; do\n"); printf(" free=$(cat /sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/free_hugepages)\n"); printf(" total=$(cat /sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/nr_hugepages)\n"); printf(" echo \"Node $node: $free / $total huge pages free\"\n"); printf("done\n");} /** * Pattern 3: Transparent Huge Pages (THP) with NUMA */void explain_thp_numa() { printf("\n=== Transparent Huge Pages + NUMA ===\n"); printf("THP automatically uses huge pages when possible.\n\n"); printf("THP modes:\n"); printf(" always - Use THP whenever possible\n"); printf(" madvise - Use THP only for MADV_HUGEPAGE regions\n"); printf(" never - Disable THP\n\n"); printf("NUMA interaction:\n"); printf(" - THP respects NUMA policies for allocation\n"); printf(" - khugepaged daemon can collapse pages to huge pages\n"); printf(" - Collapse may move pages (NUMA disruption!)\n\n"); printf("For NUMA-critical applications, consider:\n"); printf(" echo madvise > /sys/kernel/mm/transparent_hugepage/enabled\n"); printf("Then use madvise(addr, len, MADV_HUGEPAGE) for specific regions\n");} int main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } allocate_huge_pages_numa(0); configure_per_node_hugepages(); explain_thp_numa(); return 0;}For production NUMA systems, configure huge pages at boot time using kernel command line (hugepages=N) and use per-node huge page pools. This ensures huge pages are evenly distributed across nodes. Allocating huge pages at runtime risks imbalanced distribution and fragmentation.
Let's consolidate everything into practical patterns you can apply in production code.
Pattern 1: NUMA-Aware Memory Pool
For applications that manage their own memory (databases, caches, game engines), create a NUMA-aware memory pool:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
#include <numa.h>#include <pthread.h>#include <stdio.h>#include <stdlib.h>#include <string.h> /* * NUMA-Aware Memory Pool * * Each NUMA node has its own memory pool. Threads allocate from their * local pool, ensuring locality. Cross-node allocation is supported * but logged for performance analysis. */ typedef struct NumaPool { int node_id; void *base; size_t size; size_t used; pthread_mutex_t lock;} NumaPool; typedef struct NumaAllocator { int num_nodes; NumaPool *pools;} NumaAllocator; NumaAllocator *numa_allocator_create(size_t per_node_size) { NumaAllocator *alloc = malloc(sizeof(NumaAllocator)); alloc->num_nodes = numa_max_node() + 1; alloc->pools = calloc(alloc->num_nodes, sizeof(NumaPool)); for (int i = 0; i < alloc->num_nodes; i++) { NumaPool *pool = &alloc->pools[i]; pool->node_id = i; pool->size = per_node_size; pool->used = 0; pthread_mutex_init(&pool->lock, NULL); // Allocate pool memory strictly on this node pool->base = numa_alloc_onnode(per_node_size, i); if (pool->base) { // Touch from a thread on this node to ensure first-touch // For simplicity, we bind current thread temporarily numa_run_on_node(i); memset(pool->base, 0, per_node_size); printf("Created pool on node %d: %zu MB\n", i, per_node_size / (1024*1024)); } } // Reset thread affinity numa_run_on_node(-1); // All nodes return alloc;} void *numa_allocator_alloc(NumaAllocator *alloc, size_t size, int *allocated_node) { // Determine current node int current_node = numa_node_of_cpu(sched_getcpu()); // Try local pool first NumaPool *pool = &alloc->pools[current_node]; pthread_mutex_lock(&pool->lock); if (pool->used + size <= pool->size) { void *ptr = pool->base + pool->used; pool->used += size; *allocated_node = current_node; pthread_mutex_unlock(&pool->lock); return ptr; } pthread_mutex_unlock(&pool->lock); // Local pool full, try other nodes (with warning) for (int i = 0; i < alloc->num_nodes; i++) { if (i == current_node) continue; pool = &alloc->pools[i]; pthread_mutex_lock(&pool->lock); if (pool->used + size <= pool->size) { void *ptr = pool->base + pool->used; pool->used += size; *allocated_node = i; pthread_mutex_unlock(&pool->lock); // Log remote allocation for analysis fprintf(stderr, "WARNING: Remote allocation: node %d -> node %d (%zu bytes)\n", current_node, i, size); return ptr; } pthread_mutex_unlock(&pool->lock); } // All pools exhausted *allocated_node = -1; return NULL;} void numa_allocator_destroy(NumaAllocator *alloc) { for (int i = 0; i < alloc->num_nodes; i++) { if (alloc->pools[i].base) { numa_free(alloc->pools[i].base, alloc->pools[i].size); } pthread_mutex_destroy(&alloc->pools[i].lock); } free(alloc->pools); free(alloc);} // Example usageint main() { if (numa_available() < 0) { fprintf(stderr, "NUMA not available\n"); return 1; } // Create allocator with 1 GB per node NumaAllocator *alloc = numa_allocator_create(1024 * 1024 * 1024); // Allocate some memory int node; void *ptr1 = numa_allocator_alloc(alloc, 1024*1024, &node); printf("Allocated 1 MB, placed on node %d\n", node); numa_allocator_destroy(alloc); return 0;}Pattern 2: numactl for Process-Level Control
For applications you don't control (or don't want to modify), numactl provides command-line NUMA control:
1234567891011121314151617181920212223242526272829303132333435
#!/bin/bash # Pattern A: Strict single-node binding (for memory-bound apps)numactl --cpunodebind=0 --membind=0 ./my_database_server # Pattern B: Interleaved across all nodes (for shared data structures)numactl --interleave=all ./shared_cache_server # Pattern C: Preferred node with fallback (flexible locality)numactl --preferred=0 ./application # Pattern D: Specific node subsetnumactl --cpunodebind=0,1 --membind=0,1 ./dual_socket_app # Pattern E: Local allocation policy (allocate where running)numactl --localalloc ./memory_intensive_app # Pattern F: Display NUMA statistics after runnumactl --show ./my_app && numastat -p $(pgrep my_app) # Pattern G: Hardware info before decidingnumactl --hardware# Then choose appropriate strategy based on topology # Common production configurations: # Redis: Single-node strict bindingnumactl --cpunodebind=0 --membind=0 redis-server # PostgreSQL shared_buffers: Interleavednumactl --interleave=all postgres -c shared_buffers=32GB # JVM applications: Let JVM handle with NUMA flags# (JVM's -XX:+UseNUMA works with G1GC)numactl --localalloc java -XX:+UseNUMA -XX:+UseG1GC -jar app.jarNot every application needs NUMA tuning. Profile first: if your working set fits in cache, or if the application is I/O-bound, NUMA optimization yields minimal benefit. Focus NUMA efforts on memory-bound, latency-sensitive workloads with large working sets.
We've covered the complete toolkit for NUMA-aware memory allocation. Let's consolidate:
What's Next:
In the final page, we'll explore Performance Optimization techniques—bringing together everything we've learned to systematically optimize NUMA performance in production systems. We'll cover profiling, benchmarking, real-world tuning examples, and common pitfalls to avoid.
You now have the complete toolkit for NUMA-aware memory allocation. You can control placement at every level—from command-line wrapping to custom memory pools. Next, we'll tie it all together with performance optimization strategies.