Operating SystemsNUMA Architecture

NUMA Architecture

LevelAdvanced

Duration60 mins

TopicNUMA Architecture

4 / 5

NUMA-Aware Allocation

Controlling Where Memory Lives

Understanding NUMA is only half the battle. The other half is controlling memory placement—ensuring that allocations end up on the right NUMA node. This page dives into the complete toolkit for NUMA-aware memory allocation: Linux memory policies, the libnuma library, system calls, and practical patterns for production systems.

By mastering these techniques, you'll be able to write applications that actively exploit NUMA topology rather than being victimized by poor memory placement.

What You Will Learn

By the end of this page, you will understand Linux's memory policy system in depth, be able to program with libnuma for fine-grained control, know how to apply policies at different granularities (process, thread, virtual memory area), and apply proven allocation patterns for NUMA-optimized applications.

Linux Memory Policy Framework

Linux provides a sophisticated memory policy framework that controls how memory is allocated across NUMA nodes. This framework operates at three levels:

System-wide default policy: Applied when no other policy is specified
Task (process/thread) policy: Applied to all allocations by that task
VMA (Virtual Memory Area) policy: Applied to specific address ranges

The Four Memory Policies:

Linux supports four fundamental memory allocation policies:

Linux NUMA Memory Policies
Policy	Behavior	Use Case
DEFAULT	Allocate on the local node (where CPU runs)	General purpose, best for most workloads
BIND	Strictly allocate from specified node set	Guarantee locality, fail if unavailable
PREFERRED	Try specified node, fall back if needed	Prefer locality, allow flexibility
INTERLEAVE	Round-robin pages across specified nodes	Shared data, bandwidth spreading

Policy Representation in the Kernel:

Internally, each policy is represented by a struct mempolicy containing:

Policy mode: DEFAULT, BIND, PREFERRED, or INTERLEAVE
Node mask: A bitmask of NUMA nodes the policy applies to
Flags: Modifiers like MPOL_F_STATIC_NODES (don't adjust nodes on cpuset changes)

Policies are reference-counted and can be shared between VMAs for efficiency.

policy-syscalls.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#include <numaif.h>
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
 
/**
 * Core system calls for memory policy control:
 * 
 * set_mempolicy()  - Set task policy
 * get_mempolicy()  - Query current policy
 * mbind()          - Set policy for a VMA
 * migrate_pages()  - Move pages between nodes
 * move_pages()     - Move specific pages
 */
 
void demonstrate_policies() {
    unsigned long nodemask = 0x3;  // Nodes 0 and 1
    int maxnode = numa_max_node() + 2;  // +2 for mask size convention
    
    // =====================================================
    // Policy 1: BIND - Strict allocation to specified nodes
    // =====================================================
    printf("Setting BIND policy to nodes 0,1\n");
    if (set_mempolicy(MPOL_BIND, &nodemask, maxnode) != 0) {
        perror("set_mempolicy BIND");
    }
    
    // All subsequent allocations MUST come from nodes 0 or 1
    void *bind_mem = malloc(1024 * 1024);  // Allocated on node 0 or 1
    
    // =====================================================
    // Policy 2: INTERLEAVE - Round-robin across nodes
    // =====================================================
    printf("Setting INTERLEAVE policy across nodes 0,1\n");
    if (set_mempolicy(MPOL_INTERLEAVE, &nodemask, maxnode) != 0) {
        perror("set_mempolicy INTERLEAVE");
    }
    
    // Pages will alternate between nodes 0 and 1
    void *interleave_mem = malloc(16 * 1024 * 1024);  // 16 MB
    // Page 0 on node 0, page 1 on node 1, page 2 on node 0, ...
    
    // =====================================================
    // Policy 3: PREFERRED - Try node, fall back if needed
    // =====================================================
    unsigned long prefer_nodemask = 0x2;  // Node 1
    printf("Setting PREFERRED policy for node 1\n");
    if (set_mempolicy(MPOL_PREFERRED, &prefer_nodemask, maxnode) != 0) {
        perror("set_mempolicy PREFERRED");
    }
    
    // Tries node 1 first, falls back to other nodes if necessary
    void *prefer_mem = malloc(1024 * 1024);
    
    // =====================================================
    // Policy 4: DEFAULT - Local allocation
    // =====================================================
    printf("Resetting to DEFAULT policy\n");
    if (set_mempolicy(MPOL_DEFAULT, NULL, 0) != 0) {
        perror("set_mempolicy DEFAULT");
    }
    
    // Allocates on the node where the CPU is running
    void *default_mem = malloc(1024 * 1024);
    
    free(bind_mem);
    free(interleave_mem);
    free(prefer_mem);
    free(default_mem);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    demonstrate_policies();
    return 0;
}

Policy vs. Allocation

Setting a policy doesn't allocate memory—it sets the rule for future allocations. Memory is actually allocated on first access (first-touch). If you set a policy, then access the memory from a different thread on a different node, the memory lands on that thread's node. Always access (touch) memory from the intended node after setting policy.

VMA-Level Policy with mbind()

While set_mempolicy() sets a task-wide default, mbind() allows per-VMA (per-memory-region) policies. This is essential for applications with different memory types that need different placement strategies.

mbind() Signature:

long mbind(void *addr, unsigned long len, int mode,
           const unsigned long *nodemask, unsigned long maxnode,
           unsigned flags);

The flags Parameter:

The flags parameter is critical and often misunderstood:

mbind() Flags
Flag	Effect	Use Case
0 (no flags)	Policy applies to future allocations only	New allocations, lazy population
MPOL_MF_STRICT	Fail if any existing pages violate policy	Verify correctness
MPOL_MF_MOVE	Migrate existing pages that process owns	Rebalance owned pages
MPOL_MF_MOVE_ALL	Migrate all existing pages (requires CAP_SYS_NICE)	Force migration of shared pages

mbind-examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
#include <numaif.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
#define SIZE (64 * 1024 * 1024)  // 64 MB
 
void example_mbind_before_touch() {
    /*
     * Pattern 1: Set policy BEFORE first touch
     * This is the most common and efficient approach.
     */
    
    // Allocate virtual address space (not yet backed by physical pages)
    void *buffer = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (buffer == MAP_FAILED) {
        perror("mmap");
        return;
    }
    
    // Bind to node 0 BEFORE touching the memory
    unsigned long nodemask = 1UL << 0;  // Node 0
    if (mbind(buffer, SIZE, MPOL_BIND, &nodemask, sizeof(nodemask) * 8, 0) != 0) {
        perror("mbind");
        munmap(buffer, SIZE);
        return;
    }
    
    // Now touch the memory - pages will be allocated on node 0
    memset(buffer, 0, SIZE);
    
    printf("Allocated 64 MB on node 0 using mbind before touch\n");
    munmap(buffer, SIZE);
}
 
void example_mbind_migrate() {
    /*
     * Pattern 2: Migrate existing pages to a new node
     * Useful for rebalancing after detecting poor placement.
     */
    
    // Allocate and touch (pages land on current node)
    void *buffer = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    memset(buffer, 0, SIZE);  // Pages now allocated on current node
    
    printf("Initial allocation complete on local node\n");
    
    // Migrate pages to node 1
    unsigned long nodemask = 1UL << 1;  // Node 1
    if (mbind(buffer, SIZE, MPOL_BIND, &nodemask, sizeof(nodemask) * 8,
              MPOL_MF_MOVE | MPOL_MF_STRICT) != 0) {
        perror("mbind migrate");
        // If this fails, pages may be on the wrong node
    } else {
        printf("Migrated 64 MB to node 1\n");
    }
    
    munmap(buffer, SIZE);
}
 
void example_mixed_policies() {
    /*
     * Pattern 3: Different regions with different policies
     * Common in database systems with different access patterns.
     */
    
    // Allocate 256 MB
    size_t total_size = 256 * 1024 * 1024;
    void *base = mmap(NULL, total_size, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    
    size_t quarter = total_size / 4;
    
    // First quarter: BIND to node 0 (hot data)
    unsigned long node0_mask = 1UL << 0;
    mbind(base, quarter, MPOL_BIND, &node0_mask, 64, 0);
    
    // Second quarter: BIND to node 1 (hot data on other socket)
    unsigned long node1_mask = 1UL << 1;
    mbind(base + quarter, quarter, MPOL_BIND, &node1_mask, 64, 0);
    
    // Third and fourth quarter: INTERLEAVE for shared data
    unsigned long all_nodes = 0xF;  // Nodes 0-3
    mbind(base + 2*quarter, 2*quarter, MPOL_INTERLEAVE, &all_nodes, 64, 0);
    
    // Touch each region from appropriate threads/nodes
    // (demonstration simplified here)
    memset(base, 0, total_size);
    
    printf("Created mixed-policy allocation:\n");
    printf("  - 64 MB on node 0 (BIND)\n");
    printf("  - 64 MB on node 1 (BIND)\n");
    printf("  - 128 MB interleaved across nodes 0-3\n");
    
    munmap(base, total_size);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    example_mbind_before_touch();
    example_mbind_migrate();
    example_mixed_policies();
    
    return 0;
}

mmap + mbind Pattern

The most reliable NUMA allocation pattern is: (1) mmap with MAP_ANONYMOUS to reserve address space, (2) mbind to set policy, (3) touch from correct thread. This gives complete control over placement. malloc() with set_mempolicy() works too but gives less control over address space layout.

libnuma High-Level API

While raw system calls offer maximum control, the libnuma library provides a more ergonomic API for common NUMA operations. It wraps the system calls with convenient functions and handles many edge cases.

Key libnuma Functions:

Essential libnuma Allocation Functions
Function	Purpose	Underlying Mechanism
`numa_alloc_onnode(size, node)`	Allocate strictly on specified node	mmap + mbind BIND
`numa_alloc_local(size)`	Allocate on local (current) node	mmap + mbind BIND to current
`numa_alloc_interleaved(size)`	Interleave across all nodes	mmap + mbind INTERLEAVE
`numa_alloc(size)`	Allocate following task policy	mmap (respects set_mempolicy)
`numa_free(ptr, size)`	Free NUMA-allocated memory	munmap
`numa_set_preferred(node)`	Set task policy to PREFERRED	set_mempolicy PREFERRED
`numa_set_membind(mask)`	Set task policy to BIND	set_mempolicy BIND
`numa_set_interleave_mask(mask)`	Set task policy to INTERLEAVE	set_mempolicy INTERLEAVE

libnuma-comprehensive.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
#include <numa.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
 
#define GB (1024UL * 1024 * 1024)
#define MB (1024UL * 1024)
 
/**
 * Comprehensive libnuma usage patterns
 */
 
void query_numa_configuration() {
    printf("=== NUMA Configuration ===\n");
    printf("NUMA available: %s\n", numa_available() >= 0 ? "yes" : "no");
    printf("Number of nodes: %d\n", numa_max_node() + 1);
    printf("Number of CPUs: %d\n", numa_num_configured_cpus());
    
    // Print per-node information
    for (int node = 0; node <= numa_max_node(); node++) {
        long long size, free_size;
        size = numa_node_size64(node, &free_size);
        
        printf("  Node %d: %.2f GB total, %.2f GB free\n",
               node, 
               (double)size / GB,
               (double)free_size / GB);
    }
    printf("\n");
}
 
void demonstrate_strict_allocation() {
    printf("=== Strict Node Allocation ===\n");
    
    int target_node = 0;
    size_t alloc_size = 256 * MB;
    
    // Allocate strictly on node 0
    void *buffer = numa_alloc_onnode(alloc_size, target_node);
    if (!buffer) {
        printf("Failed to allocate on node %d\n", target_node);
        return;
    }
    
    // CRITICAL: Touch the memory to materialize pages
    // The allocation reserves address space; first touch allocates pages
    memset(buffer, 0, alloc_size);
    
    printf("Allocated %.0f MB strictly on node %d\n", 
           (double)alloc_size / MB, target_node);
    
    // Verify placement
    int status;
    void *page = buffer;
    move_pages(0, 1, &page, NULL, &status, 0);
    printf("Verified: first page is on node %d\n", status);
    
    numa_free(buffer, alloc_size);
}
 
void demonstrate_interleaved_allocation() {
    printf("\n=== Interleaved Allocation ===\n");
    
    size_t alloc_size = 1 * GB;
    
    // Allocate interleaved across all nodes
    void *buffer = numa_alloc_interleaved(alloc_size);
    if (!buffer) {
        printf("Failed to allocate interleaved memory\n");
        return;
    }
    
    // Touch all pages to materialize
    memset(buffer, 0, alloc_size);
    
    printf("Allocated 1 GB interleaved across all nodes\n");
    
    // Count pages per node
    int num_pages = alloc_size / 4096;
    int *node_counts = calloc(numa_max_node() + 1, sizeof(int));
    
    for (int i = 0; i < num_pages; i += 1000) {  // Sample every 1000th page
        void *page = buffer + (i * 4096);
        int status;
        move_pages(0, 1, &page, NULL, &status, 0);
        if (status >= 0) {
            node_counts[status]++;
        }
    }
    
    printf("Page distribution (sampled):\n");
    for (int n = 0; n <= numa_max_node(); n++) {
        printf("  Node %d: %d pages\n", n, node_counts[n]);
    }
    
    free(node_counts);
    numa_free(buffer, alloc_size);
}
 
void demonstrate_node_masks() {
    printf("\n=== Node Mask Operations ===\n");
    
    // Create a bitmask for specific nodes
    struct bitmask *nodes = numa_bitmask_alloc(numa_max_node() + 1);
    
    // Set nodes 0 and 2
    numa_bitmask_setbit(nodes, 0);
    numa_bitmask_setbit(nodes, 2);
    
    // Bind thread to run only on these nodes' CPUs
    numa_run_on_node_mask(nodes);
    printf("Thread bound to CPUs on nodes 0 and 2\n");
    
    // Allocate memory interleaved across these specific nodes
    numa_set_interleave_mask(nodes);
    void *buffer = malloc(64 * MB);  // Uses interleave policy
    memset(buffer, 0, 64 * MB);
    
    printf("Allocated 64 MB interleaved across nodes 0 and 2\n");
    
    // Reset to default
    numa_set_interleave_mask(numa_no_nodes_ptr);
    
    free(buffer);
    numa_bitmask_free(nodes);
}
 
void demonstrate_local_allocation() {
    printf("\n=== Local Allocation ===\n");
    
    // Get current node
    int current_cpu = sched_getcpu();
    int current_node = numa_node_of_cpu(current_cpu);
    printf("Currently running on CPU %d, Node %d\n", current_cpu, current_node);
    
    // Allocate locally - memory will be on current node
    void *buffer = numa_alloc_local(128 * MB);
    if (!buffer) {
        printf("Failed to allocate local memory\n");
        return;
    }
    
    memset(buffer, 0, 128 * MB);
    printf("Allocated 128 MB on local node %d\n", current_node);
    
    numa_free(buffer, 128 * MB);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available on this system\n");
        return 1;
    }
    
    query_numa_configuration();
    demonstrate_strict_allocation();
    demonstrate_interleaved_allocation();
    demonstrate_node_masks();
    demonstrate_local_allocation();
    
    printf("\n=== All demonstrations complete ===\n");
    return 0;
}
 
/*
 * Compile: gcc -O2 -o numa_demo numa_demo.c -lnuma
 * Run: ./numa_demo
 */

numa_alloc vs malloc

Memory from numa_alloc* functions MUST be freed with numa_free(ptr, size). Memory from malloc() under a set_mempolicy can be freed with regular free(). Don't mix them! The implementations are different (mmap vs sbrk/malloc pools).

First-Touch Policy in Depth

Linux's default memory policy is first-touch allocation: physical pages are allocated on the node where they are first accessed, not where (or when) they are allocated. Understanding first-touch is critical because it determines where your memory actually lives.

The Mechanics:

malloc(1GB) returns immediately with virtual address
No physical memory is allocated yet (overcommit)
First read/write to a page triggers a page fault
Page fault handler allocates physical page on current thread's node
Page table entry is set up; access completes

Converting Mermaid diagram...

Common First-Touch Pitfalls:

First-Touch Anti-Patterns

•Main thread initialization: If main() allocates and initializes all data before spawning workers, all pages land on node 0.
•memset from wrong thread: A worker calls malloc(), but another thread does memset() for 'convenience'. Pages land on wrong node.
•Library initialization: Libraries that allocate caches or buffers during init (before you spawn workers) can cause remote access.
•Memory pools: Custom allocators that pre-allocate pools during startup place all pool memory on the startup thread's node.
•Zero-filling: Some malloc implementations zero memory on allocation. If they do this in the allocation thread, you lose first-touch control.

first-touch-correct.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
#include <numa.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
#define NUM_WORKERS 4
#define DATA_PER_WORKER (1024 * 1024 * 256)  // 256 MB each
 
typedef struct {
    int worker_id;
    int numa_node;
    void *data;
    size_t data_size;
} WorkerContext;
 
void *worker_function(void *arg) {
    WorkerContext *ctx = (WorkerContext *)arg;
    
    // Step 1: Bind this thread to its designated NUMA node
    struct bitmask *nodemask = numa_bitmask_alloc(numa_max_node() + 1);
    numa_bitmask_setbit(nodemask, ctx->numa_node);
    numa_run_on_node_mask(nodemask);
    numa_bitmask_free(nodemask);
    
    printf("Worker %d running on node %d\n", ctx->worker_id, ctx->numa_node);
    
    // Step 2: Allocate memory (still virtual)
    ctx->data = numa_alloc_onnode(ctx->data_size, ctx->numa_node);
    if (!ctx->data) {
        fprintf(stderr, "Worker %d: allocation failed\n", ctx->worker_id);
        return NULL;
    }
    
    // Step 3: Touch memory from THIS thread (ensures local allocation)
    // Do NOT let main thread or another worker touch this
    memset(ctx->data, 0, ctx->data_size);
    
    printf("Worker %d: allocated and touched %zu MB on node %d\n",
           ctx->worker_id, ctx->data_size / (1024*1024), ctx->numa_node);
    
    // Step 4: Do actual work with locally-placed data
    // ... computation here ...
    
    return NULL;
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    int num_nodes = numa_max_node() + 1;
    pthread_t threads[NUM_WORKERS];
    WorkerContext contexts[NUM_WORKERS];
    
    printf("Starting %d workers across %d nodes\n", NUM_WORKERS, num_nodes);
    
    // Create workers - each will bind to its node and allocate locally
    for (int i = 0; i < NUM_WORKERS; i++) {
        contexts[i].worker_id = i;
        contexts[i].numa_node = i % num_nodes;  // Round-robin across nodes
        contexts[i].data = NULL;
        contexts[i].data_size = DATA_PER_WORKER;
        
        pthread_create(&threads[i], NULL, worker_function, &contexts[i]);
    }
    
    // Wait for all workers
    for (int i = 0; i < NUM_WORKERS; i++) {
        pthread_join(threads[i], NULL);
    }
    
    // Verify placement
    printf("\nVerifying memory placement:\n");
    for (int i = 0; i < NUM_WORKERS; i++) {
        if (contexts[i].data) {
            int status;
            void *page = contexts[i].data;
            move_pages(0, 1, &page, NULL, &status, 0);
            printf("  Worker %d data: first page on node %d (expected %d)\n",
                   i, status, contexts[i].numa_node);
        }
    }
    
    // Cleanup
    for (int i = 0; i < NUM_WORKERS; i++) {
        if (contexts[i].data) {
            numa_free(contexts[i].data, contexts[i].data_size);
        }
    }
    
    return 0;
}

The Parallel Initialization Pattern

For large arrays that will be processed in parallel, use parallel first-touch initialization. Divide the array into regions matching your parallel worker distribution. Have each worker touch its region during initialization. This ensures data is local when processing begins.

Page Migration

Sometimes memory ends up on the wrong node. Rather than reallocating and copying, the kernel can migrate pages between nodes. This is useful for:

Rebalancing after detecting poor placement
Following threads when they migrate between nodes
Optimizing based on runtime access patterns

Migration Mechanisms:

Page Migration Methods
Method	Scope	Use Case
`move_pages()`	Move specific pages	Fine-grained migration of select pages
`migrate_pages()`	Move all pages from one node to another	Bulk migration of entire node's allocation
`mbind()` with MPOL_MF_MOVE	Migrate pages in a VMA	Rebalance a memory region
AutoNUMA / `numa_balancing`	Automatic kernel migration	Hands-off optimization

page-migration.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
#include <numaif.h>
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
 
#define PAGE_SIZE 4096
#define NUM_PAGES 10000
#define TOTAL_SIZE (PAGE_SIZE * NUM_PAGES)
 
/**
 * move_pages() - Move specific pages to specified nodes
 */
void demonstrate_move_pages() {
    printf("=== move_pages() demonstration ===\n");
    
    // Allocate memory on node 0
    void *buffer = numa_alloc_onnode(TOTAL_SIZE, 0);
    memset(buffer, 0, TOTAL_SIZE);  // Materialize pages
    
    // Prepare arrays for move_pages
    void *pages[NUM_PAGES];
    int nodes[NUM_PAGES];
    int status[NUM_PAGES];
    
    for (int i = 0; i < NUM_PAGES; i++) {
        pages[i] = buffer + (i * PAGE_SIZE);
        nodes[i] = 1;  // Target: node 1
    }
    
    printf("Moving %d pages from node 0 to node 1...\n", NUM_PAGES);
    
    // Move pages
    long ret = move_pages(0,  // pid (0 = current process)
                          NUM_PAGES, pages, nodes, status, 0);
    
    if (ret != 0) {
        perror("move_pages");
    } else {
        // Count successful migrations
        int success = 0, failed = 0;
        for (int i = 0; i < NUM_PAGES; i++) {
            if (status[i] == 1) {  // Page is now on node 1
                success++;
            } else if (status[i] < 0) {
                failed++;
            }
        }
        printf("Migration complete: %d succeeded, %d failed\n", success, failed);
    }
    
    numa_free(buffer, TOTAL_SIZE);
}
 
/**
 * migrate_pages() - Move all pages belonging to a process
 *                   from one node to another
 */
void demonstrate_migrate_pages() {
    printf("\n=== migrate_pages() demonstration ===\n");
    
    // Allocate on multiple nodes
    void *buf0 = numa_alloc_onnode(32 * 1024 * 1024, 0);
    void *buf1 = numa_alloc_onnode(32 * 1024 * 1024, 1);
    memset(buf0, 0, 32 * 1024 * 1024);
    memset(buf1, 0, 32 * 1024 * 1024);
    
    printf("Allocated 32 MB on node 0 and 32 MB on node 1\n");
    
    // Migrate all pages from node 0 to node 2
    unsigned long old_nodes = 1UL << 0;  // Node 0
    unsigned long new_nodes = 1UL << 2;  // Node 2
    
    printf("Migrating all node 0 pages to node 2...\n");
    long ret = migrate_pages(0,  // pid (0 = current)
                             sizeof(old_nodes) * 8,
                             &old_nodes, &new_nodes);
    
    if (ret < 0) {
        perror("migrate_pages");
    } else if (ret > 0) {
        printf("Migration incomplete: %ld pages could not be moved\n", ret);
    } else {
        printf("All pages migrated successfully\n");
    }
    
    numa_free(buf0, 32 * 1024 * 1024);
    numa_free(buf1, 32 * 1024 * 1024);
}
 
/**
 * AutoNUMA (NUMA Balancing) - Kernel automatic migration
 */
void explain_autonuma() {
    printf("\n=== AutoNUMA (numa_balancing) ===\n");
    printf("Check status: cat /proc/sys/kernel/numa_balancing\n");
    printf("Enable:       echo 1 > /proc/sys/kernel/numa_balancing\n");
    printf("Disable:      echo 0 > /proc/sys/kernel/numa_balancing\n");
    printf("\n");
    printf("How AutoNUMA works:\n");
    printf("  1. Kernel periodically unmaps random pages (lazy scan)\n");
    printf("  2. When accessed, page fault reveals which CPU touched it\n");
    printf("  3. If CPU's node != page's node, page becomes migration candidate\n");
    printf("  4. Kernel migrates pages that are consistently accessed remotely\n");
    printf("\n");
    printf("Tuning parameters in /proc/sys/kernel/:\n");
    printf("  numa_balancing_scan_delay_ms - delay before scanning starts\n");
    printf("  numa_balancing_scan_period_min_ms - minimum scan interval\n");
    printf("  numa_balancing_scan_period_max_ms - maximum scan interval\n");
    printf("  numa_balancing_scan_size_mb - pages scanned per interval\n");
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    if (numa_max_node() < 2) {
        printf("Need at least 3 NUMA nodes for full demo\n");
        printf("Running partial demo...\n");
    }
    
    demonstrate_move_pages();
    demonstrate_migrate_pages();
    explain_autonuma();
    
    return 0;
}

Migration Cost

Page migration isn't free. Each page must be copied (read old location, write new location), and the process is paused for unmapped pages. For large allocations, migration can take seconds and cause latency spikes. Prefer correct initial placement over post-hoc migration when possible.

NUMA and Huge Pages

Huge pages (2 MB or 1 GB pages instead of 4 KB) reduce TLB pressure for large allocations. Combining huge pages with NUMA requires special consideration.

The NUMA-Huge Page Challenge:

Huge page allocation is fundamentally different from regular page allocation:

Huge pages require physically contiguous memory
A 2 MB huge page needs 512 contiguous 4 KB frames
Contiguous regions may not be available on the desired node
Once allocated, huge pages are harder to migrate

numa-hugepages.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
#include <numa.h>
#include <numaif.h>
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
 
#define HUGE_PAGE_SIZE (2 * 1024 * 1024)  // 2 MB
#define ALLOCATION_SIZE (1024 * 1024 * 1024)  // 1 GB
 
/**
 * Pattern 1: MAP_HUGETLB with NUMA binding
 */
void allocate_huge_pages_numa(int target_node) {
    printf("Allocating 1 GB huge pages on node %d\n", target_node);
    
    // Allocate using MAP_HUGETLB
    void *buffer = mmap(NULL, ALLOCATION_SIZE,
                        PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                        -1, 0);
    
    if (buffer == MAP_FAILED) {
        perror("mmap huge pages");
        printf("Hint: Ensure huge pages are configured:\n");
        printf("  echo 1024 > /proc/sys/vm/nr_hugepages\n");
        return;
    }
    
    // Bind to specific node BEFORE first touch
    unsigned long nodemask = 1UL << target_node;
    if (mbind(buffer, ALLOCATION_SIZE, MPOL_BIND, &nodemask,
              sizeof(nodemask) * 8, 0) != 0) {
        perror("mbind");
    }
    
    // First touch to allocate physical pages
    memset(buffer, 0, ALLOCATION_SIZE);
    
    // Verify placement
    int status;
    void *page = buffer;
    move_pages(0, 1, &page, NULL, &status, 0);
    printf("Huge pages allocated on node %d\n", status);
    
    munmap(buffer, ALLOCATION_SIZE);
}
 
/**
 * Pattern 2: Per-node huge page pools
 */
void configure_per_node_hugepages() {
    printf("\n=== Per-Node Huge Page Configuration ===\n");
    printf("Configure huge pages per node for NUMA-aware allocation:\n\n");
    
    printf("# Reserve 512 huge pages on each of 4 nodes\n");
    printf("for node in 0 1 2 3; do\n");
    printf("    echo 512 > /sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/nr_hugepages\n");
    printf("done\n\n");
    
    printf("# Verify allocation\n");
    printf("for node in 0 1 2 3; do\n");
    printf("    free=$(cat /sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/free_hugepages)\n");
    printf("    total=$(cat /sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/nr_hugepages)\n");
    printf("    echo \"Node $node: $free / $total huge pages free\"\n");
    printf("done\n");
}
 
/**
 * Pattern 3: Transparent Huge Pages (THP) with NUMA
 */
void explain_thp_numa() {
    printf("\n=== Transparent Huge Pages + NUMA ===\n");
    printf("THP automatically uses huge pages when possible.\n\n");
    
    printf("THP modes:\n");
    printf("  always  - Use THP whenever possible\n");
    printf("  madvise - Use THP only for MADV_HUGEPAGE regions\n");
    printf("  never   - Disable THP\n\n");
    
    printf("NUMA interaction:\n");
    printf("  - THP respects NUMA policies for allocation\n");
    printf("  - khugepaged daemon can collapse pages to huge pages\n");
    printf("  - Collapse may move pages (NUMA disruption!)\n\n");
    
    printf("For NUMA-critical applications, consider:\n");
    printf("  echo madvise > /sys/kernel/mm/transparent_hugepage/enabled\n");
    printf("Then use madvise(addr, len, MADV_HUGEPAGE) for specific regions\n");
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    allocate_huge_pages_numa(0);
    configure_per_node_hugepages();
    explain_thp_numa();
    
    return 0;
}

Pre-Allocate Per-Node Huge Pages

For production NUMA systems, configure huge pages at boot time using kernel command line (hugepages=N) and use per-node huge page pools. This ensures huge pages are evenly distributed across nodes. Allocating huge pages at runtime risks imbalanced distribution and fragmentation.

Practical Allocation Patterns

Let's consolidate everything into practical patterns you can apply in production code.

Pattern 1: NUMA-Aware Memory Pool

For applications that manage their own memory (databases, caches, game engines), create a NUMA-aware memory pool:

numa-memory-pool.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
#include <numa.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
/* 
 * NUMA-Aware Memory Pool
 * 
 * Each NUMA node has its own memory pool. Threads allocate from their
 * local pool, ensuring locality. Cross-node allocation is supported
 * but logged for performance analysis.
 */
 
typedef struct NumaPool {
    int node_id;
    void *base;
    size_t size;
    size_t used;
    pthread_mutex_t lock;
} NumaPool;
 
typedef struct NumaAllocator {
    int num_nodes;
    NumaPool *pools;
} NumaAllocator;
 
NumaAllocator *numa_allocator_create(size_t per_node_size) {
    NumaAllocator *alloc = malloc(sizeof(NumaAllocator));
    alloc->num_nodes = numa_max_node() + 1;
    alloc->pools = calloc(alloc->num_nodes, sizeof(NumaPool));
    
    for (int i = 0; i < alloc->num_nodes; i++) {
        NumaPool *pool = &alloc->pools[i];
        pool->node_id = i;
        pool->size = per_node_size;
        pool->used = 0;
        pthread_mutex_init(&pool->lock, NULL);
        
        // Allocate pool memory strictly on this node
        pool->base = numa_alloc_onnode(per_node_size, i);
        if (pool->base) {
            // Touch from a thread on this node to ensure first-touch
            // For simplicity, we bind current thread temporarily
            numa_run_on_node(i);
            memset(pool->base, 0, per_node_size);
            printf("Created pool on node %d: %zu MB\n", 
                   i, per_node_size / (1024*1024));
        }
    }
    
    // Reset thread affinity
    numa_run_on_node(-1);  // All nodes
    
    return alloc;
}
 
void *numa_allocator_alloc(NumaAllocator *alloc, size_t size, int *allocated_node) {
    // Determine current node
    int current_node = numa_node_of_cpu(sched_getcpu());
    
    // Try local pool first
    NumaPool *pool = &alloc->pools[current_node];
    pthread_mutex_lock(&pool->lock);
    
    if (pool->used + size <= pool->size) {
        void *ptr = pool->base + pool->used;
        pool->used += size;
        *allocated_node = current_node;
        pthread_mutex_unlock(&pool->lock);
        return ptr;
    }
    pthread_mutex_unlock(&pool->lock);
    
    // Local pool full, try other nodes (with warning)
    for (int i = 0; i < alloc->num_nodes; i++) {
        if (i == current_node) continue;
        
        pool = &alloc->pools[i];
        pthread_mutex_lock(&pool->lock);
        
        if (pool->used + size <= pool->size) {
            void *ptr = pool->base + pool->used;
            pool->used += size;
            *allocated_node = i;
            pthread_mutex_unlock(&pool->lock);
            
            // Log remote allocation for analysis
            fprintf(stderr, "WARNING: Remote allocation: node %d -> node %d (%zu bytes)\n",
                    current_node, i, size);
            return ptr;
        }
        pthread_mutex_unlock(&pool->lock);
    }
    
    // All pools exhausted
    *allocated_node = -1;
    return NULL;
}
 
void numa_allocator_destroy(NumaAllocator *alloc) {
    for (int i = 0; i < alloc->num_nodes; i++) {
        if (alloc->pools[i].base) {
            numa_free(alloc->pools[i].base, alloc->pools[i].size);
        }
        pthread_mutex_destroy(&alloc->pools[i].lock);
    }
    free(alloc->pools);
    free(alloc);
}
 
// Example usage
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    // Create allocator with 1 GB per node
    NumaAllocator *alloc = numa_allocator_create(1024 * 1024 * 1024);
    
    // Allocate some memory
    int node;
    void *ptr1 = numa_allocator_alloc(alloc, 1024*1024, &node);
    printf("Allocated 1 MB, placed on node %d\n", node);
    
    numa_allocator_destroy(alloc);
    return 0;
}

Pattern 2: numactl for Process-Level Control

For applications you don't control (or don't want to modify), numactl provides command-line NUMA control:

numactl-patterns.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
 
# Pattern A: Strict single-node binding (for memory-bound apps)
numactl --cpunodebind=0 --membind=0 ./my_database_server
 
# Pattern B: Interleaved across all nodes (for shared data structures)
numactl --interleave=all ./shared_cache_server
 
# Pattern C: Preferred node with fallback (flexible locality)
numactl --preferred=0 ./application
 
# Pattern D: Specific node subset
numactl --cpunodebind=0,1 --membind=0,1 ./dual_socket_app
 
# Pattern E: Local allocation policy (allocate where running)
numactl --localalloc ./memory_intensive_app
 
# Pattern F: Display NUMA statistics after run
numactl --show ./my_app && numastat -p $(pgrep my_app)
 
# Pattern G: Hardware info before deciding
numactl --hardware
# Then choose appropriate strategy based on topology
 
# Common production configurations:
 
# Redis: Single-node strict binding
numactl --cpunodebind=0 --membind=0 redis-server
 
# PostgreSQL shared_buffers: Interleaved
numactl --interleave=all postgres -c shared_buffers=32GB
 
# JVM applications: Let JVM handle with NUMA flags
# (JVM's -XX:+UseNUMA works with G1GC)
numactl --localalloc java -XX:+UseNUMA -XX:+UseG1GC -jar app.jar

Choose Your Battle

Not every application needs NUMA tuning. Profile first: if your working set fits in cache, or if the application is I/O-bound, NUMA optimization yields minimal benefit. Focus NUMA efforts on memory-bound, latency-sensitive workloads with large working sets.

Summary: NUMA-Aware Allocation

We've covered the complete toolkit for NUMA-aware memory allocation. Let's consolidate:

Key Takeaways

•Four memory policies — DEFAULT (local), BIND (strict), PREFERRED (fallback), INTERLEAVE (spread).
•Policies apply at three levels — System-wide, per-task (set_mempolicy), per-VMA (mbind).
•libnuma simplifies common operations — numa_alloc_onnode, numa_alloc_interleaved, node masks.
•First-touch determines placement — Memory is allocated where it's first accessed, not where malloc is called.
•Migration is possible but costly — move_pages, migrate_pages, mbind with MPOL_MF_MOVE.
•Huge pages require NUMA consideration — Reserve per-node, bind before touch, watch THP interaction.
•numactl enables zero-code-change optimization — Wrap existing applications with NUMA policies.
•NUMA-aware memory pools give maximum control — For performance-critical applications.

What's Next:

In the final page, we'll explore Performance Optimization techniques—bringing together everything we've learned to systematically optimize NUMA performance in production systems. We'll cover profiling, benchmarking, real-world tuning examples, and common pitfalls to avoid.

Page Complete

You now have the complete toolkit for NUMA-aware memory allocation. You can control placement at every level—from command-line wrapping to custom memory pools. Next, we'll tie it all together with performance optimization strategies.

4 / 5

Loading learning content...

Operating SystemsNUMA Architecture

NUMA Architecture

LevelAdvanced

Duration60 mins

TopicNUMA Architecture

4 / 5

NUMA-Aware Allocation

Controlling Where Memory Lives

By mastering these techniques, you'll be able to write applications that actively exploit NUMA topology rather than being victimized by poor memory placement.

What You Will Learn

Linux Memory Policy Framework

Linux provides a sophisticated memory policy framework that controls how memory is allocated across NUMA nodes. This framework operates at three levels:

System-wide default policy: Applied when no other policy is specified
Task (process/thread) policy: Applied to all allocations by that task
VMA (Virtual Memory Area) policy: Applied to specific address ranges

The Four Memory Policies:

Linux supports four fundamental memory allocation policies:

Linux NUMA Memory Policies
Policy	Behavior	Use Case
DEFAULT	Allocate on the local node (where CPU runs)	General purpose, best for most workloads
BIND	Strictly allocate from specified node set	Guarantee locality, fail if unavailable
PREFERRED	Try specified node, fall back if needed	Prefer locality, allow flexibility
INTERLEAVE	Round-robin pages across specified nodes	Shared data, bandwidth spreading

Policy Representation in the Kernel:

Internally, each policy is represented by a struct mempolicy containing:

Policy mode: DEFAULT, BIND, PREFERRED, or INTERLEAVE
Node mask: A bitmask of NUMA nodes the policy applies to
Flags: Modifiers like MPOL_F_STATIC_NODES (don't adjust nodes on cpuset changes)

Policies are reference-counted and can be shared between VMAs for efficiency.

policy-syscalls.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#include <numaif.h>
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
 
/**
 * Core system calls for memory policy control:
 * 
 * set_mempolicy()  - Set task policy
 * get_mempolicy()  - Query current policy
 * mbind()          - Set policy for a VMA
 * migrate_pages()  - Move pages between nodes
 * move_pages()     - Move specific pages
 */
 
void demonstrate_policies() {
    unsigned long nodemask = 0x3;  // Nodes 0 and 1
    int maxnode = numa_max_node() + 2;  // +2 for mask size convention
    
    // =====================================================
    // Policy 1: BIND - Strict allocation to specified nodes
    // =====================================================
    printf("Setting BIND policy to nodes 0,1\n");
    if (set_mempolicy(MPOL_BIND, &nodemask, maxnode) != 0) {
        perror("set_mempolicy BIND");
    }
    
    // All subsequent allocations MUST come from nodes 0 or 1
    void *bind_mem = malloc(1024 * 1024);  // Allocated on node 0 or 1
    
    // =====================================================
    // Policy 2: INTERLEAVE - Round-robin across nodes
    // =====================================================
    printf("Setting INTERLEAVE policy across nodes 0,1\n");
    if (set_mempolicy(MPOL_INTERLEAVE, &nodemask, maxnode) != 0) {
        perror("set_mempolicy INTERLEAVE");
    }
    
    // Pages will alternate between nodes 0 and 1
    void *interleave_mem = malloc(16 * 1024 * 1024);  // 16 MB
    // Page 0 on node 0, page 1 on node 1, page 2 on node 0, ...
    
    // =====================================================
    // Policy 3: PREFERRED - Try node, fall back if needed
    // =====================================================
    unsigned long prefer_nodemask = 0x2;  // Node 1
    printf("Setting PREFERRED policy for node 1\n");
    if (set_mempolicy(MPOL_PREFERRED, &prefer_nodemask, maxnode) != 0) {
        perror("set_mempolicy PREFERRED");
    }
    
    // Tries node 1 first, falls back to other nodes if necessary
    void *prefer_mem = malloc(1024 * 1024);
    
    // =====================================================
    // Policy 4: DEFAULT - Local allocation
    // =====================================================
    printf("Resetting to DEFAULT policy\n");
    if (set_mempolicy(MPOL_DEFAULT, NULL, 0) != 0) {
        perror("set_mempolicy DEFAULT");
    }
    
    // Allocates on the node where the CPU is running
    void *default_mem = malloc(1024 * 1024);
    
    free(bind_mem);
    free(interleave_mem);
    free(prefer_mem);
    free(default_mem);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    demonstrate_policies();
    return 0;
}

Policy vs. Allocation

VMA-Level Policy with mbind()

mbind() Signature:

long mbind(void *addr, unsigned long len, int mode,
           const unsigned long *nodemask, unsigned long maxnode,
           unsigned flags);

The flags Parameter:

The flags parameter is critical and often misunderstood:

mbind() Flags
Flag	Effect	Use Case
0 (no flags)	Policy applies to future allocations only	New allocations, lazy population
MPOL_MF_STRICT	Fail if any existing pages violate policy	Verify correctness
MPOL_MF_MOVE	Migrate existing pages that process owns	Rebalance owned pages
MPOL_MF_MOVE_ALL	Migrate all existing pages (requires CAP_SYS_NICE)	Force migration of shared pages

mbind-examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
#include <numaif.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
#define SIZE (64 * 1024 * 1024)  // 64 MB
 
void example_mbind_before_touch() {
    /*
     * Pattern 1: Set policy BEFORE first touch
     * This is the most common and efficient approach.
     */
    
    // Allocate virtual address space (not yet backed by physical pages)
    void *buffer = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (buffer == MAP_FAILED) {
        perror("mmap");
        return;
    }
    
    // Bind to node 0 BEFORE touching the memory
    unsigned long nodemask = 1UL << 0;  // Node 0
    if (mbind(buffer, SIZE, MPOL_BIND, &nodemask, sizeof(nodemask) * 8, 0) != 0) {
        perror("mbind");
        munmap(buffer, SIZE);
        return;
    }
    
    // Now touch the memory - pages will be allocated on node 0
    memset(buffer, 0, SIZE);
    
    printf("Allocated 64 MB on node 0 using mbind before touch\n");
    munmap(buffer, SIZE);
}
 
void example_mbind_migrate() {
    /*
     * Pattern 2: Migrate existing pages to a new node
     * Useful for rebalancing after detecting poor placement.
     */
    
    // Allocate and touch (pages land on current node)
    void *buffer = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    memset(buffer, 0, SIZE);  // Pages now allocated on current node
    
    printf("Initial allocation complete on local node\n");
    
    // Migrate pages to node 1
    unsigned long nodemask = 1UL << 1;  // Node 1
    if (mbind(buffer, SIZE, MPOL_BIND, &nodemask, sizeof(nodemask) * 8,
              MPOL_MF_MOVE | MPOL_MF_STRICT) != 0) {
        perror("mbind migrate");
        // If this fails, pages may be on the wrong node
    } else {
        printf("Migrated 64 MB to node 1\n");
    }
    
    munmap(buffer, SIZE);
}
 
void example_mixed_policies() {
    /*
     * Pattern 3: Different regions with different policies
     * Common in database systems with different access patterns.
     */
    
    // Allocate 256 MB
    size_t total_size = 256 * 1024 * 1024;
    void *base = mmap(NULL, total_size, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    
    size_t quarter = total_size / 4;
    
    // First quarter: BIND to node 0 (hot data)
    unsigned long node0_mask = 1UL << 0;
    mbind(base, quarter, MPOL_BIND, &node0_mask, 64, 0);
    
    // Second quarter: BIND to node 1 (hot data on other socket)
    unsigned long node1_mask = 1UL << 1;
    mbind(base + quarter, quarter, MPOL_BIND, &node1_mask, 64, 0);
    
    // Third and fourth quarter: INTERLEAVE for shared data
    unsigned long all_nodes = 0xF;  // Nodes 0-3
    mbind(base + 2*quarter, 2*quarter, MPOL_INTERLEAVE, &all_nodes, 64, 0);
    
    // Touch each region from appropriate threads/nodes
    // (demonstration simplified here)
    memset(base, 0, total_size);
    
    printf("Created mixed-policy allocation:\n");
    printf("  - 64 MB on node 0 (BIND)\n");
    printf("  - 64 MB on node 1 (BIND)\n");
    printf("  - 128 MB interleaved across nodes 0-3\n");
    
    munmap(base, total_size);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    example_mbind_before_touch();
    example_mbind_migrate();
    example_mixed_policies();
    
    return 0;
}

mmap + mbind Pattern

libnuma High-Level API

Key libnuma Functions:

Essential libnuma Allocation Functions
Function	Purpose	Underlying Mechanism
`numa_alloc_onnode(size, node)`	Allocate strictly on specified node	mmap + mbind BIND
`numa_alloc_local(size)`	Allocate on local (current) node	mmap + mbind BIND to current
`numa_alloc_interleaved(size)`	Interleave across all nodes	mmap + mbind INTERLEAVE
`numa_alloc(size)`	Allocate following task policy	mmap (respects set_mempolicy)
`numa_free(ptr, size)`	Free NUMA-allocated memory	munmap
`numa_set_preferred(node)`	Set task policy to PREFERRED	set_mempolicy PREFERRED
`numa_set_membind(mask)`	Set task policy to BIND	set_mempolicy BIND
`numa_set_interleave_mask(mask)`	Set task policy to INTERLEAVE	set_mempolicy INTERLEAVE

libnuma-comprehensive.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
#include <numa.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
 
#define GB (1024UL * 1024 * 1024)
#define MB (1024UL * 1024)
 
/**
 * Comprehensive libnuma usage patterns
 */
 
void query_numa_configuration() {
    printf("=== NUMA Configuration ===\n");
    printf("NUMA available: %s\n", numa_available() >= 0 ? "yes" : "no");
    printf("Number of nodes: %d\n", numa_max_node() + 1);
    printf("Number of CPUs: %d\n", numa_num_configured_cpus());
    
    // Print per-node information
    for (int node = 0; node <= numa_max_node(); node++) {
        long long size, free_size;
        size = numa_node_size64(node, &free_size);
        
        printf("  Node %d: %.2f GB total, %.2f GB free\n",
               node, 
               (double)size / GB,
               (double)free_size / GB);
    }
    printf("\n");
}
 
void demonstrate_strict_allocation() {
    printf("=== Strict Node Allocation ===\n");
    
    int target_node = 0;
    size_t alloc_size = 256 * MB;
    
    // Allocate strictly on node 0
    void *buffer = numa_alloc_onnode(alloc_size, target_node);
    if (!buffer) {
        printf("Failed to allocate on node %d\n", target_node);
        return;
    }
    
    // CRITICAL: Touch the memory to materialize pages
    // The allocation reserves address space; first touch allocates pages
    memset(buffer, 0, alloc_size);
    
    printf("Allocated %.0f MB strictly on node %d\n", 
           (double)alloc_size / MB, target_node);
    
    // Verify placement
    int status;
    void *page = buffer;
    move_pages(0, 1, &page, NULL, &status, 0);
    printf("Verified: first page is on node %d\n", status);
    
    numa_free(buffer, alloc_size);
}
 
void demonstrate_interleaved_allocation() {
    printf("\n=== Interleaved Allocation ===\n");
    
    size_t alloc_size = 1 * GB;
    
    // Allocate interleaved across all nodes
    void *buffer = numa_alloc_interleaved(alloc_size);
    if (!buffer) {
        printf("Failed to allocate interleaved memory\n");
        return;
    }
    
    // Touch all pages to materialize
    memset(buffer, 0, alloc_size);
    
    printf("Allocated 1 GB interleaved across all nodes\n");
    
    // Count pages per node
    int num_pages = alloc_size / 4096;
    int *node_counts = calloc(numa_max_node() + 1, sizeof(int));
    
    for (int i = 0; i < num_pages; i += 1000) {  // Sample every 1000th page
        void *page = buffer + (i * 4096);
        int status;
        move_pages(0, 1, &page, NULL, &status, 0);
        if (status >= 0) {
            node_counts[status]++;
        }
    }
    
    printf("Page distribution (sampled):\n");
    for (int n = 0; n <= numa_max_node(); n++) {
        printf("  Node %d: %d pages\n", n, node_counts[n]);
    }
    
    free(node_counts);
    numa_free(buffer, alloc_size);
}
 
void demonstrate_node_masks() {
    printf("\n=== Node Mask Operations ===\n");
    
    // Create a bitmask for specific nodes
    struct bitmask *nodes = numa_bitmask_alloc(numa_max_node() + 1);
    
    // Set nodes 0 and 2
    numa_bitmask_setbit(nodes, 0);
    numa_bitmask_setbit(nodes, 2);
    
    // Bind thread to run only on these nodes' CPUs
    numa_run_on_node_mask(nodes);
    printf("Thread bound to CPUs on nodes 0 and 2\n");
    
    // Allocate memory interleaved across these specific nodes
    numa_set_interleave_mask(nodes);
    void *buffer = malloc(64 * MB);  // Uses interleave policy
    memset(buffer, 0, 64 * MB);
    
    printf("Allocated 64 MB interleaved across nodes 0 and 2\n");
    
    // Reset to default
    numa_set_interleave_mask(numa_no_nodes_ptr);
    
    free(buffer);
    numa_bitmask_free(nodes);
}
 
void demonstrate_local_allocation() {
    printf("\n=== Local Allocation ===\n");
    
    // Get current node
    int current_cpu = sched_getcpu();
    int current_node = numa_node_of_cpu(current_cpu);
    printf("Currently running on CPU %d, Node %d\n", current_cpu, current_node);
    
    // Allocate locally - memory will be on current node
    void *buffer = numa_alloc_local(128 * MB);
    if (!buffer) {
        printf("Failed to allocate local memory\n");
        return;
    }
    
    memset(buffer, 0, 128 * MB);
    printf("Allocated 128 MB on local node %d\n", current_node);
    
    numa_free(buffer, 128 * MB);
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available on this system\n");
        return 1;
    }
    
    query_numa_configuration();
    demonstrate_strict_allocation();
    demonstrate_interleaved_allocation();
    demonstrate_node_masks();
    demonstrate_local_allocation();
    
    printf("\n=== All demonstrations complete ===\n");
    return 0;
}
 
/*
 * Compile: gcc -O2 -o numa_demo numa_demo.c -lnuma
 * Run: ./numa_demo
 */

numa_alloc vs malloc

First-Touch Policy in Depth

The Mechanics:

malloc(1GB) returns immediately with virtual address
No physical memory is allocated yet (overcommit)
First read/write to a page triggers a page fault
Page fault handler allocates physical page on current thread's node
Page table entry is set up; access completes

Converting Mermaid diagram...

Common First-Touch Pitfalls:

First-Touch Anti-Patterns

•Main thread initialization: If main() allocates and initializes all data before spawning workers, all pages land on node 0.
•memset from wrong thread: A worker calls malloc(), but another thread does memset() for 'convenience'. Pages land on wrong node.
•Library initialization: Libraries that allocate caches or buffers during init (before you spawn workers) can cause remote access.
•Memory pools: Custom allocators that pre-allocate pools during startup place all pool memory on the startup thread's node.
•Zero-filling: Some malloc implementations zero memory on allocation. If they do this in the allocation thread, you lose first-touch control.

first-touch-correct.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
#include <numa.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
#define NUM_WORKERS 4
#define DATA_PER_WORKER (1024 * 1024 * 256)  // 256 MB each
 
typedef struct {
    int worker_id;
    int numa_node;
    void *data;
    size_t data_size;
} WorkerContext;
 
void *worker_function(void *arg) {
    WorkerContext *ctx = (WorkerContext *)arg;
    
    // Step 1: Bind this thread to its designated NUMA node
    struct bitmask *nodemask = numa_bitmask_alloc(numa_max_node() + 1);
    numa_bitmask_setbit(nodemask, ctx->numa_node);
    numa_run_on_node_mask(nodemask);
    numa_bitmask_free(nodemask);
    
    printf("Worker %d running on node %d\n", ctx->worker_id, ctx->numa_node);
    
    // Step 2: Allocate memory (still virtual)
    ctx->data = numa_alloc_onnode(ctx->data_size, ctx->numa_node);
    if (!ctx->data) {
        fprintf(stderr, "Worker %d: allocation failed\n", ctx->worker_id);
        return NULL;
    }
    
    // Step 3: Touch memory from THIS thread (ensures local allocation)
    // Do NOT let main thread or another worker touch this
    memset(ctx->data, 0, ctx->data_size);
    
    printf("Worker %d: allocated and touched %zu MB on node %d\n",
           ctx->worker_id, ctx->data_size / (1024*1024), ctx->numa_node);
    
    // Step 4: Do actual work with locally-placed data
    // ... computation here ...
    
    return NULL;
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    int num_nodes = numa_max_node() + 1;
    pthread_t threads[NUM_WORKERS];
    WorkerContext contexts[NUM_WORKERS];
    
    printf("Starting %d workers across %d nodes\n", NUM_WORKERS, num_nodes);
    
    // Create workers - each will bind to its node and allocate locally
    for (int i = 0; i < NUM_WORKERS; i++) {
        contexts[i].worker_id = i;
        contexts[i].numa_node = i % num_nodes;  // Round-robin across nodes
        contexts[i].data = NULL;
        contexts[i].data_size = DATA_PER_WORKER;
        
        pthread_create(&threads[i], NULL, worker_function, &contexts[i]);
    }
    
    // Wait for all workers
    for (int i = 0; i < NUM_WORKERS; i++) {
        pthread_join(threads[i], NULL);
    }
    
    // Verify placement
    printf("\nVerifying memory placement:\n");
    for (int i = 0; i < NUM_WORKERS; i++) {
        if (contexts[i].data) {
            int status;
            void *page = contexts[i].data;
            move_pages(0, 1, &page, NULL, &status, 0);
            printf("  Worker %d data: first page on node %d (expected %d)\n",
                   i, status, contexts[i].numa_node);
        }
    }
    
    // Cleanup
    for (int i = 0; i < NUM_WORKERS; i++) {
        if (contexts[i].data) {
            numa_free(contexts[i].data, contexts[i].data_size);
        }
    }
    
    return 0;
}

The Parallel Initialization Pattern

Page Migration

Sometimes memory ends up on the wrong node. Rather than reallocating and copying, the kernel can migrate pages between nodes. This is useful for:

Rebalancing after detecting poor placement
Following threads when they migrate between nodes
Optimizing based on runtime access patterns

Migration Mechanisms:

Page Migration Methods
Method	Scope	Use Case
`move_pages()`	Move specific pages	Fine-grained migration of select pages
`migrate_pages()`	Move all pages from one node to another	Bulk migration of entire node's allocation
`mbind()` with MPOL_MF_MOVE	Migrate pages in a VMA	Rebalance a memory region
AutoNUMA / `numa_balancing`	Automatic kernel migration	Hands-off optimization

page-migration.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
#include <numaif.h>
#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
 
#define PAGE_SIZE 4096
#define NUM_PAGES 10000
#define TOTAL_SIZE (PAGE_SIZE * NUM_PAGES)
 
/**
 * move_pages() - Move specific pages to specified nodes
 */
void demonstrate_move_pages() {
    printf("=== move_pages() demonstration ===\n");
    
    // Allocate memory on node 0
    void *buffer = numa_alloc_onnode(TOTAL_SIZE, 0);
    memset(buffer, 0, TOTAL_SIZE);  // Materialize pages
    
    // Prepare arrays for move_pages
    void *pages[NUM_PAGES];
    int nodes[NUM_PAGES];
    int status[NUM_PAGES];
    
    for (int i = 0; i < NUM_PAGES; i++) {
        pages[i] = buffer + (i * PAGE_SIZE);
        nodes[i] = 1;  // Target: node 1
    }
    
    printf("Moving %d pages from node 0 to node 1...\n", NUM_PAGES);
    
    // Move pages
    long ret = move_pages(0,  // pid (0 = current process)
                          NUM_PAGES, pages, nodes, status, 0);
    
    if (ret != 0) {
        perror("move_pages");
    } else {
        // Count successful migrations
        int success = 0, failed = 0;
        for (int i = 0; i < NUM_PAGES; i++) {
            if (status[i] == 1) {  // Page is now on node 1
                success++;
            } else if (status[i] < 0) {
                failed++;
            }
        }
        printf("Migration complete: %d succeeded, %d failed\n", success, failed);
    }
    
    numa_free(buffer, TOTAL_SIZE);
}
 
/**
 * migrate_pages() - Move all pages belonging to a process
 *                   from one node to another
 */
void demonstrate_migrate_pages() {
    printf("\n=== migrate_pages() demonstration ===\n");
    
    // Allocate on multiple nodes
    void *buf0 = numa_alloc_onnode(32 * 1024 * 1024, 0);
    void *buf1 = numa_alloc_onnode(32 * 1024 * 1024, 1);
    memset(buf0, 0, 32 * 1024 * 1024);
    memset(buf1, 0, 32 * 1024 * 1024);
    
    printf("Allocated 32 MB on node 0 and 32 MB on node 1\n");
    
    // Migrate all pages from node 0 to node 2
    unsigned long old_nodes = 1UL << 0;  // Node 0
    unsigned long new_nodes = 1UL << 2;  // Node 2
    
    printf("Migrating all node 0 pages to node 2...\n");
    long ret = migrate_pages(0,  // pid (0 = current)
                             sizeof(old_nodes) * 8,
                             &old_nodes, &new_nodes);
    
    if (ret < 0) {
        perror("migrate_pages");
    } else if (ret > 0) {
        printf("Migration incomplete: %ld pages could not be moved\n", ret);
    } else {
        printf("All pages migrated successfully\n");
    }
    
    numa_free(buf0, 32 * 1024 * 1024);
    numa_free(buf1, 32 * 1024 * 1024);
}
 
/**
 * AutoNUMA (NUMA Balancing) - Kernel automatic migration
 */
void explain_autonuma() {
    printf("\n=== AutoNUMA (numa_balancing) ===\n");
    printf("Check status: cat /proc/sys/kernel/numa_balancing\n");
    printf("Enable:       echo 1 > /proc/sys/kernel/numa_balancing\n");
    printf("Disable:      echo 0 > /proc/sys/kernel/numa_balancing\n");
    printf("\n");
    printf("How AutoNUMA works:\n");
    printf("  1. Kernel periodically unmaps random pages (lazy scan)\n");
    printf("  2. When accessed, page fault reveals which CPU touched it\n");
    printf("  3. If CPU's node != page's node, page becomes migration candidate\n");
    printf("  4. Kernel migrates pages that are consistently accessed remotely\n");
    printf("\n");
    printf("Tuning parameters in /proc/sys/kernel/:\n");
    printf("  numa_balancing_scan_delay_ms - delay before scanning starts\n");
    printf("  numa_balancing_scan_period_min_ms - minimum scan interval\n");
    printf("  numa_balancing_scan_period_max_ms - maximum scan interval\n");
    printf("  numa_balancing_scan_size_mb - pages scanned per interval\n");
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    if (numa_max_node() < 2) {
        printf("Need at least 3 NUMA nodes for full demo\n");
        printf("Running partial demo...\n");
    }
    
    demonstrate_move_pages();
    demonstrate_migrate_pages();
    explain_autonuma();
    
    return 0;
}

Migration Cost

NUMA and Huge Pages

Huge pages (2 MB or 1 GB pages instead of 4 KB) reduce TLB pressure for large allocations. Combining huge pages with NUMA requires special consideration.

The NUMA-Huge Page Challenge:

Huge page allocation is fundamentally different from regular page allocation:

Huge pages require physically contiguous memory
A 2 MB huge page needs 512 contiguous 4 KB frames
Contiguous regions may not be available on the desired node
Once allocated, huge pages are harder to migrate

numa-hugepages.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
#include <numa.h>
#include <numaif.h>
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
 
#define HUGE_PAGE_SIZE (2 * 1024 * 1024)  // 2 MB
#define ALLOCATION_SIZE (1024 * 1024 * 1024)  // 1 GB
 
/**
 * Pattern 1: MAP_HUGETLB with NUMA binding
 */
void allocate_huge_pages_numa(int target_node) {
    printf("Allocating 1 GB huge pages on node %d\n", target_node);
    
    // Allocate using MAP_HUGETLB
    void *buffer = mmap(NULL, ALLOCATION_SIZE,
                        PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                        -1, 0);
    
    if (buffer == MAP_FAILED) {
        perror("mmap huge pages");
        printf("Hint: Ensure huge pages are configured:\n");
        printf("  echo 1024 > /proc/sys/vm/nr_hugepages\n");
        return;
    }
    
    // Bind to specific node BEFORE first touch
    unsigned long nodemask = 1UL << target_node;
    if (mbind(buffer, ALLOCATION_SIZE, MPOL_BIND, &nodemask,
              sizeof(nodemask) * 8, 0) != 0) {
        perror("mbind");
    }
    
    // First touch to allocate physical pages
    memset(buffer, 0, ALLOCATION_SIZE);
    
    // Verify placement
    int status;
    void *page = buffer;
    move_pages(0, 1, &page, NULL, &status, 0);
    printf("Huge pages allocated on node %d\n", status);
    
    munmap(buffer, ALLOCATION_SIZE);
}
 
/**
 * Pattern 2: Per-node huge page pools
 */
void configure_per_node_hugepages() {
    printf("\n=== Per-Node Huge Page Configuration ===\n");
    printf("Configure huge pages per node for NUMA-aware allocation:\n\n");
    
    printf("# Reserve 512 huge pages on each of 4 nodes\n");
    printf("for node in 0 1 2 3; do\n");
    printf("    echo 512 > /sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/nr_hugepages\n");
    printf("done\n\n");
    
    printf("# Verify allocation\n");
    printf("for node in 0 1 2 3; do\n");
    printf("    free=$(cat /sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/free_hugepages)\n");
    printf("    total=$(cat /sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/nr_hugepages)\n");
    printf("    echo \"Node $node: $free / $total huge pages free\"\n");
    printf("done\n");
}
 
/**
 * Pattern 3: Transparent Huge Pages (THP) with NUMA
 */
void explain_thp_numa() {
    printf("\n=== Transparent Huge Pages + NUMA ===\n");
    printf("THP automatically uses huge pages when possible.\n\n");
    
    printf("THP modes:\n");
    printf("  always  - Use THP whenever possible\n");
    printf("  madvise - Use THP only for MADV_HUGEPAGE regions\n");
    printf("  never   - Disable THP\n\n");
    
    printf("NUMA interaction:\n");
    printf("  - THP respects NUMA policies for allocation\n");
    printf("  - khugepaged daemon can collapse pages to huge pages\n");
    printf("  - Collapse may move pages (NUMA disruption!)\n\n");
    
    printf("For NUMA-critical applications, consider:\n");
    printf("  echo madvise > /sys/kernel/mm/transparent_hugepage/enabled\n");
    printf("Then use madvise(addr, len, MADV_HUGEPAGE) for specific regions\n");
}
 
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    allocate_huge_pages_numa(0);
    configure_per_node_hugepages();
    explain_thp_numa();
    
    return 0;
}

Pre-Allocate Per-Node Huge Pages

Practical Allocation Patterns

Let's consolidate everything into practical patterns you can apply in production code.

Pattern 1: NUMA-Aware Memory Pool

For applications that manage their own memory (databases, caches, game engines), create a NUMA-aware memory pool:

numa-memory-pool.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
#include <numa.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
/* 
 * NUMA-Aware Memory Pool
 * 
 * Each NUMA node has its own memory pool. Threads allocate from their
 * local pool, ensuring locality. Cross-node allocation is supported
 * but logged for performance analysis.
 */
 
typedef struct NumaPool {
    int node_id;
    void *base;
    size_t size;
    size_t used;
    pthread_mutex_t lock;
} NumaPool;
 
typedef struct NumaAllocator {
    int num_nodes;
    NumaPool *pools;
} NumaAllocator;
 
NumaAllocator *numa_allocator_create(size_t per_node_size) {
    NumaAllocator *alloc = malloc(sizeof(NumaAllocator));
    alloc->num_nodes = numa_max_node() + 1;
    alloc->pools = calloc(alloc->num_nodes, sizeof(NumaPool));
    
    for (int i = 0; i < alloc->num_nodes; i++) {
        NumaPool *pool = &alloc->pools[i];
        pool->node_id = i;
        pool->size = per_node_size;
        pool->used = 0;
        pthread_mutex_init(&pool->lock, NULL);
        
        // Allocate pool memory strictly on this node
        pool->base = numa_alloc_onnode(per_node_size, i);
        if (pool->base) {
            // Touch from a thread on this node to ensure first-touch
            // For simplicity, we bind current thread temporarily
            numa_run_on_node(i);
            memset(pool->base, 0, per_node_size);
            printf("Created pool on node %d: %zu MB\n", 
                   i, per_node_size / (1024*1024));
        }
    }
    
    // Reset thread affinity
    numa_run_on_node(-1);  // All nodes
    
    return alloc;
}
 
void *numa_allocator_alloc(NumaAllocator *alloc, size_t size, int *allocated_node) {
    // Determine current node
    int current_node = numa_node_of_cpu(sched_getcpu());
    
    // Try local pool first
    NumaPool *pool = &alloc->pools[current_node];
    pthread_mutex_lock(&pool->lock);
    
    if (pool->used + size <= pool->size) {
        void *ptr = pool->base + pool->used;
        pool->used += size;
        *allocated_node = current_node;
        pthread_mutex_unlock(&pool->lock);
        return ptr;
    }
    pthread_mutex_unlock(&pool->lock);
    
    // Local pool full, try other nodes (with warning)
    for (int i = 0; i < alloc->num_nodes; i++) {
        if (i == current_node) continue;
        
        pool = &alloc->pools[i];
        pthread_mutex_lock(&pool->lock);
        
        if (pool->used + size <= pool->size) {
            void *ptr = pool->base + pool->used;
            pool->used += size;
            *allocated_node = i;
            pthread_mutex_unlock(&pool->lock);
            
            // Log remote allocation for analysis
            fprintf(stderr, "WARNING: Remote allocation: node %d -> node %d (%zu bytes)\n",
                    current_node, i, size);
            return ptr;
        }
        pthread_mutex_unlock(&pool->lock);
    }
    
    // All pools exhausted
    *allocated_node = -1;
    return NULL;
}
 
void numa_allocator_destroy(NumaAllocator *alloc) {
    for (int i = 0; i < alloc->num_nodes; i++) {
        if (alloc->pools[i].base) {
            numa_free(alloc->pools[i].base, alloc->pools[i].size);
        }
        pthread_mutex_destroy(&alloc->pools[i].lock);
    }
    free(alloc->pools);
    free(alloc);
}
 
// Example usage
int main() {
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        return 1;
    }
    
    // Create allocator with 1 GB per node
    NumaAllocator *alloc = numa_allocator_create(1024 * 1024 * 1024);
    
    // Allocate some memory
    int node;
    void *ptr1 = numa_allocator_alloc(alloc, 1024*1024, &node);
    printf("Allocated 1 MB, placed on node %d\n", node);
    
    numa_allocator_destroy(alloc);
    return 0;
}

Pattern 2: numactl for Process-Level Control

For applications you don't control (or don't want to modify), numactl provides command-line NUMA control:

numactl-patterns.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
 
# Pattern A: Strict single-node binding (for memory-bound apps)
numactl --cpunodebind=0 --membind=0 ./my_database_server
 
# Pattern B: Interleaved across all nodes (for shared data structures)
numactl --interleave=all ./shared_cache_server
 
# Pattern C: Preferred node with fallback (flexible locality)
numactl --preferred=0 ./application
 
# Pattern D: Specific node subset
numactl --cpunodebind=0,1 --membind=0,1 ./dual_socket_app
 
# Pattern E: Local allocation policy (allocate where running)
numactl --localalloc ./memory_intensive_app
 
# Pattern F: Display NUMA statistics after run
numactl --show ./my_app && numastat -p $(pgrep my_app)
 
# Pattern G: Hardware info before deciding
numactl --hardware
# Then choose appropriate strategy based on topology
 
# Common production configurations:
 
# Redis: Single-node strict binding
numactl --cpunodebind=0 --membind=0 redis-server
 
# PostgreSQL shared_buffers: Interleaved
numactl --interleave=all postgres -c shared_buffers=32GB
 
# JVM applications: Let JVM handle with NUMA flags
# (JVM's -XX:+UseNUMA works with G1GC)
numactl --localalloc java -XX:+UseNUMA -XX:+UseG1GC -jar app.jar

Choose Your Battle

Summary: NUMA-Aware Allocation

We've covered the complete toolkit for NUMA-aware memory allocation. Let's consolidate:

Key Takeaways

•Four memory policies — DEFAULT (local), BIND (strict), PREFERRED (fallback), INTERLEAVE (spread).
•Policies apply at three levels — System-wide, per-task (set_mempolicy), per-VMA (mbind).
•libnuma simplifies common operations — numa_alloc_onnode, numa_alloc_interleaved, node masks.
•First-touch determines placement — Memory is allocated where it's first accessed, not where malloc is called.
•Migration is possible but costly — move_pages, migrate_pages, mbind with MPOL_MF_MOVE.
•Huge pages require NUMA consideration — Reserve per-node, bind before touch, watch THP interaction.
•numactl enables zero-code-change optimization — Wrap existing applications with NUMA policies.
•NUMA-aware memory pools give maximum control — For performance-critical applications.

What's Next:

Page Complete

4 / 5