Huge Pages - Learning Module

Loading content...

0/240

Huge Page Allocation

The Allocation Challenge

Understanding the benefits of huge pages is straightforward—TLB efficiency improves dramatically, page table overhead plummets, and memory-intensive applications accelerate. But obtaining huge pages is where theory meets reality.

Huge pages require contiguous, properly aligned physical memory. A 2MB page needs 2MB of contiguous RAM aligned to a 2MB boundary. A 1GB page requires 1GB of contiguous, GB-aligned memory. As systems run, physical memory fragments. Finding these large contiguous regions becomes progressively harder—sometimes impossible.

This challenge has led to multiple allocation mechanisms, each with different tradeoffs between flexibility, reliability, and ease of use. Mastering these mechanisms is essential for any engineer optimizing memory-intensive workloads.

What You Will Learn

By the end of this page, you will understand how to configure, reserve, and allocate huge pages at boot time and runtime. You'll master hugetlbfs, sysctl tuning, mmap() with MAP_HUGETLB, and the NUMA considerations that affect huge page allocation in multi-socket systems.

The Fragmentation Problem

Before diving into allocation mechanisms, we must understand why huge page allocation is fundamentally different from regular page allocation.

Physical memory fragmentation:

When a system boots, physical memory is largely contiguous. But as processes allocate and free memory, the physical address space becomes fragmented—filled with "holes" of various sizes scattered throughout.

With 4KB pages, fragmentation is irrelevant. Any free 4KB frame can satisfy any 4KB page request. But for huge pages:

2MB page: Requires 512 consecutive 4KB frames, all aligned
1GB page: Requires 262,144 consecutive 4KB frames, all aligned

After a system has been running for hours or days, finding 512 consecutive free frames becomes difficult. Finding 262,144 consecutive frames is often impossible.

Probability of Finding Contiguous Regions Over Time
System Uptime	Free Memory %	2MB Success Rate	1GB Success Rate
Boot (< 1 min)	~95%	~99%	~95%
1 hour	~70%	~85%	~40%
24 hours	~50%	~60%	~10%
7 days	~40%	~40%	~2%
30 days	~35%	~25%	~0%

Memory compaction:

Linux can attempt memory compaction—migrating pages to consolidate free space. But compaction:

Takes time (potentially seconds for large systems)
Cannot move kernel pages or pinned memory
May fail if fragmentation is severe
Causes latency spikes during migration

For latency-sensitive applications, relying on runtime compaction is risky. This is why boot-time reservation is often preferred.

The Catch-22

You need huge pages most for workloads with large memory footprints. But large footprints fragment memory quickly. Without planning, the workloads that benefit most from huge pages are the least likely to be able to obtain them at runtime.

Boot-Time Reservation

The most reliable way to ensure huge page availability is to reserve them at boot time, before any fragmentation occurs.

Kernel command line parameters:

Add these to your bootloader configuration (GRUB, etc.):

grub_hugepages.conf

Config

# /etc/default/grub
# Add to GRUB_CMDLINE_LINUX:
 
# Reserve 1024 2MB huge pages at boot (2GB total)
GRUB_CMDLINE_LINUX="hugepagesz=2M hugepages=1024"
 
# Reserve 512 2MB huge pages AND 4 1GB huge pages
GRUB_CMDLINE_LINUX="hugepagesz=2M hugepages=512 hugepagesz=1G hugepages=4"
 
# NUMA-aware: Reserve on specific nodes
# Node 0: 512 2MB pages, Node 1: 512 2MB pages
GRUB_CMDLINE_LINUX="hugepagesz=2M hugepages=0:512,1:512"
 
# Default huge page size (affects THP and other defaults)
GRUB_CMDLINE_LINUX="default_hugepagesz=2M hugepagesz=2M hugepages=1024"
 
# After editing, run:
# sudo update-grub
# sudo reboot

Verification after boot:

verify_hugepages.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/bin/bash
# Verify huge page reservation
 
echo "=== HUGE PAGE STATUS ==="
 
# Overall huge page info from meminfo
echo ""
echo "From /proc/meminfo:"
grep -i huge /proc/meminfo
 
# Detailed stats per huge page size
echo ""
echo "From /sys/kernel/mm/hugepages/:"
for dir in /sys/kernel/mm/hugepages/hugepages-*; do
    if [ -d "$dir" ]; then
        size=$(basename "$dir" | sed 's/hugepages-//')
        total=$(cat "$dir/nr_hugepages")
        free=$(cat "$dir/free_hugepages")
        reserved=$(cat "$dir/resv_hugepages")
        surplus=$(cat "$dir/surplus_hugepages")
        
        echo ""
        echo "Page Size: $size"
        echo "  Total:    $total"
        echo "  Free:     $free"
        echo "  Reserved: $reserved"
        echo "  Surplus:  $surplus"
    fi
done
 
# NUMA distribution
if [ -d /sys/devices/system/node ]; then
    echo ""
    echo "NUMA Distribution:"
    for node in /sys/devices/system/node/node*/hugepages/hugepages-*; do
        if [ -d "$node" ]; then
            path=$(dirname "$node")
            nodename=$(basename $(dirname "$path"))
            size=$(basename "$node" | sed 's/hugepages-//')
            total=$(cat "$node/nr_hugepages")
            free=$(cat "$node/free_hugepages")
            echo "  $nodename ($size): $total total, $free free"
        fi
    done
fi

Boot-Time Reservation Best Practices

•Calculate requirements carefully — Reserved pages reduce available system memory for regular allocation
•Account for NUMA topology — Distribute reservations across nodes for NUMA-aware applications
•Leave headroom — Don't reserve 100% of expected needs; leave room for growth and system operations
•Consider multiple sizes — Some workloads benefit from mixed 2MB and 1GB pages
•Document the configuration — Boot-time settings are invisible to runtime debugging

Runtime Configuration via sysctl

While boot-time reservation is most reliable, runtime configuration through /proc and sysctl offers flexibility for dynamic workloads.

Key configuration files:

Huge Page sysctl and /proc Interfaces
Path	Purpose	Read/Write
/proc/sys/vm/nr_hugepages	Number of 2MB huge pages (default size)	R/W
/proc/sys/vm/nr_overcommit_hugepages	Extra pages to allocate on demand	R/W
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages	2MB huge pages	R/W
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages	1GB huge pages	R/W
/sys/devices/system/node/node*/hugepages/...	Per-NUMA-node allocation	R/W
/proc/meminfo	Current huge page statistics	R

configure_hugepages.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#!/bin/bash
#
# Huge Page Runtime Configuration Script
# Run as root or with sudo
#
 
set -e
 
# Configuration
HUGEPAGES_2MB=1024    # Number of 2MB pages to request
HUGEPAGES_1GB=4       # Number of 1GB pages to request (if supported)
 
echo "=== CONFIGURING HUGE PAGES ==="
 
# Check 1GB page support
if grep -q pdpe1gb /proc/cpuinfo; then
    SUPPORTS_1GB=true
    echo "✓ CPU supports 1GB huge pages"
else
    SUPPORTS_1GB=false
    echo "✗ CPU does not support 1GB huge pages"
fi
 
# Function to allocate huge pages with retry and compaction
allocate_hugepages() {
    local size=$1
    local count=$2
    local path=$3
    
    echo ""
    echo "Allocating $count pages of size $size..."
    
    # Trigger memory compaction first
    echo "  Triggering memory compaction..."
    echo 1 > /proc/sys/vm/compact_memory 2>/dev/null || true
    sleep 1
    
    # Drop caches to free memory
    echo "  Dropping caches..."
    sync
    echo 3 > /proc/sys/vm/drop_caches
    sleep 1
    
    # Attempt allocation
    echo "  Requesting $count pages..."
    current=$(cat "$path")
    echo $count > "$path"
    
    # Verify
    sleep 1
    actual=$(cat "$path")
    
    if [ "$actual" -eq "$count" ]; then
        echo "  ✓ Successfully allocated $actual pages"
    else
        echo "  ⚠ Only allocated $actual of $count requested pages"
        echo "    This may indicate memory fragmentation."
        echo "    Consider boot-time reservation for reliable allocation."
    fi
}
 
# Allocate 2MB pages
allocate_hugepages "2MB" $HUGEPAGES_2MB /proc/sys/vm/nr_hugepages
 
# Allocate 1GB pages if supported
if [ "$SUPPORTS_1GB" = true ] && [ -f /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages ]; then
    allocate_hugepages "1GB" $HUGEPAGES_1GB /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
fi
 
# Configure overcommit (allow some on-demand allocation)
echo ""
echo "Configuring overcommit..."
echo 256 > /proc/sys/vm/nr_overcommit_hugepages
echo "  ✓ Allowed 256 overcommit huge pages"
 
# Show final status
echo ""
echo "=== FINAL STATUS ==="
grep -i huge /proc/meminfo
 
# Calculate memory reserved
RESERVED_2MB=$(($(cat /proc/sys/vm/nr_hugepages) * 2))
echo ""
echo "Total memory reserved for 2MB pages: ${RESERVED_2MB}MB"
 
if [ "$SUPPORTS_1GB" = true ] && [ -f /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages ]; then
    RESERVED_1GB=$(($(cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages) * 1024))
    echo "Total memory reserved for 1GB pages: ${RESERVED_1GB}MB"
fi

Persistent Configuration

To make sysctl settings persistent across reboots, add them to /etc/sysctl.conf:

vm.nr_hugepages = 1024 vm.nr_overcommit_hugepages = 256

Then run: sysctl -p

hugetlbfs File System

hugetlbfs is a pseudo-filesystem that provides a filesystem interface to huge pages. Applications can use standard file operations (open, mmap, close) to work with huge page-backed memory.

Mounting hugetlbfs:

mount_hugetlbfs.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
# Mount hugetlbfs for huge page access
 
# Create mount points
sudo mkdir -p /mnt/hugepages
sudo mkdir -p /mnt/hugepages-1G
 
# Mount 2MB hugetlbfs (default)
sudo mount -t hugetlbfs none /mnt/hugepages
 
# Mount with specific options
sudo mount -t hugetlbfs none /mnt/hugepages     -o mode=1770,gid=1000,pagesize=2M
 
# Mount 1GB hugetlbfs (requires separate mount)
sudo mount -t hugetlbfs none /mnt/hugepages-1G     -o pagesize=1G
 
# Verify mounts
echo "Mounted huge page filesystems:"
mount | grep hugetlbfs
 
# For persistent mounting, add to /etc/fstab:
# hugetlbfs /mnt/hugepages hugetlbfs mode=1770,gid=1000,pagesize=2M 0 0
# hugetlbfs /mnt/hugepages-1G hugetlbfs pagesize=1G 0 0

Using hugetlbfs programmatically:

hugetlbfs_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
/*
 * Using hugetlbfs for huge page allocation
 * 
 * Compile: gcc -o hugetlbfs_example hugetlbfs_example.c
 * Requirements: hugetlbfs mounted at /mnt/hugepages
 */
 
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>
 
#define HUGEPAGE_MOUNT "/mnt/hugepages"
#define HUGEPAGE_SIZE (2 * 1024 * 1024)  // 2MB
 
/**
 * Allocate huge pages using hugetlbfs
 * 
 * @param name  Unique name for the huge page file
 * @param size  Size to allocate (rounded up to huge page boundary)
 * @return      Pointer to mapped memory, or NULL on failure
 */
void* allocate_huge_pages(const char *name, size_t size) {
    char path[256];
    int fd;
    void *addr;
    
    // Round size up to huge page boundary
    size = (size + HUGEPAGE_SIZE - 1) & ~(HUGEPAGE_SIZE - 1);
    
    // Create file path in hugetlbfs
    snprintf(path, sizeof(path), "%s/%s", HUGEPAGE_MOUNT, name);
    
    // Create and open the file
    fd = open(path, O_CREAT | O_RDWR, 0600);
    if (fd < 0) {
        perror("open hugetlbfs file");
        fprintf(stderr, "Ensure %s is mounted and writable
", HUGEPAGE_MOUNT);
        return NULL;
    }
    
    // Map the file into memory
    addr = mmap(NULL, size, PROT_READ | PROT_WRITE, 
                MAP_SHARED, fd, 0);
    
    if (addr == MAP_FAILED) {
        perror("mmap hugetlbfs");
        fprintf(stderr, "Ensure enough huge pages are allocated
");
        close(fd);
        unlink(path);
        return NULL;
    }
    
    // File descriptor can be closed after mmap
    close(fd);
    
    // Touch the pages to ensure allocation
    memset(addr, 0, size);
    
    printf("Allocated %zu MB of huge pages at %p
", size / (1024*1024), addr);
    
    return addr;
}
 
/**
 * Free huge pages allocated via hugetlbfs
 */
void free_huge_pages(void *addr, size_t size, const char *name) {
    char path[256];
    
    // Round size up to huge page boundary
    size = (size + HUGEPAGE_SIZE - 1) & ~(HUGEPAGE_SIZE - 1);
    
    // Unmap the memory
    if (munmap(addr, size) != 0) {
        perror("munmap");
    }
    
    // Remove the file
    snprintf(path, sizeof(path), "%s/%s", HUGEPAGE_MOUNT, name);
    unlink(path);
    
    printf("Freed huge pages: %s
", name);
}
 
int main() {
    const char *name = "test_hugepages";
    size_t size = 64 * 1024 * 1024;  // 64MB request
    
    printf("hugetlbfs Huge Page Allocation Demo
");
    printf("════════════════════════════════════
 
");
    
    // Check hugetlbfs mount
    struct stat st;
    if (stat(HUGEPAGE_MOUNT, &st) != 0) {
        fprintf(stderr, "Error: %s not found. Mount hugetlbfs first:
", 
                HUGEPAGE_MOUNT);
        fprintf(stderr, "  sudo mount -t hugetlbfs none %s
", HUGEPAGE_MOUNT);
        return 1;
    }
    
    // Allocate huge pages
    void *mem = allocate_huge_pages(name, size);
    if (!mem) {
        return 1;
    }
    
    // Use the memory
    printf("
Writing pattern to huge pages...
");
    uint64_t *data = (uint64_t *)mem;
    size_t count = size / sizeof(uint64_t);
    for (size_t i = 0; i < count; i++) {
        data[i] = i;
    }
    
    // Verify
    printf("Verifying pattern...
");
    for (size_t i = 0; i < count; i++) {
        if (data[i] != i) {
            fprintf(stderr, "Verification failed at index %zu
", i);
            break;
        }
    }
    printf("Verification successful!
");
    
    // Show in proc
    printf("
Process huge page usage (from /proc/self/smaps):
");
    system("grep -A 5 hugetlbfs /proc/self/smaps 2>/dev/null | head -20");
    
    // Cleanup
    printf("
");
    free_huge_pages(mem, size, name);
    
    return 0;
}

hugetlbfs Characteristics

•Explicit allocation: Application must consciously use hugetlbfs paths
•Shared memory friendly: Files can be mapped by multiple processes
•Size granularity: Allocations rounded to huge page size
•Pool management: Uses the pre-allocated huge page pool
•Permission control: Standard file permissions apply

Anonymous Huge Pages via mmap

For applications that don't need the filesystem interface, mmap() with the MAP_HUGETLB flag provides a simpler path to huge pages.

Direct huge page allocation:

mmap_hugepages.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
/*
 * Anonymous huge page allocation using mmap()
 * 
 * Compile: gcc -o mmap_hugepages mmap_hugepages.c
 * Run: ./mmap_hugepages (requires huge pages to be configured)
 */
 
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <errno.h>
#include <stdint.h>
#include <time.h>
 
// Huge page sizes
#define HPAGE_2MB (2UL * 1024 * 1024)
#define HPAGE_1GB (1UL * 1024 * 1024 * 1024)
 
// MAP_HUGETLB flags for specific sizes
#ifndef MAP_HUGE_SHIFT
#define MAP_HUGE_SHIFT 26
#endif
#ifndef MAP_HUGE_2MB
#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT)  // 2^21 = 2MB
#endif
#ifndef MAP_HUGE_1GB
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)  // 2^30 = 1GB
#endif
 
/**
 * Allocate memory with 2MB huge pages
 */
void* alloc_2mb_hugepages(size_t size) {
    // Round up to 2MB boundary
    size = (size + HPAGE_2MB - 1) & ~(HPAGE_2MB - 1);
    
    void *addr = mmap(NULL, size,
                      PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB,
                      -1, 0);
    
    if (addr == MAP_FAILED) {
        perror("mmap 2MB huge pages");
        return NULL;
    }
    
    return addr;
}
 
/**
 * Allocate memory with 1GB huge pages
 */
void* alloc_1gb_hugepages(size_t size) {
    // Round up to 1GB boundary
    size = (size + HPAGE_1GB - 1) & ~(HPAGE_1GB - 1);
    
    void *addr = mmap(NULL, size,
                      PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB,
                      -1, 0);
    
    if (addr == MAP_FAILED) {
        perror("mmap 1GB huge pages");
        return NULL;
    }
    
    return addr;
}
 
/**
 * Allocate memory with regular pages (for comparison)
 */
void* alloc_regular_pages(size_t size) {
    void *addr = mmap(NULL, size,
                      PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS,
                      -1, 0);
    
    if (addr == MAP_FAILED) {
        perror("mmap regular pages");
        return NULL;
    }
    
    return addr;
}
 
/**
 * Benchmark memory access
 */
double benchmark_access(void *mem, size_t size, int iterations) {
    uint64_t stride = 4096;  // Access every page
    volatile uint64_t sum = 0;
    
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int iter = 0; iter < iterations; iter++) {
        for (size_t offset = 0; offset < size; offset += stride) {
            sum += *(uint64_t*)((char*)mem + offset);
        }
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed = (end.tv_sec - start.tv_sec) + 
                     (end.tv_nsec - start.tv_nsec) / 1e9;
    
    // Prevent optimization
    if (sum == 0) printf("");
    
    return elapsed;
}
 
int main() {
    const size_t size = 256 * 1024 * 1024;  // 256MB
    const int iterations = 10;
    
    printf("Anonymous Huge Page Allocation Demo
");
    printf("════════════════════════════════════
 
");
    printf("Allocation size: %zu MB
 
", size / (1024*1024));
    
    // Try 2MB huge pages
    printf("Allocating 2MB huge pages...
");
    void *mem_2mb = alloc_2mb_hugepages(size);
    if (mem_2mb) {
        memset(mem_2mb, 1, size);
        printf("  ✓ Allocated at %p
", mem_2mb);
        
        double time_2mb = benchmark_access(mem_2mb, size, iterations);
        printf("  Scan time: %.3f seconds
", time_2mb);
        
        munmap(mem_2mb, size);
    } else {
        printf("  ✗ Failed (ensure huge pages are configured)
");
    }
    
    // Try 1GB huge pages
    printf("
Allocating 1GB huge pages (need 1GB minimum)...
");
    void *mem_1gb = alloc_1gb_hugepages(HPAGE_1GB);
    if (mem_1gb) {
        memset(mem_1gb, 1, HPAGE_1GB);
        printf("  ✓ Allocated at %p
", mem_1gb);
        munmap(mem_1gb, HPAGE_1GB);
    } else {
        printf("  ✗ Failed (1GB pages may not be available)
");
    }
    
    // Compare with regular pages
    printf("
Allocating regular 4KB pages...
");
    void *mem_4kb = alloc_regular_pages(size);
    if (mem_4kb) {
        memset(mem_4kb, 1, size);
        printf("  ✓ Allocated at %p
", mem_4kb);
        
        double time_4kb = benchmark_access(mem_4kb, size, iterations);
        printf("  Scan time: %.3f seconds
", time_4kb);
        
        if (mem_2mb) {
            // Re-run 2MB test for fair comparison
            mem_2mb = alloc_2mb_hugepages(size);
            if (mem_2mb) {
                memset(mem_2mb, 1, size);
                double time_2mb = benchmark_access(mem_2mb, size, iterations);
                printf("
Speedup with 2MB pages: %.2fx
", time_4kb / time_2mb);
                munmap(mem_2mb, size);
            }
        }
        
        munmap(mem_4kb, size);
    }
    
    return 0;
}

mmap() Huge Page Flags
Flag	Description	Since
MAP_HUGETLB	Use huge pages (default size)	Linux 2.6.32
MAP_HUGE_2MB	Use 2MB huge pages	Linux 3.8
MAP_HUGE_1GB	Use 1GB huge pages	Linux 3.8
MAP_HUGE_SHIFT	Shift value for custom sizes	Linux 3.8

Failure Modes

MAP_HUGETLB mmap() calls fail with ENOMEM if huge pages aren't available. Unlike regular mmap(), there's no automatic fallback to small pages. Applications must handle this failure and implement their own fallback strategy if needed.

NUMA Considerations

On multi-socket systems with NUMA (Non-Uniform Memory Access) architecture, huge page allocation strategies become more complex. Memory access latency varies depending on which NUMA node the memory resides on and which CPU core is accessing it.

NUMA-aware huge page management:

numa_hugepages.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/bin/bash
# NUMA-aware huge page configuration
 
echo "=== NUMA HUGE PAGE CONFIGURATION ==="
 
# Show NUMA topology
echo ""
echo "NUMA Topology:"
numactl --hardware 2>/dev/null || echo "numactl not installed"
 
# Function to configure huge pages on a specific NUMA node
configure_node_hugepages() {
    local node=$1
    local count=$2
    local path="/sys/devices/system/node/node${node}/hugepages/hugepages-2048kB/nr_hugepages"
    
    if [ -f "$path" ]; then
        echo "Node $node: Requesting $count 2MB pages..."
        echo $count > "$path"
        actual=$(cat "$path")
        echo "  Allocated: $actual"
    else
        echo "Node $node: Path not found - $path"
    fi
}
 
# Get number of NUMA nodes
NODES=$(ls -d /sys/devices/system/node/node* 2>/dev/null | wc -l)
echo ""
echo "Detected $NODES NUMA nodes"
 
# Allocate huge pages evenly across nodes
TOTAL_HUGEPAGES=1024
PER_NODE=$((TOTAL_HUGEPAGES / NODES))
 
echo ""
echo "Allocating $PER_NODE huge pages per node..."
 
for i in $(seq 0 $((NODES - 1))); do
    configure_node_hugepages $i $PER_NODE
done
 
# Show final distribution
echo ""
echo "Final NUMA Huge Page Distribution:"
for dir in /sys/devices/system/node/node*/hugepages/hugepages-2048kB; do
    if [ -d "$dir" ]; then
        node=$(echo "$dir" | grep -oP 'node\K[0-9]+')
        total=$(cat "$dir/nr_hugepages")
        free=$(cat "$dir/free_hugepages")
        echo "  Node $node: $total total, $free free"
    fi
done

numa_aware_alloc.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
/*
 * NUMA-aware huge page allocation
 * 
 * Compile: gcc -o numa_alloc numa_aware_alloc.c -lnuma
 * Requires: libnuma-dev
 */
 
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/mman.h>
#include <numa.h>
#include <numaif.h>
 
#define HPAGE_SIZE (2UL * 1024 * 1024)
 
/**
 * Allocate huge pages on a specific NUMA node
 */
void* alloc_hugepages_on_node(size_t size, int node) {
    void *addr;
    unsigned long nodemask = 1UL << node;
    int mode = MPOL_BIND;
    
    // Round up to huge page boundary
    size = (size + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1);
    
    // Allocate with huge pages
    addr = mmap(NULL, size,
                PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                -1, 0);
    
    if (addr == MAP_FAILED) {
        perror("mmap huge pages");
        return NULL;
    }
    
    // Bind to specific NUMA node
    if (mbind(addr, size, mode, &nodemask, sizeof(nodemask) * 8, 
              MPOL_MF_STRICT | MPOL_MF_MOVE) != 0) {
        perror("mbind");
        fprintf(stderr, "Warning: Could not bind to node %d
", node);
    }
    
    // Touch all pages to ensure allocation
    memset(addr, 0, size);
    
    return addr;
}
 
/**
 * Allocate huge pages with interleaved NUMA policy
 * Good for bandwidth-bound workloads
 */
void* alloc_hugepages_interleaved(size_t size) {
    void *addr;
    
    size = (size + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1);
    
    addr = mmap(NULL, size,
                PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                -1, 0);
    
    if (addr == MAP_FAILED) {
        perror("mmap huge pages");
        return NULL;
    }
    
    // Interleave across all nodes
    struct bitmask *all_nodes = numa_allocate_nodemask();
    for (int i = 0; i <= numa_max_node(); i++) {
        numa_bitmask_setbit(all_nodes, i);
    }
    
    if (mbind(addr, size, MPOL_INTERLEAVE, 
              all_nodes->maskp, all_nodes->size,
              MPOL_MF_STRICT | MPOL_MF_MOVE) != 0) {
        perror("mbind interleave");
    }
    
    numa_free_nodemask(all_nodes);
    
    memset(addr, 0, size);
    return addr;
}
 
int main() {
    printf("NUMA-Aware Huge Page Allocation
");
    printf("════════════════════════════════
 
");
    
    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available
");
        return 1;
    }
    
    int num_nodes = numa_num_configured_nodes();
    printf("System has %d NUMA nodes
 
", num_nodes);
    
    size_t alloc_size = 64 * 1024 * 1024;  // 64MB
    
    // Allocate on each node
    for (int node = 0; node < num_nodes; node++) {
        printf("Allocating on node %d...
", node);
        void *mem = alloc_hugepages_on_node(alloc_size, node);
        if (mem) {
            printf("  ✓ Allocated %zu MB at %p
", alloc_size/(1024*1024), mem);
            munmap(mem, alloc_size);
        }
    }
    
    // Interleaved allocation
    printf("
Allocating interleaved...
");
    void *mem = alloc_hugepages_interleaved(alloc_size * num_nodes);
    if (mem) {
        printf("  ✓ Allocated %zu MB at %p
", 
               (alloc_size * num_nodes)/(1024*1024), mem);
        munmap(mem, alloc_size * num_nodes);
    }
    
    return 0;
}

NUMA Huge Page Best Practices

•Pre-allocate on each node: Ensure huge page pool is distributed across NUMA nodes before application starts
•Match allocation to access pattern: Bind huge pages to the node where they'll be accessed
•Consider interleaving for bandwidth: For workloads accessing memory from multiple CPUs, interleaved allocation balances bandwidth
•Monitor cross-node access: Use perf or numastat to detect expensive remote memory access
•Database considerations: Database buffer pools often benefit from local NUMA allocation for latency-sensitive queries

Summary: Huge Page Allocation

We've covered the complete landscape of huge page allocation mechanisms. Here are the essential takeaways:

Key Takeaways

•Physical fragmentation is the enemy — huge pages require contiguous, aligned memory that becomes scarce over time
•Boot-time reservation is most reliable — configure via kernel command line for guaranteed availability
•Runtime allocation via sysctl offers flexibility — but may fail under memory pressure
•hugetlbfs provides filesystem interface — useful for shared memory and explicit control
•mmap() with MAP_HUGETLB is simplest — direct allocation with no filesystem overhead
•NUMA awareness is critical — distribute huge pages across nodes and bind allocations appropriately

Allocation Method Comparison
Method	Reliability	Flexibility	Best For
Boot-time reservation	Highest	Low	Production servers, databases
sysctl runtime	Medium	High	Development, dynamic workloads
hugetlbfs	Medium	Medium	Shared memory, legacy apps
mmap() MAP_HUGETLB	Medium	High	New applications, direct allocation

What's next:

Now that we can allocate huge pages, we'll explore Transparent Huge Pages (THP) — Linux's attempt to provide huge page benefits automatically, without explicit application changes. THP offers convenience but comes with tradeoffs that every system administrator must understand.

Page Complete

You now have comprehensive knowledge of huge page allocation mechanisms—from kernel boot parameters to programmatic mmap() calls to NUMA-aware distribution. Next, we'll explore Transparent Huge Pages and their role in automatic memory optimization.