Operating SystemsHuge Pages

Huge Pages: Optimizing Memory Management at Scale

LevelAdvanced

Duration75 mins

TopicHuge Pages

1 / 5

Standard vs Huge Pages

The Hidden Cost of Memory Translation

Every memory access your program makes is a lie—at least from the hardware's perspective. When your code reads from address 0x7ffd5a3b1000, the CPU must translate this virtual address into a physical address before any actual memory operation can occur. This translation happens billions of times per second on a modern system, and its efficiency fundamentally determines application performance.

For decades, operating systems have used a standard page size of 4 kilobytes (4KB) as the fundamental unit of memory management. This choice, made when physical memory was measured in megabytes, has become increasingly problematic as systems now manage hundreds of gigabytes or even terabytes of RAM.

Huge pages represent a fundamental shift in this paradigm—allowing the operating system to manage memory in units of 2MB or 1GB instead of 4KB. This seemingly simple change has profound implications for performance, memory overhead, and system design.

What You Will Learn

By the end of this page, you will understand the architectural foundations of page size design, compare standard and huge pages across multiple dimensions, and grasp why modern systems increasingly demand larger page sizes to maintain efficient memory management.

The Anatomy of a Page

Before comparing standard and huge pages, we must understand what a page fundamentally represents in the memory hierarchy and why its size is one of the most consequential design decisions in computer architecture.

A page is the atomic unit of memory management. The operating system allocates, protects, maps, and swaps memory at page granularity. When you request memory via malloc() or mmap(), the kernel ultimately provides you with one or more pages. When the MMU (Memory Management Unit) translates virtual addresses, it does so page by page.

Page Size Comparison Across Architectures
Page Type	Size	Bits Offset	Address Coverage per PTE	Typical Use Case
Standard	4 KB	12 bits	4 KB	General-purpose computing
Large (x86)	2 MB	21 bits	2 MB	Databases, virtualization
Huge (x86)	1 GB	30 bits	1 GB	HPC, large memory servers
Large (ARM)	64 KB	16 bits	64 KB	Mobile, embedded (optional)
Large (ARM)	2 MB	21 bits	2 MB	Server ARM platforms

The virtual address decomposition:

Every virtual address is conceptually split into two parts:

Page Number (VPN): Identifies which page the address belongs to
Page Offset: Identifies the specific byte within that page

For a 4KB page, the lower 12 bits form the offset (2¹² = 4096 bytes), and the remaining bits form the virtual page number. For a 2MB page, the lower 21 bits form the offset (2²¹ = 2,097,152 bytes), dramatically reducing the number of pages needed to cover the same address space.

address_decomposition.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <stdio.h>
#include <stdint.h>
 
// Page size constants
#define PAGE_SIZE_4KB    (1UL << 12)   // 4096 bytes
#define PAGE_SIZE_2MB    (1UL << 21)   // 2,097,152 bytes
#define PAGE_SIZE_1GB    (1UL << 30)   // 1,073,741,824 bytes
 
// Mask generators
#define PAGE_OFFSET_MASK(page_size) ((page_size) - 1)
#define PAGE_NUMBER_MASK(page_size) (~PAGE_OFFSET_MASK(page_size))
 
/**
 * Decompose a virtual address into page number and offset
 * for different page sizes
 */
void decompose_address(uintptr_t virtual_addr) {
    printf("Virtual Address: 0x%016lx\n", virtual_addr);
    printf("=".repeat(50));
    
    // 4KB page decomposition
    uintptr_t vpn_4kb = virtual_addr >> 12;
    uintptr_t offset_4kb = virtual_addr & PAGE_OFFSET_MASK(PAGE_SIZE_4KB);
    printf("4KB Page:  VPN = 0x%013lx, Offset = 0x%03lx (%lu bytes)\n",
           vpn_4kb, offset_4kb, offset_4kb);
    
    // 2MB page decomposition
    uintptr_t vpn_2mb = virtual_addr >> 21;
    uintptr_t offset_2mb = virtual_addr & PAGE_OFFSET_MASK(PAGE_SIZE_2MB);
    printf("2MB Page:  VPN = 0x%010lx, Offset = 0x%06lx (%lu bytes)\n",
           vpn_2mb, offset_2mb, offset_2mb);
    
    // 1GB page decomposition
    uintptr_t vpn_1gb = virtual_addr >> 30;
    uintptr_t offset_1gb = virtual_addr & PAGE_OFFSET_MASK(PAGE_SIZE_1GB);
    printf("1GB Page:  VPN = 0x%08lx, Offset = 0x%09lx (%lu bytes)\n",
           vpn_1gb, offset_1gb, offset_1gb);
}
 
int main() {
    // Example: Address within a typical heap region
    uintptr_t test_addr = 0x7f1234567890UL;
    decompose_address(test_addr);
    
    return 0;
}
 
/*
 * Output:
 * Virtual Address: 0x00007f1234567890
 * ==================================================
 * 4KB Page:  VPN = 0x0007f12345678, Offset = 0x890 (2192 bytes)
 * 2MB Page:  VPN = 0x0003f891a2, Offset = 0x167890 (1472656 bytes)
 * 1GB Page:  VPN = 0x000001fc, Offset = 0x034567890 (878082192 bytes)
 */

The Critical Insight

With larger page sizes, fewer pages are needed to cover the same virtual address space. A 128GB address space requires 33,554,432 entries with 4KB pages, but only 65,536 entries with 2MB pages—a 512x reduction. This reduction cascades through the entire memory management hierarchy.

Why 4KB Became the Standard

The 4KB page size wasn't chosen arbitrarily—it emerged as a careful balance between competing concerns in an era when memory was expensive and scarce. Understanding this historical context reveals why larger pages weren't adopted earlier and what has changed.

The original constraints:

Historical Design Constraints (1970s-1990s)

•Memory was expensive: Physical RAM was measured in megabytes. A 4KB page allowed fine-grained allocation that minimized waste.
•Internal fragmentation concerns: When allocating a 100-byte object, wasting 3,996 bytes (4KB page) was acceptable; wasting ~2MB would have been catastrophic.
•Disk I/O alignment: Early disk sectors were 512 bytes; 4KB (8 sectors) provided good alignment with reasonable read-ahead.
•Page table size: Smaller pages meant smaller per-entry overhead. With limited memory, keeping page tables small was essential.
•TLB size: Early TLBs had only 8-64 entries. Smaller pages meant each TLB entry covered less memory, but more pages could fit in smaller working sets.

The internal fragmentation tradeoff:

Internal fragmentation occurs when allocated space exceeds what's actually needed. With pages as the allocation unit, any request rounds up to the nearest page boundary.

Average internal fragmentation = Page Size / 2

For 4KB pages: average waste = 2KB per allocation For 2MB pages: average waste = 1MB per allocation For 1GB pages: average waste = 512MB per allocation

When systems had 16MB of RAM, wasting 1MB on internal fragmentation for a small process was unthinkable. Today, with 512GB systems, this constraint has inverted—the overhead of managing millions of 4KB pages exceeds the cost of some internal fragmentation with huge pages.

Evolution of Memory Density vs Page Size Impact
Era	Typical RAM	4KB Pages Required	Internal Fragmentation Concern	Page Table Overhead
1980s	1-16 MB	256-4K pages	Critical (2KB waste per page)	Negligible
1990s	16-256 MB	4K-64K pages	Significant	Minor
2000s	256MB-4GB	64K-1M pages	Moderate	Noticeable
2010s	4-64 GB	1M-16M pages	Low concern	Significant
2020s	64GB-1TB+	16M-256M+ pages	Minimal (for large apps)	Dominant concern

The Paradigm Shift

In modern systems, the bottleneck has shifted from internal fragmentation (wasted space within pages) to external overhead (TLB misses, page table walks, and memory consumed by page table structures). Huge pages directly address these new bottlenecks.

The Page Table Explosion

As physical memory grew, the infrastructure required to manage 4KB pages expanded dramatically. Consider a modern server with 1TB of RAM running a database that maps its entire dataset into memory:

With 4KB Pages (1TB RAM)

•Page count: 268,435,456 pages
•Page table entries: 8 bytes each × 268M = 2GB just for PTEs
•TLB coverage: 256-4096 entries × 4KB = 1-16MB (0.0001-0.0016% of RAM)
•Translation overhead: Multi-level walks for most accesses
•Context switch cost: TLB flush impacts millions of entries

With 2MB Pages (1TB RAM)

•Page count: 524,288 pages
•Page table entries: 8 bytes each × 512K = 4MB for PTEs
•TLB coverage: 256-4096 entries × 2MB = 512MB-8GB (0.05-0.8% of RAM)
•Translation overhead: Fewer levels, faster walks
•Context switch cost: Dramatically reduced TLB pressure

The mathematics are stark:

Reducing page count from 268 million to 524 thousand (a 512x reduction) ripples through every aspect of memory management:

Page table memory: From 2GB to 4MB—a 500x reduction in kernel memory overhead
TLB effectiveness: Each TLB entry now covers 512x more memory
Page table walks: Fewer levels in the radix tree, fewer memory accesses per translation
Cache efficiency: Page table entries themselves are more likely to be cached

These aren't marginal improvements—they represent fundamental changes in how effectively the memory system operates.

page_table_overhead.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#!/usr/bin/env python3
"""
Calculate page table overhead for different page sizes
This demonstrates the dramatic impact of page size on memory management overhead
"""
 
def calculate_page_table_overhead(ram_bytes: int, page_size: int, pte_size: int = 8):
    """
    Calculate page table overhead for a given memory size and page size.
    
    Args:
        ram_bytes: Total RAM in bytes
        page_size: Page size in bytes
        pte_size: Page table entry size (default 8 bytes for 64-bit)
    
    Returns:
        Dictionary with overhead statistics
    """
    num_pages = ram_bytes // page_size
    pte_overhead = num_pages * pte_size
    overhead_percentage = (pte_overhead / ram_bytes) * 100
    
    return {
        'num_pages': num_pages,
        'pte_overhead_bytes': pte_overhead,
        'pte_overhead_mb': pte_overhead / (1024 ** 2),
        'overhead_percentage': overhead_percentage,
    }
 
def format_size(bytes_val):
    """Format bytes as human-readable string"""
    for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
        if bytes_val < 1024:
            return f"{bytes_val:.2f} {unit}"
        bytes_val /= 1024
    return f"{bytes_val:.2f} PB"
 
# System configurations to analyze
RAM_SIZES = [
    ("Developer Workstation", 32 * 2**30),    # 32 GB
    ("Application Server", 128 * 2**30),       # 128 GB  
    ("Database Server", 512 * 2**30),          # 512 GB
    ("In-Memory Analytics", 1 * 2**40),        # 1 TB
    ("Large-Scale NUMA", 4 * 2**40),           # 4 TB
]
 
PAGE_SIZES = [
    ("4KB (Standard)", 4 * 2**10),
    ("2MB (Huge)", 2 * 2**20),
    ("1GB (Giant)", 1 * 2**30),
]
 
print("=" * 100)
print("PAGE TABLE OVERHEAD ANALYSIS")
print("=" * 100)
 
for system_name, ram in RAM_SIZES:
    print(f"\n{'─' * 100}")
    print(f"System: {system_name} ({format_size(ram)} RAM)")
    print(f"{'─' * 100}")
    print(f"{'Page Size':<20} {'Page Count':>20} {'PTE Overhead':>20} {'% of RAM':>15}")
    print(f"{'─' * 20} {'─' * 20} {'─' * 20} {'─' * 15}")
    
    for page_name, page_size in PAGE_SIZES:
        stats = calculate_page_table_overhead(ram, page_size)
        print(f"{page_name:<20} {stats['num_pages']:>20,} "
              f"{format_size(stats['pte_overhead_bytes']):>20} "
              f"{stats['overhead_percentage']:>14.6f}%")
 
print("\n" + "=" * 100)
print("KEY INSIGHT: With 4KB pages, a 4TB system needs 8GB just for page table entries!")
print("With 2MB pages, this drops to 16MB. With 1GB pages, just 32KB.")
print("=" * 100)

Architectural Implementation

Huge pages aren't simply "bigger pages"—they require specific hardware support and leverage the hierarchical structure of modern page table formats. Understanding this implementation reveals both the power and constraints of huge page technology.

x86-64 Page Table Hierarchy:

The x86-64 architecture uses a four-level page table (five levels with LA57 extension):

PML4 (Page Map Level 4): 512 entries, each covering 512GB
PDPT (Page Directory Pointer Table): 512 entries, each covering 1GB
PD (Page Directory): 512 entries, each covering 2MB
PT (Page Table): 512 entries, each covering 4KB

Huge pages work by terminating the walk early:

Page Walk Termination Points

•4KB Page: Walk continues through all four levels → PML4 → PDPT → PD → PT → Physical Frame
•2MB Huge Page: Walk terminates at PD level → PML4 → PDPT → PD (PS bit set) → Physical Frame
•1GB Giant Page: Walk terminates at PDPT level → PML4 → PDPT (PS bit set) → Physical Frame

page_walk_comparison.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
/*
 * x86-64 Page Table Structure and Walk Comparison
 * This illustrates how huge pages reduce translation overhead
 */
 
#include <stdint.h>
 
// Page table entry flags
#define PTE_PRESENT     (1UL << 0)   // Page is present in memory
#define PTE_WRITABLE    (1UL << 1)   // Page is writable
#define PTE_USER        (1UL << 2)   // User-accessible
#define PTE_PS          (1UL << 7)   // Page Size (huge page flag)
#define PTE_GLOBAL      (1UL << 8)   // Global page (survives TLB flush)
#define PTE_NX          (1UL << 63)  // No-execute
 
// Address extraction masks for 64-bit addresses
#define PML4_INDEX(va)  (((va) >> 39) & 0x1FF)  // Bits 47:39
#define PDPT_INDEX(va)  (((va) >> 30) & 0x1FF)  // Bits 38:30
#define PD_INDEX(va)    (((va) >> 21) & 0x1FF)  // Bits 29:21
#define PT_INDEX(va)    (((va) >> 12) & 0x1FF)  // Bits 20:12
 
// Offset extraction for different page sizes
#define OFFSET_4KB(va)  ((va) & 0xFFF)          // Bits 11:0 (12 bits)
#define OFFSET_2MB(va)  ((va) & 0x1FFFFF)       // Bits 20:0 (21 bits)
#define OFFSET_1GB(va)  ((va) & 0x3FFFFFFF)     // Bits 29:0 (30 bits)
 
// Physical address extraction from PTE (bits 51:12)
#define PTE_TO_PHYS(pte) ((pte) & 0x000FFFFFFFFFF000UL)
 
/**
 * Translate virtual address with 4KB pages
 * Requires FOUR memory accesses for page table walk
 */
uint64_t translate_4kb_page(uint64_t cr3, uint64_t va, uint64_t *memory) {
    // Level 4: PML4
    uint64_t *pml4 = (uint64_t *)PTE_TO_PHYS(cr3);
    uint64_t pml4e = pml4[PML4_INDEX(va)];       // Memory Access #1
    if (!(pml4e & PTE_PRESENT)) return 0;
    
    // Level 3: PDPT
    uint64_t *pdpt = (uint64_t *)PTE_TO_PHYS(pml4e);
    uint64_t pdpte = pdpt[PDPT_INDEX(va)];       // Memory Access #2
    if (!(pdpte & PTE_PRESENT)) return 0;
    
    // Level 2: PD
    uint64_t *pd = (uint64_t *)PTE_TO_PHYS(pdpte);
    uint64_t pde = pd[PD_INDEX(va)];             // Memory Access #3
    if (!(pde & PTE_PRESENT)) return 0;
    
    // Level 1: PT
    uint64_t *pt = (uint64_t *)PTE_TO_PHYS(pde);
    uint64_t pte = pt[PT_INDEX(va)];             // Memory Access #4
    if (!(pte & PTE_PRESENT)) return 0;
    
    // Combine frame address with offset
    return PTE_TO_PHYS(pte) | OFFSET_4KB(va);
}
 
/**
 * Translate virtual address with 2MB huge pages
 * Requires only THREE memory accesses - walk terminates at PD level
 */
uint64_t translate_2mb_page(uint64_t cr3, uint64_t va, uint64_t *memory) {
    // Level 4: PML4
    uint64_t *pml4 = (uint64_t *)PTE_TO_PHYS(cr3);
    uint64_t pml4e = pml4[PML4_INDEX(va)];       // Memory Access #1
    if (!(pml4e & PTE_PRESENT)) return 0;
    
    // Level 3: PDPT
    uint64_t *pdpt = (uint64_t *)PTE_TO_PHYS(pml4e);
    uint64_t pdpte = pdpt[PDPT_INDEX(va)];       // Memory Access #2
    if (!(pdpte & PTE_PRESENT)) return 0;
    
    // Level 2: PD - Check PS bit for 2MB page
    uint64_t *pd = (uint64_t *)PTE_TO_PHYS(pdpte);
    uint64_t pde = pd[PD_INDEX(va)];             // Memory Access #3
    if (!(pde & PTE_PRESENT)) return 0;
    
    // PS bit set means this is a 2MB page - walk terminates here
    if (pde & PTE_PS) {
        // Combine 2MB frame address with 21-bit offset
        return (PTE_TO_PHYS(pde) & ~0x1FFFFFUL) | OFFSET_2MB(va);
    }
    
    // PS bit not set - this is actually a 4KB page, need PT level
    // (This shouldn't happen for properly configured huge pages)
    return 0;
}
 
/**
 * Translate virtual address with 1GB giant pages
 * Requires only TWO memory accesses - walk terminates at PDPT level
 */
uint64_t translate_1gb_page(uint64_t cr3, uint64_t va, uint64_t *memory) {
    // Level 4: PML4
    uint64_t *pml4 = (uint64_t *)PTE_TO_PHYS(cr3);
    uint64_t pml4e = pml4[PML4_INDEX(va)];       // Memory Access #1
    if (!(pml4e & PTE_PRESENT)) return 0;
    
    // Level 3: PDPT - Check PS bit for 1GB page
    uint64_t *pdpt = (uint64_t *)PTE_TO_PHYS(pml4e);
    uint64_t pdpte = pdpt[PDPT_INDEX(va)];       // Memory Access #2
    if (!(pdpte & PTE_PRESENT)) return 0;
    
    // PS bit set means this is a 1GB page - walk terminates here
    if (pdpte & PTE_PS) {
        // Combine 1GB frame address with 30-bit offset
        return (PTE_TO_PHYS(pdpte) & ~0x3FFFFFFFUL) | OFFSET_1GB(va);
    }
    
    // PS bit not set - need to continue walk for smaller pages
    return 0;
}
 
/*
 * Summary of Memory Accesses Required:
 * 
 * 4KB Page:  4 memory accesses (PML4 → PDPT → PD → PT)
 * 2MB Page:  3 memory accesses (PML4 → PDPT → PD)
 * 1GB Page:  2 memory accesses (PML4 → PDPT)
 * 
 * On a TLB miss, this difference can be 100+ CPU cycles per access!
 * With memory-bound workloads doing billions of accesses,
 * the cumulative savings are enormous.
 */

Hardware Requirements

Not all processors support all huge page sizes. 2MB pages require the PSE (Page Size Extension) feature (standard since Pentium Pro). 1GB pages require the PDPE1GB feature (available since Westmere/AMD Barcelona). Check CPU capabilities via /proc/cpuinfo or CPUID instruction before relying on specific page sizes.

Memory Alignment Requirements

Huge pages impose strict alignment constraints that fundamentally impact how memory must be organized. These requirements derive directly from the page table structure and represent one of the key challenges in huge page adoption.

The alignment rule:

A page of size N must be aligned to an N-byte boundary in physical memory.

This means:

4KB pages: Must start at addresses divisible by 4,096
2MB pages: Must start at addresses divisible by 2,097,152
1GB pages: Must start at addresses divisible by 1,073,741,824

Alignment Requirements and Valid Starting Addresses
Page Size	Alignment	Valid Physical Addresses	Bits Fixed to Zero
4 KB	4 KB	0x0000, 0x1000, 0x2000, 0x3000, ...	Low 12 bits
2 MB	2 MB	0x000000, 0x200000, 0x400000, 0x600000, ...	Low 21 bits
1 GB	1 GB	0x00000000, 0x40000000, 0x80000000, ...	Low 30 bits

Why alignment matters:

The alignment requirement exists because the page table structure reuses address bits. When the PS (Page Size) bit is set in a PD entry for a 2MB page, the hardware interprets the physical address field differently:

For 4KB pages: Bits 51:12 of the PTE specify the physical frame (40 bits)
For 2MB pages: Bits 51:21 of the PDE specify the physical frame (31 bits) — lower bits are part of the offset

If a 2MB page's physical address had non-zero bits in positions 20:12, those bits would conflict with the address calculation, causing incorrect translations.

Fragmentation implications:

This alignment requirement creates a significant challenge: physical memory fragmentation. After a system has been running for a while, finding 2MB of contiguous, properly aligned physical memory becomes difficult. Finding 1GB aligned regions is even harder.

check_alignment.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
 
#define PAGE_4KB  (1UL << 12)   // 4096
#define PAGE_2MB  (1UL << 21)   // 2097152
#define PAGE_1GB  (1UL << 30)   // 1073741824
 
/**
 * Check if address is properly aligned for given page size
 */
bool is_aligned(uintptr_t addr, size_t page_size) {
    return (addr & (page_size - 1)) == 0;
}
 
/**
 * Round up address to next aligned boundary
 */
uintptr_t align_up(uintptr_t addr, size_t page_size) {
    return (addr + page_size - 1) & ~(page_size - 1);
}
 
/**
 * Round down address to previous aligned boundary  
 */
uintptr_t align_down(uintptr_t addr, size_t page_size) {
    return addr & ~(page_size - 1);
}
 
/**
 * Calculate potential waste when aligning allocation
 */
size_t alignment_waste(uintptr_t start, size_t size, size_t page_size) {
    uintptr_t aligned_start = align_up(start, page_size);
    
    // Waste from aligning the start
    size_t start_waste = aligned_start - start;
    
    // Waste from rounding size up to page boundary
    size_t aligned_size = align_up(size, page_size);
    size_t size_waste = aligned_size - size;
    
    return start_waste + size_waste;
}
 
void analyze_region(uintptr_t start, size_t length) {
    printf("\nAnalyzing region: 0x%lx - 0x%lx (%.2f MB)\n",
           start, start + length, length / (1024.0 * 1024.0));
    printf("─".repeat(60));
    
    struct {
        const char *name;
        size_t page_size;
    } page_types[] = {
        {"4KB pages", PAGE_4KB},
        {"2MB pages", PAGE_2MB},
        {"1GB pages", PAGE_1GB},
    };
    
    for (int i = 0; i < 3; i++) {
        size_t ps = page_types[i].page_size;
        
        uintptr_t aligned_start = align_up(start, ps);
        uintptr_t aligned_end = align_down(start + length, ps);
        
        if (aligned_end > aligned_start) {
            size_t usable = aligned_end - aligned_start;
            size_t num_pages = usable / ps;
            double efficiency = (usable * 100.0) / length;
            
            printf("%-12s: %6zu pages, Usable: %.2f MB (%.1f%% efficient)\n",
                   page_types[i].name, num_pages, 
                   usable / (1024.0 * 1024.0), efficiency);
        } else {
            printf("%-12s: Region too small/misaligned for this page size\n",
                   page_types[i].name);
        }
    }
}
 
int main() {
    // Simulate analyzing physical memory regions
    
    // Well-aligned 1GB region
    analyze_region(0x40000000UL, 1UL << 30);
    
    // Misaligned region (common after fragmentation)
    analyze_region(0x40100000UL, 500UL * 1024 * 1024);
    
    // Small region that can't use huge pages
    analyze_region(0x55555000UL, 8 * 1024 * 1024);
    
    return 0;
}
 
/*
 * Example Output:
 *
 * Analyzing region: 0x40000000 - 0x80000000 (1024.00 MB)
 * ────────────────────────────────────────────────────────────
 * 4KB pages  : 262144 pages, Usable: 1024.00 MB (100.0% efficient)
 * 2MB pages  :    512 pages, Usable: 1024.00 MB (100.0% efficient)
 * 1GB pages  :      1 pages, Usable: 1024.00 MB (100.0% efficient)
 *
 * Analyzing region: 0x40100000 - 0x5f500000 (500.00 MB)
 * ────────────────────────────────────────────────────────────
 * 4KB pages  : 128000 pages, Usable: 500.00 MB (100.0% efficient)
 * 2MB pages  :    249 pages, Usable: 498.00 MB (99.6% efficient)
 * 1GB pages  : Region too small/misaligned for this page size
 */

Comprehensive Comparison

Let's consolidate our understanding with a comprehensive comparison across all dimensions that matter for memory management:

Complete Standard vs Huge Page Comparison
Characteristic	4KB Standard	2MB Huge	1GB Giant
Size	4,096 bytes	2,097,152 bytes	1,073,741,824 bytes
Ratio to 4KB	1×	512×	262,144×
Offset bits	12 bits	21 bits	30 bits
Page table levels	4 (PML4→PT)	3 (PML4→PD)	2 (PML4→PDPT)
TLB entries needed (1TB)	268M	524K	1,024
Page table size (1TB)	~2GB	~4MB	~8KB
Avg internal fragmentation	2 KB	1 MB	512 MB
Allocation flexibility	Excellent	Good	Limited
Fragmentation resistance	High	Moderate	Low
TLB efficiency	Low	High	Highest
Boot-time reservation	Not required	Recommended	Required
CPU feature required	None (baseline)	PSE	PDPE1GB

When to Use Each Page Size

•4KB Pages: Small processes, varied allocation sizes, memory-constrained systems, maximum flexibility
•2MB Pages: Databases, virtual machines, large memory-mapped files, HPC workloads, server applications
•1GB Pages: Extreme-scale databases (in-memory DBs), hypervisors with huge VMs, scientific computing with massive datasets

The Golden Ratio

2MB pages hit a sweet spot for most workloads—providing 512× better TLB coverage than 4KB pages while still being practical to allocate after system startup. 1GB pages offer even greater benefits but require more careful planning and are best suited for specialized, large-memory applications.

Summary: Standard vs Huge Pages

We've explored the fundamental differences between standard and huge pages. Here are the key insights:

Key Takeaways

•Page size determines the granularity of memory management — affecting allocation, protection, and translation at every level
•4KB was optimal for memory-scarce systems — minimizing internal fragmentation when RAM was expensive
•Modern systems face a page table explosion — requiring gigabytes just to manage hundreds of gigabytes of RAM with 4KB pages
•Huge pages reduce overhead by orders of magnitude — from page table size to TLB effectiveness to translation speed
•Alignment requirements create challenges — requiring contiguous, aligned physical memory that becomes scarce over time
•Hardware support varies by architecture — always verify CPU capabilities before relying on specific page sizes

What's next:

Now that we understand the fundamental differences between page sizes, we'll dive deeper into the specific mechanism that makes huge pages so valuable: TLB efficiency. The next page explores how the Translation Lookaside Buffer works, why TLB misses are so expensive, and how huge pages dramatically improve hit rates.

Page Complete

You now understand the architectural foundations of page sizes, the historical context of 4KB pages, and why modern systems need huge pages. Next, we'll explore the TLB and see exactly how huge pages deliver their performance benefits.

1 / 5

Loading learning content...

Operating SystemsHuge Pages

Huge Pages: Optimizing Memory Management at Scale

LevelAdvanced

Duration75 mins

TopicHuge Pages

1 / 5

Standard vs Huge Pages

The Hidden Cost of Memory Translation

What You Will Learn

The Anatomy of a Page

Page Size Comparison Across Architectures
Page Type	Size	Bits Offset	Address Coverage per PTE	Typical Use Case
Standard	4 KB	12 bits	4 KB	General-purpose computing
Large (x86)	2 MB	21 bits	2 MB	Databases, virtualization
Huge (x86)	1 GB	30 bits	1 GB	HPC, large memory servers
Large (ARM)	64 KB	16 bits	64 KB	Mobile, embedded (optional)
Large (ARM)	2 MB	21 bits	2 MB	Server ARM platforms

The virtual address decomposition:

Every virtual address is conceptually split into two parts:

Page Number (VPN): Identifies which page the address belongs to
Page Offset: Identifies the specific byte within that page

address_decomposition.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <stdio.h>
#include <stdint.h>
 
// Page size constants
#define PAGE_SIZE_4KB    (1UL << 12)   // 4096 bytes
#define PAGE_SIZE_2MB    (1UL << 21)   // 2,097,152 bytes
#define PAGE_SIZE_1GB    (1UL << 30)   // 1,073,741,824 bytes
 
// Mask generators
#define PAGE_OFFSET_MASK(page_size) ((page_size) - 1)
#define PAGE_NUMBER_MASK(page_size) (~PAGE_OFFSET_MASK(page_size))
 
/**
 * Decompose a virtual address into page number and offset
 * for different page sizes
 */
void decompose_address(uintptr_t virtual_addr) {
    printf("Virtual Address: 0x%016lx\n", virtual_addr);
    printf("=".repeat(50));
    
    // 4KB page decomposition
    uintptr_t vpn_4kb = virtual_addr >> 12;
    uintptr_t offset_4kb = virtual_addr & PAGE_OFFSET_MASK(PAGE_SIZE_4KB);
    printf("4KB Page:  VPN = 0x%013lx, Offset = 0x%03lx (%lu bytes)\n",
           vpn_4kb, offset_4kb, offset_4kb);
    
    // 2MB page decomposition
    uintptr_t vpn_2mb = virtual_addr >> 21;
    uintptr_t offset_2mb = virtual_addr & PAGE_OFFSET_MASK(PAGE_SIZE_2MB);
    printf("2MB Page:  VPN = 0x%010lx, Offset = 0x%06lx (%lu bytes)\n",
           vpn_2mb, offset_2mb, offset_2mb);
    
    // 1GB page decomposition
    uintptr_t vpn_1gb = virtual_addr >> 30;
    uintptr_t offset_1gb = virtual_addr & PAGE_OFFSET_MASK(PAGE_SIZE_1GB);
    printf("1GB Page:  VPN = 0x%08lx, Offset = 0x%09lx (%lu bytes)\n",
           vpn_1gb, offset_1gb, offset_1gb);
}
 
int main() {
    // Example: Address within a typical heap region
    uintptr_t test_addr = 0x7f1234567890UL;
    decompose_address(test_addr);
    
    return 0;
}
 
/*
 * Output:
 * Virtual Address: 0x00007f1234567890
 * ==================================================
 * 4KB Page:  VPN = 0x0007f12345678, Offset = 0x890 (2192 bytes)
 * 2MB Page:  VPN = 0x0003f891a2, Offset = 0x167890 (1472656 bytes)
 * 1GB Page:  VPN = 0x000001fc, Offset = 0x034567890 (878082192 bytes)
 */

The Critical Insight

Why 4KB Became the Standard

The original constraints:

Historical Design Constraints (1970s-1990s)

•Memory was expensive: Physical RAM was measured in megabytes. A 4KB page allowed fine-grained allocation that minimized waste.
•Internal fragmentation concerns: When allocating a 100-byte object, wasting 3,996 bytes (4KB page) was acceptable; wasting ~2MB would have been catastrophic.
•Disk I/O alignment: Early disk sectors were 512 bytes; 4KB (8 sectors) provided good alignment with reasonable read-ahead.
•Page table size: Smaller pages meant smaller per-entry overhead. With limited memory, keeping page tables small was essential.
•TLB size: Early TLBs had only 8-64 entries. Smaller pages meant each TLB entry covered less memory, but more pages could fit in smaller working sets.

The internal fragmentation tradeoff:

Internal fragmentation occurs when allocated space exceeds what's actually needed. With pages as the allocation unit, any request rounds up to the nearest page boundary.

Average internal fragmentation = Page Size / 2

For 4KB pages: average waste = 2KB per allocation For 2MB pages: average waste = 1MB per allocation For 1GB pages: average waste = 512MB per allocation

Evolution of Memory Density vs Page Size Impact
Era	Typical RAM	4KB Pages Required	Internal Fragmentation Concern	Page Table Overhead
1980s	1-16 MB	256-4K pages	Critical (2KB waste per page)	Negligible
1990s	16-256 MB	4K-64K pages	Significant	Minor
2000s	256MB-4GB	64K-1M pages	Moderate	Noticeable
2010s	4-64 GB	1M-16M pages	Low concern	Significant
2020s	64GB-1TB+	16M-256M+ pages	Minimal (for large apps)	Dominant concern

The Paradigm Shift

The Page Table Explosion

As physical memory grew, the infrastructure required to manage 4KB pages expanded dramatically. Consider a modern server with 1TB of RAM running a database that maps its entire dataset into memory:

With 4KB Pages (1TB RAM)

•Page count: 268,435,456 pages
•Page table entries: 8 bytes each × 268M = 2GB just for PTEs
•TLB coverage: 256-4096 entries × 4KB = 1-16MB (0.0001-0.0016% of RAM)
•Translation overhead: Multi-level walks for most accesses
•Context switch cost: TLB flush impacts millions of entries

With 2MB Pages (1TB RAM)

•Page count: 524,288 pages
•Page table entries: 8 bytes each × 512K = 4MB for PTEs
•TLB coverage: 256-4096 entries × 2MB = 512MB-8GB (0.05-0.8% of RAM)
•Translation overhead: Fewer levels, faster walks
•Context switch cost: Dramatically reduced TLB pressure

The mathematics are stark:

Reducing page count from 268 million to 524 thousand (a 512x reduction) ripples through every aspect of memory management:

Page table memory: From 2GB to 4MB—a 500x reduction in kernel memory overhead
TLB effectiveness: Each TLB entry now covers 512x more memory
Page table walks: Fewer levels in the radix tree, fewer memory accesses per translation
Cache efficiency: Page table entries themselves are more likely to be cached

These aren't marginal improvements—they represent fundamental changes in how effectively the memory system operates.

page_table_overhead.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#!/usr/bin/env python3
"""
Calculate page table overhead for different page sizes
This demonstrates the dramatic impact of page size on memory management overhead
"""
 
def calculate_page_table_overhead(ram_bytes: int, page_size: int, pte_size: int = 8):
    """
    Calculate page table overhead for a given memory size and page size.
    
    Args:
        ram_bytes: Total RAM in bytes
        page_size: Page size in bytes
        pte_size: Page table entry size (default 8 bytes for 64-bit)
    
    Returns:
        Dictionary with overhead statistics
    """
    num_pages = ram_bytes // page_size
    pte_overhead = num_pages * pte_size
    overhead_percentage = (pte_overhead / ram_bytes) * 100
    
    return {
        'num_pages': num_pages,
        'pte_overhead_bytes': pte_overhead,
        'pte_overhead_mb': pte_overhead / (1024 ** 2),
        'overhead_percentage': overhead_percentage,
    }
 
def format_size(bytes_val):
    """Format bytes as human-readable string"""
    for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
        if bytes_val < 1024:
            return f"{bytes_val:.2f} {unit}"
        bytes_val /= 1024
    return f"{bytes_val:.2f} PB"
 
# System configurations to analyze
RAM_SIZES = [
    ("Developer Workstation", 32 * 2**30),    # 32 GB
    ("Application Server", 128 * 2**30),       # 128 GB  
    ("Database Server", 512 * 2**30),          # 512 GB
    ("In-Memory Analytics", 1 * 2**40),        # 1 TB
    ("Large-Scale NUMA", 4 * 2**40),           # 4 TB
]
 
PAGE_SIZES = [
    ("4KB (Standard)", 4 * 2**10),
    ("2MB (Huge)", 2 * 2**20),
    ("1GB (Giant)", 1 * 2**30),
]
 
print("=" * 100)
print("PAGE TABLE OVERHEAD ANALYSIS")
print("=" * 100)
 
for system_name, ram in RAM_SIZES:
    print(f"\n{'─' * 100}")
    print(f"System: {system_name} ({format_size(ram)} RAM)")
    print(f"{'─' * 100}")
    print(f"{'Page Size':<20} {'Page Count':>20} {'PTE Overhead':>20} {'% of RAM':>15}")
    print(f"{'─' * 20} {'─' * 20} {'─' * 20} {'─' * 15}")
    
    for page_name, page_size in PAGE_SIZES:
        stats = calculate_page_table_overhead(ram, page_size)
        print(f"{page_name:<20} {stats['num_pages']:>20,} "
              f"{format_size(stats['pte_overhead_bytes']):>20} "
              f"{stats['overhead_percentage']:>14.6f}%")
 
print("\n" + "=" * 100)
print("KEY INSIGHT: With 4KB pages, a 4TB system needs 8GB just for page table entries!")
print("With 2MB pages, this drops to 16MB. With 1GB pages, just 32KB.")
print("=" * 100)

Architectural Implementation

x86-64 Page Table Hierarchy:

The x86-64 architecture uses a four-level page table (five levels with LA57 extension):

PML4 (Page Map Level 4): 512 entries, each covering 512GB
PDPT (Page Directory Pointer Table): 512 entries, each covering 1GB
PD (Page Directory): 512 entries, each covering 2MB
PT (Page Table): 512 entries, each covering 4KB

Huge pages work by terminating the walk early:

Page Walk Termination Points

•4KB Page: Walk continues through all four levels → PML4 → PDPT → PD → PT → Physical Frame
•2MB Huge Page: Walk terminates at PD level → PML4 → PDPT → PD (PS bit set) → Physical Frame
•1GB Giant Page: Walk terminates at PDPT level → PML4 → PDPT (PS bit set) → Physical Frame

page_walk_comparison.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
/*
 * x86-64 Page Table Structure and Walk Comparison
 * This illustrates how huge pages reduce translation overhead
 */
 
#include <stdint.h>
 
// Page table entry flags
#define PTE_PRESENT     (1UL << 0)   // Page is present in memory
#define PTE_WRITABLE    (1UL << 1)   // Page is writable
#define PTE_USER        (1UL << 2)   // User-accessible
#define PTE_PS          (1UL << 7)   // Page Size (huge page flag)
#define PTE_GLOBAL      (1UL << 8)   // Global page (survives TLB flush)
#define PTE_NX          (1UL << 63)  // No-execute
 
// Address extraction masks for 64-bit addresses
#define PML4_INDEX(va)  (((va) >> 39) & 0x1FF)  // Bits 47:39
#define PDPT_INDEX(va)  (((va) >> 30) & 0x1FF)  // Bits 38:30
#define PD_INDEX(va)    (((va) >> 21) & 0x1FF)  // Bits 29:21
#define PT_INDEX(va)    (((va) >> 12) & 0x1FF)  // Bits 20:12
 
// Offset extraction for different page sizes
#define OFFSET_4KB(va)  ((va) & 0xFFF)          // Bits 11:0 (12 bits)
#define OFFSET_2MB(va)  ((va) & 0x1FFFFF)       // Bits 20:0 (21 bits)
#define OFFSET_1GB(va)  ((va) & 0x3FFFFFFF)     // Bits 29:0 (30 bits)
 
// Physical address extraction from PTE (bits 51:12)
#define PTE_TO_PHYS(pte) ((pte) & 0x000FFFFFFFFFF000UL)
 
/**
 * Translate virtual address with 4KB pages
 * Requires FOUR memory accesses for page table walk
 */
uint64_t translate_4kb_page(uint64_t cr3, uint64_t va, uint64_t *memory) {
    // Level 4: PML4
    uint64_t *pml4 = (uint64_t *)PTE_TO_PHYS(cr3);
    uint64_t pml4e = pml4[PML4_INDEX(va)];       // Memory Access #1
    if (!(pml4e & PTE_PRESENT)) return 0;
    
    // Level 3: PDPT
    uint64_t *pdpt = (uint64_t *)PTE_TO_PHYS(pml4e);
    uint64_t pdpte = pdpt[PDPT_INDEX(va)];       // Memory Access #2
    if (!(pdpte & PTE_PRESENT)) return 0;
    
    // Level 2: PD
    uint64_t *pd = (uint64_t *)PTE_TO_PHYS(pdpte);
    uint64_t pde = pd[PD_INDEX(va)];             // Memory Access #3
    if (!(pde & PTE_PRESENT)) return 0;
    
    // Level 1: PT
    uint64_t *pt = (uint64_t *)PTE_TO_PHYS(pde);
    uint64_t pte = pt[PT_INDEX(va)];             // Memory Access #4
    if (!(pte & PTE_PRESENT)) return 0;
    
    // Combine frame address with offset
    return PTE_TO_PHYS(pte) | OFFSET_4KB(va);
}
 
/**
 * Translate virtual address with 2MB huge pages
 * Requires only THREE memory accesses - walk terminates at PD level
 */
uint64_t translate_2mb_page(uint64_t cr3, uint64_t va, uint64_t *memory) {
    // Level 4: PML4
    uint64_t *pml4 = (uint64_t *)PTE_TO_PHYS(cr3);
    uint64_t pml4e = pml4[PML4_INDEX(va)];       // Memory Access #1
    if (!(pml4e & PTE_PRESENT)) return 0;
    
    // Level 3: PDPT
    uint64_t *pdpt = (uint64_t *)PTE_TO_PHYS(pml4e);
    uint64_t pdpte = pdpt[PDPT_INDEX(va)];       // Memory Access #2
    if (!(pdpte & PTE_PRESENT)) return 0;
    
    // Level 2: PD - Check PS bit for 2MB page
    uint64_t *pd = (uint64_t *)PTE_TO_PHYS(pdpte);
    uint64_t pde = pd[PD_INDEX(va)];             // Memory Access #3
    if (!(pde & PTE_PRESENT)) return 0;
    
    // PS bit set means this is a 2MB page - walk terminates here
    if (pde & PTE_PS) {
        // Combine 2MB frame address with 21-bit offset
        return (PTE_TO_PHYS(pde) & ~0x1FFFFFUL) | OFFSET_2MB(va);
    }
    
    // PS bit not set - this is actually a 4KB page, need PT level
    // (This shouldn't happen for properly configured huge pages)
    return 0;
}
 
/**
 * Translate virtual address with 1GB giant pages
 * Requires only TWO memory accesses - walk terminates at PDPT level
 */
uint64_t translate_1gb_page(uint64_t cr3, uint64_t va, uint64_t *memory) {
    // Level 4: PML4
    uint64_t *pml4 = (uint64_t *)PTE_TO_PHYS(cr3);
    uint64_t pml4e = pml4[PML4_INDEX(va)];       // Memory Access #1
    if (!(pml4e & PTE_PRESENT)) return 0;
    
    // Level 3: PDPT - Check PS bit for 1GB page
    uint64_t *pdpt = (uint64_t *)PTE_TO_PHYS(pml4e);
    uint64_t pdpte = pdpt[PDPT_INDEX(va)];       // Memory Access #2
    if (!(pdpte & PTE_PRESENT)) return 0;
    
    // PS bit set means this is a 1GB page - walk terminates here
    if (pdpte & PTE_PS) {
        // Combine 1GB frame address with 30-bit offset
        return (PTE_TO_PHYS(pdpte) & ~0x3FFFFFFFUL) | OFFSET_1GB(va);
    }
    
    // PS bit not set - need to continue walk for smaller pages
    return 0;
}
 
/*
 * Summary of Memory Accesses Required:
 * 
 * 4KB Page:  4 memory accesses (PML4 → PDPT → PD → PT)
 * 2MB Page:  3 memory accesses (PML4 → PDPT → PD)
 * 1GB Page:  2 memory accesses (PML4 → PDPT)
 * 
 * On a TLB miss, this difference can be 100+ CPU cycles per access!
 * With memory-bound workloads doing billions of accesses,
 * the cumulative savings are enormous.
 */

Hardware Requirements

Memory Alignment Requirements

The alignment rule:

A page of size N must be aligned to an N-byte boundary in physical memory.

This means:

4KB pages: Must start at addresses divisible by 4,096
2MB pages: Must start at addresses divisible by 2,097,152
1GB pages: Must start at addresses divisible by 1,073,741,824

Alignment Requirements and Valid Starting Addresses
Page Size	Alignment	Valid Physical Addresses	Bits Fixed to Zero
4 KB	4 KB	0x0000, 0x1000, 0x2000, 0x3000, ...	Low 12 bits
2 MB	2 MB	0x000000, 0x200000, 0x400000, 0x600000, ...	Low 21 bits
1 GB	1 GB	0x00000000, 0x40000000, 0x80000000, ...	Low 30 bits

Why alignment matters:

For 4KB pages: Bits 51:12 of the PTE specify the physical frame (40 bits)
For 2MB pages: Bits 51:21 of the PDE specify the physical frame (31 bits) — lower bits are part of the offset

If a 2MB page's physical address had non-zero bits in positions 20:12, those bits would conflict with the address calculation, causing incorrect translations.

Fragmentation implications:

check_alignment.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
 
#define PAGE_4KB  (1UL << 12)   // 4096
#define PAGE_2MB  (1UL << 21)   // 2097152
#define PAGE_1GB  (1UL << 30)   // 1073741824
 
/**
 * Check if address is properly aligned for given page size
 */
bool is_aligned(uintptr_t addr, size_t page_size) {
    return (addr & (page_size - 1)) == 0;
}
 
/**
 * Round up address to next aligned boundary
 */
uintptr_t align_up(uintptr_t addr, size_t page_size) {
    return (addr + page_size - 1) & ~(page_size - 1);
}
 
/**
 * Round down address to previous aligned boundary  
 */
uintptr_t align_down(uintptr_t addr, size_t page_size) {
    return addr & ~(page_size - 1);
}
 
/**
 * Calculate potential waste when aligning allocation
 */
size_t alignment_waste(uintptr_t start, size_t size, size_t page_size) {
    uintptr_t aligned_start = align_up(start, page_size);
    
    // Waste from aligning the start
    size_t start_waste = aligned_start - start;
    
    // Waste from rounding size up to page boundary
    size_t aligned_size = align_up(size, page_size);
    size_t size_waste = aligned_size - size;
    
    return start_waste + size_waste;
}
 
void analyze_region(uintptr_t start, size_t length) {
    printf("\nAnalyzing region: 0x%lx - 0x%lx (%.2f MB)\n",
           start, start + length, length / (1024.0 * 1024.0));
    printf("─".repeat(60));
    
    struct {
        const char *name;
        size_t page_size;
    } page_types[] = {
        {"4KB pages", PAGE_4KB},
        {"2MB pages", PAGE_2MB},
        {"1GB pages", PAGE_1GB},
    };
    
    for (int i = 0; i < 3; i++) {
        size_t ps = page_types[i].page_size;
        
        uintptr_t aligned_start = align_up(start, ps);
        uintptr_t aligned_end = align_down(start + length, ps);
        
        if (aligned_end > aligned_start) {
            size_t usable = aligned_end - aligned_start;
            size_t num_pages = usable / ps;
            double efficiency = (usable * 100.0) / length;
            
            printf("%-12s: %6zu pages, Usable: %.2f MB (%.1f%% efficient)\n",
                   page_types[i].name, num_pages, 
                   usable / (1024.0 * 1024.0), efficiency);
        } else {
            printf("%-12s: Region too small/misaligned for this page size\n",
                   page_types[i].name);
        }
    }
}
 
int main() {
    // Simulate analyzing physical memory regions
    
    // Well-aligned 1GB region
    analyze_region(0x40000000UL, 1UL << 30);
    
    // Misaligned region (common after fragmentation)
    analyze_region(0x40100000UL, 500UL * 1024 * 1024);
    
    // Small region that can't use huge pages
    analyze_region(0x55555000UL, 8 * 1024 * 1024);
    
    return 0;
}
 
/*
 * Example Output:
 *
 * Analyzing region: 0x40000000 - 0x80000000 (1024.00 MB)
 * ────────────────────────────────────────────────────────────
 * 4KB pages  : 262144 pages, Usable: 1024.00 MB (100.0% efficient)
 * 2MB pages  :    512 pages, Usable: 1024.00 MB (100.0% efficient)
 * 1GB pages  :      1 pages, Usable: 1024.00 MB (100.0% efficient)
 *
 * Analyzing region: 0x40100000 - 0x5f500000 (500.00 MB)
 * ────────────────────────────────────────────────────────────
 * 4KB pages  : 128000 pages, Usable: 500.00 MB (100.0% efficient)
 * 2MB pages  :    249 pages, Usable: 498.00 MB (99.6% efficient)
 * 1GB pages  : Region too small/misaligned for this page size
 */

Comprehensive Comparison

Let's consolidate our understanding with a comprehensive comparison across all dimensions that matter for memory management:

Complete Standard vs Huge Page Comparison
Characteristic	4KB Standard	2MB Huge	1GB Giant
Size	4,096 bytes	2,097,152 bytes	1,073,741,824 bytes
Ratio to 4KB	1×	512×	262,144×
Offset bits	12 bits	21 bits	30 bits
Page table levels	4 (PML4→PT)	3 (PML4→PD)	2 (PML4→PDPT)
TLB entries needed (1TB)	268M	524K	1,024
Page table size (1TB)	~2GB	~4MB	~8KB
Avg internal fragmentation	2 KB	1 MB	512 MB
Allocation flexibility	Excellent	Good	Limited
Fragmentation resistance	High	Moderate	Low
TLB efficiency	Low	High	Highest
Boot-time reservation	Not required	Recommended	Required
CPU feature required	None (baseline)	PSE	PDPE1GB

When to Use Each Page Size

•4KB Pages: Small processes, varied allocation sizes, memory-constrained systems, maximum flexibility
•2MB Pages: Databases, virtual machines, large memory-mapped files, HPC workloads, server applications
•1GB Pages: Extreme-scale databases (in-memory DBs), hypervisors with huge VMs, scientific computing with massive datasets

The Golden Ratio

Summary: Standard vs Huge Pages

We've explored the fundamental differences between standard and huge pages. Here are the key insights:

Key Takeaways

•Page size determines the granularity of memory management — affecting allocation, protection, and translation at every level
•4KB was optimal for memory-scarce systems — minimizing internal fragmentation when RAM was expensive
•Modern systems face a page table explosion — requiring gigabytes just to manage hundreds of gigabytes of RAM with 4KB pages
•Huge pages reduce overhead by orders of magnitude — from page table size to TLB effectiveness to translation speed
•Alignment requirements create challenges — requiring contiguous, aligned physical memory that becomes scarce over time
•Hardware support varies by architecture — always verify CPU capabilities before relying on specific page sizes

What's next:

Page Complete

1 / 5