Loading learning content...
Every memory access your program makes is a lie—at least from the hardware's perspective. When your code reads from address 0x7ffd5a3b1000, the CPU must translate this virtual address into a physical address before any actual memory operation can occur. This translation happens billions of times per second on a modern system, and its efficiency fundamentally determines application performance.
For decades, operating systems have used a standard page size of 4 kilobytes (4KB) as the fundamental unit of memory management. This choice, made when physical memory was measured in megabytes, has become increasingly problematic as systems now manage hundreds of gigabytes or even terabytes of RAM.
Huge pages represent a fundamental shift in this paradigm—allowing the operating system to manage memory in units of 2MB or 1GB instead of 4KB. This seemingly simple change has profound implications for performance, memory overhead, and system design.
By the end of this page, you will understand the architectural foundations of page size design, compare standard and huge pages across multiple dimensions, and grasp why modern systems increasingly demand larger page sizes to maintain efficient memory management.
Before comparing standard and huge pages, we must understand what a page fundamentally represents in the memory hierarchy and why its size is one of the most consequential design decisions in computer architecture.
A page is the atomic unit of memory management. The operating system allocates, protects, maps, and swaps memory at page granularity. When you request memory via malloc() or mmap(), the kernel ultimately provides you with one or more pages. When the MMU (Memory Management Unit) translates virtual addresses, it does so page by page.
| Page Type | Size | Bits Offset | Address Coverage per PTE | Typical Use Case |
|---|---|---|---|---|
| Standard | 4 KB | 12 bits | 4 KB | General-purpose computing |
| Large (x86) | 2 MB | 21 bits | 2 MB | Databases, virtualization |
| Huge (x86) | 1 GB | 30 bits | 1 GB | HPC, large memory servers |
| Large (ARM) | 64 KB | 16 bits | 64 KB | Mobile, embedded (optional) |
| Large (ARM) | 2 MB | 21 bits | 2 MB | Server ARM platforms |
The virtual address decomposition:
Every virtual address is conceptually split into two parts:
For a 4KB page, the lower 12 bits form the offset (2¹² = 4096 bytes), and the remaining bits form the virtual page number. For a 2MB page, the lower 21 bits form the offset (2²¹ = 2,097,152 bytes), dramatically reducing the number of pages needed to cover the same address space.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
#include <stdio.h>#include <stdint.h> // Page size constants#define PAGE_SIZE_4KB (1UL << 12) // 4096 bytes#define PAGE_SIZE_2MB (1UL << 21) // 2,097,152 bytes#define PAGE_SIZE_1GB (1UL << 30) // 1,073,741,824 bytes // Mask generators#define PAGE_OFFSET_MASK(page_size) ((page_size) - 1)#define PAGE_NUMBER_MASK(page_size) (~PAGE_OFFSET_MASK(page_size)) /** * Decompose a virtual address into page number and offset * for different page sizes */void decompose_address(uintptr_t virtual_addr) { printf("Virtual Address: 0x%016lx\n", virtual_addr); printf("=".repeat(50)); // 4KB page decomposition uintptr_t vpn_4kb = virtual_addr >> 12; uintptr_t offset_4kb = virtual_addr & PAGE_OFFSET_MASK(PAGE_SIZE_4KB); printf("4KB Page: VPN = 0x%013lx, Offset = 0x%03lx (%lu bytes)\n", vpn_4kb, offset_4kb, offset_4kb); // 2MB page decomposition uintptr_t vpn_2mb = virtual_addr >> 21; uintptr_t offset_2mb = virtual_addr & PAGE_OFFSET_MASK(PAGE_SIZE_2MB); printf("2MB Page: VPN = 0x%010lx, Offset = 0x%06lx (%lu bytes)\n", vpn_2mb, offset_2mb, offset_2mb); // 1GB page decomposition uintptr_t vpn_1gb = virtual_addr >> 30; uintptr_t offset_1gb = virtual_addr & PAGE_OFFSET_MASK(PAGE_SIZE_1GB); printf("1GB Page: VPN = 0x%08lx, Offset = 0x%09lx (%lu bytes)\n", vpn_1gb, offset_1gb, offset_1gb);} int main() { // Example: Address within a typical heap region uintptr_t test_addr = 0x7f1234567890UL; decompose_address(test_addr); return 0;} /* * Output: * Virtual Address: 0x00007f1234567890 * ================================================== * 4KB Page: VPN = 0x0007f12345678, Offset = 0x890 (2192 bytes) * 2MB Page: VPN = 0x0003f891a2, Offset = 0x167890 (1472656 bytes) * 1GB Page: VPN = 0x000001fc, Offset = 0x034567890 (878082192 bytes) */With larger page sizes, fewer pages are needed to cover the same virtual address space. A 128GB address space requires 33,554,432 entries with 4KB pages, but only 65,536 entries with 2MB pages—a 512x reduction. This reduction cascades through the entire memory management hierarchy.
The 4KB page size wasn't chosen arbitrarily—it emerged as a careful balance between competing concerns in an era when memory was expensive and scarce. Understanding this historical context reveals why larger pages weren't adopted earlier and what has changed.
The original constraints:
The internal fragmentation tradeoff:
Internal fragmentation occurs when allocated space exceeds what's actually needed. With pages as the allocation unit, any request rounds up to the nearest page boundary.
Average internal fragmentation = Page Size / 2
For 4KB pages: average waste = 2KB per allocation For 2MB pages: average waste = 1MB per allocation For 1GB pages: average waste = 512MB per allocation
When systems had 16MB of RAM, wasting 1MB on internal fragmentation for a small process was unthinkable. Today, with 512GB systems, this constraint has inverted—the overhead of managing millions of 4KB pages exceeds the cost of some internal fragmentation with huge pages.
| Era | Typical RAM | 4KB Pages Required | Internal Fragmentation Concern | Page Table Overhead |
|---|---|---|---|---|
| 1980s | 1-16 MB | 256-4K pages | Critical (2KB waste per page) | Negligible |
| 1990s | 16-256 MB | 4K-64K pages | Significant | Minor |
| 2000s | 256MB-4GB | 64K-1M pages | Moderate | Noticeable |
| 2010s | 4-64 GB | 1M-16M pages | Low concern | Significant |
| 2020s | 64GB-1TB+ | 16M-256M+ pages | Minimal (for large apps) | Dominant concern |
In modern systems, the bottleneck has shifted from internal fragmentation (wasted space within pages) to external overhead (TLB misses, page table walks, and memory consumed by page table structures). Huge pages directly address these new bottlenecks.
As physical memory grew, the infrastructure required to manage 4KB pages expanded dramatically. Consider a modern server with 1TB of RAM running a database that maps its entire dataset into memory:
The mathematics are stark:
Reducing page count from 268 million to 524 thousand (a 512x reduction) ripples through every aspect of memory management:
These aren't marginal improvements—they represent fundamental changes in how effectively the memory system operates.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
#!/usr/bin/env python3"""Calculate page table overhead for different page sizesThis demonstrates the dramatic impact of page size on memory management overhead""" def calculate_page_table_overhead(ram_bytes: int, page_size: int, pte_size: int = 8): """ Calculate page table overhead for a given memory size and page size. Args: ram_bytes: Total RAM in bytes page_size: Page size in bytes pte_size: Page table entry size (default 8 bytes for 64-bit) Returns: Dictionary with overhead statistics """ num_pages = ram_bytes // page_size pte_overhead = num_pages * pte_size overhead_percentage = (pte_overhead / ram_bytes) * 100 return { 'num_pages': num_pages, 'pte_overhead_bytes': pte_overhead, 'pte_overhead_mb': pte_overhead / (1024 ** 2), 'overhead_percentage': overhead_percentage, } def format_size(bytes_val): """Format bytes as human-readable string""" for unit in ['B', 'KB', 'MB', 'GB', 'TB']: if bytes_val < 1024: return f"{bytes_val:.2f} {unit}" bytes_val /= 1024 return f"{bytes_val:.2f} PB" # System configurations to analyzeRAM_SIZES = [ ("Developer Workstation", 32 * 2**30), # 32 GB ("Application Server", 128 * 2**30), # 128 GB ("Database Server", 512 * 2**30), # 512 GB ("In-Memory Analytics", 1 * 2**40), # 1 TB ("Large-Scale NUMA", 4 * 2**40), # 4 TB] PAGE_SIZES = [ ("4KB (Standard)", 4 * 2**10), ("2MB (Huge)", 2 * 2**20), ("1GB (Giant)", 1 * 2**30),] print("=" * 100)print("PAGE TABLE OVERHEAD ANALYSIS")print("=" * 100) for system_name, ram in RAM_SIZES: print(f"\n{'─' * 100}") print(f"System: {system_name} ({format_size(ram)} RAM)") print(f"{'─' * 100}") print(f"{'Page Size':<20} {'Page Count':>20} {'PTE Overhead':>20} {'% of RAM':>15}") print(f"{'─' * 20} {'─' * 20} {'─' * 20} {'─' * 15}") for page_name, page_size in PAGE_SIZES: stats = calculate_page_table_overhead(ram, page_size) print(f"{page_name:<20} {stats['num_pages']:>20,} " f"{format_size(stats['pte_overhead_bytes']):>20} " f"{stats['overhead_percentage']:>14.6f}%") print("\n" + "=" * 100)print("KEY INSIGHT: With 4KB pages, a 4TB system needs 8GB just for page table entries!")print("With 2MB pages, this drops to 16MB. With 1GB pages, just 32KB.")print("=" * 100)Huge pages aren't simply "bigger pages"—they require specific hardware support and leverage the hierarchical structure of modern page table formats. Understanding this implementation reveals both the power and constraints of huge page technology.
x86-64 Page Table Hierarchy:
The x86-64 architecture uses a four-level page table (five levels with LA57 extension):
Huge pages work by terminating the walk early:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
/* * x86-64 Page Table Structure and Walk Comparison * This illustrates how huge pages reduce translation overhead */ #include <stdint.h> // Page table entry flags#define PTE_PRESENT (1UL << 0) // Page is present in memory#define PTE_WRITABLE (1UL << 1) // Page is writable#define PTE_USER (1UL << 2) // User-accessible#define PTE_PS (1UL << 7) // Page Size (huge page flag)#define PTE_GLOBAL (1UL << 8) // Global page (survives TLB flush)#define PTE_NX (1UL << 63) // No-execute // Address extraction masks for 64-bit addresses#define PML4_INDEX(va) (((va) >> 39) & 0x1FF) // Bits 47:39#define PDPT_INDEX(va) (((va) >> 30) & 0x1FF) // Bits 38:30#define PD_INDEX(va) (((va) >> 21) & 0x1FF) // Bits 29:21#define PT_INDEX(va) (((va) >> 12) & 0x1FF) // Bits 20:12 // Offset extraction for different page sizes#define OFFSET_4KB(va) ((va) & 0xFFF) // Bits 11:0 (12 bits)#define OFFSET_2MB(va) ((va) & 0x1FFFFF) // Bits 20:0 (21 bits)#define OFFSET_1GB(va) ((va) & 0x3FFFFFFF) // Bits 29:0 (30 bits) // Physical address extraction from PTE (bits 51:12)#define PTE_TO_PHYS(pte) ((pte) & 0x000FFFFFFFFFF000UL) /** * Translate virtual address with 4KB pages * Requires FOUR memory accesses for page table walk */uint64_t translate_4kb_page(uint64_t cr3, uint64_t va, uint64_t *memory) { // Level 4: PML4 uint64_t *pml4 = (uint64_t *)PTE_TO_PHYS(cr3); uint64_t pml4e = pml4[PML4_INDEX(va)]; // Memory Access #1 if (!(pml4e & PTE_PRESENT)) return 0; // Level 3: PDPT uint64_t *pdpt = (uint64_t *)PTE_TO_PHYS(pml4e); uint64_t pdpte = pdpt[PDPT_INDEX(va)]; // Memory Access #2 if (!(pdpte & PTE_PRESENT)) return 0; // Level 2: PD uint64_t *pd = (uint64_t *)PTE_TO_PHYS(pdpte); uint64_t pde = pd[PD_INDEX(va)]; // Memory Access #3 if (!(pde & PTE_PRESENT)) return 0; // Level 1: PT uint64_t *pt = (uint64_t *)PTE_TO_PHYS(pde); uint64_t pte = pt[PT_INDEX(va)]; // Memory Access #4 if (!(pte & PTE_PRESENT)) return 0; // Combine frame address with offset return PTE_TO_PHYS(pte) | OFFSET_4KB(va);} /** * Translate virtual address with 2MB huge pages * Requires only THREE memory accesses - walk terminates at PD level */uint64_t translate_2mb_page(uint64_t cr3, uint64_t va, uint64_t *memory) { // Level 4: PML4 uint64_t *pml4 = (uint64_t *)PTE_TO_PHYS(cr3); uint64_t pml4e = pml4[PML4_INDEX(va)]; // Memory Access #1 if (!(pml4e & PTE_PRESENT)) return 0; // Level 3: PDPT uint64_t *pdpt = (uint64_t *)PTE_TO_PHYS(pml4e); uint64_t pdpte = pdpt[PDPT_INDEX(va)]; // Memory Access #2 if (!(pdpte & PTE_PRESENT)) return 0; // Level 2: PD - Check PS bit for 2MB page uint64_t *pd = (uint64_t *)PTE_TO_PHYS(pdpte); uint64_t pde = pd[PD_INDEX(va)]; // Memory Access #3 if (!(pde & PTE_PRESENT)) return 0; // PS bit set means this is a 2MB page - walk terminates here if (pde & PTE_PS) { // Combine 2MB frame address with 21-bit offset return (PTE_TO_PHYS(pde) & ~0x1FFFFFUL) | OFFSET_2MB(va); } // PS bit not set - this is actually a 4KB page, need PT level // (This shouldn't happen for properly configured huge pages) return 0;} /** * Translate virtual address with 1GB giant pages * Requires only TWO memory accesses - walk terminates at PDPT level */uint64_t translate_1gb_page(uint64_t cr3, uint64_t va, uint64_t *memory) { // Level 4: PML4 uint64_t *pml4 = (uint64_t *)PTE_TO_PHYS(cr3); uint64_t pml4e = pml4[PML4_INDEX(va)]; // Memory Access #1 if (!(pml4e & PTE_PRESENT)) return 0; // Level 3: PDPT - Check PS bit for 1GB page uint64_t *pdpt = (uint64_t *)PTE_TO_PHYS(pml4e); uint64_t pdpte = pdpt[PDPT_INDEX(va)]; // Memory Access #2 if (!(pdpte & PTE_PRESENT)) return 0; // PS bit set means this is a 1GB page - walk terminates here if (pdpte & PTE_PS) { // Combine 1GB frame address with 30-bit offset return (PTE_TO_PHYS(pdpte) & ~0x3FFFFFFFUL) | OFFSET_1GB(va); } // PS bit not set - need to continue walk for smaller pages return 0;} /* * Summary of Memory Accesses Required: * * 4KB Page: 4 memory accesses (PML4 → PDPT → PD → PT) * 2MB Page: 3 memory accesses (PML4 → PDPT → PD) * 1GB Page: 2 memory accesses (PML4 → PDPT) * * On a TLB miss, this difference can be 100+ CPU cycles per access! * With memory-bound workloads doing billions of accesses, * the cumulative savings are enormous. */Not all processors support all huge page sizes. 2MB pages require the PSE (Page Size Extension) feature (standard since Pentium Pro). 1GB pages require the PDPE1GB feature (available since Westmere/AMD Barcelona). Check CPU capabilities via /proc/cpuinfo or CPUID instruction before relying on specific page sizes.
Huge pages impose strict alignment constraints that fundamentally impact how memory must be organized. These requirements derive directly from the page table structure and represent one of the key challenges in huge page adoption.
The alignment rule:
A page of size N must be aligned to an N-byte boundary in physical memory.
This means:
| Page Size | Alignment | Valid Physical Addresses | Bits Fixed to Zero |
|---|---|---|---|
| 4 KB | 4 KB | 0x0000, 0x1000, 0x2000, 0x3000, ... | Low 12 bits |
| 2 MB | 2 MB | 0x000000, 0x200000, 0x400000, 0x600000, ... | Low 21 bits |
| 1 GB | 1 GB | 0x00000000, 0x40000000, 0x80000000, ... | Low 30 bits |
Why alignment matters:
The alignment requirement exists because the page table structure reuses address bits. When the PS (Page Size) bit is set in a PD entry for a 2MB page, the hardware interprets the physical address field differently:
If a 2MB page's physical address had non-zero bits in positions 20:12, those bits would conflict with the address calculation, causing incorrect translations.
Fragmentation implications:
This alignment requirement creates a significant challenge: physical memory fragmentation. After a system has been running for a while, finding 2MB of contiguous, properly aligned physical memory becomes difficult. Finding 1GB aligned regions is even harder.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
#include <stdio.h>#include <stdint.h>#include <stdbool.h> #define PAGE_4KB (1UL << 12) // 4096#define PAGE_2MB (1UL << 21) // 2097152#define PAGE_1GB (1UL << 30) // 1073741824 /** * Check if address is properly aligned for given page size */bool is_aligned(uintptr_t addr, size_t page_size) { return (addr & (page_size - 1)) == 0;} /** * Round up address to next aligned boundary */uintptr_t align_up(uintptr_t addr, size_t page_size) { return (addr + page_size - 1) & ~(page_size - 1);} /** * Round down address to previous aligned boundary */uintptr_t align_down(uintptr_t addr, size_t page_size) { return addr & ~(page_size - 1);} /** * Calculate potential waste when aligning allocation */size_t alignment_waste(uintptr_t start, size_t size, size_t page_size) { uintptr_t aligned_start = align_up(start, page_size); // Waste from aligning the start size_t start_waste = aligned_start - start; // Waste from rounding size up to page boundary size_t aligned_size = align_up(size, page_size); size_t size_waste = aligned_size - size; return start_waste + size_waste;} void analyze_region(uintptr_t start, size_t length) { printf("\nAnalyzing region: 0x%lx - 0x%lx (%.2f MB)\n", start, start + length, length / (1024.0 * 1024.0)); printf("─".repeat(60)); struct { const char *name; size_t page_size; } page_types[] = { {"4KB pages", PAGE_4KB}, {"2MB pages", PAGE_2MB}, {"1GB pages", PAGE_1GB}, }; for (int i = 0; i < 3; i++) { size_t ps = page_types[i].page_size; uintptr_t aligned_start = align_up(start, ps); uintptr_t aligned_end = align_down(start + length, ps); if (aligned_end > aligned_start) { size_t usable = aligned_end - aligned_start; size_t num_pages = usable / ps; double efficiency = (usable * 100.0) / length; printf("%-12s: %6zu pages, Usable: %.2f MB (%.1f%% efficient)\n", page_types[i].name, num_pages, usable / (1024.0 * 1024.0), efficiency); } else { printf("%-12s: Region too small/misaligned for this page size\n", page_types[i].name); } }} int main() { // Simulate analyzing physical memory regions // Well-aligned 1GB region analyze_region(0x40000000UL, 1UL << 30); // Misaligned region (common after fragmentation) analyze_region(0x40100000UL, 500UL * 1024 * 1024); // Small region that can't use huge pages analyze_region(0x55555000UL, 8 * 1024 * 1024); return 0;} /* * Example Output: * * Analyzing region: 0x40000000 - 0x80000000 (1024.00 MB) * ──────────────────────────────────────────────────────────── * 4KB pages : 262144 pages, Usable: 1024.00 MB (100.0% efficient) * 2MB pages : 512 pages, Usable: 1024.00 MB (100.0% efficient) * 1GB pages : 1 pages, Usable: 1024.00 MB (100.0% efficient) * * Analyzing region: 0x40100000 - 0x5f500000 (500.00 MB) * ──────────────────────────────────────────────────────────── * 4KB pages : 128000 pages, Usable: 500.00 MB (100.0% efficient) * 2MB pages : 249 pages, Usable: 498.00 MB (99.6% efficient) * 1GB pages : Region too small/misaligned for this page size */Let's consolidate our understanding with a comprehensive comparison across all dimensions that matter for memory management:
| Characteristic | 4KB Standard | 2MB Huge | 1GB Giant |
|---|---|---|---|
| Size | 4,096 bytes | 2,097,152 bytes | 1,073,741,824 bytes |
| Ratio to 4KB | 1× | 512× | 262,144× |
| Offset bits | 12 bits | 21 bits | 30 bits |
| Page table levels | 4 (PML4→PT) | 3 (PML4→PD) | 2 (PML4→PDPT) |
| TLB entries needed (1TB) | 268M | 524K | 1,024 |
| Page table size (1TB) | ~2GB | ~4MB | ~8KB |
| Avg internal fragmentation | 2 KB | 1 MB | 512 MB |
| Allocation flexibility | Excellent | Good | Limited |
| Fragmentation resistance | High | Moderate | Low |
| TLB efficiency | Low | High | Highest |
| Boot-time reservation | Not required | Recommended | Required |
| CPU feature required | None (baseline) | PSE | PDPE1GB |
2MB pages hit a sweet spot for most workloads—providing 512× better TLB coverage than 4KB pages while still being practical to allocate after system startup. 1GB pages offer even greater benefits but require more careful planning and are best suited for specialized, large-memory applications.
We've explored the fundamental differences between standard and huge pages. Here are the key insights:
What's next:
Now that we understand the fundamental differences between page sizes, we'll dive deeper into the specific mechanism that makes huge pages so valuable: TLB efficiency. The next page explores how the Translation Lookaside Buffer works, why TLB misses are so expensive, and how huge pages dramatically improve hit rates.
You now understand the architectural foundations of page sizes, the historical context of 4KB pages, and why modern systems need huge pages. Next, we'll explore the TLB and see exactly how huge pages deliver their performance benefits.