Loading learning content...
The virtual address space we explored in the previous page is an illusion—a powerful abstraction that simplifies programming and enables isolation. But at some point, virtual addresses must become physical addresses that reference actual RAM locations. This translation is performed millions of times per second, making it one of the most performance-critical operations in any operating system.
Page tables are the data structures that define this mapping. They form a hierarchical lookup structure that the CPU's Memory Management Unit (MMU) traverses on every memory access. Linux's page table implementation must balance competing concerns: compact representation (tables can't consume all available memory), fast lookup (every memory access depends on translation), and flexibility (supporting architectures from ARM embedded systems to massive x86_64 servers).
This page provides an expert-level examination of Linux page table management—from the fundamental multi-level hierarchy to advanced topics like kernel page table manipulation and huge page support.
By the end of this page, you will understand: (1) why page tables use a multi-level hierarchy, (2) how Linux implements 4-level and 5-level page tables on x86_64, (3) the structure of page table entries and their flags, (4) TLB operation and management, (5) the kernel APIs for page table manipulation, and (6) huge pages and their performance implications.
To understand why Linux uses multi-level page tables, we must first understand why simpler approaches fail at scale.
The Single-Level Problem:
Consider the simplest possible page table: a flat array where each entry maps one virtual page to one physical frame. On a 32-bit system with 4 KB pages:
This is already problematic—4 MB of contiguous physical memory for every process, even if the process only uses a few kilobytes. But on 64-bit systems, it becomes absurd:
Clearly, flat page tables don't scale.
Most processes use only a tiny fraction of their virtual address space. A typical process might have: text segment (~5 MB), data/heap (~100 MB), libraries (~50 MB), and stack (~8 MB). Total: ~163 MB out of 128 TB of available address space—0.0001% utilization. A flat page table would waste enormous memory tracking empty regions.
The Multi-Level Solution:
Multi-level page tables solve this by making the structure sparse. Instead of allocating entries for the entire address space, we only allocate table pages for regions actually in use.
Think of it like a hierarchical directory structure:
For unused regions, we simply leave the parent table entry NULL—no memory allocated for child tables. Only the paths to actually-mapped pages exist.
Cost-Benefit Analysis:
Multi-level tables trade lookup complexity for memory efficiency:
| Scenario | Flat Table (48-bit) | 4-Level Table |
|---|---|---|
| Minimal process (1 MB mapped) | 512 GB | ~16 KB (a few table pages) |
| Typical process (200 MB mapped) | 512 GB | ~1 MB |
| Large process (10 GB mapped) | 512 GB | ~20 MB |
| Fully mapped (128 TB) | 512 GB | ~512 GB + overhead |
On x86_64 with 48-bit virtual addresses, Linux uses a four-level page table hierarchy. With 57-bit addressing (LA57), a fifth level is added. Let's examine the four-level case in detail.
Virtual Address Decomposition (48-bit):
A 48-bit virtual address is split into five components:
| 47-39 (9 bits) | 38-30 (9 bits) | 29-21 (9 bits) | 20-12 (9 bits) | 11-0 (12 bits) |
| PGD | PUD | PMD | PTE | Offset |
Why 9-bit Indices?
Each table level uses 9 bits for indexing, giving 512 entries (2⁹). With 8-byte entries, each table is exactly 4 KB—one physical page. This is not coincidental:
Coverage at Each Level:
| Level | Linux Name | Entries | Entry Size | Table Size | Coverage per Entry |
|---|---|---|---|---|---|
| 4 (Top) | PGD (Page Global Directory) | 512 | 8 bytes | 4 KB | 512 GB |
| 3 | PUD (Page Upper Directory) | 512 | 8 bytes | 4 KB | 1 GB |
| 2 | PMD (Page Middle Directory) | 512 | 8 bytes | 4 KB | 2 MB |
| 1 (Bottom) | PTE (Page Table Entry) | 512 | 8 bytes | 4 KB | 4 KB |
Each entry in the page table hierarchy encodes both the address of the next level (or the physical page) and a rich set of flags controlling access and behavior. Understanding these flags is essential for kernel development and security analysis.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
/* x86_64 Page Table Entry Format (64 bits) * * Bit Layout: * 63 62:59 58:52 51:12 11:9 8 7 6 5 4 3 2 1 0 * +--+------+-------+---------------------------+-----+---+--+--+--+--+--+--+--+--+--+ * |XD| Key | Avail | Physical Frame Number (40)|Avail|PAT|G|PS|D |A |CD|WT|US|RW| P| * +--+------+-------+---------------------------+-----+---+--+--+--+--+--+--+--+--+--+ * * Key Fields: * P (bit 0) - Present: 1 if page is in memory, 0 if not mapped/swapped * RW (bit 1) - Read/Write: 1 = writable, 0 = read-only * US (bit 2) - User/Supervisor: 1 = user accessible, 0 = supervisor only * PWT (bit 3) - Page Write-Through: cache write-through if set * PCD (bit 4) - Page Cache Disable: disable caching if set * A (bit 5) - Accessed: set by MMU when page is read * D (bit 6) - Dirty: set by MMU when page is written * PS (bit 7) - Page Size: 1 = large page (2MB/1GB), 0 = next table level * G (bit 8) - Global: don't flush from TLB on context switch * XD (bit 63) - Execute Disable (NX): 1 = not executable */ /* Linux kernel definitions (arch/x86/include/asm/pgtable_types.h) */#define _PAGE_BIT_PRESENT 0 /* is present */#define _PAGE_BIT_RW 1 /* writeable */#define _PAGE_BIT_USER 2 /* userspace addressable */#define _PAGE_BIT_PWT 3 /* page write through */#define _PAGE_BIT_PCD 4 /* page cache disabled */#define _PAGE_BIT_ACCESSED 5 /* was accessed */#define _PAGE_BIT_DIRTY 6 /* was written to */#define _PAGE_BIT_PSE 7 /* 2MB/1GB page */#define _PAGE_BIT_GLOBAL 8 /* global TLB entry */#define _PAGE_BIT_NX 63 /* No execute: restrict to data */ #define _PAGE_PRESENT (1UL << _PAGE_BIT_PRESENT)#define _PAGE_RW (1UL << _PAGE_BIT_RW)#define _PAGE_USER (1UL << _PAGE_BIT_USER)#define _PAGE_PWT (1UL << _PAGE_BIT_PWT)#define _PAGE_PCD (1UL << _PAGE_BIT_PCD)#define _PAGE_ACCESSED (1UL << _PAGE_BIT_ACCESSED)#define _PAGE_DIRTY (1UL << _PAGE_BIT_DIRTY)#define _PAGE_PSE (1UL << _PAGE_BIT_PSE)#define _PAGE_GLOBAL (1UL << _PAGE_BIT_GLOBAL)#define _PAGE_NX (1UL << _PAGE_BIT_NX) /* Common protection combinations */#define PAGE_KERNEL __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | \ _PAGE_ACCESSED | _PAGE_NX)#define PAGE_KERNEL_EXEC __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | \ _PAGE_ACCESSED)#define PAGE_SHARED __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_NX)#define PAGE_READONLY __pgprot(_PAGE_PRESENT | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_NX)#define PAGE_COPY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ _PAGE_ACCESSED)The A and D bits are set automatically by hardware but must be cleared by software. This enables efficient tracking of page usage patterns. The kernel periodically clears these bits and uses their state to implement page replacement policies like LRU. If a page's A bit is still 0 after some time, it hasn't been accessed and is a good eviction candidate.
When the CPU encounters a virtual address that isn't cached in the TLB, it must perform a page table walk—traversing the multi-level hierarchy to find the physical address. Let's trace through this process step by step:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
/* Conceptual page table walk for x86_64 */ /* Example virtual address: 0x00007F4A12345678 * Breaking it down (assuming 48-bit addressing): * * Binary: 0000 0000 0000 0000 0111 1111 0100 1010 * 0001 0010 0011 0100 0101 0110 0111 1000 * * Components: * PGD index (bits 47-39): 0x0FE = 254 * PUD index (bits 38-30): 0x128 = 296 * PMD index (bits 29-21): 0x091 = 145 * PTE index (bits 20-12): 0x145 = 325 * Page offset (bits 11-0): 0x678 = 1656 */ uint64_t translate_virtual_to_physical( uint64_t cr3, /* PGD base from CR3 register */ uint64_t virtual_addr /* Address to translate */) { /* Extract indices from virtual address */ uint16_t pgd_idx = (virtual_addr >> 39) & 0x1FF; /* bits 47-39 */ uint16_t pud_idx = (virtual_addr >> 30) & 0x1FF; /* bits 38-30 */ uint16_t pmd_idx = (virtual_addr >> 21) & 0x1FF; /* bits 29-21 */ uint16_t pte_idx = (virtual_addr >> 12) & 0x1FF; /* bits 20-12 */ uint16_t offset = virtual_addr & 0xFFF; /* bits 11-0 */ /* Step 1: Read PGD entry */ uint64_t *pgd_table = (uint64_t *)(cr3 & ~0xFFF); uint64_t pgd_entry = pgd_table[pgd_idx]; if (!(pgd_entry & _PAGE_PRESENT)) { /* PAGE FAULT: PGD entry not present */ return PAGE_FAULT; } /* Step 2: Read PUD entry */ uint64_t *pud_table = (uint64_t *)(pgd_entry & PHYS_ADDR_MASK); uint64_t pud_entry = pud_table[pud_idx]; if (!(pud_entry & _PAGE_PRESENT)) { /* PAGE FAULT: PUD entry not present */ return PAGE_FAULT; } /* Check for 1GB huge page */ if (pud_entry & _PAGE_PSE) { uint64_t phys_base = pud_entry & HUGE_1GB_MASK; uint64_t huge_offset = virtual_addr & 0x3FFFFFFF; /* 30-bit offset */ return phys_base | huge_offset; } /* Step 3: Read PMD entry */ uint64_t *pmd_table = (uint64_t *)(pud_entry & PHYS_ADDR_MASK); uint64_t pmd_entry = pmd_table[pmd_idx]; if (!(pmd_entry & _PAGE_PRESENT)) { /* PAGE FAULT: PMD entry not present */ return PAGE_FAULT; } /* Check for 2MB huge page */ if (pmd_entry & _PAGE_PSE) { uint64_t phys_base = pmd_entry & HUGE_2MB_MASK; uint64_t huge_offset = virtual_addr & 0x1FFFFF; /* 21-bit offset */ return phys_base | huge_offset; } /* Step 4: Read PTE entry */ uint64_t *pte_table = (uint64_t *)(pmd_entry & PHYS_ADDR_MASK); uint64_t pte_entry = pte_table[pte_idx]; if (!(pte_entry & _PAGE_PRESENT)) { /* PAGE FAULT: PTE entry not present */ return PAGE_FAULT; } /* Check permissions */ if (is_user_access && !(pte_entry & _PAGE_USER)) { /* PAGE FAULT: Supervisor-only page accessed from user mode */ return PAGE_FAULT; } if (is_write_access && !(pte_entry & _PAGE_RW)) { /* PAGE FAULT: Write to read-only page */ return PAGE_FAULT; } if (is_exec_access && (pte_entry & _PAGE_NX)) { /* PAGE FAULT: Execute on non-executable page */ return PAGE_FAULT; } /* Step 5: Construct physical address */ uint64_t phys_frame = pte_entry & PHYS_ADDR_MASK; uint64_t physical_addr = phys_frame | offset; /* Hardware sets Accessed bit */ pte_entry |= _PAGE_ACCESSED; if (is_write_access) { /* Hardware sets Dirty bit for writes */ pte_entry |= _PAGE_DIRTY; } return physical_addr;}On x86, the page table walk is performed entirely in hardware by the MMU—no software intervention is needed for normal translations. The kernel's role is to set up and maintain the page tables. Only when a page fault occurs (Present=0, permission violation, etc.) does software get involved. Some architectures (like early MIPS) use software-managed TLBs where the kernel handles all misses.
Walk Performance:
A complete 4-level page table walk requires 4 memory accesses—a significant overhead if performed for every memory reference. This is where the Translation Lookaside Buffer (TLB) becomes critical: caching recent translations to avoid repeated walks.
The Translation Lookaside Buffer (TLB) is a small, extremely fast cache that stores recent virtual-to-physical address translations. Modern CPUs could not function efficiently without it—memory access would be 4-5x slower if every reference required a full page table walk.
TLB Organization:
Modern processors typically have:
TLBs are usually organized as set-associative caches, with separate entries for different page sizes (4KB, 2MB, 1GB).
| TLB Type | Entries (4KB) | Entries (2MB) | Associativity | Latency |
|---|---|---|---|---|
| L1 ITLB | 64 | 8 | 8-way | ~1 cycle |
| L1 DTLB | 64 | 32 | 4-way | ~1 cycle |
| L2 STLB | 1536 | 1536 | 12-way | ~7 cycles |
TLB Invalidation:
When page table entries change, the corresponding TLB entries must be invalidated to prevent stale translations. Linux provides several invalidation mechanisms:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
/* TLB invalidation functions (arch/x86/include/asm/tlbflush.h) */ /* Flush entire TLB (expensive - avoid when possible) */static inline void flush_tlb_all(void){ /* Reloading CR3 flushes all non-global TLB entries */ native_write_cr3(__native_read_cr3());} /* Flush TLB entries for a specific address range */void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end){ /* Uses INVLPG instruction for each page, or * full flush if range is large enough */} /* Flush single page TLB entry */static inline void flush_tlb_one_kernel(unsigned long addr){ /* x86 INVLPG instruction */ asm volatile("invlpg (%0)" : : "r" (addr) : "memory");} /* Flush TLB on all CPUs (very expensive!) */void flush_tlb_all_remote(void){ /* Sends Inter-Processor Interrupt (IPI) to all CPUs * Each CPU then performs local TLB flush * Called when changing kernel page tables */} /* Modern x86 provides INVPCID instruction for more granular control */static inline void invpcid_flush_one(unsigned long pcid, unsigned long addr){ /* Flush entry for specific PCID and address * More efficient than full flush */} /* TLB shootdown optimization: batching */struct tlb_flush_pending { unsigned long start; unsigned long end; unsigned int stride_shift; bool flush_required;}; /* Batch multiple TLB invalidations and execute at once */void tlb_flush_batched(struct mmu_gather *tlb) { /* Collects invalidation requests then issues efficiently */}When kernel page tables change, ALL CPUs must flush their TLBs—requiring Inter-Processor Interrupts (IPIs). On systems with many cores, TLB shootdowns can become a significant performance bottleneck. This is one reason why huge pages improve performance: fewer TLB entries means fewer shootdowns during memory management operations.
Process Context Identifiers (PCID):
Traditionally, a context switch required a full TLB flush because TLB entries don't distinguish between processes. Modern x86 processors support PCIDs—12-bit tags attached to TLB entries. With PCID support:
This significantly reduces context switch overhead, especially for workloads with frequent process switches.
Linux provides a rich set of functions for page table manipulation. These APIs abstract architecture-specific details while providing the flexibility kernel subsystems need.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
/* Key page table manipulation functions */ /* === Table Entry Access and Navigation === */ /* Get PGD entry for a virtual address */pgd_t *pgd_offset(struct mm_struct *mm, unsigned long address);pgd_t *pgd_offset_k(unsigned long address); /* Kernel address space */ /* Navigate down the hierarchy */p4d_t *p4d_offset(pgd_t *pgd, unsigned long address);pud_t *pud_offset(p4d_t *p4d, unsigned long address);pmd_t *pmd_offset(pud_t *pud, unsigned long address);pte_t *pte_offset_map(pmd_t *pmd, unsigned long address); /* === Entry State Queries === */ #define pgd_present(pgd) (pgd_val(pgd) & _PAGE_PRESENT)#define pud_present(pud) (pud_val(pud) & _PAGE_PRESENT)#define pmd_present(pmd) (pmd_val(pmd) & _PAGE_PRESENT)#define pte_present(pte) (pte_val(pte) & _PAGE_PRESENT) #define pte_write(pte) (pte_val(pte) & _PAGE_RW)#define pte_dirty(pte) (pte_val(pte) & _PAGE_DIRTY)#define pte_young(pte) (pte_val(pte) & _PAGE_ACCESSED)#define pte_exec(pte) (!(pte_val(pte) & _PAGE_NX)) /* === Entry Modification === */ /* Create PTE with specific protections */pte_t mk_pte(struct page *page, pgprot_t pgprot); /* Modify existing PTE flags */pte_t pte_wrprotect(pte_t pte); /* Clear write permission */pte_t pte_mkwrite(pte_t pte); /* Set write permission */pte_t pte_mkclean(pte_t pte); /* Clear dirty bit */pte_t pte_mkdirty(pte_t pte); /* Set dirty bit */pte_t pte_mkold(pte_t pte); /* Clear accessed bit */pte_t pte_mkyoung(pte_t pte); /* Set accessed bit */pte_t pte_mkexec(pte_t pte); /* Clear NX bit */pte_t pte_exprotect(pte_t pte); /* Set NX bit */ /* === Setting Entries in Tables === */ void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte);void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd);/* Similar for PUD and PGD */ /* Atomic clear and set (for concurrent access) */pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep);pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep);void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t pte); /* === Table Allocation === */ /* Allocate a new page table page */pgtable_t pte_alloc_one(struct mm_struct *mm);pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr);pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long addr); /* Free a page table page */void pte_free(struct mm_struct *mm, pgtable_t pte_page); /* === High-Level Page Fault Handlers === */ /* Install a new page mapping (called from fault handler) */vm_fault_t vmf_insert_page(struct vm_area_struct *vma, unsigned long addr, struct page *page);vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); /* Copy page tables (used in fork) */int copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);These APIs are defined per-architecture but have consistent semantics. On 32-bit systems without all 4/5 levels, the 'missing' levels are folded—pud_offset() simply returns the same pointer if there's no PUD level. This allows architecture-independent kernel code to work correctly across all platforms.
Huge pages use the Page Size (PS) bit in PMD or PUD entries to map larger memory regions with a single TLB entry. Instead of mapping 4 KB per entry, huge pages map 2 MB (PMD-level) or 1 GB (PUD-level) per entry.
Benefits of Huge Pages:
| Page Size | TLB Entries for 1 GB | Page Table Depth | Table Overhead for 1 GB |
|---|---|---|---|
| 4 KB (standard) | 262,144 entries | 4 levels | ~2 MB |
| 2 MB (huge) | 512 entries | 3 levels (stop at PMD) | ~4 KB |
| 1 GB (gigantic) | 1 entry | 2 levels (stop at PUD) | ~8 bytes |
12345678910111213141516171819202122232425262728293031323334
# Check huge page support and configurationcat /proc/meminfo | grep -i huge# HugePages_Total: 0# HugePages_Free: 0# HugePages_Rsvd: 0# HugePages_Surp: 0# Hugepagesize: 2048 kB# Hugetlb: 0 kB # Check available huge page sizesls /sys/kernel/mm/hugepages/# hugepages-1048576kB hugepages-2048kB # Allocate huge pages (requires root)echo 100 > /proc/sys/vm/nr_hugepages # Request 100 2MB pagescat /proc/meminfo | grep HugePages_Total# HugePages_Total: 100 # Mount hugetlbfs for explicit huge page allocationmkdir -p /mnt/hugemount -t hugetlbfs nodev /mnt/huge # Transparent Huge Pages (THP) - automatic huge page usagecat /sys/kernel/mm/transparent_hugepage/enabled# [always] madvise never # View THP statisticsgrep -i hugepages /proc/vmstat# thp_fault_alloc 12345# thp_collapse_alloc 567# thp_split_page 89 # Check if a process is using huge pagescat /proc/self/smaps | grep -i hugeWhile THP improves average performance, it can cause latency spikes. When the kernel promotes pages to huge pages or demotes/splits them, it may stall the application. For latency-sensitive workloads (databases, trading systems), explicit huge pages or 'madvise' mode may be preferable to 'always' mode.
As memory demands grow, the 256 TB limit of 48-bit addressing becomes constraining for some workloads. Intel introduced LA57 (Level 5 Adding for 57-bit Virtual Addresses) to extend the virtual address space to 128 PB (petabytes).
The Additional Level:
LA57 adds a fifth level—the P4D (Page 4th Directory) or PML5 in Intel terminology—above the PGD:
| 56-48 (9 bits) | 47-39 (9 bits) | 38-30 (9 bits) | 29-21 (9 bits) | 20-12 (9 bits) | 11-0 (12 bits) |
| P4D | PGD | PUD | PMD | PTE | Offset |
Implications:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
/* 5-level page table support (arch/x86/include/asm/pgtable_64_types.h) */ #ifdef CONFIG_X86_5LEVEL #define PGDIR_SHIFT 48#define PTRS_PER_PGD 512#define P4D_SHIFT 39#define PTRS_PER_P4D 512#define MAX_VA_BITS 57 /* User/kernel split for 5-level */#define TASK_SIZE_MAX ((1UL << 56) - PAGE_SIZE) /* Navigate through 5 levels */static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address){ if (!pgtable_l5_enabled()) return (p4d_t *)pgd; /* Fold if not using 5-level */ return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);} #else /* 4-level: P4D is folded into PGD */#define PGDIR_SHIFT 39#define P4D_SHIFT 39#define PTRS_PER_P4D 1#define MAX_VA_BITS 48 #define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE) #endif /* Runtime check for 5-level support */static inline bool pgtable_l5_enabled(void){ return IS_ENABLED(CONFIG_X86_5LEVEL) && cpu_feature_enabled(X86_FEATURE_LA57);} /* Kernel configuration option check */#ifdef CONFIG_X86_5LEVEL /* 5-level page tables enabled in kernel config */ /* Actual use depends on CPU support (checked at boot) */#endifMost systems today don't need 5-level page tables—128 TB is sufficient for nearly all workloads. LA57 is primarily relevant for large in-memory databases, scientific computing with massive datasets, and future-proofing. The extra level adds one memory access per TLB miss, so 4-level tables are preferable when 128 TB suffices.
We have explored the intricate machinery that bridges virtual and physical memory. Let's consolidate the key insights:
What's Next:
Page tables define the mapping, but memory must also be allocated efficiently. The next page explores the slab allocator—Linux's high-performance object caching layer that dramatically reduces allocation overhead for frequently-used kernel data structures.
You now have an expert understanding of Linux page table management—the multi-level hierarchy, entry formats, TLB optimization, and kernel APIs. This knowledge is essential for kernel development, performance tuning, and understanding security mechanisms like KPTI.