Operating SystemsLinux Memory Management

Linux Memory Management

LevelAdvanced

Duration120 mins

TopicLinux Memory Management

2 / 5

Page Table Management

The Bridge Between Virtual and Physical

The virtual address space we explored in the previous page is an illusion—a powerful abstraction that simplifies programming and enables isolation. But at some point, virtual addresses must become physical addresses that reference actual RAM locations. This translation is performed millions of times per second, making it one of the most performance-critical operations in any operating system.

Page tables are the data structures that define this mapping. They form a hierarchical lookup structure that the CPU's Memory Management Unit (MMU) traverses on every memory access. Linux's page table implementation must balance competing concerns: compact representation (tables can't consume all available memory), fast lookup (every memory access depends on translation), and flexibility (supporting architectures from ARM embedded systems to massive x86_64 servers).

This page provides an expert-level examination of Linux page table management—from the fundamental multi-level hierarchy to advanced topics like kernel page table manipulation and huge page support.

What You Will Learn

By the end of this page, you will understand: (1) why page tables use a multi-level hierarchy, (2) how Linux implements 4-level and 5-level page tables on x86_64, (3) the structure of page table entries and their flags, (4) TLB operation and management, (5) the kernel APIs for page table manipulation, and (6) huge pages and their performance implications.

Why Multi-Level Page Tables?

To understand why Linux uses multi-level page tables, we must first understand why simpler approaches fail at scale.

The Single-Level Problem:

Consider the simplest possible page table: a flat array where each entry maps one virtual page to one physical frame. On a 32-bit system with 4 KB pages:

Virtual address space: 2³² = 4 GB
Number of pages: 4 GB / 4 KB = 1,048,576 pages
Entry size: 4 bytes (enough for physical frame number + flags)
Table size: 1,048,576 × 4 bytes = 4 MB per process

This is already problematic—4 MB of contiguous physical memory for every process, even if the process only uses a few kilobytes. But on 64-bit systems, it becomes absurd:

With 48-bit addressing: 2⁴⁸ / 4 KB = 68,719,476,736 pages
Table size: 68 billion × 8 bytes = 512 GB per process

Clearly, flat page tables don't scale.

The Sparsity Principle

Most processes use only a tiny fraction of their virtual address space. A typical process might have: text segment (~5 MB), data/heap (~100 MB), libraries (~50 MB), and stack (~8 MB). Total: ~163 MB out of 128 TB of available address space—0.0001% utilization. A flat page table would waste enormous memory tracking empty regions.

The Multi-Level Solution:

Multi-level page tables solve this by making the structure sparse. Instead of allocating entries for the entire address space, we only allocate table pages for regions actually in use.

Think of it like a hierarchical directory structure:

Level 4 (PGD): One table with 512 entries covering the entire address space
Level 3 (PUD): Each PGD entry points to a PUD table
Level 2 (PMD): Each PUD entry points to a PMD table
Level 1 (PTE): Each PMD entry points to a page table with actual page mappings

For unused regions, we simply leave the parent table entry NULL—no memory allocated for child tables. Only the paths to actually-mapped pages exist.

Cost-Benefit Analysis:

Multi-level tables trade lookup complexity for memory efficiency:

Memory overhead: Dramatically reduced for sparse address spaces
Lookup cost: Multiple memory accesses per translation (mitigated by TLB)
Flexibility: Can support architectures with different address sizes

Page Table Memory Efficiency
Scenario	Flat Table (48-bit)	4-Level Table
Minimal process (1 MB mapped)	512 GB	~16 KB (a few table pages)
Typical process (200 MB mapped)	512 GB	~1 MB
Large process (10 GB mapped)	512 GB	~20 MB
Fully mapped (128 TB)	512 GB	~512 GB + overhead

x86_64 Page Table Hierarchy

On x86_64 with 48-bit virtual addresses, Linux uses a four-level page table hierarchy. With 57-bit addressing (LA57), a fifth level is added. Let's examine the four-level case in detail.

Virtual Address Decomposition (48-bit):

A 48-bit virtual address is split into five components:

| 47-39 (9 bits) | 38-30 (9 bits) | 29-21 (9 bits) | 20-12 (9 bits) | 11-0 (12 bits) |
|     PGD        |      PUD       |      PMD       |      PTE       |     Offset     |

Bits 47-39 (9 bits): Index into the Page Global Directory (PGD) — 512 entries
Bits 38-30 (9 bits): Index into the Page Upper Directory (PUD) — 512 entries
Bits 29-21 (9 bits): Index into the Page Middle Directory (PMD) — 512 entries
Bits 20-12 (9 bits): Index into the Page Table Entry (PTE) — 512 entries
Bits 11-0 (12 bits): Offset within the 4 KB page — 4096 bytes

Converting Mermaid diagram...

Why 9-bit Indices?

Each table level uses 9 bits for indexing, giving 512 entries (2⁹). With 8-byte entries, each table is exactly 4 KB—one physical page. This is not coincidental:

Tables fit perfectly in single pages, simplifying memory allocation
Page-aligned tables enable efficient caching and TLB operations
The 8-byte entry size accommodates 64-bit physical addresses plus flags

Coverage at Each Level:

One PGD entry covers: 2³⁹ = 512 GB
One PUD entry covers: 2³⁰ = 1 GB
One PMD entry covers: 2²¹ = 2 MB
One PTE entry covers: 2¹² = 4 KB (one page)

Page Table Level Summary (x86_64, 4KB pages)
Level	Linux Name	Entries	Entry Size	Table Size	Coverage per Entry
4 (Top)	PGD (Page Global Directory)	512	8 bytes	4 KB	512 GB
3	PUD (Page Upper Directory)	512	8 bytes	4 KB	1 GB
2	PMD (Page Middle Directory)	512	8 bytes	4 KB	2 MB
1 (Bottom)	PTE (Page Table Entry)	512	8 bytes	4 KB	4 KB

Page Table Entry Format

Each entry in the page table hierarchy encodes both the address of the next level (or the physical page) and a rich set of flags controlling access and behavior. Understanding these flags is essential for kernel development and security analysis.

pte_format.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
/* x86_64 Page Table Entry Format (64 bits)
 *
 * Bit Layout:
 * 63    62:59 58:52    51:12                          11:9  8  7  6  5  4  3  2  1  0
 * +--+------+-------+---------------------------+-----+---+--+--+--+--+--+--+--+--+--+
 * |XD| Key  | Avail | Physical Frame Number (40)|Avail|PAT|G|PS|D |A |CD|WT|US|RW| P|
 * +--+------+-------+---------------------------+-----+---+--+--+--+--+--+--+--+--+--+
 *
 * Key Fields:
 * P    (bit 0)  - Present: 1 if page is in memory, 0 if not mapped/swapped
 * RW   (bit 1)  - Read/Write: 1 = writable, 0 = read-only
 * US   (bit 2)  - User/Supervisor: 1 = user accessible, 0 = supervisor only
 * PWT  (bit 3)  - Page Write-Through: cache write-through if set
 * PCD  (bit 4)  - Page Cache Disable: disable caching if set
 * A    (bit 5)  - Accessed: set by MMU when page is read
 * D    (bit 6)  - Dirty: set by MMU when page is written
 * PS   (bit 7)  - Page Size: 1 = large page (2MB/1GB), 0 = next table level
 * G    (bit 8)  - Global: don't flush from TLB on context switch
 * XD   (bit 63) - Execute Disable (NX): 1 = not executable
 */
 
/* Linux kernel definitions (arch/x86/include/asm/pgtable_types.h) */
#define _PAGE_BIT_PRESENT        0   /* is present */
#define _PAGE_BIT_RW             1   /* writeable */
#define _PAGE_BIT_USER           2   /* userspace addressable */
#define _PAGE_BIT_PWT            3   /* page write through */
#define _PAGE_BIT_PCD            4   /* page cache disabled */
#define _PAGE_BIT_ACCESSED       5   /* was accessed */
#define _PAGE_BIT_DIRTY          6   /* was written to */
#define _PAGE_BIT_PSE            7   /* 2MB/1GB page */
#define _PAGE_BIT_GLOBAL         8   /* global TLB entry */
#define _PAGE_BIT_NX             63  /* No execute: restrict to data */
 
#define _PAGE_PRESENT   (1UL << _PAGE_BIT_PRESENT)
#define _PAGE_RW        (1UL << _PAGE_BIT_RW)
#define _PAGE_USER      (1UL << _PAGE_BIT_USER)
#define _PAGE_PWT       (1UL << _PAGE_BIT_PWT)
#define _PAGE_PCD       (1UL << _PAGE_BIT_PCD)
#define _PAGE_ACCESSED  (1UL << _PAGE_BIT_ACCESSED)
#define _PAGE_DIRTY     (1UL << _PAGE_BIT_DIRTY)
#define _PAGE_PSE       (1UL << _PAGE_BIT_PSE)
#define _PAGE_GLOBAL    (1UL << _PAGE_BIT_GLOBAL)
#define _PAGE_NX        (1UL << _PAGE_BIT_NX)
 
/* Common protection combinations */
#define PAGE_KERNEL     __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | \
                                 _PAGE_ACCESSED | _PAGE_NX)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | \
                                  _PAGE_ACCESSED)
#define PAGE_SHARED     __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
                                 _PAGE_ACCESSED | _PAGE_NX)
#define PAGE_READONLY   __pgprot(_PAGE_PRESENT | _PAGE_USER | \
                                 _PAGE_ACCESSED | _PAGE_NX)
#define PAGE_COPY_EXEC  __pgprot(_PAGE_PRESENT | _PAGE_USER | \
                                 _PAGE_ACCESSED)

Critical Page Table Entry Flags

•Present (P) — The most fundamental flag. If 0, the entry is invalid and accessing it causes a page fault. Used for demand paging, swap, and unmapped regions.
•Read/Write (RW) — Controls write permission. Combined with Copy-on-Write (CoW) semantics, this enables memory sharing after fork().
•User/Supervisor (US) — Enforces kernel/user separation. Kernel pages (US=0) cannot be accessed from user mode, preventing direct kernel memory access.
•Accessed (A) — Set by hardware when the page is read. Used by page replacement algorithms to identify active pages.
•Dirty (D) — Set by hardware when the page is written. Essential for knowing which pages need to be written back to disk.
•Page Size (PS) — At PMD level, enables 2 MB huge pages. At PUD level, enables 1 GB huge pages. Improves TLB efficiency.
•Execute Disable (NX/XD) — When set, the page cannot contain executable code. Critical security feature preventing code injection attacks.

The Accessed and Dirty Bits

The A and D bits are set automatically by hardware but must be cleared by software. This enables efficient tracking of page usage patterns. The kernel periodically clears these bits and uses their state to implement page replacement policies like LRU. If a page's A bit is still 0 after some time, it hasn't been accessed and is a good eviction candidate.

The Page Table Walk

When the CPU encounters a virtual address that isn't cached in the TLB, it must perform a page table walk—traversing the multi-level hierarchy to find the physical address. Let's trace through this process step by step:

page_table_walk.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/* Conceptual page table walk for x86_64 */
 
/* Example virtual address: 0x00007F4A12345678
 * Breaking it down (assuming 48-bit addressing):
 *
 * Binary: 0000 0000 0000 0000 0111 1111 0100 1010 
 *         0001 0010 0011 0100 0101 0110 0111 1000
 *
 * Components:
 * PGD index (bits 47-39): 0x0FE = 254
 * PUD index (bits 38-30): 0x128 = 296
 * PMD index (bits 29-21): 0x091 = 145
 * PTE index (bits 20-12): 0x145 = 325
 * Page offset (bits 11-0): 0x678 = 1656
 */
 
uint64_t translate_virtual_to_physical(
    uint64_t cr3,           /* PGD base from CR3 register */
    uint64_t virtual_addr   /* Address to translate */
) {
    /* Extract indices from virtual address */
    uint16_t pgd_idx = (virtual_addr >> 39) & 0x1FF;  /* bits 47-39 */
    uint16_t pud_idx = (virtual_addr >> 30) & 0x1FF;  /* bits 38-30 */
    uint16_t pmd_idx = (virtual_addr >> 21) & 0x1FF;  /* bits 29-21 */
    uint16_t pte_idx = (virtual_addr >> 12) & 0x1FF;  /* bits 20-12 */
    uint16_t offset = virtual_addr & 0xFFF;            /* bits 11-0 */
    
    /* Step 1: Read PGD entry */
    uint64_t *pgd_table = (uint64_t *)(cr3 & ~0xFFF);
    uint64_t pgd_entry = pgd_table[pgd_idx];
    
    if (!(pgd_entry & _PAGE_PRESENT)) {
        /* PAGE FAULT: PGD entry not present */
        return PAGE_FAULT;
    }
    
    /* Step 2: Read PUD entry */
    uint64_t *pud_table = (uint64_t *)(pgd_entry & PHYS_ADDR_MASK);
    uint64_t pud_entry = pud_table[pud_idx];
    
    if (!(pud_entry & _PAGE_PRESENT)) {
        /* PAGE FAULT: PUD entry not present */
        return PAGE_FAULT;
    }
    
    /* Check for 1GB huge page */
    if (pud_entry & _PAGE_PSE) {
        uint64_t phys_base = pud_entry & HUGE_1GB_MASK;
        uint64_t huge_offset = virtual_addr & 0x3FFFFFFF;  /* 30-bit offset */
        return phys_base | huge_offset;
    }
    
    /* Step 3: Read PMD entry */
    uint64_t *pmd_table = (uint64_t *)(pud_entry & PHYS_ADDR_MASK);
    uint64_t pmd_entry = pmd_table[pmd_idx];
    
    if (!(pmd_entry & _PAGE_PRESENT)) {
        /* PAGE FAULT: PMD entry not present */
        return PAGE_FAULT;
    }
    
    /* Check for 2MB huge page */
    if (pmd_entry & _PAGE_PSE) {
        uint64_t phys_base = pmd_entry & HUGE_2MB_MASK;
        uint64_t huge_offset = virtual_addr & 0x1FFFFF;  /* 21-bit offset */
        return phys_base | huge_offset;
    }
    
    /* Step 4: Read PTE entry */
    uint64_t *pte_table = (uint64_t *)(pmd_entry & PHYS_ADDR_MASK);
    uint64_t pte_entry = pte_table[pte_idx];
    
    if (!(pte_entry & _PAGE_PRESENT)) {
        /* PAGE FAULT: PTE entry not present */
        return PAGE_FAULT;
    }
    
    /* Check permissions */
    if (is_user_access && !(pte_entry & _PAGE_USER)) {
        /* PAGE FAULT: Supervisor-only page accessed from user mode */
        return PAGE_FAULT;
    }
    
    if (is_write_access && !(pte_entry & _PAGE_RW)) {
        /* PAGE FAULT: Write to read-only page */
        return PAGE_FAULT;
    }
    
    if (is_exec_access && (pte_entry & _PAGE_NX)) {
        /* PAGE FAULT: Execute on non-executable page */
        return PAGE_FAULT;
    }
    
    /* Step 5: Construct physical address */
    uint64_t phys_frame = pte_entry & PHYS_ADDR_MASK;
    uint64_t physical_addr = phys_frame | offset;
    
    /* Hardware sets Accessed bit */
    pte_entry |= _PAGE_ACCESSED;
    if (is_write_access) {
        /* Hardware sets Dirty bit for writes */
        pte_entry |= _PAGE_DIRTY;
    }
    
    return physical_addr;
}

Hardware vs. Software Walk

On x86, the page table walk is performed entirely in hardware by the MMU—no software intervention is needed for normal translations. The kernel's role is to set up and maintain the page tables. Only when a page fault occurs (Present=0, permission violation, etc.) does software get involved. Some architectures (like early MIPS) use software-managed TLBs where the kernel handles all misses.

Walk Performance:

A complete 4-level page table walk requires 4 memory accesses—a significant overhead if performed for every memory reference. This is where the Translation Lookaside Buffer (TLB) becomes critical: caching recent translations to avoid repeated walks.

TLB Architecture and Management

The Translation Lookaside Buffer (TLB) is a small, extremely fast cache that stores recent virtual-to-physical address translations. Modern CPUs could not function efficiently without it—memory access would be 4-5x slower if every reference required a full page table walk.

TLB Organization:

Modern processors typically have:

L1 Instruction TLB (ITLB): Caches translations for instruction fetches
L1 Data TLB (DTLB): Caches translations for data accesses
L2 Unified TLB (STLB): Larger, unified cache backing both L1 TLBs

TLBs are usually organized as set-associative caches, with separate entries for different page sizes (4KB, 2MB, 1GB).

Typical Modern CPU TLB Configuration
TLB Type	Entries (4KB)	Entries (2MB)	Associativity	Latency
L1 ITLB	64	8	8-way	~1 cycle
L1 DTLB	64	32	4-way	~1 cycle
L2 STLB	1536	1536	12-way	~7 cycles

TLB Invalidation:

When page table entries change, the corresponding TLB entries must be invalidated to prevent stale translations. Linux provides several invalidation mechanisms:

tlb_invalidation.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* TLB invalidation functions (arch/x86/include/asm/tlbflush.h) */
 
/* Flush entire TLB (expensive - avoid when possible) */
static inline void flush_tlb_all(void)
{
    /* Reloading CR3 flushes all non-global TLB entries */
    native_write_cr3(__native_read_cr3());
}
 
/* Flush TLB entries for a specific address range */
void flush_tlb_range(struct vm_area_struct *vma,
                     unsigned long start, unsigned long end)
{
    /* Uses INVLPG instruction for each page, or 
     * full flush if range is large enough */
}
 
/* Flush single page TLB entry */
static inline void flush_tlb_one_kernel(unsigned long addr)
{
    /* x86 INVLPG instruction */
    asm volatile("invlpg (%0)" : : "r" (addr) : "memory");
}
 
/* Flush TLB on all CPUs (very expensive!) */
void flush_tlb_all_remote(void)
{
    /* Sends Inter-Processor Interrupt (IPI) to all CPUs
     * Each CPU then performs local TLB flush
     * Called when changing kernel page tables */
}
 
/* Modern x86 provides INVPCID instruction for more granular control */
static inline void invpcid_flush_one(unsigned long pcid,
                                     unsigned long addr)
{
    /* Flush entry for specific PCID and address
     * More efficient than full flush */
}
 
/* TLB shootdown optimization: batching */
struct tlb_flush_pending {
    unsigned long start;
    unsigned long end;
    unsigned int stride_shift;
    bool flush_required;
};
 
/* Batch multiple TLB invalidations and execute at once */
void tlb_flush_batched(struct mmu_gather *tlb) 
{
    /* Collects invalidation requests then issues efficiently */
}

TLB Shootdowns are Expensive

When kernel page tables change, ALL CPUs must flush their TLBs—requiring Inter-Processor Interrupts (IPIs). On systems with many cores, TLB shootdowns can become a significant performance bottleneck. This is one reason why huge pages improve performance: fewer TLB entries means fewer shootdowns during memory management operations.

Process Context Identifiers (PCID):

Traditionally, a context switch required a full TLB flush because TLB entries don't distinguish between processes. Modern x86 processors support PCIDs—12-bit tags attached to TLB entries. With PCID support:

Each process is assigned a unique PCID
TLB entries are tagged with the PCID
Context switches update CR3 with the new PCID without flushing
Only entries with unfamiliar PCIDs are evicted naturally

This significantly reduces context switch overhead, especially for workloads with frequent process switches.

Kernel Page Table Manipulation APIs

Linux provides a rich set of functions for page table manipulation. These APIs abstract architecture-specific details while providing the flexibility kernel subsystems need.

pgtable_apis.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/* Key page table manipulation functions */
 
/* === Table Entry Access and Navigation === */
 
/* Get PGD entry for a virtual address */
pgd_t *pgd_offset(struct mm_struct *mm, unsigned long address);
pgd_t *pgd_offset_k(unsigned long address);  /* Kernel address space */
 
/* Navigate down the hierarchy */
p4d_t *p4d_offset(pgd_t *pgd, unsigned long address);
pud_t *pud_offset(p4d_t *p4d, unsigned long address);
pmd_t *pmd_offset(pud_t *pud, unsigned long address);
pte_t *pte_offset_map(pmd_t *pmd, unsigned long address);
 
/* === Entry State Queries === */
 
#define pgd_present(pgd)    (pgd_val(pgd) & _PAGE_PRESENT)
#define pud_present(pud)    (pud_val(pud) & _PAGE_PRESENT)
#define pmd_present(pmd)    (pmd_val(pmd) & _PAGE_PRESENT)
#define pte_present(pte)    (pte_val(pte) & _PAGE_PRESENT)
 
#define pte_write(pte)      (pte_val(pte) & _PAGE_RW)
#define pte_dirty(pte)      (pte_val(pte) & _PAGE_DIRTY)
#define pte_young(pte)      (pte_val(pte) & _PAGE_ACCESSED)
#define pte_exec(pte)       (!(pte_val(pte) & _PAGE_NX))
 
/* === Entry Modification === */
 
/* Create PTE with specific protections */
pte_t mk_pte(struct page *page, pgprot_t pgprot);
 
/* Modify existing PTE flags */
pte_t pte_wrprotect(pte_t pte);  /* Clear write permission */
pte_t pte_mkwrite(pte_t pte);    /* Set write permission */
pte_t pte_mkclean(pte_t pte);    /* Clear dirty bit */
pte_t pte_mkdirty(pte_t pte);    /* Set dirty bit */
pte_t pte_mkold(pte_t pte);      /* Clear accessed bit */
pte_t pte_mkyoung(pte_t pte);    /* Set accessed bit */
pte_t pte_mkexec(pte_t pte);     /* Clear NX bit */
pte_t pte_exprotect(pte_t pte);  /* Set NX bit */
 
/* === Setting Entries in Tables === */
 
void set_pte_at(struct mm_struct *mm, unsigned long addr,
                pte_t *ptep, pte_t pte);
void set_pmd_at(struct mm_struct *mm, unsigned long addr,
                pmd_t *pmdp, pmd_t pmd);
/* Similar for PUD and PGD */
 
/* Atomic clear and set (for concurrent access) */
pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
                         pte_t *ptep);
pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
                             unsigned long addr, pte_t *ptep);
void ptep_modify_prot_commit(struct vm_area_struct *vma,
                             unsigned long addr, pte_t *ptep, pte_t pte);
 
/* === Table Allocation === */
 
/* Allocate a new page table page */
pgtable_t pte_alloc_one(struct mm_struct *mm);
pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr);
pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long addr);
 
/* Free a page table page */
void pte_free(struct mm_struct *mm, pgtable_t pte_page);
 
/* === High-Level Page Fault Handlers === */
 
/* Install a new page mapping (called from fault handler) */
vm_fault_t vmf_insert_page(struct vm_area_struct *vma,
                           unsigned long addr, struct page *page);
vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma,
                          unsigned long addr, unsigned long pfn);
 
/* Copy page tables (used in fork) */
int copy_page_range(struct vm_area_struct *dst_vma,
                    struct vm_area_struct *src_vma);

Architecture Abstraction

These APIs are defined per-architecture but have consistent semantics. On 32-bit systems without all 4/5 levels, the 'missing' levels are folded—pud_offset() simply returns the same pointer if there's no PUD level. This allows architecture-independent kernel code to work correctly across all platforms.

Huge Pages: Reducing Translation Overhead

Huge pages use the Page Size (PS) bit in PMD or PUD entries to map larger memory regions with a single TLB entry. Instead of mapping 4 KB per entry, huge pages map 2 MB (PMD-level) or 1 GB (PUD-level) per entry.

Benefits of Huge Pages:

Reduced TLB misses: Fewer entries can cover the same memory
Faster page table walks: Shallow walks for huge page regions
Less memory overhead: Fewer page table pages needed
Better cache utilization: Contiguous physical memory for large allocations

Page Size Comparison
Page Size	TLB Entries for 1 GB	Page Table Depth	Table Overhead for 1 GB
4 KB (standard)	262,144 entries	4 levels	~2 MB
2 MB (huge)	512 entries	3 levels (stop at PMD)	~4 KB
1 GB (gigantic)	1 entry	2 levels (stop at PUD)	~8 bytes

huge_pages_usage.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Check huge page support and configuration
cat /proc/meminfo | grep -i huge
# HugePages_Total:     0
# HugePages_Free:      0
# HugePages_Rsvd:      0
# HugePages_Surp:      0
# Hugepagesize:        2048 kB
# Hugetlb:             0 kB
 
# Check available huge page sizes
ls /sys/kernel/mm/hugepages/
# hugepages-1048576kB  hugepages-2048kB
 
# Allocate huge pages (requires root)
echo 100 > /proc/sys/vm/nr_hugepages  # Request 100 2MB pages
cat /proc/meminfo | grep HugePages_Total
# HugePages_Total:   100
 
# Mount hugetlbfs for explicit huge page allocation
mkdir -p /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
 
# Transparent Huge Pages (THP) - automatic huge page usage
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
 
# View THP statistics
grep -i hugepages /proc/vmstat
# thp_fault_alloc 12345
# thp_collapse_alloc 567
# thp_split_page 89
 
# Check if a process is using huge pages
cat /proc/self/smaps | grep -i huge

Explicit Huge Pages (hugetlbfs)

•Preallocated at boot or runtime
•Guaranteed availability once reserved
•Used via mmap() with MAP_HUGETLB
•Memory is locked, never swapped
•Requires explicit configuration
•Best for databases, VMs with known needs

Transparent Huge Pages (THP)

•Automatically promoted from 4KB pages
•No application changes required
•May fail if memory is fragmented
•Can be swapped (with limitations)
•Works transparently for all applications
•May cause latency spikes during promotion/demotion

THP Latency Considerations

While THP improves average performance, it can cause latency spikes. When the kernel promotes pages to huge pages or demotes/splits them, it may stall the application. For latency-sensitive workloads (databases, trading systems), explicit huge pages or 'madvise' mode may be preferable to 'always' mode.

5-Level Page Tables (LA57)

As memory demands grow, the 256 TB limit of 48-bit addressing becomes constraining for some workloads. Intel introduced LA57 (Level 5 Adding for 57-bit Virtual Addresses) to extend the virtual address space to 128 PB (petabytes).

The Additional Level:

LA57 adds a fifth level—the P4D (Page 4th Directory) or PML5 in Intel terminology—above the PGD:

| 56-48 (9 bits) | 47-39 (9 bits) | 38-30 (9 bits) | 29-21 (9 bits) | 20-12 (9 bits) | 11-0 (12 bits) |
|      P4D       |      PGD       |      PUD       |      PMD       |      PTE       |     Offset     |

Implications:

Virtual address space: 2⁵⁷ = 128 PB total (64 PB user, 64 PB kernel)
One additional memory access per page table walk
Sign extension: bits 56-63 must match bit 55 for canonical addresses
Backward compatible: existing 48-bit code works unchanged

la57_support.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/* 5-level page table support (arch/x86/include/asm/pgtable_64_types.h) */
 
#ifdef CONFIG_X86_5LEVEL
 
#define PGDIR_SHIFT     48
#define PTRS_PER_PGD    512
#define P4D_SHIFT       39
#define PTRS_PER_P4D    512
#define MAX_VA_BITS     57
 
/* User/kernel split for 5-level */
#define TASK_SIZE_MAX   ((1UL << 56) - PAGE_SIZE)
 
/* Navigate through 5 levels */
static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
{
    if (!pgtable_l5_enabled())
        return (p4d_t *)pgd;  /* Fold if not using 5-level */
    return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);
}
 
#else
 
/* 4-level: P4D is folded into PGD */
#define PGDIR_SHIFT     39
#define P4D_SHIFT       39
#define PTRS_PER_P4D    1
#define MAX_VA_BITS     48
 
#define TASK_SIZE_MAX   ((1UL << 47) - PAGE_SIZE)
 
#endif
 
/* Runtime check for 5-level support */
static inline bool pgtable_l5_enabled(void)
{
    return IS_ENABLED(CONFIG_X86_5LEVEL) && 
           cpu_feature_enabled(X86_FEATURE_LA57);
}
 
/* Kernel configuration option check */
#ifdef CONFIG_X86_5LEVEL
  /* 5-level page tables enabled in kernel config */
  /* Actual use depends on CPU support (checked at boot) */
#endif

Current Adoption

Most systems today don't need 5-level page tables—128 TB is sufficient for nearly all workloads. LA57 is primarily relevant for large in-memory databases, scientific computing with massive datasets, and future-proofing. The extra level adds one memory access per TLB miss, so 4-level tables are preferable when 128 TB suffices.

Summary: Page Table Management

We have explored the intricate machinery that bridges virtual and physical memory. Let's consolidate the key insights:

Key Takeaways

•Multi-level page tables enable sparse address spaces without consuming memory for unmapped regions—critical for 64-bit systems with 128+ TB address spaces.
•x86_64 uses 4-level (or 5-level) hierarchies with 9-bit indices per level, 4 KB table pages, and 8-byte entries with rich permission flags.
•Page table entries contain physical addresses plus flags for present, read/write, user/supervisor, accessed, dirty, page size, and execute disable.
•The TLB dramatically accelerates translation by caching recent mappings. TLB misses trigger page table walks; TLB invalidation requires careful cross-CPU coordination.
•PCIDs reduce context switch overhead by tagging TLB entries with process identifiers, avoiding full flushes.
•Kernel APIs provide architecture-independent access to page table manipulation, from entry queries to table allocation.
•Huge pages (2 MB, 1 GB) reduce TLB pressure and page table overhead for large memory consumers.
•5-level page tables (LA57) extend the address space to 128 PB for extreme memory requirements.

What's Next:

Page tables define the mapping, but memory must also be allocated efficiently. The next page explores the slab allocator—Linux's high-performance object caching layer that dramatically reduces allocation overhead for frequently-used kernel data structures.

Page Complete

You now have an expert understanding of Linux page table management—the multi-level hierarchy, entry formats, TLB optimization, and kernel APIs. This knowledge is essential for kernel development, performance tuning, and understanding security mechanisms like KPTI.

2 / 5

Loading learning content...

Operating SystemsLinux Memory Management

Linux Memory Management

LevelAdvanced

Duration120 mins

TopicLinux Memory Management

2 / 5

Page Table Management

The Bridge Between Virtual and Physical

What You Will Learn

Why Multi-Level Page Tables?

To understand why Linux uses multi-level page tables, we must first understand why simpler approaches fail at scale.

The Single-Level Problem:

Consider the simplest possible page table: a flat array where each entry maps one virtual page to one physical frame. On a 32-bit system with 4 KB pages:

Virtual address space: 2³² = 4 GB
Number of pages: 4 GB / 4 KB = 1,048,576 pages
Entry size: 4 bytes (enough for physical frame number + flags)
Table size: 1,048,576 × 4 bytes = 4 MB per process

This is already problematic—4 MB of contiguous physical memory for every process, even if the process only uses a few kilobytes. But on 64-bit systems, it becomes absurd:

With 48-bit addressing: 2⁴⁸ / 4 KB = 68,719,476,736 pages
Table size: 68 billion × 8 bytes = 512 GB per process

Clearly, flat page tables don't scale.

The Sparsity Principle

The Multi-Level Solution:

Multi-level page tables solve this by making the structure sparse. Instead of allocating entries for the entire address space, we only allocate table pages for regions actually in use.

Think of it like a hierarchical directory structure:

Level 4 (PGD): One table with 512 entries covering the entire address space
Level 3 (PUD): Each PGD entry points to a PUD table
Level 2 (PMD): Each PUD entry points to a PMD table
Level 1 (PTE): Each PMD entry points to a page table with actual page mappings

For unused regions, we simply leave the parent table entry NULL—no memory allocated for child tables. Only the paths to actually-mapped pages exist.

Cost-Benefit Analysis:

Multi-level tables trade lookup complexity for memory efficiency:

Memory overhead: Dramatically reduced for sparse address spaces
Lookup cost: Multiple memory accesses per translation (mitigated by TLB)
Flexibility: Can support architectures with different address sizes

Page Table Memory Efficiency
Scenario	Flat Table (48-bit)	4-Level Table
Minimal process (1 MB mapped)	512 GB	~16 KB (a few table pages)
Typical process (200 MB mapped)	512 GB	~1 MB
Large process (10 GB mapped)	512 GB	~20 MB
Fully mapped (128 TB)	512 GB	~512 GB + overhead

x86_64 Page Table Hierarchy

On x86_64 with 48-bit virtual addresses, Linux uses a four-level page table hierarchy. With 57-bit addressing (LA57), a fifth level is added. Let's examine the four-level case in detail.

Virtual Address Decomposition (48-bit):

A 48-bit virtual address is split into five components:

| 47-39 (9 bits) | 38-30 (9 bits) | 29-21 (9 bits) | 20-12 (9 bits) | 11-0 (12 bits) |
|     PGD        |      PUD       |      PMD       |      PTE       |     Offset     |

Bits 47-39 (9 bits): Index into the Page Global Directory (PGD) — 512 entries
Bits 38-30 (9 bits): Index into the Page Upper Directory (PUD) — 512 entries
Bits 29-21 (9 bits): Index into the Page Middle Directory (PMD) — 512 entries
Bits 20-12 (9 bits): Index into the Page Table Entry (PTE) — 512 entries
Bits 11-0 (12 bits): Offset within the 4 KB page — 4096 bytes

Converting Mermaid diagram...

Why 9-bit Indices?

Each table level uses 9 bits for indexing, giving 512 entries (2⁹). With 8-byte entries, each table is exactly 4 KB—one physical page. This is not coincidental:

Tables fit perfectly in single pages, simplifying memory allocation
Page-aligned tables enable efficient caching and TLB operations
The 8-byte entry size accommodates 64-bit physical addresses plus flags

Coverage at Each Level:

One PGD entry covers: 2³⁹ = 512 GB
One PUD entry covers: 2³⁰ = 1 GB
One PMD entry covers: 2²¹ = 2 MB
One PTE entry covers: 2¹² = 4 KB (one page)

Page Table Level Summary (x86_64, 4KB pages)
Level	Linux Name	Entries	Entry Size	Table Size	Coverage per Entry
4 (Top)	PGD (Page Global Directory)	512	8 bytes	4 KB	512 GB
3	PUD (Page Upper Directory)	512	8 bytes	4 KB	1 GB
2	PMD (Page Middle Directory)	512	8 bytes	4 KB	2 MB
1 (Bottom)	PTE (Page Table Entry)	512	8 bytes	4 KB	4 KB

Page Table Entry Format

pte_format.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
/* x86_64 Page Table Entry Format (64 bits)
 *
 * Bit Layout:
 * 63    62:59 58:52    51:12                          11:9  8  7  6  5  4  3  2  1  0
 * +--+------+-------+---------------------------+-----+---+--+--+--+--+--+--+--+--+--+
 * |XD| Key  | Avail | Physical Frame Number (40)|Avail|PAT|G|PS|D |A |CD|WT|US|RW| P|
 * +--+------+-------+---------------------------+-----+---+--+--+--+--+--+--+--+--+--+
 *
 * Key Fields:
 * P    (bit 0)  - Present: 1 if page is in memory, 0 if not mapped/swapped
 * RW   (bit 1)  - Read/Write: 1 = writable, 0 = read-only
 * US   (bit 2)  - User/Supervisor: 1 = user accessible, 0 = supervisor only
 * PWT  (bit 3)  - Page Write-Through: cache write-through if set
 * PCD  (bit 4)  - Page Cache Disable: disable caching if set
 * A    (bit 5)  - Accessed: set by MMU when page is read
 * D    (bit 6)  - Dirty: set by MMU when page is written
 * PS   (bit 7)  - Page Size: 1 = large page (2MB/1GB), 0 = next table level
 * G    (bit 8)  - Global: don't flush from TLB on context switch
 * XD   (bit 63) - Execute Disable (NX): 1 = not executable
 */
 
/* Linux kernel definitions (arch/x86/include/asm/pgtable_types.h) */
#define _PAGE_BIT_PRESENT        0   /* is present */
#define _PAGE_BIT_RW             1   /* writeable */
#define _PAGE_BIT_USER           2   /* userspace addressable */
#define _PAGE_BIT_PWT            3   /* page write through */
#define _PAGE_BIT_PCD            4   /* page cache disabled */
#define _PAGE_BIT_ACCESSED       5   /* was accessed */
#define _PAGE_BIT_DIRTY          6   /* was written to */
#define _PAGE_BIT_PSE            7   /* 2MB/1GB page */
#define _PAGE_BIT_GLOBAL         8   /* global TLB entry */
#define _PAGE_BIT_NX             63  /* No execute: restrict to data */
 
#define _PAGE_PRESENT   (1UL << _PAGE_BIT_PRESENT)
#define _PAGE_RW        (1UL << _PAGE_BIT_RW)
#define _PAGE_USER      (1UL << _PAGE_BIT_USER)
#define _PAGE_PWT       (1UL << _PAGE_BIT_PWT)
#define _PAGE_PCD       (1UL << _PAGE_BIT_PCD)
#define _PAGE_ACCESSED  (1UL << _PAGE_BIT_ACCESSED)
#define _PAGE_DIRTY     (1UL << _PAGE_BIT_DIRTY)
#define _PAGE_PSE       (1UL << _PAGE_BIT_PSE)
#define _PAGE_GLOBAL    (1UL << _PAGE_BIT_GLOBAL)
#define _PAGE_NX        (1UL << _PAGE_BIT_NX)
 
/* Common protection combinations */
#define PAGE_KERNEL     __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | \
                                 _PAGE_ACCESSED | _PAGE_NX)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | \
                                  _PAGE_ACCESSED)
#define PAGE_SHARED     __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
                                 _PAGE_ACCESSED | _PAGE_NX)
#define PAGE_READONLY   __pgprot(_PAGE_PRESENT | _PAGE_USER | \
                                 _PAGE_ACCESSED | _PAGE_NX)
#define PAGE_COPY_EXEC  __pgprot(_PAGE_PRESENT | _PAGE_USER | \
                                 _PAGE_ACCESSED)

Critical Page Table Entry Flags

•Present (P) — The most fundamental flag. If 0, the entry is invalid and accessing it causes a page fault. Used for demand paging, swap, and unmapped regions.
•Read/Write (RW) — Controls write permission. Combined with Copy-on-Write (CoW) semantics, this enables memory sharing after fork().
•User/Supervisor (US) — Enforces kernel/user separation. Kernel pages (US=0) cannot be accessed from user mode, preventing direct kernel memory access.
•Accessed (A) — Set by hardware when the page is read. Used by page replacement algorithms to identify active pages.
•Dirty (D) — Set by hardware when the page is written. Essential for knowing which pages need to be written back to disk.
•Page Size (PS) — At PMD level, enables 2 MB huge pages. At PUD level, enables 1 GB huge pages. Improves TLB efficiency.
•Execute Disable (NX/XD) — When set, the page cannot contain executable code. Critical security feature preventing code injection attacks.

The Accessed and Dirty Bits

The Page Table Walk

page_table_walk.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/* Conceptual page table walk for x86_64 */
 
/* Example virtual address: 0x00007F4A12345678
 * Breaking it down (assuming 48-bit addressing):
 *
 * Binary: 0000 0000 0000 0000 0111 1111 0100 1010 
 *         0001 0010 0011 0100 0101 0110 0111 1000
 *
 * Components:
 * PGD index (bits 47-39): 0x0FE = 254
 * PUD index (bits 38-30): 0x128 = 296
 * PMD index (bits 29-21): 0x091 = 145
 * PTE index (bits 20-12): 0x145 = 325
 * Page offset (bits 11-0): 0x678 = 1656
 */
 
uint64_t translate_virtual_to_physical(
    uint64_t cr3,           /* PGD base from CR3 register */
    uint64_t virtual_addr   /* Address to translate */
) {
    /* Extract indices from virtual address */
    uint16_t pgd_idx = (virtual_addr >> 39) & 0x1FF;  /* bits 47-39 */
    uint16_t pud_idx = (virtual_addr >> 30) & 0x1FF;  /* bits 38-30 */
    uint16_t pmd_idx = (virtual_addr >> 21) & 0x1FF;  /* bits 29-21 */
    uint16_t pte_idx = (virtual_addr >> 12) & 0x1FF;  /* bits 20-12 */
    uint16_t offset = virtual_addr & 0xFFF;            /* bits 11-0 */
    
    /* Step 1: Read PGD entry */
    uint64_t *pgd_table = (uint64_t *)(cr3 & ~0xFFF);
    uint64_t pgd_entry = pgd_table[pgd_idx];
    
    if (!(pgd_entry & _PAGE_PRESENT)) {
        /* PAGE FAULT: PGD entry not present */
        return PAGE_FAULT;
    }
    
    /* Step 2: Read PUD entry */
    uint64_t *pud_table = (uint64_t *)(pgd_entry & PHYS_ADDR_MASK);
    uint64_t pud_entry = pud_table[pud_idx];
    
    if (!(pud_entry & _PAGE_PRESENT)) {
        /* PAGE FAULT: PUD entry not present */
        return PAGE_FAULT;
    }
    
    /* Check for 1GB huge page */
    if (pud_entry & _PAGE_PSE) {
        uint64_t phys_base = pud_entry & HUGE_1GB_MASK;
        uint64_t huge_offset = virtual_addr & 0x3FFFFFFF;  /* 30-bit offset */
        return phys_base | huge_offset;
    }
    
    /* Step 3: Read PMD entry */
    uint64_t *pmd_table = (uint64_t *)(pud_entry & PHYS_ADDR_MASK);
    uint64_t pmd_entry = pmd_table[pmd_idx];
    
    if (!(pmd_entry & _PAGE_PRESENT)) {
        /* PAGE FAULT: PMD entry not present */
        return PAGE_FAULT;
    }
    
    /* Check for 2MB huge page */
    if (pmd_entry & _PAGE_PSE) {
        uint64_t phys_base = pmd_entry & HUGE_2MB_MASK;
        uint64_t huge_offset = virtual_addr & 0x1FFFFF;  /* 21-bit offset */
        return phys_base | huge_offset;
    }
    
    /* Step 4: Read PTE entry */
    uint64_t *pte_table = (uint64_t *)(pmd_entry & PHYS_ADDR_MASK);
    uint64_t pte_entry = pte_table[pte_idx];
    
    if (!(pte_entry & _PAGE_PRESENT)) {
        /* PAGE FAULT: PTE entry not present */
        return PAGE_FAULT;
    }
    
    /* Check permissions */
    if (is_user_access && !(pte_entry & _PAGE_USER)) {
        /* PAGE FAULT: Supervisor-only page accessed from user mode */
        return PAGE_FAULT;
    }
    
    if (is_write_access && !(pte_entry & _PAGE_RW)) {
        /* PAGE FAULT: Write to read-only page */
        return PAGE_FAULT;
    }
    
    if (is_exec_access && (pte_entry & _PAGE_NX)) {
        /* PAGE FAULT: Execute on non-executable page */
        return PAGE_FAULT;
    }
    
    /* Step 5: Construct physical address */
    uint64_t phys_frame = pte_entry & PHYS_ADDR_MASK;
    uint64_t physical_addr = phys_frame | offset;
    
    /* Hardware sets Accessed bit */
    pte_entry |= _PAGE_ACCESSED;
    if (is_write_access) {
        /* Hardware sets Dirty bit for writes */
        pte_entry |= _PAGE_DIRTY;
    }
    
    return physical_addr;
}

Hardware vs. Software Walk

Walk Performance:

TLB Architecture and Management

TLB Organization:

Modern processors typically have:

L1 Instruction TLB (ITLB): Caches translations for instruction fetches
L1 Data TLB (DTLB): Caches translations for data accesses
L2 Unified TLB (STLB): Larger, unified cache backing both L1 TLBs

TLBs are usually organized as set-associative caches, with separate entries for different page sizes (4KB, 2MB, 1GB).

Typical Modern CPU TLB Configuration
TLB Type	Entries (4KB)	Entries (2MB)	Associativity	Latency
L1 ITLB	64	8	8-way	~1 cycle
L1 DTLB	64	32	4-way	~1 cycle
L2 STLB	1536	1536	12-way	~7 cycles

TLB Invalidation:

When page table entries change, the corresponding TLB entries must be invalidated to prevent stale translations. Linux provides several invalidation mechanisms:

tlb_invalidation.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* TLB invalidation functions (arch/x86/include/asm/tlbflush.h) */
 
/* Flush entire TLB (expensive - avoid when possible) */
static inline void flush_tlb_all(void)
{
    /* Reloading CR3 flushes all non-global TLB entries */
    native_write_cr3(__native_read_cr3());
}
 
/* Flush TLB entries for a specific address range */
void flush_tlb_range(struct vm_area_struct *vma,
                     unsigned long start, unsigned long end)
{
    /* Uses INVLPG instruction for each page, or 
     * full flush if range is large enough */
}
 
/* Flush single page TLB entry */
static inline void flush_tlb_one_kernel(unsigned long addr)
{
    /* x86 INVLPG instruction */
    asm volatile("invlpg (%0)" : : "r" (addr) : "memory");
}
 
/* Flush TLB on all CPUs (very expensive!) */
void flush_tlb_all_remote(void)
{
    /* Sends Inter-Processor Interrupt (IPI) to all CPUs
     * Each CPU then performs local TLB flush
     * Called when changing kernel page tables */
}
 
/* Modern x86 provides INVPCID instruction for more granular control */
static inline void invpcid_flush_one(unsigned long pcid,
                                     unsigned long addr)
{
    /* Flush entry for specific PCID and address
     * More efficient than full flush */
}
 
/* TLB shootdown optimization: batching */
struct tlb_flush_pending {
    unsigned long start;
    unsigned long end;
    unsigned int stride_shift;
    bool flush_required;
};
 
/* Batch multiple TLB invalidations and execute at once */
void tlb_flush_batched(struct mmu_gather *tlb) 
{
    /* Collects invalidation requests then issues efficiently */
}

TLB Shootdowns are Expensive

Process Context Identifiers (PCID):

Each process is assigned a unique PCID
TLB entries are tagged with the PCID
Context switches update CR3 with the new PCID without flushing
Only entries with unfamiliar PCIDs are evicted naturally

This significantly reduces context switch overhead, especially for workloads with frequent process switches.

Kernel Page Table Manipulation APIs

Linux provides a rich set of functions for page table manipulation. These APIs abstract architecture-specific details while providing the flexibility kernel subsystems need.

pgtable_apis.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/* Key page table manipulation functions */
 
/* === Table Entry Access and Navigation === */
 
/* Get PGD entry for a virtual address */
pgd_t *pgd_offset(struct mm_struct *mm, unsigned long address);
pgd_t *pgd_offset_k(unsigned long address);  /* Kernel address space */
 
/* Navigate down the hierarchy */
p4d_t *p4d_offset(pgd_t *pgd, unsigned long address);
pud_t *pud_offset(p4d_t *p4d, unsigned long address);
pmd_t *pmd_offset(pud_t *pud, unsigned long address);
pte_t *pte_offset_map(pmd_t *pmd, unsigned long address);
 
/* === Entry State Queries === */
 
#define pgd_present(pgd)    (pgd_val(pgd) & _PAGE_PRESENT)
#define pud_present(pud)    (pud_val(pud) & _PAGE_PRESENT)
#define pmd_present(pmd)    (pmd_val(pmd) & _PAGE_PRESENT)
#define pte_present(pte)    (pte_val(pte) & _PAGE_PRESENT)
 
#define pte_write(pte)      (pte_val(pte) & _PAGE_RW)
#define pte_dirty(pte)      (pte_val(pte) & _PAGE_DIRTY)
#define pte_young(pte)      (pte_val(pte) & _PAGE_ACCESSED)
#define pte_exec(pte)       (!(pte_val(pte) & _PAGE_NX))
 
/* === Entry Modification === */
 
/* Create PTE with specific protections */
pte_t mk_pte(struct page *page, pgprot_t pgprot);
 
/* Modify existing PTE flags */
pte_t pte_wrprotect(pte_t pte);  /* Clear write permission */
pte_t pte_mkwrite(pte_t pte);    /* Set write permission */
pte_t pte_mkclean(pte_t pte);    /* Clear dirty bit */
pte_t pte_mkdirty(pte_t pte);    /* Set dirty bit */
pte_t pte_mkold(pte_t pte);      /* Clear accessed bit */
pte_t pte_mkyoung(pte_t pte);    /* Set accessed bit */
pte_t pte_mkexec(pte_t pte);     /* Clear NX bit */
pte_t pte_exprotect(pte_t pte);  /* Set NX bit */
 
/* === Setting Entries in Tables === */
 
void set_pte_at(struct mm_struct *mm, unsigned long addr,
                pte_t *ptep, pte_t pte);
void set_pmd_at(struct mm_struct *mm, unsigned long addr,
                pmd_t *pmdp, pmd_t pmd);
/* Similar for PUD and PGD */
 
/* Atomic clear and set (for concurrent access) */
pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
                         pte_t *ptep);
pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
                             unsigned long addr, pte_t *ptep);
void ptep_modify_prot_commit(struct vm_area_struct *vma,
                             unsigned long addr, pte_t *ptep, pte_t pte);
 
/* === Table Allocation === */
 
/* Allocate a new page table page */
pgtable_t pte_alloc_one(struct mm_struct *mm);
pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr);
pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long addr);
 
/* Free a page table page */
void pte_free(struct mm_struct *mm, pgtable_t pte_page);
 
/* === High-Level Page Fault Handlers === */
 
/* Install a new page mapping (called from fault handler) */
vm_fault_t vmf_insert_page(struct vm_area_struct *vma,
                           unsigned long addr, struct page *page);
vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma,
                          unsigned long addr, unsigned long pfn);
 
/* Copy page tables (used in fork) */
int copy_page_range(struct vm_area_struct *dst_vma,
                    struct vm_area_struct *src_vma);

Architecture Abstraction

Huge Pages: Reducing Translation Overhead

Benefits of Huge Pages:

Reduced TLB misses: Fewer entries can cover the same memory
Faster page table walks: Shallow walks for huge page regions
Less memory overhead: Fewer page table pages needed
Better cache utilization: Contiguous physical memory for large allocations

Page Size Comparison
Page Size	TLB Entries for 1 GB	Page Table Depth	Table Overhead for 1 GB
4 KB (standard)	262,144 entries	4 levels	~2 MB
2 MB (huge)	512 entries	3 levels (stop at PMD)	~4 KB
1 GB (gigantic)	1 entry	2 levels (stop at PUD)	~8 bytes

huge_pages_usage.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Check huge page support and configuration
cat /proc/meminfo | grep -i huge
# HugePages_Total:     0
# HugePages_Free:      0
# HugePages_Rsvd:      0
# HugePages_Surp:      0
# Hugepagesize:        2048 kB
# Hugetlb:             0 kB
 
# Check available huge page sizes
ls /sys/kernel/mm/hugepages/
# hugepages-1048576kB  hugepages-2048kB
 
# Allocate huge pages (requires root)
echo 100 > /proc/sys/vm/nr_hugepages  # Request 100 2MB pages
cat /proc/meminfo | grep HugePages_Total
# HugePages_Total:   100
 
# Mount hugetlbfs for explicit huge page allocation
mkdir -p /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
 
# Transparent Huge Pages (THP) - automatic huge page usage
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
 
# View THP statistics
grep -i hugepages /proc/vmstat
# thp_fault_alloc 12345
# thp_collapse_alloc 567
# thp_split_page 89
 
# Check if a process is using huge pages
cat /proc/self/smaps | grep -i huge

Explicit Huge Pages (hugetlbfs)

•Preallocated at boot or runtime
•Guaranteed availability once reserved
•Used via mmap() with MAP_HUGETLB
•Memory is locked, never swapped
•Requires explicit configuration
•Best for databases, VMs with known needs

Transparent Huge Pages (THP)

•Automatically promoted from 4KB pages
•No application changes required
•May fail if memory is fragmented
•Can be swapped (with limitations)
•Works transparently for all applications
•May cause latency spikes during promotion/demotion

THP Latency Considerations

5-Level Page Tables (LA57)

The Additional Level:

LA57 adds a fifth level—the P4D (Page 4th Directory) or PML5 in Intel terminology—above the PGD:

| 56-48 (9 bits) | 47-39 (9 bits) | 38-30 (9 bits) | 29-21 (9 bits) | 20-12 (9 bits) | 11-0 (12 bits) |
|      P4D       |      PGD       |      PUD       |      PMD       |      PTE       |     Offset     |

Implications:

Virtual address space: 2⁵⁷ = 128 PB total (64 PB user, 64 PB kernel)
One additional memory access per page table walk
Sign extension: bits 56-63 must match bit 55 for canonical addresses
Backward compatible: existing 48-bit code works unchanged

la57_support.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/* 5-level page table support (arch/x86/include/asm/pgtable_64_types.h) */
 
#ifdef CONFIG_X86_5LEVEL
 
#define PGDIR_SHIFT     48
#define PTRS_PER_PGD    512
#define P4D_SHIFT       39
#define PTRS_PER_P4D    512
#define MAX_VA_BITS     57
 
/* User/kernel split for 5-level */
#define TASK_SIZE_MAX   ((1UL << 56) - PAGE_SIZE)
 
/* Navigate through 5 levels */
static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
{
    if (!pgtable_l5_enabled())
        return (p4d_t *)pgd;  /* Fold if not using 5-level */
    return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);
}
 
#else
 
/* 4-level: P4D is folded into PGD */
#define PGDIR_SHIFT     39
#define P4D_SHIFT       39
#define PTRS_PER_P4D    1
#define MAX_VA_BITS     48
 
#define TASK_SIZE_MAX   ((1UL << 47) - PAGE_SIZE)
 
#endif
 
/* Runtime check for 5-level support */
static inline bool pgtable_l5_enabled(void)
{
    return IS_ENABLED(CONFIG_X86_5LEVEL) && 
           cpu_feature_enabled(X86_FEATURE_LA57);
}
 
/* Kernel configuration option check */
#ifdef CONFIG_X86_5LEVEL
  /* 5-level page tables enabled in kernel config */
  /* Actual use depends on CPU support (checked at boot) */
#endif

Current Adoption

Summary: Page Table Management

We have explored the intricate machinery that bridges virtual and physical memory. Let's consolidate the key insights:

Key Takeaways

•Multi-level page tables enable sparse address spaces without consuming memory for unmapped regions—critical for 64-bit systems with 128+ TB address spaces.
•x86_64 uses 4-level (or 5-level) hierarchies with 9-bit indices per level, 4 KB table pages, and 8-byte entries with rich permission flags.
•Page table entries contain physical addresses plus flags for present, read/write, user/supervisor, accessed, dirty, page size, and execute disable.
•The TLB dramatically accelerates translation by caching recent mappings. TLB misses trigger page table walks; TLB invalidation requires careful cross-CPU coordination.
•PCIDs reduce context switch overhead by tagging TLB entries with process identifiers, avoiding full flushes.
•Kernel APIs provide architecture-independent access to page table manipulation, from entry queries to table allocation.
•Huge pages (2 MB, 1 GB) reduce TLB pressure and page table overhead for large memory consumers.
•5-level page tables (LA57) extend the address space to 128 PB for extreme memory requirements.

What's Next:

Page Complete

2 / 5