Operating SystemsHardware Virtualization Support

Hardware Virtualization Support

LevelAdvanced

Duration90 mins

TopicHardware Virtualization Support

3 / 5

Extended Page Tables

Hardware Memory Virtualization: Ending the Shadow Page Table Era

Memory virtualization is one of the most challenging aspects of building a hypervisor. Every memory access a guest makes uses a guest virtual address (GVA). The guest's operating system translates this to a guest physical address (GPA) using its page tables. But the guest's 'physical' addresses aren't actually physical—they're abstracted by the hypervisor. The hypervisor must then translate GPAs to host physical addresses (HPA) that map to real RAM.

Before hardware-assisted nested paging, hypervisors maintained shadow page tables—complex software structures that collapsed both translation levels into direct GVA→HPA mappings. Shadow page tables worked, but they were expensive to maintain, triggered frequent VM exits on guest page table updates, and added significant hypervisor complexity.

Extended Page Tables (EPT) from Intel and Nested Page Tables (NPT) from AMD changed everything by moving the second-level translation into hardware.

What You Will Learn

By the end of this page, you will understand the two-dimensional page walk that EPT/NPT enables, the structure of nested page tables, how TLB caching works with two-level translation, EPT/NPT violations vs. traditional page faults, and the performance characteristics of hardware memory virtualization.

The Memory Virtualization Problem

To understand EPT/NPT, we must first understand why memory virtualization is hard, and what shadow page tables attempted to solve.

Address Spaces in Virtualization:

Address Type	Abbreviation	Description
Guest Virtual Address	GVA	Address used by guest applications and kernel
Guest Physical Address	GPA	What guest OS thinks is physical RAM
Host Virtual Address	HVA	Hypervisor's own virtual address space
Host Physical Address	HPA	Actual physical RAM addresses

A guest application's memory access requires two translations:

GVA → GPA: Guest page tables (managed by guest OS)
GPA → HPA: Nested page tables (managed by hypervisor)

Without hardware support, the CPU only knows about one page table hierarchy. It can do GVA → HPA directly (if given the right page tables), but not a two-step translation.

Converting Mermaid diagram...

The Shadow Page Table Approach (Pre-EPT/NPT):

Without hardware nested paging, hypervisors created shadow page tables that combined both translations:

Guest OS maintains its own page tables (GVA → GPA)
Hypervisor intercepts guest page table modifications
Hypervisor creates shadow tables translating GVA → HPA directly
CPU uses shadow tables for actual memory access

Problems with Shadow Page Tables:

Write Protection: Guest page tables must be write-protected to catch modifications
Frequent Exits: Every guest page table write causes a VM exit
TLB Flushes: Changes require flushing TLB entries
Memory Overhead: Shadow tables duplicate guest table structure
Complexity: Handling all guest paging modes (32-bit, PAE, 64-bit) is error-prone

Shadow Page Table Maintenance:

1. Guest writes to its page table
2. Write causes exit (page protected)
3. VMM reads guest page table entry
4. VMM looks up GPA→HPA translation
5. VMM creates/updates shadow entry: GVA→HPA
6. VMM unprotects page temporarily
7. VMM re-enters guest

This happens on EVERY guest page table modification!

Shadow Table Performance Impact

Workloads with heavy memory allocation (database systems, JIT compilers, container orchestration) could see 10-40% overhead from shadow page table maintenance. A single fork() call might trigger thousands of shadow table updates. This made certain workloads impractical to virtualize efficiently.

EPT/NPT Architecture: Two-Dimensional Page Walk

Extended Page Tables (Intel) and Nested Page Tables (AMD) solve memory virtualization by adding a second-level address translation performed entirely in hardware. The guest page tables remain unmodified, and the CPU handles both translation levels automatically.

The Two-Dimensional Walk:

When a guest accesses memory, the CPU performs a coordinated walk of both page table hierarchies:

First dimension (Guest tables): Walk guest CR3 → PML4 → PDPT → PD → PT to translate GVA to GPA
Second dimension (Nested tables): For every guest physical address encountered during the walk (including page table pointers), translate GPA to HPA using the nested page tables

The guest page table walk alone might access 4-5 memory locations (in 4-level paging). Each of those locations is a GPA that must be translated via the nested tables. A single guest memory access can trigger up to 24 memory references in the worst case!

Converting Mermaid diagram...

Walk Complexity Analysis:

For 4-level paging:

Guest walk: 4 table levels + 1 final access = 5 GPAs to translate
Each GPA translation: 4 nested table levels = 4 memory accesses
Worst case: 5 × 4 = 20 memory accesses (24 if counting CR3 translations)

This sounds expensive, but:

TLB caching dramatically reduces actual memory accesses
Large pages reduce table depth
The translations happen in hardware, no VM exits needed
Modern CPUs have dedicated nested page table caches

Enabling EPT (Intel):

void enable_ept(struct vmcs *vmcs, uint64_t eptp) {
    /* Set EPT pointer in VMCS */
    /* Format: [Page walk length (3)] [Memory type] [Root table address] */
    vmwrite(VMCS_EPT_POINTER, 
            (eptp & PAGE_MASK) |    /* Root table physical addr */
            (3 << 3) |               /* 4-level walk (encoded as 3) */
            (6));                    /* Write-back memory type */
    
    /* Enable EPT in secondary processor controls */
    uint32_t secondary = vmread(VMCS_SECONDARY_PROC_CONTROLS);
    secondary |= SECONDARY_EXEC_ENABLE_EPT;
    vmwrite(VMCS_SECONDARY_PROC_CONTROLS, secondary);
}

Enabling NPT (AMD):

void enable_npt(struct vmcb *vmcb, uint64_t ncr3) {
    /* Set nested CR3 (root of nested page tables) */
    vmcb->control.nested_cr3 = ncr3;
    
    /* Enable NPT */
    vmcb->control.nested_ctl |= SVM_NESTED_CTL_NP_ENABLE;
}

Hardware vs. Software Trade-off

EPT/NPT trades per-access latency (deeper page walks) for elimination of exit overhead. For most workloads, this is a massive win—shadow table exits are far more expensive than extra memory references. The break-even point is workloads with extremely high TLB miss rates and very few page table modifications.

EPT/NPT Page Table Structure

EPT and NPT use page table structures similar to the standard x86-64 page tables, but with different entry formats designed for virtualization needs.

EPT Entry Format (Intel):

Each EPT entry is 64 bits with the following layout:

Bits	Field	Description
0	R	Read access allowed
1	W	Write access allowed
2	X	Execute access allowed
3-5	Memory Type	EPT memory type (for leaf entries)
6	Ignore PAT	Ignore guest PAT settings
7	Large Page	Maps 2MB (PD) or 1GB (PDPT) page
8	Accessed	Hardware sets on access (if enabled)
9	Dirty	Hardware sets on write (if enabled)
10-11	Reserved	Must be 0
12-51	Physical Address	Next table or mapped page
52-62	Reserved	Must be 0
63	Suppress VE	Suppress #VE virtualization exception

Key Differences from Standard x86 Page Tables:

Separate R/W/X bits: EPT has independent read, write, and execute permissions (standard x86 only has W and NX). This enables fine-grained memory protection.
No User/Supervisor distinction: EPT doesn't distinguish privilege levels—it applies to all guest accesses. Guest user/supervisor is handled by guest page tables.
Memory Type control: EPT can specify caching behavior (uncacheable, write-combining, write-back, etc.) replacing or complementing MTRR and PAT.
Accessed/Dirty bits: Optional (requires CPU support). When enabled, hardware tracks page access for demand paging and dirty tracking.

/* EPT Entry definitions */
#define EPT_READ      (1ULL << 0)
#define EPT_WRITE     (1ULL << 1)
#define EPT_EXECUTE   (1ULL << 2)
#define EPT_MT_MASK   (7ULL << 3)      /* Memory type */
#define EPT_MT_UC     (0ULL << 3)      /* Uncacheable */
#define EPT_MT_WC     (1ULL << 3)      /* Write-combining */
#define EPT_MT_WB     (6ULL << 3)      /* Write-back */
#define EPT_IGNORE_PAT (1ULL << 6)
#define EPT_LARGE_PAGE (1ULL << 7)
#define EPT_ACCESSED  (1ULL << 8)
#define EPT_DIRTY     (1ULL << 9)

/* Create EPT entry for 4KB page */
uint64_t create_ept_pte(uint64_t hpa, uint64_t flags) {
    return (hpa & PAGE_MASK) | flags | EPT_READ | EPT_WRITE | EPT_EXECUTE;
}

/* Create EPT entry for 2MB large page */
uint64_t create_ept_pde_2mb(uint64_t hpa, uint64_t flags) {
    return (hpa & LARGE_PAGE_MASK) | flags | EPT_LARGE_PAGE | 
           EPT_READ | EPT_WRITE | EPT_EXECUTE | EPT_MT_WB;
}

NPT Entry Format (AMD):

AMD's NPT uses format closer to standard x86 page tables:

Bits	Field	Description
0	P	Present
1	R/W	Read/Write
2	U/S	User/Supervisor (typically set for guest kernel)
3	PWT	Page Write-Through
4	PCD	Page Cache Disable
5	A	Accessed
6	D	Dirty (leaf entries only)
7	PS	Large page (2MB or 1GB)
8-11	Available	Software use
12-51	Physical Address	Next table or mapped page
52-62	Available	Software use
63	NX	No-Execute (if enabled)

NPT uses the standard x86 Present/Read-Write model, making it simpler for developers familiar with x86 paging. The explicit read/write/execute granularity of EPT requires slightly more complex handling but enables additional security features.

nested_page_table.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/* Build nested page tables for a guest VM */
struct ept_tables {
    uint64_t *pml4;    /* Level 4: 512 entries, 512GB each */
    uint64_t *pdpt;    /* Level 3: 512 entries, 1GB each */
    uint64_t *pd;      /* Level 2: 512 entries, 2MB each */
    uint64_t *pt;      /* Level 1: 512 entries, 4KB each */
};
 
/* Identity map guest physical memory with EPT */
void build_ept_identity_map(struct ept_tables *ept, size_t mem_size) {
    /* Allocate and zero all page tables */
    ept->pml4 = alloc_page_zeroed();
    ept->pdpt = alloc_page_zeroed();
    
    /* Point PML4[0] to PDPT */
    ept->pml4[0] = virt_to_phys(ept->pdpt) | EPT_READ | EPT_WRITE | EPT_EXECUTE;
    
    /* Map memory using 2MB large pages for efficiency */
    size_t pdpt_entries = (mem_size + (1ULL << 30) - 1) >> 30;  /* 1GB chunks */
    
    for (size_t i = 0; i < pdpt_entries && i < 512; i++) {
        uint64_t *pd = alloc_page_zeroed();
        ept->pdpt[i] = virt_to_phys(pd) | EPT_READ | EPT_WRITE | EPT_EXECUTE;
        
        /* Fill PD with 2MB large page entries */
        size_t pd_entries = min(512, (mem_size - i * (1ULL << 30)) >> 21);
        for (size_t j = 0; j < pd_entries; j++) {
            uint64_t gpa = (i << 30) | (j << 21);
            /* Identity map: GPA == HPA (for this example) */
            pd[j] = gpa | EPT_LARGE_PAGE | EPT_READ | EPT_WRITE | 
                    EPT_EXECUTE | EPT_MT_WB;
        }
    }
}
 
/* Map a specific GPA to HPA */
void ept_map_page(struct ept_tables *ept, uint64_t gpa, uint64_t hpa, 
                  uint64_t flags) {
    /* Extract indices for each level */
    uint64_t pml4_idx = (gpa >> 39) & 0x1FF;
    uint64_t pdpt_idx = (gpa >> 30) & 0x1FF;
    uint64_t pd_idx   = (gpa >> 21) & 0x1FF;
    uint64_t pt_idx   = (gpa >> 12) & 0x1FF;
    
    /* Walk/create page table hierarchy */
    uint64_t *pdpt = get_or_create_table(ept->pml4, pml4_idx);
    uint64_t *pd = get_or_create_table(pdpt, pdpt_idx);
    uint64_t *pt = get_or_create_table(pd, pd_idx);
    
    /* Set the final page table entry */
    pt[pt_idx] = (hpa & PAGE_MASK) | flags;
}

Large Page Benefits

Using 2MB or 1GB large pages dramatically reduces EPT/NPT walk overhead. A 1GB page eliminates 3 levels of table lookup. Modern hypervisors default to large pages for guest memory when possible, falling back to 4KB pages only when granular control is needed (e.g., MMIO regions).

EPT/NPT Violations and Misconfigurations

When a guest memory access cannot be translated by the nested page tables, an EPT violation (Intel) or NPT page fault (AMD) occurs. This is distinct from a regular guest page fault—the guest's translation succeeded, but the hypervisor's translation failed.

Causes of EPT/NPT Violations:

Unmapped GPA: Guest accessed a physical address not configured in EPT/NPT
Permission violation: Access type (read/write/execute) not permitted
MMIO Access: Guest trying to access device memory that needs emulation
Lazy allocation: Hypervisor populates EPT on-demand

EPT Violation Exit Information (Intel):

When an EPT violation occurs, the VMCS contains detailed information:

Exit Qualification: Bits indicating access type and EPT state
- Bit 0: Data read
- Bit 1: Data write
- Bit 2: Instruction fetch
- Bit 3: EPT entry readable
- Bit 4: EPT entry writable
- Bit 5: EPT entry executable
- Bit 7: GVA valid
- Bit 8: Caused by GPA access (vs. page table walk)
Guest Physical Address: The GPA that caused the violation
Guest Linear Address: The GVA that caused the access (if valid)

ept_violation_handler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* Handle EPT violation (Intel) */
void handle_ept_violation(struct vcpu *vcpu) {
    uint64_t exit_qual = vmread(VMCS_EXIT_QUALIFICATION);
    uint64_t gpa = vmread(VMCS_GUEST_PHYSICAL_ADDRESS);
    uint64_t gva = vmread(VMCS_GUEST_LINEAR_ADDRESS);
    
    bool is_read = exit_qual & (1 << 0);
    bool is_write = exit_qual & (1 << 1);
    bool is_fetch = exit_qual & (1 << 2);
    bool ept_readable = exit_qual & (1 << 3);
    bool ept_writable = exit_qual & (1 << 4);
    bool ept_executable = exit_qual & (1 << 5);
    
    /* Check if this is an MMIO access */
    if (is_mmio_region(vcpu->vm, gpa)) {
        if (is_write) {
            uint64_t value = get_write_value(vcpu);
            emulate_mmio_write(vcpu->vm, gpa, value);
        } else {
            uint64_t value = emulate_mmio_read(vcpu->vm, gpa);
            set_read_result(vcpu, value);
        }
        advance_rip(vcpu);
        return;
    }
    
    /* Check if page needs to be allocated (demand paging) */
    if (!ept_readable && !ept_writable && !ept_executable) {
        /* Page not present - allocate and map */
        uint64_t hpa = allocate_guest_page(vcpu->vm);
        uint64_t flags = EPT_READ | EPT_WRITE | EPT_EXECUTE | EPT_MT_WB;
        ept_map_page(vcpu->vm->ept, gpa, hpa, flags);
        /* Let guest retry the access */
        return;
    }
    
    /* Check for write to read-only page (e.g., dirty tracking) */
    if (is_write && ept_readable && !ept_writable) {
        /* Copy-on-write or dirty tracking */
        handle_write_protection(vcpu->vm, gpa);
        return;
    }
    
    /* Check for execute on non-executable page */
    if (is_fetch && !ept_executable) {
        /* Could be security violation or code injection attempt */
        inject_guest_exception(vcpu, EXC_GP, 0);
        return;
    }
    
    /* Unknown violation - this shouldn't happen */
    panic("Unexpected EPT violation: GPA=%lx qual=%lx", gpa, exit_qual);
}

NPT Page Fault Handling (AMD):

AMD's NPT faults are handled similarly but with different exit information format:

void handle_npt_fault(struct vcpu *vcpu) {
    struct vmcb *vmcb = vcpu->vmcb;
    uint64_t error_code = vmcb->control.exitinfo1;
    uint64_t gpa = vmcb->control.exitinfo2;
    
    bool present = error_code & (1 << 0);
    bool write = error_code & (1 << 1);
    bool user = error_code & (1 << 2);
    bool reserved = error_code & (1 << 3);
    bool fetch = error_code & (1 << 4);
    
    /* Similar handling logic as EPT violations */
    if (!present) {
        /* Page not mapped - demand allocation */
        allocate_and_map_page(vcpu->vm, gpa);
    } else if (write) {
        /* Write to read-only */
        handle_write_protection(vcpu->vm, gpa);
    }
}

EPT Misconfiguration vs. Violation:

Intel distinguishes between:

EPT Violation: Access denied by EPT permissions (normal case)
EPT Misconfiguration: Invalid EPT entry (e.g., reserved bits set, invalid memory type)

Misconfigurations indicate hypervisor bugs and should never occur in correct operation. They result in immediate VM exit with a different exit reason.

MMIO Handling

MMIO (Memory-Mapped I/O) regions like device registers must NOT be mapped in EPT/NPT. Accesses to these regions should cause violations so the hypervisor can emulate the device. Common MMIO regions include VGA framebuffer, APIC registers, and PCIe configuration space.

TLB Management with Nested Paging

The Translation Lookaside Buffer (TLB) is critical for paging performance—it caches virtual-to-physical translations to avoid expensive page walks. With nested paging, TLB management becomes more complex because translations now span both guest and nested tables.

What's Cached in the TLB:

With EPT/NPT, the TLB caches the combined translation: GVA → HPA. This means a single TLB entry subsumes both translation levels. This is efficient for hits but complicates invalidation—changes to either guest or nested tables can invalidate cached entries.

VPID and ASID: Tagging TLB Entries:

Without tagging, every VM entry/exit would require a TLB flush because guest and host translations might conflict. VPID (Intel) and ASID (AMD) tag TLB entries with an identifier:

TLB Entry: [VPID/ASID | GVA | HPA | attributes]

VM 1: VPID=1, GVA 0x1000 → HPA 0x5000
VM 2: VPID=2, GVA 0x1000 → HPA 0x8000  (different VM, same GVA)
Hypervisor: VPID=0, VA 0x1000 → PA 0x2000

All three can coexist in TLB!

TLB Invalidation Scenarios

•Guest INVLPG: Guest invalidates single GVA entry. Hardware handles within guest VPID/ASID.
•Guest CR3 load: Guest switches address space. Hardware may retain entries from old CR3 unless PCID is used.
•EPT/NPT modification: Hypervisor changes nested tables. Must manually invalidate affected entries.
•VM exit/entry: With VPID/ASID, no flush needed. Without, all guest entries flushed.
•VPID/ASID change: Assigning new ID effectively discards old translations.

INVEPT and INVVPID (Intel):

Intel provides specific instructions for EPT/VPID invalidation:

/* Invalidate all EPT translations for a specific EPTP */
void invept_single_context(uint64_t eptp) {
    struct {
        uint64_t eptp;
        uint64_t reserved;
    } descriptor = { eptp, 0 };
    
    asm volatile("invept %0, %1"
                 : : "m"(descriptor), "r"(1ULL)  /* Type 1 = single context */
                 : "memory");
}

/* Invalidate all EPT translations globally */
void invept_all_contexts(void) {
    struct {
        uint64_t eptp;
        uint64_t reserved;
    } descriptor = { 0, 0 };
    
    asm volatile("invept %0, %1"
                 : : "m"(descriptor), "r"(2ULL)  /* Type 2 = all contexts */
                 : "memory");
}

/* Invalidate single virtual address for a VPID */
void invvpid_individual_address(uint16_t vpid, uint64_t gva) {
    struct {
        uint64_t vpid;
        uint64_t gva;
    } descriptor = { vpid, gva };
    
    asm volatile("invvpid %0, %1"
                 : : "m"(descriptor), "r"(0ULL)  /* Type 0 = individual */
                 : "memory");
}

ASID and TLB Control (AMD):

AMD uses the TLB_CONTROL field in VMCB:

/* TLB control values for VMRUN */
#define TLB_CONTROL_DO_NOTHING    0  /* Preserve TLB entries */
#define TLB_CONTROL_FLUSH_ASID    1  /* Flush this ASID only */
#define TLB_CONTROL_FLUSH_ALL     3  /* Flush all TLB entries */

void set_guest_asid(struct vmcb *vmcb, uint32_t asid) {
    vmcb->control.guest_asid = asid;
    /* Typically flush on first use of new ASID */
    vmcb->control.tlb_control = TLB_CONTROL_FLUSH_ASID;
}

void invalidate_guest_page(struct vmcb *vmcb, uint64_t gva) {
    /* AMD doesn't have per-address invalidation in hardware */
    /* Options: flush ASID, or let guest INVLPG handle it */
    vmcb->control.tlb_control = TLB_CONTROL_FLUSH_ASID;
}

VPID/ASID Management

Hypervisors typically maintain a VPID/ASID pool and assign unique IDs to each vCPU. When IDs are exhausted, some must be recycled with a flush. Good ASID management is critical for performance—frequent flushes defeat the purpose of tagged TLBs.

Performance Characteristics of Nested Paging

EPT/NPT fundamentally changes the performance profile of virtualized memory access. Understanding these characteristics helps optimize VM configurations.

Comparison: Shadow Tables vs. EPT/NPT:

Shadow Page Tables vs. EPT/NPT Performance
Metric	Shadow Tables	EPT/NPT	Winner
TLB Hit Performance	Excellent (GVA→HPA direct)	Excellent (same)	Tie
TLB Miss Overhead	Standard page walk	2D walk (higher latency)	Shadow
Page Table Updates	Exit + sync required	No exit needed	EPT/NPT
Context Switch Overhead	Shadow rebuild/flush	VPID/ASID tagging	EPT/NPT
Memory Overhead	Shadow tables per guest	One EPT/NPT per guest	EPT/NPT
Implementation Complexity	Very high	Moderate	EPT/NPT
Fork Performance	Many exits	Guest-only operation	EPT/NPT
mmap Performance	Exits for each mapping	No exits	EPT/NPT

The TLB Miss Trade-off:

On a TLB miss with EPT/NPT, the 2-dimensional page walk accesses more memory than shadow tables would. Analysis:

Shadow Tables TLB Miss:
  4 memory accesses (standard 4-level walk)

EPT/NPT TLB Miss:
  Up to 24 memory accesses (5 GPAs × 4-5 levels + overhead)

But consider:
  - Modern CPUs have nested page table caches
  - Large pages reduce both dimensions
  - 24 accesses × 100ns = 2.4μs worst case
  - One VM exit = 1,000-10,000 cycles ≈ 0.3-3μs
  - Shadow update might trigger multiple exits

Optimization: Large Pages:

Using large pages (2MB or 1GB) in EPT/NPT reduces walk depth:

Page Size	Guest Levels	EPT Levels	Max Accesses
4KB	4	4	5×4+4 = 24
2MB	3	3	4×3+3 = 15
1GB	2	2	3×2+2 = 8

Using 2MB pages throughout reduces worst-case from 24 to 15 accesses—a 37% improvement.

Workload-Dependent Performance:

Different workloads see different benefits from EPT/NPT:

Best for EPT/NPT:

Heavy memory allocation (malloc/mmap intensive)
Process creation (fork/exec)
Databases with large working sets
Applications with high page table churn

Neutral:

Steady-state CPU computation
I/O-bound workloads (exits dominated by I/O)
Applications with excellent TLB hit rates

Potentially Worse:

Extremely high TLB miss rates with small working sets
Memory-intensive benchmarks measuring raw access latency
(Rare in practice)

Real-World Impact:

Benchmark: Linux kernel compile in VM

Shadow Tables:
  - 847,000 VM exits for CR3 loads
  - 2.1 million exits for page table writes
  - Total: 3.2 million exits
  - Build time: 142 seconds

With EPT:
  - 0 exits for memory operations
  - ~12,000 exits (I/O, interrupts only)
  - Build time: 98 seconds

Improvement: 31% faster

Always Enable EPT/NPT

Modern hypervisors always enable EPT/NPT when available. The performance benefits vastly outweigh the slightly deeper TLB miss path. Shadow page tables are now legacy, used only when hardware doesn't support nested paging or for specialized debugging scenarios.

Advanced EPT/NPT Features

Modern processors include advanced EPT/NPT features that enable sophisticated virtualization scenarios beyond basic memory translation.

Accessed and Dirty Bits:

When CPU support is present, EPT/NPT can set accessed and dirty bits automatically:

Accessed bit: Set by hardware on read or write to page
Dirty bit: Set by hardware on write to page

These bits are essential for:

Live migration: Tracking which pages changed during migration iterations
Memory overcommit: Identifying cold pages for swapping
Checkpointing: Knowing which pages to save for incremental snapshots

/* Enable accessed/dirty bit support (Intel) */
void enable_ept_ad_bits(struct vm *vm) {
    /* Check CPU capability first */
    if (!cpu_has_ept_ad_bits())
        return;
    
    uint64_t eptp = vmread(VMCS_EPT_POINTER);
    eptp |= EPT_POINTER_AD_ENABLE;  /* Set bit 6 */
    vmwrite(VMCS_EPT_POINTER, eptp);
}

/* Scan for dirty pages (for live migration) */
void scan_dirty_pages(struct vm *vm, uint64_t *dirty_bitmap) {
    for (each ept_page_table_entry(vm->ept, &entry, &gpa)) {
        if (entry & EPT_DIRTY) {
            set_bit(dirty_bitmap, gpa >> PAGE_SHIFT);
            /* Clear dirty bit for next iteration */
            entry &= ~EPT_DIRTY;
        }
    }
    invept_single_context(vm->eptp);  /* Flush TLB after clearing */
}

Advanced EPT/NPT Capabilities

•Mode-Based Execute (Intel): Different execute permissions for user vs. supervisor mode. Enables more granular DEP policies within guest.
•Sub-Page Permissions (SPP): Divide 4KB pages into 128-byte sub-pages with individual R/W permissions. Enables fine-grained write tracking.
•#VE (Virtualization Exception): EPT violations can be delivered as guest exceptions instead of VM exits. Enables guest-level handling without hypervisor involvement.
•VMFUNC: Guest can switch EPT views without VM exit. Enables efficient compartmentalization within a single guest.
•PML (Page Modification Logging): Hardware automatically logs dirty GPAs to a buffer. Eliminates need to scan entire EPT for dirty bits.

Page Modification Logging (PML):

PML is particularly powerful for live migration and checkpointing. Instead of scanning the entire EPT for dirty bits, hardware maintains a log:

/* PML setup */
void enable_pml(struct vm *vm) {
    /* Allocate 512-entry PML buffer (4KB page) */
    vm->pml_buffer = alloc_page();
    vmwrite(VMCS_PML_ADDRESS, virt_to_phys(vm->pml_buffer));
    
    /* Set initial PML index to 511 (grows downward) */
    vmwrite(VMCS_PML_INDEX, 511);
    
    /* Enable PML in secondary controls */
    uint32_t ctrl = vmread(VMCS_SECONDARY_PROC_CONTROLS);
    ctrl |= SECONDARY_EXEC_ENABLE_PML;
    vmwrite(VMCS_SECONDARY_PROC_CONTROLS, ctrl);
}

/* Process dirty pages from PML buffer */
void process_pml_buffer(struct vm *vm) {
    uint16_t pml_index = vmread(VMCS_PML_INDEX);
    
    /* Entries from pml_index+1 to 511 are dirty GPAs */
    for (int i = pml_index + 1; i < 512; i++) {
        uint64_t dirty_gpa = vm->pml_buffer[i];
        mark_page_dirty(vm, dirty_gpa);
    }
    
    /* Reset PML index */
    vmwrite(VMCS_PML_INDEX, 511);
}

When the PML buffer fills, a VM exit occurs. The hypervisor drains the buffer and resumes the guest. This is far more efficient than page-table scanning for workloads with scattered dirty pages.

VMFUNC for Security

VMFUNC with EPT switching enables powerful isolation primitives. A guest can switch between different memory views (e.g., 'trusted' and 'untrusted' compartments) with a single instruction, without hypervisor involvement. This is used by Intel SGX and various intra-guest isolation systems.

Summary: Extended Page Tables

Extended Page Tables (Intel EPT) and Nested Page Tables (AMD NPT) represent a major advancement in virtualization technology. By moving the second-level address translation into hardware, they eliminate the complexity and overhead of shadow page tables while enabling new capabilities impossible with software-only approaches.

Key Takeaways

•Two-Dimensional Translation: GVA→GPA (guest tables) + GPA→HPA (nested tables), both handled in hardware.
•Shadow Table Elimination: No more exits for guest page table updates; massive reduction in hypervisor complexity and exit overhead.
•Page Table Structure: EPT/NPT entries similar to x86 but with virtualization-specific attributes (separate R/W/X, memory types).
•Violation Handling: EPT/NPT violations exit to hypervisor for demand paging, MMIO emulation, and permission enforcement.
•TLB Management: VPID/ASID tagging enables TLB entry persistence across VM transitions; INVEPT/INVVPID for invalidation.
•Performance Trade-off: Deeper TLB miss path but eliminated exit overhead—massive net gain for most workloads.
•Advanced Features: Accessed/dirty bits, PML for dirty tracking, VMFUNC for compartmentalization.

What's Next:

In the next page, we'll explore I/O Virtualization (VT-d)—how hardware enables direct device assignment to VMs, DMA remapping for security, and interrupt remapping for isolation. I/O virtualization completes the hardware support picture, enabling near-native I/O performance for virtualized workloads.

Page Complete

You now understand EPT and NPT—the hardware memory virtualization technologies that enable efficient, low-overhead address translation for virtual machines. From two-dimensional page walks to TLB management and advanced features, you have the knowledge to understand modern hypervisor memory management.

3 / 5

Loading learning content...

Operating SystemsHardware Virtualization Support

Hardware Virtualization Support

LevelAdvanced

Duration90 mins

TopicHardware Virtualization Support

3 / 5

Extended Page Tables

Hardware Memory Virtualization: Ending the Shadow Page Table Era

Extended Page Tables (EPT) from Intel and Nested Page Tables (NPT) from AMD changed everything by moving the second-level translation into hardware.

What You Will Learn

The Memory Virtualization Problem

To understand EPT/NPT, we must first understand why memory virtualization is hard, and what shadow page tables attempted to solve.

Address Spaces in Virtualization:

Address Type	Abbreviation	Description
Guest Virtual Address	GVA	Address used by guest applications and kernel
Guest Physical Address	GPA	What guest OS thinks is physical RAM
Host Virtual Address	HVA	Hypervisor's own virtual address space
Host Physical Address	HPA	Actual physical RAM addresses

A guest application's memory access requires two translations:

GVA → GPA: Guest page tables (managed by guest OS)
GPA → HPA: Nested page tables (managed by hypervisor)

Without hardware support, the CPU only knows about one page table hierarchy. It can do GVA → HPA directly (if given the right page tables), but not a two-step translation.

Converting Mermaid diagram...

The Shadow Page Table Approach (Pre-EPT/NPT):

Without hardware nested paging, hypervisors created shadow page tables that combined both translations:

Guest OS maintains its own page tables (GVA → GPA)
Hypervisor intercepts guest page table modifications
Hypervisor creates shadow tables translating GVA → HPA directly
CPU uses shadow tables for actual memory access

Problems with Shadow Page Tables:

Write Protection: Guest page tables must be write-protected to catch modifications
Frequent Exits: Every guest page table write causes a VM exit
TLB Flushes: Changes require flushing TLB entries
Memory Overhead: Shadow tables duplicate guest table structure
Complexity: Handling all guest paging modes (32-bit, PAE, 64-bit) is error-prone

Shadow Page Table Maintenance:

1. Guest writes to its page table
2. Write causes exit (page protected)
3. VMM reads guest page table entry
4. VMM looks up GPA→HPA translation
5. VMM creates/updates shadow entry: GVA→HPA
6. VMM unprotects page temporarily
7. VMM re-enters guest

This happens on EVERY guest page table modification!

Shadow Table Performance Impact

EPT/NPT Architecture: Two-Dimensional Page Walk

The Two-Dimensional Walk:

When a guest accesses memory, the CPU performs a coordinated walk of both page table hierarchies:

First dimension (Guest tables): Walk guest CR3 → PML4 → PDPT → PD → PT to translate GVA to GPA
Second dimension (Nested tables): For every guest physical address encountered during the walk (including page table pointers), translate GPA to HPA using the nested page tables

Converting Mermaid diagram...

Walk Complexity Analysis:

For 4-level paging:

Guest walk: 4 table levels + 1 final access = 5 GPAs to translate
Each GPA translation: 4 nested table levels = 4 memory accesses
Worst case: 5 × 4 = 20 memory accesses (24 if counting CR3 translations)

This sounds expensive, but:

TLB caching dramatically reduces actual memory accesses
Large pages reduce table depth
The translations happen in hardware, no VM exits needed
Modern CPUs have dedicated nested page table caches

Enabling EPT (Intel):

void enable_ept(struct vmcs *vmcs, uint64_t eptp) {
    /* Set EPT pointer in VMCS */
    /* Format: [Page walk length (3)] [Memory type] [Root table address] */
    vmwrite(VMCS_EPT_POINTER, 
            (eptp & PAGE_MASK) |    /* Root table physical addr */
            (3 << 3) |               /* 4-level walk (encoded as 3) */
            (6));                    /* Write-back memory type */
    
    /* Enable EPT in secondary processor controls */
    uint32_t secondary = vmread(VMCS_SECONDARY_PROC_CONTROLS);
    secondary |= SECONDARY_EXEC_ENABLE_EPT;
    vmwrite(VMCS_SECONDARY_PROC_CONTROLS, secondary);
}

Enabling NPT (AMD):

void enable_npt(struct vmcb *vmcb, uint64_t ncr3) {
    /* Set nested CR3 (root of nested page tables) */
    vmcb->control.nested_cr3 = ncr3;
    
    /* Enable NPT */
    vmcb->control.nested_ctl |= SVM_NESTED_CTL_NP_ENABLE;
}

Hardware vs. Software Trade-off

EPT/NPT Page Table Structure

EPT and NPT use page table structures similar to the standard x86-64 page tables, but with different entry formats designed for virtualization needs.

EPT Entry Format (Intel):

Each EPT entry is 64 bits with the following layout:

Bits	Field	Description
0	R	Read access allowed
1	W	Write access allowed
2	X	Execute access allowed
3-5	Memory Type	EPT memory type (for leaf entries)
6	Ignore PAT	Ignore guest PAT settings
7	Large Page	Maps 2MB (PD) or 1GB (PDPT) page
8	Accessed	Hardware sets on access (if enabled)
9	Dirty	Hardware sets on write (if enabled)
10-11	Reserved	Must be 0
12-51	Physical Address	Next table or mapped page
52-62	Reserved	Must be 0
63	Suppress VE	Suppress #VE virtualization exception

Key Differences from Standard x86 Page Tables:

Separate R/W/X bits: EPT has independent read, write, and execute permissions (standard x86 only has W and NX). This enables fine-grained memory protection.
No User/Supervisor distinction: EPT doesn't distinguish privilege levels—it applies to all guest accesses. Guest user/supervisor is handled by guest page tables.
Memory Type control: EPT can specify caching behavior (uncacheable, write-combining, write-back, etc.) replacing or complementing MTRR and PAT.
Accessed/Dirty bits: Optional (requires CPU support). When enabled, hardware tracks page access for demand paging and dirty tracking.

/* EPT Entry definitions */
#define EPT_READ      (1ULL << 0)
#define EPT_WRITE     (1ULL << 1)
#define EPT_EXECUTE   (1ULL << 2)
#define EPT_MT_MASK   (7ULL << 3)      /* Memory type */
#define EPT_MT_UC     (0ULL << 3)      /* Uncacheable */
#define EPT_MT_WC     (1ULL << 3)      /* Write-combining */
#define EPT_MT_WB     (6ULL << 3)      /* Write-back */
#define EPT_IGNORE_PAT (1ULL << 6)
#define EPT_LARGE_PAGE (1ULL << 7)
#define EPT_ACCESSED  (1ULL << 8)
#define EPT_DIRTY     (1ULL << 9)

/* Create EPT entry for 4KB page */
uint64_t create_ept_pte(uint64_t hpa, uint64_t flags) {
    return (hpa & PAGE_MASK) | flags | EPT_READ | EPT_WRITE | EPT_EXECUTE;
}

/* Create EPT entry for 2MB large page */
uint64_t create_ept_pde_2mb(uint64_t hpa, uint64_t flags) {
    return (hpa & LARGE_PAGE_MASK) | flags | EPT_LARGE_PAGE | 
           EPT_READ | EPT_WRITE | EPT_EXECUTE | EPT_MT_WB;
}

NPT Entry Format (AMD):

AMD's NPT uses format closer to standard x86 page tables:

Bits	Field	Description
0	P	Present
1	R/W	Read/Write
2	U/S	User/Supervisor (typically set for guest kernel)
3	PWT	Page Write-Through
4	PCD	Page Cache Disable
5	A	Accessed
6	D	Dirty (leaf entries only)
7	PS	Large page (2MB or 1GB)
8-11	Available	Software use
12-51	Physical Address	Next table or mapped page
52-62	Available	Software use
63	NX	No-Execute (if enabled)

nested_page_table.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/* Build nested page tables for a guest VM */
struct ept_tables {
    uint64_t *pml4;    /* Level 4: 512 entries, 512GB each */
    uint64_t *pdpt;    /* Level 3: 512 entries, 1GB each */
    uint64_t *pd;      /* Level 2: 512 entries, 2MB each */
    uint64_t *pt;      /* Level 1: 512 entries, 4KB each */
};
 
/* Identity map guest physical memory with EPT */
void build_ept_identity_map(struct ept_tables *ept, size_t mem_size) {
    /* Allocate and zero all page tables */
    ept->pml4 = alloc_page_zeroed();
    ept->pdpt = alloc_page_zeroed();
    
    /* Point PML4[0] to PDPT */
    ept->pml4[0] = virt_to_phys(ept->pdpt) | EPT_READ | EPT_WRITE | EPT_EXECUTE;
    
    /* Map memory using 2MB large pages for efficiency */
    size_t pdpt_entries = (mem_size + (1ULL << 30) - 1) >> 30;  /* 1GB chunks */
    
    for (size_t i = 0; i < pdpt_entries && i < 512; i++) {
        uint64_t *pd = alloc_page_zeroed();
        ept->pdpt[i] = virt_to_phys(pd) | EPT_READ | EPT_WRITE | EPT_EXECUTE;
        
        /* Fill PD with 2MB large page entries */
        size_t pd_entries = min(512, (mem_size - i * (1ULL << 30)) >> 21);
        for (size_t j = 0; j < pd_entries; j++) {
            uint64_t gpa = (i << 30) | (j << 21);
            /* Identity map: GPA == HPA (for this example) */
            pd[j] = gpa | EPT_LARGE_PAGE | EPT_READ | EPT_WRITE | 
                    EPT_EXECUTE | EPT_MT_WB;
        }
    }
}
 
/* Map a specific GPA to HPA */
void ept_map_page(struct ept_tables *ept, uint64_t gpa, uint64_t hpa, 
                  uint64_t flags) {
    /* Extract indices for each level */
    uint64_t pml4_idx = (gpa >> 39) & 0x1FF;
    uint64_t pdpt_idx = (gpa >> 30) & 0x1FF;
    uint64_t pd_idx   = (gpa >> 21) & 0x1FF;
    uint64_t pt_idx   = (gpa >> 12) & 0x1FF;
    
    /* Walk/create page table hierarchy */
    uint64_t *pdpt = get_or_create_table(ept->pml4, pml4_idx);
    uint64_t *pd = get_or_create_table(pdpt, pdpt_idx);
    uint64_t *pt = get_or_create_table(pd, pd_idx);
    
    /* Set the final page table entry */
    pt[pt_idx] = (hpa & PAGE_MASK) | flags;
}

Large Page Benefits

EPT/NPT Violations and Misconfigurations

Causes of EPT/NPT Violations:

Unmapped GPA: Guest accessed a physical address not configured in EPT/NPT
Permission violation: Access type (read/write/execute) not permitted
MMIO Access: Guest trying to access device memory that needs emulation
Lazy allocation: Hypervisor populates EPT on-demand

EPT Violation Exit Information (Intel):

When an EPT violation occurs, the VMCS contains detailed information:

Exit Qualification: Bits indicating access type and EPT state
- Bit 0: Data read
- Bit 1: Data write
- Bit 2: Instruction fetch
- Bit 3: EPT entry readable
- Bit 4: EPT entry writable
- Bit 5: EPT entry executable
- Bit 7: GVA valid
- Bit 8: Caused by GPA access (vs. page table walk)
Guest Physical Address: The GPA that caused the violation
Guest Linear Address: The GVA that caused the access (if valid)

ept_violation_handler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* Handle EPT violation (Intel) */
void handle_ept_violation(struct vcpu *vcpu) {
    uint64_t exit_qual = vmread(VMCS_EXIT_QUALIFICATION);
    uint64_t gpa = vmread(VMCS_GUEST_PHYSICAL_ADDRESS);
    uint64_t gva = vmread(VMCS_GUEST_LINEAR_ADDRESS);
    
    bool is_read = exit_qual & (1 << 0);
    bool is_write = exit_qual & (1 << 1);
    bool is_fetch = exit_qual & (1 << 2);
    bool ept_readable = exit_qual & (1 << 3);
    bool ept_writable = exit_qual & (1 << 4);
    bool ept_executable = exit_qual & (1 << 5);
    
    /* Check if this is an MMIO access */
    if (is_mmio_region(vcpu->vm, gpa)) {
        if (is_write) {
            uint64_t value = get_write_value(vcpu);
            emulate_mmio_write(vcpu->vm, gpa, value);
        } else {
            uint64_t value = emulate_mmio_read(vcpu->vm, gpa);
            set_read_result(vcpu, value);
        }
        advance_rip(vcpu);
        return;
    }
    
    /* Check if page needs to be allocated (demand paging) */
    if (!ept_readable && !ept_writable && !ept_executable) {
        /* Page not present - allocate and map */
        uint64_t hpa = allocate_guest_page(vcpu->vm);
        uint64_t flags = EPT_READ | EPT_WRITE | EPT_EXECUTE | EPT_MT_WB;
        ept_map_page(vcpu->vm->ept, gpa, hpa, flags);
        /* Let guest retry the access */
        return;
    }
    
    /* Check for write to read-only page (e.g., dirty tracking) */
    if (is_write && ept_readable && !ept_writable) {
        /* Copy-on-write or dirty tracking */
        handle_write_protection(vcpu->vm, gpa);
        return;
    }
    
    /* Check for execute on non-executable page */
    if (is_fetch && !ept_executable) {
        /* Could be security violation or code injection attempt */
        inject_guest_exception(vcpu, EXC_GP, 0);
        return;
    }
    
    /* Unknown violation - this shouldn't happen */
    panic("Unexpected EPT violation: GPA=%lx qual=%lx", gpa, exit_qual);
}

NPT Page Fault Handling (AMD):

AMD's NPT faults are handled similarly but with different exit information format:

void handle_npt_fault(struct vcpu *vcpu) {
    struct vmcb *vmcb = vcpu->vmcb;
    uint64_t error_code = vmcb->control.exitinfo1;
    uint64_t gpa = vmcb->control.exitinfo2;
    
    bool present = error_code & (1 << 0);
    bool write = error_code & (1 << 1);
    bool user = error_code & (1 << 2);
    bool reserved = error_code & (1 << 3);
    bool fetch = error_code & (1 << 4);
    
    /* Similar handling logic as EPT violations */
    if (!present) {
        /* Page not mapped - demand allocation */
        allocate_and_map_page(vcpu->vm, gpa);
    } else if (write) {
        /* Write to read-only */
        handle_write_protection(vcpu->vm, gpa);
    }
}

EPT Misconfiguration vs. Violation:

Intel distinguishes between:

EPT Violation: Access denied by EPT permissions (normal case)
EPT Misconfiguration: Invalid EPT entry (e.g., reserved bits set, invalid memory type)

Misconfigurations indicate hypervisor bugs and should never occur in correct operation. They result in immediate VM exit with a different exit reason.

MMIO Handling

TLB Management with Nested Paging

What's Cached in the TLB:

VPID and ASID: Tagging TLB Entries:

Without tagging, every VM entry/exit would require a TLB flush because guest and host translations might conflict. VPID (Intel) and ASID (AMD) tag TLB entries with an identifier:

TLB Entry: [VPID/ASID | GVA | HPA | attributes]

VM 1: VPID=1, GVA 0x1000 → HPA 0x5000
VM 2: VPID=2, GVA 0x1000 → HPA 0x8000  (different VM, same GVA)
Hypervisor: VPID=0, VA 0x1000 → PA 0x2000

All three can coexist in TLB!

TLB Invalidation Scenarios

•Guest INVLPG: Guest invalidates single GVA entry. Hardware handles within guest VPID/ASID.
•Guest CR3 load: Guest switches address space. Hardware may retain entries from old CR3 unless PCID is used.
•EPT/NPT modification: Hypervisor changes nested tables. Must manually invalidate affected entries.
•VM exit/entry: With VPID/ASID, no flush needed. Without, all guest entries flushed.
•VPID/ASID change: Assigning new ID effectively discards old translations.

INVEPT and INVVPID (Intel):

Intel provides specific instructions for EPT/VPID invalidation:

/* Invalidate all EPT translations for a specific EPTP */
void invept_single_context(uint64_t eptp) {
    struct {
        uint64_t eptp;
        uint64_t reserved;
    } descriptor = { eptp, 0 };
    
    asm volatile("invept %0, %1"
                 : : "m"(descriptor), "r"(1ULL)  /* Type 1 = single context */
                 : "memory");
}

/* Invalidate all EPT translations globally */
void invept_all_contexts(void) {
    struct {
        uint64_t eptp;
        uint64_t reserved;
    } descriptor = { 0, 0 };
    
    asm volatile("invept %0, %1"
                 : : "m"(descriptor), "r"(2ULL)  /* Type 2 = all contexts */
                 : "memory");
}

/* Invalidate single virtual address for a VPID */
void invvpid_individual_address(uint16_t vpid, uint64_t gva) {
    struct {
        uint64_t vpid;
        uint64_t gva;
    } descriptor = { vpid, gva };
    
    asm volatile("invvpid %0, %1"
                 : : "m"(descriptor), "r"(0ULL)  /* Type 0 = individual */
                 : "memory");
}

ASID and TLB Control (AMD):

AMD uses the TLB_CONTROL field in VMCB:

/* TLB control values for VMRUN */
#define TLB_CONTROL_DO_NOTHING    0  /* Preserve TLB entries */
#define TLB_CONTROL_FLUSH_ASID    1  /* Flush this ASID only */
#define TLB_CONTROL_FLUSH_ALL     3  /* Flush all TLB entries */

void set_guest_asid(struct vmcb *vmcb, uint32_t asid) {
    vmcb->control.guest_asid = asid;
    /* Typically flush on first use of new ASID */
    vmcb->control.tlb_control = TLB_CONTROL_FLUSH_ASID;
}

void invalidate_guest_page(struct vmcb *vmcb, uint64_t gva) {
    /* AMD doesn't have per-address invalidation in hardware */
    /* Options: flush ASID, or let guest INVLPG handle it */
    vmcb->control.tlb_control = TLB_CONTROL_FLUSH_ASID;
}

VPID/ASID Management

Performance Characteristics of Nested Paging

EPT/NPT fundamentally changes the performance profile of virtualized memory access. Understanding these characteristics helps optimize VM configurations.

Comparison: Shadow Tables vs. EPT/NPT:

Shadow Page Tables vs. EPT/NPT Performance
Metric	Shadow Tables	EPT/NPT	Winner
TLB Hit Performance	Excellent (GVA→HPA direct)	Excellent (same)	Tie
TLB Miss Overhead	Standard page walk	2D walk (higher latency)	Shadow
Page Table Updates	Exit + sync required	No exit needed	EPT/NPT
Context Switch Overhead	Shadow rebuild/flush	VPID/ASID tagging	EPT/NPT
Memory Overhead	Shadow tables per guest	One EPT/NPT per guest	EPT/NPT
Implementation Complexity	Very high	Moderate	EPT/NPT
Fork Performance	Many exits	Guest-only operation	EPT/NPT
mmap Performance	Exits for each mapping	No exits	EPT/NPT

The TLB Miss Trade-off:

On a TLB miss with EPT/NPT, the 2-dimensional page walk accesses more memory than shadow tables would. Analysis:

Shadow Tables TLB Miss:
  4 memory accesses (standard 4-level walk)

EPT/NPT TLB Miss:
  Up to 24 memory accesses (5 GPAs × 4-5 levels + overhead)

But consider:
  - Modern CPUs have nested page table caches
  - Large pages reduce both dimensions
  - 24 accesses × 100ns = 2.4μs worst case
  - One VM exit = 1,000-10,000 cycles ≈ 0.3-3μs
  - Shadow update might trigger multiple exits

Optimization: Large Pages:

Using large pages (2MB or 1GB) in EPT/NPT reduces walk depth:

Page Size	Guest Levels	EPT Levels	Max Accesses
4KB	4	4	5×4+4 = 24
2MB	3	3	4×3+3 = 15
1GB	2	2	3×2+2 = 8

Using 2MB pages throughout reduces worst-case from 24 to 15 accesses—a 37% improvement.

Workload-Dependent Performance:

Different workloads see different benefits from EPT/NPT:

Best for EPT/NPT:

Heavy memory allocation (malloc/mmap intensive)
Process creation (fork/exec)
Databases with large working sets
Applications with high page table churn

Neutral:

Steady-state CPU computation
I/O-bound workloads (exits dominated by I/O)
Applications with excellent TLB hit rates

Potentially Worse:

Extremely high TLB miss rates with small working sets
Memory-intensive benchmarks measuring raw access latency
(Rare in practice)

Real-World Impact:

Benchmark: Linux kernel compile in VM

Shadow Tables:
  - 847,000 VM exits for CR3 loads
  - 2.1 million exits for page table writes
  - Total: 3.2 million exits
  - Build time: 142 seconds

With EPT:
  - 0 exits for memory operations
  - ~12,000 exits (I/O, interrupts only)
  - Build time: 98 seconds

Improvement: 31% faster

Always Enable EPT/NPT

Advanced EPT/NPT Features

Modern processors include advanced EPT/NPT features that enable sophisticated virtualization scenarios beyond basic memory translation.

Accessed and Dirty Bits:

When CPU support is present, EPT/NPT can set accessed and dirty bits automatically:

Accessed bit: Set by hardware on read or write to page
Dirty bit: Set by hardware on write to page

These bits are essential for:

Live migration: Tracking which pages changed during migration iterations
Memory overcommit: Identifying cold pages for swapping
Checkpointing: Knowing which pages to save for incremental snapshots

/* Enable accessed/dirty bit support (Intel) */
void enable_ept_ad_bits(struct vm *vm) {
    /* Check CPU capability first */
    if (!cpu_has_ept_ad_bits())
        return;
    
    uint64_t eptp = vmread(VMCS_EPT_POINTER);
    eptp |= EPT_POINTER_AD_ENABLE;  /* Set bit 6 */
    vmwrite(VMCS_EPT_POINTER, eptp);
}

/* Scan for dirty pages (for live migration) */
void scan_dirty_pages(struct vm *vm, uint64_t *dirty_bitmap) {
    for (each ept_page_table_entry(vm->ept, &entry, &gpa)) {
        if (entry & EPT_DIRTY) {
            set_bit(dirty_bitmap, gpa >> PAGE_SHIFT);
            /* Clear dirty bit for next iteration */
            entry &= ~EPT_DIRTY;
        }
    }
    invept_single_context(vm->eptp);  /* Flush TLB after clearing */
}

Advanced EPT/NPT Capabilities

•Mode-Based Execute (Intel): Different execute permissions for user vs. supervisor mode. Enables more granular DEP policies within guest.
•Sub-Page Permissions (SPP): Divide 4KB pages into 128-byte sub-pages with individual R/W permissions. Enables fine-grained write tracking.
•#VE (Virtualization Exception): EPT violations can be delivered as guest exceptions instead of VM exits. Enables guest-level handling without hypervisor involvement.
•VMFUNC: Guest can switch EPT views without VM exit. Enables efficient compartmentalization within a single guest.
•PML (Page Modification Logging): Hardware automatically logs dirty GPAs to a buffer. Eliminates need to scan entire EPT for dirty bits.

Page Modification Logging (PML):

PML is particularly powerful for live migration and checkpointing. Instead of scanning the entire EPT for dirty bits, hardware maintains a log:

/* PML setup */
void enable_pml(struct vm *vm) {
    /* Allocate 512-entry PML buffer (4KB page) */
    vm->pml_buffer = alloc_page();
    vmwrite(VMCS_PML_ADDRESS, virt_to_phys(vm->pml_buffer));
    
    /* Set initial PML index to 511 (grows downward) */
    vmwrite(VMCS_PML_INDEX, 511);
    
    /* Enable PML in secondary controls */
    uint32_t ctrl = vmread(VMCS_SECONDARY_PROC_CONTROLS);
    ctrl |= SECONDARY_EXEC_ENABLE_PML;
    vmwrite(VMCS_SECONDARY_PROC_CONTROLS, ctrl);
}

/* Process dirty pages from PML buffer */
void process_pml_buffer(struct vm *vm) {
    uint16_t pml_index = vmread(VMCS_PML_INDEX);
    
    /* Entries from pml_index+1 to 511 are dirty GPAs */
    for (int i = pml_index + 1; i < 512; i++) {
        uint64_t dirty_gpa = vm->pml_buffer[i];
        mark_page_dirty(vm, dirty_gpa);
    }
    
    /* Reset PML index */
    vmwrite(VMCS_PML_INDEX, 511);
}

When the PML buffer fills, a VM exit occurs. The hypervisor drains the buffer and resumes the guest. This is far more efficient than page-table scanning for workloads with scattered dirty pages.

VMFUNC for Security

Summary: Extended Page Tables

Key Takeaways

•Two-Dimensional Translation: GVA→GPA (guest tables) + GPA→HPA (nested tables), both handled in hardware.
•Shadow Table Elimination: No more exits for guest page table updates; massive reduction in hypervisor complexity and exit overhead.
•Page Table Structure: EPT/NPT entries similar to x86 but with virtualization-specific attributes (separate R/W/X, memory types).
•Violation Handling: EPT/NPT violations exit to hypervisor for demand paging, MMIO emulation, and permission enforcement.
•TLB Management: VPID/ASID tagging enables TLB entry persistence across VM transitions; INVEPT/INVVPID for invalidation.
•Performance Trade-off: Deeper TLB miss path but eliminated exit overhead—massive net gain for most workloads.
•Advanced Features: Accessed/dirty bits, PML for dirty tracking, VMFUNC for compartmentalization.

What's Next:

Page Complete

3 / 5