Loading learning content...
Memory virtualization is one of the most challenging aspects of building a hypervisor. Every memory access a guest makes uses a guest virtual address (GVA). The guest's operating system translates this to a guest physical address (GPA) using its page tables. But the guest's 'physical' addresses aren't actually physical—they're abstracted by the hypervisor. The hypervisor must then translate GPAs to host physical addresses (HPA) that map to real RAM.
Before hardware-assisted nested paging, hypervisors maintained shadow page tables—complex software structures that collapsed both translation levels into direct GVA→HPA mappings. Shadow page tables worked, but they were expensive to maintain, triggered frequent VM exits on guest page table updates, and added significant hypervisor complexity.
Extended Page Tables (EPT) from Intel and Nested Page Tables (NPT) from AMD changed everything by moving the second-level translation into hardware.
By the end of this page, you will understand the two-dimensional page walk that EPT/NPT enables, the structure of nested page tables, how TLB caching works with two-level translation, EPT/NPT violations vs. traditional page faults, and the performance characteristics of hardware memory virtualization.
To understand EPT/NPT, we must first understand why memory virtualization is hard, and what shadow page tables attempted to solve.
Address Spaces in Virtualization:
| Address Type | Abbreviation | Description |
|---|---|---|
| Guest Virtual Address | GVA | Address used by guest applications and kernel |
| Guest Physical Address | GPA | What guest OS thinks is physical RAM |
| Host Virtual Address | HVA | Hypervisor's own virtual address space |
| Host Physical Address | HPA | Actual physical RAM addresses |
A guest application's memory access requires two translations:
Without hardware support, the CPU only knows about one page table hierarchy. It can do GVA → HPA directly (if given the right page tables), but not a two-step translation.
The Shadow Page Table Approach (Pre-EPT/NPT):
Without hardware nested paging, hypervisors created shadow page tables that combined both translations:
Problems with Shadow Page Tables:
Shadow Page Table Maintenance:
1. Guest writes to its page table
2. Write causes exit (page protected)
3. VMM reads guest page table entry
4. VMM looks up GPA→HPA translation
5. VMM creates/updates shadow entry: GVA→HPA
6. VMM unprotects page temporarily
7. VMM re-enters guest
This happens on EVERY guest page table modification!
Workloads with heavy memory allocation (database systems, JIT compilers, container orchestration) could see 10-40% overhead from shadow page table maintenance. A single fork() call might trigger thousands of shadow table updates. This made certain workloads impractical to virtualize efficiently.
Extended Page Tables (Intel) and Nested Page Tables (AMD) solve memory virtualization by adding a second-level address translation performed entirely in hardware. The guest page tables remain unmodified, and the CPU handles both translation levels automatically.
The Two-Dimensional Walk:
When a guest accesses memory, the CPU performs a coordinated walk of both page table hierarchies:
First dimension (Guest tables): Walk guest CR3 → PML4 → PDPT → PD → PT to translate GVA to GPA
Second dimension (Nested tables): For every guest physical address encountered during the walk (including page table pointers), translate GPA to HPA using the nested page tables
The guest page table walk alone might access 4-5 memory locations (in 4-level paging). Each of those locations is a GPA that must be translated via the nested tables. A single guest memory access can trigger up to 24 memory references in the worst case!
Walk Complexity Analysis:
For 4-level paging:
This sounds expensive, but:
Enabling EPT (Intel):
void enable_ept(struct vmcs *vmcs, uint64_t eptp) {
/* Set EPT pointer in VMCS */
/* Format: [Page walk length (3)] [Memory type] [Root table address] */
vmwrite(VMCS_EPT_POINTER,
(eptp & PAGE_MASK) | /* Root table physical addr */
(3 << 3) | /* 4-level walk (encoded as 3) */
(6)); /* Write-back memory type */
/* Enable EPT in secondary processor controls */
uint32_t secondary = vmread(VMCS_SECONDARY_PROC_CONTROLS);
secondary |= SECONDARY_EXEC_ENABLE_EPT;
vmwrite(VMCS_SECONDARY_PROC_CONTROLS, secondary);
}
Enabling NPT (AMD):
void enable_npt(struct vmcb *vmcb, uint64_t ncr3) {
/* Set nested CR3 (root of nested page tables) */
vmcb->control.nested_cr3 = ncr3;
/* Enable NPT */
vmcb->control.nested_ctl |= SVM_NESTED_CTL_NP_ENABLE;
}
EPT/NPT trades per-access latency (deeper page walks) for elimination of exit overhead. For most workloads, this is a massive win—shadow table exits are far more expensive than extra memory references. The break-even point is workloads with extremely high TLB miss rates and very few page table modifications.
EPT and NPT use page table structures similar to the standard x86-64 page tables, but with different entry formats designed for virtualization needs.
EPT Entry Format (Intel):
Each EPT entry is 64 bits with the following layout:
| Bits | Field | Description |
|---|---|---|
| 0 | R | Read access allowed |
| 1 | W | Write access allowed |
| 2 | X | Execute access allowed |
| 3-5 | Memory Type | EPT memory type (for leaf entries) |
| 6 | Ignore PAT | Ignore guest PAT settings |
| 7 | Large Page | Maps 2MB (PD) or 1GB (PDPT) page |
| 8 | Accessed | Hardware sets on access (if enabled) |
| 9 | Dirty | Hardware sets on write (if enabled) |
| 10-11 | Reserved | Must be 0 |
| 12-51 | Physical Address | Next table or mapped page |
| 52-62 | Reserved | Must be 0 |
| 63 | Suppress VE | Suppress #VE virtualization exception |
Key Differences from Standard x86 Page Tables:
Separate R/W/X bits: EPT has independent read, write, and execute permissions (standard x86 only has W and NX). This enables fine-grained memory protection.
No User/Supervisor distinction: EPT doesn't distinguish privilege levels—it applies to all guest accesses. Guest user/supervisor is handled by guest page tables.
Memory Type control: EPT can specify caching behavior (uncacheable, write-combining, write-back, etc.) replacing or complementing MTRR and PAT.
Accessed/Dirty bits: Optional (requires CPU support). When enabled, hardware tracks page access for demand paging and dirty tracking.
/* EPT Entry definitions */
#define EPT_READ (1ULL << 0)
#define EPT_WRITE (1ULL << 1)
#define EPT_EXECUTE (1ULL << 2)
#define EPT_MT_MASK (7ULL << 3) /* Memory type */
#define EPT_MT_UC (0ULL << 3) /* Uncacheable */
#define EPT_MT_WC (1ULL << 3) /* Write-combining */
#define EPT_MT_WB (6ULL << 3) /* Write-back */
#define EPT_IGNORE_PAT (1ULL << 6)
#define EPT_LARGE_PAGE (1ULL << 7)
#define EPT_ACCESSED (1ULL << 8)
#define EPT_DIRTY (1ULL << 9)
/* Create EPT entry for 4KB page */
uint64_t create_ept_pte(uint64_t hpa, uint64_t flags) {
return (hpa & PAGE_MASK) | flags | EPT_READ | EPT_WRITE | EPT_EXECUTE;
}
/* Create EPT entry for 2MB large page */
uint64_t create_ept_pde_2mb(uint64_t hpa, uint64_t flags) {
return (hpa & LARGE_PAGE_MASK) | flags | EPT_LARGE_PAGE |
EPT_READ | EPT_WRITE | EPT_EXECUTE | EPT_MT_WB;
}
NPT Entry Format (AMD):
AMD's NPT uses format closer to standard x86 page tables:
| Bits | Field | Description |
|---|---|---|
| 0 | P | Present |
| 1 | R/W | Read/Write |
| 2 | U/S | User/Supervisor (typically set for guest kernel) |
| 3 | PWT | Page Write-Through |
| 4 | PCD | Page Cache Disable |
| 5 | A | Accessed |
| 6 | D | Dirty (leaf entries only) |
| 7 | PS | Large page (2MB or 1GB) |
| 8-11 | Available | Software use |
| 12-51 | Physical Address | Next table or mapped page |
| 52-62 | Available | Software use |
| 63 | NX | No-Execute (if enabled) |
NPT uses the standard x86 Present/Read-Write model, making it simpler for developers familiar with x86 paging. The explicit read/write/execute granularity of EPT requires slightly more complex handling but enables additional security features.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
/* Build nested page tables for a guest VM */struct ept_tables { uint64_t *pml4; /* Level 4: 512 entries, 512GB each */ uint64_t *pdpt; /* Level 3: 512 entries, 1GB each */ uint64_t *pd; /* Level 2: 512 entries, 2MB each */ uint64_t *pt; /* Level 1: 512 entries, 4KB each */}; /* Identity map guest physical memory with EPT */void build_ept_identity_map(struct ept_tables *ept, size_t mem_size) { /* Allocate and zero all page tables */ ept->pml4 = alloc_page_zeroed(); ept->pdpt = alloc_page_zeroed(); /* Point PML4[0] to PDPT */ ept->pml4[0] = virt_to_phys(ept->pdpt) | EPT_READ | EPT_WRITE | EPT_EXECUTE; /* Map memory using 2MB large pages for efficiency */ size_t pdpt_entries = (mem_size + (1ULL << 30) - 1) >> 30; /* 1GB chunks */ for (size_t i = 0; i < pdpt_entries && i < 512; i++) { uint64_t *pd = alloc_page_zeroed(); ept->pdpt[i] = virt_to_phys(pd) | EPT_READ | EPT_WRITE | EPT_EXECUTE; /* Fill PD with 2MB large page entries */ size_t pd_entries = min(512, (mem_size - i * (1ULL << 30)) >> 21); for (size_t j = 0; j < pd_entries; j++) { uint64_t gpa = (i << 30) | (j << 21); /* Identity map: GPA == HPA (for this example) */ pd[j] = gpa | EPT_LARGE_PAGE | EPT_READ | EPT_WRITE | EPT_EXECUTE | EPT_MT_WB; } }} /* Map a specific GPA to HPA */void ept_map_page(struct ept_tables *ept, uint64_t gpa, uint64_t hpa, uint64_t flags) { /* Extract indices for each level */ uint64_t pml4_idx = (gpa >> 39) & 0x1FF; uint64_t pdpt_idx = (gpa >> 30) & 0x1FF; uint64_t pd_idx = (gpa >> 21) & 0x1FF; uint64_t pt_idx = (gpa >> 12) & 0x1FF; /* Walk/create page table hierarchy */ uint64_t *pdpt = get_or_create_table(ept->pml4, pml4_idx); uint64_t *pd = get_or_create_table(pdpt, pdpt_idx); uint64_t *pt = get_or_create_table(pd, pd_idx); /* Set the final page table entry */ pt[pt_idx] = (hpa & PAGE_MASK) | flags;}Using 2MB or 1GB large pages dramatically reduces EPT/NPT walk overhead. A 1GB page eliminates 3 levels of table lookup. Modern hypervisors default to large pages for guest memory when possible, falling back to 4KB pages only when granular control is needed (e.g., MMIO regions).
When a guest memory access cannot be translated by the nested page tables, an EPT violation (Intel) or NPT page fault (AMD) occurs. This is distinct from a regular guest page fault—the guest's translation succeeded, but the hypervisor's translation failed.
Causes of EPT/NPT Violations:
EPT Violation Exit Information (Intel):
When an EPT violation occurs, the VMCS contains detailed information:
Exit Qualification: Bits indicating access type and EPT state
Guest Physical Address: The GPA that caused the violation
Guest Linear Address: The GVA that caused the access (if valid)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
/* Handle EPT violation (Intel) */void handle_ept_violation(struct vcpu *vcpu) { uint64_t exit_qual = vmread(VMCS_EXIT_QUALIFICATION); uint64_t gpa = vmread(VMCS_GUEST_PHYSICAL_ADDRESS); uint64_t gva = vmread(VMCS_GUEST_LINEAR_ADDRESS); bool is_read = exit_qual & (1 << 0); bool is_write = exit_qual & (1 << 1); bool is_fetch = exit_qual & (1 << 2); bool ept_readable = exit_qual & (1 << 3); bool ept_writable = exit_qual & (1 << 4); bool ept_executable = exit_qual & (1 << 5); /* Check if this is an MMIO access */ if (is_mmio_region(vcpu->vm, gpa)) { if (is_write) { uint64_t value = get_write_value(vcpu); emulate_mmio_write(vcpu->vm, gpa, value); } else { uint64_t value = emulate_mmio_read(vcpu->vm, gpa); set_read_result(vcpu, value); } advance_rip(vcpu); return; } /* Check if page needs to be allocated (demand paging) */ if (!ept_readable && !ept_writable && !ept_executable) { /* Page not present - allocate and map */ uint64_t hpa = allocate_guest_page(vcpu->vm); uint64_t flags = EPT_READ | EPT_WRITE | EPT_EXECUTE | EPT_MT_WB; ept_map_page(vcpu->vm->ept, gpa, hpa, flags); /* Let guest retry the access */ return; } /* Check for write to read-only page (e.g., dirty tracking) */ if (is_write && ept_readable && !ept_writable) { /* Copy-on-write or dirty tracking */ handle_write_protection(vcpu->vm, gpa); return; } /* Check for execute on non-executable page */ if (is_fetch && !ept_executable) { /* Could be security violation or code injection attempt */ inject_guest_exception(vcpu, EXC_GP, 0); return; } /* Unknown violation - this shouldn't happen */ panic("Unexpected EPT violation: GPA=%lx qual=%lx", gpa, exit_qual);}NPT Page Fault Handling (AMD):
AMD's NPT faults are handled similarly but with different exit information format:
void handle_npt_fault(struct vcpu *vcpu) {
struct vmcb *vmcb = vcpu->vmcb;
uint64_t error_code = vmcb->control.exitinfo1;
uint64_t gpa = vmcb->control.exitinfo2;
bool present = error_code & (1 << 0);
bool write = error_code & (1 << 1);
bool user = error_code & (1 << 2);
bool reserved = error_code & (1 << 3);
bool fetch = error_code & (1 << 4);
/* Similar handling logic as EPT violations */
if (!present) {
/* Page not mapped - demand allocation */
allocate_and_map_page(vcpu->vm, gpa);
} else if (write) {
/* Write to read-only */
handle_write_protection(vcpu->vm, gpa);
}
}
EPT Misconfiguration vs. Violation:
Intel distinguishes between:
Misconfigurations indicate hypervisor bugs and should never occur in correct operation. They result in immediate VM exit with a different exit reason.
MMIO (Memory-Mapped I/O) regions like device registers must NOT be mapped in EPT/NPT. Accesses to these regions should cause violations so the hypervisor can emulate the device. Common MMIO regions include VGA framebuffer, APIC registers, and PCIe configuration space.
The Translation Lookaside Buffer (TLB) is critical for paging performance—it caches virtual-to-physical translations to avoid expensive page walks. With nested paging, TLB management becomes more complex because translations now span both guest and nested tables.
What's Cached in the TLB:
With EPT/NPT, the TLB caches the combined translation: GVA → HPA. This means a single TLB entry subsumes both translation levels. This is efficient for hits but complicates invalidation—changes to either guest or nested tables can invalidate cached entries.
VPID and ASID: Tagging TLB Entries:
Without tagging, every VM entry/exit would require a TLB flush because guest and host translations might conflict. VPID (Intel) and ASID (AMD) tag TLB entries with an identifier:
TLB Entry: [VPID/ASID | GVA | HPA | attributes]
VM 1: VPID=1, GVA 0x1000 → HPA 0x5000
VM 2: VPID=2, GVA 0x1000 → HPA 0x8000 (different VM, same GVA)
Hypervisor: VPID=0, VA 0x1000 → PA 0x2000
All three can coexist in TLB!
INVEPT and INVVPID (Intel):
Intel provides specific instructions for EPT/VPID invalidation:
/* Invalidate all EPT translations for a specific EPTP */
void invept_single_context(uint64_t eptp) {
struct {
uint64_t eptp;
uint64_t reserved;
} descriptor = { eptp, 0 };
asm volatile("invept %0, %1"
: : "m"(descriptor), "r"(1ULL) /* Type 1 = single context */
: "memory");
}
/* Invalidate all EPT translations globally */
void invept_all_contexts(void) {
struct {
uint64_t eptp;
uint64_t reserved;
} descriptor = { 0, 0 };
asm volatile("invept %0, %1"
: : "m"(descriptor), "r"(2ULL) /* Type 2 = all contexts */
: "memory");
}
/* Invalidate single virtual address for a VPID */
void invvpid_individual_address(uint16_t vpid, uint64_t gva) {
struct {
uint64_t vpid;
uint64_t gva;
} descriptor = { vpid, gva };
asm volatile("invvpid %0, %1"
: : "m"(descriptor), "r"(0ULL) /* Type 0 = individual */
: "memory");
}
ASID and TLB Control (AMD):
AMD uses the TLB_CONTROL field in VMCB:
/* TLB control values for VMRUN */
#define TLB_CONTROL_DO_NOTHING 0 /* Preserve TLB entries */
#define TLB_CONTROL_FLUSH_ASID 1 /* Flush this ASID only */
#define TLB_CONTROL_FLUSH_ALL 3 /* Flush all TLB entries */
void set_guest_asid(struct vmcb *vmcb, uint32_t asid) {
vmcb->control.guest_asid = asid;
/* Typically flush on first use of new ASID */
vmcb->control.tlb_control = TLB_CONTROL_FLUSH_ASID;
}
void invalidate_guest_page(struct vmcb *vmcb, uint64_t gva) {
/* AMD doesn't have per-address invalidation in hardware */
/* Options: flush ASID, or let guest INVLPG handle it */
vmcb->control.tlb_control = TLB_CONTROL_FLUSH_ASID;
}
Hypervisors typically maintain a VPID/ASID pool and assign unique IDs to each vCPU. When IDs are exhausted, some must be recycled with a flush. Good ASID management is critical for performance—frequent flushes defeat the purpose of tagged TLBs.
EPT/NPT fundamentally changes the performance profile of virtualized memory access. Understanding these characteristics helps optimize VM configurations.
Comparison: Shadow Tables vs. EPT/NPT:
| Metric | Shadow Tables | EPT/NPT | Winner |
|---|---|---|---|
| TLB Hit Performance | Excellent (GVA→HPA direct) | Excellent (same) | Tie |
| TLB Miss Overhead | Standard page walk | 2D walk (higher latency) | Shadow |
| Page Table Updates | Exit + sync required | No exit needed | EPT/NPT |
| Context Switch Overhead | Shadow rebuild/flush | VPID/ASID tagging | EPT/NPT |
| Memory Overhead | Shadow tables per guest | One EPT/NPT per guest | EPT/NPT |
| Implementation Complexity | Very high | Moderate | EPT/NPT |
| Fork Performance | Many exits | Guest-only operation | EPT/NPT |
| mmap Performance | Exits for each mapping | No exits | EPT/NPT |
The TLB Miss Trade-off:
On a TLB miss with EPT/NPT, the 2-dimensional page walk accesses more memory than shadow tables would. Analysis:
Shadow Tables TLB Miss:
4 memory accesses (standard 4-level walk)
EPT/NPT TLB Miss:
Up to 24 memory accesses (5 GPAs × 4-5 levels + overhead)
But consider:
- Modern CPUs have nested page table caches
- Large pages reduce both dimensions
- 24 accesses × 100ns = 2.4μs worst case
- One VM exit = 1,000-10,000 cycles ≈ 0.3-3μs
- Shadow update might trigger multiple exits
Optimization: Large Pages:
Using large pages (2MB or 1GB) in EPT/NPT reduces walk depth:
| Page Size | Guest Levels | EPT Levels | Max Accesses |
|---|---|---|---|
| 4KB | 4 | 4 | 5×4+4 = 24 |
| 2MB | 3 | 3 | 4×3+3 = 15 |
| 1GB | 2 | 2 | 3×2+2 = 8 |
Using 2MB pages throughout reduces worst-case from 24 to 15 accesses—a 37% improvement.
Workload-Dependent Performance:
Different workloads see different benefits from EPT/NPT:
Best for EPT/NPT:
Neutral:
Potentially Worse:
Real-World Impact:
Benchmark: Linux kernel compile in VM
Shadow Tables:
- 847,000 VM exits for CR3 loads
- 2.1 million exits for page table writes
- Total: 3.2 million exits
- Build time: 142 seconds
With EPT:
- 0 exits for memory operations
- ~12,000 exits (I/O, interrupts only)
- Build time: 98 seconds
Improvement: 31% faster
Modern hypervisors always enable EPT/NPT when available. The performance benefits vastly outweigh the slightly deeper TLB miss path. Shadow page tables are now legacy, used only when hardware doesn't support nested paging or for specialized debugging scenarios.
Modern processors include advanced EPT/NPT features that enable sophisticated virtualization scenarios beyond basic memory translation.
Accessed and Dirty Bits:
When CPU support is present, EPT/NPT can set accessed and dirty bits automatically:
These bits are essential for:
/* Enable accessed/dirty bit support (Intel) */
void enable_ept_ad_bits(struct vm *vm) {
/* Check CPU capability first */
if (!cpu_has_ept_ad_bits())
return;
uint64_t eptp = vmread(VMCS_EPT_POINTER);
eptp |= EPT_POINTER_AD_ENABLE; /* Set bit 6 */
vmwrite(VMCS_EPT_POINTER, eptp);
}
/* Scan for dirty pages (for live migration) */
void scan_dirty_pages(struct vm *vm, uint64_t *dirty_bitmap) {
for (each ept_page_table_entry(vm->ept, &entry, &gpa)) {
if (entry & EPT_DIRTY) {
set_bit(dirty_bitmap, gpa >> PAGE_SHIFT);
/* Clear dirty bit for next iteration */
entry &= ~EPT_DIRTY;
}
}
invept_single_context(vm->eptp); /* Flush TLB after clearing */
}
Page Modification Logging (PML):
PML is particularly powerful for live migration and checkpointing. Instead of scanning the entire EPT for dirty bits, hardware maintains a log:
/* PML setup */
void enable_pml(struct vm *vm) {
/* Allocate 512-entry PML buffer (4KB page) */
vm->pml_buffer = alloc_page();
vmwrite(VMCS_PML_ADDRESS, virt_to_phys(vm->pml_buffer));
/* Set initial PML index to 511 (grows downward) */
vmwrite(VMCS_PML_INDEX, 511);
/* Enable PML in secondary controls */
uint32_t ctrl = vmread(VMCS_SECONDARY_PROC_CONTROLS);
ctrl |= SECONDARY_EXEC_ENABLE_PML;
vmwrite(VMCS_SECONDARY_PROC_CONTROLS, ctrl);
}
/* Process dirty pages from PML buffer */
void process_pml_buffer(struct vm *vm) {
uint16_t pml_index = vmread(VMCS_PML_INDEX);
/* Entries from pml_index+1 to 511 are dirty GPAs */
for (int i = pml_index + 1; i < 512; i++) {
uint64_t dirty_gpa = vm->pml_buffer[i];
mark_page_dirty(vm, dirty_gpa);
}
/* Reset PML index */
vmwrite(VMCS_PML_INDEX, 511);
}
When the PML buffer fills, a VM exit occurs. The hypervisor drains the buffer and resumes the guest. This is far more efficient than page-table scanning for workloads with scattered dirty pages.
VMFUNC with EPT switching enables powerful isolation primitives. A guest can switch between different memory views (e.g., 'trusted' and 'untrusted' compartments) with a single instruction, without hypervisor involvement. This is used by Intel SGX and various intra-guest isolation systems.
Extended Page Tables (Intel EPT) and Nested Page Tables (AMD NPT) represent a major advancement in virtualization technology. By moving the second-level address translation into hardware, they eliminate the complexity and overhead of shadow page tables while enabling new capabilities impossible with software-only approaches.
What's Next:
In the next page, we'll explore I/O Virtualization (VT-d)—how hardware enables direct device assignment to VMs, DMA remapping for security, and interrupt remapping for isolation. I/O virtualization completes the hardware support picture, enabling near-native I/O performance for virtualized workloads.
You now understand EPT and NPT—the hardware memory virtualization technologies that enable efficient, low-overhead address translation for virtual machines. From two-dimensional page walks to TLB management and advanced features, you have the knowledge to understand modern hypervisor memory management.