Operating SystemsVirtual Machines

Virtual Machines

LevelIntermediate

Duration90 mins

TopicVirtual Machines

4 / 5

Hardware Virtualization Support: Silicon Enabling Software

When Hardware Meets Virtualization

For decades, virtualization on x86 processors required complex software workarounds. The x86 architecture was never designed with virtualization in mind, and certain instructions simply didn't behave correctly when a hypervisor tried to intercept them. This changed dramatically in 2005-2006 when Intel and AMD introduced hardware virtualization extensions—dedicated silicon features specifically designed to make virtualization efficient, secure, and correct.

Today's virtualization story is fundamentally a hardware story. Without VT-x, AMD-V, EPT, NPT, VT-d, and AMD-Vi, the cloud computing revolution would have been far slower and more expensive. Understanding these hardware features is essential for anyone working with virtual machines, whether deploying cloud infrastructure, developing hypervisors, or debugging virtualization-related issues.

What You Will Learn

By the end of this page, you will understand Intel VT-x and AMD-V CPU extensions, Virtual Machine Control Structures (VMCS), hardware-assisted memory virtualization (EPT/NPT), I/O virtualization (VT-d/AMD-Vi), interrupt virtualization, and specialized instructions like VMLAUNCH, VMRESUME, and VMEXIT.

The Pre-Hardware Virtualization Challenge

Before we appreciate hardware virtualization, let's understand what problem it solved.

The Popek and Goldberg Criteria (1974):

Gerald Popek and Robert Goldberg formalized requirements for efficient virtualization. An architecture is classically virtualizable if:

All sensitive instructions are privileged: Any instruction that could observe or modify the machine's true state must trap when executed in user mode.
Complete isolation: The VMM has complete control over system resources.
Efficiency: Innocuous instructions execute directly on hardware without VMM intervention.

x86's Violation:

The x86 architecture violated these requirements. Several instructions behaved differently in user mode vs kernel mode without trapping, making it impossible for a hypervisor to intercept and emulate them:

Problematic x86 Instructions (Pre-Hardware Virtualization)
Instruction	Expected Behavior	Actual Behavior in User Mode
SGDT (Store GDT)	Should trap to VMM	Silently returns current GDT address
SIDT (Store IDT)	Should trap to VMM	Silently returns current IDT address
SLDT (Store LDT)	Should trap to VMM	Silently returns current LDT selector
PUSHF/POPF	Should trap if modifying IF	IF modifications silently ignored
LAR, LSL, VERR, VERW	Should trap	Execute with different semantics
CALL, JMP (far)	May need VMM intervention	Complex segment checking issues

Software Workarounds:

Binary Translation (VMware): VMware's solution was to scan guest code for problematic instructions and rewrite them at runtime. Sensitive instructions were replaced with calls to VMM handlers. This approach worked but was complex (millions of lines of code) and added overhead.

Paravirtualization (Xen): Xen's approach was to modify the guest operating system to never execute problematic instructions. Guests would use hypercalls (explicit calls to the hypervisor) instead of sensitive instructions. This required guest modifications but offered good performance.

Neither was ideal:

Binary translation was complex, had performance overhead, and couldn't handle all edge cases elegantly
Paravirtualization required guest modifications, ruling out unmodified operating systems

The industry needed a hardware solution.

Intel VT-x Architecture

Intel VT-x (Virtual Technology for x86), codenamed "Vanderpool," was introduced in 2005 with the Pentium 4 662/672 processors. It fundamentally changed how virtualization works on x86.

Core Concepts:

VMX Operation Modes: VT-x introduces two new CPU operation modes:

VMX Root Mode: Where the hypervisor runs. The VMM has full control over the processor and can configure virtualization settings.
VMX Non-Root Mode: Where guest VMs run. Instructions and events that require hypervisor intervention automatically cause VM exits.

Transitions:

VM Entry: Transition from VMX root (hypervisor) to VMX non-root (guest)
VM Exit: Transition from VMX non-root (guest) back to VMX root (hypervisor)

Converting Mermaid diagram...

The Virtual Machine Control Structure (VMCS):

The VMCS is a hardware-managed data structure that controls VMX operation. It contains:

Guest-State Area:

All guest CPU registers (general-purpose, segment, control)
Guest page table pointer (CR3)
Guest interrupt descriptor table (IDTR)
Guest execution state (interruptibility, activity state)

Host-State Area:

Host CPU registers to restore on VM exit
Host page table pointer
Host interrupt handlers
Host stack pointer for exit handling

VM-Execution Control Fields:

Which operations should cause VM exits
Exception handling configuration
I/O permissions (which ports cause exits)
MSR (Model-Specific Register) access controls

VM-Exit Information Fields:

Reason for the VM exit (exit reason code)
Exit qualification (additional information)
Instruction information (for instruction-specific exits)

vmcs-structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Conceptual VMCS operations (highly simplified)
// In reality, accessed via VMREAD/VMWRITE instructions
 
// Initialize VMCS for a virtual CPU
void setup_vmcs(struct vcpu *vcpu) {
    // Guest state - what guest sees
    vmwrite(GUEST_CR0, vcpu->cr0);
    vmwrite(GUEST_CR3, vcpu->cr3);  // Guest page tables
    vmwrite(GUEST_CR4, vcpu->cr4);
    vmwrite(GUEST_RSP, vcpu->rsp);
    vmwrite(GUEST_RIP, vcpu->rip);  // Guest instruction pointer
    vmwrite(GUEST_RFLAGS, vcpu->rflags);
    
    // Guest segments (CS, DS, SS, ES, FS, GS)
    vmwrite(GUEST_CS_SELECTOR, vcpu->cs.selector);
    vmwrite(GUEST_CS_BASE, vcpu->cs.base);
    vmwrite(GUEST_CS_LIMIT, vcpu->cs.limit);
    vmwrite(GUEST_CS_ACCESS_RIGHTS, vcpu->cs.access);
    // ... same for DS, SS, ES, FS, GS
    
    // Host state - where to return on VM exit
    vmwrite(HOST_CR0, read_cr0());
    vmwrite(HOST_CR3, read_cr3());  // Host page tables
    vmwrite(HOST_CR4, read_cr4());
    vmwrite(HOST_RSP, (u64)vcpu->host_stack);
    vmwrite(HOST_RIP, (u64)vm_exit_handler);
    
    // Execution controls - what causes VM exits
    vmwrite(PRIMARY_VM_EXEC_CONTROLS,
        CPU_BASED_HLT_EXITING |      // Exit on HLT
        CPU_BASED_IO_EXITING |       // Exit on I/O
        CPU_BASED_MSR_BITMAPS |      // Use MSR bitmap
        CPU_BASED_ACTIVATE_SECONDARY);
    
    vmwrite(SECONDARY_VM_EXEC_CONTROLS,
        CPU_BASED_ENABLE_EPT |       // Enable EPT
        CPU_BASED_UNRESTRICTED_GUEST); // Real mode support
}
 
// Enter guest execution
void run_guest(struct vcpu *vcpu) {
    // VMLAUNCH for first entry, VMRESUME for subsequent
    if (vcpu->launched)
        vmresume();
    else {
        vcpu->launched = true;
        vmlaunch();
    }
    // Control returns here after VM exit
    handle_vmexit(vcpu);
}

AMD-V Architecture

AMD-V (AMD Virtualization), codenamed "Pacifica," was introduced in 2006 with AMD's Athlon 64. While conceptually similar to Intel VT-x, AMD-V uses different terminology and data structures.

Key Concepts:

Secure Virtual Machine (SVM): AMD-V is also known as SVM. Like Intel's VMX, it provides separate execution modes for hypervisor and guest.

Virtual Machine Control Block (VMCB): AMD's equivalent to Intel's VMCS. A 4KB structure containing guest state, control bits, and exit information. Unlike VMCS (which uses special VMREAD/VMWRITE instructions), VMCB fields are accessed directly via normal memory operations.

Key Instructions:

VMRUN: Enter guest mode (equivalent to VMLAUNCH/VMRESUME)
VMSAVE/VMLOAD: Save/restore additional guest state
STGI/CLGI: Set/clear global interrupt flag (control interrupt handling)

Comparison with Intel VT-x:

Intel VT-x vs AMD-V Comparison
Feature	Intel VT-x	AMD-V (SVM)
Enable instruction	VMXON	EFER.SVME = 1
Control structure	VMCS (special access)	VMCB (regular memory access)
Enter guest	VMLAUNCH / VMRESUME	VMRUN
Exit to host	Automatic VM Exit	#VMEXIT exception
Memory virtualization	EPT (Extended Page Tables)	NPT (Nested Page Tables)
I/O virtualization	VT-d	AMD-Vi (IOMMU)
State save/restore	Automatic via VMCS	VMSAVE/VMLOAD
Intercept control	VM-Execution controls	VMCB intercept bits

AMD-V Unique Features:

ASID (Address Space ID): AMD-V includes ASID tags in TLB entries, allowing multiple VMs' translations to coexist in the TLB without flushing on every VM switch. This reduces the performance cost of context switches between VMs.

Clean Bits: The VMCB includes "clean bits" that indicate which state areas have been modified since the last VM exit. The processor can skip loading unmodified state, speeding up VM entries.

Decode Assists: For some VM exits (like string I/O operations), AMD-V provides decoded instruction information in the VMCB, reducing the need for the hypervisor to decode instructions itself.

Practical Differences:

For hypervisor developers, the choice between VT-x and AMD-V is usually academic—you support both. The concepts are parallel, and hypervisors like KVM, Xen, and even VMware abstract these differences behind common interfaces. The key architectural insight is the same: dedicated CPU modes for hypervisor and guest, with automatic transitions on sensitive operations.

Checking for Virtualization Support

On Linux, check /proc/cpuinfo for 'vmx' (Intel) or 'svm' (AMD). On Windows, use systeminfo and look for 'Virtualization Enabled'. Most CPUs manufactured after 2010 support hardware virtualization, though it may be disabled in BIOS/UEFI settings.

VM Exits and Entries

VM exits and entries are the fundamental transitions between guest and hypervisor execution. Understanding their causes and costs is essential for virtualization performance.

What Causes VM Exits:

VM exits are configured by the hypervisor and triggered by specific guest operations:

Categories of VM Exits

•Unconditional Exits — Some operations always cause exits regardless of configuration: CPUID, INVD, RSM, most VMX instructions. The hypervisor must handle these to maintain proper virtualization.
•Conditional Exits — Configured via VMCS/VMCB control fields: HLT (CPU idle), I/O port access, MSR read/write, control register (CR0, CR4) modifications, interrupt window notifications.
•Exception Exits — Specific exceptions can be configured to exit: page faults, general protection faults, debug exceptions. Often used for memory virtualization (pre-EPT) or debugging.
•External Interrupts — Hardware interrupts can be configured to cause immediate exits, allowing the hypervisor to handle physical device interrupts.
•EPT Violations — When guest memory access cannot be satisfied by EPT, the hypervisor must intervene (missing mapping, MMIO emulation).

The Anatomy of a VM Exit:

Guest executing in VMX non-root mode
            │
            ▼
    ┌───────────────┐
    │ Trigger event │ (e.g., I/O instruction, CPUID)
    └───────────────┘
            │
            ▼
    ┌───────────────┐
    │ Save guest    │ CPU state → VMCS guest-state area
    │ state         │ (RIP, RSP, RFLAGS, segments, etc.)
    └───────────────┘
            │
            ▼
    ┌───────────────┐
    │ Record exit   │ Exit reason, qualification, instruction info
    │ information   │
    └───────────────┘
            │
            ▼
    ┌───────────────┐
    │ Load host     │ VMCS host-state area → CPU state
    │ state         │ (switches to host page tables, stack)
    └───────────────┘
            │
            ▼
    ┌───────────────┐
    │ Execute at    │ Hypervisor VM exit handler runs
    │ HOST_RIP      │ (in VMX root mode)
    └───────────────┘

VM Exit Cost:

VM exits are expensive—typically hundreds to thousands of CPU cycles:

Component	Approximate Cycles	Notes
Guest state save	200-400 cycles	Saving registers to VMCS
State checks	100-200 cycles	Validating VMCS consistency
Host state load	200-400 cycles	Loading host registers
TLB/cache effects	Variable	May need to flush TLB
Total round-trip	1000-3000 cycles	Exit + handler + entry

VM Entry Process:

VM entry (VMLAUNCH or VMRESUME) performs the reverse:

Consistency checks on VMCS fields (invalid configuration causes entry failure)
Load guest state from VMCS guest-state area
Switch to guest page tables (CR3)
Transfer execution to guest RIP in VMX non-root mode

Minimizing VM Exits:

Hypervisor optimization often focuses on reducing VM exit frequency:

Batch operations: Combine multiple I/O operations into one paravirtual batch
Use EPT: Eliminates most memory-related exits
Exit-less interrupts: Posted interrupts deliver to guest without exit
MSR bitmaps: Only exit on necessary MSR accesses
I/O bitmaps: Only exit on necessary port accesses
Paravirtual devices: Use virtio to minimize device emulation exits

Profiling VM Exits

Linux's perf tool can profile VM exit reasons: 'perf kvm stat record' and 'perf kvm stat report' show which exits are most frequent. High exit counts for specific reasons indicate optimization opportunities.

Extended/Nested Page Tables

Memory virtualization was historically one of the most expensive aspects of virtualization. Each guest memory access required hypervisor intervention. Extended Page Tables (EPT) (Intel) and Nested Page Tables (NPT) (AMD) solved this by adding a second level of address translation directly in hardware.

The Two-Dimensional Address Translation:

Without EPT/NPT:

Guest page tables translate GVA → GPA
Hypervisor maintains shadow page tables translating GVA → HPA
Any guest page table modification requires a VM exit
VMM updates shadow tables, incurring significant overhead

With EPT/NPT:

Guest page tables translate GVA → GPA (guest-controlled)
EPT/NPT tables translate GPA → HPA (hypervisor-controlled)
Hardware performs both translations automatically
No VM exit for guest page table modifications

Converting Mermaid diagram...

Page Walk Overhead:

A 4-level page table walk (standard for 64-bit) normally requires up to 4 memory accesses (plus the final data access). With EPT/NPT, each of those 4 accesses itself goes through the EPT/NPT tables:

Worst case: 4 × 4 + 4 = 20 memory accesses per translation

Walk Step	Guest Page Walk	EPT Walk for Each	Total
PML4 access	1 memory access	4 memory accesses	5
PDPT access	1 memory access	4 memory accesses	5
PD access	1 memory access	4 memory accesses	5
PT access	1 memory access	4 memory accesses	5
Total	4	16	20

Why it's still fast:

Despite the theoretical overhead, EPT/NPT is extremely fast in practice because:

TLB Caching: Successful translations are cached in the TLB. Most accesses hit the TLB, skipping the page walk entirely.
Page Walk Caches: Modern CPUs cache intermediate page walk results. A translation of a nearby address may already have cached paging structure entries.
Large Pages: Using 2MB or 1GB pages reduces page table depth, cutting the number of accesses.
No VM Exits: Even a full 20-access walk is faster than a VM exit (1000+ cycles) that shadow paging would require.

EPT Pointer (EPTP):

The hypervisor sets the EPT pointer in the VMCS, pointing to the root of the EPT page tables for each guest. When the guest runs, the CPU uses this EPT for all GPA → HPA translations.

EPT Violations:

If the EPT walk fails (mapping doesn't exist, or access rights violated), an EPT violation VM exit occurs. The hypervisor can then:

Establish a new mapping (for lazy allocation)
Emulate MMIO (for memory-mapped device access)
Inject an exception into the guest (for actual access violations)

EPT Features:

EPT/NPT Advanced Features
Feature	Description	Use Case
Large pages (2MB/1GB)	Reduce translation depth	Improving TLB coverage and walk speed
Execute-disable (XD)	Mark pages non-executable	Security (NX bit in guest context)
Accessed/Dirty bits	Track page access/modification	Memory management, migration
Memory type (UC/WB/WC)	Control caching behavior	Device memory, performance tuning
#VE (Virtualization Exception)	Guest-handled EPT violations	Advanced introspection, security

I/O Virtualization (VT-d / AMD-Vi)

CPU and memory virtualization enable efficient computation, but I/O devices present unique challenges. Devices use DMA (Direct Memory Access) to read and write memory independently of the CPU. Without hardware support, a device could access any physical memory location—including the hypervisor's memory or other VMs' memory.

Intel VT-d (Virtualization Technology for Directed I/O) and AMD-Vi (AMD I/O Virtualization, also known as IOMMU) solve this by providing:

DMA Remapping: Direct device DMA through page tables, ensuring devices can only access authorized memory
Interrupt Remapping: Control which VM receives device interrupts
Device Assignment: Safely assign physical devices to individual VMs

Converting Mermaid diagram...

DMA Remapping:

The IOMMU maintains page tables similar to CPU page tables, but for device memory access:

Device DMA Address → IOMMU → Physical Address
                      ↓
              Access Control
              (allowed/denied)

Each device (or device group) can have its own I/O page table, restricting its DMA to specific physical pages. A device assigned to VM1 can only DMA to memory pages allocated to VM1.

Interrupt Remapping:

Without interrupt remapping, devices could send interrupts to arbitrary CPUs or vectors, potentially disrupting the hypervisor or other VMs. Interrupt remapping validates and redirects device interrupts:

Device generates interrupt
Interrupt remapping unit validates the interrupt against a table
Valid interrupts are delivered to the configured destination
Invalid interrupts are logged/dropped

Device Assignment (Pass-Through):

With VT-d/AMD-Vi, physical devices can be safely assigned to VMs:

Hypervisor programs IOMMU to restrict device DMA to guest memory
Hypervisor maps device MMIO regions in EPT
Guest accesses device directly (no hypervisor intervention)
Device interrupts route to guest (via interrupt remapping)

Result: Near-native I/O performance with virtualization security.

IOMMU and Security

IOMMU protection is essential for secure device pass-through. Without it, a malicious or buggy guest with an assigned device could use DMA to read or write any memory, completely bypassing virtualization isolation. Always verify IOMMU is enabled (BIOS/UEFI setting) when using device pass-through.

Interrupt Virtualization

Efficient interrupt handling is critical for I/O-intensive workloads. Traditional virtualization requires a VM exit for every interrupt, adding significant latency. Hardware features now enable interrupt delivery directly to guests.

Traditional Interrupt Handling:

Hardware interrupt occurs
VM exit to hypervisor (expensive)
Hypervisor determines which VM should receive it
Hypervisor injects virtual interrupt into guest
VM resume (expensive)
Guest handles interrupt

Advanced Interrupt Features:

Hardware Interrupt Optimization

•Virtual APIC (vAPIC) — Hardware-assisted APIC emulation. Accesses to the virtual APIC page don't cause VM exits. The guest interacts with a virtualized view of the interrupt controller.
•Posted Interrupts (Intel) — Interrupts can be "posted" directly to a running vCPU without a VM exit. The hypervisor sets up a posting descriptor, and hardware delivers the interrupt to the guest automatically.
•Virtual Interrupt Controller (AMD AVIC) — AMD's equivalent to Intel's vAPIC and posted interrupts. Provides hardware acceleration for interrupt handling in guests.
•MSI/MSI-X Virtualization — Message Signaled Interrupts can be remapped to virtual interrupts, delivered to the appropriate guest without full interrupt emulation.

Posted Interrupts in Detail:

┌─────────────────────────────────────────────────┐
│ Posted Interrupt Descriptor (in memory)         │
├─────────────────────────────────────────────────┤
│ PIR (Posted Interrupt Requests) - 256 bits      │
│   Each bit represents an interrupt vector       │
├─────────────────────────────────────────────────┤
│ ON (Outstanding Notification) - 1 bit           │
│   Set when new interrupts need notification     │
├─────────────────────────────────────────────────┤
│ Notification Vector                             │
│   IPI vector to notify vCPU                     │
├─────────────────────────────────────────────────┤
│ Notification Destination (APIC ID)              │
│   Which physical CPU to notify                  │
└─────────────────────────────────────────────────┘

Flow with Posted Interrupts:

Device generates interrupt (via MSI/MSI-X)
Interrupt remapping identifies destination VM
Hardware posts interrupt to PIR (sets bit in descriptor)
If vCPU is running, notification IPI wakes it to check PIR
Guest processes interrupt without VM exit

Result: Interrupt latency drops from thousands of cycles (VM exit path) to hundreds (posted interrupt path).

Nested Virtualization

Nested virtualization allows running a hypervisor inside a virtual machine. A guest VM runs a hypervisor (L1), which manages its own nested guests (L2). This enables fascinating use cases:

Why Nested Virtualization:

Test hypervisors: Develop and debug hypervisor code without dedicated hardware
Cloud flexibility: Run VMware inside AWS, or Hyper-V inside Azure
Training environments: Practice with enterprise hypervisors in VMs
Security research: Analyze hypervisor attacks in controlled environments

Converting Mermaid diagram...

Implementation Challenges:

VMCS Shadowing: L1 hypervisor manipulates VMCS for L2 guests. But only L0 can actually use hardware VMCS. L0 must:

Intercept L1's VMCS operations
Maintain shadow copies of L1's VMCS
Merge L1's settings with L0's requirements for actual L2 execution

EPT Translation Chain: With nested virtualization:

L2 page tables: L2VA → L2PA (managed by L1)
L1 EPT: L2PA → L1PA (managed by L1, but actually shadows)
L0 EPT: L1PA → HPA (managed by L0)

L0 often merges these into a single effective EPT for L2, avoiding triple translation overhead.

Hardware Support:

Modern CPUs include features to accelerate nested virtualization:

VMCS Shadowing (Intel): Hardware assist for L1 VMCS manipulation
Virtual VMCS (Intel): Allows L1 to use VMREAD/VMWRITE without exits
Nested Paging (AMD): NPT designed with nested virtualization in mind

Performance:

Nested virtualization adds overhead:

Each L2 VM exit may cause L1 exit + L0 handling
Memory translation has additional layers
Typical overhead: 10-30% compared to flat L1 virtualization

For development and testing, this overhead is acceptable. For production, prefer flat virtualization.

Enabling Nested Virtualization

On KVM (Linux): 'modprobe kvm_intel nested=1' or add 'options kvm_intel nested=1' to /etc/modprobe.d/. For VirtualBox/VMware, enable 'Nested VT-x/AMD-V' in VM settings. Note that nested virtualization significantly increases complexity and may expose additional security surface.

Summary: Hardware Virtualization Support

We've explored the hardware technologies that make modern virtualization efficient and practical. Let's consolidate these critical concepts:

Key Takeaways

•Hardware virtualization extensions (VT-x, AMD-V) solve x86's non-virtualizable instructions by providing dedicated CPU modes for VMMs and guests.
•VMCS (Intel) and VMCB (AMD) are hardware-managed structures that control guest state, host state, and exit/entry behavior.
•VM exits and entries are the transitions between guest and hypervisor, costing 1000-3000 cycles; minimizing exits is a key optimization goal.
•EPT (Intel) and NPT (AMD) enable hardware two-level address translation, eliminating shadow page tables and their associated VM exits.
•VT-d and AMD-Vi (IOMMU) provide secure device assignment through DMA remapping and interrupt remapping.
•Interrupt virtualization (vAPIC, posted interrupts) enables low-latency interrupt delivery without VM exits.
•Nested virtualization allows hypervisors inside VMs, enabled by VMCS shadowing and extended hardware support.

What's Next:

We've covered traditional virtual machines with hypervisors. Next, we'll explore OS-level virtualization (containers)—a complementary approach that virtualizes at the operating system level rather than the hardware level, offering even lighter weight isolation for many use cases.

Page Complete

You now understand the hardware technologies underlying modern virtualization. This knowledge is essential for troubleshooting virtualization issues, optimizing VM performance, and understanding the security model of virtualized environments.

4 / 5

Loading learning content...

Operating SystemsVirtual Machines

Virtual Machines

LevelIntermediate

Duration90 mins

TopicVirtual Machines

4 / 5

Hardware Virtualization Support: Silicon Enabling Software

When Hardware Meets Virtualization

What You Will Learn

The Pre-Hardware Virtualization Challenge

Before we appreciate hardware virtualization, let's understand what problem it solved.

The Popek and Goldberg Criteria (1974):

Gerald Popek and Robert Goldberg formalized requirements for efficient virtualization. An architecture is classically virtualizable if:

All sensitive instructions are privileged: Any instruction that could observe or modify the machine's true state must trap when executed in user mode.
Complete isolation: The VMM has complete control over system resources.
Efficiency: Innocuous instructions execute directly on hardware without VMM intervention.

x86's Violation:

Problematic x86 Instructions (Pre-Hardware Virtualization)
Instruction	Expected Behavior	Actual Behavior in User Mode
SGDT (Store GDT)	Should trap to VMM	Silently returns current GDT address
SIDT (Store IDT)	Should trap to VMM	Silently returns current IDT address
SLDT (Store LDT)	Should trap to VMM	Silently returns current LDT selector
PUSHF/POPF	Should trap if modifying IF	IF modifications silently ignored
LAR, LSL, VERR, VERW	Should trap	Execute with different semantics
CALL, JMP (far)	May need VMM intervention	Complex segment checking issues

Software Workarounds:

Neither was ideal:

Binary translation was complex, had performance overhead, and couldn't handle all edge cases elegantly
Paravirtualization required guest modifications, ruling out unmodified operating systems

The industry needed a hardware solution.

Intel VT-x Architecture

Intel VT-x (Virtual Technology for x86), codenamed "Vanderpool," was introduced in 2005 with the Pentium 4 662/672 processors. It fundamentally changed how virtualization works on x86.

Core Concepts:

VMX Operation Modes: VT-x introduces two new CPU operation modes:

VMX Root Mode: Where the hypervisor runs. The VMM has full control over the processor and can configure virtualization settings.
VMX Non-Root Mode: Where guest VMs run. Instructions and events that require hypervisor intervention automatically cause VM exits.

Transitions:

VM Entry: Transition from VMX root (hypervisor) to VMX non-root (guest)
VM Exit: Transition from VMX non-root (guest) back to VMX root (hypervisor)

Converting Mermaid diagram...

The Virtual Machine Control Structure (VMCS):

The VMCS is a hardware-managed data structure that controls VMX operation. It contains:

Guest-State Area:

All guest CPU registers (general-purpose, segment, control)
Guest page table pointer (CR3)
Guest interrupt descriptor table (IDTR)
Guest execution state (interruptibility, activity state)

Host-State Area:

Host CPU registers to restore on VM exit
Host page table pointer
Host interrupt handlers
Host stack pointer for exit handling

VM-Execution Control Fields:

Which operations should cause VM exits
Exception handling configuration
I/O permissions (which ports cause exits)
MSR (Model-Specific Register) access controls

VM-Exit Information Fields:

Reason for the VM exit (exit reason code)
Exit qualification (additional information)
Instruction information (for instruction-specific exits)

vmcs-structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Conceptual VMCS operations (highly simplified)
// In reality, accessed via VMREAD/VMWRITE instructions
 
// Initialize VMCS for a virtual CPU
void setup_vmcs(struct vcpu *vcpu) {
    // Guest state - what guest sees
    vmwrite(GUEST_CR0, vcpu->cr0);
    vmwrite(GUEST_CR3, vcpu->cr3);  // Guest page tables
    vmwrite(GUEST_CR4, vcpu->cr4);
    vmwrite(GUEST_RSP, vcpu->rsp);
    vmwrite(GUEST_RIP, vcpu->rip);  // Guest instruction pointer
    vmwrite(GUEST_RFLAGS, vcpu->rflags);
    
    // Guest segments (CS, DS, SS, ES, FS, GS)
    vmwrite(GUEST_CS_SELECTOR, vcpu->cs.selector);
    vmwrite(GUEST_CS_BASE, vcpu->cs.base);
    vmwrite(GUEST_CS_LIMIT, vcpu->cs.limit);
    vmwrite(GUEST_CS_ACCESS_RIGHTS, vcpu->cs.access);
    // ... same for DS, SS, ES, FS, GS
    
    // Host state - where to return on VM exit
    vmwrite(HOST_CR0, read_cr0());
    vmwrite(HOST_CR3, read_cr3());  // Host page tables
    vmwrite(HOST_CR4, read_cr4());
    vmwrite(HOST_RSP, (u64)vcpu->host_stack);
    vmwrite(HOST_RIP, (u64)vm_exit_handler);
    
    // Execution controls - what causes VM exits
    vmwrite(PRIMARY_VM_EXEC_CONTROLS,
        CPU_BASED_HLT_EXITING |      // Exit on HLT
        CPU_BASED_IO_EXITING |       // Exit on I/O
        CPU_BASED_MSR_BITMAPS |      // Use MSR bitmap
        CPU_BASED_ACTIVATE_SECONDARY);
    
    vmwrite(SECONDARY_VM_EXEC_CONTROLS,
        CPU_BASED_ENABLE_EPT |       // Enable EPT
        CPU_BASED_UNRESTRICTED_GUEST); // Real mode support
}
 
// Enter guest execution
void run_guest(struct vcpu *vcpu) {
    // VMLAUNCH for first entry, VMRESUME for subsequent
    if (vcpu->launched)
        vmresume();
    else {
        vcpu->launched = true;
        vmlaunch();
    }
    // Control returns here after VM exit
    handle_vmexit(vcpu);
}

AMD-V Architecture

AMD-V (AMD Virtualization), codenamed "Pacifica," was introduced in 2006 with AMD's Athlon 64. While conceptually similar to Intel VT-x, AMD-V uses different terminology and data structures.

Key Concepts:

Secure Virtual Machine (SVM): AMD-V is also known as SVM. Like Intel's VMX, it provides separate execution modes for hypervisor and guest.

Key Instructions:

VMRUN: Enter guest mode (equivalent to VMLAUNCH/VMRESUME)
VMSAVE/VMLOAD: Save/restore additional guest state
STGI/CLGI: Set/clear global interrupt flag (control interrupt handling)

Comparison with Intel VT-x:

Intel VT-x vs AMD-V Comparison
Feature	Intel VT-x	AMD-V (SVM)
Enable instruction	VMXON	EFER.SVME = 1
Control structure	VMCS (special access)	VMCB (regular memory access)
Enter guest	VMLAUNCH / VMRESUME	VMRUN
Exit to host	Automatic VM Exit	#VMEXIT exception
Memory virtualization	EPT (Extended Page Tables)	NPT (Nested Page Tables)
I/O virtualization	VT-d	AMD-Vi (IOMMU)
State save/restore	Automatic via VMCS	VMSAVE/VMLOAD
Intercept control	VM-Execution controls	VMCB intercept bits

AMD-V Unique Features:

Clean Bits: The VMCB includes "clean bits" that indicate which state areas have been modified since the last VM exit. The processor can skip loading unmodified state, speeding up VM entries.

Decode Assists: For some VM exits (like string I/O operations), AMD-V provides decoded instruction information in the VMCB, reducing the need for the hypervisor to decode instructions itself.

Practical Differences:

Checking for Virtualization Support

VM Exits and Entries

VM exits and entries are the fundamental transitions between guest and hypervisor execution. Understanding their causes and costs is essential for virtualization performance.

What Causes VM Exits:

VM exits are configured by the hypervisor and triggered by specific guest operations:

Categories of VM Exits

•Unconditional Exits — Some operations always cause exits regardless of configuration: CPUID, INVD, RSM, most VMX instructions. The hypervisor must handle these to maintain proper virtualization.
•Conditional Exits — Configured via VMCS/VMCB control fields: HLT (CPU idle), I/O port access, MSR read/write, control register (CR0, CR4) modifications, interrupt window notifications.
•Exception Exits — Specific exceptions can be configured to exit: page faults, general protection faults, debug exceptions. Often used for memory virtualization (pre-EPT) or debugging.
•External Interrupts — Hardware interrupts can be configured to cause immediate exits, allowing the hypervisor to handle physical device interrupts.
•EPT Violations — When guest memory access cannot be satisfied by EPT, the hypervisor must intervene (missing mapping, MMIO emulation).

The Anatomy of a VM Exit:

Guest executing in VMX non-root mode
            │
            ▼
    ┌───────────────┐
    │ Trigger event │ (e.g., I/O instruction, CPUID)
    └───────────────┘
            │
            ▼
    ┌───────────────┐
    │ Save guest    │ CPU state → VMCS guest-state area
    │ state         │ (RIP, RSP, RFLAGS, segments, etc.)
    └───────────────┘
            │
            ▼
    ┌───────────────┐
    │ Record exit   │ Exit reason, qualification, instruction info
    │ information   │
    └───────────────┘
            │
            ▼
    ┌───────────────┐
    │ Load host     │ VMCS host-state area → CPU state
    │ state         │ (switches to host page tables, stack)
    └───────────────┘
            │
            ▼
    ┌───────────────┐
    │ Execute at    │ Hypervisor VM exit handler runs
    │ HOST_RIP      │ (in VMX root mode)
    └───────────────┘

VM Exit Cost:

VM exits are expensive—typically hundreds to thousands of CPU cycles:

Component	Approximate Cycles	Notes
Guest state save	200-400 cycles	Saving registers to VMCS
State checks	100-200 cycles	Validating VMCS consistency
Host state load	200-400 cycles	Loading host registers
TLB/cache effects	Variable	May need to flush TLB
Total round-trip	1000-3000 cycles	Exit + handler + entry

VM Entry Process:

VM entry (VMLAUNCH or VMRESUME) performs the reverse:

Consistency checks on VMCS fields (invalid configuration causes entry failure)
Load guest state from VMCS guest-state area
Switch to guest page tables (CR3)
Transfer execution to guest RIP in VMX non-root mode

Minimizing VM Exits:

Hypervisor optimization often focuses on reducing VM exit frequency:

Batch operations: Combine multiple I/O operations into one paravirtual batch
Use EPT: Eliminates most memory-related exits
Exit-less interrupts: Posted interrupts deliver to guest without exit
MSR bitmaps: Only exit on necessary MSR accesses
I/O bitmaps: Only exit on necessary port accesses
Paravirtual devices: Use virtio to minimize device emulation exits

Profiling VM Exits

Extended/Nested Page Tables

The Two-Dimensional Address Translation:

Without EPT/NPT:

Guest page tables translate GVA → GPA
Hypervisor maintains shadow page tables translating GVA → HPA
Any guest page table modification requires a VM exit
VMM updates shadow tables, incurring significant overhead

With EPT/NPT:

Guest page tables translate GVA → GPA (guest-controlled)
EPT/NPT tables translate GPA → HPA (hypervisor-controlled)
Hardware performs both translations automatically
No VM exit for guest page table modifications

Converting Mermaid diagram...

Page Walk Overhead:

A 4-level page table walk (standard for 64-bit) normally requires up to 4 memory accesses (plus the final data access). With EPT/NPT, each of those 4 accesses itself goes through the EPT/NPT tables:

Worst case: 4 × 4 + 4 = 20 memory accesses per translation

Walk Step	Guest Page Walk	EPT Walk for Each	Total
PML4 access	1 memory access	4 memory accesses	5
PDPT access	1 memory access	4 memory accesses	5
PD access	1 memory access	4 memory accesses	5
PT access	1 memory access	4 memory accesses	5
Total	4	16	20

Why it's still fast:

Despite the theoretical overhead, EPT/NPT is extremely fast in practice because:

TLB Caching: Successful translations are cached in the TLB. Most accesses hit the TLB, skipping the page walk entirely.
Page Walk Caches: Modern CPUs cache intermediate page walk results. A translation of a nearby address may already have cached paging structure entries.
Large Pages: Using 2MB or 1GB pages reduces page table depth, cutting the number of accesses.
No VM Exits: Even a full 20-access walk is faster than a VM exit (1000+ cycles) that shadow paging would require.

EPT Pointer (EPTP):

The hypervisor sets the EPT pointer in the VMCS, pointing to the root of the EPT page tables for each guest. When the guest runs, the CPU uses this EPT for all GPA → HPA translations.

EPT Violations:

If the EPT walk fails (mapping doesn't exist, or access rights violated), an EPT violation VM exit occurs. The hypervisor can then:

Establish a new mapping (for lazy allocation)
Emulate MMIO (for memory-mapped device access)
Inject an exception into the guest (for actual access violations)

EPT Features:

EPT/NPT Advanced Features
Feature	Description	Use Case
Large pages (2MB/1GB)	Reduce translation depth	Improving TLB coverage and walk speed
Execute-disable (XD)	Mark pages non-executable	Security (NX bit in guest context)
Accessed/Dirty bits	Track page access/modification	Memory management, migration
Memory type (UC/WB/WC)	Control caching behavior	Device memory, performance tuning
#VE (Virtualization Exception)	Guest-handled EPT violations	Advanced introspection, security

I/O Virtualization (VT-d / AMD-Vi)

Intel VT-d (Virtualization Technology for Directed I/O) and AMD-Vi (AMD I/O Virtualization, also known as IOMMU) solve this by providing:

DMA Remapping: Direct device DMA through page tables, ensuring devices can only access authorized memory
Interrupt Remapping: Control which VM receives device interrupts
Device Assignment: Safely assign physical devices to individual VMs

Converting Mermaid diagram...

DMA Remapping:

The IOMMU maintains page tables similar to CPU page tables, but for device memory access:

Device DMA Address → IOMMU → Physical Address
                      ↓
              Access Control
              (allowed/denied)

Each device (or device group) can have its own I/O page table, restricting its DMA to specific physical pages. A device assigned to VM1 can only DMA to memory pages allocated to VM1.

Interrupt Remapping:

Device generates interrupt
Interrupt remapping unit validates the interrupt against a table
Valid interrupts are delivered to the configured destination
Invalid interrupts are logged/dropped

Device Assignment (Pass-Through):

With VT-d/AMD-Vi, physical devices can be safely assigned to VMs:

Hypervisor programs IOMMU to restrict device DMA to guest memory
Hypervisor maps device MMIO regions in EPT
Guest accesses device directly (no hypervisor intervention)
Device interrupts route to guest (via interrupt remapping)

Result: Near-native I/O performance with virtualization security.

IOMMU and Security

Interrupt Virtualization

Traditional Interrupt Handling:

Hardware interrupt occurs
VM exit to hypervisor (expensive)
Hypervisor determines which VM should receive it
Hypervisor injects virtual interrupt into guest
VM resume (expensive)
Guest handles interrupt

Advanced Interrupt Features:

Hardware Interrupt Optimization

•Virtual APIC (vAPIC) — Hardware-assisted APIC emulation. Accesses to the virtual APIC page don't cause VM exits. The guest interacts with a virtualized view of the interrupt controller.
•Posted Interrupts (Intel) — Interrupts can be "posted" directly to a running vCPU without a VM exit. The hypervisor sets up a posting descriptor, and hardware delivers the interrupt to the guest automatically.
•Virtual Interrupt Controller (AMD AVIC) — AMD's equivalent to Intel's vAPIC and posted interrupts. Provides hardware acceleration for interrupt handling in guests.
•MSI/MSI-X Virtualization — Message Signaled Interrupts can be remapped to virtual interrupts, delivered to the appropriate guest without full interrupt emulation.

Posted Interrupts in Detail:

┌─────────────────────────────────────────────────┐
│ Posted Interrupt Descriptor (in memory)         │
├─────────────────────────────────────────────────┤
│ PIR (Posted Interrupt Requests) - 256 bits      │
│   Each bit represents an interrupt vector       │
├─────────────────────────────────────────────────┤
│ ON (Outstanding Notification) - 1 bit           │
│   Set when new interrupts need notification     │
├─────────────────────────────────────────────────┤
│ Notification Vector                             │
│   IPI vector to notify vCPU                     │
├─────────────────────────────────────────────────┤
│ Notification Destination (APIC ID)              │
│   Which physical CPU to notify                  │
└─────────────────────────────────────────────────┘

Flow with Posted Interrupts:

Device generates interrupt (via MSI/MSI-X)
Interrupt remapping identifies destination VM
Hardware posts interrupt to PIR (sets bit in descriptor)
If vCPU is running, notification IPI wakes it to check PIR
Guest processes interrupt without VM exit

Result: Interrupt latency drops from thousands of cycles (VM exit path) to hundreds (posted interrupt path).

Nested Virtualization

Nested virtualization allows running a hypervisor inside a virtual machine. A guest VM runs a hypervisor (L1), which manages its own nested guests (L2). This enables fascinating use cases:

Why Nested Virtualization:

Test hypervisors: Develop and debug hypervisor code without dedicated hardware
Cloud flexibility: Run VMware inside AWS, or Hyper-V inside Azure
Training environments: Practice with enterprise hypervisors in VMs
Security research: Analyze hypervisor attacks in controlled environments

Converting Mermaid diagram...

Implementation Challenges:

VMCS Shadowing: L1 hypervisor manipulates VMCS for L2 guests. But only L0 can actually use hardware VMCS. L0 must:

Intercept L1's VMCS operations
Maintain shadow copies of L1's VMCS
Merge L1's settings with L0's requirements for actual L2 execution

EPT Translation Chain: With nested virtualization:

L2 page tables: L2VA → L2PA (managed by L1)
L1 EPT: L2PA → L1PA (managed by L1, but actually shadows)
L0 EPT: L1PA → HPA (managed by L0)

L0 often merges these into a single effective EPT for L2, avoiding triple translation overhead.

Hardware Support:

Modern CPUs include features to accelerate nested virtualization:

VMCS Shadowing (Intel): Hardware assist for L1 VMCS manipulation
Virtual VMCS (Intel): Allows L1 to use VMREAD/VMWRITE without exits
Nested Paging (AMD): NPT designed with nested virtualization in mind

Performance:

Nested virtualization adds overhead:

Each L2 VM exit may cause L1 exit + L0 handling
Memory translation has additional layers
Typical overhead: 10-30% compared to flat L1 virtualization

For development and testing, this overhead is acceptable. For production, prefer flat virtualization.

Enabling Nested Virtualization

Summary: Hardware Virtualization Support

We've explored the hardware technologies that make modern virtualization efficient and practical. Let's consolidate these critical concepts:

Key Takeaways

•Hardware virtualization extensions (VT-x, AMD-V) solve x86's non-virtualizable instructions by providing dedicated CPU modes for VMMs and guests.
•VMCS (Intel) and VMCB (AMD) are hardware-managed structures that control guest state, host state, and exit/entry behavior.
•VM exits and entries are the transitions between guest and hypervisor, costing 1000-3000 cycles; minimizing exits is a key optimization goal.
•EPT (Intel) and NPT (AMD) enable hardware two-level address translation, eliminating shadow page tables and their associated VM exits.
•VT-d and AMD-Vi (IOMMU) provide secure device assignment through DMA remapping and interrupt remapping.
•Interrupt virtualization (vAPIC, posted interrupts) enables low-latency interrupt delivery without VM exits.
•Nested virtualization allows hypervisors inside VMs, enabled by VMCS shadowing and extended hardware support.

What's Next:

Page Complete

4 / 5