Operating SystemsHardware Virtualization Support

Hardware Virtualization Support

LevelAdvanced

Duration90 mins

TopicHardware Virtualization Support

4 / 5

I/O Virtualization (VT-d)

Direct Device Access for Virtual Machines

CPU and memory virtualization create the foundation for running isolated virtual machines, but real workloads need I/O—network packets, disk blocks, GPU computations. Traditional virtualization emulates devices: the guest talks to a virtual NIC, the hypervisor translates requests to real hardware. This works but adds overhead and complexity.

VT-d (Virtualization Technology for Directed I/O) enables a radically different approach: device passthrough. With VT-d, a physical device can be assigned directly to a virtual machine. The guest OS talks to real hardware, achieving near-native performance. The hypervisor doesn't intercept every I/O operation—it just sets up the assignment and lets hardware enforce isolation.

But how do you safely let a VM control real hardware? What stops a malicious guest from using DMA to read arbitrary host memory? VT-d's answer: an IOMMU (I/O Memory Management Unit) that applies address translation and access control to all device memory operations.

What You Will Learn

By the end of this page, you will understand the I/O virtualization problem, IOMMU architecture and DMA remapping, interrupt remapping for isolation, device passthrough mechanisms, SR-IOV for hardware-level device sharing, and the security guarantees VT-d provides.

The I/O Virtualization Challenge

I/O virtualization presents unique challenges that differ fundamentally from CPU or memory virtualization.

Device Emulation: The Traditional Approach:

In traditional virtualization, devices are emulated in software:

Guest OS has drivers for virtual devices (e.g., virtio-net, IDE disk)
Guest I/O operations cause VM exits
Hypervisor decodes I/O and performs real operations
Results are passed back to guest

Guest Application
     ↓ write()
Guest Kernel Driver (e.g., virtio-net)
     ↓ I/O port write or MMIO
[VM EXIT]
Hypervisor - Device Model (e.g., QEMU)
     ↓ Real network operation
Host Kernel NIC Driver
     ↓
Physical NIC

This approach is flexible—guest doesn't need real hardware drivers—but introduces overhead at every layer.

I/O Virtualization Approaches Comparison
Approach	Performance	Flexibility	Isolation	Complexity
Full Emulation	30-50% native	Excellent	Excellent	Very High
Paravirtualization (virtio)	60-80% native	Good	Excellent	Medium
Device Passthrough	95-100% native	Limited	Good (with VT-d)	Low
SR-IOV	95-100% native	Good	Excellent	Medium

Why Device Emulation Is Slow:

VM Exit Overhead: Each I/O operation causes expensive context switches
Memory Copying: Data often copied between guest, hypervisor, and host buffers
Interrupt Delivery: Virtual interrupts require hypervisor intervention
Latency: Multiple software layers add microseconds to every operation

For a 10 Gbps NIC processing 14 million packets per second, even 1 microsecond per packet consumes all CPU time just for overhead!

The DMA Security Problem:

Direct device access seems ideal but raises a critical security issue: DMA (Direct Memory Access). Devices perform DMA to transfer data without CPU involvement—they read/write physical memory directly. If a guest controls a device, that device can DMA to any physical address the guest programs.

Without protection, a malicious guest could:

Read hypervisor memory (steal secrets, break isolation)
Write to hypervisor memory (code injection, privilege escalation)
Access other VMs' memory (lateral movement)

VT-d's Solution: The IOMMU

DMA is Dangerous

Without IOMMU protection, device passthrough is fundamentally unsafe. A device can write to any physical address—including kernel code, page tables, or security-critical data. VT-d isn't just a performance feature; it's essential for security in passthrough scenarios.

IOMMU Architecture and DMA Remapping

The IOMMU (I/O Memory Management Unit) sits between devices and memory, translating device-initiated addresses and enforcing access control. Think of it as a page table unit for DMA—just as the MMU translates CPU addresses, the IOMMU translates device addresses.

Intel VT-d IOMMU:

VT-d introduces DMA Remapping (DMAR) with these components:

Root Table: Indexed by bus number, points to context tables
Context Table: Indexed by device/function, points to address translation structure
Second-Level Page Tables: Similar to EPT, translates device addresses to physical

DMA Address Translation:

Device wants to DMA to address 0x1000_0000
      ↓
IOMMU looks up device identity (Bus:Device:Function)
      ↓
Root Table[Bus] → Context Entry[DevFn]
      ↓
Context Entry contains: Domain ID + Page Table Pointer
      ↓
Page Table Walk: 0x1000_0000 → Translated Physical Address
      ↓
DMA proceeds to translated address (or faults if unmapped)

Converting Mermaid diagram...

Context Entry Structure:

Each device (identified by Bus:Device:Function) has a context entry:

Field	Description
Present	Entry valid
Fault Processing Disable	Suppress fault reporting
Translation Type	Pass-through, translation, or reserved
Address Width	Supported guest address width
Second Level Page Table Pointer	Root of translation tables
Domain ID	Identifier for this isolation domain

struct context_entry {
    uint64_t lo;   /* Pointer to page tables, flags */
    uint64_t hi;   /* Domain ID, address width */
};

/* Context entry bit definitions */
#define CTX_PRESENT    (1 << 0)
#define CTX_FPD        (1 << 1)   /* Fault processing disable */
#define CTX_TRANS_TYPE(t) ((t) << 2)  /* Translation type */
#define CTX_ADDR_WIDTH(w) ((w) << 0)  /* In hi word */
#define CTX_DOMAIN_ID(d) ((uint64_t)(d) << 8)  /* In hi word */

void setup_context_entry(struct context_entry *ctx,
                          uint64_t page_table_root,
                          uint16_t domain_id) {
    ctx->lo = (page_table_root & PAGE_MASK) | 
              CTX_PRESENT | 
              CTX_TRANS_TYPE(0);  /* 0 = Second-level only */
    ctx->hi = CTX_ADDR_WIDTH(2) |  /* 48-bit */
              CTX_DOMAIN_ID(domain_id);
}

Second-Level Page Tables:

VT-d page tables are similar to EPT—4-level structure translating device addresses:

/* VT-d page table entry (similar to EPT) */
#define DMA_PTE_READ   (1 << 0)   /* Read access */
#define DMA_PTE_WRITE  (1 << 1)   /* Write access */
#define DMA_PTE_SNP    (1 << 11)  /* Snoop bit for cache coherency */
#define DMA_PTE_ADDR(a) ((a) & 0x000FFFFFFFFFF000ULL)

/* Build IOMMU page table for a VM */
void build_iommu_page_table(struct vm *vm, struct iommu_domain *domain) {
    /* For passthrough, map guest physical space */
    for (each guest_page gpfn in vm->memory) {
        uint64_t gpa = gpfn << PAGE_SHIFT;
        uint64_t hpa = ept_translate(vm->ept, gpa);
        
        /* Device sees VM's physical addresses, IOMMU translates to host */
        iommu_map_page(domain, gpa, hpa, DMA_PTE_READ | DMA_PTE_WRITE);
    }
}

Key Insight: Double Translation

With device passthrough:

CPU accesses: GVA → (guest PT) → GPA → (EPT) → HPA
Device DMA: GPA → (IOMMU PT) → HPA

The guest programs the device with GPAs. The IOMMU translates these to HPAs. The IOMMU page table typically mirrors the EPT, ensuring devices and CPUs see consistent memory mapping.

AMD IOMMU

AMD's equivalent is simply called 'AMD IOMMU' or 'AMD-Vi'. The architecture is similar—device identification, page tables for address translation—with different register layouts and table formats. Linux abstracts both under the common IOMMU API.

Interrupt Remapping

DMA remapping protects memory, but devices also generate interrupts. In x86, MSI (Message Signaled Interrupts) work by having the device write to a special memory address that the interrupt controller interprets as an interrupt request. Without protection, a malicious device could forge arbitrary interrupts.

The Interrupt Security Problem:

MSI interrupts contain:

Destination APIC ID (which CPU to interrupt)
Vector (which interrupt handler to invoke)
Delivery mode (how to deliver)

A compromised device could:

Interrupt arbitrary CPUs (denial of service)
Specify vectors that invoke unexpected handlers
Inject interrupts into other VMs or the hypervisor

VT-d Interrupt Remapping:

VT-d introduces an Interrupt Remapping Table (IRT) that validates and translates device interrupts:

Device issues MSI with a handle instead of raw interrupt info
IOMMU looks up handle in Interrupt Remapping Table
IRT entry contains validated destination/vector
IOMMU generates actual interrupt with authorized parameters

Converting Mermaid diagram...

Interrupt Remapping Table Entry:

struct irte {           /* Interrupt Remapping Table Entry */
    uint64_t lo;
    uint64_t hi;
};

/* Low quadword fields */
#define IRTE_PRESENT     (1 << 0)
#define IRTE_FPD         (1 << 1)   /* Fault Processing Disable */
#define IRTE_DM(m)       ((m) << 2)  /* Destination Mode */
#define IRTE_RH          (1 << 3)   /* Redirection Hint */
#define IRTE_TM          (1 << 4)   /* Trigger Mode */
#define IRTE_DLV(d)      ((d) << 5)  /* Delivery Mode */
#define IRTE_AVAIL       (0xF << 8)  /* Available for software */
#define IRTE_VECTOR(v)   ((uint64_t)(v) << 16)
#define IRTE_DEST(d)     ((uint64_t)(d) << 32)

/* High quadword fields */
#define IRTE_SID(s)      ((s) & 0xFFFF)   /* Source ID validation */
#define IRTE_SQ(q)       ((q) << 16)      /* Source ID qualifier */
#define IRTE_SVT(t)      ((t) << 18)      /* Source validation type */

void setup_irte(struct irte *entry, uint8_t vector, uint32_t dest_apic,
                uint16_t source_id) {
    entry->lo = IRTE_PRESENT |
                IRTE_DM(0) |        /* Physical destination mode */
                IRTE_DLV(0) |       /* Fixed delivery */
                IRTE_VECTOR(vector) |
                IRTE_DEST(dest_apic);
    
    entry->hi = IRTE_SVT(1) |       /* Verify source ID */
                IRTE_SID(source_id); /* Expected BDF */
}

Source ID Validation:

Critically, interrupt remapping can validate the source of interrupts. Each IRT entry specifies which device (by Bus:Device:Function) is allowed to use that entry. A device can't use another device's interrupt handles.

Device 0:3:0 tries to use handle belonging to device 0:4:0
  → IOMMU compares source ID
  → Mismatch detected
  → Interrupt blocked, fault logged

Interrupt Remapping is Essential

Without interrupt remapping, device passthrough remains insecure even with DMA protection. A device could trigger arbitrary interrupt vectors, potentially invoking kernel code paths with unexpected state. Modern hypervisors require both DMA and interrupt remapping for safe passthrough.

Device Passthrough Implementation

With VT-d infrastructure in place, we can implement device passthrough—assigning a physical device exclusively to a virtual machine.

Passthrough Setup Steps:

Unbind from host driver: Remove device from host OS control
Configure IOMMU domain: Create isolation domain for the VM
Set up DMA remapping: Map guest physical space for this device
Configure interrupts: Set up interrupt remapping entries
Expose to guest: Present device in guest's virtual PCI bus

# Linux VFIO device passthrough example

# 1. Identify device
lspci -nn | grep NVIDIA
# 01:00.0 VGA compatible controller [0300]: NVIDIA ... [10de:1b80]

# 2. Unbind from current driver
echo "0000:01:00.0" > /sys/bus/pci/devices/0000:01:00.0/driver/unbind

# 3. Bind to vfio-pci driver
echo "10de 1b80" > /sys/bus/pci/drivers/vfio-pci/new_id

# 4. Verify IOMMU group
ls /sys/kernel/iommu_groups/*/devices/
# Group contains: 0000:01:00.0 0000:01:00.1 (GPU + GPU audio)

# 5. Start VM with device passthrough (QEMU example)
qemu-system-x86_64 
    -device vfio-pci,host=01:00.0 
    -device vfio-pci,host=01:00.1 
    ...

VFIO (Virtual Function I/O) Framework

•Userspace Device Access: VFIO provides safe, IOMMU-protected device access from userspace (used by hypervisors like QEMU).
•IOMMU Groups: Devices that share IOMMU translation must be in the same group. Entire group must be passed through together.
•DMA Mapping API: Hypervisor maps guest memory regions, VFIO configures IOMMU page tables.
•Interrupt Forwarding: Device interrupts routed to guest via eventfd and KVM irqfd.
•PCI Configuration Space: VFIO mediates config space access, emulating or passing through as appropriate.

vfio_passthrough.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* Simplified VFIO device passthrough setup */
#include <linux/vfio.h>
 
int setup_device_passthrough(struct vm *vm, const char *device_path) {
    int container, group, device;
    struct vfio_group_status group_status;
    struct vfio_device_info device_info;
    
    /* Open VFIO container */
    container = open("/dev/vfio/vfio", O_RDWR);
    ioctl(container, VFIO_GET_API_VERSION);
    ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU);
    
    /* Open IOMMU group (e.g., /dev/vfio/42) */
    group = open("/dev/vfio/42", O_RDWR);
    group_status.argsz = sizeof(group_status);
    ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
    
    /* Add group to container */
    ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
    
    /* Enable IOMMU */
    ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
    
    /* Get device file descriptor */
    device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, device_path);
    
    /* Query device info */
    device_info.argsz = sizeof(device_info);
    ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
    
    /* Map guest memory for DMA */
    struct vfio_iommu_type1_dma_map dma_map = {
        .argsz = sizeof(dma_map),
        .flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
        .vaddr = (uint64_t)vm->guest_memory,
        .iova = 0,  /* Guest physical address 0 */
        .size = vm->memory_size,
    };
    ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
    
    /* Set up interrupts via irqfd */
    struct vfio_irq_set *irq_set;
    irq_set = malloc(sizeof(*irq_set) + sizeof(int));
    irq_set->argsz = sizeof(*irq_set) + sizeof(int);
    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
    irq_set->index = VFIO_PCI_MSI_IRQ_INDEX;
    irq_set->start = 0;
    irq_set->count = 1;
    *(int *)irq_set->data = create_eventfd_for_vm_irq(vm);
    ioctl(device, VFIO_DEVICE_SET_IRQS, irq_set);
    
    return 0;
}

IOMMU Groups and ACS:

Devices that can communicate without going through the IOMMU (e.g., devices behind the same PCIe switch) must be in the same IOMMU group. If devices can peer-to-peer DMA, they can't be independently isolated.

ACS (Access Control Services) is a PCIe feature that enforces routing through the root complex, enabling finer-grained IOMMU groups:

Without ACS:
  PCIe Switch
  ├── Device A ─┐
  └── Device B ─┤── Same IOMMU group
                     (can peer-to-peer)

With ACS:
  PCIe Switch (ACS enabled)
  ├── Device A ── Separate IOMMU group
  └── Device B ── Separate IOMMU group
                  (traffic forced through root)

For passthrough, you often need to pass through entire IOMMU groups. Consumer motherboards often have poor IOMMU group isolation; server hardware typically has better ACS support.

GPU Passthrough

GPU passthrough is a popular use case—enabling gaming or CUDA workloads in VMs. It requires: IOMMU enabled in BIOS, GPU in its own IOMMU group (or with other group members also passed through), and proper driver support in the guest. NVIDIA drivers historically detected virtualization; consumer cards may need workarounds.

SR-IOV: Hardware-Level Device Sharing

Device passthrough gives one VM exclusive access to one physical device. But what if you have 100 VMs and only 4 network ports? SR-IOV (Single Root I/O Virtualization) solves this by making one physical device appear as multiple independent devices.

SR-IOV Architecture:

An SR-IOV device presents:

Physical Function (PF): The actual device, managed by the host/hypervisor. Has full device capabilities including configuration.
Virtual Functions (VFs): Lightweight device instances, each appearing as a separate PCIe function. Each VF can be assigned to a different VM.

SR-IOV NIC Configuration:

Physical NIC (e.g., Intel X540)
├── PF0 (Physical Function) - Managed by host
├── VF0 → Assigned to VM1
├── VF1 → Assigned to VM2
├── VF2 → Assigned to VM3
├── VF3 → Assigned to VM4
└── ... (up to 64+ VFs)

Each VF has its own:
- PCIe configuration space
- Memory-mapped registers
- TX/RX queues
- Interrupts (MSI-X vectors)

Converting Mermaid diagram...

Creating Virtual Functions:

# Enable SR-IOV on a network interface
# Check maximum VFs supported
cat /sys/class/net/eth0/device/sriov_totalvfs
# 63

# Create 4 virtual functions
echo 4 > /sys/class/net/eth0/device/sriov_numvfs

# Verify VFs appeared
lspci | grep Virtual
# 03:10.0 Ethernet controller: Intel ... Virtual Function
# 03:10.2 Ethernet controller: Intel ... Virtual Function
# 03:10.4 Ethernet controller: Intel ... Virtual Function
# 03:10.6 Ethernet controller: Intel ... Virtual Function

# Each VF gets its own IOMMU group and can be passed through
ls /sys/kernel/iommu_groups/*/devices/ | grep 03:10

SR-IOV Benefits:

Near-native performance: VFs are real hardware, not emulated
Scalability: One physical device serves many VMs
Hardware isolation: Switching/queuing in device hardware
Reduced host CPU usage: No hypervisor in data path

SR-IOV Mechanism:

The hardware includes an internal switch fabric that routes traffic:

Ingress: Incoming packets are classified (by MAC, VLAN, etc.) and directed to the appropriate VF's RX queue
Egress: Each VF's TX queue is scheduled onto the physical port
Isolation: VFs cannot see each other's traffic; hardware enforces separation

sriov_setup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* SR-IOV VF configuration example */
#include <linux/pci.h>
 
/* Host: Enable SR-IOV on a PF */
int enable_sriov(struct pci_dev *pdev, int num_vfs) {
    int ret;
    
    /* Check device supports SR-IOV */
    if (!pdev->is_physfn)
        return -ENODEV;
    
    if (num_vfs > pci_sriov_get_totalvfs(pdev))
        return -EINVAL;
    
    /* Enable VFs */
    ret = pci_enable_sriov(pdev, num_vfs);
    if (ret)
        return ret;
    
    /* Configure each VF */
    for (int i = 0; i < num_vfs; i++) {
        /* Set VF MAC address */
        struct pci_dev *vf = pci_get_domain_bus_and_slot(
            pci_domain_nr(pdev->bus),
            pdev->bus->number,
            PCI_DEVFN(PCI_SLOT(pdev->devfn) + i, 0));
        
        /* VF inherits from PF but can be customized */
        set_vf_mac(pdev, i, vf_mac_addresses[i]);
        set_vf_vlan(pdev, i, vf_vlans[i]);
        set_vf_rate_limit(pdev, i, vf_rate_mbps[i]);
    }
    
    return 0;
}
 
/* Hypervisor: Attach VF to VM */
int attach_vf_to_vm(struct vm *vm, const char *vf_bdf) {
    /* Same as regular passthrough - VF appears as normal PCIe device */
    return setup_device_passthrough(vm, vf_bdf);
}

SR-IOV Limitations

SR-IOV requires hardware support in both the device and the platform. Not all devices support SR-IOV, and VF capabilities may be limited compared to PF (e.g., fewer queues, no promiscuous mode). Live migration of VMs with SR-IOV devices is challenging because hardware state must be saved/restored.

Security Considerations for I/O Virtualization

VT-d provides essential security for device passthrough, but correct configuration is critical. Misconfiguration can break isolation guarantees entirely.

Security Guarantees of Properly Configured VT-d:

DMA Isolation: Device can only access memory explicitly mapped for it
Interrupt Isolation: Device can only trigger authorized interrupts
Device Isolation: Devices in different IOMMU domains cannot interfere
Guest Containment: Compromised guest with device access cannot escape VM

Common Misconfigurations and Attacks:

VT-d Security Risks
Risk	Description	Mitigation
IOMMU Bypass	IOMMU not enabled; device has unrestricted DMA	Verify IOMMU enabled in BIOS and kernel
RMRR Conflicts	Reserved Memory Regions block safe passthrough	Check dmesg for RMRR warnings, use ACS override
IOMMU Group Issues	Multiple devices share translation; escaping possible	Pass through entire group or use ACS
Interrupt Remapping Disabled	Device can forge interrupts	Require interrupt remapping for passthrough
Hot-plug Attacks	Malicious device inserted while running	Disable hot-plug or use physical security
DMA Before Boot	Device DMAs before IOMMU initialized	Enable pre-boot DMA protection (Intel VT-d feature)

Verifying VT-d Configuration:

# Check IOMMU is enabled in kernel
dmesg | grep -i iommu
# [    0.000000] DMAR: IOMMU enabled
# [    0.123456] DMAR-IR: Enabled IRQ remapping

# Verify interrupt remapping
dmesg | grep "IRQ remapping"
# Should show "Enabled"

# Check for IOMMU faults (should be empty normally)
cat /sys/kernel/debug/iommu/intel/dmar_table_errors

# List IOMMU domains and devices
cat /sys/kernel/iommu_groups/*/type
ls -la /sys/kernel/iommu_groups/*/devices/

Pre-Boot DMA Protection:

Modern systems support blocking DMA before the OS loads:

Intel: Kernel DMA Protection via BIOS
Purpose: Prevent Thunderbolt DMA attacks, evil maid scenarios
Mechanism: IOMMU blocks all DMA until OS takes control

Boot Sequence with DMA Protection:

1. Platform powers on
2. IOMMU initialized in blocking mode
3. All device DMA rejected
4. OS loads, takes IOMMU ownership
5. OS enables DMA only for trusted devices

Do Not Disable the IOMMU

Some guides suggest kernel parameters like 'intel_iommu=off' for troubleshooting. This completely disables IOMMU protection, making device passthrough fundamentally unsafe. A compromised guest with passthrough device can read/write all system memory. Only disable IOMMU if you fully understand and accept this risk.

TOCTOU and Malicious Devices:

A sophisticated attacker with physical access could use a malicious PCIe device (e.g., via Thunderbolt) to attempt:

DMA attacks: Attempt to read memory before IOMMU protection
Confusion attacks: Trick IOMMU into incorrect mappings
Denial of service: Flood IOMMU with invalid transactions

VT-d fault handling addresses some concerns:

/* VT-d fault handler */
void dmar_fault_handler(uint64_t source_id, uint64_t fault_addr,
                         uint32_t fault_reason) {
    /* Log the fault */
    log_security_event("IOMMU fault: device %04x addr %016lx reason %d",
                       source_id, fault_addr, fault_reason);
    
    /* Common fault reasons:
     * 1 = Page not present
     * 2 = Write to read-only page  
     * 5 = Access width violation
     */
    
    /* Consider disabling device if repeated faults */
    if (++fault_count[source_id] > FAULT_THRESHOLD) {
        disable_device(source_id);
        alert_admin("Device %04x disabled due to repeated IOMMU faults",
                    source_id);
    }
}

I/O Virtualization Performance Optimization

While VT-d enables near-native I/O performance, several optimizations can further reduce overhead.

IOTLB (I/O Translation Lookaside Buffer):

Like CPU TLBs cache address translations, IOTLBs cache IOMMU translations:

DMA without IOTLB hit:
  Device address → Root Table → Context Table → Page Walk → Physical
  (Multiple memory accesses)

DMA with IOTLB hit:
  Device address → IOTLB → Physical
  (Single lookup)

IOTLB Invalidation:

When IOMMU mappings change, IOTLB entries must be invalidated:

void invalidate_iotlb(struct iommu_domain *domain) {
    /* Global invalidation - flush all entries for domain */
    struct qi_desc desc = {
        .qw0 = QI_IOTLB_GRAN(QI_IOTLB_DOMAIN) | 
               QI_IOTLB_DID(domain->id) |
               QI_IOTLB_TYPE,
        .qw1 = 0,
    };
    qi_submit_sync(&desc);
}

void invalidate_iotlb_addr(struct iommu_domain *domain, 
                           uint64_t addr, uint64_t size) {
    /* Page-selective invalidation - more efficient */
    struct qi_desc desc = {
        .qw0 = QI_IOTLB_GRAN(QI_IOTLB_PAGE) |
               QI_IOTLB_DID(domain->id) |
               QI_IOTLB_TYPE,
        .qw1 = addr | QI_IOTLB_AM(size),
    };
    qi_submit_sync(&desc);
}

Performance Optimization Techniques

•Large Pages: Use 2MB/1GB IOMMU pages to reduce table walk depth and improve IOTLB efficiency.
•Queued Invalidation: Batch IOTLB invalidations through hardware queue rather than synchronous commands.
•Pass-Through Mode: For trusted devices, enable IOMMU pass-through (no translation, only protection checking).
•ATS (Address Translation Services): PCIe feature allowing devices to cache IOMMU translations, reducing IOMMU traffic.
•Page Request Interface (PRI): Devices can request IOMMU page faults be resolved, enabling on-demand mapping.
•Interrupt Coalescing: Reduce interrupt rate by having device batch notifications.

ATS (Address Translation Services):

ATS allows PCIe devices to cache address translations locally:

Device requests translation from IOMMU
IOMMU returns translated address + permissions
Device caches translation in its ATC (Address Translation Cache)
Future DMAs use cached translation directly
IOMMU invalidations must also invalidate device ATCs

Without ATS:
  Every DMA → IOMMU page walk

With ATS:
  First DMA → IOMMU translation → Device caches
  Subsequent DMAs → Device ATC hit → Direct access

Posted Interrupts for Devices:

Intel's Posted Interrupts feature can deliver device interrupts directly to a running vCPU without VM exit:

Device generates MSI → Interrupt Remapping
IRT entry configured for posted interrupt
Interrupt posted to notification vector
If vCPU running, delivered without exit
If vCPU blocked, hypervisor notified to wake it

/* Configure posted interrupt for device */
void setup_posted_interrupt_irte(struct irte *entry, 
                                  struct vcpu *vcpu,
                                  uint8_t vector) {
    entry->lo = IRTE_PRESENT |
                IRTE_MODE_POSTED |         /* Posted interrupt mode */
                IRTE_VECTOR(vector);
    entry->hi = IRTE_POSTED_ADDR(vcpu->posted_intr_desc) |
                IRTE_URGENT_BIT;           /* Wake blocked vCPU */
}

Measuring I/O Virtualization Overhead

Use tools like perf to measure IOMMU overhead: 'perf stat -e dTLB-load-misses,iTLB-load-misses' for CPU TLB, and check /sys/kernel/debug/iommu/intel/ for IOMMU statistics. High IOTLB miss rates indicate need for larger pages or better locality.

Summary: I/O Virtualization (VT-d)

VT-d completes the hardware virtualization picture by enabling secure, high-performance I/O for virtual machines. Through DMA remapping, interrupt remapping, and device passthrough, VMs can achieve near-native I/O performance while maintaining isolation guarantees.

Key Takeaways

•I/O Virtualization Challenge: Device emulation adds overhead; direct device access needs protection against DMA attacks.
•IOMMU Architecture: DMA Remapping translates device addresses through page tables, enforcing memory isolation.
•Interrupt Remapping: Prevents devices from forging interrupts; validates source and destination of all MSI traffic.
•Device Passthrough: Assigns physical devices to VMs for near-native performance. Requires IOMMU protection.
•SR-IOV: Hardware-level device sharing; single physical device presents multiple Virtual Functions to different VMs.
•Security Requirements: Both DMA and interrupt remapping essential. IOMMU must be enabled and correctly configured.
•Performance Optimization: IOTLB caching, large pages, ATS, and posted interrupts reduce virtualization overhead.

What's Next:

In the next page, we'll explore Performance Acceleration techniques—how all the hardware virtualization features (VT-x, EPT, VT-d) work together to minimize overhead, and practical techniques for tuning virtual machine performance. We'll examine real-world benchmarks and optimization strategies.

Page Complete

You now understand VT-d and I/O virtualization—from IOMMU architecture and DMA remapping to SR-IOV and security considerations. Combined with CPU virtualization (VT-x/AMD-V) and memory virtualization (EPT/NPT), you have a complete picture of modern hardware virtualization technology.

4 / 5

Loading learning content...

Operating SystemsHardware Virtualization Support

Hardware Virtualization Support

LevelAdvanced

Duration90 mins

TopicHardware Virtualization Support

4 / 5

I/O Virtualization (VT-d)

Direct Device Access for Virtual Machines

What You Will Learn

The I/O Virtualization Challenge

I/O virtualization presents unique challenges that differ fundamentally from CPU or memory virtualization.

Device Emulation: The Traditional Approach:

In traditional virtualization, devices are emulated in software:

Guest OS has drivers for virtual devices (e.g., virtio-net, IDE disk)
Guest I/O operations cause VM exits
Hypervisor decodes I/O and performs real operations
Results are passed back to guest

Guest Application
     ↓ write()
Guest Kernel Driver (e.g., virtio-net)
     ↓ I/O port write or MMIO
[VM EXIT]
Hypervisor - Device Model (e.g., QEMU)
     ↓ Real network operation
Host Kernel NIC Driver
     ↓
Physical NIC

This approach is flexible—guest doesn't need real hardware drivers—but introduces overhead at every layer.

I/O Virtualization Approaches Comparison
Approach	Performance	Flexibility	Isolation	Complexity
Full Emulation	30-50% native	Excellent	Excellent	Very High
Paravirtualization (virtio)	60-80% native	Good	Excellent	Medium
Device Passthrough	95-100% native	Limited	Good (with VT-d)	Low
SR-IOV	95-100% native	Good	Excellent	Medium

Why Device Emulation Is Slow:

VM Exit Overhead: Each I/O operation causes expensive context switches
Memory Copying: Data often copied between guest, hypervisor, and host buffers
Interrupt Delivery: Virtual interrupts require hypervisor intervention
Latency: Multiple software layers add microseconds to every operation

For a 10 Gbps NIC processing 14 million packets per second, even 1 microsecond per packet consumes all CPU time just for overhead!

The DMA Security Problem:

Without protection, a malicious guest could:

Read hypervisor memory (steal secrets, break isolation)
Write to hypervisor memory (code injection, privilege escalation)
Access other VMs' memory (lateral movement)

VT-d's Solution: The IOMMU

DMA is Dangerous

IOMMU Architecture and DMA Remapping

Intel VT-d IOMMU:

VT-d introduces DMA Remapping (DMAR) with these components:

Root Table: Indexed by bus number, points to context tables
Context Table: Indexed by device/function, points to address translation structure
Second-Level Page Tables: Similar to EPT, translates device addresses to physical

DMA Address Translation:

Device wants to DMA to address 0x1000_0000
      ↓
IOMMU looks up device identity (Bus:Device:Function)
      ↓
Root Table[Bus] → Context Entry[DevFn]
      ↓
Context Entry contains: Domain ID + Page Table Pointer
      ↓
Page Table Walk: 0x1000_0000 → Translated Physical Address
      ↓
DMA proceeds to translated address (or faults if unmapped)

Converting Mermaid diagram...

Context Entry Structure:

Each device (identified by Bus:Device:Function) has a context entry:

Field	Description
Present	Entry valid
Fault Processing Disable	Suppress fault reporting
Translation Type	Pass-through, translation, or reserved
Address Width	Supported guest address width
Second Level Page Table Pointer	Root of translation tables
Domain ID	Identifier for this isolation domain

struct context_entry {
    uint64_t lo;   /* Pointer to page tables, flags */
    uint64_t hi;   /* Domain ID, address width */
};

/* Context entry bit definitions */
#define CTX_PRESENT    (1 << 0)
#define CTX_FPD        (1 << 1)   /* Fault processing disable */
#define CTX_TRANS_TYPE(t) ((t) << 2)  /* Translation type */
#define CTX_ADDR_WIDTH(w) ((w) << 0)  /* In hi word */
#define CTX_DOMAIN_ID(d) ((uint64_t)(d) << 8)  /* In hi word */

void setup_context_entry(struct context_entry *ctx,
                          uint64_t page_table_root,
                          uint16_t domain_id) {
    ctx->lo = (page_table_root & PAGE_MASK) | 
              CTX_PRESENT | 
              CTX_TRANS_TYPE(0);  /* 0 = Second-level only */
    ctx->hi = CTX_ADDR_WIDTH(2) |  /* 48-bit */
              CTX_DOMAIN_ID(domain_id);
}

Second-Level Page Tables:

VT-d page tables are similar to EPT—4-level structure translating device addresses:

/* VT-d page table entry (similar to EPT) */
#define DMA_PTE_READ   (1 << 0)   /* Read access */
#define DMA_PTE_WRITE  (1 << 1)   /* Write access */
#define DMA_PTE_SNP    (1 << 11)  /* Snoop bit for cache coherency */
#define DMA_PTE_ADDR(a) ((a) & 0x000FFFFFFFFFF000ULL)

/* Build IOMMU page table for a VM */
void build_iommu_page_table(struct vm *vm, struct iommu_domain *domain) {
    /* For passthrough, map guest physical space */
    for (each guest_page gpfn in vm->memory) {
        uint64_t gpa = gpfn << PAGE_SHIFT;
        uint64_t hpa = ept_translate(vm->ept, gpa);
        
        /* Device sees VM's physical addresses, IOMMU translates to host */
        iommu_map_page(domain, gpa, hpa, DMA_PTE_READ | DMA_PTE_WRITE);
    }
}

Key Insight: Double Translation

With device passthrough:

CPU accesses: GVA → (guest PT) → GPA → (EPT) → HPA
Device DMA: GPA → (IOMMU PT) → HPA

The guest programs the device with GPAs. The IOMMU translates these to HPAs. The IOMMU page table typically mirrors the EPT, ensuring devices and CPUs see consistent memory mapping.

AMD IOMMU

Interrupt Remapping

The Interrupt Security Problem:

MSI interrupts contain:

Destination APIC ID (which CPU to interrupt)
Vector (which interrupt handler to invoke)
Delivery mode (how to deliver)

A compromised device could:

Interrupt arbitrary CPUs (denial of service)
Specify vectors that invoke unexpected handlers
Inject interrupts into other VMs or the hypervisor

VT-d Interrupt Remapping:

VT-d introduces an Interrupt Remapping Table (IRT) that validates and translates device interrupts:

Device issues MSI with a handle instead of raw interrupt info
IOMMU looks up handle in Interrupt Remapping Table
IRT entry contains validated destination/vector
IOMMU generates actual interrupt with authorized parameters

Converting Mermaid diagram...

Interrupt Remapping Table Entry:

struct irte {           /* Interrupt Remapping Table Entry */
    uint64_t lo;
    uint64_t hi;
};

/* Low quadword fields */
#define IRTE_PRESENT     (1 << 0)
#define IRTE_FPD         (1 << 1)   /* Fault Processing Disable */
#define IRTE_DM(m)       ((m) << 2)  /* Destination Mode */
#define IRTE_RH          (1 << 3)   /* Redirection Hint */
#define IRTE_TM          (1 << 4)   /* Trigger Mode */
#define IRTE_DLV(d)      ((d) << 5)  /* Delivery Mode */
#define IRTE_AVAIL       (0xF << 8)  /* Available for software */
#define IRTE_VECTOR(v)   ((uint64_t)(v) << 16)
#define IRTE_DEST(d)     ((uint64_t)(d) << 32)

/* High quadword fields */
#define IRTE_SID(s)      ((s) & 0xFFFF)   /* Source ID validation */
#define IRTE_SQ(q)       ((q) << 16)      /* Source ID qualifier */
#define IRTE_SVT(t)      ((t) << 18)      /* Source validation type */

void setup_irte(struct irte *entry, uint8_t vector, uint32_t dest_apic,
                uint16_t source_id) {
    entry->lo = IRTE_PRESENT |
                IRTE_DM(0) |        /* Physical destination mode */
                IRTE_DLV(0) |       /* Fixed delivery */
                IRTE_VECTOR(vector) |
                IRTE_DEST(dest_apic);
    
    entry->hi = IRTE_SVT(1) |       /* Verify source ID */
                IRTE_SID(source_id); /* Expected BDF */
}

Source ID Validation:

Device 0:3:0 tries to use handle belonging to device 0:4:0
  → IOMMU compares source ID
  → Mismatch detected
  → Interrupt blocked, fault logged

Interrupt Remapping is Essential

Device Passthrough Implementation

With VT-d infrastructure in place, we can implement device passthrough—assigning a physical device exclusively to a virtual machine.

Passthrough Setup Steps:

Unbind from host driver: Remove device from host OS control
Configure IOMMU domain: Create isolation domain for the VM
Set up DMA remapping: Map guest physical space for this device
Configure interrupts: Set up interrupt remapping entries
Expose to guest: Present device in guest's virtual PCI bus

# Linux VFIO device passthrough example

# 1. Identify device
lspci -nn | grep NVIDIA
# 01:00.0 VGA compatible controller [0300]: NVIDIA ... [10de:1b80]

# 2. Unbind from current driver
echo "0000:01:00.0" > /sys/bus/pci/devices/0000:01:00.0/driver/unbind

# 3. Bind to vfio-pci driver
echo "10de 1b80" > /sys/bus/pci/drivers/vfio-pci/new_id

# 4. Verify IOMMU group
ls /sys/kernel/iommu_groups/*/devices/
# Group contains: 0000:01:00.0 0000:01:00.1 (GPU + GPU audio)

# 5. Start VM with device passthrough (QEMU example)
qemu-system-x86_64 
    -device vfio-pci,host=01:00.0 
    -device vfio-pci,host=01:00.1 
    ...

VFIO (Virtual Function I/O) Framework

•Userspace Device Access: VFIO provides safe, IOMMU-protected device access from userspace (used by hypervisors like QEMU).
•IOMMU Groups: Devices that share IOMMU translation must be in the same group. Entire group must be passed through together.
•DMA Mapping API: Hypervisor maps guest memory regions, VFIO configures IOMMU page tables.
•Interrupt Forwarding: Device interrupts routed to guest via eventfd and KVM irqfd.
•PCI Configuration Space: VFIO mediates config space access, emulating or passing through as appropriate.

vfio_passthrough.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* Simplified VFIO device passthrough setup */
#include <linux/vfio.h>
 
int setup_device_passthrough(struct vm *vm, const char *device_path) {
    int container, group, device;
    struct vfio_group_status group_status;
    struct vfio_device_info device_info;
    
    /* Open VFIO container */
    container = open("/dev/vfio/vfio", O_RDWR);
    ioctl(container, VFIO_GET_API_VERSION);
    ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU);
    
    /* Open IOMMU group (e.g., /dev/vfio/42) */
    group = open("/dev/vfio/42", O_RDWR);
    group_status.argsz = sizeof(group_status);
    ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
    
    /* Add group to container */
    ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
    
    /* Enable IOMMU */
    ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
    
    /* Get device file descriptor */
    device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, device_path);
    
    /* Query device info */
    device_info.argsz = sizeof(device_info);
    ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
    
    /* Map guest memory for DMA */
    struct vfio_iommu_type1_dma_map dma_map = {
        .argsz = sizeof(dma_map),
        .flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
        .vaddr = (uint64_t)vm->guest_memory,
        .iova = 0,  /* Guest physical address 0 */
        .size = vm->memory_size,
    };
    ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
    
    /* Set up interrupts via irqfd */
    struct vfio_irq_set *irq_set;
    irq_set = malloc(sizeof(*irq_set) + sizeof(int));
    irq_set->argsz = sizeof(*irq_set) + sizeof(int);
    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
    irq_set->index = VFIO_PCI_MSI_IRQ_INDEX;
    irq_set->start = 0;
    irq_set->count = 1;
    *(int *)irq_set->data = create_eventfd_for_vm_irq(vm);
    ioctl(device, VFIO_DEVICE_SET_IRQS, irq_set);
    
    return 0;
}

IOMMU Groups and ACS:

ACS (Access Control Services) is a PCIe feature that enforces routing through the root complex, enabling finer-grained IOMMU groups:

Without ACS:
  PCIe Switch
  ├── Device A ─┐
  └── Device B ─┤── Same IOMMU group
                     (can peer-to-peer)

With ACS:
  PCIe Switch (ACS enabled)
  ├── Device A ── Separate IOMMU group
  └── Device B ── Separate IOMMU group
                  (traffic forced through root)

For passthrough, you often need to pass through entire IOMMU groups. Consumer motherboards often have poor IOMMU group isolation; server hardware typically has better ACS support.

GPU Passthrough

SR-IOV: Hardware-Level Device Sharing

SR-IOV Architecture:

An SR-IOV device presents:

Physical Function (PF): The actual device, managed by the host/hypervisor. Has full device capabilities including configuration.
Virtual Functions (VFs): Lightweight device instances, each appearing as a separate PCIe function. Each VF can be assigned to a different VM.

SR-IOV NIC Configuration:

Physical NIC (e.g., Intel X540)
├── PF0 (Physical Function) - Managed by host
├── VF0 → Assigned to VM1
├── VF1 → Assigned to VM2
├── VF2 → Assigned to VM3
├── VF3 → Assigned to VM4
└── ... (up to 64+ VFs)

Each VF has its own:
- PCIe configuration space
- Memory-mapped registers
- TX/RX queues
- Interrupts (MSI-X vectors)

Converting Mermaid diagram...

Creating Virtual Functions:

# Enable SR-IOV on a network interface
# Check maximum VFs supported
cat /sys/class/net/eth0/device/sriov_totalvfs
# 63

# Create 4 virtual functions
echo 4 > /sys/class/net/eth0/device/sriov_numvfs

# Verify VFs appeared
lspci | grep Virtual
# 03:10.0 Ethernet controller: Intel ... Virtual Function
# 03:10.2 Ethernet controller: Intel ... Virtual Function
# 03:10.4 Ethernet controller: Intel ... Virtual Function
# 03:10.6 Ethernet controller: Intel ... Virtual Function

# Each VF gets its own IOMMU group and can be passed through
ls /sys/kernel/iommu_groups/*/devices/ | grep 03:10

SR-IOV Benefits:

Near-native performance: VFs are real hardware, not emulated
Scalability: One physical device serves many VMs
Hardware isolation: Switching/queuing in device hardware
Reduced host CPU usage: No hypervisor in data path

SR-IOV Mechanism:

The hardware includes an internal switch fabric that routes traffic:

Ingress: Incoming packets are classified (by MAC, VLAN, etc.) and directed to the appropriate VF's RX queue
Egress: Each VF's TX queue is scheduled onto the physical port
Isolation: VFs cannot see each other's traffic; hardware enforces separation

sriov_setup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* SR-IOV VF configuration example */
#include <linux/pci.h>
 
/* Host: Enable SR-IOV on a PF */
int enable_sriov(struct pci_dev *pdev, int num_vfs) {
    int ret;
    
    /* Check device supports SR-IOV */
    if (!pdev->is_physfn)
        return -ENODEV;
    
    if (num_vfs > pci_sriov_get_totalvfs(pdev))
        return -EINVAL;
    
    /* Enable VFs */
    ret = pci_enable_sriov(pdev, num_vfs);
    if (ret)
        return ret;
    
    /* Configure each VF */
    for (int i = 0; i < num_vfs; i++) {
        /* Set VF MAC address */
        struct pci_dev *vf = pci_get_domain_bus_and_slot(
            pci_domain_nr(pdev->bus),
            pdev->bus->number,
            PCI_DEVFN(PCI_SLOT(pdev->devfn) + i, 0));
        
        /* VF inherits from PF but can be customized */
        set_vf_mac(pdev, i, vf_mac_addresses[i]);
        set_vf_vlan(pdev, i, vf_vlans[i]);
        set_vf_rate_limit(pdev, i, vf_rate_mbps[i]);
    }
    
    return 0;
}
 
/* Hypervisor: Attach VF to VM */
int attach_vf_to_vm(struct vm *vm, const char *vf_bdf) {
    /* Same as regular passthrough - VF appears as normal PCIe device */
    return setup_device_passthrough(vm, vf_bdf);
}

SR-IOV Limitations

Security Considerations for I/O Virtualization

VT-d provides essential security for device passthrough, but correct configuration is critical. Misconfiguration can break isolation guarantees entirely.

Security Guarantees of Properly Configured VT-d:

DMA Isolation: Device can only access memory explicitly mapped for it
Interrupt Isolation: Device can only trigger authorized interrupts
Device Isolation: Devices in different IOMMU domains cannot interfere
Guest Containment: Compromised guest with device access cannot escape VM

Common Misconfigurations and Attacks:

VT-d Security Risks
Risk	Description	Mitigation
IOMMU Bypass	IOMMU not enabled; device has unrestricted DMA	Verify IOMMU enabled in BIOS and kernel
RMRR Conflicts	Reserved Memory Regions block safe passthrough	Check dmesg for RMRR warnings, use ACS override
IOMMU Group Issues	Multiple devices share translation; escaping possible	Pass through entire group or use ACS
Interrupt Remapping Disabled	Device can forge interrupts	Require interrupt remapping for passthrough
Hot-plug Attacks	Malicious device inserted while running	Disable hot-plug or use physical security
DMA Before Boot	Device DMAs before IOMMU initialized	Enable pre-boot DMA protection (Intel VT-d feature)

Verifying VT-d Configuration:

# Check IOMMU is enabled in kernel
dmesg | grep -i iommu
# [    0.000000] DMAR: IOMMU enabled
# [    0.123456] DMAR-IR: Enabled IRQ remapping

# Verify interrupt remapping
dmesg | grep "IRQ remapping"
# Should show "Enabled"

# Check for IOMMU faults (should be empty normally)
cat /sys/kernel/debug/iommu/intel/dmar_table_errors

# List IOMMU domains and devices
cat /sys/kernel/iommu_groups/*/type
ls -la /sys/kernel/iommu_groups/*/devices/

Pre-Boot DMA Protection:

Modern systems support blocking DMA before the OS loads:

Intel: Kernel DMA Protection via BIOS
Purpose: Prevent Thunderbolt DMA attacks, evil maid scenarios
Mechanism: IOMMU blocks all DMA until OS takes control

Boot Sequence with DMA Protection:

1. Platform powers on
2. IOMMU initialized in blocking mode
3. All device DMA rejected
4. OS loads, takes IOMMU ownership
5. OS enables DMA only for trusted devices

Do Not Disable the IOMMU

TOCTOU and Malicious Devices:

A sophisticated attacker with physical access could use a malicious PCIe device (e.g., via Thunderbolt) to attempt:

DMA attacks: Attempt to read memory before IOMMU protection
Confusion attacks: Trick IOMMU into incorrect mappings
Denial of service: Flood IOMMU with invalid transactions

VT-d fault handling addresses some concerns:

/* VT-d fault handler */
void dmar_fault_handler(uint64_t source_id, uint64_t fault_addr,
                         uint32_t fault_reason) {
    /* Log the fault */
    log_security_event("IOMMU fault: device %04x addr %016lx reason %d",
                       source_id, fault_addr, fault_reason);
    
    /* Common fault reasons:
     * 1 = Page not present
     * 2 = Write to read-only page  
     * 5 = Access width violation
     */
    
    /* Consider disabling device if repeated faults */
    if (++fault_count[source_id] > FAULT_THRESHOLD) {
        disable_device(source_id);
        alert_admin("Device %04x disabled due to repeated IOMMU faults",
                    source_id);
    }
}

I/O Virtualization Performance Optimization

While VT-d enables near-native I/O performance, several optimizations can further reduce overhead.

IOTLB (I/O Translation Lookaside Buffer):

Like CPU TLBs cache address translations, IOTLBs cache IOMMU translations:

DMA without IOTLB hit:
  Device address → Root Table → Context Table → Page Walk → Physical
  (Multiple memory accesses)

DMA with IOTLB hit:
  Device address → IOTLB → Physical
  (Single lookup)

IOTLB Invalidation:

When IOMMU mappings change, IOTLB entries must be invalidated:

void invalidate_iotlb(struct iommu_domain *domain) {
    /* Global invalidation - flush all entries for domain */
    struct qi_desc desc = {
        .qw0 = QI_IOTLB_GRAN(QI_IOTLB_DOMAIN) | 
               QI_IOTLB_DID(domain->id) |
               QI_IOTLB_TYPE,
        .qw1 = 0,
    };
    qi_submit_sync(&desc);
}

void invalidate_iotlb_addr(struct iommu_domain *domain, 
                           uint64_t addr, uint64_t size) {
    /* Page-selective invalidation - more efficient */
    struct qi_desc desc = {
        .qw0 = QI_IOTLB_GRAN(QI_IOTLB_PAGE) |
               QI_IOTLB_DID(domain->id) |
               QI_IOTLB_TYPE,
        .qw1 = addr | QI_IOTLB_AM(size),
    };
    qi_submit_sync(&desc);
}

Performance Optimization Techniques

•Large Pages: Use 2MB/1GB IOMMU pages to reduce table walk depth and improve IOTLB efficiency.
•Queued Invalidation: Batch IOTLB invalidations through hardware queue rather than synchronous commands.
•Pass-Through Mode: For trusted devices, enable IOMMU pass-through (no translation, only protection checking).
•ATS (Address Translation Services): PCIe feature allowing devices to cache IOMMU translations, reducing IOMMU traffic.
•Page Request Interface (PRI): Devices can request IOMMU page faults be resolved, enabling on-demand mapping.
•Interrupt Coalescing: Reduce interrupt rate by having device batch notifications.

ATS (Address Translation Services):

ATS allows PCIe devices to cache address translations locally:

Device requests translation from IOMMU
IOMMU returns translated address + permissions
Device caches translation in its ATC (Address Translation Cache)
Future DMAs use cached translation directly
IOMMU invalidations must also invalidate device ATCs

Without ATS:
  Every DMA → IOMMU page walk

With ATS:
  First DMA → IOMMU translation → Device caches
  Subsequent DMAs → Device ATC hit → Direct access

Posted Interrupts for Devices:

Intel's Posted Interrupts feature can deliver device interrupts directly to a running vCPU without VM exit:

Device generates MSI → Interrupt Remapping
IRT entry configured for posted interrupt
Interrupt posted to notification vector
If vCPU running, delivered without exit
If vCPU blocked, hypervisor notified to wake it

/* Configure posted interrupt for device */
void setup_posted_interrupt_irte(struct irte *entry, 
                                  struct vcpu *vcpu,
                                  uint8_t vector) {
    entry->lo = IRTE_PRESENT |
                IRTE_MODE_POSTED |         /* Posted interrupt mode */
                IRTE_VECTOR(vector);
    entry->hi = IRTE_POSTED_ADDR(vcpu->posted_intr_desc) |
                IRTE_URGENT_BIT;           /* Wake blocked vCPU */
}

Measuring I/O Virtualization Overhead

Summary: I/O Virtualization (VT-d)

Key Takeaways

•I/O Virtualization Challenge: Device emulation adds overhead; direct device access needs protection against DMA attacks.
•IOMMU Architecture: DMA Remapping translates device addresses through page tables, enforcing memory isolation.
•Interrupt Remapping: Prevents devices from forging interrupts; validates source and destination of all MSI traffic.
•Device Passthrough: Assigns physical devices to VMs for near-native performance. Requires IOMMU protection.
•SR-IOV: Hardware-level device sharing; single physical device presents multiple Virtual Functions to different VMs.
•Security Requirements: Both DMA and interrupt remapping essential. IOMMU must be enabled and correctly configured.
•Performance Optimization: IOTLB caching, large pages, ATS, and posted interrupts reduce virtualization overhead.

What's Next:

Page Complete

4 / 5