Loading learning content...
The holy grail of virtualization is making virtual machines indistinguishable from bare metal—not just in behavior, but in performance. With modern hardware virtualization features, this goal is remarkably achievable. Well-tuned VMs regularly achieve 95-99% of native performance for most workloads.
But getting there requires understanding how the various hardware and software components interact. A single misconfiguration—missing EPT enablement, incorrect NUMA placement, or excessive VM exits—can devastate performance. Conversely, understanding the optimization landscape lets you make informed trade-offs between isolation, flexibility, and speed.
This page synthesizes everything we've learned about hardware virtualization into a coherent performance optimization strategy.
By the end of this page, you will understand the sources of virtualization overhead, how hardware features eliminate that overhead, practical VM tuning techniques for CPU, memory, and I/O, NUMA considerations, measurement and benchmarking strategies, and real-world performance case studies.
Before optimizing, we must understand what makes VMs slower than bare metal. Virtualization overhead comes from several distinct sources:
1. VM Exits (CPU Overhead):
Every VM exit saves full CPU state, transitions to the hypervisor, processes the exit reason, and transitions back. Even with hardware acceleration, this costs 1,000-10,000 CPU cycles.
2. Address Translation (Memory Overhead):
Two-dimensional page walks (guest PT + EPT/NPT) access more memory than native paging on TLB misses.
3. I/O Emulation:
Emulated devices require hypervisor intervention for every operation, copying data through multiple layers.
4. Interrupt Delivery:
Virtual interrupts traditionally require exits for injection and acknowledgment.
5. Resource Contention:
Multiple VMs compete for physical resources—CPU time, memory bandwidth, cache space, I/O bandwidth.
| Overhead Source | Impact | Hardware Mitigation | Remaining Overhead |
|---|---|---|---|
| VM Exits | High (μs per exit) | VT-x/AMD-V | Minimal if optimized |
| Memory Translation | Medium (TLB misses) | EPT/NPT + VPID/ASID | ~10% extra TLB miss cost |
| Shadow Page Tables | Very High | EPT/NPT (eliminates) | Zero with EPT/NPT |
| I/O Emulation | Very High | VT-d + Passthrough/SR-IOV | Near-zero with passthrough |
| Interrupt Delivery | Medium | Posted Interrupts + APICv/AVIC | Near-zero |
| Timer Operations | Medium | TSC virtualization | Zero if TSC stable |
Exit Frequency Analysis:
The number of VM exits per second is a key performance indicator:
# Monitor VM exit rate on Linux/KVM
perf kvm stat live
# Sample output:
# VM-Exit Samples %time Avg time
# HLT 3421 45.2% 2.1us
# IO 892 12.3% 15.4us
# EPT_VIOLATION 234 3.1% 1.8us
# EXTERNAL_INT 178 2.4% 0.9us
# CPUID 156 0.8% 0.4us
Interpreting Exit Data:
| Exit Type | Normal Rate | Concern Threshold | Mitigation |
|---|---|---|---|
| HLT | Depends on idle | High when busy | Expected when idle |
| IO | < 1000/sec | > 10000/sec | Use virtio, passthrough |
| EPT_VIOLATION | < 100/sec | > 1000/sec | Check memory mapping |
| EXTERNAL_INT | Variable | Too frequent | Enable posted interrupts |
| CR_ACCESS | < 100/sec | > 1000/sec | Check guest behavior |
Focus on eliminating high-frequency exits first. A rarely-triggered exit handler can be slow; but an exit occurring 10,000 times per second better be fast. Use exit statistics to identify the biggest contributors before optimizing.
CPU performance in VMs depends on minimizing exits and ensuring predictable scheduling.
vCPU Pinning:
By default, vCPUs may be scheduled on any physical CPU, migrating as the host scheduler decides. This causes:
Pinning assigns vCPUs to specific physical CPUs:
# QEMU/KVM: Pin vCPU 0 to physical CPU 2
qemu-system-x86_64 ... \
-smp 4 \
-binding 0:2,1:3,2:4,3:5
# libvirt: In domain XML
<vcpu placement='static'>4</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='2'/>
<vcpupin vcpu='1' cpuset='3'/>
<vcpupin vcpu='2' cpuset='4'/>
<vcpupin vcpu='3' cpuset='5'/>
</cputune>
CPU Feature Exposure:
The hypervisor controls which CPU features are visible to guests via CPUID interception:
<!-- libvirt: Use host CPU model with specified features -->
<cpu mode='host-passthrough' check='none'>
<feature policy='disable' name='vmx'/> <!-- Hide nested virt -->
<feature policy='require' name='avx2'/> <!-- Require AVX2 -->
</cpu>
CPU Models:
| Model | Description | Use Case |
|---|---|---|
| host-passthrough | Expose exact host CPU | Maximum performance, no migration |
| host-model | Similar to host with safety checks | Good performance, limited migration |
| Named model (Skylake, EPYC) | Specific CPU generation | Migration compatibility |
| qemu64/kvm64 | Minimal baseline | Maximum compatibility |
Performance Impact of Feature Hiding:
Disabling features impacts workloads that use them:
123456789101112131415161718192021222324252627282930
#!/bin/bash# CPU optimization script for KVM host # 1. Isolate CPUs for VM use (add to kernel cmdline)# isolcpus=4-15 nohz_full=4-15 rcu_nocbs=4-15 # 2. Disable CPU frequency scaling for consistent performancefor cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo "performance" > "$cpu"done # 3. Disable C-states for lowest latencyfor cpu in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do echo 1 > "$cpu"done # 4. Verify TSC is stable (required for good timer performance)dmesg | grep -i "tsc"# Should show: "tsc: Refined TSC clocksource calibration"# "tsc: Detected X.XXX MHz processor" # 5. Check for hardware virtualization featuresgrep -E 'vmx|svm' /proc/cpuinfo | head -1 # 6. Verify EPT/NPT is availablecat /sys/module/kvm_intel/parameters/ept # Should be Ycat /sys/module/kvm_intel/parameters/vpid # Should be Y # 7. Check for APICv supportcat /sys/module/kvm_intel/parameters/enable_apicv # Should be YPerformance tuning often trades off latency against throughput. Disabling C-states improves latency but wastes power. CPU pinning improves predictability but reduces scheduling flexibility. Choose based on workload requirements—database queries need low latency; batch processing priorities throughput.
Memory performance is critical because every CPU operation ultimately involves memory. EPT/NPT are essential, but there's more to optimize.
Huge Pages:
Using 2MB or 1GB pages instead of 4KB pages:
# Configure huge pages on host
echo 1024 > /proc/sys/vm/nr_hugepages # Allocate 2GB (1024 * 2MB)
# Verify available huge pages
cat /proc/meminfo | grep Huge
# HugePages_Total: 1024
# HugePages_Free: 1024
# Hugepagesize: 2048 kB
# libvirt: Enable huge pages for VM
<memoryBacking>
<hugepages>
<page size='2048' unit='KiB'/>
</hugepages>
</memoryBacking>
| Page Size | TLB Entries | Coverage per Entry | Total Coverage |
|---|---|---|---|
| 4KB | ~1500 (typical) | 4KB | ~6MB |
| 2MB | ~32 (typical) | 2MB | ~64MB |
| 1GB | 4 (typical) | 1GB | ~4GB |
NUMA Topology:
Non-Uniform Memory Access (NUMA) means memory access speed depends on which CPU socket makes the request. Local memory access is faster than remote:
NUMA System:
┌─────────────────┐ ┌─────────────────┐
│ Socket 0 │ │ Socket 1 │
│ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ │
│ │CPU│ │CPU│ │ │ │CPU│ │CPU│ │
│ └───┘ └───┘ │ │ └───┘ └───┘ │
│ │ │ │ │ │
│ Memory 0 │←──→│ Memory 1 │
│ (64GB) │QPI │ (64GB) │
└─────────────────┘ └─────────────────┘
Local access: ~80ns
Remote access: ~130ns (1.6x slower)
NUMA-Aware VM Configuration:
<!-- libvirt: Pin VM memory to specific NUMA node -->
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
<!-- Topology: Show guest its own NUMA structure -->
<cpu>
<numa>
<cell id='0' cpus='0-3' memory='4' unit='GiB'/>
<cell id='1' cpus='4-7' memory='4' unit='GiB'/>
</numa>
</cpu>
Memory Pre-Allocation:
By default, VM memory is allocated on-demand (overcommit). This saves RAM but adds latency:
First access to VM page:
1. Guest accesses GPA
2. EPT violation (page not mapped)
3. Exit to hypervisor
4. Hypervisor allocates host page
5. Maps in EPT
6. Re-enter guest
With pre-allocation:
1. Guest accesses GPA
2. EPT hit (already mapped)
3. Access completes
<!-- libvirt: Pre-allocate all memory -->
<memoryBacking>
<allocation mode='immediate'/>
<locked/> <!-- Prevent swapping -->
</memoryBacking>
Memory Overcommit Considerations:
| Overcommit Strategy | Density | Latency | Predictability |
|---|---|---|---|
| None (all pre-allocated) | Low | Best | Best |
| Moderate (1.5x) | Medium | Good | Good |
| Aggressive (2x+) | High | Variable | Poor |
| Balloon + Swap | Highest | Worst | Worst |
For VMs with 64GB+ memory, 1GB huge pages provide massive TLB coverage. They must be allocated at boot (add 'hugepagesz=1G hugepages=N default_hugepagesz=1G' to kernel cmdline). The combination of 1GB pages + NUMA pinning + pre-allocation approaches bare-metal memory performance.
I/O is often the most impactful area for virtualization performance. The difference between emulated and passthrough devices can be 10x or more.
I/O Virtualization Hierarchy (Best to Worst):
| Method | Throughput | Latency | CPU Usage |
|---|---|---|---|
| Native (no VM) | 9.8 Gbps | 15 μs | Low |
| SR-IOV VF Passthrough | 9.7 Gbps | 18 μs | Low |
| vhost-user (DPDK) | 9.5 Gbps | 25 μs | Medium |
| vhost-net | 8.5 Gbps | 45 μs | Medium-High |
| virtio-net | 6.0 Gbps | 80 μs | High |
| e1000 emulation | 1.0 Gbps | 500+ μs | Very High |
virtio Optimization:
When passthrough isn't available, virtio is the next best option. Optimize it:
<!-- libvirt: Optimized virtio-net -->
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost' queues='4'/> <!-- Multi-queue -->
<tune>
<sndbuf>1048576</sndbuf> <!-- 1MB send buffer -->
</tune>
</interface>
<!-- Optimized virtio-blk disk -->
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'
discard='unmap' iothread='1'/>
<source file='/var/lib/libvirt/images/vm.raw'/>
<target dev='vda' bus='virtio'/>
</disk>
Key virtio Settings:
| Setting | Purpose | Recommendation |
|---|---|---|
| queues | Multi-queue I/O | Set to vCPU count |
| vhost | Kernel bypass | Always enable |
| cache=none | Direct I/O | For consistency |
| io=native | Linux native AIO | For best performance |
| discard=unmap | TRIM support | For SSD backing |
Disk I/O Optimization:
# Disk image format impact
# Raw: No overhead, immediate allocation
# QCOW2: Copy-on-write, slight overhead but flexible
# Create pre-allocated raw image (best performance)
qemu-img create -f raw /var/lib/libvirt/images/vm.raw 100G
# Pre-allocate to eliminate fragmentation
fallocate -l 100G /var/lib/libvirt/images/vm.raw
# For QCOW2, use preallocation
qemu-img create -f qcow2 -o preallocation=metadata \
/var/lib/libvirt/images/vm.qcow2 100G
I/O Scheduling for KVM:
# For SSD storage backing VMs
echo 'none' > /sys/block/nvme0n1/queue/scheduler
# For HDD storage
echo 'mq-deadline' > /sys/block/sda/queue/scheduler
# Increase queue depth for NVMe
echo 1024 > /sys/block/nvme0n1/queue/nr_requests
vhost-user enables userspace applications (like DPDK-based OVS or Snabb) to serve as the backend for virtio devices. Packets never touch the kernel, achieving line-rate forwarding on modern NICs. This is how cloud providers achieve multi-million PPS networking in VMs.
Interrupts and timers cause exits. Reducing their frequency and enabling hardware acceleration dramatically improves performance.
Posted Interrupts (Intel):
Without posted interrupts:
With posted interrupts:
# Verify posted interrupts enabled
cat /sys/module/kvm_intel/parameters/enable_apicv
# Y
# Monitor interrupt delivery
perf kvm stat live | grep -E 'INTERRUPT|POSTED'
Timer Virtualization:
Guests need accurate time for scheduling, timestamps, and application logic. Timer sources:
| Timer | VM Exit Cost | Accuracy | Use Case |
|---|---|---|---|
| PIT (i8254) | Exit per tick | Poor | Legacy only |
| LAPIC Timer | Exit per tick | Good | Default guest timer |
| HPET | Exit per read | Good | Legacy high-precision |
| TSC | Exit on RDTSC (optional) | Best | High-frequency timing |
| kvmclock | Hypercall | Good | KVM paravirt guests |
TSC Virtualization:
The TSC (Time Stamp Counter) is the fastest timer—a simple register read. VT-x provides TSC offsetting:
Guest RDTSC result = Host TSC + TSC Offset
This allows:
1. Each VM sees consistent TSC from boot
2. No exit required for RDTSC
3. Live migration can adjust offset
Clock Sources for Linux Guests:
# Check guest clock source
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# tsc (best) or kvm-clock (paravirt) or hpet (legacy)
# Force TSC if stable
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
12345678910111213141516171819202122232425262728
<!-- libvirt: Optimized timer configuration --><clock offset='utc'> <!-- Use TSC with proper frequency --> <timer name='tsc' present='yes' mode='native'/> <!-- HPET: Present but don't use as primary --> <timer name='hpet' present='yes' tickpolicy='delay'/> <!-- Disable PIT for modern guests --> <timer name='pit' tickpolicy='delay'/> <!-- Hyperv enlightenments for Windows guests --> <timer name='hypervclock' present='yes'/></clock> <!-- For Windows guests, add Hyper-V enlightenments --><features> <hyperv> <relaxed state='on'/> <vapic state='on'/> <!-- PV interrupt controller --> <spinlocks state='on' retries='8191'/> <vpindex state='on'/> <synic state='on'/> <!-- Synthetic interrupts --> <stimer state='on'/> <!-- Synthetic timers --> <reset state='on'/> <frequencies state='on'/> </hyperv></features>Optimization without measurement is guessing. Proper benchmarking requires understanding what you're measuring and avoiding common pitfalls.
Benchmarking Principles:
| Category | Benchmark | Measures | Use Case |
|---|---|---|---|
| CPU | SPEC CPU 2017 | Integer/float performance | Overall CPU capability |
| CPU | UnixBench | Mixed system performance | Quick overall assessment |
| Memory | STREAM | Memory bandwidth | NUMA and cache effects |
| Memory | mlc (Intel) | Memory latency | NUMA placement validation |
| Network | iperf3 | Throughput | Raw network performance |
| Network | netperf | Latency and throughput | Detailed network analysis |
| Storage | fio | IOPS and bandwidth | Disk subsystem |
| Storage | sysbench | MySQL-style I/O | Database workloads |
| End-to-End | TPC-C/TPC-H | Database transactions | Real application proxy |
Measuring Virtualization Overhead:
# Compare VM vs bare metal
# 1. Run benchmark on bare metal, record results
# 2. Run same benchmark in VM, record results
# 3. Calculate overhead: (native - vm) / native * 100
# CPU overhead example with sysbench
# Bare metal:
sysbench cpu --threads=4 run
# events per second: 10000
# VM:
sysbench cpu --threads=4 run
# events per second: 9850
# Overhead: (10000 - 9850) / 10000 = 1.5%
VM Exit Analysis:
# Detailed exit analysis with perf
perf kvm stat record -p $(pgrep qemu) -- sleep 60
perf kvm stat report
# Output shows:
# - Exit counts by type
# - Time spent in each exit handler
# - Min/max/avg exit handling time
# Identify exit hotspots
perf kvm stat live
# Watch for unexpected high-frequency exits
Common Measurement Mistakes:
| Mistake | Problem | Solution |
|---|---|---|
| Testing with debug enabled | Debug adds overhead | Use production builds |
| Single short run | High variance, cache effects | Multiple runs, 60+ seconds |
| Different hardware | Can't compare results | Same servers for A/B |
| Background activity | Unpredictable interference | Quiesce system first |
| Not waiting for caches | Memory not in cache | Add warmup phase |
| Ignoring NUMA | Remote memory access | Pin to single NUMA node |
Performance Monitoring During Production:
# Continuous monitoring stack
# 1. Host metrics: node_exporter + Prometheus
# 2. VM metrics: libvirt exporter
# 3. Guest metrics: Guest agent + node_exporter
# 4. Storage metrics: iostat, iotop
# 5. Network metrics: iperf tests, packet drops
# Real-time VM performance
virt-top --connect qemu:///system
# Quick health check
virsh domstats myvm | grep -E 'cpu|memory|balloon|net|block'
Benchmarks measure specific scenarios. Production has mixed, unpredictable workloads. A VM that benchmarks well may perform poorly if CPU pinning causes contention with host processes. Always validate with production-representative workloads.
Let's examine how these optimization techniques apply to real scenarios.
Case Study 1: Database Server (PostgreSQL)
Initial Configuration:
Optimization Steps:
CPU Pinning + NUMA: Pin vCPUs and memory to same NUMA node
Huge Pages: Enable 2MB huge pages for VM memory
Disk: Raw image, cache=none, io=native, NVMe passthrough
Network: Multi-queue virtio, vhost
Final Result: 33,500 TPS vs 15,000 baseline = 2.2x improvement
Case Study 2: Low-Latency Trading Application
Requirements:
Configuration:
<!-- Extreme low-latency configuration -->
<domain type='kvm'>
<!-- CPU isolation -->
<vcpu placement='static'>8</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='5'/>
<!-- ... -->
<emulatorpin cpuset='0-3'/> <!-- Emulator off perf CPUs -->
</cputune>
<!-- Memory: Huge pages, locked, NUMA-pinned -->
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
<memoryBacking>
<hugepages>
<page size='1048576' unit='KiB'/> <!-- 1GB pages -->
</hugepages>
<locked/>
<nosharepages/>
</memoryBacking>
<!-- Network: SR-IOV passthrough -->
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x82' slot='0x00' function='0x1'/>
</source>
</hostdev>
</domain>
Host Tuning:
# Isolate CPUs from scheduler
# Kernel cmdline: isolcpus=4-11 nohz_full=4-11 rcu_nocbs=4-11
# Disable frequency scaling
for cpu in 4 5 6 7 8 9 10 11; do
cpupower -c $cpu frequency-set -g performance
done
# Disable C-states
for cpu in 4 5 6 7 8 9 10 11; do
for state in /sys/devices/system/cpu/cpu$cpu/cpuidle/state*/disable; do
echo 1 > $state
done
done
Results:
Case Study 3: Cloud Provider Compute Instance
Challenge: Balance density, isolation, and performance for thousands of VMs.
Approach:
| Component | Configuration | Rationale |
|---|---|---|
| CPU | Overcommit 2:1, fair scheduling | Density over performance |
| Memory | Overcommit with KSM + ballooning | Maximize density |
| Network | SR-IOV with limited VFs | Performance where needed |
| Storage | Shared storage, virtio-scsi | Flexibility + live migration |
| Migration | Live migration enabled | Operational flexibility |
Trade-offs Accepted:
Mitigations:
Performance Expectation:
Each use case has different needs. Database servers benefit most from I/O optimization; compute-intensive workloads need CPU optimization; latency-sensitive applications need jitter reduction. Profile your specific workload and optimize the bottleneck—not everything at once.
Achieving near-native performance in virtual machines requires understanding the full stack of hardware and software optimizations. With proper configuration, the overhead of virtualization can be reduced to single-digit percentages for most workloads.
Module Complete:
You've now completed the Hardware Virtualization Support module. You understand:
This knowledge is foundational for building, configuring, and operating virtualized infrastructure—from single-host development environments to massive cloud deployments.
Congratulations! You now have comprehensive knowledge of hardware virtualization support. From CPU virtualization (VT-x/AMD-V) through memory (EPT/NPT) and I/O (VT-d) to performance optimization, you understand how modern hypervisors leverage hardware to achieve near-native performance while maintaining strong isolation.