Loading learning content...
When the Xen team set out to paravirtualize Linux in the early 2000s, they faced a fundamental question: How much of the kernel must change? The answer was surprisingly modest—roughly 3,000 lines of code in the initial Linux 2.4 port. But those 3,000 lines touched some of the most sensitive and complex parts of the operating system.
Guest modification for paravirtualization isn't about rewriting the OS. It's about surgically replacing the assumptions the kernel makes about running on physical hardware with awareness of the virtualized environment. The goal is to change as little as possible while enabling the hypervisor to efficiently manage the guest.
This page examines exactly what must change in a guest operating system, why those changes are necessary, and how modern kernels (particularly Linux) have evolved to support multiple hypervisors through abstraction layers like paravirt_ops.
By the end of this page, you will understand which operating system subsystems require modification for paravirtualization, how the Linux kernel abstracts hypervisor differences through paravirt_ops, the specific techniques used to replace privileged operations, and the design principles that minimize modification scope while maximizing performance.
Guest modification for paravirtualization targets specific kernel subsystems that interact directly with hardware or make assumptions about the physical environment. The modifications fall into several categories:
Core Categories Requiring Modification:
| Subsystem | Why It Needs Change | Typical Modifications |
|---|---|---|
| CPU Initialization | Boot code assumes direct hardware access | Boot in protected mode with hypervisor cooperation |
| Memory Management | Page table manipulation uses privileged instructions | Replace direct CR3/PTE writes with hypercalls |
| Interrupt Handling | IDT setup and interrupt enable/disable | Virtual interrupt vectors, event channels |
| Time Management | Timer hardware access (PIT, HPET, TSC) | Paravirtualized clocksource and timer events |
| Device Drivers | Direct I/O port and MMIO access | Split driver model with ring buffers |
| SMP Support | IPI delivery, APIC manipulation | Virtual IPIs through event channels |
| Power Management | ACPI, halt instruction, cpuidle | Cooperative idle, hypervisor-aware PM |
The Principle of Minimal Modification:
Effective paravirtualization follows a key principle: modify the thinnest layer possible. Rather than changing scheduling algorithms or file system code, modifications target the architecture abstraction layer—the boundary between platform-independent kernel code and hardware-specific operations.
This principle yields several benefits:
In practice, paravirtualization modifies less than 1% of kernel code. The original Xen Linux 2.4 port modified approximately 3,000 lines out of over 300,000 lines of kernel code. This small footprint makes paravirtualization-enabled kernels practical to maintain alongside mainline development.
The CPU and memory subsystems require the most significant paravirtualization changes because they involve the most privileged operations. Let's examine each in detail.
CPU Privilege Level Changes:
On bare metal, the kernel runs at Ring 0 (highest privilege). In a paravirtualized environment, the hypervisor occupies Ring 0, and the guest kernel runs at Ring 1 or Ring 3 (deprivileged). This fundamental change affects:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
/* Native x86: Direct control register access */static inline void native_write_cr3(unsigned long val) { asm volatile("mov %0, %%cr3" : : "r"(val) : "memory");} static inline unsigned long native_read_cr3(void) { unsigned long val; asm volatile("mov %%cr3, %0" : "=r"(val)); return val;} /* Xen Paravirtualized: Hypercall to update page tables */static inline void xen_write_cr3(unsigned long cr3) { struct mmuext_op op; /* Don't directly write CR3 - ask hypervisor to do it */ op.cmd = MMUEXT_NEW_BASEPTR; op.arg1.mfn = PFN_DOWN(cr3); /* Convert to machine frame number */ HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF);} static inline unsigned long xen_read_cr3(void) { /* Reading is safe - we can read the virtualized CR3 value */ return native_read_cr3();} /* Interrupt enable/disable: Cannot use CLI/STI directly */static inline void native_irq_disable(void) { asm volatile("cli" : : : "memory");} static inline void xen_irq_disable(void) { /* Update the virtual interrupt flag in shared memory */ struct vcpu_info *v = this_cpu_read(xen_vcpu); v->evtchn_upcall_mask = 1; barrier(); /* Ensure the write is visible */} static inline void xen_irq_enable(void) { struct vcpu_info *v = this_cpu_read(xen_vcpu); v->evtchn_upcall_mask = 0; barrier(); /* Check if events became pending while masked */ if (unlikely(v->evtchn_upcall_pending)) force_evtchn_callback(); /* Process pending events */}Memory Management Modifications:
Memory management paravirtualization is perhaps the most complex area. The kernel must relinquish direct control over:
The guest kernel sees pseudo-physical addresses (PFNs) that map to different machine frame numbers (MFNs) in actual hardware. This translation is fundamental to memory isolation:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
/* Xen Memory Address Translation */ /* * Address spaces in Xen paravirtualization: * * Guest Virtual Addr → Guest Physical Addr → Machine Physical Addr * (GVA) (GPA/PFN) (MPA/MFN) * * The guest kernel works with PFNs (pseudo-physical frame numbers) * but page tables must contain MFNs (machine frame numbers) */ /* Translation tables maintained by hypervisor */extern unsigned long *phys_to_machine_mapping; /* PFN → MFN */extern unsigned long *machine_to_phys_mapping; /* MFN → PFN */ static inline unsigned long pfn_to_mfn(unsigned long pfn) { return phys_to_machine_mapping[pfn];} static inline unsigned long mfn_to_pfn(unsigned long mfn) { return machine_to_phys_mapping[mfn];} /* Creating a page table entry: Must translate addresses */static inline pte_t xen_make_pte(unsigned long val) { unsigned long pfn = pte_pfn(val); unsigned long mfn = pfn_to_mfn(pfn); /* Replace PFN with MFN in the PTE */ val = (val & ~PTE_PFN_MASK) | (mfn << PAGE_SHIFT); return (pte_t){ .pte = val };} /* Reading a page table entry: Must reverse translate */static inline unsigned long xen_pte_val(pte_t pte) { unsigned long val = pte.pte; unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT; unsigned long pfn = mfn_to_pfn(mfn); /* Replace MFN with PFN for kernel's view */ return (val & ~PTE_PFN_MASK) | (pfn << PAGE_SHIFT);} /* Page table update: Must go through hypervisor */static void xen_set_pte(pte_t *ptep, pte_t pte) { struct mmu_update u; /* Build the update request */ u.ptr = virt_to_machine(ptep); /* Machine address of PTE */ u.val = pte.pte; /* New PTE value (already MFN-translated) */ /* Ask hypervisor to perform the validated update */ if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0) BUG(); /* Should never fail for valid updates */} /* Batched updates for efficiency */static void xen_set_pte_batch(pte_t *ptes[], pte_t vals[], int count) { struct mmu_update updates[count]; for (int i = 0; i < count; i++) { updates[i].ptr = virt_to_machine(ptes[i]); updates[i].val = vals[i].pte; } /* Single hypercall for multiple updates - amortizes overhead */ HYPERVISOR_mmu_update(updates, count, NULL, DOMID_SELF);}Interrupt handling is transformed from hardware-based delivery to software event notification. The guest no longer interacts with interrupt controllers (PIC, APIC) but instead uses event channels provided by the hypervisor.
Event Channel Architecture:
Event channels provide a notification mechanism between the hypervisor and guests (and between guests). Each channel is identified by a port number and can be bound to:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
/* Event Channel Based Interrupt Handling */ /* * Native Linux: IDT-based interrupt handling * - CPU receives interrupt on vector N * - Looks up handler in IDT[N] * - Calls handler with interrupt context * * Xen Paravirtualized: Event channel model * - Hypervisor sets pending bit in shared memory * - Upcalls registered callback function * - Guest dispatches based on event channel port */ /* Event channel → IRQ mapping */struct irq_info { int type; /* IRQS_NONE, IRQT_EVTCHN, etc. */ unsigned int evtchn; /* Bound event channel port */ void (*handler)(int port); /* Handler function */}; static struct irq_info irq_info[NR_IRQS];static unsigned int evtchn_to_irq[NR_EVENT_CHANNELS]; /* Bind an IRQ to an event channel */int bind_evtchn_to_irq(unsigned int evtchn) { struct evtchn_bind_virq bind_virq; int irq; irq = find_unbound_irq(); if (irq < 0) return irq; irq_info[irq].type = IRQT_EVTCHN; irq_info[irq].evtchn = evtchn; evtchn_to_irq[evtchn] = irq; return irq;} /* The main event callback - replaces hardware interrupt entry */void xen_evtchn_do_upcall(struct pt_regs *regs) { struct shared_info *s = HYPERVISOR_shared_info; struct vcpu_info *vcpu = this_cpu_read(xen_vcpu); unsigned long pending_words, pending_bits; int word_idx, bit_idx, port; /* Clear the pending flag first */ vcpu->evtchn_upcall_pending = 0; /* Process all pending event channel ports */ pending_words = xchg(&vcpu->evtchn_pending_sel, 0); while (pending_words) { word_idx = __ffs(pending_words); pending_words &= ~(1UL << word_idx); /* Get pending events for this word, excluding masked ones */ pending_bits = s->evtchn_pending[word_idx]; pending_bits &= ~s->evtchn_mask[word_idx]; while (pending_bits) { bit_idx = __ffs(pending_bits); pending_bits &= ~(1UL << bit_idx); port = word_idx * BITS_PER_LONG + bit_idx; /* Clear the pending bit */ clear_bit(port, &s->evtchn_pending[word_idx]); /* Dispatch to the handler */ handle_irq(evtchn_to_irq[port], regs); } }} /* Timer event handling - replaces hardware timer interrupts */static void xen_timer_interrupt(int port) { struct clock_event_device *evt = this_cpu_ptr(&xen_clock_events); /* Acknowledge and handle the timer event */ evt->event_handler(evt);} /* Setting up a timer - single hypercall instead of PIT/APIC programming */static int xen_set_next_event(unsigned long delta, struct clock_event_device *evt) { HYPERVISOR_set_timer_op(xen_clocksource_read() + delta * NSEC_PER_JIFFIE); return 0;}Event channels are more efficient than hardware interrupt emulation because they avoid APIC virtualization complexity. A single shared memory write and lightweight hypercall can batch multiple events, versus the overhead of emulating EOIs, TPR updates, and interrupt delivery for each individual interrupt.
As paravirtualization gained adoption, Linux needed a way to support multiple hypervisors without maintaining separate kernel trees. The solution was paravirt_ops (pv_ops)—an abstraction layer that allows runtime selection of native or paravirtualized operations.
Design Goals of paravirt_ops:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
/* Linux paravirt_ops Framework */ /* * paravirt_ops provides function pointers for all virtualizable operations. * At boot time, these are set to either native or paravirtualized implementations. */ /* Structure holding CPU operation function pointers */struct pv_cpu_ops { /* Interrupt flag manipulation */ unsigned long (*save_fl)(void); void (*restore_fl)(unsigned long); void (*irq_disable)(void); void (*irq_enable)(void); /* Control register operations */ unsigned long (*read_cr0)(void); void (*write_cr0)(unsigned long); unsigned long (*read_cr3)(void); void (*write_cr3)(unsigned long); /* Debug register operations */ unsigned long (*get_debugreg)(int); void (*set_debugreg)(int, unsigned long); /* Processor state */ void (*halt)(void); void (*safe_halt)(void); /* ... many more operations ... */}; /* Structure for MMU operations */struct pv_mmu_ops { void (*set_pte)(pte_t *ptep, pte_t pte); void (*set_pmd)(pmd_t *pmdp, pmd_t pmd); void (*set_pud)(pud_t *pudp, pud_t pud); void (*set_pgd)(pgd_t *pgdp, pgd_t pgd); pte_t (*make_pte)(unsigned long val); unsigned long (*pte_val)(pte_t); void (*flush_tlb_user)(void); void (*flush_tlb_kernel)(void); void (*flush_tlb_single)(unsigned long addr); /* ... many more operations ... */}; /* Global operation tables - set at boot time */struct pv_cpu_ops pv_cpu_ops;struct pv_mmu_ops pv_mmu_ops; /* Native implementations - used on bare metal */static struct pv_cpu_ops native_cpu_ops = { .save_fl = native_save_fl, .restore_fl = native_restore_fl, .irq_disable = native_irq_disable, .irq_enable = native_irq_enable, .read_cr3 = native_read_cr3, .write_cr3 = native_write_cr3, .halt = native_halt, /* ... */}; /* Xen implementations - used when running on Xen */static struct pv_cpu_ops xen_cpu_ops = { .save_fl = xen_save_fl, .restore_fl = xen_restore_fl, .irq_disable = xen_irq_disable, .irq_enable = xen_irq_enable, .read_cr3 = xen_read_cr3, .write_cr3 = xen_write_cr3, .halt = xen_halt, /* ... */}; /* Boot-time initialization */void __init xen_start_kernel(void) { /* Replace native ops with Xen ops */ pv_cpu_ops = xen_cpu_ops; pv_mmu_ops = xen_mmu_ops; pv_time_ops = xen_time_ops; /* Continue with kernel boot... */} /* Actual code uses the ops indirectly */static inline void arch_local_irq_disable(void) { pv_cpu_ops.irq_disable(); /* Calls either native or xen version */}Patching for Performance:
While function pointer indirection is flexible, it introduces overhead for very hot paths. The Linux kernel uses binary patching to eliminate this overhead:
This technique, called paravirt patching, replaces call sites with appropriately-sized alternatives based on the detected environment.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
/* Paravirt Patching Mechanism */ /* * The kernel records all paravirt call sites for later patching. * * Original code: * call *pv_cpu_ops.irq_disable * * Patched for native: * cli * nop; nop; nop (padding to original instruction size) * * Patched for Xen: * call xen_irq_disable */ struct paravirt_patch_site { u8 *instr; /* Location of the call instruction */ u8 type; /* Type of operation (IRQ_DISABLE, etc.) */ u8 len; /* Length of the patch area */}; /* Collected during compilation via section attributes */extern struct paravirt_patch_site __parainstructions[];extern struct paravirt_patch_site __parainstructions_end[]; void __init apply_paravirt_patches(void) { struct paravirt_patch_site *p; for (p = __parainstructions; p < __parainstructions_end; p++) { unsigned int used; /* Let the hypervisor ops provide optimal code */ used = pv_init_ops.patch(p->type, p->instr, p->len); /* Fill remaining bytes with NOPs */ if (used < p->len) add_nops(p->instr + used, p->len - used); }} /* Xen provides its patch implementations */unsigned int xen_patch(u8 type, void *insns, unsigned int len) { switch (type) { case PARAVIRT_PATCH(pv_cpu_ops.irq_disable): /* Replace with inline disable code */ return xen_emit_irq_disable(insns); case PARAVIRT_PATCH(pv_cpu_ops.irq_enable): /* Replace with inline enable code */ return xen_emit_irq_enable(insns); default: /* Fall back to function call */ return paravirt_patch_default(type, insns, len); }}The paravirt_ops framework isn't Xen-specific. It also supports KVM (via kvmclock), VMware (via vmware-specific ops), Microsoft Hyper-V, and others. This unified framework allows a single kernel to efficiently run across all these platforms with appropriate runtime behavior.
The boot process for a paravirtualized guest differs fundamentally from bare-metal boot. The kernel cannot assume control of hardware initialization and must cooperate with the hypervisor from the very first instruction.
Native Boot Sequence (simplified):
Paravirtualized Boot Sequence (Xen PV):
start_info structure12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
/* Xen Paravirtualized Boot Entry */ /* * The kernel is loaded by Xen at a specific entry point * with registers set up according to the Xen ABI. * * On entry: * %rsi = pointer to start_info structure * %rsp = valid stack * Running in ring 3, with Xen in ring 0 */ struct start_info { char magic[32]; /* "xen-<version>-<platform>" */ unsigned long nr_pages; /* Total pages given to domain */ unsigned long shared_info; /* MFN of shared_info structure */ uint32_t flags; /* Flags (SIF_*) */ xen_pfn_t store_mfn; /* MFN of XenStore page */ uint32_t store_evtchn; /* Event channel for store */ xen_pfn_t console_mfn; /* MFN of console page */ uint32_t console_evtchn; /* Event channel for console */ unsigned long pt_base; /* PFN of initial page table */ unsigned long mod_start; /* PFN of loaded modules (initrd) */ unsigned long mod_len; /* Length of modules */ char cmd_line[1024]; /* Kernel command line */}; /* Entry point from Xen - this is where paravirt boot begins */void __init xen_start_kernel(struct start_info *si) { /* Save the start_info pointer */ xen_start_info = si; /* Map the shared info page - essential for communication */ HYPERVISOR_shared_info = (struct shared_info *)fix_to_virt(FIX_SHARED_INFO); /* Set up paravirt_ops to use Xen implementations */ pv_info = xen_info; pv_init_ops = xen_init_ops; pv_cpu_ops = xen_cpu_ops; pv_mmu_ops = xen_mmu_ops; /* Initialize the P2M (pseudo-physical to machine) table */ xen_setup_machphys_mapping(); /* Register event channel callback */ xen_init_interrupts(); /* Set up time management */ xen_time_init(); /* Continue with normal kernel initialization */ /* ... */} /* The kernel image header identifies this as Xen-compatible */#ifdef CONFIG_XEN_PV__asm__ ( ".pushsection .text " "xen_pvh_start: " " mov %rsi, %rdi " /* start_info as first arg */ " call xen_start_kernel " ".popsection ");#endifModern Xen supports 'PVH' mode—a hybrid where the guest uses hardware virtualization for CPU/memory (no ring deprivileging) but paravirtualized I/O. This simplifies boot significantly while retaining paravirt I/O performance benefits.
Device access in paravirtualized guests uses the split driver model, where functionality is divided between:
This model achieves excellent performance because:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
/* Split Driver Architecture: Block Device Example */ /* * Guest (DomU) Frontend Dom0 Backend * ───────────────────── ───────────────── * * [Block Layer Request] * ↓ * [Frontend Driver] ───ring───→ [Backend Driver] * ↑ ←──ring──── ↓ * │ [Real Block Device] * │ * [Completion to Block Layer] */ /* Ring buffer structure (producer-consumer) */struct blkif_request { uint8_t operation; /* BLKIF_OP_READ/WRITE/etc */ uint8_t nr_segments; /* Number of memory segments */ uint64_t sector_number; /* Starting sector */ struct blkif_request_segment { grant_ref_t gref; /* Grant reference for page sharing */ uint8_t first_sect; /* First sector in page */ uint8_t last_sect; /* Last sector in page */ } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];}; struct blkif_response { uint64_t id; /* Request ID (for matching) */ uint8_t operation; /* Original operation */ int16_t status; /* BLKIF_RSP_OKAY/ERROR */}; /* Shared ring macros - handle wrap-around correctly */DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response); /* Frontend: Submitting a block I/O request */static int xen_blkfront_queue_request(struct request *req) { struct blkfront_info *info = req->q->queuedata; struct blkif_request *ring_req; unsigned long id; /* Get next slot in the ring */ ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); /* Assign unique ID for matching response */ id = get_id_from_request(info, req); /* Fill in the request */ ring_req->operation = rq_data_dir(req) ? BLKIF_OP_WRITE : BLKIF_OP_READ; ring_req->sector_number = blk_rq_pos(req); ring_req->nr_segments = 0; /* Set up grant references for data pages */ for_each_segment(bvec, req) { grant_ref_t gref = gnttab_grant_foreign_access( info->backend_id, virt_to_mfn(page_address(bvec->bv_page)), rq_data_dir(req) == READ ); ring_req->seg[ring_req->nr_segments].gref = gref; ring_req->seg[ring_req->nr_segments].first_sect = bvec->bv_offset >> 9; ring_req->seg[ring_req->nr_segments].last_sect = (bvec->bv_offset + bvec->bv_len - 1) >> 9; ring_req->nr_segments++; } /* Advance producer pointer and notify backend */ info->ring.req_prod_pvt++; RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&info->ring, notify); if (notify) notify_remote_via_evtchn(info->evtchn); return 0;} /* Frontend: Processing completed responses */static irqreturn_t xen_blkfront_irq(int irq, void *dev_id) { struct blkfront_info *info = dev_id; struct blkif_response *response; RING_IDX cons, prod; prod = info->ring.sring->rsp_prod; rmb(); /* Read producer before reading responses */ for (cons = info->ring.rsp_cons; cons != prod; cons++) { response = RING_GET_RESPONSE(&info->ring, cons); /* Find original request and complete it */ struct request *req = get_request_from_id(info, response->id); if (response->status != BLKIF_RSP_OKAY) blk_mq_end_request(req, BLK_STS_IOERR); else blk_mq_end_request(req, BLK_STS_OK); /* Release grant references */ release_grants_for_request(info, req); } info->ring.rsp_cons = cons; return IRQ_HANDLED;}Grant Tables for Memory Sharing:
Grant tables provide a secure mechanism for sharing memory pages between domains. Rather than directly accessing each other's memory (which would violate isolation), domains grant and revoke access permissions:
This model enables zero-copy I/O: the guest grants backend access to data pages, avoiding the need to copy data between domains.
For operating system developers considering paravirtualization support, several principles guide successful implementation:
We've explored the specific modifications required to paravirtualize a guest operating system. Let's consolidate the key insights:
What's Next:
Now that we understand what changes in a guest OS, we'll examine how the guest invokes hypervisor services. The next page explores hypercalls—the mechanism by which paravirtualized guests request privileged operations from the hypervisor. We'll see the interface design, calling conventions, and specific examples of hypercall implementations.
You now understand the specific modifications required to paravirtualize an operating system. From CPU privilege changes to split drivers, these surgical modifications enable near-native performance while maintaining hypervisor control.