Operating SystemsParavirtualization

Paravirtualization

LevelAdvanced

Duration60 mins

TopicParavirtualization

2 / 5

Guest Modification

Transforming an Operating System

When the Xen team set out to paravirtualize Linux in the early 2000s, they faced a fundamental question: How much of the kernel must change? The answer was surprisingly modest—roughly 3,000 lines of code in the initial Linux 2.4 port. But those 3,000 lines touched some of the most sensitive and complex parts of the operating system.

Guest modification for paravirtualization isn't about rewriting the OS. It's about surgically replacing the assumptions the kernel makes about running on physical hardware with awareness of the virtualized environment. The goal is to change as little as possible while enabling the hypervisor to efficiently manage the guest.

This page examines exactly what must change in a guest operating system, why those changes are necessary, and how modern kernels (particularly Linux) have evolved to support multiple hypervisors through abstraction layers like paravirt_ops.

What You Will Learn

By the end of this page, you will understand which operating system subsystems require modification for paravirtualization, how the Linux kernel abstracts hypervisor differences through paravirt_ops, the specific techniques used to replace privileged operations, and the design principles that minimize modification scope while maximizing performance.

What Must Be Modified

Guest modification for paravirtualization targets specific kernel subsystems that interact directly with hardware or make assumptions about the physical environment. The modifications fall into several categories:

Core Categories Requiring Modification:

Kernel Subsystems Requiring Paravirtualization
Subsystem	Why It Needs Change	Typical Modifications
CPU Initialization	Boot code assumes direct hardware access	Boot in protected mode with hypervisor cooperation
Memory Management	Page table manipulation uses privileged instructions	Replace direct CR3/PTE writes with hypercalls
Interrupt Handling	IDT setup and interrupt enable/disable	Virtual interrupt vectors, event channels
Time Management	Timer hardware access (PIT, HPET, TSC)	Paravirtualized clocksource and timer events
Device Drivers	Direct I/O port and MMIO access	Split driver model with ring buffers
SMP Support	IPI delivery, APIC manipulation	Virtual IPIs through event channels
Power Management	ACPI, halt instruction, cpuidle	Cooperative idle, hypervisor-aware PM

The Principle of Minimal Modification:

Effective paravirtualization follows a key principle: modify the thinnest layer possible. Rather than changing scheduling algorithms or file system code, modifications target the architecture abstraction layer—the boundary between platform-independent kernel code and hardware-specific operations.

This principle yields several benefits:

Maintainability — Changes are localized and don't affect higher-level kernel logic
Portability — The same kernel can run on bare metal or virtualized with runtime selection
Testability — Most kernel paths remain unchanged and need no special validation
Upgradability — Kernel updates rarely conflict with paravirtualization code

The 1% Rule

In practice, paravirtualization modifies less than 1% of kernel code. The original Xen Linux 2.4 port modified approximately 3,000 lines out of over 300,000 lines of kernel code. This small footprint makes paravirtualization-enabled kernels practical to maintain alongside mainline development.

CPU and Memory Subsystem Modifications

The CPU and memory subsystems require the most significant paravirtualization changes because they involve the most privileged operations. Let's examine each in detail.

CPU Privilege Level Changes:

On bare metal, the kernel runs at Ring 0 (highest privilege). In a paravirtualized environment, the hypervisor occupies Ring 0, and the guest kernel runs at Ring 1 or Ring 3 (deprivileged). This fundamental change affects:

Segment Descriptor Access — Guest cannot directly modify GDT/LDT
Control Register Access — CR0, CR3, CR4 manipulation requires hypercalls
Debug Register Access — DR0-DR7 access must be mediated
Privileged Instructions — LGDT, LIDT, HLT, CLI, STI all require replacement

cpu_modifications.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/* Native x86: Direct control register access */
static inline void native_write_cr3(unsigned long val) {
    asm volatile("mov %0, %%cr3" : : "r"(val) : "memory");
}
 
static inline unsigned long native_read_cr3(void) {
    unsigned long val;
    asm volatile("mov %%cr3, %0" : "=r"(val));
    return val;
}
 
/* Xen Paravirtualized: Hypercall to update page tables */
static inline void xen_write_cr3(unsigned long cr3) {
    struct mmuext_op op;
    
    /* Don't directly write CR3 - ask hypervisor to do it */
    op.cmd = MMUEXT_NEW_BASEPTR;
    op.arg1.mfn = PFN_DOWN(cr3);  /* Convert to machine frame number */
    
    HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF);
}
 
static inline unsigned long xen_read_cr3(void) {
    /* Reading is safe - we can read the virtualized CR3 value */
    return native_read_cr3();
}
 
/* Interrupt enable/disable: Cannot use CLI/STI directly */
static inline void native_irq_disable(void) {
    asm volatile("cli" : : : "memory");
}
 
static inline void xen_irq_disable(void) {
    /* Update the virtual interrupt flag in shared memory */
    struct vcpu_info *v = this_cpu_read(xen_vcpu);
    v->evtchn_upcall_mask = 1;
    barrier();  /* Ensure the write is visible */
}
 
static inline void xen_irq_enable(void) {
    struct vcpu_info *v = this_cpu_read(xen_vcpu);
    
    v->evtchn_upcall_mask = 0;
    barrier();
    
    /* Check if events became pending while masked */
    if (unlikely(v->evtchn_upcall_pending))
        force_evtchn_callback();  /* Process pending events */
}

Memory Management Modifications:

Memory management paravirtualization is perhaps the most complex area. The kernel must relinquish direct control over:

Page Table Updates — Every PTE modification must be validated by hypervisor
Page Frame Allocation — The kernel works with pseudo-physical frames, not machine frames
TLB Management — Flush operations require hypervisor coordination
Memory Typing — Cache attribute changes need validation

The guest kernel sees pseudo-physical addresses (PFNs) that map to different machine frame numbers (MFNs) in actual hardware. This translation is fundamental to memory isolation:

memory_modifications.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/* Xen Memory Address Translation */
 
/*
 * Address spaces in Xen paravirtualization:
 * 
 *   Guest Virtual Addr → Guest Physical Addr → Machine Physical Addr
 *         (GVA)              (GPA/PFN)              (MPA/MFN)
 *
 * The guest kernel works with PFNs (pseudo-physical frame numbers)
 * but page tables must contain MFNs (machine frame numbers)
 */
 
/* Translation tables maintained by hypervisor */
extern unsigned long *phys_to_machine_mapping;  /* PFN → MFN */
extern unsigned long *machine_to_phys_mapping;  /* MFN → PFN */
 
static inline unsigned long pfn_to_mfn(unsigned long pfn) {
    return phys_to_machine_mapping[pfn];
}
 
static inline unsigned long mfn_to_pfn(unsigned long mfn) {
    return machine_to_phys_mapping[mfn];
}
 
/* Creating a page table entry: Must translate addresses */
static inline pte_t xen_make_pte(unsigned long val) {
    unsigned long pfn = pte_pfn(val);
    unsigned long mfn = pfn_to_mfn(pfn);
    
    /* Replace PFN with MFN in the PTE */
    val = (val & ~PTE_PFN_MASK) | (mfn << PAGE_SHIFT);
    
    return (pte_t){ .pte = val };
}
 
/* Reading a page table entry: Must reverse translate */
static inline unsigned long xen_pte_val(pte_t pte) {
    unsigned long val = pte.pte;
    unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
    unsigned long pfn = mfn_to_pfn(mfn);
    
    /* Replace MFN with PFN for kernel's view */
    return (val & ~PTE_PFN_MASK) | (pfn << PAGE_SHIFT);
}
 
/* Page table update: Must go through hypervisor */
static void xen_set_pte(pte_t *ptep, pte_t pte) {
    struct mmu_update u;
    
    /* Build the update request */
    u.ptr = virt_to_machine(ptep);  /* Machine address of PTE */
    u.val = pte.pte;                /* New PTE value (already MFN-translated) */
    
    /* Ask hypervisor to perform the validated update */
    if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0)
        BUG();  /* Should never fail for valid updates */
}
 
/* Batched updates for efficiency */
static void xen_set_pte_batch(pte_t *ptes[], pte_t vals[], int count) {
    struct mmu_update updates[count];
    
    for (int i = 0; i < count; i++) {
        updates[i].ptr = virt_to_machine(ptes[i]);
        updates[i].val = vals[i].pte;
    }
    
    /* Single hypercall for multiple updates - amortizes overhead */
    HYPERVISOR_mmu_update(updates, count, NULL, DOMID_SELF);
}

Interrupt and Timer Modifications

Interrupt handling is transformed from hardware-based delivery to software event notification. The guest no longer interacts with interrupt controllers (PIC, APIC) but instead uses event channels provided by the hypervisor.

Event Channel Architecture:

Event channels provide a notification mechanism between the hypervisor and guests (and between guests). Each channel is identified by a port number and can be bound to:

Virtual Interrupts — Replacing hardware IRQs
Inter-Domain Events — Communication between VMs
Virtual IPIs — Inter-processor interrupts within a guest
Timer Events — Replacing hardware timers

event_channels.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
/* Event Channel Based Interrupt Handling */
 
/*
 * Native Linux: IDT-based interrupt handling
 * - CPU receives interrupt on vector N
 * - Looks up handler in IDT[N]
 * - Calls handler with interrupt context
 *
 * Xen Paravirtualized: Event channel model
 * - Hypervisor sets pending bit in shared memory
 * - Upcalls registered callback function
 * - Guest dispatches based on event channel port
 */
 
/* Event channel → IRQ mapping */
struct irq_info {
    int type;                    /* IRQS_NONE, IRQT_EVTCHN, etc. */
    unsigned int evtchn;         /* Bound event channel port */
    void (*handler)(int port);   /* Handler function */
};
 
static struct irq_info irq_info[NR_IRQS];
static unsigned int evtchn_to_irq[NR_EVENT_CHANNELS];
 
/* Bind an IRQ to an event channel */
int bind_evtchn_to_irq(unsigned int evtchn) {
    struct evtchn_bind_virq bind_virq;
    int irq;
    
    irq = find_unbound_irq();
    if (irq < 0)
        return irq;
    
    irq_info[irq].type = IRQT_EVTCHN;
    irq_info[irq].evtchn = evtchn;
    evtchn_to_irq[evtchn] = irq;
    
    return irq;
}
 
/* The main event callback - replaces hardware interrupt entry */
void xen_evtchn_do_upcall(struct pt_regs *regs) {
    struct shared_info *s = HYPERVISOR_shared_info;
    struct vcpu_info *vcpu = this_cpu_read(xen_vcpu);
    unsigned long pending_words, pending_bits;
    int word_idx, bit_idx, port;
    
    /* Clear the pending flag first */
    vcpu->evtchn_upcall_pending = 0;
    
    /* Process all pending event channel ports */
    pending_words = xchg(&vcpu->evtchn_pending_sel, 0);
    
    while (pending_words) {
        word_idx = __ffs(pending_words);
        pending_words &= ~(1UL << word_idx);
        
        /* Get pending events for this word, excluding masked ones */
        pending_bits = s->evtchn_pending[word_idx];
        pending_bits &= ~s->evtchn_mask[word_idx];
        
        while (pending_bits) {
            bit_idx = __ffs(pending_bits);
            pending_bits &= ~(1UL << bit_idx);
            
            port = word_idx * BITS_PER_LONG + bit_idx;
            
            /* Clear the pending bit */
            clear_bit(port, &s->evtchn_pending[word_idx]);
            
            /* Dispatch to the handler */
            handle_irq(evtchn_to_irq[port], regs);
        }
    }
}
 
/* Timer event handling - replaces hardware timer interrupts */
static void xen_timer_interrupt(int port) {
    struct clock_event_device *evt = this_cpu_ptr(&xen_clock_events);
    
    /* Acknowledge and handle the timer event */
    evt->event_handler(evt);
}
 
/* Setting up a timer - single hypercall instead of PIT/APIC programming */
static int xen_set_next_event(unsigned long delta,
                              struct clock_event_device *evt) {
    HYPERVISOR_set_timer_op(xen_clocksource_read() + 
                            delta * NSEC_PER_JIFFIE);
    return 0;
}

Efficiency of Event Channels

Event channels are more efficient than hardware interrupt emulation because they avoid APIC virtualization complexity. A single shared memory write and lightweight hypercall can batch multiple events, versus the overhead of emulating EOIs, TPR updates, and interrupt delivery for each individual interrupt.

The Linux paravirt_ops Framework

As paravirtualization gained adoption, Linux needed a way to support multiple hypervisors without maintaining separate kernel trees. The solution was paravirt_ops (pv_ops)—an abstraction layer that allows runtime selection of native or paravirtualized operations.

Design Goals of paravirt_ops:

Single Binary — One kernel image runs on bare metal, Xen, or other hypervisors
Minimal Overhead — Native operations should have near-zero cost
Extensibility — Easy to add support for new hypervisors
Clean Separation — Hypervisor-specific code isolated from core kernel

paravirt_ops.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
/* Linux paravirt_ops Framework */
 
/*
 * paravirt_ops provides function pointers for all virtualizable operations.
 * At boot time, these are set to either native or paravirtualized implementations.
 */
 
/* Structure holding CPU operation function pointers */
struct pv_cpu_ops {
    /* Interrupt flag manipulation */
    unsigned long (*save_fl)(void);
    void (*restore_fl)(unsigned long);
    void (*irq_disable)(void);
    void (*irq_enable)(void);
    
    /* Control register operations */
    unsigned long (*read_cr0)(void);
    void (*write_cr0)(unsigned long);
    unsigned long (*read_cr3)(void);
    void (*write_cr3)(unsigned long);
    
    /* Debug register operations */
    unsigned long (*get_debugreg)(int);
    void (*set_debugreg)(int, unsigned long);
    
    /* Processor state */
    void (*halt)(void);
    void (*safe_halt)(void);
    
    /* ... many more operations ... */
};
 
/* Structure for MMU operations */
struct pv_mmu_ops {
    void (*set_pte)(pte_t *ptep, pte_t pte);
    void (*set_pmd)(pmd_t *pmdp, pmd_t pmd);
    void (*set_pud)(pud_t *pudp, pud_t pud);
    void (*set_pgd)(pgd_t *pgdp, pgd_t pgd);
    
    pte_t (*make_pte)(unsigned long val);
    unsigned long (*pte_val)(pte_t);
    
    void (*flush_tlb_user)(void);
    void (*flush_tlb_kernel)(void);
    void (*flush_tlb_single)(unsigned long addr);
    
    /* ... many more operations ... */
};
 
/* Global operation tables - set at boot time */
struct pv_cpu_ops pv_cpu_ops;
struct pv_mmu_ops pv_mmu_ops;
 
/* Native implementations - used on bare metal */
static struct pv_cpu_ops native_cpu_ops = {
    .save_fl = native_save_fl,
    .restore_fl = native_restore_fl,
    .irq_disable = native_irq_disable,
    .irq_enable = native_irq_enable,
    .read_cr3 = native_read_cr3,
    .write_cr3 = native_write_cr3,
    .halt = native_halt,
    /* ... */
};
 
/* Xen implementations - used when running on Xen */
static struct pv_cpu_ops xen_cpu_ops = {
    .save_fl = xen_save_fl,
    .restore_fl = xen_restore_fl,
    .irq_disable = xen_irq_disable,
    .irq_enable = xen_irq_enable,
    .read_cr3 = xen_read_cr3,
    .write_cr3 = xen_write_cr3,
    .halt = xen_halt,
    /* ... */
};
 
/* Boot-time initialization */
void __init xen_start_kernel(void) {
    /* Replace native ops with Xen ops */
    pv_cpu_ops = xen_cpu_ops;
    pv_mmu_ops = xen_mmu_ops;
    pv_time_ops = xen_time_ops;
    
    /* Continue with kernel boot... */
}
 
/* Actual code uses the ops indirectly */
static inline void arch_local_irq_disable(void) {
    pv_cpu_ops.irq_disable();  /* Calls either native or xen version */
}

Patching for Performance:

While function pointer indirection is flexible, it introduces overhead for very hot paths. The Linux kernel uses binary patching to eliminate this overhead:

At compile time: Calls are compiled as indirect calls through the ops table
At boot time: If running natively, the indirection is patched to direct calls
Result: Zero overhead for native execution, minimal overhead for paravirt

This technique, called paravirt patching, replaces call sites with appropriately-sized alternatives based on the detected environment.

paravirt_patching.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/* Paravirt Patching Mechanism */
 
/*
 * The kernel records all paravirt call sites for later patching.
 * 
 * Original code:
 *   call *pv_cpu_ops.irq_disable
 * 
 * Patched for native:
 *   cli
 *   nop; nop; nop  (padding to original instruction size)
 * 
 * Patched for Xen:
 *   call xen_irq_disable
 */
 
struct paravirt_patch_site {
    u8 *instr;           /* Location of the call instruction */
    u8 type;             /* Type of operation (IRQ_DISABLE, etc.) */
    u8 len;              /* Length of the patch area */
};
 
/* Collected during compilation via section attributes */
extern struct paravirt_patch_site __parainstructions[];
extern struct paravirt_patch_site __parainstructions_end[];
 
void __init apply_paravirt_patches(void) {
    struct paravirt_patch_site *p;
    
    for (p = __parainstructions; p < __parainstructions_end; p++) {
        unsigned int used;
        
        /* Let the hypervisor ops provide optimal code */
        used = pv_init_ops.patch(p->type, p->instr, p->len);
        
        /* Fill remaining bytes with NOPs */
        if (used < p->len)
            add_nops(p->instr + used, p->len - used);
    }
}
 
/* Xen provides its patch implementations */
unsigned int xen_patch(u8 type, void *insns, unsigned int len) {
    switch (type) {
    case PARAVIRT_PATCH(pv_cpu_ops.irq_disable):
        /* Replace with inline disable code */
        return xen_emit_irq_disable(insns);
        
    case PARAVIRT_PATCH(pv_cpu_ops.irq_enable):
        /* Replace with inline enable code */  
        return xen_emit_irq_enable(insns);
        
    default:
        /* Fall back to function call */
        return paravirt_patch_default(type, insns, len);
    }
}

Beyond Xen

The paravirt_ops framework isn't Xen-specific. It also supports KVM (via kvmclock), VMware (via vmware-specific ops), Microsoft Hyper-V, and others. This unified framework allows a single kernel to efficiently run across all these platforms with appropriate runtime behavior.

Boot Process Modifications

The boot process for a paravirtualized guest differs fundamentally from bare-metal boot. The kernel cannot assume control of hardware initialization and must cooperate with the hypervisor from the very first instruction.

Native Boot Sequence (simplified):

BIOS/UEFI initializes hardware
Bootloader loads kernel in real mode
Kernel transitions to protected mode
Kernel sets up page tables, GDT, IDT
Kernel initializes devices via probing
Kernel starts init process

Paravirtualized Boot Sequence (Xen PV):

Xen loads domain (VM) kernel image
Kernel starts in protected mode (32-bit ring 1 or 64-bit ring 3)
Initial page tables provided by Xen, not kernel-constructed
Kernel receives boot info via start_info structure
Kernel initializes using hypercalls for privileged operations
Virtual devices discovered via XenStore
Kernel starts init process

xen_boot.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/* Xen Paravirtualized Boot Entry */
 
/*
 * The kernel is loaded by Xen at a specific entry point
 * with registers set up according to the Xen ABI.
 * 
 * On entry:
 *   %rsi = pointer to start_info structure
 *   %rsp = valid stack
 *   Running in ring 3, with Xen in ring 0
 */
 
struct start_info {
    char magic[32];              /* "xen-<version>-<platform>" */
    unsigned long nr_pages;      /* Total pages given to domain */
    unsigned long shared_info;   /* MFN of shared_info structure */
    uint32_t flags;              /* Flags (SIF_*) */
    xen_pfn_t store_mfn;         /* MFN of XenStore page */
    uint32_t store_evtchn;       /* Event channel for store */
    xen_pfn_t console_mfn;       /* MFN of console page */
    uint32_t console_evtchn;     /* Event channel for console */
    unsigned long pt_base;       /* PFN of initial page table */
    unsigned long mod_start;     /* PFN of loaded modules (initrd) */
    unsigned long mod_len;       /* Length of modules */
    char cmd_line[1024];         /* Kernel command line */
};
 
/* Entry point from Xen - this is where paravirt boot begins */
void __init xen_start_kernel(struct start_info *si) {
    /* Save the start_info pointer */
    xen_start_info = si;
    
    /* Map the shared info page - essential for communication */
    HYPERVISOR_shared_info = 
        (struct shared_info *)fix_to_virt(FIX_SHARED_INFO);
    
    /* Set up paravirt_ops to use Xen implementations */
    pv_info = xen_info;
    pv_init_ops = xen_init_ops;
    pv_cpu_ops = xen_cpu_ops;
    pv_mmu_ops = xen_mmu_ops;
    
    /* Initialize the P2M (pseudo-physical to machine) table */
    xen_setup_machphys_mapping();
    
    /* Register event channel callback */
    xen_init_interrupts();
    
    /* Set up time management */
    xen_time_init();
    
    /* Continue with normal kernel initialization */
    /* ... */
}
 
/* The kernel image header identifies this as Xen-compatible */
#ifdef CONFIG_XEN_PV
__asm__ (
    ".pushsection .text          
"
    "xen_pvh_start:              
"
    "    mov %rsi, %rdi          
"  /* start_info as first arg */
    "    call xen_start_kernel   
"
    ".popsection                 
"
);
#endif

PVH: The Modern Hybrid

Modern Xen supports 'PVH' mode—a hybrid where the guest uses hardware virtualization for CPU/memory (no ring deprivileging) but paravirtualized I/O. This simplifies boot significantly while retaining paravirt I/O performance benefits.

Device Driver Modifications

Device access in paravirtualized guests uses the split driver model, where functionality is divided between:

Frontend Driver (in guest) — Presents standard device interface to OS, communicates with backend
Backend Driver (in Dom0/hypervisor) — Interfaces with actual hardware, processes frontend requests
Shared Ring — Lock-free circular buffer for request/response exchange
Grant Tables — Secure mechanism for sharing memory between domains

This model achieves excellent performance because:

No hardware emulation overhead
Zero-copy data transfer via shared memory
Batching of multiple requests
Asynchronous, non-blocking operations

split_driver.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
/* Split Driver Architecture: Block Device Example */
 
/*
 * Guest (DomU) Frontend                 Dom0 Backend
 * ─────────────────────              ─────────────────
 *                                    
 * [Block Layer Request]                 
 *         ↓                              
 * [Frontend Driver]  ───ring───→  [Backend Driver]
 *         ↑         ←──ring────         ↓
 *         │                        [Real Block Device]
 *         │                              
 * [Completion to Block Layer]           
 */
 
/* Ring buffer structure (producer-consumer) */
struct blkif_request {
    uint8_t  operation;           /* BLKIF_OP_READ/WRITE/etc */
    uint8_t  nr_segments;         /* Number of memory segments */
    uint64_t sector_number;       /* Starting sector */
    struct blkif_request_segment {
        grant_ref_t gref;         /* Grant reference for page sharing */
        uint8_t first_sect;       /* First sector in page */
        uint8_t last_sect;        /* Last sector in page */
    } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
};
 
struct blkif_response {
    uint64_t id;                  /* Request ID (for matching) */
    uint8_t  operation;           /* Original operation */
    int16_t  status;              /* BLKIF_RSP_OKAY/ERROR */
};
 
/* Shared ring macros - handle wrap-around correctly */
DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response);
 
/* Frontend: Submitting a block I/O request */
static int xen_blkfront_queue_request(struct request *req) {
    struct blkfront_info *info = req->q->queuedata;
    struct blkif_request *ring_req;
    unsigned long id;
    
    /* Get next slot in the ring */
    ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
    
    /* Assign unique ID for matching response */
    id = get_id_from_request(info, req);
    
    /* Fill in the request */
    ring_req->operation = rq_data_dir(req) ? BLKIF_OP_WRITE : BLKIF_OP_READ;
    ring_req->sector_number = blk_rq_pos(req);
    ring_req->nr_segments = 0;
    
    /* Set up grant references for data pages */
    for_each_segment(bvec, req) {
        grant_ref_t gref = gnttab_grant_foreign_access(
            info->backend_id,
            virt_to_mfn(page_address(bvec->bv_page)),
            rq_data_dir(req) == READ
        );
        
        ring_req->seg[ring_req->nr_segments].gref = gref;
        ring_req->seg[ring_req->nr_segments].first_sect = 
            bvec->bv_offset >> 9;
        ring_req->seg[ring_req->nr_segments].last_sect = 
            (bvec->bv_offset + bvec->bv_len - 1) >> 9;
        ring_req->nr_segments++;
    }
    
    /* Advance producer pointer and notify backend */
    info->ring.req_prod_pvt++;
    RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&info->ring, notify);
    
    if (notify)
        notify_remote_via_evtchn(info->evtchn);
    
    return 0;
}
 
/* Frontend: Processing completed responses */
static irqreturn_t xen_blkfront_irq(int irq, void *dev_id) {
    struct blkfront_info *info = dev_id;
    struct blkif_response *response;
    RING_IDX cons, prod;
    
    prod = info->ring.sring->rsp_prod;
    rmb();  /* Read producer before reading responses */
    
    for (cons = info->ring.rsp_cons; cons != prod; cons++) {
        response = RING_GET_RESPONSE(&info->ring, cons);
        
        /* Find original request and complete it */
        struct request *req = get_request_from_id(info, response->id);
        
        if (response->status != BLKIF_RSP_OKAY)
            blk_mq_end_request(req, BLK_STS_IOERR);
        else
            blk_mq_end_request(req, BLK_STS_OK);
        
        /* Release grant references */
        release_grants_for_request(info, req);
    }
    
    info->ring.rsp_cons = cons;
    
    return IRQ_HANDLED;
}

Grant Tables for Memory Sharing:

Grant tables provide a secure mechanism for sharing memory pages between domains. Rather than directly accessing each other's memory (which would violate isolation), domains grant and revoke access permissions:

Granting: Domain A creates a grant entry giving Domain B read or write access to a specific page
Mapping: Domain B maps the granted page into its address space
Using: Domain B accesses the page directly via the mapping
Unmapping: Domain B releases the mapping when done
Revoking: Domain A can revoke the grant (after unmap)

This model enables zero-copy I/O: the guest grants backend access to data pages, avoiding the need to copy data between domains.

Porting Guidelines and Best Practices

For operating system developers considering paravirtualization support, several principles guide successful implementation:

Key Porting Principles

•Isolate Architecture Dependencies — Create clean abstraction boundaries between platform-independent code and hardware-specific operations. This is where paravirt hooks will be inserted.
•Minimize Kernel Changes — Target the narrowest possible layer. Changes to scheduling algorithms, file systems, or networking stacks should never be necessary.
•Prefer Runtime Detection — Design for a single kernel binary that detects its environment at boot time and behaves accordingly.
•Batch Operations When Possible — Hypercalls have overhead. Batch multiple PTE updates, event deliveries, or I/O requests to amortize this cost.
•Handle Time Carefully — Virtual time is complex. Separate wall-clock time from CPU-local time, and account for stolen time when vCPUs are descheduled.
•Use Existing Frameworks — Linux's paravirt_ops, BSD's bhyve, and similar frameworks provide proven patterns. Study them before creating new abstractions.

•Test on both native and virtualized environments
•Measure performance impact of each change
•Document hypercall semantics clearly
•Handle hypercall failures gracefully
•Use memory barriers for shared structures

Avoid

•Assuming specific hypervisor implementation details
•Polling in tight loops; use events
•Direct hardware access without abstraction
•Assuming continuous execution time
•Trusting guest-provided data in backend

Summary: Guest Modification

We've explored the specific modifications required to paravirtualize a guest operating system. Let's consolidate the key insights:

Key Takeaways

•Modifications target the architecture layer — Higher-level kernel code (scheduling, VFS, networking) remains unchanged.
•CPU operations require hypercall replacement — Privileged instructions like CR writes, interrupt control, and halts cannot execute directly.
•Memory management uses PFN-to-MFN translation — Guests work with pseudo-physical addresses translated to machine addresses by the hypervisor.
•Events replace hardware interrupts — Shared memory bitmaps and callbacks provide efficient notification without APIC emulation.
•The paravirt_ops framework enables multi-hypervisor support — Runtime function pointer selection allows one kernel binary to run everywhere.
•Split drivers provide high-performance I/O — Frontend/backend division with shared rings and grant tables enables zero-copy data transfer.
•Boot differs fundamentally — No BIOS, no real mode; the kernel starts in protected/long mode with hypervisor-provided initial state.

What's Next:

Now that we understand what changes in a guest OS, we'll examine how the guest invokes hypervisor services. The next page explores hypercalls—the mechanism by which paravirtualized guests request privileged operations from the hypervisor. We'll see the interface design, calling conventions, and specific examples of hypercall implementations.

Page Complete

You now understand the specific modifications required to paravirtualize an operating system. From CPU privilege changes to split drivers, these surgical modifications enable near-native performance while maintaining hypervisor control.

2 / 5

Loading learning content...

Operating SystemsParavirtualization

Paravirtualization

LevelAdvanced

Duration60 mins

TopicParavirtualization

2 / 5

Guest Modification

Transforming an Operating System

What You Will Learn

What Must Be Modified

Core Categories Requiring Modification:

Kernel Subsystems Requiring Paravirtualization
Subsystem	Why It Needs Change	Typical Modifications
CPU Initialization	Boot code assumes direct hardware access	Boot in protected mode with hypervisor cooperation
Memory Management	Page table manipulation uses privileged instructions	Replace direct CR3/PTE writes with hypercalls
Interrupt Handling	IDT setup and interrupt enable/disable	Virtual interrupt vectors, event channels
Time Management	Timer hardware access (PIT, HPET, TSC)	Paravirtualized clocksource and timer events
Device Drivers	Direct I/O port and MMIO access	Split driver model with ring buffers
SMP Support	IPI delivery, APIC manipulation	Virtual IPIs through event channels
Power Management	ACPI, halt instruction, cpuidle	Cooperative idle, hypervisor-aware PM

The Principle of Minimal Modification:

This principle yields several benefits:

Maintainability — Changes are localized and don't affect higher-level kernel logic
Portability — The same kernel can run on bare metal or virtualized with runtime selection
Testability — Most kernel paths remain unchanged and need no special validation
Upgradability — Kernel updates rarely conflict with paravirtualization code

The 1% Rule

CPU and Memory Subsystem Modifications

The CPU and memory subsystems require the most significant paravirtualization changes because they involve the most privileged operations. Let's examine each in detail.

CPU Privilege Level Changes:

Segment Descriptor Access — Guest cannot directly modify GDT/LDT
Control Register Access — CR0, CR3, CR4 manipulation requires hypercalls
Debug Register Access — DR0-DR7 access must be mediated
Privileged Instructions — LGDT, LIDT, HLT, CLI, STI all require replacement

cpu_modifications.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/* Native x86: Direct control register access */
static inline void native_write_cr3(unsigned long val) {
    asm volatile("mov %0, %%cr3" : : "r"(val) : "memory");
}
 
static inline unsigned long native_read_cr3(void) {
    unsigned long val;
    asm volatile("mov %%cr3, %0" : "=r"(val));
    return val;
}
 
/* Xen Paravirtualized: Hypercall to update page tables */
static inline void xen_write_cr3(unsigned long cr3) {
    struct mmuext_op op;
    
    /* Don't directly write CR3 - ask hypervisor to do it */
    op.cmd = MMUEXT_NEW_BASEPTR;
    op.arg1.mfn = PFN_DOWN(cr3);  /* Convert to machine frame number */
    
    HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF);
}
 
static inline unsigned long xen_read_cr3(void) {
    /* Reading is safe - we can read the virtualized CR3 value */
    return native_read_cr3();
}
 
/* Interrupt enable/disable: Cannot use CLI/STI directly */
static inline void native_irq_disable(void) {
    asm volatile("cli" : : : "memory");
}
 
static inline void xen_irq_disable(void) {
    /* Update the virtual interrupt flag in shared memory */
    struct vcpu_info *v = this_cpu_read(xen_vcpu);
    v->evtchn_upcall_mask = 1;
    barrier();  /* Ensure the write is visible */
}
 
static inline void xen_irq_enable(void) {
    struct vcpu_info *v = this_cpu_read(xen_vcpu);
    
    v->evtchn_upcall_mask = 0;
    barrier();
    
    /* Check if events became pending while masked */
    if (unlikely(v->evtchn_upcall_pending))
        force_evtchn_callback();  /* Process pending events */
}

Memory Management Modifications:

Memory management paravirtualization is perhaps the most complex area. The kernel must relinquish direct control over:

Page Table Updates — Every PTE modification must be validated by hypervisor
Page Frame Allocation — The kernel works with pseudo-physical frames, not machine frames
TLB Management — Flush operations require hypervisor coordination
Memory Typing — Cache attribute changes need validation

The guest kernel sees pseudo-physical addresses (PFNs) that map to different machine frame numbers (MFNs) in actual hardware. This translation is fundamental to memory isolation:

memory_modifications.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/* Xen Memory Address Translation */
 
/*
 * Address spaces in Xen paravirtualization:
 * 
 *   Guest Virtual Addr → Guest Physical Addr → Machine Physical Addr
 *         (GVA)              (GPA/PFN)              (MPA/MFN)
 *
 * The guest kernel works with PFNs (pseudo-physical frame numbers)
 * but page tables must contain MFNs (machine frame numbers)
 */
 
/* Translation tables maintained by hypervisor */
extern unsigned long *phys_to_machine_mapping;  /* PFN → MFN */
extern unsigned long *machine_to_phys_mapping;  /* MFN → PFN */
 
static inline unsigned long pfn_to_mfn(unsigned long pfn) {
    return phys_to_machine_mapping[pfn];
}
 
static inline unsigned long mfn_to_pfn(unsigned long mfn) {
    return machine_to_phys_mapping[mfn];
}
 
/* Creating a page table entry: Must translate addresses */
static inline pte_t xen_make_pte(unsigned long val) {
    unsigned long pfn = pte_pfn(val);
    unsigned long mfn = pfn_to_mfn(pfn);
    
    /* Replace PFN with MFN in the PTE */
    val = (val & ~PTE_PFN_MASK) | (mfn << PAGE_SHIFT);
    
    return (pte_t){ .pte = val };
}
 
/* Reading a page table entry: Must reverse translate */
static inline unsigned long xen_pte_val(pte_t pte) {
    unsigned long val = pte.pte;
    unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
    unsigned long pfn = mfn_to_pfn(mfn);
    
    /* Replace MFN with PFN for kernel's view */
    return (val & ~PTE_PFN_MASK) | (pfn << PAGE_SHIFT);
}
 
/* Page table update: Must go through hypervisor */
static void xen_set_pte(pte_t *ptep, pte_t pte) {
    struct mmu_update u;
    
    /* Build the update request */
    u.ptr = virt_to_machine(ptep);  /* Machine address of PTE */
    u.val = pte.pte;                /* New PTE value (already MFN-translated) */
    
    /* Ask hypervisor to perform the validated update */
    if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0)
        BUG();  /* Should never fail for valid updates */
}
 
/* Batched updates for efficiency */
static void xen_set_pte_batch(pte_t *ptes[], pte_t vals[], int count) {
    struct mmu_update updates[count];
    
    for (int i = 0; i < count; i++) {
        updates[i].ptr = virt_to_machine(ptes[i]);
        updates[i].val = vals[i].pte;
    }
    
    /* Single hypercall for multiple updates - amortizes overhead */
    HYPERVISOR_mmu_update(updates, count, NULL, DOMID_SELF);
}

Interrupt and Timer Modifications

Event Channel Architecture:

Event channels provide a notification mechanism between the hypervisor and guests (and between guests). Each channel is identified by a port number and can be bound to:

Virtual Interrupts — Replacing hardware IRQs
Inter-Domain Events — Communication between VMs
Virtual IPIs — Inter-processor interrupts within a guest
Timer Events — Replacing hardware timers

event_channels.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
/* Event Channel Based Interrupt Handling */
 
/*
 * Native Linux: IDT-based interrupt handling
 * - CPU receives interrupt on vector N
 * - Looks up handler in IDT[N]
 * - Calls handler with interrupt context
 *
 * Xen Paravirtualized: Event channel model
 * - Hypervisor sets pending bit in shared memory
 * - Upcalls registered callback function
 * - Guest dispatches based on event channel port
 */
 
/* Event channel → IRQ mapping */
struct irq_info {
    int type;                    /* IRQS_NONE, IRQT_EVTCHN, etc. */
    unsigned int evtchn;         /* Bound event channel port */
    void (*handler)(int port);   /* Handler function */
};
 
static struct irq_info irq_info[NR_IRQS];
static unsigned int evtchn_to_irq[NR_EVENT_CHANNELS];
 
/* Bind an IRQ to an event channel */
int bind_evtchn_to_irq(unsigned int evtchn) {
    struct evtchn_bind_virq bind_virq;
    int irq;
    
    irq = find_unbound_irq();
    if (irq < 0)
        return irq;
    
    irq_info[irq].type = IRQT_EVTCHN;
    irq_info[irq].evtchn = evtchn;
    evtchn_to_irq[evtchn] = irq;
    
    return irq;
}
 
/* The main event callback - replaces hardware interrupt entry */
void xen_evtchn_do_upcall(struct pt_regs *regs) {
    struct shared_info *s = HYPERVISOR_shared_info;
    struct vcpu_info *vcpu = this_cpu_read(xen_vcpu);
    unsigned long pending_words, pending_bits;
    int word_idx, bit_idx, port;
    
    /* Clear the pending flag first */
    vcpu->evtchn_upcall_pending = 0;
    
    /* Process all pending event channel ports */
    pending_words = xchg(&vcpu->evtchn_pending_sel, 0);
    
    while (pending_words) {
        word_idx = __ffs(pending_words);
        pending_words &= ~(1UL << word_idx);
        
        /* Get pending events for this word, excluding masked ones */
        pending_bits = s->evtchn_pending[word_idx];
        pending_bits &= ~s->evtchn_mask[word_idx];
        
        while (pending_bits) {
            bit_idx = __ffs(pending_bits);
            pending_bits &= ~(1UL << bit_idx);
            
            port = word_idx * BITS_PER_LONG + bit_idx;
            
            /* Clear the pending bit */
            clear_bit(port, &s->evtchn_pending[word_idx]);
            
            /* Dispatch to the handler */
            handle_irq(evtchn_to_irq[port], regs);
        }
    }
}
 
/* Timer event handling - replaces hardware timer interrupts */
static void xen_timer_interrupt(int port) {
    struct clock_event_device *evt = this_cpu_ptr(&xen_clock_events);
    
    /* Acknowledge and handle the timer event */
    evt->event_handler(evt);
}
 
/* Setting up a timer - single hypercall instead of PIT/APIC programming */
static int xen_set_next_event(unsigned long delta,
                              struct clock_event_device *evt) {
    HYPERVISOR_set_timer_op(xen_clocksource_read() + 
                            delta * NSEC_PER_JIFFIE);
    return 0;
}

Efficiency of Event Channels

The Linux paravirt_ops Framework

Design Goals of paravirt_ops:

Single Binary — One kernel image runs on bare metal, Xen, or other hypervisors
Minimal Overhead — Native operations should have near-zero cost
Extensibility — Easy to add support for new hypervisors
Clean Separation — Hypervisor-specific code isolated from core kernel

paravirt_ops.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
/* Linux paravirt_ops Framework */
 
/*
 * paravirt_ops provides function pointers for all virtualizable operations.
 * At boot time, these are set to either native or paravirtualized implementations.
 */
 
/* Structure holding CPU operation function pointers */
struct pv_cpu_ops {
    /* Interrupt flag manipulation */
    unsigned long (*save_fl)(void);
    void (*restore_fl)(unsigned long);
    void (*irq_disable)(void);
    void (*irq_enable)(void);
    
    /* Control register operations */
    unsigned long (*read_cr0)(void);
    void (*write_cr0)(unsigned long);
    unsigned long (*read_cr3)(void);
    void (*write_cr3)(unsigned long);
    
    /* Debug register operations */
    unsigned long (*get_debugreg)(int);
    void (*set_debugreg)(int, unsigned long);
    
    /* Processor state */
    void (*halt)(void);
    void (*safe_halt)(void);
    
    /* ... many more operations ... */
};
 
/* Structure for MMU operations */
struct pv_mmu_ops {
    void (*set_pte)(pte_t *ptep, pte_t pte);
    void (*set_pmd)(pmd_t *pmdp, pmd_t pmd);
    void (*set_pud)(pud_t *pudp, pud_t pud);
    void (*set_pgd)(pgd_t *pgdp, pgd_t pgd);
    
    pte_t (*make_pte)(unsigned long val);
    unsigned long (*pte_val)(pte_t);
    
    void (*flush_tlb_user)(void);
    void (*flush_tlb_kernel)(void);
    void (*flush_tlb_single)(unsigned long addr);
    
    /* ... many more operations ... */
};
 
/* Global operation tables - set at boot time */
struct pv_cpu_ops pv_cpu_ops;
struct pv_mmu_ops pv_mmu_ops;
 
/* Native implementations - used on bare metal */
static struct pv_cpu_ops native_cpu_ops = {
    .save_fl = native_save_fl,
    .restore_fl = native_restore_fl,
    .irq_disable = native_irq_disable,
    .irq_enable = native_irq_enable,
    .read_cr3 = native_read_cr3,
    .write_cr3 = native_write_cr3,
    .halt = native_halt,
    /* ... */
};
 
/* Xen implementations - used when running on Xen */
static struct pv_cpu_ops xen_cpu_ops = {
    .save_fl = xen_save_fl,
    .restore_fl = xen_restore_fl,
    .irq_disable = xen_irq_disable,
    .irq_enable = xen_irq_enable,
    .read_cr3 = xen_read_cr3,
    .write_cr3 = xen_write_cr3,
    .halt = xen_halt,
    /* ... */
};
 
/* Boot-time initialization */
void __init xen_start_kernel(void) {
    /* Replace native ops with Xen ops */
    pv_cpu_ops = xen_cpu_ops;
    pv_mmu_ops = xen_mmu_ops;
    pv_time_ops = xen_time_ops;
    
    /* Continue with kernel boot... */
}
 
/* Actual code uses the ops indirectly */
static inline void arch_local_irq_disable(void) {
    pv_cpu_ops.irq_disable();  /* Calls either native or xen version */
}

Patching for Performance:

While function pointer indirection is flexible, it introduces overhead for very hot paths. The Linux kernel uses binary patching to eliminate this overhead:

At compile time: Calls are compiled as indirect calls through the ops table
At boot time: If running natively, the indirection is patched to direct calls
Result: Zero overhead for native execution, minimal overhead for paravirt

This technique, called paravirt patching, replaces call sites with appropriately-sized alternatives based on the detected environment.

paravirt_patching.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/* Paravirt Patching Mechanism */
 
/*
 * The kernel records all paravirt call sites for later patching.
 * 
 * Original code:
 *   call *pv_cpu_ops.irq_disable
 * 
 * Patched for native:
 *   cli
 *   nop; nop; nop  (padding to original instruction size)
 * 
 * Patched for Xen:
 *   call xen_irq_disable
 */
 
struct paravirt_patch_site {
    u8 *instr;           /* Location of the call instruction */
    u8 type;             /* Type of operation (IRQ_DISABLE, etc.) */
    u8 len;              /* Length of the patch area */
};
 
/* Collected during compilation via section attributes */
extern struct paravirt_patch_site __parainstructions[];
extern struct paravirt_patch_site __parainstructions_end[];
 
void __init apply_paravirt_patches(void) {
    struct paravirt_patch_site *p;
    
    for (p = __parainstructions; p < __parainstructions_end; p++) {
        unsigned int used;
        
        /* Let the hypervisor ops provide optimal code */
        used = pv_init_ops.patch(p->type, p->instr, p->len);
        
        /* Fill remaining bytes with NOPs */
        if (used < p->len)
            add_nops(p->instr + used, p->len - used);
    }
}
 
/* Xen provides its patch implementations */
unsigned int xen_patch(u8 type, void *insns, unsigned int len) {
    switch (type) {
    case PARAVIRT_PATCH(pv_cpu_ops.irq_disable):
        /* Replace with inline disable code */
        return xen_emit_irq_disable(insns);
        
    case PARAVIRT_PATCH(pv_cpu_ops.irq_enable):
        /* Replace with inline enable code */  
        return xen_emit_irq_enable(insns);
        
    default:
        /* Fall back to function call */
        return paravirt_patch_default(type, insns, len);
    }
}

Beyond Xen

Boot Process Modifications

Native Boot Sequence (simplified):

BIOS/UEFI initializes hardware
Bootloader loads kernel in real mode
Kernel transitions to protected mode
Kernel sets up page tables, GDT, IDT
Kernel initializes devices via probing
Kernel starts init process

Paravirtualized Boot Sequence (Xen PV):

Xen loads domain (VM) kernel image
Kernel starts in protected mode (32-bit ring 1 or 64-bit ring 3)
Initial page tables provided by Xen, not kernel-constructed
Kernel receives boot info via start_info structure
Kernel initializes using hypercalls for privileged operations
Virtual devices discovered via XenStore
Kernel starts init process

xen_boot.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/* Xen Paravirtualized Boot Entry */
 
/*
 * The kernel is loaded by Xen at a specific entry point
 * with registers set up according to the Xen ABI.
 * 
 * On entry:
 *   %rsi = pointer to start_info structure
 *   %rsp = valid stack
 *   Running in ring 3, with Xen in ring 0
 */
 
struct start_info {
    char magic[32];              /* "xen-<version>-<platform>" */
    unsigned long nr_pages;      /* Total pages given to domain */
    unsigned long shared_info;   /* MFN of shared_info structure */
    uint32_t flags;              /* Flags (SIF_*) */
    xen_pfn_t store_mfn;         /* MFN of XenStore page */
    uint32_t store_evtchn;       /* Event channel for store */
    xen_pfn_t console_mfn;       /* MFN of console page */
    uint32_t console_evtchn;     /* Event channel for console */
    unsigned long pt_base;       /* PFN of initial page table */
    unsigned long mod_start;     /* PFN of loaded modules (initrd) */
    unsigned long mod_len;       /* Length of modules */
    char cmd_line[1024];         /* Kernel command line */
};
 
/* Entry point from Xen - this is where paravirt boot begins */
void __init xen_start_kernel(struct start_info *si) {
    /* Save the start_info pointer */
    xen_start_info = si;
    
    /* Map the shared info page - essential for communication */
    HYPERVISOR_shared_info = 
        (struct shared_info *)fix_to_virt(FIX_SHARED_INFO);
    
    /* Set up paravirt_ops to use Xen implementations */
    pv_info = xen_info;
    pv_init_ops = xen_init_ops;
    pv_cpu_ops = xen_cpu_ops;
    pv_mmu_ops = xen_mmu_ops;
    
    /* Initialize the P2M (pseudo-physical to machine) table */
    xen_setup_machphys_mapping();
    
    /* Register event channel callback */
    xen_init_interrupts();
    
    /* Set up time management */
    xen_time_init();
    
    /* Continue with normal kernel initialization */
    /* ... */
}
 
/* The kernel image header identifies this as Xen-compatible */
#ifdef CONFIG_XEN_PV
__asm__ (
    ".pushsection .text          
"
    "xen_pvh_start:              
"
    "    mov %rsi, %rdi          
"  /* start_info as first arg */
    "    call xen_start_kernel   
"
    ".popsection                 
"
);
#endif

PVH: The Modern Hybrid

Device Driver Modifications

Device access in paravirtualized guests uses the split driver model, where functionality is divided between:

Frontend Driver (in guest) — Presents standard device interface to OS, communicates with backend
Backend Driver (in Dom0/hypervisor) — Interfaces with actual hardware, processes frontend requests
Shared Ring — Lock-free circular buffer for request/response exchange
Grant Tables — Secure mechanism for sharing memory between domains

This model achieves excellent performance because:

No hardware emulation overhead
Zero-copy data transfer via shared memory
Batching of multiple requests
Asynchronous, non-blocking operations

split_driver.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
/* Split Driver Architecture: Block Device Example */
 
/*
 * Guest (DomU) Frontend                 Dom0 Backend
 * ─────────────────────              ─────────────────
 *                                    
 * [Block Layer Request]                 
 *         ↓                              
 * [Frontend Driver]  ───ring───→  [Backend Driver]
 *         ↑         ←──ring────         ↓
 *         │                        [Real Block Device]
 *         │                              
 * [Completion to Block Layer]           
 */
 
/* Ring buffer structure (producer-consumer) */
struct blkif_request {
    uint8_t  operation;           /* BLKIF_OP_READ/WRITE/etc */
    uint8_t  nr_segments;         /* Number of memory segments */
    uint64_t sector_number;       /* Starting sector */
    struct blkif_request_segment {
        grant_ref_t gref;         /* Grant reference for page sharing */
        uint8_t first_sect;       /* First sector in page */
        uint8_t last_sect;        /* Last sector in page */
    } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
};
 
struct blkif_response {
    uint64_t id;                  /* Request ID (for matching) */
    uint8_t  operation;           /* Original operation */
    int16_t  status;              /* BLKIF_RSP_OKAY/ERROR */
};
 
/* Shared ring macros - handle wrap-around correctly */
DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response);
 
/* Frontend: Submitting a block I/O request */
static int xen_blkfront_queue_request(struct request *req) {
    struct blkfront_info *info = req->q->queuedata;
    struct blkif_request *ring_req;
    unsigned long id;
    
    /* Get next slot in the ring */
    ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
    
    /* Assign unique ID for matching response */
    id = get_id_from_request(info, req);
    
    /* Fill in the request */
    ring_req->operation = rq_data_dir(req) ? BLKIF_OP_WRITE : BLKIF_OP_READ;
    ring_req->sector_number = blk_rq_pos(req);
    ring_req->nr_segments = 0;
    
    /* Set up grant references for data pages */
    for_each_segment(bvec, req) {
        grant_ref_t gref = gnttab_grant_foreign_access(
            info->backend_id,
            virt_to_mfn(page_address(bvec->bv_page)),
            rq_data_dir(req) == READ
        );
        
        ring_req->seg[ring_req->nr_segments].gref = gref;
        ring_req->seg[ring_req->nr_segments].first_sect = 
            bvec->bv_offset >> 9;
        ring_req->seg[ring_req->nr_segments].last_sect = 
            (bvec->bv_offset + bvec->bv_len - 1) >> 9;
        ring_req->nr_segments++;
    }
    
    /* Advance producer pointer and notify backend */
    info->ring.req_prod_pvt++;
    RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&info->ring, notify);
    
    if (notify)
        notify_remote_via_evtchn(info->evtchn);
    
    return 0;
}
 
/* Frontend: Processing completed responses */
static irqreturn_t xen_blkfront_irq(int irq, void *dev_id) {
    struct blkfront_info *info = dev_id;
    struct blkif_response *response;
    RING_IDX cons, prod;
    
    prod = info->ring.sring->rsp_prod;
    rmb();  /* Read producer before reading responses */
    
    for (cons = info->ring.rsp_cons; cons != prod; cons++) {
        response = RING_GET_RESPONSE(&info->ring, cons);
        
        /* Find original request and complete it */
        struct request *req = get_request_from_id(info, response->id);
        
        if (response->status != BLKIF_RSP_OKAY)
            blk_mq_end_request(req, BLK_STS_IOERR);
        else
            blk_mq_end_request(req, BLK_STS_OK);
        
        /* Release grant references */
        release_grants_for_request(info, req);
    }
    
    info->ring.rsp_cons = cons;
    
    return IRQ_HANDLED;
}

Grant Tables for Memory Sharing:

Granting: Domain A creates a grant entry giving Domain B read or write access to a specific page
Mapping: Domain B maps the granted page into its address space
Using: Domain B accesses the page directly via the mapping
Unmapping: Domain B releases the mapping when done
Revoking: Domain A can revoke the grant (after unmap)

This model enables zero-copy I/O: the guest grants backend access to data pages, avoiding the need to copy data between domains.

Porting Guidelines and Best Practices

For operating system developers considering paravirtualization support, several principles guide successful implementation:

Key Porting Principles

•Isolate Architecture Dependencies — Create clean abstraction boundaries between platform-independent code and hardware-specific operations. This is where paravirt hooks will be inserted.
•Minimize Kernel Changes — Target the narrowest possible layer. Changes to scheduling algorithms, file systems, or networking stacks should never be necessary.
•Prefer Runtime Detection — Design for a single kernel binary that detects its environment at boot time and behaves accordingly.
•Batch Operations When Possible — Hypercalls have overhead. Batch multiple PTE updates, event deliveries, or I/O requests to amortize this cost.
•Handle Time Carefully — Virtual time is complex. Separate wall-clock time from CPU-local time, and account for stolen time when vCPUs are descheduled.
•Use Existing Frameworks — Linux's paravirt_ops, BSD's bhyve, and similar frameworks provide proven patterns. Study them before creating new abstractions.

•Test on both native and virtualized environments
•Measure performance impact of each change
•Document hypercall semantics clearly
•Handle hypercall failures gracefully
•Use memory barriers for shared structures

Avoid

•Assuming specific hypervisor implementation details
•Polling in tight loops; use events
•Direct hardware access without abstraction
•Assuming continuous execution time
•Trusting guest-provided data in backend

Summary: Guest Modification

We've explored the specific modifications required to paravirtualize a guest operating system. Let's consolidate the key insights:

Key Takeaways

•Modifications target the architecture layer — Higher-level kernel code (scheduling, VFS, networking) remains unchanged.
•CPU operations require hypercall replacement — Privileged instructions like CR writes, interrupt control, and halts cannot execute directly.
•Memory management uses PFN-to-MFN translation — Guests work with pseudo-physical addresses translated to machine addresses by the hypervisor.
•Events replace hardware interrupts — Shared memory bitmaps and callbacks provide efficient notification without APIC emulation.
•The paravirt_ops framework enables multi-hypervisor support — Runtime function pointer selection allows one kernel binary to run everywhere.
•Split drivers provide high-performance I/O — Frontend/backend division with shared rings and grant tables enables zero-copy data transfer.
•Boot differs fundamentally — No BIOS, no real mode; the kernel starts in protected/long mode with hypervisor-provided initial state.

What's Next:

Page Complete

2 / 5