Operating SystemsCloud Computing

Cloud Computing

LevelAdvanced

Duration75 mins

TopicCloud Computing

3 / 5

Virtualization in Cloud

The Virtualization Foundation

Cloud computing would be impossible without virtualization. Every virtual machine, container, and serverless function relies on the operating system's ability to abstract and multiplex hardware resources. This page explores the virtualization technologies that transform data centers into elastic computing platforms.

We've covered virtualization fundamentals in earlier chapters. Here, we focus on how these concepts apply at cloud scale—the architectural decisions, performance optimizations, and operational practices that enable hyperscale cloud providers to serve millions of customers from shared infrastructure.

What You Will Master

By completing this page, you will understand how cloud providers implement virtualization at scale, the trade-offs between different virtualization technologies, how hardware acceleration improves efficiency, and how cloud operating systems orchestrate resources across data centers.

Cloud Virtualization Architecture

Cloud virtualization extends traditional hypervisor-based virtualization with sophisticated management layers that handle resource scheduling, multi-tenancy, and global orchestration.

The Cloud Virtualization Stack:

┌─────────────────────────────────────────────────────────────────┐
│                    CLOUD MANAGEMENT PLANE                        │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  API Gateway │ Scheduler │ Resource Manager │ Billing       ││
│  └─────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│                    ORCHESTRATION LAYER                           │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  VM Lifecycle │ Storage Provisioning │ Network Config       ││
│  └─────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│                    HYPERVISOR CLUSTER                            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │  Host Node 1 │  │  Host Node 2 │  │  Host Node N │          │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │          │
│  │ │Hypervisor│ │  │ │Hypervisor│ │  │ │Hypervisor│ │          │
│  │ │ (KVM)    │ │  │ │ (KVM)    │ │  │ │ (KVM)    │ │          │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │          │
│  │ ┌──┐┌──┐┌──┐│  │ ┌──┐┌──┐┌──┐│  │ ┌──┐┌──┐┌──┐│          │
│  │ │VM││VM││VM││  │ │VM││VM││VM││  │ │VM││VM││VM││          │
│  │ └──┘└──┘└──┘│  │ └──┘└──┘└──┘│  │ └──┘└──┘└──┘│          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
├─────────────────────────────────────────────────────────────────┤
│                    PHYSICAL INFRASTRUCTURE                       │
│  Compute Servers │ Storage Arrays │ Network Fabric │ Power      │
└─────────────────────────────────────────────────────────────────┘

Key Architectural Components:

1. Management Plane:

API Gateway: RESTful API endpoints for customer operations (CreateInstance, TerminateInstance)
Scheduler: Determines which physical host runs each VM based on resource availability, affinity rules, and bin-packing efficiency
Resource Manager: Tracks capacity across data centers, zones, and regions
Billing: Meters resource usage for pay-as-you-go pricing

2. Orchestration Layer:

VM Lifecycle Management: Handles creation, snapshotting, migration, and termination
Storage Provisioning: Attaches block volumes, manages snapshots and replication
Network Configuration: Sets up VPC networking, security groups, load balancers

Multi-Tenancy at the Hypervisor Level:

Cloud hypervisors run VMs from many different customers on the same physical host. This requires rigorous isolation:

Memory Isolation:

Hardware MMU enforced by second-level address translation (EPT/NPT)
Guest physical addresses translated to host physical addresses
No guest can access another guest's memory without hardware security flaw

CPU Isolation:

vCPUs scheduled by hypervisor across physical cores
CPU time quotas prevent monopolization
Cache partitioning (Intel CAT) reduces side-channel leakage

I/O Isolation:

Virtual device models mediate I/O requests
SR-IOV provides hardware-enforced I/O virtualization
Rate limiting prevents I/O bandwidth monopolization

The Shared Resources Challenge

Despite strong isolation, VMs on the same host share physical resources: last-level CPU cache, memory bandwidth, network interface capacity. Performance can vary based on 'neighbors'—a phenomenon that requires cloud architects to design for variability and use strategies like placement groups for latency-sensitive workloads.

Hypervisor Technologies in the Cloud

Major cloud providers have evolved their hypervisor strategies to balance security, performance, and operational efficiency:

AWS Nitro System:

AWS developed a custom hardware/software platform that radically reimagines hypervisor architecture:

┌─────────────────────────────────────────────────────────────────┐
│                     TRADITIONAL HYPERVISOR                       │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                      GUEST VM                              │  │
│  └───────────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │   HYPERVISOR (Xen/KVM) - CPU, Memory, I/O Virtualization  │  │
│  │   [Consumes significant CPU for I/O emulation]             │  │
│  └───────────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    HOST CPU + HARDWARE                     │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                      AWS NITRO SYSTEM                            │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                      GUEST VM                              │  │
│  │   [Receives nearly 100% of host CPU resources]             │  │
│  └───────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────┐  ┌────────────────────────────────┐  │
│  │   NITRO HYPERVISOR   │  │        NITRO CARDS             │  │
│  │   (Minimal KVM)      │  │  ┌──────────────────────────┐  │  │
│  │   [CPU + Memory only]│  │  │ Nitro Controller         │  │  │
│  └──────────────────────┘  │  │ Nitro Storage (EBS)      │  │  │
│                            │  │ Nitro Networking (ENA)   │  │  │
│                            │  │ Nitro Security           │  │  │
│                            │  └──────────────────────────┘  │  │
│                            └────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    HOST CPU + HARDWARE                     │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Key Nitro Innovations:

Offloaded I/O: Network and storage virtualization moved to dedicated Nitro cards (custom ASICs)
Minimal Hypervisor: Only CPU and memory virtualization in software; everything else in hardware
Near-Bare-Metal Performance: ~95% of bare-metal performance for most workloads
Nitro Enclaves: Isolated compute environments with cryptographic attestation

Cloud Provider Hypervisor Comparison
Provider	Hypervisor	Key Characteristics	Notable Features
AWS	Nitro (KVM-based)	Custom hardware offload	Bare metal instances, Nitro Enclaves
Google Cloud	KVM	Live migration, confidential VMs	Shielded VMs, sole-tenant nodes
Azure	Hyper-V (modified)	Root OS hardened	Nested virtualization, SGX support
Oracle Cloud	KVM + OCI	Dense packing optimizations	Bare metal, dedicated hosts
Alibaba Cloud	KVM	Custom acceleration	Enhanced SSD, GPU virtualization

Google Cloud's Approach:

Google uses a customized KVM hypervisor with proprietary enhancements:

Live Migration: VMs can be moved between hosts with minimal (<1 second) downtime for maintenance
Shielded VMs: Secure boot, vTPM, integrity monitoring prevent rootkit attacks
Confidential Computing: Memory encryption prevents even the hypervisor from reading guest memory

Azure's Hyper-V Foundation:

Microsoft leverages its Hyper-V technology with significant cloud-focused modifications:

Root OS Hardening: The management partition is heavily locked down
VSM (Virtual Secure Mode): Security-critical code runs in isolated virtual trust levels
Azure Sphere Guardian: Hardware-based attestation and monitoring

The Convergence Trend

All major cloud providers have converged on similar architectural patterns: minimal hypervisor footprint, hardware-accelerated I/O, hardware-based security features, and live migration capabilities. The differentiation is increasingly in custom silicon (AWS Graviton, Google TPU) rather than hypervisor software.

Hardware Virtualization Extensions

Modern cloud infrastructure relies heavily on CPU and chipset features designed specifically for virtualization. Understanding these extensions explains why cloud VMs achieve near-native performance.

Intel VT-x and AMD-V (CPU Virtualization):

These extensions add hardware support for the critical operations hypervisors perform:

┌─────────────────────────────────────────────────────────────────┐
│                    VT-x/AMD-V ARCHITECTURE                       │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     GUEST MODE (VMX non-root)              │  │
│  │                                                            │  │
│  │   Guest executes normally until privileged instruction     │  │
│  │   or sensitive event triggers VM Exit                      │  │
│  │                                                            │  │
│  └───────────────────────────┬───────────────────────────────┘  │
│                              │ VM Exit (automatic)              │
│                              ▼                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     HOST MODE (VMX root)                   │  │
│  │                                                            │  │
│  │   Hypervisor handles VM Exit:                              │  │
│  │   - I/O operations                                         │  │
│  │   - Page faults                                            │  │
│  │   - Interrupts                                             │  │
│  │   - Privilege violations                                   │  │
│  │                                                            │  │
│  │   Then executes VMRESUME to return to guest                │  │
│  │                                                            │  │
│  └───────────────────────────┬───────────────────────────────┘  │
│                              │ VMRESUME                         │
│                              ▼                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     GUEST MODE (resumed)                   │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

VMCS (Virtual Machine Control Structure):

Hardware data structure storing VM state
Contains guest registers, control fields, exit information
VMREAD/VMWRITE instructions access fields
One VMCS per vCPU; hardware switches between them

Extended Page Tables (EPT) / Nested Page Tables (NPT):

Second-level address translation eliminates software-based memory virtualization overhead:

┌─────────────────────────────────────────────────────────────────┐
│              TWO-DIMENSIONAL PAGE TABLE WALK                     │
│                                                                  │
│   Guest Virtual    Guest Page      Guest Physical               │
│     Address    ───► Tables    ───► Address (GPA)                │
│                    (in guest OS)                                 │
│                                         │                        │
│                                         ▼                        │
│                                    EPT/NPT Walk                  │
│                                         │                        │
│                                         ▼                        │
│                                    Host Physical                 │
│                                      Address                     │
│                                                                  │
│   Previously: Hypervisor had to trap every page table            │
│   modification and maintain shadow page tables                   │
│                                                                  │
│   With EPT/NPT: Hardware performs entire translation             │
│   without hypervisor intervention                                │
└─────────────────────────────────────────────────────────────────┘

Performance Impact:

TLB misses require walking two page table hierarchies
Hardware caches nested translations in dedicated EPT TLBs
Net result: minimal overhead for memory access (<5% for most workloads)

I/O Virtualization Technologies

•Intel VT-d / AMD-Vi (IOMMU) — DMA remapping allows devices to access VM memory directly while maintaining isolation. Critical for SR-IOV and device passthrough.
•SR-IOV (Single Root I/O Virtualization) — PCIe devices expose multiple 'virtual functions' that can be assigned directly to VMs. NIC appears native to guest with full hardware acceleration.
•DPDK (Data Plane Development Kit) — User-space networking bypasses kernel entirely for packet processing. Used in NFV and high-throughput networking appliances.
•RDMA over Converged Ethernet (RoCE) — Remote Direct Memory Access enables zero-copy, kernel-bypass networking for ultra-low latency. Critical for distributed storage and HPC.
•NVMe Virtualization — NVMe drives support multiple namespaces and virtual queues, enabling efficient passthrough to VMs.

The Hardware Trend

Each generation of server hardware adds more virtualization capabilities. SmartNICs (DPUs) now handle network functions traditionally performed by hypervisor software. Storage controllers implement virtualization in silicon. The hypervisor's role is shrinking as functionality moves to hardware.

Container-based Virtualization in Cloud

Containers have become the dominant deployment unit for cloud applications. Unlike VM-based virtualization, containers leverage the host operating system's kernel, providing faster startup, higher density, and more efficient resource utilization.

Container vs. VM Architecture:

┌────────────────────────────────────────────────────────────────────────────────────┐
│                    VIRTUAL MACHINES                    CONTAINERS                  │
│  ┌───────────────────────────────┐  ┌───────────────────────────────────────────┐ │
│  │ ┌─────┐ ┌─────┐ ┌─────┐      │  │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐  │ │
│  │ │App A│ │App B│ │App C│      │  │ │App A│ │App B│ │App C│ │App D│ │App E│  │ │
│  │ └─────┘ └─────┘ └─────┘      │  │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘  │ │
│  │ ┌─────┐ ┌─────┐ ┌─────┐      │  │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐  │ │
│  │ │Bins/│ │Bins/│ │Bins/│      │  │ │Bins/│ │Bins/│ │Bins/│ │Bins/│ │Bins/│  │ │
│  │ │Libs │ │Libs │ │Libs │      │  │ │Libs │ │Libs │ │Libs │ │Libs │ │Libs │  │ │
│  │ └─────┘ └─────┘ └─────┘      │  │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘  │ │
│  │ ┌─────┐ ┌─────┐ ┌─────┐      │  └────────────────────┬────────────────────┘ │ │
│  │ │Guest│ │Guest│ │Guest│      │                       │                      │ │
│  │ │ OS  │ │ OS  │ │ OS  │      │  ┌────────────────────┴────────────────────┐ │ │
│  │ └─────┘ └─────┘ └─────┘      │  │         CONTAINER RUNTIME               │ │ │
│  └───────────────┬───────────────┘  │      (Docker, containerd, CRI-O)        │ │ │
│                  │                  └────────────────────┬────────────────────┘ │ │
│  ┌───────────────┴───────────────┐  ┌────────────────────┴────────────────────┐ │ │
│  │         HYPERVISOR            │  │           HOST OPERATING SYSTEM          │ │ │
│  └───────────────┬───────────────┘  │    (Linux kernel with namespaces/cgroups)│ │ │
│                  │                  └────────────────────┬────────────────────┘ │ │
│  ┌───────────────┴───────────────┐  ┌────────────────────┴────────────────────┐ │ │
│  │    HOST OPERATING SYSTEM      │  │                                          │ │ │
│  └───────────────┬───────────────┘                                              │ │
│                  │                                                              │ │
│  ┌───────────────┴──────────────────────────────────────────────────────────┐  │ │
│  │                        PHYSICAL HARDWARE                                  │  │ │
│  └───────────────────────────────────────────────────────────────────────────┘  │ │
└────────────────────────────────────────────────────────────────────────────────────┘

Container Isolation Mechanisms:

Containers rely on Linux kernel features for isolation:

Namespaces (Resource Visibility Isolation):

PID Namespace: Processes see only their container's process tree; PID 1 is init within container
Network Namespace: Isolated network stack (interfaces, routes, firewall rules)
Mount Namespace: Isolated filesystem view; overlayfs for layered images
User Namespace: UID/GID mapping; root in container can be unprivileged on host
UTS Namespace: Independent hostname and domain name
IPC Namespace: Isolated System V IPC and POSIX message queues
Cgroup Namespace: Isolated view of cgroup hierarchy

Cgroups (Resource Usage Control):

CPU: Quota and period settings limit CPU time
Memory: Hard and soft limits, OOM killer configuration
Block I/O: Bandwidth limits and I/O weights
PIDs: Maximum number of processes in container
Network: Traffic shaping (with additional tools)

Container Runtimes Comparison
Runtime	Type	Use Case	Key Features
Docker (moby)	High-level	Development, CI/CD	Build + run, Docker Compose, ease of use
containerd	Low-level	Kubernetes nodes	OCI compliant, lightweight, production-focused
CRI-O	Low-level	Kubernetes nodes	Minimal runtime for Kubernetes CRI
runc	Lowest-level	Container spawn	OCI reference implementation, default spawner
gVisor (runsc)	Sandbox	Untrusted workloads	User-space kernel, syscall interception
Kata Containers	MicroVM	Strong isolation	VM-level isolation, hardware virtualization
Firecracker	MicroVM	Serverless	Ultra-lightweight VMs, fast startup (<125ms)

Container Security Considerations

Containers share the host kernel—a kernel vulnerability affects all containers. Defense in depth is essential: don't run as root in containers, use seccomp profiles to limit syscalls, employ AppArmor/SELinux for mandatory access control, and consider gVisor or Kata for untrusted workloads requiring stronger isolation.

Serverless and MicroVM Technologies

Serverless computing requires virtualization that combines the security of VMs with the density and startup time of containers. MicroVMs and specialized runtimes fill this niche.

AWS Firecracker:

Firecracker is a virtual machine monitor (VMM) designed specifically for serverless and container workloads:

┌─────────────────────────────────────────────────────────────────┐
│                    FIRECRACKER ARCHITECTURE                      │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   LAMBDA WORKER                          │    │
│  │  ┌───────────────┐  ┌───────────────┐  ┌─────────────┐  │    │
│  │  │  MicroVM 1    │  │  MicroVM 2    │  │  MicroVM N  │  │    │
│  │  │ ┌───────────┐ │  │ ┌───────────┐ │  │ ┌─────────┐ │  │    │
│  │  │ │ Function  │ │  │ │ Function  │ │  │ │Function │ │  │    │
│  │  │ │ Runtime   │ │  │ │ Runtime   │ │  │ │Runtime  │ │  │    │
│  │  │ └───────────┘ │  │ └───────────┘ │  │ └─────────┘ │  │    │
│  │  │ ┌───────────┐ │  │ ┌───────────┐ │  │ ┌─────────┐ │  │    │
│  │  │ │  Minimal  │ │  │ │  Minimal  │ │  │ │ Minimal │ │  │    │
│  │  │ │  Kernel   │ │  │ │  Kernel   │ │  │ │ Kernel  │ │  │    │
│  │  │ └───────────┘ │  │ └───────────┘ │  │ └─────────┘ │  │    │
│  │  └───────────────┘  └───────────────┘  └─────────────┘  │    │
│  │                           │                              │    │
│  │  ┌────────────────────────┴───────────────────────────┐ │    │
│  │  │              FIRECRACKER VMM                        │ │    │
│  │  │   - Minimal device model (virtio-net, virtio-blk)   │ │    │
│  │  │   - REST API for VM management                       │ │    │
│  │  │   - <125ms boot time                                 │ │    │
│  │  │   - <5MB memory overhead per VM                      │ │    │
│  │  └────────────────────────────────────────────────────┘ │    │
│  └──────────────────────────────────────────────────────────┘    │
│                              │                                   │
│  ┌───────────────────────────┴───────────────────────────────┐  │
│  │                      KVM HYPERVISOR                        │  │
│  └───────────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     HOST LINUX KERNEL                      │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Firecracker Key Features:

Fast Boot: <125ms from API call to code execution
Low Overhead: <5MB memory per microVM (vs. 100MB+ for traditional VMs)
Strong Isolation: Full VM boundary (hardware virtualization)
Minimal Attack Surface: Only essential devices emulated (~15,000 lines of Rust)

MicroVM Technologies

•Firecracker — AWS open source; powers Lambda and Fargate. Rust-based, minimal VMM.
•Cloud Hypervisor — Intel/Alibaba project; written in Rust; focus on cloud workloads.
•QEMU MicroVM — Lightweight QEMU machine type; faster startup than full emulation.
•Kata Containers — OCI-compatible VM runtime; integrates with Kubernetes via CRI.

Unikernel Approaches

•OSv — Java/JVM-optimized unikernel; single address space.
•MirageOS — Functional (OCaml) unikernel; minimal image size.
•Nanos — Linux application binary compatible unikernel.
•UniK — Compiler toolchain for building unikernels from standard apps.

The gVisor Approach:

gVisor intercepts syscalls from containerized applications and implements them in a user-space kernel:

Sentry: User-space kernel written in Go; implements Linux syscall interface
Gofer: File system proxy; handles file operations outside the sandbox
Minimal Host Syscalls: Sentry uses ~20 host syscalls (compared to ~300+ for a complete app)
Security Trade-off: Performance overhead for reduced kernel attack surface

Use Cases:

Running untrusted code (code sandboxes, browser isolation)
Additional isolation layer for containerized workloads
Applications where hardware virtualization isn't available

Choosing the Right Isolation

Standard containers for trusted first-party code. gVisor for untrusted code when you control the host. MicroVMs (Firecracker, Kata) for multi-tenant serverless where you need VM-level isolation with container-like density. The choice depends on your trust model and performance requirements.

Live Migration and Maintenance

Cloud providers must maintain infrastructure—patching security vulnerabilities, upgrading hardware—without disrupting customer workloads. Live migration is the key technology enabling transparent maintenance.

Live Migration Process:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                          LIVE MIGRATION PHASES                                  │
│                                                                                 │
│  Phase 1: Pre-Copy (Iterative)                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────┐   │
│  │ SOURCE HOST                         DESTINATION HOST                     │   │
│  │ ┌───────────┐                       ┌───────────┐                        │   │
│  │ │    VM     │  ──Memory Pages──►    │  VM Copy  │                        │   │
│  │ │ (Running) │      (Round 1)        │(Not Running)│                       │   │
│  │ └───────────┘                       └───────────┘                        │   │
│  │      │                                   │                                │   │
│  │      │ Dirty pages tracked               │                                │   │
│  │      ▼                                   │                                │   │
│  │ ┌───────────┐                       ┌───────────┐                        │   │
│  │ │    VM     │  ──Dirty Pages──►     │  VM Copy  │                        │   │
│  │ │ (Running) │      (Round N)        │ (Updated) │                        │   │
│  │ └───────────┘                       └───────────┘                        │   │
│  └─────────────────────────────────────────────────────────────────────────┘   │
│                                                                                 │
│  Phase 2: Stop-and-Copy                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐   │
│  │ ┌───────────┐                       ┌───────────┐                        │   │
│  │ │    VM     │  ──Final State──►     │    VM     │                        │   │
│  │ │ (Paused)  │    (CPU, Devices)     │ (Ready)   │                        │   │
│  │ └───────────┘                       └───────────┘                        │   │
│  │                                          │                                │   │
│  │                                          ▼                                │   │
│  │                                     Resume VM                             │   │
│  │                                     Update Network                        │   │
│  └─────────────────────────────────────────────────────────────────────────┘   │
│                                                                                 │
│  Guest Downtime: Typically 10-100ms                                             │
└─────────────────────────────────────────────────────────────────────────────────┘

Technical Challenges:

1. Dirty Page Tracking:

Hypervisor marks pages as read-only after copying
Page faults on writes identify modified pages
Convergence: minimize dirty pages per iteration
Fallback: stop-and-copy if write rate exceeds transfer rate

2. Device State Transfer:

Virtual device state (registers, buffers) serialized and transferred
In-flight network packets buffered or retransmitted
Storage: detach from source, attach to destination (or use shared storage)

3. Network Cutover:

New host must receive traffic destined for migrated VM
ARP/NDP updates propagate new MAC-to-switch mapping
SDN controllers update flow rules for new location
Momentary packet loss possible during cutover

4. Clock Synchronization:

VM clock may drift during migration
Hypervisor adjusts guest TSC after migration
NTP resynchronization for accuracy-sensitive workloads

Cloud Provider Maintenance Strategies

•Google Cloud — Live migration is default; VMs rarely experience downtime for maintenance events
•AWS — Offers instance stop/start or scheduled maintenance windows; live migration for some instance families
•Azure — Memory-preserving updates for many scenarios; live migration for planned maintenance
•Transparent to Customers — Ideally, customers don't notice maintenance at all
•Scheduled Notifications — When downtime is unavoidable, customers receive advance notice

Designing for Mobility

Applications should assume VMs can migrate at any time. Avoid hardcoding IP addresses; use DNS. Tolerate brief network hiccups. Store persistent data in network-attached storage. Test with simulated migrations to ensure graceful behavior.

Cloud Operating Systems

At the macro level, cloud infrastructure itself behaves like a distributed operating system—managing resources across thousands of machines, scheduling workloads, providing abstractions that hide complexity.

Warehouse-Scale Computer:

Google conceptualized the "warehouse-scale computer" (WSC)—treating an entire data center as a single computer:

┌─────────────────────────────────────────────────────────────────┐
│               WAREHOUSE-SCALE COMPUTER MODEL                     │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    RESOURCE MANAGER                          ││
│  │          (Borg, Kubernetes, Mesos, YARN)                     ││
│  │   - Cluster-wide resource allocation                         ││
│  │   - Workload scheduling and placement                        ││
│  │   - Failure detection and recovery                           ││
│  │   - Resource efficiency optimization                         ││
│  └─────────────────────────────────────────────────────────────┘│
│                              │                                   │
│              ┌───────────────┼───────────────┐                  │
│              ▼               ▼               ▼                  │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐         │
│  │  Compute Pool │ │  Storage Pool │ │ Network Fabric │         │
│  │  (1000s of    │ │  (Distributed │ │  (SDN, Switch  │         │
│  │   machines)   │ │   File System)│ │   Hierarchy)   │         │
│  └───────────────┘ └───────────────┘ └───────────────┘         │
│                                                                  │
│  Traditional OS Model:    Cloud OS Model:                        │
│   CPU → Process            Server → Container/VM                 │
│   RAM → Address Space      Memory Pool → Memory Allocation       │
│   Disk → Files             Storage Cluster → Volumes             │
│   NIC → Sockets            Network Fabric → Virtual Networks     │
└─────────────────────────────────────────────────────────────────┘

Google Borg (and Kubernetes):

Borg is Google's internal cluster management system, handling over 2 billion container starts per week:

Key Concepts:

Cluster: Set of machines in a building managed together
Cell: A Borg unit managing ~10,000 machines
Job: Collection of tasks (container instances) that run together
Task: A single container with resource allocation
Alloc: Reserved resources on a machine (similar to Kubernetes pods)

Borg Features:

High Utilization: Packs jobs onto machines for 60-70% average utilization
High Availability: Spreads replicas across fault domains
Preemption: Lower-priority batch jobs yield resources to production
Declarative: Users specify desired state; Borg achieves it

Kubernetes as Open Borg: Kubernetes embodies many Borg concepts in an open-source system:

Pods ≈ Allocs
Deployments ≈ Jobs
Nodes ≈ Borg machines
Control plane ≈ Borgmaster

Cluster Management Systems
System	Origin	Scale	Key Use Case
Borg	Google (internal)	Billions of containers/week	Production + batch
Kubernetes	Google (open source)	Thousands of pods/cluster	Container orchestration
Apache Mesos	UC Berkeley/Twitter	10,000+ nodes	Multi-framework resource sharing
Nomad	HashiCorp	10,000+ nodes	Simple, multi-region scheduling
YARN	Apache/Hadoop	Thousands of nodes	Big data workloads
Twine	Meta (internal)	Millions of containers	Facebook services

The OS Analogy

Just as an OS schedules processes onto CPUs, cluster managers schedule containers onto machines. Just as an OS provides a filesystem abstraction over raw disks, distributed storage provides a unified namespace over thousands of drives. Understanding OS concepts provides the mental model for understanding cloud-scale systems.

Summary: Virtualization in Cloud

Virtualization is the foundational technology enabling cloud computing. From hypervisors to containers to microVMs, virtualization technologies provide the isolation, efficiency, and elasticity that define the cloud.

Key Takeaways

•Cloud virtualization adds management layers — Control planes, schedulers, and orchestrators coordinate resources across data centers, while hypervisors provide local VM execution.
•Major providers have evolved hypervisor strategies — AWS Nitro offloads virtualization to hardware; Google focuses on live migration; Azure leverages Hyper-V with cloud enhancements.
•Hardware acceleration is essential — VT-x/AMD-V, EPT/NPT, SR-IOV, and VT-d enable near-native performance for cloud workloads.
•Containers share the kernel — Namespaces and cgroups provide isolation; faster startup and higher density than VMs but weaker security boundaries.
•MicroVMs bridge containers and VMs — Firecracker, Kata Containers provide VM-level isolation with container-like startup times for serverless workloads.
•Live migration enables maintenance — Pre-copy iteration converges on minimal downtime; transparent to well-designed applications.
•Cloud infrastructure is a distributed OS — Cluster managers like Borg and Kubernetes perform scheduling, resource management, and failure recovery at data center scale.

Looking Ahead:

With virtualization foundations established, we'll next explore container orchestration with Kubernetes—the de facto standard for managing containerized applications at scale in cloud environments.

Page Complete

You now understand how virtualization technologies enable cloud computing, from hypervisor architectures to container runtimes to microVMs. Next, we'll dive deep into Kubernetes and container orchestration.

3 / 5

Loading learning content...

Operating SystemsCloud Computing

Cloud Computing

LevelAdvanced

Duration75 mins

TopicCloud Computing

3 / 5

Virtualization in Cloud

The Virtualization Foundation

What You Will Master

Cloud Virtualization Architecture

Cloud virtualization extends traditional hypervisor-based virtualization with sophisticated management layers that handle resource scheduling, multi-tenancy, and global orchestration.

The Cloud Virtualization Stack:

┌─────────────────────────────────────────────────────────────────┐
│                    CLOUD MANAGEMENT PLANE                        │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  API Gateway │ Scheduler │ Resource Manager │ Billing       ││
│  └─────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│                    ORCHESTRATION LAYER                           │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  VM Lifecycle │ Storage Provisioning │ Network Config       ││
│  └─────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│                    HYPERVISOR CLUSTER                            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │  Host Node 1 │  │  Host Node 2 │  │  Host Node N │          │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │          │
│  │ │Hypervisor│ │  │ │Hypervisor│ │  │ │Hypervisor│ │          │
│  │ │ (KVM)    │ │  │ │ (KVM)    │ │  │ │ (KVM)    │ │          │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │          │
│  │ ┌──┐┌──┐┌──┐│  │ ┌──┐┌──┐┌──┐│  │ ┌──┐┌──┐┌──┐│          │
│  │ │VM││VM││VM││  │ │VM││VM││VM││  │ │VM││VM││VM││          │
│  │ └──┘└──┘└──┘│  │ └──┘└──┘└──┘│  │ └──┘└──┘└──┘│          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
├─────────────────────────────────────────────────────────────────┤
│                    PHYSICAL INFRASTRUCTURE                       │
│  Compute Servers │ Storage Arrays │ Network Fabric │ Power      │
└─────────────────────────────────────────────────────────────────┘

Key Architectural Components:

1. Management Plane:

API Gateway: RESTful API endpoints for customer operations (CreateInstance, TerminateInstance)
Scheduler: Determines which physical host runs each VM based on resource availability, affinity rules, and bin-packing efficiency
Resource Manager: Tracks capacity across data centers, zones, and regions
Billing: Meters resource usage for pay-as-you-go pricing

2. Orchestration Layer:

VM Lifecycle Management: Handles creation, snapshotting, migration, and termination
Storage Provisioning: Attaches block volumes, manages snapshots and replication
Network Configuration: Sets up VPC networking, security groups, load balancers

Multi-Tenancy at the Hypervisor Level:

Cloud hypervisors run VMs from many different customers on the same physical host. This requires rigorous isolation:

Memory Isolation:

Hardware MMU enforced by second-level address translation (EPT/NPT)
Guest physical addresses translated to host physical addresses
No guest can access another guest's memory without hardware security flaw

CPU Isolation:

vCPUs scheduled by hypervisor across physical cores
CPU time quotas prevent monopolization
Cache partitioning (Intel CAT) reduces side-channel leakage

I/O Isolation:

Virtual device models mediate I/O requests
SR-IOV provides hardware-enforced I/O virtualization
Rate limiting prevents I/O bandwidth monopolization

The Shared Resources Challenge

Hypervisor Technologies in the Cloud

Major cloud providers have evolved their hypervisor strategies to balance security, performance, and operational efficiency:

AWS Nitro System:

AWS developed a custom hardware/software platform that radically reimagines hypervisor architecture:

┌─────────────────────────────────────────────────────────────────┐
│                     TRADITIONAL HYPERVISOR                       │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                      GUEST VM                              │  │
│  └───────────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │   HYPERVISOR (Xen/KVM) - CPU, Memory, I/O Virtualization  │  │
│  │   [Consumes significant CPU for I/O emulation]             │  │
│  └───────────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    HOST CPU + HARDWARE                     │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                      AWS NITRO SYSTEM                            │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                      GUEST VM                              │  │
│  │   [Receives nearly 100% of host CPU resources]             │  │
│  └───────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────┐  ┌────────────────────────────────┐  │
│  │   NITRO HYPERVISOR   │  │        NITRO CARDS             │  │
│  │   (Minimal KVM)      │  │  ┌──────────────────────────┐  │  │
│  │   [CPU + Memory only]│  │  │ Nitro Controller         │  │  │
│  └──────────────────────┘  │  │ Nitro Storage (EBS)      │  │  │
│                            │  │ Nitro Networking (ENA)   │  │  │
│                            │  │ Nitro Security           │  │  │
│                            │  └──────────────────────────┘  │  │
│                            └────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    HOST CPU + HARDWARE                     │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Key Nitro Innovations:

Offloaded I/O: Network and storage virtualization moved to dedicated Nitro cards (custom ASICs)
Minimal Hypervisor: Only CPU and memory virtualization in software; everything else in hardware
Near-Bare-Metal Performance: ~95% of bare-metal performance for most workloads
Nitro Enclaves: Isolated compute environments with cryptographic attestation

Cloud Provider Hypervisor Comparison
Provider	Hypervisor	Key Characteristics	Notable Features
AWS	Nitro (KVM-based)	Custom hardware offload	Bare metal instances, Nitro Enclaves
Google Cloud	KVM	Live migration, confidential VMs	Shielded VMs, sole-tenant nodes
Azure	Hyper-V (modified)	Root OS hardened	Nested virtualization, SGX support
Oracle Cloud	KVM + OCI	Dense packing optimizations	Bare metal, dedicated hosts
Alibaba Cloud	KVM	Custom acceleration	Enhanced SSD, GPU virtualization

Google Cloud's Approach:

Google uses a customized KVM hypervisor with proprietary enhancements:

Live Migration: VMs can be moved between hosts with minimal (<1 second) downtime for maintenance
Shielded VMs: Secure boot, vTPM, integrity monitoring prevent rootkit attacks
Confidential Computing: Memory encryption prevents even the hypervisor from reading guest memory

Azure's Hyper-V Foundation:

Microsoft leverages its Hyper-V technology with significant cloud-focused modifications:

Root OS Hardening: The management partition is heavily locked down
VSM (Virtual Secure Mode): Security-critical code runs in isolated virtual trust levels
Azure Sphere Guardian: Hardware-based attestation and monitoring

The Convergence Trend

Hardware Virtualization Extensions

Modern cloud infrastructure relies heavily on CPU and chipset features designed specifically for virtualization. Understanding these extensions explains why cloud VMs achieve near-native performance.

Intel VT-x and AMD-V (CPU Virtualization):

These extensions add hardware support for the critical operations hypervisors perform:

┌─────────────────────────────────────────────────────────────────┐
│                    VT-x/AMD-V ARCHITECTURE                       │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     GUEST MODE (VMX non-root)              │  │
│  │                                                            │  │
│  │   Guest executes normally until privileged instruction     │  │
│  │   or sensitive event triggers VM Exit                      │  │
│  │                                                            │  │
│  └───────────────────────────┬───────────────────────────────┘  │
│                              │ VM Exit (automatic)              │
│                              ▼                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     HOST MODE (VMX root)                   │  │
│  │                                                            │  │
│  │   Hypervisor handles VM Exit:                              │  │
│  │   - I/O operations                                         │  │
│  │   - Page faults                                            │  │
│  │   - Interrupts                                             │  │
│  │   - Privilege violations                                   │  │
│  │                                                            │  │
│  │   Then executes VMRESUME to return to guest                │  │
│  │                                                            │  │
│  └───────────────────────────┬───────────────────────────────┘  │
│                              │ VMRESUME                         │
│                              ▼                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     GUEST MODE (resumed)                   │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

VMCS (Virtual Machine Control Structure):

Hardware data structure storing VM state
Contains guest registers, control fields, exit information
VMREAD/VMWRITE instructions access fields
One VMCS per vCPU; hardware switches between them

Extended Page Tables (EPT) / Nested Page Tables (NPT):

Second-level address translation eliminates software-based memory virtualization overhead:

┌─────────────────────────────────────────────────────────────────┐
│              TWO-DIMENSIONAL PAGE TABLE WALK                     │
│                                                                  │
│   Guest Virtual    Guest Page      Guest Physical               │
│     Address    ───► Tables    ───► Address (GPA)                │
│                    (in guest OS)                                 │
│                                         │                        │
│                                         ▼                        │
│                                    EPT/NPT Walk                  │
│                                         │                        │
│                                         ▼                        │
│                                    Host Physical                 │
│                                      Address                     │
│                                                                  │
│   Previously: Hypervisor had to trap every page table            │
│   modification and maintain shadow page tables                   │
│                                                                  │
│   With EPT/NPT: Hardware performs entire translation             │
│   without hypervisor intervention                                │
└─────────────────────────────────────────────────────────────────┘

Performance Impact:

TLB misses require walking two page table hierarchies
Hardware caches nested translations in dedicated EPT TLBs
Net result: minimal overhead for memory access (<5% for most workloads)

I/O Virtualization Technologies

•Intel VT-d / AMD-Vi (IOMMU) — DMA remapping allows devices to access VM memory directly while maintaining isolation. Critical for SR-IOV and device passthrough.
•SR-IOV (Single Root I/O Virtualization) — PCIe devices expose multiple 'virtual functions' that can be assigned directly to VMs. NIC appears native to guest with full hardware acceleration.
•DPDK (Data Plane Development Kit) — User-space networking bypasses kernel entirely for packet processing. Used in NFV and high-throughput networking appliances.
•RDMA over Converged Ethernet (RoCE) — Remote Direct Memory Access enables zero-copy, kernel-bypass networking for ultra-low latency. Critical for distributed storage and HPC.
•NVMe Virtualization — NVMe drives support multiple namespaces and virtual queues, enabling efficient passthrough to VMs.

The Hardware Trend

Container-based Virtualization in Cloud

Container vs. VM Architecture:

┌────────────────────────────────────────────────────────────────────────────────────┐
│                    VIRTUAL MACHINES                    CONTAINERS                  │
│  ┌───────────────────────────────┐  ┌───────────────────────────────────────────┐ │
│  │ ┌─────┐ ┌─────┐ ┌─────┐      │  │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐  │ │
│  │ │App A│ │App B│ │App C│      │  │ │App A│ │App B│ │App C│ │App D│ │App E│  │ │
│  │ └─────┘ └─────┘ └─────┘      │  │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘  │ │
│  │ ┌─────┐ ┌─────┐ ┌─────┐      │  │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐  │ │
│  │ │Bins/│ │Bins/│ │Bins/│      │  │ │Bins/│ │Bins/│ │Bins/│ │Bins/│ │Bins/│  │ │
│  │ │Libs │ │Libs │ │Libs │      │  │ │Libs │ │Libs │ │Libs │ │Libs │ │Libs │  │ │
│  │ └─────┘ └─────┘ └─────┘      │  │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘  │ │
│  │ ┌─────┐ ┌─────┐ ┌─────┐      │  └────────────────────┬────────────────────┘ │ │
│  │ │Guest│ │Guest│ │Guest│      │                       │                      │ │
│  │ │ OS  │ │ OS  │ │ OS  │      │  ┌────────────────────┴────────────────────┐ │ │
│  │ └─────┘ └─────┘ └─────┘      │  │         CONTAINER RUNTIME               │ │ │
│  └───────────────┬───────────────┘  │      (Docker, containerd, CRI-O)        │ │ │
│                  │                  └────────────────────┬────────────────────┘ │ │
│  ┌───────────────┴───────────────┐  ┌────────────────────┴────────────────────┐ │ │
│  │         HYPERVISOR            │  │           HOST OPERATING SYSTEM          │ │ │
│  └───────────────┬───────────────┘  │    (Linux kernel with namespaces/cgroups)│ │ │
│                  │                  └────────────────────┬────────────────────┘ │ │
│  ┌───────────────┴───────────────┐  ┌────────────────────┴────────────────────┐ │ │
│  │    HOST OPERATING SYSTEM      │  │                                          │ │ │
│  └───────────────┬───────────────┘                                              │ │
│                  │                                                              │ │
│  ┌───────────────┴──────────────────────────────────────────────────────────┐  │ │
│  │                        PHYSICAL HARDWARE                                  │  │ │
│  └───────────────────────────────────────────────────────────────────────────┘  │ │
└────────────────────────────────────────────────────────────────────────────────────┘

Container Isolation Mechanisms:

Containers rely on Linux kernel features for isolation:

Namespaces (Resource Visibility Isolation):

PID Namespace: Processes see only their container's process tree; PID 1 is init within container
Network Namespace: Isolated network stack (interfaces, routes, firewall rules)
Mount Namespace: Isolated filesystem view; overlayfs for layered images
User Namespace: UID/GID mapping; root in container can be unprivileged on host
UTS Namespace: Independent hostname and domain name
IPC Namespace: Isolated System V IPC and POSIX message queues
Cgroup Namespace: Isolated view of cgroup hierarchy

Cgroups (Resource Usage Control):

CPU: Quota and period settings limit CPU time
Memory: Hard and soft limits, OOM killer configuration
Block I/O: Bandwidth limits and I/O weights
PIDs: Maximum number of processes in container
Network: Traffic shaping (with additional tools)

Container Runtimes Comparison
Runtime	Type	Use Case	Key Features
Docker (moby)	High-level	Development, CI/CD	Build + run, Docker Compose, ease of use
containerd	Low-level	Kubernetes nodes	OCI compliant, lightweight, production-focused
CRI-O	Low-level	Kubernetes nodes	Minimal runtime for Kubernetes CRI
runc	Lowest-level	Container spawn	OCI reference implementation, default spawner
gVisor (runsc)	Sandbox	Untrusted workloads	User-space kernel, syscall interception
Kata Containers	MicroVM	Strong isolation	VM-level isolation, hardware virtualization
Firecracker	MicroVM	Serverless	Ultra-lightweight VMs, fast startup (<125ms)

Container Security Considerations

Serverless and MicroVM Technologies

Serverless computing requires virtualization that combines the security of VMs with the density and startup time of containers. MicroVMs and specialized runtimes fill this niche.

AWS Firecracker:

Firecracker is a virtual machine monitor (VMM) designed specifically for serverless and container workloads:

┌─────────────────────────────────────────────────────────────────┐
│                    FIRECRACKER ARCHITECTURE                      │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   LAMBDA WORKER                          │    │
│  │  ┌───────────────┐  ┌───────────────┐  ┌─────────────┐  │    │
│  │  │  MicroVM 1    │  │  MicroVM 2    │  │  MicroVM N  │  │    │
│  │  │ ┌───────────┐ │  │ ┌───────────┐ │  │ ┌─────────┐ │  │    │
│  │  │ │ Function  │ │  │ │ Function  │ │  │ │Function │ │  │    │
│  │  │ │ Runtime   │ │  │ │ Runtime   │ │  │ │Runtime  │ │  │    │
│  │  │ └───────────┘ │  │ └───────────┘ │  │ └─────────┘ │  │    │
│  │  │ ┌───────────┐ │  │ ┌───────────┐ │  │ ┌─────────┐ │  │    │
│  │  │ │  Minimal  │ │  │ │  Minimal  │ │  │ │ Minimal │ │  │    │
│  │  │ │  Kernel   │ │  │ │  Kernel   │ │  │ │ Kernel  │ │  │    │
│  │  │ └───────────┘ │  │ └───────────┘ │  │ └─────────┘ │  │    │
│  │  └───────────────┘  └───────────────┘  └─────────────┘  │    │
│  │                           │                              │    │
│  │  ┌────────────────────────┴───────────────────────────┐ │    │
│  │  │              FIRECRACKER VMM                        │ │    │
│  │  │   - Minimal device model (virtio-net, virtio-blk)   │ │    │
│  │  │   - REST API for VM management                       │ │    │
│  │  │   - <125ms boot time                                 │ │    │
│  │  │   - <5MB memory overhead per VM                      │ │    │
│  │  └────────────────────────────────────────────────────┘ │    │
│  └──────────────────────────────────────────────────────────┘    │
│                              │                                   │
│  ┌───────────────────────────┴───────────────────────────────┐  │
│  │                      KVM HYPERVISOR                        │  │
│  └───────────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     HOST LINUX KERNEL                      │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Firecracker Key Features:

Fast Boot: <125ms from API call to code execution
Low Overhead: <5MB memory per microVM (vs. 100MB+ for traditional VMs)
Strong Isolation: Full VM boundary (hardware virtualization)
Minimal Attack Surface: Only essential devices emulated (~15,000 lines of Rust)

MicroVM Technologies

•Firecracker — AWS open source; powers Lambda and Fargate. Rust-based, minimal VMM.
•Cloud Hypervisor — Intel/Alibaba project; written in Rust; focus on cloud workloads.
•QEMU MicroVM — Lightweight QEMU machine type; faster startup than full emulation.
•Kata Containers — OCI-compatible VM runtime; integrates with Kubernetes via CRI.

Unikernel Approaches

•OSv — Java/JVM-optimized unikernel; single address space.
•MirageOS — Functional (OCaml) unikernel; minimal image size.
•Nanos — Linux application binary compatible unikernel.
•UniK — Compiler toolchain for building unikernels from standard apps.

The gVisor Approach:

gVisor intercepts syscalls from containerized applications and implements them in a user-space kernel:

Sentry: User-space kernel written in Go; implements Linux syscall interface
Gofer: File system proxy; handles file operations outside the sandbox
Minimal Host Syscalls: Sentry uses ~20 host syscalls (compared to ~300+ for a complete app)
Security Trade-off: Performance overhead for reduced kernel attack surface

Use Cases:

Running untrusted code (code sandboxes, browser isolation)
Additional isolation layer for containerized workloads
Applications where hardware virtualization isn't available

Choosing the Right Isolation

Live Migration and Maintenance

Live Migration Process:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                          LIVE MIGRATION PHASES                                  │
│                                                                                 │
│  Phase 1: Pre-Copy (Iterative)                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────┐   │
│  │ SOURCE HOST                         DESTINATION HOST                     │   │
│  │ ┌───────────┐                       ┌───────────┐                        │   │
│  │ │    VM     │  ──Memory Pages──►    │  VM Copy  │                        │   │
│  │ │ (Running) │      (Round 1)        │(Not Running)│                       │   │
│  │ └───────────┘                       └───────────┘                        │   │
│  │      │                                   │                                │   │
│  │      │ Dirty pages tracked               │                                │   │
│  │      ▼                                   │                                │   │
│  │ ┌───────────┐                       ┌───────────┐                        │   │
│  │ │    VM     │  ──Dirty Pages──►     │  VM Copy  │                        │   │
│  │ │ (Running) │      (Round N)        │ (Updated) │                        │   │
│  │ └───────────┘                       └───────────┘                        │   │
│  └─────────────────────────────────────────────────────────────────────────┘   │
│                                                                                 │
│  Phase 2: Stop-and-Copy                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐   │
│  │ ┌───────────┐                       ┌───────────┐                        │   │
│  │ │    VM     │  ──Final State──►     │    VM     │                        │   │
│  │ │ (Paused)  │    (CPU, Devices)     │ (Ready)   │                        │   │
│  │ └───────────┘                       └───────────┘                        │   │
│  │                                          │                                │   │
│  │                                          ▼                                │   │
│  │                                     Resume VM                             │   │
│  │                                     Update Network                        │   │
│  └─────────────────────────────────────────────────────────────────────────┘   │
│                                                                                 │
│  Guest Downtime: Typically 10-100ms                                             │
└─────────────────────────────────────────────────────────────────────────────────┘

Technical Challenges:

1. Dirty Page Tracking:

Hypervisor marks pages as read-only after copying
Page faults on writes identify modified pages
Convergence: minimize dirty pages per iteration
Fallback: stop-and-copy if write rate exceeds transfer rate

2. Device State Transfer:

Virtual device state (registers, buffers) serialized and transferred
In-flight network packets buffered or retransmitted
Storage: detach from source, attach to destination (or use shared storage)

3. Network Cutover:

New host must receive traffic destined for migrated VM
ARP/NDP updates propagate new MAC-to-switch mapping
SDN controllers update flow rules for new location
Momentary packet loss possible during cutover

4. Clock Synchronization:

VM clock may drift during migration
Hypervisor adjusts guest TSC after migration
NTP resynchronization for accuracy-sensitive workloads

Cloud Provider Maintenance Strategies

•Google Cloud — Live migration is default; VMs rarely experience downtime for maintenance events
•AWS — Offers instance stop/start or scheduled maintenance windows; live migration for some instance families
•Azure — Memory-preserving updates for many scenarios; live migration for planned maintenance
•Transparent to Customers — Ideally, customers don't notice maintenance at all
•Scheduled Notifications — When downtime is unavoidable, customers receive advance notice

Designing for Mobility

Cloud Operating Systems

Warehouse-Scale Computer:

Google conceptualized the "warehouse-scale computer" (WSC)—treating an entire data center as a single computer:

┌─────────────────────────────────────────────────────────────────┐
│               WAREHOUSE-SCALE COMPUTER MODEL                     │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    RESOURCE MANAGER                          ││
│  │          (Borg, Kubernetes, Mesos, YARN)                     ││
│  │   - Cluster-wide resource allocation                         ││
│  │   - Workload scheduling and placement                        ││
│  │   - Failure detection and recovery                           ││
│  │   - Resource efficiency optimization                         ││
│  └─────────────────────────────────────────────────────────────┘│
│                              │                                   │
│              ┌───────────────┼───────────────┐                  │
│              ▼               ▼               ▼                  │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐         │
│  │  Compute Pool │ │  Storage Pool │ │ Network Fabric │         │
│  │  (1000s of    │ │  (Distributed │ │  (SDN, Switch  │         │
│  │   machines)   │ │   File System)│ │   Hierarchy)   │         │
│  └───────────────┘ └───────────────┘ └───────────────┘         │
│                                                                  │
│  Traditional OS Model:    Cloud OS Model:                        │
│   CPU → Process            Server → Container/VM                 │
│   RAM → Address Space      Memory Pool → Memory Allocation       │
│   Disk → Files             Storage Cluster → Volumes             │
│   NIC → Sockets            Network Fabric → Virtual Networks     │
└─────────────────────────────────────────────────────────────────┘

Google Borg (and Kubernetes):

Borg is Google's internal cluster management system, handling over 2 billion container starts per week:

Key Concepts:

Cluster: Set of machines in a building managed together
Cell: A Borg unit managing ~10,000 machines
Job: Collection of tasks (container instances) that run together
Task: A single container with resource allocation
Alloc: Reserved resources on a machine (similar to Kubernetes pods)

Borg Features:

High Utilization: Packs jobs onto machines for 60-70% average utilization
High Availability: Spreads replicas across fault domains
Preemption: Lower-priority batch jobs yield resources to production
Declarative: Users specify desired state; Borg achieves it

Kubernetes as Open Borg: Kubernetes embodies many Borg concepts in an open-source system:

Pods ≈ Allocs
Deployments ≈ Jobs
Nodes ≈ Borg machines
Control plane ≈ Borgmaster

Cluster Management Systems
System	Origin	Scale	Key Use Case
Borg	Google (internal)	Billions of containers/week	Production + batch
Kubernetes	Google (open source)	Thousands of pods/cluster	Container orchestration
Apache Mesos	UC Berkeley/Twitter	10,000+ nodes	Multi-framework resource sharing
Nomad	HashiCorp	10,000+ nodes	Simple, multi-region scheduling
YARN	Apache/Hadoop	Thousands of nodes	Big data workloads
Twine	Meta (internal)	Millions of containers	Facebook services

The OS Analogy

Summary: Virtualization in Cloud

Key Takeaways

•Cloud virtualization adds management layers — Control planes, schedulers, and orchestrators coordinate resources across data centers, while hypervisors provide local VM execution.
•Major providers have evolved hypervisor strategies — AWS Nitro offloads virtualization to hardware; Google focuses on live migration; Azure leverages Hyper-V with cloud enhancements.
•Hardware acceleration is essential — VT-x/AMD-V, EPT/NPT, SR-IOV, and VT-d enable near-native performance for cloud workloads.
•Containers share the kernel — Namespaces and cgroups provide isolation; faster startup and higher density than VMs but weaker security boundaries.
•MicroVMs bridge containers and VMs — Firecracker, Kata Containers provide VM-level isolation with container-like startup times for serverless workloads.
•Live migration enables maintenance — Pre-copy iteration converges on minimal downtime; transparent to well-designed applications.
•Cloud infrastructure is a distributed OS — Cluster managers like Borg and Kubernetes perform scheduling, resource management, and failure recovery at data center scale.

Looking Ahead:

With virtualization foundations established, we'll next explore container orchestration with Kubernetes—the de facto standard for managing containerized applications at scale in cloud environments.

Page Complete

3 / 5