Loading learning content...
We've explored what virtual memory is, why virtual address spaces can exceed physical memory, how demand paging works, and the comprehensive benefits this system provides. But how does it all come together? What are the concrete mechanisms that transform these concepts into a working system?
This page provides a high-level implementation overview—a roadmap of the components, data structures, algorithms, and hardware features that constitute a complete virtual memory implementation. Think of this as a blueprint: we'll identify all the pieces and how they connect, setting the stage for the deep dives in subsequent modules.
Understanding this big picture is essential for systems programming, performance tuning, and debugging. Whether you're tracing a page fault, optimizing memory usage, or understanding why an application is thrashing, you need to see how the components interact.
By the end of this page, you will understand the major components of a virtual memory implementation, how hardware and software cooperate, the key data structures involved, the flow of address translation and page fault handling, and what topics are covered in subsequent modules.
A complete virtual memory implementation involves hardware, kernel data structures, and algorithms working in concert. Let's identify the major components:
Hardware components:
Virtual memory requires dedicated silicon in the CPU to achieve acceptable performance:
| Component | Location | Primary Function |
|---|---|---|
| Memory Management Unit (MMU) | CPU core | Translates virtual to physical addresses |
| Translation Lookaside Buffer (TLB) | CPU core (L1, L2) | Caches recent address translations |
| Page Table Walker | CPU (hardware state machine) | Traverses page tables on TLB miss |
| Control Registers (CR3, etc.) | CPU registers | Points to current page table, control bits |
| Exception/Interrupt Logic | CPU core | Generates page fault exceptions |
Kernel data structures:
The operating system maintains extensive data structures to track virtual memory state:
| Data Structure | Purpose | Key Contents |
|---|---|---|
| Page Table | Map virtual to physical | PTEs with frame numbers, flags |
| Virtual Memory Area (VMA) list | Define valid address ranges | Start/end addresses, permissions, backing |
| Physical Frame Database | Track physical page state | Reference counts, flags, reverse mappings |
| Swap Map | Track swap space allocation | Page-to-swap-slot mapping |
| Page Cache | Cached file-backed pages | Pages indexed by (file, offset) |
| Free Frame List | Available physical frames | List/bitmap of free frames |
Kernel algorithms:
Several algorithms operate on these data structures:
| Algorithm | When Invoked | Goal |
|---|---|---|
| Page Fault Handler | On page fault exception | Load/allocate missing page |
| Page Replacement (LRU, Clock) | When free memory low | Select victim pages to evict |
| Swap Allocation | When evicting anonymous page | Find swap slot to store page |
| Memory Allocation (malloc path) | On brk()/mmap() syscall | Extend address space |
| Working Set Estimation | Periodically | Determine page importance for replacement |
| OOM Killer | When memory exhausted | Kill process to free memory |
Virtual memory is a textbook example of hardware-software co-design. The hardware provides fast-path translation (TLB) and slow-path walking (page table walker). The software handles complex decisions: which pages to evict, where to store them, how to share pages, when to prefetch. Neither could succeed alone.
Let's trace the complete flow of an address translation, from a virtual address in a program to the physical memory access. This is the most frequent operation in virtual memory—it happens for every memory reference.
The translation path:
ADDRESS TRANSLATION: FROM INSTRUCTION TO PHYSICAL MEMORY═══════════════════════════════════════════════════════════ Step 1: INSTRUCTION DECODE──────────────────────────mov rax, [0x7fff12345678] └──────────────┘ Virtual address = 0x7fff12345678 Step 2: EXTRACT PAGE NUMBER AND OFFSET──────────────────────────────────────Virtual Address: 0x7fff12345678Page Size: 4 KB (2^12 = 4096 bytes) VPN (Virtual Page Number): 0x7fff12345 (bits 47:12)Offset: 0x678 (bits 11:0) Step 3: TLB LOOKUP (Hardware, ~1-2 cycles)─────────────────────────────────────────TLB Entry Format: [VPN | ASID | PFN | Flags] Query: VPN = 0x7fff12345, ASID = current processResult: ├─ HIT → PFN from TLB, skip to Step 7 └─ MISS → Proceed to Step 4 Step 4: PAGE TABLE WALK (Hardware, ~10-100 cycles)─────────────────────────────────────────────────For x86-64 with 4-level paging: CR3 contains physical address of PML4 table a) PML4 lookup: PML4[(VPN >> 27) & 0x1FF] → Get PDPT base address b) PDPT lookup: PDPT[(VPN >> 18) & 0x1FF] → Get PD base address c) PD lookup: PD[(VPN >> 9) & 0x1FF] → Get PT base address d) PT lookup: PT[VPN & 0x1FF] → Get Page Table Entry Step 5: CHECK PAGE TABLE ENTRY─────────────────────────────PTE Format (x86-64):┌────────────────────────────────────────────────────┐│ NX │ ... │ PFN (40 bits) │ Flags │ P │└────────────────────────────────────────────────────┘ If P (Present) bit = 0: → PAGE FAULT! Jump to Step 6 If P = 1: → PFN = Page Frame Number from PTE → Check access permissions (R/W/X vs operation type) → If permission violation: PAGE FAULT (protection error) Step 6: PAGE FAULT (Software, ~100μs to 10ms)────────────────────────────────────────────If P = 0 or permission violation: a) CPU pushes exception frame, saves faulting address to CR2 b) Jump to kernel page fault handler (interrupt vector 14) c) Kernel determines page status: ├─ Valid but not present: Load from disk/swap, update PTE ├─ COW page: Copy and remap └─ Invalid address: Send SIGSEGV to process d) Return from exception (IRET) e) Retry the original instruction → Re-walk page tables Step 7: FORM PHYSICAL ADDRESS────────────────────────────Physical Address = (PFN << 12) | Offset = (FrameNumber * 4096) + 0x678 Step 8: POPULATE TLB (Parallel to memory access)────────────────────────────────────────────────Insert: [VPN=0x7fff12345 | ASID=cur | PFN | Flags]Future accesses will hit TLB (Step 3) Step 9: ACCESS PHYSICAL MEMORY─────────────────────────────Send physical address to memory controller→ Cache hierarchy lookup (L1/L2/L3)→ If miss, read from DRAM→ Data returns to CPU register (rax)TLB hits (Step 3 → Step 7) add only 1-2 cycles to memory access. Page table walks add 10-100 cycles. Page faults add millions of cycles (milliseconds). System performance depends on maximizing TLB hits and minimizing page faults.
The page fault handler is one of the most critical pieces of kernel code. It must quickly determine why the fault occurred and take appropriate action. Let's examine its logic in detail:
Page fault classification:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
PAGE FAULT HANDLER PSEUDOCODE:══════════════════════════════ void page_fault_handler(fault_address, error_code) { // 1. Get current process's memory descriptor mm = current_process->mm; // 2. Find the VMA (virtual memory area) containing fault address vma = find_vma(mm, fault_address); // 3. Check if address is valid if (vma == NULL || fault_address < vma->start) { // Address is outside any valid region send_signal(SIGSEGV); // Segmentation fault return; } // 4. Check permissions if (error_code.write && !vma->permissions.write) { // Write to read-only region (might be COW, check below) } if (error_code.user && !vma->permissions.user) { // User mode accessing kernel memory send_signal(SIGSEGV); return; } // 5. Determine fault type and handle pte = get_pte(mm->page_table, fault_address); if (!pte->present) { // Page not in memory - need to load it if (pte->swap_entry) { // Page is in swap - read from swap handle_swap_in(vma, fault_address, pte); } else if (vma->file) { // File-backed region - read from file handle_file_fault(vma, fault_address, pte); } else { // Anonymous region - allocate zero page handle_anon_fault(vma, fault_address, pte); } } else if (pte->readonly && error_code.write && vma->permissions.write) { // Copy-on-write: page is shared but marked RO for COW handle_cow(vma, fault_address, pte); } else { // True protection violation send_signal(SIGSEGV); return; } // 6. Return - processor will retry the instruction}Handling different fault types:
| Fault Type | Detection | Handler Action | Latency |
|---|---|---|---|
| Swap in | PTE has swap entry, not present | Read from swap, update PTE | ~100μs (SSD) |
| File fault | VMA file-backed, page not cached | Read from file, add to page cache | ~100μs (SSD) |
| Anonymous alloc | VMA anon, first access | Allocate frame, zero-fill, map | ~1μs |
| Copy-on-write | Write to read-only shared page | Copy page, give process private copy | ~1μs |
| Permission violation | Access violates VMA permissions | Send SIGSEGV to process | N/A (error) |
| Invalid address | Address outside all VMAs | Send SIGSEGV to process | N/A (error) |
Faults that require disk I/O (swap-in, file read) are 'major faults'—expensive. Faults that only require allocating a frame or copying a page are 'minor faults'—cheap. Reducing major faults is critical for performance.
Page tables are the central data structure of virtual memory. Let's examine how they're organized in modern systems.
Hierarchical page tables:
A flat page table for a 48-bit address space with 4 KB pages would have 2³⁶ entries (68 billion), consuming 512 TB just for page tables! Hierarchical (multi-level) page tables solve this by only allocating page table pages that are needed.
x86-64 four-level structure:
x86-64 FOUR-LEVEL PAGE TABLE STRUCTURE:═══════════════════════════════════════ CR3 (Page Table Base Register) │ ▼┌─────────────────────────────────────────────────────────────────┐│ PML4 (Page Map Level 4) ││ 512 entries × 8 bytes = 4 KB ││ Each entry points to a PDPT ││ Covers 512 GB per entry (512 × 512 × 512 × 4KB) │└─────────────────────────────────────────────────────────────────┘ │ ┌─────────────┼─────────────┐ ▼ ▼ ▼┌───────────┐ ┌───────────┐ ┌───────────┐│ PDPT │ │ PDPT │ │ PDPT │ Page Directory Pointer Tables│ (512 ent) │ │ (512 ent) │ │ (512 ent) │ Each covers 1 GB per entry└───────────┘ └───────────┘ └───────────┘ │ ├─────────────┬─────────────┐ ▼ ▼ ▼┌───────────┐ ┌───────────┐ ┌───────────┐│ PD │ │ PD │ │ PD │ Page Directories│ (512 ent) │ │ (512 ent) │ │ (512 ent) │ Each covers 2 MB per entry└───────────┘ └───────────┘ └───────────┘ │ ├─────────────┬─────────────┐ ▼ ▼ ▼┌───────────┐ ┌───────────┐ ┌───────────┐│ PT │ │ PT │ │ PT │ Page Tables│ (512 ent) │ │ (512 ent) │ │ (512 ent) │ Each covers 4 KB per entry└───────────┘ └───────────┘ └───────────┘ │ ▼ Physical Frame Space efficiency example:─────────────────────────Process using 1 GB of memory (256K pages): Flat table: 256K entries × 8 bytes = 2 MBHierarchical: 1 PML4 + ~1 PDPT + ~512 PDs + ~256K entries in PTs = ~2 MB (similar), BUT... Process using 100 MB scattered across address space:Flat table: Still 2 MB (must cover whole space)Hierarchical: Only allocate PTs for used regions = ~KB Sparse address spaces benefit enormously from hierarchical tables!Page table entry details:
Each PTE contains:
The Accessed and Dirty bits are crucial for page replacement—they let the kernel know which pages are being used without expensive tracking.
The CPU automatically sets the Accessed bit when a page is read and the Dirty bit when written. This hardware assistance is essential for efficient page replacement—the OS can check these bits to find 'cold' pages without intercepting every memory access.
While page tables map individual pages, the kernel needs higher-level data structures to track memory regions—contiguous ranges of virtual addresses with uniform properties. In Linux, these are called Virtual Memory Areas (VMAs).
What a VMA represents:
Each VMA describes a region of the address space with:
VMA DATA STRUCTURE (Conceptual):═════════════════════════════════ struct vm_area_struct { unsigned long vm_start; // Start address (inclusive) unsigned long vm_end; // End address (exclusive) unsigned long vm_flags; // VM_READ, VM_WRITE, VM_EXEC, VM_SHARED, etc. struct file* vm_file; // File if file-backed, NULL if anonymous unsigned long vm_pgoff; // Offset into file (in pages) struct mm_struct* vm_mm; // Owning memory descriptor // For efficient lookup and iteration struct rb_node vm_rb; // Red-black tree of VMAs struct list_head vm_list; // List of all VMAs}; EXAMPLE PROCESS VMA LIST:═════════════════════════ VMA 0: 0x00400000 - 0x00452000 r-x file=/bin/bash offset=0 (Text segment - executable code) VMA 1: 0x00651000 - 0x00652000 r-- file=/bin/bash offset=0x51000 (Read-only data) VMA 2: 0x00652000 - 0x0065b000 rw- file=/bin/bash offset=0x52000 (Initialized data - copy-on-write from file) VMA 3: 0x00c00000 - 0x00c21000 rw- file=NULL (anonymous) (Heap) VMA 4: 0x7f0000000000 - 0x7f00001bc000 r-x file=/lib/libc.so (Shared library code) VMA 5: 0x7fff12340000 - 0x7fff12400000 rw- file=NULL (anonymous) (Stack)Why VMAs matter:
Page fault disambiguation: When a fault occurs, the kernel checks VMAs to determine if the address is valid and what the backing store is.
System call handling: mmap(), munmap(), mprotect(), brk() all operate on VMAs, not individual pages.
/proc/[pid]/maps: The VMA list is what you see when you examine a process's memory map.
Memory sharing: Shared mappings reference the same underlying structures, tracked via VMAs.
VMA organization:
For fast lookup (needed on every page fault), VMAs are organized in:
A typical process has dozens to hundreds of VMAs. Complex applications (browsers, VMs) may have thousands.
VMAs and page tables are complementary. VMAs describe what should exist (regions, permissions, backing). Page tables describe what does exist in physical memory right now. A VMA can span millions of pages, most not present in RAM. The page table has entries only for present pages (or pages with swap locations).
While most of virtual memory deals with virtual addresses, the kernel must also carefully manage physical frames—the actual storage in RAM.
Frame states:
Each physical frame can be in various states:
| State | Description | Example Use |
|---|---|---|
| Free | Not allocated to any purpose | On the free list, available |
| Anonymous page | Backing process heap/stack | malloc'd memory |
| File page (clean) | Caching file, unmodified | Read-only file cache |
| File page (dirty) | Caching file, modified | Written file cache |
| Kernel use | Kernel data structures | Page tables, slabs |
| Reserved | Hardware or special use | BIOS, MMIO ranges |
Frame metadata (struct page in Linux):
The kernel maintains metadata for every physical frame:
FRAME METADATA STRUCTURE (Conceptual):══════════════════════════════════════ struct page { unsigned long flags; // Page state flags (PG_locked, PG_dirty, etc.) atomic_t _refcount; // Reference count (0 = free) atomic_t _mapcount; // Number of page table mappings // For page cache pages: struct address_space* mapping; // File this page belongs to pgoff_t index; // Offset within file // For LRU management: struct list_head lru; // Position in LRU list // For memory zones: struct zone* zone; // ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM}; FRAME ALLOCATION LOGIC:═══════════════════════ When kernel needs a free frame: 1. Try direct reclaim from free lists (immediate) └─ If sufficient free frames, allocate from free list 2. If low on free frames, invoke page reclaim (kswapd) └─ Scan LRU lists for reclaimable pages └─ Clean file pages: Just remove from page cache └─ Dirty file pages: Write back first, then reclaim └─ Anonymous pages: Swap out, then reclaim 3. If still insufficient, invoke OOM killer (desperate) └─ Select process to kill based on memory usage └─ Kill it to free memoryReference counting:
Frames have reference counts indicating how many entities are using them:
When reference count reaches 0, the frame can be freed. This enables safe sharing—a shared library page might have 100 mappings, and it's only freed when the last one is removed.
Not all frames are equal. On x86, ZONE_DMA (first 16 MB) is for legacy ISA DMA. ZONE_DMA32 (first 4 GB) is for 32-bit DMA. ZONE_NORMAL is regular memory. The allocator must respect these constraints when allocating for specific purposes (e.g., DMA buffers must come from the right zone).
When physical memory fills up and a new frame is needed, the kernel must choose a victim page to evict. This decision is critical—evicting a page that will be accessed soon causes expensive page faults.
The page replacement problem:
Key algorithms (covered in detail in subsequent modules):
| Algorithm | Selection Criterion | Pros | Cons |
|---|---|---|---|
| Optimal (OPT) | Evict page used farthest in future | Theoretically best | Requires future knowledge (impossible) |
| FIFO | Evict oldest page | Simple to implement | Poor performance, Belady's anomaly |
| LRU | Evict least recently used | Good approximation of optimal | Expensive to implement exactly |
| Clock (Second Chance) | Evict page with A=0 | Good performance, practical | Widely used in real systems |
| LRU-K | Consider K most recent accesses | Better for some workloads | More complex bookkeeping |
How real systems (Linux) do it:
Linux uses a multi-list approach:
Pages move between lists based on access patterns. Eviction preferentially comes from the inactive list. This approximates LRU without the overhead of exact LRU ordering.
The working set concept:
The working set is the set of pages a process is actively using. If a process's working set exceeds available frames, it will thrash—constantly faulting.
Good page replacement:
Page replacement algorithms are a rich topic covered extensively in upcoming modules. We'll analyze each algorithm in detail, prove their properties, implement them, and understand how they're used in production operating systems.
Swap is the backing store for anonymous pages—pages with no file backing (heap, stack, anonymous mmap). When these pages are evicted, they must be written to swap space.
Swap space organization:
SWAP SPACE ORGANIZATION:════════════════════════ Swap Partition/File:┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │... │ N │└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘ ↑ ↑ ↑ ↑ │ │ │ │ hdr used used free by by P1 P2 Swap Map (kernel data structure):┌────────────────────────────────────┐│ slot 0: reserved (header) ││ slot 1: refcount=1, PID=100 ││ slot 2: refcount=2, PID=100,200 │ (shared by two processes after fork)│ slot 3: refcount=0 (free) ││ ... │└────────────────────────────────────┘ SWAP OUT FLOW:══════════════1. Select victim page (anonymous page from inactive list)2. If dirty or never swapped: Write to swap a) Allocate swap slot from swap map b) Write page contents to swap slot3. Update page table entry: - Clear Present bit - Store swap slot identifier4. Free physical frame SWAP IN FLOW (on page fault):═════════════════════════════1. Page fault handler sees PTE with swap entry2. Allocate physical frame3. Read from swap slot into frame4. Update page table entry: - Set Present bit - Store physical frame number5. Optionally free swap slot (or keep for future swap out)6. Return from fault, instruction retriesSwap performance considerations:
| Factor | Impact | Mitigation |
|---|---|---|
| SSD vs HDD | 100x speed difference | Use SSD for swap, or avoid swapping |
| Swap space size | Too small = OOM; too large = wasted disk | Size based on workload |
| Swappiness | How aggressively to swap vs reclaim file cache | Tune /proc/sys/vm/swappiness |
| Swap encryption | Prevents data leaks at performance cost | Enable for security-sensitive systems |
Modern alternatives to traditional swap:
Well-provisioned systems rarely use swap heavily. Swap exists to handle temporary memory spikes gracefully rather than crashing. If swap is consistently active, the system needs more RAM, not more swap.
We've surveyed the complete implementation landscape of virtual memory. Let's consolidate this knowledge and preview what's ahead:
What's next in the Virtual Memory series:
| Module | Topic | What You'll Learn |
|---|---|---|
| Module 2 | Demand Paging | Deep dive into page fault handling, demand loading strategies |
| Module 3 | Page Fault Handling | Complete page fault lifecycle, error handling, performance |
| Module 4 | Copy-on-Write | COW implementation, fork optimization, memory efficiency |
| Module 5 | Memory-Mapped Files | mmap() implementation, page cache integration |
| Module 6 | Shared Memory | IPC via shared memory, protection, synchronization |
These modules build on the concepts introduced here, providing implementation-level detail for each aspect of virtual memory.
Congratulations! You've completed the Virtual Memory Concepts module. You now understand the fundamental concepts, mechanisms, benefits, and implementation of virtual memory. This foundation prepares you for the detailed treatment of each subsystem in subsequent modules.