Operating SystemsVirtual Memory Concepts

Virtual Memory Concepts

LevelIntermediate

Duration60 mins

TopicVirtual Memory Concepts

5 / 5

Implementation Overview

From Concept to Reality

We've explored what virtual memory is, why virtual address spaces can exceed physical memory, how demand paging works, and the comprehensive benefits this system provides. But how does it all come together? What are the concrete mechanisms that transform these concepts into a working system?

This page provides a high-level implementation overview—a roadmap of the components, data structures, algorithms, and hardware features that constitute a complete virtual memory implementation. Think of this as a blueprint: we'll identify all the pieces and how they connect, setting the stage for the deep dives in subsequent modules.

Understanding this big picture is essential for systems programming, performance tuning, and debugging. Whether you're tracing a page fault, optimizing memory usage, or understanding why an application is thrashing, you need to see how the components interact.

What You Will Learn

By the end of this page, you will understand the major components of a virtual memory implementation, how hardware and software cooperate, the key data structures involved, the flow of address translation and page fault handling, and what topics are covered in subsequent modules.

The Major Components

A complete virtual memory implementation involves hardware, kernel data structures, and algorithms working in concert. Let's identify the major components:

Hardware components:

Virtual memory requires dedicated silicon in the CPU to achieve acceptable performance:

Hardware Components for Virtual Memory
Component	Location	Primary Function
Memory Management Unit (MMU)	CPU core	Translates virtual to physical addresses
Translation Lookaside Buffer (TLB)	CPU core (L1, L2)	Caches recent address translations
Page Table Walker	CPU (hardware state machine)	Traverses page tables on TLB miss
Control Registers (CR3, etc.)	CPU registers	Points to current page table, control bits
Exception/Interrupt Logic	CPU core	Generates page fault exceptions

Kernel data structures:

The operating system maintains extensive data structures to track virtual memory state:

Kernel Data Structures for Virtual Memory
Data Structure	Purpose	Key Contents
Page Table	Map virtual to physical	PTEs with frame numbers, flags
Virtual Memory Area (VMA) list	Define valid address ranges	Start/end addresses, permissions, backing
Physical Frame Database	Track physical page state	Reference counts, flags, reverse mappings
Swap Map	Track swap space allocation	Page-to-swap-slot mapping
Page Cache	Cached file-backed pages	Pages indexed by (file, offset)
Free Frame List	Available physical frames	List/bitmap of free frames

Kernel algorithms:

Several algorithms operate on these data structures:

Key Algorithms in Virtual Memory
Algorithm	When Invoked	Goal
Page Fault Handler	On page fault exception	Load/allocate missing page
Page Replacement (LRU, Clock)	When free memory low	Select victim pages to evict
Swap Allocation	When evicting anonymous page	Find swap slot to store page
Memory Allocation (malloc path)	On brk()/mmap() syscall	Extend address space
Working Set Estimation	Periodically	Determine page importance for replacement
OOM Killer	When memory exhausted	Kill process to free memory

Hardware-Software Partnership

Virtual memory is a textbook example of hardware-software co-design. The hardware provides fast-path translation (TLB) and slow-path walking (page table walker). The software handles complex decisions: which pages to evict, where to store them, how to share pages, when to prefetch. Neither could succeed alone.

Address Translation: The Complete Flow

Let's trace the complete flow of an address translation, from a virtual address in a program to the physical memory access. This is the most frequent operation in virtual memory—it happens for every memory reference.

The translation path:

Complete Address Translation Flow
ADDRESS TRANSLATION: FROM INSTRUCTION TO PHYSICAL MEMORY
═══════════════════════════════════════════════════════════
 
Step 1: INSTRUCTION DECODE
──────────────────────────
mov rax, [0x7fff12345678]
           └──────────────┘
           Virtual address = 0x7fff12345678
 
 
Step 2: EXTRACT PAGE NUMBER AND OFFSET
──────────────────────────────────────
Virtual Address: 0x7fff12345678
Page Size: 4 KB (2^12 = 4096 bytes)
 
VPN (Virtual Page Number): 0x7fff12345 (bits 47:12)
Offset: 0x678 (bits 11:0)
 
 
Step 3: TLB LOOKUP (Hardware, ~1-2 cycles)
─────────────────────────────────────────
TLB Entry Format: [VPN | ASID | PFN | Flags]
 
Query: VPN = 0x7fff12345, ASID = current process
Result: 
  ├─ HIT  → PFN from TLB, skip to Step 7
  └─ MISS → Proceed to Step 4
 
 
Step 4: PAGE TABLE WALK (Hardware, ~10-100 cycles)
─────────────────────────────────────────────────
For x86-64 with 4-level paging:
 
CR3 contains physical address of PML4 table
 
a) PML4 lookup: PML4[(VPN >> 27) & 0x1FF]
   → Get PDPT base address
 
b) PDPT lookup: PDPT[(VPN >> 18) & 0x1FF]
   → Get PD base address
 
c) PD lookup: PD[(VPN >> 9) & 0x1FF]
   → Get PT base address
 
d) PT lookup: PT[VPN & 0x1FF]
   → Get Page Table Entry
 
 
Step 5: CHECK PAGE TABLE ENTRY
─────────────────────────────
PTE Format (x86-64):
┌────────────────────────────────────────────────────┐
│ NX │ ... │ PFN (40 bits) │ Flags │ P │
└────────────────────────────────────────────────────┘
 
If P (Present) bit = 0:
  → PAGE FAULT! Jump to Step 6
 
If P = 1:
  → PFN = Page Frame Number from PTE
  → Check access permissions (R/W/X vs operation type)
  → If permission violation: PAGE FAULT (protection error)
 
 
Step 6: PAGE FAULT (Software, ~100μs to 10ms)
────────────────────────────────────────────
If P = 0 or permission violation:
  a) CPU pushes exception frame, saves faulting address to CR2
  b) Jump to kernel page fault handler (interrupt vector 14)
  c) Kernel determines page status:
     ├─ Valid but not present: Load from disk/swap, update PTE
     ├─ COW page: Copy and remap
     └─ Invalid address: Send SIGSEGV to process
  d) Return from exception (IRET)
  e) Retry the original instruction → Re-walk page tables
 
 
Step 7: FORM PHYSICAL ADDRESS
────────────────────────────
Physical Address = (PFN << 12) | Offset
                 = (FrameNumber * 4096) + 0x678
 
 
Step 8: POPULATE TLB (Parallel to memory access)
────────────────────────────────────────────────
Insert: [VPN=0x7fff12345 | ASID=cur | PFN | Flags]
Future accesses will hit TLB (Step 3)
 
 
Step 9: ACCESS PHYSICAL MEMORY
─────────────────────────────
Send physical address to memory controller
→ Cache hierarchy lookup (L1/L2/L3)
→ If miss, read from DRAM
→ Data returns to CPU register (rax)

The Fast Path Must Be Fast

TLB hits (Step 3 → Step 7) add only 1-2 cycles to memory access. Page table walks add 10-100 cycles. Page faults add millions of cycles (milliseconds). System performance depends on maximizing TLB hits and minimizing page faults.

Page Fault Handling: The Complete Picture

The page fault handler is one of the most critical pieces of kernel code. It must quickly determine why the fault occurred and take appropriate action. Let's examine its logic in detail:

Page fault classification:

Page Fault Handler Logic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
PAGE FAULT HANDLER PSEUDOCODE:
══════════════════════════════
 
void page_fault_handler(fault_address, error_code) {
    // 1. Get current process's memory descriptor
    mm = current_process->mm;
    
    // 2. Find the VMA (virtual memory area) containing fault address
    vma = find_vma(mm, fault_address);
    
    // 3. Check if address is valid
    if (vma == NULL || fault_address < vma->start) {
        // Address is outside any valid region
        send_signal(SIGSEGV);  // Segmentation fault
        return;
    }
    
    // 4. Check permissions
    if (error_code.write && !vma->permissions.write) {
        // Write to read-only region (might be COW, check below)
    }
    if (error_code.user && !vma->permissions.user) {
        // User mode accessing kernel memory
        send_signal(SIGSEGV);
        return;
    }
    
    // 5. Determine fault type and handle
    pte = get_pte(mm->page_table, fault_address);
    
    if (!pte->present) {
        // Page not in memory - need to load it
        
        if (pte->swap_entry) {
            // Page is in swap - read from swap
            handle_swap_in(vma, fault_address, pte);
        }
        else if (vma->file) {
            // File-backed region - read from file
            handle_file_fault(vma, fault_address, pte);
        }
        else {
            // Anonymous region - allocate zero page
            handle_anon_fault(vma, fault_address, pte);
        }
    }
    else if (pte->readonly && error_code.write && vma->permissions.write) {
        // Copy-on-write: page is shared but marked RO for COW
        handle_cow(vma, fault_address, pte);
    }
    else {
        // True protection violation
        send_signal(SIGSEGV);
        return;
    }
    
    // 6. Return - processor will retry the instruction
}

Handling different fault types:

Page Fault Types and Handling
Fault Type	Detection	Handler Action	Latency
Swap in	PTE has swap entry, not present	Read from swap, update PTE	~100μs (SSD)
File fault	VMA file-backed, page not cached	Read from file, add to page cache	~100μs (SSD)
Anonymous alloc	VMA anon, first access	Allocate frame, zero-fill, map	~1μs
Copy-on-write	Write to read-only shared page	Copy page, give process private copy	~1μs
Permission violation	Access violates VMA permissions	Send SIGSEGV to process	N/A (error)
Invalid address	Address outside all VMAs	Send SIGSEGV to process	N/A (error)

Minor vs Major Faults

Faults that require disk I/O (swap-in, file read) are 'major faults'—expensive. Faults that only require allocating a frame or copying a page are 'minor faults'—cheap. Reducing major faults is critical for performance.

Page Tables in Practice

Page tables are the central data structure of virtual memory. Let's examine how they're organized in modern systems.

Hierarchical page tables:

A flat page table for a 48-bit address space with 4 KB pages would have 2³⁶ entries (68 billion), consuming 512 TB just for page tables! Hierarchical (multi-level) page tables solve this by only allocating page table pages that are needed.

x86-64 four-level structure:

x86-64 Page Table Hierarchy
x86-64 FOUR-LEVEL PAGE TABLE STRUCTURE:
═══════════════════════════════════════
 
                    CR3 (Page Table Base Register)
                      │
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│                PML4 (Page Map Level 4)                          │
│   512 entries × 8 bytes = 4 KB                                   │
│   Each entry points to a PDPT                                    │
│   Covers 512 GB per entry (512 × 512 × 512 × 4KB)               │
└─────────────────────────────────────────────────────────────────┘
                      │
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
┌───────────┐   ┌───────────┐   ┌───────────┐
│   PDPT    │   │   PDPT    │   │   PDPT    │  Page Directory Pointer Tables
│ (512 ent) │   │ (512 ent) │   │ (512 ent) │  Each covers 1 GB per entry
└───────────┘   └───────────┘   └───────────┘
        │
        ├─────────────┬─────────────┐
        ▼             ▼             ▼
┌───────────┐   ┌───────────┐   ┌───────────┐
│    PD     │   │    PD     │   │    PD     │  Page Directories
│ (512 ent) │   │ (512 ent) │   │ (512 ent) │  Each covers 2 MB per entry
└───────────┘   └───────────┘   └───────────┘
        │
        ├─────────────┬─────────────┐
        ▼             ▼             ▼
┌───────────┐   ┌───────────┐   ┌───────────┐
│    PT     │   │    PT     │   │    PT     │  Page Tables
│ (512 ent) │   │ (512 ent) │   │ (512 ent) │  Each covers 4 KB per entry
└───────────┘   └───────────┘   └───────────┘
        │
        ▼
    Physical Frame
 
 
Space efficiency example:
─────────────────────────
Process using 1 GB of memory (256K pages):
 
Flat table: 256K entries × 8 bytes = 2 MB
Hierarchical: 
  1 PML4 + ~1 PDPT + ~512 PDs + ~256K entries in PTs
  = ~2 MB (similar), BUT...
  
Process using 100 MB scattered across address space:
Flat table: Still 2 MB (must cover whole space)
Hierarchical: Only allocate PTs for used regions = ~KB
 
Sparse address spaces benefit enormously from hierarchical tables!

Page table entry details:

Each PTE contains:

Physical frame number (PFN): Which physical frame holds this page
Present (P) bit: Is the page in RAM?
Read/Write (R/W) bit: Is writing permitted?
User/Supervisor (U/S) bit: Can user mode access?
Accessed (A) bit: Has the page been read? (set by hardware)
Dirty (D) bit: Has the page been written? (set by hardware)
No-Execute (NX) bit: Is execution permitted?

The Accessed and Dirty bits are crucial for page replacement—they let the kernel know which pages are being used without expensive tracking.

Hardware-Set Bits

The CPU automatically sets the Accessed bit when a page is read and the Dirty bit when written. This hardware assistance is essential for efficient page replacement—the OS can check these bits to find 'cold' pages without intercepting every memory access.

Memory Regions and VMAs

While page tables map individual pages, the kernel needs higher-level data structures to track memory regions—contiguous ranges of virtual addresses with uniform properties. In Linux, these are called Virtual Memory Areas (VMAs).

What a VMA represents:

Each VMA describes a region of the address space with:

Start and end virtual addresses
Access permissions (read, write, execute)
Backing store (anonymous, file, shared)
File offset (for file mappings)
Flags (growable, locked in RAM, etc.)

VMA Structure (Simplified Linux)
VMA DATA STRUCTURE (Conceptual):
═════════════════════════════════
 
struct vm_area_struct {
    unsigned long vm_start;     // Start address (inclusive)
    unsigned long vm_end;       // End address (exclusive)
    
    unsigned long vm_flags;     // VM_READ, VM_WRITE, VM_EXEC, VM_SHARED, etc.
    
    struct file* vm_file;       // File if file-backed, NULL if anonymous
    unsigned long vm_pgoff;     // Offset into file (in pages)
    
    struct mm_struct* vm_mm;    // Owning memory descriptor
    
    // For efficient lookup and iteration
    struct rb_node vm_rb;       // Red-black tree of VMAs
    struct list_head vm_list;   // List of all VMAs
};
 
 
EXAMPLE PROCESS VMA LIST:
═════════════════════════
 
VMA 0: 0x00400000 - 0x00452000  r-x  file=/bin/bash  offset=0
       (Text segment - executable code)
 
VMA 1: 0x00651000 - 0x00652000  r--  file=/bin/bash  offset=0x51000
       (Read-only data)
 
VMA 2: 0x00652000 - 0x0065b000  rw-  file=/bin/bash  offset=0x52000
       (Initialized data - copy-on-write from file)
 
VMA 3: 0x00c00000 - 0x00c21000  rw-  file=NULL (anonymous)
       (Heap)
 
VMA 4: 0x7f0000000000 - 0x7f00001bc000  r-x  file=/lib/libc.so
       (Shared library code)
 
VMA 5: 0x7fff12340000 - 0x7fff12400000  rw-  file=NULL (anonymous)
       (Stack)

Why VMAs matter:

Page fault disambiguation: When a fault occurs, the kernel checks VMAs to determine if the address is valid and what the backing store is.
System call handling: mmap(), munmap(), mprotect(), brk() all operate on VMAs, not individual pages.
/proc/[pid]/maps: The VMA list is what you see when you examine a process's memory map.
Memory sharing: Shared mappings reference the same underlying structures, tracked via VMAs.

VMA organization:

For fast lookup (needed on every page fault), VMAs are organized in:

Red-black tree: O(log n) lookup by address
Linked list: O(n) iteration over all VMAs

A typical process has dozens to hundreds of VMAs. Complex applications (browsers, VMs) may have thousands.

VMAs vs Page Tables

VMAs and page tables are complementary. VMAs describe what should exist (regions, permissions, backing). Page tables describe what does exist in physical memory right now. A VMA can span millions of pages, most not present in RAM. The page table has entries only for present pages (or pages with swap locations).

Physical Frame Management

While most of virtual memory deals with virtual addresses, the kernel must also carefully manage physical frames—the actual storage in RAM.

Frame states:

Each physical frame can be in various states:

Physical Frame States
State	Description	Example Use
Free	Not allocated to any purpose	On the free list, available
Anonymous page	Backing process heap/stack	malloc'd memory
File page (clean)	Caching file, unmodified	Read-only file cache
File page (dirty)	Caching file, modified	Written file cache
Kernel use	Kernel data structures	Page tables, slabs
Reserved	Hardware or special use	BIOS, MMIO ranges

Frame metadata (struct page in Linux):

The kernel maintains metadata for every physical frame:

Physical Frame Metadata
FRAME METADATA STRUCTURE (Conceptual):
══════════════════════════════════════
 
struct page {
    unsigned long flags;       // Page state flags (PG_locked, PG_dirty, etc.)
    
    atomic_t _refcount;        // Reference count (0 = free)
    atomic_t _mapcount;        // Number of page table mappings
    
    // For page cache pages:
    struct address_space* mapping;  // File this page belongs to
    pgoff_t index;                  // Offset within file
    
    // For LRU management:
    struct list_head lru;      // Position in LRU list
    
    // For memory zones:
    struct zone* zone;         // ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM
};
 
 
FRAME ALLOCATION LOGIC:
═══════════════════════
 
When kernel needs a free frame:
 
1. Try direct reclaim from free lists (immediate)
   └─ If sufficient free frames, allocate from free list
 
2. If low on free frames, invoke page reclaim (kswapd)
   └─ Scan LRU lists for reclaimable pages
   └─ Clean file pages: Just remove from page cache
   └─ Dirty file pages: Write back first, then reclaim
   └─ Anonymous pages: Swap out, then reclaim
 
3. If still insufficient, invoke OOM killer (desperate)
   └─ Select process to kill based on memory usage
   └─ Kill it to free memory

Reference counting:

Frames have reference counts indicating how many entities are using them:

Each page table mapping increments count
Page cache reference adds to count
Kernel references for I/O add to count

When reference count reaches 0, the frame can be freed. This enables safe sharing—a shared library page might have 100 mappings, and it's only freed when the last one is removed.

Memory Zones

Not all frames are equal. On x86, ZONE_DMA (first 16 MB) is for legacy ISA DMA. ZONE_DMA32 (first 4 GB) is for 32-bit DMA. ZONE_NORMAL is regular memory. The allocator must respect these constraints when allocating for specific purposes (e.g., DMA buffers must come from the right zone).

Page Replacement: A Preview

When physical memory fills up and a new frame is needed, the kernel must choose a victim page to evict. This decision is critical—evicting a page that will be accessed soon causes expensive page faults.

The page replacement problem:

Goal: Minimize page faults while operating within RAM constraints
Input: Stream of page references, limited physical frames
Decision: Which page to evict when a new page is needed
Challenge: Can't see the future (which pages will be accessed next)

Key algorithms (covered in detail in subsequent modules):

Page Replacement Algorithms Overview
Algorithm	Selection Criterion	Pros	Cons
Optimal (OPT)	Evict page used farthest in future	Theoretically best	Requires future knowledge (impossible)
FIFO	Evict oldest page	Simple to implement	Poor performance, Belady's anomaly
LRU	Evict least recently used	Good approximation of optimal	Expensive to implement exactly
Clock (Second Chance)	Evict page with A=0	Good performance, practical	Widely used in real systems
LRU-K	Consider K most recent accesses	Better for some workloads	More complex bookkeeping

How real systems (Linux) do it:

Linux uses a multi-list approach:

Active list: Recently accessed pages (hot)
Inactive list: Not recently accessed pages (cold candidates for eviction)

Pages move between lists based on access patterns. Eviction preferentially comes from the inactive list. This approximates LRU without the overhead of exact LRU ordering.

The working set concept:

The working set is the set of pages a process is actively using. If a process's working set exceeds available frames, it will thrash—constantly faulting.

Good page replacement:

Keeps hot pages (working set) resident
Evicts cold pages that won't be accessed soon
Balances between processes fairly

Coming in Later Modules

Page replacement algorithms are a rich topic covered extensively in upcoming modules. We'll analyze each algorithm in detail, prove their properties, implement them, and understand how they're used in production operating systems.

The Swap Subsystem

Swap is the backing store for anonymous pages—pages with no file backing (heap, stack, anonymous mmap). When these pages are evicted, they must be written to swap space.

Swap space organization:

Swap Subsystem Overview
SWAP SPACE ORGANIZATION:
════════════════════════
 
Swap Partition/File:
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │... │ N  │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
  ↑    ↑    ↑                   ↑
  │    │    │                   │
  hdr  used used               free
       by   by
       P1   P2
 
Swap Map (kernel data structure):
┌────────────────────────────────────┐
│ slot 0: reserved (header)          │
│ slot 1: refcount=1, PID=100        │
│ slot 2: refcount=2, PID=100,200    │  (shared by two processes after fork)
│ slot 3: refcount=0 (free)          │
│ ...                                │
└────────────────────────────────────┘
 
 
SWAP OUT FLOW:
══════════════
1. Select victim page (anonymous page from inactive list)
2. If dirty or never swapped: Write to swap
   a) Allocate swap slot from swap map
   b) Write page contents to swap slot
3. Update page table entry:
   - Clear Present bit
   - Store swap slot identifier
4. Free physical frame
 
 
SWAP IN FLOW (on page fault):
═════════════════════════════
1. Page fault handler sees PTE with swap entry
2. Allocate physical frame
3. Read from swap slot into frame
4. Update page table entry:
   - Set Present bit
   - Store physical frame number
5. Optionally free swap slot (or keep for future swap out)
6. Return from fault, instruction retries

Swap performance considerations:

Factor	Impact	Mitigation
SSD vs HDD	100x speed difference	Use SSD for swap, or avoid swapping
Swap space size	Too small = OOM; too large = wasted disk	Size based on workload
Swappiness	How aggressively to swap vs reclaim file cache	Tune /proc/sys/vm/swappiness
Swap encryption	Prevents data leaks at performance cost	Enable for security-sensitive systems

Modern alternatives to traditional swap:

zswap: Compress pages in RAM before swapping
zram: In-memory compressed swap device
Memory cgroups: Limit memory per container, swap within limits
No swap: Cloud VMs often disable swap, relying on OOM for pressure

Swap Is a Safety Net

Well-provisioned systems rarely use swap heavily. Swap exists to handle temporary memory spikes gracefully rather than crashing. If swap is consistently active, the system needs more RAM, not more swap.

Summary: The Complete Picture

We've surveyed the complete implementation landscape of virtual memory. Let's consolidate this knowledge and preview what's ahead:

Key Takeaways

•Hardware-software partnership — MMU, TLB, and page table walker handle the fast path; kernel handles exceptions and complex decisions.
•Address translation — Virtual → TLB lookup → Page table walk → Physical, with page fault on non-present pages.
•Page fault handling — Kernel classifies fault (swap-in, file, anon, COW, error) and resolves appropriately.
•Hierarchical page tables — Multi-level structure enables sparse address spaces without wasting memory.
•VMAs describe regions — Higher-level abstraction tracking valid ranges, permissions, and backing.
•Frame management — Kernel tracks every physical frame's state, reference count, and LRU position.
•Page replacement — When RAM is full, algorithms select victims to evict to make room.
•Swap subsystem — Anonymous pages go to swap when evicted; swap-in brings them back.

What's next in the Virtual Memory series:

Module	Topic	What You'll Learn
Module 2	Demand Paging	Deep dive into page fault handling, demand loading strategies
Module 3	Page Fault Handling	Complete page fault lifecycle, error handling, performance
Module 4	Copy-on-Write	COW implementation, fork optimization, memory efficiency
Module 5	Memory-Mapped Files	mmap() implementation, page cache integration
Module 6	Shared Memory	IPC via shared memory, protection, synchronization

These modules build on the concepts introduced here, providing implementation-level detail for each aspect of virtual memory.

Module Complete

Congratulations! You've completed the Virtual Memory Concepts module. You now understand the fundamental concepts, mechanisms, benefits, and implementation of virtual memory. This foundation prepares you for the detailed treatment of each subsystem in subsequent modules.

5 / 5

Loading learning content...

Operating SystemsVirtual Memory Concepts

Virtual Memory Concepts

LevelIntermediate

Duration60 mins

TopicVirtual Memory Concepts

5 / 5

Implementation Overview

From Concept to Reality

What You Will Learn

The Major Components

A complete virtual memory implementation involves hardware, kernel data structures, and algorithms working in concert. Let's identify the major components:

Hardware components:

Virtual memory requires dedicated silicon in the CPU to achieve acceptable performance:

Hardware Components for Virtual Memory
Component	Location	Primary Function
Memory Management Unit (MMU)	CPU core	Translates virtual to physical addresses
Translation Lookaside Buffer (TLB)	CPU core (L1, L2)	Caches recent address translations
Page Table Walker	CPU (hardware state machine)	Traverses page tables on TLB miss
Control Registers (CR3, etc.)	CPU registers	Points to current page table, control bits
Exception/Interrupt Logic	CPU core	Generates page fault exceptions

Kernel data structures:

The operating system maintains extensive data structures to track virtual memory state:

Kernel Data Structures for Virtual Memory
Data Structure	Purpose	Key Contents
Page Table	Map virtual to physical	PTEs with frame numbers, flags
Virtual Memory Area (VMA) list	Define valid address ranges	Start/end addresses, permissions, backing
Physical Frame Database	Track physical page state	Reference counts, flags, reverse mappings
Swap Map	Track swap space allocation	Page-to-swap-slot mapping
Page Cache	Cached file-backed pages	Pages indexed by (file, offset)
Free Frame List	Available physical frames	List/bitmap of free frames

Kernel algorithms:

Several algorithms operate on these data structures:

Key Algorithms in Virtual Memory
Algorithm	When Invoked	Goal
Page Fault Handler	On page fault exception	Load/allocate missing page
Page Replacement (LRU, Clock)	When free memory low	Select victim pages to evict
Swap Allocation	When evicting anonymous page	Find swap slot to store page
Memory Allocation (malloc path)	On brk()/mmap() syscall	Extend address space
Working Set Estimation	Periodically	Determine page importance for replacement
OOM Killer	When memory exhausted	Kill process to free memory

Hardware-Software Partnership

Address Translation: The Complete Flow

The translation path:

Complete Address Translation Flow
ADDRESS TRANSLATION: FROM INSTRUCTION TO PHYSICAL MEMORY
═══════════════════════════════════════════════════════════
 
Step 1: INSTRUCTION DECODE
──────────────────────────
mov rax, [0x7fff12345678]
           └──────────────┘
           Virtual address = 0x7fff12345678
 
 
Step 2: EXTRACT PAGE NUMBER AND OFFSET
──────────────────────────────────────
Virtual Address: 0x7fff12345678
Page Size: 4 KB (2^12 = 4096 bytes)
 
VPN (Virtual Page Number): 0x7fff12345 (bits 47:12)
Offset: 0x678 (bits 11:0)
 
 
Step 3: TLB LOOKUP (Hardware, ~1-2 cycles)
─────────────────────────────────────────
TLB Entry Format: [VPN | ASID | PFN | Flags]
 
Query: VPN = 0x7fff12345, ASID = current process
Result: 
  ├─ HIT  → PFN from TLB, skip to Step 7
  └─ MISS → Proceed to Step 4
 
 
Step 4: PAGE TABLE WALK (Hardware, ~10-100 cycles)
─────────────────────────────────────────────────
For x86-64 with 4-level paging:
 
CR3 contains physical address of PML4 table
 
a) PML4 lookup: PML4[(VPN >> 27) & 0x1FF]
   → Get PDPT base address
 
b) PDPT lookup: PDPT[(VPN >> 18) & 0x1FF]
   → Get PD base address
 
c) PD lookup: PD[(VPN >> 9) & 0x1FF]
   → Get PT base address
 
d) PT lookup: PT[VPN & 0x1FF]
   → Get Page Table Entry
 
 
Step 5: CHECK PAGE TABLE ENTRY
─────────────────────────────
PTE Format (x86-64):
┌────────────────────────────────────────────────────┐
│ NX │ ... │ PFN (40 bits) │ Flags │ P │
└────────────────────────────────────────────────────┘
 
If P (Present) bit = 0:
  → PAGE FAULT! Jump to Step 6
 
If P = 1:
  → PFN = Page Frame Number from PTE
  → Check access permissions (R/W/X vs operation type)
  → If permission violation: PAGE FAULT (protection error)
 
 
Step 6: PAGE FAULT (Software, ~100μs to 10ms)
────────────────────────────────────────────
If P = 0 or permission violation:
  a) CPU pushes exception frame, saves faulting address to CR2
  b) Jump to kernel page fault handler (interrupt vector 14)
  c) Kernel determines page status:
     ├─ Valid but not present: Load from disk/swap, update PTE
     ├─ COW page: Copy and remap
     └─ Invalid address: Send SIGSEGV to process
  d) Return from exception (IRET)
  e) Retry the original instruction → Re-walk page tables
 
 
Step 7: FORM PHYSICAL ADDRESS
────────────────────────────
Physical Address = (PFN << 12) | Offset
                 = (FrameNumber * 4096) + 0x678
 
 
Step 8: POPULATE TLB (Parallel to memory access)
────────────────────────────────────────────────
Insert: [VPN=0x7fff12345 | ASID=cur | PFN | Flags]
Future accesses will hit TLB (Step 3)
 
 
Step 9: ACCESS PHYSICAL MEMORY
─────────────────────────────
Send physical address to memory controller
→ Cache hierarchy lookup (L1/L2/L3)
→ If miss, read from DRAM
→ Data returns to CPU register (rax)

The Fast Path Must Be Fast

Page Fault Handling: The Complete Picture

The page fault handler is one of the most critical pieces of kernel code. It must quickly determine why the fault occurred and take appropriate action. Let's examine its logic in detail:

Page fault classification:

Page Fault Handler Logic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
PAGE FAULT HANDLER PSEUDOCODE:
══════════════════════════════
 
void page_fault_handler(fault_address, error_code) {
    // 1. Get current process's memory descriptor
    mm = current_process->mm;
    
    // 2. Find the VMA (virtual memory area) containing fault address
    vma = find_vma(mm, fault_address);
    
    // 3. Check if address is valid
    if (vma == NULL || fault_address < vma->start) {
        // Address is outside any valid region
        send_signal(SIGSEGV);  // Segmentation fault
        return;
    }
    
    // 4. Check permissions
    if (error_code.write && !vma->permissions.write) {
        // Write to read-only region (might be COW, check below)
    }
    if (error_code.user && !vma->permissions.user) {
        // User mode accessing kernel memory
        send_signal(SIGSEGV);
        return;
    }
    
    // 5. Determine fault type and handle
    pte = get_pte(mm->page_table, fault_address);
    
    if (!pte->present) {
        // Page not in memory - need to load it
        
        if (pte->swap_entry) {
            // Page is in swap - read from swap
            handle_swap_in(vma, fault_address, pte);
        }
        else if (vma->file) {
            // File-backed region - read from file
            handle_file_fault(vma, fault_address, pte);
        }
        else {
            // Anonymous region - allocate zero page
            handle_anon_fault(vma, fault_address, pte);
        }
    }
    else if (pte->readonly && error_code.write && vma->permissions.write) {
        // Copy-on-write: page is shared but marked RO for COW
        handle_cow(vma, fault_address, pte);
    }
    else {
        // True protection violation
        send_signal(SIGSEGV);
        return;
    }
    
    // 6. Return - processor will retry the instruction
}

Handling different fault types:

Page Fault Types and Handling
Fault Type	Detection	Handler Action	Latency
Swap in	PTE has swap entry, not present	Read from swap, update PTE	~100μs (SSD)
File fault	VMA file-backed, page not cached	Read from file, add to page cache	~100μs (SSD)
Anonymous alloc	VMA anon, first access	Allocate frame, zero-fill, map	~1μs
Copy-on-write	Write to read-only shared page	Copy page, give process private copy	~1μs
Permission violation	Access violates VMA permissions	Send SIGSEGV to process	N/A (error)
Invalid address	Address outside all VMAs	Send SIGSEGV to process	N/A (error)

Minor vs Major Faults

Page Tables in Practice

Page tables are the central data structure of virtual memory. Let's examine how they're organized in modern systems.

Hierarchical page tables:

x86-64 four-level structure:

x86-64 Page Table Hierarchy
x86-64 FOUR-LEVEL PAGE TABLE STRUCTURE:
═══════════════════════════════════════
 
                    CR3 (Page Table Base Register)
                      │
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│                PML4 (Page Map Level 4)                          │
│   512 entries × 8 bytes = 4 KB                                   │
│   Each entry points to a PDPT                                    │
│   Covers 512 GB per entry (512 × 512 × 512 × 4KB)               │
└─────────────────────────────────────────────────────────────────┘
                      │
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
┌───────────┐   ┌───────────┐   ┌───────────┐
│   PDPT    │   │   PDPT    │   │   PDPT    │  Page Directory Pointer Tables
│ (512 ent) │   │ (512 ent) │   │ (512 ent) │  Each covers 1 GB per entry
└───────────┘   └───────────┘   └───────────┘
        │
        ├─────────────┬─────────────┐
        ▼             ▼             ▼
┌───────────┐   ┌───────────┐   ┌───────────┐
│    PD     │   │    PD     │   │    PD     │  Page Directories
│ (512 ent) │   │ (512 ent) │   │ (512 ent) │  Each covers 2 MB per entry
└───────────┘   └───────────┘   └───────────┘
        │
        ├─────────────┬─────────────┐
        ▼             ▼             ▼
┌───────────┐   ┌───────────┐   ┌───────────┐
│    PT     │   │    PT     │   │    PT     │  Page Tables
│ (512 ent) │   │ (512 ent) │   │ (512 ent) │  Each covers 4 KB per entry
└───────────┘   └───────────┘   └───────────┘
        │
        ▼
    Physical Frame
 
 
Space efficiency example:
─────────────────────────
Process using 1 GB of memory (256K pages):
 
Flat table: 256K entries × 8 bytes = 2 MB
Hierarchical: 
  1 PML4 + ~1 PDPT + ~512 PDs + ~256K entries in PTs
  = ~2 MB (similar), BUT...
  
Process using 100 MB scattered across address space:
Flat table: Still 2 MB (must cover whole space)
Hierarchical: Only allocate PTs for used regions = ~KB
 
Sparse address spaces benefit enormously from hierarchical tables!

Page table entry details:

Each PTE contains:

Physical frame number (PFN): Which physical frame holds this page
Present (P) bit: Is the page in RAM?
Read/Write (R/W) bit: Is writing permitted?
User/Supervisor (U/S) bit: Can user mode access?
Accessed (A) bit: Has the page been read? (set by hardware)
Dirty (D) bit: Has the page been written? (set by hardware)
No-Execute (NX) bit: Is execution permitted?

The Accessed and Dirty bits are crucial for page replacement—they let the kernel know which pages are being used without expensive tracking.

Hardware-Set Bits

Memory Regions and VMAs

What a VMA represents:

Each VMA describes a region of the address space with:

Start and end virtual addresses
Access permissions (read, write, execute)
Backing store (anonymous, file, shared)
File offset (for file mappings)
Flags (growable, locked in RAM, etc.)

VMA Structure (Simplified Linux)
VMA DATA STRUCTURE (Conceptual):
═════════════════════════════════
 
struct vm_area_struct {
    unsigned long vm_start;     // Start address (inclusive)
    unsigned long vm_end;       // End address (exclusive)
    
    unsigned long vm_flags;     // VM_READ, VM_WRITE, VM_EXEC, VM_SHARED, etc.
    
    struct file* vm_file;       // File if file-backed, NULL if anonymous
    unsigned long vm_pgoff;     // Offset into file (in pages)
    
    struct mm_struct* vm_mm;    // Owning memory descriptor
    
    // For efficient lookup and iteration
    struct rb_node vm_rb;       // Red-black tree of VMAs
    struct list_head vm_list;   // List of all VMAs
};
 
 
EXAMPLE PROCESS VMA LIST:
═════════════════════════
 
VMA 0: 0x00400000 - 0x00452000  r-x  file=/bin/bash  offset=0
       (Text segment - executable code)
 
VMA 1: 0x00651000 - 0x00652000  r--  file=/bin/bash  offset=0x51000
       (Read-only data)
 
VMA 2: 0x00652000 - 0x0065b000  rw-  file=/bin/bash  offset=0x52000
       (Initialized data - copy-on-write from file)
 
VMA 3: 0x00c00000 - 0x00c21000  rw-  file=NULL (anonymous)
       (Heap)
 
VMA 4: 0x7f0000000000 - 0x7f00001bc000  r-x  file=/lib/libc.so
       (Shared library code)
 
VMA 5: 0x7fff12340000 - 0x7fff12400000  rw-  file=NULL (anonymous)
       (Stack)

Why VMAs matter:

Page fault disambiguation: When a fault occurs, the kernel checks VMAs to determine if the address is valid and what the backing store is.
System call handling: mmap(), munmap(), mprotect(), brk() all operate on VMAs, not individual pages.
/proc/[pid]/maps: The VMA list is what you see when you examine a process's memory map.
Memory sharing: Shared mappings reference the same underlying structures, tracked via VMAs.

VMA organization:

For fast lookup (needed on every page fault), VMAs are organized in:

Red-black tree: O(log n) lookup by address
Linked list: O(n) iteration over all VMAs

A typical process has dozens to hundreds of VMAs. Complex applications (browsers, VMs) may have thousands.

VMAs vs Page Tables

Physical Frame Management

While most of virtual memory deals with virtual addresses, the kernel must also carefully manage physical frames—the actual storage in RAM.

Frame states:

Each physical frame can be in various states:

Physical Frame States
State	Description	Example Use
Free	Not allocated to any purpose	On the free list, available
Anonymous page	Backing process heap/stack	malloc'd memory
File page (clean)	Caching file, unmodified	Read-only file cache
File page (dirty)	Caching file, modified	Written file cache
Kernel use	Kernel data structures	Page tables, slabs
Reserved	Hardware or special use	BIOS, MMIO ranges

Frame metadata (struct page in Linux):

The kernel maintains metadata for every physical frame:

Physical Frame Metadata
FRAME METADATA STRUCTURE (Conceptual):
══════════════════════════════════════
 
struct page {
    unsigned long flags;       // Page state flags (PG_locked, PG_dirty, etc.)
    
    atomic_t _refcount;        // Reference count (0 = free)
    atomic_t _mapcount;        // Number of page table mappings
    
    // For page cache pages:
    struct address_space* mapping;  // File this page belongs to
    pgoff_t index;                  // Offset within file
    
    // For LRU management:
    struct list_head lru;      // Position in LRU list
    
    // For memory zones:
    struct zone* zone;         // ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM
};
 
 
FRAME ALLOCATION LOGIC:
═══════════════════════
 
When kernel needs a free frame:
 
1. Try direct reclaim from free lists (immediate)
   └─ If sufficient free frames, allocate from free list
 
2. If low on free frames, invoke page reclaim (kswapd)
   └─ Scan LRU lists for reclaimable pages
   └─ Clean file pages: Just remove from page cache
   └─ Dirty file pages: Write back first, then reclaim
   └─ Anonymous pages: Swap out, then reclaim
 
3. If still insufficient, invoke OOM killer (desperate)
   └─ Select process to kill based on memory usage
   └─ Kill it to free memory

Reference counting:

Frames have reference counts indicating how many entities are using them:

Each page table mapping increments count
Page cache reference adds to count
Kernel references for I/O add to count

When reference count reaches 0, the frame can be freed. This enables safe sharing—a shared library page might have 100 mappings, and it's only freed when the last one is removed.

Memory Zones

Page Replacement: A Preview

The page replacement problem:

Goal: Minimize page faults while operating within RAM constraints
Input: Stream of page references, limited physical frames
Decision: Which page to evict when a new page is needed
Challenge: Can't see the future (which pages will be accessed next)

Key algorithms (covered in detail in subsequent modules):

Page Replacement Algorithms Overview
Algorithm	Selection Criterion	Pros	Cons
Optimal (OPT)	Evict page used farthest in future	Theoretically best	Requires future knowledge (impossible)
FIFO	Evict oldest page	Simple to implement	Poor performance, Belady's anomaly
LRU	Evict least recently used	Good approximation of optimal	Expensive to implement exactly
Clock (Second Chance)	Evict page with A=0	Good performance, practical	Widely used in real systems
LRU-K	Consider K most recent accesses	Better for some workloads	More complex bookkeeping

How real systems (Linux) do it:

Linux uses a multi-list approach:

Active list: Recently accessed pages (hot)
Inactive list: Not recently accessed pages (cold candidates for eviction)

Pages move between lists based on access patterns. Eviction preferentially comes from the inactive list. This approximates LRU without the overhead of exact LRU ordering.

The working set concept:

The working set is the set of pages a process is actively using. If a process's working set exceeds available frames, it will thrash—constantly faulting.

Good page replacement:

Keeps hot pages (working set) resident
Evicts cold pages that won't be accessed soon
Balances between processes fairly

Coming in Later Modules

The Swap Subsystem

Swap is the backing store for anonymous pages—pages with no file backing (heap, stack, anonymous mmap). When these pages are evicted, they must be written to swap space.

Swap space organization:

Swap Subsystem Overview
SWAP SPACE ORGANIZATION:
════════════════════════
 
Swap Partition/File:
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │... │ N  │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
  ↑    ↑    ↑                   ↑
  │    │    │                   │
  hdr  used used               free
       by   by
       P1   P2
 
Swap Map (kernel data structure):
┌────────────────────────────────────┐
│ slot 0: reserved (header)          │
│ slot 1: refcount=1, PID=100        │
│ slot 2: refcount=2, PID=100,200    │  (shared by two processes after fork)
│ slot 3: refcount=0 (free)          │
│ ...                                │
└────────────────────────────────────┘
 
 
SWAP OUT FLOW:
══════════════
1. Select victim page (anonymous page from inactive list)
2. If dirty or never swapped: Write to swap
   a) Allocate swap slot from swap map
   b) Write page contents to swap slot
3. Update page table entry:
   - Clear Present bit
   - Store swap slot identifier
4. Free physical frame
 
 
SWAP IN FLOW (on page fault):
═════════════════════════════
1. Page fault handler sees PTE with swap entry
2. Allocate physical frame
3. Read from swap slot into frame
4. Update page table entry:
   - Set Present bit
   - Store physical frame number
5. Optionally free swap slot (or keep for future swap out)
6. Return from fault, instruction retries

Swap performance considerations:

Factor	Impact	Mitigation
SSD vs HDD	100x speed difference	Use SSD for swap, or avoid swapping
Swap space size	Too small = OOM; too large = wasted disk	Size based on workload
Swappiness	How aggressively to swap vs reclaim file cache	Tune /proc/sys/vm/swappiness
Swap encryption	Prevents data leaks at performance cost	Enable for security-sensitive systems

Modern alternatives to traditional swap:

zswap: Compress pages in RAM before swapping
zram: In-memory compressed swap device
Memory cgroups: Limit memory per container, swap within limits
No swap: Cloud VMs often disable swap, relying on OOM for pressure

Swap Is a Safety Net

Summary: The Complete Picture

We've surveyed the complete implementation landscape of virtual memory. Let's consolidate this knowledge and preview what's ahead:

Key Takeaways

•Hardware-software partnership — MMU, TLB, and page table walker handle the fast path; kernel handles exceptions and complex decisions.
•Address translation — Virtual → TLB lookup → Page table walk → Physical, with page fault on non-present pages.
•Page fault handling — Kernel classifies fault (swap-in, file, anon, COW, error) and resolves appropriately.
•Hierarchical page tables — Multi-level structure enables sparse address spaces without wasting memory.
•VMAs describe regions — Higher-level abstraction tracking valid ranges, permissions, and backing.
•Frame management — Kernel tracks every physical frame's state, reference count, and LRU position.
•Page replacement — When RAM is full, algorithms select victims to evict to make room.
•Swap subsystem — Anonymous pages go to swap when evicted; swap-in brings them back.

What's next in the Virtual Memory series:

Module	Topic	What You'll Learn
Module 2	Demand Paging	Deep dive into page fault handling, demand loading strategies
Module 3	Page Fault Handling	Complete page fault lifecycle, error handling, performance
Module 4	Copy-on-Write	COW implementation, fork optimization, memory efficiency
Module 5	Memory-Mapped Files	mmap() implementation, page cache integration
Module 6	Shared Memory	IPC via shared memory, protection, synchronization

These modules build on the concepts introduced here, providing implementation-level detail for each aspect of virtual memory.

Module Complete

5 / 5