Loading content...
Paging elegantly solves the external fragmentation problem of contiguous allocation. By dividing both virtual and physical memory into fixed-size pages and frames, the operating system can place any page in any available frame, achieving flexible, non-contiguous memory allocation.
But as virtual address spaces grew from 16 bits to 32 bits to 64 bits, a fundamental problem emerged: page tables became unmanageably large.
Consider a straightforward 32-bit system with 4KB pages. The virtual address space spans 2³² bytes (4GB), divided into 2³²/2¹² = 2²⁰ pages—over one million pages. Each page requires a page table entry (PTE), typically 4 bytes. That means every process needs a 4MB page table, even if the process only uses a tiny fraction of its address space. With hundreds or thousands of processes, page tables alone could consume all available memory.
A single-level page table for a 32-bit address space with 4KB pages requires 4MB per process. If 100 processes run simultaneously, page tables alone consume 400MB—a significant fraction of physical memory on many systems. For 64-bit systems, the situation is exponentially worse: a flat page table would require petabytes of memory.
The insight that changed everything:
Most processes use only a small portion of their virtual address space. A typical application might use:
Allocating page table entries for the unused region wastes memory. The solution? Make the page table itself hierarchical. Instead of a flat array with an entry for every possible page, we create a tree of smaller tables, allocating only the branches we actually need. This is the essence of multi-level paging.
By the end of this page, you will understand the two-level paging scheme: how the virtual address is decomposed, how the outer and inner page tables work together, how address translation proceeds through the hierarchy, and the precise memory savings this structure achieves.
Before diving into two-level paging, let's precisely understand what a single-level (flat) page table looks like and why it becomes problematic.
Structure of a flat page table:
In a single-level paging system, the page table is a contiguous array of page table entries (PTEs). Each entry maps one virtual page to a physical frame. The virtual page number (VPN), extracted from the virtual address, directly indexes into this array.
1234567891011121314151617181920212223242526272829303132333435363738394041
// Single-Level (Flat) Page Table Structure// 32-bit virtual address, 4KB pages #define PAGE_SIZE 4096 // 4KB = 2^12 bytes#define PAGE_SHIFT 12 // log2(PAGE_SIZE)#define NUM_PAGES (1 << 20) // 2^20 = 1,048,576 pages // Page Table Entry (PTE) - 32 bitstypedef struct { uint32_t present : 1; // Page is in physical memory uint32_t writable : 1; // Page is writable uint32_t user : 1; // Page accessible from user mode uint32_t accessed : 1; // Page has been read uint32_t dirty : 1; // Page has been written uint32_t unused : 7; // Available for OS use uint32_t frame_number: 20; // Physical frame number} __attribute__((packed)) PTE; // The entire page table: 4MB contiguous arrayPTE page_table[NUM_PAGES]; // 1,048,576 entries × 4 bytes = 4MB // Address translationuint32_t translate(uint32_t virtual_addr) { // Extract virtual page number (top 20 bits) uint32_t vpn = virtual_addr >> PAGE_SHIFT; // Extract page offset (bottom 12 bits) uint32_t offset = virtual_addr & (PAGE_SIZE - 1); // Look up the VPN in the page table PTE entry = page_table[vpn]; if (!entry.present) { // Page fault - page not in physical memory trigger_page_fault(virtual_addr); } // Combine frame number with offset uint32_t physical_addr = (entry.frame_number << PAGE_SHIFT) | offset; return physical_addr;}The problem visualized:
Imagine a simple process with 10MB of code, 5MB of heap, and 1MB of stack. Its virtual address space layout might look like:
Virtual Address Space (4GB total):
┌─────────────────────────────────────────┐ 0xFFFFFFFF (4GB)
│ │
│ Stack (1MB) │
│ │
├─────────────────────────────────────────┤ ~0xFFF00000
│ │
│ │
│ │
│ UNUSED REGION (~3.98GB) │
│ │
│ │
│ │
├─────────────────────────────────────────┤ ~0x00A00000
│ Heap (5MB) │
├─────────────────────────────────────────┤ ~0x00500000
│ Code (10MB) │
├─────────────────────────────────────────┤ ~0x00100000
│ Reserved/OS (1MB) │
└─────────────────────────────────────────┘ 0x00000000
The process uses only about 17MB of its 4GB address space—less than 0.5%. Yet a flat page table allocates entries for all 4GB. We're storing 4 million entries to describe fewer than 5,000 used pages.
| Address Space | Page Size | Number of PTEs | PTE Size | Page Table Size |
|---|---|---|---|---|
| 32-bit (4GB) | 4KB | 2²⁰ = 1M | 4 bytes | 4MB per process |
| 32-bit (4GB) | 8KB | 2¹⁹ = 512K | 4 bytes | 2MB per process |
| 48-bit (256TB) | 4KB | 2³⁶ = 64G | 8 bytes | 512GB per process |
| 64-bit (16EB) | 4KB | 2⁵² = 4P | 8 bytes | 32PB per process |
A 64-bit flat page table would require 32 petabytes per process—more storage than exists on Earth. Even for 32-bit systems, dedicating 4MB per process becomes prohibitive when running hundreds of processes. We need a fundamentally different approach.
Two-level paging (also called hierarchical paging or multi-level paging) addresses the page table size problem by breaking the flat table into a tree structure with two levels:
Outer Page Table (Page Directory): A single table that contains pointers to inner page tables. Each entry points to an inner table or is marked invalid.
Inner Page Tables (Page Tables): Multiple smaller tables, each covering a portion of the address space. Only the inner tables needed for the process's actual memory regions are allocated.
The key insight:
If a region of the virtual address space is unused, we don't allocate an inner page table for it. The outer table entry for that region is simply marked invalid. This transforms the space requirement from O(address space size) to O(used memory size).
12345678910111213141516171819202122232425262728293031323334353637
// Two-Level Page Table Structure// 32-bit virtual address, 4KB pages #define PAGE_SIZE 4096 // 4KB#define PAGE_SHIFT 12#define ENTRIES_PER_TABLE 1024 // 2^10 entries per table#define OUTER_SHIFT 22 // Bits 22-31 for outer index#define INNER_SHIFT 12 // Bits 12-21 for inner index // Page Directory Entry (PDE) - points to inner page tabletypedef struct { uint32_t present : 1; // Inner table is allocated uint32_t writable : 1; // Region is writable uint32_t user : 1; // Accessible from user mode uint32_t unused : 9; // Available bits uint32_t table_addr : 20; // Physical address of inner table (aligned)} __attribute__((packed)) PDE; // Page Table Entry (PTE) - maps to physical frametypedef struct { uint32_t present : 1; // Page is in memory uint32_t writable : 1; // Page is writable uint32_t user : 1; // Accessible from user mode uint32_t accessed : 1; // Page has been accessed uint32_t dirty : 1; // Page has been modified uint32_t unused : 7; // Available bits uint32_t frame_number: 20; // Physical frame number} __attribute__((packed)) PTE; // Each inner page table is one page (4KB)typedef PTE InnerPageTable[ENTRIES_PER_TABLE]; // 1024 × 4 = 4KB // The outer page table (page directory) is also one pagetypedef PDE OuterPageTable[ENTRIES_PER_TABLE]; // 1024 × 4 = 4KB // Per-process page directoryOuterPageTable* cr3; // CR3 register points to page directoryThe elegance of page-sized tables:
Notice that both the outer and inner tables are exactly 4KB—the same as a page. This is not coincidental:
With 1024 entries in the outer table, and each inner table covering 1024 pages × 4KB = 4MB, we get:
Outer table coverage: 1024 inner tables × 4MB each = 4GB ✓
The entire 32-bit address space is covered by a 4KB outer table plus any needed 4KB inner tables.
The 32-bit address is divided into three fields: 10 bits for outer index, 10 bits for inner index, and 12 bits for page offset. This gives 2^10 = 1024 entries in each level, and 2^12 = 4096 bytes per page. The 10-10 split is optimal because it makes both table levels exactly one page in size.
In two-level paging, the virtual address is split into three distinct fields, each serving a specific purpose in the translation process:
┌────────────────────────────────────────────────────────────────┐
│ 32-bit Virtual Address │
├─────────────┬─────────────┬────────────────────────────────────┤
│ Bits 31-22│ Bits 21-12│ Bits 11-0 │
│ (10 bits) │ (10 bits) │ (12 bits) │
├─────────────┼─────────────┼────────────────────────────────────┤
│ P1 │ P2 │ D (Offset) │
│ (Outer idx) │ (Inner idx) │ (Page Offset) │
└─────────────┴─────────────┴────────────────────────────────────┘
Field breakdown:
P1 (Outer Index / Directory Index): Selects one of 1024 entries in the page directory. This entry points to an inner page table.
P2 (Inner Index / Page Table Index): Selects one of 1024 entries in the inner page table. This entry contains the physical frame number.
D (Offset): The position within the 4KB page. This is copied directly to the physical address without translation.
123456789101112131415161718192021222324252627282930313233343536373839404142
// Extracting address components from a 32-bit virtual address #define OFFSET_MASK 0x00000FFF // Bits 0-11: page offset#define P2_MASK 0x003FF000 // Bits 12-21: inner table index#define P1_MASK 0xFFC00000 // Bits 22-31: outer table index #define PAGE_SHIFT 12#define P2_SHIFT 12#define P1_SHIFT 22 // Extract page offset (12 bits)static inline uint32_t get_offset(uint32_t vaddr) { return vaddr & OFFSET_MASK; // Mask: 0x00000FFF} // Extract inner table index (10 bits) static inline uint32_t get_p2(uint32_t vaddr) { return (vaddr >> P2_SHIFT) & 0x3FF; // Shift right 12, mask 10 bits} // Extract outer table index (10 bits)static inline uint32_t get_p1(uint32_t vaddr) { return (vaddr >> P1_SHIFT) & 0x3FF; // Shift right 22, mask 10 bits} // Example: Virtual address 0x12345678// Binary: 0001 0010 0011 0100 0101 0110 0111 1000// ├─────────┤├─────────┤├────────────────┤// P1 (10b) P2 (10b) Offset (12b)//// P1 = 0001001000 = 72 (outer index)// P2 = 1101000101 = 837 (inner index) // D = 011001111000 = 1656 (offset within page) void decompose_example() { uint32_t vaddr = 0x12345678; printf("Virtual Address: 0x%08X\n", vaddr); printf("P1 (Directory Index): %u\n", get_p1(vaddr)); printf("P2 (Table Index): %u\n", get_p2(vaddr)); printf("Offset: %u\n", get_offset(vaddr));}| Field | Bits | Range | Purpose |
|---|---|---|---|
| P1 (Outer) | 22-31 | 0-1023 | Index into page directory to find inner table pointer |
| P2 (Inner) | 12-21 | 0-1023 | Index into inner page table to find frame number |
| D (Offset) | 0-11 | 0-4095 | Byte position within the 4KB page (unchanged) |
Visualizing the decomposition:
Consider accessing memory at virtual address 0xC0123456. Let's decompose this address:
0xC0123456 = 1100 0000 0001 0010 0011 0100 0101 0110
└───────────┘└───────────┘└───────────────┘
P1 P2 Offset
= 768 = 291 = 1110
(0x300) (0x123) (0x456)
This address refers to:
The hierarchical lookup transforms a single flat array access into two sequential table lookups, but enables dramatic memory savings.
The complete translation process in two-level paging involves accessing two separate tables in sequence. Let's trace through the entire process:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
// Complete Two-Level Address Translation typedef uint32_t* PageDirectory;typedef uint32_t* PageTable; #define PAGE_SHIFT 12#define PAGE_SIZE (1 << PAGE_SHIFT)#define ENTRY_MASK 0x3FF // 10-bit mask for P1/P2 #define PDE_PRESENT 0x001 // Directory entry present#define PTE_PRESENT 0x001 // Page table entry present#define PDE_ADDR_MASK 0xFFFFF000 // Physical address of inner table#define PTE_FRAME_MASK 0xFFFFF000 // Physical frame number // Hardware CR3 register (page directory base)extern PageDirectory cr3; // Translate virtual address to physical address// Returns physical address, or triggers page faultuint32_t translate_two_level(uint32_t virtual_address) { // STEP 1: Decompose virtual address into components uint32_t p1 = (virtual_address >> 22) & ENTRY_MASK; // Outer index uint32_t p2 = (virtual_address >> 12) & ENTRY_MASK; // Inner index uint32_t offset = virtual_address & (PAGE_SIZE - 1); // Page offset // STEP 2: Read page directory entry (first memory access) uint32_t pde = cr3[p1]; // STEP 3: Check if inner page table exists if (!(pde & PDE_PRESENT)) { // Inner table not allocated - page fault // OS will either allocate page table or terminate process page_fault(virtual_address, FAULT_NOT_PRESENT); return 0; // Not reached } // STEP 4: Extract physical address of inner page table uint32_t inner_table_phys = pde & PDE_ADDR_MASK; PageTable inner_table = physical_to_virtual(inner_table_phys); // STEP 5: Read page table entry (second memory access) uint32_t pte = inner_table[p2]; // STEP 6: Check if page is present if (!(pte & PTE_PRESENT)) { // Page not in memory - page fault // OS will load from disk or allocate new frame page_fault(virtual_address, FAULT_PAGE_NOT_PRESENT); return 0; // Not reached } // Additional checks (simplified) check_protection(pte, access_type); update_accessed_bit(&inner_table[p2]); // STEP 7: Extract frame number from PTE uint32_t frame_number = pte & PTE_FRAME_MASK; // STEP 8: Form physical address uint32_t physical_address = frame_number | offset; return physical_address;}Memory access overhead:
The critical observation is that two-level paging requires two memory accesses to translate a single virtual address (compared to one for a flat page table):
Only after both reads can we access the actual data at a third memory location. This 3× memory access overhead would be devastating for performance if not mitigated by the Translation Lookaside Buffer (TLB), which caches recently used translations.
Without TLB caching, every memory access would require three physical memory reads. With 100ns RAM latency, a single memory access would take 300ns—a 3× slowdown. The TLB is essential for making multi-level paging viable. We cover TLB mechanics in detail in Module 4.
Let's rigorously analyze the memory savings achieved by two-level paging for realistic workloads.
Baseline: Flat Page Table
Two-Level Page Table: Sparse Process
Consider a process using:
Total memory used: 16MB = 4,096 pages
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
# Two-Level Paging Memory Analysis # System parametersADDRESS_BITS = 32PAGE_SIZE = 4096 # 4KB pagesPDE_SIZE = 4 # 4 bytes per directory entryPTE_SIZE = 4 # 4 bytes per page table entryENTRIES_PER_TABLE = 1024 # 2^10 entries # Each inner table covers 4MB of virtual address spaceINNER_TABLE_COVERAGE = ENTRIES_PER_TABLE * PAGE_SIZE # 4MB def calculate_two_level_overhead(memory_regions): """ Calculate page table memory for given memory regions. Each region is (start_addr, size_bytes). """ # Always need the page directory outer_table_size = ENTRIES_PER_TABLE * PDE_SIZE # 4KB # Count which inner tables are needed inner_tables_needed = set() for start_addr, size_bytes in memory_regions: end_addr = start_addr + size_bytes - 1 # Calculate which 4MB regions are touched start_p1 = start_addr // INNER_TABLE_COVERAGE end_p1 = end_addr // INNER_TABLE_COVERAGE for p1 in range(start_p1, end_p1 + 1): inner_tables_needed.add(p1) inner_tables_size = len(inner_tables_needed) * ENTRIES_PER_TABLE * PTE_SIZE total_overhead = outer_table_size + inner_tables_size return outer_table_size, len(inner_tables_needed), inner_tables_size, total_overhead # Example: Typical process layoutregions = [ (0x00100000, 10 * 1024 * 1024), # 10MB code (0x00A00000, 5 * 1024 * 1024), # 5MB heap (0xFFF00000, 1 * 1024 * 1024), # 1MB stack] directory, count, inner, total = calculate_two_level_overhead(regions) print("Two-Level Page Table Analysis")print("=" * 40)print(f"Process memory used: 16MB")print(f"Page directory: {directory:,} bytes (4KB)")print(f"Inner tables needed: {count}")print(f"Inner tables total: {inner:,} bytes ({inner // 1024}KB)")print(f"Total overhead: {total:,} bytes ({total // 1024}KB)")print()print(f"Flat table would need: 4,194,304 bytes (4MB)")print(f"Savings: {(4194304 - total):,} bytes ({(1 - total/4194304)*100:.1f}%)")| Memory Layout | Flat Table | Two-Level Tables | Savings |
|---|---|---|---|
| 16MB sparse (code/heap/stack) | 4MB | ~20KB (1 dir + 5 inner) | 99.5% |
| 100MB contiguous | 4MB | ~104KB (1 dir + 25 inner) | 97.5% |
| 512MB contiguous | 4MB | ~516KB (1 dir + 128 inner) | 87.3% |
| 2GB contiguous | 4MB | ~2.05MB (1 dir + 512 inner) | 48.8% |
| 4GB fully mapped | 4MB | ~4.01MB (all tables) | -0.2% |
Key observations:
Sparse processes save dramatically: The typical case—small code, small heap, small stack—achieves 99%+ savings.
Savings scale with sparsity: The more unused virtual address space, the greater the benefit.
Dense usage reduces benefit: If a process uses most of its address space, two-level paging adds slight overhead due to the extra directory.
The worst case is rare: Few processes actually use 4GB of virtual address space.
The critical insight:
Two-level paging trades time for space. We add an extra memory access during translation, but we can dramatically reduce the total memory devoted to page tables. This trade-off is overwhelmingly favorable because:
On a system running 500 processes, each using 20MB of RAM: Flat tables would consume 2GB just for page tables. Two-level tables might use only 15-20MB total—a 100× reduction. This is why multi-level paging is universally adopted in modern systems.
The Page Directory Entry (PDE) in a two-level system serves as a pointer to an inner page table. Its structure must encode both the physical location of the inner table and control information for the entire 4MB region that table covers.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
// Page Directory Entry (PDE) - 32-bit x86 Format//// Bit layout:// ┌───┬─┬─┬─┬───┬─┬─┬─┬─┬─┬─┬─────────────────────────┐// │31 12│11 9│8│7│6│5│4│3│2│1│0│ Bit │// ├───────────────────┼─────┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ │// │ Page Table Addr │Avail│G│S│0│A│D│W│U│W│P│ │// │ (Frame Number) │ │ │ │ │ │ │T│S│ │ │ │// └───────────────────┴─────┴─┴─┴─┴─┴─┴─┴─┴─┴─┘ │ typedef struct { union { uint32_t raw; struct { uint32_t present : 1; // P: Entry is valid uint32_t read_write : 1; // R/W: 0=read-only, 1=read-write uint32_t user_supervisor: 1; // U/S: 0=supervisor only, 1=user accessible uint32_t write_through : 1; // PWT: Write-through caching uint32_t cache_disable : 1; // PCD: Page caching disabled uint32_t accessed : 1; // A: Entry has been used for translation uint32_t reserved : 1; // (Reserved, must be 0) uint32_t page_size : 1; // PS: 0=4KB pages, 1=4MB page (PSE) uint32_t global : 1; // G: Global entry (TLB behavior) uint32_t available : 3; // Available for OS use uint32_t table_address : 20; // Physical frame number of inner table } fields; };} PDE; // PDE operations#define PDE_PRESENT 0x001#define PDE_WRITE 0x002#define PDE_USER 0x004#define PDE_ACCESSED 0x020#define PDE_LARGE_PAGE 0x080#define PDE_ADDR_MASK 0xFFFFF000 // Create a PDE pointing to an inner page tablePDE make_pde(uint32_t table_phys_addr, uint32_t flags) { PDE pde; // Table address must be 4KB aligned pde.raw = (table_phys_addr & PDE_ADDR_MASK) | (flags & 0xFFF); return pde;} // Get physical address of inner page table from PDEuint32_t pde_get_table_addr(PDE pde) { return pde.raw & PDE_ADDR_MASK;} // Example: Create a present, user-accessible, writable PDE// Inner page table at physical address 0x00345000void example_pde() { uint32_t flags = PDE_PRESENT | PDE_WRITE | PDE_USER; PDE pde = make_pde(0x00345000, flags); // pde.raw = 0x00345007}Protection bits in the PDE apply to the entire 4MB region. Inner PTEs can only further restrict access—never grant more. For example, if the PDE is read-only, no PTE in that inner table can make a page writable. This hierarchical protection simplifies permission management.
The true power of two-level paging emerges in handling sparse address spaces—where a process uses small, scattered regions of a large virtual address space.
How sparse mappings work:
When a process is created, the only page tables allocated are:
Vast regions of unused address space require no memory at all—only an invalid entry in the directory.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
// Managing Sparse Address Space with Two-Level Paging // Process creation: minimal page table allocationvoid create_process_page_tables(Process* proc) { // Allocate ONLY the page directory (4KB) proc->page_directory = allocate_physical_page(); memset(proc->page_directory, 0, PAGE_SIZE); // All entries invalid // No inner tables allocated yet! // Total initial overhead: 4KB (just the directory)} // Map a virtual address on demandint map_page(Process* proc, uint32_t vaddr, uint32_t paddr, uint32_t flags) { uint32_t p1 = (vaddr >> 22) & 0x3FF; uint32_t p2 = (vaddr >> 12) & 0x3FF; PDE* pde = &proc->page_directory[p1]; // Check if inner page table exists if (!pde->present) { // Allocate new inner table on demand uint32_t new_table = allocate_physical_page(); memset(phys_to_virt(new_table), 0, PAGE_SIZE); // Point directory entry to new table pde->raw = new_table | PDE_PRESENT | PDE_WRITE | PDE_USER; // Now we've allocated 8KB total (directory + 1 inner table) } // Get inner table PTE* inner_table = phys_to_virt(pde_get_table_addr(*pde)); // Create mapping inner_table[p2] = (paddr & 0xFFFFF000) | flags | PTE_PRESENT; // Invalidate TLB entry for this address invlpg(vaddr); return 0;} // Example: Load segments of an executablevoid load_executable(Process* proc, ELF_Header* elf) { // Code segment at 0x00400000 (4MB-aligned for simplicity) // Uses directory entry 1, allocates 1 inner table for (int i = 0; i < code_pages; i++) { map_page(proc, 0x00400000 + i * PAGE_SIZE, alloc_frame(), PTE_USER); // Read-only code } // Data segment at 0x00800000 // Uses directory entry 2, allocates 1 inner table for (int i = 0; i < data_pages; i++) { map_page(proc, 0x00800000 + i * PAGE_SIZE, alloc_frame(), PTE_USER | PTE_WRITE); } // Stack at 0xBFFFFFFF (grows down) // Uses directory entry 767, allocates 1 inner table for (int i = 0; i < INITIAL_STACK_PAGES; i++) { map_page(proc, 0xC0000000 - (i + 1) * PAGE_SIZE, alloc_frame(), PTE_USER | PTE_WRITE); } // Total: 4KB directory + 3 * 4KB inner tables = 16KB // Flat table would be 4MB regardless of actual memory used}Dynamic growth with lazy allocation:
The heap and stack grow dynamically. With two-level paging:
Heap growth — When malloc() exhausts the current heap and calls sbrk() or mmap(), the kernel maps new pages. If the new pages fall in an already-allocated inner table, only new PTEs are filled. If they cross into a new 4MB region, a new inner table is allocated.
Stack growth — When the stack pointer moves below currently-mapped pages, a page fault occurs. The kernel allocates a new frame and creates a PTE. New inner tables are allocated only when the stack grows into a new 4MB region.
This lazy allocation strategy minimizes memory overhead by allocating page table structures only when they're actually needed.
Memory-mapped files and shared libraries often map at high virtual addresses (e.g., 0x7F000000 on some systems). Each shared library might touch a different 4MB region, requiring its own inner page table. However, these tables can be shared between processes mapping the same library at the same address, further reducing per-process overhead.
Two-level paging is the foundational hierarchical page table design that makes large virtual address spaces practical. Let's consolidate the key concepts:
What's next:
Two-level paging suffices for 32-bit systems, but 64-bit address spaces are vastly larger. Even with two-level paging, a 64-bit system would need an enormous outer table. The solution? Add more levels. In the next page, we'll explore three-level paging and understand how modern 64-bit systems use four or even five levels to manage their immense address spaces.
You now understand two-level paging: the motivation for hierarchical page tables, the structure of the page directory and inner tables, the complete translation process, and the memory savings achieved. Next, we'll extend these concepts to three-level paging for larger address spaces.