Loading content...
We've detected the page fault, trapped to the OS, and located where the page data resides on disk. Now comes the critical operation: actually loading that data into physical memory.
This phase involves orchestrating multiple subsystems:
This page explores each aspect in depth. You'll understand the full lifecycle of a page from its arrival in RAM to its integration into the process's address space—the moment when the trap can finally return and the instruction can successfully complete.
By the end of this page, you will understand: (1) How physical frames are allocated during page faults, (2) The mechanics of reading page content from disk, (3) How page tables are atomically updated, (4) TLB considerations when establishing new mappings, (5) How these operations are synchronized for correctness.
Before we can load content from disk, we need somewhere to put it—a physical frame. The OS maintains elaborate data structures to track which frames are free, in use, or reclaimable.
The Frame Allocation Challenge:
Frame allocation during page fault handling must be:
Free Frame Sources:
The OS can obtain free frames from several sources, in rough order of preference:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// Physical frame allocator (simplified) // Per-NUMA-node free listsstruct free_area { struct list_head free_list; // List of free page blocks unsigned long nr_free; // Count of free pages}; // Per-zone free area lists (one per order for buddy allocator)struct zone { struct free_area free_area[MAX_ORDER]; // Order 0-10 (1, 2, 4, ... 1024 pages) unsigned long managed_pages; unsigned long watermark[NR_WMARK]; // Min-low-high watermarks}; // Core page allocation functionstruct page *alloc_pages(gfp_t gfp_mask, unsigned int order) { struct page *page; unsigned int alloc_flags; // Step 1: Determine allocation flags from gfp_mask alloc_flags = gfp_to_alloc_flags(gfp_mask); // Step 2: Try to get a page from the free lists page = get_page_from_freelist(gfp_mask, order, alloc_flags); if (page) return page; // Got a page immediately // Step 3: No free pages - enter the slow path // This may involve reclaiming pages from caches page = __alloc_pages_slowpath(gfp_mask, order); return page; // May be NULL if truly out of memory} // Page fault specific allocationstruct page *alloc_page_for_fault(struct vm_area_struct *vma, unsigned long address) { gfp_t gfp_flags = GFP_HIGHUSER_MOVABLE; // Prefer memory close to the faulting CPU (NUMA) int preferred_nid = numa_node_id(); // Policy might specify different behavior if (vma->vm_policy) { preferred_nid = get_policy_node(vma->vm_policy, address); } // Allocate the page struct page *page = __alloc_pages_node(preferred_nid, gfp_flags, 0); if (!page) { // Emergency: try any node page = __alloc_pages(gfp_flags | __GFP_THISNODE, 0); } return page;}GFP (Get Free Pages) flags control how aggressively the allocator tries to satisfy the request. GFP_KERNEL allows blocking and reclaim. GFP_ATOMIC never blocks (for interrupt context). GFP_HIGHUSER_MOVABLE is typical for user page faults—allows using high memory, can be moved for compaction.
When free memory is low, the OS must reclaim pages from current users to satisfy new allocation requests. This reclamation can happen synchronously (during the page fault) or asynchronously (by background kernel threads).
The Reclaim Process:
Identify candidates: Scan inactive page lists for pages that haven't been accessed recently
Determine reclaim action:
Update page tables: All mappings to the reclaimed page must be removed
Free the frame: Page is now available for new allocation
The Watermark System:
Linux uses watermarks to trigger proactive reclamation:
| Watermark | State | Action |
|---|---|---|
| High | Plenty of free memory | No action |
| Low | Getting low | Wake kswapd for background reclaim |
| Min | Critical | Direct reclaim in allocating process |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
// Simplified page reclaim logic // kswapd - background reclaim daemonint kswapd(void *p) { struct zone *zone = p; while (!kthread_should_stop()) { if (zone_watermark_ok(zone, order, zone->watermark[WMARK_HIGH])) // Enough free memory, sleep wait_event_interruptible(zone->kswapd_wait, !zone_watermark_ok()); else // Below high watermark, reclaim pages shrink_zone(zone, &sc); }} // Direct reclaim - called when allocation failsstatic unsigned long try_to_free_pages(struct zonelist *zonelist, gfp_t gfp_mask) { struct scan_control sc = { .nr_to_reclaim = SWAP_CLUSTER_MAX, .gfp_mask = gfp_mask, .priority = DEF_PRIORITY, }; unsigned long nr_reclaimed = 0; do { nr_reclaimed += shrink_zones(zonelist, &sc); sc.priority--; // Get more aggressive } while (nr_reclaimed < sc.nr_to_reclaim && sc.priority >= 0); return nr_reclaimed;} // Shrink by scanning LRU listsstatic unsigned long shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) { unsigned long nr_reclaimed = 0; // Scan inactive anonymous pages (candidates for swap) nr_reclaimed += shrink_inactive_list(LRU_INACTIVE_ANON, lruvec, sc); // Scan inactive file pages (candidates for discarding/writeback) nr_reclaimed += shrink_inactive_list(LRU_INACTIVE_FILE, lruvec, sc); return nr_reclaimed;} // For a single page, decide and execute reclaimstatic int shrink_page(struct page *page, struct scan_control *sc) { // Is page referenced recently? Move to active list if (page_referenced(page)) { activate_page(page); return PAGEREF_ACTIVE; } // Anonymous page - must swap out if (PageAnon(page)) { if (!add_to_swap(page)) return PAGEREF_KEEP; // No swap space, can't reclaim // Write to swap (may be async) swap_writepage(page); } // File-backed dirty page - write back if (PageDirty(page)) { writepage(page); } // Remove from page tables (reverse mapping) try_to_unmap(page); // Free the page free_page(page); return PAGEREF_RECLAIMED;}When a page fault triggers direct reclaim, the faulting process pays the cost of evicting other pages. This can add significant latency—potentially writing pages to swap, waiting for I/O, scanning LRU lists. High-performance systems try to keep enough free memory to avoid direct reclaim.
With a frame allocated, we now need to fill it with the page's content. This involves issuing I/O to the storage device—the most time-consuming part of page fault handling.
I/O Paths:
1. Swap Read:
Page Fault → alloc_page() → swap_readpage() → Block Layer → Storage Driver → Wait → Complete
2. File Read:
Page Fault → alloc_page() → readpage() → Filesystem → Block Layer → Storage Driver → Wait → Complete
3. Zero Fill (no I/O):
Page Fault → alloc_page() → clear_page() → Complete
Blocking vs Async I/O:
Most page fault I/O is synchronous—the faulting process blocks until the I/O completes. However, some variations exist:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
// Reading a page from swap int swap_readpage(struct page *page, swp_entry_t entry) { struct swap_info_struct *sis; struct bio *bio; sector_t sector; int ret; // Lock the page - prevents concurrent I/O or access lock_page(page); // Get swap device info sis = swp_swap_info(entry); // Calculate disk sector for this swap slot sector = swp_offset(entry); sector <<= PAGE_SHIFT - SECTOR_SHIFT; // Convert page offset to sector // Allocate and set up bio (block I/O request) bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL); bio->bi_iter.bi_sector = sector + sis->start_sector; bio_add_page(bio, page, PAGE_SIZE, 0); // Submit the I/O request bio->bi_end_io = swap_read_endio; // Completion callback bio->bi_private = page; submit_bio(bio); // Wait for completion (page fault path waits synchronously) wait_on_page_locked(page); // Check if read succeeded if (PageError(page)) { ClearPageError(page); return -EIO; } SetPageUptodate(page); // Mark page as having valid content return 0;} // Completion callback for swap readstatic void swap_read_endio(struct bio *bio) { struct page *page = bio->bi_private; if (bio->bi_status) { SetPageError(page); } else { SetPageUptodate(page); } unlock_page(page); // Wake up waiters bio_put(bio);} // Reading from a file (filesystem-specific)int generic_file_read_page(struct file *file, struct page *page) { struct inode *inode = file->f_inode; struct address_space *mapping = inode->i_mapping; loff_t offset = page_offset(page); // Read from filesystem into page return mapping->a_ops->readpage(file, page);} // Special case: zero-fill (no I/O needed)void zero_fill_page(struct page *page) { void *kaddr = kmap_local_page(page); memset(kaddr, 0, PAGE_SIZE); kunmap_local(kaddr); SetPageUptodate(page);}| Storage Type | Random 4KB Read | Impact on Page Fault |
|---|---|---|
| DDR4 RAM (for reference) | ~60 ns | Negligible |
| NVMe SSD | ~70-150 µs | ~70-150 µs per fault |
| SATA SSD | ~100-300 µs | ~100-300 µs per fault |
| HDD (7200 RPM) | ~5-15 ms | ~5-15 ms per fault |
| Network storage (NFS) | ~1-100 ms | Highly variable |
A modern CPU can execute 3+ billion instructions per second. A single HDD page fault (10ms) means ~30 million instructions lost. Even an NVMe fault (100µs) costs ~300,000 instructions. This is why keeping the working set in memory is so critical for performance.
With the page now in memory (loaded from disk or zero-filled), we must create the mapping in the page table so that the faulting instruction can access it.
The Page Table Entry (PTE) Update:
Walk to the PTE: Navigate through the page table hierarchy to the correct PTE
Construct the new PTE value:
Atomically update the PTE: Use atomic operations to prevent races
Handle concurrent modifications: Check if another thread modified the PTE while we were loading
The Atomicity Requirement:
PTE updates must be atomic because:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
// Updating the page table entry after loading page int finish_fault(struct vm_fault *vmf) { struct mm_struct *mm = vmf->vma->vm_mm; struct page *page = vmf->page; pte_t *pte = vmf->pte; pte_t entry; spinlock_t *ptl; // Step 1: Construct the new PTE entry = mk_pte(page, vmf->vma->vm_page_prot); // Apply VMA permissions if (vmf->vma->vm_flags & VM_WRITE) entry = pte_mkwrite(entry); if (vmf->vma->vm_flags & VM_EXEC) entry = pte_mkexec(entry); // Mark as young (just accessed) entry = pte_mkyoung(entry); // For write faults, may mark dirty if (vmf->flags & FAULT_FLAG_WRITE) entry = pte_mkdirty(entry); // Step 2: Lock the page table ptl = pte_lockptr(mm, vmf->pmd); spin_lock(ptl); // Step 3: Check if PTE was modified by another thread if (!pte_none(*pte) && pte_present(*pte)) { // Another thread already handled this fault! spin_unlock(ptl); put_page(page); // We don't need this page return VM_FAULT_NOPAGE; } // Step 4: Atomically install the new PTE set_pte_at(mm, vmf->address, pte, entry); // Step 5: Update memory accounting mm->_rss++; // Increment resident set size spin_unlock(ptl); return 0;} // Architecture-specific PTE setting (x86-64)static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { // On x86-64, PTEs are 64-bit and naturally aligned // A simple store is atomic for aligned 64-bit on x86-64 WRITE_ONCE(*ptep, pte); // Memory barrier to ensure PTE is visible before TLB operations smp_wmb();} // Allocating intermediate page table levels if neededpte_t *walk_and_allocate(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; p4d_t *p4d; pud_t *pud; pmd_t *pmd; pgd = pgd_offset(mm, addr); p4d = p4d_alloc(mm, pgd, addr); if (!p4d) return NULL; pud = pud_alloc(mm, p4d, addr); if (!pud) return NULL; pmd = pmd_alloc(mm, pud, addr); if (!pmd) return NULL; return pte_alloc_map(mm, pmd, addr);}If intermediate page table levels don't exist (e.g., first access to a previously unmapped region), the fault handler allocates them. Each level is a page-sized structure. The allocation uses the same mechanisms as regular page allocation, adding to the fault handling time.
After updating the page table, we must consider the TLB (Translation Lookaside Buffer). The TLB caches page table entries for fast translation, and it must reflect our changes.
For New Mappings (page fault):
We're mapping a page that wasn't previously present. The TLB definitely doesn't have a stale entry for this mapping because:
Therefore, no TLB invalidation is needed when installing a new mapping. The next access will cause a TLB miss, the hardware walker will load our new PTE, and the mapping will be cached.
For Modified Mappings (COW, permission change):
When modifying an existing mapping:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// TLB operations in page fault context // When installing a completely new mapping (not present → present)// No TLB invalidation needed - there was no valid TLB entryvoid install_new_pte(struct mm_struct *mm, pte_t *pte, pte_t entry) { set_pte_at(mm, addr, pte, entry); // Next access will TLB miss and load our entry - no flush needed} // When modifying an existing mapping (e.g., COW, permission upgrade)void modify_existing_pte(struct mm_struct *mm, unsigned long addr, pte_t *pte, pte_t new_entry) { pte_t old_entry = *pte; // Install new PTE set_pte_at(mm, addr, pte, new_entry); // Must invalidate stale TLB entries // Single CPU case: flush_tlb_page(mm, addr); // Multi-CPU case: This is a TLB shootdown // Must send IPI to all CPUs that might have this mapping cached} // TLB shootdown - invalidate on all relevant CPUsvoid flush_tlb_page_all(struct mm_struct *mm, unsigned long addr) { // Get mask of CPUs that have used this mm const cpumask_t *mm_cpumask = mm_cpumask(mm); // Current CPU - local flush local_flush_tlb_page(addr); // Other CPUs - send Inter-Processor Interrupt smp_call_function_many(mm_cpumask, tlb_flush_func, &addr, 1 /* wait */);} // IPI handler on remote CPUvoid tlb_flush_func(void *info) { unsigned long addr = *(unsigned long *)info; local_flush_tlb_page(addr);} // Local TLB flush (x86)static inline void local_flush_tlb_page(unsigned long addr) { // INVLPG instruction invalidates single TLB entry asm volatile("invlpg (%0)" : : "r" (addr) : "memory");} // Flush entire TLB (x86) - used for larger changesstatic inline void local_flush_tlb(void) { // Reloading CR3 flushes entire TLB (except global entries) unsigned long cr3 = read_cr3(); write_cr3(cr3);}| Scenario | TLB State Before | TLB Action Needed |
|---|---|---|
| New anonymous page | No entry (faulted) | None - will load on next access |
| Page loaded from swap | No valid entry | None - will load on next access |
| File page loaded from disk | No valid entry | None - will load on next access |
| COW page (write protection) | Had read-only entry | Flush to remove stale entry |
| Permission upgrade | Had restrictive entry | Flush to allow reload with new perms |
Sending IPIs to multiple CPUs and waiting for acknowledgment takes microseconds. When modifying mappings (especially during process exit or munmap), batching TLB invalidations can significantly improve performance. Fortunately, for basic page fault handling (new mappings), no shootdown is needed.
Page fault handling involves multiple data structures that could be accessed concurrently. Proper synchronization is essential for correctness and safety.
Key Locks in Page Fault Path:
Page Table Lock (PTL): Protects individual page table pages during modification. Fine-grained—different PTEs can be modified in parallel.
mmap_lock (formerly mmap_sem): Protects the VMA tree structure. Held in read mode during fault handling, write mode during mmap/munmap.
Page Lock: Individual page lock prevents concurrent I/O to the same page. Held while reading from disk.
Swap Map Locks: Protect swap slot reference counts.
Lock Ordering:
To prevent deadlocks, locks are acquired in a consistent order:
mmap_lock → page table lock → page lock → swap locks
Lock Contention Concerns:
High page fault rates can cause contention:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
// Locking in page fault path int __do_page_fault(struct mm_struct *mm, unsigned long address, unsigned int flags, struct pt_regs *regs) { struct vm_area_struct *vma; int fault; // Step 1: Acquire mmap_lock in read mode // This protects VMA structure from concurrent modification mmap_read_lock(mm); // Find VMA under lock protection vma = find_vma(mm, address); if (!vma) { mmap_read_unlock(mm); return VM_FAULT_SIGSEGV; } // Step 2: Handle the fault (will acquire PTL internally) fault = handle_mm_fault(mm, vma, address, flags); // Step 3: Release mmap_lock mmap_read_unlock(mm); return fault;} // Fine-grained page table lockingint finish_fault(struct vm_fault *vmf) { spinlock_t *ptl; pte_t *pte; // Acquire page table lock for this specific PTE // Different from mmap_lock - very fine-grained pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &ptl); // ... update PTE ... pte_unmap_unlock(pte, ptl);} // Page lock for I/O synchronizationint read_swap_page(struct page *page, swp_entry_t entry) { // Lock page - blocks other threads from same page I/O lock_page(page); // Do swap read I/O int err = swap_readpage(page, entry); // Page stays locked until I/O completes // Then swap_read_endio() unlocks it return err;} // Speculative fault handling (kernel 5.8+)// Try without mmap_lock firstint do_user_addr_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) { // First, try lockless (speculative) handling // Works if VMA is stable and page tables are populated if (maybe_handle_faults_without_lock(mm, address, flags)) return 0; // Fall back to locked path mmap_read_lock(mm); // ... full fault handling ... mmap_read_unlock(mm);}The mmap_lock is a well-known scalability bottleneck. When many threads fault concurrently, they all contend for this lock. Modern kernels have introduced speculative fault handling that can bypass the lock in common cases, significantly improving scalability for memory-intensive parallel workloads.
Many things can go wrong during page loading. Robust error handling is essential to prevent data corruption, security vulnerabilities, and system crashes.
Potential Errors:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
// Error handling in page fault path int do_fault_around(struct vm_fault *vmf, pgoff_t start_pgoff) { struct page *page; int ret; // Try to allocate a page page = alloc_page(GFP_HIGHUSER_MOVABLE); if (!page) { // OOM - couldn't allocate return VM_FAULT_OOM; } // Try to read from file/swap ret = do_read_page(page, vmf); if (ret < 0) { put_page(page); if (ret == -EIO) return VM_FAULT_SIGBUS; // I/O error if (ret == -ENOMEM) return VM_FAULT_OOM; // OOM during I/O return VM_FAULT_ERROR; // Generic error } // Check for data corruption if (!PageUptodate(page)) { put_page(page); return VM_FAULT_SIGBUS; // Page didn't load correctly } return 0; // Success} // After do_page_fault returns, higher level handles resultvoid handle_page_fault_result(int fault_result, struct pt_regs *regs, unsigned long address) { if (fault_result & VM_FAULT_OOM) { // OOM - this is serious pagefault_out_of_memory(); // May invoke OOM killer return; } if (fault_result & VM_FAULT_SIGBUS) { // I/O error or similar do_sigbus(regs, address); // Send SIGBUS to process return; } if (fault_result & VM_FAULT_SIGSEGV) { // Invalid access do_sigsegv(regs, address); // Send SIGSEGV to process return; } // Success - return to user mode and retry instruction} // OOM killer - last resortvoid pagefault_out_of_memory(void) { if (current_thread_has_oom_victim()) return; // This thread is already being killed // Select a process to kill to free memory out_of_memory(&oc);}The error handling philosophy is to be as graceful as possible. An I/O error on one page shouldn't crash the whole system—just that process. OOM situations try reclaim and OOM killing before giving up. Only truly unrecoverable situations (kernel bugs, hardware failures) cause panics.
Let's put together everything we've learned into a complete timeline of loading a page into a frame:
| Step | Operation | Typical Time | Blocking? |
|---|---|---|---|
| 1 | Frame allocation (fast path) | ~100 ns | No |
| 1a | Frame allocation (reclaim) | ~1-100 ms | Yes |
| 2 | Determine page source | ~100 ns | No |
| 3a | Zero-fill page | ~1 µs | No |
| 3b | Read from NVMe swap | ~70-150 µs | Yes |
| 3c | Read from file (cached) | ~1 µs | No |
| 3c | Read from file (disk) | ~100 µs - 10 ms | Yes |
| 4 | Acquire PTL | ~100 ns | Maybe (contention) |
| 5 | Update PTE | ~10 ns | No |
| 6 | Update statistics | ~100 ns | No |
| 7 | Return to user mode | ~500 ns | No |
The fastest page faults are minor faults on anonymous pages—zero-fill takes only microseconds with no disk I/O. This is why modern systems can handle millions of minor faults per second. The killer is major faults (disk I/O), which are 1000x slower at minimum.
Loading a page into a frame is the heart of demand paging—the operation that makes virtual memory's promise possible. Let's consolidate the key concepts:
What's Next:
With the page now loaded and mapped, the handler is almost done. The final page explores Restart Instruction—how control returns to user mode and the faulting instruction successfully completes, making the entire page fault completely transparent to the application.
You now understand how pages are loaded from secondary storage into physical memory. From frame allocation through disk I/O to PTE updates, you've seen the complete process that fulfills virtual memory's promise. Next, we'll see how the instruction that caused all this activity finally completes successfully.