Loading learning content...
We've established that Copy-on-Write defers page copying until a write occurs. Now we examine the critical moment when this deferred work finally happens: the COW fault. This is where the operating system's illusion of private memory becomes reality—where a shared page transforms into a private copy, invisible to the writing process but orchestrated by a complex sequence of hardware exceptions and kernel handling.
Understanding the COW fault mechanism is essential because it represents the cost paid for COW's benefits. Every optimization has a price, and COW's price is paid at write time. The kernel must detect the write attempt, determine that it's a legitimate COW situation, allocate a new frame, copy the data, update page tables, and resume the process—all while maintaining correctness in a concurrent, multi-processor environment.
By the end of this page, you will understand the complete COW fault handling sequence: from the initial hardware trap through kernel page fault handling, frame allocation, memory copying, page table updates, and process resumption. You'll learn how the kernel distinguishes COW faults from other page faults and the optimizations that make COW efficient in practice.
When a process attempts to write to a COW-protected page, the following sequence unfolds. Understanding each step is crucial for grasping both the elegance and the cost of COW:
| Step | Location | What Happens |
|---|---|---|
| CPU | Process executes a store (write) instruction to virtual address |
| MMU | TLB lookup; perhaps hit with read-only permission |
| MMU | Hardware walks page table; finds PTE with read-only bit |
| MMU | Write to read-only page triggers protection fault |
| CPU | CPU switches to kernel mode, saves context |
| Kernel | OS page fault handler receives control |
| Kernel | Handler determines fault type: COW fault |
| Kernel | Allocate new physical frame for private copy |
| Kernel | Copy 4KB (or more) from shared frame to new frame |
| Kernel | Update faulting PTE to point to new frame, mark writable |
| Kernel | Decrement old frame's refcount; if now 1, mark writable |
| CPU/Kernel | Flush stale TLB entry for this address |
| CPU | Return from exception, re-execute the store instruction |
| CPU | Store instruction completes successfully |
Time Analysis:
Let's estimate the time cost of a COW fault:
| Component | Time Estimate | Notes |
|---|---|---|
| Exception entry | ~100 cycles | Mode switch, save registers |
| Fault handler lookup | ~50 cycles | Find VMA, determine fault type |
| Frame allocation | ~500-5000 cycles | Depends on allocator state |
| Page copy (4KB) | ~1000 cycles | Memory bandwidth limited |
| PTE update | ~100 cycles | Write to page table |
| TLB invalidation | ~100-1000 cycles | May involve IPI on SMP |
| Exception return | ~100 cycles | Restore registers, mode switch |
| Total | ~2000-7000 cycles | ~1-3 microseconds at 3 GHz |
This may seem fast, but compare to a normal memory write: ~4 cycles. A COW fault is 500-1500x slower than an unprotected write.
While a single COW fault takes microseconds, the cost is amortized across all subsequent writes to that page. A 4KB page might receive millions of writes over its lifetime, so the one-time ~2μs fault penalty is negligible. The problem arises when many pages fault in succession (e.g., initializing a large array after fork).
The COW fault begins with hardware. When the CPU attempts a store to a read-only page, the Memory Management Unit (MMU) generates a page fault exception. Let's examine the hardware's role in detail:
x86-64 Page Fault Details:
On x86-64, a page fault generates interrupt vector 14 (#PF). The CPU automatically:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
// x86-64 Page Fault Error Code Bits// This is pushed to the stack by the CPU on page fault #define PF_PROT (1 << 0) // 0 = non-present page, 1 = protection violation#define PF_WRITE (1 << 1) // 0 = read access, 1 = write access#define PF_USER (1 << 2) // 0 = kernel mode, 1 = user mode#define PF_RSVD (1 << 3) // 1 = reserved bit set in PTE#define PF_INSTR (1 << 4) // 1 = instruction fetch (NX violation) // Determining fault type from error codestatic inline bool is_cow_fault(unsigned long error_code, pte_t pte) { // COW fault characteristics: // 1. It's a write access (PF_WRITE set) // 2. It's a protection violation (PF_PROT set) - page is present but read-only // 3. The PTE has COW marker or the VMA is writable but PTE is not if (!(error_code & PF_WRITE)) return false; // Not a write if (!(error_code & PF_PROT)) return false; // Page not present - different fault type // At this point: write to a present, read-only page // Check if VMA says it should be writable (COW scenario) return pte_present(pte) && !pte_write(pte) && vma_is_writable(vma);} // x86-64 page fault entry point (simplified)void page_fault_handler(struct pt_regs *regs, unsigned long error_code) { unsigned long address = read_cr2(); // Get faulting address struct vm_area_struct *vma; pte_t *pte; // Find the VMA for this address vma = find_vma(current->mm, address); if (!vma || address < vma->vm_start) { // No VMA - invalid access (SIGSEGV) do_sigsegv(regs, error_code, address); return; } // Get the PTE pte = lookup_pte(current->mm, address); // Determine fault type and handle if (is_cow_fault(error_code, *pte)) { // COW fault - handle the copy handle_cow_fault(vma, pte, address); } else if (!(error_code & PF_PROT)) { // Page not present - demand paging handle_demand_fault(vma, pte, address); } else { // Real protection violation (e.g., write to .text section) do_sigsegv(regs, error_code, address); }}ARM architectures handle page faults similarly but with different register names (FAR instead of CR2) and different exception vector mechanisms. The fundamental flow—protection bit triggers exception, kernel handles exception, updates mappings, resumes—is universal across architectures that support virtual memory.
The kernel's page fault handler must quickly determine what kind of fault occurred and take appropriate action. This classification is performance-critical since page faults are relatively common during normal execution:
| Fault Type | Condition | Action |
|---|---|---|
| Invalid Access | Address not in any VMA | Send SIGSEGV to process |
| Stack Growth | Address below stack VMA, within growth limit | Extend stack VMA, allocate pages |
| Demand Paging (file) | Valid VMA, page not present, file-backed | Read page from file into frame |
| Demand Paging (anon) | Valid VMA, page not present, anonymous | Allocate zeroed frame |
| Swap In | Valid VMA, PTE points to swap entry | Read page from swap space |
| COW Fault | Valid VMA, present page, write to read-only COW page | Copy page, update mapping |
| NUMA Migration | Valid page, wrong NUMA node | Migrate page to local node |
| Huge Page Split | Fault within huge page requiring split | Split huge page to base pages |
| Permission Violation | Write to truly read-only page (e.g., .text) | Send SIGSEGV to process |
The Classification Algorithm:
The kernel follows a decision tree to classify faults:
1. Is address in kernel space?
└─ Yes: Handle kernel fault (oops if bad)
└─ No: Continue...
2. Is address in a valid VMA?
└─ No: Maybe stack expansion? If not, SIGSEGV
└─ Yes: Continue...
3. Is access type allowed by VMA?
└─ Write to read-only VMA: SIGSEGV (unless COW)
└─ Execute from non-exec VMA: SIGSEGV
└─ Allowed: Continue...
4. Is page present? (Check PTE)
└─ No: Demand fault (allocate/load page)
└─ Yes: Continue...
5. Is page writable? (Check PTE write bit)
└─ Yes: Shouldn't fault (hardware bug?)
└─ No: Is this COW? Check VMA + PTE flags
└─ Yes: COW fault
└─ No: SIGSEGV
The kernel optimizes common paths. COW faults and demand faults are frequent and expected; the handler is tuned to classify and process them quickly. Invalid accesses (SIGSEGVs) are rare and can take the slow path. This priority ordering affects the code structure and branching.
Once a COW fault is identified, the kernel must copy the page content from the shared frame to a new private frame. This copy operation is more nuanced than it appears:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
// COW fault handling: the copy operation// (Simplified from Linux's wp_page_copy()) static int do_cow_copy(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte, struct page *old_page) { struct page *new_page = NULL; pte_t entry; int ret = 0; // Optimization: Check if we're now the sole owner // (Another process may have exited while we were handling the fault) if (page_count(old_page) == 1) { // We're the only reference! No copy needed. // Just make the page writable. reuse_swap_page(old_page); goto reuse; } // Allocate a new page frame // Use the same NUMA node as the old page for locality new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!new_page) { ret = -ENOMEM; goto out; } // Copy the page content // Architecture-specific, may use optimized copy routines cow_user_page(new_page, old_page, address, vma); // Ensure copy is visible before updating PTE // (Important on weakly-ordered architectures) smp_wmb(); // Lock the page table for update spin_lock(ptl); // Re-check: Did something change while we allocated/copied? if (!pte_same(*page_table, orig_pte)) { // PTE changed - another CPU handled this or process died // Abort our copy, free the new page spin_unlock(ptl); free_page(new_page); return 0; // Will retry } // Set up mappings for the new page page_add_new_anon_rmap(new_page, vma, address); lru_cache_add_active_or_unevictable(new_page, vma); // Create new PTE: new frame number, writable, dirty entry = mk_pte(new_page, vma->vm_page_prot); entry = pte_mkdirty(entry); entry = pte_mkwrite(entry); entry = pte_mkyoung(entry); // Update the page table entry atomically ptep_clear_flush_notify(vma, address, page_table); set_pte_at(mm, address, page_table, entry); // Update memory management statistics update_mmu_cache(vma, address, page_table); spin_unlock(ptl); // Clean up old page reference page_remove_rmap(old_page); put_page(old_page); return 0; reuse: // Sole owner path: just make writable, no copy spin_lock(ptl); if (!pte_same(*page_table, orig_pte)) { spin_unlock(ptl); return 0; } entry = pte_mkdirty(orig_pte); entry = pte_mkwrite(entry); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, page_table); spin_unlock(ptl); return 0; out: return ret;} // Optimized page copy (architecture-specific)// On x86-64, may use REP MOVSQ or even non-temporal storesvoid cow_user_page(struct page *dst, struct page *src, unsigned long addr, struct vm_area_struct *vma) { void *dst_addr = kmap_atomic(dst); void *src_addr = kmap_atomic(src); // 4KB copy, possibly using SIMD or enhanced REP instructions // Non-temporal stores avoid polluting cache with destination copy_user_page(dst_addr, src_addr, addr, vma); kunmap_atomic(src_addr); kunmap_atomic(dst_addr);}Notice the 'reuse' path: if the page's reference count dropped to 1 while we were preparing to copy (perhaps another sharing process exited), we skip the copy entirely. This is a significant optimization for processes that become sole owners between fork and write.
After the copy, the kernel must update the page table entry and ensure the CPU uses the new mapping. This involves careful handling of both the page table and the Translation Lookaside Buffer (TLB):
PTE Update Requirements:
Atomic Update — The PTE update must be atomic. A partial update visible to another CPU could cause corruption.
Content — The new PTE contains:
Old PTE Handling — The old mapping is invalidated, and the old frame's reference count is decremented.
| Field | Before (COW) | After (Private) | Why |
|---|---|---|---|
| Present | 1 | 1 | Page remains in memory |
| Read/Write | 0 (R) | 1 (RW) | Now writable, no more faults |
| User/Supervisor | 1 (U) | 1 (U) | Still user-accessible |
| Dirty | 0/1 | 1 | About to be written |
| Accessed | 1 | 1 | Recently used |
| Frame Number | X (shared) | Y (private) | Points to new frame |
| COW marker* | 1 | 0 | No longer COW-protected |
TLB Invalidation:
The TLB caches recent address translations. After updating the PTE, the old (incorrect) TLB entry must be invalidated. This is non-trivial on multiprocessor systems:
Single-Processor Case:
- Update PTE in page table
- Execute INVLPG instruction for the virtual address
- TLB entry evicted, next access uses new PTE
Multi-Processor Case:
- Update PTE in page table
- Send Inter-Processor Interrupt (IPI) to all CPUs that might have the old TLB entry
- Each CPU executes INVLPG locally
- Wait for acknowledgment (barrier)
- Only then is it safe to free old frame
The multi-processor case is expensive. TLB shootdown IPIs can cost hundreds of cycles and add latency to COW faults.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// TLB shootdown for PTE update (simplified)// Real Linux code uses more sophisticated batching and tracking void ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { pte_t pte = *ptep; // Clear the PTE pte_clear(vma->vm_mm, address, ptep); // Now we need to flush TLBs // Which CPUs might have cached this PTE? cpumask_t flush_cpus; // In Linux, mm->cpu_vm_mask tracks which CPUs have used this mm cpumask_copy(&flush_cpus, mm_cpumask(vma->vm_mm)); if (cpumask_any_but(&flush_cpus, smp_processor_id()) < nr_cpu_ids) { // Other CPUs need flushing - send IPIs // This is the expensive path smp_call_function_many(&flush_cpus, flush_tlb_func, (void *)address, 1 /* wait */); } // Flush local TLB __flush_tlb_one(address);} // Called on each CPU receiving the IPIvoid flush_tlb_func(void *addr) { unsigned long address = (unsigned long)addr; // Invalidate the single TLB entry // Architecture-specific instruction (INVLPG on x86) __flush_tlb_one(address);} // Optimization: Lazy TLB invalidation// Instead of immediate IPI, mark the mm as needing flush// CPUs will flush on next context switch to this mmvoid flush_tlb_batched(struct vm_area_struct *vma, unsigned long address) { // Increment a per-mm generation counter atomic_inc(&vma->vm_mm->tlb_flush_pending); // Each CPU checks this counter on mm switch // If changed since last switch, full TLB flush // This batches multiple invalidations}Modern kernels batch TLB invalidations to amortize IPI costs. Instead of sending an IPI per page, the kernel accumulates invalidations and sends a single IPI covering a range (or schedules a full TLB flush on context switch). This is especially important when many COW faults occur in succession.
COW fault handling must be correct in concurrent, multi-processor environments. Multiple threads or processes might trigger COW faults on the same page simultaneously. The kernel uses several mechanisms to ensure correctness:
| Mechanism | What It Protects | Scope |
|---|---|---|
| mmap_sem/mmap_lock | VMA list, mm_struct changes | Per-address-space |
| Page table lock (PTL) | PTE modifications | Per-PTE or per-table portion |
| Page lock | Page state during I/O | Per-page |
| Reference counts (atomic) | Frame lifecycle | Per-page |
| Compare-and-swap on PTE | Atomic PTE updates | Per-PTE |
The Critical Race: Duplicate COW Handling
Consider two threads faulting on the same COW page:
Thread A (CPU 0) Thread B (CPU 1)
───────────────── ─────────────────
Write to VAddr 0x1000 Write to VAddr 0x1000
COW fault triggered COW fault triggered
Read PTE (read-only) Read PTE (read-only)
Allocate new frame Allocate new frame
Copy page Copy page
Lock PTL [waits for PTL]
Verify PTE unchanged ...
Update PTE to new frame ...
Unlock PTL Lock PTL
Verify PTE - CHANGED!
Abort, free new frame
Unlock PTL
Thread B's copy is wasted, but correctness is maintained through the lock and re-verification pattern.
The COW handler uses an optimistic approach: do expensive work (allocate, copy) without holding locks, then lock briefly to verify and commit. If verification fails, work is discarded. This maximizes parallelism at the cost of occasional wasted work. For rare races, this trade-off is worthwhile.
Real-world COW fault handling must address numerous special cases that complicate the basic flow:
Huge Pages and COW:
Huge pages (2MB on x86-64) complicate COW handling:
Option 1: Copy Entire Huge Page — Preserve the huge page but copy 2MB of data. This is expensive but maintains huge page benefits.
Option 2: Split Then COW — Convert the huge page to 512 base pages, then COW just the faulting 4KB page. Cheaper for sparse writes but loses huge page TLB efficiency.
Linux generally chooses Option 2 for anonymous huge pages, but the optimal choice depends on access patterns. Some workloads explicitly use madvise(MADV_DONTFORK) to avoid COW on huge page regions entirely.
GUP (Get User Pages) + fork() is a known problematic combination. If one process has pinned pages (for DMA or other reasons) and forks, the kernel must break COW immediately for pinned pages to prevent the DMA device from writing to the wrong physical frame. This 'GUP-fast' vs 'COW' interaction has been the source of serious security vulnerabilities.
Let's consolidate our understanding of the COW write fault mechanism:
What's Next:
Now that we understand how writes trigger copies, we'll examine fork optimization—the specific use case that motivated COW and where it provides the most dramatic benefits. We'll see how fork() leverages COW to achieve near-instant process creation.
You now understand the complete COW fault handling mechanism: from hardware trap to kernel classification, through copying and page table updates, to TLB management and concurrency control. This is the core machinery that makes Copy-on-Write work in practice.