Operating SystemsVirtual Memory

Copy-on-Write

LevelIntermediate

Duration75 mins

TopicVirtual Memory

3 / 5

Write Triggers Copy

The Moment of Divergence

We've established that Copy-on-Write defers page copying until a write occurs. Now we examine the critical moment when this deferred work finally happens: the COW fault. This is where the operating system's illusion of private memory becomes reality—where a shared page transforms into a private copy, invisible to the writing process but orchestrated by a complex sequence of hardware exceptions and kernel handling.

Understanding the COW fault mechanism is essential because it represents the cost paid for COW's benefits. Every optimization has a price, and COW's price is paid at write time. The kernel must detect the write attempt, determine that it's a legitimate COW situation, allocate a new frame, copy the data, update page tables, and resume the process—all while maintaining correctness in a concurrent, multi-processor environment.

What You Will Learn

By the end of this page, you will understand the complete COW fault handling sequence: from the initial hardware trap through kernel page fault handling, frame allocation, memory copying, page table updates, and process resumption. You'll learn how the kernel distinguishes COW faults from other page faults and the optimizations that make COW efficient in practice.

Anatomy of a Write to a COW Page

When a process attempts to write to a COW-protected page, the following sequence unfolds. Understanding each step is crucial for grasping both the elegance and the cost of COW:

COW Write Fault: Step-by-Step Sequence
Step	Location	What Happens
Store Instruction	CPU	Process executes a store (write) instruction to virtual address
TLB Miss/Check	MMU	TLB lookup; perhaps hit with read-only permission
Page Table Walk	MMU	Hardware walks page table; finds PTE with read-only bit
Protection Violation	MMU	Write to read-only page triggers protection fault
Exception Generated	CPU	CPU switches to kernel mode, saves context
Fault Handler Invoked	Kernel	OS page fault handler receives control
Fault Analysis	Kernel	Handler determines fault type: COW fault
Frame Allocation	Kernel	Allocate new physical frame for private copy
Data Copy	Kernel	Copy 4KB (or more) from shared frame to new frame
PTE Update	Kernel	Update faulting PTE to point to new frame, mark writable
Reference Count Update	Kernel	Decrement old frame's refcount; if now 1, mark writable
TLB Invalidation	CPU/Kernel	Flush stale TLB entry for this address
Resume Process	CPU	Return from exception, re-execute the store instruction
Write Succeeds	CPU	Store instruction completes successfully

Time Analysis:

Let's estimate the time cost of a COW fault:

Component	Time Estimate	Notes
Exception entry	~100 cycles	Mode switch, save registers
Fault handler lookup	~50 cycles	Find VMA, determine fault type
Frame allocation	~500-5000 cycles	Depends on allocator state
Page copy (4KB)	~1000 cycles	Memory bandwidth limited
PTE update	~100 cycles	Write to page table
TLB invalidation	~100-1000 cycles	May involve IPI on SMP
Exception return	~100 cycles	Restore registers, mode switch
Total	~2000-7000 cycles	~1-3 microseconds at 3 GHz

This may seem fast, but compare to a normal memory write: ~4 cycles. A COW fault is 500-1500x slower than an unprotected write.

The Amortized Cost

While a single COW fault takes microseconds, the cost is amortized across all subsequent writes to that page. A 4KB page might receive millions of writes over its lifetime, so the one-time ~2μs fault penalty is negligible. The problem arises when many pages fault in succession (e.g., initializing a large array after fork).

The Hardware Fault Mechanism

The COW fault begins with hardware. When the CPU attempts a store to a read-only page, the Memory Management Unit (MMU) generates a page fault exception. Let's examine the hardware's role in detail:

x86-64 Page Fault Details:

On x86-64, a page fault generates interrupt vector 14 (#PF). The CPU automatically:

Saves fault information to the CR2 register (faulting virtual address)
Pushes an error code to the stack with fault details
Switches to kernel mode (CPL 0)
Jumps to the page fault handler via the IDT (Interrupt Descriptor Table)

x86_page_fault.c
C (x86-64 specifics)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// x86-64 Page Fault Error Code Bits
// This is pushed to the stack by the CPU on page fault
 
#define PF_PROT         (1 << 0)  // 0 = non-present page, 1 = protection violation
#define PF_WRITE        (1 << 1)  // 0 = read access, 1 = write access
#define PF_USER         (1 << 2)  // 0 = kernel mode, 1 = user mode
#define PF_RSVD         (1 << 3)  // 1 = reserved bit set in PTE
#define PF_INSTR        (1 << 4)  // 1 = instruction fetch (NX violation)
 
// Determining fault type from error code
static inline bool is_cow_fault(unsigned long error_code, pte_t pte) {
    // COW fault characteristics:
    // 1. It's a write access (PF_WRITE set)
    // 2. It's a protection violation (PF_PROT set) - page is present but read-only
    // 3. The PTE has COW marker or the VMA is writable but PTE is not
    
    if (!(error_code & PF_WRITE))
        return false;  // Not a write
    
    if (!(error_code & PF_PROT))
        return false;  // Page not present - different fault type
    
    // At this point: write to a present, read-only page
    // Check if VMA says it should be writable (COW scenario)
    return pte_present(pte) && !pte_write(pte) && vma_is_writable(vma);
}
 
// x86-64 page fault entry point (simplified)
void page_fault_handler(struct pt_regs *regs, unsigned long error_code) {
    unsigned long address = read_cr2();  // Get faulting address
    struct vm_area_struct *vma;
    pte_t *pte;
    
    // Find the VMA for this address
    vma = find_vma(current->mm, address);
    if (!vma || address < vma->vm_start) {
        // No VMA - invalid access (SIGSEGV)
        do_sigsegv(regs, error_code, address);
        return;
    }
    
    // Get the PTE
    pte = lookup_pte(current->mm, address);
    
    // Determine fault type and handle
    if (is_cow_fault(error_code, *pte)) {
        // COW fault - handle the copy
        handle_cow_fault(vma, pte, address);
    } else if (!(error_code & PF_PROT)) {
        // Page not present - demand paging
        handle_demand_fault(vma, pte, address);
    } else {
        // Real protection violation (e.g., write to .text section)
        do_sigsegv(regs, error_code, address);
    }
}

Key Hardware Features Used

•Protection Bits in PTE — The read/write bit in each PTE controls write permission. Setting it to read-only enables trap on write.
•Fault Address Register (CR2) — Provides the exact virtual address that caused the fault, allowing precise handling.
•Error Code — Distinguishes between different fault types (not present vs. protection, read vs. write, user vs. kernel).
•Atomic Exception Entry — Hardware saves state atomically, enabling safe resumption after fault handling.
•Instruction Restart — After fault handling, the CPU re-executes the faulting instruction, which now succeeds.

ARM vs x86 Page Faults

ARM architectures handle page faults similarly but with different register names (FAR instead of CR2) and different exception vector mechanisms. The fundamental flow—protection bit triggers exception, kernel handles exception, updates mappings, resumes—is universal across architectures that support virtual memory.

Kernel Fault Classification

The kernel's page fault handler must quickly determine what kind of fault occurred and take appropriate action. This classification is performance-critical since page faults are relatively common during normal execution:

Page Fault Classification
Fault Type	Condition	Action
Invalid Access	Address not in any VMA	Send SIGSEGV to process
Stack Growth	Address below stack VMA, within growth limit	Extend stack VMA, allocate pages
Demand Paging (file)	Valid VMA, page not present, file-backed	Read page from file into frame
Demand Paging (anon)	Valid VMA, page not present, anonymous	Allocate zeroed frame
Swap In	Valid VMA, PTE points to swap entry	Read page from swap space
COW Fault	Valid VMA, present page, write to read-only COW page	Copy page, update mapping
NUMA Migration	Valid page, wrong NUMA node	Migrate page to local node
Huge Page Split	Fault within huge page requiring split	Split huge page to base pages
Permission Violation	Write to truly read-only page (e.g., .text)	Send SIGSEGV to process

The Classification Algorithm:

The kernel follows a decision tree to classify faults:

1. Is address in kernel space?
   └─ Yes: Handle kernel fault (oops if bad)
   └─ No: Continue...

2. Is address in a valid VMA?
   └─ No: Maybe stack expansion? If not, SIGSEGV
   └─ Yes: Continue...

3. Is access type allowed by VMA?
   └─ Write to read-only VMA: SIGSEGV (unless COW)
   └─ Execute from non-exec VMA: SIGSEGV
   └─ Allowed: Continue...

4. Is page present? (Check PTE)
   └─ No: Demand fault (allocate/load page)
   └─ Yes: Continue...

5. Is page writable? (Check PTE write bit)
   └─ Yes: Shouldn't fault (hardware bug?)
   └─ No: Is this COW? Check VMA + PTE flags
        └─ Yes: COW fault
        └─ No: SIGSEGV

Converting Mermaid diagram...

Fast Path Optimization

The kernel optimizes common paths. COW faults and demand faults are frequent and expected; the handler is tuned to classify and process them quickly. Invalid accesses (SIGSEGVs) are rare and can take the slow path. This priority ordering affects the code structure and branching.

The Copy Operation

Once a COW fault is identified, the kernel must copy the page content from the shared frame to a new private frame. This copy operation is more nuanced than it appears:

cow_copy.c
C (Linux-style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
// COW fault handling: the copy operation
// (Simplified from Linux's wp_page_copy())
 
static int do_cow_copy(struct mm_struct *mm,
                       struct vm_area_struct *vma,
                       unsigned long address,
                       pte_t *page_table,
                       pmd_t *pmd,
                       spinlock_t *ptl,
                       pte_t orig_pte,
                       struct page *old_page) {
    struct page *new_page = NULL;
    pte_t entry;
    int ret = 0;
    
    // Optimization: Check if we're now the sole owner
    // (Another process may have exited while we were handling the fault)
    if (page_count(old_page) == 1) {
        // We're the only reference! No copy needed.
        // Just make the page writable.
        reuse_swap_page(old_page);
        goto reuse;
    }
    
    // Allocate a new page frame
    // Use the same NUMA node as the old page for locality
    new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
    if (!new_page) {
        ret = -ENOMEM;
        goto out;
    }
    
    // Copy the page content
    // Architecture-specific, may use optimized copy routines
    cow_user_page(new_page, old_page, address, vma);
    
    // Ensure copy is visible before updating PTE
    // (Important on weakly-ordered architectures)
    smp_wmb();
    
    // Lock the page table for update
    spin_lock(ptl);
    
    // Re-check: Did something change while we allocated/copied?
    if (!pte_same(*page_table, orig_pte)) {
        // PTE changed - another CPU handled this or process died
        // Abort our copy, free the new page
        spin_unlock(ptl);
        free_page(new_page);
        return 0;  // Will retry
    }
    
    // Set up mappings for the new page
    page_add_new_anon_rmap(new_page, vma, address);
    lru_cache_add_active_or_unevictable(new_page, vma);
    
    // Create new PTE: new frame number, writable, dirty
    entry = mk_pte(new_page, vma->vm_page_prot);
    entry = pte_mkdirty(entry);
    entry = pte_mkwrite(entry);
    entry = pte_mkyoung(entry);
    
    // Update the page table entry atomically
    ptep_clear_flush_notify(vma, address, page_table);
    set_pte_at(mm, address, page_table, entry);
    
    // Update memory management statistics
    update_mmu_cache(vma, address, page_table);
    
    spin_unlock(ptl);
    
    // Clean up old page reference
    page_remove_rmap(old_page);
    put_page(old_page);
    
    return 0;
 
reuse:
    // Sole owner path: just make writable, no copy
    spin_lock(ptl);
    if (!pte_same(*page_table, orig_pte)) {
        spin_unlock(ptl);
        return 0;
    }
    entry = pte_mkdirty(orig_pte);
    entry = pte_mkwrite(entry);
    set_pte_at(mm, address, page_table, entry);
    update_mmu_cache(vma, address, page_table);
    spin_unlock(ptl);
    return 0;
 
out:
    return ret;
}
 
// Optimized page copy (architecture-specific)
// On x86-64, may use REP MOVSQ or even non-temporal stores
void cow_user_page(struct page *dst, struct page *src,
                   unsigned long addr, struct vm_area_struct *vma) {
    void *dst_addr = kmap_atomic(dst);
    void *src_addr = kmap_atomic(src);
    
    // 4KB copy, possibly using SIMD or enhanced REP instructions
    // Non-temporal stores avoid polluting cache with destination
    copy_user_page(dst_addr, src_addr, addr, vma);
    
    kunmap_atomic(src_addr);
    kunmap_atomic(dst_addr);
}

Critical Copy Details

•NUMA-Aware Allocation — The new frame is allocated from the same NUMA node as the old frame (or the faulting CPU's node) to maintain memory locality.
•Optimized Copy Routines — Modern CPUs have optimized copy instructions (REP MOVSQ, enhanced REP). The kernel uses these for performance.
•Non-Temporal Stores — Some architectures use non-temporal stores to avoid polluting CPU caches with destination data that may not be immediately read.
•Memory Barriers — On weakly-ordered architectures (ARM, etc.), memory barriers ensure the copy completes before the PTE update becomes visible.
•Re-Verification — After copying, the PTE is rechecked. Another CPU might have handled the same fault, or the process might have died. The copy is discarded if stale.

The Sole Owner Optimization

Notice the 'reuse' path: if the page's reference count dropped to 1 while we were preparing to copy (perhaps another sharing process exited), we skip the copy entirely. This is a significant optimization for processes that become sole owners between fork and write.

Page Table and TLB Updates

After the copy, the kernel must update the page table entry and ensure the CPU uses the new mapping. This involves careful handling of both the page table and the Translation Lookaside Buffer (TLB):

PTE Update Requirements:

Atomic Update — The PTE update must be atomic. A partial update visible to another CPU could cause corruption.
Content — The new PTE contains:
- New frame number (pointing to the copy)
- Writable bit set (allowing future writes without fault)
- Dirty bit set (page has been modified)
- Present bit (page is in memory)
Old PTE Handling — The old mapping is invalidated, and the old frame's reference count is decremented.

PTE Field Changes During COW Fault
Field	Before (COW)	After (Private)	Why
Present	1	1	Page remains in memory
Read/Write	0 (R)	1 (RW)	Now writable, no more faults
User/Supervisor	1 (U)	1 (U)	Still user-accessible
Dirty	0/1	1	About to be written
Accessed	1	1	Recently used
Frame Number	X (shared)	Y (private)	Points to new frame
COW marker*	1	0	No longer COW-protected

TLB Invalidation:

The TLB caches recent address translations. After updating the PTE, the old (incorrect) TLB entry must be invalidated. This is non-trivial on multiprocessor systems:

Single-Processor Case:

- Update PTE in page table
- Execute INVLPG instruction for the virtual address
- TLB entry evicted, next access uses new PTE

Multi-Processor Case:

- Update PTE in page table
- Send Inter-Processor Interrupt (IPI) to all CPUs that might have the old TLB entry
- Each CPU executes INVLPG locally
- Wait for acknowledgment (barrier)
- Only then is it safe to free old frame

The multi-processor case is expensive. TLB shootdown IPIs can cost hundreds of cycles and add latency to COW faults.

tlb_shootdown.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// TLB shootdown for PTE update (simplified)
// Real Linux code uses more sophisticated batching and tracking
 
void ptep_clear_flush(struct vm_area_struct *vma,
                      unsigned long address,
                      pte_t *ptep) {
    pte_t pte = *ptep;
    
    // Clear the PTE
    pte_clear(vma->vm_mm, address, ptep);
    
    // Now we need to flush TLBs
    // Which CPUs might have cached this PTE?
    cpumask_t flush_cpus;
    
    // In Linux, mm->cpu_vm_mask tracks which CPUs have used this mm
    cpumask_copy(&flush_cpus, mm_cpumask(vma->vm_mm));
    
    if (cpumask_any_but(&flush_cpus, smp_processor_id()) < nr_cpu_ids) {
        // Other CPUs need flushing - send IPIs
        // This is the expensive path
        smp_call_function_many(&flush_cpus, 
                               flush_tlb_func, 
                               (void *)address, 
                               1 /* wait */);
    }
    
    // Flush local TLB
    __flush_tlb_one(address);
}
 
// Called on each CPU receiving the IPI
void flush_tlb_func(void *addr) {
    unsigned long address = (unsigned long)addr;
    
    // Invalidate the single TLB entry
    // Architecture-specific instruction (INVLPG on x86)
    __flush_tlb_one(address);
}
 
// Optimization: Lazy TLB invalidation
// Instead of immediate IPI, mark the mm as needing flush
// CPUs will flush on next context switch to this mm
void flush_tlb_batched(struct vm_area_struct *vma,
                       unsigned long address) {
    // Increment a per-mm generation counter
    atomic_inc(&vma->vm_mm->tlb_flush_pending);
    
    // Each CPU checks this counter on mm switch
    // If changed since last switch, full TLB flush
    // This batches multiple invalidations
}

TLB Shootdown Batching

Modern kernels batch TLB invalidations to amortize IPI costs. Instead of sending an IPI per page, the kernel accumulates invalidations and sends a single IPI covering a range (or schedules a full TLB flush on context switch). This is especially important when many COW faults occur in succession.

Concurrency Considerations

COW fault handling must be correct in concurrent, multi-processor environments. Multiple threads or processes might trigger COW faults on the same page simultaneously. The kernel uses several mechanisms to ensure correctness:

Concurrency Challenges

•Simultaneous Faults — Two threads in the same process might write to the same COW page at the exact same time on different CPUs.
•Fork During Fault — A fork() might occur while a COW fault is being handled, creating new sharing.
•Exit During Fault — The sharing process might exit while we're copying, changing reference counts.
•Memory Pressure — Frame allocation might block or fail, requiring careful unwinding.
•Page Table Races — The PTE might change between our first read and our update.

Synchronization Mechanisms Used
Mechanism	What It Protects	Scope
mmap_sem/mmap_lock	VMA list, mm_struct changes	Per-address-space
Page table lock (PTL)	PTE modifications	Per-PTE or per-table portion
Page lock	Page state during I/O	Per-page
Reference counts (atomic)	Frame lifecycle	Per-page
Compare-and-swap on PTE	Atomic PTE updates	Per-PTE

The Critical Race: Duplicate COW Handling

Consider two threads faulting on the same COW page:

Thread A (CPU 0)              Thread B (CPU 1)
─────────────────             ─────────────────
Write to VAddr 0x1000         Write to VAddr 0x1000
COW fault triggered           COW fault triggered
Read PTE (read-only)          Read PTE (read-only)
Allocate new frame            Allocate new frame
Copy page                     Copy page
Lock PTL                      [waits for PTL]
Verify PTE unchanged          ...
Update PTE to new frame       ...
Unlock PTL                    Lock PTL
                              Verify PTE - CHANGED!
                              Abort, free new frame
                              Unlock PTL

Thread B's copy is wasted, but correctness is maintained through the lock and re-verification pattern.

Lock-Free Optimism

The COW handler uses an optimistic approach: do expensive work (allocate, copy) without holding locks, then lock briefly to verify and commit. If verification fails, work is discarded. This maximizes parallelism at the cost of occasional wasted work. For rare races, this trade-off is worthwhile.

Special Cases and Edge Conditions

Real-world COW fault handling must address numerous special cases that complicate the basic flow:

Special Memory Types

•Huge Pages (2MB/1GB) — COW for huge pages involves either splitting to base pages or copying the entire huge page (expensive).
•Pinned Pages — Pages pinned for DMA (get_user_pages) can't be COW'd. Fork must make immediate copies.
•File-Backed COW — Private file mappings use partially different paths (writeback to page cache vs. anon conversion).
•Swap Entries — Pages might be swapped out; COW on swap entry requires swap-in first.

Error Conditions

•OOM During COW — If frame allocation fails, the kernel may kill a process or return error (depending on VMA flags).
•Signal During Fault — If a signal arrives during fault handling, the handler may be interrupted.
•Fault During Fault — Page table memory itself might trigger faults (handled carefully).
•Race with munmap — The VMA might be unmapped while fault is being handled.

Huge Pages and COW:

Huge pages (2MB on x86-64) complicate COW handling:

Option 1: Copy Entire Huge Page — Preserve the huge page but copy 2MB of data. This is expensive but maintains huge page benefits.

Option 2: Split Then COW — Convert the huge page to 512 base pages, then COW just the faulting 4KB page. Cheaper for sparse writes but loses huge page TLB efficiency.

Linux generally chooses Option 2 for anonymous huge pages, but the optimal choice depends on access patterns. Some workloads explicitly use madvise(MADV_DONTFORK) to avoid COW on huge page regions entirely.

GUPS and Fork

GUP (Get User Pages) + fork() is a known problematic combination. If one process has pinned pages (for DMA or other reasons) and forks, the kernel must break COW immediately for pinned pages to prevent the DMA device from writing to the wrong physical frame. This 'GUP-fast' vs 'COW' interaction has been the source of serious security vulnerabilities.

Summary and Looking Ahead

Let's consolidate our understanding of the COW write fault mechanism:

Key Takeaways

•Write to read-only triggers hardware fault — The MMU generates a protection violation when writing to a COW-protected page, transferring control to the kernel.
•Kernel classifies the fault — The page fault handler determines this is a COW fault (not an actual protection violation) based on VMA and PTE flags.
•Copy is optimized and careful — Frame allocation, data copy, and PTE update are performed with NUMA awareness, optimized copy routines, and atomic updates.
•TLB coherence is maintained — Old TLB entries are invalidated via INVLPG and inter-processor interrupts as needed.
•Concurrency is handled via locking and re-verification — Optimistic parallelism with lock-protected commit ensures correctness without serializing all faults.
•Special cases add complexity — Huge pages, pinned pages, OOM, and races require additional handling paths.

What's Next:

Now that we understand how writes trigger copies, we'll examine fork optimization—the specific use case that motivated COW and where it provides the most dramatic benefits. We'll see how fork() leverages COW to achieve near-instant process creation.

Page Complete

You now understand the complete COW fault handling mechanism: from hardware trap to kernel classification, through copying and page table updates, to TLB management and concurrency control. This is the core machinery that makes Copy-on-Write work in practice.

3 / 5

Loading learning content...

Operating SystemsVirtual Memory

Copy-on-Write

LevelIntermediate

Duration75 mins

TopicVirtual Memory

3 / 5

Write Triggers Copy

The Moment of Divergence

What You Will Learn

Anatomy of a Write to a COW Page

When a process attempts to write to a COW-protected page, the following sequence unfolds. Understanding each step is crucial for grasping both the elegance and the cost of COW:

COW Write Fault: Step-by-Step Sequence
Step	Location	What Happens
Store Instruction	CPU	Process executes a store (write) instruction to virtual address
TLB Miss/Check	MMU	TLB lookup; perhaps hit with read-only permission
Page Table Walk	MMU	Hardware walks page table; finds PTE with read-only bit
Protection Violation	MMU	Write to read-only page triggers protection fault
Exception Generated	CPU	CPU switches to kernel mode, saves context
Fault Handler Invoked	Kernel	OS page fault handler receives control
Fault Analysis	Kernel	Handler determines fault type: COW fault
Frame Allocation	Kernel	Allocate new physical frame for private copy
Data Copy	Kernel	Copy 4KB (or more) from shared frame to new frame
PTE Update	Kernel	Update faulting PTE to point to new frame, mark writable
Reference Count Update	Kernel	Decrement old frame's refcount; if now 1, mark writable
TLB Invalidation	CPU/Kernel	Flush stale TLB entry for this address
Resume Process	CPU	Return from exception, re-execute the store instruction
Write Succeeds	CPU	Store instruction completes successfully

Time Analysis:

Let's estimate the time cost of a COW fault:

Component	Time Estimate	Notes
Exception entry	~100 cycles	Mode switch, save registers
Fault handler lookup	~50 cycles	Find VMA, determine fault type
Frame allocation	~500-5000 cycles	Depends on allocator state
Page copy (4KB)	~1000 cycles	Memory bandwidth limited
PTE update	~100 cycles	Write to page table
TLB invalidation	~100-1000 cycles	May involve IPI on SMP
Exception return	~100 cycles	Restore registers, mode switch
Total	~2000-7000 cycles	~1-3 microseconds at 3 GHz

This may seem fast, but compare to a normal memory write: ~4 cycles. A COW fault is 500-1500x slower than an unprotected write.

The Amortized Cost

The Hardware Fault Mechanism

x86-64 Page Fault Details:

On x86-64, a page fault generates interrupt vector 14 (#PF). The CPU automatically:

Saves fault information to the CR2 register (faulting virtual address)
Pushes an error code to the stack with fault details
Switches to kernel mode (CPL 0)
Jumps to the page fault handler via the IDT (Interrupt Descriptor Table)

x86_page_fault.c
C (x86-64 specifics)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// x86-64 Page Fault Error Code Bits
// This is pushed to the stack by the CPU on page fault
 
#define PF_PROT         (1 << 0)  // 0 = non-present page, 1 = protection violation
#define PF_WRITE        (1 << 1)  // 0 = read access, 1 = write access
#define PF_USER         (1 << 2)  // 0 = kernel mode, 1 = user mode
#define PF_RSVD         (1 << 3)  // 1 = reserved bit set in PTE
#define PF_INSTR        (1 << 4)  // 1 = instruction fetch (NX violation)
 
// Determining fault type from error code
static inline bool is_cow_fault(unsigned long error_code, pte_t pte) {
    // COW fault characteristics:
    // 1. It's a write access (PF_WRITE set)
    // 2. It's a protection violation (PF_PROT set) - page is present but read-only
    // 3. The PTE has COW marker or the VMA is writable but PTE is not
    
    if (!(error_code & PF_WRITE))
        return false;  // Not a write
    
    if (!(error_code & PF_PROT))
        return false;  // Page not present - different fault type
    
    // At this point: write to a present, read-only page
    // Check if VMA says it should be writable (COW scenario)
    return pte_present(pte) && !pte_write(pte) && vma_is_writable(vma);
}
 
// x86-64 page fault entry point (simplified)
void page_fault_handler(struct pt_regs *regs, unsigned long error_code) {
    unsigned long address = read_cr2();  // Get faulting address
    struct vm_area_struct *vma;
    pte_t *pte;
    
    // Find the VMA for this address
    vma = find_vma(current->mm, address);
    if (!vma || address < vma->vm_start) {
        // No VMA - invalid access (SIGSEGV)
        do_sigsegv(regs, error_code, address);
        return;
    }
    
    // Get the PTE
    pte = lookup_pte(current->mm, address);
    
    // Determine fault type and handle
    if (is_cow_fault(error_code, *pte)) {
        // COW fault - handle the copy
        handle_cow_fault(vma, pte, address);
    } else if (!(error_code & PF_PROT)) {
        // Page not present - demand paging
        handle_demand_fault(vma, pte, address);
    } else {
        // Real protection violation (e.g., write to .text section)
        do_sigsegv(regs, error_code, address);
    }
}

Key Hardware Features Used

•Protection Bits in PTE — The read/write bit in each PTE controls write permission. Setting it to read-only enables trap on write.
•Fault Address Register (CR2) — Provides the exact virtual address that caused the fault, allowing precise handling.
•Error Code — Distinguishes between different fault types (not present vs. protection, read vs. write, user vs. kernel).
•Atomic Exception Entry — Hardware saves state atomically, enabling safe resumption after fault handling.
•Instruction Restart — After fault handling, the CPU re-executes the faulting instruction, which now succeeds.

ARM vs x86 Page Faults

Kernel Fault Classification

Page Fault Classification
Fault Type	Condition	Action
Invalid Access	Address not in any VMA	Send SIGSEGV to process
Stack Growth	Address below stack VMA, within growth limit	Extend stack VMA, allocate pages
Demand Paging (file)	Valid VMA, page not present, file-backed	Read page from file into frame
Demand Paging (anon)	Valid VMA, page not present, anonymous	Allocate zeroed frame
Swap In	Valid VMA, PTE points to swap entry	Read page from swap space
COW Fault	Valid VMA, present page, write to read-only COW page	Copy page, update mapping
NUMA Migration	Valid page, wrong NUMA node	Migrate page to local node
Huge Page Split	Fault within huge page requiring split	Split huge page to base pages
Permission Violation	Write to truly read-only page (e.g., .text)	Send SIGSEGV to process

The Classification Algorithm:

The kernel follows a decision tree to classify faults:

1. Is address in kernel space?
   └─ Yes: Handle kernel fault (oops if bad)
   └─ No: Continue...

2. Is address in a valid VMA?
   └─ No: Maybe stack expansion? If not, SIGSEGV
   └─ Yes: Continue...

3. Is access type allowed by VMA?
   └─ Write to read-only VMA: SIGSEGV (unless COW)
   └─ Execute from non-exec VMA: SIGSEGV
   └─ Allowed: Continue...

4. Is page present? (Check PTE)
   └─ No: Demand fault (allocate/load page)
   └─ Yes: Continue...

5. Is page writable? (Check PTE write bit)
   └─ Yes: Shouldn't fault (hardware bug?)
   └─ No: Is this COW? Check VMA + PTE flags
        └─ Yes: COW fault
        └─ No: SIGSEGV

Converting Mermaid diagram...

Fast Path Optimization

The Copy Operation

Once a COW fault is identified, the kernel must copy the page content from the shared frame to a new private frame. This copy operation is more nuanced than it appears:

cow_copy.c
C (Linux-style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
// COW fault handling: the copy operation
// (Simplified from Linux's wp_page_copy())
 
static int do_cow_copy(struct mm_struct *mm,
                       struct vm_area_struct *vma,
                       unsigned long address,
                       pte_t *page_table,
                       pmd_t *pmd,
                       spinlock_t *ptl,
                       pte_t orig_pte,
                       struct page *old_page) {
    struct page *new_page = NULL;
    pte_t entry;
    int ret = 0;
    
    // Optimization: Check if we're now the sole owner
    // (Another process may have exited while we were handling the fault)
    if (page_count(old_page) == 1) {
        // We're the only reference! No copy needed.
        // Just make the page writable.
        reuse_swap_page(old_page);
        goto reuse;
    }
    
    // Allocate a new page frame
    // Use the same NUMA node as the old page for locality
    new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
    if (!new_page) {
        ret = -ENOMEM;
        goto out;
    }
    
    // Copy the page content
    // Architecture-specific, may use optimized copy routines
    cow_user_page(new_page, old_page, address, vma);
    
    // Ensure copy is visible before updating PTE
    // (Important on weakly-ordered architectures)
    smp_wmb();
    
    // Lock the page table for update
    spin_lock(ptl);
    
    // Re-check: Did something change while we allocated/copied?
    if (!pte_same(*page_table, orig_pte)) {
        // PTE changed - another CPU handled this or process died
        // Abort our copy, free the new page
        spin_unlock(ptl);
        free_page(new_page);
        return 0;  // Will retry
    }
    
    // Set up mappings for the new page
    page_add_new_anon_rmap(new_page, vma, address);
    lru_cache_add_active_or_unevictable(new_page, vma);
    
    // Create new PTE: new frame number, writable, dirty
    entry = mk_pte(new_page, vma->vm_page_prot);
    entry = pte_mkdirty(entry);
    entry = pte_mkwrite(entry);
    entry = pte_mkyoung(entry);
    
    // Update the page table entry atomically
    ptep_clear_flush_notify(vma, address, page_table);
    set_pte_at(mm, address, page_table, entry);
    
    // Update memory management statistics
    update_mmu_cache(vma, address, page_table);
    
    spin_unlock(ptl);
    
    // Clean up old page reference
    page_remove_rmap(old_page);
    put_page(old_page);
    
    return 0;
 
reuse:
    // Sole owner path: just make writable, no copy
    spin_lock(ptl);
    if (!pte_same(*page_table, orig_pte)) {
        spin_unlock(ptl);
        return 0;
    }
    entry = pte_mkdirty(orig_pte);
    entry = pte_mkwrite(entry);
    set_pte_at(mm, address, page_table, entry);
    update_mmu_cache(vma, address, page_table);
    spin_unlock(ptl);
    return 0;
 
out:
    return ret;
}
 
// Optimized page copy (architecture-specific)
// On x86-64, may use REP MOVSQ or even non-temporal stores
void cow_user_page(struct page *dst, struct page *src,
                   unsigned long addr, struct vm_area_struct *vma) {
    void *dst_addr = kmap_atomic(dst);
    void *src_addr = kmap_atomic(src);
    
    // 4KB copy, possibly using SIMD or enhanced REP instructions
    // Non-temporal stores avoid polluting cache with destination
    copy_user_page(dst_addr, src_addr, addr, vma);
    
    kunmap_atomic(src_addr);
    kunmap_atomic(dst_addr);
}

Critical Copy Details

•NUMA-Aware Allocation — The new frame is allocated from the same NUMA node as the old frame (or the faulting CPU's node) to maintain memory locality.
•Optimized Copy Routines — Modern CPUs have optimized copy instructions (REP MOVSQ, enhanced REP). The kernel uses these for performance.
•Non-Temporal Stores — Some architectures use non-temporal stores to avoid polluting CPU caches with destination data that may not be immediately read.
•Memory Barriers — On weakly-ordered architectures (ARM, etc.), memory barriers ensure the copy completes before the PTE update becomes visible.
•Re-Verification — After copying, the PTE is rechecked. Another CPU might have handled the same fault, or the process might have died. The copy is discarded if stale.

The Sole Owner Optimization

Page Table and TLB Updates

After the copy, the kernel must update the page table entry and ensure the CPU uses the new mapping. This involves careful handling of both the page table and the Translation Lookaside Buffer (TLB):

PTE Update Requirements:

Atomic Update — The PTE update must be atomic. A partial update visible to another CPU could cause corruption.
Content — The new PTE contains:
- New frame number (pointing to the copy)
- Writable bit set (allowing future writes without fault)
- Dirty bit set (page has been modified)
- Present bit (page is in memory)
Old PTE Handling — The old mapping is invalidated, and the old frame's reference count is decremented.

PTE Field Changes During COW Fault
Field	Before (COW)	After (Private)	Why
Present	1	1	Page remains in memory
Read/Write	0 (R)	1 (RW)	Now writable, no more faults
User/Supervisor	1 (U)	1 (U)	Still user-accessible
Dirty	0/1	1	About to be written
Accessed	1	1	Recently used
Frame Number	X (shared)	Y (private)	Points to new frame
COW marker*	1	0	No longer COW-protected

TLB Invalidation:

The TLB caches recent address translations. After updating the PTE, the old (incorrect) TLB entry must be invalidated. This is non-trivial on multiprocessor systems:

Single-Processor Case:

- Update PTE in page table
- Execute INVLPG instruction for the virtual address
- TLB entry evicted, next access uses new PTE

Multi-Processor Case:

- Update PTE in page table
- Send Inter-Processor Interrupt (IPI) to all CPUs that might have the old TLB entry
- Each CPU executes INVLPG locally
- Wait for acknowledgment (barrier)
- Only then is it safe to free old frame

The multi-processor case is expensive. TLB shootdown IPIs can cost hundreds of cycles and add latency to COW faults.

tlb_shootdown.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// TLB shootdown for PTE update (simplified)
// Real Linux code uses more sophisticated batching and tracking
 
void ptep_clear_flush(struct vm_area_struct *vma,
                      unsigned long address,
                      pte_t *ptep) {
    pte_t pte = *ptep;
    
    // Clear the PTE
    pte_clear(vma->vm_mm, address, ptep);
    
    // Now we need to flush TLBs
    // Which CPUs might have cached this PTE?
    cpumask_t flush_cpus;
    
    // In Linux, mm->cpu_vm_mask tracks which CPUs have used this mm
    cpumask_copy(&flush_cpus, mm_cpumask(vma->vm_mm));
    
    if (cpumask_any_but(&flush_cpus, smp_processor_id()) < nr_cpu_ids) {
        // Other CPUs need flushing - send IPIs
        // This is the expensive path
        smp_call_function_many(&flush_cpus, 
                               flush_tlb_func, 
                               (void *)address, 
                               1 /* wait */);
    }
    
    // Flush local TLB
    __flush_tlb_one(address);
}
 
// Called on each CPU receiving the IPI
void flush_tlb_func(void *addr) {
    unsigned long address = (unsigned long)addr;
    
    // Invalidate the single TLB entry
    // Architecture-specific instruction (INVLPG on x86)
    __flush_tlb_one(address);
}
 
// Optimization: Lazy TLB invalidation
// Instead of immediate IPI, mark the mm as needing flush
// CPUs will flush on next context switch to this mm
void flush_tlb_batched(struct vm_area_struct *vma,
                       unsigned long address) {
    // Increment a per-mm generation counter
    atomic_inc(&vma->vm_mm->tlb_flush_pending);
    
    // Each CPU checks this counter on mm switch
    // If changed since last switch, full TLB flush
    // This batches multiple invalidations
}

TLB Shootdown Batching

Concurrency Considerations

Concurrency Challenges

•Simultaneous Faults — Two threads in the same process might write to the same COW page at the exact same time on different CPUs.
•Fork During Fault — A fork() might occur while a COW fault is being handled, creating new sharing.
•Exit During Fault — The sharing process might exit while we're copying, changing reference counts.
•Memory Pressure — Frame allocation might block or fail, requiring careful unwinding.
•Page Table Races — The PTE might change between our first read and our update.

Synchronization Mechanisms Used
Mechanism	What It Protects	Scope
mmap_sem/mmap_lock	VMA list, mm_struct changes	Per-address-space
Page table lock (PTL)	PTE modifications	Per-PTE or per-table portion
Page lock	Page state during I/O	Per-page
Reference counts (atomic)	Frame lifecycle	Per-page
Compare-and-swap on PTE	Atomic PTE updates	Per-PTE

The Critical Race: Duplicate COW Handling

Consider two threads faulting on the same COW page:

Thread A (CPU 0)              Thread B (CPU 1)
─────────────────             ─────────────────
Write to VAddr 0x1000         Write to VAddr 0x1000
COW fault triggered           COW fault triggered
Read PTE (read-only)          Read PTE (read-only)
Allocate new frame            Allocate new frame
Copy page                     Copy page
Lock PTL                      [waits for PTL]
Verify PTE unchanged          ...
Update PTE to new frame       ...
Unlock PTL                    Lock PTL
                              Verify PTE - CHANGED!
                              Abort, free new frame
                              Unlock PTL

Thread B's copy is wasted, but correctness is maintained through the lock and re-verification pattern.

Lock-Free Optimism

Special Cases and Edge Conditions

Real-world COW fault handling must address numerous special cases that complicate the basic flow:

Special Memory Types

•Huge Pages (2MB/1GB) — COW for huge pages involves either splitting to base pages or copying the entire huge page (expensive).
•Pinned Pages — Pages pinned for DMA (get_user_pages) can't be COW'd. Fork must make immediate copies.
•File-Backed COW — Private file mappings use partially different paths (writeback to page cache vs. anon conversion).
•Swap Entries — Pages might be swapped out; COW on swap entry requires swap-in first.

Error Conditions

•OOM During COW — If frame allocation fails, the kernel may kill a process or return error (depending on VMA flags).
•Signal During Fault — If a signal arrives during fault handling, the handler may be interrupted.
•Fault During Fault — Page table memory itself might trigger faults (handled carefully).
•Race with munmap — The VMA might be unmapped while fault is being handled.

Huge Pages and COW:

Huge pages (2MB on x86-64) complicate COW handling:

Option 1: Copy Entire Huge Page — Preserve the huge page but copy 2MB of data. This is expensive but maintains huge page benefits.

Option 2: Split Then COW — Convert the huge page to 512 base pages, then COW just the faulting 4KB page. Cheaper for sparse writes but loses huge page TLB efficiency.

GUPS and Fork

Summary and Looking Ahead

Let's consolidate our understanding of the COW write fault mechanism:

Key Takeaways

•Write to read-only triggers hardware fault — The MMU generates a protection violation when writing to a COW-protected page, transferring control to the kernel.
•Kernel classifies the fault — The page fault handler determines this is a COW fault (not an actual protection violation) based on VMA and PTE flags.
•Copy is optimized and careful — Frame allocation, data copy, and PTE update are performed with NUMA awareness, optimized copy routines, and atomic updates.
•TLB coherence is maintained — Old TLB entries are invalidated via INVLPG and inter-processor interrupts as needed.
•Concurrency is handled via locking and re-verification — Optimistic parallelism with lock-protected commit ensures correctness without serializing all faults.
•Special cases add complexity — Huge pages, pinned pages, OOM, and races require additional handling paths.

What's Next:

Page Complete

3 / 5