Loading learning content...
Consider a web server process with 500MB of memory calling fork() to handle a request. If fork() literally copied all 500MB, it would be devastatingly slow—potentially taking hundreds of milliseconds and consuming vast amounts of RAM. With thousands of requests per second, this would be completely impractical.
Yet Unix servers routinely fork millions of times per day with negligible overhead. The secret is Copy-on-Write (COW)—an elegant optimization that makes fork() fast by being lazy about actual copying.
Copy-on-Write is not merely an optimization; it fundamentally changes how we think about fork()'s cost model. Understanding COW is essential for writing memory-efficient multi-process applications.
This page provides deep mastery of Copy-on-Write mechanics. You will understand why naive copying is impractical, how COW works at the page-table level, when actual copying occurs, memory sharing patterns, performance implications, and how to write COW-aware code that minimizes memory usage.
To appreciate Copy-on-Write, we must first understand the naive alternative and why it fails at scale.
Naive fork() Implementation:
In a naive implementation, fork() would:
The Problem:
| Process Size | Copy Time (est.) | Memory Consumed | Practical? |
|---|---|---|---|
| 10 MB | ~1 ms | 20 MB total | Acceptable for infrequent use |
| 100 MB | ~10 ms | 200 MB total | Noticeable latency |
| 500 MB | ~50 ms | 1 GB total | Significant delay |
| 2 GB | ~200 ms | 4 GB total | Unacceptable for most uses |
| 8 GB | ~800 ms | 16 GB total | System would grind to halt |
Additional Problems with Naive Copying:
Early Unix systems did use naive copying, which worked acceptably on 1970s machines with kilobytes of process memory. As programs grew, this approach became untenable. BSD Unix introduced COW in the 1980s, and it's now universal in modern operating systems.
Copy-on-Write operates on a simple principle: defer copying until absolutely necessary. Most forked processes don't modify most of their memory, so copying it would be wasted work.
The Core Insight:
When fork() is called, instead of copying memory:
Visual Representation:
123456789101112131415161718192021222324252627282930313233343536373839
IMMEDIATELY AFTER fork() - NO COPYING HAS OCCURRED Parent Process Child Process┌─────────────────────┐ ┌─────────────────────┐│ Virtual Address │ │ Virtual Address ││ Space │ │ Space ││ │ │ ││ Page 0 ─────────────┼──────┐ ┌──────┼───────── Page 0 ││ Page 1 ─────────────┼────┐ │ │ ┌────┼───────── Page 1 ││ Page 2 ─────────────┼──┐ │ │ │ │ ┌──┼───────── Page 2 ││ ... │ │ │ │ │ │ │ │ ... │└─────────────────────┘ │ │ │ │ │ │ └─────────────────────┘ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ┌─────────────────────┐Physical Memory: │ Page Frame 0 (R/O) │ ← Shared, read-only │ Page Frame 1 (R/O) │ ← Shared, read-only │ Page Frame 2 (R/O) │ ← Shared, read-only │ ... │ └─────────────────────┘ AFTER CHILD WRITES TO PAGE 1 - COW TRIGGERED Parent Process Child Process┌─────────────────────┐ ┌─────────────────────┐│ Page 0 ─────────────┼──────────────X┼───────── Page 0 ││ Page 1 ─────────────┼─────┐ ┌───┼───────── Page 1 ││ Page 2 ─────────────┼─────┼────┼───┼───────── Page 2 │└─────────────────────┘ │ │ └─────────────────────┘ │ │ ▼ ▼ ┌─────────────────────┐Physical Memory: │ Page Frame 0 (R/O) │ ← Still shared │ Page Frame 1 (R/W) │ ← Parent's copy (now R/W) │ Page Frame 2 (R/O) │ ← Still shared │ Page Frame 99(R/W) │ ← Child's NEW copy └─────────────────────┘ Key: Only the WRITTEN page was copied. All other pages remain shared!COW embodies the laziness principle in computer science: don't do work until you absolutely must. By deferring copies until writes occur, the system does minimal work—often no copying at all if the child quickly exec()s into a new program.
Copy-on-Write operates at the page level—the fundamental unit of virtual memory. Understanding page mechanics is essential for understanding COW behavior.
Page Table Entry Manipulation:
Each page table entry (PTE) contains flags that control access:
| Flag | Purpose | COW Usage |
|---|---|---|
| Present (P) | Page is in physical memory | Always set for mapped pages |
| Read/Write (R/W) | 0=read-only, 1=writable | Set to 0 for COW pages |
| User/Supervisor (U/S) | User mode can access | Set for user pages |
| Accessed (A) | Page has been read | Used for page replacement |
| Dirty (D) | Page has been written | Indicates need to write back |
The COW Page Fault Sequence:
When a process writes to a COW page, the following sequence occurs:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
COW Page Fault Handling Sequence================================= 1. WRITE ATTEMPT ┌─────────────────────────────────────────────────────────┐ │ Process attempts to write to address 0x7FFF8000 │ │ This address maps to a COW-protected page │ └─────────────────────────────────────────────────────────┘ │ ▼2. HARDWARE PAGE FAULT ┌─────────────────────────────────────────────────────────┐ │ MMU detects write to read-only page │ │ CPU raises Page Fault exception (interrupt 14 on x86) │ │ Control transfers to kernel page fault handler │ └─────────────────────────────────────────────────────────┘ │ ▼3. KERNEL FAULT ANALYSIS ┌─────────────────────────────────────────────────────────┐ │ Kernel examines fault address and type │ │ Checks VMA (Virtual Memory Area) permissions │ │ Determines this is a valid COW situation, not SIGSEGV │ └─────────────────────────────────────────────────────────┘ │ ▼4. REFERENCE COUNT CHECK ┌─────────────────────────────────────────────────────────┐ │ Kernel checks page reference count │ │ If count > 1: page is shared, must copy │ │ If count == 1: this process owns it, just make writable │ └─────────────────────────────────────────────────────────┘ │ ▼5. PAGE COPY (if needed) ┌─────────────────────────────────────────────────────────┐ │ Allocate new physical page frame │ │ Copy 4KB from shared page to new page │ │ Decrement reference count on original page │ └─────────────────────────────────────────────────────────┘ │ ▼6. PAGE TABLE UPDATE ┌─────────────────────────────────────────────────────────┐ │ Update PTE to point to new page │ │ Set R/W bit to 1 (writable) │ │ Flush TLB entry for this address │ └─────────────────────────────────────────────────────────┘ │ ▼7. RESUME EXECUTION ┌─────────────────────────────────────────────────────────┐ │ Return to user mode │ │ Re-execute the faulting instruction │ │ Write now succeeds to the private copy │ └─────────────────────────────────────────────────────────┘ Total time: ~microseconds (for one page)Far faster than copying entire address space!Each physical page has a reference count tracking how many page tables point to it. When count reaches zero, the page can be freed. COW increments this count during fork() and decrements it during page copies or process exit.
Let's examine how a modern operating system implements fork() using Copy-on-Write:
Simplified fork() with COW:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
/* * Simplified Conceptual Implementation of fork() with COW * (Actual kernel code is far more complex) */ /* Called when user invokes fork() system call */pid_t sys_fork(void) { struct task_struct *parent = current; /* Current process */ struct task_struct *child; pid_t child_pid; /* 1. Allocate new task structure (PCB) for child */ child = alloc_task_struct(); if (!child) return -ENOMEM; /* 2. Allocate unique PID */ child_pid = alloc_pid(); if (child_pid < 0) { free_task_struct(child); return -EAGAIN; } /* 3. Copy task structure from parent */ *child = *parent; /* Shallow copy */ child->pid = child_pid; child->ppid = parent->pid; child->state = TASK_READY; child->pending_signals = 0; /* Clear pending signals */ /* 4. Create new page table for child (THE COW MAGIC) */ child->page_table = copy_page_table_cow(parent->page_table); if (!child->page_table) { free_pid(child_pid); free_task_struct(child); return -ENOMEM; } /* 5. Duplicate file descriptor table (not the files themselves) */ child->fd_table = dup_fd_table(parent->fd_table); /* 6. Copy other resources: signal handlers, etc. */ copy_signal_handlers(parent, child); /* 7. Set up return values */ /* Parent's context will return child_pid */ /* Child's context will return 0 */ child->saved_registers.return_value = 0; /* 8. Add child to scheduler's ready queue */ add_to_runqueue(child); /* 9. Return child's PID to parent */ return child_pid;} /* * The critical COW function: copy page table without copying pages */page_table_t* copy_page_table_cow(page_table_t *parent_pt) { page_table_t *child_pt = alloc_page_table(); /* Iterate through all page table entries */ for (int i = 0; i < NUM_PAGES; i++) { pte_t *parent_pte = &parent_pt->entries[i]; pte_t *child_pte = &child_pt->entries[i]; if (!pte_present(parent_pte)) continue; /* Skip unmapped pages */ /* Copy the PTE (same physical page!) */ *child_pte = *parent_pte; /* Mark BOTH entries as read-only for COW */ if (pte_writable(parent_pte)) { clear_pte_writable(parent_pte); /* Parent too! */ clear_pte_writable(child_pte); /* Mark as COW (implementation-specific flag) */ set_pte_cow(parent_pte); set_pte_cow(child_pte); } /* Increment reference count on the physical page */ struct page *pg = pte_to_page(parent_pte); atomic_inc(&pg->ref_count); } /* Flush TLB since we changed parent's page table too */ flush_tlb_all(); return child_pt;}A subtle but critical detail: the parent's pages are ALSO marked read-only. If only child pages were marked and parent writes first, the child would see parent's modifications to 'shared' memory! Both must fault on write to maintain isolation.
Copy-on-Write provides dramatic memory savings in typical usage patterns. Let's analyze the efficiency gains:
Typical Post-fork() Memory Sharing:
| Scenario | Child Action | Pages Copied | Memory Saved |
|---|---|---|---|
| fork() + immediate exec() | Child runs different program | Near zero (~stack page) | ~100% |
| fork() for parallel computation | Child processes data | Only modified data pages | 60-90% |
| fork() in web server | Child handles request | Minimal (stack, small heap) | 90%+ |
| fork() with full memory modification | Child touches all memory | All pages eventually | 0% (but amortized) |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
/* * Demonstrating COW Memory Efficiency * Compile: gcc -o cow_demo cow_demo.c * Run: watch -n 0.5 'ps -o pid,vsz,rss -p $(pgrep cow_demo)' * * VSZ = Virtual Size (mapped memory) * RSS = Resident Set Size (physical memory actually used) */ #include <stdio.h>#include <stdlib.h>#include <unistd.h>#include <string.h>#include <sys/wait.h> #define ARRAY_SIZE (100 * 1024 * 1024) /* 100 MB */ int main() { /* Allocate and touch large array to ensure physical pages exist */ char *large_array = malloc(ARRAY_SIZE); memset(large_array, 'A', ARRAY_SIZE); /* Force allocation */ printf("Before fork: Allocated %d MB\n", ARRAY_SIZE / (1024*1024)); printf("PID %d. Check memory with: ps -o vsz,rss -p %d\n", getpid(), getpid()); sleep(3); pid_t pid = fork(); if (pid == 0) { /* CHILD */ printf("\nChild PID %d born. Initially sharing all memory with parent.\n", getpid()); printf("Check RSS - should be small (COW, sharing pages)\n"); sleep(3); /* Modify first 10MB - triggers COW on those pages */ printf("Child: Writing to first 10MB (triggering COW)...\n"); memset(large_array, 'B', 10 * 1024 * 1024); printf("Child: Now should have ~10MB more physical memory\n"); sleep(3); /* Modify rest - triggers COW on remaining pages */ printf("Child: Writing to ALL memory...\n"); memset(large_array, 'C', ARRAY_SIZE); printf("Child: Now should have full 100MB physical memory\n"); sleep(5); free(large_array); exit(0); } /* PARENT */ printf("Parent: Waiting, NOT modifying memory (staying shared)\n"); int status; waitpid(pid, &status, 0); /* Parent's array should still be 'A' - unchanged by child */ printf("Parent: Verifying data integrity after child exit...\n"); if (large_array[0] == 'A' && large_array[ARRAY_SIZE-1] == 'A') { printf("Parent: Data unchanged (as expected from COW)\n"); } free(large_array); return 0;} /* * Expected memory observations: * * 1. Before fork: Parent has ~100MB RSS * 2. After fork: Parent same, Child ~small RSS (shared pages) * 3. Child writes 10MB: Child RSS grows ~10MB * 4. Child writes all: Child RSS grows to ~100MB * 5. Parent RSS unchanged throughout! */Virtual Size (VSZ/VSS) shows mapped memory and is the same for parent and child after fork. Resident Set Size (RSS) shows actual physical memory—this reveals COW in action. After fork, RSS is much smaller than VSZ because pages are shared until written.
The most dramatic efficiency gain from COW comes in the common fork-exec pattern. When a child calls exec() immediately after fork(), Copy-on-Write means virtually no memory was actually copied.
The Ideal Case: fork() followed by exec()
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
/* * Why fork() + exec() is Very Efficient with COW */ #include <stdio.h>#include <unistd.h>#include <sys/wait.h> int main() { /* * Parent might have 1GB of memory. * Without COW: fork() would copy all 1GB * With COW: fork() copies only page tables (~few KB) */ pid_t pid = fork(); if (pid == 0) { /* * CHILD PROCESS * * At this point, child's address space shares all pages * with parent via COW. * * If we immediately exec(), we: * 1. Replace entire address space with new program * 2. Release all COW page references * 3. Never actually copy any pages! * * Total memory copied: ~0 bytes * Total pages copied: ~0 (maybe 1-2 for stack activity) */ execlp("ls", "ls", "-l", NULL); _exit(127); /* Only reached if exec fails */ } /* Parent waits */ waitpid(pid, NULL, 0); return 0;} /* * Memory timeline: * * Time 0: Parent exists with 1GB memory * Time 1: fork() called * - Child created with COW sharing all 1GB * - Physical memory used: still ~1GB (shared) * Time 2: Child calls exec("ls") * - Child's COW references released * - New "ls" program loaded (~few MB) * - Physical memory: ~1GB (parent) + ~few MB (child) * * WITHOUT COW: * Time 1 would require 2GB total, then Time 2 would free 1GB * Wasteful! */Pages That Might Be Copied Even with Immediate exec():
Before COW was universal, Unix provided vfork() which didn't copy page tables at all—parent was suspended while child ran using parent's address space directly. With modern COW, vfork() offers minimal benefit and is dangerous (child can corrupt parent's memory). Prefer fork() or posix_spawn().
While COW is almost always beneficial, certain patterns can cause performance problems:
Pathological Case 1: Immediate Full Memory Write
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
/* * COW Pathological Case: Full Memory Modification * * This is WORSE than naive copying because: * 1. We still iterate page tables during fork() * 2. We then fault on EVERY page * 3. Each fault has overhead (context switch, allocation, copy) * * Total cost: fork() overhead + (N pages * fault handling overhead) */ #include <stdio.h>#include <stdlib.h>#include <string.h>#include <unistd.h>#include <sys/wait.h>#include <sys/time.h> #define MB (1024 * 1024)#define ARRAY_SIZE (500 * MB) long long get_time_us() { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec * 1000000LL + tv.tv_usec;} int main() { char *data = malloc(ARRAY_SIZE); memset(data, 0, ARRAY_SIZE); /* Touch all pages */ long long start = get_time_us(); pid_t pid = fork(); if (pid == 0) { /* * Child IMMEDIATELY writes to ALL memory * This triggers COW faults on all ~125,000 pages * * Each page fault: * - Trap to kernel * - Find free page frame * - Copy 4KB * - Update page table * - Flush TLB entry * - Return to user mode * * This is slower than if we had just copied upfront! */ memset(data, 'X', ARRAY_SIZE); long long end = get_time_us(); printf("Child: Wrote %d MB in %lld ms\n", ARRAY_SIZE / MB, (end - start) / 1000); free(data); exit(0); } waitpid(pid, NULL, 0); /* Parent didn't modify - data still zero */ free(data); return 0;} /* * LESSON: COW is optimized for the common case (partial modification). * If you KNOW child will modify all memory, consider alternatives: * - Pre-allocate in child after fork (no COW overhead) * - Use threads instead of processes for shared-memory parallelism * - Use mmap with MAP_PRIVATE (similar to COW but explicit) */Pathological Case 2: Many Forks Sharing Memory
Redis uses fork() for background saves (BGSAVE). If write load is high during save, COW causes massive memory consumption as modified pages are copied. Redis documentation warns about this 'memory explosion' scenario. Solutions: schedule saves during low-write periods, or use jemalloc's transparent huge pages setting.
Understanding COW enables you to write more memory-efficient multi-process applications. Here are practical guidelines:
Dos and Don'ts:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
/* * COW-Friendly Pattern: Pre-fork Initialization */ #include <stdio.h>#include <stdlib.h>#include <unistd.h>#include <sys/mman.h>#include <sys/wait.h> /* Large shared dataset */#define DATA_SIZE (100 * 1024 * 1024) /* 100 MB */ int main() { /* * PATTERN: Initialize read-only data BEFORE forking * All children share this via COW, never copying */ char *shared_config = malloc(DATA_SIZE); /* Load configuration/model/data that all children need */ load_large_readonly_dataset(shared_config); /* Hypothetical */ /* For truly shared read-write data, use mmap */ int *shared_counter = mmap(NULL, sizeof(int), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); *shared_counter = 0; /* Now fork - children share shared_config via COW */ for (int i = 0; i < 4; i++) { pid_t pid = fork(); if (pid == 0) { /* Child: Read shared_config (no copy) */ /* Child: Allocate OWN work buffer (child's private pages) */ char *my_buffer = malloc(1024); /* Process using shared_config as lookup (read-only) */ /* Write results to my_buffer (private) */ /* Shared counter modified via true shared memory (not COW) */ __atomic_add_fetch(shared_counter, 1, __ATOMIC_SEQ_CST); free(my_buffer); exit(0); } } /* Wait for all children */ while (wait(NULL) > 0); printf("Processed by %d children\n", *shared_counter); free(shared_config); munmap(shared_counter, sizeof(int)); return 0;}Use /proc/[pid]/smaps to see detailed memory sharing. Look for 'Shared_Clean' and 'Shared_Dirty' to understand COW behavior. The 'pmap -X' command also shows sharing. For real-time monitoring, 'smem' provides user-friendly output.
We have explored Copy-on-Write in depth, understanding how this optimization makes fork() practical. Let's consolidate the key concepts:
What's Next:
With COW understood, we'll examine what happens after fork() from the perspective of execution: how parent and child diverge in their execution paths, scheduling considerations, and the dynamics of parent-child relationships during concurrent execution.
You now have deep understanding of Copy-on-Write mechanics. You understand why COW exists, how it works at the page-table level, its efficiency characteristics, and how to write COW-aware code. Next, we'll explore parent vs child execution dynamics.