Operating SystemsVirtual Memory

Copy-on-Write

LevelIntermediate

Duration75 mins

TopicVirtual Memory

4 / 5

Fork Optimization

The System Call That Copies Everything (Without Copying Anything)

The fork() system call is one of the most elegant and powerful abstractions in the Unix tradition. It creates a new process that is an exact copy of the calling process—same code, same data, same open files, same everything except the process ID. This seemingly simple operation enables the shell, web servers, databases, and countless other programs to create child processes.

But here's the paradox: if fork() truly copies everything, it should be impossibly slow for large processes. A process with 8GB of memory would require copying 8 billion bytes. Yet on modern systems, fork() completes in microseconds, regardless of process size.

The answer is Copy-on-Write. And understanding how COW optimizes fork() is essential for any systems programmer, because it reveals both the power and the pitfalls of this technique.

What You Will Learn

By the end of this page, you will understand the fork-exec pattern and why it dominates Unix-like systems, how COW transforms fork() from O(memory) to O(page_tables), the specific optimizations used during fork(), and real-world scenarios where this matters.

The Fork-Exec Pattern

To understand why COW matters so much for fork(), we need to understand the fork-exec pattern—the canonical Unix method for creating new processes running new programs:

pid_t pid = fork();
if (pid == 0) {
    // Child process
    exec("/path/to/program", args);
    // exec replaces the entire address space
    // If we reach here, exec failed
    exit(1);
} else if (pid > 0) {
    // Parent process
    wait(&status);  // Wait for child to finish
} else {
    // fork failed
    perror("fork");
}

Why Fork-Exec?

The fork-exec pattern separates process creation (fork) from program loading (exec). This separation enables:

Fork-Exec Advantages

•Inheritance Control — Between fork and exec, the child can manipulate inherited state: close files, redirect stdio, change environment, set resource limits, change directories.
•Simple Shell Implementation — Shells can set up pipes by forking, creating pipes, dup2'ing file descriptors, then exec'ing the commands.
•Process Hierarchy — Parent-child relationships are explicit, enabling wait(), job control, and process groups.
•Uniform Mechanism — One pattern works for all program creation, from shells to web servers to init.
•Copy-On-Write Benefits — Since exec discards all memory, COW means fork cost is minimal.

shell_fork_exec.c
C (Shell Example)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Simplified shell command execution
// Demonstrates why fork-exec is so powerful
 
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <fcntl.h>
 
// Execute: command > output.txt
void execute_with_redirect(char *command[], char *output_file) {
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child process - set up redirect BEFORE exec
        
        // Open output file (shell's responsibility)
        int fd = open(output_file, O_WRONLY | O_CREAT | O_TRUNC, 0644);
        if (fd < 0) {
            perror("open");
            exit(1);
        }
        
        // Redirect stdout to file
        // The command being exec'd knows nothing about this!
        dup2(fd, STDOUT_FILENO);
        close(fd);
        
        // Now execute the command
        // Command sees stdout as already pointing to the file
        execvp(command[0], command);
        
        // If we get here, exec failed
        perror("exec");
        exit(1);
    }
    
    // Parent waits for child
    waitpid(pid, NULL, 0);
}
 
// Execute: cmd1 | cmd2  (pipeline)
void execute_pipeline(char *cmd1[], char *cmd2[]) {
    int pipefd[2];
    pipe(pipefd);  // Create pipe before any fork
    
    pid_t pid1 = fork();
    if (pid1 == 0) {
        // First child: writes to pipe
        close(pipefd[0]);                  // Close read end
        dup2(pipefd[1], STDOUT_FILENO);    // Stdout -> pipe write
        close(pipefd[1]);
        execvp(cmd1[0], cmd1);
        exit(1);
    }
    
    pid_t pid2 = fork();
    if (pid2 == 0) {
        // Second child: reads from pipe
        close(pipefd[1]);                  // Close write end
        dup2(pipefd[0], STDIN_FILENO);     // Stdin <- pipe read
        close(pipefd[0]);
        execvp(cmd2[0], cmd2);
        exit(1);
    }
    
    // Parent: close pipe ends and wait
    close(pipefd[0]);
    close(pipefd[1]);
    waitpid(pid1, NULL, 0);
    waitpid(pid2, NULL, 0);
}
 
// NOTE: Without COW, the shell process (perhaps 100MB with readline,
// history, etc.) would be fully copied just to be immediately discarded
// by exec(). With COW, fork() is instant regardless of shell size.

Windows CreateProcess vs Unix Fork-Exec

Windows uses CreateProcess() which combines fork and exec into one call. This is more efficient when you don't need to modify child state, but less flexible. The Unix separation enables powerful patterns that are awkward on Windows (like shell pipelines). Each approach has trade-offs.

Fork Without COW: The Problem

To appreciate COW's impact on fork, let's examine what fork would require without it. Imagine implementing fork() with eager copying:

naive_fork.c
C (Naive, No COW)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Naive fork implementation WITHOUT Copy-on-Write
// This is how early Unix systems worked (before virtual memory)
 
int naive_fork() {
    // Allocate new process control block
    struct process *child = alloc_process();
    
    // Get parent's memory size
    size_t parent_memory_size = current->mm->total_pages * PAGE_SIZE;
    
    // Allocate ALL physical memory for child up-front
    void *child_memory = kmalloc(parent_memory_size);
    if (!child_memory) {
        return -ENOMEM;  // Can't allocate - fail
    }
    
    // Copy EVERY BYTE of parent's address space
    // This is the killer - O(memory_size) operation
    memcpy(child_memory, parent_memory_base, parent_memory_size);
    
    // Set up child's page tables to point to new memory
    setup_page_tables(child, child_memory);
    
    // Copy other resources...
    copy_files(child);
    copy_signals(child);
    copy_credentials(child);
    
    // Schedule the child
    add_to_scheduler(child);
    
    return child->pid;
}
 
// Time analysis for 1GB process:
// - Memory copy: ~200ms (at ~5 GB/s copy speed)
// - Page table setup: ~1ms
// - Other copies: ~microseconds
// Total: dominated by memory copy
 
// If this fork() is followed by exec():
// - exec() will FREE all that copied memory immediately
// - 200ms of work for NOTHING

Fork Performance Without COW
Process Size	Copy Time	Memory Required	After Exec
10 MB	~2 ms	20 MB	10 MB freed instantly
100 MB	~20 ms	200 MB	100 MB freed instantly
1 GB	~200 ms	2 GB	1 GB freed instantly
10 GB	~2 sec	20 GB	10 GB freed instantly
100 GB	~20 sec	200 GB	OOM likely, if not, 100 GB freed

The Fatal Flaw:

Without COW, fork() has pathological time complexity:

fork() = O(memory_size)
fork() + exec() = O(memory_size) + O(new_program_size)
But child uses: O(new_program_size)

The fork-exec pattern becomes unusable for large processes. Imagine a database server with 64GB of memory trying to spawn a helper script—it would take minutes just for fork().

Historical Context:

Early Unix systems (before virtual memory) did work this way. Systems were small enough (kilobytes of memory) that eager copying was acceptable. The advent of virtual memory enabled COW, which in turn enabled large processes to use fork-exec without penalty.

Memory Overcommit Pressure

Even ignoring time, eager fork doubles memory usage temporarily. A system with 90% memory used couldn't fork() at all—even if the child exec()s immediately. This would make fork() unusable under memory pressure, exactly when process creation might be needed (e.g., spawning a helper to free memory).

Fork With COW: The Solution

With Copy-on-Write, fork() transforms from an O(memory_size) operation to approximately O(page_table_entries)—a dramatically smaller quantity. Here's how:

cow_fork.c
C (Linux-style COW Fork)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
// COW-optimized fork implementation (simplified from Linux)
// Key insight: copy page TABLES, not page CONTENTS
 
int cow_fork() {
    struct mm_struct *parent_mm = current->mm;
    struct mm_struct *child_mm;
    struct vm_area_struct *vma;
    
    // Allocate new mm_struct for child (small, fixed-size)
    child_mm = allocate_mm();
    
    // Copy mm_struct metadata (quick, small)
    duplicate_mm_metadata(child_mm, parent_mm);
    
    // Now the key: copy VMAs and page tables
    // But NOT the actual page contents!
    
    for_each_vma(vma, parent_mm) {
        struct vm_area_struct *child_vma;
        
        // Create corresponding VMA in child
        child_vma = copy_vma(vma);
        insert_vma(child_mm, child_vma);
        
        // Copy page table entries, with COW magic
        copy_page_range(child_mm, parent_mm, vma);
    }
    
    // Copy file descriptors, signals, etc.
    copy_files(child);
    copy_signals(child);
    
    return child->pid;
}
 
// The magic happens here:
int copy_page_range(struct mm_struct *dst_mm,
                    struct mm_struct *src_mm,
                    struct vm_area_struct *vma) {
    pgd_t *src_pgd, *dst_pgd;
    unsigned long addr = vma->vm_start;
    unsigned long end = vma->vm_end;
    
    // Walk page tables at each level
    for (; addr < end; addr = next_pgd_addr(addr)) {
        src_pgd = pgd_offset(src_mm, addr);
        dst_pgd = pgd_alloc(dst_mm, addr);  // Allocate PGD entry
        
        if (pgd_none(*src_pgd))
            continue;  // No mapping - skip
        
        // Recurse through PUD, PMD, PTE levels...
        copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
                       vma, addr, next);
    }
    return 0;
}
 
// At the PTE (leaf) level:
int copy_pte_range(struct mm_struct *dst_mm,
                   struct mm_struct *src_mm,
                   pmd_t *dst_pmd, pmd_t *src_pmd,
                   struct vm_area_struct *vma,
                   unsigned long addr) {
    pte_t *src_pte, *dst_pte;
    pte_t pte;
    struct page *page;
    
    src_pte = pte_offset_map(src_pmd, addr);
    dst_pte = pte_alloc(dst_mm, dst_pmd, addr);
    
    // Read the source PTE
    pte = *src_pte;
    
    if (!pte_present(pte)) {
        // Not present - copy swap entry or skip
        *dst_pte = pte;
        return 0;
    }
    
    // Get the physical page
    page = pte_page(pte);
    
    // KEY OPTIMIZATION: If page is writable, mark read-only
    // in BOTH parent and child for COW
    if (pte_write(pte) && is_cow_mapping(vma->vm_flags)) {
        // Clear write bit in parent
        pte = pte_wrprotect(pte);
        set_pte_at(src_mm, addr, src_pte, pte);
        
        // Flush parent's TLB for this address
        flush_tlb_page(vma, addr);
    }
    
    // Increment page reference count (now shared)
    page_ref_inc(page);
    
    // Copy the (now read-only) PTE to child
    set_pte_at(dst_mm, addr, dst_pte, pte);
    
    return 0;
}

Fork Performance WITH COW
Process Size	Fork Time	Immediate Memory	After Fork+Exec
10 MB	~100 μs	~0.1 MB (page tables)	Parent unchanged
100 MB	~100 μs	~1 MB (page tables)	Parent unchanged
1 GB	~200 μs	~8 MB (page tables)	Parent unchanged
10 GB	~500 μs	~80 MB (page tables)	Parent unchanged
100 GB	~2 ms	~800 MB (page tables)	Parent unchanged

The Dramatic Improvement

For a 1GB process: Eager fork = ~200ms, COW fork = ~200μs. That's a 1000x improvement. And critically, immediate memory overhead is just page tables (~8MB), not a full copy (~1GB). This makes fork() viable for any process size.

What Fork Actually Copies

It's instructive to understand exactly what fork() does copy immediately, versus what it defers through COW:

Fork Resources: Copied vs. Shared
Resource	Copied or Shared	Cost	Notes
Process Control Block	Copied	Small (KB)	PID, state, scheduling info
Memory Descriptor (mm)	Copied	Small (KB)	VMA list, memory limits
Page Tables (all levels)	Copied (structure)	Proportional to mappings	PTEs point to shared frames
Physical Page Frames	Shared (COW)	Zero at fork	Actually copied only on write
File Descriptor Table	Copied (table)	Small	FDs point to shared file structs
Open File Structures	Shared	Zero	Reference counts incremented
Signal Handlers	Copied (table)	Small	Pointers to handler functions
Current Working Dir	Shared	Zero	Reference to inode
Credentials (uid, etc)	Copied	Tiny	A few integers
Resource Limits	Copied	Tiny	RLIMIT array

Page Table Size Analysis:

Page tables are the primary immediate cost of fork(). For a 4-level page table (x86-64):

Page size: 4 KB
Entries per table: 512 (9 bits per level)
Address bits covered: 9 + 9 + 9 + 9 + 12 = 48 bits

For a process with sparse address space (typical):

Virtual range: 0 - 128 TB (48 bits)
Actual used:   ~1 GB (common for applications)

Page table overhead:
- PGD (level 1): 1 page = 4 KB
- PUD (level 2): ~1-2 pages = 4-8 KB
- PMD (level 3): ~2-4 pages = 8-16 KB
- PTE (level 4): ~256 pages = 1 MB

Total: ~1-2 MB for page tables for 1 GB process

This is orders of magnitude less than the 1 GB of actual data.

Converting Mermaid diagram...

Fork Scalability

Fork time scales with the number of page table entries (mapped pages), not with actual memory content. A process with 1GB of densely-mapped memory forks at similar speed to a process with 1GB of sparsely-mapped memory, as long as page table size is similar. This is why fork() remains fast even for huge processes.

The Fork-Exec Optimization Path

The fork-exec pattern is so common that the kernel provides additional optimizations beyond basic COW. Let's trace the optimal path:

Optimal Fork-Exec Timeline:

Parent Process (500 MB)
        │
        ├─ fork()  [~100 μs]
        │   └── All 500MB shared as COW
        │
        ├─ Child runs briefly
        │   └── Typical: 0 pages modified (no COW faults)
        │
        ├─ exec()  [~10 ms to load new program]
        │   └── All 500MB of shared mappings dropped
        │   └── Page refcounts decremented (no frames freed, parent still uses)
        │   └── New 5MB program loaded
        │
        └─ Net result:
            ├── Parent: unchanged
            ├── Child: 5MB new program
            └── Total time: ~10 ms

Compare to eager fork: ~100ms just for fork + ~10ms for exec = ~110ms total, with 500MB temporary memory spike.

Fork-Exec Optimizations

•vfork() — Ultra-fast fork for immediate exec. Child borrows parent's address space (no page table copy). Parent blocks until child execs or exits. Dangerous but fast.
•clone(CLONE_VM) — Share address space entirely (threads). No COW needed because no copy semantics expected.
•posix_spawn() — Combines fork+exec into one call. Can use vfork internally. Avoids some COW setup if exec is guaranteed.
•Exec drops COW — When exec() replaces the address space, all COW shared mappings are unmapped. Reference counts drop. No actual copies ever needed for exec-ing child.

vfork_example.c
C (vfork vs fork)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// vfork() - even faster than COW fork for immediate exec
// WARNING: vfork has serious restrictions!
 
#include <unistd.h>
#include <sys/wait.h>
 
// Using vfork() for maximum speed
void fast_spawn_with_vfork(char *program, char **args) {
    pid_t pid = vfork();  // Create child, borrow parent's memory
    
    // CRITICAL: With vfork, child runs in parent's address space!
    // Until exec or _exit, parent is SUSPENDED.
    
    if (pid == 0) {
        // Child: we're borrowing parent's memory
        // MUST NOT:
        // - Modify any variables (would corrupt parent)
        // - Return from this function (would corrupt parent's stack)
        // - Call any functions that might modify memory
        
        // CAN ONLY:
        // - Call exec (replaces address space, safe)
        // - Call _exit (terminates child, releases parent)
        
        execvp(program, args);
        _exit(127);  // Use _exit, not exit (exit would run cleanup)
    }
    
    // Parent: resumes after child execs or exits
    waitpid(pid, NULL, 0);
}
 
// Performance comparison:
//
// Process size  | fork() time | vfork() time
// --------------|-------------|-------------
// 10 MB         | ~100 μs     | ~50 μs
// 100 MB        | ~150 μs     | ~50 μs
// 1 GB          | ~200 μs     | ~50 μs
// 10 GB         | ~500 μs     | ~50 μs
//
// vfork() is constant time - no page table copying at all!
// But the restrictions make it tricky to use correctly.
 
// Modern alternative: posix_spawn()
#include <spawn.h>
 
void modern_spawn(char *program, char **args, char **envp) {
    pid_t pid;
    posix_spawnattr_t attr;
    posix_spawn_file_actions_t actions;
    
    posix_spawnattr_init(&attr);
    posix_spawn_file_actions_init(&actions);
    
    // Can set up file redirects, signals, etc. via actions
    
    // Spawn the process - may use vfork internally
    posix_spawn(&pid, program, &actions, &attr, args, envp);
    
    waitpid(pid, NULL, 0);
    
    posix_spawnattr_destroy(&attr);
    posix_spawn_file_actions_destroy(&actions);
}

vfork Dangers

vfork() is faster but extremely dangerous. The child shares the parent's stack; any modifications corrupt the parent. Even function calls can be problematic. Modern glibc implements vfork using clone() with safeguards, but direct syscall use requires extreme care. Prefer posix_spawn() for safe, fast spawning.

Real-World Fork Scenarios

Let's examine how fork+COW behaves in common real-world scenarios, including both optimal and problematic cases:

COW Wins: Best Cases

•Shell commands — Large shell forks, child immediately exec()s. Zero COW faults.
•Web server prefork — Apache forks workers sharing most code/config. Minimal divergence.
•Database streaming replica — Fork for consistent snapshot, child mostly reads. Few copies.
•Container runtimes — Fork to spawn container, most memory is shared libraries.

COW Costs: Challenging Cases

•Fork + heavy mutation — Child modifies most memory. COW overhead exceeds eager copy.
•Redis BGSAVE with writes — Background save + active writes cause massive COW.
•Fork in garbage-collected runtimes — GC marks/updates trigger COW on many pages.
•Large mmap'd files forked — Even read-only file pages are COW-protected.

Case Study: Redis BGSAVE

Redis uses fork() to create a point-in-time snapshot for background saving:

Parent (Redis server): 10GB dataset in memory
        │
        └─ fork() for BGSAVE
             │
             └─ Child: writes dataset to disk
                      (reads 10GB, doesn't modify)
             │
             └─ Parent: continues serving requests
                      (modifies ~1% of data during save)

The COW Behavior:

Fork instant (~500μs for 10GB)
Child reads all pages (fine, no COW fault)
Parent modifies pages → COW fault for each modified page
If 1% modified: 100MB of COW copies during save
Memory peaks at 10GB + 100MB, not 20GB

Under Heavy Write Load:

If parent modifies 50% during save...
5GB of COW copies occur!
Memory peaks at 10GB + 5GB = 15GB
Save is slower due to continuous COW faults

Redis documents this and recommends sufficient memory headroom.

Designing for COW

Applications can be designed to work well with COW: place frequently-modified data together (single page), use huge pages for large read-mostly data (fewer COW events), or explicitly pre-copy hot pages before fork. Redis Jemalloc configuration and huge page settings are examples of COW-aware tuning.

Measuring Fork Performance

Understanding and measuring fork performance is crucial for systems programming. Here's how to benchmark and profile fork behavior:

fork_benchmark.c
C (Benchmark)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
// Fork performance benchmark
// Measures fork time across different memory sizes
 
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <string.h>
 
double get_time_ms() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec * 1000.0 + tv.tv_usec / 1000.0;
}
 
void benchmark_fork(size_t memory_mb) {
    size_t size = memory_mb * 1024 * 1024;
    
    // Allocate and touch memory to ensure it's resident
    char *mem = malloc(size);
    memset(mem, 'x', size);  // Touch every page
    
    printf("Memory: %zu MB, ", memory_mb);
    
    // Measure fork time
    double start = get_time_ms();
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: exit immediately (fork-exec pattern)
        _exit(0);
    } else {
        double fork_time = get_time_ms() - start;
        waitpid(pid, NULL, 0);
        double total_time = get_time_ms() - start;
        
        printf("Fork: %.3f ms, Fork+Wait: %.3f ms\n", 
               fork_time, total_time);
    }
    
    free(mem);
}
 
void benchmark_fork_with_cow(size_t memory_mb, double write_fraction) {
    size_t size = memory_mb * 1024 * 1024;
    size_t pages = size / 4096;
    size_t pages_to_write = pages * write_fraction;
    
    char *mem = malloc(size);
    memset(mem, 'x', size);
    
    printf("Memory: %zu MB, Write fraction: %.0f%%, ", 
           memory_mb, write_fraction * 100);
    
    double start = get_time_ms();
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: simulate work by writing to some pages
        for (size_t i = 0; i < pages_to_write; i++) {
            // Write to one byte per page (triggers COW)
            mem[i * 4096] = 'y';
        }
        _exit(0);
    } else {
        waitpid(pid, NULL, 0);
        double total_time = get_time_ms() - start;
        printf("Total time: %.3f ms\n", total_time);
    }
    
    free(mem);
}
 
int main() {
    printf("=== Fork Benchmark (COW) ===\n\n");
    
    printf("--- Fork-only time (child exits immediately) ---\n");
    benchmark_fork(10);
    benchmark_fork(100);
    benchmark_fork(1000);
    
    printf("\n--- Fork with COW writes (child modifies memory) ---\n");
    benchmark_fork_with_cow(100, 0.0);   // No writes
    benchmark_fork_with_cow(100, 0.01);  // Write 1%
    benchmark_fork_with_cow(100, 0.10);  // Write 10%
    benchmark_fork_with_cow(100, 0.50);  // Write 50%
    benchmark_fork_with_cow(100, 1.0);   // Write 100%
    
    return 0;
}
 
/* Sample output on Linux x86-64:
 
=== Fork Benchmark (COW) ===
 
--- Fork-only time (child exits immediately) ---
Memory: 10 MB, Fork: 0.15 ms, Fork+Wait: 0.21 ms
Memory: 100 MB, Fork: 0.18 ms, Fork+Wait: 0.24 ms
Memory: 1000 MB, Fork: 0.25 ms, Fork+Wait: 0.32 ms
 
--- Fork with COW writes (child modifies memory) ---
Memory: 100 MB, Write fraction: 0%, Total time: 0.24 ms
Memory: 100 MB, Write fraction: 1%, Total time: 2.5 ms
Memory: 100 MB, Write fraction: 10%, Total time: 18 ms
Memory: 100 MB, Write fraction: 50%, Total time: 85 ms
Memory: 100 MB, Write fraction: 100%, Total time: 165 ms
 
Key insight: Fork itself is fast. COW costs come during child execution.
*/

Profiling Fork and COW:

Linux provides several ways to observe fork and COW behavior:

# Watch page faults during execution
perf stat -e page-faults,minor-faults,major-faults ./program

# Track COW events specifically
perf record -e 'page-faults' -g ./program
perf report

# Watch memory changes in real-time
watch -n 1 'cat /proc/$(pgrep myprogram)/smaps | grep -A 20 heap'

# System-wide fork monitoring
perf trace -e 'syscalls:sys_enter_fork,syscalls:sys_enter_clone'

Interpreting Results

Minor page faults on a post-fork process often indicate COW events. A high minor fault count after fork, especially correlated with writes, suggests heavy COW activity. Tools like /proc/[pid]/smaps and pmap -x can show Private_Dirty pages (COW copies made) vs Shared pages (still COW-shared).

Summary and Looking Ahead

Let's consolidate our understanding of fork optimization through COW:

Key Takeaways

•Fork-exec is the Unix process creation pattern — fork() duplicates, exec() replaces. The separation enables powerful setup between fork and exec.
•Without COW, fork would be O(memory_size) — Eager copying would make fork() unusable for large processes, especially with immediate exec().
•COW transforms fork to O(page_table_size) — Only page tables are copied at fork; actual page contents are shared until written.
•Fork time is now independent of data — A 100GB process can fork as fast as a 100MB process (page table sizes may vary slightly).
•vfork and posix_spawn provide even faster paths — For immediate exec(), these avoid even page table copying.
•COW costs are paid later on write — Fork is cheap; COW faults during child execution add up if many pages are modified.

What's Next:

We'll conclude our exploration of Copy-on-Write with zero-fill on demand—a related technique that extends lazy initialization to fresh memory allocations, further reducing the cost of memory operations.

Page Complete

You now understand how COW revolutionizes fork(): transforming a potentially minutes-long operation into microseconds, enabling the fork-exec pattern at any process size, and making the entire Unix process model practical for modern workloads.

4 / 5

Loading learning content...

Operating SystemsVirtual Memory

Copy-on-Write

LevelIntermediate

Duration75 mins

TopicVirtual Memory

4 / 5

Fork Optimization

The System Call That Copies Everything (Without Copying Anything)

The answer is Copy-on-Write. And understanding how COW optimizes fork() is essential for any systems programmer, because it reveals both the power and the pitfalls of this technique.

What You Will Learn

The Fork-Exec Pattern

To understand why COW matters so much for fork(), we need to understand the fork-exec pattern—the canonical Unix method for creating new processes running new programs:

pid_t pid = fork();
if (pid == 0) {
    // Child process
    exec("/path/to/program", args);
    // exec replaces the entire address space
    // If we reach here, exec failed
    exit(1);
} else if (pid > 0) {
    // Parent process
    wait(&status);  // Wait for child to finish
} else {
    // fork failed
    perror("fork");
}

Why Fork-Exec?

The fork-exec pattern separates process creation (fork) from program loading (exec). This separation enables:

Fork-Exec Advantages

•Inheritance Control — Between fork and exec, the child can manipulate inherited state: close files, redirect stdio, change environment, set resource limits, change directories.
•Simple Shell Implementation — Shells can set up pipes by forking, creating pipes, dup2'ing file descriptors, then exec'ing the commands.
•Process Hierarchy — Parent-child relationships are explicit, enabling wait(), job control, and process groups.
•Uniform Mechanism — One pattern works for all program creation, from shells to web servers to init.
•Copy-On-Write Benefits — Since exec discards all memory, COW means fork cost is minimal.

shell_fork_exec.c
C (Shell Example)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Simplified shell command execution
// Demonstrates why fork-exec is so powerful
 
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <fcntl.h>
 
// Execute: command > output.txt
void execute_with_redirect(char *command[], char *output_file) {
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child process - set up redirect BEFORE exec
        
        // Open output file (shell's responsibility)
        int fd = open(output_file, O_WRONLY | O_CREAT | O_TRUNC, 0644);
        if (fd < 0) {
            perror("open");
            exit(1);
        }
        
        // Redirect stdout to file
        // The command being exec'd knows nothing about this!
        dup2(fd, STDOUT_FILENO);
        close(fd);
        
        // Now execute the command
        // Command sees stdout as already pointing to the file
        execvp(command[0], command);
        
        // If we get here, exec failed
        perror("exec");
        exit(1);
    }
    
    // Parent waits for child
    waitpid(pid, NULL, 0);
}
 
// Execute: cmd1 | cmd2  (pipeline)
void execute_pipeline(char *cmd1[], char *cmd2[]) {
    int pipefd[2];
    pipe(pipefd);  // Create pipe before any fork
    
    pid_t pid1 = fork();
    if (pid1 == 0) {
        // First child: writes to pipe
        close(pipefd[0]);                  // Close read end
        dup2(pipefd[1], STDOUT_FILENO);    // Stdout -> pipe write
        close(pipefd[1]);
        execvp(cmd1[0], cmd1);
        exit(1);
    }
    
    pid_t pid2 = fork();
    if (pid2 == 0) {
        // Second child: reads from pipe
        close(pipefd[1]);                  // Close write end
        dup2(pipefd[0], STDIN_FILENO);     // Stdin <- pipe read
        close(pipefd[0]);
        execvp(cmd2[0], cmd2);
        exit(1);
    }
    
    // Parent: close pipe ends and wait
    close(pipefd[0]);
    close(pipefd[1]);
    waitpid(pid1, NULL, 0);
    waitpid(pid2, NULL, 0);
}
 
// NOTE: Without COW, the shell process (perhaps 100MB with readline,
// history, etc.) would be fully copied just to be immediately discarded
// by exec(). With COW, fork() is instant regardless of shell size.

Windows CreateProcess vs Unix Fork-Exec

Fork Without COW: The Problem

To appreciate COW's impact on fork, let's examine what fork would require without it. Imagine implementing fork() with eager copying:

naive_fork.c
C (Naive, No COW)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Naive fork implementation WITHOUT Copy-on-Write
// This is how early Unix systems worked (before virtual memory)
 
int naive_fork() {
    // Allocate new process control block
    struct process *child = alloc_process();
    
    // Get parent's memory size
    size_t parent_memory_size = current->mm->total_pages * PAGE_SIZE;
    
    // Allocate ALL physical memory for child up-front
    void *child_memory = kmalloc(parent_memory_size);
    if (!child_memory) {
        return -ENOMEM;  // Can't allocate - fail
    }
    
    // Copy EVERY BYTE of parent's address space
    // This is the killer - O(memory_size) operation
    memcpy(child_memory, parent_memory_base, parent_memory_size);
    
    // Set up child's page tables to point to new memory
    setup_page_tables(child, child_memory);
    
    // Copy other resources...
    copy_files(child);
    copy_signals(child);
    copy_credentials(child);
    
    // Schedule the child
    add_to_scheduler(child);
    
    return child->pid;
}
 
// Time analysis for 1GB process:
// - Memory copy: ~200ms (at ~5 GB/s copy speed)
// - Page table setup: ~1ms
// - Other copies: ~microseconds
// Total: dominated by memory copy
 
// If this fork() is followed by exec():
// - exec() will FREE all that copied memory immediately
// - 200ms of work for NOTHING

Fork Performance Without COW
Process Size	Copy Time	Memory Required	After Exec
10 MB	~2 ms	20 MB	10 MB freed instantly
100 MB	~20 ms	200 MB	100 MB freed instantly
1 GB	~200 ms	2 GB	1 GB freed instantly
10 GB	~2 sec	20 GB	10 GB freed instantly
100 GB	~20 sec	200 GB	OOM likely, if not, 100 GB freed

The Fatal Flaw:

Without COW, fork() has pathological time complexity:

fork() = O(memory_size)
fork() + exec() = O(memory_size) + O(new_program_size)
But child uses: O(new_program_size)

The fork-exec pattern becomes unusable for large processes. Imagine a database server with 64GB of memory trying to spawn a helper script—it would take minutes just for fork().

Historical Context:

Memory Overcommit Pressure

Fork With COW: The Solution

With Copy-on-Write, fork() transforms from an O(memory_size) operation to approximately O(page_table_entries)—a dramatically smaller quantity. Here's how:

cow_fork.c
C (Linux-style COW Fork)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
// COW-optimized fork implementation (simplified from Linux)
// Key insight: copy page TABLES, not page CONTENTS
 
int cow_fork() {
    struct mm_struct *parent_mm = current->mm;
    struct mm_struct *child_mm;
    struct vm_area_struct *vma;
    
    // Allocate new mm_struct for child (small, fixed-size)
    child_mm = allocate_mm();
    
    // Copy mm_struct metadata (quick, small)
    duplicate_mm_metadata(child_mm, parent_mm);
    
    // Now the key: copy VMAs and page tables
    // But NOT the actual page contents!
    
    for_each_vma(vma, parent_mm) {
        struct vm_area_struct *child_vma;
        
        // Create corresponding VMA in child
        child_vma = copy_vma(vma);
        insert_vma(child_mm, child_vma);
        
        // Copy page table entries, with COW magic
        copy_page_range(child_mm, parent_mm, vma);
    }
    
    // Copy file descriptors, signals, etc.
    copy_files(child);
    copy_signals(child);
    
    return child->pid;
}
 
// The magic happens here:
int copy_page_range(struct mm_struct *dst_mm,
                    struct mm_struct *src_mm,
                    struct vm_area_struct *vma) {
    pgd_t *src_pgd, *dst_pgd;
    unsigned long addr = vma->vm_start;
    unsigned long end = vma->vm_end;
    
    // Walk page tables at each level
    for (; addr < end; addr = next_pgd_addr(addr)) {
        src_pgd = pgd_offset(src_mm, addr);
        dst_pgd = pgd_alloc(dst_mm, addr);  // Allocate PGD entry
        
        if (pgd_none(*src_pgd))
            continue;  // No mapping - skip
        
        // Recurse through PUD, PMD, PTE levels...
        copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
                       vma, addr, next);
    }
    return 0;
}
 
// At the PTE (leaf) level:
int copy_pte_range(struct mm_struct *dst_mm,
                   struct mm_struct *src_mm,
                   pmd_t *dst_pmd, pmd_t *src_pmd,
                   struct vm_area_struct *vma,
                   unsigned long addr) {
    pte_t *src_pte, *dst_pte;
    pte_t pte;
    struct page *page;
    
    src_pte = pte_offset_map(src_pmd, addr);
    dst_pte = pte_alloc(dst_mm, dst_pmd, addr);
    
    // Read the source PTE
    pte = *src_pte;
    
    if (!pte_present(pte)) {
        // Not present - copy swap entry or skip
        *dst_pte = pte;
        return 0;
    }
    
    // Get the physical page
    page = pte_page(pte);
    
    // KEY OPTIMIZATION: If page is writable, mark read-only
    // in BOTH parent and child for COW
    if (pte_write(pte) && is_cow_mapping(vma->vm_flags)) {
        // Clear write bit in parent
        pte = pte_wrprotect(pte);
        set_pte_at(src_mm, addr, src_pte, pte);
        
        // Flush parent's TLB for this address
        flush_tlb_page(vma, addr);
    }
    
    // Increment page reference count (now shared)
    page_ref_inc(page);
    
    // Copy the (now read-only) PTE to child
    set_pte_at(dst_mm, addr, dst_pte, pte);
    
    return 0;
}

Fork Performance WITH COW
Process Size	Fork Time	Immediate Memory	After Fork+Exec
10 MB	~100 μs	~0.1 MB (page tables)	Parent unchanged
100 MB	~100 μs	~1 MB (page tables)	Parent unchanged
1 GB	~200 μs	~8 MB (page tables)	Parent unchanged
10 GB	~500 μs	~80 MB (page tables)	Parent unchanged
100 GB	~2 ms	~800 MB (page tables)	Parent unchanged

The Dramatic Improvement

What Fork Actually Copies

It's instructive to understand exactly what fork() does copy immediately, versus what it defers through COW:

Fork Resources: Copied vs. Shared
Resource	Copied or Shared	Cost	Notes
Process Control Block	Copied	Small (KB)	PID, state, scheduling info
Memory Descriptor (mm)	Copied	Small (KB)	VMA list, memory limits
Page Tables (all levels)	Copied (structure)	Proportional to mappings	PTEs point to shared frames
Physical Page Frames	Shared (COW)	Zero at fork	Actually copied only on write
File Descriptor Table	Copied (table)	Small	FDs point to shared file structs
Open File Structures	Shared	Zero	Reference counts incremented
Signal Handlers	Copied (table)	Small	Pointers to handler functions
Current Working Dir	Shared	Zero	Reference to inode
Credentials (uid, etc)	Copied	Tiny	A few integers
Resource Limits	Copied	Tiny	RLIMIT array

Page Table Size Analysis:

Page tables are the primary immediate cost of fork(). For a 4-level page table (x86-64):

Page size: 4 KB
Entries per table: 512 (9 bits per level)
Address bits covered: 9 + 9 + 9 + 9 + 12 = 48 bits

For a process with sparse address space (typical):

Virtual range: 0 - 128 TB (48 bits)
Actual used:   ~1 GB (common for applications)

Page table overhead:
- PGD (level 1): 1 page = 4 KB
- PUD (level 2): ~1-2 pages = 4-8 KB
- PMD (level 3): ~2-4 pages = 8-16 KB
- PTE (level 4): ~256 pages = 1 MB

Total: ~1-2 MB for page tables for 1 GB process

This is orders of magnitude less than the 1 GB of actual data.

Converting Mermaid diagram...

Fork Scalability

The Fork-Exec Optimization Path

The fork-exec pattern is so common that the kernel provides additional optimizations beyond basic COW. Let's trace the optimal path:

Optimal Fork-Exec Timeline:

Parent Process (500 MB)
        │
        ├─ fork()  [~100 μs]
        │   └── All 500MB shared as COW
        │
        ├─ Child runs briefly
        │   └── Typical: 0 pages modified (no COW faults)
        │
        ├─ exec()  [~10 ms to load new program]
        │   └── All 500MB of shared mappings dropped
        │   └── Page refcounts decremented (no frames freed, parent still uses)
        │   └── New 5MB program loaded
        │
        └─ Net result:
            ├── Parent: unchanged
            ├── Child: 5MB new program
            └── Total time: ~10 ms

Compare to eager fork: ~100ms just for fork + ~10ms for exec = ~110ms total, with 500MB temporary memory spike.

Fork-Exec Optimizations

•vfork() — Ultra-fast fork for immediate exec. Child borrows parent's address space (no page table copy). Parent blocks until child execs or exits. Dangerous but fast.
•clone(CLONE_VM) — Share address space entirely (threads). No COW needed because no copy semantics expected.
•posix_spawn() — Combines fork+exec into one call. Can use vfork internally. Avoids some COW setup if exec is guaranteed.
•Exec drops COW — When exec() replaces the address space, all COW shared mappings are unmapped. Reference counts drop. No actual copies ever needed for exec-ing child.

vfork_example.c
C (vfork vs fork)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// vfork() - even faster than COW fork for immediate exec
// WARNING: vfork has serious restrictions!
 
#include <unistd.h>
#include <sys/wait.h>
 
// Using vfork() for maximum speed
void fast_spawn_with_vfork(char *program, char **args) {
    pid_t pid = vfork();  // Create child, borrow parent's memory
    
    // CRITICAL: With vfork, child runs in parent's address space!
    // Until exec or _exit, parent is SUSPENDED.
    
    if (pid == 0) {
        // Child: we're borrowing parent's memory
        // MUST NOT:
        // - Modify any variables (would corrupt parent)
        // - Return from this function (would corrupt parent's stack)
        // - Call any functions that might modify memory
        
        // CAN ONLY:
        // - Call exec (replaces address space, safe)
        // - Call _exit (terminates child, releases parent)
        
        execvp(program, args);
        _exit(127);  // Use _exit, not exit (exit would run cleanup)
    }
    
    // Parent: resumes after child execs or exits
    waitpid(pid, NULL, 0);
}
 
// Performance comparison:
//
// Process size  | fork() time | vfork() time
// --------------|-------------|-------------
// 10 MB         | ~100 μs     | ~50 μs
// 100 MB        | ~150 μs     | ~50 μs
// 1 GB          | ~200 μs     | ~50 μs
// 10 GB         | ~500 μs     | ~50 μs
//
// vfork() is constant time - no page table copying at all!
// But the restrictions make it tricky to use correctly.
 
// Modern alternative: posix_spawn()
#include <spawn.h>
 
void modern_spawn(char *program, char **args, char **envp) {
    pid_t pid;
    posix_spawnattr_t attr;
    posix_spawn_file_actions_t actions;
    
    posix_spawnattr_init(&attr);
    posix_spawn_file_actions_init(&actions);
    
    // Can set up file redirects, signals, etc. via actions
    
    // Spawn the process - may use vfork internally
    posix_spawn(&pid, program, &actions, &attr, args, envp);
    
    waitpid(pid, NULL, 0);
    
    posix_spawnattr_destroy(&attr);
    posix_spawn_file_actions_destroy(&actions);
}

vfork Dangers

Real-World Fork Scenarios

Let's examine how fork+COW behaves in common real-world scenarios, including both optimal and problematic cases:

COW Wins: Best Cases

•Shell commands — Large shell forks, child immediately exec()s. Zero COW faults.
•Web server prefork — Apache forks workers sharing most code/config. Minimal divergence.
•Database streaming replica — Fork for consistent snapshot, child mostly reads. Few copies.
•Container runtimes — Fork to spawn container, most memory is shared libraries.

COW Costs: Challenging Cases

•Fork + heavy mutation — Child modifies most memory. COW overhead exceeds eager copy.
•Redis BGSAVE with writes — Background save + active writes cause massive COW.
•Fork in garbage-collected runtimes — GC marks/updates trigger COW on many pages.
•Large mmap'd files forked — Even read-only file pages are COW-protected.

Case Study: Redis BGSAVE

Redis uses fork() to create a point-in-time snapshot for background saving:

Parent (Redis server): 10GB dataset in memory
        │
        └─ fork() for BGSAVE
             │
             └─ Child: writes dataset to disk
                      (reads 10GB, doesn't modify)
             │
             └─ Parent: continues serving requests
                      (modifies ~1% of data during save)

The COW Behavior:

Fork instant (~500μs for 10GB)
Child reads all pages (fine, no COW fault)
Parent modifies pages → COW fault for each modified page
If 1% modified: 100MB of COW copies during save
Memory peaks at 10GB + 100MB, not 20GB

Under Heavy Write Load:

If parent modifies 50% during save...
5GB of COW copies occur!
Memory peaks at 10GB + 5GB = 15GB
Save is slower due to continuous COW faults

Redis documents this and recommends sufficient memory headroom.

Designing for COW

Measuring Fork Performance

Understanding and measuring fork performance is crucial for systems programming. Here's how to benchmark and profile fork behavior:

fork_benchmark.c
C (Benchmark)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
// Fork performance benchmark
// Measures fork time across different memory sizes
 
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <string.h>
 
double get_time_ms() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec * 1000.0 + tv.tv_usec / 1000.0;
}
 
void benchmark_fork(size_t memory_mb) {
    size_t size = memory_mb * 1024 * 1024;
    
    // Allocate and touch memory to ensure it's resident
    char *mem = malloc(size);
    memset(mem, 'x', size);  // Touch every page
    
    printf("Memory: %zu MB, ", memory_mb);
    
    // Measure fork time
    double start = get_time_ms();
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: exit immediately (fork-exec pattern)
        _exit(0);
    } else {
        double fork_time = get_time_ms() - start;
        waitpid(pid, NULL, 0);
        double total_time = get_time_ms() - start;
        
        printf("Fork: %.3f ms, Fork+Wait: %.3f ms\n", 
               fork_time, total_time);
    }
    
    free(mem);
}
 
void benchmark_fork_with_cow(size_t memory_mb, double write_fraction) {
    size_t size = memory_mb * 1024 * 1024;
    size_t pages = size / 4096;
    size_t pages_to_write = pages * write_fraction;
    
    char *mem = malloc(size);
    memset(mem, 'x', size);
    
    printf("Memory: %zu MB, Write fraction: %.0f%%, ", 
           memory_mb, write_fraction * 100);
    
    double start = get_time_ms();
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: simulate work by writing to some pages
        for (size_t i = 0; i < pages_to_write; i++) {
            // Write to one byte per page (triggers COW)
            mem[i * 4096] = 'y';
        }
        _exit(0);
    } else {
        waitpid(pid, NULL, 0);
        double total_time = get_time_ms() - start;
        printf("Total time: %.3f ms\n", total_time);
    }
    
    free(mem);
}
 
int main() {
    printf("=== Fork Benchmark (COW) ===\n\n");
    
    printf("--- Fork-only time (child exits immediately) ---\n");
    benchmark_fork(10);
    benchmark_fork(100);
    benchmark_fork(1000);
    
    printf("\n--- Fork with COW writes (child modifies memory) ---\n");
    benchmark_fork_with_cow(100, 0.0);   // No writes
    benchmark_fork_with_cow(100, 0.01);  // Write 1%
    benchmark_fork_with_cow(100, 0.10);  // Write 10%
    benchmark_fork_with_cow(100, 0.50);  // Write 50%
    benchmark_fork_with_cow(100, 1.0);   // Write 100%
    
    return 0;
}
 
/* Sample output on Linux x86-64:
 
=== Fork Benchmark (COW) ===
 
--- Fork-only time (child exits immediately) ---
Memory: 10 MB, Fork: 0.15 ms, Fork+Wait: 0.21 ms
Memory: 100 MB, Fork: 0.18 ms, Fork+Wait: 0.24 ms
Memory: 1000 MB, Fork: 0.25 ms, Fork+Wait: 0.32 ms
 
--- Fork with COW writes (child modifies memory) ---
Memory: 100 MB, Write fraction: 0%, Total time: 0.24 ms
Memory: 100 MB, Write fraction: 1%, Total time: 2.5 ms
Memory: 100 MB, Write fraction: 10%, Total time: 18 ms
Memory: 100 MB, Write fraction: 50%, Total time: 85 ms
Memory: 100 MB, Write fraction: 100%, Total time: 165 ms
 
Key insight: Fork itself is fast. COW costs come during child execution.
*/

Profiling Fork and COW:

Linux provides several ways to observe fork and COW behavior:

# Watch page faults during execution
perf stat -e page-faults,minor-faults,major-faults ./program

# Track COW events specifically
perf record -e 'page-faults' -g ./program
perf report

# Watch memory changes in real-time
watch -n 1 'cat /proc/$(pgrep myprogram)/smaps | grep -A 20 heap'

# System-wide fork monitoring
perf trace -e 'syscalls:sys_enter_fork,syscalls:sys_enter_clone'

Interpreting Results

Summary and Looking Ahead

Let's consolidate our understanding of fork optimization through COW:

Key Takeaways

•Fork-exec is the Unix process creation pattern — fork() duplicates, exec() replaces. The separation enables powerful setup between fork and exec.
•Without COW, fork would be O(memory_size) — Eager copying would make fork() unusable for large processes, especially with immediate exec().
•COW transforms fork to O(page_table_size) — Only page tables are copied at fork; actual page contents are shared until written.
•Fork time is now independent of data — A 100GB process can fork as fast as a 100MB process (page table sizes may vary slightly).
•vfork and posix_spawn provide even faster paths — For immediate exec(), these avoid even page table copying.
•COW costs are paid later on write — Fork is cheap; COW faults during child execution add up if many pages are modified.

What's Next:

Page Complete

4 / 5