Operating SystemsVirtual Memory

Copy-on-Write

LevelIntermediate

Duration75 mins

TopicVirtual Memory

5 / 5

Zero-Fill on Demand

The Page That Doesn't Exist (Until You Touch It)

When you allocate memory with malloc() or extend your stack, the operating system promises you a range of zeroed memory. But here's a secret that might surprise you: that memory doesn't physically exist when it's allocated. The OS has made you a promise it hasn't yet kept.

This technique is called Zero-Fill on Demand (ZFOD), and it's a close cousin to Copy-on-Write. Where COW defers copying until a write, ZFOD defers allocation and zeroing until a read or write actually occurs. Together, these lazy evaluation techniques form the foundation of efficient memory management in modern operating systems.

Understanding ZFOD is essential because it explains why your programs can allocate far more memory than exists, why memory usage numbers are often misleading, and how the OS creates the illusion of abundant resources from limited physical memory.

What You Will Learn

By the end of this page, you will understand how ZFOD works, why it exists, the zero page optimization, how it interacts with COW during fork(), and the implications for application design and memory overcommit.

The Concept of Zero-Fill on Demand

Zero-Fill on Demand (ZFOD) is a memory allocation strategy where:

Virtual memory is allocated — The OS creates a VMA (Virtual Memory Area) for the requested range
Page table entries are NOT created — Or they're marked as 'not present'
No physical frames are allocated — Memory is purely virtual at this point
First access triggers fault — When the process reads or writes, a page fault occurs
Physical frame allocated and zeroed — Only then is real memory committed

This is lazy allocation in its purest form: don't do any work until absolutely necessary.

Why Zero?

Fresh memory is zeroed for security and correctness:

Security: Without zeroing, new allocations might contain data from previous processes (passwords, keys, etc.)
Correctness: Many programs assume uninitialized memory is zero (though this is technically undefined behavior in C)

Memory Allocation Stages with ZFOD
Stage	Virtual	Page Table	Physical	Usable?
Before malloc()	N/A	N/A	N/A	No
After malloc()	Allocated	Not present / Zero-page	None	Appears usable
First read	Allocated	Points to zero page	Zero page (shared)	Yes, reads zero
First write	Allocated	Points to private frame	Zeroed frame allocated	Yes, fully private

The Three-Stage Dance

ZFOD can work in stages: (1) Virtual-only with no PTE, (2) Read-only mapping to a shared zero page, (3) Private writable page on first write. This mirrors COW's lazy copying, applied to fresh allocations instead of inherited data.

The Zero Page: A Clever Optimization

Modern operating systems employ a particularly elegant optimization for ZFOD: the zero page. Instead of leaving fresh anonymous mappings unmapped (causing a fault on first access) or allocating zeroed frames immediately, the OS maps freshly allocated memory to a single, special, read-only page filled with zeros.

How the Zero Page Works:

System maintains one (or a few) permanently zero pages — These are physical pages containing all zeros, never modified
Fresh anonymous allocations map to zero page as read-only — All bytes read as zero, as expected
On first write, COW is triggered — Write to read-only page causes fault
New private page is allocated and zeroed — Standard COW handling, but the 'copy' is from an all-zero source
PTE updated to point to new private page — Subsequent accesses go directly to the private page

Converting Mermaid diagram...

Benefits of the Zero Page:

Benefit	Explanation
Memory savings	Untouched allocations consume no physical memory (except shared zero page)
Time savings	No zeroing needed until write; zeroing happens on-demand, spread over time
Cache efficiency	Zero page often hot in cache; reading zeros doesn't pollute cache
Reduced TLB misses	Many PTEs point to same frame; TLB can reuse entries
Symmetry with COW	Same mechanism as COW (read-only + fault on write)

The Zero Page in Linux:

Linux uses a special page called ZERO_PAGE (or empty_zero_page on x86). It's a single 4KB page of zeros, mapped read-only into any process that has ZFOD mappings. The page is never modified and is initialized once at boot.

// In Linux kernel (arch/x86/kernel/head_64.S)
.balign PAGE_SIZE
entry(empty_zero_page)
    .fill 4096, 1, 0   // 4096 bytes of zero
end(empty_zero_page)

Huge Zero Pages

For transparent huge pages (THP), Linux extends this concept to huge zero pages—2MB pages of zeros shared among processes. This preserves the ZFOD benefit at huge page granularity, though the first-write cost is higher (must allocate and zero 2MB instead of 4KB).

ZFOD Mechanics in Detail

Let's trace through the exact sequence of events for a ZFOD allocation, from userspace request to physical memory:

zfod_sequence.c
C (Trace)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Tracing ZFOD from malloc to physical allocation
// Each step annotated with kernel behavior
 
#include <stdlib.h>
#include <string.h>
 
int main() {
    // Step 1: malloc requests 1 GB of memory
    // ----------------------------------------
    // - malloc() calls mmap(NULL, 1GB, PROT_READ|PROT_WRITE, 
    //                       MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
    // - Kernel creates a VMA for the range
    // - Kernel sets up page table entries pointing to ZERO_PAGE (read-only)
    //   OR leaves PTEs as "not present" (implementation varies)
    // - Physical memory allocated: 0 bytes
    // - Time: ~microseconds
    
    char *huge = malloc(1UL << 30);  // 1 GB
    // At this point: 1GB virtual, ~0 physical
    
    // Step 2: Read from the allocation
    // ----------------------------------------
    // - CPU accesses virtual address
    // - If PTE points to zero page: TLB hit or soft miss, reads zero
    // - If PTE is "not present": page fault triggers
    //   - Kernel maps the zero page read-only
    //   - Returns to userspace
    // - Physical memory allocated: still 0 (zero page is shared)
    
    char val = huge[0];  // Reads 0, no private allocation yet
    // val == 0
    
    // Step 3: Write to the allocation
    // ----------------------------------------
    // - CPU executes store instruction
    // - PTE is read-only (points to zero page)
    // - Hardware raises protection fault (write to read-only)
    // - Kernel page fault handler:
    //   1. Identifies as ZFOD fault (write to zero-page-mapped address)
    //   2. Allocates a new physical frame
    //   3. Zeros the frame (or copies from zero page)
    //   4. Updates PTE: new frame, read-write permission
    //   5. Returns to userspace
    // - CPU re-executes store instruction, succeeds
    // - Physical memory allocated: 4 KB
    
    huge[0] = 'x';  // Triggers ZFOD allocation of one page
    
    // Step 4: Write to another page
    // ----------------------------------------
    // - Process repeats for each page on first write
    // - Each 4KB page is allocated independently
    // - Physical allocation grows as more pages are touched
    
    huge[4096] = 'y';  // Another page: now 8 KB physical
    huge[8192] = 'z';  // Another page: now 12 KB physical
    
    // Step 5: Bulk initialization
    // ----------------------------------------
    // - memset touches every page in the range
    // - Each page triggers ZFOD fault on first write
    // - After memset: entire 1 GB is physically allocated
    
    memset(huge, 0, 1UL << 30);  // ~1 million ZFOD faults!
    // Now: 1 GB physical (plus overhead)
    
    return 0;
}
 
// Memory usage timeline:
// After malloc():   Virtual=1GB, Physical=~0
// After read:       Virtual=1GB, Physical=~0 (zero page shared)
// After first write: Virtual=1GB, Physical=4KB
// After more writes: Physical grows proportionally
// After memset:     Virtual=1GB, Physical=1GB

ZFOD Fault Handling Steps
Step	Component	Action
1	CPU	Execute store instruction to virtual address
2	TLB	Miss (no entry) or hit with read-only permission
3	MMU	Walk page table, find read-only or not-present PTE
4	MMU	Raise page fault exception (protection or not-present)
5	CPU	Switch to kernel mode, invoke page fault handler
6	Kernel	Look up VMA for faulting address
7	Kernel	Determine fault type: ZFOD (anonymous, first touch)
8	Kernel	Allocate physical frame from page allocator
9	Kernel	Zero the frame (or copy from zero page)
10	Kernel	Create/update PTE: frame number, RW, present
11	Kernel	Invalidate TLB entry if needed
12	CPU	Return from exception, re-execute store instruction
13	CPU	Store succeeds, normal execution continues

The Bulk Touch Problem

Initializing large allocations (e.g., memset on 1GB) triggers a ZFOD fault for each page—over 250,000 faults for 1GB with 4KB pages. This can be slow! Some allocators prefault memory for large allocations, or applications use mmap(MAP_POPULATE) to force immediate allocation.

ZFOD and Memory Allocators

The interaction between userspace memory allocators (like glibc's malloc, jemalloc, or tcmalloc) and the kernel's ZFOD mechanism is crucial for understanding memory behavior:

How Allocators Use ZFOD

•Small allocations (< ~128KB): malloc() uses brk()/sbrk() to extend the heap. The first touch to each new heap page triggers ZFOD.
•Large allocations (>= ~128KB): malloc() uses mmap() directly with MAP_ANONYMOUS. The entire allocation benefits from ZFOD.
•Arenas/pools: Modern allocators pre-allocate blocks. The underlying mmap is ZFOD, but the allocator may pre-touch memory.
•calloc() optimization: calloc() knows memory is zero; for ZFOD-backed allocations, it skips zeroing (it's already zero).
•Recycled memory: When malloc reuses freed memory, it may or may not be zeroed; only initial allocation uses ZFOD.

calloc_optimization.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// calloc() can skip zeroing due to ZFOD
 
#include <stdlib.h>
#include <string.h>
 
// Naive calloc implementation (for illustration)
void *naive_calloc(size_t nmemb, size_t size) {
    size_t total = nmemb * size;
    void *ptr = malloc(total);
    if (ptr) {
        memset(ptr, 0, total);  // Explicit zeroing - SLOW!
    }
    return ptr;
}
 
// Smart calloc implementation (like glibc)
void *smart_calloc(size_t nmemb, size_t size) {
    size_t total = nmemb * size;
    void *ptr = malloc(total);
    
    if (ptr) {
        // Check: is this memory fresh from mmap (ZFOD)?
        // If so, it's already zero - skip memset!
        
        // Glibc tracks whether the returned chunk comes from:
        // a) Fresh mmap (ZFOD) - already zero, skip memset
        // b) Recycled memory - may have old data, must zero
        
        if (!is_fresh_mmap_memory(ptr)) {
            memset(ptr, 0, total);
        }
        // For fresh mmap memory, skip memset entirely
    }
    return ptr;
}
 
// Benchmark: malloc + memset vs calloc
// For large, fresh allocations:
//
// malloc(1GB) + memset(0):   ~300ms (1M page faults + 1GB memset)
// calloc(1, 1GB):            ~0.1ms (mmap only, no page faults!)
//
// The calloc version is >3000x faster for large fresh allocations
// because it defers ZFOD until actual use.
 
// However, if you immediately use all the memory:
// malloc(1GB) + memset(0) + use:  ~300ms + work
// calloc(1, 1GB) + use:           ~300ms (faults during use) + work
//
// Total work is similar, but calloc spreads faults over use time.

When ZFOD Helps

•Large buffers for potential use (sparse access)
•Hash tables sized for worst case
•Sparse arrays with few active elements
•Memory pools pre-allocated for reserves
•Speculative allocations (may not be used)

When ZFOD Hurts

•Dense allocations fully touched immediately
•Sequential initialization (fault storm)
•Real-time systems (unpredictable latency)
•When fault cost exceeds eager allocation
•Memory-mapped DBs that touch all pages

Prefaulting for Predictable Performance

For latency-sensitive applications, you can force immediate allocation:

mmap(..., MAP_POPULATE): Kernel pre-faults all pages
madvise(MADV_WILLNEED): Hint to kernel to populate
Manual touch: memset(ptr, 0, size) forces allocation

This trades startup time for predictable runtime latency.

ZFOD and fork(): The Interaction

ZFOD and COW interact in interesting ways during fork(). Understanding this interaction is crucial for correctly predicting memory behavior after fork:

Case 1: Untouched ZFOD Pages During Fork

Before fork:
  Parent has 100 pages allocated via mmap()
  - 30 pages written (private, allocated)
  - 70 pages still ZFOD (mapped to zero page)

After fork:
  - 30 private pages become COW (shared between parent/child)
  - 70 ZFOD pages: still mapped to zero page in both!
  
Result: Both parent and child's ZFOD pages are already 'shared'
via the zero page. No additional COW setup needed.

This is elegant: ZFOD pages are naturally shareable because they all point to the same zero page. Fork doesn't need to do anything special for them.

Page States Through Fork
Page Type	Before Fork	After Fork (both)	On Child Write
Untouched (ZFOD)	→ Zero Page (RO)	→ Zero Page (RO)	Allocate private, zeroed
Read-only Data	→ Frame A (RO)	→ Frame A (RO)	Regular COW to private copy
Modified Data	→ Frame B (RW)	→ Frame B (RO, COW)	COW to private copy

Case 2: Sparse Allocation After Fork

char *big = malloc(1GB);  // ZFOD: virtual only

pid = fork();  // 1GB shared as zero page mappings

if (pid == 0) {
    // Child: touch 1MB
    for (int i = 0; i < 256; i++) {
        big[i * 4096] = 'x';  // Touch 256 pages
    }
    // Physical: 1MB private
} else {
    // Parent: touch different 1MB
    for (int i = 256; i < 512; i++) {
        big[i * 4096] = 'y';  // Touch 256 pages
    }
    // Physical: 1MB private
}

// Total physical memory: ~2MB, not 2GB!
// Each process has 1MB private + shared zero page for rest

This demonstrates how ZFOD enables extreme memory efficiency for sparse access patterns after fork.

Converting Mermaid diagram...

The Multiplicative Benefit

ZFOD and COW together provide multiplicative memory savings. Fork doesn't copy ZFOD pages (already shared). Each process only allocates physical memory for pages it actually writes. For sparse workloads, the savings can be enormous—100+ processes might share 90% of their 'allocated' memory.

Memory Overcommit: Promise Now, Pay Later

ZFOD enables a powerful but controversial feature: memory overcommit. The OS can promise more virtual memory than physically exists, betting that not all of it will be touched at once.

How Overcommit Works:

System has: 16 GB RAM + 4 GB swap = 20 GB total

Process A: malloc(20 GB) → ZFOD, touches 8 GB → 8 GB physical
Process B: malloc(20 GB) → ZFOD, touches 8 GB → 8 GB physical
Process C: malloc(20 GB) → ZFOD, touches 4 GB → 4 GB physical

Virtual allocated: 60 GB
Physical used: 20 GB (at capacity)
Everyone is happy (for now)

But if processes start touching more memory...

Linux Overcommit Modes (vm.overcommit_memory)
Mode	Value	Behavior
Heuristic	0	Allow reasonable overcommit; reject obviously excessive requests. Default.
Always	1	Always allow any allocation, no matter how large. Maximum overcommit.
Strict	2	Never overcommit. Limit to physical + swap × overcommit_ratio. Most conservative.

The OOM Killer:

When overcommit fails—when processes actually try to use more memory than exists—the system faces a crisis. Memory cannot be manufactured. Linux's solution is the OOM (Out Of Memory) Killer:

System runs critically low on memory
Page allocator cannot satisfy a request
Kernel invokes OOM killer
OOM killer selects a "victim" process to kill (based on memory usage, oom_score, etc.)
Victim is forcibly terminated, freeing its memory
System continues (hopefully)

This is the downside of overcommit: processes that allocated memory in good faith may be killed because the OS made promises it couldn't keep.

overcommit.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# Examining and controlling memory overcommit on Linux
 
# Check current overcommit setting
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic, 1 = always, 2 = strict
 
# Check overcommit ratio (for mode 2)
cat /proc/sys/vm/overcommit_ratio
# Default 50, meaning commit limit = RAM + 50% of RAM
 
# Check current commit limit and usage
cat /proc/meminfo | grep -i commit
# CommitLimit: Maximum memory that can be committed
# Committed_AS: Currently committed address space
 
# Example output:
# CommitLimit:    24645736 kB
# Committed_AS:   18234512 kB
 
# Disable overcommit (strict mode)
echo 2 > /proc/sys/vm/overcommit_memory
echo 80 > /proc/sys/vm/overcommit_ratio  # RAM + 80% swap
 
# Check OOM score for a process
cat /proc/$(pgrep myapp)/oom_score
# Higher = more likely to be killed
 
# Protect a process from OOM killer
echo -1000 > /proc/$(pgrep critical_app)/oom_score_adj
# -1000 = never kill, 1000 = always kill first

Overcommit Benefits

•Enables large sparse allocations
•Supports fork() of large processes
•Improves memory efficiency for typical workloads
•Allows programs to request 'just in case' memory

Overcommit Risks

•OOM killer may terminate important processes
•Hard to predict when OOM will strike
•Debugging OOM issues is challenging
•May mask memory leaks until crisis

Production Overcommit Strategy

For production servers, consider: (1) mode 2 with appropriate ratio for predictability, (2) careful oom_score_adj to protect critical services, (3) monitoring Committed_AS vs CommitLimit, (4) alerting before OOM conditions develop. The default mode 0 is often too aggressive for servers where process death has real consequences.

Measuring and Debugging ZFOD

Understanding ZFOD behavior requires distinguishing between virtual allocations and physical commitments. Several tools help examine this:

zfod_observation.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/bin/bash
# Observing ZFOD behavior
 
# Create test program
cat << 'EOF' > zfod_test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
 
int main() {
    printf("PID: %d
", getpid());
    
    printf("Before malloc...
");
    sleep(3);
    
    // Allocate 1GB (ZFOD)
    char *buf = malloc(1UL << 30);
    printf("After malloc (1GB)... VSZ increased, RSS minimal
");
    sleep(3);
    
    // Touch 100MB
    memset(buf, 'x', 100 * 1024 * 1024);
    printf("After touching 100MB... RSS ~100MB
");
    sleep(3);
    
    // Touch remaining
    memset(buf + 100*1024*1024, 'y', 924*1024*1024);
    printf("After touching all... RSS ~1GB
");
    sleep(60);
    
    return 0;
}
EOF
gcc zfod_test.c -o zfod_test
 
# Run in background
./zfod_test &
PID=$!
 
# Watch memory evolution
watch -n 1 "ps -p $PID -o pid,vsz,rss"
 
# Output evolution:
#   PID    VSZ    RSS
# Before: low    low
# After malloc: ~1.05GB  ~2MB   ← Virtual huge, physical tiny!
# After 100MB:  ~1.05GB  ~100MB ← RSS grew with touches
# After all:    ~1.05GB  ~1GB   ← RSS now matches allocation
 
# Examine /proc/[pid]/smaps for details
cat /proc/$PID/smaps | grep -A 15 "heap"
# Shows Anonymous, Private_Clean, Private_Dirty
 
# Page fault counters
ps -p $PID -o maj_flt,min_flt
# Minor faults = ZFOD/COW faults (memory, no I/O)
# Major faults = pages read from disk
 
# Using perf to trace page faults
perf stat -e page-faults,minor-faults,major-faults ./zfod_test

Key Metrics for ZFOD Analysis
Metric	Source	Meaning
VSZ (Virtual Size)	ps, top	Total virtual address space (includes ZFOD)
RSS (Resident Set)	ps, top	Physical memory actually allocated
Anonymous	/proc/[pid]/smaps	Non-file-backed memory (heap, stack)
Private_Dirty	/proc/[pid]/smaps	Private pages modified (post-ZFOD/COW)
Minor Faults	ps, /proc/[pid]/stat	Page faults satisfied without I/O (ZFOD+COW)
Committed_AS	/proc/meminfo	Total committed address space system-wide

The VSZ/RSS Gap

A large gap between VSZ and RSS indicates ZFOD at work: memory is allocated virtually but not yet physically committed. This is normal and often desirable. However, watch Committed_AS at the system level—if it approaches CommitLimit, OOM risk increases regardless of current RSS.

Summary and Module Conclusion

Let's consolidate our understanding of Zero-Fill on Demand and bring together the entire Copy-on-Write module:

ZFOD Key Takeaways

•ZFOD defers allocation until access — Virtual memory is allocated, but physical frames aren't committed until pages are touched.
•The zero page enables read optimization — Fresh allocations map to a shared zero page, making reads cheap and writes trigger private allocation.
•calloc() exploits ZFOD — Smart allocators skip zeroing for fresh mmap memory, providing huge performance wins for large allocations.
•ZFOD and COW are complementary — ZFOD handles fresh allocations; COW handles inherited data. Together, they form complete lazy memory management.
•Memory overcommit is enabled by ZFOD — Systems can promise more memory than exists, betting on sparse usage. The OOM killer is the fallback.
•VSZ vs RSS reveals ZFOD — Large gaps indicate virtual allocations not yet physically committed.

Module Summary: Copy-on-Write

We've covered Copy-on-Write comprehensively:

COW Concept — The fundamental insight that sharing is safe until write, and work should be deferred
Shared Pages — Reference counting, reverse mapping, and the infrastructure tracking shared memory
Write Triggers Copy — The detailed fault handling sequence when COW finally happens
Fork Optimization — How COW transforms fork() from impossible to instant
Zero-Fill on Demand — The extension of lazy evaluation to fresh allocations

Together, these techniques represent a masterclass in systems design: using the indirection of virtual memory to defer work, providing the illusion of unlimited resources from finite hardware, and paying costs only when benefits are actually consumed.

Module Complete

You now have a deep understanding of Copy-on-Write and Zero-Fill on Demand—the lazy evaluation techniques that make modern memory management efficient. These principles apply beyond OS kernels: databases, containers, version control systems, and many other systems use similar ideas. The next time you see 'instant' operations on large data, you'll know the lazy magic behind the scenes.

5 / 5

Loading learning content...

Operating SystemsVirtual Memory

Copy-on-Write

LevelIntermediate

Duration75 mins

TopicVirtual Memory

5 / 5

Zero-Fill on Demand

The Page That Doesn't Exist (Until You Touch It)

What You Will Learn

The Concept of Zero-Fill on Demand

Zero-Fill on Demand (ZFOD) is a memory allocation strategy where:

Virtual memory is allocated — The OS creates a VMA (Virtual Memory Area) for the requested range
Page table entries are NOT created — Or they're marked as 'not present'
No physical frames are allocated — Memory is purely virtual at this point
First access triggers fault — When the process reads or writes, a page fault occurs
Physical frame allocated and zeroed — Only then is real memory committed

This is lazy allocation in its purest form: don't do any work until absolutely necessary.

Why Zero?

Fresh memory is zeroed for security and correctness:

Security: Without zeroing, new allocations might contain data from previous processes (passwords, keys, etc.)
Correctness: Many programs assume uninitialized memory is zero (though this is technically undefined behavior in C)

Memory Allocation Stages with ZFOD
Stage	Virtual	Page Table	Physical	Usable?
Before malloc()	N/A	N/A	N/A	No
After malloc()	Allocated	Not present / Zero-page	None	Appears usable
First read	Allocated	Points to zero page	Zero page (shared)	Yes, reads zero
First write	Allocated	Points to private frame	Zeroed frame allocated	Yes, fully private

The Three-Stage Dance

The Zero Page: A Clever Optimization

How the Zero Page Works:

System maintains one (or a few) permanently zero pages — These are physical pages containing all zeros, never modified
Fresh anonymous allocations map to zero page as read-only — All bytes read as zero, as expected
On first write, COW is triggered — Write to read-only page causes fault
New private page is allocated and zeroed — Standard COW handling, but the 'copy' is from an all-zero source
PTE updated to point to new private page — Subsequent accesses go directly to the private page

Converting Mermaid diagram...

Benefits of the Zero Page:

Benefit	Explanation
Memory savings	Untouched allocations consume no physical memory (except shared zero page)
Time savings	No zeroing needed until write; zeroing happens on-demand, spread over time
Cache efficiency	Zero page often hot in cache; reading zeros doesn't pollute cache
Reduced TLB misses	Many PTEs point to same frame; TLB can reuse entries
Symmetry with COW	Same mechanism as COW (read-only + fault on write)

The Zero Page in Linux:

// In Linux kernel (arch/x86/kernel/head_64.S)
.balign PAGE_SIZE
entry(empty_zero_page)
    .fill 4096, 1, 0   // 4096 bytes of zero
end(empty_zero_page)

Huge Zero Pages

ZFOD Mechanics in Detail

Let's trace through the exact sequence of events for a ZFOD allocation, from userspace request to physical memory:

zfod_sequence.c
C (Trace)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Tracing ZFOD from malloc to physical allocation
// Each step annotated with kernel behavior
 
#include <stdlib.h>
#include <string.h>
 
int main() {
    // Step 1: malloc requests 1 GB of memory
    // ----------------------------------------
    // - malloc() calls mmap(NULL, 1GB, PROT_READ|PROT_WRITE, 
    //                       MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
    // - Kernel creates a VMA for the range
    // - Kernel sets up page table entries pointing to ZERO_PAGE (read-only)
    //   OR leaves PTEs as "not present" (implementation varies)
    // - Physical memory allocated: 0 bytes
    // - Time: ~microseconds
    
    char *huge = malloc(1UL << 30);  // 1 GB
    // At this point: 1GB virtual, ~0 physical
    
    // Step 2: Read from the allocation
    // ----------------------------------------
    // - CPU accesses virtual address
    // - If PTE points to zero page: TLB hit or soft miss, reads zero
    // - If PTE is "not present": page fault triggers
    //   - Kernel maps the zero page read-only
    //   - Returns to userspace
    // - Physical memory allocated: still 0 (zero page is shared)
    
    char val = huge[0];  // Reads 0, no private allocation yet
    // val == 0
    
    // Step 3: Write to the allocation
    // ----------------------------------------
    // - CPU executes store instruction
    // - PTE is read-only (points to zero page)
    // - Hardware raises protection fault (write to read-only)
    // - Kernel page fault handler:
    //   1. Identifies as ZFOD fault (write to zero-page-mapped address)
    //   2. Allocates a new physical frame
    //   3. Zeros the frame (or copies from zero page)
    //   4. Updates PTE: new frame, read-write permission
    //   5. Returns to userspace
    // - CPU re-executes store instruction, succeeds
    // - Physical memory allocated: 4 KB
    
    huge[0] = 'x';  // Triggers ZFOD allocation of one page
    
    // Step 4: Write to another page
    // ----------------------------------------
    // - Process repeats for each page on first write
    // - Each 4KB page is allocated independently
    // - Physical allocation grows as more pages are touched
    
    huge[4096] = 'y';  // Another page: now 8 KB physical
    huge[8192] = 'z';  // Another page: now 12 KB physical
    
    // Step 5: Bulk initialization
    // ----------------------------------------
    // - memset touches every page in the range
    // - Each page triggers ZFOD fault on first write
    // - After memset: entire 1 GB is physically allocated
    
    memset(huge, 0, 1UL << 30);  // ~1 million ZFOD faults!
    // Now: 1 GB physical (plus overhead)
    
    return 0;
}
 
// Memory usage timeline:
// After malloc():   Virtual=1GB, Physical=~0
// After read:       Virtual=1GB, Physical=~0 (zero page shared)
// After first write: Virtual=1GB, Physical=4KB
// After more writes: Physical grows proportionally
// After memset:     Virtual=1GB, Physical=1GB

ZFOD Fault Handling Steps
Step	Component	Action
1	CPU	Execute store instruction to virtual address
2	TLB	Miss (no entry) or hit with read-only permission
3	MMU	Walk page table, find read-only or not-present PTE
4	MMU	Raise page fault exception (protection or not-present)
5	CPU	Switch to kernel mode, invoke page fault handler
6	Kernel	Look up VMA for faulting address
7	Kernel	Determine fault type: ZFOD (anonymous, first touch)
8	Kernel	Allocate physical frame from page allocator
9	Kernel	Zero the frame (or copy from zero page)
10	Kernel	Create/update PTE: frame number, RW, present
11	Kernel	Invalidate TLB entry if needed
12	CPU	Return from exception, re-execute store instruction
13	CPU	Store succeeds, normal execution continues

The Bulk Touch Problem

ZFOD and Memory Allocators

The interaction between userspace memory allocators (like glibc's malloc, jemalloc, or tcmalloc) and the kernel's ZFOD mechanism is crucial for understanding memory behavior:

How Allocators Use ZFOD

•Small allocations (< ~128KB): malloc() uses brk()/sbrk() to extend the heap. The first touch to each new heap page triggers ZFOD.
•Large allocations (>= ~128KB): malloc() uses mmap() directly with MAP_ANONYMOUS. The entire allocation benefits from ZFOD.
•Arenas/pools: Modern allocators pre-allocate blocks. The underlying mmap is ZFOD, but the allocator may pre-touch memory.
•calloc() optimization: calloc() knows memory is zero; for ZFOD-backed allocations, it skips zeroing (it's already zero).
•Recycled memory: When malloc reuses freed memory, it may or may not be zeroed; only initial allocation uses ZFOD.

calloc_optimization.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// calloc() can skip zeroing due to ZFOD
 
#include <stdlib.h>
#include <string.h>
 
// Naive calloc implementation (for illustration)
void *naive_calloc(size_t nmemb, size_t size) {
    size_t total = nmemb * size;
    void *ptr = malloc(total);
    if (ptr) {
        memset(ptr, 0, total);  // Explicit zeroing - SLOW!
    }
    return ptr;
}
 
// Smart calloc implementation (like glibc)
void *smart_calloc(size_t nmemb, size_t size) {
    size_t total = nmemb * size;
    void *ptr = malloc(total);
    
    if (ptr) {
        // Check: is this memory fresh from mmap (ZFOD)?
        // If so, it's already zero - skip memset!
        
        // Glibc tracks whether the returned chunk comes from:
        // a) Fresh mmap (ZFOD) - already zero, skip memset
        // b) Recycled memory - may have old data, must zero
        
        if (!is_fresh_mmap_memory(ptr)) {
            memset(ptr, 0, total);
        }
        // For fresh mmap memory, skip memset entirely
    }
    return ptr;
}
 
// Benchmark: malloc + memset vs calloc
// For large, fresh allocations:
//
// malloc(1GB) + memset(0):   ~300ms (1M page faults + 1GB memset)
// calloc(1, 1GB):            ~0.1ms (mmap only, no page faults!)
//
// The calloc version is >3000x faster for large fresh allocations
// because it defers ZFOD until actual use.
 
// However, if you immediately use all the memory:
// malloc(1GB) + memset(0) + use:  ~300ms + work
// calloc(1, 1GB) + use:           ~300ms (faults during use) + work
//
// Total work is similar, but calloc spreads faults over use time.

When ZFOD Helps

•Large buffers for potential use (sparse access)
•Hash tables sized for worst case
•Sparse arrays with few active elements
•Memory pools pre-allocated for reserves
•Speculative allocations (may not be used)

When ZFOD Hurts

•Dense allocations fully touched immediately
•Sequential initialization (fault storm)
•Real-time systems (unpredictable latency)
•When fault cost exceeds eager allocation
•Memory-mapped DBs that touch all pages

Prefaulting for Predictable Performance

For latency-sensitive applications, you can force immediate allocation:

mmap(..., MAP_POPULATE): Kernel pre-faults all pages
madvise(MADV_WILLNEED): Hint to kernel to populate
Manual touch: memset(ptr, 0, size) forces allocation

This trades startup time for predictable runtime latency.

ZFOD and fork(): The Interaction

ZFOD and COW interact in interesting ways during fork(). Understanding this interaction is crucial for correctly predicting memory behavior after fork:

Case 1: Untouched ZFOD Pages During Fork

Before fork:
  Parent has 100 pages allocated via mmap()
  - 30 pages written (private, allocated)
  - 70 pages still ZFOD (mapped to zero page)

After fork:
  - 30 private pages become COW (shared between parent/child)
  - 70 ZFOD pages: still mapped to zero page in both!
  
Result: Both parent and child's ZFOD pages are already 'shared'
via the zero page. No additional COW setup needed.

This is elegant: ZFOD pages are naturally shareable because they all point to the same zero page. Fork doesn't need to do anything special for them.

Page States Through Fork
Page Type	Before Fork	After Fork (both)	On Child Write
Untouched (ZFOD)	→ Zero Page (RO)	→ Zero Page (RO)	Allocate private, zeroed
Read-only Data	→ Frame A (RO)	→ Frame A (RO)	Regular COW to private copy
Modified Data	→ Frame B (RW)	→ Frame B (RO, COW)	COW to private copy

Case 2: Sparse Allocation After Fork

char *big = malloc(1GB);  // ZFOD: virtual only

pid = fork();  // 1GB shared as zero page mappings

if (pid == 0) {
    // Child: touch 1MB
    for (int i = 0; i < 256; i++) {
        big[i * 4096] = 'x';  // Touch 256 pages
    }
    // Physical: 1MB private
} else {
    // Parent: touch different 1MB
    for (int i = 256; i < 512; i++) {
        big[i * 4096] = 'y';  // Touch 256 pages
    }
    // Physical: 1MB private
}

// Total physical memory: ~2MB, not 2GB!
// Each process has 1MB private + shared zero page for rest

This demonstrates how ZFOD enables extreme memory efficiency for sparse access patterns after fork.

Converting Mermaid diagram...

The Multiplicative Benefit

Memory Overcommit: Promise Now, Pay Later

ZFOD enables a powerful but controversial feature: memory overcommit. The OS can promise more virtual memory than physically exists, betting that not all of it will be touched at once.

How Overcommit Works:

System has: 16 GB RAM + 4 GB swap = 20 GB total

Process A: malloc(20 GB) → ZFOD, touches 8 GB → 8 GB physical
Process B: malloc(20 GB) → ZFOD, touches 8 GB → 8 GB physical
Process C: malloc(20 GB) → ZFOD, touches 4 GB → 4 GB physical

Virtual allocated: 60 GB
Physical used: 20 GB (at capacity)
Everyone is happy (for now)

But if processes start touching more memory...

Linux Overcommit Modes (vm.overcommit_memory)
Mode	Value	Behavior
Heuristic	0	Allow reasonable overcommit; reject obviously excessive requests. Default.
Always	1	Always allow any allocation, no matter how large. Maximum overcommit.
Strict	2	Never overcommit. Limit to physical + swap × overcommit_ratio. Most conservative.

The OOM Killer:

When overcommit fails—when processes actually try to use more memory than exists—the system faces a crisis. Memory cannot be manufactured. Linux's solution is the OOM (Out Of Memory) Killer:

System runs critically low on memory
Page allocator cannot satisfy a request
Kernel invokes OOM killer
OOM killer selects a "victim" process to kill (based on memory usage, oom_score, etc.)
Victim is forcibly terminated, freeing its memory
System continues (hopefully)

This is the downside of overcommit: processes that allocated memory in good faith may be killed because the OS made promises it couldn't keep.

overcommit.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# Examining and controlling memory overcommit on Linux
 
# Check current overcommit setting
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic, 1 = always, 2 = strict
 
# Check overcommit ratio (for mode 2)
cat /proc/sys/vm/overcommit_ratio
# Default 50, meaning commit limit = RAM + 50% of RAM
 
# Check current commit limit and usage
cat /proc/meminfo | grep -i commit
# CommitLimit: Maximum memory that can be committed
# Committed_AS: Currently committed address space
 
# Example output:
# CommitLimit:    24645736 kB
# Committed_AS:   18234512 kB
 
# Disable overcommit (strict mode)
echo 2 > /proc/sys/vm/overcommit_memory
echo 80 > /proc/sys/vm/overcommit_ratio  # RAM + 80% swap
 
# Check OOM score for a process
cat /proc/$(pgrep myapp)/oom_score
# Higher = more likely to be killed
 
# Protect a process from OOM killer
echo -1000 > /proc/$(pgrep critical_app)/oom_score_adj
# -1000 = never kill, 1000 = always kill first

Overcommit Benefits

•Enables large sparse allocations
•Supports fork() of large processes
•Improves memory efficiency for typical workloads
•Allows programs to request 'just in case' memory

Overcommit Risks

•OOM killer may terminate important processes
•Hard to predict when OOM will strike
•Debugging OOM issues is challenging
•May mask memory leaks until crisis

Production Overcommit Strategy

Measuring and Debugging ZFOD

Understanding ZFOD behavior requires distinguishing between virtual allocations and physical commitments. Several tools help examine this:

zfod_observation.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/bin/bash
# Observing ZFOD behavior
 
# Create test program
cat << 'EOF' > zfod_test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
 
int main() {
    printf("PID: %d
", getpid());
    
    printf("Before malloc...
");
    sleep(3);
    
    // Allocate 1GB (ZFOD)
    char *buf = malloc(1UL << 30);
    printf("After malloc (1GB)... VSZ increased, RSS minimal
");
    sleep(3);
    
    // Touch 100MB
    memset(buf, 'x', 100 * 1024 * 1024);
    printf("After touching 100MB... RSS ~100MB
");
    sleep(3);
    
    // Touch remaining
    memset(buf + 100*1024*1024, 'y', 924*1024*1024);
    printf("After touching all... RSS ~1GB
");
    sleep(60);
    
    return 0;
}
EOF
gcc zfod_test.c -o zfod_test
 
# Run in background
./zfod_test &
PID=$!
 
# Watch memory evolution
watch -n 1 "ps -p $PID -o pid,vsz,rss"
 
# Output evolution:
#   PID    VSZ    RSS
# Before: low    low
# After malloc: ~1.05GB  ~2MB   ← Virtual huge, physical tiny!
# After 100MB:  ~1.05GB  ~100MB ← RSS grew with touches
# After all:    ~1.05GB  ~1GB   ← RSS now matches allocation
 
# Examine /proc/[pid]/smaps for details
cat /proc/$PID/smaps | grep -A 15 "heap"
# Shows Anonymous, Private_Clean, Private_Dirty
 
# Page fault counters
ps -p $PID -o maj_flt,min_flt
# Minor faults = ZFOD/COW faults (memory, no I/O)
# Major faults = pages read from disk
 
# Using perf to trace page faults
perf stat -e page-faults,minor-faults,major-faults ./zfod_test

Key Metrics for ZFOD Analysis
Metric	Source	Meaning
VSZ (Virtual Size)	ps, top	Total virtual address space (includes ZFOD)
RSS (Resident Set)	ps, top	Physical memory actually allocated
Anonymous	/proc/[pid]/smaps	Non-file-backed memory (heap, stack)
Private_Dirty	/proc/[pid]/smaps	Private pages modified (post-ZFOD/COW)
Minor Faults	ps, /proc/[pid]/stat	Page faults satisfied without I/O (ZFOD+COW)
Committed_AS	/proc/meminfo	Total committed address space system-wide

The VSZ/RSS Gap

Summary and Module Conclusion

Let's consolidate our understanding of Zero-Fill on Demand and bring together the entire Copy-on-Write module:

ZFOD Key Takeaways

•ZFOD defers allocation until access — Virtual memory is allocated, but physical frames aren't committed until pages are touched.
•The zero page enables read optimization — Fresh allocations map to a shared zero page, making reads cheap and writes trigger private allocation.
•calloc() exploits ZFOD — Smart allocators skip zeroing for fresh mmap memory, providing huge performance wins for large allocations.
•ZFOD and COW are complementary — ZFOD handles fresh allocations; COW handles inherited data. Together, they form complete lazy memory management.
•Memory overcommit is enabled by ZFOD — Systems can promise more memory than exists, betting on sparse usage. The OOM killer is the fallback.
•VSZ vs RSS reveals ZFOD — Large gaps indicate virtual allocations not yet physically committed.

Module Summary: Copy-on-Write

We've covered Copy-on-Write comprehensively:

COW Concept — The fundamental insight that sharing is safe until write, and work should be deferred
Shared Pages — Reference counting, reverse mapping, and the infrastructure tracking shared memory
Write Triggers Copy — The detailed fault handling sequence when COW finally happens
Fork Optimization — How COW transforms fork() from impossible to instant
Zero-Fill on Demand — The extension of lazy evaluation to fresh allocations

Module Complete

5 / 5