Loading learning content...
When you allocate memory with malloc() or extend your stack, the operating system promises you a range of zeroed memory. But here's a secret that might surprise you: that memory doesn't physically exist when it's allocated. The OS has made you a promise it hasn't yet kept.
This technique is called Zero-Fill on Demand (ZFOD), and it's a close cousin to Copy-on-Write. Where COW defers copying until a write, ZFOD defers allocation and zeroing until a read or write actually occurs. Together, these lazy evaluation techniques form the foundation of efficient memory management in modern operating systems.
Understanding ZFOD is essential because it explains why your programs can allocate far more memory than exists, why memory usage numbers are often misleading, and how the OS creates the illusion of abundant resources from limited physical memory.
By the end of this page, you will understand how ZFOD works, why it exists, the zero page optimization, how it interacts with COW during fork(), and the implications for application design and memory overcommit.
Zero-Fill on Demand (ZFOD) is a memory allocation strategy where:
This is lazy allocation in its purest form: don't do any work until absolutely necessary.
Why Zero?
Fresh memory is zeroed for security and correctness:
| Stage | Virtual | Page Table | Physical | Usable? |
|---|---|---|---|---|
| Before malloc() | N/A | N/A | N/A | No |
| After malloc() | Allocated | Not present / Zero-page | None | Appears usable |
| First read | Allocated | Points to zero page | Zero page (shared) | Yes, reads zero |
| First write | Allocated | Points to private frame | Zeroed frame allocated | Yes, fully private |
ZFOD can work in stages: (1) Virtual-only with no PTE, (2) Read-only mapping to a shared zero page, (3) Private writable page on first write. This mirrors COW's lazy copying, applied to fresh allocations instead of inherited data.
Modern operating systems employ a particularly elegant optimization for ZFOD: the zero page. Instead of leaving fresh anonymous mappings unmapped (causing a fault on first access) or allocating zeroed frames immediately, the OS maps freshly allocated memory to a single, special, read-only page filled with zeros.
How the Zero Page Works:
Benefits of the Zero Page:
| Benefit | Explanation |
|---|---|
| Memory savings | Untouched allocations consume no physical memory (except shared zero page) |
| Time savings | No zeroing needed until write; zeroing happens on-demand, spread over time |
| Cache efficiency | Zero page often hot in cache; reading zeros doesn't pollute cache |
| Reduced TLB misses | Many PTEs point to same frame; TLB can reuse entries |
| Symmetry with COW | Same mechanism as COW (read-only + fault on write) |
The Zero Page in Linux:
Linux uses a special page called ZERO_PAGE (or empty_zero_page on x86). It's a single 4KB page of zeros, mapped read-only into any process that has ZFOD mappings. The page is never modified and is initialized once at boot.
// In Linux kernel (arch/x86/kernel/head_64.S)
.balign PAGE_SIZE
entry(empty_zero_page)
.fill 4096, 1, 0 // 4096 bytes of zero
end(empty_zero_page)
For transparent huge pages (THP), Linux extends this concept to huge zero pages—2MB pages of zeros shared among processes. This preserves the ZFOD benefit at huge page granularity, though the first-write cost is higher (must allocate and zero 2MB instead of 4KB).
Let's trace through the exact sequence of events for a ZFOD allocation, from userspace request to physical memory:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
// Tracing ZFOD from malloc to physical allocation// Each step annotated with kernel behavior #include <stdlib.h>#include <string.h> int main() { // Step 1: malloc requests 1 GB of memory // ---------------------------------------- // - malloc() calls mmap(NULL, 1GB, PROT_READ|PROT_WRITE, // MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) // - Kernel creates a VMA for the range // - Kernel sets up page table entries pointing to ZERO_PAGE (read-only) // OR leaves PTEs as "not present" (implementation varies) // - Physical memory allocated: 0 bytes // - Time: ~microseconds char *huge = malloc(1UL << 30); // 1 GB // At this point: 1GB virtual, ~0 physical // Step 2: Read from the allocation // ---------------------------------------- // - CPU accesses virtual address // - If PTE points to zero page: TLB hit or soft miss, reads zero // - If PTE is "not present": page fault triggers // - Kernel maps the zero page read-only // - Returns to userspace // - Physical memory allocated: still 0 (zero page is shared) char val = huge[0]; // Reads 0, no private allocation yet // val == 0 // Step 3: Write to the allocation // ---------------------------------------- // - CPU executes store instruction // - PTE is read-only (points to zero page) // - Hardware raises protection fault (write to read-only) // - Kernel page fault handler: // 1. Identifies as ZFOD fault (write to zero-page-mapped address) // 2. Allocates a new physical frame // 3. Zeros the frame (or copies from zero page) // 4. Updates PTE: new frame, read-write permission // 5. Returns to userspace // - CPU re-executes store instruction, succeeds // - Physical memory allocated: 4 KB huge[0] = 'x'; // Triggers ZFOD allocation of one page // Step 4: Write to another page // ---------------------------------------- // - Process repeats for each page on first write // - Each 4KB page is allocated independently // - Physical allocation grows as more pages are touched huge[4096] = 'y'; // Another page: now 8 KB physical huge[8192] = 'z'; // Another page: now 12 KB physical // Step 5: Bulk initialization // ---------------------------------------- // - memset touches every page in the range // - Each page triggers ZFOD fault on first write // - After memset: entire 1 GB is physically allocated memset(huge, 0, 1UL << 30); // ~1 million ZFOD faults! // Now: 1 GB physical (plus overhead) return 0;} // Memory usage timeline:// After malloc(): Virtual=1GB, Physical=~0// After read: Virtual=1GB, Physical=~0 (zero page shared)// After first write: Virtual=1GB, Physical=4KB// After more writes: Physical grows proportionally// After memset: Virtual=1GB, Physical=1GB| Step | Component | Action |
|---|---|---|
| 1 | CPU | Execute store instruction to virtual address |
| 2 | TLB | Miss (no entry) or hit with read-only permission |
| 3 | MMU | Walk page table, find read-only or not-present PTE |
| 4 | MMU | Raise page fault exception (protection or not-present) |
| 5 | CPU | Switch to kernel mode, invoke page fault handler |
| 6 | Kernel | Look up VMA for faulting address |
| 7 | Kernel | Determine fault type: ZFOD (anonymous, first touch) |
| 8 | Kernel | Allocate physical frame from page allocator |
| 9 | Kernel | Zero the frame (or copy from zero page) |
| 10 | Kernel | Create/update PTE: frame number, RW, present |
| 11 | Kernel | Invalidate TLB entry if needed |
| 12 | CPU | Return from exception, re-execute store instruction |
| 13 | CPU | Store succeeds, normal execution continues |
Initializing large allocations (e.g., memset on 1GB) triggers a ZFOD fault for each page—over 250,000 faults for 1GB with 4KB pages. This can be slow! Some allocators prefault memory for large allocations, or applications use mmap(MAP_POPULATE) to force immediate allocation.
The interaction between userspace memory allocators (like glibc's malloc, jemalloc, or tcmalloc) and the kernel's ZFOD mechanism is crucial for understanding memory behavior:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
// calloc() can skip zeroing due to ZFOD #include <stdlib.h>#include <string.h> // Naive calloc implementation (for illustration)void *naive_calloc(size_t nmemb, size_t size) { size_t total = nmemb * size; void *ptr = malloc(total); if (ptr) { memset(ptr, 0, total); // Explicit zeroing - SLOW! } return ptr;} // Smart calloc implementation (like glibc)void *smart_calloc(size_t nmemb, size_t size) { size_t total = nmemb * size; void *ptr = malloc(total); if (ptr) { // Check: is this memory fresh from mmap (ZFOD)? // If so, it's already zero - skip memset! // Glibc tracks whether the returned chunk comes from: // a) Fresh mmap (ZFOD) - already zero, skip memset // b) Recycled memory - may have old data, must zero if (!is_fresh_mmap_memory(ptr)) { memset(ptr, 0, total); } // For fresh mmap memory, skip memset entirely } return ptr;} // Benchmark: malloc + memset vs calloc// For large, fresh allocations://// malloc(1GB) + memset(0): ~300ms (1M page faults + 1GB memset)// calloc(1, 1GB): ~0.1ms (mmap only, no page faults!)//// The calloc version is >3000x faster for large fresh allocations// because it defers ZFOD until actual use. // However, if you immediately use all the memory:// malloc(1GB) + memset(0) + use: ~300ms + work// calloc(1, 1GB) + use: ~300ms (faults during use) + work//// Total work is similar, but calloc spreads faults over use time.For latency-sensitive applications, you can force immediate allocation:
mmap(..., MAP_POPULATE): Kernel pre-faults all pagesmadvise(MADV_WILLNEED): Hint to kernel to populatememset(ptr, 0, size) forces allocationThis trades startup time for predictable runtime latency.
ZFOD and COW interact in interesting ways during fork(). Understanding this interaction is crucial for correctly predicting memory behavior after fork:
Case 1: Untouched ZFOD Pages During Fork
Before fork:
Parent has 100 pages allocated via mmap()
- 30 pages written (private, allocated)
- 70 pages still ZFOD (mapped to zero page)
After fork:
- 30 private pages become COW (shared between parent/child)
- 70 ZFOD pages: still mapped to zero page in both!
Result: Both parent and child's ZFOD pages are already 'shared'
via the zero page. No additional COW setup needed.
This is elegant: ZFOD pages are naturally shareable because they all point to the same zero page. Fork doesn't need to do anything special for them.
| Page Type | Before Fork | After Fork (both) | On Child Write |
|---|---|---|---|
| Untouched (ZFOD) | → Zero Page (RO) | → Zero Page (RO) | Allocate private, zeroed |
| Read-only Data | → Frame A (RO) | → Frame A (RO) | Regular COW to private copy |
| Modified Data | → Frame B (RW) | → Frame B (RO, COW) | COW to private copy |
Case 2: Sparse Allocation After Fork
char *big = malloc(1GB); // ZFOD: virtual only
pid = fork(); // 1GB shared as zero page mappings
if (pid == 0) {
// Child: touch 1MB
for (int i = 0; i < 256; i++) {
big[i * 4096] = 'x'; // Touch 256 pages
}
// Physical: 1MB private
} else {
// Parent: touch different 1MB
for (int i = 256; i < 512; i++) {
big[i * 4096] = 'y'; // Touch 256 pages
}
// Physical: 1MB private
}
// Total physical memory: ~2MB, not 2GB!
// Each process has 1MB private + shared zero page for rest
This demonstrates how ZFOD enables extreme memory efficiency for sparse access patterns after fork.
ZFOD and COW together provide multiplicative memory savings. Fork doesn't copy ZFOD pages (already shared). Each process only allocates physical memory for pages it actually writes. For sparse workloads, the savings can be enormous—100+ processes might share 90% of their 'allocated' memory.
ZFOD enables a powerful but controversial feature: memory overcommit. The OS can promise more virtual memory than physically exists, betting that not all of it will be touched at once.
How Overcommit Works:
System has: 16 GB RAM + 4 GB swap = 20 GB total
Process A: malloc(20 GB) → ZFOD, touches 8 GB → 8 GB physical
Process B: malloc(20 GB) → ZFOD, touches 8 GB → 8 GB physical
Process C: malloc(20 GB) → ZFOD, touches 4 GB → 4 GB physical
Virtual allocated: 60 GB
Physical used: 20 GB (at capacity)
Everyone is happy (for now)
But if processes start touching more memory...
| Mode | Value | Behavior |
|---|---|---|
| Heuristic | 0 | Allow reasonable overcommit; reject obviously excessive requests. Default. |
| Always | 1 | Always allow any allocation, no matter how large. Maximum overcommit. |
| Strict | 2 | Never overcommit. Limit to physical + swap × overcommit_ratio. Most conservative. |
The OOM Killer:
When overcommit fails—when processes actually try to use more memory than exists—the system faces a crisis. Memory cannot be manufactured. Linux's solution is the OOM (Out Of Memory) Killer:
This is the downside of overcommit: processes that allocated memory in good faith may be killed because the OS made promises it couldn't keep.
12345678910111213141516171819202122232425262728293031
#!/bin/bash# Examining and controlling memory overcommit on Linux # Check current overcommit settingcat /proc/sys/vm/overcommit_memory# 0 = heuristic, 1 = always, 2 = strict # Check overcommit ratio (for mode 2)cat /proc/sys/vm/overcommit_ratio# Default 50, meaning commit limit = RAM + 50% of RAM # Check current commit limit and usagecat /proc/meminfo | grep -i commit# CommitLimit: Maximum memory that can be committed# Committed_AS: Currently committed address space # Example output:# CommitLimit: 24645736 kB# Committed_AS: 18234512 kB # Disable overcommit (strict mode)echo 2 > /proc/sys/vm/overcommit_memoryecho 80 > /proc/sys/vm/overcommit_ratio # RAM + 80% swap # Check OOM score for a processcat /proc/$(pgrep myapp)/oom_score# Higher = more likely to be killed # Protect a process from OOM killerecho -1000 > /proc/$(pgrep critical_app)/oom_score_adj# -1000 = never kill, 1000 = always kill firstFor production servers, consider: (1) mode 2 with appropriate ratio for predictability, (2) careful oom_score_adj to protect critical services, (3) monitoring Committed_AS vs CommitLimit, (4) alerting before OOM conditions develop. The default mode 0 is often too aggressive for servers where process death has real consequences.
Understanding ZFOD behavior requires distinguishing between virtual allocations and physical commitments. Several tools help examine this:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
#!/bin/bash# Observing ZFOD behavior # Create test programcat << 'EOF' > zfod_test.c#include <stdio.h>#include <stdlib.h>#include <string.h>#include <unistd.h> int main() { printf("PID: %d", getpid()); printf("Before malloc..."); sleep(3); // Allocate 1GB (ZFOD) char *buf = malloc(1UL << 30); printf("After malloc (1GB)... VSZ increased, RSS minimal"); sleep(3); // Touch 100MB memset(buf, 'x', 100 * 1024 * 1024); printf("After touching 100MB... RSS ~100MB"); sleep(3); // Touch remaining memset(buf + 100*1024*1024, 'y', 924*1024*1024); printf("After touching all... RSS ~1GB"); sleep(60); return 0;}EOFgcc zfod_test.c -o zfod_test # Run in background./zfod_test &PID=$! # Watch memory evolutionwatch -n 1 "ps -p $PID -o pid,vsz,rss" # Output evolution:# PID VSZ RSS# Before: low low# After malloc: ~1.05GB ~2MB ← Virtual huge, physical tiny!# After 100MB: ~1.05GB ~100MB ← RSS grew with touches# After all: ~1.05GB ~1GB ← RSS now matches allocation # Examine /proc/[pid]/smaps for detailscat /proc/$PID/smaps | grep -A 15 "heap"# Shows Anonymous, Private_Clean, Private_Dirty # Page fault countersps -p $PID -o maj_flt,min_flt# Minor faults = ZFOD/COW faults (memory, no I/O)# Major faults = pages read from disk # Using perf to trace page faultsperf stat -e page-faults,minor-faults,major-faults ./zfod_test| Metric | Source | Meaning |
|---|---|---|
| VSZ (Virtual Size) | ps, top | Total virtual address space (includes ZFOD) |
| RSS (Resident Set) | ps, top | Physical memory actually allocated |
| Anonymous | /proc/[pid]/smaps | Non-file-backed memory (heap, stack) |
| Private_Dirty | /proc/[pid]/smaps | Private pages modified (post-ZFOD/COW) |
| Minor Faults | ps, /proc/[pid]/stat | Page faults satisfied without I/O (ZFOD+COW) |
| Committed_AS | /proc/meminfo | Total committed address space system-wide |
A large gap between VSZ and RSS indicates ZFOD at work: memory is allocated virtually but not yet physically committed. This is normal and often desirable. However, watch Committed_AS at the system level—if it approaches CommitLimit, OOM risk increases regardless of current RSS.
Let's consolidate our understanding of Zero-Fill on Demand and bring together the entire Copy-on-Write module:
Module Summary: Copy-on-Write
We've covered Copy-on-Write comprehensively:
Together, these techniques represent a masterclass in systems design: using the indirection of virtual memory to defer work, providing the illusion of unlimited resources from finite hardware, and paying costs only when benefits are actually consumed.
You now have a deep understanding of Copy-on-Write and Zero-Fill on Demand—the lazy evaluation techniques that make modern memory management efficient. These principles apply beyond OS kernels: databases, containers, version control systems, and many other systems use similar ideas. The next time you see 'instant' operations on large data, you'll know the lazy magic behind the scenes.