Loading content...
When you call mmap() on a 10GB file, something remarkable happens—or rather, doesn't happen. The kernel creates data structures describing the mapping, returns a pointer, and your application resumes execution in microseconds. No disk I/O. No memory allocation. No data copying. A 10GB file is "loaded" in the time it takes to read a few memory locations.
This seemingly impossible feat is achieved through lazy loading (also called demand paging when applied to virtual memory)—a fundamental principle that pervades modern operating system design. Rather than doing work upfront, the system defers work until the last possible moment: when you actually try to use the data.
void *map = mmap(NULL, 10737418240, PROT_READ, MAP_PRIVATE, fd, 0); // 10GB
// Returns instantly! No 10GB of RAM allocated, no disk reads
char first_byte = ((char *)map)[0]; // NOW the kernel does work:
// - Allocates a physical page
// - Reads 4KB from disk
// - Maps it into your address space
// - Returns the byte
This lazy approach isn't just an optimization—it's the fundamental mechanism that makes virtual memory practical. Without it, computers couldn't run programs larger than physical RAM, couldn't share libraries efficiently between processes, and couldn't provide the illusion of abundant memory to applications.
This page explores lazy loading in comprehensive depth—the page fault mechanism, what triggers kernel intervention, read-ahead optimizations, working set dynamics, and strategies for optimizing memory-mapped access patterns. You'll gain the understanding needed to predict and tune memory-mapped I/O performance.
A page fault is a CPU exception that occurs when a program accesses a virtual address that isn't currently mapped to physical memory. Far from being an error (despite the name), page faults are the mechanism that enables lazy loading.
The CPU's Perspective:
When your program executes an instruction that accesses memory, the CPU's Memory Management Unit (MMU) translates the virtual address to a physical address using the page table. If the page table entry indicates "not present," the CPU raises a page fault exception:
Types of Page Faults:
Page faults come in three categories, with dramatically different performance implications:
| Type | Cause | Resolution | Cost |
|---|---|---|---|
| Minor (Soft) Fault | Page in memory but not in process's page table | Update page table entry only | ~1-10 microseconds |
| Major (Hard) Fault | Page not in memory, must be fetched from disk | Disk I/O + page table update | ~1-10 milliseconds (1000x slower) |
| Invalid Fault | Access to unmapped address or permission violation | Deliver SIGSEGV/SIGBUS | Process termination |
Minor faults are the fast path. They occur when:
Major faults are the slow path. They require actual disk I/O:
For memory-mapped files, the goal is to maximize minor faults and minimize major faults by keeping your working set in the page cache.
The difference between a minor and major fault is dramatic—roughly 1,000 to 10,000 times slower. A minor fault involves only CPU and memory operations (microseconds). A major fault requires disk I/O (milliseconds). For an SSD, this might be 50-100 microseconds; for a spinning disk, 5-10 milliseconds. Your access pattern's ratio of major to minor faults dominates overall performance.
When a page fault occurs, the kernel must quickly determine what to do. The page fault handler is one of the most performance-critical paths in the kernel—it's invoked millions of times per second on a busy system.
Linux's do_page_fault() Function:
On Linux, the main entry point is architecture-specific (e.g., do_page_fault() on x86), which delegates to the generic handle_mm_fault(). Here's the conceptual flow:
// Simplified pseudocode
void handle_page_fault(unsigned long address, unsigned int error_code) {
struct mm_struct *mm = current->mm; // Process's memory descriptor
// Step 1: Find the VMA containing this address
struct vm_area_struct *vma = find_vma(mm, address);
if (!vma || address < vma->vm_start) {
// No mapping exists at this address
// -> Deliver SIGSEGV
return do_segfault(address);
}
// Step 2: Check access permissions
if ((error_code & PF_WRITE) && !(vma->vm_flags & VM_WRITE)) {
// Write to read-only region
// -> Deliver SIGSEGV
return do_segfault(address);
}
// Step 3: Handle the fault based on VMA type
return handle_mm_fault(vma, address, error_code);
}
VMA-Specific Fault Handling:
Each VMA has a pointer to a vm_operations_struct containing fault handlers appropriate for its type:
1234567891011121314151617181920212223242526
// Each VMA points to operations appropriate for its typestruct vm_operations_struct { // Called when this VMA is created void (*open)(struct vm_area_struct *); // Called when this VMA is destroyed void (*close)(struct vm_area_struct *); // THE KEY: Called to handle page faults vm_fault_t (*fault)(struct vm_fault *vmf); // Called for huge page faults vm_fault_t (*huge_fault)(struct vm_fault *vmf, ...); // Called to map pages (batch fault) vm_fault_t (*map_pages)(struct vm_fault *vmf, ...); // Called when page is about to be written vm_fault_t (*page_mkwrite)(struct vm_fault *vmf);}; // For file-backed mappings, this typically points to// filemap_fault() which retrieves pages from the page cache // For anonymous mappings, this handles zero-page allocation// and copy-on-write logicFile-Backed Fault Handling (filemap_fault):
When you fault on a memory-mapped file, the kernel's filemap_fault() function handles it:
Calculate file position: offset = (faulting_address - vma->vm_start) + vma->vm_pgoff * PAGE_SIZE
Search page cache: Look for the page at (file_inode, offset) in the page cache
If found (minor fault):
If not found (major fault):
Performance Implications:
The path through the page fault handler critically affects performance:
Loading pages one at a time, each requiring a separate disk I/O, would be disastrously slow for sequential access. The kernel employs read-ahead to predict future page accesses and load them proactively.
The Read-Ahead Mechanism:
When the kernel detects sequential access patterns, it loads pages ahead of where you're currently accessing:
Access Pattern Detection:
Access page 0 → Load pages 0-3 (initial window: 4 pages)
Access page 1 → Already cached (read-ahead working!)
Access page 2 → Already cached
Access page 3 → Load pages 4-11 (window doubled: 8 pages)
Access page 4 → Already cached
...
Access page 11 → Load pages 12-27 (window grows: 16 pages)
The read-ahead window grows exponentially up to a maximum (often 128-256 KB by default), enabling very high sequential throughput.
On Linux, you can view and modify the read-ahead size using /sys/block/sdX/queue/read_ahead_kb. For workloads with known sequential patterns on fast storage, increasing this can improve throughput. For random access patterns, reducing it avoids wasting I/O bandwidth on unused pages.
Read-Ahead State Machine:
The kernel maintains read-ahead state per file mapping:
| State Variable | Purpose |
|---|---|
| ra_start | Starting offset of the current read-ahead window |
| ra_size | Current size of the read-ahead window |
| ra_async_size | Portion of window that triggers async read-ahead |
| prev_pos | Previous access position (for pattern detection) |
Synchronous vs. Asynchronous Read-Ahead:
Synchronous read-ahead: Triggered when you access a page not in cache. The kernel loads a batch of pages, and your process waits for the I/O to complete.
Asynchronous read-ahead: Triggered when you access a page within the "async" portion of the current window (typically the last quarter). The kernel initiates I/O for the next window without blocking your access—the pages you're accessing are already cached.
|--------- Current Window ----------|------ Next Window ------|
|-- Already Accessed --|-- Async ---|
^
| Access here triggers async read-ahead
for next window. You don't block;
current page is already cached.
How mmap() Interacts with Read-Ahead:
With mmap(), read-ahead is triggered by page faults, not system calls. The kernel can't always detect sequential patterns as easily because:
To help the kernel, you can use madvise() to declare your access pattern:
#include <sys/mman.h>
// Hint: I'll access sequentially - please read ahead aggressively
madvise(mapped_addr, length, MADV_SEQUENTIAL);
// Hint: I'll access randomly - don't bother with read-ahead
madvise(mapped_addr, length, MADV_RANDOM);
// Hint: I'll need this soon - start loading now
madvise(mapped_addr, length, MADV_WILLNEED);
// Hint: I'm done with this - feel free to reclaim
madvise(mapped_addr, length, MADV_DONTNEED);
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
#include <sys/mman.h>#include <sys/stat.h>#include <fcntl.h>#include <stdio.h>#include <stdlib.h> // Process a large file with optimal read-aheadvoid process_file_optimized(const char *path) { int fd = open(path, O_RDONLY); struct stat sb; fstat(fd, &sb); void *map = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); if (map == MAP_FAILED) { perror("mmap"); return; } // Tell kernel we'll access sequentially // This maximizes read-ahead effectiveness if (madvise(map, sb.st_size, MADV_SEQUENTIAL) == -1) { perror("madvise SEQUENTIAL"); // Non-fatal, continue anyway } // Process the file size_t sum = 0; unsigned char *bytes = (unsigned char *)map; for (size_t i = 0; i < sb.st_size; i++) { sum += bytes[i]; } printf("Sum of bytes: %zu\n", sum); munmap(map, sb.st_size);} // Pre-load data that will be needed soonvoid preload_index(void *index_map, size_t index_size) { // Before we need the index, tell kernel to load it // This initiates I/O asynchronously madvise(index_map, index_size, MADV_WILLNEED); // Do other work while index loads in background // ... // By the time we access the index, it's likely cached} // Release data we no longer needvoid release_processed_data(void *data_map, size_t offset, size_t length) { // After processing a section, hint that we're done // Kernel can prioritize these pages for reclaim madvise((char *)data_map + offset, length, MADV_DONTNEED);}The working set is the set of pages actually being used by your program at a given time. Understanding working set dynamics is crucial for optimizing memory-mapped file performance.
Definition:
Formally, the working set at time t with window w is the set of pages accessed in the time interval [t-w, t]. In practice, we care about:
The Page Cache and Memory Pressure:
File-backed mmap() pages reside in the kernel's page cache. Under memory pressure:
┌─────────────────────────────────────┐
│ Physical Memory │
│ ┌───────────────────────────────┐ │
│ │ Page Cache │ │
Memory-mapped ──►│ │ (File-backed pages) │ │
files │ │ [Clean: reclaimable] │ │
│ │ [Dirty: must sync first] │ │
│ └───────────────────────────────┘ │
│ ┌───────────────────────────────┐ │
Anonymous ►│ │ Anonymous Pages │ │
mappings │ │ (Heap, stack, MAP_ANON) │ │
mallocated │ │ [May be swapped out] │ │
memory │ └───────────────────────────────┘ │
└─────────────────────────────────────┘
If your working set exceeds available RAM, pages are constantly being evicted and re-faulted. With file-backed mappings, this means continuous disk I/O—reading the same pages over and over. Performance collapses. Monitor your page fault rates: high major fault rates indicate your working set doesn't fit in memory.
Measuring Working Set and Page Faults:
On Linux, monitor page faults using:
# Per-process stats via /proc
cat /proc/<pid>/stat # Field 10: minor faults, Field 12: major faults
# Real-time monitoring
watch -n 1 'grep pgfault /proc/vmstat'
# Tool-based monitoring
perf stat -e page-faults ./your_program
# Detailed via sar
sar -B 1 # Page-in/out, fault rates
Strategies for Working Set Optimization:
1234567891011121314151617181920212223242526272829303132333435363738
#include <sys/mman.h>#include <stdio.h> // Example: Processing a large file in chunks to manage working set#define CHUNK_SIZE (64 * 1024 * 1024) // 64 MB working set per chunk void process_large_file_chunked(void *map, size_t file_size) { size_t offset = 0; while (offset < file_size) { size_t chunk_size = CHUNK_SIZE; if (offset + chunk_size > file_size) { chunk_size = file_size - offset; } void *chunk = (char *)map + offset; // Pre-load this chunk madvise(chunk, chunk_size, MADV_WILLNEED); // Process the chunk process_chunk(chunk, chunk_size); // Release this chunk - hint to kernel to reclaim these pages // This makes room in the page cache for the next chunk madvise(chunk, chunk_size, MADV_DONTNEED); offset += chunk_size; }} /* * Benefits of this pattern: * 1. Working set stays bounded at ~64 MB * 2. MADV_WILLNEED prefetches next chunk while processing current * 3. MADV_DONTNEED releases processed chunks, avoiding memory bloat * 4. Works even for files vastly larger than RAM */Sometimes lazy loading isn't what you want. If you know you'll access every page of a mapping, loading them lazily incurs page fault overhead for each page. In such cases, you can request eager loading using MAP_POPULATE.
What MAP_POPULATE Does:
void *map = mmap(NULL, size, PROT_READ,
MAP_PRIVATE | MAP_POPULATE, // <-- Force pre-loading
fd, 0);
With MAP_POPULATE:
Performance Profile Comparison:
| Aspect | Lazy (default) | Eager (MAP_POPULATE) |
|---|---|---|
| mmap() call time | Microseconds | May be seconds for large files |
| First access latency | Page fault per page | None |
| Memory usage pattern | Gradual growth as accessed | Full allocation upfront |
| CPU overhead | Page fault handling per page | None post-mmap() |
| I/O pattern | Potentially fragmented | Contiguous read-ahead |
| Unused pages | Never loaded (efficient) | Loaded anyway (wasteful) |
When to Use MAP_POPULATE:
Alternative: madvise() with MADV_WILLNEED:
For more control, you can combine lazy mmap() with explicit pre-faulting:
void *map = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
// Pre-fault asynchronously - returns immediately, I/O happens in background
madvise(map, size, MADV_WILLNEED);
// Do other initialization work while pages load...
initialize_other_subsystems();
// By the time you access the mapping, pages are likely cached
process(map, size);
This approach combines lazy mmap() with proactive loading, giving you:
MAP_POPULATE loads pages into memory but doesn't prevent them from being reclaimed under memory pressure. If you need pages to stay resident (for real-time guarantees), use mlock() or mlockall() after mapping, or combine with MAP_LOCKED.
When memory-mapped file performance is poor, page faults are often the culprit. Here's how to investigate:
High-Level Monitoring:
# System-wide page fault statistics
vmstat 1
# Watch columns: si/so (swap), bi/bo (block I/O)
# High bi values during your workload indicate major faults
# Per-process page faults
pidstat -r 1 -p <pid>
# minflt/s: minor faults/sec (cheap)
# majflt/s: major faults/sec (expensive - these hurt!)
# Detailed breakdown
sar -B 1
# pgpgin/s, pgpgout/s: pages read from/written to disk
# fault/s, majflt/s: fault rates
Detailed Analysis with perf:
# Record page fault events
perf record -e page-faults,major-faults,minor-faults ./your_program
# Analyze results
perf report
# See which code paths trigger the most faults
perf annotate --symbol=your_function
123456789101112131415161718192021222324252627282930313233343536
#include <sys/resource.h>#include <stdio.h> // Self-monitoring page faults within your programvoid print_page_fault_stats(const char *label) { struct rusage usage; getrusage(RUSAGE_SELF, &usage); printf("[%s] Minor faults: %ld, Major faults: %ld\n", label, usage.ru_minflt, usage.ru_majflt);} // Usage pattern: track faults during different phasesint main() { print_page_fault_stats("Before mmap"); void *map = mmap(...); print_page_fault_stats("After mmap (no change expected)"); // Access all pages volatile char c; for (size_t i = 0; i < size; i += 4096) { c = ((char *)map)[i]; // Touch each page } print_page_fault_stats("After first access"); // Access again - should be all minor faults for (size_t i = 0; i < size; i += 4096) { c = ((char *)map)[i]; } print_page_fault_stats("After second access"); return 0;}Understanding mincore():
The mincore() system call lets you query which pages of a mapping are currently in memory:
#include <sys/mman.h>
#include <unistd.h>
void check_page_residency(void *map, size_t length) {
long page_size = sysconf(_SC_PAGESIZE);
size_t num_pages = (length + page_size - 1) / page_size;
unsigned char *vec = malloc(num_pages);
if (mincore(map, length, vec) == -1) {
perror("mincore");
return;
}
size_t resident = 0;
for (size_t i = 0; i < num_pages; i++) {
if (vec[i] & 1) resident++;
}
printf("Resident: %zu / %zu pages (%.1f%%)\n",
resident, num_pages, 100.0 * resident / num_pages);
free(vec);
}
Common Performance Problems:
| Symptom | Likely Cause | Solution |
|---|---|---|
| High major fault rate | Working set exceeds RAM | Reduce working set, add RAM, or use streaming approach |
| Major faults on re-access | Pages being reclaimed prematurely | Use mlock() for critical data, or reduce memory pressure |
| High minor fault rate | Many first-time accesses | Use MAP_POPULATE or MADV_WILLNEED |
| I/O wait despite cache hits | Read-ahead not effective | Use MADV_SEQUENTIAL for sequential access |
| Pages loaded but never used | Speculative loading waste | Use MADV_RANDOM, reduce read-ahead size |
When you combine MAP_PRIVATE with read-write access, lazy loading interacts with copy-on-write (COW) semantics:
Initial State:
On First Write:
Timeline: MAP_PRIVATE with lazy loading
mmap() returned Read page 0 Write to page 0
| | |
v v v
[Empty] [Read-only] [Read-write]
from page cache private copy
original unchanged
This Has Performance Implications:
For read-heavy workloads on MAP_PRIVATE mappings:
For write-heavy workloads on MAP_PRIVATE mappings:
Optimization: Write-Only Access Pattern:
If you're going to overwrite entire pages (not read-modify-write), consider:
// For pure overwrites, MAP_ANONYMOUS avoids reading the file at all
void *workspace = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Pages are zero-filled on first access (minor fault only)
// No file I/O involved
This is faster than mapping a file with MAP_PRIVATE when you don't need the original content.
On Linux, you can track copy-on-write page faults using perf events. 'perf stat -e page-faults,minor-faults,major-faults ./program' shows the breakdown. If minor faults are high but major faults are low, you're likely seeing COW copies of already-resident pages.
We've comprehensively explored lazy loading—the fundamental mechanism that makes memory-mapped files efficient and enables programs to work with datasets larger than physical memory.
What's Next:
With lazy loading understood, we'll explore shared mappings—how multiple processes can map the same file into their address spaces and see each other's changes. This is the foundation of shared-memory IPC, memory-mapped databases, and collaborative file access patterns.
You now deeply understand how lazy loading works in memory-mapped files—from page fault mechanics to read-ahead optimization to working set management. This knowledge enables you to predict, measure, and optimize the performance of memory-mapped applications.