Loading learning content...
In our exploration of Copy-on-Write, we established a fundamental principle: multiple processes can share the same physical memory frames until one of them needs to modify the data. But this raises immediate practical questions: How does the operating system know which frames are shared? How does it track how many processes reference each frame? What data structures make this efficient?
This page dives deep into the mechanics of page sharing—the bookkeeping, data structures, and algorithms that transform COW from an elegant idea into a working system. Understanding shared pages is essential because this infrastructure underpins not just fork(), but also shared libraries, memory-mapped files, and inter-process communication.
By the end of this page, you will understand how the OS tracks page sharing, the critical role of reference counting, the data structures used to map between virtual and physical memory, and how these mechanisms interact during fork(), exec(), and exit().
When multiple page table entries point to the same physical frame, the OS needs to track this relationship for several reasons:
1. Knowing When to Copy: When a process writes to a COW-protected page, the OS must determine whether a copy is needed. If the reference count is 1 (sole owner), no copy is needed—just mark the page writable. If count > 1, a copy must be made.
2. Safe Frame Deallocation: When a process exits or unmaps memory, the OS must know whether to free the physical frame. A frame can only be freed when its reference count reaches zero. Freeing a shared frame would corrupt other processes.
3. Memory Accounting: To enforce memory limits and report accurate usage, the OS must distinguish between private and shared memory. Shared pages shouldn't be double-counted across processes.
4. Page-Out Decisions: When selecting pages to swap out, the kernel considers reference counts. Swapping a highly-shared page affects many processes, which may be undesirable.
5. Page Table Updates: When a shared page is swapped out or relocated, all referencing page tables must be updated. This requires tracking all PTEs pointing to each frame.
| Information Needed | Why It's Needed | Where It's Stored |
|---|---|---|
| Reference count | Determine if copy needed, safe to free | Page frame descriptor |
| PTE list (reverse mapping) | Update all PTEs when frame moves | Reverse mapping structure |
| Sharing type | Distinguish COW vs. true shared | PTE flags or VMA |
| Owning address spaces | Memory accounting, limits | VMA and mm_struct |
| Dirty/clean status | Writeback decisions | Page frame flags |
Maintaining sharing metadata consumes memory and CPU cycles. The OS must carefully design these structures to minimize overhead while providing the information needed for correct operation. Every fork() increments reference counts; every write may trigger lookups; every exit traverses PTEs. This bookkeeping is the hidden cost of COW's benefits.
Reference counting is the foundational mechanism for tracking page sharing. Each physical frame has an associated count indicating how many page table entries reference it. The rules are deceptively simple:
Basic Operations:
The simplicity is deceiving, however. Real-world reference counting must handle several complications:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
// Simplified representation of page frame reference counting// Inspired by Linux kernel's struct page and page_count() mechanism // Each physical frame has a descriptor (very simplified)struct page { atomic_t _refcount; // Reference count (start at -1, 0 = 1 ref) atomic_t _mapcount; // How many PTEs map this page unsigned long flags; // Page state flags struct address_space *mapping; // File mapping (if file-backed) pgoff_t index; // Offset within mapping struct list_head lru; // For page replacement lists // ... many more fields in real kernel}; // Get the current reference countstatic inline int page_count(struct page *page) { return atomic_read(&page->_refcount) + 1;} // Increment reference count (returns old value)static inline void get_page(struct page *page) { atomic_inc(&page->_refcount);} // Decrement reference count, free if reaches zerostatic inline void put_page(struct page *page) { if (atomic_dec_and_test(&page->_refcount)) { // Reference count hit zero - free the page __free_page(page); }} // Increment map count (called when creating PTE pointing to page)static inline void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { atomic_inc(&page->_mapcount); // Also add to reverse mapping for this VMA // (details omitted - uses anon_vma structures)} // Decrement map count (called when removing PTE)static inline void page_remove_rmap(struct page *page) { if (atomic_dec_and_test(&page->_mapcount)) { // Page is no longer mapped by any PTE // May still have kernel references (page_count > 0) }} // Example: Handling COW fault reference countingint handle_cow_fault(struct vm_area_struct *vma, struct page *old_page, unsigned long address) { struct page *new_page; // Check if we're the sole owner if (page_mapcount(old_page) == 1) { // Sole owner - just make page writable, no copy needed return make_page_writable(vma, address); } // Multiple owners - need to copy new_page = alloc_page(GFP_HIGHUSER); if (!new_page) return -ENOMEM; // Copy the page content copy_page(page_address(new_page), page_address(old_page)); // Set up new page page_add_anon_rmap(new_page, vma, address); // Update PTE to point to new page set_pte_at(vma->vm_mm, address, pte, mk_pte(new_page, vma->vm_page_prot | VM_WRITE)); // Remove old mapping page_remove_rmap(old_page); put_page(old_page); // Release our reference return 0;}Linux distinguishes between _refcount (total references including kernel) and _mapcount (only PTE mappings). A page might have mapcount=0 but refcount>0 if the kernel is using it for I/O or caching. This distinction is crucial for correct page lifecycle management.
The operating system maintains metadata for every physical page frame in the system. In Linux, this is the famous struct page (or in modern code, struct folio for compound pages). This structure is the nerve center for shared page management.
The Page Descriptor Array:
The kernel maintains an array of page descriptors, one for each physical page frame in the system. Given a physical frame number, the kernel can index directly into this array to find its descriptor. Conversely, given a descriptor, the kernel can compute the frame number.
| Field | Purpose | Sharing Relevance |
|---|---|---|
| _refcount | Total reference count | When 0, frame can be freed |
| _mapcount | PTE mapping count | Determines if COW copy needed |
| flags | Page state bits | Dirty, locked, uptodate, etc. |
| mapping | Pointer to address_space | File mapping or anon_vma |
| index | Offset in mapping | Position in file or swap |
| lru | LRU list pointers | Page replacement tracking |
| private | FS-specific data | Buffer heads, etc. |
Memory Overhead of Page Descriptors:
Each struct page in Linux is approximately 64 bytes on x86-64. For a system with 64GB of RAM and 4KB pages, there are 16 million page frames:
16,000,000 frames × 64 bytes = 1 GB of descriptor overhead
This ~1.5% overhead is the price of flexible memory management. The kernel places these descriptors in a dedicated region at boot time, ensuring they're always accessible without page faults.
On NUMA systems with non-contiguous physical memory, the simple array model breaks down. Linux uses SPARSEMEM or SPARSEMEM_VMEMMAP models where the page descriptor array is not physically contiguous but appears so through virtual mapping. This adds complexity but maintains the logical simplicity of array indexing.
Reference counting tells us how many PTEs reference a page, but not which PTEs. For many operations—particularly unmapping shared pages for swapout or migration—the kernel needs to find and update all page table entries pointing to a given frame. This is the job of reverse mapping (rmap).
The Reverse Mapping Problem:
Given a physical page frame, find all (process, virtual address) pairs that map it.
This is challenging because the normal lookup direction is reversed:
Why Reverse Mapping Matters:
Linux Reverse Mapping Implementation:
Linux uses different rmap strategies for different page types:
Anonymous Pages (COW-relevant):
Anonymous pages (heap, stack, COW copies) use anon_vma structures. When a page is first created or during fork(), it's linked to an anon_vma that tracks all VMAs (Virtual Memory Areas) that might map it. Walking the rmap involves:
File-Backed Pages:
Pages from file mappings use the file's address_space. Each file has an interval tree of VMAs mapping it. Rmap walks this tree to find all mappings of a given page offset.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
// Simplified reverse mapping traversal for anonymous pages// (Real Linux code is significantly more complex) // Structure linking VMAs that might share anonymous pagesstruct anon_vma { struct anon_vma *root; // Root of the anon_vma tree struct rb_root_cached rb_root; // Interval tree of anon_vma_chains atomic_t refcount; spinlock_t lock;}; // Chain linking VMA to its anon_vmastruct anon_vma_chain { struct vm_area_struct *vma; struct anon_vma *anon_vma; struct list_head same_vma; // List of all chains for this VMA struct rb_node rb; // Node in anon_vma's interval tree}; // Traverse all PTEs mapping a given pageint try_to_unmap(struct page *page, enum ttu_flags flags) { struct anon_vma *anon_vma; struct anon_vma_chain *avc; int ret = 0; // Get the anon_vma for this page anon_vma = page_get_anon_vma(page); if (!anon_vma) return SWAP_SUCCESS; // No mappings // Lock to prevent concurrent modification anon_vma_lock_read(anon_vma); // Walk all anon_vma_chains in the interval tree // that might contain our page anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, page->index, page->index) { struct vm_area_struct *vma = avc->vma; unsigned long address; // Compute virtual address of page in this VMA address = vma_address(page, vma); if (address == -EFAULT) continue; // Page not in this VMA's range // Try to unmap from this address space ret = try_to_unmap_one(page, vma, address, flags); if (ret != SWAP_AGAIN) break; // Stop on success or unrecoverable failure } anon_vma_unlock_read(anon_vma); put_anon_vma(anon_vma); return ret;} // Unmap a single PTEint try_to_unmap_one(struct page *page, struct vm_area_struct *vma, unsigned long address, enum ttu_flags flags) { struct mm_struct *mm = vma->vm_mm; pte_t *pte; pte_t pteval; spinlock_t *ptl; // Get the PTE for this address (with lock) pte = page_check_address(page, mm, address, &ptl); if (!pte) return SWAP_AGAIN; // Not mapped here // Clear the PTE pteval = ptep_clear_flush(vma, address, pte); // Update page counts page_remove_rmap(page); put_page(page); pte_unmap_unlock(pte, ptl); return SWAP_SUCCESS;}Reverse mapping can be expensive when a page is shared by many processes (e.g., libc mapped by 1000 processes). Walking all mappings and updating PTEs takes O(n) time. Linux optimizes for common cases but pathological sharing patterns can impact performance. This is one reason some workloads disable KSM (Kernel Same-page Merging).
Not all shared pages are created equal. The OS distinguishes between different sharing types, each with distinct semantics and handling:
| Type | Source | Write Behavior | Examples |
|---|---|---|---|
| COW Anonymous | fork() duplication | Private copy on write | Heap, stack after fork |
| Shared Anonymous | Explicit shared mmap | Writes visible to all | IPC shared memory |
| Private File-Backed | mmap(MAP_PRIVATE) | COW on write | Executable code segments |
| Shared File-Backed | mmap(MAP_SHARED) | Writes visible + file | Shared libs, DB files |
| KSM-Merged | Kernel deduplication | COW unmerge on write | VM memory optimization |
Deep Dive: Each Sharing Type
COW Anonymous Pages: These arise from fork(). Initially private to parent, after fork() they're shared with child. First write by either process triggers a COW fault and creates a private copy. The page never returns to shared state.
Shared Anonymous Pages (SYSV shm, MAP_SHARED|MAP_ANONYMOUS): Explicitly created for IPC. Writes by any process are immediately visible to all. No COW—all processes see the same memory. Reference count tracks participants.
Private File-Backed Pages: When you mmap a file with MAP_PRIVATE, the initial pages come from the page cache (shared). If you write, a private copy is made—your modifications are never written to the file. This is how executable code pages work.
Shared File-Backed Pages: With MAP_SHARED, writes go to the page cache and are flushed to disk. All processes sharing the mapping see writes. File provides the synchronization semantics.
Understanding sharing types is crucial for reasoning about memory usage. Tools like pmap, /proc/[pid]/smaps, and ps report PSS (Proportional Set Size) which divides shared page 'cost' among sharing processes. Knowing what's actually shared vs. COW-pending helps optimize memory footprint and understand actual resource usage.
Let's trace how page sharing evolves through a process's lifecycle, seeing how reference counts and mappings change at each stage:
| Event | What Happens to Pages | Reference Counts |
|---|---|---|
| Process Creation (exec) | Load code from file (shared page cache), allocate heap/stack (private) | Code pages: high (all processes); Heap/stack: 1 |
| fork() | All pages become COW-shared; PTEs duplicated as read-only | All counts increment by 1 |
| Child writes to heap | COW fault; copy made; child gets private page | Old page count--, new page count=1 |
| Child exec() | All pages unmapped; new program loaded | All counts decrement (many reach 0, freed) |
| Parent/Child modifies | Each writer gets private copy if needed | Counts adjust per page |
| Process exit() | All pages unmapped; reference counts decremented | Pages with count→0 are freed |
Example: Web Server Worker Processes
Consider Apache's pre-fork model with 100 worker processes forked from a master:
Initial State:
Master: 500MB (code: 100MB, data: 400MB)
After 100 forks (naive):
100 × 500MB = 50GB (impossible on 32GB machine)
After 100 forks (with COW):
Shared code: 100MB (refcount=101)
Read-only master data: ~380MB shared
Modified data per worker: ~20MB × 100 = 2GB private
Total: ~100MB + 380MB + 2GB ≈ 2.5GB (easily fits)
COW provides a 20x memory reduction for this workload, enabling 100 workers on a machine that couldn't even start them with eager copying.
After fork(), parent and child start 100% shared. As each writes to pages, they gradually diverge. A child that immediately exec()s diverges 100% instantly (discarding all shared mappings). A child that runs similar code to parent may remain highly shared. This spectrum is the beauty of COW—you pay for divergence as it happens.
Shared pages complicate memory accounting. If a page is shared by 10 processes, who 'owns' that memory? Various metrics provide different perspectives:
| Metric | Definition | How Sharing is Counted |
|---|---|---|
| RSS (Resident Set Size) | Physical pages currently in memory for this process | Each process counts full page (overcounts) |
| USS (Unique Set Size) | Pages private to this process | Shared pages don't count (undercounts) |
| PSS (Proportional Set Size) | RSS with shared pages divided among sharers | 1 page shared by 10 = 0.1 per process |
| VSZ (Virtual Size) | Total virtual address space mapped | Doesn't reflect physical usage |
| Shared Memory | Pages with mapcount > 1 | Explicitly tracks shared portion |
1234567891011121314151617181920212223
# Example from /proc/[pid]/smaps showing memory breakdown# For a heap region (anonymous memory, modified after fork) 00400000-00452000 r-xp 00000000 08:01 1234567 /usr/bin/myappSize: 328 kB # Virtual size of mappingRss: 280 kB # Resident in physical memoryPss: 85 kB # Proportional share (shared with 3 other procs)Shared_Clean: 224 kB # Shared, unmodified pagesShared_Dirty: 0 kB # Shared, modified pagesPrivate_Clean: 0 kB # Private, unmodified pagesPrivate_Dirty: 56 kB # Private, modified pages (COW copies)Referenced: 280 kB # Recently accessedAnonymous: 56 kB # Not file-backed (our COW copies)AnonHugePages: 0 kB # Huge page portionsSwap: 0 kB # Swapped out pagesKernelPageSize: 4 kB # Standard page size # Interpretation:# - 280 kB RSS, but only 85 kB PSS (heavily shared)# - 224 kB is shared code/rodata (read-only, no COW needed)# - 56 kB is private (COW copies we've made)# - If this is one of 4 forked workers, the 224 kB shared code# contributes 56 kB to each process's PSSPSS (Proportional Set Size) is often the most useful metric for shared environments. If you sum PSS across all processes, you get actual physical memory usage. Sum of RSS would dramatically overcount (counting shared pages multiple times). Sum of USS would dramatically undercount (ignoring shared pages entirely).
Practical Implications:
Let's consolidate our understanding of shared pages:
What's Next:
Now that we understand how pages are shared and tracked, we'll examine what triggers a copy—the mechanics of the write fault that converts a shared page into a private copy. This is where the 'Write' in 'Copy-on-Write' actually happens.
You now understand the infrastructure supporting shared pages: reference counting, page descriptors, reverse mapping, and the different types of sharing. This knowledge is foundational for understanding how the kernel manages memory efficiently across processes.