Operating SystemsVirtual Memory

Copy-on-Write

LevelIntermediate

Duration75 mins

TopicVirtual Memory

2 / 5

Shared Pages

One Frame, Many Owners

In our exploration of Copy-on-Write, we established a fundamental principle: multiple processes can share the same physical memory frames until one of them needs to modify the data. But this raises immediate practical questions: How does the operating system know which frames are shared? How does it track how many processes reference each frame? What data structures make this efficient?

This page dives deep into the mechanics of page sharing—the bookkeeping, data structures, and algorithms that transform COW from an elegant idea into a working system. Understanding shared pages is essential because this infrastructure underpins not just fork(), but also shared libraries, memory-mapped files, and inter-process communication.

What You Will Learn

By the end of this page, you will understand how the OS tracks page sharing, the critical role of reference counting, the data structures used to map between virtual and physical memory, and how these mechanisms interact during fork(), exec(), and exit().

The Need for Sharing Metadata

When multiple page table entries point to the same physical frame, the OS needs to track this relationship for several reasons:

1. Knowing When to Copy: When a process writes to a COW-protected page, the OS must determine whether a copy is needed. If the reference count is 1 (sole owner), no copy is needed—just mark the page writable. If count > 1, a copy must be made.

2. Safe Frame Deallocation: When a process exits or unmaps memory, the OS must know whether to free the physical frame. A frame can only be freed when its reference count reaches zero. Freeing a shared frame would corrupt other processes.

3. Memory Accounting: To enforce memory limits and report accurate usage, the OS must distinguish between private and shared memory. Shared pages shouldn't be double-counted across processes.

4. Page-Out Decisions: When selecting pages to swap out, the kernel considers reference counts. Swapping a highly-shared page affects many processes, which may be undesirable.

5. Page Table Updates: When a shared page is swapped out or relocated, all referencing page tables must be updated. This requires tracking all PTEs pointing to each frame.

Metadata Requirements for Shared Pages
Information Needed	Why It's Needed	Where It's Stored
Reference count	Determine if copy needed, safe to free	Page frame descriptor
PTE list (reverse mapping)	Update all PTEs when frame moves	Reverse mapping structure
Sharing type	Distinguish COW vs. true shared	PTE flags or VMA
Owning address spaces	Memory accounting, limits	VMA and mm_struct
Dirty/clean status	Writeback decisions	Page frame flags

The Overhead Trade-off

Maintaining sharing metadata consumes memory and CPU cycles. The OS must carefully design these structures to minimize overhead while providing the information needed for correct operation. Every fork() increments reference counts; every write may trigger lookups; every exit traverses PTEs. This bookkeeping is the hidden cost of COW's benefits.

Reference Counting Mechanics

Reference counting is the foundational mechanism for tracking page sharing. Each physical frame has an associated count indicating how many page table entries reference it. The rules are deceptively simple:

Basic Operations:

Initial mapping: count = 1
Fork (COW sharing): count++
COW fault (copy made): old frame count--, new frame count = 1
Unmap/Exit: count--
When count reaches 0: frame returns to free list

The simplicity is deceiving, however. Real-world reference counting must handle several complications:

Reference Counting Complications

•Concurrency — Multiple CPUs may fork, exit, or trigger COW faults simultaneously. Reference counts must be updated atomically to prevent races.
•Wraparound — Reference counts can theoretically overflow. Systems use large enough counters or saturating arithmetic.
•Multiple Reference Types — A page might be referenced by PTEs, kernel data structures, DMA operations, etc. Some systems use separate counters.
•Page Cache Integration — File-backed pages have complex lifecycles with both process mappings and page cache references.
•Huge Pages — Compound pages (huge pages) require consistent handling of base and tail pages.

reference_counting.c
C (Linux-style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
// Simplified representation of page frame reference counting
// Inspired by Linux kernel's struct page and page_count() mechanism
 
// Each physical frame has a descriptor (very simplified)
struct page {
    atomic_t _refcount;     // Reference count (start at -1, 0 = 1 ref)
    atomic_t _mapcount;     // How many PTEs map this page
    unsigned long flags;     // Page state flags
    struct address_space *mapping;  // File mapping (if file-backed)
    pgoff_t index;          // Offset within mapping
    struct list_head lru;   // For page replacement lists
    // ... many more fields in real kernel
};
 
// Get the current reference count
static inline int page_count(struct page *page) {
    return atomic_read(&page->_refcount) + 1;
}
 
// Increment reference count (returns old value)
static inline void get_page(struct page *page) {
    atomic_inc(&page->_refcount);
}
 
// Decrement reference count, free if reaches zero
static inline void put_page(struct page *page) {
    if (atomic_dec_and_test(&page->_refcount)) {
        // Reference count hit zero - free the page
        __free_page(page);
    }
}
 
// Increment map count (called when creating PTE pointing to page)
static inline void page_add_anon_rmap(struct page *page, 
                                       struct vm_area_struct *vma,
                                       unsigned long address) {
    atomic_inc(&page->_mapcount);
    // Also add to reverse mapping for this VMA
    // (details omitted - uses anon_vma structures)
}
 
// Decrement map count (called when removing PTE)
static inline void page_remove_rmap(struct page *page) {
    if (atomic_dec_and_test(&page->_mapcount)) {
        // Page is no longer mapped by any PTE
        // May still have kernel references (page_count > 0)
    }
}
 
// Example: Handling COW fault reference counting
int handle_cow_fault(struct vm_area_struct *vma,
                     struct page *old_page,
                     unsigned long address) {
    struct page *new_page;
    
    // Check if we're the sole owner
    if (page_mapcount(old_page) == 1) {
        // Sole owner - just make page writable, no copy needed
        return make_page_writable(vma, address);
    }
    
    // Multiple owners - need to copy
    new_page = alloc_page(GFP_HIGHUSER);
    if (!new_page)
        return -ENOMEM;
    
    // Copy the page content
    copy_page(page_address(new_page), page_address(old_page));
    
    // Set up new page
    page_add_anon_rmap(new_page, vma, address);
    
    // Update PTE to point to new page
    set_pte_at(vma->vm_mm, address, pte, 
               mk_pte(new_page, vma->vm_page_prot | VM_WRITE));
    
    // Remove old mapping
    page_remove_rmap(old_page);
    put_page(old_page);  // Release our reference
    
    return 0;
}

refcount vs. mapcount

Linux distinguishes between _refcount (total references including kernel) and _mapcount (only PTE mappings). A page might have mapcount=0 but refcount>0 if the kernel is using it for I/O or caching. This distinction is crucial for correct page lifecycle management.

Page Frame Descriptors

The operating system maintains metadata for every physical page frame in the system. In Linux, this is the famous struct page (or in modern code, struct folio for compound pages). This structure is the nerve center for shared page management.

The Page Descriptor Array:

The kernel maintains an array of page descriptors, one for each physical page frame in the system. Given a physical frame number, the kernel can index directly into this array to find its descriptor. Conversely, given a descriptor, the kernel can compute the frame number.

Key Fields in Page Frame Descriptor (Linux struct page)
Field	Purpose	Sharing Relevance
_refcount	Total reference count	When 0, frame can be freed
_mapcount	PTE mapping count	Determines if COW copy needed
flags	Page state bits	Dirty, locked, uptodate, etc.
mapping	Pointer to address_space	File mapping or anon_vma
index	Offset in mapping	Position in file or swap
lru	LRU list pointers	Page replacement tracking
private	FS-specific data	Buffer heads, etc.

Memory Overhead of Page Descriptors:

Each struct page in Linux is approximately 64 bytes on x86-64. For a system with 64GB of RAM and 4KB pages, there are 16 million page frames:

16,000,000 frames × 64 bytes = 1 GB of descriptor overhead

This ~1.5% overhead is the price of flexible memory management. The kernel places these descriptors in a dedicated region at boot time, ensuring they're always accessible without page faults.

Converting Mermaid diagram...

NUMA and Sparse Memory

On NUMA systems with non-contiguous physical memory, the simple array model breaks down. Linux uses SPARSEMEM or SPARSEMEM_VMEMMAP models where the page descriptor array is not physically contiguous but appears so through virtual mapping. This adds complexity but maintains the logical simplicity of array indexing.

Reverse Mapping: Finding All PTEs

Reference counting tells us how many PTEs reference a page, but not which PTEs. For many operations—particularly unmapping shared pages for swapout or migration—the kernel needs to find and update all page table entries pointing to a given frame. This is the job of reverse mapping (rmap).

The Reverse Mapping Problem:

Given a physical page frame, find all (process, virtual address) pairs that map it.

This is challenging because the normal lookup direction is reversed:

Forward: (process, virtual address) → physical frame (via page table)
Reverse: physical frame → all (process, virtual address) pairs (via rmap)

Why Reverse Mapping Matters:

Reverse Mapping Use Cases

•Page Swapout — Before evicting a page to swap, the kernel must mark all PTEs as 'not present' to prevent access.
•Page Migration — Moving a page to a different physical frame (for NUMA optimization or defragmentation) requires updating all referencing PTEs.
•Huge Page Collapse/Promotion — Transforming small pages to huge pages (or vice versa) needs to update page tables.
•Memory Hotplug — Removing physical memory regions requires migrating pages and updating mappings.
•PFRA (Page Frame Reclamation Algorithm) — Accurate reference information helps choose pages to reclaim.

Linux Reverse Mapping Implementation:

Linux uses different rmap strategies for different page types:

Anonymous Pages (COW-relevant): Anonymous pages (heap, stack, COW copies) use anon_vma structures. When a page is first created or during fork(), it's linked to an anon_vma that tracks all VMAs (Virtual Memory Areas) that might map it. Walking the rmap involves:

Page → anon_vma (via page->mapping)
anon_vma → list of VMAs (via interval tree)
For each VMA → compute PTE address → check if PTE points to page

File-Backed Pages: Pages from file mappings use the file's address_space. Each file has an interval tree of VMAs mapping it. Rmap walks this tree to find all mappings of a given page offset.

reverse_mapping.c
C (Linux-style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// Simplified reverse mapping traversal for anonymous pages
// (Real Linux code is significantly more complex)
 
// Structure linking VMAs that might share anonymous pages
struct anon_vma {
    struct anon_vma *root;      // Root of the anon_vma tree
    struct rb_root_cached rb_root;  // Interval tree of anon_vma_chains
    atomic_t refcount;
    spinlock_t lock;
};
 
// Chain linking VMA to its anon_vma
struct anon_vma_chain {
    struct vm_area_struct *vma;
    struct anon_vma *anon_vma;
    struct list_head same_vma;    // List of all chains for this VMA
    struct rb_node rb;            // Node in anon_vma's interval tree
};
 
// Traverse all PTEs mapping a given page
int try_to_unmap(struct page *page, enum ttu_flags flags) {
    struct anon_vma *anon_vma;
    struct anon_vma_chain *avc;
    int ret = 0;
    
    // Get the anon_vma for this page
    anon_vma = page_get_anon_vma(page);
    if (!anon_vma)
        return SWAP_SUCCESS;  // No mappings
    
    // Lock to prevent concurrent modification
    anon_vma_lock_read(anon_vma);
    
    // Walk all anon_vma_chains in the interval tree
    // that might contain our page
    anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
                                    page->index, page->index) {
        struct vm_area_struct *vma = avc->vma;
        unsigned long address;
        
        // Compute virtual address of page in this VMA
        address = vma_address(page, vma);
        if (address == -EFAULT)
            continue;  // Page not in this VMA's range
        
        // Try to unmap from this address space
        ret = try_to_unmap_one(page, vma, address, flags);
        if (ret != SWAP_AGAIN)
            break;  // Stop on success or unrecoverable failure
    }
    
    anon_vma_unlock_read(anon_vma);
    put_anon_vma(anon_vma);
    
    return ret;
}
 
// Unmap a single PTE
int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                     unsigned long address, enum ttu_flags flags) {
    struct mm_struct *mm = vma->vm_mm;
    pte_t *pte;
    pte_t pteval;
    spinlock_t *ptl;
    
    // Get the PTE for this address (with lock)
    pte = page_check_address(page, mm, address, &ptl);
    if (!pte)
        return SWAP_AGAIN;  // Not mapped here
    
    // Clear the PTE
    pteval = ptep_clear_flush(vma, address, pte);
    
    // Update page counts
    page_remove_rmap(page);
    put_page(page);
    
    pte_unmap_unlock(pte, ptl);
    return SWAP_SUCCESS;
}

Rmap Scalability Concerns

Reverse mapping can be expensive when a page is shared by many processes (e.g., libc mapped by 1000 processes). Walking all mappings and updating PTEs takes O(n) time. Linux optimizes for common cases but pathological sharing patterns can impact performance. This is one reason some workloads disable KSM (Kernel Same-page Merging).

Types of Shared Pages

Not all shared pages are created equal. The OS distinguishes between different sharing types, each with distinct semantics and handling:

Types of Page Sharing
Type	Source	Write Behavior	Examples
COW Anonymous	fork() duplication	Private copy on write	Heap, stack after fork
Shared Anonymous	Explicit shared mmap	Writes visible to all	IPC shared memory
Private File-Backed	mmap(MAP_PRIVATE)	COW on write	Executable code segments
Shared File-Backed	mmap(MAP_SHARED)	Writes visible + file	Shared libs, DB files
KSM-Merged	Kernel deduplication	COW unmerge on write	VM memory optimization

Deep Dive: Each Sharing Type

COW Anonymous Pages: These arise from fork(). Initially private to parent, after fork() they're shared with child. First write by either process triggers a COW fault and creates a private copy. The page never returns to shared state.

Shared Anonymous Pages (SYSV shm, MAP_SHARED|MAP_ANONYMOUS): Explicitly created for IPC. Writes by any process are immediately visible to all. No COW—all processes see the same memory. Reference count tracks participants.

Private File-Backed Pages: When you mmap a file with MAP_PRIVATE, the initial pages come from the page cache (shared). If you write, a private copy is made—your modifications are never written to the file. This is how executable code pages work.

Shared File-Backed Pages: With MAP_SHARED, writes go to the page cache and are flushed to disk. All processes sharing the mapping see writes. File provides the synchronization semantics.

COW vs. True Sharing

•COW pages appear shared initially but become private on write
•True shared pages remain shared; writes propagate to all viewers
•The transition from COW-shared to private is one-way
•True shared pages require explicit synchronization (locks, atomics)

Identifying Share Type

•VMA flags indicate MAP_SHARED vs MAP_PRIVATE
•Page flags indicate anonymous vs file-backed
•anon_vma presence indicates anonymous pages
•address_space indicates file-backed pages

Why This Distinction Matters

Understanding sharing types is crucial for reasoning about memory usage. Tools like pmap, /proc/[pid]/smaps, and ps report PSS (Proportional Set Size) which divides shared page 'cost' among sharing processes. Knowing what's actually shared vs. COW-pending helps optimize memory footprint and understand actual resource usage.

Sharing During Process Lifecycle

Let's trace how page sharing evolves through a process's lifecycle, seeing how reference counts and mappings change at each stage:

Page Sharing Through Process Lifecycle
Event	What Happens to Pages	Reference Counts
Process Creation (exec)	Load code from file (shared page cache), allocate heap/stack (private)	Code pages: high (all processes); Heap/stack: 1
fork()	All pages become COW-shared; PTEs duplicated as read-only	All counts increment by 1
Child writes to heap	COW fault; copy made; child gets private page	Old page count--, new page count=1
Child exec()	All pages unmapped; new program loaded	All counts decrement (many reach 0, freed)
Parent/Child modifies	Each writer gets private copy if needed	Counts adjust per page
Process exit()	All pages unmapped; reference counts decremented	Pages with count→0 are freed

Example: Web Server Worker Processes

Consider Apache's pre-fork model with 100 worker processes forked from a master:

Initial State:
  Master: 500MB (code: 100MB, data: 400MB)

After 100 forks (naive):
  100 × 500MB = 50GB (impossible on 32GB machine)

After 100 forks (with COW):
  Shared code: 100MB (refcount=101)
  Read-only master data: ~380MB shared
  Modified data per worker: ~20MB × 100 = 2GB private
  Total: ~100MB + 380MB + 2GB ≈ 2.5GB (easily fits)

COW provides a 20x memory reduction for this workload, enabling 100 workers on a machine that couldn't even start them with eager copying.

Converting Mermaid diagram...

Gradual Divergence

After fork(), parent and child start 100% shared. As each writes to pages, they gradually diverge. A child that immediately exec()s diverges 100% instantly (discarding all shared mappings). A child that runs similar code to parent may remain highly shared. This spectrum is the beauty of COW—you pay for divergence as it happens.

Memory Accounting with Shared Pages

Shared pages complicate memory accounting. If a page is shared by 10 processes, who 'owns' that memory? Various metrics provide different perspectives:

Memory Metrics for Shared Pages
Metric	Definition	How Sharing is Counted
RSS (Resident Set Size)	Physical pages currently in memory for this process	Each process counts full page (overcounts)
USS (Unique Set Size)	Pages private to this process	Shared pages don't count (undercounts)
PSS (Proportional Set Size)	RSS with shared pages divided among sharers	1 page shared by 10 = 0.1 per process
VSZ (Virtual Size)	Total virtual address space mapped	Doesn't reflect physical usage
Shared Memory	Pages with mapcount > 1	Explicitly tracks shared portion

smaps_example.txt

Sample /proc/[pid]/smaps

# Example from /proc/[pid]/smaps showing memory breakdown
# For a heap region (anonymous memory, modified after fork)
 
00400000-00452000 r-xp 00000000 08:01 1234567    /usr/bin/myapp
Size:                328 kB      # Virtual size of mapping
Rss:                 280 kB      # Resident in physical memory
Pss:                  85 kB      # Proportional share (shared with 3 other procs)
Shared_Clean:        224 kB      # Shared, unmodified pages
Shared_Dirty:          0 kB      # Shared, modified pages
Private_Clean:         0 kB      # Private, unmodified pages
Private_Dirty:        56 kB      # Private, modified pages (COW copies)
Referenced:          280 kB      # Recently accessed
Anonymous:            56 kB      # Not file-backed (our COW copies)
AnonHugePages:         0 kB      # Huge page portions
Swap:                  0 kB      # Swapped out pages
KernelPageSize:        4 kB      # Standard page size
 
# Interpretation:
# - 280 kB RSS, but only 85 kB PSS (heavily shared)
# - 224 kB is shared code/rodata (read-only, no COW needed)
# - 56 kB is private (COW copies we've made)
# - If this is one of 4 forked workers, the 224 kB shared code
#   contributes 56 kB to each process's PSS

PSS: The Fairest Metric

PSS (Proportional Set Size) is often the most useful metric for shared environments. If you sum PSS across all processes, you get actual physical memory usage. Sum of RSS would dramatically overcount (counting shared pages multiple times). Sum of USS would dramatically undercount (ignoring shared pages entirely).

Practical Implications:

Container memory limits should consider PSS, not just RSS, to avoid over-restricting
OOM killer decisions use multiple metrics; naive RSS-based kills may target wrong process
Capacity planning for fork-heavy workloads must account for actual sharing, not worst-case
Monitoring dashboards should display appropriate metrics for the workload type

Summary and Looking Ahead

Let's consolidate our understanding of shared pages:

Key Takeaways

•Reference counting tracks sharing — Each page frame has a reference count indicating how many PTEs point to it. This determines when copies are needed and when frames can be freed.
•Page frame descriptors hold metadata — Every physical frame has a descriptor struct containing reference counts, flags, and pointers to associated structures.
•Reverse mapping enables updates — When the kernel needs to modify or remove a shared page, rmap structures allow finding all PTEs that reference it.
•Different sharing types exist — COW anonymous, shared anonymous, private file-backed, and shared file-backed pages have different semantics and handling.
•Sharing evolves over time — Fork creates maximal sharing; subsequent writes cause gradual divergence as COW copies are made.
•Memory accounting is nuanced — RSS, USS, and PSS provide different views of memory usage in shared environments.

What's Next:

Now that we understand how pages are shared and tracked, we'll examine what triggers a copy—the mechanics of the write fault that converts a shared page into a private copy. This is where the 'Write' in 'Copy-on-Write' actually happens.

Page Complete

You now understand the infrastructure supporting shared pages: reference counting, page descriptors, reverse mapping, and the different types of sharing. This knowledge is foundational for understanding how the kernel manages memory efficiently across processes.

2 / 5

Loading learning content...

Operating SystemsVirtual Memory

Copy-on-Write

LevelIntermediate

Duration75 mins

TopicVirtual Memory

2 / 5

Shared Pages

One Frame, Many Owners

What You Will Learn

The Need for Sharing Metadata

When multiple page table entries point to the same physical frame, the OS needs to track this relationship for several reasons:

3. Memory Accounting: To enforce memory limits and report accurate usage, the OS must distinguish between private and shared memory. Shared pages shouldn't be double-counted across processes.

4. Page-Out Decisions: When selecting pages to swap out, the kernel considers reference counts. Swapping a highly-shared page affects many processes, which may be undesirable.

5. Page Table Updates: When a shared page is swapped out or relocated, all referencing page tables must be updated. This requires tracking all PTEs pointing to each frame.

Metadata Requirements for Shared Pages
Information Needed	Why It's Needed	Where It's Stored
Reference count	Determine if copy needed, safe to free	Page frame descriptor
PTE list (reverse mapping)	Update all PTEs when frame moves	Reverse mapping structure
Sharing type	Distinguish COW vs. true shared	PTE flags or VMA
Owning address spaces	Memory accounting, limits	VMA and mm_struct
Dirty/clean status	Writeback decisions	Page frame flags

The Overhead Trade-off

Reference Counting Mechanics

Basic Operations:

Initial mapping: count = 1
Fork (COW sharing): count++
COW fault (copy made): old frame count--, new frame count = 1
Unmap/Exit: count--
When count reaches 0: frame returns to free list

The simplicity is deceiving, however. Real-world reference counting must handle several complications:

Reference Counting Complications

•Concurrency — Multiple CPUs may fork, exit, or trigger COW faults simultaneously. Reference counts must be updated atomically to prevent races.
•Wraparound — Reference counts can theoretically overflow. Systems use large enough counters or saturating arithmetic.
•Multiple Reference Types — A page might be referenced by PTEs, kernel data structures, DMA operations, etc. Some systems use separate counters.
•Page Cache Integration — File-backed pages have complex lifecycles with both process mappings and page cache references.
•Huge Pages — Compound pages (huge pages) require consistent handling of base and tail pages.

reference_counting.c
C (Linux-style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
// Simplified representation of page frame reference counting
// Inspired by Linux kernel's struct page and page_count() mechanism
 
// Each physical frame has a descriptor (very simplified)
struct page {
    atomic_t _refcount;     // Reference count (start at -1, 0 = 1 ref)
    atomic_t _mapcount;     // How many PTEs map this page
    unsigned long flags;     // Page state flags
    struct address_space *mapping;  // File mapping (if file-backed)
    pgoff_t index;          // Offset within mapping
    struct list_head lru;   // For page replacement lists
    // ... many more fields in real kernel
};
 
// Get the current reference count
static inline int page_count(struct page *page) {
    return atomic_read(&page->_refcount) + 1;
}
 
// Increment reference count (returns old value)
static inline void get_page(struct page *page) {
    atomic_inc(&page->_refcount);
}
 
// Decrement reference count, free if reaches zero
static inline void put_page(struct page *page) {
    if (atomic_dec_and_test(&page->_refcount)) {
        // Reference count hit zero - free the page
        __free_page(page);
    }
}
 
// Increment map count (called when creating PTE pointing to page)
static inline void page_add_anon_rmap(struct page *page, 
                                       struct vm_area_struct *vma,
                                       unsigned long address) {
    atomic_inc(&page->_mapcount);
    // Also add to reverse mapping for this VMA
    // (details omitted - uses anon_vma structures)
}
 
// Decrement map count (called when removing PTE)
static inline void page_remove_rmap(struct page *page) {
    if (atomic_dec_and_test(&page->_mapcount)) {
        // Page is no longer mapped by any PTE
        // May still have kernel references (page_count > 0)
    }
}
 
// Example: Handling COW fault reference counting
int handle_cow_fault(struct vm_area_struct *vma,
                     struct page *old_page,
                     unsigned long address) {
    struct page *new_page;
    
    // Check if we're the sole owner
    if (page_mapcount(old_page) == 1) {
        // Sole owner - just make page writable, no copy needed
        return make_page_writable(vma, address);
    }
    
    // Multiple owners - need to copy
    new_page = alloc_page(GFP_HIGHUSER);
    if (!new_page)
        return -ENOMEM;
    
    // Copy the page content
    copy_page(page_address(new_page), page_address(old_page));
    
    // Set up new page
    page_add_anon_rmap(new_page, vma, address);
    
    // Update PTE to point to new page
    set_pte_at(vma->vm_mm, address, pte, 
               mk_pte(new_page, vma->vm_page_prot | VM_WRITE));
    
    // Remove old mapping
    page_remove_rmap(old_page);
    put_page(old_page);  // Release our reference
    
    return 0;
}

refcount vs. mapcount

Page Frame Descriptors

The Page Descriptor Array:

Key Fields in Page Frame Descriptor (Linux struct page)
Field	Purpose	Sharing Relevance
_refcount	Total reference count	When 0, frame can be freed
_mapcount	PTE mapping count	Determines if COW copy needed
flags	Page state bits	Dirty, locked, uptodate, etc.
mapping	Pointer to address_space	File mapping or anon_vma
index	Offset in mapping	Position in file or swap
lru	LRU list pointers	Page replacement tracking
private	FS-specific data	Buffer heads, etc.

Memory Overhead of Page Descriptors:

Each struct page in Linux is approximately 64 bytes on x86-64. For a system with 64GB of RAM and 4KB pages, there are 16 million page frames:

16,000,000 frames × 64 bytes = 1 GB of descriptor overhead

This ~1.5% overhead is the price of flexible memory management. The kernel places these descriptors in a dedicated region at boot time, ensuring they're always accessible without page faults.

Converting Mermaid diagram...

NUMA and Sparse Memory

Reverse Mapping: Finding All PTEs

The Reverse Mapping Problem:

Given a physical page frame, find all (process, virtual address) pairs that map it.

This is challenging because the normal lookup direction is reversed:

Forward: (process, virtual address) → physical frame (via page table)
Reverse: physical frame → all (process, virtual address) pairs (via rmap)

Why Reverse Mapping Matters:

Reverse Mapping Use Cases

•Page Swapout — Before evicting a page to swap, the kernel must mark all PTEs as 'not present' to prevent access.
•Page Migration — Moving a page to a different physical frame (for NUMA optimization or defragmentation) requires updating all referencing PTEs.
•Huge Page Collapse/Promotion — Transforming small pages to huge pages (or vice versa) needs to update page tables.
•Memory Hotplug — Removing physical memory regions requires migrating pages and updating mappings.
•PFRA (Page Frame Reclamation Algorithm) — Accurate reference information helps choose pages to reclaim.

Linux Reverse Mapping Implementation:

Linux uses different rmap strategies for different page types:

Page → anon_vma (via page->mapping)
anon_vma → list of VMAs (via interval tree)
For each VMA → compute PTE address → check if PTE points to page

File-Backed Pages: Pages from file mappings use the file's address_space. Each file has an interval tree of VMAs mapping it. Rmap walks this tree to find all mappings of a given page offset.

reverse_mapping.c
C (Linux-style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// Simplified reverse mapping traversal for anonymous pages
// (Real Linux code is significantly more complex)
 
// Structure linking VMAs that might share anonymous pages
struct anon_vma {
    struct anon_vma *root;      // Root of the anon_vma tree
    struct rb_root_cached rb_root;  // Interval tree of anon_vma_chains
    atomic_t refcount;
    spinlock_t lock;
};
 
// Chain linking VMA to its anon_vma
struct anon_vma_chain {
    struct vm_area_struct *vma;
    struct anon_vma *anon_vma;
    struct list_head same_vma;    // List of all chains for this VMA
    struct rb_node rb;            // Node in anon_vma's interval tree
};
 
// Traverse all PTEs mapping a given page
int try_to_unmap(struct page *page, enum ttu_flags flags) {
    struct anon_vma *anon_vma;
    struct anon_vma_chain *avc;
    int ret = 0;
    
    // Get the anon_vma for this page
    anon_vma = page_get_anon_vma(page);
    if (!anon_vma)
        return SWAP_SUCCESS;  // No mappings
    
    // Lock to prevent concurrent modification
    anon_vma_lock_read(anon_vma);
    
    // Walk all anon_vma_chains in the interval tree
    // that might contain our page
    anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
                                    page->index, page->index) {
        struct vm_area_struct *vma = avc->vma;
        unsigned long address;
        
        // Compute virtual address of page in this VMA
        address = vma_address(page, vma);
        if (address == -EFAULT)
            continue;  // Page not in this VMA's range
        
        // Try to unmap from this address space
        ret = try_to_unmap_one(page, vma, address, flags);
        if (ret != SWAP_AGAIN)
            break;  // Stop on success or unrecoverable failure
    }
    
    anon_vma_unlock_read(anon_vma);
    put_anon_vma(anon_vma);
    
    return ret;
}
 
// Unmap a single PTE
int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                     unsigned long address, enum ttu_flags flags) {
    struct mm_struct *mm = vma->vm_mm;
    pte_t *pte;
    pte_t pteval;
    spinlock_t *ptl;
    
    // Get the PTE for this address (with lock)
    pte = page_check_address(page, mm, address, &ptl);
    if (!pte)
        return SWAP_AGAIN;  // Not mapped here
    
    // Clear the PTE
    pteval = ptep_clear_flush(vma, address, pte);
    
    // Update page counts
    page_remove_rmap(page);
    put_page(page);
    
    pte_unmap_unlock(pte, ptl);
    return SWAP_SUCCESS;
}

Rmap Scalability Concerns

Types of Shared Pages

Not all shared pages are created equal. The OS distinguishes between different sharing types, each with distinct semantics and handling:

Types of Page Sharing
Type	Source	Write Behavior	Examples
COW Anonymous	fork() duplication	Private copy on write	Heap, stack after fork
Shared Anonymous	Explicit shared mmap	Writes visible to all	IPC shared memory
Private File-Backed	mmap(MAP_PRIVATE)	COW on write	Executable code segments
Shared File-Backed	mmap(MAP_SHARED)	Writes visible + file	Shared libs, DB files
KSM-Merged	Kernel deduplication	COW unmerge on write	VM memory optimization

Deep Dive: Each Sharing Type

Shared File-Backed Pages: With MAP_SHARED, writes go to the page cache and are flushed to disk. All processes sharing the mapping see writes. File provides the synchronization semantics.

COW vs. True Sharing

•COW pages appear shared initially but become private on write
•True shared pages remain shared; writes propagate to all viewers
•The transition from COW-shared to private is one-way
•True shared pages require explicit synchronization (locks, atomics)

Identifying Share Type

•VMA flags indicate MAP_SHARED vs MAP_PRIVATE
•Page flags indicate anonymous vs file-backed
•anon_vma presence indicates anonymous pages
•address_space indicates file-backed pages

Why This Distinction Matters

Sharing During Process Lifecycle

Let's trace how page sharing evolves through a process's lifecycle, seeing how reference counts and mappings change at each stage:

Page Sharing Through Process Lifecycle
Event	What Happens to Pages	Reference Counts
Process Creation (exec)	Load code from file (shared page cache), allocate heap/stack (private)	Code pages: high (all processes); Heap/stack: 1
fork()	All pages become COW-shared; PTEs duplicated as read-only	All counts increment by 1
Child writes to heap	COW fault; copy made; child gets private page	Old page count--, new page count=1
Child exec()	All pages unmapped; new program loaded	All counts decrement (many reach 0, freed)
Parent/Child modifies	Each writer gets private copy if needed	Counts adjust per page
Process exit()	All pages unmapped; reference counts decremented	Pages with count→0 are freed

Example: Web Server Worker Processes

Consider Apache's pre-fork model with 100 worker processes forked from a master:

Initial State:
  Master: 500MB (code: 100MB, data: 400MB)

After 100 forks (naive):
  100 × 500MB = 50GB (impossible on 32GB machine)

After 100 forks (with COW):
  Shared code: 100MB (refcount=101)
  Read-only master data: ~380MB shared
  Modified data per worker: ~20MB × 100 = 2GB private
  Total: ~100MB + 380MB + 2GB ≈ 2.5GB (easily fits)

COW provides a 20x memory reduction for this workload, enabling 100 workers on a machine that couldn't even start them with eager copying.

Converting Mermaid diagram...

Gradual Divergence

Memory Accounting with Shared Pages

Shared pages complicate memory accounting. If a page is shared by 10 processes, who 'owns' that memory? Various metrics provide different perspectives:

Memory Metrics for Shared Pages
Metric	Definition	How Sharing is Counted
RSS (Resident Set Size)	Physical pages currently in memory for this process	Each process counts full page (overcounts)
USS (Unique Set Size)	Pages private to this process	Shared pages don't count (undercounts)
PSS (Proportional Set Size)	RSS with shared pages divided among sharers	1 page shared by 10 = 0.1 per process
VSZ (Virtual Size)	Total virtual address space mapped	Doesn't reflect physical usage
Shared Memory	Pages with mapcount > 1	Explicitly tracks shared portion

smaps_example.txt

Sample /proc/[pid]/smaps

# Example from /proc/[pid]/smaps showing memory breakdown
# For a heap region (anonymous memory, modified after fork)
 
00400000-00452000 r-xp 00000000 08:01 1234567    /usr/bin/myapp
Size:                328 kB      # Virtual size of mapping
Rss:                 280 kB      # Resident in physical memory
Pss:                  85 kB      # Proportional share (shared with 3 other procs)
Shared_Clean:        224 kB      # Shared, unmodified pages
Shared_Dirty:          0 kB      # Shared, modified pages
Private_Clean:         0 kB      # Private, unmodified pages
Private_Dirty:        56 kB      # Private, modified pages (COW copies)
Referenced:          280 kB      # Recently accessed
Anonymous:            56 kB      # Not file-backed (our COW copies)
AnonHugePages:         0 kB      # Huge page portions
Swap:                  0 kB      # Swapped out pages
KernelPageSize:        4 kB      # Standard page size
 
# Interpretation:
# - 280 kB RSS, but only 85 kB PSS (heavily shared)
# - 224 kB is shared code/rodata (read-only, no COW needed)
# - 56 kB is private (COW copies we've made)
# - If this is one of 4 forked workers, the 224 kB shared code
#   contributes 56 kB to each process's PSS

PSS: The Fairest Metric

Practical Implications:

Container memory limits should consider PSS, not just RSS, to avoid over-restricting
OOM killer decisions use multiple metrics; naive RSS-based kills may target wrong process
Capacity planning for fork-heavy workloads must account for actual sharing, not worst-case
Monitoring dashboards should display appropriate metrics for the workload type

Summary and Looking Ahead

Let's consolidate our understanding of shared pages:

Key Takeaways

•Reference counting tracks sharing — Each page frame has a reference count indicating how many PTEs point to it. This determines when copies are needed and when frames can be freed.
•Page frame descriptors hold metadata — Every physical frame has a descriptor struct containing reference counts, flags, and pointers to associated structures.
•Reverse mapping enables updates — When the kernel needs to modify or remove a shared page, rmap structures allow finding all PTEs that reference it.
•Different sharing types exist — COW anonymous, shared anonymous, private file-backed, and shared file-backed pages have different semantics and handling.
•Sharing evolves over time — Fork creates maximal sharing; subsequent writes cause gradual divergence as COW copies are made.
•Memory accounting is nuanced — RSS, USS, and PSS provide different views of memory usage in shared environments.

What's Next:

Page Complete

2 / 5