Operating SystemsShared Memory via Virtual Memory

Shared Memory via Virtual Memory

LevelAdvanced

Duration90 mins

TopicShared Memory via Virtual Memory

1 / 5

Page Sharing

The Mathematics of Memory Efficiency

Consider a typical Linux server running 50 instances of a web application. Each process requires 100 MB of code and libraries. Without memory sharing, you'd need 5 GB of physical memory just for the redundant code. With page sharing, that same code occupies a mere 100 MB — a 50x reduction.

This isn't theoretical optimization; it's the difference between a $200/month cloud instance and a $4,000/month one. Page sharing is the fundamental mechanism that makes modern multi-process computing economically viable.

In this page, we'll dissect exactly how virtual memory enables this remarkable efficiency, exploring the hardware and software mechanisms that allow multiple processes to share physical memory pages while maintaining complete isolation and security.

What You Will Learn

By the end of this page, you will understand the fundamental mechanisms of page sharing: how page tables enable multiple virtual addresses to map to the same physical frame, the different types of sharing (read-only vs copy-on-write), and the architectural principles that make sharing both efficient and safe. You'll gain the vocabulary and conceptual framework to understand how operating systems optimize memory usage through sharing.

The Foundation of Page Sharing

Page sharing is made possible by the fundamental architecture of virtual memory systems. To understand it fully, we must first revisit the core abstraction that enables all modern memory management.

The Key Insight: Indirection

Virtual memory introduces an indirection layer between the addresses a process uses (virtual addresses) and the actual physical memory locations (physical addresses). This indirection is implemented through page tables, which the Memory Management Unit (MMU) consults on every memory access.

Here's the crucial observation: nothing in the virtual memory architecture requires that a physical frame be referenced by only one virtual page. The page table is simply a mapping structure, and multiple entries — whether in the same page table or in different processes' page tables — can point to the same physical frame.

This seemingly simple insight has profound implications. It means that two processes can have page table entries that contain the same physical frame number, allowing them to access the same physical memory through their own distinct virtual addresses.

Page Table Entry Structure Supporting Sharing
Field	Size (typical)	Role in Sharing
Physical Frame Number (PFN)	20-40 bits	The actual physical address; identical in shared PTEs
Present/Valid bit	1 bit	Must be set for accessible shared pages
Read/Write bit	1 bit	Often read-only for shared pages to enable COW
User/Supervisor bit	1 bit	Determines if user-space can access the shared page
Global bit	1 bit	Prevents TLB flush; useful for widely-shared pages (e.g., kernel)
Accessed/Dirty bits	2 bits	Track usage; shared pages complicate dirty tracking
Cache control bits	2-3 bits	Must be consistent across all sharing PTEs

The Reference Count Obligation

When multiple page tables reference the same physical frame, the operating system must maintain a reference count for each frame. The frame can only be freed when the reference count drops to zero, meaning no page table entries point to it anymore. This reference counting is managed by the OS's frame allocator and is critical for correctness.

The Mechanics of Sharing

Let's trace exactly how page sharing is established and maintained at the hardware and software levels.

Establishing a Shared Mapping

When the operating system decides to share a page between two processes, it performs the following operations:

Identify the physical frame to be shared (already allocated and containing the desired content)
Create or update a PTE in the second process's page table, pointing to the same physical frame
Set appropriate protection bits (typically read-only for safety)
Increment the reference count for the physical frame
Invalidate TLB entries if the page was previously mapped differently

Let's visualize this with a concrete example:

Converting Mermaid diagram...

Virtual Address Independence

Notice that Process A maps the shared page at virtual address 0x1000, while Process B maps the same physical frame at virtual address 0x2000. The virtual addresses don't need to match — each process has complete autonomy over its virtual address space layout.

However, some types of sharing do require consistent virtual addresses across processes. Code that contains absolute addresses (rather than position-independent code) will only work correctly if mapped at the expected virtual address. This is why:

Shared libraries are typically compiled as Position-Independent Code (PIC), allowing each process to map them at whatever virtual address is convenient
The kernel is mapped at identical virtual addresses across all processes (in the kernel portion of the address space)
Shared memory regions for IPC can be mapped at arbitrary addresses in each process

Position-Independent Code (PIC)

Position-Independent Code uses relative addressing for all internal references, allowing the same code to execute correctly regardless of where it's loaded in virtual memory. This is achieved using instruction-pointer-relative addressing (for code) and the Global Offset Table (GOT) with Procedure Linkage Table (PLT) (for data and function calls). Modern shared libraries are always compiled as PIC.

Types of Page Sharing

Page sharing in operating systems manifests in several distinct forms, each with specific use cases, implementation requirements, and performance characteristics. Understanding these categories is essential for systems programming and OS development.

Pure Read-Only Sharing

This is the simplest and most common form of sharing. The shared pages are marked read-only in all processes' page tables, and no process ever modifies them.

Use Cases:

Executable code sections (.text segments)
Read-only data (strings, constants, .rodata sections)
Shared library code

Implementation:

// Pseudo-code for establishing read-only sharing
frame_t frame = find_or_create_frame(file, offset);
add_pte(process_A, vaddr_A, frame, PROT_READ | PROT_EXEC);
add_pte(process_B, vaddr_B, frame, PROT_READ | PROT_EXEC);
frame->refcount = 2;

Properties:

No synchronization overhead during access
No page faults after initial mapping
Maximum memory savings
Zero copy overhead

Example: When 100 processes run /usr/bin/python3, they all share the same physical pages containing the Python interpreter's executable code. The 15 MB Python binary exists once in memory, not 100 times.

The Sharing Infrastructure

For page sharing to work correctly and efficiently, the operating system must maintain sophisticated data structures that track which frames are shared, by whom, and with what properties. Let's examine this infrastructure in detail.

Core Sharing Data Structures

•Page Frame Database (struct page in Linux) — Each physical frame has metadata including reference count, flags (locked, dirty, active), and mapping information. The _mapcount field tracks how many page table entries reference this frame.
•Reverse Mapping (rmap) — For shared pages, the OS needs to find all PTEs that reference a given frame. This is essential for page reclamation, COW handling, and migration. Linux implements this with a tree of virtual memory areas (VMAs) per page.
•Virtual Memory Area (VMA) — Describes a contiguous region of a process's virtual address space with uniform properties. VMAs track the backing file (if any), protection, and sharing status.
•Address Space Object (address_space) — For file-backed pages, groups all pages belonging to a file. When multiple processes map the same file, they share the same address_space, ensuring they get the same physical pages.
•Anonymous VMA (anon_vma) — For anonymous (non-file-backed) shared pages (e.g., after fork), provides the reverse mapping to locate all sharing processes.

linux_page_structures.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Simplified representation of Linux's page frame metadata
struct page {
    unsigned long flags;           // Page state flags (Locked, Dirty, Active, etc.)
    
    union {
        atomic_t _mapcount;        // Count of page table mappings (-1 = not mapped)
        unsigned int page_type;    // For special pages
    };
    
    atomic_t _refcount;            // Usage count (must be > 0 to use page)
    
    struct address_space *mapping; // File this page belongs to (if file-backed)
    pgoff_t index;                 // Offset within file (in pages)
    
    struct list_head lru;          // Position on LRU list for reclamation
    
    // For anonymous pages (COW, heap, stack)
    struct anon_vma *anon_vma;     // Reverse mapping for anonymous pages
};
 
// VMA describes a region of virtual address space
struct vm_area_struct {
    unsigned long vm_start;         // Start virtual address
    unsigned long vm_end;           // End address (exclusive)
    
    pgprot_t vm_page_prot;         // Access permissions
    unsigned long vm_flags;         // Flags: VM_READ, VM_WRITE, VM_SHARED, etc.
    
    struct file *vm_file;          // Backing file (NULL for anonymous)
    unsigned long vm_pgoff;        // Offset in file (in pages)
    
    struct mm_struct *vm_mm;       // Owning process's address space
    
    // For shared file mappings, all processes share same address_space
    // For anonymous sharing, anon_vma links all COW-related VMAs
};

The Reference Counting Challenge

Reference counting for shared pages must handle several edge cases:

Transient references: The kernel often takes temporary references (e.g., during I/O). The _refcount must account for these beyond just page table mappings.
Split references: _mapcount counts page table entries, while _refcount counts all references. A page with _mapcount = 5 (five PTEs) might have _refcount = 7 (PTEs + 2 ongoing I/O operations).
Large pages: When using huge pages (2 MB), all component 4 KB pages must maintain consistent counts when the huge page is split.
NUMA proximity: Reference tracking must preserve NUMA node information to avoid migrating pages away from their optimal memory node.

Hardware Support for Sharing

The hardware plays a crucial role in enabling efficient page sharing. Modern CPUs provide several features specifically designed to support sharing semantics.

CPU Features Supporting Page Sharing

•Protection Bits in PTEs — Read/Write/Execute bits allow the same page to be read-only in one mapping and read-write in another. This enables COW semantics where the OS can control which access triggers a copy.
•TLB Tagging (ASID/PCID) — Address Space Identifiers allow TLB entries from different processes to coexist. Shared pages with different virtual addresses across processes can each have their own TLB entries without flushing on context switch.
•Global Pages — The 'Global' bit in PTEs indicates pages shared across all address spaces (like kernel pages). These TLB entries survive context switches, improving performance for frequently-accessed shared pages.
•Cache Coherency Protocols — MESI/MOESI protocols ensure that when one CPU modifies a shared page, all other CPUs see the update. This hardware coherency is what makes shared memory IPC possible without explicit software synchronization for each memory access.
•Write Protection Fault — When a write occurs to a page marked read-only, the CPU raises a precise exception, allowing the OS to implement COW. The fault saves the exact instruction address, enabling transparent retry after the page is copied.

TLB Behavior with Shared Pages
Scenario	TLB Impact	Optimization
Context switch between sharing processes	TLB entries with different ASIDs can coexist	No flush needed if ASID differs
COW page becomes private	Must invalidate other CPUs' TLB entries (TLB shootdown)	IPI to affected CPUs only
Global page (kernel, vDSO)	Shared TLB entry, never flushed	One TLB entry serves all processes
File page evicted from memory	All PTEs must be cleared, TLB shootdown	Batch invalidations for efficiency
Huge page shared	Single TLB entry covers 2MB/1GB	Massive TLB coverage improvement

TLB Shootdown: The Sharing Tax

When a shared page's mapping changes (e.g., COW copy, unmapping), the OS must ensure all CPUs invalidate their TLB entries for that page. This 'TLB shootdown' requires inter-processor interrupts (IPIs), which are expensive. On a 128-core system, a shootdown can take thousands of cycles. This is a hidden cost of sharing that can become a bottleneck for write-heavy workloads.

Sharing Across NUMA Nodes

On Non-Uniform Memory Access (NUMA) systems, memory access time depends on which CPU socket is accessing which memory bank. Shared pages introduce interesting NUMA considerations.

The NUMA Sharing Dilemma

When a page is shared across processes running on different NUMA nodes:

Scenario	Memory Latency	Practical Impact
Local access	80-100 ns	Optimal performance
Remote access (1 hop)	120-150 ns	30-50% slower
Remote access (2 hops)	200+ ns	2x+ slower

For a shared library page accessed by processes on different nodes, someone will have remote access. The OS must decide where to place the page.

NUMA Sharing Strategies

•First touch placement — Page is placed on the node where it's first accessed
•Interleaving — Pages distributed round-robin across nodes
•Replication — Read-only pages duplicated to each node
•Migration — Pages moved to heavily-accessing node
•Preferred node — Process specifies preferred node, shared pages try to stay there

Challenges

•First-touch favors initialization thread, not runtime pattern
•Interleaving gives good average but no locality
•Replication consumes extra memory, complicates writes
•Migration is expensive and can thrash
•No perfect solution exists; workload-dependent

Linux AutoNUMA

Linux implements automatic NUMA balancing (AutoNUMA) that monitors access patterns and migrates pages to the accessing node. For shared pages, this creates interesting dynamics:

Page initially on Node 0
    |
    v
Process on Node 1 accesses frequently
    |
    v
AutoNUMA detects remote access pattern
    |
    v
Migrates page to Node 1
    |
    v
But now Process on Node 0 has remote access!

This back-and-forth ('NUMA ping-pong') can occur with shared pages accessed equally from multiple nodes. The solution is often to either:

Accept remote access for shared pages as the cost of sharing
Use huge pages to reduce migration decisions
Replicate read-only pages (some research systems, not mainstream Linux)

Practical NUMA Advice for Shared Memory

For shared memory IPC on NUMA systems: (1) Keep communicating processes on the same node when possible, (2) Use interleaving for truly shared structures accessed equally, (3) For producer-consumer patterns, place the shared region on the consumer's node (reads are typically more latency-sensitive than writes due to CPU stalls), (4) Use numactl and mbind() for explicit control in performance-critical applications.

Sharing Metrics and Limits

Understanding sharing efficiency requires metrics to measure actual sharing in a running system. Operating systems provide various tools and interfaces to observe sharing behavior.

measuring_sharing.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Check memory sharing statistics on Linux
 
# 1. System-wide sharing stats
cat /proc/meminfo | grep -E 'Shmem|Mapped|AnonPages'
# Shmem: explicitly shared memory (shm_open, etc.)
# Mapped: file-backed pages mapped into processes
# AnonPages: private process memory (heap, stack, COW)
 
# 2. Per-process sharing analysis
# PSS (Proportional Set Size) divides shared pages by sharers
# Example: 100 processes share 100 MB libc → each shows ~1 MB PSS
cat /proc/<pid>/smaps | grep -E 'Pss|Shared|Private'
 
# 3. Shared memory segments
ipcs -m  # System V shared memory
ls /dev/shm  # POSIX shared memory
 
# 4. Analyze a specific library's sharing
# Count how many processes map libc
lsof /lib/x86_64-linux-gnu/libc.so.* | wc -l
 
# 5. Detailed page flags analysis
cat /proc/<pid>/pagemap  # Raw page table data
# Requires specialized tools like /proc/kpageflags

Memory Metrics for Sharing Analysis
Metric	Definition	Interpretation
RSS (Resident Set Size)	All pages in physical memory	Includes shared pages (counted fully)
PSS (Proportional Set Size)	Private pages + shared/N	Fair share accounting for shared pages
USS (Unique Set Size)	Only private pages	Memory freed if process terminates
Shared_Clean	Shared pages not modified	True read-only sharing
Shared_Dirty	Shared pages with pending writes	File-backed with unsaved changes
Private_Clean	Private pages not modified	COW pages not yet written
Private_Dirty	Private pages that were modified	Definitely uses memory on stop

Limits on Sharing

Several factors limit how much memory can be shared:

Alignment requirements: Pages must be naturally aligned. Partially-overlapping data cannot share a page.
Write frequency: High-write pages break COW sharing immediately. Data structures with frequent writes are poor sharing candidates.
ASLR entropy: Address Space Layout Randomization places libraries at random addresses. Position-independent code handles this, but offset randomization within libraries can affect page alignment.
Execution state: Writable data (globals, static variables) in shared libraries is COW'd per-process.
Huge page granularity: 2 MB huge pages can only share if the entire 2 MB section is shareable.
Maximum mapcount: Linux limits how many PTEs can reference one page (128*1024 by default). Systems with extreme sharing (containers) can hit this limit.

Summary: Page Sharing Fundamentals

We've established the foundational concepts of page sharing in virtual memory systems. Let's consolidate the key takeaways:

Key Takeaways

•Page sharing exploits virtual memory indirection — Multiple page table entries pointing to the same physical frame enables sharing with complete process isolation preserved.
•Four types of sharing exist — Read-only (code/libraries), Copy-on-Write (fork optimization), Explicit IPC (shared memory), and File-backed (memory-mapped files) — each with distinct use cases and implementations.
•Reference counting is mandatory — The OS must track how many PTEs reference each frame to know when it can be reclaimed, requiring sophisticated data structures like Linux's anon_vma and address_space.
•Hardware provides essential support — Protection bits enable COW, ASID/PCID allow TLB sharing, cache coherency protocols make shared memory work, and precise write faults enable transparent copying.
•NUMA adds complexity — Shared pages must be placed somewhere, and that 'somewhere' will be remote for some accessing processes. Various placement strategies exist, but no perfect solution.
•Sharing has costs — TLB shootdowns, reference counting overhead, and NUMA effects are the hidden taxes on sharing. The benefits usually outweigh costs, but awareness is essential.

What's Next:

Now that we understand the fundamental mechanisms of page sharing, we'll explore one of its most important applications: shared libraries. We'll see how operating systems enable hundreds of processes to share a single copy of libc, how symbol resolution works with shared code, and the performance implications of library loading strategies.

Page Complete

You now understand the fundamental mechanisms of page sharing in virtual memory systems. You've learned how the indirection provided by page tables enables sharing, the different types of sharing and their use cases, the kernel data structures that track sharing, and the hardware features that make it efficient. This knowledge forms the foundation for understanding shared libraries, IPC, and memory optimization in real systems.

1 / 5

Loading learning content...

Operating SystemsShared Memory via Virtual Memory

Shared Memory via Virtual Memory

LevelAdvanced

Duration90 mins

TopicShared Memory via Virtual Memory

1 / 5

Page Sharing

The Mathematics of Memory Efficiency

What You Will Learn

The Foundation of Page Sharing

Page sharing is made possible by the fundamental architecture of virtual memory systems. To understand it fully, we must first revisit the core abstraction that enables all modern memory management.

The Key Insight: Indirection

Page Table Entry Structure Supporting Sharing
Field	Size (typical)	Role in Sharing
Physical Frame Number (PFN)	20-40 bits	The actual physical address; identical in shared PTEs
Present/Valid bit	1 bit	Must be set for accessible shared pages
Read/Write bit	1 bit	Often read-only for shared pages to enable COW
User/Supervisor bit	1 bit	Determines if user-space can access the shared page
Global bit	1 bit	Prevents TLB flush; useful for widely-shared pages (e.g., kernel)
Accessed/Dirty bits	2 bits	Track usage; shared pages complicate dirty tracking
Cache control bits	2-3 bits	Must be consistent across all sharing PTEs

The Reference Count Obligation

The Mechanics of Sharing

Let's trace exactly how page sharing is established and maintained at the hardware and software levels.

Establishing a Shared Mapping

When the operating system decides to share a page between two processes, it performs the following operations:

Identify the physical frame to be shared (already allocated and containing the desired content)
Create or update a PTE in the second process's page table, pointing to the same physical frame
Set appropriate protection bits (typically read-only for safety)
Increment the reference count for the physical frame
Invalidate TLB entries if the page was previously mapped differently

Let's visualize this with a concrete example:

Converting Mermaid diagram...

Virtual Address Independence

Shared libraries are typically compiled as Position-Independent Code (PIC), allowing each process to map them at whatever virtual address is convenient
The kernel is mapped at identical virtual addresses across all processes (in the kernel portion of the address space)
Shared memory regions for IPC can be mapped at arbitrary addresses in each process

Position-Independent Code (PIC)

Types of Page Sharing

Pure Read-Only Sharing

This is the simplest and most common form of sharing. The shared pages are marked read-only in all processes' page tables, and no process ever modifies them.

Use Cases:

Executable code sections (.text segments)
Read-only data (strings, constants, .rodata sections)
Shared library code

Implementation:

// Pseudo-code for establishing read-only sharing
frame_t frame = find_or_create_frame(file, offset);
add_pte(process_A, vaddr_A, frame, PROT_READ | PROT_EXEC);
add_pte(process_B, vaddr_B, frame, PROT_READ | PROT_EXEC);
frame->refcount = 2;

Properties:

No synchronization overhead during access
No page faults after initial mapping
Maximum memory savings
Zero copy overhead

The Sharing Infrastructure

Core Sharing Data Structures

•Page Frame Database (struct page in Linux) — Each physical frame has metadata including reference count, flags (locked, dirty, active), and mapping information. The _mapcount field tracks how many page table entries reference this frame.
•Reverse Mapping (rmap) — For shared pages, the OS needs to find all PTEs that reference a given frame. This is essential for page reclamation, COW handling, and migration. Linux implements this with a tree of virtual memory areas (VMAs) per page.
•Virtual Memory Area (VMA) — Describes a contiguous region of a process's virtual address space with uniform properties. VMAs track the backing file (if any), protection, and sharing status.
•Address Space Object (address_space) — For file-backed pages, groups all pages belonging to a file. When multiple processes map the same file, they share the same address_space, ensuring they get the same physical pages.
•Anonymous VMA (anon_vma) — For anonymous (non-file-backed) shared pages (e.g., after fork), provides the reverse mapping to locate all sharing processes.

linux_page_structures.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Simplified representation of Linux's page frame metadata
struct page {
    unsigned long flags;           // Page state flags (Locked, Dirty, Active, etc.)
    
    union {
        atomic_t _mapcount;        // Count of page table mappings (-1 = not mapped)
        unsigned int page_type;    // For special pages
    };
    
    atomic_t _refcount;            // Usage count (must be > 0 to use page)
    
    struct address_space *mapping; // File this page belongs to (if file-backed)
    pgoff_t index;                 // Offset within file (in pages)
    
    struct list_head lru;          // Position on LRU list for reclamation
    
    // For anonymous pages (COW, heap, stack)
    struct anon_vma *anon_vma;     // Reverse mapping for anonymous pages
};
 
// VMA describes a region of virtual address space
struct vm_area_struct {
    unsigned long vm_start;         // Start virtual address
    unsigned long vm_end;           // End address (exclusive)
    
    pgprot_t vm_page_prot;         // Access permissions
    unsigned long vm_flags;         // Flags: VM_READ, VM_WRITE, VM_SHARED, etc.
    
    struct file *vm_file;          // Backing file (NULL for anonymous)
    unsigned long vm_pgoff;        // Offset in file (in pages)
    
    struct mm_struct *vm_mm;       // Owning process's address space
    
    // For shared file mappings, all processes share same address_space
    // For anonymous sharing, anon_vma links all COW-related VMAs
};

The Reference Counting Challenge

Reference counting for shared pages must handle several edge cases:

Transient references: The kernel often takes temporary references (e.g., during I/O). The _refcount must account for these beyond just page table mappings.
Split references: _mapcount counts page table entries, while _refcount counts all references. A page with _mapcount = 5 (five PTEs) might have _refcount = 7 (PTEs + 2 ongoing I/O operations).
Large pages: When using huge pages (2 MB), all component 4 KB pages must maintain consistent counts when the huge page is split.
NUMA proximity: Reference tracking must preserve NUMA node information to avoid migrating pages away from their optimal memory node.

Hardware Support for Sharing

The hardware plays a crucial role in enabling efficient page sharing. Modern CPUs provide several features specifically designed to support sharing semantics.

CPU Features Supporting Page Sharing

•Protection Bits in PTEs — Read/Write/Execute bits allow the same page to be read-only in one mapping and read-write in another. This enables COW semantics where the OS can control which access triggers a copy.
•TLB Tagging (ASID/PCID) — Address Space Identifiers allow TLB entries from different processes to coexist. Shared pages with different virtual addresses across processes can each have their own TLB entries without flushing on context switch.
•Global Pages — The 'Global' bit in PTEs indicates pages shared across all address spaces (like kernel pages). These TLB entries survive context switches, improving performance for frequently-accessed shared pages.
•Cache Coherency Protocols — MESI/MOESI protocols ensure that when one CPU modifies a shared page, all other CPUs see the update. This hardware coherency is what makes shared memory IPC possible without explicit software synchronization for each memory access.
•Write Protection Fault — When a write occurs to a page marked read-only, the CPU raises a precise exception, allowing the OS to implement COW. The fault saves the exact instruction address, enabling transparent retry after the page is copied.

TLB Behavior with Shared Pages
Scenario	TLB Impact	Optimization
Context switch between sharing processes	TLB entries with different ASIDs can coexist	No flush needed if ASID differs
COW page becomes private	Must invalidate other CPUs' TLB entries (TLB shootdown)	IPI to affected CPUs only
Global page (kernel, vDSO)	Shared TLB entry, never flushed	One TLB entry serves all processes
File page evicted from memory	All PTEs must be cleared, TLB shootdown	Batch invalidations for efficiency
Huge page shared	Single TLB entry covers 2MB/1GB	Massive TLB coverage improvement

TLB Shootdown: The Sharing Tax

Sharing Across NUMA Nodes

On Non-Uniform Memory Access (NUMA) systems, memory access time depends on which CPU socket is accessing which memory bank. Shared pages introduce interesting NUMA considerations.

The NUMA Sharing Dilemma

When a page is shared across processes running on different NUMA nodes:

Scenario	Memory Latency	Practical Impact
Local access	80-100 ns	Optimal performance
Remote access (1 hop)	120-150 ns	30-50% slower
Remote access (2 hops)	200+ ns	2x+ slower

For a shared library page accessed by processes on different nodes, someone will have remote access. The OS must decide where to place the page.

NUMA Sharing Strategies

•First touch placement — Page is placed on the node where it's first accessed
•Interleaving — Pages distributed round-robin across nodes
•Replication — Read-only pages duplicated to each node
•Migration — Pages moved to heavily-accessing node
•Preferred node — Process specifies preferred node, shared pages try to stay there

Challenges

•First-touch favors initialization thread, not runtime pattern
•Interleaving gives good average but no locality
•Replication consumes extra memory, complicates writes
•Migration is expensive and can thrash
•No perfect solution exists; workload-dependent

Linux AutoNUMA

Linux implements automatic NUMA balancing (AutoNUMA) that monitors access patterns and migrates pages to the accessing node. For shared pages, this creates interesting dynamics:

Page initially on Node 0
    |
    v
Process on Node 1 accesses frequently
    |
    v
AutoNUMA detects remote access pattern
    |
    v
Migrates page to Node 1
    |
    v
But now Process on Node 0 has remote access!

This back-and-forth ('NUMA ping-pong') can occur with shared pages accessed equally from multiple nodes. The solution is often to either:

Accept remote access for shared pages as the cost of sharing
Use huge pages to reduce migration decisions
Replicate read-only pages (some research systems, not mainstream Linux)

Practical NUMA Advice for Shared Memory

Sharing Metrics and Limits

Understanding sharing efficiency requires metrics to measure actual sharing in a running system. Operating systems provide various tools and interfaces to observe sharing behavior.

measuring_sharing.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Check memory sharing statistics on Linux
 
# 1. System-wide sharing stats
cat /proc/meminfo | grep -E 'Shmem|Mapped|AnonPages'
# Shmem: explicitly shared memory (shm_open, etc.)
# Mapped: file-backed pages mapped into processes
# AnonPages: private process memory (heap, stack, COW)
 
# 2. Per-process sharing analysis
# PSS (Proportional Set Size) divides shared pages by sharers
# Example: 100 processes share 100 MB libc → each shows ~1 MB PSS
cat /proc/<pid>/smaps | grep -E 'Pss|Shared|Private'
 
# 3. Shared memory segments
ipcs -m  # System V shared memory
ls /dev/shm  # POSIX shared memory
 
# 4. Analyze a specific library's sharing
# Count how many processes map libc
lsof /lib/x86_64-linux-gnu/libc.so.* | wc -l
 
# 5. Detailed page flags analysis
cat /proc/<pid>/pagemap  # Raw page table data
# Requires specialized tools like /proc/kpageflags

Memory Metrics for Sharing Analysis
Metric	Definition	Interpretation
RSS (Resident Set Size)	All pages in physical memory	Includes shared pages (counted fully)
PSS (Proportional Set Size)	Private pages + shared/N	Fair share accounting for shared pages
USS (Unique Set Size)	Only private pages	Memory freed if process terminates
Shared_Clean	Shared pages not modified	True read-only sharing
Shared_Dirty	Shared pages with pending writes	File-backed with unsaved changes
Private_Clean	Private pages not modified	COW pages not yet written
Private_Dirty	Private pages that were modified	Definitely uses memory on stop

Limits on Sharing

Several factors limit how much memory can be shared:

Alignment requirements: Pages must be naturally aligned. Partially-overlapping data cannot share a page.
Write frequency: High-write pages break COW sharing immediately. Data structures with frequent writes are poor sharing candidates.
ASLR entropy: Address Space Layout Randomization places libraries at random addresses. Position-independent code handles this, but offset randomization within libraries can affect page alignment.
Execution state: Writable data (globals, static variables) in shared libraries is COW'd per-process.
Huge page granularity: 2 MB huge pages can only share if the entire 2 MB section is shareable.
Maximum mapcount: Linux limits how many PTEs can reference one page (128*1024 by default). Systems with extreme sharing (containers) can hit this limit.

Summary: Page Sharing Fundamentals

We've established the foundational concepts of page sharing in virtual memory systems. Let's consolidate the key takeaways:

Key Takeaways

•Page sharing exploits virtual memory indirection — Multiple page table entries pointing to the same physical frame enables sharing with complete process isolation preserved.
•Four types of sharing exist — Read-only (code/libraries), Copy-on-Write (fork optimization), Explicit IPC (shared memory), and File-backed (memory-mapped files) — each with distinct use cases and implementations.
•Reference counting is mandatory — The OS must track how many PTEs reference each frame to know when it can be reclaimed, requiring sophisticated data structures like Linux's anon_vma and address_space.
•Hardware provides essential support — Protection bits enable COW, ASID/PCID allow TLB sharing, cache coherency protocols make shared memory work, and precise write faults enable transparent copying.
•NUMA adds complexity — Shared pages must be placed somewhere, and that 'somewhere' will be remote for some accessing processes. Various placement strategies exist, but no perfect solution.
•Sharing has costs — TLB shootdowns, reference counting overhead, and NUMA effects are the hidden taxes on sharing. The benefits usually outweigh costs, but awareness is essential.

What's Next:

Page Complete

1 / 5