Operating SystemsShared Memory via Virtual Memory

Shared Memory via Virtual Memory

LevelAdvanced

Duration90 mins

TopicShared Memory via Virtual Memory

5 / 5

Implementation

From Abstraction to Reality

We've explored what shared memory does and why it's useful. Now we'll examine how operating systems actually implement it.

When you call shm_open() or mmap(), what happens inside the kernel? Which data structures are created, modified, or linked? How does the OS ensure that multiple processes access the same physical pages? What cleanup occurs when processes exit?

Understanding implementation details isn't just academic curiosity—it's essential for:

Debugging shared memory issues (why isn't my mapping working?)
Performance tuning (why is shared memory slow in this scenario?)
Capacity planning (what are the actual memory costs?)
Security auditing (where are the trust boundaries?)

This page traces the complete lifecycle of shared memory in the Linux kernel, from creation through mapping to cleanup, revealing the elegant data structures and algorithms that make sharing work.

What You Will Learn

By the end of this page, you will understand: the kernel data structures for shared memory (VMA, address_space, page cache), how shm_open and mmap translate to kernel operations, the page fault handling path for shared mappings, how the kernel tracks sharing with reverse mappings, and the cleanup process when shared memory is released.

Kernel Data Structures for Shared Memory

Linux implements shared memory through a carefully designed set of interconnected data structures. Let's examine each component and its role in enabling sharing.

Converting Mermaid diagram...

Core Data Structures for Shared Memory
Structure	Purpose	Key Fields for Sharing
mm_struct	Per-process address space	mmap (VMA tree), pgd (page table root), map_count (VMA count)
vm_area_struct (VMA)	Describes one contiguous virtual address region	vm_start, vm_end, vm_file, vm_flags (VM_SHARED)
struct file	Open file instance	f_mapping (points to address_space), f_mode (access mode)
struct inode	Filesystem object (file/device)	i_mapping (address_space for file's pages), i_ino (inode number)
address_space	Page cache for one backing object	host (inode), i_pages (xarray of cached pages), a_ops (operations)
struct page	Metadata for one physical page frame	mapping, index, _mapcount, _refcount, flags

linux_structures.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// Simplified representations of key Linux structures
 
// Per-process address space descriptor (one per process)
struct mm_struct {
    struct maple_tree mm_mt;         // Tree of VMAs (was rbtree, now maple)
    pgd_t *pgd;                       // Root of page tables
    atomic_t mm_users;                // Processes sharing this mm
    atomic_t mm_count;                // Reference count
    unsigned long start_brk, brk;    // Heap boundaries
    unsigned long start_stack;        // Stack start
    unsigned long total_vm;           // Total pages mapped
    unsigned long shared_vm;          // Shared pages
    // ...
};
 
// Virtual Memory Area - describes one mapping
struct vm_area_struct {
    unsigned long vm_start;           // Start address (inclusive)
    unsigned long vm_end;             // End address (exclusive)
    
    struct mm_struct *vm_mm;          // Owning mm_struct
    pgprot_t vm_page_prot;            // Page protection bits
    unsigned long vm_flags;           // VM_READ, VM_WRITE, VM_SHARED, etc.
    
    struct file *vm_file;             // Backing file (NULL for anonymous)
    unsigned long vm_pgoff;           // Offset within file (in pages)
    
    const struct vm_operations_struct *vm_ops;  // Fault handlers, etc.
    
    // For linking in the mm's VMA tree
    struct vm_area_struct *vm_next, *vm_prev;
    // ...
};
 
// Key flags in vm_flags:
#define VM_READ      0x00000001   // Readable
#define VM_WRITE     0x00000002   // Writable
#define VM_EXEC      0x00000004   // Executable
#define VM_SHARED    0x00000008   // Shared (vs private/COW)
#define VM_MAYSHARE  0x00000080   // Can be shared
#define VM_LOCKED    0x00002000   // Pages locked in RAM
#define VM_HUGETLB   0x00400000   // Huge TLB pages
 
// Page cache (address_space) - shared by all mappers of a file
struct address_space {
    struct inode *host;               // Owning inode
    struct xarray i_pages;            // Radix tree of cached pages
    atomic_t i_mmap_writable;         // Count of writable mmap users
    struct rb_root_cached i_mmap;     // Tree of VMAs mapping this
    const struct address_space_operations *a_ops;
    // ...
};
 
// Physical page frame descriptor
struct page {
    unsigned long flags;              // PG_locked, PG_dirty, PG_lru, etc.
    
    union {
        struct address_space *mapping; // address_space if page cache
        // Or other uses for non-file pages
    };
    pgoff_t index;                    // Offset in mapping's page tree
    
    atomic_t _refcount;               // Usage count (must be > 0)
    atomic_t _mapcount;               // PTE mapping count (-1 = unmapped)
    
    struct list_head lru;             // LRU list for reclamation
    // ...
};

The address_space is the Sharing Hub

The address_space structure is central to sharing. All processes mapping the same file share the same address_space, which contains the page cache. When any process accesses a page, it comes from this shared cache. The i_mmap tree tracks all VMAs mapping this address_space, enabling the kernel to find all PTEs pointing to a given page (reverse mapping).

The shm_open() Implementation Path

When a process calls shm_open(), what actually happens in the kernel? Let's trace the complete path.

shm_open_trace.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
User space:
  shm_open("/my_shm", O_CREAT | O_RDWR, 0644)
    ↓
Glibc wrapper:
  - Prepends "/dev/shm/" to name
  - Calls open("/dev/shm/my_shm", O_CREAT | O_RDWR, 0644)
    ↓
Kernel VFS layer (sys_open → do_sys_open):
  1. Allocate struct file
  2. Resolve path "/dev/shm/my_shm"
     - /dev/shm is a tmpfs mount point
    ↓
tmpfs filesystem (shmem_file_setup / shmem_create):
  3. If O_CREAT: create new inode in tmpfs
     - inode type: S_IFREG (regular file)
     - inode operations: shmem_iops
     - file operations: shmem_file_operations
  4. Initialize address_space with shmem_aops
     - No disk backing; pages come from swap or memory
  5. Link file to inode
    ↓
Return to user space:
  6. Return file descriptor
     - fd references struct file
     - struct file → struct inode → address_space
 
At this point:
- Object exists in /dev/shm namespace
- No physical pages allocated yet (demand paging)
- address_space ready to cache pages when accessed

Key insight: shm_open() is essentially open() on a tmpfs filesystem. POSIX shared memory is implemented as files in a memory-backed filesystem. This elegant design leverages existing VFS infrastructure rather than implementing a parallel mechanism.

The tmpfs connection:

tmpfs as Shared Memory Implementation
Property	How tmpfs Provides It
Memory backing	Pages allocated from RAM, can swap if needed
Size limits	/dev/shm mounted with size limit (typically 50% RAM)
Persistence	Survives until unmount (reboot clears it)
Permissions	Standard Unix file permissions
Naming	Filesystem namespace (/dev/shm/name)
Sharing	Multiple open() calls get same inode → same address_space

Viewing tmpfs Configuration

Run mount | grep tmpfs to see tmpfs mounts. /dev/shm typically has a size limit (e.g., tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,size=50%)). This limit constrains total POSIX shared memory. Increase with mount -o remount,size=4G /dev/shm if needed.

The mmap() Implementation Path

After shm_open() or open(), the next step is mmap() to map the shared memory into the process's address space. This is where the virtual-to-physical connection is established.

mmap_trace.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
User space:
  mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)
    ↓
Kernel (sys_mmap → do_mmap):
 
1. Input validation:
   - Check fd is valid
   - Check file supports mmap (f_op->mmap exists)
   - Check offset alignment (must be page-aligned)
   - Check requested protections compatible with file mode
 
2. Find virtual address range:
   - If addr == NULL: kernel picks address (get_unmapped_area)
   - Consider ASLR, existing mappings, alignment
   - Result: vma_start = 0x7f0000000000 (example)
 
3. Create VMA (vm_area_alloc + vma initialization):
   vma->vm_start = 0x7f0000000000
   vma->vm_end   = 0x7f0000001000
   vma->vm_flags = VM_READ | VM_WRITE | VM_SHARED | VM_MAYSHARE
   vma->vm_file  = get_file(fd's struct file)  // Increment refcount
   vma->vm_pgoff = 0  // Offset in pages
 
4. Call file's mmap handler (shmem_mmap for tmpfs):
   - Verify mapping is allowed
   - Set vma->vm_ops = &shmem_vm_ops  // Page fault handler
   - Link VMA into file's address_space->i_mmap
 
5. Insert VMA into process's mm:
   - Insert into mm->mm_mt (maple tree)
   - Update mm->total_vm, mm->shared_vm
 
6. Return virtual address to user space:
   - 0x7f0000000000
    ↓
User space:
  ptr = (returned address)
 
IMPORTANT: No page table entries created yet!
Physical pages will be allocated on first access (demand paging).

The mmap system call does NOT allocate physical memory. It only creates the VMA — a metadata structure describing that this virtual address range is mapped to this file. The actual physical memory allocation is deferred until the process tries to access the memory.

VMA Lifecycle After mmap()

•VMA created — Virtual address range reserved, no physical mapping yet
•First access — Page fault triggers, kernel allocates/finds page, creates PTE
•Subsequent access — PTE exists, hardware translates directly
•munmap() — VMA removed, PTEs cleared, page refcount decremented
•Page eviction — If memory pressure, page can be swapped, PTE cleared

The i_mmap Tree

When a VMA is created for a file mapping, it's linked into the address_space's i_mmap tree. This allows the kernel to find all VMAs (and thus all PTEs) mapping a given file page — essential for page reclamation, COW handling, and shared page updates.

Page Fault Handling for Shared Memory

When a process first accesses a shared memory page, a page fault occurs because no page table entry exists yet. The kernel must handle this fault by setting up the mapping.

page_fault_path.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
CPU executes: mov eax, [0x7f0000000100]  ; First access to shared page
    ↓
Hardware page fault (no PTE exists):
  - Save fault address: CR2 = 0x7f0000000100
  - Exception raised: #PF (Page Fault)
    ↓
Kernel page fault handler (handle_page_fault):
 
1. Get fault info:
   - address = 0x7f0000000100
   - error_code: read access, user mode, not present
 
2. Find VMA for faulting address (find_vma):
   - Search mm->mm_mt for VMA containing address
   - Found: vma = {start=0x7f0000000000, end=0x7f0000001000, flags=VM_SHARED}
 
3. Check permissions:
   - Read access + VMA has VM_READ → OK
   - (If write access, check VM_WRITE)
 
4. Call VMA fault handler (handle_mm_fault → __handle_mm_fault):
   - vma->vm_ops->fault = shmem_fault
    ↓
shmem_fault() handler (for tmpfs/shm):
 
5. Calculate file offset:
   - vmf->pgoff = (address - vma->vm_start) / PAGE_SIZE + vma->vm_pgoff
   - = (0x100) / 4096 + 0 = 0  (page 0 of file)
 
6. Look up page in page cache (shmem_get_folio):
   - Search address_space->i_pages for index 0
   - If found: use existing page (SHARING!)
   - If not found: allocate new page, add to cache
 
7. For new page (shmem_alloc_and_add_folio):
   - page = __alloc_pages(GFP_HIGHUSER | __GFP_ZERO)
   - Add to address_space->i_pages at index 0
   - page->mapping = address_space
   - page->index = 0
 
8. Install PTE (do_set_pte):
   - pte = mk_pte(page, vma->vm_page_prot)
   - Set PTE in process's page table
   - page->_mapcount++  (now 1)
    ↓
Return from fault handler:
  - CPU retries instruction
  - TLB now has entry, instruction succeeds!
 
 
SECOND PROCESS accesses SAME shared memory:
------------------------------------------
Same path until step 6...
 
6. Look up page in page cache:
   - Search address_space->i_pages for index 0
   - FOUND! (first process already faulted it in)
   - Return existing page
 
8. Install PTE for SECOND process:
   - Same physical page, NEW PTE in second process's tables
   - page->_mapcount++  (now 2)
 
Result: Both processes share the SAME physical page!

Converting Mermaid diagram...

The Page Cache is the Sharing Mechanism

Shared memory sharing happens through the page cache (address_space). The first accessor's page fault populates the cache; subsequent accessors find the same page in the cache. This is why file-backed mmap's share automatically — they all use the file's page cache.

Reverse Mapping: Finding All Users of a Page

The kernel sometimes needs to find all page table entries pointing to a given physical page. This is called reverse mapping (rmap), and it's essential for:

Page reclamation: When memory is needed, the kernel must clear PTEs before freeing a page
Page migration: Moving a page (e.g., for NUMA balancing) requires updating all PTEs
COW handling: When a shared page is written, other sharers' PTEs might need updating
Memory compaction: Defragmenting physical memory

Reverse Mapping Approaches
Approach	How It Works	Trade-off
Object-based rmap (Linux)	Page → address_space → i_mmap tree → VMAs → PTEs	Efficient for file pages; some overhead per VMA
PTE chain (old approach)	Each page has linked list of all PTEs	Perfect accuracy; huge memory overhead
Page table scan	Walk all page tables looking for page	No overhead; very slow

rmap_algorithm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Kernel needs to unmap all users of page P (e.g., for reclamation):
 
try_to_unmap(page P):
  
1. Get the address_space:
   mapping = page->mapping
   
2. Lock the mapping's i_mmap:
   i_mmap_lock_read(mapping)
 
3. For each VMA in mapping->i_mmap that overlaps page's index:
   for each vma in vma_interval_tree_iter(mapping->i_mmap, start, end):
       
       4. Calculate virtual address in this VMA:
          address = vma->vm_start + 
                    (page->index - vma->vm_pgoff) * PAGE_SIZE
       
       5. Find page table entry:
          pte = find_pte(vma->vm_mm, address)
          if (pte && pte_present(*pte) && pte_page(pte) == page):
              
              6. Unmap the PTE:
                 ptep_clear(pte)
                 page->_mapcount--
                 
              7. Flush TLB entry:
                 flush_tlb_page(vma, address)
 
4. Unlock and return:
   i_mmap_unlock_read(mapping)
   return (page->_mapcount == -1)  // True if fully unmapped
 
Key data structure: vma_interval_tree
  - Indexed by (vm_pgoff, vm_pgoff + size/PAGE_SIZE)
  - Quickly finds all VMAs that map a given file page index
  - Much faster than scanning all VMAs

Anonymous Page Reverse Mapping

For non-file-backed shared memory (e.g., after fork() with anonymous mappings), Linux uses a different structure called anon_vma:

                          anon_vma (shared)
                              │
              ┌───────────────┼───────────────┐
              │               │               │
          VMA (parent)    VMA (child1)    VMA (child2)
              │               │               │
          page tables     page tables     page tables
              │               │               │
              └───────────────┴───────────────┘
                              │
                        physical page

The anon_vma tree groups all VMAs that share the same anonymous pages (due to fork()), enabling efficient reverse mapping without the file/address_space structure.

rmap Scalability Challenge

When many processes map the same file (or many child processes exist after fork), the rmap tree can become very large. Unmapping a single page requires iterating through many VMAs. This is why extremely high levels of sharing (e.g., 10,000 containers mapping the same library) can show performance issues during memory pressure.

System V Shared Memory Implementation

System V shared memory has a separate implementation path, though it ultimately uses similar underlying mechanisms.

sysv_implementation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// System V shared memory kernel structures
 
// Global IPC namespace contains all shared memory segments
struct ipc_namespace {
    struct ipc_ids shm_ids;    // All shm segments in this namespace
    size_t shm_ctlmax;         // Max segment size
    size_t shm_ctlall;         // Max total shared memory
    unsigned long shm_ctlmni;  // Max number of segments
    // ...
};
 
// Per-segment metadata
struct shmid_kernel {
    struct kern_ipc_perm shm_perm;  // IPC permissions, key, id
    struct file *shm_file;           // Underlying file (shmem/tmpfs!)
    unsigned long shm_nattch;        // Number of attached processes
    unsigned long shm_segsz;         // Size in bytes
    time64_t shm_atim, shm_dtim, shm_ctim;  // Timestamps
    // ...
};
 
/*
 * shmget() creates a shmid_kernel and an underlying tmpfs file.
 * shmat() calls do_mmap() on that file — similar to POSIX!
 *
 * Key insight: System V shm is implemented ON TOP OF tmpfs/shmem,
 * just with different API and namespace.
 */
 
// shmget() simplified:
long do_shmget(key_t key, size_t size, int shmflg) {
    // 1. Look up or create shmid_kernel
    struct shmid_kernel *shp = /* find by key or create */;
    
    // 2. Create underlying shmem file (like shm_open result)
    shp->shm_file = shmem_kernel_file_setup("SYSV...", size, 0);
    
    // 3. Return identifier (not fd — this is different from POSIX)
    return shp->shm_perm.id;
}
 
// shmat() simplified:
long do_shmat(int shmid, char *shmaddr, int shmflg) {
    struct shmid_kernel *shp = shm_lock_check(shmid);
    
    // Map the underlying file — same as POSIX mmap!
    void *addr = do_mmap(shp->shm_file, /* params */);
    
    shp->shm_nattch++;
    return addr;
}

System V vs POSIX Implementation Comparison

•Namespace: System V uses IPC key/id namespace; POSIX uses filesystem paths. But both create tmpfs files!
•API: System V returns segment IDs; POSIX returns file descriptors. FDs are more composable.
•Underlying mechanism: Both use shmem/tmpfs as the backing store. Same page cache sharing.
•Memory pressure: Both can swap pages to disk under memory pressure.
•Hugepages: System V has explicit SHM_HUGETLB; POSIX relies on transparent huge pages or hugetlbfs mount.

The Unifying Layer: shmem/tmpfs

Both POSIX and System V shared memory are built on the same foundation: Linux's shmem/tmpfs filesystem. This is a memory-backed filesystem that can optionally swap pages to disk. shm_open creates files in /dev/shm (mounted tmpfs); shmget creates anonymous tmpfs files tracked by the shmid_kernel structure.

Cleanup and Resource Reclamation

Understanding how shared memory is cleaned up is essential for debugging resource leaks and understanding system behavior.

cleanup_paths.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
=== POSIX Shared Memory Cleanup ===
 
Process calls munmap(ptr, size):
  1. Find VMA covering [ptr, ptr+size)
  2. Remove VMA from mm's VMA tree
  3. For each page in the unmapped range:
     - Clear PTE
     - page->_mapcount--
     - If _mapcount == -1 (no more mappers):
       - Page remains in page cache (could be re-mapped)
       - Or kernel may reclaim if memory pressure
  4. VMA memory freed
  5. file->f_count-- (from vma->vm_file)
  6. If f_count == 0: file structure freed
 
Process calls shm_unlink("/name"):
  1. Remove /dev/shm/name from filesystem
  2. inode->i_nlink = 0
  3. inode NOT freed yet (existing mappings keep refcount)
  4. When last mapping is unmapped:
     - inode refcount → 0
     - All pages in address_space freed
     - inode and address_space freed
 
Process exits without explicit cleanup:
  1. Kernel calls exit_mm() to clean up address space
  2. All VMAs unmapped (implicit munmap for each)
  3. File refcounts decremented
  4. If shm_unlink was called: resources freed when last process exits
  5. If NOT unlinked: /dev/shm/name persists! (potential leak)
 
=== System V Shared Memory Cleanup ===
 
Process calls shmdt(ptr):
  1. Find VMA for this attachment
  2. munmap equivalent
  3. shm_segment->shm_nattch--
 
Admin calls shmctl(shmid, IPC_RMID, NULL):
  1. Mark segment for removal
  2. Segment remains accessible until shm_nattch == 0
  3. When last process detaches:
     - Free underlying tmpfs file
     - Free shmid_kernel
     - Remove from shm_ids
 
Process exits:
  1. All attachments auto-detached
  2. If IPC_RMID was called and nattch → 0: freed
  3. If NOT IPC_RMID: segment persists! (use ipcs/ipcrm to clean)

Resource Leak Prevention
Scenario	Potential Leak	Prevention
Process crash	Mappings auto-cleaned, but shm object may persist	Always shm_unlink when creating; check /dev/shm on startup
System V segment orphaned	Segment with nattch=0 persists forever	Use IPC_RMID immediately after shmget; periodic ipcs audit
mmap without munmap	VMA cleaned on exit, but process consumes memory	Track all mappings; RAII patterns in C++
Container shutdown	IPC namespace destroyed; all resources freed	No issue if using namespaces properly
fork without exec/exit	Child inherits all mappings (increased refcounts)	Understand fork semantics; close_on_exec for FDs

The shm_unlink Timing Question

Common pattern question: When should you call shm_unlink? Options: (1) Immediately after creation — object disappears from namespace but remains usable by existing mappers. Prevents orphaning. (2) At explicit cleanup — allows late-comers to attach by name. Requires discipline. (3) Never — let it persist for restart recovery. Must handle stale objects. Best practice for most cases: unlink immediately after all participants have opened it.

Summary: Implementation

We've traced the complete implementation of shared memory in the Linux kernel. Let's consolidate the key takeaways:

Key Takeaways

•Core structures form a chain — mm_struct → vm_area_struct → file → inode → address_space → page cache → physical pages. Understanding this chain is key to debugging.
•shm_open is just open() on tmpfs — POSIX shared memory is implemented using the tmpfs filesystem mounted at /dev/shm. No special kernel subsystem.
•mmap creates VMA, not PTEs — Physical memory allocation is deferred to page faults. This is demand paging in action.
•Page cache enables sharing — Multiple processes faulting the same file page find (or add) it in the shared page cache. This is the actual sharing mechanism.
•Reverse mapping enables management — The kernel can find all PTEs for a page through address_space→i_mmap tree (files) or anon_vma (anonymous). Essential for reclamation and migration.
•System V uses tmpfs internally — Despite different API, System V shm is built on the same shmem/tmpfs foundation as POSIX shm.
•Cleanup requires explicit action — shm_unlink or IPC_RMID prevents orphaned shared memory. Process exit alone doesn't remove named objects.

Module Complete:

You've now completed a comprehensive study of shared memory via virtual memory. From the fundamental concept of page sharing, through shared libraries and inter-process communication, to protection mechanisms and implementation details, you have the knowledge to:

Design efficient shared memory systems
Debug complex sharing issues at the kernel level
Optimize memory usage through understanding of actual mechanisms
Secure shared memory with defense in depth
Audit systems for resource leaks and security issues

Module Complete

Congratulations! You've mastered shared memory via virtual memory — one of the most powerful mechanisms in modern operating systems. You understand not just the APIs, but the underlying implementation: how page tables, page caches, and kernel data structures work together to enable zero-copy, high-performance memory sharing. This knowledge is essential for systems programming, performance engineering, and OS development.

5 / 5

Loading learning content...

Operating SystemsShared Memory via Virtual Memory

Shared Memory via Virtual Memory

LevelAdvanced

Duration90 mins

TopicShared Memory via Virtual Memory

5 / 5

Implementation

From Abstraction to Reality

We've explored what shared memory does and why it's useful. Now we'll examine how operating systems actually implement it.

Understanding implementation details isn't just academic curiosity—it's essential for:

Debugging shared memory issues (why isn't my mapping working?)
Performance tuning (why is shared memory slow in this scenario?)
Capacity planning (what are the actual memory costs?)
Security auditing (where are the trust boundaries?)

This page traces the complete lifecycle of shared memory in the Linux kernel, from creation through mapping to cleanup, revealing the elegant data structures and algorithms that make sharing work.

What You Will Learn

Kernel Data Structures for Shared Memory

Linux implements shared memory through a carefully designed set of interconnected data structures. Let's examine each component and its role in enabling sharing.

Converting Mermaid diagram...

Core Data Structures for Shared Memory
Structure	Purpose	Key Fields for Sharing
mm_struct	Per-process address space	mmap (VMA tree), pgd (page table root), map_count (VMA count)
vm_area_struct (VMA)	Describes one contiguous virtual address region	vm_start, vm_end, vm_file, vm_flags (VM_SHARED)
struct file	Open file instance	f_mapping (points to address_space), f_mode (access mode)
struct inode	Filesystem object (file/device)	i_mapping (address_space for file's pages), i_ino (inode number)
address_space	Page cache for one backing object	host (inode), i_pages (xarray of cached pages), a_ops (operations)
struct page	Metadata for one physical page frame	mapping, index, _mapcount, _refcount, flags

linux_structures.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// Simplified representations of key Linux structures
 
// Per-process address space descriptor (one per process)
struct mm_struct {
    struct maple_tree mm_mt;         // Tree of VMAs (was rbtree, now maple)
    pgd_t *pgd;                       // Root of page tables
    atomic_t mm_users;                // Processes sharing this mm
    atomic_t mm_count;                // Reference count
    unsigned long start_brk, brk;    // Heap boundaries
    unsigned long start_stack;        // Stack start
    unsigned long total_vm;           // Total pages mapped
    unsigned long shared_vm;          // Shared pages
    // ...
};
 
// Virtual Memory Area - describes one mapping
struct vm_area_struct {
    unsigned long vm_start;           // Start address (inclusive)
    unsigned long vm_end;             // End address (exclusive)
    
    struct mm_struct *vm_mm;          // Owning mm_struct
    pgprot_t vm_page_prot;            // Page protection bits
    unsigned long vm_flags;           // VM_READ, VM_WRITE, VM_SHARED, etc.
    
    struct file *vm_file;             // Backing file (NULL for anonymous)
    unsigned long vm_pgoff;           // Offset within file (in pages)
    
    const struct vm_operations_struct *vm_ops;  // Fault handlers, etc.
    
    // For linking in the mm's VMA tree
    struct vm_area_struct *vm_next, *vm_prev;
    // ...
};
 
// Key flags in vm_flags:
#define VM_READ      0x00000001   // Readable
#define VM_WRITE     0x00000002   // Writable
#define VM_EXEC      0x00000004   // Executable
#define VM_SHARED    0x00000008   // Shared (vs private/COW)
#define VM_MAYSHARE  0x00000080   // Can be shared
#define VM_LOCKED    0x00002000   // Pages locked in RAM
#define VM_HUGETLB   0x00400000   // Huge TLB pages
 
// Page cache (address_space) - shared by all mappers of a file
struct address_space {
    struct inode *host;               // Owning inode
    struct xarray i_pages;            // Radix tree of cached pages
    atomic_t i_mmap_writable;         // Count of writable mmap users
    struct rb_root_cached i_mmap;     // Tree of VMAs mapping this
    const struct address_space_operations *a_ops;
    // ...
};
 
// Physical page frame descriptor
struct page {
    unsigned long flags;              // PG_locked, PG_dirty, PG_lru, etc.
    
    union {
        struct address_space *mapping; // address_space if page cache
        // Or other uses for non-file pages
    };
    pgoff_t index;                    // Offset in mapping's page tree
    
    atomic_t _refcount;               // Usage count (must be > 0)
    atomic_t _mapcount;               // PTE mapping count (-1 = unmapped)
    
    struct list_head lru;             // LRU list for reclamation
    // ...
};

The address_space is the Sharing Hub

The shm_open() Implementation Path

When a process calls shm_open(), what actually happens in the kernel? Let's trace the complete path.

shm_open_trace.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
User space:
  shm_open("/my_shm", O_CREAT | O_RDWR, 0644)
    ↓
Glibc wrapper:
  - Prepends "/dev/shm/" to name
  - Calls open("/dev/shm/my_shm", O_CREAT | O_RDWR, 0644)
    ↓
Kernel VFS layer (sys_open → do_sys_open):
  1. Allocate struct file
  2. Resolve path "/dev/shm/my_shm"
     - /dev/shm is a tmpfs mount point
    ↓
tmpfs filesystem (shmem_file_setup / shmem_create):
  3. If O_CREAT: create new inode in tmpfs
     - inode type: S_IFREG (regular file)
     - inode operations: shmem_iops
     - file operations: shmem_file_operations
  4. Initialize address_space with shmem_aops
     - No disk backing; pages come from swap or memory
  5. Link file to inode
    ↓
Return to user space:
  6. Return file descriptor
     - fd references struct file
     - struct file → struct inode → address_space
 
At this point:
- Object exists in /dev/shm namespace
- No physical pages allocated yet (demand paging)
- address_space ready to cache pages when accessed

The tmpfs connection:

tmpfs as Shared Memory Implementation
Property	How tmpfs Provides It
Memory backing	Pages allocated from RAM, can swap if needed
Size limits	/dev/shm mounted with size limit (typically 50% RAM)
Persistence	Survives until unmount (reboot clears it)
Permissions	Standard Unix file permissions
Naming	Filesystem namespace (/dev/shm/name)
Sharing	Multiple open() calls get same inode → same address_space

Viewing tmpfs Configuration

The mmap() Implementation Path

After shm_open() or open(), the next step is mmap() to map the shared memory into the process's address space. This is where the virtual-to-physical connection is established.

mmap_trace.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
User space:
  mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)
    ↓
Kernel (sys_mmap → do_mmap):
 
1. Input validation:
   - Check fd is valid
   - Check file supports mmap (f_op->mmap exists)
   - Check offset alignment (must be page-aligned)
   - Check requested protections compatible with file mode
 
2. Find virtual address range:
   - If addr == NULL: kernel picks address (get_unmapped_area)
   - Consider ASLR, existing mappings, alignment
   - Result: vma_start = 0x7f0000000000 (example)
 
3. Create VMA (vm_area_alloc + vma initialization):
   vma->vm_start = 0x7f0000000000
   vma->vm_end   = 0x7f0000001000
   vma->vm_flags = VM_READ | VM_WRITE | VM_SHARED | VM_MAYSHARE
   vma->vm_file  = get_file(fd's struct file)  // Increment refcount
   vma->vm_pgoff = 0  // Offset in pages
 
4. Call file's mmap handler (shmem_mmap for tmpfs):
   - Verify mapping is allowed
   - Set vma->vm_ops = &shmem_vm_ops  // Page fault handler
   - Link VMA into file's address_space->i_mmap
 
5. Insert VMA into process's mm:
   - Insert into mm->mm_mt (maple tree)
   - Update mm->total_vm, mm->shared_vm
 
6. Return virtual address to user space:
   - 0x7f0000000000
    ↓
User space:
  ptr = (returned address)
 
IMPORTANT: No page table entries created yet!
Physical pages will be allocated on first access (demand paging).

VMA Lifecycle After mmap()

•VMA created — Virtual address range reserved, no physical mapping yet
•First access — Page fault triggers, kernel allocates/finds page, creates PTE
•Subsequent access — PTE exists, hardware translates directly
•munmap() — VMA removed, PTEs cleared, page refcount decremented
•Page eviction — If memory pressure, page can be swapped, PTE cleared

The i_mmap Tree

Page Fault Handling for Shared Memory

When a process first accesses a shared memory page, a page fault occurs because no page table entry exists yet. The kernel must handle this fault by setting up the mapping.

page_fault_path.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
CPU executes: mov eax, [0x7f0000000100]  ; First access to shared page
    ↓
Hardware page fault (no PTE exists):
  - Save fault address: CR2 = 0x7f0000000100
  - Exception raised: #PF (Page Fault)
    ↓
Kernel page fault handler (handle_page_fault):
 
1. Get fault info:
   - address = 0x7f0000000100
   - error_code: read access, user mode, not present
 
2. Find VMA for faulting address (find_vma):
   - Search mm->mm_mt for VMA containing address
   - Found: vma = {start=0x7f0000000000, end=0x7f0000001000, flags=VM_SHARED}
 
3. Check permissions:
   - Read access + VMA has VM_READ → OK
   - (If write access, check VM_WRITE)
 
4. Call VMA fault handler (handle_mm_fault → __handle_mm_fault):
   - vma->vm_ops->fault = shmem_fault
    ↓
shmem_fault() handler (for tmpfs/shm):
 
5. Calculate file offset:
   - vmf->pgoff = (address - vma->vm_start) / PAGE_SIZE + vma->vm_pgoff
   - = (0x100) / 4096 + 0 = 0  (page 0 of file)
 
6. Look up page in page cache (shmem_get_folio):
   - Search address_space->i_pages for index 0
   - If found: use existing page (SHARING!)
   - If not found: allocate new page, add to cache
 
7. For new page (shmem_alloc_and_add_folio):
   - page = __alloc_pages(GFP_HIGHUSER | __GFP_ZERO)
   - Add to address_space->i_pages at index 0
   - page->mapping = address_space
   - page->index = 0
 
8. Install PTE (do_set_pte):
   - pte = mk_pte(page, vma->vm_page_prot)
   - Set PTE in process's page table
   - page->_mapcount++  (now 1)
    ↓
Return from fault handler:
  - CPU retries instruction
  - TLB now has entry, instruction succeeds!
 
 
SECOND PROCESS accesses SAME shared memory:
------------------------------------------
Same path until step 6...
 
6. Look up page in page cache:
   - Search address_space->i_pages for index 0
   - FOUND! (first process already faulted it in)
   - Return existing page
 
8. Install PTE for SECOND process:
   - Same physical page, NEW PTE in second process's tables
   - page->_mapcount++  (now 2)
 
Result: Both processes share the SAME physical page!

Converting Mermaid diagram...

The Page Cache is the Sharing Mechanism

Reverse Mapping: Finding All Users of a Page

The kernel sometimes needs to find all page table entries pointing to a given physical page. This is called reverse mapping (rmap), and it's essential for:

Page reclamation: When memory is needed, the kernel must clear PTEs before freeing a page
Page migration: Moving a page (e.g., for NUMA balancing) requires updating all PTEs
COW handling: When a shared page is written, other sharers' PTEs might need updating
Memory compaction: Defragmenting physical memory

Reverse Mapping Approaches
Approach	How It Works	Trade-off
Object-based rmap (Linux)	Page → address_space → i_mmap tree → VMAs → PTEs	Efficient for file pages; some overhead per VMA
PTE chain (old approach)	Each page has linked list of all PTEs	Perfect accuracy; huge memory overhead
Page table scan	Walk all page tables looking for page	No overhead; very slow

rmap_algorithm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Kernel needs to unmap all users of page P (e.g., for reclamation):
 
try_to_unmap(page P):
  
1. Get the address_space:
   mapping = page->mapping
   
2. Lock the mapping's i_mmap:
   i_mmap_lock_read(mapping)
 
3. For each VMA in mapping->i_mmap that overlaps page's index:
   for each vma in vma_interval_tree_iter(mapping->i_mmap, start, end):
       
       4. Calculate virtual address in this VMA:
          address = vma->vm_start + 
                    (page->index - vma->vm_pgoff) * PAGE_SIZE
       
       5. Find page table entry:
          pte = find_pte(vma->vm_mm, address)
          if (pte && pte_present(*pte) && pte_page(pte) == page):
              
              6. Unmap the PTE:
                 ptep_clear(pte)
                 page->_mapcount--
                 
              7. Flush TLB entry:
                 flush_tlb_page(vma, address)
 
4. Unlock and return:
   i_mmap_unlock_read(mapping)
   return (page->_mapcount == -1)  // True if fully unmapped
 
Key data structure: vma_interval_tree
  - Indexed by (vm_pgoff, vm_pgoff + size/PAGE_SIZE)
  - Quickly finds all VMAs that map a given file page index
  - Much faster than scanning all VMAs

Anonymous Page Reverse Mapping

For non-file-backed shared memory (e.g., after fork() with anonymous mappings), Linux uses a different structure called anon_vma:

                          anon_vma (shared)
                              │
              ┌───────────────┼───────────────┐
              │               │               │
          VMA (parent)    VMA (child1)    VMA (child2)
              │               │               │
          page tables     page tables     page tables
              │               │               │
              └───────────────┴───────────────┘
                              │
                        physical page

The anon_vma tree groups all VMAs that share the same anonymous pages (due to fork()), enabling efficient reverse mapping without the file/address_space structure.

rmap Scalability Challenge

System V Shared Memory Implementation

System V shared memory has a separate implementation path, though it ultimately uses similar underlying mechanisms.

sysv_implementation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// System V shared memory kernel structures
 
// Global IPC namespace contains all shared memory segments
struct ipc_namespace {
    struct ipc_ids shm_ids;    // All shm segments in this namespace
    size_t shm_ctlmax;         // Max segment size
    size_t shm_ctlall;         // Max total shared memory
    unsigned long shm_ctlmni;  // Max number of segments
    // ...
};
 
// Per-segment metadata
struct shmid_kernel {
    struct kern_ipc_perm shm_perm;  // IPC permissions, key, id
    struct file *shm_file;           // Underlying file (shmem/tmpfs!)
    unsigned long shm_nattch;        // Number of attached processes
    unsigned long shm_segsz;         // Size in bytes
    time64_t shm_atim, shm_dtim, shm_ctim;  // Timestamps
    // ...
};
 
/*
 * shmget() creates a shmid_kernel and an underlying tmpfs file.
 * shmat() calls do_mmap() on that file — similar to POSIX!
 *
 * Key insight: System V shm is implemented ON TOP OF tmpfs/shmem,
 * just with different API and namespace.
 */
 
// shmget() simplified:
long do_shmget(key_t key, size_t size, int shmflg) {
    // 1. Look up or create shmid_kernel
    struct shmid_kernel *shp = /* find by key or create */;
    
    // 2. Create underlying shmem file (like shm_open result)
    shp->shm_file = shmem_kernel_file_setup("SYSV...", size, 0);
    
    // 3. Return identifier (not fd — this is different from POSIX)
    return shp->shm_perm.id;
}
 
// shmat() simplified:
long do_shmat(int shmid, char *shmaddr, int shmflg) {
    struct shmid_kernel *shp = shm_lock_check(shmid);
    
    // Map the underlying file — same as POSIX mmap!
    void *addr = do_mmap(shp->shm_file, /* params */);
    
    shp->shm_nattch++;
    return addr;
}

System V vs POSIX Implementation Comparison

•Namespace: System V uses IPC key/id namespace; POSIX uses filesystem paths. But both create tmpfs files!
•API: System V returns segment IDs; POSIX returns file descriptors. FDs are more composable.
•Underlying mechanism: Both use shmem/tmpfs as the backing store. Same page cache sharing.
•Memory pressure: Both can swap pages to disk under memory pressure.
•Hugepages: System V has explicit SHM_HUGETLB; POSIX relies on transparent huge pages or hugetlbfs mount.

The Unifying Layer: shmem/tmpfs

Cleanup and Resource Reclamation

Understanding how shared memory is cleaned up is essential for debugging resource leaks and understanding system behavior.

cleanup_paths.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
=== POSIX Shared Memory Cleanup ===
 
Process calls munmap(ptr, size):
  1. Find VMA covering [ptr, ptr+size)
  2. Remove VMA from mm's VMA tree
  3. For each page in the unmapped range:
     - Clear PTE
     - page->_mapcount--
     - If _mapcount == -1 (no more mappers):
       - Page remains in page cache (could be re-mapped)
       - Or kernel may reclaim if memory pressure
  4. VMA memory freed
  5. file->f_count-- (from vma->vm_file)
  6. If f_count == 0: file structure freed
 
Process calls shm_unlink("/name"):
  1. Remove /dev/shm/name from filesystem
  2. inode->i_nlink = 0
  3. inode NOT freed yet (existing mappings keep refcount)
  4. When last mapping is unmapped:
     - inode refcount → 0
     - All pages in address_space freed
     - inode and address_space freed
 
Process exits without explicit cleanup:
  1. Kernel calls exit_mm() to clean up address space
  2. All VMAs unmapped (implicit munmap for each)
  3. File refcounts decremented
  4. If shm_unlink was called: resources freed when last process exits
  5. If NOT unlinked: /dev/shm/name persists! (potential leak)
 
=== System V Shared Memory Cleanup ===
 
Process calls shmdt(ptr):
  1. Find VMA for this attachment
  2. munmap equivalent
  3. shm_segment->shm_nattch--
 
Admin calls shmctl(shmid, IPC_RMID, NULL):
  1. Mark segment for removal
  2. Segment remains accessible until shm_nattch == 0
  3. When last process detaches:
     - Free underlying tmpfs file
     - Free shmid_kernel
     - Remove from shm_ids
 
Process exits:
  1. All attachments auto-detached
  2. If IPC_RMID was called and nattch → 0: freed
  3. If NOT IPC_RMID: segment persists! (use ipcs/ipcrm to clean)

Resource Leak Prevention
Scenario	Potential Leak	Prevention
Process crash	Mappings auto-cleaned, but shm object may persist	Always shm_unlink when creating; check /dev/shm on startup
System V segment orphaned	Segment with nattch=0 persists forever	Use IPC_RMID immediately after shmget; periodic ipcs audit
mmap without munmap	VMA cleaned on exit, but process consumes memory	Track all mappings; RAII patterns in C++
Container shutdown	IPC namespace destroyed; all resources freed	No issue if using namespaces properly
fork without exec/exit	Child inherits all mappings (increased refcounts)	Understand fork semantics; close_on_exec for FDs

The shm_unlink Timing Question

Summary: Implementation

We've traced the complete implementation of shared memory in the Linux kernel. Let's consolidate the key takeaways:

Key Takeaways

•Core structures form a chain — mm_struct → vm_area_struct → file → inode → address_space → page cache → physical pages. Understanding this chain is key to debugging.
•shm_open is just open() on tmpfs — POSIX shared memory is implemented using the tmpfs filesystem mounted at /dev/shm. No special kernel subsystem.
•mmap creates VMA, not PTEs — Physical memory allocation is deferred to page faults. This is demand paging in action.
•Page cache enables sharing — Multiple processes faulting the same file page find (or add) it in the shared page cache. This is the actual sharing mechanism.
•Reverse mapping enables management — The kernel can find all PTEs for a page through address_space→i_mmap tree (files) or anon_vma (anonymous). Essential for reclamation and migration.
•System V uses tmpfs internally — Despite different API, System V shm is built on the same shmem/tmpfs foundation as POSIX shm.
•Cleanup requires explicit action — shm_unlink or IPC_RMID prevents orphaned shared memory. Process exit alone doesn't remove named objects.

Module Complete:

Design efficient shared memory systems
Debug complex sharing issues at the kernel level
Optimize memory usage through understanding of actual mechanisms
Secure shared memory with defense in depth
Audit systems for resource leaks and security issues

Module Complete

5 / 5