Loading learning content...
We've explored what shared memory does and why it's useful. Now we'll examine how operating systems actually implement it.
When you call shm_open() or mmap(), what happens inside the kernel? Which data structures are created, modified, or linked? How does the OS ensure that multiple processes access the same physical pages? What cleanup occurs when processes exit?
Understanding implementation details isn't just academic curiosity—it's essential for:
This page traces the complete lifecycle of shared memory in the Linux kernel, from creation through mapping to cleanup, revealing the elegant data structures and algorithms that make sharing work.
By the end of this page, you will understand: the kernel data structures for shared memory (VMA, address_space, page cache), how shm_open and mmap translate to kernel operations, the page fault handling path for shared mappings, how the kernel tracks sharing with reverse mappings, and the cleanup process when shared memory is released.
Linux implements shared memory through a carefully designed set of interconnected data structures. Let's examine each component and its role in enabling sharing.
| Structure | Purpose | Key Fields for Sharing |
|---|---|---|
| mm_struct | Per-process address space | mmap (VMA tree), pgd (page table root), map_count (VMA count) |
| vm_area_struct (VMA) | Describes one contiguous virtual address region | vm_start, vm_end, vm_file, vm_flags (VM_SHARED) |
| struct file | Open file instance | f_mapping (points to address_space), f_mode (access mode) |
| struct inode | Filesystem object (file/device) | i_mapping (address_space for file's pages), i_ino (inode number) |
| address_space | Page cache for one backing object | host (inode), i_pages (xarray of cached pages), a_ops (operations) |
| struct page | Metadata for one physical page frame | mapping, index, _mapcount, _refcount, flags |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
// Simplified representations of key Linux structures // Per-process address space descriptor (one per process)struct mm_struct { struct maple_tree mm_mt; // Tree of VMAs (was rbtree, now maple) pgd_t *pgd; // Root of page tables atomic_t mm_users; // Processes sharing this mm atomic_t mm_count; // Reference count unsigned long start_brk, brk; // Heap boundaries unsigned long start_stack; // Stack start unsigned long total_vm; // Total pages mapped unsigned long shared_vm; // Shared pages // ...}; // Virtual Memory Area - describes one mappingstruct vm_area_struct { unsigned long vm_start; // Start address (inclusive) unsigned long vm_end; // End address (exclusive) struct mm_struct *vm_mm; // Owning mm_struct pgprot_t vm_page_prot; // Page protection bits unsigned long vm_flags; // VM_READ, VM_WRITE, VM_SHARED, etc. struct file *vm_file; // Backing file (NULL for anonymous) unsigned long vm_pgoff; // Offset within file (in pages) const struct vm_operations_struct *vm_ops; // Fault handlers, etc. // For linking in the mm's VMA tree struct vm_area_struct *vm_next, *vm_prev; // ...}; // Key flags in vm_flags:#define VM_READ 0x00000001 // Readable#define VM_WRITE 0x00000002 // Writable#define VM_EXEC 0x00000004 // Executable#define VM_SHARED 0x00000008 // Shared (vs private/COW)#define VM_MAYSHARE 0x00000080 // Can be shared#define VM_LOCKED 0x00002000 // Pages locked in RAM#define VM_HUGETLB 0x00400000 // Huge TLB pages // Page cache (address_space) - shared by all mappers of a filestruct address_space { struct inode *host; // Owning inode struct xarray i_pages; // Radix tree of cached pages atomic_t i_mmap_writable; // Count of writable mmap users struct rb_root_cached i_mmap; // Tree of VMAs mapping this const struct address_space_operations *a_ops; // ...}; // Physical page frame descriptorstruct page { unsigned long flags; // PG_locked, PG_dirty, PG_lru, etc. union { struct address_space *mapping; // address_space if page cache // Or other uses for non-file pages }; pgoff_t index; // Offset in mapping's page tree atomic_t _refcount; // Usage count (must be > 0) atomic_t _mapcount; // PTE mapping count (-1 = unmapped) struct list_head lru; // LRU list for reclamation // ...};The address_space structure is central to sharing. All processes mapping the same file share the same address_space, which contains the page cache. When any process accesses a page, it comes from this shared cache. The i_mmap tree tracks all VMAs mapping this address_space, enabling the kernel to find all PTEs pointing to a given page (reverse mapping).
When a process calls shm_open(), what actually happens in the kernel? Let's trace the complete path.
123456789101112131415161718192021222324252627282930
User space: shm_open("/my_shm", O_CREAT | O_RDWR, 0644) ↓Glibc wrapper: - Prepends "/dev/shm/" to name - Calls open("/dev/shm/my_shm", O_CREAT | O_RDWR, 0644) ↓Kernel VFS layer (sys_open → do_sys_open): 1. Allocate struct file 2. Resolve path "/dev/shm/my_shm" - /dev/shm is a tmpfs mount point ↓tmpfs filesystem (shmem_file_setup / shmem_create): 3. If O_CREAT: create new inode in tmpfs - inode type: S_IFREG (regular file) - inode operations: shmem_iops - file operations: shmem_file_operations 4. Initialize address_space with shmem_aops - No disk backing; pages come from swap or memory 5. Link file to inode ↓Return to user space: 6. Return file descriptor - fd references struct file - struct file → struct inode → address_space At this point:- Object exists in /dev/shm namespace- No physical pages allocated yet (demand paging)- address_space ready to cache pages when accessedKey insight: shm_open() is essentially open() on a tmpfs filesystem. POSIX shared memory is implemented as files in a memory-backed filesystem. This elegant design leverages existing VFS infrastructure rather than implementing a parallel mechanism.
The tmpfs connection:
| Property | How tmpfs Provides It |
|---|---|
| Memory backing | Pages allocated from RAM, can swap if needed |
| Size limits | /dev/shm mounted with size limit (typically 50% RAM) |
| Persistence | Survives until unmount (reboot clears it) |
| Permissions | Standard Unix file permissions |
| Naming | Filesystem namespace (/dev/shm/name) |
| Sharing | Multiple open() calls get same inode → same address_space |
Run mount | grep tmpfs to see tmpfs mounts. /dev/shm typically has a size limit (e.g., tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,size=50%)). This limit constrains total POSIX shared memory. Increase with mount -o remount,size=4G /dev/shm if needed.
After shm_open() or open(), the next step is mmap() to map the shared memory into the process's address space. This is where the virtual-to-physical connection is established.
12345678910111213141516171819202122232425262728293031323334353637383940
User space: mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0) ↓Kernel (sys_mmap → do_mmap): 1. Input validation: - Check fd is valid - Check file supports mmap (f_op->mmap exists) - Check offset alignment (must be page-aligned) - Check requested protections compatible with file mode 2. Find virtual address range: - If addr == NULL: kernel picks address (get_unmapped_area) - Consider ASLR, existing mappings, alignment - Result: vma_start = 0x7f0000000000 (example) 3. Create VMA (vm_area_alloc + vma initialization): vma->vm_start = 0x7f0000000000 vma->vm_end = 0x7f0000001000 vma->vm_flags = VM_READ | VM_WRITE | VM_SHARED | VM_MAYSHARE vma->vm_file = get_file(fd's struct file) // Increment refcount vma->vm_pgoff = 0 // Offset in pages 4. Call file's mmap handler (shmem_mmap for tmpfs): - Verify mapping is allowed - Set vma->vm_ops = &shmem_vm_ops // Page fault handler - Link VMA into file's address_space->i_mmap 5. Insert VMA into process's mm: - Insert into mm->mm_mt (maple tree) - Update mm->total_vm, mm->shared_vm 6. Return virtual address to user space: - 0x7f0000000000 ↓User space: ptr = (returned address) IMPORTANT: No page table entries created yet!Physical pages will be allocated on first access (demand paging).The mmap system call does NOT allocate physical memory. It only creates the VMA — a metadata structure describing that this virtual address range is mapped to this file. The actual physical memory allocation is deferred until the process tries to access the memory.
When a VMA is created for a file mapping, it's linked into the address_space's i_mmap tree. This allows the kernel to find all VMAs (and thus all PTEs) mapping a given file page — essential for page reclamation, COW handling, and shared page updates.
When a process first accesses a shared memory page, a page fault occurs because no page table entry exists yet. The kernel must handle this fault by setting up the mapping.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
CPU executes: mov eax, [0x7f0000000100] ; First access to shared page ↓Hardware page fault (no PTE exists): - Save fault address: CR2 = 0x7f0000000100 - Exception raised: #PF (Page Fault) ↓Kernel page fault handler (handle_page_fault): 1. Get fault info: - address = 0x7f0000000100 - error_code: read access, user mode, not present 2. Find VMA for faulting address (find_vma): - Search mm->mm_mt for VMA containing address - Found: vma = {start=0x7f0000000000, end=0x7f0000001000, flags=VM_SHARED} 3. Check permissions: - Read access + VMA has VM_READ → OK - (If write access, check VM_WRITE) 4. Call VMA fault handler (handle_mm_fault → __handle_mm_fault): - vma->vm_ops->fault = shmem_fault ↓shmem_fault() handler (for tmpfs/shm): 5. Calculate file offset: - vmf->pgoff = (address - vma->vm_start) / PAGE_SIZE + vma->vm_pgoff - = (0x100) / 4096 + 0 = 0 (page 0 of file) 6. Look up page in page cache (shmem_get_folio): - Search address_space->i_pages for index 0 - If found: use existing page (SHARING!) - If not found: allocate new page, add to cache 7. For new page (shmem_alloc_and_add_folio): - page = __alloc_pages(GFP_HIGHUSER | __GFP_ZERO) - Add to address_space->i_pages at index 0 - page->mapping = address_space - page->index = 0 8. Install PTE (do_set_pte): - pte = mk_pte(page, vma->vm_page_prot) - Set PTE in process's page table - page->_mapcount++ (now 1) ↓Return from fault handler: - CPU retries instruction - TLB now has entry, instruction succeeds! SECOND PROCESS accesses SAME shared memory:------------------------------------------Same path until step 6... 6. Look up page in page cache: - Search address_space->i_pages for index 0 - FOUND! (first process already faulted it in) - Return existing page 8. Install PTE for SECOND process: - Same physical page, NEW PTE in second process's tables - page->_mapcount++ (now 2) Result: Both processes share the SAME physical page!Shared memory sharing happens through the page cache (address_space). The first accessor's page fault populates the cache; subsequent accessors find the same page in the cache. This is why file-backed mmap's share automatically — they all use the file's page cache.
The kernel sometimes needs to find all page table entries pointing to a given physical page. This is called reverse mapping (rmap), and it's essential for:
| Approach | How It Works | Trade-off |
|---|---|---|
| Object-based rmap (Linux) | Page → address_space → i_mmap tree → VMAs → PTEs | Efficient for file pages; some overhead per VMA |
| PTE chain (old approach) | Each page has linked list of all PTEs | Perfect accuracy; huge memory overhead |
| Page table scan | Walk all page tables looking for page | No overhead; very slow |
123456789101112131415161718192021222324252627282930313233343536
Kernel needs to unmap all users of page P (e.g., for reclamation): try_to_unmap(page P): 1. Get the address_space: mapping = page->mapping 2. Lock the mapping's i_mmap: i_mmap_lock_read(mapping) 3. For each VMA in mapping->i_mmap that overlaps page's index: for each vma in vma_interval_tree_iter(mapping->i_mmap, start, end): 4. Calculate virtual address in this VMA: address = vma->vm_start + (page->index - vma->vm_pgoff) * PAGE_SIZE 5. Find page table entry: pte = find_pte(vma->vm_mm, address) if (pte && pte_present(*pte) && pte_page(pte) == page): 6. Unmap the PTE: ptep_clear(pte) page->_mapcount-- 7. Flush TLB entry: flush_tlb_page(vma, address) 4. Unlock and return: i_mmap_unlock_read(mapping) return (page->_mapcount == -1) // True if fully unmapped Key data structure: vma_interval_tree - Indexed by (vm_pgoff, vm_pgoff + size/PAGE_SIZE) - Quickly finds all VMAs that map a given file page index - Much faster than scanning all VMAsAnonymous Page Reverse Mapping
For non-file-backed shared memory (e.g., after fork() with anonymous mappings), Linux uses a different structure called anon_vma:
anon_vma (shared)
│
┌───────────────┼───────────────┐
│ │ │
VMA (parent) VMA (child1) VMA (child2)
│ │ │
page tables page tables page tables
│ │ │
└───────────────┴───────────────┘
│
physical page
The anon_vma tree groups all VMAs that share the same anonymous pages (due to fork()), enabling efficient reverse mapping without the file/address_space structure.
When many processes map the same file (or many child processes exist after fork), the rmap tree can become very large. Unmapping a single page requires iterating through many VMAs. This is why extremely high levels of sharing (e.g., 10,000 containers mapping the same library) can show performance issues during memory pressure.
System V shared memory has a separate implementation path, though it ultimately uses similar underlying mechanisms.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// System V shared memory kernel structures // Global IPC namespace contains all shared memory segmentsstruct ipc_namespace { struct ipc_ids shm_ids; // All shm segments in this namespace size_t shm_ctlmax; // Max segment size size_t shm_ctlall; // Max total shared memory unsigned long shm_ctlmni; // Max number of segments // ...}; // Per-segment metadatastruct shmid_kernel { struct kern_ipc_perm shm_perm; // IPC permissions, key, id struct file *shm_file; // Underlying file (shmem/tmpfs!) unsigned long shm_nattch; // Number of attached processes unsigned long shm_segsz; // Size in bytes time64_t shm_atim, shm_dtim, shm_ctim; // Timestamps // ...}; /* * shmget() creates a shmid_kernel and an underlying tmpfs file. * shmat() calls do_mmap() on that file — similar to POSIX! * * Key insight: System V shm is implemented ON TOP OF tmpfs/shmem, * just with different API and namespace. */ // shmget() simplified:long do_shmget(key_t key, size_t size, int shmflg) { // 1. Look up or create shmid_kernel struct shmid_kernel *shp = /* find by key or create */; // 2. Create underlying shmem file (like shm_open result) shp->shm_file = shmem_kernel_file_setup("SYSV...", size, 0); // 3. Return identifier (not fd — this is different from POSIX) return shp->shm_perm.id;} // shmat() simplified:long do_shmat(int shmid, char *shmaddr, int shmflg) { struct shmid_kernel *shp = shm_lock_check(shmid); // Map the underlying file — same as POSIX mmap! void *addr = do_mmap(shp->shm_file, /* params */); shp->shm_nattch++; return addr;}Both POSIX and System V shared memory are built on the same foundation: Linux's shmem/tmpfs filesystem. This is a memory-backed filesystem that can optionally swap pages to disk. shm_open creates files in /dev/shm (mounted tmpfs); shmget creates anonymous tmpfs files tracked by the shmid_kernel structure.
Understanding how shared memory is cleaned up is essential for debugging resource leaks and understanding system behavior.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
=== POSIX Shared Memory Cleanup === Process calls munmap(ptr, size): 1. Find VMA covering [ptr, ptr+size) 2. Remove VMA from mm's VMA tree 3. For each page in the unmapped range: - Clear PTE - page->_mapcount-- - If _mapcount == -1 (no more mappers): - Page remains in page cache (could be re-mapped) - Or kernel may reclaim if memory pressure 4. VMA memory freed 5. file->f_count-- (from vma->vm_file) 6. If f_count == 0: file structure freed Process calls shm_unlink("/name"): 1. Remove /dev/shm/name from filesystem 2. inode->i_nlink = 0 3. inode NOT freed yet (existing mappings keep refcount) 4. When last mapping is unmapped: - inode refcount → 0 - All pages in address_space freed - inode and address_space freed Process exits without explicit cleanup: 1. Kernel calls exit_mm() to clean up address space 2. All VMAs unmapped (implicit munmap for each) 3. File refcounts decremented 4. If shm_unlink was called: resources freed when last process exits 5. If NOT unlinked: /dev/shm/name persists! (potential leak) === System V Shared Memory Cleanup === Process calls shmdt(ptr): 1. Find VMA for this attachment 2. munmap equivalent 3. shm_segment->shm_nattch-- Admin calls shmctl(shmid, IPC_RMID, NULL): 1. Mark segment for removal 2. Segment remains accessible until shm_nattch == 0 3. When last process detaches: - Free underlying tmpfs file - Free shmid_kernel - Remove from shm_ids Process exits: 1. All attachments auto-detached 2. If IPC_RMID was called and nattch → 0: freed 3. If NOT IPC_RMID: segment persists! (use ipcs/ipcrm to clean)| Scenario | Potential Leak | Prevention |
|---|---|---|
| Process crash | Mappings auto-cleaned, but shm object may persist | Always shm_unlink when creating; check /dev/shm on startup |
| System V segment orphaned | Segment with nattch=0 persists forever | Use IPC_RMID immediately after shmget; periodic ipcs audit |
| mmap without munmap | VMA cleaned on exit, but process consumes memory | Track all mappings; RAII patterns in C++ |
| Container shutdown | IPC namespace destroyed; all resources freed | No issue if using namespaces properly |
| fork without exec/exit | Child inherits all mappings (increased refcounts) | Understand fork semantics; close_on_exec for FDs |
Common pattern question: When should you call shm_unlink? Options: (1) Immediately after creation — object disappears from namespace but remains usable by existing mappers. Prevents orphaning. (2) At explicit cleanup — allows late-comers to attach by name. Requires discipline. (3) Never — let it persist for restart recovery. Must handle stale objects. Best practice for most cases: unlink immediately after all participants have opened it.
We've traced the complete implementation of shared memory in the Linux kernel. Let's consolidate the key takeaways:
Module Complete:
You've now completed a comprehensive study of shared memory via virtual memory. From the fundamental concept of page sharing, through shared libraries and inter-process communication, to protection mechanisms and implementation details, you have the knowledge to:
Congratulations! You've mastered shared memory via virtual memory — one of the most powerful mechanisms in modern operating systems. You understand not just the APIs, but the underlying implementation: how page tables, page caches, and kernel data structures work together to enable zero-copy, high-performance memory sharing. This knowledge is essential for systems programming, performance engineering, and OS development.