Loading learning content...
When two people need to collaborate, one approach is to give them a shared whiteboard. Both can write on it, both can read from it, changes appear instantly, and there's no middleman passing notes. This is the essence of shared memory IPC.
In the shared memory model, multiple processes map the same physical memory region into their respective virtual address spaces. Once established, processes can read and write this shared region as if it were regular memory—at processor speeds, without kernel involvement for each access.
This approach offers the highest possible performance for IPC. But like a shared whiteboard with no rules about who writes where and when, shared memory without proper synchronization leads to chaos. Understanding both the power and the pitfalls of shared memory is essential for systems programming.
By the end of this page, you will understand how shared memory works at the hardware and OS level, the System V and POSIX APIs for creating and mapping shared memory, why synchronization is mandatory (not optional), common patterns for safe shared memory usage, and when shared memory is the right choice.
Shared memory is deceptively simple in concept: take a region of physical memory and make it accessible from multiple process address spaces. But understanding the implementation requires grasping how virtual memory works.
Virtual to Physical Mapping
Recall that each process has its own virtual address space. The Memory Management Unit (MMU) translates virtual addresses to physical addresses using page tables. Normally, each process's page tables point to distinct physical pages—this is what provides isolation.
Shared memory works by having multiple processes' page tables point to the same physical pages:
Virtual Address Space Physical Memory═══════════════════════════ ═══════════════ Process A Physical RAM┌─────────────────────┐ ┌─────────────────────┐│ ... │ │ │├─────────────────────┤ │ ││ 0x7fff00000000 │──┐ │ ││ [Shared Region] │ │ ┌────►│ Physical Page 0x1A3 │◄────┐│ 4096 bytes │ │ │ │ [Shared Data Here] │ │├─────────────────────┤ │ │ │ │ ││ ... │ │ │ │ │ │└─────────────────────┘ │ │ │ │ │ │ │ │ │ │ Page Table A │ │ └─────────────────────┘ │ ┌────────────────┐ │ │ │ │ VPN 0x7fff0000 ├───┴───┘ │ │ → PFN 0x1A3 │ │ └────────────────┘ │ │Process B Page Table B │┌─────────────────────┐ ┌────────────────┐ ││ ... │ │ VPN 0x4000000 ├───────┘├─────────────────────┤ │ → PFN 0x1A3 ││ 0x40000000 │──────────────►└────────────────┘│ [Shared Region] ││ 4096 bytes │├─────────────────────┤ Same physical page (0x1A3) is│ ... │ mapped at different virtual└─────────────────────┘ addresses in each process!Key Observations:
Same physical memory, different virtual addresses: Process A sees the shared region at 0x7fff00000000; Process B sees it at 0x40000000. The virtual addresses differ, but the physical page is identical.
No kernel involvement for access: Once mapped, reading and writing the shared region is just normal memory access. The CPU's memory bus handles it directly—no system call overhead.
Cache coherency is automatic: Modern multi-core CPUs maintain cache coherency through hardware protocols (like MESI). When one core writes to shared memory, other cores see the update (eventually—we'll discuss memory ordering later).
Explicit setup required: Unlike normal process memory, shared memory must be explicitly created and attached. This is where the IPC system calls come in.
Shared memory segments exist independently of any process. A process creates a segment, and it persists until explicitly destroyed—even if the creating process terminates. This is both a feature (allows processes to communicate after restarts) and a hazard (leaked segments consume memory indefinitely).
There are three primary mechanisms for creating shared memory in Unix-like systems:
shmget, shmat)shm_open, mmap)open, mmap with MAP_SHARED)Each has distinct characteristics. Let's examine them in detail.
System V shared memory is the oldest IPC shared memory mechanism, dating to Unix System V in 1983. It remains widely supported and is still used in many production systems.
API Overview:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
#include <sys/ipc.h>#include <sys/shm.h>#include <string.h>#include <stdio.h>#include <stdlib.h> int main() { // Step 1: Generate a key // ftok() creates a key from a file path and project ID // The file must exist; the path and ID together create a unique key key_t key = ftok("/tmp/myapp", 'A'); if (key == -1) { perror("ftok failed"); exit(1); } // Step 2: Create or get the shared memory segment // Arguments: key, size in bytes, flags // IPC_CREAT: create if doesn't exist // IPC_EXCL: fail if already exists (use with IPC_CREAT for exclusive create) // 0666: permissions (read/write for owner, group, others) int shmid = shmget(key, 4096, IPC_CREAT | 0666); if (shmid == -1) { perror("shmget failed"); exit(1); } printf("Created/accessed shared memory segment ID: %d\n", shmid); // Step 3: Attach the segment to our address space // shmat() maps the segment into our virtual address space // Arguments: segment ID, preferred address (NULL = let kernel choose), flags // Returns the address where the segment is attached void *addr = shmat(shmid, NULL, 0); if (addr == (void*)-1) { perror("shmat failed"); exit(1); } printf("Attached at address: %p\n", addr); // Step 4: Use the shared memory like regular memory strcpy((char*)addr, "Hello from Process A!"); // Step 5: Detach when done (optional before exit, but good practice) // This only removes mapping from this process; segment still exists if (shmdt(addr) == -1) { perror("shmdt failed"); exit(1); } // Step 6: Optionally remove the segment entirely // Only do this when no processes need it anymore // if (shmctl(shmid, IPC_RMID, NULL) == -1) { // perror("shmctl IPC_RMID failed"); // } return 0;}key_t values generated via ftok() or manuallyshmctl(shmid, IPC_RMID, NULL) to remove/proc/sys/kernel/shmmax, /proc/sys/kernel/shmmni control max sizesipcs/ipcrm commands: List (ipcs -m) and remove (ipcrm -m shmid) segments from shellUse POSIX shared memory (shm_open) for in-memory IPC with modern, clean APIs. Use memory-mapped files when you need persistence to disk or want file-based access control. Use System V shared memory only for legacy compatibility or on systems lacking POSIX shared memory support.
Shared memory provides the fastest possible IPC—but this speed comes with responsibility. Shared memory provides no automatic synchronization whatsoever. The operating system maps the memory and then steps back. What happens next is entirely up to your code.
This section is critical. More bugs in shared memory systems come from synchronization failures than from the actual mapping code.
Without proper synchronization, two processes writing to overlapping memory regions will produce unpredictable results. This isn't a 'sometimes fails' situation—it's a 'silently corrupts data in ways you won't notice until production' situation.
Why Is Synchronization Necessary?
Consider a simple counter shared between two processes:
12345678910111213141516171819202122232425262728
// Shared memory contains:struct shared_data { int counter; // Both processes increment this}; // Process A and Process B both run:void increment_counter(struct shared_data *shared) { shared->counter = shared->counter + 1; // Looks atomic, but ISN'T!} // This simple line compiles to multiple CPU instructions:// 1. LOAD shared->counter from memory into register// 2. ADD 1 to the register// 3. STORE register value back to shared->counter // Race condition scenario:// Time Process A Process B Memory// ─────────────────────────────────────────────────────────────────────// t0 counter = 0 0// t1 LOAD counter (=0) 0// t2 LOAD counter (=0) 0// t3 ADD 1 (register=1) 0// t4 ADD 1 (register=1) 0// t5 STORE counter (=1) 1// t6 STORE counter (=1) 1 ← BUG!//// Expected: counter = 2 (incremented twice)// Actual: counter = 1 (one increment lost!)Memory Ordering and Visibility
The problem goes deeper than race conditions. Modern CPUs and compilers reorder memory operations for performance. Without proper memory barriers:
Writes may not be immediately visible: Process A writes to location X, then Y. Process B might see the new Y but the old X.
Reads may return stale values: Process B's CPU cache might hold an old copy of memory that hasn't been invalidated yet.
Compilers reorder code: The compiler might reorder statements for optimization, breaking assumptions about execution order.
Synchronization Primitives for Shared Memory:
To use shared memory safely, you need synchronization primitives:
| Primitive | Mechanism | Use Case | API |
|---|---|---|---|
| Semaphore | Integer counter with atomic decrement/increment | Counting resources, binary mutex behavior | POSIX: sem_open(), sem_wait(), sem_post() |
| Mutexes | Binary lock for mutual exclusion | Protecting critical sections in shared memory | pthread_mutex_t with PTHREAD_PROCESS_SHARED |
| Condition Variables | Wait for a condition with mutex protection | Producer-consumer, event notification | pthread_cond_t with PTHREAD_PROCESS_SHARED |
| Read-Write Locks | Multiple readers or one writer | Read-heavy workloads with occasional writes | pthread_rwlock_t with PTHREAD_PROCESS_SHARED |
| Spin Locks | Busy-wait lock for short critical sections | Very short operations, low contention | Atomic operations or pthread_spin_lock |
123456789101112131415161718192021222324252627282930313233343536373839404142434445
#include <fcntl.h>#include <sys/mman.h>#include <pthread.h>#include <unistd.h>#include <stdio.h>#include <stdlib.h> // Shared memory structure with embedded synchronizationtypedef struct { pthread_mutex_t mutex; // MUST be in shared memory too! int counter;} shared_data_t; int main() { // Create shared memory int fd = shm_open("/safe_shared", O_CREAT | O_RDWR, 0666); ftruncate(fd, sizeof(shared_data_t)); shared_data_t *shared = mmap(NULL, sizeof(shared_data_t), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); close(fd); // Initialize mutex with PTHREAD_PROCESS_SHARED attribute // This is CRITICAL - default mutexes only work within one process! pthread_mutexattr_t attr; pthread_mutexattr_init(&attr); pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED); pthread_mutex_init(&shared->mutex, &attr); pthread_mutexattr_destroy(&attr); shared->counter = 0; // Safe increment - works correctly across processes pthread_mutex_lock(&shared->mutex); shared->counter++; // Protected by mutex pthread_mutex_unlock(&shared->mutex); printf("Counter: %d\n", shared->counter); // Cleanup... // Note: Must coordinate mutex destruction across all processes! return 0;}Default pthread mutexes and condition variables only work between threads of the same process. For inter-process synchronization, you MUST set the PTHREAD_PROCESS_SHARED attribute. Forgetting this is a common bug that may appear to work during testing but fails under load.
Certain patterns appear repeatedly in shared memory systems. Understanding these patterns helps you design robust shared memory IPC.
Pattern 1: Producer-Consumer with Circular Buffer
A classic pattern where one process writes data and another reads it, using a fixed-size buffer:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
#include <stdatomic.h>#include <stdint.h> #define BUFFER_SIZE 1024 // Must be power of 2 for efficient modulo#define BUFFER_MASK (BUFFER_SIZE - 1) typedef struct { // Use atomics for lock-free single-producer single-consumer atomic_uint write_idx; // Only producer modifies atomic_uint read_idx; // Only consumer modifies // Padding to avoid false sharing (cache line is typically 64 bytes) char pad1[64 - sizeof(atomic_uint)]; char pad2[64 - sizeof(atomic_uint)]; // The actual buffer uint8_t data[BUFFER_SIZE];} spsc_queue_t; // Producer side (only one producer!)int push(spsc_queue_t *q, uint8_t value) { unsigned int write = atomic_load_explicit(&q->write_idx, memory_order_relaxed); unsigned int read = atomic_load_explicit(&q->read_idx, memory_order_acquire); // Check if buffer is full if (((write + 1) & BUFFER_MASK) == (read & BUFFER_MASK)) { return -1; // Buffer full } // Write data q->data[write & BUFFER_MASK] = value; // Publish write (release semantics ensure data is visible before index update) atomic_store_explicit(&q->write_idx, write + 1, memory_order_release); return 0;} // Consumer side (only one consumer!)int pop(spsc_queue_t *q, uint8_t *value) { unsigned int read = atomic_load_explicit(&q->read_idx, memory_order_relaxed); unsigned int write = atomic_load_explicit(&q->write_idx, memory_order_acquire); // Check if buffer is empty if ((read & BUFFER_MASK) == (write & BUFFER_MASK)) { return -1; // Buffer empty } // Read data *value = q->data[read & BUFFER_MASK]; // Publish read (release semantics) atomic_store_explicit(&q->read_idx, read + 1, memory_order_release); return 0;}Pattern 2: Read-Heavy Shared State with SeqLock
When readers vastly outnumber writers, a sequence lock provides excellent performance:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
#include <stdatomic.h> typedef struct { atomic_uint sequence; // Even when unlocked, odd during write // Protected data - can be any structure int x; int y; int z;} seqlock_data_t; // Writer (must have external mutual exclusion between writers)void write_data(seqlock_data_t *data, int x, int y, int z) { unsigned int seq = atomic_load(&data->sequence); atomic_store(&data->sequence, seq + 1); // Mark as being written (odd) atomic_thread_fence(memory_order_release); // Write the data data->x = x; data->y = y; data->z = z; atomic_thread_fence(memory_order_release); atomic_store(&data->sequence, seq + 2); // Mark complete (even again)} // Reader (multiple readers can proceed simultaneously)int read_data(seqlock_data_t *data, int *x, int *y, int *z) { unsigned int seq1, seq2; do { seq1 = atomic_load_explicit(&data->sequence, memory_order_acquire); if (seq1 & 1) continue; // Writer in progress, retry // Read the data *x = data->x; *y = data->y; *z = data->z; atomic_thread_fence(memory_order_acquire); seq2 = atomic_load_explicit(&data->sequence, memory_order_acquire); } while (seq1 != seq2); // Retry if writer modified during read return 0;} // Readers never block writers, and writers never block readers!// Readers may retry if a write occurs during their read.Never store raw pointers in shared memory! Each process maps the shared region at a different virtual address, so pointers from one process are meaningless to another. Use offsets from the start of the shared region instead, or embed all data directly in the shared structure.
Shared memory is the fastest IPC mechanism, but understanding why requires analyzing its performance characteristics.
| Metric | Shared Memory | Pipes/Message Queues | Reason |
|---|---|---|---|
| Setup cost | Higher (mmap, shm_open) | Lower (pipe, socketpair) | Memory mapping has more kernel work |
| Per-operation cost | ~0 (memory access) | System call overhead (1-10μs) | No kernel involvement for shm access |
| Data copy overhead | Zero copy | 2 copies (user→kernel→user) | Data written directly to shared region |
| Synchronization | Explicit (your responsibility) | Implicit (kernel handles) | Shared memory requires manual sync |
| Cache behavior | Best (data in CPU cache) | Worse (kernel buffers involved) | Shared memory leverages CPU cache |
| Max throughput | Memory bandwidth (~50-100 GB/s) | Limited by copy bandwidth (~1-5 GB/s) | Direct memory access vs. copying |
When Shared Memory Excels:
Large Data Transfers: Sending a 100MB data structure? With pipes, you copy 100MB into the kernel, then 100MB out—200MB of memory bandwidth consumed. With shared memory, the receiver accesses the same physical memory—zero copies.
High-Frequency Communication: If processes exchange data 100,000 times per second, the ~2μs system call overhead per pipe operation adds up to 200ms of pure overhead per second. Shared memory operations are memory accesses taking nanoseconds.
Random Access Patterns: Need to read just one field from a large shared structure? Shared memory accesses exactly those bytes. Pipes must send entire messages.
Multiple Readers: Multiple processes can simultaneously read shared memory without interference (with appropriate locking). Pipe data can only be read once.
Benchmark Example:
# Benchmark: Transfer 1 million 4KB messages between two processes# (Single-threaded producer and consumer, Linux x86_64, pinned to same NUMA node) Mechanism Throughput Latency (avg) CPU Utilization═══════════════════════════════════════════════════════════════════════════Shared Memory (SPSC) 18.2 GB/s ~50 ns 12% (spin wait)Unix Domain Socket 2.3 GB/s ~1.7 μs 48%TCP Socket (loopback) 1.8 GB/s ~2.2 μs 62%Pipe 1.9 GB/s ~2.0 μs 55%POSIX Message Queue 0.8 GB/s ~4.8 μs 78% # Notes:# - Shared memory is 8-20x faster in throughput# - Shared memory latency is ~40x lower# - Socket/pipe CPU usage is higher due to system call overhead# - Shared memory spins waiting for data (configurable) # For mixed or bursty workloads, differences may be less dramatic due to:# - CPU cache effects# - Synchronization contention# - Application processing time dominating IPC timeThe Hidden Costs:
Shared memory's raw performance can be misleading. Consider these hidden costs:
Synchronization overhead: Mutex locks, especially contended ones, can add microseconds of latency
False sharing: If two processes access different variables that happen to be on the same cache line, cache invalidation traffic kills performance
Memory ordering complexity: Incorrect use of atomics or missing memory barriers causes subtle bugs that can take weeks to diagnose
Debugging difficulty: Shared memory bugs don't produce nice error messages. Data corruption manifests as seemingly random failures far from the actual bug.
Shared memory is faster, but it's also harder to get right. A pipe that's 'slow enough' but correct will serve you better than shared memory that's fast but subtly corrupts data once a month. Choose shared memory when you've measured that other IPC is a bottleneck, not as a premature optimization.
Shared memory is used extensively in performance-critical systems. Understanding these real-world applications illustrates when shared memory is the right choice.
Case Study: PostgreSQL Shared Memory Architecture
PostgreSQL's use of shared memory is a master class in practical shared memory design:
┌──────────────────────────────────────────────────────────────────────┐│ PostgreSQL Shared Memory │├──────────────────────────────────────────────────────────────────────┤│ Postmaster Process (main) ││ ┌──────────────────────────────────────────────────────────────────┐││ │ Creates and initializes shared memory │││ └──────────────────────────────────────────────────────────────────┘││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────────────┐││ │ Shared Memory Segment │││ │ ┌────────────────────────────────────────────────────────────┐ │││ │ │ Shared Buffers (shared_buffers config parameter) │ │││ │ │ - Cached database pages (typically 25% of RAM) │ │││ │ │ - Buffer descriptors with pin counts, dirty flags │ │││ │ │ - Protected by buffer manager locks │ │││ │ └────────────────────────────────────────────────────────────┘ │││ │ ┌────────────────────────────────────────────────────────────┐ │││ │ │ Lock Tables │ │││ │ │ - Row locks, table locks, advisory locks │ │││ │ │ - Fast path for common cases │ │││ │ └────────────────────────────────────────────────────────────┘ │││ │ ┌────────────────────────────────────────────────────────────┐ │││ │ │ WAL Buffers (Write-Ahead Log) │ │││ │ │ - Transaction log before disk write │ │││ │ │ - Protected by WAL insert locks │ │││ │ └────────────────────────────────────────────────────────────┘ │││ │ ┌────────────────────────────────────────────────────────────┐ │││ │ │ Various Tables: CLOG, Subtrans, Proc Array, PGXACT... │ │││ │ └────────────────────────────────────────────────────────────┘ │││ └──────────────────────────────────────────────────────────────────┘││ ▲ ▲ ▲ ││ ┌─────────┴───┐ ┌───────┴────┐ ┌─────┴──────┐ ││ │ Backend 1 │ │ Backend 2 │ │ Backend N │ ││ │ (Client 1) │ │ (Client 2) │ │ (Client N) │ ││ └─────────────┘ └────────────┘ └────────────┘ ││ Each backend attaches to the same shared memory segment │└──────────────────────────────────────────────────────────────────────┘ Why PostgreSQL uses shared memory:1. Buffer cache shared across all connections (no duplication)2. Lock visibility - all processes see same lock state3. Transaction status visible to all (MVCC coordination)4. Extremely high-frequency access (every query touches shared_buffers)PostgreSQL uses lightweight locks (LWLocks), spinlocks, and buffer pins for synchronization. The lock manager itself is in shared memory. This design evolved over 25+ years and handles thousands of concurrent connections efficiently—but it required immense expertise to get right.
Shared memory's power comes with serious pitfalls. Here are the most common mistakes and how to avoid them.
shm_unlink, shmctl(IPC_RMID), or signal handlers12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
// Best Practice: Versioned, self-describing shared memory structure #define SHM_VERSION 1#define CACHE_LINE_SIZE 64 typedef struct { // Header - always at the start uint32_t version; // Detect structure mismatches uint32_t size; // Total size for validation uint32_t magic; // 0xDEADBEEF to detect corruption // Synchronization primitives pthread_mutex_t mutex; pthread_cond_t cond; // Padding to align data to cache line char _pad[CACHE_LINE_SIZE - (sizeof(uint32_t)*3 + sizeof(pthread_mutex_t) + sizeof(pthread_cond_t)) % CACHE_LINE_SIZE]; // Application data - aligned to cache line struct { // Hot data that one process writes, others read // Separate cache lines for different writers! _Alignas(CACHE_LINE_SIZE) int producer_counter; _Alignas(CACHE_LINE_SIZE) int consumer_counter; // Bulk data _Alignas(CACHE_LINE_SIZE) char data[4096]; } payload;} shared_region_t; _Static_assert(sizeof(shared_region_t) % CACHE_LINE_SIZE == 0, "shared_region_t must be cache-line aligned"); // Initialization with validationshared_region_t* init_shared(const char *name, int create) { int flags = O_RDWR | (create ? O_CREAT | O_EXCL : 0); int fd = shm_open(name, flags, 0666); if (fd < 0) return NULL; if (create) { ftruncate(fd, sizeof(shared_region_t)); } shared_region_t *shm = mmap(NULL, sizeof(shared_region_t), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); close(fd); if (shm == MAP_FAILED) return NULL; if (create) { // Initialize the structure memset(shm, 0, sizeof(*shm)); shm->version = SHM_VERSION; shm->size = sizeof(shared_region_t); shm->magic = 0xDEADBEEF; // Initialize mutex for multi-process pthread_mutexattr_t mattr; pthread_mutexattr_init(&mattr); pthread_mutexattr_setpshared(&mattr, PTHREAD_PROCESS_SHARED); pthread_mutex_init(&shm->mutex, &mattr); pthread_mutexattr_destroy(&mattr); // Initialize condvar similarly... } else { // Validate existing structure if (shm->magic != 0xDEADBEEF) { fprintf(stderr, "Shared memory corrupted!\n"); munmap(shm, sizeof(*shm)); return NULL; } if (shm->version != SHM_VERSION) { fprintf(stderr, "Version mismatch: expected %d, got %d\n", SHM_VERSION, shm->version); munmap(shm, sizeof(*shm)); return NULL; } } return shm;}Writing correct shared memory code is hard. Consider using established libraries like Boost.Interprocess (C++), shared_memory (Rust), or similar. These handle the tricky details and provide tested, cross-platform implementations of common patterns.
We've explored the shared memory model for IPC in depth. Let's consolidate the key insights:
shmget/shmat), POSIX (shm_open/mmap), and memory-mapped files. POSIX is recommended for new development.What's Next: Message Passing Model
Shared memory represents one fundamental IPC paradigm—processes communicate by reading and writing common memory. The next page explores the contrasting approach: message passing, where processes exchange discrete messages through kernel-managed channels. Message passing provides automatic synchronization and cleaner semantics, trading some performance for safety and simplicity.
You now understand the shared memory model for IPC—its mechanisms, APIs, synchronization requirements, performance characteristics, and common patterns. This is the fastest IPC mechanism available, but it demands careful use. Next, we'll explore how message passing offers a different trade-off between performance and safety.