Loading content...
Every IPC mechanism carries overhead. Pipes and sockets require system calls, kernel buffering, and data copying. Message queues add queuing logic and metadata management. Even local Unix domain sockets—often considered "fast"—still traverse the kernel.
Shared memory bypasses all of this. Once the initial mapping is established, processes read and write directly to RAM through ordinary memory operations. There are no system calls, no kernel interventions, no copies. A write by one process appears instantly in another's address space—often within the time it takes for cache coherency to propagate between CPU cores.
This architectural simplicity translates to measurable performance: latencies measured in nanoseconds instead of microseconds, bandwidths limited only by memory bus speed, and CPU overhead reduced to the synchronization primitive itself.
This page quantifies these benefits, examines the architectural reasons behind them, and identifies the scenarios where shared memory's performance advantage is worth the added complexity.
By the end of this page, you will understand: (1) The zero-copy advantage and what makes shared memory so fast, (2) Latency and bandwidth comparisons with other IPC mechanisms, (3) Cache coherency and its performance implications, (4) When shared memory's overhead does matter, (5) Real-world use cases that demand shared memory performance, and (6) Techniques for maximizing shared memory throughput.
The fundamental performance advantage of shared memory stems from its zero-copy architecture. To understand why this matters, let's trace the data path for different IPC mechanisms.
Traditional IPC (Pipes, Sockets, Message Queues):
For a simple message transfer, data is copied at least twice. Each copy consumes CPU cycles and memory bandwidth.
Shared Memory:
There is literally no data copying. The "transfer" is the act of writing to memory—something every program already does.
Quantifying the Improvement:
For large data transfers, the copy overhead dominates. Memory bandwidth on modern systems is 50-100 GB/s, but each copy consumes bandwidth. Two copies halves effective throughput. System call overhead adds latency.
| Operation | Time |
|---|---|
| Memory copy (4KB) | ~0.5-1 µs |
| System call overhead | ~0.2-0.5 µs |
| Context switch | ~1-5 µs |
| Shared memory write | ~10-100 ns (memory access time) |
For a 4KB message via pipes:
For 4KB write to shared memory:
The improvement is 2-5x for small messages, and grows larger as message size increases.
You might think "but the sender still writes the data somewhere." True—but they write it once, directly to the shared region. With pipes, they first write to their own buffer, then copy to the kernel. The elimination of redundant copies is the core win.
Latency—the time from when a sender initiates a message until the receiver can access it—is critical for interactive systems, real-time applications, and high-frequency trading. Let's compare IPC mechanisms on small message latency.
Test Setup: Single-threaded ping-pong test, 64-byte messages, 1 million iterations, median latency.
| IPC Mechanism | Round-Trip Latency | Relative Performance |
|---|---|---|
| Shared Memory + Spinlock | ~200-500 ns | 1x (baseline) |
| Shared Memory + Futex | ~500-1,000 ns | 2-3x slower |
| Unix Domain Socket | ~2,000-5,000 ns | 4-10x slower |
| Named Pipe (FIFO) | ~3,000-7,000 ns | 6-15x slower |
| POSIX Message Queue | ~4,000-8,000 ns | 8-16x slower |
| TCP Loopback Socket | ~10,000-20,000 ns | 20-40x slower |
| System V Message Queue | ~5,000-10,000 ns | 10-20x slower |
Why the Dramatic Differences?
Shared Memory + Spinlock: No system calls in the data path. The sender writes data and sets a flag. The receiver spins on the flag (checks in a tight loop). When the flag changes, the receiver reads immediately. Total path: memory writes and reads plus cache coherency propagation.
Unix Domain Sockets: Require write() and read() system calls. Kernel must copy data, manage socket buffers, check for signals, potentially schedule. Even optimized paths add significant overhead.
TCP Loopback: Adds TCP/IP processing, checksum computation, and more complex buffering. Even though data never leaves the machine, the full network stack runs.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
#include <stdio.h>#include <stdlib.h>#include <string.h>#include <stdatomic.h>#include <fcntl.h>#include <sys/mman.h>#include <sys/wait.h>#include <unistd.h>#include <time.h> #define SHM_NAME "/latency_bench"#define ITERATIONS 1000000#define MESSAGE_SIZE 64 typedef struct { atomic_int flag; // 0: empty, 1: has request, 2: has response char data[MESSAGE_SIZE];} SharedBuffer; static inline uint64_t rdtsc() { unsigned int lo, hi; asm volatile("rdtsc" : "=a"(lo), "=d"(hi)); return ((uint64_t)hi << 32) | lo;} /** * Shared memory latency benchmark using spinlock-style synchronization. * Measures round-trip latency (ping-pong) between two processes. */int main() { int shm_fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666); ftruncate(shm_fd, sizeof(SharedBuffer)); SharedBuffer *buf = mmap(NULL, sizeof(SharedBuffer), PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0); close(shm_fd); atomic_store(&buf->flag, 0); pid_t pid = fork(); if (pid == 0) { // Child: Responder for (int i = 0; i < ITERATIONS; i++) { // Wait for request while (atomic_load(&buf->flag) != 1) { asm volatile("pause"); // CPU hint: we're spinning } // Process (in real app, would modify data) buf->data[0]++; // Send response atomic_store(&buf->flag, 2); } munmap(buf, sizeof(SharedBuffer)); exit(0); } // Parent: Requester sleep(1); // Let child initialize // Warmup for (int i = 0; i < 10000; i++) { atomic_store(&buf->flag, 1); while (atomic_load(&buf->flag) != 2) { asm volatile("pause"); } atomic_store(&buf->flag, 0); } // Benchmark uint64_t start = rdtsc(); struct timespec ts_start, ts_end; clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start); for (int i = 0; i < ITERATIONS; i++) { atomic_store(&buf->flag, 1); // Send request // Wait for response while (atomic_load(&buf->flag) != 2) { asm volatile("pause"); } atomic_store(&buf->flag, 0); // Reset for next iteration } clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end); uint64_t end = rdtsc(); waitpid(pid, NULL, 0); // Calculate results double elapsed_ns = (ts_end.tv_sec - ts_start.tv_sec) * 1e9 + (ts_end.tv_nsec - ts_start.tv_nsec); double avg_latency_ns = elapsed_ns / ITERATIONS; uint64_t cycles_per_trip = (end - start) / ITERATIONS; printf("=== Shared Memory Latency Benchmark ===\n"); printf("Iterations: %d\n", ITERATIONS); printf("Message size: %d bytes\n", MESSAGE_SIZE); printf("\n"); printf("Total time: %.2f ms\n", elapsed_ns / 1e6); printf("Average round-trip latency: %.0f ns\n", avg_latency_ns); printf("CPU cycles per round-trip: %lu\n", cycles_per_trip); printf("\n"); printf("Throughput: %.2f million round-trips/sec\n", 1e9 / avg_latency_ns / 1e6); munmap(buf, sizeof(SharedBuffer)); shm_unlink(SHM_NAME); return 0;}The spinlock approach (busy-waiting) achieves minimum latency but consumes CPU cycles while waiting. For latency-critical applications with sufficient CPU headroom, this is acceptable. For systems where CPU efficiency matters, use futex-based waiting which sleeps but adds ~500ns of latency.
For bulk data transfer, bandwidth becomes more important than latency. Shared memory's ability to avoid data copies shines brightest when moving large amounts of data.
Theoretical Limits:
With traditional IPC, the 2x copy overhead means effective peak throughput is halved. Shared memory can approach the raw memory bandwidth limit.
| IPC Mechanism | Bandwidth | % of Memory BW |
|---|---|---|
| Shared Memory (memcpy) | ~40-50 GB/s | 80-100% |
| Shared Memory (streaming) | ~50-60 GB/s | 100%+* |
| Unix Domain Socket | ~8-12 GB/s | 16-24% |
| Pipe | ~5-10 GB/s | 10-20% |
| TCP Loopback | ~3-6 GB/s | 6-12% |
| POSIX MQ (large msg) | ~2-5 GB/s | 4-10% |
*Streaming writes using non-temporal stores can exceed normal memory bandwidth by bypassing cache.
Why Such Large Differences?
Copy Overhead: Pipes/sockets copy data twice. For a 64KB transfer, that's 128KB of memory traffic.
System Call Overhead: Each send/receive requires a syscall. For small messages, syscall overhead dominates. For large messages, it's amortized.
Kernel Processing: Sockets involve buffer management, locking, socket buffer allocation, and often memory allocation for each operation.
Cache Pollution: Kernel buffers may evict application data from cache, causing additional main memory accesses.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
#include <stdio.h>#include <stdlib.h>#include <string.h>#include <stdatomic.h>#include <fcntl.h>#include <sys/mman.h>#include <sys/wait.h>#include <unistd.h>#include <time.h> #define SHM_NAME "/bandwidth_bench"#define BUFFER_SIZE (64 * 1024 * 1024) // 64 MB shared buffer#define BLOCK_SIZE (64 * 1024) // 64 KB blocks#define ITERATIONS 1000 typedef struct { atomic_int writer_done; atomic_int reader_done; char data[BUFFER_SIZE];} SharedBuffer; /** * Bandwidth benchmark for shared memory bulk transfers. * One process writes, another reads, measuring throughput. */int main() { int shm_fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666); ftruncate(shm_fd, sizeof(SharedBuffer)); SharedBuffer *buf = mmap(NULL, sizeof(SharedBuffer), PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0); close(shm_fd); if (buf == MAP_FAILED) { perror("mmap"); exit(1); } atomic_store(&buf->writer_done, 0); atomic_store(&buf->reader_done, 0); pid_t pid = fork(); if (pid == 0) { // Child: Reader char *local_buf = malloc(BLOCK_SIZE); size_t total_read = 0; struct timespec ts_start, ts_end; clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start); for (int iter = 0; iter < ITERATIONS; iter++) { // Wait for writer to signal data is ready while (atomic_load(&buf->writer_done) <= iter) { asm volatile("pause"); } // Read data (simulate processing) size_t offset = (iter * BLOCK_SIZE) % BUFFER_SIZE; memcpy(local_buf, &buf->data[offset], BLOCK_SIZE); total_read += BLOCK_SIZE; atomic_fetch_add(&buf->reader_done, 1); } clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end); double elapsed_s = (ts_end.tv_sec - ts_start.tv_sec) + (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9; double bandwidth = (total_read / 1e9) / elapsed_s; printf("Reader: %.2f GB transferred in %.2f s = %.2f GB/s\n", total_read / 1e9, elapsed_s, bandwidth); free(local_buf); munmap(buf, sizeof(SharedBuffer)); exit(0); } // Parent: Writer char *source_data = malloc(BLOCK_SIZE); memset(source_data, 'X', BLOCK_SIZE); size_t total_written = 0; struct timespec ts_start, ts_end; clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start); for (int iter = 0; iter < ITERATIONS; iter++) { // Write data to shared memory size_t offset = (iter * BLOCK_SIZE) % BUFFER_SIZE; memcpy(&buf->data[offset], source_data, BLOCK_SIZE); total_written += BLOCK_SIZE; // Signal data is ready atomic_fetch_add(&buf->writer_done, 1); // Wait for reader to process before overwriting (ring buffer) while (atomic_load(&buf->reader_done) < iter - (BUFFER_SIZE / BLOCK_SIZE) + 1 && iter > 10) { asm volatile("pause"); } } clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end); double elapsed_s = (ts_end.tv_sec - ts_start.tv_sec) + (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9; double bandwidth = (total_written / 1e9) / elapsed_s; printf("Writer: %.2f GB transferred in %.2f s = %.2f GB/s\n", total_written / 1e9, elapsed_s, bandwidth); waitpid(pid, NULL, 0); free(source_data); munmap(buf, sizeof(SharedBuffer)); shm_unlink(SHM_NAME); return 0;}For write-streaming workloads, non-temporal stores (movntps on x86) bypass the cache and write directly to memory. This is faster for large sequential writes that won't be re-read soon. Use _mm_stream_ps() or similar intrinsics in performance-critical code.
While shared memory eliminates software copies, hardware cache coherency introduces its own performance considerations. Understanding these effects is crucial for optimizing high-performance shared memory systems.
How Cache Coherency Works:
Modern multiprocessor systems maintain cache coherency through protocols like MESI (Modified, Exclusive, Shared, Invalid). When one CPU writes to a cache line, other CPUs' caches are invalidated. The next access by those CPUs must fetch the updated data.
For shared memory between processes on different CPUs:
This coherency traffic has latency—typically 50-100 nanoseconds for inter-core communication, more for cross-socket.
False Sharing:
One of the most insidious performance problems in shared memory is false sharing: when two processes access different data that happens to reside on the same cache line (typically 64 bytes).
// BAD: These two counters are likely on the same cache line
struct SharedCounters {
atomic_int counter_a; // bytes 0-3
atomic_int counter_b; // bytes 4-7
};
// Process A increments counter_a
// Process B increments counter_b
// Both experience full cache line bouncing!
// GOOD: Pad to separate cache lines
struct SharedCounters {
atomic_int counter_a;
char pad_a[60]; // Padding to 64-byte cache line
atomic_int counter_b;
char pad_b[60];
};
False sharing can degrade performance by 10-100x for frequently accessed data.
Linux exposes cache line size via sysconf(_SC_LEVEL1_DCACHE_LINESIZE) or by reading /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size. Most x86 systems use 64 bytes. Some ARM processors use 128 bytes. Design your shared structures to accommodate the largest expected cache line size.
While shared memory eliminates many sources of overhead, it introduces its own costs. Understanding when these matter helps you make informed design decisions.
Synchronization Overhead:
The performance gain from zero-copy can be lost if synchronization is poorly designed:
| Synchronization Method | Typical Cost | When to Use |
|---|---|---|
| Spinlock (busy wait) | 50-200 ns | Ultra-low latency, short waits |
| Futex | 500-2000 ns | Moderate latency, long waits |
| Mutex (robust) | 1000-5000 ns | General purpose, recovery needed |
| Named semaphore | 2000-5000 ns | Cross-process coordination |
Setup Overhead:
Shared memory has significant setup cost:
For short-lived communications, this setup cost may exceed the benefits. Rule of thumb: shared memory pays off when:
Memory Usage:
Shared memory consumes physical RAM (or swap) from the moment it's created. Unlike pipe buffers that grow dynamically, you pre-allocate the full size. Over-provisioning wastes memory; under-provisioning limits functionality.
| Consideration | Favor Shared Memory | Favor Other IPC |
|---|---|---|
| Message size | 4KB typical | <100 bytes typical |
| Message frequency | High (>10K/sec) | Low (<100/sec) |
| Latency requirement | <1µs needed | 10µs acceptable |
| Process lifetime | Long-running services | Short-lived workers |
| Data structure | Complex, in-place updates | Simple messages |
| Synchronization | You can manage it | Want kernel to handle |
| Debugging ease | Can handle complexity | Want simple semantics |
Shared memory's performance comes at the cost of complexity: custom synchronization, careful memory management, debugging difficulty, and potential for hard-to-reproduce bugs. If your IPC needs don't require sub-microsecond latency or multi-GB throughput, Unix domain sockets or pipes may be simpler and fast enough.
Shared memory's performance characteristics make it indispensable for specific application domains. Here are real-world scenarios where shared memory is the right—often the only—choice.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
/** * Simplified video frame sharing pattern used in media pipelines. * Producer captures frames, consumer processes them, zero copies. */ #include <stdio.h>#include <stdlib.h>#include <stdatomic.h>#include <fcntl.h>#include <sys/mman.h>#include <unistd.h> #define SHM_NAME "/video_frames"#define NUM_FRAMES 4 // Ring buffer size#define FRAME_WIDTH 1920#define FRAME_HEIGHT 1080#define BYTES_PER_PIXEL 4 // RGBA#define FRAME_SIZE (FRAME_WIDTH * FRAME_HEIGHT * BYTES_PER_PIXEL) typedef struct { uint64_t timestamp; uint32_t frame_number; atomic_int state; // 0: empty, 1: filled, 2: processing char pixels[FRAME_SIZE];} VideoFrame; typedef struct { atomic_uint write_index; atomic_uint read_index; VideoFrame frames[NUM_FRAMES];} FrameBuffer; // Producer: Capture and write framesvoid capture_frames(FrameBuffer *fb, int count) { for (int i = 0; i < count; i++) { // Get next write slot uint32_t slot = atomic_fetch_add(&fb->write_index, 1) % NUM_FRAMES; VideoFrame *frame = &fb->frames[slot]; // Wait if consumer hasn't processed yet while (atomic_load(&frame->state) == 1) { usleep(1000); // Back off } // "Capture" frame (in reality, hardware fills this directly) frame->frame_number = i; frame->timestamp = /* get_timestamp_ns() */ 0; // memcpy from capture device... // Signal frame is ready atomic_store(&frame->state, 1); }} // Consumer: Process framesvoid process_frames(FrameBuffer *fb, int count) { for (int i = 0; i < count; i++) { uint32_t slot = atomic_fetch_add(&fb->read_index, 1) % NUM_FRAMES; VideoFrame *frame = &fb->frames[slot]; // Wait for frame while (atomic_load(&frame->state) != 1) { usleep(100); } atomic_store(&frame->state, 2); // Processing // Process frame in-place (no copy!) // ... apply filters, encode, display ... atomic_store(&frame->state, 0); // Done }} /** * Key insight: 8MB frames pass between processes with ZERO copies. * The only overhead is the atomic flag operations and cache coherency. * * At 60fps, this is 480 MB/s of video data with minimal CPU overhead. * Using pipes would require copying 960 MB/s (2x for send+receive). */Beyond the inherent advantages of zero-copy, several optimization techniques can squeeze additional performance from shared memory systems.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
#include <stdio.h>#include <stdlib.h>#include <sys/mman.h>#include <fcntl.h>#include <unistd.h>#include <sched.h>#include <numaif.h> /** * Examples of shared memory optimization techniques. */ // 1. Huge page allocationvoid *alloc_huge_pages(size_t size) { void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); if (ptr == MAP_FAILED) { perror("Huge pages not available, falling back"); ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); } return ptr;} // 2. Pre-fault pagesvoid prefault_memory(void *ptr, size_t size) { volatile char *p = (volatile char *)ptr; size_t page_size = sysconf(_SC_PAGESIZE); for (size_t i = 0; i < size; i += page_size) { p[i] = p[i]; // Touch each page }} // 3. Lock pages in RAM (prevent swapping)void lock_memory(void *ptr, size_t size) { if (mlock(ptr, size) == -1) { perror("mlock failed (may need CAP_IPC_LOCK)"); }} // 4. CPU pinningvoid pin_to_cpu(int cpu_id) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(cpu_id, &cpuset); if (sched_setaffinity(0, sizeof(cpuset), &cpuset) == -1) { perror("sched_setaffinity"); }} // 5. NUMA-aware allocationvoid *alloc_numa_local(size_t size, int numa_node) { void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (ptr != MAP_FAILED) { unsigned long nodemask = 1UL << numa_node; mbind(ptr, size, MPOL_BIND, &nodemask, sizeof(nodemask) * 8, 0); } return ptr;} // 6. Cache line prefetchingvoid prefetch_next(void *ptr) { __builtin_prefetch(ptr, 0, 3); // Read, high locality __builtin_prefetch(ptr + 64, 0, 3); // Next cache line} int main() { // Example: Optimized shared memory setup size_t size = 64 * 1024 * 1024; // 64 MB printf("Allocating optimized shared memory...\n"); // Pin to CPU 0 pin_to_cpu(0); // Allocate with huge pages if available void *shm = alloc_huge_pages(size); if (!shm) { printf("Allocation failed\n"); return 1; } // Pre-fault all pages printf("Pre-faulting pages...\n"); prefault_memory(shm, size); // Lock in RAM lock_memory(shm, size); printf("Optimized shared memory ready at %p\n", shm); // Use the memory... munmap(shm, size); return 0;}We've explored why shared memory is the performance king of IPC mechanisms and when that performance advantage justifies its complexity. Let's consolidate the key takeaways:
What's Next:
We've covered System V and implementation details throughout this module. The final page provides a comprehensive look at POSIX shared memory—the modern, file-oriented API that offers cleaner semantics, better tooling support, and integration with the standard mmap()/munmap() memory model.
You now understand why shared memory delivers unmatched IPC performance and when to choose it. You've learned about zero-copy architecture, latency and bandwidth characteristics, cache considerations, and optimization techniques. Next, we'll explore the POSIX shared memory API in depth.