Shared Memory - Learning Module

Loading content...

0/227

Performance Benefits

The Fastest Path Between Processes

Every IPC mechanism carries overhead. Pipes and sockets require system calls, kernel buffering, and data copying. Message queues add queuing logic and metadata management. Even local Unix domain sockets—often considered "fast"—still traverse the kernel.

Shared memory bypasses all of this. Once the initial mapping is established, processes read and write directly to RAM through ordinary memory operations. There are no system calls, no kernel interventions, no copies. A write by one process appears instantly in another's address space—often within the time it takes for cache coherency to propagate between CPU cores.

This architectural simplicity translates to measurable performance: latencies measured in nanoseconds instead of microseconds, bandwidths limited only by memory bus speed, and CPU overhead reduced to the synchronization primitive itself.

This page quantifies these benefits, examines the architectural reasons behind them, and identifies the scenarios where shared memory's performance advantage is worth the added complexity.

What You Will Learn

By the end of this page, you will understand: (1) The zero-copy advantage and what makes shared memory so fast, (2) Latency and bandwidth comparisons with other IPC mechanisms, (3) Cache coherency and its performance implications, (4) When shared memory's overhead does matter, (5) Real-world use cases that demand shared memory performance, and (6) Techniques for maximizing shared memory throughput.

The Zero-Copy Advantage

The fundamental performance advantage of shared memory stems from its zero-copy architecture. To understand why this matters, let's trace the data path for different IPC mechanisms.

Traditional IPC (Pipes, Sockets, Message Queues):

Sender writes: Data is copied from sender's user-space buffer to kernel buffer
Kernel processes: Data may be queued, buffered, or processed by the kernel
Receiver reads: Data is copied from kernel buffer to receiver's user-space buffer

For a simple message transfer, data is copied at least twice. Each copy consumes CPU cycles and memory bandwidth.

Shared Memory:

Sender writes: Data is written directly to mapped memory (RAM)
Memory is visible: No copy—receiver's mapping points to the same physical memory
Receiver reads: Ordinary memory load from RAM

There is literally no data copying. The "transfer" is the act of writing to memory—something every program already does.

Converting Mermaid diagram...

Quantifying the Improvement:

For large data transfers, the copy overhead dominates. Memory bandwidth on modern systems is 50-100 GB/s, but each copy consumes bandwidth. Two copies halves effective throughput. System call overhead adds latency.

Operation	Time
Memory copy (4KB)	~0.5-1 µs
System call overhead	~0.2-0.5 µs
Context switch	~1-5 µs
Shared memory write	~10-100 ns (memory access time)

For a 4KB message via pipes:

Total: ~2-3 µs (copies + syscalls)

For 4KB write to shared memory:

Total: ~0.5-1 µs (memory write + cache flush)

The improvement is 2-5x for small messages, and grows larger as message size increases.

The Copy Illusion

You might think "but the sender still writes the data somewhere." True—but they write it once, directly to the shared region. With pipes, they first write to their own buffer, then copy to the kernel. The elimination of redundant copies is the core win.

Latency Comparison Across IPC Mechanisms

Latency—the time from when a sender initiates a message until the receiver can access it—is critical for interactive systems, real-time applications, and high-frequency trading. Let's compare IPC mechanisms on small message latency.

Test Setup: Single-threaded ping-pong test, 64-byte messages, 1 million iterations, median latency.

IPC Latency Comparison (Typical Linux x86-64)
IPC Mechanism	Round-Trip Latency	Relative Performance
Shared Memory + Spinlock	~200-500 ns	1x (baseline)
Shared Memory + Futex	~500-1,000 ns	2-3x slower
Unix Domain Socket	~2,000-5,000 ns	4-10x slower
Named Pipe (FIFO)	~3,000-7,000 ns	6-15x slower
POSIX Message Queue	~4,000-8,000 ns	8-16x slower
TCP Loopback Socket	~10,000-20,000 ns	20-40x slower
System V Message Queue	~5,000-10,000 ns	10-20x slower

Why the Dramatic Differences?

Shared Memory + Spinlock: No system calls in the data path. The sender writes data and sets a flag. The receiver spins on the flag (checks in a tight loop). When the flag changes, the receiver reads immediately. Total path: memory writes and reads plus cache coherency propagation.

Unix Domain Sockets: Require write() and read() system calls. Kernel must copy data, manage socket buffers, check for signals, potentially schedule. Even optimized paths add significant overhead.

TCP Loopback: Adds TCP/IP processing, checksum computation, and more complex buffering. Even though data never leaves the machine, the full network stack runs.

latency_benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdatomic.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <time.h>
 
#define SHM_NAME "/latency_bench"
#define ITERATIONS 1000000
#define MESSAGE_SIZE 64
 
typedef struct {
    atomic_int flag;  // 0: empty, 1: has request, 2: has response
    char data[MESSAGE_SIZE];
} SharedBuffer;
 
static inline uint64_t rdtsc() {
    unsigned int lo, hi;
    asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}
 
/**
 * Shared memory latency benchmark using spinlock-style synchronization.
 * Measures round-trip latency (ping-pong) between two processes.
 */
int main() {
    int shm_fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
    ftruncate(shm_fd, sizeof(SharedBuffer));
    
    SharedBuffer *buf = mmap(NULL, sizeof(SharedBuffer),
                              PROT_READ | PROT_WRITE,
                              MAP_SHARED, shm_fd, 0);
    close(shm_fd);
    
    atomic_store(&buf->flag, 0);
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: Responder
        for (int i = 0; i < ITERATIONS; i++) {
            // Wait for request
            while (atomic_load(&buf->flag) != 1) {
                asm volatile("pause");  // CPU hint: we're spinning
            }
            
            // Process (in real app, would modify data)
            buf->data[0]++;
            
            // Send response
            atomic_store(&buf->flag, 2);
        }
        
        munmap(buf, sizeof(SharedBuffer));
        exit(0);
    }
    
    // Parent: Requester
    sleep(1);  // Let child initialize
    
    // Warmup
    for (int i = 0; i < 10000; i++) {
        atomic_store(&buf->flag, 1);
        while (atomic_load(&buf->flag) != 2) {
            asm volatile("pause");
        }
        atomic_store(&buf->flag, 0);
    }
    
    // Benchmark
    uint64_t start = rdtsc();
    struct timespec ts_start, ts_end;
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start);
    
    for (int i = 0; i < ITERATIONS; i++) {
        atomic_store(&buf->flag, 1);  // Send request
        
        // Wait for response
        while (atomic_load(&buf->flag) != 2) {
            asm volatile("pause");
        }
        
        atomic_store(&buf->flag, 0);  // Reset for next iteration
    }
    
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end);
    uint64_t end = rdtsc();
    
    waitpid(pid, NULL, 0);
    
    // Calculate results
    double elapsed_ns = (ts_end.tv_sec - ts_start.tv_sec) * 1e9 +
                        (ts_end.tv_nsec - ts_start.tv_nsec);
    double avg_latency_ns = elapsed_ns / ITERATIONS;
    uint64_t cycles_per_trip = (end - start) / ITERATIONS;
    
    printf("=== Shared Memory Latency Benchmark ===\n");
    printf("Iterations: %d\n", ITERATIONS);
    printf("Message size: %d bytes\n", MESSAGE_SIZE);
    printf("\n");
    printf("Total time: %.2f ms\n", elapsed_ns / 1e6);
    printf("Average round-trip latency: %.0f ns\n", avg_latency_ns);
    printf("CPU cycles per round-trip: %lu\n", cycles_per_trip);
    printf("\n");
    printf("Throughput: %.2f million round-trips/sec\n", 
           1e9 / avg_latency_ns / 1e6);
    
    munmap(buf, sizeof(SharedBuffer));
    shm_unlink(SHM_NAME);
    
    return 0;
}

Spinlock Trade-offs

The spinlock approach (busy-waiting) achieves minimum latency but consumes CPU cycles while waiting. For latency-critical applications with sufficient CPU headroom, this is acceptable. For systems where CPU efficiency matters, use futex-based waiting which sleeps but adds ~500ns of latency.

Bandwidth and Throughput Analysis

For bulk data transfer, bandwidth becomes more important than latency. Shared memory's ability to avoid data copies shines brightest when moving large amounts of data.

Theoretical Limits:

DDR4 memory bandwidth: ~50 GB/s (dual-channel)
DDR5 memory bandwidth: ~80+ GB/s
PCIe 4.0 NVMe SSD: ~7 GB/s
10 GbE network: ~1.2 GB/s

With traditional IPC, the 2x copy overhead means effective peak throughput is halved. Shared memory can approach the raw memory bandwidth limit.

Bulk Transfer Bandwidth (64KB blocks)
IPC Mechanism	Bandwidth	% of Memory BW
Shared Memory (memcpy)	~40-50 GB/s	80-100%
Shared Memory (streaming)	~50-60 GB/s	100%+*
Unix Domain Socket	~8-12 GB/s	16-24%
Pipe	~5-10 GB/s	10-20%
TCP Loopback	~3-6 GB/s	6-12%
POSIX MQ (large msg)	~2-5 GB/s	4-10%

*Streaming writes using non-temporal stores can exceed normal memory bandwidth by bypassing cache.

Why Such Large Differences?

Copy Overhead: Pipes/sockets copy data twice. For a 64KB transfer, that's 128KB of memory traffic.
System Call Overhead: Each send/receive requires a syscall. For small messages, syscall overhead dominates. For large messages, it's amortized.
Kernel Processing: Sockets involve buffer management, locking, socket buffer allocation, and often memory allocation for each operation.
Cache Pollution: Kernel buffers may evict application data from cache, causing additional main memory accesses.

bandwidth_benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdatomic.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <time.h>
 
#define SHM_NAME "/bandwidth_bench"
#define BUFFER_SIZE (64 * 1024 * 1024)  // 64 MB shared buffer
#define BLOCK_SIZE (64 * 1024)          // 64 KB blocks
#define ITERATIONS 1000
 
typedef struct {
    atomic_int writer_done;
    atomic_int reader_done;
    char data[BUFFER_SIZE];
} SharedBuffer;
 
/**
 * Bandwidth benchmark for shared memory bulk transfers.
 * One process writes, another reads, measuring throughput.
 */
int main() {
    int shm_fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
    ftruncate(shm_fd, sizeof(SharedBuffer));
    
    SharedBuffer *buf = mmap(NULL, sizeof(SharedBuffer),
                              PROT_READ | PROT_WRITE,
                              MAP_SHARED, shm_fd, 0);
    close(shm_fd);
    
    if (buf == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }
    
    atomic_store(&buf->writer_done, 0);
    atomic_store(&buf->reader_done, 0);
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: Reader
        char *local_buf = malloc(BLOCK_SIZE);
        size_t total_read = 0;
        
        struct timespec ts_start, ts_end;
        clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start);
        
        for (int iter = 0; iter < ITERATIONS; iter++) {
            // Wait for writer to signal data is ready
            while (atomic_load(&buf->writer_done) <= iter) {
                asm volatile("pause");
            }
            
            // Read data (simulate processing)
            size_t offset = (iter * BLOCK_SIZE) % BUFFER_SIZE;
            memcpy(local_buf, &buf->data[offset], BLOCK_SIZE);
            total_read += BLOCK_SIZE;
            
            atomic_fetch_add(&buf->reader_done, 1);
        }
        
        clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end);
        
        double elapsed_s = (ts_end.tv_sec - ts_start.tv_sec) +
                           (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9;
        double bandwidth = (total_read / 1e9) / elapsed_s;
        
        printf("Reader: %.2f GB transferred in %.2f s = %.2f GB/s\n",
               total_read / 1e9, elapsed_s, bandwidth);
        
        free(local_buf);
        munmap(buf, sizeof(SharedBuffer));
        exit(0);
    }
    
    // Parent: Writer
    char *source_data = malloc(BLOCK_SIZE);
    memset(source_data, 'X', BLOCK_SIZE);
    size_t total_written = 0;
    
    struct timespec ts_start, ts_end;
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start);
    
    for (int iter = 0; iter < ITERATIONS; iter++) {
        // Write data to shared memory
        size_t offset = (iter * BLOCK_SIZE) % BUFFER_SIZE;
        memcpy(&buf->data[offset], source_data, BLOCK_SIZE);
        total_written += BLOCK_SIZE;
        
        // Signal data is ready
        atomic_fetch_add(&buf->writer_done, 1);
        
        // Wait for reader to process before overwriting (ring buffer)
        while (atomic_load(&buf->reader_done) < iter - 
               (BUFFER_SIZE / BLOCK_SIZE) + 1 && iter > 10) {
            asm volatile("pause");
        }
    }
    
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end);
    
    double elapsed_s = (ts_end.tv_sec - ts_start.tv_sec) +
                       (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9;
    double bandwidth = (total_written / 1e9) / elapsed_s;
    
    printf("Writer: %.2f GB transferred in %.2f s = %.2f GB/s\n",
           total_written / 1e9, elapsed_s, bandwidth);
    
    waitpid(pid, NULL, 0);
    
    free(source_data);
    munmap(buf, sizeof(SharedBuffer));
    shm_unlink(SHM_NAME);
    
    return 0;
}

Non-Temporal Stores

For write-streaming workloads, non-temporal stores (movntps on x86) bypass the cache and write directly to memory. This is faster for large sequential writes that won't be re-read soon. Use _mm_stream_ps() or similar intrinsics in performance-critical code.

Cache Coherency and Performance

While shared memory eliminates software copies, hardware cache coherency introduces its own performance considerations. Understanding these effects is crucial for optimizing high-performance shared memory systems.

How Cache Coherency Works:

Modern multiprocessor systems maintain cache coherency through protocols like MESI (Modified, Exclusive, Shared, Invalid). When one CPU writes to a cache line, other CPUs' caches are invalidated. The next access by those CPUs must fetch the updated data.

For shared memory between processes on different CPUs:

Process A writes to shared memory
A's cache line is marked Modified
When B reads, A's cache line is written back to memory (or snooped)
B loads the updated value

This coherency traffic has latency—typically 50-100 nanoseconds for inter-core communication, more for cross-socket.

False Sharing:

One of the most insidious performance problems in shared memory is false sharing: when two processes access different data that happens to reside on the same cache line (typically 64 bytes).

// BAD: These two counters are likely on the same cache line
struct SharedCounters {
    atomic_int counter_a;  // bytes 0-3
    atomic_int counter_b;  // bytes 4-7
};
// Process A increments counter_a
// Process B increments counter_b
// Both experience full cache line bouncing!

// GOOD: Pad to separate cache lines
struct SharedCounters {
    atomic_int counter_a;
    char pad_a[60];  // Padding to 64-byte cache line
    atomic_int counter_b;
    char pad_b[60];
};

False sharing can degrade performance by 10-100x for frequently accessed data.

Converting Mermaid diagram...

Cache-Aware Design Principles

•Pad hot variables — Separate frequently-accessed shared variables by at least 64 bytes (cache line size).
•Batch updates — Instead of frequent small updates, accumulate changes and write periodically.
•Partition the data — Assign different portions of shared memory to different processes to minimize overlap.
•Read-write separation — Place read-mostly data separately from write-heavy data.
•Use local buffers — Accumulate data locally, then copy to shared memory in bulk.
•NUMA awareness — On NUMA systems, ensure shared memory is placed on a node accessible to all accessing CPUs.

Cache Line Size Detection

Linux exposes cache line size via sysconf(_SC_LEVEL1_DCACHE_LINESIZE) or by reading /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size. Most x86 systems use 64 bytes. Some ARM processors use 128 bytes. Design your shared structures to accommodate the largest expected cache line size.

When Shared Memory Overhead Matters

While shared memory eliminates many sources of overhead, it introduces its own costs. Understanding when these matter helps you make informed design decisions.

Synchronization Overhead:

The performance gain from zero-copy can be lost if synchronization is poorly designed:

Synchronization Method	Typical Cost	When to Use
Spinlock (busy wait)	50-200 ns	Ultra-low latency, short waits
Futex	500-2000 ns	Moderate latency, long waits
Mutex (robust)	1000-5000 ns	General purpose, recovery needed
Named semaphore	2000-5000 ns	Cross-process coordination

Setup Overhead:

Shared memory has significant setup cost:

shmget()/shm_open(): ~1-10 µs
shmat()/mmap(): ~10-100 µs (depends on size, page fault handling)
First access to each page: ~1-5 µs (page fault)

For short-lived communications, this setup cost may exceed the benefits. Rule of thumb: shared memory pays off when:

Exchanging many messages, OR
Exchanging large amounts of data, OR
Latency requirements are sub-microsecond

Memory Usage:

Shared memory consumes physical RAM (or swap) from the moment it's created. Unlike pipe buffers that grow dynamically, you pre-allocate the full size. Over-provisioning wastes memory; under-provisioning limits functionality.

When to Choose Shared Memory
Consideration	Favor Shared Memory	Favor Other IPC
Message size	4KB typical	<100 bytes typical
Message frequency	High (>10K/sec)	Low (<100/sec)
Latency requirement	<1µs needed	10µs acceptable
Process lifetime	Long-running services	Short-lived workers
Data structure	Complex, in-place updates	Simple messages
Synchronization	You can manage it	Want kernel to handle
Debugging ease	Can handle complexity	Want simple semantics

The Complexity Tax

Shared memory's performance comes at the cost of complexity: custom synchronization, careful memory management, debugging difficulty, and potential for hard-to-reproduce bugs. If your IPC needs don't require sub-microsecond latency or multi-GB throughput, Unix domain sockets or pipes may be simpler and fast enough.

Real-World Use Cases

Shared memory's performance characteristics make it indispensable for specific application domains. Here are real-world scenarios where shared memory is the right—often the only—choice.

High-Frequency Trading (HFT)

•Requirement: Sub-microsecond market data distribution to multiple trading strategies
•Implementation: Market data feed handler writes to shared memory ring buffer. Multiple strategy processes read independently.
•Why shared memory: Every microsecond of latency costs money. Pipes would add unacceptable delays.
•Optimization: Lock-free ring buffers, huge pages, CPU pinning, NUMA-local memory

Database Buffer Pools

•Requirement: Multiple database worker processes sharing cached data pages
•Implementation: PostgreSQL uses shared memory for shared_buffers—GB of cached database pages accessed by all connections.
•Why shared memory: Reading a cached page via IPC would defeat the purpose of caching. Direct memory access is essential.
•Optimization: Careful locking granularity, buffer replacement algorithms, huge pages

Graphics and Video Processing

•Requirement: Sharing video frames (megabytes) between capture, processing, and display processes
•Implementation: GStreamer, FFmpeg, and Wayland use shared memory (or dmabuf) for zero-copy video frame passing.
•Why shared memory: Copying 4K video frames (8MB+ per frame) at 60fps is impossible without shared memory.
•Optimization: Pre-allocated frame pools, producer-consumer patterns, GPU integration

Scientific Computing

•Requirement: Multiple worker processes operating on large datasets (GBs-TBs)
•Implementation: Parallel simulations where workers read shared input data and write results to partitioned output regions.
•Why shared memory: Input data is read-only, shared efficiently. No need to duplicate in each process.
•Optimization: Read-write lock for updates, memory-mapped files for persistence

video_frame_sharing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
/**
 * Simplified video frame sharing pattern used in media pipelines.
 * Producer captures frames, consumer processes them, zero copies.
 */
 
#include <stdio.h>
#include <stdlib.h>
#include <stdatomic.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
 
#define SHM_NAME "/video_frames"
#define NUM_FRAMES 4        // Ring buffer size
#define FRAME_WIDTH 1920
#define FRAME_HEIGHT 1080
#define BYTES_PER_PIXEL 4   // RGBA
#define FRAME_SIZE (FRAME_WIDTH * FRAME_HEIGHT * BYTES_PER_PIXEL)
 
typedef struct {
    uint64_t timestamp;
    uint32_t frame_number;
    atomic_int state;       // 0: empty, 1: filled, 2: processing
    char pixels[FRAME_SIZE];
} VideoFrame;
 
typedef struct {
    atomic_uint write_index;
    atomic_uint read_index;
    VideoFrame frames[NUM_FRAMES];
} FrameBuffer;
 
// Producer: Capture and write frames
void capture_frames(FrameBuffer *fb, int count) {
    for (int i = 0; i < count; i++) {
        // Get next write slot
        uint32_t slot = atomic_fetch_add(&fb->write_index, 1) % NUM_FRAMES;
        VideoFrame *frame = &fb->frames[slot];
        
        // Wait if consumer hasn't processed yet
        while (atomic_load(&frame->state) == 1) {
            usleep(1000);  // Back off
        }
        
        // "Capture" frame (in reality, hardware fills this directly)
        frame->frame_number = i;
        frame->timestamp = /* get_timestamp_ns() */ 0;
        // memcpy from capture device...
        
        // Signal frame is ready
        atomic_store(&frame->state, 1);
    }
}
 
// Consumer: Process frames
void process_frames(FrameBuffer *fb, int count) {
    for (int i = 0; i < count; i++) {
        uint32_t slot = atomic_fetch_add(&fb->read_index, 1) % NUM_FRAMES;
        VideoFrame *frame = &fb->frames[slot];
        
        // Wait for frame
        while (atomic_load(&frame->state) != 1) {
            usleep(100);
        }
        
        atomic_store(&frame->state, 2);  // Processing
        
        // Process frame in-place (no copy!)
        // ... apply filters, encode, display ...
        
        atomic_store(&frame->state, 0);  // Done
    }
}
 
/**
 * Key insight: 8MB frames pass between processes with ZERO copies.
 * The only overhead is the atomic flag operations and cache coherency.
 * 
 * At 60fps, this is 480 MB/s of video data with minimal CPU overhead.
 * Using pipes would require copying 960 MB/s (2x for send+receive).
 */

Performance Optimization Techniques

Beyond the inherent advantages of zero-copy, several optimization techniques can squeeze additional performance from shared memory systems.

Optimization Strategies

•Huge Pages (2MB/1GB): Reduce TLB misses for large shared regions. Use mmap with MAP_HUGETLB or hugetlbfs. Can improve throughput by 10-30% for large datasets.
•Lock-Free Data Structures: Eliminate synchronization overhead with carefully designed lock-free algorithms. SPSC (single-producer single-consumer) queues are easiest; MPMC is much harder.
•Memory Pre-Faulting: Touch all pages after mmap() to trigger page faults during initialization, not during time-critical operations. Use mlock() to prevent swapping.
•NUMA Awareness: On multi-socket systems, bind processes to CPUs near the memory. Use mbind() or numactl to place shared memory on specific nodes.
•Cache Line Prefetching: Use __builtin_prefetch() to request cache lines before they're needed, hiding memory latency.
•Batching: Accumulate multiple updates/messages and process in batches to amortize synchronization overhead.
•Memory-Mapped Files: For persistent shared data, use mmap() on a file in /dev/shm or a regular file. Enables crash recovery.
•CPU Pinning: Bind processes to specific CPUs with sched_setaffinity() for consistent cache behavior and reduced migration.

optimization_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <numaif.h>
 
/**
 * Examples of shared memory optimization techniques.
 */
 
// 1. Huge page allocation
void *alloc_huge_pages(size_t size) {
    void *ptr = mmap(NULL, size,
                     PROT_READ | PROT_WRITE,
                     MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB,
                     -1, 0);
    
    if (ptr == MAP_FAILED) {
        perror("Huge pages not available, falling back");
        ptr = mmap(NULL, size,
                   PROT_READ | PROT_WRITE,
                   MAP_SHARED | MAP_ANONYMOUS,
                   -1, 0);
    }
    return ptr;
}
 
// 2. Pre-fault pages
void prefault_memory(void *ptr, size_t size) {
    volatile char *p = (volatile char *)ptr;
    size_t page_size = sysconf(_SC_PAGESIZE);
    
    for (size_t i = 0; i < size; i += page_size) {
        p[i] = p[i];  // Touch each page
    }
}
 
// 3. Lock pages in RAM (prevent swapping)
void lock_memory(void *ptr, size_t size) {
    if (mlock(ptr, size) == -1) {
        perror("mlock failed (may need CAP_IPC_LOCK)");
    }
}
 
// 4. CPU pinning
void pin_to_cpu(int cpu_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu_id, &cpuset);
    
    if (sched_setaffinity(0, sizeof(cpuset), &cpuset) == -1) {
        perror("sched_setaffinity");
    }
}
 
// 5. NUMA-aware allocation
void *alloc_numa_local(size_t size, int numa_node) {
    void *ptr = mmap(NULL, size,
                     PROT_READ | PROT_WRITE,
                     MAP_SHARED | MAP_ANONYMOUS,
                     -1, 0);
    
    if (ptr != MAP_FAILED) {
        unsigned long nodemask = 1UL << numa_node;
        mbind(ptr, size, MPOL_BIND, &nodemask, sizeof(nodemask) * 8, 0);
    }
    return ptr;
}
 
// 6. Cache line prefetching
void prefetch_next(void *ptr) {
    __builtin_prefetch(ptr, 0, 3);     // Read, high locality
    __builtin_prefetch(ptr + 64, 0, 3); // Next cache line
}
 
int main() {
    // Example: Optimized shared memory setup
    size_t size = 64 * 1024 * 1024;  // 64 MB
    
    printf("Allocating optimized shared memory...\n");
    
    // Pin to CPU 0
    pin_to_cpu(0);
    
    // Allocate with huge pages if available
    void *shm = alloc_huge_pages(size);
    if (!shm) {
        printf("Allocation failed\n");
        return 1;
    }
    
    // Pre-fault all pages
    printf("Pre-faulting pages...\n");
    prefault_memory(shm, size);
    
    // Lock in RAM
    lock_memory(shm, size);
    
    printf("Optimized shared memory ready at %p\n", shm);
    
    // Use the memory...
    
    munmap(shm, size);
    return 0;
}

Summary: Performance Benefits

We've explored why shared memory is the performance king of IPC mechanisms and when that performance advantage justifies its complexity. Let's consolidate the key takeaways:

Key Takeaways

•Zero-copy is the core advantage — No kernel copies means fundamentally lower overhead than pipes, sockets, or message queues.
•Latency is measured in nanoseconds — Sub-microsecond round-trip times are achievable, 10-100x faster than socket-based IPC.
•Bandwidth approaches memory bus limits — Full memory bandwidth (~50 GB/s) vs. 5-10 GB/s for other mechanisms.
•Cache coherency has its own costs — False sharing and cache line bouncing can negate zero-copy benefits if not managed.
•Setup overhead is significant — Shared memory pays off for long-lived, high-frequency communication, not short bursts.
•Real-world applications demand it — HFT, databases, video processing, and scientific computing rely on shared memory's unique performance.
•Optimization techniques compound benefits — Huge pages, NUMA awareness, CPU pinning, and lock-free algorithms push performance further.

What's Next:

We've covered System V and implementation details throughout this module. The final page provides a comprehensive look at POSIX shared memory—the modern, file-oriented API that offers cleaner semantics, better tooling support, and integration with the standard mmap()/munmap() memory model.

Page Complete

You now understand why shared memory delivers unmatched IPC performance and when to choose it. You've learned about zero-copy architecture, latency and bandwidth characteristics, cache considerations, and optimization techniques. Next, we'll explore the POSIX shared memory API in depth.

Performance Benefits

The Fastest Path Between Processes

This page quantifies these benefits, examines the architectural reasons behind them, and identifies the scenarios where shared memory's performance advantage is worth the added complexity.

What You Will Learn

The Zero-Copy Advantage

The fundamental performance advantage of shared memory stems from its zero-copy architecture. To understand why this matters, let's trace the data path for different IPC mechanisms.

Traditional IPC (Pipes, Sockets, Message Queues):

Sender writes: Data is copied from sender's user-space buffer to kernel buffer
Kernel processes: Data may be queued, buffered, or processed by the kernel
Receiver reads: Data is copied from kernel buffer to receiver's user-space buffer

For a simple message transfer, data is copied at least twice. Each copy consumes CPU cycles and memory bandwidth.

Shared Memory:

Sender writes: Data is written directly to mapped memory (RAM)
Memory is visible: No copy—receiver's mapping points to the same physical memory
Receiver reads: Ordinary memory load from RAM

There is literally no data copying. The "transfer" is the act of writing to memory—something every program already does.

Converting Mermaid diagram...

Quantifying the Improvement:

Operation	Time
Memory copy (4KB)	~0.5-1 µs
System call overhead	~0.2-0.5 µs
Context switch	~1-5 µs
Shared memory write	~10-100 ns (memory access time)

For a 4KB message via pipes:

Total: ~2-3 µs (copies + syscalls)

For 4KB write to shared memory:

Total: ~0.5-1 µs (memory write + cache flush)

The improvement is 2-5x for small messages, and grows larger as message size increases.

The Copy Illusion

Latency Comparison Across IPC Mechanisms

Test Setup: Single-threaded ping-pong test, 64-byte messages, 1 million iterations, median latency.

IPC Latency Comparison (Typical Linux x86-64)
IPC Mechanism	Round-Trip Latency	Relative Performance
Shared Memory + Spinlock	~200-500 ns	1x (baseline)
Shared Memory + Futex	~500-1,000 ns	2-3x slower
Unix Domain Socket	~2,000-5,000 ns	4-10x slower
Named Pipe (FIFO)	~3,000-7,000 ns	6-15x slower
POSIX Message Queue	~4,000-8,000 ns	8-16x slower
TCP Loopback Socket	~10,000-20,000 ns	20-40x slower
System V Message Queue	~5,000-10,000 ns	10-20x slower

Why the Dramatic Differences?

Unix Domain Sockets: Require write() and read() system calls. Kernel must copy data, manage socket buffers, check for signals, potentially schedule. Even optimized paths add significant overhead.

TCP Loopback: Adds TCP/IP processing, checksum computation, and more complex buffering. Even though data never leaves the machine, the full network stack runs.

latency_benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdatomic.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <time.h>
 
#define SHM_NAME "/latency_bench"
#define ITERATIONS 1000000
#define MESSAGE_SIZE 64
 
typedef struct {
    atomic_int flag;  // 0: empty, 1: has request, 2: has response
    char data[MESSAGE_SIZE];
} SharedBuffer;
 
static inline uint64_t rdtsc() {
    unsigned int lo, hi;
    asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}
 
/**
 * Shared memory latency benchmark using spinlock-style synchronization.
 * Measures round-trip latency (ping-pong) between two processes.
 */
int main() {
    int shm_fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
    ftruncate(shm_fd, sizeof(SharedBuffer));
    
    SharedBuffer *buf = mmap(NULL, sizeof(SharedBuffer),
                              PROT_READ | PROT_WRITE,
                              MAP_SHARED, shm_fd, 0);
    close(shm_fd);
    
    atomic_store(&buf->flag, 0);
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: Responder
        for (int i = 0; i < ITERATIONS; i++) {
            // Wait for request
            while (atomic_load(&buf->flag) != 1) {
                asm volatile("pause");  // CPU hint: we're spinning
            }
            
            // Process (in real app, would modify data)
            buf->data[0]++;
            
            // Send response
            atomic_store(&buf->flag, 2);
        }
        
        munmap(buf, sizeof(SharedBuffer));
        exit(0);
    }
    
    // Parent: Requester
    sleep(1);  // Let child initialize
    
    // Warmup
    for (int i = 0; i < 10000; i++) {
        atomic_store(&buf->flag, 1);
        while (atomic_load(&buf->flag) != 2) {
            asm volatile("pause");
        }
        atomic_store(&buf->flag, 0);
    }
    
    // Benchmark
    uint64_t start = rdtsc();
    struct timespec ts_start, ts_end;
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start);
    
    for (int i = 0; i < ITERATIONS; i++) {
        atomic_store(&buf->flag, 1);  // Send request
        
        // Wait for response
        while (atomic_load(&buf->flag) != 2) {
            asm volatile("pause");
        }
        
        atomic_store(&buf->flag, 0);  // Reset for next iteration
    }
    
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end);
    uint64_t end = rdtsc();
    
    waitpid(pid, NULL, 0);
    
    // Calculate results
    double elapsed_ns = (ts_end.tv_sec - ts_start.tv_sec) * 1e9 +
                        (ts_end.tv_nsec - ts_start.tv_nsec);
    double avg_latency_ns = elapsed_ns / ITERATIONS;
    uint64_t cycles_per_trip = (end - start) / ITERATIONS;
    
    printf("=== Shared Memory Latency Benchmark ===\n");
    printf("Iterations: %d\n", ITERATIONS);
    printf("Message size: %d bytes\n", MESSAGE_SIZE);
    printf("\n");
    printf("Total time: %.2f ms\n", elapsed_ns / 1e6);
    printf("Average round-trip latency: %.0f ns\n", avg_latency_ns);
    printf("CPU cycles per round-trip: %lu\n", cycles_per_trip);
    printf("\n");
    printf("Throughput: %.2f million round-trips/sec\n", 
           1e9 / avg_latency_ns / 1e6);
    
    munmap(buf, sizeof(SharedBuffer));
    shm_unlink(SHM_NAME);
    
    return 0;
}

Spinlock Trade-offs

Bandwidth and Throughput Analysis

For bulk data transfer, bandwidth becomes more important than latency. Shared memory's ability to avoid data copies shines brightest when moving large amounts of data.

Theoretical Limits:

DDR4 memory bandwidth: ~50 GB/s (dual-channel)
DDR5 memory bandwidth: ~80+ GB/s
PCIe 4.0 NVMe SSD: ~7 GB/s
10 GbE network: ~1.2 GB/s

With traditional IPC, the 2x copy overhead means effective peak throughput is halved. Shared memory can approach the raw memory bandwidth limit.

Bulk Transfer Bandwidth (64KB blocks)
IPC Mechanism	Bandwidth	% of Memory BW
Shared Memory (memcpy)	~40-50 GB/s	80-100%
Shared Memory (streaming)	~50-60 GB/s	100%+*
Unix Domain Socket	~8-12 GB/s	16-24%
Pipe	~5-10 GB/s	10-20%
TCP Loopback	~3-6 GB/s	6-12%
POSIX MQ (large msg)	~2-5 GB/s	4-10%

*Streaming writes using non-temporal stores can exceed normal memory bandwidth by bypassing cache.

Why Such Large Differences?

Copy Overhead: Pipes/sockets copy data twice. For a 64KB transfer, that's 128KB of memory traffic.
System Call Overhead: Each send/receive requires a syscall. For small messages, syscall overhead dominates. For large messages, it's amortized.
Kernel Processing: Sockets involve buffer management, locking, socket buffer allocation, and often memory allocation for each operation.
Cache Pollution: Kernel buffers may evict application data from cache, causing additional main memory accesses.

bandwidth_benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdatomic.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <time.h>
 
#define SHM_NAME "/bandwidth_bench"
#define BUFFER_SIZE (64 * 1024 * 1024)  // 64 MB shared buffer
#define BLOCK_SIZE (64 * 1024)          // 64 KB blocks
#define ITERATIONS 1000
 
typedef struct {
    atomic_int writer_done;
    atomic_int reader_done;
    char data[BUFFER_SIZE];
} SharedBuffer;
 
/**
 * Bandwidth benchmark for shared memory bulk transfers.
 * One process writes, another reads, measuring throughput.
 */
int main() {
    int shm_fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
    ftruncate(shm_fd, sizeof(SharedBuffer));
    
    SharedBuffer *buf = mmap(NULL, sizeof(SharedBuffer),
                              PROT_READ | PROT_WRITE,
                              MAP_SHARED, shm_fd, 0);
    close(shm_fd);
    
    if (buf == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }
    
    atomic_store(&buf->writer_done, 0);
    atomic_store(&buf->reader_done, 0);
    
    pid_t pid = fork();
    
    if (pid == 0) {
        // Child: Reader
        char *local_buf = malloc(BLOCK_SIZE);
        size_t total_read = 0;
        
        struct timespec ts_start, ts_end;
        clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start);
        
        for (int iter = 0; iter < ITERATIONS; iter++) {
            // Wait for writer to signal data is ready
            while (atomic_load(&buf->writer_done) <= iter) {
                asm volatile("pause");
            }
            
            // Read data (simulate processing)
            size_t offset = (iter * BLOCK_SIZE) % BUFFER_SIZE;
            memcpy(local_buf, &buf->data[offset], BLOCK_SIZE);
            total_read += BLOCK_SIZE;
            
            atomic_fetch_add(&buf->reader_done, 1);
        }
        
        clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end);
        
        double elapsed_s = (ts_end.tv_sec - ts_start.tv_sec) +
                           (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9;
        double bandwidth = (total_read / 1e9) / elapsed_s;
        
        printf("Reader: %.2f GB transferred in %.2f s = %.2f GB/s\n",
               total_read / 1e9, elapsed_s, bandwidth);
        
        free(local_buf);
        munmap(buf, sizeof(SharedBuffer));
        exit(0);
    }
    
    // Parent: Writer
    char *source_data = malloc(BLOCK_SIZE);
    memset(source_data, 'X', BLOCK_SIZE);
    size_t total_written = 0;
    
    struct timespec ts_start, ts_end;
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start);
    
    for (int iter = 0; iter < ITERATIONS; iter++) {
        // Write data to shared memory
        size_t offset = (iter * BLOCK_SIZE) % BUFFER_SIZE;
        memcpy(&buf->data[offset], source_data, BLOCK_SIZE);
        total_written += BLOCK_SIZE;
        
        // Signal data is ready
        atomic_fetch_add(&buf->writer_done, 1);
        
        // Wait for reader to process before overwriting (ring buffer)
        while (atomic_load(&buf->reader_done) < iter - 
               (BUFFER_SIZE / BLOCK_SIZE) + 1 && iter > 10) {
            asm volatile("pause");
        }
    }
    
    clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end);
    
    double elapsed_s = (ts_end.tv_sec - ts_start.tv_sec) +
                       (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9;
    double bandwidth = (total_written / 1e9) / elapsed_s;
    
    printf("Writer: %.2f GB transferred in %.2f s = %.2f GB/s\n",
           total_written / 1e9, elapsed_s, bandwidth);
    
    waitpid(pid, NULL, 0);
    
    free(source_data);
    munmap(buf, sizeof(SharedBuffer));
    shm_unlink(SHM_NAME);
    
    return 0;
}

Non-Temporal Stores

Cache Coherency and Performance

How Cache Coherency Works:

For shared memory between processes on different CPUs:

Process A writes to shared memory
A's cache line is marked Modified
When B reads, A's cache line is written back to memory (or snooped)
B loads the updated value

This coherency traffic has latency—typically 50-100 nanoseconds for inter-core communication, more for cross-socket.

False Sharing:

One of the most insidious performance problems in shared memory is false sharing: when two processes access different data that happens to reside on the same cache line (typically 64 bytes).

// BAD: These two counters are likely on the same cache line
struct SharedCounters {
    atomic_int counter_a;  // bytes 0-3
    atomic_int counter_b;  // bytes 4-7
};
// Process A increments counter_a
// Process B increments counter_b
// Both experience full cache line bouncing!

// GOOD: Pad to separate cache lines
struct SharedCounters {
    atomic_int counter_a;
    char pad_a[60];  // Padding to 64-byte cache line
    atomic_int counter_b;
    char pad_b[60];
};

False sharing can degrade performance by 10-100x for frequently accessed data.

Converting Mermaid diagram...

Cache-Aware Design Principles

•Pad hot variables — Separate frequently-accessed shared variables by at least 64 bytes (cache line size).
•Batch updates — Instead of frequent small updates, accumulate changes and write periodically.
•Partition the data — Assign different portions of shared memory to different processes to minimize overlap.
•Read-write separation — Place read-mostly data separately from write-heavy data.
•Use local buffers — Accumulate data locally, then copy to shared memory in bulk.
•NUMA awareness — On NUMA systems, ensure shared memory is placed on a node accessible to all accessing CPUs.

Cache Line Size Detection

When Shared Memory Overhead Matters

While shared memory eliminates many sources of overhead, it introduces its own costs. Understanding when these matter helps you make informed design decisions.

Synchronization Overhead:

The performance gain from zero-copy can be lost if synchronization is poorly designed:

Synchronization Method	Typical Cost	When to Use
Spinlock (busy wait)	50-200 ns	Ultra-low latency, short waits
Futex	500-2000 ns	Moderate latency, long waits
Mutex (robust)	1000-5000 ns	General purpose, recovery needed
Named semaphore	2000-5000 ns	Cross-process coordination

Setup Overhead:

Shared memory has significant setup cost:

shmget()/shm_open(): ~1-10 µs
shmat()/mmap(): ~10-100 µs (depends on size, page fault handling)
First access to each page: ~1-5 µs (page fault)

For short-lived communications, this setup cost may exceed the benefits. Rule of thumb: shared memory pays off when:

Exchanging many messages, OR
Exchanging large amounts of data, OR
Latency requirements are sub-microsecond

Memory Usage:

When to Choose Shared Memory
Consideration	Favor Shared Memory	Favor Other IPC
Message size	4KB typical	<100 bytes typical
Message frequency	High (>10K/sec)	Low (<100/sec)
Latency requirement	<1µs needed	10µs acceptable
Process lifetime	Long-running services	Short-lived workers
Data structure	Complex, in-place updates	Simple messages
Synchronization	You can manage it	Want kernel to handle
Debugging ease	Can handle complexity	Want simple semantics

The Complexity Tax

Real-World Use Cases

Shared memory's performance characteristics make it indispensable for specific application domains. Here are real-world scenarios where shared memory is the right—often the only—choice.

High-Frequency Trading (HFT)

•Requirement: Sub-microsecond market data distribution to multiple trading strategies
•Implementation: Market data feed handler writes to shared memory ring buffer. Multiple strategy processes read independently.
•Why shared memory: Every microsecond of latency costs money. Pipes would add unacceptable delays.
•Optimization: Lock-free ring buffers, huge pages, CPU pinning, NUMA-local memory

Database Buffer Pools

•Requirement: Multiple database worker processes sharing cached data pages
•Implementation: PostgreSQL uses shared memory for shared_buffers—GB of cached database pages accessed by all connections.
•Why shared memory: Reading a cached page via IPC would defeat the purpose of caching. Direct memory access is essential.
•Optimization: Careful locking granularity, buffer replacement algorithms, huge pages

Graphics and Video Processing

•Requirement: Sharing video frames (megabytes) between capture, processing, and display processes
•Implementation: GStreamer, FFmpeg, and Wayland use shared memory (or dmabuf) for zero-copy video frame passing.
•Why shared memory: Copying 4K video frames (8MB+ per frame) at 60fps is impossible without shared memory.
•Optimization: Pre-allocated frame pools, producer-consumer patterns, GPU integration

Scientific Computing

•Requirement: Multiple worker processes operating on large datasets (GBs-TBs)
•Implementation: Parallel simulations where workers read shared input data and write results to partitioned output regions.
•Why shared memory: Input data is read-only, shared efficiently. No need to duplicate in each process.
•Optimization: Read-write lock for updates, memory-mapped files for persistence

video_frame_sharing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
/**
 * Simplified video frame sharing pattern used in media pipelines.
 * Producer captures frames, consumer processes them, zero copies.
 */
 
#include <stdio.h>
#include <stdlib.h>
#include <stdatomic.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
 
#define SHM_NAME "/video_frames"
#define NUM_FRAMES 4        // Ring buffer size
#define FRAME_WIDTH 1920
#define FRAME_HEIGHT 1080
#define BYTES_PER_PIXEL 4   // RGBA
#define FRAME_SIZE (FRAME_WIDTH * FRAME_HEIGHT * BYTES_PER_PIXEL)
 
typedef struct {
    uint64_t timestamp;
    uint32_t frame_number;
    atomic_int state;       // 0: empty, 1: filled, 2: processing
    char pixels[FRAME_SIZE];
} VideoFrame;
 
typedef struct {
    atomic_uint write_index;
    atomic_uint read_index;
    VideoFrame frames[NUM_FRAMES];
} FrameBuffer;
 
// Producer: Capture and write frames
void capture_frames(FrameBuffer *fb, int count) {
    for (int i = 0; i < count; i++) {
        // Get next write slot
        uint32_t slot = atomic_fetch_add(&fb->write_index, 1) % NUM_FRAMES;
        VideoFrame *frame = &fb->frames[slot];
        
        // Wait if consumer hasn't processed yet
        while (atomic_load(&frame->state) == 1) {
            usleep(1000);  // Back off
        }
        
        // "Capture" frame (in reality, hardware fills this directly)
        frame->frame_number = i;
        frame->timestamp = /* get_timestamp_ns() */ 0;
        // memcpy from capture device...
        
        // Signal frame is ready
        atomic_store(&frame->state, 1);
    }
}
 
// Consumer: Process frames
void process_frames(FrameBuffer *fb, int count) {
    for (int i = 0; i < count; i++) {
        uint32_t slot = atomic_fetch_add(&fb->read_index, 1) % NUM_FRAMES;
        VideoFrame *frame = &fb->frames[slot];
        
        // Wait for frame
        while (atomic_load(&frame->state) != 1) {
            usleep(100);
        }
        
        atomic_store(&frame->state, 2);  // Processing
        
        // Process frame in-place (no copy!)
        // ... apply filters, encode, display ...
        
        atomic_store(&frame->state, 0);  // Done
    }
}
 
/**
 * Key insight: 8MB frames pass between processes with ZERO copies.
 * The only overhead is the atomic flag operations and cache coherency.
 * 
 * At 60fps, this is 480 MB/s of video data with minimal CPU overhead.
 * Using pipes would require copying 960 MB/s (2x for send+receive).
 */

Performance Optimization Techniques

Beyond the inherent advantages of zero-copy, several optimization techniques can squeeze additional performance from shared memory systems.

Optimization Strategies

•Huge Pages (2MB/1GB): Reduce TLB misses for large shared regions. Use mmap with MAP_HUGETLB or hugetlbfs. Can improve throughput by 10-30% for large datasets.
•Lock-Free Data Structures: Eliminate synchronization overhead with carefully designed lock-free algorithms. SPSC (single-producer single-consumer) queues are easiest; MPMC is much harder.
•Memory Pre-Faulting: Touch all pages after mmap() to trigger page faults during initialization, not during time-critical operations. Use mlock() to prevent swapping.
•NUMA Awareness: On multi-socket systems, bind processes to CPUs near the memory. Use mbind() or numactl to place shared memory on specific nodes.
•Cache Line Prefetching: Use __builtin_prefetch() to request cache lines before they're needed, hiding memory latency.
•Batching: Accumulate multiple updates/messages and process in batches to amortize synchronization overhead.
•Memory-Mapped Files: For persistent shared data, use mmap() on a file in /dev/shm or a regular file. Enables crash recovery.
•CPU Pinning: Bind processes to specific CPUs with sched_setaffinity() for consistent cache behavior and reduced migration.

optimization_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <numaif.h>
 
/**
 * Examples of shared memory optimization techniques.
 */
 
// 1. Huge page allocation
void *alloc_huge_pages(size_t size) {
    void *ptr = mmap(NULL, size,
                     PROT_READ | PROT_WRITE,
                     MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB,
                     -1, 0);
    
    if (ptr == MAP_FAILED) {
        perror("Huge pages not available, falling back");
        ptr = mmap(NULL, size,
                   PROT_READ | PROT_WRITE,
                   MAP_SHARED | MAP_ANONYMOUS,
                   -1, 0);
    }
    return ptr;
}
 
// 2. Pre-fault pages
void prefault_memory(void *ptr, size_t size) {
    volatile char *p = (volatile char *)ptr;
    size_t page_size = sysconf(_SC_PAGESIZE);
    
    for (size_t i = 0; i < size; i += page_size) {
        p[i] = p[i];  // Touch each page
    }
}
 
// 3. Lock pages in RAM (prevent swapping)
void lock_memory(void *ptr, size_t size) {
    if (mlock(ptr, size) == -1) {
        perror("mlock failed (may need CAP_IPC_LOCK)");
    }
}
 
// 4. CPU pinning
void pin_to_cpu(int cpu_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu_id, &cpuset);
    
    if (sched_setaffinity(0, sizeof(cpuset), &cpuset) == -1) {
        perror("sched_setaffinity");
    }
}
 
// 5. NUMA-aware allocation
void *alloc_numa_local(size_t size, int numa_node) {
    void *ptr = mmap(NULL, size,
                     PROT_READ | PROT_WRITE,
                     MAP_SHARED | MAP_ANONYMOUS,
                     -1, 0);
    
    if (ptr != MAP_FAILED) {
        unsigned long nodemask = 1UL << numa_node;
        mbind(ptr, size, MPOL_BIND, &nodemask, sizeof(nodemask) * 8, 0);
    }
    return ptr;
}
 
// 6. Cache line prefetching
void prefetch_next(void *ptr) {
    __builtin_prefetch(ptr, 0, 3);     // Read, high locality
    __builtin_prefetch(ptr + 64, 0, 3); // Next cache line
}
 
int main() {
    // Example: Optimized shared memory setup
    size_t size = 64 * 1024 * 1024;  // 64 MB
    
    printf("Allocating optimized shared memory...\n");
    
    // Pin to CPU 0
    pin_to_cpu(0);
    
    // Allocate with huge pages if available
    void *shm = alloc_huge_pages(size);
    if (!shm) {
        printf("Allocation failed\n");
        return 1;
    }
    
    // Pre-fault all pages
    printf("Pre-faulting pages...\n");
    prefault_memory(shm, size);
    
    // Lock in RAM
    lock_memory(shm, size);
    
    printf("Optimized shared memory ready at %p\n", shm);
    
    // Use the memory...
    
    munmap(shm, size);
    return 0;
}

Summary: Performance Benefits

We've explored why shared memory is the performance king of IPC mechanisms and when that performance advantage justifies its complexity. Let's consolidate the key takeaways:

Key Takeaways

•Zero-copy is the core advantage — No kernel copies means fundamentally lower overhead than pipes, sockets, or message queues.
•Latency is measured in nanoseconds — Sub-microsecond round-trip times are achievable, 10-100x faster than socket-based IPC.
•Bandwidth approaches memory bus limits — Full memory bandwidth (~50 GB/s) vs. 5-10 GB/s for other mechanisms.
•Cache coherency has its own costs — False sharing and cache line bouncing can negate zero-copy benefits if not managed.
•Setup overhead is significant — Shared memory pays off for long-lived, high-frequency communication, not short bursts.
•Real-world applications demand it — HFT, databases, video processing, and scientific computing rely on shared memory's unique performance.
•Optimization techniques compound benefits — Huge pages, NUMA awareness, CPU pinning, and lock-free algorithms push performance further.

What's Next:

Page Complete