Loading learning content...
Imagine a simple scenario: two processors, each with its own cache, both caching the same memory location. Processor A writes a new value. But Processor B's cache still holds the old value. When B reads the location, it sees stale data—its cache lies about what's really in memory.
This is the cache coherence problem: in multiprocessor systems with private caches, different processors can have inconsistent views of the same memory location. Left unsolved, this makes shared-memory parallel programming impossible—threads could not reliably communicate through memory.
This page explains why cache coherence is a problem, how the problem manifests in real systems, the formal requirements for a coherent cache system, and the architectural approaches to maintaining coherence. By the end, you'll understand why cache coherence is fundamental to shared-memory multiprocessing and how hardware maintains the illusion of a single, consistent memory.
Let's trace through a concrete example that shows exactly how incoherence arises.
Without coherence, a simple flag variable breaks: Thread 0 sets flag=1 to signal Thread 1, but Thread 1 never sees the update. All communication between threads through shared memory becomes unreliable. Locks, condition variables, message passing—all impossible without coherence.
Cache incoherence can arise from multiple sources in a multiprocessor system. Understanding all sources is essential for designing correct coherence protocols.
The Processor Write Case (In Detail):
This is the most common source of incoherence and the primary focus of coherence protocols. The problem has two variants:
1. Write to Shared Data (Multiple Readers):
2. Write to Previously-Written Data (Writer-Writer):
The I/O Incoherence Problem:
I/O devices that use DMA (Direct Memory Access) bypass the CPU cache:
DMA Write: Device → Memory (cache bypassed)
Problem: CPU cache has stale data
DMA Read: Memory → Device (cache bypassed)
Problem: Dirty data in cache not visible to device
Solutions:
Coherence protocols typically only maintain coherence between CPU caches (CPU coherence). I/O coherence is usually handled by software: device drivers invalidate cache lines before DMA reads into memory, and flush dirty lines before DMA reads from memory. This is why DMA buffers are often allocated as uncacheable or require explicit cache management.
What exactly does it mean for a memory system to be "coherent"? Two conditions define coherence:
1. Single-Writer, Multiple-Reader (SWMR) Invariant:
For any memory location, at any point in time:
This is the fundamental invariant that prevents incoherence: if only one processor can write at a time, and it's the only one that can read while writing, there's no opportunity for stale reads.
2. Data-Value Invariant:
The value of a memory location at the start of an epoch is the same as the value written in the last read-write epoch for that location.
In simpler terms: when you read a location, you get the value from the most recent write to that location (by any processor).
Coherence vs Memory Consistency:
A critical distinction that causes confusion:
Coherence answers: "What value does a read return for a single memory location?"
Memory Consistency answers: "When can a processor see another processor's writes to different locations?"
Coherence is necessary but not sufficient for correct parallel programs. You also need defined memory consistency. But coherence is the foundation—without it, consistency is meaningless.
| Aspect | Cache Coherence | Memory Consistency |
|---|---|---|
| Scope | Single memory location | Multiple memory locations |
| Question | What value does read return? | When are writes visible? |
| Mechanism | Hardware protocols (MESI, etc.) | Memory ordering rules + barriers |
| Programmer Visible? | Usually invisible (hardware handles) | Very visible (affects synchronization) |
| Example Issue | Read returns stale value | Flag seen before data it protects |
Cache coherence creates the illusion that there's only one copy of each memory location. Even though physically each CPU has its own cache with its own copy, coherence ensures they all agree. A write by one processor is eventually seen by all others as if they were sharing a single memory. This illusion is fundamental to shared-memory programming.
Two fundamental approaches exist for maintaining coherence: snooping protocols and directory-based protocols. The choice depends on system size and interconnect topology.
Snooping Protocols:
All caches monitor (snoop) a shared bus. When any cache issues a request, all other caches see it and can respond.
How It Works:
Advantages:
Disadvantages:
Used In:
| Property | Snooping | Directory |
|---|---|---|
| Scalability | Limited (≤8-16 nodes) | High (1000s of nodes) |
| Latency (uncached) | Lower (broadcast) | Higher (point-to-point) |
| Bandwidth per transaction | High (broadcast) | Lower (targeted) |
| Total bandwidth | Limited by bus | Scales with network |
| Complexity | Simpler | More complex |
| Storage Overhead | None | 5-10% for directory |
| Typical Use | Small SMP, early multi-core | NUMA, many-core |
When a processor writes to a shared cache line, what should other caches do with their copies? Two approaches exist:
Why Invalidation Dominates:
Consider a producer writing to a buffer that will be read once by a consumer:
With Invalidation:
With Update:
The update protocol generates vastly more traffic. Since writes are often multiple words to the same line, and data is often read infrequently by other processors, invalidation is almost always more efficient.
When Update Might Win:
Update can be better when:
In practice, these cases are rare enough that all major modern processors use invalidation protocols.
Early multiprocessors (1980s-1990s) experimented with update protocols (e.g., Dragon, Firefly). The appeal was reducing read latency by keeping caches up-to-date. But bandwidth limitations made this impractical. By the late 1990s, invalidation protocols became dominant and remain so today.
Cache coherence isn't free. Maintaining the illusion of a single shared memory imposes significant costs that affect software performance and hardware design.
False Sharing—The Silent Performance Killer:
False sharing deserves special attention because it's a common source of performance bugs that's invisible at the source code level.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
#include <pthread.h>#include <stdio.h> #define NUM_THREADS 4#define ITERATIONS 100000000 // BAD: False sharing - all counters in the same cache line!struct counters_bad { int counter[NUM_THREADS]; // 4 ints = 16 bytes, all in one cache line}; // GOOD: Padding to separate cache linesstruct counters_good { struct { int counter; char padding[60]; // Pad to 64 bytes } per_thread[NUM_THREADS];}; struct counters_bad bad_counters;struct counters_good good_counters; void* increment_bad(void* arg) { int id = *(int*)arg; for (int i = 0; i < ITERATIONS; i++) { bad_counters.counter[id]++; // All threads hammer same cache line! } return NULL;} void* increment_good(void* arg) { int id = *(int*)arg; for (int i = 0; i < ITERATIONS; i++) { good_counters.per_thread[id].counter++; // Separate cache lines } return NULL;} // Typical results on 4-core system:// Bad (false sharing): ~8 seconds// Good (no false sharing): ~0.4 seconds// 20x SLOWDOWN from false sharing! /*What's happening:1. Thread 0 writes counter[0] → acquires exclusive cache line2. Thread 1 writes counter[1] → invalidates Thread 0's line, acquires exclusive3. Thread 0 writes counter[0] → invalidates Thread 1's line, acquires exclusive4. ... ping-pong continues for every single increment! With padding:- Each counter in its own cache line- No invalidations between threads- Near-linear scaling*/False sharing doesn't cause incorrect results—just poor performance. It's invisible in source code (variables look unrelated). It requires profiling tools like perf, VTune, or hardware performance counters tracking coherence events. Always pad shared structures accessed by different threads!
Modern multi-core CPUs implement sophisticated coherence mechanisms that have evolved significantly from early multiprocessors. Here's how coherence works in current architectures:
| Architecture | L1/L2 Coherence | L3/LLC Coherence | Multi-Socket |
|---|---|---|---|
| Intel (12th gen+) | MESIF protocol | Inclusive/NINE L3 acts as snoop filter | UPI with home-snoop protocol |
| AMD (Zen 3+) | MOESI protocol | Shared L3 per CCX (snoop domain) | Infinity Fabric with probe filter |
| Apple M-series | MESI variant | Shared L2 per cluster | Unified memory, single socket |
| ARM (Neoverse) | MOESI/MESI | Coherent Mesh Network (CMN) | CCIX/CXL for multi-chip |
Intel's Approach (Sapphire Rapids, 12th/13th gen):
AMD's Approach (Zen 4, EPYC Genoa):
The Role of the Last-Level Cache (LLC/L3):
In modern designs, the shared L3 cache serves as both:
When a core needs exclusive access, instead of snooping all cores, the L3 snoop filter tells it exactly which cores to ask. This dramatically reduces coherence traffic.
The complexity of modern coherence protocols is hidden from software. From the programmer's perspective, memory simply works—writes by one core become visible to other cores. The only visible effect is performance: coherence traffic affects cache miss rates and memory latency. Understanding coherence helps you write cache-friendly parallel code.
We've covered why cache coherence is necessary and how it's fundamentally approached. Let's consolidate:
What's Next:
Now that we understand the coherence problem and the approaches to solve it, the next page dives into the MESI Protocol—the industry-standard cache coherence protocol used (with variations) in virtually all modern multiprocessors. We'll examine each state, the transitions, and how MESI efficiently maintains coherence.
You understand why cache coherence is fundamental to shared-memory multiprocessing, the formal requirements for coherence, and the architectural approaches to maintaining it. Next, we'll see exactly how the MESI protocol implements coherence in practice.