Operating SystemsCache Coherence

Cache Coherence and Memory Consistency

LevelIntermediate

Duration90 mins

TopicCache Coherence

4 / 5

Cache Coherence Problem

When Caches Lie

Imagine a simple scenario: two processors, each with its own cache, both caching the same memory location. Processor A writes a new value. But Processor B's cache still holds the old value. When B reads the location, it sees stale data—its cache lies about what's really in memory.

This is the cache coherence problem: in multiprocessor systems with private caches, different processors can have inconsistent views of the same memory location. Left unsolved, this makes shared-memory parallel programming impossible—threads could not reliably communicate through memory.

What You Will Learn

This page explains why cache coherence is a problem, how the problem manifests in real systems, the formal requirements for a coherent cache system, and the architectural approaches to maintaining coherence. By the end, you'll understand why cache coherence is fundamental to shared-memory multiprocessing and how hardware maintains the illusion of a single, consistent memory.

The Problem Illustrated

Let's trace through a concrete example that shows exactly how incoherence arises.

Cache Incoherence ScenarioTwo processors accessing the same memory location without coherence:

Input

Output

Converting Mermaid diagram...

Why This Breaks Programs

Without coherence, a simple flag variable breaks: Thread 0 sets flag=1 to signal Thread 1, but Thread 1 never sees the update. All communication between threads through shared memory becomes unreliable. Locks, condition variables, message passing—all impossible without coherence.

Sources of Incoherence

Cache incoherence can arise from multiple sources in a multiprocessor system. Understanding all sources is essential for designing correct coherence protocols.

Sources of Cache Incoherence

•Processor Writes (Most Common): One processor writes to a cache line that other processors have cached. Other copies become stale immediately.
•I/O DMA Writes: A device writes directly to memory (via DMA) while processors have that data cached. The cached copies don't reflect the device's write.
•I/O DMA Reads: A device reads directly from memory while a processor has a dirty (modified) cache line. The device reads stale data from memory.
•Process Migration: A process runs on P0, caching data. It migrates to P1 but P0's cache still has (possibly dirty) data. P1 may fetch stale data from memory.
•Page Sharing in Virtual Memory: Two processes map the same physical page. Each process's preferred CPU caches the page independently.

The Processor Write Case (In Detail):

This is the most common source of incoherence and the primary focus of coherence protocols. The problem has two variants:

1. Write to Shared Data (Multiple Readers):

Multiple processors have read-only copies
One processor wants to write
All other copies become stale

2. Write to Previously-Written Data (Writer-Writer):

Processor A writes, creating a dirty copy
Processor B also wants to write to the same location
A's dirty data must reach B before B writes, or A's update is lost

The I/O Incoherence Problem:

I/O devices that use DMA (Direct Memory Access) bypass the CPU cache:

DMA Write: Device → Memory (cache bypassed)
  Problem: CPU cache has stale data

DMA Read: Memory → Device (cache bypassed)  
  Problem: Dirty data in cache not visible to device

Solutions:

Cache-coherent I/O: Device snoops the cache (expensive, complex)
Non-coherent I/O with software management: OS explicitly flushes/invalidates before I/O
Most systems use non-coherent I/O with OS-managed coherence for DMA buffers

System vs CPU Coherence

Coherence protocols typically only maintain coherence between CPU caches (CPU coherence). I/O coherence is usually handled by software: device drivers invalidate cache lines before DMA reads into memory, and flush dirty lines before DMA reads from memory. This is why DMA buffers are often allocated as uncacheable or require explicit cache management.

Formal Definition of Cache Coherence

What exactly does it mean for a memory system to be "coherent"? Two conditions define coherence:

1. Single-Writer, Multiple-Reader (SWMR) Invariant:

For any memory location, at any point in time:

Either ONE processor has read-write access (can read and write), OR
Multiple processors have read-only access
Never both simultaneously

This is the fundamental invariant that prevents incoherence: if only one processor can write at a time, and it's the only one that can read while writing, there's no opportunity for stale reads.

2. Data-Value Invariant:

The value of a memory location at the start of an epoch is the same as the value written in the last read-write epoch for that location.

In simpler terms: when you read a location, you get the value from the most recent write to that location (by any processor).

Coherence vs Memory Consistency:

A critical distinction that causes confusion:

Coherence answers: "What value does a read return for a single memory location?"

Coherence is about a single address
Reads should return the most recent write
This is what hardware cache coherence protocols provide

Memory Consistency answers: "When can a processor see another processor's writes to different locations?"

Consistency is about ordering across multiple addresses
"If I see your write to X, do I also see your earlier write to Y?"
Different consistency models (Sequential Consistency, TSO, etc.) give different answers

Coherence is necessary but not sufficient for correct parallel programs. You also need defined memory consistency. But coherence is the foundation—without it, consistency is meaningless.

Coherence vs Consistency
Aspect	Cache Coherence	Memory Consistency
Scope	Single memory location	Multiple memory locations
Question	What value does read return?	When are writes visible?
Mechanism	Hardware protocols (MESI, etc.)	Memory ordering rules + barriers
Programmer Visible?	Usually invisible (hardware handles)	Very visible (affects synchronization)
Example Issue	Read returns stale value	Flag seen before data it protects

The Coherence Illusion

Cache coherence creates the illusion that there's only one copy of each memory location. Even though physically each CPU has its own cache with its own copy, coherence ensures they all agree. A write by one processor is eventually seen by all others as if they were sharing a single memory. This illusion is fundamental to shared-memory programming.

Coherence Protocol Approaches

Two fundamental approaches exist for maintaining coherence: snooping protocols and directory-based protocols. The choice depends on system size and interconnect topology.

Snooping Protocols:

All caches monitor (snoop) a shared bus. When any cache issues a request, all other caches see it and can respond.

How It Works:

Processor wants to read/write
Cache controller issues request on the shared bus
All other cache controllers snoop the bus
Each snooping cache checks if it has the relevant line
Snooping caches take appropriate action (invalidate, supply data, etc.)

Advantages:

Simple conceptually
Low latency for small systems (bus is fast)
Natural broadcast behavior for getting data from any source

Disadvantages:

Doesn't scale: bus bandwidth limits processors (typically ≤8 cores)
Every transaction visible to everyone (power consumption)
Single point of failure/bottleneck (the bus)

Used In:

Small-scale SMPs (2-8 processors)
Early multi-core designs
Some coherent interconnects (QPI, UPI) use snooping-like broadcasts

Snooping vs Directory-Based Protocols
Property	Snooping	Directory
Scalability	Limited (≤8-16 nodes)	High (1000s of nodes)
Latency (uncached)	Lower (broadcast)	Higher (point-to-point)
Bandwidth per transaction	High (broadcast)	Lower (targeted)
Total bandwidth	Limited by bus	Scales with network
Complexity	Simpler	More complex
Storage Overhead	None	5-10% for directory
Typical Use	Small SMP, early multi-core	NUMA, many-core

Invalidation vs Update Protocols

When a processor writes to a shared cache line, what should other caches do with their copies? Two approaches exist:

Invalidation Protocol

•On write: invalidate all other copies
•Other caches mark their copies invalid
•Future reads must re-fetch
•Low bandwidth: just send invalidation message
•Writer gets exclusive access
•Dominant approach in modern systems

Update Protocol

•On write: update all other copies with new value
•Other caches receive and store new value
•Future reads hit in cache (no re-fetch)
•High bandwidth: send data everywhere
•Multiple writers can coexist (complex)
•Rarely used due to bandwidth cost

Why Invalidation Dominates:

Consider a producer writing to a buffer that will be read once by a consumer:

With Invalidation:

Producer writes word 1: invalidate sent (small message)
Producer writes word 2: no message (already exclusive)
... writes words 3-100: no messages
Consumer reads: cache miss, fetches line
Total coherence traffic: 1 invalidation + 1 line fetch

With Update:

Producer writes word 1: update sent to all N caches (large message × N)
Producer writes word 2: update sent to all N caches
... writes words 3-100: updates to all N caches
Consumer reads: cache hit (data already there)
Total coherence traffic: 100 × N update messages

The update protocol generates vastly more traffic. Since writes are often multiple words to the same line, and data is often read infrequently by other processors, invalidation is almost always more efficient.

When Update Might Win:

Update can be better when:

Data is written once and read many times by many processors
The interconnect has spare bandwidth
Latency is more critical than bandwidth

In practice, these cases are rare enough that all major modern processors use invalidation protocols.

Historical Note: Update Protocols

Early multiprocessors (1980s-1990s) experimented with update protocols (e.g., Dragon, Firefly). The appeal was reducing read latency by keeping caches up-to-date. But bandwidth limitations made this impractical. By the late 1990s, invalidation protocols became dominant and remain so today.

The Cost of Coherence

Cache coherence isn't free. Maintaining the illusion of a single shared memory imposes significant costs that affect software performance and hardware design.

Costs of Cache Coherence

•Coherence Traffic: Invalidations, data transfers, and acknowledgments consume interconnect bandwidth. True sharing between cores generates this traffic.
•Coherence Latency: Getting exclusive access to a cache line shared by others takes time (10s-100s of cycles). Acquiring a lock held by another processor requires waiting for invalidations.
•False Sharing: Two processors accessing different data that happens to be in the same cache line cause coherence traffic. The coherence protocol doesn't know they're accessing different data.
•Scalability Limits: As core count increases, coherence traffic grows. Snoop-based protocols hit bandwidth walls; directory protocols add latency.
•Power Consumption: Snooping caches burn power checking tags even for unrelated traffic. Coherence messages consume energy.
•Hardware Complexity: Implementing correct coherence requires significant hardware (state machines, queues, buffers). Verification is challenging.

False Sharing—The Silent Performance Killer:

False sharing deserves special attention because it's a common source of performance bugs that's invisible at the source code level.

false_sharing_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <pthread.h>
#include <stdio.h>
 
#define NUM_THREADS 4
#define ITERATIONS 100000000
 
// BAD: False sharing - all counters in the same cache line!
struct counters_bad {
    int counter[NUM_THREADS];  // 4 ints = 16 bytes, all in one cache line
};
 
// GOOD: Padding to separate cache lines
struct counters_good {
    struct {
        int counter;
        char padding[60];  // Pad to 64 bytes
    } per_thread[NUM_THREADS];
};
 
struct counters_bad bad_counters;
struct counters_good good_counters;
 
void* increment_bad(void* arg) {
    int id = *(int*)arg;
    for (int i = 0; i < ITERATIONS; i++) {
        bad_counters.counter[id]++;  // All threads hammer same cache line!
    }
    return NULL;
}
 
void* increment_good(void* arg) {
    int id = *(int*)arg;
    for (int i = 0; i < ITERATIONS; i++) {
        good_counters.per_thread[id].counter++;  // Separate cache lines
    }
    return NULL;
}
 
// Typical results on 4-core system:
// Bad (false sharing): ~8 seconds
// Good (no false sharing): ~0.4 seconds
// 20x SLOWDOWN from false sharing!
 
/*
What's happening:
1. Thread 0 writes counter[0] → acquires exclusive cache line
2. Thread 1 writes counter[1] → invalidates Thread 0's line, acquires exclusive
3. Thread 0 writes counter[0] → invalidates Thread 1's line, acquires exclusive
4. ... ping-pong continues for every single increment!
 
With padding:
- Each counter in its own cache line
- No invalidations between threads
- Near-linear scaling
*/

False Sharing Is Hard to Detect

False sharing doesn't cause incorrect results—just poor performance. It's invisible in source code (variables look unrelated). It requires profiling tools like perf, VTune, or hardware performance counters tracking coherence events. Always pad shared structures accessed by different threads!

Coherence in Modern CPUs

Modern multi-core CPUs implement sophisticated coherence mechanisms that have evolved significantly from early multiprocessors. Here's how coherence works in current architectures:

Coherence in Modern CPU Architectures
Architecture	L1/L2 Coherence	L3/LLC Coherence	Multi-Socket
Intel (12th gen+)	MESIF protocol	Inclusive/NINE L3 acts as snoop filter	UPI with home-snoop protocol
AMD (Zen 3+)	MOESI protocol	Shared L3 per CCX (snoop domain)	Infinity Fabric with probe filter
Apple M-series	MESI variant	Shared L2 per cluster	Unified memory, single socket
ARM (Neoverse)	MOESI/MESI	Coherent Mesh Network (CMN)	CCIX/CXL for multi-chip

Intel's Approach (Sapphire Rapids, 12th/13th gen):

L1/L2: Private per-core, MESIF protocol (Modified, Exclusive, Shared, Invalid, Forward)
L3: Non-inclusive (NINE) distributed across cores, acts as victim cache
On-die interconnect: Mesh network with distributed LLC slices
Multi-socket: UPI (Ultra Path Interconnect) with snoop-home protocol
Key innovation: Snoop filters at L3 reduce unnecessary snoops

AMD's Approach (Zen 4, EPYC Genoa):

L1/L2: Private per-core, MOESI protocol (adds Owned state for sharing dirty data)
L3: Shared 32MB per CCX (4 cores), victim cache
CCX-to-CCX: Fast on-die Infinity Fabric
Cross-socket: Infinity Fabric with probe filters (directory-like)
Key innovation: Hierarchical coherence domains reduce traffic

The Role of the Last-Level Cache (LLC/L3):

In modern designs, the shared L3 cache serves as both:

A large backing store for L1/L2 evictions
A snoop filter that tracks which cores might have a line cached

When a core needs exclusive access, instead of snooping all cores, the L3 snoop filter tells it exactly which cores to ask. This dramatically reduces coherence traffic.

Coherence Is Invisible Done Right

The complexity of modern coherence protocols is hidden from software. From the programmer's perspective, memory simply works—writes by one core become visible to other cores. The only visible effect is performance: coherence traffic affects cache miss rates and memory latency. Understanding coherence helps you write cache-friendly parallel code.

Summary: Cache Coherence Problem

We've covered why cache coherence is necessary and how it's fundamentally approached. Let's consolidate:

Key Takeaways

•Private caches create incoherence — Without coordination, one processor's write leaves other processors with stale data.
•Coherence requires SWMR invariant — Single-Writer-Multiple-Reader ensures no processor reads stale data.
•Coherence ≠ Consistency — Coherence is per-address; consistency is ordering across addresses.
•Snooping works for small systems — Broadcast-based, simple, but doesn't scale past ~16 processors.
•Directories scale to large systems — Track sharers explicitly, targeted messages, but higher latency.
•Invalidation beats update — Invalidating stale copies is more bandwidth-efficient than broadcasting updates.
•False sharing kills performance — Different data in the same cache line causes unnecessary coherence traffic.
•Modern CPUs use hybrid approaches — Snooping locally, directories globally, with snoop filters.

What's Next:

Now that we understand the coherence problem and the approaches to solve it, the next page dives into the MESI Protocol—the industry-standard cache coherence protocol used (with variations) in virtually all modern multiprocessors. We'll examine each state, the transitions, and how MESI efficiently maintains coherence.

Page Complete

You understand why cache coherence is fundamental to shared-memory multiprocessing, the formal requirements for coherence, and the architectural approaches to maintaining it. Next, we'll see exactly how the MESI protocol implements coherence in practice.

4 / 5

Loading learning content...

Operating SystemsCache Coherence

Cache Coherence and Memory Consistency

LevelIntermediate

Duration90 mins

TopicCache Coherence

4 / 5

Cache Coherence Problem

When Caches Lie

What You Will Learn

The Problem Illustrated

Let's trace through a concrete example that shows exactly how incoherence arises.

Cache Incoherence ScenarioTwo processors accessing the same memory location without coherence:

Input

Output

Converting Mermaid diagram...

Why This Breaks Programs

Sources of Incoherence

Cache incoherence can arise from multiple sources in a multiprocessor system. Understanding all sources is essential for designing correct coherence protocols.

Sources of Cache Incoherence

•Processor Writes (Most Common): One processor writes to a cache line that other processors have cached. Other copies become stale immediately.
•I/O DMA Writes: A device writes directly to memory (via DMA) while processors have that data cached. The cached copies don't reflect the device's write.
•I/O DMA Reads: A device reads directly from memory while a processor has a dirty (modified) cache line. The device reads stale data from memory.
•Process Migration: A process runs on P0, caching data. It migrates to P1 but P0's cache still has (possibly dirty) data. P1 may fetch stale data from memory.
•Page Sharing in Virtual Memory: Two processes map the same physical page. Each process's preferred CPU caches the page independently.

The Processor Write Case (In Detail):

This is the most common source of incoherence and the primary focus of coherence protocols. The problem has two variants:

1. Write to Shared Data (Multiple Readers):

Multiple processors have read-only copies
One processor wants to write
All other copies become stale

2. Write to Previously-Written Data (Writer-Writer):

Processor A writes, creating a dirty copy
Processor B also wants to write to the same location
A's dirty data must reach B before B writes, or A's update is lost

The I/O Incoherence Problem:

I/O devices that use DMA (Direct Memory Access) bypass the CPU cache:

DMA Write: Device → Memory (cache bypassed)
  Problem: CPU cache has stale data

DMA Read: Memory → Device (cache bypassed)  
  Problem: Dirty data in cache not visible to device

Solutions:

Cache-coherent I/O: Device snoops the cache (expensive, complex)
Non-coherent I/O with software management: OS explicitly flushes/invalidates before I/O
Most systems use non-coherent I/O with OS-managed coherence for DMA buffers

System vs CPU Coherence

Formal Definition of Cache Coherence

What exactly does it mean for a memory system to be "coherent"? Two conditions define coherence:

1. Single-Writer, Multiple-Reader (SWMR) Invariant:

For any memory location, at any point in time:

Either ONE processor has read-write access (can read and write), OR
Multiple processors have read-only access
Never both simultaneously

This is the fundamental invariant that prevents incoherence: if only one processor can write at a time, and it's the only one that can read while writing, there's no opportunity for stale reads.

2. Data-Value Invariant:

The value of a memory location at the start of an epoch is the same as the value written in the last read-write epoch for that location.

In simpler terms: when you read a location, you get the value from the most recent write to that location (by any processor).

Coherence vs Memory Consistency:

A critical distinction that causes confusion:

Coherence answers: "What value does a read return for a single memory location?"

Coherence is about a single address
Reads should return the most recent write
This is what hardware cache coherence protocols provide

Memory Consistency answers: "When can a processor see another processor's writes to different locations?"

Consistency is about ordering across multiple addresses
"If I see your write to X, do I also see your earlier write to Y?"
Different consistency models (Sequential Consistency, TSO, etc.) give different answers

Coherence is necessary but not sufficient for correct parallel programs. You also need defined memory consistency. But coherence is the foundation—without it, consistency is meaningless.

Coherence vs Consistency
Aspect	Cache Coherence	Memory Consistency
Scope	Single memory location	Multiple memory locations
Question	What value does read return?	When are writes visible?
Mechanism	Hardware protocols (MESI, etc.)	Memory ordering rules + barriers
Programmer Visible?	Usually invisible (hardware handles)	Very visible (affects synchronization)
Example Issue	Read returns stale value	Flag seen before data it protects

The Coherence Illusion

Coherence Protocol Approaches

Two fundamental approaches exist for maintaining coherence: snooping protocols and directory-based protocols. The choice depends on system size and interconnect topology.

Snooping Protocols:

All caches monitor (snoop) a shared bus. When any cache issues a request, all other caches see it and can respond.

How It Works:

Processor wants to read/write
Cache controller issues request on the shared bus
All other cache controllers snoop the bus
Each snooping cache checks if it has the relevant line
Snooping caches take appropriate action (invalidate, supply data, etc.)

Advantages:

Simple conceptually
Low latency for small systems (bus is fast)
Natural broadcast behavior for getting data from any source

Disadvantages:

Doesn't scale: bus bandwidth limits processors (typically ≤8 cores)
Every transaction visible to everyone (power consumption)
Single point of failure/bottleneck (the bus)

Used In:

Small-scale SMPs (2-8 processors)
Early multi-core designs
Some coherent interconnects (QPI, UPI) use snooping-like broadcasts

Snooping vs Directory-Based Protocols
Property	Snooping	Directory
Scalability	Limited (≤8-16 nodes)	High (1000s of nodes)
Latency (uncached)	Lower (broadcast)	Higher (point-to-point)
Bandwidth per transaction	High (broadcast)	Lower (targeted)
Total bandwidth	Limited by bus	Scales with network
Complexity	Simpler	More complex
Storage Overhead	None	5-10% for directory
Typical Use	Small SMP, early multi-core	NUMA, many-core

Invalidation vs Update Protocols

When a processor writes to a shared cache line, what should other caches do with their copies? Two approaches exist:

Invalidation Protocol

•On write: invalidate all other copies
•Other caches mark their copies invalid
•Future reads must re-fetch
•Low bandwidth: just send invalidation message
•Writer gets exclusive access
•Dominant approach in modern systems

Update Protocol

•On write: update all other copies with new value
•Other caches receive and store new value
•Future reads hit in cache (no re-fetch)
•High bandwidth: send data everywhere
•Multiple writers can coexist (complex)
•Rarely used due to bandwidth cost

Why Invalidation Dominates:

Consider a producer writing to a buffer that will be read once by a consumer:

With Invalidation:

Producer writes word 1: invalidate sent (small message)
Producer writes word 2: no message (already exclusive)
... writes words 3-100: no messages
Consumer reads: cache miss, fetches line
Total coherence traffic: 1 invalidation + 1 line fetch

With Update:

Producer writes word 1: update sent to all N caches (large message × N)
Producer writes word 2: update sent to all N caches
... writes words 3-100: updates to all N caches
Consumer reads: cache hit (data already there)
Total coherence traffic: 100 × N update messages

When Update Might Win:

Update can be better when:

Data is written once and read many times by many processors
The interconnect has spare bandwidth
Latency is more critical than bandwidth

In practice, these cases are rare enough that all major modern processors use invalidation protocols.

Historical Note: Update Protocols

The Cost of Coherence

Cache coherence isn't free. Maintaining the illusion of a single shared memory imposes significant costs that affect software performance and hardware design.

Costs of Cache Coherence

•Coherence Traffic: Invalidations, data transfers, and acknowledgments consume interconnect bandwidth. True sharing between cores generates this traffic.
•Coherence Latency: Getting exclusive access to a cache line shared by others takes time (10s-100s of cycles). Acquiring a lock held by another processor requires waiting for invalidations.
•False Sharing: Two processors accessing different data that happens to be in the same cache line cause coherence traffic. The coherence protocol doesn't know they're accessing different data.
•Scalability Limits: As core count increases, coherence traffic grows. Snoop-based protocols hit bandwidth walls; directory protocols add latency.
•Power Consumption: Snooping caches burn power checking tags even for unrelated traffic. Coherence messages consume energy.
•Hardware Complexity: Implementing correct coherence requires significant hardware (state machines, queues, buffers). Verification is challenging.

False Sharing—The Silent Performance Killer:

False sharing deserves special attention because it's a common source of performance bugs that's invisible at the source code level.

false_sharing_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <pthread.h>
#include <stdio.h>
 
#define NUM_THREADS 4
#define ITERATIONS 100000000
 
// BAD: False sharing - all counters in the same cache line!
struct counters_bad {
    int counter[NUM_THREADS];  // 4 ints = 16 bytes, all in one cache line
};
 
// GOOD: Padding to separate cache lines
struct counters_good {
    struct {
        int counter;
        char padding[60];  // Pad to 64 bytes
    } per_thread[NUM_THREADS];
};
 
struct counters_bad bad_counters;
struct counters_good good_counters;
 
void* increment_bad(void* arg) {
    int id = *(int*)arg;
    for (int i = 0; i < ITERATIONS; i++) {
        bad_counters.counter[id]++;  // All threads hammer same cache line!
    }
    return NULL;
}
 
void* increment_good(void* arg) {
    int id = *(int*)arg;
    for (int i = 0; i < ITERATIONS; i++) {
        good_counters.per_thread[id].counter++;  // Separate cache lines
    }
    return NULL;
}
 
// Typical results on 4-core system:
// Bad (false sharing): ~8 seconds
// Good (no false sharing): ~0.4 seconds
// 20x SLOWDOWN from false sharing!
 
/*
What's happening:
1. Thread 0 writes counter[0] → acquires exclusive cache line
2. Thread 1 writes counter[1] → invalidates Thread 0's line, acquires exclusive
3. Thread 0 writes counter[0] → invalidates Thread 1's line, acquires exclusive
4. ... ping-pong continues for every single increment!
 
With padding:
- Each counter in its own cache line
- No invalidations between threads
- Near-linear scaling
*/

False Sharing Is Hard to Detect

Coherence in Modern CPUs

Modern multi-core CPUs implement sophisticated coherence mechanisms that have evolved significantly from early multiprocessors. Here's how coherence works in current architectures:

Coherence in Modern CPU Architectures
Architecture	L1/L2 Coherence	L3/LLC Coherence	Multi-Socket
Intel (12th gen+)	MESIF protocol	Inclusive/NINE L3 acts as snoop filter	UPI with home-snoop protocol
AMD (Zen 3+)	MOESI protocol	Shared L3 per CCX (snoop domain)	Infinity Fabric with probe filter
Apple M-series	MESI variant	Shared L2 per cluster	Unified memory, single socket
ARM (Neoverse)	MOESI/MESI	Coherent Mesh Network (CMN)	CCIX/CXL for multi-chip

Intel's Approach (Sapphire Rapids, 12th/13th gen):

L1/L2: Private per-core, MESIF protocol (Modified, Exclusive, Shared, Invalid, Forward)
L3: Non-inclusive (NINE) distributed across cores, acts as victim cache
On-die interconnect: Mesh network with distributed LLC slices
Multi-socket: UPI (Ultra Path Interconnect) with snoop-home protocol
Key innovation: Snoop filters at L3 reduce unnecessary snoops

AMD's Approach (Zen 4, EPYC Genoa):

L1/L2: Private per-core, MOESI protocol (adds Owned state for sharing dirty data)
L3: Shared 32MB per CCX (4 cores), victim cache
CCX-to-CCX: Fast on-die Infinity Fabric
Cross-socket: Infinity Fabric with probe filters (directory-like)
Key innovation: Hierarchical coherence domains reduce traffic

The Role of the Last-Level Cache (LLC/L3):

In modern designs, the shared L3 cache serves as both:

A large backing store for L1/L2 evictions
A snoop filter that tracks which cores might have a line cached

When a core needs exclusive access, instead of snooping all cores, the L3 snoop filter tells it exactly which cores to ask. This dramatically reduces coherence traffic.

Coherence Is Invisible Done Right

Summary: Cache Coherence Problem

We've covered why cache coherence is necessary and how it's fundamentally approached. Let's consolidate:

Key Takeaways

•Private caches create incoherence — Without coordination, one processor's write leaves other processors with stale data.
•Coherence requires SWMR invariant — Single-Writer-Multiple-Reader ensures no processor reads stale data.
•Coherence ≠ Consistency — Coherence is per-address; consistency is ordering across addresses.
•Snooping works for small systems — Broadcast-based, simple, but doesn't scale past ~16 processors.
•Directories scale to large systems — Track sharers explicitly, targeted messages, but higher latency.
•Invalidation beats update — Invalidating stale copies is more bandwidth-efficient than broadcasting updates.
•False sharing kills performance — Different data in the same cache line causes unnecessary coherence traffic.
•Modern CPUs use hybrid approaches — Snooping locally, directories globally, with snoop filters.

What's Next:

Page Complete

4 / 5