Operating SystemsCache Coherence

Cache Coherence and Memory Consistency

LevelIntermediate

Duration90 mins

TopicCache Coherence

3 / 5

Write-Through vs Write-Back

The Write Problem

Reads are conceptually simple: check the cache, find the data, return it. But writes introduce a fundamental question: when should modified data be written to the next level of the memory hierarchy?

This seemingly simple question has profound implications for performance, power consumption, data consistency, memory traffic, and system reliability. Two primary strategies have evolved—write-through and write-back—each with distinct tradeoffs that influence when each is appropriate.

What You Will Learn

This page covers the two fundamental cache write policies, their mechanisms, tradeoffs, and implications. You'll understand write allocation strategies, write buffers, the relationship between write policies and cache coherence, and how modern systems combine these approaches across the cache hierarchy.

Write-Through Caches

In a write-through cache, every write to the cache is simultaneously written to the next level of the memory hierarchy (whether that's L2, L3, or main memory). The cache and memory always contain the same data for any cached address.

Operation:

CPU issues a store instruction
Data is written to the cache (if the address is present)
Data is also immediately sent to the next memory level
The write is considered complete when the next level acknowledges

Key Characteristic: Memory always has the current value. The cache never contains data 'newer' than memory.

Converting Mermaid diagram...

Advantages of Write-Through

•Simple consistency: Memory always has current data; crash recovery is straightforward
•Simple coherence: Other caches can read from memory knowing it's up-to-date
•Simple eviction: Evicted lines don't need writeback; just discard
•Reads don't stall writes: Write goes to memory; cache available for reads
•Failure resilience: Power loss doesn't lose cached modifications

Disadvantages of Write-Through

•High memory traffic: Every write goes to memory, even if overwritten soon
•Write latency visible: CPU must wait for memory (unless buffered)
•Saturates memory bus: Write-heavy workloads overwhelm memory bandwidth
•Power consumption: More memory accesses means more energy
•Limited by memory speed: Writes can't exceed memory write bandwidth

Write Buffers—Making Write-Through Practical:

The naive write-through implementation would stall the CPU on every write, waiting for memory acknowledgment. This is clearly unacceptable. The solution: write buffers.

A write buffer is a small FIFO queue (typically 4-16 entries) that holds pending writes:

CPU writes to cache AND to write buffer
CPU continues execution immediately
Write buffer drains to memory in the background
If write buffer fills, CPU must stall until space is available

Write Buffer Considerations:

Must be checked on reads (data might be in buffer, not yet in memory)
Writes to the same address can be coalesced (combine multiple writes to one)
Write buffer full → pipeline stall (critical in write-heavy code)

Where Write-Through Is Used

Modern CPU caches rarely use pure write-through. However, write-through (with write buffers) is common in: L1 to L2 paths in some embedded processors, GPU caches with high bandwidth to memory, and write-through levels in hybrid configurations. SSDs and storage systems also use write-through semantics when data integrity matters more than performance.

Write-Back Caches

In a write-back (also called copy-back or writeback) cache, writes update only the cache. Modified data is written to the next level only when the cache line is evicted. This dramatically reduces memory traffic.

Operation:

CPU issues a store instruction
Data is written to the cache
The cache line is marked dirty (modified)
Write is complete—no memory access needed
When the dirty line is evicted, it is written back to memory

Key Characteristic: The cache may contain data that is 'newer' than memory. Memory may be stale. The dirty bit tracks which lines need writeback.

Converting Mermaid diagram...

Advantages of Write-Back

•Reduced memory traffic: Multiple writes to same line → single writeback
•Lower latency writes: No memory access for write completion
•Better bandwidth utilization: Memory bus available for reads
•Write absorption: Temporary/local variables never reach memory
•Lower power: Fewer memory accesses = less energy

Disadvantages of Write-Back

•Complex coherence: Other caches can't rely on memory being current
•Eviction overhead: Dirty line eviction takes longer than clean
•Data loss risk: Power failure loses dirty cache data
•Complex implementation: Dirty bit tracking, writeback logic
•Burst traffic: Eviction storms can spike memory bandwidth usage

The Power of Write Absorption:

Consider a simple loop that writes to a counter:

for (int i = 0; i < 1000000; i++) {
    counter++;  // Writes to same address 1 million times
}

With write-through: 1,000,000 memory writes With write-back: 1 memory write (when line is eventually evicted)

This is a 1,000,000x reduction in memory traffic! Real programs don't achieve this extreme ratio, but write-heavy code routinely sees 10-100x reductions.

Eviction Complexity:

Write-back complicates eviction:

Clean line eviction: Just discard—memory has valid copy
Dirty line eviction: Must write data to memory first, then evict

This asymmetry means a read miss that needs to evict a dirty line takes longer (miss + writeback) than one that evicts a clean line (miss only). High-performance caches may use victim buffers to handle writebacks in background.

The Data Loss Risk

Write-back caches can lose data on power failure. Dirty cache lines that haven't been written back are lost. This is why databases, file systems, and storage controllers implement explicit cache flushing. The fsync() and msync() system calls, and the x86 CLFLUSH/CLWB instructions, force dirty data to stable storage.

Write Allocation Policies

A separate but related question: what happens when we write to an address that is not currently in the cache? Two strategies exist:

Write Allocation Policies
Policy	Also Called	On Write Miss	Typical Pairing
Write-Allocate	Fetch-on-Write	Load the cache line from memory, then perform the write	Write-Back caches
No-Write-Allocate	Write-Around, Write-No-Allocate	Write directly to memory without loading into cache	Write-Through caches

Write-Allocate (Fetch-on-Write):

Write miss occurs
Fetch the entire cache line from memory into the cache
Perform the write to the now-cached line
(For write-back: mark dirty; for write-through: also write to memory)

Rationale: We're likely to access this data again soon (temporal locality). By fetching the whole line, we can satisfy subsequent reads/writes from cache. Also, if we're only writing part of the line, we need the rest of the line's data.

No-Write-Allocate (Write-Around):

Write miss occurs
Write goes directly to memory, bypassing cache
Cache line is NOT loaded

Rationale: If we're writing but not reading, why pollute the cache? The written data might not be needed soon. Common for streaming writes (e.g., video encoding output).

Write Policy Combination EffectsConsider writing a single 4-byte integer to an address not in cache, then reading it:

Input

Output

Non-Temporal Stores

Modern CPUs offer 'non-temporal' store instructions (MOVNTI on x86) that bypass the cache entirely—a software-controlled no-write-allocate. These are ideal for streaming writes to large buffers that won't be read soon, preventing cache pollution and improving cache efficiency for other data.

Hybrid Approaches in Modern Systems

Modern processor cache hierarchies don't uniformly apply one policy. Instead, they use hybrid configurations optimized for each cache level's characteristics:

Typical Modern Configuration:

Write Policies Across Cache Hierarchy (Intel/AMD Desktop CPUs)
Cache Level	Write Policy	Write Allocation	Reason
L1 Data Cache	Write-Back	Write-Allocate	Maximize write absorption; L1 is small and fast, writes are frequent
L2 Cache	Write-Back	Write-Allocate	Continue absorbing writes; still private per-core
L3 Cache (LLC)	Write-Back	Write-Allocate	Absorb cross-core sharing writes; largest on-chip cache
Memory Controller	Write-Back buffered	N/A	Combine writes, optimize for DRAM burst patterns

Per-Region Write Policies:

Memory Type Range Registers (MTRRs) and Page Attribute Tables (PAT) allow different write policies for different memory regions:

Write-Back (WB): Normal memory—full caching
Write-Through (WT): Rare, used for some compatibility cases
Write-Combining (WC): No caching, but combine writes to the same line
Uncacheable (UC): Memory-mapped I/O—no caching, no combining
Write-Protected (WP): Cached reads, but writes go through immediately

Write-Combining (WC):

A special policy for graphics memory and other streaming writes:

Writes are buffered but not cached
Multiple writes to the same cache line are combined
When the line is full or a different address is written, the entire line is burst-written to memory
Dramatically improves performance for sequential writes to frame buffers
No read benefit (uncached reads)

Example: Writing 4 bytes at a time to a GPU framebuffer at 1920×1080×4 (8 MB). With WC, writes combine into efficient 64-byte bursts. Without WC, each 4-byte write would be a separate memory transaction.

memory_type_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Setting memory types via Page Attribute Table (PAT) on x86
// Typically done by OS during memory mapping
 
// mmap with different caching attributes (Linux)
#include <sys/mman.h>
#include <fcntl.h>
 
int fd = open("/dev/mem", O_RDWR);
 
// Normal write-back cached mapping (default)
void *wb_region = mmap(NULL, size, PROT_READ | PROT_WRITE,
                       MAP_SHARED, fd, phys_addr);
 
// Write-combining for frame buffer (requires /dev/fb or similar)
// The driver sets WC attributes automatically for frame buffer mappings
 
// Using non-temporal stores for write-around behavior
#include <immintrin.h>
 
void stream_write(void *dest, void *src, size_t n) {
    // Copy using non-temporal stores (bypasses cache)
    __m256i *d = (__m256i *)dest;
    __m256i *s = (__m256i *)src;
    
    for (size_t i = 0; i < n / 32; i++) {
        __m256i data = _mm256_load_si256(s + i);
        _mm256_stream_si256(d + i, data);  // Non-temporal store
    }
    _mm_sfence();  // Ensure stores are visible
}
// This prevents cache pollution when writing large buffers

Write Policies and Cache Coherence

Write policies significantly impact cache coherence in multiprocessor systems. The coherence protocol must handle the case where one processor writes to a cache line that other processors may have cached.

Write-Through Coherence Simplicity:

With write-through, memory always has the current value. When processor A writes:

Write goes to memory immediately
Other processors can snoop the memory bus, see the write
Other processors invalidate their copies (or update them)

The coherence protocol is simpler because memory is the 'source of truth.'

Write-Back Coherence Complexity:

With write-back, the cache may have data that memory doesn't:

Processor A writes → data only in A's cache (dirty)
Processor B wants to read the same address
B checks memory → memory is stale!
Coherence protocol must arrange for A to supply the data

This requires more sophisticated protocols (like MESI, covered later) where:

Caches snoop each other's activity
Dirty caches can intervene on reads
Modified data can transfer cache-to-cache without going to memory

Coherence with Write-Back: Cache-to-Cache TransferProcessor B reads an address that Processor A has modified (dirty in A's cache):

Input

Output

Why Write-Back Dominates Despite Complexity

The performance benefits of write-back (reduced memory traffic, lower latency writes) far outweigh the coherence complexity. Modern coherence protocols efficiently handle write-back caches. The complexity is in hardware—software sees a coherent memory abstraction. Only kernel developers and those writing memory barriers need to understand the details.

Write Buffers, Coalescing, and Store Buffers

Writes in modern processors pass through several buffering stages before reaching the cache or memory. Understanding these mechanisms is crucial for understanding memory ordering.

Store Buffer (Store Queue):

The store buffer sits between the CPU pipeline and L1 cache:

CPU executes a store instruction
Store goes into the store buffer
CPU continues executing immediately
Store buffer drains to L1 cache as cache is available

Why buffer stores?

Cache may be busy with reads
Cache line may need to be fetched first (write-allocate on miss)
Allows speculation (stores can be undone if instruction is squashed)

Write Combining:

Multiple stores to the same cache line can be combined in the store buffer:

Write 4 bytes at offset 0
Write 4 bytes at offset 4
Write 4 bytes at offset 8 ...
Combined into single 64-byte cache line write

Store Buffer and Memory Ordering:

The store buffer is the primary source of store-load reordering on modern CPUs:

Initially: x = 0, y = 0

Processor 1:         Processor 2:
store x = 1          store y = 1
load r1 = y          load r2 = x

Can we end up with r1 = 0 AND r2 = 0? Yes!

P1 stores x=1 to store buffer (not yet in cache)
P1 loads y from cache/memory → sees 0
P2 stores y=1 to store buffer (not yet in cache)
P2 loads x from cache/memory → sees 0
Store buffers drain → x=1, y=1 now in memory
But registers already loaded 0!

This is why memory barriers (fences) exist—they force store buffer draining before proceeding.

store_buffer_ordering.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <stdatomic.h>
#include <pthread.h>
 
// Global variables (volatile alone is NOT sufficient on modern CPUs!)
atomic_int x = 0;
atomic_int y = 0;
int r1, r2;
 
void* thread1(void* arg) {
    // Store x, then load y
    atomic_store_explicit(&x, 1, memory_order_relaxed);
    
    // Without a fence, the load might see stale y
    // because our store to x is still in the store buffer
    
    atomic_thread_fence(memory_order_seq_cst);  // Full fence
    
    r1 = atomic_load_explicit(&y, memory_order_relaxed);
    return NULL;
}
 
void* thread2(void* arg) {
    // Store y, then load x
    atomic_store_explicit(&y, 1, memory_order_relaxed);
    
    atomic_thread_fence(memory_order_seq_cst);  // Full fence
    
    r2 = atomic_load_explicit(&x, memory_order_relaxed);
    return NULL;
}
 
// With the fences, we're guaranteed:
// NOT (r1 == 0 AND r2 == 0)
// At least one thread will see the other's store.
 
// Without fences, on x86 (Total Store Order):
// r1 == 0 AND r2 == 0 is possible due to store buffer effects!

Store Buffers Break Sequential Consistency Intuition

Many programmers expect memory to behave sequentially—if I write X then read Y, my write 'happened' before my read. Store buffers break this intuition. Your write might still be in the store buffer while you read stale data from cache. Understanding this is essential for correct lock-free programming.

Explicit Cache Control Instructions

Sometimes software needs explicit control over cache behavior—flushing dirty data, invalidating stale data, or bypassing caching entirely. Modern ISAs provide specific instructions for these purposes.

x86 Cache Control Instructions
Instruction	Purpose	Use Case
CLFLUSH	Flush cache line to memory and invalidate	Persistent memory, security (clear sensitive data)
CLFLUSHOPT	Optimized flush (can be reordered)	Persistent memory writes, higher throughput
CLWB	Write-back line but keep in cache	Persistent memory with continued use
INVD	Invalidate all caches (no writeback!)	System initialization only—destroys data
WBINVD	Write-back and invalidate all caches	Power management, system state transitions
PREFETCH	Hint to load data into cache	Pre-load data before needed
MOVNTI	Non-temporal store (bypass cache)	Streaming writes, avoid cache pollution

Persistent Memory (PMEM) and Cache Flushing:

With technologies like Intel Optane PMEM, cache flushing becomes critical for durability:

// Writing to persistent memory
void pmem_persist(void *addr, size_t len) {
    // Write data normally (goes to cache)
    memcpy(pmem_addr, data, len);
    
    // Flush from cache to PMEM
    for (void *p = addr; p < addr + len; p += 64) {
        _mm_clwb(p);  // Write back (keep in cache for performance)
    }
    
    // Ensure flushes complete before proceeding
    _mm_sfence();
}

Without explicit flushing, data in write-back caches would be lost on power failure. Persistent memory programming requires careful use of these instructions.

Cache Flushing and Security:

Flush instructions have security implications:

Spectre/Meltdown mitigations: Flush caches between security domains
Sensitive data handling: CLFLUSH can remove encryption keys from cache after use
Side-channel attacks: Cache timing attacks may use CLFLUSH to reset state
Intel SGX: Enclave transitions flush caches to prevent data leakage

When to Use Cache Control Instructions

Most application code never needs cache control instructions—the hardware handles caching automatically. You need explicit control for: persistent memory programming, device driver I/O buffers, security-sensitive data clearing, and high-performance streaming computations. Premature cache 'optimization' usually hurts performance.

Summary: Write-Through vs Write-Back

We've covered how writes propagate through the cache hierarchy. Let's consolidate:

Key Takeaways

•Write-through sends every write to memory immediately—simple but high traffic.
•Write-back writes only to cache, updates memory on eviction—complex but efficient.
•Write-allocate fetches lines on write miss; no-write-allocate writes around cache.
•Modern CPUs use write-back at all cache levels for performance.
•Write policies affect coherence—write-back requires cache intervention on remote reads.
•Store buffers hide write latency but create memory ordering effects requiring barriers.
•Explicit cache control (CLFLUSH, CLWB) is essential for persistent memory and security.

What's Next:

Now we understand how individual caches handle reads and writes. But modern systems have multiple processors, each with their own caches, potentially caching the same data. The next page addresses the Cache Coherence Problem—how the system ensures all processors see a consistent view of memory despite having private caches.

Page Complete

You understand cache write policies and their profound implications for performance, consistency, and system behavior. You can reason about write traffic, store buffers, and when explicit cache control is needed. Next, we'll see why multiple caches create the coherence problem and how systems solve it.

3 / 5

Loading learning content...

Operating SystemsCache Coherence

Cache Coherence and Memory Consistency

LevelIntermediate

Duration90 mins

TopicCache Coherence

3 / 5

Write-Through vs Write-Back

The Write Problem

What You Will Learn

Write-Through Caches

Operation:

CPU issues a store instruction
Data is written to the cache (if the address is present)
Data is also immediately sent to the next memory level
The write is considered complete when the next level acknowledges

Key Characteristic: Memory always has the current value. The cache never contains data 'newer' than memory.

Converting Mermaid diagram...

Advantages of Write-Through

•Simple consistency: Memory always has current data; crash recovery is straightforward
•Simple coherence: Other caches can read from memory knowing it's up-to-date
•Simple eviction: Evicted lines don't need writeback; just discard
•Reads don't stall writes: Write goes to memory; cache available for reads
•Failure resilience: Power loss doesn't lose cached modifications

Disadvantages of Write-Through

•High memory traffic: Every write goes to memory, even if overwritten soon
•Write latency visible: CPU must wait for memory (unless buffered)
•Saturates memory bus: Write-heavy workloads overwhelm memory bandwidth
•Power consumption: More memory accesses means more energy
•Limited by memory speed: Writes can't exceed memory write bandwidth

Write Buffers—Making Write-Through Practical:

The naive write-through implementation would stall the CPU on every write, waiting for memory acknowledgment. This is clearly unacceptable. The solution: write buffers.

A write buffer is a small FIFO queue (typically 4-16 entries) that holds pending writes:

CPU writes to cache AND to write buffer
CPU continues execution immediately
Write buffer drains to memory in the background
If write buffer fills, CPU must stall until space is available

Write Buffer Considerations:

Must be checked on reads (data might be in buffer, not yet in memory)
Writes to the same address can be coalesced (combine multiple writes to one)
Write buffer full → pipeline stall (critical in write-heavy code)

Where Write-Through Is Used

Write-Back Caches

Operation:

CPU issues a store instruction
Data is written to the cache
The cache line is marked dirty (modified)
Write is complete—no memory access needed
When the dirty line is evicted, it is written back to memory

Key Characteristic: The cache may contain data that is 'newer' than memory. Memory may be stale. The dirty bit tracks which lines need writeback.

Converting Mermaid diagram...

Advantages of Write-Back

•Reduced memory traffic: Multiple writes to same line → single writeback
•Lower latency writes: No memory access for write completion
•Better bandwidth utilization: Memory bus available for reads
•Write absorption: Temporary/local variables never reach memory
•Lower power: Fewer memory accesses = less energy

Disadvantages of Write-Back

•Complex coherence: Other caches can't rely on memory being current
•Eviction overhead: Dirty line eviction takes longer than clean
•Data loss risk: Power failure loses dirty cache data
•Complex implementation: Dirty bit tracking, writeback logic
•Burst traffic: Eviction storms can spike memory bandwidth usage

The Power of Write Absorption:

Consider a simple loop that writes to a counter:

for (int i = 0; i < 1000000; i++) {
    counter++;  // Writes to same address 1 million times
}

With write-through: 1,000,000 memory writes With write-back: 1 memory write (when line is eventually evicted)

This is a 1,000,000x reduction in memory traffic! Real programs don't achieve this extreme ratio, but write-heavy code routinely sees 10-100x reductions.

Eviction Complexity:

Write-back complicates eviction:

Clean line eviction: Just discard—memory has valid copy
Dirty line eviction: Must write data to memory first, then evict

The Data Loss Risk

Write Allocation Policies

A separate but related question: what happens when we write to an address that is not currently in the cache? Two strategies exist:

Write Allocation Policies
Policy	Also Called	On Write Miss	Typical Pairing
Write-Allocate	Fetch-on-Write	Load the cache line from memory, then perform the write	Write-Back caches
No-Write-Allocate	Write-Around, Write-No-Allocate	Write directly to memory without loading into cache	Write-Through caches

Write-Allocate (Fetch-on-Write):

Write miss occurs
Fetch the entire cache line from memory into the cache
Perform the write to the now-cached line
(For write-back: mark dirty; for write-through: also write to memory)

No-Write-Allocate (Write-Around):

Write miss occurs
Write goes directly to memory, bypassing cache
Cache line is NOT loaded

Rationale: If we're writing but not reading, why pollute the cache? The written data might not be needed soon. Common for streaming writes (e.g., video encoding output).

Write Policy Combination EffectsConsider writing a single 4-byte integer to an address not in cache, then reading it:

Input

Output

Non-Temporal Stores

Hybrid Approaches in Modern Systems

Modern processor cache hierarchies don't uniformly apply one policy. Instead, they use hybrid configurations optimized for each cache level's characteristics:

Typical Modern Configuration:

Write Policies Across Cache Hierarchy (Intel/AMD Desktop CPUs)
Cache Level	Write Policy	Write Allocation	Reason
L1 Data Cache	Write-Back	Write-Allocate	Maximize write absorption; L1 is small and fast, writes are frequent
L2 Cache	Write-Back	Write-Allocate	Continue absorbing writes; still private per-core
L3 Cache (LLC)	Write-Back	Write-Allocate	Absorb cross-core sharing writes; largest on-chip cache
Memory Controller	Write-Back buffered	N/A	Combine writes, optimize for DRAM burst patterns

Per-Region Write Policies:

Memory Type Range Registers (MTRRs) and Page Attribute Tables (PAT) allow different write policies for different memory regions:

Write-Back (WB): Normal memory—full caching
Write-Through (WT): Rare, used for some compatibility cases
Write-Combining (WC): No caching, but combine writes to the same line
Uncacheable (UC): Memory-mapped I/O—no caching, no combining
Write-Protected (WP): Cached reads, but writes go through immediately

Write-Combining (WC):

A special policy for graphics memory and other streaming writes:

Writes are buffered but not cached
Multiple writes to the same cache line are combined
When the line is full or a different address is written, the entire line is burst-written to memory
Dramatically improves performance for sequential writes to frame buffers
No read benefit (uncached reads)

memory_type_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Setting memory types via Page Attribute Table (PAT) on x86
// Typically done by OS during memory mapping
 
// mmap with different caching attributes (Linux)
#include <sys/mman.h>
#include <fcntl.h>
 
int fd = open("/dev/mem", O_RDWR);
 
// Normal write-back cached mapping (default)
void *wb_region = mmap(NULL, size, PROT_READ | PROT_WRITE,
                       MAP_SHARED, fd, phys_addr);
 
// Write-combining for frame buffer (requires /dev/fb or similar)
// The driver sets WC attributes automatically for frame buffer mappings
 
// Using non-temporal stores for write-around behavior
#include <immintrin.h>
 
void stream_write(void *dest, void *src, size_t n) {
    // Copy using non-temporal stores (bypasses cache)
    __m256i *d = (__m256i *)dest;
    __m256i *s = (__m256i *)src;
    
    for (size_t i = 0; i < n / 32; i++) {
        __m256i data = _mm256_load_si256(s + i);
        _mm256_stream_si256(d + i, data);  // Non-temporal store
    }
    _mm_sfence();  // Ensure stores are visible
}
// This prevents cache pollution when writing large buffers

Write Policies and Cache Coherence

Write-Through Coherence Simplicity:

With write-through, memory always has the current value. When processor A writes:

Write goes to memory immediately
Other processors can snoop the memory bus, see the write
Other processors invalidate their copies (or update them)

The coherence protocol is simpler because memory is the 'source of truth.'

Write-Back Coherence Complexity:

With write-back, the cache may have data that memory doesn't:

Processor A writes → data only in A's cache (dirty)
Processor B wants to read the same address
B checks memory → memory is stale!
Coherence protocol must arrange for A to supply the data

This requires more sophisticated protocols (like MESI, covered later) where:

Caches snoop each other's activity
Dirty caches can intervene on reads
Modified data can transfer cache-to-cache without going to memory

Coherence with Write-Back: Cache-to-Cache TransferProcessor B reads an address that Processor A has modified (dirty in A's cache):

Input

Output

Why Write-Back Dominates Despite Complexity

Write Buffers, Coalescing, and Store Buffers

Writes in modern processors pass through several buffering stages before reaching the cache or memory. Understanding these mechanisms is crucial for understanding memory ordering.

Store Buffer (Store Queue):

The store buffer sits between the CPU pipeline and L1 cache:

CPU executes a store instruction
Store goes into the store buffer
CPU continues executing immediately
Store buffer drains to L1 cache as cache is available

Why buffer stores?

Cache may be busy with reads
Cache line may need to be fetched first (write-allocate on miss)
Allows speculation (stores can be undone if instruction is squashed)

Write Combining:

Multiple stores to the same cache line can be combined in the store buffer:

Write 4 bytes at offset 0
Write 4 bytes at offset 4
Write 4 bytes at offset 8 ...
Combined into single 64-byte cache line write

Store Buffer and Memory Ordering:

The store buffer is the primary source of store-load reordering on modern CPUs:

Initially: x = 0, y = 0

Processor 1:         Processor 2:
store x = 1          store y = 1
load r1 = y          load r2 = x

Can we end up with r1 = 0 AND r2 = 0? Yes!

P1 stores x=1 to store buffer (not yet in cache)
P1 loads y from cache/memory → sees 0
P2 stores y=1 to store buffer (not yet in cache)
P2 loads x from cache/memory → sees 0
Store buffers drain → x=1, y=1 now in memory
But registers already loaded 0!

This is why memory barriers (fences) exist—they force store buffer draining before proceeding.

store_buffer_ordering.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <stdatomic.h>
#include <pthread.h>
 
// Global variables (volatile alone is NOT sufficient on modern CPUs!)
atomic_int x = 0;
atomic_int y = 0;
int r1, r2;
 
void* thread1(void* arg) {
    // Store x, then load y
    atomic_store_explicit(&x, 1, memory_order_relaxed);
    
    // Without a fence, the load might see stale y
    // because our store to x is still in the store buffer
    
    atomic_thread_fence(memory_order_seq_cst);  // Full fence
    
    r1 = atomic_load_explicit(&y, memory_order_relaxed);
    return NULL;
}
 
void* thread2(void* arg) {
    // Store y, then load x
    atomic_store_explicit(&y, 1, memory_order_relaxed);
    
    atomic_thread_fence(memory_order_seq_cst);  // Full fence
    
    r2 = atomic_load_explicit(&x, memory_order_relaxed);
    return NULL;
}
 
// With the fences, we're guaranteed:
// NOT (r1 == 0 AND r2 == 0)
// At least one thread will see the other's store.
 
// Without fences, on x86 (Total Store Order):
// r1 == 0 AND r2 == 0 is possible due to store buffer effects!

Store Buffers Break Sequential Consistency Intuition

Explicit Cache Control Instructions

x86 Cache Control Instructions
Instruction	Purpose	Use Case
CLFLUSH	Flush cache line to memory and invalidate	Persistent memory, security (clear sensitive data)
CLFLUSHOPT	Optimized flush (can be reordered)	Persistent memory writes, higher throughput
CLWB	Write-back line but keep in cache	Persistent memory with continued use
INVD	Invalidate all caches (no writeback!)	System initialization only—destroys data
WBINVD	Write-back and invalidate all caches	Power management, system state transitions
PREFETCH	Hint to load data into cache	Pre-load data before needed
MOVNTI	Non-temporal store (bypass cache)	Streaming writes, avoid cache pollution

Persistent Memory (PMEM) and Cache Flushing:

With technologies like Intel Optane PMEM, cache flushing becomes critical for durability:

// Writing to persistent memory
void pmem_persist(void *addr, size_t len) {
    // Write data normally (goes to cache)
    memcpy(pmem_addr, data, len);
    
    // Flush from cache to PMEM
    for (void *p = addr; p < addr + len; p += 64) {
        _mm_clwb(p);  // Write back (keep in cache for performance)
    }
    
    // Ensure flushes complete before proceeding
    _mm_sfence();
}

Without explicit flushing, data in write-back caches would be lost on power failure. Persistent memory programming requires careful use of these instructions.

Cache Flushing and Security:

Flush instructions have security implications:

Spectre/Meltdown mitigations: Flush caches between security domains
Sensitive data handling: CLFLUSH can remove encryption keys from cache after use
Side-channel attacks: Cache timing attacks may use CLFLUSH to reset state
Intel SGX: Enclave transitions flush caches to prevent data leakage

When to Use Cache Control Instructions

Summary: Write-Through vs Write-Back

We've covered how writes propagate through the cache hierarchy. Let's consolidate:

Key Takeaways

•Write-through sends every write to memory immediately—simple but high traffic.
•Write-back writes only to cache, updates memory on eviction—complex but efficient.
•Write-allocate fetches lines on write miss; no-write-allocate writes around cache.
•Modern CPUs use write-back at all cache levels for performance.
•Write policies affect coherence—write-back requires cache intervention on remote reads.
•Store buffers hide write latency but create memory ordering effects requiring barriers.
•Explicit cache control (CLFLUSH, CLWB) is essential for persistent memory and security.

What's Next:

Page Complete

3 / 5