Loading learning content...
Reads are conceptually simple: check the cache, find the data, return it. But writes introduce a fundamental question: when should modified data be written to the next level of the memory hierarchy?
This seemingly simple question has profound implications for performance, power consumption, data consistency, memory traffic, and system reliability. Two primary strategies have evolved—write-through and write-back—each with distinct tradeoffs that influence when each is appropriate.
This page covers the two fundamental cache write policies, their mechanisms, tradeoffs, and implications. You'll understand write allocation strategies, write buffers, the relationship between write policies and cache coherence, and how modern systems combine these approaches across the cache hierarchy.
In a write-through cache, every write to the cache is simultaneously written to the next level of the memory hierarchy (whether that's L2, L3, or main memory). The cache and memory always contain the same data for any cached address.
Operation:
Key Characteristic: Memory always has the current value. The cache never contains data 'newer' than memory.
Write Buffers—Making Write-Through Practical:
The naive write-through implementation would stall the CPU on every write, waiting for memory acknowledgment. This is clearly unacceptable. The solution: write buffers.
A write buffer is a small FIFO queue (typically 4-16 entries) that holds pending writes:
Write Buffer Considerations:
Modern CPU caches rarely use pure write-through. However, write-through (with write buffers) is common in: L1 to L2 paths in some embedded processors, GPU caches with high bandwidth to memory, and write-through levels in hybrid configurations. SSDs and storage systems also use write-through semantics when data integrity matters more than performance.
In a write-back (also called copy-back or writeback) cache, writes update only the cache. Modified data is written to the next level only when the cache line is evicted. This dramatically reduces memory traffic.
Operation:
Key Characteristic: The cache may contain data that is 'newer' than memory. Memory may be stale. The dirty bit tracks which lines need writeback.
The Power of Write Absorption:
Consider a simple loop that writes to a counter:
for (int i = 0; i < 1000000; i++) {
counter++; // Writes to same address 1 million times
}
With write-through: 1,000,000 memory writes With write-back: 1 memory write (when line is eventually evicted)
This is a 1,000,000x reduction in memory traffic! Real programs don't achieve this extreme ratio, but write-heavy code routinely sees 10-100x reductions.
Eviction Complexity:
Write-back complicates eviction:
This asymmetry means a read miss that needs to evict a dirty line takes longer (miss + writeback) than one that evicts a clean line (miss only). High-performance caches may use victim buffers to handle writebacks in background.
Write-back caches can lose data on power failure. Dirty cache lines that haven't been written back are lost. This is why databases, file systems, and storage controllers implement explicit cache flushing. The fsync() and msync() system calls, and the x86 CLFLUSH/CLWB instructions, force dirty data to stable storage.
A separate but related question: what happens when we write to an address that is not currently in the cache? Two strategies exist:
| Policy | Also Called | On Write Miss | Typical Pairing |
|---|---|---|---|
| Write-Allocate | Fetch-on-Write | Load the cache line from memory, then perform the write | Write-Back caches |
| No-Write-Allocate | Write-Around, Write-No-Allocate | Write directly to memory without loading into cache | Write-Through caches |
Write-Allocate (Fetch-on-Write):
Rationale: We're likely to access this data again soon (temporal locality). By fetching the whole line, we can satisfy subsequent reads/writes from cache. Also, if we're only writing part of the line, we need the rest of the line's data.
No-Write-Allocate (Write-Around):
Rationale: If we're writing but not reading, why pollute the cache? The written data might not be needed soon. Common for streaming writes (e.g., video encoding output).
Modern CPUs offer 'non-temporal' store instructions (MOVNTI on x86) that bypass the cache entirely—a software-controlled no-write-allocate. These are ideal for streaming writes to large buffers that won't be read soon, preventing cache pollution and improving cache efficiency for other data.
Modern processor cache hierarchies don't uniformly apply one policy. Instead, they use hybrid configurations optimized for each cache level's characteristics:
Typical Modern Configuration:
| Cache Level | Write Policy | Write Allocation | Reason |
|---|---|---|---|
| L1 Data Cache | Write-Back | Write-Allocate | Maximize write absorption; L1 is small and fast, writes are frequent |
| L2 Cache | Write-Back | Write-Allocate | Continue absorbing writes; still private per-core |
| L3 Cache (LLC) | Write-Back | Write-Allocate | Absorb cross-core sharing writes; largest on-chip cache |
| Memory Controller | Write-Back buffered | N/A | Combine writes, optimize for DRAM burst patterns |
Per-Region Write Policies:
Memory Type Range Registers (MTRRs) and Page Attribute Tables (PAT) allow different write policies for different memory regions:
Write-Combining (WC):
A special policy for graphics memory and other streaming writes:
Example: Writing 4 bytes at a time to a GPU framebuffer at 1920×1080×4 (8 MB). With WC, writes combine into efficient 64-byte bursts. Without WC, each 4-byte write would be a separate memory transaction.
12345678910111213141516171819202122232425262728293031
// Setting memory types via Page Attribute Table (PAT) on x86// Typically done by OS during memory mapping // mmap with different caching attributes (Linux)#include <sys/mman.h>#include <fcntl.h> int fd = open("/dev/mem", O_RDWR); // Normal write-back cached mapping (default)void *wb_region = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, phys_addr); // Write-combining for frame buffer (requires /dev/fb or similar)// The driver sets WC attributes automatically for frame buffer mappings // Using non-temporal stores for write-around behavior#include <immintrin.h> void stream_write(void *dest, void *src, size_t n) { // Copy using non-temporal stores (bypasses cache) __m256i *d = (__m256i *)dest; __m256i *s = (__m256i *)src; for (size_t i = 0; i < n / 32; i++) { __m256i data = _mm256_load_si256(s + i); _mm256_stream_si256(d + i, data); // Non-temporal store } _mm_sfence(); // Ensure stores are visible}// This prevents cache pollution when writing large buffersWrite policies significantly impact cache coherence in multiprocessor systems. The coherence protocol must handle the case where one processor writes to a cache line that other processors may have cached.
Write-Through Coherence Simplicity:
With write-through, memory always has the current value. When processor A writes:
The coherence protocol is simpler because memory is the 'source of truth.'
Write-Back Coherence Complexity:
With write-back, the cache may have data that memory doesn't:
This requires more sophisticated protocols (like MESI, covered later) where:
The performance benefits of write-back (reduced memory traffic, lower latency writes) far outweigh the coherence complexity. Modern coherence protocols efficiently handle write-back caches. The complexity is in hardware—software sees a coherent memory abstraction. Only kernel developers and those writing memory barriers need to understand the details.
Writes in modern processors pass through several buffering stages before reaching the cache or memory. Understanding these mechanisms is crucial for understanding memory ordering.
Store Buffer (Store Queue):
The store buffer sits between the CPU pipeline and L1 cache:
Why buffer stores?
Write Combining:
Multiple stores to the same cache line can be combined in the store buffer:
Store Buffer and Memory Ordering:
The store buffer is the primary source of store-load reordering on modern CPUs:
Initially: x = 0, y = 0
Processor 1: Processor 2:
store x = 1 store y = 1
load r1 = y load r2 = x
Can we end up with r1 = 0 AND r2 = 0? Yes!
This is why memory barriers (fences) exist—they force store buffer draining before proceeding.
12345678910111213141516171819202122232425262728293031323334353637
#include <stdatomic.h>#include <pthread.h> // Global variables (volatile alone is NOT sufficient on modern CPUs!)atomic_int x = 0;atomic_int y = 0;int r1, r2; void* thread1(void* arg) { // Store x, then load y atomic_store_explicit(&x, 1, memory_order_relaxed); // Without a fence, the load might see stale y // because our store to x is still in the store buffer atomic_thread_fence(memory_order_seq_cst); // Full fence r1 = atomic_load_explicit(&y, memory_order_relaxed); return NULL;} void* thread2(void* arg) { // Store y, then load x atomic_store_explicit(&y, 1, memory_order_relaxed); atomic_thread_fence(memory_order_seq_cst); // Full fence r2 = atomic_load_explicit(&x, memory_order_relaxed); return NULL;} // With the fences, we're guaranteed:// NOT (r1 == 0 AND r2 == 0)// At least one thread will see the other's store. // Without fences, on x86 (Total Store Order):// r1 == 0 AND r2 == 0 is possible due to store buffer effects!Many programmers expect memory to behave sequentially—if I write X then read Y, my write 'happened' before my read. Store buffers break this intuition. Your write might still be in the store buffer while you read stale data from cache. Understanding this is essential for correct lock-free programming.
Sometimes software needs explicit control over cache behavior—flushing dirty data, invalidating stale data, or bypassing caching entirely. Modern ISAs provide specific instructions for these purposes.
| Instruction | Purpose | Use Case |
|---|---|---|
| CLFLUSH | Flush cache line to memory and invalidate | Persistent memory, security (clear sensitive data) |
| CLFLUSHOPT | Optimized flush (can be reordered) | Persistent memory writes, higher throughput |
| CLWB | Write-back line but keep in cache | Persistent memory with continued use |
| INVD | Invalidate all caches (no writeback!) | System initialization only—destroys data |
| WBINVD | Write-back and invalidate all caches | Power management, system state transitions |
| PREFETCH | Hint to load data into cache | Pre-load data before needed |
| MOVNTI | Non-temporal store (bypass cache) | Streaming writes, avoid cache pollution |
Persistent Memory (PMEM) and Cache Flushing:
With technologies like Intel Optane PMEM, cache flushing becomes critical for durability:
// Writing to persistent memory
void pmem_persist(void *addr, size_t len) {
// Write data normally (goes to cache)
memcpy(pmem_addr, data, len);
// Flush from cache to PMEM
for (void *p = addr; p < addr + len; p += 64) {
_mm_clwb(p); // Write back (keep in cache for performance)
}
// Ensure flushes complete before proceeding
_mm_sfence();
}
Without explicit flushing, data in write-back caches would be lost on power failure. Persistent memory programming requires careful use of these instructions.
Cache Flushing and Security:
Flush instructions have security implications:
Most application code never needs cache control instructions—the hardware handles caching automatically. You need explicit control for: persistent memory programming, device driver I/O buffers, security-sensitive data clearing, and high-performance streaming computations. Premature cache 'optimization' usually hurts performance.
We've covered how writes propagate through the cache hierarchy. Let's consolidate:
What's Next:
Now we understand how individual caches handle reads and writes. But modern systems have multiple processors, each with their own caches, potentially caching the same data. The next page addresses the Cache Coherence Problem—how the system ensures all processors see a consistent view of memory despite having private caches.
You understand cache write policies and their profound implications for performance, consistency, and system behavior. You can reason about write traffic, store buffers, and when explicit cache control is needed. Next, we'll see why multiple caches create the coherence problem and how systems solve it.