Loading learning content...
The fundamental challenge of computer architecture is the processor-memory gap—CPUs can process data orders of magnitude faster than main memory can supply it. In the 1980s, processor speeds and memory speeds were roughly balanced. Since then, CPU speeds have improved by roughly 60% per year while DRAM speeds have improved by only 7% per year.
The result? A modern CPU can execute instructions in cycles measured in fractions of nanoseconds, while accessing main memory takes 50-100 nanoseconds—a ratio of 100:1 or more. Without intervention, the CPU would spend most of its time waiting for data.
Cache memory is the ingenious solution to this problem. Caches are small, fast memories positioned between the processor and main memory, holding copies of the most frequently accessed data. By exploiting patterns in how programs access memory (locality), caches dramatically reduce effective memory access time.
By the end of this page, you will understand: how caches work at a fundamental level; the organization of L1, L2, and L3 caches; cache policies for replacement and writes; cache coherence protocols in multicore systems; and how to analyze and optimize for cache performance.
Caches work because program behavior is remarkably predictable. The principle of locality states that programs tend to access data and instructions in patterns that favor recently-used items and nearby locations. This principle has two fundamental forms:
Temporal locality (locality in time):
If a memory location is accessed now, it is likely to be accessed again in the near future. Examples:
Spatial locality (locality in space):
If a memory location is accessed now, nearby locations are likely to be accessed soon. Examples:
Locality isn't accidental—it's a natural consequence of how programs are structured. Loops create temporal locality. Data structures create spatial locality. The abstraction boundaries we use to organize code (functions, modules, classes) tend to group related data together. Well-designed caches exploit these natural patterns.
Quantifying locality:
The working set of a program is the set of memory locations accessed within a given time interval. If a cache can hold the working set, almost all accesses will be cache hits. The art of cache design—and performance optimization—involves sizing caches to capture working sets while managing the tradeoffs of larger, slower caches versus smaller, faster ones.
Sequential access patterns:
Consider iterating through an array of 1 million integers:
int arr[1000000];
int sum = 0;
for (int i = 0; i < 1000000; i++) {
sum += arr[i]; // Pure sequential access
}
This exhibits nearly perfect spatial locality. A cache line (typically 64 bytes) holds 16 integers. When arr[0] is accessed, the entire cache line is loaded. Accesses to arr[1] through arr[15] are all cache hits. The effective miss rate is 1/16 = 6.25%.
Random access patterns:
Now consider random accesses:
for (int i = 0; i < 1000000; i++) {
sum += arr[random() % 1000000]; // Random access
}
Each access is equally likely to hit any cache line. If the array exceeds cache capacity, almost every access is a cache miss. This can be 10-100× slower than sequential access.
| Access Pattern | Spatial Locality | Temporal Locality | Typical Hit Rate | Performance Impact |
|---|---|---|---|---|
| Sequential array traversal | Excellent | Low | ~94-97% | Near-optimal |
| Repeated array traversal | Excellent | Excellent | ~99%+ | Optimal |
| Strided access (stride=cache line) | None | Low | <10% | Very poor |
| Random access (large array) | None | None | <5% | Worst case |
| Linked list traversal (scattered) | None | Low | Variable | Often poor |
| Matrix multiply (naive) | Mixed | Low | ~50-80% | Improvable |
| Matrix multiply (blocked) | Excellent | Excellent | ~95%+ | Near-optimal |
Understanding cache organization is fundamental to understanding cache behavior. A cache is not simply a small memory—it's a highly structured system with specific mechanisms for storing, locating, and replacing data.
Cache lines (blocks):
The fundamental unit of data in a cache is the cache line (also called a cache block). Rather than loading individual bytes, caches load entire lines at once—typically 64 bytes on modern systems. This exploits spatial locality: when you access one byte, nearby bytes come along for free.
A cache line contains:
Address decomposition:
When the CPU accesses a memory address, that address is decomposed into three parts:
Block offset (b bits): Identifies the specific byte within the cache line. For 64-byte lines, b = 6 (2⁶ = 64)
Index (i bits): Identifies which cache set this address maps to. The number of sets S = 2ⁱ
Tag (t bits): Identifies which specific memory block is stored in the cache line. t = address_bits - i - b
For a 64-bit address space with 64-byte lines and 64 sets:
Cache lookup process:
Higher associativity reduces conflict misses but increases access latency (more tags to compare) and power consumption (more lines active simultaneously). Modern L1 caches typically use 4-8 way associativity. L3 caches, being larger and slower anyway, often use 12-16+ way associativity to maximize hit rates.
1234567891011121314151617
# Conceptual model of cache address decompositionclass CacheAddressDecomposition: def __init__(self, line_size_bytes=64, num_sets=64, address_bits=64): self.offset_bits = int.bit_length(line_size_bytes - 1) # e.g., 6 for 64 bytes self.index_bits = int.bit_length(num_sets - 1) # e.g., 6 for 64 sets self.tag_bits = address_bits - self.offset_bits - self.index_bits def decompose(self, address): offset = address & ((1 << self.offset_bits) - 1) index = (address >> self.offset_bits) & ((1 << self.index_bits) - 1) tag = address >> (self.offset_bits + self.index_bits) return {"tag": tag, "index": index, "offset": offset} # Example: L1 cache with 64-byte lines, 64 sets (4KB / 64 / 1-way = 64 sets)cache = CacheAddressDecomposition(line_size_bytes=64, num_sets=64)result = cache.decompose(0x7fff8c2a4b80)# tag: 0x1fffe30a92, index: 46, offset: 0The L1 cache (Level 1 cache) is the smallest,the fastest, and closest cache to the CPU execution units. On modern processors, L1 cache is so tightly integrated that it's essentially part of the processor pipeline. Every CPU core has its own private L1 cache.
Key characteristics of L1 cache:
Why split L1 into instruction and data?
The split L1 design (Harvard architecture at L1) provides several advantages:
L1 cache optimizations:
L1 caches employ aggressive optimizations to achieve single-cycle access:
| Processor | L1-D per Core | L1-I per Core | Associativity | Latency (cycles) |
|---|---|---|---|---|
| Intel Core i9-13900K | 48 KB | 32 KB | 12-way (D), 8-way (I) | 5 |
| AMD Ryzen 9 7950X | 32 KB | 32 KB | 8-way | 4 |
| Apple M2 (P-core) | 128 KB | 192 KB | 8-way | 3 |
| ARM Cortex-X3 | 64 KB | 64 KB | 4-way | 4 |
| Graviton3 | 64 KB | 64 KB | 4-way | 3-4 |
An L1 cache miss doesn't go directly to main memory—it goes to L2 cache. The L1 miss penalty is the L2 access time: typically 10-15 cycles. While this seems small, remember that in tight loops the CPU can be limited by L1 bandwidth. Optimizing for L1 hits is critical in performance-sensitive code.
The L2 cache (Level 2 cache) serves as the middle tier of the cache hierarchy. It's larger than L1 but smaller than L3, striking a balance between speed and capacity. L2 cache catches misses from L1, preventing them from incurring the much higher latency of L3 or main memory access.
Key characteristics of L2 cache:
Inclusion policies:
The relationship between cache levels is defined by their inclusion policy:
Inclusive L2: Every line in L1 is also present in L2. This simplifies coherence (checking L2 is sufficient to know a core's cache state) but wastes capacity (data is duplicated). Intel historically used inclusive caches.
Exclusive L2: L1 and L2 contain disjoint data. When a line is promoted from L2 to L1, it's evicted from L2. This maximizes effective capacity but complicates coherence. AMD Zen uses this approach.
Non-inclusive L2: Neither strictly inclusive nor exclusive. A line might be in L1 only, L2 only, or both. Modern Intel processors have moved toward non-inclusive designs.
L2 as a victim cache (exclusive):
In exclusive configurations, L2 often acts as a victim cache—when a line is evicted from L1, it moves to L2 rather than being discarded. This captures temporal locality: recently-evicted lines that are accessed again can be found in L2 rather than fetched from L3 or memory.
For many applications, L2 is the critical cache level. If your application's hot working set fits in L2 (~256KB-1MB per core), you'll rarely incur L3 or memory latencies. Profile your applications to understand their working set sizes and optimize data layouts accordingly.
L2 bandwidth and ports:
L2 cache typically has lower bandwidth than L1 (measured in bytes per cycle). While L1 might support 2 loads per cycle (128 bytes at 64 bits each), L2 might deliver data to L1 at only 32-64 bytes per cycle. This asymmetry enforces the hierarchy: L1 must absorb most accesses to avoid saturating L2 bandwidth.
Hardware prefetching at L2:
Many processors implement hardware prefetchers at L2 that detect access patterns (sequential, strided) and speculatively fetch data before it's requested. Effective prefetching can hide memory latency by having data arrive in L2 just before the CPU needs it. However, incorrect prefetching wastes bandwidth and can evict useful data.
The L3 cache (Level 3 cache), often called the Last Level Cache (LLC), represents the final cache barrier before main memory. It's significantly larger than L1 and L2, but also slower. Its primary role is to reduce the number of expensive main memory accesses and to facilitate data sharing between cores.
Key characteristics of L3 cache:
L3 as the coherence hub:
In many processor designs, L3 serves as the central point for cache coherence. When one core modifies data, the coherence protocol checks L3's directory to determine which other cores might have copies. This makes L3 access latency variable—a simple hit is faster than a hit that requires coherence actions.
Sliced L3 architecture:
Modern processors with many cores often organize L3 into slices, with each slice associated with a subset of cores. The address bits determine which slice holds a given cache line. This distributes traffic and provides better bandwidth scaling but can introduce Non-Uniform Cache Access (NUCA) effects where some slices are faster to reach than others.
AMD's revolutionary 3D V-Cache:
AMD's 3D V-Cache technology stacks additional L3 cache vertically on top of the processor die, dramatically increasing L3 capacity (e.g., 96 MB on Ryzen 9 7950X3D). This has proven particularly effective for gaming and other workloads with large working sets, demonstrating the continued importance of cache capacity.
| Processor | L3 Size | Cores Sharing | Latency (cycles) | Architecture |
|---|---|---|---|---|
| Intel Core i9-13900K | 36 MB | 24 cores (8P+16E) | ~42 | Ring bus, sliced |
| AMD Ryzen 9 7950X | 64 MB | 16 cores (2 CCDs) | ~46 | 2x 32MB per CCD |
| AMD Ryzen 9 7950X3D | 128 MB | 16 cores | ~50 | 3D V-Cache stacking |
| Apple M2 Ultra | 192 MB | 24 cores | Unknown | Unified memory architecture |
| Intel Xeon Scalable | 52.5 MB | 36+ cores | ~60+ | Mesh interconnect |
L3 latency (30-50 cycles, ~15 ns) versus main memory latency (200-300 cycles, ~70-100 ns) represents a 4-7× difference. This makes L3 hits dramatically faster than main memory accesses. For server workloads, L3 hit rates often determine overall system throughput.
When a cache is full and a new line must be loaded, the cache must decide which existing line to evict. This decision is governed by the replacement policy. The goal is to evict the line least likely to be used again, preserving the lines with the highest future utility.
The optimal (Bélády's) policy:
In theory, the best replacement policy is to evict the line that will be used furthest in the future. Known as Bélády's MIN algorithm, this is optimal but impossible to implement—we can't know future accesses. Practical policies approximate this ideal.
Common replacement policies:
The scan resistance problem:
LRU has a critical weakness: a single scan through a large array can evict all useful data. If a program accesses N+1 unique cache lines where N is the cache size, LRU will have 0% hit rate—each new line evicts a potentially useful line that won't be seen again soon.
Modern policies like RRIP address this by treating newly-loaded lines as less valuable until they prove their worth through re-reference.
Adaptive policies:
State-of-the-art caches often use adaptive policies that observe behavior and switch strategies. Hardware monitors hit rates under different policies (using sampling) and dynamically selects the best performer. This is particularly valuable in shared L3 caches where different applications may have different access patterns.
While you can't control hardware replacement policies, you can structure data access to work well with them. Sequential access patterns, reusing data before moving on, and minimizing working set size all improve cache behavior regardless of the specific replacement policy in use.
When the CPU writes data, the cache must decide how to handle that write. The write policy determines whether and when data is propagated to lower cache levels and main memory. This decision has significant implications for performance, coherence, and consistency.
Two fundamental choices:
Write miss policies:
What happens when a write targets an address not currently in the cache?
Write-allocate: Fetch the target cache line from memory, then perform the write. This is the common choice for write-back caches—future writes and reads to neighboring bytes will hit in cache.
No-write-allocate: Write directly to memory without loading the line into cache. Common with write-through caches—since data goes to memory anyway, why waste cache space?
Combining policies:
Modern systems typically use:
Write-through is rare in general-purpose processors but common in specialized systems where coherence simplicity outweighs performance costs.
Write-back creates a consistency challenge: memory doesn't reflect the latest data until the dirty line is evicted or explicitly flushed. For I/O devices that read memory directly (DMA), or for power loss scenarios, this can cause data loss. Operating systems and device drivers must explicitly flush caches when coherence with external entities is required.
Not every memory access can be a cache hit. Understanding why misses occur is essential for optimization. The classic Three Cs model categorizes cache misses into three types:
1. Compulsory misses (cold misses):
The first access to any memory location always misses—data isn't in the cache yet because it has never been fetched. Compulsory misses are unavoidable given finite cache size, but prefetching can hide their latency by loading data before it's requested.
2. Capacity misses:
Occur when the working set exceeds cache size. Even with perfect replacement, if you access more unique data than the cache can hold, evictions are inevitable. The evicted lines will miss when re-accessed.
Reducing capacity misses:
3. Conflict misses (collision misses):
Occur when multiple addresses map to the same cache set, causing evictions even when the overall cache has free space. This is a limitation of direct-mapped and set-associative caches. Fully associative caches have no conflict misses.
Extending the model:
A fourth C is sometimes added:
4. Coherence misses:
In multicore systems, misses caused by cache coherence protocols. When one core modifies data, copies in other cores are invalidated. Subsequent accesses to that data from other cores miss despite the data being "recently used" from their perspective.
Diagnosing miss types:
Profiling tools (Intel VTune, AMD uProf, perf) can measure cache miss rates and, with analysis, attribute misses to categories:
| Miss Type | Cause | Mitigation Strategies | Tools |
|---|---|---|---|
| Compulsory | First access to data | Prefetching, batching first accesses | Hardware prefetchers, __builtin_prefetch |
| Capacity | Working set > cache size | Blocking/tiling, reduce footprint, compression | perf stat, VTune |
| Conflict | Address collisions in sets | Array padding, data alignment, struct reorganization | Cache simulation, hardware counters |
| Coherence | Multi-core invalidation | Reduce false sharing, partitioning, careful sync | perf c2c, VTune memory analysis |
We've explored the cache hierarchy in depth—the critical bridge between the CPU's execution capability and main memory's vast storage. Understanding cache behavior is essential for writing high-performance code and designing efficient operating systems.
What's next:
With registers and caches covered, we'll now explore main memory (RAM)—the primary system memory that stores the bulk of program data and code. We'll examine DRAM technology, memory organization, and how operating systems manage physical memory through techniques like paging and virtual memory.
You now have a comprehensive understanding of CPU cache memory—from fundamental organization through advanced topics like replacement policies and coherence. This knowledge is essential for understanding system performance, writing cache-friendly code, and designing OS memory management systems.