Operating SystemsCache Coherence

Cache Coherence and Memory Consistency

LevelIntermediate

Duration90 mins

TopicCache Coherence

1 / 5

Cache Hierarchy

The Memory Wall Problem

Modern processors can execute billions of instructions per second. Yet, accessing main memory takes hundreds of processor cycles. This gap—known as the memory wall—is one of the most critical performance bottlenecks in computing. Without clever engineering solutions, your multi-gigahertz processor would spend most of its time waiting for data.

The solution? Cache memory—small, fast memories strategically placed between the processor and main memory, organized into a sophisticated cache hierarchy that dramatically reduces average memory access times. Understanding this hierarchy isn't just academic; it's essential for writing high-performance code and understanding why some programs run 10x faster than functionally equivalent alternatives.

What You Will Learn

By the end of this page, you will understand the complete architecture of modern cache hierarchies—from L1 through L3 and beyond. You'll learn why caches are organized this way, the fundamental tradeoffs at each level, and how cache hierarchy design profoundly impacts both hardware architecture and software performance.

The Processor-Memory Speed Gap

To appreciate cache hierarchies, we must first understand the problem they solve. The history of computing reveals a widening chasm between processor and memory speeds:

Historical Divergence:

In the early 1980s, processor and memory speeds were roughly matched—both operated at a few megahertz. A processor could access memory in approximately the same time it took to execute an instruction. Life was simple.

But then the paths diverged dramatically. Processor speeds increased by roughly 60% per year through the 1990s and 2000s, driven by advances in transistor density and architecture. Memory speeds improved by only 7-10% per year, constrained by the physics of DRAM technology.

The Widening Processor-Memory Gap Over Time
Year	Typical Processor Speed	DRAM Access Time	Speed Ratio
1980	10 MHz	150 ns	~1.5 cycles
1990	40 MHz	100 ns	~4 cycles
2000	1,000 MHz (1 GHz)	70 ns	~70 cycles
2010	3,000 MHz (3 GHz)	50 ns	~150 cycles
2020	4,500 MHz (4.5 GHz)	40 ns	~180 cycles
Present	5,000+ MHz (5+ GHz)	35 ns	~175-200 cycles

The Performance Catastrophe Without Caches

At 5 GHz, a single memory access taking 200 cycles means the processor sits idle for 199 cycles waiting for data. Without caches, even the fastest processor becomes a slow memory-access machine. Cache hierarchies transform this idle time into productive computation.

Why DRAM Can't Keep Up:

DRAM (Dynamic Random-Access Memory) stores each bit as a charge in a tiny capacitor. This design offers:

High density: Billions of bits per chip
Low cost: A few dollars per gigabyte
But slow access: Reading requires sensing tiny charges, amplifying signals, and destructively reading (which requires rewriting)

Making DRAM faster requires shrinking capacitors, which reduces charge and increases error rates. Physics imposes fundamental limits on DRAM speed improvements.

SRAM: The Alternative

Static RAM (SRAM) stores bits using flip-flop circuits—six transistors per bit instead of one transistor and one capacitor. SRAM offers:

Very fast access: 1-2 ns
No refresh needed: Stable as long as power is applied
But low density: 6x more transistors per bit
And high cost: 10-20x more expensive per bit than DRAM

We can't build gigabytes of SRAM economically—but we can strategically place small amounts where they matter most.

The Principle of Locality

Caches work because of a fundamental property of programs: locality of reference. Programs don't access memory randomly; they exhibit predictable patterns that caches exploit.

Two Dimensions of Locality:

Temporal Locality

•Definition: Recently accessed data is likely to be accessed again soon
•Examples: Loop counters, frequently-used variables, function call return addresses
•Exploitation: Keep recently accessed data in cache, evict old data when space is needed
•Impact: A single cache fetches data once and serves multiple subsequent accesses

Spatial Locality

•Definition: Data near recently accessed data is likely to be accessed soon
•Examples: Array traversal, sequential instruction fetch, struct field access
•Exploitation: Fetch blocks of data (cache lines), not individual bytes
•Impact: One memory access brings in useful neighboring data 'for free'

Quantifying Locality:

Empirical studies consistently show that typical programs spend:

90% of their execution time in just 10% of their code
80% of memory references hit the same 20% of addresses (within short time windows)

This concentration of access makes caching viable. If every memory access were random, caches would be useless—we'd never find the data we need in the cache.

Locality in Different Program Patterns:

locality_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// HIGH TEMPORAL LOCALITY: Same variable accessed repeatedly
int sum = 0;
for (int i = 0; i < N; i++) {
    sum += array[i];  // 'sum' accessed N times - excellent temporal locality
}
 
// HIGH SPATIAL LOCALITY: Sequential array access
for (int i = 0; i < N; i++) {
    process(array[i]);  // Elements accessed in order - perfect spatial locality
}
 
// POOR SPATIAL LOCALITY: Strided access
for (int i = 0; i < N; i += 64) {
    process(array[i]);  // Skipping 64 elements - poor spatial locality
}
 
// POOR TEMPORAL & SPATIAL LOCALITY: Random access
for (int i = 0; i < N; i++) {
    int random_index = rand() % ARRAY_SIZE;
    process(array[random_index]);  // Random access - cache-hostile
}
 
// MATRIX ACCESS: Row-major vs Column-major
// In C, arrays are stored row-major (row elements are contiguous)
int matrix[1000][1000];
 
// GOOD: Row-major traversal (matches memory layout)
for (int i = 0; i < 1000; i++)
    for (int j = 0; j < 1000; j++)
        process(matrix[i][j]);  // Sequential in memory
 
// BAD: Column-major traversal (jumps across rows)
for (int j = 0; j < 1000; j++)
    for (int i = 0; i < 1000; i++)
        process(matrix[i][j]);  // Jumps 1000 elements each access

The Matrix Traversal Difference

The row-major vs column-major traversal difference can cause 10-50x performance differences for large matrices. The 'wrong' order thrashes the cache, turning a memory-bound problem into a cache-miss nightmare. Understanding cache hierarchy makes this obvious; without it, the performance difference seems mysterious.

Multi-Level Cache Architecture

Modern processors don't use a single cache—they employ a hierarchy of caches, each level trading off speed, size, and proximity to the processor. This design emerges from the insight that a single cache can't simultaneously be fast, large, and practical.

The Classic Three-Level Hierarchy:

Modern Cache Hierarchy Characteristics (Typical Desktop/Server CPU, 2024)
Cache Level	Size (per core)	Access Latency	Technology	Sharing
L1 Instruction	32-64 KB	~1-4 cycles (1-2 ns)	SRAM	Private per core
L1 Data	32-64 KB	~1-4 cycles (1-2 ns)	SRAM	Private per core
L2 Unified	256 KB - 1 MB	~10-15 cycles (5-7 ns)	SRAM	Private per core
L3 Unified	8-96 MB (total)	~30-50 cycles (15-25 ns)	SRAM	Shared across cores
Main Memory	16-128 GB	~150-200 cycles (50-100 ns)	DRAM	Shared system-wide

Why Multiple Levels?

The existence of multiple cache levels represents a series of engineering tradeoffs:

L1 Cache (First Level):

Closest to the processor core, physically integrated into the pipeline
Split into instruction cache (L1i) and data cache (L1d) for parallel access
Optimized for latency at all costs—must respond in 1-4 cycles
Small size (32-64 KB) enables low latency but limits capacity
Uses the fastest, most expensive SRAM technology
High hit rates despite small size due to excellent temporal locality of hot code/data

L2 Cache (Second Level):

Backup for L1 misses, still quite fast (~10-15 cycles)
Larger capacity (256 KB - 1 MB) catches data that doesn't fit in L1
Unified (holds both instructions and data) for flexibility
Private to each core—no coherence traffic within L2 lookups
Slight latency increase allows more complex organization

L3 Cache (Third Level / Last-Level Cache):

Shared among all cores on a chip
Much larger (8-96 MB or more in high-end server chips)
Acts as a victim cache for L2 evictions
Critical for multi-threaded workloads with shared data
Enables inter-core communication without going to main memory
Introduces coherence complexity (covered in later pages)

Converting Mermaid diagram...

Inclusion Policies: Inclusive, Exclusive, and NINE

An important design decision in multi-level caches is the inclusion policy: the relationship between what data exists at different cache levels.

Inclusion Policy Types

•Inclusive Cache: Every block in L1 is also present in L2 (and L3). L2 is a superset of L1. Pros: Simple coherence checking—only check L2/L3 to see if a core has data. Cons: Wastes space—the same data occupies multiple cache levels.
•Exclusive Cache: A block exists in only ONE cache level at a time. Data promoted from L2 to L1 is removed from L2; data evicted from L1 goes to L2. Pros: Maximum effective capacity—no replication. Cons: Coherence checking must probe all levels; more complex.
•Non-Inclusive Non-Exclusive (NINE): No strict enforcement. A block in L1 may or may not be in L2. Pros: Flexibility, simpler than exclusive. Cons: Coherence complexity; must track data location explicitly.

Comparison of Inclusion Policies
Aspect	Inclusive	Exclusive	NINE
Effective Capacity	Lower (duplication)	Maximum (no duplication)	Moderate
Coherence Simplicity	Simple (check higher level)	Complex (check all levels)	Moderate
Back-Invalidation	Yes (L2 eviction invalidates L1)	No	No
Implementation Complexity	Moderate	Higher	Lower
Used In	Intel (older), some ARM	AMD, some Intel L2-L3	Intel (newer L3)

Intel's Evolution:

Intel traditionally used inclusive L3 caches because it simplified coherence—to check if any core has a cache line, you only need to check the L3. If it's not there, no core has it.

However, inclusive caches suffer from back-invalidation: when L3 evicts a line, it must also invalidate that line in any L1/L2 cache that holds it. With growing core counts and relatively smaller L3 per core, this became problematic.

Starting with Skylake-SP and Coffee Lake, Intel moved to NINE L3 caches, accepting the coherence complexity for better effective capacity.

AMD's Approach:

AMD has used exclusive or NINE caches for longer, particularly between L2 and L3. This maximizes cache capacity at the cost of more complex coherence tracking.

Operating System Implications

Operating systems don't typically need to know inclusion policies—hardware handles this transparently. However, OS developers and performance engineers benefit from understanding these tradeoffs when analyzing cache behavior and designing cache-aware data structures.

Beyond L3: L4, eDRAM, and Emerging Technologies

The three-level hierarchy isn't universal. Some processors add L4 caches or use hybrid technologies:

L4 Cache (eDRAM):

Intel introduced L4 caches in some Haswell and Broadwell processors (Iris Pro graphics) using embedded DRAM (eDRAM):

Integrated on the same package as the CPU (but separate die)
128 MB capacity—much larger than any L3
Slower than SRAM L3 (~50-60 cycles) but faster than main memory
Primarily served integrated graphics but helped CPU workloads too
Discontinued in later processors as integrated graphics improved

3D V-Cache (AMD):

AMD's 3D V-Cache technology stacks additional SRAM cache directly on top of existing L3:

Up to 96 MB additional L3 (192 MB total on some chips)
Uses 3D chip-stacking technology
Minimal latency increase (same SRAM technology)
Significant performance gains for cache-sensitive workloads (gaming, some HPC)

The Performance Impact of More Cache

AMD's 3D V-Cache demonstrates how dramatic cache capacity effects can be. The Ryzen 7 5800X3D with 96 MB L3 outperforms the higher-clocked 5800X in many games by 10-15%, despite lower clock speeds. When working sets fit in cache, even modest capacity increases eliminate expensive memory accesses.

Emerging Technologies:

Non-Volatile Memory (NVM) as Cache: Intel's Optane (3D XPoint) technology briefly served as a large, persistent cache tier between DRAM and SSD. While discontinued, the concept of tiered memory with different performance characteristics continues to evolve.

Compute Express Link (CXL): CXL enables memory expansion and pooling, creating new cache hierarchy considerations. Future systems may have:

Local DRAM (fast)
CXL-attached memory (slower, expandable)
Caches may need to account for NUMA-like effects at the memory level

Smart Caches in Accelerators: GPUs and AI accelerators have their own cache hierarchies optimized for massively parallel workloads. These often feature:

Very large shared caches (or scratchpads)
Software-managed cache behavior
Coherence domains that differ from CPU caches

Cache Hierarchy Performance Analysis

Understanding cache hierarchy performance requires quantifying the impact of cache hits and misses at each level.

Average Memory Access Time (AMAT):

The fundamental equation for cache performance is:

AMAT = T_hit + (Miss_Rate × Miss_Penalty)

For a multi-level hierarchy, this extends recursively:

AMAT = T_L1 + (MR_L1 × [T_L2 + (MR_L2 × [T_L3 + (MR_L3 × T_Memory)])])

Where:

T_Lx = Hit time at level x
MR_Lx = Miss rate at level x (fraction of accesses that miss)
T_Memory = Main memory access time

Calculating AMAT for a Three-Level HierarchyConsider a system with the following parameters:

Input

Output

Global vs Local Miss Rates:

Two ways to express cache miss rates:

Local Miss Rate: Fraction of accesses to this cache level that miss

L2 local miss rate = L2 misses / L2 accesses
Useful for understanding individual cache effectiveness

Global Miss Rate: Fraction of total memory references that miss a given level

L2 global miss rate = L2 misses / Total CPU memory references
Useful for understanding overall system behavior

Relationship: Global Miss Rate = L1 Miss Rate × L2 Local Miss Rate

Miss Rate Classification
Miss Type	Definition	How to Reduce
Compulsory (Cold)	First access to a block; unavoidable cache miss	Increase cache line size (prefetch more data); use software prefetching
Capacity	Cache too small to hold all needed blocks	Increase cache size; improve data structure locality; blocking/tiling algorithms
Conflict	Multiple blocks map to same cache set (associativity limit)	Increase associativity; use better placement/replacement policies
Coherence	Invalidation due to another processor's write (multiprocessor)	Reduce sharing; use synchronization carefully; false sharing avoidance

The 3 Cs (and 4th C) of Cache Misses

Compulsory, Capacity, and Conflict misses are the classic '3 Cs' of cache misses. In multiprocessor systems, Coherence misses become the '4th C.' Performance optimization requires identifying which miss type dominates and addressing it specifically—bigger caches won't help compulsory misses, and higher associativity won't help capacity misses.

Cache-Aware Programming

Software developers can dramatically improve performance by understanding cache hierarchy. Here are practical techniques:

Cache Optimization Techniques

•Loop Blocking (Tiling): Break large loops into cache-sized chunks. Process a block completely before moving to the next, keeping working set in cache.
•Data Structure Layout: Store frequently co-accessed data together. Use arrays of structs vs structs of arrays depending on access patterns.
•Loop Interchange: Reorder nested loops to access memory in the order it's stored (row-major for C/C++, column-major for Fortran).
•Prefetching: Hint to hardware to load data before it's needed. Software prefetch intrinsics or compiler directives can help.
•Avoid False Sharing: In multithreaded code, ensure different threads don't modify data in the same cache line (covered in later pages).

loop_blocking_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// NAIVE MATRIX MULTIPLY - Poor cache utilization
void matmul_naive(double *A, double *B, double *C, int N) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            double sum = 0.0;
            for (int k = 0; k < N; k++) {
                sum += A[i*N + k] * B[k*N + j];  // B[k][j] has poor spatial locality
            }
            C[i*N + j] = sum;
        }
    }
}
// Problem: For large N, B column access jumps N elements per iteration,
// causing cache misses every access.
 
// LOOP BLOCKED (TILED) MATRIX MULTIPLY - Much better cache utilization
#define BLOCK_SIZE 64  // Chosen to fit in L1 cache
 
void matmul_blocked(double *A, double *B, double *C, int N) {
    for (int ii = 0; ii < N; ii += BLOCK_SIZE) {
        for (int jj = 0; jj < N; jj += BLOCK_SIZE) {
            for (int kk = 0; kk < N; kk += BLOCK_SIZE) {
                // Process one block
                for (int i = ii; i < ii + BLOCK_SIZE && i < N; i++) {
                    for (int j = jj; j < jj + BLOCK_SIZE && j < N; j++) {
                        double sum = C[i*N + j];
                        for (int k = kk; k < kk + BLOCK_SIZE && k < N; k++) {
                            sum += A[i*N + k] * B[k*N + j];
                        }
                        C[i*N + j] = sum;
                    }
                }
            }
        }
    }
}
// Benefit: Each block of B fits in cache and is reused many times
// before eviction. Can achieve 5-10x speedup for large matrices!
 
// PREFETCH HINT EXAMPLE
void process_array_with_prefetch(double *data, int N) {
    for (int i = 0; i < N; i++) {
        // Prefetch data 8 iterations ahead
        __builtin_prefetch(&data[i + 8], 0, 3);  // GCC builtin
        process(data[i]);
    }
}

Real-World Impact

Loop blocking is used extensively in high-performance linear algebra libraries (BLAS, LAPACK) and can achieve 80-90% of theoretical peak performance. Understanding cache hierarchy transforms naive code achieving 5% of peak into optimized code approaching hardware limits.

Summary: Cache Hierarchy

We've covered the foundational concepts of cache hierarchy. Let's consolidate the key takeaways:

Key Takeaways

•The memory wall is real — Processors are 100-200x faster than main memory. Without caches, modern CPUs would be crippled by memory latency.
•Locality makes caching work — Temporal and spatial locality ensure that small, fast caches can service the vast majority of memory accesses.
•Multi-level hierarchies balance tradeoffs — L1 optimizes for speed, L3 for capacity, with L2 bridging the gap. Each level serves a purpose.
•Inclusion policies matter — Inclusive, exclusive, and NINE caches offer different tradeoffs between capacity efficiency and coherence simplicity.
•Cache technology evolves — From eDRAM L4 to 3D V-Cache, hardware architects continue pushing cache capacity and performance.
•AMAT quantifies performance — Average Memory Access Time calculations reveal the compound benefit of the cache hierarchy.
•Software can leverage cache — Loop blocking, data layout, and prefetching can yield order-of-magnitude performance improvements.

What's Next:

Now that we understand the cache hierarchy structure, the next page dives into Cache Lines—the fundamental unit of data transfer between cache levels. We'll explore how cache lines are addressed, organized, and why their size profoundly affects both hardware design and software performance.

Page Complete

You now understand modern multi-level cache hierarchies and their critical role in bridging the processor-memory gap. Next, we'll examine cache lines—the building blocks that make cache operation possible.

1 / 5

Loading learning content...

Operating SystemsCache Coherence

Cache Coherence and Memory Consistency

LevelIntermediate

Duration90 mins

TopicCache Coherence

1 / 5

Cache Hierarchy

The Memory Wall Problem

What You Will Learn

The Processor-Memory Speed Gap

To appreciate cache hierarchies, we must first understand the problem they solve. The history of computing reveals a widening chasm between processor and memory speeds:

Historical Divergence:

The Widening Processor-Memory Gap Over Time
Year	Typical Processor Speed	DRAM Access Time	Speed Ratio
1980	10 MHz	150 ns	~1.5 cycles
1990	40 MHz	100 ns	~4 cycles
2000	1,000 MHz (1 GHz)	70 ns	~70 cycles
2010	3,000 MHz (3 GHz)	50 ns	~150 cycles
2020	4,500 MHz (4.5 GHz)	40 ns	~180 cycles
Present	5,000+ MHz (5+ GHz)	35 ns	~175-200 cycles

The Performance Catastrophe Without Caches

Why DRAM Can't Keep Up:

DRAM (Dynamic Random-Access Memory) stores each bit as a charge in a tiny capacitor. This design offers:

High density: Billions of bits per chip
Low cost: A few dollars per gigabyte
But slow access: Reading requires sensing tiny charges, amplifying signals, and destructively reading (which requires rewriting)

Making DRAM faster requires shrinking capacitors, which reduces charge and increases error rates. Physics imposes fundamental limits on DRAM speed improvements.

SRAM: The Alternative

Static RAM (SRAM) stores bits using flip-flop circuits—six transistors per bit instead of one transistor and one capacitor. SRAM offers:

Very fast access: 1-2 ns
No refresh needed: Stable as long as power is applied
But low density: 6x more transistors per bit
And high cost: 10-20x more expensive per bit than DRAM

We can't build gigabytes of SRAM economically—but we can strategically place small amounts where they matter most.

The Principle of Locality

Caches work because of a fundamental property of programs: locality of reference. Programs don't access memory randomly; they exhibit predictable patterns that caches exploit.

Two Dimensions of Locality:

Temporal Locality

•Definition: Recently accessed data is likely to be accessed again soon
•Examples: Loop counters, frequently-used variables, function call return addresses
•Exploitation: Keep recently accessed data in cache, evict old data when space is needed
•Impact: A single cache fetches data once and serves multiple subsequent accesses

Spatial Locality

•Definition: Data near recently accessed data is likely to be accessed soon
•Examples: Array traversal, sequential instruction fetch, struct field access
•Exploitation: Fetch blocks of data (cache lines), not individual bytes
•Impact: One memory access brings in useful neighboring data 'for free'

Quantifying Locality:

Empirical studies consistently show that typical programs spend:

90% of their execution time in just 10% of their code
80% of memory references hit the same 20% of addresses (within short time windows)

This concentration of access makes caching viable. If every memory access were random, caches would be useless—we'd never find the data we need in the cache.

Locality in Different Program Patterns:

locality_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// HIGH TEMPORAL LOCALITY: Same variable accessed repeatedly
int sum = 0;
for (int i = 0; i < N; i++) {
    sum += array[i];  // 'sum' accessed N times - excellent temporal locality
}
 
// HIGH SPATIAL LOCALITY: Sequential array access
for (int i = 0; i < N; i++) {
    process(array[i]);  // Elements accessed in order - perfect spatial locality
}
 
// POOR SPATIAL LOCALITY: Strided access
for (int i = 0; i < N; i += 64) {
    process(array[i]);  // Skipping 64 elements - poor spatial locality
}
 
// POOR TEMPORAL & SPATIAL LOCALITY: Random access
for (int i = 0; i < N; i++) {
    int random_index = rand() % ARRAY_SIZE;
    process(array[random_index]);  // Random access - cache-hostile
}
 
// MATRIX ACCESS: Row-major vs Column-major
// In C, arrays are stored row-major (row elements are contiguous)
int matrix[1000][1000];
 
// GOOD: Row-major traversal (matches memory layout)
for (int i = 0; i < 1000; i++)
    for (int j = 0; j < 1000; j++)
        process(matrix[i][j]);  // Sequential in memory
 
// BAD: Column-major traversal (jumps across rows)
for (int j = 0; j < 1000; j++)
    for (int i = 0; i < 1000; i++)
        process(matrix[i][j]);  // Jumps 1000 elements each access

The Matrix Traversal Difference

Multi-Level Cache Architecture

The Classic Three-Level Hierarchy:

Modern Cache Hierarchy Characteristics (Typical Desktop/Server CPU, 2024)
Cache Level	Size (per core)	Access Latency	Technology	Sharing
L1 Instruction	32-64 KB	~1-4 cycles (1-2 ns)	SRAM	Private per core
L1 Data	32-64 KB	~1-4 cycles (1-2 ns)	SRAM	Private per core
L2 Unified	256 KB - 1 MB	~10-15 cycles (5-7 ns)	SRAM	Private per core
L3 Unified	8-96 MB (total)	~30-50 cycles (15-25 ns)	SRAM	Shared across cores
Main Memory	16-128 GB	~150-200 cycles (50-100 ns)	DRAM	Shared system-wide

Why Multiple Levels?

The existence of multiple cache levels represents a series of engineering tradeoffs:

L1 Cache (First Level):

Closest to the processor core, physically integrated into the pipeline
Split into instruction cache (L1i) and data cache (L1d) for parallel access
Optimized for latency at all costs—must respond in 1-4 cycles
Small size (32-64 KB) enables low latency but limits capacity
Uses the fastest, most expensive SRAM technology
High hit rates despite small size due to excellent temporal locality of hot code/data

L2 Cache (Second Level):

Backup for L1 misses, still quite fast (~10-15 cycles)
Larger capacity (256 KB - 1 MB) catches data that doesn't fit in L1
Unified (holds both instructions and data) for flexibility
Private to each core—no coherence traffic within L2 lookups
Slight latency increase allows more complex organization

L3 Cache (Third Level / Last-Level Cache):

Shared among all cores on a chip
Much larger (8-96 MB or more in high-end server chips)
Acts as a victim cache for L2 evictions
Critical for multi-threaded workloads with shared data
Enables inter-core communication without going to main memory
Introduces coherence complexity (covered in later pages)

Converting Mermaid diagram...

Inclusion Policies: Inclusive, Exclusive, and NINE

An important design decision in multi-level caches is the inclusion policy: the relationship between what data exists at different cache levels.

Inclusion Policy Types

•Inclusive Cache: Every block in L1 is also present in L2 (and L3). L2 is a superset of L1. Pros: Simple coherence checking—only check L2/L3 to see if a core has data. Cons: Wastes space—the same data occupies multiple cache levels.
•Exclusive Cache: A block exists in only ONE cache level at a time. Data promoted from L2 to L1 is removed from L2; data evicted from L1 goes to L2. Pros: Maximum effective capacity—no replication. Cons: Coherence checking must probe all levels; more complex.
•Non-Inclusive Non-Exclusive (NINE): No strict enforcement. A block in L1 may or may not be in L2. Pros: Flexibility, simpler than exclusive. Cons: Coherence complexity; must track data location explicitly.

Comparison of Inclusion Policies
Aspect	Inclusive	Exclusive	NINE
Effective Capacity	Lower (duplication)	Maximum (no duplication)	Moderate
Coherence Simplicity	Simple (check higher level)	Complex (check all levels)	Moderate
Back-Invalidation	Yes (L2 eviction invalidates L1)	No	No
Implementation Complexity	Moderate	Higher	Lower
Used In	Intel (older), some ARM	AMD, some Intel L2-L3	Intel (newer L3)

Intel's Evolution:

Intel traditionally used inclusive L3 caches because it simplified coherence—to check if any core has a cache line, you only need to check the L3. If it's not there, no core has it.

Starting with Skylake-SP and Coffee Lake, Intel moved to NINE L3 caches, accepting the coherence complexity for better effective capacity.

AMD's Approach:

AMD has used exclusive or NINE caches for longer, particularly between L2 and L3. This maximizes cache capacity at the cost of more complex coherence tracking.

Operating System Implications

Beyond L3: L4, eDRAM, and Emerging Technologies

The three-level hierarchy isn't universal. Some processors add L4 caches or use hybrid technologies:

L4 Cache (eDRAM):

Intel introduced L4 caches in some Haswell and Broadwell processors (Iris Pro graphics) using embedded DRAM (eDRAM):

Integrated on the same package as the CPU (but separate die)
128 MB capacity—much larger than any L3
Slower than SRAM L3 (~50-60 cycles) but faster than main memory
Primarily served integrated graphics but helped CPU workloads too
Discontinued in later processors as integrated graphics improved

3D V-Cache (AMD):

AMD's 3D V-Cache technology stacks additional SRAM cache directly on top of existing L3:

Up to 96 MB additional L3 (192 MB total on some chips)
Uses 3D chip-stacking technology
Minimal latency increase (same SRAM technology)
Significant performance gains for cache-sensitive workloads (gaming, some HPC)

The Performance Impact of More Cache

Emerging Technologies:

Compute Express Link (CXL): CXL enables memory expansion and pooling, creating new cache hierarchy considerations. Future systems may have:

Local DRAM (fast)
CXL-attached memory (slower, expandable)
Caches may need to account for NUMA-like effects at the memory level

Smart Caches in Accelerators: GPUs and AI accelerators have their own cache hierarchies optimized for massively parallel workloads. These often feature:

Very large shared caches (or scratchpads)
Software-managed cache behavior
Coherence domains that differ from CPU caches

Cache Hierarchy Performance Analysis

Understanding cache hierarchy performance requires quantifying the impact of cache hits and misses at each level.

Average Memory Access Time (AMAT):

The fundamental equation for cache performance is:

AMAT = T_hit + (Miss_Rate × Miss_Penalty)

For a multi-level hierarchy, this extends recursively:

AMAT = T_L1 + (MR_L1 × [T_L2 + (MR_L2 × [T_L3 + (MR_L3 × T_Memory)])])

Where:

T_Lx = Hit time at level x
MR_Lx = Miss rate at level x (fraction of accesses that miss)
T_Memory = Main memory access time

Calculating AMAT for a Three-Level HierarchyConsider a system with the following parameters:

Input

Output

Global vs Local Miss Rates:

Two ways to express cache miss rates:

Local Miss Rate: Fraction of accesses to this cache level that miss

L2 local miss rate = L2 misses / L2 accesses
Useful for understanding individual cache effectiveness

Global Miss Rate: Fraction of total memory references that miss a given level

L2 global miss rate = L2 misses / Total CPU memory references
Useful for understanding overall system behavior

Relationship: Global Miss Rate = L1 Miss Rate × L2 Local Miss Rate

Miss Rate Classification
Miss Type	Definition	How to Reduce
Compulsory (Cold)	First access to a block; unavoidable cache miss	Increase cache line size (prefetch more data); use software prefetching
Capacity	Cache too small to hold all needed blocks	Increase cache size; improve data structure locality; blocking/tiling algorithms
Conflict	Multiple blocks map to same cache set (associativity limit)	Increase associativity; use better placement/replacement policies
Coherence	Invalidation due to another processor's write (multiprocessor)	Reduce sharing; use synchronization carefully; false sharing avoidance

The 3 Cs (and 4th C) of Cache Misses

Cache-Aware Programming

Software developers can dramatically improve performance by understanding cache hierarchy. Here are practical techniques:

Cache Optimization Techniques

•Loop Blocking (Tiling): Break large loops into cache-sized chunks. Process a block completely before moving to the next, keeping working set in cache.
•Data Structure Layout: Store frequently co-accessed data together. Use arrays of structs vs structs of arrays depending on access patterns.
•Loop Interchange: Reorder nested loops to access memory in the order it's stored (row-major for C/C++, column-major for Fortran).
•Prefetching: Hint to hardware to load data before it's needed. Software prefetch intrinsics or compiler directives can help.
•Avoid False Sharing: In multithreaded code, ensure different threads don't modify data in the same cache line (covered in later pages).

loop_blocking_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// NAIVE MATRIX MULTIPLY - Poor cache utilization
void matmul_naive(double *A, double *B, double *C, int N) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            double sum = 0.0;
            for (int k = 0; k < N; k++) {
                sum += A[i*N + k] * B[k*N + j];  // B[k][j] has poor spatial locality
            }
            C[i*N + j] = sum;
        }
    }
}
// Problem: For large N, B column access jumps N elements per iteration,
// causing cache misses every access.
 
// LOOP BLOCKED (TILED) MATRIX MULTIPLY - Much better cache utilization
#define BLOCK_SIZE 64  // Chosen to fit in L1 cache
 
void matmul_blocked(double *A, double *B, double *C, int N) {
    for (int ii = 0; ii < N; ii += BLOCK_SIZE) {
        for (int jj = 0; jj < N; jj += BLOCK_SIZE) {
            for (int kk = 0; kk < N; kk += BLOCK_SIZE) {
                // Process one block
                for (int i = ii; i < ii + BLOCK_SIZE && i < N; i++) {
                    for (int j = jj; j < jj + BLOCK_SIZE && j < N; j++) {
                        double sum = C[i*N + j];
                        for (int k = kk; k < kk + BLOCK_SIZE && k < N; k++) {
                            sum += A[i*N + k] * B[k*N + j];
                        }
                        C[i*N + j] = sum;
                    }
                }
            }
        }
    }
}
// Benefit: Each block of B fits in cache and is reused many times
// before eviction. Can achieve 5-10x speedup for large matrices!
 
// PREFETCH HINT EXAMPLE
void process_array_with_prefetch(double *data, int N) {
    for (int i = 0; i < N; i++) {
        // Prefetch data 8 iterations ahead
        __builtin_prefetch(&data[i + 8], 0, 3);  // GCC builtin
        process(data[i]);
    }
}

Real-World Impact

Summary: Cache Hierarchy

We've covered the foundational concepts of cache hierarchy. Let's consolidate the key takeaways:

Key Takeaways

•The memory wall is real — Processors are 100-200x faster than main memory. Without caches, modern CPUs would be crippled by memory latency.
•Locality makes caching work — Temporal and spatial locality ensure that small, fast caches can service the vast majority of memory accesses.
•Multi-level hierarchies balance tradeoffs — L1 optimizes for speed, L3 for capacity, with L2 bridging the gap. Each level serves a purpose.
•Inclusion policies matter — Inclusive, exclusive, and NINE caches offer different tradeoffs between capacity efficiency and coherence simplicity.
•Cache technology evolves — From eDRAM L4 to 3D V-Cache, hardware architects continue pushing cache capacity and performance.
•AMAT quantifies performance — Average Memory Access Time calculations reveal the compound benefit of the cache hierarchy.
•Software can leverage cache — Loop blocking, data layout, and prefetching can yield order-of-magnitude performance improvements.

What's Next:

Page Complete

1 / 5