Operating SystemsMemory Hierarchy

Memory Hierarchy: From Registers to Disk

LevelIntermediate

Duration90 mins

TopicMemory Hierarchy

2 / 5

Cache Memory (L1, L2, L3) — The Critical Performance Tier

Bridging the Processor-Memory Gap

The fundamental challenge of computer architecture is the processor-memory gap—CPUs can process data orders of magnitude faster than main memory can supply it. In the 1980s, processor speeds and memory speeds were roughly balanced. Since then, CPU speeds have improved by roughly 60% per year while DRAM speeds have improved by only 7% per year.

The result? A modern CPU can execute instructions in cycles measured in fractions of nanoseconds, while accessing main memory takes 50-100 nanoseconds—a ratio of 100:1 or more. Without intervention, the CPU would spend most of its time waiting for data.

Cache memory is the ingenious solution to this problem. Caches are small, fast memories positioned between the processor and main memory, holding copies of the most frequently accessed data. By exploiting patterns in how programs access memory (locality), caches dramatically reduce effective memory access time.

What You Will Learn

By the end of this page, you will understand: how caches work at a fundamental level; the organization of L1, L2, and L3 caches; cache policies for replacement and writes; cache coherence protocols in multicore systems; and how to analyze and optimize for cache performance.

The Principle of Locality

Caches work because program behavior is remarkably predictable. The principle of locality states that programs tend to access data and instructions in patterns that favor recently-used items and nearby locations. This principle has two fundamental forms:

Temporal locality (locality in time):

If a memory location is accessed now, it is likely to be accessed again in the near future. Examples:

Loop counters are read/written every iteration
Frequently-called functions are executed repeatedly
Hot data structures (e.g., hash table buckets) are accessed often

Spatial locality (locality in space):

If a memory location is accessed now, nearby locations are likely to be accessed soon. Examples:

Array traversals access consecutive elements
Fields in a struct are often accessed together
Instructions are typically executed sequentially

Why Locality Exists

Locality isn't accidental—it's a natural consequence of how programs are structured. Loops create temporal locality. Data structures create spatial locality. The abstraction boundaries we use to organize code (functions, modules, classes) tend to group related data together. Well-designed caches exploit these natural patterns.

Quantifying locality:

The working set of a program is the set of memory locations accessed within a given time interval. If a cache can hold the working set, almost all accesses will be cache hits. The art of cache design—and performance optimization—involves sizing caches to capture working sets while managing the tradeoffs of larger, slower caches versus smaller, faster ones.

Sequential access patterns:

Consider iterating through an array of 1 million integers:

int arr[1000000];
int sum = 0;
for (int i = 0; i < 1000000; i++) {
    sum += arr[i];  // Pure sequential access
}

This exhibits nearly perfect spatial locality. A cache line (typically 64 bytes) holds 16 integers. When arr[0] is accessed, the entire cache line is loaded. Accesses to arr[1] through arr[15] are all cache hits. The effective miss rate is 1/16 = 6.25%.

Random access patterns:

Now consider random accesses:

for (int i = 0; i < 1000000; i++) {
    sum += arr[random() % 1000000];  // Random access
}

Each access is equally likely to hit any cache line. If the array exceeds cache capacity, almost every access is a cache miss. This can be 10-100× slower than sequential access.

Impact of Access Patterns on Cache Performance
Access Pattern	Spatial Locality	Temporal Locality	Typical Hit Rate	Performance Impact
Sequential array traversal	Excellent	Low	~94-97%	Near-optimal
Repeated array traversal	Excellent	Excellent	~99%+	Optimal
Strided access (stride=cache line)	None	Low	<10%	Very poor
Random access (large array)	None	None	<5%	Worst case
Linked list traversal (scattered)	None	Low	Variable	Often poor
Matrix multiply (naive)	Mixed	Low	~50-80%	Improvable
Matrix multiply (blocked)	Excellent	Excellent	~95%+	Near-optimal

Cache Organization Fundamentals

Understanding cache organization is fundamental to understanding cache behavior. A cache is not simply a small memory—it's a highly structured system with specific mechanisms for storing, locating, and replacing data.

Cache lines (blocks):

The fundamental unit of data in a cache is the cache line (also called a cache block). Rather than loading individual bytes, caches load entire lines at once—typically 64 bytes on modern systems. This exploits spatial locality: when you access one byte, nearby bytes come along for free.

A cache line contains:

Data: The actual memory contents (64 bytes typically)
Tag: Part of the memory address that uniquely identifies which memory location this line holds
Valid bit: Indicates whether this line contains valid data
Dirty bit: (For write-back caches) Indicates whether this line has been modified and differs from main memory
Replacement metadata: Information used to decide which line to evict (e.g., age for LRU)

Address decomposition:

When the CPU accesses a memory address, that address is decomposed into three parts:

Block offset (b bits): Identifies the specific byte within the cache line. For 64-byte lines, b = 6 (2⁶ = 64)
Index (i bits): Identifies which cache set this address maps to. The number of sets S = 2ⁱ
Tag (t bits): Identifies which specific memory block is stored in the cache line. t = address_bits - i - b

For a 64-bit address space with 64-byte lines and 64 sets:

Block offset: 6 bits (bits 0-5)
Index: 6 bits (bits 6-11)
Tag: 52 bits (bits 12-63)

Cache lookup process:

Extract the index from the address
Look up the corresponding set in the cache
Compare the tag of each cache line in the set with the address's tag
If a tag matches and the valid bit is set → cache hit
If no match → cache miss, fetch data from the next level

Cache Associativity Types

•Direct-mapped cache: Each memory address maps to exactly one cache line. Simple and fast, but prone to conflict misses when multiple hot addresses map to the same line. Associativity = 1.
•N-way set-associative cache: Each memory address can reside in one of N cache lines within a set. More flexible, fewer conflict misses, but requires comparing N tags simultaneously. Common values: 4-way, 8-way, 12-way.
•Fully associative cache: A cache line can hold any memory address. Maximum flexibility, minimum conflict misses, but requires comparing against all cache lines (practical only for small caches). Used for TLBs and small specialized buffers.

The Associativity Tradeoff

Higher associativity reduces conflict misses but increases access latency (more tags to compare) and power consumption (more lines active simultaneously). Modern L1 caches typically use 4-8 way associativity. L3 caches, being larger and slower anyway, often use 12-16+ way associativity to maximize hit rates.

cache_simulation_concept.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Conceptual model of cache address decomposition
class CacheAddressDecomposition:
    def __init__(self, line_size_bytes=64, num_sets=64, address_bits=64):
        self.offset_bits = int.bit_length(line_size_bytes - 1)  # e.g., 6 for 64 bytes
        self.index_bits = int.bit_length(num_sets - 1)          # e.g., 6 for 64 sets
        self.tag_bits = address_bits - self.offset_bits - self.index_bits
        
    def decompose(self, address):
        offset = address & ((1 << self.offset_bits) - 1)
        index = (address >> self.offset_bits) & ((1 << self.index_bits) - 1)
        tag = address >> (self.offset_bits + self.index_bits)
        return {"tag": tag, "index": index, "offset": offset}
 
# Example: L1 cache with 64-byte lines, 64 sets (4KB / 64 / 1-way = 64 sets)
cache = CacheAddressDecomposition(line_size_bytes=64, num_sets=64)
result = cache.decompose(0x7fff8c2a4b80)
# tag: 0x1fffe30a92, index: 46, offset: 0

L1 Cache — The Speed Demon

The L1 cache (Level 1 cache) is the smallest,the fastest, and closest cache to the CPU execution units. On modern processors, L1 cache is so tightly integrated that it's essentially part of the processor pipeline. Every CPU core has its own private L1 cache.

Key characteristics of L1 cache:

•Size: Typically 32-64 KB per core (split between instruction and data)
•Latency: 3-5 CPU cycles (~1-2 ns at 3 GHz)
•Structure: Almost always split into L1-I (instruction) and L1-D (data)
•Associativity: Usually 4-8 way set associative
•Line size: 64 bytes (standard across x86 and ARM)
•Bandwidth: Can often supply 2 loads + 1 store per cycle

Why split L1 into instruction and data?

The split L1 design (Harvard architecture at L1) provides several advantages:

Bandwidth: Instructions and data can be fetched simultaneously from independent caches
Optimization: L1-I can be optimized for sequential access (instruction fetch), while L1-D is optimized for random access patterns
Security: Separation helps prevent certain types of code injection attacks (instructions and data don't mix)
Coherence: L1-I typically doesn't need to handle writes (code doesn't self-modify often), simplifying its design

L1 cache optimizations:

L1 caches employ aggressive optimizations to achieve single-cycle access:

Physically indexed, Virtually tagged (VIPT): Allows cache lookup to begin before TLB translation completes
Way prediction: Predicts which way will hit, accessing only one way for power savings
Multi-banking: Divides the cache into banks that can be accessed independently
Pipelined access: Allows multiple outstanding cache requests

L1 Cache Specifications Across Modern Processors
Processor	L1-D per Core	L1-I per Core	Associativity	Latency (cycles)
Intel Core i9-13900K	48 KB	32 KB	12-way (D), 8-way (I)	5
AMD Ryzen 9 7950X	32 KB	32 KB	8-way	4
Apple M2 (P-core)	128 KB	192 KB	8-way	3
ARM Cortex-X3	64 KB	64 KB	4-way	4
Graviton3	64 KB	64 KB	4-way	3-4

L1 Miss Penalty

An L1 cache miss doesn't go directly to main memory—it goes to L2 cache. The L1 miss penalty is the L2 access time: typically 10-15 cycles. While this seems small, remember that in tight loops the CPU can be limited by L1 bandwidth. Optimizing for L1 hits is critical in performance-sensitive code.

L2 Cache — The Balanced Middle Ground

The L2 cache (Level 2 cache) serves as the middle tier of the cache hierarchy. It's larger than L1 but smaller than L3, striking a balance between speed and capacity. L2 cache catches misses from L1, preventing them from incurring the much higher latency of L3 or main memory access.

Key characteristics of L2 cache:

•Size: Typically 256 KB to 2 MB per core
•Latency: 10-15 CPU cycles (~3-5 ns)
•Structure: Unified (holds both instructions and data)
•Associativity: Usually 8-16 way set associative
•Location: Private per core on most modern designs
•Inclusive vs Exclusive: Varies by architecture

Inclusion policies:

The relationship between cache levels is defined by their inclusion policy:

Inclusive L2: Every line in L1 is also present in L2. This simplifies coherence (checking L2 is sufficient to know a core's cache state) but wastes capacity (data is duplicated). Intel historically used inclusive caches.

Exclusive L2: L1 and L2 contain disjoint data. When a line is promoted from L2 to L1, it's evicted from L2. This maximizes effective capacity but complicates coherence. AMD Zen uses this approach.

Non-inclusive L2: Neither strictly inclusive nor exclusive. A line might be in L1 only, L2 only, or both. Modern Intel processors have moved toward non-inclusive designs.

L2 as a victim cache (exclusive):

In exclusive configurations, L2 often acts as a victim cache—when a line is evicted from L1, it moves to L2 rather than being discarded. This captures temporal locality: recently-evicted lines that are accessed again can be found in L2 rather than fetched from L3 or memory.

L2 as the Working Set Cache

For many applications, L2 is the critical cache level. If your application's hot working set fits in L2 (~256KB-1MB per core), you'll rarely incur L3 or memory latencies. Profile your applications to understand their working set sizes and optimize data layouts accordingly.

L2 bandwidth and ports:

L2 cache typically has lower bandwidth than L1 (measured in bytes per cycle). While L1 might support 2 loads per cycle (128 bytes at 64 bits each), L2 might deliver data to L1 at only 32-64 bytes per cycle. This asymmetry enforces the hierarchy: L1 must absorb most accesses to avoid saturating L2 bandwidth.

Hardware prefetching at L2:

Many processors implement hardware prefetchers at L2 that detect access patterns (sequential, strided) and speculatively fetch data before it's requested. Effective prefetching can hide memory latency by having data arrive in L2 just before the CPU needs it. However, incorrect prefetching wastes bandwidth and can evict useful data.

L3 Cache — The Shared Last Line of Defense

The L3 cache (Level 3 cache), often called the Last Level Cache (LLC), represents the final cache barrier before main memory. It's significantly larger than L1 and L2, but also slower. Its primary role is to reduce the number of expensive main memory accesses and to facilitate data sharing between cores.

Key characteristics of L3 cache:

•Size: Typically 8-64 MB (shared across all cores)
•Latency: 30-50 CPU cycles (~10-20 ns)
•Structure: Unified cache (instructions and data)
•Associativity: High (12-16+ way) to minimize conflicts
•Sharing: Shared between all cores in a processor package
•Coherence role: Serves as the coherence point for inter-core communication

L3 as the coherence hub:

In many processor designs, L3 serves as the central point for cache coherence. When one core modifies data, the coherence protocol checks L3's directory to determine which other cores might have copies. This makes L3 access latency variable—a simple hit is faster than a hit that requires coherence actions.

Sliced L3 architecture:

Modern processors with many cores often organize L3 into slices, with each slice associated with a subset of cores. The address bits determine which slice holds a given cache line. This distributes traffic and provides better bandwidth scaling but can introduce Non-Uniform Cache Access (NUCA) effects where some slices are faster to reach than others.

AMD's revolutionary 3D V-Cache:

AMD's 3D V-Cache technology stacks additional L3 cache vertically on top of the processor die, dramatically increasing L3 capacity (e.g., 96 MB on Ryzen 9 7950X3D). This has proven particularly effective for gaming and other workloads with large working sets, demonstrating the continued importance of cache capacity.

L3 Cache Specifications Across Modern Processors
Processor	L3 Size	Cores Sharing	Latency (cycles)	Architecture
Intel Core i9-13900K	36 MB	24 cores (8P+16E)	~42	Ring bus, sliced
AMD Ryzen 9 7950X	64 MB	16 cores (2 CCDs)	~46	2x 32MB per CCD
AMD Ryzen 9 7950X3D	128 MB	16 cores	~50	3D V-Cache stacking
Apple M2 Ultra	192 MB	24 cores	Unknown	Unified memory architecture
Intel Xeon Scalable	52.5 MB	36+ cores	~60+	Mesh interconnect

L3 vs Main Memory

L3 latency (30-50 cycles, ~15 ns) versus main memory latency (200-300 cycles, ~70-100 ns) represents a 4-7× difference. This makes L3 hits dramatically faster than main memory accesses. For server workloads, L3 hit rates often determine overall system throughput.

Cache Replacement Policies

When a cache is full and a new line must be loaded, the cache must decide which existing line to evict. This decision is governed by the replacement policy. The goal is to evict the line least likely to be used again, preserving the lines with the highest future utility.

The optimal (Bélády's) policy:

In theory, the best replacement policy is to evict the line that will be used furthest in the future. Known as Bélády's MIN algorithm, this is optimal but impossible to implement—we can't know future accesses. Practical policies approximate this ideal.

Common replacement policies:

Replacement Policy Comparison

•LRU (Least Recently Used): Evict the line accessed longest ago. Based on temporal locality—recently used lines are more likely to be used again. True LRU requires O(log n) bits per line for N-way associativity, making it expensive for high associativity.
•Pseudo-LRU (PLRU): Approximates LRU using a binary tree structure with one bit per internal node. Much cheaper than true LRU (~N-1 bits for N ways vs. N log N for true LRU). Common in L1 caches.
•Random: Evict a randomly-chosen line. Surprisingly effective and extremely simple to implement. No pathological cases but misses opportunities when access patterns are predictable.
•RRIP (Re-Reference Interval Prediction): Predicts when each line will be re-referenced and evicts lines predicted to be used furthest in the future. Handles scan-resistant behavior better than LRU. Used in Intel's L3 caches.
•SRRIP (Static RRIP): Assigns incoming lines an initial prediction value (e.g., distant reuse), updated on re-reference. Prevents scans from polluting the cache.
•DRRIP (Dynamic RRIP): Dynamically chooses between SRRIP and other policies based on runtime behavior, adapting to workload characteristics.

The scan resistance problem:

LRU has a critical weakness: a single scan through a large array can evict all useful data. If a program accesses N+1 unique cache lines where N is the cache size, LRU will have 0% hit rate—each new line evicts a potentially useful line that won't be seen again soon.

Modern policies like RRIP address this by treating newly-loaded lines as less valuable until they prove their worth through re-reference.

Adaptive policies:

State-of-the-art caches often use adaptive policies that observe behavior and switch strategies. Hardware monitors hit rates under different policies (using sampling) and dynamically selects the best performer. This is particularly valuable in shared L3 caches where different applications may have different access patterns.

Application-level Cache Awareness

While you can't control hardware replacement policies, you can structure data access to work well with them. Sequential access patterns, reusing data before moving on, and minimizing working set size all improve cache behavior regardless of the specific replacement policy in use.

Cache Write Policies

When the CPU writes data, the cache must decide how to handle that write. The write policy determines whether and when data is propagated to lower cache levels and main memory. This decision has significant implications for performance, coherence, and consistency.

Two fundamental choices:

Write-Through

•Writes go to both cache and memory/lower level immediately
•Memory always has current data
•Simpler coherence—no need to track dirty state
•Higher write bandwidth consumption
•Often uses a write buffer to absorb write latency
•Common in embedded systems, some L1 designs

Write-Back

•Writes update only the cache initially
•A dirty bit tracks modified lines
•Modified data written to memory only on eviction
•Much lower write bandwidth usage
•More complex—must handle dirty line evictions
•Standard in modern L1, L2, L3 caches

Write miss policies:

What happens when a write targets an address not currently in the cache?

Write-allocate: Fetch the target cache line from memory, then perform the write. This is the common choice for write-back caches—future writes and reads to neighboring bytes will hit in cache.

No-write-allocate: Write directly to memory without loading the line into cache. Common with write-through caches—since data goes to memory anyway, why waste cache space?

Combining policies:

Modern systems typically use:

L1: Write-back + write-allocate (maximize locality)
L2: Write-back + write-allocate
L3: Write-back + write-allocate

Write-through is rare in general-purpose processors but common in specialized systems where coherence simplicity outweighs performance costs.

Write-Back and Consistency

Write-back creates a consistency challenge: memory doesn't reflect the latest data until the dirty line is evicted or explicitly flushed. For I/O devices that read memory directly (DMA), or for power loss scenarios, this can cause data loss. Operating systems and device drivers must explicitly flush caches when coherence with external entities is required.

Understanding Cache Misses — The Three Cs

Not every memory access can be a cache hit. Understanding why misses occur is essential for optimization. The classic Three Cs model categorizes cache misses into three types:

1. Compulsory misses (cold misses):

The first access to any memory location always misses—data isn't in the cache yet because it has never been fetched. Compulsory misses are unavoidable given finite cache size, but prefetching can hide their latency by loading data before it's requested.

2. Capacity misses:

Occur when the working set exceeds cache size. Even with perfect replacement, if you access more unique data than the cache can hold, evictions are inevitable. The evicted lines will miss when re-accessed.

Reducing capacity misses:

Reduce working set size (better data structures, compression)
Block/tile algorithms to process data in cache-sized chunks
Increase cache size (hardware design choice)

3. Conflict misses (collision misses):

Occur when multiple addresses map to the same cache set, causing evictions even when the overall cache has free space. This is a limitation of direct-mapped and set-associative caches. Fully associative caches have no conflict misses.

Extending the model:

A fourth C is sometimes added:

4. Coherence misses:

In multicore systems, misses caused by cache coherence protocols. When one core modifies data, copies in other cores are invalidated. Subsequent accesses to that data from other cores miss despite the data being "recently used" from their perspective.

Diagnosing miss types:

Profiling tools (Intel VTune, AMD uProf, perf) can measure cache miss rates and, with analysis, attribute misses to categories:

High cold misses: Need prefetching or restructured access patterns
High capacity misses: Need smaller working sets or data tiling
High conflict misses: Need array padding, data reorganization, or different data structures
High coherence misses: Need reduced sharing, better synchronization, or data partitioning

Cache Miss Types and Mitigation Strategies
Miss Type	Cause	Mitigation Strategies	Tools
Compulsory	First access to data	Prefetching, batching first accesses	Hardware prefetchers, __builtin_prefetch
Capacity	Working set > cache size	Blocking/tiling, reduce footprint, compression	perf stat, VTune
Conflict	Address collisions in sets	Array padding, data alignment, struct reorganization	Cache simulation, hardware counters
Coherence	Multi-core invalidation	Reduce false sharing, partitioning, careful sync	perf c2c, VTune memory analysis

Summary: Cache Memory Mastery

We've explored the cache hierarchy in depth—the critical bridge between the CPU's execution capability and main memory's vast storage. Understanding cache behavior is essential for writing high-performance code and designing efficient operating systems.

Key Takeaways

•Caches exploit locality — Temporal and spatial locality make it possible for small, fast caches to satisfy most memory accesses
•Cache organization matters — Lines, sets, ways, and associativity determine how addresses map to storage and affect conflict behavior
•Three cache levels with distinct roles — L1 (fast, small, private), L2 (balanced, private), L3 (large, shared, coherence point)
•Replacement policies approximate optimal — LRU, RRIP, and adaptive policies try to keep useful data and evict unlikely-to-be-reused data
•Write-back dominates modern designs — Reduces memory bandwidth at the cost of complexity in maintaining consistency
•The Three Cs explain miss behavior — Compulsory, capacity, and conflict misses each have different causes and solutions
•Coherence enables multi-core sharing — But adds latency and potential miss sources; minimize inter-core data sharing in hot paths

What's next:

With registers and caches covered, we'll now explore main memory (RAM)—the primary system memory that stores the bulk of program data and code. We'll examine DRAM technology, memory organization, and how operating systems manage physical memory through techniques like paging and virtual memory.

Page Complete

You now have a comprehensive understanding of CPU cache memory—from fundamental organization through advanced topics like replacement policies and coherence. This knowledge is essential for understanding system performance, writing cache-friendly code, and designing OS memory management systems.

2 / 5

Loading learning content...

Operating SystemsMemory Hierarchy

Memory Hierarchy: From Registers to Disk

LevelIntermediate

Duration90 mins

TopicMemory Hierarchy

2 / 5

Cache Memory (L1, L2, L3) — The Critical Performance Tier

Bridging the Processor-Memory Gap

What You Will Learn

The Principle of Locality

Temporal locality (locality in time):

If a memory location is accessed now, it is likely to be accessed again in the near future. Examples:

Loop counters are read/written every iteration
Frequently-called functions are executed repeatedly
Hot data structures (e.g., hash table buckets) are accessed often

Spatial locality (locality in space):

If a memory location is accessed now, nearby locations are likely to be accessed soon. Examples:

Array traversals access consecutive elements
Fields in a struct are often accessed together
Instructions are typically executed sequentially

Why Locality Exists

Quantifying locality:

Sequential access patterns:

Consider iterating through an array of 1 million integers:

int arr[1000000];
int sum = 0;
for (int i = 0; i < 1000000; i++) {
    sum += arr[i];  // Pure sequential access
}

Random access patterns:

Now consider random accesses:

for (int i = 0; i < 1000000; i++) {
    sum += arr[random() % 1000000];  // Random access
}

Each access is equally likely to hit any cache line. If the array exceeds cache capacity, almost every access is a cache miss. This can be 10-100× slower than sequential access.

Impact of Access Patterns on Cache Performance
Access Pattern	Spatial Locality	Temporal Locality	Typical Hit Rate	Performance Impact
Sequential array traversal	Excellent	Low	~94-97%	Near-optimal
Repeated array traversal	Excellent	Excellent	~99%+	Optimal
Strided access (stride=cache line)	None	Low	<10%	Very poor
Random access (large array)	None	None	<5%	Worst case
Linked list traversal (scattered)	None	Low	Variable	Often poor
Matrix multiply (naive)	Mixed	Low	~50-80%	Improvable
Matrix multiply (blocked)	Excellent	Excellent	~95%+	Near-optimal

Cache Organization Fundamentals

Cache lines (blocks):

A cache line contains:

Data: The actual memory contents (64 bytes typically)
Tag: Part of the memory address that uniquely identifies which memory location this line holds
Valid bit: Indicates whether this line contains valid data
Dirty bit: (For write-back caches) Indicates whether this line has been modified and differs from main memory
Replacement metadata: Information used to decide which line to evict (e.g., age for LRU)

Address decomposition:

When the CPU accesses a memory address, that address is decomposed into three parts:

Block offset (b bits): Identifies the specific byte within the cache line. For 64-byte lines, b = 6 (2⁶ = 64)
Index (i bits): Identifies which cache set this address maps to. The number of sets S = 2ⁱ
Tag (t bits): Identifies which specific memory block is stored in the cache line. t = address_bits - i - b

For a 64-bit address space with 64-byte lines and 64 sets:

Block offset: 6 bits (bits 0-5)
Index: 6 bits (bits 6-11)
Tag: 52 bits (bits 12-63)

Cache lookup process:

Extract the index from the address
Look up the corresponding set in the cache
Compare the tag of each cache line in the set with the address's tag
If a tag matches and the valid bit is set → cache hit
If no match → cache miss, fetch data from the next level

Cache Associativity Types

•Direct-mapped cache: Each memory address maps to exactly one cache line. Simple and fast, but prone to conflict misses when multiple hot addresses map to the same line. Associativity = 1.
•N-way set-associative cache: Each memory address can reside in one of N cache lines within a set. More flexible, fewer conflict misses, but requires comparing N tags simultaneously. Common values: 4-way, 8-way, 12-way.
•Fully associative cache: A cache line can hold any memory address. Maximum flexibility, minimum conflict misses, but requires comparing against all cache lines (practical only for small caches). Used for TLBs and small specialized buffers.

The Associativity Tradeoff

cache_simulation_concept.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Conceptual model of cache address decomposition
class CacheAddressDecomposition:
    def __init__(self, line_size_bytes=64, num_sets=64, address_bits=64):
        self.offset_bits = int.bit_length(line_size_bytes - 1)  # e.g., 6 for 64 bytes
        self.index_bits = int.bit_length(num_sets - 1)          # e.g., 6 for 64 sets
        self.tag_bits = address_bits - self.offset_bits - self.index_bits
        
    def decompose(self, address):
        offset = address & ((1 << self.offset_bits) - 1)
        index = (address >> self.offset_bits) & ((1 << self.index_bits) - 1)
        tag = address >> (self.offset_bits + self.index_bits)
        return {"tag": tag, "index": index, "offset": offset}
 
# Example: L1 cache with 64-byte lines, 64 sets (4KB / 64 / 1-way = 64 sets)
cache = CacheAddressDecomposition(line_size_bytes=64, num_sets=64)
result = cache.decompose(0x7fff8c2a4b80)
# tag: 0x1fffe30a92, index: 46, offset: 0

L1 Cache — The Speed Demon

Key characteristics of L1 cache:

•Size: Typically 32-64 KB per core (split between instruction and data)
•Latency: 3-5 CPU cycles (~1-2 ns at 3 GHz)
•Structure: Almost always split into L1-I (instruction) and L1-D (data)
•Associativity: Usually 4-8 way set associative
•Line size: 64 bytes (standard across x86 and ARM)
•Bandwidth: Can often supply 2 loads + 1 store per cycle

Why split L1 into instruction and data?

The split L1 design (Harvard architecture at L1) provides several advantages:

Bandwidth: Instructions and data can be fetched simultaneously from independent caches
Optimization: L1-I can be optimized for sequential access (instruction fetch), while L1-D is optimized for random access patterns
Security: Separation helps prevent certain types of code injection attacks (instructions and data don't mix)
Coherence: L1-I typically doesn't need to handle writes (code doesn't self-modify often), simplifying its design

L1 cache optimizations:

L1 caches employ aggressive optimizations to achieve single-cycle access:

Physically indexed, Virtually tagged (VIPT): Allows cache lookup to begin before TLB translation completes
Way prediction: Predicts which way will hit, accessing only one way for power savings
Multi-banking: Divides the cache into banks that can be accessed independently
Pipelined access: Allows multiple outstanding cache requests

L1 Cache Specifications Across Modern Processors
Processor	L1-D per Core	L1-I per Core	Associativity	Latency (cycles)
Intel Core i9-13900K	48 KB	32 KB	12-way (D), 8-way (I)	5
AMD Ryzen 9 7950X	32 KB	32 KB	8-way	4
Apple M2 (P-core)	128 KB	192 KB	8-way	3
ARM Cortex-X3	64 KB	64 KB	4-way	4
Graviton3	64 KB	64 KB	4-way	3-4

L1 Miss Penalty

L2 Cache — The Balanced Middle Ground

Key characteristics of L2 cache:

•Size: Typically 256 KB to 2 MB per core
•Latency: 10-15 CPU cycles (~3-5 ns)
•Structure: Unified (holds both instructions and data)
•Associativity: Usually 8-16 way set associative
•Location: Private per core on most modern designs
•Inclusive vs Exclusive: Varies by architecture

Inclusion policies:

The relationship between cache levels is defined by their inclusion policy:

Non-inclusive L2: Neither strictly inclusive nor exclusive. A line might be in L1 only, L2 only, or both. Modern Intel processors have moved toward non-inclusive designs.

L2 as a victim cache (exclusive):

L2 as the Working Set Cache

L2 bandwidth and ports:

Hardware prefetching at L2:

L3 Cache — The Shared Last Line of Defense

Key characteristics of L3 cache:

•Size: Typically 8-64 MB (shared across all cores)
•Latency: 30-50 CPU cycles (~10-20 ns)
•Structure: Unified cache (instructions and data)
•Associativity: High (12-16+ way) to minimize conflicts
•Sharing: Shared between all cores in a processor package
•Coherence role: Serves as the coherence point for inter-core communication

L3 as the coherence hub:

Sliced L3 architecture:

AMD's revolutionary 3D V-Cache:

L3 Cache Specifications Across Modern Processors
Processor	L3 Size	Cores Sharing	Latency (cycles)	Architecture
Intel Core i9-13900K	36 MB	24 cores (8P+16E)	~42	Ring bus, sliced
AMD Ryzen 9 7950X	64 MB	16 cores (2 CCDs)	~46	2x 32MB per CCD
AMD Ryzen 9 7950X3D	128 MB	16 cores	~50	3D V-Cache stacking
Apple M2 Ultra	192 MB	24 cores	Unknown	Unified memory architecture
Intel Xeon Scalable	52.5 MB	36+ cores	~60+	Mesh interconnect

L3 vs Main Memory

Cache Replacement Policies

The optimal (Bélády's) policy:

Common replacement policies:

Replacement Policy Comparison

•LRU (Least Recently Used): Evict the line accessed longest ago. Based on temporal locality—recently used lines are more likely to be used again. True LRU requires O(log n) bits per line for N-way associativity, making it expensive for high associativity.
•Pseudo-LRU (PLRU): Approximates LRU using a binary tree structure with one bit per internal node. Much cheaper than true LRU (~N-1 bits for N ways vs. N log N for true LRU). Common in L1 caches.
•Random: Evict a randomly-chosen line. Surprisingly effective and extremely simple to implement. No pathological cases but misses opportunities when access patterns are predictable.
•RRIP (Re-Reference Interval Prediction): Predicts when each line will be re-referenced and evicts lines predicted to be used furthest in the future. Handles scan-resistant behavior better than LRU. Used in Intel's L3 caches.
•SRRIP (Static RRIP): Assigns incoming lines an initial prediction value (e.g., distant reuse), updated on re-reference. Prevents scans from polluting the cache.
•DRRIP (Dynamic RRIP): Dynamically chooses between SRRIP and other policies based on runtime behavior, adapting to workload characteristics.

The scan resistance problem:

Modern policies like RRIP address this by treating newly-loaded lines as less valuable until they prove their worth through re-reference.

Adaptive policies:

Application-level Cache Awareness

Cache Write Policies

Two fundamental choices:

Write-Through

•Writes go to both cache and memory/lower level immediately
•Memory always has current data
•Simpler coherence—no need to track dirty state
•Higher write bandwidth consumption
•Often uses a write buffer to absorb write latency
•Common in embedded systems, some L1 designs

Write-Back

•Writes update only the cache initially
•A dirty bit tracks modified lines
•Modified data written to memory only on eviction
•Much lower write bandwidth usage
•More complex—must handle dirty line evictions
•Standard in modern L1, L2, L3 caches

Write miss policies:

What happens when a write targets an address not currently in the cache?

Write-allocate: Fetch the target cache line from memory, then perform the write. This is the common choice for write-back caches—future writes and reads to neighboring bytes will hit in cache.

No-write-allocate: Write directly to memory without loading the line into cache. Common with write-through caches—since data goes to memory anyway, why waste cache space?

Combining policies:

Modern systems typically use:

L1: Write-back + write-allocate (maximize locality)
L2: Write-back + write-allocate
L3: Write-back + write-allocate

Write-through is rare in general-purpose processors but common in specialized systems where coherence simplicity outweighs performance costs.

Write-Back and Consistency

Understanding Cache Misses — The Three Cs

Not every memory access can be a cache hit. Understanding why misses occur is essential for optimization. The classic Three Cs model categorizes cache misses into three types:

1. Compulsory misses (cold misses):

2. Capacity misses:

Reducing capacity misses:

Reduce working set size (better data structures, compression)
Block/tile algorithms to process data in cache-sized chunks
Increase cache size (hardware design choice)

3. Conflict misses (collision misses):

Extending the model:

A fourth C is sometimes added:

4. Coherence misses:

Diagnosing miss types:

Profiling tools (Intel VTune, AMD uProf, perf) can measure cache miss rates and, with analysis, attribute misses to categories:

High cold misses: Need prefetching or restructured access patterns
High capacity misses: Need smaller working sets or data tiling
High conflict misses: Need array padding, data reorganization, or different data structures
High coherence misses: Need reduced sharing, better synchronization, or data partitioning

Cache Miss Types and Mitigation Strategies
Miss Type	Cause	Mitigation Strategies	Tools
Compulsory	First access to data	Prefetching, batching first accesses	Hardware prefetchers, __builtin_prefetch
Capacity	Working set > cache size	Blocking/tiling, reduce footprint, compression	perf stat, VTune
Conflict	Address collisions in sets	Array padding, data alignment, struct reorganization	Cache simulation, hardware counters
Coherence	Multi-core invalidation	Reduce false sharing, partitioning, careful sync	perf c2c, VTune memory analysis

Summary: Cache Memory Mastery

Key Takeaways

•Caches exploit locality — Temporal and spatial locality make it possible for small, fast caches to satisfy most memory accesses
•Cache organization matters — Lines, sets, ways, and associativity determine how addresses map to storage and affect conflict behavior
•Three cache levels with distinct roles — L1 (fast, small, private), L2 (balanced, private), L3 (large, shared, coherence point)
•Replacement policies approximate optimal — LRU, RRIP, and adaptive policies try to keep useful data and evict unlikely-to-be-reused data
•Write-back dominates modern designs — Reduces memory bandwidth at the cost of complexity in maintaining consistency
•The Three Cs explain miss behavior — Compulsory, capacity, and conflict misses each have different causes and solutions
•Coherence enables multi-core sharing — But adds latency and potential miss sources; minimize inter-core data sharing in hot paths

What's next:

Page Complete

2 / 5