Operating SystemsCache Coherence

Cache Coherence and Memory Consistency

LevelIntermediate

Duration90 mins

TopicCache Coherence

2 / 5

Cache Lines

The Atom of Memory Transfer

When a processor accesses a single byte from memory, the cache doesn't fetch just that byte—it fetches an entire cache line (also called a cache block). This seemingly simple design decision has profound implications for system performance, memory architecture, and software optimization.

Cache lines are the fundamental unit of data transfer between cache levels and main memory. Understanding their organization is essential for understanding how caches actually work—and for writing code that doesn't fight against the hardware.

What You Will Learn

This page covers cache line structure and organization, addressing schemes (direct-mapped, set-associative, fully-associative), tag matching and lookup mechanics, and the critical implications of cache line size. You'll understand why cache lines exist and how they affect everything from memory bandwidth to false sharing bugs.

What Is a Cache Line?

A cache line (or cache block) is the smallest unit of data that can be transferred between the cache and main memory (or between cache levels). Key characteristics:

Size:

Most modern processors use 64-byte cache lines
Some embedded systems use 32 bytes; some high-performance systems use 128 bytes
The size is fixed for a given cache (typically consistent across all cache levels)

Alignment:

Cache lines are naturally aligned to their size
A 64-byte cache line starts at an address divisible by 64
Addresses 0-63 form one cache line, 64-127 form the next, etc.

Atomic Unit:

The entire cache line is the unit of validity—you can't have half a cache line
When any byte in a cache line is needed, the entire line is fetched
When a cache line is evicted, the entire line is written back (if dirty)

Converting Mermaid diagram...

Why Not Fetch Just the Byte We Need?

Fetching cache lines instead of individual bytes exploits spatial locality and amortizes memory access overhead:

Spatial Locality Exploitation: When you access byte N, you're likely to access bytes N+1, N+2, etc. By fetching 64 bytes, subsequent accesses are 'free' cache hits.
Memory Bus Efficiency: Modern memory systems transfer data on wide buses (256-512 bits). A 64-byte transfer uses the bus efficiently; fetching 1 byte would waste most of the bus width.
DRAM Access Patterns: DRAM accesses have long latency but high bandwidth once started. Reading 64 bytes takes only slightly longer than reading 8 bytes—most of the time is row activation and column selection.
Metadata Overhead: Each cache line requires tag storage, valid bits, and potentially dirty bits. Smaller lines would require more metadata per byte of data stored.

Cache Line Addressing and Structure

Every memory address can be decomposed into three parts for cache lookup purposes. Understanding this decomposition is key to understanding cache operation.

Address Decomposition:

For a cache with 2^S sets and 2^B byte blocks (cache lines), an N-bit address is split as:

|<---- Tag ---->|<-- Index -->|<-- Offset -->|
    N-S-B bits      S bits         B bits

Components:

Offset (B bits): Selects the specific byte within the cache line. For 64-byte lines, B = 6 (2^6 = 64)
Index (S bits): Selects which cache set to check. For 256 sets, S = 8 (2^8 = 256)
Tag (remaining bits): Uniquely identifies which memory block is stored in the cache line

Address Decomposition ExampleConsider a 32KB L1 data cache with 64-byte lines and 8-way set associativity:

Input

Output

Why This Decomposition Matters

The index bits determine which lines conflict for cache space. Two addresses with the same index bits (regardless of tag) compete for the same cache sets. This is why access patterns that differ by powers of cache size can cause conflict misses—they hit the same sets repeatedly.

Cache Associativity: Mapping Strategies

Associativity refers to how many different cache locations a given memory address can occupy. This critically affects conflict misses and hardware complexity.

Direct-Mapped Cache (1-way associative):

Each memory address maps to exactly ONE cache location. The index bits directly select the cache line.

Pros:

Simplest hardware: just one tag comparison
Lowest latency: immediate line selection
Lowest power: minimal comparators

Cons:

Highest conflict miss rate
Two frequently-used addresses with same index thrash even if cache has room

Example Conflict:

Address A: index = 42, tag = X
Address B: index = 42, tag = Y

Accessing A, B, A, B, A, B...
→ Every access is a miss! (ping-pong effect)

Used In:

Some specialized caches (e.g., micro-op caches)
Embedded systems where simplicity is paramount
Not common in modern L1/L2/L3 caches

Associativity Comparison
Property	Direct-Mapped	4-Way Set	8-Way Set	Fully Associative
Conflict Misses	Highest	Moderate	Low	Minimal
Tag Comparisons/Lookup	1	4	8	All entries
Hardware Complexity	Lowest	Moderate	Higher	Highest
Power Consumption	Lowest	Moderate	Higher	Highest
Hit Latency	Lowest	Low	Moderate	Variable
Typical Use	Rare	L1 caches	L2/L3 caches	TLBs, small caches

Tag Matching and Cache Lookup Mechanics

A cache lookup involves several coordinated steps. Understanding the mechanics helps explain cache timing and power characteristics.

Cache Lookup Procedure:

Step-by-Step Cache Lookup

•Address Decomposition: CPU issues address; cache controller extracts tag, index, offset bits.
•Set Selection: Index bits select the appropriate cache set. For an 8-way cache, this selects 8 cache lines.
•Parallel Tag Read: All tags in the selected set are read simultaneously (SRAM enables fast parallel access).
•Tag Comparison: Comparators check each tag against the address tag. Also check valid bits.
•Hit/Miss Determination: If any comparison succeeds AND valid bit is set → cache hit. Otherwise → cache miss.
•Data Selection (on hit): Multiplexer selects the data from the matching way; offset bits select bytes within the line.
•Data Return: Requested bytes forwarded to processor or load/store unit.

Timing Considerations:

For an L1 cache with 1-4 cycle hit latency, all these steps must complete in 1-4 clock cycles. This extreme timing constraint drives several design decisions:

Way Prediction: Some caches predict which way will hit and start reading that data early. If prediction is wrong, access takes extra cycles.

Serial vs Parallel Tag/Data Access:

Parallel: Read tags AND data simultaneously; select correct data after tag match. Faster but wastes power (read 8 ways, use 1).
Serial: Read tags first, then read only matching way's data. Slower but lower power.

Virtually-Indexed Physically-Tagged (VIPT):

Use virtual address index bits (available immediately) to start set selection
Use physical address tag bits (after TLB translate) for comparison
Enables overlapping TLB lookup with cache set access

cache_lookup_pseudo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def cache_lookup(address, cache):
    """
    Pseudocode for cache lookup in an N-way set-associative cache.
    """
    # Step 1: Decompose address
    offset = address & ((1 << cache.offset_bits) - 1)
    index = (address >> cache.offset_bits) & ((1 << cache.index_bits) - 1)
    tag = address >> (cache.offset_bits + cache.index_bits)
    
    # Step 2: Select the set
    cache_set = cache.sets[index]
    
    # Step 3 & 4: Check tags in the set (in parallel in hardware)
    for way in range(cache.associativity):
        line = cache_set.lines[way]
        
        if line.valid and line.tag == tag:
            # CACHE HIT
            cache.update_lru(index, way)  # Update replacement state
            return {
                'hit': True,
                'data': line.data[offset:offset + access_size],
                'cycles': cache.hit_latency
            }
    
    # CACHE MISS
    return {
        'hit': False,
        'data': None,
        'cycles': cache.miss_penalty  # Must fetch from next level
    }
 
def cache_fill(address, data, cache):
    """
    Pseudocode for filling a cache line after a miss.
    """
    # Decompose address
    offset = address & ((1 << cache.offset_bits) - 1)
    index = (address >> cache.offset_bits) & ((1 << cache.index_bits) - 1)
    tag = address >> (cache.offset_bits + cache.index_bits)
    
    # Find victim using replacement policy
    cache_set = cache.sets[index]
    victim_way = cache.select_victim(index)  # LRU, random, etc.
    victim_line = cache_set.lines[victim_way]
    
    # Writeback if dirty
    if victim_line.valid and victim_line.dirty:
        writeback_address = (victim_line.tag << (cache.index_bits + cache.offset_bits)) | \
                           (index << cache.offset_bits)
        cache.writeback(writeback_address, victim_line.data)
    
    # Install new line
    victim_line.valid = True
    victim_line.tag = tag
    victim_line.data = data
    victim_line.dirty = False
    
    cache.update_lru(index, victim_way)

Cache Line Metadata: Beyond Tags

Each cache line carries metadata beyond the data itself. This metadata is essential for correct operation and replacement decisions.

Cache Line Metadata Fields
Field	Size (typical)	Purpose
Tag	36-48 bits	Identifies which memory address is stored in this line
Valid Bit	1 bit	Indicates if the line contains valid data (vs garbage after power-on)
Dirty Bit	1 bit	Indicates if data has been modified (needs writeback on eviction)
LRU State	2-4 bits (varies)	Tracks recency for replacement decisions (may be approximate)
MESI State	2 bits	Coherence state: Modified, Exclusive, Shared, or Invalid (covered later)
ECC	8 bits per 64 bytes	Error-correcting code for soft error detection/correction

Storage Overhead:

Let's calculate the overhead for a 32KB 8-way set-associative cache:

Data storage: 32 KB = 262,144 bits
Number of lines: 32KB / 64B = 512 lines

Per line metadata:
- Tag: 36 bits (for 48-bit physical address, 6-bit index, 6-bit offset)
- Valid: 1 bit
- Dirty: 1 bit  
- LRU state: 3 bits (approximate for 8-way)
- MESI: 2 bits
- ECC: 64 bits (8 bits per 8 bytes)
Total: 107 bits per line

Total metadata: 512 × 107 = 54,784 bits ≈ 6.7 KB
Overhead: 6.7 KB / 32 KB = 21%

This 21% overhead represents the cost of the cache's 'intelligence'—the ability to know what data is present and how to manage it.

The Valid Bit:

The valid bit seems simple but is crucial:

On power-up, all valid bits are cleared (hardware reset)
Lines are marked valid when data is loaded
Lines are marked invalid on cache flush, invalidation requests, or coherence events
Without valid bits, the cache couldn't distinguish garbage from real data

The Dirty Bit:

The dirty bit tracks modifications:

Cleared when line is fetched from memory (clean copy)
Set when any byte in the line is written
Checked on eviction: dirty lines must be written back, clean lines can be discarded
Essential for write-back caches (discussed in the next page)

Why Approximate LRU?

True LRU for 8-way associativity requires tracking 8! = 40,320 orderings—impractical. Real caches use approximate LRU schemes like tree-based PLRU (Pseudo-LRU) or RRIP that require only 3 bits per line while providing 'good enough' replacement decisions.

Cache Line Size Selection and Tradeoffs

Cache line size is a critical design decision that affects performance, power, and behavior. There's no universally optimal size—it's a tradeoff.

Larger Cache Lines (128B)

•Better spatial locality exploitation
•Lower tag overhead (fewer lines = fewer tags)
•Better memory bus utilization
•Fewer compulsory misses
•BUT: Increased false sharing
•BUT: Wasted bandwidth if locality is poor
•BUT: Longer miss penalties

Smaller Cache Lines (32B)

•Less wasted bandwidth with poor locality
•Reduced false sharing
•Lower miss penalty per miss
•More fine-grained eviction
•BUT: Higher tag overhead
•BUT: More compulsory misses
•BUT: Less spatial locality benefit

Why 64 Bytes?

The industry has converged on 64-byte cache lines as a sweet spot:

Matches common DRAM burst sizes
Good balance for typical access patterns
Reasonable tag overhead
Aligns well with SIMD register widths (AVX-512 = 64 bytes)
Power of 2 for simple addressing

Cache Line Size Impact on Software:

cache_line_impact.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Structure padding to avoid false sharing in multi-threaded code
// Assume 64-byte cache lines
 
// BAD: Two variables in the same cache line
struct shared_counters_bad {
    volatile int counter_thread_1;  // 4 bytes
    volatile int counter_thread_2;  // 4 bytes - same cache line!
};
// Both threads constantly invalidate each other's cache line
 
// GOOD: Padding to separate cache lines
struct shared_counters_good {
    volatile int counter_thread_1;  // 4 bytes
    char padding1[60];              // Pad to 64 bytes
    volatile int counter_thread_2;  // 4 bytes - different cache line!
    char padding2[60];              // Pad for alignment
};
 
// BETTER: Use compiler-provided alignment
struct shared_counters_best {
    alignas(64) volatile int counter_thread_1;
    alignas(64) volatile int counter_thread_2;
};
 
// Practical example: Array traversal stride effects
#define SIZE (1024 * 1024)
int array[SIZE];
 
// Access every element (stride 1): excellent cache utilization
// Each cache line (16 ints) is fully used before moving on
for (int i = 0; i < SIZE; i++) {
    array[i] = 0;  // 1 miss per 16 accesses
}
 
// Access every 16th element (stride 16 = 64 bytes = cache line size)
// Each cache line fetched but only 1 element used!
for (int i = 0; i < SIZE; i += 16) {
    array[i] = 0;  // 1 miss per access = maximum cache waste
}
 
// The strided version has same number of cache misses as accessing
// 16 TIMES more elements with stride 1!

The 64-Byte Boundary

Be aware of 64-byte boundaries in performance-critical code. Data structures that span cache line boundaries may require two cache line fetches for a single access. Similarly, unintended sharing of cache lines between threads causes devastating false sharing. Modern compilers offer alignas() and similar mechanisms to control alignment.

Replacement Policies

When a cache set is full and a new line must be loaded, the cache must choose a victim to evict. The replacement policy makes this decision.

Common Replacement Policies:

Cache Replacement Policies
Policy	Description	Pros	Cons
LRU (Least Recently Used)	Evict line accessed longest ago	Good temporal locality exploitation	Expensive to track for high associativity
PLRU (Pseudo-LRU)	Tree-based approximation of LRU	Low overhead, reasonable performance	Not true LRU, can make suboptimal choices
Random	Evict random line in set	Zero tracking overhead, immune to pathological patterns	Doesn't exploit locality, unpredictable
FIFO (First-In First-Out)	Evict oldest installed line	Simple to implement	Doesn't consider access frequency
RRIP (Re-Reference Interval Prediction)	Predict re-use distance	Good for scan-resistant behavior	More complex, requires tuning

LRU Implementation Challenge:

True LRU for an 8-way cache requires maintaining a total ordering of 8 elements. On every access, the accessed line must move to the 'most recent' position.

Naive approach: 8! = 40,320 possible orderings → impractical
Better: Counter-based (3 bits per line), still expensive to update
Practical: Tree-based Pseudo-LRU

Tree-Based Pseudo-LRU:

A binary tree with 7 internal nodes can approximately track LRU for 8 ways:

Each node points left or right (1 bit)
On hit: update path bits to point away from accessed line
On eviction: follow path bits to find victim
Only 7 bits total, one update per access

Scan Resistance:

Novel workloads like database scans can thrash LRU caches:

A table scan touches many lines once each
LRU fills cache with scan data, evicting frequently-used data
After scan, cache is 'cold' and must re-fetch useful data

Policies like RRIP predict that newly-loaded lines have short re-use intervals (likely scan) and don't immediately make them MRU. This provides scan resistance while still exploiting temporal locality for truly repeated accesses.

Belady's Optimal Algorithm

The theoretical optimal replacement policy (Belady's MIN) evicts the line that will be used furthest in the future. This is impossible to implement in practice (requires knowing the future) but serves as an upper bound for comparison. Real policies typically achieve 90-95% of optimal performance on common workloads.

Summary: Cache Lines

We've covered the fundamental unit of cache operation. Let's consolidate the key concepts:

Key Takeaways

•Cache lines are the transfer unit — Entire 64-byte blocks move between cache levels and memory, exploiting spatial locality.
•Address decomposition enables lookup — Tag, index, and offset fields allow quick determination of cache hits and misses.
•Associativity balances conflicts and complexity — Set-associative caches (4-16 way) dominate modern processors as a practical middle ground.
•Metadata enables intelligence — Valid bits, dirty bits, tags, and replacement state allow caches to make informed decisions.
•Cache line size is a tradeoff — 64 bytes balances spatial locality exploitation against false sharing and miss penalties.
•Replacement policies predict reuse — LRU variants and scan-resistant policies aim to keep useful data in cache.
•Software must be cache-aware — Understanding cache lines enables optimization through alignment, padding, and access pattern tuning.

What's Next:

We've seen how cache lines are organized and looked up. But what happens when data is written? The next page covers Write-Through vs Write-Back policies—the critical decision of when and how modified data propagates through the cache hierarchy to main memory.

Page Complete

You now understand cache lines—the fundamental building blocks of cache systems. You know how addresses decompose, how associativity works, and why cache-aware programming matters. Next, we'll explore how written data moves through the cache hierarchy.

2 / 5

Loading learning content...

Operating SystemsCache Coherence

Cache Coherence and Memory Consistency

LevelIntermediate

Duration90 mins

TopicCache Coherence

2 / 5

Cache Lines

The Atom of Memory Transfer

What You Will Learn

What Is a Cache Line?

A cache line (or cache block) is the smallest unit of data that can be transferred between the cache and main memory (or between cache levels). Key characteristics:

Size:

Most modern processors use 64-byte cache lines
Some embedded systems use 32 bytes; some high-performance systems use 128 bytes
The size is fixed for a given cache (typically consistent across all cache levels)

Alignment:

Cache lines are naturally aligned to their size
A 64-byte cache line starts at an address divisible by 64
Addresses 0-63 form one cache line, 64-127 form the next, etc.

Atomic Unit:

The entire cache line is the unit of validity—you can't have half a cache line
When any byte in a cache line is needed, the entire line is fetched
When a cache line is evicted, the entire line is written back (if dirty)

Converting Mermaid diagram...

Why Not Fetch Just the Byte We Need?

Fetching cache lines instead of individual bytes exploits spatial locality and amortizes memory access overhead:

Spatial Locality Exploitation: When you access byte N, you're likely to access bytes N+1, N+2, etc. By fetching 64 bytes, subsequent accesses are 'free' cache hits.
Memory Bus Efficiency: Modern memory systems transfer data on wide buses (256-512 bits). A 64-byte transfer uses the bus efficiently; fetching 1 byte would waste most of the bus width.
DRAM Access Patterns: DRAM accesses have long latency but high bandwidth once started. Reading 64 bytes takes only slightly longer than reading 8 bytes—most of the time is row activation and column selection.
Metadata Overhead: Each cache line requires tag storage, valid bits, and potentially dirty bits. Smaller lines would require more metadata per byte of data stored.

Cache Line Addressing and Structure

Every memory address can be decomposed into three parts for cache lookup purposes. Understanding this decomposition is key to understanding cache operation.

Address Decomposition:

For a cache with 2^S sets and 2^B byte blocks (cache lines), an N-bit address is split as:

|<---- Tag ---->|<-- Index -->|<-- Offset -->|
    N-S-B bits      S bits         B bits

Components:

Offset (B bits): Selects the specific byte within the cache line. For 64-byte lines, B = 6 (2^6 = 64)
Index (S bits): Selects which cache set to check. For 256 sets, S = 8 (2^8 = 256)
Tag (remaining bits): Uniquely identifies which memory block is stored in the cache line

Address Decomposition ExampleConsider a 32KB L1 data cache with 64-byte lines and 8-way set associativity:

Input

Output

Why This Decomposition Matters

Cache Associativity: Mapping Strategies

Associativity refers to how many different cache locations a given memory address can occupy. This critically affects conflict misses and hardware complexity.

Direct-Mapped Cache (1-way associative):

Each memory address maps to exactly ONE cache location. The index bits directly select the cache line.

Pros:

Simplest hardware: just one tag comparison
Lowest latency: immediate line selection
Lowest power: minimal comparators

Cons:

Highest conflict miss rate
Two frequently-used addresses with same index thrash even if cache has room

Example Conflict:

Address A: index = 42, tag = X
Address B: index = 42, tag = Y

Accessing A, B, A, B, A, B...
→ Every access is a miss! (ping-pong effect)

Used In:

Some specialized caches (e.g., micro-op caches)
Embedded systems where simplicity is paramount
Not common in modern L1/L2/L3 caches

Associativity Comparison
Property	Direct-Mapped	4-Way Set	8-Way Set	Fully Associative
Conflict Misses	Highest	Moderate	Low	Minimal
Tag Comparisons/Lookup	1	4	8	All entries
Hardware Complexity	Lowest	Moderate	Higher	Highest
Power Consumption	Lowest	Moderate	Higher	Highest
Hit Latency	Lowest	Low	Moderate	Variable
Typical Use	Rare	L1 caches	L2/L3 caches	TLBs, small caches

Tag Matching and Cache Lookup Mechanics

A cache lookup involves several coordinated steps. Understanding the mechanics helps explain cache timing and power characteristics.

Cache Lookup Procedure:

Step-by-Step Cache Lookup

•Address Decomposition: CPU issues address; cache controller extracts tag, index, offset bits.
•Set Selection: Index bits select the appropriate cache set. For an 8-way cache, this selects 8 cache lines.
•Parallel Tag Read: All tags in the selected set are read simultaneously (SRAM enables fast parallel access).
•Tag Comparison: Comparators check each tag against the address tag. Also check valid bits.
•Hit/Miss Determination: If any comparison succeeds AND valid bit is set → cache hit. Otherwise → cache miss.
•Data Selection (on hit): Multiplexer selects the data from the matching way; offset bits select bytes within the line.
•Data Return: Requested bytes forwarded to processor or load/store unit.

Timing Considerations:

For an L1 cache with 1-4 cycle hit latency, all these steps must complete in 1-4 clock cycles. This extreme timing constraint drives several design decisions:

Way Prediction: Some caches predict which way will hit and start reading that data early. If prediction is wrong, access takes extra cycles.

Serial vs Parallel Tag/Data Access:

Parallel: Read tags AND data simultaneously; select correct data after tag match. Faster but wastes power (read 8 ways, use 1).
Serial: Read tags first, then read only matching way's data. Slower but lower power.

Virtually-Indexed Physically-Tagged (VIPT):

Use virtual address index bits (available immediately) to start set selection
Use physical address tag bits (after TLB translate) for comparison
Enables overlapping TLB lookup with cache set access

cache_lookup_pseudo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def cache_lookup(address, cache):
    """
    Pseudocode for cache lookup in an N-way set-associative cache.
    """
    # Step 1: Decompose address
    offset = address & ((1 << cache.offset_bits) - 1)
    index = (address >> cache.offset_bits) & ((1 << cache.index_bits) - 1)
    tag = address >> (cache.offset_bits + cache.index_bits)
    
    # Step 2: Select the set
    cache_set = cache.sets[index]
    
    # Step 3 & 4: Check tags in the set (in parallel in hardware)
    for way in range(cache.associativity):
        line = cache_set.lines[way]
        
        if line.valid and line.tag == tag:
            # CACHE HIT
            cache.update_lru(index, way)  # Update replacement state
            return {
                'hit': True,
                'data': line.data[offset:offset + access_size],
                'cycles': cache.hit_latency
            }
    
    # CACHE MISS
    return {
        'hit': False,
        'data': None,
        'cycles': cache.miss_penalty  # Must fetch from next level
    }
 
def cache_fill(address, data, cache):
    """
    Pseudocode for filling a cache line after a miss.
    """
    # Decompose address
    offset = address & ((1 << cache.offset_bits) - 1)
    index = (address >> cache.offset_bits) & ((1 << cache.index_bits) - 1)
    tag = address >> (cache.offset_bits + cache.index_bits)
    
    # Find victim using replacement policy
    cache_set = cache.sets[index]
    victim_way = cache.select_victim(index)  # LRU, random, etc.
    victim_line = cache_set.lines[victim_way]
    
    # Writeback if dirty
    if victim_line.valid and victim_line.dirty:
        writeback_address = (victim_line.tag << (cache.index_bits + cache.offset_bits)) | \
                           (index << cache.offset_bits)
        cache.writeback(writeback_address, victim_line.data)
    
    # Install new line
    victim_line.valid = True
    victim_line.tag = tag
    victim_line.data = data
    victim_line.dirty = False
    
    cache.update_lru(index, victim_way)

Cache Line Metadata: Beyond Tags

Each cache line carries metadata beyond the data itself. This metadata is essential for correct operation and replacement decisions.

Cache Line Metadata Fields
Field	Size (typical)	Purpose
Tag	36-48 bits	Identifies which memory address is stored in this line
Valid Bit	1 bit	Indicates if the line contains valid data (vs garbage after power-on)
Dirty Bit	1 bit	Indicates if data has been modified (needs writeback on eviction)
LRU State	2-4 bits (varies)	Tracks recency for replacement decisions (may be approximate)
MESI State	2 bits	Coherence state: Modified, Exclusive, Shared, or Invalid (covered later)
ECC	8 bits per 64 bytes	Error-correcting code for soft error detection/correction

Storage Overhead:

Let's calculate the overhead for a 32KB 8-way set-associative cache:

Data storage: 32 KB = 262,144 bits
Number of lines: 32KB / 64B = 512 lines

Per line metadata:
- Tag: 36 bits (for 48-bit physical address, 6-bit index, 6-bit offset)
- Valid: 1 bit
- Dirty: 1 bit  
- LRU state: 3 bits (approximate for 8-way)
- MESI: 2 bits
- ECC: 64 bits (8 bits per 8 bytes)
Total: 107 bits per line

Total metadata: 512 × 107 = 54,784 bits ≈ 6.7 KB
Overhead: 6.7 KB / 32 KB = 21%

This 21% overhead represents the cost of the cache's 'intelligence'—the ability to know what data is present and how to manage it.

The Valid Bit:

The valid bit seems simple but is crucial:

On power-up, all valid bits are cleared (hardware reset)
Lines are marked valid when data is loaded
Lines are marked invalid on cache flush, invalidation requests, or coherence events
Without valid bits, the cache couldn't distinguish garbage from real data

The Dirty Bit:

The dirty bit tracks modifications:

Cleared when line is fetched from memory (clean copy)
Set when any byte in the line is written
Checked on eviction: dirty lines must be written back, clean lines can be discarded
Essential for write-back caches (discussed in the next page)

Why Approximate LRU?

Cache Line Size Selection and Tradeoffs

Cache line size is a critical design decision that affects performance, power, and behavior. There's no universally optimal size—it's a tradeoff.

Larger Cache Lines (128B)

•Better spatial locality exploitation
•Lower tag overhead (fewer lines = fewer tags)
•Better memory bus utilization
•Fewer compulsory misses
•BUT: Increased false sharing
•BUT: Wasted bandwidth if locality is poor
•BUT: Longer miss penalties

Smaller Cache Lines (32B)

•Less wasted bandwidth with poor locality
•Reduced false sharing
•Lower miss penalty per miss
•More fine-grained eviction
•BUT: Higher tag overhead
•BUT: More compulsory misses
•BUT: Less spatial locality benefit

Why 64 Bytes?

The industry has converged on 64-byte cache lines as a sweet spot:

Matches common DRAM burst sizes
Good balance for typical access patterns
Reasonable tag overhead
Aligns well with SIMD register widths (AVX-512 = 64 bytes)
Power of 2 for simple addressing

Cache Line Size Impact on Software:

cache_line_impact.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Structure padding to avoid false sharing in multi-threaded code
// Assume 64-byte cache lines
 
// BAD: Two variables in the same cache line
struct shared_counters_bad {
    volatile int counter_thread_1;  // 4 bytes
    volatile int counter_thread_2;  // 4 bytes - same cache line!
};
// Both threads constantly invalidate each other's cache line
 
// GOOD: Padding to separate cache lines
struct shared_counters_good {
    volatile int counter_thread_1;  // 4 bytes
    char padding1[60];              // Pad to 64 bytes
    volatile int counter_thread_2;  // 4 bytes - different cache line!
    char padding2[60];              // Pad for alignment
};
 
// BETTER: Use compiler-provided alignment
struct shared_counters_best {
    alignas(64) volatile int counter_thread_1;
    alignas(64) volatile int counter_thread_2;
};
 
// Practical example: Array traversal stride effects
#define SIZE (1024 * 1024)
int array[SIZE];
 
// Access every element (stride 1): excellent cache utilization
// Each cache line (16 ints) is fully used before moving on
for (int i = 0; i < SIZE; i++) {
    array[i] = 0;  // 1 miss per 16 accesses
}
 
// Access every 16th element (stride 16 = 64 bytes = cache line size)
// Each cache line fetched but only 1 element used!
for (int i = 0; i < SIZE; i += 16) {
    array[i] = 0;  // 1 miss per access = maximum cache waste
}
 
// The strided version has same number of cache misses as accessing
// 16 TIMES more elements with stride 1!

The 64-Byte Boundary

Replacement Policies

When a cache set is full and a new line must be loaded, the cache must choose a victim to evict. The replacement policy makes this decision.

Common Replacement Policies:

Cache Replacement Policies
Policy	Description	Pros	Cons
LRU (Least Recently Used)	Evict line accessed longest ago	Good temporal locality exploitation	Expensive to track for high associativity
PLRU (Pseudo-LRU)	Tree-based approximation of LRU	Low overhead, reasonable performance	Not true LRU, can make suboptimal choices
Random	Evict random line in set	Zero tracking overhead, immune to pathological patterns	Doesn't exploit locality, unpredictable
FIFO (First-In First-Out)	Evict oldest installed line	Simple to implement	Doesn't consider access frequency
RRIP (Re-Reference Interval Prediction)	Predict re-use distance	Good for scan-resistant behavior	More complex, requires tuning

LRU Implementation Challenge:

True LRU for an 8-way cache requires maintaining a total ordering of 8 elements. On every access, the accessed line must move to the 'most recent' position.

Naive approach: 8! = 40,320 possible orderings → impractical
Better: Counter-based (3 bits per line), still expensive to update
Practical: Tree-based Pseudo-LRU

Tree-Based Pseudo-LRU:

A binary tree with 7 internal nodes can approximately track LRU for 8 ways:

Each node points left or right (1 bit)
On hit: update path bits to point away from accessed line
On eviction: follow path bits to find victim
Only 7 bits total, one update per access

Scan Resistance:

Novel workloads like database scans can thrash LRU caches:

A table scan touches many lines once each
LRU fills cache with scan data, evicting frequently-used data
After scan, cache is 'cold' and must re-fetch useful data

Belady's Optimal Algorithm

Summary: Cache Lines

We've covered the fundamental unit of cache operation. Let's consolidate the key concepts:

Key Takeaways

•Cache lines are the transfer unit — Entire 64-byte blocks move between cache levels and memory, exploiting spatial locality.
•Address decomposition enables lookup — Tag, index, and offset fields allow quick determination of cache hits and misses.
•Associativity balances conflicts and complexity — Set-associative caches (4-16 way) dominate modern processors as a practical middle ground.
•Metadata enables intelligence — Valid bits, dirty bits, tags, and replacement state allow caches to make informed decisions.
•Cache line size is a tradeoff — 64 bytes balances spatial locality exploitation against false sharing and miss penalties.
•Replacement policies predict reuse — LRU variants and scan-resistant policies aim to keep useful data in cache.
•Software must be cache-aware — Understanding cache lines enables optimization through alignment, padding, and access pattern tuning.

What's Next:

Page Complete

2 / 5