Loading learning content...
When a processor accesses a single byte from memory, the cache doesn't fetch just that byte—it fetches an entire cache line (also called a cache block). This seemingly simple design decision has profound implications for system performance, memory architecture, and software optimization.
Cache lines are the fundamental unit of data transfer between cache levels and main memory. Understanding their organization is essential for understanding how caches actually work—and for writing code that doesn't fight against the hardware.
This page covers cache line structure and organization, addressing schemes (direct-mapped, set-associative, fully-associative), tag matching and lookup mechanics, and the critical implications of cache line size. You'll understand why cache lines exist and how they affect everything from memory bandwidth to false sharing bugs.
A cache line (or cache block) is the smallest unit of data that can be transferred between the cache and main memory (or between cache levels). Key characteristics:
Size:
Alignment:
Atomic Unit:
Why Not Fetch Just the Byte We Need?
Fetching cache lines instead of individual bytes exploits spatial locality and amortizes memory access overhead:
Spatial Locality Exploitation: When you access byte N, you're likely to access bytes N+1, N+2, etc. By fetching 64 bytes, subsequent accesses are 'free' cache hits.
Memory Bus Efficiency: Modern memory systems transfer data on wide buses (256-512 bits). A 64-byte transfer uses the bus efficiently; fetching 1 byte would waste most of the bus width.
DRAM Access Patterns: DRAM accesses have long latency but high bandwidth once started. Reading 64 bytes takes only slightly longer than reading 8 bytes—most of the time is row activation and column selection.
Metadata Overhead: Each cache line requires tag storage, valid bits, and potentially dirty bits. Smaller lines would require more metadata per byte of data stored.
Every memory address can be decomposed into three parts for cache lookup purposes. Understanding this decomposition is key to understanding cache operation.
Address Decomposition:
For a cache with 2^S sets and 2^B byte blocks (cache lines), an N-bit address is split as:
|<---- Tag ---->|<-- Index -->|<-- Offset -->|
N-S-B bits S bits B bits
Components:
The index bits determine which lines conflict for cache space. Two addresses with the same index bits (regardless of tag) compete for the same cache sets. This is why access patterns that differ by powers of cache size can cause conflict misses—they hit the same sets repeatedly.
Associativity refers to how many different cache locations a given memory address can occupy. This critically affects conflict misses and hardware complexity.
Direct-Mapped Cache (1-way associative):
Each memory address maps to exactly ONE cache location. The index bits directly select the cache line.
Pros:
Cons:
Example Conflict:
Address A: index = 42, tag = X
Address B: index = 42, tag = Y
Accessing A, B, A, B, A, B...
→ Every access is a miss! (ping-pong effect)
Used In:
| Property | Direct-Mapped | 4-Way Set | 8-Way Set | Fully Associative |
|---|---|---|---|---|
| Conflict Misses | Highest | Moderate | Low | Minimal |
| Tag Comparisons/Lookup | 1 | 4 | 8 | All entries |
| Hardware Complexity | Lowest | Moderate | Higher | Highest |
| Power Consumption | Lowest | Moderate | Higher | Highest |
| Hit Latency | Lowest | Low | Moderate | Variable |
| Typical Use | Rare | L1 caches | L2/L3 caches | TLBs, small caches |
A cache lookup involves several coordinated steps. Understanding the mechanics helps explain cache timing and power characteristics.
Cache Lookup Procedure:
Timing Considerations:
For an L1 cache with 1-4 cycle hit latency, all these steps must complete in 1-4 clock cycles. This extreme timing constraint drives several design decisions:
Way Prediction: Some caches predict which way will hit and start reading that data early. If prediction is wrong, access takes extra cycles.
Serial vs Parallel Tag/Data Access:
Virtually-Indexed Physically-Tagged (VIPT):
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
def cache_lookup(address, cache): """ Pseudocode for cache lookup in an N-way set-associative cache. """ # Step 1: Decompose address offset = address & ((1 << cache.offset_bits) - 1) index = (address >> cache.offset_bits) & ((1 << cache.index_bits) - 1) tag = address >> (cache.offset_bits + cache.index_bits) # Step 2: Select the set cache_set = cache.sets[index] # Step 3 & 4: Check tags in the set (in parallel in hardware) for way in range(cache.associativity): line = cache_set.lines[way] if line.valid and line.tag == tag: # CACHE HIT cache.update_lru(index, way) # Update replacement state return { 'hit': True, 'data': line.data[offset:offset + access_size], 'cycles': cache.hit_latency } # CACHE MISS return { 'hit': False, 'data': None, 'cycles': cache.miss_penalty # Must fetch from next level } def cache_fill(address, data, cache): """ Pseudocode for filling a cache line after a miss. """ # Decompose address offset = address & ((1 << cache.offset_bits) - 1) index = (address >> cache.offset_bits) & ((1 << cache.index_bits) - 1) tag = address >> (cache.offset_bits + cache.index_bits) # Find victim using replacement policy cache_set = cache.sets[index] victim_way = cache.select_victim(index) # LRU, random, etc. victim_line = cache_set.lines[victim_way] # Writeback if dirty if victim_line.valid and victim_line.dirty: writeback_address = (victim_line.tag << (cache.index_bits + cache.offset_bits)) | \ (index << cache.offset_bits) cache.writeback(writeback_address, victim_line.data) # Install new line victim_line.valid = True victim_line.tag = tag victim_line.data = data victim_line.dirty = False cache.update_lru(index, victim_way)Each cache line carries metadata beyond the data itself. This metadata is essential for correct operation and replacement decisions.
| Field | Size (typical) | Purpose |
|---|---|---|
| Tag | 36-48 bits | Identifies which memory address is stored in this line |
| Valid Bit | 1 bit | Indicates if the line contains valid data (vs garbage after power-on) |
| Dirty Bit | 1 bit | Indicates if data has been modified (needs writeback on eviction) |
| LRU State | 2-4 bits (varies) | Tracks recency for replacement decisions (may be approximate) |
| MESI State | 2 bits | Coherence state: Modified, Exclusive, Shared, or Invalid (covered later) |
| ECC | 8 bits per 64 bytes | Error-correcting code for soft error detection/correction |
Storage Overhead:
Let's calculate the overhead for a 32KB 8-way set-associative cache:
Data storage: 32 KB = 262,144 bits
Number of lines: 32KB / 64B = 512 lines
Per line metadata:
- Tag: 36 bits (for 48-bit physical address, 6-bit index, 6-bit offset)
- Valid: 1 bit
- Dirty: 1 bit
- LRU state: 3 bits (approximate for 8-way)
- MESI: 2 bits
- ECC: 64 bits (8 bits per 8 bytes)
Total: 107 bits per line
Total metadata: 512 × 107 = 54,784 bits ≈ 6.7 KB
Overhead: 6.7 KB / 32 KB = 21%
This 21% overhead represents the cost of the cache's 'intelligence'—the ability to know what data is present and how to manage it.
The Valid Bit:
The valid bit seems simple but is crucial:
The Dirty Bit:
The dirty bit tracks modifications:
True LRU for 8-way associativity requires tracking 8! = 40,320 orderings—impractical. Real caches use approximate LRU schemes like tree-based PLRU (Pseudo-LRU) or RRIP that require only 3 bits per line while providing 'good enough' replacement decisions.
Cache line size is a critical design decision that affects performance, power, and behavior. There's no universally optimal size—it's a tradeoff.
Why 64 Bytes?
The industry has converged on 64-byte cache lines as a sweet spot:
Cache Line Size Impact on Software:
123456789101112131415161718192021222324252627282930313233343536373839404142
// Structure padding to avoid false sharing in multi-threaded code// Assume 64-byte cache lines // BAD: Two variables in the same cache linestruct shared_counters_bad { volatile int counter_thread_1; // 4 bytes volatile int counter_thread_2; // 4 bytes - same cache line!};// Both threads constantly invalidate each other's cache line // GOOD: Padding to separate cache linesstruct shared_counters_good { volatile int counter_thread_1; // 4 bytes char padding1[60]; // Pad to 64 bytes volatile int counter_thread_2; // 4 bytes - different cache line! char padding2[60]; // Pad for alignment}; // BETTER: Use compiler-provided alignmentstruct shared_counters_best { alignas(64) volatile int counter_thread_1; alignas(64) volatile int counter_thread_2;}; // Practical example: Array traversal stride effects#define SIZE (1024 * 1024)int array[SIZE]; // Access every element (stride 1): excellent cache utilization// Each cache line (16 ints) is fully used before moving onfor (int i = 0; i < SIZE; i++) { array[i] = 0; // 1 miss per 16 accesses} // Access every 16th element (stride 16 = 64 bytes = cache line size)// Each cache line fetched but only 1 element used!for (int i = 0; i < SIZE; i += 16) { array[i] = 0; // 1 miss per access = maximum cache waste} // The strided version has same number of cache misses as accessing// 16 TIMES more elements with stride 1!Be aware of 64-byte boundaries in performance-critical code. Data structures that span cache line boundaries may require two cache line fetches for a single access. Similarly, unintended sharing of cache lines between threads causes devastating false sharing. Modern compilers offer alignas() and similar mechanisms to control alignment.
When a cache set is full and a new line must be loaded, the cache must choose a victim to evict. The replacement policy makes this decision.
Common Replacement Policies:
| Policy | Description | Pros | Cons |
|---|---|---|---|
| LRU (Least Recently Used) | Evict line accessed longest ago | Good temporal locality exploitation | Expensive to track for high associativity |
| PLRU (Pseudo-LRU) | Tree-based approximation of LRU | Low overhead, reasonable performance | Not true LRU, can make suboptimal choices |
| Random | Evict random line in set | Zero tracking overhead, immune to pathological patterns | Doesn't exploit locality, unpredictable |
| FIFO (First-In First-Out) | Evict oldest installed line | Simple to implement | Doesn't consider access frequency |
| RRIP (Re-Reference Interval Prediction) | Predict re-use distance | Good for scan-resistant behavior | More complex, requires tuning |
LRU Implementation Challenge:
True LRU for an 8-way cache requires maintaining a total ordering of 8 elements. On every access, the accessed line must move to the 'most recent' position.
Tree-Based Pseudo-LRU:
A binary tree with 7 internal nodes can approximately track LRU for 8 ways:
Scan Resistance:
Novel workloads like database scans can thrash LRU caches:
Policies like RRIP predict that newly-loaded lines have short re-use intervals (likely scan) and don't immediately make them MRU. This provides scan resistance while still exploiting temporal locality for truly repeated accesses.
The theoretical optimal replacement policy (Belady's MIN) evicts the line that will be used furthest in the future. This is impossible to implement in practice (requires knowing the future) but serves as an upper bound for comparison. Real policies typically achieve 90-95% of optimal performance on common workloads.
We've covered the fundamental unit of cache operation. Let's consolidate the key concepts:
What's Next:
We've seen how cache lines are organized and looked up. But what happens when data is written? The next page covers Write-Through vs Write-Back policies—the critical decision of when and how modified data propagates through the cache hierarchy to main memory.
You now understand cache lines—the fundamental building blocks of cache systems. You know how addresses decompose, how associativity works, and why cache-aware programming matters. Next, we'll explore how written data moves through the cache hierarchy.