Operating SystemsMemory Hierarchy

Memory Hierarchy: From Registers to Disk

LevelIntermediate

Duration90 mins

TopicMemory Hierarchy

3 / 5

Main Memory (RAM) — The Primary System Memory

The Foundation of System Memory

Main memory, commonly referred to as RAM (Random Access Memory), is the primary workspace of a computer system. It holds the operating system kernel, running applications, and their data during execution. While caches provide speed and storage provides capacity, main memory occupies the critical middle ground—large enough to hold active workloads, fast enough to keep the CPU reasonably fed.

From an operating system perspective, main memory is one of the most precious and carefully managed resources. The OS must:

Track which memory regions are free and which are in use
Allocate memory to processes and reclaim it when no longer needed
Implement virtual memory to present a uniform address space to each process
Balance memory pressure across competing applications

Understanding how RAM works at the hardware level is foundational to understanding all of these OS functions.

What You Will Learn

By the end of this page, you will understand: how DRAM technology works at the cell level; how memory is organized into channels, DIMMs, ranks, and banks; the evolution of DDR memory standards; memory timing and access patterns; memory controllers and their scheduling algorithms; and how the OS views and manages physical memory.

DRAM Fundamentals: How Memory Cells Work

Main memory in nearly all modern computers uses DRAM (Dynamic Random Access Memory), a technology that stores each bit as a charge in a tiny capacitor. Understanding DRAM's fundamental operation explains many of its performance characteristics and why it behaves differently from caches.

The DRAM cell:

Each DRAM cell consists of:

One transistor: Acts as a switch, connecting the capacitor to a bitline
One capacitor: Stores charge representing a bit value (charged = 1, discharged = 0)

This "1T1C" design is extremely compact—the smallest practical memory cell—which is why DRAM achieves high densities at low cost. Compare this to SRAM (cache), which uses 6 transistors per bit.

The problem with capacitors:

Capacitors leak charge over time. Left alone, a DRAM cell would lose its stored value within milliseconds. This creates two critical consequences:

•Refresh required: Every DRAM cell must be periodically refreshed (read and rewritten) to restore its charge. Modern DRAM refreshes every 32-64 ms. During refresh, that portion of memory is unavailable—a source of latency variation.
•Destructive reads: Reading a DRAM cell discharges the capacitor through the sense amplifier. Every read operation must be followed by a write-back to restore the value. This fundamentally limits read speed compared to SRAM.

Sensing the bit:

DRAM cells are arranged in a 2D grid of rows and columns. Reading a cell involves:

Precharge: The bitline (column wire) is set to a reference voltage (typically VDD/2)
Row activation: The wordline for the target row is raised, connecting all cells in that row to their bitlines. The tiny charge difference tips the bitline voltage up or down.
Sensing: A sense amplifier detects the tiny voltage difference and amplifies it to a full logic level
Restoration: The amplified value is written back to the cell, restoring its charge

This process takes tens of nanoseconds—orders of magnitude slower than an SRAM read. The sense amplifiers act as a row buffer, holding the entire activated row (typically 8KB) for subsequent column accesses.

Row Buffer Locality

When a row is activated, subsequent accesses to different columns within that row are much faster (column access time ~15ns) than accesses requiring a new row activation (row access time ~30-50ns). This is row buffer locality—a critical consideration for memory-efficient code and OS page allocation.

DRAM vs SRAM Comparison
Characteristic	DRAM	SRAM
Transistors per bit	1	6
Density	Very high	Low
Cost per bit	Low ($)	High ($$$)
Speed	~10-20 ns	~1-2 ns
Power (static)	Low (but refresh)	Higher (leakage)
Volatility	Volatile	Volatile
Refresh required	Yes (every 32-64 ms)	No
Typical use	Main memory	Cache memory

Memory Organization: Channels, DIMMs, Ranks, and Banks

Modern memory systems are hierarchically organized to maximize bandwidth and parallelism while managing physical constraints. Understanding this organization is essential for understanding memory performance characteristics.

Memory hierarchy (from CPU outward):

Memory Organization Hierarchy

•Channels: Independent memory interfaces from the CPU. Dual-channel, quad-channel, etc. Each channel has its own 64-bit data bus (72-bit with ECC). Multiple channels multiply bandwidth.
•DIMMs (Dual Inline Memory Modules): Physical memory sticks inserted into motherboard slots. Each DIMM contains multiple memory chips organized into ranks.
•Ranks: Groups of memory chips that operate together to provide 64 bits of data width. A DIMM may have 1, 2, 4, or 8 ranks. Only one rank per channel can be active at a time.
•Chips/ICs: Individual memory integrated circuits on a DIMM. Typically ×4, ×8, or ×16 data width per chip (the number of data pins per chip).
•Banks: Independent storage arrays within a chip. Modern DRAMs have 8-32 banks. Different banks can be accessed in parallel, hiding bank activation latency.
•Rows and columns: Each bank is a 2D array. Rows are typically 8KB; the number of rows determines capacity per bank.

Address mapping example:

When the memory controller receives a physical address, it decodes it into:

Channel bits: Select which memory channel
Rank bits: Select which rank on that channel
Bank bits: Select which bank within the rank
Row bits: Select which row within the bank
Column bits: Select which column within the activated row
Byte offset: Select which byte within the burst transfer

The exact bit positions vary by system and can significantly impact performance. Interleaving lower address bits across channels/banks improves parallelism for sequential accesses.

Bank-Level Parallelism

Modern memory controllers exploit bank-level parallelism: while one bank is activating a row (slow), another bank can be serving a read from its already-activated row (fast). Address mappings that spread sequential accesses across banks achieve higher bandwidth than those that concentrate accesses in one bank.

Typical DDR5 DIMM Organization
Component	Typical Values	Purpose
Channels per DIMM	2	DDR5 splits each DIMM into 2 independent 32-bit channels
Ranks per channel	1-2	More ranks = more capacity but also more command bus contention
Banks per rank	32 (4 bank groups × 8 banks)	More banks = more parallelism
Row buffer size	8 KB per bank	Larger = more row buffer hits
Chip width	×8 typical	Wider chips = fewer chips per rank

DDR Generations: From DDR to DDR5

DDR SDRAM (Double Data Rate Synchronous DRAM) is the dominant memory technology in computers. "Double data rate" means data is transferred on both the rising and falling edges of the clock signal, effectively doubling bandwidth compared to single data rate (SDR) memory at the same clock frequency.

Each DDR generation roughly doubles bandwidth through higher clock speeds and architectural improvements, while the underlying DRAM cell technology remains similar.

DDR Generation Comparison
Generation	Data Rate (MT/s)	Voltage	Bandwidth (per channel)	Key Features	Era
DDR	200-400	2.5V	1.6-3.2 GB/s	Double data rate, prefetch 2n	2000-2003
DDR2	400-1066	1.8V	3.2-8.5 GB/s	Prefetch 4n, higher density	2003-2008
DDR3	800-2133	1.5V	6.4-17 GB/s	Prefetch 8n, lower power	2007-2015
DDR4	1600-3200	1.2V	12.8-25.6 GB/s	Bank groups, higher density	2014-present
DDR5	3200-8800+	1.1V	25.6-70+ GB/s	Dual channel per DIMM, on-DIMM power management, 32 banks	2021-present

Key evolutionary improvements:

Prefetch architecture: Each DDR generation increases the prefetch width—the number of bits fetched from the memory array in a single internal access. DDR5 prefetches 16n bits (16 bits per data pin per access). Higher prefetch enables higher external data rates while keeping internal array speeds manageable.

Bank groups: DDR4 introduced bank groups—clusters of banks that can serve back-to-back requests faster than banks in different groups. This helps maintain high bandwidth for interleaved access patterns.

DDR5 innovations:

Dual 32-bit channels per DIMM: Each DIMM now has two independent channels, doubling command/address bandwidth
Decision feedback equalization (DFE): Enables higher data rates by correcting signal distortion
Same-bank refresh: Allows other banks to operate while one bank refreshes
On-DIMM power management ICs (PMICs): More precise voltage regulation

Latency vs. Bandwidth

While bandwidth has increased ~30× from DDR to DDR5, latency has improved only modestly. DDR5-4800 has similar absolute latency (~14-16 ns first access) to DDR-400. Modern architectures are increasingly bandwidth-limited, not latency-limited. Larger caches and out-of-order execution hide latency; parallelism exploits bandwidth.

Memory Timing Parameters

DRAM operations are governed by precise timing requirements. Understanding these timings helps explain why memory access patterns dramatically affect performance and why the memory controller is so complex.

The fundamental timing parameters:

Critical DRAM Timing Parameters

•tCL (CAS Latency): Time from read command to first data available. The most-advertised timing, but not the whole story. Typical: 14-22 clock cycles.
•tRCD (RAS to CAS Delay): Time from row activation to column access. Must wait for row to be sensed and amplified. Typical: 14-22 clock cycles.
•tRP (RAS Precharge): Time to close a row before opening a new one. The row buffer must be written back and bitlines precharged. Typical: 14-22 clock cycles.
•tRAS (RAS Active Time): Minimum time a row must stay active. Ensures the row has been sufficiently sensed and restored. Typical: tRCD + tCL + some margin.
•tRFC (Refresh Cycle Time): Time to complete a refresh operation. Longer for higher-density DIMMs. Can be ~300-500 ns for 16Gb chips.
•tFAW (Four Activate Window): Maximum rate of row activations (4 activations per tFAW). Limits power consumption from activation surges.

Access latency scenarios:

Row buffer hit (best case):

The requested row is already activated
Latency = tCL (just column access)
Typical: ~14 ns

Row buffer miss (row already activated, needs different row):

Must precharge current row, then activate new row
Latency = tRP + tRCD + tCL
Typical: ~40-50 ns

Row buffer closed (no row activated):

Must activate the row, then access column
Latency = tRCD + tCL
Typical: ~28-35 ns

Impact on programming:

Sequential memory access achieves row buffer hits—only paying tCL for each cache line after the first. Random access within a large region pays full tRP + tRCD + tCL for most accesses. This 3-4× latency difference is why access patterns matter so much.

memory_access_impact.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Sequential access: achieves row buffer hits
// ~12 GB/s on typical DDR4 single-channel system
void sequential_sum(int* arr, size_t n) {
    long sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += arr[i];  // Sequential: exploits row buffer
    }
}
 
// Random access: constant row buffer misses
// ~1-2 GB/s on same system (6-10× slower!)
void random_sum(int* arr, size_t n, size_t* indices) {
    long sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += arr[indices[i]];  // Random: row miss each time
    }
}
 
// Strided access: may or may not hit row buffer
// Stride < 8KB: hits same row; Stride >= 8KB: always misses
void strided_sum(int* arr, size_t n, size_t stride) {
    long sum = 0;
    for (size_t i = 0; i < n; i += stride) {
        sum += arr[i];  // Stride-dependent performance
    }
}

Memory Controllers and Scheduling

The memory controller is the hardware unit that translates CPU memory requests into DRAM commands (activate, precharge, read, write, refresh). Modern memory controllers are integrated into the CPU die and implement sophisticated scheduling algorithms to maximize performance.

Memory controller responsibilities:

•Address translation: Convert physical addresses to channel/rank/bank/row/column coordinates
•Command scheduling: Order and time DRAM commands to satisfy timing constraints while maximizing throughput
•Row buffer management: Decide when to keep rows open (optimistic) vs. close after access (pessimistic)
•Refresh management: Ensure all rows are refreshed within the retention period without blocking access
•Error detection and correction: Apply ECC (Error Correcting Code) if enabled
•Power management: Manage DRAM power states (active, standby, self-refresh)

Command scheduling policies:

FCFS (First Come First Served): Process requests in arrival order. Simple but poor performance—ignores row buffer locality.

FR-FCFS (First Ready - First Come First Served): Prioritize row buffer hits over misses, using FCFS as a tiebreaker. Much better performance but can starve requests to different rows if one row is hot.

ATLAS (Adaptive per-Thread Least-Attained-Service): Tracks how much service each thread has received; prioritizes under-served threads. Improves fairness in multi-core systems.

BLISS (Blacklisting: Throttle memory-intensive threads): Identifies and temporarily de-prioritizes memory-hogging threads to prevent them from blocking others. Improves quality of service for latency-sensitive workloads.

Row buffer policies:

Open-Page Policy

•Keep row activated after access
•Optimistic: assume next access hits same row
•Best when spatial locality is high
•Risk: row miss costs extra tRP
•Common in client systems

Closed-Page Policy

•Precharge row immediately after access
•Pessimistic: assume next access misses
•Best when accesses are random
•Avoids extra tRP on row miss
•Common in servers with random patterns

OS Transparency

The memory controller operates below the OS abstraction layer—the operating system cannot directly control scheduling decisions. However, the OS affects memory behavior through physical page placement, huge pages (which improve row buffer hit rates), and NUMA-aware allocation.

Physical Memory from the Operating System's View

The operating system must manage physical memory as a precious, finite resource. The OS doesn't see DRAM cells and timing parameters—it sees a contiguous range of physical addresses that must be partitioned, tracked, and allocated efficiently.

Physical address space layout:

Not all physical addresses correspond to RAM. The physical address space includes:

Main memory regions: Actual installed DRAM
Memory-mapped I/O: Regions mapped to device registers
Reserved regions: BIOS, ACPI tables, system management RAM
PCI holes: Address ranges assigned to PCI devices

The BIOS/UEFI provides a memory map to the OS at boot time, describing which regions are usable RAM, reserved, or memory-mapped I/O.

Page-based memory management:

The OS manages physical memory in fixed-size units called page frames (typically 4 KB on x86). The page frame allocator tracks which frames are:

Free: Available for allocation
Allocated to kernel: Used by the OS itself
Allocated to processes: Mapped into user virtual address spaces
Used by page cache: Caching file data
Reserved: Hardware-reserved or unavailable

Key data structures:

Physical Memory Management Structures

•struct page (Linux): One structure per physical page frame. Contains reference count, flags (dirty, locked, writeback), mapping information, and list pointers. Total memory overhead: ~64 bytes per 4KB page = 1.5% overhead.
•Free lists (buddy allocator): Tracks free pages in power-of-2 sized groups. Enables O(log n) allocation and coalescing of adjacent free pages.
•Zone lists: Divides memory into zones (DMA, Normal, HighMem on 32-bit; DMA, DMA32, Normal on 64-bit) based on address range constraints.
•NUMA node structures: On NUMA systems, each node has its own set of zones and free lists. Allocation prefers local memory.

Huge Pages

Modern CPUs support larger page sizes (2 MB and 1 GB on x86-64). Huge pages reduce TLB misses for large allocations and improve row buffer locality (a 2 MB huge page spans ~250 DRAM rows, keeping more accesses on the same row). Databases, VMs, and HPC applications commonly use huge pages for performance.

Memory pressure and reclamation:

When physical memory runs low, the OS must reclaim pages. Strategies include:

Page cache eviction: Discard cached file data (can be re-read from disk)
Swap out: Write anonymous pages (heap, stack) to swap device, freeing physical frames
OOM killer: As a last resort, terminate processes to free memory

The OS continuously balances page cache size (for I/O performance) against free memory (for allocation headroom). The kswapd daemon in Linux proactively reclaims pages when free memory drops below thresholds.

NUMA Architecture and Memory Locality

NUMA (Non-Uniform Memory Access) describes architectures where memory access time depends on which processor accesses which memory. In NUMA systems, each processor (or group of processors) has "local" memory that it can access faster than "remote" memory attached to other processors.

Why NUMA exists:

As core counts increased, memory bandwidth became a bottleneck. If all cores shared a single memory controller, that controller becomes a chokepoint. NUMA distributes memory controllers across sockets, providing:

Scalable bandwidth: Each socket has dedicated memory bandwidth
Reduced interconnect traffic: Local accesses don't traverse inter-socket links
But: Software must be aware of locality to benefit

NUMA Access Latencies (Example 2-Socket System)
Access Type	Latency (ns)	Relative Cost	Bandwidth
Local memory (same socket)	~70-80	1x	Full local bandwidth
Remote memory (other socket)	~120-150	1.5-2x	Shared interconnect
Cross-NUMA write	~150-200	2-2.5x	Often worse than reads

OS NUMA support:

Operating systems expose NUMA topology to applications and implement NUMA-aware policies:

Allocation policies:

Local allocation: Allocate memory on the same node as the requesting CPU (default)
Interleaved: Round-robin across nodes (useful for shared data structures)
Bind: Force allocation to a specific node
Preferred: Try one node first, fall back to others

Process scheduling:

Keep processes on the same NUMA node as their memory
Migrate pages if a process moves to a different node (memory migration)
Balance load while respecting locality constraints

Linux tools:

numactl: Run programs with specific NUMA policies
numastat: Display NUMA memory statistics
/sys/devices/system/node/: NUMA topology information
mbind(), set_mempolicy(): System calls for memory policies

NUMA Pitfalls

NUMA-unaware applications can suffer severe performance degradation. Common mistakes: allocating all memory on first touch (ending up on one node), spawning threads that access memory allocated by other threads, and reading/writing shared data structures from multiple sockets. Profiling with 'perf stat -e numa_hit,numa_miss' reveals NUMA access patterns.

ECC and Memory Reliability

Memory errors occur in real systems—cosmic rays, electrical noise, manufacturing defects, and aging can all cause bit flips. For systems where reliability matters (servers, storage, scientific computing), ECC (Error Correcting Code) memory provides protection.

Types of memory errors:

•Soft errors: Transient bit flips caused by radiation (cosmic rays, alpha particles from packaging). Single event upsets (SEUs) flip a bit temporarily.
•Hard errors: Permanent failures from manufacturing defects or wear-out. A cell always reads as 0 or 1 regardless of stored value.
•Intermittent errors: Failures that occur only under certain conditions (temperature, voltage, access patterns).

How ECC works:

ECC memory uses additional bits to store error detection/correction codes. The most common scheme is SECDED (Single Error Correction, Double Error Detection):

For each 64-bit data word, 8 additional ECC bits are stored
On read, the ECC bits are used to detect and correct single-bit errors
Double-bit errors are detected but not correctable (machine check exception)

ECC overhead:

Memory capacity: 72 bits stored for every 64 bits of data (12.5% overhead)
Latency: Minimal—ECC check happens in parallel with data path
Cost: ECC DIMMs are more expensive; ECC requires server-class motherboards/CPUs

Error rates in practice:

Google's study (2009) found:

~8% of DIMMs experienced at least one correctable error per year
~0.2% experienced uncorrectable errors
Error rates increase with age and temperature

For enterprise workloads and systems with large memory (TB scale), ECC is essential—without it, silent data corruption is statistically likely.

Chipkill/SDDC

Advanced ECC schemes like Intel's Chipkill or AMD's SDDC (Single Device Data Correction) can correct all errors from a complete DRAM chip failure (8-bit positions on ×4 mode). This protects against hard failures taking out an entire chip, not just single bit flips.

Summary: Main Memory Mastery

We've explored main memory in depth—from DRAM cell physics to operating system memory management. Main memory sits at a critical juncture in the memory hierarchy: large enough to hold working data, but slow enough that access patterns dramatically affect performance.

Key Takeaways

•DRAM stores bits as charge in capacitors — The 1T1C cell is dense but requires refresh and has destructive reads, fundamentally limiting speed
•Memory is hierarchically organized — Channels, ranks, banks provide parallelism; row buffers provide locality within a bank
•DDR generations double bandwidth — Through higher frequencies and architectural improvements, while latency improves only modestly
•Timing parameters govern access speed — tRCD, tCL, tRP determine first access latency; row buffer hits are much faster than misses
•Memory controllers schedule commands — Sophisticated policies like FR-FCFS maximize throughput by exploiting row buffer locality
•The OS manages physical memory in pages — Buddy allocator, zones, and NUMA awareness enable efficient allocation
•NUMA systems require locality awareness — Local memory access is 1.5-2× faster than remote; OS and applications must be NUMA-conscious
•ECC protects against memory errors — Essential for enterprise systems; SECDED corrects single-bit errors, detects double-bit

What's next:

With volatile memory covered, we'll now explore secondary storage—the persistent tier of the memory hierarchy. We'll examine storage technologies from magnetic disks to SSDs, their performance characteristics, and how they interface with the operating system through storage drivers and file systems.

Page Complete

You now understand main memory from DRAM physics to OS management. This knowledge is essential for writing memory-efficient software, understanding system performance, and designing operating system memory management subsystems.

3 / 5

Loading learning content...

Operating SystemsMemory Hierarchy

Memory Hierarchy: From Registers to Disk

LevelIntermediate

Duration90 mins

TopicMemory Hierarchy

3 / 5

Main Memory (RAM) — The Primary System Memory

The Foundation of System Memory

From an operating system perspective, main memory is one of the most precious and carefully managed resources. The OS must:

Track which memory regions are free and which are in use
Allocate memory to processes and reclaim it when no longer needed
Implement virtual memory to present a uniform address space to each process
Balance memory pressure across competing applications

Understanding how RAM works at the hardware level is foundational to understanding all of these OS functions.

What You Will Learn

DRAM Fundamentals: How Memory Cells Work

The DRAM cell:

Each DRAM cell consists of:

One transistor: Acts as a switch, connecting the capacitor to a bitline
One capacitor: Stores charge representing a bit value (charged = 1, discharged = 0)

This "1T1C" design is extremely compact—the smallest practical memory cell—which is why DRAM achieves high densities at low cost. Compare this to SRAM (cache), which uses 6 transistors per bit.

The problem with capacitors:

Capacitors leak charge over time. Left alone, a DRAM cell would lose its stored value within milliseconds. This creates two critical consequences:

•Refresh required: Every DRAM cell must be periodically refreshed (read and rewritten) to restore its charge. Modern DRAM refreshes every 32-64 ms. During refresh, that portion of memory is unavailable—a source of latency variation.
•Destructive reads: Reading a DRAM cell discharges the capacitor through the sense amplifier. Every read operation must be followed by a write-back to restore the value. This fundamentally limits read speed compared to SRAM.

Sensing the bit:

DRAM cells are arranged in a 2D grid of rows and columns. Reading a cell involves:

Precharge: The bitline (column wire) is set to a reference voltage (typically VDD/2)
Row activation: The wordline for the target row is raised, connecting all cells in that row to their bitlines. The tiny charge difference tips the bitline voltage up or down.
Sensing: A sense amplifier detects the tiny voltage difference and amplifies it to a full logic level
Restoration: The amplified value is written back to the cell, restoring its charge

Row Buffer Locality

DRAM vs SRAM Comparison
Characteristic	DRAM	SRAM
Transistors per bit	1	6
Density	Very high	Low
Cost per bit	Low ($)	High ($$$)
Speed	~10-20 ns	~1-2 ns
Power (static)	Low (but refresh)	Higher (leakage)
Volatility	Volatile	Volatile
Refresh required	Yes (every 32-64 ms)	No
Typical use	Main memory	Cache memory

Memory Organization: Channels, DIMMs, Ranks, and Banks

Memory hierarchy (from CPU outward):

Memory Organization Hierarchy

•Channels: Independent memory interfaces from the CPU. Dual-channel, quad-channel, etc. Each channel has its own 64-bit data bus (72-bit with ECC). Multiple channels multiply bandwidth.
•DIMMs (Dual Inline Memory Modules): Physical memory sticks inserted into motherboard slots. Each DIMM contains multiple memory chips organized into ranks.
•Ranks: Groups of memory chips that operate together to provide 64 bits of data width. A DIMM may have 1, 2, 4, or 8 ranks. Only one rank per channel can be active at a time.
•Chips/ICs: Individual memory integrated circuits on a DIMM. Typically ×4, ×8, or ×16 data width per chip (the number of data pins per chip).
•Banks: Independent storage arrays within a chip. Modern DRAMs have 8-32 banks. Different banks can be accessed in parallel, hiding bank activation latency.
•Rows and columns: Each bank is a 2D array. Rows are typically 8KB; the number of rows determines capacity per bank.

Address mapping example:

When the memory controller receives a physical address, it decodes it into:

Channel bits: Select which memory channel
Rank bits: Select which rank on that channel
Bank bits: Select which bank within the rank
Row bits: Select which row within the bank
Column bits: Select which column within the activated row
Byte offset: Select which byte within the burst transfer

The exact bit positions vary by system and can significantly impact performance. Interleaving lower address bits across channels/banks improves parallelism for sequential accesses.

Bank-Level Parallelism

Typical DDR5 DIMM Organization
Component	Typical Values	Purpose
Channels per DIMM	2	DDR5 splits each DIMM into 2 independent 32-bit channels
Ranks per channel	1-2	More ranks = more capacity but also more command bus contention
Banks per rank	32 (4 bank groups × 8 banks)	More banks = more parallelism
Row buffer size	8 KB per bank	Larger = more row buffer hits
Chip width	×8 typical	Wider chips = fewer chips per rank

DDR Generations: From DDR to DDR5

Each DDR generation roughly doubles bandwidth through higher clock speeds and architectural improvements, while the underlying DRAM cell technology remains similar.

DDR Generation Comparison
Generation	Data Rate (MT/s)	Voltage	Bandwidth (per channel)	Key Features	Era
DDR	200-400	2.5V	1.6-3.2 GB/s	Double data rate, prefetch 2n	2000-2003
DDR2	400-1066	1.8V	3.2-8.5 GB/s	Prefetch 4n, higher density	2003-2008
DDR3	800-2133	1.5V	6.4-17 GB/s	Prefetch 8n, lower power	2007-2015
DDR4	1600-3200	1.2V	12.8-25.6 GB/s	Bank groups, higher density	2014-present
DDR5	3200-8800+	1.1V	25.6-70+ GB/s	Dual channel per DIMM, on-DIMM power management, 32 banks	2021-present

Key evolutionary improvements:

DDR5 innovations:

Dual 32-bit channels per DIMM: Each DIMM now has two independent channels, doubling command/address bandwidth
Decision feedback equalization (DFE): Enables higher data rates by correcting signal distortion
Same-bank refresh: Allows other banks to operate while one bank refreshes
On-DIMM power management ICs (PMICs): More precise voltage regulation

Latency vs. Bandwidth

Memory Timing Parameters

The fundamental timing parameters:

Critical DRAM Timing Parameters

•tCL (CAS Latency): Time from read command to first data available. The most-advertised timing, but not the whole story. Typical: 14-22 clock cycles.
•tRCD (RAS to CAS Delay): Time from row activation to column access. Must wait for row to be sensed and amplified. Typical: 14-22 clock cycles.
•tRP (RAS Precharge): Time to close a row before opening a new one. The row buffer must be written back and bitlines precharged. Typical: 14-22 clock cycles.
•tRAS (RAS Active Time): Minimum time a row must stay active. Ensures the row has been sufficiently sensed and restored. Typical: tRCD + tCL + some margin.
•tRFC (Refresh Cycle Time): Time to complete a refresh operation. Longer for higher-density DIMMs. Can be ~300-500 ns for 16Gb chips.
•tFAW (Four Activate Window): Maximum rate of row activations (4 activations per tFAW). Limits power consumption from activation surges.

Access latency scenarios:

Row buffer hit (best case):

The requested row is already activated
Latency = tCL (just column access)
Typical: ~14 ns

Row buffer miss (row already activated, needs different row):

Must precharge current row, then activate new row
Latency = tRP + tRCD + tCL
Typical: ~40-50 ns

Row buffer closed (no row activated):

Must activate the row, then access column
Latency = tRCD + tCL
Typical: ~28-35 ns

Impact on programming:

memory_access_impact.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Sequential access: achieves row buffer hits
// ~12 GB/s on typical DDR4 single-channel system
void sequential_sum(int* arr, size_t n) {
    long sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += arr[i];  // Sequential: exploits row buffer
    }
}
 
// Random access: constant row buffer misses
// ~1-2 GB/s on same system (6-10× slower!)
void random_sum(int* arr, size_t n, size_t* indices) {
    long sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += arr[indices[i]];  // Random: row miss each time
    }
}
 
// Strided access: may or may not hit row buffer
// Stride < 8KB: hits same row; Stride >= 8KB: always misses
void strided_sum(int* arr, size_t n, size_t stride) {
    long sum = 0;
    for (size_t i = 0; i < n; i += stride) {
        sum += arr[i];  // Stride-dependent performance
    }
}

Memory Controllers and Scheduling

Memory controller responsibilities:

•Address translation: Convert physical addresses to channel/rank/bank/row/column coordinates
•Command scheduling: Order and time DRAM commands to satisfy timing constraints while maximizing throughput
•Row buffer management: Decide when to keep rows open (optimistic) vs. close after access (pessimistic)
•Refresh management: Ensure all rows are refreshed within the retention period without blocking access
•Error detection and correction: Apply ECC (Error Correcting Code) if enabled
•Power management: Manage DRAM power states (active, standby, self-refresh)

Command scheduling policies:

FCFS (First Come First Served): Process requests in arrival order. Simple but poor performance—ignores row buffer locality.

ATLAS (Adaptive per-Thread Least-Attained-Service): Tracks how much service each thread has received; prioritizes under-served threads. Improves fairness in multi-core systems.

Row buffer policies:

Open-Page Policy

•Keep row activated after access
•Optimistic: assume next access hits same row
•Best when spatial locality is high
•Risk: row miss costs extra tRP
•Common in client systems

Closed-Page Policy

•Precharge row immediately after access
•Pessimistic: assume next access misses
•Best when accesses are random
•Avoids extra tRP on row miss
•Common in servers with random patterns

OS Transparency

Physical Memory from the Operating System's View

Physical address space layout:

Not all physical addresses correspond to RAM. The physical address space includes:

Main memory regions: Actual installed DRAM
Memory-mapped I/O: Regions mapped to device registers
Reserved regions: BIOS, ACPI tables, system management RAM
PCI holes: Address ranges assigned to PCI devices

The BIOS/UEFI provides a memory map to the OS at boot time, describing which regions are usable RAM, reserved, or memory-mapped I/O.

Page-based memory management:

The OS manages physical memory in fixed-size units called page frames (typically 4 KB on x86). The page frame allocator tracks which frames are:

Free: Available for allocation
Allocated to kernel: Used by the OS itself
Allocated to processes: Mapped into user virtual address spaces
Used by page cache: Caching file data
Reserved: Hardware-reserved or unavailable

Key data structures:

Physical Memory Management Structures

•struct page (Linux): One structure per physical page frame. Contains reference count, flags (dirty, locked, writeback), mapping information, and list pointers. Total memory overhead: ~64 bytes per 4KB page = 1.5% overhead.
•Free lists (buddy allocator): Tracks free pages in power-of-2 sized groups. Enables O(log n) allocation and coalescing of adjacent free pages.
•Zone lists: Divides memory into zones (DMA, Normal, HighMem on 32-bit; DMA, DMA32, Normal on 64-bit) based on address range constraints.
•NUMA node structures: On NUMA systems, each node has its own set of zones and free lists. Allocation prefers local memory.

Huge Pages

Memory pressure and reclamation:

When physical memory runs low, the OS must reclaim pages. Strategies include:

Page cache eviction: Discard cached file data (can be re-read from disk)
Swap out: Write anonymous pages (heap, stack) to swap device, freeing physical frames
OOM killer: As a last resort, terminate processes to free memory

NUMA Architecture and Memory Locality

Why NUMA exists:

Scalable bandwidth: Each socket has dedicated memory bandwidth
Reduced interconnect traffic: Local accesses don't traverse inter-socket links
But: Software must be aware of locality to benefit

NUMA Access Latencies (Example 2-Socket System)
Access Type	Latency (ns)	Relative Cost	Bandwidth
Local memory (same socket)	~70-80	1x	Full local bandwidth
Remote memory (other socket)	~120-150	1.5-2x	Shared interconnect
Cross-NUMA write	~150-200	2-2.5x	Often worse than reads

OS NUMA support:

Operating systems expose NUMA topology to applications and implement NUMA-aware policies:

Allocation policies:

Local allocation: Allocate memory on the same node as the requesting CPU (default)
Interleaved: Round-robin across nodes (useful for shared data structures)
Bind: Force allocation to a specific node
Preferred: Try one node first, fall back to others

Process scheduling:

Keep processes on the same NUMA node as their memory
Migrate pages if a process moves to a different node (memory migration)
Balance load while respecting locality constraints

Linux tools:

numactl: Run programs with specific NUMA policies
numastat: Display NUMA memory statistics
/sys/devices/system/node/: NUMA topology information
mbind(), set_mempolicy(): System calls for memory policies

NUMA Pitfalls

ECC and Memory Reliability

Types of memory errors:

•Soft errors: Transient bit flips caused by radiation (cosmic rays, alpha particles from packaging). Single event upsets (SEUs) flip a bit temporarily.
•Hard errors: Permanent failures from manufacturing defects or wear-out. A cell always reads as 0 or 1 regardless of stored value.
•Intermittent errors: Failures that occur only under certain conditions (temperature, voltage, access patterns).

How ECC works:

ECC memory uses additional bits to store error detection/correction codes. The most common scheme is SECDED (Single Error Correction, Double Error Detection):

For each 64-bit data word, 8 additional ECC bits are stored
On read, the ECC bits are used to detect and correct single-bit errors
Double-bit errors are detected but not correctable (machine check exception)

ECC overhead:

Memory capacity: 72 bits stored for every 64 bits of data (12.5% overhead)
Latency: Minimal—ECC check happens in parallel with data path
Cost: ECC DIMMs are more expensive; ECC requires server-class motherboards/CPUs

Error rates in practice:

Google's study (2009) found:

~8% of DIMMs experienced at least one correctable error per year
~0.2% experienced uncorrectable errors
Error rates increase with age and temperature

For enterprise workloads and systems with large memory (TB scale), ECC is essential—without it, silent data corruption is statistically likely.

Chipkill/SDDC

Summary: Main Memory Mastery

Key Takeaways

•DRAM stores bits as charge in capacitors — The 1T1C cell is dense but requires refresh and has destructive reads, fundamentally limiting speed
•Memory is hierarchically organized — Channels, ranks, banks provide parallelism; row buffers provide locality within a bank
•DDR generations double bandwidth — Through higher frequencies and architectural improvements, while latency improves only modestly
•Timing parameters govern access speed — tRCD, tCL, tRP determine first access latency; row buffer hits are much faster than misses
•Memory controllers schedule commands — Sophisticated policies like FR-FCFS maximize throughput by exploiting row buffer locality
•The OS manages physical memory in pages — Buddy allocator, zones, and NUMA awareness enable efficient allocation
•NUMA systems require locality awareness — Local memory access is 1.5-2× faster than remote; OS and applications must be NUMA-conscious
•ECC protects against memory errors — Essential for enterprise systems; SECDED corrects single-bit errors, detects double-bit

What's next:

Page Complete

3 / 5