Storage Hierarchy - Learning Module

Loading content...

0/252

RAM vs Disk - Understanding the Great Divide

The Two Worlds of Database Storage

In the vast landscape of the memory hierarchy, two tiers dominate the database professional's daily concerns: main memory (RAM) and persistent storage (disk). While caches accelerate and tapes archive, it is the boundary between RAM and disk that determines whether your database runs at blazing speed or grinding slowness.

This divide isn't merely a matter of degree—it represents fundamentally different technologies with fundamentally different behaviors. RAM stores data as electrical charges in capacitors, delivering nanosecond access with no mechanical movement. Disks—whether spinning magnetic platters or solid-state flash—store data in persistent physical states, requiring microseconds to milliseconds for each access.

Understanding the detailed characteristics of each technology is essential because every database optimization ultimately reduces to one question: How do we minimize the expensive crossings between these two worlds?

What You Will Master

By the end of this page, you will understand the internal architecture of RAM and disk technologies (HDD and SSD), their detailed performance characteristics, how access patterns affect each differently, and the strategies databases employ to bridge the RAM-disk divide efficiently.

Understanding Main Memory (RAM)

Main memory, commonly called RAM (Random Access Memory), is the primary working space for active database operations. When queries execute, tables are scanned, or indexes are traversed, the data flows through RAM. Understanding RAM's architecture explains both its remarkable speed and its fundamental limitation—volatility.

The DRAM Cell:

Modern RAM uses Dynamic RAM (DRAM) technology. Each bit is stored in a tiny structure consisting of:

One transistor: Acts as an access gate
One capacitor: Stores the actual data as electrical charge

This simplicity—just two components per bit—enables incredible density. Modern DRAM chips pack billions of cells onto a single die.

Why "Dynamic"?

The capacitor holding each bit leaks charge over time. Within milliseconds, the stored charge would dissipate, losing the data. To prevent this, DRAM controllers continuously refresh each cell, reading and rewriting its value before the charge decays. This refresh cycle consumes time and power—a tax for DRAM's density advantage over static alternatives.

RAM Types and Characteristics
Type	Structure	Speed	Density	Power	Use Case
SRAM (Static)	6 transistors/bit	Fastest (~1ns)	Lower	Lower when idle	CPU caches
DRAM (Dynamic)	1T1C per bit	Fast (~50-100ns)	Higher	Refresh overhead	Main memory
DDR4 SDRAM	DRAM + synch clock	~15-20ns effective	High	Moderate	Current standard
DDR5 SDRAM	DDR4 + improvements	~12-14ns effective	Higher	Improved	Latest systems
HBM (High Bandwidth)	Stacked DRAM	Wide interface	Very high	Lower per bit	GPUs, accelerators

DRAM Organization:

DRAM is organized hierarchically:

Cells: Individual bits arranged in rows and columns
Banks: Independent arrays that can operate in parallel
Ranks: Groups of chips accessed together to match bus width
Channels: Separate memory controllers providing bandwidth
DIMMs: Physical modules containing multiple chips

Accessing DRAM:

A memory access follows this sequence:

Row Activation (RAS): The memory controller selects a row. The entire row (typically 8KB) is loaded into the row buffer—a faster SRAM structure.
Column Access (CAS): Within the activated row, specific columns are read or written. If the desired data is in the already-active row, access is fast.
Precharge: Before accessing a different row in the same bank, the current row must be written back and the bank reset.

Row Buffer Hit vs. Miss:

Row buffer hit: Requested address is in the currently active row → ~10ns
Row buffer miss (closed): Different row needed, bank is closed → ~50-60ns
Row buffer miss (conflict): Different row needed, must precharge first → ~70-80ns

These patterns explain why sequential access in RAM is faster than random access—sequential reads hit the same row buffer repeatedly.

The "Random" in RAM is Misleading

Despite the name 'Random Access Memory,' truly random access patterns are significantly slower than sequential patterns. Row buffer effects mean that accessing addresses sequentially can be 5-7x faster than random access. Database buffer pools exploit this by organizing pages to maximize row buffer hits.

Understanding Hard Disk Drives (HDD)

Despite the rise of solid-state storage, hard disk drives remain crucial in database environments—particularly for bulk storage, backups, and archival systems where cost-per-gigabyte dominates over speed.

Physical Architecture:

An HDD is an electromechanical device consisting of:

Platters: Circular disks coated with magnetic material, spinning at high speed (5,400 - 15,000 RPM)
Read/Write Heads: Electromagnetic devices floating nanometers above the platter surface
Actuator Arm: Mechanical arm that positions the heads radially across the platter
Spindle Motor: Rotates the platters at constant speed
Controller Board: Electronics managing the interface and head positioning

HDD Speed Classes
RPM Rating	Rotation Period	Avg. Rotational Latency	Typical Use	Seek Time
5,400 RPM	11.1ms	5.56ms	Consumer, archive	12-15ms
7,200 RPM	8.33ms	4.17ms	Desktop, NAS	9-12ms
10,000 RPM	6.0ms	3.0ms	Enterprise (legacy)	4-6ms
15,000 RPM	4.0ms	2.0ms	High-performance enterprise	3-4ms

Data Organization on Disk:

Tracks: Concentric circles on a platter surface where data is recorded
Sectors: Arc segments of a track, typically 512 bytes or 4KB (Advanced Format)
Cylinders: The set of tracks at the same radial position across all platters
Blocks: Logical groupings of sectors (often 4KB-64KB for databases)

The Three Components of Access Time:

Every random HDD access involves three delays:

Seek Time: Time to move the head assembly to the correct track
- Modern drives: 4-12ms average
- Worst case (full stroke): 15-20ms
- Best case (adjacent track): <1ms
Rotational Latency: Time for the desired sector to rotate under the head
- Average: Half a rotation (4.17ms for 7,200 RPM)
- Best case: Sector happens to be there (0ms)
- Worst case: Just missed, full rotation
Transfer Time: Time to read/write the data
- Depends on data density and rotational speed
- Modern drives: 100-200 MB/s sustained

Total random access time = Seek + Rotational Latency + Transfer ≈ 8-15ms

The Mechanical Reality

No amount of controller intelligence can escape the mechanical reality of HDDs. The head must physically move across the platter (seek), and the platter must rotate (latency). These operations are governed by Newtonian physics, not Moore's Law. HDD speeds have improved only modestly in 30 years while CPUs have accelerated 10,000x.

Sequential vs. Random Performance:

The sequential vs. random dichotomy is extreme for HDDs:

Access Pattern	Typical Speed	Why
Sequential Read	150-250 MB/s	No seeks, continuous rotation
Sequential Write	140-220 MB/s	Same, slight head stabilization delay
Random Read (4KB)	0.5-1.5 MB/s	Dominated by seek + rotational latency
Random Write (4KB)	0.3-1.0 MB/s	Same, plus write completion confirmation

The 100x Gap: HDDs can read sequentially at 200 MB/s but only manage ~1 MB/s for random small reads. This 100x or greater gap fundamentally shapes how databases organize data on disk.

Database Implications:

Table data files should be read sequentially when possible (full scans)
Index structures (B-trees) minimize random disk accesses
Transaction logs exploit sequential write patterns
Buffer pools cache to avoid random reads
Batch operations amortize seek time over multiple blocks

Understanding Solid State Drives (SSD)

Solid-state drives have transformed database performance by eliminating the mechanical constraints of HDDs. With no moving parts, SSDs offer dramatically lower latency and vastly improved random access performance.

Flash Memory Fundamentals:

SSDs store data in NAND flash memory—a non-volatile storage technology based on floating-gate transistors. Each cell stores charge in a floating gate insulated from the control circuit, where it can persist without power indefinitely.

Cell Types (SLC, MLC, TLC, QLC):

NAND cells can store multiple bits by distinguishing voltage levels:

NAND Flash Cell Types
Type	Bits/Cell	Voltage Levels	Speed	Endurance	Cost	Typical Use
SLC (Single)	1	2	Fastest	~100K cycles	$$$$	Enterprise, critical apps
MLC (Multi)	2	4	Fast	~10K cycles	$$$	Enterprise mixed
TLC (Triple)	3	8	Moderate	~3K cycles	$$	Consumer, datacenter read
QLC (Quad)	4	16	Slower	~1K cycles	$	Read-heavy, cold data

NAND Organization:

Flash is organized hierarchically:

Cells: Individual storage units (1-4 bits each)
Pages: Smallest read/write unit (4KB-16KB)
Blocks: Collection of pages (128-512 pages per block, ~512KB-4MB)
Planes: Independent block arrays for parallelism
Dies: Complete NAND chips
Packages: Multiple dies in one physical package

The Asymmetric Operations Problem:

NAND flash has a critical asymmetry:

Read: Can read any page independently (~25-75μs)
Write: Can only write to erased pages (~200-500μs)
Erase: Can only erase entire blocks (~2-5ms)

This means you cannot simply overwrite data in place. To modify a page:

Read the entire block
Erase the block (slow)
Write all pages including the modified one

Flash Translation Layer (FTL):

To hide this complexity, SSDs implement a Flash Translation Layer:

Maintains a logical-to-physical address mapping
Redirects writes to free pages (out-of-place writes)
Performs garbage collection to reclaim invalidated pages
Implements wear leveling to distribute writes evenly

The FTL makes the SSD appear as a simple block device to the operating system and database.

Write Amplification

Due to the block-erase requirement, writing 4KB of data may require erasing and rewriting 512KB or more. This 'write amplification' reduces effective write speed and accelerates flash cell wear. Database workload patterns significantly affect SSD lifespan—heavy random write workloads are the most damaging.

SSD Performance Characteristics:

Metric	Typical Range	Notes
Sequential Read	500 MB/s - 7 GB/s	Limited by interface (SATA vs NVMe)
Sequential Write	400 MB/s - 5 GB/s	Depends on write buffer, SLC cache
Random Read (4KB)	10,000 - 1,000,000 IOPS	Major advantage over HDD
Random Write (4KB)	5,000 - 300,000 IOPS	Varies with FTL efficiency
Latency (Read)	25-100 μs	~100x faster than HDD
Latency (Write)	50-500 μs	Can spike during GC

Interface Matters:

SATA SSD: Limited to ~550 MB/s by the SATA interface (designed for HDDs)
NVMe SSD: Uses PCIe lanes directly, achieving 3-7+ GB/s
NVMe with multiple queues: 64K queues with 64K commands each—massive parallelism

Database Implications:

Random read performance makes point queries viable without extensive caching
Write patterns should still be optimized to reduce write amplification
Log files benefit enormously from SSD speed
Mixed workloads (OLTP) see the biggest gains
Cold storage can remain on HDD for cost efficiency

The Performance Comparison: RAM vs SSD vs HDD

Let's consolidate the performance differences between the three primary storage technologies that databases interact with. These numbers represent typical values for modern enterprise-grade components.

Storage Technology Comparison
Metric	DDR4 RAM	NVMe SSD	SATA SSD	Enterprise HDD
Random Read Latency	60-100 ns	25-100 μs	100-200 μs	5-15 ms
Random Write Latency	60-100 ns	50-500 μs	100-500 μs	5-15 ms
Sequential Read (MB/s)	25,000-50,000	3,000-7,000	500-550	150-250
Sequential Write (MB/s)	25,000-50,000	2,000-5,000	400-520	140-220
Random 4KB Read IOPS	Millions	100K-1M	50K-100K	100-200
Random 4KB Write IOPS	Millions	50K-500K	30K-80K	100-150
Capacity (typical)	64GB-2TB	256GB-8TB	256GB-4TB	1TB-20TB
Cost per GB	$3-5	$0.10-0.30	$0.08-0.12	$0.02-0.03
Persistence	No (volatile)	Yes	Yes	Yes
Power (active)	3-6W per DIMM	5-15W	2-5W	7-15W
Endurance	Unlimited	1-5 DWPD*	0.3-1 DWPD	Unlimited

*DWPD = Drive Writes Per Day (the full drive capacity can be written this many times per day over the warranty period)

Key Ratios to Remember:

Performance Ratios

•RAM vs NVMe SSD: RAM is ~100-1,000x faster for random access
•RAM vs HDD: RAM is ~100,000x faster for random access
•NVMe SSD vs HDD: NVMe is ~100-500x faster for random access
•Sequential vs Random on HDD: Sequential is ~100x faster
•Sequential vs Random on SSD: Sequential is ~10-20x faster
•Cost: HDD vs SSD: HDD is ~3-10x cheaper per GB
•Cost: SSD vs RAM: SSD is ~20-50x cheaper per GB

The Latency Perspective

When reasoning about database performance, think in latency first. A query that requires 100 random disk reads on HDD (100 × 10ms = 1 second) takes 100ms on NVMe SSD, or essentially instant if cached in RAM. This 10,000x difference determines whether your application feels fast or slow.

Access Pattern Implications for Databases

Different database operations exhibit different access patterns, and these patterns interact differently with each storage technology. Understanding these interactions guides storage allocation and query optimization.

Common Database Access Patterns:

Database Operations and Their Access Patterns
Operation	Read Pattern	Write Pattern	Best Storage	Notes
Table full scan	Sequential read	None	HDD acceptable	Throughput matters, not latency
Index lookup (point)	Random read	None	SSD/RAM	Latency critical
Index range scan	Sequential read	None	SSD preferred	After initial seek, sequential
Transaction log write	None	Sequential append	NVMe SSD	Critical path, must be durable
OLTP mixed workload	Random read/write	Random write	NVMe SSD + RAM cache	IOPS-bound
Data warehouse query	Sequential scan	Temp writes	SSD + HDD mix	Throughput-bound
Backup	Sequential read	Sequential write	HDD sufficient	Cost-sensitive
Random row updates	Random read	Random write	SSD preferred	HDD would bottleneck

Patterns That Favor SSD

•High concurrency point queries
•OLTP transaction processing
•Random key-value lookups
•Database transaction logs
•Temporary tables with random access
•Index files for frequently-used indexes

Patterns Where HDD Is Adequate

•Sequential data warehouse scans
•Backup storage destinations
•Archive log retention
•Cold/infrequent data storage
•Media files, documents (BLOB storage)
•Bulk load staging areas

Hybrid Strategies

Most production databases use hybrid storage strategies. Hot data and logs on NVMe SSD, warm data on SATA SSD, cold data and backups on HDD. Some databases (like SQL Server and Oracle) support automatic tiering that moves data between tiers based on access patterns.

How Databases Bridge the Gap

Given the enormous performance gap between RAM and disk (even SSDs), databases employ sophisticated techniques to minimize the impact of slow storage access. These techniques are central to database architecture.

The Core Strategies:

Primary Gap-Bridging Techniques

•Buffer Pool Caching: Cache frequently-accessed pages in RAM. The buffer pool is typically 70-80% of server RAM. Pages stay cached using LRU or clock-based replacement until evicted by pressure.
•Write-Ahead Logging (WAL): Convert random writes to sequential writes. Instead of updating data in place immediately, write a compact log record sequentially. This converts expensive random I/O to cheap sequential I/O.
•Asynchronous I/O: Overlap computation with I/O. While waiting for one disk read, the database processes other queries or issues additional I/O requests. Modern databases maintain dozens to hundreds of concurrent I/O operations.
•Read-Ahead/Prefetching: Predict future access and load data proactively. During sequential scans, the database reads ahead, loading blocks before they're requested. This hides latency behind useful work.
•Batched/Group Commits: Amortize sync overhead across transactions. Instead of flushing to disk after every commit, batch multiple commits into a single disk write. Sacrifices individual latency for throughput.
•Index Structures: Minimize I/O per query. B-trees and other index structures are specifically designed to answer queries with minimal disk accesses—typically 2-4 I/Os for billion-row tables.
•Compression: Fit more data in RAM. Compressing pages requires CPU work but reduces I/O. Since CPUs are fast and disks are slow, this is usually a net win.

Buffer Pool Deep Dive:

The buffer pool deserves special attention as the primary bridge between RAM and disk:

+-------------------+     +-------------------+
|   Query Executor  |     |   Storage Engine  |
+-------------------+     +-------------------+
          |                         |
          v                         v
+-------------------------------------------+
|              BUFFER POOL (RAM)            |
|   +-------+  +-------+  +-------+  ...    |
|   | Page  |  | Page  |  | Page  |         |
|   | (8KB) |  | (8KB) |  | (8KB) |         |
|   +-------+  +-------+  +-------+         |
|   [Clean]    [Dirty]    [Pinned]          |
+-------------------------------------------+
          |                         ^
          v                         |
+-------------------------------------------+
|         PERSISTENT STORAGE (Disk)         |
+-------------------------------------------+

Page States:

Clean: Matches disk, safe to evict without writing
Dirty: Modified in memory, must be written before eviction
Pinned: Currently in use, cannot be evicted

Buffer Pool Operations:

Get Page: Check if in pool. If yes, return from RAM. If no, read from disk, add to pool, return.
Modify Page: Get page, modify in place, mark dirty.
Evict Page: Choose victim (LRU/clock). If dirty, write to disk. Remove from pool.
Flush: Write specific dirty page to disk (e.g., for checkpointing).

The Buffer Pool Hit Ratio

Monitor your buffer pool hit ratio religiously. A ratio of 99% means only 1 in 100 page requests hits disk. At 95%, that's 1 in 20—potentially 5x more disk I/O. For OLTP workloads, under 99% often indicates the buffer pool is too small for the working set.

Emerging Technologies: Blurring the Divide

The strict dichotomy between RAM and disk is beginning to blur as new technologies fill the gap with intermediate performance and persistence characteristics.

Persistent Memory (PMEM):

Intel Optane Persistent Memory (and similar technologies) provides:

Byte-addressable access: Like RAM, accessed directly by CPU load/store instructions
Persistence: Data survives power loss like SSD
Latency: ~300ns—between RAM (~100ns) and SSD (~25μs)
Capacity: 128GB-512GB per DIMM, larger than RAM, smaller than SSD

Databases can use PMEM as:

An extension of the buffer pool (larger cache)
Persistent transaction logs (durability without fsync overhead) -Redo log buffers (smaller durability latency)

Storage Class Memory (SCM):

A category including various technologies:

3D XPoint (Optane)
ReRAM
Phase-Change Memory
MRAM

All offer persistence with lower latency than NAND flash.

Computational Storage:

Moving computation closer to storage:

SSDs with embedded processors
In-storage database filtering
Reduces data movement across the hierarchy

Storage Technology Spectrum
Technology	Latency	Persistent	Capacity	Database Application
SRAM (Cache)	<1ns	No	KB-MB	CPU internal
DRAM	50-100ns	No	GB-TB	Buffer pool, indexes
Optane PMEM	300-400ns	Yes	128-512GB/DIMM	Extended cache, logs
Optane SSD	10-15μs	Yes	100GB-1.5TB	Hot data, redo logs
NVMe SSD	25-100μs	Yes	256GB-8TB	Primary storage
SATA SSD	100-200μs	Yes	256GB-4TB	Warm data
HDD	5-15ms	Yes	1TB-20TB	Cold data, backups

Architectural Implications

As the storage hierarchy gains more tiers, database architectures adapt. Traditional designs assumed 'fast but volatile vs. slow but persistent.' With byte-addressable persistent memory, new data structures become possible—like persistent B-trees that don't need recovery after a crash.

Summary: Navigating the RAM-Disk Divide

The divide between RAM and disk (SSD and HDD) is the single most important performance consideration in database systems. Let's consolidate our understanding:

Key Takeaways

•RAM is 100-100,000x faster than disk — Random access latency ranges from 100ns (RAM) to 10ms (HDD). Every disk I/O avoided is a massive performance win.
•HDDs are mechanical; SSDs are electronic — HDD latency is dominated by seek and rotation. SSD latency is dominated by flash read/write cycles and FTL overhead.
•Sequential vs. random matters enormously on HDD — The gap is 100x or more. Design workloads for sequential access when targeting HDDs.
•SSDs reduce but don't eliminate the gap — NVMe SSDs are ~100x faster than HDD but still ~1,000x slower than RAM for random access.
•Buffer pools are the primary bridge — Keeping the working set in RAM is the most effective performance strategy. Monitor hit ratios obsessively.
•Storage tiering optimizes cost-performance — Hot data on SSD, cold data on HDD realigns spending with access patterns.
•New technologies blur the boundary — Persistent memory offers new options between traditional RAM and SSD characteristics.

What's Next:

We've examined the performance differences between RAM and disk. But what determines the actual latency of a storage access? The next page dives into Access Times—dissecting the physics and engineering behind storage latency, from CPU clock cycles to disk rotational delays.

Page Complete

You now understand the detailed characteristics of RAM, HDD, and SSD storage technologies and how databases bridge the performance gap between them. This knowledge is essential for storage configuration, capacity planning, and performance optimization decisions.

RAM vs Disk - Understanding the Great Divide

The Two Worlds of Database Storage

What You Will Master

Understanding Main Memory (RAM)

The DRAM Cell:

Modern RAM uses Dynamic RAM (DRAM) technology. Each bit is stored in a tiny structure consisting of:

One transistor: Acts as an access gate
One capacitor: Stores the actual data as electrical charge

This simplicity—just two components per bit—enables incredible density. Modern DRAM chips pack billions of cells onto a single die.

Why "Dynamic"?

RAM Types and Characteristics
Type	Structure	Speed	Density	Power	Use Case
SRAM (Static)	6 transistors/bit	Fastest (~1ns)	Lower	Lower when idle	CPU caches
DRAM (Dynamic)	1T1C per bit	Fast (~50-100ns)	Higher	Refresh overhead	Main memory
DDR4 SDRAM	DRAM + synch clock	~15-20ns effective	High	Moderate	Current standard
DDR5 SDRAM	DDR4 + improvements	~12-14ns effective	Higher	Improved	Latest systems
HBM (High Bandwidth)	Stacked DRAM	Wide interface	Very high	Lower per bit	GPUs, accelerators

DRAM Organization:

DRAM is organized hierarchically:

Cells: Individual bits arranged in rows and columns
Banks: Independent arrays that can operate in parallel
Ranks: Groups of chips accessed together to match bus width
Channels: Separate memory controllers providing bandwidth
DIMMs: Physical modules containing multiple chips

Accessing DRAM:

A memory access follows this sequence:

Row Activation (RAS): The memory controller selects a row. The entire row (typically 8KB) is loaded into the row buffer—a faster SRAM structure.
Column Access (CAS): Within the activated row, specific columns are read or written. If the desired data is in the already-active row, access is fast.
Precharge: Before accessing a different row in the same bank, the current row must be written back and the bank reset.

Row Buffer Hit vs. Miss:

Row buffer hit: Requested address is in the currently active row → ~10ns
Row buffer miss (closed): Different row needed, bank is closed → ~50-60ns
Row buffer miss (conflict): Different row needed, must precharge first → ~70-80ns

These patterns explain why sequential access in RAM is faster than random access—sequential reads hit the same row buffer repeatedly.

The "Random" in RAM is Misleading

Understanding Hard Disk Drives (HDD)

Physical Architecture:

An HDD is an electromechanical device consisting of:

Platters: Circular disks coated with magnetic material, spinning at high speed (5,400 - 15,000 RPM)
Read/Write Heads: Electromagnetic devices floating nanometers above the platter surface
Actuator Arm: Mechanical arm that positions the heads radially across the platter
Spindle Motor: Rotates the platters at constant speed
Controller Board: Electronics managing the interface and head positioning

HDD Speed Classes
RPM Rating	Rotation Period	Avg. Rotational Latency	Typical Use	Seek Time
5,400 RPM	11.1ms	5.56ms	Consumer, archive	12-15ms
7,200 RPM	8.33ms	4.17ms	Desktop, NAS	9-12ms
10,000 RPM	6.0ms	3.0ms	Enterprise (legacy)	4-6ms
15,000 RPM	4.0ms	2.0ms	High-performance enterprise	3-4ms

Data Organization on Disk:

Tracks: Concentric circles on a platter surface where data is recorded
Sectors: Arc segments of a track, typically 512 bytes or 4KB (Advanced Format)
Cylinders: The set of tracks at the same radial position across all platters
Blocks: Logical groupings of sectors (often 4KB-64KB for databases)

The Three Components of Access Time:

Every random HDD access involves three delays:

Seek Time: Time to move the head assembly to the correct track
- Modern drives: 4-12ms average
- Worst case (full stroke): 15-20ms
- Best case (adjacent track): <1ms
Rotational Latency: Time for the desired sector to rotate under the head
- Average: Half a rotation (4.17ms for 7,200 RPM)
- Best case: Sector happens to be there (0ms)
- Worst case: Just missed, full rotation
Transfer Time: Time to read/write the data
- Depends on data density and rotational speed
- Modern drives: 100-200 MB/s sustained

Total random access time = Seek + Rotational Latency + Transfer ≈ 8-15ms

The Mechanical Reality

Sequential vs. Random Performance:

The sequential vs. random dichotomy is extreme for HDDs:

Access Pattern	Typical Speed	Why
Sequential Read	150-250 MB/s	No seeks, continuous rotation
Sequential Write	140-220 MB/s	Same, slight head stabilization delay
Random Read (4KB)	0.5-1.5 MB/s	Dominated by seek + rotational latency
Random Write (4KB)	0.3-1.0 MB/s	Same, plus write completion confirmation

The 100x Gap: HDDs can read sequentially at 200 MB/s but only manage ~1 MB/s for random small reads. This 100x or greater gap fundamentally shapes how databases organize data on disk.

Database Implications:

Table data files should be read sequentially when possible (full scans)
Index structures (B-trees) minimize random disk accesses
Transaction logs exploit sequential write patterns
Buffer pools cache to avoid random reads
Batch operations amortize seek time over multiple blocks

Understanding Solid State Drives (SSD)

Flash Memory Fundamentals:

Cell Types (SLC, MLC, TLC, QLC):

NAND cells can store multiple bits by distinguishing voltage levels:

NAND Flash Cell Types
Type	Bits/Cell	Voltage Levels	Speed	Endurance	Cost	Typical Use
SLC (Single)	1	2	Fastest	~100K cycles	$$$$	Enterprise, critical apps
MLC (Multi)	2	4	Fast	~10K cycles	$$$	Enterprise mixed
TLC (Triple)	3	8	Moderate	~3K cycles	$$	Consumer, datacenter read
QLC (Quad)	4	16	Slower	~1K cycles	$	Read-heavy, cold data

NAND Organization:

Flash is organized hierarchically:

Cells: Individual storage units (1-4 bits each)
Pages: Smallest read/write unit (4KB-16KB)
Blocks: Collection of pages (128-512 pages per block, ~512KB-4MB)
Planes: Independent block arrays for parallelism
Dies: Complete NAND chips
Packages: Multiple dies in one physical package

The Asymmetric Operations Problem:

NAND flash has a critical asymmetry:

Read: Can read any page independently (~25-75μs)
Write: Can only write to erased pages (~200-500μs)
Erase: Can only erase entire blocks (~2-5ms)

This means you cannot simply overwrite data in place. To modify a page:

Read the entire block
Erase the block (slow)
Write all pages including the modified one

Flash Translation Layer (FTL):

To hide this complexity, SSDs implement a Flash Translation Layer:

Maintains a logical-to-physical address mapping
Redirects writes to free pages (out-of-place writes)
Performs garbage collection to reclaim invalidated pages
Implements wear leveling to distribute writes evenly

The FTL makes the SSD appear as a simple block device to the operating system and database.

Write Amplification

SSD Performance Characteristics:

Metric	Typical Range	Notes
Sequential Read	500 MB/s - 7 GB/s	Limited by interface (SATA vs NVMe)
Sequential Write	400 MB/s - 5 GB/s	Depends on write buffer, SLC cache
Random Read (4KB)	10,000 - 1,000,000 IOPS	Major advantage over HDD
Random Write (4KB)	5,000 - 300,000 IOPS	Varies with FTL efficiency
Latency (Read)	25-100 μs	~100x faster than HDD
Latency (Write)	50-500 μs	Can spike during GC

Interface Matters:

SATA SSD: Limited to ~550 MB/s by the SATA interface (designed for HDDs)
NVMe SSD: Uses PCIe lanes directly, achieving 3-7+ GB/s
NVMe with multiple queues: 64K queues with 64K commands each—massive parallelism

Database Implications:

Random read performance makes point queries viable without extensive caching
Write patterns should still be optimized to reduce write amplification
Log files benefit enormously from SSD speed
Mixed workloads (OLTP) see the biggest gains
Cold storage can remain on HDD for cost efficiency

The Performance Comparison: RAM vs SSD vs HDD

Storage Technology Comparison
Metric	DDR4 RAM	NVMe SSD	SATA SSD	Enterprise HDD
Random Read Latency	60-100 ns	25-100 μs	100-200 μs	5-15 ms
Random Write Latency	60-100 ns	50-500 μs	100-500 μs	5-15 ms
Sequential Read (MB/s)	25,000-50,000	3,000-7,000	500-550	150-250
Sequential Write (MB/s)	25,000-50,000	2,000-5,000	400-520	140-220
Random 4KB Read IOPS	Millions	100K-1M	50K-100K	100-200
Random 4KB Write IOPS	Millions	50K-500K	30K-80K	100-150
Capacity (typical)	64GB-2TB	256GB-8TB	256GB-4TB	1TB-20TB
Cost per GB	$3-5	$0.10-0.30	$0.08-0.12	$0.02-0.03
Persistence	No (volatile)	Yes	Yes	Yes
Power (active)	3-6W per DIMM	5-15W	2-5W	7-15W
Endurance	Unlimited	1-5 DWPD*	0.3-1 DWPD	Unlimited

*DWPD = Drive Writes Per Day (the full drive capacity can be written this many times per day over the warranty period)

Key Ratios to Remember:

Performance Ratios

•RAM vs NVMe SSD: RAM is ~100-1,000x faster for random access
•RAM vs HDD: RAM is ~100,000x faster for random access
•NVMe SSD vs HDD: NVMe is ~100-500x faster for random access
•Sequential vs Random on HDD: Sequential is ~100x faster
•Sequential vs Random on SSD: Sequential is ~10-20x faster
•Cost: HDD vs SSD: HDD is ~3-10x cheaper per GB
•Cost: SSD vs RAM: SSD is ~20-50x cheaper per GB

The Latency Perspective

Access Pattern Implications for Databases

Common Database Access Patterns:

Database Operations and Their Access Patterns
Operation	Read Pattern	Write Pattern	Best Storage	Notes
Table full scan	Sequential read	None	HDD acceptable	Throughput matters, not latency
Index lookup (point)	Random read	None	SSD/RAM	Latency critical
Index range scan	Sequential read	None	SSD preferred	After initial seek, sequential
Transaction log write	None	Sequential append	NVMe SSD	Critical path, must be durable
OLTP mixed workload	Random read/write	Random write	NVMe SSD + RAM cache	IOPS-bound
Data warehouse query	Sequential scan	Temp writes	SSD + HDD mix	Throughput-bound
Backup	Sequential read	Sequential write	HDD sufficient	Cost-sensitive
Random row updates	Random read	Random write	SSD preferred	HDD would bottleneck

Patterns That Favor SSD

•High concurrency point queries
•OLTP transaction processing
•Random key-value lookups
•Database transaction logs
•Temporary tables with random access
•Index files for frequently-used indexes

Patterns Where HDD Is Adequate

•Sequential data warehouse scans
•Backup storage destinations
•Archive log retention
•Cold/infrequent data storage
•Media files, documents (BLOB storage)
•Bulk load staging areas

Hybrid Strategies

How Databases Bridge the Gap

The Core Strategies:

Primary Gap-Bridging Techniques

•Buffer Pool Caching: Cache frequently-accessed pages in RAM. The buffer pool is typically 70-80% of server RAM. Pages stay cached using LRU or clock-based replacement until evicted by pressure.
•Write-Ahead Logging (WAL): Convert random writes to sequential writes. Instead of updating data in place immediately, write a compact log record sequentially. This converts expensive random I/O to cheap sequential I/O.
•Asynchronous I/O: Overlap computation with I/O. While waiting for one disk read, the database processes other queries or issues additional I/O requests. Modern databases maintain dozens to hundreds of concurrent I/O operations.
•Read-Ahead/Prefetching: Predict future access and load data proactively. During sequential scans, the database reads ahead, loading blocks before they're requested. This hides latency behind useful work.
•Batched/Group Commits: Amortize sync overhead across transactions. Instead of flushing to disk after every commit, batch multiple commits into a single disk write. Sacrifices individual latency for throughput.
•Index Structures: Minimize I/O per query. B-trees and other index structures are specifically designed to answer queries with minimal disk accesses—typically 2-4 I/Os for billion-row tables.
•Compression: Fit more data in RAM. Compressing pages requires CPU work but reduces I/O. Since CPUs are fast and disks are slow, this is usually a net win.

Buffer Pool Deep Dive:

The buffer pool deserves special attention as the primary bridge between RAM and disk:

+-------------------+     +-------------------+
|   Query Executor  |     |   Storage Engine  |
+-------------------+     +-------------------+
          |                         |
          v                         v
+-------------------------------------------+
|              BUFFER POOL (RAM)            |
|   +-------+  +-------+  +-------+  ...    |
|   | Page  |  | Page  |  | Page  |         |
|   | (8KB) |  | (8KB) |  | (8KB) |         |
|   +-------+  +-------+  +-------+         |
|   [Clean]    [Dirty]    [Pinned]          |
+-------------------------------------------+
          |                         ^
          v                         |
+-------------------------------------------+
|         PERSISTENT STORAGE (Disk)         |
+-------------------------------------------+

Page States:

Clean: Matches disk, safe to evict without writing
Dirty: Modified in memory, must be written before eviction
Pinned: Currently in use, cannot be evicted

Buffer Pool Operations:

Get Page: Check if in pool. If yes, return from RAM. If no, read from disk, add to pool, return.
Modify Page: Get page, modify in place, mark dirty.
Evict Page: Choose victim (LRU/clock). If dirty, write to disk. Remove from pool.
Flush: Write specific dirty page to disk (e.g., for checkpointing).

The Buffer Pool Hit Ratio

Emerging Technologies: Blurring the Divide

The strict dichotomy between RAM and disk is beginning to blur as new technologies fill the gap with intermediate performance and persistence characteristics.

Persistent Memory (PMEM):

Intel Optane Persistent Memory (and similar technologies) provides:

Byte-addressable access: Like RAM, accessed directly by CPU load/store instructions
Persistence: Data survives power loss like SSD
Latency: ~300ns—between RAM (~100ns) and SSD (~25μs)
Capacity: 128GB-512GB per DIMM, larger than RAM, smaller than SSD

Databases can use PMEM as:

An extension of the buffer pool (larger cache)
Persistent transaction logs (durability without fsync overhead) -Redo log buffers (smaller durability latency)

Storage Class Memory (SCM):

A category including various technologies:

3D XPoint (Optane)
ReRAM
Phase-Change Memory
MRAM

All offer persistence with lower latency than NAND flash.

Computational Storage:

Moving computation closer to storage:

SSDs with embedded processors
In-storage database filtering
Reduces data movement across the hierarchy

Storage Technology Spectrum
Technology	Latency	Persistent	Capacity	Database Application
SRAM (Cache)	<1ns	No	KB-MB	CPU internal
DRAM	50-100ns	No	GB-TB	Buffer pool, indexes
Optane PMEM	300-400ns	Yes	128-512GB/DIMM	Extended cache, logs
Optane SSD	10-15μs	Yes	100GB-1.5TB	Hot data, redo logs
NVMe SSD	25-100μs	Yes	256GB-8TB	Primary storage
SATA SSD	100-200μs	Yes	256GB-4TB	Warm data
HDD	5-15ms	Yes	1TB-20TB	Cold data, backups

Architectural Implications

Summary: Navigating the RAM-Disk Divide

The divide between RAM and disk (SSD and HDD) is the single most important performance consideration in database systems. Let's consolidate our understanding:

Key Takeaways

•RAM is 100-100,000x faster than disk — Random access latency ranges from 100ns (RAM) to 10ms (HDD). Every disk I/O avoided is a massive performance win.
•HDDs are mechanical; SSDs are electronic — HDD latency is dominated by seek and rotation. SSD latency is dominated by flash read/write cycles and FTL overhead.
•Sequential vs. random matters enormously on HDD — The gap is 100x or more. Design workloads for sequential access when targeting HDDs.
•SSDs reduce but don't eliminate the gap — NVMe SSDs are ~100x faster than HDD but still ~1,000x slower than RAM for random access.
•Buffer pools are the primary bridge — Keeping the working set in RAM is the most effective performance strategy. Monitor hit ratios obsessively.
•Storage tiering optimizes cost-performance — Hot data on SSD, cold data on HDD realigns spending with access patterns.
•New technologies blur the boundary — Persistent memory offers new options between traditional RAM and SSD characteristics.

What's Next:

Page Complete