Operating SystemsMemory Hierarchy

Memory Hierarchy: From Registers to Disk

LevelIntermediate

Duration90 mins

TopicMemory Hierarchy

4 / 5

Secondary Storage — Persistent Memory and Mass Storage

The Persistence Tier

Secondary storage represents a fundamental shift in the memory hierarchy. Unlike registers, caches, and RAM—which are volatile and lose data when power is removed—secondary storage is persistent. It retains data across power cycles, system reboots, and hardware changes. This persistence enables:

Operating systems to survive reboots — The kernel, libraries, and applications are stored on disk
User data to persist indefinitely — Documents, media, databases remain available
Virtual memory larger than physical RAM — Swap space extends addressable memory
Recovery from crashes — Journaling and logging enable state reconstruction

However, persistence comes at a cost: secondary storage is dramatically slower than main memory. Understanding this tier's technology, performance characteristics, and interface with the operating system is essential for systems engineering.

What You Will Learn

By the end of this page, you will understand: the fundamental technologies behind HDDs and SSDs; their drastically different performance characteristics; storage interfaces and protocols; the OS storage stack; and emerging technologies bridging the gap between memory and storage.

Hard Disk Drives (HDDs) — Mechanical Marvels

The Hard Disk Drive (HDD) has been the dominant mass storage technology for over 50 years. Despite the rise of SSDs, HDDs remain important for cost-sensitive bulk storage and archival applications.

Physical construction:

An HDD contains:

Platters: Spinning aluminum or glass disks coated with magnetic material. Modern HDDs have 1-10 platters spinning at 5,400-15,000 RPM.
Read/write heads: Tiny electromagnets mounted on actuator arms, flying nanometers above the platter surface. Each platter surface has its own head.
Actuator arm: Mechanical arm that positions heads over the correct track. Moves via voice coil motor (like a speaker).
Spindle motor: Rotates platters at constant speed. Generates significant heat and power consumption.
Controller: Embedded processor managing the mechanical components, error correction, and interface protocols.

Data organization:

Data on a platter is organized as:

Tracks: Concentric circles on the platter surface (thousands per platter)
Sectors: Arc segments within a track, the minimum read/write unit (typically 512 bytes or 4KB)
Cylinders: All tracks at the same radial position across all platters (accessed without arm movement)
Zones: Groups of tracks—outer zones have more sectors per track than inner zones (Zone Bit Recording)

Access latency components:

HDD access time is dominated by mechanical delays:

HDD Latency Components

•Seek time: Time to move heads to the correct track (cylinder). Typical: 2-15 ms average, depending on distance. Full stroke (inner to outer edge) can be 15-20 ms.
•Rotational latency: Time for the target sector to rotate under the head. Average is half a rotation. At 7,200 RPM: 4.17 ms average (8.33 ms full rotation).
•Transfer time: Time to read/write bytes once the head is positioned. Depends on linear data density and RPM. Modern HDDs: 150-250 MB/s sustained, outer tracks faster than inner.
•Controller overhead: Command processing, error correction. Usually sub-millisecond.

HDD Performance Characteristics
Specification	7,200 RPM Desktop	15,000 RPM Enterprise	5,400 RPM Laptop
Average seek time	8-10 ms	3-4 ms	10-14 ms
Rotational latency	4.17 ms	2 ms	5.56 ms
Average access time	~12-14 ms	~5-6 ms	~15-20 ms
Sequential read	150-200 MB/s	200-300 MB/s	100-140 MB/s
Random 4KB IOPS	75-150	180-300	50-80
Power (active)	6-10W	10-15W	1.5-3W

The Random Access Penalty

HDDs are fundamentally sequential devices. Sequential access achieves 150+ MB/s; random 4KB access achieves ~1 MB/s effective bandwidth (75 IOPS × 4KB = 0.3 MB/s). This 500× difference makes HDDs unsuitable for random access workloads like databases without extensive caching.

Solid State Drives (SSDs) — Flash Revolution

Solid State Drives (SSDs) use NAND flash memory to store data, eliminating mechanical components entirely. This fundamental change brings dramatically different performance characteristics.

NAND flash basics:

NAND flash stores data in floating-gate transistors:

Floating gate: A conductor surrounded by insulator, maintaining charge for years without power
Programming: Electrons are tunneled onto the floating gate (requires high voltage), shifting threshold voltage
Erasing: Electrons are tunneled off (requires even higher voltage)
Reading: Measure transistor threshold voltage to determine stored value

Cell types by bits per cell:

•SLC (Single Level Cell): 1 bit per cell. Two voltage states. Fastest, most durable, most expensive. Used in enterprise/industrial.
•MLC (Multi Level Cell): 2 bits per cell. Four voltage states. Good performance/cost balance. Consumer SSDs.
•TLC (Triple Level Cell): 3 bits per cell. Eight voltage states. Lower endurance, lower cost. Mainstream consumer SSDs.
•QLC (Quad Level Cell): 4 bits per cell. Sixteen voltage states. Lowest cost, lowest endurance. High-capacity consumer drives.

SSD internal organization:

SSDs have a complex internal structure:

NAND dies: Individual flash chips, 16-64+ per SSD
Planes: 2-4 independent units within each die, can operate in parallel
Blocks: Erase unit, typically 256-512 pages (4-16 MB)
Pages: Read/write unit, typically 4-16 KB
Controller: Powerful ARM or custom processor running firmware that manages:
- FTL (Flash Translation Layer): Maps logical addresses to physical locations
- Wear leveling: Distributes writes to extend lifespan
- Garbage collection: Reclaims blocks with invalid data
- ECC: Error correction (essential as cells wear)
- Write amplification management

The program/erase asymmetry:

NAND flash has a critical constraint: you cannot overwrite data in place.

Pages must be erased before they can be written
Erase happens at block granularity (256+ pages)
Each cell survives limited erase cycles (1,000 - 100,000)

This creates the garbage collection problem: to write to a "used" page, the SSD must:

Read all valid pages from the containing block
Erase the block (slow: ~1-2 ms)
Write the valid pages plus the new data

This background activity can cause performance variability and write amplification.

SSD Performance Characteristics
Specification	Consumer NVMe SSD	Enterprise NVMe SSD	SATA SSD
Sequential read	3-7 GB/s	7-14 GB/s	500-550 MB/s
Sequential write	2-5 GB/s	3-10 GB/s	450-520 MB/s
Random 4KB read IOPS	400K-1M	1-2M	90-100K
Random 4KB write IOPS	200K-700K	200-500K	70-90K
Average latency (read)	~10-50 μs	~10-30 μs	~100 μs
Average latency (write)	~20-100 μs	~20-50 μs	~100 μs
Endurance (TBW)	300-600 TB	5-50 PBW	150-300 TB

Storage Interfaces and Protocols

The interface between storage devices and the computer system significantly impacts performance. Different interfaces evolved for different use cases, and modern NVMe represents a clean-slate design for flash storage.

Major storage interfaces:

Storage Interface Comparison

•SATA (Serial ATA): Ubiquitous interface for HDDs and consumer SSDs. SATA 3.0 provides 6 Gb/s (~550 MB/s effective). Uses AHCI command protocol with single command queue (32 commands). Limited by HDD-era design.
•SAS (Serial Attached SCSI): Enterprise interface supporting higher speeds (12-24 Gb/s), dual-port redundancy, and longer cables. Full-featured SCSI command set. Used in servers and storage arrays.
•NVMe (Non-Volatile Memory Express): Purpose-built for flash over PCIe. PCIe 4.0 x4 provides ~8 GB/s; PCIe 5.0 x4 provides ~16 GB/s. Supports 64K queues with 64K commands each. Sub-10 μs latency possible.
•U.2/U.3: Enterprise form factors for NVMe drives. Hot-swap capable, 2.5" form factor. Common in servers.
•M.2: Compact form factor for NVMe (and SATA) drives. Key M for NVMe, Key B for SATA. Common in laptops and consumer systems.

NVMe deep dive:

NVMe was designed from the ground up for low-latency, high-parallelism flash storage:

Queue architecture:

Up to 64K submission queues (one per CPU core for optimal parallelism)
Up to 64K completion queues
Each queue holds up to 64K commands
Doorbell registers signal new commands (no polling/interrupts for submission)

Streamlined command set:

Simple, fixed-size commands (64 bytes)
Essential operations: Read, Write, Flush, Trim (Deallocate)
No legacy SCSI/ATA complexity

Low latency optimizations:

Commands can bypass OS block layer for user-space I/O (SPDK, io_uring)
Interrupt coalescing for throughput; polling for latency
Controller memory buffer can hold queues in device RAM

Interface Performance Comparison
Interface	Max Bandwidth	Queue Depth	Latency	Use Case
SATA 3.0	~550 MB/s	32 (single queue)	~100 μs	Consumer HDD/SSD
SAS-3	~1.2 GB/s	256+ per LUN	~70-100 μs	Enterprise HDD/SSD
NVMe PCIe 3.0 x4	~3.5 GB/s	64K × 64K	~10-30 μs	Consumer NVMe SSD
NVMe PCIe 4.0 x4	~7 GB/s	64K × 64K	~10-30 μs	Modern NVMe SSD
NVMe PCIe 5.0 x4	~14 GB/s	64K × 64K	~10-30 μs	Latest Gen NVMe

Queue Depth and IOPS

Deep queues are essential for SSD performance. With 30 μs latency, a single queue achieves ~33K IOPS. With 32 parallel commands, the same drive achieves 1M+ IOPS. NVMe's per-CPU queues eliminate contention that SATA's single queue creates.

Storage Performance Characteristics

Understanding storage performance requires considering multiple dimensions beyond simple bandwidth numbers. Workload characteristics dramatically affect achieved performance.

Key performance metrics:

•Throughput/Bandwidth (MB/s): Data transferred per second. High for sequential workloads.
•IOPS (I/O Operations Per Second): Operations completed per second. Critical for random access workloads. 1 IOPS at 4KB = 4KB/s throughput.
•Latency: Time from command issue to completion. Affects user-perceived responsiveness. Tail latency (99th, 99.9th percentile) often matters more than average.
•Queue depth: Number of outstanding I/O operations. Higher depth = higher throughput for SSDs (up to a point).

Workload patterns:

Sequential access:

Reading/writing contiguous data (file copies, video streaming)
HDDs and SSDs both perform well
HDD: 150-250 MB/s; SSD: 500-7000+ MB/s

Random access:

Accessing scattered locations (databases, virtual machines, email servers)
HDDs catastrophically slow (~100 IOPS)
SSDs excel (100K-1M+ IOPS)

Mixed workloads:

Real applications mix random and sequential
Read/write ratio matters (reads generally faster)
Working set size affects caching effectiveness

Latency under load:

As queue depth increases, latency typically increases (more operations waiting). SSDs handle this better than HDDs. At very high loads, SSDs may exhibit garbage collection pauses causing latency spikes.

Comparative Performance: HDD vs SSD
Metric	7200 RPM HDD	SATA SSD	NVMe SSD	Difference
Sequential read	180 MB/s	550 MB/s	3500 MB/s	~20× (NVMe vs HDD)
Sequential write	170 MB/s	520 MB/s	3000 MB/s	~18× (NVMe vs HDD)
Random 4KB read	~100 IOPS	~95K IOPS	~500K IOPS	~5000× (NVMe vs HDD)
Random 4KB write	~100 IOPS	~85K IOPS	~400K IOPS	~4000× (NVMe vs HDD)
Access latency	~10 ms	~100 μs	~30 μs	~300× (NVMe vs HDD)
Power (active)	8W	3-5W	5-10W	Varies

Benchmarking Caveats

Real-world performance differs from benchmarks. SSD performance degrades as drives fill up (less space for garbage collection). Sustained writes may throttle due to heat. Consumer SSDs may have vastly different read vs write performance after cache exhaustion (SLC cache). Always benchmark with realistic workloads.

OS Storage Stack Architecture

The operating system provides a complex software stack between applications and storage hardware. This stack provides abstraction, caching, scheduling, and management services.

Linux storage stack (representative):

From top to bottom:

Storage Stack Layers

•Virtual File System (VFS): Unified interface for all file systems. Handles open/read/write/close system calls. Provides inode and dentry caching.
•File System (ext4, XFS, Btrfs, etc.): Implements file and directory semantics on block devices. Handles allocation, naming, permissions, journaling.
•Page Cache: In-memory cache of file data. Absorbs repeated reads; buffers writes. Can consume most of available RAM.
•Block Layer: Generic infrastructure for block devices. I/O scheduling, request merging, plug/unplug for batching. Implements I/O schedulers (mq-deadline, BFQ, none).
•Device Mapper / MD (optional): Provides LVM, RAID, encryption, snapshots on top of raw block devices.
•Block Driver (nvme, sd, ahci): Device-specific driver translating block requests to hardware commands.
•Hardware Interface (NVMe, SATA, SAS): Protocol-level communication with the device.

I/O Schedulers:

I/O schedulers order and merge requests to optimize performance:

none (noop): No scheduling—pass requests directly to device. Best for NVMe SSDs with internal parallelism and no seek penalty.

mq-deadline: Batch requests while ensuring no request waits too long (deadline). Good balance for SSDs.

BFQ (Budget Fair Queueing): Per-process fairness and latency guarantees. Better for interactive workloads on slower devices.

kyber: Simple, low-overhead latency-targeting scheduler for fast devices.

Historical (now deprecated):

CFQ: Completely Fair Queuing—per-process fairness
Deadline: For HDDs, prevented starvation with deadlines
Anticipatory: Anticipated future requests before seeking

Modern SSD Scheduling

For NVMe SSDs, the 'none' scheduler is often optimal. The SSD controller has sophisticated internal scheduling; adding OS-level scheduling introduces latency without benefit. The multi-queue block layer (blk-mq) enables parallel I/O submission to all hardware queues.

Direct I/O and Bypass:

For applications needing direct device access (databases, high-performance storage):

O_DIRECT: Bypasses page cache, reading/writing directly to device. Avoids double-buffering for applications managing their own cache.

io_uring: Modern Linux async I/O interface with submission/completion rings in shared memory. Achieves near-hardware latency with batched syscalls.

SPDK (Storage Performance Development Kit): User-space NVMe driver completely bypassing kernel. Achieves maximum performance but requires application modification.

Storage Device Driver Architecture

Device drivers are kernel modules that interface between the OS block layer and specific storage hardware. Well-designed drivers are essential for performance and reliability.

NVMe driver architecture (Linux nvme):

The NVMe driver is relatively simple compared to SCSI/ATA drivers because NVMe is a clean protocol:

Discovery: At boot or hot-plug, PCIe enumeration finds NVMe controllers
Initialization: Driver allocates submission/completion queues per CPU core
Operation: Block requests are converted to NVMe commands, placed in submission queue, doorbell rung
Completion: Interrupts or polling detect completions; block layer notified

Key driver responsibilities:

•Command construction: Build NVMe/SCSI/ATA commands from block requests
•DMA setup: Configure scatter-gather lists for data transfer
•Interrupt handling: Process completions, wake waiting threads
•Error handling: Retry transient errors, report permanent failures
•Power management: Implement suspend/resume, runtime power states
•Hot-plug support: Handle device arrival and removal
•Feature negotiation: Query and enable device capabilities

Interrupt modes:

Legacy interrupts: Single shared interrupt line. High overhead from shared interrupt handling. Rarely used for modern storage.

MSI (Message Signaled Interrupts): Dedicated interrupt per device. Lower latency than legacy.

MSI-X (Extended MSI): Multiple interrupts per device—typically one per CPU core/queue. Enables parallel completion processing without contention.

Polling mode: Driver continuously polls for completions instead of waiting for interrupts. Lower latency at cost of CPU usage. Useful for ultra-low-latency NVMe.

nvme_request_flow.txt

Pseudocode

// Simplified NVMe I/O flow
 
// 1. Application requests read
read(fd, buffer, 4096);
 
// 2. VFS → file system → block layer
block_request = create_request(LBA, length, READ);
 
// 3. Block layer dispatches to NVMe driver
nvme_queue_request(queue, block_request);
 
// 4. Driver builds NVMe command
nvme_cmd = {
    opcode: NVME_READ,
    lba: block_request->lba,
    length: block_request->length / 512 - 1,
    prp1: dma_addr(buffer),  // Physical address for DMA
};
 
// 5. Submit to hardware queue
submission_queue[tail] = nvme_cmd;
writel(tail, doorbell_register);  // Ring doorbell
 
// 6. Device processes request via DMA
// ...hardware reads from NAND, DMAs to buffer...
 
// 7. Completion interrupt fires
irq_handler() {
    while (completion_queue[head].valid) {
        complete_block_request(completion_queue[head].request);
        head++;
    }
}
 
// 8. Application unblocks with data in buffer

Emerging Storage Technologies

The traditional gap between volatile memory (DRAM) and persistent storage (HDD/SSD) is being bridged by new technologies that combine the characteristics of both.

Storage-class memory technologies:

Next-Generation Storage Technologies

•Intel Optane (3D XPoint): Non-volatile memory with ~10× lower latency than NAND (~10 μs vs 100 μs). Byte-addressable. Can be used as persistent memory or as fast SSD cache. Intel discontinued but technology influential.
•CXL-attached Memory: Compute Express Link enables memory pools and tiers across PCIe fabric. Memory can be shared across hosts or expanded beyond local DIMM capacity with ~100-200 ns latency.
•MRAM (Magnetoresistive RAM): Non-volatile using magnetic storage. Fast write, unlimited endurance. Currently expensive and low density. Used in embedded/cache applications.
•ReRAM (Resistive RAM): Non-volatile using resistance changes in oxide materials. Potential for 3D stacking and high density. Still maturing.
•Computational Storage: SSDs with embedded processing (ARM cores, FPGAs). Offload computation (compression, encryption, database operations) to the storage device.

Persistent memory programming:

With byte-addressable persistent memory, traditional file I/O becomes optional:

Memory-mapped DAX (Direct Access): Map persistent memory file directly into virtual address space. Loads and stores persist without system calls.
PMDK (Persistent Memory Development Kit): Libraries for safe persistent programming, including transactions and crash consistency.
Challenges: Cache line flushes required to ensure durability (CLFLUSH, CLWB). Power-fail atomicity limited to 8 bytes on x86.

ZNS (Zoned Namespace) SSDs:

A new SSD interface where the drive is divided into zones that must be written sequentially:

Eliminates device-side garbage collection (host manages it)
Enables higher performance and endurance
Better match for log-structured applications (databases, file systems)
Requires host software support (Linux has btrfs, dm-zoned support)

The Storage Hierarchy Is Evolving

The traditional 3-tier hierarchy (cache, RAM, disk) is becoming more nuanced: registers → cache → DRAM → Optane/NVRAM → fast NVMe SSD → slower QLC SSD → HDD → tape. Operating systems and applications are adapting to manage these multiple tiers intelligently.

Summary: Secondary Storage Mastery

We've explored secondary storage in depth—from spinning platters to flash cells, from SATA to NVMe, from device drivers to emerging technologies. Storage is where persistence meets performance, and understanding this tier is essential for systems design.

Key Takeaways

•HDDs are mechanical devices with millisecond latencies — Seek time and rotational latency dominate; sequential access is orders of magnitude faster than random
•SSDs use NAND flash with microsecond latencies — No moving parts, but complex internal management (FTL, garbage collection, wear leveling)
•Interface matters enormously — NVMe over PCIe achieves 10-100× the performance of SATA for SSDs by enabling parallelism and reducing overhead
•Random vs sequential access creates huge performance differences — HDDs: 500×; SSDs: 5-10×. Workload patterns are critical
•The OS storage stack provides abstraction and optimization — VFS, file systems, page cache, block layer, I/O schedulers, and drivers work together
•Modern NVMe needs minimal OS scheduling — The 'none' scheduler and multi-queue block layer enable direct access to device parallelism
•Emerging technologies are blurring memory/storage lines — Persistent memory, CXL, and computational storage create new tiers and programming models

What's next:

Having explored all tiers of the memory hierarchy—registers, caches, main memory, and secondary storage—we'll now examine the access times and tradeoffs across the entire hierarchy. We'll quantify the performance gaps, understand the economic factors driving design decisions, and learn how to reason about memory hierarchy when designing systems and writing performance-critical code.

Page Complete

You now understand secondary storage technologies, interfaces, and OS integration. This knowledge is essential for understanding file systems, optimizing I/O-intensive applications, and designing storage architectures.

4 / 5

Loading learning content...

Operating SystemsMemory Hierarchy

Memory Hierarchy: From Registers to Disk

LevelIntermediate

Duration90 mins

TopicMemory Hierarchy

4 / 5

Secondary Storage — Persistent Memory and Mass Storage

The Persistence Tier

Operating systems to survive reboots — The kernel, libraries, and applications are stored on disk
User data to persist indefinitely — Documents, media, databases remain available
Virtual memory larger than physical RAM — Swap space extends addressable memory
Recovery from crashes — Journaling and logging enable state reconstruction

What You Will Learn

Hard Disk Drives (HDDs) — Mechanical Marvels

Physical construction:

An HDD contains:

Platters: Spinning aluminum or glass disks coated with magnetic material. Modern HDDs have 1-10 platters spinning at 5,400-15,000 RPM.
Read/write heads: Tiny electromagnets mounted on actuator arms, flying nanometers above the platter surface. Each platter surface has its own head.
Actuator arm: Mechanical arm that positions heads over the correct track. Moves via voice coil motor (like a speaker).
Spindle motor: Rotates platters at constant speed. Generates significant heat and power consumption.
Controller: Embedded processor managing the mechanical components, error correction, and interface protocols.

Data organization:

Data on a platter is organized as:

Tracks: Concentric circles on the platter surface (thousands per platter)
Sectors: Arc segments within a track, the minimum read/write unit (typically 512 bytes or 4KB)
Cylinders: All tracks at the same radial position across all platters (accessed without arm movement)
Zones: Groups of tracks—outer zones have more sectors per track than inner zones (Zone Bit Recording)

Access latency components:

HDD access time is dominated by mechanical delays:

HDD Latency Components

•Seek time: Time to move heads to the correct track (cylinder). Typical: 2-15 ms average, depending on distance. Full stroke (inner to outer edge) can be 15-20 ms.
•Rotational latency: Time for the target sector to rotate under the head. Average is half a rotation. At 7,200 RPM: 4.17 ms average (8.33 ms full rotation).
•Transfer time: Time to read/write bytes once the head is positioned. Depends on linear data density and RPM. Modern HDDs: 150-250 MB/s sustained, outer tracks faster than inner.
•Controller overhead: Command processing, error correction. Usually sub-millisecond.

HDD Performance Characteristics
Specification	7,200 RPM Desktop	15,000 RPM Enterprise	5,400 RPM Laptop
Average seek time	8-10 ms	3-4 ms	10-14 ms
Rotational latency	4.17 ms	2 ms	5.56 ms
Average access time	~12-14 ms	~5-6 ms	~15-20 ms
Sequential read	150-200 MB/s	200-300 MB/s	100-140 MB/s
Random 4KB IOPS	75-150	180-300	50-80
Power (active)	6-10W	10-15W	1.5-3W

The Random Access Penalty

Solid State Drives (SSDs) — Flash Revolution

Solid State Drives (SSDs) use NAND flash memory to store data, eliminating mechanical components entirely. This fundamental change brings dramatically different performance characteristics.

NAND flash basics:

NAND flash stores data in floating-gate transistors:

Floating gate: A conductor surrounded by insulator, maintaining charge for years without power
Programming: Electrons are tunneled onto the floating gate (requires high voltage), shifting threshold voltage
Erasing: Electrons are tunneled off (requires even higher voltage)
Reading: Measure transistor threshold voltage to determine stored value

Cell types by bits per cell:

•SLC (Single Level Cell): 1 bit per cell. Two voltage states. Fastest, most durable, most expensive. Used in enterprise/industrial.
•MLC (Multi Level Cell): 2 bits per cell. Four voltage states. Good performance/cost balance. Consumer SSDs.
•TLC (Triple Level Cell): 3 bits per cell. Eight voltage states. Lower endurance, lower cost. Mainstream consumer SSDs.
•QLC (Quad Level Cell): 4 bits per cell. Sixteen voltage states. Lowest cost, lowest endurance. High-capacity consumer drives.

SSD internal organization:

SSDs have a complex internal structure:

NAND dies: Individual flash chips, 16-64+ per SSD
Planes: 2-4 independent units within each die, can operate in parallel
Blocks: Erase unit, typically 256-512 pages (4-16 MB)
Pages: Read/write unit, typically 4-16 KB
Controller: Powerful ARM or custom processor running firmware that manages:
- FTL (Flash Translation Layer): Maps logical addresses to physical locations
- Wear leveling: Distributes writes to extend lifespan
- Garbage collection: Reclaims blocks with invalid data
- ECC: Error correction (essential as cells wear)
- Write amplification management

The program/erase asymmetry:

NAND flash has a critical constraint: you cannot overwrite data in place.

Pages must be erased before they can be written
Erase happens at block granularity (256+ pages)
Each cell survives limited erase cycles (1,000 - 100,000)

This creates the garbage collection problem: to write to a "used" page, the SSD must:

Read all valid pages from the containing block
Erase the block (slow: ~1-2 ms)
Write the valid pages plus the new data

This background activity can cause performance variability and write amplification.

SSD Performance Characteristics
Specification	Consumer NVMe SSD	Enterprise NVMe SSD	SATA SSD
Sequential read	3-7 GB/s	7-14 GB/s	500-550 MB/s
Sequential write	2-5 GB/s	3-10 GB/s	450-520 MB/s
Random 4KB read IOPS	400K-1M	1-2M	90-100K
Random 4KB write IOPS	200K-700K	200-500K	70-90K
Average latency (read)	~10-50 μs	~10-30 μs	~100 μs
Average latency (write)	~20-100 μs	~20-50 μs	~100 μs
Endurance (TBW)	300-600 TB	5-50 PBW	150-300 TB

Storage Interfaces and Protocols

Major storage interfaces:

Storage Interface Comparison

•SATA (Serial ATA): Ubiquitous interface for HDDs and consumer SSDs. SATA 3.0 provides 6 Gb/s (~550 MB/s effective). Uses AHCI command protocol with single command queue (32 commands). Limited by HDD-era design.
•SAS (Serial Attached SCSI): Enterprise interface supporting higher speeds (12-24 Gb/s), dual-port redundancy, and longer cables. Full-featured SCSI command set. Used in servers and storage arrays.
•NVMe (Non-Volatile Memory Express): Purpose-built for flash over PCIe. PCIe 4.0 x4 provides ~8 GB/s; PCIe 5.0 x4 provides ~16 GB/s. Supports 64K queues with 64K commands each. Sub-10 μs latency possible.
•U.2/U.3: Enterprise form factors for NVMe drives. Hot-swap capable, 2.5" form factor. Common in servers.
•M.2: Compact form factor for NVMe (and SATA) drives. Key M for NVMe, Key B for SATA. Common in laptops and consumer systems.

NVMe deep dive:

NVMe was designed from the ground up for low-latency, high-parallelism flash storage:

Queue architecture:

Up to 64K submission queues (one per CPU core for optimal parallelism)
Up to 64K completion queues
Each queue holds up to 64K commands
Doorbell registers signal new commands (no polling/interrupts for submission)

Streamlined command set:

Simple, fixed-size commands (64 bytes)
Essential operations: Read, Write, Flush, Trim (Deallocate)
No legacy SCSI/ATA complexity

Low latency optimizations:

Commands can bypass OS block layer for user-space I/O (SPDK, io_uring)
Interrupt coalescing for throughput; polling for latency
Controller memory buffer can hold queues in device RAM

Interface Performance Comparison
Interface	Max Bandwidth	Queue Depth	Latency	Use Case
SATA 3.0	~550 MB/s	32 (single queue)	~100 μs	Consumer HDD/SSD
SAS-3	~1.2 GB/s	256+ per LUN	~70-100 μs	Enterprise HDD/SSD
NVMe PCIe 3.0 x4	~3.5 GB/s	64K × 64K	~10-30 μs	Consumer NVMe SSD
NVMe PCIe 4.0 x4	~7 GB/s	64K × 64K	~10-30 μs	Modern NVMe SSD
NVMe PCIe 5.0 x4	~14 GB/s	64K × 64K	~10-30 μs	Latest Gen NVMe

Queue Depth and IOPS

Storage Performance Characteristics

Understanding storage performance requires considering multiple dimensions beyond simple bandwidth numbers. Workload characteristics dramatically affect achieved performance.

Key performance metrics:

•Throughput/Bandwidth (MB/s): Data transferred per second. High for sequential workloads.
•IOPS (I/O Operations Per Second): Operations completed per second. Critical for random access workloads. 1 IOPS at 4KB = 4KB/s throughput.
•Latency: Time from command issue to completion. Affects user-perceived responsiveness. Tail latency (99th, 99.9th percentile) often matters more than average.
•Queue depth: Number of outstanding I/O operations. Higher depth = higher throughput for SSDs (up to a point).

Workload patterns:

Sequential access:

Reading/writing contiguous data (file copies, video streaming)
HDDs and SSDs both perform well
HDD: 150-250 MB/s; SSD: 500-7000+ MB/s

Random access:

Accessing scattered locations (databases, virtual machines, email servers)
HDDs catastrophically slow (~100 IOPS)
SSDs excel (100K-1M+ IOPS)

Mixed workloads:

Real applications mix random and sequential
Read/write ratio matters (reads generally faster)
Working set size affects caching effectiveness

Latency under load:

Comparative Performance: HDD vs SSD
Metric	7200 RPM HDD	SATA SSD	NVMe SSD	Difference
Sequential read	180 MB/s	550 MB/s	3500 MB/s	~20× (NVMe vs HDD)
Sequential write	170 MB/s	520 MB/s	3000 MB/s	~18× (NVMe vs HDD)
Random 4KB read	~100 IOPS	~95K IOPS	~500K IOPS	~5000× (NVMe vs HDD)
Random 4KB write	~100 IOPS	~85K IOPS	~400K IOPS	~4000× (NVMe vs HDD)
Access latency	~10 ms	~100 μs	~30 μs	~300× (NVMe vs HDD)
Power (active)	8W	3-5W	5-10W	Varies

Benchmarking Caveats

OS Storage Stack Architecture

The operating system provides a complex software stack between applications and storage hardware. This stack provides abstraction, caching, scheduling, and management services.

Linux storage stack (representative):

From top to bottom:

Storage Stack Layers

•Virtual File System (VFS): Unified interface for all file systems. Handles open/read/write/close system calls. Provides inode and dentry caching.
•File System (ext4, XFS, Btrfs, etc.): Implements file and directory semantics on block devices. Handles allocation, naming, permissions, journaling.
•Page Cache: In-memory cache of file data. Absorbs repeated reads; buffers writes. Can consume most of available RAM.
•Block Layer: Generic infrastructure for block devices. I/O scheduling, request merging, plug/unplug for batching. Implements I/O schedulers (mq-deadline, BFQ, none).
•Device Mapper / MD (optional): Provides LVM, RAID, encryption, snapshots on top of raw block devices.
•Block Driver (nvme, sd, ahci): Device-specific driver translating block requests to hardware commands.
•Hardware Interface (NVMe, SATA, SAS): Protocol-level communication with the device.

I/O Schedulers:

I/O schedulers order and merge requests to optimize performance:

none (noop): No scheduling—pass requests directly to device. Best for NVMe SSDs with internal parallelism and no seek penalty.

mq-deadline: Batch requests while ensuring no request waits too long (deadline). Good balance for SSDs.

BFQ (Budget Fair Queueing): Per-process fairness and latency guarantees. Better for interactive workloads on slower devices.

kyber: Simple, low-overhead latency-targeting scheduler for fast devices.

Historical (now deprecated):

CFQ: Completely Fair Queuing—per-process fairness
Deadline: For HDDs, prevented starvation with deadlines
Anticipatory: Anticipated future requests before seeking

Modern SSD Scheduling

Direct I/O and Bypass:

For applications needing direct device access (databases, high-performance storage):

O_DIRECT: Bypasses page cache, reading/writing directly to device. Avoids double-buffering for applications managing their own cache.

io_uring: Modern Linux async I/O interface with submission/completion rings in shared memory. Achieves near-hardware latency with batched syscalls.

SPDK (Storage Performance Development Kit): User-space NVMe driver completely bypassing kernel. Achieves maximum performance but requires application modification.

Storage Device Driver Architecture

Device drivers are kernel modules that interface between the OS block layer and specific storage hardware. Well-designed drivers are essential for performance and reliability.

NVMe driver architecture (Linux nvme):

The NVMe driver is relatively simple compared to SCSI/ATA drivers because NVMe is a clean protocol:

Discovery: At boot or hot-plug, PCIe enumeration finds NVMe controllers
Initialization: Driver allocates submission/completion queues per CPU core
Operation: Block requests are converted to NVMe commands, placed in submission queue, doorbell rung
Completion: Interrupts or polling detect completions; block layer notified

Key driver responsibilities:

•Command construction: Build NVMe/SCSI/ATA commands from block requests
•DMA setup: Configure scatter-gather lists for data transfer
•Interrupt handling: Process completions, wake waiting threads
•Error handling: Retry transient errors, report permanent failures
•Power management: Implement suspend/resume, runtime power states
•Hot-plug support: Handle device arrival and removal
•Feature negotiation: Query and enable device capabilities

Interrupt modes:

Legacy interrupts: Single shared interrupt line. High overhead from shared interrupt handling. Rarely used for modern storage.

MSI (Message Signaled Interrupts): Dedicated interrupt per device. Lower latency than legacy.

MSI-X (Extended MSI): Multiple interrupts per device—typically one per CPU core/queue. Enables parallel completion processing without contention.

Polling mode: Driver continuously polls for completions instead of waiting for interrupts. Lower latency at cost of CPU usage. Useful for ultra-low-latency NVMe.

nvme_request_flow.txt

Pseudocode

// Simplified NVMe I/O flow
 
// 1. Application requests read
read(fd, buffer, 4096);
 
// 2. VFS → file system → block layer
block_request = create_request(LBA, length, READ);
 
// 3. Block layer dispatches to NVMe driver
nvme_queue_request(queue, block_request);
 
// 4. Driver builds NVMe command
nvme_cmd = {
    opcode: NVME_READ,
    lba: block_request->lba,
    length: block_request->length / 512 - 1,
    prp1: dma_addr(buffer),  // Physical address for DMA
};
 
// 5. Submit to hardware queue
submission_queue[tail] = nvme_cmd;
writel(tail, doorbell_register);  // Ring doorbell
 
// 6. Device processes request via DMA
// ...hardware reads from NAND, DMAs to buffer...
 
// 7. Completion interrupt fires
irq_handler() {
    while (completion_queue[head].valid) {
        complete_block_request(completion_queue[head].request);
        head++;
    }
}
 
// 8. Application unblocks with data in buffer

Emerging Storage Technologies

The traditional gap between volatile memory (DRAM) and persistent storage (HDD/SSD) is being bridged by new technologies that combine the characteristics of both.

Storage-class memory technologies:

Next-Generation Storage Technologies

•Intel Optane (3D XPoint): Non-volatile memory with ~10× lower latency than NAND (~10 μs vs 100 μs). Byte-addressable. Can be used as persistent memory or as fast SSD cache. Intel discontinued but technology influential.
•CXL-attached Memory: Compute Express Link enables memory pools and tiers across PCIe fabric. Memory can be shared across hosts or expanded beyond local DIMM capacity with ~100-200 ns latency.
•MRAM (Magnetoresistive RAM): Non-volatile using magnetic storage. Fast write, unlimited endurance. Currently expensive and low density. Used in embedded/cache applications.
•ReRAM (Resistive RAM): Non-volatile using resistance changes in oxide materials. Potential for 3D stacking and high density. Still maturing.
•Computational Storage: SSDs with embedded processing (ARM cores, FPGAs). Offload computation (compression, encryption, database operations) to the storage device.

Persistent memory programming:

With byte-addressable persistent memory, traditional file I/O becomes optional:

Memory-mapped DAX (Direct Access): Map persistent memory file directly into virtual address space. Loads and stores persist without system calls.
PMDK (Persistent Memory Development Kit): Libraries for safe persistent programming, including transactions and crash consistency.
Challenges: Cache line flushes required to ensure durability (CLFLUSH, CLWB). Power-fail atomicity limited to 8 bytes on x86.

ZNS (Zoned Namespace) SSDs:

A new SSD interface where the drive is divided into zones that must be written sequentially:

Eliminates device-side garbage collection (host manages it)
Enables higher performance and endurance
Better match for log-structured applications (databases, file systems)
Requires host software support (Linux has btrfs, dm-zoned support)

The Storage Hierarchy Is Evolving

Summary: Secondary Storage Mastery

Key Takeaways

•HDDs are mechanical devices with millisecond latencies — Seek time and rotational latency dominate; sequential access is orders of magnitude faster than random
•SSDs use NAND flash with microsecond latencies — No moving parts, but complex internal management (FTL, garbage collection, wear leveling)
•Interface matters enormously — NVMe over PCIe achieves 10-100× the performance of SATA for SSDs by enabling parallelism and reducing overhead
•Random vs sequential access creates huge performance differences — HDDs: 500×; SSDs: 5-10×. Workload patterns are critical
•The OS storage stack provides abstraction and optimization — VFS, file systems, page cache, block layer, I/O schedulers, and drivers work together
•Modern NVMe needs minimal OS scheduling — The 'none' scheduler and multi-queue block layer enable direct access to device parallelism
•Emerging technologies are blurring memory/storage lines — Persistent memory, CXL, and computational storage create new tiers and programming models

What's next:

Page Complete

4 / 5