Database Management SystemsRecovery Concepts

Recovery Concepts

LevelIntermediate

Duration75 mins

TopicRecovery Concepts

3 / 5

Stable Storage

The Foundation of Durability

The recovery manager can guarantee durability only if it can write data to storage that actually survives failures. But here's an uncomfortable truth: no physical storage device is perfectly reliable. Disks fail. SSDs corrupt. Entire data centers lose power. Yet databases promise that committed transactions persist forever.

How can we build reliable systems from unreliable components? The answer is stable storage—an abstraction that provides reliability guarantees through redundancy, replication, and carefully designed I/O patterns. Understanding stable storage is understanding how theoretical durability becomes practical reality.

What You Will Learn

By the end of this page, you will understand the stable storage abstraction, the hierarchy of storage reliability levels, how redundancy techniques like mirroring and RAID create reliability, the role of careful I/O protocols in preventing data corruption, and how modern systems implement stable storage in practice.

The Stable Storage Abstraction

Database theory defines stable storage as an idealized abstraction:

Definition: Stable storage is an abstraction of storage that never loses data. Once data is written to stable storage, it can always be read back, regardless of any failures.

Of course, this is physically impossible—any storage can be destroyed by sufficient physical force. But the abstraction is useful because we can approximate stable storage to an arbitrarily high degree of reliability using redundancy.

Storage Reliability Hierarchy:

Storage Categories by Reliability
Storage Type	Reliability	Failure Modes	Use in Databases
Volatile Storage	Lowest—lost on power loss	Power failure, process crash	Buffer pool, query execution memory
Non-Volatile Storage	Medium—survives power loss	Disk failure, wear-out, corruption	Data files, log files
Stable Storage	Highest—survives any single failure	Designed to be 'impossible' to lose	Log (abstractly), critical metadata

Approximating Stable Storage:

Since true stable storage is impossible, databases use multiple independent redundant copies to approximate it:

                    ┌────────────────────────────────────────────┐
  Write Request ──▶ │            Stable Storage Interface        │
                    └────────────────┬───────────────────────────┘
                                     │
                    ┌────────────────┼───────────────────────────┐
                    │                │                           │
                    ▼                ▼                           ▼
              ┌──────────┐    ┌──────────┐               ┌──────────┐
              │  Copy 1  │    │  Copy 2  │      ...      │  Copy N  │
              │ (Disk A) │    │ (Disk B) │               │ (Site Z) │
              └──────────┘    └──────────┘               └──────────┘
                    │                │                           │
                    └────────────────┴───────────────────────────┘
                                     │
  Read Request  ◀───────────────────┘ (any surviving copy)
                    (validate, resolve conflicts if needed)

The key insight: if copies are independent (stored on different physical devices, in different locations), then all copies failing simultaneously becomes arbitrarily unlikely.

Probability Example:

If a single disk has 99.9% annual reliability (0.1% failure rate):

Single disk: 0.1% chance of annual data loss
Two independent mirrored disks: 0.1% × 0.1% = 0.0001% chance (both fail simultaneously)
Three independent copies: 0.000001% chance

With sufficient redundancy, data loss becomes vanishingly improbable—approximating stable storage.

Independence is Critical

Redundancy only helps if failures are independent. Two disks in the same enclosure may fail from the same power surge. Two copies in the same data center may burn together. True stable storage requires geographically and electrically independent copies—a point often overlooked until disaster strikes.

RAID: Redundant Arrays of Independent Disks

RAID is the foundational technology for implementing stable storage in practice. By spreading data across multiple disks with redundancy, RAID provides both improved performance and fault tolerance.

RAID Levels Relevant to Database Storage:

RAID Levels for Database Systems
RAID Level	Redundancy Type	Disk Overhead	Write Performance	Read Performance	Fault Tolerance
RAID 0 (Striping)	None	0%	Excellent	Excellent	None (one disk fails = all data lost)
RAID 1 (Mirroring)	Full copy	50%	Good (write to both)	Excellent (read from either)	Survives 1 disk failure
RAID 5 (Distributed Parity)	Parity blocks	1/N	Moderate (parity calculation)	Good	Survives 1 disk failure
RAID 6 (Double Parity)	Two parity blocks	2/N	Worse (dual parity)	Good	Survives 2 disk failures
RAID 10 (1+0)	Mirroring + Striping	50%	Excellent	Excellent	Survives multiple failures (luck-dependent)

Database Log vs Data File RAID Considerations:

Component	Write Pattern	Recommended RAID	Rationale
Transaction Log	Sequential writes	RAID 1 or RAID 10	Mirroring provides fast sequential write without parity overhead
Data Files	Random reads/writes	RAID 10 or RAID 6	Need balance of performance and redundancy
Temp Space	Random, ephemeral	RAID 0 or RAID 10	Performance matters; data is recreatable
Archive Logs	Sequential write, rare read	RAID 5/6	Capacity matters; read performance secondary

The RAID 5 Write Penalty:

RAID 5 has a notorious write penalty. For each small write:

Read old data block
Read old parity block
Calculate new parity
Write new data block
Write new parity block

This '4 I/O for 1 write' penalty makes RAID 5 unsuitable for transaction logs or high-write workloads. RAID 1/10, with only 2 writes per I/O, is preferred for performance-critical paths.

RAID is Not Backup

RAID protects against disk failure but not against corruption, accidental deletion, ransomware, or site-wide disasters. RAID keeps the database running when a disk dies, but proper backups are still essential for complete data protection. They serve different purposes.

Write Protocols for Reliable Storage

Even with redundant disks, careless write protocols can corrupt data. Stable storage implementations must carefully handle partial writes and write ordering to prevent corruption.

The Partial Write Problem:

A disk block write is atomic (either it completes or it doesn't), but database pages are often larger than disk blocks. An 8KB database page may span multiple 512-byte or 4KB disk sectors. If power fails during a page write, some sectors may have new data, others have old data—a torn page.

Torn Page Preventions:

Solutions to Torn Page Problem

•Double-Write Buffer (InnoDB Method) — Before writing a page to its actual location, first write it to a sequential double-write buffer area. If a crash causes a torn page, recovery can restore from the double-write copy. Adds write amplification but guarantees page integrity.
•Full Page Writes (PostgreSQL Method) — After each checkpoint, the first modification to a page causes the entire original page to be logged. If a crash causes a torn page, recovery restores the full page from the log first, then applies subsequent log records.
•Copy-on-Write (ZFS, Btrfs Method) — Never overwrite existing data. Write new data to a new location, then atomically update the pointer. Old version remains until pointer update completes. File systems like ZFS provide this natively.
•Atomic Write Guarantees (Hardware) — Some storage systems (modern NVMe drives, storage appliances) provide atomic writes for larger-than-sector sizes. If the hardware guarantees atomic 8KB writes, torn pages become impossible.

Write Ordering and Barrier Semantics:

Storage devices and operating systems may reorder writes for performance. This can violate assumptions about ordering:

Application writes:
  1. Write log record (commit indication)
  2. Acknowledge commit to application
  3. Write data page (lazily)

Without barriers, storage might reorder to:
  1. Write data page
  2. Write log record
  3. [CRASH HERE]

Result: Data page written without corresponding log record.
         Recovery cannot redo/undo correctly.

Barrier Operations:

fsync() / fsyncdata() — Force all preceding writes to persistent storage before returning
O_DIRECT — Bypass OS page cache, write directly to disk (but doesn't guarantee order)
Write barriers — Linux blkdev flush commands that order writes

Databases must use appropriate barrier operations after log writes and before data writes to ensure correct ordering. This is a common source of bugs in database storage engines.

fsync() Trust Issues

Many storage devices have volatile write caches and may not honor fsync() correctly—they report success before data reaches persistent media. Some consumer SSDs are notorious for this. Production databases should use enterprise storage with proven fsync() compliance, or battery-backed cache that makes volatile caching safe.

Mirroring and Replication for Durability

Beyond RAID's local disk mirroring, databases employ additional mirroring and replication strategies for higher levels of durability:

Levels of Durability Through Replication:

Replication Strategies for Increasing Durability
Strategy	Scope	Survives	Latency Impact	Example
RAID 1/10	Same server	Single disk failure	Minimal (2 writes)	Local RAID controller
Synchronous local copy	Same server, different disks	Multiple disk failures	Minimal	Log mirroring to separate spindle
Synchronous remote copy	Different server, same rack	Server failure	Sub-millisecond	Storage replication, DRBD
Synchronous cross-AZ	Different availability zone	AZ failure	1-5 ms	Cloud multi-AZ databases
Synchronous cross-region	Different geographic region	Regional disaster	50-200 ms	CockroachDB, Spanner
Asynchronous replication	Remote standby	Primary failure (with RPO gap)	None (async)	PostgreSQL streaming, MySQL replication

Synchronous vs Asynchronous Replication:

Durability strength correlates with replication synchronicity:

Synchronous Replication: Transaction doesn't commit until all replicas acknowledge the write. This guarantees durability even if the primary fails immediately after commit. Cost: increased commit latency.
Asynchronous Replication: Transaction commits when primary writes complete; replication happens in background. If primary fails before replication completes, committed transactions may be lost. Benefit: no added commit latency.

The Commit Latency Budget:

For a transaction commit involving synchronous cross-region replication:

Local log write fsync: ~1 ms
Network round-trip to replica: ~50-100 ms
Replica log write fsync: ~1 ms
Total: ~52-102 ms per commit

For latency-sensitive applications, this may be unacceptable. Solutions:

Accept asynchronous replication with potential data loss (RPO > 0)
Use quorum replication (wait for K of N replicas)
Keep replicas geographically close
Accept the latency for true durability

RPO and RTO

Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time. RPO=0 means no committed transaction can ever be lost, requiring synchronous replication. Recovery Time Objective (RTO) is the maximum acceptable downtime. These two metrics drive replication and recovery architecture decisions.

Storage Media: HDDs vs SSDs vs NVMe

The choice of storage media significantly impacts stable storage implementation. Each technology has different reliability characteristics and requires different handling:

Comparison of Storage Technologies:

Storage Media Characteristics
Characteristic	HDD	SATA SSD	NVMe SSD
Sequential Write	100-200 MB/s	400-500 MB/s	2000-7000 MB/s
Random Write IOPS	100-200	20,000-80,000	100,000-1,000,000
fsync Latency	10-15 ms	0.1-0.5 ms	0.01-0.1 ms
Power-Loss Protection	Physical (magnetic)	Varies (check specs)	Often included (enterprise)
Wear-Out Mechanism	Mechanical (bearings)	Flash cell degradation	Flash cell degradation
Endurance Rating	N/A (mechanical)	0.1-3 DWPD	1-10+ DWPD
Typical Failure Mode	Gradual (SMART warnings)	Sudden (cell exhaustion)	Sudden (cell exhaustion)

Stable Storage Implications:

HDDs:

Mechanical arms must physically move, creating multi-millisecond latency
Sequential writes are far faster than random (arm movement avoidance)
Physical magnetic storage retains data without power indefinitely
SMART monitoring can predict failures days/weeks in advance
Good for capacity-oriented storage, log archival

SSDs:

No mechanical parts = consistent low latency
Random and sequential writes are similar performance
Flash cells wear out after limited write cycles (DWPD = Drive Writes Per Day)
Enterprise SSDs include power-loss protection (capacitors to complete writes)
Consumer SSDs may lose in-flight writes on power loss
Watch for write cliff when cells exhaust spare capacity

NVMe SSDs:

PCIe-attached, bypassing SATA bottleneck
Dramatically higher IOPS, lower latency
Group commit becomes less necessary (fast fsync)
Enterprise NVMe includes power-loss protection
Enables performance profiles previously impossible

Database Configuration Differences:

With NVMe, traditional wisdom changes:

Group commit batch sizes can shrink (less benefit from batching)
Checkpoint frequency can increase (less impact)
Log placement on separate device matters less
SSD settings like trim, write caching require attention

Enterprise vs Consumer SSD

Never use consumer SSDs for database transaction logs without understanding the risks. Consumer SSDs often lack power-loss protection and may report commits as durable when they're cached in volatile DRAM. Enterprise SSDs cost more but provide the guarantees stable storage requires.

Block Device Verification and Checksums

Stable storage must not only survive failures but also detect corruption. Silent data corruption—where bits flip without detection—can be worse than outright failure because it propagates unnoticed.

Sources of Silent Corruption:

Corruption Sources

•Bit rot — Spontaneous bit flips in storage media, especially DRAM and flash
•Cosmic rays — High-energy particles causing memory bit flips (yes, really)
•Firmware bugs — Storage controller software errors corrupting data
•Misdirected writes — Data written to wrong location due to firmware/driver bugs
•Lost writes — Write acknowledged but never persisted (volatile cache failure)
•Cable/connector issues — Electrical interference corrupting data in transit

Checksum Strategies:

Per-Page Checksums: Databases store a checksum with each page (PostgreSQL, MySQL, Oracle). When reading a page, recalculate the checksum and compare:

┌──────────────────────────────────────────────────────────┐
│  Page Header                                              │
│  ├── Page LSN                                             │
│  ├── Checksum (CRC32, xxHash, etc.)                       │
│  └── Other metadata                                       │
├──────────────────────────────────────────────────────────┤
│  Page Data (rows, index entries, etc.)                    │
├──────────────────────────────────────────────────────────┤
│  Page Footer (optional additional checksum, etc.)         │
└──────────────────────────────────────────────────────────┘

Detection Coverage:

Checksum Location	Protects Against
Database page checksum	Corruption on disk, in controller cache
Log record checksum	Log file corruption
Network checksum	Corruption in transit (replication)
End-to-end checksum (T10 DIF)	Corruption anywhere in stack

Recovery from Detected Corruption:

Restore page from replica (if available)
Restore from backup
Attempt partial recovery from log records
Fail fast to prevent corruption spread

Checksums Don't Fix—They Only Detect

Checksums detect corruption after the fact. By themselves, they don't prevent data loss. When a checksum mismatch is detected, the database needs a reliable source to recover from—RAID, replicas, or backups. Checksums without redundancy just report that data is gone.

Stable Storage in Production Systems

Modern production databases implement stable storage through careful combination of multiple techniques:

PostgreSQL Stable Storage Implementation:

PostgreSQL's Approach

•WAL on reliable storage — Transaction log should be on RAID 1/10 or equivalent
•Synchronous commit — Default on, ensuring WAL fsync before commit acknowledgment
•Full page writes — Enabled by default, protecting against torn pages
•Data checksums — Optional (initdb --data-checksums), recommended for production
•Synchronous replication — Optional, for multi-node durability
•Archive mode — Continuous archival of WAL segments to separate storage

MySQL/InnoDB Stable Storage Implementation:

Mechanism	Default	Purpose
innodb_doublewrite	ON	Prevents torn page corruption
innodb_flush_log_at_trx_commit	1	fsync log at each commit
innodb_flush_method	fsync/O_DIRECT	Controls how I/O reaches disk
innodb_checksums	ON	Validates page integrity
sync_binlog	1	Synchronous binary log for replication

Cloud Database Implementations:

Cloud databases often provide stable storage as a managed service:

Amazon Aurora: Storage layer automatically replicates across 3 AZs, 6 copies per segment, 4/6 write quorum
Google Cloud Spanner: Synchronous Paxos replication across zones/regions
Azure Cosmos DB: Multi-region writes with configurable consistency

These managed implementations abstract stable storage complexity, providing durability guarantees without manual RAID/replication configuration.

Configuration Matters

Default configurations are often tuned for performance benchmarks, not maximum durability. Production deployments should explicitly review and configure durability settings. Document your durability posture so operations teams understand what's protected and what's at risk.

Summary: Building Reliable Storage

Stable storage is the foundation upon which durability is built. Through redundancy, careful protocols, and verification, we transform unreliable physical components into reliable logical storage. Let's consolidate the key insights:

Key Takeaways

•Stable storage is an abstraction — No physical device is perfectly reliable; we approximate stability through redundancy.
•Independence is essential — Redundant copies must fail independently (different disks, servers, sites) to provide real protection.
•RAID provides local redundancy — RAID 1/10 for logs (performance), RAID 5/6/10 for data (capacity vs performance tradeoff).
•Write protocols prevent corruption — Double-write, full page writes, and copy-on-write prevent torn page corruption.
•Ordering guarantees require barriers — fsync() and write barriers ensure log records reach disk before corresponding data pages.
•Replication extends durability scope — From disk failure (RAID) to server failure (local replication) to site failure (geo-replication).
•Checksums detect corruption — Without checksums, silent corruption propagates undetected. With checksums, you need recovery sources.

What's Next:

With stable storage providing the reliable foundation and the recovery manager coordinating operations, we're ready to explore recovery algorithm overviews. The next page will survey the major recovery algorithm families—from simple deferred/immediate update schemes to the sophisticated ARIES algorithm that powers most modern databases.

Page Complete

You now understand stable storage from theory to implementation—the abstraction, the redundancy techniques, the write protocols, and the verification mechanisms. This foundation enables the recovery algorithms we'll explore next to provide meaningful durability guarantees.

3 / 5

Loading learning content...

Database Management SystemsRecovery Concepts

Recovery Concepts

LevelIntermediate

Duration75 mins

TopicRecovery Concepts

3 / 5

Stable Storage

The Foundation of Durability

What You Will Learn

The Stable Storage Abstraction

Database theory defines stable storage as an idealized abstraction:

Definition: Stable storage is an abstraction of storage that never loses data. Once data is written to stable storage, it can always be read back, regardless of any failures.

Storage Reliability Hierarchy:

Storage Categories by Reliability
Storage Type	Reliability	Failure Modes	Use in Databases
Volatile Storage	Lowest—lost on power loss	Power failure, process crash	Buffer pool, query execution memory
Non-Volatile Storage	Medium—survives power loss	Disk failure, wear-out, corruption	Data files, log files
Stable Storage	Highest—survives any single failure	Designed to be 'impossible' to lose	Log (abstractly), critical metadata

Approximating Stable Storage:

Since true stable storage is impossible, databases use multiple independent redundant copies to approximate it:

                    ┌────────────────────────────────────────────┐
  Write Request ──▶ │            Stable Storage Interface        │
                    └────────────────┬───────────────────────────┘
                                     │
                    ┌────────────────┼───────────────────────────┐
                    │                │                           │
                    ▼                ▼                           ▼
              ┌──────────┐    ┌──────────┐               ┌──────────┐
              │  Copy 1  │    │  Copy 2  │      ...      │  Copy N  │
              │ (Disk A) │    │ (Disk B) │               │ (Site Z) │
              └──────────┘    └──────────┘               └──────────┘
                    │                │                           │
                    └────────────────┴───────────────────────────┘
                                     │
  Read Request  ◀───────────────────┘ (any surviving copy)
                    (validate, resolve conflicts if needed)

The key insight: if copies are independent (stored on different physical devices, in different locations), then all copies failing simultaneously becomes arbitrarily unlikely.

Probability Example:

If a single disk has 99.9% annual reliability (0.1% failure rate):

Single disk: 0.1% chance of annual data loss
Two independent mirrored disks: 0.1% × 0.1% = 0.0001% chance (both fail simultaneously)
Three independent copies: 0.000001% chance

With sufficient redundancy, data loss becomes vanishingly improbable—approximating stable storage.

Independence is Critical

RAID: Redundant Arrays of Independent Disks

RAID is the foundational technology for implementing stable storage in practice. By spreading data across multiple disks with redundancy, RAID provides both improved performance and fault tolerance.

RAID Levels Relevant to Database Storage:

RAID Levels for Database Systems
RAID Level	Redundancy Type	Disk Overhead	Write Performance	Read Performance	Fault Tolerance
RAID 0 (Striping)	None	0%	Excellent	Excellent	None (one disk fails = all data lost)
RAID 1 (Mirroring)	Full copy	50%	Good (write to both)	Excellent (read from either)	Survives 1 disk failure
RAID 5 (Distributed Parity)	Parity blocks	1/N	Moderate (parity calculation)	Good	Survives 1 disk failure
RAID 6 (Double Parity)	Two parity blocks	2/N	Worse (dual parity)	Good	Survives 2 disk failures
RAID 10 (1+0)	Mirroring + Striping	50%	Excellent	Excellent	Survives multiple failures (luck-dependent)

Database Log vs Data File RAID Considerations:

Component	Write Pattern	Recommended RAID	Rationale
Transaction Log	Sequential writes	RAID 1 or RAID 10	Mirroring provides fast sequential write without parity overhead
Data Files	Random reads/writes	RAID 10 or RAID 6	Need balance of performance and redundancy
Temp Space	Random, ephemeral	RAID 0 or RAID 10	Performance matters; data is recreatable
Archive Logs	Sequential write, rare read	RAID 5/6	Capacity matters; read performance secondary

The RAID 5 Write Penalty:

RAID 5 has a notorious write penalty. For each small write:

Read old data block
Read old parity block
Calculate new parity
Write new data block
Write new parity block

This '4 I/O for 1 write' penalty makes RAID 5 unsuitable for transaction logs or high-write workloads. RAID 1/10, with only 2 writes per I/O, is preferred for performance-critical paths.

RAID is Not Backup

Write Protocols for Reliable Storage

Even with redundant disks, careless write protocols can corrupt data. Stable storage implementations must carefully handle partial writes and write ordering to prevent corruption.

The Partial Write Problem:

Torn Page Preventions:

Solutions to Torn Page Problem

•Double-Write Buffer (InnoDB Method) — Before writing a page to its actual location, first write it to a sequential double-write buffer area. If a crash causes a torn page, recovery can restore from the double-write copy. Adds write amplification but guarantees page integrity.
•Full Page Writes (PostgreSQL Method) — After each checkpoint, the first modification to a page causes the entire original page to be logged. If a crash causes a torn page, recovery restores the full page from the log first, then applies subsequent log records.
•Copy-on-Write (ZFS, Btrfs Method) — Never overwrite existing data. Write new data to a new location, then atomically update the pointer. Old version remains until pointer update completes. File systems like ZFS provide this natively.
•Atomic Write Guarantees (Hardware) — Some storage systems (modern NVMe drives, storage appliances) provide atomic writes for larger-than-sector sizes. If the hardware guarantees atomic 8KB writes, torn pages become impossible.

Write Ordering and Barrier Semantics:

Storage devices and operating systems may reorder writes for performance. This can violate assumptions about ordering:

Application writes:
  1. Write log record (commit indication)
  2. Acknowledge commit to application
  3. Write data page (lazily)

Without barriers, storage might reorder to:
  1. Write data page
  2. Write log record
  3. [CRASH HERE]

Result: Data page written without corresponding log record.
         Recovery cannot redo/undo correctly.

Barrier Operations:

fsync() / fsyncdata() — Force all preceding writes to persistent storage before returning
O_DIRECT — Bypass OS page cache, write directly to disk (but doesn't guarantee order)
Write barriers — Linux blkdev flush commands that order writes

Databases must use appropriate barrier operations after log writes and before data writes to ensure correct ordering. This is a common source of bugs in database storage engines.

fsync() Trust Issues

Mirroring and Replication for Durability

Beyond RAID's local disk mirroring, databases employ additional mirroring and replication strategies for higher levels of durability:

Levels of Durability Through Replication:

Replication Strategies for Increasing Durability
Strategy	Scope	Survives	Latency Impact	Example
RAID 1/10	Same server	Single disk failure	Minimal (2 writes)	Local RAID controller
Synchronous local copy	Same server, different disks	Multiple disk failures	Minimal	Log mirroring to separate spindle
Synchronous remote copy	Different server, same rack	Server failure	Sub-millisecond	Storage replication, DRBD
Synchronous cross-AZ	Different availability zone	AZ failure	1-5 ms	Cloud multi-AZ databases
Synchronous cross-region	Different geographic region	Regional disaster	50-200 ms	CockroachDB, Spanner
Asynchronous replication	Remote standby	Primary failure (with RPO gap)	None (async)	PostgreSQL streaming, MySQL replication

Synchronous vs Asynchronous Replication:

Durability strength correlates with replication synchronicity:

Synchronous Replication: Transaction doesn't commit until all replicas acknowledge the write. This guarantees durability even if the primary fails immediately after commit. Cost: increased commit latency.
Asynchronous Replication: Transaction commits when primary writes complete; replication happens in background. If primary fails before replication completes, committed transactions may be lost. Benefit: no added commit latency.

The Commit Latency Budget:

For a transaction commit involving synchronous cross-region replication:

Local log write fsync: ~1 ms
Network round-trip to replica: ~50-100 ms
Replica log write fsync: ~1 ms
Total: ~52-102 ms per commit

For latency-sensitive applications, this may be unacceptable. Solutions:

Accept asynchronous replication with potential data loss (RPO > 0)
Use quorum replication (wait for K of N replicas)
Keep replicas geographically close
Accept the latency for true durability

RPO and RTO

Storage Media: HDDs vs SSDs vs NVMe

The choice of storage media significantly impacts stable storage implementation. Each technology has different reliability characteristics and requires different handling:

Comparison of Storage Technologies:

Storage Media Characteristics
Characteristic	HDD	SATA SSD	NVMe SSD
Sequential Write	100-200 MB/s	400-500 MB/s	2000-7000 MB/s
Random Write IOPS	100-200	20,000-80,000	100,000-1,000,000
fsync Latency	10-15 ms	0.1-0.5 ms	0.01-0.1 ms
Power-Loss Protection	Physical (magnetic)	Varies (check specs)	Often included (enterprise)
Wear-Out Mechanism	Mechanical (bearings)	Flash cell degradation	Flash cell degradation
Endurance Rating	N/A (mechanical)	0.1-3 DWPD	1-10+ DWPD
Typical Failure Mode	Gradual (SMART warnings)	Sudden (cell exhaustion)	Sudden (cell exhaustion)

Stable Storage Implications:

HDDs:

Mechanical arms must physically move, creating multi-millisecond latency
Sequential writes are far faster than random (arm movement avoidance)
Physical magnetic storage retains data without power indefinitely
SMART monitoring can predict failures days/weeks in advance
Good for capacity-oriented storage, log archival

SSDs:

No mechanical parts = consistent low latency
Random and sequential writes are similar performance
Flash cells wear out after limited write cycles (DWPD = Drive Writes Per Day)
Enterprise SSDs include power-loss protection (capacitors to complete writes)
Consumer SSDs may lose in-flight writes on power loss
Watch for write cliff when cells exhaust spare capacity

NVMe SSDs:

PCIe-attached, bypassing SATA bottleneck
Dramatically higher IOPS, lower latency
Group commit becomes less necessary (fast fsync)
Enterprise NVMe includes power-loss protection
Enables performance profiles previously impossible

Database Configuration Differences:

With NVMe, traditional wisdom changes:

Group commit batch sizes can shrink (less benefit from batching)
Checkpoint frequency can increase (less impact)
Log placement on separate device matters less
SSD settings like trim, write caching require attention

Enterprise vs Consumer SSD

Block Device Verification and Checksums

Sources of Silent Corruption:

Corruption Sources

•Bit rot — Spontaneous bit flips in storage media, especially DRAM and flash
•Cosmic rays — High-energy particles causing memory bit flips (yes, really)
•Firmware bugs — Storage controller software errors corrupting data
•Misdirected writes — Data written to wrong location due to firmware/driver bugs
•Lost writes — Write acknowledged but never persisted (volatile cache failure)
•Cable/connector issues — Electrical interference corrupting data in transit

Checksum Strategies:

Per-Page Checksums: Databases store a checksum with each page (PostgreSQL, MySQL, Oracle). When reading a page, recalculate the checksum and compare:

┌──────────────────────────────────────────────────────────┐
│  Page Header                                              │
│  ├── Page LSN                                             │
│  ├── Checksum (CRC32, xxHash, etc.)                       │
│  └── Other metadata                                       │
├──────────────────────────────────────────────────────────┤
│  Page Data (rows, index entries, etc.)                    │
├──────────────────────────────────────────────────────────┤
│  Page Footer (optional additional checksum, etc.)         │
└──────────────────────────────────────────────────────────┘

Detection Coverage:

Checksum Location	Protects Against
Database page checksum	Corruption on disk, in controller cache
Log record checksum	Log file corruption
Network checksum	Corruption in transit (replication)
End-to-end checksum (T10 DIF)	Corruption anywhere in stack

Recovery from Detected Corruption:

Restore page from replica (if available)
Restore from backup
Attempt partial recovery from log records
Fail fast to prevent corruption spread

Checksums Don't Fix—They Only Detect

Stable Storage in Production Systems

Modern production databases implement stable storage through careful combination of multiple techniques:

PostgreSQL Stable Storage Implementation:

PostgreSQL's Approach

•WAL on reliable storage — Transaction log should be on RAID 1/10 or equivalent
•Synchronous commit — Default on, ensuring WAL fsync before commit acknowledgment
•Full page writes — Enabled by default, protecting against torn pages
•Data checksums — Optional (initdb --data-checksums), recommended for production
•Synchronous replication — Optional, for multi-node durability
•Archive mode — Continuous archival of WAL segments to separate storage

MySQL/InnoDB Stable Storage Implementation:

Mechanism	Default	Purpose
innodb_doublewrite	ON	Prevents torn page corruption
innodb_flush_log_at_trx_commit	1	fsync log at each commit
innodb_flush_method	fsync/O_DIRECT	Controls how I/O reaches disk
innodb_checksums	ON	Validates page integrity
sync_binlog	1	Synchronous binary log for replication

Cloud Database Implementations:

Cloud databases often provide stable storage as a managed service:

Amazon Aurora: Storage layer automatically replicates across 3 AZs, 6 copies per segment, 4/6 write quorum
Google Cloud Spanner: Synchronous Paxos replication across zones/regions
Azure Cosmos DB: Multi-region writes with configurable consistency

These managed implementations abstract stable storage complexity, providing durability guarantees without manual RAID/replication configuration.

Configuration Matters

Summary: Building Reliable Storage

Key Takeaways

•Stable storage is an abstraction — No physical device is perfectly reliable; we approximate stability through redundancy.
•Independence is essential — Redundant copies must fail independently (different disks, servers, sites) to provide real protection.
•RAID provides local redundancy — RAID 1/10 for logs (performance), RAID 5/6/10 for data (capacity vs performance tradeoff).
•Write protocols prevent corruption — Double-write, full page writes, and copy-on-write prevent torn page corruption.
•Ordering guarantees require barriers — fsync() and write barriers ensure log records reach disk before corresponding data pages.
•Replication extends durability scope — From disk failure (RAID) to server failure (local replication) to site failure (geo-replication).
•Checksums detect corruption — Without checksums, silent corruption propagates undetected. With checksums, you need recovery sources.

What's Next:

Page Complete

3 / 5