Loading learning content...
The recovery manager can guarantee durability only if it can write data to storage that actually survives failures. But here's an uncomfortable truth: no physical storage device is perfectly reliable. Disks fail. SSDs corrupt. Entire data centers lose power. Yet databases promise that committed transactions persist forever.
How can we build reliable systems from unreliable components? The answer is stable storage—an abstraction that provides reliability guarantees through redundancy, replication, and carefully designed I/O patterns. Understanding stable storage is understanding how theoretical durability becomes practical reality.
By the end of this page, you will understand the stable storage abstraction, the hierarchy of storage reliability levels, how redundancy techniques like mirroring and RAID create reliability, the role of careful I/O protocols in preventing data corruption, and how modern systems implement stable storage in practice.
Database theory defines stable storage as an idealized abstraction:
Definition: Stable storage is an abstraction of storage that never loses data. Once data is written to stable storage, it can always be read back, regardless of any failures.
Of course, this is physically impossible—any storage can be destroyed by sufficient physical force. But the abstraction is useful because we can approximate stable storage to an arbitrarily high degree of reliability using redundancy.
Storage Reliability Hierarchy:
| Storage Type | Reliability | Failure Modes | Use in Databases |
|---|---|---|---|
| Volatile Storage | Lowest—lost on power loss | Power failure, process crash | Buffer pool, query execution memory |
| Non-Volatile Storage | Medium—survives power loss | Disk failure, wear-out, corruption | Data files, log files |
| Stable Storage | Highest—survives any single failure | Designed to be 'impossible' to lose | Log (abstractly), critical metadata |
Approximating Stable Storage:
Since true stable storage is impossible, databases use multiple independent redundant copies to approximate it:
┌────────────────────────────────────────────┐
Write Request ──▶ │ Stable Storage Interface │
└────────────────┬───────────────────────────┘
│
┌────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Copy 1 │ │ Copy 2 │ ... │ Copy N │
│ (Disk A) │ │ (Disk B) │ │ (Site Z) │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└────────────────┴───────────────────────────┘
│
Read Request ◀───────────────────┘ (any surviving copy)
(validate, resolve conflicts if needed)
The key insight: if copies are independent (stored on different physical devices, in different locations), then all copies failing simultaneously becomes arbitrarily unlikely.
Probability Example:
If a single disk has 99.9% annual reliability (0.1% failure rate):
With sufficient redundancy, data loss becomes vanishingly improbable—approximating stable storage.
Redundancy only helps if failures are independent. Two disks in the same enclosure may fail from the same power surge. Two copies in the same data center may burn together. True stable storage requires geographically and electrically independent copies—a point often overlooked until disaster strikes.
RAID is the foundational technology for implementing stable storage in practice. By spreading data across multiple disks with redundancy, RAID provides both improved performance and fault tolerance.
RAID Levels Relevant to Database Storage:
| RAID Level | Redundancy Type | Disk Overhead | Write Performance | Read Performance | Fault Tolerance |
|---|---|---|---|---|---|
| RAID 0 (Striping) | None | 0% | Excellent | Excellent | None (one disk fails = all data lost) |
| RAID 1 (Mirroring) | Full copy | 50% | Good (write to both) | Excellent (read from either) | Survives 1 disk failure |
| RAID 5 (Distributed Parity) | Parity blocks | 1/N | Moderate (parity calculation) | Good | Survives 1 disk failure |
| RAID 6 (Double Parity) | Two parity blocks | 2/N | Worse (dual parity) | Good | Survives 2 disk failures |
| RAID 10 (1+0) | Mirroring + Striping | 50% | Excellent | Excellent | Survives multiple failures (luck-dependent) |
Database Log vs Data File RAID Considerations:
| Component | Write Pattern | Recommended RAID | Rationale |
|---|---|---|---|
| Transaction Log | Sequential writes | RAID 1 or RAID 10 | Mirroring provides fast sequential write without parity overhead |
| Data Files | Random reads/writes | RAID 10 or RAID 6 | Need balance of performance and redundancy |
| Temp Space | Random, ephemeral | RAID 0 or RAID 10 | Performance matters; data is recreatable |
| Archive Logs | Sequential write, rare read | RAID 5/6 | Capacity matters; read performance secondary |
The RAID 5 Write Penalty:
RAID 5 has a notorious write penalty. For each small write:
This '4 I/O for 1 write' penalty makes RAID 5 unsuitable for transaction logs or high-write workloads. RAID 1/10, with only 2 writes per I/O, is preferred for performance-critical paths.
RAID protects against disk failure but not against corruption, accidental deletion, ransomware, or site-wide disasters. RAID keeps the database running when a disk dies, but proper backups are still essential for complete data protection. They serve different purposes.
Even with redundant disks, careless write protocols can corrupt data. Stable storage implementations must carefully handle partial writes and write ordering to prevent corruption.
The Partial Write Problem:
A disk block write is atomic (either it completes or it doesn't), but database pages are often larger than disk blocks. An 8KB database page may span multiple 512-byte or 4KB disk sectors. If power fails during a page write, some sectors may have new data, others have old data—a torn page.
Torn Page Preventions:
Write Ordering and Barrier Semantics:
Storage devices and operating systems may reorder writes for performance. This can violate assumptions about ordering:
Application writes:
1. Write log record (commit indication)
2. Acknowledge commit to application
3. Write data page (lazily)
Without barriers, storage might reorder to:
1. Write data page
2. Write log record
3. [CRASH HERE]
Result: Data page written without corresponding log record.
Recovery cannot redo/undo correctly.
Barrier Operations:
Databases must use appropriate barrier operations after log writes and before data writes to ensure correct ordering. This is a common source of bugs in database storage engines.
Many storage devices have volatile write caches and may not honor fsync() correctly—they report success before data reaches persistent media. Some consumer SSDs are notorious for this. Production databases should use enterprise storage with proven fsync() compliance, or battery-backed cache that makes volatile caching safe.
Beyond RAID's local disk mirroring, databases employ additional mirroring and replication strategies for higher levels of durability:
Levels of Durability Through Replication:
| Strategy | Scope | Survives | Latency Impact | Example |
|---|---|---|---|---|
| RAID 1/10 | Same server | Single disk failure | Minimal (2 writes) | Local RAID controller |
| Synchronous local copy | Same server, different disks | Multiple disk failures | Minimal | Log mirroring to separate spindle |
| Synchronous remote copy | Different server, same rack | Server failure | Sub-millisecond | Storage replication, DRBD |
| Synchronous cross-AZ | Different availability zone | AZ failure | 1-5 ms | Cloud multi-AZ databases |
| Synchronous cross-region | Different geographic region | Regional disaster | 50-200 ms | CockroachDB, Spanner |
| Asynchronous replication | Remote standby | Primary failure (with RPO gap) | None (async) | PostgreSQL streaming, MySQL replication |
Synchronous vs Asynchronous Replication:
Durability strength correlates with replication synchronicity:
Synchronous Replication: Transaction doesn't commit until all replicas acknowledge the write. This guarantees durability even if the primary fails immediately after commit. Cost: increased commit latency.
Asynchronous Replication: Transaction commits when primary writes complete; replication happens in background. If primary fails before replication completes, committed transactions may be lost. Benefit: no added commit latency.
The Commit Latency Budget:
For a transaction commit involving synchronous cross-region replication:
For latency-sensitive applications, this may be unacceptable. Solutions:
Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time. RPO=0 means no committed transaction can ever be lost, requiring synchronous replication. Recovery Time Objective (RTO) is the maximum acceptable downtime. These two metrics drive replication and recovery architecture decisions.
The choice of storage media significantly impacts stable storage implementation. Each technology has different reliability characteristics and requires different handling:
Comparison of Storage Technologies:
| Characteristic | HDD | SATA SSD | NVMe SSD |
|---|---|---|---|
| Sequential Write | 100-200 MB/s | 400-500 MB/s | 2000-7000 MB/s |
| Random Write IOPS | 100-200 | 20,000-80,000 | 100,000-1,000,000 |
| fsync Latency | 10-15 ms | 0.1-0.5 ms | 0.01-0.1 ms |
| Power-Loss Protection | Physical (magnetic) | Varies (check specs) | Often included (enterprise) |
| Wear-Out Mechanism | Mechanical (bearings) | Flash cell degradation | Flash cell degradation |
| Endurance Rating | N/A (mechanical) | 0.1-3 DWPD | 1-10+ DWPD |
| Typical Failure Mode | Gradual (SMART warnings) | Sudden (cell exhaustion) | Sudden (cell exhaustion) |
Stable Storage Implications:
HDDs:
SSDs:
NVMe SSDs:
Database Configuration Differences:
With NVMe, traditional wisdom changes:
Never use consumer SSDs for database transaction logs without understanding the risks. Consumer SSDs often lack power-loss protection and may report commits as durable when they're cached in volatile DRAM. Enterprise SSDs cost more but provide the guarantees stable storage requires.
Stable storage must not only survive failures but also detect corruption. Silent data corruption—where bits flip without detection—can be worse than outright failure because it propagates unnoticed.
Sources of Silent Corruption:
Checksum Strategies:
Per-Page Checksums: Databases store a checksum with each page (PostgreSQL, MySQL, Oracle). When reading a page, recalculate the checksum and compare:
┌──────────────────────────────────────────────────────────┐
│ Page Header │
│ ├── Page LSN │
│ ├── Checksum (CRC32, xxHash, etc.) │
│ └── Other metadata │
├──────────────────────────────────────────────────────────┤
│ Page Data (rows, index entries, etc.) │
├──────────────────────────────────────────────────────────┤
│ Page Footer (optional additional checksum, etc.) │
└──────────────────────────────────────────────────────────┘
Detection Coverage:
| Checksum Location | Protects Against |
|---|---|
| Database page checksum | Corruption on disk, in controller cache |
| Log record checksum | Log file corruption |
| Network checksum | Corruption in transit (replication) |
| End-to-end checksum (T10 DIF) | Corruption anywhere in stack |
Recovery from Detected Corruption:
Checksums detect corruption after the fact. By themselves, they don't prevent data loss. When a checksum mismatch is detected, the database needs a reliable source to recover from—RAID, replicas, or backups. Checksums without redundancy just report that data is gone.
Modern production databases implement stable storage through careful combination of multiple techniques:
PostgreSQL Stable Storage Implementation:
MySQL/InnoDB Stable Storage Implementation:
| Mechanism | Default | Purpose |
|---|---|---|
| innodb_doublewrite | ON | Prevents torn page corruption |
| innodb_flush_log_at_trx_commit | 1 | fsync log at each commit |
| innodb_flush_method | fsync/O_DIRECT | Controls how I/O reaches disk |
| innodb_checksums | ON | Validates page integrity |
| sync_binlog | 1 | Synchronous binary log for replication |
Cloud Database Implementations:
Cloud databases often provide stable storage as a managed service:
These managed implementations abstract stable storage complexity, providing durability guarantees without manual RAID/replication configuration.
Default configurations are often tuned for performance benchmarks, not maximum durability. Production deployments should explicitly review and configure durability settings. Document your durability posture so operations teams understand what's protected and what's at risk.
Stable storage is the foundation upon which durability is built. Through redundancy, careful protocols, and verification, we transform unreliable physical components into reliable logical storage. Let's consolidate the key insights:
What's Next:
With stable storage providing the reliable foundation and the recovery manager coordinating operations, we're ready to explore recovery algorithm overviews. The next page will survey the major recovery algorithm families—from simple deferred/immediate update schemes to the sophisticated ARIES algorithm that powers most modern databases.
You now understand stable storage from theory to implementation—the abstraction, the redundancy techniques, the write protocols, and the verification mechanisms. This foundation enables the recovery algorithms we'll explore next to provide meaningful durability guarantees.