Loading learning content...
When a database system confirms that your transaction has committed, it makes a profound promise: your data will survive. Not just for a few minutes, not just until the next power cycle, but permanently—surviving hardware failures, power outages, earthquakes, and even catastrophic disasters. This promise is the durability guarantee of ACID.
But how can any system make such an audacious promise? Computer hardware fails. Hard drives crash. SSDs wear out. Memory chips degrade. Data centers lose power. Buildings catch fire. If all storage is ultimately physical and all physical systems ultimately fail, how can any database claim to guarantee permanent storage?
The answer lies in a fundamental abstraction called stable storage—and the engineering techniques that approximate this theoretical ideal in practice.
By the end of this page, you will understand what stable storage means in theory, why it's essential for database durability, how real systems approximate stable storage guarantees, and the fundamental principles that guide the design of storage systems that 'never' lose data.
Before we can appreciate stable storage, we must understand the problem it solves. All practical storage devices are fundamentally unreliable at some level. Let's examine the failure characteristics of different storage types:
Volatile Storage (Main Memory / RAM)
RAM provides the fastest access times of any storage medium, but it has a critical flaw: volatility. When power is lost, all data in RAM disappears instantly and irrecoverably. This makes RAM unsuitable for durability guarantees—any data that exists only in RAM is vulnerable to power failures, system crashes, and unplanned restarts.
Moreover, RAM suffers from occasional bit flips due to cosmic rays and other radiation sources. While Error-Correcting Code (ECC) memory can detect and correct single-bit errors, multi-bit errors can cause silent data corruption.
Non-Volatile Storage (Hard Disks, SSDs)
Non-volatile storage retains data when power is removed, but it introduces new failure modes:
Industry studies consistently show that annual failure rates for hard drives range from 1-5%, with some models significantly higher. Google's famous study of 100,000+ drives found that after five years, roughly 25% of drives had failed. Even 'enterprise' SSDs designed for high reliability experience failures. No single device can be trusted for critical data.
The Compounding Problem of Scale
Individual device failure rates might seem manageable—a 2% annual failure rate means any single drive has a 98% chance of surviving the year. But database systems at scale operate many storage devices simultaneously:
With 10,000 drives at 2% annual failure rate, you expect 200 drive failures per year—roughly one every two days. At this scale, device failure isn't an exceptional event; it's a routine operational reality that systems must handle gracefully.
Given that all physical storage devices fail, database theorists defined an idealized abstraction called stable storage. This is storage that, by definition, never loses data under any circumstances.
Stable storage represents a theoretical concept—an impossibility in the strict sense, since all physical systems eventually fail. However, the concept is invaluable because it allows us to reason about database correctness and recovery separately from implementation details.
Definition of Stable Storage:
Stable storage is a storage abstraction with the following property: information stored in stable storage is never lost. Once a write to stable storage completes successfully, the written data survives all possible failures—power outages, system crashes, hardware failures, natural disasters, and any other event.
Mathematically, we can model stable storage as storage with probability of data loss equal to zero: P(loss) = 0.
Why Stable Storage Matters for Database Theory
The concept of stable storage is central to database recovery theory because it provides a foundation for proving correctness. When we specify recovery algorithms like ARIES, we say things like:
By assuming stable storage, we can prove that a recovery algorithm will correctly restore the database after any failure. The theoretical guarantees hold because stable storage, by definition, preserves all data.
This separation of concerns is powerful: we can design and verify recovery algorithms against the stable storage abstraction, then separately implement stable storage approximations using real hardware. Each layer can be analyzed independently.
The Storage Hierarchy Revisited:
| Storage Type | Characteristics | Failure Behavior | Role in Recovery |
|---|---|---|---|
| Volatile Storage | Fast, temporary (RAM) | Lost on power failure or crash | Holds working data; contents not trusted after crash |
| Non-Volatile Storage | Persistent (HDD, SSD) | Survives power loss but can fail | Primary storage; vulnerable to device failure |
| Stable Storage | Theoretically perfect persistence | Never loses data (by definition) | Log storage; foundation for durability guarantees |
The Key Insight:
The stable storage abstraction teaches us a crucial lesson: durability is not a property of any single device, but an emergent property of a system design. We achieve durability not by finding perfectly reliable hardware (which doesn't exist) but by designing systems that remain reliable despite component failures.
True stable storage is impossible—all physical systems eventually fail. However, we can build systems that approximate stable storage to an arbitrarily high degree of reliability. The fundamental technique is redundancy: storing data on multiple independent storage devices such that all copies would have to fail simultaneously for data loss to occur.
The Probability Argument:
Consider a storage device with annual failure probability P(fail) = 0.02 (2%). If we store data on a single device:
P(data loss) = 0.02
Now consider storing the same data on two independent devices. Data loss requires both devices to fail before either can be used to recover the other:
P(data loss) = 0.02 × 0.02 = 0.0004 (0.04%)
With three independent copies:
P(data loss) = 0.02 × 0.02 × 0.02 = 0.000008 (0.0008%)
Each additional copy multiplies the reliability. With sufficient copies on sufficiently independent devices, we can achieve practical data safety indistinguishable from theoretical stable storage.
The probability calculation above assumes independent failures. If both devices share a power supply and a power surge destroys them simultaneously, they're not truly independent. Practical stable storage implementations must consider: separate power supplies, separate physical locations, different device manufacturers/batches, different failure modes. Independence determines actual redundancy value.
Three Fundamental Techniques for Stable Storage:
Real systems approximate stable storage using combinations of three main techniques:
1. Replication (Mirroring)
Maintain identical copies of data on multiple devices. Every write updates all copies. Any read can use any copy. As long as one copy survives, data is preserved.
2. Redundancy Coding (RAID Parity)
Store data plus computed redundancy information that allows reconstruction if some portion is lost. Less storage overhead than full mirroring, but more complex recovery.
3. Geographic Distribution (Remote Backup)
Store copies at physically separate locations to survive site-level disasters (fire, flood, earthquake). Trades latency for disaster resilience.
In database systems, stable storage is most critical for the transaction log (also called the write-ahead log or WAL). The log is the foundation of durability—it records every change made by every transaction, allowing the system to recover to a consistent state after any failure.
Why the Log Requires Stable Storage:
The Write-Ahead Logging protocol specifies that log records must be written to stable storage before the corresponding data pages are written. This ordering guarantee, combined with stable storage, ensures:
Redo Recovery — If a committed transaction's changes were not yet written to the database, the log contains enough information to redo those changes.
Undo Recovery — If an uncommitted transaction's changes were written to the database, the log contains enough information to undo those changes.
If the log itself could be lost, recovery would be impossible. Data pages might contain partial updates from a transaction that never committed, with no way to identify or reverse them.
After a crash, the database's data files may be in an inconsistent state. The log is what tells the recovery system what changes were committed and what changes were not. If the log is lost or corrupted, there's no authoritative record of what happened. This is why the log requires the highest level of storage reliability—stable storage or its best practical approximation.
Log Storage Implementation in Practice:
Real database systems protect log storage through multiple mechanisms:
1. Synchronous Log Writes
Log writes use synchronous I/O (fsync/fdatasync) to ensure data reaches persistent storage before returning success. Buffered writes that remain in OS caches are not considered durable because they would be lost on power failure.
2. Log Mirroring
Many database systems support log mirroring, maintaining copies of the log on multiple devices. SQL Server supports up to four mirror copies. Oracle has multiplexed redo logs. PostgreSQL recommends placing pg_wal on separate, reliable storage.
3. Battery-Backed Write Caches
Storage controllers with battery-backed or flash-backed write caches can acknowledge writes immediately while guaranteeing data will be written even if power fails before the write completes. This provides durability with improved latency.
4. Network-Attached Stable Storage
Enterprise storage arrays (SAN/NAS) provide stable storage services with built-in redundancy, mirroring, and battery backup. The database treats the storage array as stable, delegating redundancy management to specialized hardware.
| Configuration | Protection Level | Failure Scenarios Survived | Latency Impact |
|---|---|---|---|
| Single Disk (Async) | Minimal | Clean shutdown only | None |
| Single Disk (Sync) | Basic | Application crash, OS crash | Moderate |
| Mirrored Disks (Sync) | Good | Single device failure | Low |
| RAID + Battery Cache | High | Multiple failures, power loss | Minimal |
| Replicated Storage + Remote Backup | Excellent | Site disaster | Variable |
Redundancy alone is not sufficient for stable storage. We must also detect when stored data becomes corrupted. Silent data corruption—where bits change without the storage system detecting an error—can propagate to all redundant copies if not caught in time.
The Checksum Solution:
A checksum is a computed value derived from the data that can be used to verify data integrity. When writing data, we compute and store the checksum alongside the data. When reading, we recompute the checksum and compare it to the stored value.
If checksums differ, we know the data is corrupted. With redundant copies, we can read from an alternative copy that passes checksum verification.
123456789101112131415161718192021222324252627282930313233
// Writing data with checksumfunction write_with_checksum(block_id, data): checksum = compute_checksum(data) storage.write(block_id, data + checksum) // For stable storage, write to all replicas for replica in replicas: replica.write(block_id, data + checksum) // Only return success when all replicas confirm wait_for_all_replicas() // Reading data with checksum verificationfunction read_with_verification(block_id): data_with_checksum = storage.read(block_id) data = data_with_checksum[:-checksum_size] stored_checksum = data_with_checksum[-checksum_size:] computed_checksum = compute_checksum(data) if computed_checksum == stored_checksum: return data else: // Primary copy corrupted, try replicas for replica in replicas: replica_data = replica.read(block_id) if verify_checksum(replica_data): // Repair primary from replica storage.write(block_id, replica_data) return replica_data[:-checksum_size] // All copies corrupted - data loss raise DataCorruptionError("Unrecoverable corruption")Types of Checksums Used in Database Systems:
CRC32 (Cyclic Redundancy Check)
Widely used for error detection. Fast to compute, 32-bit output detects most corruption patterns. Used by PostgreSQL for data pages, InnoDB for pages and log records.
xxHash
A modern, extremely fast non-cryptographic hash. Provides excellent error detection with minimal CPU overhead. Seeing increasing adoption in performance-sensitive systems.
SHA-256 (Cryptographic Hash)
Provides stronger guarantees against intentional tampering in addition to corruption detection. Higher CPU cost, used when security is a concern (e.g., detecting unauthorized modifications).
Enterprise storage systems perform periodic 'scrubbing'—reading all data and verifying checksums. This proactive verification detects latent corruption before it spreads or before redundancy degrades due to device failures. ZFS, Btrfs, and enterprise RAID controllers all implement scrubbing. The goal is to repair corruption while enough redundancy remains to recover the correct data.
Another critical aspect of stable storage is write atomicity—ensuring that writes either complete fully or have no effect. Partial writes, called torn writes or fractured writes, can corrupt data in ways that are difficult to detect and recover from.
The Torn Write Problem:
Consider a database that writes 16KB pages to storage that has 4KB atomic write units. If power fails during a page write, some 4KB sectors might contain new data while others contain old data:
Page Write in Progress:
| Sector 1: NEW | Sector 2: NEW | Sector 3: OLD | Sector 4: OLD |
^
Power failed here
The resulting page is a corrupted mix of old and new data that may not be detectable through simple checksums—both the old and new data had valid checksums, but the mixture does not.
SSDs often have 4KB atomic write units regardless of how they present themselves to the OS. Many filesystems have 4KB block sizes. Database pages are often 8KB or 16KB. Any mismatch creates torn write vulnerability during power loss. Even 'enterprise' drives with power-loss protection may not guarantee atomic writes larger than their internal block size.
Solutions to Torn Writes:
1. Double-Write Buffer (MySQL InnoDB)
Before writing a page to its final location, first write it to a special double-write buffer that is flushed to disk. After a crash:
Since pages are written twice, neither location can have torn writes—at least one is always valid.
2. Full Page Logging (PostgreSQL)
After a checkpoint, the first modification to each page logs the entire page content in the WAL. After a crash, recovery can restore the complete page from the log, avoiding reliance on the potentially-torn on-disk version.
3. Copy-on-Write Filesystems (ZFS, Btrfs)
Never overwrite existing data. New versions are written to new locations, and metadata is atomically updated to point to the new location. Crashes leave either the old version or the new version, never a mixture.
4. Hardware Atomic Writes
Some modern SSDs and storage controllers support configurable atomic write sizes, guaranteeing that writes up to a certain size (e.g., 16KB) are atomic. This requires specific hardware support and configuration.
| Mechanism | Implementation | Space Overhead | Performance Impact |
|---|---|---|---|
| Double-Write Buffer | MySQL InnoDB | 2x for double-write area | Moderate (extra writes) |
| Full Page Logging | PostgreSQL | Log size increases | Low (first write only) |
| Copy-on-Write | ZFS, Btrfs | Filesystem overhead | Variable |
| Hardware Atomic Write | NVMe with atomicity support | None | None |
Let's examine how real database systems implement stable storage principles for their transaction logs and critical data:
PostgreSQL Write-Ahead Log (WAL)
PostgreSQL's WAL implementation demonstrates several stable storage principles:
Synchronous Writes — By default, WAL records are fsynced to disk before transaction commit returns. The synchronous_commit setting controls this behavior.
WAL Segment Files — The log is divided into 16MB segment files. Full segments can be archived before being recycled.
Checksum Protection — WAL records include CRC-32 checksums. Recovery validates checksums and refuses to apply corrupted records.
Full Page Images — After checkpoints, the first modification to each page includes a full page image, protecting against torn writes.
1234567891011121314151617181920
-- PostgreSQL WAL and Durability Settings -- Ensure WAL is synced before commit returnsSET synchronous_commit = on; -- Default: wait for WAL flush-- SET synchronous_commit = remote_apply; -- Wait for replica to apply -- Control checkpointing frequencySET checkpoint_timeout = '5min'; -- Maximum time between checkpointsSET checkpoint_completion_target = 0.9; -- Spread checkpoint I/O -- Full page writes protect against torn pagesSET full_page_writes = on; -- Required for crash safety -- WAL archiving for point-in-time recoverySET archive_mode = on;SET archive_command = 'cp %p /archive/%f'; -- Verify WAL integritySET wal_log_hints = on; -- Enable checksums for hint bit changes-- initdb --data-checksums -- Enable page checksums (at init)MySQL InnoDB Redo Log
InnoDB's redo log implementation includes:
Circular Log Files — Two or more log files of fixed size, written in round-robin fashion. Simplifies space management.
Log Sequence Numbers (LSN) — Every log entry has a unique, monotonically increasing LSN. Enables precise recovery positioning.
Double-Write Buffer — Protects against torn page writes to the tablespace.
Group Commit — Multiple transactions can share a single fsync(), improving throughput without sacrificing durability.
12345678910111213141516171819
-- MySQL InnoDB Durability Settings -- Control durability guaranteesSET innodb_flush_log_at_trx_commit = 1; -- Fsync on every commit (safest)-- = 2: Write to OS cache, fsync once per second (faster, less safe)-- = 0: Write to log buffer, fsync once per second (fastest, least safe) -- Redo log configurationSET innodb_log_file_size = 1G; -- Size of each redo log fileSET innodb_log_files_in_group = 2; -- Number of redo log files -- Double-write protectionSET innodb_doublewrite = ON; -- Required for torn page protection -- Checksum algorithmSET innodb_checksum_algorithm = crc32; -- Page checksum method -- Flush methodSET innodb_flush_method = O_DIRECT; -- Bypass OS cache for data filesBoth PostgreSQL and MySQL allow relaxing durability guarantees for improved performance. This is appropriate for testing, development, or workloads where losing the last second of transactions is acceptable. For financial systems, healthcare records, or any data loss intolerant application, always use the most conservative settings.
We've established the conceptual foundation of stable storage—the theoretical abstraction that enables database durability guarantees. Let's consolidate the key insights:
What's Next:
Now that we understand the concept of stable storage and why it matters, we'll explore the specific technologies used to implement it. The next page examines RAID (Redundant Array of Independent Disks)—the foundational technology for building reliable storage from unreliable components.
You now understand stable storage as a theoretical concept and practical engineering challenge. This foundation will help you appreciate the specific technologies—RAID, mirroring, and remote backup—that we'll explore in subsequent pages of this module.