Database Management SystemsStable Storage

Stable Storage: Foundations of Durability

LevelIntermediate

Duration60 mins

TopicStable Storage

1 / 5

Stable Storage Concept

The Promise of Permanence

When a database system confirms that your transaction has committed, it makes a profound promise: your data will survive. Not just for a few minutes, not just until the next power cycle, but permanently—surviving hardware failures, power outages, earthquakes, and even catastrophic disasters. This promise is the durability guarantee of ACID.

But how can any system make such an audacious promise? Computer hardware fails. Hard drives crash. SSDs wear out. Memory chips degrade. Data centers lose power. Buildings catch fire. If all storage is ultimately physical and all physical systems ultimately fail, how can any database claim to guarantee permanent storage?

The answer lies in a fundamental abstraction called stable storage—and the engineering techniques that approximate this theoretical ideal in practice.

What You Will Learn

By the end of this page, you will understand what stable storage means in theory, why it's essential for database durability, how real systems approximate stable storage guarantees, and the fundamental principles that guide the design of storage systems that 'never' lose data.

The Storage Reliability Problem

Before we can appreciate stable storage, we must understand the problem it solves. All practical storage devices are fundamentally unreliable at some level. Let's examine the failure characteristics of different storage types:

Volatile Storage (Main Memory / RAM)

RAM provides the fastest access times of any storage medium, but it has a critical flaw: volatility. When power is lost, all data in RAM disappears instantly and irrecoverably. This makes RAM unsuitable for durability guarantees—any data that exists only in RAM is vulnerable to power failures, system crashes, and unplanned restarts.

Moreover, RAM suffers from occasional bit flips due to cosmic rays and other radiation sources. While Error-Correcting Code (ECC) memory can detect and correct single-bit errors, multi-bit errors can cause silent data corruption.

Non-Volatile Storage (Hard Disks, SSDs)

Non-volatile storage retains data when power is removed, but it introduces new failure modes:

Storage Device Failure Modes

•Total Failure — The entire device becomes inaccessible. For HDDs, this might result from head crashes, motor failure, or electronic component failure. For SSDs, controller failures or flash memory wear-out can cause complete data loss.
•Partial Failure — Individual sectors or blocks fail while the rest of the device remains operational. Bad sectors accumulate over the device's lifetime, and modern drives maintain internal lists of failed sectors remapped to spare areas.
•Silent Data Corruption — Data is written incorrectly or degrades over time without the storage system detecting the error. This 'bit rot' is particularly insidious because the system believes the data is intact until a read reveals corruption.
•Block Write Failure — A write operation fails partway through, leaving a block in an inconsistent state—neither the old data nor the new data, but some corrupted mixture of both.
•Firmware Bugs — Storage device firmware can contain bugs that cause data loss or corruption under specific conditions. These are particularly dangerous because they can affect entire batches of devices.

The Harsh Reality of Device Reliability

Industry studies consistently show that annual failure rates for hard drives range from 1-5%, with some models significantly higher. Google's famous study of 100,000+ drives found that after five years, roughly 25% of drives had failed. Even 'enterprise' SSDs designed for high reliability experience failures. No single device can be trusted for critical data.

The Compounding Problem of Scale

Individual device failure rates might seem manageable—a 2% annual failure rate means any single drive has a 98% chance of surviving the year. But database systems at scale operate many storage devices simultaneously:

A modest database server might have 10 drives
A mid-sized deployment might have 100 drives
Large-scale systems routinely operate 10,000+ drives

With 10,000 drives at 2% annual failure rate, you expect 200 drive failures per year—roughly one every two days. At this scale, device failure isn't an exceptional event; it's a routine operational reality that systems must handle gracefully.

The Theoretical Ideal: Stable Storage

Given that all physical storage devices fail, database theorists defined an idealized abstraction called stable storage. This is storage that, by definition, never loses data under any circumstances.

Stable storage represents a theoretical concept—an impossibility in the strict sense, since all physical systems eventually fail. However, the concept is invaluable because it allows us to reason about database correctness and recovery separately from implementation details.

Definition of Stable Storage:

Formal Definition

Stable storage is a storage abstraction with the following property: information stored in stable storage is never lost. Once a write to stable storage completes successfully, the written data survives all possible failures—power outages, system crashes, hardware failures, natural disasters, and any other event.

Mathematically, we can model stable storage as storage with probability of data loss equal to zero: P(loss) = 0.

Why Stable Storage Matters for Database Theory

The concept of stable storage is central to database recovery theory because it provides a foundation for proving correctness. When we specify recovery algorithms like ARIES, we say things like:

"The log is stored on stable storage"
"Transaction commit requires flushing the commit record to stable storage"

By assuming stable storage, we can prove that a recovery algorithm will correctly restore the database after any failure. The theoretical guarantees hold because stable storage, by definition, preserves all data.

This separation of concerns is powerful: we can design and verify recovery algorithms against the stable storage abstraction, then separately implement stable storage approximations using real hardware. Each layer can be analyzed independently.

The Storage Hierarchy Revisited:

Storage Types in the Recovery Model
Storage Type	Characteristics	Failure Behavior	Role in Recovery
Volatile Storage	Fast, temporary (RAM)	Lost on power failure or crash	Holds working data; contents not trusted after crash
Non-Volatile Storage	Persistent (HDD, SSD)	Survives power loss but can fail	Primary storage; vulnerable to device failure
Stable Storage	Theoretically perfect persistence	Never loses data (by definition)	Log storage; foundation for durability guarantees

The Key Insight:

The stable storage abstraction teaches us a crucial lesson: durability is not a property of any single device, but an emergent property of a system design. We achieve durability not by finding perfectly reliable hardware (which doesn't exist) but by designing systems that remain reliable despite component failures.

Approximating Stable Storage in Practice

True stable storage is impossible—all physical systems eventually fail. However, we can build systems that approximate stable storage to an arbitrarily high degree of reliability. The fundamental technique is redundancy: storing data on multiple independent storage devices such that all copies would have to fail simultaneously for data loss to occur.

The Probability Argument:

Consider a storage device with annual failure probability P(fail) = 0.02 (2%). If we store data on a single device:

P(data loss) = 0.02

Now consider storing the same data on two independent devices. Data loss requires both devices to fail before either can be used to recover the other:

P(data loss) = 0.02 × 0.02 = 0.0004 (0.04%)

With three independent copies:

P(data loss) = 0.02 × 0.02 × 0.02 = 0.000008 (0.0008%)

Each additional copy multiplies the reliability. With sufficient copies on sufficiently independent devices, we can achieve practical data safety indistinguishable from theoretical stable storage.

Independence is Critical

The probability calculation above assumes independent failures. If both devices share a power supply and a power surge destroys them simultaneously, they're not truly independent. Practical stable storage implementations must consider: separate power supplies, separate physical locations, different device manufacturers/batches, different failure modes. Independence determines actual redundancy value.

Three Fundamental Techniques for Stable Storage:

Real systems approximate stable storage using combinations of three main techniques:

1. Replication (Mirroring)

Maintain identical copies of data on multiple devices. Every write updates all copies. Any read can use any copy. As long as one copy survives, data is preserved.

2. Redundancy Coding (RAID Parity)

Store data plus computed redundancy information that allows reconstruction if some portion is lost. Less storage overhead than full mirroring, but more complex recovery.

3. Geographic Distribution (Remote Backup)

Store copies at physically separate locations to survive site-level disasters (fire, flood, earthquake). Trades latency for disaster resilience.

Requirements for Practical Stable Storage

•Multiple Independent Copies — Data exists on multiple storage devices that can fail independently.
•Synchronous Confirmation — Writes are not confirmed until data is durably recorded on enough devices to survive any single failure.
•Failure Detection — The system detects device failures so it can restore redundancy before additional failures occur.
•Automatic Recovery — When a device fails, remaining devices automatically reconstruct the lost data on a replacement device.
•Verification — Periodic scrubbing detects and repairs silent data corruption before redundant copies also degrade.

Stable Storage for Database Logs

In database systems, stable storage is most critical for the transaction log (also called the write-ahead log or WAL). The log is the foundation of durability—it records every change made by every transaction, allowing the system to recover to a consistent state after any failure.

Why the Log Requires Stable Storage:

The Write-Ahead Logging protocol specifies that log records must be written to stable storage before the corresponding data pages are written. This ordering guarantee, combined with stable storage, ensures:

Redo Recovery — If a committed transaction's changes were not yet written to the database, the log contains enough information to redo those changes.
Undo Recovery — If an uncommitted transaction's changes were written to the database, the log contains enough information to undo those changes.

If the log itself could be lost, recovery would be impossible. Data pages might contain partial updates from a transaction that never committed, with no way to identify or reverse them.

The Log is the Source of Truth

After a crash, the database's data files may be in an inconsistent state. The log is what tells the recovery system what changes were committed and what changes were not. If the log is lost or corrupted, there's no authoritative record of what happened. This is why the log requires the highest level of storage reliability—stable storage or its best practical approximation.

Log Storage Implementation in Practice:

Real database systems protect log storage through multiple mechanisms:

1. Synchronous Log Writes

Log writes use synchronous I/O (fsync/fdatasync) to ensure data reaches persistent storage before returning success. Buffered writes that remain in OS caches are not considered durable because they would be lost on power failure.

2. Log Mirroring

Many database systems support log mirroring, maintaining copies of the log on multiple devices. SQL Server supports up to four mirror copies. Oracle has multiplexed redo logs. PostgreSQL recommends placing pg_wal on separate, reliable storage.

3. Battery-Backed Write Caches

Storage controllers with battery-backed or flash-backed write caches can acknowledge writes immediately while guaranteeing data will be written even if power fails before the write completes. This provides durability with improved latency.

4. Network-Attached Stable Storage

Enterprise storage arrays (SAN/NAS) provide stable storage services with built-in redundancy, mirroring, and battery backup. The database treats the storage array as stable, delegating redundancy management to specialized hardware.

Log Storage Configurations by Reliability Level
Configuration	Protection Level	Failure Scenarios Survived	Latency Impact
Single Disk (Async)	Minimal	Clean shutdown only	None
Single Disk (Sync)	Basic	Application crash, OS crash	Moderate
Mirrored Disks (Sync)	Good	Single device failure	Low
RAID + Battery Cache	High	Multiple failures, power loss	Minimal
Replicated Storage + Remote Backup	Excellent	Site disaster	Variable

Checksums and Data Verification

Redundancy alone is not sufficient for stable storage. We must also detect when stored data becomes corrupted. Silent data corruption—where bits change without the storage system detecting an error—can propagate to all redundant copies if not caught in time.

The Checksum Solution:

A checksum is a computed value derived from the data that can be used to verify data integrity. When writing data, we compute and store the checksum alongside the data. When reading, we recompute the checksum and compare it to the stored value.

If checksums differ, we know the data is corrupted. With redundant copies, we can read from an alternative copy that passes checksum verification.

checksum_verification.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Writing data with checksum
function write_with_checksum(block_id, data):
    checksum = compute_checksum(data)
    storage.write(block_id, data + checksum)
    
    // For stable storage, write to all replicas
    for replica in replicas:
        replica.write(block_id, data + checksum)
    
    // Only return success when all replicas confirm
    wait_for_all_replicas()
 
// Reading data with checksum verification
function read_with_verification(block_id):
    data_with_checksum = storage.read(block_id)
    data = data_with_checksum[:-checksum_size]
    stored_checksum = data_with_checksum[-checksum_size:]
    
    computed_checksum = compute_checksum(data)
    
    if computed_checksum == stored_checksum:
        return data
    else:
        // Primary copy corrupted, try replicas
        for replica in replicas:
            replica_data = replica.read(block_id)
            if verify_checksum(replica_data):
                // Repair primary from replica
                storage.write(block_id, replica_data)
                return replica_data[:-checksum_size]
        
        // All copies corrupted - data loss
        raise DataCorruptionError("Unrecoverable corruption")

Types of Checksums Used in Database Systems:

CRC32 (Cyclic Redundancy Check)

Widely used for error detection. Fast to compute, 32-bit output detects most corruption patterns. Used by PostgreSQL for data pages, InnoDB for pages and log records.

xxHash

A modern, extremely fast non-cryptographic hash. Provides excellent error detection with minimal CPU overhead. Seeing increasing adoption in performance-sensitive systems.

SHA-256 (Cryptographic Hash)

Provides stronger guarantees against intentional tampering in addition to corruption detection. Higher CPU cost, used when security is a concern (e.g., detecting unauthorized modifications).

Background Scrubbing

Enterprise storage systems perform periodic 'scrubbing'—reading all data and verifying checksums. This proactive verification detects latent corruption before it spreads or before redundancy degrades due to device failures. ZFS, Btrfs, and enterprise RAID controllers all implement scrubbing. The goal is to repair corruption while enough redundancy remains to recover the correct data.

Verification Best Practices

•End-to-End Checksums — Compute checksums in the database engine, not just in storage hardware. This catches corruption anywhere in the stack.
•Include Metadata — Checksum should cover block identifier and sequence numbers, not just payload. This prevents misrouted or stale blocks from going undetected.
•Periodic Verification — Don't wait for read requests to discover corruption. Proactively verify backups and archival data.
•Checksum the Checksums — Store checksums together with data in atomic units. Some systems use hierarchical checksums for additional protection.

Write Atomicity and Torn Writes

Another critical aspect of stable storage is write atomicity—ensuring that writes either complete fully or have no effect. Partial writes, called torn writes or fractured writes, can corrupt data in ways that are difficult to detect and recover from.

The Torn Write Problem:

Consider a database that writes 16KB pages to storage that has 4KB atomic write units. If power fails during a page write, some 4KB sectors might contain new data while others contain old data:

Page Write in Progress:
| Sector 1: NEW | Sector 2: NEW | Sector 3: OLD | Sector 4: OLD |
                                    ^
                          Power failed here

The resulting page is a corrupted mix of old and new data that may not be detectable through simple checksums—both the old and new data had valid checksums, but the mixture does not.

Torn Writes Are More Common Than You Think

SSDs often have 4KB atomic write units regardless of how they present themselves to the OS. Many filesystems have 4KB block sizes. Database pages are often 8KB or 16KB. Any mismatch creates torn write vulnerability during power loss. Even 'enterprise' drives with power-loss protection may not guarantee atomic writes larger than their internal block size.

Solutions to Torn Writes:

1. Double-Write Buffer (MySQL InnoDB)

Before writing a page to its final location, first write it to a special double-write buffer that is flushed to disk. After a crash:

If the double-write buffer is complete, copy to final location
If the double-write buffer is incomplete, use the old version from the final location

Since pages are written twice, neither location can have torn writes—at least one is always valid.

2. Full Page Logging (PostgreSQL)

After a checkpoint, the first modification to each page logs the entire page content in the WAL. After a crash, recovery can restore the complete page from the log, avoiding reliance on the potentially-torn on-disk version.

3. Copy-on-Write Filesystems (ZFS, Btrfs)

Never overwrite existing data. New versions are written to new locations, and metadata is atomically updated to point to the new location. Crashes leave either the old version or the new version, never a mixture.

4. Hardware Atomic Writes

Some modern SSDs and storage controllers support configurable atomic write sizes, guaranteeing that writes up to a certain size (e.g., 16KB) are atomic. This requires specific hardware support and configuration.

Torn Write Protection Mechanisms
Mechanism	Implementation	Space Overhead	Performance Impact
Double-Write Buffer	MySQL InnoDB	2x for double-write area	Moderate (extra writes)
Full Page Logging	PostgreSQL	Log size increases	Low (first write only)
Copy-on-Write	ZFS, Btrfs	Filesystem overhead	Variable
Hardware Atomic Write	NVMe with atomicity support	None	None

Practical Implementation Patterns

Let's examine how real database systems implement stable storage principles for their transaction logs and critical data:

PostgreSQL Write-Ahead Log (WAL)

PostgreSQL's WAL implementation demonstrates several stable storage principles:

Synchronous Writes — By default, WAL records are fsynced to disk before transaction commit returns. The synchronous_commit setting controls this behavior.
WAL Segment Files — The log is divided into 16MB segment files. Full segments can be archived before being recycled.
Checksum Protection — WAL records include CRC-32 checksums. Recovery validates checksums and refuses to apply corrupted records.
Full Page Images — After checkpoints, the first modification to each page includes a full page image, protecting against torn writes.

postgresql_wal_settings.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- PostgreSQL WAL and Durability Settings
 
-- Ensure WAL is synced before commit returns
SET synchronous_commit = on;        -- Default: wait for WAL flush
-- SET synchronous_commit = remote_apply;  -- Wait for replica to apply
 
-- Control checkpointing frequency
SET checkpoint_timeout = '5min';    -- Maximum time between checkpoints
SET checkpoint_completion_target = 0.9;  -- Spread checkpoint I/O
 
-- Full page writes protect against torn pages
SET full_page_writes = on;          -- Required for crash safety
 
-- WAL archiving for point-in-time recovery
SET archive_mode = on;
SET archive_command = 'cp %p /archive/%f';
 
-- Verify WAL integrity
SET wal_log_hints = on;             -- Enable checksums for hint bit changes
-- initdb --data-checksums          -- Enable page checksums (at init)

MySQL InnoDB Redo Log

InnoDB's redo log implementation includes:

Circular Log Files — Two or more log files of fixed size, written in round-robin fashion. Simplifies space management.
Log Sequence Numbers (LSN) — Every log entry has a unique, monotonically increasing LSN. Enables precise recovery positioning.
Double-Write Buffer — Protects against torn page writes to the tablespace.
Group Commit — Multiple transactions can share a single fsync(), improving throughput without sacrificing durability.

mysql_innodb_settings.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- MySQL InnoDB Durability Settings
 
-- Control durability guarantees
SET innodb_flush_log_at_trx_commit = 1;  -- Fsync on every commit (safest)
-- = 2: Write to OS cache, fsync once per second (faster, less safe)
-- = 0: Write to log buffer, fsync once per second (fastest, least safe)
 
-- Redo log configuration
SET innodb_log_file_size = 1G;           -- Size of each redo log file
SET innodb_log_files_in_group = 2;       -- Number of redo log files
 
-- Double-write protection
SET innodb_doublewrite = ON;             -- Required for torn page protection
 
-- Checksum algorithm
SET innodb_checksum_algorithm = crc32;   -- Page checksum method
 
-- Flush method
SET innodb_flush_method = O_DIRECT;      -- Bypass OS cache for data files

Performance vs. Durability Trade-offs

Both PostgreSQL and MySQL allow relaxing durability guarantees for improved performance. This is appropriate for testing, development, or workloads where losing the last second of transactions is acceptable. For financial systems, healthcare records, or any data loss intolerant application, always use the most conservative settings.

Summary: The Foundation of Durability

We've established the conceptual foundation of stable storage—the theoretical abstraction that enables database durability guarantees. Let's consolidate the key insights:

Key Takeaways

•Stable storage is a theoretical abstraction — It represents storage that never loses data, providing a foundation for reasoning about database recovery.
•All physical storage eventually fails — Hard drives, SSDs, and all hardware have finite lifetimes and non-zero failure rates.
•Redundancy enables practical stable storage — By storing data on multiple independent devices, we achieve arbitrarily high reliability.
•Independence is critical for redundancy — Devices sharing power supplies, locations, or other single points of failure don't provide true redundancy.
•Transaction logs require stable storage — The log is the source of truth for recovery; losing it means losing the ability to recover.
•Checksums detect silent corruption — Redundancy without verification can propagate corrupt data to all copies.
•Torn writes are a real threat — Write atomicity must be ensured through mechanisms like double-write buffers or full page logging.

What's Next:

Now that we understand the concept of stable storage and why it matters, we'll explore the specific technologies used to implement it. The next page examines RAID (Redundant Array of Independent Disks)—the foundational technology for building reliable storage from unreliable components.

Page Complete

You now understand stable storage as a theoretical concept and practical engineering challenge. This foundation will help you appreciate the specific technologies—RAID, mirroring, and remote backup—that we'll explore in subsequent pages of this module.

1 / 5

Loading learning content...

Database Management SystemsStable Storage

Stable Storage: Foundations of Durability

LevelIntermediate

Duration60 mins

TopicStable Storage

1 / 5

Stable Storage Concept

The Promise of Permanence

The answer lies in a fundamental abstraction called stable storage—and the engineering techniques that approximate this theoretical ideal in practice.

What You Will Learn

The Storage Reliability Problem

Volatile Storage (Main Memory / RAM)

Non-Volatile Storage (Hard Disks, SSDs)

Non-volatile storage retains data when power is removed, but it introduces new failure modes:

Storage Device Failure Modes

•Total Failure — The entire device becomes inaccessible. For HDDs, this might result from head crashes, motor failure, or electronic component failure. For SSDs, controller failures or flash memory wear-out can cause complete data loss.
•Partial Failure — Individual sectors or blocks fail while the rest of the device remains operational. Bad sectors accumulate over the device's lifetime, and modern drives maintain internal lists of failed sectors remapped to spare areas.
•Silent Data Corruption — Data is written incorrectly or degrades over time without the storage system detecting the error. This 'bit rot' is particularly insidious because the system believes the data is intact until a read reveals corruption.
•Block Write Failure — A write operation fails partway through, leaving a block in an inconsistent state—neither the old data nor the new data, but some corrupted mixture of both.
•Firmware Bugs — Storage device firmware can contain bugs that cause data loss or corruption under specific conditions. These are particularly dangerous because they can affect entire batches of devices.

The Harsh Reality of Device Reliability

The Compounding Problem of Scale

A modest database server might have 10 drives
A mid-sized deployment might have 100 drives
Large-scale systems routinely operate 10,000+ drives

The Theoretical Ideal: Stable Storage

Definition of Stable Storage:

Formal Definition

Mathematically, we can model stable storage as storage with probability of data loss equal to zero: P(loss) = 0.

Why Stable Storage Matters for Database Theory

The concept of stable storage is central to database recovery theory because it provides a foundation for proving correctness. When we specify recovery algorithms like ARIES, we say things like:

"The log is stored on stable storage"
"Transaction commit requires flushing the commit record to stable storage"

The Storage Hierarchy Revisited:

Storage Types in the Recovery Model
Storage Type	Characteristics	Failure Behavior	Role in Recovery
Volatile Storage	Fast, temporary (RAM)	Lost on power failure or crash	Holds working data; contents not trusted after crash
Non-Volatile Storage	Persistent (HDD, SSD)	Survives power loss but can fail	Primary storage; vulnerable to device failure
Stable Storage	Theoretically perfect persistence	Never loses data (by definition)	Log storage; foundation for durability guarantees

The Key Insight:

Approximating Stable Storage in Practice

The Probability Argument:

Consider a storage device with annual failure probability P(fail) = 0.02 (2%). If we store data on a single device:

P(data loss) = 0.02

Now consider storing the same data on two independent devices. Data loss requires both devices to fail before either can be used to recover the other:

P(data loss) = 0.02 × 0.02 = 0.0004 (0.04%)

With three independent copies:

P(data loss) = 0.02 × 0.02 × 0.02 = 0.000008 (0.0008%)

Each additional copy multiplies the reliability. With sufficient copies on sufficiently independent devices, we can achieve practical data safety indistinguishable from theoretical stable storage.

Independence is Critical

Three Fundamental Techniques for Stable Storage:

Real systems approximate stable storage using combinations of three main techniques:

1. Replication (Mirroring)

Maintain identical copies of data on multiple devices. Every write updates all copies. Any read can use any copy. As long as one copy survives, data is preserved.

2. Redundancy Coding (RAID Parity)

Store data plus computed redundancy information that allows reconstruction if some portion is lost. Less storage overhead than full mirroring, but more complex recovery.

3. Geographic Distribution (Remote Backup)

Store copies at physically separate locations to survive site-level disasters (fire, flood, earthquake). Trades latency for disaster resilience.

Requirements for Practical Stable Storage

•Multiple Independent Copies — Data exists on multiple storage devices that can fail independently.
•Synchronous Confirmation — Writes are not confirmed until data is durably recorded on enough devices to survive any single failure.
•Failure Detection — The system detects device failures so it can restore redundancy before additional failures occur.
•Automatic Recovery — When a device fails, remaining devices automatically reconstruct the lost data on a replacement device.
•Verification — Periodic scrubbing detects and repairs silent data corruption before redundant copies also degrade.

Stable Storage for Database Logs

Why the Log Requires Stable Storage:

Redo Recovery — If a committed transaction's changes were not yet written to the database, the log contains enough information to redo those changes.
Undo Recovery — If an uncommitted transaction's changes were written to the database, the log contains enough information to undo those changes.

If the log itself could be lost, recovery would be impossible. Data pages might contain partial updates from a transaction that never committed, with no way to identify or reverse them.

The Log is the Source of Truth

Log Storage Implementation in Practice:

Real database systems protect log storage through multiple mechanisms:

1. Synchronous Log Writes

2. Log Mirroring

3. Battery-Backed Write Caches

4. Network-Attached Stable Storage

Log Storage Configurations by Reliability Level
Configuration	Protection Level	Failure Scenarios Survived	Latency Impact
Single Disk (Async)	Minimal	Clean shutdown only	None
Single Disk (Sync)	Basic	Application crash, OS crash	Moderate
Mirrored Disks (Sync)	Good	Single device failure	Low
RAID + Battery Cache	High	Multiple failures, power loss	Minimal
Replicated Storage + Remote Backup	Excellent	Site disaster	Variable

Checksums and Data Verification

The Checksum Solution:

If checksums differ, we know the data is corrupted. With redundant copies, we can read from an alternative copy that passes checksum verification.

checksum_verification.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Writing data with checksum
function write_with_checksum(block_id, data):
    checksum = compute_checksum(data)
    storage.write(block_id, data + checksum)
    
    // For stable storage, write to all replicas
    for replica in replicas:
        replica.write(block_id, data + checksum)
    
    // Only return success when all replicas confirm
    wait_for_all_replicas()
 
// Reading data with checksum verification
function read_with_verification(block_id):
    data_with_checksum = storage.read(block_id)
    data = data_with_checksum[:-checksum_size]
    stored_checksum = data_with_checksum[-checksum_size:]
    
    computed_checksum = compute_checksum(data)
    
    if computed_checksum == stored_checksum:
        return data
    else:
        // Primary copy corrupted, try replicas
        for replica in replicas:
            replica_data = replica.read(block_id)
            if verify_checksum(replica_data):
                // Repair primary from replica
                storage.write(block_id, replica_data)
                return replica_data[:-checksum_size]
        
        // All copies corrupted - data loss
        raise DataCorruptionError("Unrecoverable corruption")

Types of Checksums Used in Database Systems:

CRC32 (Cyclic Redundancy Check)

Widely used for error detection. Fast to compute, 32-bit output detects most corruption patterns. Used by PostgreSQL for data pages, InnoDB for pages and log records.

xxHash

A modern, extremely fast non-cryptographic hash. Provides excellent error detection with minimal CPU overhead. Seeing increasing adoption in performance-sensitive systems.

SHA-256 (Cryptographic Hash)

Provides stronger guarantees against intentional tampering in addition to corruption detection. Higher CPU cost, used when security is a concern (e.g., detecting unauthorized modifications).

Background Scrubbing

Verification Best Practices

•End-to-End Checksums — Compute checksums in the database engine, not just in storage hardware. This catches corruption anywhere in the stack.
•Include Metadata — Checksum should cover block identifier and sequence numbers, not just payload. This prevents misrouted or stale blocks from going undetected.
•Periodic Verification — Don't wait for read requests to discover corruption. Proactively verify backups and archival data.
•Checksum the Checksums — Store checksums together with data in atomic units. Some systems use hierarchical checksums for additional protection.

Write Atomicity and Torn Writes

The Torn Write Problem:

Consider a database that writes 16KB pages to storage that has 4KB atomic write units. If power fails during a page write, some 4KB sectors might contain new data while others contain old data:

Page Write in Progress:
| Sector 1: NEW | Sector 2: NEW | Sector 3: OLD | Sector 4: OLD |
                                    ^
                          Power failed here

The resulting page is a corrupted mix of old and new data that may not be detectable through simple checksums—both the old and new data had valid checksums, but the mixture does not.

Torn Writes Are More Common Than You Think

Solutions to Torn Writes:

1. Double-Write Buffer (MySQL InnoDB)

Before writing a page to its final location, first write it to a special double-write buffer that is flushed to disk. After a crash:

If the double-write buffer is complete, copy to final location
If the double-write buffer is incomplete, use the old version from the final location

Since pages are written twice, neither location can have torn writes—at least one is always valid.

2. Full Page Logging (PostgreSQL)

3. Copy-on-Write Filesystems (ZFS, Btrfs)

4. Hardware Atomic Writes

Torn Write Protection Mechanisms
Mechanism	Implementation	Space Overhead	Performance Impact
Double-Write Buffer	MySQL InnoDB	2x for double-write area	Moderate (extra writes)
Full Page Logging	PostgreSQL	Log size increases	Low (first write only)
Copy-on-Write	ZFS, Btrfs	Filesystem overhead	Variable
Hardware Atomic Write	NVMe with atomicity support	None	None

Practical Implementation Patterns

Let's examine how real database systems implement stable storage principles for their transaction logs and critical data:

PostgreSQL Write-Ahead Log (WAL)

PostgreSQL's WAL implementation demonstrates several stable storage principles:

Synchronous Writes — By default, WAL records are fsynced to disk before transaction commit returns. The synchronous_commit setting controls this behavior.
WAL Segment Files — The log is divided into 16MB segment files. Full segments can be archived before being recycled.
Checksum Protection — WAL records include CRC-32 checksums. Recovery validates checksums and refuses to apply corrupted records.
Full Page Images — After checkpoints, the first modification to each page includes a full page image, protecting against torn writes.

postgresql_wal_settings.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- PostgreSQL WAL and Durability Settings
 
-- Ensure WAL is synced before commit returns
SET synchronous_commit = on;        -- Default: wait for WAL flush
-- SET synchronous_commit = remote_apply;  -- Wait for replica to apply
 
-- Control checkpointing frequency
SET checkpoint_timeout = '5min';    -- Maximum time between checkpoints
SET checkpoint_completion_target = 0.9;  -- Spread checkpoint I/O
 
-- Full page writes protect against torn pages
SET full_page_writes = on;          -- Required for crash safety
 
-- WAL archiving for point-in-time recovery
SET archive_mode = on;
SET archive_command = 'cp %p /archive/%f';
 
-- Verify WAL integrity
SET wal_log_hints = on;             -- Enable checksums for hint bit changes
-- initdb --data-checksums          -- Enable page checksums (at init)

MySQL InnoDB Redo Log

InnoDB's redo log implementation includes:

Circular Log Files — Two or more log files of fixed size, written in round-robin fashion. Simplifies space management.
Log Sequence Numbers (LSN) — Every log entry has a unique, monotonically increasing LSN. Enables precise recovery positioning.
Double-Write Buffer — Protects against torn page writes to the tablespace.
Group Commit — Multiple transactions can share a single fsync(), improving throughput without sacrificing durability.

mysql_innodb_settings.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- MySQL InnoDB Durability Settings
 
-- Control durability guarantees
SET innodb_flush_log_at_trx_commit = 1;  -- Fsync on every commit (safest)
-- = 2: Write to OS cache, fsync once per second (faster, less safe)
-- = 0: Write to log buffer, fsync once per second (fastest, least safe)
 
-- Redo log configuration
SET innodb_log_file_size = 1G;           -- Size of each redo log file
SET innodb_log_files_in_group = 2;       -- Number of redo log files
 
-- Double-write protection
SET innodb_doublewrite = ON;             -- Required for torn page protection
 
-- Checksum algorithm
SET innodb_checksum_algorithm = crc32;   -- Page checksum method
 
-- Flush method
SET innodb_flush_method = O_DIRECT;      -- Bypass OS cache for data files

Performance vs. Durability Trade-offs

Summary: The Foundation of Durability

We've established the conceptual foundation of stable storage—the theoretical abstraction that enables database durability guarantees. Let's consolidate the key insights:

Key Takeaways

•Stable storage is a theoretical abstraction — It represents storage that never loses data, providing a foundation for reasoning about database recovery.
•All physical storage eventually fails — Hard drives, SSDs, and all hardware have finite lifetimes and non-zero failure rates.
•Redundancy enables practical stable storage — By storing data on multiple independent devices, we achieve arbitrarily high reliability.
•Independence is critical for redundancy — Devices sharing power supplies, locations, or other single points of failure don't provide true redundancy.
•Transaction logs require stable storage — The log is the source of truth for recovery; losing it means losing the ability to recover.
•Checksums detect silent corruption — Redundancy without verification can propagate corrupt data to all copies.
•Torn writes are a real threat — Write atomicity must be ensured through mechanisms like double-write buffers or full page logging.

What's Next:

Page Complete

1 / 5