Loading learning content...
The stock trade executed perfectly. The buyer clicked 'Confirm,' the system deducted funds, recorded the position, and displayed 'Transaction Complete.' The customer took a screenshot and closed their browser. Thirty seconds later, a transformer exploded at the data center. Power vanished. Servers died instantly.
When power returned and the database restarted, was that trade recorded? Did the customer buy the stock or not? If the database says 'no trade exists' while the customer has a screenshot confirming it did, the brokerage faces a serious legal and financial problem.
Durability is the ACID property that answers this question definitively: Once a transaction commits, its effects are permanent. Not 'probably permanent.' Not 'permanent unless there's a power failure.' Absolutely, unconditionally permanent—surviving crashes, restarts, hardware failures, and any other disaster short of physical destruction of all storage media.
By the end of this page, you will understand how databases achieve the seemingly impossible: guaranteeing that data survives failures that occur after the commit but before all data reaches stable storage. You'll learn about write-ahead logging, checkpointing, fsync, and the storage hierarchy that makes durability possible.
Durability is the ACID property that guarantees committed transactions are never lost. It represents the ultimate promise a database makes about data persistence.
Formal Definition:
A transaction exhibits durability if, once the transaction has been committed, its effects are permanently recorded in the database and will survive any subsequent failure—including power outages, operating system crashes, and hardware failures.
The key insight is that durability must survive failures that occur after the commit. The moment the database acknowledges 'COMMIT successful,' it has made an irrevocable promise: this data will be here when you look for it, regardless of what happens next.
Durability means 'data is not lost.' High availability means 'data is accessible.' A durable but crashed system still has the data—it's just temporarily inaccessible until recovery. These are complementary but distinct properties. Durability is about data persistence; availability is about data accessibility.
The Problem Durability Solves:
Computers have a memory hierarchy:
The challenge: for performance, databases keep recent data in RAM. But committing a transaction requires the data to reach persistent storage before acknowledging success. Otherwise, a power failure loses 'committed' data.
| Storage Level | Speed | Survives Power Loss? | Typical Size |
|---|---|---|---|
| CPU Registers | ~1 nanosecond | No | ~1 KB |
| L1/L2/L3 Cache | ~1-10 nanoseconds | No | ~64 MB |
| Main Memory (RAM) | ~100 nanoseconds | No | ~128 GB |
| NVMe SSD | ~10-100 microseconds | Yes* | ~1-8 TB |
| SATA SSD | ~100-500 microseconds | Yes* | ~1-4 TB |
| HDD | ~1-10 milliseconds | Yes | ~10-20 TB |
SSDs have volatile write caches that improve performance but can lose data on power failure. Enterprise SSDs include capacitors to flush the cache to persistent NAND on power loss. Consumer SSDs may not. Additionally, the operating system and filesystem have their own caches. True durability requires bypassing or flushing all these caches—which is what fsync() does.
The primary mechanism for achieving durability is Write-Ahead Logging (WAL), also called transaction logging or redo logging. The concept is elegant: before modifying any data page, first write a log record describing the change. This log record is forced to stable storage before acknowledging the commit.
The WAL Protocol:
Writing to the log is sequential (append-only), while writing data pages requires random I/O to different disk locations. Sequential writes are much faster, especially on HDDs. By making the commit path sequential (log write only) and deferring random I/O (data page writes) to background processes, WAL achieves both durability and performance.
Write-Ahead Logging Process: Memory Disk┌─────────────────────────────────┐ ┌─────────────────────────────┐│ │ │ ││ ┌──────────────┐ │ │ ┌────────────────────────┐ ││ │ Transaction │ (1) Write │ │ │ │ ││ │ Operations ├──────────────────────►│ WAL Log │ ││ └──────────────┘ Log Record │ │ │ (Sequential writes) │ ││ │ │ │ │ │ ││ │ │ │ └───────────┬────────────┘ ││ │ │ │ │ ▲ ││ ▼ │ │ (3) │ │ (5) ││ ┌──────────────┐ │ │ Recovery │ │ Replay ││ │ Buffer Pool │ │ │ ▼ │ ││ │ (Data Pages │ │ │ ┌────────────────────────┐ ││ │ in Memory) │ (4) Lazy │ │ │ │ ││ └──────┬───────┤ Checkpoint ├──────►│ Data Files │ ││ │ │ │ │ │ (Random writes) │ ││ ▼ │ │ │ │ │ ││ (2) Modify │ │ │ └────────────────────────┘ ││ in memory │ │ │ ││ │ │ │└─────────────────────────────────┘ └─────────────────────────────┘ 1. Transaction writes to WAL first (force to disk on commit)2. Data pages modified in memory only3. On crash, recovery reads WAL to know what was committed4. Background checkpoint writes dirty pages to disk5. WAL is replayed to redo any committed changes not in data filesWhat the Log Contains:
Each log record contains sufficient information to either redo or undo a change:
12345678910111213141516171819202122
-- Conceptual structure of WAL log records-- (Actual binary format varies by database) -- Transaction beginsLSN: 000001 TxnID: T1 Type: BEGIN Timestamp: 2024-01-15 10:30:00.001 -- Update operation with before/after images LSN: 000002 TxnID: T1 Type: UPDATE PageID: accounts_page_42 Offset: 384 Length: 64 BeforeImage: {id: 1, balance: 1000, ...} AfterImage: {id: 1, balance: 700, ...} -- Another update in same transactionLSN: 000003 TxnID: T1 Type: UPDATE PageID: accounts_page_107 Offset: 192 Length: 64 BeforeImage: {id: 2, balance: 500, ...} AfterImage: {id: 2, balance: 800, ...} -- Commit record - THE DURABILITY GUARANTEE-- Once this is on disk, the transaction is committed foreverLSN: 000004 TxnID: T1 Type: COMMIT Timestamp: 2024-01-15 10:30:00.047-- CRITICAL: COMMIT record must be fsync'd before acknowledging to clientThe exact moment a transaction becomes durable is precisely defined: a transaction is committed when its COMMIT log record is written to stable storage. This is the "commit point" or "point of no return."
The Commit Sequence:
fsync() is the most expensive operation in the commit path. It waits for the disk to confirm that data has physically reached persistent storage. For HDDs, this means waiting for the disk platter to rotate to the write position. Even for SSDs, the firmware must confirm the write. This is why high-transaction-rate systems invest heavily in fast storage, battery-backed write caches, and techniques like group commit.
Group Commit: Amortizing fsync() Costs
Calling fsync() for every transaction is expensive. Group commit batches multiple transactions' log flushes into a single fsync(), dramatically improving throughput:
This trades slightly higher latency for significantly higher throughput. A busy system might batch 100+ commits per fsync(), reducing per-transaction I/O overhead by 100x.
123456789101112131415161718192021222324
-- PostgreSQL: Configure commit behavior for durability vs. performance -- Full durability (default) - synchronous fsync on every commit-- Safest, but lowest throughputSET synchronous_commit = on; -- Commit returns after WAL is sent to synchronous standby-- Provides durability across server failure (if standby survives)SET synchronous_commit = remote_apply; -- Commit returns before fsync completes (DANGEROUS!)-- Data may be lost on crash but not visible to other transactions-- Only use when you can tolerate data lossSET synchronous_commit = off; -- PostgreSQL group commit: delay to batch commits-- Increase for throughput at cost of latencySET commit_delay = 10; -- microseconds to wait before fsyncSET commit_siblings = 5; -- minimum concurrent transactions to trigger delay -- MySQL InnoDB: Similar concept-- innodb_flush_log_at_trx_commit = 1 (full durability, default)-- innodb_flush_log_at_trx_commit = 2 (flush to OS buffer each commit, fsync once/second)-- innodb_flush_log_at_trx_commit = 0 (no flush, dangerous)If the WAL contains the complete history of all changes, couldn't the log grow forever? And wouldn't recovery require replaying the entire log from the beginning of time? Checkpointing solves both problems.
What is a Checkpoint?
A checkpoint is a snapshot-in-time where:
After a checkpoint, recovery only needs to replay the log from the checkpoint forward—not from the beginning of time.
Types of Checkpoints:
1234567891011121314151617181920212223
-- PostgreSQL checkpoint configuration -- Maximum time between automatic checkpoints (seconds)-- Lower = faster recovery, more I/O overhead-- Higher = longer recovery, less I/O overheadSET checkpoint_timeout = '5min'; -- Maximum WAL size before forcing a checkpoint-- Limits how much log must be replayed on recoverySET max_wal_size = '1GB'; -- Spread checkpoint I/O over this fraction of checkpoint_timeout-- 0.5 = spread writes over half the checkpoint interval-- Reduces I/O spikesSET checkpoint_completion_target = 0.9; -- Force a manual checkpoint (useful before maintenance)CHECKPOINT; -- MySQL InnoDB equivalent concepts:-- innodb_log_file_size - determines recovery replay time-- innodb_max_dirty_pages_pct - triggers flushing when buffer pool gets too dirty-- innodb_adaptive_flushing - adjusts flush rate based on workloadMore frequent checkpoints mean faster recovery (less log to replay) but more I/O overhead during normal operation. Less frequent checkpoints reduce normal I/O but extend recovery time. Tune based on your recovery time objective (RTO) and I/O capacity. Most production systems aim for recovery times under 5 minutes.
When a database restarts after a crash, the recovery process ensures that all committed transactions are present in the data files and all uncommitted transactions are removed. This is the ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) recovery protocol used by most modern databases.
The Three Phases of ARIES Recovery:
| Phase | Direction | Objective | Result |
|---|---|---|---|
| Analysis | Forward from checkpoint | Identify work to do | Lists of transactions and dirty pages |
| Redo | Forward from checkpoint | Restore committed changes | Data files match committed state |
| Undo | Backward | Remove uncommitted changes | No trace of aborted transactions |
123456789101112131415161718192021222324252627282930313233
Recovery Example: WAL Contents at Crash:─────────────────────────────────────────────────────────────────────LSN 1000: CHECKPOINT (all pages clean at this point)LSN 1001: T1 BEGINLSN 1002: T1 UPDATE Page_A (balance: 1000 → 700)LSN 1003: T2 BEGINLSN 1004: T2 UPDATE Page_B (quantity: 50 → 45)LSN 1005: T1 COMMIT ← T1 is committed (durable)LSN 1006: T2 UPDATE Page_C (status: 'pending' → 'processing')<<< CRASH >>> (T2 never committed) Analysis Phase:- T1: Has COMMIT at LSN 1005 → Will be redone if necessary- T2: No COMMIT → Will be undone (active at crash)- Dirty pages: A, B, C potentially need work Redo Phase (forward from checkpoint):- LSN 1002: Check Page_A's LSN. If < 1002, redo the update- LSN 1004: Check Page_B's LSN. If < 1004, redo the update- LSN 1006: Check Page_C's LSN. If < 1006, redo the update(We redo ALL operations to get data files current, even for T2) Undo Phase (backward for uncommitted transactions):- T2 at LSN 1006: Undo Page_C (restore 'pending')- T2 at LSN 1004: Undo Page_B (restore 50)- Write CLR records for each undo operation Final State:- Page_A: balance = 700 (T1's committed change preserved)- Page_B: quantity = 50 (T2's change undone)- Page_C: status = 'pending' (T2's change undone)We redo ALL operations (even for transactions we'll undo) because the data pages on disk might be in any state—some operations applied, some not. Redo brings everything to a consistent current state. Then undo rolls back uncommitted changes cleanly. This separation simplifies recovery logic significantly.
Single-node durability protects against process crashes and power failures. But what about catastrophic events—disk failure, server destruction, data center fires? Distributed durability extends protection across multiple physical locations.
Synchronous vs. Asynchronous Replication:
123456789101112131415161718192021
-- PostgreSQL: Configure synchronous replication for durability -- On primary: require confirmation from at least one synchronous standby-- before reporting commit successSET synchronous_standby_names = 'first 1 (standby1, standby2)'; -- Commit modes:-- 'on' (default): Wait for WAL to be flushed on primary-- 'remote_write': Wait for WAL to reach standby's memory-- 'remote_apply': Wait for WAL to be applied on standby (strongest)SET synchronous_commit = 'remote_apply'; -- This configuration means:-- 1. Primary writes WAL to local disk-- 2. Primary sends WAL to standby-- 3. Standby writes WAL to disk and applies it-- 4. Standby acknowledges to primary-- 5. Primary acknowledges commit to client-- -- If primary fails after step 5, standby has all committed data-- Client has received confirmation only after data exists in two placesSynchronous replication guarantees durability (data exists on multiple nodes) but reduces availability (if the synchronous replica is unreachable, commits block). This is the CAP theorem in action. You can't have perfect durability AND perfect availability across network partitions. Design systems with clear priorities.
Even with proper understanding of durability, several pitfalls can compromise data safety:
Periodically kill database servers ungracefully (kill -9, pull power) with recent transactions. Verify on restart that all committed transactions are present. Tools like pg_crash_test or custom scripts can automate this. Trust but verify your durability configuration.
123456789101112131415161718192021222324252627282930
#!/bin/bash# Simple durability test script # Insert a record with a unique timestampTEST_ID=$(date +%s)psql -c "INSERT INTO durability_test (id, created_at) VALUES ($TEST_ID, NOW());"echo "Inserted test record: $TEST_ID" # Wait for any async replicationsleep 2 # Verify the record existsCOUNT=$(psql -t -c "SELECT COUNT(*) FROM durability_test WHERE id = $TEST_ID;")if [ "$COUNT" -eq 1 ]; then echo "Record confirmed in database"else echo "ERROR: Record not found after insert!" exit 1fi # Now: manually kill the database server (kill -9 or power off)# Then: restart and verify the record still exists # After restart:# COUNT=$(psql -t -c "SELECT COUNT(*) FROM durability_test WHERE id = $TEST_ID;")# if [ "$COUNT" -eq 1 ]; then# echo "PASS: Record survived crash"# else# echo "FAIL: Record lost - durability compromised!"# fiWe've explored durability as the ACID property that guarantees committed data persists forever. From write-ahead logging to crash recovery to distributed replication, here are the key takeaways:
What's Next:
With all four ACID properties covered—Atomicity, Consistency, Isolation, and Durability—we now understand the complete guarantee that transactions provide. The final page of this module synthesizes these properties, exploring how they interact, when they conflict, and why ACID as a unified whole is more than the sum of its parts.
You now understand durability as the guarantee that committed data persists forever. You know how WAL, fsync, and checkpointing work together to make this possible, and how distributed replication extends protection across physical boundaries. Next, we explore how all four ACID properties work together as a unified guarantee.