Loading content...
Your bank confirms a $10,000 deposit. You see the confirmation screen—transaction complete, funds available. Then the power goes out. When the bank's systems come back online, the question arises: is your money still there?
This is the essence of Durability—the ACID property that makes commitments permanent. Once a database tells you 'transaction committed,' that commitment is irrevocable. The data will survive power failures, operating system crashes, hardware malfunctions, and even complete server destruction (with proper replication).
Durability is what makes databases trustworthy repositories for critical data—financial records, medical history, legal documents, anything where 'we lost it' is simply not an acceptable answer.
This page explores durability comprehensively: the formal definition and guarantees, the storage hierarchy from RAM to persistent storage, how databases achieve durability through WAL and fsync, the critical role of the commit acknowledgment, durability considerations at different levels (single machine, replicated, distributed), performance trade-offs, and common durability pitfalls in system design.
Durability guarantees that once a transaction is committed, its effects are permanent. The changes will survive any subsequent failure—power loss, software crash, hardware malfunction—until explicitly modified by another committed transaction.
The Formal Definition:
After a transaction commits successfully:
Durability draws a bright line in time: before commit, all changes are tentative and may be lost. After commit, changes are permanent and protected.
Durability is not a silver bullet. It does not protect against: 1) Disk destruction (unless replicated), 2) Data center fires/floods (unless geo-replicated), 3) Logical errors (your application writing wrong data commits that wrong data durably), 4) Malicious actors (committed deletions are durable deletions). Durability preserves committed state—it cannot judge whether that state is correct.
Understanding durability requires understanding where data lives at different stages of processing. The storage hierarchy represents a trade-off between speed and persistence.
Volatile Storage (Data Lost on Power Failure):
| Storage Type | Latency | Survives Power Loss? | Role in Database |
|---|---|---|---|
| CPU Registers | < 1 ns | No ❌ | Active computation only |
| L1/L2/L3 Cache | 1-20 ns | No ❌ | CPU-level caching |
| RAM (DRAM) | 50-100 ns | No ❌ | Buffer pool, query processing |
| NVMe SSD | 10-100 μs | Yes ✓ | Database files, WAL |
| SATA SSD | 50-200 μs | Yes ✓ | Database files, WAL |
| Spinning HDD | 5-15 ms | Yes ✓ | Archival, large sequential writes |
| Network Storage | Variable | Yes ✓ | Distributed durability |
The Critical Observation:
Data in RAM is fast to access but will be lost on power failure. Data on disk survives power failure but is 1,000-100,000x slower to access.
Databases must navigate this trade-off: keeping frequently accessed data in RAM for performance while ensuring that committed data reaches disk for durability.
The Commit Critical Path:
When a transaction commits, the database must ensure that enough information is on non-volatile storage to reconstruct the committed state after a crash. This is the 'commit critical path'—the minimum that must hit disk before acknowledging commit.
1234567891011121314151617181920212223242526272829303132333435363738394041
TRANSACTION COMMIT: From RAM to Durable Storage═══════════════════════════════════════════════════════════════════ Application Layer │ │ SQL: COMMIT ▼ ┌────────────────────────────────────────────────────────────┐ │ DATABASE ENGINE │ │ │ │ 1. Transaction operations executed in memory │ │ └── Buffer Pool (RAM): modified pages │ │ │ │ 2. WAL records written for each operation │ │ └── WAL Buffer (RAM): log entries awaiting write │ │ │ │ 3. On COMMIT: │ │ └── WAL records MUST reach disk before ACK │ │ │ └────────────────────────────────────────────────────────────┘ │ │ fsync / fdatasync / O_DIRECT ▼ ┌────────────────────────────────────────────────────────────┐ │ NON-VOLATILE STORAGE │ │ │ │ WAL Log File: 0001.log, 0002.log, ... │ │ └── Contains all information to recover committed txns │ │ │ │ Data Files: base/16384/2619 │ │ └── Actual table data (may lag behind WAL) │ │ │ └────────────────────────────────────────────────────────────┘ │ │ Only AFTER fsync completes ▼ ┌────────────────────────────────────────────────────────────┐ │ COMMIT ACKNOWLEDGMENT │ │ "Transaction Committed" │ │ (Now the promise is made) │ └────────────────────────────────────────────────────────────┘We introduced WAL in the Atomicity section as a mechanism for rollback. But WAL is equally essential for Durability—it's the foundation that ensures committed transactions survive crashes.
The WAL Durability Guarantee:
Before a commit is acknowledged, the WAL record containing the commit must be:
Only then can the database tell the client 'transaction committed.'
Why Write Log Before Data?
Writing the log entry is much faster than updating the actual data pages:
By requiring only log to be durable at commit—not the actual data pages—we achieve durability with minimal performance penalty.
123456789101112131415161718192021
SCENARIO: System crash after commit acknowledged Timeline:1. Transaction T1 modifies table 'accounts'2. WAL record written: "T1: UPDATE accounts SET balance=1000 WHERE id='A'"3. WAL record written: "T1: COMMIT"4. fsync() of WAL file completes ← DATA IS NOW DURABLE5. Client receives "Commit OK"6. Background: Data page with accounts table still in memory7. ** CRASH ** (before data page written to disk)8. System restarts Recovery:1. Database reads WAL from last checkpoint2. Finds T1 COMMIT record → T1 is a "winner"3. Replays T1's changes using WAL records4. Data page reconstructed with balance=1000 Result: Despite data pages never reaching disk directly, the committed transaction is fully recovered. Durability PRESERVED! ✓A properly maintained WAL log contains complete information to rebuild the entire database state. In PostgreSQL, you can ship WAL files to another server and replay them to create an exact replica. This is the foundation of Point-In-Time Recovery (PITR) and streaming replication.
When an application writes data, it doesn't immediately reach the physical disk platters or flash cells. Data passes through multiple layers of caching, each adding latency between 'written' and 'durable.'
The Write Path Through Caches:
1234567891011121314151617181920212223242526272829303132
write() system call │ ▼┌─────────────────────────────────────────────────────────────┐│ USER SPACE BUFFER (Application) ││ └── Your application's memory │└─────────────────────────────────────────────────────────────┘ │ │ write() copies data to kernel ▼┌─────────────────────────────────────────────────────────────┐│ KERNEL PAGE CACHE (OS Buffer Cache) ││ └── OS-managed memory cache of file system data ││ └── write() returns SUCCESS here ← NOT DURABLE! │└─────────────────────────────────────────────────────────────┘ │ │ Background pdflush / writeback (or fsync) ▼┌─────────────────────────────────────────────────────────────┐│ DISK CONTROLLER CACHE (Hardware Write Buffer) ││ └── RAM on the disk controller / drive ││ └── 8MB - 256MB typically ││ └── STILL NOT DURABLE on power failure! │└─────────────────────────────────────────────────────────────┘ │ │ Disk firmware commits to media ▼┌─────────────────────────────────────────────────────────────┐│ PERSISTENT MEDIA (Platters / Flash Cells) ││ └── Finally durable! ││ └── Survives power failure │└─────────────────────────────────────────────────────────────┘The fsync() System Call:
fsync(fd) is the critical system call that forces all buffered data for a file descriptor to be written through to persistent storage:
Without fsync, data exists only in volatile caches—a power failure loses it.
| Call | Flushes | Use Case |
|---|---|---|
| fsync(fd) | File data + metadata | Full durability for specific file |
| fdatasync(fd) | File data only (not metadata) | Slightly faster; sufficient if metadata unchanged |
| sync() | All files, system-wide | Rarely used by databases; too broad |
| O_DIRECT | Bypasses page cache entirely | Reduces double-buffering; still needs fsync |
| O_SYNC | Each write() syncs immediately | Very slow; effectively fsync on every write |
Many disks have write caches (volatile RAM) that lie to the operating system. fsync returns 'success' while data still sits in volatile disk cache. On power failure, this data is lost. Enterprise drives have 'power loss protection' (capacitors to flush cache on power loss). Consumer drives often don't. Check your hardware and configure 'Force Unit Access' (FUA) if needed.
The moment a database acknowledges a commit is the moment of truth for durability. This acknowledgment is a promise: no matter what happens next, this transaction's effects will survive.
What Must Happen Before Acknowledgment:
Only after these steps complete successfully does the database return success to the client.
1234567891011121314151617181920212223242526272829303132333435
CLIENT DATABASE │ │ │ INSERT INTO accounts ... │ │ ───────────────────────────────►│ │ │ Execute in buffer pool │ │ Write WAL: INSERT record │ │ │ COMMIT │ │ ───────────────────────────────►│ │ │ Write WAL: COMMIT record │ │ │ ... waiting ... │ fsync(wal_file) │ │ ← Waiting for disk confirmation │ │ │ │ fsync complete! │ │ │ COMMIT │ │◄───────────────────────────────│ │ │ │ THE PROMISE IS NOW MADE │ │ Data survives any crash │ │ from this point forward │ ▼ ▼ ┌──────────────────────────────────────────────────────────────────┐│ BEFORE "COMMIT" response: ││ • Changes may be lost on crash ││ • Transaction can be rolled back ││ • No durability guarantee │├──────────────────────────────────────────────────────────────────┤│ AFTER "COMMIT" response: ││ • Changes WILL survive any crash ││ • Transaction CANNOT be undone (only by new compensating tx) ││ • Full durability guarantee │└──────────────────────────────────────────────────────────────────┘By default, PostgreSQL and most databases wait for WAL fsync before acknowledging commits. This adds latency (typically 1-10ms) but ensures durability. Some databases offer 'asynchronous commit' options that acknowledge before fsync—faster, but data loss is possible on crash. Use with extreme caution.
fsync is expensive—a typical disk takes 1-10ms to confirm a flush. If every transaction requires its own fsync, throughput is limited to 100-1000 transactions per second, regardless of how fast your CPU or how much RAM you have.
Group Commit solves this by batching:
Instead of one fsync per transaction, the database accumulates WAL writes from multiple concurrent transactions and fsyncs them together. One fsync confirms durability for potentially hundreds of transactions.
123456789101112131415161718192021222324252627
WITHOUT GROUP COMMIT (Serial fsync):────────────────────────────────────────────────────────Time │ Action──────┼───────────────────────────────────────────────── 0ms │ T1: COMMIT → WAL write → fsync → ack 5ms │ T2: COMMIT → WAL write → fsync → ack10ms │ T3: COMMIT → WAL write → fsync → ack15ms │ T4: COMMIT → WAL write → fsync → ack20ms │ 4 transactions complete Result: 4 transactions in 20ms = 200 TPS max WITH GROUP COMMIT (Batched fsync):────────────────────────────────────────────────────────Time │ Action──────┼───────────────────────────────────────────────── 0ms │ T1: COMMIT → WAL write (no fsync yet) 1ms │ T2: COMMIT → WAL write (no fsync yet) 2ms │ T3: COMMIT → WAL write (no fsync yet) 3ms │ T4: COMMIT → WAL write (no fsync yet) 3ms │ DB: commit_delay reached / buffer full 3ms │ DB: Single fsync for T1,T2,T3,T4 together 8ms │ fsync complete → ack all 4 transactions Result: 4 transactions in 8ms = 500 TPS (and the ratio improves with more concurrency)Group commit introduces a small latency increase (the time spent waiting for batch) in exchange for dramatically higher throughput. Under high concurrency, this trade-off is almost always worthwhile. Individual transactions wait slightly longer, but total system capacity multiplies.
Single-machine durability protects against software crashes and power failures, but what about disk failures? Fire in the data center? Earthquake?
The Replication Durability Spectrum:
True durability requires data to exist on multiple independent failure domains. Replication is the mechanism, but not all replication is equal for durability.
123456789101112131415161718192021222324252627282930
SYNCHRONOUS REPLICATION:─────────────────────────────────────────────────────────────────Client Primary Replica │ │ │ │ COMMIT │ │ │────────►│ │ │ │──── Write WAL ────────────────────────►│ │ │ │ │ │◄───────────────── ACK ─────────────────│ │ │ │ │◄────────│ COMMIT OK (durable on both!) │ │ │ │ If primary dies NOW: No data loss—replica has everything ASYNCHRONOUS REPLICATION:─────────────────────────────────────────────────────────────────Client Primary Replica │ │ │ │ COMMIT │ │ │────────►│ │ │ │ Write local WAL │ │◄────────│ COMMIT OK │ │ │ │ │ │──── Write WAL (background) ───────────►│ │ │ │ │ │ │ If primary dies NOW: Transactions in "replication lag" are LOSTWith asynchronous replication, the gap between primary and replica represents potentially lost transactions. A typical lag of 100ms means up to 100ms of transactions could be lost if the primary fails. For financial systems, this is often unacceptable. For social media 'likes,' it might be fine.
Production databases require careful configuration to balance durability guarantees against performance. Here are key settings for major databases:
PostgreSQL:
12345678910111213141516171819202122
# postgresql.conf - Durability settings # fsync: Actually write to disk (NEVER turn off in production!)fsync = on # Default: on # synchronous_commit: When to acknowledge commitsynchronous_commit = on # Default: on# Options:# on - Wait for WAL write to local disk# remote_write - Wait for WAL to reach standby OS# remote_apply - Wait for WAL to be applied on standby# off - Return immediately (DANGEROUS: data loss possible) # full_page_writes: Protect against partial page writesfull_page_writes = on # Default: on # wal_sync_method: How to sync WAL to diskwal_sync_method = fdatasync # Default: platform-dependent# Options: fsync, fdatasync, open_sync, open_datasync # Synchronous replicationsynchronous_standby_names = 'standby1' # Require sync from standby1123456789101112131415161718192021
# my.cnf - Durability settings # innodb_flush_log_at_trx_commit: Controls InnoDB durabilityinnodb_flush_log_at_trx_commit = 1# Options:# 1 - Full durability: flush log to disk on every commit (DEFAULT, SAFE)# 2 - Flush to OS, not to disk (data loss on OS crash)# 0 - No flush (data loss on any crash) # sync_binlog: Binary log durabilitysync_binlog = 1 # 1 = sync binlog on every commit (SAFE for replication)# 0 = let OS decide when to sync (faster, less safe) # innodb_doublewrite: Protect against partial page writesinnodb_doublewrite = ON # Default: ON # For semi-synchronous replicationrpl_semi_sync_master_enabled = 1rpl_semi_sync_master_wait_for_slave_count = 1rpl_semi_sync_master_timeout = 1000 # ms to wait for slave ACKNever disable fsync or set synchronous_commit=off in production for data you care about. The performance gain is not worth the data loss risk. If you need more performance, add replicas, upgrade hardware, or redesign your access patterns. Sacrificing durability is almost never the right answer.
Even with proper database configuration, durability can be compromised at multiple points in the stack. Here are the common pitfalls:
12345678910111213141516171819202122
# A simple (but scary) durability test# WARNING: Do this on a TEST system only! # 1. Start a write workloadpgbench -c 10 -T 60 -i -s 10 testdb & # 2. While workload is running, simulate crash# Option A: Kill the database processkill -9 $(pgrep postgres) # Option B: Actually cut power (on test hardware)# echo 1 > /proc/sys/kernel/sysrq && echo o > /proc/sysrq-trigger # 3. Restart database and verifypg_ctl startpsql -c "SELECT COUNT(*) FROM pgbench_accounts;" testdb # 4. Check for data corruptionpg_catalog.pg_checksums --check testdb # If data is different than expected or corrupted,# your durability chain has a weak link!Durability is the database's most fundamental promise: what is committed stays committed. Without durability, everything else is meaningless—atomicity, consistency, and isolation mean nothing if committed data can vanish.
You now understand Durability comprehensively—from the storage hierarchy and WAL foundations, through fsync mechanics and commit acknowledgment, to replication strategies and common pitfalls. Next, we'll bring all four ACID properties together to see how they work in practice.