ACID Properties - Learning Module

Loading content...

0/273

Durability: Committed Data Survives Failures

The Promise that Cannot Be Broken

Your bank confirms a $10,000 deposit. You see the confirmation screen—transaction complete, funds available. Then the power goes out. When the bank's systems come back online, the question arises: is your money still there?

This is the essence of Durability—the ACID property that makes commitments permanent. Once a database tells you 'transaction committed,' that commitment is irrevocable. The data will survive power failures, operating system crashes, hardware malfunctions, and even complete server destruction (with proper replication).

Durability is what makes databases trustworthy repositories for critical data—financial records, medical history, legal documents, anything where 'we lost it' is simply not an acceptable answer.

What You Will Learn

This page explores durability comprehensively: the formal definition and guarantees, the storage hierarchy from RAM to persistent storage, how databases achieve durability through WAL and fsync, the critical role of the commit acknowledgment, durability considerations at different levels (single machine, replicated, distributed), performance trade-offs, and common durability pitfalls in system design.

Defining Durability: The Permanent Commit

Durability guarantees that once a transaction is committed, its effects are permanent. The changes will survive any subsequent failure—power loss, software crash, hardware malfunction—until explicitly modified by another committed transaction.

The Formal Definition:

After a transaction commits successfully:

Its changes are recorded in non-volatile storage (storage that survives power loss)
The changes will be recoverable after any system restart
No crash or failure can cause committed data to disappear

Durability draws a bright line in time: before commit, all changes are tentative and may be lost. After commit, changes are permanent and protected.

What Durability Protects Against

•Power Failure — Sudden loss of electricity at any moment. RAM is wiped, but durable data persists.
•Operating System Crash — Kernel panic, blue screen, segfault in the database process. The system restarts, but committed data is intact.
•Application Crash — The application that wrote the data crashes. The database maintains the data independently.
•Hardware Failure — CPU malfunction, RAM failure, controller errors. As long as disk survives, committed data is safe.
•Network Partition — In replicated systems, durability means data survives on enough replicas to be recoverable.

What Durability Does NOT Protect Against

Durability is not a silver bullet. It does not protect against: 1) Disk destruction (unless replicated), 2) Data center fires/floods (unless geo-replicated), 3) Logical errors (your application writing wrong data commits that wrong data durably), 4) Malicious actors (committed deletions are durable deletions). Durability preserves committed state—it cannot judge whether that state is correct.

The Storage Hierarchy: Volatile vs Non-Volatile

Understanding durability requires understanding where data lives at different stages of processing. The storage hierarchy represents a trade-off between speed and persistence.

Volatile Storage (Data Lost on Power Failure):

Storage Hierarchy: Speed vs Durability
Storage Type	Latency	Survives Power Loss?	Role in Database
CPU Registers	< 1 ns	No ❌	Active computation only
L1/L2/L3 Cache	1-20 ns	No ❌	CPU-level caching
RAM (DRAM)	50-100 ns	No ❌	Buffer pool, query processing
NVMe SSD	10-100 μs	Yes ✓	Database files, WAL
SATA SSD	50-200 μs	Yes ✓	Database files, WAL
Spinning HDD	5-15 ms	Yes ✓	Archival, large sequential writes
Network Storage	Variable	Yes ✓	Distributed durability

The Critical Observation:

Data in RAM is fast to access but will be lost on power failure. Data on disk survives power failure but is 1,000-100,000x slower to access.

Databases must navigate this trade-off: keeping frequently accessed data in RAM for performance while ensuring that committed data reaches disk for durability.

The Commit Critical Path:

When a transaction commits, the database must ensure that enough information is on non-volatile storage to reconstruct the committed state after a crash. This is the 'commit critical path'—the minimum that must hit disk before acknowledging commit.

Data Flow: Commit Critical Path
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
TRANSACTION COMMIT: From RAM to Durable Storage
═══════════════════════════════════════════════════════════════════
 
       Application Layer
            │
            │  SQL: COMMIT
            ▼
   ┌────────────────────────────────────────────────────────────┐
   │                     DATABASE ENGINE                        │
   │                                                            │
   │  1. Transaction operations executed in memory              │
   │     └── Buffer Pool (RAM): modified pages                  │
   │                                                            │
   │  2. WAL records written for each operation                 │
   │     └── WAL Buffer (RAM): log entries awaiting write       │
   │                                                            │
   │  3. On COMMIT:                                             │
   │     └── WAL records MUST reach disk before ACK             │
   │                                                            │
   └────────────────────────────────────────────────────────────┘
            │
            │  fsync / fdatasync / O_DIRECT
            ▼
   ┌────────────────────────────────────────────────────────────┐
   │                    NON-VOLATILE STORAGE                    │
   │                                                            │
   │  WAL Log File: 0001.log, 0002.log, ...                     │
   │  └── Contains all information to recover committed txns    │
   │                                                            │
   │  Data Files: base/16384/2619                               │
   │  └── Actual table data (may lag behind WAL)                │
   │                                                            │
   └────────────────────────────────────────────────────────────┘
            │
            │  Only AFTER fsync completes
            ▼
   ┌────────────────────────────────────────────────────────────┐
   │              COMMIT ACKNOWLEDGMENT                         │
   │              "Transaction Committed"                       │
   │              (Now the promise is made)                     │
   └────────────────────────────────────────────────────────────┘

Write-Ahead Logging: The Durability Foundation

We introduced WAL in the Atomicity section as a mechanism for rollback. But WAL is equally essential for Durability—it's the foundation that ensures committed transactions survive crashes.

The WAL Durability Guarantee:

Before a commit is acknowledged, the WAL record containing the commit must be:

Written to the operating system
Flushed through any caches to persistent storage
Confirmed as durably stored

Only then can the database tell the client 'transaction committed.'

Why Write Log Before Data?

Writing the log entry is much faster than updating the actual data pages:

Log writes are sequential (append-only to end of file)
Data writes are random (updating pages scattered across disk)
Sequential writes are 100x faster than random writes on HDDs, 10x faster on SSDs

By requiring only log to be durable at commit—not the actual data pages—we achieve durability with minimal performance penalty.

WAL Durability Mechanics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SCENARIO: System crash after commit acknowledged
 
Timeline:
1. Transaction T1 modifies table 'accounts'
2. WAL record written: "T1: UPDATE accounts SET balance=1000 WHERE id='A'"
3. WAL record written: "T1: COMMIT"
4. fsync() of WAL file completes ← DATA IS NOW DURABLE
5. Client receives "Commit OK"
6. Background: Data page with accounts table still in memory
7. ** CRASH ** (before data page written to disk)
8. System restarts
 
Recovery:
1. Database reads WAL from last checkpoint
2. Finds T1 COMMIT record → T1 is a "winner"
3. Replays T1's changes using WAL records
4. Data page reconstructed with balance=1000
 
Result: Despite data pages never reaching disk directly,
        the committed transaction is fully recovered.
        Durability PRESERVED! ✓

The WAL Contains Everything Needed

A properly maintained WAL log contains complete information to rebuild the entire database state. In PostgreSQL, you can ship WAL files to another server and replay them to create an exact replica. This is the foundation of Point-In-Time Recovery (PITR) and streaming replication.

fsync and the Disk Controller Reality

When an application writes data, it doesn't immediately reach the physical disk platters or flash cells. Data passes through multiple layers of caching, each adding latency between 'written' and 'durable.'

The Write Path Through Caches:

The Treacherous Path to Durability
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
write() system call
     │
     ▼
┌─────────────────────────────────────────────────────────────┐
│  USER SPACE BUFFER (Application)                            │
│  └── Your application's memory                              │
└─────────────────────────────────────────────────────────────┘
     │
     │  write() copies data to kernel
     ▼
┌─────────────────────────────────────────────────────────────┐
│  KERNEL PAGE CACHE (OS Buffer Cache)                        │
│  └── OS-managed memory cache of file system data            │
│  └── write() returns SUCCESS here ← NOT DURABLE!            │
└─────────────────────────────────────────────────────────────┘
     │
     │  Background pdflush / writeback (or fsync)
     ▼
┌─────────────────────────────────────────────────────────────┐
│  DISK CONTROLLER CACHE (Hardware Write Buffer)              │
│  └── RAM on the disk controller / drive                     │
│  └── 8MB - 256MB typically                                  │
│  └── STILL NOT DURABLE on power failure!                    │
└─────────────────────────────────────────────────────────────┘
     │
     │  Disk firmware commits to media
     ▼
┌─────────────────────────────────────────────────────────────┐
│  PERSISTENT MEDIA (Platters / Flash Cells)                  │
│  └── Finally durable!                                       │
│  └── Survives power failure                                 │
└─────────────────────────────────────────────────────────────┘

The fsync() System Call:

fsync(fd) is the critical system call that forces all buffered data for a file descriptor to be written through to persistent storage:

Flushes kernel page cache for this file
Issues a 'cache flush' or 'force unit access' command to the disk
Returns only after the disk confirms data is on persistent media

Without fsync, data exists only in volatile caches—a power failure loses it.

Durability System Calls
Call	Flushes	Use Case
fsync(fd)	File data + metadata	Full durability for specific file
fdatasync(fd)	File data only (not metadata)	Slightly faster; sufficient if metadata unchanged
sync()	All files, system-wide	Rarely used by databases; too broad
O_DIRECT	Bypasses page cache entirely	Reduces double-buffering; still needs fsync
O_SYNC	Each write() syncs immediately	Very slow; effectively fsync on every write

Disk Write Cache: The Silent Threat

Many disks have write caches (volatile RAM) that lie to the operating system. fsync returns 'success' while data still sits in volatile disk cache. On power failure, this data is lost. Enterprise drives have 'power loss protection' (capacitors to flush cache on power loss). Consumer drives often don't. Check your hardware and configure 'Force Unit Access' (FUA) if needed.

The Sacred Moment: Commit Acknowledgment

The moment a database acknowledges a commit is the moment of truth for durability. This acknowledgment is a promise: no matter what happens next, this transaction's effects will survive.

What Must Happen Before Acknowledgment:

All WAL records for the transaction are written
The COMMIT log record is written to WAL
The WAL is fsynced to persistent storage
(In replicated systems) Required replicas have confirmed receipt

Only after these steps complete successfully does the database return success to the client.

Commit Acknowledgment Guarantees
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
CLIENT                          DATABASE
   │                                 │
   │  INSERT INTO accounts ...       │
   │ ───────────────────────────────►│
   │                                 │  Execute in buffer pool
   │                                 │  Write WAL: INSERT record
   │                                 │
   │  COMMIT                         │
   │ ───────────────────────────────►│
   │                                 │  Write WAL: COMMIT record
   │                                 │
   │      ... waiting ...            │  fsync(wal_file)
   │                                 │  ← Waiting for disk confirmation
   │                                 │
   │                                 │  fsync complete!
   │                                 │
   │                          COMMIT │
   │◄───────────────────────────────│
   │                                 │
   │  THE PROMISE IS NOW MADE        │
   │  Data survives any crash        │
   │  from this point forward        │
   ▼                                 ▼
 
┌──────────────────────────────────────────────────────────────────┐
│  BEFORE "COMMIT" response:                                       │
│  • Changes may be lost on crash                                  │
│  • Transaction can be rolled back                                │
│  • No durability guarantee                                       │
├──────────────────────────────────────────────────────────────────┤
│  AFTER "COMMIT" response:                                        │
│  • Changes WILL survive any crash                                │
│  • Transaction CANNOT be undone (only by new compensating tx)    │
│  • Full durability guarantee                                     │
└──────────────────────────────────────────────────────────────────┘

Synchronous Commits Are the Default

By default, PostgreSQL and most databases wait for WAL fsync before acknowledging commits. This adds latency (typically 1-10ms) but ensures durability. Some databases offer 'asynchronous commit' options that acknowledge before fsync—faster, but data loss is possible on crash. Use with extreme caution.

Group Commit: Amortizing fsync Cost

fsync is expensive—a typical disk takes 1-10ms to confirm a flush. If every transaction requires its own fsync, throughput is limited to 100-1000 transactions per second, regardless of how fast your CPU or how much RAM you have.

Group Commit solves this by batching:

Instead of one fsync per transaction, the database accumulates WAL writes from multiple concurrent transactions and fsyncs them together. One fsync confirms durability for potentially hundreds of transactions.

Group Commit in Action
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
WITHOUT GROUP COMMIT (Serial fsync):
────────────────────────────────────────────────────────
Time  │ Action
──────┼─────────────────────────────────────────────────
 0ms  │ T1: COMMIT → WAL write → fsync → ack
 5ms  │ T2: COMMIT → WAL write → fsync → ack
10ms  │ T3: COMMIT → WAL write → fsync → ack
15ms  │ T4: COMMIT → WAL write → fsync → ack
20ms  │ 4 transactions complete
 
Result: 4 transactions in 20ms = 200 TPS max
 
 
WITH GROUP COMMIT (Batched fsync):
────────────────────────────────────────────────────────
Time  │ Action
──────┼─────────────────────────────────────────────────
 0ms  │ T1: COMMIT → WAL write (no fsync yet)
 1ms  │ T2: COMMIT → WAL write (no fsync yet)
 2ms  │ T3: COMMIT → WAL write (no fsync yet)
 3ms  │ T4: COMMIT → WAL write (no fsync yet)
 3ms  │ DB: commit_delay reached / buffer full
 3ms  │ DB: Single fsync for T1,T2,T3,T4 together
 8ms  │ fsync complete → ack all 4 transactions
 
Result: 4 transactions in 8ms = 500 TPS
        (and the ratio improves with more concurrency)

Group Commit Configuration

•commit_delay (PostgreSQL) — How long to wait for additional transactions before fsync. More delay = more batching but higher latency.
•commit_siblings (PostgreSQL) — Minimum concurrent transactions before applying commit_delay. No delay if you're the only transaction.
•innodb_flush_log_at_trx_commit (MySQL) — Controls fsync behavior: 1=fsync every commit, 2=fsync every second (durability risk), 0=no fsync (not durable).
•sync_binlog (MySQL) — Similar control for binary log syncing.

Latency vs Throughput Trade-off

Group commit introduces a small latency increase (the time spent waiting for batch) in exchange for dramatically higher throughput. Under high concurrency, this trade-off is almost always worthwhile. Individual transactions wait slightly longer, but total system capacity multiplies.

Replication: Durability Beyond Single Machine

Single-machine durability protects against software crashes and power failures, but what about disk failures? Fire in the data center? Earthquake?

The Replication Durability Spectrum:

True durability requires data to exist on multiple independent failure domains. Replication is the mechanism, but not all replication is equal for durability.

Synchronous Replication

•Transaction waits for replica confirmation
•Commit acknowledged only when written to N replicas
•Strong durability guarantee
•Survives primary failure with zero data loss
•Higher latency (network round-trip)
•Risk of blocking if replica is unavailable

Asynchronous Replication

•Transaction commits without waiting for replicas
•Replicas updated in background
•Weaker durability guarantee
•Primary failure may lose recent transactions
•Lower latency (no network wait)
•More availability (replica issues don't block commits)

Synchronous vs Asynchronous Replication
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SYNCHRONOUS REPLICATION:
─────────────────────────────────────────────────────────────────
Client    Primary                                 Replica
  │         │                                        │
  │ COMMIT  │                                        │
  │────────►│                                        │
  │         │──── Write WAL ────────────────────────►│
  │         │                                        │
  │         │◄───────────────── ACK ─────────────────│
  │         │                                        │
  │◄────────│ COMMIT OK (durable on both!)           │
  │         │                                        │
 
If primary dies NOW: No data loss—replica has everything
 
 
ASYNCHRONOUS REPLICATION:
─────────────────────────────────────────────────────────────────
Client    Primary                                 Replica
  │         │                                        │
  │ COMMIT  │                                        │
  │────────►│                                        │
  │         │ Write local WAL                        │
  │◄────────│ COMMIT OK                              │
  │         │                                        │
  │         │──── Write WAL (background) ───────────►│
  │         │                                        │
  │         │                                        │
 
If primary dies NOW: Transactions in "replication lag" are LOST

Replication Lag = Durability Gap

With asynchronous replication, the gap between primary and replica represents potentially lost transactions. A typical lag of 100ms means up to 100ms of transactions could be lost if the primary fails. For financial systems, this is often unacceptable. For social media 'likes,' it might be fine.

Durability Configuration in Production

Production databases require careful configuration to balance durability guarantees against performance. Here are key settings for major databases:

PostgreSQL:

PostgreSQL Durability Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# postgresql.conf - Durability settings
 
# fsync: Actually write to disk (NEVER turn off in production!)
fsync = on                          # Default: on
 
# synchronous_commit: When to acknowledge commit
synchronous_commit = on             # Default: on
# Options:
#   on          - Wait for WAL write to local disk
#   remote_write - Wait for WAL to reach standby OS
#   remote_apply - Wait for WAL to be applied on standby
#   off         - Return immediately (DANGEROUS: data loss possible)
 
# full_page_writes: Protect against partial page writes
full_page_writes = on               # Default: on
 
# wal_sync_method: How to sync WAL to disk
wal_sync_method = fdatasync         # Default: platform-dependent
# Options: fsync, fdatasync, open_sync, open_datasync
 
# Synchronous replication
synchronous_standby_names = 'standby1'  # Require sync from standby1

MySQL/InnoDB Durability Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# my.cnf - Durability settings
 
# innodb_flush_log_at_trx_commit: Controls InnoDB durability
innodb_flush_log_at_trx_commit = 1
# Options:
#   1 - Full durability: flush log to disk on every commit (DEFAULT, SAFE)
#   2 - Flush to OS, not to disk (data loss on OS crash)
#   0 - No flush (data loss on any crash) 
 
# sync_binlog: Binary log durability
sync_binlog = 1                     
# 1 = sync binlog on every commit (SAFE for replication)
# 0 = let OS decide when to sync (faster, less safe)
 
# innodb_doublewrite: Protect against partial page writes
innodb_doublewrite = ON             # Default: ON
 
# For semi-synchronous replication
rpl_semi_sync_master_enabled = 1
rpl_semi_sync_master_wait_for_slave_count = 1
rpl_semi_sync_master_timeout = 1000  # ms to wait for slave ACK

The Golden Rule

Never disable fsync or set synchronous_commit=off in production for data you care about. The performance gain is not worth the data loss risk. If you need more performance, add replicas, upgrade hardware, or redesign your access patterns. Sacrificing durability is almost never the right answer.

Common Durability Pitfalls

Even with proper database configuration, durability can be compromised at multiple points in the stack. Here are the common pitfalls:

Durability Anti-Patterns

•Virtualized storage without proper backing — VMs often have layers of caching. An fsync to a virtual disk may not reach physical media. Ensure your VM provider offers durable storage.
•Network-attached storage (NAS/SAN) misconfiguration — Network storage has its own caches. Verify that fsync commands propagate through the entire storage stack.
•Lying disk controllers — Some cheap disks lie about fsync completion. Use drives with power-loss protection or battery-backed controllers.
•Running databases in containers without durable volumes — Container default storage is often ephemeral. Mount persistent volumes for database data.
•Cloud provider's 'standard' storage — Cloud instance storage (ephemeral) vanishes when instances stop. Use EBS/persistent disks for databases.
•Disabling durability for 'benchmarks' — Benchmarks with fsync=off show amazing numbers that are meaningless for production.
•Assuming 'committed' means 'replicated' — With async replication, committed data may still be on only one server.

Testing Durability: Crash Test
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# A simple (but scary) durability test
# WARNING: Do this on a TEST system only!
 
# 1. Start a write workload
pgbench -c 10 -T 60 -i -s 10 testdb &
 
# 2. While workload is running, simulate crash
# Option A: Kill the database process
kill -9 $(pgrep postgres)
 
# Option B: Actually cut power (on test hardware)
# echo 1 > /proc/sys/kernel/sysrq && echo o > /proc/sysrq-trigger
 
# 3. Restart database and verify
pg_ctl start
psql -c "SELECT COUNT(*) FROM pgbench_accounts;" testdb
 
# 4. Check for data corruption
pg_catalog.pg_checksums --check testdb
 
# If data is different than expected or corrupted,
# your durability chain has a weak link!

Summary: Durability Mastery

Durability is the database's most fundamental promise: what is committed stays committed. Without durability, everything else is meaningless—atomicity, consistency, and isolation mean nothing if committed data can vanish.

Key Takeaways

•Durability means permanence — Once a transaction is committed and acknowledged, its effects survive any subsequent failure.
•WAL enables efficient durability — Write-ahead logging allows durability with a single sequential write, deferring random writes to data pages.
•fsync is the critical system call — Data is not durable until it passes through all caches to persistent media. fsync ensures this happens.
•Commit acknowledgment is the promise — Everything changes at the moment the database confirms commit. Before: tentative. After: permanent.
•Group commit improves throughput — Batching multiple transactions into one fsync dramatically improves performance under concurrency.
•Replication extends durability — Single-machine durability survives crashes. Multi-machine durability survives hardware failure and disasters.

Page Complete

You now understand Durability comprehensively—from the storage hierarchy and WAL foundations, through fsync mechanics and commit acknowledgment, to replication strategies and common pitfalls. Next, we'll bring all four ACID properties together to see how they work in practice.