Database Management SystemsACID Properties

ACID Properties: The Foundation of Reliable Transactions

LevelIntermediate

Duration60 mins

TopicACID Properties

4 / 5

Durability: Persistence Against All Odds

When the Lights Go Out

The stock trade executed perfectly. The buyer clicked 'Confirm,' the system deducted funds, recorded the position, and displayed 'Transaction Complete.' The customer took a screenshot and closed their browser. Thirty seconds later, a transformer exploded at the data center. Power vanished. Servers died instantly.

When power returned and the database restarted, was that trade recorded? Did the customer buy the stock or not? If the database says 'no trade exists' while the customer has a screenshot confirming it did, the brokerage faces a serious legal and financial problem.

Durability is the ACID property that answers this question definitively: Once a transaction commits, its effects are permanent. Not 'probably permanent.' Not 'permanent unless there's a power failure.' Absolutely, unconditionally permanent—surviving crashes, restarts, hardware failures, and any other disaster short of physical destruction of all storage media.

What You Will Learn

By the end of this page, you will understand how databases achieve the seemingly impossible: guaranteeing that data survives failures that occur after the commit but before all data reaches stable storage. You'll learn about write-ahead logging, checkpointing, fsync, and the storage hierarchy that makes durability possible.

Defining Durability: The Permanence Guarantee

Durability is the ACID property that guarantees committed transactions are never lost. It represents the ultimate promise a database makes about data persistence.

Formal Definition:

A transaction exhibits durability if, once the transaction has been committed, its effects are permanently recorded in the database and will survive any subsequent failure—including power outages, operating system crashes, and hardware failures.

The key insight is that durability must survive failures that occur after the commit. The moment the database acknowledges 'COMMIT successful,' it has made an irrevocable promise: this data will be here when you look for it, regardless of what happens next.

Durability vs. High Availability

Durability means 'data is not lost.' High availability means 'data is accessible.' A durable but crashed system still has the data—it's just temporarily inaccessible until recovery. These are complementary but distinct properties. Durability is about data persistence; availability is about data accessibility.

The Problem Durability Solves:

Computers have a memory hierarchy:

CPU Registers — Fastest, lost instantly on power failure
CPU Cache — Very fast, lost instantly on power failure
RAM (Main Memory) — Fast, lost instantly on power failure
SSD/HDD (Persistent Storage) — Slower, survives power failure
Remote Storage/Backups — Slowest, survives site disasters

The challenge: for performance, databases keep recent data in RAM. But committing a transaction requires the data to reach persistent storage before acknowledging success. Otherwise, a power failure loses 'committed' data.

Storage Hierarchy and Persistence Characteristics
Storage Level	Speed	Survives Power Loss?	Typical Size
CPU Registers	~1 nanosecond	No	~1 KB
L1/L2/L3 Cache	~1-10 nanoseconds	No	~64 MB
Main Memory (RAM)	~100 nanoseconds	No	~128 GB
NVMe SSD	~10-100 microseconds	Yes*	~1-8 TB
SATA SSD	~100-500 microseconds	Yes*	~1-4 TB
HDD	~1-10 milliseconds	Yes	~10-20 TB

The Asterisk Matters

SSDs have volatile write caches that improve performance but can lose data on power failure. Enterprise SSDs include capacitors to flush the cache to persistent NAND on power loss. Consumer SSDs may not. Additionally, the operating system and filesystem have their own caches. True durability requires bypassing or flushing all these caches—which is what fsync() does.

Write-Ahead Logging: The Foundation of Durability

The primary mechanism for achieving durability is Write-Ahead Logging (WAL), also called transaction logging or redo logging. The concept is elegant: before modifying any data page, first write a log record describing the change. This log record is forced to stable storage before acknowledging the commit.

The WAL Protocol:

A transaction modifies data pages in memory
Before each modification, a log record is created in the log buffer
Before COMMIT acknowledges success, all log records for the transaction must be flushed to persistent storage
The actual data pages can be written to disk later (lazily)
If a crash occurs, the log provides everything needed to reconstruct committed changes

Why Logging Is Faster Than Direct Writes

Writing to the log is sequential (append-only), while writing data pages requires random I/O to different disk locations. Sequential writes are much faster, especially on HDDs. By making the commit path sequential (log write only) and deferring random I/O (data page writes) to background processes, WAL achieves both durability and performance.

wal-process-diagram.txt
Write-Ahead Logging Process:
 
                    Memory                              Disk
┌─────────────────────────────────┐      ┌─────────────────────────────┐
│                                 │      │                             │
│  ┌──────────────┐              │      │  ┌────────────────────────┐ │
│  │ Transaction  │   (1) Write   │      │  │                        │ │
│  │ Operations   ├──────────────────────►│ WAL Log                 │ │
│  └──────────────┘   Log Record  │      │  │ (Sequential writes)   │ │
│        │                        │      │  │                        │ │
│        │                        │      │  └───────────┬────────────┘ │
│        │                        │      │              │  ▲          │
│        ▼                        │      │          (3) │  │ (5)      │
│  ┌──────────────┐              │      │    Recovery │  │ Replay    │
│  │ Buffer Pool  │              │      │              ▼  │          │
│  │ (Data Pages  │              │      │  ┌────────────────────────┐ │
│  │  in Memory)  │   (4) Lazy    │      │  │                        │ │
│  └──────┬───────┤   Checkpoint  ├──────►│ Data Files              │ │
│         │       │              │      │  │ (Random writes)        │ │
│         ▼       │              │      │  │                        │ │
│  (2) Modify     │              │      │  └────────────────────────┘ │
│     in memory   │              │      │                             │
│                                 │      │                             │
└─────────────────────────────────┘      └─────────────────────────────┘
 
1. Transaction writes to WAL first (force to disk on commit)
2. Data pages modified in memory only
3. On crash, recovery reads WAL to know what was committed
4. Background checkpoint writes dirty pages to disk
5. WAL is replayed to redo any committed changes not in data files

What the Log Contains:

Each log record contains sufficient information to either redo or undo a change:

Transaction ID — Identifies which transaction made this change
Log Sequence Number (LSN) — Unique, monotonically increasing identifier ordering all log records
Page ID — Which data page was modified
Offset and Length — Where on the page the change occurred
Before Image — Previous value (for undo during rollback)
After Image — New value (for redo during recovery)
Operation Type — INSERT, UPDATE, DELETE, COMMIT, ABORT, CHECKPOINT, etc.

wal-record-structure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Conceptual structure of WAL log records
-- (Actual binary format varies by database)
 
-- Transaction begins
LSN: 000001  TxnID: T1  Type: BEGIN  Timestamp: 2024-01-15 10:30:00.001
 
-- Update operation with before/after images  
LSN: 000002  TxnID: T1  Type: UPDATE
    PageID: accounts_page_42  Offset: 384  Length: 64
    BeforeImage: {id: 1, balance: 1000, ...}
    AfterImage:  {id: 1, balance: 700, ...}
 
-- Another update in same transaction
LSN: 000003  TxnID: T1  Type: UPDATE
    PageID: accounts_page_107  Offset: 192  Length: 64  
    BeforeImage: {id: 2, balance: 500, ...}
    AfterImage:  {id: 2, balance: 800, ...}
 
-- Commit record - THE DURABILITY GUARANTEE
-- Once this is on disk, the transaction is committed forever
LSN: 000004  TxnID: T1  Type: COMMIT  Timestamp: 2024-01-15 10:30:00.047
-- CRITICAL: COMMIT record must be fsync'd before acknowledging to client

The Commit Process: When Is Data Really Committed?

The exact moment a transaction becomes durable is precisely defined: a transaction is committed when its COMMIT log record is written to stable storage. This is the "commit point" or "point of no return."

The Commit Sequence:

Steps in a Durable Commit

•Transaction requests COMMIT — Application issues COMMIT command
•Write COMMIT record to log buffer — COMMIT record added to in-memory log
•Flush log buffer to disk — All log records up to and including COMMIT are written to the log file
•fsync() the log file — Force operating system to write its buffers to physical storage
•Wait for fsync to complete — This is the synchronous, blocking step that ensures durability
•Acknowledge success to client — Only after fsync completes does the database report 'COMMIT' successful
•Release locks (eventually) — Transaction resources cleaned up, but data is already durable

The fsync() Bottleneck

fsync() is the most expensive operation in the commit path. It waits for the disk to confirm that data has physically reached persistent storage. For HDDs, this means waiting for the disk platter to rotate to the write position. Even for SSDs, the firmware must confirm the write. This is why high-transaction-rate systems invest heavily in fast storage, battery-backed write caches, and techniques like group commit.

Group Commit: Amortizing fsync() Costs

Calling fsync() for every transaction is expensive. Group commit batches multiple transactions' log flushes into a single fsync(), dramatically improving throughput:

Multiple transactions write their COMMIT records to the log buffer
A single fsync() flushes all pending records
All waiting transactions are acknowledged simultaneously

This trades slightly higher latency for significantly higher throughput. A busy system might batch 100+ commits per fsync(), reducing per-transaction I/O overhead by 100x.

group-commit-configuration.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- PostgreSQL: Configure commit behavior for durability vs. performance
 
-- Full durability (default) - synchronous fsync on every commit
-- Safest, but lowest throughput
SET synchronous_commit = on;
 
-- Commit returns after WAL is sent to synchronous standby
-- Provides durability across server failure (if standby survives)
SET synchronous_commit = remote_apply;
 
-- Commit returns before fsync completes (DANGEROUS!)
-- Data may be lost on crash but not visible to other transactions
-- Only use when you can tolerate data loss
SET synchronous_commit = off;
 
-- PostgreSQL group commit: delay to batch commits
-- Increase for throughput at cost of latency
SET commit_delay = 10;  -- microseconds to wait before fsync
SET commit_siblings = 5; -- minimum concurrent transactions to trigger delay
 
-- MySQL InnoDB: Similar concept
-- innodb_flush_log_at_trx_commit = 1  (full durability, default)
-- innodb_flush_log_at_trx_commit = 2  (flush to OS buffer each commit, fsync once/second)
-- innodb_flush_log_at_trx_commit = 0  (no flush, dangerous)

Checkpointing: Limiting Recovery Time

If the WAL contains the complete history of all changes, couldn't the log grow forever? And wouldn't recovery require replaying the entire log from the beginning of time? Checkpointing solves both problems.

What is a Checkpoint?

A checkpoint is a snapshot-in-time where:

All modified (dirty) pages in memory are written to the data files on disk
A checkpoint record is written to the log recording this fact
Log records older than the checkpoint are no longer needed for crash recovery

After a checkpoint, recovery only needs to replay the log from the checkpoint forward—not from the beginning of time.

Converting Mermaid diagram...

Types of Checkpoints:

Checkpoint Strategies

•Quiescent Checkpoint — Stop all transactions, write all dirty pages, then allow transactions to resume. Simple but blocks the system. Rarely used in production.
•Fuzzy Checkpoint — Write dirty pages continuously in the background. Record which pages might still be dirty. Recovery needs to consider pages that were being written during the checkpoint. Used by most modern systems.
•Incremental Checkpoint — Continuously write oldest dirty pages rather than all at once. Spreads I/O load evenly. Combined with fuzzy checkpoint logic.

checkpoint-configuration.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- PostgreSQL checkpoint configuration
 
-- Maximum time between automatic checkpoints (seconds)
-- Lower = faster recovery, more I/O overhead
-- Higher = longer recovery, less I/O overhead
SET checkpoint_timeout = '5min';
 
-- Maximum WAL size before forcing a checkpoint
-- Limits how much log must be replayed on recovery
SET max_wal_size = '1GB';
 
-- Spread checkpoint I/O over this fraction of checkpoint_timeout
-- 0.5 = spread writes over half the checkpoint interval
-- Reduces I/O spikes
SET checkpoint_completion_target = 0.9;
 
-- Force a manual checkpoint (useful before maintenance)
CHECKPOINT;
 
-- MySQL InnoDB equivalent concepts:
-- innodb_log_file_size - determines recovery replay time
-- innodb_max_dirty_pages_pct - triggers flushing when buffer pool gets too dirty
-- innodb_adaptive_flushing - adjusts flush rate based on workload

The Checkpoint Trade-off

More frequent checkpoints mean faster recovery (less log to replay) but more I/O overhead during normal operation. Less frequent checkpoints reduce normal I/O but extend recovery time. Tune based on your recovery time objective (RTO) and I/O capacity. Most production systems aim for recovery times under 5 minutes.

Crash Recovery: Restoring Durability

When a database restarts after a crash, the recovery process ensures that all committed transactions are present in the data files and all uncommitted transactions are removed. This is the ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) recovery protocol used by most modern databases.

The Three Phases of ARIES Recovery:

ARIES Recovery Phases

•Analysis Phase — Scan WAL from last checkpoint forward. Identify which transactions were active at crash time (have BEGIN but no COMMIT or ABORT). Build a list of dirty pages that might need redo. This phase is read-only and determines what the other phases need to do.
•Redo Phase — Replay the WAL forward from the checkpoint. For each logged operation, check if the corresponding data page reflects that operation. If not (the LSN on the page is older than the log record), apply the operation. This ensures all committed work reaches the data files.
•Undo Phase — For transactions that were active at crash time (never committed), apply their logged operations in reverse using before images. This removes all effects of uncommitted transactions. Write compensation log records (CLRs) to record the undo operations.

Recovery Phase Objectives
Phase	Direction	Objective	Result
Analysis	Forward from checkpoint	Identify work to do	Lists of transactions and dirty pages
Redo	Forward from checkpoint	Restore committed changes	Data files match committed state
Undo	Backward	Remove uncommitted changes	No trace of aborted transactions

recovery-example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Recovery Example:
 
WAL Contents at Crash:
─────────────────────────────────────────────────────────────────────
LSN 1000: CHECKPOINT (all pages clean at this point)
LSN 1001: T1 BEGIN
LSN 1002: T1 UPDATE Page_A (balance: 1000 → 700)
LSN 1003: T2 BEGIN
LSN 1004: T2 UPDATE Page_B (quantity: 50 → 45)
LSN 1005: T1 COMMIT  ← T1 is committed (durable)
LSN 1006: T2 UPDATE Page_C (status: 'pending' → 'processing')
<<< CRASH >>> (T2 never committed)
 
Analysis Phase:
- T1: Has COMMIT at LSN 1005 → Will be redone if necessary
- T2: No COMMIT → Will be undone (active at crash)
- Dirty pages: A, B, C potentially need work
 
Redo Phase (forward from checkpoint):
- LSN 1002: Check Page_A's LSN. If < 1002, redo the update
- LSN 1004: Check Page_B's LSN. If < 1004, redo the update
- LSN 1006: Check Page_C's LSN. If < 1006, redo the update
(We redo ALL operations to get data files current, even for T2)
 
Undo Phase (backward for uncommitted transactions):
- T2 at LSN 1006: Undo Page_C (restore 'pending')
- T2 at LSN 1004: Undo Page_B (restore 50)
- Write CLR records for each undo operation
 
Final State:
- Page_A: balance = 700 (T1's committed change preserved)
- Page_B: quantity = 50 (T2's change undone)
- Page_C: status = 'pending' (T2's change undone)

Why Redo Before Undo?

We redo ALL operations (even for transactions we'll undo) because the data pages on disk might be in any state—some operations applied, some not. Redo brings everything to a consistent current state. Then undo rolls back uncommitted changes cleanly. This separation simplifies recovery logic significantly.

Durability in Distributed Systems

Single-node durability protects against process crashes and power failures. But what about catastrophic events—disk failure, server destruction, data center fires? Distributed durability extends protection across multiple physical locations.

Levels of Durability Protection

•Single Disk — Survives process crash, power failure. Vulnerable to disk failure.
•RAID — Survives single disk failure (RAID 1, 5, 6). Vulnerable to controller failure, multiple disk failure.
•Local Replication — Synchronous replica in same data center. Survives server failure. Vulnerable to site disaster.
•Synchronous Remote Replication — Commit only after replica in another site confirms. Survives site disaster. Higher latency.
•Multi-Region Consensus — Multiple replicas in multiple regions must agree (Raft, Paxos). High durability, high latency.

Synchronous vs. Asynchronous Replication:

Synchronous Replication

•Commit waits for replica acknowledgment
•Zero data loss on primary failure
•Higher commit latency (network round trip)
•Replica failure blocks commits
•Use for critical data that cannot be lost

Asynchronous Replication

•Commit returns immediately, replication follows
•Possible data loss on primary failure (replication lag)
•Lower commit latency
•Replica failure doesn't block commits
•Use when some data loss is acceptable for performance

synchronous-replication.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL: Configure synchronous replication for durability
 
-- On primary: require confirmation from at least one synchronous standby
-- before reporting commit success
SET synchronous_standby_names = 'first 1 (standby1, standby2)';
 
-- Commit modes:
-- 'on' (default): Wait for WAL to be flushed on primary
-- 'remote_write': Wait for WAL to reach standby's memory
-- 'remote_apply': Wait for WAL to be applied on standby (strongest)
SET synchronous_commit = 'remote_apply';
 
-- This configuration means:
-- 1. Primary writes WAL to local disk
-- 2. Primary sends WAL to standby
-- 3. Standby writes WAL to disk and applies it
-- 4. Standby acknowledges to primary
-- 5. Primary acknowledges commit to client
-- 
-- If primary fails after step 5, standby has all committed data
-- Client has received confirmation only after data exists in two places

The CAP Theorem Intersection

Synchronous replication guarantees durability (data exists on multiple nodes) but reduces availability (if the synchronous replica is unreachable, commits block). This is the CAP theorem in action. You can't have perfect durability AND perfect availability across network partitions. Design systems with clear priorities.

Common Durability Pitfalls

Even with proper understanding of durability, several pitfalls can compromise data safety:

Durability Anti-Patterns

•Disabling fsync for performance — Setting synchronous_commit=off or innodb_flush_log_at_trx_commit=0 trades durability for speed. Only appropriate when data loss is acceptable (e.g., session data, caches).
•Write caches without battery backup — RAID controllers and SSDs may have volatile write caches. Without battery/capacitor backup, 'confirmed' writes can vanish on power loss. Verify your storage stack end-to-end.
•Virtualization layer lies — VMs may report fsync complete before actually flushing to physical storage. Cloud providers vary in their guarantees. Test actual durability, don't assume.
•Application assumes success too early — If the application makes external side effects (sends emails, calls APIs) before the database commit is confirmed, those effects can't be undone if the commit fails.
•Insufficient backup testing — Backups are part of durability. If backups can't be restored (corruption, missing WAL segments, version incompatibility), they provide false confidence.

Test Your Durability

Periodically kill database servers ungracefully (kill -9, pull power) with recent transactions. Verify on restart that all committed transactions are present. Tools like pg_crash_test or custom scripts can automate this. Trust but verify your durability configuration.

durability-verification.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
# Simple durability test script
 
# Insert a record with a unique timestamp
TEST_ID=$(date +%s)
psql -c "INSERT INTO durability_test (id, created_at) VALUES ($TEST_ID, NOW());"
echo "Inserted test record: $TEST_ID"
 
# Wait for any async replication
sleep 2
 
# Verify the record exists
COUNT=$(psql -t -c "SELECT COUNT(*) FROM durability_test WHERE id = $TEST_ID;")
if [ "$COUNT" -eq 1 ]; then
    echo "Record confirmed in database"
else
    echo "ERROR: Record not found after insert!"
    exit 1
fi
 
# Now: manually kill the database server (kill -9 or power off)
# Then: restart and verify the record still exists
 
# After restart:
# COUNT=$(psql -t -c "SELECT COUNT(*) FROM durability_test WHERE id = $TEST_ID;")
# if [ "$COUNT" -eq 1 ]; then
#     echo "PASS: Record survived crash"
# else
#     echo "FAIL: Record lost - durability compromised!"
# fi

Summary: Durability as the Ultimate Promise

We've explored durability as the ACID property that guarantees committed data persists forever. From write-ahead logging to crash recovery to distributed replication, here are the key takeaways:

Key Takeaways

•Durability means committed data survives failures — Once COMMIT succeeds, the data exists regardless of crashes, power failures, or hardware problems.
•Write-Ahead Logging is the foundation — By logging changes before applying them, databases can always reconstruct committed state after failures.
•fsync() is the durability enforcement point — The operating system call that forces data to physical storage. Essential for true durability, but performance-costly.
•Checkpointing limits recovery time — Periodic snapshots allow the database to discard old log entries and speed up crash recovery.
•ARIES recovery ensures correctness — The three-phase protocol (Analysis, Redo, Undo) systematically restores the database to a consistent committed state.
•Distributed replication extends durability — Synchronous replication to remote sites protects against disasters that destroy entire locations.

What's Next:

With all four ACID properties covered—Atomicity, Consistency, Isolation, and Durability—we now understand the complete guarantee that transactions provide. The final page of this module synthesizes these properties, exploring how they interact, when they conflict, and why ACID as a unified whole is more than the sum of its parts.

Page Complete

You now understand durability as the guarantee that committed data persists forever. You know how WAL, fsync, and checkpointing work together to make this possible, and how distributed replication extends protection across physical boundaries. Next, we explore how all four ACID properties work together as a unified guarantee.

4 / 5

Loading learning content...

Database Management SystemsACID Properties

ACID Properties: The Foundation of Reliable Transactions

LevelIntermediate

Duration60 mins

TopicACID Properties

4 / 5

Durability: Persistence Against All Odds

When the Lights Go Out

What You Will Learn

Defining Durability: The Permanence Guarantee

Durability is the ACID property that guarantees committed transactions are never lost. It represents the ultimate promise a database makes about data persistence.

Formal Definition:

A transaction exhibits durability if, once the transaction has been committed, its effects are permanently recorded in the database and will survive any subsequent failure—including power outages, operating system crashes, and hardware failures.

Durability vs. High Availability

The Problem Durability Solves:

Computers have a memory hierarchy:

CPU Registers — Fastest, lost instantly on power failure
CPU Cache — Very fast, lost instantly on power failure
RAM (Main Memory) — Fast, lost instantly on power failure
SSD/HDD (Persistent Storage) — Slower, survives power failure
Remote Storage/Backups — Slowest, survives site disasters

Storage Hierarchy and Persistence Characteristics
Storage Level	Speed	Survives Power Loss?	Typical Size
CPU Registers	~1 nanosecond	No	~1 KB
L1/L2/L3 Cache	~1-10 nanoseconds	No	~64 MB
Main Memory (RAM)	~100 nanoseconds	No	~128 GB
NVMe SSD	~10-100 microseconds	Yes*	~1-8 TB
SATA SSD	~100-500 microseconds	Yes*	~1-4 TB
HDD	~1-10 milliseconds	Yes	~10-20 TB

The Asterisk Matters

Write-Ahead Logging: The Foundation of Durability

The WAL Protocol:

A transaction modifies data pages in memory
Before each modification, a log record is created in the log buffer
Before COMMIT acknowledges success, all log records for the transaction must be flushed to persistent storage
The actual data pages can be written to disk later (lazily)
If a crash occurs, the log provides everything needed to reconstruct committed changes

Why Logging Is Faster Than Direct Writes

wal-process-diagram.txt
Write-Ahead Logging Process:
 
                    Memory                              Disk
┌─────────────────────────────────┐      ┌─────────────────────────────┐
│                                 │      │                             │
│  ┌──────────────┐              │      │  ┌────────────────────────┐ │
│  │ Transaction  │   (1) Write   │      │  │                        │ │
│  │ Operations   ├──────────────────────►│ WAL Log                 │ │
│  └──────────────┘   Log Record  │      │  │ (Sequential writes)   │ │
│        │                        │      │  │                        │ │
│        │                        │      │  └───────────┬────────────┘ │
│        │                        │      │              │  ▲          │
│        ▼                        │      │          (3) │  │ (5)      │
│  ┌──────────────┐              │      │    Recovery │  │ Replay    │
│  │ Buffer Pool  │              │      │              ▼  │          │
│  │ (Data Pages  │              │      │  ┌────────────────────────┐ │
│  │  in Memory)  │   (4) Lazy    │      │  │                        │ │
│  └──────┬───────┤   Checkpoint  ├──────►│ Data Files              │ │
│         │       │              │      │  │ (Random writes)        │ │
│         ▼       │              │      │  │                        │ │
│  (2) Modify     │              │      │  └────────────────────────┘ │
│     in memory   │              │      │                             │
│                                 │      │                             │
└─────────────────────────────────┘      └─────────────────────────────┘
 
1. Transaction writes to WAL first (force to disk on commit)
2. Data pages modified in memory only
3. On crash, recovery reads WAL to know what was committed
4. Background checkpoint writes dirty pages to disk
5. WAL is replayed to redo any committed changes not in data files

What the Log Contains:

Each log record contains sufficient information to either redo or undo a change:

Transaction ID — Identifies which transaction made this change
Log Sequence Number (LSN) — Unique, monotonically increasing identifier ordering all log records
Page ID — Which data page was modified
Offset and Length — Where on the page the change occurred
Before Image — Previous value (for undo during rollback)
After Image — New value (for redo during recovery)
Operation Type — INSERT, UPDATE, DELETE, COMMIT, ABORT, CHECKPOINT, etc.

wal-record-structure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Conceptual structure of WAL log records
-- (Actual binary format varies by database)
 
-- Transaction begins
LSN: 000001  TxnID: T1  Type: BEGIN  Timestamp: 2024-01-15 10:30:00.001
 
-- Update operation with before/after images  
LSN: 000002  TxnID: T1  Type: UPDATE
    PageID: accounts_page_42  Offset: 384  Length: 64
    BeforeImage: {id: 1, balance: 1000, ...}
    AfterImage:  {id: 1, balance: 700, ...}
 
-- Another update in same transaction
LSN: 000003  TxnID: T1  Type: UPDATE
    PageID: accounts_page_107  Offset: 192  Length: 64  
    BeforeImage: {id: 2, balance: 500, ...}
    AfterImage:  {id: 2, balance: 800, ...}
 
-- Commit record - THE DURABILITY GUARANTEE
-- Once this is on disk, the transaction is committed forever
LSN: 000004  TxnID: T1  Type: COMMIT  Timestamp: 2024-01-15 10:30:00.047
-- CRITICAL: COMMIT record must be fsync'd before acknowledging to client

The Commit Process: When Is Data Really Committed?

The Commit Sequence:

Steps in a Durable Commit

•Transaction requests COMMIT — Application issues COMMIT command
•Write COMMIT record to log buffer — COMMIT record added to in-memory log
•Flush log buffer to disk — All log records up to and including COMMIT are written to the log file
•fsync() the log file — Force operating system to write its buffers to physical storage
•Wait for fsync to complete — This is the synchronous, blocking step that ensures durability
•Acknowledge success to client — Only after fsync completes does the database report 'COMMIT' successful
•Release locks (eventually) — Transaction resources cleaned up, but data is already durable

The fsync() Bottleneck

Group Commit: Amortizing fsync() Costs

Calling fsync() for every transaction is expensive. Group commit batches multiple transactions' log flushes into a single fsync(), dramatically improving throughput:

Multiple transactions write their COMMIT records to the log buffer
A single fsync() flushes all pending records
All waiting transactions are acknowledged simultaneously

This trades slightly higher latency for significantly higher throughput. A busy system might batch 100+ commits per fsync(), reducing per-transaction I/O overhead by 100x.

group-commit-configuration.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- PostgreSQL: Configure commit behavior for durability vs. performance
 
-- Full durability (default) - synchronous fsync on every commit
-- Safest, but lowest throughput
SET synchronous_commit = on;
 
-- Commit returns after WAL is sent to synchronous standby
-- Provides durability across server failure (if standby survives)
SET synchronous_commit = remote_apply;
 
-- Commit returns before fsync completes (DANGEROUS!)
-- Data may be lost on crash but not visible to other transactions
-- Only use when you can tolerate data loss
SET synchronous_commit = off;
 
-- PostgreSQL group commit: delay to batch commits
-- Increase for throughput at cost of latency
SET commit_delay = 10;  -- microseconds to wait before fsync
SET commit_siblings = 5; -- minimum concurrent transactions to trigger delay
 
-- MySQL InnoDB: Similar concept
-- innodb_flush_log_at_trx_commit = 1  (full durability, default)
-- innodb_flush_log_at_trx_commit = 2  (flush to OS buffer each commit, fsync once/second)
-- innodb_flush_log_at_trx_commit = 0  (no flush, dangerous)

Checkpointing: Limiting Recovery Time

What is a Checkpoint?

A checkpoint is a snapshot-in-time where:

All modified (dirty) pages in memory are written to the data files on disk
A checkpoint record is written to the log recording this fact
Log records older than the checkpoint are no longer needed for crash recovery

After a checkpoint, recovery only needs to replay the log from the checkpoint forward—not from the beginning of time.

Converting Mermaid diagram...

Types of Checkpoints:

Checkpoint Strategies

•Quiescent Checkpoint — Stop all transactions, write all dirty pages, then allow transactions to resume. Simple but blocks the system. Rarely used in production.
•Fuzzy Checkpoint — Write dirty pages continuously in the background. Record which pages might still be dirty. Recovery needs to consider pages that were being written during the checkpoint. Used by most modern systems.
•Incremental Checkpoint — Continuously write oldest dirty pages rather than all at once. Spreads I/O load evenly. Combined with fuzzy checkpoint logic.

checkpoint-configuration.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- PostgreSQL checkpoint configuration
 
-- Maximum time between automatic checkpoints (seconds)
-- Lower = faster recovery, more I/O overhead
-- Higher = longer recovery, less I/O overhead
SET checkpoint_timeout = '5min';
 
-- Maximum WAL size before forcing a checkpoint
-- Limits how much log must be replayed on recovery
SET max_wal_size = '1GB';
 
-- Spread checkpoint I/O over this fraction of checkpoint_timeout
-- 0.5 = spread writes over half the checkpoint interval
-- Reduces I/O spikes
SET checkpoint_completion_target = 0.9;
 
-- Force a manual checkpoint (useful before maintenance)
CHECKPOINT;
 
-- MySQL InnoDB equivalent concepts:
-- innodb_log_file_size - determines recovery replay time
-- innodb_max_dirty_pages_pct - triggers flushing when buffer pool gets too dirty
-- innodb_adaptive_flushing - adjusts flush rate based on workload

The Checkpoint Trade-off

Crash Recovery: Restoring Durability

The Three Phases of ARIES Recovery:

ARIES Recovery Phases

•Analysis Phase — Scan WAL from last checkpoint forward. Identify which transactions were active at crash time (have BEGIN but no COMMIT or ABORT). Build a list of dirty pages that might need redo. This phase is read-only and determines what the other phases need to do.
•Redo Phase — Replay the WAL forward from the checkpoint. For each logged operation, check if the corresponding data page reflects that operation. If not (the LSN on the page is older than the log record), apply the operation. This ensures all committed work reaches the data files.
•Undo Phase — For transactions that were active at crash time (never committed), apply their logged operations in reverse using before images. This removes all effects of uncommitted transactions. Write compensation log records (CLRs) to record the undo operations.

Recovery Phase Objectives
Phase	Direction	Objective	Result
Analysis	Forward from checkpoint	Identify work to do	Lists of transactions and dirty pages
Redo	Forward from checkpoint	Restore committed changes	Data files match committed state
Undo	Backward	Remove uncommitted changes	No trace of aborted transactions

recovery-example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Recovery Example:
 
WAL Contents at Crash:
─────────────────────────────────────────────────────────────────────
LSN 1000: CHECKPOINT (all pages clean at this point)
LSN 1001: T1 BEGIN
LSN 1002: T1 UPDATE Page_A (balance: 1000 → 700)
LSN 1003: T2 BEGIN
LSN 1004: T2 UPDATE Page_B (quantity: 50 → 45)
LSN 1005: T1 COMMIT  ← T1 is committed (durable)
LSN 1006: T2 UPDATE Page_C (status: 'pending' → 'processing')
<<< CRASH >>> (T2 never committed)
 
Analysis Phase:
- T1: Has COMMIT at LSN 1005 → Will be redone if necessary
- T2: No COMMIT → Will be undone (active at crash)
- Dirty pages: A, B, C potentially need work
 
Redo Phase (forward from checkpoint):
- LSN 1002: Check Page_A's LSN. If < 1002, redo the update
- LSN 1004: Check Page_B's LSN. If < 1004, redo the update
- LSN 1006: Check Page_C's LSN. If < 1006, redo the update
(We redo ALL operations to get data files current, even for T2)
 
Undo Phase (backward for uncommitted transactions):
- T2 at LSN 1006: Undo Page_C (restore 'pending')
- T2 at LSN 1004: Undo Page_B (restore 50)
- Write CLR records for each undo operation
 
Final State:
- Page_A: balance = 700 (T1's committed change preserved)
- Page_B: quantity = 50 (T2's change undone)
- Page_C: status = 'pending' (T2's change undone)

Why Redo Before Undo?

Durability in Distributed Systems

Levels of Durability Protection

•Single Disk — Survives process crash, power failure. Vulnerable to disk failure.
•RAID — Survives single disk failure (RAID 1, 5, 6). Vulnerable to controller failure, multiple disk failure.
•Local Replication — Synchronous replica in same data center. Survives server failure. Vulnerable to site disaster.
•Synchronous Remote Replication — Commit only after replica in another site confirms. Survives site disaster. Higher latency.
•Multi-Region Consensus — Multiple replicas in multiple regions must agree (Raft, Paxos). High durability, high latency.

Synchronous vs. Asynchronous Replication:

Synchronous Replication

•Commit waits for replica acknowledgment
•Zero data loss on primary failure
•Higher commit latency (network round trip)
•Replica failure blocks commits
•Use for critical data that cannot be lost

Asynchronous Replication

•Commit returns immediately, replication follows
•Possible data loss on primary failure (replication lag)
•Lower commit latency
•Replica failure doesn't block commits
•Use when some data loss is acceptable for performance

synchronous-replication.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL: Configure synchronous replication for durability
 
-- On primary: require confirmation from at least one synchronous standby
-- before reporting commit success
SET synchronous_standby_names = 'first 1 (standby1, standby2)';
 
-- Commit modes:
-- 'on' (default): Wait for WAL to be flushed on primary
-- 'remote_write': Wait for WAL to reach standby's memory
-- 'remote_apply': Wait for WAL to be applied on standby (strongest)
SET synchronous_commit = 'remote_apply';
 
-- This configuration means:
-- 1. Primary writes WAL to local disk
-- 2. Primary sends WAL to standby
-- 3. Standby writes WAL to disk and applies it
-- 4. Standby acknowledges to primary
-- 5. Primary acknowledges commit to client
-- 
-- If primary fails after step 5, standby has all committed data
-- Client has received confirmation only after data exists in two places

The CAP Theorem Intersection

Common Durability Pitfalls

Even with proper understanding of durability, several pitfalls can compromise data safety:

Durability Anti-Patterns

•Disabling fsync for performance — Setting synchronous_commit=off or innodb_flush_log_at_trx_commit=0 trades durability for speed. Only appropriate when data loss is acceptable (e.g., session data, caches).
•Write caches without battery backup — RAID controllers and SSDs may have volatile write caches. Without battery/capacitor backup, 'confirmed' writes can vanish on power loss. Verify your storage stack end-to-end.
•Virtualization layer lies — VMs may report fsync complete before actually flushing to physical storage. Cloud providers vary in their guarantees. Test actual durability, don't assume.
•Application assumes success too early — If the application makes external side effects (sends emails, calls APIs) before the database commit is confirmed, those effects can't be undone if the commit fails.
•Insufficient backup testing — Backups are part of durability. If backups can't be restored (corruption, missing WAL segments, version incompatibility), they provide false confidence.

Test Your Durability

durability-verification.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
# Simple durability test script
 
# Insert a record with a unique timestamp
TEST_ID=$(date +%s)
psql -c "INSERT INTO durability_test (id, created_at) VALUES ($TEST_ID, NOW());"
echo "Inserted test record: $TEST_ID"
 
# Wait for any async replication
sleep 2
 
# Verify the record exists
COUNT=$(psql -t -c "SELECT COUNT(*) FROM durability_test WHERE id = $TEST_ID;")
if [ "$COUNT" -eq 1 ]; then
    echo "Record confirmed in database"
else
    echo "ERROR: Record not found after insert!"
    exit 1
fi
 
# Now: manually kill the database server (kill -9 or power off)
# Then: restart and verify the record still exists
 
# After restart:
# COUNT=$(psql -t -c "SELECT COUNT(*) FROM durability_test WHERE id = $TEST_ID;")
# if [ "$COUNT" -eq 1 ]; then
#     echo "PASS: Record survived crash"
# else
#     echo "FAIL: Record lost - durability compromised!"
# fi

Summary: Durability as the Ultimate Promise

We've explored durability as the ACID property that guarantees committed data persists forever. From write-ahead logging to crash recovery to distributed replication, here are the key takeaways:

Key Takeaways

•Durability means committed data survives failures — Once COMMIT succeeds, the data exists regardless of crashes, power failures, or hardware problems.
•Write-Ahead Logging is the foundation — By logging changes before applying them, databases can always reconstruct committed state after failures.
•fsync() is the durability enforcement point — The operating system call that forces data to physical storage. Essential for true durability, but performance-costly.
•Checkpointing limits recovery time — Periodic snapshots allow the database to discard old log entries and speed up crash recovery.
•ARIES recovery ensures correctness — The three-phase protocol (Analysis, Redo, Undo) systematically restores the database to a consistent committed state.
•Distributed replication extends durability — Synchronous replication to remote sites protects against disasters that destroy entire locations.

What's Next:

Page Complete

4 / 5