System Design (HLD)PostgreSQL

PostgreSQL: The World's Most Advanced Open Source Database

LevelAdvanced

Duration90 mins

TopicPostgreSQL

1 / 5

ACID Compliance and Reliability

PostgreSQL: Engineering for Correctness

When organizations entrust their most critical data to a database system, they are making a profound commitment. Financial transactions, healthcare records, inventory systems, and countless other mission-critical applications depend absolutely on the guarantee that data, once committed, will not be lost, corrupted, or left in an inconsistent state.

PostgreSQL has earned its reputation as the gold standard for data integrity in the open-source world precisely because its architects refused to compromise on ACID guarantees. Understanding how PostgreSQL achieves these guarantees—not just that it provides them—is essential knowledge for any system designer choosing a database for applications where correctness cannot be traded for convenience.

What You Will Learn

This page explores the architectural foundations of PostgreSQL's reliability: the Write-Ahead Log (WAL), Multi-Version Concurrency Control (MVCC), transaction isolation mechanisms, and durability guarantees. You will understand not just what these features do, but how they work together to provide the strongest data integrity guarantees available in an open-source RDBMS.

The ACID Foundation: Why It Matters

Before diving into PostgreSQL's implementation, let's establish precisely what ACID means and why each property is non-negotiable for serious data management:

Atomicity ensures that a transaction is treated as a single, indivisible unit. Either all operations within the transaction succeed and are committed, or none of them take effect. There is no partial execution—if a transfer from Account A to Account B fails after debiting A, the debit is reversed. Without atomicity, business logic is fundamentally broken.

Consistency guarantees that every transaction moves the database from one valid state to another valid state. All defined constraints—primary keys, foreign keys, check constraints, triggers—are enforced. A transaction cannot leave the database violating its own rules.

Isolation means that concurrent transactions execute as if they were sequential. Transaction A cannot see uncommitted changes from Transaction B. The level of isolation is configurable (Read Committed, Repeatable Read, Serializable), but some degree of isolation is always maintained.

Durability promises that once a transaction is committed, it will survive system failures—power outages, crashes, disk failures (with proper redundancy). Committed data is permanent.

ACID Properties and Their Guarantees
Property	Guarantee	What Fails Without It	PostgreSQL Mechanism
Atomicity	All-or-nothing execution	Partial updates corrupt data state	WAL + Transaction management
Consistency	Valid state transitions only	Constraint violations, data anomalies	Constraint enforcement + Triggers
Isolation	Concurrent transactions don't interfere	Dirty reads, phantom reads, lost updates	MVCC + Isolation levels
Durability	Committed data survives failures	Data loss on crash or power failure	WAL + fsync + Checkpoints

The Cost of Weak ACID

Many NoSQL systems and even some MySQL storage engines (like MyISAM) weaken or entirely abandon ACID properties for performance. This is a deliberate trade-off, not a minor detail. Applications running on non-ACID systems must implement their own consistency logic—a complex, error-prone endeavor that often leads to subtle, hard-to-diagnose data corruption.

Write-Ahead Logging: The Heart of Durability

PostgreSQL's durability and crash recovery capabilities rest upon a fundamental architectural decision: Write-Ahead Logging (WAL). This mechanism, also known as transaction logging or journaling, is the single most important feature for understanding PostgreSQL's reliability.

The Core Principle:

Before any modification to the actual data files (called 'heap files' in PostgreSQL), the change must first be written to the WAL—a sequential, append-only log. Only after the WAL record is durably written (flushed to disk) is the transaction considered committed.

This seemingly simple rule has profound implications:

WAL Guarantees

•Crash Recovery — If PostgreSQL crashes, it can replay the WAL from the last checkpoint to reconstruct any committed transactions that hadn't yet been written to data files.
•Sequential Writes — WAL writes are sequential (append-only), which is dramatically faster than the random writes required to update data pages scattered across the disk.
•Atomic Commits — A transaction is committed if and only if its final WAL commit record reaches disk. There is no ambiguity about whether a transaction succeeded.
•Replication Foundation — WAL records can be streamed to replica servers, enabling physical replication that guarantees byte-for-byte consistency with the primary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Transaction Begins:
┌─────────────────────────────────────────────────────────────────────┐
│  1. Application executes INSERT/UPDATE/DELETE                       │
│  2. PostgreSQL generates WAL record describing the change          │
│  3. WAL record is written to WAL buffer (in memory)                │
│  4. Data page is modified in shared buffers (in memory)            │
└─────────────────────────────────────────────────────────────────────┘
 
COMMIT Requested:
┌─────────────────────────────────────────────────────────────────────┐
│  5. COMMIT WAL record is generated                                  │
│  6. All WAL records up to and including COMMIT are flushed to disk │
│  7. fsync() ensures records are physically on stable storage       │
│  8. Transaction is confirmed committed to application              │
└─────────────────────────────────────────────────────────────────────┘
 
Data Page Writes (Asynchronous):
┌─────────────────────────────────────────────────────────────────────┐
│  9. Background writer eventually writes dirty pages to data files  │
│  10. Checkpoint process periodically ensures all pages are flushed │
│  11. Old WAL segments can be recycled after checkpoint             │
└─────────────────────────────────────────────────────────────────────┘

The Performance Insight:

The brilliance of WAL is that it decouples the performance-critical path (commit confirmation) from the expensive operation (random data file writes). By writing sequentially to the WAL, PostgreSQL can confirm transactions quickly while deferring the scattered data writes to background processes.

This design means you can tune PostgreSQL for your durability vs. performance requirements:

synchronous_commit = on (default): Maximum durability. WAL records are flushed to disk before commit returns. Survives OS crash.
synchronous_commit = off: WAL records may be in OS cache when commit returns. Faster, but up to ~3x wal_writer_delay of transactions could be lost on crash.
synchronous_commit = remote_apply: Commit waits until WAL is applied on a synchronous standby. Maximum consistency for replicas.

Full Page Writes

PostgreSQL also writes 'full page images' to WAL after each checkpoint for pages that are being modified for the first time since that checkpoint. This prevents torn page (partial write) corruption—if a crash occurs while a page is being written, the full page image in WAL allows complete reconstruction. This is controlled by the full_page_writes parameter, which should remain enabled for data safety.

Multi-Version Concurrency Control (MVCC)

PostgreSQL achieves isolation without the catastrophic performance penalty of naive locking through Multi-Version Concurrency Control (MVCC). This design allows readers and writers to operate concurrently without blocking each other—a fundamental requirement for high-throughput applications.

How MVCC Works:

In PostgreSQL, each row version (called a 'tuple') contains metadata indicating when it was created and when it was deleted (or marked for deletion). Every transaction is assigned a unique, monotonically increasing transaction ID (XID). When a transaction reads data, it only sees tuples that:

Were created by transactions that committed before this transaction started
Were not deleted by any transaction that committed before this transaction started
Meet the criteria of the current isolation level

This means multiple versions of the same logical row can coexist in the table simultaneously.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Tuple Header Fields:
┌────────────────────────────────────────────────────────────────────┐
│  xmin: Transaction ID that created this tuple version              │
│  xmax: Transaction ID that deleted/updated this tuple (0 if live)  │
│  ctid: Physical location on disk (page, offset within page)        │
│  infomask: Bit flags for visibility, null indicators, etc.         │
│  infomask2: Additional flags including HOT update status           │
└────────────────────────────────────────────────────────────────────┘
 
Example: Row Updated Twice
─────────────────────────────────────────────────────────────────────
Version 1 (Original):     xmin=100, xmax=150  [Value: 'Alice']
Version 2 (First update): xmin=150, xmax=200  [Value: 'Alicia']  
Version 3 (Current):      xmin=200, xmax=0    [Value: 'Alicia M.']
 
Transaction 175 sees Version 1 (xmin=100 < 175, xmax=150 < 175, but 
  150 was committed after 175 started, so Version 2 not visible)
Transaction 250 sees Version 3 (xmin=200 < 250, xmax=0 means live)

The Visibility Map and VACUUM:

MVCC creates a maintenance challenge: old tuple versions (dead tuples) accumulate in tables. The VACUUM process reclaims this space by:

Scanning tables for dead tuples that are no longer visible to any active transaction
Marking that space as available for reuse
Updating the visibility map (tracking which pages contain only visible tuples)
Updating the free space map (tracking available space in pages)

The autovacuum daemon automates this process, monitoring table activity and running vacuum when thresholds are crossed. Properly tuned autovacuum is critical for PostgreSQL health—neglected tables can suffer from 'bloat' where dead tuples consume massive storage and degrade performance.

MVCC Trade-offs
Advantage	Trade-off	Mitigation
Readers never block writers	Multiple row versions consume storage	Regular VACUUM, autovacuum tuning
Writers never block readers	Dead tuples accumulate until cleaned	Aggressive autovacuum for high-update tables
No read locks on data	XID wraparound risk in very long-running transactions	Avoid extremely long transactions, monitor xid age
Consistent snapshots for reporting	Snapshot too old error if data changes faster than queries	Tune old_snapshot_threshold, optimize slow queries

Transaction ID Wraparound

PostgreSQL's transaction IDs are 32-bit integers, meaning ~4 billion transactions before wraparound. If 'anti-wraparound vacuum' falls too far behind, PostgreSQL will shut down to prevent data corruption. High-transaction-rate systems must be monitored for XID age, and autovacuum must be tuned to keep up. This is a case where understanding internals prevents production disasters.

Transaction Isolation Levels in Practice

PostgreSQL implements three of the four SQL-standard isolation levels, with semantic differences that matter for application correctness. Understanding these levels is essential for designing systems that behave correctly under concurrent load.

Read Committed (Default):

Each statement within a transaction sees a snapshot of committed data as of the moment that statement begins. Different statements within the same transaction may see different data if other transactions commit in between.

Implication: If you SELECT a row, then UPDATE it, the row might have been modified by another transaction between your SELECT and UPDATE. The UPDATE will see the latest committed version.

Repeatable Read:

The transaction sees a consistent snapshot of the database as of the moment the first query in the transaction executes. All subsequent queries see this same snapshot, regardless of concurrent commits by other transactions.

Implication: Re-reading the same row returns the same data. But if you try to UPDATE a row that another transaction modified after your snapshot was taken, PostgreSQL raises a serialization failure (you must retry the transaction).

Serializable:

The strongest isolation level. PostgreSQL guarantees that the outcome of concurrent transactions is equivalent to some serial (one-at-a-time) execution order. This prevents all anomalies including phantom reads and write skew.

Implication: More serialization failures will occur, requiring retry logic. But you get the simplest mental model: transactions behave as if they ran alone.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Default: Read Committed
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- Returns 1000
-- Another transaction commits: UPDATE accounts SET balance = 900 WHERE id = 1;
SELECT balance FROM accounts WHERE id = 1;  -- Returns 900 (sees new commit)
COMMIT;
 
-- Repeatable Read: Consistent snapshot
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT balance FROM accounts WHERE id = 1;  -- Returns 1000
-- Another transaction commits: UPDATE accounts SET balance = 900 WHERE id = 1;
SELECT balance FROM accounts WHERE id = 1;  -- Still returns 1000!
UPDATE accounts SET balance = balance - 100 WHERE id = 1;  
-- ERROR: could not serialize access due to concurrent update
ROLLBACK;  -- Must retry the entire transaction
 
-- Serializable: Full serializability
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Complex multi-statement logic here
-- PostgreSQL tracks read/write dependencies
-- If a conflict is detected that would cause a non-serializable outcome,
-- one transaction is aborted with serialization_failure
COMMIT;  -- Or ROLLBACK if aborted

Choosing Isolation Levels

Read Committed is sufficient for most OLTP applications where individual statements are self-contained. Repeatable Read is valuable for reporting queries that must see consistent data across multiple statements. Serializable is essential for applications with complex invariants that span multiple rows or tables (e.g., double-entry accounting), where you want the database to enforce correctness rather than implementing it in application code.

Durability Configurations and Trade-offs

PostgreSQL provides granular control over durability guarantees, allowing architects to make informed trade-offs between performance and data safety based on application requirements.

Key Durability Parameters:

PostgreSQL Durability Configuration Parameters
Parameter	Options	Impact	Recommendation
synchronous_commit	on, off, local, remote_write, remote_apply	Controls when commit returns to client	on for critical data, off acceptable for logs/metrics
fsync	on, off	Controls whether PostgreSQL issues fsync() calls	NEVER set to off in production
full_page_writes	on, off	Writes full page images after checkpoint	Keep on to prevent torn page corruption
wal_level	minimal, replica, logical	Amount of information in WAL	replica for physical replication, logical for CDC
checkpoint_timeout	1s - 1d	Maximum time between checkpoints	5-15 minutes typically; more = fewer checkpoints, more recovery time
max_wal_size	size	WAL size before checkpoint triggered	Balance between checkpoint frequency and recovery time

Synchronous Replication for Enhanced Durability:

For applications requiring zero data loss even if the primary server is destroyed, PostgreSQL supports synchronous replication:

Configure one or more standby servers as synchronous
Commits on the primary wait until the standby confirms receipt (or application) of WAL
Even if the primary experiences catastrophic failure, the standby has all committed transactions

This configuration trades latency for durability—each commit incurs network round-trip time to the synchronous standby. For many applications, this is an acceptable trade-off for the guarantee of zero data loss.

1
2
3
4
5
6
7
8
9
10
11
12
13
-- On Primary: postgresql.conf
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
synchronous_commit = on  -- or remote_apply for strongest guarantee
 
-- This means:
-- - Commits wait for at least 1 standby from (standby1, standby2)
-- - FIRST priority: standby1 is preferred, standby2 is backup
-- - If standby1 is unavailable, standby2 becomes synchronous
 
-- Per-transaction override (useful for less critical operations):
SET LOCAL synchronous_commit = off;
INSERT INTO logs (message) VALUES ('Non-critical log entry');
-- This insert commits without waiting for standby confirmation

The fsync = off Trap

Setting fsync = off dramatically improves write performance because PostgreSQL no longer waits for data to reach stable storage. However, any OS crash, power failure, or storage subsystem issue can corrupt your database irreparably. This setting should only be used for disposable databases like testing or data import that will be immediately re-imported on failure. Never use it for data you cannot recreate.

Crash Recovery: Resilience in Practice

PostgreSQL's crash recovery process demonstrates how WAL, MVCC, and durability configurations work together to provide enterprise-grade reliability.

Recovery Process:

When PostgreSQL starts after an unclean shutdown (crash, power loss, kill -9), it performs recovery:

Locate last checkpoint: Read pg_control to find the last completed checkpoint's WAL position
Replay WAL: Starting from the checkpoint, replay all WAL records up to the end of WAL
Apply redo records: Each WAL record describes an operation; PostgreSQL reapplies each one
Reconstruct state: After replay, the database is in the exact state it was at the moment of the last committed transaction
Open for connections: Database is now consistent and ready to serve queries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Timeline: Normal Operation to Recovery
═══════════════════════════════════════════════════════════════════════
 
T=0        Checkpoint completes, all data flushed to disk
           [Database consistent on disk at this point]
 
T=1-300    Normal operations: commits at T=50, T=100, T=200, T=280
           [WAL contains records for all these transactions]
           [Data files may or may not reflect all changes]
 
T=300      CRASH (power failure, kernel panic, etc.)
           [In-flight transaction at T=295 was not committed]
 
T=305      PostgreSQL restarts, begins recovery:
           
           1. Read pg_control → last checkpoint at T=0
           2. Open WAL from T=0 position
           3. Replay WAL record for T=50 commit → data now reflects T=50
           4. Replay WAL record for T=100 commit → data now reflects T=100
           5. Replay WAL record for T=200 commit → data now reflects T=200
           6. Replay WAL record for T=280 commit → data now reflects T=280
           7. Reach end of WAL (T=295 transaction has no commit record)
           8. T=295 transaction is automatically rolled back (never committed)
           9. Recovery complete, database consistent
 
T=310      Database open for connections
           [All committed transactions preserved, uncommitted rolled back]

Recovery Guarantees

•No committed transaction is lost — If the client received COMMIT confirmation, the transaction will survive any crash
•No uncommitted transaction persists — Transactions in progress at crash time are rolled back cleanly
•Database is always consistent post-recovery — All constraints remain satisfied, no partial updates visible
•Recovery time is bounded — Proportional to WAL generated since last checkpoint, not total database size
•Indexes are automatically rebuilt — WAL contains index modifications, so indexes are consistent after recovery

Point-in-Time Recovery (PITR)

Because WAL is a complete sequential log of all modifications, PostgreSQL can recover to any point in time, not just the crash moment. By combining a base backup with archived WAL, you can restore a database to the state it was in at 2:47:33 PM last Tuesday—useful for recovering from application-level errors like accidental deletes or bad data migrations.

Constraint Enforcement: Guaranteeing Consistency

PostgreSQL's Consistency guarantee (the 'C' in ACID) is enforced through a comprehensive constraint system that prevents invalid data from entering the database. Unlike application-level validation, database constraints are guaranteed to be enforced regardless of which application, script, or manual operation attempts to modify data.

Constraint Types:

PostgreSQL Constraint Types
Constraint Type	Purpose	Enforcement	Example
NOT NULL	Column must have a value	Immediate, every INSERT/UPDATE	email VARCHAR(255) NOT NULL
UNIQUE	Values must be distinct across rows	Immediate, uses unique index	UNIQUE (email)
PRIMARY KEY	Row identifier, implies UNIQUE + NOT NULL	Immediate	id SERIAL PRIMARY KEY
FOREIGN KEY	References must exist in parent table	Immediate or deferred	REFERENCES users(id)
CHECK	Arbitrary boolean condition	Immediate	CHECK (age >= 0 AND age <= 150)
EXCLUSION	No two rows satisfy comparison	Immediate, uses GiST index	EXCLUDE USING gist (room WITH =, during WITH &&)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Deferrable foreign keys for complex transactions
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL
        REFERENCES customers(id) 
        DEFERRABLE INITIALLY DEFERRED
);
 
-- Allows inserting order before customer within same transaction:
BEGIN;
SET CONSTRAINTS ALL DEFERRED;
INSERT INTO orders (customer_id, total) VALUES (999, 100.00);  
-- Customer 999 doesn't exist yet, but constraint check is deferred
INSERT INTO customers (id, name) VALUES (999, 'New Customer');
COMMIT;  -- Constraint checked here; both rows exist, so succeeds
 
-- Exclusion constraints for non-overlapping ranges
CREATE TABLE room_reservations (
    room_id INTEGER NOT NULL,
    during TSRANGE NOT NULL,
    EXCLUDE USING gist (room_id WITH =, during WITH &&)
);
 
-- This prevents double-booking: two reservations for the same room
-- with overlapping time ranges cannot coexist
 
-- Complex check constraints
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    price DECIMAL(10,2) NOT NULL,
    discount_price DECIMAL(10,2),
    CHECK (discount_price IS NULL OR discount_price < price),
    CHECK (price > 0)
);
 
-- Triggers for complex validation beyond declarative constraints
CREATE OR REPLACE FUNCTION validate_order_total() 
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.total != (SELECT SUM(price * quantity) 
                     FROM order_items 
                     WHERE order_id = NEW.id) THEN
        RAISE EXCEPTION 'Order total does not match items';
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Constraints as Documentation

Well-designed constraints serve as executable documentation of business rules. When a new developer asks 'Can an order have a negative total?' the schema itself answers: CHECK (total >= 0). This is far more reliable than comments or external documentation that can become stale.

Summary: PostgreSQL's ACID Reliability

We have explored the architectural foundations that make PostgreSQL one of the most reliable database systems available:

Key Takeaways

•Write-Ahead Logging (WAL) provides atomicity and durability by ensuring all changes are logged before being applied to data files
•MVCC enables high concurrency by allowing readers and writers to operate without blocking, maintaining consistent snapshots for each transaction
•Isolation levels (Read Committed, Repeatable Read, Serializable) provide granular control over how transactions interact
•Durability is configurable through synchronous_commit, fsync, and synchronous replication settings to match application requirements
•Crash recovery is automatic — PostgreSQL replays WAL to restore the database to a consistent state with all committed transactions intact
•Constraints enforce consistency at the database level, guaranteeing data validity regardless of application behavior

What's Next:

Now that we understand PostgreSQL's reliability guarantees, the next page explores PostgreSQL's rich feature set—the powerful capabilities that make it far more than just a SQL-compliant database, including advanced data types, full-text search, JSON support, and the procedural language ecosystem.

Page Complete

You now understand the mechanisms that make PostgreSQL one of the most reliable relational database systems available. These ACID guarantees form the foundation upon which all PostgreSQL-based systems are built. Next, we'll explore the rich features that make PostgreSQL exceptionally versatile.

1 / 5

Loading learning content...

System Design (HLD)PostgreSQL

PostgreSQL: The World's Most Advanced Open Source Database

LevelAdvanced

Duration90 mins

TopicPostgreSQL

1 / 5

ACID Compliance and Reliability

PostgreSQL: Engineering for Correctness

What You Will Learn

The ACID Foundation: Why It Matters

Before diving into PostgreSQL's implementation, let's establish precisely what ACID means and why each property is non-negotiable for serious data management:

Durability promises that once a transaction is committed, it will survive system failures—power outages, crashes, disk failures (with proper redundancy). Committed data is permanent.

ACID Properties and Their Guarantees
Property	Guarantee	What Fails Without It	PostgreSQL Mechanism
Atomicity	All-or-nothing execution	Partial updates corrupt data state	WAL + Transaction management
Consistency	Valid state transitions only	Constraint violations, data anomalies	Constraint enforcement + Triggers
Isolation	Concurrent transactions don't interfere	Dirty reads, phantom reads, lost updates	MVCC + Isolation levels
Durability	Committed data survives failures	Data loss on crash or power failure	WAL + fsync + Checkpoints

The Cost of Weak ACID

Write-Ahead Logging: The Heart of Durability

The Core Principle:

This seemingly simple rule has profound implications:

WAL Guarantees

•Crash Recovery — If PostgreSQL crashes, it can replay the WAL from the last checkpoint to reconstruct any committed transactions that hadn't yet been written to data files.
•Sequential Writes — WAL writes are sequential (append-only), which is dramatically faster than the random writes required to update data pages scattered across the disk.
•Atomic Commits — A transaction is committed if and only if its final WAL commit record reaches disk. There is no ambiguity about whether a transaction succeeded.
•Replication Foundation — WAL records can be streamed to replica servers, enabling physical replication that guarantees byte-for-byte consistency with the primary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Transaction Begins:
┌─────────────────────────────────────────────────────────────────────┐
│  1. Application executes INSERT/UPDATE/DELETE                       │
│  2. PostgreSQL generates WAL record describing the change          │
│  3. WAL record is written to WAL buffer (in memory)                │
│  4. Data page is modified in shared buffers (in memory)            │
└─────────────────────────────────────────────────────────────────────┘
 
COMMIT Requested:
┌─────────────────────────────────────────────────────────────────────┐
│  5. COMMIT WAL record is generated                                  │
│  6. All WAL records up to and including COMMIT are flushed to disk │
│  7. fsync() ensures records are physically on stable storage       │
│  8. Transaction is confirmed committed to application              │
└─────────────────────────────────────────────────────────────────────┘
 
Data Page Writes (Asynchronous):
┌─────────────────────────────────────────────────────────────────────┐
│  9. Background writer eventually writes dirty pages to data files  │
│  10. Checkpoint process periodically ensures all pages are flushed │
│  11. Old WAL segments can be recycled after checkpoint             │
└─────────────────────────────────────────────────────────────────────┘

The Performance Insight:

This design means you can tune PostgreSQL for your durability vs. performance requirements:

synchronous_commit = on (default): Maximum durability. WAL records are flushed to disk before commit returns. Survives OS crash.
synchronous_commit = off: WAL records may be in OS cache when commit returns. Faster, but up to ~3x wal_writer_delay of transactions could be lost on crash.
synchronous_commit = remote_apply: Commit waits until WAL is applied on a synchronous standby. Maximum consistency for replicas.

Full Page Writes

Multi-Version Concurrency Control (MVCC)

How MVCC Works:

Were created by transactions that committed before this transaction started
Were not deleted by any transaction that committed before this transaction started
Meet the criteria of the current isolation level

This means multiple versions of the same logical row can coexist in the table simultaneously.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Tuple Header Fields:
┌────────────────────────────────────────────────────────────────────┐
│  xmin: Transaction ID that created this tuple version              │
│  xmax: Transaction ID that deleted/updated this tuple (0 if live)  │
│  ctid: Physical location on disk (page, offset within page)        │
│  infomask: Bit flags for visibility, null indicators, etc.         │
│  infomask2: Additional flags including HOT update status           │
└────────────────────────────────────────────────────────────────────┘
 
Example: Row Updated Twice
─────────────────────────────────────────────────────────────────────
Version 1 (Original):     xmin=100, xmax=150  [Value: 'Alice']
Version 2 (First update): xmin=150, xmax=200  [Value: 'Alicia']  
Version 3 (Current):      xmin=200, xmax=0    [Value: 'Alicia M.']
 
Transaction 175 sees Version 1 (xmin=100 < 175, xmax=150 < 175, but 
  150 was committed after 175 started, so Version 2 not visible)
Transaction 250 sees Version 3 (xmin=200 < 250, xmax=0 means live)

The Visibility Map and VACUUM:

MVCC creates a maintenance challenge: old tuple versions (dead tuples) accumulate in tables. The VACUUM process reclaims this space by:

Scanning tables for dead tuples that are no longer visible to any active transaction
Marking that space as available for reuse
Updating the visibility map (tracking which pages contain only visible tuples)
Updating the free space map (tracking available space in pages)

MVCC Trade-offs
Advantage	Trade-off	Mitigation
Readers never block writers	Multiple row versions consume storage	Regular VACUUM, autovacuum tuning
Writers never block readers	Dead tuples accumulate until cleaned	Aggressive autovacuum for high-update tables
No read locks on data	XID wraparound risk in very long-running transactions	Avoid extremely long transactions, monitor xid age
Consistent snapshots for reporting	Snapshot too old error if data changes faster than queries	Tune old_snapshot_threshold, optimize slow queries

Transaction ID Wraparound

Transaction Isolation Levels in Practice

Read Committed (Default):

Implication: If you SELECT a row, then UPDATE it, the row might have been modified by another transaction between your SELECT and UPDATE. The UPDATE will see the latest committed version.

Repeatable Read:

Serializable:

Implication: More serialization failures will occur, requiring retry logic. But you get the simplest mental model: transactions behave as if they ran alone.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Default: Read Committed
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- Returns 1000
-- Another transaction commits: UPDATE accounts SET balance = 900 WHERE id = 1;
SELECT balance FROM accounts WHERE id = 1;  -- Returns 900 (sees new commit)
COMMIT;
 
-- Repeatable Read: Consistent snapshot
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT balance FROM accounts WHERE id = 1;  -- Returns 1000
-- Another transaction commits: UPDATE accounts SET balance = 900 WHERE id = 1;
SELECT balance FROM accounts WHERE id = 1;  -- Still returns 1000!
UPDATE accounts SET balance = balance - 100 WHERE id = 1;  
-- ERROR: could not serialize access due to concurrent update
ROLLBACK;  -- Must retry the entire transaction
 
-- Serializable: Full serializability
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Complex multi-statement logic here
-- PostgreSQL tracks read/write dependencies
-- If a conflict is detected that would cause a non-serializable outcome,
-- one transaction is aborted with serialization_failure
COMMIT;  -- Or ROLLBACK if aborted

Choosing Isolation Levels

Durability Configurations and Trade-offs

PostgreSQL provides granular control over durability guarantees, allowing architects to make informed trade-offs between performance and data safety based on application requirements.

Key Durability Parameters:

PostgreSQL Durability Configuration Parameters
Parameter	Options	Impact	Recommendation
synchronous_commit	on, off, local, remote_write, remote_apply	Controls when commit returns to client	on for critical data, off acceptable for logs/metrics
fsync	on, off	Controls whether PostgreSQL issues fsync() calls	NEVER set to off in production
full_page_writes	on, off	Writes full page images after checkpoint	Keep on to prevent torn page corruption
wal_level	minimal, replica, logical	Amount of information in WAL	replica for physical replication, logical for CDC
checkpoint_timeout	1s - 1d	Maximum time between checkpoints	5-15 minutes typically; more = fewer checkpoints, more recovery time
max_wal_size	size	WAL size before checkpoint triggered	Balance between checkpoint frequency and recovery time

Synchronous Replication for Enhanced Durability:

For applications requiring zero data loss even if the primary server is destroyed, PostgreSQL supports synchronous replication:

Configure one or more standby servers as synchronous
Commits on the primary wait until the standby confirms receipt (or application) of WAL
Even if the primary experiences catastrophic failure, the standby has all committed transactions

1
2
3
4
5
6
7
8
9
10
11
12
13
-- On Primary: postgresql.conf
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
synchronous_commit = on  -- or remote_apply for strongest guarantee
 
-- This means:
-- - Commits wait for at least 1 standby from (standby1, standby2)
-- - FIRST priority: standby1 is preferred, standby2 is backup
-- - If standby1 is unavailable, standby2 becomes synchronous
 
-- Per-transaction override (useful for less critical operations):
SET LOCAL synchronous_commit = off;
INSERT INTO logs (message) VALUES ('Non-critical log entry');
-- This insert commits without waiting for standby confirmation

The fsync = off Trap

Crash Recovery: Resilience in Practice

PostgreSQL's crash recovery process demonstrates how WAL, MVCC, and durability configurations work together to provide enterprise-grade reliability.

Recovery Process:

When PostgreSQL starts after an unclean shutdown (crash, power loss, kill -9), it performs recovery:

Locate last checkpoint: Read pg_control to find the last completed checkpoint's WAL position
Replay WAL: Starting from the checkpoint, replay all WAL records up to the end of WAL
Apply redo records: Each WAL record describes an operation; PostgreSQL reapplies each one
Reconstruct state: After replay, the database is in the exact state it was at the moment of the last committed transaction
Open for connections: Database is now consistent and ready to serve queries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Timeline: Normal Operation to Recovery
═══════════════════════════════════════════════════════════════════════
 
T=0        Checkpoint completes, all data flushed to disk
           [Database consistent on disk at this point]
 
T=1-300    Normal operations: commits at T=50, T=100, T=200, T=280
           [WAL contains records for all these transactions]
           [Data files may or may not reflect all changes]
 
T=300      CRASH (power failure, kernel panic, etc.)
           [In-flight transaction at T=295 was not committed]
 
T=305      PostgreSQL restarts, begins recovery:
           
           1. Read pg_control → last checkpoint at T=0
           2. Open WAL from T=0 position
           3. Replay WAL record for T=50 commit → data now reflects T=50
           4. Replay WAL record for T=100 commit → data now reflects T=100
           5. Replay WAL record for T=200 commit → data now reflects T=200
           6. Replay WAL record for T=280 commit → data now reflects T=280
           7. Reach end of WAL (T=295 transaction has no commit record)
           8. T=295 transaction is automatically rolled back (never committed)
           9. Recovery complete, database consistent
 
T=310      Database open for connections
           [All committed transactions preserved, uncommitted rolled back]

Recovery Guarantees

•No committed transaction is lost — If the client received COMMIT confirmation, the transaction will survive any crash
•No uncommitted transaction persists — Transactions in progress at crash time are rolled back cleanly
•Database is always consistent post-recovery — All constraints remain satisfied, no partial updates visible
•Recovery time is bounded — Proportional to WAL generated since last checkpoint, not total database size
•Indexes are automatically rebuilt — WAL contains index modifications, so indexes are consistent after recovery

Point-in-Time Recovery (PITR)

Constraint Enforcement: Guaranteeing Consistency

Constraint Types:

PostgreSQL Constraint Types
Constraint Type	Purpose	Enforcement	Example
NOT NULL	Column must have a value	Immediate, every INSERT/UPDATE	email VARCHAR(255) NOT NULL
UNIQUE	Values must be distinct across rows	Immediate, uses unique index	UNIQUE (email)
PRIMARY KEY	Row identifier, implies UNIQUE + NOT NULL	Immediate	id SERIAL PRIMARY KEY
FOREIGN KEY	References must exist in parent table	Immediate or deferred	REFERENCES users(id)
CHECK	Arbitrary boolean condition	Immediate	CHECK (age >= 0 AND age <= 150)
EXCLUSION	No two rows satisfy comparison	Immediate, uses GiST index	EXCLUDE USING gist (room WITH =, during WITH &&)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Deferrable foreign keys for complex transactions
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL
        REFERENCES customers(id) 
        DEFERRABLE INITIALLY DEFERRED
);
 
-- Allows inserting order before customer within same transaction:
BEGIN;
SET CONSTRAINTS ALL DEFERRED;
INSERT INTO orders (customer_id, total) VALUES (999, 100.00);  
-- Customer 999 doesn't exist yet, but constraint check is deferred
INSERT INTO customers (id, name) VALUES (999, 'New Customer');
COMMIT;  -- Constraint checked here; both rows exist, so succeeds
 
-- Exclusion constraints for non-overlapping ranges
CREATE TABLE room_reservations (
    room_id INTEGER NOT NULL,
    during TSRANGE NOT NULL,
    EXCLUDE USING gist (room_id WITH =, during WITH &&)
);
 
-- This prevents double-booking: two reservations for the same room
-- with overlapping time ranges cannot coexist
 
-- Complex check constraints
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    price DECIMAL(10,2) NOT NULL,
    discount_price DECIMAL(10,2),
    CHECK (discount_price IS NULL OR discount_price < price),
    CHECK (price > 0)
);
 
-- Triggers for complex validation beyond declarative constraints
CREATE OR REPLACE FUNCTION validate_order_total() 
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.total != (SELECT SUM(price * quantity) 
                     FROM order_items 
                     WHERE order_id = NEW.id) THEN
        RAISE EXCEPTION 'Order total does not match items';
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Constraints as Documentation

Summary: PostgreSQL's ACID Reliability

We have explored the architectural foundations that make PostgreSQL one of the most reliable database systems available:

Key Takeaways

•Write-Ahead Logging (WAL) provides atomicity and durability by ensuring all changes are logged before being applied to data files
•MVCC enables high concurrency by allowing readers and writers to operate without blocking, maintaining consistent snapshots for each transaction
•Isolation levels (Read Committed, Repeatable Read, Serializable) provide granular control over how transactions interact
•Durability is configurable through synchronous_commit, fsync, and synchronous replication settings to match application requirements
•Crash recovery is automatic — PostgreSQL replays WAL to restore the database to a consistent state with all committed transactions intact
•Constraints enforce consistency at the database level, guaranteeing data validity regardless of application behavior

What's Next:

Page Complete

1 / 5