Loading content...
At the heart of MVCC's value proposition is a deceptively simple promise: every read operation within a transaction sees a consistent view of the database. This consistency means that data appears as it existed at a single logical point in time, regardless of concurrent modifications by other transactions.
This guarantee transforms how applications interact with databases. Without read consistency, developers must defensively code around the possibility that data changes between reads—leading to convoluted application logic, potential race conditions, and subtle bugs. With MVCC's read consistency, developers can write straightforward code with confidence that their transaction sees a stable reality.
But how exactly does this work? How does a database system determine which version of each data item to return when multiple versions exist? What happens when a transaction reads, then another transaction commits a change, and the first transaction reads again?
By the end of this page, you will understand how MVCC systems create and use snapshots, the precise algorithms that determine version visibility, how read consistency prevents different anomalies, and the relationship between MVCC and SQL isolation levels. You'll be able to trace exactly how a read operation selects the correct version.
The cornerstone of MVCC read consistency is the snapshot—a logical point-in-time view of the database. Each transaction operates with a snapshot that determines which versions of data items it will see.
What is a Snapshot?
A snapshot is not a physical copy of the database. Rather, it's a small data structure that captures enough information to determine, for any version of any data item, whether that version should be visible to the transaction.
Typically, a snapshot contains:
With this information, the visibility algorithm can determine:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Snapshot Structure class Snapshot { // The transaction ID when this snapshot was created snapshot_xid: TransactionID // Transaction IDs that were active (uncommitted) when snapshot was taken // Versions from these transactions are NOT visible active_txns: Set<TransactionID> // Minimum active transaction ID at snapshot time // Any transaction ID less than this has definitely completed xmin: TransactionID // Maximum transaction ID at snapshot time // Any transaction ID greater than this started after the snapshot xmax: TransactionID} // Snapshot Creation (at transaction begin or first read)function create_snapshot() -> Snapshot: snapshot = new Snapshot() // Get current transaction state atomically with global_transaction_lock: snapshot.snapshot_xid = get_current_transaction_id() snapshot.xmax = next_transaction_id() snapshot.active_txns = get_all_active_transaction_ids() snapshot.xmin = min(snapshot.active_txns) if not empty else snapshot.xmax return snapshot // Example snapshot state at a point in time:// // Committed transactions: 100, 101, 102, 104, 105// Active transactions: 103, 107, 108// Next transaction ID: 110//// Snapshot taken by transaction 109:// {// snapshot_xid: 109,// xmin: 103,// xmax: 110,// active_txns: {103, 107, 108}// }//// Visibility:// - Txn 100, 101, 102, 104, 105: visible (committed before snapshot)// - Txn 103: NOT visible (active at snapshot time)// - Txn 106: visible (committed, not in active set)// - Txn 107, 108: NOT visible (active at snapshot time)// - Txn 110+: NOT visible (started after snapshot)When is the Snapshot Taken?
The timing of snapshot creation determines the transaction's semantics. Different isolation levels take snapshots at different times:
Snapshot Isolation / Repeatable Read: Snapshot taken at transaction start (or first query). All reads see the same consistent view throughout.
Read Committed: New snapshot taken at the start of each statement. Each statement may see different committed data.
Serializable (MVCC implementation): Snapshot at transaction start, plus additional conflict detection to ensure serializability.
This snapshot timing is crucial—it's what makes Repeatable Read actually repeatable and what allows Read Committed to see committed changes from concurrent transactions.
| Isolation Level | Snapshot Taken | Behavior |
|---|---|---|
| Read Uncommitted | N/A (not typically used with MVCC) | Reads latest version, even uncommitted |
| Read Committed | Start of each SQL statement | Sees newly committed data between statements |
| Repeatable Read | Start of transaction (first read) | All reads return same data throughout transaction |
| Serializable | Start of transaction + extra tracking | Repeatable + guaranteed serializable ordering |
PostgreSQL's default isolation level is Read Committed, meaning it takes a new snapshot per statement. Many developers incorrectly assume transaction-level consistency. For truly repeatable reads within a transaction, explicitly set 'SET TRANSACTION ISOLATION LEVEL REPEATABLE READ'.
The visibility algorithm is the core logic that determines whether a specific version is visible to a specific transaction. While conceptually simple, production implementations must handle numerous edge cases efficiently.
The Fundamental Questions:
For any version V and transaction T with snapshot S, the visibility check asks:
If the answer to (1) is YES and (2) is NO, the version is visible.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
// MVCC Visibility Algorithm (PostgreSQL-style)// Determines if a tuple version is visible to a snapshot function HeapTupleSatisfiesMVCC(tuple: HeapTuple, snapshot: Snapshot) -> Boolean: xmin = tuple.xmin // Transaction that created this tuple xmax = tuple.xmax // Transaction that deleted/updated this tuple (0 if current) // ======================================== // STEP 1: Check creator transaction (xmin) // ======================================== // Case 1A: Created by our own transaction if xmin == current_transaction_id(): if tuple.cmin >= current_command_id(): return false // Created by later command in same transaction // We created it in an earlier command if xmax == 0: return true // We created it, not yet deleted // Check if we also deleted it if xmax == current_transaction_id(): if tuple.cmax >= current_command_id(): return true // Deleted by later command - still visible else: return false // Deleted by earlier command // Deleted by another transaction (which must have aborted) // since concurrent transactions can't delete our uncommitted rows return true // Case 1B: Created by another transaction if not is_xmin_committed(xmin, snapshot): // Creator was active or aborted at snapshot time return false // Creator committed before our snapshot // ======================================== // STEP 2: Check deleter transaction (xmax) // ======================================== if xmax == 0 or xmax == INVALID_TRANSACTION_ID: return true // Not deleted - definitely visible // Case 2A: Deleted by our own transaction if xmax == current_transaction_id(): if tuple.cmax >= current_command_id(): return true // Deleted by later command - still visible to us else: return false // Deleted by earlier command - not visible // Case 2B: Deleted by another transaction if not is_xmax_committed(xmax, snapshot): // Deleter was active or aborted at snapshot time return true // Deletion not visible - tuple is visible // Deleter committed - tuple is not visible return false // Helper: Check if xmin transaction is "committed" from snapshot's perspectivefunction is_xmin_committed(xmin: TransactionID, snapshot: Snapshot) -> Boolean: // Quick check: if xmin < snapshot.xmin, it's definitely committed if xmin < snapshot.xmin: return true // If xmin >= snapshot.xmax, it started after snapshot if xmin >= snapshot.xmax: return false // xmin is in the range [xmin, xmax) - check active list if xmin in snapshot.active_txns: return false // Was active at snapshot time - not visible // Check actual commit status in transaction log status = get_transaction_status(xmin) return status == COMMITTED // Similar logic for is_xmax_committed...Special Cases and Edge Conditions:
The algorithm above handles the common cases, but production systems must also handle:
Subtransaction visibility: Nested transactions (savepoints) create visibility within transactions that temporarily rolled back
Combo CID: When xmin and xmax are from the same transaction, a combined command ID optimizes storage
Hint Bits: Optimization flags stored in the tuple header that cache visibility decisions
FrozenTransactionId: Special marker for very old tuples to prevent transaction ID wraparound
Invalid Transaction ID: Special value indicating no transaction (e.g., xmax = 0 means not deleted)
Visibility checking happens for EVERY tuple examined by a query. In a table scan of 1 million rows, the visibility algorithm runs 1 million times. Optimizations like hint bits (caching commit status in the tuple) are critical—they avoid repeated transaction log lookups for commonly-accessed tuples.
InnoDB implements a similar concept called a Read View. While the principles are the same as PostgreSQL's snapshot, the implementation details differ due to InnoDB's undo-log-based architecture.
Read View Structure:
The InnoDB read view tracks:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
// InnoDB Read View Visibility Check (Simplified) struct ReadView { trx_id_t m_low_limit_id; // Transactions >= this started after read view trx_id_t m_up_limit_id; // Transactions < this completed before read view std::vector<trx_id_t> m_ids; // Active transactions at read view creation trx_id_t m_creator_trx_id; // Transaction that owns this read view}; // Check if a transaction's changes are visible to this read viewbool changes_visible(const ReadView* view, trx_id_t id) { // 1. Our own transaction's changes are always visible if (id == view->m_creator_trx_id) { return true; } // 2. Transaction IDs < m_up_limit_id completed before our read view if (id < view->m_up_limit_id) { return true; } // 3. Transaction IDs >= m_low_limit_id started after our read view if (id >= view->m_low_limit_id) { return false; } // 4. Transaction ID is in the ambiguous range - check active list // Binary search since m_ids is sorted return !std::binary_search(view->m_ids.begin(), view->m_ids.end(), id);} // Reading a row: may need to traverse undo logRecord read_row_mvcc(const ReadView* view, const Record* latest) { Record* current = latest; while (current != nullptr) { trx_id_t creator = current->get_trx_id(); if (changes_visible(view, creator)) { // This version was created by a visible transaction // Check if it's a delete marker if (current->is_delete_marked()) { return NOT_FOUND; // Row was deleted } return *current; // This is the visible version } // This version is not visible - go back in history current = current->get_older_version_from_undo_log(); } // No visible version found - row didn't exist for this read view return NOT_FOUND;} /*Example Read View: m_low_limit_id: 110 m_up_limit_id: 103 m_ids: [103, 107, 108] // Active at read view creation m_creator_trx_id: 109 Visibility decisions: trx_id = 100: visible (100 < 103, completed before read view) trx_id = 103: NOT visible (in m_ids, was active) trx_id = 105: visible (103 <= 105 < 110, not in m_ids) trx_id = 107: NOT visible (in m_ids, was active) trx_id = 109: visible (it's our own transaction) trx_id = 110: NOT visible (>= m_low_limit_id, started after)*/MVCC read consistency provides powerful guarantees that directly address the classic concurrency anomalies. Let's trace exactly how MVCC prevents each problem.
Preventing Dirty Reads:
A dirty read occurs when a transaction reads uncommitted data from another transaction. MVCC prevents this because:
Scenario: Dirty Read Prevention with MVCC Time T1 (Txn 100) T2 (Txn 101)──────────────────────────────────────────────────────────────── 1 BEGIN 2 UPDATE accounts SET balance = 500 WHERE id = 1 (Creates version with xmin=100) 3 BEGIN Snapshot: {active: [100], xmax: 102} 4 SELECT balance FROM accounts WHERE id = 1 Version xmin=100: - Is 100 in active set? YES - Therefore: NOT VISIBLE Previous version xmin=99 (committed): - Is 99 < snapshot.xmin? YES - Therefore: VISIBLE Result: Returns OLD balance (e.g., 1000) ✅ Dirty read prevented! 5 ROLLBACK (T1 aborts - change discarded) 6 SELECT balance FROM accounts WHERE id = 1 Result: Still 1000 (correct - T1 rolled back) Without MVCC (if T2 saw uncommitted value): Step 4 would return 500 (WRONG!) Step 6 would see 1000 back - inconsistent!Preventing Non-Repeatable Reads (at Repeatable Read):
A non-repeatable read occurs when a transaction reads the same row twice and gets different values. MVCC Repeatable Read prevents this by:
Scenario: Non-Repeatable Read Prevention at REPEATABLE READ Time T1 (Txn 100) T2 (Txn 101)──────────────────────────────────────────────────────────────── 1 BEGIN ISOLATION LEVEL REPEATABLE READ Snapshot taken: {active: [], xmax: 101} 2 SELECT balance FROM accounts WHERE id = 1 Visible version: balance = 1000 (xmin=99, committed) Result: 1000 3 BEGIN 4 UPDATE accounts SET balance = 500 WHERE id = 1 (Creates new version xmin=101) 5 COMMIT (Version xmin=101 now committed) 6 SELECT balance FROM accounts WHERE id = 1 New version xmin=101: - Is 101 >= snapshot.xmax (101)? YES - Therefore: NOT VISIBLE (started after our snapshot) Old version xmin=99: - Is 99 < snapshot.xmax? YES - Is 99 in active set? NO - Therefore: VISIBLE Result: 1000 (same as before!) ✅ Repeatable read achieved! 7 COMMIT Without MVCC snapshot: Step 6 would return 500 (different from step 2!) = Non-repeatable readNote on Read Committed:
At Read Committed isolation, non-repeatable reads ARE allowed by design. MVCC implements this by taking a new snapshot at each statement:
This is not a bug—it's the defined behavior of Read Committed. It allows applications to see recent committed data at the cost of within-transaction consistency.
The same MVCC mechanism provides different guarantees based on snapshot timing. At Read Committed, each statement gets a fresh snapshot—repeatable reads are NOT guaranteed. At Repeatable Read, all statements use the transaction's initial snapshot. Always verify your application's isolation level expectations match the database configuration.
Phantom reads—where a query returns additional rows on re-execution due to concurrent inserts—represent a subtle challenge for MVCC. Standard snapshot isolation provides partial protection against phantoms.
How MVCC Helps with Phantoms:
With snapshot isolation:
This seems to prevent phantoms entirely, and for most practical purposes, it does. However, there's a subtle distinction.
Scenario: Phantom Prevention with Snapshot Isolation Time T1 (Txn 100) T2 (Txn 101)──────────────────────────────────────────────────────────────── 1 BEGIN ISOLATION LEVEL REPEATABLE READ Snapshot: {active: [], xmax: 101} 2 SELECT COUNT(*) FROM accounts WHERE balance > 1000 Finds: Account #1 (balance 1500), Account #2 (balance 2000) Result: 2 rows 3 BEGIN 4 INSERT INTO accounts (id, balance) VALUES (3, 3000) (New row with xmin=101) 5 COMMIT 6 SELECT COUNT(*) FROM accounts WHERE balance > 1000 Scanning all rows: - Account #1: xmin=99 < 101 → VISIBLE, balance 1500 ✓ - Account #2: xmin=98 < 101 → VISIBLE, balance 2000 ✓ - Account #3: xmin=101 >= snapshot.xmax → NOT VISIBLE ✗ Result: 2 rows (same as before!) ✅ Phantom read prevented!The Write Skew Exception:
While MVCC prevents phantom reads (same rows returned), it doesn't prevent all related anomalies. The write skew problem can occur when two transactions read overlapping data and then make writes based on those reads.
Consider a constraint: 'At least one doctor must be on call.' Two doctors, both on call, each check if removing themselves would leave at least one other doctor. Both see the other is on call, both remove themselves—and the constraint is violated.
Write Skew Anomaly (NOT Prevented by Standard Snapshot Isolation) Constraint: At least one doctor must be on-callInitial: doctor_1.on_call = true, doctor_2.on_call = true Time T1 (Doctor 1) T2 (Doctor 2)──────────────────────────────────────────────────────────────────── 1 BEGIN REPEATABLE READ BEGIN REPEATABLE READ Snapshot: xmax=100 Snapshot: xmax=100 2 SELECT COUNT(*) FROM doctors WHERE on_call = true Result: 2 (safe to leave) 3 SELECT COUNT(*) FROM doctors WHERE on_call = true Result: 2 (safe to leave) 4 UPDATE doctors SET on_call = false WHERE id = 1 (doctor_1 thinks doctor_2 will cover) 5 UPDATE doctors SET on_call = false WHERE id = 2 (doctor_2 thinks doctor_1 will cover) 6 COMMIT ✓ COMMIT ✓ Result: Both doctors removed from on-call! Constraint violated! 😱 No write-write conflict (different rows)No phantom read (no new rows appeared)But the combined effect violates application invariant Solution: Use SERIALIZABLE isolation level, which would detect this conflict and abort one transaction.Write skew demonstrates that Snapshot Isolation (what most databases call REPEATABLE READ with MVCC) is NOT fully serializable. For applications requiring true serializability, use the SERIALIZABLE isolation level—which in PostgreSQL employs Serializable Snapshot Isolation (SSI) with additional conflict detection.
Understanding MVCC read consistency empowers you to make better application design decisions. Here are practical patterns and anti-patterns.
Pattern 1: Long-Running Reports
MVCC shines for reporting transactions that scan large amounts of data without blocking OLTP operations. Use Repeatable Read for consistent reports that don't see partial updates from concurrent transactions.
123456789101112131415161718192021222324252627282930313233343536373839
-- Pattern: Consistent Reporting with MVCC -- Start a reporting transactionBEGIN ISOLATION LEVEL REPEATABLE READ; -- First report querySELECT department_id, SUM(salary) as total_salary, COUNT(*) as employee_countFROM employeesGROUP BY department_id; -- This could take minutes for large tables... -- Meanwhile, other transactions are updating salaries,-- hiring employees, etc. - all proceeding without waiting! -- Second report query - sees SAME data as firstSELECT department_id, AVG(salary) as avg_salaryFROM employeesGROUP BY department_id; -- The totals from first query will be consistent with-- averages from second query - no phantom employees appeared COMMIT; -- Anti-pattern: Using Read Committed for reportsBEGIN; -- Default is READ COMMITTED in PostgreSQL SELECT department_id, SUM(salary) FROM employees GROUP BY department_id;-- ... time passes, other transactions commit ...SELECT department_id, SUM(salary) FROM employees GROUP BY department_id;-- Could return different totals! Not suitable for consistent reports. COMMIT;Pattern 2: Read-Modify-Write with Consistency Checks
When you need to read data, apply business logic, and write based on the read values, consider the consistency requirements carefully.
1234567891011121314151617181920212223242526272829303132333435363738
-- Pattern: Read-Modify-Write requiring row-level atomicity -- Option 1: Atomic UPDATE (best for simple cases)UPDATE accounts SET balance = balance - 100 WHERE id = 1 AND balance >= 100;-- Single statement is always atomic, no read-then-write gap -- Option 2: SELECT FOR UPDATE (explicit row locking)BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;-- Returns 500; row is now locked against concurrent updates -- Application logic: check balance, calculate, etc. UPDATE accounts SET balance = 400 WHERE id = 1;COMMIT;-- Other transactions waited for our lock -- Option 3: Optimistic concurrency with MVCCBEGIN ISOLATION LEVEL REPEATABLE READ; SELECT balance, version FROM accounts WHERE id = 1;-- Returns: balance=500, version=5 -- Application logic... UPDATE accounts SET balance = 400, version = 6 WHERE id = 1 AND version = 5; -- Check: was the update applied?-- If version changed, another transaction modified the row-- - 1 row updated = success-- - 0 rows updated = conflict, retry transaction COMMIT;For simple increments/decrements, atomic UPDATE is best. For complex business logic requiring multiple reads before decision, SELECT FOR UPDATE provides certainty. For high-concurrency scenarios where conflicts are rare, optimistic locking with version columns minimizes blocking while handling conflicts gracefully.
MVCC read consistency transforms concurrent database access from a minefield of anomalies into a predictable, manageable system. The snapshot mechanism provides elegant solutions to problems that plagued lock-based systems.
Looking Ahead:
With read consistency understood, we'll next explore PostgreSQL's MVCC implementation in detail—examining how one of the most popular databases puts these concepts into practice with its heap-based versioning, vacuum mechanisms, and visibility map optimizations.
You now understand how MVCC achieves read consistency through snapshots and visibility algorithms. You can trace visibility decisions, understand isolation level differences, and apply these concepts in practical patterns. Next, we'll see these principles implemented in PostgreSQL.