Loading content...
At the heart of MVCC's value proposition is a deceptively simple promise: every read operation within a transaction sees a consistent view of the database. This consistency means that data appears as it existed at a single logical point in time, regardless of concurrent modifications by other transactions.\n\nThis guarantee transforms how applications interact with databases. Without read consistency, developers must defensively code around the possibility that data changes between reads—leading to convoluted application logic, potential race conditions, and subtle bugs. With MVCC's read consistency, developers can write straightforward code with confidence that their transaction sees a stable reality.\n\nBut how exactly does this work? How does a database system determine which version of each data item to return when multiple versions exist? What happens when a transaction reads, then another transaction commits a change, and the first transaction reads again?
By the end of this page, you will understand how MVCC systems create and use snapshots, the precise algorithms that determine version visibility, how read consistency prevents different anomalies, and the relationship between MVCC and SQL isolation levels. You'll be able to trace exactly how a read operation selects the correct version.
The cornerstone of MVCC read consistency is the snapshot—a logical point-in-time view of the database. Each transaction operates with a snapshot that determines which versions of data items it will see.\n\nWhat is a Snapshot?\n\nA snapshot is not a physical copy of the database. Rather, it's a small data structure that captures enough information to determine, for any version of any data item, whether that version should be visible to the transaction.\n\nTypically, a snapshot contains:\n- The snapshot timestamp or transaction ID at snapshot creation\n- The set of transaction IDs that were in-progress (active) when the snapshot was taken\n- Possibly the oldest active transaction ID (for optimization)\n\nWith this information, the visibility algorithm can determine:\n- Versions created by transactions that committed before the snapshot: visible\n- Versions created by transactions that started after the snapshot: not visible\n- Versions created by in-progress transactions (listed in snapshot): not visible (unless it's own transaction)
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Snapshot Structure class Snapshot { // The transaction ID when this snapshot was created snapshot_xid: TransactionID // Transaction IDs that were active (uncommitted) when snapshot was taken // Versions from these transactions are NOT visible active_txns: Set<TransactionID> // Minimum active transaction ID at snapshot time // Any transaction ID less than this has definitely completed xmin: TransactionID // Maximum transaction ID at snapshot time // Any transaction ID greater than this started after the snapshot xmax: TransactionID} // Snapshot Creation (at transaction begin or first read)function create_snapshot() -> Snapshot: snapshot = new Snapshot() // Get current transaction state atomically with global_transaction_lock: snapshot.snapshot_xid = get_current_transaction_id() snapshot.xmax = next_transaction_id() snapshot.active_txns = get_all_active_transaction_ids() snapshot.xmin = min(snapshot.active_txns) if not empty else snapshot.xmax return snapshot // Example snapshot state at a point in time:// // Committed transactions: 100, 101, 102, 104, 105// Active transactions: 103, 107, 108// Next transaction ID: 110//// Snapshot taken by transaction 109:// {// snapshot_xid: 109,// xmin: 103,// xmax: 110,// active_txns: {103, 107, 108}// }//// Visibility:// - Txn 100, 101, 102, 104, 105: visible (committed before snapshot)// - Txn 103: NOT visible (active at snapshot time)// - Txn 106: visible (committed, not in active set)// - Txn 107, 108: NOT visible (active at snapshot time)// - Txn 110+: NOT visible (started after snapshot)When is the Snapshot Taken?\n\nThe timing of snapshot creation determines the transaction's semantics. Different isolation levels take snapshots at different times:\n\n- Snapshot Isolation / Repeatable Read: Snapshot taken at transaction start (or first query). All reads see the same consistent view throughout.\n\n- Read Committed: New snapshot taken at the start of each statement. Each statement may see different committed data.\n\n- Serializable (MVCC implementation): Snapshot at transaction start, plus additional conflict detection to ensure serializability.\n\nThis snapshot timing is crucial—it's what makes Repeatable Read actually repeatable and what allows Read Committed to see committed changes from concurrent transactions.
| Isolation Level | Snapshot Taken | Behavior |
|---|---|---|
| Read Uncommitted | N/A (not typically used with MVCC) | Reads latest version, even uncommitted |
| Read Committed | Start of each SQL statement | Sees newly committed data between statements |
| Repeatable Read | Start of transaction (first read) | All reads return same data throughout transaction |
| Serializable | Start of transaction + extra tracking | Repeatable + guaranteed serializable ordering |
PostgreSQL's default isolation level is Read Committed, meaning it takes a new snapshot per statement. Many developers incorrectly assume transaction-level consistency. For truly repeatable reads within a transaction, explicitly set 'SET TRANSACTION ISOLATION LEVEL REPEATABLE READ'.
The visibility algorithm is the core logic that determines whether a specific version is visible to a specific transaction. While conceptually simple, production implementations must handle numerous edge cases efficiently.\n\nThe Fundamental Questions:\n\nFor any version V and transaction T with snapshot S, the visibility check asks:\n\n1. Was V created by a transaction that T can see?\n2. Was V deleted (superseded) by a transaction that T can see?\n\nIf the answer to (1) is YES and (2) is NO, the version is visible.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
// MVCC Visibility Algorithm (PostgreSQL-style)// Determines if a tuple version is visible to a snapshot function HeapTupleSatisfiesMVCC(tuple: HeapTuple, snapshot: Snapshot) -> Boolean: xmin = tuple.xmin // Transaction that created this tuple xmax = tuple.xmax // Transaction that deleted/updated this tuple (0 if current) // ======================================== // STEP 1: Check creator transaction (xmin) // ======================================== // Case 1A: Created by our own transaction if xmin == current_transaction_id(): if tuple.cmin >= current_command_id(): return false // Created by later command in same transaction // We created it in an earlier command if xmax == 0: return true // We created it, not yet deleted // Check if we also deleted it if xmax == current_transaction_id(): if tuple.cmax >= current_command_id(): return true // Deleted by later command - still visible else: return false // Deleted by earlier command // Deleted by another transaction (which must have aborted) // since concurrent transactions can't delete our uncommitted rows return true // Case 1B: Created by another transaction if not is_xmin_committed(xmin, snapshot): // Creator was active or aborted at snapshot time return false // Creator committed before our snapshot // ======================================== // STEP 2: Check deleter transaction (xmax) // ======================================== if xmax == 0 or xmax == INVALID_TRANSACTION_ID: return true // Not deleted - definitely visible // Case 2A: Deleted by our own transaction if xmax == current_transaction_id(): if tuple.cmax >= current_command_id(): return true // Deleted by later command - still visible to us else: return false // Deleted by earlier command - not visible // Case 2B: Deleted by another transaction if not is_xmax_committed(xmax, snapshot): // Deleter was active or aborted at snapshot time return true // Deletion not visible - tuple is visible // Deleter committed - tuple is not visible return false // Helper: Check if xmin transaction is "committed" from snapshot's perspectivefunction is_xmin_committed(xmin: TransactionID, snapshot: Snapshot) -> Boolean: // Quick check: if xmin < snapshot.xmin, it's definitely committed if xmin < snapshot.xmin: return true // If xmin >= snapshot.xmax, it started after snapshot if xmin >= snapshot.xmax: return false // xmin is in the range [xmin, xmax) - check active list if xmin in snapshot.active_txns: return false // Was active at snapshot time - not visible // Check actual commit status in transaction log status = get_transaction_status(xmin) return status == COMMITTED // Similar logic for is_xmax_committed...Special Cases and Edge Conditions:\n\nThe algorithm above handles the common cases, but production systems must also handle:\n\n1. Subtransaction visibility: Nested transactions (savepoints) create visibility within transactions that temporarily rolled back\n\n2. Combo CID: When xmin and xmax are from the same transaction, a combined command ID optimizes storage\n\n3. Hint Bits: Optimization flags stored in the tuple header that cache visibility decisions\n\n4. FrozenTransactionId: Special marker for very old tuples to prevent transaction ID wraparound\n\n5. Invalid Transaction ID: Special value indicating no transaction (e.g., xmax = 0 means not deleted)
Visibility checking happens for EVERY tuple examined by a query. In a table scan of 1 million rows, the visibility algorithm runs 1 million times. Optimizations like hint bits (caching commit status in the tuple) are critical—they avoid repeated transaction log lookups for commonly-accessed tuples.
InnoDB implements a similar concept called a Read View. While the principles are the same as PostgreSQL's snapshot, the implementation details differ due to InnoDB's undo-log-based architecture.\n\nRead View Structure:\n\nThe InnoDB read view tracks:\n- m_low_limit_id: Any transaction ID ≥ this was not yet started when read view was created\n- m_up_limit_id: Any transaction ID < this had completed (committed or aborted)\n- m_ids: List of active transaction IDs in the range [m_up_limit_id, m_low_limit_id)\n- m_creator_trx_id: The transaction that created this read view
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
// InnoDB Read View Visibility Check (Simplified) struct ReadView { trx_id_t m_low_limit_id; // Transactions >= this started after read view trx_id_t m_up_limit_id; // Transactions < this completed before read view std::vector<trx_id_t> m_ids; // Active transactions at read view creation trx_id_t m_creator_trx_id; // Transaction that owns this read view}; // Check if a transaction's changes are visible to this read viewbool changes_visible(const ReadView* view, trx_id_t id) { // 1. Our own transaction's changes are always visible if (id == view->m_creator_trx_id) { return true; } // 2. Transaction IDs < m_up_limit_id completed before our read view if (id < view->m_up_limit_id) { return true; } // 3. Transaction IDs >= m_low_limit_id started after our read view if (id >= view->m_low_limit_id) { return false; } // 4. Transaction ID is in the ambiguous range - check active list // Binary search since m_ids is sorted return !std::binary_search(view->m_ids.begin(), view->m_ids.end(), id);} // Reading a row: may need to traverse undo logRecord read_row_mvcc(const ReadView* view, const Record* latest) { Record* current = latest; while (current != nullptr) { trx_id_t creator = current->get_trx_id(); if (changes_visible(view, creator)) { // This version was created by a visible transaction // Check if it's a delete marker if (current->is_delete_marked()) { return NOT_FOUND; // Row was deleted } return *current; // This is the visible version } // This version is not visible - go back in history current = current->get_older_version_from_undo_log(); } // No visible version found - row didn't exist for this read view return NOT_FOUND;} /*Example Read View: m_low_limit_id: 110 m_up_limit_id: 103 m_ids: [103, 107, 108] // Active at read view creation m_creator_trx_id: 109 Visibility decisions: trx_id = 100: visible (100 < 103, completed before read view) trx_id = 103: NOT visible (in m_ids, was active) trx_id = 105: visible (103 <= 105 < 110, not in m_ids) trx_id = 107: NOT visible (in m_ids, was active) trx_id = 109: visible (it's our own transaction) trx_id = 110: NOT visible (>= m_low_limit_id, started after)*/MVCC read consistency provides powerful guarantees that directly address the classic concurrency anomalies. Let's trace exactly how MVCC prevents each problem.\n\nPreventing Dirty Reads:\n\nA dirty read occurs when a transaction reads uncommitted data from another transaction. MVCC prevents this because:\n1. The snapshot tracks which transactions were active (uncommitted) at snapshot time\n2. Versions created by active transactions are explicitly excluded from visibility\n3. Even if the other transaction later commits while our transaction runs, versions from transactions in our snapshot's active set remain invisible
Scenario: Dirty Read Prevention with MVCC Time T1 (Txn 100) T2 (Txn 101)──────────────────────────────────────────────────────────────── 1 BEGIN 2 UPDATE accounts SET balance = 500 WHERE id = 1 (Creates version with xmin=100) 3 BEGIN Snapshot: {active: [100], xmax: 102} 4 SELECT balance FROM accounts WHERE id = 1 Version xmin=100: - Is 100 in active set? YES - Therefore: NOT VISIBLE Previous version xmin=99 (committed): - Is 99 < snapshot.xmin? YES - Therefore: VISIBLE Result: Returns OLD balance (e.g., 1000) ✅ Dirty read prevented! 5 ROLLBACK (T1 aborts - change discarded) 6 SELECT balance FROM accounts WHERE id = 1 Result: Still 1000 (correct - T1 rolled back) Without MVCC (if T2 saw uncommitted value): Step 4 would return 500 (WRONG!) Step 6 would see 1000 back - inconsistent!Preventing Non-Repeatable Reads (at Repeatable Read):\n\nA non-repeatable read occurs when a transaction reads the same row twice and gets different values. MVCC Repeatable Read prevents this by:\n\n1. Taking a snapshot at transaction start (not per-statement)\n2. All reads use the same snapshot throughout the transaction\n3. Even if other transactions commit changes, those commits are after our snapshot and thus invisible
Scenario: Non-Repeatable Read Prevention at REPEATABLE READ Time T1 (Txn 100) T2 (Txn 101)──────────────────────────────────────────────────────────────── 1 BEGIN ISOLATION LEVEL REPEATABLE READ Snapshot taken: {active: [], xmax: 101} 2 SELECT balance FROM accounts WHERE id = 1 Visible version: balance = 1000 (xmin=99, committed) Result: 1000 3 BEGIN 4 UPDATE accounts SET balance = 500 WHERE id = 1 (Creates new version xmin=101) 5 COMMIT (Version xmin=101 now committed) 6 SELECT balance FROM accounts WHERE id = 1 New version xmin=101: - Is 101 >= snapshot.xmax (101)? YES - Therefore: NOT VISIBLE (started after our snapshot) Old version xmin=99: - Is 99 < snapshot.xmax? YES - Is 99 in active set? NO - Therefore: VISIBLE Result: 1000 (same as before!) ✅ Repeatable read achieved! 7 COMMIT Without MVCC snapshot: Step 6 would return 500 (different from step 2!) = Non-repeatable readNote on Read Committed:\n\nAt Read Committed isolation, non-repeatable reads ARE allowed by design. MVCC implements this by taking a new snapshot at each statement:\n\n- First SELECT uses snapshot at time T1\n- T2 commits in between\n- Second SELECT uses new snapshot at time T2 → sees T2's committed changes\n\nThis is not a bug—it's the defined behavior of Read Committed. It allows applications to see recent committed data at the cost of within-transaction consistency.
The same MVCC mechanism provides different guarantees based on snapshot timing. At Read Committed, each statement gets a fresh snapshot—repeatable reads are NOT guaranteed. At Repeatable Read, all statements use the transaction's initial snapshot. Always verify your application's isolation level expectations match the database configuration.
Phantom reads—where a query returns additional rows on re-execution due to concurrent inserts—represent a subtle challenge for MVCC. Standard snapshot isolation provides partial protection against phantoms.\n\nHow MVCC Helps with Phantoms:\n\nWith snapshot isolation:\n- Inserts by concurrent transactions have xmin >= snapshot.xmax\n- Therefore, newly inserted rows are not visible to our snapshot\n- Range queries re-executed within the transaction return the same rows\n\nThis seems to prevent phantoms entirely, and for most practical purposes, it does. However, there's a subtle distinction.
Scenario: Phantom Prevention with Snapshot Isolation Time T1 (Txn 100) T2 (Txn 101)──────────────────────────────────────────────────────────────── 1 BEGIN ISOLATION LEVEL REPEATABLE READ Snapshot: {active: [], xmax: 101} 2 SELECT COUNT(*) FROM accounts WHERE balance > 1000 Finds: Account #1 (balance 1500), Account #2 (balance 2000) Result: 2 rows 3 BEGIN 4 INSERT INTO accounts (id, balance) VALUES (3, 3000) (New row with xmin=101) 5 COMMIT 6 SELECT COUNT(*) FROM accounts WHERE balance > 1000 Scanning all rows: - Account #1: xmin=99 < 101 → VISIBLE, balance 1500 ✓ - Account #2: xmin=98 < 101 → VISIBLE, balance 2000 ✓ - Account #3: xmin=101 >= snapshot.xmax → NOT VISIBLE ✗ Result: 2 rows (same as before!) ✅ Phantom read prevented!The Write Skew Exception:\n\nWhile MVCC prevents phantom reads (same rows returned), it doesn't prevent all related anomalies. The write skew problem can occur when two transactions read overlapping data and then make writes based on those reads.\n\nConsider a constraint: 'At least one doctor must be on call.' Two doctors, both on call, each check if removing themselves would leave at least one other doctor. Both see the other is on call, both remove themselves—and the constraint is violated.
Write Skew Anomaly (NOT Prevented by Standard Snapshot Isolation) Constraint: At least one doctor must be on-callInitial: doctor_1.on_call = true, doctor_2.on_call = true Time T1 (Doctor 1) T2 (Doctor 2)──────────────────────────────────────────────────────────────────── 1 BEGIN REPEATABLE READ BEGIN REPEATABLE READ Snapshot: xmax=100 Snapshot: xmax=100 2 SELECT COUNT(*) FROM doctors WHERE on_call = true Result: 2 (safe to leave) 3 SELECT COUNT(*) FROM doctors WHERE on_call = true Result: 2 (safe to leave) 4 UPDATE doctors SET on_call = false WHERE id = 1 (doctor_1 thinks doctor_2 will cover) 5 UPDATE doctors SET on_call = false WHERE id = 2 (doctor_2 thinks doctor_1 will cover) 6 COMMIT ✓ COMMIT ✓ Result: Both doctors removed from on-call! Constraint violated! 😱 No write-write conflict (different rows)No phantom read (no new rows appeared)But the combined effect violates application invariant Solution: Use SERIALIZABLE isolation level, which would detect this conflict and abort one transaction.Write skew demonstrates that Snapshot Isolation (what most databases call REPEATABLE READ with MVCC) is NOT fully serializable. For applications requiring true serializability, use the SERIALIZABLE isolation level—which in PostgreSQL employs Serializable Snapshot Isolation (SSI) with additional conflict detection.
Understanding MVCC read consistency empowers you to make better application design decisions. Here are practical patterns and anti-patterns.\n\nPattern 1: Long-Running Reports\n\nMVCC shines for reporting transactions that scan large amounts of data without blocking OLTP operations. Use Repeatable Read for consistent reports that don't see partial updates from concurrent transactions.
123456789101112131415161718192021222324252627282930313233343536373839
-- Pattern: Consistent Reporting with MVCC -- Start a reporting transactionBEGIN ISOLATION LEVEL REPEATABLE READ; -- First report querySELECT department_id, SUM(salary) as total_salary, COUNT(*) as employee_countFROM employeesGROUP BY department_id; -- This could take minutes for large tables... -- Meanwhile, other transactions are updating salaries,-- hiring employees, etc. - all proceeding without waiting! -- Second report query - sees SAME data as firstSELECT department_id, AVG(salary) as avg_salaryFROM employeesGROUP BY department_id; -- The totals from first query will be consistent with-- averages from second query - no phantom employees appeared COMMIT; -- Anti-pattern: Using Read Committed for reportsBEGIN; -- Default is READ COMMITTED in PostgreSQL SELECT department_id, SUM(salary) FROM employees GROUP BY department_id;-- ... time passes, other transactions commit ...SELECT department_id, SUM(salary) FROM employees GROUP BY department_id;-- Could return different totals! Not suitable for consistent reports. COMMIT;Pattern 2: Read-Modify-Write with Consistency Checks\n\nWhen you need to read data, apply business logic, and write based on the read values, consider the consistency requirements carefully.
1234567891011121314151617181920212223242526272829303132333435363738
-- Pattern: Read-Modify-Write requiring row-level atomicity -- Option 1: Atomic UPDATE (best for simple cases)UPDATE accounts SET balance = balance - 100 WHERE id = 1 AND balance >= 100;-- Single statement is always atomic, no read-then-write gap -- Option 2: SELECT FOR UPDATE (explicit row locking)BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;-- Returns 500; row is now locked against concurrent updates -- Application logic: check balance, calculate, etc. UPDATE accounts SET balance = 400 WHERE id = 1;COMMIT;-- Other transactions waited for our lock -- Option 3: Optimistic concurrency with MVCCBEGIN ISOLATION LEVEL REPEATABLE READ; SELECT balance, version FROM accounts WHERE id = 1;-- Returns: balance=500, version=5 -- Application logic... UPDATE accounts SET balance = 400, version = 6 WHERE id = 1 AND version = 5; -- Check: was the update applied?-- If version changed, another transaction modified the row-- - 1 row updated = success-- - 0 rows updated = conflict, retry transaction COMMIT;For simple increments/decrements, atomic UPDATE is best. For complex business logic requiring multiple reads before decision, SELECT FOR UPDATE provides certainty. For high-concurrency scenarios where conflicts are rare, optimistic locking with version columns minimizes blocking while handling conflicts gracefully.
MVCC read consistency transforms concurrent database access from a minefield of anomalies into a predictable, manageable system. The snapshot mechanism provides elegant solutions to problems that plagued lock-based systems.
Looking Ahead:\n\nWith read consistency understood, we'll next explore PostgreSQL's MVCC implementation in detail—examining how one of the most popular databases puts these concepts into practice with its heap-based versioning, vacuum mechanisms, and visibility map optimizations.
You now understand how MVCC achieves read consistency through snapshots and visibility algorithms. You can trace visibility decisions, understand isolation level differences, and apply these concepts in practical patterns. Next, we'll see these principles implemented in PostgreSQL.