Mvcc - Learning Module | OneNoughtOne

Loading content...

0/241

Read Consistency

The Promise of Consistent Reads

At the heart of MVCC's value proposition is a deceptively simple promise: every read operation within a transaction sees a consistent view of the database. This consistency means that data appears as it existed at a single logical point in time, regardless of concurrent modifications by other transactions.\n\nThis guarantee transforms how applications interact with databases. Without read consistency, developers must defensively code around the possibility that data changes between reads—leading to convoluted application logic, potential race conditions, and subtle bugs. With MVCC's read consistency, developers can write straightforward code with confidence that their transaction sees a stable reality.\n\nBut how exactly does this work? How does a database system determine which version of each data item to return when multiple versions exist? What happens when a transaction reads, then another transaction commits a change, and the first transaction reads again?

What You Will Learn

By the end of this page, you will understand how MVCC systems create and use snapshots, the precise algorithms that determine version visibility, how read consistency prevents different anomalies, and the relationship between MVCC and SQL isolation levels. You'll be able to trace exactly how a read operation selects the correct version.

Snapshot Semantics

The cornerstone of MVCC read consistency is the snapshot—a logical point-in-time view of the database. Each transaction operates with a snapshot that determines which versions of data items it will see.\n\nWhat is a Snapshot?\n\nA snapshot is not a physical copy of the database. Rather, it's a small data structure that captures enough information to determine, for any version of any data item, whether that version should be visible to the transaction.\n\nTypically, a snapshot contains:\n- The snapshot timestamp or transaction ID at snapshot creation\n- The set of transaction IDs that were in-progress (active) when the snapshot was taken\n- Possibly the oldest active transaction ID (for optimization)\n\nWith this information, the visibility algorithm can determine:\n- Versions created by transactions that committed before the snapshot: visible\n- Versions created by transactions that started after the snapshot: not visible\n- Versions created by in-progress transactions (listed in snapshot): not visible (unless it's own transaction)

snapshot_structure.pseudo

Pseudocode

// Snapshot Structure
 
class Snapshot {
    // The transaction ID when this snapshot was created
    snapshot_xid: TransactionID
    
    // Transaction IDs that were active (uncommitted) when snapshot was taken
    // Versions from these transactions are NOT visible
    active_txns: Set<TransactionID>
    
    // Minimum active transaction ID at snapshot time
    // Any transaction ID less than this has definitely completed
    xmin: TransactionID
    
    // Maximum transaction ID at snapshot time
    // Any transaction ID greater than this started after the snapshot
    xmax: TransactionID
}
 
// Snapshot Creation (at transaction begin or first read)
function create_snapshot() -> Snapshot:
    snapshot = new Snapshot()
    
    // Get current transaction state atomically
    with global_transaction_lock:
        snapshot.snapshot_xid = get_current_transaction_id()
        snapshot.xmax = next_transaction_id()
        snapshot.active_txns = get_all_active_transaction_ids()
        snapshot.xmin = min(snapshot.active_txns) if not empty else snapshot.xmax
    
    return snapshot
 
// Example snapshot state at a point in time:
// 
// Committed transactions: 100, 101, 102, 104, 105
// Active transactions: 103, 107, 108
// Next transaction ID: 110
//
// Snapshot taken by transaction 109:
// {
//     snapshot_xid: 109,
//     xmin: 103,
//     xmax: 110,
//     active_txns: {103, 107, 108}
// }
//
// Visibility:
// - Txn 100, 101, 102, 104, 105: visible (committed before snapshot)
// - Txn 103: NOT visible (active at snapshot time)
// - Txn 106: visible (committed, not in active set)
// - Txn 107, 108: NOT visible (active at snapshot time)
// - Txn 110+: NOT visible (started after snapshot)

When is the Snapshot Taken?\n\nThe timing of snapshot creation determines the transaction's semantics. Different isolation levels take snapshots at different times:\n\n- Snapshot Isolation / Repeatable Read: Snapshot taken at transaction start (or first query). All reads see the same consistent view throughout.\n\n- Read Committed: New snapshot taken at the start of each statement. Each statement may see different committed data.\n\n- Serializable (MVCC implementation): Snapshot at transaction start, plus additional conflict detection to ensure serializability.\n\nThis snapshot timing is crucial—it's what makes Repeatable Read actually repeatable and what allows Read Committed to see committed changes from concurrent transactions.

Snapshot Timing by Isolation Level
Isolation Level	Snapshot Taken	Behavior
Read Uncommitted	N/A (not typically used with MVCC)	Reads latest version, even uncommitted
Read Committed	Start of each SQL statement	Sees newly committed data between statements
Repeatable Read	Start of transaction (first read)	All reads return same data throughout transaction
Serializable	Start of transaction + extra tracking	Repeatable + guaranteed serializable ordering

PostgreSQL Default

PostgreSQL's default isolation level is Read Committed, meaning it takes a new snapshot per statement. Many developers incorrectly assume transaction-level consistency. For truly repeatable reads within a transaction, explicitly set 'SET TRANSACTION ISOLATION LEVEL REPEATABLE READ'.

Visibility Algorithm Deep Dive

The visibility algorithm is the core logic that determines whether a specific version is visible to a specific transaction. While conceptually simple, production implementations must handle numerous edge cases efficiently.\n\nThe Fundamental Questions:\n\nFor any version V and transaction T with snapshot S, the visibility check asks:\n\n1. Was V created by a transaction that T can see?\n2. Was V deleted (superseded) by a transaction that T can see?\n\nIf the answer to (1) is YES and (2) is NO, the version is visible.

heapam_visibility.pseudo

Visibility Algorithm

// MVCC Visibility Algorithm (PostgreSQL-style)
// Determines if a tuple version is visible to a snapshot
 
function HeapTupleSatisfiesMVCC(tuple: HeapTuple, snapshot: Snapshot) -> Boolean:
    xmin = tuple.xmin  // Transaction that created this tuple
    xmax = tuple.xmax  // Transaction that deleted/updated this tuple (0 if current)
    
    // ========================================
    // STEP 1: Check creator transaction (xmin)
    // ========================================
    
    // Case 1A: Created by our own transaction
    if xmin == current_transaction_id():
        if tuple.cmin >= current_command_id():
            return false  // Created by later command in same transaction
        
        // We created it in an earlier command
        if xmax == 0:
            return true   // We created it, not yet deleted
        
        // Check if we also deleted it
        if xmax == current_transaction_id():
            if tuple.cmax >= current_command_id():
                return true   // Deleted by later command - still visible
            else:
                return false  // Deleted by earlier command
        
        // Deleted by another transaction (which must have aborted)
        // since concurrent transactions can't delete our uncommitted rows
        return true
    
    // Case 1B: Created by another transaction
    if not is_xmin_committed(xmin, snapshot):
        // Creator was active or aborted at snapshot time
        return false
    
    // Creator committed before our snapshot
    
    // ========================================
    // STEP 2: Check deleter transaction (xmax)
    // ========================================
    
    if xmax == 0 or xmax == INVALID_TRANSACTION_ID:
        return true  // Not deleted - definitely visible
    
    // Case 2A: Deleted by our own transaction
    if xmax == current_transaction_id():
        if tuple.cmax >= current_command_id():
            return true   // Deleted by later command - still visible to us
        else:
            return false  // Deleted by earlier command - not visible
    
    // Case 2B: Deleted by another transaction
    if not is_xmax_committed(xmax, snapshot):
        // Deleter was active or aborted at snapshot time
        return true   // Deletion not visible - tuple is visible
    
    // Deleter committed - tuple is not visible
    return false
 
// Helper: Check if xmin transaction is "committed" from snapshot's perspective
function is_xmin_committed(xmin: TransactionID, snapshot: Snapshot) -> Boolean:
    // Quick check: if xmin < snapshot.xmin, it's definitely committed
    if xmin < snapshot.xmin:
        return true
    
    // If xmin >= snapshot.xmax, it started after snapshot
    if xmin >= snapshot.xmax:
        return false
    
    // xmin is in the range [xmin, xmax) - check active list
    if xmin in snapshot.active_txns:
        return false  // Was active at snapshot time - not visible
    
    // Check actual commit status in transaction log
    status = get_transaction_status(xmin)
    return status == COMMITTED
 
// Similar logic for is_xmax_committed...

Special Cases and Edge Conditions:\n\nThe algorithm above handles the common cases, but production systems must also handle:\n\n1. Subtransaction visibility: Nested transactions (savepoints) create visibility within transactions that temporarily rolled back\n\n2. Combo CID: When xmin and xmax are from the same transaction, a combined command ID optimizes storage\n\n3. Hint Bits: Optimization flags stored in the tuple header that cache visibility decisions\n\n4. FrozenTransactionId: Special marker for very old tuples to prevent transaction ID wraparound\n\n5. Invalid Transaction ID: Special value indicating no transaction (e.g., xmax = 0 means not deleted)

Performance Implications

Visibility checking happens for EVERY tuple examined by a query. In a table scan of 1 million rows, the visibility algorithm runs 1 million times. Optimizations like hint bits (caching commit status in the tuple) are critical—they avoid repeated transaction log lookups for commonly-accessed tuples.

InnoDB Read View Implementation

InnoDB implements a similar concept called a Read View. While the principles are the same as PostgreSQL's snapshot, the implementation details differ due to InnoDB's undo-log-based architecture.\n\nRead View Structure:\n\nThe InnoDB read view tracks:\n- m_low_limit_id: Any transaction ID ≥ this was not yet started when read view was created\n- m_up_limit_id: Any transaction ID < this had completed (committed or aborted)\n- m_ids: List of active transaction IDs in the range [m_up_limit_id, m_low_limit_id)\n- m_creator_trx_id: The transaction that created this read view

innodb_read_view.cpp
C++ (InnoDB Style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// InnoDB Read View Visibility Check (Simplified)
 
struct ReadView {
    trx_id_t m_low_limit_id;   // Transactions >= this started after read view
    trx_id_t m_up_limit_id;    // Transactions < this completed before read view
    std::vector<trx_id_t> m_ids;  // Active transactions at read view creation
    trx_id_t m_creator_trx_id;    // Transaction that owns this read view
};
 
// Check if a transaction's changes are visible to this read view
bool changes_visible(const ReadView* view, trx_id_t id) {
    // 1. Our own transaction's changes are always visible
    if (id == view->m_creator_trx_id) {
        return true;
    }
    
    // 2. Transaction IDs < m_up_limit_id completed before our read view
    if (id < view->m_up_limit_id) {
        return true;
    }
    
    // 3. Transaction IDs >= m_low_limit_id started after our read view
    if (id >= view->m_low_limit_id) {
        return false;
    }
    
    // 4. Transaction ID is in the ambiguous range - check active list
    // Binary search since m_ids is sorted
    return !std::binary_search(view->m_ids.begin(), view->m_ids.end(), id);
}
 
// Reading a row: may need to traverse undo log
Record read_row_mvcc(const ReadView* view, const Record* latest) {
    Record* current = latest;
    
    while (current != nullptr) {
        trx_id_t creator = current->get_trx_id();
        
        if (changes_visible(view, creator)) {
            // This version was created by a visible transaction
            
            // Check if it's a delete marker
            if (current->is_delete_marked()) {
                return NOT_FOUND;  // Row was deleted
            }
            
            return *current;  // This is the visible version
        }
        
        // This version is not visible - go back in history
        current = current->get_older_version_from_undo_log();
    }
    
    // No visible version found - row didn't exist for this read view
    return NOT_FOUND;
}
 
/*
Example Read View:
  m_low_limit_id: 110
  m_up_limit_id: 103
  m_ids: [103, 107, 108]  // Active at read view creation
  m_creator_trx_id: 109
 
Visibility decisions:
  trx_id = 100: visible (100 < 103, completed before read view)
  trx_id = 103: NOT visible (in m_ids, was active)  
  trx_id = 105: visible (103 <= 105 < 110, not in m_ids)
  trx_id = 107: NOT visible (in m_ids, was active)
  trx_id = 109: visible (it's our own transaction)
  trx_id = 110: NOT visible (>= m_low_limit_id, started after)
*/

PostgreSQL Snapshot

•Stored in transaction state
•Checks pg_xact for commit status
•Hint bits optimize repeated checks
•Tuple contains creator/deleter info
•Version chain in heap

InnoDB Read View

•Attached to transaction object
•Sorted active list for binary search
•Row header has DB_TRX_ID
•May traverse undo log for old versions
•Version chain in undo tablespace

Read Consistency Guarantees

MVCC read consistency provides powerful guarantees that directly address the classic concurrency anomalies. Let's trace exactly how MVCC prevents each problem.\n\nPreventing Dirty Reads:\n\nA dirty read occurs when a transaction reads uncommitted data from another transaction. MVCC prevents this because:\n1. The snapshot tracks which transactions were active (uncommitted) at snapshot time\n2. Versions created by active transactions are explicitly excluded from visibility\n3. Even if the other transaction later commits while our transaction runs, versions from transactions in our snapshot's active set remain invisible

dirty_read_prevention.diagram

Timeline

Scenario: Dirty Read Prevention with MVCC
 
Time    T1 (Txn 100)                T2 (Txn 101)
────────────────────────────────────────────────────────────────
 1      BEGIN
 2      UPDATE accounts SET         
        balance = 500 
        WHERE id = 1
        (Creates version with xmin=100)
 
 3                                  BEGIN
                                    Snapshot: {active: [100], xmax: 102}
 
 4                                  SELECT balance FROM accounts WHERE id = 1
                                    
                                    Version xmin=100:
                                    - Is 100 in active set? YES
                                    - Therefore: NOT VISIBLE
                                    
                                    Previous version xmin=99 (committed):
                                    - Is 99 < snapshot.xmin? YES
                                    - Therefore: VISIBLE
                                    
                                    Result: Returns OLD balance (e.g., 1000)
                                    ✅ Dirty read prevented!
 
 5      ROLLBACK                    
        (T1 aborts - change discarded)
        
 6                                  SELECT balance FROM accounts WHERE id = 1
                                    Result: Still 1000 (correct - T1 rolled back)
                                    
Without MVCC (if T2 saw uncommitted value):
  Step 4 would return 500 (WRONG!)
  Step 6 would see 1000 back - inconsistent!

Preventing Non-Repeatable Reads (at Repeatable Read):\n\nA non-repeatable read occurs when a transaction reads the same row twice and gets different values. MVCC Repeatable Read prevents this by:\n\n1. Taking a snapshot at transaction start (not per-statement)\n2. All reads use the same snapshot throughout the transaction\n3. Even if other transactions commit changes, those commits are after our snapshot and thus invisible

non_repeatable_read_prevention.diagram

Timeline

Scenario: Non-Repeatable Read Prevention at REPEATABLE READ
 
Time    T1 (Txn 100)                T2 (Txn 101)
────────────────────────────────────────────────────────────────
 1      BEGIN ISOLATION LEVEL
        REPEATABLE READ
        Snapshot taken: {active: [], xmax: 101}
        
 2      SELECT balance FROM accounts WHERE id = 1
        Visible version: balance = 1000 (xmin=99, committed)
        Result: 1000
 
 3                                  BEGIN
 
 4                                  UPDATE accounts SET balance = 500 WHERE id = 1
                                    (Creates new version xmin=101)
                                    
 5                                  COMMIT
                                    (Version xmin=101 now committed)
 
 6      SELECT balance FROM accounts WHERE id = 1
        
        New version xmin=101:
        - Is 101 >= snapshot.xmax (101)? YES
        - Therefore: NOT VISIBLE (started after our snapshot)
        
        Old version xmin=99:
        - Is 99 < snapshot.xmax? YES
        - Is 99 in active set? NO
        - Therefore: VISIBLE
        
        Result: 1000 (same as before!)
        ✅ Repeatable read achieved!
 
 7      COMMIT
 
Without MVCC snapshot:
  Step 6 would return 500 (different from step 2!)
  = Non-repeatable read

Note on Read Committed:\n\nAt Read Committed isolation, non-repeatable reads ARE allowed by design. MVCC implements this by taking a new snapshot at each statement:\n\n- First SELECT uses snapshot at time T1\n- T2 commits in between\n- Second SELECT uses new snapshot at time T2 → sees T2's committed changes\n\nThis is not a bug—it's the defined behavior of Read Committed. It allows applications to see recent committed data at the cost of within-transaction consistency.

Isolation Level Matters Greatly

The same MVCC mechanism provides different guarantees based on snapshot timing. At Read Committed, each statement gets a fresh snapshot—repeatable reads are NOT guaranteed. At Repeatable Read, all statements use the transaction's initial snapshot. Always verify your application's isolation level expectations match the database configuration.

Phantom Prevention and Its Limitations

Phantom reads—where a query returns additional rows on re-execution due to concurrent inserts—represent a subtle challenge for MVCC. Standard snapshot isolation provides partial protection against phantoms.\n\nHow MVCC Helps with Phantoms:\n\nWith snapshot isolation:\n- Inserts by concurrent transactions have xmin >= snapshot.xmax\n- Therefore, newly inserted rows are not visible to our snapshot\n- Range queries re-executed within the transaction return the same rows\n\nThis seems to prevent phantoms entirely, and for most practical purposes, it does. However, there's a subtle distinction.

phantom_mvcc.diagram

Timeline

Scenario: Phantom Prevention with Snapshot Isolation
 
Time    T1 (Txn 100)                T2 (Txn 101)
────────────────────────────────────────────────────────────────
 1      BEGIN ISOLATION LEVEL 
        REPEATABLE READ
        Snapshot: {active: [], xmax: 101}
        
 2      SELECT COUNT(*) FROM accounts WHERE balance > 1000
        Finds: Account #1 (balance 1500), Account #2 (balance 2000)
        Result: 2 rows
 
 3                                  BEGIN
 
 4                                  INSERT INTO accounts (id, balance) 
                                    VALUES (3, 3000)
                                    (New row with xmin=101)
                                    
 5                                  COMMIT
 
 6      SELECT COUNT(*) FROM accounts WHERE balance > 1000
        
        Scanning all rows:
        - Account #1: xmin=99 < 101 → VISIBLE, balance 1500 ✓
        - Account #2: xmin=98 < 101 → VISIBLE, balance 2000 ✓
        - Account #3: xmin=101 >= snapshot.xmax → NOT VISIBLE ✗
        
        Result: 2 rows (same as before!)
        ✅ Phantom read prevented!

The Write Skew Exception:\n\nWhile MVCC prevents phantom reads (same rows returned), it doesn't prevent all related anomalies. The write skew problem can occur when two transactions read overlapping data and then make writes based on those reads.\n\nConsider a constraint: 'At least one doctor must be on call.' Two doctors, both on call, each check if removing themselves would leave at least one other doctor. Both see the other is on call, both remove themselves—and the constraint is violated.

write_skew.diagram

Write Skew Scenario

Write Skew Anomaly (NOT Prevented by Standard Snapshot Isolation)
 
Constraint: At least one doctor must be on-call
Initial: doctor_1.on_call = true, doctor_2.on_call = true
 
Time    T1 (Doctor 1)                    T2 (Doctor 2)
────────────────────────────────────────────────────────────────────
 1      BEGIN REPEATABLE READ            BEGIN REPEATABLE READ
        Snapshot: xmax=100                Snapshot: xmax=100
 
 2      SELECT COUNT(*) FROM doctors 
        WHERE on_call = true
        Result: 2 (safe to leave)
        
 3                                       SELECT COUNT(*) FROM doctors 
                                         WHERE on_call = true
                                         Result: 2 (safe to leave)
        
 4      UPDATE doctors SET on_call = false WHERE id = 1
        (doctor_1 thinks doctor_2 will cover)
        
 5                                       UPDATE doctors SET on_call = false 
                                         WHERE id = 2
                                         (doctor_2 thinks doctor_1 will cover)
                                         
 6      COMMIT ✓                         COMMIT ✓
 
Result: Both doctors removed from on-call!
        Constraint violated! 😱
        
No write-write conflict (different rows)
No phantom read (no new rows appeared)
But the combined effect violates application invariant
 
Solution: Use SERIALIZABLE isolation level, which would detect this
          conflict and abort one transaction.

Snapshot Isolation Isn't Serializable

Write skew demonstrates that Snapshot Isolation (what most databases call REPEATABLE READ with MVCC) is NOT fully serializable. For applications requiring true serializability, use the SERIALIZABLE isolation level—which in PostgreSQL employs Serializable Snapshot Isolation (SSI) with additional conflict detection.

Consistent Read in Practice

Understanding MVCC read consistency empowers you to make better application design decisions. Here are practical patterns and anti-patterns.\n\nPattern 1: Long-Running Reports\n\nMVCC shines for reporting transactions that scan large amounts of data without blocking OLTP operations. Use Repeatable Read for consistent reports that don't see partial updates from concurrent transactions.

reporting_pattern.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Pattern: Consistent Reporting with MVCC
 
-- Start a reporting transaction
BEGIN ISOLATION LEVEL REPEATABLE READ;
 
-- First report query
SELECT 
    department_id,
    SUM(salary) as total_salary,
    COUNT(*) as employee_count
FROM employees
GROUP BY department_id;
 
-- This could take minutes for large tables...
 
-- Meanwhile, other transactions are updating salaries,
-- hiring employees, etc. - all proceeding without waiting!
 
-- Second report query - sees SAME data as first
SELECT 
    department_id,
    AVG(salary) as avg_salary
FROM employees
GROUP BY department_id;
 
-- The totals from first query will be consistent with
-- averages from second query - no phantom employees appeared
 
COMMIT;
 
-- Anti-pattern: Using Read Committed for reports
BEGIN;  -- Default is READ COMMITTED in PostgreSQL
 
SELECT department_id, SUM(salary) FROM employees GROUP BY department_id;
-- ... time passes, other transactions commit ...
SELECT department_id, SUM(salary) FROM employees GROUP BY department_id;
-- Could return different totals! Not suitable for consistent reports.
 
COMMIT;

Pattern 2: Read-Modify-Write with Consistency Checks\n\nWhen you need to read data, apply business logic, and write based on the read values, consider the consistency requirements carefully.

read_modify_write.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Pattern: Read-Modify-Write requiring row-level atomicity
 
-- Option 1: Atomic UPDATE (best for simple cases)
UPDATE accounts 
SET balance = balance - 100 
WHERE id = 1 AND balance >= 100;
-- Single statement is always atomic, no read-then-write gap
 
-- Option 2: SELECT FOR UPDATE (explicit row locking)
BEGIN ISOLATION LEVEL REPEATABLE READ;
 
SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;
-- Returns 500; row is now locked against concurrent updates
 
-- Application logic: check balance, calculate, etc.
 
UPDATE accounts SET balance = 400 WHERE id = 1;
COMMIT;
-- Other transactions waited for our lock
 
-- Option 3: Optimistic concurrency with MVCC
BEGIN ISOLATION LEVEL REPEATABLE READ;
 
SELECT balance, version FROM accounts WHERE id = 1;
-- Returns: balance=500, version=5
 
-- Application logic...
 
UPDATE accounts 
SET balance = 400, version = 6 
WHERE id = 1 AND version = 5;
 
-- Check: was the update applied?
-- If version changed, another transaction modified the row
-- - 1 row updated = success
-- - 0 rows updated = conflict, retry transaction
 
COMMIT;

Choosing the Right Approach

For simple increments/decrements, atomic UPDATE is best. For complex business logic requiring multiple reads before decision, SELECT FOR UPDATE provides certainty. For high-concurrency scenarios where conflicts are rare, optimistic locking with version columns minimizes blocking while handling conflicts gracefully.

Summary: Achieving Read Consistency

MVCC read consistency transforms concurrent database access from a minefield of anomalies into a predictable, manageable system. The snapshot mechanism provides elegant solutions to problems that plagued lock-based systems.

Key Takeaways

•Snapshots define transaction views — A snapshot captures which transactions were active, determining version visibility
•Snapshot timing varies by isolation level — Read Committed gets new snapshots per statement; Repeatable Read uses one for the entire transaction
•Visibility algorithms are precise — Creator transaction status + deleter transaction status determine if a version is visible
•Dirty reads are prevented automatically — Uncommitted transactions are in the active set and their changes are invisible
•Non-repeatable reads depend on isolation — Repeatable Read prevents them; Read Committed allows them by design
•Phantoms are largely prevented but write skew is possible — Snapshot isolation is strong but not fully serializable
•InnoDB and PostgreSQL differ in implementation — Read Views vs Snapshots, but the concepts are equivalent
•Practical patterns exist for common scenarios — Reporting, read-modify-write, and optimistic concurrency all leverage MVCC

Looking Ahead:\n\nWith read consistency understood, we'll next explore PostgreSQL's MVCC implementation in detail—examining how one of the most popular databases puts these concepts into practice with its heap-based versioning, vacuum mechanisms, and visibility map optimizations.

Page Complete

You now understand how MVCC achieves read consistency through snapshots and visibility algorithms. You can trace visibility decisions, understand isolation level differences, and apply these concepts in practical patterns. Next, we'll see these principles implemented in PostgreSQL.

Read Consistency

The Promise of Consistent Reads

What You Will Learn

Snapshot Semantics

snapshot_structure.pseudo

Pseudocode

// Snapshot Structure
 
class Snapshot {
    // The transaction ID when this snapshot was created
    snapshot_xid: TransactionID
    
    // Transaction IDs that were active (uncommitted) when snapshot was taken
    // Versions from these transactions are NOT visible
    active_txns: Set<TransactionID>
    
    // Minimum active transaction ID at snapshot time
    // Any transaction ID less than this has definitely completed
    xmin: TransactionID
    
    // Maximum transaction ID at snapshot time
    // Any transaction ID greater than this started after the snapshot
    xmax: TransactionID
}
 
// Snapshot Creation (at transaction begin or first read)
function create_snapshot() -> Snapshot:
    snapshot = new Snapshot()
    
    // Get current transaction state atomically
    with global_transaction_lock:
        snapshot.snapshot_xid = get_current_transaction_id()
        snapshot.xmax = next_transaction_id()
        snapshot.active_txns = get_all_active_transaction_ids()
        snapshot.xmin = min(snapshot.active_txns) if not empty else snapshot.xmax
    
    return snapshot
 
// Example snapshot state at a point in time:
// 
// Committed transactions: 100, 101, 102, 104, 105
// Active transactions: 103, 107, 108
// Next transaction ID: 110
//
// Snapshot taken by transaction 109:
// {
//     snapshot_xid: 109,
//     xmin: 103,
//     xmax: 110,
//     active_txns: {103, 107, 108}
// }
//
// Visibility:
// - Txn 100, 101, 102, 104, 105: visible (committed before snapshot)
// - Txn 103: NOT visible (active at snapshot time)
// - Txn 106: visible (committed, not in active set)
// - Txn 107, 108: NOT visible (active at snapshot time)
// - Txn 110+: NOT visible (started after snapshot)

Snapshot Timing by Isolation Level
Isolation Level	Snapshot Taken	Behavior
Read Uncommitted	N/A (not typically used with MVCC)	Reads latest version, even uncommitted
Read Committed	Start of each SQL statement	Sees newly committed data between statements
Repeatable Read	Start of transaction (first read)	All reads return same data throughout transaction
Serializable	Start of transaction + extra tracking	Repeatable + guaranteed serializable ordering

PostgreSQL Default

Visibility Algorithm Deep Dive

heapam_visibility.pseudo

Visibility Algorithm

// MVCC Visibility Algorithm (PostgreSQL-style)
// Determines if a tuple version is visible to a snapshot
 
function HeapTupleSatisfiesMVCC(tuple: HeapTuple, snapshot: Snapshot) -> Boolean:
    xmin = tuple.xmin  // Transaction that created this tuple
    xmax = tuple.xmax  // Transaction that deleted/updated this tuple (0 if current)
    
    // ========================================
    // STEP 1: Check creator transaction (xmin)
    // ========================================
    
    // Case 1A: Created by our own transaction
    if xmin == current_transaction_id():
        if tuple.cmin >= current_command_id():
            return false  // Created by later command in same transaction
        
        // We created it in an earlier command
        if xmax == 0:
            return true   // We created it, not yet deleted
        
        // Check if we also deleted it
        if xmax == current_transaction_id():
            if tuple.cmax >= current_command_id():
                return true   // Deleted by later command - still visible
            else:
                return false  // Deleted by earlier command
        
        // Deleted by another transaction (which must have aborted)
        // since concurrent transactions can't delete our uncommitted rows
        return true
    
    // Case 1B: Created by another transaction
    if not is_xmin_committed(xmin, snapshot):
        // Creator was active or aborted at snapshot time
        return false
    
    // Creator committed before our snapshot
    
    // ========================================
    // STEP 2: Check deleter transaction (xmax)
    // ========================================
    
    if xmax == 0 or xmax == INVALID_TRANSACTION_ID:
        return true  // Not deleted - definitely visible
    
    // Case 2A: Deleted by our own transaction
    if xmax == current_transaction_id():
        if tuple.cmax >= current_command_id():
            return true   // Deleted by later command - still visible to us
        else:
            return false  // Deleted by earlier command - not visible
    
    // Case 2B: Deleted by another transaction
    if not is_xmax_committed(xmax, snapshot):
        // Deleter was active or aborted at snapshot time
        return true   // Deletion not visible - tuple is visible
    
    // Deleter committed - tuple is not visible
    return false
 
// Helper: Check if xmin transaction is "committed" from snapshot's perspective
function is_xmin_committed(xmin: TransactionID, snapshot: Snapshot) -> Boolean:
    // Quick check: if xmin < snapshot.xmin, it's definitely committed
    if xmin < snapshot.xmin:
        return true
    
    // If xmin >= snapshot.xmax, it started after snapshot
    if xmin >= snapshot.xmax:
        return false
    
    // xmin is in the range [xmin, xmax) - check active list
    if xmin in snapshot.active_txns:
        return false  // Was active at snapshot time - not visible
    
    // Check actual commit status in transaction log
    status = get_transaction_status(xmin)
    return status == COMMITTED
 
// Similar logic for is_xmax_committed...

Performance Implications

InnoDB Read View Implementation

innodb_read_view.cpp
C++ (InnoDB Style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// InnoDB Read View Visibility Check (Simplified)
 
struct ReadView {
    trx_id_t m_low_limit_id;   // Transactions >= this started after read view
    trx_id_t m_up_limit_id;    // Transactions < this completed before read view
    std::vector<trx_id_t> m_ids;  // Active transactions at read view creation
    trx_id_t m_creator_trx_id;    // Transaction that owns this read view
};
 
// Check if a transaction's changes are visible to this read view
bool changes_visible(const ReadView* view, trx_id_t id) {
    // 1. Our own transaction's changes are always visible
    if (id == view->m_creator_trx_id) {
        return true;
    }
    
    // 2. Transaction IDs < m_up_limit_id completed before our read view
    if (id < view->m_up_limit_id) {
        return true;
    }
    
    // 3. Transaction IDs >= m_low_limit_id started after our read view
    if (id >= view->m_low_limit_id) {
        return false;
    }
    
    // 4. Transaction ID is in the ambiguous range - check active list
    // Binary search since m_ids is sorted
    return !std::binary_search(view->m_ids.begin(), view->m_ids.end(), id);
}
 
// Reading a row: may need to traverse undo log
Record read_row_mvcc(const ReadView* view, const Record* latest) {
    Record* current = latest;
    
    while (current != nullptr) {
        trx_id_t creator = current->get_trx_id();
        
        if (changes_visible(view, creator)) {
            // This version was created by a visible transaction
            
            // Check if it's a delete marker
            if (current->is_delete_marked()) {
                return NOT_FOUND;  // Row was deleted
            }
            
            return *current;  // This is the visible version
        }
        
        // This version is not visible - go back in history
        current = current->get_older_version_from_undo_log();
    }
    
    // No visible version found - row didn't exist for this read view
    return NOT_FOUND;
}
 
/*
Example Read View:
  m_low_limit_id: 110
  m_up_limit_id: 103
  m_ids: [103, 107, 108]  // Active at read view creation
  m_creator_trx_id: 109
 
Visibility decisions:
  trx_id = 100: visible (100 < 103, completed before read view)
  trx_id = 103: NOT visible (in m_ids, was active)  
  trx_id = 105: visible (103 <= 105 < 110, not in m_ids)
  trx_id = 107: NOT visible (in m_ids, was active)
  trx_id = 109: visible (it's our own transaction)
  trx_id = 110: NOT visible (>= m_low_limit_id, started after)
*/

PostgreSQL Snapshot

•Stored in transaction state
•Checks pg_xact for commit status
•Hint bits optimize repeated checks
•Tuple contains creator/deleter info
•Version chain in heap

InnoDB Read View

•Attached to transaction object
•Sorted active list for binary search
•Row header has DB_TRX_ID
•May traverse undo log for old versions
•Version chain in undo tablespace

Read Consistency Guarantees

dirty_read_prevention.diagram

Timeline

Scenario: Dirty Read Prevention with MVCC
 
Time    T1 (Txn 100)                T2 (Txn 101)
────────────────────────────────────────────────────────────────
 1      BEGIN
 2      UPDATE accounts SET         
        balance = 500 
        WHERE id = 1
        (Creates version with xmin=100)
 
 3                                  BEGIN
                                    Snapshot: {active: [100], xmax: 102}
 
 4                                  SELECT balance FROM accounts WHERE id = 1
                                    
                                    Version xmin=100:
                                    - Is 100 in active set? YES
                                    - Therefore: NOT VISIBLE
                                    
                                    Previous version xmin=99 (committed):
                                    - Is 99 < snapshot.xmin? YES
                                    - Therefore: VISIBLE
                                    
                                    Result: Returns OLD balance (e.g., 1000)
                                    ✅ Dirty read prevented!
 
 5      ROLLBACK                    
        (T1 aborts - change discarded)
        
 6                                  SELECT balance FROM accounts WHERE id = 1
                                    Result: Still 1000 (correct - T1 rolled back)
                                    
Without MVCC (if T2 saw uncommitted value):
  Step 4 would return 500 (WRONG!)
  Step 6 would see 1000 back - inconsistent!

non_repeatable_read_prevention.diagram

Timeline

Scenario: Non-Repeatable Read Prevention at REPEATABLE READ
 
Time    T1 (Txn 100)                T2 (Txn 101)
────────────────────────────────────────────────────────────────
 1      BEGIN ISOLATION LEVEL
        REPEATABLE READ
        Snapshot taken: {active: [], xmax: 101}
        
 2      SELECT balance FROM accounts WHERE id = 1
        Visible version: balance = 1000 (xmin=99, committed)
        Result: 1000
 
 3                                  BEGIN
 
 4                                  UPDATE accounts SET balance = 500 WHERE id = 1
                                    (Creates new version xmin=101)
                                    
 5                                  COMMIT
                                    (Version xmin=101 now committed)
 
 6      SELECT balance FROM accounts WHERE id = 1
        
        New version xmin=101:
        - Is 101 >= snapshot.xmax (101)? YES
        - Therefore: NOT VISIBLE (started after our snapshot)
        
        Old version xmin=99:
        - Is 99 < snapshot.xmax? YES
        - Is 99 in active set? NO
        - Therefore: VISIBLE
        
        Result: 1000 (same as before!)
        ✅ Repeatable read achieved!
 
 7      COMMIT
 
Without MVCC snapshot:
  Step 6 would return 500 (different from step 2!)
  = Non-repeatable read

Isolation Level Matters Greatly

Phantom Prevention and Its Limitations

phantom_mvcc.diagram

Timeline

Scenario: Phantom Prevention with Snapshot Isolation
 
Time    T1 (Txn 100)                T2 (Txn 101)
────────────────────────────────────────────────────────────────
 1      BEGIN ISOLATION LEVEL 
        REPEATABLE READ
        Snapshot: {active: [], xmax: 101}
        
 2      SELECT COUNT(*) FROM accounts WHERE balance > 1000
        Finds: Account #1 (balance 1500), Account #2 (balance 2000)
        Result: 2 rows
 
 3                                  BEGIN
 
 4                                  INSERT INTO accounts (id, balance) 
                                    VALUES (3, 3000)
                                    (New row with xmin=101)
                                    
 5                                  COMMIT
 
 6      SELECT COUNT(*) FROM accounts WHERE balance > 1000
        
        Scanning all rows:
        - Account #1: xmin=99 < 101 → VISIBLE, balance 1500 ✓
        - Account #2: xmin=98 < 101 → VISIBLE, balance 2000 ✓
        - Account #3: xmin=101 >= snapshot.xmax → NOT VISIBLE ✗
        
        Result: 2 rows (same as before!)
        ✅ Phantom read prevented!

write_skew.diagram

Write Skew Scenario

Write Skew Anomaly (NOT Prevented by Standard Snapshot Isolation)
 
Constraint: At least one doctor must be on-call
Initial: doctor_1.on_call = true, doctor_2.on_call = true
 
Time    T1 (Doctor 1)                    T2 (Doctor 2)
────────────────────────────────────────────────────────────────────
 1      BEGIN REPEATABLE READ            BEGIN REPEATABLE READ
        Snapshot: xmax=100                Snapshot: xmax=100
 
 2      SELECT COUNT(*) FROM doctors 
        WHERE on_call = true
        Result: 2 (safe to leave)
        
 3                                       SELECT COUNT(*) FROM doctors 
                                         WHERE on_call = true
                                         Result: 2 (safe to leave)
        
 4      UPDATE doctors SET on_call = false WHERE id = 1
        (doctor_1 thinks doctor_2 will cover)
        
 5                                       UPDATE doctors SET on_call = false 
                                         WHERE id = 2
                                         (doctor_2 thinks doctor_1 will cover)
                                         
 6      COMMIT ✓                         COMMIT ✓
 
Result: Both doctors removed from on-call!
        Constraint violated! 😱
        
No write-write conflict (different rows)
No phantom read (no new rows appeared)
But the combined effect violates application invariant
 
Solution: Use SERIALIZABLE isolation level, which would detect this
          conflict and abort one transaction.

Snapshot Isolation Isn't Serializable

Consistent Read in Practice

reporting_pattern.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Pattern: Consistent Reporting with MVCC
 
-- Start a reporting transaction
BEGIN ISOLATION LEVEL REPEATABLE READ;
 
-- First report query
SELECT 
    department_id,
    SUM(salary) as total_salary,
    COUNT(*) as employee_count
FROM employees
GROUP BY department_id;
 
-- This could take minutes for large tables...
 
-- Meanwhile, other transactions are updating salaries,
-- hiring employees, etc. - all proceeding without waiting!
 
-- Second report query - sees SAME data as first
SELECT 
    department_id,
    AVG(salary) as avg_salary
FROM employees
GROUP BY department_id;
 
-- The totals from first query will be consistent with
-- averages from second query - no phantom employees appeared
 
COMMIT;
 
-- Anti-pattern: Using Read Committed for reports
BEGIN;  -- Default is READ COMMITTED in PostgreSQL
 
SELECT department_id, SUM(salary) FROM employees GROUP BY department_id;
-- ... time passes, other transactions commit ...
SELECT department_id, SUM(salary) FROM employees GROUP BY department_id;
-- Could return different totals! Not suitable for consistent reports.
 
COMMIT;

Pattern 2: Read-Modify-Write with Consistency Checks\n\nWhen you need to read data, apply business logic, and write based on the read values, consider the consistency requirements carefully.

read_modify_write.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Pattern: Read-Modify-Write requiring row-level atomicity
 
-- Option 1: Atomic UPDATE (best for simple cases)
UPDATE accounts 
SET balance = balance - 100 
WHERE id = 1 AND balance >= 100;
-- Single statement is always atomic, no read-then-write gap
 
-- Option 2: SELECT FOR UPDATE (explicit row locking)
BEGIN ISOLATION LEVEL REPEATABLE READ;
 
SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;
-- Returns 500; row is now locked against concurrent updates
 
-- Application logic: check balance, calculate, etc.
 
UPDATE accounts SET balance = 400 WHERE id = 1;
COMMIT;
-- Other transactions waited for our lock
 
-- Option 3: Optimistic concurrency with MVCC
BEGIN ISOLATION LEVEL REPEATABLE READ;
 
SELECT balance, version FROM accounts WHERE id = 1;
-- Returns: balance=500, version=5
 
-- Application logic...
 
UPDATE accounts 
SET balance = 400, version = 6 
WHERE id = 1 AND version = 5;
 
-- Check: was the update applied?
-- If version changed, another transaction modified the row
-- - 1 row updated = success
-- - 0 rows updated = conflict, retry transaction
 
COMMIT;

Choosing the Right Approach

Summary: Achieving Read Consistency

Key Takeaways

•Snapshots define transaction views — A snapshot captures which transactions were active, determining version visibility
•Snapshot timing varies by isolation level — Read Committed gets new snapshots per statement; Repeatable Read uses one for the entire transaction
•Visibility algorithms are precise — Creator transaction status + deleter transaction status determine if a version is visible
•Dirty reads are prevented automatically — Uncommitted transactions are in the active set and their changes are invisible
•Non-repeatable reads depend on isolation — Repeatable Read prevents them; Read Committed allows them by design
•Phantoms are largely prevented but write skew is possible — Snapshot isolation is strong but not fully serializable
•InnoDB and PostgreSQL differ in implementation — Read Views vs Snapshots, but the concepts are equivalent
•Practical patterns exist for common scenarios — Reporting, read-modify-write, and optimistic concurrency all leverage MVCC

Page Complete