Snapshot Isolation - Learning Module

Loading content...

0/252

Snapshot Isolation Implementation

From Theory to Reality

Understanding what a snapshot is conceptually is only half the battle. The real engineering challenge lies in implementing snapshot isolation efficiently at scale—handling millions of transactions, maintaining millions of versions, and making visibility decisions in microseconds.

This page takes you inside the implementation of snapshot isolation (SI), revealing the data structures, algorithms, and engineering trade-offs that power databases like PostgreSQL, Oracle, MySQL InnoDB, and SQL Server. We'll examine how snapshots are established, how visibility is determined, how write conflicts are detected, and how the system maintains correctness under extreme concurrency.

What You Will Learn

By the end of this page, you will understand: how transaction IDs and commit timestamps are assigned and used; how snapshot metadata is structured and stored; how visibility checks are performed; how first-committer-wins conflict detection works; and how different databases implement these mechanisms. You'll be able to reason about SI behavior at the implementation level.

Transaction ID Management

The foundation of any SI implementation is a reliable mechanism for identifying and ordering transactions. This is achieved through Transaction IDs (XIDs or TXIDs), which serve as unique identifiers that also encode ordering information.

Requirements for Transaction IDs:

Uniqueness: No two concurrent transactions can share the same ID
Monotonicity: IDs must increase over time (newer transactions get higher IDs)
Efficiency: ID assignment must be fast (cannot be a bottleneck)
Persistence: Committed transaction IDs must survive crashes
Comparability: Must be able to efficiently compare IDs for ordering

PostgreSQL's XID System:

PostgreSQL uses 32-bit unsigned integers as transaction IDs. This creates an interesting challenge: with a maximum of ~4 billion IDs, a busy database could exhaust the ID space. PostgreSQL handles this through XID wraparound protection:

XIDs are compared using modular arithmetic
An XID $A$ is "older" than XID $B$ if $(B - A) < 2^{31}$
The system prevents any transaction from being more than ~2 billion XIDs old
VACUUM FREEZE converts old XIDs to a special "frozen" state

xid_comparison.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/* PostgreSQL-style XID comparison using modular arithmetic */
typedef uint32_t TransactionId;
 
/* 
 * Check if xid1 occurred before xid2.
 * Uses wraparound-safe comparison.
 */
bool TransactionIdPrecedes(TransactionId xid1, TransactionId xid2) {
    /* 
     * xid1 is before xid2 if the difference (xid2 - xid1)
     * is less than half the XID space (2^31).
     * This handles wraparound correctly.
     */
    int32_t diff = (int32_t)(xid2 - xid1);
    return diff > 0;
}
 
/*
 * Check if xid1 occurred before or equals xid2.
 */
bool TransactionIdPrecedesOrEquals(TransactionId xid1, TransactionId xid2) {
    int32_t diff = (int32_t)(xid2 - xid1);
    return diff >= 0;
}
 
/*
 * Assign the next transaction ID (must hold XID lock)
 */
TransactionId GetNewTransactionId(void) {
    TransactionId xid;
    
    SpinLockAcquire(&XidGenLock);
    
    xid = ShmemVariableCache->nextXid;
    ShmemVariableCache->nextXid++;
    
    /* Handle wraparound by skipping special XIDs */
    if (ShmemVariableCache->nextXid < FirstNormalTransactionId)
        ShmemVariableCache->nextXid = FirstNormalTransactionId;
    
    SpinLockRelease(&XidGenLock);
    
    return xid;
}

Oracle's SCN System:

Oracle uses System Change Numbers (SCNs) instead of simple transaction IDs. SCNs are 48-bit values that combine:

A time-based component (loosely correlated with wall-clock time)
A sequence component (for ordering within the same time period)

Advantages of SCNs:

Larger space (281 trillion values) eliminates wraparound concerns
Can be correlated with real time for flashback queries
Distributed-system friendly (nodes can coordinate SCN ranges)

MySQL InnoDB's Transaction System:

InnoDB uses 48-bit transaction IDs stored in the transaction header. It maintains:

A global transaction ID counter in shared memory
Per-transaction structures linking to undo logs
A read view (snapshot) structure for each reading transaction

ID Assignment Is a Bottleneck

Transaction ID assignment is a potential serialization point—all transactions must obtain unique IDs from a shared counter. Modern databases minimize this with techniques like batching (claim IDs in groups), sharded counters (different cores claim from different ranges), or lazy assignment (only assign IDs when needed for writes).

Snapshot Data Structures

When a transaction establishes its snapshot, it must capture enough information to determine visibility of any data version it might later read. This requires carefully designed data structures that are compact (memory-efficient) yet complete (can answer any visibility question).

PostgreSQL's SnapshotData Structure:

PostgreSQL's snapshot contains three critical pieces of information:

xmin: The lowest XID that was active when the snapshot was taken. All XIDs < xmin are definitely committed (or aborted) and can use fast-path visibility checks.
xmax: The first XID that had not yet been assigned when the snapshot was taken. All XIDs >= xmax are definitely invisible (started after our snapshot).
xip[] (XID In Progress): An array of XIDs that were active at snapshot time, where xmin <= XID < xmax. These transactions' changes are invisible even if they commit later.

snapshot_structure.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* PostgreSQL SnapshotData structure (simplified) */
typedef struct SnapshotData {
    TransactionId xmin;      /* All XIDs < xmin are visible if committed */
    TransactionId xmax;      /* All XIDs >= xmax are invisible */
    
    TransactionId *xip;      /* Array of in-progress XIDs */
    uint32 xcnt;             /* Count of in-progress XIDs */
    
    /* For subtransaction handling */
    TransactionId *subxip;   /* In-progress subtransaction XIDs */
    int32 subxcnt;           /* Count of subtransaction XIDs */
    bool suboverflowed;      /* Did subxip array overflow? */
    
    /* Snapshot identity for caching */
    uint32 snapshot_id;      /* Unique identifier for this snapshot */
    
    /* Timestamp for MVCC garbage collection */
    TimestampTz whenTaken;   /* Wall-clock time snapshot was taken */
    
    /* Command counter for statement-level visibility */
    CommandId curcid;        /* Current command ID in owning transaction */
} SnapshotData;
 
/*
 * Take a snapshot of the current transaction state
 */
Snapshot GetTransactionSnapshot(void) {
    Snapshot snapshot = GetSnapshotData();
    
    /*
     * Capture current state:
     * 1. Lock the proc array briefly
     * 2. Record the next-to-be-assigned XID as xmax
     * 3. Scan all running transactions, record their XIDs
     * 4. Find the minimum running XID as xmin
     * 5. Copy running XID list to xip array
     * 6. Release proc array lock
     */
    
    LWLockAcquire(ProcArrayLock, LW_SHARED);
    
    snapshot->xmax = ShmemVariableCache->nextXid;
    snapshot->xmin = snapshot->xmax;  /* Will be lowered below */
    snapshot->xcnt = 0;
    
    for (int i = 0; i < ProcGlobal->allProcCount; i++) {
        PGPROC *proc = &allProcs[i];
        TransactionId xid = proc->xid;
        
        if (!TransactionIdIsValid(xid))
            continue;  /* Not running a transaction */
        
        if (TransactionIdPrecedes(xid, snapshot->xmin))
            snapshot->xmin = xid;
        
        if (TransactionIdPrecedes(xid, snapshot->xmax)) {
            /* This XID is in our [xmin, xmax) range; add to xip */
            snapshot->xip[snapshot->xcnt++] = xid;
        }
    }
    
    LWLockRelease(ProcArrayLock);
    
    return snapshot;
}

Visibility Check Algorithm:

With the snapshot structure in place, visibility checks follow this algorithm:

IS_VISIBLE(tuple_xid, snapshot):
    if tuple_xid < snapshot.xmin:
        return COMMITTED_AND_VISIBLE  // Old, definitely committed
    
    if tuple_xid >= snapshot.xmax:
        return INVISIBLE  // Started after our snapshot
    
    if tuple_xid IN snapshot.xip:
        return INVISIBLE  // Was running when we took snapshot
    
    // In range [xmin, xmax) but not in xip
    // Need to check if committed
    return CHECK_CLOG(tuple_xid)

The CLOG (Commit Log) stores the commit status of each transaction: committed, aborted, or in-progress. It's a compact bitmap with 2 bits per transaction ID.

Memory Considerations

The xip[] array can become large with many concurrent transactions. PostgreSQL limits this and uses subxip[] for subtransactions. If limits are exceeded, the system may need to serialize certain operations. Well-tuned databases keep concurrent transaction counts reasonable.

Version Storage Strategies

MVCC requires storing multiple versions of data. Different databases take fundamentally different approaches to how and where these versions are stored, each with distinct performance characteristics.

Strategy 1: Append-Only / Tuple Versioning (PostgreSQL)

PostgreSQL stores all versions directly in the main table (heap). When a row is updated:

The old tuple is marked as "dead" (xmax set to updating transaction)
A new tuple is created with the new values
Both tuples exist in the table until VACUUM removes the old one
Indexes point to all tuple versions

Advantages:

Updates are fast (just insert new version)
No separate undo log to manage
Simple crash recovery

Disadvantages:

Table bloat (dead tuples accumulate)
Index bloat (entries for all versions)
VACUUM overhead required
Full row copy even for small updates

postgresql_heap_tuple.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* PostgreSQL HeapTupleHeader structure (simplified) */
typedef struct HeapTupleHeaderData {
    /* Transaction information */
    TransactionId t_xmin;     /* XID that created this tuple */
    TransactionId t_xmax;     /* XID that deleted/updated this tuple, or 0 */
    
    /* Command ID within the transaction */
    CommandId t_cid;          /* Command ID for intra-transaction visibility */
    
    /* Tuple pointer to next version (for HOT chains) */
    ItemPointerData t_ctid;   /* Current tuple ID, or pointer to next ver */
    
    /* Infomask bits for status flags */
    uint16 t_infomask;        /* Flags: committed, invalid, etc. */
    uint16 t_infomask2;       /* More flags + number of attributes */
    
    /* Actual data follows... */
} HeapTupleHeaderData;
 
/*
 * Example visibility check for a heap tuple
 */
bool HeapTupleSatisfiesMVCC(HeapTuple tuple, Snapshot snapshot) {
    TransactionId xmin = HeapTupleHeaderGetXmin(tuple->t_data);
    TransactionId xmax = HeapTupleHeaderGetXmax(tuple->t_data);
    
    /* Check if creating transaction is visible */
    if (!XidInSnapshot(xmin, snapshot)) {
        /* xmin not visible = tuple not yet created from snapshot's view */
        return false;
    }
    
    /* Check if tuple has been deleted/updated */
    if (!TransactionIdIsValid(xmax)) {
        /* Not deleted - visible if xmin is visible */
        return true;
    }
    
    /* Tuple was deleted - check if deletion is visible */
    if (XidInSnapshot(xmax, snapshot)) {
        /* Deletion is visible - tuple is dead */
        return false;
    }
    
    /* Deletion not visible - tuple is still alive from our view */
    return true;
}

Strategy 2: Undo Log Versioning (Oracle, MySQL InnoDB)

Oracle and InnoDB store only the current version in the main table. Previous versions are reconstructed from undo logs (rollback segments):

Main table always contains the latest committed (or in-progress) version
Before modifying a row, the old values are written to undo log
To read an old version, start from current and apply undo records backward
Undo logs are automatically purged when no longer needed

Advantages:

Main table stays compact (only current data)
Indexes point only to current tuples
No separate vacuum process needed
Space-efficient for small updates

Disadvantages:

Version reconstruction has CPU cost
Long transactions hold undo space
Complex undo log management
"ORA-01555: Snapshot too old" if undo is recycled

PostgreSQL: Append-Only

•All versions in main table
•Updates create new tuples
•xmin/xmax in tuple header
•VACUUM removes dead tuples
•Table/index bloat possible
•Fast writes, potential read overhead

Oracle/InnoDB: Undo Log

•Current version in main table
•Updates modify in place + undo
•SCN/XID tracks versions
•Automatic undo purge
•Main table stays compact
•Fast reads, version reconstruction cost

Strategy 3: Delta Storage

Some systems (like SQL Server's optimized version store) store deltas—only the changed columns—instead of full row copies. This reduces storage for wide tables with small updates. The trade-off is more complex version reconstruction.

Write Conflict Detection: First-Committer-Wins

While snapshots elegantly handle read operations, writes require additional conflict detection to prevent lost updates. Snapshot isolation implements the First-Committer-Wins (FCW) rule:

The Problem:

Two transactions T1 and T2 both read the same row, make decisions based on that read, and attempt to write. Without conflict detection, one write could overwrite the other—a lost update.

The FCW Solution:

When a transaction attempts to write a row, it checks if the row has been modified by any transaction that committed after the writing transaction's snapshot was taken. If so, the write is rejected (the transaction must abort).

Formal Definition:

Transaction $T_i$ can write data item $x$ only if no other transaction $T_j$ has modified and committed $x$ where:

$T_j$'s commit timestamp > $T_i$'s snapshot timestamp
$T_j eq T_i$

fcw_conflict_check.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
FUNCTION attempt_write(transaction T, row R, new_value V):
    /* Step 1: Acquire write lock on row (blocking) */
    acquire_exclusive_lock(R)
    
    /* Step 2: Check current row version */
    current_version = get_current_version(R)
    
    /* Step 3: Conflict detection */
    IF current_version.xmax != 0:
        /* Row was updated or deleted by another transaction */
        other_xid = current_version.xmax
        
        IF other_xid is still running:
            /* Wait for other transaction to complete */
            wait_for_commit_or_abort(other_xid)
        
        IF was_committed(other_xid):
            IF commit_timestamp(other_xid) > T.snapshot_timestamp:
                /* CONFLICT: Other transaction committed after our snapshot */
                release_exclusive_lock(R)
                ABORT T with "serialization failure"
    
    /* Step 4: Check if row was created after our snapshot */
    IF current_version.xmin is still running:
        wait_for_commit_or_abort(current_version.xmin)
    
    IF was_committed(current_version.xmin):
        IF commit_timestamp(current_version.xmin) > T.snapshot_timestamp:
            /* Row didn't exist in our snapshot - cannot update */
            release_exclusive_lock(R)
            ABORT T with "serialization failure"
    
    /* Step 5: No conflict - perform the write */
    create_new_version(R, V, T.xid)
    
    release_exclusive_lock(R)
    RETURN success

Conflict Detection Scenarios:

Scenario 1: No Conflict

T1 begins (snapshot at 100), reads row X
T1 writes X
T1 commits at 200
-- T1 succeeds: no concurrent modifier

Scenario 2: Concurrent Write, First Committer Wins

T1 begins (snapshot at 100), reads row X
T2 begins (snapshot at 110), reads row X
T1 writes X, commits at 200
T2 attempts to write X
-- T2 finds X was modified after its snapshot (by T1 at 200 > 110)
-- T2 ABORTS with serialization failure

Scenario 3: Non-Conflicting Concurrent Writes

T1 begins (snapshot at 100), reads row X
T2 begins (snapshot at 110), reads row Y
T1 writes X, commits
T2 writes Y, commits
-- Both succeed: writing different rows

FCW Is Not Enough for Serializability

First-Committer-Wins prevents lost updates but does NOT prevent write skew anomalies. T1 reading X and writing Y, while T2 reads Y and writes X, can both commit under FCW—potentially violating constraints that span X and Y. We'll cover this limitation in the Write Skew page.

Commit Processing

When a transaction under snapshot isolation commits, several things must happen atomically to ensure durability and visibility consistency.

Commit Steps:

1. Final Conflict Check

Before committing, perform a final verification that no conflicts emerged during the transaction (some systems do this incrementally during writes, others do a final pass).

2. Assign Commit Timestamp

Obtain a commit timestamp from the global counter. This timestamp determines when this transaction's changes become visible to future snapshots.

3. Write Commit Record to Log (WAL)

Write a commit record to the write-ahead log. This must be flushed to stable storage before the commit can be acknowledged.

4. Update Transaction Status

Mark the transaction as committed in the commit log (CLOG in PostgreSQL, undo header in Oracle/InnoDB). This makes the transaction's status quickly queryable.

5. Make Changes Visible

After commit, the transaction's modifications are visible to any snapshot taken after the commit timestamp.

commit_protocol.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
FUNCTION commit_transaction(Transaction T):
    /* Step 1: Ensure all writes have been performed */
    IF T.pending_writes NOT EMPTY:
        ERROR "Cannot commit with unflushed writes"
    
    /* Step 2: Acquire commit lock (brief serialization point) */
    acquire_commit_lock()
    
    /* Step 3: Assign commit timestamp */
    T.commit_timestamp = get_next_commit_timestamp()
    
    /* Step 4: Create WAL commit record */
    commit_record = {
        type: XLOG_XACT_COMMIT,
        xid: T.xid,
        timestamp: T.commit_timestamp,
        database: T.database_id,
        subxacts: T.subtransaction_list
    }
    
    /* Step 5: Write and flush WAL */
    lsn = write_wal_record(commit_record)
    flush_wal_to_disk(lsn)  /* CRITICAL: Must sync before acknowledge */
    
    /* Step 6: Update CLOG (commit log) */
    set_transaction_status(T.xid, TRANSACTION_STATUS_COMMITTED)
    
    /* For subtransactions */
    FOR EACH subxid IN T.subtransaction_list:
        set_transaction_status(subxid, TRANSACTION_STATUS_COMMITTED)
    
    /* Step 7: Release locks held by transaction */
    release_all_locks(T)
    
    /* Step 8: Release commit lock */
    release_commit_lock()
    
    /* Step 9: Update shared memory (for snapshot efficiency) */
    remove_from_running_transactions(T.xid)
    
    RETURN T.commit_timestamp

PostgreSQL's Commit Sequence:

PostgreSQL's actual commit process involves:

RecordTransactionCommit(): Writes the commit record to WAL
XactLogCommitRecord(): Logs detailed commit information
SyncRepWaitForLSN(): Waits for synchronous replication if configured
TransactionIdCommitTree(): Marks XID as committed in CLOG
ProcArrayEndTransaction(): Removes XID from running transaction array
ResourceOwnerRelease(): Releases resources (locks, files, memory)

Visibility Latency:

There's a brief window between WAL flush and ProcArray update where:

The transaction is durably committed
But other transactions may not yet see it as committed

This is handled by careful ordering and the visibility rules checking both CLOG status and the running transaction array.

Synchronous vs Asynchronous Commit

PostgreSQL offers 'synchronous_commit' configurations. Turning it off skips the WAL flush wait, providing much faster commits but risking up to 3× wal_writer_delay worth of transactions in a crash. The visibility semantics remain the same—only durability is affected.

Garbage Collection: Cleaning Up Old Versions

MVCC creates multiple versions, but not all versions need to be kept forever. Once no active snapshot can possibly need an old version, it can be safely removed. This is the job of garbage collection (called VACUUM in PostgreSQL, undo purge in Oracle/InnoDB).

The Horizon Concept:

The oldest active snapshot determines the garbage collection horizon. Any version that was superseded before this horizon is safe to remove because no current or future transaction can see it.

$$\text{GC Horizon} = \min(\text{xmin of all active snapshots})$$

Versions created by transactions with XIDs less than this horizon AND subsequently replaced by newer versions can be reclaimed.

Converting Mermaid diagram...

PostgreSQL VACUUM:

VACUUM in PostgreSQL performs several tasks:

Remove dead tuples: Tuples whose xmax indicates deletion/update by a committed transaction older than all active snapshots
Freeze old tuples: Convert very old XIDs to FrozenTransactionId to prevent wraparound
Update visibility map: Track pages where all tuples are visible to all transactions
Update free space map: Track available space for new tuples
Truncate trailing empty pages: Reclaim disk space from the end of tables

Autovacuum runs automatically based on table modification statistics, but heavily-updated tables may need manual tuning.

vacuum_logic.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
FUNCTION vacuum_table(table):
    /* Get the oldest active snapshot horizon */
    gc_horizon = get_oldest_xmin_across_all_snapshots()
    
    /* Scan each page of the table */
    FOR page IN table.pages:
        dead_tuples = []
        
        FOR tuple IN page.tuples:
            /* Check if tuple is dead and reclaimable */
            IF tuple.xmax IS VALID AND tuple.xmax < gc_horizon:
                IF was_committed(tuple.xmax):
                    /* This tuple was deleted/updated and is no longer needed */
                    dead_tuples.append(tuple)
            
            /* Check for XID freeze requirement */
            IF tuple.xmin < freeze_threshold:
                IF was_committed(tuple.xmin):
                    tuple.xmin = FrozenTransactionId  /* Prevent wraparound */
        
        /* Remove dead tuples and update free space */
        IF dead_tuples NOT EMPTY:
            compact_page(page, dead_tuples)
            update_free_space_map(page)
        
        /* Update visibility map if all remaining tuples are visible */
        IF all_tuples_visible_to_all(page):
            set_visibility_map_bit(page)
    
    /* Update table statistics */
    update_pg_class_statistics(table)

Long Transactions Block Garbage Collection

If a transaction holds a very old snapshot (e.g., a long-running report), it pins the GC horizon. No versions created after that snapshot's xmin can be removed, even if they've been updated a thousand times. This is why long-running transactions can cause table bloat in PostgreSQL and 'ORA-01555: Snapshot too old' errors in Oracle.

Database-Specific Implementations

Each major database implements snapshot isolation with unique engineering choices. Understanding these differences helps when designing applications or troubleshooting performance issues.

Snapshot Isolation Implementation Comparison
Aspect	PostgreSQL	Oracle	MySQL InnoDB	SQL Server
Version Storage	In-heap (append-only)	Undo tablespace	Undo log (rollback segment)	Version store (tempdb)
Version ID	32-bit XID	48-bit SCN	48-bit Transaction ID	14-byte row version
GC Mechanism	VACUUM process	Automatic undo purge	Purge thread	Version cleaner
Snapshot Capture	ProcArray scan + CLOG	SCN at statement/tx start	Read View creation	Version chain traversal
Conflict Detection	xmax check	Interested Transaction List (ITL)	Lock wait + trx comparison	Update conflict on commit
Isolation Level	REPEATABLE READ	READ COMMITTED (default)	REPEATABLE READ (default)	SNAPSHOT (explicit)
Write Skew	Allowed	Allowed	Allowed	Allowed (SNAPSHOT level)

PostgreSQL Specifics:

Uses tuple-level xmin/xmax headers for visibility
REPEATABLE READ is implemented as snapshot isolation
SERIALIZABLE adds SSI (Serializable Snapshot Isolation) checking
HOT updates allow same-page updates to avoid index maintenance
Index-only scans require visibility map confirmation

Oracle Specifics:

READ COMMITTED actually uses statement-level snapshots
SERIALIZABLE uses transaction-level snapshots (closer to SI)
Consistent reads reconstruct from undo automatically
ORA-01555 occurs when undo is overwritten before snapshot completes
ITL (Interested Transaction List) tracks row-level lock holders

MySQL InnoDB Specifics:

REPEATABLE READ is the default and uses snapshots
Read View captures active transactions at snapshot time
Secondary indexes may need version visibility checks
Purge lag can be monitored with SHOW ENGINE INNODB STATUS
Gap locking prevents phantoms (beyond standard SI guarantees)

Matching Isolation to Requirements

When designing applications, consider: Are you okay with READ COMMITTED's statement-level snapshots? Do you need transaction-level consistency (REPEATABLE READ/SI)? Must you prevent write skew (SERIALIZABLE)? Each level has different implementation costs and anomaly prevention.

Summary: The Mechanics of Snapshot Isolation

Implementing snapshot isolation requires careful coordination of transaction identifiers, version management, visibility rules, conflict detection, and garbage collection. Understanding these mechanisms is essential for effective database tuning and application design.

Key Takeaways

•Transaction IDs enable ordering — Monotonically increasing XIDs/SCNs provide a total order for determining visibility and conflict detection
•Snapshots capture active transaction state — The xmin, xmax, and in-progress list enable efficient visibility determination
•Version storage strategies differ — PostgreSQL's append-only vs Oracle/InnoDB's undo log approach have different performance trade-offs
•First-Committer-Wins prevents lost updates — Write conflict detection ensures only one concurrent modifier can commit its changes to a row
•Commit processing is carefully ordered — WAL flush, CLOG update, and ProcArray updates must happen in sequence for durability and correctness
•Garbage collection requires horizon tracking — The oldest active snapshot determines which old versions can be safely removed
•Implementation details matter — Database-specific choices affect performance characteristics, tuning options, and application design

What's Next:

The next page explores the Write Skew Anomaly—the signature problem that snapshot isolation cannot prevent. Understanding write skew is critical for designing correct applications and knowing when stronger isolation levels are needed.

Page Complete

You now understand how snapshot isolation is implemented in real database systems—from transaction ID assignment through version management, visibility checking, conflict detection, commit processing, and garbage collection. This implementation knowledge enables you to reason about database behavior and optimize application designs.

Snapshot Isolation Implementation

From Theory to Reality

What You Will Learn

Transaction ID Management

Requirements for Transaction IDs:

Uniqueness: No two concurrent transactions can share the same ID
Monotonicity: IDs must increase over time (newer transactions get higher IDs)
Efficiency: ID assignment must be fast (cannot be a bottleneck)
Persistence: Committed transaction IDs must survive crashes
Comparability: Must be able to efficiently compare IDs for ordering

PostgreSQL's XID System:

XIDs are compared using modular arithmetic
An XID $A$ is "older" than XID $B$ if $(B - A) < 2^{31}$
The system prevents any transaction from being more than ~2 billion XIDs old
VACUUM FREEZE converts old XIDs to a special "frozen" state

xid_comparison.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/* PostgreSQL-style XID comparison using modular arithmetic */
typedef uint32_t TransactionId;
 
/* 
 * Check if xid1 occurred before xid2.
 * Uses wraparound-safe comparison.
 */
bool TransactionIdPrecedes(TransactionId xid1, TransactionId xid2) {
    /* 
     * xid1 is before xid2 if the difference (xid2 - xid1)
     * is less than half the XID space (2^31).
     * This handles wraparound correctly.
     */
    int32_t diff = (int32_t)(xid2 - xid1);
    return diff > 0;
}
 
/*
 * Check if xid1 occurred before or equals xid2.
 */
bool TransactionIdPrecedesOrEquals(TransactionId xid1, TransactionId xid2) {
    int32_t diff = (int32_t)(xid2 - xid1);
    return diff >= 0;
}
 
/*
 * Assign the next transaction ID (must hold XID lock)
 */
TransactionId GetNewTransactionId(void) {
    TransactionId xid;
    
    SpinLockAcquire(&XidGenLock);
    
    xid = ShmemVariableCache->nextXid;
    ShmemVariableCache->nextXid++;
    
    /* Handle wraparound by skipping special XIDs */
    if (ShmemVariableCache->nextXid < FirstNormalTransactionId)
        ShmemVariableCache->nextXid = FirstNormalTransactionId;
    
    SpinLockRelease(&XidGenLock);
    
    return xid;
}

Oracle's SCN System:

Oracle uses System Change Numbers (SCNs) instead of simple transaction IDs. SCNs are 48-bit values that combine:

A time-based component (loosely correlated with wall-clock time)
A sequence component (for ordering within the same time period)

Advantages of SCNs:

Larger space (281 trillion values) eliminates wraparound concerns
Can be correlated with real time for flashback queries
Distributed-system friendly (nodes can coordinate SCN ranges)

MySQL InnoDB's Transaction System:

InnoDB uses 48-bit transaction IDs stored in the transaction header. It maintains:

A global transaction ID counter in shared memory
Per-transaction structures linking to undo logs
A read view (snapshot) structure for each reading transaction

ID Assignment Is a Bottleneck

Snapshot Data Structures

PostgreSQL's SnapshotData Structure:

PostgreSQL's snapshot contains three critical pieces of information:

xmin: The lowest XID that was active when the snapshot was taken. All XIDs < xmin are definitely committed (or aborted) and can use fast-path visibility checks.
xmax: The first XID that had not yet been assigned when the snapshot was taken. All XIDs >= xmax are definitely invisible (started after our snapshot).
xip[] (XID In Progress): An array of XIDs that were active at snapshot time, where xmin <= XID < xmax. These transactions' changes are invisible even if they commit later.

snapshot_structure.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* PostgreSQL SnapshotData structure (simplified) */
typedef struct SnapshotData {
    TransactionId xmin;      /* All XIDs < xmin are visible if committed */
    TransactionId xmax;      /* All XIDs >= xmax are invisible */
    
    TransactionId *xip;      /* Array of in-progress XIDs */
    uint32 xcnt;             /* Count of in-progress XIDs */
    
    /* For subtransaction handling */
    TransactionId *subxip;   /* In-progress subtransaction XIDs */
    int32 subxcnt;           /* Count of subtransaction XIDs */
    bool suboverflowed;      /* Did subxip array overflow? */
    
    /* Snapshot identity for caching */
    uint32 snapshot_id;      /* Unique identifier for this snapshot */
    
    /* Timestamp for MVCC garbage collection */
    TimestampTz whenTaken;   /* Wall-clock time snapshot was taken */
    
    /* Command counter for statement-level visibility */
    CommandId curcid;        /* Current command ID in owning transaction */
} SnapshotData;
 
/*
 * Take a snapshot of the current transaction state
 */
Snapshot GetTransactionSnapshot(void) {
    Snapshot snapshot = GetSnapshotData();
    
    /*
     * Capture current state:
     * 1. Lock the proc array briefly
     * 2. Record the next-to-be-assigned XID as xmax
     * 3. Scan all running transactions, record their XIDs
     * 4. Find the minimum running XID as xmin
     * 5. Copy running XID list to xip array
     * 6. Release proc array lock
     */
    
    LWLockAcquire(ProcArrayLock, LW_SHARED);
    
    snapshot->xmax = ShmemVariableCache->nextXid;
    snapshot->xmin = snapshot->xmax;  /* Will be lowered below */
    snapshot->xcnt = 0;
    
    for (int i = 0; i < ProcGlobal->allProcCount; i++) {
        PGPROC *proc = &allProcs[i];
        TransactionId xid = proc->xid;
        
        if (!TransactionIdIsValid(xid))
            continue;  /* Not running a transaction */
        
        if (TransactionIdPrecedes(xid, snapshot->xmin))
            snapshot->xmin = xid;
        
        if (TransactionIdPrecedes(xid, snapshot->xmax)) {
            /* This XID is in our [xmin, xmax) range; add to xip */
            snapshot->xip[snapshot->xcnt++] = xid;
        }
    }
    
    LWLockRelease(ProcArrayLock);
    
    return snapshot;
}

Visibility Check Algorithm:

With the snapshot structure in place, visibility checks follow this algorithm:

IS_VISIBLE(tuple_xid, snapshot):
    if tuple_xid < snapshot.xmin:
        return COMMITTED_AND_VISIBLE  // Old, definitely committed
    
    if tuple_xid >= snapshot.xmax:
        return INVISIBLE  // Started after our snapshot
    
    if tuple_xid IN snapshot.xip:
        return INVISIBLE  // Was running when we took snapshot
    
    // In range [xmin, xmax) but not in xip
    // Need to check if committed
    return CHECK_CLOG(tuple_xid)

The CLOG (Commit Log) stores the commit status of each transaction: committed, aborted, or in-progress. It's a compact bitmap with 2 bits per transaction ID.

Memory Considerations

Version Storage Strategies

Strategy 1: Append-Only / Tuple Versioning (PostgreSQL)

PostgreSQL stores all versions directly in the main table (heap). When a row is updated:

The old tuple is marked as "dead" (xmax set to updating transaction)
A new tuple is created with the new values
Both tuples exist in the table until VACUUM removes the old one
Indexes point to all tuple versions

Advantages:

Updates are fast (just insert new version)
No separate undo log to manage
Simple crash recovery

Disadvantages:

Table bloat (dead tuples accumulate)
Index bloat (entries for all versions)
VACUUM overhead required
Full row copy even for small updates

postgresql_heap_tuple.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* PostgreSQL HeapTupleHeader structure (simplified) */
typedef struct HeapTupleHeaderData {
    /* Transaction information */
    TransactionId t_xmin;     /* XID that created this tuple */
    TransactionId t_xmax;     /* XID that deleted/updated this tuple, or 0 */
    
    /* Command ID within the transaction */
    CommandId t_cid;          /* Command ID for intra-transaction visibility */
    
    /* Tuple pointer to next version (for HOT chains) */
    ItemPointerData t_ctid;   /* Current tuple ID, or pointer to next ver */
    
    /* Infomask bits for status flags */
    uint16 t_infomask;        /* Flags: committed, invalid, etc. */
    uint16 t_infomask2;       /* More flags + number of attributes */
    
    /* Actual data follows... */
} HeapTupleHeaderData;
 
/*
 * Example visibility check for a heap tuple
 */
bool HeapTupleSatisfiesMVCC(HeapTuple tuple, Snapshot snapshot) {
    TransactionId xmin = HeapTupleHeaderGetXmin(tuple->t_data);
    TransactionId xmax = HeapTupleHeaderGetXmax(tuple->t_data);
    
    /* Check if creating transaction is visible */
    if (!XidInSnapshot(xmin, snapshot)) {
        /* xmin not visible = tuple not yet created from snapshot's view */
        return false;
    }
    
    /* Check if tuple has been deleted/updated */
    if (!TransactionIdIsValid(xmax)) {
        /* Not deleted - visible if xmin is visible */
        return true;
    }
    
    /* Tuple was deleted - check if deletion is visible */
    if (XidInSnapshot(xmax, snapshot)) {
        /* Deletion is visible - tuple is dead */
        return false;
    }
    
    /* Deletion not visible - tuple is still alive from our view */
    return true;
}

Strategy 2: Undo Log Versioning (Oracle, MySQL InnoDB)

Oracle and InnoDB store only the current version in the main table. Previous versions are reconstructed from undo logs (rollback segments):

Main table always contains the latest committed (or in-progress) version
Before modifying a row, the old values are written to undo log
To read an old version, start from current and apply undo records backward
Undo logs are automatically purged when no longer needed

Advantages:

Main table stays compact (only current data)
Indexes point only to current tuples
No separate vacuum process needed
Space-efficient for small updates

Disadvantages:

Version reconstruction has CPU cost
Long transactions hold undo space
Complex undo log management
"ORA-01555: Snapshot too old" if undo is recycled

PostgreSQL: Append-Only

•All versions in main table
•Updates create new tuples
•xmin/xmax in tuple header
•VACUUM removes dead tuples
•Table/index bloat possible
•Fast writes, potential read overhead

Oracle/InnoDB: Undo Log

•Current version in main table
•Updates modify in place + undo
•SCN/XID tracks versions
•Automatic undo purge
•Main table stays compact
•Fast reads, version reconstruction cost

Strategy 3: Delta Storage

Write Conflict Detection: First-Committer-Wins

While snapshots elegantly handle read operations, writes require additional conflict detection to prevent lost updates. Snapshot isolation implements the First-Committer-Wins (FCW) rule:

The Problem:

Two transactions T1 and T2 both read the same row, make decisions based on that read, and attempt to write. Without conflict detection, one write could overwrite the other—a lost update.

The FCW Solution:

Formal Definition:

Transaction $T_i$ can write data item $x$ only if no other transaction $T_j$ has modified and committed $x$ where:

$T_j$'s commit timestamp > $T_i$'s snapshot timestamp
$T_j eq T_i$

fcw_conflict_check.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
FUNCTION attempt_write(transaction T, row R, new_value V):
    /* Step 1: Acquire write lock on row (blocking) */
    acquire_exclusive_lock(R)
    
    /* Step 2: Check current row version */
    current_version = get_current_version(R)
    
    /* Step 3: Conflict detection */
    IF current_version.xmax != 0:
        /* Row was updated or deleted by another transaction */
        other_xid = current_version.xmax
        
        IF other_xid is still running:
            /* Wait for other transaction to complete */
            wait_for_commit_or_abort(other_xid)
        
        IF was_committed(other_xid):
            IF commit_timestamp(other_xid) > T.snapshot_timestamp:
                /* CONFLICT: Other transaction committed after our snapshot */
                release_exclusive_lock(R)
                ABORT T with "serialization failure"
    
    /* Step 4: Check if row was created after our snapshot */
    IF current_version.xmin is still running:
        wait_for_commit_or_abort(current_version.xmin)
    
    IF was_committed(current_version.xmin):
        IF commit_timestamp(current_version.xmin) > T.snapshot_timestamp:
            /* Row didn't exist in our snapshot - cannot update */
            release_exclusive_lock(R)
            ABORT T with "serialization failure"
    
    /* Step 5: No conflict - perform the write */
    create_new_version(R, V, T.xid)
    
    release_exclusive_lock(R)
    RETURN success

Conflict Detection Scenarios:

Scenario 1: No Conflict

T1 begins (snapshot at 100), reads row X
T1 writes X
T1 commits at 200
-- T1 succeeds: no concurrent modifier

Scenario 2: Concurrent Write, First Committer Wins

T1 begins (snapshot at 100), reads row X
T2 begins (snapshot at 110), reads row X
T1 writes X, commits at 200
T2 attempts to write X
-- T2 finds X was modified after its snapshot (by T1 at 200 > 110)
-- T2 ABORTS with serialization failure

Scenario 3: Non-Conflicting Concurrent Writes

T1 begins (snapshot at 100), reads row X
T2 begins (snapshot at 110), reads row Y
T1 writes X, commits
T2 writes Y, commits
-- Both succeed: writing different rows

FCW Is Not Enough for Serializability

Commit Processing

When a transaction under snapshot isolation commits, several things must happen atomically to ensure durability and visibility consistency.

Commit Steps:

1. Final Conflict Check

Before committing, perform a final verification that no conflicts emerged during the transaction (some systems do this incrementally during writes, others do a final pass).

2. Assign Commit Timestamp

Obtain a commit timestamp from the global counter. This timestamp determines when this transaction's changes become visible to future snapshots.

3. Write Commit Record to Log (WAL)

Write a commit record to the write-ahead log. This must be flushed to stable storage before the commit can be acknowledged.

4. Update Transaction Status

Mark the transaction as committed in the commit log (CLOG in PostgreSQL, undo header in Oracle/InnoDB). This makes the transaction's status quickly queryable.

5. Make Changes Visible

After commit, the transaction's modifications are visible to any snapshot taken after the commit timestamp.

commit_protocol.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
FUNCTION commit_transaction(Transaction T):
    /* Step 1: Ensure all writes have been performed */
    IF T.pending_writes NOT EMPTY:
        ERROR "Cannot commit with unflushed writes"
    
    /* Step 2: Acquire commit lock (brief serialization point) */
    acquire_commit_lock()
    
    /* Step 3: Assign commit timestamp */
    T.commit_timestamp = get_next_commit_timestamp()
    
    /* Step 4: Create WAL commit record */
    commit_record = {
        type: XLOG_XACT_COMMIT,
        xid: T.xid,
        timestamp: T.commit_timestamp,
        database: T.database_id,
        subxacts: T.subtransaction_list
    }
    
    /* Step 5: Write and flush WAL */
    lsn = write_wal_record(commit_record)
    flush_wal_to_disk(lsn)  /* CRITICAL: Must sync before acknowledge */
    
    /* Step 6: Update CLOG (commit log) */
    set_transaction_status(T.xid, TRANSACTION_STATUS_COMMITTED)
    
    /* For subtransactions */
    FOR EACH subxid IN T.subtransaction_list:
        set_transaction_status(subxid, TRANSACTION_STATUS_COMMITTED)
    
    /* Step 7: Release locks held by transaction */
    release_all_locks(T)
    
    /* Step 8: Release commit lock */
    release_commit_lock()
    
    /* Step 9: Update shared memory (for snapshot efficiency) */
    remove_from_running_transactions(T.xid)
    
    RETURN T.commit_timestamp

PostgreSQL's Commit Sequence:

PostgreSQL's actual commit process involves:

RecordTransactionCommit(): Writes the commit record to WAL
XactLogCommitRecord(): Logs detailed commit information
SyncRepWaitForLSN(): Waits for synchronous replication if configured
TransactionIdCommitTree(): Marks XID as committed in CLOG
ProcArrayEndTransaction(): Removes XID from running transaction array
ResourceOwnerRelease(): Releases resources (locks, files, memory)

Visibility Latency:

There's a brief window between WAL flush and ProcArray update where:

The transaction is durably committed
But other transactions may not yet see it as committed

This is handled by careful ordering and the visibility rules checking both CLOG status and the running transaction array.

Synchronous vs Asynchronous Commit

Garbage Collection: Cleaning Up Old Versions

The Horizon Concept:

The oldest active snapshot determines the garbage collection horizon. Any version that was superseded before this horizon is safe to remove because no current or future transaction can see it.

$$\text{GC Horizon} = \min(\text{xmin of all active snapshots})$$

Versions created by transactions with XIDs less than this horizon AND subsequently replaced by newer versions can be reclaimed.

Converting Mermaid diagram...

PostgreSQL VACUUM:

VACUUM in PostgreSQL performs several tasks:

Remove dead tuples: Tuples whose xmax indicates deletion/update by a committed transaction older than all active snapshots
Freeze old tuples: Convert very old XIDs to FrozenTransactionId to prevent wraparound
Update visibility map: Track pages where all tuples are visible to all transactions
Update free space map: Track available space for new tuples
Truncate trailing empty pages: Reclaim disk space from the end of tables

Autovacuum runs automatically based on table modification statistics, but heavily-updated tables may need manual tuning.

vacuum_logic.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
FUNCTION vacuum_table(table):
    /* Get the oldest active snapshot horizon */
    gc_horizon = get_oldest_xmin_across_all_snapshots()
    
    /* Scan each page of the table */
    FOR page IN table.pages:
        dead_tuples = []
        
        FOR tuple IN page.tuples:
            /* Check if tuple is dead and reclaimable */
            IF tuple.xmax IS VALID AND tuple.xmax < gc_horizon:
                IF was_committed(tuple.xmax):
                    /* This tuple was deleted/updated and is no longer needed */
                    dead_tuples.append(tuple)
            
            /* Check for XID freeze requirement */
            IF tuple.xmin < freeze_threshold:
                IF was_committed(tuple.xmin):
                    tuple.xmin = FrozenTransactionId  /* Prevent wraparound */
        
        /* Remove dead tuples and update free space */
        IF dead_tuples NOT EMPTY:
            compact_page(page, dead_tuples)
            update_free_space_map(page)
        
        /* Update visibility map if all remaining tuples are visible */
        IF all_tuples_visible_to_all(page):
            set_visibility_map_bit(page)
    
    /* Update table statistics */
    update_pg_class_statistics(table)

Long Transactions Block Garbage Collection

Database-Specific Implementations

Each major database implements snapshot isolation with unique engineering choices. Understanding these differences helps when designing applications or troubleshooting performance issues.

Snapshot Isolation Implementation Comparison
Aspect	PostgreSQL	Oracle	MySQL InnoDB	SQL Server
Version Storage	In-heap (append-only)	Undo tablespace	Undo log (rollback segment)	Version store (tempdb)
Version ID	32-bit XID	48-bit SCN	48-bit Transaction ID	14-byte row version
GC Mechanism	VACUUM process	Automatic undo purge	Purge thread	Version cleaner
Snapshot Capture	ProcArray scan + CLOG	SCN at statement/tx start	Read View creation	Version chain traversal
Conflict Detection	xmax check	Interested Transaction List (ITL)	Lock wait + trx comparison	Update conflict on commit
Isolation Level	REPEATABLE READ	READ COMMITTED (default)	REPEATABLE READ (default)	SNAPSHOT (explicit)
Write Skew	Allowed	Allowed	Allowed	Allowed (SNAPSHOT level)

PostgreSQL Specifics:

Uses tuple-level xmin/xmax headers for visibility
REPEATABLE READ is implemented as snapshot isolation
SERIALIZABLE adds SSI (Serializable Snapshot Isolation) checking
HOT updates allow same-page updates to avoid index maintenance
Index-only scans require visibility map confirmation

Oracle Specifics:

READ COMMITTED actually uses statement-level snapshots
SERIALIZABLE uses transaction-level snapshots (closer to SI)
Consistent reads reconstruct from undo automatically
ORA-01555 occurs when undo is overwritten before snapshot completes
ITL (Interested Transaction List) tracks row-level lock holders

MySQL InnoDB Specifics:

REPEATABLE READ is the default and uses snapshots
Read View captures active transactions at snapshot time
Secondary indexes may need version visibility checks
Purge lag can be monitored with SHOW ENGINE INNODB STATUS
Gap locking prevents phantoms (beyond standard SI guarantees)

Matching Isolation to Requirements

Summary: The Mechanics of Snapshot Isolation

Key Takeaways

•Transaction IDs enable ordering — Monotonically increasing XIDs/SCNs provide a total order for determining visibility and conflict detection
•Snapshots capture active transaction state — The xmin, xmax, and in-progress list enable efficient visibility determination
•Version storage strategies differ — PostgreSQL's append-only vs Oracle/InnoDB's undo log approach have different performance trade-offs
•First-Committer-Wins prevents lost updates — Write conflict detection ensures only one concurrent modifier can commit its changes to a row
•Commit processing is carefully ordered — WAL flush, CLOG update, and ProcArray updates must happen in sequence for durability and correctness
•Garbage collection requires horizon tracking — The oldest active snapshot determines which old versions can be safely removed
•Implementation details matter — Database-specific choices affect performance characteristics, tuning options, and application design

What's Next:

Page Complete