Loading content...
Understanding what a snapshot is conceptually is only half the battle. The real engineering challenge lies in implementing snapshot isolation efficiently at scale—handling millions of transactions, maintaining millions of versions, and making visibility decisions in microseconds.
This page takes you inside the implementation of snapshot isolation (SI), revealing the data structures, algorithms, and engineering trade-offs that power databases like PostgreSQL, Oracle, MySQL InnoDB, and SQL Server. We'll examine how snapshots are established, how visibility is determined, how write conflicts are detected, and how the system maintains correctness under extreme concurrency.
By the end of this page, you will understand: how transaction IDs and commit timestamps are assigned and used; how snapshot metadata is structured and stored; how visibility checks are performed; how first-committer-wins conflict detection works; and how different databases implement these mechanisms. You'll be able to reason about SI behavior at the implementation level.
The foundation of any SI implementation is a reliable mechanism for identifying and ordering transactions. This is achieved through Transaction IDs (XIDs or TXIDs), which serve as unique identifiers that also encode ordering information.
Requirements for Transaction IDs:
PostgreSQL's XID System:
PostgreSQL uses 32-bit unsigned integers as transaction IDs. This creates an interesting challenge: with a maximum of ~4 billion IDs, a busy database could exhaust the ID space. PostgreSQL handles this through XID wraparound protection:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
/* PostgreSQL-style XID comparison using modular arithmetic */typedef uint32_t TransactionId; /* * Check if xid1 occurred before xid2. * Uses wraparound-safe comparison. */bool TransactionIdPrecedes(TransactionId xid1, TransactionId xid2) { /* * xid1 is before xid2 if the difference (xid2 - xid1) * is less than half the XID space (2^31). * This handles wraparound correctly. */ int32_t diff = (int32_t)(xid2 - xid1); return diff > 0;} /* * Check if xid1 occurred before or equals xid2. */bool TransactionIdPrecedesOrEquals(TransactionId xid1, TransactionId xid2) { int32_t diff = (int32_t)(xid2 - xid1); return diff >= 0;} /* * Assign the next transaction ID (must hold XID lock) */TransactionId GetNewTransactionId(void) { TransactionId xid; SpinLockAcquire(&XidGenLock); xid = ShmemVariableCache->nextXid; ShmemVariableCache->nextXid++; /* Handle wraparound by skipping special XIDs */ if (ShmemVariableCache->nextXid < FirstNormalTransactionId) ShmemVariableCache->nextXid = FirstNormalTransactionId; SpinLockRelease(&XidGenLock); return xid;}Oracle's SCN System:
Oracle uses System Change Numbers (SCNs) instead of simple transaction IDs. SCNs are 48-bit values that combine:
Advantages of SCNs:
MySQL InnoDB's Transaction System:
InnoDB uses 48-bit transaction IDs stored in the transaction header. It maintains:
Transaction ID assignment is a potential serialization point—all transactions must obtain unique IDs from a shared counter. Modern databases minimize this with techniques like batching (claim IDs in groups), sharded counters (different cores claim from different ranges), or lazy assignment (only assign IDs when needed for writes).
When a transaction establishes its snapshot, it must capture enough information to determine visibility of any data version it might later read. This requires carefully designed data structures that are compact (memory-efficient) yet complete (can answer any visibility question).
PostgreSQL's SnapshotData Structure:
PostgreSQL's snapshot contains three critical pieces of information:
xmin: The lowest XID that was active when the snapshot was taken. All XIDs < xmin are definitely committed (or aborted) and can use fast-path visibility checks.
xmax: The first XID that had not yet been assigned when the snapshot was taken. All XIDs >= xmax are definitely invisible (started after our snapshot).
xip[] (XID In Progress): An array of XIDs that were active at snapshot time, where xmin <= XID < xmax. These transactions' changes are invisible even if they commit later.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
/* PostgreSQL SnapshotData structure (simplified) */typedef struct SnapshotData { TransactionId xmin; /* All XIDs < xmin are visible if committed */ TransactionId xmax; /* All XIDs >= xmax are invisible */ TransactionId *xip; /* Array of in-progress XIDs */ uint32 xcnt; /* Count of in-progress XIDs */ /* For subtransaction handling */ TransactionId *subxip; /* In-progress subtransaction XIDs */ int32 subxcnt; /* Count of subtransaction XIDs */ bool suboverflowed; /* Did subxip array overflow? */ /* Snapshot identity for caching */ uint32 snapshot_id; /* Unique identifier for this snapshot */ /* Timestamp for MVCC garbage collection */ TimestampTz whenTaken; /* Wall-clock time snapshot was taken */ /* Command counter for statement-level visibility */ CommandId curcid; /* Current command ID in owning transaction */} SnapshotData; /* * Take a snapshot of the current transaction state */Snapshot GetTransactionSnapshot(void) { Snapshot snapshot = GetSnapshotData(); /* * Capture current state: * 1. Lock the proc array briefly * 2. Record the next-to-be-assigned XID as xmax * 3. Scan all running transactions, record their XIDs * 4. Find the minimum running XID as xmin * 5. Copy running XID list to xip array * 6. Release proc array lock */ LWLockAcquire(ProcArrayLock, LW_SHARED); snapshot->xmax = ShmemVariableCache->nextXid; snapshot->xmin = snapshot->xmax; /* Will be lowered below */ snapshot->xcnt = 0; for (int i = 0; i < ProcGlobal->allProcCount; i++) { PGPROC *proc = &allProcs[i]; TransactionId xid = proc->xid; if (!TransactionIdIsValid(xid)) continue; /* Not running a transaction */ if (TransactionIdPrecedes(xid, snapshot->xmin)) snapshot->xmin = xid; if (TransactionIdPrecedes(xid, snapshot->xmax)) { /* This XID is in our [xmin, xmax) range; add to xip */ snapshot->xip[snapshot->xcnt++] = xid; } } LWLockRelease(ProcArrayLock); return snapshot;}Visibility Check Algorithm:
With the snapshot structure in place, visibility checks follow this algorithm:
IS_VISIBLE(tuple_xid, snapshot):
if tuple_xid < snapshot.xmin:
return COMMITTED_AND_VISIBLE // Old, definitely committed
if tuple_xid >= snapshot.xmax:
return INVISIBLE // Started after our snapshot
if tuple_xid IN snapshot.xip:
return INVISIBLE // Was running when we took snapshot
// In range [xmin, xmax) but not in xip
// Need to check if committed
return CHECK_CLOG(tuple_xid)
The CLOG (Commit Log) stores the commit status of each transaction: committed, aborted, or in-progress. It's a compact bitmap with 2 bits per transaction ID.
The xip[] array can become large with many concurrent transactions. PostgreSQL limits this and uses subxip[] for subtransactions. If limits are exceeded, the system may need to serialize certain operations. Well-tuned databases keep concurrent transaction counts reasonable.
MVCC requires storing multiple versions of data. Different databases take fundamentally different approaches to how and where these versions are stored, each with distinct performance characteristics.
Strategy 1: Append-Only / Tuple Versioning (PostgreSQL)
PostgreSQL stores all versions directly in the main table (heap). When a row is updated:
Advantages:
Disadvantages:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
/* PostgreSQL HeapTupleHeader structure (simplified) */typedef struct HeapTupleHeaderData { /* Transaction information */ TransactionId t_xmin; /* XID that created this tuple */ TransactionId t_xmax; /* XID that deleted/updated this tuple, or 0 */ /* Command ID within the transaction */ CommandId t_cid; /* Command ID for intra-transaction visibility */ /* Tuple pointer to next version (for HOT chains) */ ItemPointerData t_ctid; /* Current tuple ID, or pointer to next ver */ /* Infomask bits for status flags */ uint16 t_infomask; /* Flags: committed, invalid, etc. */ uint16 t_infomask2; /* More flags + number of attributes */ /* Actual data follows... */} HeapTupleHeaderData; /* * Example visibility check for a heap tuple */bool HeapTupleSatisfiesMVCC(HeapTuple tuple, Snapshot snapshot) { TransactionId xmin = HeapTupleHeaderGetXmin(tuple->t_data); TransactionId xmax = HeapTupleHeaderGetXmax(tuple->t_data); /* Check if creating transaction is visible */ if (!XidInSnapshot(xmin, snapshot)) { /* xmin not visible = tuple not yet created from snapshot's view */ return false; } /* Check if tuple has been deleted/updated */ if (!TransactionIdIsValid(xmax)) { /* Not deleted - visible if xmin is visible */ return true; } /* Tuple was deleted - check if deletion is visible */ if (XidInSnapshot(xmax, snapshot)) { /* Deletion is visible - tuple is dead */ return false; } /* Deletion not visible - tuple is still alive from our view */ return true;}Strategy 2: Undo Log Versioning (Oracle, MySQL InnoDB)
Oracle and InnoDB store only the current version in the main table. Previous versions are reconstructed from undo logs (rollback segments):
Advantages:
Disadvantages:
Some systems (like SQL Server's optimized version store) store deltas—only the changed columns—instead of full row copies. This reduces storage for wide tables with small updates. The trade-off is more complex version reconstruction.
While snapshots elegantly handle read operations, writes require additional conflict detection to prevent lost updates. Snapshot isolation implements the First-Committer-Wins (FCW) rule:
The Problem:
Two transactions T1 and T2 both read the same row, make decisions based on that read, and attempt to write. Without conflict detection, one write could overwrite the other—a lost update.
The FCW Solution:
When a transaction attempts to write a row, it checks if the row has been modified by any transaction that committed after the writing transaction's snapshot was taken. If so, the write is rejected (the transaction must abort).
Formal Definition:
Transaction $T_i$ can write data item $x$ only if no other transaction $T_j$ has modified and committed $x$ where:
12345678910111213141516171819202122232425262728293031323334353637
FUNCTION attempt_write(transaction T, row R, new_value V): /* Step 1: Acquire write lock on row (blocking) */ acquire_exclusive_lock(R) /* Step 2: Check current row version */ current_version = get_current_version(R) /* Step 3: Conflict detection */ IF current_version.xmax != 0: /* Row was updated or deleted by another transaction */ other_xid = current_version.xmax IF other_xid is still running: /* Wait for other transaction to complete */ wait_for_commit_or_abort(other_xid) IF was_committed(other_xid): IF commit_timestamp(other_xid) > T.snapshot_timestamp: /* CONFLICT: Other transaction committed after our snapshot */ release_exclusive_lock(R) ABORT T with "serialization failure" /* Step 4: Check if row was created after our snapshot */ IF current_version.xmin is still running: wait_for_commit_or_abort(current_version.xmin) IF was_committed(current_version.xmin): IF commit_timestamp(current_version.xmin) > T.snapshot_timestamp: /* Row didn't exist in our snapshot - cannot update */ release_exclusive_lock(R) ABORT T with "serialization failure" /* Step 5: No conflict - perform the write */ create_new_version(R, V, T.xid) release_exclusive_lock(R) RETURN successConflict Detection Scenarios:
Scenario 1: No Conflict
T1 begins (snapshot at 100), reads row X
T1 writes X
T1 commits at 200
-- T1 succeeds: no concurrent modifier
Scenario 2: Concurrent Write, First Committer Wins
T1 begins (snapshot at 100), reads row X
T2 begins (snapshot at 110), reads row X
T1 writes X, commits at 200
T2 attempts to write X
-- T2 finds X was modified after its snapshot (by T1 at 200 > 110)
-- T2 ABORTS with serialization failure
Scenario 3: Non-Conflicting Concurrent Writes
T1 begins (snapshot at 100), reads row X
T2 begins (snapshot at 110), reads row Y
T1 writes X, commits
T2 writes Y, commits
-- Both succeed: writing different rows
First-Committer-Wins prevents lost updates but does NOT prevent write skew anomalies. T1 reading X and writing Y, while T2 reads Y and writes X, can both commit under FCW—potentially violating constraints that span X and Y. We'll cover this limitation in the Write Skew page.
When a transaction under snapshot isolation commits, several things must happen atomically to ensure durability and visibility consistency.
Commit Steps:
1. Final Conflict Check
Before committing, perform a final verification that no conflicts emerged during the transaction (some systems do this incrementally during writes, others do a final pass).
2. Assign Commit Timestamp
Obtain a commit timestamp from the global counter. This timestamp determines when this transaction's changes become visible to future snapshots.
3. Write Commit Record to Log (WAL)
Write a commit record to the write-ahead log. This must be flushed to stable storage before the commit can be acknowledged.
4. Update Transaction Status
Mark the transaction as committed in the commit log (CLOG in PostgreSQL, undo header in Oracle/InnoDB). This makes the transaction's status quickly queryable.
5. Make Changes Visible
After commit, the transaction's modifications are visible to any snapshot taken after the commit timestamp.
1234567891011121314151617181920212223242526272829303132333435363738394041
FUNCTION commit_transaction(Transaction T): /* Step 1: Ensure all writes have been performed */ IF T.pending_writes NOT EMPTY: ERROR "Cannot commit with unflushed writes" /* Step 2: Acquire commit lock (brief serialization point) */ acquire_commit_lock() /* Step 3: Assign commit timestamp */ T.commit_timestamp = get_next_commit_timestamp() /* Step 4: Create WAL commit record */ commit_record = { type: XLOG_XACT_COMMIT, xid: T.xid, timestamp: T.commit_timestamp, database: T.database_id, subxacts: T.subtransaction_list } /* Step 5: Write and flush WAL */ lsn = write_wal_record(commit_record) flush_wal_to_disk(lsn) /* CRITICAL: Must sync before acknowledge */ /* Step 6: Update CLOG (commit log) */ set_transaction_status(T.xid, TRANSACTION_STATUS_COMMITTED) /* For subtransactions */ FOR EACH subxid IN T.subtransaction_list: set_transaction_status(subxid, TRANSACTION_STATUS_COMMITTED) /* Step 7: Release locks held by transaction */ release_all_locks(T) /* Step 8: Release commit lock */ release_commit_lock() /* Step 9: Update shared memory (for snapshot efficiency) */ remove_from_running_transactions(T.xid) RETURN T.commit_timestampPostgreSQL's Commit Sequence:
PostgreSQL's actual commit process involves:
Visibility Latency:
There's a brief window between WAL flush and ProcArray update where:
This is handled by careful ordering and the visibility rules checking both CLOG status and the running transaction array.
PostgreSQL offers 'synchronous_commit' configurations. Turning it off skips the WAL flush wait, providing much faster commits but risking up to 3× wal_writer_delay worth of transactions in a crash. The visibility semantics remain the same—only durability is affected.
MVCC creates multiple versions, but not all versions need to be kept forever. Once no active snapshot can possibly need an old version, it can be safely removed. This is the job of garbage collection (called VACUUM in PostgreSQL, undo purge in Oracle/InnoDB).
The Horizon Concept:
The oldest active snapshot determines the garbage collection horizon. Any version that was superseded before this horizon is safe to remove because no current or future transaction can see it.
$$\text{GC Horizon} = \min(\text{xmin of all active snapshots})$$
Versions created by transactions with XIDs less than this horizon AND subsequently replaced by newer versions can be reclaimed.
PostgreSQL VACUUM:
VACUUM in PostgreSQL performs several tasks:
Autovacuum runs automatically based on table modification statistics, but heavily-updated tables may need manual tuning.
12345678910111213141516171819202122232425262728293031
FUNCTION vacuum_table(table): /* Get the oldest active snapshot horizon */ gc_horizon = get_oldest_xmin_across_all_snapshots() /* Scan each page of the table */ FOR page IN table.pages: dead_tuples = [] FOR tuple IN page.tuples: /* Check if tuple is dead and reclaimable */ IF tuple.xmax IS VALID AND tuple.xmax < gc_horizon: IF was_committed(tuple.xmax): /* This tuple was deleted/updated and is no longer needed */ dead_tuples.append(tuple) /* Check for XID freeze requirement */ IF tuple.xmin < freeze_threshold: IF was_committed(tuple.xmin): tuple.xmin = FrozenTransactionId /* Prevent wraparound */ /* Remove dead tuples and update free space */ IF dead_tuples NOT EMPTY: compact_page(page, dead_tuples) update_free_space_map(page) /* Update visibility map if all remaining tuples are visible */ IF all_tuples_visible_to_all(page): set_visibility_map_bit(page) /* Update table statistics */ update_pg_class_statistics(table)If a transaction holds a very old snapshot (e.g., a long-running report), it pins the GC horizon. No versions created after that snapshot's xmin can be removed, even if they've been updated a thousand times. This is why long-running transactions can cause table bloat in PostgreSQL and 'ORA-01555: Snapshot too old' errors in Oracle.
Each major database implements snapshot isolation with unique engineering choices. Understanding these differences helps when designing applications or troubleshooting performance issues.
| Aspect | PostgreSQL | Oracle | MySQL InnoDB | SQL Server |
|---|---|---|---|---|
| Version Storage | In-heap (append-only) | Undo tablespace | Undo log (rollback segment) | Version store (tempdb) |
| Version ID | 32-bit XID | 48-bit SCN | 48-bit Transaction ID | 14-byte row version |
| GC Mechanism | VACUUM process | Automatic undo purge | Purge thread | Version cleaner |
| Snapshot Capture | ProcArray scan + CLOG | SCN at statement/tx start | Read View creation | Version chain traversal |
| Conflict Detection | xmax check | Interested Transaction List (ITL) | Lock wait + trx comparison | Update conflict on commit |
| Isolation Level | REPEATABLE READ | READ COMMITTED (default) | REPEATABLE READ (default) | SNAPSHOT (explicit) |
| Write Skew | Allowed | Allowed | Allowed | Allowed (SNAPSHOT level) |
PostgreSQL Specifics:
Oracle Specifics:
MySQL InnoDB Specifics:
SHOW ENGINE INNODB STATUSWhen designing applications, consider: Are you okay with READ COMMITTED's statement-level snapshots? Do you need transaction-level consistency (REPEATABLE READ/SI)? Must you prevent write skew (SERIALIZABLE)? Each level has different implementation costs and anomaly prevention.
Implementing snapshot isolation requires careful coordination of transaction identifiers, version management, visibility rules, conflict detection, and garbage collection. Understanding these mechanisms is essential for effective database tuning and application design.
What's Next:
The next page explores the Write Skew Anomaly—the signature problem that snapshot isolation cannot prevent. Understanding write skew is critical for designing correct applications and knowing when stronger isolation levels are needed.
You now understand how snapshot isolation is implemented in real database systems—from transaction ID assignment through version management, visibility checking, conflict detection, commit processing, and garbage collection. This implementation knowledge enables you to reason about database behavior and optimize application designs.