Loading learning content...
The concept of maintaining multiple versions sounds straightforward, but the implementation details determine whether an MVCC system is practical for production use. This page explores the structural and operational aspects of version management: how versions are physically organized, what metadata they carry, how they form chains of history, and how the database eventually reclaims obsolete versions.
Every time you update a row in a PostgreSQL table, every time you modify a document in MongoDB, every time InnoDB processes an UPDATE statement—the database is creating, organizing, and managing multiple versions of your data. Understanding this mechanism is essential for:
By the end of this page, you will understand how version chains are structured, what metadata each version carries, how different storage models organize versions, and how garbage collection reclaims obsolete versions. You'll gain insight into the engineering tradeoffs that shape real-world MVCC implementations.
Every version of a data item must carry sufficient metadata for the database to answer two critical questions:
These questions determine visibility: whether a given transaction should see this version or skip to the next in the chain.
Essential Version Metadata:
While specific implementations vary, most MVCC systems track similar information for each version:
| Metadata Field | Purpose | Example Values |
|---|---|---|
| Xmin / Created-By | Transaction ID that created this version | 10042 (PostgreSQL), TRX_ID (InnoDB) |
| Xmax / Deleted-By | Transaction ID that deleted/superseded this version (0 or NULL if still current) | 10057 or NULL |
| Commit Timestamp | Logical or physical time when creating transaction committed | 2024-01-15T14:30:00 or LSN 0x5F3A |
| Version Pointer | Link to previous or next version in the chain | Heap location or undo segment offset |
| Transaction Status | Whether creating transaction committed, aborted, or in-progress | Committed, Aborted, In-Progress |
| Row Content | The actual data values for this version | Column values, BLOB references, etc. |
The Visibility Challenge:
Determining whether a version is visible requires knowing the status of the transactions that created and potentially deleted it. This creates an interesting bootstrapping problem: the version exists in storage, but its visibility depends on transaction state that may be stored elsewhere.
Databases solve this in different ways:
pg_xact (formerly clog) to track transaction commit status. Visibility check requires consulting this structure.12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Determining Version Visibility// Given: Transaction T with snapshot timestamp S// Version V of data item with metadata function is_version_visible(T: Transaction, V: Version) -> Boolean: // Step 1: Check if creator transaction has committed creator_status = get_transaction_status(V.xmin) if creator_status == ABORTED: return false // Created by aborted transaction - never visible if creator_status == IN_PROGRESS: if V.xmin == T.id: return true // Transaction sees its own uncommitted writes else: return false // Created by another in-progress transaction // Creator committed - check if commit was before our snapshot if creator_status == COMMITTED: if V.xmin_commit_time > T.snapshot_time: return false // Committed after our snapshot started // Step 2: Check if version has been superseded if V.xmax == 0 or V.xmax == NULL: return true // Current version - not yet superseded deleter_status = get_transaction_status(V.xmax) if deleter_status == ABORTED: return true // Delete was rolled back - version still valid if deleter_status == IN_PROGRESS: if V.xmax == T.id: return false // We deleted it ourselves else: return true // Another transaction's uncommitted delete - still visible to us if deleter_status == COMMITTED: if V.xmax_commit_time > T.snapshot_time: return true // Deleted after our snapshot - still visible to us else: return false // Deleted before our snapshot - not visible return false // Should not reach here // Helper: Get transaction commit statusfunction get_transaction_status(txn_id: TransactionID) -> Status: // Different implementations: // PostgreSQL: Check pg_xact (clog) // InnoDB: Check undo log headers + active transaction list // Oracle: Check rollback segment transaction table return lookup_transaction_status(txn_id)PostgreSQL optimization: Rather than consulting pg_xact for every visibility check, PostgreSQL can set 'hint bits' directly in the tuple header once a transaction's fate is known. This trades a small amount of write I/O (marking the hint) for significant read savings on subsequent accesses.
When multiple versions of the same logical row exist, they must be connected in a way that allows efficient traversal. The version chain is the data structure that links these versions together. Different database systems organize version chains differently, with significant implications for performance.
Key Design Decisions:
| Architecture | Chain Direction | Primary Table Content | Example Systems |
|---|---|---|---|
| Append-Only (O2N) | Oldest → Newest | All versions (newest is latest) | PostgreSQL |
| Append-Only (N2O) | Newest → Oldest | All versions (oldest at head) | Some research systems |
| Undo Log / Delta (N2O) | Newest → Oldest | Only newest version | InnoDB, Oracle, SQL Server |
| Time-Travel Table | Separate index | Current only; history in separate table | Temporal table implementations |
Append-Only / Heap-Based (PostgreSQL Model):
In PostgreSQL's architecture, when a row is updated:
xmax = updating transaction's IDctid) to the new tupleThis creates a chain where traversal goes from old to new (O2N):
PostgreSQL Heap-Based Version Chain (O2N) Initial State - Row inserted by Txn 100:┌────────────────────────────────────────────────────┐│ Heap Page 1 ││ ┌────────────────────────────────────────────┐ ││ │ Tuple @ (1,1) │ ││ │ xmin: 100 xmax: 0 ctid: (1,1) │ ││ │ data: {name: "Alice", balance: 1000} │ ││ └────────────────────────────────────────────┘ │└────────────────────────────────────────────────────┘ After UPDATE by Txn 150 (balance = 900):┌────────────────────────────────────────────────────┐│ Heap Page 1 ││ ┌────────────────────────────────────────────┐ ││ │ Tuple @ (1,1) [OLD VERSION] │ ││ │ xmin: 100 xmax: 150 ctid: (1,3) ──────────► to new│ │ data: {name: "Alice", balance: 1000} │ ││ └────────────────────────────────────────────┘ ││ ││ ┌────────────────────────────────────────────┐ ││ │ Tuple @ (1,3) [NEW VERSION - CURRENT] │ ││ │ xmin: 150 xmax: 0 ctid: (1,3) │ ││ │ data: {name: "Alice", balance: 900} │ ││ └────────────────────────────────────────────┘ │└────────────────────────────────────────────────────┘ After another UPDATE by Txn 200 (balance = 850):┌─────────────────────────────────────────────────────────────────┐│ Heap Page 1 Heap Page 2 ││ ┌─────────────────────────────────┐ ┌─────────────────── ││ │ (1,1) xmin:100 xmax:150 ─────────────► │ ││ │ ctid:(1,3) │ ┌──┘ ││ │ balance: 1000 │ │ ││ └─────────────────────────────────┘ ▼ ││ ┌─────────────────────────────────┐ (2,1) [CURRENT] ││ │ (1,3) xmin:150 xmax:200 ────────────► xmin:200 xmax:0 ││ │ balance: 900 │ balance: 850 ││ └─────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────┘ Version Chain: (1,1) → (1,3) → (2,1) 100 150 200 (Transaction IDs)Undo Log-Based (InnoDB Model):
InnoDB takes a fundamentally different approach. The main table (clustered index) always contains the newest version. When an update occurs:
This creates a Newest-to-Oldest (N2O) chain:
InnoDB Undo Log-Based Version Chain (N2O) Initial State - Row inserted by Trx 100:┌────────────────────────────────────────────────────┐│ Clustered Index (Primary Key) ││ ┌────────────────────────────────────────────┐ ││ │ Row PK=1 │ ││ │ DB_TRX_ID: 100 │ ││ │ DB_ROLL_PTR: NULL (no previous version) │ ││ │ data: {name: "Alice", balance: 1000} │ ││ └────────────────────────────────────────────┘ │└────────────────────────────────────────────────────┘ After UPDATE by Trx 150 (balance = 900):┌────────────────────────────────────────────────────┐│ Clustered Index ││ ┌────────────────────────────────────────────┐ ││ │ Row PK=1 [CURRENT VERSION] │ ││ │ DB_TRX_ID: 150 │ ││ │ DB_ROLL_PTR: ──────────────────────────────► ││ │ data: {name: "Alice", balance: 900} │ ││ └────────────────────────────────────────────┘ │└────────────────────────────────────────────────────┘ │┌─────────────────────────────────────────────────┼──┐│ Undo Log (Rollback Segment) ▼ ││ ┌────────────────────────────────────────────┐ ││ │ Undo Record │ ││ │ TRX_ID: 100 │ ││ │ Previous Undo: NULL │ ││ │ Old values: {balance: 1000} │ ││ └────────────────────────────────────────────┘ │└────────────────────────────────────────────────────┘ After another UPDATE by Trx 200 (balance = 850):┌────────────────────────────────────────────────────┐│ Clustered Index ││ ┌────────────────────────────────────────────┐ ││ │ Row PK=1 [CURRENT VERSION] │ ││ │ DB_TRX_ID: 200 DB_ROLL_PTR: ─────────────────► Undo #2│ │ data: {balance: 850} │ ││ └────────────────────────────────────────────┘ │└────────────────────────────────────────────────────┘ ┌────────────────────────────────────────────────────┐│ Undo Log ││ ┌────────────────────────────────────────────┐ ││ │ Undo Record #2 (Trx 200's changes) │ ││ │ TRX_ID: 150 │ ││ │ Previous Undo: ────────────► Undo #1 │ ││ │ Old values: {balance: 900} │ ││ └────────────────────────────────────────────┘ ││ ┌────────────────────────────────────────────┐ ││ │ Undo Record #1 (Trx 150's changes) │ ││ │ TRX_ID: 100 │ ││ │ Previous Undo: NULL │ ││ │ Old values: {balance: 1000} │ ││ └────────────────────────────────────────────┘ │└────────────────────────────────────────────────────┘ Version Chain: Current → Undo#2 → Undo#1 → (original insert) 200 150 100N2O chains (InnoDB) optimize for current-version reads: most queries want the newest data, which is found immediately in the main table. O2N chains (PostgreSQL) may require traversing the chain to find the current version, but index maintenance is more complex in N2O systems.
Secondary indexes create significant complexity in MVCC systems. The fundamental challenge: an index entry must point to a row, but multiple versions of that row exist. Different architectures handle this differently, with major performance implications.
The PostgreSQL Approach: Indexes Point to Any Version
In PostgreSQL, secondary indexes point to heap tuple locations (TID = table ID + item pointer). When a row is updated:
This means:
HOT (Heap-Only Tuples) Optimization:
PostgreSQL's HOT optimization addresses index bloat when the updated columns are not indexed:
PostgreSQL HOT (Heap-Only Tuple) Update Scenario: Table users (id PK, name indexed, balance NOT indexed)UPDATE users SET balance = 900 WHERE id = 1; Without HOT (indexed column change or no room in page):┌────────────────────────────────────────┐│ Index on 'name' ││ "Alice" → (page 1, item 1) [OLD] ││ "Alice" → (page 2, item 5) [NEW] ←─── New index entry needed!└────────────────────────────────────────┘ With HOT (non-indexed column change, room in same page):┌────────────────────────────────────────┐│ Index on 'name' ││ "Alice" → (page 1, item 1) │ ← Index unchanged!└────────────────────────────────────────┘ │ ▼┌────────────────────────────────────────┐│ Heap Page 1 ││ ┌──────────────────────────────┐ ││ │ Item 1 (OLD - HEAP ONLY) │ ││ │ xmax: 150 │ ││ │ t_ctid: (1, 3) ──────────────────►││ └──────────────────────────────┘ ││ ┌──────────────────────────────┐ ││ │ Item 3 (NEW - CURRENT) │ ││ │ xmax: 0 │ ││ │ HEAP_ONLY flag set │ ← Not directly indexed│ └──────────────────────────────┘ │└────────────────────────────────────────┘ HOT chain pruning: When page is accessed, old HOT versionscan be immediately reclaimed without full VACUUM.The InnoDB Approach: Indexes Point to Primary Key
InnoDB takes a different approach. Secondary indexes don't store pointers to physical row locations—they store the primary key value of the row.
When reading via a secondary index:
This has interesting implications:
Every version goes through a predictable lifecycle, from creation during a transaction to eventual garbage collection. Understanding this lifecycle is essential for both performance tuning and troubleshooting.
Phase 1: Creation
A version is born when a transaction performs an INSERT or UPDATE:
At creation time, the version is only visible to the transaction that created it. The version's metadata reflects the in-progress status.
Phase 2: Visibility Window
After the creating transaction commits, the version becomes visible to new transactions. It remains visible until superseded by a newer version.
The version's visibility window is defined by:
Phase 3: Obsolescence
A version becomes obsolete when it's no longer visible to any active or future transaction. This happens when:
Phase 4: Garbage Collection
Once obsolete, the version can be reclaimed. The timing and mechanism depend on the system:
| Phase | PostgreSQL | InnoDB/MySQL |
|---|---|---|
| Creation | Insert new heap tuple | Modify row, copy old to undo log |
| Visibility Start | Creating txn commits, hint bits set | Creating trx commits |
| Obsolescence Detection | xmax committed AND xmax < oldest active snapshot | Read view no longer references this version |
| GC Mechanism | VACUUM process scans heap | Purge background thread |
| GC Trigger | Manual, autovacuum thresholds, or wraparound prevention | Continuous background purge |
| Space Reclamation | Clears dead tuples, updates FSM | Undo log space reused, undo tablespace may shrink |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// Version Lifecycle State Machine enum VersionState { IN_PROGRESS, // Created by active transaction COMMITTED, // Creating transaction committed - visible to new transactions OBSOLETE, // Superseded and no longer needed by any transaction RECLAIMED // Space has been reclaimed by GC} class Version { state: VersionState = IN_PROGRESS created_by: Transaction superseded_by: Transaction = null // Transition: IN_PROGRESS → COMMITTED on_creator_commit(): this.state = COMMITTED // Version now visible to new transactions starting after commit // Transition: COMMITTED → OBSOLETE on_obsolescence_detected(): // Happens when: // 1. superseded_by has committed, AND // 2. oldest_active_snapshot > max(created_by.commit_time, superseded_by.commit_time) this.state = OBSOLETE // Version can now be garbage collected // Transition: OBSOLETE → RECLAIMED on_garbage_collection(): // PostgreSQL: VACUUM removes tuple, updates visibility map, FSM // InnoDB: Purge thread frees undo log record this.state = RECLAIMED free_storage(this)} // Determining if version is obsoletefunction is_version_obsolete(v: Version) -> Boolean: if v.state != COMMITTED: return false if v.superseded_by == null: return false // Still the current version if not v.superseded_by.is_committed(): return false // Newer version not yet permanent oldest_snapshot = get_oldest_active_snapshot_timestamp() // If oldest active snapshot started after the superseder committed, // no one can ever need this version again return oldest_snapshot > v.superseded_by.commit_timeA critical implication: if a transaction holds a snapshot for a long time (e.g., a 3-hour analytical query), ALL versions created after that snapshot started cannot be garbage collected—even if they've been superseded millions of times. This is a common cause of table bloat and performance degradation in production systems.
Garbage collection (GC) in MVCC systems is a critical background process that determines long-term system health. Without effective GC, version storage grows unboundedly, leading to storage exhaustion, performance degradation, and eventual system failure.
PostgreSQL VACUUM:
PostgreSQL's VACUUM process is the primary garbage collection mechanism. It operates in several modes:
VACUUM must also address transaction ID wraparound: PostgreSQL uses 32-bit transaction IDs that can wrap around after ~4 billion transactions. VACUUM freezes old tuples to prevent visibility issues.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
-- PostgreSQL VACUUM Operations -- Standard VACUUM: reclaims dead tuple space within tableVACUUM my_table;-- Table size may not shrink - space is marked reusable-- Fast, doesn't lock the table for writes -- VACUUM ANALYZE: reclaims space AND updates statisticsVACUUM ANALYZE my_table;-- Critical for query planner to make good decisions -- VACUUM FULL: rewrites entire table, returns space to OSVACUUM FULL my_table;-- Requires exclusive lock - blocks reads and writes!-- Use only when significant space reclamation needed -- Check vacuum stats for a tableSELECT schemaname, relname, n_live_tup, n_dead_tup, n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) as dead_ratio, last_vacuum, last_autovacuum, last_analyze, vacuum_count, autovacuum_countFROM pg_stat_user_tables WHERE relname = 'my_table'; -- Autovacuum configuration parameters-- autovacuum_vacuum_threshold = 50 (minimum dead tuples before vacuum)-- autovacuum_vacuum_scale_factor = 0.2 (fraction of table that must be dead)-- autovacuum_naptime = 1min (sleep time between runs) -- A table with 100,000 rows triggers autovacuum when:-- dead_tuples > 50 + (0.2 * 100,000) = 20,050 dead tuples -- Monitoring for bloatSELECT current_setting('autovacuum_vacuum_scale_factor') as scale_factor, current_setting('autovacuum_vacuum_threshold') as threshold; -- Check for tables needing vacuumSELECT schemaname || '.' || relname as table_name, n_dead_tup as dead_tuples, n_live_tup as live_tuples, n_dead_tup::float / NULLIF(n_live_tup, 0) * 100 as dead_percentage, last_autovacuumFROM pg_stat_user_tablesWHERE n_dead_tup > 1000ORDER BY n_dead_tup DESC;InnoDB Purge:
InnoDB uses a different approach. The purge system runs continuously as a background thread (or multiple threads in recent versions):
Unlike PostgreSQL's VACUUM, InnoDB's purge doesn't affect the main tablespace—it only reclaims undo log space. The main table stays compact because only the current version is stored there.
| Aspect | PostgreSQL VACUUM | InnoDB Purge |
|---|---|---|
| What is cleaned | Dead heap tuples + index entries | Undo log records |
| Main table impact | Creates free space within table pages | No impact (main table always has current only) |
| Triggering | Manual, autovacuum thresholds | Continuous background |
| Blocking | Standard: none; FULL: exclusive lock | None (background) |
| Monitoring | pg_stat_user_tables, pg_stat_progress_vacuum | SHOW ENGINE INNODB STATUS (history length) |
| Failure mode | Table bloat, txid wraparound | Undo tablespace growth, long row lookups |
In PostgreSQL: Monitor n_dead_tup and table size growth. In InnoDB: Watch 'History list length' in SHOW ENGINE INNODB STATUS—values over 1 million indicate purge lag. In both: Identify long-running transactions that prevent GC progress.
MVCC's version storage creates measurable overhead compared to single-version systems. Understanding this overhead is crucial for capacity planning and architecture decisions.
Per-Version Overhead:
Each version carries metadata beyond the row data itself:
| System | Metadata per Tuple | Overhead Estimate |
|---|---|---|
| PostgreSQL | xmin (4B) + xmax (4B) + cmin/cmax (4B) + ctid (6B) + infomask (4B) + natts (2B) | ~23 bytes + alignment padding |
| InnoDB | DB_TRX_ID (6B) + DB_ROLL_PTR (7B) + implicit row ID if no PK (6B) | ~13-19 bytes |
| Oracle | ITL (interested transaction list) + row header | Varies by block size |
Version Amplification:
The more significant overhead comes from version amplification—storing multiple complete versions of rows.
Scenario: A table has 1 million rows, each 100 bytes. A batch process updates every row once per day.
PostgreSQL (Append-Only):
InnoDB (Undo Log):
For update-heavy workloads with delays in GC, PostgreSQL can experience significant temporary bloat that InnoDB avoids.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
-- PostgreSQL: Analyzing version storage overhead -- Calculate table bloat estimateSELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid)) as total_size, pg_size_pretty(pg_relation_size(relid)) as table_size, pg_size_pretty(pg_indexes_size(relid)) as indexes_size, n_live_tup, n_dead_tup, ROUND( n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2 ) as dead_pct, -- Estimate actual data size vs allocated ROUND( (n_live_tup * 100)::numeric / (pg_relation_size(relid) / 8192.0 * 200), 2 -- Assume 200 rows/page at fill factor ) as fill_efficiency_pctFROM pg_stat_user_tablesWHERE schemaname = 'public'ORDER BY pg_total_relation_size(relid) DESC; -- More accurate bloat calculation using pgstattupleCREATE EXTENSION IF NOT EXISTS pgstattuple; SELECT table_len, tuple_count, tuple_len, dead_tuple_count, dead_tuple_len, free_space, free_percentFROM pgstattuple('my_table'); -- InnoDB: Check undo tablespace usage-- (Run in MySQL/MariaDB)-- SHOW ENGINE INNODB STATUS; -- Look for "History list length"-- -- SELECT-- TOTAL_EXTENTS * 1048576 / 1024 / 1024 as undo_mb-- FROM INFORMATION_SCHEMA.FILES-- WHERE FILE_TYPE = 'UNDO LOG';For PostgreSQL: Plan for 2-3x table size under normal operations to account for version bloat between vacuums. For InnoDB: Plan undo tablespace for peak transaction length × peak concurrent transactions × average row size. Both: Always monitor actual usage patterns in production.
We've explored the intricate mechanics of how MVCC systems create, organize, and eventually reclaim multiple versions of data. This knowledge is essential for understanding system behavior and making informed operational decisions.
Looking Ahead:
With a solid understanding of how multiple versions are managed, we're ready to explore read consistency—how transactions determine exactly which version to read to achieve consistent, repeatable results. This is where MVCC's theoretical guarantees become practical reality.
You now understand the structural foundations of MVCC: version metadata, chain architectures, index interactions, lifecycle phases, garbage collection mechanisms, and storage overhead. Next, we'll see how these versions enable consistent reads across transactions.