Database Management SystemsMVCC

Multi-Version Concurrency Control (MVCC)

LevelIntermediate

Duration60 mins

TopicMVCC

2 / 5

Multiple Versions

The Architecture of Version Management

The concept of maintaining multiple versions sounds straightforward, but the implementation details determine whether an MVCC system is practical for production use. This page explores the structural and operational aspects of version management: how versions are physically organized, what metadata they carry, how they form chains of history, and how the database eventually reclaims obsolete versions.\n\nEvery time you update a row in a PostgreSQL table, every time you modify a document in MongoDB, every time InnoDB processes an UPDATE statement—the database is creating, organizing, and managing multiple versions of your data. Understanding this mechanism is essential for:\n\n- Performance tuning: Knowing how version chains affect query performance\n- Capacity planning: Understanding storage amplification from version overhead\n- Troubleshooting: Diagnosing issues caused by version bloat or long-running transactions\n- Architecture decisions: Choosing between storage engines with different versioning strategies

What You Will Learn

By the end of this page, you will understand how version chains are structured, what metadata each version carries, how different storage models organize versions, and how garbage collection reclaims obsolete versions. You'll gain insight into the engineering tradeoffs that shape real-world MVCC implementations.

Version Structure and Metadata

Every version of a data item must carry sufficient metadata for the database to answer two critical questions:\n\n1. Who created this version? — Which transaction wrote this value, and when did it commit?\n2. Who invalidated this version? — Which transaction superseded this value with a newer one?\n\nThese questions determine visibility: whether a given transaction should see this version or skip to the next in the chain.\n\nEssential Version Metadata:\n\nWhile specific implementations vary, most MVCC systems track similar information for each version:

Core Version Metadata Fields
Metadata Field	Purpose	Example Values
Xmin / Created-By	Transaction ID that created this version	10042 (PostgreSQL), TRX_ID (InnoDB)
Xmax / Deleted-By	Transaction ID that deleted/superseded this version (0 or NULL if still current)	10057 or NULL
Commit Timestamp	Logical or physical time when creating transaction committed	2024-01-15T14:30:00 or LSN 0x5F3A
Version Pointer	Link to previous or next version in the chain	Heap location or undo segment offset
Transaction Status	Whether creating transaction committed, aborted, or in-progress	Committed, Aborted, In-Progress
Row Content	The actual data values for this version	Column values, BLOB references, etc.

The Visibility Challenge:\n\nDetermining whether a version is visible requires knowing the status of the transactions that created and potentially deleted it. This creates an interesting bootstrapping problem: the version exists in storage, but its visibility depends on transaction state that may be stored elsewhere.\n\nDatabases solve this in different ways:\n\n- PostgreSQL: Maintains pg_xact (formerly clog) to track transaction commit status. Visibility check requires consulting this structure.\n- InnoDB: Stores transaction state in the undo log headers. Read views compare transaction IDs against active transaction lists.\n- Oracle: Rollback segments contain transaction status. System change numbers (SCNs) provide ordering.

version_visibility_check.pseudo

Pseudocode

// Determining Version Visibility
// Given: Transaction T with snapshot timestamp S
//        Version V of data item with metadata
 
function is_version_visible(T: Transaction, V: Version) -> Boolean:
    // Step 1: Check if creator transaction has committed
    creator_status = get_transaction_status(V.xmin)
    
    if creator_status == ABORTED:
        return false  // Created by aborted transaction - never visible
    
    if creator_status == IN_PROGRESS:
        if V.xmin == T.id:
            return true   // Transaction sees its own uncommitted writes
        else:
            return false  // Created by another in-progress transaction
    
    // Creator committed - check if commit was before our snapshot
    if creator_status == COMMITTED:
        if V.xmin_commit_time > T.snapshot_time:
            return false  // Committed after our snapshot started
    
    // Step 2: Check if version has been superseded
    if V.xmax == 0 or V.xmax == NULL:
        return true  // Current version - not yet superseded
    
    deleter_status = get_transaction_status(V.xmax)
    
    if deleter_status == ABORTED:
        return true  // Delete was rolled back - version still valid
    
    if deleter_status == IN_PROGRESS:
        if V.xmax == T.id:
            return false  // We deleted it ourselves
        else:
            return true   // Another transaction's uncommitted delete - still visible to us
    
    if deleter_status == COMMITTED:
        if V.xmax_commit_time > T.snapshot_time:
            return true   // Deleted after our snapshot - still visible to us
        else:
            return false  // Deleted before our snapshot - not visible
    
    return false  // Should not reach here
 
// Helper: Get transaction commit status
function get_transaction_status(txn_id: TransactionID) -> Status:
    // Different implementations:
    // PostgreSQL: Check pg_xact (clog)
    // InnoDB: Check undo log headers + active transaction list
    // Oracle: Check rollback segment transaction table
    return lookup_transaction_status(txn_id)

Hint Bits Optimization

PostgreSQL optimization: Rather than consulting pg_xact for every visibility check, PostgreSQL can set 'hint bits' directly in the tuple header once a transaction's fate is known. This trades a small amount of write I/O (marking the hint) for significant read savings on subsequent accesses.

Version Chain Architectures

When multiple versions of the same logical row exist, they must be connected in a way that allows efficient traversal. The version chain is the data structure that links these versions together. Different database systems organize version chains differently, with significant implications for performance.\n\nKey Design Decisions:\n\n1. Chain Direction: Does the chain go from newest to oldest (N2O) or oldest to newest (O2N)?\n2. Primary Location: Is the newest version in the main table, or the oldest?\n3. Version Storage: Are all versions in the main heap, or in separate storage?\n4. Index Behavior: Do indexes point to the new version, old version, or a version-independent location?

Version Chain Architectures Comparison
Architecture	Chain Direction	Primary Table Content	Example Systems
Append-Only (O2N)	Oldest → Newest	All versions (newest is latest)	PostgreSQL
Append-Only (N2O)	Newest → Oldest	All versions (oldest at head)	Some research systems
Undo Log / Delta (N2O)	Newest → Oldest	Only newest version	InnoDB, Oracle, SQL Server
Time-Travel Table	Separate index	Current only; history in separate table	Temporal table implementations

Append-Only / Heap-Based (PostgreSQL Model):\n\nIn PostgreSQL's architecture, when a row is updated:\n\n1. The original tuple remains in place but is marked with xmax = updating transaction's ID\n2. A new tuple is inserted (potentially on a different page)\n3. The old tuple contains a forward pointer (ctid) to the new tuple\n4. Indexes may point to any version in the chain\n\nThis creates a chain where traversal goes from old to new (O2N):

postgres_version_chain.diagram

Version Chain

PostgreSQL Heap-Based Version Chain (O2N)
 
Initial State - Row inserted by Txn 100:
┌────────────────────────────────────────────────────┐
│ Heap Page 1                                        │
│  ┌────────────────────────────────────────────┐    │
│  │ Tuple @ (1,1)                              │    │
│  │   xmin: 100  xmax: 0  ctid: (1,1)          │    │
│  │   data: {name: "Alice", balance: 1000}     │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
After UPDATE by Txn 150 (balance = 900):
┌────────────────────────────────────────────────────┐
│ Heap Page 1                                        │
│  ┌────────────────────────────────────────────┐    │
│  │ Tuple @ (1,1)  [OLD VERSION]               │    │
│  │   xmin: 100  xmax: 150  ctid: (1,3) ──────────► to new
│  │   data: {name: "Alice", balance: 1000}     │    │
│  └────────────────────────────────────────────┘    │
│                                                    │
│  ┌────────────────────────────────────────────┐    │
│  │ Tuple @ (1,3)  [NEW VERSION - CURRENT]     │    │
│  │   xmin: 150  xmax: 0  ctid: (1,3)          │    │
│  │   data: {name: "Alice", balance: 900}      │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
After another UPDATE by Txn 200 (balance = 850):
┌─────────────────────────────────────────────────────────────────┐
│ Heap Page 1                               Heap Page 2           │
│  ┌─────────────────────────────────┐       ┌─────────────────── │
│  │ (1,1) xmin:100 xmax:150 ─────────────►  │                    │
│  │       ctid:(1,3)                │    ┌──┘                    │
│  │       balance: 1000             │    │                       │
│  └─────────────────────────────────┘    ▼                       │
│  ┌─────────────────────────────────┐   (2,1) [CURRENT]          │
│  │ (1,3) xmin:150 xmax:200 ────────────► xmin:200 xmax:0        │
│  │       balance: 900              │       balance: 850         │
│  └─────────────────────────────────┘                            │
└─────────────────────────────────────────────────────────────────┘
 
Version Chain: (1,1) → (1,3) → (2,1)
                100     150     200   (Transaction IDs)

Undo Log-Based (InnoDB Model):\n\nInnoDB takes a fundamentally different approach. The main table (clustered index) always contains the newest version. When an update occurs:\n\n1. The current values are copied to an undo log record\n2. The row in the main table is modified in-place\n3. The row header contains a pointer to the undo log record\n4. Undo records chain backwards through history\n\nThis creates a Newest-to-Oldest (N2O) chain:

innodb_version_chain.diagram

InnoDB Version Chain

InnoDB Undo Log-Based Version Chain (N2O)
 
Initial State - Row inserted by Trx 100:
┌────────────────────────────────────────────────────┐
│ Clustered Index (Primary Key)                      │
│  ┌────────────────────────────────────────────┐    │
│  │ Row PK=1                                   │    │
│  │   DB_TRX_ID: 100                           │    │
│  │   DB_ROLL_PTR: NULL (no previous version)  │    │
│  │   data: {name: "Alice", balance: 1000}     │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
After UPDATE by Trx 150 (balance = 900):
┌────────────────────────────────────────────────────┐
│ Clustered Index                                    │
│  ┌────────────────────────────────────────────┐    │
│  │ Row PK=1  [CURRENT VERSION]                │    │
│  │   DB_TRX_ID: 150                           │    │
│  │   DB_ROLL_PTR: ──────────────────────────────►  │
│  │   data: {name: "Alice", balance: 900}      │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
                                                  │
┌─────────────────────────────────────────────────┼──┐
│ Undo Log (Rollback Segment)                     ▼  │
│  ┌────────────────────────────────────────────┐    │
│  │ Undo Record                                │    │
│  │   TRX_ID: 100                              │    │
│  │   Previous Undo: NULL                      │    │
│  │   Old values: {balance: 1000}              │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
After another UPDATE by Trx 200 (balance = 850):
┌────────────────────────────────────────────────────┐
│ Clustered Index                                    │
│  ┌────────────────────────────────────────────┐    │
│  │ Row PK=1  [CURRENT VERSION]                │    │
│  │   DB_TRX_ID: 200  DB_ROLL_PTR: ─────────────────► Undo #2
│  │   data: {balance: 850}                     │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
┌────────────────────────────────────────────────────┐
│ Undo Log                                           │
│  ┌────────────────────────────────────────────┐    │
│  │ Undo Record #2 (Trx 200's changes)         │    │
│  │   TRX_ID: 150                              │    │
│  │   Previous Undo: ────────────► Undo #1     │    │
│  │   Old values: {balance: 900}               │    │
│  └────────────────────────────────────────────┘    │
│  ┌────────────────────────────────────────────┐    │
│  │ Undo Record #1 (Trx 150's changes)         │    │
│  │   TRX_ID: 100                              │    │
│  │   Previous Undo: NULL                      │    │
│  │   Old values: {balance: 1000}              │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
Version Chain: Current → Undo#2 → Undo#1 → (original insert)
               200       150       100

Chain Direction Matters

N2O chains (InnoDB) optimize for current-version reads: most queries want the newest data, which is found immediately in the main table. O2N chains (PostgreSQL) may require traversing the chain to find the current version, but index maintenance is more complex in N2O systems.

Index Interaction with Versions

Secondary indexes create significant complexity in MVCC systems. The fundamental challenge: an index entry must point to a row, but multiple versions of that row exist. Different architectures handle this differently, with major performance implications.\n\nThe PostgreSQL Approach: Indexes Point to Any Version\n\nIn PostgreSQL, secondary indexes point to heap tuple locations (TID = table ID + item pointer). When a row is updated:\n\n1. A new heap tuple is created\n2. The index might need a new entry pointing to the new tuple\n3. The old index entry still points to the old (now dead) tuple\n\nThis means:\n- Index scans may find old versions and must follow the chain to find visible versions\n- Index-only scans are complicated by visibility\n- Index bloat occurs as entries for old versions accumulate\n\nHOT (Heap-Only Tuples) Optimization:\n\nPostgreSQL's HOT optimization addresses index bloat when the updated columns are not indexed:\n- If the update doesn't change any indexed column values, the new version can be created in the same page without updating indexes\n- Indexes continue pointing to the old version, which forwards to the new version\n- Significantly reduces index maintenance overhead for many workloads

hot_update_example.diagram

HOT Update

PostgreSQL HOT (Heap-Only Tuple) Update
 
Scenario: Table users (id PK, name indexed, balance NOT indexed)
UPDATE users SET balance = 900 WHERE id = 1;
 
Without HOT (indexed column change or no room in page):
┌────────────────────────────────────────┐
│ Index on 'name'                        │
│   "Alice" → (page 1, item 1) [OLD]     │
│   "Alice" → (page 2, item 5) [NEW]  ←─── New index entry needed!
└────────────────────────────────────────┘
 
With HOT (non-indexed column change, room in same page):
┌────────────────────────────────────────┐
│ Index on 'name'                        │
│   "Alice" → (page 1, item 1)           │  ← Index unchanged!
└────────────────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────┐
│ Heap Page 1                            │
│  ┌──────────────────────────────┐      │
│  │ Item 1 (OLD - HEAP ONLY)     │      │
│  │   xmax: 150                  │      │
│  │   t_ctid: (1, 3) ──────────────────►│
│  └──────────────────────────────┘      │
│  ┌──────────────────────────────┐      │
│  │ Item 3 (NEW - CURRENT)       │      │
│  │   xmax: 0                    │      │
│  │   HEAP_ONLY flag set         │  ← Not directly indexed
│  └──────────────────────────────┘      │
└────────────────────────────────────────┘
 
HOT chain pruning: When page is accessed, old HOT versions
can be immediately reclaimed without full VACUUM.

The InnoDB Approach: Indexes Point to Primary Key\n\nInnoDB takes a different approach. Secondary indexes don't store pointers to physical row locations—they store the primary key value of the row.\n\nWhen reading via a secondary index:\n1. Find the primary key in the secondary index\n2. Use the primary key to locate the row in the clustered index\n3. Check visibility against the current version in the clustered index\n4. If a historical version is needed, follow the undo chain\n\nThis has interesting implications:\n- Secondary indexes don't need updating when non-indexed columns change\n- Index entries don't point to dead versions directly\n- But: secondary index lookups always require a primary key lookup\n- The clustered index becomes a potential bottleneck

PostgreSQL Index Model

•Indexes point to heap tuple locations
•Multiple index entries may exist for version chain
•Index-only scans possible but visibility complex
•HOT optimization reduces index updates
•VACUUM required to reclaim index entries

InnoDB Index Model

•Secondary indexes store primary key values
•One index entry per logical row
•Covering index scans may need PK lookup for visibility
•Non-indexed column updates don't touch indexes
•Purge thread handles undo log cleanup

Version Lifecycle: Creation to Reclamation

Every version goes through a predictable lifecycle, from creation during a transaction to eventual garbage collection. Understanding this lifecycle is essential for both performance tuning and troubleshooting.\n\nPhase 1: Creation\n\nA version is born when a transaction performs an INSERT or UPDATE:\n- INSERT: Creates the first version of a new row\n- UPDATE: Creates a new version of an existing row (and marks the old version as superseded)\n- DELETE: Often implemented as creating a 'delete marker' version\n\nAt creation time, the version is only visible to the transaction that created it. The version's metadata reflects the in-progress status.\n\nPhase 2: Visibility Window\n\nAfter the creating transaction commits, the version becomes visible to new transactions. It remains visible until superseded by a newer version.\n\nThe version's visibility window is defined by:\n- Start: The commit timestamp of the creating transaction\n- End: The commit timestamp of the transaction that supersedes it (or infinity if current)\n\nPhase 3: Obsolescence\n\nA version becomes obsolete when it's no longer visible to any active or future transaction. This happens when:\n- A newer version exists AND\n- The newest superseding transaction has committed AND\n- No active transaction has a snapshot old enough to need this version\n\nPhase 4: Garbage Collection\n\nOnce obsolete, the version can be reclaimed. The timing and mechanism depend on the system:

Version Lifecycle Across Systems
Phase	PostgreSQL	InnoDB/MySQL
Creation	Insert new heap tuple	Modify row, copy old to undo log
Visibility Start	Creating txn commits, hint bits set	Creating trx commits
Obsolescence Detection	xmax committed AND xmax < oldest active snapshot	Read view no longer references this version
GC Mechanism	VACUUM process scans heap	Purge background thread
GC Trigger	Manual, autovacuum thresholds, or wraparound prevention	Continuous background purge
Space Reclamation	Clears dead tuples, updates FSM	Undo log space reused, undo tablespace may shrink

version_lifecycle.pseudo

Lifecycle Pseudocode

// Version Lifecycle State Machine
 
enum VersionState {
    IN_PROGRESS,    // Created by active transaction
    COMMITTED,      // Creating transaction committed - visible to new transactions
    OBSOLETE,       // Superseded and no longer needed by any transaction
    RECLAIMED       // Space has been reclaimed by GC
}
 
class Version {
    state: VersionState = IN_PROGRESS
    created_by: Transaction
    superseded_by: Transaction = null
    
    // Transition: IN_PROGRESS → COMMITTED
    on_creator_commit():
        this.state = COMMITTED
        // Version now visible to new transactions starting after commit
    
    // Transition: COMMITTED → OBSOLETE
    on_obsolescence_detected():
        // Happens when:
        // 1. superseded_by has committed, AND
        // 2. oldest_active_snapshot > max(created_by.commit_time, superseded_by.commit_time)
        this.state = OBSOLETE
        // Version can now be garbage collected
    
    // Transition: OBSOLETE → RECLAIMED
    on_garbage_collection():
        // PostgreSQL: VACUUM removes tuple, updates visibility map, FSM
        // InnoDB: Purge thread frees undo log record
        this.state = RECLAIMED
        free_storage(this)
}
 
// Determining if version is obsolete
function is_version_obsolete(v: Version) -> Boolean:
    if v.state != COMMITTED:
        return false
    
    if v.superseded_by == null:
        return false  // Still the current version
    
    if not v.superseded_by.is_committed():
        return false  // Newer version not yet permanent
    
    oldest_snapshot = get_oldest_active_snapshot_timestamp()
    
    // If oldest active snapshot started after the superseder committed,
    // no one can ever need this version again
    return oldest_snapshot > v.superseded_by.commit_time

Long-Running Transactions Block GC

A critical implication: if a transaction holds a snapshot for a long time (e.g., a 3-hour analytical query), ALL versions created after that snapshot started cannot be garbage collected—even if they've been superseded millions of times. This is a common cause of table bloat and performance degradation in production systems.

Garbage Collection Deep Dive

Garbage collection (GC) in MVCC systems is a critical background process that determines long-term system health. Without effective GC, version storage grows unboundedly, leading to storage exhaustion, performance degradation, and eventual system failure.\n\nPostgreSQL VACUUM:\n\nPostgreSQL's VACUUM process is the primary garbage collection mechanism. It operates in several modes:\n\n1. Standard VACUUM: Scans tables for dead tuples, marks their space as reusable, updates statistics\n2. VACUUM FULL: Rewrites the entire table, compacting it to reclaim space to the OS\n3. Autovacuum: Background daemon that triggers VACUUM based on table activity thresholds\n\nVACUUM must also address transaction ID wraparound: PostgreSQL uses 32-bit transaction IDs that can wrap around after ~4 billion transactions. VACUUM freezes old tuples to prevent visibility issues.

postgresql_vacuum_behavior.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
-- PostgreSQL VACUUM Operations
 
-- Standard VACUUM: reclaims dead tuple space within table
VACUUM my_table;
-- Table size may not shrink - space is marked reusable
-- Fast, doesn't lock the table for writes
 
-- VACUUM ANALYZE: reclaims space AND updates statistics
VACUUM ANALYZE my_table;
-- Critical for query planner to make good decisions
 
-- VACUUM FULL: rewrites entire table, returns space to OS
VACUUM FULL my_table;
-- Requires exclusive lock - blocks reads and writes!
-- Use only when significant space reclamation needed
 
-- Check vacuum stats for a table
SELECT 
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) as dead_ratio,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    vacuum_count,
    autovacuum_count
FROM pg_stat_user_tables 
WHERE relname = 'my_table';
 
-- Autovacuum configuration parameters
-- autovacuum_vacuum_threshold = 50 (minimum dead tuples before vacuum)
-- autovacuum_vacuum_scale_factor = 0.2 (fraction of table that must be dead)
-- autovacuum_naptime = 1min (sleep time between runs)
 
-- A table with 100,000 rows triggers autovacuum when:
-- dead_tuples > 50 + (0.2 * 100,000) = 20,050 dead tuples
 
-- Monitoring for bloat
SELECT 
    current_setting('autovacuum_vacuum_scale_factor') as scale_factor,
    current_setting('autovacuum_vacuum_threshold') as threshold;
 
-- Check for tables needing vacuum
SELECT 
    schemaname || '.' || relname as table_name,
    n_dead_tup as dead_tuples,
    n_live_tup as live_tuples,
    n_dead_tup::float / NULLIF(n_live_tup, 0) * 100 as dead_percentage,
    last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

InnoDB Purge:\n\nInnoDB uses a different approach. The purge system runs continuously as a background thread (or multiple threads in recent versions):\n\n1. Purge Thread: Scans undo log for records no longer needed by any read view\n2. History List: Tracks undo logs awaiting purge in chronological order\n3. Purge Lag: If purge falls behind, the history list grows, consuming undo tablespace\n\nUnlike PostgreSQL's VACUUM, InnoDB's purge doesn't affect the main tablespace—it only reclaims undo log space. The main table stays compact because only the current version is stored there.

PostgreSQL VACUUM vs InnoDB Purge
Aspect	PostgreSQL VACUUM	InnoDB Purge
What is cleaned	Dead heap tuples + index entries	Undo log records
Main table impact	Creates free space within table pages	No impact (main table always has current only)
Triggering	Manual, autovacuum thresholds	Continuous background
Blocking	Standard: none; FULL: exclusive lock	None (background)
Monitoring	pg_stat_user_tables, pg_stat_progress_vacuum	SHOW ENGINE INNODB STATUS (history length)
Failure mode	Table bloat, txid wraparound	Undo tablespace growth, long row lookups

Monitoring GC Health

In PostgreSQL: Monitor n_dead_tup and table size growth. In InnoDB: Watch 'History list length' in SHOW ENGINE INNODB STATUS—values over 1 million indicate purge lag. In both: Identify long-running transactions that prevent GC progress.

Storage Overhead Analysis

MVCC's version storage creates measurable overhead compared to single-version systems. Understanding this overhead is crucial for capacity planning and architecture decisions.\n\nPer-Version Overhead:\n\nEach version carries metadata beyond the row data itself:

Version Metadata Overhead
System	Metadata per Tuple	Overhead Estimate
PostgreSQL	xmin (4B) + xmax (4B) + cmin/cmax (4B) + ctid (6B) + infomask (4B) + natts (2B)	~23 bytes + alignment padding
InnoDB	DB_TRX_ID (6B) + DB_ROLL_PTR (7B) + implicit row ID if no PK (6B)	~13-19 bytes
Oracle	ITL (interested transaction list) + row header	Varies by block size

Version Amplification:\n\nThe more significant overhead comes from version amplification—storing multiple complete versions of rows.\n\nScenario: A table has 1 million rows, each 100 bytes. A batch process updates every row once per day.\n\nPostgreSQL (Append-Only):\n- After update: 2 million tuples exist (1M dead, 1M live)\n- Before VACUUM: 200MB table (doubled)\n- VACUUM removes dead tuples: table returns to ~100MB\n- If vacuum is slow: temporary 2x-10x table bloat\n\nInnoDB (Undo Log):\n- Main table: constant 100MB (only live rows)\n- Undo log: 1M undo records × cost per record\n- Undo overhead: ~10-30MB for this case\n- Purge continuously reclaims\n\nFor update-heavy workloads with delays in GC, PostgreSQL can experience significant temporary bloat that InnoDB avoids.

storage_overhead_example.sql
SQL Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- PostgreSQL: Analyzing version storage overhead
 
-- Calculate table bloat estimate
SELECT
    schemaname,
    relname,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as table_size,
    pg_size_pretty(pg_indexes_size(relid)) as indexes_size,
    n_live_tup,
    n_dead_tup,
    ROUND(
        n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2
    ) as dead_pct,
    -- Estimate actual data size vs allocated
    ROUND(
        (n_live_tup * 100)::numeric / 
        (pg_relation_size(relid) / 8192.0 * 200), 2  -- Assume 200 rows/page at fill factor
    ) as fill_efficiency_pct
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(relid) DESC;
 
-- More accurate bloat calculation using pgstattuple
CREATE EXTENSION IF NOT EXISTS pgstattuple;
 
SELECT 
    table_len,
    tuple_count,
    tuple_len,
    dead_tuple_count,
    dead_tuple_len,
    free_space,
    free_percent
FROM pgstattuple('my_table');
 
-- InnoDB: Check undo tablespace usage
-- (Run in MySQL/MariaDB)
-- SHOW ENGINE INNODB STATUS;  -- Look for "History list length"
-- 
-- SELECT
--     TOTAL_EXTENTS * 1048576 / 1024 / 1024 as undo_mb
-- FROM INFORMATION_SCHEMA.FILES
-- WHERE FILE_TYPE = 'UNDO LOG';

Capacity Planning Rule of Thumb

For PostgreSQL: Plan for 2-3x table size under normal operations to account for version bloat between vacuums. For InnoDB: Plan undo tablespace for peak transaction length × peak concurrent transactions × average row size. Both: Always monitor actual usage patterns in production.

Summary: Mastering Multiple Versions

We've explored the intricate mechanics of how MVCC systems create, organize, and eventually reclaim multiple versions of data. This knowledge is essential for understanding system behavior and making informed operational decisions.

Key Takeaways

•Versions carry essential metadata — xmin/xmax, timestamps, and pointers enable visibility decisions
•Version chains link history — O2N (PostgreSQL) vs N2O (InnoDB) architectures have distinct tradeoffs
•Index interaction is complex — Different systems handle index-to-version mapping differently, affecting performance
•Version lifecycle has four phases — Creation, visibility, obsolescence, and reclamation
•Long transactions block GC — A single long-running query can cause significant version accumulation
•GC mechanisms differ by system — PostgreSQL VACUUM vs InnoDB purge have different characteristics
•Storage overhead is real — Both per-version metadata and version amplification consume space
•Monitoring is essential — Dead tuple counts, undo history length, and table bloat require ongoing attention

Looking Ahead:\n\nWith a solid understanding of how multiple versions are managed, we're ready to explore read consistency—how transactions determine exactly which version to read to achieve consistent, repeatable results. This is where MVCC's theoretical guarantees become practical reality.

Page Complete

You now understand the structural foundations of MVCC: version metadata, chain architectures, index interactions, lifecycle phases, garbage collection mechanisms, and storage overhead. Next, we'll see how these versions enable consistent reads across transactions.

2 / 5

Loading learning content...

Database Management SystemsMVCC

Multi-Version Concurrency Control (MVCC)

LevelIntermediate

Duration60 mins

TopicMVCC

2 / 5

Multiple Versions

The Architecture of Version Management

What You Will Learn

Version Structure and Metadata

Core Version Metadata Fields
Metadata Field	Purpose	Example Values
Xmin / Created-By	Transaction ID that created this version	10042 (PostgreSQL), TRX_ID (InnoDB)
Xmax / Deleted-By	Transaction ID that deleted/superseded this version (0 or NULL if still current)	10057 or NULL
Commit Timestamp	Logical or physical time when creating transaction committed	2024-01-15T14:30:00 or LSN 0x5F3A
Version Pointer	Link to previous or next version in the chain	Heap location or undo segment offset
Transaction Status	Whether creating transaction committed, aborted, or in-progress	Committed, Aborted, In-Progress
Row Content	The actual data values for this version	Column values, BLOB references, etc.

version_visibility_check.pseudo

Pseudocode

// Determining Version Visibility
// Given: Transaction T with snapshot timestamp S
//        Version V of data item with metadata
 
function is_version_visible(T: Transaction, V: Version) -> Boolean:
    // Step 1: Check if creator transaction has committed
    creator_status = get_transaction_status(V.xmin)
    
    if creator_status == ABORTED:
        return false  // Created by aborted transaction - never visible
    
    if creator_status == IN_PROGRESS:
        if V.xmin == T.id:
            return true   // Transaction sees its own uncommitted writes
        else:
            return false  // Created by another in-progress transaction
    
    // Creator committed - check if commit was before our snapshot
    if creator_status == COMMITTED:
        if V.xmin_commit_time > T.snapshot_time:
            return false  // Committed after our snapshot started
    
    // Step 2: Check if version has been superseded
    if V.xmax == 0 or V.xmax == NULL:
        return true  // Current version - not yet superseded
    
    deleter_status = get_transaction_status(V.xmax)
    
    if deleter_status == ABORTED:
        return true  // Delete was rolled back - version still valid
    
    if deleter_status == IN_PROGRESS:
        if V.xmax == T.id:
            return false  // We deleted it ourselves
        else:
            return true   // Another transaction's uncommitted delete - still visible to us
    
    if deleter_status == COMMITTED:
        if V.xmax_commit_time > T.snapshot_time:
            return true   // Deleted after our snapshot - still visible to us
        else:
            return false  // Deleted before our snapshot - not visible
    
    return false  // Should not reach here
 
// Helper: Get transaction commit status
function get_transaction_status(txn_id: TransactionID) -> Status:
    // Different implementations:
    // PostgreSQL: Check pg_xact (clog)
    // InnoDB: Check undo log headers + active transaction list
    // Oracle: Check rollback segment transaction table
    return lookup_transaction_status(txn_id)

Hint Bits Optimization

Version Chain Architectures

Version Chain Architectures Comparison
Architecture	Chain Direction	Primary Table Content	Example Systems
Append-Only (O2N)	Oldest → Newest	All versions (newest is latest)	PostgreSQL
Append-Only (N2O)	Newest → Oldest	All versions (oldest at head)	Some research systems
Undo Log / Delta (N2O)	Newest → Oldest	Only newest version	InnoDB, Oracle, SQL Server
Time-Travel Table	Separate index	Current only; history in separate table	Temporal table implementations

postgres_version_chain.diagram

Version Chain

PostgreSQL Heap-Based Version Chain (O2N)
 
Initial State - Row inserted by Txn 100:
┌────────────────────────────────────────────────────┐
│ Heap Page 1                                        │
│  ┌────────────────────────────────────────────┐    │
│  │ Tuple @ (1,1)                              │    │
│  │   xmin: 100  xmax: 0  ctid: (1,1)          │    │
│  │   data: {name: "Alice", balance: 1000}     │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
After UPDATE by Txn 150 (balance = 900):
┌────────────────────────────────────────────────────┐
│ Heap Page 1                                        │
│  ┌────────────────────────────────────────────┐    │
│  │ Tuple @ (1,1)  [OLD VERSION]               │    │
│  │   xmin: 100  xmax: 150  ctid: (1,3) ──────────► to new
│  │   data: {name: "Alice", balance: 1000}     │    │
│  └────────────────────────────────────────────┘    │
│                                                    │
│  ┌────────────────────────────────────────────┐    │
│  │ Tuple @ (1,3)  [NEW VERSION - CURRENT]     │    │
│  │   xmin: 150  xmax: 0  ctid: (1,3)          │    │
│  │   data: {name: "Alice", balance: 900}      │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
After another UPDATE by Txn 200 (balance = 850):
┌─────────────────────────────────────────────────────────────────┐
│ Heap Page 1                               Heap Page 2           │
│  ┌─────────────────────────────────┐       ┌─────────────────── │
│  │ (1,1) xmin:100 xmax:150 ─────────────►  │                    │
│  │       ctid:(1,3)                │    ┌──┘                    │
│  │       balance: 1000             │    │                       │
│  └─────────────────────────────────┘    ▼                       │
│  ┌─────────────────────────────────┐   (2,1) [CURRENT]          │
│  │ (1,3) xmin:150 xmax:200 ────────────► xmin:200 xmax:0        │
│  │       balance: 900              │       balance: 850         │
│  └─────────────────────────────────┘                            │
└─────────────────────────────────────────────────────────────────┘
 
Version Chain: (1,1) → (1,3) → (2,1)
                100     150     200   (Transaction IDs)

innodb_version_chain.diagram

InnoDB Version Chain

InnoDB Undo Log-Based Version Chain (N2O)
 
Initial State - Row inserted by Trx 100:
┌────────────────────────────────────────────────────┐
│ Clustered Index (Primary Key)                      │
│  ┌────────────────────────────────────────────┐    │
│  │ Row PK=1                                   │    │
│  │   DB_TRX_ID: 100                           │    │
│  │   DB_ROLL_PTR: NULL (no previous version)  │    │
│  │   data: {name: "Alice", balance: 1000}     │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
After UPDATE by Trx 150 (balance = 900):
┌────────────────────────────────────────────────────┐
│ Clustered Index                                    │
│  ┌────────────────────────────────────────────┐    │
│  │ Row PK=1  [CURRENT VERSION]                │    │
│  │   DB_TRX_ID: 150                           │    │
│  │   DB_ROLL_PTR: ──────────────────────────────►  │
│  │   data: {name: "Alice", balance: 900}      │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
                                                  │
┌─────────────────────────────────────────────────┼──┐
│ Undo Log (Rollback Segment)                     ▼  │
│  ┌────────────────────────────────────────────┐    │
│  │ Undo Record                                │    │
│  │   TRX_ID: 100                              │    │
│  │   Previous Undo: NULL                      │    │
│  │   Old values: {balance: 1000}              │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
After another UPDATE by Trx 200 (balance = 850):
┌────────────────────────────────────────────────────┐
│ Clustered Index                                    │
│  ┌────────────────────────────────────────────┐    │
│  │ Row PK=1  [CURRENT VERSION]                │    │
│  │   DB_TRX_ID: 200  DB_ROLL_PTR: ─────────────────► Undo #2
│  │   data: {balance: 850}                     │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
┌────────────────────────────────────────────────────┐
│ Undo Log                                           │
│  ┌────────────────────────────────────────────┐    │
│  │ Undo Record #2 (Trx 200's changes)         │    │
│  │   TRX_ID: 150                              │    │
│  │   Previous Undo: ────────────► Undo #1     │    │
│  │   Old values: {balance: 900}               │    │
│  └────────────────────────────────────────────┘    │
│  ┌────────────────────────────────────────────┐    │
│  │ Undo Record #1 (Trx 150's changes)         │    │
│  │   TRX_ID: 100                              │    │
│  │   Previous Undo: NULL                      │    │
│  │   Old values: {balance: 1000}              │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘
 
Version Chain: Current → Undo#2 → Undo#1 → (original insert)
               200       150       100

Chain Direction Matters

Index Interaction with Versions

hot_update_example.diagram

HOT Update

PostgreSQL HOT (Heap-Only Tuple) Update
 
Scenario: Table users (id PK, name indexed, balance NOT indexed)
UPDATE users SET balance = 900 WHERE id = 1;
 
Without HOT (indexed column change or no room in page):
┌────────────────────────────────────────┐
│ Index on 'name'                        │
│   "Alice" → (page 1, item 1) [OLD]     │
│   "Alice" → (page 2, item 5) [NEW]  ←─── New index entry needed!
└────────────────────────────────────────┘
 
With HOT (non-indexed column change, room in same page):
┌────────────────────────────────────────┐
│ Index on 'name'                        │
│   "Alice" → (page 1, item 1)           │  ← Index unchanged!
└────────────────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────┐
│ Heap Page 1                            │
│  ┌──────────────────────────────┐      │
│  │ Item 1 (OLD - HEAP ONLY)     │      │
│  │   xmax: 150                  │      │
│  │   t_ctid: (1, 3) ──────────────────►│
│  └──────────────────────────────┘      │
│  ┌──────────────────────────────┐      │
│  │ Item 3 (NEW - CURRENT)       │      │
│  │   xmax: 0                    │      │
│  │   HEAP_ONLY flag set         │  ← Not directly indexed
│  └──────────────────────────────┘      │
└────────────────────────────────────────┘
 
HOT chain pruning: When page is accessed, old HOT versions
can be immediately reclaimed without full VACUUM.

PostgreSQL Index Model

•Indexes point to heap tuple locations
•Multiple index entries may exist for version chain
•Index-only scans possible but visibility complex
•HOT optimization reduces index updates
•VACUUM required to reclaim index entries

InnoDB Index Model

•Secondary indexes store primary key values
•One index entry per logical row
•Covering index scans may need PK lookup for visibility
•Non-indexed column updates don't touch indexes
•Purge thread handles undo log cleanup

Version Lifecycle: Creation to Reclamation

Version Lifecycle Across Systems
Phase	PostgreSQL	InnoDB/MySQL
Creation	Insert new heap tuple	Modify row, copy old to undo log
Visibility Start	Creating txn commits, hint bits set	Creating trx commits
Obsolescence Detection	xmax committed AND xmax < oldest active snapshot	Read view no longer references this version
GC Mechanism	VACUUM process scans heap	Purge background thread
GC Trigger	Manual, autovacuum thresholds, or wraparound prevention	Continuous background purge
Space Reclamation	Clears dead tuples, updates FSM	Undo log space reused, undo tablespace may shrink

version_lifecycle.pseudo

Lifecycle Pseudocode

// Version Lifecycle State Machine
 
enum VersionState {
    IN_PROGRESS,    // Created by active transaction
    COMMITTED,      // Creating transaction committed - visible to new transactions
    OBSOLETE,       // Superseded and no longer needed by any transaction
    RECLAIMED       // Space has been reclaimed by GC
}
 
class Version {
    state: VersionState = IN_PROGRESS
    created_by: Transaction
    superseded_by: Transaction = null
    
    // Transition: IN_PROGRESS → COMMITTED
    on_creator_commit():
        this.state = COMMITTED
        // Version now visible to new transactions starting after commit
    
    // Transition: COMMITTED → OBSOLETE
    on_obsolescence_detected():
        // Happens when:
        // 1. superseded_by has committed, AND
        // 2. oldest_active_snapshot > max(created_by.commit_time, superseded_by.commit_time)
        this.state = OBSOLETE
        // Version can now be garbage collected
    
    // Transition: OBSOLETE → RECLAIMED
    on_garbage_collection():
        // PostgreSQL: VACUUM removes tuple, updates visibility map, FSM
        // InnoDB: Purge thread frees undo log record
        this.state = RECLAIMED
        free_storage(this)
}
 
// Determining if version is obsolete
function is_version_obsolete(v: Version) -> Boolean:
    if v.state != COMMITTED:
        return false
    
    if v.superseded_by == null:
        return false  // Still the current version
    
    if not v.superseded_by.is_committed():
        return false  // Newer version not yet permanent
    
    oldest_snapshot = get_oldest_active_snapshot_timestamp()
    
    // If oldest active snapshot started after the superseder committed,
    // no one can ever need this version again
    return oldest_snapshot > v.superseded_by.commit_time

Long-Running Transactions Block GC

Garbage Collection Deep Dive

postgresql_vacuum_behavior.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
-- PostgreSQL VACUUM Operations
 
-- Standard VACUUM: reclaims dead tuple space within table
VACUUM my_table;
-- Table size may not shrink - space is marked reusable
-- Fast, doesn't lock the table for writes
 
-- VACUUM ANALYZE: reclaims space AND updates statistics
VACUUM ANALYZE my_table;
-- Critical for query planner to make good decisions
 
-- VACUUM FULL: rewrites entire table, returns space to OS
VACUUM FULL my_table;
-- Requires exclusive lock - blocks reads and writes!
-- Use only when significant space reclamation needed
 
-- Check vacuum stats for a table
SELECT 
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) as dead_ratio,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    vacuum_count,
    autovacuum_count
FROM pg_stat_user_tables 
WHERE relname = 'my_table';
 
-- Autovacuum configuration parameters
-- autovacuum_vacuum_threshold = 50 (minimum dead tuples before vacuum)
-- autovacuum_vacuum_scale_factor = 0.2 (fraction of table that must be dead)
-- autovacuum_naptime = 1min (sleep time between runs)
 
-- A table with 100,000 rows triggers autovacuum when:
-- dead_tuples > 50 + (0.2 * 100,000) = 20,050 dead tuples
 
-- Monitoring for bloat
SELECT 
    current_setting('autovacuum_vacuum_scale_factor') as scale_factor,
    current_setting('autovacuum_vacuum_threshold') as threshold;
 
-- Check for tables needing vacuum
SELECT 
    schemaname || '.' || relname as table_name,
    n_dead_tup as dead_tuples,
    n_live_tup as live_tuples,
    n_dead_tup::float / NULLIF(n_live_tup, 0) * 100 as dead_percentage,
    last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

PostgreSQL VACUUM vs InnoDB Purge
Aspect	PostgreSQL VACUUM	InnoDB Purge
What is cleaned	Dead heap tuples + index entries	Undo log records
Main table impact	Creates free space within table pages	No impact (main table always has current only)
Triggering	Manual, autovacuum thresholds	Continuous background
Blocking	Standard: none; FULL: exclusive lock	None (background)
Monitoring	pg_stat_user_tables, pg_stat_progress_vacuum	SHOW ENGINE INNODB STATUS (history length)
Failure mode	Table bloat, txid wraparound	Undo tablespace growth, long row lookups

Monitoring GC Health

Storage Overhead Analysis

Version Metadata Overhead
System	Metadata per Tuple	Overhead Estimate
PostgreSQL	xmin (4B) + xmax (4B) + cmin/cmax (4B) + ctid (6B) + infomask (4B) + natts (2B)	~23 bytes + alignment padding
InnoDB	DB_TRX_ID (6B) + DB_ROLL_PTR (7B) + implicit row ID if no PK (6B)	~13-19 bytes
Oracle	ITL (interested transaction list) + row header	Varies by block size

storage_overhead_example.sql
SQL Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- PostgreSQL: Analyzing version storage overhead
 
-- Calculate table bloat estimate
SELECT
    schemaname,
    relname,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as table_size,
    pg_size_pretty(pg_indexes_size(relid)) as indexes_size,
    n_live_tup,
    n_dead_tup,
    ROUND(
        n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2
    ) as dead_pct,
    -- Estimate actual data size vs allocated
    ROUND(
        (n_live_tup * 100)::numeric / 
        (pg_relation_size(relid) / 8192.0 * 200), 2  -- Assume 200 rows/page at fill factor
    ) as fill_efficiency_pct
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(relid) DESC;
 
-- More accurate bloat calculation using pgstattuple
CREATE EXTENSION IF NOT EXISTS pgstattuple;
 
SELECT 
    table_len,
    tuple_count,
    tuple_len,
    dead_tuple_count,
    dead_tuple_len,
    free_space,
    free_percent
FROM pgstattuple('my_table');
 
-- InnoDB: Check undo tablespace usage
-- (Run in MySQL/MariaDB)
-- SHOW ENGINE INNODB STATUS;  -- Look for "History list length"
-- 
-- SELECT
--     TOTAL_EXTENTS * 1048576 / 1024 / 1024 as undo_mb
-- FROM INFORMATION_SCHEMA.FILES
-- WHERE FILE_TYPE = 'UNDO LOG';

Capacity Planning Rule of Thumb

Summary: Mastering Multiple Versions

Key Takeaways

•Versions carry essential metadata — xmin/xmax, timestamps, and pointers enable visibility decisions
•Version chains link history — O2N (PostgreSQL) vs N2O (InnoDB) architectures have distinct tradeoffs
•Index interaction is complex — Different systems handle index-to-version mapping differently, affecting performance
•Version lifecycle has four phases — Creation, visibility, obsolescence, and reclamation
•Long transactions block GC — A single long-running query can cause significant version accumulation
•GC mechanisms differ by system — PostgreSQL VACUUM vs InnoDB purge have different characteristics
•Storage overhead is real — Both per-version metadata and version amplification consume space
•Monitoring is essential — Dead tuple counts, undo history length, and table bloat require ongoing attention

Page Complete

2 / 5