Loading content...
PostgreSQL's MVCC implementation is one of the most sophisticated and well-documented in the database world. Its design choices—particularly the decision to store all versions in the main heap rather than in separate undo logs—have far-reaching implications for performance, storage, and operational characteristics.\n\nPostgreSQL's MVCC architecture was established in the mid-1990s during the transition from Postgres95 to PostgreSQL. The design prioritizes simplicity and read performance at the cost of more complex vacuum requirements. Understanding this implementation deeply is essential for anyone operating PostgreSQL at scale.\n\nIn this page, we'll examine PostgreSQL's MVCC from the tuple header structure up through the system catalogs and background processes that make it all work.
By the end of this page, you will understand PostgreSQL's heap tuple structure, transaction ID management, hint bits optimization, the complete VACUUM process, transaction ID wraparound prevention, and performance tuning strategies. You'll be able to diagnose MVCC-related performance issues and operate PostgreSQL with confidence.
Every row in a PostgreSQL table is stored as a heap tuple (or simply 'tuple'). Each tuple contains not just the row data, but also the MVCC metadata that enables visibility determination.\n\nTuple Header Components:\n\nThe tuple header precedes the actual row data and contains:
| Field | Size | Purpose |
|---|---|---|
| t_xmin | 4 bytes (TransactionId) | Transaction ID that inserted this tuple |
| t_xmax | 4 bytes (TransactionId) | Transaction ID that deleted/updated this tuple (0 if still live) |
| t_cid | 4 bytes (CommandId) | Command ID within inserting transaction (for same-transaction visibility) |
| t_ctid | 6 bytes (ItemPointerData) | Current tuple ID: page number + item offset; points to self or newer version |
| t_infomask2 | 2 bytes | Number of attributes + various flags |
| t_infomask | 2 bytes | Hint bits, null bitmap presence, various flags |
| t_hoff | 1 byte | Offset to tuple data (header length including null bitmap) |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
/* PostgreSQL Tuple Header Structure (simplified from htup_details.h) */ typedef struct HeapTupleHeaderData { union { HeapTupleFields t_heap; DatumTupleFields t_datum; } t_choice; ItemPointerData t_ctid; /* Current tuple ID (6 bytes) */ /* Fields below here must match MinimalTupleData */ uint16 t_infomask2; /* Number of attributes + flags */ uint16 t_infomask; /* Various flags (see below) */ uint8 t_hoff; /* Sizeof header incl. bitmap + padding */ /* NULL bitmap and user data follow after header */ bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* Null bitmap (variable length) */ /* Actual row data follows the null bitmap */} HeapTupleHeaderData; typedef struct HeapTupleFields { TransactionId t_xmin; /* Inserting xact ID */ TransactionId t_xmax; /* Deleting or locking xact ID */ union { CommandId t_cid; /* Inserting or deleting command ID */ TransactionId t_xvac; /* VACUUM xact ID (for freezing) */ } t_field3;} HeapTupleFields; /* t_infomask flag bits */#define HEAP_HASNULL 0x0001 /* Has null attribute(s) */#define HEAP_HASVARWIDTH 0x0002 /* Has variable-width attr(s) */#define HEAP_HASEXTERNAL 0x0004 /* Has external stored attr(s) */#define HEAP_HASOID 0x0008 /* Has an OID (deprecated) */#define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-share locker */#define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */#define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */#define HEAP_XMAX_LOCK_ONLY 0x0080 /* xmax is a locker only */ /* Hint bits - critical for MVCC performance */#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */#define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */ /* Example tuple in memory: * +------------------+ * | t_xmin = 12345 | <- Inserted by txn 12345 * | t_xmax = 0 | <- Not deleted (still live) * | t_cid = 0 | <- First command in transaction * | t_ctid = (4,17) | <- Page 4, item 17 (points to self) * | t_infomask2 = 5 | <- 5 columns * | t_infomask = 0x102| <- Has nulls, xmin committed * | t_hoff = 24 | <- Header is 24 bytes * +------------------+ * | null bitmap | <- 1 byte: which columns are null * | padding | <- Alignment padding * +------------------+ * | column data... | <- Actual row data * +------------------+ */The tuple header is approximately 23 bytes, plus alignment padding (typically bringing it to 24+ bytes). For tables with many small rows, this overhead is significant. A 10-byte payload effectively becomes 34+ bytes on disk—a 3x+ amplification. This is one reason PostgreSQL benefits from row-oriented designs with reasonably-sized rows.
PostgreSQL uses 32-bit unsigned integers for transaction IDs (XIDs). This seemingly simple decision has profound implications for system operation.\n\nTransaction ID Lifecycle:\n\n1. Normal XIDs: Assigned from a counter that increments with each new transaction\n2. Special XIDs:\n - InvalidTransactionId (0): Represents 'no transaction'\n - BootstrapTransactionId (1): Used during database initialization\n - FrozenTransactionId (2): Represents 'very old, visible to everyone'\n - FirstNormalTransactionId (3): First usable transaction ID\n\nWith approximately 4 billion possible XIDs (2³²), and assuming thousands of transactions per second, the XID space can be exhausted in months to years of continuous operation.
12345678910111213141516171819202122232425262728293031323334353637383940414243
/* PostgreSQL Transaction ID Comparison * * XIDs are compared using 32-bit modular arithmetic. * XID space is conceptually circular, like a clock. * * For two XIDs A and B: * - A is "before" B if (int32)(A - B) < 0 * - A is "after" B if (int32)(A - B) > 0 * * This means: * - We can compare XIDs up to ~2 billion apart * - XIDs more than ~2B apart give incorrect comparisons * - This is why XID wraparound is dangerous! */ #define TransactionIdPrecedes(id1, id2) ((int32) ((id1) - (id2)) < 0) #define TransactionIdFollows(id1, id2) ((int32) ((id1) - (id2)) > 0) /* Example: * Current XID: 4,000,000,000 (near max) * * Comparing 4,000,000,000 with 100: * (int32)(4000000000 - 100) = 3999999900 * 3999999900 > 0, so 4B is "after" 100... * * BUT if they were 2B+ apart due to wraparound, * comparison becomes unreliable! * * This is why we must freeze old tuples before wraparound. */ /* XID "distance" check - used for freeze threshold */#define TransactionIdIsOlder(id, epoch, current, current_epoch) ( ((current_epoch) > (epoch)) || ((current_epoch) == (epoch) && TransactionIdPrecedes((id), (current))) ) /* The visibility logic fundamentally relies on correct XID ordering. * If XID comparison breaks, MVCC visibility breaks! * * PostgreSQL prevents this with: * 1. VACUUM freezing old XIDs to FrozenTransactionId * 2. Blocking new transactions if too old unfrozen XIDs exist * 3. Autovacuum to_prevent_wraparound monitoring */Transaction Status Tracking (pg_xact):\n\nPostgreSQL tracks transaction commit status in a system called pg_xact (formerly known as clog). This is a bitmap where each transaction gets 2 bits:\n\n- 00: Transaction in progress\n- 01: Transaction committed\n- 10: Transaction aborted\n- 11: Sub-transaction committed\n\nThe pg_xact files are stored in $PGDATA/pg_xact/ and are memory-mapped for efficient access.
| Bits | Status | Meaning |
|---|---|---|
| 00 | IN_PROGRESS | Transaction is still running |
| 01 | COMMITTED | Transaction completed successfully |
| 10 | ABORTED | Transaction was rolled back or failed |
| 11 | SUB_COMMITTED | Subtransaction committed (awaiting parent) |
If a database approaches XID wraparound without adequate vacuuming, PostgreSQL will refuse to start new write transactions to prevent data corruption. This manifests as errors like 'database is not accepting commands to avoid wraparound data loss.' Prevention is critical—monitor autovacuum_freeze_max_age and ensure VACUUM runs regularly.
Checking transaction status in pg_xact for every visibility check would be expensive. PostgreSQL optimizes this with hint bits—flags stored directly in the tuple header that cache the commit/abort status of xmin and xmax.\n\nHow Hint Bits Work:\n\nWhen a tuple is first accessed after the creating transaction completes:\n1. The visibility check consults pg_xact for xmin status\n2. If xmin's transaction committed, set HEAP_XMIN_COMMITTED\n3. If xmin's transaction aborted, set HEAP_XMIN_INVALID\n4. Similarly for xmax when inspected\n\nSubsequent accesses can skip the pg_xact lookup and use the hint bits directly.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
// Hint Bits Optimization Logic function check_xmin_committed(tuple: HeapTuple) -> Boolean: // Fast path: hint bit already set if tuple.t_infomask & HEAP_XMIN_COMMITTED: return true if tuple.t_infomask & HEAP_XMIN_INVALID: return false // Slow path: must check pg_xact xmin = tuple.t_xmin status = pg_xact_lookup(xmin) // Set hint bit to avoid future lookups if status == COMMITTED: tuple.t_infomask |= HEAP_XMIN_COMMITTED // Dirty the page! return true else if status == ABORTED: tuple.t_infomask |= HEAP_XMIN_INVALID // Dirty the page! return false else: // Still in progress - can't set hint bit yet return false /* * Performance Impact Example: * * Scenario: SELECT * FROM large_table; (1 million rows) * * WITHOUT hint bits: * - 1M pg_xact lookups per query execution * - Even with caching, significant overhead * * WITH hint bits (after first scan): * - First scan: 1M pg_xact lookups, sets 1M hint bits * - Subsequent scans: 0 pg_xact lookups! * - Just check the bits in the already-fetched tuple header * * Trade-off: * - Setting hint bits dirties pages (even for reads!) * - This can increase WAL traffic and checkpoint pressure * - But saves far more I/O on repeated accesses */Hint Bits and WAL:\n\nA subtle issue: setting hint bits modifies tuple data, which normally requires WAL logging for crash safety. However, hint bits are redundant information (derivable from pg_xact), so PostgreSQL uses a clever optimization:\n\n- Hint bit changes are typically NOT WAL-logged individually\n- If the page is dirtied only by hint bits and then evicted, the changes might be lost\n- On next access, hint bits are recalculated from pg_xact\n- Full-page writes after checkpoint capture any accumulated hint bits\n\nThe parameter wal_log_hints controls whether hint-only changes generate WAL in some scenarios (useful for replication tools like pg_rewind).
In PostgreSQL, even a pure SELECT can dirty pages by setting hint bits. This is unusual—most databases treat reads as truly read-only. It means read-heavy workloads on recently-committed data may generate unexpected I/O. The benefit (avoiding pg_xact lookups) almost always outweighs this cost.
VACUUM is PostgreSQL's garbage collection mechanism, and understanding it deeply is essential for maintaining healthy PostgreSQL systems. VACUUM performs multiple critical functions:\n\n1. Dead tuple removal: Reclaims space from deleted/updated rows\n2. XID freeze: Prevents transaction ID wraparound\n3. Visibility map updates: Tracks all-visible pages\n4. Free space map updates: Tracks available space for new inserts\n5. Statistics update: (with ANALYZE) Updates table statistics for query planner
VACUUM Process Flow (Simplified) Phase 1: Heap Scan─────────────────────────────────────────────────────────────For each page in the table: 1. Acquire cleanup lock on page (allows concurrent reads) 2. Scan all tuples on page 3. For each tuple: - Check visibility to OldestXmin (oldest active snapshot) - If dead (not visible to any snapshot): * Add to dead tuple list - If old (xmin < freeze threshold): * Consider freezing (replacing xmin with FrozenTransactionId) 4. If any dead tuples found on page: - Add page to list for Phase 2 Phase 2: Index Vacuuming (if indexes exist)─────────────────────────────────────────────────────────────For each index on the table: 1. Scan entire index 2. Find entries pointing to dead tuple locations 3. Remove those index entries Note: This is expensive! B-tree scan of entire index. Reason vacuum with many dead tuples is costly. Phase 3: Heap Vacuuming ─────────────────────────────────────────────────────────────For each page with dead tuples: 1. Acquire exclusive lock on page 2. Remove dead tuples from page 3. Defragment page (optional, for contiguous free space) 4. Update Free Space Map Phase 4: Truncation (if enabled)─────────────────────────────────────────────────────────────If last N pages are completely empty: 1. Acquire exclusive lock on those pages 2. Truncate file to remove empty trailing pages 3. Return space to filesystem Phase 5: Cleanup─────────────────────────────────────────────────────────────1. Update visibility map for all-visible pages2. Update table's relfrozenxid in pg_class3. Write statistics to pg_stat tablesVACUUM vs VACUUM FULL:\n\nCritical distinction:\n\nVACUUM (standard):\n- Runs concurrently with reads and writes\n- Marks dead tuple space as reusable within the table\n- Does NOT return space to operating system (except trailing pages)\n- Relatively fast, should run frequently\n\nVACUUM FULL:\n- Acquires exclusive lock (blocks ALL access)\n- Rewrites entire table to new file\n- Returns all dead space to operating system\n- Very slow, should be rare\n- Creates new OID, breaks replication indexes temporarily
| Aspect | VACUUM | VACUUM FULL |
|---|---|---|
| Locking | ShareUpdateExclusive (allows reads/writes) | AccessExclusive (blocks everything) |
| Speed | Fast (seconds to minutes) | Slow (minutes to hours for large tables) |
| Space Recovery | Within table only | To filesystem |
| File Rewrite | No | Yes (complete rewrite) |
| Concurrent Access | Yes | No |
| When to Use | Regular maintenance | After massive deletes, emergency bloat fix |
| CPU/IO Impact | Moderate | Very high |
In most cases, VACUUM FULL should be avoided in production. Instead: (1) Run regular VACUUM to prevent bloat accumulation, (2) If bloat occurs, use pg_repack for online table rewriting, (3) Consider partitioning for tables with high churn.
Autovacuum is the background daemon that automatically triggers VACUUM (and ANALYZE) operations. Properly configuring autovacuum is perhaps the most important PostgreSQL maintenance task.\n\nTrigger Thresholds:\n\nAutovacuum triggers when dead tuples exceed a threshold:
12345678910111213141516171819202122232425262728293031323334353637383940414243
-- Autovacuum Trigger Formula -- VACUUM triggered when:-- dead_tuples > autovacuum_vacuum_threshold + -- (autovacuum_vacuum_scale_factor * n_live_tup) -- Default values:-- autovacuum_vacuum_threshold = 50-- autovacuum_vacuum_scale_factor = 0.2 (20%) -- Example: Table with 100,000 live tuples-- Threshold = 50 + (0.2 * 100,000) = 20,050 dead tuples -- ANALYZE triggered when:-- changes > autovacuum_analyze_threshold + -- (autovacuum_analyze_scale_factor * n_live_tup) -- For large tables, 20% is too high!-- Table with 100 million rows: needs 20 million dead tuples-- This could mean GBs of bloat before vacuum triggers -- Recommended: Lower scale_factor for large tablesALTER TABLE huge_table SET ( autovacuum_vacuum_scale_factor = 0.01, -- 1% instead of 20% autovacuum_analyze_scale_factor = 0.005 -- 0.5%); -- Or use flat thresholds for very large tablesALTER TABLE huge_table SET ( autovacuum_vacuum_threshold = 10000, autovacuum_vacuum_scale_factor = 0.0);-- Triggers after 10,000 dead tuples regardless of table size -- Check current settings for a tableSELECT n.nspname, c.relname, pg_table_size(c.oid) as table_size, c.reloptions -- Shows table-specific autovacuum settingsFROM pg_class cJOIN pg_namespace n ON n.oid = c.relnamespaceWHERE c.relname = 'my_table';Autovacuum Worker Configuration:\n\nBeyond triggers, controlling autovacuum resource usage is critical:
| Parameter | Default | Purpose | Tuning Advice |
|---|---|---|---|
| autovacuum_max_workers | 3 | Max concurrent autovacuum workers | Increase for large databases (5-8) |
| autovacuum_naptime | 1min | Time between autovacuum checks | Lower for busy systems (10-15s) |
| autovacuum_vacuum_cost_limit | 200 | Vacuum I/O throttling limit | Increase for faster vacuums (500-1000) |
| autovacuum_vacuum_cost_delay | 2ms | Sleep time when cost limit reached | Decrease for faster vacuums (0-1ms) |
| vacuum_cost_page_hit | 1 | Cost of vacuuming cached page | Usually leave default |
| vacuum_cost_page_miss | 10 | Cost of vacuuming non-cached page | Usually leave default |
| maintenance_work_mem | 64MB | Memory for vacuum operations | Increase significantly (256MB-1GB) |
When tables approach transaction ID wraparound (age approaching autovacuum_freeze_max_age, default 200M transactions), autovacuum triggers an aggressive 'anti-wraparound' vacuum that cannot be cancelled. This can cause performance issues if it surprises you. Monitor transaction age: SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind = 'r';
The Visibility Map (VM) is a crucial optimization structure that tracks the visibility status of heap pages. It enables significant performance improvements.\n\nWhat the Visibility Map Tracks:\n\nFor each heap page, the VM stores 2 bits:\n- Bit 1 (all-visible): All tuples on this page are visible to all transactions\n- Bit 2 (all-frozen): All tuples on this page are frozen (xmin = FrozenTransactionId)
Visibility Map Structure Heap File: my_table (8KB pages)┌────────────────────────────────────────────────────────────┐│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │ ... ││ Mixed │ All-Vis │ All-Frz │ Mixed │ All-Vis │ │└────────────────────────────────────────────────────────────┘ ↑ ↑ ↑ ↑ ↑ │ │ │ │ │Visibility Map File: my_table_vm┌─────────────────────────────────────────┐│ Bits: 00 │ 01 │ 11 │ 00 │ 01 │ ... ││ page0 page1 page2 page3 page4 │└─────────────────────────────────────────┘ Bit meanings:00 = Mixed: Page has dead tuples or recent changes, needs full check01 = All-Visible: All tuples visible, but some not frozen11 = All-Frozen: All tuples visible AND frozen Benefits: 1. Index-Only Scans: - If VM bit is set, no need to visit heap to check visibility - Just return data from index directly - Massive speedup for covered indexes! 2. VACUUM Optimization: - Can skip all-frozen pages entirely - Can skip all-visible pages (unless freezing needed) - Dramatically faster VACUUM for stable tables 3. VACUUM FREEZE Optimization: - Only processes pages without all-frozen bit - Stable data only frozen once, then skipped forever Example Query Optimization:─────────────────────────────────────────Query: SELECT indexed_col FROM table WHERE indexed_col = 'foo' Without VM (or VM bit not set):1. Search index → find heap page 5, item 32. Fetch heap page 53. Check visibility of item 34. Return if visible With VM (bit set for page 5):1. Search index → find heap page 5, item 32. Check VM for page 5 → all-visible!3. Skip heap fetch, return directly from index ← Heap I/O eliminated!Index-Only Scans and Covering Indexes:\n\nThe visibility map enables index-only scans, where PostgreSQL can satisfy a query entirely from the index without touching the heap—if the VM confirms all tuples on the referenced pages are visible.
12345678910111213141516171819202122232425262728293031323334353637383940
-- Index-Only Scan Example -- Create a covering indexCREATE INDEX idx_orders_covering ON orders (customer_id) INCLUDE (order_date, total_amount); -- This query can be satisfied entirely from the index-- if visibility map bits are setEXPLAIN ANALYZESELECT customer_id, order_date, total_amount FROM orders WHERE customer_id = 12345; /*Plan with good VM coverage: Index Only Scan using idx_orders_covering on orders Index Cond: (customer_id = 12345) Heap Fetches: 0 ← All pages were all-visible! Plan with poor VM coverage (after many updates): Index Only Scan using idx_orders_covering on orders Index Cond: (customer_id = 12345) Heap Fetches: 847 ← Had to check heap visibility*/ -- Check VM coverage for a tableSELECT relname, pg_size_pretty(pg_relation_size(oid)) as table_size, pg_size_pretty(pg_relation_size(reltoastrelid)) as toast_size, pg_stat_get_live_tuples(oid) as live_tuples, pg_stat_get_dead_tuples(oid) as dead_tuples, pg_stat_get_mod_since_analyze(oid) as mods_since_analyzeFROM pg_class WHERE relname = 'orders'; -- After VACUUM, VM bits will be set for all-visible pagesVACUUM orders; -- Re-run query, observe "Heap Fetches: 0"High 'Heap Fetches' in Index Only Scans indicates poor visibility map coverage. Run VACUUM more frequently on tables with many updates to maintain VM bits and enable efficient index-only scans. Monitor pg_stat_user_tables.idx_scan vs pg_statio_user_tables.idx_blks_hit for index efficiency.
PostgreSQL's MVCC implementation is elegant in concept but rich in operational detail. Mastering these internals is essential for running PostgreSQL effectively at scale.
Looking Ahead:\n\nWith PostgreSQL's implementation understood, we'll conclude the MVCC module by examining the benefits and advantages of MVCC in its entirety—from performance characteristics to modern extensions like Serializable Snapshot Isolation.
You now understand PostgreSQL's MVCC implementation at the system internals level. You can reason about tuple structure, XID management, hint bits, VACUUM operations, and visibility map optimizations. This knowledge is essential for operating PostgreSQL at production scale.