Page Layout - Learning Module

Loading content...

0/252

Page Management

Orchestrating the Page Lifecycle

A database may contain millions of pages across thousands of tables and indexes. How does the system know which pages have free space for new inserts? How are new pages allocated as data grows? What happens to pages when records are deleted—do they return to a free pool, or does the space linger as fragmentation?

Page management encompasses all the mechanisms that coordinate page allocation, track free space, and maintain efficient storage utilization. These mechanisms operate largely invisibly, but their effectiveness directly impacts database performance, storage efficiency, and operational complexity.

What You Will Learn

By the end of this page, you will understand page allocation strategies, free space tracking mechanisms (Free Space Maps), extent-based allocation, page splits and merges in indexes, vacuum and space reclamation, and practical monitoring and tuning approaches. This completes our exploration of page layout fundamentals.

Page Allocation: Where New Pages Come From

When a table or index needs a new page, the database must decide:

Where in the file should this page be physically located?
How is the page initialized?
How is it linked to other pages in the same object?

Allocation Approaches

Page Allocation Strategies
Strategy	Description	Pros	Cons
Single-page allocation	Allocate one page at a time from free list	No wasted space; simple	Fragmented layout; poor sequential performance
Extent allocation	Allocate contiguous groups (e.g., 64KB, 1MB)	Sequential access; less metadata	Potential wasted space in partial extents
Pre-allocation	Reserve space before data arrives	Reduces allocation latency	May over-allocate; wasted storage
Segment-based	Objects own segments containing extents	Logical grouping; controlled growth	More complex management

PostgreSQL Allocation Model

PostgreSQL uses a relatively simple model:

Table → Relation file(s) on disk
  main fork: user data (heap pages)
  fsm fork: Free Space Map
  vm fork: Visibility Map
  init fork: for unlogged tables

Page allocation:
  1. Extend file by one page (8KB)
  2. Initialize page header
  3. Zero remaining bytes
  4. Update FSM to show available space
  
No explicit extent grouping; pages added sequentially.
OS and filesystem handle physical contiguity.

InnoDB Extent Model

Tablespace → Multiple files
  Segment: collection of extents for one B-tree level
  Extent: 64 contiguous pages (1MB for 16KB pages)
  Page: 16KB

Allocation levels:
  - First 32 pages: single-page allocation
  - After: extent allocation for locality
  
Fragmented pages for small tables (< 32 pages)
Full extents for larger tables (better scan performance)

Preallocation Hint

For bulk loading, some databases support preallocation hints. PostgreSQL: minmax file extension operations. InnoDB: innodb_autoextend_increment controls growth. Oracle: ALLOCATE EXTENT clause on segments. Preallocating reduces extension latency during heavy inserts.

Free Space Maps: Finding Room for New Data

When inserting a row, the database must find a page with sufficient free space. Scanning every page would be O(n)—impractical for large tables. Free Space Maps (FSM) provide efficient space lookup.

Converting Mermaid diagram...

PostgreSQL FSM Details

FSM fork: tablename_fsm

Structure: Binary tree stored across FSM pages
  - Leaf nodes: 1 byte per heap page
  - Internal nodes: max of children
  - Root: max free space in entire table

Free space encoding (1 byte = 256 categories):
  Category × 32 = approximate free bytes
  Category 0: < 32 bytes free
  Category 255: > 8160 bytes free (nearly empty)

Search algorithm:
  1. Start at root
  2. Request: need 500 bytes → category 16 (500/32)
  3. Navigate tree to find leaf with category ≥ 16
  4. Return page number
  5. Insert may fail if estimate inaccurate → try next

Update policy:
  - VACUUM updates FSM from actual page contents
  - INSERT/UPDATE may update FSM opportunistically
  - Stale FSM can cause unnecessary page extensions

InnoDB Free Space Management

Page types for space management:
  - FSP_HDR page: tablespace header + extent descriptors
  - XDES pages: additional extent descriptors
  - INODE pages: segment descriptors
  
Extent states:
  - FREE: available for allocation
  - FREE_FRAG: partially used, pages can be allocated
  - FULL_FRAG: all pages in use
  - Segment-owned: belongs to specific segment

Per-page free space tracked in page headers.
Segment manages list of pages with free space.

FSM Staleness

The FSM is an approximation, not guaranteed accurate. After many updates/deletes without VACUUM, the FSM can become stale—showing pages full when they have space, or vice versa. This causes unnecessary file extensions (bloat) or failed insert attempts. Regular VACUUM keeps FSM accurate.

Visibility Map: Accelerating Scans and Vacuum

PostgreSQL maintains a Visibility Map (VM) alongside the FSM. While FSM tracks free space, the VM tracks which pages are 'all-visible'—containing only tuples visible to all transactions.

Visibility Map Structure

VM fork: tablename_vm

Format: 2 bits per heap page
  Bit 0: All-visible flag
    1 = All tuples on page visible to all transactions
    0 = Some tuples may be invisible to some transactions
  Bit 1: All-frozen flag (PostgreSQL 9.6+)
    1 = All tuples are frozen (ancient, never need visibility check)
    0 = Some tuples may need freezing

Compactness: 1 VM page tracks 8192 × 8 / 2 = 32,768 heap pages
  For 8KB pages: ~256MB of heap per VM page

Use Cases

Operation	How VM Helps	Benefit
Index-Only Scan	If VM says all-visible, skip heap fetch	Major I/O reduction
VACUUM	Skip all-visible pages (no dead tuples)	Faster vacuum
Anti-wraparound VACUUM	Skip all-frozen pages	Much faster freeze
Heap scans	No direct benefit	VM not consulted

Index-Only Scan Example

SELECT customer_id FROM orders WHERE status = 'shipped';

-- If there's an index on (status, customer_id):

-- Without VM / not all-visible:
  1. Search index → find matching entries with (status, customer_id, TID)
  2. For each TID, fetch heap page to check visibility (MVCC)
  3. Return customer_id if visible
  
-- With VM showing pages all-visible:
  1. Search index → find matching entries
  2. Check VM: is page at TID all-visible?
     If yes: trust index data, skip heap fetch
     If no: fetch heap page for visibility check
  3. Return customer_id

-- Result: Dramatically fewer heap I/Os for recent tables

VM Maintenance

Set all-visible: VACUUM, when page confirmed all-visible
Clear all-visible: Any INSERT/UPDATE/DELETE on page
Set all-frozen: VACUUM FREEZE (or regular vacuum for old tuples)
Clear all-frozen: Logically, never happens (frozen is permanent)

Monitoring VM Effectiveness

Query pg_class.relallvisible for count of all-visible pages. Compare relallvisible / relpages to see VM coverage. Low coverage indicates recent activity or vacuum backlog. High coverage enables efficient index-only scans. Example: SELECT relname, relallvisible, relpages, round(100.0 * relallvisible / nullif(relpages,0), 1) as pct_visible FROM pg_class WHERE relname = 'orders';

Page Splits and Merges in B-Trees

Index pages, unlike heap pages, maintain sorted order. When a page fills and a new entry must be inserted, the page must split. Conversely, when deletions leave pages too empty, they may merge with siblings.

B-Tree Page Split

Before split: Page P is full, need to insert key 'M'

  Page P: [A, B, C, D, E, F, G, H, I, J, K, L, N, O, P, Q, R, S, T]
                                           ↑ Insert 'M' here
                                           No room!

Split process:
  1. Allocate new page P'
  2. Choose split point (typically middle: 'J')
  3. Move keys > J to new page P'
  4. Insert 'M' into appropriate page (P' since M > J)
  5. Add separator key 'J' to parent (with pointer to P')
  6. Parent may cascade split if full

After split:
  Page P: [A, B, C, D, E, F, G, H, I, J]
  Page P': [K, L, M, N, O, P, Q, R, S, T]
  Parent: [..., (ptr to P), 'J', (ptr to P'), ...]

Cost of split:
  - Write 3 pages (P, P', parent) — potentially more if cascades
  - Brief lock escalation
  - WAL logging for each page

Converting Mermaid diagram...

Merge (Coalesce) Operations

Trigger: Page utilization drops below threshold (often 50%)

Merge process:
  1. Identify under-full sibling pages
  2. If combined content fits in one page:
     a. Copy all entries from one page to another
     b. Remove separator from parent
     c. Mark emptied page as free
  3. Parent may cascade merge if under-full

Alternatives to merge:
  - Redistribution: Move some entries from fuller sibling
  - Deferred merge: Mark for later batch processing
  - No merge: Accept space wastage (common choice!)

Why many databases don't merge aggressively:
  - Merge is expensive (page locks, WAL, parent update)
  - Workload may re-fill pages (wasted merge effort)
  - Dead pages are reclaimed during rebuild operations

PostgreSQL B-Tree Recycling

PostgreSQL B-tree indexes don't actively merge pages. Empty leaf pages are recycled into a free list for reuse by future splits. This avoids merge complexity but can leave indexes larger than optimal after heavy deletions. REINDEX rebuilds with optimal packing.

VACUUM: Reclaiming Dead Space

In MVCC databases, deleted rows aren't immediately removed—they may be visible to in-progress transactions. VACUUM is the garbage collection process that reclaims space from dead tuples.

VACUUM Variants in PostgreSQL
Variant	Description	Impact	When to Use
VACUUM	Standard: reclaim dead tuple space, update FSM/VM	Can run concurrently with queries	Regular maintenance
VACUUM FULL	Rewrites entire table, reclaims space to OS	Exclusive lock, blocks queries	Massive bloat recovery
VACUUM FREEZE	Freeze old tuple XIDs to prevent wraparound	Slightly more work than standard	Before XID exhaustion
Autovacuum	Background daemon, triggered by thresholds	Automatic, minimal impact	Default, always enabled

Standard VACUUM Process

Phase 1: Heap Scan
  For each page not all-visible in VM:
    1. Acquire page lock (share)
    2. Identify dead tuples (xmax < oldest active transaction)
    3. Mark dead tuples for removal
    4. Update page's free space
    5. If page now all-visible, update VM
    
Phase 2: Index Cleanup
  For each index on table:
    1. Scan index for TIDs of dead tuples
    2. Remove index entries pointing to dead tuples
    3. Recycle empty index pages to free list
    
Phase 3: Heap Cleanup
  For each page with dead tuples:
    1. Acquire page lock (exclusive, brief)
    2. Remove dead tuples, update slot directory
    3. Optionally compact page
    4. Update FSM
    
Phase 4: Truncation (optional)
  If trailing pages are all-empty:
    1. Acquire AccessExclusiveLock briefly
    2. Truncate file, returning space to OS
    3. Update relation size

VACUUM FULL Caution

VACUUM FULL rewrites the entire table to a new file, reclaiming all dead space and returning it to the OS. However, it requires an exclusive lock (blocks ALL access) and needs 2x the table space temporarily. For large tables, this can mean hours of downtime. Alternatives: pg_repack extension for online rebuild, or partitioning to drop old partitions.

Autovacuum Tuning

Key parameters:

  autovacuum_vacuum_threshold = 50
  autovacuum_vacuum_scale_factor = 0.2
  
  Trigger: vacuum when dead_tuples > threshold + scale_factor × n_live_tup
  
  For 1M row table: vacuum when dead > 50 + 0.2 × 1M = 200,050 dead tuples
  For 100 row table: vacuum when dead > 50 + 0.2 × 100 = 70 dead tuples

Aggressive tables (high churn):
  ALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.05);
  -- Vacuum at 5% instead of 20%

Large tables:
  ALTER TABLE huge_archive SET (autovacuum_vacuum_threshold = 10000);
  -- Don't vacuum for few scattered deletes

Monitoring:
  SELECT relname, n_dead_tup, last_vacuum, last_autovacuum
  FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;

Understanding and Preventing Table Bloat

Bloat occurs when dead space accumulates faster than it's reclaimed. A 10GB logical table might occupy 50GB on disk. This wastes storage, slows scans, and increases backup sizes.

Common Causes of Bloat

•Long-running transactions — Old transactions hold back vacuum (can't remove tuples they might see). A single idle-in-transaction session can bloat entire database.
•Vacuum not keeping pace — High update rate + slow vacuum = accumulating dead tuples. Often due to resource limits or autovacuum starvation.
•Bulk deletes without vacuum — Delete 80% of table, but vacuum doesn't run → 80% of space wasted.
•Updates to large variable columns — Old versions too big for HOT; new version + old version consume double space.
•Index bloat — B-tree pages don't merge; heavy delete patterns leave sparse indexes. Requires REINDEX to reclaim.

Detecting Bloat

-- Quick estimate using pg_stat_user_tables
SELECT 
    relname,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) as dead_pct,
    pg_size_pretty(pg_relation_size(relid)) as table_size
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

-- More accurate: pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('orders');
-- Returns: dead_tuple_percent, free_percent, etc.

-- Table size vs estimated live data size
SELECT
    relname,
    pg_size_pretty(pg_relation_size(oid)) as actual_size,
    pg_size_pretty((n_live_tup * avg_row_width)::bigint) as estimated_live_size
FROM pg_class
JOIN pg_stat_user_tables ON pg_class.oid = pg_stat_user_tables.relid
WHERE avg_row_width > 0;

Bloat Prevention Strategies

Strategy	Implementation	Effectiveness
Tune autovacuum aggressively	Lower scale factors for high-churn tables	High
Monitor long transactions	Set statement_timeout, idle_in_transaction_session_timeout	Critical
Hot Standby feedback	hot_standby_feedback = on for replicas	Prevents WIP
Partition by time	Drop old partitions instead of DELETE	Very high
HOT updates	Keep updates to non-indexed columns	Reduces version proliferation
Fill factor	Lower fill factor for update-heavy tables	Helps but trades space
Regular REINDEX	Rebuild indexes periodically	Required for index bloat

pg_repack for Online Rebuild

The pg_repack extension rebuilds tables without exclusive locks (unlike VACUUM FULL). It creates a new table, copies data with triggers capturing changes, then swaps atomically. Excellent for production bloat recovery. Install: CREATE EXTENSION pg_repack; Run: pg_repack --table orders dbname

Page Management Across Database Systems

Different databases approach page management with varying philosophies:

InnoDB Page Management

Purge Thread (instead of VACUUM):

Background thread removes old MVCC versions
Cleans undo logs after no transaction needs them
No user-visible command equivalent to VACUUM

Space reclamation:

UPDATE: Old version goes to undo log, page space reused immediately
DELETE: Row marked deleted, purge thread removes later
Space stays within tablespace (doesn't return to OS easily)

OPTIMIZE TABLE:

Rebuilds table + indexes
Reclaims fragmented space
Blocks writes (non-online in older versions)
ALTER TABLE ... ENGINE=InnoDB is equivalent

Online DDL (5.6+):

Many operations allow concurrent reads/writes
ALGORITHM=INPLACE, LOCK=NONE

Monitoring:

SELECT * FROM information_schema.INNODB_TABLESPACES;
SELECT * FROM sys.innodb_buffer_stats_by_table;
-- Check table fragmentation
SELECT table_name, data_length, data_free,
       round(100 * data_free / (data_length + data_free), 1) as frag_pct
FROM information_schema.tables WHERE table_schema = 'mydb';

Philosophy Differences

PostgreSQL: Explicit VACUUM needed due to append-only heap. InnoDB: In-place updates with undo log; purge thread automatic. Oracle: In-place updates with undo; explicit shrink for reclamation. SQL Server: Ghost cleanup automatic; index maintenance is explicit. Each reflects different MVCC implementation choices.

Summary: Page Management

Page management is the unsung orchestrator that keeps database storage efficient and accessible. Let's consolidate the key takeaways:

Key Takeaways

•Page allocation uses different strategies — Single-page for simplicity, extent-based for locality and performance.
•Free Space Maps enable efficient insert placement — Tree-structured metadata avoids scanning all pages.
•Visibility Maps accelerate index-only scans — Skip heap access when pages are known all-visible.
•B-tree page splits maintain sorted structure — Splits cascade upward; merges are often deferred or avoided.
•VACUUM is essential for MVCC systems — Reclaims dead tuple space, updates FSM/VM, prevents transaction ID wraparound.
•Bloat is preventable with proper maintenance — Monitor dead tuples, tune autovacuum, avoid long transactions.
•Each database has unique space management — PostgreSQL VACUUM, InnoDB purge, Oracle shrink, SQL Server ghost cleanup.
•Proactive monitoring prevents operational crises — Track bloat, vacuum lag, and maintenance frequency.

Module Complete:

Congratulations! You've completed the Page Layout module, covering:

Page structure and its role as the fundamental unit of database I/O
Slot directories and the indirection that enables flexible record management
Record layout including headers, NULL bitmaps, and field encoding
Variable-length records and TOAST/overflow mechanisms
Page management including allocation, FSM, VM, vacuum, and bloat prevention

These concepts form the foundation of database storage engineering, enabling you to understand performance characteristics, optimize schemas, and debug storage issues at the deepest level.

Module Complete

You now have a comprehensive understanding of database page layout and management—from the byte-level structure of individual records to the system-wide orchestration of millions of pages. This knowledge is essential for storage optimization, performance tuning, and understanding the true cost of database operations.

Page Management

Orchestrating the Page Lifecycle

What You Will Learn

Page Allocation: Where New Pages Come From

When a table or index needs a new page, the database must decide:

Where in the file should this page be physically located?
How is the page initialized?
How is it linked to other pages in the same object?

Allocation Approaches

Page Allocation Strategies
Strategy	Description	Pros	Cons
Single-page allocation	Allocate one page at a time from free list	No wasted space; simple	Fragmented layout; poor sequential performance
Extent allocation	Allocate contiguous groups (e.g., 64KB, 1MB)	Sequential access; less metadata	Potential wasted space in partial extents
Pre-allocation	Reserve space before data arrives	Reduces allocation latency	May over-allocate; wasted storage
Segment-based	Objects own segments containing extents	Logical grouping; controlled growth	More complex management

PostgreSQL Allocation Model

PostgreSQL uses a relatively simple model:

Table → Relation file(s) on disk
  main fork: user data (heap pages)
  fsm fork: Free Space Map
  vm fork: Visibility Map
  init fork: for unlogged tables

Page allocation:
  1. Extend file by one page (8KB)
  2. Initialize page header
  3. Zero remaining bytes
  4. Update FSM to show available space
  
No explicit extent grouping; pages added sequentially.
OS and filesystem handle physical contiguity.

InnoDB Extent Model

Tablespace → Multiple files
  Segment: collection of extents for one B-tree level
  Extent: 64 contiguous pages (1MB for 16KB pages)
  Page: 16KB

Allocation levels:
  - First 32 pages: single-page allocation
  - After: extent allocation for locality
  
Fragmented pages for small tables (< 32 pages)
Full extents for larger tables (better scan performance)

Preallocation Hint

Free Space Maps: Finding Room for New Data

Converting Mermaid diagram...

PostgreSQL FSM Details

FSM fork: tablename_fsm

Structure: Binary tree stored across FSM pages
  - Leaf nodes: 1 byte per heap page
  - Internal nodes: max of children
  - Root: max free space in entire table

Free space encoding (1 byte = 256 categories):
  Category × 32 = approximate free bytes
  Category 0: < 32 bytes free
  Category 255: > 8160 bytes free (nearly empty)

Search algorithm:
  1. Start at root
  2. Request: need 500 bytes → category 16 (500/32)
  3. Navigate tree to find leaf with category ≥ 16
  4. Return page number
  5. Insert may fail if estimate inaccurate → try next

Update policy:
  - VACUUM updates FSM from actual page contents
  - INSERT/UPDATE may update FSM opportunistically
  - Stale FSM can cause unnecessary page extensions

InnoDB Free Space Management

Page types for space management:
  - FSP_HDR page: tablespace header + extent descriptors
  - XDES pages: additional extent descriptors
  - INODE pages: segment descriptors
  
Extent states:
  - FREE: available for allocation
  - FREE_FRAG: partially used, pages can be allocated
  - FULL_FRAG: all pages in use
  - Segment-owned: belongs to specific segment

Per-page free space tracked in page headers.
Segment manages list of pages with free space.

FSM Staleness

Visibility Map: Accelerating Scans and Vacuum

PostgreSQL maintains a Visibility Map (VM) alongside the FSM. While FSM tracks free space, the VM tracks which pages are 'all-visible'—containing only tuples visible to all transactions.

Visibility Map Structure

VM fork: tablename_vm

Format: 2 bits per heap page
  Bit 0: All-visible flag
    1 = All tuples on page visible to all transactions
    0 = Some tuples may be invisible to some transactions
  Bit 1: All-frozen flag (PostgreSQL 9.6+)
    1 = All tuples are frozen (ancient, never need visibility check)
    0 = Some tuples may need freezing

Compactness: 1 VM page tracks 8192 × 8 / 2 = 32,768 heap pages
  For 8KB pages: ~256MB of heap per VM page

Use Cases

Operation	How VM Helps	Benefit
Index-Only Scan	If VM says all-visible, skip heap fetch	Major I/O reduction
VACUUM	Skip all-visible pages (no dead tuples)	Faster vacuum
Anti-wraparound VACUUM	Skip all-frozen pages	Much faster freeze
Heap scans	No direct benefit	VM not consulted

Index-Only Scan Example

SELECT customer_id FROM orders WHERE status = 'shipped';

-- If there's an index on (status, customer_id):

-- Without VM / not all-visible:
  1. Search index → find matching entries with (status, customer_id, TID)
  2. For each TID, fetch heap page to check visibility (MVCC)
  3. Return customer_id if visible
  
-- With VM showing pages all-visible:
  1. Search index → find matching entries
  2. Check VM: is page at TID all-visible?
     If yes: trust index data, skip heap fetch
     If no: fetch heap page for visibility check
  3. Return customer_id

-- Result: Dramatically fewer heap I/Os for recent tables

VM Maintenance

Set all-visible: VACUUM, when page confirmed all-visible
Clear all-visible: Any INSERT/UPDATE/DELETE on page
Set all-frozen: VACUUM FREEZE (or regular vacuum for old tuples)
Clear all-frozen: Logically, never happens (frozen is permanent)

Monitoring VM Effectiveness

Page Splits and Merges in B-Trees

B-Tree Page Split

Before split: Page P is full, need to insert key 'M'

  Page P: [A, B, C, D, E, F, G, H, I, J, K, L, N, O, P, Q, R, S, T]
                                           ↑ Insert 'M' here
                                           No room!

Split process:
  1. Allocate new page P'
  2. Choose split point (typically middle: 'J')
  3. Move keys > J to new page P'
  4. Insert 'M' into appropriate page (P' since M > J)
  5. Add separator key 'J' to parent (with pointer to P')
  6. Parent may cascade split if full

After split:
  Page P: [A, B, C, D, E, F, G, H, I, J]
  Page P': [K, L, M, N, O, P, Q, R, S, T]
  Parent: [..., (ptr to P), 'J', (ptr to P'), ...]

Cost of split:
  - Write 3 pages (P, P', parent) — potentially more if cascades
  - Brief lock escalation
  - WAL logging for each page

Converting Mermaid diagram...

Merge (Coalesce) Operations

Trigger: Page utilization drops below threshold (often 50%)

Merge process:
  1. Identify under-full sibling pages
  2. If combined content fits in one page:
     a. Copy all entries from one page to another
     b. Remove separator from parent
     c. Mark emptied page as free
  3. Parent may cascade merge if under-full

Alternatives to merge:
  - Redistribution: Move some entries from fuller sibling
  - Deferred merge: Mark for later batch processing
  - No merge: Accept space wastage (common choice!)

Why many databases don't merge aggressively:
  - Merge is expensive (page locks, WAL, parent update)
  - Workload may re-fill pages (wasted merge effort)
  - Dead pages are reclaimed during rebuild operations

PostgreSQL B-Tree Recycling

VACUUM: Reclaiming Dead Space

In MVCC databases, deleted rows aren't immediately removed—they may be visible to in-progress transactions. VACUUM is the garbage collection process that reclaims space from dead tuples.

VACUUM Variants in PostgreSQL
Variant	Description	Impact	When to Use
VACUUM	Standard: reclaim dead tuple space, update FSM/VM	Can run concurrently with queries	Regular maintenance
VACUUM FULL	Rewrites entire table, reclaims space to OS	Exclusive lock, blocks queries	Massive bloat recovery
VACUUM FREEZE	Freeze old tuple XIDs to prevent wraparound	Slightly more work than standard	Before XID exhaustion
Autovacuum	Background daemon, triggered by thresholds	Automatic, minimal impact	Default, always enabled

Standard VACUUM Process

Phase 1: Heap Scan
  For each page not all-visible in VM:
    1. Acquire page lock (share)
    2. Identify dead tuples (xmax < oldest active transaction)
    3. Mark dead tuples for removal
    4. Update page's free space
    5. If page now all-visible, update VM
    
Phase 2: Index Cleanup
  For each index on table:
    1. Scan index for TIDs of dead tuples
    2. Remove index entries pointing to dead tuples
    3. Recycle empty index pages to free list
    
Phase 3: Heap Cleanup
  For each page with dead tuples:
    1. Acquire page lock (exclusive, brief)
    2. Remove dead tuples, update slot directory
    3. Optionally compact page
    4. Update FSM
    
Phase 4: Truncation (optional)
  If trailing pages are all-empty:
    1. Acquire AccessExclusiveLock briefly
    2. Truncate file, returning space to OS
    3. Update relation size

VACUUM FULL Caution

Autovacuum Tuning

Key parameters:

  autovacuum_vacuum_threshold = 50
  autovacuum_vacuum_scale_factor = 0.2
  
  Trigger: vacuum when dead_tuples > threshold + scale_factor × n_live_tup
  
  For 1M row table: vacuum when dead > 50 + 0.2 × 1M = 200,050 dead tuples
  For 100 row table: vacuum when dead > 50 + 0.2 × 100 = 70 dead tuples

Aggressive tables (high churn):
  ALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.05);
  -- Vacuum at 5% instead of 20%

Large tables:
  ALTER TABLE huge_archive SET (autovacuum_vacuum_threshold = 10000);
  -- Don't vacuum for few scattered deletes

Monitoring:
  SELECT relname, n_dead_tup, last_vacuum, last_autovacuum
  FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;

Understanding and Preventing Table Bloat

Bloat occurs when dead space accumulates faster than it's reclaimed. A 10GB logical table might occupy 50GB on disk. This wastes storage, slows scans, and increases backup sizes.

Common Causes of Bloat

•Long-running transactions — Old transactions hold back vacuum (can't remove tuples they might see). A single idle-in-transaction session can bloat entire database.
•Vacuum not keeping pace — High update rate + slow vacuum = accumulating dead tuples. Often due to resource limits or autovacuum starvation.
•Bulk deletes without vacuum — Delete 80% of table, but vacuum doesn't run → 80% of space wasted.
•Updates to large variable columns — Old versions too big for HOT; new version + old version consume double space.
•Index bloat — B-tree pages don't merge; heavy delete patterns leave sparse indexes. Requires REINDEX to reclaim.

Detecting Bloat

-- Quick estimate using pg_stat_user_tables
SELECT 
    relname,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) as dead_pct,
    pg_size_pretty(pg_relation_size(relid)) as table_size
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

-- More accurate: pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('orders');
-- Returns: dead_tuple_percent, free_percent, etc.

-- Table size vs estimated live data size
SELECT
    relname,
    pg_size_pretty(pg_relation_size(oid)) as actual_size,
    pg_size_pretty((n_live_tup * avg_row_width)::bigint) as estimated_live_size
FROM pg_class
JOIN pg_stat_user_tables ON pg_class.oid = pg_stat_user_tables.relid
WHERE avg_row_width > 0;

Bloat Prevention Strategies

Strategy	Implementation	Effectiveness
Tune autovacuum aggressively	Lower scale factors for high-churn tables	High
Monitor long transactions	Set statement_timeout, idle_in_transaction_session_timeout	Critical
Hot Standby feedback	hot_standby_feedback = on for replicas	Prevents WIP
Partition by time	Drop old partitions instead of DELETE	Very high
HOT updates	Keep updates to non-indexed columns	Reduces version proliferation
Fill factor	Lower fill factor for update-heavy tables	Helps but trades space
Regular REINDEX	Rebuild indexes periodically	Required for index bloat

pg_repack for Online Rebuild

Page Management Across Database Systems

Different databases approach page management with varying philosophies:

InnoDB Page Management

Purge Thread (instead of VACUUM):

Background thread removes old MVCC versions
Cleans undo logs after no transaction needs them
No user-visible command equivalent to VACUUM

Space reclamation:

UPDATE: Old version goes to undo log, page space reused immediately
DELETE: Row marked deleted, purge thread removes later
Space stays within tablespace (doesn't return to OS easily)

OPTIMIZE TABLE:

Rebuilds table + indexes
Reclaims fragmented space
Blocks writes (non-online in older versions)
ALTER TABLE ... ENGINE=InnoDB is equivalent

Online DDL (5.6+):

Many operations allow concurrent reads/writes
ALGORITHM=INPLACE, LOCK=NONE

Monitoring:

SELECT * FROM information_schema.INNODB_TABLESPACES;
SELECT * FROM sys.innodb_buffer_stats_by_table;
-- Check table fragmentation
SELECT table_name, data_length, data_free,
       round(100 * data_free / (data_length + data_free), 1) as frag_pct
FROM information_schema.tables WHERE table_schema = 'mydb';

Philosophy Differences

Summary: Page Management

Page management is the unsung orchestrator that keeps database storage efficient and accessible. Let's consolidate the key takeaways:

Key Takeaways

•Page allocation uses different strategies — Single-page for simplicity, extent-based for locality and performance.
•Free Space Maps enable efficient insert placement — Tree-structured metadata avoids scanning all pages.
•Visibility Maps accelerate index-only scans — Skip heap access when pages are known all-visible.
•B-tree page splits maintain sorted structure — Splits cascade upward; merges are often deferred or avoided.
•VACUUM is essential for MVCC systems — Reclaims dead tuple space, updates FSM/VM, prevents transaction ID wraparound.
•Bloat is preventable with proper maintenance — Monitor dead tuples, tune autovacuum, avoid long transactions.
•Each database has unique space management — PostgreSQL VACUUM, InnoDB purge, Oracle shrink, SQL Server ghost cleanup.
•Proactive monitoring prevents operational crises — Track bloat, vacuum lag, and maintenance frequency.

Module Complete:

Congratulations! You've completed the Page Layout module, covering:

Page structure and its role as the fundamental unit of database I/O
Slot directories and the indirection that enables flexible record management
Record layout including headers, NULL bitmaps, and field encoding
Variable-length records and TOAST/overflow mechanisms
Page management including allocation, FSM, VM, vacuum, and bloat prevention

These concepts form the foundation of database storage engineering, enabling you to understand performance characteristics, optimize schemas, and debug storage issues at the deepest level.

Module Complete