Loading content...
A database may contain millions of pages across thousands of tables and indexes. How does the system know which pages have free space for new inserts? How are new pages allocated as data grows? What happens to pages when records are deleted—do they return to a free pool, or does the space linger as fragmentation?
Page management encompasses all the mechanisms that coordinate page allocation, track free space, and maintain efficient storage utilization. These mechanisms operate largely invisibly, but their effectiveness directly impacts database performance, storage efficiency, and operational complexity.
By the end of this page, you will understand page allocation strategies, free space tracking mechanisms (Free Space Maps), extent-based allocation, page splits and merges in indexes, vacuum and space reclamation, and practical monitoring and tuning approaches. This completes our exploration of page layout fundamentals.
When a table or index needs a new page, the database must decide:
Allocation Approaches
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Single-page allocation | Allocate one page at a time from free list | No wasted space; simple | Fragmented layout; poor sequential performance |
| Extent allocation | Allocate contiguous groups (e.g., 64KB, 1MB) | Sequential access; less metadata | Potential wasted space in partial extents |
| Pre-allocation | Reserve space before data arrives | Reduces allocation latency | May over-allocate; wasted storage |
| Segment-based | Objects own segments containing extents | Logical grouping; controlled growth | More complex management |
PostgreSQL Allocation Model
PostgreSQL uses a relatively simple model:
Table → Relation file(s) on disk
main fork: user data (heap pages)
fsm fork: Free Space Map
vm fork: Visibility Map
init fork: for unlogged tables
Page allocation:
1. Extend file by one page (8KB)
2. Initialize page header
3. Zero remaining bytes
4. Update FSM to show available space
No explicit extent grouping; pages added sequentially.
OS and filesystem handle physical contiguity.
InnoDB Extent Model
Tablespace → Multiple files
Segment: collection of extents for one B-tree level
Extent: 64 contiguous pages (1MB for 16KB pages)
Page: 16KB
Allocation levels:
- First 32 pages: single-page allocation
- After: extent allocation for locality
Fragmented pages for small tables (< 32 pages)
Full extents for larger tables (better scan performance)
For bulk loading, some databases support preallocation hints. PostgreSQL: minmax file extension operations. InnoDB: innodb_autoextend_increment controls growth. Oracle: ALLOCATE EXTENT clause on segments. Preallocating reduces extension latency during heavy inserts.
When inserting a row, the database must find a page with sufficient free space. Scanning every page would be O(n)—impractical for large tables. Free Space Maps (FSM) provide efficient space lookup.
PostgreSQL FSM Details
FSM fork: tablename_fsm
Structure: Binary tree stored across FSM pages
- Leaf nodes: 1 byte per heap page
- Internal nodes: max of children
- Root: max free space in entire table
Free space encoding (1 byte = 256 categories):
Category × 32 = approximate free bytes
Category 0: < 32 bytes free
Category 255: > 8160 bytes free (nearly empty)
Search algorithm:
1. Start at root
2. Request: need 500 bytes → category 16 (500/32)
3. Navigate tree to find leaf with category ≥ 16
4. Return page number
5. Insert may fail if estimate inaccurate → try next
Update policy:
- VACUUM updates FSM from actual page contents
- INSERT/UPDATE may update FSM opportunistically
- Stale FSM can cause unnecessary page extensions
InnoDB Free Space Management
Page types for space management:
- FSP_HDR page: tablespace header + extent descriptors
- XDES pages: additional extent descriptors
- INODE pages: segment descriptors
Extent states:
- FREE: available for allocation
- FREE_FRAG: partially used, pages can be allocated
- FULL_FRAG: all pages in use
- Segment-owned: belongs to specific segment
Per-page free space tracked in page headers.
Segment manages list of pages with free space.
The FSM is an approximation, not guaranteed accurate. After many updates/deletes without VACUUM, the FSM can become stale—showing pages full when they have space, or vice versa. This causes unnecessary file extensions (bloat) or failed insert attempts. Regular VACUUM keeps FSM accurate.
PostgreSQL maintains a Visibility Map (VM) alongside the FSM. While FSM tracks free space, the VM tracks which pages are 'all-visible'—containing only tuples visible to all transactions.
Visibility Map Structure
VM fork: tablename_vm
Format: 2 bits per heap page
Bit 0: All-visible flag
1 = All tuples on page visible to all transactions
0 = Some tuples may be invisible to some transactions
Bit 1: All-frozen flag (PostgreSQL 9.6+)
1 = All tuples are frozen (ancient, never need visibility check)
0 = Some tuples may need freezing
Compactness: 1 VM page tracks 8192 × 8 / 2 = 32,768 heap pages
For 8KB pages: ~256MB of heap per VM page
Use Cases
| Operation | How VM Helps | Benefit |
|---|---|---|
| Index-Only Scan | If VM says all-visible, skip heap fetch | Major I/O reduction |
| VACUUM | Skip all-visible pages (no dead tuples) | Faster vacuum |
| Anti-wraparound VACUUM | Skip all-frozen pages | Much faster freeze |
| Heap scans | No direct benefit | VM not consulted |
Index-Only Scan Example
SELECT customer_id FROM orders WHERE status = 'shipped';
-- If there's an index on (status, customer_id):
-- Without VM / not all-visible:
1. Search index → find matching entries with (status, customer_id, TID)
2. For each TID, fetch heap page to check visibility (MVCC)
3. Return customer_id if visible
-- With VM showing pages all-visible:
1. Search index → find matching entries
2. Check VM: is page at TID all-visible?
If yes: trust index data, skip heap fetch
If no: fetch heap page for visibility check
3. Return customer_id
-- Result: Dramatically fewer heap I/Os for recent tables
VM Maintenance
Query pg_class.relallvisible for count of all-visible pages. Compare relallvisible / relpages to see VM coverage. Low coverage indicates recent activity or vacuum backlog. High coverage enables efficient index-only scans. Example: SELECT relname, relallvisible, relpages, round(100.0 * relallvisible / nullif(relpages,0), 1) as pct_visible FROM pg_class WHERE relname = 'orders';
Index pages, unlike heap pages, maintain sorted order. When a page fills and a new entry must be inserted, the page must split. Conversely, when deletions leave pages too empty, they may merge with siblings.
B-Tree Page Split
Before split: Page P is full, need to insert key 'M'
Page P: [A, B, C, D, E, F, G, H, I, J, K, L, N, O, P, Q, R, S, T]
↑ Insert 'M' here
No room!
Split process:
1. Allocate new page P'
2. Choose split point (typically middle: 'J')
3. Move keys > J to new page P'
4. Insert 'M' into appropriate page (P' since M > J)
5. Add separator key 'J' to parent (with pointer to P')
6. Parent may cascade split if full
After split:
Page P: [A, B, C, D, E, F, G, H, I, J]
Page P': [K, L, M, N, O, P, Q, R, S, T]
Parent: [..., (ptr to P), 'J', (ptr to P'), ...]
Cost of split:
- Write 3 pages (P, P', parent) — potentially more if cascades
- Brief lock escalation
- WAL logging for each page
Merge (Coalesce) Operations
Trigger: Page utilization drops below threshold (often 50%)
Merge process:
1. Identify under-full sibling pages
2. If combined content fits in one page:
a. Copy all entries from one page to another
b. Remove separator from parent
c. Mark emptied page as free
3. Parent may cascade merge if under-full
Alternatives to merge:
- Redistribution: Move some entries from fuller sibling
- Deferred merge: Mark for later batch processing
- No merge: Accept space wastage (common choice!)
Why many databases don't merge aggressively:
- Merge is expensive (page locks, WAL, parent update)
- Workload may re-fill pages (wasted merge effort)
- Dead pages are reclaimed during rebuild operations
PostgreSQL B-tree indexes don't actively merge pages. Empty leaf pages are recycled into a free list for reuse by future splits. This avoids merge complexity but can leave indexes larger than optimal after heavy deletions. REINDEX rebuilds with optimal packing.
In MVCC databases, deleted rows aren't immediately removed—they may be visible to in-progress transactions. VACUUM is the garbage collection process that reclaims space from dead tuples.
| Variant | Description | Impact | When to Use |
|---|---|---|---|
| VACUUM | Standard: reclaim dead tuple space, update FSM/VM | Can run concurrently with queries | Regular maintenance |
| VACUUM FULL | Rewrites entire table, reclaims space to OS | Exclusive lock, blocks queries | Massive bloat recovery |
| VACUUM FREEZE | Freeze old tuple XIDs to prevent wraparound | Slightly more work than standard | Before XID exhaustion |
| Autovacuum | Background daemon, triggered by thresholds | Automatic, minimal impact | Default, always enabled |
Standard VACUUM Process
Phase 1: Heap Scan
For each page not all-visible in VM:
1. Acquire page lock (share)
2. Identify dead tuples (xmax < oldest active transaction)
3. Mark dead tuples for removal
4. Update page's free space
5. If page now all-visible, update VM
Phase 2: Index Cleanup
For each index on table:
1. Scan index for TIDs of dead tuples
2. Remove index entries pointing to dead tuples
3. Recycle empty index pages to free list
Phase 3: Heap Cleanup
For each page with dead tuples:
1. Acquire page lock (exclusive, brief)
2. Remove dead tuples, update slot directory
3. Optionally compact page
4. Update FSM
Phase 4: Truncation (optional)
If trailing pages are all-empty:
1. Acquire AccessExclusiveLock briefly
2. Truncate file, returning space to OS
3. Update relation size
VACUUM FULL rewrites the entire table to a new file, reclaiming all dead space and returning it to the OS. However, it requires an exclusive lock (blocks ALL access) and needs 2x the table space temporarily. For large tables, this can mean hours of downtime. Alternatives: pg_repack extension for online rebuild, or partitioning to drop old partitions.
Autovacuum Tuning
Key parameters:
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = 0.2
Trigger: vacuum when dead_tuples > threshold + scale_factor × n_live_tup
For 1M row table: vacuum when dead > 50 + 0.2 × 1M = 200,050 dead tuples
For 100 row table: vacuum when dead > 50 + 0.2 × 100 = 70 dead tuples
Aggressive tables (high churn):
ALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.05);
-- Vacuum at 5% instead of 20%
Large tables:
ALTER TABLE huge_archive SET (autovacuum_vacuum_threshold = 10000);
-- Don't vacuum for few scattered deletes
Monitoring:
SELECT relname, n_dead_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;
Bloat occurs when dead space accumulates faster than it's reclaimed. A 10GB logical table might occupy 50GB on disk. This wastes storage, slows scans, and increases backup sizes.
Detecting Bloat
-- Quick estimate using pg_stat_user_tables
SELECT
relname,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) as dead_pct,
pg_size_pretty(pg_relation_size(relid)) as table_size
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
-- More accurate: pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('orders');
-- Returns: dead_tuple_percent, free_percent, etc.
-- Table size vs estimated live data size
SELECT
relname,
pg_size_pretty(pg_relation_size(oid)) as actual_size,
pg_size_pretty((n_live_tup * avg_row_width)::bigint) as estimated_live_size
FROM pg_class
JOIN pg_stat_user_tables ON pg_class.oid = pg_stat_user_tables.relid
WHERE avg_row_width > 0;
Bloat Prevention Strategies
| Strategy | Implementation | Effectiveness |
|---|---|---|
| Tune autovacuum aggressively | Lower scale factors for high-churn tables | High |
| Monitor long transactions | Set statement_timeout, idle_in_transaction_session_timeout | Critical |
| Hot Standby feedback | hot_standby_feedback = on for replicas | Prevents WIP |
| Partition by time | Drop old partitions instead of DELETE | Very high |
| HOT updates | Keep updates to non-indexed columns | Reduces version proliferation |
| Fill factor | Lower fill factor for update-heavy tables | Helps but trades space |
| Regular REINDEX | Rebuild indexes periodically | Required for index bloat |
The pg_repack extension rebuilds tables without exclusive locks (unlike VACUUM FULL). It creates a new table, copies data with triggers capturing changes, then swaps atomically. Excellent for production bloat recovery. Install: CREATE EXTENSION pg_repack; Run: pg_repack --table orders dbname
Different databases approach page management with varying philosophies:
InnoDB Page Management
Purge Thread (instead of VACUUM):
Space reclamation:
OPTIMIZE TABLE:
ALTER TABLE ... ENGINE=InnoDB is equivalentOnline DDL (5.6+):
ALGORITHM=INPLACE, LOCK=NONEMonitoring:
SELECT * FROM information_schema.INNODB_TABLESPACES;
SELECT * FROM sys.innodb_buffer_stats_by_table;
-- Check table fragmentation
SELECT table_name, data_length, data_free,
round(100 * data_free / (data_length + data_free), 1) as frag_pct
FROM information_schema.tables WHERE table_schema = 'mydb';
PostgreSQL: Explicit VACUUM needed due to append-only heap. InnoDB: In-place updates with undo log; purge thread automatic. Oracle: In-place updates with undo; explicit shrink for reclamation. SQL Server: Ghost cleanup automatic; index maintenance is explicit. Each reflects different MVCC implementation choices.
Page management is the unsung orchestrator that keeps database storage efficient and accessible. Let's consolidate the key takeaways:
Module Complete:
Congratulations! You've completed the Page Layout module, covering:
These concepts form the foundation of database storage engineering, enabling you to understand performance characteristics, optimize schemas, and debug storage issues at the deepest level.
You now have a comprehensive understanding of database page layout and management—from the byte-level structure of individual records to the system-wide orchestration of millions of pages. This knowledge is essential for storage optimization, performance tuning, and understanding the true cost of database operations.