Loading learning content...
Every piece of data you've ever stored in a database—every row, every column, every index entry—ultimately lives on a page. The page is the fundamental unit of I/O in virtually all database management systems, serving as the atomic container that bridges the gap between persistent storage and in-memory data manipulation.\n\nUnderstanding page structure isn't merely academic detail; it's the foundation upon which all higher-level database operations are built. Query performance, storage efficiency, concurrency control, and recovery mechanisms all depend intimately on how pages are organized. A Principal Engineer working on database internals or optimizing storage-intensive applications must understand these concepts at a deep, intuitive level.
By the end of this page, you will understand the complete anatomy of a database page, including its header, free space management, and how different components interact. You'll grasp why page design decisions have cascading effects throughout the entire database system and how to reason about page-level tradeoffs in real-world scenarios.
Before diving into page structure, we must understand why databases use pages in the first place. This isn't an arbitrary design choice—it's a carefully considered solution to fundamental constraints imposed by hardware and operating systems.\n\nThe Disk I/O Reality\n\nMagnetic disks and even modern SSDs don't read or write individual bytes. They operate in fixed-size blocks, typically 512 bytes or 4KB for disk sectors and larger blocks for SSD pages. Any read or write operation, no matter how small the actual data, transfers an entire block.\n\nConsider what happens without page abstraction:\n- Read 1 byte? Transfer 4KB from disk\n- Write 1 byte? Read 4KB, modify 1 byte, write 4KB back\n- N random single-byte accesses? N × 4KB transfers\n\nThis mismatch between byte-addressable needs and block-addressable reality drove the invention of the page as a management abstraction.
| Characteristic | Magnetic Disk (HDD) | Solid State Drive (SSD) | Implication for Pages |
|---|---|---|---|
| Minimum Transfer Unit | 512B - 4KB sector | 4KB - 16KB page | Pages must align with device block boundaries |
| Random Read Latency | 5-10ms (seek + rotation) | 50-200μs | Larger pages amortize fixed access overhead |
| Sequential Bandwidth | 100-200 MB/s | 500-7000 MB/s | Large sequential page reads are efficient |
| Write Amplification | N/A (rewrite in place) | Significant (erase blocks) | Page size affects SSD wear and performance |
| Parallelism | Single head per platter | Multiple channels | Page I/O can be parallelized on SSDs |
The Buffer Pool Connection\n\nPages serve as the unit of transfer between disk and the buffer pool (main memory cache). When the database needs to access a row, it:\n\n1. Determines which page contains the row (via page ID)\n2. Checks if that page is in the buffer pool\n3. If not, reads the entire page from disk into a buffer frame\n4. Accesses the row within the in-memory page\n5. Eventually, if the page was modified, writes the entire page back to disk\n\nThis page-at-a-time model has profound implications:\n- Multiple rows on the same page share the I/O cost\n- Rows physically near each other (on the same page) have temporal locality benefits\n- Page size directly affects buffer pool hit rates and memory utilization
Larger pages (e.g., 32KB) transfer more data per I/O but may waste bandwidth if only a small portion is needed. Smaller pages (e.g., 4KB) are more granular but incur more I/O operations for large scans. Most databases default to 4KB-16KB pages, matching OS page size and SSD characteristics. PostgreSQL uses 8KB; Oracle defaults to 8KB but allows 2KB-32KB; SQL Server uses 8KB.
A database page is a carefully organized data structure, typically fixed at 4KB, 8KB, or 16KB. While implementations vary across database systems, the fundamental components remain consistent:\n\nThe Three-Region Model\n\nMost database pages follow a conceptual model with three primary regions:
Why This Layout?\n\nThe bidirectional growth pattern (slots growing down, records growing up) is intentional:\n\n1. Maximizes contiguous free space — All unused bytes form a single block, simplifying allocation\n2. Eliminates internal fragmentation during initial insertions — New records and their slot entries simply claim space from opposite ends\n3. Enables efficient compaction — When free space becomes fragmented, records can be compacted toward the bottom without invalidating slot numbers\n4. Supports variable-length records — Unlike fixed-offset schemes, records of any size can coexist on the same page
The page header is the control structure that makes everything else on the page interpretable. Without the header, the remaining bytes would be meaningless. Let's examine each component in detail:
| Field | Size | Purpose | Critical For |
|---|---|---|---|
| Page LSN | 8 bytes | Log Sequence Number of last modification | Recovery (WAL), Checkpointing |
| Checksum | 2-4 bytes | CRC or hash for data integrity verification | Detecting corruption, Silent data loss prevention |
| Flags | 2 bytes | Page state bits (leaf/internal, full, torn) | Page type identification, Space management |
| Lower Pointer | 2 bytes | Offset to start of free space (end of slot dir) | Slot directory boundary |
| Upper Pointer | 2 bytes | Offset to end of free space (start of records) | Record region boundary |
| Special Pointer | 2 bytes | Offset to special/index-specific data | Index pages (B-tree, etc.) |
| Page Version | 2 bytes | Schema/format version for upgrade compatibility | Online schema evolution |
| Prune XID | 4-8 bytes | Oldest transaction that may need to see dead tuples | MVCC, Vacuum operations |
The LSN: Cornerstone of Recovery\n\nThe Page LSN (Log Sequence Number) deserves special attention. This 8-byte field records the LSN of the most recent WAL (Write-Ahead Log) record that modified this page. During crash recovery:\n\n1. Recovery scans the WAL from the checkpoint\n2. For each WAL record, it reads the target page\n3. It compares the WAL record's LSN to the page's LSN\n4. If WAL LSN > Page LSN, the modification is replayed (redo)\n5. If WAL LSN ≤ Page LSN, the page already has this change (skip)\n\nThis mechanism ensures idempotent redo—replaying the same WAL record multiple times yields the same result. The Page LSN makes this possible by indicating the page's current state in the modification timeline.
Checksums detect silent data corruption from failing disks, cosmic rays (bit flips), or firmware bugs. PostgreSQL enables checksums via initdb --data-checksums. MySQL/InnoDB has innodb_checksum_algorithm. Disabling checksums saves ~1-2% CPU but risks undetected corruption. Production systems should ALWAYS enable page checksums.
Lower and Upper Pointers: The Free Space Boundaries\n\nThese two offsets define the free space region:\n\n\n┌─────────────────────────────────────────────────────────┐\n│ Header │ Slot Dir... │ FREE SPACE │ Records... │\n│ (24B) │ │ │ │\n└────────┴──────────────────────┴────────────┴────────────┘\n ↑ ↑ ↑\n 0 LOWER UPPER 8191\n (100) (7800)\n │────── allocated ─────│── free ───│── allocated ──│\n\n\n- Lower points to the first byte after the slot directory\n- Upper points to the first byte of the record region\n- Free Space = Upper - Lower bytes\n\nWhen inserting a record:\n1. Check if (record size + slot entry size) ≤ (Upper - Lower)\n2. If yes: copy record starting at Upper, decrement Upper by record size\n3. Add slot entry at Lower, increment Lower by slot entry size
Not all pages are created equal. Databases use different page types for different purposes, each with specialized structures optimized for their role:
| Page Type | Purpose | Special Characteristics | Examples |
|---|---|---|---|
| Heap Page | Store table row data | Slot directory, variable-length records, MVCC info | PostgreSQL heap, MySQL InnoDB data pages |
| Index Page (Leaf) | Store index entries pointing to data | Sorted entries, sibling pointers, high fanout | B-tree leaf nodes |
| Index Page (Internal) | Navigate index structure | Keys + child page pointers, no data payload | B-tree internal nodes |
| Overflow Page | Store large values that don't fit on main page | Continuation pointers, blob chunks | TOAST pages (PostgreSQL) |
| Free Space Map Page | Track free space across all pages | Bitmap or byte-per-page representation | FSM in PostgreSQL |
| Visibility Map Page | Track all-visible pages for index-only scans | 2 bits per heap page | VM in PostgreSQL |
| Undo Page | Store before-images for transaction rollback | Chain of undo records, transaction linkage | Oracle undo, InnoDB undo logs |
| System Page | Database metadata (schema, statistics) | Catalog entries, system table rows | pg_class entries, InnoDB data dictionary |
Heap Pages vs. Index Pages\n\nThe distinction between heap (data) pages and index pages is fundamental:\n\nHeap Pages:\n- Store actual row data with all column values\n- Records are not necessarily sorted\n- Support in-place updates (with MVCC caveats)\n- Primary concern: space utilization, hot updates\n\nIndex Pages:\n- Store keys and pointers (either to heap or to other index pages)\n- Entries are sorted by key value\n- Leaf pages linked for range scans\n- Primary concern: fanout, balance, split/merge overhead\n\nBoth page types share the basic header + slot directory + record structure, but their internal record formats and management algorithms differ significantly.
The page header's flags field typically encodes the page type. Recovery operations, consistency checks, and space management routines read this flag to apply type-appropriate logic. A corrupted type flag can cause catastrophic misinterpretation of page contents.
Choosing the right page size is one of the most consequential decisions in database design. While most systems provide sensible defaults, understanding the tradeoffs enables informed tuning for specific workloads.
Workload-Specific Recommendations\n\n| Workload Type | Recommended Page Size | Rationale |\n|---------------|----------------------|-----------|\n| OLTP (many point queries) | 4KB - 8KB | Minimize read amplification, reduce lock contention |\n| OLAP (large scans) | 16KB - 32KB | Maximize sequential throughput, better compression |\n| Mixed workload | 8KB - 16KB | Balance between competing concerns |\n| Wide rows (many columns) | 16KB+ | Avoid excessive overflow pages |\n| High concurrency | Smaller pages | Reduce per-page lock conflicts |\n\nModern SSD Considerations\n\nSSDs have changed some traditional assumptions:\n- SSD internal page size is typically 4KB-16KB\n- Write amplification occurs when database page size doesn't align with SSD page size\n- NVMe SSDs can handle smaller random I/Os efficiently, reducing the penalty for smaller pages\n- Some databases (e.g., RocksDB, TiKV) use variable-size blocks rather than fixed pages
Most databases require reinitialization or dump/restore to change page size. PostgreSQL's page size (BLCKSZ) is set at compile time—changing it requires rebuilding PostgreSQL and reinitializing the cluster. Plan page size carefully during initial deployment; it's difficult to change later.
Managing free space within a page is a microcosm of the broader memory management problem. The database must track available space, allocate from it efficiently, and reclaim space from deleted or updated records.
States of Space on a Page\n\nSpace on a database page exists in several states:\n\n1. Allocated (Live) — Currently used by live records\n2. Allocated (Dead) — Used by deleted/obsolete records awaiting cleanup\n3. Free (Contiguous) — Between slot directory and record region\n4. Free (Fragmented) — Holes within the record region from deletions\n\nThe Fragmentation Problem\n\nConsider a page with three 1000-byte records, then the middle one is deleted:\n\n\n Before deletion: After deletion:\n ┌───────────────────┐ ┌───────────────────┐\n │ Header + Slots │ │ Header + Slots │\n ├───────────────────┤ ├───────────────────┤\n │ Free: 5000 bytes │ │ Free: 5000 bytes │\n ├───────────────────┤ ├───────────────────┤\n │ Record C (1000B) │ │ Record C (1000B) │\n │ Record B (1000B) │ │ ████ HOLE 1000B ██│\n │ Record A (1000B) │ │ Record A (1000B) │\n └───────────────────┘ └───────────────────┘\n\n\nAfter deletion:\n- Contiguous free space: 5000 bytes\n- Total free space: 6000 bytes (including the 1000-byte hole)\n- Usable for a 5500-byte record? No! The hole fragments the space.
In MVCC systems (PostgreSQL, MySQL InnoDB), dead tuples accumulate as transactions update or delete rows. Without regular vacuum/purge, pages become bloated with dead data, causing table bloat, slower scans, and eventual transaction ID wraparound issues. Monitor bloat and ensure vacuum keeps pace with modification rates.
Every database operation ultimately translates to a sequence of page-level operations. Understanding these primitives reveals why certain operations are expensive and how the database maintains consistency.
| Operation | Steps | Complexity | Logging Required |
|---|---|---|---|
| Insert Record |
| O(1) for the page operation itself | Full-page image or redo log |
| Delete Record (Mark) |
| O(1) lookup + constant work | Slot modification log |
| Delete Record (Physical) |
| O(n) where n = records after deleted one | Typically deferred to vacuum |
| Update Record (In-Place) |
| O(1) if sizes match, else higher | Before/after images or redo |
| Page Compaction |
| O(n) where n = live records | Full-page image typically |
The Insert Path in Detail\n\n\nINSERT INTO orders (id, customer, amount) VALUES (42, 'Alice', 100.00);\n\n┌──────────────────────────────────────────────────────┐\n│ 1. Executor requests: insert this tuple on table heap │\n└──────────────────────────────────────────────────────┘\n │\n ▼\n┌──────────────────────────────────────────────────────┐\n│ 2. Buffer manager: find page with sufficient space │\n│ - Consult Free Space Map (FSM) │\n│ - FSM says page 47 has ~2KB free │\n│ - Request page 47 into buffer pool │\n└──────────────────────────────────────────────────────┘\n │\n ▼\n┌──────────────────────────────────────────────────────┐\n│ 3. Lock page 47 (exclusive or content lock) │\n│ 4. Verify free space: Upper - Lower ≥ tuple + slot │\n│ 5. Generate WAL record (before modifying page) │\n│ 6. Write tuple at offset = Upper - tupleSize │\n│ 7. Add slot entry: slots[nslots++] = new offset │\n│ 8. Update header: Upper -= tupleSize, Lower += 4 │\n│ 9. Mark page dirty in buffer pool │\n│ 10. Release page lock │\n└──────────────────────────────────────────────────────┘\n\n\nNote how the page insert involves multiple subsystems: executor, buffer manager, lock manager, WAL writer, and the page manipulation itself. Each step is carefully ordered to ensure recoverability and concurrency safety.
We've explored the fundamental building block of database storage—the page. Let's consolidate the key takeaways:
What's Next:\n\nNow that you understand the overall page structure, we'll dive deeper into the slot directory—the indirection layer that enables stable record addressing, efficient updates, and garbage collection. The slot directory is where logical record IDs meet physical byte offsets.
You now understand the fundamental structure of database pages—the atomic containers that make all database operations possible. This knowledge is essential for understanding storage engineering, query optimization, and recovery mechanisms. Next, we'll explore the slot directory in detail.