Loading learning content...
Every piece of data you've ever stored in a database—every row, every column, every index entry—ultimately lives on a page. The page is the fundamental unit of I/O in virtually all database management systems, serving as the atomic container that bridges the gap between persistent storage and in-memory data manipulation.
Understanding page structure isn't merely academic detail; it's the foundation upon which all higher-level database operations are built. Query performance, storage efficiency, concurrency control, and recovery mechanisms all depend intimately on how pages are organized. A Principal Engineer working on database internals or optimizing storage-intensive applications must understand these concepts at a deep, intuitive level.
By the end of this page, you will understand the complete anatomy of a database page, including its header, free space management, and how different components interact. You'll grasp why page design decisions have cascading effects throughout the entire database system and how to reason about page-level tradeoffs in real-world scenarios.
Before diving into page structure, we must understand why databases use pages in the first place. This isn't an arbitrary design choice—it's a carefully considered solution to fundamental constraints imposed by hardware and operating systems.
The Disk I/O Reality
Magnetic disks and even modern SSDs don't read or write individual bytes. They operate in fixed-size blocks, typically 512 bytes or 4KB for disk sectors and larger blocks for SSD pages. Any read or write operation, no matter how small the actual data, transfers an entire block.
Consider what happens without page abstraction:
This mismatch between byte-addressable needs and block-addressable reality drove the invention of the page as a management abstraction.
| Characteristic | Magnetic Disk (HDD) | Solid State Drive (SSD) | Implication for Pages |
|---|---|---|---|
| Minimum Transfer Unit | 512B - 4KB sector | 4KB - 16KB page | Pages must align with device block boundaries |
| Random Read Latency | 5-10ms (seek + rotation) | 50-200μs | Larger pages amortize fixed access overhead |
| Sequential Bandwidth | 100-200 MB/s | 500-7000 MB/s | Large sequential page reads are efficient |
| Write Amplification | N/A (rewrite in place) | Significant (erase blocks) | Page size affects SSD wear and performance |
| Parallelism | Single head per platter | Multiple channels | Page I/O can be parallelized on SSDs |
The Buffer Pool Connection
Pages serve as the unit of transfer between disk and the buffer pool (main memory cache). When the database needs to access a row, it:
This page-at-a-time model has profound implications:
Larger pages (e.g., 32KB) transfer more data per I/O but may waste bandwidth if only a small portion is needed. Smaller pages (e.g., 4KB) are more granular but incur more I/O operations for large scans. Most databases default to 4KB-16KB pages, matching OS page size and SSD characteristics. PostgreSQL uses 8KB; Oracle defaults to 8KB but allows 2KB-32KB; SQL Server uses 8KB.
A database page is a carefully organized data structure, typically fixed at 4KB, 8KB, or 16KB. While implementations vary across database systems, the fundamental components remain consistent:
The Three-Region Model
Most database pages follow a conceptual model with three primary regions:
Why This Layout?
The bidirectional growth pattern (slots growing down, records growing up) is intentional:
The page header is the control structure that makes everything else on the page interpretable. Without the header, the remaining bytes would be meaningless. Let's examine each component in detail:
| Field | Size | Purpose | Critical For |
|---|---|---|---|
| Page LSN | 8 bytes | Log Sequence Number of last modification | Recovery (WAL), Checkpointing |
| Checksum | 2-4 bytes | CRC or hash for data integrity verification | Detecting corruption, Silent data loss prevention |
| Flags | 2 bytes | Page state bits (leaf/internal, full, torn) | Page type identification, Space management |
| Lower Pointer | 2 bytes | Offset to start of free space (end of slot dir) | Slot directory boundary |
| Upper Pointer | 2 bytes | Offset to end of free space (start of records) | Record region boundary |
| Special Pointer | 2 bytes | Offset to special/index-specific data | Index pages (B-tree, etc.) |
| Page Version | 2 bytes | Schema/format version for upgrade compatibility | Online schema evolution |
| Prune XID | 4-8 bytes | Oldest transaction that may need to see dead tuples | MVCC, Vacuum operations |
The LSN: Cornerstone of Recovery
The Page LSN (Log Sequence Number) deserves special attention. This 8-byte field records the LSN of the most recent WAL (Write-Ahead Log) record that modified this page. During crash recovery:
This mechanism ensures idempotent redo—replaying the same WAL record multiple times yields the same result. The Page LSN makes this possible by indicating the page's current state in the modification timeline.
Checksums detect silent data corruption from failing disks, cosmic rays (bit flips), or firmware bugs. PostgreSQL enables checksums via initdb --data-checksums. MySQL/InnoDB has innodb_checksum_algorithm. Disabling checksums saves ~1-2% CPU but risks undetected corruption. Production systems should ALWAYS enable page checksums.
Lower and Upper Pointers: The Free Space Boundaries
These two offsets define the free space region:
┌─────────────────────────────────────────────────────────┐
│ Header │ Slot Dir... │ FREE SPACE │ Records... │
│ (24B) │ │ │ │
└────────┴──────────────────────┴────────────┴────────────┘
↑ ↑ ↑
0 LOWER UPPER 8191
(100) (7800)
│────── allocated ─────│── free ───│── allocated ──│
When inserting a record:
Not all pages are created equal. Databases use different page types for different purposes, each with specialized structures optimized for their role:
| Page Type | Purpose | Special Characteristics | Examples |
|---|---|---|---|
| Heap Page | Store table row data | Slot directory, variable-length records, MVCC info | PostgreSQL heap, MySQL InnoDB data pages |
| Index Page (Leaf) | Store index entries pointing to data | Sorted entries, sibling pointers, high fanout | B-tree leaf nodes |
| Index Page (Internal) | Navigate index structure | Keys + child page pointers, no data payload | B-tree internal nodes |
| Overflow Page | Store large values that don't fit on main page | Continuation pointers, blob chunks | TOAST pages (PostgreSQL) |
| Free Space Map Page | Track free space across all pages | Bitmap or byte-per-page representation | FSM in PostgreSQL |
| Visibility Map Page | Track all-visible pages for index-only scans | 2 bits per heap page | VM in PostgreSQL |
| Undo Page | Store before-images for transaction rollback | Chain of undo records, transaction linkage | Oracle undo, InnoDB undo logs |
| System Page | Database metadata (schema, statistics) | Catalog entries, system table rows | pg_class entries, InnoDB data dictionary |
Heap Pages vs. Index Pages
The distinction between heap (data) pages and index pages is fundamental:
Heap Pages:
Index Pages:
Both page types share the basic header + slot directory + record structure, but their internal record formats and management algorithms differ significantly.
The page header's flags field typically encodes the page type. Recovery operations, consistency checks, and space management routines read this flag to apply type-appropriate logic. A corrupted type flag can cause catastrophic misinterpretation of page contents.
Choosing the right page size is one of the most consequential decisions in database design. While most systems provide sensible defaults, understanding the tradeoffs enables informed tuning for specific workloads.
Workload-Specific Recommendations
| Workload Type | Recommended Page Size | Rationale |
|---|---|---|
| OLTP (many point queries) | 4KB - 8KB | Minimize read amplification, reduce lock contention |
| OLAP (large scans) | 16KB - 32KB | Maximize sequential throughput, better compression |
| Mixed workload | 8KB - 16KB | Balance between competing concerns |
| Wide rows (many columns) | 16KB+ | Avoid excessive overflow pages |
| High concurrency | Smaller pages | Reduce per-page lock conflicts |
Modern SSD Considerations
SSDs have changed some traditional assumptions:
Most databases require reinitialization or dump/restore to change page size. PostgreSQL's page size (BLCKSZ) is set at compile time—changing it requires rebuilding PostgreSQL and reinitializing the cluster. Plan page size carefully during initial deployment; it's difficult to change later.
Managing free space within a page is a microcosm of the broader memory management problem. The database must track available space, allocate from it efficiently, and reclaim space from deleted or updated records.
States of Space on a Page
Space on a database page exists in several states:
The Fragmentation Problem
Consider a page with three 1000-byte records, then the middle one is deleted:
Before deletion: After deletion:
┌───────────────────┐ ┌───────────────────┐
│ Header + Slots │ │ Header + Slots │
├───────────────────┤ ├───────────────────┤
│ Free: 5000 bytes │ │ Free: 5000 bytes │
├───────────────────┤ ├───────────────────┤
│ Record C (1000B) │ │ Record C (1000B) │
│ Record B (1000B) │ │ ████ HOLE 1000B ██│
│ Record A (1000B) │ │ Record A (1000B) │
└───────────────────┘ └───────────────────┘
After deletion:
In MVCC systems (PostgreSQL, MySQL InnoDB), dead tuples accumulate as transactions update or delete rows. Without regular vacuum/purge, pages become bloated with dead data, causing table bloat, slower scans, and eventual transaction ID wraparound issues. Monitor bloat and ensure vacuum keeps pace with modification rates.
Every database operation ultimately translates to a sequence of page-level operations. Understanding these primitives reveals why certain operations are expensive and how the database maintains consistency.
| Operation | Steps | Complexity | Logging Required |
|---|---|---|---|
| Insert Record |
| O(1) for the page operation itself | Full-page image or redo log |
| Delete Record (Mark) |
| O(1) lookup + constant work | Slot modification log |
| Delete Record (Physical) |
| O(n) where n = records after deleted one | Typically deferred to vacuum |
| Update Record (In-Place) |
| O(1) if sizes match, else higher | Before/after images or redo |
| Page Compaction |
| O(n) where n = live records | Full-page image typically |
The Insert Path in Detail
INSERT INTO orders (id, customer, amount) VALUES (42, 'Alice', 100.00);
┌──────────────────────────────────────────────────────┐
│ 1. Executor requests: insert this tuple on table heap │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ 2. Buffer manager: find page with sufficient space │
│ - Consult Free Space Map (FSM) │
│ - FSM says page 47 has ~2KB free │
│ - Request page 47 into buffer pool │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ 3. Lock page 47 (exclusive or content lock) │
│ 4. Verify free space: Upper - Lower ≥ tuple + slot │
│ 5. Generate WAL record (before modifying page) │
│ 6. Write tuple at offset = Upper - tupleSize │
│ 7. Add slot entry: slots[nslots++] = new offset │
│ 8. Update header: Upper -= tupleSize, Lower += 4 │
│ 9. Mark page dirty in buffer pool │
│ 10. Release page lock │
└──────────────────────────────────────────────────────┘
Note how the page insert involves multiple subsystems: executor, buffer manager, lock manager, WAL writer, and the page manipulation itself. Each step is carefully ordered to ensure recoverability and concurrency safety.
We've explored the fundamental building block of database storage—the page. Let's consolidate the key takeaways:
What's Next:
Now that you understand the overall page structure, we'll dive deeper into the slot directory—the indirection layer that enables stable record addressing, efficient updates, and garbage collection. The slot directory is where logical record IDs meet physical byte offsets.
You now understand the fundamental structure of database pages—the atomic containers that make all database operations possible. This knowledge is essential for understanding storage engineering, query optimization, and recovery mechanisms. Next, we'll explore the slot directory in detail.