Loading learning content...
When a transaction modifies data—inserting a row, updating a value, or deleting a record—the change happens in the buffer pool, not directly on disk. The modified page becomes dirty: its in-memory contents differ from the persistent copy on disk.
Dirty pages represent a fundamental tension in database design. They enable high-performance modifications (memory writes are 100,000x faster than disk writes), but they also represent uncommitted or unpersisted data that could be lost in a crash. Managing dirty pages correctly is essential for achieving both performance and durability.
By the end of this page, you will understand how databases track dirty pages, the various flushing strategies and their trade-offs, the relationship between dirty pages and Write-Ahead Logging, and how production systems balance performance against recovery time objectives.
A page becomes dirty when its in-memory contents are modified after being read from disk. Any operation that changes data on the page sets the dirty bit in the frame descriptor:
Operations that dirty a page:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Dirty bit management in the buffer manager class BufferPool {public: // Called by the execution engine after modifying a page void markDirty(PageId page_id) { lock_guard<mutex> guard(pool_latch); auto it = page_table.find(page_id); if (it == page_table.end()) { throw PageNotFoundException("Page not in buffer pool"); } FrameId frame_id = it->second; FrameDescriptor& desc = descriptors[frame_id]; // Set the dirty bit desc.is_dirty = true; // Record the LSN of the modification for WAL ordering // This comes from the log record written before the modification desc.page_lsn = current_lsn; // Add to dirty page tracking (for checkpoint/background writer) dirty_page_list.insert(page_id); } // Check if a page is dirty bool isDirty(PageId page_id) { lock_guard<mutex> guard(pool_latch); auto it = page_table.find(page_id); if (it == page_table.end()) return false; return descriptors[it->second].is_dirty; } // Clear dirty bit after successful flush to disk void markClean(PageId page_id) { lock_guard<mutex> guard(pool_latch); auto it = page_table.find(page_id); if (it == page_table.end()) return; FrameId frame_id = it->second; descriptors[frame_id].is_dirty = false; dirty_page_list.erase(page_id); }};The dirty bit lifecycle:
falsetruefalsetrueA page's dirty bit may toggle multiple times between disk reads. The key invariant: if dirty bit is true, the disk copy is stale.
The dirty bit indicates that a page differs from disk. It does NOT indicate whether the changes are committed or durable. Durability comes from Write-Ahead Logging: changes are durable once their log records are on stable storage, regardless of whether the data pages are flushed.
Beyond per-frame dirty bits, databases maintain a Dirty Page Table (DPT)—a centralized data structure tracking all dirty pages in the buffer pool. The DPT serves multiple purposes:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
// Dirty Page Table structure struct DirtyPageEntry { PageId page_id; // The dirty page's identifier LSN recovery_lsn; // LSN of the first modification since last flush // Also called "recLSN" in ARIES terminology Timestamp first_dirtied; // When the page first became dirty size_t modification_count; // Number of modifications (optional metric)}; class DirtyPageTable {private: // Map from page_id to dirty page information unordered_map<PageId, DirtyPageEntry> entries; mutex dpt_latch; public: // Add or update a dirty page entry void addDirtyPage(PageId page_id, LSN lsn) { lock_guard<mutex> guard(dpt_latch); auto it = entries.find(page_id); if (it == entries.end()) { // First modification since last flush DirtyPageEntry entry; entry.page_id = page_id; entry.recovery_lsn = lsn; // RecLSN is the first modifying LSN entry.first_dirtied = now(); entry.modification_count = 1; entries[page_id] = entry; } else { // Page already dirty - only update modification count // RecLSN stays as the FIRST modification's LSN it->second.modification_count++; } } // Remove page from DPT after successful flush void removeDirtyPage(PageId page_id) { lock_guard<mutex> guard(dpt_latch); entries.erase(page_id); } // Get the minimum recLSN across all dirty pages // This is the oldest log position needed for recovery LSN getMinRecoveryLSN() { lock_guard<mutex> guard(dpt_latch); LSN min_lsn = MAX_LSN; for (const auto& [page_id, entry] : entries) { if (entry.recovery_lsn < min_lsn) { min_lsn = entry.recovery_lsn; } } return min_lsn; } // Get pages to flush, sorted by oldest first vector<PageId> getPagesToFlush(size_t limit) { lock_guard<mutex> guard(dpt_latch); vector<pair<Timestamp, PageId>> candidates; for (const auto& [page_id, entry] : entries) { candidates.push_back({entry.first_dirtied, page_id}); } // Sort by first_dirtied time (oldest first) sort(candidates.begin(), candidates.end()); vector<PageId> result; for (size_t i = 0; i < min(limit, candidates.size()); i++) { result.push_back(candidates[i].second); } return result; }};The recLSN concept:
Each dirty page entry includes a recLSN (recovery LSN)—the LSN of the first log record that dirtied the page since its last flush. This is critical for recovery:
This is why frequent dirty page flushing reduces recovery time: it advances the minimum recLSN, shortening the log scan required during crash recovery.
During a checkpoint, the database writes the current Dirty Page Table to the log. This snapshot tells recovery exactly which pages were dirty at checkpoint time, eliminating the need to scan the entire log to reconstruct this information.
Dirty pages must eventually be written to disk. The timing and manner of these writes significantly impact performance, recovery time, and system stability. Databases employ various flushing strategies to balance these concerns.
| Strategy | Description | Trade-offs |
|---|---|---|
| On-demand (eviction) | Flush dirty page only when evicting | Good throughput, unpredictable latency |
| Periodic background | Background thread flushes pages at regular intervals | Smooth I/O, may not keep up with write rate |
| Threshold-based | Flush when dirty page count exceeds threshold | Prevents accumulation, burst I/O possible |
| Oldest-first (age-based) | Prioritize flushing pages dirty longest | Minimizes recovery time, advances log truncation |
| Checkpoint-driven | Flush all dirty pages during checkpoint | Simplest, causes I/O spikes |
Modern databases use a combination of strategies:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
// Background writer (page cleaner) implementation class BackgroundWriter {private: BufferPool* buffer_pool; DirtyPageTable* dirty_page_table; DiskManager* disk_manager; // Configuration Duration flush_interval = 100ms; size_t pages_per_flush = 64; double dirty_ratio_high = 0.75; // Trigger aggressive flushing double dirty_ratio_low = 0.50; // Normal flushing rate atomic<bool> running; public: void writerLoop() { while (running) { // Calculate current dirty ratio double dirty_ratio = getDirtyPageRatio(); // Determine how many pages to flush this cycle size_t pages_to_flush; if (dirty_ratio > dirty_ratio_high) { // Aggressive flushing: 2x normal rate pages_to_flush = pages_per_flush * 2; } else if (dirty_ratio > dirty_ratio_low) { // Normal flushing rate pages_to_flush = pages_per_flush; } else { // Low dirty ratio: minimal flushing pages_to_flush = pages_per_flush / 4; } // Get oldest dirty pages to flush vector<PageId> to_flush = dirty_page_table->getPagesToFlush(pages_to_flush); // Flush each page for (PageId page_id : to_flush) { flushPage(page_id); } // Sleep until next cycle this_thread::sleep_for(flush_interval); } } void flushPage(PageId page_id) { // Get the page from buffer pool Page* page = buffer_pool->getPageForFlush(page_id); if (page == nullptr) return; // Page was evicted // Ensure WAL records are flushed first LSN page_lsn = buffer_pool->getPageLSN(page_id); log_manager->flushToLSN(page_lsn); // Write to disk disk_manager->writePage(page_id, page); // Mark page as clean buffer_pool->markClean(page_id); dirty_page_table->removeDirtyPage(page_id); }};Dirty pages cannot be flushed to disk arbitrarily. The Write-Ahead Logging (WAL) protocol imposes a critical constraint:
Before a dirty page can be written to disk, all log records that describe modifications to that page must first be written to stable storage.
This constraint ensures that if a crash occurs immediately after the page write, recovery can redo the modifications by reading the log. Without WAL, a partial page write could leave the database in an unrecoverable state.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Enforcing the WAL constraint during page flush void BufferPool::flushDirtyPage(FrameId frame_id) { FrameDescriptor& desc = descriptors[frame_id]; if (!desc.is_dirty) { return; // Nothing to flush } // CRITICAL: WAL constraint enforcement // The page's LSN records the LSN of the most recent modification LSN page_lsn = desc.page_lsn; // Ensure all log records up to and including page_lsn are on disk // This may trigger a log flush if records are still in log buffer log_manager->flushToLSN(page_lsn); // Now safe to write the page to disk // Even if crash occurs, recovery can redo from the flushed log disk_manager->writePage(desc.page_id, &frames[frame_id]); // Update state desc.is_dirty = false; dirty_page_table->removeDirtyPage(desc.page_id);} // Log manager's flushToLSN implementationvoid LogManager::flushToLSN(LSN target_lsn) { lock_guard<mutex> guard(log_latch); // If already flushed to this LSN, nothing to do if (target_lsn <= flushed_lsn) { return; } // Flush log buffer to disk up to target_lsn size_t bytes_to_flush = lsnToBufferOffset(target_lsn) - lsnToBufferOffset(flushed_lsn); // Write to log file with fsync for durability logFile->write(log_buffer + lsnToBufferOffset(flushed_lsn), bytes_to_flush); logFile->fsync(); // Ensure data reaches disk flushed_lsn = target_lsn;}Why WAL is essential:
Consider what happens without WAL:
With WAL:
The WAL constraint means that dirty page writes depend on log writes. If the log is slow (e.g., writing to a slow disk), page flushing is also slow. This is why production databases place logs on the fastest available storage, often separate from data files.
Database systems make two fundamental policy decisions about dirty page management:
STEAL Policy: Can the buffer manager flush ("steal") a dirty page modified by an uncommitted transaction?
FORCE Policy: Must all dirty pages modified by a transaction be forced to disk before the transaction commits?
| Policy | Meaning | Requirements |
|---|---|---|
| STEAL | Can flush uncommitted transaction's pages | Requires UNDO capability in recovery |
| NO-STEAL | Cannot flush uncommitted transaction's pages | No UNDO needed; may run out of buffer space |
| FORCE | Must flush all transaction's pages before commit | No REDO needed; high commit latency |
| NO-FORCE | Need not flush pages before commit | Requires REDO capability in recovery |
The four combinations:
STEAL + FORCE: Allows stealing, requires forcing. Needs UNDO (for stolen pages of aborted transactions). Simple REDO (all committed changes are on disk).
NO-STEAL + FORCE: No stealing, requires forcing. No UNDO, no REDO needed. Simple recovery but poor performance.
STEAL + NO-FORCE: Most common in production systems. Allows stealing, doesn't require forcing. Needs both UNDO and REDO. Best performance.
NO-STEAL + NO-FORCE: No stealing, doesn't require forcing. Needs REDO but not UNDO. Limited by buffer space.
STEAL allows the buffer manager to evict any unpinned page, preventing buffer pool exhaustion. NO-FORCE allows commits to complete without waiting for I/O. This combination maximizes flexibility and performance. The cost is recovery complexity: the recovery algorithm (like ARIES) must handle both UNDO and REDO.
12345678910111213141516171819202122232425262728293031323334
// STEAL + NO-FORCE implementation // STEAL: When evicting, we can flush uncommitted pagesvoid BufferPool::evictPageWithSteal(FrameId frame_id) { FrameDescriptor& desc = descriptors[frame_id]; if (desc.is_dirty) { // STEAL policy: we CAN flush uncommitted changes // WAL ensures we can undo if the transaction aborts flushDirtyPage(frame_id); } // Continue with eviction...} // NO-FORCE: Commit does NOT wait for page flushesvoid TransactionManager::commitTransaction(TxnId txn_id) { // Step 1: Write COMMIT log record LSN commit_lsn = log_manager->writeCommitRecord(txn_id); // Step 2: Flush log up to commit record // This is the durability guarantee - log is on stable storage log_manager->flushToLSN(commit_lsn); // Step 3: NO-FORCE - we do NOT flush dirty data pages // The log flush guarantees durability; data pages can flush later // Step 4: Release locks, cleanup transaction state lock_manager->releaseAllLocks(txn_id); active_transactions.erase(txn_id); // Dirty pages will be flushed by background writer eventually // Recovery will REDO any committed changes not yet on disk}The dirty page ratio—the proportion of buffer pool pages that are dirty—is a critical metric for database performance and stability. Both extremes cause problems:
Optimal dirty page management:
The ideal approach balances these concerns:
Target ratio: Maintain a dirty page ratio that provides write coalescing benefits without risking eviction stalls. Common targets are 25-50%.
Flush prioritization: Prioritize flushing the oldest dirty pages to advance the recovery point and allow log truncation.
Adaptive flushing: Increase flush rate when dirty ratio exceeds threshold; decrease when it's low.
Write coalescing: Allow multiple modifications to the same page before flushing, reducing total I/O.
In PostgreSQL, monitor pg_stat_bgwriter.buffers_backend (pages flushed by backends, indicating background writer isn't keeping up) and pg_stat_bgwriter.buffers_clean (pages flushed by background writer). In MySQL, check Innodb_buffer_pool_pages_dirty and Innodb_buffer_pool_wait_free (waits for free pages).
| Database | Parameter | Purpose |
|---|---|---|
| PostgreSQL | bgwriter_lru_maxpages | Max pages written per bgwriter round |
| PostgreSQL | bgwriter_delay | Milliseconds between bgwriter rounds |
| MySQL | innodb_max_dirty_pages_pct | Target max dirty page percentage |
| MySQL | innodb_io_capacity | I/O operations per second for flushing |
| Oracle | DB_WRITER_PROCESSES | Number of database writer processes |
Dirty page management directly impacts crash recovery performance. The relationship is straightforward: more unflushed dirty pages means more work during recovery.
How dirty pages affect recovery:
Analysis phase: Reconstructs the Dirty Page Table to identify pages that need redo
Redo phase: Must replay log records for all dirty pages since the oldest recLSN
Undo phase: Uses the log to undo uncommitted transactions whose pages were stolen
The key insight: flushing a dirty page and checkpointing advances the recovery start point. After a checkpoint, recovery only needs to scan log records since the checkpoint.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
// How dirty pages affect recovery time class RecoveryManager { // Analyze dirty page impact on recovery RecoveryEstimate estimateRecoveryTime(Checkpoint checkpoint) { RecoveryEstimate estimate; // Get the dirty page table from checkpoint DirtyPageTable dpt = checkpoint.dirty_page_table; // Find the oldest recLSN - this is where redo starts LSN oldest_rec_lsn = dpt.getMinRecoveryLSN(); // Calculate log bytes to scan LSN current_lsn = log_manager->getCurrentLSN(); size_t log_bytes_to_scan = current_lsn - oldest_rec_lsn; // Estimate number of pages to redo size_t pages_to_redo = dpt.count(); // Estimate recovery time // Log scan rate: ~100 MB/s // Page redo rate: ~10,000 pages/s estimate.log_scan_time = log_bytes_to_scan / (100 * 1024 * 1024); estimate.redo_time = pages_to_redo / 10000.0; estimate.total_time = estimate.log_scan_time + estimate.redo_time; return estimate; } // Recommendation: Keep dirty page age bounded Duration getOldestDirtyPageAge() { vector<DirtyPageEntry> entries = dirty_page_table.getAllEntries(); Timestamp oldest = now(); for (const auto& entry : entries) { if (entry.first_dirtied < oldest) { oldest = entry.first_dirtied; } } return now() - oldest; } // Alert if dirty pages are too old void checkDirtyPageAge() { Duration max_age = 5min; // Target: no page dirty for > 5 minutes Duration oldest = getOldestDirtyPageAge(); if (oldest > max_age) { log_warning("Oldest dirty page is {} old; recovery time is elevated", oldest); } }};If your RTO is 5 minutes, your dirty page management must ensure recovery can complete in that time. This often means aggressive background flushing and frequent checkpoints. There's a direct trade-off between runtime performance (less I/O) and recovery speed (more I/O).
Dirty page management is central to database performance and reliability. Proper handling ensures efficient use of the buffer pool while maintaining durability and enabling fast recovery.
What's next:
We've examined pages and their states. The next page explores the Buffer Manager: the component that orchestrates all these operations, providing the interface between query execution and the buffer pool.
You now understand dirty page management: tracking, flushing, WAL constraints, and performance implications. This knowledge is essential for database tuning and understanding recovery behavior.