Loading learning content...
How often should a database perform checkpoints?
If too infrequent:
If too frequent:
Finding the optimal checkpoint frequency is a critical tuning decision that balances recovery time requirements against operational efficiency. This page explores the factors that influence checkpoint frequency, provides frameworks for making this decision, and examines adaptive approaches that modern databases use to optimize dynamically.
By the end of this page, you will understand the trade-offs governing checkpoint frequency, how to calculate optimal intervals based on your requirements, the different triggers for checkpoints (time, size, events), and how modern databases adapt checkpoint frequency dynamically.
Checkpoint frequency involves a classic engineering trade-off: investing resources during normal operation to save resources during failure recovery. Let's quantify this trade-off.
The recovery time component:
Recovery time after a crash depends on:
For most systems, redo dominates. And redo time is bounded by:
Max Redo Time ≈ (Checkpoint Interval + Flushing Lag) × Redo Overhead / Log Rate
Where:
The operational cost component:
Each checkpoint incurs:
| Checkpoint Interval | Expected Recovery Time | Checkpoint I/O Overhead | Log Storage Needed |
|---|---|---|---|
| 1 minute | ~30-60 seconds | High (60 checkpoints/hour) | Minimal (~1 minute of logs) |
| 5 minutes | ~2-5 minutes | Moderate (12 checkpoints/hour) | Low (~5-10 minutes of logs) |
| 15 minutes | ~5-15 minutes | Low (4 checkpoints/hour) | Moderate (~15-30 minutes of logs) |
| 1 hour | ~20-60 minutes | Very Low (1 checkpoint/hour) | High (~1-2 hours of logs) |
The optimal checkpoint frequency starts with your Recovery Time Objective (RTO). If your RTO is 5 minutes, you need checkpoint and flushing configured to ensure recovery completes within 5 minutes. Work backward from this requirement, then ensure the resulting overhead is acceptable.
To choose checkpoint frequency rationally, we need to estimate recovery time. Here's a framework for this calculation.
Components of recovery time:
1234567891011121314151617181920212223242526272829303132
-- Recovery time estimation model -- Input parameters (measure or estimate for your system)log_rate_mb_per_sec: FLOAT = 10.0 -- How fast your system generates WALredo_rate_mb_per_sec: FLOAT = 50.0 -- How fast recovery can replay WALcheckpoint_interval_sec: INT = 300 -- 5 minutesavg_flushing_lag_sec: INT = 60 -- Oldest dirty page ageavg_uncommitted_txn_size_mb: FLOAT = 0.1 -- Average uncommitted workavg_active_txn_count: INT = 10 -- Transactions active at any timestartup_overhead_sec: INT = 10 -- Fixed startup costs -- Derived calculations -- Log accumulated between redo_lsn and crash (worst case)max_log_for_redo_mb = log_rate_mb_per_sec × (checkpoint_interval_sec + avg_flushing_lag_sec) = 10 × (300 + 60) = 3600 MB = 3.6 GB -- Redo time (log volume / redo rate) redo_time_sec = max_log_for_redo_mb / redo_rate_mb_per_sec = 3600 / 50 = 72 seconds -- Undo time (depends on uncommitted work)undo_work_mb = avg_uncommitted_txn_size_mb × avg_active_txn_count = 0.1 × 10 = 1 MBundo_time_sec = undo_work_mb / redo_rate_mb_per_sec -- Similar rate = 1 / 50 = 0.02 seconds (negligible) -- Total recovery timetotal_recovery_time_sec = startup_overhead_sec + redo_time_sec + undo_time_sec = 10 + 72 + 0.02 ≈ 82 seconds -- Result: Expected recovery time is ~82 seconds with 5-minute checkpointsKey observations from the model:
Redo dominates: In most systems, undo time is negligible compared to redo. Transactions typically commit quickly, so uncommitted work is small.
Flushing lag matters: Even with recent checkpoints, if dirty pages aren't flushed, redo_lsn may be much older than checkpoint time. Flushing lag directly adds to recovery time.
Redo rate depends on I/O: Recovery is typically I/O-bound. Faster storage = faster recovery. SSDs can achieve 10x faster redo than HDDs.
Log rate varies wildly: OLTP systems may generate 100s of MB/sec during peak. Batch loads may generate GB/sec. Idle systems generate almost nothing. Use peak rates for conservative estimates.
These calculations are estimates. Actual recovery time depends on hardware, data patterns, and transaction mix. The only way to know your true recovery time is to test it—intentionally crash a test system and measure recovery. Do this periodically as your workload evolves.
Checkpoints can be triggered by various conditions. Most systems use a combination of triggers to balance different concerns.
Common checkpoint triggers:
| Trigger Type | When Activated | Primary Purpose | Example Configuration |
|---|---|---|---|
| Time-based | After fixed interval elapses | Bound recovery time to predictable maximum | Every 5 minutes |
| Size-based | After N bytes of log generated | Limit log storage consumption | At 1 GB of new WAL |
| Transaction-count | After N transactions commit | Bound work since last checkpoint | Every 10,000 commits |
| Manual/On-demand | Administrator request | Before maintenance or backup | CHECKPOINT command |
| Shutdown | Database is stopping | Minimize restart recovery | Always on clean shutdown |
| Background writer | Dirty page percentage too high | Prevent buffer pool exhaustion | At 80% dirty pages |
Time-based triggers:
The simplest approach: checkpoint every N minutes. This provides predictable recovery time bounds regardless of workload.
checkpoint_timeout = 5min (PostgreSQL)
Size-based triggers:
Checkpoint when the log reaches a certain size since the last checkpoint. This bounds log storage and is useful when log space is constrained.
max_wal_size = 1GB (PostgreSQL)
innodb_log_file_size = 1G (MySQL)
Combining triggers:
Most systems checkpoint when either condition is met:
Checkpoint if (time since last checkpoint ≥ timeout) OR (log size since last checkpoint ≥ max_size)
This ensures both recovery time and log storage are bounded.
12345678910111213141516171819202122232425262728293031323334353637
-- Conceptual checkpoint trigger logic in a database system FUNCTION should_checkpoint_now(): last_cp = get_last_checkpoint_info() -- Time-based trigger IF current_time - last_cp.timestamp >= checkpoint_timeout: RETURN TRUE, "Time interval exceeded" -- Size-based trigger current_lsn = get_current_log_position() log_since_checkpoint = current_lsn - last_cp.lsn IF log_since_checkpoint >= max_wal_size: RETURN TRUE, "Log size exceeded" -- Transaction count trigger (if configured) IF transactions_since_checkpoint >= max_transactions_per_checkpoint: RETURN TRUE, "Transaction count exceeded" -- Dirty page percentage trigger IF dirty_page_percentage >= emergency_dirty_threshold: RETURN TRUE, "Dirty page pressure" -- Manual trigger IF checkpoint_requested_by_admin: RETURN TRUE, "Manual request" -- No trigger condition met RETURN FALSE, NULL -- Run by background checkpoint schedulerWHILE database_is_running: SLEEP(checkpoint_check_interval) -- e.g., 1 second should, reason = should_checkpoint_now() IF should: LOG("Starting checkpoint: " + reason) perform_checkpoint()Clean shutdowns always perform a complete checkpoint, flushing all dirty pages. This means restart after clean shutdown requires no recovery at all—the database opens immediately. Always prefer clean shutdowns over kill -9 for this reason.
Different workloads have different checkpoint requirements. Understanding your workload helps choose appropriate settings.
OLTP (Online Transaction Processing) workloads:
Recommendation: Use time-based triggers with smooth I/O spreading. Target checkpoint completion in background without noticeable latency impact.
OLAP (Online Analytical Processing) workloads:
Recommendation: Less frequent checkpoints, larger log allowance. Focus on not interrupting long-running queries.
Batch/ETL workloads:
Recommendation: Size-based triggers to handle bursts. Consider manual checkpoints between major batch jobs.
| Setting | OLTP Recommendation | OLAP Recommendation | Batch Recommendation |
|---|---|---|---|
| Time trigger | 5-10 minutes | 15-30 minutes | Manual/between batches |
| Size trigger | 500 MB - 1 GB | 2-5 GB | 5-10 GB |
| I/O spreading | High (>0.9) | Medium (0.5-0.7) | Low (aggressive flush) |
| Background flushing | Aggressive | Moderate | Aggressive during quiet |
| Recovery time target | 30-60 seconds | 5-15 minutes | Varies |
Mixed workloads:
Real systems often have mixed workloads—OLTP during the day, batch processing at night. Checkpoint configuration might need to vary:
Some systems support changing checkpoint parameters at runtime:
-- PostgreSQL: Change checkpoint timeout dynamically
ALTER SYSTEM SET checkpoint_timeout = '10min';
SELECT pg_reload_conf();
Workloads evolve. Yesterday's settings may not suit today's traffic patterns. Monitor checkpoint frequency, duration, and impact on transaction latency. Adjust settings quarterly or when workload characteristics change significantly.
A naive checkpoint implementation would immediately flush all dirty pages, creating an I/O storm. Modern databases spread checkpoint I/O over time to avoid disrupting normal operations.
The I/O storm problem:
Imagine a database with 50,000 dirty pages at checkpoint time. Flushing all of them in a burst:
Checkpoint spreading:
Instead of flushing everything at once, spread the I/O over the checkpoint interval:
1234567891011121314151617181920212223242526272829303132333435
-- PostgreSQL checkpoint spreading configuration -- checkpoint_completion_target controls spreading-- Value between 0 and 1: fraction of interval over which to spread I/Ocheckpoint_completion_target = 0.9 -- Spread over 90% of interval -- Example: 5-minute checkpoint interval-- Target = 0.9 means spread I/O over 4.5 minutes-- Leaves 0.5 minutes buffer before next checkpoint -- How this works:-- 1. At checkpoint start, calculate total pages to flush-- 2. Divide by (interval × target) to get pages per second-- 3. Flush at this steady rate-- 4. Adjust if falling behind or getting ahead -- Example calculation:checkpoint_interval = 300 seconds (5 minutes)completion_target = 0.9available_time = 300 × 0.9 = 270 seconds dirty_pages = 50000flush_rate = 50000 / 270 = 185 pages/second -- Result: ~185 pages/second instead of 50000 burst ------------------------------------------------------ -- MySQL/InnoDB uses adaptive flushinginnodb_adaptive_flushing = ONinnodb_adaptive_flushing_lwm = 10 -- Low water mark percentageinnodb_io_capacity = 2000 -- Target IOPS for flushing -- InnoDB continuously flushes based on redo log fill level-- More aggressive as log fills up, gentler when log is emptyAdaptive flushing approaches:
Beyond simple spreading, some systems adapt flushing intensity based on current conditions:
Log fill level: As the log fills up (approaching max_wal_size), flush more aggressively to prevent running out of log space.
Dirty page percentage: If too many pages become dirty, increase flushing to prevent buffer pool exhaustion.
I/O bandwidth availability: When user I/O is low (quiet periods), flush more aggressively. Back off during peak load.
Recovery time prediction: Estimate current recovery time based on redo_lsn distance. If exceeding target, flush more aggressively.
SQL Server's indirect checkpoints:
SQL Server takes this further with target-based checkpoints:
ALTER DATABASE [MyDB] SET TARGET_RECOVERY_TIME = 60 SECONDS;
The system automatically adjusts checkpoint and flushing behavior to meet this target, eliminating manual tuning.
Spreading doesn't magically create more I/O capacity. If your system generates 100 MB/s of dirty pages but your disk can only flush 50 MB/s, you'll accumulate dirty pages regardless of spreading. Ensure your storage can handle worst-case flush requirements.
Effective checkpoint tuning requires monitoring. You need visibility into how often checkpoints occur, how long they take, and what impact they have on the system.
Key metrics to monitor:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
-- PostgreSQL checkpoint monitoring -- Background writer statistics (includes checkpoint info)SELECT checkpoints_timed, -- Scheduled checkpoints checkpoints_req, -- Requested checkpoints (size/time exceeded) checkpoint_write_time, -- Time spent writing pages (ms) checkpoint_sync_time, -- Time spent syncing pages (ms) buffers_checkpoint, -- Pages written by checkpoints buffers_backend -- Pages written by backends (should be low)FROM pg_stat_bgwriter; -- Recent checkpoint information (requires pg_stat_statements or logging)-- Enable checkpoint logging:log_checkpoints = on -- Log output will show:-- LOG: checkpoint starting: time-- LOG: checkpoint complete: wrote 4567 buffers (2.8%); -- 0 WAL file(s) added, 0 removed, 3 recycled; -- write=45.234 s, sync=1.032 s, total=47.891 s ------------------------------------------------------ -- MySQL/InnoDB checkpoint monitoring -- Check InnoDB statusSHOW ENGINE INNODB STATUS\G -- Key sections to examine:-- - LOG: shows redo log usage and checkpoint position-- - BUFFER POOL AND MEMORY: shows dirty page count -- Specific metricsSELECT variable_name, variable_value FROM performance_schema.global_status WHERE variable_name IN ( 'Innodb_checkpoint_age', -- Redo log distance 'Innodb_checkpoint_max_age', -- Max allowed distance 'Innodb_pages_flushed', -- Total pages flushed 'Innodb_buffer_pool_pages_dirty' -- Current dirty pages); ------------------------------------------------------ -- SQL Server checkpoint monitoring -- Recent checkpointsSELECT database_id, last_log_backup_lsn, recovery_model_desc, log_reuse_wait_descFROM sys.databases; -- DMV for log usageSELECT * FROM sys.dm_db_log_stats(DB_ID());Warning signs to watch for:
Checkpoints exceeding interval: If checkpoint duration > interval, checkpoints pile up. This often indicates insufficient I/O capacity or too-low interval setting.
Too many time-triggered checkpoints: If most checkpoints are time-triggered (not size-triggered), you might be checkpointing more often than necessary.
Too many size-triggered checkpoints: If most checkpoints are size-triggered, your time interval is too long for your log generation rate.
Backend writes increasing: In PostgreSQL, if backends (user queries) are doing writes instead of background processes, checkpointing is falling behind.
Recovery time estimate exceeding RTO: If current redo_lsn suggests recovery would exceed your target, increase checkpoint or flushing frequency.
Set up alerts for: checkpoint duration > 80% of interval, recovery time estimate > 80% of RTO, dirty page percentage > threshold. These are leading indicators—address them before they become outages.
Static checkpoint intervals are suboptimal for varying workloads. Modern systems increasingly use adaptive approaches that dynamically adjust checkpoint behavior based on current conditions.
Target-based checkpointing:
Instead of specifying interval and size triggers, specify the desired outcome:
-- SQL Server: Target 60-second recovery
ALTER DATABASE [MyDB] SET TARGET_RECOVERY_TIME = 60 SECONDS;
The system then automatically:
Advantages of target-based approach:
Predictive flushing:
Some systems predict future log generation based on recent history:
if log_rate is increasing:
increase flushing intensity now
(anticipate future checkpoint needs)
if log_rate is decreasing:
decrease flushing intensity
(save I/O for later)
Cost-based checkpointing:
Advanced systems might model checkpoint cost vs. recovery time savings:
checkpoint_benefit = (current_recovery_time - target_recovery_time) × recovery_value
checkpoint_cost = io_overhead + latency_impact
if checkpoint_benefit > checkpoint_cost:
trigger checkpoint
This allows checkpointing decisions to consider the actual business value of faster recovery.
1234567891011121314151617181920212223242526272829303132333435
-- Conceptual adaptive checkpoint algorithm FUNCTION adaptive_checkpoint_loop(): WHILE database_is_running: current_recovery_estimate = estimate_recovery_time() target_recovery_time = get_configured_target() IF current_recovery_estimate > target_recovery_time × 0.8: -- Approaching target; increase flush intensity increase_background_flushing(intensity = HIGH) IF current_recovery_estimate > target_recovery_time: -- Exceeding target; trigger immediate checkpoint trigger_checkpoint(reason = "adaptive_recovery_target") ELSE IF current_recovery_estimate < target_recovery_time × 0.5: -- Well under target; reduce flushing to save I/O decrease_background_flushing(intensity = LOW) ELSE: -- Within comfortable range; maintain steady state maintain_background_flushing(intensity = MEDIUM) SLEEP(monitoring_interval) -- Check every few seconds FUNCTION estimate_recovery_time(): -- Based on current redo_lsn distance and redo rate oldest_recovery_lsn = MIN(recovery_lsn FROM dirty_page_table) current_lsn = get_current_log_position() log_distance = current_lsn - oldest_recovery_lsn redo_rate = measure_recent_redo_rate() -- From past recovery or benchmarks estimated_redo_time = log_distance / redo_rate RETURN estimated_redo_time + FIXED_OVERHEADDatabase research continues advancing checkpoint technology. Emerging approaches include machine learning-based prediction, NVRAM-aware checkpointing, and checkpointing optimized for cloud storage. Stay current with your database vendor's checkpoint improvements.
Checkpoint frequency is a crucial tuning parameter that directly impacts both recovery time and operational overhead. The optimal setting depends on your specific requirements, workload, and infrastructure. Let's consolidate the key concepts:
What's next:
Having understood checkpoint frequency optimization, we'll examine recovery with checkpoints—how the recovery process uses checkpoint information to restore the database, and how checkpoint quality affects recovery speed and correctness.
You now understand how to choose and tune checkpoint frequency for your database systems. You can calculate recovery time estimates, configure appropriate triggers, implement I/O spreading, monitor checkpoint behavior, and consider adaptive approaches. Next, we'll see how recovery actually uses checkpoints.