Checkpoints - Learning Module

Loading content...

0/241

Checkpoint Concept

The Recovery Time Problem

Imagine a database system that has been running continuously for six months, processing millions of transactions daily. Now imagine the power fails. When the system restarts, how long should recovery take?

Without optimization, the answer could be six months of log replay—the database would need to process every single log record generated since its inception to restore consistency. This is clearly unacceptable. No business can tolerate hours, let alone months, of downtime while waiting for recovery.

This is the fundamental problem that checkpoints solve.

Checkpoints provide a mechanism to establish known good points in the database's history—points where we know exactly what state the database was in, what transactions were active, and what data was on disk. During recovery, instead of starting from the beginning of time, we can start from the most recent checkpoint.

The difference is transformative: recovery time drops from months to minutes, or even seconds.

What You Will Learn

By the end of this page, you will understand what checkpoints are conceptually, why they are essential for practical database systems, the key information captured during checkpointing, and the fundamental constraints that make checkpoint design a critical engineering challenge.

The Log Replay Problem

To understand checkpoints, we must first understand the problem they solve. Let's examine what happens during recovery without checkpoints.

The Write-Ahead Log (WAL) provides durability:

Every database modification is first recorded in a persistent log before being applied to the database. This ensures that even if the system crashes immediately after a commit, the committed changes can be recovered by replaying the log.

The problem: logs accumulate indefinitely:

As the database operates, log records accumulate continuously:

Each INSERT, UPDATE, DELETE generates log records
Each commit and abort generates log records
Each internal operation (page splits, index updates) generates log records

A busy database might generate gigabytes of log data per hour. Over months or years, the log can grow to terabytes.

Log Growth Reality in Production Systems
Database Type	Typical Log Rate	1 Month Accumulation	1 Year Accumulation
Small OLTP System	10 MB/hour	~7 GB	~88 GB
Medium E-Commerce	500 MB/hour	~350 GB	~4.3 TB
Large Financial System	5 GB/hour	~3.5 TB	~43 TB
High-Frequency Trading	50 GB/hour	~35 TB	~430 TB

The recovery nightmare without checkpoints:

During recovery, the database must:

Redo all committed transactions: Replay every log record for every committed transaction to ensure their changes are present in the database.
Undo all uncommitted transactions: Reverse any changes made by transactions that were active at crash time.
Process in order: Log records must be processed sequentially to maintain consistency.

If we have a year's worth of logs and must replay from the beginning, recovery time becomes:

Recovery Time = Log Size / Processing Rate
             = 4.3 TB / 100 MB/s (optimistic replay rate)
             = ~12 hours

And this is for a medium-sized system. Large systems face days of recovery time—completely unacceptable for production databases where every minute of downtime costs money and reputation.

Recovery Time Objective (RTO)

Modern businesses typically have RTOs measured in minutes, not hours. A financial trading system might have an RTO of 60 seconds. An e-commerce platform might tolerate 5 minutes. No one plans for 12-hour recovery. Checkpoints are not optional—they are essential for meeting real-world availability requirements.

The Checkpoint Solution

A checkpoint is a synchronization point in the database's operation where we capture a consistent snapshot of the current state. This snapshot establishes a boundary:

Everything before the checkpoint is safely on disk and does not need to be redone during recovery.

Think of checkpoints as bookmarks in the database's history. Instead of reading the entire book from the beginning, we can start reading from the bookmark.

The core insight:

If we know that at time T:

All committed transactions' changes are on disk
We have a list of active transactions and their states
We know which data pages were modified but not yet written

...then during recovery, we only need to process log records after time T. Everything before T is already safely persisted.

Converting Mermaid diagram...

Formal definition:

A checkpoint is a record in the transaction log that contains:

Checkpoint LSN: The Log Sequence Number (position in log) when the checkpoint started
Active Transaction List: All transactions that were active (not yet committed or aborted) at checkpoint time
Dirty Page Table: Information about modified pages that haven't been written to disk
Additional metadata: Timestamps, checkpoint type, and system state information

When recovery begins, the system locates the most recent checkpoint record and uses its information to:

Determine where to start log replay (the redo point)
Know which transactions need to be undone
Know which pages might need redo operations

Log Sequence Numbers (LSN)

Every log record is assigned a unique, monotonically increasing Log Sequence Number (LSN). Think of the LSN as an address in the log—it tells you exactly where a record is located. Checkpoints reference LSNs to establish precise points in the log timeline.

What a Checkpoint Captures

A checkpoint record contains precise information required to reconstruct the database state during recovery. Each piece of information serves a specific purpose in the recovery algorithm.

Let's examine each component in detail:

Essential Checkpoint Components

•Active Transaction Table (ATT) — Lists all transactions that were executing at checkpoint time. For each transaction: Transaction ID, state (active, preparing), last LSN written by that transaction. This tells recovery which transactions might need to be undone.
•Dirty Page Table (DPT) — Lists all pages in the buffer pool that were modified but not yet written to disk. For each page: Page ID, Recovery LSN (the first log record that made the page dirty). This tells recovery where to start redoing for each page.
•Checkpoint Begin LSN — The log position where the checkpoint started. This serves as a reference point for log truncation and recovery start.
•Checkpoint End LSN — The log position where the checkpoint completed. The checkpoint record itself is written at this position.
•Master Record Location — Points to where the checkpoint record is stored, enabling fast lookup during recovery without scanning the entire log.

checkpoint_record_structure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Conceptual structure of a checkpoint record (pseudo-SQL representation)
 
-- The checkpoint metadata
CHECKPOINT_RECORD {
    checkpoint_lsn:        LSN = 12847293,           -- Position in log
    checkpoint_time:       TIMESTAMP = '2024-01-15 14:30:22.847',
    checkpoint_type:       ENUM = 'FUZZY',           -- NORMAL or FUZZY
    
    -- Active Transaction Table
    active_transactions: [
        { txn_id: 1001, state: 'ACTIVE',    last_lsn: 12847100 },
        { txn_id: 1005, state: 'ACTIVE',    last_lsn: 12847250 },
        { txn_id: 1007, state: 'PREPARING', last_lsn: 12847280 }
    ],
    
    -- Dirty Page Table  
    dirty_pages: [
        { page_id: 'T1:P42',  recovery_lsn: 12840000 },  -- Table 1, Page 42
        { page_id: 'T1:P43',  recovery_lsn: 12845000 },
        { page_id: 'T2:P17',  recovery_lsn: 12842000 },
        { page_id: 'IDX1:P5', recovery_lsn: 12846000 }   -- Index page
    ],
    
    -- Recovery pointers
    redo_start_lsn:        LSN = 12840000,           -- Minimum recovery_lsn
    master_record_updated: BOOLEAN = true
}

Understanding Recovery LSN:

The Recovery LSN in the Dirty Page Table is crucial for optimization. It represents the earliest log record that might need to be redone for that specific page.

During redo, the recovery system:

Finds the minimum Recovery LSN across all dirty pages
Starts redo from that point (not from the checkpoint LSN)
For each log record, checks if the affected page needs the operation

This means even with a recent checkpoint, if some dirty page has an old Recovery LSN, redo might need to go further back. The checkpoint's dirty page table provides this visibility.

Why Track Dirty Pages?

Without the Dirty Page Table, recovery would have to redo every log record since the checkpoint—even for pages that were already flushed to disk. The DPT allows recovery to skip redo for pages that don't need it, dramatically reducing recovery time for systems with many stable pages.

The Recovery Boundary Guarantee

The checkpoint creates a recovery boundary—a point in time before which we have complete certainty about the database state. This boundary provides strong guarantees that simplify recovery:

Guarantee 1: Committed transactions before the checkpoint are durable

Any transaction that committed before the checkpoint began has had its changes written to stable storage. During recovery, we don't need to redo committed work from before the checkpoint.

Guarantee 2: The dirty page table is complete

At checkpoint time, the system captures a complete list of modified pages in the buffer pool. Any page not in this list is either clean (matches disk) or wasn't yet modified.

Guarantee 3: The active transaction list is accurate

Any transaction not in the active transaction list either committed or aborted before the checkpoint. The only transactions that might need undo are those in the list.

Converting Mermaid diagram...

The boundary simplifies recovery dramatically:

Without a checkpoint, recovery must:

Scan the entire log from the beginning
Track every transaction from start to finish
Maintain global state about all changes ever made

With a checkpoint, recovery can:

Start scanning from the checkpoint LSN
Use the pre-built ATT and DPT
Focus only on recent changes

Mathematical impact on recovery time:

Let:

L = Total log size (e.g., 1 TB)
C = Time since last checkpoint (e.g., 5 minutes of logs)
R = Log replay rate (e.g., 100 MB/s)

Without checkpoint: Recovery Time = L / R = 1 TB / 100 MB/s ≈ 2.8 hours

With checkpoint: Recovery Time = C / R = 30 MB / 100 MB/s ≈ 0.3 seconds

The improvement is not incremental—it's four orders of magnitude.

Checkpoint Efficiency

Checkpoints bound recovery time by the checkpoint interval, not by the database's operational history. A database running for 10 years recovers just as fast as one running for 10 days—as long as both have recent checkpoints.

Types of Checkpoints

Not all checkpoints are created equal. Different checkpoint types offer different tradeoffs between completeness, performance impact, and implementation complexity. Understanding these types is essential for database administrators and system designers.

The fundamental tension:

An ideal checkpoint would:

Capture a perfectly consistent state instantly
Have zero impact on concurrent transactions
Complete in constant time regardless of database size

In reality, we cannot achieve all three. Different checkpoint types prioritize different properties.

Checkpoint Types Comparison
Checkpoint Type	System Impact	Consistency	Implementation Complexity	Use Case
Consistent (Quiescent)	Very High—blocks all transactions	Perfect—clean database state	Low	Rarely used in production
Transaction-Consistent	High—waits for active transactions	Transaction-level consistency	Medium	Infrequent maintenance windows
Fuzzy (Non-Quiescent)	Low—minimal transaction blocking	Approximate—requires recovery work	High	Standard production operation
Non-Blocking	Very Low—almost invisible	Requires complex recovery	Very High	High-performance OLTP systems

Detailed Type Descriptions

•Consistent (Quiescent) Checkpoint — The database stops all transaction processing, flushes all dirty pages to disk, and then records the checkpoint. The result is a perfectly consistent on-disk state. However, all queries and updates are blocked during this period, which can take minutes for large databases.
•Transaction-Consistent Checkpoint — The database waits until all active transactions commit or abort, then takes a snapshot. No new transactions start until the checkpoint completes. Less disruptive than quiescent but still causes noticeable pauses.
•Fuzzy Checkpoint — The database records checkpoint information while transactions continue running. Dirty pages are flushed asynchronously in the background. The checkpoint captures an approximate state that recovery must reconcile. This is the standard approach in modern databases.
•Non-Blocking Checkpoint — An advanced variant of fuzzy checkpoint using copy-on-write or multiversion techniques. Transactions experience no blocking at all. Used in systems with extreme availability requirements.

Modern Systems Use Fuzzy Checkpoints

Almost all modern production databases (PostgreSQL, MySQL/InnoDB, Oracle, SQL Server) use fuzzy checkpoints by default. The consistent checkpoint is primarily of historical and educational interest. We will explore fuzzy checkpoints in detail in a later page.

Log Truncation and Space Management

Checkpoints enable an essential operational benefit: log truncation. Without the ability to discard old log records, logs would grow indefinitely, eventually consuming all storage.

The truncation principle:

Once a checkpoint completes successfully, log records before the checkpoint's redo start point are no longer needed for crash recovery. These records can be:

Archived to cheaper storage (for point-in-time recovery or auditing)
Discarded entirely (if archival isn't required)
Recycled (log file reused for new records)

The truncation boundary:

The safe truncation point is the minimum of:

The oldest active transaction's first log record
The checkpoint's redo start LSN
Any replication slots or required archive positions

Converting Mermaid diagram...

log_truncation_logic.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Conceptual log truncation decision logic
 
-- Find the safe truncation point
DECLARE @safe_truncation_lsn LSN;
 
SET @safe_truncation_lsn = (
    SELECT MIN(lsn) FROM (
        -- Option 1: Last checkpoint's redo start
        SELECT redo_start_lsn AS lsn 
        FROM sys.checkpoints 
        WHERE checkpoint_id = (SELECT MAX(checkpoint_id) FROM sys.checkpoints)
        
        UNION ALL
        
        -- Option 2: Oldest active transaction's first LSN
        SELECT MIN(first_lsn) AS lsn 
        FROM sys.active_transactions
        
        UNION ALL
        
        -- Option 3: Oldest required replication position
        SELECT MIN(confirmed_lsn) AS lsn 
        FROM sys.replication_slots
    ) AS boundaries
);
 
-- Log records before @safe_truncation_lsn can be safely removed
-- Records after must be retained for recovery

Practical implications:

More frequent checkpoints = less log retained: Each checkpoint advances the truncation boundary, allowing older logs to be removed sooner.
Long-running transactions block truncation: A transaction running for hours prevents truncation of logs it might need for rollback, even if many checkpoints occur.
Replication can block truncation: If a replica falls behind, the primary must retain logs until the replica catches up.

Storage economics:

Checkpoint frequency directly impacts storage costs:

Without checkpoints: Logs grow without bound
With hourly checkpoints: ~1 hour of logs retained minimum
With 5-minute checkpoints: ~5 minutes of logs retained minimum

For high-throughput systems generating gigabytes per minute, this difference is significant.

Archive Before Truncate

In production systems, log records should typically be archived before truncation. Archived logs enable point-in-time recovery beyond the last checkpoint—recovering to any moment in the past, not just the crash state. Never truncate without archiving unless you're certain PITR isn't needed.

The Checkpoint Design Challenge

Designing an effective checkpoint mechanism is one of the most challenging problems in database system engineering. The challenge stems from fundamental tensions that cannot be fully resolved—only balanced.

The core tensions:

What We Want

•Fast recovery — Minimize time from crash to operational
•Low overhead — Checkpoint shouldn't slow down normal operations
•Predictable performance — No sudden latency spikes during checkpoints
•Complete consistency — Captured state must be logically coherent
•Storage efficiency — Enable aggressive log truncation

The Trade-offs

•Faster recovery requires more frequent checkpoints → higher overhead
•Lower overhead requires less flushing → longer recovery
•Predictable performance requires spreading I/O → complex scheduling
•Complete consistency requires pausing → transaction delays
•More truncation requires more checkpoints → more overhead

Engineering decisions required:

When to checkpoint: Time-based? Transaction count-based? Log size-based? Some combination?
How to flush dirty pages: All at once (I/O storm)? Gradually in background (complexity)? On-demand only (risk)?
How to capture consistent state: Block all transactions? Use multiversion snapshots? Accept some inconsistency?
How to handle very large databases: With terabytes of dirty pages, flushing all of them takes time. How to bound this?
How to interact with the buffer pool: Checkpoint should ideally work with, not against, buffer replacement policies.

These questions have been studied and re-answered over 40 years of database research. Modern systems use sophisticated algorithms (like the checkpoint mechanism in ARIES) that represent the current best understanding of these trade-offs.

ARIES and Modern Checkpoints

The ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) recovery algorithm, developed at IBM in the 1990s, revolutionized checkpoint design. Its fuzzy checkpoint mechanism allows transactions to proceed uninterrupted while capturing a recoverable state. Most modern databases implement variants of ARIES checkpointing.

Summary: Checkpoint Concept

Checkpoints are foundational to practical database recovery. Without them, recovery time would be bounded by operational history, making databases unusable after prolonged operation. Let's consolidate the key concepts:

Key Takeaways

•Checkpoints solve the log replay problem — Without checkpoints, recovery must replay the entire log from the beginning, which becomes impossible as logs grow over time.
•Checkpoints are synchronization points — They capture the current database state, including active transactions, dirty pages, and recovery starting positions.
•Active Transaction Table (ATT) tracks in-flight work — Tells recovery which transactions need undo if they didn't commit.
•Dirty Page Table (DPT) tracks unflushed modifications — Tells recovery which pages might need redo operations.
•Checkpoints enable log truncation — By establishing recovery boundaries, checkpoints allow old log records to be archived or discarded.
•Different checkpoint types offer different trade-offs — From quiescent (complete consistency, high impact) to fuzzy (low impact, more complex recovery).
•Checkpoint design is an engineering challenge — Balancing recovery speed, runtime overhead, and implementation complexity requires careful trade-off analysis.

What's next:

Now that we understand what checkpoints are conceptually, we'll examine how the checkpoint process actually works—the sequence of operations, the locking requirements, the I/O operations, and the interactions with the buffer pool and transaction manager.

Page Complete

You now understand the fundamental concept of checkpoints—what they are, what problems they solve, what information they capture, and why they are essential for practical database systems. Next, we'll dive into the mechanics of the checkpoint process itself.

Checkpoint Concept

The Recovery Time Problem

This is the fundamental problem that checkpoints solve.

The difference is transformative: recovery time drops from months to minutes, or even seconds.

What You Will Learn

The Log Replay Problem

To understand checkpoints, we must first understand the problem they solve. Let's examine what happens during recovery without checkpoints.

The Write-Ahead Log (WAL) provides durability:

The problem: logs accumulate indefinitely:

As the database operates, log records accumulate continuously:

Each INSERT, UPDATE, DELETE generates log records
Each commit and abort generates log records
Each internal operation (page splits, index updates) generates log records

A busy database might generate gigabytes of log data per hour. Over months or years, the log can grow to terabytes.

Log Growth Reality in Production Systems
Database Type	Typical Log Rate	1 Month Accumulation	1 Year Accumulation
Small OLTP System	10 MB/hour	~7 GB	~88 GB
Medium E-Commerce	500 MB/hour	~350 GB	~4.3 TB
Large Financial System	5 GB/hour	~3.5 TB	~43 TB
High-Frequency Trading	50 GB/hour	~35 TB	~430 TB

The recovery nightmare without checkpoints:

During recovery, the database must:

Redo all committed transactions: Replay every log record for every committed transaction to ensure their changes are present in the database.
Undo all uncommitted transactions: Reverse any changes made by transactions that were active at crash time.
Process in order: Log records must be processed sequentially to maintain consistency.

If we have a year's worth of logs and must replay from the beginning, recovery time becomes:

Recovery Time = Log Size / Processing Rate
             = 4.3 TB / 100 MB/s (optimistic replay rate)
             = ~12 hours

And this is for a medium-sized system. Large systems face days of recovery time—completely unacceptable for production databases where every minute of downtime costs money and reputation.

Recovery Time Objective (RTO)

The Checkpoint Solution

A checkpoint is a synchronization point in the database's operation where we capture a consistent snapshot of the current state. This snapshot establishes a boundary:

Everything before the checkpoint is safely on disk and does not need to be redone during recovery.

Think of checkpoints as bookmarks in the database's history. Instead of reading the entire book from the beginning, we can start reading from the bookmark.

The core insight:

If we know that at time T:

All committed transactions' changes are on disk
We have a list of active transactions and their states
We know which data pages were modified but not yet written

...then during recovery, we only need to process log records after time T. Everything before T is already safely persisted.

Converting Mermaid diagram...

Formal definition:

A checkpoint is a record in the transaction log that contains:

Checkpoint LSN: The Log Sequence Number (position in log) when the checkpoint started
Active Transaction List: All transactions that were active (not yet committed or aborted) at checkpoint time
Dirty Page Table: Information about modified pages that haven't been written to disk
Additional metadata: Timestamps, checkpoint type, and system state information

When recovery begins, the system locates the most recent checkpoint record and uses its information to:

Determine where to start log replay (the redo point)
Know which transactions need to be undone
Know which pages might need redo operations

Log Sequence Numbers (LSN)

What a Checkpoint Captures

A checkpoint record contains precise information required to reconstruct the database state during recovery. Each piece of information serves a specific purpose in the recovery algorithm.

Let's examine each component in detail:

Essential Checkpoint Components

•Active Transaction Table (ATT) — Lists all transactions that were executing at checkpoint time. For each transaction: Transaction ID, state (active, preparing), last LSN written by that transaction. This tells recovery which transactions might need to be undone.
•Dirty Page Table (DPT) — Lists all pages in the buffer pool that were modified but not yet written to disk. For each page: Page ID, Recovery LSN (the first log record that made the page dirty). This tells recovery where to start redoing for each page.
•Checkpoint Begin LSN — The log position where the checkpoint started. This serves as a reference point for log truncation and recovery start.
•Checkpoint End LSN — The log position where the checkpoint completed. The checkpoint record itself is written at this position.
•Master Record Location — Points to where the checkpoint record is stored, enabling fast lookup during recovery without scanning the entire log.

checkpoint_record_structure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Conceptual structure of a checkpoint record (pseudo-SQL representation)
 
-- The checkpoint metadata
CHECKPOINT_RECORD {
    checkpoint_lsn:        LSN = 12847293,           -- Position in log
    checkpoint_time:       TIMESTAMP = '2024-01-15 14:30:22.847',
    checkpoint_type:       ENUM = 'FUZZY',           -- NORMAL or FUZZY
    
    -- Active Transaction Table
    active_transactions: [
        { txn_id: 1001, state: 'ACTIVE',    last_lsn: 12847100 },
        { txn_id: 1005, state: 'ACTIVE',    last_lsn: 12847250 },
        { txn_id: 1007, state: 'PREPARING', last_lsn: 12847280 }
    ],
    
    -- Dirty Page Table  
    dirty_pages: [
        { page_id: 'T1:P42',  recovery_lsn: 12840000 },  -- Table 1, Page 42
        { page_id: 'T1:P43',  recovery_lsn: 12845000 },
        { page_id: 'T2:P17',  recovery_lsn: 12842000 },
        { page_id: 'IDX1:P5', recovery_lsn: 12846000 }   -- Index page
    ],
    
    -- Recovery pointers
    redo_start_lsn:        LSN = 12840000,           -- Minimum recovery_lsn
    master_record_updated: BOOLEAN = true
}

Understanding Recovery LSN:

The Recovery LSN in the Dirty Page Table is crucial for optimization. It represents the earliest log record that might need to be redone for that specific page.

During redo, the recovery system:

Finds the minimum Recovery LSN across all dirty pages
Starts redo from that point (not from the checkpoint LSN)
For each log record, checks if the affected page needs the operation

This means even with a recent checkpoint, if some dirty page has an old Recovery LSN, redo might need to go further back. The checkpoint's dirty page table provides this visibility.

Why Track Dirty Pages?

The Recovery Boundary Guarantee

The checkpoint creates a recovery boundary—a point in time before which we have complete certainty about the database state. This boundary provides strong guarantees that simplify recovery:

Guarantee 1: Committed transactions before the checkpoint are durable

Any transaction that committed before the checkpoint began has had its changes written to stable storage. During recovery, we don't need to redo committed work from before the checkpoint.

Guarantee 2: The dirty page table is complete

At checkpoint time, the system captures a complete list of modified pages in the buffer pool. Any page not in this list is either clean (matches disk) or wasn't yet modified.

Guarantee 3: The active transaction list is accurate

Any transaction not in the active transaction list either committed or aborted before the checkpoint. The only transactions that might need undo are those in the list.

Converting Mermaid diagram...

The boundary simplifies recovery dramatically:

Without a checkpoint, recovery must:

Scan the entire log from the beginning
Track every transaction from start to finish
Maintain global state about all changes ever made

With a checkpoint, recovery can:

Start scanning from the checkpoint LSN
Use the pre-built ATT and DPT
Focus only on recent changes

Mathematical impact on recovery time:

Let:

L = Total log size (e.g., 1 TB)
C = Time since last checkpoint (e.g., 5 minutes of logs)
R = Log replay rate (e.g., 100 MB/s)

Without checkpoint: Recovery Time = L / R = 1 TB / 100 MB/s ≈ 2.8 hours

With checkpoint: Recovery Time = C / R = 30 MB / 100 MB/s ≈ 0.3 seconds

The improvement is not incremental—it's four orders of magnitude.

Checkpoint Efficiency

Types of Checkpoints

The fundamental tension:

An ideal checkpoint would:

Capture a perfectly consistent state instantly
Have zero impact on concurrent transactions
Complete in constant time regardless of database size

In reality, we cannot achieve all three. Different checkpoint types prioritize different properties.

Checkpoint Types Comparison
Checkpoint Type	System Impact	Consistency	Implementation Complexity	Use Case
Consistent (Quiescent)	Very High—blocks all transactions	Perfect—clean database state	Low	Rarely used in production
Transaction-Consistent	High—waits for active transactions	Transaction-level consistency	Medium	Infrequent maintenance windows
Fuzzy (Non-Quiescent)	Low—minimal transaction blocking	Approximate—requires recovery work	High	Standard production operation
Non-Blocking	Very Low—almost invisible	Requires complex recovery	Very High	High-performance OLTP systems

Detailed Type Descriptions

•Consistent (Quiescent) Checkpoint — The database stops all transaction processing, flushes all dirty pages to disk, and then records the checkpoint. The result is a perfectly consistent on-disk state. However, all queries and updates are blocked during this period, which can take minutes for large databases.
•Transaction-Consistent Checkpoint — The database waits until all active transactions commit or abort, then takes a snapshot. No new transactions start until the checkpoint completes. Less disruptive than quiescent but still causes noticeable pauses.
•Fuzzy Checkpoint — The database records checkpoint information while transactions continue running. Dirty pages are flushed asynchronously in the background. The checkpoint captures an approximate state that recovery must reconcile. This is the standard approach in modern databases.
•Non-Blocking Checkpoint — An advanced variant of fuzzy checkpoint using copy-on-write or multiversion techniques. Transactions experience no blocking at all. Used in systems with extreme availability requirements.

Modern Systems Use Fuzzy Checkpoints

Log Truncation and Space Management

Checkpoints enable an essential operational benefit: log truncation. Without the ability to discard old log records, logs would grow indefinitely, eventually consuming all storage.

The truncation principle:

Once a checkpoint completes successfully, log records before the checkpoint's redo start point are no longer needed for crash recovery. These records can be:

Archived to cheaper storage (for point-in-time recovery or auditing)
Discarded entirely (if archival isn't required)
Recycled (log file reused for new records)

The truncation boundary:

The safe truncation point is the minimum of:

The oldest active transaction's first log record
The checkpoint's redo start LSN
Any replication slots or required archive positions

Converting Mermaid diagram...

log_truncation_logic.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Conceptual log truncation decision logic
 
-- Find the safe truncation point
DECLARE @safe_truncation_lsn LSN;
 
SET @safe_truncation_lsn = (
    SELECT MIN(lsn) FROM (
        -- Option 1: Last checkpoint's redo start
        SELECT redo_start_lsn AS lsn 
        FROM sys.checkpoints 
        WHERE checkpoint_id = (SELECT MAX(checkpoint_id) FROM sys.checkpoints)
        
        UNION ALL
        
        -- Option 2: Oldest active transaction's first LSN
        SELECT MIN(first_lsn) AS lsn 
        FROM sys.active_transactions
        
        UNION ALL
        
        -- Option 3: Oldest required replication position
        SELECT MIN(confirmed_lsn) AS lsn 
        FROM sys.replication_slots
    ) AS boundaries
);
 
-- Log records before @safe_truncation_lsn can be safely removed
-- Records after must be retained for recovery

Practical implications:

More frequent checkpoints = less log retained: Each checkpoint advances the truncation boundary, allowing older logs to be removed sooner.
Long-running transactions block truncation: A transaction running for hours prevents truncation of logs it might need for rollback, even if many checkpoints occur.
Replication can block truncation: If a replica falls behind, the primary must retain logs until the replica catches up.

Storage economics:

Checkpoint frequency directly impacts storage costs:

Without checkpoints: Logs grow without bound
With hourly checkpoints: ~1 hour of logs retained minimum
With 5-minute checkpoints: ~5 minutes of logs retained minimum

For high-throughput systems generating gigabytes per minute, this difference is significant.

Archive Before Truncate

The Checkpoint Design Challenge

The core tensions:

What We Want

•Fast recovery — Minimize time from crash to operational
•Low overhead — Checkpoint shouldn't slow down normal operations
•Predictable performance — No sudden latency spikes during checkpoints
•Complete consistency — Captured state must be logically coherent
•Storage efficiency — Enable aggressive log truncation

The Trade-offs

•Faster recovery requires more frequent checkpoints → higher overhead
•Lower overhead requires less flushing → longer recovery
•Predictable performance requires spreading I/O → complex scheduling
•Complete consistency requires pausing → transaction delays
•More truncation requires more checkpoints → more overhead

Engineering decisions required:

When to checkpoint: Time-based? Transaction count-based? Log size-based? Some combination?
How to flush dirty pages: All at once (I/O storm)? Gradually in background (complexity)? On-demand only (risk)?
How to capture consistent state: Block all transactions? Use multiversion snapshots? Accept some inconsistency?
How to handle very large databases: With terabytes of dirty pages, flushing all of them takes time. How to bound this?
How to interact with the buffer pool: Checkpoint should ideally work with, not against, buffer replacement policies.

ARIES and Modern Checkpoints

Summary: Checkpoint Concept

Key Takeaways

•Checkpoints solve the log replay problem — Without checkpoints, recovery must replay the entire log from the beginning, which becomes impossible as logs grow over time.
•Checkpoints are synchronization points — They capture the current database state, including active transactions, dirty pages, and recovery starting positions.
•Active Transaction Table (ATT) tracks in-flight work — Tells recovery which transactions need undo if they didn't commit.
•Dirty Page Table (DPT) tracks unflushed modifications — Tells recovery which pages might need redo operations.
•Checkpoints enable log truncation — By establishing recovery boundaries, checkpoints allow old log records to be archived or discarded.
•Different checkpoint types offer different trade-offs — From quiescent (complete consistency, high impact) to fuzzy (low impact, more complex recovery).
•Checkpoint design is an engineering challenge — Balancing recovery speed, runtime overhead, and implementation complexity requires careful trade-off analysis.

What's next:

Page Complete