Database Management SystemsFailure Types

Understanding Database Failure Types

LevelIntermediate

Duration60 mins

TopicFailure Types

5 / 5

Recovery Requirements

What Recovery Must Guarantee

Understanding failure types is only the first step. The real question is: What must a recovery system deliver? When the database restarts after a crash, or when a transaction is rolled back after an error, what guarantees should users expect?

Recovery requirements define the contract between the database system and its users. They specify what 'correct recovery' means, what resources are needed to achieve it, and what trade-offs exist between recovery capabilities and system performance.

This page examines recovery requirements comprehensively—the fundamental guarantees, the ACID connection, measurable objectives like RPO and RTO, and the architectural requirements that enable recovery. Understanding these requirements is essential for both designing recovery systems and making informed decisions about database deployment.

What You Will Learn

By the end of this page, you will understand the guarantees that recovery systems must provide, how these guarantees connect to ACID properties, the measurable recovery objectives (RPO, RTO), the resources required for recovery, and the trade-offs involved in recovery system design.

Fundamental Recovery Guarantees

A correct recovery system must provide two fundamental guarantees after any failure:

The Two Cardinal Guarantees:

The Recovery Contract

•Guarantee 1: Committed Transaction Durability — All effects of committed transactions must be present in the recovered database. If a transaction received a commit acknowledgment, its changes survive any subsequent failure.
•Guarantee 2: Uncommitted Transaction Atomicity — All effects of uncommitted transactions must be absent from the recovered database. Partial work from aborted or interrupted transactions must be completely removed.

These guarantees map directly to two ACID properties:

Durability (D): Once committed, data persists through failures
Atomicity (A): Uncommitted transactions are 'all or nothing'—and after failure, they're nothing

Formal Statement of Recovery Correctness:

Let DB_crash be the database state at the moment of failure, and DB_recovered be the database state after recovery completes. Recovery is correct if and only if:

DB_recovered = ApplyCommitted(DB_initial) 
             = The state that would result from perfectly executing
               all committed transactions on the initial database,
               with no trace of uncommitted transactions

This is a powerful requirement: the recovered database must look exactly as if all committed transactions completed normally and all uncommitted transactions never started.

Why Both Guarantees Matter:

Violating either guarantee has severe consequences:

Consequences of Recovery Guarantee Violations
Violation	Example Consequence	Business Impact
Missing committed data	Customer payment recorded as committed but lost after crash	Financial loss, legal liability, lost trust
Present uncommitted data	Half-completed transfer leaves money in both accounts	Data corruption, regulatory violations
Inconsistent state	Inventory count doesn't match order records	Business process failures, audit failures
Constraint violations	Foreign keys reference non-existent records	Application errors, data quality issues

Beyond Two Guarantees

Some recovery systems provide additional guarantees, such as consistency with application-level invariants (the C in ACID), or recovery to a specific point in time. But the two cardinal guarantees—committed durability and uncommitted atomicity—are the non-negotiable foundation.

Recovery Objectives: RPO and RTO

Beyond correctness, recovery systems are measured by two key performance objectives:

Recovery Point Objective (RPO):

RPO specifies the maximum acceptable data loss measured in time. It answers: 'How much data can we afford to lose?'

RPO = 0: No data loss tolerable. Requires synchronous replication or similar mechanisms.
RPO = 15 minutes: Up to 15 minutes of transactions can be lost. Requires log backups every 15 minutes.
RPO = 24 hours: Up to one day of data can be lost. Daily backups suffice.

Recovery Time Objective (RTO):

RTO specifies the maximum acceptable downtime. It answers: 'How long can we be offline?'

RTO = 0 (or near-zero): Requires automatic failover to a standby.
RTO = 1 hour: Manual recovery with prepared procedures is acceptable.
RTO = 24 hours: Extended recovery processes are acceptable.

RPO/RTO Trade-offs and Requirements
RPO/RTO Target	Infrastructure Required	Cost Level	Typical Use Case
RPO=0, RTO=minutes	Synchronous replication, automatic failover	$$$$	Financial systems, critical infrastructure
RPO=minutes, RTO=hour	Async replication, log shipping, hot standby	$$$	E-commerce, SaaS applications
RPO=hours, RTO=hours	Frequent backups, warm standby	$$	Internal business systems
RPO=day, RTO=day	Daily backups, cold standby	$	Development, archival systems

The RPO/RTO Matrix:

Different failure types have different RPO/RTO characteristics:

Failure Type	Typical RPO	Typical RTO	Governing Factor
Transaction Failure	0	Milliseconds	Rollback speed
System Failure	0	Minutes	Crash recovery speed
Single Disk Failure (RAID)	0	0	RAID rebuild (background)
Media Failure (all disks)	Depends on backup	Hours	Restore speed
Site Disaster	Depends on replication	Minutes-hours	Failover type

Calculating RPO Requirements:

RPO_needed = (Cost of lost transactions) × (Probability of failure) / (Cost of protection)

Simplified: What's the maximum you'd pay to recover one hour of work?
           If that exceeds the cost of hourly log backups, implement them.

Calculating RTO Requirements:

RTO_needed = Maximum acceptable downtime before unacceptable business impact

Consider:
- Revenue loss per hour of downtime
- SLA penalties
- Customer impact and churn risk
- Regulatory requirements
- Operational dependencies

Stated vs. Tested Recovery Times

Many organizations have stated RTO targets that they've never actually tested. The only way to know your real RTO is to practice recovery. A 'one hour RTO' that actually takes six hours in practice is a liability, not a plan.

Resources Required for Recovery

Recovery doesn't happen by magic—it requires specific resources that must be maintained and protected. The availability and quality of these resources directly determines recovery capability.

Essential Recovery Resources:

Recovery Resource Categories

•Transaction Log (WAL) — Sequential record of all database changes. Required for crash recovery and transaction rollback.
•Checkpoints — Periodic snapshots that bound recovery scope. Required to limit recovery time.
•Backups — Point-in-time copies of database. Required for media failure recovery.
•Archived Logs — Log files preserved after being rotated from online storage. Required for PITR and media recovery.
•Control/Catalog Information — Metadata about database structure, files, and backup history. Required to coordinate recovery.

Recovery Resources by Failure Type
Failure Type	Minimum Required	For Full Recovery	Nice to Have
Transaction Failure	Log (undo info)	Log (undo + redo)	Savepoints
System Failure	Log + Data files	Log + Data + Checkpoint	Parallel recovery
Media Failure (data)	Backup + Archived logs	All logs to present	Incremental backups
Media Failure (log)	Backup	Backup + archived logs	Log on separate storage
Site Disaster	Off-site backup	Off-site replica	Geographic replication

Resource Protection Requirements:

Recovery resources themselves must be protected:

1. Log Protection:

Threat	Protection Mechanism
Disk failure	Store on RAID, separate from data
Corruption	Checksums on log records
Overflow	Adequate size, monitoring, archival
Performance contention	Dedicated fast storage

2. Backup Protection:

Threat	Protection Mechanism
Media failure at backup site	Multiple backup copies
Site disaster	Off-site storage
Corruption	Verification, checksums
Ransomware	Air-gapped copies
Retention expiration	Proper retention policy

3. Checkpoint Protection:

Checkpoints are transient (they define a point in time) but the checkpoint process must be reliable:

Checkpoint must complete atomically
Checkpoint record in log must be durable
Dirty pages must be reliably flushed

The Log Is Sacred

The transaction log is the most critical resource for recovery. Protect it more carefully than the data files themselves. Data can be reconstructed from backup + logs. Logs cannot be reconstructed from anything. Separate storage, replication, and fast media for logs are essential investments.

Properties of Recovery Operations

Recovery operations themselves must satisfy certain properties to be reliable. These properties ensure that recovery works correctly even under adverse conditions:

Property 1: Idempotence

Recovery operations must be idempotent—applying them multiple times produces the same result as applying them once.

Why this matters: Recovery might be interrupted (another crash during recovery). When recovery restarts, it may re-execute operations it already performed. Idempotence guarantees this doesn't cause problems.

How it's achieved: Log Sequence Numbers (LSNs) track which operations have been applied. Before applying a redo operation to a page, compare the page's LSN to the log record's LSN. If the page LSN is already >= record LSN, the operation was already applied—skip it.

idempotent_redo.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
FUNCTION ApplyRedoOperation(log_record, page):
    // Idempotent redo: check if already applied
    
    IF page.lsn >= log_record.lsn:
        // This operation was already applied to this page
        // Skip it - applying again would be incorrect
        LogDebug("Skipping already-applied redo: " + log_record.lsn)
        RETURN
    
    // Operation not yet applied - do it now
    SWITCH log_record.type:
        CASE UPDATE:
            ApplyUpdate(page, log_record.after_image)
        CASE INSERT:
            InsertRecord(page, log_record.new_record)
        CASE DELETE:
            DeleteRecord(page, log_record.record_id)
    
    // Update page LSN to reflect this operation
    page.lsn = log_record.lsn
    
    LogDebug("Applied redo: " + log_record.lsn)

Property 2: Determinism

Recovery must be deterministic—given the same log and data files, recovery always produces the same result.

Why this matters:

Testing and verification require predictable behavior
Debugging recovery issues requires reproducibility
Parallel recovery must coordinate deterministically

How it's achieved:

Log records contain all information needed to reproduce operations
No dependence on external state (time, random values) during recovery
Recovery order is determined by log order, not timing

Property 3: Consistency Preservation

Recovery must maintain database consistency at all times:

Internal page structure must be valid
Indexes must match table data
Constraints must be satisfied
Cross-page references must be valid

How it's achieved:

Operations are logged at a level that maintains consistency
Physiological logging logs actions at page level with logical descriptions
Page-level locks protect consistency during recovery

Recovery Operation Properties
Property	Definition	Mechanism	Failure Mode If Violated
Idempotence	Multiple applications = single application	LSN comparison	Double-application corruption
Determinism	Same inputs = same outputs	Complete log information	Non-reproducible recovery
Consistency	Invariants preserved throughout	Physiological logging	Corrupt internal structures
Atomicity	Operations complete fully or not at all	Log-based undo/redo	Partial operation application
Durability	Completed recovery survives subsequent failures	Flush after recovery	Need to re-recover

Crashes During Recovery

Recovery may be crashed and restarted multiple times. This is called 'nested recovery' or 'recovery from recovery.' The idempotence property, plus Compensation Log Records (CLRs) for undo operations, ensures that nested recovery works correctly. Each recovery attempt makes progress and doesn't undo previous progress.

Performance Requirements

Recovery systems must balance recovery capability against normal operation performance. Stronger recovery guarantees often impose overhead during normal operation.

Normal Operation Overhead:

Recovery mechanisms impose costs during normal operation:

Recovery-Related Overhead During Normal Operation
Mechanism	Overhead Type	Typical Impact	Trade-off
Logging	Write amplification, I/O	10-30%	More logging = faster recovery but slower operation
Checkpointing	Periodic I/O spike, pause	Variable	More frequent = faster recovery but more overhead
Log flushing	Commit latency	1-10ms per commit	Sync flush = durability, async = speed
Backup	I/O, sometimes locks	Variable	Online backup = less disruption, more complexity
Replication	Network, slight latency	1-5ms for sync	Sync = zero RPO, async = better performance

Recovery Operation Performance:

Recovery itself should be as fast as possible:

Factors Affecting Recovery Time:

Log Volume: More log records = longer redo/undo
- Mitigation: Frequent checkpoints reduce log to process
Random I/O: Reading scattered pages for redo
- Mitigation: SSDs provide fast random access; parallel recovery
Undo Transactions: Active transactions at crash time
- Mitigation: Keep transactions short; limit concurrent transactions
Page Reads: Each dirty page must be read before redo
- Mitigation: Pre-fetch pages during analysis phase
Sequential Processing: Traditional recovery is single-threaded
- Mitigation: Modern systems use parallel recovery

recovery_time_estimation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Recovery Time Estimation Model
==============================
 
Given:
- Checkpoint interval: C seconds
- Log generation rate: L MB/second
- Log process rate: R MB/second (typically 100-500 MB/s)
- Average active transactions at crash: T
- Average transaction undo time: U seconds
 
Estimated Recovery Time:
========================
 
1. Log to Replay:
   Log_size ≈ C × L (MB)
   
2. Redo Phase:
   Redo_time ≈ Log_size / R (seconds)
   
3. Undo Phase:
   Undo_time ≈ T × U (seconds)
   
4. Total Recovery Time:
   Recovery_time ≈ Redo_time + Undo_time + Overhead
 
Example Calculation:
-------------------
C = 300 seconds (5-minute checkpoint)
L = 10 MB/second log rate
R = 200 MB/second processing rate
T = 50 active transactions
U = 0.5 seconds average undo
 
Log_size = 300 × 10 = 3000 MB
Redo_time = 3000 / 200 = 15 seconds
Undo_time = 50 × 0.5 = 25 seconds
Overhead ≈ 10 seconds (startup, analysis)
 
Recovery_time ≈ 15 + 25 + 10 = 50 seconds
 
To achieve RTO = 30 seconds:
- Reduce checkpoint interval to 2 minutes: Redo_time = 12 seconds
- Reduce active transactions to 20: Undo_time = 10 seconds
- New total: 12 + 10 + 5 = 27 seconds ✓

Parallel Recovery

Modern database systems (PostgreSQL 15+, MySQL 8.0+, Oracle) support parallel recovery, processing multiple log records simultaneously. This can reduce recovery time by 70-90% compared to single-threaded recovery. Check if your system supports it and configure appropriately for your hardware.

Trade-offs in Recovery Design

Recovery system design involves fundamental trade-offs. Understanding these trade-offs is essential for making informed decisions about database configuration and deployment:

Trade-off 1: Durability vs. Performance

Synchronous Log Flush

•Log is flushed to disk on every commit
•Commit acknowledged only after flush
•Durability: Guaranteed
•Performance: Slower commits (disk latency)
•Use when: Data loss is unacceptable

Asynchronous Log Flush

•Log is flushed periodically or on buffer full
•Commit acknowledged before flush
•Durability: Risk of recent commit loss
•Performance: Faster commits
•Use when: Some data loss acceptable

Trade-off 2: Recovery Time vs. Operation Overhead

Configuration	Normal Operation	Recovery Time
Infrequent checkpoints	Lower overhead	Longer recovery
Frequent checkpoints	Higher overhead	Shorter recovery
Large buffer pool	Better cache hit rate	More to redo
Small buffer pool	More I/O during operation	Less to redo
Aggressive page flushing	More disk writes	Less redo work
Lazy page flushing	Fewer disk writes	More redo work

Trade-off 3: Space vs. Recoverability

Configuration	Space Usage	Recovery Capability
Minimal logging	Less log space	Limited recovery options
Full logging	More log space	Complete recovery
Short log retention	Less archive storage	Limited PITR window
Long log retention	More archive storage	Extended PITR window
Infrequent backups	Less backup storage	Longer restore time
Frequent backups	More backup storage	Faster restore

Trade-off 4: Simplicity vs. Flexibility

Simpler recovery mechanisms are easier to implement correctly and faster to execute, but less flexible:

Approach	Simplicity	Flexibility
NO-UNDO/REDO	Simpler	Forces all dirty pages to disk at commit
UNDO/NO-REDO	Simpler	Forces page flush before log flush
UNDO/REDO (ARIES)	Complex	Maximum flexibility, best performance

There's No Free Lunch

Every recovery configuration is a trade-off. Stronger guarantees cost more in performance or resources. The art is matching the configuration to your actual requirements—not over-provisioning (wasting resources) or under-provisioning (risking unacceptable outcomes).

Verification and Testing Requirements

A recovery system that has never been tested is not trustworthy. Verification and testing are essential requirements for any production recovery system:

Continuous Verification:

Ongoing Verification Activities

•Backup Verification — Regularly restore backups to test systems and verify integrity
•Replication Lag Monitoring — Ensure standbys are current (within RPO)
•Log Archival Verification — Confirm archived logs are readable and complete
•Checksum Validation — Periodically verify page checksums haven't degraded
•PITR Testing — Practice point-in-time recovery to verify capability

Disaster Recovery Testing:

Full disaster recovery tests should be conducted periodically:

1. Tabletop Exercises:

Walk through recovery procedures on paper
Identify gaps in documentation
Ensure personnel know their roles
No actual systems affected

2. Technical Dry Runs:

Restore to non-production environment
Verify data integrity
Measure actual recovery time
Test application connectivity

3. Full Failover Tests:

Actually fail over to standby
Run production workload on standby
Verify all functionality works
Practice failback procedure

4. Chaos Engineering:

Randomly inject failures in production
Verify systems recover automatically
Find weaknesses before real failures
Netflix's Chaos Monkey is the famous example

recovery_test_checklist.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Recovery Test Checklist
=======================
 
Pre-Test:
□ Notify stakeholders of test window
□ Ensure test environment is ready
□ Have rollback plan if test fails
□ Document current backup/archive state
 
Backup Verification:
□ Select backup to restore (recent + older)
□ Restore to test environment
□ Verify database opens successfully
□ Run integrity checks (DBCC, pg_catalog checks)
□ Verify row counts against known values
□ Test application connectivity
□ Document restore time (actual vs RTO)
 
PITR Test:
□ Restore to specific point in time
□ Verify data as of that timestamp
□ Confirm changes after that time are absent
□ Document process and timing
 
Failover Test:
□ Verify standby is synchronized
□ Initiate failover
□ Confirm new primary accepts connections
□ Verify data integrity
□ Run application smoke tests
□ Document failover time
□ Perform failback when ready
 
Post-Test:
□ Document all findings
□ Update procedures based on learnings
□ Track metrics (recovery times, issues found)
□ Schedule next test

Test Under Realistic Conditions

Testing recovery at 3 AM on Sunday with a tiny database is not the same as recovering at noon on Monday with production load. Test with realistic data sizes, under realistic conditions, and include realistic complications (simultaneous network problems, missing documentation, key personnel unavailable).

Summary: Recovery Requirements

Let's consolidate the key concepts covered in this page:

Key Takeaways

•Two fundamental guarantees — Committed transactions are durable; uncommitted transactions leave no trace
•RPO and RTO are measurable objectives — They quantify acceptable data loss and downtime
•Recovery requires specific resources — Logs, checkpoints, backups, and archives must be protected
•Recovery operations must be idempotent and deterministic — Safe to re-execute after crashes during recovery
•Performance trade-offs are fundamental — Durability vs. speed, recovery time vs. operational overhead
•Testing is a requirement, not optional — Untested recovery systems are untrustworthy
•Match configuration to requirements — Neither over- nor under-provision recovery capabilities

Module Complete:

This completes our exploration of failure types and recovery requirements. You now understand:

The three major failure types (Transaction, System, Media)
How failures are classified and detected
What recovery systems must guarantee
The resources and trade-offs involved

In the next module, we'll move from concepts to mechanisms, exploring Recovery Concepts in detail—how durability is actually achieved, how logs are structured, and how the recovery manager orchestrates the recovery process.

Module 1 Complete

Congratulations! You've completed Module 1: Failure Types. You now have a comprehensive understanding of database failures—their types, classification, and the requirements for recovering from them. This foundation prepares you for the detailed study of recovery mechanisms in the following modules.

5 / 5

Loading learning content...

Database Management SystemsFailure Types

Understanding Database Failure Types

LevelIntermediate

Duration60 mins

TopicFailure Types

5 / 5

Recovery Requirements

What Recovery Must Guarantee

What You Will Learn

Fundamental Recovery Guarantees

A correct recovery system must provide two fundamental guarantees after any failure:

The Two Cardinal Guarantees:

The Recovery Contract

•Guarantee 1: Committed Transaction Durability — All effects of committed transactions must be present in the recovered database. If a transaction received a commit acknowledgment, its changes survive any subsequent failure.
•Guarantee 2: Uncommitted Transaction Atomicity — All effects of uncommitted transactions must be absent from the recovered database. Partial work from aborted or interrupted transactions must be completely removed.

These guarantees map directly to two ACID properties:

Durability (D): Once committed, data persists through failures
Atomicity (A): Uncommitted transactions are 'all or nothing'—and after failure, they're nothing

Formal Statement of Recovery Correctness:

Let DB_crash be the database state at the moment of failure, and DB_recovered be the database state after recovery completes. Recovery is correct if and only if:

DB_recovered = ApplyCommitted(DB_initial) 
             = The state that would result from perfectly executing
               all committed transactions on the initial database,
               with no trace of uncommitted transactions

This is a powerful requirement: the recovered database must look exactly as if all committed transactions completed normally and all uncommitted transactions never started.

Why Both Guarantees Matter:

Violating either guarantee has severe consequences:

Consequences of Recovery Guarantee Violations
Violation	Example Consequence	Business Impact
Missing committed data	Customer payment recorded as committed but lost after crash	Financial loss, legal liability, lost trust
Present uncommitted data	Half-completed transfer leaves money in both accounts	Data corruption, regulatory violations
Inconsistent state	Inventory count doesn't match order records	Business process failures, audit failures
Constraint violations	Foreign keys reference non-existent records	Application errors, data quality issues

Beyond Two Guarantees

Recovery Objectives: RPO and RTO

Beyond correctness, recovery systems are measured by two key performance objectives:

Recovery Point Objective (RPO):

RPO specifies the maximum acceptable data loss measured in time. It answers: 'How much data can we afford to lose?'

RPO = 0: No data loss tolerable. Requires synchronous replication or similar mechanisms.
RPO = 15 minutes: Up to 15 minutes of transactions can be lost. Requires log backups every 15 minutes.
RPO = 24 hours: Up to one day of data can be lost. Daily backups suffice.

Recovery Time Objective (RTO):

RTO specifies the maximum acceptable downtime. It answers: 'How long can we be offline?'

RTO = 0 (or near-zero): Requires automatic failover to a standby.
RTO = 1 hour: Manual recovery with prepared procedures is acceptable.
RTO = 24 hours: Extended recovery processes are acceptable.

RPO/RTO Trade-offs and Requirements
RPO/RTO Target	Infrastructure Required	Cost Level	Typical Use Case
RPO=0, RTO=minutes	Synchronous replication, automatic failover	$$$$	Financial systems, critical infrastructure
RPO=minutes, RTO=hour	Async replication, log shipping, hot standby	$$$	E-commerce, SaaS applications
RPO=hours, RTO=hours	Frequent backups, warm standby	$$	Internal business systems
RPO=day, RTO=day	Daily backups, cold standby	$	Development, archival systems

The RPO/RTO Matrix:

Different failure types have different RPO/RTO characteristics:

Failure Type	Typical RPO	Typical RTO	Governing Factor
Transaction Failure	0	Milliseconds	Rollback speed
System Failure	0	Minutes	Crash recovery speed
Single Disk Failure (RAID)	0	0	RAID rebuild (background)
Media Failure (all disks)	Depends on backup	Hours	Restore speed
Site Disaster	Depends on replication	Minutes-hours	Failover type

Calculating RPO Requirements:

RPO_needed = (Cost of lost transactions) × (Probability of failure) / (Cost of protection)

Simplified: What's the maximum you'd pay to recover one hour of work?
           If that exceeds the cost of hourly log backups, implement them.

Calculating RTO Requirements:

RTO_needed = Maximum acceptable downtime before unacceptable business impact

Consider:
- Revenue loss per hour of downtime
- SLA penalties
- Customer impact and churn risk
- Regulatory requirements
- Operational dependencies

Stated vs. Tested Recovery Times

Resources Required for Recovery

Recovery doesn't happen by magic—it requires specific resources that must be maintained and protected. The availability and quality of these resources directly determines recovery capability.

Essential Recovery Resources:

Recovery Resource Categories

•Transaction Log (WAL) — Sequential record of all database changes. Required for crash recovery and transaction rollback.
•Checkpoints — Periodic snapshots that bound recovery scope. Required to limit recovery time.
•Backups — Point-in-time copies of database. Required for media failure recovery.
•Archived Logs — Log files preserved after being rotated from online storage. Required for PITR and media recovery.
•Control/Catalog Information — Metadata about database structure, files, and backup history. Required to coordinate recovery.

Recovery Resources by Failure Type
Failure Type	Minimum Required	For Full Recovery	Nice to Have
Transaction Failure	Log (undo info)	Log (undo + redo)	Savepoints
System Failure	Log + Data files	Log + Data + Checkpoint	Parallel recovery
Media Failure (data)	Backup + Archived logs	All logs to present	Incremental backups
Media Failure (log)	Backup	Backup + archived logs	Log on separate storage
Site Disaster	Off-site backup	Off-site replica	Geographic replication

Resource Protection Requirements:

Recovery resources themselves must be protected:

1. Log Protection:

Threat	Protection Mechanism
Disk failure	Store on RAID, separate from data
Corruption	Checksums on log records
Overflow	Adequate size, monitoring, archival
Performance contention	Dedicated fast storage

2. Backup Protection:

Threat	Protection Mechanism
Media failure at backup site	Multiple backup copies
Site disaster	Off-site storage
Corruption	Verification, checksums
Ransomware	Air-gapped copies
Retention expiration	Proper retention policy

3. Checkpoint Protection:

Checkpoints are transient (they define a point in time) but the checkpoint process must be reliable:

Checkpoint must complete atomically
Checkpoint record in log must be durable
Dirty pages must be reliably flushed

The Log Is Sacred

Properties of Recovery Operations

Recovery operations themselves must satisfy certain properties to be reliable. These properties ensure that recovery works correctly even under adverse conditions:

Property 1: Idempotence

Recovery operations must be idempotent—applying them multiple times produces the same result as applying them once.

idempotent_redo.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
FUNCTION ApplyRedoOperation(log_record, page):
    // Idempotent redo: check if already applied
    
    IF page.lsn >= log_record.lsn:
        // This operation was already applied to this page
        // Skip it - applying again would be incorrect
        LogDebug("Skipping already-applied redo: " + log_record.lsn)
        RETURN
    
    // Operation not yet applied - do it now
    SWITCH log_record.type:
        CASE UPDATE:
            ApplyUpdate(page, log_record.after_image)
        CASE INSERT:
            InsertRecord(page, log_record.new_record)
        CASE DELETE:
            DeleteRecord(page, log_record.record_id)
    
    // Update page LSN to reflect this operation
    page.lsn = log_record.lsn
    
    LogDebug("Applied redo: " + log_record.lsn)

Property 2: Determinism

Recovery must be deterministic—given the same log and data files, recovery always produces the same result.

Why this matters:

Testing and verification require predictable behavior
Debugging recovery issues requires reproducibility
Parallel recovery must coordinate deterministically

How it's achieved:

Log records contain all information needed to reproduce operations
No dependence on external state (time, random values) during recovery
Recovery order is determined by log order, not timing

Property 3: Consistency Preservation

Recovery must maintain database consistency at all times:

Internal page structure must be valid
Indexes must match table data
Constraints must be satisfied
Cross-page references must be valid

How it's achieved:

Operations are logged at a level that maintains consistency
Physiological logging logs actions at page level with logical descriptions
Page-level locks protect consistency during recovery

Recovery Operation Properties
Property	Definition	Mechanism	Failure Mode If Violated
Idempotence	Multiple applications = single application	LSN comparison	Double-application corruption
Determinism	Same inputs = same outputs	Complete log information	Non-reproducible recovery
Consistency	Invariants preserved throughout	Physiological logging	Corrupt internal structures
Atomicity	Operations complete fully or not at all	Log-based undo/redo	Partial operation application
Durability	Completed recovery survives subsequent failures	Flush after recovery	Need to re-recover

Crashes During Recovery

Performance Requirements

Recovery systems must balance recovery capability against normal operation performance. Stronger recovery guarantees often impose overhead during normal operation.

Normal Operation Overhead:

Recovery mechanisms impose costs during normal operation:

Recovery-Related Overhead During Normal Operation
Mechanism	Overhead Type	Typical Impact	Trade-off
Logging	Write amplification, I/O	10-30%	More logging = faster recovery but slower operation
Checkpointing	Periodic I/O spike, pause	Variable	More frequent = faster recovery but more overhead
Log flushing	Commit latency	1-10ms per commit	Sync flush = durability, async = speed
Backup	I/O, sometimes locks	Variable	Online backup = less disruption, more complexity
Replication	Network, slight latency	1-5ms for sync	Sync = zero RPO, async = better performance

Recovery Operation Performance:

Recovery itself should be as fast as possible:

Factors Affecting Recovery Time:

Log Volume: More log records = longer redo/undo
- Mitigation: Frequent checkpoints reduce log to process
Random I/O: Reading scattered pages for redo
- Mitigation: SSDs provide fast random access; parallel recovery
Undo Transactions: Active transactions at crash time
- Mitigation: Keep transactions short; limit concurrent transactions
Page Reads: Each dirty page must be read before redo
- Mitigation: Pre-fetch pages during analysis phase
Sequential Processing: Traditional recovery is single-threaded
- Mitigation: Modern systems use parallel recovery

recovery_time_estimation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Recovery Time Estimation Model
==============================
 
Given:
- Checkpoint interval: C seconds
- Log generation rate: L MB/second
- Log process rate: R MB/second (typically 100-500 MB/s)
- Average active transactions at crash: T
- Average transaction undo time: U seconds
 
Estimated Recovery Time:
========================
 
1. Log to Replay:
   Log_size ≈ C × L (MB)
   
2. Redo Phase:
   Redo_time ≈ Log_size / R (seconds)
   
3. Undo Phase:
   Undo_time ≈ T × U (seconds)
   
4. Total Recovery Time:
   Recovery_time ≈ Redo_time + Undo_time + Overhead
 
Example Calculation:
-------------------
C = 300 seconds (5-minute checkpoint)
L = 10 MB/second log rate
R = 200 MB/second processing rate
T = 50 active transactions
U = 0.5 seconds average undo
 
Log_size = 300 × 10 = 3000 MB
Redo_time = 3000 / 200 = 15 seconds
Undo_time = 50 × 0.5 = 25 seconds
Overhead ≈ 10 seconds (startup, analysis)
 
Recovery_time ≈ 15 + 25 + 10 = 50 seconds
 
To achieve RTO = 30 seconds:
- Reduce checkpoint interval to 2 minutes: Redo_time = 12 seconds
- Reduce active transactions to 20: Undo_time = 10 seconds
- New total: 12 + 10 + 5 = 27 seconds ✓

Parallel Recovery

Trade-offs in Recovery Design

Recovery system design involves fundamental trade-offs. Understanding these trade-offs is essential for making informed decisions about database configuration and deployment:

Trade-off 1: Durability vs. Performance

Synchronous Log Flush

•Log is flushed to disk on every commit
•Commit acknowledged only after flush
•Durability: Guaranteed
•Performance: Slower commits (disk latency)
•Use when: Data loss is unacceptable

Asynchronous Log Flush

•Log is flushed periodically or on buffer full
•Commit acknowledged before flush
•Durability: Risk of recent commit loss
•Performance: Faster commits
•Use when: Some data loss acceptable

Trade-off 2: Recovery Time vs. Operation Overhead

Configuration	Normal Operation	Recovery Time
Infrequent checkpoints	Lower overhead	Longer recovery
Frequent checkpoints	Higher overhead	Shorter recovery
Large buffer pool	Better cache hit rate	More to redo
Small buffer pool	More I/O during operation	Less to redo
Aggressive page flushing	More disk writes	Less redo work
Lazy page flushing	Fewer disk writes	More redo work

Trade-off 3: Space vs. Recoverability

Configuration	Space Usage	Recovery Capability
Minimal logging	Less log space	Limited recovery options
Full logging	More log space	Complete recovery
Short log retention	Less archive storage	Limited PITR window
Long log retention	More archive storage	Extended PITR window
Infrequent backups	Less backup storage	Longer restore time
Frequent backups	More backup storage	Faster restore

Trade-off 4: Simplicity vs. Flexibility

Simpler recovery mechanisms are easier to implement correctly and faster to execute, but less flexible:

Approach	Simplicity	Flexibility
NO-UNDO/REDO	Simpler	Forces all dirty pages to disk at commit
UNDO/NO-REDO	Simpler	Forces page flush before log flush
UNDO/REDO (ARIES)	Complex	Maximum flexibility, best performance

There's No Free Lunch

Verification and Testing Requirements

A recovery system that has never been tested is not trustworthy. Verification and testing are essential requirements for any production recovery system:

Continuous Verification:

Ongoing Verification Activities

•Backup Verification — Regularly restore backups to test systems and verify integrity
•Replication Lag Monitoring — Ensure standbys are current (within RPO)
•Log Archival Verification — Confirm archived logs are readable and complete
•Checksum Validation — Periodically verify page checksums haven't degraded
•PITR Testing — Practice point-in-time recovery to verify capability

Disaster Recovery Testing:

Full disaster recovery tests should be conducted periodically:

1. Tabletop Exercises:

Walk through recovery procedures on paper
Identify gaps in documentation
Ensure personnel know their roles
No actual systems affected

2. Technical Dry Runs:

Restore to non-production environment
Verify data integrity
Measure actual recovery time
Test application connectivity

3. Full Failover Tests:

Actually fail over to standby
Run production workload on standby
Verify all functionality works
Practice failback procedure

4. Chaos Engineering:

Randomly inject failures in production
Verify systems recover automatically
Find weaknesses before real failures
Netflix's Chaos Monkey is the famous example

recovery_test_checklist.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Recovery Test Checklist
=======================
 
Pre-Test:
□ Notify stakeholders of test window
□ Ensure test environment is ready
□ Have rollback plan if test fails
□ Document current backup/archive state
 
Backup Verification:
□ Select backup to restore (recent + older)
□ Restore to test environment
□ Verify database opens successfully
□ Run integrity checks (DBCC, pg_catalog checks)
□ Verify row counts against known values
□ Test application connectivity
□ Document restore time (actual vs RTO)
 
PITR Test:
□ Restore to specific point in time
□ Verify data as of that timestamp
□ Confirm changes after that time are absent
□ Document process and timing
 
Failover Test:
□ Verify standby is synchronized
□ Initiate failover
□ Confirm new primary accepts connections
□ Verify data integrity
□ Run application smoke tests
□ Document failover time
□ Perform failback when ready
 
Post-Test:
□ Document all findings
□ Update procedures based on learnings
□ Track metrics (recovery times, issues found)
□ Schedule next test

Test Under Realistic Conditions

Summary: Recovery Requirements

Let's consolidate the key concepts covered in this page:

Key Takeaways

•Two fundamental guarantees — Committed transactions are durable; uncommitted transactions leave no trace
•RPO and RTO are measurable objectives — They quantify acceptable data loss and downtime
•Recovery requires specific resources — Logs, checkpoints, backups, and archives must be protected
•Recovery operations must be idempotent and deterministic — Safe to re-execute after crashes during recovery
•Performance trade-offs are fundamental — Durability vs. speed, recovery time vs. operational overhead
•Testing is a requirement, not optional — Untested recovery systems are untrustworthy
•Match configuration to requirements — Neither over- nor under-provision recovery capabilities

Module Complete:

This completes our exploration of failure types and recovery requirements. You now understand:

The three major failure types (Transaction, System, Media)
How failures are classified and detected
What recovery systems must guarantee
The resources and trade-offs involved

Module 1 Complete

5 / 5