Loading learning content...
Understanding failure types is only the first step. The real question is: What must a recovery system deliver? When the database restarts after a crash, or when a transaction is rolled back after an error, what guarantees should users expect?
Recovery requirements define the contract between the database system and its users. They specify what 'correct recovery' means, what resources are needed to achieve it, and what trade-offs exist between recovery capabilities and system performance.
This page examines recovery requirements comprehensively—the fundamental guarantees, the ACID connection, measurable objectives like RPO and RTO, and the architectural requirements that enable recovery. Understanding these requirements is essential for both designing recovery systems and making informed decisions about database deployment.
By the end of this page, you will understand the guarantees that recovery systems must provide, how these guarantees connect to ACID properties, the measurable recovery objectives (RPO, RTO), the resources required for recovery, and the trade-offs involved in recovery system design.
A correct recovery system must provide two fundamental guarantees after any failure:
The Two Cardinal Guarantees:
These guarantees map directly to two ACID properties:
Formal Statement of Recovery Correctness:
Let DB_crash be the database state at the moment of failure, and DB_recovered be the database state after recovery completes. Recovery is correct if and only if:
DB_recovered = ApplyCommitted(DB_initial)
= The state that would result from perfectly executing
all committed transactions on the initial database,
with no trace of uncommitted transactions
This is a powerful requirement: the recovered database must look exactly as if all committed transactions completed normally and all uncommitted transactions never started.
Why Both Guarantees Matter:
Violating either guarantee has severe consequences:
| Violation | Example Consequence | Business Impact |
|---|---|---|
| Missing committed data | Customer payment recorded as committed but lost after crash | Financial loss, legal liability, lost trust |
| Present uncommitted data | Half-completed transfer leaves money in both accounts | Data corruption, regulatory violations |
| Inconsistent state | Inventory count doesn't match order records | Business process failures, audit failures |
| Constraint violations | Foreign keys reference non-existent records | Application errors, data quality issues |
Some recovery systems provide additional guarantees, such as consistency with application-level invariants (the C in ACID), or recovery to a specific point in time. But the two cardinal guarantees—committed durability and uncommitted atomicity—are the non-negotiable foundation.
Beyond correctness, recovery systems are measured by two key performance objectives:
Recovery Point Objective (RPO):
RPO specifies the maximum acceptable data loss measured in time. It answers: 'How much data can we afford to lose?'
Recovery Time Objective (RTO):
RTO specifies the maximum acceptable downtime. It answers: 'How long can we be offline?'
| RPO/RTO Target | Infrastructure Required | Cost Level | Typical Use Case |
|---|---|---|---|
| RPO=0, RTO=minutes | Synchronous replication, automatic failover | $$$$ | Financial systems, critical infrastructure |
| RPO=minutes, RTO=hour | Async replication, log shipping, hot standby | $$$ | E-commerce, SaaS applications |
| RPO=hours, RTO=hours | Frequent backups, warm standby | $$ | Internal business systems |
| RPO=day, RTO=day | Daily backups, cold standby | $ | Development, archival systems |
The RPO/RTO Matrix:
Different failure types have different RPO/RTO characteristics:
| Failure Type | Typical RPO | Typical RTO | Governing Factor |
|---|---|---|---|
| Transaction Failure | 0 | Milliseconds | Rollback speed |
| System Failure | 0 | Minutes | Crash recovery speed |
| Single Disk Failure (RAID) | 0 | 0 | RAID rebuild (background) |
| Media Failure (all disks) | Depends on backup | Hours | Restore speed |
| Site Disaster | Depends on replication | Minutes-hours | Failover type |
Calculating RPO Requirements:
RPO_needed = (Cost of lost transactions) × (Probability of failure) / (Cost of protection)
Simplified: What's the maximum you'd pay to recover one hour of work?
If that exceeds the cost of hourly log backups, implement them.
Calculating RTO Requirements:
RTO_needed = Maximum acceptable downtime before unacceptable business impact
Consider:
- Revenue loss per hour of downtime
- SLA penalties
- Customer impact and churn risk
- Regulatory requirements
- Operational dependencies
Many organizations have stated RTO targets that they've never actually tested. The only way to know your real RTO is to practice recovery. A 'one hour RTO' that actually takes six hours in practice is a liability, not a plan.
Recovery doesn't happen by magic—it requires specific resources that must be maintained and protected. The availability and quality of these resources directly determines recovery capability.
Essential Recovery Resources:
| Failure Type | Minimum Required | For Full Recovery | Nice to Have |
|---|---|---|---|
| Transaction Failure | Log (undo info) | Log (undo + redo) | Savepoints |
| System Failure | Log + Data files | Log + Data + Checkpoint | Parallel recovery |
| Media Failure (data) | Backup + Archived logs | All logs to present | Incremental backups |
| Media Failure (log) | Backup | Backup + archived logs | Log on separate storage |
| Site Disaster | Off-site backup | Off-site replica | Geographic replication |
Resource Protection Requirements:
Recovery resources themselves must be protected:
1. Log Protection:
| Threat | Protection Mechanism |
|---|---|
| Disk failure | Store on RAID, separate from data |
| Corruption | Checksums on log records |
| Overflow | Adequate size, monitoring, archival |
| Performance contention | Dedicated fast storage |
2. Backup Protection:
| Threat | Protection Mechanism |
|---|---|
| Media failure at backup site | Multiple backup copies |
| Site disaster | Off-site storage |
| Corruption | Verification, checksums |
| Ransomware | Air-gapped copies |
| Retention expiration | Proper retention policy |
3. Checkpoint Protection:
Checkpoints are transient (they define a point in time) but the checkpoint process must be reliable:
The transaction log is the most critical resource for recovery. Protect it more carefully than the data files themselves. Data can be reconstructed from backup + logs. Logs cannot be reconstructed from anything. Separate storage, replication, and fast media for logs are essential investments.
Recovery operations themselves must satisfy certain properties to be reliable. These properties ensure that recovery works correctly even under adverse conditions:
Property 1: Idempotence
Recovery operations must be idempotent—applying them multiple times produces the same result as applying them once.
Why this matters: Recovery might be interrupted (another crash during recovery). When recovery restarts, it may re-execute operations it already performed. Idempotence guarantees this doesn't cause problems.
How it's achieved: Log Sequence Numbers (LSNs) track which operations have been applied. Before applying a redo operation to a page, compare the page's LSN to the log record's LSN. If the page LSN is already >= record LSN, the operation was already applied—skip it.
12345678910111213141516171819202122
FUNCTION ApplyRedoOperation(log_record, page): // Idempotent redo: check if already applied IF page.lsn >= log_record.lsn: // This operation was already applied to this page // Skip it - applying again would be incorrect LogDebug("Skipping already-applied redo: " + log_record.lsn) RETURN // Operation not yet applied - do it now SWITCH log_record.type: CASE UPDATE: ApplyUpdate(page, log_record.after_image) CASE INSERT: InsertRecord(page, log_record.new_record) CASE DELETE: DeleteRecord(page, log_record.record_id) // Update page LSN to reflect this operation page.lsn = log_record.lsn LogDebug("Applied redo: " + log_record.lsn)Property 2: Determinism
Recovery must be deterministic—given the same log and data files, recovery always produces the same result.
Why this matters:
How it's achieved:
Property 3: Consistency Preservation
Recovery must maintain database consistency at all times:
How it's achieved:
| Property | Definition | Mechanism | Failure Mode If Violated |
|---|---|---|---|
| Idempotence | Multiple applications = single application | LSN comparison | Double-application corruption |
| Determinism | Same inputs = same outputs | Complete log information | Non-reproducible recovery |
| Consistency | Invariants preserved throughout | Physiological logging | Corrupt internal structures |
| Atomicity | Operations complete fully or not at all | Log-based undo/redo | Partial operation application |
| Durability | Completed recovery survives subsequent failures | Flush after recovery | Need to re-recover |
Recovery may be crashed and restarted multiple times. This is called 'nested recovery' or 'recovery from recovery.' The idempotence property, plus Compensation Log Records (CLRs) for undo operations, ensures that nested recovery works correctly. Each recovery attempt makes progress and doesn't undo previous progress.
Recovery systems must balance recovery capability against normal operation performance. Stronger recovery guarantees often impose overhead during normal operation.
Normal Operation Overhead:
Recovery mechanisms impose costs during normal operation:
| Mechanism | Overhead Type | Typical Impact | Trade-off |
|---|---|---|---|
| Logging | Write amplification, I/O | 10-30% | More logging = faster recovery but slower operation |
| Checkpointing | Periodic I/O spike, pause | Variable | More frequent = faster recovery but more overhead |
| Log flushing | Commit latency | 1-10ms per commit | Sync flush = durability, async = speed |
| Backup | I/O, sometimes locks | Variable | Online backup = less disruption, more complexity |
| Replication | Network, slight latency | 1-5ms for sync | Sync = zero RPO, async = better performance |
Recovery Operation Performance:
Recovery itself should be as fast as possible:
Factors Affecting Recovery Time:
Log Volume: More log records = longer redo/undo
Random I/O: Reading scattered pages for redo
Undo Transactions: Active transactions at crash time
Page Reads: Each dirty page must be read before redo
Sequential Processing: Traditional recovery is single-threaded
1234567891011121314151617181920212223242526272829303132333435363738394041424344
Recovery Time Estimation Model============================== Given:- Checkpoint interval: C seconds- Log generation rate: L MB/second- Log process rate: R MB/second (typically 100-500 MB/s)- Average active transactions at crash: T- Average transaction undo time: U seconds Estimated Recovery Time:======================== 1. Log to Replay: Log_size ≈ C × L (MB) 2. Redo Phase: Redo_time ≈ Log_size / R (seconds) 3. Undo Phase: Undo_time ≈ T × U (seconds) 4. Total Recovery Time: Recovery_time ≈ Redo_time + Undo_time + Overhead Example Calculation:-------------------C = 300 seconds (5-minute checkpoint)L = 10 MB/second log rateR = 200 MB/second processing rateT = 50 active transactionsU = 0.5 seconds average undo Log_size = 300 × 10 = 3000 MBRedo_time = 3000 / 200 = 15 secondsUndo_time = 50 × 0.5 = 25 secondsOverhead ≈ 10 seconds (startup, analysis) Recovery_time ≈ 15 + 25 + 10 = 50 seconds To achieve RTO = 30 seconds:- Reduce checkpoint interval to 2 minutes: Redo_time = 12 seconds- Reduce active transactions to 20: Undo_time = 10 seconds- New total: 12 + 10 + 5 = 27 seconds ✓Modern database systems (PostgreSQL 15+, MySQL 8.0+, Oracle) support parallel recovery, processing multiple log records simultaneously. This can reduce recovery time by 70-90% compared to single-threaded recovery. Check if your system supports it and configure appropriately for your hardware.
Recovery system design involves fundamental trade-offs. Understanding these trade-offs is essential for making informed decisions about database configuration and deployment:
Trade-off 1: Durability vs. Performance
Trade-off 2: Recovery Time vs. Operation Overhead
| Configuration | Normal Operation | Recovery Time |
|---|---|---|
| Infrequent checkpoints | Lower overhead | Longer recovery |
| Frequent checkpoints | Higher overhead | Shorter recovery |
| Large buffer pool | Better cache hit rate | More to redo |
| Small buffer pool | More I/O during operation | Less to redo |
| Aggressive page flushing | More disk writes | Less redo work |
| Lazy page flushing | Fewer disk writes | More redo work |
Trade-off 3: Space vs. Recoverability
| Configuration | Space Usage | Recovery Capability |
|---|---|---|
| Minimal logging | Less log space | Limited recovery options |
| Full logging | More log space | Complete recovery |
| Short log retention | Less archive storage | Limited PITR window |
| Long log retention | More archive storage | Extended PITR window |
| Infrequent backups | Less backup storage | Longer restore time |
| Frequent backups | More backup storage | Faster restore |
Trade-off 4: Simplicity vs. Flexibility
Simpler recovery mechanisms are easier to implement correctly and faster to execute, but less flexible:
| Approach | Simplicity | Flexibility |
|---|---|---|
| NO-UNDO/REDO | Simpler | Forces all dirty pages to disk at commit |
| UNDO/NO-REDO | Simpler | Forces page flush before log flush |
| UNDO/REDO (ARIES) | Complex | Maximum flexibility, best performance |
Every recovery configuration is a trade-off. Stronger guarantees cost more in performance or resources. The art is matching the configuration to your actual requirements—not over-provisioning (wasting resources) or under-provisioning (risking unacceptable outcomes).
A recovery system that has never been tested is not trustworthy. Verification and testing are essential requirements for any production recovery system:
Continuous Verification:
Disaster Recovery Testing:
Full disaster recovery tests should be conducted periodically:
1. Tabletop Exercises:
2. Technical Dry Runs:
3. Full Failover Tests:
4. Chaos Engineering:
1234567891011121314151617181920212223242526272829303132333435363738
Recovery Test Checklist======================= Pre-Test:□ Notify stakeholders of test window□ Ensure test environment is ready□ Have rollback plan if test fails□ Document current backup/archive state Backup Verification:□ Select backup to restore (recent + older)□ Restore to test environment□ Verify database opens successfully□ Run integrity checks (DBCC, pg_catalog checks)□ Verify row counts against known values□ Test application connectivity□ Document restore time (actual vs RTO) PITR Test:□ Restore to specific point in time□ Verify data as of that timestamp□ Confirm changes after that time are absent□ Document process and timing Failover Test:□ Verify standby is synchronized□ Initiate failover□ Confirm new primary accepts connections□ Verify data integrity□ Run application smoke tests□ Document failover time□ Perform failback when ready Post-Test:□ Document all findings□ Update procedures based on learnings□ Track metrics (recovery times, issues found)□ Schedule next testTesting recovery at 3 AM on Sunday with a tiny database is not the same as recovering at noon on Monday with production load. Test with realistic data sizes, under realistic conditions, and include realistic complications (simultaneous network problems, missing documentation, key personnel unavailable).
Let's consolidate the key concepts covered in this page:
Module Complete:
This completes our exploration of failure types and recovery requirements. You now understand:
In the next module, we'll move from concepts to mechanisms, exploring Recovery Concepts in detail—how durability is actually achieved, how logs are structured, and how the recovery manager orchestrates the recovery process.
Congratulations! You've completed Module 1: Failure Types. You now have a comprehensive understanding of database failures—their types, classification, and the requirements for recovering from them. This foundation prepares you for the detailed study of recovery mechanisms in the following modules.