Database Management SystemsBackup Strategies

Point-in-Time Recovery

LevelAdvanced

Duration75 mins

TopicBackup Strategies

3 / 5

The Recovery Process: Executing Point-in-Time Recovery

The Most Critical 30 Minutes

The call comes at 2:47 PM. A critical batch process has corrupted thousands of records in the production database. Customers are seeing incorrect data. The application team has isolated the issue to a deployment that went live at 2:31 PM. Your job: restore the database to 2:30 PM, before the corruption began.

This is where all your PITR preparation either pays dividends or reveals gaps. The next 30 minutes will determine whether you're back online with minimal data loss or facing hours of downtime with potentially devastating business impact.

The recovery process is not just technical execution—it's a structured methodology that minimizes mistakes during high-pressure situations.

This page walks through the complete PITR recovery workflow: the decision-making process, the technical execution steps, and the validation procedures that confirm success. By the end, you'll have a mental checklist for navigating the most stressful scenarios in database operations.

Practice Before the Emergency

PITR procedures should be practiced regularly in non-production environments. The first time you perform a recovery should never be during an actual emergency. Regular drills ensure the team is prepared and the documentation is accurate.

The Recovery Decision Framework

Before executing any recovery, several critical decisions must be made. Rushing into recovery without proper analysis can result in losing more data than the original incident.

Decision 1: Is PITR the Right Approach?

PITR is destructive—it replaces the current database state with a historical state. Alternative approaches may be less disruptive:

Consider alternatives before PITR:

Targeted Data Correction: If specific records are corrupted, surgical UPDATE/DELETE statements may fix the issue without full recovery
Logical Restore from Backup: Restoring specific tables from a logical backup (pg_dump) may be faster and less risky
Flashback Queries: If the database supports temporal queries, historical data can be accessed without recovery
Application-Level Rollback: Some applications maintain their own audit trails that can reverse specific transactions

PITR is appropriate when:

Corruption is widespread and difficult to characterize
The exact scope of damage is unknown
Targeted correction would take longer than full recovery
Regulatory requirements mandate clean recovery
Integrity constraints make partial correction impractical

Recovery Approach Decision Matrix
Scenario	Recommended Approach	Rationale
Single table, known rows affected	Targeted correction or table restore	Faster, preserves other data
Single table, unknown scope	Table restore from backup	Isolates impact, simpler verification
Multiple tables, referential integrity	PITR	Ensures consistent cross-table state
Unknown scope, possible ongoing damage	PITR	Clean slate from known-good state
Malicious activity suspected	PITR	Eliminates any backdoors or hidden damage
Partial outage acceptable	Targeted restoration	Minimizes downtime for unaffected functions

Decision 2: What Is the Recovery Target?

Determining the correct recovery target requires investigation. Getting this wrong means either recovering past the problem (corruption returns) or recovering too far back (unnecessary data loss).

Target Investigation Process:

Establish Timeline: When was the problem first noticed? When did symptoms actually begin (may differ)?
Analyze Logs: Application logs, database logs, and audit tables provide transaction-level detail
Identify Causal Transaction: What specific transaction or process caused the damage?
Determine Safe Target: The target should be definitively before the causal event
Add Safety Margin: A few seconds before the identified bad transaction provides margin for clock skew

-- Example: Finding when a bad batch started
SELECT min(created_at) AS earliest_bad_record
FROM corrupted_table
WHERE is_corrupted = true;

-- Check database logs for the transaction
-- PostgreSQL example:
SELECT xact_start, query
FROM pg_stat_activity
WHERE xact_start >= '2024-01-15 14:30:00'
  AND xact_start <= '2024-01-15 14:32:00';

The 5-Minute Safety Margin

When in doubt about exact timing, recover to 5 minutes before the suspected incident time. This margin accounts for clock skew, transaction duration, and timing uncertainty. It's easier to manually replay 5 minutes of transactions than to repeat a recovery because you went past the problem.

Decision 3: Recovery Strategy

PITR can be performed in several ways depending on infrastructure and requirements:

In-Place Recovery

Replace the current database with the recovered state
Fastest to achieve minimal downtime
No rollback option once committed
Appropriate when time is critical

Parallel Recovery

Recover to a new database instance alongside production
Validate recovered state before cutover
Allows comparison with current state
Adds time but reduces risk

Hybrid Approach

Recover specific data from parallel instance
Merge into production database
Minimizes data loss from transactions after recovery target
Most complex but most flexible

Pre-Recovery Preparation

Before beginning the actual recovery, several preparation steps protect against making the situation worse.

Step 1: Preserve the Current State

The current database state—even if corrupted—may contain valuable information needed later. Always preserve it before recovery.

# Take a final backup of the corrupted state
pg_dump -Fc -f /emergency/pre_recovery_$(date +%Y%m%d_%H%M%S).dump production_db

# If using filesystem backup
pg_basebackup -D /emergency/pre_recovery_basebackup -Fp -Xs -P

# Document current WAL position
psql -c "SELECT pg_current_wal_lsn(), current_timestamp;"

Why preserve corrupted state?

Post-recovery forensics to understand root cause
Recovery of specific transactions committed after corruption
Compliance documentation of incident timeline
Fallback if recovery fails or makes things worse

Step 2: Verify Recovery Resources

Confirm all required components are available before starting:

#!/bin/bash
# Pre-recovery verification script

echo "=== PITR Pre-Recovery Verification ==="

# 1. Verify base backup exists and is accessible
BACKUP_DIR="/backups/basebackup_latest"
if [ -d "$BACKUP_DIR" ]; then
    echo "✓ Base backup found: $BACKUP_DIR"
    echo "  Size: $(du -sh $BACKUP_DIR | cut -f1)"
else
    echo "✗ ERROR: Base backup not found!"
    exit 1
fi

# 2. Get backup LSN
BACKUP_LSN=$(cat $BACKUP_DIR/backup_label | grep "START WAL" | awk '{print $5}')
echo "  Backup LSN: $BACKUP_LSN"

# 3. Verify WAL archives from backup LSN to target
TARGET_TIME="2024-01-15 14:30:00"
echo "\nVerifying WAL archive continuity to target: $TARGET_TIME"

# Check archive sequence (simplified)
WAL_START=$(echo $BACKUP_LSN | cut -d'/' -f1)
echo "  Starting from segment: $WAL_START"

# 4. Check disk space for recovery
AVAILABLE_GB=$(df -BG /data | tail -1 | awk '{print $4}' | tr -d 'G')
REQUIRED_GB=100  # Estimate based on backup size
if [ $AVAILABLE_GB -gt $REQUIRED_GB ]; then
    echo "✓ Sufficient disk space: ${AVAILABLE_GB}GB available"
else
    echo "✗ WARNING: Low disk space: ${AVAILABLE_GB}GB available"
fi

# 5. Verify database is stopped (or will be)
if pg_isready -q; then
    echo "⚠ Database is currently running - will need to stop"
else
    echo "✓ Database is stopped"
fi

echo "\n=== Pre-Recovery Verification Complete ==="

Pre-Recovery Checklist

•Current state preserved: Backup of corrupted database exists with timestamp
•Base backup accessible: The target base backup is available and readable
•WAL archives complete: All required archives from backup LSN to target time exist
•Archives verified: Checksums or integrity tests passed for required archives
•Disk space sufficient: Recovery location has adequate space
•Recovery configuration prepared: recovery.signal/recovery.conf ready
•Stakeholders notified: Operations, application teams informed of recovery timeline
•Rollback plan ready: Steps to abort and restore corrupted state if needed

Step 3: Stop Application Traffic

Preventing new transactions during recovery preparation is critical:

-- PostgreSQL: Terminate connections and prevent new ones
SELECT pg_terminate_backend(pid) 
FROM pg_stat_activity 
WHERE datname = 'production' AND pid <> pg_backend_pid();

-- Revoke connect privileges temporarily
REVOKE CONNECT ON DATABASE production FROM PUBLIC;
REVOKE CONNECT ON DATABASE production FROM app_user;

# Network-level: Block database port
iptables -A INPUT -p tcp --dport 5432 -j DROP

# Or disable in connection pooler
pgbouncer -R  # Reload with connections disabled

Executing the Recovery

With preparation complete, we execute the recovery procedure. This section provides step-by-step instructions for PITR execution.

Phase 1: Stop the Database

# Graceful shutdown with timeout
pg_ctl stop -D /data/postgres -m fast -t 60

# Verify shutdown
if pg_isready; then
    echo "ERROR: Database still running!"
    pg_ctl stop -D /data/postgres -m immediate
fi

echo "Database stopped at $(date)"

Phase 2: Clear or Relocate Current Data

# Option A: Rename current data directory (preserves for forensics)
mv /data/postgres /data/postgres_pre_recovery_$(date +%Y%m%d_%H%M%S)

# Option B: Clear data directory (faster, if backup already taken)
rm -rf /data/postgres/*

Phase 3: Restore Base Backup

# Restore using pg_restore for compressed backup
pg_restore -D /data/postgres /backups/base_20240115_000000.tar

# Or for plain file copy backups
cp -a /backups/base_20240115_000000/* /data/postgres/

# Ensure correct ownership
chown -R postgres:postgres /data/postgres

echo "Base backup restored at $(date)"

Phase 4: Configure Recovery Target

Create the recovery configuration specifying the target time:

PostgreSQL 12+ (recovery.signal method):

# Create recovery signal file
touch /data/postgres/recovery.signal

# Add recovery configuration to postgresql.conf
cat >> /data/postgres/postgresql.conf << EOF

# --- PITR Recovery Configuration ---
restore_command = 'cp /archive/wal/%f %p'
recovery_target_time = '2024-01-15 14:30:00+00'
recovery_target_action = 'pause'
recovery_target_inclusive = false
EOF

PostgreSQL 11 and earlier (recovery.conf method):

cat > /data/postgres/recovery.conf << EOF
restore_command = 'cp /archive/wal/%f %p'
recovery_target_time = '2024-01-15 14:30:00+00'
pause_at_recovery_target = true
EOF

Key Recovery Parameters:

Parameter	Purpose	Recommended Value
`restore_command`	How to retrieve archived WAL	Path to archive + decompression
`recovery_target_time`	Target timestamp with timezone	ISO format with explicit timezone
`recovery_target_action`	What to do when target reached	`pause` for verification
`recovery_target_inclusive`	Include transaction at target?	`false` (safer)

recovery_configuration.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/bin/bash
# Complete PITR Recovery Configuration Script
 
# Variables (customize for your environment)
DATA_DIR="/data/postgres"
ARCHIVE_DIR="/archive/wal"
RECOVERY_TARGET="2024-01-15 14:30:00 America/New_York"
PG_VERSION=$(pg_config --version | grep -oP '\d+' | head -1)
 
echo "Configuring PITR recovery for PostgreSQL $PG_VERSION"
echo "Recovery target: $RECOVERY_TARGET"
 
# Determine restore command based on archive format
if ls $ARCHIVE_DIR/*.gz 1> /dev/null 2>&1; then
    RESTORE_CMD="gunzip -c $ARCHIVE_DIR/%f.gz > %p"
elif ls $ARCHIVE_DIR/*.zst 1> /dev/null 2>&1; then
    RESTORE_CMD="zstd -d -c $ARCHIVE_DIR/%f.zst > %p"
else
    RESTORE_CMD="cp $ARCHIVE_DIR/%f %p"
fi
 
echo "Restore command: $RESTORE_CMD"
 
# Create recovery configuration
if [ "$PG_VERSION" -ge 12 ]; then
    # PostgreSQL 12+ method
    touch "$DATA_DIR/recovery.signal"
    
    cat >> "$DATA_DIR/postgresql.conf" << EOF
 
# === POINT-IN-TIME RECOVERY CONFIGURATION ===
# Added: $(date)
# Target: $RECOVERY_TARGET
 
restore_command = '$RESTORE_CMD'
recovery_target_time = '$RECOVERY_TARGET'
recovery_target_action = 'pause'
recovery_target_inclusive = false
 
# Recovery performance tuning
max_wal_senders = 0
hot_standby = on
# === END PITR CONFIGURATION ===
EOF
 
else
    # PostgreSQL 11 and earlier
    cat > "$DATA_DIR/recovery.conf" << EOF
# Point-in-Time Recovery Configuration
# Created: $(date)
 
restore_command = '$RESTORE_CMD'
recovery_target_time = '$RECOVERY_TARGET'
pause_at_recovery_target = true
EOF
 
fi
 
echo "Recovery configuration complete"
echo "Start the database to begin recovery"

Phase 5: Start Recovery

Start the database to begin the recovery process:

# Start PostgreSQL in recovery mode
pg_ctl start -D /data/postgres -l /var/log/postgres/recovery.log

# Monitor recovery progress
tail -f /var/log/postgres/recovery.log

# Check recovery status
psql -c "SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), 
                pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"

Recovery Progress Indicators:

-- Monitor WAL replay progress
SELECT 
    CASE WHEN pg_is_in_recovery() THEN 'Recovering' ELSE 'Complete' END AS status,
    pg_last_xact_replay_timestamp() AS replayed_to,
    current_timestamp - pg_last_xact_replay_timestamp() AS lag
;

-- Check what's being restored (from logs)
-- Look for: "restored log file ... from archive"

The database will:

Load the base backup
Begin fetching and replaying archived WAL segments
Continue replaying until reaching the recovery target
Pause (if recovery_target_action = 'pause')

Recovery Verification

When recovery reaches the target and pauses, verification is critical before finalizing. This is your last opportunity to adjust the target or abort.

Verification Step 1: Confirm Recovery Target

-- Verify we're at the expected point
SELECT 
    pg_is_in_recovery() AS in_recovery,
    pg_last_xact_replay_timestamp() AS recovered_to,
    pg_is_wal_replay_paused() AS paused;

-- Expected output:
-- in_recovery | recovered_to                  | paused
-- t           | 2024-01-15 14:29:58.123456+00 | t

Verification Step 2: Check Data Integrity

-- Verify critical tables have expected state
SELECT count(*), max(updated_at) FROM critical_table;

-- Check for presence of known-good records
SELECT * FROM orders WHERE order_id = 'known_good_order_before_incident';

-- Confirm absence of corrupted records
SELECT count(*) FROM corrupted_table WHERE is_corrupted = true;
-- Expected: 0 (no corrupted records yet)

Verification Step 3: Application-Level Checks

# Run application health checks
./app_health_check.sh --database-only

# Verify key business metrics
psql -f /scripts/verify_business_state.sql

The Point of No Return

Once you finalize recovery (by promoting the database out of recovery mode), you cannot easily go back. Take time here to verify thoroughly. If something looks wrong, you can still adjust the recovery target and try again.

Verification Step 4: Adjust Target If Needed

If verification reveals the target was wrong:

Target Too Late (corruption present):

# Stop the database
pg_ctl stop -D /data/postgres -m fast

# Adjust recovery target earlier
sed -i "s/recovery_target_time = .*/recovery_target_time = '2024-01-15 14:25:00'" \
    /data/postgres/postgresql.conf

# Restart recovery from base backup
# (Cannot resume - must restart full recovery)
rm -rf /data/postgres/*
cp -a /backups/base_20240115/* /data/postgres/
touch /data/postgres/recovery.signal
pg_ctl start -D /data/postgres

Target Too Early (missing needed data):

-- While paused, can advance the target
ALTER SYSTEM SET recovery_target_time = '2024-01-15 14:35:00';
SELECT pg_reload_conf();
SELECT pg_wal_replay_resume();  -- Continue recovery to new target

Verification Step 5: Final Sign-Off

Before finalizing, get explicit confirmation:

-- Document the recovered state
SELECT 
    'PITR Recovery Verification' AS report_type,
    current_timestamp AS verified_at,
    pg_last_xact_replay_timestamp() AS recovered_to,
    (SELECT count(*) FROM pg_stat_user_tables) AS tables_present,
    pg_database_size(current_database()) AS database_size;

Finalizing the Recovery

Once verification is complete and stakeholders have confirmed the recovered state is correct, finalize the recovery to return to normal operation.

Promotion: Ending Recovery Mode

-- For paused recovery, promote to primary
SELECT pg_wal_replay_resume();  -- If paused
-- Wait for promotion to complete

-- Verify no longer in recovery
SELECT pg_is_in_recovery();  -- Should return 'f'

Alternatively, using command line:

# PostgreSQL 12+
pg_ctl promote -D /data/postgres

# Or create trigger file (older method)
touch /data/postgres/promote

Post-Promotion Tasks

1. Verify Normal Operation

-- Confirm database accepts writes
CREATE TABLE recovery_test (id serial, ts timestamp default now());
INSERT INTO recovery_test DEFAULT VALUES;
SELECT * FROM recovery_test;
DROP TABLE recovery_test;

-- Check for any errors in logs
-- tail -100 /var/log/postgres/postgresql.log

2. Re-enable Archiving

-- Verify archive_mode is still enabled
SHOW archive_mode;
SHOW archive_command;

-- Force a WAL switch to verify archiving works
SELECT pg_switch_wal();

-- Check archive status
SELECT * FROM pg_stat_archiver;

3. Re-establish Replication (if applicable)

# Rebuild standby servers from new timeline
pg_basebackup -h recovered-primary -D /data/standby -Xs -P

Understanding Timeline Branches

When recovery completes and the database is promoted, a new timeline is created. This is crucial for understanding the database's history:

           Timeline 1 (original)
           -------------------------->
                                \
                                 \  Timeline 2 (after PITR)
                                  \------------------------>
                                  ▲
                            Recovery Point

The new timeline:

Has a new timeline ID (visible in WAL filenames)
Creates a .history file documenting the branch point
Requires standby servers to be rebuilt from the new timeline
Archives continue with the new timeline ID

# Check current timeline
psql -c "SELECT timeline_id FROM pg_control_checkpoint();"

# View timeline history
cat /data/postgres/pg_wal/00000002.history
# Example content:
# 1    0/50000A0    before 2024-01-15 14:30:00+00

Timeline Branches Are Permanent

Once a new timeline is created, the database's history has permanently diverged. The original timeline still exists in archives but will not be continued. All subsequent operations occur on the new timeline.

Post-Recovery Operations

Recovery completion is not the end of the process. Several post-recovery tasks ensure the database is properly stabilized and protected.

Immediate Post-Recovery Tasks

1. Take a New Base Backup

After PITR, immediately create a new base backup. This:

Provides a new recovery starting point on the new timeline
Allows retirement of old archives once no longer needed
Reduces future recovery time

# Immediate post-recovery backup
pg_basebackup -D /backups/post_recovery_$(date +%Y%m%d_%H%M%S) \
    -Fp -Xs -P -c fast

echo "Post-recovery base backup completed at $(date)"

2. Re-establish Monitoring

# Restart monitoring agents
systemctl restart prometheus-postgres-exporter
systemctl restart pgbadger-collector

# Clear any stale alerts
curl -X POST http://alertmanager:9093/api/v1/alerts \
    -d '{"labels":{"alertname":"PITRRecoveryComplete"}}'

3. Reconnect Applications

-- Restore connection privileges
GRANT CONNECT ON DATABASE production TO app_user;
GRANT CONNECT ON DATABASE production TO PUBLIC;

-- Verify connections are working
SELECT * FROM pg_stat_activity WHERE datname = 'production';

Data Reconciliation

PITR recovers to a point in time, but transactions after that point are lost. Some of these may be legitimate transactions that need recovery.

Transaction Recovery Options:

Replay from Application Logs
- If applications log transactions, replay from logs after the recovery point
- Requires idempotent operations or careful filtering
Extract from Corrupted Backup
- The backup taken before recovery contains all transactions
- Extract legitimate transactions and replay
User Notification
- Notify users of the recovery window
- Request re-entry of transactions from the lost period

-- Identify transactions from corrupted backup that might need recovery
-- (Run against the preserved corrupted backup)
SELECT *
FROM corrupted_backup.transactions
WHERE created_at > '2024-01-15 14:30:00'
  AND created_at < '2024-01-15 14:45:00'
  AND is_valid = true  -- Application-specific validation
ORDER BY created_at;

Reconciliation Checklist:

Identify types of transactions in the lost window
Determine which are recoverable from external sources
Develop replay scripts with proper ordering
Test replay in non-production with recovered database copy
Execute replay with verification
Document any permanently lost transactions

Post-Recovery Verification Checklist

•Database accepting writes: Confirmed with test insert/delete
•Archiving resumed: New WAL segments being archived
•Fresh backup taken: New base backup on new timeline complete
•Replication re-established: Standby servers rebuilt and syncing
•Monitoring active: All monitoring systems reporting
•Applications connected: Client applications functioning normally
•Performance baseline: Query performance within acceptable range
•Incident documented: Complete timeline and actions recorded

Recovery Automation and Runbooks

While PITR requires human judgment for target selection and verification, the mechanical steps benefit from automation. Well-designed runbooks and scripts reduce errors and recovery time.

The PITR Runbook Structure

A production-ready PITR runbook should include:

1. Decision Tree

When to use PITR vs. alternatives
How to determine recovery target
Escalation contacts and approval requirements

2. Pre-Checks

Automated verification of backup availability
Archive continuity verification
Resource availability checks

3. Execution Scripts

Parameterized scripts for each recovery phase
Clear checkpoints requiring human confirmation
Logging of all actions for post-incident review

4. Verification Procedures

Standard queries to verify recovered state
Application-specific health checks
Signoff requirements before finalization

5. Rollback Procedures

Steps to abort and restore pre-recovery state
Conditions triggering rollback
Rollback verification procedures

pitr_recovery_automation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#!/usr/bin/env python3
"""
Point-in-Time Recovery Automation Framework
Provides guided, semi-automated PITR with human checkpoints
"""
 
import subprocess
import sys
from datetime import datetime
from dataclasses import dataclass
from enum import Enum
from pathlib import Path
import logging
 
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger('pitr_recovery')
 
class RecoveryPhase(Enum):
    PREPARATION = "preparation"
    BACKUP_CURRENT = "backup_current"
    RESTORE_BASE = "restore_base"
    CONFIGURE_RECOVERY = "configure_recovery"
    EXECUTE_RECOVERY = "execute_recovery"
    VERIFY = "verify"
    FINALIZE = "finalize"
    POST_RECOVERY = "post_recovery"
 
@dataclass
class RecoveryConfig:
    target_time: str
    base_backup_path: Path
    archive_path: Path
    data_dir: Path
    emergency_backup_dir: Path
 
class PITRRecoveryManager:
    """Manages the complete PITR recovery workflow"""
    
    def __init__(self, config: RecoveryConfig):
        self.config = config
        self.current_phase = RecoveryPhase.PREPARATION
        self.checkpoints = []
        
    def log_checkpoint(self, message: str):
        """Record a recovery checkpoint"""
        checkpoint = {
            'timestamp': datetime.now().isoformat(),
            'phase': self.current_phase.value,
            'message': message
        }
        self.checkpoints.append(checkpoint)
        logger.info(f"CHECKPOINT: {message}")
        
    def require_confirmation(self, prompt: str) -> bool:
        """Require human confirmation before proceeding"""
        print(f"\n{'='*60}")
        print(f"CONFIRMATION REQUIRED")
        print(f"{'='*60}")
        print(f"\n{prompt}\n")
        response = input("Type 'PROCEED' to continue, anything else to abort: ")
        return response.strip().upper() == "PROCEED"
    
    def phase_preparation(self) -> bool:
        """Pre-recovery verification phase"""
        self.current_phase = RecoveryPhase.PREPARATION
        logger.info("Starting preparation phase")
        
        # Verify base backup exists
        if not self.config.base_backup_path.exists():
            logger.error(f"Base backup not found: {self.config.base_backup_path}")
            return False
        self.log_checkpoint(f"Base backup verified: {self.config.base_backup_path}")
        
        # Verify archive path
        if not self.config.archive_path.exists():
            logger.error(f"Archive path not found: {self.config.archive_path}")
            return False
        self.log_checkpoint("Archive path verified")
        
        # Check disk space
        # ... (disk space verification logic)
        
        return self.require_confirmation(
            f"Preparation complete. Ready to proceed with PITR to: "
            f"{self.config.target_time}"
        )
    
    def phase_backup_current(self) -> bool:
        """Backup current corrupted state"""
        self.current_phase = RecoveryPhase.BACKUP_CURRENT
        logger.info("Backing up current database state")
        
        backup_name = f"pre_pitr_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        backup_path = self.config.emergency_backup_dir / backup_name
        
        result = subprocess.run([
            'pg_dump', '-Fc', 
            '-f', str(backup_path / 'dump.custom'),
            'production'
        ], capture_output=True, text=True)
        
        if result.returncode != 0:
            logger.error(f"Backup failed: {result.stderr}")
            return False
            
        self.log_checkpoint(f"Current state backed up to: {backup_path}")
        return True
    
    def execute_recovery(self) -> bool:
        """Execute the complete recovery workflow"""
        phases = [
            self.phase_preparation,
            self.phase_backup_current,
            # ... additional phases
        ]
        
        for phase_func in phases:
            if not phase_func():
                logger.error(f"Recovery aborted at phase: {self.current_phase}")
                return False
                
        logger.info("Recovery completed successfully")
        return True
 
if __name__ == "__main__":
    config = RecoveryConfig(
        target_time="2024-01-15 14:30:00",
        base_backup_path=Path("/backups/base_latest"),
        archive_path=Path("/archive/wal"),
        data_dir=Path("/data/postgres"),
        emergency_backup_dir=Path("/emergency")
    )
    
    manager = PITRRecoveryManager(config)
    success = manager.execute_recovery()
    sys.exit(0 if success else 1)

Summary: The Complete Recovery Process

We've walked through the complete PITR recovery process from incident detection to post-recovery operations. Let's consolidate the key concepts:

Key Takeaways

•PITR is a last resort — Consider targeted corrections or partial restores before full PITR, but don't hesitate when PITR is the right choice.
•Target selection is critical — Investigate thoroughly to determine the correct recovery point. Add safety margin when uncertain.
•Always preserve current state first — The corrupted database may contain valuable data for forensics and transaction recovery.
•Verify before finalizing — Use the pause feature to examine recovered state before committing to the new timeline.
•Timeline branches are permanent — Understand that PITR creates a new timeline, requiring standby rebuilds.
•Post-recovery tasks are essential — New backups, monitoring restoration, and data reconciliation complete the recovery.
•Practice is non-negotiable — Regular recovery drills ensure the team can execute under pressure.

What's next:

Now that we understand how to execute PITR, we'll examine Target Time Specification in depth. The next page explores the nuances of specifying recovery targets, including timezone handling, transaction boundary considerations, and techniques for identifying the optimal recovery point during incident investigation.

Page Complete

You now understand the complete PITR recovery process. You've learned the decision framework for choosing PITR, pre-recovery preparation, step-by-step execution, verification procedures, and post-recovery operations. Next, we'll dive deeper into the art and science of specifying the optimal recovery target.

3 / 5

Loading learning content...

Database Management SystemsBackup Strategies

Point-in-Time Recovery

LevelAdvanced

Duration75 mins

TopicBackup Strategies

3 / 5

The Recovery Process: Executing Point-in-Time Recovery

The Most Critical 30 Minutes

The recovery process is not just technical execution—it's a structured methodology that minimizes mistakes during high-pressure situations.

Practice Before the Emergency

The Recovery Decision Framework

Before executing any recovery, several critical decisions must be made. Rushing into recovery without proper analysis can result in losing more data than the original incident.

Decision 1: Is PITR the Right Approach?

PITR is destructive—it replaces the current database state with a historical state. Alternative approaches may be less disruptive:

Consider alternatives before PITR:

Targeted Data Correction: If specific records are corrupted, surgical UPDATE/DELETE statements may fix the issue without full recovery
Logical Restore from Backup: Restoring specific tables from a logical backup (pg_dump) may be faster and less risky
Flashback Queries: If the database supports temporal queries, historical data can be accessed without recovery
Application-Level Rollback: Some applications maintain their own audit trails that can reverse specific transactions

PITR is appropriate when:

Corruption is widespread and difficult to characterize
The exact scope of damage is unknown
Targeted correction would take longer than full recovery
Regulatory requirements mandate clean recovery
Integrity constraints make partial correction impractical

Recovery Approach Decision Matrix
Scenario	Recommended Approach	Rationale
Single table, known rows affected	Targeted correction or table restore	Faster, preserves other data
Single table, unknown scope	Table restore from backup	Isolates impact, simpler verification
Multiple tables, referential integrity	PITR	Ensures consistent cross-table state
Unknown scope, possible ongoing damage	PITR	Clean slate from known-good state
Malicious activity suspected	PITR	Eliminates any backdoors or hidden damage
Partial outage acceptable	Targeted restoration	Minimizes downtime for unaffected functions

Decision 2: What Is the Recovery Target?

Determining the correct recovery target requires investigation. Getting this wrong means either recovering past the problem (corruption returns) or recovering too far back (unnecessary data loss).

Target Investigation Process:

Establish Timeline: When was the problem first noticed? When did symptoms actually begin (may differ)?
Analyze Logs: Application logs, database logs, and audit tables provide transaction-level detail
Identify Causal Transaction: What specific transaction or process caused the damage?
Determine Safe Target: The target should be definitively before the causal event
Add Safety Margin: A few seconds before the identified bad transaction provides margin for clock skew

-- Example: Finding when a bad batch started
SELECT min(created_at) AS earliest_bad_record
FROM corrupted_table
WHERE is_corrupted = true;

-- Check database logs for the transaction
-- PostgreSQL example:
SELECT xact_start, query
FROM pg_stat_activity
WHERE xact_start >= '2024-01-15 14:30:00'
  AND xact_start <= '2024-01-15 14:32:00';

The 5-Minute Safety Margin

Decision 3: Recovery Strategy

PITR can be performed in several ways depending on infrastructure and requirements:

In-Place Recovery

Replace the current database with the recovered state
Fastest to achieve minimal downtime
No rollback option once committed
Appropriate when time is critical

Parallel Recovery

Recover to a new database instance alongside production
Validate recovered state before cutover
Allows comparison with current state
Adds time but reduces risk

Hybrid Approach

Recover specific data from parallel instance
Merge into production database
Minimizes data loss from transactions after recovery target
Most complex but most flexible

Pre-Recovery Preparation

Before beginning the actual recovery, several preparation steps protect against making the situation worse.

Step 1: Preserve the Current State

The current database state—even if corrupted—may contain valuable information needed later. Always preserve it before recovery.

# Take a final backup of the corrupted state
pg_dump -Fc -f /emergency/pre_recovery_$(date +%Y%m%d_%H%M%S).dump production_db

# If using filesystem backup
pg_basebackup -D /emergency/pre_recovery_basebackup -Fp -Xs -P

# Document current WAL position
psql -c "SELECT pg_current_wal_lsn(), current_timestamp;"

Why preserve corrupted state?

Post-recovery forensics to understand root cause
Recovery of specific transactions committed after corruption
Compliance documentation of incident timeline
Fallback if recovery fails or makes things worse

Step 2: Verify Recovery Resources

Confirm all required components are available before starting:

#!/bin/bash
# Pre-recovery verification script

echo "=== PITR Pre-Recovery Verification ==="

# 1. Verify base backup exists and is accessible
BACKUP_DIR="/backups/basebackup_latest"
if [ -d "$BACKUP_DIR" ]; then
    echo "✓ Base backup found: $BACKUP_DIR"
    echo "  Size: $(du -sh $BACKUP_DIR | cut -f1)"
else
    echo "✗ ERROR: Base backup not found!"
    exit 1
fi

# 2. Get backup LSN
BACKUP_LSN=$(cat $BACKUP_DIR/backup_label | grep "START WAL" | awk '{print $5}')
echo "  Backup LSN: $BACKUP_LSN"

# 3. Verify WAL archives from backup LSN to target
TARGET_TIME="2024-01-15 14:30:00"
echo "\nVerifying WAL archive continuity to target: $TARGET_TIME"

# Check archive sequence (simplified)
WAL_START=$(echo $BACKUP_LSN | cut -d'/' -f1)
echo "  Starting from segment: $WAL_START"

# 4. Check disk space for recovery
AVAILABLE_GB=$(df -BG /data | tail -1 | awk '{print $4}' | tr -d 'G')
REQUIRED_GB=100  # Estimate based on backup size
if [ $AVAILABLE_GB -gt $REQUIRED_GB ]; then
    echo "✓ Sufficient disk space: ${AVAILABLE_GB}GB available"
else
    echo "✗ WARNING: Low disk space: ${AVAILABLE_GB}GB available"
fi

# 5. Verify database is stopped (or will be)
if pg_isready -q; then
    echo "⚠ Database is currently running - will need to stop"
else
    echo "✓ Database is stopped"
fi

echo "\n=== Pre-Recovery Verification Complete ==="

Pre-Recovery Checklist

•Current state preserved: Backup of corrupted database exists with timestamp
•Base backup accessible: The target base backup is available and readable
•WAL archives complete: All required archives from backup LSN to target time exist
•Archives verified: Checksums or integrity tests passed for required archives
•Disk space sufficient: Recovery location has adequate space
•Recovery configuration prepared: recovery.signal/recovery.conf ready
•Stakeholders notified: Operations, application teams informed of recovery timeline
•Rollback plan ready: Steps to abort and restore corrupted state if needed

Step 3: Stop Application Traffic

Preventing new transactions during recovery preparation is critical:

-- PostgreSQL: Terminate connections and prevent new ones
SELECT pg_terminate_backend(pid) 
FROM pg_stat_activity 
WHERE datname = 'production' AND pid <> pg_backend_pid();

-- Revoke connect privileges temporarily
REVOKE CONNECT ON DATABASE production FROM PUBLIC;
REVOKE CONNECT ON DATABASE production FROM app_user;

# Network-level: Block database port
iptables -A INPUT -p tcp --dport 5432 -j DROP

# Or disable in connection pooler
pgbouncer -R  # Reload with connections disabled

Executing the Recovery

With preparation complete, we execute the recovery procedure. This section provides step-by-step instructions for PITR execution.

Phase 1: Stop the Database

# Graceful shutdown with timeout
pg_ctl stop -D /data/postgres -m fast -t 60

# Verify shutdown
if pg_isready; then
    echo "ERROR: Database still running!"
    pg_ctl stop -D /data/postgres -m immediate
fi

echo "Database stopped at $(date)"

Phase 2: Clear or Relocate Current Data

# Option A: Rename current data directory (preserves for forensics)
mv /data/postgres /data/postgres_pre_recovery_$(date +%Y%m%d_%H%M%S)

# Option B: Clear data directory (faster, if backup already taken)
rm -rf /data/postgres/*

Phase 3: Restore Base Backup

# Restore using pg_restore for compressed backup
pg_restore -D /data/postgres /backups/base_20240115_000000.tar

# Or for plain file copy backups
cp -a /backups/base_20240115_000000/* /data/postgres/

# Ensure correct ownership
chown -R postgres:postgres /data/postgres

echo "Base backup restored at $(date)"

Phase 4: Configure Recovery Target

Create the recovery configuration specifying the target time:

PostgreSQL 12+ (recovery.signal method):

# Create recovery signal file
touch /data/postgres/recovery.signal

# Add recovery configuration to postgresql.conf
cat >> /data/postgres/postgresql.conf << EOF

# --- PITR Recovery Configuration ---
restore_command = 'cp /archive/wal/%f %p'
recovery_target_time = '2024-01-15 14:30:00+00'
recovery_target_action = 'pause'
recovery_target_inclusive = false
EOF

PostgreSQL 11 and earlier (recovery.conf method):

cat > /data/postgres/recovery.conf << EOF
restore_command = 'cp /archive/wal/%f %p'
recovery_target_time = '2024-01-15 14:30:00+00'
pause_at_recovery_target = true
EOF

Key Recovery Parameters:

Parameter	Purpose	Recommended Value
`restore_command`	How to retrieve archived WAL	Path to archive + decompression
`recovery_target_time`	Target timestamp with timezone	ISO format with explicit timezone
`recovery_target_action`	What to do when target reached	`pause` for verification
`recovery_target_inclusive`	Include transaction at target?	`false` (safer)

recovery_configuration.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/bin/bash
# Complete PITR Recovery Configuration Script
 
# Variables (customize for your environment)
DATA_DIR="/data/postgres"
ARCHIVE_DIR="/archive/wal"
RECOVERY_TARGET="2024-01-15 14:30:00 America/New_York"
PG_VERSION=$(pg_config --version | grep -oP '\d+' | head -1)
 
echo "Configuring PITR recovery for PostgreSQL $PG_VERSION"
echo "Recovery target: $RECOVERY_TARGET"
 
# Determine restore command based on archive format
if ls $ARCHIVE_DIR/*.gz 1> /dev/null 2>&1; then
    RESTORE_CMD="gunzip -c $ARCHIVE_DIR/%f.gz > %p"
elif ls $ARCHIVE_DIR/*.zst 1> /dev/null 2>&1; then
    RESTORE_CMD="zstd -d -c $ARCHIVE_DIR/%f.zst > %p"
else
    RESTORE_CMD="cp $ARCHIVE_DIR/%f %p"
fi
 
echo "Restore command: $RESTORE_CMD"
 
# Create recovery configuration
if [ "$PG_VERSION" -ge 12 ]; then
    # PostgreSQL 12+ method
    touch "$DATA_DIR/recovery.signal"
    
    cat >> "$DATA_DIR/postgresql.conf" << EOF
 
# === POINT-IN-TIME RECOVERY CONFIGURATION ===
# Added: $(date)
# Target: $RECOVERY_TARGET
 
restore_command = '$RESTORE_CMD'
recovery_target_time = '$RECOVERY_TARGET'
recovery_target_action = 'pause'
recovery_target_inclusive = false
 
# Recovery performance tuning
max_wal_senders = 0
hot_standby = on
# === END PITR CONFIGURATION ===
EOF
 
else
    # PostgreSQL 11 and earlier
    cat > "$DATA_DIR/recovery.conf" << EOF
# Point-in-Time Recovery Configuration
# Created: $(date)
 
restore_command = '$RESTORE_CMD'
recovery_target_time = '$RECOVERY_TARGET'
pause_at_recovery_target = true
EOF
 
fi
 
echo "Recovery configuration complete"
echo "Start the database to begin recovery"

Phase 5: Start Recovery

Start the database to begin the recovery process:

# Start PostgreSQL in recovery mode
pg_ctl start -D /data/postgres -l /var/log/postgres/recovery.log

# Monitor recovery progress
tail -f /var/log/postgres/recovery.log

# Check recovery status
psql -c "SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), 
                pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"

Recovery Progress Indicators:

-- Monitor WAL replay progress
SELECT 
    CASE WHEN pg_is_in_recovery() THEN 'Recovering' ELSE 'Complete' END AS status,
    pg_last_xact_replay_timestamp() AS replayed_to,
    current_timestamp - pg_last_xact_replay_timestamp() AS lag
;

-- Check what's being restored (from logs)
-- Look for: "restored log file ... from archive"

The database will:

Load the base backup
Begin fetching and replaying archived WAL segments
Continue replaying until reaching the recovery target
Pause (if recovery_target_action = 'pause')

Recovery Verification

When recovery reaches the target and pauses, verification is critical before finalizing. This is your last opportunity to adjust the target or abort.

Verification Step 1: Confirm Recovery Target

-- Verify we're at the expected point
SELECT 
    pg_is_in_recovery() AS in_recovery,
    pg_last_xact_replay_timestamp() AS recovered_to,
    pg_is_wal_replay_paused() AS paused;

-- Expected output:
-- in_recovery | recovered_to                  | paused
-- t           | 2024-01-15 14:29:58.123456+00 | t

Verification Step 2: Check Data Integrity

-- Verify critical tables have expected state
SELECT count(*), max(updated_at) FROM critical_table;

-- Check for presence of known-good records
SELECT * FROM orders WHERE order_id = 'known_good_order_before_incident';

-- Confirm absence of corrupted records
SELECT count(*) FROM corrupted_table WHERE is_corrupted = true;
-- Expected: 0 (no corrupted records yet)

Verification Step 3: Application-Level Checks

# Run application health checks
./app_health_check.sh --database-only

# Verify key business metrics
psql -f /scripts/verify_business_state.sql

The Point of No Return

Verification Step 4: Adjust Target If Needed

If verification reveals the target was wrong:

Target Too Late (corruption present):

# Stop the database
pg_ctl stop -D /data/postgres -m fast

# Adjust recovery target earlier
sed -i "s/recovery_target_time = .*/recovery_target_time = '2024-01-15 14:25:00'" \
    /data/postgres/postgresql.conf

# Restart recovery from base backup
# (Cannot resume - must restart full recovery)
rm -rf /data/postgres/*
cp -a /backups/base_20240115/* /data/postgres/
touch /data/postgres/recovery.signal
pg_ctl start -D /data/postgres

Target Too Early (missing needed data):

-- While paused, can advance the target
ALTER SYSTEM SET recovery_target_time = '2024-01-15 14:35:00';
SELECT pg_reload_conf();
SELECT pg_wal_replay_resume();  -- Continue recovery to new target

Verification Step 5: Final Sign-Off

Before finalizing, get explicit confirmation:

-- Document the recovered state
SELECT 
    'PITR Recovery Verification' AS report_type,
    current_timestamp AS verified_at,
    pg_last_xact_replay_timestamp() AS recovered_to,
    (SELECT count(*) FROM pg_stat_user_tables) AS tables_present,
    pg_database_size(current_database()) AS database_size;

Finalizing the Recovery

Once verification is complete and stakeholders have confirmed the recovered state is correct, finalize the recovery to return to normal operation.

Promotion: Ending Recovery Mode

-- For paused recovery, promote to primary
SELECT pg_wal_replay_resume();  -- If paused
-- Wait for promotion to complete

-- Verify no longer in recovery
SELECT pg_is_in_recovery();  -- Should return 'f'

Alternatively, using command line:

# PostgreSQL 12+
pg_ctl promote -D /data/postgres

# Or create trigger file (older method)
touch /data/postgres/promote

Post-Promotion Tasks

1. Verify Normal Operation

-- Confirm database accepts writes
CREATE TABLE recovery_test (id serial, ts timestamp default now());
INSERT INTO recovery_test DEFAULT VALUES;
SELECT * FROM recovery_test;
DROP TABLE recovery_test;

-- Check for any errors in logs
-- tail -100 /var/log/postgres/postgresql.log

2. Re-enable Archiving

-- Verify archive_mode is still enabled
SHOW archive_mode;
SHOW archive_command;

-- Force a WAL switch to verify archiving works
SELECT pg_switch_wal();

-- Check archive status
SELECT * FROM pg_stat_archiver;

3. Re-establish Replication (if applicable)

# Rebuild standby servers from new timeline
pg_basebackup -h recovered-primary -D /data/standby -Xs -P

Understanding Timeline Branches

When recovery completes and the database is promoted, a new timeline is created. This is crucial for understanding the database's history:

           Timeline 1 (original)
           -------------------------->
                                \
                                 \  Timeline 2 (after PITR)
                                  \------------------------>
                                  ▲
                            Recovery Point

The new timeline:

Has a new timeline ID (visible in WAL filenames)
Creates a .history file documenting the branch point
Requires standby servers to be rebuilt from the new timeline
Archives continue with the new timeline ID

# Check current timeline
psql -c "SELECT timeline_id FROM pg_control_checkpoint();"

# View timeline history
cat /data/postgres/pg_wal/00000002.history
# Example content:
# 1    0/50000A0    before 2024-01-15 14:30:00+00

Timeline Branches Are Permanent

Post-Recovery Operations

Recovery completion is not the end of the process. Several post-recovery tasks ensure the database is properly stabilized and protected.

Immediate Post-Recovery Tasks

1. Take a New Base Backup

After PITR, immediately create a new base backup. This:

Provides a new recovery starting point on the new timeline
Allows retirement of old archives once no longer needed
Reduces future recovery time

# Immediate post-recovery backup
pg_basebackup -D /backups/post_recovery_$(date +%Y%m%d_%H%M%S) \
    -Fp -Xs -P -c fast

echo "Post-recovery base backup completed at $(date)"

2. Re-establish Monitoring

# Restart monitoring agents
systemctl restart prometheus-postgres-exporter
systemctl restart pgbadger-collector

# Clear any stale alerts
curl -X POST http://alertmanager:9093/api/v1/alerts \
    -d '{"labels":{"alertname":"PITRRecoveryComplete"}}'

3. Reconnect Applications

-- Restore connection privileges
GRANT CONNECT ON DATABASE production TO app_user;
GRANT CONNECT ON DATABASE production TO PUBLIC;

-- Verify connections are working
SELECT * FROM pg_stat_activity WHERE datname = 'production';

Data Reconciliation

PITR recovers to a point in time, but transactions after that point are lost. Some of these may be legitimate transactions that need recovery.

Transaction Recovery Options:

Replay from Application Logs
- If applications log transactions, replay from logs after the recovery point
- Requires idempotent operations or careful filtering
Extract from Corrupted Backup
- The backup taken before recovery contains all transactions
- Extract legitimate transactions and replay
User Notification
- Notify users of the recovery window
- Request re-entry of transactions from the lost period

-- Identify transactions from corrupted backup that might need recovery
-- (Run against the preserved corrupted backup)
SELECT *
FROM corrupted_backup.transactions
WHERE created_at > '2024-01-15 14:30:00'
  AND created_at < '2024-01-15 14:45:00'
  AND is_valid = true  -- Application-specific validation
ORDER BY created_at;

Reconciliation Checklist:

Identify types of transactions in the lost window
Determine which are recoverable from external sources
Develop replay scripts with proper ordering
Test replay in non-production with recovered database copy
Execute replay with verification
Document any permanently lost transactions

Post-Recovery Verification Checklist

•Database accepting writes: Confirmed with test insert/delete
•Archiving resumed: New WAL segments being archived
•Fresh backup taken: New base backup on new timeline complete
•Replication re-established: Standby servers rebuilt and syncing
•Monitoring active: All monitoring systems reporting
•Applications connected: Client applications functioning normally
•Performance baseline: Query performance within acceptable range
•Incident documented: Complete timeline and actions recorded

Recovery Automation and Runbooks

While PITR requires human judgment for target selection and verification, the mechanical steps benefit from automation. Well-designed runbooks and scripts reduce errors and recovery time.

The PITR Runbook Structure

A production-ready PITR runbook should include:

1. Decision Tree

When to use PITR vs. alternatives
How to determine recovery target
Escalation contacts and approval requirements

2. Pre-Checks

Automated verification of backup availability
Archive continuity verification
Resource availability checks

3. Execution Scripts

Parameterized scripts for each recovery phase
Clear checkpoints requiring human confirmation
Logging of all actions for post-incident review

4. Verification Procedures

Standard queries to verify recovered state
Application-specific health checks
Signoff requirements before finalization

5. Rollback Procedures

Steps to abort and restore pre-recovery state
Conditions triggering rollback
Rollback verification procedures

pitr_recovery_automation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#!/usr/bin/env python3
"""
Point-in-Time Recovery Automation Framework
Provides guided, semi-automated PITR with human checkpoints
"""
 
import subprocess
import sys
from datetime import datetime
from dataclasses import dataclass
from enum import Enum
from pathlib import Path
import logging
 
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger('pitr_recovery')
 
class RecoveryPhase(Enum):
    PREPARATION = "preparation"
    BACKUP_CURRENT = "backup_current"
    RESTORE_BASE = "restore_base"
    CONFIGURE_RECOVERY = "configure_recovery"
    EXECUTE_RECOVERY = "execute_recovery"
    VERIFY = "verify"
    FINALIZE = "finalize"
    POST_RECOVERY = "post_recovery"
 
@dataclass
class RecoveryConfig:
    target_time: str
    base_backup_path: Path
    archive_path: Path
    data_dir: Path
    emergency_backup_dir: Path
 
class PITRRecoveryManager:
    """Manages the complete PITR recovery workflow"""
    
    def __init__(self, config: RecoveryConfig):
        self.config = config
        self.current_phase = RecoveryPhase.PREPARATION
        self.checkpoints = []
        
    def log_checkpoint(self, message: str):
        """Record a recovery checkpoint"""
        checkpoint = {
            'timestamp': datetime.now().isoformat(),
            'phase': self.current_phase.value,
            'message': message
        }
        self.checkpoints.append(checkpoint)
        logger.info(f"CHECKPOINT: {message}")
        
    def require_confirmation(self, prompt: str) -> bool:
        """Require human confirmation before proceeding"""
        print(f"\n{'='*60}")
        print(f"CONFIRMATION REQUIRED")
        print(f"{'='*60}")
        print(f"\n{prompt}\n")
        response = input("Type 'PROCEED' to continue, anything else to abort: ")
        return response.strip().upper() == "PROCEED"
    
    def phase_preparation(self) -> bool:
        """Pre-recovery verification phase"""
        self.current_phase = RecoveryPhase.PREPARATION
        logger.info("Starting preparation phase")
        
        # Verify base backup exists
        if not self.config.base_backup_path.exists():
            logger.error(f"Base backup not found: {self.config.base_backup_path}")
            return False
        self.log_checkpoint(f"Base backup verified: {self.config.base_backup_path}")
        
        # Verify archive path
        if not self.config.archive_path.exists():
            logger.error(f"Archive path not found: {self.config.archive_path}")
            return False
        self.log_checkpoint("Archive path verified")
        
        # Check disk space
        # ... (disk space verification logic)
        
        return self.require_confirmation(
            f"Preparation complete. Ready to proceed with PITR to: "
            f"{self.config.target_time}"
        )
    
    def phase_backup_current(self) -> bool:
        """Backup current corrupted state"""
        self.current_phase = RecoveryPhase.BACKUP_CURRENT
        logger.info("Backing up current database state")
        
        backup_name = f"pre_pitr_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        backup_path = self.config.emergency_backup_dir / backup_name
        
        result = subprocess.run([
            'pg_dump', '-Fc', 
            '-f', str(backup_path / 'dump.custom'),
            'production'
        ], capture_output=True, text=True)
        
        if result.returncode != 0:
            logger.error(f"Backup failed: {result.stderr}")
            return False
            
        self.log_checkpoint(f"Current state backed up to: {backup_path}")
        return True
    
    def execute_recovery(self) -> bool:
        """Execute the complete recovery workflow"""
        phases = [
            self.phase_preparation,
            self.phase_backup_current,
            # ... additional phases
        ]
        
        for phase_func in phases:
            if not phase_func():
                logger.error(f"Recovery aborted at phase: {self.current_phase}")
                return False
                
        logger.info("Recovery completed successfully")
        return True
 
if __name__ == "__main__":
    config = RecoveryConfig(
        target_time="2024-01-15 14:30:00",
        base_backup_path=Path("/backups/base_latest"),
        archive_path=Path("/archive/wal"),
        data_dir=Path("/data/postgres"),
        emergency_backup_dir=Path("/emergency")
    )
    
    manager = PITRRecoveryManager(config)
    success = manager.execute_recovery()
    sys.exit(0 if success else 1)

Summary: The Complete Recovery Process

We've walked through the complete PITR recovery process from incident detection to post-recovery operations. Let's consolidate the key concepts:

Key Takeaways

•PITR is a last resort — Consider targeted corrections or partial restores before full PITR, but don't hesitate when PITR is the right choice.
•Target selection is critical — Investigate thoroughly to determine the correct recovery point. Add safety margin when uncertain.
•Always preserve current state first — The corrupted database may contain valuable data for forensics and transaction recovery.
•Verify before finalizing — Use the pause feature to examine recovered state before committing to the new timeline.
•Timeline branches are permanent — Understand that PITR creates a new timeline, requiring standby rebuilds.
•Post-recovery tasks are essential — New backups, monitoring restoration, and data reconciliation complete the recovery.
•Practice is non-negotiable — Regular recovery drills ensure the team can execute under pressure.

What's next:

Page Complete

3 / 5