Loading learning content...
The call comes at 2:47 PM. A critical batch process has corrupted thousands of records in the production database. Customers are seeing incorrect data. The application team has isolated the issue to a deployment that went live at 2:31 PM. Your job: restore the database to 2:30 PM, before the corruption began.
This is where all your PITR preparation either pays dividends or reveals gaps. The next 30 minutes will determine whether you're back online with minimal data loss or facing hours of downtime with potentially devastating business impact.
The recovery process is not just technical execution—it's a structured methodology that minimizes mistakes during high-pressure situations.
This page walks through the complete PITR recovery workflow: the decision-making process, the technical execution steps, and the validation procedures that confirm success. By the end, you'll have a mental checklist for navigating the most stressful scenarios in database operations.
PITR procedures should be practiced regularly in non-production environments. The first time you perform a recovery should never be during an actual emergency. Regular drills ensure the team is prepared and the documentation is accurate.
Before executing any recovery, several critical decisions must be made. Rushing into recovery without proper analysis can result in losing more data than the original incident.
PITR is destructive—it replaces the current database state with a historical state. Alternative approaches may be less disruptive:
Consider alternatives before PITR:
Targeted Data Correction: If specific records are corrupted, surgical UPDATE/DELETE statements may fix the issue without full recovery
Logical Restore from Backup: Restoring specific tables from a logical backup (pg_dump) may be faster and less risky
Flashback Queries: If the database supports temporal queries, historical data can be accessed without recovery
Application-Level Rollback: Some applications maintain their own audit trails that can reverse specific transactions
PITR is appropriate when:
| Scenario | Recommended Approach | Rationale |
|---|---|---|
| Single table, known rows affected | Targeted correction or table restore | Faster, preserves other data |
| Single table, unknown scope | Table restore from backup | Isolates impact, simpler verification |
| Multiple tables, referential integrity | PITR | Ensures consistent cross-table state |
| Unknown scope, possible ongoing damage | PITR | Clean slate from known-good state |
| Malicious activity suspected | PITR | Eliminates any backdoors or hidden damage |
| Partial outage acceptable | Targeted restoration | Minimizes downtime for unaffected functions |
Determining the correct recovery target requires investigation. Getting this wrong means either recovering past the problem (corruption returns) or recovering too far back (unnecessary data loss).
Target Investigation Process:
Establish Timeline: When was the problem first noticed? When did symptoms actually begin (may differ)?
Analyze Logs: Application logs, database logs, and audit tables provide transaction-level detail
Identify Causal Transaction: What specific transaction or process caused the damage?
Determine Safe Target: The target should be definitively before the causal event
Add Safety Margin: A few seconds before the identified bad transaction provides margin for clock skew
-- Example: Finding when a bad batch started
SELECT min(created_at) AS earliest_bad_record
FROM corrupted_table
WHERE is_corrupted = true;
-- Check database logs for the transaction
-- PostgreSQL example:
SELECT xact_start, query
FROM pg_stat_activity
WHERE xact_start >= '2024-01-15 14:30:00'
AND xact_start <= '2024-01-15 14:32:00';
When in doubt about exact timing, recover to 5 minutes before the suspected incident time. This margin accounts for clock skew, transaction duration, and timing uncertainty. It's easier to manually replay 5 minutes of transactions than to repeat a recovery because you went past the problem.
PITR can be performed in several ways depending on infrastructure and requirements:
In-Place Recovery
Parallel Recovery
Hybrid Approach
Before beginning the actual recovery, several preparation steps protect against making the situation worse.
The current database state—even if corrupted—may contain valuable information needed later. Always preserve it before recovery.
# Take a final backup of the corrupted state
pg_dump -Fc -f /emergency/pre_recovery_$(date +%Y%m%d_%H%M%S).dump production_db
# If using filesystem backup
pg_basebackup -D /emergency/pre_recovery_basebackup -Fp -Xs -P
# Document current WAL position
psql -c "SELECT pg_current_wal_lsn(), current_timestamp;"
Why preserve corrupted state?
Confirm all required components are available before starting:
#!/bin/bash
# Pre-recovery verification script
echo "=== PITR Pre-Recovery Verification ==="
# 1. Verify base backup exists and is accessible
BACKUP_DIR="/backups/basebackup_latest"
if [ -d "$BACKUP_DIR" ]; then
echo "✓ Base backup found: $BACKUP_DIR"
echo " Size: $(du -sh $BACKUP_DIR | cut -f1)"
else
echo "✗ ERROR: Base backup not found!"
exit 1
fi
# 2. Get backup LSN
BACKUP_LSN=$(cat $BACKUP_DIR/backup_label | grep "START WAL" | awk '{print $5}')
echo " Backup LSN: $BACKUP_LSN"
# 3. Verify WAL archives from backup LSN to target
TARGET_TIME="2024-01-15 14:30:00"
echo "\nVerifying WAL archive continuity to target: $TARGET_TIME"
# Check archive sequence (simplified)
WAL_START=$(echo $BACKUP_LSN | cut -d'/' -f1)
echo " Starting from segment: $WAL_START"
# 4. Check disk space for recovery
AVAILABLE_GB=$(df -BG /data | tail -1 | awk '{print $4}' | tr -d 'G')
REQUIRED_GB=100 # Estimate based on backup size
if [ $AVAILABLE_GB -gt $REQUIRED_GB ]; then
echo "✓ Sufficient disk space: ${AVAILABLE_GB}GB available"
else
echo "✗ WARNING: Low disk space: ${AVAILABLE_GB}GB available"
fi
# 5. Verify database is stopped (or will be)
if pg_isready -q; then
echo "⚠ Database is currently running - will need to stop"
else
echo "✓ Database is stopped"
fi
echo "\n=== Pre-Recovery Verification Complete ==="
Preventing new transactions during recovery preparation is critical:
-- PostgreSQL: Terminate connections and prevent new ones
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'production' AND pid <> pg_backend_pid();
-- Revoke connect privileges temporarily
REVOKE CONNECT ON DATABASE production FROM PUBLIC;
REVOKE CONNECT ON DATABASE production FROM app_user;
# Network-level: Block database port
iptables -A INPUT -p tcp --dport 5432 -j DROP
# Or disable in connection pooler
pgbouncer -R # Reload with connections disabled
With preparation complete, we execute the recovery procedure. This section provides step-by-step instructions for PITR execution.
# Graceful shutdown with timeout
pg_ctl stop -D /data/postgres -m fast -t 60
# Verify shutdown
if pg_isready; then
echo "ERROR: Database still running!"
pg_ctl stop -D /data/postgres -m immediate
fi
echo "Database stopped at $(date)"
# Option A: Rename current data directory (preserves for forensics)
mv /data/postgres /data/postgres_pre_recovery_$(date +%Y%m%d_%H%M%S)
# Option B: Clear data directory (faster, if backup already taken)
rm -rf /data/postgres/*
# Restore using pg_restore for compressed backup
pg_restore -D /data/postgres /backups/base_20240115_000000.tar
# Or for plain file copy backups
cp -a /backups/base_20240115_000000/* /data/postgres/
# Ensure correct ownership
chown -R postgres:postgres /data/postgres
echo "Base backup restored at $(date)"
Create the recovery configuration specifying the target time:
PostgreSQL 12+ (recovery.signal method):
# Create recovery signal file
touch /data/postgres/recovery.signal
# Add recovery configuration to postgresql.conf
cat >> /data/postgres/postgresql.conf << EOF
# --- PITR Recovery Configuration ---
restore_command = 'cp /archive/wal/%f %p'
recovery_target_time = '2024-01-15 14:30:00+00'
recovery_target_action = 'pause'
recovery_target_inclusive = false
EOF
PostgreSQL 11 and earlier (recovery.conf method):
cat > /data/postgres/recovery.conf << EOF
restore_command = 'cp /archive/wal/%f %p'
recovery_target_time = '2024-01-15 14:30:00+00'
pause_at_recovery_target = true
EOF
Key Recovery Parameters:
| Parameter | Purpose | Recommended Value |
|---|---|---|
restore_command | How to retrieve archived WAL | Path to archive + decompression |
recovery_target_time | Target timestamp with timezone | ISO format with explicit timezone |
recovery_target_action | What to do when target reached | pause for verification |
recovery_target_inclusive | Include transaction at target? | false (safer) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
#!/bin/bash# Complete PITR Recovery Configuration Script # Variables (customize for your environment)DATA_DIR="/data/postgres"ARCHIVE_DIR="/archive/wal"RECOVERY_TARGET="2024-01-15 14:30:00 America/New_York"PG_VERSION=$(pg_config --version | grep -oP '\d+' | head -1) echo "Configuring PITR recovery for PostgreSQL $PG_VERSION"echo "Recovery target: $RECOVERY_TARGET" # Determine restore command based on archive formatif ls $ARCHIVE_DIR/*.gz 1> /dev/null 2>&1; then RESTORE_CMD="gunzip -c $ARCHIVE_DIR/%f.gz > %p"elif ls $ARCHIVE_DIR/*.zst 1> /dev/null 2>&1; then RESTORE_CMD="zstd -d -c $ARCHIVE_DIR/%f.zst > %p"else RESTORE_CMD="cp $ARCHIVE_DIR/%f %p"fi echo "Restore command: $RESTORE_CMD" # Create recovery configurationif [ "$PG_VERSION" -ge 12 ]; then # PostgreSQL 12+ method touch "$DATA_DIR/recovery.signal" cat >> "$DATA_DIR/postgresql.conf" << EOF # === POINT-IN-TIME RECOVERY CONFIGURATION ===# Added: $(date)# Target: $RECOVERY_TARGET restore_command = '$RESTORE_CMD'recovery_target_time = '$RECOVERY_TARGET'recovery_target_action = 'pause'recovery_target_inclusive = false # Recovery performance tuningmax_wal_senders = 0hot_standby = on# === END PITR CONFIGURATION ===EOF else # PostgreSQL 11 and earlier cat > "$DATA_DIR/recovery.conf" << EOF# Point-in-Time Recovery Configuration# Created: $(date) restore_command = '$RESTORE_CMD'recovery_target_time = '$RECOVERY_TARGET'pause_at_recovery_target = trueEOF fi echo "Recovery configuration complete"echo "Start the database to begin recovery"Start the database to begin the recovery process:
# Start PostgreSQL in recovery mode
pg_ctl start -D /data/postgres -l /var/log/postgres/recovery.log
# Monitor recovery progress
tail -f /var/log/postgres/recovery.log
# Check recovery status
psql -c "SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"
Recovery Progress Indicators:
-- Monitor WAL replay progress
SELECT
CASE WHEN pg_is_in_recovery() THEN 'Recovering' ELSE 'Complete' END AS status,
pg_last_xact_replay_timestamp() AS replayed_to,
current_timestamp - pg_last_xact_replay_timestamp() AS lag
;
-- Check what's being restored (from logs)
-- Look for: "restored log file ... from archive"
The database will:
recovery_target_action = 'pause')When recovery reaches the target and pauses, verification is critical before finalizing. This is your last opportunity to adjust the target or abort.
-- Verify we're at the expected point
SELECT
pg_is_in_recovery() AS in_recovery,
pg_last_xact_replay_timestamp() AS recovered_to,
pg_is_wal_replay_paused() AS paused;
-- Expected output:
-- in_recovery | recovered_to | paused
-- t | 2024-01-15 14:29:58.123456+00 | t
-- Verify critical tables have expected state
SELECT count(*), max(updated_at) FROM critical_table;
-- Check for presence of known-good records
SELECT * FROM orders WHERE order_id = 'known_good_order_before_incident';
-- Confirm absence of corrupted records
SELECT count(*) FROM corrupted_table WHERE is_corrupted = true;
-- Expected: 0 (no corrupted records yet)
# Run application health checks
./app_health_check.sh --database-only
# Verify key business metrics
psql -f /scripts/verify_business_state.sql
Once you finalize recovery (by promoting the database out of recovery mode), you cannot easily go back. Take time here to verify thoroughly. If something looks wrong, you can still adjust the recovery target and try again.
If verification reveals the target was wrong:
Target Too Late (corruption present):
# Stop the database
pg_ctl stop -D /data/postgres -m fast
# Adjust recovery target earlier
sed -i "s/recovery_target_time = .*/recovery_target_time = '2024-01-15 14:25:00'" \
/data/postgres/postgresql.conf
# Restart recovery from base backup
# (Cannot resume - must restart full recovery)
rm -rf /data/postgres/*
cp -a /backups/base_20240115/* /data/postgres/
touch /data/postgres/recovery.signal
pg_ctl start -D /data/postgres
Target Too Early (missing needed data):
-- While paused, can advance the target
ALTER SYSTEM SET recovery_target_time = '2024-01-15 14:35:00';
SELECT pg_reload_conf();
SELECT pg_wal_replay_resume(); -- Continue recovery to new target
Before finalizing, get explicit confirmation:
-- Document the recovered state
SELECT
'PITR Recovery Verification' AS report_type,
current_timestamp AS verified_at,
pg_last_xact_replay_timestamp() AS recovered_to,
(SELECT count(*) FROM pg_stat_user_tables) AS tables_present,
pg_database_size(current_database()) AS database_size;
Once verification is complete and stakeholders have confirmed the recovered state is correct, finalize the recovery to return to normal operation.
-- For paused recovery, promote to primary
SELECT pg_wal_replay_resume(); -- If paused
-- Wait for promotion to complete
-- Verify no longer in recovery
SELECT pg_is_in_recovery(); -- Should return 'f'
Alternatively, using command line:
# PostgreSQL 12+
pg_ctl promote -D /data/postgres
# Or create trigger file (older method)
touch /data/postgres/promote
1. Verify Normal Operation
-- Confirm database accepts writes
CREATE TABLE recovery_test (id serial, ts timestamp default now());
INSERT INTO recovery_test DEFAULT VALUES;
SELECT * FROM recovery_test;
DROP TABLE recovery_test;
-- Check for any errors in logs
-- tail -100 /var/log/postgres/postgresql.log
2. Re-enable Archiving
-- Verify archive_mode is still enabled
SHOW archive_mode;
SHOW archive_command;
-- Force a WAL switch to verify archiving works
SELECT pg_switch_wal();
-- Check archive status
SELECT * FROM pg_stat_archiver;
3. Re-establish Replication (if applicable)
# Rebuild standby servers from new timeline
pg_basebackup -h recovered-primary -D /data/standby -Xs -P
When recovery completes and the database is promoted, a new timeline is created. This is crucial for understanding the database's history:
Timeline 1 (original)
-------------------------->
\
\ Timeline 2 (after PITR)
\------------------------>
▲
Recovery Point
The new timeline:
.history file documenting the branch point# Check current timeline
psql -c "SELECT timeline_id FROM pg_control_checkpoint();"
# View timeline history
cat /data/postgres/pg_wal/00000002.history
# Example content:
# 1 0/50000A0 before 2024-01-15 14:30:00+00
Once a new timeline is created, the database's history has permanently diverged. The original timeline still exists in archives but will not be continued. All subsequent operations occur on the new timeline.
Recovery completion is not the end of the process. Several post-recovery tasks ensure the database is properly stabilized and protected.
1. Take a New Base Backup
After PITR, immediately create a new base backup. This:
# Immediate post-recovery backup
pg_basebackup -D /backups/post_recovery_$(date +%Y%m%d_%H%M%S) \
-Fp -Xs -P -c fast
echo "Post-recovery base backup completed at $(date)"
2. Re-establish Monitoring
# Restart monitoring agents
systemctl restart prometheus-postgres-exporter
systemctl restart pgbadger-collector
# Clear any stale alerts
curl -X POST http://alertmanager:9093/api/v1/alerts \
-d '{"labels":{"alertname":"PITRRecoveryComplete"}}'
3. Reconnect Applications
-- Restore connection privileges
GRANT CONNECT ON DATABASE production TO app_user;
GRANT CONNECT ON DATABASE production TO PUBLIC;
-- Verify connections are working
SELECT * FROM pg_stat_activity WHERE datname = 'production';
PITR recovers to a point in time, but transactions after that point are lost. Some of these may be legitimate transactions that need recovery.
Transaction Recovery Options:
Replay from Application Logs
Extract from Corrupted Backup
User Notification
-- Identify transactions from corrupted backup that might need recovery
-- (Run against the preserved corrupted backup)
SELECT *
FROM corrupted_backup.transactions
WHERE created_at > '2024-01-15 14:30:00'
AND created_at < '2024-01-15 14:45:00'
AND is_valid = true -- Application-specific validation
ORDER BY created_at;
Reconciliation Checklist:
While PITR requires human judgment for target selection and verification, the mechanical steps benefit from automation. Well-designed runbooks and scripts reduce errors and recovery time.
A production-ready PITR runbook should include:
1. Decision Tree
2. Pre-Checks
3. Execution Scripts
4. Verification Procedures
5. Rollback Procedures
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
#!/usr/bin/env python3"""Point-in-Time Recovery Automation FrameworkProvides guided, semi-automated PITR with human checkpoints""" import subprocessimport sysfrom datetime import datetimefrom dataclasses import dataclassfrom enum import Enumfrom pathlib import Pathimport logging logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')logger = logging.getLogger('pitr_recovery') class RecoveryPhase(Enum): PREPARATION = "preparation" BACKUP_CURRENT = "backup_current" RESTORE_BASE = "restore_base" CONFIGURE_RECOVERY = "configure_recovery" EXECUTE_RECOVERY = "execute_recovery" VERIFY = "verify" FINALIZE = "finalize" POST_RECOVERY = "post_recovery" @dataclassclass RecoveryConfig: target_time: str base_backup_path: Path archive_path: Path data_dir: Path emergency_backup_dir: Path class PITRRecoveryManager: """Manages the complete PITR recovery workflow""" def __init__(self, config: RecoveryConfig): self.config = config self.current_phase = RecoveryPhase.PREPARATION self.checkpoints = [] def log_checkpoint(self, message: str): """Record a recovery checkpoint""" checkpoint = { 'timestamp': datetime.now().isoformat(), 'phase': self.current_phase.value, 'message': message } self.checkpoints.append(checkpoint) logger.info(f"CHECKPOINT: {message}") def require_confirmation(self, prompt: str) -> bool: """Require human confirmation before proceeding""" print(f"\n{'='*60}") print(f"CONFIRMATION REQUIRED") print(f"{'='*60}") print(f"\n{prompt}\n") response = input("Type 'PROCEED' to continue, anything else to abort: ") return response.strip().upper() == "PROCEED" def phase_preparation(self) -> bool: """Pre-recovery verification phase""" self.current_phase = RecoveryPhase.PREPARATION logger.info("Starting preparation phase") # Verify base backup exists if not self.config.base_backup_path.exists(): logger.error(f"Base backup not found: {self.config.base_backup_path}") return False self.log_checkpoint(f"Base backup verified: {self.config.base_backup_path}") # Verify archive path if not self.config.archive_path.exists(): logger.error(f"Archive path not found: {self.config.archive_path}") return False self.log_checkpoint("Archive path verified") # Check disk space # ... (disk space verification logic) return self.require_confirmation( f"Preparation complete. Ready to proceed with PITR to: " f"{self.config.target_time}" ) def phase_backup_current(self) -> bool: """Backup current corrupted state""" self.current_phase = RecoveryPhase.BACKUP_CURRENT logger.info("Backing up current database state") backup_name = f"pre_pitr_{datetime.now().strftime('%Y%m%d_%H%M%S')}" backup_path = self.config.emergency_backup_dir / backup_name result = subprocess.run([ 'pg_dump', '-Fc', '-f', str(backup_path / 'dump.custom'), 'production' ], capture_output=True, text=True) if result.returncode != 0: logger.error(f"Backup failed: {result.stderr}") return False self.log_checkpoint(f"Current state backed up to: {backup_path}") return True def execute_recovery(self) -> bool: """Execute the complete recovery workflow""" phases = [ self.phase_preparation, self.phase_backup_current, # ... additional phases ] for phase_func in phases: if not phase_func(): logger.error(f"Recovery aborted at phase: {self.current_phase}") return False logger.info("Recovery completed successfully") return True if __name__ == "__main__": config = RecoveryConfig( target_time="2024-01-15 14:30:00", base_backup_path=Path("/backups/base_latest"), archive_path=Path("/archive/wal"), data_dir=Path("/data/postgres"), emergency_backup_dir=Path("/emergency") ) manager = PITRRecoveryManager(config) success = manager.execute_recovery() sys.exit(0 if success else 1)We've walked through the complete PITR recovery process from incident detection to post-recovery operations. Let's consolidate the key concepts:
What's next:
Now that we understand how to execute PITR, we'll examine Target Time Specification in depth. The next page explores the nuances of specifying recovery targets, including timezone handling, transaction boundary considerations, and techniques for identifying the optimal recovery point during incident investigation.
You now understand the complete PITR recovery process. You've learned the decision framework for choosing PITR, pre-recovery preparation, step-by-step execution, verification procedures, and post-recovery operations. Next, we'll dive deeper into the art and science of specifying the optimal recovery target.