Loading learning content...
Every database will eventually experience a failure that threatens data. Hardware fails, software crashes, human operators make mistakes, and disasters strike. The question is not if you will need to recover data, but when—and whether you'll be prepared.
Backup and recovery is the DBA's ultimate responsibility. All other duties—performance tuning, security hardening, monitoring—become meaningless if the data is lost forever. A well-designed backup strategy transforms potential catastrophes into recoverable incidents.
Organizations that have lost critical data without adequate backups have faced bankruptcy, regulatory penalties, and permanent reputation damage. Conversely, organizations with robust backup strategies have recovered from ransomware attacks, data center fires, and catastrophic human errors with minimal lasting impact.
By the end of this page, you will understand comprehensive backup and recovery strategies, including backup types and their trade-offs, recovery objectives (RTO/RPO), backup technologies and tools, point-in-time recovery, backup verification, disaster recovery planning, and real-world recovery scenarios. You'll learn to design backup strategies that balance protection, cost, and operational complexity.
Before designing a backup strategy, you must understand what level of recovery the business requires. Two key metrics define recovery requirements:
Recovery Time Objective (RTO):
RTO answers: How long can we be down?
This is the maximum acceptable time from failure to full recovery. An RTO of 1 hour means the business can tolerate at most one hour of database unavailability.
Recovery Point Objective (RPO):
RPO answers: How much data can we afford to lose?
This is the maximum acceptable data loss measured in time. An RPO of 1 hour means we can tolerate losing up to one hour of transactions.
| RPO/RTO | Typical Systems | Backup Strategy | Cost Level |
|---|---|---|---|
| Minutes | Trading systems, payment processing | Synchronous replication, continuous backup, automated failover | Very High |
| 1 Hour | E-commerce, SaaS applications | Frequent log backups, hot standby, tested recovery | High |
| 4 Hours | Business applications, internal systems | Regular backups with transaction logs, warm standby | Medium |
| 24 Hours | Development, analytics, archives | Daily backups, cold storage, manual recovery | Low |
| Days/Weeks | Historical archives, compliance data | Infrequent backups, tape/cold storage | Minimal |
The Cost-Protection Trade-off:
More aggressive recovery objectives (lower RTO/RPO) require more sophisticated infrastructure:
The key is matching recovery objectives to actual business requirements. Over-engineering backup systems wastes money; under-engineering risks the business.
RTO and RPO are business decisions, not technical ones. The DBA can explain what's achievable at what cost, but business stakeholders must decide acceptable risk levels. Document these decisions and get sign-off—they protect you when trade-offs become visible during incidents.
Different backup types serve different purposes. A robust strategy typically combines multiple types:
Full Backup:
A complete copy of the entire database. Every data file, every table, every row is captured.
Incremental Backup:
Captures only data changed since the last backup (full or incremental).
Differential Backup:
Captures all data changed since the last full backup.
Transaction Log Backup:
Captures transaction log records since the last log backup. Essential for point-in-time recovery.
| Aspect | Detail |
|---|---|
| Purpose | Enable point-in-time recovery (PITR) to any moment |
| Frequency | Every 5-15 minutes for production databases |
| RPO Impact | RPO = log backup interval (lose since last backup) |
| Chain Requirement | Log chain must be unbroken from full backup forward |
| Storage | Relatively small; essential to archive and protect |
123456789101112131415161718192021222324252627282930
-- PostgreSQL: Backup Strategy Components -- 1. FULL BACKUP with pg_basebackup (physical)-- Run weekly (e.g., Sunday night)-- pg_basebackup -h localhost -D /backup/full/$(date +%Y%m%d) -U backup_user -P -Ft -z -- 2. CONTINUOUS WAL ARCHIVING (transaction logs)-- In postgresql.conf:archive_mode = onarchive_command = 'cp %p /backup/wal_archive/%f' -- Or using pgBackRest for more sophisticated management:-- pgbackrest --stanza=main --type=full backup-- pgbackrest --stanza=main --type=incr backup -- 3. SQL Server example:-- Full backup (weekly)BACKUP DATABASE [Production] TO DISK = 'E:\Backup\Production_Full.bak'WITH COMPRESSION, CHECKSUM, STATS = 10; -- Differential backup (daily)BACKUP DATABASE [Production] TO DISK = 'E:\Backup\Production_Diff.bak'WITH DIFFERENTIAL, COMPRESSION, CHECKSUM; -- Transaction log backup (every 15 minutes)BACKUP LOG [Production] TO DISK = 'E:\Backup\Production_Log.trn'WITH COMPRESSION, CHECKSUM;Keep at least 3 copies of critical data, on 2 different types of media, with 1 copy offsite. This protects against single failures (corrupted backup), media failures (disk dies), and site disasters (fire, flood).
Backups can be categorized as physical (copying raw data files) or logical (exporting data in a portable format). Each approach has distinct advantages:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
#!/bin/bash# Backup Method Examples # ============================================# PHYSICAL BACKUP - PostgreSQL with pg_basebackup# ============================================pg_basebackup \ --host=db-primary \ --username=backup_user \ --pgdata=/backup/base/$(date +%Y%m%d_%H%M%S) \ --format=tar \ --gzip \ --checkpoint=fast \ --wal-method=stream \ --progress # ============================================# LOGICAL BACKUP - PostgreSQL with pg_dump# ============================================# Full database exportpg_dump \ --host=db-primary \ --username=backup_user \ --format=custom \ --file=/backup/logical/production_$(date +%Y%m%d).dump \ production_db # Specific tables onlypg_dump \ --host=db-primary \ --table=customers \ --table=orders \ --format=custom \ --file=/backup/logical/critical_tables.dump \ production_db # ============================================# LOGICAL BACKUP - MySQL with mysqldump# ============================================mysqldump \ --host=db-primary \ --user=backup_user \ --password \ --single-transaction \ --routines \ --triggers \ --quick \ production_db > /backup/logical/production_$(date +%Y%m%d).sql # ============================================# PHYSICAL BACKUP - MySQL with Percona XtraBackup# ============================================xtrabackup \ --backup \ --target-dir=/backup/base/$(date +%Y%m%d) \ --user=backup_user \ --password \ --parallel=4 \ --compressBest practice combines both approaches: physical backups for fast recovery of the entire database, logical backups for portability, selective restoration, and migration. Logical backups also serve as a cross-check—if both backup types restore successfully, confidence is high.
Point-in-Time Recovery allows restoring a database to any specific moment—not just to backup time, but to any point covered by transaction logs. PITR is essential for recovering from logical errors like accidental deletions or application bugs that corrupt data.
How PITR Works:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
-- PostgreSQL PITR Recovery Configuration -- 1. Stop PostgreSQL-- sudo systemctl stop postgresql -- 2. Clear data directory (or use new location)-- rm -rf /var/lib/postgresql/data/* -- 3. Restore base backup-- tar -xzf /backup/base/20240115.tar.gz -C /var/lib/postgresql/data/ -- 4. Create recovery signal file and configure recovery-- postgresql.conf or recovery.conf (version dependent): restore_command = 'cp /backup/wal_archive/%f %p'recovery_target_time = '2024-01-15 14:59:00'recovery_target_action = 'promote' -- or 'pause' to verify before promoting -- 5. Create recovery signal file-- touch /var/lib/postgresql/data/recovery.signal -- 6. Start PostgreSQL - recovery begins automatically-- sudo systemctl start postgresql -- 7. Monitor recovery progress in logs-- tail -f /var/log/postgresql/postgresql-15-main.log -- ============================================-- SQL Server PITR Recovery-- ============================================-- Step 1: Restore full backup with NORECOVERYRESTORE DATABASE [Production]FROM DISK = 'E:\Backup\Production_Full.bak'WITH NORECOVERY, REPLACE; -- Step 2: Restore differential (if applicable)RESTORE DATABASE [Production]FROM DISK = 'E:\Backup\Production_Diff.bak'WITH NORECOVERY; -- Step 3: Restore logs until target timeRESTORE LOG [Production]FROM DISK = 'E:\Backup\Production_Log_1.trn'WITH NORECOVERY; RESTORE LOG [Production]FROM DISK = 'E:\Backup\Production_Log_2.trn'WITH STOPAT = '2024-01-15T14:59:00', RECOVERY;PITR is only possible if transaction logs are preserved. A broken log chain (missing or corrupted log files) limits recovery to the last available point. Archive logs immediately, verify copies, and monitor for gaps. Losing transaction logs limits your recovery options severely.
A backup that hasn't been tested is not a backup—it's a hope. Many organizations have discovered during actual emergencies that their backups were corrupted, incomplete, or unrecoverable. Regular verification and testing are essential.
Verification Levels:
| Level | What It Verifies | How Often | Effort |
|---|---|---|---|
| Completion Check | Backup job finished without errors | Every backup | Automated |
| Checksum Validation | Backup file integrity (no corruption) | Every backup | Automated |
| Restore Test (Automated) | Backup can be restored to a test environment | Weekly | Automated |
| Data Validation | Restored data is correct and complete | Monthly | Semi-automated |
| Full DR Test | Complete recovery to alternate site | Quarterly/Annually | Manual event |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
#!/bin/bash# Automated Backup Verification Script set -e # Exit on error BACKUP_FILE="/backup/base/production_$(date +%Y%m%d).tar.gz"RESTORE_DIR="/tmp/backup_test"VERIFY_LOG="/var/log/backup_verification.log" log() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$VERIFY_LOG"} # Step 1: Verify backup file exists and has reasonable sizeif [ ! -f "$BACKUP_FILE" ]; then log "ERROR: Backup file not found: $BACKUP_FILE" exit 1fi SIZE=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat -c%s "$BACKUP_FILE")if [ "$SIZE" -lt 1000000 ]; then log "ERROR: Backup file suspiciously small: $SIZE bytes" exit 1filog "Backup file size OK: $SIZE bytes" # Step 2: Verify checksum/integrityif ! gzip -t "$BACKUP_FILE" 2>/dev/null; then log "ERROR: Backup file failed gzip integrity check" exit 1filog "Gzip integrity check passed" # Step 3: Test extractionrm -rf "$RESTORE_DIR"mkdir -p "$RESTORE_DIR"if ! tar -xzf "$BACKUP_FILE" -C "$RESTORE_DIR"; then log "ERROR: Failed to extract backup" exit 1filog "Extraction successful" # Step 4: Attempt database restore to test instanceexport PGPORT=5433 # Test instance portpg_ctl -D "$RESTORE_DIR" start -w # Step 5: Verify data integrityTABLES=$(psql -p 5433 -c "SELECT count(*) FROM information_schema.tables WHERE table_schema='public'" -t)if [ "$TABLES" -lt 10 ]; then log "WARNING: Fewer tables than expected: $TABLES"fi CUSTOMERS=$(psql -p 5433 -c "SELECT count(*) FROM customers" -t)log "Restored database has $CUSTOMERS customers" # Step 6: Cleanuppg_ctl -D "$RESTORE_DIR" stoprm -rf "$RESTORE_DIR" log "=== Backup verification completed successfully ==="During recovery tests, measure how long restoration takes. This validates your RTO assumptions. If your RTO is 1 hour but restoration takes 3 hours, you have a planning gap that must be addressed before a real emergency.
Backing up databases requires appropriate infrastructure for storage, scheduling, and management.
Storage Options:
| Storage Type | Pros | Cons | Use Case |
|---|---|---|---|
| Local Disk | Fast, simple | Single point of failure, limited capacity | Staging before offsite copy |
| Network Storage (NAS/SAN) | Centralized, redundant | Network dependent, shared bottleneck | Medium-term retention |
| Object Storage (S3, Azure Blob) | Unlimited, durable, geographic options | Latency, egress costs | Long-term, offsite, disaster recovery |
| Tape | Very low cost per GB, offline protection | Slow, manual handling | Archive, air-gapped security |
| Dedicated Backup Appliances | Optimized, deduplication, integration | Cost, vendor lock-in | Enterprise environments |
Backup Management Tools:
Dedicated backup tools provide scheduling, retention management, verification, and reporting:
1234567891011121314151617181920212223242526272829303132333435
# pgBackRest Configuration Example# /etc/pgbackrest/pgbackrest.conf [global]# Repository configurationrepo1-path=/backup/pgbackrestrepo1-retention-full=4 # Keep 4 full backupsrepo1-retention-diff=2 # Keep 2 differentials per fullrepo1-cipher-type=aes-256-cbc # Encrypt backupsrepo1-cipher-pass=SecureBackupKey # Second repository for offsite (S3)repo2-type=s3repo2-s3-bucket=mycompany-db-backupsrepo2-s3-endpoint=s3.amazonaws.comrepo2-s3-region=us-east-1repo2-path=/productionrepo2-retention-full=12 # Keep 12 full backups in S3repo2-s3-key=AKIAIOSFODNN7EXAMPLErepo2-s3-key-secret=SECRET_KEY_HERE # Compression and performancecompress-type=zstcompress-level=3process-max=4 # Parallel processes # Logginglog-level-console=infolog-level-file=detail [main]# PostgreSQL cluster configurationpg1-path=/var/lib/postgresql/15/mainpg1-port=5432pg1-user=postgresBackup traffic can saturate production networks if not managed. Consider dedicated backup networks, bandwidth throttling during peak hours, or off-hours scheduling. Also ensure backup systems themselves are secured—a compromised backup server provides access to all your data.
Backup retention policies balance recovery needs, storage costs, and regulatory requirements.
Retention Strategy Factors:
| Backup Type | Short-Term | Medium-Term | Long-Term/Archive |
|---|---|---|---|
| Transaction Logs | 7-14 days | 30 days | Typically not archived |
| Daily Backups | 14-30 days | 90 days | Selected months kept yearly |
| Weekly Full | 4-8 weeks | 12 months | Year-end kept 7+ years |
| Monthly Full | 12 months | 3-5 years | As required by regulation |
Grandfather-Father-Son (GFS) Rotation:
A classic retention scheme that manages backup expiration:
After 12 months, monthly backups can rotate out or be retained for yearly archives.
123456789101112131415161718192021222324252627282930313233343536
#!/bin/bash# Backup Retention Management Script BACKUP_DIR="/backup/postgresql"LOG_FILE="/var/log/backup_retention.log" log() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"} # Remove daily backups older than 14 dayslog "Cleaning daily backups older than 14 days..."find "$BACKUP_DIR/daily" -name "*.tar.gz" -mtime +14 -delete # Remove weekly backups older than 8 weekslog "Cleaning weekly backups older than 8 weeks..."find "$BACKUP_DIR/weekly" -name "*.tar.gz" -mtime +56 -delete # Remove monthly backups older than 1 yearlog "Cleaning monthly backups older than 1 year..."find "$BACKUP_DIR/monthly" -name "*.tar.gz" -mtime +365 -delete # Remove transaction logs older than 7 dayslog "Cleaning transaction logs older than 7 days..."find "$BACKUP_DIR/wal" -name "*.xz" -mtime +7 -delete # Report remaining storageUSED=$(du -sh "$BACKUP_DIR" | cut -f1)log "Retention cleanup complete. Total backup storage: $USED" # Alert if storage exceeds thresholdUSED_BYTES=$(du -sb "$BACKUP_DIR" | cut -f1)THRESHOLD=$((500 * 1024 * 1024 * 1024)) # 500GBif [ "$USED_BYTES" -gt "$THRESHOLD" ]; then log "WARNING: Backup storage exceeds 500GB threshold"fiRetention policies should be documented and approved by stakeholders including legal, compliance, and business units. During audits or legal proceedings, you'll need to demonstrate that retention practices follow documented policy. Changes to retention should follow change management processes.
Disaster recovery (DR) addresses scenarios where the entire primary site becomes unavailable—natural disasters, infrastructure failures, or widespread outages. DR planning ensures business continuity through geographic redundancy and tested procedures.
DR Architecture Options:
| Tier | Architecture | RTO | RPO | Cost |
|---|---|---|---|---|
| Cold Site | Backup restoration to rented/cloud infrastructure | Days | Hours-Days | Low |
| Warm Standby | Periodically synchronized replica at DR site | Hours | Minutes-Hours | Medium |
| Hot Standby | Real-time replicated standby ready for failover | Minutes | Seconds | High |
| Active-Active | Both sites actively serving traffic, data synchronized | Seconds (automatic) | Near-zero | Very High |
DR Planning Elements:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
# Database Disaster Recovery Runbook ## 1. Initial Assessment (First 5 minutes)- [ ] Confirm primary site is unavailable (not just monitoring alert)- [ ] Assess scope: which systems are affected?- [ ] Notify DR coordinator and on-call manager- [ ] Check DR site systems are accessible ## 2. Decision Point: Invoke DR? (Within 15 minutes)- Decision authority: [VP of Engineering] or designated alternate- Factors to consider: - Estimated primary site recovery time - Business impact of continued outage - Risk of data loss from failover ## 3. Database Failover (If DR invoked) ### 3.1 PostgreSQL Streaming Standby Promotion```bash# Verify standby is caught up (lag should be minimal)psql -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();" # Promote standby to primarypg_ctl promote -D /var/lib/postgresql/data # Verify database is accepting writespsql -c "CREATE TABLE dr_test (id int); DROP TABLE dr_test;"``` ### 3.2 Update Application Configuration- [ ] Update DNS/load balancer to point to DR database- [ ] Restart application servers with new connection strings- [ ] Verify application connectivity ## 4. Post-Failover Validation- [ ] Confirm all critical applications operational- [ ] Verify data integrity spot checks- [ ] Monitor performance and errors- [ ] Document timeline and actions taken ## 5. Return to Normal Operations (Later)- When primary site restored, plan failback- May require full resynchronization- Schedule maintenance window for failbackA disaster recovery plan that has never been tested is fantasy, not planning. Schedule regular DR drills—at least annually, and after any significant infrastructure changes. During tests, identify gaps, update documentation, and improve. The worst time to discover your DR plan doesn't work is during an actual disaster.
Backup and recovery is the ultimate DBA responsibility—the safety net that protects organizations from data loss. A well-designed backup strategy provides confidence that no matter what goes wrong, data can be recovered.
Key Takeaways:
What's Next:
With backup and recovery ensuring data can be restored, the final core DBA responsibility is Capacity Planning—anticipating future needs and ensuring resources scale appropriately as demands grow.
You now understand comprehensive database backup and recovery, from recovery objectives through backup types, PITR, verification, retention, and disaster recovery planning. These skills ensure that data survives whatever failures occur, protecting both the organization and the users who depend on it.