Database Management SystemsDBA Responsibilities

Database Administrator Responsibilities

LevelAdvanced

Duration90 mins

TopicDBA Responsibilities

4 / 5

Backup and Recovery

The Ultimate Safety Net

Every database will eventually experience a failure that threatens data. Hardware fails, software crashes, human operators make mistakes, and disasters strike. The question is not if you will need to recover data, but when—and whether you'll be prepared.

Backup and recovery is the DBA's ultimate responsibility. All other duties—performance tuning, security hardening, monitoring—become meaningless if the data is lost forever. A well-designed backup strategy transforms potential catastrophes into recoverable incidents.

Organizations that have lost critical data without adequate backups have faced bankruptcy, regulatory penalties, and permanent reputation damage. Conversely, organizations with robust backup strategies have recovered from ransomware attacks, data center fires, and catastrophic human errors with minimal lasting impact.

What You Will Learn

By the end of this page, you will understand comprehensive backup and recovery strategies, including backup types and their trade-offs, recovery objectives (RTO/RPO), backup technologies and tools, point-in-time recovery, backup verification, disaster recovery planning, and real-world recovery scenarios. You'll learn to design backup strategies that balance protection, cost, and operational complexity.

Recovery Objectives: RTO and RPO

Before designing a backup strategy, you must understand what level of recovery the business requires. Two key metrics define recovery requirements:

Recovery Time Objective (RTO):

RTO answers: How long can we be down?

This is the maximum acceptable time from failure to full recovery. An RTO of 1 hour means the business can tolerate at most one hour of database unavailability.

Recovery Point Objective (RPO):

RPO answers: How much data can we afford to lose?

This is the maximum acceptable data loss measured in time. An RPO of 1 hour means we can tolerate losing up to one hour of transactions.

Recovery Objectives and Their Implications
RPO/RTO	Typical Systems	Backup Strategy	Cost Level
Minutes	Trading systems, payment processing	Synchronous replication, continuous backup, automated failover	Very High
1 Hour	E-commerce, SaaS applications	Frequent log backups, hot standby, tested recovery	High
4 Hours	Business applications, internal systems	Regular backups with transaction logs, warm standby	Medium
24 Hours	Development, analytics, archives	Daily backups, cold storage, manual recovery	Low
Days/Weeks	Historical archives, compliance data	Infrequent backups, tape/cold storage	Minimal

The Cost-Protection Trade-off:

More aggressive recovery objectives (lower RTO/RPO) require more sophisticated infrastructure:

More frequent backups consume more storage and I/O
Hot standbys require duplicate hardware and licensing
Automated failover requires complex orchestration and testing
Faster storage systems (for quick restore) cost more

The key is matching recovery objectives to actual business requirements. Over-engineering backup systems wastes money; under-engineering risks the business.

Business Defines the Requirements

RTO and RPO are business decisions, not technical ones. The DBA can explain what's achievable at what cost, but business stakeholders must decide acceptable risk levels. Document these decisions and get sign-off—they protect you when trade-offs become visible during incidents.

Backup Types and Strategies

Different backup types serve different purposes. A robust strategy typically combines multiple types:

Full Backup:

A complete copy of the entire database. Every data file, every table, every row is captured.

Pros: Self-contained recovery, simple to understand
Cons: Time-consuming, storage-intensive, high I/O impact
Typical frequency: Weekly or daily, depending on database size

Incremental Backup:

Captures only data changed since the last backup (full or incremental).

Pros: Fast, minimal storage
Cons: Recovery requires full + all increments in sequence
Typical frequency: Daily or more frequent

Differential Backup:

Captures all data changed since the last full backup.

Pros: Recovery requires only full + latest differential
Cons: Grows over time until next full backup
Typical frequency: Daily

Converting Mermaid diagram...

Transaction Log Backup:

Captures transaction log records since the last log backup. Essential for point-in-time recovery.

Transaction Log Backup Characteristics
Aspect	Detail
Purpose	Enable point-in-time recovery (PITR) to any moment
Frequency	Every 5-15 minutes for production databases
RPO Impact	RPO = log backup interval (lose since last backup)
Chain Requirement	Log chain must be unbroken from full backup forward
Storage	Relatively small; essential to archive and protect

backup-strategy-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- PostgreSQL: Backup Strategy Components
 
-- 1. FULL BACKUP with pg_basebackup (physical)
-- Run weekly (e.g., Sunday night)
-- pg_basebackup -h localhost -D /backup/full/$(date +%Y%m%d) -U backup_user -P -Ft -z
 
-- 2. CONTINUOUS WAL ARCHIVING (transaction logs)
-- In postgresql.conf:
archive_mode = on
archive_command = 'cp %p /backup/wal_archive/%f'
 
-- Or using pgBackRest for more sophisticated management:
-- pgbackrest --stanza=main --type=full backup
-- pgbackrest --stanza=main --type=incr backup
 
-- 3. SQL Server example:
-- Full backup (weekly)
BACKUP DATABASE [Production] 
TO DISK = 'E:\Backup\Production_Full.bak'
WITH COMPRESSION, CHECKSUM, STATS = 10;
 
-- Differential backup (daily)
BACKUP DATABASE [Production] 
TO DISK = 'E:\Backup\Production_Diff.bak'
WITH DIFFERENTIAL, COMPRESSION, CHECKSUM;
 
-- Transaction log backup (every 15 minutes)
BACKUP LOG [Production] 
TO DISK = 'E:\Backup\Production_Log.trn'
WITH COMPRESSION, CHECKSUM;

The 3-2-1 Rule

Keep at least 3 copies of critical data, on 2 different types of media, with 1 copy offsite. This protects against single failures (corrupted backup), media failures (disk dies), and site disasters (fire, flood).

Physical vs. Logical Backups

Backups can be categorized as physical (copying raw data files) or logical (exporting data in a portable format). Each approach has distinct advantages:

Physical Backups

•Fast backup and restore for large databases
•Exact replica of database files
•Supports PITR with transaction logs
•Efficient — only copies used blocks
•Platform-specific — same DB version required
•Examples: pg_basebackup, MySQL Enterprise Backup, SQL Server BACKUP DATABASE

Logical Backups

•Portable across versions and platforms
•Granular — backup specific tables
•Human readable (SQL statements)
•Slower for large databases
•No PITR — snapshot at backup time only
•Examples: pg_dump, mysqldump, exp/expdp (Oracle)

backup-methods.sh

#!/bin/bash
# Backup Method Examples
 
# ============================================
# PHYSICAL BACKUP - PostgreSQL with pg_basebackup
# ============================================
pg_basebackup \
    --host=db-primary \
    --username=backup_user \
    --pgdata=/backup/base/$(date +%Y%m%d_%H%M%S) \
    --format=tar \
    --gzip \
    --checkpoint=fast \
    --wal-method=stream \
    --progress
 
# ============================================
# LOGICAL BACKUP - PostgreSQL with pg_dump
# ============================================
# Full database export
pg_dump \
    --host=db-primary \
    --username=backup_user \
    --format=custom \
    --file=/backup/logical/production_$(date +%Y%m%d).dump \
    production_db
 
# Specific tables only
pg_dump \
    --host=db-primary \
    --table=customers \
    --table=orders \
    --format=custom \
    --file=/backup/logical/critical_tables.dump \
    production_db
 
# ============================================
# LOGICAL BACKUP - MySQL with mysqldump
# ============================================
mysqldump \
    --host=db-primary \
    --user=backup_user \
    --password \
    --single-transaction \
    --routines \
    --triggers \
    --quick \
    production_db > /backup/logical/production_$(date +%Y%m%d).sql
 
# ============================================
# PHYSICAL BACKUP - MySQL with Percona XtraBackup
# ============================================
xtrabackup \
    --backup \
    --target-dir=/backup/base/$(date +%Y%m%d) \
    --user=backup_user \
    --password \
    --parallel=4 \
    --compress

Use Both Types

Best practice combines both approaches: physical backups for fast recovery of the entire database, logical backups for portability, selective restoration, and migration. Logical backups also serve as a cross-check—if both backup types restore successfully, confidence is high.

Point-in-Time Recovery (PITR)

Point-in-Time Recovery allows restoring a database to any specific moment—not just to backup time, but to any point covered by transaction logs. PITR is essential for recovering from logical errors like accidental deletions or application bugs that corrupt data.

How PITR Works:

PITR Recovery Process

•Restore base backup — Copy the most recent full backup before the target recovery point
•Apply archived transaction logs — Replay logs from backup time forward
•Stop at target time — Halt log replay at the specified recovery target
•Open database — Complete recovery and allow connections

Converting Mermaid diagram...

pitr-recovery.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- PostgreSQL PITR Recovery Configuration
 
-- 1. Stop PostgreSQL
-- sudo systemctl stop postgresql
 
-- 2. Clear data directory (or use new location)
-- rm -rf /var/lib/postgresql/data/*
 
-- 3. Restore base backup
-- tar -xzf /backup/base/20240115.tar.gz -C /var/lib/postgresql/data/
 
-- 4. Create recovery signal file and configure recovery
-- postgresql.conf or recovery.conf (version dependent):
 
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2024-01-15 14:59:00'
recovery_target_action = 'promote'  -- or 'pause' to verify before promoting
 
-- 5. Create recovery signal file
-- touch /var/lib/postgresql/data/recovery.signal
 
-- 6. Start PostgreSQL - recovery begins automatically
-- sudo systemctl start postgresql
 
-- 7. Monitor recovery progress in logs
-- tail -f /var/log/postgresql/postgresql-15-main.log
 
-- ============================================
-- SQL Server PITR Recovery
-- ============================================
-- Step 1: Restore full backup with NORECOVERY
RESTORE DATABASE [Production]
FROM DISK = 'E:\Backup\Production_Full.bak'
WITH NORECOVERY, REPLACE;
 
-- Step 2: Restore differential (if applicable)
RESTORE DATABASE [Production]
FROM DISK = 'E:\Backup\Production_Diff.bak'
WITH NORECOVERY;
 
-- Step 3: Restore logs until target time
RESTORE LOG [Production]
FROM DISK = 'E:\Backup\Production_Log_1.trn'
WITH NORECOVERY;
 
RESTORE LOG [Production]
FROM DISK = 'E:\Backup\Production_Log_2.trn'
WITH STOPAT = '2024-01-15T14:59:00', RECOVERY;

Protect Your Transaction Logs

PITR is only possible if transaction logs are preserved. A broken log chain (missing or corrupted log files) limits recovery to the last available point. Archive logs immediately, verify copies, and monitor for gaps. Losing transaction logs limits your recovery options severely.

Backup Verification and Testing

A backup that hasn't been tested is not a backup—it's a hope. Many organizations have discovered during actual emergencies that their backups were corrupted, incomplete, or unrecoverable. Regular verification and testing are essential.

Verification Levels:

Backup Verification Hierarchy
Level	What It Verifies	How Often	Effort
Completion Check	Backup job finished without errors	Every backup	Automated
Checksum Validation	Backup file integrity (no corruption)	Every backup	Automated
Restore Test (Automated)	Backup can be restored to a test environment	Weekly	Automated
Data Validation	Restored data is correct and complete	Monthly	Semi-automated
Full DR Test	Complete recovery to alternate site	Quarterly/Annually	Manual event

backup-verification.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/bin/bash
# Automated Backup Verification Script
 
set -e  # Exit on error
 
BACKUP_FILE="/backup/base/production_$(date +%Y%m%d).tar.gz"
RESTORE_DIR="/tmp/backup_test"
VERIFY_LOG="/var/log/backup_verification.log"
 
log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$VERIFY_LOG"
}
 
# Step 1: Verify backup file exists and has reasonable size
if [ ! -f "$BACKUP_FILE" ]; then
    log "ERROR: Backup file not found: $BACKUP_FILE"
    exit 1
fi
 
SIZE=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat -c%s "$BACKUP_FILE")
if [ "$SIZE" -lt 1000000 ]; then
    log "ERROR: Backup file suspiciously small: $SIZE bytes"
    exit 1
fi
log "Backup file size OK: $SIZE bytes"
 
# Step 2: Verify checksum/integrity
if ! gzip -t "$BACKUP_FILE" 2>/dev/null; then
    log "ERROR: Backup file failed gzip integrity check"
    exit 1
fi
log "Gzip integrity check passed"
 
# Step 3: Test extraction
rm -rf "$RESTORE_DIR"
mkdir -p "$RESTORE_DIR"
if ! tar -xzf "$BACKUP_FILE" -C "$RESTORE_DIR"; then
    log "ERROR: Failed to extract backup"
    exit 1
fi
log "Extraction successful"
 
# Step 4: Attempt database restore to test instance
export PGPORT=5433  # Test instance port
pg_ctl -D "$RESTORE_DIR" start -w
 
# Step 5: Verify data integrity
TABLES=$(psql -p 5433 -c "SELECT count(*) FROM information_schema.tables WHERE table_schema='public'" -t)
if [ "$TABLES" -lt 10 ]; then
    log "WARNING: Fewer tables than expected: $TABLES"
fi
 
CUSTOMERS=$(psql -p 5433 -c "SELECT count(*) FROM customers" -t)
log "Restored database has $CUSTOMERS customers"
 
# Step 6: Cleanup
pg_ctl -D "$RESTORE_DIR" stop
rm -rf "$RESTORE_DIR"
 
log "=== Backup verification completed successfully ==="

Common Backup Failures Discovered During Testing

•Incomplete backups — Backup process was interrupted or excluded critical files
•Corrupted backup media — Storage failure corrupted backup files
•Encryption key unavailable — Backup encrypted but key not properly stored
•Version incompatibility — Restore target has different database version
•Missing dependencies — Application requires additional files not included in backup
•Insufficient permissions — Recovery user lacks required privileges
•Documentation gaps — Nobody knows the correct recovery procedure

Time Your Restores

During recovery tests, measure how long restoration takes. This validates your RTO assumptions. If your RTO is 1 hour but restoration takes 3 hours, you have a planning gap that must be addressed before a real emergency.

Backup Infrastructure

Backing up databases requires appropriate infrastructure for storage, scheduling, and management.

Storage Options:

Backup Storage Solutions
Storage Type	Pros	Cons	Use Case
Local Disk	Fast, simple	Single point of failure, limited capacity	Staging before offsite copy
Network Storage (NAS/SAN)	Centralized, redundant	Network dependent, shared bottleneck	Medium-term retention
Object Storage (S3, Azure Blob)	Unlimited, durable, geographic options	Latency, egress costs	Long-term, offsite, disaster recovery
Tape	Very low cost per GB, offline protection	Slow, manual handling	Archive, air-gapped security
Dedicated Backup Appliances	Optimized, deduplication, integration	Cost, vendor lock-in	Enterprise environments

Backup Management Tools:

Dedicated backup tools provide scheduling, retention management, verification, and reporting:

Popular Database Backup Tools

•pgBackRest (PostgreSQL) — Advanced backup with compression, encryption, parallel processing, and repository management
•Barman (PostgreSQL) — Backup and recovery manager with catalog, retention policies, and monitoring
•Percona XtraBackup (MySQL) — Hot physical backups without locking, incremental support
•mariabackup (MariaDB) — Fork of XtraBackup for MariaDB-specific features
•RMAN (Oracle) — Comprehensive backup and recovery solution with catalog management
•Veeam, Commvault, Veritas — Enterprise backup platforms with database agents

pgbackrest-config.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# pgBackRest Configuration Example
# /etc/pgbackrest/pgbackrest.conf
 
[global]
# Repository configuration
repo1-path=/backup/pgbackrest
repo1-retention-full=4           # Keep 4 full backups
repo1-retention-diff=2           # Keep 2 differentials per full
repo1-cipher-type=aes-256-cbc    # Encrypt backups
repo1-cipher-pass=SecureBackupKey
 
# Second repository for offsite (S3)
repo2-type=s3
repo2-s3-bucket=mycompany-db-backups
repo2-s3-endpoint=s3.amazonaws.com
repo2-s3-region=us-east-1
repo2-path=/production
repo2-retention-full=12          # Keep 12 full backups in S3
repo2-s3-key=AKIAIOSFODNN7EXAMPLE
repo2-s3-key-secret=SECRET_KEY_HERE
 
# Compression and performance
compress-type=zst
compress-level=3
process-max=4                    # Parallel processes
 
# Logging
log-level-console=info
log-level-file=detail
 
[main]
# PostgreSQL cluster configuration
pg1-path=/var/lib/postgresql/15/main
pg1-port=5432
pg1-user=postgres

Backup Network Separation

Backup traffic can saturate production networks if not managed. Consider dedicated backup networks, bandwidth throttling during peak hours, or off-hours scheduling. Also ensure backup systems themselves are secured—a compromised backup server provides access to all your data.

Retention and Compliance

Backup retention policies balance recovery needs, storage costs, and regulatory requirements.

Retention Strategy Factors:

Retention Policy Considerations

•Recovery Window: How far back might you need to recover? Some data corruption goes unnoticed for weeks.
•Regulatory Requirements: Some industries require data retention for years (financial: 7 years, healthcare: 6 years, etc.)
•Legal Hold: Pending litigation may require preserving specific time periods indefinitely
•Storage Costs: Longer retention means more storage; balance against actual need
•Data Classification: Different data types may have different retention requirements

Common Retention Schedules
Backup Type	Short-Term	Medium-Term	Long-Term/Archive
Transaction Logs	7-14 days	30 days	Typically not archived
Daily Backups	14-30 days	90 days	Selected months kept yearly
Weekly Full	4-8 weeks	12 months	Year-end kept 7+ years
Monthly Full	12 months	3-5 years	As required by regulation

Grandfather-Father-Son (GFS) Rotation:

A classic retention scheme that manages backup expiration:

Son (Daily): Keep last 7 daily backups
Father (Weekly): Keep last 4 weekly backups (e.g., Friday's daily becomes the weekly)
Grandfather (Monthly): Keep last 12 monthly backups (first of month or month-end)

After 12 months, monthly backups can rotate out or be retained for yearly archives.

retention-management.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/bash
# Backup Retention Management Script
 
BACKUP_DIR="/backup/postgresql"
LOG_FILE="/var/log/backup_retention.log"
 
log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}
 
# Remove daily backups older than 14 days
log "Cleaning daily backups older than 14 days..."
find "$BACKUP_DIR/daily" -name "*.tar.gz" -mtime +14 -delete
 
# Remove weekly backups older than 8 weeks
log "Cleaning weekly backups older than 8 weeks..."
find "$BACKUP_DIR/weekly" -name "*.tar.gz" -mtime +56 -delete
 
# Remove monthly backups older than 1 year
log "Cleaning monthly backups older than 1 year..."
find "$BACKUP_DIR/monthly" -name "*.tar.gz" -mtime +365 -delete
 
# Remove transaction logs older than 7 days
log "Cleaning transaction logs older than 7 days..."
find "$BACKUP_DIR/wal" -name "*.xz" -mtime +7 -delete
 
# Report remaining storage
USED=$(du -sh "$BACKUP_DIR" | cut -f1)
log "Retention cleanup complete. Total backup storage: $USED"
 
# Alert if storage exceeds threshold
USED_BYTES=$(du -sb "$BACKUP_DIR" | cut -f1)
THRESHOLD=$((500 * 1024 * 1024 * 1024))  # 500GB
if [ "$USED_BYTES" -gt "$THRESHOLD" ]; then
    log "WARNING: Backup storage exceeds 500GB threshold"
fi

Document Your Retention Policy

Retention policies should be documented and approved by stakeholders including legal, compliance, and business units. During audits or legal proceedings, you'll need to demonstrate that retention practices follow documented policy. Changes to retention should follow change management processes.

Disaster Recovery Planning

Disaster recovery (DR) addresses scenarios where the entire primary site becomes unavailable—natural disasters, infrastructure failures, or widespread outages. DR planning ensures business continuity through geographic redundancy and tested procedures.

DR Architecture Options:

Disaster Recovery Tiers
Tier	Architecture	RTO	RPO	Cost
Cold Site	Backup restoration to rented/cloud infrastructure	Days	Hours-Days	Low
Warm Standby	Periodically synchronized replica at DR site	Hours	Minutes-Hours	Medium
Hot Standby	Real-time replicated standby ready for failover	Minutes	Seconds	High
Active-Active	Both sites actively serving traffic, data synchronized	Seconds (automatic)	Near-zero	Very High

DR Planning Elements:

Disaster Recovery Plan Components

•Risk Assessment: Identify potential disasters and their likelihood (earthquake, power failure, cyber attack, etc.)
•Impact Analysis: Determine which databases are critical and their RTO/RPO requirements
•Recovery Strategy: Choose appropriate DR tier based on requirements and budget
•Runbook Creation: Document step-by-step recovery procedures anyone can follow
•Communication Plan: Who needs to be notified, how decisions are made during crisis
•Testing Schedule: Regular DR drills to validate procedures and train staff
•Maintenance: Keep DR systems updated, synchronized, and documented

dr-runbook-excerpt.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Database Disaster Recovery Runbook
 
## 1. Initial Assessment (First 5 minutes)
- [ ] Confirm primary site is unavailable (not just monitoring alert)
- [ ] Assess scope: which systems are affected?
- [ ] Notify DR coordinator and on-call manager
- [ ] Check DR site systems are accessible
 
## 2. Decision Point: Invoke DR? (Within 15 minutes)
- Decision authority: [VP of Engineering] or designated alternate
- Factors to consider:
  - Estimated primary site recovery time
  - Business impact of continued outage
  - Risk of data loss from failover
 
## 3. Database Failover (If DR invoked)
 
### 3.1 PostgreSQL Streaming Standby Promotion
```bash
# Verify standby is caught up (lag should be minimal)
psql -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"
 
# Promote standby to primary
pg_ctl promote -D /var/lib/postgresql/data
 
# Verify database is accepting writes
psql -c "CREATE TABLE dr_test (id int); DROP TABLE dr_test;"
```
 
### 3.2 Update Application Configuration
- [ ] Update DNS/load balancer to point to DR database
- [ ] Restart application servers with new connection strings
- [ ] Verify application connectivity
 
## 4. Post-Failover Validation
- [ ] Confirm all critical applications operational
- [ ] Verify data integrity spot checks
- [ ] Monitor performance and errors
- [ ] Document timeline and actions taken
 
## 5. Return to Normal Operations (Later)
- When primary site restored, plan failback
- May require full resynchronization
- Schedule maintenance window for failback

Test Your DR Plan

A disaster recovery plan that has never been tested is fantasy, not planning. Schedule regular DR drills—at least annually, and after any significant infrastructure changes. During tests, identify gaps, update documentation, and improve. The worst time to discover your DR plan doesn't work is during an actual disaster.

Summary

Backup and recovery is the ultimate DBA responsibility—the safety net that protects organizations from data loss. A well-designed backup strategy provides confidence that no matter what goes wrong, data can be recovered.

Key Takeaways:

Backup and Recovery Essentials

•Define RTO and RPO first. Business requirements drive backup strategy, not the other way around. Get stakeholder agreement on acceptable downtime and data loss.
•Use multiple backup types. Full backups for baseline, incrementals/differentials for efficiency, transaction logs for point-in-time recovery.
•Combine physical and logical backups. Physical for speed, logical for portability. Both for confidence.
•Enable point-in-time recovery. PITR protects against logical errors that regular backups can't address. Protect and archive transaction logs.
•Test your backups regularly. A backup that hasn't been restored is not verified. Automate verification; conduct full DR tests periodically.
•Build appropriate infrastructure. Match storage and tools to requirements. Follow the 3-2-1 rule for critical data.
•Implement proper retention. Balance recovery needs, storage costs, and compliance requirements. Document and enforce retention policies.
•Plan for disaster. Determine appropriate DR tier, create runbooks, test regularly. Be prepared for site-level failures.

What's Next:

With backup and recovery ensuring data can be restored, the final core DBA responsibility is Capacity Planning—anticipating future needs and ensuring resources scale appropriately as demands grow.

Page Complete

You now understand comprehensive database backup and recovery, from recovery objectives through backup types, PITR, verification, retention, and disaster recovery planning. These skills ensure that data survives whatever failures occur, protecting both the organization and the users who depend on it.

4 / 5

Loading learning content...

Database Management SystemsDBA Responsibilities

Database Administrator Responsibilities

LevelAdvanced

Duration90 mins

TopicDBA Responsibilities

4 / 5

Backup and Recovery

The Ultimate Safety Net

What You Will Learn

Recovery Objectives: RTO and RPO

Before designing a backup strategy, you must understand what level of recovery the business requires. Two key metrics define recovery requirements:

Recovery Time Objective (RTO):

RTO answers: How long can we be down?

This is the maximum acceptable time from failure to full recovery. An RTO of 1 hour means the business can tolerate at most one hour of database unavailability.

Recovery Point Objective (RPO):

RPO answers: How much data can we afford to lose?

This is the maximum acceptable data loss measured in time. An RPO of 1 hour means we can tolerate losing up to one hour of transactions.

Recovery Objectives and Their Implications
RPO/RTO	Typical Systems	Backup Strategy	Cost Level
Minutes	Trading systems, payment processing	Synchronous replication, continuous backup, automated failover	Very High
1 Hour	E-commerce, SaaS applications	Frequent log backups, hot standby, tested recovery	High
4 Hours	Business applications, internal systems	Regular backups with transaction logs, warm standby	Medium
24 Hours	Development, analytics, archives	Daily backups, cold storage, manual recovery	Low
Days/Weeks	Historical archives, compliance data	Infrequent backups, tape/cold storage	Minimal

The Cost-Protection Trade-off:

More aggressive recovery objectives (lower RTO/RPO) require more sophisticated infrastructure:

More frequent backups consume more storage and I/O
Hot standbys require duplicate hardware and licensing
Automated failover requires complex orchestration and testing
Faster storage systems (for quick restore) cost more

The key is matching recovery objectives to actual business requirements. Over-engineering backup systems wastes money; under-engineering risks the business.

Business Defines the Requirements

Backup Types and Strategies

Different backup types serve different purposes. A robust strategy typically combines multiple types:

Full Backup:

A complete copy of the entire database. Every data file, every table, every row is captured.

Pros: Self-contained recovery, simple to understand
Cons: Time-consuming, storage-intensive, high I/O impact
Typical frequency: Weekly or daily, depending on database size

Incremental Backup:

Captures only data changed since the last backup (full or incremental).

Pros: Fast, minimal storage
Cons: Recovery requires full + all increments in sequence
Typical frequency: Daily or more frequent

Differential Backup:

Captures all data changed since the last full backup.

Pros: Recovery requires only full + latest differential
Cons: Grows over time until next full backup
Typical frequency: Daily

Converting Mermaid diagram...

Transaction Log Backup:

Captures transaction log records since the last log backup. Essential for point-in-time recovery.

Transaction Log Backup Characteristics
Aspect	Detail
Purpose	Enable point-in-time recovery (PITR) to any moment
Frequency	Every 5-15 minutes for production databases
RPO Impact	RPO = log backup interval (lose since last backup)
Chain Requirement	Log chain must be unbroken from full backup forward
Storage	Relatively small; essential to archive and protect

backup-strategy-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- PostgreSQL: Backup Strategy Components
 
-- 1. FULL BACKUP with pg_basebackup (physical)
-- Run weekly (e.g., Sunday night)
-- pg_basebackup -h localhost -D /backup/full/$(date +%Y%m%d) -U backup_user -P -Ft -z
 
-- 2. CONTINUOUS WAL ARCHIVING (transaction logs)
-- In postgresql.conf:
archive_mode = on
archive_command = 'cp %p /backup/wal_archive/%f'
 
-- Or using pgBackRest for more sophisticated management:
-- pgbackrest --stanza=main --type=full backup
-- pgbackrest --stanza=main --type=incr backup
 
-- 3. SQL Server example:
-- Full backup (weekly)
BACKUP DATABASE [Production] 
TO DISK = 'E:\Backup\Production_Full.bak'
WITH COMPRESSION, CHECKSUM, STATS = 10;
 
-- Differential backup (daily)
BACKUP DATABASE [Production] 
TO DISK = 'E:\Backup\Production_Diff.bak'
WITH DIFFERENTIAL, COMPRESSION, CHECKSUM;
 
-- Transaction log backup (every 15 minutes)
BACKUP LOG [Production] 
TO DISK = 'E:\Backup\Production_Log.trn'
WITH COMPRESSION, CHECKSUM;

The 3-2-1 Rule

Physical vs. Logical Backups

Backups can be categorized as physical (copying raw data files) or logical (exporting data in a portable format). Each approach has distinct advantages:

Physical Backups

•Fast backup and restore for large databases
•Exact replica of database files
•Supports PITR with transaction logs
•Efficient — only copies used blocks
•Platform-specific — same DB version required
•Examples: pg_basebackup, MySQL Enterprise Backup, SQL Server BACKUP DATABASE

Logical Backups

•Portable across versions and platforms
•Granular — backup specific tables
•Human readable (SQL statements)
•Slower for large databases
•No PITR — snapshot at backup time only
•Examples: pg_dump, mysqldump, exp/expdp (Oracle)

backup-methods.sh

#!/bin/bash
# Backup Method Examples
 
# ============================================
# PHYSICAL BACKUP - PostgreSQL with pg_basebackup
# ============================================
pg_basebackup \
    --host=db-primary \
    --username=backup_user \
    --pgdata=/backup/base/$(date +%Y%m%d_%H%M%S) \
    --format=tar \
    --gzip \
    --checkpoint=fast \
    --wal-method=stream \
    --progress
 
# ============================================
# LOGICAL BACKUP - PostgreSQL with pg_dump
# ============================================
# Full database export
pg_dump \
    --host=db-primary \
    --username=backup_user \
    --format=custom \
    --file=/backup/logical/production_$(date +%Y%m%d).dump \
    production_db
 
# Specific tables only
pg_dump \
    --host=db-primary \
    --table=customers \
    --table=orders \
    --format=custom \
    --file=/backup/logical/critical_tables.dump \
    production_db
 
# ============================================
# LOGICAL BACKUP - MySQL with mysqldump
# ============================================
mysqldump \
    --host=db-primary \
    --user=backup_user \
    --password \
    --single-transaction \
    --routines \
    --triggers \
    --quick \
    production_db > /backup/logical/production_$(date +%Y%m%d).sql
 
# ============================================
# PHYSICAL BACKUP - MySQL with Percona XtraBackup
# ============================================
xtrabackup \
    --backup \
    --target-dir=/backup/base/$(date +%Y%m%d) \
    --user=backup_user \
    --password \
    --parallel=4 \
    --compress

Use Both Types

Point-in-Time Recovery (PITR)

How PITR Works:

PITR Recovery Process

•Restore base backup — Copy the most recent full backup before the target recovery point
•Apply archived transaction logs — Replay logs from backup time forward
•Stop at target time — Halt log replay at the specified recovery target
•Open database — Complete recovery and allow connections

Converting Mermaid diagram...

pitr-recovery.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- PostgreSQL PITR Recovery Configuration
 
-- 1. Stop PostgreSQL
-- sudo systemctl stop postgresql
 
-- 2. Clear data directory (or use new location)
-- rm -rf /var/lib/postgresql/data/*
 
-- 3. Restore base backup
-- tar -xzf /backup/base/20240115.tar.gz -C /var/lib/postgresql/data/
 
-- 4. Create recovery signal file and configure recovery
-- postgresql.conf or recovery.conf (version dependent):
 
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2024-01-15 14:59:00'
recovery_target_action = 'promote'  -- or 'pause' to verify before promoting
 
-- 5. Create recovery signal file
-- touch /var/lib/postgresql/data/recovery.signal
 
-- 6. Start PostgreSQL - recovery begins automatically
-- sudo systemctl start postgresql
 
-- 7. Monitor recovery progress in logs
-- tail -f /var/log/postgresql/postgresql-15-main.log
 
-- ============================================
-- SQL Server PITR Recovery
-- ============================================
-- Step 1: Restore full backup with NORECOVERY
RESTORE DATABASE [Production]
FROM DISK = 'E:\Backup\Production_Full.bak'
WITH NORECOVERY, REPLACE;
 
-- Step 2: Restore differential (if applicable)
RESTORE DATABASE [Production]
FROM DISK = 'E:\Backup\Production_Diff.bak'
WITH NORECOVERY;
 
-- Step 3: Restore logs until target time
RESTORE LOG [Production]
FROM DISK = 'E:\Backup\Production_Log_1.trn'
WITH NORECOVERY;
 
RESTORE LOG [Production]
FROM DISK = 'E:\Backup\Production_Log_2.trn'
WITH STOPAT = '2024-01-15T14:59:00', RECOVERY;

Protect Your Transaction Logs

Backup Verification and Testing

Verification Levels:

Backup Verification Hierarchy
Level	What It Verifies	How Often	Effort
Completion Check	Backup job finished without errors	Every backup	Automated
Checksum Validation	Backup file integrity (no corruption)	Every backup	Automated
Restore Test (Automated)	Backup can be restored to a test environment	Weekly	Automated
Data Validation	Restored data is correct and complete	Monthly	Semi-automated
Full DR Test	Complete recovery to alternate site	Quarterly/Annually	Manual event

backup-verification.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/bin/bash
# Automated Backup Verification Script
 
set -e  # Exit on error
 
BACKUP_FILE="/backup/base/production_$(date +%Y%m%d).tar.gz"
RESTORE_DIR="/tmp/backup_test"
VERIFY_LOG="/var/log/backup_verification.log"
 
log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$VERIFY_LOG"
}
 
# Step 1: Verify backup file exists and has reasonable size
if [ ! -f "$BACKUP_FILE" ]; then
    log "ERROR: Backup file not found: $BACKUP_FILE"
    exit 1
fi
 
SIZE=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat -c%s "$BACKUP_FILE")
if [ "$SIZE" -lt 1000000 ]; then
    log "ERROR: Backup file suspiciously small: $SIZE bytes"
    exit 1
fi
log "Backup file size OK: $SIZE bytes"
 
# Step 2: Verify checksum/integrity
if ! gzip -t "$BACKUP_FILE" 2>/dev/null; then
    log "ERROR: Backup file failed gzip integrity check"
    exit 1
fi
log "Gzip integrity check passed"
 
# Step 3: Test extraction
rm -rf "$RESTORE_DIR"
mkdir -p "$RESTORE_DIR"
if ! tar -xzf "$BACKUP_FILE" -C "$RESTORE_DIR"; then
    log "ERROR: Failed to extract backup"
    exit 1
fi
log "Extraction successful"
 
# Step 4: Attempt database restore to test instance
export PGPORT=5433  # Test instance port
pg_ctl -D "$RESTORE_DIR" start -w
 
# Step 5: Verify data integrity
TABLES=$(psql -p 5433 -c "SELECT count(*) FROM information_schema.tables WHERE table_schema='public'" -t)
if [ "$TABLES" -lt 10 ]; then
    log "WARNING: Fewer tables than expected: $TABLES"
fi
 
CUSTOMERS=$(psql -p 5433 -c "SELECT count(*) FROM customers" -t)
log "Restored database has $CUSTOMERS customers"
 
# Step 6: Cleanup
pg_ctl -D "$RESTORE_DIR" stop
rm -rf "$RESTORE_DIR"
 
log "=== Backup verification completed successfully ==="

Common Backup Failures Discovered During Testing

•Incomplete backups — Backup process was interrupted or excluded critical files
•Corrupted backup media — Storage failure corrupted backup files
•Encryption key unavailable — Backup encrypted but key not properly stored
•Version incompatibility — Restore target has different database version
•Missing dependencies — Application requires additional files not included in backup
•Insufficient permissions — Recovery user lacks required privileges
•Documentation gaps — Nobody knows the correct recovery procedure

Time Your Restores

Backup Infrastructure

Backing up databases requires appropriate infrastructure for storage, scheduling, and management.

Storage Options:

Backup Storage Solutions
Storage Type	Pros	Cons	Use Case
Local Disk	Fast, simple	Single point of failure, limited capacity	Staging before offsite copy
Network Storage (NAS/SAN)	Centralized, redundant	Network dependent, shared bottleneck	Medium-term retention
Object Storage (S3, Azure Blob)	Unlimited, durable, geographic options	Latency, egress costs	Long-term, offsite, disaster recovery
Tape	Very low cost per GB, offline protection	Slow, manual handling	Archive, air-gapped security
Dedicated Backup Appliances	Optimized, deduplication, integration	Cost, vendor lock-in	Enterprise environments

Backup Management Tools:

Dedicated backup tools provide scheduling, retention management, verification, and reporting:

Popular Database Backup Tools

•pgBackRest (PostgreSQL) — Advanced backup with compression, encryption, parallel processing, and repository management
•Barman (PostgreSQL) — Backup and recovery manager with catalog, retention policies, and monitoring
•Percona XtraBackup (MySQL) — Hot physical backups without locking, incremental support
•mariabackup (MariaDB) — Fork of XtraBackup for MariaDB-specific features
•RMAN (Oracle) — Comprehensive backup and recovery solution with catalog management
•Veeam, Commvault, Veritas — Enterprise backup platforms with database agents

pgbackrest-config.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# pgBackRest Configuration Example
# /etc/pgbackrest/pgbackrest.conf
 
[global]
# Repository configuration
repo1-path=/backup/pgbackrest
repo1-retention-full=4           # Keep 4 full backups
repo1-retention-diff=2           # Keep 2 differentials per full
repo1-cipher-type=aes-256-cbc    # Encrypt backups
repo1-cipher-pass=SecureBackupKey
 
# Second repository for offsite (S3)
repo2-type=s3
repo2-s3-bucket=mycompany-db-backups
repo2-s3-endpoint=s3.amazonaws.com
repo2-s3-region=us-east-1
repo2-path=/production
repo2-retention-full=12          # Keep 12 full backups in S3
repo2-s3-key=AKIAIOSFODNN7EXAMPLE
repo2-s3-key-secret=SECRET_KEY_HERE
 
# Compression and performance
compress-type=zst
compress-level=3
process-max=4                    # Parallel processes
 
# Logging
log-level-console=info
log-level-file=detail
 
[main]
# PostgreSQL cluster configuration
pg1-path=/var/lib/postgresql/15/main
pg1-port=5432
pg1-user=postgres

Backup Network Separation

Retention and Compliance

Backup retention policies balance recovery needs, storage costs, and regulatory requirements.

Retention Strategy Factors:

Retention Policy Considerations

•Recovery Window: How far back might you need to recover? Some data corruption goes unnoticed for weeks.
•Regulatory Requirements: Some industries require data retention for years (financial: 7 years, healthcare: 6 years, etc.)
•Legal Hold: Pending litigation may require preserving specific time periods indefinitely
•Storage Costs: Longer retention means more storage; balance against actual need
•Data Classification: Different data types may have different retention requirements

Common Retention Schedules
Backup Type	Short-Term	Medium-Term	Long-Term/Archive
Transaction Logs	7-14 days	30 days	Typically not archived
Daily Backups	14-30 days	90 days	Selected months kept yearly
Weekly Full	4-8 weeks	12 months	Year-end kept 7+ years
Monthly Full	12 months	3-5 years	As required by regulation

Grandfather-Father-Son (GFS) Rotation:

A classic retention scheme that manages backup expiration:

Son (Daily): Keep last 7 daily backups
Father (Weekly): Keep last 4 weekly backups (e.g., Friday's daily becomes the weekly)
Grandfather (Monthly): Keep last 12 monthly backups (first of month or month-end)

After 12 months, monthly backups can rotate out or be retained for yearly archives.

retention-management.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/bash
# Backup Retention Management Script
 
BACKUP_DIR="/backup/postgresql"
LOG_FILE="/var/log/backup_retention.log"
 
log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}
 
# Remove daily backups older than 14 days
log "Cleaning daily backups older than 14 days..."
find "$BACKUP_DIR/daily" -name "*.tar.gz" -mtime +14 -delete
 
# Remove weekly backups older than 8 weeks
log "Cleaning weekly backups older than 8 weeks..."
find "$BACKUP_DIR/weekly" -name "*.tar.gz" -mtime +56 -delete
 
# Remove monthly backups older than 1 year
log "Cleaning monthly backups older than 1 year..."
find "$BACKUP_DIR/monthly" -name "*.tar.gz" -mtime +365 -delete
 
# Remove transaction logs older than 7 days
log "Cleaning transaction logs older than 7 days..."
find "$BACKUP_DIR/wal" -name "*.xz" -mtime +7 -delete
 
# Report remaining storage
USED=$(du -sh "$BACKUP_DIR" | cut -f1)
log "Retention cleanup complete. Total backup storage: $USED"
 
# Alert if storage exceeds threshold
USED_BYTES=$(du -sb "$BACKUP_DIR" | cut -f1)
THRESHOLD=$((500 * 1024 * 1024 * 1024))  # 500GB
if [ "$USED_BYTES" -gt "$THRESHOLD" ]; then
    log "WARNING: Backup storage exceeds 500GB threshold"
fi

Document Your Retention Policy

Disaster Recovery Planning

DR Architecture Options:

Disaster Recovery Tiers
Tier	Architecture	RTO	RPO	Cost
Cold Site	Backup restoration to rented/cloud infrastructure	Days	Hours-Days	Low
Warm Standby	Periodically synchronized replica at DR site	Hours	Minutes-Hours	Medium
Hot Standby	Real-time replicated standby ready for failover	Minutes	Seconds	High
Active-Active	Both sites actively serving traffic, data synchronized	Seconds (automatic)	Near-zero	Very High

DR Planning Elements:

Disaster Recovery Plan Components

•Risk Assessment: Identify potential disasters and their likelihood (earthquake, power failure, cyber attack, etc.)
•Impact Analysis: Determine which databases are critical and their RTO/RPO requirements
•Recovery Strategy: Choose appropriate DR tier based on requirements and budget
•Runbook Creation: Document step-by-step recovery procedures anyone can follow
•Communication Plan: Who needs to be notified, how decisions are made during crisis
•Testing Schedule: Regular DR drills to validate procedures and train staff
•Maintenance: Keep DR systems updated, synchronized, and documented

dr-runbook-excerpt.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Database Disaster Recovery Runbook
 
## 1. Initial Assessment (First 5 minutes)
- [ ] Confirm primary site is unavailable (not just monitoring alert)
- [ ] Assess scope: which systems are affected?
- [ ] Notify DR coordinator and on-call manager
- [ ] Check DR site systems are accessible
 
## 2. Decision Point: Invoke DR? (Within 15 minutes)
- Decision authority: [VP of Engineering] or designated alternate
- Factors to consider:
  - Estimated primary site recovery time
  - Business impact of continued outage
  - Risk of data loss from failover
 
## 3. Database Failover (If DR invoked)
 
### 3.1 PostgreSQL Streaming Standby Promotion
```bash
# Verify standby is caught up (lag should be minimal)
psql -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"
 
# Promote standby to primary
pg_ctl promote -D /var/lib/postgresql/data
 
# Verify database is accepting writes
psql -c "CREATE TABLE dr_test (id int); DROP TABLE dr_test;"
```
 
### 3.2 Update Application Configuration
- [ ] Update DNS/load balancer to point to DR database
- [ ] Restart application servers with new connection strings
- [ ] Verify application connectivity
 
## 4. Post-Failover Validation
- [ ] Confirm all critical applications operational
- [ ] Verify data integrity spot checks
- [ ] Monitor performance and errors
- [ ] Document timeline and actions taken
 
## 5. Return to Normal Operations (Later)
- When primary site restored, plan failback
- May require full resynchronization
- Schedule maintenance window for failback

Test Your DR Plan

Summary

Key Takeaways:

Backup and Recovery Essentials

•Define RTO and RPO first. Business requirements drive backup strategy, not the other way around. Get stakeholder agreement on acceptable downtime and data loss.
•Use multiple backup types. Full backups for baseline, incrementals/differentials for efficiency, transaction logs for point-in-time recovery.
•Combine physical and logical backups. Physical for speed, logical for portability. Both for confidence.
•Enable point-in-time recovery. PITR protects against logical errors that regular backups can't address. Protect and archive transaction logs.
•Test your backups regularly. A backup that hasn't been restored is not verified. Automate verification; conduct full DR tests periodically.
•Build appropriate infrastructure. Match storage and tools to requirements. Follow the 3-2-1 rule for critical data.
•Implement proper retention. Balance recovery needs, storage costs, and compliance requirements. Document and enforce retention policies.
•Plan for disaster. Determine appropriate DR tier, create runbooks, test regularly. Be prepared for site-level failures.

What's Next:

Page Complete

4 / 5