Loading content...
RAID protects against disk failures. Mirroring protects against server failures. But what happens when the entire data center is destroyed?
Fires, floods, earthquakes, power grid failures, and other catastrophic events can render an entire site inoperable. All the local redundancy in the world won't help if every copy of your data is in the same building when disaster strikes.
Remote backup addresses this ultimate failure scenario by maintaining copies of data at geographically distant locations—far enough that a single event cannot affect both sites. This is the final layer of defense in achieving true stable storage.
By the end of this page, you will understand the principles of geographic data protection, the different approaches to remote backup (synchronous replication, async replication, periodic backups), how to design disaster recovery architecture, and the trade-offs between protection, performance, and cost.
Local redundancy—RAID, mirroring, multiple servers—significantly reduces the probability of data loss. But these measures share a common weakness: they exist within a single failure domain.
Failure Domains:
A failure domain is a scope within which failures are correlated. Components sharing a failure domain can fail together:
Effective disaster recovery requires placing data copies in independent failure domains.
Major cloud providers have experienced region-wide outages: AWS US-East-1 in 2017 affected thousands of websites; Google Cloud Australia in 2020 impacted major banks; Azure South Central US in 2018 lasted nearly 24 hours. Companies with data only in the affected region were completely down. Companies with multi-region architectures continued operating.
The Geographic Cost-Protection Tradeoff:
Distance between data copies involves a fundamental trade-off:
| Distance | Protection | Latency Impact | Cost |
|---|---|---|---|
| Same rack | Disk failure | Minimal | Low |
| Same building | Server/rack failure | Low | Moderate |
| Same city (10-50km) | Building disaster | 1-5ms | Moderate |
| Same region (100-500km) | City-wide disaster | 5-20ms | High |
| Cross-continent (1000+ km) | Regional disaster | 50-150ms | Very High |
The additional latency comes from the speed of light—approximately 200,000 km/s through fiber optic cables. At 1000km, round-trip time is at least 10ms, plus switching and processing overhead.
Synchronous remote replication extends the synchronous mirroring concept across geographic distances. Every transaction waits for confirmation from the remote site before committing, guaranteeing zero data loss (RPO=0) even in a complete site failure.
The Latency Challenge:
Synchronous replication across distances is constrained by physics. For a 100km separation:
Fiber optic speed: ~200,000 km/s
Round-trip distance: 200 km
Minimum latency: 200 km ÷ 200,000 km/s = 1 ms
Add network equipment latency: ~0.5-2 ms
Add storage write latency: ~1-5 ms
Total: ~2.5-8 ms per transaction commit
For 500km: 5-15ms For 1000km: 10-25ms
These latencies are per-commit. A workload doing 1000 commits/second at 10ms each means commits queue up—you can't sustain that rate.
Most synchronous replication deployments limit distance to 50-200km—far enough to survive building or campus-level disasters, but not so far that commit latency becomes unacceptable. Beyond 200-300km, asynchronous replication is typically necessary.
Implementation Examples:
Oracle Data Guard Maximum Protection Mode:
123456789101112131415161718192021222324
-- Oracle Data Guard: Maximum Protection (Synchronous) -- Primary database configurationALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE PROTECTION; -- This mode guarantees:-- 1. No committed transaction can be lost-- 2. Primary will NOT continue if standby is unreachable-- 3. Zero data loss in any failure scenario -- Trade-off: If standby fails, primary stops accepting transactions-- Use for mission-critical systems where data loss is unacceptable -- Check protection modeSELECT protection_mode, protection_level FROM v$database; -- Monitor sync statusSELECT dest_name, status, error, synchronization_status FROM v$archive_dest_statusWHERE dest_name = 'LOG_ARCHIVE_DEST_2'; -- For high availability WITH zero data loss, use Maximum Availability mode-- (falls back to async if standby becomes unreachable)ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY;SQL Server Always On Synchronous Commit:
123456789101112131415161718192021222324252627282930
-- SQL Server Always On: Synchronous Remote Replica -- Create availability group with sync replicaCREATE AVAILABILITY GROUP MyAGFOR DATABASE MyDBREPLICA ON 'Primary' WITH ( ENDPOINT_URL = 'TCP://primary:5022', AVAILABILITY_MODE = SYNCHRONOUS_COMMIT, FAILOVER_MODE = AUTOMATIC ), 'Secondary' WITH ( ENDPOINT_URL = 'TCP://secondary:5022', AVAILABILITY_MODE = SYNCHRONOUS_COMMIT, FAILOVER_MODE = AUTOMATIC ); -- Verify synchronization stateSELECT ag.name AS ag_name, ar.replica_server_name, ars.synchronization_health_desc, ars.connected_state_desc, drs.synchronization_state_desc, drs.is_commit_participantFROM sys.dm_hadr_availability_replica_states arsJOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_idJOIN sys.availability_groups ag ON ar.group_id = ag.group_idLEFT JOIN sys.dm_hadr_database_replica_states drs ON ars.replica_id = drs.replica_id;| Aspect | Characteristic |
|---|---|
| Data Loss (RPO) | Zero - no committed data is ever lost |
| Latency Impact | Significant - limited by distance |
| Practical Distance | 50-200 km typically |
| Availability Impact | May stop if remote is unreachable |
| Use Case | Financial, healthcare, regulatory compliance |
Asynchronous remote replication removes the latency constraint by acknowledging transactions locally, then shipping changes to the remote site in the background. This enables replication across any distance but accepts potential data loss.
How It Works:
Primary Site Remote Site
│ │
│── Transaction commits locally ───▶ │
│ │
│── Send log records (background) ───▶│
│ │
│ (Lag: seconds to minutes) │
│ │
│── Primary fails here ───────────────│
│ │
│ Data in flight is LOST │
Replication Lag:
The gap between what's committed on the primary and what's received at the remote site is called replication lag. Under normal conditions, lag is typically seconds. Under high load or network issues, it can grow to minutes or even hours.
1234567891011121314151617181920212223242526272829303132
-- PostgreSQL: Async replication to remote site -- Primary server: postgresql.confwal_level = replicamax_wal_senders = 10synchronous_commit = local -- Don't wait for remotewal_keep_size = 10GB -- Keep more WAL for lag tolerance -- Archive WAL to remote storage (belt and suspenders)archive_mode = onarchive_command = 'aws s3 cp %p s3://dr-bucket/wal/%f' -- Monitor replication lag on primarySELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes, replay_lag AS replay_lag_timeFROM pg_stat_replication; -- Alert if lag exceeds thresholdSELECT CASE WHEN max(replay_lag) > interval '30 seconds' THEN 'CRITICAL: Replication lag > 30s' WHEN max(replay_lag) > interval '10 seconds' THEN 'WARNING: Replication lag > 10s' ELSE 'OK'END AS statusFROM pg_stat_replication;Your RPO with async replication equals your replication lag at the time of failure. If average lag is 5 seconds, your average data loss in a disaster is 5 seconds of transactions. Monitor lag continuously and alert when it exceeds your RPO target. Many organizations accept RPO of 1-60 seconds for async cross-region replication.
Compression and Bandwidth Optimization:
Cross-region replication consumes significant network bandwidth. Optimization techniques include:
| Aspect | Characteristic |
|---|---|
| Data Loss (RPO) | Typically seconds, potentially minutes |
| Latency Impact | None on primary transactions |
| Practical Distance | Unlimited (cross-continent, cross-globe) |
| Availability Impact | None - primary continues if remote fails |
| Use Case | Cross-region DR, read scaling, compliance |
Not all systems require continuous replication. Periodic backup captures point-in-time snapshots and transfers them to remote storage. This is simpler and cheaper than continuous replication but provides coarser recovery granularity.
Backup Types and Remote Storage:
Full Backup
Complete copy of the database. Large but self-contained—can restore from just this backup.
Incremental Backup
Only changes since last backup (full or incremental). Smaller, but requires the full backup plus all incrementals to restore.
Differential Backup
All changes since last full backup. Larger than incremental but requires only full + one differential to restore.
123456789101112131415161718192021222324252627282930313233343536
#!/bin/bash# PostgreSQL: Backup to remote cloud storage # VariablesDATE=$(date +%Y%m%d_%H%M%S)BACKUP_DIR="/backup"S3_BUCKET="s3://database-backups/prod" # Option 1: pg_dump to S3 (logical backup)pg_dump -Fc mydb | gzip | aws s3 cp - ${S3_BUCKET}/logical/${DATE}.dump.gz # Option 2: pg_basebackup to S3 (physical backup)pg_basebackup -D - -Ft -Xs -Pv | gzip | aws s3 cp - ${S3_BUCKET}/physical/${DATE}.tar.gz # Option 3: pgBackRest (incremental, parallel, encrypted)pgbackrest --stanza=main --type=incr backuppgbackrest --stanza=main archive-push # Ship WAL continuously # Option 4: File-level snapshot + ship# (LVM snapshot, ZFS snapshot, etc.)lvcreate --snapshot --name pg_snap --size 10G /dev/vg/pg_datatar czf - /mnt/pg_snap | aws s3 cp - ${S3_BUCKET}/snapshot/${DATE}.tar.gzlvremove -f /dev/vg/pg_snap # Cleanup old backups (retain 30 days)aws s3 ls ${S3_BUCKET}/ | while read -r line; do createDate=$(echo $line | awk '{print $1" "$2}') createDateSec=$(date -d "$createDate" +%s) olderThanSec=$(date -d "30 days ago" +%s) if [[ $createDateSec -lt $olderThanSec ]]; then fileName=$(echo $line | awk '{print $4}') aws s3 rm ${S3_BUCKET}/$fileName fidone echo "Backup completed: ${DATE}"Cloud Storage Options for Remote Backup:
| Storage Class | Access Time | Use Case | Relative Cost |
|---|---|---|---|
| Standard (S3, GCS, Azure Blob) | Milliseconds | Frequent restores | 100% |
| Infrequent Access | Milliseconds | Monthly restores | 50-60% |
| Glacier / Archive | Hours | Rare disaster recovery | 10-20% |
| Glacier Deep Archive | 12+ hours | Compliance retention | 5% |
The 3-2-1 backup rule recommends: 3 copies of your data, on 2 different media types, with 1 copy offsite. For databases: primary database, local backup (different storage), remote backup (cloud or DR site). This covers hardware failure, site disaster, and accidental deletion.
Combining Continuous Replication with Periodic Backup:
Many organizations use both:
Continuous replication protects against operational failures. Backups protect against data corruption, ransomware, and accidental deletion (since corrupt/deleted data replicates too).
A complete disaster recovery (DR) architecture combines multiple technologies into a cohesive system that can restore operations after any failure. Let's examine common DR architecture patterns.
Active-Passive (Warm Standby):
┌─────────────────────────────────────────────────────────────┐
│ Primary Site │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Application │───▶│ Database │───▶│ Storage │ │
│ │ Servers │ │ (Active) │ │ (RAID 10) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │
└──────────────────────────────│───────────────────────────────┘
│ Async/Sync Replication
▼
┌──────────────────────────────────────────────────────────────┐
│ DR Site (Standby) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Application │ │ Database │ │ Storage │ │
│ │ (Standby) │ │ (Replica) │ │ (RAID 10) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ (Start on failover) │
└──────────────────────────────────────────────────────────────┘
Active-Active (Multi-Region):
┌──────────────────────────────────────────────────────────────┐
│ Region A (US-East) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Application │◀──▶│ Database │◀──┐ │
│ │ Servers │ │ (Read/Write)│ │ │
│ └──────────────┘ └──────────────┘ │ │
│ │ Bi-directional │
└──────────────────────────────────────────│ Replication │
│ │
┌──────────────────────────────────────────│───────────────────┐
│ Region B (US-West) │ │
│ ┌──────────────┐ ┌──────────────┐ │ │
│ │ Application │◀──▶│ Database │◀──┘ │
│ │ Servers │ │ (Read/Write)│ │
│ └──────────────┘ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
True active-active with writes to both regions is extremely complex. Conflicts can occur when the same data is modified in both regions simultaneously. Solutions include: last-write-wins (data loss), application-level conflict resolution, restricting what each region can write, or using distributed databases designed for this (CockroachDB, Spanner).
An untested disaster recovery plan is not a plan—it's a hope. Remote backup and DR systems are only valuable if they actually work when needed. Regular testing is essential.
Types of DR Testing:
1. Backup Validation
Verify that backups are complete and restorable:
123456789101112131415161718192021222324252627282930313233343536
#!/bin/bash# Automated backup validation script # 1. Download latest backup from remote storageaws s3 cp s3://database-backups/prod/latest.dump.gz /tmp/restore_test/ # 2. Restore to isolated test databasegunzip -c /tmp/restore_test/latest.dump.gz | pg_restore -d test_restore # 3. Run validation queriesEXPECTED_ROWS=1000000 # Adjust based on expectationsACTUAL_ROWS=$(psql test_restore -t -c "SELECT count(*) FROM orders;") if [ $ACTUAL_ROWS -lt $EXPECTED_ROWS ]; then echo "ERROR: Restored backup has fewer rows than expected!" echo "Expected: $EXPECTED_ROWS, Actual: $ACTUAL_ROWS" exit 1fi # 4. Validate data integritypsql test_restore -c "SELECT 'OK' FROM users WHERE id = 1;"psql test_restore -c "\di+" # Check indexes exist # 5. Test application connectivityif timeout 30 ./healthcheck.sh test_restore_url; then echo "Application healthcheck passed"else echo "ERROR: Application failed to start with restored data!" exit 1fi # 6. Cleanupdropdb test_restorerm -rf /tmp/restore_test/ echo "Backup validation PASSED: $(date)"2. Replica Consistency Check
Verify that replica data matches primary:
-- PostgreSQL: Compare checksums between primary and replica
-- On primary (or through logical check):
SELECT schemaname, tablename,
md5(string_agg(t::text, '')) as table_hash
FROM (
SELECT * FROM mytable ORDER BY primary_key
) t
GROUP BY schemaname, tablename;
-- Compare with same query on replica
-- Hashes should match for all tables
3. Failover Drill (Non-Production)
Test the entire failover process using a recent clone:
4. Production Failover (Planned)
The ultimate test—actually fail over production during a maintenance window:
Some organizations adopt chaos engineering practices, intentionally injecting failures in production to verify systems respond correctly. Netflix's Chaos Monkey terminates random instances. Similar approaches can test database failover by killing primary database servers (with appropriate safeguards). This builds confidence that systems work under real conditions.
| Test Type | Frequency | Impact | Coverage |
|---|---|---|---|
| Backup restoration | Weekly | None (isolated) | Data recoverability |
| Replica consistency | Daily (automated) | None | Replication integrity |
| Failover drill (non-prod) | Monthly | None | Process and automation |
| Production failover | Quarterly/Annually | Planned downtime | Complete end-to-end |
Remote backup and DR infrastructure can be expensive. Careful design can significantly reduce costs while maintaining protection.
Cost Components:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
{ "Rules": [ { "ID": "MoveToInfrequentAccess", "Status": "Enabled", "Filter": { "Prefix": "database-backups/daily/" }, "Transitions": [ { "Days": 30, "StorageClass": "STANDARD_IA" } ] }, { "ID": "MoveToGlacier", "Status": "Enabled", "Filter": { "Prefix": "database-backups/monthly/" }, "Transitions": [ { "Days": 90, "StorageClass": "GLACIER" }, { "Days": 365, "StorageClass": "DEEP_ARCHIVE" } ] }, { "ID": "ExpireOldDailyBackups", "Status": "Enabled", "Filter": { "Prefix": "database-backups/daily/" }, "Expiration": { "Days": 90 } }, { "ID": "ExpireOldWAL", "Status": "Enabled", "Filter": { "Prefix": "database-backups/wal/" }, "Expiration": { "Days": 14 } } ]}Every cost reduction comes with trade-offs. Smaller DR servers mean slower recovery. Glacier storage means hours before data is accessible. Fewer backups mean larger RPO. Document these trade-offs explicitly so business stakeholders understand what they're accepting.
Remote backup provides the ultimate layer of data protection—survival of site-level disasters. Let's consolidate the key concepts:
What's Next:
We've explored how to protect data through redundancy, but what happens when data is actually lost despite these protections? The final page of this module examines media recovery—the techniques for recovering from storage failures using backups and archived logs.
You now understand the principles of remote backup and disaster recovery architecture. This knowledge enables you to design database systems that can survive site-level disasters and meet organizational recovery objectives.