Stable Storage - Learning Module

Loading content...

0/241

Remote Backup

Beyond the Data Center

RAID protects against disk failures. Mirroring protects against server failures. But what happens when the entire data center is destroyed?

Fires, floods, earthquakes, power grid failures, and other catastrophic events can render an entire site inoperable. All the local redundancy in the world won't help if every copy of your data is in the same building when disaster strikes.

Remote backup addresses this ultimate failure scenario by maintaining copies of data at geographically distant locations—far enough that a single event cannot affect both sites. This is the final layer of defense in achieving true stable storage.

What You Will Learn

By the end of this page, you will understand the principles of geographic data protection, the different approaches to remote backup (synchronous replication, async replication, periodic backups), how to design disaster recovery architecture, and the trade-offs between protection, performance, and cost.

The Case for Remote Backup

Local redundancy—RAID, mirroring, multiple servers—significantly reduces the probability of data loss. But these measures share a common weakness: they exist within a single failure domain.

Failure Domains:

A failure domain is a scope within which failures are correlated. Components sharing a failure domain can fail together:

Disk failure domain: Single RAID array (controller failure affects all disks)
Server failure domain: Single physical server (power supply, motherboard)
Rack failure domain: Single rack (shared power, network switch)
Room failure domain: Single data hall (cooling failure, fire suppression)
Building failure domain: Entire facility (fire, flood, earthquake)
Regional failure domain: Geographic area (natural disaster, power grid)

Effective disaster recovery requires placing data copies in independent failure domains.

Real-World Disaster Examples

Major cloud providers have experienced region-wide outages: AWS US-East-1 in 2017 affected thousands of websites; Google Cloud Australia in 2020 impacted major banks; Azure South Central US in 2018 lasted nearly 24 hours. Companies with data only in the affected region were completely down. Companies with multi-region architectures continued operating.

The Geographic Cost-Protection Tradeoff:

Distance between data copies involves a fundamental trade-off:

Distance	Protection	Latency Impact	Cost
Same rack	Disk failure	Minimal	Low
Same building	Server/rack failure	Low	Moderate
Same city (10-50km)	Building disaster	1-5ms	Moderate
Same region (100-500km)	City-wide disaster	5-20ms	High
Cross-continent (1000+ km)	Regional disaster	50-150ms	Very High

The additional latency comes from the speed of light—approximately 200,000 km/s through fiber optic cables. At 1000km, round-trip time is at least 10ms, plus switching and processing overhead.

Key Disaster Recovery Metrics

•RPO (Recovery Point Objective) — Maximum acceptable data loss measured in time. How much data can you afford to lose? RPO=0 means no data loss; RPO=1h means up to 1 hour of transactions can be lost.
•RTO (Recovery Time Objective) — Maximum acceptable downtime. How quickly must you be operational after a disaster? RTO=1h means you must be running within 1 hour.
•MTD (Maximum Tolerable Downtime) — The point at which business failure occurs. Beyond this, recovery may not matter because the business has failed.

Synchronous Remote Replication

Synchronous remote replication extends the synchronous mirroring concept across geographic distances. Every transaction waits for confirmation from the remote site before committing, guaranteeing zero data loss (RPO=0) even in a complete site failure.

The Latency Challenge:

Synchronous replication across distances is constrained by physics. For a 100km separation:

Fiber optic speed: ~200,000 km/s
Round-trip distance: 200 km
Minimum latency: 200 km ÷ 200,000 km/s = 1 ms

Add network equipment latency: ~0.5-2 ms
Add storage write latency: ~1-5 ms
Total: ~2.5-8 ms per transaction commit

For 500km: 5-15ms For 1000km: 10-25ms

These latencies are per-commit. A workload doing 1000 commits/second at 10ms each means commits queue up—you can't sustain that rate.

Distance Limits for Synchronous Replication

Most synchronous replication deployments limit distance to 50-200km—far enough to survive building or campus-level disasters, but not so far that commit latency becomes unacceptable. Beyond 200-300km, asynchronous replication is typically necessary.

Implementation Examples:

Oracle Data Guard Maximum Protection Mode:

oracle_dataguard_sync.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Oracle Data Guard: Maximum Protection (Synchronous)
 
-- Primary database configuration
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE PROTECTION;
 
-- This mode guarantees:
-- 1. No committed transaction can be lost
-- 2. Primary will NOT continue if standby is unreachable
-- 3. Zero data loss in any failure scenario
 
-- Trade-off: If standby fails, primary stops accepting transactions
-- Use for mission-critical systems where data loss is unacceptable
 
-- Check protection mode
SELECT protection_mode, protection_level FROM v$database;
 
-- Monitor sync status
SELECT dest_name, status, error, synchronization_status 
FROM v$archive_dest_status
WHERE dest_name = 'LOG_ARCHIVE_DEST_2';
 
-- For high availability WITH zero data loss, use Maximum Availability mode
-- (falls back to async if standby becomes unreachable)
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY;

SQL Server Always On Synchronous Commit:

sqlserver_sync_replica.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- SQL Server Always On: Synchronous Remote Replica
 
-- Create availability group with sync replica
CREATE AVAILABILITY GROUP MyAG
FOR DATABASE MyDB
REPLICA ON 
    'Primary' WITH (
        ENDPOINT_URL = 'TCP://primary:5022',
        AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
        FAILOVER_MODE = AUTOMATIC
    ),
    'Secondary' WITH (
        ENDPOINT_URL = 'TCP://secondary:5022',
        AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
        FAILOVER_MODE = AUTOMATIC
    );
 
-- Verify synchronization state
SELECT 
    ag.name AS ag_name,
    ar.replica_server_name,
    ars.synchronization_health_desc,
    ars.connected_state_desc,
    drs.synchronization_state_desc,
    drs.is_commit_participant
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id
LEFT JOIN sys.dm_hadr_database_replica_states drs 
    ON ars.replica_id = drs.replica_id;

Synchronous Remote Replication Summary
Aspect	Characteristic
Data Loss (RPO)	Zero - no committed data is ever lost
Latency Impact	Significant - limited by distance
Practical Distance	50-200 km typically
Availability Impact	May stop if remote is unreachable
Use Case	Financial, healthcare, regulatory compliance

Asynchronous Remote Replication

Asynchronous remote replication removes the latency constraint by acknowledging transactions locally, then shipping changes to the remote site in the background. This enables replication across any distance but accepts potential data loss.

How It Works:

Primary Site                          Remote Site
     │                                     │
     │── Transaction commits locally ───▶  │
     │                                     │
     │── Send log records (background) ───▶│
     │                                     │
     │    (Lag: seconds to minutes)        │
     │                                     │
     │── Primary fails here ───────────────│
     │                                     │
     │    Data in flight is LOST           │

Replication Lag:

The gap between what's committed on the primary and what's received at the remote site is called replication lag. Under normal conditions, lag is typically seconds. Under high load or network issues, it can grow to minutes or even hours.

postgresql_remote_async.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- PostgreSQL: Async replication to remote site
 
-- Primary server: postgresql.conf
wal_level = replica
max_wal_senders = 10
synchronous_commit = local          -- Don't wait for remote
wal_keep_size = 10GB                -- Keep more WAL for lag tolerance
 
-- Archive WAL to remote storage (belt and suspenders)
archive_mode = on
archive_command = 'aws s3 cp %p s3://dr-bucket/wal/%f'
 
-- Monitor replication lag on primary
SELECT client_addr, 
       state, 
       sent_lsn, 
       write_lsn,
       flush_lsn,
       replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
       replay_lag AS replay_lag_time
FROM pg_stat_replication;
 
-- Alert if lag exceeds threshold
SELECT CASE 
    WHEN max(replay_lag) > interval '30 seconds' 
    THEN 'CRITICAL: Replication lag > 30s'
    WHEN max(replay_lag) > interval '10 seconds'
    THEN 'WARNING: Replication lag > 10s'
    ELSE 'OK'
END AS status
FROM pg_stat_replication;

Quantifying Data Loss Exposure

Your RPO with async replication equals your replication lag at the time of failure. If average lag is 5 seconds, your average data loss in a disaster is 5 seconds of transactions. Monitor lag continuously and alert when it exceeds your RPO target. Many organizations accept RPO of 1-60 seconds for async cross-region replication.

Compression and Bandwidth Optimization:

Cross-region replication consumes significant network bandwidth. Optimization techniques include:

Log Compression: Compress WAL/redo before transmission (50-90% reduction)
Batching: Group multiple log records into single network packets
Delta Transmission: Send only changed portions of large objects
Dedicated Links: Use dedicated circuits for consistent bandwidth

Async Remote Replication Summary
Aspect	Characteristic
Data Loss (RPO)	Typically seconds, potentially minutes
Latency Impact	None on primary transactions
Practical Distance	Unlimited (cross-continent, cross-globe)
Availability Impact	None - primary continues if remote fails
Use Case	Cross-region DR, read scaling, compliance

Periodic Backup to Remote Storage

Not all systems require continuous replication. Periodic backup captures point-in-time snapshots and transfers them to remote storage. This is simpler and cheaper than continuous replication but provides coarser recovery granularity.

Backup Types and Remote Storage:

Full Backup

Complete copy of the database. Large but self-contained—can restore from just this backup.

Incremental Backup

Only changes since last backup (full or incremental). Smaller, but requires the full backup plus all incrementals to restore.

Differential Backup

All changes since last full backup. Larger than incremental but requires only full + one differential to restore.

backup_to_cloud.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/bash
# PostgreSQL: Backup to remote cloud storage
 
# Variables
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backup"
S3_BUCKET="s3://database-backups/prod"
 
# Option 1: pg_dump to S3 (logical backup)
pg_dump -Fc mydb | gzip | aws s3 cp - ${S3_BUCKET}/logical/${DATE}.dump.gz
 
# Option 2: pg_basebackup to S3 (physical backup)
pg_basebackup -D - -Ft -Xs -Pv | gzip | aws s3 cp - ${S3_BUCKET}/physical/${DATE}.tar.gz
 
# Option 3: pgBackRest (incremental, parallel, encrypted)
pgbackrest --stanza=main --type=incr backup
pgbackrest --stanza=main archive-push  # Ship WAL continuously
 
# Option 4: File-level snapshot + ship
# (LVM snapshot, ZFS snapshot, etc.)
lvcreate --snapshot --name pg_snap --size 10G /dev/vg/pg_data
tar czf - /mnt/pg_snap | aws s3 cp - ${S3_BUCKET}/snapshot/${DATE}.tar.gz
lvremove -f /dev/vg/pg_snap
 
# Cleanup old backups (retain 30 days)
aws s3 ls ${S3_BUCKET}/ | while read -r line; do
    createDate=$(echo $line | awk '{print $1" "$2}')
    createDateSec=$(date -d "$createDate" +%s)
    olderThanSec=$(date -d "30 days ago" +%s)
    if [[ $createDateSec -lt $olderThanSec ]]; then
        fileName=$(echo $line | awk '{print $4}')
        aws s3 rm ${S3_BUCKET}/$fileName
    fi
done
 
echo "Backup completed: ${DATE}"

Cloud Storage Options for Remote Backup:

Cloud Storage Classes for Backup
Storage Class	Access Time	Use Case	Relative Cost
Standard (S3, GCS, Azure Blob)	Milliseconds	Frequent restores	100%
Infrequent Access	Milliseconds	Monthly restores	50-60%
Glacier / Archive	Hours	Rare disaster recovery	10-20%
Glacier Deep Archive	12+ hours	Compliance retention	5%

Backup Strategy: 3-2-1 Rule

The 3-2-1 backup rule recommends: 3 copies of your data, on 2 different media types, with 1 copy offsite. For databases: primary database, local backup (different storage), remote backup (cloud or DR site). This covers hardware failure, site disaster, and accidental deletion.

Combining Continuous Replication with Periodic Backup:

Many organizations use both:

Continuous replication to a DR site provides low RPO and fast failover
Periodic backups to cloud storage provide long-term retention and protection against corruption

Continuous replication protects against operational failures. Backups protect against data corruption, ransomware, and accidental deletion (since corrupt/deleted data replicates too).

Disaster Recovery Architecture

A complete disaster recovery (DR) architecture combines multiple technologies into a cohesive system that can restore operations after any failure. Let's examine common DR architecture patterns.

Active-Passive (Warm Standby):

┌─────────────────────────────────────────────────────────────┐
│                    Primary Site                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐   │
│  │ Application  │───▶│   Database   │───▶│   Storage    │   │
│  │   Servers    │    │    (Active)  │    │   (RAID 10)  │   │
│  └──────────────┘    └──────────────┘    └──────────────┘   │
│                              │                               │
└──────────────────────────────│───────────────────────────────┘
                               │ Async/Sync Replication
                               ▼
┌──────────────────────────────────────────────────────────────┐
│                    DR Site (Standby)                         │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐   │
│  │ Application  │    │   Database   │    │   Storage    │   │
│  │   (Standby)  │    │   (Replica)  │    │   (RAID 10)  │   │
│  └──────────────┘    └──────────────┘    └──────────────┘   │
│           (Start on failover)                                │
└──────────────────────────────────────────────────────────────┘

Normal operation: All traffic to primary
Failover: Promote DR database, start DR applications, redirect traffic
RTO: Minutes to hours (depending on automation)

Active-Active (Multi-Region):

┌──────────────────────────────────────────────────────────────┐
│                    Region A (US-East)                        │
│  ┌──────────────┐    ┌──────────────┐                       │
│  │ Application  │◀──▶│   Database   │◀──┐                   │
│  │   Servers    │    │  (Read/Write)│   │                   │
│  └──────────────┘    └──────────────┘   │                   │
│                                          │ Bi-directional   │
└──────────────────────────────────────────│ Replication      │
                                           │                   │
┌──────────────────────────────────────────│───────────────────┐
│                    Region B (US-West)    │                   │
│  ┌──────────────┐    ┌──────────────┐   │                   │
│  │ Application  │◀──▶│   Database   │◀──┘                   │
│  │   Servers    │    │  (Read/Write)│                       │
│  └──────────────┘    └──────────────┘                       │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Normal operation: Both regions active, traffic distributed
Regional failure: Affected region's traffic fails over to surviving region
RTO: Seconds (automatic)
Complexity: High (conflict resolution, split-brain prevention)

Active-Active Complexity

True active-active with writes to both regions is extremely complex. Conflicts can occur when the same data is modified in both regions simultaneously. Solutions include: last-write-wins (data loss), application-level conflict resolution, restricting what each region can write, or using distributed databases designed for this (CockroachDB, Spanner).

DR Architecture Checklist

•Define RPO and RTO — Know your requirements before designing the solution.
•Site independence — DR site must have independent power, network, and physical security.
•Data replication — Choose sync, async, or periodic based on RPO requirements.
•Application tier — DR site needs applications ready to start, or running passively.
•DNS/Load balancing — Mechanism to redirect traffic to DR site (low TTL DNS, global load balancer).
•Monitoring — DR site health must be continuously monitored, not just at failover time.
•Runbooks — Documented, tested procedures for failover and failback.
•Regular testing — Quarterly or annual DR drills to verify the system works.

Testing Remote Backup and DR

An untested disaster recovery plan is not a plan—it's a hope. Remote backup and DR systems are only valuable if they actually work when needed. Regular testing is essential.

Types of DR Testing:

1. Backup Validation

Verify that backups are complete and restorable:

Restore backups to isolated environment
Verify data completeness and integrity
Test that applications can start using restored data

backup_validation.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/bash
# Automated backup validation script
 
# 1. Download latest backup from remote storage
aws s3 cp s3://database-backups/prod/latest.dump.gz /tmp/restore_test/
 
# 2. Restore to isolated test database
gunzip -c /tmp/restore_test/latest.dump.gz | pg_restore -d test_restore
 
# 3. Run validation queries
EXPECTED_ROWS=1000000  # Adjust based on expectations
ACTUAL_ROWS=$(psql test_restore -t -c "SELECT count(*) FROM orders;")
 
if [ $ACTUAL_ROWS -lt $EXPECTED_ROWS ]; then
    echo "ERROR: Restored backup has fewer rows than expected!"
    echo "Expected: $EXPECTED_ROWS, Actual: $ACTUAL_ROWS"
    exit 1
fi
 
# 4. Validate data integrity
psql test_restore -c "SELECT 'OK' FROM users WHERE id = 1;"
psql test_restore -c "\di+"  # Check indexes exist
 
# 5. Test application connectivity
if timeout 30 ./healthcheck.sh test_restore_url; then
    echo "Application healthcheck passed"
else
    echo "ERROR: Application failed to start with restored data!"
    exit 1
fi
 
# 6. Cleanup
dropdb test_restore
rm -rf /tmp/restore_test/
 
echo "Backup validation PASSED: $(date)"

2. Replica Consistency Check

Verify that replica data matches primary:

-- PostgreSQL: Compare checksums between primary and replica

-- On primary (or through logical check):
SELECT schemaname, tablename, 
       md5(string_agg(t::text, '')) as table_hash
FROM (
    SELECT * FROM mytable ORDER BY primary_key
) t
GROUP BY schemaname, tablename;

-- Compare with same query on replica
-- Hashes should match for all tables

3. Failover Drill (Non-Production)

Test the entire failover process using a recent clone:

Clone production to isolated environment
Execute failover runbook
Verify applications work with promoted replica
Measure actual RTO

4. Production Failover (Planned)

The ultimate test—actually fail over production during a maintenance window:

Notify stakeholders
Execute failover to DR site
Run production traffic from DR for a defined period
Fail back to primary
Document lessons learned

Chaos Engineering for DR

Some organizations adopt chaos engineering practices, intentionally injecting failures in production to verify systems respond correctly. Netflix's Chaos Monkey terminates random instances. Similar approaches can test database failover by killing primary database servers (with appropriate safeguards). This builds confidence that systems work under real conditions.

DR Testing Frequency Recommendations
Test Type	Frequency	Impact	Coverage
Backup restoration	Weekly	None (isolated)	Data recoverability
Replica consistency	Daily (automated)	None	Replication integrity
Failover drill (non-prod)	Monthly	None	Process and automation
Production failover	Quarterly/Annually	Planned downtime	Complete end-to-end

Cost Optimization for Remote Backup

Remote backup and DR infrastructure can be expensive. Careful design can significantly reduce costs while maintaining protection.

Cost Components:

Network/Egress — Data transfer between regions (often the largest cost)
Storage — Remote storage for backups and replicas
Compute — DR site servers (even if idle)
Licenses — Some database licenses charge per instance

Cost Reduction Strategies

•Compress replication traffic — Reduce egress costs by 50-90% through log compression.
•Use archive storage tiers — Old backups don't need instant access; use cheaper cold storage.
•Right-size DR compute — DR servers can be smaller if you'll scale up during failover.
•Serverless/spot instances — Use preemptible compute for DR testing and batch processing.
•Backup frequency tuning — More frequent backups = more storage. Balance against RPO needs.
•Retention policies — Don't retain backups longer than necessary; implement lifecycle rules.
•Log truncation — Aggressive log archiving and truncation reduces ongoing storage growth.

s3_lifecycle_policy.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "Rules": [
    {
      "ID": "MoveToInfrequentAccess",
      "Status": "Enabled",
      "Filter": { "Prefix": "database-backups/daily/" },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        }
      ]
    },
    {
      "ID": "MoveToGlacier",
      "Status": "Enabled",
      "Filter": { "Prefix": "database-backups/monthly/" },
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ]
    },
    {
      "ID": "ExpireOldDailyBackups",
      "Status": "Enabled",
      "Filter": { "Prefix": "database-backups/daily/" },
      "Expiration": {
        "Days": 90
      }
    },
    {
      "ID": "ExpireOldWAL",
      "Status": "Enabled",
      "Filter": { "Prefix": "database-backups/wal/" },
      "Expiration": {
        "Days": 14
      }
    }
  ]
}

Cost vs. Risk Tradeoff

Every cost reduction comes with trade-offs. Smaller DR servers mean slower recovery. Glacier storage means hours before data is accessible. Fewer backups mean larger RPO. Document these trade-offs explicitly so business stakeholders understand what they're accepting.

Summary: Remote Backup for Site-Level Protection

Remote backup provides the ultimate layer of data protection—survival of site-level disasters. Let's consolidate the key concepts:

Key Takeaways

•Local redundancy doesn't protect against site-level disasters — Fire, flood, or earthquake can destroy all local copies.
•Synchronous remote replication provides zero data loss — But distance is limited by latency constraints (typically <200km).
•Asynchronous replication works at any distance — RPO equals replication lag, typically seconds to minutes.
•Periodic backups to cloud storage provide long-term protection — Use tiered storage to balance cost and access time.
•RPO and RTO determine architecture — Define requirements before choosing solutions.
•DR must be tested regularly — Untested DR is unreliable DR.
•Balance cost against protection — Compression, tiered storage, and right-sizing reduce costs.

What's Next:

We've explored how to protect data through redundancy, but what happens when data is actually lost despite these protections? The final page of this module examines media recovery—the techniques for recovering from storage failures using backups and archived logs.

Page Complete

You now understand the principles of remote backup and disaster recovery architecture. This knowledge enables you to design database systems that can survive site-level disasters and meet organizational recovery objectives.