Loading learning content...
On February 28, 2017, Amazon Web Services' S3 service in US-East-1 experienced a major outage lasting nearly four hours. The ripple effects were catastrophic: thousands of websites and applications that depended on S3—or on services that depended on S3—became unavailable. Companies lost millions of dollars. Some businesses discovered their "disaster recovery" plans were hosted in the same region that failed.
In March 2021, a fire at OVHcloud's Strasbourg data center destroyed one building entirely and damaged another. Hundreds of thousands of websites went offline. For some companies, the damage was permanent—their data was lost because backups were stored in the same facility.
These aren't hypothetical scenarios—they're recent history. Disaster recovery (DR) is the discipline of planning for, and recovering from, catastrophic failures that affect entire data centers, regions, or even cloud providers. Database replication is the foundational technology that makes DR possible for your most critical asset: your data.
By the end of this page, you will understand the difference between high availability and disaster recovery, how to define RPO and RTO objectives for your organization, replication strategies that enable disaster recovery, how to design cross-region DR architectures, testing methodologies that validate your DR plan, and real-world DR patterns from major organizations.
High availability (HA) and disaster recovery (DR) are related but distinct concepts. Understanding the difference is essential for proper planning.
High Availability (HA):
Disaster Recovery (DR):
| Aspect | High Availability | Disaster Recovery |
|---|---|---|
| Failure scope | Component (server, disk, rack) | Site (data center, region, provider) |
| Geographic scope | Within a site/region | Across sites/regions |
| Recovery time | Seconds to minutes | Minutes to hours |
| Data loss | Zero (typically) | May accept some loss (RPO) |
| Automation | Fully automatic | Often semi-automated or manual |
| Cost | Moderate (redundant components) | Higher (separate infrastructure) |
| Testing frequency | Continuous/frequent | Periodic (quarterly/annually) |
Why both are necessary:
A system can have excellent HA within a region but zero DR capability. Consider:
Conversely, a system with DR but poor HA experiences frequent minor outages for simple hardware failures.
Mature organizations implement both:
Your DR plan is only valid if DR resources don't share fate with primary resources. Common mistakes: DR replicas in the same region, DNS hosted by the same provider, runbooks stored in the primary data center, or depending on services that have single-region dependencies. Audit your DR architecture for hidden shared dependencies.
Two metrics define disaster recovery requirements: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). These are the foundation of DR planning and directly influence architecture decisions.
Recovery Point Objective (RPO):
RPO answers: How much data can we afford to lose?
Recovery Time Objective (RTO):
RTO answers: How long can we be down?
The cost curve:
Lower RPO and RTO requirements exponentially increase costs:
| RPO | RTO | Infrastructure Cost | Operational Cost |
|---|---|---|---|
| 24 hours | 24 hours | 💰 (backups only) | Low |
| 1 hour | 4 hours | 💰💰 (warm standby) | Medium |
| 15 minutes | 1 hour | 💰💰💰 (hot standby + async replication) | Medium-High |
| 0 | 5 minutes | 💰💰💰💰 (sync replication + automated failover) | High |
Setting objectives:
RPO and RTO should be set based on business impact analysis:
Different objectives for different data:
Not all data warrants the same RPO/RTO:
RPO and RTO must be documented, approved by business stakeholders, and communicated to engineering. Many outages extend because teams don't know their target recovery time, or make expensive real-time decisions without understanding business priorities.
Several architectural patterns use database replication to enable disaster recovery, each with different RPO, RTO, and cost characteristics.
Pattern 1: Backup to Remote Storage
Simplest DR approach:
Characteristics:
Pattern 2: Warm Standby
Pre-provisioned infrastructure in DR region:
Characteristics:
Pattern 3: Hot Standby (Active-Passive)
Fully operational DR region:
Characteristics:
Pattern 4: Active-Active Multi-Region
Both regions actively serve traffic:
Characteristics:
| Pattern | RPO | RTO | Cost | Complexity |
|---|---|---|---|---|
| Backup to Remote | Hours | Hours to days | 💰 | Low |
| Warm Standby | Minutes | 30 min - 1 hour | 💰💰 | Medium |
| Hot Standby | Seconds | Minutes | 💰💰💰 | Medium-High |
| Active-Active | Zero* | Seconds | 💰💰💰💰 | High |
Use different DR patterns for different data tiers. Critical transactional data: hot standby with automated failover. User content: warm standby with async replication. Historical analytics: backup to remote storage. This optimizes cost while meeting business requirements.
Implementing cross-region replication for DR requires careful attention to network configuration, monitoring, and operational procedures.
Network considerations:
Replication bandwidth calculation:
For a database with:
Bandwidth = 1000 × 500 × 1.5 = 750,000 bytes/sec = ~6 Mbps
This is baseline; bursts during traffic spikes can be 10-100x higher. Provision accordingly.
Handling replication lag:
Cross-region inherently has higher latency (50-200+ ms depending on distance). This affects:
Strategies for managing lag:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
-- PostgreSQL configuration for cross-region streaming replication-- Primary server (primary region) postgresql.conf -- Replication settingswal_level = replica -- Required for streaming replicationmax_wal_senders = 10 -- Number of concurrent replication connectionswal_keep_size = 1GB -- WAL retention for slow replicasmax_replication_slots = 5 -- Replication slots for guaranteed WAL retention -- Network settings for cross-regionlisten_addresses = '*' -- Accept remote connectionsssl = on -- Require TLS for replicationssl_cert_file = '/etc/ssl/certs/server.crt'ssl_key_file = '/etc/ssl/private/server.key' -- Performance tuning for WAN replicationwal_sender_timeout = 60s -- Longer timeout for high-latency connectionstcp_keepalives_idle = 60 -- TCP keepalive for long-distance connectionstcp_keepalives_interval = 10tcp_keepalives_count = 6 -- Primary pg_hba.conf (allow replication from DR region)-- TYPE DATABASE USER ADDRESS METHOD-- host replication replicator 10.100.0.0/16 scram-sha-256-- Note: 10.100.0.0/16 is the DR region VPC CIDR -- DR server (DR region) postgresql.conf for standby -- Connection to primaryprimary_conninfo = 'host=primary.us-east-1.internal port=5432 user=replicator password=******* sslmode=require application_name=dr-standby'primary_slot_name = 'dr_region_slot' -- Use replication slot -- Recovery settingshot_standby = on -- Allow read queries on standbyhot_standby_feedback = on -- Prevent vacuum from removing needed rowsmax_standby_streaming_delay = 30s -- Allow queries even during lagrecovery_min_apply_delay = 0 -- Apply immediately (or set delay for point-in-time) -- Archive recovery (belt and suspenders)restore_command = 'aws s3 cp s3://wal-archive/%f %p' -- Create replication slot on primary (prevents WAL from being recycled)SELECT pg_create_physical_replication_slot('dr_region_slot'); -- Monitor replication lag on primarySELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes, pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS replay_lag_mbFROM pg_stat_replication;Replication slots prevent WAL recycling until the standby has consumed it. If the DR region becomes unreachable, WAL will accumulate on the primary indefinitely, potentially filling the disk and crashing the primary. Implement monitoring and automated slot management to prevent this failure mode.
When disaster strikes, having documented, tested procedures is the difference between a controlled recovery and chaos. DR failover is significantly more complex than local HA failover.
Pre-failover checklist:
Database failover steps:
Post-failover operations:
Failback planning:
Once the primary region recovers, you need a plan to return to normal operations:
DR runbooks become outdated quickly as infrastructure evolves. Establish a regular review cycle (quarterly recommended) and update procedures after any infrastructure changes. Include runbook verification as part of architecture change management.
An untested DR plan is not a plan—it's a hope. Real disasters expose gaps that documentation reviews cannot. Regular, realistic DR testing is essential.
Types of DR tests:
| Test Type | Description | Frequency | Risk Level |
|---|---|---|---|
| Tabletop Exercise | Walk through procedures without execution | Quarterly | None |
| Component Test | Test individual steps (failover, restore) | Monthly | Low |
| Parallel Test | Bring up DR without redirecting traffic | Quarterly | Low |
| Full Failover | Complete failover and operate from DR | Annually | Medium |
| Surprise Drill | Unannounced DR exercise | Annually | High |
What to validate during DR tests:
Game Day methodology:
The most effective DR test is a "Game Day"—a scheduled, announced exercise where the team executes DR procedures against production (during low-traffic periods):
Netflix's famous "Chaos Monkey" and broader Chaos Engineering practices extend this concept to continuous, smaller-scale failure injection.
Common issues discovered during DR tests: expired credentials, incorrect DNS records, insufficient DR capacity, missing dependencies, outdated runbooks, and slow team response due to unfamiliarity. Better to discover these in a planned test than during an actual disaster.
Understanding real-world disasters helps in designing robust DR strategies.
Scenario 1: Cloud Provider Region Outage (AWS US-East-1, 2017)
What happened:
Lessons:
Scenario 2: Data Center Fire (OVHcloud Strasbourg, 2021)
What happened:
Lessons:
Scenario 3: Ransomware Attack (Multiple Organizations)
What happens:
Lessons:
Scenario 4: Submarine Cable Cut (Multiple Events)
What happens:
Lessons:
| Disaster Type | Examples | DR Strategy |
|---|---|---|
| Infrastructure failure | Hardware crash, disk failure | HA (automatic failover within region) |
| Site failure | Data center fire, flooding, power | Cross-region replication + DR site |
| Regional failure | Cloud region outage, natural disaster | Multi-region active-active or warm standby |
| Cyber attack | Ransomware, data destruction | Offline/immutable backups, air-gapped |
| Network partition | Cable cut, routing failure | Multi-path connectivity, local caching |
| Provider failure | Cloud provider business failure | Multi-cloud strategy (complex) |
No single DR strategy protects against all scenarios. Layer defenses: automatic HA within region, asynchronous replication to DR region, regular backups to separate storage, and periodic offline/immutable copies. Each layer protects against different failure modes.
Disaster recovery protects organizations from catastrophic data loss and extended outages. Let's consolidate the key concepts:
Module complete:
You have now explored all four motivations for database replication:
These motivations are not mutually exclusive—a well-designed replication architecture often addresses multiple goals simultaneously. Cross-region replicas, for example, provide read scaling, geographic distribution, AND disaster recovery.
In subsequent modules, we'll explore how to implement replication: leader-follower patterns, leaderless approaches, quorum-based systems, and partitioning strategies that complement replication.
You now have a comprehensive understanding of why database replication matters. You can articulate the business case for replication, define appropriate RPO/RTO objectives, design DR architectures, and implement testing strategies. You're ready to dive into the implementation patterns in the following modules.