System Design (HLD)Why Replication?

Why Database Replication Matters

LevelIntermediate

Duration90 mins

TopicWhy Replication?

4 / 4

Disaster Recovery: Surviving the Unthinkable

When Entire Regions Fail

On February 28, 2017, Amazon Web Services' S3 service in US-East-1 experienced a major outage lasting nearly four hours. The ripple effects were catastrophic: thousands of websites and applications that depended on S3—or on services that depended on S3—became unavailable. Companies lost millions of dollars. Some businesses discovered their "disaster recovery" plans were hosted in the same region that failed.

In March 2021, a fire at OVHcloud's Strasbourg data center destroyed one building entirely and damaged another. Hundreds of thousands of websites went offline. For some companies, the damage was permanent—their data was lost because backups were stored in the same facility.

These aren't hypothetical scenarios—they're recent history. Disaster recovery (DR) is the discipline of planning for, and recovering from, catastrophic failures that affect entire data centers, regions, or even cloud providers. Database replication is the foundational technology that makes DR possible for your most critical asset: your data.

What You Will Learn

By the end of this page, you will understand the difference between high availability and disaster recovery, how to define RPO and RTO objectives for your organization, replication strategies that enable disaster recovery, how to design cross-region DR architectures, testing methodologies that validate your DR plan, and real-world DR patterns from major organizations.

Disaster Recovery vs High Availability

High availability (HA) and disaster recovery (DR) are related but distinct concepts. Understanding the difference is essential for proper planning.

High Availability (HA):

Protects against component failures (disk, server, network device)
Typically operates within a single data center or region
Failover is automatic and fast (seconds to minutes)
Goal: continuous operation with minimal or no interruption
Example: Primary database fails; replica in same availability zone takes over

Disaster Recovery (DR):

Protects against site-wide failures (data center, region, provider)
Operates across geographically separated locations
Recovery may be manual or semi-automated (minutes to hours)
Goal: restore operations after catastrophic failure
Example: Entire AWS region fails; failover to a different region

HA vs DR Comparison
Aspect	High Availability	Disaster Recovery
Failure scope	Component (server, disk, rack)	Site (data center, region, provider)
Geographic scope	Within a site/region	Across sites/regions
Recovery time	Seconds to minutes	Minutes to hours
Data loss	Zero (typically)	May accept some loss (RPO)
Automation	Fully automatic	Often semi-automated or manual
Cost	Moderate (redundant components)	Higher (separate infrastructure)
Testing frequency	Continuous/frequent	Periodic (quarterly/annually)

Why both are necessary:

A system can have excellent HA within a region but zero DR capability. Consider:

Primary database in us-east-1a with replica in us-east-1b
Automatic failover between availability zones (HA: ✓)
If all of us-east-1 fails (major AWS outage, natural disaster, power grid failure): complete outage (DR: ✗)

Conversely, a system with DR but poor HA experiences frequent minor outages for simple hardware failures.

Mature organizations implement both:

HA within each region: Multi-AZ deployment, automatic failover
DR across regions: Replicated data, documented procedures, tested recovery

The Shared Fate Problem

Your DR plan is only valid if DR resources don't share fate with primary resources. Common mistakes: DR replicas in the same region, DNS hosted by the same provider, runbooks stored in the primary data center, or depending on services that have single-region dependencies. Audit your DR architecture for hidden shared dependencies.

RPO and RTO: Defining Recovery Objectives

Two metrics define disaster recovery requirements: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). These are the foundation of DR planning and directly influence architecture decisions.

Recovery Point Objective (RPO):

RPO answers: How much data can we afford to lose?

Measured in time: "We can lose up to 15 minutes of data"
Determines replication strategy and frequency
Business-driven: based on cost of lost transactions/data

Recovery Time Objective (RTO):

RTO answers: How long can we be down?

Measured in time: "We must be operational within 4 hours"
Determines automation level and infrastructure readiness
Business-driven: based on cost of downtime (revenue, reputation, SLA penalties)

Achievable RPO by Method

•Synchronous replication: RPO = 0 (zero data loss)
•Asynchronous replication: RPO = replication lag (seconds to minutes)
•Continuous backup (WAL shipping): RPO = seconds to minutes
•Hourly backups: RPO = up to 1 hour
•Daily backups: RPO = up to 24 hours

Achievable RTO by Method

•Automated cross-region failover: RTO = 1-5 minutes
•Hot standby with scripted failover: RTO = 15-30 minutes
•Warm standby (needs configuration): RTO = 1-4 hours
•Cold standby (infrastructure rebuild): RTO = 4-24 hours
•Backup restore only: RTO = 24+ hours

The cost curve:

Lower RPO and RTO requirements exponentially increase costs:

RPO	RTO	Infrastructure Cost	Operational Cost
24 hours	24 hours	💰 (backups only)	Low
1 hour	4 hours	💰💰 (warm standby)	Medium
15 minutes	1 hour	💰💰💰 (hot standby + async replication)	Medium-High
0	5 minutes	💰💰💰💰 (sync replication + automated failover)	High

Setting objectives:

RPO and RTO should be set based on business impact analysis:

Identify critical systems — Which databases, if lost, would halt business operations?
Quantify downtime cost — Revenue per hour, SLA penalties, customer churn, regulatory fines
Quantify data loss cost — Value of transactions in RPO window, ability to reconstruct manually
Compare to DR implementation cost — Balance cost of disaster against cost of prevention

Different objectives for different data:

Not all data warrants the same RPO/RTO:

Payment transactions: RPO = 0, RTO = 5 minutes (cannot lose or delay payments)
User profiles: RPO = 15 minutes, RTO = 1 hour (annoying but recoverable)
Analytics data: RPO = 24 hours, RTO = 24 hours (can be regenerated)

Document and Communicate

RPO and RTO must be documented, approved by business stakeholders, and communicated to engineering. Many outages extend because teams don't know their target recovery time, or make expensive real-time decisions without understanding business priorities.

DR Architectures Using Replication

Several architectural patterns use database replication to enable disaster recovery, each with different RPO, RTO, and cost characteristics.

Pattern 1: Backup to Remote Storage

Simplest DR approach:

Regular backups (daily/hourly) stored in a separate region
Continuous WAL archiving for point-in-time recovery
On disaster: Provision infrastructure, restore from backup

Characteristics:

RPO: Backup interval (or WAL archive lag)
RTO: Hours (infrastructure provisioning + restore time)
Cost: Low (only storage costs for backups)
Use when: Cost-sensitive, can tolerate longer recovery

Pattern 2: Warm Standby

Pre-provisioned infrastructure in DR region:

Infrastructure ready but not actively serving traffic
Asynchronous replication keeps data relatively current
On disaster: Promote standby, redirect traffic

Characteristics:

RPO: Replication lag (seconds to minutes)
RTO: Minutes to 1 hour (promotion + DNS propagation)
Cost: Medium (idle infrastructure + replication bandwidth)
Use when: Moderate RPO/RTO requirements, predictable costs

Pattern 3: Hot Standby (Active-Passive)

Fully operational DR region:

DR region receives replicated data in near real-time
DR infrastructure runs continuously (can serve read traffic)
On disaster: Near-instant promotion

Characteristics:

RPO: Seconds (async) or zero (sync)
RTO: Minutes (automated failover achievable)
Cost: High (full duplicate infrastructure)
Use when: Stringent RPO/RTO, mission-critical systems

Pattern 4: Active-Active Multi-Region

Both regions actively serve traffic:

Writes accepted in both regions with conflict resolution
No "failover" needed—traffic simply routes away from failed region
Provides both HA and DR simultaneously

Characteristics:

RPO: Zero for local writes (may lose in-flight cross-region replication)
RTO: Seconds (just routing change)
Cost: Very high (full capacity in multiple regions + complexity)
Use when: Global presence, maximum availability requirements

DR Architecture Comparison
Pattern	RPO	RTO	Cost	Complexity
Backup to Remote	Hours	Hours to days	💰	Low
Warm Standby	Minutes	30 min - 1 hour	💰💰	Medium
Hot Standby	Seconds	Minutes	💰💰💰	Medium-High
Active-Active	Zero*	Seconds	💰💰💰💰	High

Tiered DR Strategy

Use different DR patterns for different data tiers. Critical transactional data: hot standby with automated failover. User content: warm standby with async replication. Historical analytics: backup to remote storage. This optimizes cost while meeting business requirements.

Cross-Region Replication Implementation

Implementing cross-region replication for DR requires careful attention to network configuration, monitoring, and operational procedures.

Network considerations:

Network Configuration for Cross-Region Replication

•VPC Peering: Establish direct private network connection between regions
•Transit Gateway: For complex multi-region topologies, centralized routing
•VPN over Internet: Lower cost alternative, but higher latency and less reliable
•Encryption in transit: All cross-region traffic must be TLS encrypted
•Bandwidth estimation: Calculate replication bandwidth needs (write volume × row size × overhead)

Replication bandwidth calculation:

For a database with:

1,000 write transactions per second
Average 500 bytes per write
50% overhead for replication metadata

Bandwidth = 1000 × 500 × 1.5 = 750,000 bytes/sec = ~6 Mbps

This is baseline; bursts during traffic spikes can be 10-100x higher. Provision accordingly.

Handling replication lag:

Cross-region inherently has higher latency (50-200+ ms depending on distance). This affects:

Synchronous replication: Every write waits for cross-region round-trip (often unacceptable)
Asynchronous replication: Lag accumulates during high-write periods

Strategies for managing lag:

Parallel apply: Multiple worker threads applying changes on DR replica
Batched replication: Aggregate changes for efficiency (increases lag slightly)
Monitoring and alerting: Alert when lag exceeds RPO threshold
Write throttling: In extreme cases, slow down primary to let DR catch up

dr-replication-config.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
-- PostgreSQL configuration for cross-region streaming replication
-- Primary server (primary region) postgresql.conf
 
-- Replication settings
wal_level = replica                    -- Required for streaming replication
max_wal_senders = 10                   -- Number of concurrent replication connections
wal_keep_size = 1GB                    -- WAL retention for slow replicas
max_replication_slots = 5              -- Replication slots for guaranteed WAL retention
 
-- Network settings for cross-region
listen_addresses = '*'                 -- Accept remote connections
ssl = on                               -- Require TLS for replication
ssl_cert_file = '/etc/ssl/certs/server.crt'
ssl_key_file = '/etc/ssl/private/server.key'
 
-- Performance tuning for WAN replication
wal_sender_timeout = 60s               -- Longer timeout for high-latency connections
tcp_keepalives_idle = 60               -- TCP keepalive for long-distance connections
tcp_keepalives_interval = 10
tcp_keepalives_count = 6
 
-- Primary pg_hba.conf (allow replication from DR region)
-- TYPE  DATABASE    USER            ADDRESS                 METHOD
-- host  replication replicator      10.100.0.0/16          scram-sha-256
-- Note: 10.100.0.0/16 is the DR region VPC CIDR
 
 
-- DR server (DR region) postgresql.conf for standby
 
-- Connection to primary
primary_conninfo = 'host=primary.us-east-1.internal port=5432 user=replicator password=******* sslmode=require application_name=dr-standby'
primary_slot_name = 'dr_region_slot'   -- Use replication slot
 
-- Recovery settings
hot_standby = on                       -- Allow read queries on standby
hot_standby_feedback = on              -- Prevent vacuum from removing needed rows
max_standby_streaming_delay = 30s      -- Allow queries even during lag
recovery_min_apply_delay = 0           -- Apply immediately (or set delay for point-in-time)
 
-- Archive recovery (belt and suspenders)
restore_command = 'aws s3 cp s3://wal-archive/%f %p'
 
 
-- Create replication slot on primary (prevents WAL from being recycled)
SELECT pg_create_physical_replication_slot('dr_region_slot');
 
-- Monitor replication lag on primary
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS replay_lag_mb
FROM pg_stat_replication;

Replication Slots and Disk Space

Replication slots prevent WAL recycling until the standby has consumed it. If the DR region becomes unreachable, WAL will accumulate on the primary indefinitely, potentially filling the disk and crashing the primary. Implement monitoring and automated slot management to prevent this failure mode.

DR Failover Procedures

When disaster strikes, having documented, tested procedures is the difference between a controlled recovery and chaos. DR failover is significantly more complex than local HA failover.

Pre-failover checklist:

Confirm the disaster — Is primary region truly unavailable, or is this a monitoring false positive?
Assess data state — What is the replication lag at the DR site? What data will be lost?
Notify stakeholders — Communicate to leadership, customers, support teams
Prepare DR infrastructure — Ensure DR application servers, load balancers are ready
Document decision — Record who authorized failover, when, and why

Database failover steps:

DR Failover Sequence

•Stop replication: End replication connection to prevent partial transactions
•Verify data consistency: Check for incomplete transactions, validate checksums
•Promote DR database: Convert standby to primary (pg_promote, reset slave, etc.)
•Reconfigure connections: Update connection strings, secrets, service discovery
•Verify application connectivity: Test application can connect and query
•Enable write traffic: Allow applications to write to new primary
•Update DNS/routing: Point traffic to DR region load balancers
•Monitor and validate: Watch for errors, performance issues, data inconsistencies

Post-failover operations:

Communication: Regular updates to stakeholders on status
Monitoring: Heightened alerting during DR operation
Capacity management: DR may have less capacity than primary; watch for overload
Prevent split-brain: Ensure primary region cannot accept writes if it recovers

Failback planning:

Once the primary region recovers, you need a plan to return to normal operations:

Assess primary region: Is it stable? What caused the failure?
Resynchronize data: Replicate changes made during DR back to primary
Validate primary: Run tests to confirm primary is healthy
Gradual traffic shift: Incrementally move traffic back (canary deployment)
Full cutover: Return to normal primary/DR configuration
Post-mortem: Document what happened, how it was handled, lessons learned

Runbook Maintenance

DR runbooks become outdated quickly as infrastructure evolves. Establish a regular review cycle (quarterly recommended) and update procedures after any infrastructure changes. Include runbook verification as part of architecture change management.

Testing Disaster Recovery

An untested DR plan is not a plan—it's a hope. Real disasters expose gaps that documentation reviews cannot. Regular, realistic DR testing is essential.

Types of DR tests:

DR Testing Approaches
Test Type	Description	Frequency	Risk Level
Tabletop Exercise	Walk through procedures without execution	Quarterly	None
Component Test	Test individual steps (failover, restore)	Monthly	Low
Parallel Test	Bring up DR without redirecting traffic	Quarterly	Low
Full Failover	Complete failover and operate from DR	Annually	Medium
Surprise Drill	Unannounced DR exercise	Annually	High

What to validate during DR tests:

DR Test Validation Checklist

•RPO validation: Was actual data loss within RPO target?
•RTO validation: Was recovery time within RTO target?
•Data integrity: Are checksums valid? Can you query data correctly?
•Application functionality: Do all features work against DR database?
•Dependent systems: Do downstream services reconnect properly?
•Performance: Is DR region capacity adequate for production load?
•Monitoring: Do alerts fire correctly? Are dashboards accurate?
•Team execution: Did on-call follow procedures? Were there confusion points?

Game Day methodology:

The most effective DR test is a "Game Day"—a scheduled, announced exercise where the team executes DR procedures against production (during low-traffic periods):

Schedule — Pick a low-traffic window (Saturday morning, holiday)
Announce — Notify stakeholders; have support ready for user impact
Execute — Follow DR runbooks exactly as you would in real disaster
Measure — Record timestamps, issues encountered, deviations from plan
Operate — Run from DR for a period (1 hour to 24 hours)
Return — Fail back to primary region
Retrospective — Document lessons learned, update runbooks

Netflix's famous "Chaos Monkey" and broader Chaos Engineering practices extend this concept to continuous, smaller-scale failure injection.

Common DR Test Failures

Common issues discovered during DR tests: expired credentials, incorrect DNS records, insufficient DR capacity, missing dependencies, outdated runbooks, and slow team response due to unfamiliarity. Better to discover these in a planned test than during an actual disaster.

Real-World Disaster Scenarios

Understanding real-world disasters helps in designing robust DR strategies.

Scenario 1: Cloud Provider Region Outage (AWS US-East-1, 2017)

What happened:

Human error during S3 maintenance took down core S3 servers
Cascading failures affected many dependent services
Duration: ~4 hours

Lessons:

Avoid single-region dependencies, even for "always on" services
Dependencies on core cloud services (S3, IAM) can cascade
DR in a different region would have provided immediate failover

Scenario 2: Data Center Fire (OVHcloud Strasbourg, 2021)

What happened:

Fire destroyed entire data center building
Some customers lost data permanently (backups in same facility)
Duration: Permanent for some, weeks for full restoration

Lessons:

Physical separation between primary and backups is essential
"Same provider, different data center" is not DR if they share physical infrastructure
Off-site, geographically distant backups are mandatory

Scenario 3: Ransomware Attack (Multiple Organizations)

What happens:

Attackers encrypt production databases and demand payment
Organizations without offline backups face permanent data loss or payment
Recovery time: Days to weeks

Lessons:

Backups must be isolated from production network (air-gapped or immutable)
Cross-region replication propagates encryption (useless for ransomware DR)
Offline, immutable backups with tested restores are critical

Scenario 4: Submarine Cable Cut (Multiple Events)

What happens:

Undersea cable damage (anchor, earthquake) severs connectivity between regions
Traffic between regions becomes unavailable or severely degraded
Duration: Days to weeks for cable repair

Lessons:

Cross-region replication depends on network connectivity
Multi-path network design (multiple cable routes) improves resilience
Users in affected regions need local data access

Disaster Categories and DR Strategies
Disaster Type	Examples	DR Strategy
Infrastructure failure	Hardware crash, disk failure	HA (automatic failover within region)
Site failure	Data center fire, flooding, power	Cross-region replication + DR site
Regional failure	Cloud region outage, natural disaster	Multi-region active-active or warm standby
Cyber attack	Ransomware, data destruction	Offline/immutable backups, air-gapped
Network partition	Cable cut, routing failure	Multi-path connectivity, local caching
Provider failure	Cloud provider business failure	Multi-cloud strategy (complex)

Defense in Depth

No single DR strategy protects against all scenarios. Layer defenses: automatic HA within region, asynchronous replication to DR region, regular backups to separate storage, and periodic offline/immutable copies. Each layer protects against different failure modes.

Summary: Disaster Recovery Through Replication

Disaster recovery protects organizations from catastrophic data loss and extended outages. Let's consolidate the key concepts:

Key Takeaways

•DR ≠ HA — High availability protects against component failures; disaster recovery protects against site-wide disasters. Both are necessary.
•Define RPO and RTO — Recovery Point and Time Objectives are business decisions that drive architectural choices. Document and communicate them.
•Architecture matches objectives — Backup-only, warm standby, hot standby, and active-active have different RPO/RTO/cost profiles. Choose appropriately.
•Cross-region replication is foundational — DR depends on having current data in a separate geographic location. Replication makes this possible.
•Procedures must be documented — DR runbooks, checklists, and escalation paths are critical. Update them regularly.
•Testing is mandatory — Untested DR plans fail. Conduct regular exercises from tabletop to full failover.
•Defense in depth — Layer multiple strategies (HA + replication + backups + immutable copies) to protect against different failure modes.
•Learn from others' failures — Real-world disasters provide lessons. Study outages at AWS, OVH, and others.

Module complete:

You have now explored all four motivations for database replication:

Read Scalability — Distributing read load across replicas
High Availability — Surviving component failures with automatic failover
Geographic Distribution — Serving global users with local latency
Disaster Recovery — Protecting against catastrophic site-wide failures

These motivations are not mutually exclusive—a well-designed replication architecture often addresses multiple goals simultaneously. Cross-region replicas, for example, provide read scaling, geographic distribution, AND disaster recovery.

In subsequent modules, we'll explore how to implement replication: leader-follower patterns, leaderless approaches, quorum-based systems, and partitioning strategies that complement replication.

Module Complete

You now have a comprehensive understanding of why database replication matters. You can articulate the business case for replication, define appropriate RPO/RTO objectives, design DR architectures, and implement testing strategies. You're ready to dive into the implementation patterns in the following modules.

4 / 4

Loading learning content...

System Design (HLD)Why Replication?

Why Database Replication Matters

LevelIntermediate

Duration90 mins

TopicWhy Replication?

4 / 4

Disaster Recovery: Surviving the Unthinkable

When Entire Regions Fail

What You Will Learn

Disaster Recovery vs High Availability

High availability (HA) and disaster recovery (DR) are related but distinct concepts. Understanding the difference is essential for proper planning.

High Availability (HA):

Protects against component failures (disk, server, network device)
Typically operates within a single data center or region
Failover is automatic and fast (seconds to minutes)
Goal: continuous operation with minimal or no interruption
Example: Primary database fails; replica in same availability zone takes over

Disaster Recovery (DR):

Protects against site-wide failures (data center, region, provider)
Operates across geographically separated locations
Recovery may be manual or semi-automated (minutes to hours)
Goal: restore operations after catastrophic failure
Example: Entire AWS region fails; failover to a different region

HA vs DR Comparison
Aspect	High Availability	Disaster Recovery
Failure scope	Component (server, disk, rack)	Site (data center, region, provider)
Geographic scope	Within a site/region	Across sites/regions
Recovery time	Seconds to minutes	Minutes to hours
Data loss	Zero (typically)	May accept some loss (RPO)
Automation	Fully automatic	Often semi-automated or manual
Cost	Moderate (redundant components)	Higher (separate infrastructure)
Testing frequency	Continuous/frequent	Periodic (quarterly/annually)

Why both are necessary:

A system can have excellent HA within a region but zero DR capability. Consider:

Primary database in us-east-1a with replica in us-east-1b
Automatic failover between availability zones (HA: ✓)
If all of us-east-1 fails (major AWS outage, natural disaster, power grid failure): complete outage (DR: ✗)

Conversely, a system with DR but poor HA experiences frequent minor outages for simple hardware failures.

Mature organizations implement both:

HA within each region: Multi-AZ deployment, automatic failover
DR across regions: Replicated data, documented procedures, tested recovery

The Shared Fate Problem

RPO and RTO: Defining Recovery Objectives

Recovery Point Objective (RPO):

RPO answers: How much data can we afford to lose?

Measured in time: "We can lose up to 15 minutes of data"
Determines replication strategy and frequency
Business-driven: based on cost of lost transactions/data

Recovery Time Objective (RTO):

RTO answers: How long can we be down?

Measured in time: "We must be operational within 4 hours"
Determines automation level and infrastructure readiness
Business-driven: based on cost of downtime (revenue, reputation, SLA penalties)

Achievable RPO by Method

•Synchronous replication: RPO = 0 (zero data loss)
•Asynchronous replication: RPO = replication lag (seconds to minutes)
•Continuous backup (WAL shipping): RPO = seconds to minutes
•Hourly backups: RPO = up to 1 hour
•Daily backups: RPO = up to 24 hours

Achievable RTO by Method

•Automated cross-region failover: RTO = 1-5 minutes
•Hot standby with scripted failover: RTO = 15-30 minutes
•Warm standby (needs configuration): RTO = 1-4 hours
•Cold standby (infrastructure rebuild): RTO = 4-24 hours
•Backup restore only: RTO = 24+ hours

The cost curve:

Lower RPO and RTO requirements exponentially increase costs:

RPO	RTO	Infrastructure Cost	Operational Cost
24 hours	24 hours	💰 (backups only)	Low
1 hour	4 hours	💰💰 (warm standby)	Medium
15 minutes	1 hour	💰💰💰 (hot standby + async replication)	Medium-High
0	5 minutes	💰💰💰💰 (sync replication + automated failover)	High

Setting objectives:

RPO and RTO should be set based on business impact analysis:

Identify critical systems — Which databases, if lost, would halt business operations?
Quantify downtime cost — Revenue per hour, SLA penalties, customer churn, regulatory fines
Quantify data loss cost — Value of transactions in RPO window, ability to reconstruct manually
Compare to DR implementation cost — Balance cost of disaster against cost of prevention

Different objectives for different data:

Not all data warrants the same RPO/RTO:

Payment transactions: RPO = 0, RTO = 5 minutes (cannot lose or delay payments)
User profiles: RPO = 15 minutes, RTO = 1 hour (annoying but recoverable)
Analytics data: RPO = 24 hours, RTO = 24 hours (can be regenerated)

Document and Communicate

DR Architectures Using Replication

Several architectural patterns use database replication to enable disaster recovery, each with different RPO, RTO, and cost characteristics.

Pattern 1: Backup to Remote Storage

Simplest DR approach:

Regular backups (daily/hourly) stored in a separate region
Continuous WAL archiving for point-in-time recovery
On disaster: Provision infrastructure, restore from backup

Characteristics:

RPO: Backup interval (or WAL archive lag)
RTO: Hours (infrastructure provisioning + restore time)
Cost: Low (only storage costs for backups)
Use when: Cost-sensitive, can tolerate longer recovery

Pattern 2: Warm Standby

Pre-provisioned infrastructure in DR region:

Infrastructure ready but not actively serving traffic
Asynchronous replication keeps data relatively current
On disaster: Promote standby, redirect traffic

Characteristics:

RPO: Replication lag (seconds to minutes)
RTO: Minutes to 1 hour (promotion + DNS propagation)
Cost: Medium (idle infrastructure + replication bandwidth)
Use when: Moderate RPO/RTO requirements, predictable costs

Pattern 3: Hot Standby (Active-Passive)

Fully operational DR region:

DR region receives replicated data in near real-time
DR infrastructure runs continuously (can serve read traffic)
On disaster: Near-instant promotion

Characteristics:

RPO: Seconds (async) or zero (sync)
RTO: Minutes (automated failover achievable)
Cost: High (full duplicate infrastructure)
Use when: Stringent RPO/RTO, mission-critical systems

Pattern 4: Active-Active Multi-Region

Both regions actively serve traffic:

Writes accepted in both regions with conflict resolution
No "failover" needed—traffic simply routes away from failed region
Provides both HA and DR simultaneously

Characteristics:

RPO: Zero for local writes (may lose in-flight cross-region replication)
RTO: Seconds (just routing change)
Cost: Very high (full capacity in multiple regions + complexity)
Use when: Global presence, maximum availability requirements

DR Architecture Comparison
Pattern	RPO	RTO	Cost	Complexity
Backup to Remote	Hours	Hours to days	💰	Low
Warm Standby	Minutes	30 min - 1 hour	💰💰	Medium
Hot Standby	Seconds	Minutes	💰💰💰	Medium-High
Active-Active	Zero*	Seconds	💰💰💰💰	High

Tiered DR Strategy

Cross-Region Replication Implementation

Implementing cross-region replication for DR requires careful attention to network configuration, monitoring, and operational procedures.

Network considerations:

Network Configuration for Cross-Region Replication

•VPC Peering: Establish direct private network connection between regions
•Transit Gateway: For complex multi-region topologies, centralized routing
•VPN over Internet: Lower cost alternative, but higher latency and less reliable
•Encryption in transit: All cross-region traffic must be TLS encrypted
•Bandwidth estimation: Calculate replication bandwidth needs (write volume × row size × overhead)

Replication bandwidth calculation:

For a database with:

1,000 write transactions per second
Average 500 bytes per write
50% overhead for replication metadata

Bandwidth = 1000 × 500 × 1.5 = 750,000 bytes/sec = ~6 Mbps

This is baseline; bursts during traffic spikes can be 10-100x higher. Provision accordingly.

Handling replication lag:

Cross-region inherently has higher latency (50-200+ ms depending on distance). This affects:

Synchronous replication: Every write waits for cross-region round-trip (often unacceptable)
Asynchronous replication: Lag accumulates during high-write periods

Strategies for managing lag:

Parallel apply: Multiple worker threads applying changes on DR replica
Batched replication: Aggregate changes for efficiency (increases lag slightly)
Monitoring and alerting: Alert when lag exceeds RPO threshold
Write throttling: In extreme cases, slow down primary to let DR catch up

dr-replication-config.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
-- PostgreSQL configuration for cross-region streaming replication
-- Primary server (primary region) postgresql.conf
 
-- Replication settings
wal_level = replica                    -- Required for streaming replication
max_wal_senders = 10                   -- Number of concurrent replication connections
wal_keep_size = 1GB                    -- WAL retention for slow replicas
max_replication_slots = 5              -- Replication slots for guaranteed WAL retention
 
-- Network settings for cross-region
listen_addresses = '*'                 -- Accept remote connections
ssl = on                               -- Require TLS for replication
ssl_cert_file = '/etc/ssl/certs/server.crt'
ssl_key_file = '/etc/ssl/private/server.key'
 
-- Performance tuning for WAN replication
wal_sender_timeout = 60s               -- Longer timeout for high-latency connections
tcp_keepalives_idle = 60               -- TCP keepalive for long-distance connections
tcp_keepalives_interval = 10
tcp_keepalives_count = 6
 
-- Primary pg_hba.conf (allow replication from DR region)
-- TYPE  DATABASE    USER            ADDRESS                 METHOD
-- host  replication replicator      10.100.0.0/16          scram-sha-256
-- Note: 10.100.0.0/16 is the DR region VPC CIDR
 
 
-- DR server (DR region) postgresql.conf for standby
 
-- Connection to primary
primary_conninfo = 'host=primary.us-east-1.internal port=5432 user=replicator password=******* sslmode=require application_name=dr-standby'
primary_slot_name = 'dr_region_slot'   -- Use replication slot
 
-- Recovery settings
hot_standby = on                       -- Allow read queries on standby
hot_standby_feedback = on              -- Prevent vacuum from removing needed rows
max_standby_streaming_delay = 30s      -- Allow queries even during lag
recovery_min_apply_delay = 0           -- Apply immediately (or set delay for point-in-time)
 
-- Archive recovery (belt and suspenders)
restore_command = 'aws s3 cp s3://wal-archive/%f %p'
 
 
-- Create replication slot on primary (prevents WAL from being recycled)
SELECT pg_create_physical_replication_slot('dr_region_slot');
 
-- Monitor replication lag on primary
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS replay_lag_mb
FROM pg_stat_replication;

Replication Slots and Disk Space

DR Failover Procedures

When disaster strikes, having documented, tested procedures is the difference between a controlled recovery and chaos. DR failover is significantly more complex than local HA failover.

Pre-failover checklist:

Confirm the disaster — Is primary region truly unavailable, or is this a monitoring false positive?
Assess data state — What is the replication lag at the DR site? What data will be lost?
Notify stakeholders — Communicate to leadership, customers, support teams
Prepare DR infrastructure — Ensure DR application servers, load balancers are ready
Document decision — Record who authorized failover, when, and why

Database failover steps:

DR Failover Sequence

•Stop replication: End replication connection to prevent partial transactions
•Verify data consistency: Check for incomplete transactions, validate checksums
•Promote DR database: Convert standby to primary (pg_promote, reset slave, etc.)
•Reconfigure connections: Update connection strings, secrets, service discovery
•Verify application connectivity: Test application can connect and query
•Enable write traffic: Allow applications to write to new primary
•Update DNS/routing: Point traffic to DR region load balancers
•Monitor and validate: Watch for errors, performance issues, data inconsistencies

Post-failover operations:

Communication: Regular updates to stakeholders on status
Monitoring: Heightened alerting during DR operation
Capacity management: DR may have less capacity than primary; watch for overload
Prevent split-brain: Ensure primary region cannot accept writes if it recovers

Failback planning:

Once the primary region recovers, you need a plan to return to normal operations:

Assess primary region: Is it stable? What caused the failure?
Resynchronize data: Replicate changes made during DR back to primary
Validate primary: Run tests to confirm primary is healthy
Gradual traffic shift: Incrementally move traffic back (canary deployment)
Full cutover: Return to normal primary/DR configuration
Post-mortem: Document what happened, how it was handled, lessons learned

Runbook Maintenance

Testing Disaster Recovery

An untested DR plan is not a plan—it's a hope. Real disasters expose gaps that documentation reviews cannot. Regular, realistic DR testing is essential.

Types of DR tests:

DR Testing Approaches
Test Type	Description	Frequency	Risk Level
Tabletop Exercise	Walk through procedures without execution	Quarterly	None
Component Test	Test individual steps (failover, restore)	Monthly	Low
Parallel Test	Bring up DR without redirecting traffic	Quarterly	Low
Full Failover	Complete failover and operate from DR	Annually	Medium
Surprise Drill	Unannounced DR exercise	Annually	High

What to validate during DR tests:

DR Test Validation Checklist

•RPO validation: Was actual data loss within RPO target?
•RTO validation: Was recovery time within RTO target?
•Data integrity: Are checksums valid? Can you query data correctly?
•Application functionality: Do all features work against DR database?
•Dependent systems: Do downstream services reconnect properly?
•Performance: Is DR region capacity adequate for production load?
•Monitoring: Do alerts fire correctly? Are dashboards accurate?
•Team execution: Did on-call follow procedures? Were there confusion points?

Game Day methodology:

The most effective DR test is a "Game Day"—a scheduled, announced exercise where the team executes DR procedures against production (during low-traffic periods):

Schedule — Pick a low-traffic window (Saturday morning, holiday)
Announce — Notify stakeholders; have support ready for user impact
Execute — Follow DR runbooks exactly as you would in real disaster
Measure — Record timestamps, issues encountered, deviations from plan
Operate — Run from DR for a period (1 hour to 24 hours)
Return — Fail back to primary region
Retrospective — Document lessons learned, update runbooks

Netflix's famous "Chaos Monkey" and broader Chaos Engineering practices extend this concept to continuous, smaller-scale failure injection.

Common DR Test Failures

Real-World Disaster Scenarios

Understanding real-world disasters helps in designing robust DR strategies.

Scenario 1: Cloud Provider Region Outage (AWS US-East-1, 2017)

What happened:

Human error during S3 maintenance took down core S3 servers
Cascading failures affected many dependent services
Duration: ~4 hours

Lessons:

Avoid single-region dependencies, even for "always on" services
Dependencies on core cloud services (S3, IAM) can cascade
DR in a different region would have provided immediate failover

Scenario 2: Data Center Fire (OVHcloud Strasbourg, 2021)

What happened:

Fire destroyed entire data center building
Some customers lost data permanently (backups in same facility)
Duration: Permanent for some, weeks for full restoration

Lessons:

Physical separation between primary and backups is essential
"Same provider, different data center" is not DR if they share physical infrastructure
Off-site, geographically distant backups are mandatory

Scenario 3: Ransomware Attack (Multiple Organizations)

What happens:

Attackers encrypt production databases and demand payment
Organizations without offline backups face permanent data loss or payment
Recovery time: Days to weeks

Lessons:

Backups must be isolated from production network (air-gapped or immutable)
Cross-region replication propagates encryption (useless for ransomware DR)
Offline, immutable backups with tested restores are critical

Scenario 4: Submarine Cable Cut (Multiple Events)

What happens:

Undersea cable damage (anchor, earthquake) severs connectivity between regions
Traffic between regions becomes unavailable or severely degraded
Duration: Days to weeks for cable repair

Lessons:

Cross-region replication depends on network connectivity
Multi-path network design (multiple cable routes) improves resilience
Users in affected regions need local data access

Disaster Categories and DR Strategies
Disaster Type	Examples	DR Strategy
Infrastructure failure	Hardware crash, disk failure	HA (automatic failover within region)
Site failure	Data center fire, flooding, power	Cross-region replication + DR site
Regional failure	Cloud region outage, natural disaster	Multi-region active-active or warm standby
Cyber attack	Ransomware, data destruction	Offline/immutable backups, air-gapped
Network partition	Cable cut, routing failure	Multi-path connectivity, local caching
Provider failure	Cloud provider business failure	Multi-cloud strategy (complex)

Defense in Depth

Summary: Disaster Recovery Through Replication

Disaster recovery protects organizations from catastrophic data loss and extended outages. Let's consolidate the key concepts:

Key Takeaways

•DR ≠ HA — High availability protects against component failures; disaster recovery protects against site-wide disasters. Both are necessary.
•Define RPO and RTO — Recovery Point and Time Objectives are business decisions that drive architectural choices. Document and communicate them.
•Architecture matches objectives — Backup-only, warm standby, hot standby, and active-active have different RPO/RTO/cost profiles. Choose appropriately.
•Cross-region replication is foundational — DR depends on having current data in a separate geographic location. Replication makes this possible.
•Procedures must be documented — DR runbooks, checklists, and escalation paths are critical. Update them regularly.
•Testing is mandatory — Untested DR plans fail. Conduct regular exercises from tabletop to full failover.
•Defense in depth — Layer multiple strategies (HA + replication + backups + immutable copies) to protect against different failure modes.
•Learn from others' failures — Real-world disasters provide lessons. Study outages at AWS, OVH, and others.

Module complete:

You have now explored all four motivations for database replication:

Read Scalability — Distributing read load across replicas
High Availability — Surviving component failures with automatic failover
Geographic Distribution — Serving global users with local latency
Disaster Recovery — Protecting against catastrophic site-wide failures

In subsequent modules, we'll explore how to implement replication: leader-follower patterns, leaderless approaches, quorum-based systems, and partitioning strategies that complement replication.

Module Complete

4 / 4