Disaster Recovery - Learning Module

Loading content...

0/241

DR Sites

Where Recovery Happens

When Hurricane Sandy struck the northeastern United States in 2012, it flooded data centers in lower Manhattan, including those of major financial institutions. Companies with geographically distant DR sites recovered within hours. Those without either lost data permanently or spent weeks in emergency recovery mode.

A disaster recovery strategy is only as good as the site it recovers to. Having perfect backups and flawless failover procedures means nothing if there's nowhere to fail over to. DR sites—the physical and virtual infrastructure positioned to take over when the primary site fails—are the destination that makes recovery possible.

This page explores DR site architecture: the types of sites available, how to choose between them, where to locate them, and how to operate them effectively. By understanding DR sites, you'll complete your disaster recovery knowledge with the infrastructure that makes recovery real.

What You Will Learn

By the end of this page, you will understand the spectrum of DR site options from hot to cold, know how to select appropriate site types based on RTO/RPO requirements, learn geographic and regulatory considerations for site placement, and understand the operational realities of maintaining DR infrastructure.

DR Site Types Overview

DR sites exist on a spectrum from fully active (indistinguishable from the primary) to completely inactive (requiring significant work to activate). The appropriate choice depends on your RTO requirements, budget, and operational complexity tolerance.

The Temperature Metaphor:

DR sites are traditionally categorized by "temperature"—a metaphor for their readiness level:

Hot Site: Fully operational, running and synchronized, ready for immediate failover
Warm Site: Infrastructure in place, may need data restoration or configuration
Cold Site: Physical space exists, but requires equipment setup and data restoration

Beyond these traditional categories, modern options include:

Cloud DR: Leveraging cloud infrastructure for on-demand recovery capacity
Mobile/Portable Sites: Containerized data centers for rapid deployment
Reciprocal Agreements: Arrangements with partner organizations for mutual backup

DR Site Type Comparison
Type	Typical RTO	Data Readiness	Cost (Relative)	Best For
Hot Site	Minutes to 1 hour	Real-time replication	80-100% of primary	Mission-critical systems
Warm Site	Hours to 1 day	Recent backup + logs	30-50% of primary	Business-critical systems
Cold Site	Days to 1 week	Backup restoration	10-20% of primary	Lower-priority systems
Cloud DR	Variable (hours)	Backup + automation	Pay-per-use	Flexible recovery needs
Mobile Site	1-3 days	On-site restoration	Per-deployment	Events, remote locations

Converting Mermaid diagram...

Tiered DR Sites

Many organizations operate multiple DR tiers: a hot site within the same metro area for high-frequency failures, plus a warm or cold site in a distant region for catastrophic scenarios. This tiered approach optimizes cost while addressing different risk levels.

Hot Sites

A hot site is a fully operational duplicate of your primary data center, running continuously and ready to assume production workloads with minimal or zero switchover time. It represents the highest level of disaster recovery preparedness.

Hot Site Characteristics:

Infrastructure:

Full production-equivalent hardware in place and powered
Network connectivity identical to primary site
Storage systems with real-time replicated data
All software installed, configured, and licensed
Often actively serving read traffic or standby workloads

Data Synchronization:

Synchronous or near-synchronous replication from primary
Minimal replication lag (seconds to minutes)
Data immediately usable upon failover
No backup restoration required

Operational State:

Systems actively monitored 24/7
Staff may be present or quickly mobilized
Regular failover testing ensures readiness
May run test workloads continuously

Hot Site Configurations:

Active-Passive:

The DR site runs continuously but serves no production traffic. All writes go to the primary; the hot site maintains real-time replicas.

Advantages: Simple application architecture, clear primary/secondary roles Disadvantages: DR infrastructure sits idle, capacity may be underutilized

Active-Active:

Both sites serve production traffic simultaneously. Each site can handle the other's load if one fails.

Advantages: Better resource utilization, proven capacity under load Disadvantages: Complex data consistency, requires conflict resolution

Active-Read:

Primary handles writes; DR site serves read queries actively.

Advantages: DR site is working and validated, reduces primary load Disadvantages: Replication lag may cause stale reads

Hot Site Cost Components
Component	Typical Cost Factor	Notes
Hardware/Infrastructure	80-100% of primary	Must match or exceed primary capacity
Software Licensing	100% of primary	Running systems require full licenses
Network Connectivity	High bandwidth required	Replication traffic + failover capacity
Facility Costs	Full data center costs	Power, cooling, physical security
Operations Staff	Partial to full staffing	At least monitoring, possibly full ops
Total Relative Cost	80-120% of primary	May exceed primary if higher tier

Making Hot Sites Earn Their Keep

A hot site represents significant investment sitting largely idle. Consider using DR resources productively: run development/staging environments, execute batch processing jobs, serve read-heavy analytics workloads, or run lower-priority services with the understanding they'll be displaced during DR activation. This 'warm-running' approach validates infrastructure while extracting value.

Warm and Cold Sites

Warm and cold sites trade recovery speed for cost savings. They're appropriate for systems where longer RTO is acceptable or where budget constraints preclude hot site investment.

Warm Sites:

A warm site has infrastructure in place but requires some activation work before becoming operational.

Typical Configuration:

Servers are installed but may be powered off or running minimal services
Network connectivity is established and can be rapidly configured
Data is partially synchronized (log shipping, periodic replication)
Recent backups are available locally or can be quickly retrieved
Some configuration steps required before activation

Activation Process:

Power on / provision required servers
Restore database from recent backup
Apply archived transaction logs to catch up
Update network configuration (DNS, load balancers)
Validate application connectivity
Resume operations

Typical RTO: 4-24 hours depending on data volume and automation level

Cold Sites:

A cold site provides physical space and basic infrastructure but requires extensive setup before use.

Typical Configuration:

Physical facility exists with power, cooling, and network capacity
Racks are available but may be empty
Hardware must be shipped or provisioned
Data must be restored from offsite backups
Complete system installation and configuration required

Activation Process:

Procure/ship/install hardware (or provision cloud resources)
Install operating systems and database software
Retrieve backup media from offsite storage
Restore from backups (potentially multi-day process)
Configure networking and security
Test systems thoroughly
Resume operations

Typical RTO: 1-7 days depending on complexity

Cold Site Variations:

Reciprocal Agreement: Arrangement with another organization to use their facilities if needed. Low cost but uncertain capacity and availability.
Mobile/Portable Sites: Containerized data centers that can be deployed to any location. Useful for events or where facility availability is uncertain.
Cloud-based Cold Site: No physical facility; cloud resources are provisioned on-demand during disaster. Potentially faster activation than traditional cold sites.

Warm Site Advantages

•Lower cost than hot site (30-50%)
•Reasonable RTO (hours, not days)
•Infrastructure pre-positioned
•Testing can validate readiness
•Good balance of cost/capability

Cold Site Advantages

•Lowest cost (10-20% of primary)
•Minimal ongoing operational burden
•Flexibility in recovery approach
•May be sufficient for non-critical systems
•Can leverage cloud for rapid provisioning

Cold Site Reality Check

Cold site activation is often underestimated. In a real disaster, you're working with stressed staff, potentially unfamiliar equipment, and time pressure. What takes 2 days in a controlled test may take a week during an actual disaster. Build significant buffer into cold site RTO estimates, and test activation regularly to validate assumptions.

Cloud-Based DR Sites

Cloud computing has transformed DR site economics. Instead of maintaining idle infrastructure, organizations can provision DR capacity on-demand, paying only during actual recovery. This "DR-as-a-Service" model has made robust DR accessible to organizations that couldn't previously afford hot sites.

Cloud DR Approaches:

Pilot Light:

Minimal always-on footprint with rapid scale-up capability.

Core components running continuously (replication endpoints, minimal database instances)
Full infrastructure defined as code (Terraform, CloudFormation)
Data replicated continuously or from frequent backups
Full deployment triggered during DR event

Typical RTO: 1-4 hours Cost: 10-20% of primary (ongoing) + burst during DR

Warm Standby:

Scaled-down but functional replica running continuously.

All components running but at reduced scale
Data replicated in near-real-time
Scale-up triggered during DR event
Can serve limited traffic before scale-up

Typical RTO: 30 minutes - 2 hours Cost: 30-50% of primary

Multi-Region Active-Active:

Full deployment in multiple cloud regions, all serving traffic.

Identical infrastructure in each region
Global load balancing distributes traffic
Data synchronized across regions (eventual or strong consistency)
No "failover" needed—failed region simply stops receiving traffic

Typical RTO: Near-zero (traffic rerouting only) Cost: 100%+ (multiple full deployments)

Backup and Restore:

No infrastructure running; full deployment from backups on-demand.

Only backup storage provisioned
Complete infrastructure defined as code
Full deployment triggered from scratch during DR
Slowest recovery but lowest cost

Typical RTO: 4-24 hours Cost: Storage only (until DR event)

terraform_dr_rds.tf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Terraform Example: Cloud DR for PostgreSQL Database
# Pilot Light approach with cross-region replication
 
# Primary RDS instance in us-east-1
resource "aws_db_instance" "primary" {
  identifier           = "prod-database-primary"
  engine              = "postgres"
  engine_version      = "15.4"
  instance_class      = "db.r6g.xlarge"
  allocated_storage   = 500
  
  multi_az                    = true
  backup_retention_period     = 7
  backup_window              = "03:00-04:00"
  
  # Enable automatic backups and replication
  storage_encrypted   = true
  kms_key_id         = aws_kms_key.database.arn
  
  # Performance configuration
  performance_insights_enabled = true
  
  tags = {
    Environment = "production"
    Role        = "primary"
  }
}
 
# Cross-region read replica in us-west-2 (DR region)
resource "aws_db_instance" "dr_replica" {
  provider = aws.west
  
  identifier           = "prod-database-dr-replica"
  replicate_source_db = aws_db_instance.primary.arn
  
  # Can be smaller for cost savings during normal operation
  # but must be able to scale up during DR
  instance_class      = "db.r6g.large"
  
  # DR replica settings
  multi_az            = false  # Single AZ for cost, switch to multi during DR
  storage_encrypted   = true
  
  tags = {
    Environment = "production"
    Role        = "dr-replica"
  }
}
 
# Automated failover using Route53 health checks
resource "aws_route53_health_check" "primary" {
  fqdn              = aws_db_instance.primary.address
  port              = 5432
  type              = "TCP"
  request_interval  = 30
  failure_threshold = 3
  
  tags = {
    Name = "primary-db-health"
  }
}
 
resource "aws_route53_record" "database" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "db.example.com"
  type    = "CNAME"
  
  # Failover routing policy
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  health_check_id = aws_route53_health_check.primary.id
  set_identifier  = "primary"
  
  ttl     = 60
  records = [aws_db_instance.primary.address]
}
 
resource "aws_route53_record" "database_dr" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "db.example.com"
  type    = "CNAME"
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier = "secondary"
  
  ttl     = 60
  records = [aws_db_instance.dr_replica.address]
}

Infrastructure as Code is Essential

Cloud DR requires infrastructure defined as code (Terraform, CloudFormation, Pulumi). Without it, deploying your full environment during a crisis becomes a manual, error-prone process. With IaC, you can validate your DR deployment regularly, ensure it matches your primary environment, and deploy it automatically when needed.

Geographic Considerations

The physical location of your DR site is a strategic decision that affects recovery capability, performance, and regulatory compliance. Distance from the primary site involves fundamental tradeoffs.

Distance Considerations:

Too Close (< 50 km):

Both sites may be affected by the same regional disaster (earthquake, flood, hurricane, power grid failure).

Risks:

Single disaster can destroy both sites
Same utility infrastructure (power grid, internet backbone)
May share natural disaster zones

When appropriate:

Metro-region DR for data center failures (not regional disasters)
Synchronous replication requiring low latency
Initial DR deployment before distant site is ready

Optimal Distance (100-500 km):

Beyond most single-event disaster zones but close enough for reasonable replication performance.

Advantages:

Different natural disaster exposure
Different utility infrastructure likely
Synchronous or near-synchronous replication feasible
Staff can physically travel to DR site same-day

Typical examples:

Primary in San Francisco, DR in Los Angeles (~600 km)
Primary in New York, DR in Virginia (~400 km)
Primary in London, DR in Dublin (~500 km)

Very Distant (> 1000 km / Different Continent):

Maximum geographic diversification for catastrophic scenario protection.

Advantages:

Protection against continental-scale disasters
Different political/regulatory jurisdictions (if desired)
Complete infrastructure independence

Challenges:

Synchronous replication not feasible (physics of light speed)
Higher replication lag increases RPO
Staff may not be able to reach site easily
Time zone differences complicate operations
Higher network costs

Site Selection Factors:

Factor	Closer is Better	Farther is Better
Replication lag	✓
Same disaster zone risk		✓
Staff accessibility	✓
Network cost	✓
Political diversification		✓
Utility independence		✓

Regulatory and Compliance Considerations:

Data sovereignty laws may restrict where data can be stored or processed:

GDPR (EU):

Personal data of EU residents must be handled in accordance with GDPR
May restrict DR sites to EU or approved countries

Data Residency Requirements:

Some countries require certain data to remain within national borders
Government, healthcare, financial sector often have specific requirements

Industry Regulations:

PCI-DSS has requirements for security of payment data
HIPAA governs healthcare data handling
SOX affects financial reporting system DR

Multi-Region Strategy:

Many organizations operate multiple DR tiers:

Metro DR Site (50-100 km): Synchronous replication, immediate failover for hot systems
Regional DR Site (300-500 km): Asynchronous replication, same-day recovery
Continental DR (1000+ km): Backup restoration, multi-day recovery for catastrophic scenarios

Learn From History

Review historical disaster data for potential DR site locations. The same locations that have experienced earthquakes, hurricanes, floods, or other disasters will likely experience them again. Avoid placing DR sites in high-risk zones, even if they're otherwise attractive options.

DR Site Operations

Operating a DR site requires ongoing attention to ensure it's ready when needed. A neglected DR site often fails when called upon, negating the entire DR investment.

Operational Challenges:

Configuration Drift:

Over time, the primary site evolves but DR site updates lag behind:

Software versions diverge
Configuration changes not replicated
Network rules and security settings differ
New applications not deployed to DR

Mitigation:

Infrastructure as Code for both sites
Automated deployment pipelines that target both
Regular configuration audits
DR site validation as part of change management

Replication Health:

Replication can fail silently, leaving DR data stale:

Network issues cause lag or disconnection
Storage problems corrupt replicated data
Schema changes break replication

Mitigation:

Continuous replication monitoring
Alerting at RPO thresholds
Regular data consistency validation
Automated replication repair where possible

Capacity Planning:

DR site must be able to handle full production load:

Primary site capacity grows; DR may not keep pace
Performance characteristics may differ
License limitations may restrict DR site capacity

Mitigation:

Include DR site in capacity planning process
Regularly test under production-equivalent load
Maintain license agreements that cover DR activation

Staff Readiness:

People may not remember procedures when disaster strikes:

Staff turnover since last DR test
Procedures changed since last training
Stress degrades recall and performance

Mitigation:

Regular training and drills
Updated and accessible runbooks
Practice under simulated stress conditions
Clear escalation and role assignments

dr_site_health_check.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#!/bin/bash
# DR Site Health Check Script
# Run regularly to validate DR readiness
 
echo "=== DR Site Health Check Report ==="
echo "Date: $(date)"
echo ""
 
# 1. Replication Status
echo "--- Replication Status ---"
REPL_LAG=$(psql -h dr-db.example.com -U monitor -d postgres -t -c     "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int;")
 
if [ "$REPL_LAG" -lt 60 ]; then
    echo "✅ Replication lag: ${REPL_LAG}s (Healthy)"
elif [ "$REPL_LAG" -lt 300 ]; then
    echo "⚠️  Replication lag: ${REPL_LAG}s (Warning)"
else
    echo "❌ Replication lag: ${REPL_LAG}s (Critical)"
fi
 
# 2. Data Consistency Check (sample tables)
echo ""
echo "--- Data Consistency Check ---"
PRIMARY_COUNT=$(psql -h primary-db.example.com -t -c "SELECT COUNT(*) FROM users;")
DR_COUNT=$(psql -h dr-db.example.com -t -c "SELECT COUNT(*) FROM users;")
DIFF=$((PRIMARY_COUNT - DR_COUNT))
 
if [ "$DIFF" -lt 100 ]; then
    echo "✅ User count difference: $DIFF (Acceptable)"
else
    echo "❌ User count difference: $DIFF (Data divergence detected)"
fi
 
# 3. Configuration Comparison
echo ""
echo "--- Configuration Check ---"
PRIMARY_VERSION=$(psql -h primary-db.example.com -t -c "SHOW server_version;")
DR_VERSION=$(psql -h dr-db.example.com -t -c "SHOW server_version;")
 
if [ "$PRIMARY_VERSION" == "$DR_VERSION" ]; then
    echo "✅ PostgreSQL versions match: $PRIMARY_VERSION"
else
    echo "❌ Version mismatch! Primary: $PRIMARY_VERSION, DR: $DR_VERSION"
fi
 
# 4. Disk Space Check
echo ""
echo "--- DR Disk Space ---"
DISK_USAGE=$(ssh dr-db.example.com "df -h /var/lib/postgresql | tail -1 | awk '{print $5}'")
echo "DR database disk usage: $DISK_USAGE"
 
# 5. Last Failover Test
echo ""
echo "--- Testing History ---"
LAST_TEST=$(cat /var/log/dr/last_test_date.txt 2>/dev/null || echo "Unknown")
echo "Last full failover test: $LAST_TEST"
 
# Calculate days since last test
if [ "$LAST_TEST" != "Unknown" ]; then
    DAYS_SINCE=$((( $(date +%s) - $(date -d "$LAST_TEST" +%s) ) / 86400))
    if [ "$DAYS_SINCE" -gt 180 ]; then
        echo "❌ Warning: $DAYS_SINCE days since last test (>180 days)"
    else
        echo "✅ $DAYS_SINCE days since last test"
    fi
fi
 
echo ""
echo "=== End of Health Check ===

Automate Everything Possible

Manual DR site maintenance is unsustainable. Automate health checks, configuration synchronization, replication monitoring, and capacity alerts. Schedule automated reports that highlight DR readiness issues before they become critical. The goal is continuous validation, not periodic audits.

Cost Optimization

DR sites represent significant investment—often approaching the cost of the primary infrastructure. Optimizing this cost while maintaining acceptable protection levels is essential for sustainable DR programs.

Cost Optimization Strategies:

1. Right-Size for Purpose

Not every system needs the same DR tier:

System Type	Suggested DR Tier	Rationale
Core transaction systems	Hot	Business-critical, zero RPO
Customer-facing applications	Warm/Hot	High impact but some RTO acceptable
Internal applications	Warm	Important but longer RTO acceptable
Development/test systems	Cold/None	Can be rebuilt, not mission-critical
Archives/reporting	Cold	Data preserved in backups

2. Cloud Economics

Leverage cloud pricing models:

Reserved instances for always-on DR infrastructure
Spot instances for testing and non-production DR workloads
Serverless for monitoring and automation functions
On-demand for actual DR activation (burst capacity)

3. Resource Sharing

Maximize utilization of DR infrastructure:

Run development/staging environments in DR facility
Execute batch processing during non-peak hours
Host disaster-tolerant workloads (can be evicted during DR)
Sell excess capacity back to cloud providers (if applicable)

4. Graduated Protection

Implement protection that scales with criticality:

Tier 1 (most critical): Hot site, sync replication, automated failover
Tier 2 (critical): Warm site, async replication, semi-automated failover
Tier 3 (important): Cloud DR, backup + logs, manual failover
Tier 4 (standard): Backup only, rebuild from scratch if needed

5. Regular Cost Reviews

DR costs tend to grow uncontrolled without governance:

Annual DR budget review
Quarterly usage/utilization analysis
Challenge necessity of hot site components
Identify and eliminate waste (unused resources, over-provisioned capacity)

DR Cost Model Comparison (Annual)
Model	Hot Site Cost	Cloud Pilot Light	Backup Only
Infrastructure	$500K	$80K	$20K
Networking	$50K	$30K	$5K
Software licenses	$200K	$50K	Included in backup
Operations staff	$150K	$50K	$20K
Testing/validation	$50K	$30K	$10K
Total annual	$950K	$240K	$55K
RTO capability	15 minutes	2-4 hours	24-72 hours

Cost of Not Having DR

When justifying DR costs, calculate the expected annual loss without DR: (Probability of disaster) × (Impact of disaster without DR). If your region has 5% annual probability of a major outage and impact would be $10M, expected loss is $500K/year. Any DR solution costing less than this provides positive ROI—before considering reputational damage, regulatory penalties, and other hard-to-quantify losses.

Summary: Building Effective DR Sites

DR sites are the infrastructure foundation that makes disaster recovery possible. Let's consolidate the key takeaways:

Key Takeaways

•Site type matches RTO requirements — Hot sites enable minutes-level RTO but cost nearly as much as the primary. Cold sites are economical but require days to activate. Choose based on business requirements.
•Cloud has transformed DR economics — Pilot light and warm standby patterns enable sophisticated DR at a fraction of traditional hot site costs. Infrastructure as Code is essential for cloud DR success.
•Geography involves fundamental tradeoffs — Distance protects against regional disasters but introduces replication latency. Multi-tier strategies with metro and regional sites balance both concerns.
•Regulatory requirements may constrain options — Data sovereignty, industry regulations, and compliance requirements may dictate where DR sites can or cannot be located.
•Continuous operation is essential — DR sites decay without attention. Replication fails, configurations drift, and staff forget procedures. Automated monitoring and regular testing maintain readiness.
•Cost optimization requires tiered approach — Not every system needs hot DR. Right-sizing protection levels by system criticality optimizes the DR investment while protecting what matters most.

Module Complete:

Congratulations! You've completed the Disaster Recovery module. You now understand the complete DR lifecycle:

Planning: Business impact analysis, risk assessment, and strategy development
Objectives: Setting and achieving RTO and RPO targets
Replication: Keeping data synchronized across sites
Failover: Transitioning to DR infrastructure when disaster strikes
Sites: The infrastructure that makes recovery possible

With this knowledge, you can design, implement, and operate disaster recovery solutions that ensure your database systems survive any disaster scenario.

Module Complete

You've mastered the essential components of disaster recovery for database systems. From planning through execution, from RTO/RPO targets through DR site infrastructure, you now possess the knowledge to build and maintain robust DR capabilities. Remember: DR is a process, not a project. Continuous validation, testing, and improvement ensure that when disaster strikes, your systems—and your organization—will survive. Next, explore Backup Best Practices to round out your knowledge of database resilience and recoverability.

DR Sites

Where Recovery Happens

What You Will Learn

DR Site Types Overview

The Temperature Metaphor:

DR sites are traditionally categorized by "temperature"—a metaphor for their readiness level:

Hot Site: Fully operational, running and synchronized, ready for immediate failover
Warm Site: Infrastructure in place, may need data restoration or configuration
Cold Site: Physical space exists, but requires equipment setup and data restoration

Beyond these traditional categories, modern options include:

Cloud DR: Leveraging cloud infrastructure for on-demand recovery capacity
Mobile/Portable Sites: Containerized data centers for rapid deployment
Reciprocal Agreements: Arrangements with partner organizations for mutual backup

DR Site Type Comparison
Type	Typical RTO	Data Readiness	Cost (Relative)	Best For
Hot Site	Minutes to 1 hour	Real-time replication	80-100% of primary	Mission-critical systems
Warm Site	Hours to 1 day	Recent backup + logs	30-50% of primary	Business-critical systems
Cold Site	Days to 1 week	Backup restoration	10-20% of primary	Lower-priority systems
Cloud DR	Variable (hours)	Backup + automation	Pay-per-use	Flexible recovery needs
Mobile Site	1-3 days	On-site restoration	Per-deployment	Events, remote locations

Converting Mermaid diagram...

Tiered DR Sites

Hot Sites

Hot Site Characteristics:

Infrastructure:

Full production-equivalent hardware in place and powered
Network connectivity identical to primary site
Storage systems with real-time replicated data
All software installed, configured, and licensed
Often actively serving read traffic or standby workloads

Data Synchronization:

Synchronous or near-synchronous replication from primary
Minimal replication lag (seconds to minutes)
Data immediately usable upon failover
No backup restoration required

Operational State:

Systems actively monitored 24/7
Staff may be present or quickly mobilized
Regular failover testing ensures readiness
May run test workloads continuously

Hot Site Configurations:

Active-Passive:

The DR site runs continuously but serves no production traffic. All writes go to the primary; the hot site maintains real-time replicas.

Advantages: Simple application architecture, clear primary/secondary roles Disadvantages: DR infrastructure sits idle, capacity may be underutilized

Active-Active:

Both sites serve production traffic simultaneously. Each site can handle the other's load if one fails.

Advantages: Better resource utilization, proven capacity under load Disadvantages: Complex data consistency, requires conflict resolution

Active-Read:

Primary handles writes; DR site serves read queries actively.

Advantages: DR site is working and validated, reduces primary load Disadvantages: Replication lag may cause stale reads

Hot Site Cost Components
Component	Typical Cost Factor	Notes
Hardware/Infrastructure	80-100% of primary	Must match or exceed primary capacity
Software Licensing	100% of primary	Running systems require full licenses
Network Connectivity	High bandwidth required	Replication traffic + failover capacity
Facility Costs	Full data center costs	Power, cooling, physical security
Operations Staff	Partial to full staffing	At least monitoring, possibly full ops
Total Relative Cost	80-120% of primary	May exceed primary if higher tier

Making Hot Sites Earn Their Keep

Warm and Cold Sites

Warm and cold sites trade recovery speed for cost savings. They're appropriate for systems where longer RTO is acceptable or where budget constraints preclude hot site investment.

Warm Sites:

A warm site has infrastructure in place but requires some activation work before becoming operational.

Typical Configuration:

Servers are installed but may be powered off or running minimal services
Network connectivity is established and can be rapidly configured
Data is partially synchronized (log shipping, periodic replication)
Recent backups are available locally or can be quickly retrieved
Some configuration steps required before activation

Activation Process:

Power on / provision required servers
Restore database from recent backup
Apply archived transaction logs to catch up
Update network configuration (DNS, load balancers)
Validate application connectivity
Resume operations

Typical RTO: 4-24 hours depending on data volume and automation level

Cold Sites:

A cold site provides physical space and basic infrastructure but requires extensive setup before use.

Typical Configuration:

Physical facility exists with power, cooling, and network capacity
Racks are available but may be empty
Hardware must be shipped or provisioned
Data must be restored from offsite backups
Complete system installation and configuration required

Activation Process:

Procure/ship/install hardware (or provision cloud resources)
Install operating systems and database software
Retrieve backup media from offsite storage
Restore from backups (potentially multi-day process)
Configure networking and security
Test systems thoroughly
Resume operations

Typical RTO: 1-7 days depending on complexity

Cold Site Variations:

Reciprocal Agreement: Arrangement with another organization to use their facilities if needed. Low cost but uncertain capacity and availability.
Mobile/Portable Sites: Containerized data centers that can be deployed to any location. Useful for events or where facility availability is uncertain.
Cloud-based Cold Site: No physical facility; cloud resources are provisioned on-demand during disaster. Potentially faster activation than traditional cold sites.

Warm Site Advantages

•Lower cost than hot site (30-50%)
•Reasonable RTO (hours, not days)
•Infrastructure pre-positioned
•Testing can validate readiness
•Good balance of cost/capability

Cold Site Advantages

•Lowest cost (10-20% of primary)
•Minimal ongoing operational burden
•Flexibility in recovery approach
•May be sufficient for non-critical systems
•Can leverage cloud for rapid provisioning

Cold Site Reality Check

Cloud-Based DR Sites

Cloud DR Approaches:

Pilot Light:

Minimal always-on footprint with rapid scale-up capability.

Core components running continuously (replication endpoints, minimal database instances)
Full infrastructure defined as code (Terraform, CloudFormation)
Data replicated continuously or from frequent backups
Full deployment triggered during DR event

Typical RTO: 1-4 hours Cost: 10-20% of primary (ongoing) + burst during DR

Warm Standby:

Scaled-down but functional replica running continuously.

All components running but at reduced scale
Data replicated in near-real-time
Scale-up triggered during DR event
Can serve limited traffic before scale-up

Typical RTO: 30 minutes - 2 hours Cost: 30-50% of primary

Multi-Region Active-Active:

Full deployment in multiple cloud regions, all serving traffic.

Identical infrastructure in each region
Global load balancing distributes traffic
Data synchronized across regions (eventual or strong consistency)
No "failover" needed—failed region simply stops receiving traffic

Typical RTO: Near-zero (traffic rerouting only) Cost: 100%+ (multiple full deployments)

Backup and Restore:

No infrastructure running; full deployment from backups on-demand.

Only backup storage provisioned
Complete infrastructure defined as code
Full deployment triggered from scratch during DR
Slowest recovery but lowest cost

Typical RTO: 4-24 hours Cost: Storage only (until DR event)

terraform_dr_rds.tf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Terraform Example: Cloud DR for PostgreSQL Database
# Pilot Light approach with cross-region replication
 
# Primary RDS instance in us-east-1
resource "aws_db_instance" "primary" {
  identifier           = "prod-database-primary"
  engine              = "postgres"
  engine_version      = "15.4"
  instance_class      = "db.r6g.xlarge"
  allocated_storage   = 500
  
  multi_az                    = true
  backup_retention_period     = 7
  backup_window              = "03:00-04:00"
  
  # Enable automatic backups and replication
  storage_encrypted   = true
  kms_key_id         = aws_kms_key.database.arn
  
  # Performance configuration
  performance_insights_enabled = true
  
  tags = {
    Environment = "production"
    Role        = "primary"
  }
}
 
# Cross-region read replica in us-west-2 (DR region)
resource "aws_db_instance" "dr_replica" {
  provider = aws.west
  
  identifier           = "prod-database-dr-replica"
  replicate_source_db = aws_db_instance.primary.arn
  
  # Can be smaller for cost savings during normal operation
  # but must be able to scale up during DR
  instance_class      = "db.r6g.large"
  
  # DR replica settings
  multi_az            = false  # Single AZ for cost, switch to multi during DR
  storage_encrypted   = true
  
  tags = {
    Environment = "production"
    Role        = "dr-replica"
  }
}
 
# Automated failover using Route53 health checks
resource "aws_route53_health_check" "primary" {
  fqdn              = aws_db_instance.primary.address
  port              = 5432
  type              = "TCP"
  request_interval  = 30
  failure_threshold = 3
  
  tags = {
    Name = "primary-db-health"
  }
}
 
resource "aws_route53_record" "database" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "db.example.com"
  type    = "CNAME"
  
  # Failover routing policy
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  health_check_id = aws_route53_health_check.primary.id
  set_identifier  = "primary"
  
  ttl     = 60
  records = [aws_db_instance.primary.address]
}
 
resource "aws_route53_record" "database_dr" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "db.example.com"
  type    = "CNAME"
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier = "secondary"
  
  ttl     = 60
  records = [aws_db_instance.dr_replica.address]
}

Infrastructure as Code is Essential

Geographic Considerations

The physical location of your DR site is a strategic decision that affects recovery capability, performance, and regulatory compliance. Distance from the primary site involves fundamental tradeoffs.

Distance Considerations:

Too Close (< 50 km):

Both sites may be affected by the same regional disaster (earthquake, flood, hurricane, power grid failure).

Risks:

Single disaster can destroy both sites
Same utility infrastructure (power grid, internet backbone)
May share natural disaster zones

When appropriate:

Metro-region DR for data center failures (not regional disasters)
Synchronous replication requiring low latency
Initial DR deployment before distant site is ready

Optimal Distance (100-500 km):

Beyond most single-event disaster zones but close enough for reasonable replication performance.

Advantages:

Different natural disaster exposure
Different utility infrastructure likely
Synchronous or near-synchronous replication feasible
Staff can physically travel to DR site same-day

Typical examples:

Primary in San Francisco, DR in Los Angeles (~600 km)
Primary in New York, DR in Virginia (~400 km)
Primary in London, DR in Dublin (~500 km)

Very Distant (> 1000 km / Different Continent):

Maximum geographic diversification for catastrophic scenario protection.

Advantages:

Protection against continental-scale disasters
Different political/regulatory jurisdictions (if desired)
Complete infrastructure independence

Challenges:

Synchronous replication not feasible (physics of light speed)
Higher replication lag increases RPO
Staff may not be able to reach site easily
Time zone differences complicate operations
Higher network costs

Site Selection Factors:

Factor	Closer is Better	Farther is Better
Replication lag	✓
Same disaster zone risk		✓
Staff accessibility	✓
Network cost	✓
Political diversification		✓
Utility independence		✓

Regulatory and Compliance Considerations:

Data sovereignty laws may restrict where data can be stored or processed:

GDPR (EU):

Personal data of EU residents must be handled in accordance with GDPR
May restrict DR sites to EU or approved countries

Data Residency Requirements:

Some countries require certain data to remain within national borders
Government, healthcare, financial sector often have specific requirements

Industry Regulations:

PCI-DSS has requirements for security of payment data
HIPAA governs healthcare data handling
SOX affects financial reporting system DR

Multi-Region Strategy:

Many organizations operate multiple DR tiers:

Metro DR Site (50-100 km): Synchronous replication, immediate failover for hot systems
Regional DR Site (300-500 km): Asynchronous replication, same-day recovery
Continental DR (1000+ km): Backup restoration, multi-day recovery for catastrophic scenarios

Learn From History

DR Site Operations

Operating a DR site requires ongoing attention to ensure it's ready when needed. A neglected DR site often fails when called upon, negating the entire DR investment.

Operational Challenges:

Configuration Drift:

Over time, the primary site evolves but DR site updates lag behind:

Software versions diverge
Configuration changes not replicated
Network rules and security settings differ
New applications not deployed to DR

Mitigation:

Infrastructure as Code for both sites
Automated deployment pipelines that target both
Regular configuration audits
DR site validation as part of change management

Replication Health:

Replication can fail silently, leaving DR data stale:

Network issues cause lag or disconnection
Storage problems corrupt replicated data
Schema changes break replication

Mitigation:

Continuous replication monitoring
Alerting at RPO thresholds
Regular data consistency validation
Automated replication repair where possible

Capacity Planning:

DR site must be able to handle full production load:

Primary site capacity grows; DR may not keep pace
Performance characteristics may differ
License limitations may restrict DR site capacity

Mitigation:

Include DR site in capacity planning process
Regularly test under production-equivalent load
Maintain license agreements that cover DR activation

Staff Readiness:

People may not remember procedures when disaster strikes:

Staff turnover since last DR test
Procedures changed since last training
Stress degrades recall and performance

Mitigation:

Regular training and drills
Updated and accessible runbooks
Practice under simulated stress conditions
Clear escalation and role assignments

dr_site_health_check.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#!/bin/bash
# DR Site Health Check Script
# Run regularly to validate DR readiness
 
echo "=== DR Site Health Check Report ==="
echo "Date: $(date)"
echo ""
 
# 1. Replication Status
echo "--- Replication Status ---"
REPL_LAG=$(psql -h dr-db.example.com -U monitor -d postgres -t -c     "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int;")
 
if [ "$REPL_LAG" -lt 60 ]; then
    echo "✅ Replication lag: ${REPL_LAG}s (Healthy)"
elif [ "$REPL_LAG" -lt 300 ]; then
    echo "⚠️  Replication lag: ${REPL_LAG}s (Warning)"
else
    echo "❌ Replication lag: ${REPL_LAG}s (Critical)"
fi
 
# 2. Data Consistency Check (sample tables)
echo ""
echo "--- Data Consistency Check ---"
PRIMARY_COUNT=$(psql -h primary-db.example.com -t -c "SELECT COUNT(*) FROM users;")
DR_COUNT=$(psql -h dr-db.example.com -t -c "SELECT COUNT(*) FROM users;")
DIFF=$((PRIMARY_COUNT - DR_COUNT))
 
if [ "$DIFF" -lt 100 ]; then
    echo "✅ User count difference: $DIFF (Acceptable)"
else
    echo "❌ User count difference: $DIFF (Data divergence detected)"
fi
 
# 3. Configuration Comparison
echo ""
echo "--- Configuration Check ---"
PRIMARY_VERSION=$(psql -h primary-db.example.com -t -c "SHOW server_version;")
DR_VERSION=$(psql -h dr-db.example.com -t -c "SHOW server_version;")
 
if [ "$PRIMARY_VERSION" == "$DR_VERSION" ]; then
    echo "✅ PostgreSQL versions match: $PRIMARY_VERSION"
else
    echo "❌ Version mismatch! Primary: $PRIMARY_VERSION, DR: $DR_VERSION"
fi
 
# 4. Disk Space Check
echo ""
echo "--- DR Disk Space ---"
DISK_USAGE=$(ssh dr-db.example.com "df -h /var/lib/postgresql | tail -1 | awk '{print $5}'")
echo "DR database disk usage: $DISK_USAGE"
 
# 5. Last Failover Test
echo ""
echo "--- Testing History ---"
LAST_TEST=$(cat /var/log/dr/last_test_date.txt 2>/dev/null || echo "Unknown")
echo "Last full failover test: $LAST_TEST"
 
# Calculate days since last test
if [ "$LAST_TEST" != "Unknown" ]; then
    DAYS_SINCE=$((( $(date +%s) - $(date -d "$LAST_TEST" +%s) ) / 86400))
    if [ "$DAYS_SINCE" -gt 180 ]; then
        echo "❌ Warning: $DAYS_SINCE days since last test (>180 days)"
    else
        echo "✅ $DAYS_SINCE days since last test"
    fi
fi
 
echo ""
echo "=== End of Health Check ===

Automate Everything Possible

Cost Optimization

Cost Optimization Strategies:

1. Right-Size for Purpose

Not every system needs the same DR tier:

System Type	Suggested DR Tier	Rationale
Core transaction systems	Hot	Business-critical, zero RPO
Customer-facing applications	Warm/Hot	High impact but some RTO acceptable
Internal applications	Warm	Important but longer RTO acceptable
Development/test systems	Cold/None	Can be rebuilt, not mission-critical
Archives/reporting	Cold	Data preserved in backups

2. Cloud Economics

Leverage cloud pricing models:

Reserved instances for always-on DR infrastructure
Spot instances for testing and non-production DR workloads
Serverless for monitoring and automation functions
On-demand for actual DR activation (burst capacity)

3. Resource Sharing

Maximize utilization of DR infrastructure:

Run development/staging environments in DR facility
Execute batch processing during non-peak hours
Host disaster-tolerant workloads (can be evicted during DR)
Sell excess capacity back to cloud providers (if applicable)

4. Graduated Protection

Implement protection that scales with criticality:

Tier 1 (most critical): Hot site, sync replication, automated failover
Tier 2 (critical): Warm site, async replication, semi-automated failover
Tier 3 (important): Cloud DR, backup + logs, manual failover
Tier 4 (standard): Backup only, rebuild from scratch if needed

5. Regular Cost Reviews

DR costs tend to grow uncontrolled without governance:

Annual DR budget review
Quarterly usage/utilization analysis
Challenge necessity of hot site components
Identify and eliminate waste (unused resources, over-provisioned capacity)

DR Cost Model Comparison (Annual)
Model	Hot Site Cost	Cloud Pilot Light	Backup Only
Infrastructure	$500K	$80K	$20K
Networking	$50K	$30K	$5K
Software licenses	$200K	$50K	Included in backup
Operations staff	$150K	$50K	$20K
Testing/validation	$50K	$30K	$10K
Total annual	$950K	$240K	$55K
RTO capability	15 minutes	2-4 hours	24-72 hours

Cost of Not Having DR

Summary: Building Effective DR Sites

DR sites are the infrastructure foundation that makes disaster recovery possible. Let's consolidate the key takeaways:

Key Takeaways

•Site type matches RTO requirements — Hot sites enable minutes-level RTO but cost nearly as much as the primary. Cold sites are economical but require days to activate. Choose based on business requirements.
•Cloud has transformed DR economics — Pilot light and warm standby patterns enable sophisticated DR at a fraction of traditional hot site costs. Infrastructure as Code is essential for cloud DR success.
•Geography involves fundamental tradeoffs — Distance protects against regional disasters but introduces replication latency. Multi-tier strategies with metro and regional sites balance both concerns.
•Regulatory requirements may constrain options — Data sovereignty, industry regulations, and compliance requirements may dictate where DR sites can or cannot be located.
•Continuous operation is essential — DR sites decay without attention. Replication fails, configurations drift, and staff forget procedures. Automated monitoring and regular testing maintain readiness.
•Cost optimization requires tiered approach — Not every system needs hot DR. Right-sizing protection levels by system criticality optimizes the DR investment while protecting what matters most.

Module Complete:

Congratulations! You've completed the Disaster Recovery module. You now understand the complete DR lifecycle:

Planning: Business impact analysis, risk assessment, and strategy development
Objectives: Setting and achieving RTO and RPO targets
Replication: Keeping data synchronized across sites
Failover: Transitioning to DR infrastructure when disaster strikes
Sites: The infrastructure that makes recovery possible

With this knowledge, you can design, implement, and operate disaster recovery solutions that ensure your database systems survive any disaster scenario.

Module Complete