Loading content...
When Hurricane Sandy struck the northeastern United States in 2012, it flooded data centers in lower Manhattan, including those of major financial institutions. Companies with geographically distant DR sites recovered within hours. Those without either lost data permanently or spent weeks in emergency recovery mode.
A disaster recovery strategy is only as good as the site it recovers to. Having perfect backups and flawless failover procedures means nothing if there's nowhere to fail over to. DR sites—the physical and virtual infrastructure positioned to take over when the primary site fails—are the destination that makes recovery possible.
This page explores DR site architecture: the types of sites available, how to choose between them, where to locate them, and how to operate them effectively. By understanding DR sites, you'll complete your disaster recovery knowledge with the infrastructure that makes recovery real.
By the end of this page, you will understand the spectrum of DR site options from hot to cold, know how to select appropriate site types based on RTO/RPO requirements, learn geographic and regulatory considerations for site placement, and understand the operational realities of maintaining DR infrastructure.
DR sites exist on a spectrum from fully active (indistinguishable from the primary) to completely inactive (requiring significant work to activate). The appropriate choice depends on your RTO requirements, budget, and operational complexity tolerance.
The Temperature Metaphor:
DR sites are traditionally categorized by "temperature"—a metaphor for their readiness level:
Beyond these traditional categories, modern options include:
| Type | Typical RTO | Data Readiness | Cost (Relative) | Best For |
|---|---|---|---|---|
| Hot Site | Minutes to 1 hour | Real-time replication | 80-100% of primary | Mission-critical systems |
| Warm Site | Hours to 1 day | Recent backup + logs | 30-50% of primary | Business-critical systems |
| Cold Site | Days to 1 week | Backup restoration | 10-20% of primary | Lower-priority systems |
| Cloud DR | Variable (hours) | Backup + automation | Pay-per-use | Flexible recovery needs |
| Mobile Site | 1-3 days | On-site restoration | Per-deployment | Events, remote locations |
Many organizations operate multiple DR tiers: a hot site within the same metro area for high-frequency failures, plus a warm or cold site in a distant region for catastrophic scenarios. This tiered approach optimizes cost while addressing different risk levels.
A hot site is a fully operational duplicate of your primary data center, running continuously and ready to assume production workloads with minimal or zero switchover time. It represents the highest level of disaster recovery preparedness.
Hot Site Characteristics:
Infrastructure:
Data Synchronization:
Operational State:
Hot Site Configurations:
Active-Passive:
The DR site runs continuously but serves no production traffic. All writes go to the primary; the hot site maintains real-time replicas.
Advantages: Simple application architecture, clear primary/secondary roles Disadvantages: DR infrastructure sits idle, capacity may be underutilized
Active-Active:
Both sites serve production traffic simultaneously. Each site can handle the other's load if one fails.
Advantages: Better resource utilization, proven capacity under load Disadvantages: Complex data consistency, requires conflict resolution
Active-Read:
Primary handles writes; DR site serves read queries actively.
Advantages: DR site is working and validated, reduces primary load Disadvantages: Replication lag may cause stale reads
| Component | Typical Cost Factor | Notes |
|---|---|---|
| Hardware/Infrastructure | 80-100% of primary | Must match or exceed primary capacity |
| Software Licensing | 100% of primary | Running systems require full licenses |
| Network Connectivity | High bandwidth required | Replication traffic + failover capacity |
| Facility Costs | Full data center costs | Power, cooling, physical security |
| Operations Staff | Partial to full staffing | At least monitoring, possibly full ops |
| Total Relative Cost | 80-120% of primary | May exceed primary if higher tier |
A hot site represents significant investment sitting largely idle. Consider using DR resources productively: run development/staging environments, execute batch processing jobs, serve read-heavy analytics workloads, or run lower-priority services with the understanding they'll be displaced during DR activation. This 'warm-running' approach validates infrastructure while extracting value.
Warm and cold sites trade recovery speed for cost savings. They're appropriate for systems where longer RTO is acceptable or where budget constraints preclude hot site investment.
Warm Sites:
A warm site has infrastructure in place but requires some activation work before becoming operational.
Typical Configuration:
Activation Process:
Typical RTO: 4-24 hours depending on data volume and automation level
Cold Sites:
A cold site provides physical space and basic infrastructure but requires extensive setup before use.
Typical Configuration:
Activation Process:
Typical RTO: 1-7 days depending on complexity
Cold Site Variations:
Cold site activation is often underestimated. In a real disaster, you're working with stressed staff, potentially unfamiliar equipment, and time pressure. What takes 2 days in a controlled test may take a week during an actual disaster. Build significant buffer into cold site RTO estimates, and test activation regularly to validate assumptions.
Cloud computing has transformed DR site economics. Instead of maintaining idle infrastructure, organizations can provision DR capacity on-demand, paying only during actual recovery. This "DR-as-a-Service" model has made robust DR accessible to organizations that couldn't previously afford hot sites.
Cloud DR Approaches:
Pilot Light:
Minimal always-on footprint with rapid scale-up capability.
Typical RTO: 1-4 hours Cost: 10-20% of primary (ongoing) + burst during DR
Warm Standby:
Scaled-down but functional replica running continuously.
Typical RTO: 30 minutes - 2 hours Cost: 30-50% of primary
Multi-Region Active-Active:
Full deployment in multiple cloud regions, all serving traffic.
Typical RTO: Near-zero (traffic rerouting only) Cost: 100%+ (multiple full deployments)
Backup and Restore:
No infrastructure running; full deployment from backups on-demand.
Typical RTO: 4-24 hours Cost: Storage only (until DR event)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
# Terraform Example: Cloud DR for PostgreSQL Database# Pilot Light approach with cross-region replication # Primary RDS instance in us-east-1resource "aws_db_instance" "primary" { identifier = "prod-database-primary" engine = "postgres" engine_version = "15.4" instance_class = "db.r6g.xlarge" allocated_storage = 500 multi_az = true backup_retention_period = 7 backup_window = "03:00-04:00" # Enable automatic backups and replication storage_encrypted = true kms_key_id = aws_kms_key.database.arn # Performance configuration performance_insights_enabled = true tags = { Environment = "production" Role = "primary" }} # Cross-region read replica in us-west-2 (DR region)resource "aws_db_instance" "dr_replica" { provider = aws.west identifier = "prod-database-dr-replica" replicate_source_db = aws_db_instance.primary.arn # Can be smaller for cost savings during normal operation # but must be able to scale up during DR instance_class = "db.r6g.large" # DR replica settings multi_az = false # Single AZ for cost, switch to multi during DR storage_encrypted = true tags = { Environment = "production" Role = "dr-replica" }} # Automated failover using Route53 health checksresource "aws_route53_health_check" "primary" { fqdn = aws_db_instance.primary.address port = 5432 type = "TCP" request_interval = 30 failure_threshold = 3 tags = { Name = "primary-db-health" }} resource "aws_route53_record" "database" { zone_id = aws_route53_zone.main.zone_id name = "db.example.com" type = "CNAME" # Failover routing policy failover_routing_policy { type = "PRIMARY" } health_check_id = aws_route53_health_check.primary.id set_identifier = "primary" ttl = 60 records = [aws_db_instance.primary.address]} resource "aws_route53_record" "database_dr" { zone_id = aws_route53_zone.main.zone_id name = "db.example.com" type = "CNAME" failover_routing_policy { type = "SECONDARY" } set_identifier = "secondary" ttl = 60 records = [aws_db_instance.dr_replica.address]}Cloud DR requires infrastructure defined as code (Terraform, CloudFormation, Pulumi). Without it, deploying your full environment during a crisis becomes a manual, error-prone process. With IaC, you can validate your DR deployment regularly, ensure it matches your primary environment, and deploy it automatically when needed.
The physical location of your DR site is a strategic decision that affects recovery capability, performance, and regulatory compliance. Distance from the primary site involves fundamental tradeoffs.
Distance Considerations:
Too Close (< 50 km):
Both sites may be affected by the same regional disaster (earthquake, flood, hurricane, power grid failure).
Risks:
When appropriate:
Optimal Distance (100-500 km):
Beyond most single-event disaster zones but close enough for reasonable replication performance.
Advantages:
Typical examples:
Very Distant (> 1000 km / Different Continent):
Maximum geographic diversification for catastrophic scenario protection.
Advantages:
Challenges:
Site Selection Factors:
| Factor | Closer is Better | Farther is Better |
|---|---|---|
| Replication lag | ✓ | |
| Same disaster zone risk | ✓ | |
| Staff accessibility | ✓ | |
| Network cost | ✓ | |
| Political diversification | ✓ | |
| Utility independence | ✓ |
Regulatory and Compliance Considerations:
Data sovereignty laws may restrict where data can be stored or processed:
GDPR (EU):
Data Residency Requirements:
Industry Regulations:
Multi-Region Strategy:
Many organizations operate multiple DR tiers:
Review historical disaster data for potential DR site locations. The same locations that have experienced earthquakes, hurricanes, floods, or other disasters will likely experience them again. Avoid placing DR sites in high-risk zones, even if they're otherwise attractive options.
Operating a DR site requires ongoing attention to ensure it's ready when needed. A neglected DR site often fails when called upon, negating the entire DR investment.
Operational Challenges:
Configuration Drift:
Over time, the primary site evolves but DR site updates lag behind:
Mitigation:
Replication Health:
Replication can fail silently, leaving DR data stale:
Mitigation:
Capacity Planning:
DR site must be able to handle full production load:
Mitigation:
Staff Readiness:
People may not remember procedures when disaster strikes:
Mitigation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
#!/bin/bash# DR Site Health Check Script# Run regularly to validate DR readiness echo "=== DR Site Health Check Report ==="echo "Date: $(date)"echo "" # 1. Replication Statusecho "--- Replication Status ---"REPL_LAG=$(psql -h dr-db.example.com -U monitor -d postgres -t -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int;") if [ "$REPL_LAG" -lt 60 ]; then echo "✅ Replication lag: ${REPL_LAG}s (Healthy)"elif [ "$REPL_LAG" -lt 300 ]; then echo "⚠️ Replication lag: ${REPL_LAG}s (Warning)"else echo "❌ Replication lag: ${REPL_LAG}s (Critical)"fi # 2. Data Consistency Check (sample tables)echo ""echo "--- Data Consistency Check ---"PRIMARY_COUNT=$(psql -h primary-db.example.com -t -c "SELECT COUNT(*) FROM users;")DR_COUNT=$(psql -h dr-db.example.com -t -c "SELECT COUNT(*) FROM users;")DIFF=$((PRIMARY_COUNT - DR_COUNT)) if [ "$DIFF" -lt 100 ]; then echo "✅ User count difference: $DIFF (Acceptable)"else echo "❌ User count difference: $DIFF (Data divergence detected)"fi # 3. Configuration Comparisonecho ""echo "--- Configuration Check ---"PRIMARY_VERSION=$(psql -h primary-db.example.com -t -c "SHOW server_version;")DR_VERSION=$(psql -h dr-db.example.com -t -c "SHOW server_version;") if [ "$PRIMARY_VERSION" == "$DR_VERSION" ]; then echo "✅ PostgreSQL versions match: $PRIMARY_VERSION"else echo "❌ Version mismatch! Primary: $PRIMARY_VERSION, DR: $DR_VERSION"fi # 4. Disk Space Checkecho ""echo "--- DR Disk Space ---"DISK_USAGE=$(ssh dr-db.example.com "df -h /var/lib/postgresql | tail -1 | awk '{print $5}'")echo "DR database disk usage: $DISK_USAGE" # 5. Last Failover Testecho ""echo "--- Testing History ---"LAST_TEST=$(cat /var/log/dr/last_test_date.txt 2>/dev/null || echo "Unknown")echo "Last full failover test: $LAST_TEST" # Calculate days since last testif [ "$LAST_TEST" != "Unknown" ]; then DAYS_SINCE=$((( $(date +%s) - $(date -d "$LAST_TEST" +%s) ) / 86400)) if [ "$DAYS_SINCE" -gt 180 ]; then echo "❌ Warning: $DAYS_SINCE days since last test (>180 days)" else echo "✅ $DAYS_SINCE days since last test" fifi echo ""echo "=== End of Health Check ===Manual DR site maintenance is unsustainable. Automate health checks, configuration synchronization, replication monitoring, and capacity alerts. Schedule automated reports that highlight DR readiness issues before they become critical. The goal is continuous validation, not periodic audits.
DR sites represent significant investment—often approaching the cost of the primary infrastructure. Optimizing this cost while maintaining acceptable protection levels is essential for sustainable DR programs.
Cost Optimization Strategies:
1. Right-Size for Purpose
Not every system needs the same DR tier:
| System Type | Suggested DR Tier | Rationale |
|---|---|---|
| Core transaction systems | Hot | Business-critical, zero RPO |
| Customer-facing applications | Warm/Hot | High impact but some RTO acceptable |
| Internal applications | Warm | Important but longer RTO acceptable |
| Development/test systems | Cold/None | Can be rebuilt, not mission-critical |
| Archives/reporting | Cold | Data preserved in backups |
2. Cloud Economics
Leverage cloud pricing models:
3. Resource Sharing
Maximize utilization of DR infrastructure:
4. Graduated Protection
Implement protection that scales with criticality:
5. Regular Cost Reviews
DR costs tend to grow uncontrolled without governance:
| Model | Hot Site Cost | Cloud Pilot Light | Backup Only |
|---|---|---|---|
| Infrastructure | $500K | $80K | $20K |
| Networking | $50K | $30K | $5K |
| Software licenses | $200K | $50K | Included in backup |
| Operations staff | $150K | $50K | $20K |
| Testing/validation | $50K | $30K | $10K |
| Total annual | $950K | $240K | $55K |
| RTO capability | 15 minutes | 2-4 hours | 24-72 hours |
When justifying DR costs, calculate the expected annual loss without DR: (Probability of disaster) × (Impact of disaster without DR). If your region has 5% annual probability of a major outage and impact would be $10M, expected loss is $500K/year. Any DR solution costing less than this provides positive ROI—before considering reputational damage, regulatory penalties, and other hard-to-quantify losses.
DR sites are the infrastructure foundation that makes disaster recovery possible. Let's consolidate the key takeaways:
Module Complete:
Congratulations! You've completed the Disaster Recovery module. You now understand the complete DR lifecycle:
With this knowledge, you can design, implement, and operate disaster recovery solutions that ensure your database systems survive any disaster scenario.
You've mastered the essential components of disaster recovery for database systems. From planning through execution, from RTO/RPO targets through DR site infrastructure, you now possess the knowledge to build and maintain robust DR capabilities. Remember: DR is a process, not a project. Continuous validation, testing, and improvement ensure that when disaster strikes, your systems—and your organization—will survive. Next, explore Backup Best Practices to round out your knowledge of database resilience and recoverability.