Loading learning content...
On February 28, 2017, a typo in a command took down Amazon S3 in the US-East-1 region for nearly four hours. The impact was staggering: thousands of websites and applications became unavailable, including major services like Slack, Quora, and Business Insider. Companies lost millions in revenue, and the internet—for many users—seemed broken.
The companies that survived unscathed shared one characteristic: geographic redundancy. They had deployed their systems across multiple AWS regions, so when US-East-1 failed, traffic automatically shifted to other regions where their services continued running.
Geographic redundancy extends the redundancy concept beyond individual servers to entire physical locations. It protects against disasters that affect whole datacenters or regions: power grid failures, natural disasters, fiber cuts, or even geopolitical events. For systems requiring true high availability, geographic redundancy isn't optional—it's essential.
But geographic distribution introduces profound challenges: data consistency across continents, latency management, cost multiplication, and operational complexity. This page provides a comprehensive framework for designing, implementing, and operating geographically redundant systems.
By the end of this page, you will understand geographic redundancy architecture patterns, data replication strategies across regions, traffic routing mechanisms, failure detection and failover at the regional level, and the operational practices required to maintain geographically distributed systems.
Geographic redundancy exists at multiple levels, each providing different protection against different failure modes:
Availability Zones (AZs)
Cloud providers divide regions into availability zones—physically separate datacenters with independent power, cooling, and networking, connected by high-bandwidth, low-latency links.
Regions
Geographically separated collections of availability zones, typically hundreds or thousands of kilometers apart.
Continents/Global
Distribution across multiple continents for maximum geographic isolation.
| Level | Typical Latency | Protected Against | Cost Impact | Complexity |
|---|---|---|---|---|
| Multi-AZ | 1-5ms | Datacenter failure | Moderate (+50-100%) | Low |
| Multi-Region | 50-150ms | Regional disaster | High (+200-400%) | High |
| Multi-Continent | 150-300ms | Continental events | Very High (+400%+) | Very High |
Choosing Your Level:
The appropriate level depends on your availability requirements and constraints:
Most organizations should implement multi-AZ by default and add multi-region for critical systems. Multi-continent is typically only justified for the largest global platforms.
Multi-AZ redundancy provides most of the availability benefit at a fraction of the complexity cost. Modern cloud services (RDS Multi-AZ, ECS/EKS, managed load balancers) make multi-AZ almost transparent. Master this before tackling multi-region.
Several architectural patterns exist for geographic redundancy, each with distinct tradeoffs:
1. Active-Passive (Pilot Light)
One region handles all traffic; secondary region maintains minimal infrastructure ready for activation.
2. Active-Passive (Warm Standby)
Secondary region runs reduced capacity, fully functional, ready for immediate promotion.
3. Active-Active
Both regions handle production traffic simultaneously, each capable of absorbing the other's load.
1234567891011121314151617181920212223242526
┌─────────────────────────────────────────────────────────────────────────┐│ GLOBAL DNS / GSLB ││ (Routes users to nearest healthy region) │└───────────────────────────────────┬─────────────────────────────────────┘ │ ┌───────────────────────┴───────────────────────┐ │ │ ▼ ▼┌───────────────────────────────┐ ┌───────────────────────────────┐│ REGION: US-EAST │ │ REGION: EU-WEST ││ ┌─────────────────────────┐ │ │ ┌─────────────────────────┐ ││ │ Load Balancer │ │ │ │ Load Balancer │ ││ └───────────┬─────────────┘ │ │ └───────────┬─────────────┘ ││ │ │ │ │ ││ ┌───────────┴─────────────┐ │ │ ┌───────────┴─────────────┐ ││ │ App Servers (N+1) │ │ │ │ App Servers (N+1) │ ││ │ ┌───┐ ┌───┐ ┌───┐ │ │ │ │ ┌───┐ ┌───┐ ┌───┐ │ ││ │ │ A │ │ B │ │ C │ │ │ │ │ │ D │ │ E │ │ F │ │ ││ │ └───┘ └───┘ └───┘ │ │ │ │ └───┘ └───┘ └───┘ │ ││ └───────────┬─────────────┘ │ │ └───────────┬─────────────┘ ││ │ │ │ │ ││ ┌───────────┴─────────────┐ │ │ ┌───────────┴─────────────┐ ││ │ Database Cluster │◄─┼───────┼──│ Database Cluster │ ││ │ (Primary/Leader) │ │ Async │ │ (Secondary/Follower) │ ││ └─────────────────────────┘ │ Repl │ └─────────────────────────┘ │└───────────────────────────────┘ └───────────────────────────────┘Choose active-passive when cost is primary concern and RTO of minutes is acceptable. Choose active-active when users are geographically distributed, RTO must be seconds, or you need to utilize capacity in both regions during normal operations.
Data replication across geographic regions faces the fundamental constraint of physics: the speed of light. A round-trip from US-East to EU-West takes approximately 80-100ms minimum. This latency fundamentally shapes replication strategies.
Asynchronous Replication
Changes are made locally and replicated in the background:
Synchronous Replication
Changes wait for acknowledgment from remote region before completing:
Semi-Synchronous Replication
Changes wait for at least one replica (can be same-region) before completing:
Conflict-Free Replication (CRDTs)
Data structures designed to merge concurrent modifications automatically:
| Strategy | Write Latency | Data Loss Risk | Complexity | Best Use Case |
|---|---|---|---|---|
| Asynchronous | Lowest (local) | Seconds of data | Low | Most applications |
| Synchronous | Highest (+RTT) | None | Low | Financial/critical data |
| Semi-Synchronous | Moderate | Minimal | Medium | Balanced requirements |
| CRDTs | Lowest (local) | None (by design) | High | Counters, sets, specific types |
Practical Replication Patterns:
Primary-Secondary (Master-Slave)
One region owns all writes; others replicate and serve reads:
Multi-Primary with Conflict Resolution
All regions accept writes; conflicts resolved by policy:
Sharded by Region
Different data owned by different regions:
Replication lag is your RPO in real-time. If your secondary region is 30 seconds behind and the primary fails, you lose 30 seconds of data. Alarm on lag thresholds, and treat growing lag as an urgent issue—it indicates replication can't keep up with write volume.
Routing traffic across regions and executing failover when regions fail requires specialized infrastructure and careful planning.
Global Traffic Management Options:
1. GeoDNS
DNS servers return different IP addresses based on client geographic location:
2. Global Server Load Balancing (GSLB)
DNS-based load balancing with health checks and intelligent routing:
3. Anycast
Same IP address announced from multiple locations; BGP routes to nearest:
4. CDN-Based Routing
CDN edge handles routing decisions based on rules and health:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
interface RegionHealth { region: string; healthy: boolean; latency: number; lastCheck: Date; consecutiveFailures: number;} interface FailoverConfig { failureThreshold: number; // Consecutive failures before failover recoveryThreshold: number; // Consecutive successes before recovery healthCheckInterval: number; // Seconds between checks minHealthyRegions: number; // Minimum regions before degraded mode} class RegionalFailoverManager { private regionHealth: Map<string, RegionHealth> = new Map(); constructor( private regions: string[], private config: FailoverConfig ) { this.initializeHealthTracking(); } selectRegion(userLocation: string): string { const healthyRegions = this.getHealthyRegions(); if (healthyRegions.length === 0) { throw new Error('No healthy regions available'); } if (healthyRegions.length < this.config.minHealthyRegions) { this.alertDegradedMode(healthyRegions.length); } // Return nearest healthy region return this.findNearestRegion(userLocation, healthyRegions); } recordHealthCheck(region: string, success: boolean, latency: number) { const health = this.regionHealth.get(region)!; if (success) { health.consecutiveFailures = 0; health.latency = latency; // Check for recovery if (!health.healthy) { // Require multiple successful checks before marking healthy // (implementation detail: track consecutive successes) health.healthy = true; this.logRecovery(region); } } else { health.consecutiveFailures++; if (health.consecutiveFailures >= this.config.failureThreshold) { health.healthy = false; this.triggerFailover(region); } } health.lastCheck = new Date(); } private triggerFailover(failedRegion: string) { console.log(`Initiating failover from ${failedRegion}`); // Update DNS, notify GSLB, log event // Alert operations team }}Lower TTLs enable faster failover but increase DNS query load. A common strategy: use 60-second TTLs for production endpoints (fast failover) and longer TTLs for stable infrastructure (reduced load). Some advanced setups use DNS prefetching and client-side caching to decouple TTL from failover speed.
Understanding different regional failure modes helps design appropriate responses:
Complete Region Outage
The entire region becomes unreachable (power, network, or cloud provider failure):
Partial Region Degradation
Some services in the region fail while others continue:
Network Partition
Region is reachable from some locations but not others:
Performance Degradation
Region is operational but slow or experiencing high error rates:
When a failed region recovers, don't immediately shift all traffic back. The returning region may be cold (caches empty, connections dropped) and vulnerable to a surge. Gradually increase traffic to the recovered region while monitoring for issues. Rushing recovery can cause a second outage.
Operating geographically distributed systems requires specialized practices and tooling:
Multi-Region Deployment
Deployments must coordinate across regions safely:
Cross-Region Monitoring
Monitoring must itself be geographically distributed:
Testing Geographic Failover
Regular testing ensures failover actually works:
Cost Management
Multi-region deployments multiply infrastructure costs:
| Practice | Frequency | Purpose | Owner |
|---|---|---|---|
| Regional failover drill | Quarterly | Validate failover works | SRE/Platform |
| Cross-region capacity test | Monthly | Verify traffic absorption | SRE/Platform |
| Replication lag review | Weekly | Track RPO compliance | Data Engineering |
| Cost analysis | Monthly | Optimize multi-region spend | FinOps |
| Deployment pipeline review | Quarterly | Ensure safe cross-region deploys | Platform |
Regional failures are high-stress, low-frequency events. Detailed runbooks with clear decision trees are essential. Include: who can authorize failover, how to execute it, what to monitor during transition, and how to handle edge cases. Practice using these runbooks during drills.
Netflix
Netflix operates across multiple AWS regions with active-active architecture:
Google's Spanner provides globally consistent database:
Stripe
Stripe implements multi-region for payment reliability:
Cloudflare
Cloudflare's network is inherently global:
Major companies publish post-mortems of regional failures. Study these to understand real-world failure modes and responses. Netflix's chaos engineering blog, AWS post-incident reports, and Google Cloud's incident analyses provide invaluable learning opportunities.
Geographic redundancy protects against failures that affect entire regions—from datacenter fires to cloud provider outages. While complex and costly, it's essential for systems requiring the highest availability levels.
Next Steps:
Geographic redundancy protects against location-level failures. But what about individual component failures within a system? In the next page, we'll explore component redundancy—the patterns for eliminating single points of failure at the component level within a single deployment.
You now understand geographic redundancy comprehensively—from multi-AZ basics through multi-region architecture, data replication strategies, traffic routing, and operational practices. This pattern is essential for any system requiring survival of regional disasters.