System Design (HLD)Redundancy Patterns

Redundancy Patterns: Building Systems That Never Fail

LevelIntermediate

Duration90 mins

TopicRedundancy Patterns

4 / 5

Geographic Redundancy: Surviving Regional Disasters

When Entire Regions Go Dark

On February 28, 2017, a typo in a command took down Amazon S3 in the US-East-1 region for nearly four hours. The impact was staggering: thousands of websites and applications became unavailable, including major services like Slack, Quora, and Business Insider. Companies lost millions in revenue, and the internet—for many users—seemed broken.

The companies that survived unscathed shared one characteristic: geographic redundancy. They had deployed their systems across multiple AWS regions, so when US-East-1 failed, traffic automatically shifted to other regions where their services continued running.

Geographic redundancy extends the redundancy concept beyond individual servers to entire physical locations. It protects against disasters that affect whole datacenters or regions: power grid failures, natural disasters, fiber cuts, or even geopolitical events. For systems requiring true high availability, geographic redundancy isn't optional—it's essential.

But geographic distribution introduces profound challenges: data consistency across continents, latency management, cost multiplication, and operational complexity. This page provides a comprehensive framework for designing, implementing, and operating geographically redundant systems.

What You Will Learn

By the end of this page, you will understand geographic redundancy architecture patterns, data replication strategies across regions, traffic routing mechanisms, failure detection and failover at the regional level, and the operational practices required to maintain geographically distributed systems.

Levels of Geographic Redundancy

Geographic redundancy exists at multiple levels, each providing different protection against different failure modes:

Availability Zones (AZs)

Cloud providers divide regions into availability zones—physically separate datacenters with independent power, cooling, and networking, connected by high-bandwidth, low-latency links.

Distance: Typically 1-100km apart within a metro area
Latency: Sub-millisecond to single-digit milliseconds
Protection: Power failures, cooling failures, localized disasters
Limitation: Same region, susceptible to regional outages

Regions

Geographically separated collections of availability zones, typically hundreds or thousands of kilometers apart.

Distance: Hundreds to thousands of kilometers
Latency: Tens to hundreds of milliseconds
Protection: Regional disasters, regional network failures, regional power grid failures
Limitation: Same continent, cost of inter-region traffic

Continents/Global

Distribution across multiple continents for maximum geographic isolation.

Distance: Thousands of kilometers
Latency: 100-300+ milliseconds
Protection: Continental disasters, geopolitical issues, undersea cable cuts
Limitation: Extreme latency, complex data sovereignty

Geographic Redundancy Levels Comparison
Level	Typical Latency	Protected Against	Cost Impact	Complexity
Multi-AZ	1-5ms	Datacenter failure	Moderate (+50-100%)	Low
Multi-Region	50-150ms	Regional disaster	High (+200-400%)	High
Multi-Continent	150-300ms	Continental events	Very High (+400%+)	Very High

Choosing Your Level:

The appropriate level depends on your availability requirements and constraints:

Multi-AZ: Baseline for any production system; handles most common failures
Multi-Region: Required for 99.99%+ availability SLAs, global user bases, or regulatory requirements
Multi-Continent: Reserved for truly global services, specific compliance needs, or maximum disaster resilience

Most organizations should implement multi-AZ by default and add multi-region for critical systems. Multi-continent is typically only justified for the largest global platforms.

Start with Multi-AZ

Multi-AZ redundancy provides most of the availability benefit at a fraction of the complexity cost. Modern cloud services (RDS Multi-AZ, ECS/EKS, managed load balancers) make multi-AZ almost transparent. Master this before tackling multi-region.

Geographic Redundancy Architecture Patterns

Several architectural patterns exist for geographic redundancy, each with distinct tradeoffs:

1. Active-Passive (Pilot Light)

One region handles all traffic; secondary region maintains minimal infrastructure ready for activation.

Primary region: Full capacity, all traffic
Secondary region: Core infrastructure running, minimal compute, replicated data
Failover: Spin up compute, redirect traffic
RTO: Minutes to hours (depends on spin-up time)
Cost: Lower (passive region is minimal)
Best for: Cost-sensitive applications with tolerance for some downtime

2. Active-Passive (Warm Standby)

Secondary region runs reduced capacity, fully functional, ready for immediate promotion.

Primary region: Full capacity, all traffic
Secondary region: Reduced capacity, synchronized data, running application
Failover: Scale up compute, redirect traffic
RTO: Minutes (pre-warmed environment)
Cost: Moderate (partial capacity in secondary)
Best for: Applications needing fast recovery without full active-active complexity

3. Active-Active

Both regions handle production traffic simultaneously, each capable of absorbing the other's load.

Both regions: Full or high capacity, serving traffic
Traffic: Split by geography, round-robin, or user affinity
Failover: Traffic reroutes automatically
RTO: Seconds (no spin-up needed)
Cost: Higher (2x or near-2x capacity)
Best for: Highest availability requirements, global user distribution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌─────────────────────────────────────────────────────────────────────────┐
│                            GLOBAL DNS / GSLB                            │
│              (Routes users to nearest healthy region)                   │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │
            ┌───────────────────────┴───────────────────────┐
            │                                               │
            ▼                                               ▼
┌───────────────────────────────┐       ┌───────────────────────────────┐
│        REGION: US-EAST        │       │       REGION: EU-WEST         │
│  ┌─────────────────────────┐  │       │  ┌─────────────────────────┐  │
│  │     Load Balancer       │  │       │  │     Load Balancer       │  │
│  └───────────┬─────────────┘  │       │  └───────────┬─────────────┘  │
│              │                │       │              │                │
│  ┌───────────┴─────────────┐  │       │  ┌───────────┴─────────────┐  │
│  │    App Servers (N+1)    │  │       │  │    App Servers (N+1)    │  │
│  │    ┌───┐ ┌───┐ ┌───┐    │  │       │  │    ┌───┐ ┌───┐ ┌───┐    │  │
│  │    │ A │ │ B │ │ C │    │  │       │  │    │ D │ │ E │ │ F │    │  │
│  │    └───┘ └───┘ └───┘    │  │       │  │    └───┘ └───┘ └───┘    │  │
│  └───────────┬─────────────┘  │       │  └───────────┬─────────────┘  │
│              │                │       │              │                │
│  ┌───────────┴─────────────┐  │       │  ┌───────────┴─────────────┐  │
│  │   Database Cluster      │◄─┼───────┼──│   Database Cluster      │  │
│  │   (Primary/Leader)      │  │ Async │  │   (Secondary/Follower)  │  │
│  └─────────────────────────┘  │ Repl  │  └─────────────────────────┘  │
└───────────────────────────────┘       └───────────────────────────────┘

Pattern Selection Framework

Choose active-passive when cost is primary concern and RTO of minutes is acceptable. Choose active-active when users are geographically distributed, RTO must be seconds, or you need to utilize capacity in both regions during normal operations.

Data Replication Across Regions

Data replication across geographic regions faces the fundamental constraint of physics: the speed of light. A round-trip from US-East to EU-West takes approximately 80-100ms minimum. This latency fundamentally shapes replication strategies.

Asynchronous Replication

Changes are made locally and replicated in the background:

Latency impact: None—writes complete locally
Consistency: Eventually consistent, possible data loss during failover
RPO: Seconds to minutes (replication lag)
Implementation: Database log shipping, change data capture, message queues

Synchronous Replication

Changes wait for acknowledgment from remote region before completing:

Latency impact: Significant—every write includes cross-region round-trip
Consistency: Strong—both regions always synchronized
RPO: Zero—no data loss during failover
Limitation: Usually impractical for cross-region (100ms+ per write)

Semi-Synchronous Replication

Changes wait for at least one replica (can be same-region) before completing:

Latency impact: Moderate—local region sync, async to remote
Consistency: Strong within region, eventual across regions
RPO: Seconds—limited by cross-region replication lag

Conflict-Free Replication (CRDTs)

Data structures designed to merge concurrent modifications automatically:

Latency impact: None—fully local writes
Consistency: Eventually consistent but convergent
RPO: Language varies by type (counters, sets, registers)
Best for: Specific data types (counters, sets, LWW registers)

Cross-Region Replication Strategy Comparison
Strategy	Write Latency	Data Loss Risk	Complexity	Best Use Case
Asynchronous	Lowest (local)	Seconds of data	Low	Most applications
Synchronous	Highest (+RTT)	None	Low	Financial/critical data
Semi-Synchronous	Moderate	Minimal	Medium	Balanced requirements
CRDTs	Lowest (local)	None (by design)	High	Counters, sets, specific types

Practical Replication Patterns:

Primary-Secondary (Master-Slave)

One region owns all writes; others replicate and serve reads:

Simple to implement and reason about
Cross-region writes require routing to primary region
Failover involves promoting secondary to primary

Multi-Primary with Conflict Resolution

All regions accept writes; conflicts resolved by policy:

Maximum availability (write anywhere)
Requires conflict resolution strategy
Best with last-write-wins or application-specific logic

Sharded by Region

Different data owned by different regions:

Users in US write to US; users in EU write to EU
No conflicts for region-local data
Cross-region access requires remote calls

Monitor Replication Lag Religiously

Replication lag is your RPO in real-time. If your secondary region is 30 seconds behind and the primary fails, you lose 30 seconds of data. Alarm on lag thresholds, and treat growing lag as an urgent issue—it indicates replication can't keep up with write volume.

Traffic Routing and Failover

Routing traffic across regions and executing failover when regions fail requires specialized infrastructure and careful planning.

Global Traffic Management Options:

1. GeoDNS

DNS servers return different IP addresses based on client geographic location:

How it works: DNS resolver location determines response
Failover: Update DNS records, wait for TTL expiration
Latency: TTL-dependent failover (seconds to minutes)
Providers: Route 53, Cloudflare, NS1

2. Global Server Load Balancing (GSLB)

DNS-based load balancing with health checks and intelligent routing:

How it works: Monitors endpoint health, routes to healthy regions
Failover: Automatic when health checks fail
Latency: Health check interval + TTL
Providers: F5, Citrix, AWS Global Accelerator

3. Anycast

Same IP address announced from multiple locations; BGP routes to nearest:

How it works: Network layer routing to closest healthy endpoint
Failover: BGP withdrawal from failed region
Latency: Seconds (BGP convergence)
Providers: Cloudflare, Akamai, custom BGP deployment

4. CDN-Based Routing

CDN edge handles routing decisions based on rules and health:

How it works: Edge logic determines origin server
Failover: Edge detects origin failure, routes to backup
Latency: Near-instant (at edge)
Providers: Cloudflare Workers, AWS CloudFront with Lambda@Edge

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
interface RegionHealth {
    region: string;
    healthy: boolean;
    latency: number;
    lastCheck: Date;
    consecutiveFailures: number;
}
 
interface FailoverConfig {
    failureThreshold: number;     // Consecutive failures before failover
    recoveryThreshold: number;    // Consecutive successes before recovery
    healthCheckInterval: number;  // Seconds between checks
    minHealthyRegions: number;    // Minimum regions before degraded mode
}
 
class RegionalFailoverManager {
    private regionHealth: Map<string, RegionHealth> = new Map();
    
    constructor(
        private regions: string[],
        private config: FailoverConfig
    ) {
        this.initializeHealthTracking();
    }
    
    selectRegion(userLocation: string): string {
        const healthyRegions = this.getHealthyRegions();
        
        if (healthyRegions.length === 0) {
            throw new Error('No healthy regions available');
        }
        
        if (healthyRegions.length < this.config.minHealthyRegions) {
            this.alertDegradedMode(healthyRegions.length);
        }
        
        // Return nearest healthy region
        return this.findNearestRegion(userLocation, healthyRegions);
    }
    
    recordHealthCheck(region: string, success: boolean, latency: number) {
        const health = this.regionHealth.get(region)!;
        
        if (success) {
            health.consecutiveFailures = 0;
            health.latency = latency;
            
            // Check for recovery
            if (!health.healthy) {
                // Require multiple successful checks before marking healthy
                // (implementation detail: track consecutive successes)
                health.healthy = true;
                this.logRecovery(region);
            }
        } else {
            health.consecutiveFailures++;
            
            if (health.consecutiveFailures >= this.config.failureThreshold) {
                health.healthy = false;
                this.triggerFailover(region);
            }
        }
        
        health.lastCheck = new Date();
    }
    
    private triggerFailover(failedRegion: string) {
        console.log(`Initiating failover from ${failedRegion}`);
        // Update DNS, notify GSLB, log event
        // Alert operations team
    }
}

DNS TTL Strategy

Lower TTLs enable faster failover but increase DNS query load. A common strategy: use 60-second TTLs for production endpoints (fast failover) and longer TTLs for stable infrastructure (reduced load). Some advanced setups use DNS prefetching and client-side caching to decouple TTL from failover speed.

Regional Failure Scenarios

Understanding different regional failure modes helps design appropriate responses:

Complete Region Outage

The entire region becomes unreachable (power, network, or cloud provider failure):

Detection: External health checks fail from multiple vantage points
Response: Redirect all traffic to surviving regions
Challenge: May be sudden, high traffic surge to remaining regions
Example: AWS US-East-1 outages, Azure region failures

Partial Region Degradation

Some services in the region fail while others continue:

Detection: Specific health checks fail while others pass
Response: Fail over affected services only; maintain working services locally
Challenge: Complex decision-making—full failover vs partial
Example: A specific AZ within a region fails

Network Partition

Region is reachable from some locations but not others:

Detection: Conflicting health check results from different vantage points
Response: Route affected users to reachable regions
Challenge: Split-brain risk if some users can still reach the partition
Example: Submarine cable cuts, regional internet disruptions

Performance Degradation

Region is operational but slow or experiencing high error rates:

Detection: Latency or error rate thresholds exceeded
Response: Shift traffic gradually to reduce load; maintain some traffic
Challenge: Determining threshold for action vs waiting for recovery
Example: Overloaded services, degraded database performance

Regional Failure Response Checklist

•Confirm failure from multiple vantage points — Single-source detection may be false positive
•Assess scope — Complete outage vs partial degradation requires different responses
•Verify receiving region capacity — Ensure failover regions can absorb traffic surge
•Execute failover — Update DNS/GSLB, redirect traffic, monitor transition
•Monitor data consistency — Track replication state at time of failure for RPO assessment
•Communicate status — Update status page, notify stakeholders
•Plan recovery — Prepare for failed region's return and traffic rebalancing

Beware the Recovery Surge

When a failed region recovers, don't immediately shift all traffic back. The returning region may be cold (caches empty, connections dropped) and vulnerable to a surge. Gradually increase traffic to the recovered region while monitoring for issues. Rushing recovery can cause a second outage.

Operational Considerations

Operating geographically distributed systems requires specialized practices and tooling:

Multi-Region Deployment

Deployments must coordinate across regions safely:

Serial deployment: Deploy to one region, verify, then proceed to next
Canary regions: Route small traffic percentage to new version before full rollout
Rollback capability: Each region independently rollback-able
Version tracking: Know exactly what version runs where

Cross-Region Monitoring

Monitoring must itself be geographically distributed:

External monitoring: Third-party services checking from multiple locations
Cross-region health checks: Each region monitors others
Synthetic transactions: End-to-end tests simulating user behavior in each region
Replication metrics: Lag, throughput, errors for cross-region data sync

Testing Geographic Failover

Regular testing ensures failover actually works:

Scheduled drills: Planned regional failovers with rollback ready
Chaos engineering: Random region failures in production (carefully controlled)
Capacity validation: Verify receiving regions can handle full load
Runbook validation: Ensure documented procedures match reality

Cost Management

Multi-region deployments multiply infrastructure costs:

Data transfer: Cross-region transfer is expensive; minimize replication bandwidth
Compute: Size secondary regions appropriately (N+1 per region or global N+M)
Storage: Replicated storage costs multiply by region count
Reserved capacity: Multi-region reserved instances/committed use discounts

Multi-Region Operational Practices
Practice	Frequency	Purpose	Owner
Regional failover drill	Quarterly	Validate failover works	SRE/Platform
Cross-region capacity test	Monthly	Verify traffic absorption	SRE/Platform
Replication lag review	Weekly	Track RPO compliance	Data Engineering
Cost analysis	Monthly	Optimize multi-region spend	FinOps
Deployment pipeline review	Quarterly	Ensure safe cross-region deploys	Platform

Documentation Is Critical

Regional failures are high-stress, low-frequency events. Detailed runbooks with clear decision trees are essential. Include: who can authorize failover, how to execute it, what to monitor during transition, and how to handle edge cases. Practice using these runbooks during drills.

Real-World Implementations

Netflix

Netflix operates across multiple AWS regions with active-active architecture:

Zuul gateway for intelligent traffic routing
Regional failover tested continuously through chaos engineering
Stateless services with regional data caching
Global database (Cassandra) with multi-region replication

Google

Google's Spanner provides globally consistent database:

TrueTime for global transaction ordering across regions
Automatic failover with strong consistency
Paxos-based replication for durability
Available as managed service (Cloud Spanner)

Stripe

Stripe implements multi-region for payment reliability:

Active-passive with automated failover
Synchronous replication for payment data (zero loss tolerance)
Extensive testing through regular failover exercises
API versioning enables gradual regional rollouts

Cloudflare

Cloudflare's network is inherently global:

Anycast routing to nearest datacenter
No single 'primary' region—truly distributed
Workers execute at edge, near users
Automatic failover through BGP withdrawal

Learn from Outage Reports

Major companies publish post-mortems of regional failures. Study these to understand real-world failure modes and responses. Netflix's chaos engineering blog, AWS post-incident reports, and Google Cloud's incident analyses provide invaluable learning opportunities.

Summary: Geographic Redundancy

Geographic redundancy protects against failures that affect entire regions—from datacenter fires to cloud provider outages. While complex and costly, it's essential for systems requiring the highest availability levels.

Key Takeaways

•Multiple levels exist — Multi-AZ, multi-region, and multi-continent each protect against different failure scales
•Architecture patterns vary — Active-passive for cost efficiency; active-active for maximum availability
•Data replication is physics-limited — Cross-region latency forces tradeoffs between consistency and performance
•Traffic routing requires specialized infrastructure — GeoDNS, GSLB, anycast, or CDN-based routing
•Operational complexity multiplies — Deployments, monitoring, and incident response all become more complex
•Testing is non-negotiable — Regular failover drills ensure geographic redundancy actually works

Next Steps:

Geographic redundancy protects against location-level failures. But what about individual component failures within a system? In the next page, we'll explore component redundancy—the patterns for eliminating single points of failure at the component level within a single deployment.

Page Complete

You now understand geographic redundancy comprehensively—from multi-AZ basics through multi-region architecture, data replication strategies, traffic routing, and operational practices. This pattern is essential for any system requiring survival of regional disasters.

4 / 5

Loading learning content...

System Design (HLD)Redundancy Patterns

Redundancy Patterns: Building Systems That Never Fail

LevelIntermediate

Duration90 mins

TopicRedundancy Patterns

4 / 5

Geographic Redundancy: Surviving Regional Disasters

When Entire Regions Go Dark

What You Will Learn

Levels of Geographic Redundancy

Geographic redundancy exists at multiple levels, each providing different protection against different failure modes:

Availability Zones (AZs)

Cloud providers divide regions into availability zones—physically separate datacenters with independent power, cooling, and networking, connected by high-bandwidth, low-latency links.

Distance: Typically 1-100km apart within a metro area
Latency: Sub-millisecond to single-digit milliseconds
Protection: Power failures, cooling failures, localized disasters
Limitation: Same region, susceptible to regional outages

Regions

Geographically separated collections of availability zones, typically hundreds or thousands of kilometers apart.

Distance: Hundreds to thousands of kilometers
Latency: Tens to hundreds of milliseconds
Protection: Regional disasters, regional network failures, regional power grid failures
Limitation: Same continent, cost of inter-region traffic

Continents/Global

Distribution across multiple continents for maximum geographic isolation.

Distance: Thousands of kilometers
Latency: 100-300+ milliseconds
Protection: Continental disasters, geopolitical issues, undersea cable cuts
Limitation: Extreme latency, complex data sovereignty

Geographic Redundancy Levels Comparison
Level	Typical Latency	Protected Against	Cost Impact	Complexity
Multi-AZ	1-5ms	Datacenter failure	Moderate (+50-100%)	Low
Multi-Region	50-150ms	Regional disaster	High (+200-400%)	High
Multi-Continent	150-300ms	Continental events	Very High (+400%+)	Very High

Choosing Your Level:

The appropriate level depends on your availability requirements and constraints:

Multi-AZ: Baseline for any production system; handles most common failures
Multi-Region: Required for 99.99%+ availability SLAs, global user bases, or regulatory requirements
Multi-Continent: Reserved for truly global services, specific compliance needs, or maximum disaster resilience

Most organizations should implement multi-AZ by default and add multi-region for critical systems. Multi-continent is typically only justified for the largest global platforms.

Start with Multi-AZ

Geographic Redundancy Architecture Patterns

Several architectural patterns exist for geographic redundancy, each with distinct tradeoffs:

1. Active-Passive (Pilot Light)

One region handles all traffic; secondary region maintains minimal infrastructure ready for activation.

Primary region: Full capacity, all traffic
Secondary region: Core infrastructure running, minimal compute, replicated data
Failover: Spin up compute, redirect traffic
RTO: Minutes to hours (depends on spin-up time)
Cost: Lower (passive region is minimal)
Best for: Cost-sensitive applications with tolerance for some downtime

2. Active-Passive (Warm Standby)

Secondary region runs reduced capacity, fully functional, ready for immediate promotion.

Primary region: Full capacity, all traffic
Secondary region: Reduced capacity, synchronized data, running application
Failover: Scale up compute, redirect traffic
RTO: Minutes (pre-warmed environment)
Cost: Moderate (partial capacity in secondary)
Best for: Applications needing fast recovery without full active-active complexity

3. Active-Active

Both regions handle production traffic simultaneously, each capable of absorbing the other's load.

Both regions: Full or high capacity, serving traffic
Traffic: Split by geography, round-robin, or user affinity
Failover: Traffic reroutes automatically
RTO: Seconds (no spin-up needed)
Cost: Higher (2x or near-2x capacity)
Best for: Highest availability requirements, global user distribution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌─────────────────────────────────────────────────────────────────────────┐
│                            GLOBAL DNS / GSLB                            │
│              (Routes users to nearest healthy region)                   │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │
            ┌───────────────────────┴───────────────────────┐
            │                                               │
            ▼                                               ▼
┌───────────────────────────────┐       ┌───────────────────────────────┐
│        REGION: US-EAST        │       │       REGION: EU-WEST         │
│  ┌─────────────────────────┐  │       │  ┌─────────────────────────┐  │
│  │     Load Balancer       │  │       │  │     Load Balancer       │  │
│  └───────────┬─────────────┘  │       │  └───────────┬─────────────┘  │
│              │                │       │              │                │
│  ┌───────────┴─────────────┐  │       │  ┌───────────┴─────────────┐  │
│  │    App Servers (N+1)    │  │       │  │    App Servers (N+1)    │  │
│  │    ┌───┐ ┌───┐ ┌───┐    │  │       │  │    ┌───┐ ┌───┐ ┌───┐    │  │
│  │    │ A │ │ B │ │ C │    │  │       │  │    │ D │ │ E │ │ F │    │  │
│  │    └───┘ └───┘ └───┘    │  │       │  │    └───┘ └───┘ └───┘    │  │
│  └───────────┬─────────────┘  │       │  └───────────┬─────────────┘  │
│              │                │       │              │                │
│  ┌───────────┴─────────────┐  │       │  ┌───────────┴─────────────┐  │
│  │   Database Cluster      │◄─┼───────┼──│   Database Cluster      │  │
│  │   (Primary/Leader)      │  │ Async │  │   (Secondary/Follower)  │  │
│  └─────────────────────────┘  │ Repl  │  └─────────────────────────┘  │
└───────────────────────────────┘       └───────────────────────────────┘

Pattern Selection Framework

Data Replication Across Regions

Asynchronous Replication

Changes are made locally and replicated in the background:

Latency impact: None—writes complete locally
Consistency: Eventually consistent, possible data loss during failover
RPO: Seconds to minutes (replication lag)
Implementation: Database log shipping, change data capture, message queues

Synchronous Replication

Changes wait for acknowledgment from remote region before completing:

Latency impact: Significant—every write includes cross-region round-trip
Consistency: Strong—both regions always synchronized
RPO: Zero—no data loss during failover
Limitation: Usually impractical for cross-region (100ms+ per write)

Semi-Synchronous Replication

Changes wait for at least one replica (can be same-region) before completing:

Latency impact: Moderate—local region sync, async to remote
Consistency: Strong within region, eventual across regions
RPO: Seconds—limited by cross-region replication lag

Conflict-Free Replication (CRDTs)

Data structures designed to merge concurrent modifications automatically:

Latency impact: None—fully local writes
Consistency: Eventually consistent but convergent
RPO: Language varies by type (counters, sets, registers)
Best for: Specific data types (counters, sets, LWW registers)

Cross-Region Replication Strategy Comparison
Strategy	Write Latency	Data Loss Risk	Complexity	Best Use Case
Asynchronous	Lowest (local)	Seconds of data	Low	Most applications
Synchronous	Highest (+RTT)	None	Low	Financial/critical data
Semi-Synchronous	Moderate	Minimal	Medium	Balanced requirements
CRDTs	Lowest (local)	None (by design)	High	Counters, sets, specific types

Practical Replication Patterns:

Primary-Secondary (Master-Slave)

One region owns all writes; others replicate and serve reads:

Simple to implement and reason about
Cross-region writes require routing to primary region
Failover involves promoting secondary to primary

Multi-Primary with Conflict Resolution

All regions accept writes; conflicts resolved by policy:

Maximum availability (write anywhere)
Requires conflict resolution strategy
Best with last-write-wins or application-specific logic

Sharded by Region

Different data owned by different regions:

Users in US write to US; users in EU write to EU
No conflicts for region-local data
Cross-region access requires remote calls

Monitor Replication Lag Religiously

Traffic Routing and Failover

Routing traffic across regions and executing failover when regions fail requires specialized infrastructure and careful planning.

Global Traffic Management Options:

1. GeoDNS

DNS servers return different IP addresses based on client geographic location:

How it works: DNS resolver location determines response
Failover: Update DNS records, wait for TTL expiration
Latency: TTL-dependent failover (seconds to minutes)
Providers: Route 53, Cloudflare, NS1

2. Global Server Load Balancing (GSLB)

DNS-based load balancing with health checks and intelligent routing:

How it works: Monitors endpoint health, routes to healthy regions
Failover: Automatic when health checks fail
Latency: Health check interval + TTL
Providers: F5, Citrix, AWS Global Accelerator

3. Anycast

Same IP address announced from multiple locations; BGP routes to nearest:

How it works: Network layer routing to closest healthy endpoint
Failover: BGP withdrawal from failed region
Latency: Seconds (BGP convergence)
Providers: Cloudflare, Akamai, custom BGP deployment

4. CDN-Based Routing

CDN edge handles routing decisions based on rules and health:

How it works: Edge logic determines origin server
Failover: Edge detects origin failure, routes to backup
Latency: Near-instant (at edge)
Providers: Cloudflare Workers, AWS CloudFront with Lambda@Edge

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
interface RegionHealth {
    region: string;
    healthy: boolean;
    latency: number;
    lastCheck: Date;
    consecutiveFailures: number;
}
 
interface FailoverConfig {
    failureThreshold: number;     // Consecutive failures before failover
    recoveryThreshold: number;    // Consecutive successes before recovery
    healthCheckInterval: number;  // Seconds between checks
    minHealthyRegions: number;    // Minimum regions before degraded mode
}
 
class RegionalFailoverManager {
    private regionHealth: Map<string, RegionHealth> = new Map();
    
    constructor(
        private regions: string[],
        private config: FailoverConfig
    ) {
        this.initializeHealthTracking();
    }
    
    selectRegion(userLocation: string): string {
        const healthyRegions = this.getHealthyRegions();
        
        if (healthyRegions.length === 0) {
            throw new Error('No healthy regions available');
        }
        
        if (healthyRegions.length < this.config.minHealthyRegions) {
            this.alertDegradedMode(healthyRegions.length);
        }
        
        // Return nearest healthy region
        return this.findNearestRegion(userLocation, healthyRegions);
    }
    
    recordHealthCheck(region: string, success: boolean, latency: number) {
        const health = this.regionHealth.get(region)!;
        
        if (success) {
            health.consecutiveFailures = 0;
            health.latency = latency;
            
            // Check for recovery
            if (!health.healthy) {
                // Require multiple successful checks before marking healthy
                // (implementation detail: track consecutive successes)
                health.healthy = true;
                this.logRecovery(region);
            }
        } else {
            health.consecutiveFailures++;
            
            if (health.consecutiveFailures >= this.config.failureThreshold) {
                health.healthy = false;
                this.triggerFailover(region);
            }
        }
        
        health.lastCheck = new Date();
    }
    
    private triggerFailover(failedRegion: string) {
        console.log(`Initiating failover from ${failedRegion}`);
        // Update DNS, notify GSLB, log event
        // Alert operations team
    }
}

DNS TTL Strategy

Regional Failure Scenarios

Understanding different regional failure modes helps design appropriate responses:

Complete Region Outage

The entire region becomes unreachable (power, network, or cloud provider failure):

Detection: External health checks fail from multiple vantage points
Response: Redirect all traffic to surviving regions
Challenge: May be sudden, high traffic surge to remaining regions
Example: AWS US-East-1 outages, Azure region failures

Partial Region Degradation

Some services in the region fail while others continue:

Detection: Specific health checks fail while others pass
Response: Fail over affected services only; maintain working services locally
Challenge: Complex decision-making—full failover vs partial
Example: A specific AZ within a region fails

Network Partition

Region is reachable from some locations but not others:

Detection: Conflicting health check results from different vantage points
Response: Route affected users to reachable regions
Challenge: Split-brain risk if some users can still reach the partition
Example: Submarine cable cuts, regional internet disruptions

Performance Degradation

Region is operational but slow or experiencing high error rates:

Detection: Latency or error rate thresholds exceeded
Response: Shift traffic gradually to reduce load; maintain some traffic
Challenge: Determining threshold for action vs waiting for recovery
Example: Overloaded services, degraded database performance

Regional Failure Response Checklist

•Confirm failure from multiple vantage points — Single-source detection may be false positive
•Assess scope — Complete outage vs partial degradation requires different responses
•Verify receiving region capacity — Ensure failover regions can absorb traffic surge
•Execute failover — Update DNS/GSLB, redirect traffic, monitor transition
•Monitor data consistency — Track replication state at time of failure for RPO assessment
•Communicate status — Update status page, notify stakeholders
•Plan recovery — Prepare for failed region's return and traffic rebalancing

Beware the Recovery Surge

Operational Considerations

Operating geographically distributed systems requires specialized practices and tooling:

Multi-Region Deployment

Deployments must coordinate across regions safely:

Serial deployment: Deploy to one region, verify, then proceed to next
Canary regions: Route small traffic percentage to new version before full rollout
Rollback capability: Each region independently rollback-able
Version tracking: Know exactly what version runs where

Cross-Region Monitoring

Monitoring must itself be geographically distributed:

External monitoring: Third-party services checking from multiple locations
Cross-region health checks: Each region monitors others
Synthetic transactions: End-to-end tests simulating user behavior in each region
Replication metrics: Lag, throughput, errors for cross-region data sync

Testing Geographic Failover

Regular testing ensures failover actually works:

Scheduled drills: Planned regional failovers with rollback ready
Chaos engineering: Random region failures in production (carefully controlled)
Capacity validation: Verify receiving regions can handle full load
Runbook validation: Ensure documented procedures match reality

Cost Management

Multi-region deployments multiply infrastructure costs:

Data transfer: Cross-region transfer is expensive; minimize replication bandwidth
Compute: Size secondary regions appropriately (N+1 per region or global N+M)
Storage: Replicated storage costs multiply by region count
Reserved capacity: Multi-region reserved instances/committed use discounts

Multi-Region Operational Practices
Practice	Frequency	Purpose	Owner
Regional failover drill	Quarterly	Validate failover works	SRE/Platform
Cross-region capacity test	Monthly	Verify traffic absorption	SRE/Platform
Replication lag review	Weekly	Track RPO compliance	Data Engineering
Cost analysis	Monthly	Optimize multi-region spend	FinOps
Deployment pipeline review	Quarterly	Ensure safe cross-region deploys	Platform

Documentation Is Critical

Real-World Implementations

Netflix

Netflix operates across multiple AWS regions with active-active architecture:

Zuul gateway for intelligent traffic routing
Regional failover tested continuously through chaos engineering
Stateless services with regional data caching
Global database (Cassandra) with multi-region replication

Google

Google's Spanner provides globally consistent database:

TrueTime for global transaction ordering across regions
Automatic failover with strong consistency
Paxos-based replication for durability
Available as managed service (Cloud Spanner)

Stripe

Stripe implements multi-region for payment reliability:

Active-passive with automated failover
Synchronous replication for payment data (zero loss tolerance)
Extensive testing through regular failover exercises
API versioning enables gradual regional rollouts

Cloudflare

Cloudflare's network is inherently global:

Anycast routing to nearest datacenter
No single 'primary' region—truly distributed
Workers execute at edge, near users
Automatic failover through BGP withdrawal

Learn from Outage Reports

Summary: Geographic Redundancy

Key Takeaways

•Multiple levels exist — Multi-AZ, multi-region, and multi-continent each protect against different failure scales
•Architecture patterns vary — Active-passive for cost efficiency; active-active for maximum availability
•Data replication is physics-limited — Cross-region latency forces tradeoffs between consistency and performance
•Traffic routing requires specialized infrastructure — GeoDNS, GSLB, anycast, or CDN-based routing
•Operational complexity multiplies — Deployments, monitoring, and incident response all become more complex
•Testing is non-negotiable — Regular failover drills ensure geographic redundancy actually works

Next Steps:

Page Complete

4 / 5