Loading learning content...
Multi-AZ deployment protects against failures within a region—a server rack loses power, a data center has a network issue, a cooling system fails. But what happens when an entire region becomes unavailable?
In March 2017, AWS S3 had a major outage in US-East-1 that lasted approximately four hours. The ripple effects were staggering—websites failed to load images, applications couldn't read configuration files, even the AWS health dashboard was affected (because it too depended on S3 in that region). Services designed with single-region assumptions experienced cascading failures.
The lesson was clear: for critical systems, single-region deployments have an implicit SLA ceiling. No matter how well you've designed your multi-AZ architecture, a regional event—whether infrastructure failure, natural disaster, or operational incident—can take you completely offline.
Cross-region deployments are how you break through that ceiling. They're complex, expensive, and require careful design—but for systems where downtime has severe business consequences, they're essential.
By the end of this page, you will understand the patterns for cross-region deployment—from basic disaster recovery to full active-active multi-region architectures. You'll learn how to design global traffic routing, manage data replication across regions, handle the challenges of distributed data, and make informed decisions about the right level of regional redundancy for your systems.
Cross-region deployments address two fundamental needs: disaster recovery and global performance. Understanding which need (or both) drives your requirements shapes your architectural decisions.
Disaster Recovery (DR) Motivation:
Disaster recovery is about surviving catastrophic events that affect an entire cloud region:
DR-focused architectures optimize for recovery time and data protection, often accepting some performance or operational overhead.
Global Performance Motivation:
Global performance is about serving users worldwide with acceptable latency:
Performance-focused architectures optimize for latency and user experience, accepting the complexity of distributed data management.
| Aspect | DR-Focused | Performance-Focused |
|---|---|---|
| Primary Goal | Survive regional failure | Minimize user latency |
| Secondary Region Activity | Passive/warm standby | Active, serving traffic |
| Data Replication | Async OK (some data loss acceptable) | Often requires sync or near-sync |
| Traffic Routing | Failover on primary failure | Always routes to nearest region |
| Cost Profile | Lower (standby resources underutilized) | Higher (full capacity everywhere) |
| Complexity | Moderate (failover mechanisms) | High (distributed data, conflict resolution) |
When Is Cross-Region Necessary?
Not every system needs cross-region deployment. It adds significant cost and complexity. Consider cross-region if:
The Cost Reality:
Cross-region deployment roughly doubles your infrastructure cost (or more for active-active) and significantly increases operational complexity. This is a deliberate trade-off between cost/complexity and availability/performance.
Cross-region architecture is not a badge of honor—it's a calculated trade-off. If your business can tolerate multi-hour regional outages (which are rare), single-region with good multi-AZ design may be the right choice. The complexity of multi-region creates new failure modes and operational burdens.
Disaster recovery architectures exist on a spectrum from simple (cheap, slow recovery) to sophisticated (expensive, fast recovery). The AWS Well-Architected Framework defines four primary patterns.
Description: Data is regularly backed up to another region. Infrastructure is provisioned from scratch during recovery.
Primary Region DR Region
┌─────────────┐ ┌─────────────┐
│ │ │ │
│ Full │ │ Backups │
│ Production │───backup──────→│ (S3/EBS) │
│ Stack │ │ │
│ │ │ No active │
│ │ │ compute │
└─────────────┘ └─────────────┘
During Disaster:
1. Provision infrastructure in DR region
2. Restore data from backups
3. Update DNS to point to DR
4. Validate and resume service
| Metric | Typical Value |
|---|---|
| RTO (Recovery Time) | Hours to days |
| RPO (Data Loss) | Hours (last backup) |
| Cost | Lowest (storage only in DR) |
| Complexity | Low |
Best For: Non-critical systems, development environments, cost-constrained projects
Description: Core infrastructure (databases, critical configs) runs in DR region but at minimal scale. Compute is scaled up during failure.
Primary Region DR Region
┌─────────────┐ ┌─────────────┐
│ │ │ │
│ Full │ │ Minimal │
│ Production │───replicate───→│ Database │
│ Stack │ │ (running) │
│ │ │ │
│ - Web x20 │ │ No compute │
│ - App x20 │ │ (ASG at 0) │
│ - DB Multi │ │ │
└─────────────┘ └─────────────┘
During Disaster:
1. Scale out compute in DR (ASG min → desired)
2. Promote DR database if async replication
3. Update DNS to point to DR
4. Validate and resume service
| Metric | Typical Value |
|---|---|
| RTO (Recovery Time) | Minutes to hours |
| RPO (Data Loss) | Near-zero (sync) to minutes (async) |
| Cost | Low (DB running, compute minimal) |
| Complexity | Moderate |
Best For: Business-critical systems where hours of downtime is acceptable, budget-conscious enterprises
Description: Fully functional but scaled-down version of production runs in DR region. Can serve traffic immediately, then scale up.
Primary Region DR Region
┌─────────────┐ ┌─────────────┐
│ │ │ │
│ Full │ │ Reduced │
│ Production │───replicate───→│ Production │
│ Stack │ │ Stack │
│ │ │ │
│ - Web x20 │ │ - Web x2 │
│ - App x20 │ │ - App x2 │
│ - DB Multi │ │ - DB Read │
└─────────────┘ └─────────────┘
During Disaster:
1. Scale out to full capacity (already running)
2. Promote DR database
3. Update DNS to point to DR
4. Resume full service quickly
| Metric | Typical Value |
|---|---|
| RTO (Recovery Time) | Minutes |
| RPO (Data Loss) | Near-zero to seconds |
| Cost | Medium (reduced capacity running) |
| Complexity | Moderate-High |
Best For: Applications requiring <30 minute recovery, significant business impact from downtime
Description: Both regions run full production workloads. Users are routed to the nearest/healthiest region. Failover is automatic.
Primary Region Secondary Region
┌─────────────┐ ┌─────────────┐
│ │ │ │
│ Full │←──replicate───→│ Full │
│ Production │ (bidirectional) Production │
│ Stack │ │ Stack │
│ │ │ │
│ - Web x20 │ │ - Web x20 │
│ - App x20 │ │ - App x20 │
│ - DB Multi │ │ - DB Multi │
└──────┬──────┘ └──────┬──────┘
│ │
└───────────────┬──────────────┘
│
┌──────┴──────┐
│ Global LB / │
│ Route 53 │
└─────────────┘
Normal Operation:
- Both regions serve traffic
- Data syncs bidirectionally
- Global routing sends users to nearest region
During Disaster:
- Automatic (DNS/health checks reroute)
- Or simply absorb extra traffic in healthy region
| Metric | Typical Value |
|---|---|
| RTO (Recovery Time) | Near-zero (automatic) |
| RPO (Data Loss) | Near-zero (if sync replication) |
| Cost | Highest (2× infrastructure) |
| Complexity | Highest |
Best For: Mission-critical systems, global user base, zero-tolerance for downtime
Don't over-engineer DR. If your RTO is 24 hours, Backup and Restore may be sufficient. If your RTO is 5 minutes, you need Warm Standby or Active-Active. Match the pattern to your business requirements—every step up the ladder multiplies cost and complexity.
Cross-region deployments require a mechanism to route users to the appropriate region. This routing layer must be globally distributed (not tied to any single region) and resilient to regional failures.
DNS-Based Global Routing:
DNS is the most common approach for global traffic routing. Services like AWS Route 53, Google Cloud DNS, and Azure Traffic Manager provide global DNS with sophisticated routing policies.
Routing Policies:
| Policy | How It Works | Use Case |
|---|---|---|
| Simple | Single endpoint, no intelligence | Not for multi-region |
| Failover | Primary → Secondary on health check failure | Active-Passive DR |
| Geolocation | Route based on user's geographic location | Compliance, content localization |
| Geoproximity | Route to nearest region with bias control | Performance + capacity management |
| Latency-Based | Route to lowest-latency region | Performance optimization |
| Weighted | Percentage split across regions | Gradual migration, canary |
| Multivalue Answer | Return multiple IPs, client chooses | Client-side load balancing |
Latency-Based Routing Example (Route 53):
User in Europe
│
▼
┌─────────────┐
│ Route 53 │ (Query: www.example.com)
│ │
│ Latency │
│ Routing │
└──────┬──────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│US-East │ │EU-West │ │AP-Tokyo │
│ 90ms │ │ 20ms ✓ │ │ 250ms │
└─────────┘ └────┬────┘ └─────────┘
│
Returns EU-West
endpoint IP
Route 53 measures latency from many global vantage points and uses this data to route users to the region with lowest expected latency.
Health Checks Integration:
Global routing must integrate with health checks to avoid routing to failed regions:
DNS TTL Considerations:
| TTL | Failover Speed | DNS Query Volume | Use Case |
|---|---|---|---|
| 60 seconds | Fast (≤1 min) | High | Active-active, fast failover |
| 300 seconds | Moderate (≤5 min) | Medium | Balanced approach |
| 3600 seconds | Slow (≤1 hour) | Low | Stable, rarely changing |
Beyond DNS: Global Load Balancing:
DNS routing has limitations: client-side caching, inability to route individual requests. For more sophisticated routing, use global load balancers:
AWS Global Accelerator:
User
│
│ (connects to anycast IP)
▼
┌─────────────────────────────────────────────┐
│ AWS Global Accelerator │
│ (Edge locations worldwide - anycast) │
└──────────────────┬──────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│US-East │ │EU-West │ │AP-Tokyo │
│ NLB/ALB │ │ NLB/ALB │ │ NLB/ALB │
└─────────┘ └─────────┘ └─────────┘
Cloudflare / CDN-Based Routing:
Whatever mechanism you choose for global routing (Route 53, Global Accelerator, Cloudflare), it must itself be globally distributed. If your routing layer depends on a single region, your cross-region deployment fails at the first step. Verify your routing layer's SLA and failure independence.
Data replication across regions is the most challenging aspect of cross-region architecture. Unlike AZ-to-AZ replication (1-2ms latency), cross-region replication must handle latencies of 50-300ms—making synchronous replication often impractical.
The Fundamental Trade-off:
The CAP theorem applies forcefully to cross-region replication. With 100ms+ latency between regions, you must choose:
Strong Consistency: Every read sees the latest write.
Eventual Consistency: Reads may return stale data temporarily.
| Pattern | Latency Impact | Data Loss Risk | Consistency | Use Case |
|---|---|---|---|---|
| Sync Replication | +100-300ms writes | None | Strong | Financial transactions |
| Async Replication | None | Seconds of lag | Eventual | Most applications |
| Conflict-free (CRDT) | None | None (merged) | Eventual | Collaborative apps |
| Write-primary, Read-local | Local reads fast | Read lag | Read-eventual | Global read-heavy apps |
| Region-sharded | None (local) | None (local) | Strong per-shard | Data with geographic affinity |
Database-Specific Cross-Region Patterns:
1. Amazon Aurora Global Database:
Primary Region Secondary Region(s)
┌─────────────┐ ┌─────────────┐
│ Aurora │ │ Aurora │
│ Primary │───async rep───→│ Secondary │
│ Cluster │ <1 sec │ Cluster │
│ │ │ │
│ Read/Write │ │ Read Only │
└─────────────┘ └─────────────┘
2. DynamoDB Global Tables:
Region A Region B
┌─────────────┐ ┌─────────────┐
│ DynamoDB │ │ DynamoDB │
│ Table │←──multi-master─→│ Table │
│ │ replication │ │
│ Read/Write │ │ Read/Write │
└─────────────┘ └─────────────┘
│ │
└──────────────────────────────┘
Last-writer-wins for conflicts
3. CockroachDB / Spanner-style:
┌────────────────────────────────────────────────────────┐
│ Global CockroachDB Cluster │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Region A │ │Region B │ │Region C │ │
│ │ Nodes │←───→│ Nodes │←───→│ Nodes │ │
│ │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Consensus-based replication (Raft) │
│ Configurable local vs. global reads │
└────────────────────────────────────────────────────────┘
In multi-master replication, concurrent writes to the same record in different regions will conflict. You must have a conflict resolution strategy: last-writer-wins, merge, application-specific logic, or avoid the scenario through design (region-sharding, single-writer patterns).
Active-active multi-region is the holy grail of availability—both regions serve production traffic simultaneously, and users experience no interruption during regional failures. But it's also the most complex pattern, with subtle pitfalls that can undermine its benefits.
Core Design Principles:
1. Statelessness at the Application Layer
Application servers should be completely stateless. All state must be externalized to data stores designed for multi-region:
2. Region-Local Reads, Global Writes
For many applications, the best pattern is:
This limits eventual consistency to read paths while maintaining write consistency.
3. Data Segmentation Strategies
User-Region Affinity:
Assign users to a 'home' region. All their data lives in that region, with async backup to others:
User Home Region: EU-West
┌──────────────────────────────────────┐
│ User A's data (primary copy) │
│ - Profile, settings, content │
│ - Fast reads and writes │
└──────────────────────┬───────────────┘
│
async backup
│
▼
┌──────────────────────────────────────┐
│ US-East (backup) │
│ User A's data (replica) │
│ - Used for DR only │
└──────────────────────────────────────┘
Geo-Partitioned Data:
Some data has natural geographic affinity:
Global vs. Regional Data:
Separate data into categories:
| Data Type | Replication Strategy | Example |
|---|---|---|
| Global reference | Replicate everywhere, read-only | Product catalog |
| User-owned | Primary at user's home region | User profiles |
| Regional | Stays in region | Local transactions |
| Global mutable | Multi-master with conflict resolution | Inventory counts |
Reference Architecture: Active-Active E-Commerce:
Global DNS (Route 53)
Latency-based routing + health checks
│
┌───────────────┼───────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ US-East │ │ EU-West │ │ AP-Tokyo │
│ Region │ │ Region │ │ Region │
├─────────────┤ ├─────────────┤ ├─────────────┤
│ ALB │ │ ALB │ │ ALB │
│ Web (ASG) │ │ Web (ASG) │ │ Web (ASG) │
│ App (ASG) │ │ App (ASG) │ │ App (ASG) │
│ Cache │ │ Cache │ │ Cache │
│ │ │ │ │ │
│┌───────────┐│ │┌───────────┐│ │┌───────────┐│
││ Aurora ││ ││ Aurora ││ ││ Aurora ││
││ Global DB ││ ││ Secondary ││ ││ Secondary ││
││ (Primary) ││ ││ (Read Rep)││ ││ (Read Rep)││
│└───────────┘│ │└───────────┘│ │└───────────┘│
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────┼───────────────┘
│
Aurora Storage Replication
Behavior:
- Reads: Local Aurora replica (fast)
- Writes: Route to US-East primary (add ~100ms)
- Failover: Promote EU or AP if US-East down
- Product catalog: Read from local DynamoDB Global Table
- User cart: DynamoDB Global Table (eventually consistent)
Don't try to build full active-active on day one. Start with active-passive DR, verify it works, then evolve toward active-active. Each step adds confidence while limiting complexity. Many 'active-active' systems actually still use primary-secondary patterns for critical write paths to maintain consistency.
Even with active-active architectures, you need well-defined failover procedures for scenarios where a region cannot serve traffic. Failover must be faster and more reliable than manual intervention.
Automated vs. Manual Failover:
Failover Runbook Components:
1. Detection and Declaration
2. Traffic Rerouting
3. Data Promotion (if active-passive)
4. Capacity Verification
5. Communication
6. Recovery/Failback
The Failback Challenge:
Failover gets the attention, but failback is often harder:
Data resynchronization: While primary was down, secondary accumulated writes. Those must be replicated back.
Conflict resolution: If primary was partially available (split-brain), conflicting writes may exist.
Gradual reintroduction: Don't flip 100% traffic back instantly. Use weighted routing to shift gradually.
Verification at each step: Confirm data consistency, application health, and performance before full restoration.
Runbook Testing:
Your failover runbook is only as good as your last test. Schedule regular DR drills:
| Drill Type | Frequency | Scope | Risk |
|---|---|---|---|
| Tabletop | Monthly | Paper exercise, discuss steps | None |
| Failover Test (Staging) | Quarterly | Full failover in non-prod | Low |
| Failover Test (Prod) | Annually | Full failover with real traffic | Medium |
| Game Day | Bi-annually | Unannounced (to some) drill | Medium |
A failover procedure that has never been tested in a realistic scenario is an untested assumption. Many organizations discover their 'DR plan' doesn't work during an actual disaster—when it's too late. Regular drills are not optional for critical systems.
Cross-region deployments significantly increase infrastructure costs. Understanding these cost drivers helps you make informed trade-offs and optimize where possible.
Major Cost Categories:
| Cost Category | Driver | Optimization Strategies |
|---|---|---|
| Compute | 2× instances (active-active), 1.2-1.5× (warm standby) | Right-size DR, use spot for non-critical DR workloads |
| Storage | Replicated storage across regions | Tier data (replicate hot, archive cold) |
| Data Transfer | $0.02-0.09/GB for cross-region | Compress, batch, limit unnecessary replication |
| Database | Multi-region database services (Aurora Global, DynamoDB GT) | Choose appropriate tier, optimize read/write patterns |
| Load Balancing | Regional LBs + global routing | Use DNS routing where possible vs. Global Accelerator |
| Monitoring | Multi-region observability | Sample in secondary regions |
Data Transfer: The Hidden Cost:
Cross-region data transfer often becomes the dominant cost in multi-region architectures:
Scenario: Active-Active with Database Replication
Database writes: 1,000 writes/sec, 5KB average
= 5MB/sec = 432 GB/day
Cross-region replication (bidirectional):
= 432 GB × 2 directions × 30 days = ~26 TB/month
At $0.02/GB inter-region:
= ~$520/month just for DB replication
Add application traffic, cache sync, logs:
Easily $2,000-5,000/month in transfer alone
Optimization Strategies:
Replicate Less: Not all data needs multi-region replication
Compress Transfer: Compression reduces bytes transferred
Batch Operations: Aggregate small transfers
Lazy Replication: Defer non-critical replication
ROI Calculation Framework:
To justify cross-region costs, quantify the value of improved availability:
Cost of Downtime (per hour):
- Lost revenue (transactions/hour × avg value)
- Productivity loss (employees × hourly cost)
- Recovery costs (engineering time, overtime)
- Reputation damage (hard to quantify, but real)
- Regulatory penalties (if applicable)
Example Calculation:
- Revenue: $500K/hour (e-commerce during peak)
- Productivity: 1,000 employees × $50/hour = $50K/hour
- Recovery: 10 engineers × $100/hour = $1K/hour
Downtime cost: ~$551K/hour
Cross-Region Investment:
- Additional infrastructure: $100K/month
- Reduces expected downtime: 4 hours/year → 15 minutes/year
Value: 3.75 hours saved × $551K = ~$2M/year
Cost: $100K × 12 = $1.2M/year
ROI: Positive, ~$800K/year net benefit
For systems with lower downtime costs, the equation may not favor cross-region. Run the numbers for your specific scenario.
Begin with a cost-effective DR pattern (Pilot Light or Warm Standby), then optimize based on actual traffic patterns. You'll quickly discover which data needs aggressive replication versus lazy sync. Premature optimization in multi-region leads to unnecessary complexity and cost.
Cross-region deployments extend your availability guarantees beyond any single cloud region, but they bring significant complexity in data management, traffic routing, and operational procedures. Success requires careful design, thorough testing, and ongoing refinement.
What's Next:
With region and availability zone fundamentals complete, we'll conclude this module with Latency Considerations—a deep dive into how physical distance, network paths, and protocol choices affect user-perceived performance. You'll learn to measure, reason about, and optimize latency in globally distributed systems.
You now understand cross-region deployment patterns from basic DR to full active-active architectures. You can design global traffic routing, implement data replication strategies, plan failover procedures, and evaluate the cost-benefit trade-offs. Next, we'll explore the latency implications that tie all these geographic concepts together.