Loading learning content...
Once you've decided to deploy across multiple regions, you face a fundamental architectural choice: should all regions actively serve traffic (active-active), or should some regions wait in reserve for failures (active-passive)?
This decision profoundly impacts your system's complexity, consistency model, operational characteristics, and cost structure. There's no universally correct answer—the right choice depends on your specific requirements, organizational capabilities, and acceptable trade-offs.
In this page, we'll deeply examine both patterns, understanding not just what they are but when each is appropriate and how to implement them effectively.
By the end of this page, you'll understand the architecture and characteristics of active-passive deployments, the architecture and challenges of active-active deployments, detailed trade-off analysis between the patterns, failover and failback procedures for each, and criteria for choosing the appropriate pattern.
In an active-passive architecture (also called primary-secondary, master-slave, or standby), one region handles all production traffic while one or more secondary regions maintain synchronized copies of data and infrastructure, ready to take over if the primary fails.
Primary Region:
Secondary Region(s):
The secondary region's readiness level significantly affects both cost and recovery time:
Hot Standby:
Warm Standby:
Cold Standby:
| Characteristic | Hot Standby | Warm Standby | Cold Standby |
|---|---|---|---|
| Infrastructure Cost | ~80-100% of primary | ~30-50% of primary | ~5-15% of primary |
| Failover Time | 1-10 minutes | 10-60 minutes | 1-4 hours |
| Data Lag (typical) | Seconds | Seconds to minutes | Minutes to hours |
| Testing Requirements | Moderate | Moderate | Extensive |
| Operational Complexity | Medium | Medium-High | High during failover |
| Appropriate For | Low RTO requirements | Balanced cost/speed | Cost optimization, high RTO tolerance |
Simplicity:
Consistency:
Cost Efficiency:
Easier Operations:
Wasted Capacity:
Latency for Distant Users:
Failover Complexity:
Data Loss Risk:
Many active-passive deployments use the passive region for read traffic even during normal operation. This provides latency benefits for reads while maintaining simplicity for writes. It's a stepping stone toward active-active without full complexity.
Failover is the critical process of promoting the passive region to active when the primary fails. Getting this right is fundamental to the value proposition of multi-region deployment.
Automatic Failover:
Manual Failover:
Hybrid:
A typical active-passive failover follows this sequence:
DNS TTL: DNS-based failover is limited by TTL (Time To Live) settings. If DNS TTL is 1 hour, clients will continue attempting to reach the failed primary for up to an hour after failover. Common mitigations:
Split Brain: If the primary is network-partitioned (not actually down), both regions might believe they're primary:
Mitigation requires fencing: ensuring the old primary cannot process writes after failover:
Data Loss Window: Asynchronous replication means some committed transactions in primary may not have replicated to secondary:
After failover, you eventually want to return to the original topology (failback):
Failback is often more complex than failover because there's now divergent data to reconcile.
Untested failover is unreliable failover. Schedule regular failover drills (quarterly at minimum). Document every step. Time the process. Identify and fix gaps. A failover that's never been tested has unknown behavior when needed.
In an active-active architecture, all regions simultaneously serve production traffic. Each region can process both read and write operations, maintaining independent but synchronized copies of data.
All Regions:
Fully Symmetric Active-Active:
Regional Active-Active:
Active-Active with Leader:
Active-active introduces fundamental distributed systems challenges:
| Challenge | Description | Mitigation Approaches |
|---|---|---|
| Concurrent Writes | Same record modified in multiple regions simultaneously | Conflict resolution (LWW, vector clocks, CRDTs, application-level) |
| Replication Lag | Changes not immediately visible across regions | Acceptable eventual consistency, read-your-writes guarantees |
| Transaction Spanning Regions | ACID transactions across geographic distance | Avoid or accept performance penalty |
| Unique Constraints | Enforcing uniqueness across regions | Pre-partitioned ID spaces, coordination services |
| Referential Integrity | Foreign key relationships across regions | Application-level enforcement, eventual checking |
| Ordering Guarantees | Event ordering across regions | Vector clocks, causal consistency |
Low Latency Globally:
No Wasted Capacity:
Seamless Failover:
Blast Radius Isolation:
Complexity:
Eventual Consistency:
Data Integrity Challenges:
Higher Cost:
Active-active isn't just an infrastructure pattern—it's an application architecture commitment. Every feature must be designed with eventual consistency in mind. Every developer must understand distributed systems implications. Without organizational buy-in, active-active implementations accumulate bugs and eventually fail.
When the same data is modified in multiple regions before replication occurs, a conflict exists. Resolving these conflicts correctly is fundamental to active-active integrity.
Conflicts are inevitable when:
Last-Write-Wins (LWW):
Highest-Source-Wins:
Custom Merge:
CRDTs (Conflict-free Replicated Data Types):
User Resolution:
| Strategy | Data Loss | Complexity | Use Cases |
|---|---|---|---|
| Last-Write-Wins | Yes (earlier writes) | Low | Frequently updated, low-value data; user preferences |
| Custom Merge | No (if designed well) | High | Domain-specific data with known merge semantics |
| CRDTs | No | Medium (limited data types) | Counters, sets, collaborative editing |
| User Resolution | No | Medium | Documents, content creation, infrequent updates |
| Prevent Conflicts | N/A | High (architecture) | Critical data where conflicts are unacceptable |
Often the best conflict resolution is conflict prevention:
Data Partitioning: Assign data to specific regions based on user or tenant:
Coordination Services: Use distributed locks or consensus for critical operations:
Append-Only Data Models: Design data as append-only logs:
Single-Regional Operations: Route certain operation types to single region:
E-commerce Inventory:
Social Media Counter:
User Profile:
Instrument your system to track conflict frequency and resolution outcomes. Conflicts that seem rare in development may be common in production. Understanding your actual conflict patterns helps optimize resolution strategies.
Let's systematically compare active-passive and active-active across key dimensions:
| Cost Category | Active-Passive (Hot) | Active-Passive (Warm) | Active-Active |
|---|---|---|---|
| Compute | 1.8-2.0x | 1.3-1.5x | 2.0-3.0x |
| Storage | 1.5-2.0x | 1.5-2.0x | 2.0-3.0x |
| Networking | 1.3-1.5x (replication) | 1.2-1.4x | 2.0-4.0x (bidirectional + user traffic) |
| Engineering | 1.5-2.0x | 1.5-2.0x | 2.5-4.0x |
| Operations | 1.3-1.5x | 1.3-1.5x | 2.0-3.0x |
| Total (typical) | 1.6-1.8x | 1.4-1.6x | 2.5-3.5x |
| Operation | Active-Passive | Active-Active |
|---|---|---|
| Regular Deployments | Deploy to primary, sync to passive | Staged rollout across regions, rollback per-region |
| Database Migrations | Migrate primary, replicate changes | Careful coordination, backward-compatible changes only |
| Incident Response | Focus on primary, failover as last resort | Redirect traffic, investigate without downtime |
| Capacity Planning | Plan primary for peak, passive for recovery | Plan each region for partial absorption of others |
| Testing | Test primary, verify passive sync | Test all regions, test cross-region scenarios |
Active-active provides benefits in availability and latency but costs more and is significantly more complex. Active-passive is simpler and cheaper but sacrifices latency and has non-zero RTO. Your specific requirements determine which trade-off is appropriate.
With the detailed analysis above, let's develop a decision framework:
Strong consistency is critical:
Team expertise is limited:
Cost is constrained:
RTO requirements are relaxed:
Low latency is required globally:
Near-zero RTO is required:
Eventual consistency is acceptable:
Team can handle complexity:
Read Active-Active, Write Active-Passive:
Active-Active with Leader Election:
Feature-Based Patterns:
These hybrid approaches often provide better tradeoff profiles than pure active-passive or pure active-active.
Don't start with active-active unless requirements demand it. Begin with active-passive, master its operations and failover procedures, then evolve toward active-active as latency or availability requirements increase. Each stage builds operational muscle memory for the next.
We've deeply examined the two fundamental multi-region patterns. Let's consolidate the key insights:
What's next:
Both patterns depend on data replication across regions. The next page dives deep into data replication: synchronous vs asynchronous replication, consistency models, replication lag management, and strategies for keeping data synchronized across geographic distances.
You now understand the two fundamental multi-region deployment patterns: active-passive and active-active. You can evaluate their trade-offs across availability, latency, consistency, cost, and operations, and you have a framework for choosing the appropriate pattern for your context. Next, we'll explore the data replication strategies that underpin both patterns.