System Design (HLD)Geo-Distributed Architecture

Geo-Distributed Architecture

LevelAdvanced

Duration90 mins

TopicGeo-Distributed Architecture

3 / 5

Active-Passive vs Active-Active

The Fundamental Multi-Region Choice

Once you've decided to deploy across multiple regions, you face a fundamental architectural choice: should all regions actively serve traffic (active-active), or should some regions wait in reserve for failures (active-passive)?

This decision profoundly impacts your system's complexity, consistency model, operational characteristics, and cost structure. There's no universally correct answer—the right choice depends on your specific requirements, organizational capabilities, and acceptable trade-offs.

In this page, we'll deeply examine both patterns, understanding not just what they are but when each is appropriate and how to implement them effectively.

What You Will Learn

By the end of this page, you'll understand the architecture and characteristics of active-passive deployments, the architecture and challenges of active-active deployments, detailed trade-off analysis between the patterns, failover and failback procedures for each, and criteria for choosing the appropriate pattern.

Active-Passive Architecture

In an active-passive architecture (also called primary-secondary, master-slave, or standby), one region handles all production traffic while one or more secondary regions maintain synchronized copies of data and infrastructure, ready to take over if the primary fails.

Anatomy of Active-Passive

Primary Region:

Receives 100% of production traffic
Hosts the authoritative database
All write operations occur here
Full production infrastructure running at scale

Secondary Region(s):

No production traffic under normal operation
Maintains replicated copy of data (typically asynchronous)
Infrastructure may be running at reduced capacity (hot standby) or not running at all (cold standby)
Regular testing ensures failover capability

Standby Types

The secondary region's readiness level significantly affects both cost and recovery time:

Hot Standby:

Full infrastructure running in passive region
Data replicated with minimal lag
Application servers running but not serving traffic
Failover possible in minutes
Most expensive passive option

Warm Standby:

Core infrastructure running (databases, critical services)
Non-critical services scaled down or stopped
Data replication active with acceptable lag
Failover requires some scaling/startup (10-30 minutes)
Moderate cost

Cold Standby:

Minimal infrastructure running
Data replicated to storage (may have significant lag)
Infrastructure must be provisioned during failover
Failover time measured in hours
Lowest passive cost

Standby Types Comparison
Characteristic	Hot Standby	Warm Standby	Cold Standby
Infrastructure Cost	~80-100% of primary	~30-50% of primary	~5-15% of primary
Failover Time	1-10 minutes	10-60 minutes	1-4 hours
Data Lag (typical)	Seconds	Seconds to minutes	Minutes to hours
Testing Requirements	Moderate	Moderate	Extensive
Operational Complexity	Medium	Medium-High	High during failover
Appropriate For	Low RTO requirements	Balanced cost/speed	Cost optimization, high RTO tolerance

Advantages of Active-Passive

Simplicity:

Single source of truth for all data
No conflict resolution required
Transactions work normally in primary
Simpler mental model for developers

Consistency:

Strong consistency by default
No replication lag affects user experience (in primary)
No split-brain scenarios during normal operation

Cost Efficiency:

Passive region can run at reduced capacity
No cross-region traffic during normal operation
Data transfer costs limited to replication

Easier Operations:

Clear ownership model (primary is authoritative)
Simpler monitoring (focus on primary, verification of secondary)
Straightforward capacity planning

Disadvantages of Active-Passive

Wasted Capacity:

Secondary infrastructure sits idle most of the time
Still paying for capacity that's rarely used
Cold standby addresses this but increases RTO

Latency for Distant Users:

All users connect to primary region
Users far from primary experience geography-based latency
CDN can help with static content but not dynamic operations

Failover Complexity:

Failover is by definition an exceptional event
Exceptional events are harder to test and validate
Staff may be unfamiliar with failover procedures when needed

Data Loss Risk:

Asynchronous replication means potential data loss during failover (RPO > 0)
Synchronous replication is expensive and adds latency
Trade-off between RPO and performance

Consider Read Traffic Offloading

Many active-passive deployments use the passive region for read traffic even during normal operation. This provides latency benefits for reads while maintaining simplicity for writes. It's a stepping stone toward active-active without full complexity.

Active-Passive Failover

Failover is the critical process of promoting the passive region to active when the primary fails. Getting this right is fundamental to the value proposition of multi-region deployment.

Failover Triggers

Automatic Failover:

Health checks detect primary unavailability
Automated systems initiate failover sequence
Fastest response time but highest risk of false positives

Manual Failover:

Operators detect issue and decide to failover
Human validation prevents false positive triggering
Slower response but more considered

Hybrid:

Automated detection with human approval gate
Balance between speed and safety
Common for critical systems

The Failover Sequence

A typical active-passive failover follows this sequence:

Failover Sequence Steps

•Detection: Monitoring systems detect primary region unavailability. Multiple independent signals confirm failure is real (not transient network issue).
•Decision: Automated system or human operator decides to initiate failover based on severity and expected duration of outage.
•Drain (if possible): Attempt to drain in-flight requests from primary. May not be possible if primary is completely unavailable.
•Stop Replication: Halt data replication to secondary to establish consistent point-in-time state.
•Promote Secondary: Promote secondary database to primary role. This is typically the riskiest step.
•Scale Infrastructure: If warm/cold standby, scale up compute, cache, and other infrastructure in secondary region.
•Verify Health: Run health checks on newly promoted primary to confirm services are functional.
•Update Traffic Routing: Update DNS, load balancers, or service discovery to route traffic to new primary.
•Enable Traffic: Start accepting production traffic in new primary region.
•Post-Failover Verification: Monitor for issues, verify functionality with synthetic tests.

Critical Failover Considerations

DNS TTL: DNS-based failover is limited by TTL (Time To Live) settings. If DNS TTL is 1 hour, clients will continue attempting to reach the failed primary for up to an hour after failover. Common mitigations:

Keep TTL low (60-300 seconds) for production endpoints
Use anycast or other non-DNS routing for faster failover
Implement client-side retry with alternate endpoints

Split Brain: If the primary is network-partitioned (not actually down), both regions might believe they're primary:

Old primary still processing requests from cached DNS
New primary accepting requests after failover
Both writing to their local databases
Severe data inconsistency results

Mitigation requires fencing: ensuring the old primary cannot process writes after failover:

STONITH (Shoot The Other Node In The Head): Forcibly power off old primary
Lease-based: Primary requires actively renewed lease to process writes
Quorum-based: Write requires acknowledgment from majority of replicas

Data Loss Window: Asynchronous replication means some committed transactions in primary may not have replicated to secondary:

Identify unreplicated transactions during failover
Communicate potential data loss to affected users
Have process to reconcile if old primary recovers

Failback Considerations

After failover, you eventually want to return to the original topology (failback):

Recover original primary: Bring original primary back online
Replicate new data: Sync changes made during failover to original primary
Resolve conflicts: Handle any transactions from split-brain period
Reverse failover: Promote original primary, demote temporary primary

Failback is often more complex than failover because there's now divergent data to reconcile.

Test Failover Regularly

Untested failover is unreliable failover. Schedule regular failover drills (quarterly at minimum). Document every step. Time the process. Identify and fix gaps. A failover that's never been tested has unknown behavior when needed.

Active-Active Architecture

In an active-active architecture, all regions simultaneously serve production traffic. Each region can process both read and write operations, maintaining independent but synchronized copies of data.

Anatomy of Active-Active

All Regions:

Receive production traffic (typically routed by proximity)
Maintain local copy of data
Process both reads and writes
Replicate changes to other regions (typically asynchronous)
Operate independently if network partitioned from other regions

Active-Active Variants

Fully Symmetric Active-Active:

All regions handle all operations equally
Data replicated bidirectionally between all regions
Most complex but most resilient

Regional Active-Active:

Users assigned to "home" region
Home region is authoritative for that user's data
Cross-region operations route to home region
Simpler consistency model at cost of some latency

Active-Active with Leader:

All regions active for read/write
One region designated as coordination point for certain operations
Hybrid of active-active and active-passive benefits

Why Active-Active Is Harder

Active-active introduces fundamental distributed systems challenges:

Active-Active Challenges
Challenge	Description	Mitigation Approaches
Concurrent Writes	Same record modified in multiple regions simultaneously	Conflict resolution (LWW, vector clocks, CRDTs, application-level)
Replication Lag	Changes not immediately visible across regions	Acceptable eventual consistency, read-your-writes guarantees
Transaction Spanning Regions	ACID transactions across geographic distance	Avoid or accept performance penalty
Unique Constraints	Enforcing uniqueness across regions	Pre-partitioned ID spaces, coordination services
Referential Integrity	Foreign key relationships across regions	Application-level enforcement, eventual checking
Ordering Guarantees	Event ordering across regions	Vector clocks, causal consistency

Advantages of Active-Active

Low Latency Globally:

Users connect to nearby region
Latency is regional, not cross-continental
Better user experience everywhere

No Wasted Capacity:

All infrastructure is serving production traffic
Better ROI on multi-region investment
Natural load distribution

Seamless Failover:

No failover process needed (regions are already handling traffic)
Failed region's traffic simply shifts to other regions
RTO approaches zero for properly implemented active-active

Blast Radius Isolation:

Regional issues affect only that region's users
Bad deployments can be rolled back before global exposure
Natural A/B testing infrastructure

Disadvantages of Active-Active

Complexity:

Dramatically more complex than active-passive
Conflict resolution logic pervades application
Distributed systems expertise required

Eventual Consistency:

Users may see stale data (typically seconds, but potentially longer)
Application must handle consistency anomalies
Some features may be difficult to implement correctly

Data Integrity Challenges:

Unique constraints, foreign keys, and invariants are harder to maintain
Application-level enforcement often required
Risk of subtle bugs that manifest rarely

Higher Cost:

Full infrastructure in all regions
Cross-region data transfer costs
Significant engineering investment

Active-Active Requires Organizational Commitment

Active-active isn't just an infrastructure pattern—it's an application architecture commitment. Every feature must be designed with eventual consistency in mind. Every developer must understand distributed systems implications. Without organizational buy-in, active-active implementations accumulate bugs and eventually fail.

Conflict Resolution in Active-Active

When the same data is modified in multiple regions before replication occurs, a conflict exists. Resolving these conflicts correctly is fundamental to active-active integrity.

Why Conflicts Occur

Conflicts are inevitable when:

Two users modify the same record in different regions
Replication lag exceeds modification frequency
Network partition separates regions temporarily

Conflict Resolution Strategies

Last-Write-Wins (LWW):

Attach timestamp to each write
Latest timestamp wins during conflict
Simple but loses data (earlier writes are discarded)
Requires synchronized clocks (NTP can drift)

Highest-Source-Wins:

Assign precedence to regions (e.g., Region A > Region B)
Use precedence when timestamps equal
Simple but creates implicit primary

Custom Merge:

Application-specific logic merges conflicting values
Can preserve information from both writes
Requires careful design per data type
Example: Shopping cart - merge by taking union of items

CRDTs (Conflict-free Replicated Data Types):

Data structures designed to merge automatically
No conflict resolution needed—merges are deterministic
Limited to specific data types (counters, sets, maps, etc.)
Growing ecosystem of CRDT implementations

User Resolution:

Present conflicts to users for manual resolution
Preserves all information and human judgment
Poor user experience for frequent conflicts
Appropriate for document editing, rarely for real-time systems

Conflict Resolution Strategy Comparison
Strategy	Data Loss	Complexity	Use Cases
Last-Write-Wins	Yes (earlier writes)	Low	Frequently updated, low-value data; user preferences
Custom Merge	No (if designed well)	High	Domain-specific data with known merge semantics
CRDTs	No	Medium (limited data types)	Counters, sets, collaborative editing
User Resolution	No	Medium	Documents, content creation, infrequent updates
Prevent Conflicts	N/A	High (architecture)	Critical data where conflicts are unacceptable

Preventing Conflicts

Often the best conflict resolution is conflict prevention:

Data Partitioning: Assign data to specific regions based on user or tenant:

User's home region owns their data
Cross-region operations route to home region
Conflicts impossible for home-region data
Trade-off: latency for cross-region access

Coordination Services: Use distributed locks or consensus for critical operations:

Acquire lock before write
Prevents concurrent modification
Trade-off: performance and availability impact

Append-Only Data Models: Design data as append-only logs:

Conflicts become concurrent events in log
Resolution happens at read time
Natural for event sourcing architectures

Single-Regional Operations: Route certain operation types to single region:

Writes to region A, reads from any region
Simpler consistency at cost of write latency
Hybrid of active-active and active-passive

Real-World Conflict Examples

E-commerce Inventory:

Conflict: Two regions sell last item simultaneously
Resolution: Oversell and compensate (common in e-commerce)
Better: Reserve inventory through coordination

Social Media Counter:

Conflict: Like count updated in multiple regions
Resolution: CRDT counter (grow-only counter)
Result: Counts converge without data loss

User Profile:

Conflict: User updates profile in two regions
Resolution: Last-write-wins (usually acceptable)
Better: Field-level LWW to preserve more data

Design for Conflict Visibility

Instrument your system to track conflict frequency and resolution outcomes. Conflicts that seem rare in development may be common in production. Understanding your actual conflict patterns helps optimize resolution strategies.

Detailed Trade-off Analysis

Let's systematically compare active-passive and active-active across key dimensions:

Availability

Active-Passive Availability

•RTO: Minutes (hot) to hours (cold)
•Failover is exceptional event
•Failover may fail (untested paths)
•Single-region SLA during normal operation
•Multi-region SLA requires successful failover

Active-Active Availability

•RTO: Near-zero (existing traffic routing)
•No failover process—regions already serving
•Traffic shift automatic and continuous
•Combined multi-region capacity always available
•Achievable 99.99%+ with proper implementation

Latency

Active-Passive Latency

•All users connect to primary region
•Distant users experience geographic latency
•CDN helps only for static/cached content
•Consistent latency per user location
•Can use passive for reads to improve

Active-Active Latency

•Users connect to nearest region
•Latency is regional (10s of ms, not 100s)
•All operations benefit, not just cached
•Latency varies by user location
•Cross-region ops may have higher latency

Consistency

Active-Passive Consistency

•Strong consistency in primary
•ACID transactions work normally
•No conflict resolution needed
•Passive may have replication lag (for reads)
•Post-failover: potential data loss

Active-Active Consistency

•Eventual consistency by default
•Cross-region transactions expensive/complex
•Conflict resolution required
•Replication lag affects all operations
•Read-your-writes possible but adds complexity

Cost

Cost Comparison (Relative to Single Region)
Cost Category	Active-Passive (Hot)	Active-Passive (Warm)	Active-Active
Compute	1.8-2.0x	1.3-1.5x	2.0-3.0x
Storage	1.5-2.0x	1.5-2.0x	2.0-3.0x
Networking	1.3-1.5x (replication)	1.2-1.4x	2.0-4.0x (bidirectional + user traffic)
Engineering	1.5-2.0x	1.5-2.0x	2.5-4.0x
Operations	1.3-1.5x	1.3-1.5x	2.0-3.0x
Total (typical)	1.6-1.8x	1.4-1.6x	2.5-3.5x

Operational Complexity

Operational Comparison
Operation	Active-Passive	Active-Active
Regular Deployments	Deploy to primary, sync to passive	Staged rollout across regions, rollback per-region
Database Migrations	Migrate primary, replicate changes	Careful coordination, backward-compatible changes only
Incident Response	Focus on primary, failover as last resort	Redirect traffic, investigate without downtime
Capacity Planning	Plan primary for peak, passive for recovery	Plan each region for partial absorption of others
Testing	Test primary, verify passive sync	Test all regions, test cross-region scenarios

No Free Lunch

Active-active provides benefits in availability and latency but costs more and is significantly more complex. Active-passive is simpler and cheaper but sacrifices latency and has non-zero RTO. Your specific requirements determine which trade-off is appropriate.

Choosing the Right Pattern

With the detailed analysis above, let's develop a decision framework:

Choose Active-Passive When:

Strong consistency is critical:

Financial transactions requiring ACID guarantees
Systems where conflicts could cause compliance issues
Applications where eventual consistency would confuse users

Team expertise is limited:

Team lacks distributed systems experience
Cannot invest in training/hiring for active-active complexity
Operational maturity isn't sufficient for active-active

Cost is constrained:

Budget doesn't support full multi-region capacity
Primary goal is disaster recovery, not latency
Return on active-active doesn't justify investment

RTO requirements are relaxed:

Minutes to hours of recovery time is acceptable
Business can tolerate brief outages during failover
SLA requirements are 99.9% or lower

Choose Active-Active When:

Low latency is required globally:

Real-time applications (gaming, communication, trading)
User experience directly impacts business metrics
Competing with geo-distributed alternatives

Near-zero RTO is required:

SLA requirements of 99.99% or higher
Cost of any downtime exceeds multi-region investment
Continuous availability is core to value proposition

Eventual consistency is acceptable:

Application can be designed for eventual consistency
Conflicts can be resolved sensibly
Users won't be confused by consistency delays

Team can handle complexity:

Strong distributed systems expertise
Operational maturity to manage complexity
Engineering culture values thorough testing and observability

Common Evolution Path

•Single Region: Start here. Build product, achieve product-market fit.
•Single Region + CDN: Add CDN for static content latency. Simple, high ROI.
•Active-Passive (Cold): Add disaster recovery. Minimal cost, hours RTO.
•Active-Passive (Warm): Improve RTO to minutes. Moderate cost increase.
•Active-Passive + Read Replicas: Serve reads from passive. Latency improvement.
•Active-Active (Regional): Users assigned to regions. Partial active-active.
•Active-Active (Full): All regions serve all operations. Maximum complexity.

Hybrid Approaches

Read Active-Active, Write Active-Passive:

Reads served from any region
Writes routed to primary region
Simpler consistency for writes, low latency for reads
Works well when reads dominate traffic (common)

Active-Active with Leader Election:

Most operations are active-active
Coordinatedoperations (ID generation, critical transactions) go through leader
Balance of latency and consistency

Feature-Based Patterns:

Different features use different patterns
Real-time features: active-active
Transactional features: active-passive
Complexity: only where needed

These hybrid approaches often provide better tradeoff profiles than pure active-passive or pure active-active.

Start Simpler, Evolve as Needed

Don't start with active-active unless requirements demand it. Begin with active-passive, master its operations and failover procedures, then evolve toward active-active as latency or availability requirements increase. Each stage builds operational muscle memory for the next.

Summary: Two Paths, Many Trade-offs

We've deeply examined the two fundamental multi-region patterns. Let's consolidate the key insights:

Key Takeaways

•Active-passive offers simplicity: Strong consistency, simpler operations, lower cost, but higher RTO and global latency.
•Active-active offers performance: Low latency globally, near-zero RTO, but significant complexity and eventual consistency challenges.
•Standby types matter: Hot, warm, and cold standby offer different cost/RTO trade-offs within active-passive.
•Conflict resolution is fundamental: Active-active requires careful thinking about concurrent modifications and their resolution.
•Hybrid approaches are common: Read active-active with write active-passive, or feature-based patterns, often provide optimal trade-offs.
•The right choice depends on context: Requirements, team capabilities, budget, and product category all factor into the decision.

What's next:

Both patterns depend on data replication across regions. The next page dives deep into data replication: synchronous vs asynchronous replication, consistency models, replication lag management, and strategies for keeping data synchronized across geographic distances.

Page Complete

You now understand the two fundamental multi-region deployment patterns: active-passive and active-active. You can evaluate their trade-offs across availability, latency, consistency, cost, and operations, and you have a framework for choosing the appropriate pattern for your context. Next, we'll explore the data replication strategies that underpin both patterns.

3 / 5

Loading learning content...

System Design (HLD)Geo-Distributed Architecture

Geo-Distributed Architecture

LevelAdvanced

Duration90 mins

TopicGeo-Distributed Architecture

3 / 5

Active-Passive vs Active-Active

The Fundamental Multi-Region Choice

In this page, we'll deeply examine both patterns, understanding not just what they are but when each is appropriate and how to implement them effectively.

What You Will Learn

Active-Passive Architecture

Anatomy of Active-Passive

Primary Region:

Receives 100% of production traffic
Hosts the authoritative database
All write operations occur here
Full production infrastructure running at scale

Secondary Region(s):

No production traffic under normal operation
Maintains replicated copy of data (typically asynchronous)
Infrastructure may be running at reduced capacity (hot standby) or not running at all (cold standby)
Regular testing ensures failover capability

Standby Types

The secondary region's readiness level significantly affects both cost and recovery time:

Hot Standby:

Full infrastructure running in passive region
Data replicated with minimal lag
Application servers running but not serving traffic
Failover possible in minutes
Most expensive passive option

Warm Standby:

Core infrastructure running (databases, critical services)
Non-critical services scaled down or stopped
Data replication active with acceptable lag
Failover requires some scaling/startup (10-30 minutes)
Moderate cost

Cold Standby:

Minimal infrastructure running
Data replicated to storage (may have significant lag)
Infrastructure must be provisioned during failover
Failover time measured in hours
Lowest passive cost

Standby Types Comparison
Characteristic	Hot Standby	Warm Standby	Cold Standby
Infrastructure Cost	~80-100% of primary	~30-50% of primary	~5-15% of primary
Failover Time	1-10 minutes	10-60 minutes	1-4 hours
Data Lag (typical)	Seconds	Seconds to minutes	Minutes to hours
Testing Requirements	Moderate	Moderate	Extensive
Operational Complexity	Medium	Medium-High	High during failover
Appropriate For	Low RTO requirements	Balanced cost/speed	Cost optimization, high RTO tolerance

Advantages of Active-Passive

Simplicity:

Single source of truth for all data
No conflict resolution required
Transactions work normally in primary
Simpler mental model for developers

Consistency:

Strong consistency by default
No replication lag affects user experience (in primary)
No split-brain scenarios during normal operation

Cost Efficiency:

Passive region can run at reduced capacity
No cross-region traffic during normal operation
Data transfer costs limited to replication

Easier Operations:

Clear ownership model (primary is authoritative)
Simpler monitoring (focus on primary, verification of secondary)
Straightforward capacity planning

Disadvantages of Active-Passive

Wasted Capacity:

Secondary infrastructure sits idle most of the time
Still paying for capacity that's rarely used
Cold standby addresses this but increases RTO

Latency for Distant Users:

All users connect to primary region
Users far from primary experience geography-based latency
CDN can help with static content but not dynamic operations

Failover Complexity:

Failover is by definition an exceptional event
Exceptional events are harder to test and validate
Staff may be unfamiliar with failover procedures when needed

Data Loss Risk:

Asynchronous replication means potential data loss during failover (RPO > 0)
Synchronous replication is expensive and adds latency
Trade-off between RPO and performance

Consider Read Traffic Offloading

Active-Passive Failover

Failover is the critical process of promoting the passive region to active when the primary fails. Getting this right is fundamental to the value proposition of multi-region deployment.

Failover Triggers

Automatic Failover:

Health checks detect primary unavailability
Automated systems initiate failover sequence
Fastest response time but highest risk of false positives

Manual Failover:

Operators detect issue and decide to failover
Human validation prevents false positive triggering
Slower response but more considered

Hybrid:

Automated detection with human approval gate
Balance between speed and safety
Common for critical systems

The Failover Sequence

A typical active-passive failover follows this sequence:

Failover Sequence Steps

•Detection: Monitoring systems detect primary region unavailability. Multiple independent signals confirm failure is real (not transient network issue).
•Decision: Automated system or human operator decides to initiate failover based on severity and expected duration of outage.
•Drain (if possible): Attempt to drain in-flight requests from primary. May not be possible if primary is completely unavailable.
•Stop Replication: Halt data replication to secondary to establish consistent point-in-time state.
•Promote Secondary: Promote secondary database to primary role. This is typically the riskiest step.
•Scale Infrastructure: If warm/cold standby, scale up compute, cache, and other infrastructure in secondary region.
•Verify Health: Run health checks on newly promoted primary to confirm services are functional.
•Update Traffic Routing: Update DNS, load balancers, or service discovery to route traffic to new primary.
•Enable Traffic: Start accepting production traffic in new primary region.
•Post-Failover Verification: Monitor for issues, verify functionality with synthetic tests.

Critical Failover Considerations

Keep TTL low (60-300 seconds) for production endpoints
Use anycast or other non-DNS routing for faster failover
Implement client-side retry with alternate endpoints

Split Brain: If the primary is network-partitioned (not actually down), both regions might believe they're primary:

Old primary still processing requests from cached DNS
New primary accepting requests after failover
Both writing to their local databases
Severe data inconsistency results

Mitigation requires fencing: ensuring the old primary cannot process writes after failover:

STONITH (Shoot The Other Node In The Head): Forcibly power off old primary
Lease-based: Primary requires actively renewed lease to process writes
Quorum-based: Write requires acknowledgment from majority of replicas

Data Loss Window: Asynchronous replication means some committed transactions in primary may not have replicated to secondary:

Identify unreplicated transactions during failover
Communicate potential data loss to affected users
Have process to reconcile if old primary recovers

Failback Considerations

After failover, you eventually want to return to the original topology (failback):

Recover original primary: Bring original primary back online
Replicate new data: Sync changes made during failover to original primary
Resolve conflicts: Handle any transactions from split-brain period
Reverse failover: Promote original primary, demote temporary primary

Failback is often more complex than failover because there's now divergent data to reconcile.

Test Failover Regularly

Active-Active Architecture

In an active-active architecture, all regions simultaneously serve production traffic. Each region can process both read and write operations, maintaining independent but synchronized copies of data.

Anatomy of Active-Active

All Regions:

Receive production traffic (typically routed by proximity)
Maintain local copy of data
Process both reads and writes
Replicate changes to other regions (typically asynchronous)
Operate independently if network partitioned from other regions

Active-Active Variants

Fully Symmetric Active-Active:

All regions handle all operations equally
Data replicated bidirectionally between all regions
Most complex but most resilient

Regional Active-Active:

Users assigned to "home" region
Home region is authoritative for that user's data
Cross-region operations route to home region
Simpler consistency model at cost of some latency

Active-Active with Leader:

All regions active for read/write
One region designated as coordination point for certain operations
Hybrid of active-active and active-passive benefits

Why Active-Active Is Harder

Active-active introduces fundamental distributed systems challenges:

Active-Active Challenges
Challenge	Description	Mitigation Approaches
Concurrent Writes	Same record modified in multiple regions simultaneously	Conflict resolution (LWW, vector clocks, CRDTs, application-level)
Replication Lag	Changes not immediately visible across regions	Acceptable eventual consistency, read-your-writes guarantees
Transaction Spanning Regions	ACID transactions across geographic distance	Avoid or accept performance penalty
Unique Constraints	Enforcing uniqueness across regions	Pre-partitioned ID spaces, coordination services
Referential Integrity	Foreign key relationships across regions	Application-level enforcement, eventual checking
Ordering Guarantees	Event ordering across regions	Vector clocks, causal consistency

Advantages of Active-Active

Low Latency Globally:

Users connect to nearby region
Latency is regional, not cross-continental
Better user experience everywhere

No Wasted Capacity:

All infrastructure is serving production traffic
Better ROI on multi-region investment
Natural load distribution

Seamless Failover:

No failover process needed (regions are already handling traffic)
Failed region's traffic simply shifts to other regions
RTO approaches zero for properly implemented active-active

Blast Radius Isolation:

Regional issues affect only that region's users
Bad deployments can be rolled back before global exposure
Natural A/B testing infrastructure

Disadvantages of Active-Active

Complexity:

Dramatically more complex than active-passive
Conflict resolution logic pervades application
Distributed systems expertise required

Eventual Consistency:

Users may see stale data (typically seconds, but potentially longer)
Application must handle consistency anomalies
Some features may be difficult to implement correctly

Data Integrity Challenges:

Unique constraints, foreign keys, and invariants are harder to maintain
Application-level enforcement often required
Risk of subtle bugs that manifest rarely

Higher Cost:

Full infrastructure in all regions
Cross-region data transfer costs
Significant engineering investment

Active-Active Requires Organizational Commitment

Conflict Resolution in Active-Active

When the same data is modified in multiple regions before replication occurs, a conflict exists. Resolving these conflicts correctly is fundamental to active-active integrity.

Why Conflicts Occur

Conflicts are inevitable when:

Two users modify the same record in different regions
Replication lag exceeds modification frequency
Network partition separates regions temporarily

Conflict Resolution Strategies

Last-Write-Wins (LWW):

Attach timestamp to each write
Latest timestamp wins during conflict
Simple but loses data (earlier writes are discarded)
Requires synchronized clocks (NTP can drift)

Highest-Source-Wins:

Assign precedence to regions (e.g., Region A > Region B)
Use precedence when timestamps equal
Simple but creates implicit primary

Custom Merge:

Application-specific logic merges conflicting values
Can preserve information from both writes
Requires careful design per data type
Example: Shopping cart - merge by taking union of items

CRDTs (Conflict-free Replicated Data Types):

Data structures designed to merge automatically
No conflict resolution needed—merges are deterministic
Limited to specific data types (counters, sets, maps, etc.)
Growing ecosystem of CRDT implementations

User Resolution:

Present conflicts to users for manual resolution
Preserves all information and human judgment
Poor user experience for frequent conflicts
Appropriate for document editing, rarely for real-time systems

Conflict Resolution Strategy Comparison
Strategy	Data Loss	Complexity	Use Cases
Last-Write-Wins	Yes (earlier writes)	Low	Frequently updated, low-value data; user preferences
Custom Merge	No (if designed well)	High	Domain-specific data with known merge semantics
CRDTs	No	Medium (limited data types)	Counters, sets, collaborative editing
User Resolution	No	Medium	Documents, content creation, infrequent updates
Prevent Conflicts	N/A	High (architecture)	Critical data where conflicts are unacceptable

Preventing Conflicts

Often the best conflict resolution is conflict prevention:

Data Partitioning: Assign data to specific regions based on user or tenant:

User's home region owns their data
Cross-region operations route to home region
Conflicts impossible for home-region data
Trade-off: latency for cross-region access

Coordination Services: Use distributed locks or consensus for critical operations:

Acquire lock before write
Prevents concurrent modification
Trade-off: performance and availability impact

Append-Only Data Models: Design data as append-only logs:

Conflicts become concurrent events in log
Resolution happens at read time
Natural for event sourcing architectures

Single-Regional Operations: Route certain operation types to single region:

Writes to region A, reads from any region
Simpler consistency at cost of write latency
Hybrid of active-active and active-passive

Real-World Conflict Examples

E-commerce Inventory:

Conflict: Two regions sell last item simultaneously
Resolution: Oversell and compensate (common in e-commerce)
Better: Reserve inventory through coordination

Social Media Counter:

Conflict: Like count updated in multiple regions
Resolution: CRDT counter (grow-only counter)
Result: Counts converge without data loss

User Profile:

Conflict: User updates profile in two regions
Resolution: Last-write-wins (usually acceptable)
Better: Field-level LWW to preserve more data

Design for Conflict Visibility

Detailed Trade-off Analysis

Let's systematically compare active-passive and active-active across key dimensions:

Availability

Active-Passive Availability

•RTO: Minutes (hot) to hours (cold)
•Failover is exceptional event
•Failover may fail (untested paths)
•Single-region SLA during normal operation
•Multi-region SLA requires successful failover

Active-Active Availability

•RTO: Near-zero (existing traffic routing)
•No failover process—regions already serving
•Traffic shift automatic and continuous
•Combined multi-region capacity always available
•Achievable 99.99%+ with proper implementation

Latency

Active-Passive Latency

•All users connect to primary region
•Distant users experience geographic latency
•CDN helps only for static/cached content
•Consistent latency per user location
•Can use passive for reads to improve

Active-Active Latency

•Users connect to nearest region
•Latency is regional (10s of ms, not 100s)
•All operations benefit, not just cached
•Latency varies by user location
•Cross-region ops may have higher latency

Consistency

Active-Passive Consistency

•Strong consistency in primary
•ACID transactions work normally
•No conflict resolution needed
•Passive may have replication lag (for reads)
•Post-failover: potential data loss

Active-Active Consistency

•Eventual consistency by default
•Cross-region transactions expensive/complex
•Conflict resolution required
•Replication lag affects all operations
•Read-your-writes possible but adds complexity

Cost

Cost Comparison (Relative to Single Region)
Cost Category	Active-Passive (Hot)	Active-Passive (Warm)	Active-Active
Compute	1.8-2.0x	1.3-1.5x	2.0-3.0x
Storage	1.5-2.0x	1.5-2.0x	2.0-3.0x
Networking	1.3-1.5x (replication)	1.2-1.4x	2.0-4.0x (bidirectional + user traffic)
Engineering	1.5-2.0x	1.5-2.0x	2.5-4.0x
Operations	1.3-1.5x	1.3-1.5x	2.0-3.0x
Total (typical)	1.6-1.8x	1.4-1.6x	2.5-3.5x

Operational Complexity

Operational Comparison
Operation	Active-Passive	Active-Active
Regular Deployments	Deploy to primary, sync to passive	Staged rollout across regions, rollback per-region
Database Migrations	Migrate primary, replicate changes	Careful coordination, backward-compatible changes only
Incident Response	Focus on primary, failover as last resort	Redirect traffic, investigate without downtime
Capacity Planning	Plan primary for peak, passive for recovery	Plan each region for partial absorption of others
Testing	Test primary, verify passive sync	Test all regions, test cross-region scenarios

No Free Lunch

Choosing the Right Pattern

With the detailed analysis above, let's develop a decision framework:

Choose Active-Passive When:

Strong consistency is critical:

Financial transactions requiring ACID guarantees
Systems where conflicts could cause compliance issues
Applications where eventual consistency would confuse users

Team expertise is limited:

Team lacks distributed systems experience
Cannot invest in training/hiring for active-active complexity
Operational maturity isn't sufficient for active-active

Cost is constrained:

Budget doesn't support full multi-region capacity
Primary goal is disaster recovery, not latency
Return on active-active doesn't justify investment

RTO requirements are relaxed:

Minutes to hours of recovery time is acceptable
Business can tolerate brief outages during failover
SLA requirements are 99.9% or lower

Choose Active-Active When:

Low latency is required globally:

Real-time applications (gaming, communication, trading)
User experience directly impacts business metrics
Competing with geo-distributed alternatives

Near-zero RTO is required:

SLA requirements of 99.99% or higher
Cost of any downtime exceeds multi-region investment
Continuous availability is core to value proposition

Eventual consistency is acceptable:

Application can be designed for eventual consistency
Conflicts can be resolved sensibly
Users won't be confused by consistency delays

Team can handle complexity:

Strong distributed systems expertise
Operational maturity to manage complexity
Engineering culture values thorough testing and observability

Common Evolution Path

•Single Region: Start here. Build product, achieve product-market fit.
•Single Region + CDN: Add CDN for static content latency. Simple, high ROI.
•Active-Passive (Cold): Add disaster recovery. Minimal cost, hours RTO.
•Active-Passive (Warm): Improve RTO to minutes. Moderate cost increase.
•Active-Passive + Read Replicas: Serve reads from passive. Latency improvement.
•Active-Active (Regional): Users assigned to regions. Partial active-active.
•Active-Active (Full): All regions serve all operations. Maximum complexity.

Hybrid Approaches

Read Active-Active, Write Active-Passive:

Reads served from any region
Writes routed to primary region
Simpler consistency for writes, low latency for reads
Works well when reads dominate traffic (common)

Active-Active with Leader Election:

Most operations are active-active
Coordinatedoperations (ID generation, critical transactions) go through leader
Balance of latency and consistency

Feature-Based Patterns:

Different features use different patterns
Real-time features: active-active
Transactional features: active-passive
Complexity: only where needed

These hybrid approaches often provide better tradeoff profiles than pure active-passive or pure active-active.

Start Simpler, Evolve as Needed

Summary: Two Paths, Many Trade-offs

We've deeply examined the two fundamental multi-region patterns. Let's consolidate the key insights:

Key Takeaways

•Active-passive offers simplicity: Strong consistency, simpler operations, lower cost, but higher RTO and global latency.
•Active-active offers performance: Low latency globally, near-zero RTO, but significant complexity and eventual consistency challenges.
•Standby types matter: Hot, warm, and cold standby offer different cost/RTO trade-offs within active-passive.
•Conflict resolution is fundamental: Active-active requires careful thinking about concurrent modifications and their resolution.
•Hybrid approaches are common: Read active-active with write active-passive, or feature-based patterns, often provide optimal trade-offs.
•The right choice depends on context: Requirements, team capabilities, budget, and product category all factor into the decision.

What's next:

Page Complete

3 / 5