System Design (HLD)Multi-Region Architecture

Multi-Region Architecture: Building Globally Distributed Systems

LevelAdvanced

Duration90 mins

TopicMulti-Region Architecture

2 / 5

Active-Passive Multi-Region: The Disaster Recovery Foundation

The Safety Net Architecture

When Netflix's primary US-East region experiences turbulence, users barely notice. Videos continue streaming, queues keep loading, and recommendations appear seamlessly. Behind the scenes, traffic has shifted to a standby region that was patiently waiting for exactly this moment. This is active-passive multi-region architecture in action—the most common and accessible path to regional resilience.

Active-passive architecture designates one region as the primary (active), handling all production traffic, while one or more secondary (passive) regions maintain synchronized copies of data and minimal infrastructure, ready to assume operations when the primary fails. This pattern provides disaster recovery capabilities without the complexity of serving traffic from multiple regions simultaneously.

For organizations taking their first steps beyond single-region deployment, active-passive represents the optimal balance between improved resilience and manageable complexity. It's the foundation upon which more sophisticated architectures can be built.

What You Will Learn

By the end of this page, you will understand how to architect active-passive multi-region systems, implement effective database replication strategies, design failover procedures that minimize downtime, and operate these systems reliably in production.

Anatomy of Active-Passive Architecture

An active-passive multi-region deployment consists of several interconnected components, each serving a specific role in maintaining readiness and enabling failover.

The Primary Region

The primary region operates as a complete, self-contained system handling 100% of production traffic:

Application tier: All stateless application servers actively processing requests
Database tier: Primary database instances accepting both reads and writes
Caching layer: Fully populated caches serving hot data
Background processing: Job queues, schedulers, and async workers
External integrations: All connections to third-party services

The Secondary Region

The secondary region maintains a shadow of the primary, ready to assume operations:

Application tier: Minimal or scaled-down instances (can be zero in cold standby)
Database tier: Read replicas receiving continuous replication from primary
Caching layer: Empty or partially warmed (depending on strategy)
Background processing: Paused or disabled to prevent duplicate processing
External integrations: Configured but inactive

The Connection Layer

Critical infrastructure bridges the regions:

Database replication: Continuous data synchronization from primary to secondary
Global load balancing: DNS-based or anycast routing directing traffic
Health monitoring: Continuous checks to detect primary region issues
Failover automation: Systems to trigger and execute failover procedures

Converting Mermaid diagram...

Standby Strategies: Cold, Warm, and Hot

The secondary region can operate at different readiness levels, each with distinct cost and recovery time tradeoffs:

Cold Standby (Pilot Light)

Only database replicas run continuously
Compute infrastructure is defined in IaC but not provisioned
Failover requires provisioning servers (10-60 minutes)
Lowest ongoing cost but longest RTO

Warm Standby

Database replicas plus minimal compute capacity running
Typically 10-25% of primary capacity
Auto-scaling policies ready to expand during failover
Failover time reduced to 5-15 minutes
Moderate ongoing cost

Hot Standby

Database replicas with near-full compute capacity
Caches may be partially warmed with read traffic
Failover can complete in 1-5 minutes
Higher ongoing cost but near-instant recovery

Standby Strategy Comparison
Strategy	Ongoing Cost	RTO	Use Cases	Complexity
Cold Standby	+10-15%	30-60 min	Cost-sensitive, lower SLAs	Low
Warm Standby	+25-40%	5-15 min	Balanced cost/recovery	Medium
Hot Standby	+60-90%	1-5 min	Critical applications	Medium-High

Database Replication Strategies

Database replication is the heart of active-passive architecture. The replication strategy determines your Recovery Point Objective (RPO)—how much data you might lose during failover—and significantly impacts primary region performance.

Asynchronous Replication

The most common approach for cross-region replication, asynchronous replication doesn't wait for confirmation from the secondary before acknowledging writes:

Advantages: No latency impact on primary operations; tolerates temporary network issues
Disadvantages: Replication lag creates a window of potential data loss (RPO > 0)
Typical lag: 100ms to several seconds under normal load; can spike during heavy write periods

For most applications, asynchronous replication represents the right tradeoff. The performance penalty of synchronous replication across geographic distances is prohibitive, and a few seconds of potential data loss is acceptable.

postgres-replication-config.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Primary server configuration (postgresql.conf)
-- Enable WAL shipping for replication
 
-- Set replication parameters
wal_level = replica                    -- Enable replication-level logging
max_wal_senders = 10                   -- Max concurrent replication connections
wal_keep_size = 4GB                    -- Retain WAL for replicas that fall behind
synchronous_commit = local             -- Don't wait for remote confirmation (async)
archive_mode = on                      -- Enable WAL archiving for PITR
archive_command = 'aws s3 cp %p s3://wal-archive/%f'  -- Archive to S3
 
-- Network configuration for replication (pg_hba.conf)
-- Allow replication connections from secondary region
host    replication    replicator    10.1.0.0/16       scram-sha-256
host    replication    replicator    10.2.0.0/16       scram-sha-256
 
-- Create replication slot for reliable streaming
SELECT pg_create_physical_replication_slot('secondary_region_slot');
 
-- Secondary server initialization
-- Run on secondary to create replica from primary
pg_basebackup -h primary.us-east.internal \
    -D /var/lib/postgresql/data \
    -U replicator \
    -v -P \
    --wal-method=stream \
    --slot=secondary_region_slot
 
-- Secondary server configuration (postgresql.conf)
hot_standby = on                       -- Allow read queries on replica
hot_standby_feedback = on              -- Inform primary of replica queries
primary_conninfo = 'host=primary.us-east.internal port=5432 user=replicator application_name=us_west_replica'
primary_slot_name = 'secondary_region_slot'

Synchronous Replication

For applications that cannot tolerate any data loss (RPO = 0), synchronous replication ensures every write is confirmed by both primary and secondary before acknowledgment:

Advantages: Zero data loss guarantee; secondary always current
Disadvantages: Every write incurs cross-region latency penalty; unavailable during partition
Performance impact: Adds 60-150ms to every write transaction depending on region distance

Synchronous replication is appropriate only for specific critical data paths—typically financial transactions or other data where any loss is unacceptable. Even then, consider using synchronous replication only for that specific data rather than all writes.

Semi-Synchronous Replication

A hybrid approach waits for acknowledgment from at least one replica (which could be local or remote):

Write acknowledged when confirmed by primary AND at least one replica
If all replicas are slow, falls back to asynchronous behavior after timeout
Provides a balance between durability guarantees and performance

Monitoring Replication Health

Replication lag must be continuously monitored. During normal operations, lag should be under one second. Alerts should trigger when lag exceeds acceptable thresholds:

replication-monitoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
"""
Replication Lag Monitoring for Active-Passive Multi-Region
 
This module monitors database replication lag and triggers alerts
when lag exceeds acceptable thresholds. Critical for understanding
actual RPO and failover readiness.
"""
import psycopg2
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Optional
import logging
 
@dataclass
class ReplicationStatus:
    """Status of replication between primary and replica."""
    replica_name: str
    replication_lag_bytes: int
    replication_lag_seconds: float
    replay_lag_seconds: float
    state: str  # 'streaming', 'catchup', 'startup'
    is_healthy: bool
    last_updated: datetime
 
class ReplicationMonitor:
    """Monitor replication health between primary and secondary regions."""
    
    # Alert thresholds
    LAG_WARNING_SECONDS = 5.0
    LAG_CRITICAL_SECONDS = 30.0
    LAG_EMERGENCY_SECONDS = 300.0
    
    def __init__(self, primary_dsn: str, replica_dsn: str):
        self.primary_dsn = primary_dsn
        self.replica_dsn = replica_dsn
        self.logger = logging.getLogger(__name__)
    
    def get_primary_replication_status(self) -> list[ReplicationStatus]:
        """
        Query primary for status of all replicas.
        Uses pg_stat_replication to get streaming replication info.
        """
        query = """
        SELECT
            application_name,
            state,
            sent_lsn - write_lsn AS pending_bytes,
            COALESCE(
                EXTRACT(EPOCH FROM now() - write_lag),
                0
            ) AS write_lag_seconds,
            COALESCE(
                EXTRACT(EPOCH FROM now() - replay_lag),
                0
            ) AS replay_lag_seconds
        FROM pg_stat_replication
        WHERE state = 'streaming';
        """
        
        statuses = []
        with psycopg2.connect(self.primary_dsn) as conn:
            with conn.cursor() as cur:
                cur.execute(query)
                for row in cur.fetchall():
                    is_healthy = (
                        row[3] < self.LAG_CRITICAL_SECONDS and
                        row[4] < self.LAG_CRITICAL_SECONDS
                    )
                    statuses.append(ReplicationStatus(
                        replica_name=row[0],
                        replication_lag_bytes=row[2] or 0,
                        replication_lag_seconds=row[3],
                        replay_lag_seconds=row[4],
                        state=row[1],
                        is_healthy=is_healthy,
                        last_updated=datetime.now()
                    ))
        return statuses
    
    def get_replica_recovery_status(self) -> dict:
        """
        Query replica for its view of replication status.
        Provides ground-truth from replica perspective.
        """
        query = """
        SELECT
            pg_is_in_recovery() AS is_replica,
            pg_last_wal_receive_lsn() AS last_received,
            pg_last_wal_replay_lsn() AS last_replayed,
            pg_last_xact_replay_timestamp() AS last_replay_time,
            EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) 
                AS replay_lag_seconds
        """
        
        with psycopg2.connect(self.replica_dsn) as conn:
            with conn.cursor() as cur:
                cur.execute(query)
                row = cur.fetchone()
                return {
                    "is_replica": row[0],
                    "last_received_lsn": row[1],
                    "last_replayed_lsn": row[2],
                    "last_replay_time": row[3],
                    "replay_lag_seconds": row[4] or 0
                }
    
    def evaluate_failover_readiness(self) -> dict:
        """
        Evaluate whether the system is ready for a safe failover.
        Returns readiness status and estimated data loss.
        """
        primary_status = self.get_primary_replication_status()
        replica_status = self.get_replica_recovery_status()
        
        # Find the cross-region replica
        cross_region_replica = next(
            (s for s in primary_status if 'us_west' in s.replica_name),
            None
        )
        
        if not cross_region_replica:
            return {
                "ready": False,
                "reason": "No cross-region replica found",
                "estimated_data_loss_seconds": None
            }
        
        lag_seconds = max(
            cross_region_replica.replication_lag_seconds,
            cross_region_replica.replay_lag_seconds
        )
        
        # Determine readiness level
        if lag_seconds > self.LAG_EMERGENCY_SECONDS:
            readiness = "NOT_READY"
            reason = f"Replication lag ({lag_seconds:.1f}s) exceeds emergency threshold"
        elif lag_seconds > self.LAG_CRITICAL_SECONDS:
            readiness = "DEGRADED"
            reason = f"Elevated replication lag ({lag_seconds:.1f}s)"
        elif lag_seconds > self.LAG_WARNING_SECONDS:
            readiness = "READY_WITH_WARNINGS"
            reason = f"Minor replication lag ({lag_seconds:.1f}s)"
        else:
            readiness = "FULLY_READY"
            reason = "Replication healthy"
        
        return {
            "ready": readiness in ["FULLY_READY", "READY_WITH_WARNINGS"],
            "readiness_level": readiness,
            "reason": reason,
            "estimated_data_loss_seconds": lag_seconds,
            "replica_state": cross_region_replica.state,
            "replica_healthy": cross_region_replica.is_healthy
        }
    
    def emit_metrics(self) -> None:
        """Emit replication metrics for monitoring systems."""
        statuses = self.get_primary_replication_status()
        
        for status in statuses:
            # Emit to your metrics system (Prometheus, CloudWatch, etc.)
            self.logger.info(
                f"replication_lag_seconds{{replica="{status.replica_name}"}} "
                f"{status.replication_lag_seconds}"
            )
            self.logger.info(
                f"replay_lag_seconds{{replica="{status.replica_name}"}} "
                f"{status.replay_lag_seconds}"
            )
            self.logger.info(
                f"replication_healthy{{replica="{status.replica_name}"}} "
                f"{1 if status.is_healthy else 0}"
            )

Replication Lag During Load Spikes

Replication lag can spike dramatically during burst write workloads. A secondary that's normally seconds behind may fall minutes behind during a traffic surge—precisely when you're most likely to need failover. Size your replication infrastructure for peak load, not average load.

Failover Procedures: Planning for the Moment of Truth

Failover is the moment your multi-region investment proves its value. A well-executed failover is invisible to users; a poorly executed one compounds the original problem. Success requires meticulous planning, regular practice, and battle-tested automation.

Types of Failover

Planned Failover (Graceful)

Initiated intentionally for maintenance, upgrades, or testing
Time to drain connections and quiesce writes
Ensures zero data loss by waiting for replication to fully synchronize
Typically takes 5-15 minutes

Unplanned Failover (Emergency)

Triggered by primary region failure or degradation
No opportunity to drain—traffic must be rerouted immediately
Potential for some data loss depending on replication lag
Target completion: 1-10 minutes depending on automation level

The Failover Sequence

A complete failover involves multiple coordinated steps:

Failover Step Sequence

•Detection: Monitoring systems detect primary region failure or degradation. Multiple health check failures from multiple vantage points confirm the issue.
•Decision: Human operator or automation decides to initiate failover. For automated failover, ensure sufficient confirmation to avoid false positives.
•Primary isolation: Stop accepting new requests at primary (if reachable). Drain existing connections if possible.
•Replication finalization: If primary is reachable, wait for final WAL segments to replicate. If unreachable, proceed with current replica state.
•Database promotion: Promote secondary database replica to primary status. This is the point of no return.
•Application configuration: Update connection strings and feature flags to point to new primary region.
•Traffic routing: Update global load balancer or DNS to direct traffic to secondary region.
•Capacity scaling: Scale up secondary region compute if running in warm standby mode.
•Verification: Confirm applications are healthy in new primary and processing requests successfully.
•Communication: Notify stakeholders of successful failover and current system status.

failover-orchestrator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
"""
Failover Orchestrator for Active-Passive Multi-Region
 
This module coordinates the complex sequence of operations required
for a controlled failover from primary to secondary region.
"""
import asyncio
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Optional
import logging
 
class FailoverType(Enum):
    PLANNED = "planned"      # Graceful, with full preparation
    EMERGENCY = "emergency"  # Immediate, potentially with data loss
 
class FailoverPhase(Enum):
    INITIALIZED = "initialized"
    PRIMARY_ISOLATED = "primary_isolated"
    REPLICATION_FINALIZED = "replication_finalized"
    DATABASE_PROMOTED = "database_promoted"
    APPLICATIONS_CONFIGURED = "applications_configured"
    TRAFFIC_REROUTED = "traffic_rerouted"
    CAPACITY_SCALED = "capacity_scaled"
    VERIFIED = "verified"
    COMPLETED = "completed"
    FAILED = "failed"
    ROLLED_BACK = "rolled_back"
 
@dataclass
class FailoverState:
    """Tracks the current state of a failover operation."""
    failover_id: str
    failover_type: FailoverType
    initiated_at: datetime
    initiated_by: str
    current_phase: FailoverPhase
    source_region: str
    target_region: str
    estimated_data_loss_seconds: float
    notes: list[str]
    
class FailoverOrchestrator:
    """
    Coordinates multi-region failover operations.
    
    Implements the complete failover sequence with proper
    error handling, rollback capabilities, and audit logging.
    """
    
    # Timeout for each phase (seconds)
    PHASE_TIMEOUTS = {
        FailoverPhase.PRIMARY_ISOLATED: 60,
        FailoverPhase.REPLICATION_FINALIZED: 300,
        FailoverPhase.DATABASE_PROMOTED: 120,
        FailoverPhase.APPLICATIONS_CONFIGURED: 60,
        FailoverPhase.TRAFFIC_REROUTED: 30,
        FailoverPhase.CAPACITY_SCALED: 300,
        FailoverPhase.VERIFIED: 120,
    }
    
    def __init__(
        self,
        database_manager,      # Handles DB promotion
        load_balancer_manager, # Handles traffic routing
        infrastructure_manager, # Handles scaling
        monitoring_client,     # For health verification
        notification_service   # For stakeholder alerts
    ):
        self.db = database_manager
        self.lb = load_balancer_manager
        self.infra = infrastructure_manager
        self.monitoring = monitoring_client
        self.notifications = notification_service
        self.logger = logging.getLogger(__name__)
        self.state: Optional[FailoverState] = None
    
    async def execute_failover(
        self,
        source_region: str,
        target_region: str,
        failover_type: FailoverType,
        operator: str,
        skip_confirmation: bool = False
    ) -> FailoverState:
        """
        Execute a complete failover from source to target region.
        
        Args:
            source_region: The current primary region
            target_region: The region to promote
            failover_type: PLANNED or EMERGENCY
            operator: ID of person/system initiating failover
            skip_confirmation: For automation, skip human confirmation
        """
        # Initialize failover state
        self.state = FailoverState(
            failover_id=self._generate_failover_id(),
            failover_type=failover_type,
            initiated_at=datetime.now(),
            initiated_by=operator,
            current_phase=FailoverPhase.INITIALIZED,
            source_region=source_region,
            target_region=target_region,
            estimated_data_loss_seconds=0,
            notes=[]
        )
        
        self.logger.warning(
            f"FAILOVER INITIATED: {self.state.failover_id} "
            f"({failover_type.value}) {source_region} -> {target_region}"
        )
        
        try:
            # Phase 1: Pre-flight checks
            await self._preflight_checks()
            
            # Phase 2: Isolate primary (if reachable and planned)
            if failover_type == FailoverType.PLANNED:
                await self._isolate_primary()
            else:
                self._log("Skipping primary isolation (emergency failover)")
            
            # Phase 3: Finalize replication (wait for sync if possible)
            await self._finalize_replication(failover_type)
            
            # Phase 4: Promote database
            await self._promote_database()
            
            # Point of no return checkpoint
            if not skip_confirmation and failover_type == FailoverType.PLANNED:
                self._log("DATABASE PROMOTED - Awaiting confirmation to continue")
                # In real implementation, wait for human confirmation
            
            # Phase 5: Configure applications
            await self._configure_applications()
            
            # Phase 6: Reroute traffic
            await self._reroute_traffic()
            
            # Phase 7: Scale capacity
            await self._scale_capacity()
            
            # Phase 8: Verify health
            await self._verify_health()
            
            # Complete
            self.state.current_phase = FailoverPhase.COMPLETED
            self._log("FAILOVER COMPLETED SUCCESSFULLY")
            
            # Notify stakeholders
            await self.notifications.send_failover_complete(self.state)
            
        except Exception as e:
            self.state.current_phase = FailoverPhase.FAILED
            self.state.notes.append(f"FAILURE: {str(e)}")
            self.logger.error(f"Failover failed: {e}")
            await self.notifications.send_failover_failed(self.state, str(e))
            raise
        
        return self.state
    
    async def _preflight_checks(self) -> None:
        """Verify system is ready for failover."""
        self._log("Running pre-flight checks...")
        
        # Check target region health
        target_health = await self.monitoring.check_region_health(
            self.state.target_region
        )
        if not target_health.is_healthy:
            raise FailoverError(
                f"Target region {self.state.target_region} is not healthy"
            )
        
        # Check replication status
        replication = await self.db.get_replication_status(
            self.state.source_region,
            self.state.target_region
        )
        self.state.estimated_data_loss_seconds = replication.lag_seconds
        
        if replication.lag_seconds > 300:
            self._log(
                f"WARNING: High replication lag ({replication.lag_seconds}s). "
                f"Estimated data loss: {replication.lag_seconds} seconds"
            )
        
        self._log(f"Pre-flight checks passed. Lag: {replication.lag_seconds}s")
    
    async def _isolate_primary(self) -> None:
        """Stop accepting new connections at primary."""
        self._log("Isolating primary region...")
        self.state.current_phase = FailoverPhase.PRIMARY_ISOLATED
        
        # Disable load balancer backends in primary
        await self.lb.disable_backends(self.state.source_region)
        
        # Wait for connection draining
        await asyncio.sleep(10)  # Allow existing requests to complete
        
        self._log("Primary region isolated")
    
    async def _finalize_replication(self, failover_type: FailoverType) -> None:
        """Wait for replication to catch up if possible."""
        self._log("Finalizing replication...")
        self.state.current_phase = FailoverPhase.REPLICATION_FINALIZED
        
        if failover_type == FailoverType.PLANNED:
            # Wait for full sync (with timeout)
            timeout = self.PHASE_TIMEOUTS[FailoverPhase.REPLICATION_FINALIZED]
            await self.db.wait_for_replication_sync(
                self.state.source_region,
                self.state.target_region,
                timeout_seconds=timeout
            )
            self.state.estimated_data_loss_seconds = 0
            self._log("Replication fully synchronized (zero data loss)")
        else:
            # Emergency: accept current lag state
            current_lag = await self.db.get_replication_lag_seconds(
                self.state.source_region,
                self.state.target_region
            )
            self.state.estimated_data_loss_seconds = current_lag
            self._log(f"Proceeding with {current_lag}s replication lag")
    
    async def _promote_database(self) -> None:
        """Promote replica to primary status."""
        self._log("PROMOTING DATABASE - Point of no return")
        self.state.current_phase = FailoverPhase.DATABASE_PROMOTED
        
        await self.db.promote_replica(
            self.state.target_region,
            new_role="primary"
        )
        
        # Verify promotion succeeded
        is_primary = await self.db.verify_is_primary(self.state.target_region)
        if not is_primary:
            raise FailoverError("Database promotion verification failed")
        
        self._log("Database promoted successfully")
    
    async def _configure_applications(self) -> None:
        """Update application configurations for new primary."""
        self._log("Configuring applications...")
        self.state.current_phase = FailoverPhase.APPLICATIONS_CONFIGURED
        
        # Update connection strings, feature flags, etc.
        await self.infra.update_application_configs(
            region=self.state.target_region,
            database_endpoint=await self.db.get_primary_endpoint(),
            enable_background_jobs=True
        )
        
        self._log("Applications configured")
    
    async def _reroute_traffic(self) -> None:
        """Update global load balancer to route to new primary."""
        self._log("Rerouting traffic...")
        self.state.current_phase = FailoverPhase.TRAFFIC_REROUTED
        
        await self.lb.set_primary_region(self.state.target_region)
        
        self._log("Traffic rerouted to new primary")
    
    async def _scale_capacity(self) -> None:
        """Scale up target region to handle full load."""
        self._log("Scaling capacity...")
        self.state.current_phase = FailoverPhase.CAPACITY_SCALED
        
        await self.infra.scale_to_production(self.state.target_region)
        
        self._log("Capacity scaled")
    
    async def _verify_health(self) -> None:
        """Verify new primary is healthy and serving traffic."""
        self._log("Verifying health...")
        self.state.current_phase = FailoverPhase.VERIFIED
        
        # Wait for health checks to stabilize
        await asyncio.sleep(30)
        
        health = await self.monitoring.comprehensive_health_check(
            self.state.target_region
        )
        
        if not health.is_healthy:
            raise FailoverError(f"Health verification failed: {health.issues}")
        
        self._log(f"Health verified: {health.summary}")
    
    def _log(self, message: str) -> None:
        """Log and record message in state."""
        self.logger.info(f"[{self.state.failover_id}] {message}")
        self.state.notes.append(f"{datetime.now().isoformat()}: {message}")
    
    def _generate_failover_id(self) -> str:
        """Generate unique failover identifier."""
        return f"FO-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
 
class FailoverError(Exception):
    """Exception raised during failover operations."""
    pass

The Runbook Is Non-Negotiable

Every failover—planned or emergency—should follow an exhaustively documented runbook. During a 3 AM incident, engineers don't have time to reason through steps. The runbook should include exact commands, expected outputs, verification steps, and rollback procedures. Test the runbook regularly, not just the automation.

Automated vs Manual Failover: The Automation Spectrum

One of the most consequential decisions in active-passive architecture is the degree of failover automation. This isn't a binary choice—organizations operate across a spectrum from fully manual to fully automated, with most landing somewhere in between.

The Case for Manual Failover

Manual failover requires human operators to evaluate the situation and initiate the failover procedure:

Prevents false positives: Humans can distinguish between transient issues and genuine failures
Enables judgment calls: Some situations warrant waiting for the primary to recover rather than failing over
Reduces split-brain risk: Human coordination prevents both regions from becoming active simultaneously
Appropriate when: RTO requirements are > 15 minutes, on-call coverage is strong, failures are rare

The Case for Automated Failover

Automated failover triggers without human intervention when defined conditions are met:

Faster recovery: Removes human response time from RTO (typically 5-30 minutes of delay)
Consistent execution: Automation doesn't get nervous or make typos
24/7 capability: Responds instantly regardless of time or on-call availability
Appropriate when: RTO requirements are < 5 minutes, false positive risk is well-managed

The Dangerous Middle Ground

Many organizations claim "automated failover" but actually have "automated alerting with manual execution." This captures the worst of both worlds—engineers racing to execute manual steps under time pressure. If you're going to automate, automate the execution too.

Automated Failover Risks

•False positive triggers: Network glitches triggering unnecessary failovers
•Split-brain scenarios: Both regions thinking they're primary
•Cascading failures: Automation triggering during related but temporary issues
•Unexpected interactions: Automation conflicting with maintenance or deployments
•Loss of control: Failover happening when human judgment would wait

Mitigating Automation Risks

•Multi-point consensus: Require failures from multiple monitoring sources
•Quorum-based detection: Detect from multiple geographic vantage points
•Cool-down periods: Prevent rapid consecutive failovers
•Maintenance windows: Disable automation during planned changes
•Gradual automation: Start with alerting, then add one-click, then full auto

Recommended Approach: Graduated Automation

Most organizations benefit from graduated automation that increases over time as confidence grows:

Level 1: Fully Manual (Starting Point)

Automated monitoring and alerting
Manual evaluation and decision
Manual execution of runbook steps
Appropriate for initial multi-region deployment

Level 2: Assisted Manual

Automated health evaluation and recommendation
Human approval required to proceed
Automated execution after approval
Reduces execution errors while retaining human judgment

Level 3: Semi-Automated

Automatic failover for clear-cut scenarios (e.g., complete region unreachability)
Human approval for ambiguous scenarios (e.g., degraded performance)
Automatic execution once triggered
Best balance for most organizations

Level 4: Fully Automated

Automatic detection, decision, and execution
Human notification during and after
Humans can abort if needed within grace period
Reserved for mission-critical, sub-minute RTO requirements

Regardless of automation level, humans should always be able to:

Abort an in-progress failover
Manually trigger a failover
Lock out automation during sensitive operations

The "Obviously Wrong" Test

Before enabling automated failover, imagine the worst possible time for a false-positive failover to occur—during a critical product launch, a major customer demo, or quarter-end processing. Would the automation's behavior be obviously wrong? If so, add safeguards. Automated systems lack judgment about organizational context.

Failback: Returning to Normal Operations

Failback—returning operations from the secondary region back to the original primary after an incident is resolved—is often more complex than the initial failover. Data has diverged, configurations have changed, and the original primary may require careful restoration.

Why Failback Is Harder

Data has moved forward: The new primary (former secondary) has accumulated data that doesn't exist in the original primary
Replication direction reversal: The former primary must now become a replica of the new primary
No urgency, more room for error: Without incident pressure, teams may take shortcuts
Testing assumptions: The original primary may harbor latent issues that caused the original failure

Failback Options

Option 1: Role Reversal (Former primary becomes secondary)

Instead of failing back, simply adopt the new topology:

Current secondary remains primary
Original primary (once recovered) becomes the new secondary
Eliminates need for another data-disruptive operation
Simplest and safest option when regions are equivalent

Option 2: Full Failback

Restore the original topology:

Requires careful data synchronization or full re-initialization
Original primary must sync all data from new primary
Essentially another failover operation
Appropriate when original primary has specific advantages (location, capacity)

Option 3: Gradual Migration

Incrementally shift traffic back rather than cutover:

Route 10% of traffic to original primary, observe
Increase to 25%, 50%, 75%, then 100%
Catch issues before they affect all users
Requires application support for split traffic

failback-procedure.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
#!/bin/bash
# Failback Procedure: US-West (current primary) -> US-East (original primary)
# 
# Prerequisites:
# - Original primary (US-East) is recovered and healthy
# - No active incidents
# - Change approval obtained
# - Maintenance window scheduled
 
set -euo pipefail
 
CURRENT_PRIMARY="us-west"
TARGET_REGION="us-east"
LOGFILE="/var/log/failback-$(date +%Y%m%d-%H%M%S).log"
 
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOGFILE"
}
 
checkpoint() {
    log "CHECKPOINT: $1"
    read -p "Press enter to continue or Ctrl+C to abort..."
}
 
# 1. Pre-flight verification
log "=== Starting failback pre-flight checks ==="
 
log "Checking target region (US-East) health..."
if ! health-check.sh "$TARGET_REGION"; then
    log "ERROR: Target region is not healthy. Aborting."
    exit 1
fi
 
log "Checking current primary (US-West) health..."
if ! health-check.sh "$CURRENT_PRIMARY"; then
    log "ERROR: Current primary is not healthy. Aborting."
    exit 1
fi
 
log "Checking replication status..."
REPLICATION_STATUS=$(psql -h db-primary.us-west.internal -c \
    "SELECT state, replay_lag FROM pg_stat_replication WHERE application_name = 'us_east_replica';")
log "Replication status: $REPLICATION_STATUS"
 
checkpoint "Pre-flight checks complete. Ready to proceed?"
 
# 2. Configure US-East database as replica of US-West
log "=== Configuring US-East as replica ==="
 
log "Stopping application servers in US-East to prevent stale data access..."
kubectl --context=us-east scale deployment/api-service --replicas=0
 
log "Reconfiguring US-East database as replica..."
# This requires the database to re-sync from US-West
# For PostgreSQL, this typically involves pg_basebackup or pg_rewind
 
ssh db-admin@db1.us-east.internal << 'EOF'
    sudo systemctl stop postgresql
    
    # Use pg_rewind if WAL is available, otherwise full resync
    if pg_rewind --target-pgdata=/var/lib/postgresql/data                  --source-server='host=db-primary.us-west.internal port=5432'; then
        echo "pg_rewind successful - incremental sync"
    else
        echo "pg_rewind failed - performing full resync"
        rm -rf /var/lib/postgresql/data/*
        pg_basebackup -h db-primary.us-west.internal -D /var/lib/postgresql/data             -U replicator -v -P --wal-method=stream
    fi
    
    # Configure as standby
    touch /var/lib/postgresql/data/standby.signal
    
    cat >> /var/lib/postgresql/data/postgresql.auto.conf << CONF
primary_conninfo = 'host=db-primary.us-west.internal port=5432 user=replicator'
CONF
    
    sudo systemctl start postgresql
EOF
 
log "Waiting for US-East replica to sync..."
until psql -h db1.us-east.internal -c "SELECT pg_is_in_recovery();" | grep -q 't'; do
    sleep 5
    log "Waiting for replica to start..."
done
 
checkpoint "US-East database synchronized as replica. Ready for traffic migration?"
 
# 3. Migrate traffic gradually
log "=== Beginning gradual traffic migration ==="
 
log "Setting traffic split: 90% US-West, 10% US-East..."
gcloud compute url-maps update global-lb     --backend-service-weight=us-west-backend=90,us-east-backend=10
 
log "Monitoring for 10 minutes..."
sleep 600
 
# Check error rates
ERROR_RATE=$(prometheus-query "sum(rate(http_errors_total{region='us-east'}[10m])) / sum(rate(http_requests_total{region='us-east'}[10m]))")
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
    log "ERROR: High error rate in US-East ($ERROR_RATE). Aborting."
    gcloud compute url-maps update global-lb         --backend-service-weight=us-west-backend=100,us-east-backend=0
    exit 1
fi
 
log "Error rate acceptable ($ERROR_RATE). Increasing to 50%..."
gcloud compute url-maps update global-lb     --backend-service-weight=us-west-backend=50,us-east-backend=50
 
checkpoint "Traffic at 50/50. Monitor for issues before continuing."
 
log "Finalizing: 100% to US-East..."
# The rest of the failover procedure continues...
 
log "=== Failback complete. US-East is now primary. ===
 

Treat Failback as Failover

Apply the same rigor to failback as to failover: use runbooks, have rollback plans, schedule during low-traffic periods, and notify stakeholders. Many outages have been caused by rushed failback procedures after a successfully handled primary incident.

Operational Considerations for Active-Passive

Running active-passive multi-region in production requires ongoing operational practices that maintain failover readiness over time.

Regular Failover Testing

The only way to know failover works is to actually fail over. Organizations should conduct regular failover exercises:

Quarterly (minimum): Full failover exercise during planned maintenance window
Monthly: Partial failover testing (database promotion only, without traffic shift)
Weekly: Health check verification and replication monitoring review

Without regular testing, failover becomes a theoretical capability that fails when actually needed. Replication breaks, runbooks become outdated, and engineers lose familiarity with procedures.

Chaos Engineering for Active-Passive

Beyond planned testing, inject failures in controlled ways to validate resilience:

Terminate primary region instances and verify application recovery
Introduce network latency between regions to test replication under stress
Disable the primary database suddenly to verify detection timing
Trigger automated failover and measure actual RTO

Keeping the Secondary Warm

Passive regions tend to accumulate drift over time:

Configuration drift: Changes applied to primary but not secondary
Cache coldness: Secondary caches are empty, causing poor initial performance
Connection staleness: Database connections and pooling may not work correctly
Missing dependencies: New services added to primary but not configured in secondary

Mitigate drift through:

Infrastructure as Code: All configuration through Terraform/Pulumi
Synthetic traffic: Route a small percentage of read traffic to secondary
Parallel deployments: Deploy to all regions simultaneously (or in quick succession)
Drift detection: Automated comparison of region configurations

Active-Passive Maintenance Checklist
Task	Frequency	Purpose	Responsibility
Verify replication lag < threshold	Continuous	RPO maintenance	Automated monitoring
Test read queries on replica	Daily	Data accessibility	Automated probe
Compare primary/secondary configs	Weekly	Detect drift	GitOps/IaC
Mini-failover (DB promotion test)	Monthly	Procedure validation	SRE team
Full failover exercise	Quarterly	Complete validation	Engineering org
Runbook review and update	After each failover	Documentation accuracy	On-call engineers
Capacity planning review	Quarterly	Ensure secondary can handle load	Platform team

Cost Optimization Strategies

Active-passive architectures offer cost optimization opportunities that active-active does not:

Spot/preemptible instances: Secondary region compute can use cheaper spot instances since brief unavailability is acceptable for standby
Reserved capacity: Reserve capacity only in primary; use on-demand in secondary during emergencies
Reduced replica sizing: Database replicas can be smaller if primary handles all writes (with auto-scaling during failover)
Minimal application tier: Run only health-check responders in secondary until failover

These optimizations must be balanced against RTO requirements—aggressive cost reduction increases time to achieve full capacity during failover.

The Drift Problem Compounds

Organizations often deploy active-passive, test it successfully, then stop regular testing. Over 6-12 months, small drifts accumulate until failover no longer works. Treat failover readiness as a continuous investment, not a one-time project. The maintenance tax is real and must be budged.

Summary: Active-Passive Multi-Region

We've comprehensively explored active-passive multi-region architecture. Let's consolidate the key principles:

Key Takeaways

•Active-passive is the accessible path to regional resilience: It provides disaster recovery without the complexity of serving traffic from multiple regions simultaneously.
•Standby strategy determines cost and recovery time: Cold standby minimizes cost but extends RTO; hot standby enables fast recovery at higher ongoing expense.
•Database replication is the heart of the architecture: Understand your replication mode (async vs sync), monitor lag continuously, and accept that async replication means potential data loss.
•Failover is a multi-step procedure requiring orchestration: From detection through verification, each step must be defined, automated where appropriate, and practiced regularly.
•Automation exists on a spectrum: Start with assisted manual, graduate to semi-automated as confidence grows. Avoid the dangerous middle ground of automated alerting with manual execution.
•Failback deserves equal planning: Whether through role reversal, full failback, or gradual migration, returning to normal operations requires as much rigor as the original failover.
•Continuous maintenance prevents drift: Regular testing, configuration verification, and runbook updates are essential to maintain failover readiness over time.

What's Next

Active-passive architecture optimizes for disaster recovery but leaves latency opportunities on the table—all traffic still routes to a single region during normal operations. The next page explores active-active multi-region architecture, where all regions serve traffic simultaneously, providing both disaster recovery and latency optimization at the cost of significantly increased complexity.

Page Complete

You now have a comprehensive understanding of active-passive multi-region architecture—the components, procedures, automation considerations, and operational practices required for reliable disaster recovery. This pattern serves as the foundation before advancing to the more complex active-active architectures covered next.

2 / 5

Loading learning content...

System Design (HLD)Multi-Region Architecture

Multi-Region Architecture: Building Globally Distributed Systems

LevelAdvanced

Duration90 mins

TopicMulti-Region Architecture

2 / 5

Active-Passive Multi-Region: The Disaster Recovery Foundation

The Safety Net Architecture

What You Will Learn

Anatomy of Active-Passive Architecture

An active-passive multi-region deployment consists of several interconnected components, each serving a specific role in maintaining readiness and enabling failover.

The Primary Region

The primary region operates as a complete, self-contained system handling 100% of production traffic:

Application tier: All stateless application servers actively processing requests
Database tier: Primary database instances accepting both reads and writes
Caching layer: Fully populated caches serving hot data
Background processing: Job queues, schedulers, and async workers
External integrations: All connections to third-party services

The Secondary Region

The secondary region maintains a shadow of the primary, ready to assume operations:

Application tier: Minimal or scaled-down instances (can be zero in cold standby)
Database tier: Read replicas receiving continuous replication from primary
Caching layer: Empty or partially warmed (depending on strategy)
Background processing: Paused or disabled to prevent duplicate processing
External integrations: Configured but inactive

The Connection Layer

Critical infrastructure bridges the regions:

Database replication: Continuous data synchronization from primary to secondary
Global load balancing: DNS-based or anycast routing directing traffic
Health monitoring: Continuous checks to detect primary region issues
Failover automation: Systems to trigger and execute failover procedures

Converting Mermaid diagram...

Standby Strategies: Cold, Warm, and Hot

The secondary region can operate at different readiness levels, each with distinct cost and recovery time tradeoffs:

Cold Standby (Pilot Light)

Only database replicas run continuously
Compute infrastructure is defined in IaC but not provisioned
Failover requires provisioning servers (10-60 minutes)
Lowest ongoing cost but longest RTO

Warm Standby

Database replicas plus minimal compute capacity running
Typically 10-25% of primary capacity
Auto-scaling policies ready to expand during failover
Failover time reduced to 5-15 minutes
Moderate ongoing cost

Hot Standby

Database replicas with near-full compute capacity
Caches may be partially warmed with read traffic
Failover can complete in 1-5 minutes
Higher ongoing cost but near-instant recovery

Standby Strategy Comparison
Strategy	Ongoing Cost	RTO	Use Cases	Complexity
Cold Standby	+10-15%	30-60 min	Cost-sensitive, lower SLAs	Low
Warm Standby	+25-40%	5-15 min	Balanced cost/recovery	Medium
Hot Standby	+60-90%	1-5 min	Critical applications	Medium-High

Database Replication Strategies

Asynchronous Replication

The most common approach for cross-region replication, asynchronous replication doesn't wait for confirmation from the secondary before acknowledging writes:

Advantages: No latency impact on primary operations; tolerates temporary network issues
Disadvantages: Replication lag creates a window of potential data loss (RPO > 0)
Typical lag: 100ms to several seconds under normal load; can spike during heavy write periods

postgres-replication-config.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Primary server configuration (postgresql.conf)
-- Enable WAL shipping for replication
 
-- Set replication parameters
wal_level = replica                    -- Enable replication-level logging
max_wal_senders = 10                   -- Max concurrent replication connections
wal_keep_size = 4GB                    -- Retain WAL for replicas that fall behind
synchronous_commit = local             -- Don't wait for remote confirmation (async)
archive_mode = on                      -- Enable WAL archiving for PITR
archive_command = 'aws s3 cp %p s3://wal-archive/%f'  -- Archive to S3
 
-- Network configuration for replication (pg_hba.conf)
-- Allow replication connections from secondary region
host    replication    replicator    10.1.0.0/16       scram-sha-256
host    replication    replicator    10.2.0.0/16       scram-sha-256
 
-- Create replication slot for reliable streaming
SELECT pg_create_physical_replication_slot('secondary_region_slot');
 
-- Secondary server initialization
-- Run on secondary to create replica from primary
pg_basebackup -h primary.us-east.internal \
    -D /var/lib/postgresql/data \
    -U replicator \
    -v -P \
    --wal-method=stream \
    --slot=secondary_region_slot
 
-- Secondary server configuration (postgresql.conf)
hot_standby = on                       -- Allow read queries on replica
hot_standby_feedback = on              -- Inform primary of replica queries
primary_conninfo = 'host=primary.us-east.internal port=5432 user=replicator application_name=us_west_replica'
primary_slot_name = 'secondary_region_slot'

Synchronous Replication

For applications that cannot tolerate any data loss (RPO = 0), synchronous replication ensures every write is confirmed by both primary and secondary before acknowledgment:

Advantages: Zero data loss guarantee; secondary always current
Disadvantages: Every write incurs cross-region latency penalty; unavailable during partition
Performance impact: Adds 60-150ms to every write transaction depending on region distance

Semi-Synchronous Replication

A hybrid approach waits for acknowledgment from at least one replica (which could be local or remote):

Write acknowledged when confirmed by primary AND at least one replica
If all replicas are slow, falls back to asynchronous behavior after timeout
Provides a balance between durability guarantees and performance

Monitoring Replication Health

Replication lag must be continuously monitored. During normal operations, lag should be under one second. Alerts should trigger when lag exceeds acceptable thresholds:

replication-monitoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
"""
Replication Lag Monitoring for Active-Passive Multi-Region
 
This module monitors database replication lag and triggers alerts
when lag exceeds acceptable thresholds. Critical for understanding
actual RPO and failover readiness.
"""
import psycopg2
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Optional
import logging
 
@dataclass
class ReplicationStatus:
    """Status of replication between primary and replica."""
    replica_name: str
    replication_lag_bytes: int
    replication_lag_seconds: float
    replay_lag_seconds: float
    state: str  # 'streaming', 'catchup', 'startup'
    is_healthy: bool
    last_updated: datetime
 
class ReplicationMonitor:
    """Monitor replication health between primary and secondary regions."""
    
    # Alert thresholds
    LAG_WARNING_SECONDS = 5.0
    LAG_CRITICAL_SECONDS = 30.0
    LAG_EMERGENCY_SECONDS = 300.0
    
    def __init__(self, primary_dsn: str, replica_dsn: str):
        self.primary_dsn = primary_dsn
        self.replica_dsn = replica_dsn
        self.logger = logging.getLogger(__name__)
    
    def get_primary_replication_status(self) -> list[ReplicationStatus]:
        """
        Query primary for status of all replicas.
        Uses pg_stat_replication to get streaming replication info.
        """
        query = """
        SELECT
            application_name,
            state,
            sent_lsn - write_lsn AS pending_bytes,
            COALESCE(
                EXTRACT(EPOCH FROM now() - write_lag),
                0
            ) AS write_lag_seconds,
            COALESCE(
                EXTRACT(EPOCH FROM now() - replay_lag),
                0
            ) AS replay_lag_seconds
        FROM pg_stat_replication
        WHERE state = 'streaming';
        """
        
        statuses = []
        with psycopg2.connect(self.primary_dsn) as conn:
            with conn.cursor() as cur:
                cur.execute(query)
                for row in cur.fetchall():
                    is_healthy = (
                        row[3] < self.LAG_CRITICAL_SECONDS and
                        row[4] < self.LAG_CRITICAL_SECONDS
                    )
                    statuses.append(ReplicationStatus(
                        replica_name=row[0],
                        replication_lag_bytes=row[2] or 0,
                        replication_lag_seconds=row[3],
                        replay_lag_seconds=row[4],
                        state=row[1],
                        is_healthy=is_healthy,
                        last_updated=datetime.now()
                    ))
        return statuses
    
    def get_replica_recovery_status(self) -> dict:
        """
        Query replica for its view of replication status.
        Provides ground-truth from replica perspective.
        """
        query = """
        SELECT
            pg_is_in_recovery() AS is_replica,
            pg_last_wal_receive_lsn() AS last_received,
            pg_last_wal_replay_lsn() AS last_replayed,
            pg_last_xact_replay_timestamp() AS last_replay_time,
            EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) 
                AS replay_lag_seconds
        """
        
        with psycopg2.connect(self.replica_dsn) as conn:
            with conn.cursor() as cur:
                cur.execute(query)
                row = cur.fetchone()
                return {
                    "is_replica": row[0],
                    "last_received_lsn": row[1],
                    "last_replayed_lsn": row[2],
                    "last_replay_time": row[3],
                    "replay_lag_seconds": row[4] or 0
                }
    
    def evaluate_failover_readiness(self) -> dict:
        """
        Evaluate whether the system is ready for a safe failover.
        Returns readiness status and estimated data loss.
        """
        primary_status = self.get_primary_replication_status()
        replica_status = self.get_replica_recovery_status()
        
        # Find the cross-region replica
        cross_region_replica = next(
            (s for s in primary_status if 'us_west' in s.replica_name),
            None
        )
        
        if not cross_region_replica:
            return {
                "ready": False,
                "reason": "No cross-region replica found",
                "estimated_data_loss_seconds": None
            }
        
        lag_seconds = max(
            cross_region_replica.replication_lag_seconds,
            cross_region_replica.replay_lag_seconds
        )
        
        # Determine readiness level
        if lag_seconds > self.LAG_EMERGENCY_SECONDS:
            readiness = "NOT_READY"
            reason = f"Replication lag ({lag_seconds:.1f}s) exceeds emergency threshold"
        elif lag_seconds > self.LAG_CRITICAL_SECONDS:
            readiness = "DEGRADED"
            reason = f"Elevated replication lag ({lag_seconds:.1f}s)"
        elif lag_seconds > self.LAG_WARNING_SECONDS:
            readiness = "READY_WITH_WARNINGS"
            reason = f"Minor replication lag ({lag_seconds:.1f}s)"
        else:
            readiness = "FULLY_READY"
            reason = "Replication healthy"
        
        return {
            "ready": readiness in ["FULLY_READY", "READY_WITH_WARNINGS"],
            "readiness_level": readiness,
            "reason": reason,
            "estimated_data_loss_seconds": lag_seconds,
            "replica_state": cross_region_replica.state,
            "replica_healthy": cross_region_replica.is_healthy
        }
    
    def emit_metrics(self) -> None:
        """Emit replication metrics for monitoring systems."""
        statuses = self.get_primary_replication_status()
        
        for status in statuses:
            # Emit to your metrics system (Prometheus, CloudWatch, etc.)
            self.logger.info(
                f"replication_lag_seconds{{replica="{status.replica_name}"}} "
                f"{status.replication_lag_seconds}"
            )
            self.logger.info(
                f"replay_lag_seconds{{replica="{status.replica_name}"}} "
                f"{status.replay_lag_seconds}"
            )
            self.logger.info(
                f"replication_healthy{{replica="{status.replica_name}"}} "
                f"{1 if status.is_healthy else 0}"
            )

Replication Lag During Load Spikes

Failover Procedures: Planning for the Moment of Truth

Types of Failover

Planned Failover (Graceful)

Initiated intentionally for maintenance, upgrades, or testing
Time to drain connections and quiesce writes
Ensures zero data loss by waiting for replication to fully synchronize
Typically takes 5-15 minutes

Unplanned Failover (Emergency)

Triggered by primary region failure or degradation
No opportunity to drain—traffic must be rerouted immediately
Potential for some data loss depending on replication lag
Target completion: 1-10 minutes depending on automation level

The Failover Sequence

A complete failover involves multiple coordinated steps:

Failover Step Sequence

•Detection: Monitoring systems detect primary region failure or degradation. Multiple health check failures from multiple vantage points confirm the issue.
•Decision: Human operator or automation decides to initiate failover. For automated failover, ensure sufficient confirmation to avoid false positives.
•Primary isolation: Stop accepting new requests at primary (if reachable). Drain existing connections if possible.
•Replication finalization: If primary is reachable, wait for final WAL segments to replicate. If unreachable, proceed with current replica state.
•Database promotion: Promote secondary database replica to primary status. This is the point of no return.
•Application configuration: Update connection strings and feature flags to point to new primary region.
•Traffic routing: Update global load balancer or DNS to direct traffic to secondary region.
•Capacity scaling: Scale up secondary region compute if running in warm standby mode.
•Verification: Confirm applications are healthy in new primary and processing requests successfully.
•Communication: Notify stakeholders of successful failover and current system status.

failover-orchestrator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
"""
Failover Orchestrator for Active-Passive Multi-Region
 
This module coordinates the complex sequence of operations required
for a controlled failover from primary to secondary region.
"""
import asyncio
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Optional
import logging
 
class FailoverType(Enum):
    PLANNED = "planned"      # Graceful, with full preparation
    EMERGENCY = "emergency"  # Immediate, potentially with data loss
 
class FailoverPhase(Enum):
    INITIALIZED = "initialized"
    PRIMARY_ISOLATED = "primary_isolated"
    REPLICATION_FINALIZED = "replication_finalized"
    DATABASE_PROMOTED = "database_promoted"
    APPLICATIONS_CONFIGURED = "applications_configured"
    TRAFFIC_REROUTED = "traffic_rerouted"
    CAPACITY_SCALED = "capacity_scaled"
    VERIFIED = "verified"
    COMPLETED = "completed"
    FAILED = "failed"
    ROLLED_BACK = "rolled_back"
 
@dataclass
class FailoverState:
    """Tracks the current state of a failover operation."""
    failover_id: str
    failover_type: FailoverType
    initiated_at: datetime
    initiated_by: str
    current_phase: FailoverPhase
    source_region: str
    target_region: str
    estimated_data_loss_seconds: float
    notes: list[str]
    
class FailoverOrchestrator:
    """
    Coordinates multi-region failover operations.
    
    Implements the complete failover sequence with proper
    error handling, rollback capabilities, and audit logging.
    """
    
    # Timeout for each phase (seconds)
    PHASE_TIMEOUTS = {
        FailoverPhase.PRIMARY_ISOLATED: 60,
        FailoverPhase.REPLICATION_FINALIZED: 300,
        FailoverPhase.DATABASE_PROMOTED: 120,
        FailoverPhase.APPLICATIONS_CONFIGURED: 60,
        FailoverPhase.TRAFFIC_REROUTED: 30,
        FailoverPhase.CAPACITY_SCALED: 300,
        FailoverPhase.VERIFIED: 120,
    }
    
    def __init__(
        self,
        database_manager,      # Handles DB promotion
        load_balancer_manager, # Handles traffic routing
        infrastructure_manager, # Handles scaling
        monitoring_client,     # For health verification
        notification_service   # For stakeholder alerts
    ):
        self.db = database_manager
        self.lb = load_balancer_manager
        self.infra = infrastructure_manager
        self.monitoring = monitoring_client
        self.notifications = notification_service
        self.logger = logging.getLogger(__name__)
        self.state: Optional[FailoverState] = None
    
    async def execute_failover(
        self,
        source_region: str,
        target_region: str,
        failover_type: FailoverType,
        operator: str,
        skip_confirmation: bool = False
    ) -> FailoverState:
        """
        Execute a complete failover from source to target region.
        
        Args:
            source_region: The current primary region
            target_region: The region to promote
            failover_type: PLANNED or EMERGENCY
            operator: ID of person/system initiating failover
            skip_confirmation: For automation, skip human confirmation
        """
        # Initialize failover state
        self.state = FailoverState(
            failover_id=self._generate_failover_id(),
            failover_type=failover_type,
            initiated_at=datetime.now(),
            initiated_by=operator,
            current_phase=FailoverPhase.INITIALIZED,
            source_region=source_region,
            target_region=target_region,
            estimated_data_loss_seconds=0,
            notes=[]
        )
        
        self.logger.warning(
            f"FAILOVER INITIATED: {self.state.failover_id} "
            f"({failover_type.value}) {source_region} -> {target_region}"
        )
        
        try:
            # Phase 1: Pre-flight checks
            await self._preflight_checks()
            
            # Phase 2: Isolate primary (if reachable and planned)
            if failover_type == FailoverType.PLANNED:
                await self._isolate_primary()
            else:
                self._log("Skipping primary isolation (emergency failover)")
            
            # Phase 3: Finalize replication (wait for sync if possible)
            await self._finalize_replication(failover_type)
            
            # Phase 4: Promote database
            await self._promote_database()
            
            # Point of no return checkpoint
            if not skip_confirmation and failover_type == FailoverType.PLANNED:
                self._log("DATABASE PROMOTED - Awaiting confirmation to continue")
                # In real implementation, wait for human confirmation
            
            # Phase 5: Configure applications
            await self._configure_applications()
            
            # Phase 6: Reroute traffic
            await self._reroute_traffic()
            
            # Phase 7: Scale capacity
            await self._scale_capacity()
            
            # Phase 8: Verify health
            await self._verify_health()
            
            # Complete
            self.state.current_phase = FailoverPhase.COMPLETED
            self._log("FAILOVER COMPLETED SUCCESSFULLY")
            
            # Notify stakeholders
            await self.notifications.send_failover_complete(self.state)
            
        except Exception as e:
            self.state.current_phase = FailoverPhase.FAILED
            self.state.notes.append(f"FAILURE: {str(e)}")
            self.logger.error(f"Failover failed: {e}")
            await self.notifications.send_failover_failed(self.state, str(e))
            raise
        
        return self.state
    
    async def _preflight_checks(self) -> None:
        """Verify system is ready for failover."""
        self._log("Running pre-flight checks...")
        
        # Check target region health
        target_health = await self.monitoring.check_region_health(
            self.state.target_region
        )
        if not target_health.is_healthy:
            raise FailoverError(
                f"Target region {self.state.target_region} is not healthy"
            )
        
        # Check replication status
        replication = await self.db.get_replication_status(
            self.state.source_region,
            self.state.target_region
        )
        self.state.estimated_data_loss_seconds = replication.lag_seconds
        
        if replication.lag_seconds > 300:
            self._log(
                f"WARNING: High replication lag ({replication.lag_seconds}s). "
                f"Estimated data loss: {replication.lag_seconds} seconds"
            )
        
        self._log(f"Pre-flight checks passed. Lag: {replication.lag_seconds}s")
    
    async def _isolate_primary(self) -> None:
        """Stop accepting new connections at primary."""
        self._log("Isolating primary region...")
        self.state.current_phase = FailoverPhase.PRIMARY_ISOLATED
        
        # Disable load balancer backends in primary
        await self.lb.disable_backends(self.state.source_region)
        
        # Wait for connection draining
        await asyncio.sleep(10)  # Allow existing requests to complete
        
        self._log("Primary region isolated")
    
    async def _finalize_replication(self, failover_type: FailoverType) -> None:
        """Wait for replication to catch up if possible."""
        self._log("Finalizing replication...")
        self.state.current_phase = FailoverPhase.REPLICATION_FINALIZED
        
        if failover_type == FailoverType.PLANNED:
            # Wait for full sync (with timeout)
            timeout = self.PHASE_TIMEOUTS[FailoverPhase.REPLICATION_FINALIZED]
            await self.db.wait_for_replication_sync(
                self.state.source_region,
                self.state.target_region,
                timeout_seconds=timeout
            )
            self.state.estimated_data_loss_seconds = 0
            self._log("Replication fully synchronized (zero data loss)")
        else:
            # Emergency: accept current lag state
            current_lag = await self.db.get_replication_lag_seconds(
                self.state.source_region,
                self.state.target_region
            )
            self.state.estimated_data_loss_seconds = current_lag
            self._log(f"Proceeding with {current_lag}s replication lag")
    
    async def _promote_database(self) -> None:
        """Promote replica to primary status."""
        self._log("PROMOTING DATABASE - Point of no return")
        self.state.current_phase = FailoverPhase.DATABASE_PROMOTED
        
        await self.db.promote_replica(
            self.state.target_region,
            new_role="primary"
        )
        
        # Verify promotion succeeded
        is_primary = await self.db.verify_is_primary(self.state.target_region)
        if not is_primary:
            raise FailoverError("Database promotion verification failed")
        
        self._log("Database promoted successfully")
    
    async def _configure_applications(self) -> None:
        """Update application configurations for new primary."""
        self._log("Configuring applications...")
        self.state.current_phase = FailoverPhase.APPLICATIONS_CONFIGURED
        
        # Update connection strings, feature flags, etc.
        await self.infra.update_application_configs(
            region=self.state.target_region,
            database_endpoint=await self.db.get_primary_endpoint(),
            enable_background_jobs=True
        )
        
        self._log("Applications configured")
    
    async def _reroute_traffic(self) -> None:
        """Update global load balancer to route to new primary."""
        self._log("Rerouting traffic...")
        self.state.current_phase = FailoverPhase.TRAFFIC_REROUTED
        
        await self.lb.set_primary_region(self.state.target_region)
        
        self._log("Traffic rerouted to new primary")
    
    async def _scale_capacity(self) -> None:
        """Scale up target region to handle full load."""
        self._log("Scaling capacity...")
        self.state.current_phase = FailoverPhase.CAPACITY_SCALED
        
        await self.infra.scale_to_production(self.state.target_region)
        
        self._log("Capacity scaled")
    
    async def _verify_health(self) -> None:
        """Verify new primary is healthy and serving traffic."""
        self._log("Verifying health...")
        self.state.current_phase = FailoverPhase.VERIFIED
        
        # Wait for health checks to stabilize
        await asyncio.sleep(30)
        
        health = await self.monitoring.comprehensive_health_check(
            self.state.target_region
        )
        
        if not health.is_healthy:
            raise FailoverError(f"Health verification failed: {health.issues}")
        
        self._log(f"Health verified: {health.summary}")
    
    def _log(self, message: str) -> None:
        """Log and record message in state."""
        self.logger.info(f"[{self.state.failover_id}] {message}")
        self.state.notes.append(f"{datetime.now().isoformat()}: {message}")
    
    def _generate_failover_id(self) -> str:
        """Generate unique failover identifier."""
        return f"FO-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
 
class FailoverError(Exception):
    """Exception raised during failover operations."""
    pass

The Runbook Is Non-Negotiable

Automated vs Manual Failover: The Automation Spectrum

The Case for Manual Failover

Manual failover requires human operators to evaluate the situation and initiate the failover procedure:

Prevents false positives: Humans can distinguish between transient issues and genuine failures
Enables judgment calls: Some situations warrant waiting for the primary to recover rather than failing over
Reduces split-brain risk: Human coordination prevents both regions from becoming active simultaneously
Appropriate when: RTO requirements are > 15 minutes, on-call coverage is strong, failures are rare

The Case for Automated Failover

Automated failover triggers without human intervention when defined conditions are met:

Faster recovery: Removes human response time from RTO (typically 5-30 minutes of delay)
Consistent execution: Automation doesn't get nervous or make typos
24/7 capability: Responds instantly regardless of time or on-call availability
Appropriate when: RTO requirements are < 5 minutes, false positive risk is well-managed

The Dangerous Middle Ground

Automated Failover Risks

•False positive triggers: Network glitches triggering unnecessary failovers
•Split-brain scenarios: Both regions thinking they're primary
•Cascading failures: Automation triggering during related but temporary issues
•Unexpected interactions: Automation conflicting with maintenance or deployments
•Loss of control: Failover happening when human judgment would wait

Mitigating Automation Risks

•Multi-point consensus: Require failures from multiple monitoring sources
•Quorum-based detection: Detect from multiple geographic vantage points
•Cool-down periods: Prevent rapid consecutive failovers
•Maintenance windows: Disable automation during planned changes
•Gradual automation: Start with alerting, then add one-click, then full auto

Recommended Approach: Graduated Automation

Most organizations benefit from graduated automation that increases over time as confidence grows:

Level 1: Fully Manual (Starting Point)

Automated monitoring and alerting
Manual evaluation and decision
Manual execution of runbook steps
Appropriate for initial multi-region deployment

Level 2: Assisted Manual

Automated health evaluation and recommendation
Human approval required to proceed
Automated execution after approval
Reduces execution errors while retaining human judgment

Level 3: Semi-Automated

Automatic failover for clear-cut scenarios (e.g., complete region unreachability)
Human approval for ambiguous scenarios (e.g., degraded performance)
Automatic execution once triggered
Best balance for most organizations

Level 4: Fully Automated

Automatic detection, decision, and execution
Human notification during and after
Humans can abort if needed within grace period
Reserved for mission-critical, sub-minute RTO requirements

Regardless of automation level, humans should always be able to:

Abort an in-progress failover
Manually trigger a failover
Lock out automation during sensitive operations

The "Obviously Wrong" Test

Failback: Returning to Normal Operations

Why Failback Is Harder

Data has moved forward: The new primary (former secondary) has accumulated data that doesn't exist in the original primary
Replication direction reversal: The former primary must now become a replica of the new primary
No urgency, more room for error: Without incident pressure, teams may take shortcuts
Testing assumptions: The original primary may harbor latent issues that caused the original failure

Failback Options

Option 1: Role Reversal (Former primary becomes secondary)

Instead of failing back, simply adopt the new topology:

Current secondary remains primary
Original primary (once recovered) becomes the new secondary
Eliminates need for another data-disruptive operation
Simplest and safest option when regions are equivalent

Option 2: Full Failback

Restore the original topology:

Requires careful data synchronization or full re-initialization
Original primary must sync all data from new primary
Essentially another failover operation
Appropriate when original primary has specific advantages (location, capacity)

Option 3: Gradual Migration

Incrementally shift traffic back rather than cutover:

Route 10% of traffic to original primary, observe
Increase to 25%, 50%, 75%, then 100%
Catch issues before they affect all users
Requires application support for split traffic

failback-procedure.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
#!/bin/bash
# Failback Procedure: US-West (current primary) -> US-East (original primary)
# 
# Prerequisites:
# - Original primary (US-East) is recovered and healthy
# - No active incidents
# - Change approval obtained
# - Maintenance window scheduled
 
set -euo pipefail
 
CURRENT_PRIMARY="us-west"
TARGET_REGION="us-east"
LOGFILE="/var/log/failback-$(date +%Y%m%d-%H%M%S).log"
 
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOGFILE"
}
 
checkpoint() {
    log "CHECKPOINT: $1"
    read -p "Press enter to continue or Ctrl+C to abort..."
}
 
# 1. Pre-flight verification
log "=== Starting failback pre-flight checks ==="
 
log "Checking target region (US-East) health..."
if ! health-check.sh "$TARGET_REGION"; then
    log "ERROR: Target region is not healthy. Aborting."
    exit 1
fi
 
log "Checking current primary (US-West) health..."
if ! health-check.sh "$CURRENT_PRIMARY"; then
    log "ERROR: Current primary is not healthy. Aborting."
    exit 1
fi
 
log "Checking replication status..."
REPLICATION_STATUS=$(psql -h db-primary.us-west.internal -c \
    "SELECT state, replay_lag FROM pg_stat_replication WHERE application_name = 'us_east_replica';")
log "Replication status: $REPLICATION_STATUS"
 
checkpoint "Pre-flight checks complete. Ready to proceed?"
 
# 2. Configure US-East database as replica of US-West
log "=== Configuring US-East as replica ==="
 
log "Stopping application servers in US-East to prevent stale data access..."
kubectl --context=us-east scale deployment/api-service --replicas=0
 
log "Reconfiguring US-East database as replica..."
# This requires the database to re-sync from US-West
# For PostgreSQL, this typically involves pg_basebackup or pg_rewind
 
ssh db-admin@db1.us-east.internal << 'EOF'
    sudo systemctl stop postgresql
    
    # Use pg_rewind if WAL is available, otherwise full resync
    if pg_rewind --target-pgdata=/var/lib/postgresql/data                  --source-server='host=db-primary.us-west.internal port=5432'; then
        echo "pg_rewind successful - incremental sync"
    else
        echo "pg_rewind failed - performing full resync"
        rm -rf /var/lib/postgresql/data/*
        pg_basebackup -h db-primary.us-west.internal -D /var/lib/postgresql/data             -U replicator -v -P --wal-method=stream
    fi
    
    # Configure as standby
    touch /var/lib/postgresql/data/standby.signal
    
    cat >> /var/lib/postgresql/data/postgresql.auto.conf << CONF
primary_conninfo = 'host=db-primary.us-west.internal port=5432 user=replicator'
CONF
    
    sudo systemctl start postgresql
EOF
 
log "Waiting for US-East replica to sync..."
until psql -h db1.us-east.internal -c "SELECT pg_is_in_recovery();" | grep -q 't'; do
    sleep 5
    log "Waiting for replica to start..."
done
 
checkpoint "US-East database synchronized as replica. Ready for traffic migration?"
 
# 3. Migrate traffic gradually
log "=== Beginning gradual traffic migration ==="
 
log "Setting traffic split: 90% US-West, 10% US-East..."
gcloud compute url-maps update global-lb     --backend-service-weight=us-west-backend=90,us-east-backend=10
 
log "Monitoring for 10 minutes..."
sleep 600
 
# Check error rates
ERROR_RATE=$(prometheus-query "sum(rate(http_errors_total{region='us-east'}[10m])) / sum(rate(http_requests_total{region='us-east'}[10m]))")
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
    log "ERROR: High error rate in US-East ($ERROR_RATE). Aborting."
    gcloud compute url-maps update global-lb         --backend-service-weight=us-west-backend=100,us-east-backend=0
    exit 1
fi
 
log "Error rate acceptable ($ERROR_RATE). Increasing to 50%..."
gcloud compute url-maps update global-lb     --backend-service-weight=us-west-backend=50,us-east-backend=50
 
checkpoint "Traffic at 50/50. Monitor for issues before continuing."
 
log "Finalizing: 100% to US-East..."
# The rest of the failover procedure continues...
 
log "=== Failback complete. US-East is now primary. ===
 

Treat Failback as Failover

Operational Considerations for Active-Passive

Running active-passive multi-region in production requires ongoing operational practices that maintain failover readiness over time.

Regular Failover Testing

The only way to know failover works is to actually fail over. Organizations should conduct regular failover exercises:

Quarterly (minimum): Full failover exercise during planned maintenance window
Monthly: Partial failover testing (database promotion only, without traffic shift)
Weekly: Health check verification and replication monitoring review

Without regular testing, failover becomes a theoretical capability that fails when actually needed. Replication breaks, runbooks become outdated, and engineers lose familiarity with procedures.

Chaos Engineering for Active-Passive

Beyond planned testing, inject failures in controlled ways to validate resilience:

Terminate primary region instances and verify application recovery
Introduce network latency between regions to test replication under stress
Disable the primary database suddenly to verify detection timing
Trigger automated failover and measure actual RTO

Keeping the Secondary Warm

Passive regions tend to accumulate drift over time:

Configuration drift: Changes applied to primary but not secondary
Cache coldness: Secondary caches are empty, causing poor initial performance
Connection staleness: Database connections and pooling may not work correctly
Missing dependencies: New services added to primary but not configured in secondary

Mitigate drift through:

Infrastructure as Code: All configuration through Terraform/Pulumi
Synthetic traffic: Route a small percentage of read traffic to secondary
Parallel deployments: Deploy to all regions simultaneously (or in quick succession)
Drift detection: Automated comparison of region configurations

Active-Passive Maintenance Checklist
Task	Frequency	Purpose	Responsibility
Verify replication lag < threshold	Continuous	RPO maintenance	Automated monitoring
Test read queries on replica	Daily	Data accessibility	Automated probe
Compare primary/secondary configs	Weekly	Detect drift	GitOps/IaC
Mini-failover (DB promotion test)	Monthly	Procedure validation	SRE team
Full failover exercise	Quarterly	Complete validation	Engineering org
Runbook review and update	After each failover	Documentation accuracy	On-call engineers
Capacity planning review	Quarterly	Ensure secondary can handle load	Platform team

Cost Optimization Strategies

Active-passive architectures offer cost optimization opportunities that active-active does not:

Spot/preemptible instances: Secondary region compute can use cheaper spot instances since brief unavailability is acceptable for standby
Reserved capacity: Reserve capacity only in primary; use on-demand in secondary during emergencies
Reduced replica sizing: Database replicas can be smaller if primary handles all writes (with auto-scaling during failover)
Minimal application tier: Run only health-check responders in secondary until failover

These optimizations must be balanced against RTO requirements—aggressive cost reduction increases time to achieve full capacity during failover.

The Drift Problem Compounds

Summary: Active-Passive Multi-Region

We've comprehensively explored active-passive multi-region architecture. Let's consolidate the key principles:

Key Takeaways

•Active-passive is the accessible path to regional resilience: It provides disaster recovery without the complexity of serving traffic from multiple regions simultaneously.
•Standby strategy determines cost and recovery time: Cold standby minimizes cost but extends RTO; hot standby enables fast recovery at higher ongoing expense.
•Database replication is the heart of the architecture: Understand your replication mode (async vs sync), monitor lag continuously, and accept that async replication means potential data loss.
•Failover is a multi-step procedure requiring orchestration: From detection through verification, each step must be defined, automated where appropriate, and practiced regularly.
•Automation exists on a spectrum: Start with assisted manual, graduate to semi-automated as confidence grows. Avoid the dangerous middle ground of automated alerting with manual execution.
•Failback deserves equal planning: Whether through role reversal, full failback, or gradual migration, returning to normal operations requires as much rigor as the original failover.
•Continuous maintenance prevents drift: Regular testing, configuration verification, and runbook updates are essential to maintain failover readiness over time.

What's Next

Page Complete

2 / 5