Loading learning content...
When Netflix's primary US-East region experiences turbulence, users barely notice. Videos continue streaming, queues keep loading, and recommendations appear seamlessly. Behind the scenes, traffic has shifted to a standby region that was patiently waiting for exactly this moment. This is active-passive multi-region architecture in action—the most common and accessible path to regional resilience.
Active-passive architecture designates one region as the primary (active), handling all production traffic, while one or more secondary (passive) regions maintain synchronized copies of data and minimal infrastructure, ready to assume operations when the primary fails. This pattern provides disaster recovery capabilities without the complexity of serving traffic from multiple regions simultaneously.
For organizations taking their first steps beyond single-region deployment, active-passive represents the optimal balance between improved resilience and manageable complexity. It's the foundation upon which more sophisticated architectures can be built.
By the end of this page, you will understand how to architect active-passive multi-region systems, implement effective database replication strategies, design failover procedures that minimize downtime, and operate these systems reliably in production.
An active-passive multi-region deployment consists of several interconnected components, each serving a specific role in maintaining readiness and enabling failover.
The Primary Region
The primary region operates as a complete, self-contained system handling 100% of production traffic:
The Secondary Region
The secondary region maintains a shadow of the primary, ready to assume operations:
The Connection Layer
Critical infrastructure bridges the regions:
Standby Strategies: Cold, Warm, and Hot
The secondary region can operate at different readiness levels, each with distinct cost and recovery time tradeoffs:
Cold Standby (Pilot Light)
Warm Standby
Hot Standby
| Strategy | Ongoing Cost | RTO | Use Cases | Complexity |
|---|---|---|---|---|
| Cold Standby | +10-15% | 30-60 min | Cost-sensitive, lower SLAs | Low |
| Warm Standby | +25-40% | 5-15 min | Balanced cost/recovery | Medium |
| Hot Standby | +60-90% | 1-5 min | Critical applications | Medium-High |
Database replication is the heart of active-passive architecture. The replication strategy determines your Recovery Point Objective (RPO)—how much data you might lose during failover—and significantly impacts primary region performance.
Asynchronous Replication
The most common approach for cross-region replication, asynchronous replication doesn't wait for confirmation from the secondary before acknowledging writes:
For most applications, asynchronous replication represents the right tradeoff. The performance penalty of synchronous replication across geographic distances is prohibitive, and a few seconds of potential data loss is acceptable.
123456789101112131415161718192021222324252627282930313233
-- Primary server configuration (postgresql.conf)-- Enable WAL shipping for replication -- Set replication parameterswal_level = replica -- Enable replication-level loggingmax_wal_senders = 10 -- Max concurrent replication connectionswal_keep_size = 4GB -- Retain WAL for replicas that fall behindsynchronous_commit = local -- Don't wait for remote confirmation (async)archive_mode = on -- Enable WAL archiving for PITRarchive_command = 'aws s3 cp %p s3://wal-archive/%f' -- Archive to S3 -- Network configuration for replication (pg_hba.conf)-- Allow replication connections from secondary regionhost replication replicator 10.1.0.0/16 scram-sha-256host replication replicator 10.2.0.0/16 scram-sha-256 -- Create replication slot for reliable streamingSELECT pg_create_physical_replication_slot('secondary_region_slot'); -- Secondary server initialization-- Run on secondary to create replica from primarypg_basebackup -h primary.us-east.internal \ -D /var/lib/postgresql/data \ -U replicator \ -v -P \ --wal-method=stream \ --slot=secondary_region_slot -- Secondary server configuration (postgresql.conf)hot_standby = on -- Allow read queries on replicahot_standby_feedback = on -- Inform primary of replica queriesprimary_conninfo = 'host=primary.us-east.internal port=5432 user=replicator application_name=us_west_replica'primary_slot_name = 'secondary_region_slot'Synchronous Replication
For applications that cannot tolerate any data loss (RPO = 0), synchronous replication ensures every write is confirmed by both primary and secondary before acknowledgment:
Synchronous replication is appropriate only for specific critical data paths—typically financial transactions or other data where any loss is unacceptable. Even then, consider using synchronous replication only for that specific data rather than all writes.
Semi-Synchronous Replication
A hybrid approach waits for acknowledgment from at least one replica (which could be local or remote):
Monitoring Replication Health
Replication lag must be continuously monitored. During normal operations, lag should be under one second. Alerts should trigger when lag exceeds acceptable thresholds:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173
"""Replication Lag Monitoring for Active-Passive Multi-Region This module monitors database replication lag and triggers alertswhen lag exceeds acceptable thresholds. Critical for understandingactual RPO and failover readiness."""import psycopg2from datetime import datetime, timedeltafrom dataclasses import dataclassfrom typing import Optionalimport logging @dataclassclass ReplicationStatus: """Status of replication between primary and replica.""" replica_name: str replication_lag_bytes: int replication_lag_seconds: float replay_lag_seconds: float state: str # 'streaming', 'catchup', 'startup' is_healthy: bool last_updated: datetime class ReplicationMonitor: """Monitor replication health between primary and secondary regions.""" # Alert thresholds LAG_WARNING_SECONDS = 5.0 LAG_CRITICAL_SECONDS = 30.0 LAG_EMERGENCY_SECONDS = 300.0 def __init__(self, primary_dsn: str, replica_dsn: str): self.primary_dsn = primary_dsn self.replica_dsn = replica_dsn self.logger = logging.getLogger(__name__) def get_primary_replication_status(self) -> list[ReplicationStatus]: """ Query primary for status of all replicas. Uses pg_stat_replication to get streaming replication info. """ query = """ SELECT application_name, state, sent_lsn - write_lsn AS pending_bytes, COALESCE( EXTRACT(EPOCH FROM now() - write_lag), 0 ) AS write_lag_seconds, COALESCE( EXTRACT(EPOCH FROM now() - replay_lag), 0 ) AS replay_lag_seconds FROM pg_stat_replication WHERE state = 'streaming'; """ statuses = [] with psycopg2.connect(self.primary_dsn) as conn: with conn.cursor() as cur: cur.execute(query) for row in cur.fetchall(): is_healthy = ( row[3] < self.LAG_CRITICAL_SECONDS and row[4] < self.LAG_CRITICAL_SECONDS ) statuses.append(ReplicationStatus( replica_name=row[0], replication_lag_bytes=row[2] or 0, replication_lag_seconds=row[3], replay_lag_seconds=row[4], state=row[1], is_healthy=is_healthy, last_updated=datetime.now() )) return statuses def get_replica_recovery_status(self) -> dict: """ Query replica for its view of replication status. Provides ground-truth from replica perspective. """ query = """ SELECT pg_is_in_recovery() AS is_replica, pg_last_wal_receive_lsn() AS last_received, pg_last_wal_replay_lsn() AS last_replayed, pg_last_xact_replay_timestamp() AS last_replay_time, EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) AS replay_lag_seconds """ with psycopg2.connect(self.replica_dsn) as conn: with conn.cursor() as cur: cur.execute(query) row = cur.fetchone() return { "is_replica": row[0], "last_received_lsn": row[1], "last_replayed_lsn": row[2], "last_replay_time": row[3], "replay_lag_seconds": row[4] or 0 } def evaluate_failover_readiness(self) -> dict: """ Evaluate whether the system is ready for a safe failover. Returns readiness status and estimated data loss. """ primary_status = self.get_primary_replication_status() replica_status = self.get_replica_recovery_status() # Find the cross-region replica cross_region_replica = next( (s for s in primary_status if 'us_west' in s.replica_name), None ) if not cross_region_replica: return { "ready": False, "reason": "No cross-region replica found", "estimated_data_loss_seconds": None } lag_seconds = max( cross_region_replica.replication_lag_seconds, cross_region_replica.replay_lag_seconds ) # Determine readiness level if lag_seconds > self.LAG_EMERGENCY_SECONDS: readiness = "NOT_READY" reason = f"Replication lag ({lag_seconds:.1f}s) exceeds emergency threshold" elif lag_seconds > self.LAG_CRITICAL_SECONDS: readiness = "DEGRADED" reason = f"Elevated replication lag ({lag_seconds:.1f}s)" elif lag_seconds > self.LAG_WARNING_SECONDS: readiness = "READY_WITH_WARNINGS" reason = f"Minor replication lag ({lag_seconds:.1f}s)" else: readiness = "FULLY_READY" reason = "Replication healthy" return { "ready": readiness in ["FULLY_READY", "READY_WITH_WARNINGS"], "readiness_level": readiness, "reason": reason, "estimated_data_loss_seconds": lag_seconds, "replica_state": cross_region_replica.state, "replica_healthy": cross_region_replica.is_healthy } def emit_metrics(self) -> None: """Emit replication metrics for monitoring systems.""" statuses = self.get_primary_replication_status() for status in statuses: # Emit to your metrics system (Prometheus, CloudWatch, etc.) self.logger.info( f"replication_lag_seconds{{replica="{status.replica_name}"}} " f"{status.replication_lag_seconds}" ) self.logger.info( f"replay_lag_seconds{{replica="{status.replica_name}"}} " f"{status.replay_lag_seconds}" ) self.logger.info( f"replication_healthy{{replica="{status.replica_name}"}} " f"{1 if status.is_healthy else 0}" )Replication lag can spike dramatically during burst write workloads. A secondary that's normally seconds behind may fall minutes behind during a traffic surge—precisely when you're most likely to need failover. Size your replication infrastructure for peak load, not average load.
Failover is the moment your multi-region investment proves its value. A well-executed failover is invisible to users; a poorly executed one compounds the original problem. Success requires meticulous planning, regular practice, and battle-tested automation.
Types of Failover
Planned Failover (Graceful)
Unplanned Failover (Emergency)
The Failover Sequence
A complete failover involves multiple coordinated steps:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306
"""Failover Orchestrator for Active-Passive Multi-Region This module coordinates the complex sequence of operations requiredfor a controlled failover from primary to secondary region."""import asynciofrom dataclasses import dataclassfrom datetime import datetimefrom enum import Enumfrom typing import Optionalimport logging class FailoverType(Enum): PLANNED = "planned" # Graceful, with full preparation EMERGENCY = "emergency" # Immediate, potentially with data loss class FailoverPhase(Enum): INITIALIZED = "initialized" PRIMARY_ISOLATED = "primary_isolated" REPLICATION_FINALIZED = "replication_finalized" DATABASE_PROMOTED = "database_promoted" APPLICATIONS_CONFIGURED = "applications_configured" TRAFFIC_REROUTED = "traffic_rerouted" CAPACITY_SCALED = "capacity_scaled" VERIFIED = "verified" COMPLETED = "completed" FAILED = "failed" ROLLED_BACK = "rolled_back" @dataclassclass FailoverState: """Tracks the current state of a failover operation.""" failover_id: str failover_type: FailoverType initiated_at: datetime initiated_by: str current_phase: FailoverPhase source_region: str target_region: str estimated_data_loss_seconds: float notes: list[str] class FailoverOrchestrator: """ Coordinates multi-region failover operations. Implements the complete failover sequence with proper error handling, rollback capabilities, and audit logging. """ # Timeout for each phase (seconds) PHASE_TIMEOUTS = { FailoverPhase.PRIMARY_ISOLATED: 60, FailoverPhase.REPLICATION_FINALIZED: 300, FailoverPhase.DATABASE_PROMOTED: 120, FailoverPhase.APPLICATIONS_CONFIGURED: 60, FailoverPhase.TRAFFIC_REROUTED: 30, FailoverPhase.CAPACITY_SCALED: 300, FailoverPhase.VERIFIED: 120, } def __init__( self, database_manager, # Handles DB promotion load_balancer_manager, # Handles traffic routing infrastructure_manager, # Handles scaling monitoring_client, # For health verification notification_service # For stakeholder alerts ): self.db = database_manager self.lb = load_balancer_manager self.infra = infrastructure_manager self.monitoring = monitoring_client self.notifications = notification_service self.logger = logging.getLogger(__name__) self.state: Optional[FailoverState] = None async def execute_failover( self, source_region: str, target_region: str, failover_type: FailoverType, operator: str, skip_confirmation: bool = False ) -> FailoverState: """ Execute a complete failover from source to target region. Args: source_region: The current primary region target_region: The region to promote failover_type: PLANNED or EMERGENCY operator: ID of person/system initiating failover skip_confirmation: For automation, skip human confirmation """ # Initialize failover state self.state = FailoverState( failover_id=self._generate_failover_id(), failover_type=failover_type, initiated_at=datetime.now(), initiated_by=operator, current_phase=FailoverPhase.INITIALIZED, source_region=source_region, target_region=target_region, estimated_data_loss_seconds=0, notes=[] ) self.logger.warning( f"FAILOVER INITIATED: {self.state.failover_id} " f"({failover_type.value}) {source_region} -> {target_region}" ) try: # Phase 1: Pre-flight checks await self._preflight_checks() # Phase 2: Isolate primary (if reachable and planned) if failover_type == FailoverType.PLANNED: await self._isolate_primary() else: self._log("Skipping primary isolation (emergency failover)") # Phase 3: Finalize replication (wait for sync if possible) await self._finalize_replication(failover_type) # Phase 4: Promote database await self._promote_database() # Point of no return checkpoint if not skip_confirmation and failover_type == FailoverType.PLANNED: self._log("DATABASE PROMOTED - Awaiting confirmation to continue") # In real implementation, wait for human confirmation # Phase 5: Configure applications await self._configure_applications() # Phase 6: Reroute traffic await self._reroute_traffic() # Phase 7: Scale capacity await self._scale_capacity() # Phase 8: Verify health await self._verify_health() # Complete self.state.current_phase = FailoverPhase.COMPLETED self._log("FAILOVER COMPLETED SUCCESSFULLY") # Notify stakeholders await self.notifications.send_failover_complete(self.state) except Exception as e: self.state.current_phase = FailoverPhase.FAILED self.state.notes.append(f"FAILURE: {str(e)}") self.logger.error(f"Failover failed: {e}") await self.notifications.send_failover_failed(self.state, str(e)) raise return self.state async def _preflight_checks(self) -> None: """Verify system is ready for failover.""" self._log("Running pre-flight checks...") # Check target region health target_health = await self.monitoring.check_region_health( self.state.target_region ) if not target_health.is_healthy: raise FailoverError( f"Target region {self.state.target_region} is not healthy" ) # Check replication status replication = await self.db.get_replication_status( self.state.source_region, self.state.target_region ) self.state.estimated_data_loss_seconds = replication.lag_seconds if replication.lag_seconds > 300: self._log( f"WARNING: High replication lag ({replication.lag_seconds}s). " f"Estimated data loss: {replication.lag_seconds} seconds" ) self._log(f"Pre-flight checks passed. Lag: {replication.lag_seconds}s") async def _isolate_primary(self) -> None: """Stop accepting new connections at primary.""" self._log("Isolating primary region...") self.state.current_phase = FailoverPhase.PRIMARY_ISOLATED # Disable load balancer backends in primary await self.lb.disable_backends(self.state.source_region) # Wait for connection draining await asyncio.sleep(10) # Allow existing requests to complete self._log("Primary region isolated") async def _finalize_replication(self, failover_type: FailoverType) -> None: """Wait for replication to catch up if possible.""" self._log("Finalizing replication...") self.state.current_phase = FailoverPhase.REPLICATION_FINALIZED if failover_type == FailoverType.PLANNED: # Wait for full sync (with timeout) timeout = self.PHASE_TIMEOUTS[FailoverPhase.REPLICATION_FINALIZED] await self.db.wait_for_replication_sync( self.state.source_region, self.state.target_region, timeout_seconds=timeout ) self.state.estimated_data_loss_seconds = 0 self._log("Replication fully synchronized (zero data loss)") else: # Emergency: accept current lag state current_lag = await self.db.get_replication_lag_seconds( self.state.source_region, self.state.target_region ) self.state.estimated_data_loss_seconds = current_lag self._log(f"Proceeding with {current_lag}s replication lag") async def _promote_database(self) -> None: """Promote replica to primary status.""" self._log("PROMOTING DATABASE - Point of no return") self.state.current_phase = FailoverPhase.DATABASE_PROMOTED await self.db.promote_replica( self.state.target_region, new_role="primary" ) # Verify promotion succeeded is_primary = await self.db.verify_is_primary(self.state.target_region) if not is_primary: raise FailoverError("Database promotion verification failed") self._log("Database promoted successfully") async def _configure_applications(self) -> None: """Update application configurations for new primary.""" self._log("Configuring applications...") self.state.current_phase = FailoverPhase.APPLICATIONS_CONFIGURED # Update connection strings, feature flags, etc. await self.infra.update_application_configs( region=self.state.target_region, database_endpoint=await self.db.get_primary_endpoint(), enable_background_jobs=True ) self._log("Applications configured") async def _reroute_traffic(self) -> None: """Update global load balancer to route to new primary.""" self._log("Rerouting traffic...") self.state.current_phase = FailoverPhase.TRAFFIC_REROUTED await self.lb.set_primary_region(self.state.target_region) self._log("Traffic rerouted to new primary") async def _scale_capacity(self) -> None: """Scale up target region to handle full load.""" self._log("Scaling capacity...") self.state.current_phase = FailoverPhase.CAPACITY_SCALED await self.infra.scale_to_production(self.state.target_region) self._log("Capacity scaled") async def _verify_health(self) -> None: """Verify new primary is healthy and serving traffic.""" self._log("Verifying health...") self.state.current_phase = FailoverPhase.VERIFIED # Wait for health checks to stabilize await asyncio.sleep(30) health = await self.monitoring.comprehensive_health_check( self.state.target_region ) if not health.is_healthy: raise FailoverError(f"Health verification failed: {health.issues}") self._log(f"Health verified: {health.summary}") def _log(self, message: str) -> None: """Log and record message in state.""" self.logger.info(f"[{self.state.failover_id}] {message}") self.state.notes.append(f"{datetime.now().isoformat()}: {message}") def _generate_failover_id(self) -> str: """Generate unique failover identifier.""" return f"FO-{datetime.now().strftime('%Y%m%d-%H%M%S')}" class FailoverError(Exception): """Exception raised during failover operations.""" passEvery failover—planned or emergency—should follow an exhaustively documented runbook. During a 3 AM incident, engineers don't have time to reason through steps. The runbook should include exact commands, expected outputs, verification steps, and rollback procedures. Test the runbook regularly, not just the automation.
One of the most consequential decisions in active-passive architecture is the degree of failover automation. This isn't a binary choice—organizations operate across a spectrum from fully manual to fully automated, with most landing somewhere in between.
The Case for Manual Failover
Manual failover requires human operators to evaluate the situation and initiate the failover procedure:
The Case for Automated Failover
Automated failover triggers without human intervention when defined conditions are met:
The Dangerous Middle Ground
Many organizations claim "automated failover" but actually have "automated alerting with manual execution." This captures the worst of both worlds—engineers racing to execute manual steps under time pressure. If you're going to automate, automate the execution too.
Recommended Approach: Graduated Automation
Most organizations benefit from graduated automation that increases over time as confidence grows:
Level 1: Fully Manual (Starting Point)
Level 2: Assisted Manual
Level 3: Semi-Automated
Level 4: Fully Automated
Regardless of automation level, humans should always be able to:
Before enabling automated failover, imagine the worst possible time for a false-positive failover to occur—during a critical product launch, a major customer demo, or quarter-end processing. Would the automation's behavior be obviously wrong? If so, add safeguards. Automated systems lack judgment about organizational context.
Failback—returning operations from the secondary region back to the original primary after an incident is resolved—is often more complex than the initial failover. Data has diverged, configurations have changed, and the original primary may require careful restoration.
Why Failback Is Harder
Failback Options
Option 1: Role Reversal (Former primary becomes secondary)
Instead of failing back, simply adopt the new topology:
Option 2: Full Failback
Restore the original topology:
Option 3: Gradual Migration
Incrementally shift traffic back rather than cutover:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
#!/bin/bash# Failback Procedure: US-West (current primary) -> US-East (original primary)# # Prerequisites:# - Original primary (US-East) is recovered and healthy# - No active incidents# - Change approval obtained# - Maintenance window scheduled set -euo pipefail CURRENT_PRIMARY="us-west"TARGET_REGION="us-east"LOGFILE="/var/log/failback-$(date +%Y%m%d-%H%M%S).log" log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOGFILE"} checkpoint() { log "CHECKPOINT: $1" read -p "Press enter to continue or Ctrl+C to abort..."} # 1. Pre-flight verificationlog "=== Starting failback pre-flight checks ===" log "Checking target region (US-East) health..."if ! health-check.sh "$TARGET_REGION"; then log "ERROR: Target region is not healthy. Aborting." exit 1fi log "Checking current primary (US-West) health..."if ! health-check.sh "$CURRENT_PRIMARY"; then log "ERROR: Current primary is not healthy. Aborting." exit 1fi log "Checking replication status..."REPLICATION_STATUS=$(psql -h db-primary.us-west.internal -c \ "SELECT state, replay_lag FROM pg_stat_replication WHERE application_name = 'us_east_replica';")log "Replication status: $REPLICATION_STATUS" checkpoint "Pre-flight checks complete. Ready to proceed?" # 2. Configure US-East database as replica of US-Westlog "=== Configuring US-East as replica ===" log "Stopping application servers in US-East to prevent stale data access..."kubectl --context=us-east scale deployment/api-service --replicas=0 log "Reconfiguring US-East database as replica..."# This requires the database to re-sync from US-West# For PostgreSQL, this typically involves pg_basebackup or pg_rewind ssh db-admin@db1.us-east.internal << 'EOF' sudo systemctl stop postgresql # Use pg_rewind if WAL is available, otherwise full resync if pg_rewind --target-pgdata=/var/lib/postgresql/data --source-server='host=db-primary.us-west.internal port=5432'; then echo "pg_rewind successful - incremental sync" else echo "pg_rewind failed - performing full resync" rm -rf /var/lib/postgresql/data/* pg_basebackup -h db-primary.us-west.internal -D /var/lib/postgresql/data -U replicator -v -P --wal-method=stream fi # Configure as standby touch /var/lib/postgresql/data/standby.signal cat >> /var/lib/postgresql/data/postgresql.auto.conf << CONFprimary_conninfo = 'host=db-primary.us-west.internal port=5432 user=replicator'CONF sudo systemctl start postgresqlEOF log "Waiting for US-East replica to sync..."until psql -h db1.us-east.internal -c "SELECT pg_is_in_recovery();" | grep -q 't'; do sleep 5 log "Waiting for replica to start..."done checkpoint "US-East database synchronized as replica. Ready for traffic migration?" # 3. Migrate traffic graduallylog "=== Beginning gradual traffic migration ===" log "Setting traffic split: 90% US-West, 10% US-East..."gcloud compute url-maps update global-lb --backend-service-weight=us-west-backend=90,us-east-backend=10 log "Monitoring for 10 minutes..."sleep 600 # Check error ratesERROR_RATE=$(prometheus-query "sum(rate(http_errors_total{region='us-east'}[10m])) / sum(rate(http_requests_total{region='us-east'}[10m]))")if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then log "ERROR: High error rate in US-East ($ERROR_RATE). Aborting." gcloud compute url-maps update global-lb --backend-service-weight=us-west-backend=100,us-east-backend=0 exit 1fi log "Error rate acceptable ($ERROR_RATE). Increasing to 50%..."gcloud compute url-maps update global-lb --backend-service-weight=us-west-backend=50,us-east-backend=50 checkpoint "Traffic at 50/50. Monitor for issues before continuing." log "Finalizing: 100% to US-East..."# The rest of the failover procedure continues... log "=== Failback complete. US-East is now primary. === Apply the same rigor to failback as to failover: use runbooks, have rollback plans, schedule during low-traffic periods, and notify stakeholders. Many outages have been caused by rushed failback procedures after a successfully handled primary incident.
Running active-passive multi-region in production requires ongoing operational practices that maintain failover readiness over time.
Regular Failover Testing
The only way to know failover works is to actually fail over. Organizations should conduct regular failover exercises:
Without regular testing, failover becomes a theoretical capability that fails when actually needed. Replication breaks, runbooks become outdated, and engineers lose familiarity with procedures.
Chaos Engineering for Active-Passive
Beyond planned testing, inject failures in controlled ways to validate resilience:
Keeping the Secondary Warm
Passive regions tend to accumulate drift over time:
Mitigate drift through:
| Task | Frequency | Purpose | Responsibility |
|---|---|---|---|
| Verify replication lag < threshold | Continuous | RPO maintenance | Automated monitoring |
| Test read queries on replica | Daily | Data accessibility | Automated probe |
| Compare primary/secondary configs | Weekly | Detect drift | GitOps/IaC |
| Mini-failover (DB promotion test) | Monthly | Procedure validation | SRE team |
| Full failover exercise | Quarterly | Complete validation | Engineering org |
| Runbook review and update | After each failover | Documentation accuracy | On-call engineers |
| Capacity planning review | Quarterly | Ensure secondary can handle load | Platform team |
Cost Optimization Strategies
Active-passive architectures offer cost optimization opportunities that active-active does not:
These optimizations must be balanced against RTO requirements—aggressive cost reduction increases time to achieve full capacity during failover.
Organizations often deploy active-passive, test it successfully, then stop regular testing. Over 6-12 months, small drifts accumulate until failover no longer works. Treat failover readiness as a continuous investment, not a one-time project. The maintenance tax is real and must be budged.
We've comprehensively explored active-passive multi-region architecture. Let's consolidate the key principles:
What's Next
Active-passive architecture optimizes for disaster recovery but leaves latency opportunities on the table—all traffic still routes to a single region during normal operations. The next page explores active-active multi-region architecture, where all regions serve traffic simultaneously, providing both disaster recovery and latency optimization at the cost of significantly increased complexity.
You now have a comprehensive understanding of active-passive multi-region architecture—the components, procedures, automation considerations, and operational practices required for reliable disaster recovery. This pattern serves as the foundation before advancing to the more complex active-active architectures covered next.