Loading learning content...
At 2:47 AM, monitoring alerts flood your phone. The primary database server is unresponsive. Applications are returning errors. Customers are complaining on social media. The business is losing thousands of dollars per minute.
This is failover time—the moment when months of careful DR planning either pays off or proves inadequate. The next few minutes will determine whether your organization experiences a brief blip or a prolonged catastrophe.
Failover is the most critical operation in disaster recovery. It's the transition from a failed primary system to a standby, performed under pressure with incomplete information. Getting it wrong can make a bad situation catastrophic. Getting it right restores service and limits damage.
This page prepares you for that moment. You'll learn how to design failover systems, when to trigger failover, how to execute it safely, and how to recover afterward.
By the end of this page, you will understand manual and automatic failover approaches, know how to design failover decision criteria, execute failover procedures safely, avoid common failover mistakes, and plan for failback after the primary is restored.
Failover is the process of transitioning database operations from a failed (or failing) primary server to a standby replica. It's distinct from "switchover," which is a planned, controlled transition.
Failover vs. Switchover:
| Aspect | Failover | Switchover |
|---|---|---|
| Trigger | Unplanned failure | Planned maintenance |
| Time pressure | High—business impact ongoing | Low—scheduled window |
| Data state | May be inconsistent | Cleanly synchronized |
| Rollback | Complex or impossible | Usually straightforward |
| Testing opportunity | None—this is real | Pre-verified procedures |
Failover Goals:
Every failover aims to achieve:
Failover Components:
Successful failover requires coordinated action across multiple components:
Database Layer:
Network Layer:
Application Layer:
Monitoring Layer:
The worst failover scenario is 'split-brain': both the original primary and the promoted standby accept writes simultaneously. This creates irreconcilable data divergence. Failover design must include mechanisms to ensure only one server is accepting writes at any time—typically through fencing, VIP management, or application-level coordination.
The decision between manual and automatic failover involves tradeoffs between speed and safety. Neither is universally better—the right choice depends on your environment and risk tolerance.
Automatic Failover:
Failover triggered and executed by software without human intervention.
Advantages:
Disadvantages:
Best for:
Manual Failover:
Failover requires explicit human decision and often manual execution.
Advantages:
Disadvantages:
Best for:
Hybrid Approach:
Many organizations use a hybrid: automated detection and preparation, but human decision to execute. This captures benefits of both:
Consider progressive automation: start with manual failover, automate detection and preparation, then automate execution for lower-tier systems, and finally automate critical systems only after extensive testing and confidence-building. This builds organizational experience while managing risk.
Before failover can occur, you must detect that the primary has failed and decide that failover is the appropriate response. Both steps are harder than they appear.
Detection Challenges:
1. Distinguishing Failure from Transient Issues
Not every problem indicates a permanent failure:
Premature failover to a recoverable situation is costly.
2. Detecting the Right Failures
Databases can fail in many ways:
3. Confirming Primary is Truly Failed
The standby's view of the primary may be incorrect:
Detection Strategies:
Multi-Point Detection:
Never rely on a single check. Require multiple independent confirmations:
Quorum Systems:
In clustered environments, use quorum voting:
External Witness:
Use a third site to arbitrate:
| Scenario | Primary Status | Standby Status | Decision |
|---|---|---|---|
| Clean primary failure | Crashed, unrecoverable | Healthy, current | Failover immediately |
| Primary overloaded | Slow but responding | Healthy, current | Investigate first |
| Network partition | Unknown from standby | Healthy, lagging | Use witness/quorum |
| Standby lagging significantly | Failed | Behind by hours | Assess data loss first |
| Both degraded | Unhealthy | Unhealthy | Emergency intervention |
| Storage failure primary | Hanging | Healthy, current | Failover after verification |
A network partition can make the primary seem failed when it's actually healthy and serving clients on its side of the partition. If you promote the standby, you now have two primaries accepting writes—a split-brain disaster. Always verify primary failure through multiple independent paths, and use fencing to ensure the old primary cannot continue accepting writes.
Once the decision to failover is made, execution must be precise and coordinated. The sequence of operations matters—doing things out of order can cause additional problems.
Failover Execution Phases:
Phase 1: Preparation (1-2 minutes)
Phase 2: Fencing/Isolation (30 seconds - 2 minutes)
Phase 3: Promotion (30 seconds - 2 minutes)
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
-- PostgreSQL Failover Procedure -- PHASE 1: Preparation-- Check standby health and replication positionSELECT pg_is_in_recovery() AS is_standby, pg_last_wal_receive_lsn() AS last_received_lsn, pg_last_wal_replay_lsn() AS last_replayed_lsn, pg_last_xact_replay_timestamp() AS last_replay_time; -- Record the position for data loss assessment-- Note: Save this information before proceeding -- PHASE 2: Fencing (if primary might still be accessible)-- Option A: STONITH - power off primary server via IPMI/iLO-- $ ipmitool -I lanplus -H primary-ipmi -U admin -P password chassis power off -- Option B: Network isolation - remove primary from network-- $ ssh network-switch "interface gi0/1; shutdown" -- Option C: Stop primary PostgreSQL (if accessible)-- $ ssh primary-db "pg_ctl stop -D /var/lib/postgresql/data -m immediate" -- PHASE 3: Promote Standby-- Option A: Using pg_ctl (traditional)-- $ pg_ctl promote -D /var/lib/postgresql/data -- Option B: Using SQL (PostgreSQL 12+)SELECT pg_promote(); -- Option C: Using pg_ctl with wait (ensures completion)-- $ pg_ctl promote -D /var/lib/postgresql/data -w -- PHASE 4: Verify Promotion-- Check that database is no longer in recovery modeSELECT pg_is_in_recovery();-- Expected: false -- Verify database accepts writesCREATE TABLE failover_test (id int);DROP TABLE failover_test; -- Check for any recovery conflictsSELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock'; -- MySQL/MariaDB Failover Procedure -- PHASE 1: Check replica statusSHOW REPLICA STATUS\G-- Note: Seconds_Behind_Master, Exec_Master_Log_Pos -- PHASE 2: Fencing (stop old primary if accessible)-- On old primary: FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = ON; -- PHASE 3: Stop replica and enable writesSTOP REPLICA;RESET REPLICA ALL; -- Clears replica configurationSET GLOBAL read_only = OFF;SET GLOBAL super_read_only = OFF; -- PHASE 4: VerifySELECT @@read_only, @@super_read_only;-- Expected: 0, 0 -- Test write capabilityCREATE TABLE failover_test (id int);DROP TABLE failover_test;Phase 4: Network Transition (1-5 minutes)
Phase 5: Application Recovery (2-10 minutes)
Phase 6: Validation (5-15 minutes)
Virtual IPs (VIPs) provide faster failover than DNS because there's no TTL or caching delay. However, VIPs typically work only within a single network segment, limiting their use for geographic DR. DNS provides cross-site flexibility but introduces propagation delays. Many organizations use VIPs for local/metro failover and DNS for remote DR site activation.
Several tools and frameworks can automate database failover, ranging from database-integrated solutions to external orchestration systems.
PostgreSQL Failover Solutions:
Patroni:
A template for high availability PostgreSQL using distributed consensus (etcd, Consul, or ZooKeeper).
repmgr:
Replication manager for PostgreSQL with built-in failover.
pg_auto_failover:
Microsoft's automated failover solution for PostgreSQL.
MySQL Failover Solutions:
MySQL InnoDB Cluster:
Built-in HA solution using Group Replication.
Orchestrator:
Topology manager and failover tool for MySQL.
MHA (Master High Availability):
Mature failover solution for MySQL.
ProxySQL:
Not a failover tool itself, but essential for routing.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
# Patroni Configuration Example# Provides automated PostgreSQL failover with consensus scope: prod-db-clusternamespace: /postgres/name: node1 restapi: listen: 0.0.0.0:8008 connect_address: node1.example.com:8008 # Distributed consensus store (pick one)etcd: hosts: - etcd1.example.com:2379 - etcd2.example.com:2379 - etcd3.example.com:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 # 1MB - prevents failover to very lagged replica # Synchronous replication settings synchronous_mode: true synchronous_mode_strict: false postgresql: use_pg_rewind: true parameters: max_connections: 200 shared_buffers: 4GB wal_level: replica hot_standby: 'on' max_wal_senders: 10 max_replication_slots: 10 synchronous_commit: 'on' initdb: - encoding: UTF8 - data-checksums postgresql: listen: 0.0.0.0:5432 connect_address: node1.example.com:5432 data_dir: /var/lib/postgresql/data authentication: replication: username: replicator password: replicator_password superuser: username: postgres password: postgres_password # Callback scripts for failover events callbacks: on_start: /scripts/on_start.sh on_stop: /scripts/on_stop.sh on_restart: /scripts/on_restart.sh on_role_change: /scripts/on_role_change.sh # Tags for replica selection during failovertags: nofailover: false noloadbalance: false clonefrom: true nosync: falseCloud database services (AWS RDS, Azure SQL, Google Cloud SQL) provide built-in automatic failover. While this simplifies operations, understand the failover behavior: How is failure detected? What's the typical failover time? How is split-brain prevented? What visibility do you have? Cloud automation is convenient but shouldn't be a black box for DR-critical systems.
A failover procedure that has never been tested is a hypothesis, not a capability. Regular testing is essential to validate that failover works as expected and that the team can execute it under pressure.
Types of Failover Tests:
1. Tabletop Exercise
Discussion-based walkthrough of failover procedures.
Process:
Benefits: Low risk, educates team, identifies documentation gaps Limitations: Doesn't validate technical functionality
2. Simulation Test
Execute failover in a non-production environment.
Process:
Benefits: Validates technical steps, measures timing Limitations: Test environment may differ from production
3. Controlled Production Failover
Perform failover with production systems during low-impact period.
Process:
Benefits: Tests real production systems and scale Limitations: Requires maintenance window, carries risk
4. Chaos Engineering / Game Days
Introduce real failures in production to test response.
Process:
Benefits: Most realistic test, builds confidence Limitations: Highest risk, requires mature operations
| Test Type | Frequency | Participants | Scope |
|---|---|---|---|
| Tabletop exercise | Quarterly | All DR team members | Full procedure walkthrough |
| Simulation test | Monthly | DBA + App team | Technical validation |
| Controlled production failover | Semi-annually | Full incident team | End-to-end validation |
| Chaos engineering | Quarterly (after maturity) | On-call team | Response validation |
123456789101112131415161718192021222324252627282930313233343536373839
# Failover Test Execution Checklist ## Pre-Test- [ ] Failover runbook reviewed and updated- [ ] All team members briefed on their roles- [ ] Stakeholders notified of test window- [ ] Rollback procedure verified- [ ] Monitoring dashboards open- [ ] Communication channel established (Slack, bridge line)- [ ] Current replication lag recorded: ________- [ ] Current transaction counts recorded: ________ ## During Test- [ ] Time test started: ________- [ ] Failure injection/simulation initiated- [ ] Detection alert received: ________ (elapsed: ____)- [ ] Failover decision made: ________ (elapsed: ____)- [ ] Fencing completed: ________ (elapsed: ____)- [ ] Promotion completed: ________ (elapsed: ____)- [ ] Network cutover completed: ________ (elapsed: ____)- [ ] Application connectivity verified: ________ (elapsed: ____)- [ ] Service confirmed restored: ________ (elapsed: ____)- [ ] **Total RTO achieved**: ________ ## Validation- [ ] Write operations successful on new primary- [ ] Read operations successful- [ ] No data corruption detected- [ ] Application functionality verified- [ ] Monitoring updated to reflect new topology- [ ] Data loss assessment: ________ transactions lost ## Post-Test- [ ] Test results documented- [ ] Issues encountered logged- [ ] Improvement actions identified- [ ] Runbook updated if needed- [ ] Stakeholders notified of completion- [ ] Post-mortem scheduled if significant issuesThe most realistic failover test happens when you're not expecting it. Consider scheduling 'surprise' drills where only a few senior staff know the timing. This tests the on-call response, not a prepared team. Of course, ensure safety controls are in place and stakeholders are informed of the practice.
After failover, the original primary eventually needs to rejoin the cluster—either as the new standby or, optionally, resuming as primary. This process is called failback and requires careful planning.
Failback Options:
Option 1: Former Primary Becomes Standby
The most common approach: reconfigure the failed primary as a replica of the new primary.
Advantages:
Process:
Option 2: Planned Switchover Back to Original Primary
Perform a controlled failover back to the original (now recovered) server.
Advantages:
Risks:
Failback Considerations:
Data Divergence:
After failover, the old primary and new primary may have divergent data:
Reconciliation may be required before failback.
pg_rewind (PostgreSQL):
PostgreSQL provides pg_rewind to resynchronize a diverged former primary without full rebuild:
# On old primary, after it's accessible again
pg_rewind --target-pgdata=/var/lib/postgresql/data \
--source-server="host=new-primary user=postgres"
This rewinds the old primary to the point of divergence, then applies changes from the new primary.
Full Rebuild:
If resynchronization is not possible (no pg_rewind, severe divergence), the former primary must be rebuilt from scratch:
This takes longer but guarantees clean state.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
#!/bin/bash# PostgreSQL Failback Procedure# Reconfigure former primary as standby set -e OLD_PRIMARY="old-primary-db.example.com"NEW_PRIMARY="new-primary-db.example.com"PGDATA="/var/lib/postgresql/data"REPL_USER="replicator" echo "=== PostgreSQL Failback Procedure ==="echo "Old Primary: $OLD_PRIMARY -> Will become new standby"echo "New Primary: $NEW_PRIMARY -> Will remain primary" # Step 1: Ensure old primary is stoppedecho "Step 1: Stopping old primary..."ssh $OLD_PRIMARY "pg_ctl stop -D $PGDATA -m fast" || true # Step 2: Check if pg_rewind is possibleecho "Step 2: Checking rewind feasibility..."ssh $OLD_PRIMARY "pg_rewind --target-pgdata=$PGDATA --source-server='host=$NEW_PRIMARY user=postgres' --dry-run" if [ $? -eq 0 ]; then echo "pg_rewind is feasible. Proceeding..." # Step 3: Execute pg_rewind echo "Step 3: Executing pg_rewind..." ssh $OLD_PRIMARY "pg_rewind --target-pgdata=$PGDATA --source-server='host=$NEW_PRIMARY user=postgres'" else echo "pg_rewind not feasible. Full rebuild required." # Step 3 Alternative: Full rebuild from backup echo "Step 3: Rebuilding from backup..." ssh $OLD_PRIMARY "rm -rf $PGDATA/*" ssh $OLD_PRIMARY "pg_basebackup -h $NEW_PRIMARY -U $REPL_USER -D $PGDATA -P -R --slot=old_primary_slot"fi # Step 4: Configure as standbyecho "Step 4: Configuring standby..."ssh $OLD_PRIMARY "touch $PGDATA/standby.signal" # Ensure recovery settingsssh $OLD_PRIMARY "cat >> $PGDATA/postgresql.auto.conf << EOFprimary_conninfo = 'host=$NEW_PRIMARY user=$REPL_USER application_name=old_primary'primary_slot_name = 'old_primary_slot'EOF" # Step 5: Start standbyecho "Step 5: Starting standby..."ssh $OLD_PRIMARY "pg_ctl start -D $PGDATA" # Step 6: Verify replicationecho "Step 6: Verifying replication..."sleep 10ssh $NEW_PRIMARY "psql -c "SELECT client_addr, state, sync_state FROM pg_stat_replication;"" echo "=== Failback Complete ==="echo "Old primary is now standby, replicating from new primary."After a stressful failover event, there's often pressure to 'get back to normal' by failing back to the original primary. Resist this urge. The environment is stable on the new primary—stay there. Take time to properly diagnose why the original primary failed, repair it thoroughly, and reintegrate it as a standby. Only consider switching back during a planned maintenance window after the former primary has proven stable as a replica.
Failover is the culmination of disaster recovery preparation—the moment when planning meets reality. Let's consolidate the key takeaways:
What's next:
With failover mechanisms in place, we need infrastructure to fail to. Next, we'll explore DR Sites—the physical and virtual facilities that provide the destination for failover, from hot sites mirroring production to cold sites that can be activated when disaster strikes.
You now understand the full lifecycle of database failover, from detection through execution to failback. Failover is the most critical operation in DR—getting it right restores service and limits damage. Getting it wrong can make a bad situation catastrophic. With proper design, automation, testing, and team preparation, you can execute failover confidently when the moment arrives. Next, we'll explore the DR site infrastructure that enables failover.