Database Management SystemsDisaster Recovery

Disaster Recovery

LevelAdvanced

Duration90 mins

TopicDisaster Recovery

4 / 5

Failover

The Moment of Truth

At 2:47 AM, monitoring alerts flood your phone. The primary database server is unresponsive. Applications are returning errors. Customers are complaining on social media. The business is losing thousands of dollars per minute.

This is failover time—the moment when months of careful DR planning either pays off or proves inadequate. The next few minutes will determine whether your organization experiences a brief blip or a prolonged catastrophe.

Failover is the most critical operation in disaster recovery. It's the transition from a failed primary system to a standby, performed under pressure with incomplete information. Getting it wrong can make a bad situation catastrophic. Getting it right restores service and limits damage.

This page prepares you for that moment. You'll learn how to design failover systems, when to trigger failover, how to execute it safely, and how to recover afterward.

What You Will Learn

By the end of this page, you will understand manual and automatic failover approaches, know how to design failover decision criteria, execute failover procedures safely, avoid common failover mistakes, and plan for failback after the primary is restored.

Understanding Failover

Failover is the process of transitioning database operations from a failed (or failing) primary server to a standby replica. It's distinct from "switchover," which is a planned, controlled transition.

Failover vs. Switchover:

Aspect	Failover	Switchover
Trigger	Unplanned failure	Planned maintenance
Time pressure	High—business impact ongoing	Low—scheduled window
Data state	May be inconsistent	Cleanly synchronized
Rollback	Complex or impossible	Usually straightforward
Testing opportunity	None—this is real	Pre-verified procedures

Failover Goals:

Every failover aims to achieve:

Restore service availability — Applications can connect and operate
Minimize data loss — Recover as much recent data as possible
Maintain data consistency — Database remains in valid state
Enable recovery operations — Position for eventual primary restoration
Minimize recovery time — Meet RTO targets

Failover Components:

Successful failover requires coordinated action across multiple components:

Database Layer:

Detect primary failure
Promote standby to primary role
Configure for read-write operations
Update internal replication state

Network Layer:

Redirect traffic to new primary
Update DNS records or virtual IPs
Adjust load balancer configuration
Handle connection draining

Application Layer:

Detect database change
Reconnect to new primary
Resume operations
Handle any retry logic

Monitoring Layer:

Acknowledge failover state
Adjust alerting thresholds
Track recovery progress
Update dashboards

Converting Mermaid diagram...

The Split-Brain Problem

The worst failover scenario is 'split-brain': both the original primary and the promoted standby accept writes simultaneously. This creates irreconcilable data divergence. Failover design must include mechanisms to ensure only one server is accepting writes at any time—typically through fencing, VIP management, or application-level coordination.

Manual vs. Automatic Failover

The decision between manual and automatic failover involves tradeoffs between speed and safety. Neither is universally better—the right choice depends on your environment and risk tolerance.

Automatic Failover:

Failover triggered and executed by software without human intervention.

Advantages:

Fastest possible RTO (minutes or less)
No human availability dependency
Consistent execution every time
Works during off-hours without delay

Disadvantages:

Risk of false positives (unnecessary failover)
May not account for complex failure scenarios
Can mask underlying problems
Difficult to interrupt if proceeding incorrectly

Best for:

Systems where RTO is paramount
Well-understood failure modes
Robust failure detection available
Lower-impact systems where false positive is acceptable

Manual Failover:

Failover requires explicit human decision and often manual execution.

Advantages:

Human judgment assesses complex situations
Avoids false positive failovers
Can coordinate with broader incident response
Provides opportunity to gather information first

Disadvantages:

Slower RTO (humans must respond)
Dependent on human availability
Subject to human error under stress
Decision paralysis possible

Best for:

Highest-critical systems where false positive is catastrophic
Complex environments with multiple dependencies
Systems where data loss from false positive is unacceptable
Initial DR implementation before automation is trusted

Hybrid Approach:

Many organizations use a hybrid: automated detection and preparation, but human decision to execute. This captures benefits of both:

Monitoring detects failure automatically
Automated system verifies and prepares for failover
On-call engineer receives alert with recommended action
Engineer reviews situation and approves failover
Automated system executes pre-defined procedures

When to Automate Failover

•RTO requirement < 15 minutes
•Simple, well-understood failure modes
•Robust multi-point failure detection
•Mature, well-tested automation
•Strong split-brain prevention
•24/7 on-call coverage not reliable

When to Use Manual Failover

•RTO requirement ≥ 15-30 minutes acceptable
•Complex failure scenarios common
•Limited testing of automation possible
•False positive failover very costly
•Strong on-call team available 24/7
•Regulatory requirement for human oversight

Progressive Automation

Consider progressive automation: start with manual failover, automate detection and preparation, then automate execution for lower-tier systems, and finally automate critical systems only after extensive testing and confidence-building. This builds organizational experience while managing risk.

Failover Detection and Decision

Before failover can occur, you must detect that the primary has failed and decide that failover is the appropriate response. Both steps are harder than they appear.

Detection Challenges:

1. Distinguishing Failure from Transient Issues

Not every problem indicates a permanent failure:

Network blips may resolve in seconds
High load may cause temporary unresponsiveness
Transient errors may self-correct

Premature failover to a recoverable situation is costly.

2. Detecting the Right Failures

Databases can fail in many ways:

Complete server crash (easy to detect)
Network partition (may look fine locally)
Storage failure (database may hang rather than crash)
Corruption (database may continue operating incorrectly)

3. Confirming Primary is Truly Failed

The standby's view of the primary may be incorrect:

Network issue between standby and primary, but primary is healthy
Monitoring system itself has failed
'Primary down' may be false positive from overloaded checks

Detection Strategies:

Multi-Point Detection:

Never rely on a single check. Require multiple independent confirmations:

Heartbeat failure — Regular health checks fail
Connection failure — Cannot establish new connections
Query timeout — Simple queries don't respond
Third-party confirmation — Another server also sees failure

Quorum Systems:

In clustered environments, use quorum voting:

Multiple monitors vote on primary health
Failover only when majority agree primary is failed
Prevents split-brain from network partitions

External Witness:

Use a third site to arbitrate:

Witness observes both primary and standby
Can determine which side is partitioned
Guides failover decision based on broader visibility

Failover Decision Matrix
Scenario	Primary Status	Standby Status	Decision
Clean primary failure	Crashed, unrecoverable	Healthy, current	Failover immediately
Primary overloaded	Slow but responding	Healthy, current	Investigate first
Network partition	Unknown from standby	Healthy, lagging	Use witness/quorum
Standby lagging significantly	Failed	Behind by hours	Assess data loss first
Both degraded	Unhealthy	Unhealthy	Emergency intervention
Storage failure primary	Hanging	Healthy, current	Failover after verification

The Worst Case: Network Partition

A network partition can make the primary seem failed when it's actually healthy and serving clients on its side of the partition. If you promote the standby, you now have two primaries accepting writes—a split-brain disaster. Always verify primary failure through multiple independent paths, and use fencing to ensure the old primary cannot continue accepting writes.

Executing Failover

Once the decision to failover is made, execution must be precise and coordinated. The sequence of operations matters—doing things out of order can cause additional problems.

Failover Execution Phases:

Phase 1: Preparation (1-2 minutes)

Confirm failover decision with appropriate authority
Notify stakeholders that failover is beginning
Verify standby is healthy and ready for promotion
Record current standby replication position (for data loss assessment)
Ensure runbook is available and personnel are ready

Phase 2: Fencing/Isolation (30 seconds - 2 minutes)

Ensure old primary cannot accept new writes
Options: STONITH (shoot the other node in the head), network isolation, VIP removal
This step prevents split-brain
Verify fencing is successful before proceeding

Phase 3: Promotion (30 seconds - 2 minutes)

Issue promotion command to standby database
Standby transitions from recovery mode to primary mode
Database becomes read-write
Verify promotion successful

failover_procedures.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
-- PostgreSQL Failover Procedure
 
-- PHASE 1: Preparation
-- Check standby health and replication position
SELECT 
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS last_received_lsn,
    pg_last_wal_replay_lsn() AS last_replayed_lsn,
    pg_last_xact_replay_timestamp() AS last_replay_time;
 
-- Record the position for data loss assessment
-- Note: Save this information before proceeding
 
-- PHASE 2: Fencing (if primary might still be accessible)
-- Option A: STONITH - power off primary server via IPMI/iLO
-- $ ipmitool -I lanplus -H primary-ipmi -U admin -P password chassis power off
 
-- Option B: Network isolation - remove primary from network
-- $ ssh network-switch "interface gi0/1; shutdown"
 
-- Option C: Stop primary PostgreSQL (if accessible)
-- $ ssh primary-db "pg_ctl stop -D /var/lib/postgresql/data -m immediate"
 
 
-- PHASE 3: Promote Standby
-- Option A: Using pg_ctl (traditional)
-- $ pg_ctl promote -D /var/lib/postgresql/data
 
-- Option B: Using SQL (PostgreSQL 12+)
SELECT pg_promote();
 
-- Option C: Using pg_ctl with wait (ensures completion)
-- $ pg_ctl promote -D /var/lib/postgresql/data -w
 
 
-- PHASE 4: Verify Promotion
-- Check that database is no longer in recovery mode
SELECT pg_is_in_recovery();
-- Expected: false
 
-- Verify database accepts writes
CREATE TABLE failover_test (id int);
DROP TABLE failover_test;
 
-- Check for any recovery conflicts
SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';
 
 
-- MySQL/MariaDB Failover Procedure
 
-- PHASE 1: Check replica status
SHOW REPLICA STATUS\G
-- Note: Seconds_Behind_Master, Exec_Master_Log_Pos
 
-- PHASE 2: Fencing (stop old primary if accessible)
-- On old primary: FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = ON;
 
-- PHASE 3: Stop replica and enable writes
STOP REPLICA;
RESET REPLICA ALL;  -- Clears replica configuration
SET GLOBAL read_only = OFF;
SET GLOBAL super_read_only = OFF;
 
-- PHASE 4: Verify
SELECT @@read_only, @@super_read_only;
-- Expected: 0, 0
 
-- Test write capability
CREATE TABLE failover_test (id int);
DROP TABLE failover_test;

Phase 4: Network Transition (1-5 minutes)

Update DNS records to point to new primary
Or: Transfer virtual IP (VIP) to new primary
Or: Update load balancer configuration
Allow time for DNS propagation or connection draining

Phase 5: Application Recovery (2-10 minutes)

Applications reconnect to new primary
Connection pools refresh with new endpoints
Verify application functionality
Monitor for connection errors or query failures

Phase 6: Validation (5-15 minutes)

Confirm service is restored for users
Verify data accessibility and consistency
Check for any replication cascading issues
Update monitoring to reflect new topology
Communicate status to stakeholders

VIP vs. DNS for Failover

Virtual IPs (VIPs) provide faster failover than DNS because there's no TTL or caching delay. However, VIPs typically work only within a single network segment, limiting their use for geographic DR. DNS provides cross-site flexibility but introduces propagation delays. Many organizations use VIPs for local/metro failover and DNS for remote DR site activation.

Failover Automation Tools

Several tools and frameworks can automate database failover, ranging from database-integrated solutions to external orchestration systems.

PostgreSQL Failover Solutions:

Patroni:

A template for high availability PostgreSQL using distributed consensus (etcd, Consul, or ZooKeeper).

Handles leader election, promotion, and failover
Integrates with HAProxy for connection routing
Supports synchronous and asynchronous replication
Popular choice for Kubernetes deployments

repmgr:

Replication manager for PostgreSQL with built-in failover.

Simpler than Patroni, less infrastructure required
Manual or automatic failover support
Good documentation and community support

pg_auto_failover:

Microsoft's automated failover solution for PostgreSQL.

Uses a dedicated 'monitor' node for coordination
Simpler architecture than consensus-based solutions
Good for two-node setups

MySQL Failover Solutions:

MySQL InnoDB Cluster:

Built-in HA solution using Group Replication.

Native MySQL solution, no external tools required
Automatic failover with group consensus
Integrates with MySQL Router for connection routing
Requires minimum 3 nodes for consensus

Orchestrator:

Topology manager and failover tool for MySQL.

Handles complex replication topologies
Supports manual and automatic failover
Web UI for visualization and management
Used by GitHub for their MySQL infrastructure

MHA (Master High Availability):

Mature failover solution for MySQL.

Minimizes data loss during failover
Applies differential logs from multiple slaves
Less actively maintained than newer solutions

ProxySQL:

Not a failover tool itself, but essential for routing.

Directs traffic based on server health
Works with any failover orchestrator
Provides connection pooling and query routing

patroni_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# Patroni Configuration Example
# Provides automated PostgreSQL failover with consensus
 
scope: prod-db-cluster
namespace: /postgres/
name: node1
 
restapi:
  listen: 0.0.0.0:8008
  connect_address: node1.example.com:8008
 
# Distributed consensus store (pick one)
etcd:
  hosts:
    - etcd1.example.com:2379
    - etcd2.example.com:2379
    - etcd3.example.com:2379
 
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB - prevents failover to very lagged replica
    
    # Synchronous replication settings
    synchronous_mode: true
    synchronous_mode_strict: false
    
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 200
        shared_buffers: 4GB
        wal_level: replica
        hot_standby: 'on'
        max_wal_senders: 10
        max_replication_slots: 10
        synchronous_commit: 'on'
 
  initdb:
    - encoding: UTF8
    - data-checksums
 
postgresql:
  listen: 0.0.0.0:5432
  connect_address: node1.example.com:5432
  data_dir: /var/lib/postgresql/data
  
  authentication:
    replication:
      username: replicator
      password: replicator_password
    superuser:
      username: postgres
      password: postgres_password
 
  # Callback scripts for failover events
  callbacks:
    on_start: /scripts/on_start.sh
    on_stop: /scripts/on_stop.sh
    on_restart: /scripts/on_restart.sh
    on_role_change: /scripts/on_role_change.sh
 
# Tags for replica selection during failover
tags:
  nofailover: false
  noloadbalance: false
  clonefrom: true
  nosync: false

Cloud-Managed Failover

Cloud database services (AWS RDS, Azure SQL, Google Cloud SQL) provide built-in automatic failover. While this simplifies operations, understand the failover behavior: How is failure detected? What's the typical failover time? How is split-brain prevented? What visibility do you have? Cloud automation is convenient but shouldn't be a black box for DR-critical systems.

Failover Testing

A failover procedure that has never been tested is a hypothesis, not a capability. Regular testing is essential to validate that failover works as expected and that the team can execute it under pressure.

Types of Failover Tests:

1. Tabletop Exercise

Discussion-based walkthrough of failover procedures.

Process:

Team gathers in meeting room
Facilitator presents failure scenario
Team discusses each step they would take
Decisions and actions are recorded
Gaps in procedures or knowledge identified

Benefits: Low risk, educates team, identifies documentation gaps Limitations: Doesn't validate technical functionality

2. Simulation Test

Execute failover in a non-production environment.

Process:

Create production-like test environment
Execute actual failover procedures
Measure timing and success of each step
Validate application connectivity

Benefits: Validates technical steps, measures timing Limitations: Test environment may differ from production

3. Controlled Production Failover

Perform failover with production systems during low-impact period.

Process:

Schedule maintenance window
Notify stakeholders
Execute failover with rollback plan ready
Validate application functionality
Perform failback or remain on new primary

Benefits: Tests real production systems and scale Limitations: Requires maintenance window, carries risk

4. Chaos Engineering / Game Days

Introduce real failures in production to test response.

Process:

Define blast radius and safety limits
Inject real failure (kill primary, partition network)
Observe automated response and team reaction
Measure actual RTO/RPO achieved
Remediate any issues discovered

Benefits: Most realistic test, builds confidence Limitations: Highest risk, requires mature operations

Recommended Failover Test Cadence
Test Type	Frequency	Participants	Scope
Tabletop exercise	Quarterly	All DR team members	Full procedure walkthrough
Simulation test	Monthly	DBA + App team	Technical validation
Controlled production failover	Semi-annually	Full incident team	End-to-end validation
Chaos engineering	Quarterly (after maturity)	On-call team	Response validation

failover_test_checklist.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Failover Test Execution Checklist
 
## Pre-Test
- [ ] Failover runbook reviewed and updated
- [ ] All team members briefed on their roles
- [ ] Stakeholders notified of test window
- [ ] Rollback procedure verified
- [ ] Monitoring dashboards open
- [ ] Communication channel established (Slack, bridge line)
- [ ] Current replication lag recorded: ________
- [ ] Current transaction counts recorded: ________
 
## During Test
- [ ] Time test started: ________
- [ ] Failure injection/simulation initiated
- [ ] Detection alert received: ________ (elapsed: ____)
- [ ] Failover decision made: ________ (elapsed: ____)
- [ ] Fencing completed: ________ (elapsed: ____)
- [ ] Promotion completed: ________ (elapsed: ____)
- [ ] Network cutover completed: ________ (elapsed: ____)
- [ ] Application connectivity verified: ________ (elapsed: ____)
- [ ] Service confirmed restored: ________ (elapsed: ____)
- [ ] **Total RTO achieved**: ________
 
## Validation
- [ ] Write operations successful on new primary
- [ ] Read operations successful
- [ ] No data corruption detected
- [ ] Application functionality verified
- [ ] Monitoring updated to reflect new topology
- [ ] Data loss assessment: ________ transactions lost
 
## Post-Test
- [ ] Test results documented
- [ ] Issues encountered logged
- [ ] Improvement actions identified
- [ ] Runbook updated if needed
- [ ] Stakeholders notified of completion
- [ ] Post-mortem scheduled if significant issues

Test Under Realistic Conditions

The most realistic failover test happens when you're not expecting it. Consider scheduling 'surprise' drills where only a few senior staff know the timing. This tests the on-call response, not a prepared team. Of course, ensure safety controls are in place and stakeholders are informed of the practice.

Failback Procedures

After failover, the original primary eventually needs to rejoin the cluster—either as the new standby or, optionally, resuming as primary. This process is called failback and requires careful planning.

Failback Options:

Option 1: Former Primary Becomes Standby

The most common approach: reconfigure the failed primary as a replica of the new primary.

Advantages:

No second failover event needed
Simpler, lower risk
Validates former primary health before trusting it

Process:

Repair or recover the former primary
Reconfigure as standby replicating from new primary
Wait for full synchronization
Return to normal primary-standby topology

Option 2: Planned Switchover Back to Original Primary

Perform a controlled failover back to the original (now recovered) server.

Advantages:

Returns to original topology
Original primary may have better hardware/location

Risks:

Second failover event (additional risk)
Original primary may have undiscovered issues

Failback Considerations:

Data Divergence:

After failover, the old primary and new primary may have divergent data:

Transactions committed on old primary before crash but not replicated
Transactions committed on new primary after promotion

Reconciliation may be required before failback.

pg_rewind (PostgreSQL):

PostgreSQL provides pg_rewind to resynchronize a diverged former primary without full rebuild:

# On old primary, after it's accessible again
pg_rewind --target-pgdata=/var/lib/postgresql/data \
          --source-server="host=new-primary user=postgres"

This rewinds the old primary to the point of divergence, then applies changes from the new primary.

Full Rebuild:

If resynchronization is not possible (no pg_rewind, severe divergence), the former primary must be rebuilt from scratch:

Take base backup from new primary
Restore to former primary hardware
Configure as standby
Start replication

This takes longer but guarantees clean state.

failback_procedure.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/bin/bash
# PostgreSQL Failback Procedure
# Reconfigure former primary as standby
 
set -e
 
OLD_PRIMARY="old-primary-db.example.com"
NEW_PRIMARY="new-primary-db.example.com"
PGDATA="/var/lib/postgresql/data"
REPL_USER="replicator"
 
echo "=== PostgreSQL Failback Procedure ==="
echo "Old Primary: $OLD_PRIMARY -> Will become new standby"
echo "New Primary: $NEW_PRIMARY -> Will remain primary"
 
# Step 1: Ensure old primary is stopped
echo "Step 1: Stopping old primary..."
ssh $OLD_PRIMARY "pg_ctl stop -D $PGDATA -m fast" || true
 
# Step 2: Check if pg_rewind is possible
echo "Step 2: Checking rewind feasibility..."
ssh $OLD_PRIMARY "pg_rewind --target-pgdata=$PGDATA     --source-server='host=$NEW_PRIMARY user=postgres'     --dry-run"
 
if [ $? -eq 0 ]; then
    echo "pg_rewind is feasible. Proceeding..."
    
    # Step 3: Execute pg_rewind
    echo "Step 3: Executing pg_rewind..."
    ssh $OLD_PRIMARY "pg_rewind --target-pgdata=$PGDATA         --source-server='host=$NEW_PRIMARY user=postgres'"
    
else
    echo "pg_rewind not feasible. Full rebuild required."
    
    # Step 3 Alternative: Full rebuild from backup
    echo "Step 3: Rebuilding from backup..."
    ssh $OLD_PRIMARY "rm -rf $PGDATA/*"
    ssh $OLD_PRIMARY "pg_basebackup -h $NEW_PRIMARY -U $REPL_USER         -D $PGDATA -P -R --slot=old_primary_slot"
fi
 
# Step 4: Configure as standby
echo "Step 4: Configuring standby..."
ssh $OLD_PRIMARY "touch $PGDATA/standby.signal"
 
# Ensure recovery settings
ssh $OLD_PRIMARY "cat >> $PGDATA/postgresql.auto.conf << EOF
primary_conninfo = 'host=$NEW_PRIMARY user=$REPL_USER application_name=old_primary'
primary_slot_name = 'old_primary_slot'
EOF"
 
# Step 5: Start standby
echo "Step 5: Starting standby..."
ssh $OLD_PRIMARY "pg_ctl start -D $PGDATA"
 
# Step 6: Verify replication
echo "Step 6: Verifying replication..."
sleep 10
ssh $NEW_PRIMARY "psql -c "SELECT client_addr, state, sync_state FROM pg_stat_replication;""
 
echo "=== Failback Complete ==="
echo "Old primary is now standby, replicating from new primary."

Don't Rush Failback

After a stressful failover event, there's often pressure to 'get back to normal' by failing back to the original primary. Resist this urge. The environment is stable on the new primary—stay there. Take time to properly diagnose why the original primary failed, repair it thoroughly, and reintegrate it as a standby. Only consider switching back during a planned maintenance window after the former primary has proven stable as a replica.

Summary: Mastering Database Failover

Failover is the culmination of disaster recovery preparation—the moment when planning meets reality. Let's consolidate the key takeaways:

Key Takeaways

•Failover is distinct from switchover — Failover happens under pressure with incomplete information; switchover is planned and controlled. Design for both scenarios.
•Manual vs. automatic is a risk decision — Automatic failover provides faster RTO but risks false positives. Manual failover is safer but slower. Choose based on your risk tolerance.
•Detection must be robust — Use multiple independent checks, quorum systems, and external witnesses to avoid false positive failovers that cause more damage than they prevent.
•Execution sequence matters — Fencing before promotion prevents split-brain. Network transition after database promotion ensures clean handoff.
•Testing builds confidence — Regular failover testing—from tabletops to production drills—validates readiness and builds team muscle memory for real incidents.
•Failback requires patience — After failover, take time to properly reintegrate the former primary. Rushing failback can introduce new problems.

What's next:

With failover mechanisms in place, we need infrastructure to fail to. Next, we'll explore DR Sites—the physical and virtual facilities that provide the destination for failover, from hot sites mirroring production to cold sites that can be activated when disaster strikes.

Page Complete

You now understand the full lifecycle of database failover, from detection through execution to failback. Failover is the most critical operation in DR—getting it right restores service and limits damage. Getting it wrong can make a bad situation catastrophic. With proper design, automation, testing, and team preparation, you can execute failover confidently when the moment arrives. Next, we'll explore the DR site infrastructure that enables failover.

4 / 5

Loading learning content...

Database Management SystemsDisaster Recovery

Disaster Recovery

LevelAdvanced

Duration90 mins

TopicDisaster Recovery

4 / 5

Failover

The Moment of Truth

This page prepares you for that moment. You'll learn how to design failover systems, when to trigger failover, how to execute it safely, and how to recover afterward.

What You Will Learn

Understanding Failover

Failover vs. Switchover:

Aspect	Failover	Switchover
Trigger	Unplanned failure	Planned maintenance
Time pressure	High—business impact ongoing	Low—scheduled window
Data state	May be inconsistent	Cleanly synchronized
Rollback	Complex or impossible	Usually straightforward
Testing opportunity	None—this is real	Pre-verified procedures

Failover Goals:

Every failover aims to achieve:

Restore service availability — Applications can connect and operate
Minimize data loss — Recover as much recent data as possible
Maintain data consistency — Database remains in valid state
Enable recovery operations — Position for eventual primary restoration
Minimize recovery time — Meet RTO targets

Failover Components:

Successful failover requires coordinated action across multiple components:

Database Layer:

Detect primary failure
Promote standby to primary role
Configure for read-write operations
Update internal replication state

Network Layer:

Redirect traffic to new primary
Update DNS records or virtual IPs
Adjust load balancer configuration
Handle connection draining

Application Layer:

Detect database change
Reconnect to new primary
Resume operations
Handle any retry logic

Monitoring Layer:

Acknowledge failover state
Adjust alerting thresholds
Track recovery progress
Update dashboards

Converting Mermaid diagram...

The Split-Brain Problem

Manual vs. Automatic Failover

The decision between manual and automatic failover involves tradeoffs between speed and safety. Neither is universally better—the right choice depends on your environment and risk tolerance.

Automatic Failover:

Failover triggered and executed by software without human intervention.

Advantages:

Fastest possible RTO (minutes or less)
No human availability dependency
Consistent execution every time
Works during off-hours without delay

Disadvantages:

Risk of false positives (unnecessary failover)
May not account for complex failure scenarios
Can mask underlying problems
Difficult to interrupt if proceeding incorrectly

Best for:

Systems where RTO is paramount
Well-understood failure modes
Robust failure detection available
Lower-impact systems where false positive is acceptable

Manual Failover:

Failover requires explicit human decision and often manual execution.

Advantages:

Human judgment assesses complex situations
Avoids false positive failovers
Can coordinate with broader incident response
Provides opportunity to gather information first

Disadvantages:

Slower RTO (humans must respond)
Dependent on human availability
Subject to human error under stress
Decision paralysis possible

Best for:

Highest-critical systems where false positive is catastrophic
Complex environments with multiple dependencies
Systems where data loss from false positive is unacceptable
Initial DR implementation before automation is trusted

Hybrid Approach:

Many organizations use a hybrid: automated detection and preparation, but human decision to execute. This captures benefits of both:

Monitoring detects failure automatically
Automated system verifies and prepares for failover
On-call engineer receives alert with recommended action
Engineer reviews situation and approves failover
Automated system executes pre-defined procedures

When to Automate Failover

•RTO requirement < 15 minutes
•Simple, well-understood failure modes
•Robust multi-point failure detection
•Mature, well-tested automation
•Strong split-brain prevention
•24/7 on-call coverage not reliable

When to Use Manual Failover

•RTO requirement ≥ 15-30 minutes acceptable
•Complex failure scenarios common
•Limited testing of automation possible
•False positive failover very costly
•Strong on-call team available 24/7
•Regulatory requirement for human oversight

Progressive Automation

Failover Detection and Decision

Before failover can occur, you must detect that the primary has failed and decide that failover is the appropriate response. Both steps are harder than they appear.

Detection Challenges:

1. Distinguishing Failure from Transient Issues

Not every problem indicates a permanent failure:

Network blips may resolve in seconds
High load may cause temporary unresponsiveness
Transient errors may self-correct

Premature failover to a recoverable situation is costly.

2. Detecting the Right Failures

Databases can fail in many ways:

Complete server crash (easy to detect)
Network partition (may look fine locally)
Storage failure (database may hang rather than crash)
Corruption (database may continue operating incorrectly)

3. Confirming Primary is Truly Failed

The standby's view of the primary may be incorrect:

Network issue between standby and primary, but primary is healthy
Monitoring system itself has failed
'Primary down' may be false positive from overloaded checks

Detection Strategies:

Multi-Point Detection:

Never rely on a single check. Require multiple independent confirmations:

Heartbeat failure — Regular health checks fail
Connection failure — Cannot establish new connections
Query timeout — Simple queries don't respond
Third-party confirmation — Another server also sees failure

Quorum Systems:

In clustered environments, use quorum voting:

Multiple monitors vote on primary health
Failover only when majority agree primary is failed
Prevents split-brain from network partitions

External Witness:

Use a third site to arbitrate:

Witness observes both primary and standby
Can determine which side is partitioned
Guides failover decision based on broader visibility

Failover Decision Matrix
Scenario	Primary Status	Standby Status	Decision
Clean primary failure	Crashed, unrecoverable	Healthy, current	Failover immediately
Primary overloaded	Slow but responding	Healthy, current	Investigate first
Network partition	Unknown from standby	Healthy, lagging	Use witness/quorum
Standby lagging significantly	Failed	Behind by hours	Assess data loss first
Both degraded	Unhealthy	Unhealthy	Emergency intervention
Storage failure primary	Hanging	Healthy, current	Failover after verification

The Worst Case: Network Partition

Executing Failover

Once the decision to failover is made, execution must be precise and coordinated. The sequence of operations matters—doing things out of order can cause additional problems.

Failover Execution Phases:

Phase 1: Preparation (1-2 minutes)

Confirm failover decision with appropriate authority
Notify stakeholders that failover is beginning
Verify standby is healthy and ready for promotion
Record current standby replication position (for data loss assessment)
Ensure runbook is available and personnel are ready

Phase 2: Fencing/Isolation (30 seconds - 2 minutes)

Ensure old primary cannot accept new writes
Options: STONITH (shoot the other node in the head), network isolation, VIP removal
This step prevents split-brain
Verify fencing is successful before proceeding

Phase 3: Promotion (30 seconds - 2 minutes)

Issue promotion command to standby database
Standby transitions from recovery mode to primary mode
Database becomes read-write
Verify promotion successful

failover_procedures.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
-- PostgreSQL Failover Procedure
 
-- PHASE 1: Preparation
-- Check standby health and replication position
SELECT 
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS last_received_lsn,
    pg_last_wal_replay_lsn() AS last_replayed_lsn,
    pg_last_xact_replay_timestamp() AS last_replay_time;
 
-- Record the position for data loss assessment
-- Note: Save this information before proceeding
 
-- PHASE 2: Fencing (if primary might still be accessible)
-- Option A: STONITH - power off primary server via IPMI/iLO
-- $ ipmitool -I lanplus -H primary-ipmi -U admin -P password chassis power off
 
-- Option B: Network isolation - remove primary from network
-- $ ssh network-switch "interface gi0/1; shutdown"
 
-- Option C: Stop primary PostgreSQL (if accessible)
-- $ ssh primary-db "pg_ctl stop -D /var/lib/postgresql/data -m immediate"
 
 
-- PHASE 3: Promote Standby
-- Option A: Using pg_ctl (traditional)
-- $ pg_ctl promote -D /var/lib/postgresql/data
 
-- Option B: Using SQL (PostgreSQL 12+)
SELECT pg_promote();
 
-- Option C: Using pg_ctl with wait (ensures completion)
-- $ pg_ctl promote -D /var/lib/postgresql/data -w
 
 
-- PHASE 4: Verify Promotion
-- Check that database is no longer in recovery mode
SELECT pg_is_in_recovery();
-- Expected: false
 
-- Verify database accepts writes
CREATE TABLE failover_test (id int);
DROP TABLE failover_test;
 
-- Check for any recovery conflicts
SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';
 
 
-- MySQL/MariaDB Failover Procedure
 
-- PHASE 1: Check replica status
SHOW REPLICA STATUS\G
-- Note: Seconds_Behind_Master, Exec_Master_Log_Pos
 
-- PHASE 2: Fencing (stop old primary if accessible)
-- On old primary: FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = ON;
 
-- PHASE 3: Stop replica and enable writes
STOP REPLICA;
RESET REPLICA ALL;  -- Clears replica configuration
SET GLOBAL read_only = OFF;
SET GLOBAL super_read_only = OFF;
 
-- PHASE 4: Verify
SELECT @@read_only, @@super_read_only;
-- Expected: 0, 0
 
-- Test write capability
CREATE TABLE failover_test (id int);
DROP TABLE failover_test;

Phase 4: Network Transition (1-5 minutes)

Update DNS records to point to new primary
Or: Transfer virtual IP (VIP) to new primary
Or: Update load balancer configuration
Allow time for DNS propagation or connection draining

Phase 5: Application Recovery (2-10 minutes)

Applications reconnect to new primary
Connection pools refresh with new endpoints
Verify application functionality
Monitor for connection errors or query failures

Phase 6: Validation (5-15 minutes)

Confirm service is restored for users
Verify data accessibility and consistency
Check for any replication cascading issues
Update monitoring to reflect new topology
Communicate status to stakeholders

VIP vs. DNS for Failover

Failover Automation Tools

Several tools and frameworks can automate database failover, ranging from database-integrated solutions to external orchestration systems.

PostgreSQL Failover Solutions:

Patroni:

A template for high availability PostgreSQL using distributed consensus (etcd, Consul, or ZooKeeper).

Handles leader election, promotion, and failover
Integrates with HAProxy for connection routing
Supports synchronous and asynchronous replication
Popular choice for Kubernetes deployments

repmgr:

Replication manager for PostgreSQL with built-in failover.

Simpler than Patroni, less infrastructure required
Manual or automatic failover support
Good documentation and community support

pg_auto_failover:

Microsoft's automated failover solution for PostgreSQL.

Uses a dedicated 'monitor' node for coordination
Simpler architecture than consensus-based solutions
Good for two-node setups

MySQL Failover Solutions:

MySQL InnoDB Cluster:

Built-in HA solution using Group Replication.

Native MySQL solution, no external tools required
Automatic failover with group consensus
Integrates with MySQL Router for connection routing
Requires minimum 3 nodes for consensus

Orchestrator:

Topology manager and failover tool for MySQL.

Handles complex replication topologies
Supports manual and automatic failover
Web UI for visualization and management
Used by GitHub for their MySQL infrastructure

MHA (Master High Availability):

Mature failover solution for MySQL.

Minimizes data loss during failover
Applies differential logs from multiple slaves
Less actively maintained than newer solutions

ProxySQL:

Not a failover tool itself, but essential for routing.

Directs traffic based on server health
Works with any failover orchestrator
Provides connection pooling and query routing

patroni_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# Patroni Configuration Example
# Provides automated PostgreSQL failover with consensus
 
scope: prod-db-cluster
namespace: /postgres/
name: node1
 
restapi:
  listen: 0.0.0.0:8008
  connect_address: node1.example.com:8008
 
# Distributed consensus store (pick one)
etcd:
  hosts:
    - etcd1.example.com:2379
    - etcd2.example.com:2379
    - etcd3.example.com:2379
 
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB - prevents failover to very lagged replica
    
    # Synchronous replication settings
    synchronous_mode: true
    synchronous_mode_strict: false
    
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 200
        shared_buffers: 4GB
        wal_level: replica
        hot_standby: 'on'
        max_wal_senders: 10
        max_replication_slots: 10
        synchronous_commit: 'on'
 
  initdb:
    - encoding: UTF8
    - data-checksums
 
postgresql:
  listen: 0.0.0.0:5432
  connect_address: node1.example.com:5432
  data_dir: /var/lib/postgresql/data
  
  authentication:
    replication:
      username: replicator
      password: replicator_password
    superuser:
      username: postgres
      password: postgres_password
 
  # Callback scripts for failover events
  callbacks:
    on_start: /scripts/on_start.sh
    on_stop: /scripts/on_stop.sh
    on_restart: /scripts/on_restart.sh
    on_role_change: /scripts/on_role_change.sh
 
# Tags for replica selection during failover
tags:
  nofailover: false
  noloadbalance: false
  clonefrom: true
  nosync: false

Cloud-Managed Failover

Failover Testing

Types of Failover Tests:

1. Tabletop Exercise

Discussion-based walkthrough of failover procedures.

Process:

Team gathers in meeting room
Facilitator presents failure scenario
Team discusses each step they would take
Decisions and actions are recorded
Gaps in procedures or knowledge identified

Benefits: Low risk, educates team, identifies documentation gaps Limitations: Doesn't validate technical functionality

2. Simulation Test

Execute failover in a non-production environment.

Process:

Create production-like test environment
Execute actual failover procedures
Measure timing and success of each step
Validate application connectivity

Benefits: Validates technical steps, measures timing Limitations: Test environment may differ from production

3. Controlled Production Failover

Perform failover with production systems during low-impact period.

Process:

Schedule maintenance window
Notify stakeholders
Execute failover with rollback plan ready
Validate application functionality
Perform failback or remain on new primary

Benefits: Tests real production systems and scale Limitations: Requires maintenance window, carries risk

4. Chaos Engineering / Game Days

Introduce real failures in production to test response.

Process:

Define blast radius and safety limits
Inject real failure (kill primary, partition network)
Observe automated response and team reaction
Measure actual RTO/RPO achieved
Remediate any issues discovered

Benefits: Most realistic test, builds confidence Limitations: Highest risk, requires mature operations

Recommended Failover Test Cadence
Test Type	Frequency	Participants	Scope
Tabletop exercise	Quarterly	All DR team members	Full procedure walkthrough
Simulation test	Monthly	DBA + App team	Technical validation
Controlled production failover	Semi-annually	Full incident team	End-to-end validation
Chaos engineering	Quarterly (after maturity)	On-call team	Response validation

failover_test_checklist.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Failover Test Execution Checklist
 
## Pre-Test
- [ ] Failover runbook reviewed and updated
- [ ] All team members briefed on their roles
- [ ] Stakeholders notified of test window
- [ ] Rollback procedure verified
- [ ] Monitoring dashboards open
- [ ] Communication channel established (Slack, bridge line)
- [ ] Current replication lag recorded: ________
- [ ] Current transaction counts recorded: ________
 
## During Test
- [ ] Time test started: ________
- [ ] Failure injection/simulation initiated
- [ ] Detection alert received: ________ (elapsed: ____)
- [ ] Failover decision made: ________ (elapsed: ____)
- [ ] Fencing completed: ________ (elapsed: ____)
- [ ] Promotion completed: ________ (elapsed: ____)
- [ ] Network cutover completed: ________ (elapsed: ____)
- [ ] Application connectivity verified: ________ (elapsed: ____)
- [ ] Service confirmed restored: ________ (elapsed: ____)
- [ ] **Total RTO achieved**: ________
 
## Validation
- [ ] Write operations successful on new primary
- [ ] Read operations successful
- [ ] No data corruption detected
- [ ] Application functionality verified
- [ ] Monitoring updated to reflect new topology
- [ ] Data loss assessment: ________ transactions lost
 
## Post-Test
- [ ] Test results documented
- [ ] Issues encountered logged
- [ ] Improvement actions identified
- [ ] Runbook updated if needed
- [ ] Stakeholders notified of completion
- [ ] Post-mortem scheduled if significant issues

Test Under Realistic Conditions

Failback Procedures

Failback Options:

Option 1: Former Primary Becomes Standby

The most common approach: reconfigure the failed primary as a replica of the new primary.

Advantages:

No second failover event needed
Simpler, lower risk
Validates former primary health before trusting it

Process:

Repair or recover the former primary
Reconfigure as standby replicating from new primary
Wait for full synchronization
Return to normal primary-standby topology

Option 2: Planned Switchover Back to Original Primary

Perform a controlled failover back to the original (now recovered) server.

Advantages:

Returns to original topology
Original primary may have better hardware/location

Risks:

Second failover event (additional risk)
Original primary may have undiscovered issues

Failback Considerations:

Data Divergence:

After failover, the old primary and new primary may have divergent data:

Transactions committed on old primary before crash but not replicated
Transactions committed on new primary after promotion

Reconciliation may be required before failback.

pg_rewind (PostgreSQL):

PostgreSQL provides pg_rewind to resynchronize a diverged former primary without full rebuild:

# On old primary, after it's accessible again
pg_rewind --target-pgdata=/var/lib/postgresql/data \
          --source-server="host=new-primary user=postgres"

This rewinds the old primary to the point of divergence, then applies changes from the new primary.

Full Rebuild:

If resynchronization is not possible (no pg_rewind, severe divergence), the former primary must be rebuilt from scratch:

Take base backup from new primary
Restore to former primary hardware
Configure as standby
Start replication

This takes longer but guarantees clean state.

failback_procedure.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/bin/bash
# PostgreSQL Failback Procedure
# Reconfigure former primary as standby
 
set -e
 
OLD_PRIMARY="old-primary-db.example.com"
NEW_PRIMARY="new-primary-db.example.com"
PGDATA="/var/lib/postgresql/data"
REPL_USER="replicator"
 
echo "=== PostgreSQL Failback Procedure ==="
echo "Old Primary: $OLD_PRIMARY -> Will become new standby"
echo "New Primary: $NEW_PRIMARY -> Will remain primary"
 
# Step 1: Ensure old primary is stopped
echo "Step 1: Stopping old primary..."
ssh $OLD_PRIMARY "pg_ctl stop -D $PGDATA -m fast" || true
 
# Step 2: Check if pg_rewind is possible
echo "Step 2: Checking rewind feasibility..."
ssh $OLD_PRIMARY "pg_rewind --target-pgdata=$PGDATA     --source-server='host=$NEW_PRIMARY user=postgres'     --dry-run"
 
if [ $? -eq 0 ]; then
    echo "pg_rewind is feasible. Proceeding..."
    
    # Step 3: Execute pg_rewind
    echo "Step 3: Executing pg_rewind..."
    ssh $OLD_PRIMARY "pg_rewind --target-pgdata=$PGDATA         --source-server='host=$NEW_PRIMARY user=postgres'"
    
else
    echo "pg_rewind not feasible. Full rebuild required."
    
    # Step 3 Alternative: Full rebuild from backup
    echo "Step 3: Rebuilding from backup..."
    ssh $OLD_PRIMARY "rm -rf $PGDATA/*"
    ssh $OLD_PRIMARY "pg_basebackup -h $NEW_PRIMARY -U $REPL_USER         -D $PGDATA -P -R --slot=old_primary_slot"
fi
 
# Step 4: Configure as standby
echo "Step 4: Configuring standby..."
ssh $OLD_PRIMARY "touch $PGDATA/standby.signal"
 
# Ensure recovery settings
ssh $OLD_PRIMARY "cat >> $PGDATA/postgresql.auto.conf << EOF
primary_conninfo = 'host=$NEW_PRIMARY user=$REPL_USER application_name=old_primary'
primary_slot_name = 'old_primary_slot'
EOF"
 
# Step 5: Start standby
echo "Step 5: Starting standby..."
ssh $OLD_PRIMARY "pg_ctl start -D $PGDATA"
 
# Step 6: Verify replication
echo "Step 6: Verifying replication..."
sleep 10
ssh $NEW_PRIMARY "psql -c "SELECT client_addr, state, sync_state FROM pg_stat_replication;""
 
echo "=== Failback Complete ==="
echo "Old primary is now standby, replicating from new primary."

Don't Rush Failback

Summary: Mastering Database Failover

Failover is the culmination of disaster recovery preparation—the moment when planning meets reality. Let's consolidate the key takeaways:

Key Takeaways

•Failover is distinct from switchover — Failover happens under pressure with incomplete information; switchover is planned and controlled. Design for both scenarios.
•Manual vs. automatic is a risk decision — Automatic failover provides faster RTO but risks false positives. Manual failover is safer but slower. Choose based on your risk tolerance.
•Detection must be robust — Use multiple independent checks, quorum systems, and external witnesses to avoid false positive failovers that cause more damage than they prevent.
•Execution sequence matters — Fencing before promotion prevents split-brain. Network transition after database promotion ensures clean handoff.
•Testing builds confidence — Regular failover testing—from tabletops to production drills—validates readiness and builds team muscle memory for real incidents.
•Failback requires patience — After failover, take time to properly reintegrate the former primary. Rushing failback can introduce new problems.

What's next:

Page Complete

4 / 5