System Design (HLD)Failover Strategies

Failover Strategies: Building Resilient Systems

LevelAdvanced

Duration90 mins

TopicFailover Strategies

5 / 5

Failback Procedures: Returning to Normal Operations

The Journey Back

It's 4:00 AM. After two hours of incident response, your team has successfully failed over from the crashed primary to the standby. The service is back online. Customers are happy (or at least, no longer complaining). The on-call engineer breathes a sigh of relief and opens a well-deserved cup of coffee.

But the story isn't over. The original primary is now repaired—the failed disk has been replaced, the crashed process restarted, the network issue resolved. A natural question emerges: Should we switch back to the original primary?

This question seems simple, but it hides deep complexity. Failback—the process of returning to the original configuration—is often more dangerous than the original failover. The original primary's data is stale. Replication needs to be re-established. Another traffic disruption is required. And the original cause of failure might recur.

This page provides the comprehensive knowledge to navigate failback decisions wisely and execute them safely.

What You Will Learn

By the end of this page, you will understand: whether failback is actually necessary, the prerequisites for safe failback, step-by-step failback procedures, common failback patterns and anti-patterns, and how to minimize disruption during failback.

Understanding Failback

Failback is the process of restoring the original primary-standby configuration after a failover event. During failback:

The former standby (now acting as primary) relinquishes its primary role
The original primary (now repaired) becomes primary again
Traffic is redirected back to the original primary
The former standby returns to standby mode

Failback vs. Failover:

While conceptually similar to failover, failback has distinct characteristics:

Failover vs Failback Comparison
Aspect	Failover	Failback
Trigger	Failure detected	Original system restored
Urgency	High (service is down)	Low (service is working)
Data state	Standby is synchronized	Original may be stale
Risk tolerance	Accept some risk for recovery	Prioritize safety over speed
Time pressure	Every second costs money	No rush—already operational
Planning time	Minimal (reactive)	Ample (proactive)

The Fundamental Question: Do You Need Failback?

Before planning failback execution, challenge the assumption that failback is necessary. Modern architectures often don't require returning to the original primary:

Arguments Against Failback:

Symmetry: If your architecture is truly symmetric (both nodes identical, same hardware, same capacity), the current configuration is just as good as the original. Why introduce risk?
Additional Downtime: Failback requires another traffic transition, potentially causing another disruption to users.
Recurrence Risk: The original primary failed for a reason. Is that reason truly resolved? Premature failback may trigger another failure.
Engineering Cost: Failback requires planning, execution, and monitoring—engineering time that could be spent on other priorities.

Arguments For Failback:

Capacity Differences: The original primary may have more resources (CPU, memory, storage) than the standby.
Licensing Constraints: Some software licenses may require running on specific hardware.
Data Locality: The original primary may be in a preferred geographic location (closer to users, in a specific compliance zone).
Operational Clarity: Having a consistent 'primary' and 'standby' designation simplifies operations and documentation.

The Modern Approach: Role Fluidity

In truly symmetric architectures, adopt a role-fluidity mindset: whichever node is currently primary IS the primary. Don't failback automatically. Only failback when there's a compelling reason (capacity, licensing, geography) that the current configuration doesn't satisfy.

Prerequisites for Safe Failback

Before executing failback, a series of conditions must be verified. Proceeding without these prerequisites risks data loss, extended downtime, or repeating the original failure.

Prerequisite 1: Root Cause Resolution

The original failure must be understood and resolved. If the primary failed due to a disk problem, is the disk replaced and tested? If it was a software bug, is the bug fixed? If the cause is unknown, failback is premature—you may just fail again.

Questions to answer:

What caused the original failure?
Has that cause been addressed?
How confident are we that the fix is correct?
Can we detect if the problem recurs?

Failback Readiness Checklist

•Root cause identified and resolved — We know why it failed and have fixed it. Not 'it seems to work now' but 'we understand the failure and corrected it.'
•Original primary restored to working state — All processes running, health checks passing, resources available, no error conditions.
•Replication re-established — Original primary is receiving replication from current primary and has caught up (zero or minimal lag).
•Rollback plan prepared — If failback fails, how do we quickly revert to the current configuration? Document the steps.
•Coordination completed — All stakeholders informed. Maintenance window scheduled if required. Support team briefed.
•Monitoring verified — Monitoring and alerting for the original primary is active and correct. We'll detect problems quickly.
•No ongoing incidents — System is stable. Not in the middle of a traffic spike, deployment, or other high-risk period.

Prerequisite 2: Data Synchronization

The original primary's data is stale—it stopped receiving writes when it failed. Before it can become primary again, it must be synchronized with the current state:

Option A: Replay from Current Primary

If the original primary was only down briefly and its data isn't corrupted, it may be able to receive replication from the current primary and catch up.

Option B: Full Resync (Base Backup)

If the original primary was down for a long time, or its data is suspect, a full resynchronization may be needed. This involves creating a fresh replica from the current primary.

Synchronization Verification:

-- PostgreSQL: Verify replication is caught up
SELECT 
  pg_last_wal_replay_lsn(),
  pg_last_wal_receive_lsn(),
  pg_last_wal_replay_lsn() - pg_last_wal_receive_lsn() AS lag_bytes;

-- MySQL: Verify replica is caught up
SHOW SLAVE STATUS\G
-- Check Seconds_Behind_Master = 0

Prerequisite 3: Testing the Original Primary

Before directing production traffic to the original primary, verify it's actually working:

Health check verification — All health endpoints returning expected responses
Read query testing — Execute sample read queries, verify correct data returned
Write query testing — Execute test writes (to test tables/rows), verify successful persistence
Connection testing — Verify application servers can connect
Performance testing — Basic load test to verify capacity is as expected

Do not skip this testing. A 'repaired' primary that fails under production load is worse than not failing back at all.

The Waiting Period

Even after satisfying all prerequisites, consider a mandatory waiting period (soak time) before failback. Running the repaired primary as standby for 24-48 hours validates that the fix is durable. If it fails again during this period, you've saved yourself another failback-then-immediate-failover cycle.

Failback Execution Patterns

Several patterns exist for executing failback, each with different tradeoffs between safety, speed, and complexity.

Pattern 1: Immediate Failback (Not Recommended)

As soon as the original primary is restored, immediately switch back.

Advantages: Fastest return to original configuration

Disadvantages:

No time to verify the fix is durable
Original primary's data may be stale
Maximum disruption risk
Highest chance of failing again

Verdict: Avoid this pattern except in extraordinary circumstances (e.g., license violations from running on standby).

Pattern 2: Synchronized Failback (Recommended)

Re-establish the original primary as a standby replica. Let it synchronize completely. Monitor it for a stability period. Then execute a planned, controlled role swap.

Execution Steps:

Repair original primary — Fix the root cause of failure
Configure as replica — Set up replication from current primary
Wait for sync — Allow it to catch up completely
Stability soak — Run as replica for 24-48 hours
Verification testing — Execute the readiness tests
Scheduled swap — During a maintenance window, execute the role swap
Monitor closely — Watch for issues for 24+ hours post-failback

Advantages: Maximum safety, verified fix, minimal risk

Disadvantages: Takes 24-48+ hours, requires patience

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# Synchronized Failback Procedure
 
## Phase 1: Preparation (Day 0)
### 1.1 Root Cause Verification
- [ ] Root cause documented in incident ticket: _____________
- [ ] Fix applied and verified: _____________
- [ ] Post-fix testing completed: _____________
 
### 1.2 Reintegration as Replica
```bash
# On original primary (now becoming replica)
# Stop any primary processes that shouldn't run
systemctl stop application-primary-mode
 
# Configure replication from current primary
pg_basebackup -h current-primary -D /var/lib/postgresql/data -P -U replicator
 
# Or if data is intact and gap is small:
pg_ctl start  # Will automatically connect to current primary for streaming
```
 
### 1.3 Synchronization Verification
```sql
-- Execute on original primary (now replica)
SELECT 
  pg_last_wal_receive_lsn(),
  pg_last_wal_replay_lsn(),
  now() - pg_last_xact_replay_timestamp() AS replication_delay;
  
-- Must show zero or near-zero delay before proceeding
```
 
## Phase 2: Stability Period (Day 1-2)
### 2.1 Monitoring
- [ ] Replica health check passing continuously
- [ ] Replication lag staying at zero
- [ ] No errors in logs
- [ ] Resource utilization normal
 
### 2.2 Criteria for Proceeding
- [ ] 24+ hours with no issues
- [ ] No new alerts on this node
- [ ] Replication never exceeded 1s lag
 
## Phase 3: Failback Execution (Day 2+)
### 3.1 Pre-Failback Checks
- [ ] Current time is within maintenance window: _____________
- [ ] On-call team aware and available
- [ ] Rollback procedure documented and ready
- [ ] Stakeholders notified
 
### 3.2 Execution
```bash
# Step 1: Block new writes to current primary
kubectl exec current-primary -- psql -c "ALTER SYSTEM SET default_transaction_read_only = on;"
kubectl exec current-primary -- psql -c "SELECT pg_reload_conf();"
 
# Step 2: Wait for replication to fully catch up
# (Should be instant if already synchronized)
sleep 5
 
# Step 3: Verify original primary is 100% synchronized
# Query replication slot on current primary
SELECT * FROM pg_stat_replication;
 
# Step 4: Stop current primary
kubectl exec current-primary -- pg_ctl stop -m fast
 
# Step 5: Promote original primary
kubectl exec original-primary -- pg_ctl promote
 
# Step 6: Update routing
kubectl patch service db-primary ...
 
# Step 7: Verify new configuration
```
 
### 3.3 Post-Failback Verification
- [ ] Original primary accepting writes
- [ ] Application health checks passing
- [ ] No errors in application logs
- [ ] Current (now standby) replicating from new primary
 
## Phase 4: Normalization
- [ ] Monitor for 24 hours
- [ ] Close maintenance window
- [ ] Update documentation if needed
- [ ] Conduct brief post-failback review

The Graceful Swap Technique

Notice Step 1 in the execution: blocking new writes before the swap. This ensures traffic drains gracefully rather than being abruptly cut. Applications complete in-flight transactions. Connection pools drain naturally. This transforms failback from a 'hard cutover' to a 'graceful transition.'

Minimizing Failback Disruption

Even a well-planned failback introduces some disruption. Here are techniques to minimize impact:

Technique 1: Read-Only Transition Period

Before the full failback, put the current primary in read-only mode. This:

Stops new writes (preventing additional replication work)
Allows in-flight writes to complete
Lets replica fully catch up
Gives applications a chance to gracefully stop writing

Applications that handle read-only mode gracefully will buffer writes or show appropriate UI messages.

Technique 2: Traffic Draining

Gradually reduce traffic to the current primary before cutting over:

Remove from load balancer (for stateless connections)
Wait for connection pool to drain
Verify active connection count approaching zero
Then execute the swap

This prevents the 'thundering herd' of connection resets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class GracefulFailback {
  async drainBeforeFailback(currentPrimary: Node, newPrimary: Node): Promise<void> {
    console.log('Starting graceful drain before failback');
    
    // Step 1: Mark current primary as draining
    await currentPrimary.setDraining(true);
    
    // Step 2: Update load balancer to stop new connections
    await loadBalancer.removeBackend(currentPrimary);
    
    // Step 3: Wait for connection pool to drain
    const drainTimeout = 60000; // 60 seconds max
    const startTime = Date.now();
    
    while (Date.now() - startTime < drainTimeout) {
      const activeConnections = await currentPrimary.getActiveConnections();
      console.log(`Active connections: ${activeConnections}`);
      
      if (activeConnections === 0) {
        console.log('All connections drained');
        break;
      }
      
      await sleep(1000);
    }
    
    // Step 4: Force-close any remaining connections
    const remaining = await currentPrimary.getActiveConnections();
    if (remaining > 0) {
      console.log(`Force-closing ${remaining} lingering connections`);
      await currentPrimary.forceCloseConnections();
    }
    
    // Step 5: Final replication sync check
    await this.ensureFullySync(newPrimary);
    
    // Step 6: Execute the actual failback
    await this.executeSwap(currentPrimary, newPrimary);
    
    // Step 7: Add new primary to load balancer
    await loadBalancer.addBackend(newPrimary);
    
    console.log('Graceful failback complete');
  }
 
  private async ensureFullySync(newPrimary: Node): Promise<void> {
    const maxWait = 30000;
    const start = Date.now();
    
    while (Date.now() - start < maxWait) {
      const lagBytes = await newPrimary.getReplicationLag();
      
      if (lagBytes === 0) {
        console.log('New primary fully synchronized');
        return;
      }
      
      console.log(`Waiting for sync, lag: ${lagBytes} bytes`);
      await sleep(500);
    }
    
    throw new Error('New primary failed to synchronize within timeout');
  }
}

Technique 3: Blue-Green Failback

For application-layer failback, use blue-green deployment principles:

Have the original primary running and synchronized (green)
Current primary still serving traffic (blue)
Flip the router from blue to green
Blue becomes standby

This gives you instant rollback: if the green deployment has issues, flip back to blue.

Technique 4: Canary Failback

For large-scale systems, failback gradually:

5% of traffic to original primary
Monitor for 15 minutes
25% of traffic to original primary
Monitor for 15 minutes
50% → 75% → 100%

If issues appear at any stage, roll back to the previous stage.

Technique 5: Scheduled Maintenance Windows

Don't failback during peak hours. Schedule for:

Low-traffic periods (nights/weekends depending on user base)
When engineering team is available (not 3 AM)
When business impact is minimized
With adequate buffer time for debugging if needed

The 'Never Failback' Architecture

The ultimate disruption minimization is eliminating failback entirely. If your standby is identical to your primary (same capacity, same geographic region, same configuration), promote it permanently. The 'old primary' becomes the new standby. Reconfigure replication direction and you're done. No failback disruption at all.

Common Failback Failures

Failback can fail in ways that are worse than the original incident. Understanding common failures helps you prevent them.

Failback Failure Modes

•Incomplete Synchronization — Failback executed before original primary fully caught up. Result: data loss of recent transactions. Prevention: Always verify zero replication lag before proceeding.
•Root Cause Recurrence — Original failure wasn't actually fixed. Failback triggers the same failure again. Result: Another failover minutes or hours after failback. Prevention: Thorough root cause analysis, stability soak period.
•Configuration Drift — Original primary was reconfigured during repair in ways incompatible with production. Result: Application errors, performance issues, data corruption. Prevention: Configuration management, pre-failback testing.
•Stale DNS/Routing — Traffic continues going to old location after failback. Result: Writes lost, reads stale. Prevention: Low TTLs, explicit flush, connection pool restart.
•Split-Brain During Failback — Both nodes accepting writes during the transition. Result: Data divergence. Prevention: All split-brain prevention measures apply to failback too.
•Clock Skew Issues — Original primary's clock drifted during outage. Result: Timestamp-based logic fails, replication may reject transactions. Prevention: NTP sync verification before failback.

Case Study: GitHub's 2018 Failback Incident

In October 2018, GitHub experienced a significant outage. After failing over to their secondary datacenter, they began failback. However, the replication of certain data types hadn't fully propagated. The failback resulted in inconsistencies between different data stores.

The resolution required:

Pausing the failback mid-stream
Investigating the synchronization state of multiple data systems
Custom reconciliation scripts for affected data
Extended maintenance window

Lessons:

Complex systems have multiple data stores; all must be synchronized
Synchronization 'complete' for one system doesn't mean all systems
Comprehensive failback verification must cover all data dependencies
Have reconciliation procedures ready even if you don't expect to need them

The Failback Cascade

The most dangerous failure is the 'failback cascade': Failback fails → Emergency failover back to previous config → But that fails too (it was already demoted) → System in undefined state → Extended outage while operators sort out the mess. Prevent this by having clear rollback procedures and never leaving nodes in indeterminate states.

Automated vs Manual Failback

Just as failover can be automatic or manual, so can failback. The decision factors are different than for failover.

The Case Against Automatic Failback:

Unlike failover (where speed is critical because service is down), failback occurs when service is already working. There's no urgency. This shifts the balance heavily toward safety:

No time pressure — The current configuration is functional
Root cause unknown — Automatic systems can't verify the fix
Higher risk — Failback is the second disruption in a short period
Surprise factor — Automatic failback at unexpected times confuses operators

Automatic Failback Decision Matrix
Factor	Favors Automatic	Favors Manual
Root cause reliability	Well-understood, self-healing	Complex, requires investigation
Failure frequency	Frequent transient failures	Rare serious failures
System criticality	Lower criticality, tolerant users	High criticality, sensitive data
Operator availability	No 24/7 coverage	Reliable on-call rotation
Regulatory requirements	None	Require human approval
Recovery verification	Automated tests sufficient	Human judgment needed

When Automatic Failback Is Appropriate:

Scenario 1: Known Transient Failures

If your system experiences predictable transient failures (e.g., brief network blips, garbage collection pauses that look like failures), automatic failback after a stability period can reduce operator burden.

Configuration:

failback:
  mode: automatic
  stability_period: 15m     # Wait 15 min after recovery
  min_replication_lag: 0    # Must be fully synchronized
  max_failbacks_per_hour: 2 # Prevent flapping
  notification: always      # Alert operators even when automatic

Scenario 2: Stateless Services

Load-balanced stateless services often automatically 'fail back' as repaired instances rejoin the pool. This is safe because:

No persistent state to synchronize
Load balancer verifies health before sending traffic
Individual instance failure doesn't affect overall service

The Recommended Default: Manual Failback

For stateful systems (databases, message queues, caches with persistence), manual failback is almost always preferred:

Verify root cause resolution
Confirm synchronization complete
Execute during planned window
Monitor closely after execution

The small delay of human involvement is vastly outweighed by the safety benefits.

Automatic Detection, Manual Execution

A middle-ground approach: automatically detect when the original primary is recovered and synchronized, then alert operators that failback is possible. The human reviews the situation and decides whether to proceed. This captures the benefit of monitoring automation while preserving human judgment.

Failback in Multi-Region Architectures

Multi-region architectures add complexity to failback decisions. The considerations extend beyond technical synchronization to include user experience, latency, and cost.

The Geographic Dimension:

After failing from US-East (primary) to US-West (standby), users in the eastern US now experience cross-country latency for every request. This is a strong argument for failback—but it must be balanced against risk.

Cross-Region Failback Challenges:

Multi-Region Failback Complexities

•Higher Latency Synchronization — Cross-region replication takes longer. Synchronous replication adds 50-200ms per write. Async replication creates larger lag windows.
•Network Partition Risk — The original region failure may have been network-related. Failback while network is still unstable risks immediate re-failover.
•DNS Propagation — Global DNS changes can take hours to fully propagate. Some users will hit the old region long after failback.
•Cost Considerations — Cross-region data transfer costs money. Keeping a region online as standby continuously transfers change logs.
•Regional Failure Modes — Some failures affect entire regions (cloud provider outages). Verify the region is truly stable before failback.

Multi-Region Failback Strategy:

Phase 1: Regional Stability Verification (Hours-Days)

Don't rush failback after a regional outage. Wait for:

Cloud provider root cause analysis (if applicable)
24-72 hours of stability in the original region
Verification of all regional services (not just your immediate dependency)

Phase 2: Gradual Geographic Traffic Migration

Use GeoDNS or anycast to gradually shift traffic:

Day 1: 10% of original region's users → original region
Day 2: 25% of original region's users → original region  
Day 3: 50% → 100%

This limits impact if the original region has residual issues.

Phase 3: Replication Direction Swap

Once traffic is fully migrated, swap replication direction:

Original region becomes primary (already receiving writes)
Standby region receives replication from original
Verify replication health across regions

The 'Pilot Light' Return:

For cost optimization, consider returning the original region to a 'pilot light' state initially:

Minimal running resources
Replication maintained
Can be scaled up quickly for failback
Lower cost than full hot standby

Scale to full capacity only when failback is imminent.

Active-Active Eliminates Failback

In true active-active multi-region architectures, both regions serve traffic simultaneously. 'Failover' means one region absorbs the full load. 'Failback' means restoring the load balance. Since both regions are always active and synchronized, failback is just traffic distribution adjustment—much simpler than traditional failback.

Summary: Mastering Failback

Failback is often treated as an afterthought—the cleanup phase after the exciting incident response. But poorly executed failback has caused more extended outages than many original failures. Treat failback with the seriousness it deserves. Here are the key principles:

Key Takeaways

•Question whether failback is needed — In symmetric architectures, the current configuration may be just as good. Unnecessary failback adds risk without benefit.
•Prerequisites are non-negotiable — Root cause resolution, data synchronization, stability soak period, and verification testing must all complete before failback.
•Use synchronized failback — Re-establish the original primary as a replica, verify synchronization, then perform a controlled swap. Never rush.
•Minimize disruption actively — Traffic draining, read-only transition periods, maintenance windows, and canary rollouts reduce user impact.
•Prefer manual failback for stateful systems — There's no urgency when service is working. Human judgment prevents repeated failures and catches issues automation can't.
•Multi-region failback requires patience — Regional failures may have lasting effects. Wait for stability, migrate traffic gradually, and have extended monitoring.

Module Complete:

With this page, you've completed the comprehensive exploration of Failover Strategies. You now understand:

When to use automatic vs manual failover
How to detect failures reliably
The critical timing decisions that determine success
How to prevent split-brain scenarios
When and how to failback safely

These skills are foundational to building and operating highly available systems. Apply them with the patience and diligence they require, and your systems will serve users reliably through inevitable failures.

Module Complete

Congratulations! You've mastered the complete lifecycle of failover: from detecting failures and deciding to act, through the failover execution itself, to the eventual return to normal operations. These techniques—applied thoughtfully—enable the high availability that modern systems demand.

5 / 5

Loading learning content...

System Design (HLD)Failover Strategies

Failover Strategies: Building Resilient Systems

LevelAdvanced

Duration90 mins

TopicFailover Strategies

5 / 5

Failback Procedures: Returning to Normal Operations

The Journey Back

This page provides the comprehensive knowledge to navigate failback decisions wisely and execute them safely.

What You Will Learn

Understanding Failback

Failback is the process of restoring the original primary-standby configuration after a failover event. During failback:

The former standby (now acting as primary) relinquishes its primary role
The original primary (now repaired) becomes primary again
Traffic is redirected back to the original primary
The former standby returns to standby mode

Failback vs. Failover:

While conceptually similar to failover, failback has distinct characteristics:

Failover vs Failback Comparison
Aspect	Failover	Failback
Trigger	Failure detected	Original system restored
Urgency	High (service is down)	Low (service is working)
Data state	Standby is synchronized	Original may be stale
Risk tolerance	Accept some risk for recovery	Prioritize safety over speed
Time pressure	Every second costs money	No rush—already operational
Planning time	Minimal (reactive)	Ample (proactive)

The Fundamental Question: Do You Need Failback?

Before planning failback execution, challenge the assumption that failback is necessary. Modern architectures often don't require returning to the original primary:

Arguments Against Failback:

Symmetry: If your architecture is truly symmetric (both nodes identical, same hardware, same capacity), the current configuration is just as good as the original. Why introduce risk?
Additional Downtime: Failback requires another traffic transition, potentially causing another disruption to users.
Recurrence Risk: The original primary failed for a reason. Is that reason truly resolved? Premature failback may trigger another failure.
Engineering Cost: Failback requires planning, execution, and monitoring—engineering time that could be spent on other priorities.

Arguments For Failback:

Capacity Differences: The original primary may have more resources (CPU, memory, storage) than the standby.
Licensing Constraints: Some software licenses may require running on specific hardware.
Data Locality: The original primary may be in a preferred geographic location (closer to users, in a specific compliance zone).
Operational Clarity: Having a consistent 'primary' and 'standby' designation simplifies operations and documentation.

The Modern Approach: Role Fluidity

Prerequisites for Safe Failback

Before executing failback, a series of conditions must be verified. Proceeding without these prerequisites risks data loss, extended downtime, or repeating the original failure.

Prerequisite 1: Root Cause Resolution

Questions to answer:

What caused the original failure?
Has that cause been addressed?
How confident are we that the fix is correct?
Can we detect if the problem recurs?

Failback Readiness Checklist

•Root cause identified and resolved — We know why it failed and have fixed it. Not 'it seems to work now' but 'we understand the failure and corrected it.'
•Original primary restored to working state — All processes running, health checks passing, resources available, no error conditions.
•Replication re-established — Original primary is receiving replication from current primary and has caught up (zero or minimal lag).
•Rollback plan prepared — If failback fails, how do we quickly revert to the current configuration? Document the steps.
•Coordination completed — All stakeholders informed. Maintenance window scheduled if required. Support team briefed.
•Monitoring verified — Monitoring and alerting for the original primary is active and correct. We'll detect problems quickly.
•No ongoing incidents — System is stable. Not in the middle of a traffic spike, deployment, or other high-risk period.

Prerequisite 2: Data Synchronization

The original primary's data is stale—it stopped receiving writes when it failed. Before it can become primary again, it must be synchronized with the current state:

Option A: Replay from Current Primary

If the original primary was only down briefly and its data isn't corrupted, it may be able to receive replication from the current primary and catch up.

Option B: Full Resync (Base Backup)

If the original primary was down for a long time, or its data is suspect, a full resynchronization may be needed. This involves creating a fresh replica from the current primary.

Synchronization Verification:

-- PostgreSQL: Verify replication is caught up
SELECT 
  pg_last_wal_replay_lsn(),
  pg_last_wal_receive_lsn(),
  pg_last_wal_replay_lsn() - pg_last_wal_receive_lsn() AS lag_bytes;

-- MySQL: Verify replica is caught up
SHOW SLAVE STATUS\G
-- Check Seconds_Behind_Master = 0

Prerequisite 3: Testing the Original Primary

Before directing production traffic to the original primary, verify it's actually working:

Health check verification — All health endpoints returning expected responses
Read query testing — Execute sample read queries, verify correct data returned
Write query testing — Execute test writes (to test tables/rows), verify successful persistence
Connection testing — Verify application servers can connect
Performance testing — Basic load test to verify capacity is as expected

Do not skip this testing. A 'repaired' primary that fails under production load is worse than not failing back at all.

The Waiting Period

Failback Execution Patterns

Several patterns exist for executing failback, each with different tradeoffs between safety, speed, and complexity.

Pattern 1: Immediate Failback (Not Recommended)

As soon as the original primary is restored, immediately switch back.

Advantages: Fastest return to original configuration

Disadvantages:

No time to verify the fix is durable
Original primary's data may be stale
Maximum disruption risk
Highest chance of failing again

Verdict: Avoid this pattern except in extraordinary circumstances (e.g., license violations from running on standby).

Pattern 2: Synchronized Failback (Recommended)

Re-establish the original primary as a standby replica. Let it synchronize completely. Monitor it for a stability period. Then execute a planned, controlled role swap.

Execution Steps:

Repair original primary — Fix the root cause of failure
Configure as replica — Set up replication from current primary
Wait for sync — Allow it to catch up completely
Stability soak — Run as replica for 24-48 hours
Verification testing — Execute the readiness tests
Scheduled swap — During a maintenance window, execute the role swap
Monitor closely — Watch for issues for 24+ hours post-failback

Advantages: Maximum safety, verified fix, minimal risk

Disadvantages: Takes 24-48+ hours, requires patience

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# Synchronized Failback Procedure
 
## Phase 1: Preparation (Day 0)
### 1.1 Root Cause Verification
- [ ] Root cause documented in incident ticket: _____________
- [ ] Fix applied and verified: _____________
- [ ] Post-fix testing completed: _____________
 
### 1.2 Reintegration as Replica
```bash
# On original primary (now becoming replica)
# Stop any primary processes that shouldn't run
systemctl stop application-primary-mode
 
# Configure replication from current primary
pg_basebackup -h current-primary -D /var/lib/postgresql/data -P -U replicator
 
# Or if data is intact and gap is small:
pg_ctl start  # Will automatically connect to current primary for streaming
```
 
### 1.3 Synchronization Verification
```sql
-- Execute on original primary (now replica)
SELECT 
  pg_last_wal_receive_lsn(),
  pg_last_wal_replay_lsn(),
  now() - pg_last_xact_replay_timestamp() AS replication_delay;
  
-- Must show zero or near-zero delay before proceeding
```
 
## Phase 2: Stability Period (Day 1-2)
### 2.1 Monitoring
- [ ] Replica health check passing continuously
- [ ] Replication lag staying at zero
- [ ] No errors in logs
- [ ] Resource utilization normal
 
### 2.2 Criteria for Proceeding
- [ ] 24+ hours with no issues
- [ ] No new alerts on this node
- [ ] Replication never exceeded 1s lag
 
## Phase 3: Failback Execution (Day 2+)
### 3.1 Pre-Failback Checks
- [ ] Current time is within maintenance window: _____________
- [ ] On-call team aware and available
- [ ] Rollback procedure documented and ready
- [ ] Stakeholders notified
 
### 3.2 Execution
```bash
# Step 1: Block new writes to current primary
kubectl exec current-primary -- psql -c "ALTER SYSTEM SET default_transaction_read_only = on;"
kubectl exec current-primary -- psql -c "SELECT pg_reload_conf();"
 
# Step 2: Wait for replication to fully catch up
# (Should be instant if already synchronized)
sleep 5
 
# Step 3: Verify original primary is 100% synchronized
# Query replication slot on current primary
SELECT * FROM pg_stat_replication;
 
# Step 4: Stop current primary
kubectl exec current-primary -- pg_ctl stop -m fast
 
# Step 5: Promote original primary
kubectl exec original-primary -- pg_ctl promote
 
# Step 6: Update routing
kubectl patch service db-primary ...
 
# Step 7: Verify new configuration
```
 
### 3.3 Post-Failback Verification
- [ ] Original primary accepting writes
- [ ] Application health checks passing
- [ ] No errors in application logs
- [ ] Current (now standby) replicating from new primary
 
## Phase 4: Normalization
- [ ] Monitor for 24 hours
- [ ] Close maintenance window
- [ ] Update documentation if needed
- [ ] Conduct brief post-failback review

The Graceful Swap Technique

Minimizing Failback Disruption

Even a well-planned failback introduces some disruption. Here are techniques to minimize impact:

Technique 1: Read-Only Transition Period

Before the full failback, put the current primary in read-only mode. This:

Stops new writes (preventing additional replication work)
Allows in-flight writes to complete
Lets replica fully catch up
Gives applications a chance to gracefully stop writing

Applications that handle read-only mode gracefully will buffer writes or show appropriate UI messages.

Technique 2: Traffic Draining

Gradually reduce traffic to the current primary before cutting over:

Remove from load balancer (for stateless connections)
Wait for connection pool to drain
Verify active connection count approaching zero
Then execute the swap

This prevents the 'thundering herd' of connection resets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class GracefulFailback {
  async drainBeforeFailback(currentPrimary: Node, newPrimary: Node): Promise<void> {
    console.log('Starting graceful drain before failback');
    
    // Step 1: Mark current primary as draining
    await currentPrimary.setDraining(true);
    
    // Step 2: Update load balancer to stop new connections
    await loadBalancer.removeBackend(currentPrimary);
    
    // Step 3: Wait for connection pool to drain
    const drainTimeout = 60000; // 60 seconds max
    const startTime = Date.now();
    
    while (Date.now() - startTime < drainTimeout) {
      const activeConnections = await currentPrimary.getActiveConnections();
      console.log(`Active connections: ${activeConnections}`);
      
      if (activeConnections === 0) {
        console.log('All connections drained');
        break;
      }
      
      await sleep(1000);
    }
    
    // Step 4: Force-close any remaining connections
    const remaining = await currentPrimary.getActiveConnections();
    if (remaining > 0) {
      console.log(`Force-closing ${remaining} lingering connections`);
      await currentPrimary.forceCloseConnections();
    }
    
    // Step 5: Final replication sync check
    await this.ensureFullySync(newPrimary);
    
    // Step 6: Execute the actual failback
    await this.executeSwap(currentPrimary, newPrimary);
    
    // Step 7: Add new primary to load balancer
    await loadBalancer.addBackend(newPrimary);
    
    console.log('Graceful failback complete');
  }
 
  private async ensureFullySync(newPrimary: Node): Promise<void> {
    const maxWait = 30000;
    const start = Date.now();
    
    while (Date.now() - start < maxWait) {
      const lagBytes = await newPrimary.getReplicationLag();
      
      if (lagBytes === 0) {
        console.log('New primary fully synchronized');
        return;
      }
      
      console.log(`Waiting for sync, lag: ${lagBytes} bytes`);
      await sleep(500);
    }
    
    throw new Error('New primary failed to synchronize within timeout');
  }
}

Technique 3: Blue-Green Failback

For application-layer failback, use blue-green deployment principles:

Have the original primary running and synchronized (green)
Current primary still serving traffic (blue)
Flip the router from blue to green
Blue becomes standby

This gives you instant rollback: if the green deployment has issues, flip back to blue.

Technique 4: Canary Failback

For large-scale systems, failback gradually:

5% of traffic to original primary
Monitor for 15 minutes
25% of traffic to original primary
Monitor for 15 minutes
50% → 75% → 100%

If issues appear at any stage, roll back to the previous stage.

Technique 5: Scheduled Maintenance Windows

Don't failback during peak hours. Schedule for:

Low-traffic periods (nights/weekends depending on user base)
When engineering team is available (not 3 AM)
When business impact is minimized
With adequate buffer time for debugging if needed

The 'Never Failback' Architecture

Common Failback Failures

Failback can fail in ways that are worse than the original incident. Understanding common failures helps you prevent them.

Failback Failure Modes

•Incomplete Synchronization — Failback executed before original primary fully caught up. Result: data loss of recent transactions. Prevention: Always verify zero replication lag before proceeding.
•Root Cause Recurrence — Original failure wasn't actually fixed. Failback triggers the same failure again. Result: Another failover minutes or hours after failback. Prevention: Thorough root cause analysis, stability soak period.
•Configuration Drift — Original primary was reconfigured during repair in ways incompatible with production. Result: Application errors, performance issues, data corruption. Prevention: Configuration management, pre-failback testing.
•Stale DNS/Routing — Traffic continues going to old location after failback. Result: Writes lost, reads stale. Prevention: Low TTLs, explicit flush, connection pool restart.
•Split-Brain During Failback — Both nodes accepting writes during the transition. Result: Data divergence. Prevention: All split-brain prevention measures apply to failback too.
•Clock Skew Issues — Original primary's clock drifted during outage. Result: Timestamp-based logic fails, replication may reject transactions. Prevention: NTP sync verification before failback.

Case Study: GitHub's 2018 Failback Incident

The resolution required:

Pausing the failback mid-stream
Investigating the synchronization state of multiple data systems
Custom reconciliation scripts for affected data
Extended maintenance window

Lessons:

Complex systems have multiple data stores; all must be synchronized
Synchronization 'complete' for one system doesn't mean all systems
Comprehensive failback verification must cover all data dependencies
Have reconciliation procedures ready even if you don't expect to need them

The Failback Cascade

Automated vs Manual Failback

Just as failover can be automatic or manual, so can failback. The decision factors are different than for failover.

The Case Against Automatic Failback:

Unlike failover (where speed is critical because service is down), failback occurs when service is already working. There's no urgency. This shifts the balance heavily toward safety:

No time pressure — The current configuration is functional
Root cause unknown — Automatic systems can't verify the fix
Higher risk — Failback is the second disruption in a short period
Surprise factor — Automatic failback at unexpected times confuses operators

Automatic Failback Decision Matrix
Factor	Favors Automatic	Favors Manual
Root cause reliability	Well-understood, self-healing	Complex, requires investigation
Failure frequency	Frequent transient failures	Rare serious failures
System criticality	Lower criticality, tolerant users	High criticality, sensitive data
Operator availability	No 24/7 coverage	Reliable on-call rotation
Regulatory requirements	None	Require human approval
Recovery verification	Automated tests sufficient	Human judgment needed

When Automatic Failback Is Appropriate:

Scenario 1: Known Transient Failures

Configuration:

failback:
  mode: automatic
  stability_period: 15m     # Wait 15 min after recovery
  min_replication_lag: 0    # Must be fully synchronized
  max_failbacks_per_hour: 2 # Prevent flapping
  notification: always      # Alert operators even when automatic

Scenario 2: Stateless Services

Load-balanced stateless services often automatically 'fail back' as repaired instances rejoin the pool. This is safe because:

No persistent state to synchronize
Load balancer verifies health before sending traffic
Individual instance failure doesn't affect overall service

The Recommended Default: Manual Failback

For stateful systems (databases, message queues, caches with persistence), manual failback is almost always preferred:

Verify root cause resolution
Confirm synchronization complete
Execute during planned window
Monitor closely after execution

The small delay of human involvement is vastly outweighed by the safety benefits.

Automatic Detection, Manual Execution

Failback in Multi-Region Architectures

Multi-region architectures add complexity to failback decisions. The considerations extend beyond technical synchronization to include user experience, latency, and cost.

The Geographic Dimension:

Cross-Region Failback Challenges:

Multi-Region Failback Complexities

•Higher Latency Synchronization — Cross-region replication takes longer. Synchronous replication adds 50-200ms per write. Async replication creates larger lag windows.
•Network Partition Risk — The original region failure may have been network-related. Failback while network is still unstable risks immediate re-failover.
•DNS Propagation — Global DNS changes can take hours to fully propagate. Some users will hit the old region long after failback.
•Cost Considerations — Cross-region data transfer costs money. Keeping a region online as standby continuously transfers change logs.
•Regional Failure Modes — Some failures affect entire regions (cloud provider outages). Verify the region is truly stable before failback.

Multi-Region Failback Strategy:

Phase 1: Regional Stability Verification (Hours-Days)

Don't rush failback after a regional outage. Wait for:

Cloud provider root cause analysis (if applicable)
24-72 hours of stability in the original region
Verification of all regional services (not just your immediate dependency)

Phase 2: Gradual Geographic Traffic Migration

Use GeoDNS or anycast to gradually shift traffic:

Day 1: 10% of original region's users → original region
Day 2: 25% of original region's users → original region  
Day 3: 50% → 100%

This limits impact if the original region has residual issues.

Phase 3: Replication Direction Swap

Once traffic is fully migrated, swap replication direction:

Original region becomes primary (already receiving writes)
Standby region receives replication from original
Verify replication health across regions

The 'Pilot Light' Return:

For cost optimization, consider returning the original region to a 'pilot light' state initially:

Minimal running resources
Replication maintained
Can be scaled up quickly for failback
Lower cost than full hot standby

Scale to full capacity only when failback is imminent.

Active-Active Eliminates Failback

Summary: Mastering Failback

Key Takeaways

•Question whether failback is needed — In symmetric architectures, the current configuration may be just as good. Unnecessary failback adds risk without benefit.
•Prerequisites are non-negotiable — Root cause resolution, data synchronization, stability soak period, and verification testing must all complete before failback.
•Use synchronized failback — Re-establish the original primary as a replica, verify synchronization, then perform a controlled swap. Never rush.
•Minimize disruption actively — Traffic draining, read-only transition periods, maintenance windows, and canary rollouts reduce user impact.
•Prefer manual failback for stateful systems — There's no urgency when service is working. Human judgment prevents repeated failures and catches issues automation can't.
•Multi-region failback requires patience — Regional failures may have lasting effects. Wait for stability, migrate traffic gradually, and have extended monitoring.

Module Complete:

With this page, you've completed the comprehensive exploration of Failover Strategies. You now understand:

When to use automatic vs manual failover
How to detect failures reliably
The critical timing decisions that determine success
How to prevent split-brain scenarios
When and how to failback safely

Module Complete

5 / 5