Loading content...
Your primary database has failed. Whether from hardware failure, cascading software bugs, network partition, or human error, the single source of truth for your data is suddenly unavailable. Writes are failing. Depending on your architecture, reads may be failing too. Every second of downtime costs money, reputation, and user trust.
This is the moment replica promotion is designed for. Among your read replicas—previously read-only copies—one must be selected, prepared, and promoted to become the new primary. This promoted replica will accept writes, become the new source of truth, and the remaining replicas will reconfigure to follow it.
Replica promotion is perhaps the most critical operation in database administration. Done correctly, it restores service quickly with minimal data loss. Done incorrectly, it can cause data corruption, split-brain scenarios, or extended outages. This page provides the deep understanding needed to design, implement, and operate failover systems with confidence.
By the end of this page, you will understand failover detection mechanisms, the mechanics of replica promotion across different databases, how to prevent data loss during promotion, strategies for automated versus manual failover, and patterns for application-level failover handling.
Before promotion can occur, the system must detect that failover is necessary. This detection must be accurate (avoid false positives that cause unnecessary failovers) and fast (minimize downtime). Balancing these requirements is challenging.
Types of failures to detect:
| Failure Type | Symptoms | Detection Method | Detection Challenge |
|---|---|---|---|
| Hardware crash | Complete unresponsiveness, connection failures | Connection timeout, ping failure | Fast and reliable to detect |
| Process crash | Connection refused, database not running | Connection failure, process monitoring | Quick detection via health checks |
| Disk failure | I/O errors, corrupted responses | Health check queries failing | May partially respond initially |
| Network partition | Some nodes reachable, others not | Asymmetric failure detection | False positives if detector is isolated |
| Performance degradation | Responses slow, queries timeout | Latency thresholds, timeout rates | Distinguishing transient from permanent |
| Replication break | Replicas falling behind, replication errors | Replication monitoring | May not require promotion |
Detection mechanisms:
Health checks send periodic probes to the primary:
Consensus-based detection uses multiple monitors to agree on failure:
Replicas as witnesses enables replicas to monitor primary health:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
// Failover detection with consensus interface HealthCheckResult { monitor: string; primaryReachable: boolean; querySucceeded: boolean; latencyMs: number; timestamp: Date;} interface FailoverDecision { shouldFailover: boolean; reason: string; votes: { monitor: string; vote: boolean }[];} class FailoverDetector { private monitors: string[]; private healthCheckIntervalMs: number; private failureThreshold: number; // Consecutive failures before vote private quorumRequirement: number; // Minimum monitors that must agree private failureCounts: Map<string, number> = new Map(); constructor(config: { monitors: string[]; healthCheckIntervalMs: number; failureThreshold: number; }) { this.monitors = config.monitors; this.healthCheckIntervalMs = config.healthCheckIntervalMs; this.failureThreshold = config.failureThreshold; this.quorumRequirement = Math.floor(this.monitors.length / 2) + 1; } async checkHealth(monitorId: string, primary: DatabaseConnection): Promise<HealthCheckResult> { const startTime = Date.now(); try { // Test 1: Can we connect? await primary.connect(); // Test 2: Can we run a query? await primary.query('SELECT 1'); // Test 3: Can we write? (use heartbeat table) await primary.query(` INSERT INTO _healthcheck (monitor_id, timestamp) VALUES ($1, NOW()) ON CONFLICT (monitor_id) DO UPDATE SET timestamp = NOW() `, [monitorId]); this.failureCounts.set(monitorId, 0); return { monitor: monitorId, primaryReachable: true, querySucceeded: true, latencyMs: Date.now() - startTime, timestamp: new Date(), }; } catch (error) { const failures = (this.failureCounts.get(monitorId) ?? 0) + 1; this.failureCounts.set(monitorId, failures); return { monitor: monitorId, primaryReachable: false, querySucceeded: false, latencyMs: Date.now() - startTime, timestamp: new Date(), }; } } collectFailoverVotes(): FailoverDecision { const votes: { monitor: string; vote: boolean }[] = []; for (const monitor of this.monitors) { const failures = this.failureCounts.get(monitor) ?? 0; const vote = failures >= this.failureThreshold; votes.push({ monitor, vote }); } const yesVotes = votes.filter(v => v.vote).length; const shouldFailover = yesVotes >= this.quorumRequirement; return { shouldFailover, reason: shouldFailover ? `Quorum reached: ${yesVotes}/${this.monitors.length} monitors detect failure` : `Quorum not reached: ${yesVotes}/${this.monitors.length} (need ${this.quorumRequirement})`, votes, }; }}A major risk is 'split-brain': network partition causes monitors to incorrectly believe the primary is down and promote a replica, while the original primary is still running and accepting writes. Now two nodes believe they are primary, leading to divergent data. Prevention requires: 1) Fencing the old primary (STONITH—Shoot The Other Node In The Head), 2) Quorum-based decisions, 3) VIP/DNS failover that prevents writes to old primary.
Once failover is decided, the selected replica must be promoted. The promotion process differs by database but follows a general pattern:
PostgreSQL promotion process:
PostgreSQL provides several methods for promoting a standby to primary:
12345678910111213141516171819202122232425
#!/bin/bash# PostgreSQL replica promotion # Method 1: pg_ctl promote (requires shell access)# This is the traditional methodpg_ctl promote -D /var/lib/postgresql/data # Method 2: promote trigger file (configured in recovery.conf/postgresql.conf)# Configured via: promote_trigger_file = '/tmp/promote_trigger'touch /tmp/promote_trigger # Method 3: pg_promote() function (PostgreSQL 12+, requires superuser)# Can be called via SQLpsql -c "SELECT pg_promote();" # Verification: Check if no longer in recoverypsql -c "SELECT pg_is_in_recovery();"# Should return 'f' (false) after promotion # After promotion, update pg_hba.conf to allow replication connections# and configure remaining replicas to follow new primary # On other replicas, update primary_conninfo:# primary_conninfo = 'host=new-primary.internal user=replicator ...'# Then restart PostgreSQL or signal with pg_ctl reloadPostgreSQL uses 'timelines' to distinguish between different histories after promotion. When a standby is promoted, it starts a new timeline. Other standbys must be reconfigured to follow this new timeline. recovery_target_timeline = 'latest' in standby configuration helps automate this.
When multiple replicas exist, choosing which to promote is critical. The wrong choice can result in unnecessary data loss or a new primary that's undersized for the role.
Selection criteria:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
// Replica selection logic for promotion interface ReplicaCandidate { id: string; host: string; walPosition: bigint; // PostgreSQL LSN or MySQL GTID position isSynchronous: boolean; // Was it synchronously replicated? lagMs: number; // Replication lag before failure cpuCores: number; memoryGb: number; region: string; lastHealthCheck: Date; healthScore: number; // 0-100} interface SelectionResult { selectedReplica: ReplicaCandidate; reason: string; alternativeCandidates: ReplicaCandidate[]; estimatedDataLoss: string;} class PromotionCandidateSelector { private preferredRegion: string; private minimumHealthScore: number = 80; constructor(preferredRegion: string) { this.preferredRegion = preferredRegion; } select(candidates: ReplicaCandidate[], primaryLastPosition: bigint): SelectionResult { if (candidates.length === 0) { throw new Error('No replica candidates available for promotion'); } // Filter unhealthy candidates const healthy = candidates.filter(c => c.healthScore >= this.minimumHealthScore); if (healthy.length === 0) { console.warn('No healthy candidates; using best available'); return this.selectBest(candidates, primaryLastPosition); } return this.selectBest(healthy, primaryLastPosition); } private selectBest(candidates: ReplicaCandidate[], primaryLastPosition: bigint): SelectionResult { // Score each candidate const scored = candidates.map(c => ({ candidate: c, score: this.computeScore(c, primaryLastPosition), })); // Sort by score descending scored.sort((a, b) => b.score - a.score); const selected = scored[0].candidate; const dataLoss = primaryLastPosition - selected.walPosition; return { selectedReplica: selected, reason: this.explainSelection(selected, scored[0].score), alternativeCandidates: scored.slice(1).map(s => s.candidate), estimatedDataLoss: this.formatDataLoss(dataLoss), }; } private computeScore(candidate: ReplicaCandidate, primaryLastPosition: bigint): number { let score = 0; // Replication position (most important) - up to 50 points const positionRatio = Number(candidate.walPosition) / Number(primaryLastPosition); score += Math.min(50, positionRatio * 50); // Synchronous bonus - 20 points if (candidate.isSynchronous) { score += 20; } // Health score - up to 15 points score += (candidate.healthScore / 100) * 15; // Capacity (normalized) - up to 10 points const capacityScore = Math.min(10, (candidate.cpuCores * candidate.memoryGb) / 100); score += capacityScore; // Region preference - 5 points if (candidate.region === this.preferredRegion) { score += 5; } return score; } private explainSelection(candidate: ReplicaCandidate, score: number): string { const reasons: string[] = []; if (candidate.isSynchronous) { reasons.push('synchronously replicated (no data loss)'); } reasons.push(`replication position ${candidate.walPosition}`); reasons.push(`health score ${candidate.healthScore}`); if (candidate.region === this.preferredRegion) { reasons.push('preferred region'); } return `Selected ${candidate.id}: ${reasons.join(', ')}. Total score: ${score.toFixed(1)}`; } private formatDataLoss(bytesLoss: bigint): string { if (bytesLoss <= 0n) return 'None (replica fully synchronized)'; if (bytesLoss < 1024n) return `~${bytesLoss} bytes`; if (bytesLoss < 1024n * 1024n) return `~${Number(bytesLoss / 1024n)} KB`; return `~${Number(bytesLoss / 1024n / 1024n)} MB`; }}With asynchronous replication, some committed transactions on the primary may not have reached any replica before failure. This data is lost. The only prevention is synchronous replication (which impacts write latency) or accepting the business risk of potential loss.
Should failover happen automatically or require human decision? Both approaches have merits, and many organizations use a hybrid model.
Hybrid approaches:
Automatic detection, manual confirmation:
Automatic for known patterns, manual for unknowns:
Automatic with escape hatch:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
// Hybrid failover controller with approval workflow type FailoverApprovalStatus = 'pending' | 'approved' | 'rejected' | 'timeout'; interface FailoverRequest { id: string; detectedAt: Date; reason: string; candidateReplica: string; status: FailoverApprovalStatus; approvedBy?: string; executedAt?: Date;} class HybridFailoverController { private readonly autoApprovalTimeoutMs: number; private readonly requiresApproval: boolean; private pendingFailover: FailoverRequest | null = null; constructor(config: { autoApprovalTimeoutMs: number; // 0 = never auto-approve requiresApproval: boolean; }) { this.autoApprovalTimeoutMs = config.autoApprovalTimeoutMs; this.requiresApproval = config.requiresApproval; } async initiateFailover(reason: string, candidate: string): Promise<FailoverRequest> { const request: FailoverRequest = { id: crypto.randomUUID(), detectedAt: new Date(), reason, candidateReplica: candidate, status: 'pending', }; this.pendingFailover = request; // Send alerts await this.sendAlerts(request); if (!this.requiresApproval) { // Immediate automatic failover return this.executeFailover(request); } if (this.autoApprovalTimeoutMs > 0) { // Start timeout for auto-approval this.startAutoApprovalTimer(request); } return request; } async approveFailover(requestId: string, approver: string): Promise<FailoverRequest> { if (!this.pendingFailover || this.pendingFailover.id !== requestId) { throw new Error('No matching pending failover request'); } this.pendingFailover.status = 'approved'; this.pendingFailover.approvedBy = approver; return this.executeFailover(this.pendingFailover); } async rejectFailover(requestId: string, rejector: string): Promise<void> { if (!this.pendingFailover || this.pendingFailover.id !== requestId) { throw new Error('No matching pending failover request'); } this.pendingFailover.status = 'rejected'; console.log(`Failover rejected by ${rejector}`); await this.sendNotification(`Failover REJECTED by ${rejector}. Manual intervention required.`); this.pendingFailover = null; } private startAutoApprovalTimer(request: FailoverRequest): void { setTimeout(async () => { if (this.pendingFailover?.id === request.id && request.status === 'pending') { console.warn(`Auto-approving failover after ${this.autoApprovalTimeoutMs}ms timeout`); request.status = 'timeout'; await this.executeFailover(request); } }, this.autoApprovalTimeoutMs); } private async executeFailover(request: FailoverRequest): Promise<FailoverRequest> { console.log(`Executing failover to ${request.candidateReplica}`); try { // 1. Fence old primary (prevent split-brain) await this.fenceOldPrimary(); // 2. Promote candidate await this.promoteReplica(request.candidateReplica); // 3. Reconfigure other replicas await this.reconfigureReplicas(request.candidateReplica); // 4. Update routing await this.updateRouting(request.candidateReplica); request.executedAt = new Date(); request.status = request.status === 'pending' ? 'approved' : request.status; await this.sendNotification(`Failover COMPLETE to ${request.candidateReplica}`); } catch (error) { await this.sendNotification(`Failover FAILED: ${error}`); throw error; } this.pendingFailover = null; return request; } private async fenceOldPrimary(): Promise<void> { // Implementations vary: revoke network access, kill VM, etc. } private async promoteReplica(replicaId: string): Promise<void> { // Database-specific promotion } private async reconfigureReplicas(newPrimaryId: string): Promise<void> { // Point remaining replicas to new primary } private async updateRouting(newPrimaryId: string): Promise<void> { // Update DNS, VIP, proxy config, etc. } private async sendAlerts(request: FailoverRequest): Promise<void> { // PagerDuty, Slack, email, etc. } private async sendNotification(message: string): Promise<void> { // Notification channels }}Database failover doesn't happen in isolation—applications must handle the transition gracefully. Connection pools hold stale connections. In-flight queries need retrying. Routing must update to the new primary.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
// Application-level database failover handling interface DatabaseConfig { primaryEndpoint: string; replicaEndpoints: string[]; dnsRefreshMs: number;} class FailoverAwareConnection { private primaryPool: ConnectionPool; private replicaPools: ConnectionPool[]; private config: DatabaseConfig; private lastKnownPrimary: string; constructor(config: DatabaseConfig) { this.config = config; this.startDnsRefresh(); } // Execute with automatic retry on failover async executeWithRetry<T>( query: string, params: unknown[], options: { isWrite: boolean; maxRetries: number; baseDelayMs: number; } ): Promise<T> { let lastError: Error | null = null; for (let attempt = 0; attempt < options.maxRetries; attempt++) { try { const pool = options.isWrite ? this.primaryPool : this.selectReplicaPool(); return await pool.query(query, params); } catch (error) { lastError = error as Error; if (this.isRetryableError(error)) { const delay = options.baseDelayMs * Math.pow(2, attempt); console.warn(`Query failed (attempt ${attempt + 1}), retrying in ${delay}ms`); await this.sleep(delay); // Force connection pool refresh await this.refreshConnectionPools(); continue; } throw error; // Non-retryable error } } throw new Error(`Query failed after ${options.maxRetries} retries: ${lastError?.message}`); } private isRetryableError(error: unknown): boolean { const message = (error as Error).message?.toLowerCase() ?? ''; // Connection-related errors are retryable during failover const retryablePatterns = [ 'connection refused', 'connection reset', 'connection terminated', 'cannot connect', 'server closed', 'read only', // Trying to write to replica 'not primary', 'timeout', ]; return retryablePatterns.some(p => message.includes(p)); } private startDnsRefresh(): void { setInterval(async () => { try { const endpoints = await this.resolveDnsEndpoints(); if (endpoints.primary !== this.lastKnownPrimary) { console.log(`Primary endpoint changed: ${this.lastKnownPrimary} -> ${endpoints.primary}`); await this.refreshConnectionPools(); this.lastKnownPrimary = endpoints.primary; } } catch (error) { console.error('DNS refresh failed', error); } }, this.config.dnsRefreshMs); } private async resolveDnsEndpoints(): Promise<{ primary: string; replicas: string[] }> { // Resolve DNS to get current endpoints // Implementation depends on DNS setup return { primary: '', replicas: [] }; } private async refreshConnectionPools(): Promise<void> { // Close existing connections and create new pools await this.primaryPool?.end(); for (const pool of this.replicaPools) { await pool?.end(); } // Recreate with current endpoints this.primaryPool = new ConnectionPool(await this.getCurrentPrimaryEndpoint()); this.replicaPools = await this.createReplicaPools(); } private selectReplicaPool(): ConnectionPool { // Round-robin or other selection return this.replicaPools[Math.floor(Math.random() * this.replicaPools.length)]; } private sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms)); }}Low DNS TTLs (30-60 seconds) enable faster failover but increase DNS query load. High TTLs (minutes to hours) reduce DNS load but delay failover visibility. Many organizations use low TTLs for database endpoints specifically. Also ensure application DNS caching respects TTLs—some runtimes cache indefinitely by default.
Promotion is not the end—it's the beginning of recovery. Several operations must follow to restore full redundancy and prepare for future failures.
Rebuilding the old primary:
The failed primary should not be simply restarted and rejoined. Its data may be divergent (transactions committed after the promotion candidate's last position). Common approaches:
Complete rebuild: Provision a new replica using pg_basebackup (PostgreSQL) or a fresh snapshot from the new primary. Safest approach.
Rewind (PostgreSQL): pg_rewind can rewind a divergent primary to match the current timeline, then replay from the new primary. Faster but requires specific conditions.
GTID-based rejoin (MySQL): With GTID, the old primary can potentially resync by discarding local-only transactions and replaying from the new primary. Requires careful verification.
If the old primary accepted writes after the promotion (split-brain scenario), blindly rejoining could introduce duplicate or conflicting data. Always verify data consistency and use rebuild/rewind techniques rather than simple restart.
Replica promotion is the critical operation that converts a read replica into a primary, restoring write capability after primary failure. Done correctly, it minimizes downtime and data loss.
Module complete:
This completes the Read Replicas module. You've learned how to offload read traffic, handle replication lag, balance loads across replicas, maintain consistency, and orchestrate failover through replica promotion. These patterns are foundational for building scalable, highly-available SQL database architectures.
You now have comprehensive knowledge of read replica architectures—from basic traffic offloading through sophisticated failover automation. This knowledge enables you to design, implement, and operate database systems that scale to handle substantial read loads while maintaining high availability through effective redundancy and failover strategies.