System Design (HLD)Why Replication?

Why Database Replication Matters

LevelIntermediate

Duration90 mins

TopicWhy Replication?

2 / 4

High Availability: Eliminating Single Points of Failure

When Failure Is Not an Option

At 2:47 AM on a Saturday night, a hard drive in your production database server fails. The RAID controller catches the failure and begins rebuilding—but six minutes later, a second drive in the same array fails (a correlated failure, often caused by drives from the same manufacturing batch reaching end-of-life simultaneously). The database goes offline. Your application returns errors to every user. Revenue stops. On-call engineers scramble.

This scenario—or variations involving power failures, network partitions, kernel panics, or human error—occurs every day across the industry. The question isn't whether your database will experience failure, but when and how your system will respond.

High availability (HA) is the discipline of designing systems that continue operating despite component failures. In the context of databases, this means ensuring data remains accessible even when individual database servers fail. Replication is the foundational mechanism that makes database high availability possible.

What You Will Learn

By the end of this page, you will understand why single-server databases are fundamentally unreliable, how replication enables high availability, the mathematics of availability (nines), different failover mechanisms and their trade-offs, and how to design database architectures that survive real-world failure scenarios.

The Inevitability of Failure

Every component in a computing system will eventually fail. Understanding failure modes is the first step toward designing for resilience.

Hardware failure rates:

Commercial hardware experiences predictable failure rates, typically measured as Annual Failure Rate (AFR):

Typical Hardware Failure Rates
Component	Annual Failure Rate (AFR)	Mean Time Between Failures (MTBF)
Hard Disk Drive (HDD)	2-4%	25-50 years (per drive)
Solid State Drive (SSD)	0.5-2%	50-200 years (per drive)
Server (complete)	5-10%	10-20 years (per server)
Power Supply Unit (PSU)	1-3%	33-100 years (per PSU)
Memory (DIMM module)	0.1-0.5%	200-1000 years (per module)
Network Switch	2-5%	20-50 years (per switch)

These numbers may seem comfortable for individual components, but the math changes dramatically at scale:

The fleet effect:

With 100 servers, each with 5% annual failure rate:

Expected server failures per year: 5
Expected server failures per month: ~0.4
Probability of at least one failure in 30 days: ~35%

With 1,000 servers:

Expected failures per month: ~4
Probability of at least one failure in a week: ~90%

At scale, failure is not an exception—it's a constant reality. Amazon, Google, and other hyperscalers report that hardware failures occur every few minutes across their fleets.

Beyond hardware:

Hardware is only one failure mode. Production systems also contend with:

Software bugs: Memory leaks, race conditions, resource exhaustion
Configuration errors: Human mistakes during deployment or maintenance
Network partitions: Loss of connectivity between components
Dependency failures: Upstream services becoming unavailable
Capacity exhaustion: Disk full, memory exhausted, connection limits reached
Security incidents: Attacks requiring immediate shutdown or isolation

Correlated Failures

Individual failure rates underestimate real-world risk. Failures often correlate: a power outage affects multiple servers; a bad software release breaks every instance; drives from the same batch fail within days of each other. Design must account for common-mode failures, not just independent random failures.

Understanding Availability Mathematics

Availability is precisely defined as the proportion of time a system is operational and accessible. It's commonly expressed as a percentage or as "nines."

Availability formula:

Availability = Uptime / (Uptime + Downtime)

Or equivalently:

Availability = MTBF / (MTBF + MTTR)

Where:

MTBF (Mean Time Between Failures): Average time the system operates before failing
MTTR (Mean Time To Repair): Average time to restore service after failure

The nines of availability:

Availability Levels and Downtime Implications
Availability	Common Name	Downtime Per Year	Downtime Per Month	Downtime Per Week
99%	Two nines	3.65 days	7.3 hours	1.68 hours
99.9%	Three nines	8.76 hours	43.8 minutes	10.1 minutes
99.99%	Four nines	52.6 minutes	4.38 minutes	1.01 minutes
99.999%	Five nines	5.26 minutes	26.3 seconds	6.05 seconds
99.9999%	Six nines	31.5 seconds	2.63 seconds	0.6 seconds

What different businesses require:

99% (two nines): Acceptable for internal tools, development environments
99.9% (three nines): Standard for most SaaS applications, content sites
99.99% (four nines): Expected for e-commerce, financial services, enterprise software
99.999% (five nines): Required for payment processing, telecommunications, emergency services
99.9999% (six nines): Critical infrastructure (nuclear control, aviation, medical life-support)

The cost curve:

Each additional nine is exponentially harder (and more expensive) to achieve. Going from 99% to 99.9% might require adding read replicas and automated failover. Going from 99.99% to 99.999% might require multi-region active-active deployment, custom tooling, and 24/7 dedicated SRE teams.

Improving Availability: Two Levers

You can improve availability by either increasing MTBF (more reliable components, redundancy) or decreasing MTTR (faster detection, automated recovery). Replication addresses both: redundant replicas increase effective MTBF, and automated failover decreases MTTR from hours to seconds.

How Replication Enables High Availability

A single-server database has an inherent availability ceiling: when that server fails, the database is unavailable until repair. Replication fundamentally changes this equation by ensuring data exists on multiple independent servers.

The single-server problem:

Consider a database server with:

99.9% availability (realistic for well-maintained hardware)
MTTR of 1 hour (including detection, diagnosis, and recovery)

This server will experience approximately 8.76 hours of downtime per year—unacceptable for most production applications.

Replication as redundancy:

With one primary and one synchronous replica:

Both must fail simultaneously for data to be unavailable
If failures are independent, P(both fail) = P(primary fails) × P(replica fails)
If each has 99.9% availability: combined availability approaches 99.9999%

In practice, the improvement isn't quite this dramatic due to correlated failures and failover transition time, but replication typically adds 1-2 nines of availability.

Key requirements for HA through replication:

Requirements for Effective HA

•Data on multiple nodes: At least one replica must have complete, up-to-date data
•Failure detection: The system must quickly detect when the primary is unavailable
•Failover mechanism: A process to promote a replica to primary role
•Client redirection: Applications must connect to the new primary after failover
•Isolation of failure domains: Primary and replicas should not share common failure modes (power, network, rack)

Synchronous vs. asynchronous replication for HA:

Aspect	Synchronous	Asynchronous
Data durability	No data loss on failover	Potential data loss (transactions not yet replicated)
Write latency	Higher (waits for replica ack)	Lower (immediate commit)
Availability during network issues	May block writes if replica unreachable	Writes continue; lag increases
Failover complexity	Any replica can become primary	Must select replica with most current data

Most production systems use asynchronous replication for performance, accepting small potential data loss during rare failover events. Critical systems (banking, healthcare) may use synchronous replication despite the latency cost.

RPO and RTO

Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. With synchronous replication, RPO = 0. With async, RPO = replication lag at failure time. Recovery Time Objective (RTO): Maximum acceptable time to restore service. Automated failover achieves RTO of 30-60 seconds; manual failover may take 5-30 minutes.

Failover Mechanisms

When the primary database fails, the system must detect the failure and transition to a replica. This process—failover—is the critical path in high availability architecture.

Failure detection:

Before failover can occur, the system must determine that the primary is truly unavailable. This is deceptively difficult:

False positive (declaring primary dead when it's alive): Can cause split-brain scenarios with data corruption
False negative (failing to detect a dead primary): Prolongs outage

Common detection mechanisms:

Heartbeat monitoring: Regular pings between nodes; failure to respond triggers alert
Consensus-based detection: Multiple observers must agree the primary is unavailable (reduces false positives)
Client-reported failures: Applications report when they can't reach the database
Proxy-layer detection: Load balancers or proxies detect connection failures

Types of failover:

Automatic Failover

•System automatically promotes replica without human intervention
•Speed: 30-60 seconds typical for modern systems
•Risk: False positives can cause split-brain if primary isn't actually dead
•Requires: Robust failure detection, proper fencing mechanisms
•Tools: Patroni (PostgreSQL), Orchestrator (MySQL), AWS RDS Multi-AZ

Manual Failover

•Human operator assesses situation and initiates failover
•Speed: 5-30 minutes depending on operator response time
•Safety: Human judgment avoids false positive risks
•Risk: Delayed recovery if operators are slow or unavailable
•Use case: When split-brain would be catastrophic (financial systems)

Split-brain: The nightmare scenario:

Split-brain occurs when both the old primary and a promoted replica believe they are the primary, accepting writes independently. This creates divergent data that may be impossible to reconcile.

Preventing split-brain:

STONITH (Shoot The Other Node In The Head): Before promoting a replica, forcibly terminate the old primary (power off, network isolation)
Fencing: Ensure the old primary cannot accept writes (revoke VIP, disable network interface)
Quorum requirements: Require majority of cluster members to agree before any node can be primary
Epoch/term numbers: Track leadership generations; reject writes from old leaders

Failover steps (typical automatic failover):

Failure detection: Monitoring determines primary is unreachable (multiple missed heartbeats)
Fencing: If possible, forcibly isolate or shut down old primary
Replica selection: Choose the replica with most up-to-date data (lowest replication lag)
Promotion: Selected replica stops accepting replication, becomes new primary
Reconfiguration: Other replicas point to new primary
Client redirection: VIP or DNS updated to point to new primary; connections retried

Failover Testing Is Mandatory

Untested failover is no failover at all. Regularly practice failover in production (during low-traffic periods) to ensure mechanisms work, team knows procedures, and recovery time meets objectives. Many outages extend hours because failover was assumed to work but had never been tested.

High Availability Architectures

Several architectural patterns provide high availability for databases, each with different trade-offs in complexity, cost, and availability level.

Active-Passive (Primary-Standby):

The simplest HA architecture:

One active primary handles all traffic
One or more passive standbys receive replication but serve no traffic
On failure, a standby is promoted to primary

Pros: Simple, low resource overhead (standbys can be smaller) Cons: Standby resources underutilized; promotion takes time Availability: Typically 99.9% to 99.99%

Active-Active (Multi-Primary):

More complex, higher availability:

Multiple nodes all accept writes
Requires conflict resolution for concurrent writes to same data
Can use geographic distribution for local performance

Pros: No failover delay (every node is already active); higher throughput Cons: Complex conflict resolution; typically requires application awareness Availability: Can achieve 99.99% to 99.999%

Shared-Nothing Clusters:

Distributed databases (Cassandra, CockroachDB, Spanner):

Data partitioned and replicated across nodes
Any node can serve any request (with routing)
No single point of failure by design

Pros: Inherent HA; scales horizontally; no failover process Cons: More complex data model; eventual consistency (typically) Availability: Can achieve 99.999%+ with proper deployment

HA Architecture Comparison
Architecture	Write Nodes	Failover Time	Complexity	Best For
Active-Passive	1	30-60 seconds (auto)	Low	Traditional RDBMS, simpler apps
Active-Active	2+	None (instant)	High	Global apps, maximum uptime
Shared-Nothing	All nodes	None (inherent)	Medium-High	Large-scale, cloud-native apps

Cloud-managed HA:

Major cloud providers offer managed databases with built-in HA:

AWS RDS Multi-AZ: Synchronous replication to standby in different availability zone; automatic failover in 60-120 seconds
Google Cloud SQL HA: Regional instances with automatic failover
Azure SQL Database: Built-in replication and automatic failover
AWS Aurora: Storage-level replication across 3 AZs; failover in 30 seconds or less

Trade-off: Managed HA reduces operational burden but limits customization and may have higher cost.

Real-World HA Patterns

Understanding how major platforms implement database HA provides practical insights for your own architectures.

Pattern 1: PostgreSQL with Patroni

Patroni is an open-source tool for PostgreSQL HA:

Uses etcd, Consul, or ZooKeeper for distributed consensus
Automatic leader election and failover
Handles replica promotion, reconfiguration, and client redirection
Used by GitLab, Zalando, and many others

Architecture:

3+ PostgreSQL nodes (1 primary, 2+ replicas)
3+ etcd nodes for consensus (can be colocated or separate)
HAProxy or pgBouncer for connection pooling and routing
Virtual IP or DNS for client connection

Pattern 2: MySQL with Orchestrator

Orchestrator is a MySQL HA and replication management tool:

Topology discovery and visualization
Automated failure detection and recovery
Intelligent replica promotion (chooses best candidate)
Used by GitHub, Booking.com, and others

Key features:

Detects intermediate master failures and restructures topology
Supports complex replication hierarchies
Provides both automated and manual failover options

Pattern 3: Amazon Aurora

Aurora separates compute from storage for unique HA characteristics:

Storage layer automatically replicates data 6 ways across 3 AZs
Compute (database engine) can fail without data loss
Read replicas share the same storage layer (instant provisioning)
Failover promotes read replica to primary in ~30 seconds

Why it's notable:

Storage-level redundancy beyond what traditional replication provides
Replicas don't increase storage cost
Near-zero data loss even with asynchronous replication semantics

patroni-config.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Example Patroni configuration for PostgreSQL HA
# This configuration creates a 3-node HA cluster
 
scope: postgres-cluster  # Cluster name
namespace: /db/          # etcd namespace
name: node1              # This node's name
 
restapi:
  listen: 0.0.0.0:8008
  connect_address: node1:8008
 
etcd3:
  hosts:
    - etcd1:2379
    - etcd2:2379
    - etcd3:2379
 
bootstrap:
  dcs:
    ttl: 30                    # Leader key TTL in seconds
    loop_wait: 10              # Seconds between status checks
    retry_timeout: 10          # Timeout for API calls
    maximum_lag_on_failover: 1048576  # Max lag (bytes) for failover candidate
    
    postgresql:
      use_pg_rewind: true      # Use pg_rewind for fast replica resync
      use_slots: true          # Use replication slots
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
 
  initdb:
    - encoding: UTF8
    - data-checksums
 
  pg_hba:
    - host replication replicator 0.0.0.0/0 md5
    - host all all 0.0.0.0/0 md5
 
postgresql:
  listen: 0.0.0.0:5432
  connect_address: node1:5432
  data_dir: /var/lib/postgresql/data
  bin_dir: /usr/lib/postgresql/15/bin
  
  authentication:
    replication:
      username: replicator
      password: secret_replication_pass
    superuser:
      username: postgres
      password: secret_superuser_pass
 
  parameters:
    synchronous_commit: "on"
    synchronous_standby_names: "*"  # Sync to at least one replica

Start Simple, Add Complexity as Needed

Begin with managed database HA (RDS Multi-AZ, Cloud SQL HA) if possible—it handles 90% of HA needs with minimal operational burden. Only build custom HA with Patroni/Orchestrator when you need specific control, cost optimization at scale, or features managed services don't provide.

Designing for Failure

High availability isn't just about database architecture—it requires holistic system design that anticipates and gracefully handles failures at every layer.

Application-level considerations:

Application Design for HA

•Connection retry logic: Applications must handle transient connection failures and automatically reconnect
•Idempotent operations: Ensure operations can be safely retried without side effects (critical during failover)
•Transaction awareness: Understand what happens to in-flight transactions during failover; design for potential rollback
•Health-check endpoints: Provide endpoints that distinguish between app health and database health
•Graceful degradation: Design fallback behaviors when database is unavailable (cached data, queued writes, read-only mode)

Connection handling during failover:

During failover, applications experience:

Existing connections become invalid (old primary is unreachable)
New connection attempts may fail (new primary not yet accepting connections)
Brief period of complete unavailability (10-60 seconds)

Best practices:

Use connection pooling libraries that handle reconnection automatically
Implement exponential backoff for retry attempts
Set appropriate connection and query timeouts (not too short, not too long)
Use read endpoints for read traffic (replicas remain available during primary failover)

Testing resilience:

Chaos Engineering for Database HA

•Scheduled failover tests: Regularly (monthly or quarterly) perform planned failovers during low-traffic periods
•Kill primary instances: Randomly terminate the primary database to validate automatic failover
•Network partition simulation: Block traffic between primary and replicas; observe system behavior
•Slow query injection: Introduce latency to observe timeout and retry behavior
•Connection exhaustion: Simulate maxed-out connection pools to verify queueing and rejection handling

Test in Production (Carefully)

The only way to truly validate HA is to test in production. Development and staging environments cannot replicate production network conditions, load patterns, and timing. Start with clearly announced tests during maintenance windows, then graduate to unannounced tests as confidence grows.

Summary: High Availability Through Replication

High availability transforms databases from single points of failure into resilient systems that survive component failures. Let's consolidate the key concepts:

Key Takeaways

•Failure is inevitable — Hardware, software, network, and human failures happen constantly at scale. Design assuming failure, not hoping to prevent it.
•Availability is measurable — Use nines (99.9%, 99.99%) to quantify availability targets. Each additional nine is exponentially harder to achieve.
•Replication enables HA — Data on multiple nodes means no single failure can make data unavailable. Replicas provide both durability and availability.
•Failover is the critical path — Detection, fencing, promotion, and redirection must work flawlessly. Automated failover reduces RTO from minutes to seconds.
•Split-brain is the nightmare — Preventing two nodes from acting as primary simultaneously requires fencing and consensus mechanisms.
•Architecture choices matter — Active-passive is simpler; active-active is more available; shared-nothing is most resilient. Choose based on requirements.
•Test your failover — Untested HA is worthless. Regularly practice failover in production to validate procedures and timing.

What's next:

High availability keeps your database running when local failures occur. But what if you need to serve users across continents? The next page explores Geographic Distribution—how replication enables low-latency access for globally distributed users and provides protection against regional disasters.

Page Complete

You now understand high availability as a fundamental motivation for database replication. You can calculate availability requirements, design failover mechanisms, choose appropriate HA architectures, and validate resilience through testing. Next, we explore geographic distribution for global-scale applications.

2 / 4

Loading learning content...

System Design (HLD)Why Replication?

Why Database Replication Matters

LevelIntermediate

Duration90 mins

TopicWhy Replication?

2 / 4

High Availability: Eliminating Single Points of Failure

When Failure Is Not an Option

What You Will Learn

The Inevitability of Failure

Every component in a computing system will eventually fail. Understanding failure modes is the first step toward designing for resilience.

Hardware failure rates:

Commercial hardware experiences predictable failure rates, typically measured as Annual Failure Rate (AFR):

Typical Hardware Failure Rates
Component	Annual Failure Rate (AFR)	Mean Time Between Failures (MTBF)
Hard Disk Drive (HDD)	2-4%	25-50 years (per drive)
Solid State Drive (SSD)	0.5-2%	50-200 years (per drive)
Server (complete)	5-10%	10-20 years (per server)
Power Supply Unit (PSU)	1-3%	33-100 years (per PSU)
Memory (DIMM module)	0.1-0.5%	200-1000 years (per module)
Network Switch	2-5%	20-50 years (per switch)

These numbers may seem comfortable for individual components, but the math changes dramatically at scale:

The fleet effect:

With 100 servers, each with 5% annual failure rate:

Expected server failures per year: 5
Expected server failures per month: ~0.4
Probability of at least one failure in 30 days: ~35%

With 1,000 servers:

Expected failures per month: ~4
Probability of at least one failure in a week: ~90%

At scale, failure is not an exception—it's a constant reality. Amazon, Google, and other hyperscalers report that hardware failures occur every few minutes across their fleets.

Beyond hardware:

Hardware is only one failure mode. Production systems also contend with:

Software bugs: Memory leaks, race conditions, resource exhaustion
Configuration errors: Human mistakes during deployment or maintenance
Network partitions: Loss of connectivity between components
Dependency failures: Upstream services becoming unavailable
Capacity exhaustion: Disk full, memory exhausted, connection limits reached
Security incidents: Attacks requiring immediate shutdown or isolation

Correlated Failures

Understanding Availability Mathematics

Availability is precisely defined as the proportion of time a system is operational and accessible. It's commonly expressed as a percentage or as "nines."

Availability formula:

Availability = Uptime / (Uptime + Downtime)

Or equivalently:

Availability = MTBF / (MTBF + MTTR)

Where:

MTBF (Mean Time Between Failures): Average time the system operates before failing
MTTR (Mean Time To Repair): Average time to restore service after failure

The nines of availability:

Availability Levels and Downtime Implications
Availability	Common Name	Downtime Per Year	Downtime Per Month	Downtime Per Week
99%	Two nines	3.65 days	7.3 hours	1.68 hours
99.9%	Three nines	8.76 hours	43.8 minutes	10.1 minutes
99.99%	Four nines	52.6 minutes	4.38 minutes	1.01 minutes
99.999%	Five nines	5.26 minutes	26.3 seconds	6.05 seconds
99.9999%	Six nines	31.5 seconds	2.63 seconds	0.6 seconds

What different businesses require:

99% (two nines): Acceptable for internal tools, development environments
99.9% (three nines): Standard for most SaaS applications, content sites
99.99% (four nines): Expected for e-commerce, financial services, enterprise software
99.999% (five nines): Required for payment processing, telecommunications, emergency services
99.9999% (six nines): Critical infrastructure (nuclear control, aviation, medical life-support)

The cost curve:

Improving Availability: Two Levers

How Replication Enables High Availability

The single-server problem:

Consider a database server with:

99.9% availability (realistic for well-maintained hardware)
MTTR of 1 hour (including detection, diagnosis, and recovery)

This server will experience approximately 8.76 hours of downtime per year—unacceptable for most production applications.

Replication as redundancy:

With one primary and one synchronous replica:

Both must fail simultaneously for data to be unavailable
If failures are independent, P(both fail) = P(primary fails) × P(replica fails)
If each has 99.9% availability: combined availability approaches 99.9999%

In practice, the improvement isn't quite this dramatic due to correlated failures and failover transition time, but replication typically adds 1-2 nines of availability.

Key requirements for HA through replication:

Requirements for Effective HA

•Data on multiple nodes: At least one replica must have complete, up-to-date data
•Failure detection: The system must quickly detect when the primary is unavailable
•Failover mechanism: A process to promote a replica to primary role
•Client redirection: Applications must connect to the new primary after failover
•Isolation of failure domains: Primary and replicas should not share common failure modes (power, network, rack)

Synchronous vs. asynchronous replication for HA:

Aspect	Synchronous	Asynchronous
Data durability	No data loss on failover	Potential data loss (transactions not yet replicated)
Write latency	Higher (waits for replica ack)	Lower (immediate commit)
Availability during network issues	May block writes if replica unreachable	Writes continue; lag increases
Failover complexity	Any replica can become primary	Must select replica with most current data

RPO and RTO

Failover Mechanisms

When the primary database fails, the system must detect the failure and transition to a replica. This process—failover—is the critical path in high availability architecture.

Failure detection:

Before failover can occur, the system must determine that the primary is truly unavailable. This is deceptively difficult:

False positive (declaring primary dead when it's alive): Can cause split-brain scenarios with data corruption
False negative (failing to detect a dead primary): Prolongs outage

Common detection mechanisms:

Heartbeat monitoring: Regular pings between nodes; failure to respond triggers alert
Consensus-based detection: Multiple observers must agree the primary is unavailable (reduces false positives)
Client-reported failures: Applications report when they can't reach the database
Proxy-layer detection: Load balancers or proxies detect connection failures

Types of failover:

Automatic Failover

•System automatically promotes replica without human intervention
•Speed: 30-60 seconds typical for modern systems
•Risk: False positives can cause split-brain if primary isn't actually dead
•Requires: Robust failure detection, proper fencing mechanisms
•Tools: Patroni (PostgreSQL), Orchestrator (MySQL), AWS RDS Multi-AZ

Manual Failover

•Human operator assesses situation and initiates failover
•Speed: 5-30 minutes depending on operator response time
•Safety: Human judgment avoids false positive risks
•Risk: Delayed recovery if operators are slow or unavailable
•Use case: When split-brain would be catastrophic (financial systems)

Split-brain: The nightmare scenario:

Split-brain occurs when both the old primary and a promoted replica believe they are the primary, accepting writes independently. This creates divergent data that may be impossible to reconcile.

Preventing split-brain:

STONITH (Shoot The Other Node In The Head): Before promoting a replica, forcibly terminate the old primary (power off, network isolation)
Fencing: Ensure the old primary cannot accept writes (revoke VIP, disable network interface)
Quorum requirements: Require majority of cluster members to agree before any node can be primary
Epoch/term numbers: Track leadership generations; reject writes from old leaders

Failover steps (typical automatic failover):

Failure detection: Monitoring determines primary is unreachable (multiple missed heartbeats)
Fencing: If possible, forcibly isolate or shut down old primary
Replica selection: Choose the replica with most up-to-date data (lowest replication lag)
Promotion: Selected replica stops accepting replication, becomes new primary
Reconfiguration: Other replicas point to new primary
Client redirection: VIP or DNS updated to point to new primary; connections retried

Failover Testing Is Mandatory

High Availability Architectures

Several architectural patterns provide high availability for databases, each with different trade-offs in complexity, cost, and availability level.

Active-Passive (Primary-Standby):

The simplest HA architecture:

One active primary handles all traffic
One or more passive standbys receive replication but serve no traffic
On failure, a standby is promoted to primary

Pros: Simple, low resource overhead (standbys can be smaller) Cons: Standby resources underutilized; promotion takes time Availability: Typically 99.9% to 99.99%

Active-Active (Multi-Primary):

More complex, higher availability:

Multiple nodes all accept writes
Requires conflict resolution for concurrent writes to same data
Can use geographic distribution for local performance

Pros: No failover delay (every node is already active); higher throughput Cons: Complex conflict resolution; typically requires application awareness Availability: Can achieve 99.99% to 99.999%

Shared-Nothing Clusters:

Distributed databases (Cassandra, CockroachDB, Spanner):

Data partitioned and replicated across nodes
Any node can serve any request (with routing)
No single point of failure by design

Pros: Inherent HA; scales horizontally; no failover process Cons: More complex data model; eventual consistency (typically) Availability: Can achieve 99.999%+ with proper deployment

HA Architecture Comparison
Architecture	Write Nodes	Failover Time	Complexity	Best For
Active-Passive	1	30-60 seconds (auto)	Low	Traditional RDBMS, simpler apps
Active-Active	2+	None (instant)	High	Global apps, maximum uptime
Shared-Nothing	All nodes	None (inherent)	Medium-High	Large-scale, cloud-native apps

Cloud-managed HA:

Major cloud providers offer managed databases with built-in HA:

AWS RDS Multi-AZ: Synchronous replication to standby in different availability zone; automatic failover in 60-120 seconds
Google Cloud SQL HA: Regional instances with automatic failover
Azure SQL Database: Built-in replication and automatic failover
AWS Aurora: Storage-level replication across 3 AZs; failover in 30 seconds or less

Trade-off: Managed HA reduces operational burden but limits customization and may have higher cost.

Real-World HA Patterns

Understanding how major platforms implement database HA provides practical insights for your own architectures.

Pattern 1: PostgreSQL with Patroni

Patroni is an open-source tool for PostgreSQL HA:

Uses etcd, Consul, or ZooKeeper for distributed consensus
Automatic leader election and failover
Handles replica promotion, reconfiguration, and client redirection
Used by GitLab, Zalando, and many others

Architecture:

3+ PostgreSQL nodes (1 primary, 2+ replicas)
3+ etcd nodes for consensus (can be colocated or separate)
HAProxy or pgBouncer for connection pooling and routing
Virtual IP or DNS for client connection

Pattern 2: MySQL with Orchestrator

Orchestrator is a MySQL HA and replication management tool:

Topology discovery and visualization
Automated failure detection and recovery
Intelligent replica promotion (chooses best candidate)
Used by GitHub, Booking.com, and others

Key features:

Detects intermediate master failures and restructures topology
Supports complex replication hierarchies
Provides both automated and manual failover options

Pattern 3: Amazon Aurora

Aurora separates compute from storage for unique HA characteristics:

Storage layer automatically replicates data 6 ways across 3 AZs
Compute (database engine) can fail without data loss
Read replicas share the same storage layer (instant provisioning)
Failover promotes read replica to primary in ~30 seconds

Why it's notable:

Storage-level redundancy beyond what traditional replication provides
Replicas don't increase storage cost
Near-zero data loss even with asynchronous replication semantics

patroni-config.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Example Patroni configuration for PostgreSQL HA
# This configuration creates a 3-node HA cluster
 
scope: postgres-cluster  # Cluster name
namespace: /db/          # etcd namespace
name: node1              # This node's name
 
restapi:
  listen: 0.0.0.0:8008
  connect_address: node1:8008
 
etcd3:
  hosts:
    - etcd1:2379
    - etcd2:2379
    - etcd3:2379
 
bootstrap:
  dcs:
    ttl: 30                    # Leader key TTL in seconds
    loop_wait: 10              # Seconds between status checks
    retry_timeout: 10          # Timeout for API calls
    maximum_lag_on_failover: 1048576  # Max lag (bytes) for failover candidate
    
    postgresql:
      use_pg_rewind: true      # Use pg_rewind for fast replica resync
      use_slots: true          # Use replication slots
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
 
  initdb:
    - encoding: UTF8
    - data-checksums
 
  pg_hba:
    - host replication replicator 0.0.0.0/0 md5
    - host all all 0.0.0.0/0 md5
 
postgresql:
  listen: 0.0.0.0:5432
  connect_address: node1:5432
  data_dir: /var/lib/postgresql/data
  bin_dir: /usr/lib/postgresql/15/bin
  
  authentication:
    replication:
      username: replicator
      password: secret_replication_pass
    superuser:
      username: postgres
      password: secret_superuser_pass
 
  parameters:
    synchronous_commit: "on"
    synchronous_standby_names: "*"  # Sync to at least one replica

Start Simple, Add Complexity as Needed

Designing for Failure

High availability isn't just about database architecture—it requires holistic system design that anticipates and gracefully handles failures at every layer.

Application-level considerations:

Application Design for HA

•Connection retry logic: Applications must handle transient connection failures and automatically reconnect
•Idempotent operations: Ensure operations can be safely retried without side effects (critical during failover)
•Transaction awareness: Understand what happens to in-flight transactions during failover; design for potential rollback
•Health-check endpoints: Provide endpoints that distinguish between app health and database health
•Graceful degradation: Design fallback behaviors when database is unavailable (cached data, queued writes, read-only mode)

Connection handling during failover:

During failover, applications experience:

Existing connections become invalid (old primary is unreachable)
New connection attempts may fail (new primary not yet accepting connections)
Brief period of complete unavailability (10-60 seconds)

Best practices:

Use connection pooling libraries that handle reconnection automatically
Implement exponential backoff for retry attempts
Set appropriate connection and query timeouts (not too short, not too long)
Use read endpoints for read traffic (replicas remain available during primary failover)

Testing resilience:

Chaos Engineering for Database HA

•Scheduled failover tests: Regularly (monthly or quarterly) perform planned failovers during low-traffic periods
•Kill primary instances: Randomly terminate the primary database to validate automatic failover
•Network partition simulation: Block traffic between primary and replicas; observe system behavior
•Slow query injection: Introduce latency to observe timeout and retry behavior
•Connection exhaustion: Simulate maxed-out connection pools to verify queueing and rejection handling

Test in Production (Carefully)

Summary: High Availability Through Replication

High availability transforms databases from single points of failure into resilient systems that survive component failures. Let's consolidate the key concepts:

Key Takeaways

•Failure is inevitable — Hardware, software, network, and human failures happen constantly at scale. Design assuming failure, not hoping to prevent it.
•Availability is measurable — Use nines (99.9%, 99.99%) to quantify availability targets. Each additional nine is exponentially harder to achieve.
•Replication enables HA — Data on multiple nodes means no single failure can make data unavailable. Replicas provide both durability and availability.
•Failover is the critical path — Detection, fencing, promotion, and redirection must work flawlessly. Automated failover reduces RTO from minutes to seconds.
•Split-brain is the nightmare — Preventing two nodes from acting as primary simultaneously requires fencing and consensus mechanisms.
•Architecture choices matter — Active-passive is simpler; active-active is more available; shared-nothing is most resilient. Choose based on requirements.
•Test your failover — Untested HA is worthless. Regularly practice failover in production to validate procedures and timing.

What's next:

Page Complete

2 / 4