Loading learning content...
At 2:47 AM on a Saturday night, a hard drive in your production database server fails. The RAID controller catches the failure and begins rebuilding—but six minutes later, a second drive in the same array fails (a correlated failure, often caused by drives from the same manufacturing batch reaching end-of-life simultaneously). The database goes offline. Your application returns errors to every user. Revenue stops. On-call engineers scramble.
This scenario—or variations involving power failures, network partitions, kernel panics, or human error—occurs every day across the industry. The question isn't whether your database will experience failure, but when and how your system will respond.
High availability (HA) is the discipline of designing systems that continue operating despite component failures. In the context of databases, this means ensuring data remains accessible even when individual database servers fail. Replication is the foundational mechanism that makes database high availability possible.
By the end of this page, you will understand why single-server databases are fundamentally unreliable, how replication enables high availability, the mathematics of availability (nines), different failover mechanisms and their trade-offs, and how to design database architectures that survive real-world failure scenarios.
Every component in a computing system will eventually fail. Understanding failure modes is the first step toward designing for resilience.
Hardware failure rates:
Commercial hardware experiences predictable failure rates, typically measured as Annual Failure Rate (AFR):
| Component | Annual Failure Rate (AFR) | Mean Time Between Failures (MTBF) |
|---|---|---|
| Hard Disk Drive (HDD) | 2-4% | 25-50 years (per drive) |
| Solid State Drive (SSD) | 0.5-2% | 50-200 years (per drive) |
| Server (complete) | 5-10% | 10-20 years (per server) |
| Power Supply Unit (PSU) | 1-3% | 33-100 years (per PSU) |
| Memory (DIMM module) | 0.1-0.5% | 200-1000 years (per module) |
| Network Switch | 2-5% | 20-50 years (per switch) |
These numbers may seem comfortable for individual components, but the math changes dramatically at scale:
The fleet effect:
With 100 servers, each with 5% annual failure rate:
With 1,000 servers:
At scale, failure is not an exception—it's a constant reality. Amazon, Google, and other hyperscalers report that hardware failures occur every few minutes across their fleets.
Beyond hardware:
Hardware is only one failure mode. Production systems also contend with:
Individual failure rates underestimate real-world risk. Failures often correlate: a power outage affects multiple servers; a bad software release breaks every instance; drives from the same batch fail within days of each other. Design must account for common-mode failures, not just independent random failures.
Availability is precisely defined as the proportion of time a system is operational and accessible. It's commonly expressed as a percentage or as "nines."
Availability formula:
Availability = Uptime / (Uptime + Downtime)
Or equivalently:
Availability = MTBF / (MTBF + MTTR)
Where:
The nines of availability:
| Availability | Common Name | Downtime Per Year | Downtime Per Month | Downtime Per Week |
|---|---|---|---|---|
| 99% | Two nines | 3.65 days | 7.3 hours | 1.68 hours |
| 99.9% | Three nines | 8.76 hours | 43.8 minutes | 10.1 minutes |
| 99.99% | Four nines | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% | Five nines | 5.26 minutes | 26.3 seconds | 6.05 seconds |
| 99.9999% | Six nines | 31.5 seconds | 2.63 seconds | 0.6 seconds |
What different businesses require:
The cost curve:
Each additional nine is exponentially harder (and more expensive) to achieve. Going from 99% to 99.9% might require adding read replicas and automated failover. Going from 99.99% to 99.999% might require multi-region active-active deployment, custom tooling, and 24/7 dedicated SRE teams.
You can improve availability by either increasing MTBF (more reliable components, redundancy) or decreasing MTTR (faster detection, automated recovery). Replication addresses both: redundant replicas increase effective MTBF, and automated failover decreases MTTR from hours to seconds.
A single-server database has an inherent availability ceiling: when that server fails, the database is unavailable until repair. Replication fundamentally changes this equation by ensuring data exists on multiple independent servers.
The single-server problem:
Consider a database server with:
This server will experience approximately 8.76 hours of downtime per year—unacceptable for most production applications.
Replication as redundancy:
With one primary and one synchronous replica:
In practice, the improvement isn't quite this dramatic due to correlated failures and failover transition time, but replication typically adds 1-2 nines of availability.
Key requirements for HA through replication:
Synchronous vs. asynchronous replication for HA:
| Aspect | Synchronous | Asynchronous |
|---|---|---|
| Data durability | No data loss on failover | Potential data loss (transactions not yet replicated) |
| Write latency | Higher (waits for replica ack) | Lower (immediate commit) |
| Availability during network issues | May block writes if replica unreachable | Writes continue; lag increases |
| Failover complexity | Any replica can become primary | Must select replica with most current data |
Most production systems use asynchronous replication for performance, accepting small potential data loss during rare failover events. Critical systems (banking, healthcare) may use synchronous replication despite the latency cost.
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. With synchronous replication, RPO = 0. With async, RPO = replication lag at failure time. Recovery Time Objective (RTO): Maximum acceptable time to restore service. Automated failover achieves RTO of 30-60 seconds; manual failover may take 5-30 minutes.
When the primary database fails, the system must detect the failure and transition to a replica. This process—failover—is the critical path in high availability architecture.
Failure detection:
Before failover can occur, the system must determine that the primary is truly unavailable. This is deceptively difficult:
Common detection mechanisms:
Types of failover:
Split-brain: The nightmare scenario:
Split-brain occurs when both the old primary and a promoted replica believe they are the primary, accepting writes independently. This creates divergent data that may be impossible to reconcile.
Preventing split-brain:
Failover steps (typical automatic failover):
Untested failover is no failover at all. Regularly practice failover in production (during low-traffic periods) to ensure mechanisms work, team knows procedures, and recovery time meets objectives. Many outages extend hours because failover was assumed to work but had never been tested.
Several architectural patterns provide high availability for databases, each with different trade-offs in complexity, cost, and availability level.
Active-Passive (Primary-Standby):
The simplest HA architecture:
Pros: Simple, low resource overhead (standbys can be smaller) Cons: Standby resources underutilized; promotion takes time Availability: Typically 99.9% to 99.99%
Active-Active (Multi-Primary):
More complex, higher availability:
Pros: No failover delay (every node is already active); higher throughput Cons: Complex conflict resolution; typically requires application awareness Availability: Can achieve 99.99% to 99.999%
Shared-Nothing Clusters:
Distributed databases (Cassandra, CockroachDB, Spanner):
Pros: Inherent HA; scales horizontally; no failover process Cons: More complex data model; eventual consistency (typically) Availability: Can achieve 99.999%+ with proper deployment
| Architecture | Write Nodes | Failover Time | Complexity | Best For |
|---|---|---|---|---|
| Active-Passive | 1 | 30-60 seconds (auto) | Low | Traditional RDBMS, simpler apps |
| Active-Active | 2+ | None (instant) | High | Global apps, maximum uptime |
| Shared-Nothing | All nodes | None (inherent) | Medium-High | Large-scale, cloud-native apps |
Cloud-managed HA:
Major cloud providers offer managed databases with built-in HA:
Trade-off: Managed HA reduces operational burden but limits customization and may have higher cost.
Understanding how major platforms implement database HA provides practical insights for your own architectures.
Pattern 1: PostgreSQL with Patroni
Patroni is an open-source tool for PostgreSQL HA:
Architecture:
Pattern 2: MySQL with Orchestrator
Orchestrator is a MySQL HA and replication management tool:
Key features:
Pattern 3: Amazon Aurora
Aurora separates compute from storage for unique HA characteristics:
Why it's notable:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
# Example Patroni configuration for PostgreSQL HA# This configuration creates a 3-node HA cluster scope: postgres-cluster # Cluster namenamespace: /db/ # etcd namespacename: node1 # This node's name restapi: listen: 0.0.0.0:8008 connect_address: node1:8008 etcd3: hosts: - etcd1:2379 - etcd2:2379 - etcd3:2379 bootstrap: dcs: ttl: 30 # Leader key TTL in seconds loop_wait: 10 # Seconds between status checks retry_timeout: 10 # Timeout for API calls maximum_lag_on_failover: 1048576 # Max lag (bytes) for failover candidate postgresql: use_pg_rewind: true # Use pg_rewind for fast replica resync use_slots: true # Use replication slots parameters: wal_level: replica hot_standby: "on" max_wal_senders: 10 max_replication_slots: 10 wal_log_hints: "on" initdb: - encoding: UTF8 - data-checksums pg_hba: - host replication replicator 0.0.0.0/0 md5 - host all all 0.0.0.0/0 md5 postgresql: listen: 0.0.0.0:5432 connect_address: node1:5432 data_dir: /var/lib/postgresql/data bin_dir: /usr/lib/postgresql/15/bin authentication: replication: username: replicator password: secret_replication_pass superuser: username: postgres password: secret_superuser_pass parameters: synchronous_commit: "on" synchronous_standby_names: "*" # Sync to at least one replicaBegin with managed database HA (RDS Multi-AZ, Cloud SQL HA) if possible—it handles 90% of HA needs with minimal operational burden. Only build custom HA with Patroni/Orchestrator when you need specific control, cost optimization at scale, or features managed services don't provide.
High availability isn't just about database architecture—it requires holistic system design that anticipates and gracefully handles failures at every layer.
Application-level considerations:
Connection handling during failover:
During failover, applications experience:
Best practices:
Testing resilience:
The only way to truly validate HA is to test in production. Development and staging environments cannot replicate production network conditions, load patterns, and timing. Start with clearly announced tests during maintenance windows, then graduate to unannounced tests as confidence grows.
High availability transforms databases from single points of failure into resilient systems that survive component failures. Let's consolidate the key concepts:
What's next:
High availability keeps your database running when local failures occur. But what if you need to serve users across continents? The next page explores Geographic Distribution—how replication enables low-latency access for globally distributed users and provides protection against regional disasters.
You now understand high availability as a fundamental motivation for database replication. You can calculate availability requirements, design failover mechanisms, choose appropriate HA architectures, and validate resilience through testing. Next, we explore geographic distribution for global-scale applications.