Loading learning content...
When GitLab suffered a major production database incident in 2017, their PostgreSQL database was corrupted by an accidental deletion. Their streaming replication to a standby server was working perfectly—but unfortunately, it worked too perfectly, replicating the deletion command to the standby before anyone could intervene.
This incident illustrates a fundamental truth about database replication: it's a powerful tool that requires deep understanding. Replication can save your business or destroy your recovery options, depending on how it's designed and operated.
Replication is the technology that enables databases to maintain synchronized copies across multiple servers and locations. It's the foundation of both high availability and disaster recovery. Understanding replication modes, architectures, and failure scenarios is essential for any database professional.
By the end of this page, you will understand the spectrum of replication technologies, from synchronous to asynchronous modes. You'll learn how to design replication architectures for disaster recovery, handle replication failures, and monitor replication health.
Database replication is the process of copying and maintaining database objects across multiple servers. In the context of disaster recovery, replication serves as a real-time or near-real-time data protection mechanism that enables rapid failover.
Core Concepts:
Primary (Master): The database server that accepts write operations. In most replication topologies, there is exactly one primary at any given time.
Replica (Slave/Standby): Servers that receive copied data from the primary. They may be read-only or capable of accepting writes (in multi-primary scenarios).
Replication Stream: The flow of data changes from primary to replica. This typically consists of transaction log entries (WAL in PostgreSQL, binary log in MySQL).
Replication Lag: The time difference between when a change is committed on the primary and when it's applied on the replica. This is the core metric for DR readiness.
Types of Replicated Data:
Physical Replication: Replicates the actual data blocks or transaction log entries. The replica is a byte-for-byte copy of the primary at the storage level.
Logical Replication: Replicates the logical SQL operations (INSERT, UPDATE, DELETE). The replica applies these operations to achieve the same logical state.
Transactional Replication: Replicates individual transactions to ensure consistency. Transactions are applied atomically on replicas.
For disaster recovery, physical replication is generally preferred because it provides exact copies with lower overhead. Logical replication is better suited for data migration, cross-version replication, or selective table-level replication. Many organizations use both: physical replication for DR and logical replication for reporting replicas.
Synchronous replication ensures that every committed transaction exists on at least one replica before the commit is acknowledged to the application. This provides the strongest data protection but comes with performance implications.
How Synchronous Replication Works:
RPO Guarantee:
With synchronous replication, RPO is effectively zero for committed transactions. If the primary fails immediately after acknowledging a commit, the transaction is guaranteed to exist on the replica.
1234567891011121314151617181920212223242526272829303132333435363738394041
# PostgreSQL Synchronous Replication Configuration # Primary server configuration (postgresql.conf)# ============================================ # Enable WAL shipping for replicationwal_level = replica # Maximum number of concurrent connections from standby serversmax_wal_senders = 5 # Ensure synchronous behavior - wait for replica acknowledgmentsynchronous_commit = on # Define synchronous standby servers# 'FIRST 1' means wait for first replica to acknowledge# Names must match application_name in replica connectionsynchronous_standby_names = 'FIRST 1 (dr_site_1, dr_site_2)' # Keep enough WAL for replicas to catch up after network issueswal_keep_size = 1GB # Hot standby allows read queries on replicashot_standby = on # Primary pg_hba.conf - allow replication connections# TYPE DATABASE USER ADDRESS METHODhost replication repl_user 192.168.1.0/24 scram-sha-256host replication repl_user 10.0.0.0/8 scram-sha-256 # Replica recovery configuration (standby.signal file exists)# Create empty file to indicate standby mode:# touch /var/lib/postgresql/data/standby.signal # Replica postgresql.conf# ============================================primary_conninfo = 'host=primary-db port=5432 user=repl_user password=XXX application_name=dr_site_1'hot_standby = onprimary_slot_name = 'dr_site_1_slot'Performance Implications:
Synchronous replication adds latency to every transaction:
| Component | Typical Latency |
|---|---|
| Local disk write | 0.5-2 ms |
| Network roundtrip (same datacenter) | 0.5-2 ms |
| Network roundtrip (100 km) | 1-2 ms |
| Network roundtrip (1000 km) | 10-15 ms |
| Replica disk write | 0.5-2 ms |
| Total added per transaction | 2-20+ ms |
For a database processing 1000 transactions/second, this latency significantly impacts throughput. The database can process fewer transactions per second because each must wait for replica acknowledgment.
Failure Scenarios:
Replica Failure: If the synchronous replica fails, the primary faces a choice:
Network Partition: If the network between primary and replica fails:
Synchronous replication becomes impractical over long distances due to physics. Light travels approximately 200 km per millisecond through fiber optic cable. A DR site 1000 km away adds 10+ ms per transaction—potentially unacceptable for high-frequency transaction systems. This fundamental constraint drives architectural decisions about DR site location.
Asynchronous replication commits transactions on the primary without waiting for replica acknowledgment. This provides better performance but introduces potential data loss during failover.
How Asynchronous Replication Works:
Steps 4 and 5 happen independently of the commit acknowledgment. The replica may be seconds to minutes behind the primary at any given moment.
RPO Characteristics:
With asynchronous replication, RPO equals the replication lag at the moment of failure. If the primary fails when the replica is 30 seconds behind, approximately 30 seconds of transactions are lost.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
-- MySQL Asynchronous Replication Setup -- On Primary Server:-- 1. Enable binary logging in my.cnf-- [mysqld]-- server-id = 1-- log-bin = mysql-bin-- binlog_format = ROW-- gtid_mode = ON-- enforce_gtid_consistency = ON -- 2. Create replication userCREATE USER 'repl_user'@'%' IDENTIFIED BY 'secure_password';GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%';FLUSH PRIVILEGES; -- 3. Get primary status for initial syncSHOW MASTER STATUS;-- Note: File and Position for replica configuration -- On Replica Server:-- my.cnf configuration-- [mysqld]-- server-id = 2-- relay-log = relay-log-- read_only = ON-- gtid_mode = ON-- enforce_gtid_consistency = ON -- Configure replication source (MySQL 8.0+)CHANGE REPLICATION SOURCE TO SOURCE_HOST = 'primary-db.example.com', SOURCE_USER = 'repl_user', SOURCE_PASSWORD = 'secure_password', SOURCE_AUTO_POSITION = 1, -- Use GTID-based replication SOURCE_CONNECT_RETRY = 10, -- Retry interval in seconds SOURCE_RETRY_COUNT = 86400; -- Retry for 24 hours -- Start replicationSTART REPLICA; -- Check replication statusSHOW REPLICA STATUS\G -- Key metrics to monitor:-- Slave_IO_Running: Yes-- Slave_SQL_Running: Yes-- Seconds_Behind_Master: <number> (replication lag)-- Last_Error: <any errors> -- Monitor replication lag continuouslySELECT TIMESTAMPDIFF(SECOND, (SELECT MAX(last_update) FROM performance_schema.replication_connection_status), NOW() ) AS lag_seconds;Advantages of Asynchronous Replication:
Managing Replication Lag:
Replication lag is the critical metric for async replication. Factors that increase lag:
| Factor | Impact |
|---|---|
| Network bandwidth | Insufficient bandwidth can't keep up with write rate |
| Network latency | Higher latency means slower transmission |
| Replica CPU | Slow replica can't apply changes fast enough |
| Replica I/O | Disk bottlenecks slow change application |
| Large transactions | Big transactions take longer to transfer/apply |
| Complex queries | Heavy read load on replica slows replication |
Lag Management Strategies:
Set multiple alert thresholds for replication lag: Warning at 50% of RPO target, Critical at 80% of RPO target, and Emergency at 100% of RPO target. This provides escalating visibility as lag approaches unacceptable levels.
Semi-synchronous replication provides a middle ground between the data protection of synchronous replication and the performance of asynchronous replication. It's particularly popular in MySQL environments.
How Semi-Synchronous Works:
The key difference from full synchronous: the replica only needs to acknowledge receipt of the log data, not application of the changes.
Fallback Behavior:
Semi-synchronous replication includes intelligent fallback:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
-- MySQL Semi-Synchronous Replication Configuration -- On Primary Server:-- Install semi-sync plugin (if not already installed)INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so'; -- Enable semi-synchronous modeSET GLOBAL rpl_semi_sync_master_enabled = 1; -- Timeout before falling back to async (milliseconds)SET GLOBAL rpl_semi_sync_master_timeout = 10000; -- 10 seconds -- Require at least one replica for semi-syncSET GLOBAL rpl_semi_sync_master_wait_for_slave_count = 1; -- Wait point configuration-- AFTER_SYNC: Wait after binlog sync, before storage commit-- AFTER_COMMIT: Wait after storage commit (legacy)SET GLOBAL rpl_semi_sync_master_wait_point = 'AFTER_SYNC'; -- On Replica Server:-- Install semi-sync pluginINSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so'; -- Enable semi-synchronous modeSET GLOBAL rpl_semi_sync_slave_enabled = 1; -- Restart replication to activateSTOP REPLICA;START REPLICA; -- Monitoring semi-sync statusSHOW STATUS LIKE 'Rpl_semi_sync%'; -- Key metrics:-- Rpl_semi_sync_master_status: ON (semi-sync active)-- Rpl_semi_sync_master_no_tx: Count of async fallback transactions-- Rpl_semi_sync_master_yes_tx: Count of semi-sync transactions-- Rpl_semi_sync_master_clients: Connected semi-sync replicas-- Rpl_semi_sync_master_wait_sessions: Sessions waiting for ack -- Calculate semi-sync success rateSELECT Rpl_semi_sync_master_yes_tx / (Rpl_semi_sync_master_yes_tx + Rpl_semi_sync_master_no_tx) * 100 AS semi_sync_success_rateFROM ( SELECT VARIABLE_VALUE AS Rpl_semi_sync_master_yes_tx FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Rpl_semi_sync_master_yes_tx') yes_tx,( SELECT VARIABLE_VALUE AS Rpl_semi_sync_master_no_tx FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Rpl_semi_sync_master_no_tx') no_tx;| Characteristic | Synchronous | Semi-Synchronous | Asynchronous |
|---|---|---|---|
| Data protection | Zero RPO | Near-zero RPO | Lag-dependent RPO |
| Performance impact | High latency | Moderate latency | No impact |
| Replica failure handling | Blocks or fails | Falls back to async | No impact |
| Geographic range | < 100 km practical | < 500 km practical | Any distance |
| Complexity | Highest | Medium | Lowest |
| Best for | Financial, critical | Important systems | Geographically distant DR |
In MySQL semi-sync, 'AFTER_SYNC' wait point provides stronger guarantees than 'AFTER_COMMIT'. With AFTER_SYNC, the primary waits for replica acknowledgment before committing to storage engine, ensuring the replica always has a copy of any committed transaction. This is the recommended setting for DR scenarios.
The arrangement of primary and replica servers—the topology—significantly affects DR capabilities, complexity, and failure modes.
Topology 1: Primary-Standby (Simple)
The simplest topology: one primary, one passive standby.
[Primary] ──replication──> [Standby]
Topology 2: Primary with Multiple Standbys
One primary replicating to multiple standby servers.
┌──> [Standby-1 (sync, local)]
[Primary] ──────────┼──> [Standby-2 (async, DR site)]
└──> [Standby-3 (async, reporting)]
Topology 3: Cascading Replication
Primary replicates to intermediate, which replicates to downstream.
[Primary] ──sync──> [Intermediate] ──async──> [DR Standby]
└──> [Reporting Replica]
Topology 4: Multi-Primary (Active-Active)
Multiple servers accept writes, replicating to each other.
[Primary-A] <──replication──> [Primary-B]
│ │
└────────> Applications <─────┘
Topology 5: Ring Replication
Multiple servers in a ring, each replicating to the next.
[Server-A] ──> [Server-B] ──> [Server-C] ──> [Server-A]
A common enterprise pattern: synchronous replication to a standby in the same metro area (low latency, zero RPO) plus asynchronous replication to a remote site (any distance, non-zero RPO). This provides both fast failover capability and geographic disaster protection.
Replication is a complex distributed system, and failures are inevitable. Understanding failure modes and recovery procedures is essential for reliable DR operations.
Common Failure Modes:
1. Network Disconnection
The connection between primary and replica is interrupted.
Symptoms:
Recovery:
2. Replica Disk Failure
Storage failure on the replica server.
Symptoms:
Recovery:
3. Replication Conflict
In logical or multi-primary replication, conflicting changes.
Symptoms:
Recovery:
4. Data Divergence
Replica data differs from primary unexpectedly.
Symptoms:
Recovery:
5. Replication Position Loss
Replica loses track of its position in the replication stream.
Symptoms:
Recovery:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
-- PostgreSQL Replication Health Check -- Check replication status on primarySELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, sync_state, -- Calculate lag in bytes pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes, -- Calculate approximate lag in seconds EXTRACT(EPOCH FROM (now() - reply_time)) AS lag_secondsFROM pg_stat_replication; -- Check replication slots on primarySELECT slot_name, slot_type, active, restart_lsn, -- Check how much WAL is retained for this slot pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytesFROM pg_replication_slots; -- On replica: check recovery statusSELECT pg_is_in_recovery() AS is_replica, pg_last_wal_receive_lsn() AS last_received, pg_last_wal_replay_lsn() AS last_replayed, pg_last_xact_replay_timestamp() AS last_replay_time, -- Calculate replay lag NOW() - pg_last_xact_replay_timestamp() AS replay_lag; -- MySQL Replica Health CheckSHOW REPLICA STATUS\G -- Key fields to check:-- Replica_IO_Running: Yes/No-- Replica_SQL_Running: Yes/No-- Seconds_Behind_Source: lag-- Last_Error: any errors-- Executed_Gtid_Set: GTID position -- Check for common issuesSELECT CASE WHEN Slave_IO_Running = 'No' THEN 'IO thread stopped' WHEN Slave_SQL_Running = 'No' THEN 'SQL thread stopped' WHEN Seconds_Behind_Master > 300 THEN 'High replication lag' WHEN Last_Error != '' THEN 'Replication error' ELSE 'Healthy' END AS replication_statusFROM (SHOW REPLICA STATUS);Always enable WAL/binlog archiving alongside streaming replication. If the primary drops old WAL segments before the replica retrieves them, the replica becomes unrecoverable without a full resync. WAL archiving provides a safety net for extended disconnections or slow replicas.
Effective replication monitoring is essential for disaster recovery confidence. You need to know—at all times—whether your replicas are healthy and current enough to meet RPO targets.
Critical Metrics:
1. Replication Lag
The most important metric. Measures how far behind the replica is.
| Measurement Method | Accuracy | Use Case |
|---|---|---|
| Time since last transaction | High | Transaction-based workloads |
| LSN/GTID difference | Byte-accurate | High-volume systems |
| Heartbeat table | Custom accuracy | Complex topologies |
2. Replication State
Is replication actively running?
3. Replication Throughput
How fast is data replicating?
Monitoring Implementation:
Automated Collection:
Deploy monitoring agents that continuously collect replication metrics:
Alerting Strategy:
Define alert thresholds aligned with RPO:
| Alert Level | Threshold | Action |
|---|---|---|
| Warning | Lag > 50% of RPO | Investigate, notify team |
| Critical | Lag > 80% of RPO | Escalate, prepare for action |
| Emergency | Lag > RPO or replication stopped | Page on-call, potential DR situation |
| Emergency | Replication broken | Page on-call, high priority |
Dashboard Components:
Build a replication dashboard showing:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
# Prometheus Alerting Rules for Replication Monitoring groups: - name: replication_alerts rules: # PostgreSQL Replication Lag Alert - alert: PostgreSQLReplicationLagWarning expr: pg_replication_lag_seconds > 60 for: 5m labels: severity: warning annotations: summary: "PostgreSQL replication lag is high" description: "Replica {{ $labels.instance }} is {{ $value }}s behind" - alert: PostgreSQLReplicationLagCritical expr: pg_replication_lag_seconds > 180 for: 2m labels: severity: critical annotations: summary: "PostgreSQL replication lag is critical" description: "Replica {{ $labels.instance }} is {{ $value }}s behind - RPO at risk" - alert: PostgreSQLReplicationStopped expr: pg_replication_is_replica == 1 and pg_replication_lag_seconds == -1 for: 1m labels: severity: emergency annotations: summary: "PostgreSQL replication has stopped" description: "Replica {{ $labels.instance }} is not receiving updates" # MySQL Replication Alerts - alert: MySQLReplicationStopped expr: mysql_slave_status_slave_io_running == 0 or mysql_slave_status_slave_sql_running == 0 for: 1m labels: severity: emergency annotations: summary: "MySQL replication has stopped" description: "Replica {{ $labels.instance }} replication thread stopped" - alert: MySQLReplicationLagWarning expr: mysql_slave_status_seconds_behind_master > 60 for: 5m labels: severity: warning annotations: summary: "MySQL replication lag warning" description: "Replica {{ $labels.instance }} is {{ $value }}s behind master" - alert: MySQLReplicationError expr: mysql_slave_status_last_errno != 0 for: 1m labels: severity: critical annotations: summary: "MySQL replication error" description: "Replica {{ $labels.instance }} has error: {{ $labels.last_error }}"For accurate lag measurement across all DBMS types, implement a heartbeat table: a small table on the primary that receives a timestamp update every second. Query this table on the replica and compare timestamps. This provides a universally applicable, transport-agnostic lag measurement that includes all replication delays.
Replication is the technology that makes modern disaster recovery possible. Let's consolidate the key takeaways:
What's next:
With data protection mechanisms in place through replication, we need procedures to switch operations from the failed primary to the standby. Next, we'll explore Failover—the processes, automation, and decision-making that enable seamless transition to DR infrastructure.
You now understand the full spectrum of database replication technologies and their role in disaster recovery. Replication is the mechanism that keeps your DR site current and ready for activation. With properly configured and monitored replication, you can meet stringent RPO targets and enable fast failover when disaster strikes. Next, we'll explore how to actually perform failover when the moment comes.