Database Management SystemsDisaster Recovery

Disaster Recovery

LevelAdvanced

Duration90 mins

TopicDisaster Recovery

3 / 5

Replication

The Art of Keeping Copies in Sync

When GitLab suffered a major production database incident in 2017, their PostgreSQL database was corrupted by an accidental deletion. Their streaming replication to a standby server was working perfectly—but unfortunately, it worked too perfectly, replicating the deletion command to the standby before anyone could intervene.

This incident illustrates a fundamental truth about database replication: it's a powerful tool that requires deep understanding. Replication can save your business or destroy your recovery options, depending on how it's designed and operated.

Replication is the technology that enables databases to maintain synchronized copies across multiple servers and locations. It's the foundation of both high availability and disaster recovery. Understanding replication modes, architectures, and failure scenarios is essential for any database professional.

What You Will Learn

By the end of this page, you will understand the spectrum of replication technologies, from synchronous to asynchronous modes. You'll learn how to design replication architectures for disaster recovery, handle replication failures, and monitor replication health.

Replication Fundamentals

Database replication is the process of copying and maintaining database objects across multiple servers. In the context of disaster recovery, replication serves as a real-time or near-real-time data protection mechanism that enables rapid failover.

Core Concepts:

Primary (Master): The database server that accepts write operations. In most replication topologies, there is exactly one primary at any given time.

Replica (Slave/Standby): Servers that receive copied data from the primary. They may be read-only or capable of accepting writes (in multi-primary scenarios).

Replication Stream: The flow of data changes from primary to replica. This typically consists of transaction log entries (WAL in PostgreSQL, binary log in MySQL).

Replication Lag: The time difference between when a change is committed on the primary and when it's applied on the replica. This is the core metric for DR readiness.

Converting Mermaid diagram...

Types of Replicated Data:

Physical Replication: Replicates the actual data blocks or transaction log entries. The replica is a byte-for-byte copy of the primary at the storage level.

Advantage: Complete fidelity, includes all objects
Advantage: Faster replication (no SQL parsing needed)
Disadvantage: Requires identical hardware/software versions
Example: PostgreSQL streaming replication, Oracle Data Guard

Logical Replication: Replicates the logical SQL operations (INSERT, UPDATE, DELETE). The replica applies these operations to achieve the same logical state.

Advantage: Can replicate between different versions
Advantage: Can selectively replicate specific tables
Advantage: Replicas can have different indexes, constraints
Disadvantage: Higher overhead, some operations may not replicate
Example: MySQL Group Replication, PostgreSQL logical replication

Transactional Replication: Replicates individual transactions to ensure consistency. Transactions are applied atomically on replicas.

Physical vs. Logical for DR

For disaster recovery, physical replication is generally preferred because it provides exact copies with lower overhead. Logical replication is better suited for data migration, cross-version replication, or selective table-level replication. Many organizations use both: physical replication for DR and logical replication for reporting replicas.

Synchronous Replication

Synchronous replication ensures that every committed transaction exists on at least one replica before the commit is acknowledged to the application. This provides the strongest data protection but comes with performance implications.

How Synchronous Replication Works:

Application sends transaction to primary
Primary writes to local transaction log
Primary sends log entries to replica(s)
Replica writes to its transaction log
Replica acknowledges receipt to primary
Only then does primary acknowledge commit to application
Replica applies the transaction asynchronously

RPO Guarantee:

With synchronous replication, RPO is effectively zero for committed transactions. If the primary fails immediately after acknowledging a commit, the transaction is guaranteed to exist on the replica.

postgresql_sync_replication.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# PostgreSQL Synchronous Replication Configuration
 
# Primary server configuration (postgresql.conf)
# ============================================
 
# Enable WAL shipping for replication
wal_level = replica
 
# Maximum number of concurrent connections from standby servers
max_wal_senders = 5
 
# Ensure synchronous behavior - wait for replica acknowledgment
synchronous_commit = on
 
# Define synchronous standby servers
# 'FIRST 1' means wait for first replica to acknowledge
# Names must match application_name in replica connection
synchronous_standby_names = 'FIRST 1 (dr_site_1, dr_site_2)'
 
# Keep enough WAL for replicas to catch up after network issues
wal_keep_size = 1GB
 
# Hot standby allows read queries on replicas
hot_standby = on
 
 
# Primary pg_hba.conf - allow replication connections
# TYPE  DATABASE    USER        ADDRESS          METHOD
host    replication repl_user   192.168.1.0/24   scram-sha-256
host    replication repl_user   10.0.0.0/8       scram-sha-256
 
 
# Replica recovery configuration (standby.signal file exists)
# Create empty file to indicate standby mode:
# touch /var/lib/postgresql/data/standby.signal
 
# Replica postgresql.conf
# ============================================
primary_conninfo = 'host=primary-db port=5432 user=repl_user password=XXX application_name=dr_site_1'
hot_standby = on
primary_slot_name = 'dr_site_1_slot'

Performance Implications:

Synchronous replication adds latency to every transaction:

Component	Typical Latency
Local disk write	0.5-2 ms
Network roundtrip (same datacenter)	0.5-2 ms
Network roundtrip (100 km)	1-2 ms
Network roundtrip (1000 km)	10-15 ms
Replica disk write	0.5-2 ms
Total added per transaction	2-20+ ms

For a database processing 1000 transactions/second, this latency significantly impacts throughput. The database can process fewer transactions per second because each must wait for replica acknowledgment.

Failure Scenarios:

Replica Failure: If the synchronous replica fails, the primary faces a choice:

Wait indefinitely (blocks all writes until replica recovers)
Switch to asynchronous mode (loses zero-RPO guarantee)
Fail over to another synchronous replica (if available)

Network Partition: If the network between primary and replica fails:

Primary cannot get acknowledgment from replica
Writes either block or fall back to async
Must be handled carefully to avoid split-brain

The Distance Problem

Synchronous replication becomes impractical over long distances due to physics. Light travels approximately 200 km per millisecond through fiber optic cable. A DR site 1000 km away adds 10+ ms per transaction—potentially unacceptable for high-frequency transaction systems. This fundamental constraint drives architectural decisions about DR site location.

Asynchronous Replication

Asynchronous replication commits transactions on the primary without waiting for replica acknowledgment. This provides better performance but introduces potential data loss during failover.

How Asynchronous Replication Works:

Application sends transaction to primary
Primary writes to local transaction log
Primary acknowledges commit to application immediately
Primary sends log entries to replica in the background
Replica receives and applies changes

Steps 4 and 5 happen independently of the commit acknowledgment. The replica may be seconds to minutes behind the primary at any given moment.

RPO Characteristics:

With asynchronous replication, RPO equals the replication lag at the moment of failure. If the primary fails when the replica is 30 seconds behind, approximately 30 seconds of transactions are lost.

mysql_async_replication.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
-- MySQL Asynchronous Replication Setup
 
-- On Primary Server:
-- 1. Enable binary logging in my.cnf
-- [mysqld]
-- server-id = 1
-- log-bin = mysql-bin
-- binlog_format = ROW
-- gtid_mode = ON
-- enforce_gtid_consistency = ON
 
-- 2. Create replication user
CREATE USER 'repl_user'@'%' IDENTIFIED BY 'secure_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%';
FLUSH PRIVILEGES;
 
-- 3. Get primary status for initial sync
SHOW MASTER STATUS;
-- Note: File and Position for replica configuration
 
-- On Replica Server:
-- my.cnf configuration
-- [mysqld]
-- server-id = 2
-- relay-log = relay-log
-- read_only = ON
-- gtid_mode = ON
-- enforce_gtid_consistency = ON
 
-- Configure replication source (MySQL 8.0+)
CHANGE REPLICATION SOURCE TO
    SOURCE_HOST = 'primary-db.example.com',
    SOURCE_USER = 'repl_user',
    SOURCE_PASSWORD = 'secure_password',
    SOURCE_AUTO_POSITION = 1,  -- Use GTID-based replication
    SOURCE_CONNECT_RETRY = 10, -- Retry interval in seconds
    SOURCE_RETRY_COUNT = 86400; -- Retry for 24 hours
 
-- Start replication
START REPLICA;
 
-- Check replication status
SHOW REPLICA STATUS\G
 
-- Key metrics to monitor:
-- Slave_IO_Running: Yes
-- Slave_SQL_Running: Yes
-- Seconds_Behind_Master: <number> (replication lag)
-- Last_Error: <any errors>
 
-- Monitor replication lag continuously
SELECT 
    TIMESTAMPDIFF(SECOND, 
        (SELECT MAX(last_update) FROM performance_schema.replication_connection_status),
        NOW()
    ) AS lag_seconds;

Advantages of Asynchronous Replication:

No write latency impact: Primary commits immediately
Works over any distance: Network latency doesn't affect performance
More resilient to network issues: Temporary disconnections don't block operations
Lower complexity: Fewer failure modes to handle

Managing Replication Lag:

Replication lag is the critical metric for async replication. Factors that increase lag:

Factor	Impact
Network bandwidth	Insufficient bandwidth can't keep up with write rate
Network latency	Higher latency means slower transmission
Replica CPU	Slow replica can't apply changes fast enough
Replica I/O	Disk bottlenecks slow change application
Large transactions	Big transactions take longer to transfer/apply
Complex queries	Heavy read load on replica slows replication

Lag Management Strategies:

Dedicated network: Reserve bandwidth for replication traffic
Parallel replication: Apply changes in parallel where safe
Replica sizing: Size replica hardware to match or exceed primary
Query routing: Route heavy queries away from DR replica
Lag monitoring: Alert when lag exceeds thresholds

Lag Alerting Strategy

Set multiple alert thresholds for replication lag: Warning at 50% of RPO target, Critical at 80% of RPO target, and Emergency at 100% of RPO target. This provides escalating visibility as lag approaches unacceptable levels.

Semi-Synchronous Replication

Semi-synchronous replication provides a middle ground between the data protection of synchronous replication and the performance of asynchronous replication. It's particularly popular in MySQL environments.

How Semi-Synchronous Works:

Application sends transaction to primary
Primary writes to local transaction log
Primary sends log entries to replica
At least one replica acknowledges receipt
Primary acknowledges commit to application
Replica applies changes asynchronously

The key difference from full synchronous: the replica only needs to acknowledge receipt of the log data, not application of the changes.

Fallback Behavior:

Semi-synchronous replication includes intelligent fallback:

If no replica acknowledges within timeout, fall back to async
When replica reconnects, return to semi-sync
This prevents replica failure from blocking all writes
Configurable timeout (typically 10-30 seconds)

mysql_semisync_config.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- MySQL Semi-Synchronous Replication Configuration
 
-- On Primary Server:
-- Install semi-sync plugin (if not already installed)
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
 
-- Enable semi-synchronous mode
SET GLOBAL rpl_semi_sync_master_enabled = 1;
 
-- Timeout before falling back to async (milliseconds)
SET GLOBAL rpl_semi_sync_master_timeout = 10000;  -- 10 seconds
 
-- Require at least one replica for semi-sync
SET GLOBAL rpl_semi_sync_master_wait_for_slave_count = 1;
 
-- Wait point configuration
-- AFTER_SYNC: Wait after binlog sync, before storage commit
-- AFTER_COMMIT: Wait after storage commit (legacy)
SET GLOBAL rpl_semi_sync_master_wait_point = 'AFTER_SYNC';
 
 
-- On Replica Server:
-- Install semi-sync plugin
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
 
-- Enable semi-synchronous mode
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
 
-- Restart replication to activate
STOP REPLICA;
START REPLICA;
 
 
-- Monitoring semi-sync status
SHOW STATUS LIKE 'Rpl_semi_sync%';
 
-- Key metrics:
-- Rpl_semi_sync_master_status: ON (semi-sync active)
-- Rpl_semi_sync_master_no_tx: Count of async fallback transactions
-- Rpl_semi_sync_master_yes_tx: Count of semi-sync transactions
-- Rpl_semi_sync_master_clients: Connected semi-sync replicas
-- Rpl_semi_sync_master_wait_sessions: Sessions waiting for ack
 
-- Calculate semi-sync success rate
SELECT 
    Rpl_semi_sync_master_yes_tx / 
    (Rpl_semi_sync_master_yes_tx + Rpl_semi_sync_master_no_tx) * 100 
    AS semi_sync_success_rate
FROM (
    SELECT 
        VARIABLE_VALUE AS Rpl_semi_sync_master_yes_tx
    FROM performance_schema.global_status 
    WHERE VARIABLE_NAME = 'Rpl_semi_sync_master_yes_tx'
) yes_tx,
(
    SELECT 
        VARIABLE_VALUE AS Rpl_semi_sync_master_no_tx
    FROM performance_schema.global_status 
    WHERE VARIABLE_NAME = 'Rpl_semi_sync_master_no_tx'
) no_tx;

Replication Mode Comparison
Characteristic	Synchronous	Semi-Synchronous	Asynchronous
Data protection	Zero RPO	Near-zero RPO	Lag-dependent RPO
Performance impact	High latency	Moderate latency	No impact
Replica failure handling	Blocks or fails	Falls back to async	No impact
Geographic range	< 100 km practical	< 500 km practical	Any distance
Complexity	Highest	Medium	Lowest
Best for	Financial, critical	Important systems	Geographically distant DR

Wait Point Matters

In MySQL semi-sync, 'AFTER_SYNC' wait point provides stronger guarantees than 'AFTER_COMMIT'. With AFTER_SYNC, the primary waits for replica acknowledgment before committing to storage engine, ensuring the replica always has a copy of any committed transaction. This is the recommended setting for DR scenarios.

Replication Topologies for DR

The arrangement of primary and replica servers—the topology—significantly affects DR capabilities, complexity, and failure modes.

Topology 1: Primary-Standby (Simple)

The simplest topology: one primary, one passive standby.

[Primary] ──replication──> [Standby]

Advantages: Simple, easy to understand and operate
Disadvantages: Single point of failure at each tier
DR use case: Basic DR with single recovery target

Topology 2: Primary with Multiple Standbys

One primary replicating to multiple standby servers.

                    ┌──> [Standby-1 (sync, local)]
[Primary] ──────────┼──> [Standby-2 (async, DR site)]
                    └──> [Standby-3 (async, reporting)]

Advantages: Multiple recovery options, role separation
Disadvantages: More complex, more resources
DR use case: Tiered DR with local HA and remote DR

Topology 3: Cascading Replication

Primary replicates to intermediate, which replicates to downstream.

[Primary] ──sync──> [Intermediate] ──async──> [DR Standby]
                 └──> [Reporting Replica]

Advantages: Reduces load on primary, enables geographic distribution
Disadvantages: Additional hop adds lag to DR
DR use case: When primary can't directly replicate to all targets

Topology 4: Multi-Primary (Active-Active)

Multiple servers accept writes, replicating to each other.

[Primary-A] <──replication──> [Primary-B]
     │                             │
     └────────> Applications <─────┘

Advantages: No failover needed, both sites active
Disadvantages: Conflict resolution complexity, eventual consistency risks
DR use case: Ultra-low RTO requirements, geographic distribution

Topology 5: Ring Replication

Multiple servers in a ring, each replicating to the next.

[Server-A] ──> [Server-B] ──> [Server-C] ──> [Server-A]

Advantages: No single point of failure
Disadvantages: Break anywhere breaks the ring, complex conflict handling
DR use case: Rarely used for DR in modern systems

Converting Mermaid diagram...

The Metro + Remote Strategy

A common enterprise pattern: synchronous replication to a standby in the same metro area (low latency, zero RPO) plus asynchronous replication to a remote site (any distance, non-zero RPO). This provides both fast failover capability and geographic disaster protection.

Handling Replication Failures

Replication is a complex distributed system, and failures are inevitable. Understanding failure modes and recovery procedures is essential for reliable DR operations.

Common Failure Modes:

1. Network Disconnection

The connection between primary and replica is interrupted.

Symptoms:

Replication lag increases, then stops updating
Replica shows connection error
Primary may show waiting for acknowledgment (sync mode)

Recovery:

Restore network connectivity
Replication typically resumes automatically
Monitor for lag catchup

2. Replica Disk Failure

Storage failure on the replica server.

Symptoms:

Replica database crashes or becomes read-only
I/O errors in replica logs
Replication stops

Recovery:

Replace failed storage
If data recoverable: restart replication
If data lost: rebuild replica from backup or primary

3. Replication Conflict

In logical or multi-primary replication, conflicting changes.

Symptoms:

Replication error referencing constraint violation
Duplicate key or foreign key errors
Replication halts awaiting resolution

Recovery:

Investigate conflicting transactions
Choose conflict resolution strategy
Manual intervention often required

4. Data Divergence

Replica data differs from primary unexpectedly.

Symptoms:

Row counts don't match
Checksums differ
Query results inconsistent

Recovery:

Identify scope of divergence
May require partial or full resync
Investigate root cause to prevent recurrence

5. Replication Position Loss

Replica loses track of its position in the replication stream.

Symptoms:

'Replication position not found' errors
WAL segment not available
Cannot resume replication

Recovery:

Rebuild from recent backup
Restore replication from snapshot point
Implement WAL archiving to prevent recurrence

replication_health_check.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
-- PostgreSQL Replication Health Check
 
-- Check replication status on primary
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    sync_state,
    -- Calculate lag in bytes
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
    -- Calculate approximate lag in seconds
    EXTRACT(EPOCH FROM (now() - reply_time)) AS lag_seconds
FROM pg_stat_replication;
 
-- Check replication slots on primary
SELECT 
    slot_name,
    slot_type,
    active,
    restart_lsn,
    -- Check how much WAL is retained for this slot
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots;
 
-- On replica: check recovery status
SELECT 
    pg_is_in_recovery() AS is_replica,
    pg_last_wal_receive_lsn() AS last_received,
    pg_last_wal_replay_lsn() AS last_replayed,
    pg_last_xact_replay_timestamp() AS last_replay_time,
    -- Calculate replay lag
    NOW() - pg_last_xact_replay_timestamp() AS replay_lag;
 
-- MySQL Replica Health Check
SHOW REPLICA STATUS\G
 
-- Key fields to check:
-- Replica_IO_Running: Yes/No
-- Replica_SQL_Running: Yes/No
-- Seconds_Behind_Source: lag
-- Last_Error: any errors
-- Executed_Gtid_Set: GTID position
 
-- Check for common issues
SELECT 
    CASE 
        WHEN Slave_IO_Running = 'No' THEN 'IO thread stopped'
        WHEN Slave_SQL_Running = 'No' THEN 'SQL thread stopped'
        WHEN Seconds_Behind_Master > 300 THEN 'High replication lag'
        WHEN Last_Error != '' THEN 'Replication error'
        ELSE 'Healthy'
    END AS replication_status
FROM (SHOW REPLICA STATUS);

WAL Archiving is Essential

Always enable WAL/binlog archiving alongside streaming replication. If the primary drops old WAL segments before the replica retrieves them, the replica becomes unrecoverable without a full resync. WAL archiving provides a safety net for extended disconnections or slow replicas.

Monitoring Replication

Effective replication monitoring is essential for disaster recovery confidence. You need to know—at all times—whether your replicas are healthy and current enough to meet RPO targets.

Critical Metrics:

1. Replication Lag

The most important metric. Measures how far behind the replica is.

Measurement Method	Accuracy	Use Case
Time since last transaction	High	Transaction-based workloads
LSN/GTID difference	Byte-accurate	High-volume systems
Heartbeat table	Custom accuracy	Complex topologies

2. Replication State

Is replication actively running?

IO thread running (receiving data)
SQL/apply thread running (applying data)
Connection state (connected, connecting, disconnected)

3. Replication Throughput

How fast is data replicating?

Bytes/second transferred
Transactions/second applied
Network bandwidth utilization

Monitoring Implementation:

Automated Collection:

Deploy monitoring agents that continuously collect replication metrics:

Prometheus + postgres_exporter / mysqld_exporter
Datadog / New Relic database integrations
CloudWatch for AWS RDS
Custom scripts with Telegraf or similar

Alerting Strategy:

Define alert thresholds aligned with RPO:

Alert Level	Threshold	Action
Warning	Lag > 50% of RPO	Investigate, notify team
Critical	Lag > 80% of RPO	Escalate, prepare for action
Emergency	Lag > RPO or replication stopped	Page on-call, potential DR situation
Emergency	Replication broken	Page on-call, high priority

Dashboard Components:

Build a replication dashboard showing:

Current lag per replica (real-time)
Lag trend over time (detect degradation)
Replication state indicators
Throughput graphs
Event history (restarts, errors, failovers)

prometheus_alerts.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Prometheus Alerting Rules for Replication Monitoring
 
groups:
  - name: replication_alerts
    rules:
      # PostgreSQL Replication Lag Alert
      - alert: PostgreSQLReplicationLagWarning
        expr: pg_replication_lag_seconds > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL replication lag is high"
          description: "Replica {{ $labels.instance }} is {{ $value }}s behind"
 
      - alert: PostgreSQLReplicationLagCritical
        expr: pg_replication_lag_seconds > 180
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL replication lag is critical"
          description: "Replica {{ $labels.instance }} is {{ $value }}s behind - RPO at risk"
 
      - alert: PostgreSQLReplicationStopped
        expr: pg_replication_is_replica == 1 and pg_replication_lag_seconds == -1
        for: 1m
        labels:
          severity: emergency
        annotations:
          summary: "PostgreSQL replication has stopped"
          description: "Replica {{ $labels.instance }} is not receiving updates"
 
      # MySQL Replication Alerts
      - alert: MySQLReplicationStopped
        expr: mysql_slave_status_slave_io_running == 0 or mysql_slave_status_slave_sql_running == 0
        for: 1m
        labels:
          severity: emergency
        annotations:
          summary: "MySQL replication has stopped"
          description: "Replica {{ $labels.instance }} replication thread stopped"
 
      - alert: MySQLReplicationLagWarning
        expr: mysql_slave_status_seconds_behind_master > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL replication lag warning"
          description: "Replica {{ $labels.instance }} is {{ $value }}s behind master"
 
      - alert: MySQLReplicationError
        expr: mysql_slave_status_last_errno != 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL replication error"
          description: "Replica {{ $labels.instance }} has error: {{ $labels.last_error }}"

The Heartbeat Table Technique

For accurate lag measurement across all DBMS types, implement a heartbeat table: a small table on the primary that receives a timestamp update every second. Query this table on the replica and compare timestamps. This provides a universally applicable, transport-agnostic lag measurement that includes all replication delays.

Summary: Replication for Disaster Recovery

Replication is the technology that makes modern disaster recovery possible. Let's consolidate the key takeaways:

Key Takeaways

•Replication mode determines RPO — Synchronous ensures zero RPO but adds latency. Asynchronous provides flexibility but introduces data loss risk equal to replication lag.
•Distance constrains synchronous replication — Network latency makes synchronous replication impractical beyond ~100km. Plan topology accordingly.
•Semi-synchronous provides balance — Near-zero RPO with fallback capability makes it suitable for many production DR scenarios.
•Topology affects capabilities — Simple topologies are easier to operate; complex topologies provide more options and resilience.
•Replication failures are inevitable — Plan for network issues, disk failures, conflicts, and divergence. Test recovery procedures regularly.
•Monitoring is non-negotiable — Continuous lag monitoring with RPO-aligned alerting ensures timely response to developing problems.

What's next:

With data protection mechanisms in place through replication, we need procedures to switch operations from the failed primary to the standby. Next, we'll explore Failover—the processes, automation, and decision-making that enable seamless transition to DR infrastructure.

Page Complete

You now understand the full spectrum of database replication technologies and their role in disaster recovery. Replication is the mechanism that keeps your DR site current and ready for activation. With properly configured and monitored replication, you can meet stringent RPO targets and enable fast failover when disaster strikes. Next, we'll explore how to actually perform failover when the moment comes.

3 / 5

Loading learning content...

Database Management SystemsDisaster Recovery

Disaster Recovery

LevelAdvanced

Duration90 mins

TopicDisaster Recovery

3 / 5

Replication

The Art of Keeping Copies in Sync

What You Will Learn

Replication Fundamentals

Core Concepts:

Primary (Master): The database server that accepts write operations. In most replication topologies, there is exactly one primary at any given time.

Replica (Slave/Standby): Servers that receive copied data from the primary. They may be read-only or capable of accepting writes (in multi-primary scenarios).

Replication Stream: The flow of data changes from primary to replica. This typically consists of transaction log entries (WAL in PostgreSQL, binary log in MySQL).

Replication Lag: The time difference between when a change is committed on the primary and when it's applied on the replica. This is the core metric for DR readiness.

Converting Mermaid diagram...

Types of Replicated Data:

Physical Replication: Replicates the actual data blocks or transaction log entries. The replica is a byte-for-byte copy of the primary at the storage level.

Advantage: Complete fidelity, includes all objects
Advantage: Faster replication (no SQL parsing needed)
Disadvantage: Requires identical hardware/software versions
Example: PostgreSQL streaming replication, Oracle Data Guard

Logical Replication: Replicates the logical SQL operations (INSERT, UPDATE, DELETE). The replica applies these operations to achieve the same logical state.

Advantage: Can replicate between different versions
Advantage: Can selectively replicate specific tables
Advantage: Replicas can have different indexes, constraints
Disadvantage: Higher overhead, some operations may not replicate
Example: MySQL Group Replication, PostgreSQL logical replication

Transactional Replication: Replicates individual transactions to ensure consistency. Transactions are applied atomically on replicas.

Physical vs. Logical for DR

Synchronous Replication

How Synchronous Replication Works:

Application sends transaction to primary
Primary writes to local transaction log
Primary sends log entries to replica(s)
Replica writes to its transaction log
Replica acknowledges receipt to primary
Only then does primary acknowledge commit to application
Replica applies the transaction asynchronously

RPO Guarantee:

With synchronous replication, RPO is effectively zero for committed transactions. If the primary fails immediately after acknowledging a commit, the transaction is guaranteed to exist on the replica.

postgresql_sync_replication.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# PostgreSQL Synchronous Replication Configuration
 
# Primary server configuration (postgresql.conf)
# ============================================
 
# Enable WAL shipping for replication
wal_level = replica
 
# Maximum number of concurrent connections from standby servers
max_wal_senders = 5
 
# Ensure synchronous behavior - wait for replica acknowledgment
synchronous_commit = on
 
# Define synchronous standby servers
# 'FIRST 1' means wait for first replica to acknowledge
# Names must match application_name in replica connection
synchronous_standby_names = 'FIRST 1 (dr_site_1, dr_site_2)'
 
# Keep enough WAL for replicas to catch up after network issues
wal_keep_size = 1GB
 
# Hot standby allows read queries on replicas
hot_standby = on
 
 
# Primary pg_hba.conf - allow replication connections
# TYPE  DATABASE    USER        ADDRESS          METHOD
host    replication repl_user   192.168.1.0/24   scram-sha-256
host    replication repl_user   10.0.0.0/8       scram-sha-256
 
 
# Replica recovery configuration (standby.signal file exists)
# Create empty file to indicate standby mode:
# touch /var/lib/postgresql/data/standby.signal
 
# Replica postgresql.conf
# ============================================
primary_conninfo = 'host=primary-db port=5432 user=repl_user password=XXX application_name=dr_site_1'
hot_standby = on
primary_slot_name = 'dr_site_1_slot'

Performance Implications:

Synchronous replication adds latency to every transaction:

Component	Typical Latency
Local disk write	0.5-2 ms
Network roundtrip (same datacenter)	0.5-2 ms
Network roundtrip (100 km)	1-2 ms
Network roundtrip (1000 km)	10-15 ms
Replica disk write	0.5-2 ms
Total added per transaction	2-20+ ms

Failure Scenarios:

Replica Failure: If the synchronous replica fails, the primary faces a choice:

Wait indefinitely (blocks all writes until replica recovers)
Switch to asynchronous mode (loses zero-RPO guarantee)
Fail over to another synchronous replica (if available)

Network Partition: If the network between primary and replica fails:

Primary cannot get acknowledgment from replica
Writes either block or fall back to async
Must be handled carefully to avoid split-brain

The Distance Problem

Asynchronous Replication

Asynchronous replication commits transactions on the primary without waiting for replica acknowledgment. This provides better performance but introduces potential data loss during failover.

How Asynchronous Replication Works:

Application sends transaction to primary
Primary writes to local transaction log
Primary acknowledges commit to application immediately
Primary sends log entries to replica in the background
Replica receives and applies changes

Steps 4 and 5 happen independently of the commit acknowledgment. The replica may be seconds to minutes behind the primary at any given moment.

RPO Characteristics:

With asynchronous replication, RPO equals the replication lag at the moment of failure. If the primary fails when the replica is 30 seconds behind, approximately 30 seconds of transactions are lost.

mysql_async_replication.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
-- MySQL Asynchronous Replication Setup
 
-- On Primary Server:
-- 1. Enable binary logging in my.cnf
-- [mysqld]
-- server-id = 1
-- log-bin = mysql-bin
-- binlog_format = ROW
-- gtid_mode = ON
-- enforce_gtid_consistency = ON
 
-- 2. Create replication user
CREATE USER 'repl_user'@'%' IDENTIFIED BY 'secure_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%';
FLUSH PRIVILEGES;
 
-- 3. Get primary status for initial sync
SHOW MASTER STATUS;
-- Note: File and Position for replica configuration
 
-- On Replica Server:
-- my.cnf configuration
-- [mysqld]
-- server-id = 2
-- relay-log = relay-log
-- read_only = ON
-- gtid_mode = ON
-- enforce_gtid_consistency = ON
 
-- Configure replication source (MySQL 8.0+)
CHANGE REPLICATION SOURCE TO
    SOURCE_HOST = 'primary-db.example.com',
    SOURCE_USER = 'repl_user',
    SOURCE_PASSWORD = 'secure_password',
    SOURCE_AUTO_POSITION = 1,  -- Use GTID-based replication
    SOURCE_CONNECT_RETRY = 10, -- Retry interval in seconds
    SOURCE_RETRY_COUNT = 86400; -- Retry for 24 hours
 
-- Start replication
START REPLICA;
 
-- Check replication status
SHOW REPLICA STATUS\G
 
-- Key metrics to monitor:
-- Slave_IO_Running: Yes
-- Slave_SQL_Running: Yes
-- Seconds_Behind_Master: <number> (replication lag)
-- Last_Error: <any errors>
 
-- Monitor replication lag continuously
SELECT 
    TIMESTAMPDIFF(SECOND, 
        (SELECT MAX(last_update) FROM performance_schema.replication_connection_status),
        NOW()
    ) AS lag_seconds;

Advantages of Asynchronous Replication:

No write latency impact: Primary commits immediately
Works over any distance: Network latency doesn't affect performance
More resilient to network issues: Temporary disconnections don't block operations
Lower complexity: Fewer failure modes to handle

Managing Replication Lag:

Replication lag is the critical metric for async replication. Factors that increase lag:

Factor	Impact
Network bandwidth	Insufficient bandwidth can't keep up with write rate
Network latency	Higher latency means slower transmission
Replica CPU	Slow replica can't apply changes fast enough
Replica I/O	Disk bottlenecks slow change application
Large transactions	Big transactions take longer to transfer/apply
Complex queries	Heavy read load on replica slows replication

Lag Management Strategies:

Dedicated network: Reserve bandwidth for replication traffic
Parallel replication: Apply changes in parallel where safe
Replica sizing: Size replica hardware to match or exceed primary
Query routing: Route heavy queries away from DR replica
Lag monitoring: Alert when lag exceeds thresholds

Lag Alerting Strategy

Semi-Synchronous Replication

How Semi-Synchronous Works:

Application sends transaction to primary
Primary writes to local transaction log
Primary sends log entries to replica
At least one replica acknowledges receipt
Primary acknowledges commit to application
Replica applies changes asynchronously

The key difference from full synchronous: the replica only needs to acknowledge receipt of the log data, not application of the changes.

Fallback Behavior:

Semi-synchronous replication includes intelligent fallback:

If no replica acknowledges within timeout, fall back to async
When replica reconnects, return to semi-sync
This prevents replica failure from blocking all writes
Configurable timeout (typically 10-30 seconds)

mysql_semisync_config.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- MySQL Semi-Synchronous Replication Configuration
 
-- On Primary Server:
-- Install semi-sync plugin (if not already installed)
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
 
-- Enable semi-synchronous mode
SET GLOBAL rpl_semi_sync_master_enabled = 1;
 
-- Timeout before falling back to async (milliseconds)
SET GLOBAL rpl_semi_sync_master_timeout = 10000;  -- 10 seconds
 
-- Require at least one replica for semi-sync
SET GLOBAL rpl_semi_sync_master_wait_for_slave_count = 1;
 
-- Wait point configuration
-- AFTER_SYNC: Wait after binlog sync, before storage commit
-- AFTER_COMMIT: Wait after storage commit (legacy)
SET GLOBAL rpl_semi_sync_master_wait_point = 'AFTER_SYNC';
 
 
-- On Replica Server:
-- Install semi-sync plugin
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
 
-- Enable semi-synchronous mode
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
 
-- Restart replication to activate
STOP REPLICA;
START REPLICA;
 
 
-- Monitoring semi-sync status
SHOW STATUS LIKE 'Rpl_semi_sync%';
 
-- Key metrics:
-- Rpl_semi_sync_master_status: ON (semi-sync active)
-- Rpl_semi_sync_master_no_tx: Count of async fallback transactions
-- Rpl_semi_sync_master_yes_tx: Count of semi-sync transactions
-- Rpl_semi_sync_master_clients: Connected semi-sync replicas
-- Rpl_semi_sync_master_wait_sessions: Sessions waiting for ack
 
-- Calculate semi-sync success rate
SELECT 
    Rpl_semi_sync_master_yes_tx / 
    (Rpl_semi_sync_master_yes_tx + Rpl_semi_sync_master_no_tx) * 100 
    AS semi_sync_success_rate
FROM (
    SELECT 
        VARIABLE_VALUE AS Rpl_semi_sync_master_yes_tx
    FROM performance_schema.global_status 
    WHERE VARIABLE_NAME = 'Rpl_semi_sync_master_yes_tx'
) yes_tx,
(
    SELECT 
        VARIABLE_VALUE AS Rpl_semi_sync_master_no_tx
    FROM performance_schema.global_status 
    WHERE VARIABLE_NAME = 'Rpl_semi_sync_master_no_tx'
) no_tx;

Replication Mode Comparison
Characteristic	Synchronous	Semi-Synchronous	Asynchronous
Data protection	Zero RPO	Near-zero RPO	Lag-dependent RPO
Performance impact	High latency	Moderate latency	No impact
Replica failure handling	Blocks or fails	Falls back to async	No impact
Geographic range	< 100 km practical	< 500 km practical	Any distance
Complexity	Highest	Medium	Lowest
Best for	Financial, critical	Important systems	Geographically distant DR

Wait Point Matters

Replication Topologies for DR

The arrangement of primary and replica servers—the topology—significantly affects DR capabilities, complexity, and failure modes.

Topology 1: Primary-Standby (Simple)

The simplest topology: one primary, one passive standby.

[Primary] ──replication──> [Standby]

Advantages: Simple, easy to understand and operate
Disadvantages: Single point of failure at each tier
DR use case: Basic DR with single recovery target

Topology 2: Primary with Multiple Standbys

One primary replicating to multiple standby servers.

                    ┌──> [Standby-1 (sync, local)]
[Primary] ──────────┼──> [Standby-2 (async, DR site)]
                    └──> [Standby-3 (async, reporting)]

Advantages: Multiple recovery options, role separation
Disadvantages: More complex, more resources
DR use case: Tiered DR with local HA and remote DR

Topology 3: Cascading Replication

Primary replicates to intermediate, which replicates to downstream.

[Primary] ──sync──> [Intermediate] ──async──> [DR Standby]
                 └──> [Reporting Replica]

Advantages: Reduces load on primary, enables geographic distribution
Disadvantages: Additional hop adds lag to DR
DR use case: When primary can't directly replicate to all targets

Topology 4: Multi-Primary (Active-Active)

Multiple servers accept writes, replicating to each other.

[Primary-A] <──replication──> [Primary-B]
     │                             │
     └────────> Applications <─────┘

Advantages: No failover needed, both sites active
Disadvantages: Conflict resolution complexity, eventual consistency risks
DR use case: Ultra-low RTO requirements, geographic distribution

Topology 5: Ring Replication

Multiple servers in a ring, each replicating to the next.

[Server-A] ──> [Server-B] ──> [Server-C] ──> [Server-A]

Advantages: No single point of failure
Disadvantages: Break anywhere breaks the ring, complex conflict handling
DR use case: Rarely used for DR in modern systems

Converting Mermaid diagram...

The Metro + Remote Strategy

Handling Replication Failures

Replication is a complex distributed system, and failures are inevitable. Understanding failure modes and recovery procedures is essential for reliable DR operations.

Common Failure Modes:

1. Network Disconnection

The connection between primary and replica is interrupted.

Symptoms:

Replication lag increases, then stops updating
Replica shows connection error
Primary may show waiting for acknowledgment (sync mode)

Recovery:

Restore network connectivity
Replication typically resumes automatically
Monitor for lag catchup

2. Replica Disk Failure

Storage failure on the replica server.

Symptoms:

Replica database crashes or becomes read-only
I/O errors in replica logs
Replication stops

Recovery:

Replace failed storage
If data recoverable: restart replication
If data lost: rebuild replica from backup or primary

3. Replication Conflict

In logical or multi-primary replication, conflicting changes.

Symptoms:

Replication error referencing constraint violation
Duplicate key or foreign key errors
Replication halts awaiting resolution

Recovery:

Investigate conflicting transactions
Choose conflict resolution strategy
Manual intervention often required

4. Data Divergence

Replica data differs from primary unexpectedly.

Symptoms:

Row counts don't match
Checksums differ
Query results inconsistent

Recovery:

Identify scope of divergence
May require partial or full resync
Investigate root cause to prevent recurrence

5. Replication Position Loss

Replica loses track of its position in the replication stream.

Symptoms:

'Replication position not found' errors
WAL segment not available
Cannot resume replication

Recovery:

Rebuild from recent backup
Restore replication from snapshot point
Implement WAL archiving to prevent recurrence

replication_health_check.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
-- PostgreSQL Replication Health Check
 
-- Check replication status on primary
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    sync_state,
    -- Calculate lag in bytes
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
    -- Calculate approximate lag in seconds
    EXTRACT(EPOCH FROM (now() - reply_time)) AS lag_seconds
FROM pg_stat_replication;
 
-- Check replication slots on primary
SELECT 
    slot_name,
    slot_type,
    active,
    restart_lsn,
    -- Check how much WAL is retained for this slot
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots;
 
-- On replica: check recovery status
SELECT 
    pg_is_in_recovery() AS is_replica,
    pg_last_wal_receive_lsn() AS last_received,
    pg_last_wal_replay_lsn() AS last_replayed,
    pg_last_xact_replay_timestamp() AS last_replay_time,
    -- Calculate replay lag
    NOW() - pg_last_xact_replay_timestamp() AS replay_lag;
 
-- MySQL Replica Health Check
SHOW REPLICA STATUS\G
 
-- Key fields to check:
-- Replica_IO_Running: Yes/No
-- Replica_SQL_Running: Yes/No
-- Seconds_Behind_Source: lag
-- Last_Error: any errors
-- Executed_Gtid_Set: GTID position
 
-- Check for common issues
SELECT 
    CASE 
        WHEN Slave_IO_Running = 'No' THEN 'IO thread stopped'
        WHEN Slave_SQL_Running = 'No' THEN 'SQL thread stopped'
        WHEN Seconds_Behind_Master > 300 THEN 'High replication lag'
        WHEN Last_Error != '' THEN 'Replication error'
        ELSE 'Healthy'
    END AS replication_status
FROM (SHOW REPLICA STATUS);

WAL Archiving is Essential

Monitoring Replication

Effective replication monitoring is essential for disaster recovery confidence. You need to know—at all times—whether your replicas are healthy and current enough to meet RPO targets.

Critical Metrics:

1. Replication Lag

The most important metric. Measures how far behind the replica is.

Measurement Method	Accuracy	Use Case
Time since last transaction	High	Transaction-based workloads
LSN/GTID difference	Byte-accurate	High-volume systems
Heartbeat table	Custom accuracy	Complex topologies

2. Replication State

Is replication actively running?

IO thread running (receiving data)
SQL/apply thread running (applying data)
Connection state (connected, connecting, disconnected)

3. Replication Throughput

How fast is data replicating?

Bytes/second transferred
Transactions/second applied
Network bandwidth utilization

Monitoring Implementation:

Automated Collection:

Deploy monitoring agents that continuously collect replication metrics:

Prometheus + postgres_exporter / mysqld_exporter
Datadog / New Relic database integrations
CloudWatch for AWS RDS
Custom scripts with Telegraf or similar

Alerting Strategy:

Define alert thresholds aligned with RPO:

Alert Level	Threshold	Action
Warning	Lag > 50% of RPO	Investigate, notify team
Critical	Lag > 80% of RPO	Escalate, prepare for action
Emergency	Lag > RPO or replication stopped	Page on-call, potential DR situation
Emergency	Replication broken	Page on-call, high priority

Dashboard Components:

Build a replication dashboard showing:

Current lag per replica (real-time)
Lag trend over time (detect degradation)
Replication state indicators
Throughput graphs
Event history (restarts, errors, failovers)

prometheus_alerts.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Prometheus Alerting Rules for Replication Monitoring
 
groups:
  - name: replication_alerts
    rules:
      # PostgreSQL Replication Lag Alert
      - alert: PostgreSQLReplicationLagWarning
        expr: pg_replication_lag_seconds > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL replication lag is high"
          description: "Replica {{ $labels.instance }} is {{ $value }}s behind"
 
      - alert: PostgreSQLReplicationLagCritical
        expr: pg_replication_lag_seconds > 180
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL replication lag is critical"
          description: "Replica {{ $labels.instance }} is {{ $value }}s behind - RPO at risk"
 
      - alert: PostgreSQLReplicationStopped
        expr: pg_replication_is_replica == 1 and pg_replication_lag_seconds == -1
        for: 1m
        labels:
          severity: emergency
        annotations:
          summary: "PostgreSQL replication has stopped"
          description: "Replica {{ $labels.instance }} is not receiving updates"
 
      # MySQL Replication Alerts
      - alert: MySQLReplicationStopped
        expr: mysql_slave_status_slave_io_running == 0 or mysql_slave_status_slave_sql_running == 0
        for: 1m
        labels:
          severity: emergency
        annotations:
          summary: "MySQL replication has stopped"
          description: "Replica {{ $labels.instance }} replication thread stopped"
 
      - alert: MySQLReplicationLagWarning
        expr: mysql_slave_status_seconds_behind_master > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL replication lag warning"
          description: "Replica {{ $labels.instance }} is {{ $value }}s behind master"
 
      - alert: MySQLReplicationError
        expr: mysql_slave_status_last_errno != 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL replication error"
          description: "Replica {{ $labels.instance }} has error: {{ $labels.last_error }}"

The Heartbeat Table Technique

Summary: Replication for Disaster Recovery

Replication is the technology that makes modern disaster recovery possible. Let's consolidate the key takeaways:

Key Takeaways

•Replication mode determines RPO — Synchronous ensures zero RPO but adds latency. Asynchronous provides flexibility but introduces data loss risk equal to replication lag.
•Distance constrains synchronous replication — Network latency makes synchronous replication impractical beyond ~100km. Plan topology accordingly.
•Semi-synchronous provides balance — Near-zero RPO with fallback capability makes it suitable for many production DR scenarios.
•Topology affects capabilities — Simple topologies are easier to operate; complex topologies provide more options and resilience.
•Replication failures are inevitable — Plan for network issues, disk failures, conflicts, and divergence. Test recovery procedures regularly.
•Monitoring is non-negotiable — Continuous lag monitoring with RPO-aligned alerting ensures timely response to developing problems.

What's next:

Page Complete

3 / 5