System Design (HLD)Multi-Region Architecture

Multi-Region Architecture: Building Globally Distributed Systems

LevelAdvanced

Duration90 mins

TopicMulti-Region Architecture

4 / 5

Data Replication Across Regions: The Lifeblood of Multi-Region Systems

Moving Data at the Speed of Light

When a user in Singapore updates their profile photo, that change must propagate to databases in Frankfurt and Virginia—ideally within seconds, reliably every time, without losing any updates along the way. This isn't a network copy operation; it's a precisely coordinated dance of serialization, transmission, conflict detection, and application that occurs billions of times daily across the internet.

Data replication is the circulatory system of multi-region architecture. Without reliable replication, regions become isolated islands; with it, they become a unified global system. The quality of your replication determines your Recovery Point Objective (RPO), your consistency guarantees, and ultimately your users' experience.

This page explores the mechanisms that make cross-region data synchronization possible—from low-level replication protocols to high-level architectural patterns—giving you the knowledge to design replication strategies that meet your system's specific requirements.

What You Will Learn

By the end of this page, you will understand replication models and their tradeoffs, implement database-specific replication configurations, design replication topologies for different scenarios, and operate replication systems reliably in production.

Replication Fundamentals: How Data Moves Between Regions

At its core, database replication captures changes from a source system and applies them to one or more target systems. The mechanism for this varies by database, but the fundamental concepts are universal.

Write-Ahead Logging (WAL)

Most modern databases use Write-Ahead Logging: before any change is applied to the actual data, it's written to a sequential log. This log serves dual purposes:

Durability: If the system crashes, the log enables recovery
Replication: The log provides a stream of changes that can be shipped to replicas

For cross-region replication, this log becomes the source of truth. Changes are serialized into the log, transmitted to remote regions, and replayed against remote databases.

Log-Based vs. Statement-Based Replication

Statement-Based Replication

Ships the actual SQL statements to replicas
Replicas execute the same statements
Problems: Non-deterministic functions (NOW(), RAND()), triggers, different execution plans
Largely obsolete for cross-region replication

Log-Based (Physical) Replication

Ships the actual data changes (row/page level)
Replicas apply the changes directly
Deterministic: same changes produce identical state
Used by most production systems

Converting Mermaid diagram...

Logical vs Physical Replication

Physical Replication

Replicates exact byte-level changes
Replicas are binary-identical to primary
Cannot filter tables or transform data during replication
Simpler, lower overhead
Example: PostgreSQL streaming replication, MySQL binary log replication

Logical Replication

Replicates at the row/column level (decoded from log)
Replicas can have different schema, indexes, or subset of data
Enables selective replication (specific tables/columns)
Higher overhead due to decoding/re-encoding
Example: PostgreSQL logical replication, MySQL row-based replication with filters

Cross-region replication typically uses:

Physical replication for disaster recovery (exact copies)
Logical replication for active-active (selective, conflict-aware)

Replication Mechanism Comparison
Characteristic	Physical Replication	Logical Replication
Data granularity	Byte/page level	Row/column level
Schema flexibility	Identical schemas only	Can differ (with constraints)
Data filtering	Not supported	Table/column selection possible
Performance overhead	Lower	Higher (encoding/decoding)
Conflict detection	Not applicable	Supported
Multi-master support	No	Yes
Cross-database replication	No	Yes (with CDC tools)
Typical use case	DR, read replicas	Active-active, CDC, ETL

Synchronous vs Asynchronous Replication

The most consequential replication decision is the synchronization model: should the primary wait for replicas to acknowledge before confirming a write? This choice fundamentally shapes your system's consistency, performance, and availability characteristics.

Asynchronous Replication

The primary acknowledges writes to clients immediately after local persistence, without waiting for replica confirmation. Replicas catch up as fast as the network and their processing allows.

Client → Primary (write) → Primary (local commit) → Client (acknowledged)
                         ↓ (async, background)
                    Replica (eventually receives)

Characteristics:

Primary performance unaffected by replica location or count
Network issues don't block primary operations
Replicas may be seconds to minutes behind during load spikes
Data loss possible on failover (uncommitted changes lost)

Synchronous Replication

The primary waits for at least one replica (often configured) to acknowledge before confirming to the client.

Client → Primary (write) → Primary (local commit) → Replica (receives)
                         ← Replica (acknowledges) ← 
Client (acknowledged) ←

Characteristics:

Every confirmed write exists on multiple machines
Zero RPO possible (no data loss on failover)
Write latency includes round-trip to replica
Replica unavailability blocks primary writes

postgres-sync-replication.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
-- PostgreSQL Synchronous Replication Configuration
-- Demonstrates trade-offs between consistency and performance
 
-- PRIMARY SERVER CONFIGURATION (postgresql.conf)
 
-- Option 1: Asynchronous (Default) - Best performance, potential data loss
-- Every transaction confirmed after local commit only
synchronous_commit = local
 
-- Option 2: Asynchronous with local durability
-- Confirmed after local WAL flush (durable locally, not replicated)
synchronous_commit = on
 
-- Option 3: Synchronous with remote write
-- Confirmed after remote OS has received (not yet flushed)
synchronous_commit = remote_write
 
-- Option 4: Synchronous with remote flush
-- Confirmed after remote WAL flush (durable on replica)
synchronous_commit = remote_apply
 
-- Option 5: Synchronous with remote apply
-- Confirmed after remote has applied (visible on replica)
synchronous_commit = remote_apply
 
-- List of synchronous standbys
-- Format: FIRST N (standby1, standby2, ...) or ANY N (standby1, standby2, ...)
-- FIRST N: wait for first N standbys in list order
-- ANY N: wait for any N standbys
 
-- Wait for at least 1 of these to confirm (any order)
synchronous_standby_names = 'ANY 1 (us_west_replica, eu_west_replica)'
 
-- Wait for the first 2 in list order
-- synchronous_standby_names = 'FIRST 2 (local_sync, us_west_replica)'
 
-- For cross-region: often use synchronous for local, async for remote
-- This provides local HA without remote latency penalty
 
-- EXAMPLE: Synchronous within region, async across regions
-- primary → local_sync (synchronous, ~1ms latency)
-- primary → us_west (asynchronous, no latency impact)
synchronous_standby_names = 'FIRST 1 (local_sync)'
 
-- MONITORING: Check replication lag
SELECT 
    application_name,
    client_addr,
    state,
    sync_state,  -- 'sync', 'async', 'quorum', 'potential'
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,   -- Time since sent to write
    flush_lag,   -- Time since sent to flush
    replay_lag   -- Time since sent to replay (visible on replica)
FROM pg_stat_replication;
 
-- MONITORING: Compute replication delay in bytes
SELECT 
    application_name,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes
FROM pg_stat_replication;

Semi-Synchronous Replication

A hybrid approach that provides durability guarantees without blocking on the slowest replica:

Configure to wait for acknowledgment from N replicas (not all)
If all replicas are slow, fall back to asynchronous after timeout
Provides high durability with bounded latency

MySQL's semi-synchronous replication implements this: wait for at least one replica to acknowledge, with configurable timeout before fallback.

The Cross-Region Latency Problem

For cross-region replication, synchronous waiting imposes physics-limited penalties:

US-East to EU-West: ~80ms RTT minimum
US-East to AP-Northeast: ~170ms RTT minimum

Every synchronous write incurs this latency. For a database handling 1,000 writes/second, each write thread is blocked for 80-170ms. This dramatically reduces write throughput and increases application response times.

Practical Guidance:

Within a region: Synchronous replication to at least one local replica (for HA without failover data loss)
Across regions: Asynchronous replication (for DR without latency penalty)
For critical data: Consider synchronous cross-region for specific tables or transactions only

Synchronous ≠ Zero Lag

Even synchronous replication has a lag: the transaction is confirmed when the replica receives and acknowledges, but the replica may not have applied it yet (depending on configuration). For reads to see the write, you may need 'remote_apply' level, which adds even more latency.

Replication Topologies for Multi-Region

How replicas connect to each other—the replication topology—significantly impacts performance, failure resilience, and operational complexity.

Single-Source (Primary-Replica)

The simplest topology: one primary, multiple replicas. All writes go to the primary; replicas receive changes from the primary.

Simple to understand and operate
Primary failure requires failover coordination
Replicas can be geographically distributed
Limited by primary's ability to serve all replicas

Cascading Replication

Replicas can replicate from other replicas rather than directly from the primary. This reduces load on the primary and can optimize cross-region traffic.

Primary (US-East)
    └── Regional Replica (US-West)
            └── Local Replica (US-West AZ-1)
            └── Local Replica (US-West AZ-2)
    └── Regional Replica (EU-West)
            └── Local Replica (EU-West AZ-1)

Primary only maintains connections to regional replicas
Each region handles local fan-out
Reduces cross-region bandwidth (changes traverse WAN once per region)
Adds latency for deeply cascaded replicas

Converting Mermaid diagram...

Multi-Master (Circular/Mesh)

For active-active systems, multiple primaries accept writes and replicate to each other. Two variations:

Circular Replication

Changes flow in one direction around a ring
Simple to configure, but single point of failure breaks the ring
MySQL historically used this pattern

Mesh Replication

Every primary replicates to every other primary
More resilient (no single point of failure)
Conflict detection/resolution required
Higher bandwidth usage (changes sent multiple times)

Hub-and-Spoke for Global Systems

A common production pattern combines topologies:

Hub region: Central primary, synchronous local replicas for HA
Spoke regions: Asynchronous replicas for read scaling and DR
Per-region leaders: For sharded active-active, each region is primary for its data

This allows:

Strong consistency within regions
Eventual consistency across regions
Clear ownership for writes
Scalable reads globally

Topology Selection Guide
Topology	Use Case	Pros	Cons
Star	Active-passive DR	Simple, clear primary	Primary is bottleneck
Cascading	Large-scale DR	Reduced primary load	Added latency, cascade failure risk
Multi-master	Active-active global	Any-region writes	Conflict resolution required
Hub-spoke	Hybrid active-active	Balanced complexity	Region affinity required

Change Data Capture (CDC) for Cross-Region Sync

While database-native replication works between identical database systems, cross-region architectures often require more flexibility. Change Data Capture (CDC) extracts changes from databases and makes them available as streams that can be processed, transformed, and applied to various targets.

What CDC Provides Beyond Native Replication:

Database heterogeneity: Replicate between different database systems
Selective replication: Choose specific tables, columns, or filter by content
Transformation: Modify data during replication (schema adaptation, enrichment)
Multiple consumers: Feed the same changes to databases, search indexes, caches, analytics
Decoupling: Source and target don't need direct connectivity

Popular CDC Systems:

Debezium: Open-source CDC platform built on Kafka Connect

Captures changes from MySQL, PostgreSQL, MongoDB, SQL Server, Oracle
Publishes to Kafka topics as structured events
Supports exactly-once semantics with proper configuration

AWS Database Migration Service (DMS): Managed CDC for AWS

Continuous replication between AWS database services
Also supports on-premises to cloud migration
Lower operational overhead, less flexibility

Striim, Attunity, GoldenGate: Enterprise CDC platforms

Real-time streaming with transformations
Often used for Oracle and enterprise databases

debezium-config.json
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "name": "cross-region-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "primary-db.us-east.internal",
    "database.port": "5432",
    "database.user": "cdc_user",
    "database.password": "${secrets.CDC_PASSWORD}",
    "database.dbname": "production",
    
    "// Logical decoding slot": "Maintains position for exactly-once delivery",
    "slot.name": "debezium_cross_region",
    "publication.name": "cross_region_publication",
    "plugin.name": "pgoutput",
    
    "// Table selection": "Only replicate user-facing tables",
    "table.include.list": "public.users,public.orders,public.products",
    
    "// Column filtering": "Exclude sensitive columns from replication",
    "column.exclude.list": "public.users.password_hash,public.users.ssn",
    
    "// Kafka topic routing": "Region prefix for multi-region consumers",
    "topic.prefix": "us-east",
    
    "// Transformation": "Add metadata for cross-region sync",
    "transforms": "addRegion,unwrap",
    "transforms.addRegion.type": "org.apache.kafka.connect.transforms.InsertField$Value",
    "transforms.addRegion.static.field": "source_region",
    "transforms.addRegion.static.value": "us-east",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.unwrap.add.fields": "op,ts_ms",
    
    "// Snapshot configuration": "Initial sync strategy",
    "snapshot.mode": "initial",
    "snapshot.locking.mode": "none",
    
    "// Error handling": "DLQ for failed records",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "cdc-errors-us-east",
    
    "// Heartbeat": "Ensure slot advances even with no changes",
    "heartbeat.interval.ms": "30000",
    "heartbeat.action.query": "UPDATE heartbeat SET ts = now()"
  }
}

CDC Architecture for Cross-Region Sync

Region A:                    Kafka (Cross-Region):          Region B:
Database → Debezium →       Global Kafka Cluster        → Kafka Consumer → Database
                                    ↑                           ↓
                             Event Stream              Apply with conflict
                             (ordered, durable)         resolution

Key Components:

Source Connector: Captures changes from source database
Event Stream (Kafka): Durable, ordered storage for change events
Sink Connector: Applies changes to target databases
Conflict Resolver: Handles concurrent modifications (for multi-master)

Cross-Region Kafka Considerations:

For CDC to work across regions, the Kafka cluster must span regions or use cross-region replication:

MirrorMaker 2: Replicates Kafka topics across clusters
Confluent Cluster Linking: Commercial solution for cross-region Kafka
Single stretched cluster: Not recommended due to latency impact on producers

Exactly-Once Semantics

CDC systems can provide exactly-once delivery with:

Kafka transactions for atomic writes
Sink connectors that use database transactions
Offset management within transactions

CDC Beyond Replication

Once you have a change stream, it's valuable beyond replication: feed it to Elasticsearch for search, to analytics systems for real-time dashboards, to cache invalidation services. CDC becomes your system's event backbone.

Replication Performance and Optimization

Cross-region replication performance directly impacts RPO and user experience. Several factors affect how quickly changes propagate:

Factors Affecting Replication Lag

Network latency and bandwidth: Physics limits minimum propagation time; bandwidth limits throughput
Write volume: Heavy write loads can overwhelm replication capacity
Transaction size: Large transactions take longer to transmit and apply
Apply parallelism: Serial apply on replica creates bottleneck
Index maintenance: Replicas building indexes slow down apply

Optimization Strategies

Network Optimization

Use dedicated network links for replication traffic
Compress replication streams (most databases support this)
Use TCP tuning for high-latency links (increased buffers)

Parallel Apply Modern databases can apply changes in parallel when changes don't conflict:

-- PostgreSQL: Parallel recovery (replicas)
max_parallel_workers = 8

-- MySQL: Parallel replication
slave_parallel_workers = 8
slave_parallel_type = LOGICAL_CLOCK

Batch and Buffer

Group small changes into larger network packets
Use async commit on replicas (apply multiple transactions then commit)

replication-monitoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
"""
Replication Performance Monitor
 
Comprehensive monitoring for cross-region database replication,
tracking lag, throughput, and health metrics.
"""
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional, Dict, List
import psycopg2
from prometheus_client import Gauge, Histogram, Counter
 
# Prometheus metrics
REPLICATION_LAG = Gauge(
    'db_replication_lag_seconds',
    'Replication lag in seconds',
    ['source_region', 'target_region', 'replica_name']
)
 
REPLICATION_LAG_BYTES = Gauge(
    'db_replication_lag_bytes',
    'Replication lag in bytes',
    ['source_region', 'target_region', 'replica_name']
)
 
REPLICATION_THROUGHPUT = Gauge(
    'db_replication_throughput_bytes_per_sec',
    'Replication throughput',
    ['source_region', 'target_region']
)
 
REPLICATION_LATENCY = Histogram(
    'db_replication_latency_seconds',
    'End-to-end replication latency',
    ['source_region', 'target_region'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0]
)
 
@dataclass
class ReplicationMetrics:
    """Metrics for a single replication stream."""
    source_region: str
    target_region: str
    replica_name: str
    lag_seconds: float
    lag_bytes: int
    throughput_bytes_per_sec: float
    state: str  # 'streaming', 'catchup', 'stopped'
    last_received_time: datetime
    last_applied_time: datetime
 
class ReplicationMonitor:
    """
    Monitor replication health across all cross-region streams.
    """
    
    # Alert thresholds
    LAG_WARNING_SECONDS = 10
    LAG_CRITICAL_SECONDS = 60
    THROUGHPUT_WARNING_BYTES = 1024 * 1024  # 1 MB/s minimum expected
    
    def __init__(self, connections: Dict[str, str]):
        """
        Initialize with database connections per region.
        connections: {'us-east': 'connection_string', 'eu-west': '...'}
        """
        self.connections = connections
        self.previous_lsn: Dict[str, str] = {}
        self.previous_time: Dict[str, datetime] = {}
    
    def collect_all_metrics(self) -> List[ReplicationMetrics]:
        """Collect replication metrics from all configured regions."""
        all_metrics = []
        
        for source_region, conn_string in self.connections.items():
            try:
                metrics = self._collect_from_primary(source_region, conn_string)
                all_metrics.extend(metrics)
                self._update_prometheus(metrics)
            except Exception as e:
                print(f"Error collecting from {source_region}: {e}")
        
        return all_metrics
    
    def _collect_from_primary(
        self, 
        source_region: str, 
        conn_string: str
    ) -> List[ReplicationMetrics]:
        """Collect metrics from a primary database about its replicas."""
        metrics = []
        
        with psycopg2.connect(conn_string) as conn:
            with conn.cursor() as cur:
                # Get replication status for all standbys
                cur.execute("""
                    SELECT 
                        application_name,
                        client_addr,
                        state,
                        sent_lsn,
                        write_lsn,
                        flush_lsn,
                        replay_lsn,
                        EXTRACT(EPOCH FROM write_lag) AS write_lag_sec,
                        EXTRACT(EPOCH FROM flush_lag) AS flush_lag_sec,
                        EXTRACT(EPOCH FROM replay_lag) AS replay_lag_sec,
                        pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
                    FROM pg_stat_replication
                """)
                
                for row in cur.fetchall():
                    replica_name = row[0]
                    target_region = self._infer_region(replica_name)
                    
                    # Calculate throughput
                    key = f"{source_region}:{replica_name}"
                    throughput = self._calculate_throughput(
                        key, row[3], row[10]  # sent_lsn, lag_bytes
                    )
                    
                    metrics.append(ReplicationMetrics(
                        source_region=source_region,
                        target_region=target_region,
                        replica_name=replica_name,
                        lag_seconds=row[9] or 0,  # replay_lag_sec
                        lag_bytes=row[10] or 0,
                        throughput_bytes_per_sec=throughput,
                        state=row[2],
                        last_received_time=datetime.now(),  # Approximate
                        last_applied_time=datetime.now() - timedelta(
                            seconds=row[9] or 0
                        )
                    ))
        
        return metrics
    
    def _calculate_throughput(
        self, 
        key: str, 
        current_lsn: str, 
        current_lag: int
    ) -> float:
        """Calculate replication throughput based on LSN progression."""
        now = datetime.now()
        
        if key not in self.previous_lsn:
            self.previous_lsn[key] = current_lsn
            self.previous_time[key] = now
            return 0.0
        
        # Calculate bytes replicated since last check
        # This is simplified; real implementation would parse LSNs
        time_delta = (now - self.previous_time[key]).total_seconds()
        if time_delta < 1:
            return 0.0
        
        # Update for next iteration
        self.previous_lsn[key] = current_lsn
        self.previous_time[key] = now
        
        # Return approximate throughput
        # In reality, you'd calculate the LSN difference
        return abs(current_lag) / time_delta if time_delta > 0 else 0
    
    def _infer_region(self, replica_name: str) -> str:
        """Infer target region from replica application name."""
        region_mapping = {
            'us_west': 'us-west',
            'us_east': 'us-east',
            'eu_west': 'eu-west',
            'ap_northeast': 'ap-northeast'
        }
        for key, value in region_mapping.items():
            if key in replica_name.lower():
                return value
        return 'unknown'
    
    def _update_prometheus(self, metrics: List[ReplicationMetrics]) -> None:
        """Update Prometheus metrics."""
        for m in metrics:
            REPLICATION_LAG.labels(
                source_region=m.source_region,
                target_region=m.target_region,
                replica_name=m.replica_name
            ).set(m.lag_seconds)
            
            REPLICATION_LAG_BYTES.labels(
                source_region=m.source_region,
                target_region=m.target_region,
                replica_name=m.replica_name
            ).set(m.lag_bytes)
            
            REPLICATION_THROUGHPUT.labels(
                source_region=m.source_region,
                target_region=m.target_region
            ).set(m.throughput_bytes_per_sec)
    
    def check_health(self, metrics: List[ReplicationMetrics]) -> Dict[str, any]:
        """Evaluate replication health and generate alerts."""
        issues = []
        
        for m in metrics:
            if m.state != 'streaming':
                issues.append({
                    'severity': 'critical',
                    'message': f'Replica {m.replica_name} is not streaming (state: {m.state})'
                })
            elif m.lag_seconds > self.LAG_CRITICAL_SECONDS:
                issues.append({
                    'severity': 'critical',
                    'message': f'Replica {m.replica_name} lag ({m.lag_seconds:.1f}s) exceeds critical threshold'
                })
            elif m.lag_seconds > self.LAG_WARNING_SECONDS:
                issues.append({
                    'severity': 'warning',
                    'message': f'Replica {m.replica_name} lag ({m.lag_seconds:.1f}s) exceeds warning threshold'
                })
        
        return {
            'healthy': len([i for i in issues if i['severity'] == 'critical']) == 0,
            'issues': issues,
            'checked_at': datetime.now().isoformat()
        }

Measuring End-to-End Latency

Replication lag metrics from databases measure queue depth, not actual latency. To measure true end-to-end latency:

Heartbeat table: Write timestamps to a heartbeat table in primary
Read from replica: Compare heartbeat timestamp to current time
This measures: Network + queueing + apply time

Handling Replication Storms

Large batch operations (data migrations, bulk updates) can overwhelm replication:

Throttle batch operations: Limit rows per transaction, add delays
Schedule during low traffic: Off-peak hours for large changes
Pause non-critical replicas: Reduce fan-out during storms
Increase replica resources: Temporarily scale up apply capacity

Lag Compounds Under Load

When replicas fall behind during peak load, they don't automatically catch up when load decreases—apply continues at normal speed while the backlog clears. A 30-second lag spike during peak hour may take 10 minutes to fully recover. Monitor not just current lag, but lag trend.

Operational Considerations for Replication

Running cross-region replication in production requires careful operational practices beyond initial setup.

Schema Changes and Replication

Schema changes are particularly dangerous with replication:

Non-blocking DDL: Use online schema change tools (gh-ost, pt-online-schema-change)
Backward compatible changes: Always add before remove; nullable columns with defaults
Coordinate timing: Apply schema changes when replication is caught up
Test on replicas first: Some changes are replicated; others aren't

Network Failures and Recovery

When cross-region network connectivity fails:

Transient outages: Replication buffers on primary, catches up on recovery
Extended outages: Buffers exhaust; may need full resync
Slot retention (PostgreSQL): Configure wal_keep_size to retain logs during outages
Binary log retention (MySQL): expire_logs_days must exceed max expected outage

Full Resync Procedures

When replication breaks beyond recovery:

Stop writes if possible (or accept divergence window)
Take consistent backup from primary
Restore to replica
Re-establish replication from appropriate point
Snapshot may take hours for large databases

Replication Operational Checklist

•Daily: Verify replication lag is within threshold; check for replication errors in logs
•Weekly: Review lag trends and capacity; test replica read queries
•Monthly: Validate failover procedures; check replication slot/binlog retention
•Quarterly: Full failover drill; measure actual RTO/RPO; capacity planning review
•On schema change: Coordinate timing; test on replica; monitor lag during rollout
•On major version upgrade: Test replication compatibility; plan for resync if needed

Security Considerations

Cross-region replication transmits your data across networks:

Encryption in transit: TLS for all replication connections
Authentication: Strong credentials for replication users
Network isolation: Replication traffic on private/VPN networks
Audit logging: Track replication access and configuration changes

Cost Management

Cross-region data transfer incurs cloud charges:

Estimate based on write volume × regions × replication overhead
Compression typically achieves 3-5x reduction
Consider selective replication to reduce volume
Monitor actual transfer costs monthly

The Silent Failure Risk

Replication can fail silently: the connection looks healthy, but changes aren't applying. Always have synthetic writes (heartbeats) that you monitor on replicas. If heartbeat lag exceeds threshold, alert immediately—you may have invisible data divergence.

Summary: Data Replication Across Regions

We've explored the mechanisms that synchronize data across geographic regions. Let's consolidate the key principles:

Key Takeaways

•Replication is WAL-based: Changes are captured in write-ahead logs and streamed to replicas. Understand your database's log format and capabilities.
•Synchronous vs asynchronous is the key tradeoff: Sync ensures durability but adds latency; async maintains performance but risks data loss. Most cross-region uses async.
•Topology affects resilience and performance: Star, cascading, and mesh topologies offer different tradeoffs. Choose based on your consistency and availability requirements.
•CDC enables flexible replication: Change Data Capture allows database-agnostic, filtered, transformed replication. Essential for heterogeneous environments.
•Performance requires continuous optimization: Monitor lag, tune parallelism, compress streams, and handle replication storms proactively.
•Operations require discipline: Schema changes, network failures, and full resyncs need documented procedures and regular practice.

What's Next

With data replication ensuring consistency across regions, the final piece is directing user traffic to the appropriate region. The next page explores traffic routing—DNS, global load balancers, and the strategies that ensure users reach the right region with minimal latency.

Page Complete

You now understand the mechanisms, configurations, and operational practices for cross-region data replication. This knowledge enables you to design replication strategies that meet your system's RPO requirements while maintaining acceptable performance.

4 / 5

Loading learning content...

System Design (HLD)Multi-Region Architecture

Multi-Region Architecture: Building Globally Distributed Systems

LevelAdvanced

Duration90 mins

TopicMulti-Region Architecture

4 / 5

Data Replication Across Regions: The Lifeblood of Multi-Region Systems

Moving Data at the Speed of Light

What You Will Learn

Replication Fundamentals: How Data Moves Between Regions

Write-Ahead Logging (WAL)

Most modern databases use Write-Ahead Logging: before any change is applied to the actual data, it's written to a sequential log. This log serves dual purposes:

Durability: If the system crashes, the log enables recovery
Replication: The log provides a stream of changes that can be shipped to replicas

For cross-region replication, this log becomes the source of truth. Changes are serialized into the log, transmitted to remote regions, and replayed against remote databases.

Log-Based vs. Statement-Based Replication

Statement-Based Replication

Ships the actual SQL statements to replicas
Replicas execute the same statements
Problems: Non-deterministic functions (NOW(), RAND()), triggers, different execution plans
Largely obsolete for cross-region replication

Log-Based (Physical) Replication

Ships the actual data changes (row/page level)
Replicas apply the changes directly
Deterministic: same changes produce identical state
Used by most production systems

Converting Mermaid diagram...

Logical vs Physical Replication

Physical Replication

Replicates exact byte-level changes
Replicas are binary-identical to primary
Cannot filter tables or transform data during replication
Simpler, lower overhead
Example: PostgreSQL streaming replication, MySQL binary log replication

Logical Replication

Replicates at the row/column level (decoded from log)
Replicas can have different schema, indexes, or subset of data
Enables selective replication (specific tables/columns)
Higher overhead due to decoding/re-encoding
Example: PostgreSQL logical replication, MySQL row-based replication with filters

Cross-region replication typically uses:

Physical replication for disaster recovery (exact copies)
Logical replication for active-active (selective, conflict-aware)

Replication Mechanism Comparison
Characteristic	Physical Replication	Logical Replication
Data granularity	Byte/page level	Row/column level
Schema flexibility	Identical schemas only	Can differ (with constraints)
Data filtering	Not supported	Table/column selection possible
Performance overhead	Lower	Higher (encoding/decoding)
Conflict detection	Not applicable	Supported
Multi-master support	No	Yes
Cross-database replication	No	Yes (with CDC tools)
Typical use case	DR, read replicas	Active-active, CDC, ETL

Synchronous vs Asynchronous Replication

Asynchronous Replication

The primary acknowledges writes to clients immediately after local persistence, without waiting for replica confirmation. Replicas catch up as fast as the network and their processing allows.

Client → Primary (write) → Primary (local commit) → Client (acknowledged)
                         ↓ (async, background)
                    Replica (eventually receives)

Characteristics:

Primary performance unaffected by replica location or count
Network issues don't block primary operations
Replicas may be seconds to minutes behind during load spikes
Data loss possible on failover (uncommitted changes lost)

Synchronous Replication

The primary waits for at least one replica (often configured) to acknowledge before confirming to the client.

Client → Primary (write) → Primary (local commit) → Replica (receives)
                         ← Replica (acknowledges) ← 
Client (acknowledged) ←

Characteristics:

Every confirmed write exists on multiple machines
Zero RPO possible (no data loss on failover)
Write latency includes round-trip to replica
Replica unavailability blocks primary writes

postgres-sync-replication.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
-- PostgreSQL Synchronous Replication Configuration
-- Demonstrates trade-offs between consistency and performance
 
-- PRIMARY SERVER CONFIGURATION (postgresql.conf)
 
-- Option 1: Asynchronous (Default) - Best performance, potential data loss
-- Every transaction confirmed after local commit only
synchronous_commit = local
 
-- Option 2: Asynchronous with local durability
-- Confirmed after local WAL flush (durable locally, not replicated)
synchronous_commit = on
 
-- Option 3: Synchronous with remote write
-- Confirmed after remote OS has received (not yet flushed)
synchronous_commit = remote_write
 
-- Option 4: Synchronous with remote flush
-- Confirmed after remote WAL flush (durable on replica)
synchronous_commit = remote_apply
 
-- Option 5: Synchronous with remote apply
-- Confirmed after remote has applied (visible on replica)
synchronous_commit = remote_apply
 
-- List of synchronous standbys
-- Format: FIRST N (standby1, standby2, ...) or ANY N (standby1, standby2, ...)
-- FIRST N: wait for first N standbys in list order
-- ANY N: wait for any N standbys
 
-- Wait for at least 1 of these to confirm (any order)
synchronous_standby_names = 'ANY 1 (us_west_replica, eu_west_replica)'
 
-- Wait for the first 2 in list order
-- synchronous_standby_names = 'FIRST 2 (local_sync, us_west_replica)'
 
-- For cross-region: often use synchronous for local, async for remote
-- This provides local HA without remote latency penalty
 
-- EXAMPLE: Synchronous within region, async across regions
-- primary → local_sync (synchronous, ~1ms latency)
-- primary → us_west (asynchronous, no latency impact)
synchronous_standby_names = 'FIRST 1 (local_sync)'
 
-- MONITORING: Check replication lag
SELECT 
    application_name,
    client_addr,
    state,
    sync_state,  -- 'sync', 'async', 'quorum', 'potential'
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,   -- Time since sent to write
    flush_lag,   -- Time since sent to flush
    replay_lag   -- Time since sent to replay (visible on replica)
FROM pg_stat_replication;
 
-- MONITORING: Compute replication delay in bytes
SELECT 
    application_name,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes
FROM pg_stat_replication;

Semi-Synchronous Replication

A hybrid approach that provides durability guarantees without blocking on the slowest replica:

Configure to wait for acknowledgment from N replicas (not all)
If all replicas are slow, fall back to asynchronous after timeout
Provides high durability with bounded latency

MySQL's semi-synchronous replication implements this: wait for at least one replica to acknowledge, with configurable timeout before fallback.

The Cross-Region Latency Problem

For cross-region replication, synchronous waiting imposes physics-limited penalties:

US-East to EU-West: ~80ms RTT minimum
US-East to AP-Northeast: ~170ms RTT minimum

Practical Guidance:

Within a region: Synchronous replication to at least one local replica (for HA without failover data loss)
Across regions: Asynchronous replication (for DR without latency penalty)
For critical data: Consider synchronous cross-region for specific tables or transactions only

Synchronous ≠ Zero Lag

Replication Topologies for Multi-Region

How replicas connect to each other—the replication topology—significantly impacts performance, failure resilience, and operational complexity.

Single-Source (Primary-Replica)

The simplest topology: one primary, multiple replicas. All writes go to the primary; replicas receive changes from the primary.

Simple to understand and operate
Primary failure requires failover coordination
Replicas can be geographically distributed
Limited by primary's ability to serve all replicas

Cascading Replication

Replicas can replicate from other replicas rather than directly from the primary. This reduces load on the primary and can optimize cross-region traffic.

Primary (US-East)
    └── Regional Replica (US-West)
            └── Local Replica (US-West AZ-1)
            └── Local Replica (US-West AZ-2)
    └── Regional Replica (EU-West)
            └── Local Replica (EU-West AZ-1)

Primary only maintains connections to regional replicas
Each region handles local fan-out
Reduces cross-region bandwidth (changes traverse WAN once per region)
Adds latency for deeply cascaded replicas

Converting Mermaid diagram...

Multi-Master (Circular/Mesh)

For active-active systems, multiple primaries accept writes and replicate to each other. Two variations:

Circular Replication

Changes flow in one direction around a ring
Simple to configure, but single point of failure breaks the ring
MySQL historically used this pattern

Mesh Replication

Every primary replicates to every other primary
More resilient (no single point of failure)
Conflict detection/resolution required
Higher bandwidth usage (changes sent multiple times)

Hub-and-Spoke for Global Systems

A common production pattern combines topologies:

Hub region: Central primary, synchronous local replicas for HA
Spoke regions: Asynchronous replicas for read scaling and DR
Per-region leaders: For sharded active-active, each region is primary for its data

This allows:

Strong consistency within regions
Eventual consistency across regions
Clear ownership for writes
Scalable reads globally

Topology Selection Guide
Topology	Use Case	Pros	Cons
Star	Active-passive DR	Simple, clear primary	Primary is bottleneck
Cascading	Large-scale DR	Reduced primary load	Added latency, cascade failure risk
Multi-master	Active-active global	Any-region writes	Conflict resolution required
Hub-spoke	Hybrid active-active	Balanced complexity	Region affinity required

Change Data Capture (CDC) for Cross-Region Sync

What CDC Provides Beyond Native Replication:

Database heterogeneity: Replicate between different database systems
Selective replication: Choose specific tables, columns, or filter by content
Transformation: Modify data during replication (schema adaptation, enrichment)
Multiple consumers: Feed the same changes to databases, search indexes, caches, analytics
Decoupling: Source and target don't need direct connectivity

Popular CDC Systems:

Debezium: Open-source CDC platform built on Kafka Connect

Captures changes from MySQL, PostgreSQL, MongoDB, SQL Server, Oracle
Publishes to Kafka topics as structured events
Supports exactly-once semantics with proper configuration

AWS Database Migration Service (DMS): Managed CDC for AWS

Continuous replication between AWS database services
Also supports on-premises to cloud migration
Lower operational overhead, less flexibility

Striim, Attunity, GoldenGate: Enterprise CDC platforms

Real-time streaming with transformations
Often used for Oracle and enterprise databases

debezium-config.json
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "name": "cross-region-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "primary-db.us-east.internal",
    "database.port": "5432",
    "database.user": "cdc_user",
    "database.password": "${secrets.CDC_PASSWORD}",
    "database.dbname": "production",
    
    "// Logical decoding slot": "Maintains position for exactly-once delivery",
    "slot.name": "debezium_cross_region",
    "publication.name": "cross_region_publication",
    "plugin.name": "pgoutput",
    
    "// Table selection": "Only replicate user-facing tables",
    "table.include.list": "public.users,public.orders,public.products",
    
    "// Column filtering": "Exclude sensitive columns from replication",
    "column.exclude.list": "public.users.password_hash,public.users.ssn",
    
    "// Kafka topic routing": "Region prefix for multi-region consumers",
    "topic.prefix": "us-east",
    
    "// Transformation": "Add metadata for cross-region sync",
    "transforms": "addRegion,unwrap",
    "transforms.addRegion.type": "org.apache.kafka.connect.transforms.InsertField$Value",
    "transforms.addRegion.static.field": "source_region",
    "transforms.addRegion.static.value": "us-east",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.unwrap.add.fields": "op,ts_ms",
    
    "// Snapshot configuration": "Initial sync strategy",
    "snapshot.mode": "initial",
    "snapshot.locking.mode": "none",
    
    "// Error handling": "DLQ for failed records",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "cdc-errors-us-east",
    
    "// Heartbeat": "Ensure slot advances even with no changes",
    "heartbeat.interval.ms": "30000",
    "heartbeat.action.query": "UPDATE heartbeat SET ts = now()"
  }
}

CDC Architecture for Cross-Region Sync

Region A:                    Kafka (Cross-Region):          Region B:
Database → Debezium →       Global Kafka Cluster        → Kafka Consumer → Database
                                    ↑                           ↓
                             Event Stream              Apply with conflict
                             (ordered, durable)         resolution

Key Components:

Source Connector: Captures changes from source database
Event Stream (Kafka): Durable, ordered storage for change events
Sink Connector: Applies changes to target databases
Conflict Resolver: Handles concurrent modifications (for multi-master)

Cross-Region Kafka Considerations:

For CDC to work across regions, the Kafka cluster must span regions or use cross-region replication:

MirrorMaker 2: Replicates Kafka topics across clusters
Confluent Cluster Linking: Commercial solution for cross-region Kafka
Single stretched cluster: Not recommended due to latency impact on producers

Exactly-Once Semantics

CDC systems can provide exactly-once delivery with:

Kafka transactions for atomic writes
Sink connectors that use database transactions
Offset management within transactions

CDC Beyond Replication

Replication Performance and Optimization

Cross-region replication performance directly impacts RPO and user experience. Several factors affect how quickly changes propagate:

Factors Affecting Replication Lag

Network latency and bandwidth: Physics limits minimum propagation time; bandwidth limits throughput
Write volume: Heavy write loads can overwhelm replication capacity
Transaction size: Large transactions take longer to transmit and apply
Apply parallelism: Serial apply on replica creates bottleneck
Index maintenance: Replicas building indexes slow down apply

Optimization Strategies

Network Optimization

Use dedicated network links for replication traffic
Compress replication streams (most databases support this)
Use TCP tuning for high-latency links (increased buffers)

Parallel Apply Modern databases can apply changes in parallel when changes don't conflict:

-- PostgreSQL: Parallel recovery (replicas)
max_parallel_workers = 8

-- MySQL: Parallel replication
slave_parallel_workers = 8
slave_parallel_type = LOGICAL_CLOCK

Batch and Buffer

Group small changes into larger network packets
Use async commit on replicas (apply multiple transactions then commit)

replication-monitoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
"""
Replication Performance Monitor
 
Comprehensive monitoring for cross-region database replication,
tracking lag, throughput, and health metrics.
"""
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional, Dict, List
import psycopg2
from prometheus_client import Gauge, Histogram, Counter
 
# Prometheus metrics
REPLICATION_LAG = Gauge(
    'db_replication_lag_seconds',
    'Replication lag in seconds',
    ['source_region', 'target_region', 'replica_name']
)
 
REPLICATION_LAG_BYTES = Gauge(
    'db_replication_lag_bytes',
    'Replication lag in bytes',
    ['source_region', 'target_region', 'replica_name']
)
 
REPLICATION_THROUGHPUT = Gauge(
    'db_replication_throughput_bytes_per_sec',
    'Replication throughput',
    ['source_region', 'target_region']
)
 
REPLICATION_LATENCY = Histogram(
    'db_replication_latency_seconds',
    'End-to-end replication latency',
    ['source_region', 'target_region'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0]
)
 
@dataclass
class ReplicationMetrics:
    """Metrics for a single replication stream."""
    source_region: str
    target_region: str
    replica_name: str
    lag_seconds: float
    lag_bytes: int
    throughput_bytes_per_sec: float
    state: str  # 'streaming', 'catchup', 'stopped'
    last_received_time: datetime
    last_applied_time: datetime
 
class ReplicationMonitor:
    """
    Monitor replication health across all cross-region streams.
    """
    
    # Alert thresholds
    LAG_WARNING_SECONDS = 10
    LAG_CRITICAL_SECONDS = 60
    THROUGHPUT_WARNING_BYTES = 1024 * 1024  # 1 MB/s minimum expected
    
    def __init__(self, connections: Dict[str, str]):
        """
        Initialize with database connections per region.
        connections: {'us-east': 'connection_string', 'eu-west': '...'}
        """
        self.connections = connections
        self.previous_lsn: Dict[str, str] = {}
        self.previous_time: Dict[str, datetime] = {}
    
    def collect_all_metrics(self) -> List[ReplicationMetrics]:
        """Collect replication metrics from all configured regions."""
        all_metrics = []
        
        for source_region, conn_string in self.connections.items():
            try:
                metrics = self._collect_from_primary(source_region, conn_string)
                all_metrics.extend(metrics)
                self._update_prometheus(metrics)
            except Exception as e:
                print(f"Error collecting from {source_region}: {e}")
        
        return all_metrics
    
    def _collect_from_primary(
        self, 
        source_region: str, 
        conn_string: str
    ) -> List[ReplicationMetrics]:
        """Collect metrics from a primary database about its replicas."""
        metrics = []
        
        with psycopg2.connect(conn_string) as conn:
            with conn.cursor() as cur:
                # Get replication status for all standbys
                cur.execute("""
                    SELECT 
                        application_name,
                        client_addr,
                        state,
                        sent_lsn,
                        write_lsn,
                        flush_lsn,
                        replay_lsn,
                        EXTRACT(EPOCH FROM write_lag) AS write_lag_sec,
                        EXTRACT(EPOCH FROM flush_lag) AS flush_lag_sec,
                        EXTRACT(EPOCH FROM replay_lag) AS replay_lag_sec,
                        pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
                    FROM pg_stat_replication
                """)
                
                for row in cur.fetchall():
                    replica_name = row[0]
                    target_region = self._infer_region(replica_name)
                    
                    # Calculate throughput
                    key = f"{source_region}:{replica_name}"
                    throughput = self._calculate_throughput(
                        key, row[3], row[10]  # sent_lsn, lag_bytes
                    )
                    
                    metrics.append(ReplicationMetrics(
                        source_region=source_region,
                        target_region=target_region,
                        replica_name=replica_name,
                        lag_seconds=row[9] or 0,  # replay_lag_sec
                        lag_bytes=row[10] or 0,
                        throughput_bytes_per_sec=throughput,
                        state=row[2],
                        last_received_time=datetime.now(),  # Approximate
                        last_applied_time=datetime.now() - timedelta(
                            seconds=row[9] or 0
                        )
                    ))
        
        return metrics
    
    def _calculate_throughput(
        self, 
        key: str, 
        current_lsn: str, 
        current_lag: int
    ) -> float:
        """Calculate replication throughput based on LSN progression."""
        now = datetime.now()
        
        if key not in self.previous_lsn:
            self.previous_lsn[key] = current_lsn
            self.previous_time[key] = now
            return 0.0
        
        # Calculate bytes replicated since last check
        # This is simplified; real implementation would parse LSNs
        time_delta = (now - self.previous_time[key]).total_seconds()
        if time_delta < 1:
            return 0.0
        
        # Update for next iteration
        self.previous_lsn[key] = current_lsn
        self.previous_time[key] = now
        
        # Return approximate throughput
        # In reality, you'd calculate the LSN difference
        return abs(current_lag) / time_delta if time_delta > 0 else 0
    
    def _infer_region(self, replica_name: str) -> str:
        """Infer target region from replica application name."""
        region_mapping = {
            'us_west': 'us-west',
            'us_east': 'us-east',
            'eu_west': 'eu-west',
            'ap_northeast': 'ap-northeast'
        }
        for key, value in region_mapping.items():
            if key in replica_name.lower():
                return value
        return 'unknown'
    
    def _update_prometheus(self, metrics: List[ReplicationMetrics]) -> None:
        """Update Prometheus metrics."""
        for m in metrics:
            REPLICATION_LAG.labels(
                source_region=m.source_region,
                target_region=m.target_region,
                replica_name=m.replica_name
            ).set(m.lag_seconds)
            
            REPLICATION_LAG_BYTES.labels(
                source_region=m.source_region,
                target_region=m.target_region,
                replica_name=m.replica_name
            ).set(m.lag_bytes)
            
            REPLICATION_THROUGHPUT.labels(
                source_region=m.source_region,
                target_region=m.target_region
            ).set(m.throughput_bytes_per_sec)
    
    def check_health(self, metrics: List[ReplicationMetrics]) -> Dict[str, any]:
        """Evaluate replication health and generate alerts."""
        issues = []
        
        for m in metrics:
            if m.state != 'streaming':
                issues.append({
                    'severity': 'critical',
                    'message': f'Replica {m.replica_name} is not streaming (state: {m.state})'
                })
            elif m.lag_seconds > self.LAG_CRITICAL_SECONDS:
                issues.append({
                    'severity': 'critical',
                    'message': f'Replica {m.replica_name} lag ({m.lag_seconds:.1f}s) exceeds critical threshold'
                })
            elif m.lag_seconds > self.LAG_WARNING_SECONDS:
                issues.append({
                    'severity': 'warning',
                    'message': f'Replica {m.replica_name} lag ({m.lag_seconds:.1f}s) exceeds warning threshold'
                })
        
        return {
            'healthy': len([i for i in issues if i['severity'] == 'critical']) == 0,
            'issues': issues,
            'checked_at': datetime.now().isoformat()
        }

Measuring End-to-End Latency

Replication lag metrics from databases measure queue depth, not actual latency. To measure true end-to-end latency:

Heartbeat table: Write timestamps to a heartbeat table in primary
Read from replica: Compare heartbeat timestamp to current time
This measures: Network + queueing + apply time

Handling Replication Storms

Large batch operations (data migrations, bulk updates) can overwhelm replication:

Throttle batch operations: Limit rows per transaction, add delays
Schedule during low traffic: Off-peak hours for large changes
Pause non-critical replicas: Reduce fan-out during storms
Increase replica resources: Temporarily scale up apply capacity

Lag Compounds Under Load

Operational Considerations for Replication

Running cross-region replication in production requires careful operational practices beyond initial setup.

Schema Changes and Replication

Schema changes are particularly dangerous with replication:

Non-blocking DDL: Use online schema change tools (gh-ost, pt-online-schema-change)
Backward compatible changes: Always add before remove; nullable columns with defaults
Coordinate timing: Apply schema changes when replication is caught up
Test on replicas first: Some changes are replicated; others aren't

Network Failures and Recovery

When cross-region network connectivity fails:

Transient outages: Replication buffers on primary, catches up on recovery
Extended outages: Buffers exhaust; may need full resync
Slot retention (PostgreSQL): Configure wal_keep_size to retain logs during outages
Binary log retention (MySQL): expire_logs_days must exceed max expected outage

Full Resync Procedures

When replication breaks beyond recovery:

Stop writes if possible (or accept divergence window)
Take consistent backup from primary
Restore to replica
Re-establish replication from appropriate point
Snapshot may take hours for large databases

Replication Operational Checklist

•Daily: Verify replication lag is within threshold; check for replication errors in logs
•Weekly: Review lag trends and capacity; test replica read queries
•Monthly: Validate failover procedures; check replication slot/binlog retention
•Quarterly: Full failover drill; measure actual RTO/RPO; capacity planning review
•On schema change: Coordinate timing; test on replica; monitor lag during rollout
•On major version upgrade: Test replication compatibility; plan for resync if needed

Security Considerations

Cross-region replication transmits your data across networks:

Encryption in transit: TLS for all replication connections
Authentication: Strong credentials for replication users
Network isolation: Replication traffic on private/VPN networks
Audit logging: Track replication access and configuration changes

Cost Management

Cross-region data transfer incurs cloud charges:

Estimate based on write volume × regions × replication overhead
Compression typically achieves 3-5x reduction
Consider selective replication to reduce volume
Monitor actual transfer costs monthly

The Silent Failure Risk

Summary: Data Replication Across Regions

We've explored the mechanisms that synchronize data across geographic regions. Let's consolidate the key principles:

Key Takeaways

•Replication is WAL-based: Changes are captured in write-ahead logs and streamed to replicas. Understand your database's log format and capabilities.
•Synchronous vs asynchronous is the key tradeoff: Sync ensures durability but adds latency; async maintains performance but risks data loss. Most cross-region uses async.
•Topology affects resilience and performance: Star, cascading, and mesh topologies offer different tradeoffs. Choose based on your consistency and availability requirements.
•CDC enables flexible replication: Change Data Capture allows database-agnostic, filtered, transformed replication. Essential for heterogeneous environments.
•Performance requires continuous optimization: Monitor lag, tune parallelism, compress streams, and handle replication storms proactively.
•Operations require discipline: Schema changes, network failures, and full resyncs need documented procedures and regular practice.

What's Next

Page Complete

4 / 5