Geo-Distributed Architecture - Learning Module

Loading content...

0/273

Data Replication Across Regions

The Foundation of Geo-Distribution

Data replication is the beating heart of geo-distributed systems. Whether you're implementing active-passive for disaster recovery or active-active for global low latency, the ability to keep data synchronized across regions determines your system's consistency, availability, and durability characteristics.

This is also where distributed systems become genuinely difficult. The CAP theorem reminds us that network partitions are inevitable, and we must choose between consistency and availability. The PACELC theorem adds that even without partitions, we trade off between latency and consistency. Data replication is where these theoretical constraints become operational realities.

In this page, we'll explore the full spectrum of replication strategies, understanding not just their mechanics but when to apply each approach.

What You Will Learn

By the end of this page, you'll understand synchronous vs asynchronous replication trade-offs, how replication lag affects user experience and application design, consistency models and their implications, database-specific replication mechanisms, and strategies for handling replication failures and recovery.

Synchronous vs Asynchronous Replication

The fundamental choice in data replication is whether writes must be confirmed on remote replicas before being acknowledged to clients (synchronous) or if writes can be acknowledged immediately and replicated in the background (asynchronous).

Synchronous Replication

Mechanism: A write is not considered complete until it has been durably stored on all (or a quorum of) replicas, including those in remote regions.

Client → Primary → [Write to local storage]
                 → [Replicate to Region B, wait for ack]
                 → [Replicate to Region C, wait for ack]
                 ← [All acks received, acknowledge to client]

Characteristics:

Latency: Write latency includes round-trip time to farthest replica (100-300ms for cross-continental)
Consistency: Strong consistency guaranteed—all replicas have data before acknowledgment
Availability: Vulnerable to replica unavailability—writes fail if any required replica is down
Data Loss: Zero data loss on primary failure (RPO = 0)

Asynchronous Replication

Mechanism: A write is acknowledged after local storage, with replication occurring in the background.

Client → Primary → [Write to local storage]
                 ← [Immediately acknowledge to client]
                 → [Background: Replicate to Region B]
                 → [Background: Replicate to Region C]

Characteristics:

Latency: Write latency is local only (1-10ms)
Consistency: Eventual consistency—replicas may be behind primary
Availability: Not affected by replica unavailability
Data Loss: Potential data loss if primary fails before replication (RPO > 0)

Synchronous vs Asynchronous Replication Comparison
Characteristic	Synchronous	Asynchronous
Write Latency	High (includes network RTT)	Low (local only)
Read Consistency	Strong (always current)	Eventual (may be stale)
RPO (Data Loss)	Zero	Seconds to minutes
Availability	Lower (requires replicas)	Higher (independent of replicas)
Throughput	Limited by slowest replica	Limited by local resources
Failure Handling	Complex (blocked writes)	Simple (queue and retry)
Use Cases	Financial, critical data	Most web applications

Semi-Synchronous Replication

A hybrid approach that balances latency and durability:

Mechanism: Write acknowledged after confirmation from primary plus at least one remote replica (not all).

Client → Primary → [Write to local storage]
                 → [Replicate to Region B, wait for ack] ← First ack
                 ← [Acknowledge to client]
                 → [Background: Replicate to Region C]

Characteristics:

Better latency than full synchronous (wait for nearest replica, not farthest)
Better durability than asynchronous (survives primary failure if replica has data)
Common in practice: MySQL semi-sync replication, PostgreSQL synchronous_commit options

Quorum-Based Replication

Mechanism: Write acknowledged after W replicas confirm, reads require R replicas to respond. Configured such that W + R > N (total replicas).

Example (N=5, W=3, R=3):

Writes wait for 3 of 5 replicas
Reads query 3 of 5 replicas, take latest value
Any read sees any preceding write (overlap guaranteed)

Tuning Trade-offs:

High W: Better write durability, higher write latency
Low W, High R: Fast writes, reads see more replicas
W=N: Full synchronous (all replicas)
W=1: Full asynchronous (single replica)

Default to Asynchronous for Geo-Replication

For cross-region replication, asynchronous is the default choice for most applications. The latency penalty of waiting for cross-continental round trips (100-300ms per write) is prohibitive for interactive applications. Use synchronous replication only when zero data loss is absolutely required and you can accept the latency and availability implications.

Understanding and Managing Replication Lag

Asynchronous replication introduces replication lag: the delay between a write being committed on the primary and becoming visible on replicas. Understanding lag is crucial for designing systems that behave correctly despite it.

Components of Replication Lag

Network Latency: Minimum lag is bound by network round-trip time. Cross-continental links have 100-300ms latency.

Replication Transport: Time to serialize, transmit, and deserialize changes. Depends on change volume and network bandwidth.

Replica Apply Time: Time for replica to apply changes locally. Can spike during complex operations (large transactions, schema changes).

Queue Depth: If replica falls behind (apply time > arrival rate), queue builds up. Lag can grow unboundedly until queue drains.

Typical Lag Values

Same-region replicas:

Typical: sub-second (<100ms)
Under load: seconds
During maintenance: minutes

Cross-region replicas:

Typical: 100ms - 2 seconds
Under load: 2-10 seconds
During issues: minutes to hours

Replication Lag Impact by Application Type
Lag Duration	User Experience Impact	Application Design Implications
<100ms	Imperceptible	Most operations work normally
100ms-1s	Refreshing sees old data briefly	Read-your-writes needed for writes
1-5s	Noticeable inconsistency	UI must handle stale reads explicitly
5-30s	Significant confusion possible	Operations may need to route to primary
30s	Features may appear broken	Degraded mode, explicit warnings needed

Read-Your-Writes Consistency

The most common consistency pattern needed with replication lag is read-your-writes: after a user performs a write, their subsequent reads should see that write.

Implementation Approaches:

1. Sticky Sessions to Primary:

After write, route that user's reads to primary for a period
Simple but concentrates read load on primary

2. Version Tracking:

Track the latest write version for each user/session
Route reads to replicas only if replica version >= user's latest write version
More complex but distributes load

3. MVCC and Timestamps:

Use Multi-Version Concurrency Control with consistent snapshots
Reads wait until replica reaches required timestamp
Sophisticated but most correct

4. Accept Inconsistency:

Design UI to not depend on immediate consistency
Show optimistic updates in UI regardless of backend
Most tolerant approach when UX allows

Monitoring Replication Lag

Replication lag is a critical operational metric:

Key Metrics:

Current lag (seconds/milliseconds behind primary)
Lag distribution (p50, p95, p99)
Lag trend (increasing, stable, decreasing)
Replication throughput (transactions/second, MB/second)

Alerting Thresholds (typical):

Warning: 5-10 seconds
Critical: 30 seconds
Emergency: 5+ minutes

Lag spikes often indicate capacity issues, network problems, or expensive operations (large transactions, schema changes).

Lag Can Grow Without Bound

If replica apply rate is slower than primary write rate, lag grows indefinitely. This happens during: high write loads, expensive transactions, replica hardware issues, network bottlenecks. Monitor lag trends, not just current values. Catching a growing trend early prevents cascading problems.

Consistency Models for Geo-Distributed Systems

Consistency models define what guarantees the system provides about the relationship between writes and subsequent reads. Choosing the right consistency model is a fundamental design decision.

Strong Consistency

Definition: After a write completes, all subsequent reads (from any client, through any replica) return that write or a more recent one.

Implications:

Requires synchronous replication or coordination
Write latency includes cross-region round trips
Unavailable during network partitions (CP in CAP terms)

When to Use:

Financial transactions (account balances, transfers)
Inventory management (preventing oversells)
Reference data that must be consistent (user authentication)

Eventual Consistency

Definition: If no new writes occur, all replicas will eventually return the same value. No guarantee about when this happens.

Implications:

Allows asynchronous replication
Lowest latency, highest availability
Reads may return stale data

When to Use:

Social media feeds (slightly stale is acceptable)
Product catalogs (eventual updates are fine)
Analytics data (recent values not critical)

Causal Consistency

Definition: If operation B causally depends on operation A (e.g., B reads result of A, or B follows A in same session), then any process that sees B will also see A.

Implications:

Stronger than eventual, weaker than strong
Preserves "makes sense" ordering
Requires tracking causal dependencies (vector clocks, dependency graphs)

When to Use:

Collaborative applications (document editing)
Comment threads (replies appear after original)
Any workflow where ordering matters but real-time isn't required

Consistency Model Comparison
Model	Latency Impact	Availability Impact	Consistency Guarantee	Implementation Complexity
Strong	High (sync replication)	Lower (partition sensitive)	All reads see latest write	Medium
Bounded Staleness	Medium	Medium	Reads within time/version bound	Medium-High
Session (Read-Your-Writes)	Low	High	Session sees own writes	Medium
Causal	Low	High	Causally related ops ordered	High
Eventual	Lowest	Highest	Eventually converges	Low

Bounded Staleness

Definition: Reads are guaranteed to be no more than X seconds (or N versions) behind writes.

Implications:

Middle ground between strong and eventual
Allows reads from replicas within staleness bound
Falls back to primary if replicas exceed bound

Configuring Staleness Bounds:

Too tight: Might as well use strong consistency
Too loose: Users experience unacceptable inconsistency
Typical: 5-30 seconds for user-facing data

Session Consistency (Read-Your-Writes)

Definition: A client will always see their own writes. Other clients may see stale data.

Implications:

Natural expectation of users
Implementable with sticky sessions or version tracking
Most practical for user-facing applications

Implementation Patterns:

Include write timestamp in session/cookie
Route reads to replicas only if replica is past timestamp
Fall back to primary if replica is too stale

Consistency Levels in Practice

Many databases offer tunable consistency, allowing different operations to use different levels:

Cassandra: ONE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM DynamoDB: Eventually consistent reads, strongly consistent reads CosmosDB: Strong, Bounded Staleness, Session, Consistent Prefix, Eventual CockroachDB: Serializable (default), follower reads (stale allowed)

This allows applications to use strong consistency where needed and eventual consistency where acceptable.

Choose the Weakest Sufficient Consistency

Stronger consistency is not always better—it comes with latency and availability costs. Analyze each operation: Does this read need to see the absolute latest data, or is eventual consistency acceptable? Applying strong consistency everywhere is expensive; applying it only where needed is efficient.

Replication Topologies

How replicas are connected affects replication latency, bandwidth usage, and failure handling. Understanding topology options helps optimize for your specific needs.

Primary-Replica (Master-Slave)

Structure:

Primary → Replica 1
        → Replica 2  
        → Replica 3

Characteristics:

All writes go to primary
Primary replicates to all replicas
Simplest topology, clear ownership
Primary is bottleneck for replication bandwidth

Use Cases:

Active-passive disaster recovery
Read scaling with single write point
Most common entry point for replication

Multi-Primary (Multi-Master)

Structure:

Primary A ←→ Primary B ←→ Primary C

Characteristics:

Any node accepts writes
Changes replicated bidirectionally
Requires conflict resolution
More complex but no single write bottleneck

Use Cases:

Active-active geo-distribution
High write throughput requirements
Zero-downtime write availability

Cascading (Chain) Replication

Structure:

Primary → Replica 1 → Replica 2 → Replica 3

Characteristics:

Each replica replicates to next in chain
Reduces bandwidth load on primary
Increases lag for downstream replicas
Failure of intermediate node disrupts chain

Use Cases:

Many replicas where primary bandwidth is limited
Analytics replicas that tolerate higher lag
Geographic chains (US → EU → APAC)

Replication Topology Comparison
Topology	Write Capacity	Read Scaling	Lag Distribution	Failure Handling
Primary-Replica	Single primary	Excellent	Uniform	Simple (failover to replica)
Multi-Primary	All nodes	Excellent	Uniform	Complex (conflict resolution)
Cascading	Single primary	Excellent	Increases along chain	Medium (chain repair)
Ring	All nodes	Excellent	Varies	Complex (ring repair)
Mesh	All nodes	Excellent	Low	Most complex (full mesh)

Topology Design for Geo-Distribution

Two Regions:

Region A (Primary) → Region B (Replica)

Simplest geo-distribution. Active-passive with clear failover target.

Three Regions (Chain):

US-East → EU-West → APAC-Tokyo

Reduces transcontinental replication hops. APAC has highest lag.

Three Regions (Star):

       EU-West
          ↑
US-East ←→ APAC-Tokyo

US-East replicates to both. More bandwidth from US-East, lower lag globally.

Three Regions (Mesh):

US-East ←→ EU-West
   ↑ ↘     ↙ ↑
   ↓   ↘ ↙   ↓
APAC-Tokyo ←→ (all connected)

Full mesh. Lowest lag, highest complexity and bandwidth. Required for true active-active.

Bandwidth Considerations

Cross-region replication consumes significant bandwidth:

Typical Consumption:

Transactional databases: 1-10 MB/s per region pair
High-write workloads: 10-100 MB/s
Media/document storage: 100+ MB/s

Cost Implications:

Cloud providers charge $0.02-0.09/GB for cross-region transfer
10 MB/s = ~26 TB/month = $500-2,300/month per region pair
Full mesh with 4 regions: 6 pairs, potentially $3,000-14,000/month just for replication

Topology Affects Everything

Topology choice cascades through your architecture: failover procedures, consistency guarantees, monitoring requirements, cost structure. Choose topology based on your specific requirements for write distribution, lag tolerance, and operational complexity appetite.

Database-Specific Replication Mechanisms

Different database systems implement replication differently. Understanding your database's approach is essential for correct configuration and operation.

PostgreSQL Replication

Streaming Replication:

Physical replication of Write-Ahead Log (WAL)
Byte-for-byte identical replicas
Supports synchronous and asynchronous modes
Primary configuration: synchronous_commit, synchronous_standby_names

Logical Replication:

Replicates logical changes (INSERT, UPDATE, DELETE)
Allows replication between different PostgreSQL versions
Enables selective table replication
Higher overhead than streaming, more flexible

Cross-Region Considerations:

Use asynchronous mode (synchronous adds 100ms+ to each write)
Configure wal_keep_size to buffer during network issues
Consider logical replication for flexibility

MySQL Replication

Binary Log Replication:

Replicates binary log events (statement or row-based)
Supports async, semi-sync, and group replication
Source-Replica topology with optional GTID tracking

Group Replication:

Multi-primary replication with conflict detection
Requires MySQL 5.7.17+ with Group Replication plugin
Certification-based conflict resolution

Cross-Region Considerations:

Semi-synchronous provides some durability without full sync cost
Row-based replication is safer across versions/regions
GTID simplifies failover management

Database Replication Mechanisms
Database	Replication Type	Cross-Region Suitability	Key Configuration
PostgreSQL	Streaming (physical)	Good	synchronous_commit, recovery_target_timeline
PostgreSQL	Logical	Good	publication/subscription
MySQL	Binary log	Good	gtid_mode, semi_sync settings
MySQL	Group Replication	Medium (latency sensitive)	group_replication_consistency
MongoDB	Oplog replication	Good	write concern, read preference
Cassandra	Peer-to-peer gossip	Excellent	NetworkTopologyStrategy, DC-aware
CockroachDB	Raft consensus	Good (built for geo)	zone configurations, follower reads
Spanner	TrueTime + Paxos	Excellent (designed for geo)	Regional/multi-regional configuration

MongoDB Replication

Replica Set:

One primary, multiple secondaries
Oplog-based replication (operation log)
Automatic primary election on failure
Write concern controls durability guarantees

Cross-Region Considerations:

Configure read preference by tag (route to local DC)
Write concern majority with cross-region secondaries means cross-region latency
Priority settings influence primary election location

Cassandra Replication

Peer-to-Peer:

No primary—all nodes are equal
Gossip protocol distributes changes
Tunable consistency per operation
NetworkTopologyStrategy for geo-distribution

Cross-Region Considerations:

LOCAL_QUORUM for regional consistency with async cross-region
EACH_QUORUM for synchronous cross-region (high latency)
Designed for geo-distribution from the ground up

NewSQL Databases (CockroachDB, Spanner)

Distributed Consensus:

Raft (CockroachDB) or Paxos (Spanner) for replication
Strongly consistent by default
Geo-partitioning to control data placement
Follower reads for lower latency with staleness

Cross-Region Considerations:

Strong consistency means cross-region latency on writes
Zone configurations control replica placement
Follower reads reduce read latency with bounded staleness

Understand Your Database's Guarantees

Database replication documentation often uses nuanced language about guarantees. 'Durable' doesn't always mean 'replicated'. 'Committed' may not mean 'globally visible'. Read documentation carefully and test failure scenarios to understand actual behavior.

Handling Replication Failures

Replication failures are inevitable in geo-distributed systems. Networks partition, replicas fall behind, and divergent data must be reconciled. Designing for these failures is essential.

Types of Replication Failures

Network Partition:

Regions cannot communicate
Replication stops during partition
Data diverges if writes continue on both sides

Replica Falling Behind:

Replica cannot keep up with write rate
Lag grows unboundedly
May exhaust log/queue storage

Data Corruption:

Hardware failure corrupts replica data
Replication carries corruption forward
Detected by checksums, consistency checks

Schema Divergence:

Schema changes not properly coordinated
Replication fails due to schema mismatch
Common during deployments

Failure Recovery Strategies

•Catch-Up Replication: After partition heals, replay accumulated changes. Works if log/queue is preserved. May take hours for long partitions.
•Full Resync: When catch-up isn't possible, rebuild replica from scratch. Expensive but guarantees consistency. Plan for storage and bandwidth impact.
•Conflict Resolution: For multi-primary during partition, conflicts accumulate. After partition, systematically resolve using configured strategy (LWW, merge, manual).
•Point-in-Time Recovery: If corruption is detected, restore replica to known-good point-in-time. Requires backup infrastructure and understanding of corruption window.
•Quorum Repair: For quorum systems, read from all replicas and repair divergent copies. Provides eventual consistency without full resync.

Designing for Failure

Preserve Replication Logs:

Size WAL/oplog/binlog retention appropriately for expected partition duration
Alert when log size approaches retention limit
If log wraps, full resync becomes necessary

Detect Problems Early:

Monitor lag continuously
Alert on lag trend (increasing), not just threshold
Check replication health in deployment validations

Automated Recovery:

Automate catch-up after partition heals
Consider automated replica rebuilding for persistent failures
But require human approval for destructive operations

Test Failure Scenarios:

Regularly simulate network partitions
Validate recovery procedures work
Time recovery operations (know what to expect)

The Data Loss Decision

During primary failure with asynchronous replication, unreplicated data may be lost:

Option 1: Accept Loss

Promote secondary, accept that recent transactions are lost
Fast recovery (seconds to minutes)
Appropriate when RPO > 0 is acceptable

Option 2: Wait for Primary

Keep secondary as standby, wait for primary recovery
May recover all data if primary comes back
Downtime could be extended (hours to days)

Option 3: Attempt Recovery

Promote secondary while attempting primary data recovery
If primary recovers, reconcile/merge data
Most complex but preserves options

The right choice depends on your RPO requirements and the nature of the failure.

Plan for the Unplannable

Replication failures happen at the worst times—peak load, during deployments, when key engineers are unavailable. Document procedures thoroughly. Automate what can be safely automated. Ensure multiple team members can handle recovery. Don't let replication recovery be single-person tribal knowledge.

Summary: The Art of Data Replication

We've deeply explored data replication strategies for geo-distributed systems. Let's consolidate the key insights:

Key Takeaways

•Asynchronous is the default for geo-replication: The latency penalty of synchronous cross-region replication is prohibitive for most applications. Accept eventual consistency where possible.
•Replication lag is a first-class concern: Monitor it, design for it, and implement patterns like read-your-writes to mitigate its user-visible effects.
•Choose the weakest sufficient consistency: Strong consistency isn't free. Apply it where required, use weaker models where acceptable.
•Topology affects everything: Primary-replica, multi-primary, chain, or mesh topologies have different characteristics for latency, bandwidth, and failure handling.
•Understand your database's mechanics: Each database implements replication differently. Configuration and behavior vary significantly.
•Plan for failure: Network partitions, lagging replicas, and data corruption will occur. Documented, tested recovery procedures are essential.

What's next:

With data replication covered, we now turn to the user-facing side of geo-distribution: latency optimization. The next page explores techniques for minimizing user-perceived latency including edge caching, connection optimization, and geographic traffic routing.

Page Complete

You now understand the fundamentals of data replication for geo-distributed systems: synchronous vs asynchronous approaches, managing replication lag, consistency models, replication topologies, database-specific mechanisms, and failure handling. Next, we'll explore techniques for optimizing latency across your geo-distributed architecture.

Data Replication Across Regions

The Foundation of Geo-Distribution

In this page, we'll explore the full spectrum of replication strategies, understanding not just their mechanics but when to apply each approach.

What You Will Learn

Synchronous vs Asynchronous Replication

Synchronous Replication

Mechanism: A write is not considered complete until it has been durably stored on all (or a quorum of) replicas, including those in remote regions.

Client → Primary → [Write to local storage]
                 → [Replicate to Region B, wait for ack]
                 → [Replicate to Region C, wait for ack]
                 ← [All acks received, acknowledge to client]

Characteristics:

Latency: Write latency includes round-trip time to farthest replica (100-300ms for cross-continental)
Consistency: Strong consistency guaranteed—all replicas have data before acknowledgment
Availability: Vulnerable to replica unavailability—writes fail if any required replica is down
Data Loss: Zero data loss on primary failure (RPO = 0)

Asynchronous Replication

Mechanism: A write is acknowledged after local storage, with replication occurring in the background.

Client → Primary → [Write to local storage]
                 ← [Immediately acknowledge to client]
                 → [Background: Replicate to Region B]
                 → [Background: Replicate to Region C]

Characteristics:

Latency: Write latency is local only (1-10ms)
Consistency: Eventual consistency—replicas may be behind primary
Availability: Not affected by replica unavailability
Data Loss: Potential data loss if primary fails before replication (RPO > 0)

Synchronous vs Asynchronous Replication Comparison
Characteristic	Synchronous	Asynchronous
Write Latency	High (includes network RTT)	Low (local only)
Read Consistency	Strong (always current)	Eventual (may be stale)
RPO (Data Loss)	Zero	Seconds to minutes
Availability	Lower (requires replicas)	Higher (independent of replicas)
Throughput	Limited by slowest replica	Limited by local resources
Failure Handling	Complex (blocked writes)	Simple (queue and retry)
Use Cases	Financial, critical data	Most web applications

Semi-Synchronous Replication

A hybrid approach that balances latency and durability:

Mechanism: Write acknowledged after confirmation from primary plus at least one remote replica (not all).

Client → Primary → [Write to local storage]
                 → [Replicate to Region B, wait for ack] ← First ack
                 ← [Acknowledge to client]
                 → [Background: Replicate to Region C]

Characteristics:

Better latency than full synchronous (wait for nearest replica, not farthest)
Better durability than asynchronous (survives primary failure if replica has data)
Common in practice: MySQL semi-sync replication, PostgreSQL synchronous_commit options

Quorum-Based Replication

Mechanism: Write acknowledged after W replicas confirm, reads require R replicas to respond. Configured such that W + R > N (total replicas).

Example (N=5, W=3, R=3):

Writes wait for 3 of 5 replicas
Reads query 3 of 5 replicas, take latest value
Any read sees any preceding write (overlap guaranteed)

Tuning Trade-offs:

High W: Better write durability, higher write latency
Low W, High R: Fast writes, reads see more replicas
W=N: Full synchronous (all replicas)
W=1: Full asynchronous (single replica)

Default to Asynchronous for Geo-Replication

Understanding and Managing Replication Lag

Components of Replication Lag

Network Latency: Minimum lag is bound by network round-trip time. Cross-continental links have 100-300ms latency.

Replication Transport: Time to serialize, transmit, and deserialize changes. Depends on change volume and network bandwidth.

Replica Apply Time: Time for replica to apply changes locally. Can spike during complex operations (large transactions, schema changes).

Queue Depth: If replica falls behind (apply time > arrival rate), queue builds up. Lag can grow unboundedly until queue drains.

Typical Lag Values

Same-region replicas:

Typical: sub-second (<100ms)
Under load: seconds
During maintenance: minutes

Cross-region replicas:

Typical: 100ms - 2 seconds
Under load: 2-10 seconds
During issues: minutes to hours

Replication Lag Impact by Application Type
Lag Duration	User Experience Impact	Application Design Implications
<100ms	Imperceptible	Most operations work normally
100ms-1s	Refreshing sees old data briefly	Read-your-writes needed for writes
1-5s	Noticeable inconsistency	UI must handle stale reads explicitly
5-30s	Significant confusion possible	Operations may need to route to primary
30s	Features may appear broken	Degraded mode, explicit warnings needed

Read-Your-Writes Consistency

The most common consistency pattern needed with replication lag is read-your-writes: after a user performs a write, their subsequent reads should see that write.

Implementation Approaches:

1. Sticky Sessions to Primary:

After write, route that user's reads to primary for a period
Simple but concentrates read load on primary

2. Version Tracking:

Track the latest write version for each user/session
Route reads to replicas only if replica version >= user's latest write version
More complex but distributes load

3. MVCC and Timestamps:

Use Multi-Version Concurrency Control with consistent snapshots
Reads wait until replica reaches required timestamp
Sophisticated but most correct

4. Accept Inconsistency:

Design UI to not depend on immediate consistency
Show optimistic updates in UI regardless of backend
Most tolerant approach when UX allows

Monitoring Replication Lag

Replication lag is a critical operational metric:

Key Metrics:

Current lag (seconds/milliseconds behind primary)
Lag distribution (p50, p95, p99)
Lag trend (increasing, stable, decreasing)
Replication throughput (transactions/second, MB/second)

Alerting Thresholds (typical):

Warning: 5-10 seconds
Critical: 30 seconds
Emergency: 5+ minutes

Lag spikes often indicate capacity issues, network problems, or expensive operations (large transactions, schema changes).

Lag Can Grow Without Bound

Consistency Models for Geo-Distributed Systems

Consistency models define what guarantees the system provides about the relationship between writes and subsequent reads. Choosing the right consistency model is a fundamental design decision.

Strong Consistency

Definition: After a write completes, all subsequent reads (from any client, through any replica) return that write or a more recent one.

Implications:

Requires synchronous replication or coordination
Write latency includes cross-region round trips
Unavailable during network partitions (CP in CAP terms)

When to Use:

Financial transactions (account balances, transfers)
Inventory management (preventing oversells)
Reference data that must be consistent (user authentication)

Eventual Consistency

Definition: If no new writes occur, all replicas will eventually return the same value. No guarantee about when this happens.

Implications:

Allows asynchronous replication
Lowest latency, highest availability
Reads may return stale data

When to Use:

Social media feeds (slightly stale is acceptable)
Product catalogs (eventual updates are fine)
Analytics data (recent values not critical)

Causal Consistency

Definition: If operation B causally depends on operation A (e.g., B reads result of A, or B follows A in same session), then any process that sees B will also see A.

Implications:

Stronger than eventual, weaker than strong
Preserves "makes sense" ordering
Requires tracking causal dependencies (vector clocks, dependency graphs)

When to Use:

Collaborative applications (document editing)
Comment threads (replies appear after original)
Any workflow where ordering matters but real-time isn't required

Consistency Model Comparison
Model	Latency Impact	Availability Impact	Consistency Guarantee	Implementation Complexity
Strong	High (sync replication)	Lower (partition sensitive)	All reads see latest write	Medium
Bounded Staleness	Medium	Medium	Reads within time/version bound	Medium-High
Session (Read-Your-Writes)	Low	High	Session sees own writes	Medium
Causal	Low	High	Causally related ops ordered	High
Eventual	Lowest	Highest	Eventually converges	Low

Bounded Staleness

Definition: Reads are guaranteed to be no more than X seconds (or N versions) behind writes.

Implications:

Middle ground between strong and eventual
Allows reads from replicas within staleness bound
Falls back to primary if replicas exceed bound

Configuring Staleness Bounds:

Too tight: Might as well use strong consistency
Too loose: Users experience unacceptable inconsistency
Typical: 5-30 seconds for user-facing data

Session Consistency (Read-Your-Writes)

Definition: A client will always see their own writes. Other clients may see stale data.

Implications:

Natural expectation of users
Implementable with sticky sessions or version tracking
Most practical for user-facing applications

Implementation Patterns:

Include write timestamp in session/cookie
Route reads to replicas only if replica is past timestamp
Fall back to primary if replica is too stale

Consistency Levels in Practice

Many databases offer tunable consistency, allowing different operations to use different levels:

This allows applications to use strong consistency where needed and eventual consistency where acceptable.

Choose the Weakest Sufficient Consistency

Replication Topologies

How replicas are connected affects replication latency, bandwidth usage, and failure handling. Understanding topology options helps optimize for your specific needs.

Primary-Replica (Master-Slave)

Structure:

Primary → Replica 1
        → Replica 2  
        → Replica 3

Characteristics:

All writes go to primary
Primary replicates to all replicas
Simplest topology, clear ownership
Primary is bottleneck for replication bandwidth

Use Cases:

Active-passive disaster recovery
Read scaling with single write point
Most common entry point for replication

Multi-Primary (Multi-Master)

Structure:

Primary A ←→ Primary B ←→ Primary C

Characteristics:

Any node accepts writes
Changes replicated bidirectionally
Requires conflict resolution
More complex but no single write bottleneck

Use Cases:

Active-active geo-distribution
High write throughput requirements
Zero-downtime write availability

Cascading (Chain) Replication

Structure:

Primary → Replica 1 → Replica 2 → Replica 3

Characteristics:

Each replica replicates to next in chain
Reduces bandwidth load on primary
Increases lag for downstream replicas
Failure of intermediate node disrupts chain

Use Cases:

Many replicas where primary bandwidth is limited
Analytics replicas that tolerate higher lag
Geographic chains (US → EU → APAC)

Replication Topology Comparison
Topology	Write Capacity	Read Scaling	Lag Distribution	Failure Handling
Primary-Replica	Single primary	Excellent	Uniform	Simple (failover to replica)
Multi-Primary	All nodes	Excellent	Uniform	Complex (conflict resolution)
Cascading	Single primary	Excellent	Increases along chain	Medium (chain repair)
Ring	All nodes	Excellent	Varies	Complex (ring repair)
Mesh	All nodes	Excellent	Low	Most complex (full mesh)

Topology Design for Geo-Distribution

Two Regions:

Region A (Primary) → Region B (Replica)

Simplest geo-distribution. Active-passive with clear failover target.

Three Regions (Chain):

US-East → EU-West → APAC-Tokyo

Reduces transcontinental replication hops. APAC has highest lag.

Three Regions (Star):

       EU-West
          ↑
US-East ←→ APAC-Tokyo

US-East replicates to both. More bandwidth from US-East, lower lag globally.

Three Regions (Mesh):

US-East ←→ EU-West
   ↑ ↘     ↙ ↑
   ↓   ↘ ↙   ↓
APAC-Tokyo ←→ (all connected)

Full mesh. Lowest lag, highest complexity and bandwidth. Required for true active-active.

Bandwidth Considerations

Cross-region replication consumes significant bandwidth:

Typical Consumption:

Transactional databases: 1-10 MB/s per region pair
High-write workloads: 10-100 MB/s
Media/document storage: 100+ MB/s

Cost Implications:

Cloud providers charge $0.02-0.09/GB for cross-region transfer
10 MB/s = ~26 TB/month = $500-2,300/month per region pair
Full mesh with 4 regions: 6 pairs, potentially $3,000-14,000/month just for replication

Topology Affects Everything

Database-Specific Replication Mechanisms

Different database systems implement replication differently. Understanding your database's approach is essential for correct configuration and operation.

PostgreSQL Replication

Streaming Replication:

Physical replication of Write-Ahead Log (WAL)
Byte-for-byte identical replicas
Supports synchronous and asynchronous modes
Primary configuration: synchronous_commit, synchronous_standby_names

Logical Replication:

Replicates logical changes (INSERT, UPDATE, DELETE)
Allows replication between different PostgreSQL versions
Enables selective table replication
Higher overhead than streaming, more flexible

Cross-Region Considerations:

Use asynchronous mode (synchronous adds 100ms+ to each write)
Configure wal_keep_size to buffer during network issues
Consider logical replication for flexibility

MySQL Replication

Binary Log Replication:

Replicates binary log events (statement or row-based)
Supports async, semi-sync, and group replication
Source-Replica topology with optional GTID tracking

Group Replication:

Multi-primary replication with conflict detection
Requires MySQL 5.7.17+ with Group Replication plugin
Certification-based conflict resolution

Cross-Region Considerations:

Semi-synchronous provides some durability without full sync cost
Row-based replication is safer across versions/regions
GTID simplifies failover management

Database Replication Mechanisms
Database	Replication Type	Cross-Region Suitability	Key Configuration
PostgreSQL	Streaming (physical)	Good	synchronous_commit, recovery_target_timeline
PostgreSQL	Logical	Good	publication/subscription
MySQL	Binary log	Good	gtid_mode, semi_sync settings
MySQL	Group Replication	Medium (latency sensitive)	group_replication_consistency
MongoDB	Oplog replication	Good	write concern, read preference
Cassandra	Peer-to-peer gossip	Excellent	NetworkTopologyStrategy, DC-aware
CockroachDB	Raft consensus	Good (built for geo)	zone configurations, follower reads
Spanner	TrueTime + Paxos	Excellent (designed for geo)	Regional/multi-regional configuration

MongoDB Replication

Replica Set:

One primary, multiple secondaries
Oplog-based replication (operation log)
Automatic primary election on failure
Write concern controls durability guarantees

Cross-Region Considerations:

Configure read preference by tag (route to local DC)
Write concern majority with cross-region secondaries means cross-region latency
Priority settings influence primary election location

Cassandra Replication

Peer-to-Peer:

No primary—all nodes are equal
Gossip protocol distributes changes
Tunable consistency per operation
NetworkTopologyStrategy for geo-distribution

Cross-Region Considerations:

LOCAL_QUORUM for regional consistency with async cross-region
EACH_QUORUM for synchronous cross-region (high latency)
Designed for geo-distribution from the ground up

NewSQL Databases (CockroachDB, Spanner)

Distributed Consensus:

Raft (CockroachDB) or Paxos (Spanner) for replication
Strongly consistent by default
Geo-partitioning to control data placement
Follower reads for lower latency with staleness

Cross-Region Considerations:

Strong consistency means cross-region latency on writes
Zone configurations control replica placement
Follower reads reduce read latency with bounded staleness

Understand Your Database's Guarantees

Handling Replication Failures

Replication failures are inevitable in geo-distributed systems. Networks partition, replicas fall behind, and divergent data must be reconciled. Designing for these failures is essential.

Types of Replication Failures

Network Partition:

Regions cannot communicate
Replication stops during partition
Data diverges if writes continue on both sides

Replica Falling Behind:

Replica cannot keep up with write rate
Lag grows unboundedly
May exhaust log/queue storage

Data Corruption:

Hardware failure corrupts replica data
Replication carries corruption forward
Detected by checksums, consistency checks

Schema Divergence:

Schema changes not properly coordinated
Replication fails due to schema mismatch
Common during deployments

Failure Recovery Strategies

•Catch-Up Replication: After partition heals, replay accumulated changes. Works if log/queue is preserved. May take hours for long partitions.
•Full Resync: When catch-up isn't possible, rebuild replica from scratch. Expensive but guarantees consistency. Plan for storage and bandwidth impact.
•Conflict Resolution: For multi-primary during partition, conflicts accumulate. After partition, systematically resolve using configured strategy (LWW, merge, manual).
•Point-in-Time Recovery: If corruption is detected, restore replica to known-good point-in-time. Requires backup infrastructure and understanding of corruption window.
•Quorum Repair: For quorum systems, read from all replicas and repair divergent copies. Provides eventual consistency without full resync.

Designing for Failure

Preserve Replication Logs:

Size WAL/oplog/binlog retention appropriately for expected partition duration
Alert when log size approaches retention limit
If log wraps, full resync becomes necessary

Detect Problems Early:

Monitor lag continuously
Alert on lag trend (increasing), not just threshold
Check replication health in deployment validations

Automated Recovery:

Automate catch-up after partition heals
Consider automated replica rebuilding for persistent failures
But require human approval for destructive operations

Test Failure Scenarios:

Regularly simulate network partitions
Validate recovery procedures work
Time recovery operations (know what to expect)

The Data Loss Decision

During primary failure with asynchronous replication, unreplicated data may be lost:

Option 1: Accept Loss

Promote secondary, accept that recent transactions are lost
Fast recovery (seconds to minutes)
Appropriate when RPO > 0 is acceptable

Option 2: Wait for Primary

Keep secondary as standby, wait for primary recovery
May recover all data if primary comes back
Downtime could be extended (hours to days)

Option 3: Attempt Recovery

Promote secondary while attempting primary data recovery
If primary recovers, reconcile/merge data
Most complex but preserves options

The right choice depends on your RPO requirements and the nature of the failure.

Plan for the Unplannable

Summary: The Art of Data Replication

We've deeply explored data replication strategies for geo-distributed systems. Let's consolidate the key insights:

Key Takeaways

•Asynchronous is the default for geo-replication: The latency penalty of synchronous cross-region replication is prohibitive for most applications. Accept eventual consistency where possible.
•Replication lag is a first-class concern: Monitor it, design for it, and implement patterns like read-your-writes to mitigate its user-visible effects.
•Choose the weakest sufficient consistency: Strong consistency isn't free. Apply it where required, use weaker models where acceptable.
•Topology affects everything: Primary-replica, multi-primary, chain, or mesh topologies have different characteristics for latency, bandwidth, and failure handling.
•Understand your database's mechanics: Each database implements replication differently. Configuration and behavior vary significantly.
•Plan for failure: Network partitions, lagging replicas, and data corruption will occur. Documented, tested recovery procedures are essential.

What's next:

Page Complete