Replication - Learning Module

Loading content...

0/252

Asynchronous Replication

Performance Without Waiting

When a user posts a photo on Instagram, the system must acknowledge that upload in milliseconds—not seconds. When Netflix records that you paused a video at the 47-minute mark, they can't afford to wait for global replication before responding. Yet this data must eventually reach replicas worldwide for availability and disaster recovery.

Asynchronous replication solves this by decoupling the write acknowledgment from replication. The primary immediately confirms the write to the client, then asynchronously transmits changes to replicas in the background. The result: minimal write latency, maximum throughput, and replicas that may temporarily lag behind.

This trade-off—accepting temporary inconsistency for performance—powers the vast majority of internet-scale systems. But it comes with profound implications for data visibility, failover procedures, and application design that every database engineer must understand.

What You Will Learn

By the end of this page, you will understand asynchronous replication mechanics, the nature and implications of replication lag, consistency models for async systems, data loss risks during failover, and strategies for building robust applications on asynchronously replicated databases.

Definition and Mechanics

Asynchronous replication is a replication mode where the primary node confirms transactions to clients immediately after local persistence, without waiting for any replica acknowledgment. Replication to secondary nodes happens in the background, independently of the transaction commit path.

Converting Mermaid diagram...

The Asynchronous Replication Protocol:

Client initiates write: The client sends a write operation to the primary.
Primary commits locally: The primary writes to its Write-Ahead Log, flushes to disk, and immediately returns success to the client.
Background replication: A separate replication process continuously streams committed WAL records to connected replicas.
Replica acknowledgment (optional): Replicas may acknowledge receipt, but this does not affect the original transaction.
Replica application: Replicas apply received WAL records to their local data files, eventually reaching consistency with the primary.

Key Characteristics:

Write latency = local disk flush only (~1-3ms)
No network round-trip in commit path
Replicas lag behind by some variable amount (replication lag)
Primary can continue operating even if all replicas fail

Eventual Consistency

Asynchronous replication implements 'eventual consistency'—the guarantee that, in the absence of further updates, all replicas will eventually converge to the same state. The 'eventually' can range from milliseconds (same-datacenter) to seconds (cross-region), depending on network conditions and load.

Understanding Replication Lag

Replication lag is the central concept of asynchronous replication. It measures how far behind a replica is compared to the primary. This lag is not merely a technical metric—it directly affects what data users see and how applications must be designed.

What Causes Replication Lag:

1. Network Round-Trip Time WAL records must traverse the network to reach replicas. Cross-region networks add 50-200ms per hop.

2. Replica Write Throughput If the replica's disk I/O is slower than the primary's, or if the replica is processing complex apply operations, it falls behind.

3. Write Bursts Spikes in write volume on the primary generate WAL faster than replicas can apply. Lag increases during the burst, then recovers.

4. Long-Running Transactions Transactions holding locks on the replica (e.g., user queries on a read replica) can block WAL application, causing lag.

5. Resource Contention CPU, memory, or I/O contention on replicas slows WAL processing.

6. Query Conflicts In PostgreSQL, queries on replicas can conflict with WAL application. The database must choose: cancel queries or delay replication.

Replication Lag: Causes and Mitigations
Cause	Typical Impact	Mitigation Strategy
Network latency	50-200ms (cross-region)	Geo-distributed primary placement; dedicated replication networks
Replica disk I/O	Variable (seconds to minutes)	Use SSDs; match replica hardware to primary; separate WAL disk
Write bursts	Temporary spikes (seconds)	Rate limiting; larger replication buffer; more replica capacity
Long queries on replica	Duration of query	hot_standby_feedback (PostgreSQL); query timeouts
Replica CPU saturation	Seconds to minutes	Dedicated replica resources; avoid running analytics on replication replicas
Transaction log volume	Linear with write rate	Archive old WAL; configure replication slots carefully

Monitoring Replication Lag
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- PostgreSQL: Monitor replication lag from primary
SELECT 
    client_addr,
    application_name,
    state,
    -- Byte-based lag
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
    -- Human-readable size
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_size,
    -- Time-based lag (most intuitive)
    replay_lag,
    write_lag,
    flush_lag
FROM pg_stat_replication;
 
-- MySQL: Monitor replication lag from replica
SHOW SLAVE STATUS\G
-- Key field: Seconds_Behind_Master
 
-- MongoDB: Check replica set member lag
rs.printSecondaryReplicationInfo();

Read Consistency Challenges

Replication lag creates specific consistency anomalies that applications must handle. These are not bugs—they are the inherent consequences of asynchronous replication that you must design for:

Consistency Anomalies in Async Replication

•Read-Your-Writes Violation — A user writes data, then immediately reads from a replica that hasn't received the write yet. The user sees stale data, believing their write failed.
•Monotonic Read Violation — A user reads newer data from Replica A, then their next request goes to Replica B (which is further behind). They see data appear to 'go backward in time.'
•Causal Consistency Violation — User A writes a comment; User B reads it and writes a reply. User C reads the reply but not the original comment (reading from a replica that received the second write but not the first).
•Phantom Disappearance — A user creates an item, sees it in a list (from primary or fast replica), then refreshes and the item disappears (now reading from lagging replica).

Practical Example: The E-Commerce Cart Problem

Consider an e-commerce application with async replication:

User adds item to cart → Write goes to primary
Page redirects to cart view → Read goes to replica
Replica is 500ms behind → Cart appears empty
User panics, adds item again → Duplicate item in cart
Eventually replica catches up → User sees two items

This is a classic read-your-writes violation. The application's architecture created a poor user experience despite both writes succeeding.

Lag is Variable and Unpredictable

Replication lag isn't constant. A replica that's 50ms behind right now might be 5 seconds behind during a write spike. Designing for 'typical' lag is insufficient—you must design for worst-case lag during normal operations. Some databases provide mechanisms to read with guaranteed freshness, trading latency for consistency.

Strategies for Handling Read-Your-Writes
Strategy	Mechanism	Trade-off
Read from primary	Route writes and subsequent reads to primary	Reduced read scalability; primary overload risk
Sticky sessions	Route user to same replica consistently	Load balancing complexity; replica failure disruption
Timestamp-based	Include write timestamp; wait for replica to catch up	Added latency; application complexity
Optimistic UI	Immediately show expected result; correct if wrong	Complex client logic; potential confusion
Version vectors	Track write versions; reject stale reads	Significant implementation complexity

Data Loss During Failover

The most significant risk of asynchronous replication is data loss during unplanned failover. When the primary fails, any writes that were committed locally but not yet replicated to the new primary are permanently lost.

Converting Mermaid diagram...

Quantifying the Risk:

The amount of data at risk equals the replication lag at the moment of failure.

If lag = 100ms, you lose up to 100ms of committed transactions
If lag = 5 seconds, you lose up to 5 seconds of transactions
During a write spike, lag can grow to seconds or minutes

Example Calculation:

Average write throughput: 1,000 transactions/second
Typical replication lag: 200ms
Transactions at risk: 1,000 × 0.2 = 200 transactions

For many applications (social media, logging, analytics), losing 200 transactions during a rare failure event is acceptable. For financial applications, losing even one transaction is catastrophic.

The Acknowledged-But-Lost Problem

In async replication, the client received a commit confirmation for Transaction T1. They believe the data is persisted. After failover, that data is gone. The client's assumption (' acknowledged = durable') is violated. This is not a bug—it's the fundamental trade-off of async replication. Business logic must account for this possibility.

Minimizing Data Loss in Async Replication

•Prefer synchronous for critical data — Use sync replication for transactions where loss is unacceptable. Use async for loss-tolerant data.
•Wait for replication during planned failover — For maintenance failovers, wait for replicas to catch up before demoting the primary.
•Monitor lag aggressively — Alert when lag exceeds thresholds. High lag = high data loss risk during failure.
•Consider semi-synchronous — MySQL's semi-sync mode provides a middle ground: wait for one replica, then proceed.
•Implement external validation — For critical operations, have clients re-check state after write. Idempotent operations are reclaimable.
•Accept the trade-off explicitly — Document the RPO (Recovery Point Objective). If you promise 0 data loss, you need synchronous replication.

Implementation Across Databases

Asynchronous replication is the default mode for most database systems due to its performance advantages. Here's how major databases implement it:

PostgreSQL Streaming Replication (Async Mode):

PostgreSQL's async replication is enabled by default when no synchronous_standby_names is configured. The primary streams WAL continuously to connected replicas.

Key Configuration (primary):

max_wal_senders: Maximum number of concurrent replication connections
wal_keep_size: Minimum WAL to retain for replicas that disconnect
archive_mode / archive_command: WAL archiving for gap recovery

Key Configuration (replica):

primary_conninfo: Connection string to primary
hot_standby: Allow read queries on replica
max_standby_streaming_delay: How long to delay WAL application for queries

PostgreSQL Async Replication Setup
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Primary configuration (postgresql.conf)
max_wal_senders = 10
wal_level = replica
wal_keep_size = 1GB  -- Retain WAL for slow replicas
 
-- Ensure NO synchronous_standby_names (async mode)
-- SHOW synchronous_standby_names; should return empty
 
-- Create replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
 
-- Monitor async replicas
SELECT 
    application_name,
    client_addr,
    state,
    sync_state,  -- 'async' for asynchronous
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS lag,
    replay_lag AS time_lag
FROM pg_stat_replication;
 
-- Check archive status (for disaster recovery)
SELECT * FROM pg_stat_archiver;

Application Design Patterns for Async Replication

Building robust applications on asynchronously replicated databases requires specific design patterns. These patterns acknowledge replication lag and work with it rather than against it:

Robust Application Patterns

•Route Writes and Immediate Reads to Primary — For user-facing flows where read-your-writes matters, send the write and any subsequent reads within that session to the primary. Route read-only operations to replicas.
•Optimistic UI Updates — After a write, immediately update the client UI with the expected result without waiting for server confirmation. This masks latency entirely. Correct the UI if the eventual read differs.
•Include Write Tokens in Responses — Return a token (e.g., LSN, timestamp, version) with write acknowledgments. Clients include this token in subsequent reads. Replicas can wait until they've reached that position before responding.
•Idempotent Operations — Design all writes to be safely re-executable. If data is lost during failover and the client retries, the retry produces the same result without duplication.
•Background Reconciliation — For critical workflows, run background jobs that verify and reconcile data across replicas. Catch discrepancies before users notice.

Read-Your-Writes Pattern (Example)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import psycopg2
 
class ConnectionRouter:
    """Routes queries to primary or replica based on freshness requirements."""
    
    def __init__(self, primary_dsn, replica_dsn):
        self.primary = psycopg2.connect(primary_dsn)
        self.replica = psycopg2.connect(replica_dsn)
        self.last_write_lsn = None
    
    def execute_write(self, query, params=None):
        """Execute write on primary, track LSN for read-your-writes."""
        cursor = self.primary.cursor()
        cursor.execute(query, params)
        self.primary.commit()
        
        # Get current WAL position after write
        cursor.execute("SELECT pg_current_wal_lsn()")
        self.last_write_lsn = cursor.fetchone()[0]
        
        return cursor
    
    def execute_read(self, query, params=None, require_fresh=False):
        """
        Execute read on replica unless fresh data required.
        If require_fresh=True or recent write occurred, use primary.
        """
        if require_fresh or self.last_write_lsn:
            # Use primary for consistency
            cursor = self.primary.cursor()
        else:
            # Safe to use replica
            cursor = self.replica.cursor()
        
        cursor.execute(query, params)
        return cursor.fetchall()
    
    def read_with_lsn_guarantee(self, query, params, min_lsn):
        """Wait for replica to reach specific LSN before reading."""
        cursor = self.replica.cursor()
        
        # Wait for replica to catch up (PostgreSQL 10+)
        cursor.execute(
            "SELECT pg_last_wal_replay_lsn() >= %s", 
            (min_lsn,)
        )
        is_caught_up = cursor.fetchone()[0]
        
        if not is_caught_up:
            # Fall back to primary
            cursor = self.primary.cursor()
        
        cursor.execute(query, params)
        return cursor.fetchall()

The 80/20 Rule of Consistency

Most applications have a small set of critical flows requiring read-your-writes (account balance checks, order confirmations) and a large set of flows tolerant of lag (browsing, search results, dashboards). Identify the critical 20% and provide strong consistency for those. Let the other 80% enjoy the scalability of replicas.

Performance Benefits

Asynchronous replication's primary advantage is performance. Understanding these benefits quantitatively helps justify the consistency trade-offs:

Async vs. Sync Replication Performance Comparison
Metric	Synchronous	Asynchronous	Async Advantage
Write Latency (same DC)	3-5ms	1-2ms	2-3x lower latency
Write Latency (cross-region)	100-200ms	1-2ms	50-100x lower latency
Write Throughput	Limited by slowest replica	Limited by primary capacity	No replica bottleneck
Availability During Replica Failure	Degraded or blocked	Unaffected	Full write availability
Read Scalability	Limited by replica lag visibility	Full replica pool utilization	Maximum read scaling

Cross-Region Use Case:

Consider an application with primary in US-East and disaster recovery replica in EU-West (100ms RTT).

With Synchronous Replication:

Every write adds 100ms minimum latency
Write throughput limited to ~10 ops/sec per connection
EU replica failure blocks all writes

With Asynchronous Replication:

Write latency remains ~2ms (local)
Write throughput limited only by primary capacity
EU replica failure has zero impact on writes
Risk: ~100ms worth of transactions lost if primary fails before replication

For most applications, the performance benefit outweighs the rare data loss risk. The key is understanding and documenting this trade-off.

Scaling Reads with Async Replicas

Asynchronous replicas can be scaled independently to handle read traffic. Unlike synchronous replicas (which add latency per additional node), async replicas are 'free' from the write path perspective. Add as many as needed for read capacity and availability—the only cost is storage and network bandwidth.

Summary: Asynchronous Replication

We've explored asynchronous replication comprehensively—its mechanics, the nature of replication lag, consistency challenges, data loss risks, and application design patterns. Let's consolidate the key insights:

Key Takeaways

•Async replication decouples commit from replication — Writes are acknowledged immediately after local persistence. Replicas receive changes in the background.
•Replication lag is inherent and variable — Replicas always lag behind the primary. Lag increases during write bursts, network issues, or replica resource constraints.
•Consistency anomalies require application design — Read-your-writes, monotonic reads, and causal consistency violations must be handled through routing, tokens, or optimistic UI.
•Data loss during failover is possible — Committed but unreplicated transactions are lost when the primary fails. Risk = replication lag × write rate.
•Performance is the primary benefit — No network round-trip in commit path enables low latency and high throughput, especially for cross-region deployments.
•Design for eventual consistency — Build idempotent operations, implement reconciliation, and document RPO explicitly.

What's Next:

We've now covered both synchronous and asynchronous replication in depth. In the next page, we'll explore Replication Strategies—the higher-level patterns for organizing replicas including primary-replica, primary-primary (multi-master), chain replication, and consensus-based replication. We'll examine when to use each strategy and how they combine synchronous and asynchronous mechanisms.

Page Complete

You now understand asynchronous replication's mechanics, trade-offs, and application implications. It prioritizes performance and availability at the cost of potential data loss during failures and consistency anomalies during normal operation. For systems where milliseconds of latency matter more than absolute durability, async replication is the correct choice. Next, we'll explore replication strategies.

Asynchronous Replication

Performance Without Waiting

What You Will Learn

Definition and Mechanics

Converting Mermaid diagram...

The Asynchronous Replication Protocol:

Client initiates write: The client sends a write operation to the primary.
Primary commits locally: The primary writes to its Write-Ahead Log, flushes to disk, and immediately returns success to the client.
Background replication: A separate replication process continuously streams committed WAL records to connected replicas.
Replica acknowledgment (optional): Replicas may acknowledge receipt, but this does not affect the original transaction.
Replica application: Replicas apply received WAL records to their local data files, eventually reaching consistency with the primary.

Key Characteristics:

Write latency = local disk flush only (~1-3ms)
No network round-trip in commit path
Replicas lag behind by some variable amount (replication lag)
Primary can continue operating even if all replicas fail

Eventual Consistency

Understanding Replication Lag

What Causes Replication Lag:

1. Network Round-Trip Time WAL records must traverse the network to reach replicas. Cross-region networks add 50-200ms per hop.

2. Replica Write Throughput If the replica's disk I/O is slower than the primary's, or if the replica is processing complex apply operations, it falls behind.

3. Write Bursts Spikes in write volume on the primary generate WAL faster than replicas can apply. Lag increases during the burst, then recovers.

4. Long-Running Transactions Transactions holding locks on the replica (e.g., user queries on a read replica) can block WAL application, causing lag.

5. Resource Contention CPU, memory, or I/O contention on replicas slows WAL processing.

6. Query Conflicts In PostgreSQL, queries on replicas can conflict with WAL application. The database must choose: cancel queries or delay replication.

Replication Lag: Causes and Mitigations
Cause	Typical Impact	Mitigation Strategy
Network latency	50-200ms (cross-region)	Geo-distributed primary placement; dedicated replication networks
Replica disk I/O	Variable (seconds to minutes)	Use SSDs; match replica hardware to primary; separate WAL disk
Write bursts	Temporary spikes (seconds)	Rate limiting; larger replication buffer; more replica capacity
Long queries on replica	Duration of query	hot_standby_feedback (PostgreSQL); query timeouts
Replica CPU saturation	Seconds to minutes	Dedicated replica resources; avoid running analytics on replication replicas
Transaction log volume	Linear with write rate	Archive old WAL; configure replication slots carefully

Monitoring Replication Lag
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- PostgreSQL: Monitor replication lag from primary
SELECT 
    client_addr,
    application_name,
    state,
    -- Byte-based lag
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
    -- Human-readable size
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_size,
    -- Time-based lag (most intuitive)
    replay_lag,
    write_lag,
    flush_lag
FROM pg_stat_replication;
 
-- MySQL: Monitor replication lag from replica
SHOW SLAVE STATUS\G
-- Key field: Seconds_Behind_Master
 
-- MongoDB: Check replica set member lag
rs.printSecondaryReplicationInfo();

Read Consistency Challenges

Replication lag creates specific consistency anomalies that applications must handle. These are not bugs—they are the inherent consequences of asynchronous replication that you must design for:

Consistency Anomalies in Async Replication

•Read-Your-Writes Violation — A user writes data, then immediately reads from a replica that hasn't received the write yet. The user sees stale data, believing their write failed.
•Monotonic Read Violation — A user reads newer data from Replica A, then their next request goes to Replica B (which is further behind). They see data appear to 'go backward in time.'
•Causal Consistency Violation — User A writes a comment; User B reads it and writes a reply. User C reads the reply but not the original comment (reading from a replica that received the second write but not the first).
•Phantom Disappearance — A user creates an item, sees it in a list (from primary or fast replica), then refreshes and the item disappears (now reading from lagging replica).

Practical Example: The E-Commerce Cart Problem

Consider an e-commerce application with async replication:

User adds item to cart → Write goes to primary
Page redirects to cart view → Read goes to replica
Replica is 500ms behind → Cart appears empty
User panics, adds item again → Duplicate item in cart
Eventually replica catches up → User sees two items

This is a classic read-your-writes violation. The application's architecture created a poor user experience despite both writes succeeding.

Lag is Variable and Unpredictable

Strategies for Handling Read-Your-Writes
Strategy	Mechanism	Trade-off
Read from primary	Route writes and subsequent reads to primary	Reduced read scalability; primary overload risk
Sticky sessions	Route user to same replica consistently	Load balancing complexity; replica failure disruption
Timestamp-based	Include write timestamp; wait for replica to catch up	Added latency; application complexity
Optimistic UI	Immediately show expected result; correct if wrong	Complex client logic; potential confusion
Version vectors	Track write versions; reject stale reads	Significant implementation complexity

Data Loss During Failover

Converting Mermaid diagram...

Quantifying the Risk:

The amount of data at risk equals the replication lag at the moment of failure.

If lag = 100ms, you lose up to 100ms of committed transactions
If lag = 5 seconds, you lose up to 5 seconds of transactions
During a write spike, lag can grow to seconds or minutes

Example Calculation:

Average write throughput: 1,000 transactions/second
Typical replication lag: 200ms
Transactions at risk: 1,000 × 0.2 = 200 transactions

For many applications (social media, logging, analytics), losing 200 transactions during a rare failure event is acceptable. For financial applications, losing even one transaction is catastrophic.

The Acknowledged-But-Lost Problem

Minimizing Data Loss in Async Replication

•Prefer synchronous for critical data — Use sync replication for transactions where loss is unacceptable. Use async for loss-tolerant data.
•Wait for replication during planned failover — For maintenance failovers, wait for replicas to catch up before demoting the primary.
•Monitor lag aggressively — Alert when lag exceeds thresholds. High lag = high data loss risk during failure.
•Consider semi-synchronous — MySQL's semi-sync mode provides a middle ground: wait for one replica, then proceed.
•Implement external validation — For critical operations, have clients re-check state after write. Idempotent operations are reclaimable.
•Accept the trade-off explicitly — Document the RPO (Recovery Point Objective). If you promise 0 data loss, you need synchronous replication.

Implementation Across Databases

Asynchronous replication is the default mode for most database systems due to its performance advantages. Here's how major databases implement it:

PostgreSQL Streaming Replication (Async Mode):

PostgreSQL's async replication is enabled by default when no synchronous_standby_names is configured. The primary streams WAL continuously to connected replicas.

Key Configuration (primary):

max_wal_senders: Maximum number of concurrent replication connections
wal_keep_size: Minimum WAL to retain for replicas that disconnect
archive_mode / archive_command: WAL archiving for gap recovery

Key Configuration (replica):

primary_conninfo: Connection string to primary
hot_standby: Allow read queries on replica
max_standby_streaming_delay: How long to delay WAL application for queries

PostgreSQL Async Replication Setup
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Primary configuration (postgresql.conf)
max_wal_senders = 10
wal_level = replica
wal_keep_size = 1GB  -- Retain WAL for slow replicas
 
-- Ensure NO synchronous_standby_names (async mode)
-- SHOW synchronous_standby_names; should return empty
 
-- Create replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
 
-- Monitor async replicas
SELECT 
    application_name,
    client_addr,
    state,
    sync_state,  -- 'async' for asynchronous
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS lag,
    replay_lag AS time_lag
FROM pg_stat_replication;
 
-- Check archive status (for disaster recovery)
SELECT * FROM pg_stat_archiver;

Application Design Patterns for Async Replication

Building robust applications on asynchronously replicated databases requires specific design patterns. These patterns acknowledge replication lag and work with it rather than against it:

Robust Application Patterns

•Route Writes and Immediate Reads to Primary — For user-facing flows where read-your-writes matters, send the write and any subsequent reads within that session to the primary. Route read-only operations to replicas.
•Optimistic UI Updates — After a write, immediately update the client UI with the expected result without waiting for server confirmation. This masks latency entirely. Correct the UI if the eventual read differs.
•Include Write Tokens in Responses — Return a token (e.g., LSN, timestamp, version) with write acknowledgments. Clients include this token in subsequent reads. Replicas can wait until they've reached that position before responding.
•Idempotent Operations — Design all writes to be safely re-executable. If data is lost during failover and the client retries, the retry produces the same result without duplication.
•Background Reconciliation — For critical workflows, run background jobs that verify and reconcile data across replicas. Catch discrepancies before users notice.

Read-Your-Writes Pattern (Example)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import psycopg2
 
class ConnectionRouter:
    """Routes queries to primary or replica based on freshness requirements."""
    
    def __init__(self, primary_dsn, replica_dsn):
        self.primary = psycopg2.connect(primary_dsn)
        self.replica = psycopg2.connect(replica_dsn)
        self.last_write_lsn = None
    
    def execute_write(self, query, params=None):
        """Execute write on primary, track LSN for read-your-writes."""
        cursor = self.primary.cursor()
        cursor.execute(query, params)
        self.primary.commit()
        
        # Get current WAL position after write
        cursor.execute("SELECT pg_current_wal_lsn()")
        self.last_write_lsn = cursor.fetchone()[0]
        
        return cursor
    
    def execute_read(self, query, params=None, require_fresh=False):
        """
        Execute read on replica unless fresh data required.
        If require_fresh=True or recent write occurred, use primary.
        """
        if require_fresh or self.last_write_lsn:
            # Use primary for consistency
            cursor = self.primary.cursor()
        else:
            # Safe to use replica
            cursor = self.replica.cursor()
        
        cursor.execute(query, params)
        return cursor.fetchall()
    
    def read_with_lsn_guarantee(self, query, params, min_lsn):
        """Wait for replica to reach specific LSN before reading."""
        cursor = self.replica.cursor()
        
        # Wait for replica to catch up (PostgreSQL 10+)
        cursor.execute(
            "SELECT pg_last_wal_replay_lsn() >= %s", 
            (min_lsn,)
        )
        is_caught_up = cursor.fetchone()[0]
        
        if not is_caught_up:
            # Fall back to primary
            cursor = self.primary.cursor()
        
        cursor.execute(query, params)
        return cursor.fetchall()

The 80/20 Rule of Consistency

Performance Benefits

Asynchronous replication's primary advantage is performance. Understanding these benefits quantitatively helps justify the consistency trade-offs:

Async vs. Sync Replication Performance Comparison
Metric	Synchronous	Asynchronous	Async Advantage
Write Latency (same DC)	3-5ms	1-2ms	2-3x lower latency
Write Latency (cross-region)	100-200ms	1-2ms	50-100x lower latency
Write Throughput	Limited by slowest replica	Limited by primary capacity	No replica bottleneck
Availability During Replica Failure	Degraded or blocked	Unaffected	Full write availability
Read Scalability	Limited by replica lag visibility	Full replica pool utilization	Maximum read scaling

Cross-Region Use Case:

Consider an application with primary in US-East and disaster recovery replica in EU-West (100ms RTT).

With Synchronous Replication:

Every write adds 100ms minimum latency
Write throughput limited to ~10 ops/sec per connection
EU replica failure blocks all writes

With Asynchronous Replication:

Write latency remains ~2ms (local)
Write throughput limited only by primary capacity
EU replica failure has zero impact on writes
Risk: ~100ms worth of transactions lost if primary fails before replication

For most applications, the performance benefit outweighs the rare data loss risk. The key is understanding and documenting this trade-off.

Scaling Reads with Async Replicas

Summary: Asynchronous Replication

Key Takeaways

•Async replication decouples commit from replication — Writes are acknowledged immediately after local persistence. Replicas receive changes in the background.
•Replication lag is inherent and variable — Replicas always lag behind the primary. Lag increases during write bursts, network issues, or replica resource constraints.
•Consistency anomalies require application design — Read-your-writes, monotonic reads, and causal consistency violations must be handled through routing, tokens, or optimistic UI.
•Data loss during failover is possible — Committed but unreplicated transactions are lost when the primary fails. Risk = replication lag × write rate.
•Performance is the primary benefit — No network round-trip in commit path enables low latency and high throughput, especially for cross-region deployments.
•Design for eventual consistency — Build idempotent operations, implement reconciliation, and document RPO explicitly.

What's Next:

Page Complete