Loading content...
When a user posts a photo on Instagram, the system must acknowledge that upload in milliseconds—not seconds. When Netflix records that you paused a video at the 47-minute mark, they can't afford to wait for global replication before responding. Yet this data must eventually reach replicas worldwide for availability and disaster recovery.
Asynchronous replication solves this by decoupling the write acknowledgment from replication. The primary immediately confirms the write to the client, then asynchronously transmits changes to replicas in the background. The result: minimal write latency, maximum throughput, and replicas that may temporarily lag behind.
This trade-off—accepting temporary inconsistency for performance—powers the vast majority of internet-scale systems. But it comes with profound implications for data visibility, failover procedures, and application design that every database engineer must understand.
By the end of this page, you will understand asynchronous replication mechanics, the nature and implications of replication lag, consistency models for async systems, data loss risks during failover, and strategies for building robust applications on asynchronously replicated databases.
Asynchronous replication is a replication mode where the primary node confirms transactions to clients immediately after local persistence, without waiting for any replica acknowledgment. Replication to secondary nodes happens in the background, independently of the transaction commit path.
The Asynchronous Replication Protocol:
Client initiates write: The client sends a write operation to the primary.
Primary commits locally: The primary writes to its Write-Ahead Log, flushes to disk, and immediately returns success to the client.
Background replication: A separate replication process continuously streams committed WAL records to connected replicas.
Replica acknowledgment (optional): Replicas may acknowledge receipt, but this does not affect the original transaction.
Replica application: Replicas apply received WAL records to their local data files, eventually reaching consistency with the primary.
Key Characteristics:
Asynchronous replication implements 'eventual consistency'—the guarantee that, in the absence of further updates, all replicas will eventually converge to the same state. The 'eventually' can range from milliseconds (same-datacenter) to seconds (cross-region), depending on network conditions and load.
Replication lag is the central concept of asynchronous replication. It measures how far behind a replica is compared to the primary. This lag is not merely a technical metric—it directly affects what data users see and how applications must be designed.
What Causes Replication Lag:
1. Network Round-Trip Time WAL records must traverse the network to reach replicas. Cross-region networks add 50-200ms per hop.
2. Replica Write Throughput If the replica's disk I/O is slower than the primary's, or if the replica is processing complex apply operations, it falls behind.
3. Write Bursts Spikes in write volume on the primary generate WAL faster than replicas can apply. Lag increases during the burst, then recovers.
4. Long-Running Transactions Transactions holding locks on the replica (e.g., user queries on a read replica) can block WAL application, causing lag.
5. Resource Contention CPU, memory, or I/O contention on replicas slows WAL processing.
6. Query Conflicts In PostgreSQL, queries on replicas can conflict with WAL application. The database must choose: cancel queries or delay replication.
| Cause | Typical Impact | Mitigation Strategy |
|---|---|---|
| Network latency | 50-200ms (cross-region) | Geo-distributed primary placement; dedicated replication networks |
| Replica disk I/O | Variable (seconds to minutes) | Use SSDs; match replica hardware to primary; separate WAL disk |
| Write bursts | Temporary spikes (seconds) | Rate limiting; larger replication buffer; more replica capacity |
| Long queries on replica | Duration of query | hot_standby_feedback (PostgreSQL); query timeouts |
| Replica CPU saturation | Seconds to minutes | Dedicated replica resources; avoid running analytics on replication replicas |
| Transaction log volume | Linear with write rate | Archive old WAL; configure replication slots carefully |
1234567891011121314151617181920212223
-- PostgreSQL: Monitor replication lag from primarySELECT client_addr, application_name, state, -- Byte-based lag pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes, pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes, -- Human-readable size pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_size, -- Time-based lag (most intuitive) replay_lag, write_lag, flush_lagFROM pg_stat_replication; -- MySQL: Monitor replication lag from replicaSHOW SLAVE STATUS\G-- Key field: Seconds_Behind_Master -- MongoDB: Check replica set member lagrs.printSecondaryReplicationInfo();Replication lag creates specific consistency anomalies that applications must handle. These are not bugs—they are the inherent consequences of asynchronous replication that you must design for:
Practical Example: The E-Commerce Cart Problem
Consider an e-commerce application with async replication:
This is a classic read-your-writes violation. The application's architecture created a poor user experience despite both writes succeeding.
Replication lag isn't constant. A replica that's 50ms behind right now might be 5 seconds behind during a write spike. Designing for 'typical' lag is insufficient—you must design for worst-case lag during normal operations. Some databases provide mechanisms to read with guaranteed freshness, trading latency for consistency.
| Strategy | Mechanism | Trade-off |
|---|---|---|
| Read from primary | Route writes and subsequent reads to primary | Reduced read scalability; primary overload risk |
| Sticky sessions | Route user to same replica consistently | Load balancing complexity; replica failure disruption |
| Timestamp-based | Include write timestamp; wait for replica to catch up | Added latency; application complexity |
| Optimistic UI | Immediately show expected result; correct if wrong | Complex client logic; potential confusion |
| Version vectors | Track write versions; reject stale reads | Significant implementation complexity |
The most significant risk of asynchronous replication is data loss during unplanned failover. When the primary fails, any writes that were committed locally but not yet replicated to the new primary are permanently lost.
Quantifying the Risk:
The amount of data at risk equals the replication lag at the moment of failure.
Example Calculation:
For many applications (social media, logging, analytics), losing 200 transactions during a rare failure event is acceptable. For financial applications, losing even one transaction is catastrophic.
In async replication, the client received a commit confirmation for Transaction T1. They believe the data is persisted. After failover, that data is gone. The client's assumption (' acknowledged = durable') is violated. This is not a bug—it's the fundamental trade-off of async replication. Business logic must account for this possibility.
Asynchronous replication is the default mode for most database systems due to its performance advantages. Here's how major databases implement it:
PostgreSQL Streaming Replication (Async Mode):
PostgreSQL's async replication is enabled by default when no synchronous_standby_names is configured. The primary streams WAL continuously to connected replicas.
Key Configuration (primary):
max_wal_senders: Maximum number of concurrent replication connectionswal_keep_size: Minimum WAL to retain for replicas that disconnectarchive_mode / archive_command: WAL archiving for gap recoveryKey Configuration (replica):
primary_conninfo: Connection string to primaryhot_standby: Allow read queries on replicamax_standby_streaming_delay: How long to delay WAL application for queries1234567891011121314151617181920212223
-- Primary configuration (postgresql.conf)max_wal_senders = 10wal_level = replicawal_keep_size = 1GB -- Retain WAL for slow replicas -- Ensure NO synchronous_standby_names (async mode)-- SHOW synchronous_standby_names; should return empty -- Create replication userCREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password'; -- Monitor async replicasSELECT application_name, client_addr, state, sync_state, -- 'async' for asynchronous pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS lag, replay_lag AS time_lagFROM pg_stat_replication; -- Check archive status (for disaster recovery)SELECT * FROM pg_stat_archiver;Building robust applications on asynchronously replicated databases requires specific design patterns. These patterns acknowledge replication lag and work with it rather than against it:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import psycopg2 class ConnectionRouter: """Routes queries to primary or replica based on freshness requirements.""" def __init__(self, primary_dsn, replica_dsn): self.primary = psycopg2.connect(primary_dsn) self.replica = psycopg2.connect(replica_dsn) self.last_write_lsn = None def execute_write(self, query, params=None): """Execute write on primary, track LSN for read-your-writes.""" cursor = self.primary.cursor() cursor.execute(query, params) self.primary.commit() # Get current WAL position after write cursor.execute("SELECT pg_current_wal_lsn()") self.last_write_lsn = cursor.fetchone()[0] return cursor def execute_read(self, query, params=None, require_fresh=False): """ Execute read on replica unless fresh data required. If require_fresh=True or recent write occurred, use primary. """ if require_fresh or self.last_write_lsn: # Use primary for consistency cursor = self.primary.cursor() else: # Safe to use replica cursor = self.replica.cursor() cursor.execute(query, params) return cursor.fetchall() def read_with_lsn_guarantee(self, query, params, min_lsn): """Wait for replica to reach specific LSN before reading.""" cursor = self.replica.cursor() # Wait for replica to catch up (PostgreSQL 10+) cursor.execute( "SELECT pg_last_wal_replay_lsn() >= %s", (min_lsn,) ) is_caught_up = cursor.fetchone()[0] if not is_caught_up: # Fall back to primary cursor = self.primary.cursor() cursor.execute(query, params) return cursor.fetchall()Most applications have a small set of critical flows requiring read-your-writes (account balance checks, order confirmations) and a large set of flows tolerant of lag (browsing, search results, dashboards). Identify the critical 20% and provide strong consistency for those. Let the other 80% enjoy the scalability of replicas.
Asynchronous replication's primary advantage is performance. Understanding these benefits quantitatively helps justify the consistency trade-offs:
| Metric | Synchronous | Asynchronous | Async Advantage |
|---|---|---|---|
| Write Latency (same DC) | 3-5ms | 1-2ms | 2-3x lower latency |
| Write Latency (cross-region) | 100-200ms | 1-2ms | 50-100x lower latency |
| Write Throughput | Limited by slowest replica | Limited by primary capacity | No replica bottleneck |
| Availability During Replica Failure | Degraded or blocked | Unaffected | Full write availability |
| Read Scalability | Limited by replica lag visibility | Full replica pool utilization | Maximum read scaling |
Cross-Region Use Case:
Consider an application with primary in US-East and disaster recovery replica in EU-West (100ms RTT).
With Synchronous Replication:
With Asynchronous Replication:
For most applications, the performance benefit outweighs the rare data loss risk. The key is understanding and documenting this trade-off.
Asynchronous replicas can be scaled independently to handle read traffic. Unlike synchronous replicas (which add latency per additional node), async replicas are 'free' from the write path perspective. Add as many as needed for read capacity and availability—the only cost is storage and network bandwidth.
We've explored asynchronous replication comprehensively—its mechanics, the nature of replication lag, consistency challenges, data loss risks, and application design patterns. Let's consolidate the key insights:
What's Next:
We've now covered both synchronous and asynchronous replication in depth. In the next page, we'll explore Replication Strategies—the higher-level patterns for organizing replicas including primary-replica, primary-primary (multi-master), chain replication, and consensus-based replication. We'll examine when to use each strategy and how they combine synchronous and asynchronous mechanisms.
You now understand asynchronous replication's mechanics, trade-offs, and application implications. It prioritizes performance and availability at the cost of potential data loss during failures and consistency anomalies during normal operation. For systems where milliseconds of latency matter more than absolute durability, async replication is the correct choice. Next, we'll explore replication strategies.