Loading learning content...
Imagine a financial application processes a wire transfer of $1,000,000. The transaction is written to the leader database and the client receives a success response. One second later, the leader's server experiences a total hardware failure.
Question: Is that million-dollar transfer safe?
The answer depends entirely on one configuration choice: synchronous versus asynchronous replication.
This single configuration option—when the leader waits for followers during a commit—is one of the most consequential decisions in database architecture. It directly trades off between durability guarantees and write performance.
By the end of this page, you will deeply understand synchronous replication (strong durability, higher latency), asynchronous replication (lower latency, potential data loss), semi-synchronous modes, the mathematics of the tradeoff, and how to choose the right approach for different scenarios.
To understand synchronous vs. asynchronous replication, we must first understand the timeline of a write commit. The critical question is: at what point do we tell the client their write succeeded?
1234567891011121314151617181920212223242526272829303132
TIME ─────────────────────────────────────────────────────────────────────────▶ CLIENT LEADER FOLLOWER FOLLOWER (sync) (async) │ │ │ │ │──(1) WRITE───────▶│ │ │ │ │ │ │ │ │──(2) Write to WAL──────▶│ (durable) │ │ │ │ │ │ │──────────────(3) Stream to followers─────────│ │ │ │ │ │ │ ┌─────│ │ │ │◀──(4) ACK─────────┘ │ │ │ │ │ │ │ │══════════════════════════════════════════════│ │ │ SYNCHRONOUS MODE: │ │ │ Wait for follower ACK before (5) │ │ │══════════════════════════════════════════════│ │ │ │ │◀─(5) SUCCESS─────│ │ │ │ │ │ │ ┌─────│ │ │ (async follower│ │ │ │ catches up ▼ │ │ │ eventually) │ SYNCHRONOUS: Client sees success AFTER follower acknowledgedASYNCHRONOUS: Client sees success BEFORE follower acknowledged The window between (2) WAL write and (4) follower ACK is the "replication window"If the leader fails during this window, async replication may lose dataThe Five Stages of a Write:
The key decision: Does stage 5 wait for stage 4?
In asynchronous mode, there's a window between 'leader wrote to WAL' and 'follower received the data.' If the leader fails during this window, the write is lost—even though the client was told it succeeded. This is not a theoretical concern; it happens in production.
Synchronous replication ensures that every committed transaction exists on at least two nodes (the leader and one or more followers) before the client is told the transaction succeeded.
This provides a powerful guarantee: no committed data can be lost due to a single node failure.
| Level | Follower Action Before ACK | Durability Guarantee | Latency Impact |
|---|---|---|---|
| remote_write (PostgreSQL) | Written to OS buffer | Survives follower crash (if fsync soon) | Low (~1-2ms) |
| remote_flush | fsync'd to disk | Survives follower crash immediately | Medium (~5-10ms) |
| remote_apply | Applied to data files | Immediately readable on follower | High (~10-50ms) |
| on (MySQL semi-sync) | Received into relay log | Survives follower crash after apply | Low-Medium |
Synchronous Replication Configurations:
Single Synchronous Follower: The leader designates one follower as synchronous. All commits wait for this one follower. If the synchronous follower fails, the leader either waits forever (availability loss) or falls back to async (durability risk).
Multiple Synchronous Followers:
The leader can wait for N out of M followers. PostgreSQL's synchronous_standby_names supports FIRST N (wait for first N to respond) or ANY N (require N responses from the list). This provides fault tolerance: one sync follower can fail without blocking commits.
Quorum-Based Sync: Advanced systems use quorum writes: commit when a majority (e.g., 2 of 3) acknowledge. This balances durability with availability—no single follower is critical.
Every synchronous commit adds at least one round-trip time to a follower. For same-datacenter replication, this might be 1-5ms. For cross-region replication, it could be 50-200ms. This latency adds to every single write transaction.
Asynchronous replication decouples the client response from follower replication. The leader commits as soon as its own WAL write completes, without waiting for followers.
This provides the lowest possible write latency—but with a trade-off in durability guarantees.
12345678910111213141516
TIME ─────────────────────────────────────────────────────────────────────────▶ t=0ms t=1ms t=2ms t=3ms t=4ms t=5ms │ │ │ │ │ │ │ │ │ │ │ │ LEADER LEADER CLIENT LEADER ████████ FOLLOWER receives writes sees streams ████████ has NOT write to WAL SUCCESS to CRASHES received follower ████████ the write (in progress) RESULT: Client thinks transaction succeeded, but it's LOST The follower will be promoted to leader, missing this transaction. Any external side effects (emails sent, payments initiated) cannot be undone.The Risk is Real:
Asynchronous data loss isn't just theoretical. Consider these real-world scenarios:
Quantifying the Risk:
The data loss window equals the replication lag. With typical async setups:
For a system doing 1000 writes/second, 100ms of lag means up to 100 potentially lost transactions per failure.
Most database installations default to asynchronous replication because it's simpler and faster. This is often fine for development or non-critical data, but production systems with durability requirements should explicitly configure synchronous replication.
Pure synchronous and pure asynchronous represent two extremes. Most production systems operate somewhere in between, using semi-synchronous or hybrid modes that balance durability against availability and performance.
on (wait for local fsync), remote_write, remote_apply, local, off. Different transactions can have different guarantees.| Mode | Durability | Latency | Availability | Complexity |
|---|---|---|---|---|
| Fully Async | Lost data on leader failure | Lowest (leader-only) | Highest (no dependencies) | Simple |
| Semi-Sync (1 ACK) | Safe with 1 follower | Medium (+network RTT) | Blocked if sync follower down | Medium |
| Fully Sync (N ACKs) | Safe with N followers | Highest (+slowest follower) | Blocked if any required follower down | Complex |
| Quorum (Majority) | Safe with majority failure | Medium (+majority RTT) | Tolerates minority failures | Complex |
Per-Transaction Control:
Advanced databases allow durability level to be set per-transaction, not just globally. This enables intelligent trade-offs:
-- Critical financial transaction: wait for two replicas
SET synchronous_commit = 'remote_apply';
BEGIN;
INSERT INTO transfers (amount, from_account, to_account) VALUES (1000000, 'A', 'B');
COMMIT;
-- Low-priority logging: async is fine
SET synchronous_commit = 'off';
BEGIN;
INSERT INTO audit_log (event, timestamp) VALUES ('user_login', NOW());
COMMIT;
This approach provides the best of both worlds: critical data gets strong guarantees, bulk/logging operations remain fast.
When a synchronous follower times out, the system must choose: wait forever (availability loss) or fall back to async (potential durability loss). Most production systems choose fallback with alerting. Understand your system's behavior before it matters.
Choosing between sync and async isn't just intuition—we can quantify the trade-offs mathematically. Understanding the numbers helps make informed decisions.
Latency Impact:
Synchronous replication adds at least one round-trip time (RTT) to each transaction. Let's calculate the impact:
| Scenario | Network RTT | Baseline Commit | Sync Commit | Overhead |
|---|---|---|---|---|
| Same rack | 0.1ms | 1ms | 1.1ms | +10% |
| Same datacenter | 1ms | 1ms | 2ms | +100% |
| Cross-region (US) | 40ms | 1ms | 41ms | +4000% |
| Cross-continent | 150ms | 1ms | 151ms | +15000% |
For write-heavy workloads, synchronous cross-region replication can reduce throughput by 10-100x.
Data Loss Quantification:
Asynchronous replication risks losing data in the 'replication window.' We can estimate expected data loss:
Expected Data Loss Per Failure = Replication Lag × Write Rate
| Replication Lag | Write Rate | Transactions at Risk |
|---|---|---|
| 10ms | 100 writes/sec | ~1 transaction |
| 100ms | 100 writes/sec | ~10 transactions |
| 1 second | 1000 writes/sec | ~1000 transactions |
| 10 seconds | 1000 writes/sec | ~10000 transactions |
Annualized Risk:
If you have one leader failure per year with 100ms lag and 1000 writes/sec:
For financial systems, losing 100 transactions might mean regulatory violations. For analytics, it might be irrelevant.
In practice, most systems fall into clear categories. Financial/payment systems: synchronous is non-negotiable. Analytics/logging: async is fine. The edge cases where math matters are rare, but understanding the framework helps justify decisions.
With a deep understanding of sync and async trade-offs, let's provide concrete guidance for different scenarios.
| Use Case | Recommended Mode | Rationale |
|---|---|---|
| Banking/Payments | Synchronous (quorum) | Zero data loss tolerance; regulatory requirements |
| E-commerce Orders | Semi-sync (at least 1) | Orders are valuable; some latency acceptable |
| Social Media Posts | Async with monitoring | High volume; eventual consistency acceptable |
| Analytics/Logging | Fully Async | Reconstructable; volume too high for sync overhead |
| Cross-Region DR | Async (different from HA) | Latency prohibitive; DR is last resort anyway |
| Same-DC HA Replica | Synchronous or Semi-sync | Low latency; HA requires durability |
Hybrid Strategies:
Same-DC Sync + Cross-Region Async: Maintain a synchronous replica in the same data center for HA and a second replica asynchronously in a remote region for DR. This provides fast failover with strong durability locally, plus protection against datacenter-level disasters.
Per-Table Configuration: Some databases allow replication configuration per table. Critical tables (accounts, orders) replicate synchronously; high-volume tables (events, logs) replicate asynchronously.
Application-Level Routing: The application knows which writes are critical. It can specify durability requirements per transaction, routing critical writes through a synchronous path.
It's easier to relax durability guarantees (switch from sync to async) than to explain lost data to stakeholders. Start with synchronous for critical data paths; optimize to async only when you've proven the need and accepted the risk.
Different databases implement synchronous/asynchronous replication with varying configurations and semantics. Let's examine the major implementations.
off, local, remote_write, remote_apply, onFIRST 1 (standby1, standby2), ANY 2 (standby1, standby2, standby3)SET LOCAL synchronous_commit = 'remote_apply'; within a transactionANY N syntax provides quorum-based acknowledgment12345678910111213141516
-- In postgresql.conf (leader)synchronous_commit = on -- require sync for commitssynchronous_standby_names = 'FIRST 1 (replica1, replica2)' -- first responding wins -- Or for quorum mode (2 of 3 must acknowledge)synchronous_standby_names = 'ANY 2 (replica1, replica2, replica3)' -- Check synchronous statusSELECT application_name, sync_state, sync_priorityFROM pg_stat_replication; -- Per-transaction override (for less critical writes)BEGIN;SET LOCAL synchronous_commit = 'local'; -- Don't wait for followersINSERT INTO logs (event) VALUES ('user_clicked');COMMIT;{w: 1}, {w: 'majority'}, {w: 3}, {w: 'tag'}PostgreSQL calls it 'synchronous_commit,' MySQL calls it 'semi-sync,' MongoDB uses 'write concern.' The concepts are the same: how many replicas must acknowledge before the write is considered durable. Learn the terminology for your specific database.
We've explored one of the most fundamental trade-offs in database replication: when to wait for followers during commit. Let's consolidate the essential insights:
What's Next:
With writes flowing through the leader and replicating (synchronously or asynchronously) to followers, we must address a critical scenario: what happens when the leader fails? The next page explores failover handling—how systems detect leader failure, elect a new leader, and transition without data loss or prolonged downtime.
You now understand the fundamental durability-latency trade-off in database replication. You can analyze when synchronous replication is worth the latency cost and when asynchronous replication's risks are acceptable. Next, we'll explore how systems handle the inevitable: leader failure and failover.