System Design (HLD)Leader-Follower Replication

Leader-Follower Replication

LevelIntermediate

Duration75 mins

TopicLeader-Follower Replication

5 / 5

Replication Lag

The Gap Between Write and Read

A user updates their profile picture. They click 'Save,' see a success message, and immediately navigate to their profile to admire their new photo. But they see the old picture.

Confused, they refresh. Still the old picture. They refresh again—finally, the new picture appears.

What happened? The user wrote to the leader, but their subsequent read hit a follower that hadn't yet received the update. This is replication lag—the time delay between when a change is committed on the leader and when it's visible on followers.

Replication lag is an inherent property of leader-follower systems. It cannot be eliminated (without synchronous replication's latency costs), only managed. Understanding its causes, measuring it accurately, and designing applications to handle it gracefully is essential for building reliable distributed systems.

What You Will Learn

By the end of this page, you will understand why replication lag exists, how to measure it accurately, the various causes and how to address them, the impact on application behavior, read-your-writes consistency patterns, and strategies for minimizing lag's impact on user experience.

Understanding Replication Lag

Replication lag is the delay between a write being committed on the leader and that same write being applied and visible on a follower. It's an inherent consequence of asynchronous replication—by definition, if we don't wait for followers before acknowledging a write, followers will be behind.

Lag Can Be Measured Two Ways:

Time-Based Lag — How many seconds behind is the follower? (e.g., '3 seconds of lag')
Position-Based Lag — How many bytes/transactions behind is the follower? (e.g., '1GB of WAL behind')

Both perspectives are useful. Time-based lag tells you how stale reads from the follower are. Position-based lag tells you how much data needs to transfer.

Replication Lag Visualization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
LEADER TIMELINE:
═══════════════════════════════════════════════════════════════════════════════
 
  Write Events:   W1    W2    W3    W4    W5    W6    W7    W8    W9    W10
                  │     │     │     │     │     │     │     │     │     │
  Leader Time:   0ms  100ms 200ms 300ms 400ms 500ms 600ms 700ms 800ms 900ms
                                                                        ▲
                                                                  CURRENT TIME
 
FOLLOWER TIMELINE (with ~300ms lag):
═══════════════════════════════════════════════════════════════════════════════
 
  Apply Events:   W1    W2    W3    W4    W5    W6    W7
                  │     │     │     │     │     │     │
  Follower Time: 100ms 200ms 300ms 400ms 500ms 600ms 700ms───────────────now
                                                          ▲
                                                    LAST APPLIED
  
  LAG = 900ms (current time) - 600ms (W7's leader timestamp) = 300ms
  
  Transactions W8, W9, W10 exist on leader but NOT YET on follower
  A read hitting this follower won't see W8, W9, W10

Why Lag Matters

•Stale Reads — Users reading from followers see outdated data. The higher the lag, the more stale the data.
•Consistency Anomalies — A user might write data and then not see it when they read back (if reading from a lagging follower).
•Application Bugs — Code that assumes reads reflect recent writes will misbehave when lag is high.
•Monitoring Complexity — You need to track and alert on lag across all followers continuously.
•Failover Data Loss Risk — If the leader fails, transactions in the lag window may be lost (with async replication).

Lag is Not Latency

Don't confuse replication lag with network latency. Network latency is how long it takes to send data. Replication lag includes network latency PLUS apply time PLUS any queuing. A follower with 10ms network latency might have 5 seconds of replication lag if it's struggling to apply entries fast enough.

Causes of Replication Lag

Replication lag can spike for many reasons. Understanding the root causes helps you diagnose and resolve lag issues effectively.

Common Causes of Replication Lag
Cause	Mechanism	Typical Symptoms	Resolution
High Write Volume	Leader produces entries faster than follower can apply	Lag grows under load, recovers when load drops	Scale follower hardware, parallelize apply
Large Transactions	Single transaction takes long to apply	Periodic lag spikes correlating with big batches	Break up large transactions, use smaller batches
Schema Changes (DDL)	DDL locks tables during apply	Lag spike during schema changes	Run DDL during low traffic, use online DDL tools
Slow Disk I/O	Follower storage can't keep up with write rate	High disk utilization, apply waits on I/O	Faster storage (SSD), optimize I/O scheduler
Network Congestion	Replication stream is throttled	Receive lag high, apply lag normal	Increase bandwidth, prioritize replication traffic
Read Query Contention	Heavy read load on follower competes with apply	Lag varies with read traffic patterns	Load balance reads across multiple followers
Apply Thread Bottleneck	Single-threaded apply can't parallelize	CPU single core at 100%, others idle	Enable parallel apply (if supported), faster CPU
Long-Running Queries	Apply blocked by long read queries (on some DBs)	Lag correlates with slow query duration	Kill long queries, configure query cancellation

Deep Dive: The Apply Bottleneck

Historically, database replication applied changes single-threaded: one transaction at a time, in order. This matches the leader's write order but limits throughput to what one CPU core can handle.

Modern databases increasingly support parallel apply:

PostgreSQL — Parallel apply in logical replication (PostgreSQL 16+)
MySQL — Multi-threaded replication (MTS) applies non-conflicting transactions in parallel
MongoDB — Oplog apply is multithreaded for independent documents

Parallel apply dramatically improves throughput but must preserve ordering for conflicting operations (same row/document). Implementing this correctly is complex.

Single-Threaded vs Parallel Apply
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
SINGLE-THREADED APPLY:
───────────────────────────────────────────────────────────────────────────────
  Incoming:  [Tx1] [Tx2] [Tx3] [Tx4] [Tx5] [Tx6] [Tx7] [Tx8]
                |
                v
  Apply:     [Tx1]─────[Tx2]─────[Tx3]─────[Tx4]─────[Tx5]...
             ────────────────────────────────────────────────▶ time
             
  Throughput limited to: (1 transaction / apply_time)
 
 
PARALLEL APPLY (4 threads):
───────────────────────────────────────────────────────────────────────────────
  Incoming:  [Tx1] [Tx2] [Tx3] [Tx4] [Tx5] [Tx6] [Tx7] [Tx8]
                |
                v (distribute non-conflicting transactions)
  
  Thread 1:  [Tx1]─────────────[Tx5]──────────────...
  Thread 2:  [Tx2]─────────────[Tx6]──────────────...
  Thread 3:  [Tx3]─────────────[Tx7]──────────────...
  Thread 4:  [Tx4]─────────────[Tx8]──────────────...
             ────────────────────────────────────────────────▶ time
             
  Throughput up to: 4 × (1 transaction / apply_time)
  
  CONSTRAINT: Tx1 and Tx5 must not conflict (touch same rows)

DDL is the Silent Killer

Schema changes (ALTER TABLE, CREATE INDEX) are often the worst lag culprits. They typically require table-level locks during apply and can take minutes or hours on large tables. Plan DDL operations carefully, use online DDL tools when available, and monitor lag closely during schema changes.

Measuring and Monitoring Lag

Accurate lag measurement is essential for monitoring, alerting, and debugging. Different databases expose lag metrics in different ways.

PostgreSQL Lag Monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- On the leader: view replication status for all followers
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn, 
    replay_lsn,
    -- Time-based lag (most useful for applications)
    write_lag,      -- Time since follower wrote to disk
    flush_lag,      -- Time since follower fsynced
    replay_lag,     -- Time since follower applied (most important!)
    -- Position-based lag
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes_behind
FROM pg_stat_replication;
 
-- On the follower: check if in recovery and last apply time
SELECT 
    pg_is_in_recovery() AS is_replica,
    pg_last_wal_receive_lsn() AS last_received,
    pg_last_wal_replay_lsn() AS last_applied,
    pg_last_xact_replay_timestamp() AS last_apply_time,
    NOW() - pg_last_xact_replay_timestamp() AS lag_time;

MySQL Lag Monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- On the replica: check Seconds_Behind_Master
SHOW REPLICA STATUS\G
 
-- Key fields:
-- Seconds_Behind_Source: Estimated lag in seconds
-- Relay_Log_Space: Size of relay logs (received but not applied)
-- Read_Source_Log_Pos: Position read from source
-- Exec_Source_Log_Pos: Position applied locally
 
-- Or query performance_schema for more detail
SELECT 
    CHANNEL_NAME,
    LAST_QUEUED_TRANSACTION,
    LAST_APPLIED_TRANSACTION,
    LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
    LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP,
    TIMESTAMPDIFF(SECOND, 
        LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
        NOW()) AS lag_seconds
FROM performance_schema.replication_applier_status_by_coordinator;

Lag Alerting Best Practices

•Alert at Multiple Thresholds — Warning at 5s, critical at 30s, pages at 60s (tune for your SLO)
•Monitor Lag Trend, Not Just Point-in-Time — Growing lag over time indicates a chronic problem; spike followed by recovery is often transient
•Correlate with Other Metrics — Lag spike + high CPU = apply bottleneck; lag spike + high network = transfer bottleneck
•Monitor All Followers — A single lagging follower might be hardware-specific; all followers lagging suggests leader or network issues
•Track Lag Distribution — p50 lag might be 100ms, but p99 might be 10s; the tail matters for user experience
•Historical Baselines — Know your normal lag pattern so you can detect anomalies

Use Dedicated Monitoring

Don't rely solely on database-exposed metrics. Use external monitoring (Prometheus + Grafana, Datadog, etc.) that can historical track lag, correlate with other systems, and alert when the database itself might be unreachable. Write synthetic transactions to verify end-to-end replication health.

Application-Level Impacts of Replication Lag

Replication lag creates consistency anomalies that applications must handle. Understanding these patterns helps you design resilient applications and avoid confusing bugs.

Common Consistency Anomalies

•Read-Your-Writes Violation — User writes data, then reads and doesn't see it. The read went to a lagging follower that doesn't have the write yet.
•Monotonic Read Violation — User sees X=10, refreshes, sees X=5, refreshes, sees X=10 again. Different reads hit different followers with different lag.
•Causal Ordering Violation — User A posts a comment, User B replies. User C sees the reply but not the original comment (reply replicated first on their follower).
•Phantom Reads — A query returns row R, a moment later the same query doesn't return R (row was deleted on leader, delete hasn't replicated yet).
•Stale Joins — Joining across tables where one table is more current than another produces inconsistent results.

Read-Your-Writes Anomaly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
USER ACTION                      SYSTEM STATE
─────────────────────────────────────────────────────────────────────────────
 
1. User updates profile           LEADER: profile.bio = "New bio!"
   POST /profile/update           
   Response: 200 OK               FOLLOWER: profile.bio = "Old bio"
                                  (not yet replicated)
 
2. User views profile             LEADER: profile.bio = "New bio!"
   GET /profile                   
   (load balanced to follower)    FOLLOWER: profile.bio = "Old bio"
                                  ← Read returns stale data!
   
   User sees: "Old bio"           
   User thinks: "My update didn't save!"
   
3. User refreshes again           FOLLOWER: profile.bio = "New bio!"
   (after lag catches up)         (finally replicated)
   
   User sees: "New bio!"
 
 
 
USER EXPERIENCE: Confusing! Did my save work? Why did it take multiple refreshes?

The Impact on Application Design:

Applications that naively assume reads reflect recent writes will exhibit confusing bugs under replication lag. Common failure patterns include:

Double submissions — User doesn't see their order, submits again, ends up with duplicate orders
Missing data — Dashboard doesn't show recently created records
State machine errors — Code checks current state before transitioning, sees stale state, makes invalid transition
Calculation errors — Aggregate calculations that span multiple queries may see inconsistent data
Cache invalidation failures — Cache is invalidated based on write, but read fetches stale data from follower

Tests Don't Catch This

Development and test environments often have single-node databases with no replication lag. The application 'works' in testing but breaks in production. Always test with realistic replication lag—either use a test cluster with lag or simulate lag at the application level.

Read-Your-Writes Consistency Patterns

The most common consistency requirement is read-your-writes: a user should always see the effects of their own writes, even if reading from a follower. Several patterns achieve this guarantee.

Read-Your-Writes Patterns

•Read from Leader After Write — For N seconds after a user writes, route their reads to the leader. Simple but increases leader load. Implementation: Set a session cookie/flag after writes; application routes reads based on flag.
•Sticky Sessions — Route all traffic from a user to the same follower. If the user writes (going to leader), subsequent reads see the same delay consistently. Implementation: Load balancer session affinity based on user ID or session token.
•Causal Consistency Tokens — Track the write's log position (LSN/GTID); on subsequent reads, wait for follower to reach that position. Implementation: Include position in response; pass to read requests; follower waits if needed.
•Synchronized Reads — Before reading, query the follower's current position; only return results if position ≥ write position. Implementation: Application tracks last write position per user; adds wait to read path.
•Time-Based Delay — Wait a fixed time after write before allowing follower reads. Crude but simple. Implementation: Insert artificial delay in application; not recommended for real-time use cases.

Causal Consistency Token Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// After a write, capture the leader's current position
async function updateProfile(userId: string, newBio: string): Promise<WriteResult> {
    await db.leader.query(
        'UPDATE profiles SET bio = $1 WHERE user_id = $2',
        [newBio, userId]
    );
    
    // Get the current WAL position from leader
    const result = await db.leader.query('SELECT pg_current_wal_lsn() AS lsn');
    const writePosition = result.rows[0].lsn;
    
    return { success: true, consistencyToken: writePosition };
}
 
// On subsequent reads, ensure follower has caught up
async function getProfile(
    userId: string, 
    consistencyToken?: string
): Promise<Profile> {
    const replica = selectReplica();
    
    if (consistencyToken) {
        // Wait for replica to catch up to the write position
        await waitForReplicaPosition(replica, consistencyToken, {
            timeout: 5000, // Give up after 5 seconds, fall back to leader
            pollInterval: 50
        });
    }
    
    return await replica.query(
        'SELECT * FROM profiles WHERE user_id = $1',
        [userId]
    );
}
 
async function waitForReplicaPosition(
    replica: Connection, 
    targetLsn: string,
    options: { timeout: number, pollInterval: number }
): Promise<void> {
    const startTime = Date.now();
    
    while (Date.now() - startTime < options.timeout) {
        const result = await replica.query(
            'SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn AS caught_up',
            [targetLsn]
        );
        
        if (result.rows[0].caught_up) {
            return; // Replica is current enough
        }
        
        await sleep(options.pollInterval);
    }
    
    throw new Error('Replica did not catch up in time');
}

Token Propagation

The consistency token must travel with the user's session. Store it in a cookie, include it in API responses, or maintain it in client-side state. The client passes it back on subsequent requests so the server can enforce consistency.

Strategies for Reducing Lag

While some lag is inherent to asynchronous replication, many causes of excessive lag can be addressed through infrastructure and configuration improvements.

Lag Reduction Strategies
Strategy	How It Helps	Trade-offs
Faster Follower Storage	Faster apply I/O reduces apply time	Cost of premium storage (NVMe SSDs)
More Follower CPU/Memory	Faster query processing, larger buffer pool	Higher infrastructure costs
Parallel Apply	Multiple threads apply non-conflicting transactions	Complexity; not all DBs support; ordering constraints
Reduce Follower Read Load	Less competition between apply and reads	Need more followers or adjust read routing
Optimize Large Transactions	Smaller batches apply faster	Application changes; more round trips
Network Priority/QoS	Ensure replication traffic isn't starved	Network configuration complexity
Dedicated Replication Network	Separate replication from client traffic	Additional network infrastructure
Compression	Less data to transfer for wide-area replication	CPU overhead; not helpful for local

Configuration Tuning

•PostgreSQL: Tune wal_receiver_status_interval, max_standby_streaming_delay, recovery_min_apply_delay
•MySQL: Configure replica_parallel_workers, replica_parallel_type = LOGICAL_CLOCK, replica_preserve_commit_order
•MongoDB: Adjust replWriterThreadCount, replBatchLimitBytes, storage engine settings
•General: Minimize replication slot/position retention (but not below safety margin); tune recovery/apply memory settings

Lag Reduction Decision Tree
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
IS LAG CONSISTENTLY HIGH?
          │
    ┌─────┴─────┐
   YES          NO (spiky)
    │            │
    ▼            ▼
Check:        Check:
- Follower    - Large transactions?
  hardware    - DDL operations?
- Apply rate  - Slow queries blocking?
  vs leader   - Traffic bursts?
  write rate  
    │            │
    ▼            ▼
If apply      Address root cause:
rate < write  - Break up transactions
rate:         - Schedule DDL off-peak
              - Kill long queries
Follower      - Scale for bursts
can't keep    
up—scale      
hardware or   
enable        
parallel      
apply         
    │
    ▼
If apply rate
≈ write rate
but still lag:
    │
    ▼
Check network:
- Bandwidth saturated?
- Latency spikes?
- Packet loss?

Sometimes Lag is Acceptable

Not all applications require minimal lag. Analytics dashboards can tolerate minutes of lag. Batch processing jobs can use dedicated followers with intentionally delayed replication. Match your lag tolerance to actual requirements—over-engineering for sub-second lag on a reporting replica wastes resources.

Intentional Lag: Delayed Replicas

While we've focused on minimizing lag, there's a valuable pattern that intentionally maintains lag: delayed replicas (also called time-delayed or lag replicas). These are followers configured to apply changes only after a specified delay.

Use Cases for Delayed Replicas

•Accidental Deletion Recovery — If someone runs DELETE FROM users or DROP TABLE orders, the delayed replica still has the data for the configured delay period (e.g., 1 hour). You can failover to it or extract the missing data.
•Bad Deployment Rollback — A buggy deployment corrupts data via application logic. Immediate replicas have the corruption; the delayed replica has pre-corruption state.
•Ransomware Protection — Encryption ransomware that encrypts database files won't affect the delayed replica until the delay expires, giving time to detect and respond.
•Point-in-Time Recovery Alternative — For some recovery scenarios, promoting a delayed replica is faster than traditional point-in-time recovery from backups.
•Audit and Forensics — Query historical state as it existed N hours ago without complex log replay.

Configuring Delayed Replication
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL: Configure a standby with intentional delay
-- In recovery.conf or postgresql.auto.conf on the standby:
recovery_min_apply_delay = '1h'  -- 1 hour delay
 
-- MySQL: Configure replica delay
-- On the replica:
STOP REPLICA;
CHANGE REPLICATION SOURCE TO SOURCE_DELAY = 3600;  -- 1 hour in seconds
START REPLICA;
 
-- Verify delay is active
SHOW REPLICA STATUS\G
-- Look for: SQL_Delay: 3600
 
-- MongoDB: Use a slaveDelay member
rs.add({
    host: "delayed-replica:27017",
    priority: 0,           -- Never become primary
    hidden: true,          -- Not visible to clients
    slaveDelay: 3600       -- 1 hour delay (deprecated name, use secondaryDelaySecs in newer versions)
});

Delayed Replica Design Considerations:

Delay Duration — Balance recovery window against storage costs. 1 hour is common; some systems use 24 hours.
Never Serve Traffic — Delayed replicas should be hidden from client traffic and never elected leader.
Monitor Actual Delay — Ensure the replica is actually delayed by the configured amount, not more (falling behind) or less (configuration error).
Practice Recovery — Periodically test promoting or extracting data from the delayed replica. Don't wait for an emergency.
Multiple Delays — Some organizations run multiple delayed replicas: 1 hour, 6 hours, 24 hours.

Defense in Depth

Delayed replicas complement, not replace, regular backups and point-in-time recovery. They provide a faster recovery path for recent accidents but don't protect against all failure modes. Use both strategies together for comprehensive protection.

Summary: Replication Lag and Module Conclusion

We've completed our deep exploration of replication lag—the final topic in leader-follower replication. Let's consolidate the key insights from this page and reflect on the entire module.

Replication Lag Key Takeaways

•Replication lag is inherent to asynchronous replication — Followers are always behind the leader; the question is by how much.
•Lag has two dimensions: time and position — Both are useful; time tells you staleness, position tells you data volume behind.
•Common causes include write volume, large transactions, DDL, and slow hardware — Diagnose by identifying bottlenecks in the apply pipeline.
•Lag creates consistency anomalies like read-your-writes violations — Applications must handle or avoid these through consistency patterns.
•Read-your-writes can be achieved via leader routing, sticky sessions, or consistency tokens — Choose based on complexity tolerance and requirements.
•Delayed replicas intentionally maintain lag for disaster recovery — They provide a safety net against accidental deletions and corruptions.

Module Conclusion: Leader-Follower Replication

Across five pages, we've thoroughly examined the leader-follower replication model:

Single Leader Accepts Writes — All writes flow through one node, creating a total ordering and eliminating write conflicts.
Followers Replicate from Leader — Log-based replication (physical or logical) propagates changes to followers, which apply them to maintain synchronized copies.
Synchronous vs. Asynchronous — The fundamental trade-off between durability guarantees (sync) and write latency (async), with semi-synchronous modes offering balance.
Failover Handling — Detecting failures, electing new leaders, preventing split-brain, and managing the transition without data loss.
Replication Lag — Understanding, measuring, and mitigating the delay between leader and follower states.

Leader-follower replication is the dominant model in production databases because it provides a good balance of consistency, availability, and operability. It's not the only model—leaderless and multi-leader approaches offer different trade-offs—but it's the one you'll encounter most often and the foundation for understanding more advanced replication topologies.

Module Complete

Congratulations! You've mastered leader-follower replication—the backbone of most production database systems. You understand how writes flow through the leader, how followers maintain synchronized copies, the trade-offs between synchronous and asynchronous modes, how failover works, and how to manage replication lag. This knowledge applies across PostgreSQL, MySQL, MongoDB, and virtually every major database system.

5 / 5

Loading learning content...

System Design (HLD)Leader-Follower Replication

Leader-Follower Replication

LevelIntermediate

Duration75 mins

TopicLeader-Follower Replication

5 / 5

Replication Lag

The Gap Between Write and Read

A user updates their profile picture. They click 'Save,' see a success message, and immediately navigate to their profile to admire their new photo. But they see the old picture.

Confused, they refresh. Still the old picture. They refresh again—finally, the new picture appears.

What You Will Learn

Understanding Replication Lag

Lag Can Be Measured Two Ways:

Time-Based Lag — How many seconds behind is the follower? (e.g., '3 seconds of lag')
Position-Based Lag — How many bytes/transactions behind is the follower? (e.g., '1GB of WAL behind')

Both perspectives are useful. Time-based lag tells you how stale reads from the follower are. Position-based lag tells you how much data needs to transfer.

Replication Lag Visualization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
LEADER TIMELINE:
═══════════════════════════════════════════════════════════════════════════════
 
  Write Events:   W1    W2    W3    W4    W5    W6    W7    W8    W9    W10
                  │     │     │     │     │     │     │     │     │     │
  Leader Time:   0ms  100ms 200ms 300ms 400ms 500ms 600ms 700ms 800ms 900ms
                                                                        ▲
                                                                  CURRENT TIME
 
FOLLOWER TIMELINE (with ~300ms lag):
═══════════════════════════════════════════════════════════════════════════════
 
  Apply Events:   W1    W2    W3    W4    W5    W6    W7
                  │     │     │     │     │     │     │
  Follower Time: 100ms 200ms 300ms 400ms 500ms 600ms 700ms───────────────now
                                                          ▲
                                                    LAST APPLIED
  
  LAG = 900ms (current time) - 600ms (W7's leader timestamp) = 300ms
  
  Transactions W8, W9, W10 exist on leader but NOT YET on follower
  A read hitting this follower won't see W8, W9, W10

Why Lag Matters

•Stale Reads — Users reading from followers see outdated data. The higher the lag, the more stale the data.
•Consistency Anomalies — A user might write data and then not see it when they read back (if reading from a lagging follower).
•Application Bugs — Code that assumes reads reflect recent writes will misbehave when lag is high.
•Monitoring Complexity — You need to track and alert on lag across all followers continuously.
•Failover Data Loss Risk — If the leader fails, transactions in the lag window may be lost (with async replication).

Lag is Not Latency

Causes of Replication Lag

Replication lag can spike for many reasons. Understanding the root causes helps you diagnose and resolve lag issues effectively.

Common Causes of Replication Lag
Cause	Mechanism	Typical Symptoms	Resolution
High Write Volume	Leader produces entries faster than follower can apply	Lag grows under load, recovers when load drops	Scale follower hardware, parallelize apply
Large Transactions	Single transaction takes long to apply	Periodic lag spikes correlating with big batches	Break up large transactions, use smaller batches
Schema Changes (DDL)	DDL locks tables during apply	Lag spike during schema changes	Run DDL during low traffic, use online DDL tools
Slow Disk I/O	Follower storage can't keep up with write rate	High disk utilization, apply waits on I/O	Faster storage (SSD), optimize I/O scheduler
Network Congestion	Replication stream is throttled	Receive lag high, apply lag normal	Increase bandwidth, prioritize replication traffic
Read Query Contention	Heavy read load on follower competes with apply	Lag varies with read traffic patterns	Load balance reads across multiple followers
Apply Thread Bottleneck	Single-threaded apply can't parallelize	CPU single core at 100%, others idle	Enable parallel apply (if supported), faster CPU
Long-Running Queries	Apply blocked by long read queries (on some DBs)	Lag correlates with slow query duration	Kill long queries, configure query cancellation

Deep Dive: The Apply Bottleneck

Historically, database replication applied changes single-threaded: one transaction at a time, in order. This matches the leader's write order but limits throughput to what one CPU core can handle.

Modern databases increasingly support parallel apply:

PostgreSQL — Parallel apply in logical replication (PostgreSQL 16+)
MySQL — Multi-threaded replication (MTS) applies non-conflicting transactions in parallel
MongoDB — Oplog apply is multithreaded for independent documents

Parallel apply dramatically improves throughput but must preserve ordering for conflicting operations (same row/document). Implementing this correctly is complex.

Single-Threaded vs Parallel Apply
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
SINGLE-THREADED APPLY:
───────────────────────────────────────────────────────────────────────────────
  Incoming:  [Tx1] [Tx2] [Tx3] [Tx4] [Tx5] [Tx6] [Tx7] [Tx8]
                |
                v
  Apply:     [Tx1]─────[Tx2]─────[Tx3]─────[Tx4]─────[Tx5]...
             ────────────────────────────────────────────────▶ time
             
  Throughput limited to: (1 transaction / apply_time)
 
 
PARALLEL APPLY (4 threads):
───────────────────────────────────────────────────────────────────────────────
  Incoming:  [Tx1] [Tx2] [Tx3] [Tx4] [Tx5] [Tx6] [Tx7] [Tx8]
                |
                v (distribute non-conflicting transactions)
  
  Thread 1:  [Tx1]─────────────[Tx5]──────────────...
  Thread 2:  [Tx2]─────────────[Tx6]──────────────...
  Thread 3:  [Tx3]─────────────[Tx7]──────────────...
  Thread 4:  [Tx4]─────────────[Tx8]──────────────...
             ────────────────────────────────────────────────▶ time
             
  Throughput up to: 4 × (1 transaction / apply_time)
  
  CONSTRAINT: Tx1 and Tx5 must not conflict (touch same rows)

DDL is the Silent Killer

Measuring and Monitoring Lag

Accurate lag measurement is essential for monitoring, alerting, and debugging. Different databases expose lag metrics in different ways.

PostgreSQL Lag Monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- On the leader: view replication status for all followers
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn, 
    replay_lsn,
    -- Time-based lag (most useful for applications)
    write_lag,      -- Time since follower wrote to disk
    flush_lag,      -- Time since follower fsynced
    replay_lag,     -- Time since follower applied (most important!)
    -- Position-based lag
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes_behind
FROM pg_stat_replication;
 
-- On the follower: check if in recovery and last apply time
SELECT 
    pg_is_in_recovery() AS is_replica,
    pg_last_wal_receive_lsn() AS last_received,
    pg_last_wal_replay_lsn() AS last_applied,
    pg_last_xact_replay_timestamp() AS last_apply_time,
    NOW() - pg_last_xact_replay_timestamp() AS lag_time;

MySQL Lag Monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- On the replica: check Seconds_Behind_Master
SHOW REPLICA STATUS\G
 
-- Key fields:
-- Seconds_Behind_Source: Estimated lag in seconds
-- Relay_Log_Space: Size of relay logs (received but not applied)
-- Read_Source_Log_Pos: Position read from source
-- Exec_Source_Log_Pos: Position applied locally
 
-- Or query performance_schema for more detail
SELECT 
    CHANNEL_NAME,
    LAST_QUEUED_TRANSACTION,
    LAST_APPLIED_TRANSACTION,
    LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
    LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP,
    TIMESTAMPDIFF(SECOND, 
        LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
        NOW()) AS lag_seconds
FROM performance_schema.replication_applier_status_by_coordinator;

Lag Alerting Best Practices

•Alert at Multiple Thresholds — Warning at 5s, critical at 30s, pages at 60s (tune for your SLO)
•Monitor Lag Trend, Not Just Point-in-Time — Growing lag over time indicates a chronic problem; spike followed by recovery is often transient
•Correlate with Other Metrics — Lag spike + high CPU = apply bottleneck; lag spike + high network = transfer bottleneck
•Monitor All Followers — A single lagging follower might be hardware-specific; all followers lagging suggests leader or network issues
•Track Lag Distribution — p50 lag might be 100ms, but p99 might be 10s; the tail matters for user experience
•Historical Baselines — Know your normal lag pattern so you can detect anomalies

Use Dedicated Monitoring

Application-Level Impacts of Replication Lag

Replication lag creates consistency anomalies that applications must handle. Understanding these patterns helps you design resilient applications and avoid confusing bugs.

Common Consistency Anomalies

•Read-Your-Writes Violation — User writes data, then reads and doesn't see it. The read went to a lagging follower that doesn't have the write yet.
•Monotonic Read Violation — User sees X=10, refreshes, sees X=5, refreshes, sees X=10 again. Different reads hit different followers with different lag.
•Causal Ordering Violation — User A posts a comment, User B replies. User C sees the reply but not the original comment (reply replicated first on their follower).
•Phantom Reads — A query returns row R, a moment later the same query doesn't return R (row was deleted on leader, delete hasn't replicated yet).
•Stale Joins — Joining across tables where one table is more current than another produces inconsistent results.

Read-Your-Writes Anomaly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
USER ACTION                      SYSTEM STATE
─────────────────────────────────────────────────────────────────────────────
 
1. User updates profile           LEADER: profile.bio = "New bio!"
   POST /profile/update           
   Response: 200 OK               FOLLOWER: profile.bio = "Old bio"
                                  (not yet replicated)
 
2. User views profile             LEADER: profile.bio = "New bio!"
   GET /profile                   
   (load balanced to follower)    FOLLOWER: profile.bio = "Old bio"
                                  ← Read returns stale data!
   
   User sees: "Old bio"           
   User thinks: "My update didn't save!"
   
3. User refreshes again           FOLLOWER: profile.bio = "New bio!"
   (after lag catches up)         (finally replicated)
   
   User sees: "New bio!"
 
 
 
USER EXPERIENCE: Confusing! Did my save work? Why did it take multiple refreshes?

The Impact on Application Design:

Applications that naively assume reads reflect recent writes will exhibit confusing bugs under replication lag. Common failure patterns include:

Double submissions — User doesn't see their order, submits again, ends up with duplicate orders
Missing data — Dashboard doesn't show recently created records
State machine errors — Code checks current state before transitioning, sees stale state, makes invalid transition
Calculation errors — Aggregate calculations that span multiple queries may see inconsistent data
Cache invalidation failures — Cache is invalidated based on write, but read fetches stale data from follower

Tests Don't Catch This

Read-Your-Writes Consistency Patterns

The most common consistency requirement is read-your-writes: a user should always see the effects of their own writes, even if reading from a follower. Several patterns achieve this guarantee.

Read-Your-Writes Patterns

•Read from Leader After Write — For N seconds after a user writes, route their reads to the leader. Simple but increases leader load. Implementation: Set a session cookie/flag after writes; application routes reads based on flag.
•Sticky Sessions — Route all traffic from a user to the same follower. If the user writes (going to leader), subsequent reads see the same delay consistently. Implementation: Load balancer session affinity based on user ID or session token.
•Causal Consistency Tokens — Track the write's log position (LSN/GTID); on subsequent reads, wait for follower to reach that position. Implementation: Include position in response; pass to read requests; follower waits if needed.
•Synchronized Reads — Before reading, query the follower's current position; only return results if position ≥ write position. Implementation: Application tracks last write position per user; adds wait to read path.
•Time-Based Delay — Wait a fixed time after write before allowing follower reads. Crude but simple. Implementation: Insert artificial delay in application; not recommended for real-time use cases.

Causal Consistency Token Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// After a write, capture the leader's current position
async function updateProfile(userId: string, newBio: string): Promise<WriteResult> {
    await db.leader.query(
        'UPDATE profiles SET bio = $1 WHERE user_id = $2',
        [newBio, userId]
    );
    
    // Get the current WAL position from leader
    const result = await db.leader.query('SELECT pg_current_wal_lsn() AS lsn');
    const writePosition = result.rows[0].lsn;
    
    return { success: true, consistencyToken: writePosition };
}
 
// On subsequent reads, ensure follower has caught up
async function getProfile(
    userId: string, 
    consistencyToken?: string
): Promise<Profile> {
    const replica = selectReplica();
    
    if (consistencyToken) {
        // Wait for replica to catch up to the write position
        await waitForReplicaPosition(replica, consistencyToken, {
            timeout: 5000, // Give up after 5 seconds, fall back to leader
            pollInterval: 50
        });
    }
    
    return await replica.query(
        'SELECT * FROM profiles WHERE user_id = $1',
        [userId]
    );
}
 
async function waitForReplicaPosition(
    replica: Connection, 
    targetLsn: string,
    options: { timeout: number, pollInterval: number }
): Promise<void> {
    const startTime = Date.now();
    
    while (Date.now() - startTime < options.timeout) {
        const result = await replica.query(
            'SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn AS caught_up',
            [targetLsn]
        );
        
        if (result.rows[0].caught_up) {
            return; // Replica is current enough
        }
        
        await sleep(options.pollInterval);
    }
    
    throw new Error('Replica did not catch up in time');
}

Token Propagation

Strategies for Reducing Lag

While some lag is inherent to asynchronous replication, many causes of excessive lag can be addressed through infrastructure and configuration improvements.

Lag Reduction Strategies
Strategy	How It Helps	Trade-offs
Faster Follower Storage	Faster apply I/O reduces apply time	Cost of premium storage (NVMe SSDs)
More Follower CPU/Memory	Faster query processing, larger buffer pool	Higher infrastructure costs
Parallel Apply	Multiple threads apply non-conflicting transactions	Complexity; not all DBs support; ordering constraints
Reduce Follower Read Load	Less competition between apply and reads	Need more followers or adjust read routing
Optimize Large Transactions	Smaller batches apply faster	Application changes; more round trips
Network Priority/QoS	Ensure replication traffic isn't starved	Network configuration complexity
Dedicated Replication Network	Separate replication from client traffic	Additional network infrastructure
Compression	Less data to transfer for wide-area replication	CPU overhead; not helpful for local

Configuration Tuning

•PostgreSQL: Tune wal_receiver_status_interval, max_standby_streaming_delay, recovery_min_apply_delay
•MySQL: Configure replica_parallel_workers, replica_parallel_type = LOGICAL_CLOCK, replica_preserve_commit_order
•MongoDB: Adjust replWriterThreadCount, replBatchLimitBytes, storage engine settings
•General: Minimize replication slot/position retention (but not below safety margin); tune recovery/apply memory settings

Lag Reduction Decision Tree
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
IS LAG CONSISTENTLY HIGH?
          │
    ┌─────┴─────┐
   YES          NO (spiky)
    │            │
    ▼            ▼
Check:        Check:
- Follower    - Large transactions?
  hardware    - DDL operations?
- Apply rate  - Slow queries blocking?
  vs leader   - Traffic bursts?
  write rate  
    │            │
    ▼            ▼
If apply      Address root cause:
rate < write  - Break up transactions
rate:         - Schedule DDL off-peak
              - Kill long queries
Follower      - Scale for bursts
can't keep    
up—scale      
hardware or   
enable        
parallel      
apply         
    │
    ▼
If apply rate
≈ write rate
but still lag:
    │
    ▼
Check network:
- Bandwidth saturated?
- Latency spikes?
- Packet loss?

Sometimes Lag is Acceptable

Intentional Lag: Delayed Replicas

Use Cases for Delayed Replicas

•Accidental Deletion Recovery — If someone runs DELETE FROM users or DROP TABLE orders, the delayed replica still has the data for the configured delay period (e.g., 1 hour). You can failover to it or extract the missing data.
•Bad Deployment Rollback — A buggy deployment corrupts data via application logic. Immediate replicas have the corruption; the delayed replica has pre-corruption state.
•Ransomware Protection — Encryption ransomware that encrypts database files won't affect the delayed replica until the delay expires, giving time to detect and respond.
•Point-in-Time Recovery Alternative — For some recovery scenarios, promoting a delayed replica is faster than traditional point-in-time recovery from backups.
•Audit and Forensics — Query historical state as it existed N hours ago without complex log replay.

Configuring Delayed Replication
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL: Configure a standby with intentional delay
-- In recovery.conf or postgresql.auto.conf on the standby:
recovery_min_apply_delay = '1h'  -- 1 hour delay
 
-- MySQL: Configure replica delay
-- On the replica:
STOP REPLICA;
CHANGE REPLICATION SOURCE TO SOURCE_DELAY = 3600;  -- 1 hour in seconds
START REPLICA;
 
-- Verify delay is active
SHOW REPLICA STATUS\G
-- Look for: SQL_Delay: 3600
 
-- MongoDB: Use a slaveDelay member
rs.add({
    host: "delayed-replica:27017",
    priority: 0,           -- Never become primary
    hidden: true,          -- Not visible to clients
    slaveDelay: 3600       -- 1 hour delay (deprecated name, use secondaryDelaySecs in newer versions)
});

Delayed Replica Design Considerations:

Delay Duration — Balance recovery window against storage costs. 1 hour is common; some systems use 24 hours.
Never Serve Traffic — Delayed replicas should be hidden from client traffic and never elected leader.
Monitor Actual Delay — Ensure the replica is actually delayed by the configured amount, not more (falling behind) or less (configuration error).
Practice Recovery — Periodically test promoting or extracting data from the delayed replica. Don't wait for an emergency.
Multiple Delays — Some organizations run multiple delayed replicas: 1 hour, 6 hours, 24 hours.

Defense in Depth

Summary: Replication Lag and Module Conclusion

We've completed our deep exploration of replication lag—the final topic in leader-follower replication. Let's consolidate the key insights from this page and reflect on the entire module.

Replication Lag Key Takeaways

•Replication lag is inherent to asynchronous replication — Followers are always behind the leader; the question is by how much.
•Lag has two dimensions: time and position — Both are useful; time tells you staleness, position tells you data volume behind.
•Common causes include write volume, large transactions, DDL, and slow hardware — Diagnose by identifying bottlenecks in the apply pipeline.
•Lag creates consistency anomalies like read-your-writes violations — Applications must handle or avoid these through consistency patterns.
•Read-your-writes can be achieved via leader routing, sticky sessions, or consistency tokens — Choose based on complexity tolerance and requirements.
•Delayed replicas intentionally maintain lag for disaster recovery — They provide a safety net against accidental deletions and corruptions.

Module Conclusion: Leader-Follower Replication

Across five pages, we've thoroughly examined the leader-follower replication model:

Single Leader Accepts Writes — All writes flow through one node, creating a total ordering and eliminating write conflicts.
Followers Replicate from Leader — Log-based replication (physical or logical) propagates changes to followers, which apply them to maintain synchronized copies.
Synchronous vs. Asynchronous — The fundamental trade-off between durability guarantees (sync) and write latency (async), with semi-synchronous modes offering balance.
Failover Handling — Detecting failures, electing new leaders, preventing split-brain, and managing the transition without data loss.
Replication Lag — Understanding, measuring, and mitigating the delay between leader and follower states.

Module Complete

5 / 5