System Design (HLD)Multi-Leader Replication

Multi-Leader Replication

LevelAdvanced

Duration75 mins

TopicMulti-Leader Replication

1 / 5

Multiple Leaders for Writes

Breaking Free from Single-Leader Constraints

Imagine you're an engineer at a global collaboration platform—something like Google Docs or Notion. Users across Tokyo, London, and New York are simultaneously editing the same document. With single-leader replication, every keystroke from Tokyo must travel to the leader in Virginia, be processed, and replicate back—introducing latency that transforms real-time collaboration into a frustrating experience.

Multi-leader replication emerges as a solution to this fundamental constraint. Instead of funneling all writes through a single node, what if multiple nodes—each strategically positioned—could independently accept writes? Each datacenter, each region, or even each client could have its own leader, processing writes locally with sub-millisecond latency.

But this architectural freedom comes at a profound cost: the same data might be modified simultaneously in multiple places, creating conflicts that don't exist in single-leader systems. Understanding multi-leader replication means understanding both its liberating possibilities and its inherent complexities.

What You Will Learn

By the end of this page, you will deeply understand: (1) Why single-leader replication becomes insufficient at global scale, (2) The fundamental architecture of multi-leader systems, (3) How writes are processed and propagated across multiple leaders, (4) The topologies that connect leaders together, and (5) The trade-offs you accept when choosing multi-leader replication.

The Single-Leader Bottleneck

Before we architect multi-leader systems, we must deeply understand why single-leader replication becomes insufficient. This isn't merely about scale—it's about fundamental physics and user experience.

The Speed of Light Problem:

Data travels through fiber optic cables at approximately two-thirds the speed of light—roughly 200,000 km/s. A round trip from Tokyo to Virginia (approximately 11,000 km) takes at minimum 110ms at the speed of light. In practice, routing, processing, and network overhead extend this to 150-250ms.

For a single keystroke in a collaborative document, the sequence is:

User types in Tokyo → 2. Request travels to Virginia leader (75-125ms) → 3. Leader processes and persists → 4. Acknowledgment returns to Tokyo (75-125ms)

Total perceived latency: 150-250ms per keystroke. This is noticeable and frustrating for real-time applications.

Cross-Continental Latency Realities
Route	Physical Distance	Theoretical Minimum RTT	Typical Real-World RTT
New York ↔ London	5,500 km	55ms	70-90ms
New York ↔ Tokyo	11,000 km	110ms	150-200ms
London ↔ Sydney	17,000 km	170ms	250-300ms
São Paulo ↔ Singapore	16,000 km	160ms	280-350ms

The Availability Problem:

With single-leader replication, the leader is a single point of failure for write operations. If the leader becomes unavailable due to:

Network partition isolating the leader's datacenter
Hardware failure requiring failover
Planned maintenance requiring brief unavailability

...then all writes globally are blocked until either the leader recovers or a new leader is elected. Failover can take seconds to minutes depending on detection mechanisms and consensus protocols. During this window, users experience write unavailability.

The Datacenter Operations Problem:

Organizations operating globally face operational constraints:

Regulatory requirements: Data for European users might need to be processed within the EU (GDPR)
Disaster recovery: A single datacenter failure shouldn't halt business operations
Organizational autonomy: Different teams in different regions might need independent write capabilities

Single-leader replication, by design, cannot address these requirements while maintaining consistency.

The Fundamental Trade-off Ahead

Multi-leader replication trades write latency and availability for complexity and potential conflicts. Before adopting it, ask: Is the latency improvement critical for my use case? Can my application tolerate and resolve conflicts? The answer isn't always yes.

Multi-Leader Architecture Fundamentals

Multi-leader replication (also known as multi-master or active-active replication) allows multiple nodes to accept write operations independently. Each leader processes writes locally, then asynchronously propagates those writes to other leaders.

The Core Architecture:

In a multi-leader setup:

Multiple leaders exist across different locations (typically datacenters or regions)
Each leader maintains a complete copy of the database
Writes are accepted by any leader based on routing logic (usually geographic proximity)
Each leader asynchronously replicates its writes to all other leaders
Each leader may also have followers for read scaling within its region

Converting Mermaid diagram...

Key Characteristics:

1. Write Locality: Users write to their nearest leader. A user in Munich writes to the Frankfurt leader; a user in San Francisco writes to the Virginia leader. This reduces write latency from 150-300ms (cross-continental) to 10-30ms (regional).

2. Asynchronous Replication Between Leaders: Leader-to-leader replication is asynchronous. The Frankfurt leader doesn't wait for Virginia or Tokyo to acknowledge before confirming a write to the user. This is critical—synchronous cross-continental replication would reintroduce the latency we're trying to avoid.

3. Conflict Possibility: Because leaders accept writes independently and asynchronously, the same data might be modified at multiple leaders before those modifications propagate. This creates write conflicts that must be detected and resolved.

4. Eventual Convergence: Despite conflicts, all leaders must eventually converge to the same state. The system must have mechanisms to ensure that, given time without new writes, all replicas become identical.

Terminology Note

You'll encounter various terms: multi-leader, multi-master, active-active, and multi-primary. These all describe the same fundamental architecture. 'Active-active' emphasizes that all leaders actively accept writes, as opposed to 'active-passive' where only one node accepts writes.

Write Processing and Propagation

Understanding how writes flow through a multi-leader system is essential for reasoning about conflicts and consistency. Let's trace a write operation from user request to global propagation.

The Write Lifecycle in Multi-Leader Systems:

Write Operation Stages

•User Request Routing: The client's write request is routed to the nearest leader based on geographic proximity, load balancing policies, or explicit configuration. DNS-based routing, anycast, or application-level routing can accomplish this.
•Local Processing: The receiving leader validates the write, applies any business logic, and persists it to its local storage. This happens with the same guarantees as a single-leader system—durability, transaction semantics, etc.
•Local Acknowledgment: The leader acknowledges the write to the user. At this point, from the user's perspective, the write is complete. This is the key latency optimization—no cross-continental round trip needed.
•Replication Log Entry: The write is appended to a replication log (write-ahead log, binlog, or logical replication stream). This log captures changes in a format suitable for transmission to other leaders.
•Asynchronous Propagation: The replication log entries are transmitted to other leaders. This happens continuously in the background, typically with configurable batching and throttling.
•Remote Application: Other leaders receive the replicated writes and apply them to their local storage. If conflicts are detected (the same record was modified locally), conflict resolution logic is invoked.
•Downstream Follower Replication: Each leader replicates the combined stream of local and remote writes to its own followers, enabling read scaling.

Replication Lag and Its Implications:

The time between a write being acknowledged at one leader and becoming visible at another is called replication lag. In multi-leader systems, this lag varies based on:

Network latency between datacenters (50-300ms)
Processing load on leaders
Batching configuration (trading latency for throughput)
Network congestion and available bandwidth

Typical inter-leader replication lag ranges from 100ms to several seconds. During this window:

A user who wrote at Leader A might read stale data if they're redirected to Leader B
Two users might make conflicting modifications without awareness of each other's changes
Aggregation queries across the dataset might see inconsistent snapshots

write-propagation-timeline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Timeline: Multi-Leader Write Propagation
 
T=0ms     User in Tokyo submits UPDATE balance = 5000 WHERE user_id = 'alice'
T=2ms     Request received by Tokyo Leader
T=5ms     Tokyo Leader validates and applies write locally
T=7ms     Write appended to Tokyo's replication log
T=10ms    User receives acknowledgment (write complete from user's view)
 
--- Asynchronous propagation begins ---
 
T=15ms    Replication batch prepared for transmission
T=20ms    Batch transmitted to Frankfurt and Virginia Leaders
 
T=180ms   Frankfurt Leader receives batch (160ms network latency)
T=185ms   Frankfurt applies write, checks for conflicts
T=190ms   Frankfurt's followers begin replicating the change
 
T=220ms   Virginia Leader receives batch (200ms network latency)
T=225ms   Virginia applies write, checks for conflicts
T=230ms   Virginia's followers begin replicating the change
 
T=250ms   Write is visible globally across all leaders and followers

Design Principle

The asynchronous nature of inter-leader replication means you cannot provide 'read-your-writes' consistency across leaders by default. Users should be routed consistently to the same leader, or applications must implement session affinity, to avoid reading their own stale data.

Replication Topologies

When multiple leaders must exchange writes, the topology—how leaders are connected—becomes a critical architectural decision. Different topologies offer different trade-offs in fault tolerance, latency, and complexity.

The Three Primary Topologies:

Circular Topology

•Each leader replicates to exactly one other leader
•Forms a ring: A → B → C → A
•Writes propagate around the ring sequentially
•Advantage: Simple configuration, predictable paths
•Disadvantage: Single node failure breaks the ring
•Latency: High for distant pairs (N-1 hops worst case)

Star Topology

•One designated 'root' leader connects to all others
•All writes flow through the root: A ↔ Root ↔ B
•Root forwards writes between other leaders
•Advantage: Simple routing, 2 hops between any pair
•Disadvantage: Root is a single point of failure
•Latency: Consistent but limited by root placement

All-to-All (Mesh) Topology:

The most robust and commonly used topology for production multi-leader deployments.

All-to-All Topology Characteristics

•Every leader connects to every other leader — Forms a complete graph
•Direct propagation — Writes reach all leaders in a single hop
•Maximum fault tolerance — Any leader can fail without disrupting replication between others
•Optimal latency — No intermediate forwarding; direct cross-continental paths
•Complexity — Requires N×(N-1)/2 connections for N leaders
•Ordering challenges — Writes can arrive at different leaders in different orders

Converting Mermaid diagram...

Handling Message Ordering in All-to-All:

In all-to-all topologies, a critical problem arises: writes can arrive out of order. Consider:

At T=0, Leader A creates record R
At T=1, Leader A updates record R
Both writes propagate to Leaders B and C

Due to network variability, Leader B might receive them in order (create, then update), while Leader C receives them out of order (update arrives before create). The update on a non-existent record fails or causes undefined behavior.

Solutions:

Logical timestamps: Each write carries a logical timestamp (Lamport clock or vector clock). Receivers buffer out-of-order writes until dependencies arrive.
Causal dependency tracking: Writes include metadata about which prior writes they depend on. Receivers apply writes only after dependencies are satisfied.
Log sequence numbers: Each leader's writes are sequenced. Receivers apply writes in sequence order, buffering gaps.

Production Reality

Most production multi-leader systems (MySQL Group Replication, PostgreSQL BDR, CockroachDB, Spanner) use all-to-all topology with sophisticated conflict detection and ordering mechanisms. The added complexity is justified by the fault tolerance and latency benefits.

Conflict Detection Fundamentals

Unlike single-leader systems where the leader serializes all writes, multi-leader systems can have concurrent, conflicting writes. Detecting these conflicts is the first step toward resolving them.

What Constitutes a Conflict?

A conflict occurs when two or more leaders independently modify the same data in ways that cannot be trivially merged. The classic example:

At T=0, Record R has value 'Original'
At T=1, Leader A receives: SET R = 'Value A'
At T=1, Leader B receives: SET R = 'Value B'
Neither leader knows about the other's write when processing

When these writes propagate:

Leader A receives 'Value B' but already has 'Value A'
Leader B receives 'Value A' but already has 'Value B'

Which value should win? This is the conflict that must be detected and resolved.

Categories of Conflicts

•Write-Write Conflicts: The same field or record is modified at multiple leaders. The most common and challenging conflict type.
•Uniqueness Constraint Violations: Two leaders independently create records with the same unique key (e.g., email address, username).
•Referential Integrity Violations: Leader A deletes a parent record while Leader B creates a child referencing it.
•Application-Level Semantic Conflicts: Writes that are technically compatible at the database level but violate business rules (e.g., double-booking a limited resource).

Conflict Detection Mechanisms:

1. Version Vectors / Vector Clocks:

Each record carries a version vector [Leader_A: 3, Leader_B: 5, Leader_C: 2] indicating the latest version from each leader that has been incorporated. When a write arrives, the system compares version vectors:

If the incoming version dominates the local version (all components ≥), it's a simple update
If the local version dominates the incoming version, the incoming write is stale and ignored
If neither dominates (concurrent modifications), conflict detected

2. Conflict-Free Reads for Conflict Detection:

Some systems defer conflict detection until read time. Multiple concurrent writes are stored as siblings. When a client reads, it receives all siblings and must resolve them (application-level resolution).

3. Row-Level Change Tracking:

Simpler systems track only the last-modified timestamp or sequence number. They cannot detect true concurrency—only which write appeared 'later'. This enables Last-Write-Wins but cannot support more sophisticated resolution.

version-vector-conflict-detection.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
interface VersionVector {
  [leaderId: string]: number;
}
 
interface VersionedRecord<T> {
  data: T;
  version: VersionVector;
}
 
function detectConflict<T>(
  local: VersionedRecord<T>,
  incoming: VersionedRecord<T>
): 'incoming-wins' | 'local-wins' | 'conflict' {
  const localDominates = doesDominate(local.version, incoming.version);
  const incomingDominates = doesDominate(incoming.version, local.version);
  
  if (incomingDominates && !localDominates) {
    return 'incoming-wins'; // Simple update, apply incoming
  }
  if (localDominates && !incomingDominates) {
    return 'local-wins';    // Stale write, ignore incoming
  }
  return 'conflict';        // Concurrent modifications, resolve!
}
 
function doesDominate(a: VersionVector, b: VersionVector): boolean {
  // a dominates b if every component of a >= corresponding component of b
  const allKeys = new Set([...Object.keys(a), ...Object.keys(b)]);
  for (const key of allKeys) {
    if ((a[key] || 0) < (b[key] || 0)) {
      return false;
    }
  }
  return true;
}
 
// Example: Concurrent conflict
const localRecord: VersionedRecord<string> = {
  data: 'Modified at Leader A',
  version: { 'leader-a': 5, 'leader-b': 3 }
};
 
const incomingWrite: VersionedRecord<string> = {
  data: 'Modified at Leader B',
  version: { 'leader-a': 4, 'leader-b': 4 }
};
 
console.log(detectConflict(localRecord, incomingWrite)); 
// Output: 'conflict' 
// B has higher component than A (4 > 3), A has higher component than B (5 > 4)

Detection Is Just the Beginning

Detecting conflicts is mechanical—comparing versions is algorithmic. The hard part is resolution: determining what the 'correct' value should be when two users simultaneously made incompatible changes. We'll explore resolution strategies in depth in subsequent pages.

Practical Considerations and Trade-offs

Multi-leader replication is not a universal solution. It introduces significant complexity and is appropriate only for specific use cases. Understanding these trade-offs is essential for making informed architectural decisions.

Multi-Leader vs Single-Leader Trade-offs
Dimension	Single-Leader	Multi-Leader
Write Latency	High for distant users (100-300ms)	Low for all users (10-30ms)
Write Availability	Blocked during leader failover	Available if any leader is reachable
Consistency Model	Strongly consistent writes	Eventually consistent, conflicts possible
Operational Complexity	Simple, well-understood	Complex conflict resolution required
Application Complexity	Standard CRUD patterns work	Must handle/design for conflicts
Read-Your-Writes	Guaranteed naturally	Requires session affinity
Debugging	Straightforward linear history	Complex branching history

When Multi-Leader Is Appropriate:

Truly global user base with latency-sensitive write operations (collaborative editing, gaming, trading)
Disconnected operation requirements where clients must function offline and sync later (mobile apps, offline-first systems)
Multi-datacenter deployments requiring write availability during datacenter failures or network partitions
Use cases where conflicts are rare or have natural resolution strategies (e.g., shopping carts can union items)

When Multi-Leader Is Problematic:

Financial transactions where consistency is non-negotiable (bank transfers, inventory with strict limits)
Unique constraint enforcement that cannot tolerate even temporary violations (username registration)
Systems with complex referential integrity that would create cascading conflicts
Teams without expertise in distributed systems debugging and conflict resolution

The Hidden Cost of Conflicts

Even with automated conflict resolution (like Last-Write-Wins), conflicts represent data loss. One user's write is discarded. Systems that seem to 'work' might be silently losing data under concurrent writes. Before choosing multi-leader, quantify your expected conflict rate and the business impact of each conflicted write.

Summary: Multiple Leaders for Writes

We've established the foundational understanding of multi-leader replication. Let's consolidate the key concepts:

Key Takeaways

•Multi-leader replication allows multiple nodes to accept writes independently, overcoming the single-leader bottleneck for latency and availability.
•Writes are processed locally and acknowledged immediately, then asynchronously propagated to other leaders.
•Three topologies exist: circular (simple but fragile), star (centralized but limited), and all-to-all (robust but complex).
•Replication lag between leaders means users might see stale data if routed to different leaders.
•Conflicts arise from concurrent writes and must be detected using mechanisms like version vectors.
•Multi-leader is appropriate for global, latency-sensitive, conflict-tolerant workloads—not for strict consistency requirements.

What's Next:

Now that we understand how multi-leader systems are architectured, we'll explore the specific use cases where multi-leader replication shines. The next page examines multi-datacenter deployments—the primary production scenario for multi-leader systems—and how organizations like Netflix, Google, and Uber architect their global data platforms.

Page Complete

You now understand the fundamental architecture of multi-leader replication—why it exists, how writes flow through the system, and the topologies that connect leaders. Next, we'll see these concepts applied to real-world multi-datacenter deployments.

1 / 5

Loading learning content...

System Design (HLD)Multi-Leader Replication

Multi-Leader Replication

LevelAdvanced

Duration75 mins

TopicMulti-Leader Replication

1 / 5

Multiple Leaders for Writes

Breaking Free from Single-Leader Constraints

What You Will Learn

The Single-Leader Bottleneck

The Speed of Light Problem:

For a single keystroke in a collaborative document, the sequence is:

User types in Tokyo → 2. Request travels to Virginia leader (75-125ms) → 3. Leader processes and persists → 4. Acknowledgment returns to Tokyo (75-125ms)

Total perceived latency: 150-250ms per keystroke. This is noticeable and frustrating for real-time applications.

Cross-Continental Latency Realities
Route	Physical Distance	Theoretical Minimum RTT	Typical Real-World RTT
New York ↔ London	5,500 km	55ms	70-90ms
New York ↔ Tokyo	11,000 km	110ms	150-200ms
London ↔ Sydney	17,000 km	170ms	250-300ms
São Paulo ↔ Singapore	16,000 km	160ms	280-350ms

The Availability Problem:

With single-leader replication, the leader is a single point of failure for write operations. If the leader becomes unavailable due to:

Network partition isolating the leader's datacenter
Hardware failure requiring failover
Planned maintenance requiring brief unavailability

The Datacenter Operations Problem:

Organizations operating globally face operational constraints:

Regulatory requirements: Data for European users might need to be processed within the EU (GDPR)
Disaster recovery: A single datacenter failure shouldn't halt business operations
Organizational autonomy: Different teams in different regions might need independent write capabilities

Single-leader replication, by design, cannot address these requirements while maintaining consistency.

The Fundamental Trade-off Ahead

Multi-Leader Architecture Fundamentals

The Core Architecture:

In a multi-leader setup:

Multiple leaders exist across different locations (typically datacenters or regions)
Each leader maintains a complete copy of the database
Writes are accepted by any leader based on routing logic (usually geographic proximity)
Each leader asynchronously replicates its writes to all other leaders
Each leader may also have followers for read scaling within its region

Converting Mermaid diagram...

Key Characteristics:

Terminology Note

Write Processing and Propagation

Understanding how writes flow through a multi-leader system is essential for reasoning about conflicts and consistency. Let's trace a write operation from user request to global propagation.

The Write Lifecycle in Multi-Leader Systems:

Write Operation Stages

•User Request Routing: The client's write request is routed to the nearest leader based on geographic proximity, load balancing policies, or explicit configuration. DNS-based routing, anycast, or application-level routing can accomplish this.
•Local Processing: The receiving leader validates the write, applies any business logic, and persists it to its local storage. This happens with the same guarantees as a single-leader system—durability, transaction semantics, etc.
•Local Acknowledgment: The leader acknowledges the write to the user. At this point, from the user's perspective, the write is complete. This is the key latency optimization—no cross-continental round trip needed.
•Replication Log Entry: The write is appended to a replication log (write-ahead log, binlog, or logical replication stream). This log captures changes in a format suitable for transmission to other leaders.
•Asynchronous Propagation: The replication log entries are transmitted to other leaders. This happens continuously in the background, typically with configurable batching and throttling.
•Remote Application: Other leaders receive the replicated writes and apply them to their local storage. If conflicts are detected (the same record was modified locally), conflict resolution logic is invoked.
•Downstream Follower Replication: Each leader replicates the combined stream of local and remote writes to its own followers, enabling read scaling.

Replication Lag and Its Implications:

The time between a write being acknowledged at one leader and becoming visible at another is called replication lag. In multi-leader systems, this lag varies based on:

Network latency between datacenters (50-300ms)
Processing load on leaders
Batching configuration (trading latency for throughput)
Network congestion and available bandwidth

Typical inter-leader replication lag ranges from 100ms to several seconds. During this window:

A user who wrote at Leader A might read stale data if they're redirected to Leader B
Two users might make conflicting modifications without awareness of each other's changes
Aggregation queries across the dataset might see inconsistent snapshots

write-propagation-timeline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Timeline: Multi-Leader Write Propagation
 
T=0ms     User in Tokyo submits UPDATE balance = 5000 WHERE user_id = 'alice'
T=2ms     Request received by Tokyo Leader
T=5ms     Tokyo Leader validates and applies write locally
T=7ms     Write appended to Tokyo's replication log
T=10ms    User receives acknowledgment (write complete from user's view)
 
--- Asynchronous propagation begins ---
 
T=15ms    Replication batch prepared for transmission
T=20ms    Batch transmitted to Frankfurt and Virginia Leaders
 
T=180ms   Frankfurt Leader receives batch (160ms network latency)
T=185ms   Frankfurt applies write, checks for conflicts
T=190ms   Frankfurt's followers begin replicating the change
 
T=220ms   Virginia Leader receives batch (200ms network latency)
T=225ms   Virginia applies write, checks for conflicts
T=230ms   Virginia's followers begin replicating the change
 
T=250ms   Write is visible globally across all leaders and followers

Design Principle

Replication Topologies

The Three Primary Topologies:

Circular Topology

•Each leader replicates to exactly one other leader
•Forms a ring: A → B → C → A
•Writes propagate around the ring sequentially
•Advantage: Simple configuration, predictable paths
•Disadvantage: Single node failure breaks the ring
•Latency: High for distant pairs (N-1 hops worst case)

Star Topology

•One designated 'root' leader connects to all others
•All writes flow through the root: A ↔ Root ↔ B
•Root forwards writes between other leaders
•Advantage: Simple routing, 2 hops between any pair
•Disadvantage: Root is a single point of failure
•Latency: Consistent but limited by root placement

All-to-All (Mesh) Topology:

The most robust and commonly used topology for production multi-leader deployments.

All-to-All Topology Characteristics

•Every leader connects to every other leader — Forms a complete graph
•Direct propagation — Writes reach all leaders in a single hop
•Maximum fault tolerance — Any leader can fail without disrupting replication between others
•Optimal latency — No intermediate forwarding; direct cross-continental paths
•Complexity — Requires N×(N-1)/2 connections for N leaders
•Ordering challenges — Writes can arrive at different leaders in different orders

Converting Mermaid diagram...

Handling Message Ordering in All-to-All:

In all-to-all topologies, a critical problem arises: writes can arrive out of order. Consider:

At T=0, Leader A creates record R
At T=1, Leader A updates record R
Both writes propagate to Leaders B and C

Solutions:

Logical timestamps: Each write carries a logical timestamp (Lamport clock or vector clock). Receivers buffer out-of-order writes until dependencies arrive.
Causal dependency tracking: Writes include metadata about which prior writes they depend on. Receivers apply writes only after dependencies are satisfied.
Log sequence numbers: Each leader's writes are sequenced. Receivers apply writes in sequence order, buffering gaps.

Production Reality

Conflict Detection Fundamentals

What Constitutes a Conflict?

A conflict occurs when two or more leaders independently modify the same data in ways that cannot be trivially merged. The classic example:

At T=0, Record R has value 'Original'
At T=1, Leader A receives: SET R = 'Value A'
At T=1, Leader B receives: SET R = 'Value B'
Neither leader knows about the other's write when processing

When these writes propagate:

Leader A receives 'Value B' but already has 'Value A'
Leader B receives 'Value A' but already has 'Value B'

Which value should win? This is the conflict that must be detected and resolved.

Categories of Conflicts

•Write-Write Conflicts: The same field or record is modified at multiple leaders. The most common and challenging conflict type.
•Uniqueness Constraint Violations: Two leaders independently create records with the same unique key (e.g., email address, username).
•Referential Integrity Violations: Leader A deletes a parent record while Leader B creates a child referencing it.
•Application-Level Semantic Conflicts: Writes that are technically compatible at the database level but violate business rules (e.g., double-booking a limited resource).

Conflict Detection Mechanisms:

1. Version Vectors / Vector Clocks:

If the incoming version dominates the local version (all components ≥), it's a simple update
If the local version dominates the incoming version, the incoming write is stale and ignored
If neither dominates (concurrent modifications), conflict detected

2. Conflict-Free Reads for Conflict Detection:

3. Row-Level Change Tracking:

version-vector-conflict-detection.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
interface VersionVector {
  [leaderId: string]: number;
}
 
interface VersionedRecord<T> {
  data: T;
  version: VersionVector;
}
 
function detectConflict<T>(
  local: VersionedRecord<T>,
  incoming: VersionedRecord<T>
): 'incoming-wins' | 'local-wins' | 'conflict' {
  const localDominates = doesDominate(local.version, incoming.version);
  const incomingDominates = doesDominate(incoming.version, local.version);
  
  if (incomingDominates && !localDominates) {
    return 'incoming-wins'; // Simple update, apply incoming
  }
  if (localDominates && !incomingDominates) {
    return 'local-wins';    // Stale write, ignore incoming
  }
  return 'conflict';        // Concurrent modifications, resolve!
}
 
function doesDominate(a: VersionVector, b: VersionVector): boolean {
  // a dominates b if every component of a >= corresponding component of b
  const allKeys = new Set([...Object.keys(a), ...Object.keys(b)]);
  for (const key of allKeys) {
    if ((a[key] || 0) < (b[key] || 0)) {
      return false;
    }
  }
  return true;
}
 
// Example: Concurrent conflict
const localRecord: VersionedRecord<string> = {
  data: 'Modified at Leader A',
  version: { 'leader-a': 5, 'leader-b': 3 }
};
 
const incomingWrite: VersionedRecord<string> = {
  data: 'Modified at Leader B',
  version: { 'leader-a': 4, 'leader-b': 4 }
};
 
console.log(detectConflict(localRecord, incomingWrite)); 
// Output: 'conflict' 
// B has higher component than A (4 > 3), A has higher component than B (5 > 4)

Detection Is Just the Beginning

Practical Considerations and Trade-offs

Multi-Leader vs Single-Leader Trade-offs
Dimension	Single-Leader	Multi-Leader
Write Latency	High for distant users (100-300ms)	Low for all users (10-30ms)
Write Availability	Blocked during leader failover	Available if any leader is reachable
Consistency Model	Strongly consistent writes	Eventually consistent, conflicts possible
Operational Complexity	Simple, well-understood	Complex conflict resolution required
Application Complexity	Standard CRUD patterns work	Must handle/design for conflicts
Read-Your-Writes	Guaranteed naturally	Requires session affinity
Debugging	Straightforward linear history	Complex branching history

When Multi-Leader Is Appropriate:

Truly global user base with latency-sensitive write operations (collaborative editing, gaming, trading)
Disconnected operation requirements where clients must function offline and sync later (mobile apps, offline-first systems)
Multi-datacenter deployments requiring write availability during datacenter failures or network partitions
Use cases where conflicts are rare or have natural resolution strategies (e.g., shopping carts can union items)

When Multi-Leader Is Problematic:

Financial transactions where consistency is non-negotiable (bank transfers, inventory with strict limits)
Unique constraint enforcement that cannot tolerate even temporary violations (username registration)
Systems with complex referential integrity that would create cascading conflicts
Teams without expertise in distributed systems debugging and conflict resolution

The Hidden Cost of Conflicts

Summary: Multiple Leaders for Writes

We've established the foundational understanding of multi-leader replication. Let's consolidate the key concepts:

Key Takeaways

•Multi-leader replication allows multiple nodes to accept writes independently, overcoming the single-leader bottleneck for latency and availability.
•Writes are processed locally and acknowledged immediately, then asynchronously propagated to other leaders.
•Three topologies exist: circular (simple but fragile), star (centralized but limited), and all-to-all (robust but complex).
•Replication lag between leaders means users might see stale data if routed to different leaders.
•Conflicts arise from concurrent writes and must be detected using mechanisms like version vectors.
•Multi-leader is appropriate for global, latency-sensitive, conflict-tolerant workloads—not for strict consistency requirements.

What's Next:

Page Complete

1 / 5