System Design (HLD)Leaderless Replication

Leaderless Replication

LevelAdvanced

Duration75 mins

TopicLeaderless Replication

2 / 5

Any Node Accepts Writes

Writes Without a Single Point of Entry

In leader-follower replication, writes have a clear destination: the leader. This simplicity comes with constraints—the leader becomes a bottleneck, a single point of failure, and a source of geographic latency.

Leaderless systems take a radically different approach: any node can accept a write at any time. There's no designated entry point, no special node that must be available, no single location that all clients must reach.

This architectural freedom unlocks remarkable capabilities—but it requires sophisticated coordination mechanisms to ensure data integrity. In this page, we'll examine exactly how writes work in leaderless systems, from client interaction to replica coordination.

What You Will Learn

By the end of this page, you will understand the mechanics of distributed writes: how clients interact with multiple replicas, the role of coordinators, how nodes discover each other, and how the system maintains coherence when writes can arrive at any node in the cluster.

The Write Path Architecture

When a client wants to write data in a leaderless system, the process is fundamentally different from traditional databases. Instead of a direct connection to a single authoritative server, the client must interact with multiple replicas to achieve durability.

Two primary write strategies:

Leaderless systems typically implement one of two approaches for handling writes:

1. Client-Coordinated Writes

The client itself takes responsibility for sending the write to multiple replicas. The client library understands the cluster topology and directly communicates with the appropriate nodes.

Client Application
        ↓
   [Client Library]  ← Knows which replicas store which keys
    ↓    ↓    ↓
Node A  Node B  Node C  ← Client writes to all in parallel

2. Coordinator-Based Writes

The client sends the write to any node, which then acts as a coordinator. The coordinator forwards the write to all appropriate replicas.

Client Application
        ↓
     Node B  ← Acts as coordinator for this request
    ↓    ↓    ↓
Node A  Node B  Node C  ← Coordinator forwards to replicas

Write Strategy Comparison
Aspect	Client-Coordinated	Coordinator-Based
Client complexity	Higher — must understand topology	Lower — just connect to any node
Network hops	Fewer — direct to replicas	More — extra hop through coordinator
Hotspot potential	Lower — clients spread load naturally	Higher — popular nodes may become coordinators
Failure handling	Client must retry appropriately	Coordinator handles retry logic
Topology changes	Clients must stay updated	Clients unaware of topology
Used by	Cassandra (with Datastax drivers)	Riak, Voldemort, early Dynamo

Hybrid Approaches

Modern systems often use hybrid approaches. Cassandra's drivers, for example, maintain awareness of the cluster topology (like client-coordinated) but send requests to a coordinator node that handles replication (like coordinator-based). This combines the benefits of both approaches: smart routing with simplified client logic.

Replica Selection and Placement

In a leaderless system, we must answer a fundamental question: for any given key, which nodes should store replicas of its data? The answer typically involves consistent hashing combined with a replication factor.

Consistent hashing for replica placement:

Consistent hashing creates a logical ring where both nodes and keys are mapped to positions:

Each node is assigned one or more positions on the hash ring (often called "tokens" or "virtual nodes")
Each key is hashed to a position on the ring
The key is stored on the N nodes clockwise from its position (where N is the replication factor)

Example with replication factor 3:

     Key hashes to
     position X
          ↓
    ┌──────────────┐
    │              │
  Node A         Node B  ← 1st replica
    │              │
    │     Ring     │
    │              │
  Node D         Node C  ← 2nd and 3rd replicas
    │              │
    └──────────────┘

Key X is stored on: Node B, Node C, Node D

Virtual nodes (vnodes) for better distribution:

Simple consistent hashing can create uneven load distribution. If nodes are randomly placed on the ring, some nodes may claim larger portions. Virtual nodes solve this by assigning multiple positions to each physical node:

Instead of 1 position per node, assign 100-256 virtual positions
Each virtual node is a token on the ring
Keys distributed more evenly due to the law of large numbers
When a node joins/leaves, only a small fraction of keys move

Replica Placement Strategies

•Simple Strategy (Cassandra) — Replicas placed on next N nodes clockwise on the ring, regardless of datacenter or rack. Suitable for single-datacenter deployments.
•NetworkTopology Strategy (Cassandra) — Replicas distributed across datacenters and racks for fault tolerance. Ensures no two replicas in same rack within a datacenter.
•Rack-Aware Placement (Riak) — Similar to NetworkTopology, considering rack location to survive rack failures.
•Preference Lists (Dynamo) — Each key has an ordered list of nodes that should store it, determined by hash ring walking with rack/datacenter awareness.

Replication Factor vs. Consistency Level

Replication factor (RF) determines how many copies of data exist. Consistency level (CL) determines how many replicas must participate in a read or write. These are independent settings: RF=3 with write CL=1 means 3 replicas exist but writes succeed after 1 acknowledgment. Understanding this distinction is crucial for configuring leaderless systems.

Write Coordination Mechanics

When a write arrives at a coordinator node (or when a client sends writes directly to replicas), a carefully orchestrated sequence of operations ensures durability and consistency. Let's trace through this process in detail.

Phase 1: Identify Target Replicas

The coordinator first determines which nodes should receive the write:

Hash the key to determine its position on the ring
Walk clockwise to find the N nodes that should store replicas
Apply topology rules (if rack/datacenter awareness is enabled)
Filter out any nodes known to be down

Phase 2: Dispatch Writes in Parallel

The coordinator sends write requests to all identified replicas simultaneously:

Coordinator dispatches:
→ Node A: "Write key=X, value=V, timestamp=T"
→ Node B: "Write key=X, value=V, timestamp=T"
→ Node C: "Write key=X, value=V, timestamp=T"

(All three dispatched in parallel)

Phase 3: Await Acknowledgments

The coordinator waits for responses based on the configured write consistency level:

CL=ONE: Return success after first ACK
CL=QUORUM: Return success after majority ACK (2 of 3)
CL=ALL: Return success only after all replicas ACK

Write Consistency Levels in Detail
Consistency Level	Replicas Required	Trade-off	Use Case
ONE	1 of N	Fast but risky — data exists on single node until anti-entropy	Logging, metrics, fire-and-forget writes
TWO	2 of N	Moderate durability, still fast	Session data, caches
QUORUM	⌊N/2⌋ + 1	Guarantees overlap with quorum reads for consistency	Most production workloads
LOCAL_QUORUM	Quorum in local DC	Fast writes with local durability	Multi-DC deployments
EACH_QUORUM	Quorum in each DC	Strong multi-DC guarantees, highest latency	Finance, compliance data
ALL	N of N	Maximum durability but any failure blocks write	Critical data requiring certainty

Phase 4: Handling Partial Failures

What happens when some replicas acknowledge but others don't?

The coordinator must make a decision based on the consistency level:

Scenario: Write to RF=3, CL=QUORUM, Node C is slow/unreachable

→ Node A: ACK received in 2ms      ✓
→ Node B: ACK received in 3ms      ✓
→ Node C: Timeout after 500ms      ✗

Result: Write succeeds (2/3 = quorum met)

But what about Node C? It missed the write. Leaderless systems have multiple mechanisms to ensure Node C eventually gets the data:

Hinted handoff — Coordinator stores a "hint" to forward the write to Node C when it recovers
Read repair — When a read later involves Node C, its stale value is detected and repaired
Anti-entropy — Background process compares and synchronizes replicas periodically

Write Durability vs. Acknowledgment

When a node acknowledges a write, it typically means the data has been written to a commit log (durable storage) but may not yet be in the main data structures. This ensures the write survives a crash even if it hasn't been fully processed. The durability guarantee comes from the commit log, not from memory or memtables.

Timestamps and Ordering

Without a single leader to serialize operations, leaderless systems need alternative mechanisms to establish ordering. Timestamps become the primary tool for determining which write "wins" when conflicts occur.

The challenge of distributed time:

In a distributed system, clocks are never perfectly synchronized. A write from Node A at "12:00:00.000" might actually occur after a write from Node B timestamped "12:00:00.001" due to clock skew. This creates significant challenges:

Clock Skew Problems

•NTP synchronization has limits — Typical synchronization accuracy is 1-10ms over the internet, up to 100μs on local networks with GPS-synchronized sources.
•Clock drift between syncs — After NTP sync, clocks immediately begin drifting. Drift rates of 1-100 ppm are common (1-100μs per second).
•Clock jumps — NTP may suddenly adjust clocks backward or forward, violating monotonicity assumptions.
•Leap seconds — Periodic adjustments to civil time can cause unexpected behavior.

Timestamp strategies in practice:

1. Wall-Clock Timestamps (Last-Write-Wins)

Simplest approach: use the system clock and let the highest timestamp win.

Write 1: key=X, value=A, timestamp=1000
Write 2: key=X, value=B, timestamp=1001

Result: B wins (higher timestamp)

Pros: Simple, fast, deterministic Cons: Clock skew can cause earlier writes to win; writes can be silently lost

2. Logical Clocks (Lamport Timestamps)

Maintain a counter that increments on each operation and on message receipt:

Logical clock rules:
1. Increment before each local event
2. On send: include current clock value
3. On receive: clock = max(local, received) + 1

Pros: Respects causality (if A→B, then A's timestamp < B's timestamp) Cons: Does not detect concurrent events (multiple events can have same logical clock)

3. Vector Clocks / Version Vectors

Each node maintains a counter for every node in the system:

[Node1=5, Node2=3, Node3=7]

Can detect causality AND concurrency
If V1 < V2 in all components: V1 happened before V2
If V1 and V2 are incomparable: concurrent events

Pros: Detects both causality and concurrency Cons: Metadata grows with number of nodes; complex to manage

Cassandra's Approach

Cassandra uses wall-clock timestamps for conflict resolution (last-write-wins), but adds a tiebreaker: if timestamps are identical, the write with the lexicographically larger value wins. This ensures deterministic resolution across all replicas without coordination.

Coordinator Selection and Load Distribution

In coordinator-based write architectures, any node can become a coordinator for any request. But how does the client choose which node to contact? And how does the system prevent coordinator hotspots?

Client-side coordinator selection:

Load Balancing Strategies

•Round-robin — Clients rotate through known nodes sequentially. Simple but ignores node health and data locality.
•Random — Each request goes to a randomly chosen node. Statistically even distribution, simple implementation.
•Token-aware routing — Client sends request directly to a node that holds the target data's replica. Minimizes internal hops, reduces load on non-replica nodes.
•Latency-aware — Client tracks response times and preferentially routes to faster nodes. Improves client-perceived latency.
•Datacenter-aware — Client prefers nodes in the same datacenter to minimize network latency. Falls back to remote DC if local nodes unavailable.

Token-aware routing deep dive:

Modern leaderless database clients (like the DataStax drivers for Cassandra) use token-aware routing by default:

On startup, client queries cluster metadata (token ring, node membership)
Client maintains local copy of token-to-node mapping
For each operation, client hashes the key and looks up which nodes own that token
Client routes directly to a replica node, which becomes coordinator

Benefits:

Coordinator is always a replica (no forwarding needed)
Even load distribution across the cluster
Reduced latency (fewer network hops)

Challenges:

Client must stay synchronized with cluster topology changes
Metadata refresh during cluster changes can cause brief misrouting
Client libraries become more complex

Coordinator Selection Trade-offs
Strategy	Latency	Load Balance	Complexity	Failure Handling
Round-robin	Variable	Even	Low	Must skip failed nodes
Token-aware	Optimal	Even	High	Falls back to other replicas
Latency-aware	Best observed	May be uneven	Medium	Automatically avoids slow nodes
DC-aware	Low for local DC	Even per DC	Medium	Cross-DC fallback possible
Combined (Token + Latency + DC)	Optimal	Even	High	Most resilient

The Coordinator is Not the Leader

It's important not to confuse coordinators with leaders. A coordinator handles a specific request but has no special authority. The next request might use a completely different coordinator. There's no election, no failover—just ephemeral coordination for individual operations.

Parallel Writes and Performance Implications

One of the key performance characteristics of leaderless systems is that writes to multiple replicas happen in parallel, not sequentially. This has significant implications for latency and throughput.

Parallel vs. sequential write latency:

In a leader-follower system with synchronous replication:

Total latency = Leader write + Follower 1 + Follower 2 (sequential)
             = 2ms + 2ms + 2ms = 6ms minimum

In a leaderless system:

Total latency = max(Replica A, Replica B, Replica C) (parallel)
             = max(2ms, 3ms, 2ms) = 3ms for ALL acks
             = 2ms for quorum (2 of 3)

Parallel writes typically have lower latency because we wait for operations to happen concurrently rather than sequentially.

Latency distribution characteristics:

Because leaderless writes wait for multiple replicas, write latency is determined by the slowest required replica. This creates interesting latency distribution characteristics:

p50 latency: Often similar to single-node latency (typical case, all replicas responsive) p99 latency: Higher than leader-based (waiting for slow replicas)

Tail latency mitigation:

Several strategies reduce the impact of slow replicas:

Speculative retry — If primary replicas are slow, send request to additional replicas speculatively
Rapid read/write response — Return as soon as quorum is met, don't wait for all replicas
Node health scoring — Route away from nodes with high latency history
Request hedging — Send (slightly delayed) duplicate requests to backup replicas

Write Throughput Characteristics

•Horizontal scalability — Adding nodes increases total write capacity. Unlike leader-based systems, there's no single bottleneck.
•Write amplification — Each logical write becomes N physical writes (where N = replication factor). Network and I/O costs scale with N.
•Consistency level impact — Lower consistency levels (CL=ONE) have higher throughput than higher levels (CL=ALL).
•Key distribution matters — Hot keys create hotspots on specific replicas, limiting throughput. Well-distributed keys maximize parallelism.
•Cross-datacenter considerations — Writes with EACH_QUORUM or ALL must wait for distant datacenters, increasing latency significantly.

Tuning for Your Workload

The optimal configuration depends on your workload. Write-heavy workloads with tolerance for inconsistency can use CL=ONE for maximum throughput. Read-heavy workloads that need consistency should use CL=QUORUM for both reads and writes. There's no universally correct setting—only trade-offs.

Handling Node Failures During Writes

A key advantage of leaderless systems is graceful handling of node failures. Unlike leader-based systems where the leader's failure blocks all writes, leaderless systems continue operating as long as enough replicas are available.

Failure scenarios and responses:

Write Behavior Under Failure (RF=3, CL=QUORUM)
Scenario	Available Nodes	Write Outcome	Data State
No failures	A, B, C (3)	Success (all receive)	Consistent across all replicas
One node down	A, B (2)	Success (quorum met)	Missing from C, repaired later
Two nodes down	A (1)	FAILURE (below quorum)	Write rejected to preserve consistency
Node slow, not down	A (fast), B (fast), C (slow)	Success (quorum met quickly)	C receives eventually
Intermittent failure	A, B (stable), C (flapping)	Success (quorum from stable)	C may miss some writes

Hinted handoff in detail:

When a replica is unavailable during a write, the coordinator can store a "hint" — a marker that the write should be sent to that replica when it recovers:

1. Write arrives for key X (replicas: A, B, C)
2. Node C is unreachable
3. Coordinator writes to A and B successfully
4. Coordinator stores hint: "When C is available, send key=X, value=V"
5. Later, C becomes reachable
6. Coordinator (or any node holding hints) sends pending writes to C

Hinted handoff considerations:

Storage limits — Hints consume storage. Most systems limit hint storage (hours to days)
Hint replay — When hinted node recovers, all pending hints must be replayed
Sloppy quorums — Some systems count hint recipients toward quorum (read: controversial)

The Unavailability Threshold

With RF=3 and CL=QUORUM, losing 2 of 3 replicas for a key range makes writes to that range impossible. The system degrades gracefully—other key ranges remain writable—but affected data is unavailable. This is the fundamental availability trade-off of quorum-based systems.

Summary: The Multi-Node Write Paradigm

We've explored the mechanics of how leaderless systems handle writes distributed across multiple nodes. Let's consolidate the key insights:

Key Takeaways

•Any node can accept writes — Either as a coordinator forwarding to replicas, or directly receiving from topology-aware clients.
•Consistent hashing determines replica placement — Keys hash to ring positions, replicas are the N clockwise nodes.
•Writes go to replicas in parallel — Latency is determined by the quorum threshold, not sequential replication.
•Timestamps enable ordering without a leader — Wall clocks, logical clocks, or vector clocks establish precedence.
•Coordinator selection affects performance — Token-aware routing minimizes hops and balances load.
•Node failures are handled gracefully — Hinted handoff and eventual repair ensure missing replicas catch up.

What's next:

Now that we understand how writes flow through leaderless systems, we'll examine the Dynamo-style approach in detail. The next page explores the specific innovations from Amazon's Dynamo paper that became the template for modern leaderless databases.

Page Complete

You now understand the mechanics of distributed writes in leaderless systems—from coordinator selection to parallel replication to failure handling. The following pages will dive deeper into Dynamo-style protocols, quorum semantics, and conflict resolution strategies.

2 / 5

Loading learning content...

System Design (HLD)Leaderless Replication

Leaderless Replication

LevelAdvanced

Duration75 mins

TopicLeaderless Replication

2 / 5

Any Node Accepts Writes

Writes Without a Single Point of Entry

What You Will Learn

The Write Path Architecture

Two primary write strategies:

Leaderless systems typically implement one of two approaches for handling writes:

1. Client-Coordinated Writes

The client itself takes responsibility for sending the write to multiple replicas. The client library understands the cluster topology and directly communicates with the appropriate nodes.

Client Application
        ↓
   [Client Library]  ← Knows which replicas store which keys
    ↓    ↓    ↓
Node A  Node B  Node C  ← Client writes to all in parallel

2. Coordinator-Based Writes

The client sends the write to any node, which then acts as a coordinator. The coordinator forwards the write to all appropriate replicas.

Client Application
        ↓
     Node B  ← Acts as coordinator for this request
    ↓    ↓    ↓
Node A  Node B  Node C  ← Coordinator forwards to replicas

Write Strategy Comparison
Aspect	Client-Coordinated	Coordinator-Based
Client complexity	Higher — must understand topology	Lower — just connect to any node
Network hops	Fewer — direct to replicas	More — extra hop through coordinator
Hotspot potential	Lower — clients spread load naturally	Higher — popular nodes may become coordinators
Failure handling	Client must retry appropriately	Coordinator handles retry logic
Topology changes	Clients must stay updated	Clients unaware of topology
Used by	Cassandra (with Datastax drivers)	Riak, Voldemort, early Dynamo

Hybrid Approaches

Replica Selection and Placement

Consistent hashing for replica placement:

Consistent hashing creates a logical ring where both nodes and keys are mapped to positions:

Each node is assigned one or more positions on the hash ring (often called "tokens" or "virtual nodes")
Each key is hashed to a position on the ring
The key is stored on the N nodes clockwise from its position (where N is the replication factor)

Example with replication factor 3:

     Key hashes to
     position X
          ↓
    ┌──────────────┐
    │              │
  Node A         Node B  ← 1st replica
    │              │
    │     Ring     │
    │              │
  Node D         Node C  ← 2nd and 3rd replicas
    │              │
    └──────────────┘

Key X is stored on: Node B, Node C, Node D

Virtual nodes (vnodes) for better distribution:

Instead of 1 position per node, assign 100-256 virtual positions
Each virtual node is a token on the ring
Keys distributed more evenly due to the law of large numbers
When a node joins/leaves, only a small fraction of keys move

Replica Placement Strategies

•Simple Strategy (Cassandra) — Replicas placed on next N nodes clockwise on the ring, regardless of datacenter or rack. Suitable for single-datacenter deployments.
•NetworkTopology Strategy (Cassandra) — Replicas distributed across datacenters and racks for fault tolerance. Ensures no two replicas in same rack within a datacenter.
•Rack-Aware Placement (Riak) — Similar to NetworkTopology, considering rack location to survive rack failures.
•Preference Lists (Dynamo) — Each key has an ordered list of nodes that should store it, determined by hash ring walking with rack/datacenter awareness.

Replication Factor vs. Consistency Level

Write Coordination Mechanics

Phase 1: Identify Target Replicas

The coordinator first determines which nodes should receive the write:

Hash the key to determine its position on the ring
Walk clockwise to find the N nodes that should store replicas
Apply topology rules (if rack/datacenter awareness is enabled)
Filter out any nodes known to be down

Phase 2: Dispatch Writes in Parallel

The coordinator sends write requests to all identified replicas simultaneously:

Coordinator dispatches:
→ Node A: "Write key=X, value=V, timestamp=T"
→ Node B: "Write key=X, value=V, timestamp=T"
→ Node C: "Write key=X, value=V, timestamp=T"

(All three dispatched in parallel)

Phase 3: Await Acknowledgments

The coordinator waits for responses based on the configured write consistency level:

CL=ONE: Return success after first ACK
CL=QUORUM: Return success after majority ACK (2 of 3)
CL=ALL: Return success only after all replicas ACK

Write Consistency Levels in Detail
Consistency Level	Replicas Required	Trade-off	Use Case
ONE	1 of N	Fast but risky — data exists on single node until anti-entropy	Logging, metrics, fire-and-forget writes
TWO	2 of N	Moderate durability, still fast	Session data, caches
QUORUM	⌊N/2⌋ + 1	Guarantees overlap with quorum reads for consistency	Most production workloads
LOCAL_QUORUM	Quorum in local DC	Fast writes with local durability	Multi-DC deployments
EACH_QUORUM	Quorum in each DC	Strong multi-DC guarantees, highest latency	Finance, compliance data
ALL	N of N	Maximum durability but any failure blocks write	Critical data requiring certainty

Phase 4: Handling Partial Failures

What happens when some replicas acknowledge but others don't?

The coordinator must make a decision based on the consistency level:

Scenario: Write to RF=3, CL=QUORUM, Node C is slow/unreachable

→ Node A: ACK received in 2ms      ✓
→ Node B: ACK received in 3ms      ✓
→ Node C: Timeout after 500ms      ✗

Result: Write succeeds (2/3 = quorum met)

But what about Node C? It missed the write. Leaderless systems have multiple mechanisms to ensure Node C eventually gets the data:

Hinted handoff — Coordinator stores a "hint" to forward the write to Node C when it recovers
Read repair — When a read later involves Node C, its stale value is detected and repaired
Anti-entropy — Background process compares and synchronizes replicas periodically

Write Durability vs. Acknowledgment

Timestamps and Ordering

The challenge of distributed time:

Clock Skew Problems

•NTP synchronization has limits — Typical synchronization accuracy is 1-10ms over the internet, up to 100μs on local networks with GPS-synchronized sources.
•Clock drift between syncs — After NTP sync, clocks immediately begin drifting. Drift rates of 1-100 ppm are common (1-100μs per second).
•Clock jumps — NTP may suddenly adjust clocks backward or forward, violating monotonicity assumptions.
•Leap seconds — Periodic adjustments to civil time can cause unexpected behavior.

Timestamp strategies in practice:

1. Wall-Clock Timestamps (Last-Write-Wins)

Simplest approach: use the system clock and let the highest timestamp win.

Write 1: key=X, value=A, timestamp=1000
Write 2: key=X, value=B, timestamp=1001

Result: B wins (higher timestamp)

Pros: Simple, fast, deterministic Cons: Clock skew can cause earlier writes to win; writes can be silently lost

2. Logical Clocks (Lamport Timestamps)

Maintain a counter that increments on each operation and on message receipt:

Logical clock rules:
1. Increment before each local event
2. On send: include current clock value
3. On receive: clock = max(local, received) + 1

Pros: Respects causality (if A→B, then A's timestamp < B's timestamp) Cons: Does not detect concurrent events (multiple events can have same logical clock)

3. Vector Clocks / Version Vectors

Each node maintains a counter for every node in the system:

[Node1=5, Node2=3, Node3=7]

Can detect causality AND concurrency
If V1 < V2 in all components: V1 happened before V2
If V1 and V2 are incomparable: concurrent events

Pros: Detects both causality and concurrency Cons: Metadata grows with number of nodes; complex to manage

Cassandra's Approach

Coordinator Selection and Load Distribution

Client-side coordinator selection:

Load Balancing Strategies

•Round-robin — Clients rotate through known nodes sequentially. Simple but ignores node health and data locality.
•Random — Each request goes to a randomly chosen node. Statistically even distribution, simple implementation.
•Token-aware routing — Client sends request directly to a node that holds the target data's replica. Minimizes internal hops, reduces load on non-replica nodes.
•Latency-aware — Client tracks response times and preferentially routes to faster nodes. Improves client-perceived latency.
•Datacenter-aware — Client prefers nodes in the same datacenter to minimize network latency. Falls back to remote DC if local nodes unavailable.

Token-aware routing deep dive:

Modern leaderless database clients (like the DataStax drivers for Cassandra) use token-aware routing by default:

On startup, client queries cluster metadata (token ring, node membership)
Client maintains local copy of token-to-node mapping
For each operation, client hashes the key and looks up which nodes own that token
Client routes directly to a replica node, which becomes coordinator

Benefits:

Coordinator is always a replica (no forwarding needed)
Even load distribution across the cluster
Reduced latency (fewer network hops)

Challenges:

Client must stay synchronized with cluster topology changes
Metadata refresh during cluster changes can cause brief misrouting
Client libraries become more complex

Coordinator Selection Trade-offs
Strategy	Latency	Load Balance	Complexity	Failure Handling
Round-robin	Variable	Even	Low	Must skip failed nodes
Token-aware	Optimal	Even	High	Falls back to other replicas
Latency-aware	Best observed	May be uneven	Medium	Automatically avoids slow nodes
DC-aware	Low for local DC	Even per DC	Medium	Cross-DC fallback possible
Combined (Token + Latency + DC)	Optimal	Even	High	Most resilient

The Coordinator is Not the Leader

Parallel Writes and Performance Implications

Parallel vs. sequential write latency:

In a leader-follower system with synchronous replication:

Total latency = Leader write + Follower 1 + Follower 2 (sequential)
             = 2ms + 2ms + 2ms = 6ms minimum

In a leaderless system:

Total latency = max(Replica A, Replica B, Replica C) (parallel)
             = max(2ms, 3ms, 2ms) = 3ms for ALL acks
             = 2ms for quorum (2 of 3)

Parallel writes typically have lower latency because we wait for operations to happen concurrently rather than sequentially.

Latency distribution characteristics:

Because leaderless writes wait for multiple replicas, write latency is determined by the slowest required replica. This creates interesting latency distribution characteristics:

p50 latency: Often similar to single-node latency (typical case, all replicas responsive) p99 latency: Higher than leader-based (waiting for slow replicas)

Tail latency mitigation:

Several strategies reduce the impact of slow replicas:

Speculative retry — If primary replicas are slow, send request to additional replicas speculatively
Rapid read/write response — Return as soon as quorum is met, don't wait for all replicas
Node health scoring — Route away from nodes with high latency history
Request hedging — Send (slightly delayed) duplicate requests to backup replicas

Write Throughput Characteristics

•Horizontal scalability — Adding nodes increases total write capacity. Unlike leader-based systems, there's no single bottleneck.
•Write amplification — Each logical write becomes N physical writes (where N = replication factor). Network and I/O costs scale with N.
•Consistency level impact — Lower consistency levels (CL=ONE) have higher throughput than higher levels (CL=ALL).
•Key distribution matters — Hot keys create hotspots on specific replicas, limiting throughput. Well-distributed keys maximize parallelism.
•Cross-datacenter considerations — Writes with EACH_QUORUM or ALL must wait for distant datacenters, increasing latency significantly.

Tuning for Your Workload

Handling Node Failures During Writes

Failure scenarios and responses:

Write Behavior Under Failure (RF=3, CL=QUORUM)
Scenario	Available Nodes	Write Outcome	Data State
No failures	A, B, C (3)	Success (all receive)	Consistent across all replicas
One node down	A, B (2)	Success (quorum met)	Missing from C, repaired later
Two nodes down	A (1)	FAILURE (below quorum)	Write rejected to preserve consistency
Node slow, not down	A (fast), B (fast), C (slow)	Success (quorum met quickly)	C receives eventually
Intermittent failure	A, B (stable), C (flapping)	Success (quorum from stable)	C may miss some writes

Hinted handoff in detail:

When a replica is unavailable during a write, the coordinator can store a "hint" — a marker that the write should be sent to that replica when it recovers:

1. Write arrives for key X (replicas: A, B, C)
2. Node C is unreachable
3. Coordinator writes to A and B successfully
4. Coordinator stores hint: "When C is available, send key=X, value=V"
5. Later, C becomes reachable
6. Coordinator (or any node holding hints) sends pending writes to C

Hinted handoff considerations:

Storage limits — Hints consume storage. Most systems limit hint storage (hours to days)
Hint replay — When hinted node recovers, all pending hints must be replayed
Sloppy quorums — Some systems count hint recipients toward quorum (read: controversial)

The Unavailability Threshold

Summary: The Multi-Node Write Paradigm

We've explored the mechanics of how leaderless systems handle writes distributed across multiple nodes. Let's consolidate the key insights:

Key Takeaways

•Any node can accept writes — Either as a coordinator forwarding to replicas, or directly receiving from topology-aware clients.
•Consistent hashing determines replica placement — Keys hash to ring positions, replicas are the N clockwise nodes.
•Writes go to replicas in parallel — Latency is determined by the quorum threshold, not sequential replication.
•Timestamps enable ordering without a leader — Wall clocks, logical clocks, or vector clocks establish precedence.
•Coordinator selection affects performance — Token-aware routing minimizes hops and balances load.
•Node failures are handled gracefully — Hinted handoff and eventual repair ensure missing replicas catch up.

What's next:

Page Complete

2 / 5