Loading learning content...
In leader-follower replication, writes have a clear destination: the leader. This simplicity comes with constraints—the leader becomes a bottleneck, a single point of failure, and a source of geographic latency.
Leaderless systems take a radically different approach: any node can accept a write at any time. There's no designated entry point, no special node that must be available, no single location that all clients must reach.
This architectural freedom unlocks remarkable capabilities—but it requires sophisticated coordination mechanisms to ensure data integrity. In this page, we'll examine exactly how writes work in leaderless systems, from client interaction to replica coordination.
By the end of this page, you will understand the mechanics of distributed writes: how clients interact with multiple replicas, the role of coordinators, how nodes discover each other, and how the system maintains coherence when writes can arrive at any node in the cluster.
When a client wants to write data in a leaderless system, the process is fundamentally different from traditional databases. Instead of a direct connection to a single authoritative server, the client must interact with multiple replicas to achieve durability.
Two primary write strategies:
Leaderless systems typically implement one of two approaches for handling writes:
1. Client-Coordinated Writes
The client itself takes responsibility for sending the write to multiple replicas. The client library understands the cluster topology and directly communicates with the appropriate nodes.
Client Application
↓
[Client Library] ← Knows which replicas store which keys
↓ ↓ ↓
Node A Node B Node C ← Client writes to all in parallel
2. Coordinator-Based Writes
The client sends the write to any node, which then acts as a coordinator. The coordinator forwards the write to all appropriate replicas.
Client Application
↓
Node B ← Acts as coordinator for this request
↓ ↓ ↓
Node A Node B Node C ← Coordinator forwards to replicas
| Aspect | Client-Coordinated | Coordinator-Based |
|---|---|---|
| Client complexity | Higher — must understand topology | Lower — just connect to any node |
| Network hops | Fewer — direct to replicas | More — extra hop through coordinator |
| Hotspot potential | Lower — clients spread load naturally | Higher — popular nodes may become coordinators |
| Failure handling | Client must retry appropriately | Coordinator handles retry logic |
| Topology changes | Clients must stay updated | Clients unaware of topology |
| Used by | Cassandra (with Datastax drivers) | Riak, Voldemort, early Dynamo |
Modern systems often use hybrid approaches. Cassandra's drivers, for example, maintain awareness of the cluster topology (like client-coordinated) but send requests to a coordinator node that handles replication (like coordinator-based). This combines the benefits of both approaches: smart routing with simplified client logic.
In a leaderless system, we must answer a fundamental question: for any given key, which nodes should store replicas of its data? The answer typically involves consistent hashing combined with a replication factor.
Consistent hashing for replica placement:
Consistent hashing creates a logical ring where both nodes and keys are mapped to positions:
Example with replication factor 3:
Key hashes to
position X
↓
┌──────────────┐
│ │
Node A Node B ← 1st replica
│ │
│ Ring │
│ │
Node D Node C ← 2nd and 3rd replicas
│ │
└──────────────┘
Key X is stored on: Node B, Node C, Node D
Virtual nodes (vnodes) for better distribution:
Simple consistent hashing can create uneven load distribution. If nodes are randomly placed on the ring, some nodes may claim larger portions. Virtual nodes solve this by assigning multiple positions to each physical node:
Replication factor (RF) determines how many copies of data exist. Consistency level (CL) determines how many replicas must participate in a read or write. These are independent settings: RF=3 with write CL=1 means 3 replicas exist but writes succeed after 1 acknowledgment. Understanding this distinction is crucial for configuring leaderless systems.
When a write arrives at a coordinator node (or when a client sends writes directly to replicas), a carefully orchestrated sequence of operations ensures durability and consistency. Let's trace through this process in detail.
Phase 1: Identify Target Replicas
The coordinator first determines which nodes should receive the write:
Phase 2: Dispatch Writes in Parallel
The coordinator sends write requests to all identified replicas simultaneously:
Coordinator dispatches:
→ Node A: "Write key=X, value=V, timestamp=T"
→ Node B: "Write key=X, value=V, timestamp=T"
→ Node C: "Write key=X, value=V, timestamp=T"
(All three dispatched in parallel)
Phase 3: Await Acknowledgments
The coordinator waits for responses based on the configured write consistency level:
| Consistency Level | Replicas Required | Trade-off | Use Case |
|---|---|---|---|
| ONE | 1 of N | Fast but risky — data exists on single node until anti-entropy | Logging, metrics, fire-and-forget writes |
| TWO | 2 of N | Moderate durability, still fast | Session data, caches |
| QUORUM | ⌊N/2⌋ + 1 | Guarantees overlap with quorum reads for consistency | Most production workloads |
| LOCAL_QUORUM | Quorum in local DC | Fast writes with local durability | Multi-DC deployments |
| EACH_QUORUM | Quorum in each DC | Strong multi-DC guarantees, highest latency | Finance, compliance data |
| ALL | N of N | Maximum durability but any failure blocks write | Critical data requiring certainty |
Phase 4: Handling Partial Failures
What happens when some replicas acknowledge but others don't?
The coordinator must make a decision based on the consistency level:
Scenario: Write to RF=3, CL=QUORUM, Node C is slow/unreachable
→ Node A: ACK received in 2ms ✓
→ Node B: ACK received in 3ms ✓
→ Node C: Timeout after 500ms ✗
Result: Write succeeds (2/3 = quorum met)
But what about Node C? It missed the write. Leaderless systems have multiple mechanisms to ensure Node C eventually gets the data:
When a node acknowledges a write, it typically means the data has been written to a commit log (durable storage) but may not yet be in the main data structures. This ensures the write survives a crash even if it hasn't been fully processed. The durability guarantee comes from the commit log, not from memory or memtables.
Without a single leader to serialize operations, leaderless systems need alternative mechanisms to establish ordering. Timestamps become the primary tool for determining which write "wins" when conflicts occur.
The challenge of distributed time:
In a distributed system, clocks are never perfectly synchronized. A write from Node A at "12:00:00.000" might actually occur after a write from Node B timestamped "12:00:00.001" due to clock skew. This creates significant challenges:
Timestamp strategies in practice:
1. Wall-Clock Timestamps (Last-Write-Wins)
Simplest approach: use the system clock and let the highest timestamp win.
Write 1: key=X, value=A, timestamp=1000
Write 2: key=X, value=B, timestamp=1001
Result: B wins (higher timestamp)
Pros: Simple, fast, deterministic Cons: Clock skew can cause earlier writes to win; writes can be silently lost
2. Logical Clocks (Lamport Timestamps)
Maintain a counter that increments on each operation and on message receipt:
Logical clock rules:
1. Increment before each local event
2. On send: include current clock value
3. On receive: clock = max(local, received) + 1
Pros: Respects causality (if A→B, then A's timestamp < B's timestamp) Cons: Does not detect concurrent events (multiple events can have same logical clock)
3. Vector Clocks / Version Vectors
Each node maintains a counter for every node in the system:
[Node1=5, Node2=3, Node3=7]
Can detect causality AND concurrency
If V1 < V2 in all components: V1 happened before V2
If V1 and V2 are incomparable: concurrent events
Pros: Detects both causality and concurrency Cons: Metadata grows with number of nodes; complex to manage
Cassandra uses wall-clock timestamps for conflict resolution (last-write-wins), but adds a tiebreaker: if timestamps are identical, the write with the lexicographically larger value wins. This ensures deterministic resolution across all replicas without coordination.
In coordinator-based write architectures, any node can become a coordinator for any request. But how does the client choose which node to contact? And how does the system prevent coordinator hotspots?
Client-side coordinator selection:
Token-aware routing deep dive:
Modern leaderless database clients (like the DataStax drivers for Cassandra) use token-aware routing by default:
Benefits:
Challenges:
| Strategy | Latency | Load Balance | Complexity | Failure Handling |
|---|---|---|---|---|
| Round-robin | Variable | Even | Low | Must skip failed nodes |
| Token-aware | Optimal | Even | High | Falls back to other replicas |
| Latency-aware | Best observed | May be uneven | Medium | Automatically avoids slow nodes |
| DC-aware | Low for local DC | Even per DC | Medium | Cross-DC fallback possible |
| Combined (Token + Latency + DC) | Optimal | Even | High | Most resilient |
It's important not to confuse coordinators with leaders. A coordinator handles a specific request but has no special authority. The next request might use a completely different coordinator. There's no election, no failover—just ephemeral coordination for individual operations.
One of the key performance characteristics of leaderless systems is that writes to multiple replicas happen in parallel, not sequentially. This has significant implications for latency and throughput.
Parallel vs. sequential write latency:
In a leader-follower system with synchronous replication:
Total latency = Leader write + Follower 1 + Follower 2 (sequential)
= 2ms + 2ms + 2ms = 6ms minimum
In a leaderless system:
Total latency = max(Replica A, Replica B, Replica C) (parallel)
= max(2ms, 3ms, 2ms) = 3ms for ALL acks
= 2ms for quorum (2 of 3)
Parallel writes typically have lower latency because we wait for operations to happen concurrently rather than sequentially.
Latency distribution characteristics:
Because leaderless writes wait for multiple replicas, write latency is determined by the slowest required replica. This creates interesting latency distribution characteristics:
p50 latency: Often similar to single-node latency (typical case, all replicas responsive) p99 latency: Higher than leader-based (waiting for slow replicas)
Tail latency mitigation:
Several strategies reduce the impact of slow replicas:
The optimal configuration depends on your workload. Write-heavy workloads with tolerance for inconsistency can use CL=ONE for maximum throughput. Read-heavy workloads that need consistency should use CL=QUORUM for both reads and writes. There's no universally correct setting—only trade-offs.
A key advantage of leaderless systems is graceful handling of node failures. Unlike leader-based systems where the leader's failure blocks all writes, leaderless systems continue operating as long as enough replicas are available.
Failure scenarios and responses:
| Scenario | Available Nodes | Write Outcome | Data State |
|---|---|---|---|
| No failures | A, B, C (3) | Success (all receive) | Consistent across all replicas |
| One node down | A, B (2) | Success (quorum met) | Missing from C, repaired later |
| Two nodes down | A (1) | FAILURE (below quorum) | Write rejected to preserve consistency |
| Node slow, not down | A (fast), B (fast), C (slow) | Success (quorum met quickly) | C receives eventually |
| Intermittent failure | A, B (stable), C (flapping) | Success (quorum from stable) | C may miss some writes |
Hinted handoff in detail:
When a replica is unavailable during a write, the coordinator can store a "hint" — a marker that the write should be sent to that replica when it recovers:
1. Write arrives for key X (replicas: A, B, C)
2. Node C is unreachable
3. Coordinator writes to A and B successfully
4. Coordinator stores hint: "When C is available, send key=X, value=V"
5. Later, C becomes reachable
6. Coordinator (or any node holding hints) sends pending writes to C
Hinted handoff considerations:
With RF=3 and CL=QUORUM, losing 2 of 3 replicas for a key range makes writes to that range impossible. The system degrades gracefully—other key ranges remain writable—but affected data is unavailable. This is the fundamental availability trade-off of quorum-based systems.
We've explored the mechanics of how leaderless systems handle writes distributed across multiple nodes. Let's consolidate the key insights:
What's next:
Now that we understand how writes flow through leaderless systems, we'll examine the Dynamo-style approach in detail. The next page explores the specific innovations from Amazon's Dynamo paper that became the template for modern leaderless databases.
You now understand the mechanics of distributed writes in leaderless systems—from coordinator selection to parallel replication to failure handling. The following pages will dive deeper into Dynamo-style protocols, quorum semantics, and conflict resolution strategies.