Loading content...
In the previous page, we established that CRDTs achieve conflict-free replication through mathematical properties—specifically, merge operations that form a join-semilattice. But how replicas communicate their state is a fundamental design decision with profound implications for bandwidth, latency, reliability, and implementation complexity.
The CRDT world divides into two camps: State-based CRDTs (also called Convergent Replicated Data Types, or CvRDTs) and Operation-based CRDTs (Commutative Replicated Data Types, or CmRDTs). Each approach has distinct characteristics that make it suitable for different scenarios.
Understanding both approaches deeply—their mechanics, trade-offs, and hybrid evolutions—is essential for any engineer designing distributed data systems.
By the end of this page, you will understand the internal mechanics of both state-based and operation-based CRDTs, implement basic examples of each, grasp their fundamental trade-offs, and recognize when to choose each approach. You'll also learn about delta-state CRDTs—a modern hybrid that combines the best of both worlds.
State-based CRDTs are conceptually simpler and more robust. The fundamental idea: replicas periodically send their entire state to other replicas, and the receiving replica merges the incoming state with its own.
Core Requirements:
The Protocol:
1. Replica performs local update: state = update(state, operation)
2. Periodically, replica sends its state to peers: broadcast(state)
3. On receiving remote state: state = merge(state, remote_state)
4. Repeat
Because merge is idempotent, it doesn't matter if the same state is received multiple times. Because merge is commutative and associative, it doesn't matter in what order states arrive or how they're grouped.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
/** * G-Counter: A State-based CRDT for grow-only counting * * Each replica maintains a vector of counts, one per known replica. * Merge takes the maximum of each component. */interface GCounter { // Map from replica ID to that replica's count counts: Map<string, number>; replicaId: string;} function createGCounter(replicaId: string): GCounter { return { counts: new Map([[replicaId, 0]]), replicaId, };} // Local operation: increment this replica's countfunction increment(counter: GCounter): GCounter { const current = counter.counts.get(counter.replicaId) ?? 0; const newCounts = new Map(counter.counts); newCounts.set(counter.replicaId, current + 1); return { ...counter, counts: newCounts };} // Query: the value is the sum of all replica countsfunction value(counter: GCounter): number { return Array.from(counter.counts.values()).reduce((a, b) => a + b, 0);} // Merge: take the maximum of each replica's countfunction merge(local: GCounter, remote: GCounter): GCounter { const merged = new Map(local.counts); for (const [replicaId, count] of remote.counts) { const localCount = merged.get(replicaId) ?? 0; merged.set(replicaId, Math.max(localCount, count)); } return { ...local, counts: merged };} // Demonstrate convergencefunction demonstrateConvergence() { // Two replicas start from the same state let replicaA = createGCounter("A"); let replicaB = createGCounter("B"); // Concurrent increments replicaA = increment(increment(increment(replicaA))); // A: 3, B: 0 replicaB = increment(increment(replicaB)); // A: 0, B: 2 console.log("Before merge:"); console.log(" Replica A value:", value(replicaA)); // 3 console.log(" Replica B value:", value(replicaB)); // 2 // Merge in any order — result is the same const mergedAB = merge(replicaA, replicaB); const mergedBA = merge(replicaB, replicaA); console.log("After merge:"); console.log(" A←B value:", value(mergedAB)); // 5 console.log(" B←A value:", value(mergedBA)); // 5 // Both replicas now see the same value: 5}Why G-Counter Works:
The G-Counter forms a lattice where:
This design handles the fundamental issue of distributed counting: if we just had a single number, two increments on different replicas would conflict (3+1=4 vs 3+1=4, but we want 5). By tracking each replica's contribution separately, we can merge without losing information.
The G-Counter's structure is identical to a vector clock. This is no coincidence—vector clocks are themselves CRDTs! Understanding that max over vectors forms a lattice unlocks many CRDT designs. Sets, registers, maps, and more can be built on variations of this pattern.
Operation-based CRDTs take a different approach: instead of sending state, replicas broadcast the operations they perform. Each replica applies received operations to its local state.
Core Requirements:
The Protocol:
1. Replica generates operation for local update
2. Replica applies operation locally: state = apply(state, op)
3. Replica broadcasts operation to all peers: broadcast(op)
4. On receiving remote operation: state = apply(state, op)
5. Message layer ensures all operations reach all replicas
The key difference: we're sending small operations rather than full state. This can be dramatically more bandwidth-efficient for large data structures.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
/** * Op-based Counter: Operations are broadcast instead of state * * Each replica maintains just a single number. * The increment operation commutes trivially. */ // The operation typeinterface IncrementOp { type: "increment"; amount: number; // Unique ID for idempotency (optional if network guarantees once) opId: string;} interface OpBasedCounter { value: number; // Track applied operations for idempotency appliedOps: Set<string>;} function createOpCounter(): OpBasedCounter { return { value: 0, appliedOps: new Set() };} // Generate an increment operationfunction generateIncrement(amount: number = 1): IncrementOp { return { type: "increment", amount, opId: crypto.randomUUID(), };} // Apply an operation (idempotently)function applyOp(counter: OpBasedCounter, op: IncrementOp): OpBasedCounter { // Idempotency check: skip if already applied if (counter.appliedOps.has(op.opId)) { return counter; } return { value: counter.value + op.amount, appliedOps: new Set([...counter.appliedOps, op.opId]), };} // Simulate distributed operation with message reorderingfunction demonstrateCommutativity() { // Generate operations from different "replicas" const op1 = generateIncrement(3); // From replica A const op2 = generateIncrement(2); // From replica B const op3 = generateIncrement(5); // From replica C // Replica X receives: op1, op2, op3 let replicaX = createOpCounter(); replicaX = applyOp(replicaX, op1); replicaX = applyOp(replicaX, op2); replicaX = applyOp(replicaX, op3); // Replica Y receives: op3, op1, op2 (different order) let replicaY = createOpCounter(); replicaY = applyOp(replicaY, op3); replicaY = applyOp(replicaY, op1); replicaY = applyOp(replicaY, op2); // Replica Z receives some ops twice (network duplication) let replicaZ = createOpCounter(); replicaZ = applyOp(replicaZ, op2); replicaZ = applyOp(replicaZ, op1); replicaZ = applyOp(replicaZ, op2); // Duplicate! replicaZ = applyOp(replicaZ, op3); replicaZ = applyOp(replicaZ, op1); // Duplicate! console.log("Replica X value:", replicaX.value); // 10 console.log("Replica Y value:", replicaY.value); // 10 console.log("Replica Z value:", replicaZ.value); // 10 // All replicas converge despite order/duplication differences}Why Op-based Works:
For an op-based CRDT, the operations themselves must have special properties:
Commutativity of concurrent ops: If operations A and B are concurrent (neither happened before the other), then apply(apply(s, A), B) = apply(apply(s, B), A)
Causal delivery may be required: Some op-based CRDTs require that if operation B depends on A (A happened before B), then all replicas receive A before B. This is "causal broadcast."
For a simple counter, increment commutes trivially: 5+3+2 = 5+2+3. But for more complex operations (like list insertion), ensuring commutativity requires careful design.
Op-based CRDTs have a strict requirement: every operation MUST eventually be delivered to every replica. If an operation is lost, replicas diverge permanently. This is harder to guarantee than it sounds, especially across unreliable networks. State-based CRDTs are more forgiving—a missed state update can be recovered with the next sync.
The choice between state-based and operation-based CRDTs involves fundamental trade-offs across multiple dimensions. Let's analyze them systematically:
| Dimension | State-based (CvRDT) | Operation-based (CmRDT) |
|---|---|---|
| Message Content | Full replica state | Individual operations |
| Message Size | O(state size) — can be large | O(operation size) — typically small |
| Network Requirement | Eventual delivery (can lose/duplicate) | Reliable broadcast (no loss) |
| Duplicate Handling | Automatic (merge is idempotent) | Needs explicit idempotency or exactly-once |
| Bandwidth (steady state) | Higher (repeated full state) | Lower (only deltas) |
| Bandwidth (recovery) | Lower (one sync catches up) | Higher (replay all missed ops) |
| Implementation Complexity | Moderate (design merge function) | Higher (design commutative ops) |
| Ordering Requirements | None | May need causal ordering for some types |
| Garbage Collection | State can be compacted | May accumulate operation log |
Practical Example: Collaborative Whiteboard
Consider a shared whiteboard where users draw shapes:
State-based approach:
Operation-based approach:
For this scenario, operation-based is clearly better—assuming you can guarantee reliable delivery (most collaboration frameworks do via WebSocket + acknowledgments + retries).
The dichotomy between state-based and operation-based CRDTs troubled researchers. State-based was more robust but wasteful with bandwidth. Operation-based was efficient but fragile. Could we have both?
Delta-state CRDTs (introduced by Almeida et al.) answer yes. The key insight: we can compute and transmit only the delta (difference) between states, using state-based merge semantics.
Core Idea:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
/** * Delta-State G-Counter * * Combines efficiency of op-based with robustness of state-based. * Deltas are just the changed portion of state. */ interface DeltaGCounter { counts: Map<string, number>; replicaId: string; // Track unsent changes since last sync pendingDelta: Map<string, number>;} function createDeltaGCounter(replicaId: string): DeltaGCounter { return { counts: new Map([[replicaId, 0]]), replicaId, pendingDelta: new Map(), };} // Increment: update local state AND track the deltafunction incrementDelta(counter: DeltaGCounter): DeltaGCounter { const current = counter.counts.get(counter.replicaId) ?? 0; const newValue = current + 1; const newCounts = new Map(counter.counts); newCounts.set(counter.replicaId, newValue); // Delta only contains our change const newDelta = new Map(counter.pendingDelta); newDelta.set(counter.replicaId, newValue); return { ...counter, counts: newCounts, pendingDelta: newDelta };} // Extract delta for transmission (and clear pending)function extractDelta(counter: DeltaGCounter): { counter: DeltaGCounter; delta: Map<string, number>;} { const delta = new Map(counter.pendingDelta); return { counter: { ...counter, pendingDelta: new Map() }, delta, };} // Merge a received delta (same as full state merge, but delta is small)function mergeDelta( counter: DeltaGCounter, delta: Map<string, number>): DeltaGCounter { const merged = new Map(counter.counts); for (const [replicaId, count] of delta) { const current = merged.get(replicaId) ?? 0; merged.set(replicaId, Math.max(current, count)); } return { ...counter, counts: merged };} // Full state sync (fallback if deltas are lost)function fullMerge(local: DeltaGCounter, remote: DeltaGCounter): DeltaGCounter { const merged = new Map(local.counts); for (const [replicaId, count] of remote.counts) { const current = merged.get(replicaId) ?? 0; merged.set(replicaId, Math.max(current, count)); } return { ...local, counts: merged };} // Demonstrationfunction demonstrateDelta() { let replicaA = createDeltaGCounter("A"); let replicaB = createDeltaGCounter("B"); // A does 3 increments replicaA = incrementDelta(replicaA); replicaA = incrementDelta(replicaA); replicaA = incrementDelta(replicaA); // B does 2 increments replicaB = incrementDelta(replicaB); replicaB = incrementDelta(replicaB); // Extract deltas (would be sent over network) const { counter: newA, delta: deltaA } = extractDelta(replicaA); const { counter: newB, delta: deltaB } = extractDelta(replicaB); replicaA = newA; replicaB = newB; console.log("Delta from A:", Object.fromEntries(deltaA)); // {A: 3} console.log("Delta from B:", Object.fromEntries(deltaB)); // {B: 2} // Merge deltas (small messages, robust merge) replicaA = mergeDelta(replicaA, deltaB); replicaB = mergeDelta(replicaB, deltaA); // Both now have the same state console.log("A counts:", Object.fromEntries(replicaA.counts)); // {A: 3, B: 2} console.log("B counts:", Object.fromEntries(replicaB.counts)); // {A: 3, B: 2}}Key Properties of Delta-State CRDTs:
Deltas are themselves states — They're valid elements of the same lattice. This means multiple deltas can be combined before transmission (delta compaction).
Merge is still idempotent — Receiving the same delta twice is safe. This eliminates the need for exactly-once delivery.
Graceful degradation — If deltas are lost, the system doesn't diverge. The next full state sync repairs everything.
Anti-entropy compatible — Can run background full-state gossip to catch any missed updates, while using deltas for real-time sync.
Bandwidth Analysis:
Most modern CRDT libraries (Automerge, Yjs, Riak's CRDTs) use delta-state or hybrid approaches. The pure state-based vs op-based distinction is more theoretical than practical. Real systems optimize with deltas, batching, compression, and selective full-syncs for repair.
Moving from theory to production requires addressing practical concerns: network protocols, persistence, garbage collection, and more. Here are patterns used by production CRDT systems:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
## Hybrid Synchronization Architecture ┌─────────────────────────────────────────────────────────────────┐│ REPLICA NODE │├─────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ ││ │ Local Ops │────▶│ Delta │────▶│ Outbound Queue │ ││ │ (user input)│ │ Generator │ │ (batched deltas)│ ││ └─────────────┘ └─────────────┘ └────────┬────────┘ ││ │ ││ ▼ ││ ┌────────────────┐ ││ │ Network Layer │ ││ │ (WebSocket/ │ ││ │ gRPC/AMQP) │ ││ └────────┬───────┘ ││ │ ││ ▼ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ ││ │ CRDT State │◀────│ Merge │◀────│ Inbound Queue │ ││ │ (current) │ │ Function │ │ (incoming sync) │ ││ └──────┬──────┘ └─────────────┘ └─────────────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────┐ ││ │ PERSISTENCE LAYER │ ││ │ ┌─────────────┐ ┌───────────────┐ ┌────────────┐ │ ││ │ │ Checkpoints │ │ Delta Log │ │ Vector │ │ ││ │ │ (periodic) │ │ (recent ops) │ │ Clocks │ │ ││ │ └─────────────┘ └───────────────┘ └────────────┘ │ ││ └──────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────┘ ## Synchronization Channels 1. REAL-TIME CHANNEL (Delta-based) - Low latency, small messages - Deltas batched every 50-100ms - Best-effort delivery OK (anti-entropy repairs) 2. RECOVERY CHANNEL (State-based) - Used when replica is far behind - Full state or checkpoint + delta log - Triggered by: long disconnection, new replica join 3. ANTI-ENTROPY CHANNEL (Background) - Periodic full-state comparison via Merkle trees - Detects and repairs any divergence - Low priority, opportunisticProduction systems tune sync frequency based on use case. Real-time collaboration: 50-100ms batches. Mobile offline sync: on-connect + periodic background. IoT sensors: batched every few seconds to minutes. The CRDT math is the same; the sync layer adapts to requirements.
Given the trade-offs, how do you decide? Here's a systematic framework for choosing between pure state-based, pure operation-based, or hybrid delta-state:
1234567891011121314151617181920212223242526272829303132333435363738394041
## CRDT Synchronization Decision Framework ┌─────────────────────────────────────────────────────────────┐│ START: What are your network guarantees? │└────────────────────────────┬────────────────────────────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ▼ ▼ ▼┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐│ UNRELIABLE │ │ MOSTLY RELIABLE │ │ RELIABLE + ORDERED ││ (UDP, lossy) │ │ (TCP, rare loss)│ │ (guaranteed delivery)│└────────┬────────┘ └────────┬────────┘ └──────────┬──────────┘ │ │ │ ▼ ▼ ▼┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐│ STATE-BASED │ │ DELTA-STATE │ │ Consider OP-BASED ││ (most robust) │ │ (best balance) │ │ (most efficient) │└─────────────────┘ └─────────────────┘ └─────────────────────┘ THEN: Consider state size vs operation frequency ┌─────────────────────────────────────────────────────────────┐│ STATE SIZE × UPDATE RATE │└─────────────────────────────────────────────────────────────┘ │ ├─── Small state + Rare updates → STATE-BASED fine │ (User preferences, feature flags) │ ├─── Large state + Frequent updates → DELTA or OP-BASED │ (Documents, shared whiteboards) │ └─── Any size + Offline-first → DELTA-STATE (Mobile apps, edge computing) RECOMMENDATION FOR MOST CASES: Delta-State - Bandwidth efficient like op-based- Robust like state-based - Graceful recovery from network issues- Most modern libraries use this approachPractical Recommendations:
For new projects: Start with delta-state (or a library that implements it). It's the most balanced approach and handles the widest range of scenarios.
For unreliable networks: State-based with periodic full sync. The robustness is worth the bandwidth cost.
For optimized hot paths: Pure op-based if you can guarantee reliable delivery. The efficiency gains matter when scaling to millions of operations.
For hybrid architectures: Use op-based for real-time with state-based anti-entropy for repair. This is what most production systems do.
We've now deeply explored the two fundamental approaches to CRDT synchronization and their practical hybrid evolution. Let's consolidate:
What's Next:
Now that we understand how CRDTs synchronize, the next page explores the Common CRDT Types—the practical building blocks you'll use in production. We'll examine counters, registers, sets, maps, and sequences in detail, understanding the design choices and trade-offs behind each.
You now understand the mechanics, trade-offs, and practical patterns for CRDT synchronization. Whether state-based, operation-based, or the hybrid delta approach—you can reason about bandwidth, reliability, and implementation complexity. Next, we'll use this knowledge to explore specific CRDT data structures.