CRDTs - Learning Module

Loading content...

0/273

State-based vs Operation-based CRDTs

Two Paths to Eventual Consistency

In the previous page, we established that CRDTs achieve conflict-free replication through mathematical properties—specifically, merge operations that form a join-semilattice. But how replicas communicate their state is a fundamental design decision with profound implications for bandwidth, latency, reliability, and implementation complexity.

The CRDT world divides into two camps: State-based CRDTs (also called Convergent Replicated Data Types, or CvRDTs) and Operation-based CRDTs (Commutative Replicated Data Types, or CmRDTs). Each approach has distinct characteristics that make it suitable for different scenarios.

Understanding both approaches deeply—their mechanics, trade-offs, and hybrid evolutions—is essential for any engineer designing distributed data systems.

What You Will Learn

By the end of this page, you will understand the internal mechanics of both state-based and operation-based CRDTs, implement basic examples of each, grasp their fundamental trade-offs, and recognize when to choose each approach. You'll also learn about delta-state CRDTs—a modern hybrid that combines the best of both worlds.

State-based CRDTs (CvRDTs) Deep Dive

State-based CRDTs are conceptually simpler and more robust. The fundamental idea: replicas periodically send their entire state to other replicas, and the receiving replica merges the incoming state with its own.

Core Requirements:

The data type must define a merge function (⊔ or join)
The merge function must satisfy: commutativity, associativity, idempotency
Local updates must monotonically increase the state in the lattice order

The Protocol:

1. Replica performs local update: state = update(state, operation)
2. Periodically, replica sends its state to peers: broadcast(state)
3. On receiving remote state: state = merge(state, remote_state)
4. Repeat

Because merge is idempotent, it doesn't matter if the same state is received multiple times. Because merge is commutative and associative, it doesn't matter in what order states arrive or how they're grouped.

state-based-counter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/**
 * G-Counter: A State-based CRDT for grow-only counting
 * 
 * Each replica maintains a vector of counts, one per known replica.
 * Merge takes the maximum of each component.
 */
interface GCounter {
  // Map from replica ID to that replica's count
  counts: Map<string, number>;
  replicaId: string;
}
 
function createGCounter(replicaId: string): GCounter {
  return {
    counts: new Map([[replicaId, 0]]),
    replicaId,
  };
}
 
// Local operation: increment this replica's count
function increment(counter: GCounter): GCounter {
  const current = counter.counts.get(counter.replicaId) ?? 0;
  const newCounts = new Map(counter.counts);
  newCounts.set(counter.replicaId, current + 1);
  return { ...counter, counts: newCounts };
}
 
// Query: the value is the sum of all replica counts
function value(counter: GCounter): number {
  return Array.from(counter.counts.values()).reduce((a, b) => a + b, 0);
}
 
// Merge: take the maximum of each replica's count
function merge(local: GCounter, remote: GCounter): GCounter {
  const merged = new Map(local.counts);
  
  for (const [replicaId, count] of remote.counts) {
    const localCount = merged.get(replicaId) ?? 0;
    merged.set(replicaId, Math.max(localCount, count));
  }
  
  return { ...local, counts: merged };
}
 
// Demonstrate convergence
function demonstrateConvergence() {
  // Two replicas start from the same state
  let replicaA = createGCounter("A");
  let replicaB = createGCounter("B");
  
  // Concurrent increments
  replicaA = increment(increment(increment(replicaA))); // A: 3, B: 0
  replicaB = increment(increment(replicaB));            // A: 0, B: 2
  
  console.log("Before merge:");
  console.log("  Replica A value:", value(replicaA)); // 3
  console.log("  Replica B value:", value(replicaB)); // 2
  
  // Merge in any order — result is the same
  const mergedAB = merge(replicaA, replicaB);
  const mergedBA = merge(replicaB, replicaA);
  
  console.log("After merge:");
  console.log("  A←B value:", value(mergedAB)); // 5
  console.log("  B←A value:", value(mergedBA)); // 5
  
  // Both replicas now see the same value: 5
}

Why G-Counter Works:

The G-Counter forms a lattice where:

The partial order is: counter A ≤ counter B iff for all replica IDs r, A.counts[r] ≤ B.counts[r]
The join (merge) is: component-wise maximum
Monotonicity: incrementing any component only moves the state "up" in the order

This design handles the fundamental issue of distributed counting: if we just had a single number, two increments on different replicas would conflict (3+1=4 vs 3+1=4, but we want 5). By tracking each replica's contribution separately, we can merge without losing information.

The Vector Clock Insight

The G-Counter's structure is identical to a vector clock. This is no coincidence—vector clocks are themselves CRDTs! Understanding that max over vectors forms a lattice unlocks many CRDT designs. Sets, registers, maps, and more can be built on variations of this pattern.

Operation-based CRDTs (CmRDTs) Deep Dive

Operation-based CRDTs take a different approach: instead of sending state, replicas broadcast the operations they perform. Each replica applies received operations to its local state.

Core Requirements:

Operations must commute: applying ops in any order yields the same state
The system must provide reliable broadcast: every operation is delivered to every replica
Operations may need to be idempotent (or the system provides exactly-once delivery)

The Protocol:

1. Replica generates operation for local update
2. Replica applies operation locally: state = apply(state, op)
3. Replica broadcasts operation to all peers: broadcast(op)
4. On receiving remote operation: state = apply(state, op)
5. Message layer ensures all operations reach all replicas

The key difference: we're sending small operations rather than full state. This can be dramatically more bandwidth-efficient for large data structures.

operation-based-counter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
/**
 * Op-based Counter: Operations are broadcast instead of state
 * 
 * Each replica maintains just a single number.
 * The increment operation commutes trivially.
 */
 
// The operation type
interface IncrementOp {
  type: "increment";
  amount: number;
  // Unique ID for idempotency (optional if network guarantees once)
  opId: string;
}
 
interface OpBasedCounter {
  value: number;
  // Track applied operations for idempotency
  appliedOps: Set<string>;
}
 
function createOpCounter(): OpBasedCounter {
  return { value: 0, appliedOps: new Set() };
}
 
// Generate an increment operation
function generateIncrement(amount: number = 1): IncrementOp {
  return {
    type: "increment",
    amount,
    opId: crypto.randomUUID(),
  };
}
 
// Apply an operation (idempotently)
function applyOp(counter: OpBasedCounter, op: IncrementOp): OpBasedCounter {
  // Idempotency check: skip if already applied
  if (counter.appliedOps.has(op.opId)) {
    return counter;
  }
  
  return {
    value: counter.value + op.amount,
    appliedOps: new Set([...counter.appliedOps, op.opId]),
  };
}
 
// Simulate distributed operation with message reordering
function demonstrateCommutativity() {
  // Generate operations from different "replicas"
  const op1 = generateIncrement(3);  // From replica A
  const op2 = generateIncrement(2);  // From replica B
  const op3 = generateIncrement(5);  // From replica C
  
  // Replica X receives: op1, op2, op3
  let replicaX = createOpCounter();
  replicaX = applyOp(replicaX, op1);
  replicaX = applyOp(replicaX, op2);
  replicaX = applyOp(replicaX, op3);
  
  // Replica Y receives: op3, op1, op2 (different order)
  let replicaY = createOpCounter();
  replicaY = applyOp(replicaY, op3);
  replicaY = applyOp(replicaY, op1);
  replicaY = applyOp(replicaY, op2);
  
  // Replica Z receives some ops twice (network duplication)
  let replicaZ = createOpCounter();
  replicaZ = applyOp(replicaZ, op2);
  replicaZ = applyOp(replicaZ, op1);
  replicaZ = applyOp(replicaZ, op2); // Duplicate!
  replicaZ = applyOp(replicaZ, op3);
  replicaZ = applyOp(replicaZ, op1); // Duplicate!
  
  console.log("Replica X value:", replicaX.value); // 10
  console.log("Replica Y value:", replicaY.value); // 10
  console.log("Replica Z value:", replicaZ.value); // 10
  
  // All replicas converge despite order/duplication differences
}

Why Op-based Works:

For an op-based CRDT, the operations themselves must have special properties:

Commutativity of concurrent ops: If operations A and B are concurrent (neither happened before the other), then apply(apply(s, A), B) = apply(apply(s, B), A)
Causal delivery may be required: Some op-based CRDTs require that if operation B depends on A (A happened before B), then all replicas receive A before B. This is "causal broadcast."

For a simple counter, increment commutes trivially: 5+3+2 = 5+2+3. But for more complex operations (like list insertion), ensuring commutativity requires careful design.

The Reliable Broadcast Requirement

Op-based CRDTs have a strict requirement: every operation MUST eventually be delivered to every replica. If an operation is lost, replicas diverge permanently. This is harder to guarantee than it sounds, especially across unreliable networks. State-based CRDTs are more forgiving—a missed state update can be recovered with the next sync.

Comprehensive Comparison and Trade-offs

The choice between state-based and operation-based CRDTs involves fundamental trade-offs across multiple dimensions. Let's analyze them systematically:

State-based vs Operation-based: Complete Comparison
Dimension	State-based (CvRDT)	Operation-based (CmRDT)
Message Content	Full replica state	Individual operations
Message Size	O(state size) — can be large	O(operation size) — typically small
Network Requirement	Eventual delivery (can lose/duplicate)	Reliable broadcast (no loss)
Duplicate Handling	Automatic (merge is idempotent)	Needs explicit idempotency or exactly-once
Bandwidth (steady state)	Higher (repeated full state)	Lower (only deltas)
Bandwidth (recovery)	Lower (one sync catches up)	Higher (replay all missed ops)
Implementation Complexity	Moderate (design merge function)	Higher (design commutative ops)
Ordering Requirements	None	May need causal ordering for some types
Garbage Collection	State can be compacted	May accumulate operation log

Choose State-based When

•Network is unreliable (message loss likely)
•State size is small relative to operation volume
•Simpler implementation is preferred
•Recovery after long disconnection is common
•Infrastructure lacks reliable broadcast
•Example: User preferences, configuration sync

Choose Operation-based When

•State is large (gigabytes of data)
•Operations are small and frequent
•Reliable messaging infrastructure exists
•Bandwidth is constrained
•Real-time collaboration with many small edits
•Example: Collaborative documents, shared cursors

Practical Example: Collaborative Whiteboard

Consider a shared whiteboard where users draw shapes:

State-based approach:

State = list of all shapes (could be thousands)
Every sync transmits all shapes
Adding one circle sends megabytes of data
But: handles network failures gracefully

Operation-based approach:

Operations = {addShape(...), moveShape(...), deleteShape(...)}
Adding a circle sends only the circle's definition
Efficient for real-time collaboration
But: lost operation = permanent disagreement

For this scenario, operation-based is clearly better—assuming you can guarantee reliable delivery (most collaboration frameworks do via WebSocket + acknowledgments + retries).

Delta-State CRDTs: The Best of Both Worlds

The dichotomy between state-based and operation-based CRDTs troubled researchers. State-based was more robust but wasteful with bandwidth. Operation-based was efficient but fragile. Could we have both?

Delta-state CRDTs (introduced by Almeida et al.) answer yes. The key insight: we can compute and transmit only the delta (difference) between states, using state-based merge semantics.

Core Idea:

When a local update occurs, compute a delta-mutator: the minimal state change that, when merged with the old state, produces the new state
Transmit only the delta (small, like an operation)
Receiver merges the delta using the standard lattice join (robust, like state-based)
If deltas are lost, can always fall back to full state sync

delta-state-counter.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
/**
 * Delta-State G-Counter
 * 
 * Combines efficiency of op-based with robustness of state-based.
 * Deltas are just the changed portion of state.
 */
 
interface DeltaGCounter {
  counts: Map<string, number>;
  replicaId: string;
  // Track unsent changes since last sync
  pendingDelta: Map<string, number>;
}
 
function createDeltaGCounter(replicaId: string): DeltaGCounter {
  return {
    counts: new Map([[replicaId, 0]]),
    replicaId,
    pendingDelta: new Map(),
  };
}
 
// Increment: update local state AND track the delta
function incrementDelta(counter: DeltaGCounter): DeltaGCounter {
  const current = counter.counts.get(counter.replicaId) ?? 0;
  const newValue = current + 1;
  
  const newCounts = new Map(counter.counts);
  newCounts.set(counter.replicaId, newValue);
  
  // Delta only contains our change
  const newDelta = new Map(counter.pendingDelta);
  newDelta.set(counter.replicaId, newValue);
  
  return { ...counter, counts: newCounts, pendingDelta: newDelta };
}
 
// Extract delta for transmission (and clear pending)
function extractDelta(counter: DeltaGCounter): {
  counter: DeltaGCounter;
  delta: Map<string, number>;
} {
  const delta = new Map(counter.pendingDelta);
  return {
    counter: { ...counter, pendingDelta: new Map() },
    delta,
  };
}
 
// Merge a received delta (same as full state merge, but delta is small)
function mergeDelta(
  counter: DeltaGCounter, 
  delta: Map<string, number>
): DeltaGCounter {
  const merged = new Map(counter.counts);
  
  for (const [replicaId, count] of delta) {
    const current = merged.get(replicaId) ?? 0;
    merged.set(replicaId, Math.max(current, count));
  }
  
  return { ...counter, counts: merged };
}
 
// Full state sync (fallback if deltas are lost)
function fullMerge(local: DeltaGCounter, remote: DeltaGCounter): DeltaGCounter {
  const merged = new Map(local.counts);
  
  for (const [replicaId, count] of remote.counts) {
    const current = merged.get(replicaId) ?? 0;
    merged.set(replicaId, Math.max(current, count));
  }
  
  return { ...local, counts: merged };
}
 
// Demonstration
function demonstrateDelta() {
  let replicaA = createDeltaGCounter("A");
  let replicaB = createDeltaGCounter("B");
  
  // A does 3 increments
  replicaA = incrementDelta(replicaA);
  replicaA = incrementDelta(replicaA);
  replicaA = incrementDelta(replicaA);
  
  // B does 2 increments
  replicaB = incrementDelta(replicaB);
  replicaB = incrementDelta(replicaB);
  
  // Extract deltas (would be sent over network)
  const { counter: newA, delta: deltaA } = extractDelta(replicaA);
  const { counter: newB, delta: deltaB } = extractDelta(replicaB);
  replicaA = newA;
  replicaB = newB;
  
  console.log("Delta from A:", Object.fromEntries(deltaA)); // {A: 3}
  console.log("Delta from B:", Object.fromEntries(deltaB)); // {B: 2}
  
  // Merge deltas (small messages, robust merge)
  replicaA = mergeDelta(replicaA, deltaB);
  replicaB = mergeDelta(replicaB, deltaA);
  
  // Both now have the same state
  console.log("A counts:", Object.fromEntries(replicaA.counts)); // {A: 3, B: 2}
  console.log("B counts:", Object.fromEntries(replicaB.counts)); // {A: 3, B: 2}
}

Key Properties of Delta-State CRDTs:

Deltas are themselves states — They're valid elements of the same lattice. This means multiple deltas can be combined before transmission (delta compaction).
Merge is still idempotent — Receiving the same delta twice is safe. This eliminates the need for exactly-once delivery.
Graceful degradation — If deltas are lost, the system doesn't diverge. The next full state sync repairs everything.
Anti-entropy compatible — Can run background full-state gossip to catch any missed updates, while using deltas for real-time sync.

Bandwidth Analysis:

Op-based: O(operations) per sync
State-based: O(total state) per sync
Delta-state: O(changed state since last sync) per sync

Delta-State in Practice

Most modern CRDT libraries (Automerge, Yjs, Riak's CRDTs) use delta-state or hybrid approaches. The pure state-based vs op-based distinction is more theoretical than practical. Real systems optimize with deltas, batching, compression, and selective full-syncs for repair.

Production Implementation Patterns

Moving from theory to production requires addressing practical concerns: network protocols, persistence, garbage collection, and more. Here are patterns used by production CRDT systems:

Production Implementation Patterns

•Batching and Coalescing — Don't send every update immediately. Batch multiple deltas into one message. Coalesce redundant updates (100 increments → one +100).
•Causal Consistency Tracking — Use version vectors or dotted version vectors to track causal dependencies. Essential for CRDTs that need causal delivery.
•Persistence Strategy — Persist either full state snapshots or operation logs. Snapshots are simpler; logs enable time-travel but grow unboundedly.
•Garbage Collection — Tombstones and metadata can grow indefinitely. Use periodic compaction, causality barriers, or semantic garbage collection.
•Hybrid Syncing — Use deltas for real-time updates, full state for recovery. Background anti-entropy ensures eventual repair of missed updates.
•Checkpoint Compression — Periodically compute a compressed checkpoint. New replicas bootstrap from checkpoint, not full operation history.

hybrid-sync-pattern.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
## Hybrid Synchronization Architecture
 
┌─────────────────────────────────────────────────────────────────┐
│                        REPLICA NODE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────────┐   │
│  │ Local Ops   │────▶│ Delta       │────▶│ Outbound Queue  │   │
│  │ (user input)│     │ Generator   │     │ (batched deltas)│   │
│  └─────────────┘     └─────────────┘     └────────┬────────┘   │
│                                                    │            │
│                                                    ▼            │
│                                          ┌────────────────┐     │
│                                          │ Network Layer  │     │
│                                          │ (WebSocket/    │     │
│                                          │  gRPC/AMQP)    │     │
│                                          └────────┬───────┘     │
│                                                   │             │
│                                                   ▼             │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────────┐   │
│  │ CRDT State  │◀────│ Merge       │◀────│ Inbound Queue   │   │
│  │ (current)   │     │ Function    │     │ (incoming sync) │   │
│  └──────┬──────┘     └─────────────┘     └─────────────────┘   │
│         │                                                       │
│         ▼                                                       │
│  ┌──────────────────────────────────────────────────────┐      │
│  │              PERSISTENCE LAYER                        │      │
│  │  ┌─────────────┐  ┌───────────────┐  ┌────────────┐  │      │
│  │  │ Checkpoints │  │ Delta Log     │  │ Vector     │  │      │
│  │  │ (periodic)  │  │ (recent ops)  │  │ Clocks     │  │      │
│  │  └─────────────┘  └───────────────┘  └────────────┘  │      │
│  └──────────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────────┘
 
## Synchronization Channels
 
1. REAL-TIME CHANNEL (Delta-based)
   - Low latency, small messages
   - Deltas batched every 50-100ms
   - Best-effort delivery OK (anti-entropy repairs)
 
2. RECOVERY CHANNEL (State-based)
   - Used when replica is far behind
   - Full state or checkpoint + delta log
   - Triggered by: long disconnection, new replica join
 
3. ANTI-ENTROPY CHANNEL (Background)
   - Periodic full-state comparison via Merkle trees
   - Detects and repairs any divergence
   - Low priority, opportunistic

Real-World Sync Frequencies

Production systems tune sync frequency based on use case. Real-time collaboration: 50-100ms batches. Mobile offline sync: on-connect + periodic background. IoT sensors: batched every few seconds to minutes. The CRDT math is the same; the sync layer adapts to requirements.

Decision Framework: Choosing Your Approach

Given the trade-offs, how do you decide? Here's a systematic framework for choosing between pure state-based, pure operation-based, or hybrid delta-state:

decision-framework.md

## CRDT Synchronization Decision Framework
 
┌─────────────────────────────────────────────────────────────┐
│           START: What are your network guarantees?          │
└────────────────────────────┬────────────────────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ UNRELIABLE      │ │ MOSTLY RELIABLE │ │ RELIABLE + ORDERED  │
│ (UDP, lossy)    │ │ (TCP, rare loss)│ │ (guaranteed delivery)│
└────────┬────────┘ └────────┬────────┘ └──────────┬──────────┘
         │                   │                     │
         ▼                   ▼                     ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ STATE-BASED     │ │ DELTA-STATE     │ │ Consider OP-BASED   │
│ (most robust)   │ │ (best balance)  │ │ (most efficient)    │
└─────────────────┘ └─────────────────┘ └─────────────────────┘
 
THEN: Consider state size vs operation frequency
 
┌─────────────────────────────────────────────────────────────┐
│                   STATE SIZE × UPDATE RATE                  │
└─────────────────────────────────────────────────────────────┘
         │
         ├─── Small state + Rare updates → STATE-BASED fine
         │    (User preferences, feature flags)
         │
         ├─── Large state + Frequent updates → DELTA or OP-BASED
         │    (Documents, shared whiteboards)
         │
         └─── Any size + Offline-first → DELTA-STATE
              (Mobile apps, edge computing)
 
RECOMMENDATION FOR MOST CASES: Delta-State
 
- Bandwidth efficient like op-based
- Robust like state-based  
- Graceful recovery from network issues
- Most modern libraries use this approach

Practical Recommendations:

For new projects: Start with delta-state (or a library that implements it). It's the most balanced approach and handles the widest range of scenarios.

For unreliable networks: State-based with periodic full sync. The robustness is worth the bandwidth cost.

For optimized hot paths: Pure op-based if you can guarantee reliable delivery. The efficiency gains matter when scaling to millions of operations.

For hybrid architectures: Use op-based for real-time with state-based anti-entropy for repair. This is what most production systems do.

Summary: Synchronization Strategies Mastered

We've now deeply explored the two fundamental approaches to CRDT synchronization and their practical hybrid evolution. Let's consolidate:

Key Takeaways

•State-based CRDTs transmit full state and merge via lattice join. Robust but bandwidth-heavy.
•Operation-based CRDTs transmit operations that must commute. Efficient but requires reliable delivery.
•Both are mathematically equivalent in expressiveness—any CvRDT can be expressed as CmRDT and vice versa.
•Delta-state CRDTs combine efficiency and robustness by transmitting only changed state portions.
•Production systems use hybrid approaches with delta sync for real-time and state sync for recovery.
•The synchronization layer is separate from CRDT design — choose based on network guarantees and performance requirements.

What's Next:

Now that we understand how CRDTs synchronize, the next page explores the Common CRDT Types—the practical building blocks you'll use in production. We'll examine counters, registers, sets, maps, and sequences in detail, understanding the design choices and trade-offs behind each.

Synchronization Strategies Complete

You now understand the mechanics, trade-offs, and practical patterns for CRDT synchronization. Whether state-based, operation-based, or the hybrid delta approach—you can reason about bandwidth, reliability, and implementation complexity. Next, we'll use this knowledge to explore specific CRDT data structures.

State-based vs Operation-based CRDTs

Two Paths to Eventual Consistency

Understanding both approaches deeply—their mechanics, trade-offs, and hybrid evolutions—is essential for any engineer designing distributed data systems.

What You Will Learn

State-based CRDTs (CvRDTs) Deep Dive

Core Requirements:

The data type must define a merge function (⊔ or join)
The merge function must satisfy: commutativity, associativity, idempotency
Local updates must monotonically increase the state in the lattice order

The Protocol:

1. Replica performs local update: state = update(state, operation)
2. Periodically, replica sends its state to peers: broadcast(state)
3. On receiving remote state: state = merge(state, remote_state)
4. Repeat

state-based-counter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/**
 * G-Counter: A State-based CRDT for grow-only counting
 * 
 * Each replica maintains a vector of counts, one per known replica.
 * Merge takes the maximum of each component.
 */
interface GCounter {
  // Map from replica ID to that replica's count
  counts: Map<string, number>;
  replicaId: string;
}
 
function createGCounter(replicaId: string): GCounter {
  return {
    counts: new Map([[replicaId, 0]]),
    replicaId,
  };
}
 
// Local operation: increment this replica's count
function increment(counter: GCounter): GCounter {
  const current = counter.counts.get(counter.replicaId) ?? 0;
  const newCounts = new Map(counter.counts);
  newCounts.set(counter.replicaId, current + 1);
  return { ...counter, counts: newCounts };
}
 
// Query: the value is the sum of all replica counts
function value(counter: GCounter): number {
  return Array.from(counter.counts.values()).reduce((a, b) => a + b, 0);
}
 
// Merge: take the maximum of each replica's count
function merge(local: GCounter, remote: GCounter): GCounter {
  const merged = new Map(local.counts);
  
  for (const [replicaId, count] of remote.counts) {
    const localCount = merged.get(replicaId) ?? 0;
    merged.set(replicaId, Math.max(localCount, count));
  }
  
  return { ...local, counts: merged };
}
 
// Demonstrate convergence
function demonstrateConvergence() {
  // Two replicas start from the same state
  let replicaA = createGCounter("A");
  let replicaB = createGCounter("B");
  
  // Concurrent increments
  replicaA = increment(increment(increment(replicaA))); // A: 3, B: 0
  replicaB = increment(increment(replicaB));            // A: 0, B: 2
  
  console.log("Before merge:");
  console.log("  Replica A value:", value(replicaA)); // 3
  console.log("  Replica B value:", value(replicaB)); // 2
  
  // Merge in any order — result is the same
  const mergedAB = merge(replicaA, replicaB);
  const mergedBA = merge(replicaB, replicaA);
  
  console.log("After merge:");
  console.log("  A←B value:", value(mergedAB)); // 5
  console.log("  B←A value:", value(mergedBA)); // 5
  
  // Both replicas now see the same value: 5
}

Why G-Counter Works:

The G-Counter forms a lattice where:

The partial order is: counter A ≤ counter B iff for all replica IDs r, A.counts[r] ≤ B.counts[r]
The join (merge) is: component-wise maximum
Monotonicity: incrementing any component only moves the state "up" in the order

The Vector Clock Insight

Operation-based CRDTs (CmRDTs) Deep Dive

Operation-based CRDTs take a different approach: instead of sending state, replicas broadcast the operations they perform. Each replica applies received operations to its local state.

Core Requirements:

Operations must commute: applying ops in any order yields the same state
The system must provide reliable broadcast: every operation is delivered to every replica
Operations may need to be idempotent (or the system provides exactly-once delivery)

The Protocol:

1. Replica generates operation for local update
2. Replica applies operation locally: state = apply(state, op)
3. Replica broadcasts operation to all peers: broadcast(op)
4. On receiving remote operation: state = apply(state, op)
5. Message layer ensures all operations reach all replicas

The key difference: we're sending small operations rather than full state. This can be dramatically more bandwidth-efficient for large data structures.

operation-based-counter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
/**
 * Op-based Counter: Operations are broadcast instead of state
 * 
 * Each replica maintains just a single number.
 * The increment operation commutes trivially.
 */
 
// The operation type
interface IncrementOp {
  type: "increment";
  amount: number;
  // Unique ID for idempotency (optional if network guarantees once)
  opId: string;
}
 
interface OpBasedCounter {
  value: number;
  // Track applied operations for idempotency
  appliedOps: Set<string>;
}
 
function createOpCounter(): OpBasedCounter {
  return { value: 0, appliedOps: new Set() };
}
 
// Generate an increment operation
function generateIncrement(amount: number = 1): IncrementOp {
  return {
    type: "increment",
    amount,
    opId: crypto.randomUUID(),
  };
}
 
// Apply an operation (idempotently)
function applyOp(counter: OpBasedCounter, op: IncrementOp): OpBasedCounter {
  // Idempotency check: skip if already applied
  if (counter.appliedOps.has(op.opId)) {
    return counter;
  }
  
  return {
    value: counter.value + op.amount,
    appliedOps: new Set([...counter.appliedOps, op.opId]),
  };
}
 
// Simulate distributed operation with message reordering
function demonstrateCommutativity() {
  // Generate operations from different "replicas"
  const op1 = generateIncrement(3);  // From replica A
  const op2 = generateIncrement(2);  // From replica B
  const op3 = generateIncrement(5);  // From replica C
  
  // Replica X receives: op1, op2, op3
  let replicaX = createOpCounter();
  replicaX = applyOp(replicaX, op1);
  replicaX = applyOp(replicaX, op2);
  replicaX = applyOp(replicaX, op3);
  
  // Replica Y receives: op3, op1, op2 (different order)
  let replicaY = createOpCounter();
  replicaY = applyOp(replicaY, op3);
  replicaY = applyOp(replicaY, op1);
  replicaY = applyOp(replicaY, op2);
  
  // Replica Z receives some ops twice (network duplication)
  let replicaZ = createOpCounter();
  replicaZ = applyOp(replicaZ, op2);
  replicaZ = applyOp(replicaZ, op1);
  replicaZ = applyOp(replicaZ, op2); // Duplicate!
  replicaZ = applyOp(replicaZ, op3);
  replicaZ = applyOp(replicaZ, op1); // Duplicate!
  
  console.log("Replica X value:", replicaX.value); // 10
  console.log("Replica Y value:", replicaY.value); // 10
  console.log("Replica Z value:", replicaZ.value); // 10
  
  // All replicas converge despite order/duplication differences
}

Why Op-based Works:

For an op-based CRDT, the operations themselves must have special properties:

Commutativity of concurrent ops: If operations A and B are concurrent (neither happened before the other), then apply(apply(s, A), B) = apply(apply(s, B), A)
Causal delivery may be required: Some op-based CRDTs require that if operation B depends on A (A happened before B), then all replicas receive A before B. This is "causal broadcast."

For a simple counter, increment commutes trivially: 5+3+2 = 5+2+3. But for more complex operations (like list insertion), ensuring commutativity requires careful design.

The Reliable Broadcast Requirement

Comprehensive Comparison and Trade-offs

The choice between state-based and operation-based CRDTs involves fundamental trade-offs across multiple dimensions. Let's analyze them systematically:

State-based vs Operation-based: Complete Comparison
Dimension	State-based (CvRDT)	Operation-based (CmRDT)
Message Content	Full replica state	Individual operations
Message Size	O(state size) — can be large	O(operation size) — typically small
Network Requirement	Eventual delivery (can lose/duplicate)	Reliable broadcast (no loss)
Duplicate Handling	Automatic (merge is idempotent)	Needs explicit idempotency or exactly-once
Bandwidth (steady state)	Higher (repeated full state)	Lower (only deltas)
Bandwidth (recovery)	Lower (one sync catches up)	Higher (replay all missed ops)
Implementation Complexity	Moderate (design merge function)	Higher (design commutative ops)
Ordering Requirements	None	May need causal ordering for some types
Garbage Collection	State can be compacted	May accumulate operation log

Choose State-based When

•Network is unreliable (message loss likely)
•State size is small relative to operation volume
•Simpler implementation is preferred
•Recovery after long disconnection is common
•Infrastructure lacks reliable broadcast
•Example: User preferences, configuration sync

Choose Operation-based When

•State is large (gigabytes of data)
•Operations are small and frequent
•Reliable messaging infrastructure exists
•Bandwidth is constrained
•Real-time collaboration with many small edits
•Example: Collaborative documents, shared cursors

Practical Example: Collaborative Whiteboard

Consider a shared whiteboard where users draw shapes:

State-based approach:

State = list of all shapes (could be thousands)
Every sync transmits all shapes
Adding one circle sends megabytes of data
But: handles network failures gracefully

Operation-based approach:

Operations = {addShape(...), moveShape(...), deleteShape(...)}
Adding a circle sends only the circle's definition
Efficient for real-time collaboration
But: lost operation = permanent disagreement

For this scenario, operation-based is clearly better—assuming you can guarantee reliable delivery (most collaboration frameworks do via WebSocket + acknowledgments + retries).

Delta-State CRDTs: The Best of Both Worlds

Delta-state CRDTs (introduced by Almeida et al.) answer yes. The key insight: we can compute and transmit only the delta (difference) between states, using state-based merge semantics.

Core Idea:

When a local update occurs, compute a delta-mutator: the minimal state change that, when merged with the old state, produces the new state
Transmit only the delta (small, like an operation)
Receiver merges the delta using the standard lattice join (robust, like state-based)
If deltas are lost, can always fall back to full state sync

delta-state-counter.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
/**
 * Delta-State G-Counter
 * 
 * Combines efficiency of op-based with robustness of state-based.
 * Deltas are just the changed portion of state.
 */
 
interface DeltaGCounter {
  counts: Map<string, number>;
  replicaId: string;
  // Track unsent changes since last sync
  pendingDelta: Map<string, number>;
}
 
function createDeltaGCounter(replicaId: string): DeltaGCounter {
  return {
    counts: new Map([[replicaId, 0]]),
    replicaId,
    pendingDelta: new Map(),
  };
}
 
// Increment: update local state AND track the delta
function incrementDelta(counter: DeltaGCounter): DeltaGCounter {
  const current = counter.counts.get(counter.replicaId) ?? 0;
  const newValue = current + 1;
  
  const newCounts = new Map(counter.counts);
  newCounts.set(counter.replicaId, newValue);
  
  // Delta only contains our change
  const newDelta = new Map(counter.pendingDelta);
  newDelta.set(counter.replicaId, newValue);
  
  return { ...counter, counts: newCounts, pendingDelta: newDelta };
}
 
// Extract delta for transmission (and clear pending)
function extractDelta(counter: DeltaGCounter): {
  counter: DeltaGCounter;
  delta: Map<string, number>;
} {
  const delta = new Map(counter.pendingDelta);
  return {
    counter: { ...counter, pendingDelta: new Map() },
    delta,
  };
}
 
// Merge a received delta (same as full state merge, but delta is small)
function mergeDelta(
  counter: DeltaGCounter, 
  delta: Map<string, number>
): DeltaGCounter {
  const merged = new Map(counter.counts);
  
  for (const [replicaId, count] of delta) {
    const current = merged.get(replicaId) ?? 0;
    merged.set(replicaId, Math.max(current, count));
  }
  
  return { ...counter, counts: merged };
}
 
// Full state sync (fallback if deltas are lost)
function fullMerge(local: DeltaGCounter, remote: DeltaGCounter): DeltaGCounter {
  const merged = new Map(local.counts);
  
  for (const [replicaId, count] of remote.counts) {
    const current = merged.get(replicaId) ?? 0;
    merged.set(replicaId, Math.max(current, count));
  }
  
  return { ...local, counts: merged };
}
 
// Demonstration
function demonstrateDelta() {
  let replicaA = createDeltaGCounter("A");
  let replicaB = createDeltaGCounter("B");
  
  // A does 3 increments
  replicaA = incrementDelta(replicaA);
  replicaA = incrementDelta(replicaA);
  replicaA = incrementDelta(replicaA);
  
  // B does 2 increments
  replicaB = incrementDelta(replicaB);
  replicaB = incrementDelta(replicaB);
  
  // Extract deltas (would be sent over network)
  const { counter: newA, delta: deltaA } = extractDelta(replicaA);
  const { counter: newB, delta: deltaB } = extractDelta(replicaB);
  replicaA = newA;
  replicaB = newB;
  
  console.log("Delta from A:", Object.fromEntries(deltaA)); // {A: 3}
  console.log("Delta from B:", Object.fromEntries(deltaB)); // {B: 2}
  
  // Merge deltas (small messages, robust merge)
  replicaA = mergeDelta(replicaA, deltaB);
  replicaB = mergeDelta(replicaB, deltaA);
  
  // Both now have the same state
  console.log("A counts:", Object.fromEntries(replicaA.counts)); // {A: 3, B: 2}
  console.log("B counts:", Object.fromEntries(replicaB.counts)); // {A: 3, B: 2}
}

Key Properties of Delta-State CRDTs:

Deltas are themselves states — They're valid elements of the same lattice. This means multiple deltas can be combined before transmission (delta compaction).
Merge is still idempotent — Receiving the same delta twice is safe. This eliminates the need for exactly-once delivery.
Graceful degradation — If deltas are lost, the system doesn't diverge. The next full state sync repairs everything.
Anti-entropy compatible — Can run background full-state gossip to catch any missed updates, while using deltas for real-time sync.

Bandwidth Analysis:

Op-based: O(operations) per sync
State-based: O(total state) per sync
Delta-state: O(changed state since last sync) per sync

Delta-State in Practice

Production Implementation Patterns

Moving from theory to production requires addressing practical concerns: network protocols, persistence, garbage collection, and more. Here are patterns used by production CRDT systems:

Production Implementation Patterns

•Batching and Coalescing — Don't send every update immediately. Batch multiple deltas into one message. Coalesce redundant updates (100 increments → one +100).
•Causal Consistency Tracking — Use version vectors or dotted version vectors to track causal dependencies. Essential for CRDTs that need causal delivery.
•Persistence Strategy — Persist either full state snapshots or operation logs. Snapshots are simpler; logs enable time-travel but grow unboundedly.
•Garbage Collection — Tombstones and metadata can grow indefinitely. Use periodic compaction, causality barriers, or semantic garbage collection.
•Hybrid Syncing — Use deltas for real-time updates, full state for recovery. Background anti-entropy ensures eventual repair of missed updates.
•Checkpoint Compression — Periodically compute a compressed checkpoint. New replicas bootstrap from checkpoint, not full operation history.

hybrid-sync-pattern.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
## Hybrid Synchronization Architecture
 
┌─────────────────────────────────────────────────────────────────┐
│                        REPLICA NODE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────────┐   │
│  │ Local Ops   │────▶│ Delta       │────▶│ Outbound Queue  │   │
│  │ (user input)│     │ Generator   │     │ (batched deltas)│   │
│  └─────────────┘     └─────────────┘     └────────┬────────┘   │
│                                                    │            │
│                                                    ▼            │
│                                          ┌────────────────┐     │
│                                          │ Network Layer  │     │
│                                          │ (WebSocket/    │     │
│                                          │  gRPC/AMQP)    │     │
│                                          └────────┬───────┘     │
│                                                   │             │
│                                                   ▼             │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────────┐   │
│  │ CRDT State  │◀────│ Merge       │◀────│ Inbound Queue   │   │
│  │ (current)   │     │ Function    │     │ (incoming sync) │   │
│  └──────┬──────┘     └─────────────┘     └─────────────────┘   │
│         │                                                       │
│         ▼                                                       │
│  ┌──────────────────────────────────────────────────────┐      │
│  │              PERSISTENCE LAYER                        │      │
│  │  ┌─────────────┐  ┌───────────────┐  ┌────────────┐  │      │
│  │  │ Checkpoints │  │ Delta Log     │  │ Vector     │  │      │
│  │  │ (periodic)  │  │ (recent ops)  │  │ Clocks     │  │      │
│  │  └─────────────┘  └───────────────┘  └────────────┘  │      │
│  └──────────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────────┘
 
## Synchronization Channels
 
1. REAL-TIME CHANNEL (Delta-based)
   - Low latency, small messages
   - Deltas batched every 50-100ms
   - Best-effort delivery OK (anti-entropy repairs)
 
2. RECOVERY CHANNEL (State-based)
   - Used when replica is far behind
   - Full state or checkpoint + delta log
   - Triggered by: long disconnection, new replica join
 
3. ANTI-ENTROPY CHANNEL (Background)
   - Periodic full-state comparison via Merkle trees
   - Detects and repairs any divergence
   - Low priority, opportunistic

Real-World Sync Frequencies

Decision Framework: Choosing Your Approach

Given the trade-offs, how do you decide? Here's a systematic framework for choosing between pure state-based, pure operation-based, or hybrid delta-state:

decision-framework.md

## CRDT Synchronization Decision Framework
 
┌─────────────────────────────────────────────────────────────┐
│           START: What are your network guarantees?          │
└────────────────────────────┬────────────────────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ UNRELIABLE      │ │ MOSTLY RELIABLE │ │ RELIABLE + ORDERED  │
│ (UDP, lossy)    │ │ (TCP, rare loss)│ │ (guaranteed delivery)│
└────────┬────────┘ └────────┬────────┘ └──────────┬──────────┘
         │                   │                     │
         ▼                   ▼                     ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ STATE-BASED     │ │ DELTA-STATE     │ │ Consider OP-BASED   │
│ (most robust)   │ │ (best balance)  │ │ (most efficient)    │
└─────────────────┘ └─────────────────┘ └─────────────────────┘
 
THEN: Consider state size vs operation frequency
 
┌─────────────────────────────────────────────────────────────┐
│                   STATE SIZE × UPDATE RATE                  │
└─────────────────────────────────────────────────────────────┘
         │
         ├─── Small state + Rare updates → STATE-BASED fine
         │    (User preferences, feature flags)
         │
         ├─── Large state + Frequent updates → DELTA or OP-BASED
         │    (Documents, shared whiteboards)
         │
         └─── Any size + Offline-first → DELTA-STATE
              (Mobile apps, edge computing)
 
RECOMMENDATION FOR MOST CASES: Delta-State
 
- Bandwidth efficient like op-based
- Robust like state-based  
- Graceful recovery from network issues
- Most modern libraries use this approach

Practical Recommendations:

For new projects: Start with delta-state (or a library that implements it). It's the most balanced approach and handles the widest range of scenarios.

For unreliable networks: State-based with periodic full sync. The robustness is worth the bandwidth cost.

For optimized hot paths: Pure op-based if you can guarantee reliable delivery. The efficiency gains matter when scaling to millions of operations.

For hybrid architectures: Use op-based for real-time with state-based anti-entropy for repair. This is what most production systems do.

Summary: Synchronization Strategies Mastered

We've now deeply explored the two fundamental approaches to CRDT synchronization and their practical hybrid evolution. Let's consolidate:

Key Takeaways

•State-based CRDTs transmit full state and merge via lattice join. Robust but bandwidth-heavy.
•Operation-based CRDTs transmit operations that must commute. Efficient but requires reliable delivery.
•Both are mathematically equivalent in expressiveness—any CvRDT can be expressed as CmRDT and vice versa.
•Delta-state CRDTs combine efficiency and robustness by transmitting only changed state portions.
•Production systems use hybrid approaches with delta sync for real-time and state sync for recovery.
•The synchronization layer is separate from CRDT design — choose based on network guarantees and performance requirements.

What's Next:

Synchronization Strategies Complete