Availability Vs Consistency Tradeoffs - Learning Module

Loading content...

0/273

Tuning Consistency for Availability

Beyond Binary: The Art of Consistency Tuning

The CAP theorem might suggest an all-or-nothing choice between consistency and availability, but real-world systems operate on a continuum. Expert system designers don't simply choose "CP" or "AP"—they carefully tune their systems to find the sweet spot that provides maximum consistency while meeting availability requirements.

This page explores the techniques that allow systems to:

Provide strong consistency most of the time while degrading gracefully when needed.
Offer different consistency guarantees for different operations within the same system.
Minimize the consistency penalty required to achieve availability targets.
Recover quickly from temporary inconsistency states.

These are the techniques that separate adequate distributed systems from excellent ones.

What You Will Master

By the end of this page, you will understand the practical techniques for tuning consistency levels, including quorum configurations, consistency level hierarchies, and hybrid approaches that maximize both properties within the constraints of CAP.

The Consistency Spectrum

Consistency is not binary—there's a rich hierarchy of consistency models, each with different trade-offs for latency and availability. Understanding this spectrum is essential for effective tuning.

From Strongest to Weakest

Consistency Models Spectrum
Model	Guarantee	Latency Cost	Availability During Partition
Strict Serializability	Real-time ordering + serializable transactions	Highest	Unavailable
Linearizability	Operations appear instantaneous at some point	Very High	Unavailable
Sequential Consistency	Operations from each client ordered; global order exists	High	Unavailable
Causal Consistency	Causally related operations ordered; concurrent may differ	Medium	Partially Available
Read-Your-Writes	Client sees own writes immediately	Low	Available
Monotonic Reads	Once a value is seen, older values are never seen	Low	Available
Monotonic Writes	Writes from a client are ordered	Low	Available
Eventual Consistency	All replicas converge eventually	Lowest	Fully Available

Why the Spectrum Matters

The insight is that you don't always need linearizability. Many applications function correctly with weaker guarantees that still provide useful properties:

Causal Consistency ensures that if operation A causally precedes operation B, all observers see A before B. This is often sufficient for collaborative applications without the full cost of linearizability.

Read-Your-Writes ensures users see their own updates immediately, even if they don't see others' updates instantly. This matches user intuition for personal data.

Monotonic Reads prevents time-traveling reads where a user refreshes and sees older data. This is critical for coherent user experiences.

The key insight: By choosing the weakest consistency model that satisfies your requirements, you minimize the availability and latency penalty. Don't pay for consistency you don't need.

consistency-models-comparison.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
CONSISTENCY MODEL COMPARISON
════════════════════════════════════════════════════════════════
 
LINEARIZABILITY (Top of spectrum):
─────────────────────────────────────────
Client A: ───────────── write(x=1) ─────────────────────────────>
                              │
Client B: ──────────────────────────── read(x) = 1 ─────────────>
                                         │
Guarantee: After write completes, ALL subsequent reads see it.
Cost: Requires coordination across all replicas.
Use when: Financial transactions, distributed locks.
 
 
CAUSAL CONSISTENCY (Middle):
─────────────────────────────────────────
Client A: ─── write(x=1) ──── write(y=2) ───────────────────────>
                 │                │
Client B: ───────────────────────────── read(y=2) ─── read(x=1) ─>
                                            │            │
Guarantee: If B saw y=2 (which was written after x=1),
           B will never subsequently see x=<undefined>.
Cost: Track causality (vector clocks), but no global ordering.
Use when: Collaborative editing, social feeds.
 
 
EVENTUAL CONSISTENCY (Bottom):
─────────────────────────────────────────
Client A: ─── write(x=1) ───────────────────────────────────────>
                 │
Client B: ──────────── read(x) = undefined ──── read(x) = 1 ────>
                           │ (stale)                │ (caught up)
                           
Guarantee: If no new writes, eventually all reads return x=1.
Cost: Minimal coordination, maximum availability.
Use when: Caches, analytics, content catalogs.

Know Your Requirements

Before tuning, determine the actual consistency requirements. Ask: "What anomaly would this weaker model permit, and would users or business logic notice?" If a weaker model permits anomalies that don't matter, use it.

Quorum Tuning: The Primary Consistency Knob

Quorum-based consistency is the most common tuning mechanism in distributed databases. By adjusting the number of nodes that must participate in reads and writes, you can dial between consistency and availability.

The Fundamental Formula

R + W > N → Strong Consistency

Where:

N = Number of replicas
W = Number of replicas that must acknowledge a write
R = Number of replicas that must respond to a read

When R + W > N, every read contacts at least one replica that participated in the most recent write. This guarantees seeing the latest value (absent concurrent writes).

Quorum Configurations and Their Trade-offs (N=5)
Configuration	Write Availability	Read Availability	Consistency	Best For
W=1, R=5	Survives 4 failures	Requires all 5	Strong (if no concurrent writes)	Write-heavy, reads can wait
W=5, R=1	Requires all 5	Survives 4 failures	Strong	Read-heavy, writes can wait
W=3, R=3 (QUORUM)	Survives 2 failures	Survives 2 failures	Strong	Balanced workloads
W=2, R=2	Survives 3 failures	Survives 3 failures	Eventual (R+W=4 <= N=5)	High availability, eventual consistency
W=1, R=1	Survives 4 failures	Survives 4 failures	Eventual	Maximum availability

Practical Quorum Strategies

Strategy 1: Default to QUORUM, override when needed

Most operations use QUORUM (majority) for both reads and writes. This provides strong consistency for the common case. Specific operations that need higher availability can use weaker consistency.

// Default: Strong consistency
SELECT * FROM orders WHERE id = ? USING CONSISTENCY QUORUM;
INSERT INTO orders (...) VALUES (...) USING CONSISTENCY QUORUM;

// Override for analytics: Higher availability, eventual consistency
SELECT COUNT(*) FROM orders WHERE status = 'pending' USING CONSISTENCY ONE;

Strategy 2: Write strong, read flexible

Write to a quorum to ensure durability, but allow reads at different consistency levels based on the caller's needs.

// Always write with durability guarantee
INSERT INTO inventory (product_id, quantity) VALUES (?, ?) USING CONSISTENCY QUORUM;

// Strong read for checkout (must not oversell)
SELECT quantity FROM inventory WHERE product_id = ? USING CONSISTENCY QUORUM;

// Weak read for display (slight staleness OK)
SELECT quantity FROM inventory WHERE product_id = ? USING CONSISTENCY ONE;

Strategy 3: LOCAL_QUORUM for multi-datacenter

In multi-region deployments, QUORUM across all datacenters adds significant latency. LOCAL_QUORUM provides quorum within a single datacenter—strong consistency within the region while avoiding cross-region latency.

quorum-strategies.cql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Cassandra quorum tuning examples
 
-- Replication factor per datacenter
CREATE KEYSPACE ecommerce WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east': 3,
  'eu-west': 3
};
 
-- LOCAL_QUORUM: Majority within local DC (fast)
-- Provides strong consistency within region
-- Does NOT guarantee consistency across regions immediately
INSERT INTO orders (id, user_id, items)
VALUES (?, ?, ?)
USING CONSISTENCY LOCAL_QUORUM;
 
SELECT * FROM orders WHERE id = ?
USING CONSISTENCY LOCAL_QUORUM;
 
-- EACH_QUORUM: Majority in EACH datacenter (slow but globally strong)
-- Use only when global consistency is required (rare)
INSERT INTO global_config (key, value)
VALUES (?, ?)
USING CONSISTENCY EACH_QUORUM;
 
-- ONE: Single replica response (fast but weak)
-- For read-heavy, latency-sensitive, staleness-tolerant operations
SELECT * FROM product_catalog WHERE product_id = ?
USING CONSISTENCY ONE;
 
-- ALL: Every replica must respond (slow, low availability)
-- For critical operations where you cannot tolerate any replica lag
INSERT INTO financial_transactions (tx_id, amount, account)
VALUES (?, ?, ?)
USING CONSISTENCY ALL;  -- Caution: Fails if ANY replica is unavailable

Quorum and Concurrent Writes

Quorum guarantees that reads see the latest completed write. During concurrent writes, clients may see different values until one write "wins." For true linearizability, you need consensus protocols (Paxos/Raft) or compare-and-swap operations, not just quorum.

Session Guarantees: Per-Client Consistency

Session guarantees provide consistency properties for a single client's session without requiring global consistency. This is often the sweet spot—users get a consistent view of their own data while the system maintains high availability.

The Four Session Guarantees

Session Guarantee Properties

•Read Your Writes (RYW) — A client always sees its own previous writes. If you post a comment, you immediately see it in your feed, even if other users don't yet.
•Monotonic Reads (MR) — Once a client has seen a value, it never sees an older value. Refreshing the page never shows older content than before.
•Monotonic Writes (MW) — Writes from a client are applied in order. If you send messages A then B, they appear in that order.
•Writes Follow Reads (WFR) — A write following a read is guaranteed to see at least the read's version. If you read a document and then update it, your update is based on what you saw.

Implementing Session Guarantees

Technique 1: Sticky Sessions

Route a client's requests to the same replica for the duration of their session. This naturally provides RYW and MR within that session.

Pros: Simple, built into most load balancers. Cons: Failover breaks guarantees; creates load imbalance.

Technique 2: Read from Write Replica (Temporary)

After a write, route the client's reads to the replica that received the write—just for a short window (e.g., 5 seconds) until replication catches up.

Technique 3: Tokens and Version Vectors

Include the version (timestamp or vector clock) of the client's last write in subsequent requests. Replicas only serve reads if they have caught up to that version.

session-consistency.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Implementing Read-Your-Writes with session tokens
 
interface SessionContext {
  // Track the latest write timestamp for this session
  lastWriteTimestamp: number | null;
  // Track the last read version for monotonic reads  
  lastReadVersion: string | null;
}
 
class ConsistentClient {
  private session: SessionContext = {
    lastWriteTimestamp: null,
    lastReadVersion: null,
  };
 
  async write(table: string, key: string, value: any): Promise<WriteResult> {
    const result = await this.db.write(table, key, value);
    
    // Track our write timestamp for read-your-writes
    this.session.lastWriteTimestamp = result.timestamp;
    
    return result;
  }
 
  async read(table: string, key: string): Promise<ReadResult> {
    // Include session context in read request
    const result = await this.db.read(table, key, {
      // Read-your-writes: Wait for our last write to be visible
      minTimestamp: this.session.lastWriteTimestamp,
      
      // Monotonic reads: Don't return anything older than we've seen
      minVersion: this.session.lastReadVersion,
    });
 
    // Update session for future monotonic reads
    if (result.version > (this.session.lastReadVersion || '')) {
      this.session.lastReadVersion = result.version;
    }
 
    return result;
  }
}
 
// Server-side implementation of session-aware reads
class ReplicaNode {
  async handleRead(request: ReadRequest): Promise<ReadResult> {
    const { table, key, minTimestamp, minVersion } = request;
    
    // Check if we've replicated up to the required point
    const ourVersion = await this.getLocalVersion(table, key);
    
    if (minTimestamp && ourVersion.timestamp < minTimestamp) {
      // Option 1: Wait briefly for replication to catch up
      const caught_up = await this.waitForReplication(table, key, minTimestamp, {
        timeout: 100 // 100ms max wait
      });
      
      if (!caught_up) {
        // Option 2: Forward to a replica that has the data
        return await this.forwardToUpToDateReplica(request);
      }
    }
 
    if (minVersion && ourVersion < minVersion) {
      // Same logic for version-based monotonic reads
      return await this.forwardToUpToDateReplica(request);
    }
 
    // We're up to date; serve the read locally
    return await this.localRead(table, key);
  }
}

Session Guarantees Are Underrated

Session guarantees often provide exactly what users expect at a fraction of the cost of global strong consistency. A user editing their profile doesn't need everyone to see the change instantly—they just need to see it themselves immediately. Don't over-engineer global consistency when session guarantees suffice.

Adaptive Consistency: Dynamic Tuning

Static consistency configurations are limiting. Advanced systems adapt their consistency behavior based on current conditions—tightening during normal operation and relaxing when availability is threatened.

The Adaptive Approach

Normal Operation: Use strong consistency (quorum, synchronous replication). Degraded Operation: When nodes fail or partitions occur, automatically relax to weaker consistency to maintain availability. Recovery: After conditions improve, strengthen consistency and reconcile any divergence.

Adaptive Consistency Modes
System State	Consistency Mode	Trade-off
All replicas healthy	Strong (QUORUM or ALL)	Full consistency, full availability
Minority unavailable	Strong (QUORUM)	Consistency preserved, slightly reduced availability
Majority unavailable	Weak (ONE) + tracking	Best-effort availability, track divergence
Partition detected	Local consistency only	Available within partition, mark for reconciliation
Partition healed	Reconciliation mode	Merge divergent data, restore consistency

Implementing Adaptive Consistency

Step 1: Detect Degradation

Monitor replica health, replication lag, and network connectivity. Establish thresholds for triggering adaptation.

Step 2: Graceful Degradation

When thresholds are crossed:

Reduce consistency level for non-critical operations first.
Log all operations that deviate from normal consistency.
Alert operators to degraded state.

Step 3: Track Divergence

When operating in degraded mode:

Record which keys may have diverged.
Store conflicting versions with vector clocks.
Queue operations for post-partition reconciliation.

Step 4: Reconcile on Recovery

When conditions improve:

Process queued reconciliation.
Merge or resolve conflicts.
Return to normal consistency levels.
Verify integrity.

adaptive-consistency.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
// Adaptive consistency controller
 
enum HealthState {
  HEALTHY = 'healthy',
  DEGRADED = 'degraded',
  CRITICAL = 'critical',
  PARTITIONED = 'partitioned',
}
 
interface AdaptiveConfig {
  healthyThreshold: number; // Min healthy replicas for HEALTHY state
  degradedThreshold: number; // Min for DEGRADED (vs CRITICAL)
  recoveryDelay: number; // Time to wait before upgrading state
}
 
class AdaptiveConsistencyManager {
  private currentState: HealthState = HealthState.HEALTHY;
  private divergedKeys: Set<string> = new Set();
 
  constructor(
    private replicaMonitor: ReplicaMonitor,
    private config: AdaptiveConfig,
  ) {
    // Continuously monitor and adapt
    this.replicaMonitor.on('stateChange', () => this.evaluateState());
  }
 
  evaluateState(): HealthState {
    const healthyReplicas = this.replicaMonitor.getHealthyReplicaCount();
    const totalReplicas = this.replicaMonitor.getTotalReplicaCount();
    const isPartitioned = this.replicaMonitor.detectPartition();
 
    if (isPartitioned) {
      this.transitionTo(HealthState.PARTITIONED);
    } else if (healthyReplicas >= this.config.healthyThreshold) {
      this.transitionTo(HealthState.HEALTHY);
    } else if (healthyReplicas >= this.config.degradedThreshold) {
      this.transitionTo(HealthState.DEGRADED);
    } else {
      this.transitionTo(HealthState.CRITICAL);
    }
 
    return this.currentState;
  }
 
  getConsistencyLevel(operation: string, criticality: 'low' | 'medium' | 'high'): ConsistencyLevel {
    switch (this.currentState) {
      case HealthState.HEALTHY:
        // Full consistency for all operations
        return ConsistencyLevel.QUORUM;
 
      case HealthState.DEGRADED:
        // High criticality keeps quorum; lower can reduce
        if (criticality === 'high') return ConsistencyLevel.QUORUM;
        if (criticality === 'medium') return ConsistencyLevel.LOCAL_QUORUM;
        return ConsistencyLevel.ONE;
 
      case HealthState.CRITICAL:
        // Only highest criticality keeps quorum
        if (criticality === 'high') return ConsistencyLevel.LOCAL_QUORUM;
        return ConsistencyLevel.ONE;
 
      case HealthState.PARTITIONED:
        // Operate locally, track divergence
        return ConsistencyLevel.LOCAL_ONE;
    }
  }
 
  async recordDivergence(key: string, value: any, vectorClock: VectorClock): Promise<void> {
    // Track that this key may have conflicting values
    this.divergedKeys.add(key);
    
    // Store the conflicting version for later reconciliation
    await this.conflictStore.store({
      key,
      value,
      vectorClock,
      partition: this.replicaMonitor.getCurrentPartitionId(),
      timestamp: Date.now(),
    });
 
    this.metrics.increment('divergence.keys_tracked');
  }
 
  async reconcile(): Promise<ReconciliationResult> {
    const results: ReconciliationResult = {
      total: this.divergedKeys.size,
      merged: 0,
      conflicts: 0,
      errors: 0,
    };
 
    for (const key of this.divergedKeys) {
      try {
        const versions = await this.conflictStore.getVersions(key);
        const merged = await this.mergeStrategy.merge(versions);
        
        if (merged.hadConflict) {
          results.conflicts++;
          // Apply conflict resolution (LWW, app-specific, etc.)
          await this.applyResolution(key, merged.resolved);
        } else {
          results.merged++;
          // Simple merge (versions were compatible)
          await this.applyMerge(key, merged.value);
        }
      } catch (error) {
        results.errors++;
        this.logger.error(`Reconciliation failed for key: ${key}`, error);
      }
    }
 
    // Clear tracking after reconciliation
    this.divergedKeys.clear();
    
    return results;
  }
}

Complexity Cost

Adaptive consistency adds significant operational complexity. You need monitoring, state machines, reconciliation logic, and extensive testing. Use it only when static consistency is truly inadequate—and start simple with per-query consistency overrides before building full adaptive systems.

Techniques for Maximizing Both Properties

While CAP prevents having perfect consistency AND perfect availability during partitions, several techniques minimize the trade-off severity.

Optimization Techniques

•Minimize partition duration — Fast failure detection (short timeouts, aggressive health checks) and automatic recovery reduce the time you're in degraded mode. Faster partition resolution means less time sacrificing either property.
•Partition-aware routing — Detect partitions quickly and route clients to the majority partition. If 3 of 5 nodes are reachable, route all traffic there for continued strong consistency.
•Hierarchical consistency — Use strong consistency within a datacenter (fast) and eventual consistency across datacenters (slow but available). LOCAL_QUORUM gives you most of the benefits of QUORUM with lower partition risk.
•Write-ahead logging with async replication — Accept writes to a durable local log (available), then replicate asynchronously (eventual). Provides durability guarantees even during partitions.
•CRDTs for compatible data — Conflict-free Replicated Data Types allow fully available writes that automatically merge without conflicts. Perfect for counters, sets, and other compatible structures.
•Grace periods and timeouts — Before declaring unavailable (CP response), wait briefly for partition to heal. This absorbs transient network issues without sacrificing consistency for temporary blips.

Technique Deep Dive: CRDTs

Conflict-free Replicated Data Types (CRDTs) are data structures mathematically guaranteed to merge without conflicts. They allow fully available writes with automatic convergence.

G-Counter (Grow-only counter): Each node maintains its own counter. The total is the sum of all node counters. Increments never conflict.

PN-Counter (Positive-Negative counter): Two G-Counters: one for increments, one for decrements. Value = positive - negative.

G-Set (Grow-only set): Elements can be added but never removed. Set is union of all node sets.

OR-Set (Observed-Remove set): Adds and removes tracked with unique tags. Removes only remove tags seen by that node.

LWW-Register (Last-Write-Wins register): Each write has a timestamp. Highest timestamp wins. Simple but may lose writes.

crdt-examples.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
// CRDT implementation examples
 
// G-Counter: Grow-only counter
class GCounter {
  private counts: Map<string, number> = new Map();
  
  constructor(private nodeId: string) {
    this.counts.set(nodeId, 0);
  }
 
  increment(by: number = 1): void {
    const current = this.counts.get(this.nodeId) || 0;
    this.counts.set(this.nodeId, current + by);
  }
 
  value(): number {
    return Array.from(this.counts.values()).reduce((sum, n) => sum + n, 0);
  }
 
  merge(other: GCounter): void {
    // Take max of each node's count
    for (const [nodeId, count] of other.counts) {
      const existing = this.counts.get(nodeId) || 0;
      this.counts.set(nodeId, Math.max(existing, count));
    }
  }
}
 
// OR-Set: Observed-Remove Set (supports add and remove)
class ORSet<T> {
  // Map from element to set of unique add tags
  private elements: Map<T, Set<string>> = new Map();
  
  private generateTag(): string {
    return `${this.nodeId}:${Date.now()}:${Math.random()}`;
  }
 
  constructor(private nodeId: string) {}
 
  add(element: T): void {
    if (!this.elements.has(element)) {
      this.elements.set(element, new Set());
    }
    this.elements.get(element)!.add(this.generateTag());
  }
 
  remove(element: T): void {
    // Remove only removes tags we've observed
    this.elements.delete(element);
  }
 
  has(element: T): boolean {
    const tags = this.elements.get(element);
    return tags !== undefined && tags.size > 0;
  }
 
  values(): T[] {
    return Array.from(this.elements.entries())
      .filter(([_, tags]) => tags.size > 0)
      .map(([element, _]) => element);
  }
 
  merge(other: ORSet<T>): void {
    // Union of all elements and their tags
    for (const [element, tags] of other.elements) {
      if (!this.elements.has(element)) {
        this.elements.set(element, new Set());
      }
      for (const tag of tags) {
        this.elements.get(element)!.add(tag);
      }
    }
  }
}
 
// Usage: Real-time collaborative shopping cart
const cart1 = new ORSet<string>('node-1');
const cart2 = new ORSet<string>('node-2');
 
// User A adds items on node 1
cart1.add('apple');
cart1.add('banana');
 
// User B adds items on node 2 (concurrent, during partition)
cart2.add('cherry');
cart2.add('banana'); // Same item, different tag
 
// After partition heals, merge
cart1.merge(cart2);
cart2.merge(cart1);
 
// Both carts now have: apple, banana, cherry
// No conflicts, no lost items!

CRDTs Have Limits

CRDTs are powerful but not universal. They work well for accumulating data (counters, sets, append-only logs) but poorly for general-purpose mutable state. Evaluate whether your data semantics fit CRDT patterns before adopting them.

Monitoring Consistency in Production

You can't tune what you can't measure. Effective consistency tuning requires robust observability into how your system behaves under various conditions.

Key Metrics to Monitor

•Replication lag — How far behind are replicas? Monitor P50, P95, P99. High lag indicates consistency risk.
•Quorum latency — How long do quorum operations take? Increasing latency may indicate degradation.
•Consistency level distribution — What percentage of operations use each consistency level? Are fallbacks being triggered?
•Stale read rate — How often do reads return data that was subsequently overwritten? Measure with synthetic probes.
•Partition detection events — How often, how long, and between which nodes? Correlate with consistency events.
•Conflict rate — For eventually consistent operations, how often do conflicts occur? Are they increasing?
•Reconciliation backlog — How many items are pending reconciliation after partitions heal?

Measuring Staleness

Synthetic Probes:

Periodically write a unique value with a timestamp.
Immediately read from different replicas.
Measure how often reads return older values.
Track the age of stale reads when they occur.

Production Instrumentation:

Include version vectors or timestamps in responses.
Clients track the freshness of data they receive.
Report staleness metrics (age of data when received).
Alert if staleness exceeds acceptable thresholds.

consistency-monitoring.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Consistency monitoring implementation
 
interface ConsistencyMetrics {
  replicationLagMs: Histogram;
  quorumLatencyMs: Histogram;
  consistencyLevel: Counter;
  staleReadRate: Gauge;
  partitionEvents: Counter;
  conflictRate: Counter;
}
 
class ConsistencyMonitor {
  private metrics: ConsistencyMetrics;
 
  // Synthetic probe to measure staleness
  async measureStaleness(): Promise<void> {
    const probeKey = '__consistency_probe__';
    const probeValue = `${Date.now()}`;
    
    // Write with strong consistency
    await this.db.write(probeKey, probeValue, { consistency: 'QUORUM' });
    
    // Read from each replica
    const results = await Promise.all(
      this.replicas.map(async (replica) => {
        const value = await replica.localRead(probeKey);
        return {
          replica: replica.id,
          value,
          isStale: value !== probeValue,
          staleness: value ? parseInt(probeValue) - parseInt(value) : null,
        };
      })
    );
 
    // Record metrics
    for (const result of results) {
      if (result.isStale) {
        this.metrics.staleReadRate.inc({ replica: result.replica });
        this.metrics.replicationLagMs.observe(
          result.staleness || 0,
          { replica: result.replica }
        );
      }
    }
  }
 
  // Track consistency level usage
  recordOperation(operation: string, consistencyLevel: string): void {
    this.metrics.consistencyLevel.inc({
      operation,
      level: consistencyLevel,
    });
  }
 
  // Alert on concerning patterns
  async evaluateHealth(): Promise<HealthStatus> {
    const lag = await this.metrics.replicationLagMs.getP99();
    const staleRate = await this.metrics.staleReadRate.getValue();
    
    if (lag > CRITICAL_LAG_MS || staleRate > CRITICAL_STALE_RATE) {
      this.alert({
        severity: 'critical',
        message: `Consistency degradation: lag=${lag}ms, staleRate=${staleRate}`,
      });
      return HealthStatus.CRITICAL;
    }
    
    if (lag > WARNING_LAG_MS || staleRate > WARNING_STALE_RATE) {
      return HealthStatus.WARNING;
    }
    
    return HealthStatus.HEALTHY;
  }
}

Dashboards for Consistency

Create dedicated dashboards for consistency metrics, separate from general system health. When investigating an incident, you should be able to quickly see: What was the replication lag at the time? Were any replicas unreachable? Did consistency levels change? What was the stale read rate?

Summary: The Tuning Toolkit

We've explored the techniques that allow systems to navigate the consistency-availability spectrum intelligently. Let's consolidate the key insights.

Key Takeaways

•Consistency exists on a spectrum — Linearizability, causal consistency, session guarantees, eventual consistency—choose the weakest model that satisfies your requirements to minimize the availability penalty.
•Quorum tuning is the primary lever — R + W > N gives strong consistency; adjust R and W to trade read vs write availability as needed.
•Session guarantees often suffice — Read-your-writes and monotonic reads provide intuitive behavior for users without global strong consistency.
•Adaptive consistency adds resilience — Tighten during normal operation; relax during degradation; reconcile after recovery.
•CRDTs provide free availability — For compatible data types (counters, sets), CRDTs eliminate the CAP trade-off entirely.
•Measure consistency in production — Replication lag, stale read rates, and conflict frequency are essential observability signals for tuned systems.

What's Next

With the technical toolkit established, the next page explores how to align consistency choices with business requirements. We'll examine how to translate business priorities into technical consistency decisions, and how to communicate trade-offs to non-technical stakeholders.

Page Complete

You now have a comprehensive toolkit for tuning consistency and availability. These techniques—quorum configuration, session guarantees, adaptive consistency, CRDTs, and observability—are the practical application of CAP theory in production systems.