Consistency Models - Learning Module

Loading content...

0/273

Strong Consistency: The Gold Standard of Distributed Correctness

When Correctness Cannot Be Compromised

Imagine you're transferring $10,000 between two bank accounts. You initiate the transfer on your phone while sitting at a café. A millisecond later, your spouse checks the account balance on their laptop at home. What should they see?

If the system shows the old balance—money in both accounts simultaneously—you've witnessed a consistency violation. If it shows money in neither account, that's equally problematic. The only acceptable answer is that your spouse sees either the complete pre-transfer state or the complete post-transfer state, never an intermediate or phantom state.

This scenario illustrates why strong consistency exists: some operations are so critical that any deviation from a single, globally-agreed-upon view of data is unacceptable. Strong consistency provides the gold standard guarantee—every read returns the most recent write, as if there were only one copy of the data and all operations happened sequentially.

What You Will Learn

By the end of this page, you will understand what strong consistency truly means, how linearizability provides the theoretical foundation, why real-time ordering matters, and the mechanisms—like consensus protocols—that make strong consistency achievable. You'll also understand the inherent costs and trade-offs that make strong consistency a carefully considered choice, not a default.

Defining Strong Consistency

Strong consistency is a guarantee that once a write completes, all subsequent reads will return that written value. There is no window where different clients can observe different data—the system behaves as if there were a single, authoritative copy of every piece of data, even when that data is physically replicated across multiple nodes, data centers, or continents.

Formally, strong consistency provides the following guarantees:

1. Total Order of Operations: All operations (reads and writes) can be placed into a single, global, sequential order that respects real-time. If operation A completes before operation B starts, then A appears before B in this global order.

2. Reads Reflect Latest Writes: Every read operation returns the value from the most recent write operation in this global order. There are no stale reads.

3. Agreement Across All Observers: All clients, regardless of which replica they contact, observe the same sequence of states. There is no disagreement about what the current state is.

Strong Consistency Guarantees in Practice
Guarantee	Meaning	Violation Example
Recency	Reads always return the latest committed write	Client reads version 5 while version 6 is already committed
Total Ordering	All operations form a single global timeline	Operation A completes before B in real-time, but B's effects are observed first
Durability	Acknowledged writes survive failures	Write acknowledged but lost during subsequent node crash
Atomicity	Operations complete entirely or not at all	Partial update visible—some fields updated, others not

The Single-Copy Illusion

Strong consistency creates an abstraction where a distributed system behaves like a single machine. From the client's perspective, it doesn't matter that data is replicated—every interaction feels like talking to one authoritative server. This illusion is extraordinarily valuable for application developers, but extraordinarily expensive to maintain.

Linearizability: The Formal Foundation

Linearizability is the formal term for what we informally call strong consistency. Introduced by Maurice Herlihy and Jeannette Wing in 1990, linearizability provides a rigorous definition for correctness in concurrent systems.

A system is linearizable if every operation appears to take effect instantaneously at some point between its invocation and its response. This point is called the linearization point.

Consider the timeline of an operation:

├─────── Operation Invocation ───────────────────── Response ─────────┤
                              ↑
                    Linearization Point
                    (effect takes place here)

Key properties of linearizability:

1. Real-Time Ordering: If operation A's response precedes operation B's invocation in real (wall-clock) time, then A must be ordered before B in the linearization. This is what makes linearizability stronger than serializability.

2. Sequential Specification: The effects of all operations must be explainable by some sequential execution of those operations, where each operation appears atomic.

3. External Consistency: Observations by any external entity are consistent with a single global order. This includes clients, monitoring systems, or any observer outside the system.

linearizability_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Example: Testing for Linearizability
 
# Client A and Client B interact with a register
# Initial value: x = 0
 
# Timeline:
# ─────────────────────────────────────────────────────────────────
# Time     0    1    2    3    4    5    6    7    8
# ─────────────────────────────────────────────────────────────────
# Client A: [──── write(x=1) ────]
#                              ↑ response at time 3
#
# Client B:                [──── read(x) ────]
#                          ↑ invocation at time 2
#                                          ↑ response at time 5
# ─────────────────────────────────────────────────────────────────
 
# QUESTION: What can Client B's read return?
 
# Analysis:
# - Client A's write spans time [0, 3]
# - Client B's read spans time [2, 5]  
# - These operations OVERLAP (time 2-3)
 
# In a linearizable system:
# - If write's linearization point is before read's: read returns 1
# - If read's linearization point is before write's: read returns 0
# - BOTH are valid! (because operations overlap)
 
# What WOULD violate linearizability:
# - If write completes (response at time 3)
# - Then read STARTS (invocation at time 4)
# - Read returns 0
# This violates linearizability because there's no possible
# linearization order where read comes before write.
 
class LinearizabilityChecker:
    """
    Simplified linearizability checker for educational purposes.
    Real checkers (like Jepsen's Knossos) are far more complex.
    """
    
    def __init__(self):
        self.history = []  # List of (op_type, op_id, value, start, end)
    
    def add_operation(self, op_type: str, op_id: str, 
                      value, start_time: float, end_time: float):
        """Record an operation with its time bounds."""
        self.history.append({
            'type': op_type,
            'id': op_id,
            'value': value,
            'start': start_time,
            'end': end_time
        })
    
    def check_linearizable(self) -> bool:
        """
        Attempt to find a linearization of the history.
        Returns True if a valid linearization exists.
        
        This is NP-complete in general, so we use heuristics
        for real-world checking.
        """
        # Sort by end time (partial order)
        sorted_ops = sorted(self.history, key=lambda x: x['end'])
        
        # Try to find valid linearization points
        # (Simplified: actual algorithm is much more complex)
        for op in sorted_ops:
            # Each operation must have a linearization point
            # within its [start, end] interval
            if not self._can_linearize(op):
                return False
        return True
    
    def _can_linearize(self, op) -> bool:
        """Check if operation can be assigned a valid linearization point."""
        # Implementation would check all ordering constraints
        # This is where the NP-complete complexity lives
        return True  # Placeholder

Linearizability vs Serializability:

These terms are often confused, but they differ fundamentally:

Property	Linearizability	Serializability
Scope	Individual operations	Transactions (groups of operations)
Time Sensitivity	Respects real-time	Does not require real-time ordering
Use Case	Single-object operations	Multi-object transactions
Guarantee	Recent value	Isolation of concurrent transactions

Strict Serializability combines both: transactions must be serializable AND respect real-time ordering. This is what systems like Google Spanner provide.

Implementing Strong Consistency

Strong consistency doesn't happen automatically in distributed systems—it must be explicitly engineered. Several mechanisms can provide linearizability, each with different trade-offs:

Approach 1: Single Leader (Primary-Based)

The simplest approach: route all writes through a single leader node. The leader serializes all operations, assigning each a sequence number. Reads can either go to the leader (guaranteeing freshness) or to followers after confirming they're caught up.

Pros: Simple to implement, low coordination overhead for writes Cons: Leader is a single point of failure; leader changes are complex

Approach 2: Consensus Protocols (Raft, Paxos, Viewstamped Replication)

Consensus protocols allow a cluster to agree on a sequence of operations despite node failures. A majority (quorum) of nodes must agree before an operation is committed.

Pros: Tolerates minority failures, no single point of failure after leader election Cons: Higher latency (majority must respond), leader election pauses availability

consensus_overview.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# Consensus-Based Strong Consistency
# Simplified view of how Raft provides linearizability
 
class RaftNode:
    """
    Simplified Raft node demonstrating strong consistency.
    Real implementations are significantly more complex.
    """
    
    def __init__(self, node_id: str, cluster_nodes: list):
        self.node_id = node_id
        self.cluster = cluster_nodes
        self.state = "follower"  # follower, candidate, leader
        self.current_term = 0
        self.log = []  # List of (term, index, command)
        self.commit_index = 0
        self.state_machine = {}  # Actual data storage
    
    def write(self, key: str, value: any) -> bool:
        """
        Strong consistency write operation.
        Only leader can accept writes.
        """
        if self.state != "leader":
            # Redirect to leader
            return self._forward_to_leader("write", key, value)
        
        # Step 1: Append to local log (not committed yet)
        entry = {
            'term': self.current_term,
            'index': len(self.log),
            'command': {'type': 'write', 'key': key, 'value': value}
        }
        self.log.append(entry)
        
        # Step 2: Replicate to followers
        # Must get acknowledgment from MAJORITY before committing
        acks = 1  # Count self
        quorum = len(self.cluster) // 2 + 1
        
        for node in self.cluster:
            if node != self.node_id:
                # Send AppendEntries RPC
                if self._send_append_entries(node, entry):
                    acks += 1
        
        # Step 3: If majority acknowledged, commit
        if acks >= quorum:
            self.commit_index = entry['index']
            # Apply to state machine
            self.state_machine[key] = value
            
            # Notify followers to commit
            self._broadcast_commit(entry['index'])
            return True
        
        # Failed to reach quorum - DO NOT acknowledge to client
        return False
    
    def read(self, key: str) -> any:
        """
        Linearizable read operation.
        Multiple strategies possible.
        """
        if self.state == "leader":
            # Option A: Read from leader
            # Must confirm we're still leader (quorum check)
            if not self._confirm_leadership():
                return self._forward_to_leader("read", key)
            
            # Safe to read - we're confirmed leader
            return self.state_machine.get(key)
        
        else:
            # Option B: Forward to leader
            return self._forward_to_leader("read", key)
            
            # Option C: Read from follower with ReadIndex protocol
            # (requires asking leader for current commit index)
    
    def _confirm_leadership(self) -> bool:
        """
        Heartbeat to quorum to confirm we're still leader.
        Prevents stale reads after network partition.
        """
        responses = 1  # Count self
        quorum = len(self.cluster) // 2 + 1
        
        for node in self.cluster:
            if node != self.node_id:
                # Send heartbeat
                if self._send_heartbeat(node):
                    responses += 1
        
        return responses >= quorum
    
    def _send_append_entries(self, node: str, entry: dict) -> bool:
        """Send log entry to follower node."""
        # Network call - may fail, timeout, etc.
        # Real implementation handles retries, term checking, etc.
        pass
    
    def _forward_to_leader(self, operation: str, *args) -> any:
        """Forward operation to current leader."""
        # Real implementation discovers leader via cluster state
        pass

Approach 3: Quorum Systems with Read Repair

In systems like DynamoDB or Cassandra with strong consistency enabled, reads and writes go to multiple nodes simultaneously. With N replicas, writing to W nodes and reading from R nodes where W + R > N provides strong consistency.

Example: With 3 replicas (N=3), write to 2 nodes (W=2), read from 2 nodes (R=2). Since 2 + 2 = 4 > 3, at least one node in every read will have the latest write.

Approach 4: Synchronized Clocks (TrueTime)

Google Spanner uses synchronized atomic clocks to bound clock uncertainty. Transactions can be ordered globally using timestamps without per-transaction coordination, as long as operations wait for clock uncertainty to pass. This enables external consistency at global scale.

The Cost of Strong Consistency

Strong consistency is not free. The CAP theorem and the PACELC framework formalize the fundamental trade-offs:

CAP Theorem: In the presence of a network Partition, a distributed system must choose between:

Consistency: Every read receives the most recent write
Availability: Every request receives a response (even if not the latest data)

Strong consistency systems choose C over A. During network partitions, strongly consistent systems become unavailable rather than risk returning stale data.

PACELC Extension: Even when there's no partition (Else), there's a trade-off between:

Latency: How fast can we respond?
Consistency: How correct is the response?

Strong consistency requires coordination—waiting for quorums, confirming leadership, propagating updates. This coordination adds latency.

Latency Impact of Strong Consistency
Scenario	Weak Consistency Latency	Strong Consistency Latency	Overhead
Single datacenter read	~1ms (local)	~5-10ms (leader roundtrip)	5-10x
Cross-datacenter read	~1ms (local replica)	~50-100ms (remote leader)	50-100x
Write operation	~5ms (async replication)	~20-50ms (synchronous quorum)	4-10x
Global transaction	N/A	~150-300ms (global consensus)	N/A

Availability During Partitions

When a strongly consistent system loses contact with a majority of nodes, it MUST stop accepting writes. This is not a bug—it's the correctness guarantee in action. Accepting writes without consensus would risk split-brain scenarios where different partitions accept conflicting updates. For mission-critical systems, this unavailability is preferable to inconsistency.

Throughput Implications:

Strong consistency also limits throughput:

Serialization bottleneck: All operations must be ordered against each other. Even if you have 100 database nodes, all writes effectively pass through a single logical serialization point.
Cross-shard coordination: In sharded systems, transactions spanning multiple shards require distributed coordination (2PC, 3PC, or Saga patterns), multiplying latency.
Replication overhead: Every write must be replicated to a quorum before acknowledgment, consuming network bandwidth and disk I/O on multiple nodes.

Typical throughput differences:

Configuration	Eventual Consistency	Strong Consistency
Single-shard writes	50K ops/sec	10K ops/sec
Multi-shard transactions	30K ops/sec	1-5K ops/sec
Global distribution	100K+ ops/sec	5-10K ops/sec

When Strong Consistency is Required

Despite the costs, many applications genuinely require strong consistency. The decision framework centers on the cost of inconsistency:

Question: What happens if two clients briefly see different values?

If the answer involves money, safety, security, or correctness violations—strong consistency is likely necessary.

Domains Requiring Strong Consistency

•Financial Transactions — Account balances, transfers, trading systems. Incorrect values can mean money appearing or disappearing. Regulatory compliance often mandates strong consistency.
•Inventory Management — Selling more items than exist in stock leads to customer anger and revenue loss. The 'last item' problem requires strong consistency.
•Distributed Locks/Coordination — Leader election, resource allocation, distributed mutexes. If two processes think they hold the same lock, corruption follows.
•Configuration Management — Feature flags, access control lists, rate limits. Stale configuration can cause security vulnerabilities or service disruption.
•Unique Constraint Enforcement — Usernames, email addresses, ID generation. Duplicate 'unique' values corrupt data integrity.
•Critical State Machines — Order status, workflow progress, booking systems. Incorrect state transitions cause business logic failures.

Strong Consistency Needed

•Bank account balance checks
•Airline seat reservations
•Distributed lock services
•Primary key generation
•Access control changes
•Payment processing

Strong Consistency Overkill

•Social media like counts
•View/impression counters
•User activity feeds
•Product recommendation scores
•Session data (often)
•Analytics events

The Hybrid Approach

Most real-world systems use strong consistency selectively. Critical operations (payments, inventory decrements) use strongly consistent paths, while high-volume, less-critical operations (feeds, counters) use eventual consistency. This hybrid approach optimizes for both correctness and performance.

Systems Providing Strong Consistency

Several production systems offer strong consistency guarantees, each with different approaches:

Strongly Consistent Systems Comparison
System	Consistency Mechanism	Scope	Trade-offs
Google Spanner	TrueTime + Paxos	Global, multi-region	Requires synchronized clocks; higher latency for commits
CockroachDB	Raft consensus	Multi-region distributed SQL	Write latency increases with geographic distance
etcd	Raft consensus	Configuration/coordination	Not designed for high-throughput data workloads
ZooKeeper	Zab protocol	Coordination service	Limited to coordination use cases; not a general database
FoundationDB	Optimistic concurrency + Paxos	Ordered key-value store	Transaction size limits; learning curve
TiDB	Raft + 2PC	MySQL-compatible distributed SQL	Cross-shard transactions add latency
Amazon Aurora	Quorum writes + storage fleet	Regional MySQL/PostgreSQL	Single-region only for strong consistency

spanner_truetime.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# How Google Spanner Achieves Global Strong Consistency
# Using TrueTime for External Consistency
 
class TrueTime:
    """
    Simplified model of Google's TrueTime API.
    Real TrueTime uses GPS + Atomic clocks in datacenters.
    """
    
    def now(self) -> tuple:
        """
        Returns [earliest, latest] - the interval within which
        the true current time definitely falls.
        
        Real TrueTime typically has ~7ms uncertainty (ε).
        GPS/atomic clocks bound this uncertainty.
        """
        current_time = self._get_clock_reading()
        epsilon = self._get_clock_uncertainty()  # Usually ~7ms
        
        return (current_time - epsilon, current_time + epsilon)
    
    def after(self, timestamp: float) -> bool:
        """Returns True if timestamp is definitely in the past."""
        earliest, _ = self.now()
        return earliest > timestamp
    
    def before(self, timestamp: float) -> bool:
        """Returns True if timestamp is definitely in the future."""
        _, latest = self.now()
        return latest < timestamp
 
 
class SpannerTransaction:
    """
    Simplified model of Spanner commit protocol.
    Demonstrates how TrueTime enables external consistency.
    """
    
    def __init__(self, truetime: TrueTime):
        self.truetime = truetime
        self.read_timestamp = None
        self.commit_timestamp = None
    
    def commit(self, writes: list) -> bool:
        """
        Commit transaction with external consistency guarantee.
        
        KEY INSIGHT: After commit returns, any transaction
        that starts will see this transaction's writes.
        """
        # Step 1: Acquire locks on all written keys (via Paxos groups)
        if not self._acquire_locks(writes):
            return False
        
        # Step 2: Choose commit timestamp
        # Must be >= any timestamp seen during transaction
        # Must be >= TrueTime.now().latest
        _, latest = self.truetime.now()
        self.commit_timestamp = max(
            self._max_timestamp_seen(),
            latest
        )
        
        # Step 3: THE CRITICAL WAIT
        # We must wait until we're sure the commit timestamp
        # is in the past at ALL nodes worldwide.
        #
        # Wait until: TrueTime.now().earliest > commit_timestamp
        #
        # This typically takes ~2ε (~14ms with 7ms uncertainty)
        
        while not self.truetime.after(self.commit_timestamp):
            # Wait for clock uncertainty to pass
            time.sleep(0.001)  # 1ms sleep
        
        # Step 4: Now safe to commit
        # Any future transaction will have a start timestamp
        # greater than our commit timestamp (external consistency!)
        
        self._apply_writes(writes)
        self._release_locks(writes)
        
        return True
    
    def _max_timestamp_seen(self) -> float:
        """Return maximum timestamp observed during reads."""
        # Implementation tracks timestamps of all read values
        pass
 
 
# WHY THIS WORKS:
# 
# Transaction A commits with timestamp T_a
# After commit returns, TrueTime.now().earliest > T_a (guaranteed by wait)
#
# Transaction B starts after Transaction A returns
# Transaction B's start timestamp T_b comes from TrueTime.now().latest
#
# Since B started after A committed:
#   T_b >= B's TrueTime.now().latest
#   A's TrueTime.now().earliest > T_a  (commit waited for this)
#   
# Real time ordering ensures:
#   B's TrueTime.now().latest >= A's TrueTime.now().earliest
#   Therefore: T_b > T_a
#
# Since B's timestamp > A's timestamp, and Spanner orders by timestamp,
# B will always see A's writes. External consistency achieved!

The Genius of TrueTime

TrueTime represents a fundamental insight: if you can bound clock uncertainty precisely, you can trade that small wait time for global coordination-free ordering. The ~7-10ms commit wait is much faster than traditional cross-datacenter consensus round-trips (~100-300ms). This is why Spanner can provide external consistency at global scale with reasonable latency.

Linearizable Reads: Getting Fresh Data

Providing linearizable writes is only half the battle. Ensuring reads return the latest data requires careful implementation. Several patterns exist:

Pattern 1: Read from Leader The simplest approach—all reads go to the leader node, which has authoritative state.

Problem: Leader might be stale (network partition, delayed commits)
Solution: Leader confirms leadership before responding (quorum read)

Pattern 2: Quorum Reads Read from majority of replicas, use the response with the highest version.

Problem: Higher latency (wait for majority)
Solution: Used in Dynamo-style systems with W + R > N

Pattern 3: ReadIndex Protocol (Raft) Follower asks leader for current commit index, waits until local log reaches that index.

Problem: Round-trip to leader required
Benefit: Reads can be served from followers, distributing load

Pattern 4: Lease-Based Reads Leader holds a time-bounded lease; can serve reads without quorum while lease is valid.

Problem: Requires synchronized clocks; lease must be shorter than election timeout
Benefit: Very fast reads when lease is valid

linearizable_reads.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# Implementing Linearizable Reads
# Multiple patterns with different trade-offs
 
class LinearizableReader:
    """
    Demonstrates different approaches to ensuring read linearizability.
    """
    
    def __init__(self, cluster: list, leader_lease_duration: float = 10.0):
        self.cluster = cluster
        self.lease_duration = leader_lease_duration
        self.leader_lease_start = None
        self.is_leader = False
    
    # ═══════════════════════════════════════════════════════════════
    # PATTERN 1: Read from leader with leadership confirmation
    # ═══════════════════════════════════════════════════════════════
    
    def quorum_read(self, key: str) -> any:
        """
        Confirm leadership via quorum before reading.
        Most straightforward linearizable read.
        
        Latency: ~2 RTT (1 for quorum check, 1 for actual read)
        """
        if not self.is_leader:
            return self._forward_to_leader(key)
        
        # Confirm we're still leader
        # (Could be stale after network partition)
        if not self._confirm_leadership_via_heartbeat():
            # We're not leader anymore - redirect
            return self._forward_to_leader(key)
        
        # Safe to read from local state
        return self._local_read(key)
    
    def _confirm_leadership_via_heartbeat(self) -> bool:
        """Send heartbeats to majority to confirm leadership."""
        acks = 1  # Count self
        quorum = len(self.cluster) // 2 + 1
        
        for node in self.cluster:
            if self._send_heartbeat(node):
                acks += 1
        
        return acks >= quorum
    
    # ═══════════════════════════════════════════════════════════════
    # PATTERN 2: ReadIndex Protocol (Raft-style)
    # ═══════════════════════════════════════════════════════════════
    
    def read_index_read(self, key: str) -> any:
        """
        Follower-based linearizable read using ReadIndex protocol.
        Allows distributing read load across followers.
        
        Steps:
        1. Ask leader for current committed index
        2. Wait until local log reaches that index
        3. Read from local state
        
        Latency: ~1 RTT to leader + local wait
        """
        # Step 1: Get committed index from leader
        committed_index = self._get_committed_index_from_leader()
        
        if committed_index is None:
            # Leader unreachable - fail or retry
            raise Exception("Cannot reach leader for ReadIndex")
        
        # Step 2: Wait for local log to catch up
        self._wait_for_local_commit(committed_index)
        
        # Step 3: Now safe to read locally
        return self._local_read(key)
    
    def _get_committed_index_from_leader(self) -> int:
        """Query leader for current commit index."""
        # Leader must first confirm its own leadership
        # via quorum (can piggyback on heartbeat)
        pass
    
    def _wait_for_local_commit(self, target_index: int):
        """Wait until local log is committed up to target."""
        while self._get_local_commit_index() < target_index:
            time.sleep(0.001)  # 1ms polling
    
    # ═══════════════════════════════════════════════════════════════
    # PATTERN 3: Lease-based reads
    # ═══════════════════════════════════════════════════════════════
    
    def lease_read(self, key: str) -> any:
        """
        Fast reads when leader holds valid lease.
        Requires synchronized clocks (bounded drift).
        
        Latency: ~0 RTT when lease valid, ~1-2 RTT to renew
        """
        if not self.is_leader:
            return self._forward_to_leader(key)
        
        # Check if lease is still valid
        if self._lease_is_valid():
            # Fast path: no network required
            return self._local_read(key)
        
        # Lease expired - need to renew
        if self._renew_lease():
            return self._local_read(key)
        
        # Failed to renew - no longer leader
        return self._forward_to_leader(key)
    
    def _lease_is_valid(self) -> bool:
        """
        Check if current lease is valid.
        
        CRITICAL: Lease must expire BEFORE followers might
        elect a new leader. Must account for clock drift!
        """
        if self.leader_lease_start is None:
            return False
        
        current_time = time.time()
        elapsed = current_time - self.leader_lease_start
        
        # Use conservative estimate (account for clock drift)
        max_clock_drift = 0.1  # 100ms assumed max drift
        safe_lease_duration = self.lease_duration - max_clock_drift
        
        return elapsed < safe_lease_duration
    
    # ═══════════════════════════════════════════════════════════════
    # PATTERN 4: Strict quorum reads (Dynamo-style)
    # ═══════════════════════════════════════════════════════════════
    
    def quorum_read_dynamo_style(self, key: str) -> any:
        """
        Read from majority of replicas, return highest version.
        
        For linearizability with N=3: R=2, W=2 (R + W > N)
        
        Latency: parallelized, but wait for majority
        """
        n_replicas = 3
        read_quorum = 2
        
        responses = []
        for replica in self._get_replicas(key):
            response = self._read_from_replica(replica, key)
            if response:
                responses.append(response)
        
        if len(responses) < read_quorum:
            raise Exception("Failed to reach read quorum")
        
        # Return value with highest version
        return max(responses, key=lambda r: r['version'])['value']

Stale Reads in Practice

A common production bug: configuring reads to go to followers without proper freshness checks. The read returns successfully, but with stale data. This creates subtle, hard-to-reproduce bugs where behavior depends on replication lag. Always verify your read path actually provides linearizability if that's what your application requires.

Testing for Strong Consistency

Claiming strong consistency and actually providing it are different things. Many distributed databases have been discovered to violate their consistency guarantees under specific failure scenarios. Testing is essential.

Jepsen Testing:

Kyle Kingsbury's Jepsen project has become the gold standard for distributed systems testing. Jepsen:

Runs concurrent operations against a cluster
Induces failures (network partitions, node crashes, clock skew)
Records all operations with timing
Checks if the history is linearizable

Systems that have failed Jepsen tests include: MongoDB, Cassandra, RethinkDB, CockroachDB (earlier versions), and many others. These failures often exposed real bugs and led to fixes.

Testing Strategies:

Consistency Testing Approaches

•Linearizability Checking — Record operation start/end times and values. Use checker algorithms (Knossos, Porcupine) to verify a valid linearization exists.
•Partition Testing — Isolate nodes using iptables or network namespaces. Verify system behavior matches documented guarantees.
•Clock Skew Testing — Artificially drift clocks and verify ordering guarantees hold (especially important for Spanner-like systems).
•Crash Testing — Kill nodes at various points (during commit, after ack, during replication) and verify no data loss or corruption.
•Slow Network Testing — Add latency and packet loss. Verify system doesn't mistakenly elect multiple leaders or accept conflicting writes.

consistency_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# Simplified Consistency Testing Framework
 
import threading
import time
from dataclasses import dataclass
from typing import List, Optional
import random
 
@dataclass
class Operation:
    """Record of a single operation."""
    client_id: str
    op_type: str  # 'write' or 'read'
    key: str
    input_value: Optional[any]  # Value written (for writes)
    output_value: Optional[any]  # Value read (for reads)
    start_time: float
    end_time: float
    success: bool
 
class ConsistencyTester:
    """
    Tests whether a distributed system provides linearizability.
    """
    
    def __init__(self, client_factory, num_clients: int = 10):
        self.client_factory = client_factory
        self.num_clients = num_clients
        self.operations: List[Operation] = []
        self.lock = threading.Lock()
    
    def run_test(self, duration_seconds: float = 30.0,
                 failure_probability: float = 0.1) -> bool:
        """
        Run concurrent operations and check linearizability.
        
        Args:
            duration_seconds: How long to run the test
            failure_probability: Probability of inducing failure per second
        
        Returns:
            True if history is linearizable, False otherwise
        """
        threads = []
        stop_event = threading.Event()
        
        # Start client threads
        for i in range(self.num_clients):
            client = self.client_factory()
            t = threading.Thread(
                target=self._client_worker,
                args=(f"client_{i}", client, stop_event)
            )
            threads.append(t)
            t.start()
        
        # Start failure injector
        failure_thread = threading.Thread(
            target=self._failure_injector,
            args=(stop_event, failure_probability)
        )
        failure_thread.start()
        
        # Run for duration
        time.sleep(duration_seconds)
        stop_event.set()
        
        # Wait for threads
        for t in threads:
            t.join()
        failure_thread.join()
        
        # Check linearizability
        return self._check_linearizability()
    
    def _client_worker(self, client_id: str, client, stop_event):
        """Worker thread performing random operations."""
        while not stop_event.is_set():
            key = f"key_{random.randint(0, 9)}"
            
            if random.random() < 0.5:
                # Write operation
                value = random.randint(1, 1000)
                start = time.time()
                try:
                    client.write(key, value)
                    success = True
                except:
                    success = False
                end = time.time()
                
                self._record_operation(
                    client_id, 'write', key, value, None, start, end, success
                )
            else:
                # Read operation
                start = time.time()
                try:
                    result = client.read(key)
                    success = True
                except:
                    result = None
                    success = False
                end = time.time()
                
                self._record_operation(
                    client_id, 'read', key, None, result, start, end, success
                )
            
            # Small delay between operations
            time.sleep(random.uniform(0.001, 0.01))
    
    def _record_operation(self, client_id, op_type, key, 
                          input_val, output_val, start, end, success):
        """Thread-safe operation recording."""
        op = Operation(
            client_id=client_id,
            op_type=op_type,
            key=key,
            input_value=input_val,
            output_value=output_val,
            start_time=start,
            end_time=end,
            success=success
        )
        with self.lock:
            self.operations.append(op)
    
    def _failure_injector(self, stop_event, probability):
        """Periodically inject failures into the cluster."""
        while not stop_event.is_set():
            if random.random() < probability:
                failure_type = random.choice([
                    'network_partition',
                    'node_crash',
                    'node_restart',
                    'slow_network'
                ])
                self._inject_failure(failure_type)
            time.sleep(1.0)
    
    def _inject_failure(self, failure_type: str):
        """Inject a specific failure into the cluster."""
        print(f"[FAILURE] Injecting: {failure_type}")
        # Implementation would use iptables, kill signals, etc.
        pass
    
    def _check_linearizability(self) -> bool:
        """
        Check if recorded history is linearizable.
        
        This is NP-complete in general, but tractable for
        small histories. Real tools like Knossos use
        sophisticated algorithms.
        """
        successful_ops = [op for op in self.operations if op.success]
        
        print(f"Checking linearizability of {len(successful_ops)} operations")
        
        # Group by key (each key's operations must be independently linearizable)
        by_key = {}
        for op in successful_ops:
            by_key.setdefault(op.key, []).append(op)
        
        for key, ops in by_key.items():
            if not self._check_key_linearizable(ops):
                print(f"[VIOLATION] Key '{key}' is not linearizable!")
                return False
        
        print("[PASS] History is linearizable")
        return True
    
    def _check_key_linearizable(self, ops: List[Operation]) -> bool:
        """Check linearizability for single key's operations."""
        # Simplified check - real algorithms are much more complex
        # This handles basic sequential consistency checking
        
        ops_sorted = sorted(ops, key=lambda x: x.start_time)
        
        # Track what values could be current at each point
        current_value = None  # or some initial value
        
        for op in ops_sorted:
            if op.op_type == 'write':
                current_value = op.input_value
            elif op.op_type == 'read':
                # In a simple check, read should return current_value
                # Real linearizability allows for concurrent operations
                # This is just for illustration
                if op.output_value != current_value:
                    # Could be violation or concurrent operation
                    # Full algorithm would explore all orderings
                    pass
        
        return True  # Placeholder

Run Jepsen Before Production

If your system claims strong consistency, run Jepsen tests against it. Many companies have been embarrassed by Jepsen reports revealing consistency violations. Better to find issues in testing than have customers discover them in production. Jepsen is open source and well-documented.

Summary: Strong Consistency Mastery

We've covered the full landscape of strong consistency—from theoretical foundations to practical implementation. Let's consolidate the key takeaways:

Key Takeaways

•Strong consistency guarantees a single, global view of data — All clients see the same data, reads always return the latest write, and operations respect real-time ordering.
•Linearizability is the formal definition — Operations appear to take effect atomically at some point between invocation and response, respecting real-time order.
•Consensus protocols enable strong consistency — Raft, Paxos, and similar protocols allow distributed systems to agree on operation order despite failures.
•Strong consistency has real costs — Higher latency, lower throughput, and unavailability during partitions are inherent trade-offs as dictated by CAP theorem.
•Use strong consistency selectively — Reserve it for operations where inconsistency has unacceptable consequences: finances, coordination, unique constraints.
•Systems vary in their guarantees — Spanner, CockroachDB, etcd, and ZooKeeper all provide strong consistency but with different mechanisms and trade-offs.
•Linearizable reads require care — Merely reading from a replica doesn't guarantee freshness; leader confirmation or quorum reads are necessary.
•Test your consistency claims — Tools like Jepsen have revealed violations in many production databases. Verify before trusting.

What's Next:

Strong consistency represents one end of the consistency spectrum. The next page explores the opposite end: Eventual Consistency. We'll examine how systems can remain available and performant by relaxing consistency guarantees, the mathematical foundations in the form of convergence, and how applications can work effectively with eventually consistent data.

Page Complete

You now understand strong consistency at a deep, implementable level—from the theoretical foundation of linearizability through practical mechanisms like consensus protocols and TrueTime, to the real-world costs and testing approaches. This knowledge forms the baseline against which all other consistency models are compared.