Database Management SystemsCAP Theorem

CAP Theorem: Fundamental Trade-offs in Distributed Systems

LevelAdvanced

Duration90 mins

TopicCAP Theorem

5 / 5

System Classification: CP, AP, and Beyond

A Tour of the Distributed Database Landscape

The database market has exploded. Where once there were only a handful of relational databases, today there are hundreds of distributed data stores—each optimized for different workloads, each making different CAP trade-offs.

For a practitioner, this abundance creates both opportunity and confusion. Which database is right for your use case? How do you evaluate the claims made by vendors? When they say 'highly available' or 'strongly consistent,' what do they actually mean?

In this final page of the CAP theorem module, we'll develop a systematic classification of distributed systems based on their CAP behavior. We'll examine major production databases, understand the design decisions that led to their CAP choices, and learn to cut through marketing language to understand what a system actually provides.

What You Will Master

By the end of this page, you will understand how to classify distributed systems by CAP behavior, the design rationales behind major production databases, how to evaluate database claims and cut through marketing hype, and when to choose specific database types for specific use cases.

CP Systems: Consistency First

CP systems prioritize consistency over availability during network partitions. When a partition occurs, nodes that cannot reach a quorum will refuse to serve requests rather than risk returning inconsistent data.

Design Philosophy:

It's better to fail safely than to serve incorrect data
Users would rather see an error than make decisions based on stale information
Data integrity is paramount; temporary unavailability is acceptable

Key Characteristics of CP Systems:

Strong consistency (usually linearizable or serializable)
Synchronous replication (writes wait for acknowledgments)
Quorum-based decisions (majority must agree)
May reject reads/writes during partitions
Clean recovery after partition heals (no conflicts to resolve)

Major CP Systems
System	Consistency Model	Partition Behavior	Primary Use Case	Notable Features
Google Spanner	External consistency	Minority partitions unavailable	Global OLTP	TrueTime, GPS/atomic clocks
CockroachDB	Serializable	Quorum required for writes	Distributed SQL	PostgreSQL-compatible, geo-partitioning
ZooKeeper	Linearizable	Must reach leader quorum	Coordination	ZAB consensus, ephemeral nodes
etcd	Linearizable	Raft quorum required	Config/service discovery	Raft consensus, Kubernetes backbone
HBase	Strong (row-level)	Region server failures handled	Wide-column analytics	HDFS-based, Bigtable model
MongoDB (w/majority)	Linearizable (optional)	Primary election needed	General purpose	Tunable, default is eventual
Consul	Linearizable (Raft)	Quorum required	Service mesh	Health checking, KV store
FoundationDB	Serializable ACID	Coordinators must be reachable	Apple's infrastructure	Layered architecture

Deep Dive: Google Spanner

Spanner is perhaps the most ambitious CP system ever built. It provides external consistency (even stronger than linearizability) across a globally distributed database.

How Spanner Achieves Global Consistency:

TrueTime API: Spanner uses a globally synchronized clock based on GPS receivers and atomic clocks in every datacenter. This clock doesn't give you 'the exact time'—it gives you an uncertainty interval: "the current time is between [earliest, latest]."
Wait-Out Uncertainty: When committing a transaction, Spanner assigns a timestamp and then waits for the uncertainty interval to pass. This ensures that if transaction A commits with timestamp T, any transaction that starts after A completes will have a timestamp > T.
Result: Transactions are totally ordered in a way that respects real-time causality across the entire planet. If you commit in New York and then read in Tokyo, you'll see your write.

Cost: Every commit incurs latency equal to the clock uncertainty (typically 1-7ms). This is the price for global consistency.

cp_system_patterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
class CPSystemDesignPatterns:
    """
    Common patterns used by CP systems to achieve strong consistency
    while maintaining reasonable availability when possible.
    """
    
    def consensus_based_replication(self):
        """
        Pattern: Use a consensus protocol (Raft, Paxos) for all writes.
        
        Systems: etcd, ZooKeeper, CockroachDB
        
        How it works:
        1. Client sends write to any node
        2. Node forwards to leader (or is leader)
        3. Leader proposes write to followers
        4. Majority must acknowledge (quorum)
        5. Leader commits and responds to client
        
        Partition behavior:
        - Partition with majority continues operating
        - Partition with minority refuses writes
        - Reads can also require quorum (linearizable) or not (stale possible)
        """
        pass
    
    def primary_with_synchronous_replication(self):
        """
        Pattern: Single primary handles all writes, synchronously replicates.
        
        Systems: PostgreSQL (sync replication), MySQL (semi-sync)
        
        How it works:
        1. All writes go to primary
        2. Primary writes locally AND waits for replica ack
        3. Only then acknowledges to client
        4. If replica unreachable, write blocks or fails
        
        Partition behavior:
        - If primary loses connection to replicas, writes block
        - Can be configured to timeout or continue (trading consistency)
        - Failover requires careful handling to avoid split-brain
        """
        pass
    
    def two_phase_commit(self):
        """
        Pattern: Coordinate distributed transactions across nodes.
        
        Systems: Spanner, CockroachDB (internal), traditional XA
        
        How it works:
        1. Coordinator sends "prepare" to all participants
        2. Participants vote yes/no
        3. If all yes, coordinator sends "commit"
        4. If any no, coordinator sends "abort"
        
        Partition behavior:
        - If coordinator or any participant unreachable during prepare: abort
        - If coordinator fails after prepare-yes: participants block
        - This is why 2PC has availability concerns
        
        Spanner mitigation:
        - Uses Paxos *groups* as participants, not single nodes
        - A Paxos group can survive minority failures
        - Coordinator is also replicated for durability
        """
        pass
    
    def hybrid_clock_ordering(self):
        """
        Pattern: Use hybrid logical clocks for transaction ordering.
        
        Systems: CockroachDB, YugabyteDB
        
        How it works:
        - Each node maintains a Hybrid Logical Clock (HLC)
        - HLC combines physical time with logical counter
        - Provides ordering even when physical clocks drift
        - Transactions stamped with HLC values
        
        Partition behavior:
        - Partitioned nodes' clocks may drift
        - When partition heals, clocks resynchronize
        - Transactions from partition period are correctly ordered retroactively
        
        Trade-off:
        - Works without GPS/atomic clocks (unlike Spanner)
        - But may require occasional clock-wait for consistency
        """
        pass
 
 
# Example: Evaluating a CP system for an application
#
# Scenario: Building a banking application
#
# Requirements:
# - Account balances must be accurate
# - Transfers must be atomic (no double-spending)
# - Customers would rather see "temporarily unavailable" than wrong balance
#
# Evaluation of CP systems:
#
# | System         | Pros                           | Cons                        |
# |----------------|--------------------------------|-----------------------------|
# | Spanner        | Gold standard for consistency  | Requires GCP, expensive     |
# | CockroachDB    | Open source, PostgreSQL compat | Less mature than Spanner    |
# | PostgreSQL     | Proven, extensive ecosystem    | Scaling requires sharding   |
# | FoundationDB   | Apple-proven at scale          | Requires custom layers      |
#
# Decision: CockroachDB for new systems (modern, scalable, consistent)
#          PostgreSQL for simpler needs (proven, well-understood)
#          Spanner if already on GCP and need global scale

When to Choose CP

Choose CP systems when: (1) Data correctness is more important than data availability, (2) You're handling financial transactions, inventory, or safety-critical data, (3) Users expect strong guarantees and can tolerate occasional errors, (4) Conflict resolution would be impractical or dangerous, (5) Regulatory requirements mandate strong consistency.

AP Systems: Availability First

AP systems prioritize availability over consistency during network partitions. They continue serving requests from all nodes, even if this means returning stale data or accepting writes that may conflict.

Design Philosophy:

It's better to serve potentially stale data than to fail
Users would rather see something than see an error
Conflicts can be resolved later; unavailability is immediate pain

Key Characteristics of AP Systems:

Eventual consistency (or tunable)
Asynchronous replication (writes return before full replication)
Conflict resolution mechanisms (LWW, vector clocks, CRDTs)
Remain available during partitions
Require reconciliation after partition heals

Major AP Systems
System	Consistency Model	Partition Behavior	Primary Use Case	Conflict Resolution
Cassandra	Eventual (tunable)	All nodes continue serving	High-write time-series	Last-Write-Wins (LWW)
DynamoDB	Eventual (tunable)	All partitions available	Web apps, gaming	LWW or conditional writes
Riak	Eventual	All nodes available	IoT, session stores	Vector clocks, siblings
CouchDB	Eventual	Multi-master sync	Offline-first apps	Revision trees, conflicts
Voldemort	Eventual	All nodes available	LinkedIn (historical)	Vector clocks
Amazon S3	Eventual (mostly)	Always available	Object storage	LWW
DNS	Eventual	Local caches serve	Name resolution	TTL-based expiration
Memcached	N/A (cache)	Cache hit or miss	Caching	No persistence

Deep Dive: Apache Cassandra

Cassandra is the archetypal AP system for high-write workloads. It's designed to handle massive write throughput across multiple datacenters with no single point of failure.

Cassandra's Design Choices:

Masterless Architecture: Every node is equal. Any node can accept reads and writes. There's no leader election, no failover procedure—just a ring of nodes.
Consistent Hashing: Data is partitioned across nodes using consistent hashing. Each key maps to a set of replica nodes based on a replication factor.
Tunable Consistency: For each operation, the client specifies a consistency level (ONE, QUORUM, ALL, etc.). This lets you balance speed vs. consistency per-operation.
Last-Write-Wins (LWW): Concurrent writes to the same key are resolved by timestamp. The write with the highest timestamp wins. This can lose writes but is simple and predictable.

ap_system_patterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
class APSystemDesignPatterns:
    """
    Common patterns used by AP systems to maintain availability
    while providing eventual consistency.
    """
    
    def gossip_based_replication(self):
        """
        Pattern: Nodes gossip state changes to random peers.
        
        Systems: Cassandra, Riak, DynamoDB (internal)
        
        How it works:
        1. Node A receives write
        2. A acknowledges to client immediately (or after local ack)
        3. A gossips update to random nodes periodically
        4. Those nodes gossip to other random nodes
        5. Eventually, all nodes have the update
        
        Partition behavior:
        - Gossip within a partition continues normally
        - Updates queue up for unreachable nodes
        - When partition heals, gossip catches up stragglers
        - Conflicts (concurrent writes) resolved by LWW or vector clocks
        
        Trade-off:
        + Very high availability
        + No coordination bottleneck
        - Eventual consistency only
        - Conflict resolution needed
        """
        pass
    
    def hinted_handoff(self):
        """
        Pattern: Store hints for unreachable nodes to deliver later.
        
        Systems: Cassandra, Riak, DynamoDB
        
        How it works:
        1. Write targets nodes A, B, C (replication factor = 3)
        2. Node C is unreachable
        3. Node A stores "hint" containing the write for C
        4. When C comes back, A delivers the hint
        5. C applies the update
        
        Partition behavior:
        - Writes don't fail just because one replica is down
        - Hints accumulate during partitions
        - After partition heals, hints are replayed
        
        Trade-off:
        + Improves availability (don't fail if minority of replicas down)
        + Data eventually reaches all replicas
        - Hints can accumulate, causing burst when partition heals
        - If hinting node dies, hints are lost
        """
        pass
    
    def anti_entropy_repair(self):
        """
        Pattern: Background process compares and repairs divergent replicas.
        
        Systems: Cassandra, Riak, DynamoDB
        
        How it works:
        1. Periodically, nodes compare data using Merkle trees
        2. Merkle tree: hierarchical hash of data, efficient comparison
        3. If trees differ, exchange and reconcile differing portions
        4. Conflicts resolved by configured strategy (LWW, etc.)
        
        Partition behavior:
        - During partition, replicas can diverge
        - Anti-entropy can't run across partition
        - After healing, anti-entropy detects and repairs divergence
        
        Trade-off:
        + Repairs inconsistencies over time
        + Catches missed updates (failed hints, node recovery)
        - Resource intensive (CPU, network, disk)
        - May not catch all issues quickly
        """
        pass
    
    def read_repair(self):
        """
        Pattern: Repair stale replicas during read operations.
        
        Systems: Cassandra, Riak
        
        How it works:
        1. Read contacts multiple replicas (for QUORUM, majority)
        2. Compare responses - find most recent value
        3. Return most recent to client
        4. Asynchronously update stale replicas with correct value
        
        Partition behavior:
        - Works within a partition
        - After healing, reads trigger repairs across former partitions
        
        Trade-off:
        + Read path naturally repairs inconsistencies
        + Hot keys get repaired more often
        - Cold keys may stay inconsistent longer
        - Adds latency to digested reads
        """
        pass
    
    def conflict_resolution_strategies(self):
        """
        Various strategies for resolving concurrent writes in AP systems.
        """
        
        # Last-Write-Wins (LWW)
        # - Simple: highest timestamp wins
        # - Danger: can lose concurrent writes
        # - Use when: losing updates is acceptable
        
        # Vector Clocks
        # - Track causality, detect true conflicts
        # - Return all conflicting versions ("siblings")
        # - Application must choose winner
        # - Use when: application can merge or user can choose
        
        # CRDTs (Conflict-free Replicated Data Types)
        # - Data structures designed to merge automatically
        # - E.g., G-Counter (only increments, sum to merge)
        # - E.g., OR-Set (observed-remove set, union to merge)
        # - Use when: data type fits CRDT semantics
        
        # Application-Level Resolution
        # - Store all versions, let application logic decide
        # - Most flexible but most complex
        # - Use when: domain-specific merge logic required
        pass
 
 
# Example: Evaluating an AP system for an application
#
# Scenario: Building a social media "likes" counter
#
# Requirements:
# - Must always accept "like" requests
# - Counts don't need to be perfectly accurate
# - Minor discrepancies are acceptable
# - Global scale, low latency required
#
# Evaluation of AP systems:
#
# | System      | Pros                           | Cons                        |
# |-------------|--------------------------------|-----------------------------|
# | Cassandra   | Proven, high write throughput  | LWW can lose counts         |
# | Riak        | Built-in CRDTs for counters    | Smaller community           |
# | DynamoDB    | Managed, auto-scaling          | Vendor lock-in              |
# | Redis-Cluster | Ultra-low latency            | Complex cluster management  |
#
# Decision: Cassandra with counter columns (using CRDT-like behavior internally)
#          OR DynamoDB if already on AWS and want ease of management

When to Choose AP

Choose AP systems when: (1) Availability is business-critical (revenue, user experience), (2) Temporary staleness is acceptable (social feeds, caches, analytics), (3) Conflicts can be automatically resolved or are rare, (4) Global distribution requires low-latency writes everywhere, (5) Write throughput is extremely high and coordination overhead is unacceptable.

Hybrid and Tunable Systems

The strict CP vs. AP dichotomy doesn't capture the full picture. Many modern systems offer tunable consistency, allowing different operations to use different consistency levels. Others provide hybrid approaches that combine CP and AP characteristics.

Tunable Consistency Systems:

These systems let you choose consistency per-operation:

Cassandra: ONE, TWO, THREE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL
DynamoDB: Eventually consistent reads (default) or Strongly consistent reads
MongoDB: Write concern (how many replicas), Read concern (what data to read)
Azure Cosmos DB: Five consistency levels from Strong to Eventual

The Power of Tunability:

With tunable consistency, you don't have to commit your entire system to one CAP choice. You can:

Use strong consistency for critical operations (order placement)
Use eventual consistency for non-critical operations (view counts)
Choose based on the specific guarantees each operation needs

Azure Cosmos DB Consistency Levels
Level	Guarantee	Latency	Throughput	Analogy
Strong	Linearizable	Highest	Lowest	Like a single-copy database
Bounded Staleness	Lag by at most K versions or T time	High	Low	Controlled staleness
Session	Consistent within a session	Medium	Medium	User sees own writes
Consistent Prefix	Reads never see out-of-order writes	Low	High	No going backwards
Eventual	No ordering guarantees	Lowest	Highest	Total freedom

Deep Dive: MongoDB's Consistency Controls

MongoDB provides fine-grained consistency control through write concerns and read concerns:

Write Concern: How many replicas must acknowledge a write before it succeeds

w: 1 — Primary only (fast, but single point of failure for that write)
w: majority — Majority of replicas (durable across failures)
w: 0 — Fire and forget (fastest, no durability guarantee)
j: true — Wait for journal sync (disk durability)

Read Concern: What data can be returned

local — Return data from this node (may not be replicated)
majority — Return data acknowledged by majority (durable)
linearizable — Return the most recent data, with linearizable guarantee
snapshot — Return data from a snapshot for multi-document transactions

tunable_consistency_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# MongoDB: Using write and read concerns for different operations
 
from pymongo import MongoClient, WriteConcern, ReadPreference
from pymongo.read_concern import ReadConcern
 
client = MongoClient('mongodb://cluster/')
db = client['myapp']
 
# ==========================================
# STRONG CONSISTENCY: Financial operations
# ==========================================
 
def transfer_money(from_account: str, to_account: str, amount: float):
    """
    Money transfer: Use majority write concern and linearizable read.
    
    This ensures the write is durable on majority of nodes,
    and any subsequent read reflects this write.
    """
    accounts = db.get_collection(
        'accounts',
        write_concern=WriteConcern(w='majority', j=True),
        read_concern=ReadConcern('linearizable')
    )
    
    with client.start_session() as session:
        with session.start_transaction():
            # Debit from source
            accounts.update_one(
                {'_id': from_account},
                {'$inc': {'balance': -amount}},
                session=session
            )
            # Credit to destination
            accounts.update_one(
                {'_id': to_account},
                {'$inc': {'balance': amount}},
                session=session
            )
    # Transaction commits with majority write concern
 
 
# ==========================================
# SESSION CONSISTENCY: User profile updates
# ==========================================
 
def update_profile(user_id: str, updates: dict, session_token: str):
    """
    Profile update: User should see their own changes immediately.
    
    Use session consistency - same session always reads its writes.
    """
    profiles = db.get_collection(
        'profiles',
        write_concern=WriteConcern(w='majority'),  # Durable
        read_concern=ReadConcern('majority')       # Read committed data
    )
    
    # Get or create a client session for this user's browser session
    with client.start_session(causal_consistency=True) as session:
        # Update profile
        profiles.update_one(
            {'_id': user_id},
            {'$set': updates},
            session=session
        )
        
        # Read back will see the update (causal consistency)
        return profiles.find_one({'_id': user_id}, session=session)
 
 
# ==========================================
# EVENTUAL CONSISTENCY: Analytics events
# ==========================================
 
def log_event(event_type: str, data: dict):
    """
    Analytics: Fire-and-forget, eventually consistent.
    
    We don't need to wait for replication or even disk sync.
    Maximum throughput, minimal latency.
    """
    events = db.get_collection(
        'events',
        write_concern=WriteConcern(w=0),  # Don't wait at all
        read_concern=ReadConcern('local')  # Read local data
    )
    
    events.insert_one({
        'type': event_type,
        'data': data,
        'timestamp': datetime.utcnow()
    })
    # Returns immediately, write happens asynchronously
 
 
# ==========================================
# HYBRID: Shopping cart
# ==========================================
 
def add_to_cart(user_id: str, product_id: str, quantity: int):
    """
    Shopping cart: Fast writes, eventual reads are fine.
    
    Use low write concern for speed, but read from local for low latency.
    Cart contents aren't critical - user can refresh if needed.
    """
    carts = db.get_collection(
        'carts',
        write_concern=WriteConcern(w=1),  # Just primary
        read_concern=ReadConcern('local')
    )
    
    carts.update_one(
        {'_id': user_id},
        {'$push': {'items': {'product_id': product_id, 'quantity': quantity}}},
        upsert=True
    )
 
 
def checkout(user_id: str):
    """
    Checkout: Now we need strong consistency!
    
    When converting cart to order, switch to majority concern.
    """
    carts = db.get_collection('carts')
    orders = db.get_collection(
        'orders',
        write_concern=WriteConcern(w='majority', j=True),
        read_concern=ReadConcern('majority')
    )
    
    with client.start_session() as session:
        with session.start_transaction():
            cart = carts.find_one({'_id': user_id}, session=session)
            
            # Create durable order
            orders.insert_one({
                'user_id': user_id,
                'items': cart['items'],
                'status': 'placed',
                'created_at': datetime.utcnow()
            }, session=session)
            
            # Clear cart
            carts.delete_one({'_id': user_id}, session=session)
    
    # Transaction committed with majority - order is durable

The Best of Both Worlds

Tunable consistency is the pragmatic answer to CAP's rigid constraints. Instead of choosing one consistency level for your entire system, you choose operation by operation based on actual requirements. This lets you optimize each use case—fast analytics logging, consistent financial transactions, available shopping carts—within a single database system.

Evaluating Database Claims

Database vendors often make bold claims about consistency and availability. Cutting through marketing language requires understanding what these claims actually mean.

Red Flag Phrases:

"Highly available AND strongly consistent"

This is physically impossible during partitions. Ask: what happens during a network partition? If they say 'both still work,' they're either using non-standard definitions or haven't tested partitions.

"100% availability"

Nothing is 100% available. Ask about their SLA and what happens when it's breached.

"Immediately consistent across all nodes"

For geographically distributed systems, this implies either very high latency or misleading terminology. Ask about their synchronization mechanism.

"No single point of failure"

Good, but doesn't mean no coordinator. Ask about their consensus mechanism and what happens if 51% of nodes are unreachable.

Questions to Ask Database Vendors

•What consistency model do you implement? — Look for precise terms: linearizable, serializable, causal, eventual. 'Strong' and 'consistent' are vague.
•What happens during a network partition? — Do minority partitions accept writes? Do they reject? How long before detection?
•How are conflicts resolved? — LWW can lose data. Vector clocks require application logic. CRDTs work for limited data types.
•What's the latency for a consistent write? — If it's low, they might be using async replication (eventual consistency).
•Has this been tested with Jepsen? — Kyle Kingsbury's Jepsen tests find partition handling bugs in many databases. Tested systems are more trustworthy.
•What's your replication lag under load? — High replication lag means long windows of inconsistency.
•What happens when a node rejoins after partition? — Smooth reconciliation or manual intervention?

Jepsen Test Results Summary
System	Issues Found	Current Status	Key Lessons
MongoDB	Stale reads, rollbacks	Many fixed	Default settings weren't safe
Cassandra	LWT issues, timestamp	Mostly fixed	Lightweight transactions tricky
CockroachDB	Serialization anomalies	Fixed	Even careful designs have bugs
Elasticsearch	Split-brain scenarios	Improved	Search ≠ database
Redis Cluster	Data loss during partition	Known limitation	Not designed for durability
etcd	Minimal issues	Good	Simple system, well-tested
PostgreSQL	N/A (single node)	N/A	Jepsen tests distributed systems

The Jepsen Standard:

Kyle Kingsbury's Jepsen project has become the de facto standard for verifying distributed database claims. Jepsen:

Sets up a cluster of the database under test
Injects network partitions, node crashes, clock skew
Runs concurrent operations that should satisfy the claimed consistency level
Checks if the actual behavior matches the claims

Databases that have been Jepsen-tested and passed (or fixed issues) are generally more trustworthy. Databases that claim strong consistency but haven't been Jepsen-tested should be treated with skepticism.

Due Diligence Approach:

Read the database's architecture documentation, not marketing materials
Search for Jepsen tests or similar independent verification
Understand the consistency model formally (not marketing descriptions)
Test partition handling yourself in a staging environment
Document what guarantees you're relying on

Trust, But Verify

Every distributed database has had consistency bugs. Even the most mature, well-tested systems. Don't assume that because a database is popular, its consistency claims are accurate. Read the documentation carefully, check for Jepsen results, and test your critical paths yourself.

Modern Approaches: Beyond CAP

While CAP remains the foundational framework, modern distributed systems have developed sophisticated approaches that push the boundaries of what's possible:

Spanner and TrueTime:

Google's Spanner sidesteps some CAP limitations by investing in physical infrastructure:

GPS receivers and atomic clocks in every datacenter
TrueTime API provides bounded uncertainty about current time
Transactions wait for uncertainty window to pass
Achieves external consistency (stronger than linearizability) globally

Trade-off: Requires significant infrastructure investment. Can't be replicated without similar hardware.

CRDTs (Conflict-free Replicated Data Types):

CRDTs are data structures designed to merge automatically without conflicts:

G-Counter: Grows only, merge by taking max per node
PN-Counter: Positive/Negative counters, sum to merge
G-Set: Grow-only set, union to merge
OR-Set: Observed-remove set, more complex but supports remove
LWW-Register: Last-Write-Wins register, simple but can lose data

Trade-off: CRDTs only work for certain data types. Arbitrary JSON or complex objects don't have natural CRDT implementations.

crdt_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
class CRDTExamples:
    """
    Conflict-free Replicated Data Types enable AP systems
    to merge concurrent updates automatically without conflicts.
    """
    
    class GCounter:
        """
        Grow-only Counter: Can only increment.
        Each node maintains its own count; total is sum of all.
        
        Use case: Page view counts, like counts
        """
        def __init__(self, node_id: str):
            self.node_id = node_id
            self.counts = {}  # node_id -> count
        
        def increment(self):
            self.counts[self.node_id] = self.counts.get(self.node_id, 0) + 1
        
        def value(self) -> int:
            return sum(self.counts.values())
        
        def merge(self, other: 'GCounter'):
            """
            Merge with another GCounter by taking max per node.
            This is commutative, associative, and idempotent.
            """
            for node_id, count in other.counts.items():
                self.counts[node_id] = max(
                    self.counts.get(node_id, 0),
                    count
                )
        
        # Example:
        # Node A: {A: 5}           total = 5
        # Node B: {B: 3}           total = 3
        # After merge: {A: 5, B: 3} total = 8
        # Merge order doesn't matter - same result!
    
    class PNCounter:
        """
        Positive-Negative Counter: Can increment and decrement.
        Maintains two G-Counters: one for increments, one for decrements.
        
        Use case: Stock levels, upvotes/downvotes
        """
        def __init__(self, node_id: str):
            self.node_id = node_id
            self.positive = {}
            self.negative = {}
        
        def increment(self):
            self.positive[self.node_id] = self.positive.get(self.node_id, 0) + 1
        
        def decrement(self):
            self.negative[self.node_id] = self.negative.get(self.node_id, 0) + 1
        
        def value(self) -> int:
            return sum(self.positive.values()) - sum(self.negative.values())
        
        def merge(self, other: 'PNCounter'):
            for node_id, count in other.positive.items():
                self.positive[node_id] = max(self.positive.get(node_id, 0), count)
            for node_id, count in other.negative.items():
                self.negative[node_id] = max(self.negative.get(node_id, 0), count)
    
    class ORSet:
        """
        Observed-Remove Set: Supports both add and remove.
        Each add gets unique tag; remove removes specific tags.
        
        Use case: Shopping cart items, group membership
        """
        def __init__(self, node_id: str):
            self.node_id = node_id
            self.elements = {}  # element -> set of (tag, present)
            self.counter = 0
        
        def add(self, element):
            self.counter += 1
            tag = f"{self.node_id}:{self.counter}"
            if element not in self.elements:
                self.elements[element] = set()
            self.elements[element].add((tag, True))
        
        def remove(self, element):
            # Mark all observed tags as removed
            if element in self.elements:
                self.elements[element] = {
                    (tag, False) for tag, _ in self.elements[element]
                }
        
        def lookup(self, element) -> bool:
            if element not in self.elements:
                return False
            # Present if any tag is still "present"
            return any(present for _, present in self.elements[element])
        
        def merge(self, other: 'ORSet'):
            for element, tags in other.elements.items():
                if element not in self.elements:
                    self.elements[element] = set()
                # Complex merge logic: handle concurrent add/remove
                for tag, present in tags:
                    existing = [(t, p) for t, p in self.elements[element] if t == tag]
                    if not existing:
                        self.elements[element].add((tag, present))
                    else:
                        # If either has marked it removed, it's removed
                        old_present = existing[0][1]
                        new_present = present and old_present
                        self.elements[element].discard((tag, True))
                        self.elements[element].discard((tag, False))
                        self.elements[element].add((tag, new_present))
 
 
# Use CRDTs in Riak:
# Riak natively supports CRDTs (called Riak Data Types):
#   - Counters (PN-Counters)
#   - Sets (OR-Sets)
#   - Maps (nested CRDTs)
#   - Flags (boolean)
#   - Registers (LWW)
 
# Example Riak counter:
# curl -XPOST 'http://riak:8098/types/counters/buckets/likes/datatypes/post-123' \
#      -H 'Content-Type: application/json' \
#      -d '{"increment": 1}'
# 
# Multiple nodes can increment concurrently - they'll merge correctly!

NewSQL: Attempting to Have It All

NewSQL systems like CockroachDB, YugabyteDB, and TiDB attempt to provide:

Full SQL and ACID transactions
Horizontal scalability
Strong consistency

They achieve this through clever engineering:

Raft consensus for each data shard
Hybrid Logical Clocks for ordering
Distributed transactions with optimistic concurrency

Trade-off: They're still bound by CAP. During partitions, minority partitions become unavailable. The 'magic' is in making partitions rare and recovery fast, not in escaping CAP.

The Harvest/Yield Model:

An alternative framework proposed by Eric Brewer himself:

Harvest: Completeness of the answer (did we check all data?)
Yield: Probability of completing a request

Instead of binary C and A, you trade between partial answers (reduced harvest) and some failures (reduced yield). This models graceful degradation: a search that returns results from 9 of 10 shards is more useful than one that fails completely.

CAP Is Not the End

CAP describes fundamental trade-offs, but it doesn't limit what you can build. Spanner achieves global consistency through hardware investment. CRDTs eliminate conflicts through math. NewSQL systems push the boundaries of what's practical. The lesson is not 'CAP makes X impossible' but 'CAP tells us what's physically possible, and clever engineering can get us remarkably close to all three properties most of the time.'

Practical Decision Guide

Let's consolidate everything into a practical decision-making guide. When choosing a distributed database, follow this framework:

Converting Mermaid diagram...

Quick Reference: Database Selection by Use Case
Use Case	Recommended	Why	Alternative
Banking/Financial	CockroachDB, Spanner	ACID, strong consistency	PostgreSQL (single region)
E-commerce catalog	DynamoDB, MongoDB	Read-heavy, eventual OK	Cassandra
Social media feeds	Cassandra, Redis	High write volume, AP	DynamoDB
Real-time gaming	Redis, DynamoDB	Ultra-low latency	Cassandra
Analytics/OLAP	BigQuery, Redshift	Not transactional	Snowflake
IoT/Time-series	TimescaleDB, InfluxDB	Optimized for time data	Cassandra
Session storage	Redis, DynamoDB	Fast, ephemeral	Memcached
Configuration	etcd, ZooKeeper	Strong consistency, coordination	Consul
Full-text search	Elasticsearch	Search-optimized	Algolia
Global user data	Spanner, CockroachDB	SQL with global reach	DynamoDB Global Tables

Pre-Production Checklist

•Define consistency requirements — For each data type, determine the minimum acceptable consistency.
•Measure latency tolerance — What latency can users/systems tolerate? This constrains your synchronization options.
•Estimate data volume and access patterns — Read/write ratio, data size, and growth rate influence architecture.
•Choose partition behavior — Decide explicitly: during partitions, do we prefer errors or stale data?
•Verify claims — Check Jepsen tests, read architecture docs, run your own partition tests.
•Document trade-offs — Record what guarantees you're relying on and what failure modes are acceptable.
•Test failure modes — Use chaos engineering to verify your system behaves as expected under partition.
•Monitor replication lag — In production, track how far behind replicas are—this is your consistency risk window.

The Pragmatic Approach

There's no universally 'best' database. The best database is the one that matches YOUR specific requirements: your consistency needs, latency constraints, operational capabilities, and team expertise. CAP gives you the framework to understand what's possible; the rest is engineering judgment.

Summary: CAP Theorem Complete

Congratulations! You've completed a comprehensive exploration of the CAP theorem—from its theoretical foundations to practical system selection. Let's consolidate everything we've learned across this module:

Module Summary

•Page 1 - Consistency: CAP consistency means linearizability—all nodes see the same data at the same time. Implementation requires coordination (quorums, consensus) that has latency and availability costs.
•Page 2 - Availability: CAP availability means every non-failed node responds. It's different from SLA 'nines.' During partitions, availability conflicts with consistency—you can't have both.
•Page 3 - Partition Tolerance: Partitions are inevitable in distributed systems. Partition tolerance is mandatory. This reduces CAP to: during partitions, choose C or A.
•Page 4 - Trade-offs: The CP/AP choice depends on business, technical, and UX factors. Tunable consistency lets you choose per-operation. PACELC extends CAP to address normal-operation latency.
•Page 5 - Classification: CP systems (Spanner, ZooKeeper) prioritize correctness. AP systems (Cassandra, DynamoDB) prioritize availability. Modern systems often provide tunable consistency for flexibility.

The Meta-Lesson:

CAP isn't a constraint to work around—it's a fundamental truth of distributed systems. Trying to violate CAP leads to systems that fail unpredictably. Embracing CAP leads to systems that fail gracefully and predictably.

The art of distributed systems design is:

Understanding what your users actually need (not always 'perfect' everything)
Choosing trade-offs consciously (not accidentally)
Communicating those trade-offs clearly (in documentation and contracts)
Testing that your trade-offs work as expected (chaos engineering)

With this foundation, you're equipped to design, evaluate, and operate distributed databases with a clear understanding of their inherent trade-offs.

Module Complete: CAP Theorem

You now have a deep, comprehensive understanding of the CAP theorem—the foundational principle of distributed database design. You can analyze systems' consistency and availability properties, make informed trade-off decisions, evaluate vendor claims, and choose the right database for any use case. This knowledge is essential for any engineer building or operating distributed systems.

5 / 5

Loading learning content...

Database Management SystemsCAP Theorem

CAP Theorem: Fundamental Trade-offs in Distributed Systems

LevelAdvanced

Duration90 mins

TopicCAP Theorem

5 / 5

System Classification: CP, AP, and Beyond

A Tour of the Distributed Database Landscape

What You Will Master

CP Systems: Consistency First

Design Philosophy:

It's better to fail safely than to serve incorrect data
Users would rather see an error than make decisions based on stale information
Data integrity is paramount; temporary unavailability is acceptable

Key Characteristics of CP Systems:

Strong consistency (usually linearizable or serializable)
Synchronous replication (writes wait for acknowledgments)
Quorum-based decisions (majority must agree)
May reject reads/writes during partitions
Clean recovery after partition heals (no conflicts to resolve)

Major CP Systems
System	Consistency Model	Partition Behavior	Primary Use Case	Notable Features
Google Spanner	External consistency	Minority partitions unavailable	Global OLTP	TrueTime, GPS/atomic clocks
CockroachDB	Serializable	Quorum required for writes	Distributed SQL	PostgreSQL-compatible, geo-partitioning
ZooKeeper	Linearizable	Must reach leader quorum	Coordination	ZAB consensus, ephemeral nodes
etcd	Linearizable	Raft quorum required	Config/service discovery	Raft consensus, Kubernetes backbone
HBase	Strong (row-level)	Region server failures handled	Wide-column analytics	HDFS-based, Bigtable model
MongoDB (w/majority)	Linearizable (optional)	Primary election needed	General purpose	Tunable, default is eventual
Consul	Linearizable (Raft)	Quorum required	Service mesh	Health checking, KV store
FoundationDB	Serializable ACID	Coordinators must be reachable	Apple's infrastructure	Layered architecture

Deep Dive: Google Spanner

Spanner is perhaps the most ambitious CP system ever built. It provides external consistency (even stronger than linearizability) across a globally distributed database.

How Spanner Achieves Global Consistency:

TrueTime API: Spanner uses a globally synchronized clock based on GPS receivers and atomic clocks in every datacenter. This clock doesn't give you 'the exact time'—it gives you an uncertainty interval: "the current time is between [earliest, latest]."
Wait-Out Uncertainty: When committing a transaction, Spanner assigns a timestamp and then waits for the uncertainty interval to pass. This ensures that if transaction A commits with timestamp T, any transaction that starts after A completes will have a timestamp > T.
Result: Transactions are totally ordered in a way that respects real-time causality across the entire planet. If you commit in New York and then read in Tokyo, you'll see your write.

Cost: Every commit incurs latency equal to the clock uncertainty (typically 1-7ms). This is the price for global consistency.

cp_system_patterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
class CPSystemDesignPatterns:
    """
    Common patterns used by CP systems to achieve strong consistency
    while maintaining reasonable availability when possible.
    """
    
    def consensus_based_replication(self):
        """
        Pattern: Use a consensus protocol (Raft, Paxos) for all writes.
        
        Systems: etcd, ZooKeeper, CockroachDB
        
        How it works:
        1. Client sends write to any node
        2. Node forwards to leader (or is leader)
        3. Leader proposes write to followers
        4. Majority must acknowledge (quorum)
        5. Leader commits and responds to client
        
        Partition behavior:
        - Partition with majority continues operating
        - Partition with minority refuses writes
        - Reads can also require quorum (linearizable) or not (stale possible)
        """
        pass
    
    def primary_with_synchronous_replication(self):
        """
        Pattern: Single primary handles all writes, synchronously replicates.
        
        Systems: PostgreSQL (sync replication), MySQL (semi-sync)
        
        How it works:
        1. All writes go to primary
        2. Primary writes locally AND waits for replica ack
        3. Only then acknowledges to client
        4. If replica unreachable, write blocks or fails
        
        Partition behavior:
        - If primary loses connection to replicas, writes block
        - Can be configured to timeout or continue (trading consistency)
        - Failover requires careful handling to avoid split-brain
        """
        pass
    
    def two_phase_commit(self):
        """
        Pattern: Coordinate distributed transactions across nodes.
        
        Systems: Spanner, CockroachDB (internal), traditional XA
        
        How it works:
        1. Coordinator sends "prepare" to all participants
        2. Participants vote yes/no
        3. If all yes, coordinator sends "commit"
        4. If any no, coordinator sends "abort"
        
        Partition behavior:
        - If coordinator or any participant unreachable during prepare: abort
        - If coordinator fails after prepare-yes: participants block
        - This is why 2PC has availability concerns
        
        Spanner mitigation:
        - Uses Paxos *groups* as participants, not single nodes
        - A Paxos group can survive minority failures
        - Coordinator is also replicated for durability
        """
        pass
    
    def hybrid_clock_ordering(self):
        """
        Pattern: Use hybrid logical clocks for transaction ordering.
        
        Systems: CockroachDB, YugabyteDB
        
        How it works:
        - Each node maintains a Hybrid Logical Clock (HLC)
        - HLC combines physical time with logical counter
        - Provides ordering even when physical clocks drift
        - Transactions stamped with HLC values
        
        Partition behavior:
        - Partitioned nodes' clocks may drift
        - When partition heals, clocks resynchronize
        - Transactions from partition period are correctly ordered retroactively
        
        Trade-off:
        - Works without GPS/atomic clocks (unlike Spanner)
        - But may require occasional clock-wait for consistency
        """
        pass
 
 
# Example: Evaluating a CP system for an application
#
# Scenario: Building a banking application
#
# Requirements:
# - Account balances must be accurate
# - Transfers must be atomic (no double-spending)
# - Customers would rather see "temporarily unavailable" than wrong balance
#
# Evaluation of CP systems:
#
# | System         | Pros                           | Cons                        |
# |----------------|--------------------------------|-----------------------------|
# | Spanner        | Gold standard for consistency  | Requires GCP, expensive     |
# | CockroachDB    | Open source, PostgreSQL compat | Less mature than Spanner    |
# | PostgreSQL     | Proven, extensive ecosystem    | Scaling requires sharding   |
# | FoundationDB   | Apple-proven at scale          | Requires custom layers      |
#
# Decision: CockroachDB for new systems (modern, scalable, consistent)
#          PostgreSQL for simpler needs (proven, well-understood)
#          Spanner if already on GCP and need global scale

When to Choose CP

AP Systems: Availability First

Design Philosophy:

It's better to serve potentially stale data than to fail
Users would rather see something than see an error
Conflicts can be resolved later; unavailability is immediate pain

Key Characteristics of AP Systems:

Eventual consistency (or tunable)
Asynchronous replication (writes return before full replication)
Conflict resolution mechanisms (LWW, vector clocks, CRDTs)
Remain available during partitions
Require reconciliation after partition heals

Major AP Systems
System	Consistency Model	Partition Behavior	Primary Use Case	Conflict Resolution
Cassandra	Eventual (tunable)	All nodes continue serving	High-write time-series	Last-Write-Wins (LWW)
DynamoDB	Eventual (tunable)	All partitions available	Web apps, gaming	LWW or conditional writes
Riak	Eventual	All nodes available	IoT, session stores	Vector clocks, siblings
CouchDB	Eventual	Multi-master sync	Offline-first apps	Revision trees, conflicts
Voldemort	Eventual	All nodes available	LinkedIn (historical)	Vector clocks
Amazon S3	Eventual (mostly)	Always available	Object storage	LWW
DNS	Eventual	Local caches serve	Name resolution	TTL-based expiration
Memcached	N/A (cache)	Cache hit or miss	Caching	No persistence

Deep Dive: Apache Cassandra

Cassandra is the archetypal AP system for high-write workloads. It's designed to handle massive write throughput across multiple datacenters with no single point of failure.

Cassandra's Design Choices:

Masterless Architecture: Every node is equal. Any node can accept reads and writes. There's no leader election, no failover procedure—just a ring of nodes.
Consistent Hashing: Data is partitioned across nodes using consistent hashing. Each key maps to a set of replica nodes based on a replication factor.
Tunable Consistency: For each operation, the client specifies a consistency level (ONE, QUORUM, ALL, etc.). This lets you balance speed vs. consistency per-operation.
Last-Write-Wins (LWW): Concurrent writes to the same key are resolved by timestamp. The write with the highest timestamp wins. This can lose writes but is simple and predictable.

ap_system_patterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
class APSystemDesignPatterns:
    """
    Common patterns used by AP systems to maintain availability
    while providing eventual consistency.
    """
    
    def gossip_based_replication(self):
        """
        Pattern: Nodes gossip state changes to random peers.
        
        Systems: Cassandra, Riak, DynamoDB (internal)
        
        How it works:
        1. Node A receives write
        2. A acknowledges to client immediately (or after local ack)
        3. A gossips update to random nodes periodically
        4. Those nodes gossip to other random nodes
        5. Eventually, all nodes have the update
        
        Partition behavior:
        - Gossip within a partition continues normally
        - Updates queue up for unreachable nodes
        - When partition heals, gossip catches up stragglers
        - Conflicts (concurrent writes) resolved by LWW or vector clocks
        
        Trade-off:
        + Very high availability
        + No coordination bottleneck
        - Eventual consistency only
        - Conflict resolution needed
        """
        pass
    
    def hinted_handoff(self):
        """
        Pattern: Store hints for unreachable nodes to deliver later.
        
        Systems: Cassandra, Riak, DynamoDB
        
        How it works:
        1. Write targets nodes A, B, C (replication factor = 3)
        2. Node C is unreachable
        3. Node A stores "hint" containing the write for C
        4. When C comes back, A delivers the hint
        5. C applies the update
        
        Partition behavior:
        - Writes don't fail just because one replica is down
        - Hints accumulate during partitions
        - After partition heals, hints are replayed
        
        Trade-off:
        + Improves availability (don't fail if minority of replicas down)
        + Data eventually reaches all replicas
        - Hints can accumulate, causing burst when partition heals
        - If hinting node dies, hints are lost
        """
        pass
    
    def anti_entropy_repair(self):
        """
        Pattern: Background process compares and repairs divergent replicas.
        
        Systems: Cassandra, Riak, DynamoDB
        
        How it works:
        1. Periodically, nodes compare data using Merkle trees
        2. Merkle tree: hierarchical hash of data, efficient comparison
        3. If trees differ, exchange and reconcile differing portions
        4. Conflicts resolved by configured strategy (LWW, etc.)
        
        Partition behavior:
        - During partition, replicas can diverge
        - Anti-entropy can't run across partition
        - After healing, anti-entropy detects and repairs divergence
        
        Trade-off:
        + Repairs inconsistencies over time
        + Catches missed updates (failed hints, node recovery)
        - Resource intensive (CPU, network, disk)
        - May not catch all issues quickly
        """
        pass
    
    def read_repair(self):
        """
        Pattern: Repair stale replicas during read operations.
        
        Systems: Cassandra, Riak
        
        How it works:
        1. Read contacts multiple replicas (for QUORUM, majority)
        2. Compare responses - find most recent value
        3. Return most recent to client
        4. Asynchronously update stale replicas with correct value
        
        Partition behavior:
        - Works within a partition
        - After healing, reads trigger repairs across former partitions
        
        Trade-off:
        + Read path naturally repairs inconsistencies
        + Hot keys get repaired more often
        - Cold keys may stay inconsistent longer
        - Adds latency to digested reads
        """
        pass
    
    def conflict_resolution_strategies(self):
        """
        Various strategies for resolving concurrent writes in AP systems.
        """
        
        # Last-Write-Wins (LWW)
        # - Simple: highest timestamp wins
        # - Danger: can lose concurrent writes
        # - Use when: losing updates is acceptable
        
        # Vector Clocks
        # - Track causality, detect true conflicts
        # - Return all conflicting versions ("siblings")
        # - Application must choose winner
        # - Use when: application can merge or user can choose
        
        # CRDTs (Conflict-free Replicated Data Types)
        # - Data structures designed to merge automatically
        # - E.g., G-Counter (only increments, sum to merge)
        # - E.g., OR-Set (observed-remove set, union to merge)
        # - Use when: data type fits CRDT semantics
        
        # Application-Level Resolution
        # - Store all versions, let application logic decide
        # - Most flexible but most complex
        # - Use when: domain-specific merge logic required
        pass
 
 
# Example: Evaluating an AP system for an application
#
# Scenario: Building a social media "likes" counter
#
# Requirements:
# - Must always accept "like" requests
# - Counts don't need to be perfectly accurate
# - Minor discrepancies are acceptable
# - Global scale, low latency required
#
# Evaluation of AP systems:
#
# | System      | Pros                           | Cons                        |
# |-------------|--------------------------------|-----------------------------|
# | Cassandra   | Proven, high write throughput  | LWW can lose counts         |
# | Riak        | Built-in CRDTs for counters    | Smaller community           |
# | DynamoDB    | Managed, auto-scaling          | Vendor lock-in              |
# | Redis-Cluster | Ultra-low latency            | Complex cluster management  |
#
# Decision: Cassandra with counter columns (using CRDT-like behavior internally)
#          OR DynamoDB if already on AWS and want ease of management

When to Choose AP

Hybrid and Tunable Systems

Tunable Consistency Systems:

These systems let you choose consistency per-operation:

Cassandra: ONE, TWO, THREE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL
DynamoDB: Eventually consistent reads (default) or Strongly consistent reads
MongoDB: Write concern (how many replicas), Read concern (what data to read)
Azure Cosmos DB: Five consistency levels from Strong to Eventual

The Power of Tunability:

With tunable consistency, you don't have to commit your entire system to one CAP choice. You can:

Use strong consistency for critical operations (order placement)
Use eventual consistency for non-critical operations (view counts)
Choose based on the specific guarantees each operation needs

Azure Cosmos DB Consistency Levels
Level	Guarantee	Latency	Throughput	Analogy
Strong	Linearizable	Highest	Lowest	Like a single-copy database
Bounded Staleness	Lag by at most K versions or T time	High	Low	Controlled staleness
Session	Consistent within a session	Medium	Medium	User sees own writes
Consistent Prefix	Reads never see out-of-order writes	Low	High	No going backwards
Eventual	No ordering guarantees	Lowest	Highest	Total freedom

Deep Dive: MongoDB's Consistency Controls

MongoDB provides fine-grained consistency control through write concerns and read concerns:

Write Concern: How many replicas must acknowledge a write before it succeeds

w: 1 — Primary only (fast, but single point of failure for that write)
w: majority — Majority of replicas (durable across failures)
w: 0 — Fire and forget (fastest, no durability guarantee)
j: true — Wait for journal sync (disk durability)

Read Concern: What data can be returned

local — Return data from this node (may not be replicated)
majority — Return data acknowledged by majority (durable)
linearizable — Return the most recent data, with linearizable guarantee
snapshot — Return data from a snapshot for multi-document transactions

tunable_consistency_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# MongoDB: Using write and read concerns for different operations
 
from pymongo import MongoClient, WriteConcern, ReadPreference
from pymongo.read_concern import ReadConcern
 
client = MongoClient('mongodb://cluster/')
db = client['myapp']
 
# ==========================================
# STRONG CONSISTENCY: Financial operations
# ==========================================
 
def transfer_money(from_account: str, to_account: str, amount: float):
    """
    Money transfer: Use majority write concern and linearizable read.
    
    This ensures the write is durable on majority of nodes,
    and any subsequent read reflects this write.
    """
    accounts = db.get_collection(
        'accounts',
        write_concern=WriteConcern(w='majority', j=True),
        read_concern=ReadConcern('linearizable')
    )
    
    with client.start_session() as session:
        with session.start_transaction():
            # Debit from source
            accounts.update_one(
                {'_id': from_account},
                {'$inc': {'balance': -amount}},
                session=session
            )
            # Credit to destination
            accounts.update_one(
                {'_id': to_account},
                {'$inc': {'balance': amount}},
                session=session
            )
    # Transaction commits with majority write concern
 
 
# ==========================================
# SESSION CONSISTENCY: User profile updates
# ==========================================
 
def update_profile(user_id: str, updates: dict, session_token: str):
    """
    Profile update: User should see their own changes immediately.
    
    Use session consistency - same session always reads its writes.
    """
    profiles = db.get_collection(
        'profiles',
        write_concern=WriteConcern(w='majority'),  # Durable
        read_concern=ReadConcern('majority')       # Read committed data
    )
    
    # Get or create a client session for this user's browser session
    with client.start_session(causal_consistency=True) as session:
        # Update profile
        profiles.update_one(
            {'_id': user_id},
            {'$set': updates},
            session=session
        )
        
        # Read back will see the update (causal consistency)
        return profiles.find_one({'_id': user_id}, session=session)
 
 
# ==========================================
# EVENTUAL CONSISTENCY: Analytics events
# ==========================================
 
def log_event(event_type: str, data: dict):
    """
    Analytics: Fire-and-forget, eventually consistent.
    
    We don't need to wait for replication or even disk sync.
    Maximum throughput, minimal latency.
    """
    events = db.get_collection(
        'events',
        write_concern=WriteConcern(w=0),  # Don't wait at all
        read_concern=ReadConcern('local')  # Read local data
    )
    
    events.insert_one({
        'type': event_type,
        'data': data,
        'timestamp': datetime.utcnow()
    })
    # Returns immediately, write happens asynchronously
 
 
# ==========================================
# HYBRID: Shopping cart
# ==========================================
 
def add_to_cart(user_id: str, product_id: str, quantity: int):
    """
    Shopping cart: Fast writes, eventual reads are fine.
    
    Use low write concern for speed, but read from local for low latency.
    Cart contents aren't critical - user can refresh if needed.
    """
    carts = db.get_collection(
        'carts',
        write_concern=WriteConcern(w=1),  # Just primary
        read_concern=ReadConcern('local')
    )
    
    carts.update_one(
        {'_id': user_id},
        {'$push': {'items': {'product_id': product_id, 'quantity': quantity}}},
        upsert=True
    )
 
 
def checkout(user_id: str):
    """
    Checkout: Now we need strong consistency!
    
    When converting cart to order, switch to majority concern.
    """
    carts = db.get_collection('carts')
    orders = db.get_collection(
        'orders',
        write_concern=WriteConcern(w='majority', j=True),
        read_concern=ReadConcern('majority')
    )
    
    with client.start_session() as session:
        with session.start_transaction():
            cart = carts.find_one({'_id': user_id}, session=session)
            
            # Create durable order
            orders.insert_one({
                'user_id': user_id,
                'items': cart['items'],
                'status': 'placed',
                'created_at': datetime.utcnow()
            }, session=session)
            
            # Clear cart
            carts.delete_one({'_id': user_id}, session=session)
    
    # Transaction committed with majority - order is durable

The Best of Both Worlds

Evaluating Database Claims

Database vendors often make bold claims about consistency and availability. Cutting through marketing language requires understanding what these claims actually mean.

Red Flag Phrases:

"Highly available AND strongly consistent"

This is physically impossible during partitions. Ask: what happens during a network partition? If they say 'both still work,' they're either using non-standard definitions or haven't tested partitions.

"100% availability"

Nothing is 100% available. Ask about their SLA and what happens when it's breached.

"Immediately consistent across all nodes"

For geographically distributed systems, this implies either very high latency or misleading terminology. Ask about their synchronization mechanism.

"No single point of failure"

Good, but doesn't mean no coordinator. Ask about their consensus mechanism and what happens if 51% of nodes are unreachable.

Questions to Ask Database Vendors

•What consistency model do you implement? — Look for precise terms: linearizable, serializable, causal, eventual. 'Strong' and 'consistent' are vague.
•What happens during a network partition? — Do minority partitions accept writes? Do they reject? How long before detection?
•How are conflicts resolved? — LWW can lose data. Vector clocks require application logic. CRDTs work for limited data types.
•What's the latency for a consistent write? — If it's low, they might be using async replication (eventual consistency).
•Has this been tested with Jepsen? — Kyle Kingsbury's Jepsen tests find partition handling bugs in many databases. Tested systems are more trustworthy.
•What's your replication lag under load? — High replication lag means long windows of inconsistency.
•What happens when a node rejoins after partition? — Smooth reconciliation or manual intervention?

Jepsen Test Results Summary
System	Issues Found	Current Status	Key Lessons
MongoDB	Stale reads, rollbacks	Many fixed	Default settings weren't safe
Cassandra	LWT issues, timestamp	Mostly fixed	Lightweight transactions tricky
CockroachDB	Serialization anomalies	Fixed	Even careful designs have bugs
Elasticsearch	Split-brain scenarios	Improved	Search ≠ database
Redis Cluster	Data loss during partition	Known limitation	Not designed for durability
etcd	Minimal issues	Good	Simple system, well-tested
PostgreSQL	N/A (single node)	N/A	Jepsen tests distributed systems

The Jepsen Standard:

Kyle Kingsbury's Jepsen project has become the de facto standard for verifying distributed database claims. Jepsen:

Sets up a cluster of the database under test
Injects network partitions, node crashes, clock skew
Runs concurrent operations that should satisfy the claimed consistency level
Checks if the actual behavior matches the claims

Due Diligence Approach:

Read the database's architecture documentation, not marketing materials
Search for Jepsen tests or similar independent verification
Understand the consistency model formally (not marketing descriptions)
Test partition handling yourself in a staging environment
Document what guarantees you're relying on

Trust, But Verify

Modern Approaches: Beyond CAP

While CAP remains the foundational framework, modern distributed systems have developed sophisticated approaches that push the boundaries of what's possible:

Spanner and TrueTime:

Google's Spanner sidesteps some CAP limitations by investing in physical infrastructure:

GPS receivers and atomic clocks in every datacenter
TrueTime API provides bounded uncertainty about current time
Transactions wait for uncertainty window to pass
Achieves external consistency (stronger than linearizability) globally

Trade-off: Requires significant infrastructure investment. Can't be replicated without similar hardware.

CRDTs (Conflict-free Replicated Data Types):

CRDTs are data structures designed to merge automatically without conflicts:

G-Counter: Grows only, merge by taking max per node
PN-Counter: Positive/Negative counters, sum to merge
G-Set: Grow-only set, union to merge
OR-Set: Observed-remove set, more complex but supports remove
LWW-Register: Last-Write-Wins register, simple but can lose data

Trade-off: CRDTs only work for certain data types. Arbitrary JSON or complex objects don't have natural CRDT implementations.

crdt_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
class CRDTExamples:
    """
    Conflict-free Replicated Data Types enable AP systems
    to merge concurrent updates automatically without conflicts.
    """
    
    class GCounter:
        """
        Grow-only Counter: Can only increment.
        Each node maintains its own count; total is sum of all.
        
        Use case: Page view counts, like counts
        """
        def __init__(self, node_id: str):
            self.node_id = node_id
            self.counts = {}  # node_id -> count
        
        def increment(self):
            self.counts[self.node_id] = self.counts.get(self.node_id, 0) + 1
        
        def value(self) -> int:
            return sum(self.counts.values())
        
        def merge(self, other: 'GCounter'):
            """
            Merge with another GCounter by taking max per node.
            This is commutative, associative, and idempotent.
            """
            for node_id, count in other.counts.items():
                self.counts[node_id] = max(
                    self.counts.get(node_id, 0),
                    count
                )
        
        # Example:
        # Node A: {A: 5}           total = 5
        # Node B: {B: 3}           total = 3
        # After merge: {A: 5, B: 3} total = 8
        # Merge order doesn't matter - same result!
    
    class PNCounter:
        """
        Positive-Negative Counter: Can increment and decrement.
        Maintains two G-Counters: one for increments, one for decrements.
        
        Use case: Stock levels, upvotes/downvotes
        """
        def __init__(self, node_id: str):
            self.node_id = node_id
            self.positive = {}
            self.negative = {}
        
        def increment(self):
            self.positive[self.node_id] = self.positive.get(self.node_id, 0) + 1
        
        def decrement(self):
            self.negative[self.node_id] = self.negative.get(self.node_id, 0) + 1
        
        def value(self) -> int:
            return sum(self.positive.values()) - sum(self.negative.values())
        
        def merge(self, other: 'PNCounter'):
            for node_id, count in other.positive.items():
                self.positive[node_id] = max(self.positive.get(node_id, 0), count)
            for node_id, count in other.negative.items():
                self.negative[node_id] = max(self.negative.get(node_id, 0), count)
    
    class ORSet:
        """
        Observed-Remove Set: Supports both add and remove.
        Each add gets unique tag; remove removes specific tags.
        
        Use case: Shopping cart items, group membership
        """
        def __init__(self, node_id: str):
            self.node_id = node_id
            self.elements = {}  # element -> set of (tag, present)
            self.counter = 0
        
        def add(self, element):
            self.counter += 1
            tag = f"{self.node_id}:{self.counter}"
            if element not in self.elements:
                self.elements[element] = set()
            self.elements[element].add((tag, True))
        
        def remove(self, element):
            # Mark all observed tags as removed
            if element in self.elements:
                self.elements[element] = {
                    (tag, False) for tag, _ in self.elements[element]
                }
        
        def lookup(self, element) -> bool:
            if element not in self.elements:
                return False
            # Present if any tag is still "present"
            return any(present for _, present in self.elements[element])
        
        def merge(self, other: 'ORSet'):
            for element, tags in other.elements.items():
                if element not in self.elements:
                    self.elements[element] = set()
                # Complex merge logic: handle concurrent add/remove
                for tag, present in tags:
                    existing = [(t, p) for t, p in self.elements[element] if t == tag]
                    if not existing:
                        self.elements[element].add((tag, present))
                    else:
                        # If either has marked it removed, it's removed
                        old_present = existing[0][1]
                        new_present = present and old_present
                        self.elements[element].discard((tag, True))
                        self.elements[element].discard((tag, False))
                        self.elements[element].add((tag, new_present))
 
 
# Use CRDTs in Riak:
# Riak natively supports CRDTs (called Riak Data Types):
#   - Counters (PN-Counters)
#   - Sets (OR-Sets)
#   - Maps (nested CRDTs)
#   - Flags (boolean)
#   - Registers (LWW)
 
# Example Riak counter:
# curl -XPOST 'http://riak:8098/types/counters/buckets/likes/datatypes/post-123' \
#      -H 'Content-Type: application/json' \
#      -d '{"increment": 1}'
# 
# Multiple nodes can increment concurrently - they'll merge correctly!

NewSQL: Attempting to Have It All

NewSQL systems like CockroachDB, YugabyteDB, and TiDB attempt to provide:

Full SQL and ACID transactions
Horizontal scalability
Strong consistency

They achieve this through clever engineering:

Raft consensus for each data shard
Hybrid Logical Clocks for ordering
Distributed transactions with optimistic concurrency

Trade-off: They're still bound by CAP. During partitions, minority partitions become unavailable. The 'magic' is in making partitions rare and recovery fast, not in escaping CAP.

The Harvest/Yield Model:

An alternative framework proposed by Eric Brewer himself:

Harvest: Completeness of the answer (did we check all data?)
Yield: Probability of completing a request

CAP Is Not the End

Practical Decision Guide

Let's consolidate everything into a practical decision-making guide. When choosing a distributed database, follow this framework:

Converting Mermaid diagram...

Quick Reference: Database Selection by Use Case
Use Case	Recommended	Why	Alternative
Banking/Financial	CockroachDB, Spanner	ACID, strong consistency	PostgreSQL (single region)
E-commerce catalog	DynamoDB, MongoDB	Read-heavy, eventual OK	Cassandra
Social media feeds	Cassandra, Redis	High write volume, AP	DynamoDB
Real-time gaming	Redis, DynamoDB	Ultra-low latency	Cassandra
Analytics/OLAP	BigQuery, Redshift	Not transactional	Snowflake
IoT/Time-series	TimescaleDB, InfluxDB	Optimized for time data	Cassandra
Session storage	Redis, DynamoDB	Fast, ephemeral	Memcached
Configuration	etcd, ZooKeeper	Strong consistency, coordination	Consul
Full-text search	Elasticsearch	Search-optimized	Algolia
Global user data	Spanner, CockroachDB	SQL with global reach	DynamoDB Global Tables

Pre-Production Checklist

•Define consistency requirements — For each data type, determine the minimum acceptable consistency.
•Measure latency tolerance — What latency can users/systems tolerate? This constrains your synchronization options.
•Estimate data volume and access patterns — Read/write ratio, data size, and growth rate influence architecture.
•Choose partition behavior — Decide explicitly: during partitions, do we prefer errors or stale data?
•Verify claims — Check Jepsen tests, read architecture docs, run your own partition tests.
•Document trade-offs — Record what guarantees you're relying on and what failure modes are acceptable.
•Test failure modes — Use chaos engineering to verify your system behaves as expected under partition.
•Monitor replication lag — In production, track how far behind replicas are—this is your consistency risk window.

The Pragmatic Approach

Summary: CAP Theorem Complete

Module Summary

•Page 1 - Consistency: CAP consistency means linearizability—all nodes see the same data at the same time. Implementation requires coordination (quorums, consensus) that has latency and availability costs.
•Page 2 - Availability: CAP availability means every non-failed node responds. It's different from SLA 'nines.' During partitions, availability conflicts with consistency—you can't have both.
•Page 3 - Partition Tolerance: Partitions are inevitable in distributed systems. Partition tolerance is mandatory. This reduces CAP to: during partitions, choose C or A.
•Page 4 - Trade-offs: The CP/AP choice depends on business, technical, and UX factors. Tunable consistency lets you choose per-operation. PACELC extends CAP to address normal-operation latency.
•Page 5 - Classification: CP systems (Spanner, ZooKeeper) prioritize correctness. AP systems (Cassandra, DynamoDB) prioritize availability. Modern systems often provide tunable consistency for flexibility.

The Meta-Lesson:

The art of distributed systems design is:

Understanding what your users actually need (not always 'perfect' everything)
Choosing trade-offs consciously (not accidentally)
Communicating those trade-offs clearly (in documentation and contracts)
Testing that your trade-offs work as expected (chaos engineering)

With this foundation, you're equipped to design, evaluate, and operate distributed databases with a clear understanding of their inherent trade-offs.

Module Complete: CAP Theorem

5 / 5