Availability vs Consistency - Learning Module

Loading content...

0/273

Choosing CP vs AP

The Architecture-Defining Decision

Now that we understand the CAP theorem at a theoretical level, we face the practical question every distributed systems architect must answer: When a network partition occurs, should my system prioritize consistency or availability?

This decision is not purely technical. It has profound implications for user experience, business operations, legal compliance, and operational complexity. The wrong choice can result in:

CP when AP was needed: Frustrated users who can't access the system during minor network issues.
AP when CP was needed: Data corruption, financial discrepancies, or regulatory violations.

This page provides a rigorous framework for making this decision. We'll examine the criteria that drive CP vs AP choices, analyze how leading companies have made these decisions, and develop practical heuristics you can apply to your own systems.

What You Will Master

By the end of this page, you will have a systematic approach for determining whether a system—or specific operations within a system—should prioritize consistency or availability. You'll understand the business and technical factors that drive these decisions in real-world systems.

When to Choose CP (Consistency over Availability)

CP systems sacrifice availability during partitions to ensure that all operations observe a consistent view of the data. This means some requests will fail or block when partitions occur, but the data integrity is guaranteed.

The Core CP Question

Ask: "If two nodes have different views of the data during a partition, and both accept writes, would the resulting inconsistency cause unacceptable harm?"

If yes → Choose CP.

Strong Indicators for CP

•Financial transactions — Bank balances, payment processing, and fund transfers cannot tolerate inconsistency. Double-spending, negative balances, or lost transactions have severe consequences. A blocked transaction is annoying; a duplicated or missing transaction is catastrophic.
•Inventory management — Overselling products due to inconsistent inventory counts damages customer trust and operational efficiency. Airlines can't sell the same seat twice; warehouses can't ship products they don't have.
•Unique resource allocation — Usernames, email addresses, ticket numbers, reservation IDs—any system where items must be globally unique requires CP semantics. Two users claiming the same username is unacceptable.
•Coordination and leadership — Distributed locks, leader election, and configuration management require consensus. Two nodes believing they are the leader causes split-brain disasters.
•Regulatory compliance — Healthcare records, financial auditing, and legal documents often have regulatory requirements for data integrity. HIPAA, SOX, and GDPR may mandate strong consistency for specific data.
•Sequential processing — Systems where the order of operations matters (event sourcing, transactional outbox) need consistent ordering. Out-of-order or missing events break downstream consumers.

Case Study: Banking Systems

Consider a simple bank account with $100 balance. Two users attempt withdrawals simultaneously during a network partition:

AP System (Dangerous for Banking):

Node A: Processes $80 withdrawal → Balance = $20
Node B: Processes $60 withdrawal → Balance = $40
Partition heals → Actual balance should be -$40 (overdraft), but each node thinks it has positive balance.
Bank has disbursed $140 from a $100 account.

CP System (Correct for Banking):

Node A: Attempts $80 withdrawal → Cannot reach quorum → Returns error "Service temporarily unavailable."
Node B: Same behavior.
Partition heals → First successful request processes, second may be rejected if balance insufficient.
Bank maintains integrity at the cost of temporary unavailability.

The business accepts: Some withdrawal requests failing during network issues is acceptable. Uncontrolled overdrafts are not.

CP Is About Preventing Harm

Choose CP when the consequences of inconsistency are more severe than the consequences of temporary unavailability. If wrong data is worse than no data, CP is your answer.

When to Choose AP (Availability over Consistency)

AP systems continue serving requests during partitions, accepting that different nodes may have temporarily divergent views of the data. Conflicts are resolved after the partition heals.

The Core AP Question

Ask: "If users receive stale or conflicting data during a partition, can the system recover gracefully when the partition heals? Is temporary inconsistency tolerable?"

If yes → Choose AP.

Strong Indicators for AP

•Shopping carts — Users expect to add items regardless of network conditions. Conflicts (same item added twice) can be merged on checkout. Losing the entire cart during a partition is worse than having a slightly inconsistent one.
•Social media feeds — Seeing a post from 30 seconds ago instead of 10 seconds ago is imperceptible to users. Feeds that refuse to load during minor network issues are frustrating.
•Content publishing — Blog posts, comments, and user-generated content can tolerate eventual consistency. A comment appearing 5 seconds later than expected is not a crisis.
•Caching layers — By definition, caches serve potentially stale data. Cache availability is critical; cache freshness is best-effort.
•Analytics and metrics — Dashboards and reports often aggregate data over time. A 1% variance in numbers due to replication lag is acceptable for most analytics use cases.
•Collaborative editing (with CRDTs) — Google Docs continues working offline; changes merge when online. The availability of editing trumps immediate consistency across editors.
•DNS and service discovery — Stale DNS records are better than no DNS resolution. The internet is built on eventually consistent data.

Case Study: Amazon's Shopping Cart

Amazon's Dynamo paper famously advocated for AP architecture in shopping carts. The reasoning:

The problem with CP for carts:

During a partition, users cannot add items to their cart.
User frustration leads to abandoned sessions.
Revenue loss is immediate and measurable.

The AP solution:

Users can always add items, even during partitions.
Each node accepts additions independently.
On checkout (or partition heal), carts are merged.
Merge strategy: Union of all items (never lose an addition).
Result: A cart might have duplicates, easily handled at checkout.

The business trade-off: Having an extra item in a cart (user removes it) is far less costly than a user unable to add items and leaving the site. The conversion rate impact of unavailability dwarfs the minor friction of occasional cart merges.

Conflict Resolution in AP Systems

Choosing AP requires a conflict resolution strategy for when the partition heals:

Strategy	Description	Best For
Last-Write-Wins (LWW)	Most recent timestamp wins	Simple, low-conflict data
First-Write-Wins	Original value preserved	Immutable-ish data
Merge	Combine conflicting values	Sets, counters, accumulators
Application-specific	Custom logic	Complex domain rules
CRDTs	Mathematically guaranteed merge	Collaborative apps

AP Is About User Experience

Choose AP when unavailability causes more user harm than inconsistency. If a degraded experience is better than no experience, and conflicts can be resolved later, AP is your answer.

A Systematic Decision Framework

Rather than relying on intuition, use a structured framework to evaluate CP vs AP trade-offs. This framework examines multiple dimensions of the problem.

CP vs AP Decision Matrix
Factor	Favors CP	Favors AP
Cost of Inconsistency	Financial loss, regulatory violation, data corruption	Minor user inconvenience, self-correcting errors
Cost of Unavailability	Users can wait or retry	Revenue loss, user abandonment, SLA violations
Conflict Resolution	Complex or impossible to merge	Simple merge strategy exists (LWW, CRDTs)
Data Criticality	Source of truth, authoritative records	Derived data, caches, aggregations
User Expectations	"My money must be correct"	"I want it to work even if imperfect"
Recovery Path	No good way to fix bad data	Conflicts detectable and resolvable
Partition Frequency	Rare (can absorb brief outages)	Frequent (must operate through them)
Read/Write Ratio	Write-heavy (conflicts likely)	Read-heavy (stale reads acceptable)

The 5-Question Framework

For any system or operation, answer these questions:

1. What happens if data is inconsistent during a partition?

Scenario: Node A has version X, Node B has version Y, both are serving requests.
Is this acceptable for minutes? Hours? Ever?

2. What happens if the system is unavailable during a partition?

Scenario: Users receive errors or cannot complete operations.
Is this acceptable for minutes? Hours?

3. Can conflicts be detected and resolved after the partition heals?

Is there a clear "winner" (timestamp, version vector)?
Can conflicting values be merged (sets, CRDTs)?
Will conflicts require human intervention?

4. What are the business and regulatory constraints?

Are there SLAs for availability?
Are there compliance requirements for data accuracy?
What are the financial implications of each failure mode?

5. What do users expect?

Will users understand temporary errors better than stale data, or vice versa?
What experience do competitors provide?

decision-pseudocode.txt

CP vs AP DECISION FLOWCHART
════════════════════════════════════════════════════════════════
 
                    ┌────────────────────────────────┐
                    │ Can inconsistency cause        │
                    │ financial/legal/safety harm?   │
                    └───────────────┬────────────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    ▼                               ▼
                   YES                              NO
                    │                               │
                    ▼                               ▼
            ┌──────────────┐               ┌───────────────────────┐
            │ Strong CP    │               │ Is there a natural    │
            │ Required     │               │ conflict resolution?  │
            └──────────────┘               └───────────┬───────────┘
                                                       │
                                           ┌───────────┴───────────┐
                                           ▼                       ▼
                                          YES                      NO
                                           │                       │
                                           ▼                       ▼
                                   ┌───────────────┐       ┌──────────────────┐
                                   │ Does downtime │       │ Lean toward CP;  │
                                   │ cause harm?   │       │ manual conflicts │
                                   └───────┬───────┘       │ are expensive    │
                                           │               └──────────────────┘
                               ┌───────────┴───────────┐
                               ▼                       ▼
                              YES                      NO
                               │                       │
                               ▼                       ▼
                       ┌──────────────┐        ┌──────────────┐
                       │ AP Preferred │        │ CP Preferred │
                       │ with merge   │        │              │
                       └──────────────┘        └──────────────┘

Most Systems Are Hybrid

Few systems are purely CP or AP. A typical e-commerce platform might use CP for payments and inventory, AP for product catalogs and reviews, and different consistency levels for different read operations. The framework applies per operation or data type, not globally.

Real-World Case Studies

Let's examine how major companies have made CP vs AP decisions and the reasoning behind their choices.

Case Study: Stripe (Payment Processing)

•Domain: Financial transactions, payment processing
•Choice: CP (strong consistency)
•Reasoning: A double-charge or a lost payment is unacceptable. Users and merchants have zero tolerance for financial discrepancies. Regulatory compliance (PCI-DSS) requires accurate records.
•Implementation: Synchronous replication, distributed transactions, payment state machines with exactly-once semantics.
•Trade-off accepted: Brief unavailability during partition is acceptable; users retry or wait. A payment failure is preferable to a payment error.

Case Study: Twitter (Feed and Timelines)

•Domain: Social media, timeline generation
•Choice: AP (eventual consistency)
•Reasoning: Timeline freshness is best-effort; seeing a tweet 5 seconds late is imperceptible. Users expect the app to always load, even during network issues.
•Implementation: Manhattan (Twitter's distributed database), asynchronous fan-out, cached timelines.
•Trade-off accepted: Some users may see tweets out of order or slightly stale. This is vastly preferable to the app refusing to load.

Case Study: Google Spanner (Financial Services)

•Domain: Global scale transactional database (used by Google and enterprise)
•Choice: CP (externally consistent)
•Reasoning: Spanner was designed for use cases requiring ACID transactions across globally distributed data. Its primary users (Google's AdWords, enterprise banking) cannot tolerate inconsistency.
•Implementation: TrueTime (GPS/atomic clock synchronization), Paxos consensus per shard, synchronous cross-region commits.
•Trade-off accepted: Higher latency (7-10ms minimum due to TrueTime uncertainty), partial unavailability during partitions. Worth it for strong guarantees.

Case Study: Netflix (Content Catalog)

•Domain: Streaming, content metadata, personalization
•Choice: AP (eventual consistency, with exceptions)
•Reasoning: Users must always be able to browse and stream. Showing slightly stale recommendations or an outdated thumbnail is invisible to users; failing to load the app is visible.
•Implementation: EVCache (memcached-based caching), Cassandra for persistent storage, eventual consistency with aggressive TTLs.
•Exception: Billing and entitlements use stronger consistency—you shouldn't be able to stream content you haven't paid for, and subscriptions must be accurately tracked.

Pattern: Hybrid Consistency Within One System

Notice that Netflix uses different consistency models for different data:

Data Type	Consistency	Rationale
Content catalog	Eventual	Static data, cached heavily
Personalization	Eventual	Stale recommendations still work
Playback session	Session-level	User shouldn't lose progress mid-movie
Billing/Entitlements	Strong	Must not stream unpaid content
Account credentials	Strong	Login must be accurate

This is the mature approach: analyzing each data domain and applying the appropriate consistency model, rather than one-size-fits-all.

Study Real Architectures

Read engineering blogs from Netflix, Uber, Airbnb, and other companies. They publish detailed explanations of their consistency choices, providing invaluable insight into how these decisions are made at scale.

Mixed Consistency Strategies

Most production systems don't fit neatly into CP or AP—they employ mixed strategies that vary consistency by operation, data type, or client requirements.

Strategy 1: Read-Your-Writes at Client Level

For many applications, users expect to see their own writes immediately, but don't need to see others' writes instantly.

Implementation:

After a write, route that client's subsequent reads to the same replica (or a replica with the update).
Use session tokens or user IDs for routing.
Other users may see eventually consistent data, but each user has a consistent view of their own actions.

// Client-side read-your-writes
async function updateProfile(userId, data) {
  const result = await db.write('profiles', userId, data);
  
  // Store the write timestamp in session
  session.lastWriteTimestamp = result.timestamp;
  
  return result;
}

async function getProfile(userId) {
  return await db.read('profiles', userId, {
    // Read from replica that has caught up to our last write
    minTimestamp: session.lastWriteTimestamp || 0
  });
}

Strategy 2: Consistency Per Table or Collection

Different data naturally has different consistency requirements.

Implementation:

Configure consistency at the table level.
Some tables use synchronous replication; others use async.
Application code queries without knowing the underlying consistency.

Strategy 3: Consistency Per Query

Databases like Cassandra and DynamoDB allow consistency levels per query.

Implementation:

Critical operations specify strong consistency.
Bulk reads or analytics use eventual consistency.
The same data can be accessed with different guarantees depending on the use case.

mixed-consistency-example.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Example: E-commerce platform with mixed consistency
 
interface ProductService {
  // Catalog data: AP (eventually consistent), cached aggressively
  async getProductDetails(productId: string): Promise<Product> {
    return await cache.getOrFetch(
      `product:${productId}`,
      () => catalog.get(productId, { consistency: 'eventual' }),
      { ttl: 60 } // 1-minute cache is fine
    );
  }
 
  // Inventory check: CP (strongly consistent) for reservation
  async reserveInventory(productId: string, quantity: number): Promise<boolean> {
    return await inventory.reserve(productId, quantity, {
      consistency: 'strong', // Must not oversell
      timeout: 5000          // Accept brief unavailability
    });
  }
 
  // Reviews: AP (eventually consistent), user-generated content
  async getReviews(productId: string): Promise<Review[]> {
    return await reviews.list(productId, {
      consistency: 'eventual' // Stale reviews are fine
    });
  }
 
  // Add review: Read-your-writes for author
  async addReview(productId: string, authorId: string, review: ReviewInput): Promise<Review> {
    const result = await reviews.create({
      productId,
      authorId,
      ...review
    }, { consistency: 'quorum' }); // Strong enough for read-your-writes
 
    // Clear author's cache so they see their own review
    await cache.invalidate(`reviews:${productId}:author:${authorId}`);
    
    return result;
  }
}
 
// Order processing: CP (strongly consistent, transactional)
interface OrderService {
  async placeOrder(userId: string, cart: Cart): Promise<Order> {
    return await db.transaction(async (tx) => {
      // Check inventory (strong read)
      for (const item of cart.items) {
        const available = await tx.query(
          'SELECT quantity FROM inventory WHERE product_id = $1 FOR UPDATE',
          [item.productId]
        );
        if (available.quantity < item.quantity) {
          throw new InsufficientInventoryError(item.productId);
        }
      }
 
      // Decrement inventory
      for (const item of cart.items) {
        await tx.query(
          'UPDATE inventory SET quantity = quantity - $1 WHERE product_id = $2',
          [item.quantity, item.productId]
        );
      }
 
      // Create order
      const order = await tx.query(
        'INSERT INTO orders (user_id, items, total) VALUES ($1, $2, $3) RETURNING *',
        [userId, cart.items, cart.total]
      );
 
      return order;
    }, { isolation: 'serializable' }); // Maximum consistency for orders
  }
}

Consistency Hygiene

Document your consistency choices explicitly. Future engineers (and future you) need to understand why each operation uses its particular consistency level. Code comments, architecture decision records (ADRs), and runbooks should capture this.

Implementation Considerations

Having decided on CP or AP for a given operation, implementation choices determine how well your system realizes that intent.

Implementing CP

•Use consensus protocols (Paxos, Raft)
•Synchronous replication (wait for acks)
•Quorum reads and writes (R + W > N)
•Distributed transactions (2PC, Saga)
•Leader-based architectures
•Serializable isolation levels

Implementing AP

•Asynchronous replication
•Vector clocks or version vectors
•CRDTs for automatic merge
•Last-Write-Wins tombstones
•Read repair and anti-entropy
•Multi-leader or leaderless topologies

CP Implementation: Quorum-Based Systems

The most common CP implementation uses quorum reads and writes.

Rule: If R + W > N, where:

N = total replicas
W = replicas that must acknowledge writes
R = replicas that must respond to reads

Then reads are guaranteed to see the latest write (assuming no concurrent writes).

Example with N=3:

W=2, R=2: Any read will contact at least one replica with the latest write.
W=3, R=1: Every replica has the latest write; any read suffices.
W=1, R=3: Must read all replicas but write is fast.

Trade-off: Higher W improves durability but slows writes and reduces write availability. Higher R slows reads but allows lower W.

AP Implementation: Conflict Resolution

AP systems need strategies for merging divergent data:

Last-Write-Wins (LWW):

Simple: highest timestamp wins.
Risk: Clock skew causes "wrong" winner.
Use when: Loss of conflicting write is acceptable.

Vector Clocks:

Track causality, detect true conflicts.
Presents conflicts to application or user for resolution.
Use when: Conflicts are rare and require explicit handling.

CRDTs (Conflict-free Replicated Data Types):

Mathematically guaranteed to merge without conflicts.
Types: G-Counter, PN-Counter, G-Set, OR-Set, LWW-Register.
Use when: Data naturally fits CRDT semantics (counters, sets).

Test Partition Behavior

Don't assume your system behaves correctly during partitions—test it. Use chaos engineering tools to inject network partitions and verify that your CP system properly rejects requests or your AP system properly merges conflicts. Untested partition behavior is unproven partition behavior.

Summary: Making the CP vs AP Decision

We've established a rigorous framework for one of distributed systems' most critical decisions. Let's consolidate the key insights.

Key Takeaways

•CP when inconsistency causes harm — Financial transactions, inventory, unique identifiers, and coordination require strong consistency. Temporary unavailability is preferable to data corruption.
•AP when unavailability causes harm — User-facing applications, caches, content delivery, and social features prioritize availability. Stale data is preferable to no data.
•Use the 5-question framework — Systematically evaluate inconsistency cost, unavailability cost, conflict resolution ability, regulatory constraints, and user expectations.
•Apply consistency per operation, not per system — A single application typically needs both CP and AP operations. Analyze each data flow independently.
•Mixed strategies are normal — Read-your-writes, per-table consistency, and per-query consistency allow nuanced trade-offs that pure CP or AP cannot.
•Implementation matters — The decision is meaningless without proper implementation. Quorum systems, CRDTs, and conflict resolution strategies must match your consistency goals.

What's Next

Having established when to choose CP vs AP, the next page explores how to tune consistency for availability—the techniques and patterns that allow systems to provide both strong consistency and good availability, even if they can't have both perfectly.

Page Complete

You now have a systematic framework for making CP vs AP decisions. This framework—evaluating costs, conflict resolution, user expectations, and regulatory requirements—will serve you across all distributed systems you design or evaluate.

Choosing CP vs AP

The Architecture-Defining Decision

This decision is not purely technical. It has profound implications for user experience, business operations, legal compliance, and operational complexity. The wrong choice can result in:

CP when AP was needed: Frustrated users who can't access the system during minor network issues.
AP when CP was needed: Data corruption, financial discrepancies, or regulatory violations.

What You Will Master

When to Choose CP (Consistency over Availability)

The Core CP Question

Ask: "If two nodes have different views of the data during a partition, and both accept writes, would the resulting inconsistency cause unacceptable harm?"

If yes → Choose CP.

Strong Indicators for CP

•Financial transactions — Bank balances, payment processing, and fund transfers cannot tolerate inconsistency. Double-spending, negative balances, or lost transactions have severe consequences. A blocked transaction is annoying; a duplicated or missing transaction is catastrophic.
•Inventory management — Overselling products due to inconsistent inventory counts damages customer trust and operational efficiency. Airlines can't sell the same seat twice; warehouses can't ship products they don't have.
•Unique resource allocation — Usernames, email addresses, ticket numbers, reservation IDs—any system where items must be globally unique requires CP semantics. Two users claiming the same username is unacceptable.
•Coordination and leadership — Distributed locks, leader election, and configuration management require consensus. Two nodes believing they are the leader causes split-brain disasters.
•Regulatory compliance — Healthcare records, financial auditing, and legal documents often have regulatory requirements for data integrity. HIPAA, SOX, and GDPR may mandate strong consistency for specific data.
•Sequential processing — Systems where the order of operations matters (event sourcing, transactional outbox) need consistent ordering. Out-of-order or missing events break downstream consumers.

Case Study: Banking Systems

Consider a simple bank account with $100 balance. Two users attempt withdrawals simultaneously during a network partition:

AP System (Dangerous for Banking):

Node A: Processes $80 withdrawal → Balance = $20
Node B: Processes $60 withdrawal → Balance = $40
Partition heals → Actual balance should be -$40 (overdraft), but each node thinks it has positive balance.
Bank has disbursed $140 from a $100 account.

CP System (Correct for Banking):

Node A: Attempts $80 withdrawal → Cannot reach quorum → Returns error "Service temporarily unavailable."
Node B: Same behavior.
Partition heals → First successful request processes, second may be rejected if balance insufficient.
Bank maintains integrity at the cost of temporary unavailability.

The business accepts: Some withdrawal requests failing during network issues is acceptable. Uncontrolled overdrafts are not.

CP Is About Preventing Harm

Choose CP when the consequences of inconsistency are more severe than the consequences of temporary unavailability. If wrong data is worse than no data, CP is your answer.

When to Choose AP (Availability over Consistency)

AP systems continue serving requests during partitions, accepting that different nodes may have temporarily divergent views of the data. Conflicts are resolved after the partition heals.

The Core AP Question

Ask: "If users receive stale or conflicting data during a partition, can the system recover gracefully when the partition heals? Is temporary inconsistency tolerable?"

If yes → Choose AP.

Strong Indicators for AP

•Shopping carts — Users expect to add items regardless of network conditions. Conflicts (same item added twice) can be merged on checkout. Losing the entire cart during a partition is worse than having a slightly inconsistent one.
•Social media feeds — Seeing a post from 30 seconds ago instead of 10 seconds ago is imperceptible to users. Feeds that refuse to load during minor network issues are frustrating.
•Content publishing — Blog posts, comments, and user-generated content can tolerate eventual consistency. A comment appearing 5 seconds later than expected is not a crisis.
•Caching layers — By definition, caches serve potentially stale data. Cache availability is critical; cache freshness is best-effort.
•Analytics and metrics — Dashboards and reports often aggregate data over time. A 1% variance in numbers due to replication lag is acceptable for most analytics use cases.
•Collaborative editing (with CRDTs) — Google Docs continues working offline; changes merge when online. The availability of editing trumps immediate consistency across editors.
•DNS and service discovery — Stale DNS records are better than no DNS resolution. The internet is built on eventually consistent data.

Case Study: Amazon's Shopping Cart

Amazon's Dynamo paper famously advocated for AP architecture in shopping carts. The reasoning:

The problem with CP for carts:

During a partition, users cannot add items to their cart.
User frustration leads to abandoned sessions.
Revenue loss is immediate and measurable.

The AP solution:

Users can always add items, even during partitions.
Each node accepts additions independently.
On checkout (or partition heal), carts are merged.
Merge strategy: Union of all items (never lose an addition).
Result: A cart might have duplicates, easily handled at checkout.

Conflict Resolution in AP Systems

Choosing AP requires a conflict resolution strategy for when the partition heals:

Strategy	Description	Best For
Last-Write-Wins (LWW)	Most recent timestamp wins	Simple, low-conflict data
First-Write-Wins	Original value preserved	Immutable-ish data
Merge	Combine conflicting values	Sets, counters, accumulators
Application-specific	Custom logic	Complex domain rules
CRDTs	Mathematically guaranteed merge	Collaborative apps

AP Is About User Experience

Choose AP when unavailability causes more user harm than inconsistency. If a degraded experience is better than no experience, and conflicts can be resolved later, AP is your answer.

A Systematic Decision Framework

Rather than relying on intuition, use a structured framework to evaluate CP vs AP trade-offs. This framework examines multiple dimensions of the problem.

CP vs AP Decision Matrix
Factor	Favors CP	Favors AP
Cost of Inconsistency	Financial loss, regulatory violation, data corruption	Minor user inconvenience, self-correcting errors
Cost of Unavailability	Users can wait or retry	Revenue loss, user abandonment, SLA violations
Conflict Resolution	Complex or impossible to merge	Simple merge strategy exists (LWW, CRDTs)
Data Criticality	Source of truth, authoritative records	Derived data, caches, aggregations
User Expectations	"My money must be correct"	"I want it to work even if imperfect"
Recovery Path	No good way to fix bad data	Conflicts detectable and resolvable
Partition Frequency	Rare (can absorb brief outages)	Frequent (must operate through them)
Read/Write Ratio	Write-heavy (conflicts likely)	Read-heavy (stale reads acceptable)

The 5-Question Framework

For any system or operation, answer these questions:

1. What happens if data is inconsistent during a partition?

Scenario: Node A has version X, Node B has version Y, both are serving requests.
Is this acceptable for minutes? Hours? Ever?

2. What happens if the system is unavailable during a partition?

Scenario: Users receive errors or cannot complete operations.
Is this acceptable for minutes? Hours?

3. Can conflicts be detected and resolved after the partition heals?

Is there a clear "winner" (timestamp, version vector)?
Can conflicting values be merged (sets, CRDTs)?
Will conflicts require human intervention?

4. What are the business and regulatory constraints?

Are there SLAs for availability?
Are there compliance requirements for data accuracy?
What are the financial implications of each failure mode?

5. What do users expect?

Will users understand temporary errors better than stale data, or vice versa?
What experience do competitors provide?

decision-pseudocode.txt

CP vs AP DECISION FLOWCHART
════════════════════════════════════════════════════════════════
 
                    ┌────────────────────────────────┐
                    │ Can inconsistency cause        │
                    │ financial/legal/safety harm?   │
                    └───────────────┬────────────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    ▼                               ▼
                   YES                              NO
                    │                               │
                    ▼                               ▼
            ┌──────────────┐               ┌───────────────────────┐
            │ Strong CP    │               │ Is there a natural    │
            │ Required     │               │ conflict resolution?  │
            └──────────────┘               └───────────┬───────────┘
                                                       │
                                           ┌───────────┴───────────┐
                                           ▼                       ▼
                                          YES                      NO
                                           │                       │
                                           ▼                       ▼
                                   ┌───────────────┐       ┌──────────────────┐
                                   │ Does downtime │       │ Lean toward CP;  │
                                   │ cause harm?   │       │ manual conflicts │
                                   └───────┬───────┘       │ are expensive    │
                                           │               └──────────────────┘
                               ┌───────────┴───────────┐
                               ▼                       ▼
                              YES                      NO
                               │                       │
                               ▼                       ▼
                       ┌──────────────┐        ┌──────────────┐
                       │ AP Preferred │        │ CP Preferred │
                       │ with merge   │        │              │
                       └──────────────┘        └──────────────┘

Most Systems Are Hybrid

Real-World Case Studies

Let's examine how major companies have made CP vs AP decisions and the reasoning behind their choices.

Case Study: Stripe (Payment Processing)

•Domain: Financial transactions, payment processing
•Choice: CP (strong consistency)
•Reasoning: A double-charge or a lost payment is unacceptable. Users and merchants have zero tolerance for financial discrepancies. Regulatory compliance (PCI-DSS) requires accurate records.
•Implementation: Synchronous replication, distributed transactions, payment state machines with exactly-once semantics.
•Trade-off accepted: Brief unavailability during partition is acceptable; users retry or wait. A payment failure is preferable to a payment error.

Case Study: Twitter (Feed and Timelines)

•Domain: Social media, timeline generation
•Choice: AP (eventual consistency)
•Reasoning: Timeline freshness is best-effort; seeing a tweet 5 seconds late is imperceptible. Users expect the app to always load, even during network issues.
•Implementation: Manhattan (Twitter's distributed database), asynchronous fan-out, cached timelines.
•Trade-off accepted: Some users may see tweets out of order or slightly stale. This is vastly preferable to the app refusing to load.

Case Study: Google Spanner (Financial Services)

•Domain: Global scale transactional database (used by Google and enterprise)
•Choice: CP (externally consistent)
•Reasoning: Spanner was designed for use cases requiring ACID transactions across globally distributed data. Its primary users (Google's AdWords, enterprise banking) cannot tolerate inconsistency.
•Implementation: TrueTime (GPS/atomic clock synchronization), Paxos consensus per shard, synchronous cross-region commits.
•Trade-off accepted: Higher latency (7-10ms minimum due to TrueTime uncertainty), partial unavailability during partitions. Worth it for strong guarantees.

Case Study: Netflix (Content Catalog)

•Domain: Streaming, content metadata, personalization
•Choice: AP (eventual consistency, with exceptions)
•Reasoning: Users must always be able to browse and stream. Showing slightly stale recommendations or an outdated thumbnail is invisible to users; failing to load the app is visible.
•Implementation: EVCache (memcached-based caching), Cassandra for persistent storage, eventual consistency with aggressive TTLs.
•Exception: Billing and entitlements use stronger consistency—you shouldn't be able to stream content you haven't paid for, and subscriptions must be accurately tracked.

Pattern: Hybrid Consistency Within One System

Notice that Netflix uses different consistency models for different data:

Data Type	Consistency	Rationale
Content catalog	Eventual	Static data, cached heavily
Personalization	Eventual	Stale recommendations still work
Playback session	Session-level	User shouldn't lose progress mid-movie
Billing/Entitlements	Strong	Must not stream unpaid content
Account credentials	Strong	Login must be accurate

This is the mature approach: analyzing each data domain and applying the appropriate consistency model, rather than one-size-fits-all.

Study Real Architectures

Mixed Consistency Strategies

Most production systems don't fit neatly into CP or AP—they employ mixed strategies that vary consistency by operation, data type, or client requirements.

Strategy 1: Read-Your-Writes at Client Level

For many applications, users expect to see their own writes immediately, but don't need to see others' writes instantly.

Implementation:

After a write, route that client's subsequent reads to the same replica (or a replica with the update).
Use session tokens or user IDs for routing.
Other users may see eventually consistent data, but each user has a consistent view of their own actions.

// Client-side read-your-writes
async function updateProfile(userId, data) {
  const result = await db.write('profiles', userId, data);
  
  // Store the write timestamp in session
  session.lastWriteTimestamp = result.timestamp;
  
  return result;
}

async function getProfile(userId) {
  return await db.read('profiles', userId, {
    // Read from replica that has caught up to our last write
    minTimestamp: session.lastWriteTimestamp || 0
  });
}

Strategy 2: Consistency Per Table or Collection

Different data naturally has different consistency requirements.

Implementation:

Configure consistency at the table level.
Some tables use synchronous replication; others use async.
Application code queries without knowing the underlying consistency.

Strategy 3: Consistency Per Query

Databases like Cassandra and DynamoDB allow consistency levels per query.

Implementation:

Critical operations specify strong consistency.
Bulk reads or analytics use eventual consistency.
The same data can be accessed with different guarantees depending on the use case.

mixed-consistency-example.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Example: E-commerce platform with mixed consistency
 
interface ProductService {
  // Catalog data: AP (eventually consistent), cached aggressively
  async getProductDetails(productId: string): Promise<Product> {
    return await cache.getOrFetch(
      `product:${productId}`,
      () => catalog.get(productId, { consistency: 'eventual' }),
      { ttl: 60 } // 1-minute cache is fine
    );
  }
 
  // Inventory check: CP (strongly consistent) for reservation
  async reserveInventory(productId: string, quantity: number): Promise<boolean> {
    return await inventory.reserve(productId, quantity, {
      consistency: 'strong', // Must not oversell
      timeout: 5000          // Accept brief unavailability
    });
  }
 
  // Reviews: AP (eventually consistent), user-generated content
  async getReviews(productId: string): Promise<Review[]> {
    return await reviews.list(productId, {
      consistency: 'eventual' // Stale reviews are fine
    });
  }
 
  // Add review: Read-your-writes for author
  async addReview(productId: string, authorId: string, review: ReviewInput): Promise<Review> {
    const result = await reviews.create({
      productId,
      authorId,
      ...review
    }, { consistency: 'quorum' }); // Strong enough for read-your-writes
 
    // Clear author's cache so they see their own review
    await cache.invalidate(`reviews:${productId}:author:${authorId}`);
    
    return result;
  }
}
 
// Order processing: CP (strongly consistent, transactional)
interface OrderService {
  async placeOrder(userId: string, cart: Cart): Promise<Order> {
    return await db.transaction(async (tx) => {
      // Check inventory (strong read)
      for (const item of cart.items) {
        const available = await tx.query(
          'SELECT quantity FROM inventory WHERE product_id = $1 FOR UPDATE',
          [item.productId]
        );
        if (available.quantity < item.quantity) {
          throw new InsufficientInventoryError(item.productId);
        }
      }
 
      // Decrement inventory
      for (const item of cart.items) {
        await tx.query(
          'UPDATE inventory SET quantity = quantity - $1 WHERE product_id = $2',
          [item.quantity, item.productId]
        );
      }
 
      // Create order
      const order = await tx.query(
        'INSERT INTO orders (user_id, items, total) VALUES ($1, $2, $3) RETURNING *',
        [userId, cart.items, cart.total]
      );
 
      return order;
    }, { isolation: 'serializable' }); // Maximum consistency for orders
  }
}

Consistency Hygiene

Implementation Considerations

Having decided on CP or AP for a given operation, implementation choices determine how well your system realizes that intent.

Implementing CP

•Use consensus protocols (Paxos, Raft)
•Synchronous replication (wait for acks)
•Quorum reads and writes (R + W > N)
•Distributed transactions (2PC, Saga)
•Leader-based architectures
•Serializable isolation levels

Implementing AP

•Asynchronous replication
•Vector clocks or version vectors
•CRDTs for automatic merge
•Last-Write-Wins tombstones
•Read repair and anti-entropy
•Multi-leader or leaderless topologies

CP Implementation: Quorum-Based Systems

The most common CP implementation uses quorum reads and writes.

Rule: If R + W > N, where:

N = total replicas
W = replicas that must acknowledge writes
R = replicas that must respond to reads

Then reads are guaranteed to see the latest write (assuming no concurrent writes).

Example with N=3:

W=2, R=2: Any read will contact at least one replica with the latest write.
W=3, R=1: Every replica has the latest write; any read suffices.
W=1, R=3: Must read all replicas but write is fast.

Trade-off: Higher W improves durability but slows writes and reduces write availability. Higher R slows reads but allows lower W.

AP Implementation: Conflict Resolution

AP systems need strategies for merging divergent data:

Last-Write-Wins (LWW):

Simple: highest timestamp wins.
Risk: Clock skew causes "wrong" winner.
Use when: Loss of conflicting write is acceptable.

Vector Clocks:

Track causality, detect true conflicts.
Presents conflicts to application or user for resolution.
Use when: Conflicts are rare and require explicit handling.

CRDTs (Conflict-free Replicated Data Types):

Mathematically guaranteed to merge without conflicts.
Types: G-Counter, PN-Counter, G-Set, OR-Set, LWW-Register.
Use when: Data naturally fits CRDT semantics (counters, sets).

Test Partition Behavior

Summary: Making the CP vs AP Decision

We've established a rigorous framework for one of distributed systems' most critical decisions. Let's consolidate the key insights.

Key Takeaways

•CP when inconsistency causes harm — Financial transactions, inventory, unique identifiers, and coordination require strong consistency. Temporary unavailability is preferable to data corruption.
•AP when unavailability causes harm — User-facing applications, caches, content delivery, and social features prioritize availability. Stale data is preferable to no data.
•Use the 5-question framework — Systematically evaluate inconsistency cost, unavailability cost, conflict resolution ability, regulatory constraints, and user expectations.
•Apply consistency per operation, not per system — A single application typically needs both CP and AP operations. Analyze each data flow independently.
•Mixed strategies are normal — Read-your-writes, per-table consistency, and per-query consistency allow nuanced trade-offs that pure CP or AP cannot.
•Implementation matters — The decision is meaningless without proper implementation. Quorum systems, CRDTs, and conflict resolution strategies must match your consistency goals.

What's Next

Page Complete