System Design (HLD)CAP Theorem

CAP Theorem: Understanding Distributed System Trade-offs

LevelIntermediate

Duration90 mins

TopicCAP Theorem

5 / 5

Choosing Between CP and AP — The Final Decision Guide

The Decision That Shapes Everything

You've mastered the theory of CAP. You understand consistency, availability, and partition tolerance. You know that during network partitions, you must choose between C and A. But when you're standing at the whiteboard in a design session, the question becomes very concrete: Should this system be CP or AP?

This decision has profound implications. It affects your technology choices, your failure modes, your user experience during outages, and your operational complexity. Making this choice incorrectly leads to either:

Wrong CP: Users can't complete critical actions during network issues that happen regularly
Wrong AP: Silent data corruption, lost transactions, inconsistent state that's never detected

This page gives you the specific tools and frameworks to make this decision correctly.

What You Will Learn

By the end of this page, you will have concrete decision criteria for choosing between CP and AP, understand when hybrid approaches are appropriate, know how to evaluate trade-offs in domain-specific contexts, and be able to confidently justify your CAP choices in design reviews and interviews.

The Core Decision Criteria

The CP vs AP decision ultimately reduces to one fundamental question:

What is the worse outcome: showing incorrect data or showing nothing at all?

This question has different answers in different contexts. Let's formalize the criteria:

Criterion 1: Cost of Inconsistency

What happens if your system shows stale or conflicting data?

Low Cost	High Cost
User sees yesterday's analytics	User sees wrong account balance
Social feed is slightly stale	Two users book same seat
Cached product description is outdated	Inventory goes negative
Session preference not propagated	Two services have different config

If the cost of inconsistency is High → Lean toward CP

Criterion 2: Cost of Unavailability

What happens if users can't complete their action?

Low Cost	High Cost
Can't update profile right now	Can't place emergency order
Report generation delayed	Can't log in at all
Export fails, will retry	Checkout abandonment during sale
Admin action deferred	IoT sensor can't report critical alert

If the cost of unavailability is High → Lean toward AP

Criterion 3: Conflict Resolution Complexity

If you choose AP and conflicts occur, how hard are they to resolve?

Easy to Resolve	Hard to Resolve
Counters (sum them)	Unique constraints (who wins?)
Add-only sets (union)	Overwrites where both matter
LWW where order doesn't matter	Ordered sequences (whose order?)
Idempotent operations	Stateful workflows

If conflicts are hard to resolve → Lean toward CP

Criterion 4: Detection and Recovery Capability

Can you detect and recover from inconsistency after the fact?

Detectable & Recoverable	Silent & Permanent
Audit logs enable reconciliation	Data just overwrites
Compensating transactions possible	Actions are irreversible
Business process allows fixes	Real-world effects immediate
Users can surface issues	Users don't know about problem

If inconsistency is undetectable or unrecoverable → CP is mandatory

The Ultimate Test

Ask yourself: 'If I'm on call at 3 AM and facing a partition, which page would I rather receive? - 'Users seeing errors' or 'Potential data inconsistency'? The answer reveals your true priority. This gut check often clarifies ambiguous cases.

The CP/AP Decision Framework

Use this systematic framework to make and document your CP/AP decisions:

Step 1: Define the Data in Question

Be specific. Don't say "the database." Identify:

What entities are we storing?
What operations are performed?
What are the access patterns?
What are the consistency requirements per operation?

Step 2: Enumerate the Failure Scenarios

List what can go wrong during partitions:

Which nodes might be partitioned?
What operations could be in flight?
What state might diverge?
How long might partitions last?

Step 3: Evaluate Each Criterion (1-5 scale)

| Criterion | Score 1 (Low/Easy) | Score 5 (High/Hard) | |-----------|--------------------|--------------------|| | Cost of Inconsistency | Cosmetic issue | Financial/safety loss | | Cost of Unavailability | Minor inconvenience | Critical blocking | | Conflict Resolution | Automatic merge | Impossible to merge | | Detection/Recovery | Easily detected, fixed | Silent corruption |

Step 4: Apply the Decision Matrix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
DECISION MATRIX: Score each factor 1-5 and sum
 
┌──────────────────────────────────────────────────────────────────┐
│  FACTOR                              │ CP Points │ AP Points     │
├──────────────────────────────────────────────────────────────────┤
│  Cost of Inconsistency               │  +Score   │    -         │
│  Cost of Unavailability              │    -      │  +Score      │
│  Conflict Resolution Difficulty      │  +Score   │    -         │
│  Detection/Recovery Difficulty       │  +Score   │    -         │
├──────────────────────────────────────────────────────────────────┤
│  TOTAL                               │  Sum(CP)  │  Sum(AP)     │
└──────────────────────────────────────────────────────────────────┘
 
INTERPRETATION:
- CP Points > AP Points by 3+: Strong CP
- CP Points > AP Points by 1-2: CP with AP fallback modes
- Roughly equal: Hybrid approach, analyze per-operation
- AP Points > CP Points by 1-2: AP with CP for critical paths
- AP Points > CP Points by 3+: Strong AP
 
EXAMPLE: User Session Store
┌───────────────────────────────────────────────────────────────┐
│  Cost of Inconsistency: 2                                     │
│  (User sees old prefs - mildly annoying)                      │
│                                                               │
│  Cost of Unavailability: 5                                    │
│  (User can't log in - blocks everything)                      │
│                                                               │
│  Conflict Resolution: 1                                       │
│  (LWW works fine for sessions)                                │
│                                                               │
│  Detection/Recovery: 2                                        │
│  (User will notice and fix preferences)                       │
├───────────────────────────────────────────────────────────────┤
│  CP Points: 2 + 1 + 2 = 5                                     │
│  AP Points: 5                                                 │
│                                                               │
│  RESULT: Roughly equal, but availability critical             │
│  DECISION: AP with sticky sessions for read-your-writes       │
└───────────────────────────────────────────────────────────────┘

Step 5: Document the Decision

Create an Architecture Decision Record (ADR):

# ADR-042: Session Store CAP Strategy

## Status: Accepted

## Context
Our session store handles user authentication state across 
multiple data centers. Users expect to remain logged in 
regardless of network conditions.

## Decision
We will use an AP strategy (Redis Cluster) for session storage.

## Scoring
- Cost of Inconsistency: 2/5
- Cost of Unavailability: 5/5
- Conflict Resolution: 1/5
- Detection/Recovery: 2/5

## Consequences
- Positive: High availability during partitions
- Positive: Low latency for session lookups
- Negative: Rare cases of seeing old session data
- Mitigation: Sticky sessions for read-your-writes guarantee

This documentation ensures your decision is understood, defensible, and reviewable.

Decisions Should Be Per-Component

Don't make a single CP/AP decision for your entire system. Each data type, service, or feature may have different requirements. A well-designed system has explicit CP/AP decisions for each component, documented and justified.

Domain-Specific Guidance

Different domains have established patterns based on their unique requirements. Use these as starting points, not absolute rules.

Financial Services:

Component	Recommendation	Rationale
Account balances	CP	Incorrect balance = regulatory violation
Transaction history	CP	Must be accurate for auditing
Exchange rates	AP	Slightly stale rate is acceptable
Fraud alerts	AP	Better to false-alarm than miss
User preferences	AP	Low stakes, high volume

E-Commerce:

Component	Recommendation	Rationale
Order placement	CP	Must not lose/duplicate orders
Inventory (reserve)	CP	Overselling has real costs
Product catalog	AP	Stale prices/descriptions OK briefly
Shopping cart	AP	Must always be available to add items
Session/auth	AP	Users must be able to browse
Search results	AP	Stale product list is acceptable
Reviews/ratings	AP	Eventual consistency is fine

Industry-Specific CAP Guidance
Industry	Typical CP Data	Typical AP Data	Hybrid Areas
Healthcare	Patient records, prescriptions, allergies	Appointment scheduling, wait times	Billing (CP for charges, AP for estimates)
Gaming	Player inventory, currency, rankings	Chat, social features, analytics	Matchmaking (CP to finalize, AP to search)
Social Media	Authentication, privacy settings	Feeds, content, reactions, DMs	Follower counts (AP but reconciled)
IoT/Telemetry	Device config, firmware versions	Sensor readings, metrics, logs	Alert thresholds (depends on criticality)
Travel/Booking	Reservations, payments, tickets	Search, pricing, availability display	Inventory (CP to book, AP to display)
Logistics	Shipping records, tracking updates	Fleet location, capacity estimates	Route assignment (CP when committing)

Healthcare: A Critical Case Study

Healthcare illustrates the stakes of CP/AP decisions:

Medication allergies → Strong CP If a patient's allergy record is inconsistent, they might receive a medication that causes anaphylaxis. Unavailability ("Cannot access allergy data") is far safer than inconsistency. Systems should block prescription decisions if allergy data is uncertain.

Appointment scheduling → Balanced Double-booking an appointment is bad but not dangerous. Being unable to schedule appointments during an outage is operationally problematic. Most systems lean AP with conflict resolution (call patient if double-booked).

Radiology results → CP for diagnosis, AP for viewing Oncologists making treatment decisions need the authoritative, consistent result. But for a patient to view their own results? Slightly stale is probably fine, and availability matters.

Domain Knowledge Is Critical

These are guidelines, not rules. A healthcare startup might have different trade-offs than a hospital. An upstart e-commerce site might prioritize availability over inventory accuracy to maximize sales. Understand your specific context, user expectations, and regulatory environment.

Hybrid Strategies: When Neither Pure CP nor AP Fits

Often, neither a pure CP nor pure AP strategy is optimal. Hybrid approaches provide the best of both worlds—with added complexity.

Hybrid Strategy 1: Tiered Consistency

Different tiers of data get different treatment:

┌─────────────────────────────────────────────────────────────────┐
│  TIER 1 (Critical): CP                                          │
│  - User authentication tokens                                   │
│  - Payment authorizations                                       │
│  - Inventory commitments                                        │
│  → Use: Consensus-based system (etcd, PostgreSQL sync rep)      │
│                                                                 │
│  TIER 2 (Important): CP with degraded reads                     │
│  - Order history                                                │
│  - Account settings                                             │
│  - Notification preferences                                     │
│  → Use: Read replicas with staleness bounds                     │
│                                                                 │
│  TIER 3 (Eventual OK): AP                                       │
│  - Session data                                                 │
│  - Activity logs                                                │
│  - Analytics events                                             │
│  → Use: Async replication, eventual consistency                 │
└─────────────────────────────────────────────────────────────────┘

Hybrid Strategy 2: Write CP, Read AP

Writes go through a strongly consistent path; reads can use eventually consistent replicas:

All writes go to a CP primary (ensures correctness)
Reads can hit AP replicas (fast, available)
Critical reads (after important writes) go to primary
Non-critical reads accept potential staleness

Hybrid Approach Patterns

•Graceful degradation: CP during normal operation, fall back to AP (stale reads) during severe partitions
•Optimistic locking with AP storage: Accept writes optimistically, detect conflicts asynchronously, notify users of resolutions
•Reservation pattern: Use AP for 'soft' reservations, CP for 'hard' commitments
•Eventual materialized views: CP source of truth, AP denormalized views for queries
•Local-first with sync: Clients work offline (AP locally), sync through CP backend when connected
•Feature flags by consistency: Disable non-critical features during partitions to preserve consistency budget

Hybrid Strategy 3: The Reservation Pattern

This elegant pattern provides CP guarantees with AP-like availability:

Phase 1: Soft Reservation (AP)

User adds item to cart
System "soft reserves" inventory (eventual, optimistic)
Multiple users might soft-reserve the same item
Available, fast, no consistency guarantee

Phase 2: Hard Commitment (CP)

User proceeds to checkout
System attempts "hard reservation" with CP storage
If inventory is actually available: succeed
If conflicting (someone else committed first): fail gracefully
Consistent, may fail, but failure is handled

Result: High availability for browsing/carting (most users), consistency only required at commitment (fewer users, can tolerate occasional failure).

Hybrid Strategy 4: Local-First with Server Reconciliation

Emerging pattern for mobile and collaborative applications:

Client writes to local storage (instant, always available)
Background sync attempts to replicate to server
Server uses CP logic to detect and resolve conflicts
Conflicts are resolved and synced back to client
User sees eventual consistency but always has a usable app

Hybrids Add Complexity

Every hybrid strategy adds operational and cognitive complexity. You need to understand which operations are in which tier. Testing is more complex. Failure modes are more numerous. Only use hybrids when the value justifies the complexity. Simple systems with clear trade-offs are often better than clever systems with hidden failure modes.

Case Studies: CP/AP Decisions in the Wild

Let's examine how real systems have made and justified their CP/AP decisions.

Case Study 1: Amazon Shopping Cart (AP)

Amazon famously chose AP for shopping carts (described in the Dynamo paper):

Context: Users add items to carts; carts maintain state across sessions and devices.

Decision: AP (eventually consistent, vector clocks for conflict detection)

Reasoning:

Cost of Unavailability: Very high. Cart unavailability = lost sales.
Cost of Inconsistency: Low-medium. If items are duplicated, user removes them. If items are lost, user re-adds them.
Conflict Resolution: Easy. Cart merging is straightforward (union of items).
Detection/Recovery: Easy. User sees and corrects cart contents.

Outcome: Carts remain available during severe AWS outages. Occasional item duplication is a minor annoyance that users can fix. Sales are preserved.

Case Study 2: Google Spanner (CP)

Google designed Spanner for global strong consistency:

Context: Finance-grade transactional system for AdWords, Play Store, etc.

Decision: CP (synchronous Paxos replication, TrueTime for ordering)

Reasoning:

Cost of Inconsistency: Extreme. Double-charging advertisers, incorrect royalty payments = lawsuits.
Cost of Unavailability: High but manageable. Can tolerate brief write unavailability.
Conflict Resolution: Impossible. Financial transactions cannot be merged.
Detection/Recovery: Hard. Detecting incorrect financial state is complex.

Outcome: Global strong consistency with 99.999% availability. Achieves both through massive engineering investment (GPS/atomic clock time sync, multi-region Paxos). Latency cost is accepted.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
CONTEXT: Post like counts across global infrastructure
 
REQUIREMENTS:
- Billions of likes per day
- Must always be able to like a post
- Exact count is less important than directional accuracy
- Users expect sub-second response
 
DECISION: AP with eventual consistency
 
ARCHITECTURE:
┌─────────────────────────────────────────────────────────────────┐
│  User clicks "Like"                                             │
│       ↓                                                         │
│  Request goes to nearest datacenter                             │
│       ↓                                                         │
│  Like is written to local shard (fast, always succeeds)         │
│       ↓                                                         │
│  Background async: Replicate to other DCs                       │
│       ↓                                                         │
│  Periodic: Count aggregation across shards                      │
└─────────────────────────────────────────────────────────────────┘
 
DURING PARTITION:
- Likes in DC1 not visible in DC2 immediately
- User who liked in DC1 sees their like
- User in DC2 sees count without DC1 likes
- When partition heals: counts converge
 
WHY IT WORKS:
- Cost of inconsistency (wrong like count): Minimal
- Cost of unavailability (can't like): User frustration, engagement loss
- Detection: Easy (counts eventually converge)
- User expectation: "Likes are fast and always work"
 
TRADE-OFF ACCEPTED:
- Same post might show different like counts in different regions
- Within seconds to minutes, all views converge

Case Study 4: Stripe Payments (CP with Idempotency)

Context: Payment processing across global merchant base.

Decision: CP for actual charges, with idempotency keys to handle retries.

Reasoning:

Double-charging a customer: Unacceptable
Losing a charge: Unacceptable
Brief unavailability: Acceptable (merchants will retry)

Solution:

CP storage for payment records
Idempotency keys ensure retries don't double-charge
Clear error responses enable intelligent retry logic
During partitions: return errors, not silent failures

Developer Experience:

stripe.Charge.create(
  amount=1000,
  currency='usd',
  idempotency_key='order-12345',  # Ensures at-most-once
)

If the request times out, the developer can safely retry with the same idempotency key. Stripe guarantees the charge happens exactly once.

Learn from the Giants

These companies have made their decisions through years of production experience and billions of dollars of engineering. Study their published papers and blog posts. Many are available: Amazon's Dynamo paper, Google's Spanner paper, Facebook/Meta's TAO paper. These are treasure troves of practical distributed systems wisdom.

Communicating Your CAP Decision

Whether in a design review, interview, or architecture document, you need to communicate CAP decisions clearly and persuasively.

The CAP Decision Communication Template:

1. State the decision clearly "For [component], we've chosen a [CP/AP] strategy using [technology]."

2. Explain the primary driver "The primary driver is [cost of inconsistency / cost of unavailability] because [business reason]."

3. Acknowledge the trade-off "This means that during partitions, [what will happen]."

4. Describe the mitigation "We mitigate this by [specific strategy]."

5. Justify with concrete scenarios "For example, if [partition scenario], then [system behavior], which is preferable to [alternative]."

Example: Presenting an AP Decision

"For our session store, we've chosen an AP strategy using Redis Cluster.

The primary driver is availability because users who can't log in can't use any feature of our platform—it's a total blocker.

This means during partitions, users might briefly see outdated session data or preferences.

We mitigate this with sticky sessions to ensure users read from the node that received their writes, providing read-your-writes consistency.

For example, if our East and West datacenters are partitioned, a user on the East coast who updates their preferences will see those updates immediately. A user who later connects to West might see the old preferences until the partition heals—but they can still use the system."

Common Pushback and Responses

•'Why not use a strongly consistent database for everything?' → 'CP has latency and availability costs. For high-volume, tolerant data, we'd be paying those costs for no benefit. Our scoring shows inconsistency cost is low here.'
•'What if data diverges during a partition?' → 'Our conflict resolution strategy is [X]. We've tested it with simulated partitions. In the worst case, [specific outcome], which is preferable to unavailability.'
•'Isn't eventual consistency dangerous?' → 'For critical data like [X], we use CP. For this data, we've analyzed the inconsistency window and consequences. It's low-risk and recoverable.'
•'How do you know partitions will happen?' → 'Studies from Google, Amazon, and academia show partitions are inevitable in production. We've also seen [specific internal incidents]. Designing for partitions is prudent engineering.'
•'Can't we use a hybrid approach?' → 'We considered it, but the added complexity isn't justified here. [For other component], we do use a hybrid because [reason].'

In System Design Interviews:

Interviewers expect you to discuss CAP trade-offs. Use this structure:

Identify data types - "We have user data, transaction data, and analytics data."
Associate requirements - "Transactions need strong consistency; analytics can be eventual."
Make explicit choices - "For the order database, I'll use PostgreSQL with synchronous replication."
Explain partition behavior - "During partitions, writes will block or fail, but we'll never have inconsistent orders."
Acknowledge trade-offs - "This means we might see errors during network issues, but the UX for that is a retry message."

This demonstrates both theoretical understanding and practical application—exactly what interviewers want to see.

Confidence Through Preparation

The ability to confidently discuss CAP trade-offs comes from understanding the criteria, having examples ready, and having thought through objections. Prepare before design reviews. Run through likely questions. Your confidence signals competence.

Testing Your CAP Strategy

A CAP strategy that isn't tested is a CAP hope. Validate your choices with concrete experiments.

Testing CP Systems:

Partition minority of nodes
- Expect: Minority refuses operations, returns errors
- Verify: No inconsistent data written during partition
Partition during active write
- Expect: Write times out or returns error
- Verify: After partition heals, data is consistent (either write succeeded or didn't)
Rapid leader failover
- Expect: Brief unavailability during election
- Verify: No split-brain, single leader elected

Testing AP Systems:

Partition and write to both sides
- Expect: Both writes succeed
- Verify: After healing, conflict is detected and resolved correctly
Read from partitioned replica
- Expect: Reads succeed but may return stale data
- Verify: Eventually, reads converge to correct value
Concurrent conflicting writes
- Expect: Both succeed during operation
- Verify: Resolution strategy (LWW, merge, etc.) produces expected result

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
SCENARIO 1: CP Database Under Partition
═══════════════════════════════════════════════════════════════════
 
Setup:
  - 3-node PostgreSQL with synchronous replication
  - Network partition between primary and both secondaries
 
Test Steps:
  1. Inject partition (iptables, toxiproxy, etc.)
  2. Attempt INSERT from client connected to primary
  
Expected Result:
  - INSERT hangs or times out (waiting for sync ack)
  - After timeout: ERROR returned to client
  
Verification:
  - No row in database after partition heals
  - All three nodes have consistent data
 
SCENARIO 2: AP Database Conflict Resolution
═══════════════════════════════════════════════════════════════════
 
Setup:
  - 3-node Cassandra cluster
  - RF=3, ONE consistency level
 
Test Steps:
  1. Partition node-3 from nodes 1-2
  2. Write key=X, value="from-partition-A" to node-1
  3. Write key=X, value="from-partition-B" to node-3
     (using different client, assuming key maps to node-3)
  4. Heal partition
  5. Read key=X with ALL consistency
 
Expected Result:
  - Both writes succeed during partition
  - After healing: value is "from-partition-B" if its timestamp is higher
  - LWW applied correctly
 
Verification:
  - Read from all nodes shows same final value
  - Conflict was silently resolved (no error)
  - Audit log shows both writes (if logging enabled)

Chaos Engineering for CAP Validation:

Integrate CAP testing into your regular chaos engineering practice:

1. Scheduled partition tests Run automated partition injection weekly/monthly. Compare actual behavior to expected behavior.

2. Game days Simulate major partition events with full team participation. Practice incident response.

3. Continuous chaos Tools like Chaos Monkey, LitmusChaos, Gremlin can inject partitions continuously in staging/production.

4. Jepsen testing For custom distributed systems, invest in Jepsen-style testing that verifies linearizability or your expected consistency model.

Metrics to Capture:

Error rate during injected partitions
Latency percentiles during partitions
Conflict rate in AP systems
Recovery time after partition heals
Data consistency verification results

Production Is the Only Real Test

Staging environments rarely replicate production network topology, scale, or traffic patterns. Test in staging to catch obvious bugs, but also test in production (carefully, with safeguards). A system that passes all staging tests but fails its first production partition hasn't been tested.

Summary: Making the CP/AP Decision

We've provided a comprehensive guide to making CP vs AP decisions in real systems. Let's consolidate the essential framework.

Key Takeaways

•The core question is: wrong data or no response? Evaluate the cost of inconsistency vs the cost of unavailability for your specific data and use case.
•Use the four criteria: Cost of inconsistency, cost of unavailability, conflict resolution difficulty, and detection/recovery capability. Score each to guide your decision.
•Make decisions per-component, not per-system: Different data within the same system often needs different CAP strategies.
•Domain knowledge matters: Financial data almost always needs CP. Social feeds almost always work with AP. Know your domain's established patterns.
•Hybrid strategies exist for complex cases: Tiered consistency, write-CP/read-AP, reservation patterns, and local-first architectures can provide nuanced trade-offs.
•Document your decisions: Use ADRs or similar formats. Future you and future team members need to understand why choices were made.
•Communicate with the template: State decision → Primary driver → Trade-off → Mitigation → Concrete scenario.
•Test your assumptions: Inject partitions, verify behavior matches design. Untested CAP strategies are just hopes.

The CAP Theorem Mastery Checkpoint:

You've now completed the CAP theorem module. You should be able to:

✅ Define consistency, availability, and partition tolerance precisely ✅ Explain why partition tolerance is effectively mandatory ✅ Describe how CP and AP systems behave during partitions ✅ Apply a decision framework to real-world data requirements ✅ Recognize domain-specific patterns and established practices ✅ Design hybrid strategies for complex requirements ✅ Communicate and defend CAP decisions to stakeholders ✅ Test and validate CAP strategies in production

This knowledge forms the foundation for understanding all distributed system design. Every scalability strategy, every database choice, every caching decision connects back to these fundamentals.

Module Complete

Congratulations! You've mastered the CAP theorem—from formal definitions through practical application. This knowledge is foundational to distributed systems design and will inform every architectural decision you make. The next module on PACELC will extend these concepts to include latency trade-offs during normal operation.

5 / 5

Loading learning content...

System Design (HLD)CAP Theorem

CAP Theorem: Understanding Distributed System Trade-offs

LevelIntermediate

Duration90 mins

TopicCAP Theorem

5 / 5

Choosing Between CP and AP — The Final Decision Guide

The Decision That Shapes Everything

Wrong CP: Users can't complete critical actions during network issues that happen regularly
Wrong AP: Silent data corruption, lost transactions, inconsistent state that's never detected

This page gives you the specific tools and frameworks to make this decision correctly.

What You Will Learn

The Core Decision Criteria

The CP vs AP decision ultimately reduces to one fundamental question:

What is the worse outcome: showing incorrect data or showing nothing at all?

This question has different answers in different contexts. Let's formalize the criteria:

Criterion 1: Cost of Inconsistency

What happens if your system shows stale or conflicting data?

Low Cost	High Cost
User sees yesterday's analytics	User sees wrong account balance
Social feed is slightly stale	Two users book same seat
Cached product description is outdated	Inventory goes negative
Session preference not propagated	Two services have different config

If the cost of inconsistency is High → Lean toward CP

Criterion 2: Cost of Unavailability

What happens if users can't complete their action?

Low Cost	High Cost
Can't update profile right now	Can't place emergency order
Report generation delayed	Can't log in at all
Export fails, will retry	Checkout abandonment during sale
Admin action deferred	IoT sensor can't report critical alert

If the cost of unavailability is High → Lean toward AP

Criterion 3: Conflict Resolution Complexity

If you choose AP and conflicts occur, how hard are they to resolve?

Easy to Resolve	Hard to Resolve
Counters (sum them)	Unique constraints (who wins?)
Add-only sets (union)	Overwrites where both matter
LWW where order doesn't matter	Ordered sequences (whose order?)
Idempotent operations	Stateful workflows

If conflicts are hard to resolve → Lean toward CP

Criterion 4: Detection and Recovery Capability

Can you detect and recover from inconsistency after the fact?

Detectable & Recoverable	Silent & Permanent
Audit logs enable reconciliation	Data just overwrites
Compensating transactions possible	Actions are irreversible
Business process allows fixes	Real-world effects immediate
Users can surface issues	Users don't know about problem

If inconsistency is undetectable or unrecoverable → CP is mandatory

The Ultimate Test

The CP/AP Decision Framework

Use this systematic framework to make and document your CP/AP decisions:

Step 1: Define the Data in Question

Be specific. Don't say "the database." Identify:

What entities are we storing?
What operations are performed?
What are the access patterns?
What are the consistency requirements per operation?

Step 2: Enumerate the Failure Scenarios

List what can go wrong during partitions:

Which nodes might be partitioned?
What operations could be in flight?
What state might diverge?
How long might partitions last?

Step 3: Evaluate Each Criterion (1-5 scale)

Step 4: Apply the Decision Matrix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
DECISION MATRIX: Score each factor 1-5 and sum
 
┌──────────────────────────────────────────────────────────────────┐
│  FACTOR                              │ CP Points │ AP Points     │
├──────────────────────────────────────────────────────────────────┤
│  Cost of Inconsistency               │  +Score   │    -         │
│  Cost of Unavailability              │    -      │  +Score      │
│  Conflict Resolution Difficulty      │  +Score   │    -         │
│  Detection/Recovery Difficulty       │  +Score   │    -         │
├──────────────────────────────────────────────────────────────────┤
│  TOTAL                               │  Sum(CP)  │  Sum(AP)     │
└──────────────────────────────────────────────────────────────────┘
 
INTERPRETATION:
- CP Points > AP Points by 3+: Strong CP
- CP Points > AP Points by 1-2: CP with AP fallback modes
- Roughly equal: Hybrid approach, analyze per-operation
- AP Points > CP Points by 1-2: AP with CP for critical paths
- AP Points > CP Points by 3+: Strong AP
 
EXAMPLE: User Session Store
┌───────────────────────────────────────────────────────────────┐
│  Cost of Inconsistency: 2                                     │
│  (User sees old prefs - mildly annoying)                      │
│                                                               │
│  Cost of Unavailability: 5                                    │
│  (User can't log in - blocks everything)                      │
│                                                               │
│  Conflict Resolution: 1                                       │
│  (LWW works fine for sessions)                                │
│                                                               │
│  Detection/Recovery: 2                                        │
│  (User will notice and fix preferences)                       │
├───────────────────────────────────────────────────────────────┤
│  CP Points: 2 + 1 + 2 = 5                                     │
│  AP Points: 5                                                 │
│                                                               │
│  RESULT: Roughly equal, but availability critical             │
│  DECISION: AP with sticky sessions for read-your-writes       │
└───────────────────────────────────────────────────────────────┘

Step 5: Document the Decision

Create an Architecture Decision Record (ADR):

# ADR-042: Session Store CAP Strategy

## Status: Accepted

## Context
Our session store handles user authentication state across 
multiple data centers. Users expect to remain logged in 
regardless of network conditions.

## Decision
We will use an AP strategy (Redis Cluster) for session storage.

## Scoring
- Cost of Inconsistency: 2/5
- Cost of Unavailability: 5/5
- Conflict Resolution: 1/5
- Detection/Recovery: 2/5

## Consequences
- Positive: High availability during partitions
- Positive: Low latency for session lookups
- Negative: Rare cases of seeing old session data
- Mitigation: Sticky sessions for read-your-writes guarantee

This documentation ensures your decision is understood, defensible, and reviewable.

Decisions Should Be Per-Component

Domain-Specific Guidance

Different domains have established patterns based on their unique requirements. Use these as starting points, not absolute rules.

Financial Services:

Component	Recommendation	Rationale
Account balances	CP	Incorrect balance = regulatory violation
Transaction history	CP	Must be accurate for auditing
Exchange rates	AP	Slightly stale rate is acceptable
Fraud alerts	AP	Better to false-alarm than miss
User preferences	AP	Low stakes, high volume

E-Commerce:

Component	Recommendation	Rationale
Order placement	CP	Must not lose/duplicate orders
Inventory (reserve)	CP	Overselling has real costs
Product catalog	AP	Stale prices/descriptions OK briefly
Shopping cart	AP	Must always be available to add items
Session/auth	AP	Users must be able to browse
Search results	AP	Stale product list is acceptable
Reviews/ratings	AP	Eventual consistency is fine

Industry-Specific CAP Guidance
Industry	Typical CP Data	Typical AP Data	Hybrid Areas
Healthcare	Patient records, prescriptions, allergies	Appointment scheduling, wait times	Billing (CP for charges, AP for estimates)
Gaming	Player inventory, currency, rankings	Chat, social features, analytics	Matchmaking (CP to finalize, AP to search)
Social Media	Authentication, privacy settings	Feeds, content, reactions, DMs	Follower counts (AP but reconciled)
IoT/Telemetry	Device config, firmware versions	Sensor readings, metrics, logs	Alert thresholds (depends on criticality)
Travel/Booking	Reservations, payments, tickets	Search, pricing, availability display	Inventory (CP to book, AP to display)
Logistics	Shipping records, tracking updates	Fleet location, capacity estimates	Route assignment (CP when committing)

Healthcare: A Critical Case Study

Healthcare illustrates the stakes of CP/AP decisions:

Domain Knowledge Is Critical

Hybrid Strategies: When Neither Pure CP nor AP Fits

Often, neither a pure CP nor pure AP strategy is optimal. Hybrid approaches provide the best of both worlds—with added complexity.

Hybrid Strategy 1: Tiered Consistency

Different tiers of data get different treatment:

┌─────────────────────────────────────────────────────────────────┐
│  TIER 1 (Critical): CP                                          │
│  - User authentication tokens                                   │
│  - Payment authorizations                                       │
│  - Inventory commitments                                        │
│  → Use: Consensus-based system (etcd, PostgreSQL sync rep)      │
│                                                                 │
│  TIER 2 (Important): CP with degraded reads                     │
│  - Order history                                                │
│  - Account settings                                             │
│  - Notification preferences                                     │
│  → Use: Read replicas with staleness bounds                     │
│                                                                 │
│  TIER 3 (Eventual OK): AP                                       │
│  - Session data                                                 │
│  - Activity logs                                                │
│  - Analytics events                                             │
│  → Use: Async replication, eventual consistency                 │
└─────────────────────────────────────────────────────────────────┘

Hybrid Strategy 2: Write CP, Read AP

Writes go through a strongly consistent path; reads can use eventually consistent replicas:

All writes go to a CP primary (ensures correctness)
Reads can hit AP replicas (fast, available)
Critical reads (after important writes) go to primary
Non-critical reads accept potential staleness

Hybrid Approach Patterns

•Graceful degradation: CP during normal operation, fall back to AP (stale reads) during severe partitions
•Optimistic locking with AP storage: Accept writes optimistically, detect conflicts asynchronously, notify users of resolutions
•Reservation pattern: Use AP for 'soft' reservations, CP for 'hard' commitments
•Eventual materialized views: CP source of truth, AP denormalized views for queries
•Local-first with sync: Clients work offline (AP locally), sync through CP backend when connected
•Feature flags by consistency: Disable non-critical features during partitions to preserve consistency budget

Hybrid Strategy 3: The Reservation Pattern

This elegant pattern provides CP guarantees with AP-like availability:

Phase 1: Soft Reservation (AP)

User adds item to cart
System "soft reserves" inventory (eventual, optimistic)
Multiple users might soft-reserve the same item
Available, fast, no consistency guarantee

Phase 2: Hard Commitment (CP)

User proceeds to checkout
System attempts "hard reservation" with CP storage
If inventory is actually available: succeed
If conflicting (someone else committed first): fail gracefully
Consistent, may fail, but failure is handled

Result: High availability for browsing/carting (most users), consistency only required at commitment (fewer users, can tolerate occasional failure).

Hybrid Strategy 4: Local-First with Server Reconciliation

Emerging pattern for mobile and collaborative applications:

Client writes to local storage (instant, always available)
Background sync attempts to replicate to server
Server uses CP logic to detect and resolve conflicts
Conflicts are resolved and synced back to client
User sees eventual consistency but always has a usable app

Hybrids Add Complexity

Case Studies: CP/AP Decisions in the Wild

Let's examine how real systems have made and justified their CP/AP decisions.

Case Study 1: Amazon Shopping Cart (AP)

Amazon famously chose AP for shopping carts (described in the Dynamo paper):

Context: Users add items to carts; carts maintain state across sessions and devices.

Decision: AP (eventually consistent, vector clocks for conflict detection)

Reasoning:

Cost of Unavailability: Very high. Cart unavailability = lost sales.
Cost of Inconsistency: Low-medium. If items are duplicated, user removes them. If items are lost, user re-adds them.
Conflict Resolution: Easy. Cart merging is straightforward (union of items).
Detection/Recovery: Easy. User sees and corrects cart contents.

Outcome: Carts remain available during severe AWS outages. Occasional item duplication is a minor annoyance that users can fix. Sales are preserved.

Case Study 2: Google Spanner (CP)

Google designed Spanner for global strong consistency:

Context: Finance-grade transactional system for AdWords, Play Store, etc.

Decision: CP (synchronous Paxos replication, TrueTime for ordering)

Reasoning:

Cost of Inconsistency: Extreme. Double-charging advertisers, incorrect royalty payments = lawsuits.
Cost of Unavailability: High but manageable. Can tolerate brief write unavailability.
Conflict Resolution: Impossible. Financial transactions cannot be merged.
Detection/Recovery: Hard. Detecting incorrect financial state is complex.

Outcome: Global strong consistency with 99.999% availability. Achieves both through massive engineering investment (GPS/atomic clock time sync, multi-region Paxos). Latency cost is accepted.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
CONTEXT: Post like counts across global infrastructure
 
REQUIREMENTS:
- Billions of likes per day
- Must always be able to like a post
- Exact count is less important than directional accuracy
- Users expect sub-second response
 
DECISION: AP with eventual consistency
 
ARCHITECTURE:
┌─────────────────────────────────────────────────────────────────┐
│  User clicks "Like"                                             │
│       ↓                                                         │
│  Request goes to nearest datacenter                             │
│       ↓                                                         │
│  Like is written to local shard (fast, always succeeds)         │
│       ↓                                                         │
│  Background async: Replicate to other DCs                       │
│       ↓                                                         │
│  Periodic: Count aggregation across shards                      │
└─────────────────────────────────────────────────────────────────┘
 
DURING PARTITION:
- Likes in DC1 not visible in DC2 immediately
- User who liked in DC1 sees their like
- User in DC2 sees count without DC1 likes
- When partition heals: counts converge
 
WHY IT WORKS:
- Cost of inconsistency (wrong like count): Minimal
- Cost of unavailability (can't like): User frustration, engagement loss
- Detection: Easy (counts eventually converge)
- User expectation: "Likes are fast and always work"
 
TRADE-OFF ACCEPTED:
- Same post might show different like counts in different regions
- Within seconds to minutes, all views converge

Case Study 4: Stripe Payments (CP with Idempotency)

Context: Payment processing across global merchant base.

Decision: CP for actual charges, with idempotency keys to handle retries.

Reasoning:

Double-charging a customer: Unacceptable
Losing a charge: Unacceptable
Brief unavailability: Acceptable (merchants will retry)

Solution:

CP storage for payment records
Idempotency keys ensure retries don't double-charge
Clear error responses enable intelligent retry logic
During partitions: return errors, not silent failures

Developer Experience:

stripe.Charge.create(
  amount=1000,
  currency='usd',
  idempotency_key='order-12345',  # Ensures at-most-once
)

If the request times out, the developer can safely retry with the same idempotency key. Stripe guarantees the charge happens exactly once.

Learn from the Giants

Communicating Your CAP Decision

Whether in a design review, interview, or architecture document, you need to communicate CAP decisions clearly and persuasively.

The CAP Decision Communication Template:

1. State the decision clearly "For [component], we've chosen a [CP/AP] strategy using [technology]."

2. Explain the primary driver "The primary driver is [cost of inconsistency / cost of unavailability] because [business reason]."

3. Acknowledge the trade-off "This means that during partitions, [what will happen]."

4. Describe the mitigation "We mitigate this by [specific strategy]."

5. Justify with concrete scenarios "For example, if [partition scenario], then [system behavior], which is preferable to [alternative]."

Example: Presenting an AP Decision

"For our session store, we've chosen an AP strategy using Redis Cluster.

The primary driver is availability because users who can't log in can't use any feature of our platform—it's a total blocker.

This means during partitions, users might briefly see outdated session data or preferences.

We mitigate this with sticky sessions to ensure users read from the node that received their writes, providing read-your-writes consistency.

For example, if our East and West datacenters are partitioned, a user on the East coast who updates their preferences will see those updates immediately. A user who later connects to West might see the old preferences until the partition heals—but they can still use the system."

Common Pushback and Responses

•'Why not use a strongly consistent database for everything?' → 'CP has latency and availability costs. For high-volume, tolerant data, we'd be paying those costs for no benefit. Our scoring shows inconsistency cost is low here.'
•'What if data diverges during a partition?' → 'Our conflict resolution strategy is [X]. We've tested it with simulated partitions. In the worst case, [specific outcome], which is preferable to unavailability.'
•'Isn't eventual consistency dangerous?' → 'For critical data like [X], we use CP. For this data, we've analyzed the inconsistency window and consequences. It's low-risk and recoverable.'
•'How do you know partitions will happen?' → 'Studies from Google, Amazon, and academia show partitions are inevitable in production. We've also seen [specific internal incidents]. Designing for partitions is prudent engineering.'
•'Can't we use a hybrid approach?' → 'We considered it, but the added complexity isn't justified here. [For other component], we do use a hybrid because [reason].'

In System Design Interviews:

Interviewers expect you to discuss CAP trade-offs. Use this structure:

Identify data types - "We have user data, transaction data, and analytics data."
Associate requirements - "Transactions need strong consistency; analytics can be eventual."
Make explicit choices - "For the order database, I'll use PostgreSQL with synchronous replication."
Explain partition behavior - "During partitions, writes will block or fail, but we'll never have inconsistent orders."
Acknowledge trade-offs - "This means we might see errors during network issues, but the UX for that is a retry message."

This demonstrates both theoretical understanding and practical application—exactly what interviewers want to see.

Confidence Through Preparation

Testing Your CAP Strategy

A CAP strategy that isn't tested is a CAP hope. Validate your choices with concrete experiments.

Testing CP Systems:

Partition minority of nodes
- Expect: Minority refuses operations, returns errors
- Verify: No inconsistent data written during partition
Partition during active write
- Expect: Write times out or returns error
- Verify: After partition heals, data is consistent (either write succeeded or didn't)
Rapid leader failover
- Expect: Brief unavailability during election
- Verify: No split-brain, single leader elected

Testing AP Systems:

Partition and write to both sides
- Expect: Both writes succeed
- Verify: After healing, conflict is detected and resolved correctly
Read from partitioned replica
- Expect: Reads succeed but may return stale data
- Verify: Eventually, reads converge to correct value
Concurrent conflicting writes
- Expect: Both succeed during operation
- Verify: Resolution strategy (LWW, merge, etc.) produces expected result

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
SCENARIO 1: CP Database Under Partition
═══════════════════════════════════════════════════════════════════
 
Setup:
  - 3-node PostgreSQL with synchronous replication
  - Network partition between primary and both secondaries
 
Test Steps:
  1. Inject partition (iptables, toxiproxy, etc.)
  2. Attempt INSERT from client connected to primary
  
Expected Result:
  - INSERT hangs or times out (waiting for sync ack)
  - After timeout: ERROR returned to client
  
Verification:
  - No row in database after partition heals
  - All three nodes have consistent data
 
SCENARIO 2: AP Database Conflict Resolution
═══════════════════════════════════════════════════════════════════
 
Setup:
  - 3-node Cassandra cluster
  - RF=3, ONE consistency level
 
Test Steps:
  1. Partition node-3 from nodes 1-2
  2. Write key=X, value="from-partition-A" to node-1
  3. Write key=X, value="from-partition-B" to node-3
     (using different client, assuming key maps to node-3)
  4. Heal partition
  5. Read key=X with ALL consistency
 
Expected Result:
  - Both writes succeed during partition
  - After healing: value is "from-partition-B" if its timestamp is higher
  - LWW applied correctly
 
Verification:
  - Read from all nodes shows same final value
  - Conflict was silently resolved (no error)
  - Audit log shows both writes (if logging enabled)

Chaos Engineering for CAP Validation:

Integrate CAP testing into your regular chaos engineering practice:

1. Scheduled partition tests Run automated partition injection weekly/monthly. Compare actual behavior to expected behavior.

2. Game days Simulate major partition events with full team participation. Practice incident response.

3. Continuous chaos Tools like Chaos Monkey, LitmusChaos, Gremlin can inject partitions continuously in staging/production.

4. Jepsen testing For custom distributed systems, invest in Jepsen-style testing that verifies linearizability or your expected consistency model.

Metrics to Capture:

Error rate during injected partitions
Latency percentiles during partitions
Conflict rate in AP systems
Recovery time after partition heals
Data consistency verification results

Production Is the Only Real Test

Summary: Making the CP/AP Decision

We've provided a comprehensive guide to making CP vs AP decisions in real systems. Let's consolidate the essential framework.

Key Takeaways

•The core question is: wrong data or no response? Evaluate the cost of inconsistency vs the cost of unavailability for your specific data and use case.
•Use the four criteria: Cost of inconsistency, cost of unavailability, conflict resolution difficulty, and detection/recovery capability. Score each to guide your decision.
•Make decisions per-component, not per-system: Different data within the same system often needs different CAP strategies.
•Domain knowledge matters: Financial data almost always needs CP. Social feeds almost always work with AP. Know your domain's established patterns.
•Hybrid strategies exist for complex cases: Tiered consistency, write-CP/read-AP, reservation patterns, and local-first architectures can provide nuanced trade-offs.
•Document your decisions: Use ADRs or similar formats. Future you and future team members need to understand why choices were made.
•Communicate with the template: State decision → Primary driver → Trade-off → Mitigation → Concrete scenario.
•Test your assumptions: Inject partitions, verify behavior matches design. Untested CAP strategies are just hopes.

The CAP Theorem Mastery Checkpoint:

You've now completed the CAP theorem module. You should be able to:

This knowledge forms the foundation for understanding all distributed system design. Every scalability strategy, every database choice, every caching decision connects back to these fundamentals.

Module Complete

5 / 5