Loading learning content...
You've mastered the theory of CAP. You understand consistency, availability, and partition tolerance. You know that during network partitions, you must choose between C and A. But when you're standing at the whiteboard in a design session, the question becomes very concrete: Should this system be CP or AP?
This decision has profound implications. It affects your technology choices, your failure modes, your user experience during outages, and your operational complexity. Making this choice incorrectly leads to either:
This page gives you the specific tools and frameworks to make this decision correctly.
By the end of this page, you will have concrete decision criteria for choosing between CP and AP, understand when hybrid approaches are appropriate, know how to evaluate trade-offs in domain-specific contexts, and be able to confidently justify your CAP choices in design reviews and interviews.
The CP vs AP decision ultimately reduces to one fundamental question:
What is the worse outcome: showing incorrect data or showing nothing at all?
This question has different answers in different contexts. Let's formalize the criteria:
Criterion 1: Cost of Inconsistency
What happens if your system shows stale or conflicting data?
| Low Cost | High Cost |
|---|---|
| User sees yesterday's analytics | User sees wrong account balance |
| Social feed is slightly stale | Two users book same seat |
| Cached product description is outdated | Inventory goes negative |
| Session preference not propagated | Two services have different config |
If the cost of inconsistency is High → Lean toward CP
Criterion 2: Cost of Unavailability
What happens if users can't complete their action?
| Low Cost | High Cost |
|---|---|
| Can't update profile right now | Can't place emergency order |
| Report generation delayed | Can't log in at all |
| Export fails, will retry | Checkout abandonment during sale |
| Admin action deferred | IoT sensor can't report critical alert |
If the cost of unavailability is High → Lean toward AP
Criterion 3: Conflict Resolution Complexity
If you choose AP and conflicts occur, how hard are they to resolve?
| Easy to Resolve | Hard to Resolve |
|---|---|
| Counters (sum them) | Unique constraints (who wins?) |
| Add-only sets (union) | Overwrites where both matter |
| LWW where order doesn't matter | Ordered sequences (whose order?) |
| Idempotent operations | Stateful workflows |
If conflicts are hard to resolve → Lean toward CP
Criterion 4: Detection and Recovery Capability
Can you detect and recover from inconsistency after the fact?
| Detectable & Recoverable | Silent & Permanent |
|---|---|
| Audit logs enable reconciliation | Data just overwrites |
| Compensating transactions possible | Actions are irreversible |
| Business process allows fixes | Real-world effects immediate |
| Users can surface issues | Users don't know about problem |
If inconsistency is undetectable or unrecoverable → CP is mandatory
Ask yourself: 'If I'm on call at 3 AM and facing a partition, which page would I rather receive? - 'Users seeing errors' or 'Potential data inconsistency'? The answer reveals your true priority. This gut check often clarifies ambiguous cases.
Use this systematic framework to make and document your CP/AP decisions:
Step 1: Define the Data in Question
Be specific. Don't say "the database." Identify:
Step 2: Enumerate the Failure Scenarios
List what can go wrong during partitions:
Step 3: Evaluate Each Criterion (1-5 scale)
| Criterion | Score 1 (Low/Easy) | Score 5 (High/Hard) | |-----------|--------------------|--------------------|| | Cost of Inconsistency | Cosmetic issue | Financial/safety loss | | Cost of Unavailability | Minor inconvenience | Critical blocking | | Conflict Resolution | Automatic merge | Impossible to merge | | Detection/Recovery | Easily detected, fixed | Silent corruption |
Step 4: Apply the Decision Matrix
12345678910111213141516171819202122232425262728293031323334353637383940
DECISION MATRIX: Score each factor 1-5 and sum ┌──────────────────────────────────────────────────────────────────┐│ FACTOR │ CP Points │ AP Points │├──────────────────────────────────────────────────────────────────┤│ Cost of Inconsistency │ +Score │ - ││ Cost of Unavailability │ - │ +Score ││ Conflict Resolution Difficulty │ +Score │ - ││ Detection/Recovery Difficulty │ +Score │ - │├──────────────────────────────────────────────────────────────────┤│ TOTAL │ Sum(CP) │ Sum(AP) │└──────────────────────────────────────────────────────────────────┘ INTERPRETATION:- CP Points > AP Points by 3+: Strong CP- CP Points > AP Points by 1-2: CP with AP fallback modes- Roughly equal: Hybrid approach, analyze per-operation- AP Points > CP Points by 1-2: AP with CP for critical paths- AP Points > CP Points by 3+: Strong AP EXAMPLE: User Session Store┌───────────────────────────────────────────────────────────────┐│ Cost of Inconsistency: 2 ││ (User sees old prefs - mildly annoying) ││ ││ Cost of Unavailability: 5 ││ (User can't log in - blocks everything) ││ ││ Conflict Resolution: 1 ││ (LWW works fine for sessions) ││ ││ Detection/Recovery: 2 ││ (User will notice and fix preferences) │├───────────────────────────────────────────────────────────────┤│ CP Points: 2 + 1 + 2 = 5 ││ AP Points: 5 ││ ││ RESULT: Roughly equal, but availability critical ││ DECISION: AP with sticky sessions for read-your-writes │└───────────────────────────────────────────────────────────────┘Step 5: Document the Decision
Create an Architecture Decision Record (ADR):
# ADR-042: Session Store CAP Strategy
## Status: Accepted
## Context
Our session store handles user authentication state across
multiple data centers. Users expect to remain logged in
regardless of network conditions.
## Decision
We will use an AP strategy (Redis Cluster) for session storage.
## Scoring
- Cost of Inconsistency: 2/5
- Cost of Unavailability: 5/5
- Conflict Resolution: 1/5
- Detection/Recovery: 2/5
## Consequences
- Positive: High availability during partitions
- Positive: Low latency for session lookups
- Negative: Rare cases of seeing old session data
- Mitigation: Sticky sessions for read-your-writes guarantee
This documentation ensures your decision is understood, defensible, and reviewable.
Don't make a single CP/AP decision for your entire system. Each data type, service, or feature may have different requirements. A well-designed system has explicit CP/AP decisions for each component, documented and justified.
Different domains have established patterns based on their unique requirements. Use these as starting points, not absolute rules.
Financial Services:
| Component | Recommendation | Rationale |
|---|---|---|
| Account balances | CP | Incorrect balance = regulatory violation |
| Transaction history | CP | Must be accurate for auditing |
| Exchange rates | AP | Slightly stale rate is acceptable |
| Fraud alerts | AP | Better to false-alarm than miss |
| User preferences | AP | Low stakes, high volume |
E-Commerce:
| Component | Recommendation | Rationale |
|---|---|---|
| Order placement | CP | Must not lose/duplicate orders |
| Inventory (reserve) | CP | Overselling has real costs |
| Product catalog | AP | Stale prices/descriptions OK briefly |
| Shopping cart | AP | Must always be available to add items |
| Session/auth | AP | Users must be able to browse |
| Search results | AP | Stale product list is acceptable |
| Reviews/ratings | AP | Eventual consistency is fine |
| Industry | Typical CP Data | Typical AP Data | Hybrid Areas |
|---|---|---|---|
| Healthcare | Patient records, prescriptions, allergies | Appointment scheduling, wait times | Billing (CP for charges, AP for estimates) |
| Gaming | Player inventory, currency, rankings | Chat, social features, analytics | Matchmaking (CP to finalize, AP to search) |
| Social Media | Authentication, privacy settings | Feeds, content, reactions, DMs | Follower counts (AP but reconciled) |
| IoT/Telemetry | Device config, firmware versions | Sensor readings, metrics, logs | Alert thresholds (depends on criticality) |
| Travel/Booking | Reservations, payments, tickets | Search, pricing, availability display | Inventory (CP to book, AP to display) |
| Logistics | Shipping records, tracking updates | Fleet location, capacity estimates | Route assignment (CP when committing) |
Healthcare: A Critical Case Study
Healthcare illustrates the stakes of CP/AP decisions:
Medication allergies → Strong CP If a patient's allergy record is inconsistent, they might receive a medication that causes anaphylaxis. Unavailability ("Cannot access allergy data") is far safer than inconsistency. Systems should block prescription decisions if allergy data is uncertain.
Appointment scheduling → Balanced Double-booking an appointment is bad but not dangerous. Being unable to schedule appointments during an outage is operationally problematic. Most systems lean AP with conflict resolution (call patient if double-booked).
Radiology results → CP for diagnosis, AP for viewing Oncologists making treatment decisions need the authoritative, consistent result. But for a patient to view their own results? Slightly stale is probably fine, and availability matters.
These are guidelines, not rules. A healthcare startup might have different trade-offs than a hospital. An upstart e-commerce site might prioritize availability over inventory accuracy to maximize sales. Understand your specific context, user expectations, and regulatory environment.
Often, neither a pure CP nor pure AP strategy is optimal. Hybrid approaches provide the best of both worlds—with added complexity.
Hybrid Strategy 1: Tiered Consistency
Different tiers of data get different treatment:
┌─────────────────────────────────────────────────────────────────┐
│ TIER 1 (Critical): CP │
│ - User authentication tokens │
│ - Payment authorizations │
│ - Inventory commitments │
│ → Use: Consensus-based system (etcd, PostgreSQL sync rep) │
│ │
│ TIER 2 (Important): CP with degraded reads │
│ - Order history │
│ - Account settings │
│ - Notification preferences │
│ → Use: Read replicas with staleness bounds │
│ │
│ TIER 3 (Eventual OK): AP │
│ - Session data │
│ - Activity logs │
│ - Analytics events │
│ → Use: Async replication, eventual consistency │
└─────────────────────────────────────────────────────────────────┘
Hybrid Strategy 2: Write CP, Read AP
Writes go through a strongly consistent path; reads can use eventually consistent replicas:
Hybrid Strategy 3: The Reservation Pattern
This elegant pattern provides CP guarantees with AP-like availability:
Phase 1: Soft Reservation (AP)
Phase 2: Hard Commitment (CP)
Result: High availability for browsing/carting (most users), consistency only required at commitment (fewer users, can tolerate occasional failure).
Hybrid Strategy 4: Local-First with Server Reconciliation
Emerging pattern for mobile and collaborative applications:
Every hybrid strategy adds operational and cognitive complexity. You need to understand which operations are in which tier. Testing is more complex. Failure modes are more numerous. Only use hybrids when the value justifies the complexity. Simple systems with clear trade-offs are often better than clever systems with hidden failure modes.
Let's examine how real systems have made and justified their CP/AP decisions.
Case Study 1: Amazon Shopping Cart (AP)
Amazon famously chose AP for shopping carts (described in the Dynamo paper):
Context: Users add items to carts; carts maintain state across sessions and devices.
Decision: AP (eventually consistent, vector clocks for conflict detection)
Reasoning:
Outcome: Carts remain available during severe AWS outages. Occasional item duplication is a minor annoyance that users can fix. Sales are preserved.
Case Study 2: Google Spanner (CP)
Google designed Spanner for global strong consistency:
Context: Finance-grade transactional system for AdWords, Play Store, etc.
Decision: CP (synchronous Paxos replication, TrueTime for ordering)
Reasoning:
Outcome: Global strong consistency with 99.999% availability. Achieves both through massive engineering investment (GPS/atomic clock time sync, multi-region Paxos). Latency cost is accepted.
1234567891011121314151617181920212223242526272829303132333435363738
CONTEXT: Post like counts across global infrastructure REQUIREMENTS:- Billions of likes per day- Must always be able to like a post- Exact count is less important than directional accuracy- Users expect sub-second response DECISION: AP with eventual consistency ARCHITECTURE:┌─────────────────────────────────────────────────────────────────┐│ User clicks "Like" ││ ↓ ││ Request goes to nearest datacenter ││ ↓ ││ Like is written to local shard (fast, always succeeds) ││ ↓ ││ Background async: Replicate to other DCs ││ ↓ ││ Periodic: Count aggregation across shards │└─────────────────────────────────────────────────────────────────┘ DURING PARTITION:- Likes in DC1 not visible in DC2 immediately- User who liked in DC1 sees their like- User in DC2 sees count without DC1 likes- When partition heals: counts converge WHY IT WORKS:- Cost of inconsistency (wrong like count): Minimal- Cost of unavailability (can't like): User frustration, engagement loss- Detection: Easy (counts eventually converge)- User expectation: "Likes are fast and always work" TRADE-OFF ACCEPTED:- Same post might show different like counts in different regions- Within seconds to minutes, all views convergeCase Study 4: Stripe Payments (CP with Idempotency)
Context: Payment processing across global merchant base.
Decision: CP for actual charges, with idempotency keys to handle retries.
Reasoning:
Solution:
Developer Experience:
stripe.Charge.create(
amount=1000,
currency='usd',
idempotency_key='order-12345', # Ensures at-most-once
)
If the request times out, the developer can safely retry with the same idempotency key. Stripe guarantees the charge happens exactly once.
These companies have made their decisions through years of production experience and billions of dollars of engineering. Study their published papers and blog posts. Many are available: Amazon's Dynamo paper, Google's Spanner paper, Facebook/Meta's TAO paper. These are treasure troves of practical distributed systems wisdom.
Whether in a design review, interview, or architecture document, you need to communicate CAP decisions clearly and persuasively.
The CAP Decision Communication Template:
1. State the decision clearly "For [component], we've chosen a [CP/AP] strategy using [technology]."
2. Explain the primary driver "The primary driver is [cost of inconsistency / cost of unavailability] because [business reason]."
3. Acknowledge the trade-off "This means that during partitions, [what will happen]."
4. Describe the mitigation "We mitigate this by [specific strategy]."
5. Justify with concrete scenarios "For example, if [partition scenario], then [system behavior], which is preferable to [alternative]."
Example: Presenting an AP Decision
"For our session store, we've chosen an AP strategy using Redis Cluster.
The primary driver is availability because users who can't log in can't use any feature of our platform—it's a total blocker.
This means during partitions, users might briefly see outdated session data or preferences.
We mitigate this with sticky sessions to ensure users read from the node that received their writes, providing read-your-writes consistency.
For example, if our East and West datacenters are partitioned, a user on the East coast who updates their preferences will see those updates immediately. A user who later connects to West might see the old preferences until the partition heals—but they can still use the system."
In System Design Interviews:
Interviewers expect you to discuss CAP trade-offs. Use this structure:
This demonstrates both theoretical understanding and practical application—exactly what interviewers want to see.
The ability to confidently discuss CAP trade-offs comes from understanding the criteria, having examples ready, and having thought through objections. Prepare before design reviews. Run through likely questions. Your confidence signals competence.
A CAP strategy that isn't tested is a CAP hope. Validate your choices with concrete experiments.
Testing CP Systems:
Partition minority of nodes
Partition during active write
Rapid leader failover
Testing AP Systems:
Partition and write to both sides
Read from partitioned replica
Concurrent conflicting writes
12345678910111213141516171819202122232425262728293031323334353637383940414243
SCENARIO 1: CP Database Under Partition═══════════════════════════════════════════════════════════════════ Setup: - 3-node PostgreSQL with synchronous replication - Network partition between primary and both secondaries Test Steps: 1. Inject partition (iptables, toxiproxy, etc.) 2. Attempt INSERT from client connected to primary Expected Result: - INSERT hangs or times out (waiting for sync ack) - After timeout: ERROR returned to client Verification: - No row in database after partition heals - All three nodes have consistent data SCENARIO 2: AP Database Conflict Resolution═══════════════════════════════════════════════════════════════════ Setup: - 3-node Cassandra cluster - RF=3, ONE consistency level Test Steps: 1. Partition node-3 from nodes 1-2 2. Write key=X, value="from-partition-A" to node-1 3. Write key=X, value="from-partition-B" to node-3 (using different client, assuming key maps to node-3) 4. Heal partition 5. Read key=X with ALL consistency Expected Result: - Both writes succeed during partition - After healing: value is "from-partition-B" if its timestamp is higher - LWW applied correctly Verification: - Read from all nodes shows same final value - Conflict was silently resolved (no error) - Audit log shows both writes (if logging enabled)Chaos Engineering for CAP Validation:
Integrate CAP testing into your regular chaos engineering practice:
1. Scheduled partition tests Run automated partition injection weekly/monthly. Compare actual behavior to expected behavior.
2. Game days Simulate major partition events with full team participation. Practice incident response.
3. Continuous chaos Tools like Chaos Monkey, LitmusChaos, Gremlin can inject partitions continuously in staging/production.
4. Jepsen testing For custom distributed systems, invest in Jepsen-style testing that verifies linearizability or your expected consistency model.
Metrics to Capture:
Staging environments rarely replicate production network topology, scale, or traffic patterns. Test in staging to catch obvious bugs, but also test in production (carefully, with safeguards). A system that passes all staging tests but fails its first production partition hasn't been tested.
We've provided a comprehensive guide to making CP vs AP decisions in real systems. Let's consolidate the essential framework.
The CAP Theorem Mastery Checkpoint:
You've now completed the CAP theorem module. You should be able to:
✅ Define consistency, availability, and partition tolerance precisely ✅ Explain why partition tolerance is effectively mandatory ✅ Describe how CP and AP systems behave during partitions ✅ Apply a decision framework to real-world data requirements ✅ Recognize domain-specific patterns and established practices ✅ Design hybrid strategies for complex requirements ✅ Communicate and defend CAP decisions to stakeholders ✅ Test and validate CAP strategies in production
This knowledge forms the foundation for understanding all distributed system design. Every scalability strategy, every database choice, every caching decision connects back to these fundamentals.
Congratulations! You've mastered the CAP theorem—from formal definitions through practical application. This knowledge is foundational to distributed systems design and will inform every architectural decision you make. The next module on PACELC will extend these concepts to include latency trade-offs during normal operation.