Loading content...
Imagine a distributed system serving millions of requests per day. Network partitions—when nodes cannot communicate with each other—might occur a handful of times per year, lasting minutes or hours. That's perhaps 0.01% of operational time.
The other 99.99% of the time, the network is functioning normally. All nodes can communicate. No partitions exist. And yet, during this vast majority of operational time, the system is continuously making critical trade-off decisions that affect every single request.
This is the domain of the "Else" clause in PACELC—and it's where your system lives almost all of the time.
CAP theorem has nothing to say about this state. It addresses the exceptional condition (partition) but remains silent on the ordinary condition (normal operation). PACELC's greatest contribution is providing a framework for understanding and reasoning about system behavior in the "Else" state.
By the end of this page, you will understand what architectural choices distributed systems make during normal operation, how the latency vs. consistency trade-off manifests in concrete system behaviors, and why the 'Else' clause of PACELC often matters more for your application's user experience than its partition behavior.
To understand the 'Else' clause, we must first define what "no partition" actually means in practice. It's more nuanced than simply "the network is working."
Defining Normal Operation:
In PACELC terminology, "no partition" (the Else condition) means:
All nodes are reachable — Every node in the distributed system can communicate with every other node, either directly or through intermediaries.
Message delivery is reliable — Messages sent between nodes are delivered within expected timeouts. There may be latency variation, but messages don't disappear.
No indefinite delays — While network latency exists (and varies), requests complete within bounded time. The system isn't experiencing the unbounded delays characteristic of partitions.
This doesn't mean the network is perfect—just that it's functional within parameters the system considers "normal."
One subtlety: distributed systems can't always distinguish between 'slow network' and 'partition in progress.' When latency spikes, is it congestion or a partition forming? Systems typically use timeout thresholds—if communication fails beyond a threshold, assume partition. But this means there's a gray zone where the system's behavior is transitioning between Else and Partition modes.
During normal operation, distributed systems face a fundamental architectural choice that affects every read and write operation: How much coordination should occur before acknowledging an operation?
This question manifests differently for reads and writes, but the underlying trade-off is the same: more coordination means stronger consistency but higher latency; less coordination means lower latency but weaker consistency.
The Write Path Trade-off:
When a client writes data to a distributed system, the system must decide:
| Strategy | Coordination Level | Latency | Consistency | Durability |
|---|---|---|---|---|
| Acknowledge after local write | None | Lowest (~1ms) | Weakest | Risk of data loss if node fails |
| Acknowledge after async replication starts | Minimal | Low (~1-5ms) | Weak | Slight improvement in durability |
| Acknowledge after W replicas confirm | Moderate (quorum) | Medium (~10-100ms) | Strong if R+W>N | Good durability |
| Acknowledge after all replicas confirm | Maximum | Highest (~100ms+) | Strongest | Maximum durability |
The Read Path Trade-off:
Similarly, when a client reads data:
| Strategy | Coordination Level | Latency | Consistency Guarantee |
|---|---|---|---|
| Read from any replica | None | Lowest | May return stale data; eventual consistency |
| Read from leader/primary only | Low | Low-Medium | Consistent if no failover; read-your-writes typically guaranteed |
| Read from R replicas, return latest | Moderate | Medium | Strong consistency if R+W>N |
| Read with confirmation from majority | High | High | Linearizable reads; no stale data possible |
The Latency Penalty is Real:
Consider a distributed database with three replicas across three data centers:
Scenario 1: EL (Eventual, Low Latency)
Scenario 2: EC (Strong Consistency)
The difference is 15x latency for the same logical operation—and this difference exists during every single request, not just during partitions.
The 'Else' clause behavior is primarily determined by a system's replication strategy. Different strategies place systems at different points on the EL ↔ EC spectrum.
Synchronous Replication (EC Behavior):
In synchronous replication, writes are not acknowledged until a specified number of replicas have durably stored the data. The system blocks until consensus is achieved.
1234567891011121314151617181920212223
Synchronous Replication Flow: 1. Client sends write to Primary2. Primary writes to local storage3. Primary forwards write to Secondary replicas4. Primary waits for acknowledgments: - ALL mode: Wait for all secondaries - QUORUM mode: Wait for majority5. Only after acknowledgments, respond to client Timeline (Quorum with 3 replicas, geographically distributed): T=0ms: Client sends write T=1ms: Primary receives, starts local write T=2ms: Primary sends to secondaries, completes local write T=3ms: Near secondary receives write T=5ms: Near secondary acknowledges T=80ms: Far secondary receives write T=82ms: Far secondary acknowledges T=5ms: Quorum achieved (primary + near secondary) T=6ms: Primary responds to client Total latency: ~6ms (waiting for nearest secondary)Consistency: Strong (any subsequent read sees this write)Asynchronous Replication (EL Behavior):
In asynchronous replication, writes are acknowledged as soon as the primary (or local replica) has stored the data. Replication to other nodes happens in the background.
12345678910111213141516171819
Asynchronous Replication Flow: 1. Client sends write to Primary2. Primary writes to local storage3. Primary immediately responds to client: "Success"4. Primary queues write for async replication5. Background process sends to secondaries (eventually) Timeline: T=0ms: Client sends write T=1ms: Primary receives, starts local write T=2ms: Primary completes local write T=3ms: Primary responds to client: "Success" ... background replication continues ... T=10ms: Near secondary receives and applies T=100ms: Far secondary receives and applies Total latency: ~3ms (only local write)Consistency: Eventual (reads from secondaries may return stale data for ~100ms)The period between primary acknowledgment and secondary replication is called 'replication lag.' During this window, reading from the primary returns the latest data, but reading from any secondary returns stale data. In a 3-region deployment with async replication, this window can be 50-200ms for distant replicas—long enough for users to notice inconsistencies.
Semi-Synchronous Replication (Hybrid Behavior):
Many production systems implement hybrid strategies that attempt to balance the EL and EC extremes:
| Strategy | Latency Impact | Consistency | Durability | Use Case |
|---|---|---|---|---|
| Fully Synchronous | Highest (bounded by slowest replica) | Linearizable | Maximum | Financial transactions, leader election |
| Quorum Synchronous | Medium (bounded by majority) | Strong | High | Most database operations needing consistency |
| Semi-Synchronous | Low-Medium | Read-your-writes | Good | E-commerce, content management |
| Fully Asynchronous | Lowest (local only) | Eventual | Minimum | Analytics, caching, session stores |
Let's examine how specific distributed systems behave during normal operation, revealing their 'Else' clause choices:
Cassandra (EL Behavior):
Cassandra defaults to eventual consistency with tunable consistency levels. In normal operation:
12345678910111213141516171819202122
-- Cassandra Consistency Levels -- EL behavior (default):CONSISTENCY ONEINSERT INTO users (id, name) VALUES (1, 'Alice');-- Acknowledges after ANY single replica confirms-- Latency: ~2-5ms-- Risk: May lose data if that replica fails before replication -- Moving toward EC:CONSISTENCY QUORUMSELECT * FROM users WHERE id = 1;-- Reads from majority of replicas, returns most recent-- Latency: ~10-30ms (depends on replica locations)-- Guarantee: Strongly consistent if writes also use QUORUM -- Full EC:CONSISTENCY ALLINSERT INTO users (id, name) VALUES (2, 'Bob');-- Waits for ALL replicas-- Latency: ~50-200ms (bounded by slowest replica)-- Guarantee: Linearizable, maximum durabilityDynamoDB (Configurable EL/EC):
DynamoDB allows per-operation consistency selection, making it both EL and EC capable:
123456789101112131415161718192021222324
// DynamoDB Normal Operation Behaviors // EL behavior - Eventually Consistent Read (default)const eventuallyConsistentRead = await dynamodb.get({ TableName: 'users', Key: { userId: '12345' } // ConsistentRead defaults to false}).promise();// Latency: ~5-10ms// May return slightly stale data (typically <1 second old)// Cost: 0.5 RCU per 4KB // EC behavior - Strongly Consistent Readconst stronglyConsistentRead = await dynamodb.get({ TableName: 'users', Key: { userId: '12345' }, ConsistentRead: true // Explicit strong consistency}).promise();// Latency: ~10-20ms (approximately 2x eventually consistent)// Always returns most recent write// Cost: 1.0 RCU per 4KB (2x cost) // Writes are always synchronously replicated within region// Global Tables add eventual consistency across regionsPostgreSQL with Streaming Replication (EC Behavior):
PostgreSQL offers synchronous streaming replication for strong consistency:
1234567891011121314151617181920212223
-- PostgreSQL Synchronous Replication Configuration -- In postgresql.conf (primary):-- synchronous_standby_names = 'FIRST 1 (standby1, standby2)'-- synchronous_commit = on -- Normal operation write path:BEGIN;INSERT INTO orders (id, total) VALUES (1001, 99.99);COMMIT;-- The COMMIT blocks until:-- 1. WAL written to primary's disk-- 2. WAL received by at least one synchronous standby-- Only then does the client receive confirmation -- Latency implications:-- Same-datacenter standby: ~2-5ms additional latency-- Cross-region standby: ~20-100ms additional latency -- Tradeoff configuration:-- synchronous_commit = off → EL behavior, ~1ms commits-- synchronous_commit = on → EC behavior, ~5-100ms commits-- synchronous_commit = remote_apply → Strongest EC, highest latencyMongoDB (EC with Tuning Options):
MongoDB's write concern and read concern settings determine its normal operation behavior:
123456789101112131415161718192021222324252627282930
// MongoDB Write Concern Examples // EL-leaning: Acknowledge after primary onlydb.collection.insertOne( { item: "journal", qty: 25 }, { writeConcern: { w: 1 } });// Fastest write, risk of data loss if primary fails before replication // EC-leaning: Acknowledge after majoritydb.collection.insertOne( { item: "notebook", qty: 50 }, { writeConcern: { w: "majority" } });// Waits for majority of replica set to acknowledge// Latency: ~5-30ms depending on replica locations // Full EC with journalingdb.collection.insertOne( { item: "paper", qty: 100 }, { writeConcern: { w: "majority", j: true } });// Waits for majority AND journal flush// Maximum durability, highest latency // Read Concern for EC behaviordb.collection.find({ item: "journal" }) .readConcern("majority");// Only returns data acknowledged by majority// Prevents reading data that might be rolled backThe 'C' in the Else clause (EC vs EL) isn't binary—it represents a rich spectrum of consistency models. Understanding this spectrum is crucial for making informed architectural decisions.
The Consistency Hierarchy:
From strongest to weakest, the main consistency models are:
The Latency Cost of Each Level:
Moving up the consistency hierarchy requires more coordination, imposing latency costs:
| Consistency Level | Coordination Required | Typical Latency Overhead | When to Use |
|---|---|---|---|
| Eventual | None (async replication) | 0ms additional | Caching, analytics, content that can be stale |
| Monotonic Reads | Session affinity or version tracking | 0-2ms | User sessions, browsing history |
| Read-Your-Writes | Sticky routing or version vectors | 0-5ms | User profile updates, shopping carts |
| Causal | Dependency tracking, vector clocks | 2-10ms | Social media feeds, collaborative editing |
| Sequential | Total order broadcast | 10-50ms | Audit logs, financial ledgers |
| Linearizability | Consensus protocol per operation | 20-100ms | Primary key lookups, critical state |
| Strict Serializability | Global time coordination | 50-200ms | Distributed transactions, inventory management |
Most applications need different consistency levels for different operations. User authentication might need linearizability (can't risk stale passwords). Product catalog browsing works fine with eventual consistency (a few seconds of staleness is acceptable). The trick is identifying which operations need what, and using a system that supports per-operation configuration.
Systems choosing EL (Else Latency) behavior—prioritizing low latency over strong consistency during normal operation—employ specific patterns to manage the consequences of eventual consistency:
Pattern 1: Conflict-Free Replicated Data Types (CRDTs)
CRDTs are data structures designed to be replicated across nodes and merged without coordination. They guarantee eventual convergence regardless of update order.
123456789101112131415161718192021222324
CRDT Examples: 1. G-Counter (Grow-only Counter) - Each node maintains its own count - Merge: Sum all node counts - Use case: Page view counters, like counts Node A: {A: 5, B: 0, C: 0} Node B: {A: 0, B: 3, C: 0} Merged: {A: 5, B: 3, C: 0} → Total: 8 2. LWW-Register (Last-Write-Wins Register) - Each update tagged with timestamp - Merge: Keep value with highest timestamp - Use case: User profile fields Node A: {value: "Alice", ts: 100} Node B: {value: "Alicia", ts: 105} Merged: {value: "Alicia", ts: 105} 3. OR-Set (Observed-Remove Set) - Adds tracked with unique tags - Removes only remove observed adds - Use case: Shopping cart itemsPattern 2: Read Repair
Systems like Cassandra and DynamoDB use read repair to heal inconsistencies. During a read:
This approach trades read latency (consulting multiple replicas) for improving consistency over time without write-time coordination.
Pattern 3: Anti-Entropy Protocols
Background processes continuously compare data across replicas and reconcile differences. Unlike read repair (triggered by reads), anti-entropy runs proactively:
1234567891011121314151617
Anti-Entropy Process (Merkle Tree Based): 1. Each node maintains a Merkle tree of its data - Leaf nodes: Hash of data ranges - Parent nodes: Hash of children - Root: Single hash representing all data 2. Periodically, nodes exchange root hashes - If roots match: Data is identical, done - If roots differ: Drill down tree to find divergence 3. Exchange only differing data ranges - Efficient: Only transfers what's different - Eventually: All replicas converge Cassandra nodetool repair uses this approach:$ nodetool repair -full keyspace_namePattern 4: Vector Clocks for Causality
To provide causal consistency (stronger than eventual, weaker than linearizable), systems use vector clocks to track happens-before relationships:
123456789101112131415161718
Vector Clock Example (3 nodes: A, B, C): 1. Initial state: All clocks at [0, 0, 0] 2. Node A writes value "x": clock becomes [1, 0, 0] - Stored as: {value: "x", vc: [1, 0, 0]} 3. Node B reads from Node A, modifies to "y" - B's clock: max([1,0,0], [0,0,0]) + [0,1,0] = [1, 1, 0] - Stored as: {value: "y", vc: [1, 1, 0]} 4. Node C concurrently writes "z" (didn't see A's write) - C's clock: [0, 0, 1] - Stored as: {value: "z", vc: [0, 0, 1]} 5. Merge at Node A: - [1, 1, 0] vs [0, 0, 1]: Neither dominates → CONFLICT - Application must resolve (LWW, merge function, or user choice)EL systems don't ignore consistency—they defer and distribute the consistency work. Instead of blocking writes for synchronous consensus, they use background processes, CRDTs, and read-time repairs to achieve eventual convergence. The user experiences low latency upfront, and the system works toward consistency behind the scenes.
Systems choosing EC (Else Consistency) behavior—prioritizing strong consistency during normal operation—employ different mechanisms to achieve coordination while minimizing latency:
Pattern 1: Leader-Based Replication
A single leader (primary) handles all writes, ensuring serialization. Replicas follow the leader's WAL (Write-Ahead Log):
1234567891011121314151617181920
Leader-Based Replication Flow: ┌─────────────┐ Client Write ───▶│ Leader │───Replicate──▶ Follower 1 ▲ │ (Primary) │───Replicate──▶ Follower 2 │ └─────────────┘───Replicate──▶ Follower 3 │ │ └──────────────────┘ Acknowledgment (after replication threshold met) Configurations:- Async ackn: After leader writes (EL behavior)- Sync 1: After leader + any 1 follower (semi-EC)- Sync majority: After leader + N/2 followers (EC)- Sync all: After all nodes (strongest EC) Examples:- PostgreSQL streaming replication- MySQL InnoDB Cluster- MongoDB replica setsPattern 2: Quorum Systems
Reads and writes require acknowledgment from a quorum (subset) of replicas. When R + W > N, consistency is guaranteed:
12345678910111213141516171819202122232425
Quorum Calculation (N=5 replicas): Strong Consistency: R + W > N Configuration 1: R=3, W=3 - Write reaches 3 of 5: ✓ (majority) - Read consults 3 of 5: ✓ (majority) - At least 1 overlap guaranteed: ✓ (3+3-5=1) - Result: Strongly consistent Configuration 2: R=1, W=5 - Write reaches all 5: ✓ - Read needs only 1: Fast! - Overlap guaranteed: ✓ - Result: Fast reads, slow writes Configuration 3: R=5, W=1 - Write reaches 1: Fast! - Read needs all 5: Slower - Overlap guaranteed: ✓ - Result: Fast writes, slow reads Latency Impact: - Write latency: Time for Wth fastest replica - Read latency: Time for Rth fastest replicaPattern 3: Consensus Protocols (Raft, Paxos)
For distributed systems without a fixed leader, consensus protocols ensure all nodes agree on operations order:
1234567891011121314151617181920
Raft Consensus Flow (simplified): 1. Client sends write to Leader2. Leader appends to local log (not committed)3. Leader sends AppendEntries RPC to all Followers4. Followers append to logs, respond with success5. Leader receives majority acknowledgment6. Leader commits entry, applies to state machine7. Leader responds to client with success8. Leader's next heartbeat tells Followers to commit Latency: 2 RTT (Client→Leader→Followers→Leader→Client) Plus: Local disk writes at each step Optimizations:- Pipeline: Start next operation before previous completes - Batching: Group multiple operations per AppendEntries- Leases: Allow reads from leader without consensus Systems using Raft: etcd, CockroachDB, TiDBPattern 4: Synchronized Time (TrueTime)
Google Spanner uses atomic clocks and GPS receivers to bound clock uncertainty, enabling both strong consistency and reasonable latency:
123456789101112131415161718
Google TrueTime API: TT.now() returns: [earliest, latest] - Actual time is guaranteed to be within this interval - Uncertainty bound: typically 1-7 milliseconds Spanner Transaction Commit:1. Pick commit timestamp s = TT.now().latest2. Wait until TT.now().earliest > s ("commit wait")3. Now safe to release transaction Why this works:- If transaction T1 commits before T2 starts, T1's timestamp is guaranteed < T2's timestamp- External consistency achieved without cross-node coordination! Cost: ~7ms commit wait in the worst caseBenefit: Globally consistent reads without consensus overheadEC systems invest latency upfront to guarantee consistency. By coordinating during writes (via consensus, quorums, or synchronized time), they ensure that all subsequent reads—from any node—return correct, up-to-date data. The latency penalty is paid once per write, providing the benefit of simple reasoning about system state.
We've explored the 'Else' clause of PACELC—the behavior of distributed systems during normal operation. Let's consolidate the key insights:
The Practical Implication:
When evaluating distributed systems, don't just ask "Is this CP or AP?" Ask:
These questions address the 'Else' clause—the behavior that affects your users every single time they interact with your system.
What's next:
The following page explores the latency vs. consistency trade-off in technical depth—with specific numbers, formulas, and strategies for optimizing within the constraints PACELC reveals.
You now understand what happens during normal operation in distributed systems—the 'Else' clause of PACELC. You've seen how different replication strategies, consistency models, and coordination patterns create the EL vs EC spectrum. Next, we'll dive deep into the latency vs consistency trade-off with concrete analysis.