System DesignStrong Consistency

Strong Consistency

LevelAdvanced

Duration75 mins

TopicStrong Consistency

3 / 5

Performance Implications of Strong Consistency

When Milliseconds Cost Millions

In 2012, Amazon revealed that every 100 milliseconds of latency cost them 1% in sales. In high-frequency trading, firms spend billions on infrastructure to shave microseconds off transaction times. For Google, a 500ms delay in search results causes a 20% drop in traffic.

These aren't abstract numbers—they represent the profound impact of latency on user behavior and business outcomes. When we choose linearizability, we're accepting latency penalties that can range from 10x to 100x compared to eventually consistent alternatives. Understanding these performance implications and how to mitigate them is essential for building systems that are both correct and usable.

What You Will Learn

By the end of this page, you will understand how linearizability affects end-to-end latency and user experience, tail latency challenges and their outsized impact, throughput degradation under load, performance optimization strategies, and how to design systems that minimize the performance penalty of strong consistency.

Latency Impact on User Experience

Human perception is remarkably sensitive to latency. Understanding these thresholds helps us evaluate whether linearizability's latency penalty is acceptable for a given use case.

Human Perception Thresholds

The perception hierarchy:

< 100ms: Feels instantaneous. Users perceive the system as responding directly to their actions.
100-300ms: Noticeable but acceptable. Users perceive a slight delay but stay engaged.
300ms-1s: Obvious delay. Users begin to feel the system is slow.
1-10s: Disruptive. Users may abandon tasks, lose context, or multitask.
> 10s: Broken. Users assume the system has failed.

Linearizability's Latency Reality

Let's map common linearizable operations to these thresholds:

Linearizable Operation Latency vs User Perception
Scenario	Typical Latency	User Perception	Acceptable?
Same-datacenter consensus	3-10ms	Instantaneous	✓ Excellent
Same-region, multi-AZ	10-30ms	Instantaneous	✓ Good
Cross-coast US (NY↔SF)	60-100ms	Noticeable	~ Marginal
Cross-Atlantic (US↔EU)	80-150ms	Noticeable delay	✗ Poor for interactive
Global (US↔Asia)	150-300ms	Obvious delay	✗ Unacceptable for interactive
Multi-region quorum write	200-400ms	Disruptive	✗ Only for background ops

The Compounding Problem

Web applications rarely make a single database call. A typical page load involves:

User Request
  → Authentication check (linearizable)     : 50ms
  → User profile fetch (linearizable)       : 50ms  
  → Permission check (linearizable)         : 50ms
  → Business data query (linearizable)      : 50ms
  → Audit log write (linearizable)          : 50ms
                                     Total  : 250ms (just DB calls)
  + Application processing                  : 50ms
  + Network round-trip to user              : 20ms
                                     Grand Total: 320ms

With 5 linearizable operations, latency compounds rapidly. The same application with eventual consistency might complete in 80ms total—4x faster.

Interactive vs Batch Workloads

The impact varies dramatically by workload type:

High-Impact (Avoid Linearizability)

•Real-time gaming: Players notice 20ms delays
•Live collaboration: Typing latency must be < 50ms
•Trading systems: Microseconds matter
•Voice/video calls: > 150ms causes crosstalk
•Autocomplete: Must respond in < 100ms

Low-Impact (Linearizability Acceptable)

•Form submissions: Users expect 500ms-2s
•Checkout flows: Critical correctness > speed
•Account changes: Infrequent, can wait
•Configuration updates: Background operation
•Batch processing: No real-time requirement

The Latency Budget Mindset

Every user interaction has a latency budget. Before choosing linearizability, calculate your total budget and allocate appropriately. If you have 300ms total and linearizability consumes 250ms of it, you've left no room for application logic, external APIs, or network variability.

Tail Latency: The Hidden Performance Killer

Average latency tells an incomplete story. In distributed systems, tail latency (P99, P99.9) often matters more—and linearizability dramatically amplifies tail latency.

Why Tail Latency Matters

Consider a web request that queries 10 services in parallel:

If each service has P99 = 100ms:
P(at least one > 100ms) = 1 - (0.99)^10 = 9.6%

Almost 10% of user requests experience the worst-case latency!

With linearizability, tail latency gets worse because:

Quorum dependency: Must wait for the slowest member of the fastest quorum
Consensus timeouts: Occasional leader elections spike latency
Disk sync variability: SSDs have occasional long pauses
Network jitter: Cross-region links are more variable

Tail Latency in Linearizable Systems

Latency Distribution: Linearizable vs Eventually Consistent
Percentile	Linearizable (Multi-Region)	Eventually Consistent	Ratio
P50 (median)	80ms	5ms	16x
P90	120ms	10ms	12x
P99	250ms	30ms	8x
P99.9	800ms	100ms	8x
P99.99	3,000ms	500ms	6x

The Quorum Slowdown Effect

With linearizable reads requiring consensus, tail latency is bounded by the slowest quorum member:

# Simplified model of quorum read latency
def quorum_latency(node_latencies, quorum_size):
    # Sort latencies and take the quorum_size-th smallest
    sorted_latencies = sorted(node_latencies)
    return sorted_latencies[quorum_size - 1]

# Example: 5 nodes, need 3 for quorum
node_latencies = [5, 8, 12, 150, 180]  # ms
quorum_latency = 12  # We wait for 3rd-fastest

# But what if node 3 is having a bad day?
node_latencies_bad = [5, 8, 200, 150, 180]  # ms
quorum_latency = 150  # Now we must wait for 4th-fastest!

One slow node can't be avoided—the system must include it in the quorum.

Latency Amplification During Instability

During system instability, tail latency explodes:

Event	Duration	Latency Impact
Leader election	2-5s	All operations block or timeout
Follower reconnection	1-10s	Quorum may be marginal
Disk GC pause	50-500ms	Affects consensus round
Network micro-partition	Variable	May trigger election
Clock jump	Variable	Lease-based reads may fail

The Problem with SLAs Based on Averages

If your SLA promises 'average latency < 100ms,' you're masking real user experience. A system with 10ms average but 5-second P99 feels broken to 1% of users. With linearizability, those tail latencies are larger and harder to control. Define SLAs at P99 or P99.9 to capture real user experience.

Throughput Degradation Under Load

Linearizable systems exhibit non-linear throughput degradation as load increases. Understanding this behavior is critical for capacity planning.

The Saturation Curve

Eventually consistent systems often show linear throughput scaling until hardware limits:

[Eventually Consistent]
                          ┌────────── HW Limit
 Throughput      ●───●───●───●───●
                /
               /
              /
             ●
            /
           ●
          /
         ●
        ────────────────────────────────►
                      Load

Linearizable systems hit saturation earlier due to coordination:

[Linearizable]
                    ┌────────── Consensus Limit
 Throughput   ●────●
             /     ╲──────────
            /        ╲──────── Degradation
           /
          ●
         /
        ●
       ────────────────────────────────►
                     Load

Why Throughput Degrades

Causes of Throughput Degradation

•Leader CPU saturation: All operations must flow through leader; CPU becomes bottleneck before network or disk.
•Log replication queue growth: Backlogged entries increase memory pressure and latency variance.
•Increased retries: Higher latency causes more client timeouts and retries, further loading the system.
•Quorum pressure: More concurrent operations compete for the same quorum acknowledgment window.
•Snapshot operations: Under load, the leader may fall behind and trigger snapshots, consuming resources.
•Election instability: Extreme load can cause followers to time out, triggering unnecessary elections.

Quantifying the Degradation

Let's model throughput for a Raft cluster handling key-value writes:

Baseline capacity (5-node, single region):

Leader CPU: 8 cores
Consensus round time: 5ms
Network bandwidth: 10 Gbps
Theoretical max: ~1,600 ops/second/core × 8 = ~12,800 ops/s

Observed behavior:

Throughput vs Latency at Various Load Levels
Load (% capacity)	Throughput	P50 Latency	P99 Latency	Error Rate
10%	1,280 ops/s	5ms	15ms	0%
30%	3,840 ops/s	6ms	20ms	0%
50%	6,400 ops/s	8ms	35ms	0%
70%	8,960 ops/s	15ms	80ms	0.1%
90%	11,520 ops/s	40ms	500ms	2%
100%	12,000 ops/s	150ms	2,000ms	10%
120%	10,000 ops/s ↓	500ms	5,000ms	30%

Notice the non-linear degradation: at 90% load, latency spikes 8x from baseline. At overload (120%), throughput actually decreases as the system spends more time handling retries and timeouts than processing new requests.

Load Shedding and Backpressure

To prevent catastrophic degradation, linearizable systems need aggressive load management:

Strategies for load management:

1. Client-side rate limiting
   - Reject requests before they hit the database
   - Token bucket per client/tenant

2. Queue depth limits
   - Fast-reject when pending operations exceed threshold
   - Return 'overloaded' error rather than timeout

3. Adaptive concurrency control
   - Reduce in-flight requests when latency increases
   - AIMD (Additive Increase, Multiplicative Decrease)

4. Priority queues
   - Critical operations (leader election, health checks) get priority
   - User operations shed first under pressure

The Capacity Planning Rule

For linearizable systems, plan for 50-60% of theoretical maximum capacity. This headroom prevents latency spikes during load variance and leaves room for consensus overhead during leader elections or temporary node slowdowns. Operating consistently above 70% is a recipe for cascading failures.

Optimization Strategies for Linearizable Systems

While linearizability has inherent costs, several strategies can minimize the performance impact.

Strategy 1: Minimize Linearizable Surface Area

The most effective optimization is to limit where linearizability is applied:

Hybrid Architecture:

┌─────────────────────────────────────────────────────────┐
│                    Application Layer                    │
└─────────────────┬───────────────────────┬───────────────┘
                  │                       │
         ┌────────▼────────┐    ┌─────────▼─────────┐
         │  Linearizable   │    │     Eventually    │
         │     Store       │    │    Consistent     │
         │  (Coordination) │    │     Store         │
         │  - Locks        │    │  - User sessions  │
         │  - Counters     │    │  - Caches         │
         │  - Config       │    │  - Analytics      │
         └────────────────-┘    └───────────────────┘

Examples:

Use Redis (eventual) for session storage, etcd (linearizable) for leader election
Use Cassandra (eventual) for time-series data, CockroachDB (linearizable) for financial transactions
Cache hot data in eventually consistent layer; only touch linearizable store for mutations

Strategy 2: Batching and Pipelining

Amortize consensus cost across multiple operations:

Batching at the client:

# Instead of:
for item in items:
    db.write(item)  # Each incurs consensus

# Batch writes:
db.write_batch(items)  # Single consensus round

Server-side batching:

Leader collects operations for a short window (1-5ms)
Single consensus round for entire batch
Latency increases slightly, throughput increases dramatically

Pipelining:

Start consensus(batch1)
Start consensus(batch2)  -- don't wait for batch1
Start consensus(batch3)
...
Receive batch1 acknowledgment
Receive batch2 acknowledgment

Impact of Batching on Performance
Batch Size	Ops/Second	Per-Op Latency	Throughput Gain
1 (no batching)	1,000	5ms	1x (baseline)
10	8,000	12ms	8x
50	25,000	25ms	25x
100	40,000	40ms	40x
500	60,000	100ms	60x

Strategy 3: Lease-Based Reads

Avoid consensus for reads by using leader leases:

Leader heartbeat:
  1. Leader sends heartbeat to followers
  2. Followers acknowledge with bounded timestamp
  3. Leader knows it's still leader for lease_duration

Read serving:
  - While lease is valid, leader serves reads locally
  - No consensus required for reads
  - Latency drops from consensus_rtt (~100ms) to local_read (~1ms)

Risk:
  - Clock skew can violate linearizability
  - Requires bounded clock drift (NTP or better)

Trade-off: Lease-based reads reduce latency 10-100x but introduce clock skew risks. Systems like etcd and ZooKeeper use this with careful clock management.

Strategy 4: Speculative Execution

Optimistically process while consensus is in-flight:

Client perspective:
  1. Send write to leader
  2. Leader acknowledges speculatively (before consensus)
  3. Client continues with next operation
  4. Leader confirms or rolls back after consensus

Benefit: Reduced perceived latency
Risk: Must handle rollback gracefully in application

Multi-Paxos Optimization

Protocols like Multi-Paxos amortize leader election across many operations. Once a leader is elected, subsequent operations require only one round-trip (accept) instead of two (prepare + accept). This optimization is standard in production systems like Spanner and CockroachDB.

Architectural Patterns for Better Performance

Beyond point optimizations, architectural patterns can fundamentally improve linearizable system performance.

Pattern 1: Local Shard Affinity

Route operations to replicas in the same region:

┌─────────────────────────────────────────────────────────┐
│                    Global Routing Layer                 │
└───────────┬────────────────────────────┬────────────────┘
            │                            │
    ┌───────▼───────┐            ┌───────▼───────┐
    │   US Clients  │            │   EU Clients  │
    └───────┬───────┘            └───────┬───────┘
            │                            │
    ┌───────▼───────┐            ┌───────▼───────┐
    │   US Leader   │◄──────────►│   EU Leader   │
    │   Raft Group  │  (async    │   Raft Group  │
    │   (shards     │   cross-   │   (shards     │
    │    1-100)     │   region)  │    101-200)   │
    └───────────────┘            └───────────────┘

By partitioning data geographically, most operations stay local:

US user data lives on US shards → US clients get local consensus
EU user data lives on EU shards → EU clients get local consensus
Only cross-region operations pay the global consensus cost

Pattern 2: Witness Replicas

Reduce replication cost with lightweight witnesses:

Traditional 5-node:
  - 5 full replicas (all data, all compute)
  - Cost: 5x storage, 5x CPU

Witness pattern:
  - 3 full replicas + 2 witness replicas
  - Witnesses: store only log metadata, not full data
  - Participate in voting but not in read serving
  - Cost: 3x storage, 3.5x CPU

Benefit: Same fault tolerance, 40% cost reduction

Pattern 3: Follower Reads for Bounded Staleness

If bounded staleness is acceptable, read from followers:

Follower read protocol:
  1. Client requests read with max_staleness = 100ms
  2. Follower checks: applied_index >= (leader_index - staleness_window)
  3. If yes: serve read locally
  4. If no: either wait for replication or forward to leader

Benefit: Reads from any replica, distributed load
Trade-off: Reads may be up to max_staleness behind

Pattern 4: Consensus Groups by Access Pattern

Align consensus groups with access patterns to minimize cross-group operations:

access_pattern_grouping.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Anti-pattern: Random sharding
// Order ID: 12345 → Shard 45 (hash-based)
// User's other orders spread across many shards
// Query: "Get all orders for user X" → touches many shards
 
// Better: Entity-based sharding
// All of User X's data → Shard (hash(user_id))
// Query: "Get all orders for user X" → single shard
// Single consensus group handles all user-specific operations
 
// Sharding strategy:
shard_id = hash(entity_id) % num_shards
 
// Where entity_id is:
// - user_id for user-centric applications
// - tenant_id for multi-tenant SaaS
// - device_id for IoT platforms
// - geographic_region for location-based services

The Locality Principle

The best performance optimization for linearizable systems is avoiding coordination. Design your data model so that related data lives together, operations are shard-local, and cross-partition operations are rare exceptions rather than the common case.

Benchmarking and Monitoring for Performance

Optimizing performance requires measuring it correctly. Linearizable systems have unique benchmarking and monitoring requirements.

Key Metrics to Track

Latency metrics (per operation type):

P50, P90, P99, P99.9 — The full distribution matters
Client-observed vs server-observed — Includes network latency
By operation type — Reads vs writes vs CAS
By shard/partition — Identify hot spots

Throughput metrics:

Operations per second (by type)
Batches per second (if batching)
Cross-partition operations per second
Failed/retried operations per second

Consensus health metrics:

Replication lag (per follower)
Leader election frequency
Quorum health (nodes available)
Log append latency
Snapshot frequency and duration

monitoring_dashboard.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Key monitoring dashboard for linearizable systems
 
panels:
  # Latency Overview
  - title: "Write Latency Distribution"
    type: heatmap
    query: histogram_quantile({0.5, 0.9, 0.99, 0.999}, write_latency_bucket)
    
  - title: "Read Latency by Region"
    type: time_series  
    query: histogram_quantile(0.99, read_latency_bucket) by (region)
 
  # Throughput
  - title: "Operations per Second"
    type: time_series
    query: rate(operations_total[1m]) by (operation_type)
    
  # Consensus Health
  - title: "Replication Lag"
    type: gauge
    query: max(leader_commit_index - follower_applied_index) by (node)
    thresholds: [100, 1000, 10000]  # entries behind
    
  - title: "Leader Elections"
    type: time_series
    query: increase(leader_elections_total[5m])
    alert_threshold: 1  # Any election is notable
    
  # Queue Health
  - title: "Pending Operations Queue Depth"
    type: time_series  
    query: pending_operations_count
    alert_threshold: 1000
    
  # Error Rates
  - title: "Consensus Failures"
    type: time_series
    query: rate(consensus_failures_total[5m]) by (error_type)

Benchmarking Best Practices

1. Measure under realistic conditions

Include network latency simulation for cross-region
Use production-like data sizes
Simulate concurrent clients (not just sequential)

2. Test failure scenarios

Measure latency during leader election
Measure behavior during node failure and recovery
Test with clock skew injection

3. Report the full picture

Never just report averages
Include P99 and P99.9
Report throughput vs latency curves
Note environmental conditions

4. Compare fairly

Same hardware and network conditions
Same data size and access patterns
Same durability guarantees (fsync vs buffered)

Benchmark Trap: Clean Room vs Production

Vendor benchmarks often show linearizable databases performing just 2-3x slower than eventually consistent alternatives. In reality, the gap may be 10-50x because benchmarks don't capture: network variability, garbage collection pauses, disk contention, and concurrent background operations like snapshotting.

Summary: Navigating Performance Trade-offs

We've explored the performance implications of linearizability in depth. The costs are real, but understanding them enables informed optimization and architectural decisions.

Key Takeaways

•Latency impacts user experience significantly — Humans perceive delays above 100ms; linearizability often adds 50-200ms in global deployments.
•Tail latency is amplified — Quorum dependency means tail latency is bounded by the slowest quorum member, not the fastest replica.
•Throughput degrades non-linearly under load — Systems become unstable above 70% capacity; plan headroom accordingly.
•Minimize linearizable surface area — Apply strong consistency only where necessary; use weaker models elsewhere.
•Batching and pipelining are essential — Amortize consensus cost across multiple operations for practical throughput.
•Lease-based reads dramatically reduce latency — Trade small clock skew risk for 10-100x read latency improvement.
•Locality reduces coordination — Design data models so operations are shard-local; avoid cross-partition operations.
•Monitor the full latency distribution — P99 and P99.9 matter more than averages for real user experience.

What's Next:

Now that we understand the performance implications, the next page explores when strong consistency is required—identifying use cases where linearizability isn't optional, and the consequences of choosing weaker guarantees.

Page Complete

You now understand the performance implications of linearizability: latency impacts on user experience, tail latency amplification, throughput degradation patterns, and optimization strategies. This knowledge enables you to design systems that are both correct and performant.

3 / 5

Loading learning content...

System DesignStrong Consistency

Strong Consistency

LevelAdvanced

Duration75 mins

TopicStrong Consistency

3 / 5

Performance Implications of Strong Consistency

When Milliseconds Cost Millions

What You Will Learn

Latency Impact on User Experience

Human perception is remarkably sensitive to latency. Understanding these thresholds helps us evaluate whether linearizability's latency penalty is acceptable for a given use case.

Human Perception Thresholds

The perception hierarchy:

< 100ms: Feels instantaneous. Users perceive the system as responding directly to their actions.
100-300ms: Noticeable but acceptable. Users perceive a slight delay but stay engaged.
300ms-1s: Obvious delay. Users begin to feel the system is slow.
1-10s: Disruptive. Users may abandon tasks, lose context, or multitask.
> 10s: Broken. Users assume the system has failed.

Linearizability's Latency Reality

Let's map common linearizable operations to these thresholds:

Linearizable Operation Latency vs User Perception
Scenario	Typical Latency	User Perception	Acceptable?
Same-datacenter consensus	3-10ms	Instantaneous	✓ Excellent
Same-region, multi-AZ	10-30ms	Instantaneous	✓ Good
Cross-coast US (NY↔SF)	60-100ms	Noticeable	~ Marginal
Cross-Atlantic (US↔EU)	80-150ms	Noticeable delay	✗ Poor for interactive
Global (US↔Asia)	150-300ms	Obvious delay	✗ Unacceptable for interactive
Multi-region quorum write	200-400ms	Disruptive	✗ Only for background ops

The Compounding Problem

Web applications rarely make a single database call. A typical page load involves:

User Request
  → Authentication check (linearizable)     : 50ms
  → User profile fetch (linearizable)       : 50ms  
  → Permission check (linearizable)         : 50ms
  → Business data query (linearizable)      : 50ms
  → Audit log write (linearizable)          : 50ms
                                     Total  : 250ms (just DB calls)
  + Application processing                  : 50ms
  + Network round-trip to user              : 20ms
                                     Grand Total: 320ms

With 5 linearizable operations, latency compounds rapidly. The same application with eventual consistency might complete in 80ms total—4x faster.

Interactive vs Batch Workloads

The impact varies dramatically by workload type:

High-Impact (Avoid Linearizability)

•Real-time gaming: Players notice 20ms delays
•Live collaboration: Typing latency must be < 50ms
•Trading systems: Microseconds matter
•Voice/video calls: > 150ms causes crosstalk
•Autocomplete: Must respond in < 100ms

Low-Impact (Linearizability Acceptable)

•Form submissions: Users expect 500ms-2s
•Checkout flows: Critical correctness > speed
•Account changes: Infrequent, can wait
•Configuration updates: Background operation
•Batch processing: No real-time requirement

The Latency Budget Mindset

Tail Latency: The Hidden Performance Killer

Average latency tells an incomplete story. In distributed systems, tail latency (P99, P99.9) often matters more—and linearizability dramatically amplifies tail latency.

Why Tail Latency Matters

Consider a web request that queries 10 services in parallel:

If each service has P99 = 100ms:
P(at least one > 100ms) = 1 - (0.99)^10 = 9.6%

Almost 10% of user requests experience the worst-case latency!

With linearizability, tail latency gets worse because:

Quorum dependency: Must wait for the slowest member of the fastest quorum
Consensus timeouts: Occasional leader elections spike latency
Disk sync variability: SSDs have occasional long pauses
Network jitter: Cross-region links are more variable

Tail Latency in Linearizable Systems

Latency Distribution: Linearizable vs Eventually Consistent
Percentile	Linearizable (Multi-Region)	Eventually Consistent	Ratio
P50 (median)	80ms	5ms	16x
P90	120ms	10ms	12x
P99	250ms	30ms	8x
P99.9	800ms	100ms	8x
P99.99	3,000ms	500ms	6x

The Quorum Slowdown Effect

With linearizable reads requiring consensus, tail latency is bounded by the slowest quorum member:

# Simplified model of quorum read latency
def quorum_latency(node_latencies, quorum_size):
    # Sort latencies and take the quorum_size-th smallest
    sorted_latencies = sorted(node_latencies)
    return sorted_latencies[quorum_size - 1]

# Example: 5 nodes, need 3 for quorum
node_latencies = [5, 8, 12, 150, 180]  # ms
quorum_latency = 12  # We wait for 3rd-fastest

# But what if node 3 is having a bad day?
node_latencies_bad = [5, 8, 200, 150, 180]  # ms
quorum_latency = 150  # Now we must wait for 4th-fastest!

One slow node can't be avoided—the system must include it in the quorum.

Latency Amplification During Instability

During system instability, tail latency explodes:

Event	Duration	Latency Impact
Leader election	2-5s	All operations block or timeout
Follower reconnection	1-10s	Quorum may be marginal
Disk GC pause	50-500ms	Affects consensus round
Network micro-partition	Variable	May trigger election
Clock jump	Variable	Lease-based reads may fail

The Problem with SLAs Based on Averages

Throughput Degradation Under Load

Linearizable systems exhibit non-linear throughput degradation as load increases. Understanding this behavior is critical for capacity planning.

The Saturation Curve

Eventually consistent systems often show linear throughput scaling until hardware limits:

[Eventually Consistent]
                          ┌────────── HW Limit
 Throughput      ●───●───●───●───●
                /
               /
              /
             ●
            /
           ●
          /
         ●
        ────────────────────────────────►
                      Load

Linearizable systems hit saturation earlier due to coordination:

[Linearizable]
                    ┌────────── Consensus Limit
 Throughput   ●────●
             /     ╲──────────
            /        ╲──────── Degradation
           /
          ●
         /
        ●
       ────────────────────────────────►
                     Load

Why Throughput Degrades

Causes of Throughput Degradation

•Leader CPU saturation: All operations must flow through leader; CPU becomes bottleneck before network or disk.
•Log replication queue growth: Backlogged entries increase memory pressure and latency variance.
•Increased retries: Higher latency causes more client timeouts and retries, further loading the system.
•Quorum pressure: More concurrent operations compete for the same quorum acknowledgment window.
•Snapshot operations: Under load, the leader may fall behind and trigger snapshots, consuming resources.
•Election instability: Extreme load can cause followers to time out, triggering unnecessary elections.

Quantifying the Degradation

Let's model throughput for a Raft cluster handling key-value writes:

Baseline capacity (5-node, single region):

Leader CPU: 8 cores
Consensus round time: 5ms
Network bandwidth: 10 Gbps
Theoretical max: ~1,600 ops/second/core × 8 = ~12,800 ops/s

Observed behavior:

Throughput vs Latency at Various Load Levels
Load (% capacity)	Throughput	P50 Latency	P99 Latency	Error Rate
10%	1,280 ops/s	5ms	15ms	0%
30%	3,840 ops/s	6ms	20ms	0%
50%	6,400 ops/s	8ms	35ms	0%
70%	8,960 ops/s	15ms	80ms	0.1%
90%	11,520 ops/s	40ms	500ms	2%
100%	12,000 ops/s	150ms	2,000ms	10%
120%	10,000 ops/s ↓	500ms	5,000ms	30%

Load Shedding and Backpressure

To prevent catastrophic degradation, linearizable systems need aggressive load management:

Strategies for load management:

1. Client-side rate limiting
   - Reject requests before they hit the database
   - Token bucket per client/tenant

2. Queue depth limits
   - Fast-reject when pending operations exceed threshold
   - Return 'overloaded' error rather than timeout

3. Adaptive concurrency control
   - Reduce in-flight requests when latency increases
   - AIMD (Additive Increase, Multiplicative Decrease)

4. Priority queues
   - Critical operations (leader election, health checks) get priority
   - User operations shed first under pressure

The Capacity Planning Rule

Optimization Strategies for Linearizable Systems

While linearizability has inherent costs, several strategies can minimize the performance impact.

Strategy 1: Minimize Linearizable Surface Area

The most effective optimization is to limit where linearizability is applied:

Hybrid Architecture:

┌─────────────────────────────────────────────────────────┐
│                    Application Layer                    │
└─────────────────┬───────────────────────┬───────────────┘
                  │                       │
         ┌────────▼────────┐    ┌─────────▼─────────┐
         │  Linearizable   │    │     Eventually    │
         │     Store       │    │    Consistent     │
         │  (Coordination) │    │     Store         │
         │  - Locks        │    │  - User sessions  │
         │  - Counters     │    │  - Caches         │
         │  - Config       │    │  - Analytics      │
         └────────────────-┘    └───────────────────┘

Examples:

Use Redis (eventual) for session storage, etcd (linearizable) for leader election
Use Cassandra (eventual) for time-series data, CockroachDB (linearizable) for financial transactions
Cache hot data in eventually consistent layer; only touch linearizable store for mutations

Strategy 2: Batching and Pipelining

Amortize consensus cost across multiple operations:

Batching at the client:

# Instead of:
for item in items:
    db.write(item)  # Each incurs consensus

# Batch writes:
db.write_batch(items)  # Single consensus round

Server-side batching:

Leader collects operations for a short window (1-5ms)
Single consensus round for entire batch
Latency increases slightly, throughput increases dramatically

Pipelining:

Start consensus(batch1)
Start consensus(batch2)  -- don't wait for batch1
Start consensus(batch3)
...
Receive batch1 acknowledgment
Receive batch2 acknowledgment

Impact of Batching on Performance
Batch Size	Ops/Second	Per-Op Latency	Throughput Gain
1 (no batching)	1,000	5ms	1x (baseline)
10	8,000	12ms	8x
50	25,000	25ms	25x
100	40,000	40ms	40x
500	60,000	100ms	60x

Strategy 3: Lease-Based Reads

Avoid consensus for reads by using leader leases:

Leader heartbeat:
  1. Leader sends heartbeat to followers
  2. Followers acknowledge with bounded timestamp
  3. Leader knows it's still leader for lease_duration

Read serving:
  - While lease is valid, leader serves reads locally
  - No consensus required for reads
  - Latency drops from consensus_rtt (~100ms) to local_read (~1ms)

Risk:
  - Clock skew can violate linearizability
  - Requires bounded clock drift (NTP or better)

Trade-off: Lease-based reads reduce latency 10-100x but introduce clock skew risks. Systems like etcd and ZooKeeper use this with careful clock management.

Strategy 4: Speculative Execution

Optimistically process while consensus is in-flight:

Client perspective:
  1. Send write to leader
  2. Leader acknowledges speculatively (before consensus)
  3. Client continues with next operation
  4. Leader confirms or rolls back after consensus

Benefit: Reduced perceived latency
Risk: Must handle rollback gracefully in application

Multi-Paxos Optimization

Architectural Patterns for Better Performance

Beyond point optimizations, architectural patterns can fundamentally improve linearizable system performance.

Pattern 1: Local Shard Affinity

Route operations to replicas in the same region:

┌─────────────────────────────────────────────────────────┐
│                    Global Routing Layer                 │
└───────────┬────────────────────────────┬────────────────┘
            │                            │
    ┌───────▼───────┐            ┌───────▼───────┐
    │   US Clients  │            │   EU Clients  │
    └───────┬───────┘            └───────┬───────┘
            │                            │
    ┌───────▼───────┐            ┌───────▼───────┐
    │   US Leader   │◄──────────►│   EU Leader   │
    │   Raft Group  │  (async    │   Raft Group  │
    │   (shards     │   cross-   │   (shards     │
    │    1-100)     │   region)  │    101-200)   │
    └───────────────┘            └───────────────┘

By partitioning data geographically, most operations stay local:

US user data lives on US shards → US clients get local consensus
EU user data lives on EU shards → EU clients get local consensus
Only cross-region operations pay the global consensus cost

Pattern 2: Witness Replicas

Reduce replication cost with lightweight witnesses:

Traditional 5-node:
  - 5 full replicas (all data, all compute)
  - Cost: 5x storage, 5x CPU

Witness pattern:
  - 3 full replicas + 2 witness replicas
  - Witnesses: store only log metadata, not full data
  - Participate in voting but not in read serving
  - Cost: 3x storage, 3.5x CPU

Benefit: Same fault tolerance, 40% cost reduction

Pattern 3: Follower Reads for Bounded Staleness

If bounded staleness is acceptable, read from followers:

Follower read protocol:
  1. Client requests read with max_staleness = 100ms
  2. Follower checks: applied_index >= (leader_index - staleness_window)
  3. If yes: serve read locally
  4. If no: either wait for replication or forward to leader

Benefit: Reads from any replica, distributed load
Trade-off: Reads may be up to max_staleness behind

Pattern 4: Consensus Groups by Access Pattern

Align consensus groups with access patterns to minimize cross-group operations:

access_pattern_grouping.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Anti-pattern: Random sharding
// Order ID: 12345 → Shard 45 (hash-based)
// User's other orders spread across many shards
// Query: "Get all orders for user X" → touches many shards
 
// Better: Entity-based sharding
// All of User X's data → Shard (hash(user_id))
// Query: "Get all orders for user X" → single shard
// Single consensus group handles all user-specific operations
 
// Sharding strategy:
shard_id = hash(entity_id) % num_shards
 
// Where entity_id is:
// - user_id for user-centric applications
// - tenant_id for multi-tenant SaaS
// - device_id for IoT platforms
// - geographic_region for location-based services

The Locality Principle

Benchmarking and Monitoring for Performance

Optimizing performance requires measuring it correctly. Linearizable systems have unique benchmarking and monitoring requirements.

Key Metrics to Track

Latency metrics (per operation type):

P50, P90, P99, P99.9 — The full distribution matters
Client-observed vs server-observed — Includes network latency
By operation type — Reads vs writes vs CAS
By shard/partition — Identify hot spots

Throughput metrics:

Operations per second (by type)
Batches per second (if batching)
Cross-partition operations per second
Failed/retried operations per second

Consensus health metrics:

Replication lag (per follower)
Leader election frequency
Quorum health (nodes available)
Log append latency
Snapshot frequency and duration

monitoring_dashboard.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Key monitoring dashboard for linearizable systems
 
panels:
  # Latency Overview
  - title: "Write Latency Distribution"
    type: heatmap
    query: histogram_quantile({0.5, 0.9, 0.99, 0.999}, write_latency_bucket)
    
  - title: "Read Latency by Region"
    type: time_series  
    query: histogram_quantile(0.99, read_latency_bucket) by (region)
 
  # Throughput
  - title: "Operations per Second"
    type: time_series
    query: rate(operations_total[1m]) by (operation_type)
    
  # Consensus Health
  - title: "Replication Lag"
    type: gauge
    query: max(leader_commit_index - follower_applied_index) by (node)
    thresholds: [100, 1000, 10000]  # entries behind
    
  - title: "Leader Elections"
    type: time_series
    query: increase(leader_elections_total[5m])
    alert_threshold: 1  # Any election is notable
    
  # Queue Health
  - title: "Pending Operations Queue Depth"
    type: time_series  
    query: pending_operations_count
    alert_threshold: 1000
    
  # Error Rates
  - title: "Consensus Failures"
    type: time_series
    query: rate(consensus_failures_total[5m]) by (error_type)

Benchmarking Best Practices

1. Measure under realistic conditions

Include network latency simulation for cross-region
Use production-like data sizes
Simulate concurrent clients (not just sequential)

2. Test failure scenarios

Measure latency during leader election
Measure behavior during node failure and recovery
Test with clock skew injection

3. Report the full picture

Never just report averages
Include P99 and P99.9
Report throughput vs latency curves
Note environmental conditions

4. Compare fairly

Same hardware and network conditions
Same data size and access patterns
Same durability guarantees (fsync vs buffered)

Benchmark Trap: Clean Room vs Production

Summary: Navigating Performance Trade-offs

We've explored the performance implications of linearizability in depth. The costs are real, but understanding them enables informed optimization and architectural decisions.

Key Takeaways

•Latency impacts user experience significantly — Humans perceive delays above 100ms; linearizability often adds 50-200ms in global deployments.
•Tail latency is amplified — Quorum dependency means tail latency is bounded by the slowest quorum member, not the fastest replica.
•Throughput degrades non-linearly under load — Systems become unstable above 70% capacity; plan headroom accordingly.
•Minimize linearizable surface area — Apply strong consistency only where necessary; use weaker models elsewhere.
•Batching and pipelining are essential — Amortize consensus cost across multiple operations for practical throughput.
•Lease-based reads dramatically reduce latency — Trade small clock skew risk for 10-100x read latency improvement.
•Locality reduces coordination — Design data models so operations are shard-local; avoid cross-partition operations.
•Monitor the full latency distribution — P99 and P99.9 matter more than averages for real user experience.

What's Next:

Page Complete

3 / 5