System DesignStrong Consistency

Strong Consistency

LevelAdvanced

Duration75 mins

TopicStrong Consistency

5 / 5

Systems Providing Strong Consistency

Choosing Your Linearizable Foundation

Understanding linearizability is valuable, but building systems on a solid foundation requires knowing which production systems actually provide these guarantees—and under what conditions. Not all "strongly consistent" databases are created equal, and the devil is in the details.

Some systems provide linearizability by default; others require specific configurations. Some maintain guarantees during network partitions (by sacrificing availability); others silently degrade. Some have been rigorously tested with tools like Jepsen; others rely on correctness claims that haven't been independently verified.

This page explores the major categories of systems providing strong consistency, their architectural approaches, and practical guidance for choosing among them.

What You Will Learn

By the end of this page, you will understand the major categories of linearizable systems (coordination services, NewSQL databases, cloud services), how each achieves linearizability, the trade-offs between different approaches, and how to evaluate systems' consistency claims. You'll be equipped to choose the right foundation for your distributed applications.

Coordination Services: The Building Blocks

Coordination services are specialized systems designed specifically to provide linearizable operations for distributed coordination tasks: leader election, distributed locking, service discovery, and configuration management.

Apache ZooKeeper

Architecture:

Multi-node ensemble (typically 3, 5, or 7 nodes)
ZAB (ZooKeeper Atomic Broadcast) protocol for consensus
Hierarchical namespace (like a filesystem)
Ephemeral nodes for liveness detection
Sequential nodes for ordering

Consistency Model:

Writes: Linearizable (go through leader, require quorum)
Reads: NOT linearizable by default (served from local node)
sync() + read(): Linearizable (sync forces read of committed state)

// Linearizable read in ZooKeeper
zk.sync("/important/path", (rc, path, ctx) -> {
    byte[] data = zk.getData("/important/path", false, null);
    // This read is linearizable
});

// Non-linearizable (but faster) read
byte[] staleData = zk.getData("/important/path", false, null);
// May return stale data!

Performance Characteristics:

Write latency: 2-10ms (same datacenter), 100-300ms (multi-region)
Read latency: <1ms (local), 2-10ms (sync read)
Throughput: ~10,000 writes/second, ~100,000 reads/second

Coordination Services Comparison
System	Protocol	Linearizable Reads	Best For	Jepsen Status
ZooKeeper	ZAB	With sync()	Java ecosystem, Kafka, Hadoop	Tested, issues found/fixed
etcd	Raft	Yes (default)	Kubernetes, cloud-native	Tested, generally solid
Consul	Raft	Yes (consistent mode)	Service mesh, HashiCorp stack	Tested, issues found/fixed
Chubby	Paxos	Yes	Google internal (not public)	Google internal testing

etcd

Architecture:

Raft consensus for all operations
Flat key-value namespace with prefix iteration
Watch functionality for reactive updates
gRPC API for efficiency
MVCC for efficient range queries

Consistency Model:

All operations: Linearizable by default
Serializable mode: Optional, for stale reads with higher throughput
Watch: Linearizable ordering of events

// etcd provides linearizable reads by default
resp, err := client.Get(ctx, "my-key")
// This read is linearizable

// Optional: serializable read (faster, may be stale)
resp, err := client.Get(ctx, "my-key", clientv3.WithSerializable())
// May return stale data

Why etcd Is Popular:

Default linearizability reduces configuration errors
Native gRPC provides efficient communication
Kubernetes uses it as its source of truth
Clear API semantics with strong Go library support

Coordination Services Are for Metadata, Not Data

ZooKeeper, etcd, and Consul are designed for small amounts of critical data—configuration, leader information, service registration. They're not designed for high-volume application data. A few thousand keys at most. For application data requiring linearizability, use a linearizable database.

NewSQL Databases: Linearizable at Scale

NewSQL databases combine the scalability of NoSQL with the strong consistency and SQL interface of traditional relational databases. They use consensus protocols to provide linearizability (or strict serializability) while sharding data across many nodes.

Google Spanner

Architecture:

Global distribution across regions
Paxos consensus within each Paxos group (shard)
TrueTime: GPS + atomic clocks for bounded clock uncertainty
Two-phase commit for cross-shard transactions
Automatic sharding and load balancing

Consistency Model:

Strict serializability (also called external consistency)
All transactions appear to execute at a single instant
Real-time ordering is preserved

The TrueTime Magic:

TrueTime.now() returns interval: [earliest, latest]
Uncertainty: typically 5-10ms

Commit wait:
  1. Assign commit timestamp T = TrueTime.now().latest
  2. Wait until TrueTime.now().earliest > T
  3. Transaction is now durable and globally ordered

Result: All transactions have globally consistent ordering
        without requiring global consensus for reads

Performance:

Write latency: 10-100ms (single region), 200-500ms (global)
Read latency: <10ms (local replica), requires commit wait for global
Throughput: Scales linearly with nodes
Cost: Premium (requires specialized infrastructure)

CockroachDB

Architecture:

Spanner-inspired, open-source
Raft consensus for each range (16MB-64MB)
Hybrid Logical Clocks (HLC) instead of TrueTime
Supports PostgreSQL wire protocol
Can run on commodity hardware

Consistency Model:

Serializable isolation level (default)
Linearizable for single-key operations
Read committed and snapshot isolation also available

HLC vs TrueTime Trade-off:

TrueTime (Spanner):
  - Requires specialized hardware (GPS, atomic clocks)
  - Bounded uncertainty enables external consistency
  - Commit wait adds latency but avoids remote coordination for reads

HLC (CockroachDB):
  - Works on commodity hardware
  - Cannot guarantee external consistency without coordination
  - May require remote reads to ensure consistency
  - Slightly higher read latency in some scenarios

TiDB

Architecture:

MySQL-compatible SQL layer
TiKV: Raft-based distributed key-value store
PD (Placement Driver): cluster metadata and scheduling
Designed for hybrid OLTP/OLAP workloads

Consistency Model:

Snapshot isolation (default)
Optimistic transactions with conflict detection
Pessimistic transactions available (with locking)

NewSQL Database Comparison
Database	Consistency	Clock Mechanism	SQL Compatibility	Deployment
Spanner	Strict serializable	TrueTime	ANSI SQL	GCP only
CockroachDB	Serializable	HLC	PostgreSQL	Self-hosted or cloud
TiDB	Snapshot isolation	TSO (timestamp oracle)	MySQL	Self-hosted or cloud
YugabyteDB	Serializable	Hybrid clocks	PostgreSQL	Self-hosted or cloud
FoundationDB	Strict serializable	Centralized sequencer	KV only (layers)	Self-hosted

Consistency Level Configuration

Many NewSQL databases support multiple isolation levels. The default isn't always the strongest! Always verify your configuration provides the consistency level you need. For example, CockroachDB defaults to serializable, but PostgreSQL defaults to read committed.

FoundationDB: The Composable Foundation

FoundationDB deserves special attention as it represents a unique approach: provide a simple, rigorously correct linearizable key-value store, then build higher-level abstractions as "layers" on top.

Architecture

Core components:

Coordinators: Cluster configuration and recovery
Cluster Controller: Manages cluster state and roles
Master Proxy: Receives transactions, assigns timestamps
Storage Servers: Store data, serve reads
Resolvers: Conflict detection
Log Servers: Write-ahead log for durability

The Sequencer Model:

Unlike Raft/Paxos where each operation goes through consensus,
FoundationDB uses a centralized sequencer:

1. Client starts transaction, reads from storage servers
2. Client sends writes to proxies
3. Proxy requests commit timestamp from sequencer
4. Sequencer assigns monotonically increasing timestamp
5. Resolvers check for conflicts against committed transactions
6. If no conflicts, log servers persist the transaction
7. Transaction is committed with guaranteed ordering

Why FoundationDB's Model Works

Advantages:

Single point for ordering (sequencer) is simpler than distributed consensus
Very high throughput for non-conflicting transactions
Deterministic conflict detection
Rigorous testing (simulation testing)

Handling sequencer failure:

Sequencer is replicated, coordinated by Paxos
Recovery selects new sequencer
State is reconstructed from log servers

The Layer Architecture

FoundationDB provides strict serializability for key-value operations. Higher-level abstractions are built as layers:

┌─────────────────────────────────────────────────┐
│           Application Layer                     │
├─────────────────────────────────────────────────┤
│  Document Layer │ SQL Layer │ Graph Layer │ ... │
├─────────────────────────────────────────────────┤
│              Record Layer                       │
│    (structured records, indexes, transactions)  │
├─────────────────────────────────────────────────┤
│            FoundationDB Core                    │
│    (ordered key-value, strict serializability)  │
└─────────────────────────────────────────────────┘

Notable users:

Apple: iCloud backend, handling billions of transactions
Snowflake: Metadata storage for data warehouse
Wavefront: Time-series database backend

Simulation Testing

FoundationDB pioneered rigorous simulation testing:

Deterministic simulation:
  - Run entire cluster in single-threaded simulator
  - Control all randomness (network, disk, clocks)
  - Inject failures systematically
  - Replay bugs deterministically

Result: Millions of simulated hours of operation
        Bugs found before production
        High confidence in correctness claims

The Simplicity Principle

FoundationDB's philosophy is that a small, correct core is better than a feature-rich, complex one. By providing only linearizable key-value operations, FoundationDB can be exhaustively tested. Higher-level features (SQL, documents) are built as layers that inherit the core's correctness guarantees.

Cloud Provider Services

Major cloud providers offer managed services with strong consistency guarantees, often building on the technologies discussed above.

AWS Services

DynamoDB:

Default mode: Eventual consistency
Strong consistency mode: Linearizable reads available per-request
Transactions: Serializable across items and tables

# DynamoDB with strong consistency
response = dynamodb.get_item(
    TableName='users',
    Key={'user_id': {'S': 'alice'}},
    ConsistentRead=True  # Linearizable read
)

# DynamoDB transaction (serializable)
response = dynamodb.transact_write_items(
    TransactItems=[
        {'Put': {...}},
        {'Update': {...}},
        {'Delete': {...}}
    ]
)  # All-or-nothing, serializable

Aurora:

Multi-AZ with synchronous replication
Read replicas have eventual consistency
Writer endpoint provides linearizable writes
Global Database: cross-region, eventually consistent

Amazon QLDB:

Purpose-built for audit and ledger use cases
Immutable, cryptographically verifiable
Serializable transaction processing

Cloud Database Consistency Options
Service	Provider	Default Consistency	Strong Consistency Option
Spanner	GCP	Strict serializable	Default (can't be weakened)
Cloud SQL	GCP	Serializable	Default for single primary
DynamoDB	AWS	Eventual	ConsistentRead=true
Aurora	AWS	Read committed	Writer endpoint, no read replicas
Cosmos DB	Azure	Config-dependent	Strong consistency level
Firestore	GCP	Linearizable	Default

Azure Cosmos DB

Cosmos DB offers five consistency levels, from strongest to weakest:

1. Strong (Linearizable)
   - Reads guaranteed to return most recent committed write
   - Single region only for writes
   - Highest latency

2. Bounded Staleness
   - Reads lag behind writes by at most K versions or T time
   - Configurable K and T

3. Session
   - Monotonic reads/writes within a session
   - Different sessions may see different orders

4. Consistent Prefix
   - Reads never see out-of-order writes
   - May be arbitrarily stale

5. Eventual
   - No ordering guarantees
   - Highest throughput

Important: Strong consistency in Cosmos DB restricts writes to a single region, eliminating the multi-region write capability. This is a direct manifestation of the CAP theorem.

Google Firestore

Firestore provides linearizable consistency by default:

// All Firestore operations are linearizable
await db.collection('users').doc('alice').set({name: 'Alice'});

// This read will see the write above
const doc = await db.collection('users').doc('alice').get();

Firestore uses a similar architecture to Spanner, providing external consistency for all operations.

Read the Documentation Carefully

Cloud providers use terms like 'strong consistency' inconsistently. Some mean linearizable, others mean read-your-writes. Always verify: (1) What operations are covered? (2) What happens during partitions? (3) Is it default or opt-in? (4) Are there regional restrictions?

Evaluating Consistency Claims

Not all consistency claims are equally trustworthy. Here's how to evaluate whether a system actually provides the guarantees it claims.

Red Flags in Marketing Materials

Watch out for:

Vague terminology: "Strong consistency" without defining what operations are covered
Asterisks and footnotes: "Consistent*" where * = "in single-region configurations only"
Missing failure semantics: No mention of behavior during partitions
Untested claims: No Jepsen report or equivalent independent verification
Configuration complexity: Consistency depends on subtle configuration options

Questions to Ask

For any system claiming linearizability:

1. Which operations are linearizable?
   - All operations? Writes only? Specific APIs?
   
2. What happens during network partitions?
   - Operations fail? Silent degradation? Best-effort?
   
3. What's the default configuration?
   - Is linearizability default or opt-in?
   
4. Has it been independently tested?
   - Jepsen report? Academic analysis? Internal testing?
   
5. What's the latency and availability trade-off?
   - How much latency does strong consistency add?
   - What's the availability target?

Jepsen: The Consistency Verifier

Jepsen (jepsen.io) has become the industry standard for testing distributed systems' consistency claims:

What Jepsen does:

Runs concurrent operations against the system
Injects failures (network partitions, node crashes, clock skew)
Records complete history of operations
Verifies history is consistent with claimed model

Selected Jepsen findings:

System	Claim	Finding	Impact
MongoDB	Linearizable with majority read concern	Violated under network partitions	Fixed in later versions
Redis (Redlock)	Distributed lock	Unsafe under timing assumptions	Known limitation
etcd	Linearizable	Generally correct, minor issues found	Issues fixed
CockroachDB	Serializable	Generally correct, edge cases found	Issues fixed
PostgreSQL	Serializable	Correct	Validated

Key insight: Many systems that claim linearizability have had violations discovered by Jepsen. Check for a Jepsen report and whether issues were fixed.

Trustworthy Consistency Indicators

•Jepsen tested and passed — Independent verification with failures injected
•TLA+ or formal specification — Mathematical proof of protocol correctness
•Production track record — Used at scale by multiple organizations
•Clear documentation — Precise semantics, failure modes, configuration options
•Active security/correctness advisories — Issues are found and disclosed responsibly
•Reproducible testing — Anyone can run consistency tests

Trust But Verify

Even with verified systems, always understand the specific configuration you're using. Run your own consistency tests if the operation is critical. A system may be linearizable in general but misconfigured in your specific deployment.

Choosing the Right Linearizable System

With the landscape understood, here's how to choose the right linearizable system for your needs.

Decision Matrix

System Selection Decision Matrix
Use Case	Recommended System(s)	Rationale
Kubernetes cluster state	etcd	Native integration, well-tested, purpose-built
Leader election (general)	etcd, ZooKeeper, Consul	Mature, well-understood, low overhead
Distributed locks	etcd, Redis (single-node)	Coordination services handle this native
Service discovery	Consul, etcd	Built-in features, health checking
OLTP with SQL	CockroachDB, Spanner, TiDB	SQL interface, scalable, transactions
Financial transactions	Spanner, FoundationDB	Strictest guarantees, proven at scale
Global distribution	Spanner, CockroachDB	Multi-region consensus built-in
Apple-scale workloads	FoundationDB	Proven, simulation-tested
Simple key-value	FoundationDB	Minimal overhead, strict guarantees
Serverless/managed	Firestore, Spanner, Cosmos DB	No operational overhead

Key Trade-offs to Consider

1. Self-Hosted vs Managed

Self-hosted (etcd, CockroachDB, FoundationDB):
  + Full control, no vendor lock-in
  + Potentially lower cost at scale
  - Operational complexity
  - Must handle upgrades, monitoring, recovery

Managed (Spanner, Cosmos DB, Firestore):
  + Zero operational overhead
  + Provider handles consistency verification
  - Vendor lock-in
  - Higher cost, especially at scale

2. Generality vs Specialization

General-purpose (CockroachDB, Spanner):
  + SQL interface, broad use cases
  + Feature-rich (transactions, indexes, schemas)
  - Higher overhead for simple use cases

Specialized (etcd, ZooKeeper):
  + Low overhead for coordination tasks
  + Purpose-built APIs
  - Not suitable for application data
  - Limited data model

3. Geographic Distribution

Single-region:
  - Most systems work well
  - Lower latency
  - Simpler operations

Multi-region:
  - Spanner: Best-in-class (TrueTime)
  - CockroachDB: Good (HLC)
  - Most others: Significant latency penalty

Start Simple, Scale Thoughtfully

For many applications, a single PostgreSQL instance with synchronous replication provides linearizable guarantees with minimal complexity. Only move to distributed NewSQL when you've outgrown PostgreSQL's capacity—typically 10,000+ writes/second or terabytes of data.

Summary: Systems Providing Strong Consistency

We've surveyed the landscape of systems providing strong consistency, from specialized coordination services to globally distributed databases. Each approach has trade-offs, and the right choice depends on your specific requirements.

Key Takeaways

•Coordination services (etcd, ZooKeeper, Consul) — Purpose-built for locks, leader election, and configuration. Not for application data.
•NewSQL databases (Spanner, CockroachDB, TiDB) — Combine SQL with distributed linearizability. Different clock mechanisms yield different guarantees.
•FoundationDB — Unique architecture with simulation testing. Strictest guarantees, used by Apple and Snowflake.
•Cloud services — Managed convenience with varying consistency options. Read documentation carefully.
•Verify claims with Jepsen — Don't trust marketing; check for independent verification.
•Configuration matters — Many systems have linearizability as opt-in, not default. Default may be weaker.
•Match system to workload — Use coordination services for coordination, databases for data. Don't over-engineer.

Module Complete: Strong Consistency Mastery

Across five pages, you've developed a deep understanding of strong consistency in distributed systems:

Linearizability: The gold standard consistency model and its formal definition
Implementation Costs: The coordination overhead, latency, and availability trade-offs
Performance Implications: How linearizability affects latency, throughput, and user experience
When Strong Consistency Is Required: Use cases where linearizability isn't optional
Systems Providing Strong Consistency: Real-world implementations and selection criteria

You're now equipped to make informed decisions about consistency requirements in your distributed systems—knowing when to pay the linearizability tax and when weaker consistency suffices.

Module Complete

Congratulations! You've mastered strong consistency in distributed systems. You understand linearizability deeply, know its costs and when it's required, and can evaluate real-world systems' consistency claims. This knowledge is essential for building correct, reliable distributed applications.

5 / 5

Loading learning content...

System DesignStrong Consistency

Strong Consistency

LevelAdvanced

Duration75 mins

TopicStrong Consistency

5 / 5

Systems Providing Strong Consistency

Choosing Your Linearizable Foundation

This page explores the major categories of systems providing strong consistency, their architectural approaches, and practical guidance for choosing among them.

What You Will Learn

Coordination Services: The Building Blocks

Apache ZooKeeper

Architecture:

Multi-node ensemble (typically 3, 5, or 7 nodes)
ZAB (ZooKeeper Atomic Broadcast) protocol for consensus
Hierarchical namespace (like a filesystem)
Ephemeral nodes for liveness detection
Sequential nodes for ordering

Consistency Model:

Writes: Linearizable (go through leader, require quorum)
Reads: NOT linearizable by default (served from local node)
sync() + read(): Linearizable (sync forces read of committed state)

// Linearizable read in ZooKeeper
zk.sync("/important/path", (rc, path, ctx) -> {
    byte[] data = zk.getData("/important/path", false, null);
    // This read is linearizable
});

// Non-linearizable (but faster) read
byte[] staleData = zk.getData("/important/path", false, null);
// May return stale data!

Performance Characteristics:

Write latency: 2-10ms (same datacenter), 100-300ms (multi-region)
Read latency: <1ms (local), 2-10ms (sync read)
Throughput: ~10,000 writes/second, ~100,000 reads/second

Coordination Services Comparison
System	Protocol	Linearizable Reads	Best For	Jepsen Status
ZooKeeper	ZAB	With sync()	Java ecosystem, Kafka, Hadoop	Tested, issues found/fixed
etcd	Raft	Yes (default)	Kubernetes, cloud-native	Tested, generally solid
Consul	Raft	Yes (consistent mode)	Service mesh, HashiCorp stack	Tested, issues found/fixed
Chubby	Paxos	Yes	Google internal (not public)	Google internal testing

etcd

Architecture:

Raft consensus for all operations
Flat key-value namespace with prefix iteration
Watch functionality for reactive updates
gRPC API for efficiency
MVCC for efficient range queries

Consistency Model:

All operations: Linearizable by default
Serializable mode: Optional, for stale reads with higher throughput
Watch: Linearizable ordering of events

// etcd provides linearizable reads by default
resp, err := client.Get(ctx, "my-key")
// This read is linearizable

// Optional: serializable read (faster, may be stale)
resp, err := client.Get(ctx, "my-key", clientv3.WithSerializable())
// May return stale data

Why etcd Is Popular:

Default linearizability reduces configuration errors
Native gRPC provides efficient communication
Kubernetes uses it as its source of truth
Clear API semantics with strong Go library support

Coordination Services Are for Metadata, Not Data

NewSQL Databases: Linearizable at Scale

Google Spanner

Architecture:

Global distribution across regions
Paxos consensus within each Paxos group (shard)
TrueTime: GPS + atomic clocks for bounded clock uncertainty
Two-phase commit for cross-shard transactions
Automatic sharding and load balancing

Consistency Model:

Strict serializability (also called external consistency)
All transactions appear to execute at a single instant
Real-time ordering is preserved

The TrueTime Magic:

TrueTime.now() returns interval: [earliest, latest]
Uncertainty: typically 5-10ms

Commit wait:
  1. Assign commit timestamp T = TrueTime.now().latest
  2. Wait until TrueTime.now().earliest > T
  3. Transaction is now durable and globally ordered

Result: All transactions have globally consistent ordering
        without requiring global consensus for reads

Performance:

Write latency: 10-100ms (single region), 200-500ms (global)
Read latency: <10ms (local replica), requires commit wait for global
Throughput: Scales linearly with nodes
Cost: Premium (requires specialized infrastructure)

CockroachDB

Architecture:

Spanner-inspired, open-source
Raft consensus for each range (16MB-64MB)
Hybrid Logical Clocks (HLC) instead of TrueTime
Supports PostgreSQL wire protocol
Can run on commodity hardware

Consistency Model:

Serializable isolation level (default)
Linearizable for single-key operations
Read committed and snapshot isolation also available

HLC vs TrueTime Trade-off:

TrueTime (Spanner):
  - Requires specialized hardware (GPS, atomic clocks)
  - Bounded uncertainty enables external consistency
  - Commit wait adds latency but avoids remote coordination for reads

HLC (CockroachDB):
  - Works on commodity hardware
  - Cannot guarantee external consistency without coordination
  - May require remote reads to ensure consistency
  - Slightly higher read latency in some scenarios

TiDB

Architecture:

MySQL-compatible SQL layer
TiKV: Raft-based distributed key-value store
PD (Placement Driver): cluster metadata and scheduling
Designed for hybrid OLTP/OLAP workloads

Consistency Model:

Snapshot isolation (default)
Optimistic transactions with conflict detection
Pessimistic transactions available (with locking)

NewSQL Database Comparison
Database	Consistency	Clock Mechanism	SQL Compatibility	Deployment
Spanner	Strict serializable	TrueTime	ANSI SQL	GCP only
CockroachDB	Serializable	HLC	PostgreSQL	Self-hosted or cloud
TiDB	Snapshot isolation	TSO (timestamp oracle)	MySQL	Self-hosted or cloud
YugabyteDB	Serializable	Hybrid clocks	PostgreSQL	Self-hosted or cloud
FoundationDB	Strict serializable	Centralized sequencer	KV only (layers)	Self-hosted

Consistency Level Configuration

FoundationDB: The Composable Foundation

FoundationDB deserves special attention as it represents a unique approach: provide a simple, rigorously correct linearizable key-value store, then build higher-level abstractions as "layers" on top.

Architecture

Core components:

Coordinators: Cluster configuration and recovery
Cluster Controller: Manages cluster state and roles
Master Proxy: Receives transactions, assigns timestamps
Storage Servers: Store data, serve reads
Resolvers: Conflict detection
Log Servers: Write-ahead log for durability

The Sequencer Model:

Unlike Raft/Paxos where each operation goes through consensus,
FoundationDB uses a centralized sequencer:

1. Client starts transaction, reads from storage servers
2. Client sends writes to proxies
3. Proxy requests commit timestamp from sequencer
4. Sequencer assigns monotonically increasing timestamp
5. Resolvers check for conflicts against committed transactions
6. If no conflicts, log servers persist the transaction
7. Transaction is committed with guaranteed ordering

Why FoundationDB's Model Works

Advantages:

Single point for ordering (sequencer) is simpler than distributed consensus
Very high throughput for non-conflicting transactions
Deterministic conflict detection
Rigorous testing (simulation testing)

Handling sequencer failure:

Sequencer is replicated, coordinated by Paxos
Recovery selects new sequencer
State is reconstructed from log servers

The Layer Architecture

FoundationDB provides strict serializability for key-value operations. Higher-level abstractions are built as layers:

┌─────────────────────────────────────────────────┐
│           Application Layer                     │
├─────────────────────────────────────────────────┤
│  Document Layer │ SQL Layer │ Graph Layer │ ... │
├─────────────────────────────────────────────────┤
│              Record Layer                       │
│    (structured records, indexes, transactions)  │
├─────────────────────────────────────────────────┤
│            FoundationDB Core                    │
│    (ordered key-value, strict serializability)  │
└─────────────────────────────────────────────────┘

Notable users:

Apple: iCloud backend, handling billions of transactions
Snowflake: Metadata storage for data warehouse
Wavefront: Time-series database backend

Simulation Testing

FoundationDB pioneered rigorous simulation testing:

Deterministic simulation:
  - Run entire cluster in single-threaded simulator
  - Control all randomness (network, disk, clocks)
  - Inject failures systematically
  - Replay bugs deterministically

Result: Millions of simulated hours of operation
        Bugs found before production
        High confidence in correctness claims

The Simplicity Principle

Cloud Provider Services

Major cloud providers offer managed services with strong consistency guarantees, often building on the technologies discussed above.

AWS Services

DynamoDB:

Default mode: Eventual consistency
Strong consistency mode: Linearizable reads available per-request
Transactions: Serializable across items and tables

# DynamoDB with strong consistency
response = dynamodb.get_item(
    TableName='users',
    Key={'user_id': {'S': 'alice'}},
    ConsistentRead=True  # Linearizable read
)

# DynamoDB transaction (serializable)
response = dynamodb.transact_write_items(
    TransactItems=[
        {'Put': {...}},
        {'Update': {...}},
        {'Delete': {...}}
    ]
)  # All-or-nothing, serializable

Aurora:

Multi-AZ with synchronous replication
Read replicas have eventual consistency
Writer endpoint provides linearizable writes
Global Database: cross-region, eventually consistent

Amazon QLDB:

Purpose-built for audit and ledger use cases
Immutable, cryptographically verifiable
Serializable transaction processing

Cloud Database Consistency Options
Service	Provider	Default Consistency	Strong Consistency Option
Spanner	GCP	Strict serializable	Default (can't be weakened)
Cloud SQL	GCP	Serializable	Default for single primary
DynamoDB	AWS	Eventual	ConsistentRead=true
Aurora	AWS	Read committed	Writer endpoint, no read replicas
Cosmos DB	Azure	Config-dependent	Strong consistency level
Firestore	GCP	Linearizable	Default

Azure Cosmos DB

Cosmos DB offers five consistency levels, from strongest to weakest:

1. Strong (Linearizable)
   - Reads guaranteed to return most recent committed write
   - Single region only for writes
   - Highest latency

2. Bounded Staleness
   - Reads lag behind writes by at most K versions or T time
   - Configurable K and T

3. Session
   - Monotonic reads/writes within a session
   - Different sessions may see different orders

4. Consistent Prefix
   - Reads never see out-of-order writes
   - May be arbitrarily stale

5. Eventual
   - No ordering guarantees
   - Highest throughput

Important: Strong consistency in Cosmos DB restricts writes to a single region, eliminating the multi-region write capability. This is a direct manifestation of the CAP theorem.

Google Firestore

Firestore provides linearizable consistency by default:

// All Firestore operations are linearizable
await db.collection('users').doc('alice').set({name: 'Alice'});

// This read will see the write above
const doc = await db.collection('users').doc('alice').get();

Firestore uses a similar architecture to Spanner, providing external consistency for all operations.

Read the Documentation Carefully

Evaluating Consistency Claims

Not all consistency claims are equally trustworthy. Here's how to evaluate whether a system actually provides the guarantees it claims.

Red Flags in Marketing Materials

Watch out for:

Vague terminology: "Strong consistency" without defining what operations are covered
Asterisks and footnotes: "Consistent*" where * = "in single-region configurations only"
Missing failure semantics: No mention of behavior during partitions
Untested claims: No Jepsen report or equivalent independent verification
Configuration complexity: Consistency depends on subtle configuration options

Questions to Ask

For any system claiming linearizability:

1. Which operations are linearizable?
   - All operations? Writes only? Specific APIs?
   
2. What happens during network partitions?
   - Operations fail? Silent degradation? Best-effort?
   
3. What's the default configuration?
   - Is linearizability default or opt-in?
   
4. Has it been independently tested?
   - Jepsen report? Academic analysis? Internal testing?
   
5. What's the latency and availability trade-off?
   - How much latency does strong consistency add?
   - What's the availability target?

Jepsen: The Consistency Verifier

Jepsen (jepsen.io) has become the industry standard for testing distributed systems' consistency claims:

What Jepsen does:

Runs concurrent operations against the system
Injects failures (network partitions, node crashes, clock skew)
Records complete history of operations
Verifies history is consistent with claimed model

Selected Jepsen findings:

System	Claim	Finding	Impact
MongoDB	Linearizable with majority read concern	Violated under network partitions	Fixed in later versions
Redis (Redlock)	Distributed lock	Unsafe under timing assumptions	Known limitation
etcd	Linearizable	Generally correct, minor issues found	Issues fixed
CockroachDB	Serializable	Generally correct, edge cases found	Issues fixed
PostgreSQL	Serializable	Correct	Validated

Key insight: Many systems that claim linearizability have had violations discovered by Jepsen. Check for a Jepsen report and whether issues were fixed.

Trustworthy Consistency Indicators

•Jepsen tested and passed — Independent verification with failures injected
•TLA+ or formal specification — Mathematical proof of protocol correctness
•Production track record — Used at scale by multiple organizations
•Clear documentation — Precise semantics, failure modes, configuration options
•Active security/correctness advisories — Issues are found and disclosed responsibly
•Reproducible testing — Anyone can run consistency tests

Trust But Verify

Choosing the Right Linearizable System

With the landscape understood, here's how to choose the right linearizable system for your needs.

Decision Matrix

System Selection Decision Matrix
Use Case	Recommended System(s)	Rationale
Kubernetes cluster state	etcd	Native integration, well-tested, purpose-built
Leader election (general)	etcd, ZooKeeper, Consul	Mature, well-understood, low overhead
Distributed locks	etcd, Redis (single-node)	Coordination services handle this native
Service discovery	Consul, etcd	Built-in features, health checking
OLTP with SQL	CockroachDB, Spanner, TiDB	SQL interface, scalable, transactions
Financial transactions	Spanner, FoundationDB	Strictest guarantees, proven at scale
Global distribution	Spanner, CockroachDB	Multi-region consensus built-in
Apple-scale workloads	FoundationDB	Proven, simulation-tested
Simple key-value	FoundationDB	Minimal overhead, strict guarantees
Serverless/managed	Firestore, Spanner, Cosmos DB	No operational overhead

Key Trade-offs to Consider

1. Self-Hosted vs Managed

Self-hosted (etcd, CockroachDB, FoundationDB):
  + Full control, no vendor lock-in
  + Potentially lower cost at scale
  - Operational complexity
  - Must handle upgrades, monitoring, recovery

Managed (Spanner, Cosmos DB, Firestore):
  + Zero operational overhead
  + Provider handles consistency verification
  - Vendor lock-in
  - Higher cost, especially at scale

2. Generality vs Specialization

General-purpose (CockroachDB, Spanner):
  + SQL interface, broad use cases
  + Feature-rich (transactions, indexes, schemas)
  - Higher overhead for simple use cases

Specialized (etcd, ZooKeeper):
  + Low overhead for coordination tasks
  + Purpose-built APIs
  - Not suitable for application data
  - Limited data model

3. Geographic Distribution

Single-region:
  - Most systems work well
  - Lower latency
  - Simpler operations

Multi-region:
  - Spanner: Best-in-class (TrueTime)
  - CockroachDB: Good (HLC)
  - Most others: Significant latency penalty

Start Simple, Scale Thoughtfully

Summary: Systems Providing Strong Consistency

Key Takeaways

•Coordination services (etcd, ZooKeeper, Consul) — Purpose-built for locks, leader election, and configuration. Not for application data.
•NewSQL databases (Spanner, CockroachDB, TiDB) — Combine SQL with distributed linearizability. Different clock mechanisms yield different guarantees.
•FoundationDB — Unique architecture with simulation testing. Strictest guarantees, used by Apple and Snowflake.
•Cloud services — Managed convenience with varying consistency options. Read documentation carefully.
•Verify claims with Jepsen — Don't trust marketing; check for independent verification.
•Configuration matters — Many systems have linearizability as opt-in, not default. Default may be weaker.
•Match system to workload — Use coordination services for coordination, databases for data. Don't over-engineer.

Module Complete: Strong Consistency Mastery

Across five pages, you've developed a deep understanding of strong consistency in distributed systems:

Linearizability: The gold standard consistency model and its formal definition
Implementation Costs: The coordination overhead, latency, and availability trade-offs
Performance Implications: How linearizability affects latency, throughput, and user experience
When Strong Consistency Is Required: Use cases where linearizability isn't optional
Systems Providing Strong Consistency: Real-world implementations and selection criteria

You're now equipped to make informed decisions about consistency requirements in your distributed systems—knowing when to pay the linearizability tax and when weaker consistency suffices.

Module Complete

5 / 5