Database Management SystemsModern Database Topics

NewSQL Databases

LevelAdvanced

Duration60 mins

TopicModern Database Topics

4 / 5

CockroachDB

The Indestructible Database

Named after the creature that survives nuclear apocalypse, CockroachDB was designed to survive any infrastructure failure. Founded by ex-Google engineers who worked on Spanner, CockroachDB brings globally-distributed, serializable transactions to organizations without requiring Google-scale infrastructure or vendor lock-in.

CockroachDB is:

Open source (BSL license, source-available)
PostgreSQL wire-compatible (use existing tools, libraries, ORMs)
Self-hosted or managed (run anywhere, or use CockroachDB Serverless/Dedicated)
Spanner-inspired but works without atomic clocks or GPS

This accessibility has made CockroachDB one of the most widely adopted NewSQL databases, used by companies like Netflix, Bose, SpaceX, and thousands of startups requiring reliable distributed databases.

What You Will Learn

By the end of this page, you will understand CockroachDB's architecture, how it achieves Spanner-like consistency without specialized hardware (using hybrid logical clocks), its PostgreSQL compatibility layer, data distribution mechanisms, and when to choose CockroachDB for your applications.

Origins and Design Philosophy

CockroachDB was founded in 2015 by Spencer Kimball, Peter Mattis, and Ben Darnell—all former Google engineers with deep distributed systems experience.

The Founding Insight

The founders believed that Spanner's innovations shouldn't require Google's specialized infrastructure. If the core architectural principles could work with commodity hardware and without atomic clocks, every organization could benefit from globally-distributed ACID transactions.

Key Design Goals

CockroachDB Design Principles

•Survivability: No single points of failure. Any node, rack, zone, or region can fail without data loss or significant downtime.
•Strong consistency: Serializable isolation by default. No eventual consistency compromises.
•PostgreSQL compatibility: Leverage the existing PostgreSQL ecosystem—drivers, ORMs, tools, and developer knowledge.
•Geographical distribution: Support multi-region and global deployments for low-latency and disaster recovery.
•Operational simplicity: Self-healing, auto-rebalancing, online schema changes. Minimize operational burden.
•Commodity hardware: Run on any cloud, on-premises, Kubernetes, or even laptops. No specialized hardware required.

The Name

The name 'CockroachDB' reflects the resilience goal: cockroaches famously survive conditions that would kill other species. The database is designed to survive node failures, network partitions, and even regional outages while maintaining correctness.

Open Source Model

CockroachDB uses the Business Source License (BSL):

Source code is publicly available
Free for non-competing uses
Converts to Apache 2.0 license after 3 years
Enterprise features available with commercial license

This model balances open development with sustainable business.

PostgreSQL Compatibility Matters

By choosing PostgreSQL wire compatibility, CockroachDB inherits decades of tooling investment. Applications can often migrate by changing only the connection string. Popular ORMs (SQLAlchemy, Hibernate, ActiveRecord), drivers (psycopg2, JDBC, Go pq), and tools (pgAdmin, DBeaver) work directly.

Architecture Overview

CockroachDB uses a layered, symmetric architecture where every node is equal and can serve any request.

The Layered Stack

CockroachDB's architecture consists of five major layers:

CockroachDB Architecture Layers

•SQL Layer: Parses PostgreSQL-compatible SQL, builds query plans, executes distributed queries. Implements cost-based optimizer.
•Transaction Layer: Manages distributed transactions, implements serializable isolation using MVCC and write intents, handles 2PC for multi-range transactions.
•Distribution Layer: Manages data placement across ranges and nodes, handles range splits and merges, implements range-aware routing.
•Replication Layer: Implements Raft consensus for each range, manages leader election, handles log replication and follower reads.
•Storage Layer: Uses Pebble (CockroachDB's custom LSM-tree storage engine, inspired by RocksDB) for on-disk persistence.

Converting Mermaid diagram...

Symmetric Architecture

Unlike systems with dedicated master/coordinator nodes, every CockroachDB node is identical:

Any node can accept client connections
Any node can be the gateway for a SQL query
Any node can be a range leader or follower
No single coordinator means no single point of failure

This symmetry simplifies operations: add nodes, remove nodes, and CockroachDB automatically rebalances.

Ranges: The Unit of Distribution

CockroachDB divides data into ranges:

Each range contains a contiguous portion of the keyspace
Default size: ~512MB per range
Each range is replicated (typically 3x)
Each range has one Raft leader, N-1 followers
Ranges automatically split when too large, merge when too small

Hybrid Logical Clocks: Spanner Without Atomic Clocks

CockroachDB's key innovation over Spanner is achieving similar consistency guarantees without specialized time hardware. It uses Hybrid Logical Clocks (HLC) instead of TrueTime.

HLC Structure

A Hybrid Logical Clock timestamp has two components:

hlc_structure.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Hybrid Logical Clock timestamp
type Timestamp struct {
    // Physical part: Wall clock time in nanoseconds
    // Advances with real time, subject to clock skew
    WallTime int64
    
    // Logical part: Counter for ordering events at same wall time
    // Ensures causal ordering even with clock skew
    Logical  int32
}
 
// Compare two timestamps
func (a Timestamp) Less(b Timestamp) bool {
    if a.WallTime != b.WallTime {
        return a.WallTime < b.WallTime
    }
    return a.Logical < b.Logical
}
 
// Example timestamps:
// T1 = {WallTime: 1705000000000000000, Logical: 0}
// T2 = {WallTime: 1705000000000000000, Logical: 1}
// T3 = {WallTime: 1705000000000000001, Logical: 0}
// Ordering: T1 < T2 < T3

How HLC Works

HLC provides three key guarantees:

Physical time tracking: The WallTime component tracks real time, ensuring timestamps don't drift too far from actual time.
Causal ordering: If event A happens-before event B (causally), then timestamp(A) < timestamp(B). The Logical counter ensures this even when WallTime is equal.
Bounded uncertainty: CockroachDB knows the maximum clock skew between nodes (configured by max-offset, typically 500ms). This enables uncertainty windows for transaction ordering.

HLC vs TrueTime Trade-offs

HLC vs TrueTime Comparison
Aspect	TrueTime (Spanner)	HLC (CockroachDB)
Hardware required	GPS, atomic clocks	Standard servers
Clock uncertainty	~1-7ms (GPS-bounded)	~250-500ms (configured)
Commit latency impact	Wait ~7ms per commit	Uncertainty handled differently
External consistency	Guaranteed	Serializable (slightly weaker)
Infrastructure flexibility	Google Cloud only	Any cloud, on-prem, Kubernetes
Cost	Included in Cloud Spanner	No hardware overhead

Handling Uncertainty in CockroachDB

Since CockroachDB can't bound clock uncertainty as tightly as Spanner, it handles potential conflicts differently:

Uncertainty Windows:

When a transaction reads data, it may encounter writes with timestamps within its uncertainty window
If a read encounters a value with timestamp T where T is within the reader's uncertainty window, the reader must restart at T+1
This ensures the reader sees a consistent snapshot even with clock skew

Transaction Restart:

Transaction restarts are automatic and transparent
Retries use the new timestamp, which is now outside the uncertainty window
Typically resolves in one retry

uncertainty_handling.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// CockroachDB read with uncertainty handling
 
function read(key, txnTimestamp, maxOffset):
    value, valueTimestamp = storage.get(key)
    
    uncertaintyLimit = txnTimestamp + maxOffset
    
    if valueTimestamp > txnTimestamp:
        if valueTimestamp <= uncertaintyLimit:
            // Value is in uncertainty window
            // We can't tell if it was written before or after we started
            throw ReadWithinUncertaintyIntervalError{
                readTimestamp: txnTimestamp,
                existingTimestamp: valueTimestamp,
                // Transaction will restart with timestamp > valueTimestamp
            }
        else:
            // Value is definitely in the future; don't read it
            // Return older version or none
            return getPreviousVersion(key, txnTimestamp)
    
    return value  // Safe to read

Reducing Uncertainty Impact

In practice, uncertainty-related restarts are rare (<1% of transactions) because clock skew between properly-configured nodes is small. Using NTP with good time sources, the actual skew is often <10ms even with a 500ms configured maximum. The 500ms is a safety bound, not the common case.

Transaction Processing

CockroachDB implements serializable snapshot isolation (SSI) with innovations that make distributed transactions efficient.

The Write Intent Mechanism

Unlike traditional databases that use locks, CockroachDB uses write intents—provisional writes that mark data as being modified by an in-progress transaction:

write_intents.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Write Intent Structure
WriteIntent {
    key: []byte                    // The key being written
    value: []byte                  // The provisional value
    timestamp: Timestamp           // Transaction timestamp
    txnRecord: TxnRecordPointer    // Link to transaction record
    status: PENDING | COMMITTED | ABORTED
}
 
// Write path
function write(key, value, txn):
    intent = WriteIntent{
        key: key,
        value: value,
        timestamp: txn.timestamp,
        txnRecord: txn.recordLocation,
        status: PENDING
    }
    storage.put(key, intent)
 
// Read path with conflict handling
function read(key, txn):
    result = storage.get(key)
    
    if result.isIntent():
        intent = result
        if intent.txnRecord.id == txn.id:
            // Reading our own write
            return intent.value
        else:
            // Another transaction's intent - handle conflict
            return handleConflict(intent, txn)
    
    return result.value

Conflict Resolution

When a transaction encounters another transaction's write intent, CockroachDB performs contention management:

Check intent's transaction status: Is it still active? Committed? Aborted?
If intent's transaction is committed: Treat as normal value, potentially restart
If intent's transaction is aborted: Clean up intent, proceed
If intent's transaction is still active: One transaction must wait or abort

Parallel Commits

CockroachDB's parallel commits optimization reduces commit latency:

parallel_commits.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Traditional 2PC (2 consensus rounds)
function traditionalCommit(txn):
    // Round 1: Write transaction record as STAGING
    writeTransactionRecord(txn.id, STAGING)
    
    // Round 2: After intents replicated, flip to COMMITTED  
    wait for intent replication
    writeTransactionRecord(txn.id, COMMITTED)
    // Then async: resolve intents to regular values
 
// Parallel Commits (1 consensus round)
function parallelCommit(txn):
    // Write transaction record AND intents in parallel
    parallel {
        writeTransactionRecord(txn.id, STAGING, inFlightWrites=[...])
        writeIntent(key1, value1, txn)
        writeIntent(key2, value2, txn)
        // ... all intents
    }
    
    // Transaction is considered committed when:
    // - Transaction record shows STAGING
    // - ALL in-flight writes succeeded
    // Readers can verify this by checking all listed intents
    
    // Async cleanup: flip STAGING -> COMMITTED, resolve intents

Transaction Isolation Levels
Isolation Level	Anomalies Prevented	CockroachDB Support
Serializable	All (strongest)	Default, recommended
Read Committed	Dirty reads	Supported (23.2+)
Snapshot Isolation	Write skew possible	Via serializable with caveats
Read Uncommitted	None	Not supported

Serializable by Default

Unlike most databases that default to weaker isolation, CockroachDB defaults to serializable isolation. This eliminates entire classes of concurrency bugs but may require application awareness of transaction restarts. The automatic retry mechanism handles most cases transparently.

PostgreSQL Compatibility

CockroachDB implements the PostgreSQL wire protocol and a large subset of PostgreSQL SQL syntax, enabling seamless migration for many applications.

Wire Protocol Compatibility

CockroachDB speaks PostgreSQL wire protocol v3:

Use any PostgreSQL driver (psycopg2, JDBC, node-postgres, etc.)
Connect with standard connection strings
Use standard PostgreSQL tools (psql, pgAdmin, etc.)

cockroach_connection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Python connection example (identical to PostgreSQL)
import psycopg2
 
# Connection string format
conn = psycopg2.connect(
    host="your-cluster.cockroachlabs.cloud",
    port=26257,
    database="defaultdb",
    user="your_username",
    password="your_password",
    sslmode="verify-full"
)
 
# Standard PostgreSQL queries work
cursor = conn.cursor()
cursor.execute("""
    SELECT id, name, balance 
    FROM accounts 
    WHERE balance > %s
    ORDER BY balance DESC
    LIMIT 10
""", (10000,))
 
for row in cursor.fetchall():
    print(row)
 
conn.close()

SQL Compatibility

CockroachDB supports most PostgreSQL SQL features:

Supported Features

•Standard SQL (DML, DDL, DQL)
•Joins (all types)
•Subqueries (scalar, correlated)
•Window functions
•CTEs and recursive CTEs
•JSONB with operators
•Full-text search (basic)
•Stored procedures (limited)
•User-defined functions (SQL)
•Sequences and serial types

Not Supported / Different

•PL/pgSQL (limited support)
•Triggers (partial support)
•Extensions (most unavailable)
•Custom types (limited)
•LISTEN/NOTIFY (different API)
•PostGIS (use own geo types)
•Some system tables differ
•Locking behaviors differ
•Transaction semantics differ

cockroach_sql_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Standard PostgreSQL syntax works
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email STRING UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT current_timestamp(),
    metadata JSONB
);
 
-- Window functions
SELECT 
    id,
    amount,
    SUM(amount) OVER (
        PARTITION BY user_id 
        ORDER BY created_at 
        ROWS UNBOUNDED PRECEDING
    ) as running_total
FROM transactions;
 
-- CTE with recursion
WITH RECURSIVE org_tree AS (
    SELECT id, name, parent_id, 1 as depth
    FROM organizations
    WHERE parent_id IS NULL
    UNION ALL
    SELECT o.id, o.name, o.parent_id, ot.depth + 1
    FROM organizations o
    JOIN org_tree ot ON o.parent_id = ot.id
)
SELECT * FROM org_tree;
 
-- JSONB queries
SELECT * FROM users 
WHERE metadata @> '{"role": "admin"}';

Migration Testing

Cockroach Labs provides a migration assessment tool that analyzes your PostgreSQL schema and queries to identify compatibility issues before migration. For most CRUD applications with standard SQL, migration is straightforward.

Multi-Region Deployments

CockroachDB provides sophisticated multi-region capabilities for global applications, balancing latency, availability, and consistency.

Topology Patterns

CockroachDB supports three multi-region table locality patterns:

Multi-Region Table Localities
Locality	Behavior	Read Latency	Write Latency	Best For
REGIONAL BY TABLE	Table pinned to home region	Fast in home region	Fast in home region	Regional data, compliance
REGIONAL BY ROW	Each row in its home region	Fast for local rows	Fast for local writes	User data, geo-sharding
GLOBAL	Optimized global reads	Fast everywhere	Slower (all regions)	Reference data, config

multi_region_setup.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Configure database for multi-region
ALTER DATABASE myapp SET PRIMARY REGION "us-east1";
ALTER DATABASE myapp ADD REGION "us-west1";
ALTER DATABASE myapp ADD REGION "eu-west1";
 
-- Regional by table: stays in primary region
CREATE TABLE config (
    key STRING PRIMARY KEY,
    value STRING
) LOCALITY REGIONAL BY TABLE IN PRIMARY REGION;
 
-- Regional by row: each row has a home region
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email STRING,
    crdb_region crdb_internal_region NOT NULL DEFAULT 'us-east1'
) LOCALITY REGIONAL BY ROW;
 
-- Insert user with specific region
INSERT INTO users (id, email, crdb_region)
VALUES (gen_random_uuid(), 'alice@example.com', 'eu-west1');
 
-- Global table: fast reads everywhere
CREATE TABLE currency_rates (
    code STRING PRIMARY KEY,
    usd_rate DECIMAL
) LOCALITY GLOBAL;

Survival Goals

CockroachDB can be configured with different survival goals:

Zone survival (default): Survives failure of one availability zone within a region. Requires 3+ zones.
Region survival: Survives failure of an entire region. Requires 3+ regions with 5+ replicas.

Global Tables Deep Dive

Global tables are optimized for read-heavy reference data:

Writes go through Raft consensus across all regions (slow, ~200-500ms)
Reads are served locally from each region's replica (fast, <10ms)
Uses "Non-Blocking Transactions" to avoid blocking reads on writes
Perfect for: config tables, feature flags, currency rates, product catalogs

Latency Expectations

Cross-region writes add ~100-300ms latency due to Raft consensus across regions. Regional tables with locality-aware workloads achieve single-digit millisecond latency. Plan schema design around access patterns: keep frequently co-accessed data in the same region.

Operational Features

CockroachDB minimizes operational burden through self-managing features.

Online Schema Changes

Unlike traditional databases that require downtime for DDL operations, CockroachDB performs schema changes online:

Online Schema Changes

•ADD COLUMN: Immediate if nullable or has default. Server continues serving traffic.
•CREATE INDEX: Background index backfill. Queries continue using old plan until complete.
•ALTER COLUMN TYPE: Validates and migrates data incrementally in background.
•DROP COLUMN: Marked for deletion, garbage collected later. No immediate data rewrite.
•RENAME: Instant metadata change, no data movement.

Automatic Operations

CockroachDB handles many tasks automatically:

Automatic vs Manual Operations
Operation	CockroachDB	Traditional RDBMS
Replication	Automatic, configurable factor	Manual setup and monitoring
Failover	Automatic via Raft	Manual or HA proxy setup
Rebalancing	Automatic based on load/size	Manual shard management
Range splitting	Automatic at 512MB	Manual partition management
Recovery	Automatic from healthy replicas	Point-in-time restore
Upgrades	Rolling upgrade, no downtime	Scheduled maintenance windows

Built-in Observability

CockroachDB includes a rich admin UI and metrics infrastructure:

DB Console: Web-based UI for cluster overview, statements, jobs, and insights
Prometheus metrics: Thousands of metrics exposed for external monitoring
Statement statistics: Per-statement latency, retries, and execution plans
Insights: Automatic detection of slow queries, index recommendations, hot ranges
Changefeeds: Stream database changes to Kafka, cloud storage, or webhooks

Day 2 Operations

CockroachDB's operational simplicity shines in 'Day 2' scenarios. Adding nodes is: start new node, point at cluster, done. Removing nodes: drain and decommission. Upgrading: rolling restart with new binary. These operations are routine, not events.

Deployment Options

CockroachDB offers flexibility in how you deploy and manage your database.

Self-Hosted Deployments

Self-Hosted Options

•CockroachDB Core: Free, source-available. Includes all core distributed database features.
•CockroachDB Enterprise: Commercial license adds: backup/restore to cloud storage, changefeeds, encryption at rest, LDAP integration.
•Kubernetes Operator: Official operator for Kubernetes deployments. Handles upgrades, scaling, and monitoring.
•Manual deployment: Binary runs on any Linux/macOS/Windows. Suitable for VMs, bare metal, containers.

Managed Services (CockroachDB Cloud)

CockroachDB Cloud Tiers
Tier	Best For	Pricing Model	Features
Serverless	Development, small workloads	Pay-per-request	Auto-scaling, 5GB free, instant start
Dedicated	Production workloads	Fixed node hours	Dedicated resources, SLA, multi-region
Self-Hosted + Support	Enterprise on-prem	License fee	Full control + vendor support

deployment_examples.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
# Local development (single-node)
cockroach start-single-node --insecure
 
# 3-node local cluster (for testing)
cockroach start --insecure --store=node1 --listen-addr=localhost:26257   --http-addr=localhost:8080 --join=localhost:26257,localhost:26258,localhost:26259
cockroach start --insecure --store=node2 --listen-addr=localhost:26258   --http-addr=localhost:8081 --join=localhost:26257,localhost:26258,localhost:26259
cockroach start --insecure --store=node3 --listen-addr=localhost:26259   --http-addr=localhost:8082 --join=localhost:26257,localhost:26258,localhost:26259
 
cockroach init --insecure --host=localhost:26257
 
# Production Kubernetes (using operator)
kubectl apply -f https://raw.githubusercontent.com/cockroachdb/cockroach-operator/master/install/crds.yaml
kubectl apply -f cockroachdb-cluster.yaml

Cloud Availability

CockroachDB Cloud runs on AWS, GCP, and Azure. Unlike Cloud Spanner (GCP-only), you can deploy CockroachDB on any major cloud or on-premises, enabling multi-cloud and hybrid cloud strategies.

Summary: CockroachDB's Place in NewSQL

CockroachDB demonstrates that Spanner's innovations are achievable without Google-scale infrastructure. Let's consolidate the key insights and compare with alternatives.

Key Takeaways

•Hybrid Logical Clocks enable Spanner-like consistency without atomic clocks, trading some latency for infrastructure flexibility.
•PostgreSQL compatibility allows existing applications and expertise to transfer with minimal changes.
•Write intents and parallel commits provide efficient distributed transactions with serializable isolation.
•Multi-region localities (Regional by Table/Row, Global) let you tune latency/consistency per table.
•Symmetric architecture eliminates single points of failure; any node can serve any request.
•Self-managing operations (auto-rebalancing, online schema changes, rolling upgrades) reduce operational burden.
•Flexible deployment (self-hosted, Kubernetes, managed cloud) avoids vendor lock-in.

CockroachDB vs Google Spanner vs PostgreSQL
Feature	CockroachDB	Cloud Spanner	PostgreSQL
Horizontal scale	✓ Yes	✓ Yes	✗ Vertical only
Distributed ACID	✓ Serializable	✓ External consistency	✓ Single-node
Multi-region	✓ Yes	✓ Yes (cloud)	Manual replication
SQL compatibility	PostgreSQL	GoogleSQL	PostgreSQL
Self-hosted option	✓ Yes	✗ No (GCP only)	✓ Yes
Licensing	BSL (open core)	Proprietary	PostgreSQL License
Minimum cost	$0 (self-hosted)	~$650/month	$0

What's Next

With understanding of both Spanner and CockroachDB, we'll conclude with NewSQL Use Cases—identifying when NewSQL is the right choice and when traditional SQL or NoSQL better fits your requirements.

Page Complete

You now understand CockroachDB's architecture, how it achieves Spanner-like capabilities without specialized hardware, and its PostgreSQL compatibility. CockroachDB represents the democratization of distributed SQL—bringing NewSQL benefits to organizations of all sizes.

4 / 5

Loading learning content...

Database Management SystemsModern Database Topics

NewSQL Databases

LevelAdvanced

Duration60 mins

TopicModern Database Topics

4 / 5

CockroachDB

The Indestructible Database

CockroachDB is:

Open source (BSL license, source-available)
PostgreSQL wire-compatible (use existing tools, libraries, ORMs)
Self-hosted or managed (run anywhere, or use CockroachDB Serverless/Dedicated)
Spanner-inspired but works without atomic clocks or GPS

What You Will Learn

Origins and Design Philosophy

CockroachDB was founded in 2015 by Spencer Kimball, Peter Mattis, and Ben Darnell—all former Google engineers with deep distributed systems experience.

The Founding Insight

Key Design Goals

CockroachDB Design Principles

•Survivability: No single points of failure. Any node, rack, zone, or region can fail without data loss or significant downtime.
•Strong consistency: Serializable isolation by default. No eventual consistency compromises.
•PostgreSQL compatibility: Leverage the existing PostgreSQL ecosystem—drivers, ORMs, tools, and developer knowledge.
•Geographical distribution: Support multi-region and global deployments for low-latency and disaster recovery.
•Operational simplicity: Self-healing, auto-rebalancing, online schema changes. Minimize operational burden.
•Commodity hardware: Run on any cloud, on-premises, Kubernetes, or even laptops. No specialized hardware required.

The Name

Open Source Model

CockroachDB uses the Business Source License (BSL):

Source code is publicly available
Free for non-competing uses
Converts to Apache 2.0 license after 3 years
Enterprise features available with commercial license

This model balances open development with sustainable business.

PostgreSQL Compatibility Matters

Architecture Overview

CockroachDB uses a layered, symmetric architecture where every node is equal and can serve any request.

The Layered Stack

CockroachDB's architecture consists of five major layers:

CockroachDB Architecture Layers

•SQL Layer: Parses PostgreSQL-compatible SQL, builds query plans, executes distributed queries. Implements cost-based optimizer.
•Transaction Layer: Manages distributed transactions, implements serializable isolation using MVCC and write intents, handles 2PC for multi-range transactions.
•Distribution Layer: Manages data placement across ranges and nodes, handles range splits and merges, implements range-aware routing.
•Replication Layer: Implements Raft consensus for each range, manages leader election, handles log replication and follower reads.
•Storage Layer: Uses Pebble (CockroachDB's custom LSM-tree storage engine, inspired by RocksDB) for on-disk persistence.

Converting Mermaid diagram...

Symmetric Architecture

Unlike systems with dedicated master/coordinator nodes, every CockroachDB node is identical:

Any node can accept client connections
Any node can be the gateway for a SQL query
Any node can be a range leader or follower
No single coordinator means no single point of failure

This symmetry simplifies operations: add nodes, remove nodes, and CockroachDB automatically rebalances.

Ranges: The Unit of Distribution

CockroachDB divides data into ranges:

Each range contains a contiguous portion of the keyspace
Default size: ~512MB per range
Each range is replicated (typically 3x)
Each range has one Raft leader, N-1 followers
Ranges automatically split when too large, merge when too small

Hybrid Logical Clocks: Spanner Without Atomic Clocks

CockroachDB's key innovation over Spanner is achieving similar consistency guarantees without specialized time hardware. It uses Hybrid Logical Clocks (HLC) instead of TrueTime.

HLC Structure

A Hybrid Logical Clock timestamp has two components:

hlc_structure.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Hybrid Logical Clock timestamp
type Timestamp struct {
    // Physical part: Wall clock time in nanoseconds
    // Advances with real time, subject to clock skew
    WallTime int64
    
    // Logical part: Counter for ordering events at same wall time
    // Ensures causal ordering even with clock skew
    Logical  int32
}
 
// Compare two timestamps
func (a Timestamp) Less(b Timestamp) bool {
    if a.WallTime != b.WallTime {
        return a.WallTime < b.WallTime
    }
    return a.Logical < b.Logical
}
 
// Example timestamps:
// T1 = {WallTime: 1705000000000000000, Logical: 0}
// T2 = {WallTime: 1705000000000000000, Logical: 1}
// T3 = {WallTime: 1705000000000000001, Logical: 0}
// Ordering: T1 < T2 < T3

How HLC Works

HLC provides three key guarantees:

Physical time tracking: The WallTime component tracks real time, ensuring timestamps don't drift too far from actual time.
Causal ordering: If event A happens-before event B (causally), then timestamp(A) < timestamp(B). The Logical counter ensures this even when WallTime is equal.
Bounded uncertainty: CockroachDB knows the maximum clock skew between nodes (configured by max-offset, typically 500ms). This enables uncertainty windows for transaction ordering.

HLC vs TrueTime Trade-offs

HLC vs TrueTime Comparison
Aspect	TrueTime (Spanner)	HLC (CockroachDB)
Hardware required	GPS, atomic clocks	Standard servers
Clock uncertainty	~1-7ms (GPS-bounded)	~250-500ms (configured)
Commit latency impact	Wait ~7ms per commit	Uncertainty handled differently
External consistency	Guaranteed	Serializable (slightly weaker)
Infrastructure flexibility	Google Cloud only	Any cloud, on-prem, Kubernetes
Cost	Included in Cloud Spanner	No hardware overhead

Handling Uncertainty in CockroachDB

Since CockroachDB can't bound clock uncertainty as tightly as Spanner, it handles potential conflicts differently:

Uncertainty Windows:

When a transaction reads data, it may encounter writes with timestamps within its uncertainty window
If a read encounters a value with timestamp T where T is within the reader's uncertainty window, the reader must restart at T+1
This ensures the reader sees a consistent snapshot even with clock skew

Transaction Restart:

Transaction restarts are automatic and transparent
Retries use the new timestamp, which is now outside the uncertainty window
Typically resolves in one retry

uncertainty_handling.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// CockroachDB read with uncertainty handling
 
function read(key, txnTimestamp, maxOffset):
    value, valueTimestamp = storage.get(key)
    
    uncertaintyLimit = txnTimestamp + maxOffset
    
    if valueTimestamp > txnTimestamp:
        if valueTimestamp <= uncertaintyLimit:
            // Value is in uncertainty window
            // We can't tell if it was written before or after we started
            throw ReadWithinUncertaintyIntervalError{
                readTimestamp: txnTimestamp,
                existingTimestamp: valueTimestamp,
                // Transaction will restart with timestamp > valueTimestamp
            }
        else:
            // Value is definitely in the future; don't read it
            // Return older version or none
            return getPreviousVersion(key, txnTimestamp)
    
    return value  // Safe to read

Reducing Uncertainty Impact

Transaction Processing

CockroachDB implements serializable snapshot isolation (SSI) with innovations that make distributed transactions efficient.

The Write Intent Mechanism

Unlike traditional databases that use locks, CockroachDB uses write intents—provisional writes that mark data as being modified by an in-progress transaction:

write_intents.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Write Intent Structure
WriteIntent {
    key: []byte                    // The key being written
    value: []byte                  // The provisional value
    timestamp: Timestamp           // Transaction timestamp
    txnRecord: TxnRecordPointer    // Link to transaction record
    status: PENDING | COMMITTED | ABORTED
}
 
// Write path
function write(key, value, txn):
    intent = WriteIntent{
        key: key,
        value: value,
        timestamp: txn.timestamp,
        txnRecord: txn.recordLocation,
        status: PENDING
    }
    storage.put(key, intent)
 
// Read path with conflict handling
function read(key, txn):
    result = storage.get(key)
    
    if result.isIntent():
        intent = result
        if intent.txnRecord.id == txn.id:
            // Reading our own write
            return intent.value
        else:
            // Another transaction's intent - handle conflict
            return handleConflict(intent, txn)
    
    return result.value

Conflict Resolution

When a transaction encounters another transaction's write intent, CockroachDB performs contention management:

Check intent's transaction status: Is it still active? Committed? Aborted?
If intent's transaction is committed: Treat as normal value, potentially restart
If intent's transaction is aborted: Clean up intent, proceed
If intent's transaction is still active: One transaction must wait or abort

Parallel Commits

CockroachDB's parallel commits optimization reduces commit latency:

parallel_commits.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Traditional 2PC (2 consensus rounds)
function traditionalCommit(txn):
    // Round 1: Write transaction record as STAGING
    writeTransactionRecord(txn.id, STAGING)
    
    // Round 2: After intents replicated, flip to COMMITTED  
    wait for intent replication
    writeTransactionRecord(txn.id, COMMITTED)
    // Then async: resolve intents to regular values
 
// Parallel Commits (1 consensus round)
function parallelCommit(txn):
    // Write transaction record AND intents in parallel
    parallel {
        writeTransactionRecord(txn.id, STAGING, inFlightWrites=[...])
        writeIntent(key1, value1, txn)
        writeIntent(key2, value2, txn)
        // ... all intents
    }
    
    // Transaction is considered committed when:
    // - Transaction record shows STAGING
    // - ALL in-flight writes succeeded
    // Readers can verify this by checking all listed intents
    
    // Async cleanup: flip STAGING -> COMMITTED, resolve intents

Transaction Isolation Levels
Isolation Level	Anomalies Prevented	CockroachDB Support
Serializable	All (strongest)	Default, recommended
Read Committed	Dirty reads	Supported (23.2+)
Snapshot Isolation	Write skew possible	Via serializable with caveats
Read Uncommitted	None	Not supported

Serializable by Default

PostgreSQL Compatibility

CockroachDB implements the PostgreSQL wire protocol and a large subset of PostgreSQL SQL syntax, enabling seamless migration for many applications.

Wire Protocol Compatibility

CockroachDB speaks PostgreSQL wire protocol v3:

Use any PostgreSQL driver (psycopg2, JDBC, node-postgres, etc.)
Connect with standard connection strings
Use standard PostgreSQL tools (psql, pgAdmin, etc.)

cockroach_connection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Python connection example (identical to PostgreSQL)
import psycopg2
 
# Connection string format
conn = psycopg2.connect(
    host="your-cluster.cockroachlabs.cloud",
    port=26257,
    database="defaultdb",
    user="your_username",
    password="your_password",
    sslmode="verify-full"
)
 
# Standard PostgreSQL queries work
cursor = conn.cursor()
cursor.execute("""
    SELECT id, name, balance 
    FROM accounts 
    WHERE balance > %s
    ORDER BY balance DESC
    LIMIT 10
""", (10000,))
 
for row in cursor.fetchall():
    print(row)
 
conn.close()

SQL Compatibility

CockroachDB supports most PostgreSQL SQL features:

Supported Features

•Standard SQL (DML, DDL, DQL)
•Joins (all types)
•Subqueries (scalar, correlated)
•Window functions
•CTEs and recursive CTEs
•JSONB with operators
•Full-text search (basic)
•Stored procedures (limited)
•User-defined functions (SQL)
•Sequences and serial types

Not Supported / Different

•PL/pgSQL (limited support)
•Triggers (partial support)
•Extensions (most unavailable)
•Custom types (limited)
•LISTEN/NOTIFY (different API)
•PostGIS (use own geo types)
•Some system tables differ
•Locking behaviors differ
•Transaction semantics differ

cockroach_sql_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Standard PostgreSQL syntax works
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email STRING UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT current_timestamp(),
    metadata JSONB
);
 
-- Window functions
SELECT 
    id,
    amount,
    SUM(amount) OVER (
        PARTITION BY user_id 
        ORDER BY created_at 
        ROWS UNBOUNDED PRECEDING
    ) as running_total
FROM transactions;
 
-- CTE with recursion
WITH RECURSIVE org_tree AS (
    SELECT id, name, parent_id, 1 as depth
    FROM organizations
    WHERE parent_id IS NULL
    UNION ALL
    SELECT o.id, o.name, o.parent_id, ot.depth + 1
    FROM organizations o
    JOIN org_tree ot ON o.parent_id = ot.id
)
SELECT * FROM org_tree;
 
-- JSONB queries
SELECT * FROM users 
WHERE metadata @> '{"role": "admin"}';

Migration Testing

Multi-Region Deployments

CockroachDB provides sophisticated multi-region capabilities for global applications, balancing latency, availability, and consistency.

Topology Patterns

CockroachDB supports three multi-region table locality patterns:

Multi-Region Table Localities
Locality	Behavior	Read Latency	Write Latency	Best For
REGIONAL BY TABLE	Table pinned to home region	Fast in home region	Fast in home region	Regional data, compliance
REGIONAL BY ROW	Each row in its home region	Fast for local rows	Fast for local writes	User data, geo-sharding
GLOBAL	Optimized global reads	Fast everywhere	Slower (all regions)	Reference data, config

multi_region_setup.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Configure database for multi-region
ALTER DATABASE myapp SET PRIMARY REGION "us-east1";
ALTER DATABASE myapp ADD REGION "us-west1";
ALTER DATABASE myapp ADD REGION "eu-west1";
 
-- Regional by table: stays in primary region
CREATE TABLE config (
    key STRING PRIMARY KEY,
    value STRING
) LOCALITY REGIONAL BY TABLE IN PRIMARY REGION;
 
-- Regional by row: each row has a home region
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email STRING,
    crdb_region crdb_internal_region NOT NULL DEFAULT 'us-east1'
) LOCALITY REGIONAL BY ROW;
 
-- Insert user with specific region
INSERT INTO users (id, email, crdb_region)
VALUES (gen_random_uuid(), 'alice@example.com', 'eu-west1');
 
-- Global table: fast reads everywhere
CREATE TABLE currency_rates (
    code STRING PRIMARY KEY,
    usd_rate DECIMAL
) LOCALITY GLOBAL;

Survival Goals

CockroachDB can be configured with different survival goals:

Zone survival (default): Survives failure of one availability zone within a region. Requires 3+ zones.
Region survival: Survives failure of an entire region. Requires 3+ regions with 5+ replicas.

Global Tables Deep Dive

Global tables are optimized for read-heavy reference data:

Writes go through Raft consensus across all regions (slow, ~200-500ms)
Reads are served locally from each region's replica (fast, <10ms)
Uses "Non-Blocking Transactions" to avoid blocking reads on writes
Perfect for: config tables, feature flags, currency rates, product catalogs

Latency Expectations

Operational Features

CockroachDB minimizes operational burden through self-managing features.

Online Schema Changes

Unlike traditional databases that require downtime for DDL operations, CockroachDB performs schema changes online:

Online Schema Changes

•ADD COLUMN: Immediate if nullable or has default. Server continues serving traffic.
•CREATE INDEX: Background index backfill. Queries continue using old plan until complete.
•ALTER COLUMN TYPE: Validates and migrates data incrementally in background.
•DROP COLUMN: Marked for deletion, garbage collected later. No immediate data rewrite.
•RENAME: Instant metadata change, no data movement.

Automatic Operations

CockroachDB handles many tasks automatically:

Automatic vs Manual Operations
Operation	CockroachDB	Traditional RDBMS
Replication	Automatic, configurable factor	Manual setup and monitoring
Failover	Automatic via Raft	Manual or HA proxy setup
Rebalancing	Automatic based on load/size	Manual shard management
Range splitting	Automatic at 512MB	Manual partition management
Recovery	Automatic from healthy replicas	Point-in-time restore
Upgrades	Rolling upgrade, no downtime	Scheduled maintenance windows

Built-in Observability

CockroachDB includes a rich admin UI and metrics infrastructure:

DB Console: Web-based UI for cluster overview, statements, jobs, and insights
Prometheus metrics: Thousands of metrics exposed for external monitoring
Statement statistics: Per-statement latency, retries, and execution plans
Insights: Automatic detection of slow queries, index recommendations, hot ranges
Changefeeds: Stream database changes to Kafka, cloud storage, or webhooks

Day 2 Operations

Deployment Options

CockroachDB offers flexibility in how you deploy and manage your database.

Self-Hosted Deployments

Self-Hosted Options

•CockroachDB Core: Free, source-available. Includes all core distributed database features.
•CockroachDB Enterprise: Commercial license adds: backup/restore to cloud storage, changefeeds, encryption at rest, LDAP integration.
•Kubernetes Operator: Official operator for Kubernetes deployments. Handles upgrades, scaling, and monitoring.
•Manual deployment: Binary runs on any Linux/macOS/Windows. Suitable for VMs, bare metal, containers.

Managed Services (CockroachDB Cloud)

CockroachDB Cloud Tiers
Tier	Best For	Pricing Model	Features
Serverless	Development, small workloads	Pay-per-request	Auto-scaling, 5GB free, instant start
Dedicated	Production workloads	Fixed node hours	Dedicated resources, SLA, multi-region
Self-Hosted + Support	Enterprise on-prem	License fee	Full control + vendor support

deployment_examples.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
# Local development (single-node)
cockroach start-single-node --insecure
 
# 3-node local cluster (for testing)
cockroach start --insecure --store=node1 --listen-addr=localhost:26257   --http-addr=localhost:8080 --join=localhost:26257,localhost:26258,localhost:26259
cockroach start --insecure --store=node2 --listen-addr=localhost:26258   --http-addr=localhost:8081 --join=localhost:26257,localhost:26258,localhost:26259
cockroach start --insecure --store=node3 --listen-addr=localhost:26259   --http-addr=localhost:8082 --join=localhost:26257,localhost:26258,localhost:26259
 
cockroach init --insecure --host=localhost:26257
 
# Production Kubernetes (using operator)
kubectl apply -f https://raw.githubusercontent.com/cockroachdb/cockroach-operator/master/install/crds.yaml
kubectl apply -f cockroachdb-cluster.yaml

Cloud Availability

CockroachDB Cloud runs on AWS, GCP, and Azure. Unlike Cloud Spanner (GCP-only), you can deploy CockroachDB on any major cloud or on-premises, enabling multi-cloud and hybrid cloud strategies.

Summary: CockroachDB's Place in NewSQL

CockroachDB demonstrates that Spanner's innovations are achievable without Google-scale infrastructure. Let's consolidate the key insights and compare with alternatives.

Key Takeaways

•Hybrid Logical Clocks enable Spanner-like consistency without atomic clocks, trading some latency for infrastructure flexibility.
•PostgreSQL compatibility allows existing applications and expertise to transfer with minimal changes.
•Write intents and parallel commits provide efficient distributed transactions with serializable isolation.
•Multi-region localities (Regional by Table/Row, Global) let you tune latency/consistency per table.
•Symmetric architecture eliminates single points of failure; any node can serve any request.
•Self-managing operations (auto-rebalancing, online schema changes, rolling upgrades) reduce operational burden.
•Flexible deployment (self-hosted, Kubernetes, managed cloud) avoids vendor lock-in.

CockroachDB vs Google Spanner vs PostgreSQL
Feature	CockroachDB	Cloud Spanner	PostgreSQL
Horizontal scale	✓ Yes	✓ Yes	✗ Vertical only
Distributed ACID	✓ Serializable	✓ External consistency	✓ Single-node
Multi-region	✓ Yes	✓ Yes (cloud)	Manual replication
SQL compatibility	PostgreSQL	GoogleSQL	PostgreSQL
Self-hosted option	✓ Yes	✗ No (GCP only)	✓ Yes
Licensing	BSL (open core)	Proprietary	PostgreSQL License
Minimum cost	$0 (self-hosted)	~$650/month	$0

What's Next

Page Complete

4 / 5