System Design (HLD)CockroachDB

CockroachDB: Distributed SQL for the Modern Era

LevelAdvanced

Duration90 mins

TopicCockroachDB

2 / 5

Distributed SQL: Query Processing Across a Cluster

SQL That Scales

For decades, scaling SQL databases meant one thing: buying bigger hardware. When that ran out, you faced an impossible choice—abandon SQL for NoSQL systems that scaled but lost relational semantics, or implement complex application-level sharding that fragmented your data model.

CockroachDB offers a third path: Distributed SQL. Your application issues standard SQL queries—SELECT, INSERT, UPDATE, DELETE, complex JOINs, subqueries, window functions—and CockroachDB executes them transparently across a cluster of nodes. Add more nodes to scale. Remove nodes to save costs. The SQL interface remains unchanged.

This isn't simple query routing to shards. CockroachDB implements a sophisticated distributed query execution engine (DistSQL) that parallelizes work across nodes, pushes computation to where data resides, and optimizes network transfer. A query that would take seconds on a single node can complete in milliseconds when executed across dozens of nodes in parallel.

Understanding how Distributed SQL works is essential for:

Writing queries that perform well at scale
Diagnosing performance issues in production
Making informed schema design decisions
Optimizing data locality for your access patterns

What You Will Learn

By the end of this page, you will understand how CockroachDB processes SQL queries across a distributed cluster. You'll learn about the SQL layer architecture, the DistSQL execution engine, query routing, distributed join strategies, and how to write queries that leverage CockroachDB's parallelism effectively.

The SQL Layer Architecture

CockroachDB's SQL layer transforms human-readable SQL into operations on the underlying distributed key-value store. This transformation happens through a series of well-defined stages.

Stage 1: Parsing

The parser converts SQL text into an Abstract Syntax Tree (AST). CockroachDB supports the PostgreSQL SQL dialect, so queries written for PostgreSQL typically parse without modification:

SELECT u.name, COUNT(o.id) AS order_count
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.region = 'us-east'
GROUP BY u.name
HAVING COUNT(o.id) > 5
ORDER BY order_count DESC
LIMIT 10;

The parser validates syntax, identifies tables and columns, and produces a tree representation of the query.

Stage 2: Semantic Analysis

The analyzer resolves names—mapping table names to actual tables, column names to schema definitions, and verifying that the query is semantically valid:

Does the users table exist?
Does users have a region column?
Is u.id = o.user_id comparing compatible types?
Are all non-aggregated columns in the GROUP BY clause?

Stage 3: Logical Planning

The planner creates a logical execution plan—a tree of operations that, if executed, would produce the correct result:

Limit(10)
 └── Sort(order_count DESC)
     └── Filter(count > 5)
         └── GroupBy(u.name)
             └── HashJoin(u.id = o.user_id)
                 ├── Filter(u.region = 'us-east')
                 │   └── Scan(users)
                 └── Scan(orders)

Stage 4: Optimization

The cost-based optimizer evaluates multiple possible execution plans and selects the one with lowest estimated cost. It considers:

Table statistics: Row counts, column cardinality, value distributions
Index availability: Which indexes can satisfy the query
Join ordering: For multi-table joins, which order minimizes intermediate results
Predicate pushdown: Which filters can be applied early to reduce data volume

query-processing-stages.txt
SQL QUERY PROCESSING PIPELINE
═══════════════════════════════════════════════════════════════════
 
INPUT QUERY:
────────────────────────────────────────────────────────────────────
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.region = 'us-east'
GROUP BY u.name
HAVING COUNT(o.id) > 5
ORDER BY order_count DESC
LIMIT 10;
 
STAGE 1: PARSING → Abstract Syntax Tree
────────────────────────────────────────────────────────────────────
┌─ SELECT ─────────────────────────────────────────────────────────┐
│  ├── Columns: [u.name, COUNT(o.id)]                              │
│  ├── FROM: [users AS u, orders AS o]                             │
│  ├── JOIN: ON u.id = o.user_id                                   │
│  ├── WHERE: u.region = 'us-east'                                 │
│  ├── GROUP BY: [u.name]                                          │
│  ├── HAVING: COUNT(o.id) > 5                                     │
│  ├── ORDER BY: [order_count DESC]                                │
│  └── LIMIT: 10                                                   │
└──────────────────────────────────────────────────────────────────┘
 
STAGE 2: SEMANTIC ANALYSIS → Resolved Names
────────────────────────────────────────────────────────────────────
┌─ Resolved Objects ───────────────────────────────────────────────┐
│  users  → database.public.users (id: 52, columns: [id, name...]) │
│  orders → database.public.orders (id: 57, columns: [id, user_id])│
│  u.region → users.region (column index: 3, type: STRING)         │
│  u.id → users.id (column index: 0, type: INT, PRIMARY KEY)       │
│  o.user_id → orders.user_id (column index: 1, type: INT, FK)     │
└──────────────────────────────────────────────────────────────────┘
 
STAGE 3: LOGICAL PLAN → Operation Tree
────────────────────────────────────────────────────────────────────
Limit(10)
 └── Sort(order_count DESC)
     └── Filter(count > 5)           ← HAVING clause
         └── GroupBy(u.name)
             └── HashJoin(u.id = o.user_id)
                 ├── Filter(u.region = 'us-east')
                 │   └── Scan(users)
                 └── Scan(orders)
 
STAGE 4: OPTIMIZATION → Cost-Based Selection
────────────────────────────────────────────────────────────────────
Optimizer evaluates alternatives:
 
Option A: Hash Join                    Cost: 5,234
Option B: Merge Join (requires sort)   Cost: 12,891
Option C: Lookup Join                  Cost: 3,122  ← SELECTED
 
Predicate pushdown applied:
- Filter(region='us-east') pushed to users scan
- Uses index: users@users_region_idx
 
Final optimized plan:
Limit(10)
 └── Sort(order_count DESC)
     └── Filter(count > 5)
         └── GroupBy(u.name)
             └── LookupJoin(orders ON user_id)
                 └── IndexScan(users@users_region_idx, region='us-east')
 
STAGE 5: PHYSICAL PLANNING → DistSQL Distribution
────────────────────────────────────────────────────────────────────
(Covered in next section...)

Stage 5: Physical Planning

After optimization produces a logical plan, the physical planner decides where each operation executes. For a single-node query, everything runs locally. For distributed queries, the planner creates a DistSQL plan that:

Assigns operations to specific nodes based on data location
Inserts network shuffles where data must move between nodes
Parallelizes independent operations across the cluster

This is where CockroachDB's distributed nature becomes visible.

Viewing the Execution Plan

Use EXPLAIN ANALYZE to see how CockroachDB executes your query: EXPLAIN ANALYZE SELECT .... This shows the operations performed, execution times, and data flow between nodes. For distributed plans, use EXPLAIN (DISTSQL) SELECT ... to see a graphical representation of the DistSQL plan.

DistSQL: The Distributed Query Engine

DistSQL is CockroachDB's mechanism for executing SQL queries across multiple nodes in parallel. It transforms the single-node execution plan into a distributed plan that leverages the entire cluster.

The Core Principle: Move Computation to Data

In distributed databases, the primary performance bottleneck is network transfer. Moving large amounts of data between nodes is slow and expensive. DistSQL minimizes this by:

Pushing filters and projections: Apply WHERE clauses and column selection at the node storing the data, before any network transfer
Partitioning operations by data location: Group and aggregate at each node first, then combine results
Parallelizing independent operations: Different nodes process their portions simultaneously

DistSQL Plan Components:

A DistSQL plan consists of processors connected by streams:

Table Readers: Scan data from local storage (one per node holding relevant data)
Filters: Apply WHERE conditions
Aggregators: Compute GROUP BY aggregates
Joiners: Combine data from multiple sources
Sorters: Order results (distributed sort for large datasets)
Final processor: Combines results and returns to client

distsql-execution-example.txt
DISTSQL EXECUTION FLOW
═══════════════════════════════════════════════════════════════════
 
Query: SELECT region, COUNT(*) FROM orders 
       WHERE status = 'shipped' 
       GROUP BY region;
 
Assumption: orders table split across 4 nodes (Node1=US-East, Node2=US-West,
            Node3=EU-West, Node4=Asia)
 
NON-DISTRIBUTED (Single Node) Execution:
────────────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────────────┐
│                          Gateway Node                            │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  1. Fetch ALL rows from ALL nodes (network transfer)      │  │
│  │  2. Filter status = 'shipped' locally                      │  │
│  │  3. GROUP BY region locally                                │  │
│  │  4. COUNT(*) locally                                       │  │
│  │  5. Return result                                          │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
Problem: Transfers 100% of data across network before filtering
 
DISTSQL (Distributed) Execution:
────────────────────────────────────────────────────────────────────
 
Phase 1: Local Processing (Parallel on all nodes)
─────────────────────────────────────────────────
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│    Node 1    │  │    Node 2    │  │    Node 3    │  │    Node 4    │
│   (US-East)  │  │   (US-West)  │  │   (EU-West)  │  │    (Asia)    │
│              │  │              │  │              │  │              │
│ TableReader  │  │ TableReader  │  │ TableReader  │  │ TableReader  │
│   ↓          │  │   ↓          │  │   ↓          │  │   ↓          │
│ Filter       │  │ Filter       │  │ Filter       │  │ Filter       │
│ (shipped)    │  │ (shipped)    │  │ (shipped)    │  │ (shipped)    │
│   ↓          │  │   ↓          │  │   ↓          │  │   ↓          │
│ Aggregator   │  │ Aggregator   │  │ Aggregator   │  │ Aggregator   │
│ (partial)    │  │ (partial)    │  │ (partial)    │  │ (partial)    │
│              │  │              │  │              │  │              │
│ Result:      │  │ Result:      │  │ Result:      │  │ Result:      │
│ us-east: 500 │  │ us-west: 300 │  │ eu-west: 400 │  │ asia: 250    │
│ us-west: 50  │  │ us-east: 25  │  │ asia: 10     │  │ eu-west: 15  │
└──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
      │                  │                  │                 │
      └──────────────────┴────────┬─────────┴─────────────────┘
                                  │
                         Network Transfer
                      (Only aggregated results)
                                  │
                                  ▼
Phase 2: Final Aggregation (Gateway node)
─────────────────────────────────────────
┌─────────────────────────────────────────────────────────────────┐
│                        Gateway Node                              │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Receive partial aggregates from all nodes:               │  │
│  │    us-east: 500+25 = 525                                  │  │
│  │    us-west: 300+50 = 350                                  │  │
│  │    eu-west: 400+15 = 415                                  │  │
│  │    asia: 250+10 = 260                                     │  │
│  │                                                            │  │
│  │  Return final result to client                            │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
 
COMPARISON:
────────────────────────────────────────────────────────────────────
                      Non-Distributed    DistSQL
────────────────────────────────────────────────────────────────────
Data transferred       10 million rows    8 partial aggregates
Processing time        Sequential         Parallel (4x faster)
Network bottleneck     Yes               No
Scalability            Limited           Linear with nodes

When DistSQL Activates:

Not all queries use DistSQL. CockroachDB's optimizer decides based on:

Data distribution: If all required data is on one node, DistSQL overhead isn't worth it
Query complexity: Simple point lookups (SELECT * FROM table WHERE pk = 123) don't benefit from parallelism
Result size: If the query returns few rows, local execution may be faster

DistSQL Flow Types:

DistSQL plans use different flow patterns depending on the query:

Scatter-Gather: Read from multiple nodes, aggregate at one (like the example above)
Full Distribution: Each node processes its data and returns partial results
Co-located: When joined tables share the same distribution key, joins happen locally
Broadcast: Small tables are sent to all nodes (for small-large table joins)
Hash Distribution: Rows are shuffled based on hash of key columns (for hash joins)

DistSQL in the DB Console

CockroachDB's web console (usually at port 8080) includes a DistSQL visualization. Run EXPLAIN (DISTSQL) SELECT ... and click the generated link to see a graphical representation of processors, data flow, and which nodes participate in query execution.

Query Routing and Gateway Nodes

When your application connects to CockroachDB, it connects to a specific node—the gateway node for that connection. Understanding how queries are routed from gateway nodes to data is essential for optimizing performance.

Every Node is a Gateway

Unlike databases with dedicated query coordinators, every CockroachDB node can serve as a gateway. When you issue a query:

The gateway node parses, optimizes, and plans the query
For distributed queries, the gateway coordinates DistSQL execution
Partial results flow back to the gateway
The gateway combines results and returns them to the client

Implications for Load Balancing:

Since any node can be a gateway, you typically place a load balancer in front of your cluster:

     ┌─────────────────────────────────────┐
     │          Load Balancer              │
     │  (HAProxy, NGINX, Cloud LB, etc.)   │
     └─────────────────────────────────────┘
                     │
     ┌───────────────┼───────────────┐
     ▼               ▼               ▼
┌─────────┐   ┌─────────┐   ┌─────────┐
│  Node 1 │   │  Node 2 │   │  Node 3 │
│(Gateway)│   │(Gateway)│   │(Gateway)│
└─────────┘   └─────────┘   └─────────┘

Load balancing distributes the gateway work across nodes, preventing any single node from becoming a bottleneck.

Gateway Overhead:

The gateway node does coordination work even for queries that don't touch its local data:

Receiving and parsing the query
Communicating with data-owning nodes
Aggregating results
Sending response to client

For latency-sensitive applications, connecting to the node that owns the data minimizes hops. CockroachDB supports topology-aware routing (connecting to the nearest replica) and follower reads (reading from any replica for slightly stale data).

Query Routing Scenarios
Scenario	Gateway Location	Data Location	Network Round Trips	Optimization
Local query	Node A	Node A (leaseholder)	0	Best case—no network
Remote single range	Node A	Node B (leaseholder)	1 RT to B	Consider connecting to B directly
Multi-range query	Node A	Nodes B, C, D	1 RT to each in parallel	DistSQL parallelism helps
Global aggregation	Node A	All nodes	1 RT to each in parallel	Ensure gateway isn't bottleneck

Leaseholder Routing:

For reads and writes, the leaseholder is the authoritative replica for a range. The gateway must communicate with the leaseholder to ensure consistency:

Writes: Always go to the leaseholder, which replicates via Raft
Reads (strong consistency): Go to the leaseholder
Reads (follower reads): Can go to any replica (with staleness tradeoff)

The gateway uses range descriptors (cached locally, refreshed via gossip) to determine which node is the leaseholder for each range needed by the query.

Minimizing Gateway Overhead:

Strategies for reducing gateway-related latency:

Connection affinity: Route connections for specific data to nodes likely to be leaseholders for that data
Regional deployments: Connect users to nodes in their geographic region
Read replicas: For read-heavy workloads, configure replicas in each region and use follower reads
Connection pooling: Maintain persistent connections to avoid connection setup overhead

topology-aware-connection.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
package main
 
import (
    "database/sql"
    "fmt"
    
    _ "github.com/lib/pq"
)
 
// TopologyAwareConnectionPool demonstrates connecting to
// region-local nodes for lower latency
 
type RegionalPool struct {
    primary    *sql.DB  // Primary region for writes
    localRead  *sql.DB  // Local region for reads
}
 
func NewRegionalPool(userRegion string) (*RegionalPool, error) {
    // CockroachDB cluster endpoints by region
    endpoints := map[string]string{
        "us-east":  "cockroach-us-east.example.com:26257",
        "us-west":  "cockroach-us-west.example.com:26257",
        "eu-west":  "cockroach-eu-west.example.com:26257",
    }
    
    // Primary region for writes (configured as leaseholder preference)
    primary, err := sql.Open("postgres", 
        fmt.Sprintf("postgresql://root@%s/mydb?sslmode=require", 
            endpoints["us-east"]))
    if err != nil {
        return nil, err
    }
    
    // Local region for reads (using follower reads)
    localEndpoint := endpoints[userRegion]
    if localEndpoint == "" {
        localEndpoint = endpoints["us-east"]  // fallback
    }
    
    localRead, err := sql.Open("postgres",
        fmt.Sprintf("postgresql://root@%s/mydb?sslmode=require", 
            localEndpoint))
    if err != nil {
        return nil, err
    }
    
    return &RegionalPool{
        primary:   primary,
        localRead: localRead,
    }, nil
}
 
func (p *RegionalPool) Write(query string, args ...interface{}) (sql.Result, error) {
    // Writes always go to primary for strong consistency
    return p.primary.Exec(query, args...)
}
 
func (p *RegionalPool) Read(query string, args ...interface{}) (*sql.Rows, error) {
    // Reads can go to local replica with follower reads
    // Adds AS OF SYSTEM TIME follower_read_timestamp() for bounded staleness
    followerQuery := query + " AS OF SYSTEM TIME follower_read_timestamp()"
    return p.localRead.Query(followerQuery, args...)
}
 
func (p *RegionalPool) StrongRead(query string, args ...interface{}) (*sql.Rows, error) {
    // Strong reads go to leaseholder via primary
    return p.primary.Query(query, args...)
}

Gateway Bottlenecks

If all connections route to a single node, that node becomes a bottleneck—even though data is distributed. Monitor connection distribution in the CockroachDB console and ensure load balancers are distributing evenly. For very high-throughput applications, ensure adequate gateway nodes for the connection count.

Distributed Joins

Joins in a distributed database are inherently challenging—the data from different tables may reside on different nodes. CockroachDB implements several join strategies, each optimized for different scenarios.

Join Strategy 1: Lookup Join

The lookup join (also called index nested loop join) is used when one side of the join is small and the other has an index on the join key:

Read rows from the smaller table (the "input" side)
For each row, look up matching rows in the larger table using the index
Combine matching rows

When it's chosen:

One table is much smaller than the other
The larger table has an index on the join column
The filter on the smaller table is selective

Example:

-- users table: 100 filtered rows
-- orders table: 10 million rows with index on user_id
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.country = 'Iceland';  -- Small country = few users

The optimizer will likely choose lookup join: scan the ~100 Icelandic users, then do 100 index lookups into orders.

Join Strategy 2: Hash Join

The hash join builds a hash table from one side of the join and probes it with the other side:

Read all rows from the smaller table (the "build" side)
Build an in-memory hash table keyed on the join column
Stream rows from the larger table (the "probe" side)
Probe the hash table for matches

Distributed Hash Join:

When both tables are large and spread across nodes:

Partition both tables by hash of the join key
Shuffle partitions so matching keys are co-located
Perform local hash joins on each node
Combine results

distributed-hash-join.txt
DISTRIBUTED HASH JOIN
═══════════════════════════════════════════════════════════════════
 
Query: SELECT u.name, o.total 
       FROM users u 
       JOIN orders o ON u.id = o.user_id;
 
Assumption: Both tables spread across 3 nodes,
            no co-location on user_id
 
Phase 1: Hash Partitioning
────────────────────────────────────────────────────────────────────
Each node reads its local data and computes hash(join_key)
 
Node 1 (local users):        Node 1 (local orders):
├── User A (id=1) → hash=0   ├── Order X (user_id=2) → hash=1
├── User B (id=2) → hash=1   ├── Order Y (user_id=1) → hash=0
└── User C (id=3) → hash=0   └── Order Z (user_id=3) → hash=0
 
Node 2 (local users):        Node 2 (local orders):
├── User D (id=4) → hash=1   ├── Order P (user_id=5) → hash=1
└── User E (id=5) → hash=1   └── Order Q (user_id=4) → hash=1
 
Node 3 (local users):        Node 3 (local orders):
├── User F (id=6) → hash=0   ├── Order R (user_id=6) → hash=0
└── User G (id=7) → hash=1   └── Order S (user_id=7) → hash=1
 
Phase 2: Shuffle by Hash Partition
────────────────────────────────────────────────────────────────────
Rows with hash=0 → Node 1
Rows with hash=1 → Node 2
 
After shuffle:
                                                
┌─────────────────────────────────────────────────────────────────┐
│  Node 1 (hash=0 partition)                                       │
│  Users: A(1), C(3), F(6)                                         │
│  Orders: Y(user=1), Z(user=3), R(user=6)                         │
└─────────────────────────────────────────────────────────────────┘
 
┌─────────────────────────────────────────────────────────────────┐
│  Node 2 (hash=1 partition)                                       │
│  Users: B(2), D(4), E(5), G(7)                                   │
│  Orders: X(user=2), P(user=5), Q(user=4), S(user=7)              │
└─────────────────────────────────────────────────────────────────┘
 
Phase 3: Local Hash Join (Parallel)
────────────────────────────────────────────────────────────────────
Node 1: Hash join on local data
  A↔Y, C↔Z, F↔R  ✓
 
Node 2: Hash join on local data  
  B↔X, D↔Q, E↔P, G↔S  ✓
 
Phase 4: Combine Results
────────────────────────────────────────────────────────────────────
Gateway collects results from Node 1 and Node 2
Returns to client
 
COST ANALYSIS:
────────────────────────────────────────────────────────────────────
Network: Full shuffle of both tables (expensive)
CPU:     Parallel hash join (efficient)
Memory:  Hash table for smaller side on each node
 
Best for: Large-large table joins with no co-location
Avoid:    When tables could be co-located by schema design

Join Strategy 3: Merge Join

The merge join requires both sides to be sorted on the join key:

Sort both tables on the join column (or use existing index order)
Walk through both sorted streams simultaneously
When keys match, emit the joined row

When it's chosen:

Both tables have indexes on the join column
Results need to be ordered by the join column
Tables are approximately the same size

Join Strategy 4: Cross Join

When no join predicate exists (or only non-equality predicates), CockroachDB may use a cross join:

For each row from table A, examine all rows from table B
Apply any filter predicates

Cross joins are expensive (O(n×m)) and should be avoided for large tables.

Co-located Joins: The Best Case

The most efficient distributed join is one that doesn't need network shuffling—a co-located join. If related tables are partitioned on the same key:

-- Users partitioned by region
-- Orders partitioned by user's region (derived from user_id)

-- Join happens locally on each node—no shuffle needed
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.region = 'us-east';

Schema design that enables co-located joins is one of the most impactful optimizations for distributed databases.

Join Strategies Comparison
Strategy	Best For	Data Movement	Memory Usage	Time Complexity
Lookup Join	Small-large table, indexed	Minimal (point lookups)	Low	O(n) probes
Hash Join	Medium tables, no index	Shuffle if distributed	Build-side in memory	O(n + m)
Merge Join	Both sides sorted/indexed	Shuffle if distributed	Low	O(n + m)
Co-located Join	Same partition key	None (local)	Per-partition	O(n + m) local
Cross Join	Small tables only	Full combination	Low	O(n × m) AVOID

Design for Co-location

When designing schemas for CockroachDB, consider which tables are frequently joined and partition them on the same key. For example, partition both users and orders by user_id so joins are always local. This is the distributed database equivalent of Spanner's interleaved tables.

Transaction Handling in Distributed Queries

SQL queries in CockroachDB execute within transactions—even single statements are implicitly wrapped in a transaction. Understanding how transactions work in a distributed context is essential for both correctness and performance.

Every Query is a Transaction

When you execute:

SELECT * FROM users WHERE region = 'us-east';

CockroachDB implicitly wraps this in a transaction:

BEGIN;
SELECT * FROM users WHERE region = 'us-east';
COMMIT;

This ensures the query sees a consistent snapshot of the database, even if other transactions are modifying data concurrently.

Read-Only vs. Read-Write Transactions

CockroachDB distinguishes between:

Read-only transactions: Only SELECT statements, can be served by any replica (with follower reads), don't need to coordinate commits
Read-write transactions: Include INSERT, UPDATE, DELETE; must coordinate through leaseholders and Raft consensus

Distributed Transaction Flow (Read-Write):

BEGIN: Client starts transaction, gateway assigns a transaction ID and provisional timestamp
Reads: Gateway reads from leaseholders, potentially discovering newer data and pushing the timestamp forward
Writes: Gateway sends write intents to leaseholders (provisional writes visible only to this transaction)
COMMIT: Gateway initiates two-phase commit (2PC) across all ranges with write intents
Resolve: Write intents are converted to committed values; locks are released

distributed-transaction-flow.txt
DISTRIBUTED TRANSACTION LIFECYCLE
═══════════════════════════════════════════════════════════════════
 
Example Transaction:
────────────────────────────────────────────────────────────────────
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;
 
Assumption: Alice's account on Node A, Bob's account on Node B
 
STEP 1: BEGIN (Gateway Node)
────────────────────────────────────────────────────────────────────
Gateway (any node client connected to):
├── Generate transaction ID: txn-abc123
├── Assign provisional commit timestamp: ts=1000
└── Track transaction state: PENDING
 
STEP 2: First UPDATE (Alice's account)
────────────────────────────────────────────────────────────────────
Gateway:
├── Locate leaseholder for 'alice': Node A
└── Send write request to Node A
 
Node A (Leaseholder for alice):
├── Lock alice's row (write intent)
├── Write provisional value: alice.balance = (old - 100) @ ts=1000
│   └── Intent marker: txn-abc123 (not yet committed)
├── Store in Raft log (but don't replicate commit yet)
└── Respond to Gateway: write intent placed
 
STEP 3: Second UPDATE (Bob's account)  
────────────────────────────────────────────────────────────────────
Gateway:
├── Locate leaseholder for 'bob': Node B
└── Send write request to Node B
 
Node B (Leaseholder for bob):
├── Lock bob's row (write intent)
├── Write provisional value: bob.balance = (old + 100) @ ts=1000
│   └── Intent marker: txn-abc123 (not yet committed)
├── Store in Raft log
└── Respond to Gateway: write intent placed
 
STEP 4: COMMIT (Two-Phase Commit)
────────────────────────────────────────────────────────────────────
 
Phase 1: PREPARE (Parallel to all participants)
─────────────────────────────────────────────────
Gateway → Node A: "Prepare to commit txn-abc123"
Gateway → Node B: "Prepare to commit txn-abc123"
 
Node A: Check write intent still valid → PREPARED
Node B: Check write intent still valid → PREPARED
 
Both respond: PREPARED ✓
 
Phase 2: COMMIT (After all prepared)
─────────────────────────────────────────────────
Gateway → Transaction Record: Mark COMMITTED @ ts=1000
  └── Transaction record replicated via Raft
 
Gateway → Node A: "Resolve intent txn-abc123 as committed"
Gateway → Node B: "Resolve intent txn-abc123 as committed"
 
Node A: Convert intent to permanent value, release lock
Node B: Convert intent to permanent value, release lock
 
STEP 5: Return to Client
────────────────────────────────────────────────────────────────────
Gateway → Client: Transaction committed successfully
 
CONFLICT HANDLING:
────────────────────────────────────────────────────────────────────
If another transaction (txn-xyz) tries to read/write alice during txn-abc:
 
Case 1: txn-xyz reads, txn-abc uncommitted
  → txn-xyz waits for txn-abc to commit/abort
  → Returns committed value once resolved
 
Case 2: txn-xyz writes, txn-abc uncommitted
  → Write-write conflict detected
  → One transaction must abort and retry (SSI)
 
Case 3: txn-xyz reads at older timestamp
  → Can read pre-intent value (MVCC)
  → No conflict

Parallel Commits Optimization:

The 2PC flow described above requires two round trips: PREPARE then COMMIT. CockroachDB's parallel commits optimization eliminates this for common cases:

In the final write, include the transaction's commit intent
If all writes (including this one) succeed, the transaction is implicitly committed
No separate COMMIT round trip needed

This reduces commit latency from 2 round trips to 1 for transactions touching multiple ranges.

Automatic Retries:

CockroachDB automatically retries transactions that encounter conflicts, within limits:

Serialization errors: Transaction's read set was modified by another committed transaction
Lock conflicts: Write intent encountered; wait or push
Timestamp push: Read encountered newer data; retry with higher timestamp

Applications using CockroachDB's Go, Java, or Python drivers benefit from automatic retry logic. For raw SQL connections, applications should implement retry loops for 40001 (serialization failure) errors.

Transaction Contention

High contention on hot keys—many transactions trying to update the same row—causes performance degradation. CockroachDB provides contention metrics in the admin UI. If you see high contention, consider: (1) redesigning to avoid hot keys, (2) using SELECT FOR UPDATE to acquire locks early, or (3) breaking large transactions into smaller ones.

Query Performance Optimization

Writing efficient queries for CockroachDB requires understanding both SQL optimization principles and distributed-specific considerations.

Key Optimization Strategies:

1. Use Indexes Effectively

Indexes in CockroachDB work similarly to traditional databases but have distributed implications:

Covering indexes: Include all needed columns to avoid secondary lookups
Partial indexes: Index only rows matching a condition to reduce size
Index ordering: Design indexes to support common ORDER BY clauses

-- Covering index for common query pattern
CREATE INDEX orders_user_status_idx ON orders(user_id, status) 
  STORING (total, created_at);

-- Query satisfied entirely by index
SELECT user_id, status, total, created_at 
FROM orders 
WHERE user_id = 123 AND status = 'shipped';

2. Minimize Network Round Trips

Each range access is potentially a network round trip. Minimize by:

Using covering indexes (avoid secondary lookups)
Batching point lookups (WHERE id IN (1,2,3) instead of multiple queries)
Avoiding correlated subqueries (join instead)

3. Leverage Locality

For multi-region deployments, query performance depends on data locality:

Use REGIONAL BY ROW tables to pin data to regions
Configure lease_preferences for leaseholder placement
Use follower reads for read-heavy, less latency-sensitive queries

optimization-patterns.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
-- ═══════════════════════════════════════════════════════════════════
-- QUERY OPTIMIZATION PATTERNS FOR COCKROACHDB
-- ═══════════════════════════════════════════════════════════════════
 
-- PATTERN 1: Covering Indexes
-- ─────────────────────────────────────────────────────────────────────
-- BAD: Requires secondary lookup to fetch 'name'
SELECT id, name FROM users WHERE email = 'alice@example.com';
-- Index on email doesn't include 'name', so:
--   1. Index scan to find row  
--   2. Primary key lookup to get 'name'
 
-- GOOD: Covering index includes all needed columns
CREATE INDEX users_email_covering ON users(email) STORING (name);
-- Now query is satisfied entirely by index scan
 
 
-- PATTERN 2: Batch Point Lookups
-- ─────────────────────────────────────────────────────────────────────
-- BAD: Multiple round trips
SELECT * FROM products WHERE id = 1;
SELECT * FROM products WHERE id = 2;
SELECT * FROM products WHERE id = 3;
-- Each query = potential network round trip
 
-- GOOD: Single batched query
SELECT * FROM products WHERE id IN (1, 2, 3);
-- Single query, CockroachDB parallelizes lookups
 
 
-- PATTERN 3: Avoid SELECT *
-- ─────────────────────────────────────────────────────────────────────
-- BAD: Fetches all columns, some may require additional lookups
SELECT * FROM orders WHERE user_id = 123;
 
-- GOOD: Fetch only needed columns (matches covering index)
SELECT id, status, total FROM orders WHERE user_id = 123;
 
 
-- PATTERN 4: Predicate Pushdown for Joins
-- ─────────────────────────────────────────────────────────────────────
-- SUBOPTIMAL: Filter applied after join
SELECT u.name, o.total 
FROM users u 
JOIN orders o ON u.id = o.user_id
WHERE u.region = 'us-east';
 
-- BETTER: Same query, but ensure index supports the filter
CREATE INDEX users_region_idx ON users(region);
-- Optimizer will push filter down and use index
 
 
-- PATTERN 5: Follower Reads for Staleness-Tolerant Queries
-- ─────────────────────────────────────────────────────────────────────
-- STANDARD: Strong consistency, must read from leaseholder
SELECT * FROM product_catalog WHERE category = 'electronics';
-- May require cross-region round trip if leaseholder is remote
 
-- WITH FOLLOWER READS: Read from nearest replica (bounded staleness)
SELECT * FROM product_catalog 
WHERE category = 'electronics'
AS OF SYSTEM TIME follower_read_timestamp();
-- Reads from local replica, ~4.8 seconds behind (configurable)
 
 
-- PATTERN 6: Pagination with Key-Based Cursors
-- ─────────────────────────────────────────────────────────────────────
-- BAD: OFFSET-based pagination (scans and discards rows)
SELECT * FROM events ORDER BY created_at DESC LIMIT 20 OFFSET 1000;
-- Must scan 1020 rows to return 20
 
-- GOOD: Key-based cursor (starts scan at cursor position)
SELECT * FROM events 
WHERE created_at < '2024-01-15T10:30:00Z' -- cursor from previous page
ORDER BY created_at DESC 
LIMIT 20;
-- Starts scan at cursor, returns exactly 20
 
 
-- PATTERN 7: Lock Hints for High-Contention Scenarios
-- ─────────────────────────────────────────────────────────────────────
-- STANDARD: Optimistic locking, may conflict at commit
BEGIN;
SELECT balance FROM accounts WHERE id = 'hot_account';
-- computation...
UPDATE accounts SET balance = new_balance WHERE id = 'hot_account';
COMMIT;
-- May fail if another transaction committed first
 
-- WITH FOR UPDATE: Pessimistic lock acquired early
BEGIN;
SELECT balance FROM accounts WHERE id = 'hot_account' FOR UPDATE;
-- Lock acquired immediately, other transactions wait
UPDATE accounts SET balance = new_balance WHERE id = 'hot_account';
COMMIT;
-- No conflict at commit time

Common Performance Anti-Patterns

•Full table scans on large tables — Always ensure queries have appropriate indexes. Check EXPLAIN ANALYZE for 'full scan' warnings.
•Cartesian joins (missing join conditions) — Explode result size. Every row × every row. CockroachDB warns but doesn't prevent.
•Correlated subqueries — Execute inner query for each outer row. Rewrite as JOIN.
•OFFSET-based pagination — Must scan and discard rows. Use keyset pagination instead.
•Hot keys in high-throughput writes — Counter tables, auto-increment queues. Consider UUID keys or sharded counters.
•Large transactions with many writes — Hold locks longer, increase contention. Break into smaller batches.

Use EXPLAIN ANALYZE Religiously

CockroachDB's EXPLAIN ANALYZE shows actual execution statistics—not just the plan but real row counts, timing, and network transfers. For distributed queries, it shows which nodes participated and how much data moved. This is your primary tool for optimization.

Summary: Distributed SQL in Action

We've explored how CockroachDB processes SQL queries across a distributed cluster. Let's consolidate the key concepts:

Key Takeaways

•Standard SQL, Distributed Execution: CockroachDB accepts PostgreSQL-compatible SQL and transparently distributes execution across the cluster.
•DistSQL Engine: The distributed query engine pushes computation to data, parallelizes across nodes, and minimizes network transfer.
•Gateway Nodes: Any node can be a gateway; load balancing distributes this coordination work.
•Join Strategies: Lookup, hash, and merge joins each optimize for different scenarios; co-located joins are best.
•Distributed Transactions: Two-phase commit coordinates writes across ranges; parallel commits optimize common cases.
•Optimization Matters: Indexes, query design, and data locality significantly impact performance in distributed queries.

What's Next:

Distributed SQL provides the query semantics, but consistency guarantees require something more—serializable isolation. In the next page, we'll dive deep into CockroachDB's transaction model, exploring how Multi-Version Concurrency Control (MVCC), Serializable Snapshot Isolation (SSI), and write intents work together to provide the strongest isolation level without sacrificing performance.

Page Complete

You now understand how CockroachDB processes SQL queries in a distributed environment—from parsing through optimization to DistSQL execution. You can reason about query performance, identify optimization opportunities, and understand the trade-offs in distributed query processing. Next, we'll explore the serializable isolation that makes CockroachDB's transactions reliable across any scale.

2 / 5

Loading learning content...

System Design (HLD)CockroachDB

CockroachDB: Distributed SQL for the Modern Era

LevelAdvanced

Duration90 mins

TopicCockroachDB

2 / 5

Distributed SQL: Query Processing Across a Cluster

SQL That Scales

Understanding how Distributed SQL works is essential for:

Writing queries that perform well at scale
Diagnosing performance issues in production
Making informed schema design decisions
Optimizing data locality for your access patterns

What You Will Learn

The SQL Layer Architecture

CockroachDB's SQL layer transforms human-readable SQL into operations on the underlying distributed key-value store. This transformation happens through a series of well-defined stages.

Stage 1: Parsing

The parser converts SQL text into an Abstract Syntax Tree (AST). CockroachDB supports the PostgreSQL SQL dialect, so queries written for PostgreSQL typically parse without modification:

SELECT u.name, COUNT(o.id) AS order_count
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.region = 'us-east'
GROUP BY u.name
HAVING COUNT(o.id) > 5
ORDER BY order_count DESC
LIMIT 10;

The parser validates syntax, identifies tables and columns, and produces a tree representation of the query.

Stage 2: Semantic Analysis

The analyzer resolves names—mapping table names to actual tables, column names to schema definitions, and verifying that the query is semantically valid:

Does the users table exist?
Does users have a region column?
Is u.id = o.user_id comparing compatible types?
Are all non-aggregated columns in the GROUP BY clause?

Stage 3: Logical Planning

The planner creates a logical execution plan—a tree of operations that, if executed, would produce the correct result:

Limit(10)
 └── Sort(order_count DESC)
     └── Filter(count > 5)
         └── GroupBy(u.name)
             └── HashJoin(u.id = o.user_id)
                 ├── Filter(u.region = 'us-east')
                 │   └── Scan(users)
                 └── Scan(orders)

Stage 4: Optimization

The cost-based optimizer evaluates multiple possible execution plans and selects the one with lowest estimated cost. It considers:

Table statistics: Row counts, column cardinality, value distributions
Index availability: Which indexes can satisfy the query
Join ordering: For multi-table joins, which order minimizes intermediate results
Predicate pushdown: Which filters can be applied early to reduce data volume

query-processing-stages.txt
SQL QUERY PROCESSING PIPELINE
═══════════════════════════════════════════════════════════════════
 
INPUT QUERY:
────────────────────────────────────────────────────────────────────
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.region = 'us-east'
GROUP BY u.name
HAVING COUNT(o.id) > 5
ORDER BY order_count DESC
LIMIT 10;
 
STAGE 1: PARSING → Abstract Syntax Tree
────────────────────────────────────────────────────────────────────
┌─ SELECT ─────────────────────────────────────────────────────────┐
│  ├── Columns: [u.name, COUNT(o.id)]                              │
│  ├── FROM: [users AS u, orders AS o]                             │
│  ├── JOIN: ON u.id = o.user_id                                   │
│  ├── WHERE: u.region = 'us-east'                                 │
│  ├── GROUP BY: [u.name]                                          │
│  ├── HAVING: COUNT(o.id) > 5                                     │
│  ├── ORDER BY: [order_count DESC]                                │
│  └── LIMIT: 10                                                   │
└──────────────────────────────────────────────────────────────────┘
 
STAGE 2: SEMANTIC ANALYSIS → Resolved Names
────────────────────────────────────────────────────────────────────
┌─ Resolved Objects ───────────────────────────────────────────────┐
│  users  → database.public.users (id: 52, columns: [id, name...]) │
│  orders → database.public.orders (id: 57, columns: [id, user_id])│
│  u.region → users.region (column index: 3, type: STRING)         │
│  u.id → users.id (column index: 0, type: INT, PRIMARY KEY)       │
│  o.user_id → orders.user_id (column index: 1, type: INT, FK)     │
└──────────────────────────────────────────────────────────────────┘
 
STAGE 3: LOGICAL PLAN → Operation Tree
────────────────────────────────────────────────────────────────────
Limit(10)
 └── Sort(order_count DESC)
     └── Filter(count > 5)           ← HAVING clause
         └── GroupBy(u.name)
             └── HashJoin(u.id = o.user_id)
                 ├── Filter(u.region = 'us-east')
                 │   └── Scan(users)
                 └── Scan(orders)
 
STAGE 4: OPTIMIZATION → Cost-Based Selection
────────────────────────────────────────────────────────────────────
Optimizer evaluates alternatives:
 
Option A: Hash Join                    Cost: 5,234
Option B: Merge Join (requires sort)   Cost: 12,891
Option C: Lookup Join                  Cost: 3,122  ← SELECTED
 
Predicate pushdown applied:
- Filter(region='us-east') pushed to users scan
- Uses index: users@users_region_idx
 
Final optimized plan:
Limit(10)
 └── Sort(order_count DESC)
     └── Filter(count > 5)
         └── GroupBy(u.name)
             └── LookupJoin(orders ON user_id)
                 └── IndexScan(users@users_region_idx, region='us-east')
 
STAGE 5: PHYSICAL PLANNING → DistSQL Distribution
────────────────────────────────────────────────────────────────────
(Covered in next section...)

Stage 5: Physical Planning

Assigns operations to specific nodes based on data location
Inserts network shuffles where data must move between nodes
Parallelizes independent operations across the cluster

This is where CockroachDB's distributed nature becomes visible.

Viewing the Execution Plan

DistSQL: The Distributed Query Engine

DistSQL is CockroachDB's mechanism for executing SQL queries across multiple nodes in parallel. It transforms the single-node execution plan into a distributed plan that leverages the entire cluster.

The Core Principle: Move Computation to Data

In distributed databases, the primary performance bottleneck is network transfer. Moving large amounts of data between nodes is slow and expensive. DistSQL minimizes this by:

Pushing filters and projections: Apply WHERE clauses and column selection at the node storing the data, before any network transfer
Partitioning operations by data location: Group and aggregate at each node first, then combine results
Parallelizing independent operations: Different nodes process their portions simultaneously

DistSQL Plan Components:

A DistSQL plan consists of processors connected by streams:

Table Readers: Scan data from local storage (one per node holding relevant data)
Filters: Apply WHERE conditions
Aggregators: Compute GROUP BY aggregates
Joiners: Combine data from multiple sources
Sorters: Order results (distributed sort for large datasets)
Final processor: Combines results and returns to client

distsql-execution-example.txt
DISTSQL EXECUTION FLOW
═══════════════════════════════════════════════════════════════════
 
Query: SELECT region, COUNT(*) FROM orders 
       WHERE status = 'shipped' 
       GROUP BY region;
 
Assumption: orders table split across 4 nodes (Node1=US-East, Node2=US-West,
            Node3=EU-West, Node4=Asia)
 
NON-DISTRIBUTED (Single Node) Execution:
────────────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────────────┐
│                          Gateway Node                            │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  1. Fetch ALL rows from ALL nodes (network transfer)      │  │
│  │  2. Filter status = 'shipped' locally                      │  │
│  │  3. GROUP BY region locally                                │  │
│  │  4. COUNT(*) locally                                       │  │
│  │  5. Return result                                          │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
Problem: Transfers 100% of data across network before filtering
 
DISTSQL (Distributed) Execution:
────────────────────────────────────────────────────────────────────
 
Phase 1: Local Processing (Parallel on all nodes)
─────────────────────────────────────────────────
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│    Node 1    │  │    Node 2    │  │    Node 3    │  │    Node 4    │
│   (US-East)  │  │   (US-West)  │  │   (EU-West)  │  │    (Asia)    │
│              │  │              │  │              │  │              │
│ TableReader  │  │ TableReader  │  │ TableReader  │  │ TableReader  │
│   ↓          │  │   ↓          │  │   ↓          │  │   ↓          │
│ Filter       │  │ Filter       │  │ Filter       │  │ Filter       │
│ (shipped)    │  │ (shipped)    │  │ (shipped)    │  │ (shipped)    │
│   ↓          │  │   ↓          │  │   ↓          │  │   ↓          │
│ Aggregator   │  │ Aggregator   │  │ Aggregator   │  │ Aggregator   │
│ (partial)    │  │ (partial)    │  │ (partial)    │  │ (partial)    │
│              │  │              │  │              │  │              │
│ Result:      │  │ Result:      │  │ Result:      │  │ Result:      │
│ us-east: 500 │  │ us-west: 300 │  │ eu-west: 400 │  │ asia: 250    │
│ us-west: 50  │  │ us-east: 25  │  │ asia: 10     │  │ eu-west: 15  │
└──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
      │                  │                  │                 │
      └──────────────────┴────────┬─────────┴─────────────────┘
                                  │
                         Network Transfer
                      (Only aggregated results)
                                  │
                                  ▼
Phase 2: Final Aggregation (Gateway node)
─────────────────────────────────────────
┌─────────────────────────────────────────────────────────────────┐
│                        Gateway Node                              │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Receive partial aggregates from all nodes:               │  │
│  │    us-east: 500+25 = 525                                  │  │
│  │    us-west: 300+50 = 350                                  │  │
│  │    eu-west: 400+15 = 415                                  │  │
│  │    asia: 250+10 = 260                                     │  │
│  │                                                            │  │
│  │  Return final result to client                            │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
 
COMPARISON:
────────────────────────────────────────────────────────────────────
                      Non-Distributed    DistSQL
────────────────────────────────────────────────────────────────────
Data transferred       10 million rows    8 partial aggregates
Processing time        Sequential         Parallel (4x faster)
Network bottleneck     Yes               No
Scalability            Limited           Linear with nodes

When DistSQL Activates:

Not all queries use DistSQL. CockroachDB's optimizer decides based on:

Data distribution: If all required data is on one node, DistSQL overhead isn't worth it
Query complexity: Simple point lookups (SELECT * FROM table WHERE pk = 123) don't benefit from parallelism
Result size: If the query returns few rows, local execution may be faster

DistSQL Flow Types:

DistSQL plans use different flow patterns depending on the query:

Scatter-Gather: Read from multiple nodes, aggregate at one (like the example above)
Full Distribution: Each node processes its data and returns partial results
Co-located: When joined tables share the same distribution key, joins happen locally
Broadcast: Small tables are sent to all nodes (for small-large table joins)
Hash Distribution: Rows are shuffled based on hash of key columns (for hash joins)

DistSQL in the DB Console

Query Routing and Gateway Nodes

Every Node is a Gateway

Unlike databases with dedicated query coordinators, every CockroachDB node can serve as a gateway. When you issue a query:

The gateway node parses, optimizes, and plans the query
For distributed queries, the gateway coordinates DistSQL execution
Partial results flow back to the gateway
The gateway combines results and returns them to the client

Implications for Load Balancing:

Since any node can be a gateway, you typically place a load balancer in front of your cluster:

     ┌─────────────────────────────────────┐
     │          Load Balancer              │
     │  (HAProxy, NGINX, Cloud LB, etc.)   │
     └─────────────────────────────────────┘
                     │
     ┌───────────────┼───────────────┐
     ▼               ▼               ▼
┌─────────┐   ┌─────────┐   ┌─────────┐
│  Node 1 │   │  Node 2 │   │  Node 3 │
│(Gateway)│   │(Gateway)│   │(Gateway)│
└─────────┘   └─────────┘   └─────────┘

Load balancing distributes the gateway work across nodes, preventing any single node from becoming a bottleneck.

Gateway Overhead:

The gateway node does coordination work even for queries that don't touch its local data:

Receiving and parsing the query
Communicating with data-owning nodes
Aggregating results
Sending response to client

Query Routing Scenarios
Scenario	Gateway Location	Data Location	Network Round Trips	Optimization
Local query	Node A	Node A (leaseholder)	0	Best case—no network
Remote single range	Node A	Node B (leaseholder)	1 RT to B	Consider connecting to B directly
Multi-range query	Node A	Nodes B, C, D	1 RT to each in parallel	DistSQL parallelism helps
Global aggregation	Node A	All nodes	1 RT to each in parallel	Ensure gateway isn't bottleneck

Leaseholder Routing:

For reads and writes, the leaseholder is the authoritative replica for a range. The gateway must communicate with the leaseholder to ensure consistency:

Writes: Always go to the leaseholder, which replicates via Raft
Reads (strong consistency): Go to the leaseholder
Reads (follower reads): Can go to any replica (with staleness tradeoff)

The gateway uses range descriptors (cached locally, refreshed via gossip) to determine which node is the leaseholder for each range needed by the query.

Minimizing Gateway Overhead:

Strategies for reducing gateway-related latency:

Connection affinity: Route connections for specific data to nodes likely to be leaseholders for that data
Regional deployments: Connect users to nodes in their geographic region
Read replicas: For read-heavy workloads, configure replicas in each region and use follower reads
Connection pooling: Maintain persistent connections to avoid connection setup overhead

topology-aware-connection.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
package main
 
import (
    "database/sql"
    "fmt"
    
    _ "github.com/lib/pq"
)
 
// TopologyAwareConnectionPool demonstrates connecting to
// region-local nodes for lower latency
 
type RegionalPool struct {
    primary    *sql.DB  // Primary region for writes
    localRead  *sql.DB  // Local region for reads
}
 
func NewRegionalPool(userRegion string) (*RegionalPool, error) {
    // CockroachDB cluster endpoints by region
    endpoints := map[string]string{
        "us-east":  "cockroach-us-east.example.com:26257",
        "us-west":  "cockroach-us-west.example.com:26257",
        "eu-west":  "cockroach-eu-west.example.com:26257",
    }
    
    // Primary region for writes (configured as leaseholder preference)
    primary, err := sql.Open("postgres", 
        fmt.Sprintf("postgresql://root@%s/mydb?sslmode=require", 
            endpoints["us-east"]))
    if err != nil {
        return nil, err
    }
    
    // Local region for reads (using follower reads)
    localEndpoint := endpoints[userRegion]
    if localEndpoint == "" {
        localEndpoint = endpoints["us-east"]  // fallback
    }
    
    localRead, err := sql.Open("postgres",
        fmt.Sprintf("postgresql://root@%s/mydb?sslmode=require", 
            localEndpoint))
    if err != nil {
        return nil, err
    }
    
    return &RegionalPool{
        primary:   primary,
        localRead: localRead,
    }, nil
}
 
func (p *RegionalPool) Write(query string, args ...interface{}) (sql.Result, error) {
    // Writes always go to primary for strong consistency
    return p.primary.Exec(query, args...)
}
 
func (p *RegionalPool) Read(query string, args ...interface{}) (*sql.Rows, error) {
    // Reads can go to local replica with follower reads
    // Adds AS OF SYSTEM TIME follower_read_timestamp() for bounded staleness
    followerQuery := query + " AS OF SYSTEM TIME follower_read_timestamp()"
    return p.localRead.Query(followerQuery, args...)
}
 
func (p *RegionalPool) StrongRead(query string, args ...interface{}) (*sql.Rows, error) {
    // Strong reads go to leaseholder via primary
    return p.primary.Query(query, args...)
}

Gateway Bottlenecks

Distributed Joins

Join Strategy 1: Lookup Join

The lookup join (also called index nested loop join) is used when one side of the join is small and the other has an index on the join key:

Read rows from the smaller table (the "input" side)
For each row, look up matching rows in the larger table using the index
Combine matching rows

When it's chosen:

One table is much smaller than the other
The larger table has an index on the join column
The filter on the smaller table is selective

Example:

-- users table: 100 filtered rows
-- orders table: 10 million rows with index on user_id
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.country = 'Iceland';  -- Small country = few users

The optimizer will likely choose lookup join: scan the ~100 Icelandic users, then do 100 index lookups into orders.

Join Strategy 2: Hash Join

The hash join builds a hash table from one side of the join and probes it with the other side:

Read all rows from the smaller table (the "build" side)
Build an in-memory hash table keyed on the join column
Stream rows from the larger table (the "probe" side)
Probe the hash table for matches

Distributed Hash Join:

When both tables are large and spread across nodes:

Partition both tables by hash of the join key
Shuffle partitions so matching keys are co-located
Perform local hash joins on each node
Combine results

distributed-hash-join.txt
DISTRIBUTED HASH JOIN
═══════════════════════════════════════════════════════════════════
 
Query: SELECT u.name, o.total 
       FROM users u 
       JOIN orders o ON u.id = o.user_id;
 
Assumption: Both tables spread across 3 nodes,
            no co-location on user_id
 
Phase 1: Hash Partitioning
────────────────────────────────────────────────────────────────────
Each node reads its local data and computes hash(join_key)
 
Node 1 (local users):        Node 1 (local orders):
├── User A (id=1) → hash=0   ├── Order X (user_id=2) → hash=1
├── User B (id=2) → hash=1   ├── Order Y (user_id=1) → hash=0
└── User C (id=3) → hash=0   └── Order Z (user_id=3) → hash=0
 
Node 2 (local users):        Node 2 (local orders):
├── User D (id=4) → hash=1   ├── Order P (user_id=5) → hash=1
└── User E (id=5) → hash=1   └── Order Q (user_id=4) → hash=1
 
Node 3 (local users):        Node 3 (local orders):
├── User F (id=6) → hash=0   ├── Order R (user_id=6) → hash=0
└── User G (id=7) → hash=1   └── Order S (user_id=7) → hash=1
 
Phase 2: Shuffle by Hash Partition
────────────────────────────────────────────────────────────────────
Rows with hash=0 → Node 1
Rows with hash=1 → Node 2
 
After shuffle:
                                                
┌─────────────────────────────────────────────────────────────────┐
│  Node 1 (hash=0 partition)                                       │
│  Users: A(1), C(3), F(6)                                         │
│  Orders: Y(user=1), Z(user=3), R(user=6)                         │
└─────────────────────────────────────────────────────────────────┘
 
┌─────────────────────────────────────────────────────────────────┐
│  Node 2 (hash=1 partition)                                       │
│  Users: B(2), D(4), E(5), G(7)                                   │
│  Orders: X(user=2), P(user=5), Q(user=4), S(user=7)              │
└─────────────────────────────────────────────────────────────────┘
 
Phase 3: Local Hash Join (Parallel)
────────────────────────────────────────────────────────────────────
Node 1: Hash join on local data
  A↔Y, C↔Z, F↔R  ✓
 
Node 2: Hash join on local data  
  B↔X, D↔Q, E↔P, G↔S  ✓
 
Phase 4: Combine Results
────────────────────────────────────────────────────────────────────
Gateway collects results from Node 1 and Node 2
Returns to client
 
COST ANALYSIS:
────────────────────────────────────────────────────────────────────
Network: Full shuffle of both tables (expensive)
CPU:     Parallel hash join (efficient)
Memory:  Hash table for smaller side on each node
 
Best for: Large-large table joins with no co-location
Avoid:    When tables could be co-located by schema design

Join Strategy 3: Merge Join

The merge join requires both sides to be sorted on the join key:

Sort both tables on the join column (or use existing index order)
Walk through both sorted streams simultaneously
When keys match, emit the joined row

When it's chosen:

Both tables have indexes on the join column
Results need to be ordered by the join column
Tables are approximately the same size

Join Strategy 4: Cross Join

When no join predicate exists (or only non-equality predicates), CockroachDB may use a cross join:

For each row from table A, examine all rows from table B
Apply any filter predicates

Cross joins are expensive (O(n×m)) and should be avoided for large tables.

Co-located Joins: The Best Case

The most efficient distributed join is one that doesn't need network shuffling—a co-located join. If related tables are partitioned on the same key:

-- Users partitioned by region
-- Orders partitioned by user's region (derived from user_id)

-- Join happens locally on each node—no shuffle needed
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.region = 'us-east';

Schema design that enables co-located joins is one of the most impactful optimizations for distributed databases.

Join Strategies Comparison
Strategy	Best For	Data Movement	Memory Usage	Time Complexity
Lookup Join	Small-large table, indexed	Minimal (point lookups)	Low	O(n) probes
Hash Join	Medium tables, no index	Shuffle if distributed	Build-side in memory	O(n + m)
Merge Join	Both sides sorted/indexed	Shuffle if distributed	Low	O(n + m)
Co-located Join	Same partition key	None (local)	Per-partition	O(n + m) local
Cross Join	Small tables only	Full combination	Low	O(n × m) AVOID

Design for Co-location

Transaction Handling in Distributed Queries

Every Query is a Transaction

When you execute:

SELECT * FROM users WHERE region = 'us-east';

CockroachDB implicitly wraps this in a transaction:

BEGIN;
SELECT * FROM users WHERE region = 'us-east';
COMMIT;

This ensures the query sees a consistent snapshot of the database, even if other transactions are modifying data concurrently.

Read-Only vs. Read-Write Transactions

CockroachDB distinguishes between:

Read-only transactions: Only SELECT statements, can be served by any replica (with follower reads), don't need to coordinate commits
Read-write transactions: Include INSERT, UPDATE, DELETE; must coordinate through leaseholders and Raft consensus

Distributed Transaction Flow (Read-Write):

BEGIN: Client starts transaction, gateway assigns a transaction ID and provisional timestamp
Reads: Gateway reads from leaseholders, potentially discovering newer data and pushing the timestamp forward
Writes: Gateway sends write intents to leaseholders (provisional writes visible only to this transaction)
COMMIT: Gateway initiates two-phase commit (2PC) across all ranges with write intents
Resolve: Write intents are converted to committed values; locks are released

distributed-transaction-flow.txt
DISTRIBUTED TRANSACTION LIFECYCLE
═══════════════════════════════════════════════════════════════════
 
Example Transaction:
────────────────────────────────────────────────────────────────────
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;
 
Assumption: Alice's account on Node A, Bob's account on Node B
 
STEP 1: BEGIN (Gateway Node)
────────────────────────────────────────────────────────────────────
Gateway (any node client connected to):
├── Generate transaction ID: txn-abc123
├── Assign provisional commit timestamp: ts=1000
└── Track transaction state: PENDING
 
STEP 2: First UPDATE (Alice's account)
────────────────────────────────────────────────────────────────────
Gateway:
├── Locate leaseholder for 'alice': Node A
└── Send write request to Node A
 
Node A (Leaseholder for alice):
├── Lock alice's row (write intent)
├── Write provisional value: alice.balance = (old - 100) @ ts=1000
│   └── Intent marker: txn-abc123 (not yet committed)
├── Store in Raft log (but don't replicate commit yet)
└── Respond to Gateway: write intent placed
 
STEP 3: Second UPDATE (Bob's account)  
────────────────────────────────────────────────────────────────────
Gateway:
├── Locate leaseholder for 'bob': Node B
└── Send write request to Node B
 
Node B (Leaseholder for bob):
├── Lock bob's row (write intent)
├── Write provisional value: bob.balance = (old + 100) @ ts=1000
│   └── Intent marker: txn-abc123 (not yet committed)
├── Store in Raft log
└── Respond to Gateway: write intent placed
 
STEP 4: COMMIT (Two-Phase Commit)
────────────────────────────────────────────────────────────────────
 
Phase 1: PREPARE (Parallel to all participants)
─────────────────────────────────────────────────
Gateway → Node A: "Prepare to commit txn-abc123"
Gateway → Node B: "Prepare to commit txn-abc123"
 
Node A: Check write intent still valid → PREPARED
Node B: Check write intent still valid → PREPARED
 
Both respond: PREPARED ✓
 
Phase 2: COMMIT (After all prepared)
─────────────────────────────────────────────────
Gateway → Transaction Record: Mark COMMITTED @ ts=1000
  └── Transaction record replicated via Raft
 
Gateway → Node A: "Resolve intent txn-abc123 as committed"
Gateway → Node B: "Resolve intent txn-abc123 as committed"
 
Node A: Convert intent to permanent value, release lock
Node B: Convert intent to permanent value, release lock
 
STEP 5: Return to Client
────────────────────────────────────────────────────────────────────
Gateway → Client: Transaction committed successfully
 
CONFLICT HANDLING:
────────────────────────────────────────────────────────────────────
If another transaction (txn-xyz) tries to read/write alice during txn-abc:
 
Case 1: txn-xyz reads, txn-abc uncommitted
  → txn-xyz waits for txn-abc to commit/abort
  → Returns committed value once resolved
 
Case 2: txn-xyz writes, txn-abc uncommitted
  → Write-write conflict detected
  → One transaction must abort and retry (SSI)
 
Case 3: txn-xyz reads at older timestamp
  → Can read pre-intent value (MVCC)
  → No conflict

Parallel Commits Optimization:

The 2PC flow described above requires two round trips: PREPARE then COMMIT. CockroachDB's parallel commits optimization eliminates this for common cases:

In the final write, include the transaction's commit intent
If all writes (including this one) succeed, the transaction is implicitly committed
No separate COMMIT round trip needed

This reduces commit latency from 2 round trips to 1 for transactions touching multiple ranges.

Automatic Retries:

CockroachDB automatically retries transactions that encounter conflicts, within limits:

Serialization errors: Transaction's read set was modified by another committed transaction
Lock conflicts: Write intent encountered; wait or push
Timestamp push: Read encountered newer data; retry with higher timestamp

Transaction Contention

Query Performance Optimization

Writing efficient queries for CockroachDB requires understanding both SQL optimization principles and distributed-specific considerations.

Key Optimization Strategies:

1. Use Indexes Effectively

Indexes in CockroachDB work similarly to traditional databases but have distributed implications:

Covering indexes: Include all needed columns to avoid secondary lookups
Partial indexes: Index only rows matching a condition to reduce size
Index ordering: Design indexes to support common ORDER BY clauses

-- Covering index for common query pattern
CREATE INDEX orders_user_status_idx ON orders(user_id, status) 
  STORING (total, created_at);

-- Query satisfied entirely by index
SELECT user_id, status, total, created_at 
FROM orders 
WHERE user_id = 123 AND status = 'shipped';

2. Minimize Network Round Trips

Each range access is potentially a network round trip. Minimize by:

Using covering indexes (avoid secondary lookups)
Batching point lookups (WHERE id IN (1,2,3) instead of multiple queries)
Avoiding correlated subqueries (join instead)

3. Leverage Locality

For multi-region deployments, query performance depends on data locality:

Use REGIONAL BY ROW tables to pin data to regions
Configure lease_preferences for leaseholder placement
Use follower reads for read-heavy, less latency-sensitive queries

optimization-patterns.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
-- ═══════════════════════════════════════════════════════════════════
-- QUERY OPTIMIZATION PATTERNS FOR COCKROACHDB
-- ═══════════════════════════════════════════════════════════════════
 
-- PATTERN 1: Covering Indexes
-- ─────────────────────────────────────────────────────────────────────
-- BAD: Requires secondary lookup to fetch 'name'
SELECT id, name FROM users WHERE email = 'alice@example.com';
-- Index on email doesn't include 'name', so:
--   1. Index scan to find row  
--   2. Primary key lookup to get 'name'
 
-- GOOD: Covering index includes all needed columns
CREATE INDEX users_email_covering ON users(email) STORING (name);
-- Now query is satisfied entirely by index scan
 
 
-- PATTERN 2: Batch Point Lookups
-- ─────────────────────────────────────────────────────────────────────
-- BAD: Multiple round trips
SELECT * FROM products WHERE id = 1;
SELECT * FROM products WHERE id = 2;
SELECT * FROM products WHERE id = 3;
-- Each query = potential network round trip
 
-- GOOD: Single batched query
SELECT * FROM products WHERE id IN (1, 2, 3);
-- Single query, CockroachDB parallelizes lookups
 
 
-- PATTERN 3: Avoid SELECT *
-- ─────────────────────────────────────────────────────────────────────
-- BAD: Fetches all columns, some may require additional lookups
SELECT * FROM orders WHERE user_id = 123;
 
-- GOOD: Fetch only needed columns (matches covering index)
SELECT id, status, total FROM orders WHERE user_id = 123;
 
 
-- PATTERN 4: Predicate Pushdown for Joins
-- ─────────────────────────────────────────────────────────────────────
-- SUBOPTIMAL: Filter applied after join
SELECT u.name, o.total 
FROM users u 
JOIN orders o ON u.id = o.user_id
WHERE u.region = 'us-east';
 
-- BETTER: Same query, but ensure index supports the filter
CREATE INDEX users_region_idx ON users(region);
-- Optimizer will push filter down and use index
 
 
-- PATTERN 5: Follower Reads for Staleness-Tolerant Queries
-- ─────────────────────────────────────────────────────────────────────
-- STANDARD: Strong consistency, must read from leaseholder
SELECT * FROM product_catalog WHERE category = 'electronics';
-- May require cross-region round trip if leaseholder is remote
 
-- WITH FOLLOWER READS: Read from nearest replica (bounded staleness)
SELECT * FROM product_catalog 
WHERE category = 'electronics'
AS OF SYSTEM TIME follower_read_timestamp();
-- Reads from local replica, ~4.8 seconds behind (configurable)
 
 
-- PATTERN 6: Pagination with Key-Based Cursors
-- ─────────────────────────────────────────────────────────────────────
-- BAD: OFFSET-based pagination (scans and discards rows)
SELECT * FROM events ORDER BY created_at DESC LIMIT 20 OFFSET 1000;
-- Must scan 1020 rows to return 20
 
-- GOOD: Key-based cursor (starts scan at cursor position)
SELECT * FROM events 
WHERE created_at < '2024-01-15T10:30:00Z' -- cursor from previous page
ORDER BY created_at DESC 
LIMIT 20;
-- Starts scan at cursor, returns exactly 20
 
 
-- PATTERN 7: Lock Hints for High-Contention Scenarios
-- ─────────────────────────────────────────────────────────────────────
-- STANDARD: Optimistic locking, may conflict at commit
BEGIN;
SELECT balance FROM accounts WHERE id = 'hot_account';
-- computation...
UPDATE accounts SET balance = new_balance WHERE id = 'hot_account';
COMMIT;
-- May fail if another transaction committed first
 
-- WITH FOR UPDATE: Pessimistic lock acquired early
BEGIN;
SELECT balance FROM accounts WHERE id = 'hot_account' FOR UPDATE;
-- Lock acquired immediately, other transactions wait
UPDATE accounts SET balance = new_balance WHERE id = 'hot_account';
COMMIT;
-- No conflict at commit time

Common Performance Anti-Patterns

•Full table scans on large tables — Always ensure queries have appropriate indexes. Check EXPLAIN ANALYZE for 'full scan' warnings.
•Cartesian joins (missing join conditions) — Explode result size. Every row × every row. CockroachDB warns but doesn't prevent.
•Correlated subqueries — Execute inner query for each outer row. Rewrite as JOIN.
•OFFSET-based pagination — Must scan and discard rows. Use keyset pagination instead.
•Hot keys in high-throughput writes — Counter tables, auto-increment queues. Consider UUID keys or sharded counters.
•Large transactions with many writes — Hold locks longer, increase contention. Break into smaller batches.

Use EXPLAIN ANALYZE Religiously

Summary: Distributed SQL in Action

We've explored how CockroachDB processes SQL queries across a distributed cluster. Let's consolidate the key concepts:

Key Takeaways

•Standard SQL, Distributed Execution: CockroachDB accepts PostgreSQL-compatible SQL and transparently distributes execution across the cluster.
•DistSQL Engine: The distributed query engine pushes computation to data, parallelizes across nodes, and minimizes network transfer.
•Gateway Nodes: Any node can be a gateway; load balancing distributes this coordination work.
•Join Strategies: Lookup, hash, and merge joins each optimize for different scenarios; co-located joins are best.
•Distributed Transactions: Two-phase commit coordinates writes across ranges; parallel commits optimize common cases.
•Optimization Matters: Indexes, query design, and data locality significantly impact performance in distributed queries.

What's Next:

Page Complete

2 / 5