System Design (HLD)CockroachDB

CockroachDB: Distributed SQL for the Modern Era

LevelAdvanced

Duration90 mins

TopicCockroachDB

4 / 5

Automatic Load Balancing: Ranges, Leaseholders, and Rebalancing

The Self-Managing Database

Traditional databases require constant operational attention. As data grows, DBAs manually split tables, redistribute shards, and balance load. When nodes fail, they trigger failovers and repair replication. When traffic patterns change, they reconfigure routing rules.

CockroachDB eliminates this burden through automatic load balancing. The database:

Splits ranges when they grow too large
Moves ranges to balance storage across nodes
Relocates leaseholders to balance request load
Repairs replication when nodes fail
Adapts to access patterns moving data closer to where it's needed

This self-managing behavior is one of CockroachDB's most powerful features. A 3-node cluster and a 300-node cluster operate the same way—you just have more capacity.

Understanding how this works helps you:

Set appropriate cluster policies
Diagnose performance issues
Design schemas that cooperate with the balancing algorithms
Plan capacity based on realistic data distribution

What You Will Learn

By the end of this page, you will understand the range abstraction, how leaseholders coordinate access, how CockroachDB splits and merges ranges, the rebalancing algorithms that distribute load, and how to configure policies for your specific requirements.

The Range Abstraction

All data in CockroachDB is organized into ranges—contiguous portions of the key space. Understanding ranges is fundamental to understanding how CockroachDB distributes and manages data.

What is a Range?

A range is:

A contiguous span of keys in the sorted key-value store
The unit of replication (each range is replicated 3+ times)
The unit of consensus (each range has its own Raft group)
The unit of data movement (ranges are moved atomically between nodes)

Default range size is approximately 512 MB (64 MB minimum to 512 MB maximum, configurable). When a range exceeds this size, it automatically splits.

Range Key Space:

CockroachDB encodes all data—table rows, indexes, system metadata—into a single sorted key-value store. The key format is:

/Table/{table_id}/{index_id}/{primary_key_columns}/{column_family}

Because keys are sorted, rows from the same table (same table_id) are stored together. Rows with similar primary keys are in the same or adjacent ranges.

Range Metadata:

Each range has metadata stored in a range descriptor:

Range ID
Start key (inclusive) and end key (exclusive)
Replica locations (which nodes hold copies)
Which replica is the leaseholder
Generation number (increments on changes)

range-organization.txt
RANGE ORGANIZATION IN COCKROACHDB
═══════════════════════════════════════════════════════════════════
 
KEY SPACE OVERVIEW:
────────────────────────────────────────────────────────────────────
The entire keyspace is divided into ranges:
 
Key Space: [min_key ────────────────────────────────────── max_key]
           
           ├─ Range 1 ─┤├─ Range 2 ─┤├─ Range 3 ─┤├─ Range 4 ─┤
           [a, g)       [g, m)       [m, t)       [t, z]
 
RANGE CONTENTS EXAMPLE:
────────────────────────────────────────────────────────────────────
Database: mydb
Tables: users, orders, products
 
Key encoding:
  /Table/56/1/1    = users table, primary index, row id=1
  /Table/56/1/2    = users table, primary index, row id=2
  /Table/56/2/...  = users table, secondary index
  /Table/57/...    = orders table
  /Table/58/...    = products table
 
Range 42 (example):
┌─────────────────────────────────────────────────────────────────┐
│ Range 42                                                         │
│ Start Key: /Table/56/1/1000                                      │
│ End Key:   /Table/56/1/2000                                      │
│                                                                   │
│ Contents:                                                         │
│   /Table/56/1/1000 → {id: 1000, name: "Alice", ...}              │
│   /Table/56/1/1001 → {id: 1001, name: "Bob", ...}                │
│   /Table/56/1/1002 → {id: 1002, name: "Carol", ...}              │
│   ...                                                             │
│   /Table/56/1/1999 → {id: 1999, name: "Zara", ...}               │
│                                                                   │
│ Replicas:     Node 1 (leaseholder), Node 4, Node 7               │
│ Size:         487 MB                                              │
│ Raft Leader:  Node 1                                              │
└─────────────────────────────────────────────────────────────────┘
 
RANGE DISTRIBUTION ACROSS NODES:
────────────────────────────────────────────────────────────────────
 
┌─────────────────────────────────────────────────────────────────┐
│                        5-NODE CLUSTER                            │
├───────────┬───────────┬───────────┬───────────┬───────────────┐
│   Node 1  │   Node 2  │   Node 3  │   Node 4  │   Node 5      │
├───────────┼───────────┼───────────┼───────────┼───────────────┤
│ R1 (L)    │ R1        │ R1        │           │               │
│ R2 (L)    │           │ R2        │ R2        │               │
│           │ R3 (L)    │ R3        │           │ R3            │
│ R4        │ R4        │           │ R4 (L)    │               │
│           │ R5        │ R5        │ R5 (L)    │               │
│ R6        │           │ R6 (L)    │           │ R6            │
│           │ R7        │           │ R7        │ R7 (L)        │
│ R8 (L)    │           │ R8        │           │ R8            │
└───────────┴───────────┴───────────┴───────────┴───────────────┘
 
Legend: R# = Range #, (L) = Leaseholder for that range
 
Observations:
- Each range has 3 replicas (replication factor = 3)
- Replicas are distributed across different nodes
- Leaseholders are distributed to balance read/write load
- No node has all leaseholders (avoiding hot spots)

Viewing Ranges:

CockroachDB provides commands to inspect range distribution:

-- Show ranges for a table
SHOW RANGES FROM TABLE users;

-- Show range for a specific key
SHOW RANGE FROM TABLE users FOR ROW (12345);

-- Detailed range info in the admin UI
-- http://localhost:8080/#/debug/range/42

Range Size Considerations:

The 512 MB default works well for most workloads. However:

Larger ranges: Fewer ranges means less metadata overhead, but slower splits/merges and potentially longer recovery times
Smaller ranges: More granular load balancing, but more metadata and coordination overhead

You can configure range size per table:

ALTER TABLE hot_table CONFIGURE ZONE USING 
  range_max_bytes = 134217728,  -- 128 MB max
  range_min_bytes = 16777216;   -- 16 MB min

Ranges Are Not Shards

Unlike application-managed shards (where the application knows which shard holds which data), ranges are an internal abstraction. Applications issue SQL queries; CockroachDB internally routes to the appropriate ranges. You don't shard tables—CockroachDB creates and manages ranges automatically.

Leaseholders and Access Coordination

Each range has multiple replicas, but one replica is special: the leaseholder. Understanding the leaseholder's role is crucial for understanding performance.

What is a Leaseholder?

The leaseholder is the replica that:

Serves reads: All strongly consistent reads go through the leaseholder
Coordinates writes: Writes are proposed to Raft through the leaseholder
Holds the lease: A time-bounded exclusive right to serve reads and coordinate writes

Leases vs. Raft Leadership:

CockroachDB separates two concepts:

Raft leader: The replica that proposes entries to the Raft log (consensus)
Leaseholder: The replica that coordinates client requests

Typically, the same replica is both Raft leader and leaseholder, but they can diverge temporarily during transitions.

Why Leases?

Leases enable consistent reads without running Raft consensus for every read:

Leaseholder has exclusive read authority during its lease
Lease has a defined expiration time
When the lease expires, a new lease must be acquired (via Raft)
During the lease period, reads can be served immediately

This design provides strong consistency while minimizing read latency.

leaseholder-operations.txt
LEASEHOLDER OPERATION FLOW
═══════════════════════════════════════════════════════════════════
 
RANGE SETUP:
────────────────────────────────────────────────────────────────────
Range 42: accounts rows 1000-2000
Replicas: Node 1 (Leaseholder), Node 3, Node 5
Lease expiry: current_time + 9 seconds
 
READ OPERATION (Strongly Consistent):
────────────────────────────────────────────────────────────────────
Client: SELECT * FROM accounts WHERE id = 1500
 
1. Gateway receives query
   └── Locate leaseholder for range containing key /accounts/1500
   └── Found: Node 1 is leaseholder
 
2. Gateway sends read request to Node 1
   └── DistSQL routes read to Node 1
 
3. Node 1 (leaseholder) processes read:
   ├── Check: Lease still valid? Yes (6 seconds remaining)
   ├── No Raft needed—read directly from local storage
   └── Return result to gateway
 
4. Gateway returns to client
   
Total latency: Gateway → Node 1 → Gateway (2 hops)
 
WRITE OPERATION:
────────────────────────────────────────────────────────────────────
Client: UPDATE accounts SET balance = 500 WHERE id = 1500
 
1. Gateway receives query
   └── Locate leaseholder: Node 1
 
2. Gateway sends write to Node 1
 
3. Node 1 (leaseholder) coordinates write:
   ├── Acquire write intents
   ├── Propose entry to Raft log
   │   └── Replicate to Node 3 and Node 5
   ├── Wait for majority acknowledgment (2 of 3)
   ├── Apply to local state machine
   └── Return success to gateway
 
4. Gateway returns to client
 
Total latency: Gateway → Node 1 → (Node 3, Node 5 parallel) → Node 1 → Gateway
 
LEASEHOLDER LOCATION MATTERS:
────────────────────────────────────────────────────────────────────
 
Scenario A: Gateway on Node 2, Leaseholder on Node 1 (same region)
  Latency: ~1-2 ms per hop × 2 = ~2-4 ms read, ~5-10 ms write
 
Scenario B: Gateway on Node 2 (US), Leaseholder on Node 7 (EU)
  Latency: ~70 ms per hop × 2 = ~140 ms read, ~200 ms+ write
 
Optimization: Connect clients to nodes near their data's leaseholders
 
LEASE TRANSFER:
────────────────────────────────────────────────────────────────────
CockroachDB can transfer leases to optimize performance:
 
1. Observe: Most reads for Range 42 come from Node 5's region
2. Decide: Transfer leaseholder from Node 1 to Node 5
3. Execute:
   ├── Node 1 proposes lease transfer via Raft
   ├── All replicas acknowledge
   ├── Node 1 stops serving reads
   └── Node 5 starts serving reads
4. Result: Reduced latency for majority of reads

Leaseholder Preferences:

You can influence where leaseholders are placed:

-- Prefer leaseholders in us-east region
ALTER TABLE accounts CONFIGURE ZONE USING 
  lease_preferences = '[[+region=us-east]]';

-- Constrain replicas to specific regions, prefer us-west lease
ALTER TABLE user_data CONFIGURE ZONE USING
  num_replicas = 5,
  constraints = '{"+region=us-east": 2, "+region=us-west": 2, "+region=eu-west": 1}',
  lease_preferences = '[[+region=us-west], [+region=us-east]]';

Follower Reads:

For applications that can tolerate slight staleness, follower reads bypass the leaseholder:

-- Read from any replica (with bounded staleness)
SELECT * FROM products 
AS OF SYSTEM TIME follower_read_timestamp();

Follower reads:

Can be served by any replica (including local ones)
Don't require leaseholder coordination
Have bounded staleness (~4.8 seconds by default)
Excellent for read-heavy, latency-sensitive workloads

Monitor Leaseholder Distribution

Uneven leaseholder distribution causes hot spots. Check the 'Leaseholder' metric in the admin UI. If one node holds significantly more leaseholders than others, investigate zone configurations and rebalancing thresholds. CockroachDB automatically rebalances, but constraints or recent changes may cause temporary imbalance.

Range Splitting and Merging

As data grows and shrinks, CockroachDB automatically splits and merges ranges to maintain optimal sizes.

Automatic Splitting:

A range splits when it exceeds the maximum size (default 512 MB):

Detect: Range size exceeds range_max_bytes
Choose split point: Find a key that divides the range roughly in half
Execute split: Create two new ranges from the original
Update metadata: Range descriptors updated, routing tables refreshed

Split Example:

Before split:
  Range 42: [/users/1, /users/10000)
  Size: 600 MB
  Leaseholder: Node 1
  
After split:
  Range 42: [/users/1, /users/5000)     Size: 300 MB, Leaseholder: Node 1
  Range 97: [/users/5000, /users/10000)  Size: 300 MB, Leaseholder: Node 1

Splits are online—queries continue during the split. The leaseholder initially owns both new ranges; rebalancing may move them later.

Load-Based Splitting:

CockroachDB also splits based on load, not just size. A range with very high query volume splits even if it's small:

Detects high QPS (queries per second) to a range
Identifies if queries cluster around certain keys
Splits at the query hot spot

This prevents a single popular key from becoming a bottleneck.

range-split-merge.txt
RANGE SPLITTING AND MERGING
═══════════════════════════════════════════════════════════════════
 
SIZE-BASED SPLITTING:
────────────────────────────────────────────────────────────────────
 
Timeline of table growth:
 
T=0 (Initial):
  Table: users
  Range 1: [/users/, /users/ÿ)  Size: 10 MB  (all users in one range)
 
T=1 (Growth):
  Range 1: [/users/, /users/ÿ)  Size: 480 MB (approaching threshold)
 
T=2 (Split triggered at 512 MB):
  Splitting Range 1 at key /users/id_50000...
  
  ├── Range 1: [/users/, /users/id_50000)     Size: 256 MB
  └── Range 2: [/users/id_50000, /users/ÿ)    Size: 256 MB
 
T=3 (Continued growth):
  Range 1: Size 400 MB
  Range 2: Size 550 MB  (exceeds threshold)
  
  Splitting Range 2 at key /users/id_75000...
  
  Table now has 3 ranges:
  ├── Range 1: [/users/, /users/id_50000)         Size: 400 MB
  ├── Range 2: [/users/id_50000, /users/id_75000) Size: 275 MB
  └── Range 3: [/users/id_75000, /users/ÿ)        Size: 275 MB
 
LOAD-BASED SPLITTING:
────────────────────────────────────────────────────────────────────
 
Scenario: Popular product page causes query spike
 
Range 42: [/products/1000, /products/2000)
  Size: 150 MB (well under threshold)
  QPS: 50,000 reads/sec  (hot!)
  
Hot key analysis:
  /products/1500 → 40,000 QPS (viral product)
  Other keys → 10,000 QPS combined
 
Action: Load-based split at /products/1500
 
Result:
  Range 42: [/products/1000, /products/1500)  QPS: ~5,000
  Range 98: [/products/1500, /products/1501)  QPS: ~40,000 (isolated hot key)
  Range 99: [/products/1501, /products/2000)  QPS: ~5,000
 
Now Range 98 can be served by dedicated resources.
 
AUTOMATIC MERGING:
────────────────────────────────────────────────────────────────────
 
Scenario: Data deletion causes underfilled ranges
 
Before (after bulk delete):
  Range 42: [/old_data/1, /old_data/1000)      Size: 5 MB
  Range 43: [/old_data/1000, /old_data/2000)   Size: 8 MB
 
Merge condition: Both ranges < range_min_bytes (64 MB)
                 Adjacent in keyspace
                 Same zone configuration
 
After merge:
  Range 42: [/old_data/1, /old_data/2000)      Size: 13 MB
 
Benefits:
  - Fewer ranges to manage
  - Less metadata overhead
  - Reduced Raft group count
 
SPLIT/MERGE CONFIGURATION:
────────────────────────────────────────────────────────────────────
 
-- View current zone config
SHOW ZONE CONFIGURATION FOR TABLE users;
 
-- Configure split thresholds
ALTER TABLE users CONFIGURE ZONE USING
  range_max_bytes = 536870912,   -- 512 MB (default)
  range_min_bytes = 67108864;    -- 64 MB (default)
 
-- For high-throughput tables, consider smaller ranges
ALTER TABLE hot_table CONFIGURE ZONE USING
  range_max_bytes = 134217728,   -- 128 MB (smaller = more granular)
  range_min_bytes = 16777216;    -- 16 MB

Split Triggers:

Size-based: Range exceeds range_max_bytes
Load-based: Range QPS exceeds threshold (configurable)
Manual: ALTER TABLE ... SPLIT AT for explicit splits
Pre-splitting: During bulk loads, pre-split to parallelize

Merge Triggers:

Size-based: Both adjacent ranges below range_min_bytes
TTL expiration: Time-series data expires, ranges become empty
Manual: Less common, but possible via admin commands

Pre-Splitting for Bulk Loads:

When loading large amounts of data, pre-splitting avoids the bottleneck of one range handling all inserts:

-- Pre-split table by key values
ALTER TABLE large_import SPLIT AT VALUES 
  (1000000), (2000000), (3000000), (4000000);

-- Now bulk load distributes across 5 ranges
IMPORT INTO large_import ...;

Hot Key Splitting Limitations

Load-based splitting can isolate hot keys, but if the hot spot is a single key (like a global counter), even splitting doesn't help—that key still requires coordination through one leaseholder. For true single-key hot spots, consider application-level solutions like sharded counters or batching.

The Rebalancing Algorithm

CockroachDB continuously rebalances data across nodes to maintain even distribution. The rebalancing algorithm considers multiple factors.

Rebalancing Goals:

Storage balance: Each node should store roughly the same amount of data
Load balance: Each node should handle roughly the same request rate
Constraint satisfaction: Replicas should be placed according to zone constraints
Diversity: Replicas should be spread across failure domains

The Store Rebalancer:

The store rebalancer runs continuously on every node, evaluating whether to:

Add a replica to this node (if underweight)
Remove a replica from this node (if overweight)
Transfer a lease to optimize access patterns

Rebalancing Decisions:

The algorithm computes a score for each potential action:

score = storage_score + load_score + constraint_score + diversity_score

Storage score: How much does this action improve storage balance?
Load score: How much does this action improve load balance?
Constraint score: Does this satisfy or violate zone constraints?
Diversity score: Does this spread replicas across failure domains?

rebalancing-algorithm.txt
REBALANCING ALGORITHM IN ACTION
═══════════════════════════════════════════════════════════════════
 
INITIAL STATE (UNBALANCED):
────────────────────────────────────────────────────────────────────
                   Storage    Ranges    Leaseholders    QPS
Node 1 (new)       50 GB      100       20              2,000
Node 2             200 GB     400       180             18,000
Node 3             180 GB     360       150             15,000
Node 4             170 GB     340       150             15,000
 
Problem: Node 1 is underutilized (just added or recovered)
         Node 2 is slightly overloaded
 
REBALANCING PROCESS:
────────────────────────────────────────────────────────────────────
 
Step 1: Identify imbalance
  Target per node: ~150 GB, ~300 ranges, ~125 leaseholders
  
  Node 1: -100 GB under, -200 ranges under → needs MORE replicas
  Node 2: +50 GB over, +100 ranges over → needs FEWER replicas
 
Step 2: Select ranges to move
  Algorithm considers:
  ├── Range size (prefer medium-sized for efficient transfer)
  ├── Range activity (prefer moving cold ranges)
  ├── Constraint compatibility (can Node 1 host this range?)
  └── Network cost (prefer ranges where Node 1 is already a replica)
 
Step 3: Execute transfers (throttled)
  ├── Move Range 42 replica: Node 2 → Node 1
  ├── Move Range 58 replica: Node 2 → Node 1
  ├── Move Range 67 replica: Node 3 → Node 1
  └── ... (continues until balanced)
 
Step 4: Lease transfers (after replicas stable)
  ├── Transfer leaseholder for Range 101: Node 2 → Node 1
  ├── Transfer leaseholder for Range 102: Node 3 → Node 1
  └── ... (balances read/write load)
 
AFTER REBALANCING:
────────────────────────────────────────────────────────────────────
                   Storage    Ranges    Leaseholders    QPS
Node 1             150 GB     300       125             12,500
Node 2             150 GB     300       125             12,500
Node 3             150 GB     300       125             12,500
Node 4             150 GB     300       125             12,500
 
REBALANCING RATE LIMITING:
────────────────────────────────────────────────────────────────────
Rebalancing is throttled to avoid overwhelming the cluster:
 
Configurable parameters:
  kv.snapshot_rebalance.max_rate: 32 MB/s (default)
  kv.snapshot_recovery.max_rate: 32 MB/s (default)
 
Impact:
  Moving 100 GB of data at 32 MB/s ≈ 52 minutes
  
  Trade-off:
  - Higher rate: Faster rebalancing, more network/disk impact
  - Lower rate: Slower rebalancing, less production impact
 
LOCALITY-AWARE REBALANCING:
────────────────────────────────────────────────────────────────────
Configuration:
 
ALTER TABLE user_data CONFIGURE ZONE USING
  num_replicas = 5,
  constraints = '{
    "+region=us-east": 2,
    "+region=us-west": 2, 
    "+region=eu-west": 1
  }',
  lease_preferences = '[[+region=us-east]]';
 
Rebalancer ensures:
  ├── Exactly 2 replicas in us-east (never more, never fewer)
  ├── Exactly 2 replicas in us-west
  ├── Exactly 1 replica in eu-west
  └── Leaseholder preferably in us-east

Rebalancing Triggers:

New node joins: Node has no data, needs replicas
Node removed: Data must be re-replicated elsewhere
Imbalance detected: Periodic checks find uneven distribution
Zone config change: New constraints require replica movement
Store becoming full: Proactive move to prevent running out of space

Viewing Rebalancing Status:

-- Check range distribution
SELECT node_id, count(*) as range_count 
FROM crdb_internal.ranges_no_leases 
GROUP BY node_id;

-- Check for under-replicated ranges
SELECT * FROM crdb_internal.ranges 
WHERE array_length(replicas, 1) < 3;

-- Admin UI: Replication dashboard
-- http://localhost:8080/#/metrics/replication

Rebalancing Configuration Options

•kv.allocator.load_based_rebalancing — Enable/disable load-based lease and replica movement (enabled by default)
•kv.allocator.range_rebalance_threshold — Minimum fraction difference before moving replicas (0.05 = 5%)
•kv.snapshot_rebalance.max_rate — Maximum rate for rebalancing snapshots (32 MB/s default)
•kv.range_split.by_load_enabled — Enable load-based range splitting (enabled by default)
•server.time_until_store_dead — Time before a store is considered dead (5 minutes default)

Rebalancing During Operations

Rebalancing happens continuously in the background. During major operations (bulk loads, migrations), you may want to temporarily increase rebalance rates or pause rebalancing. Use SET CLUSTER SETTING to adjust rates, and drain nodes gracefully before removal to trigger proactive rebalancing.

Zone Configurations

Zone configurations control where data is placed and how it's replicated. They're the primary mechanism for implementing data locality, disaster recovery, and compliance requirements.

Zone Configuration Hierarchy:

Configurations can be set at multiple levels:

Cluster default: Applies to all data unless overridden
Database level: Applies to all tables in the database
Table level: Applies to specific tables
Partition level: Applies to specific partitions of a table
Index level: Applies to specific indexes

More specific configurations override more general ones.

Key Zone Configuration Options:

ALTER TABLE accounts CONFIGURE ZONE USING
  -- Replication
  num_replicas = 5,
  
  -- Placement constraints  
  constraints = '{
    "+region=us-east": 2,
    "+region=us-west": 2,
    "+region=eu-west": 1
  }',
  
  -- Lease preferences (for read/write performance)
  lease_preferences = '[[+region=us-east], [+region=us-west]]',
  
  -- Range sizing
  range_max_bytes = 536870912,
  range_min_bytes = 67108864,
  
  -- Garbage collection
  gc.ttlseconds = 90000;  -- 25 hours

zone-configuration-patterns.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
-- ═══════════════════════════════════════════════════════════════════
-- ZONE CONFIGURATION PATTERNS
-- ═══════════════════════════════════════════════════════════════════
 
-- PATTERN 1: Multi-Region Table with Regional Leaders
-- ─────────────────────────────────────────────────────────────────────
-- User data should be geographically distributed with local reads
 
CREATE TABLE user_profiles (
    user_id UUID PRIMARY KEY,
    region STRING NOT NULL,
    data JSONB
);
 
-- Partition by user region
ALTER TABLE user_profiles PARTITION BY LIST (region) (
    PARTITION americas VALUES IN ('us-east', 'us-west', 'brazil'),
    PARTITION europe VALUES IN ('eu-west', 'eu-central'),
    PARTITION asia VALUES IN ('asia-east', 'asia-south')
);
 
-- Configure each partition for locality
ALTER PARTITION americas OF TABLE user_profiles CONFIGURE ZONE USING
    num_replicas = 3,
    constraints = '[+region=us-east, +region=us-west]',
    lease_preferences = '[[+region=us-east]]';
 
ALTER PARTITION europe OF TABLE user_profiles CONFIGURE ZONE USING
    num_replicas = 3,
    constraints = '[+region=eu-west]',
    lease_preferences = '[[+region=eu-west]]';
 
ALTER PARTITION asia OF TABLE user_profiles CONFIGURE ZONE USING
    num_replicas = 3,
    constraints = '[+region=asia-east]',
    lease_preferences = '[[+region=asia-east]]';
 
 
-- PATTERN 2: REGIONAL BY ROW (Automatic Partitioning)
-- ─────────────────────────────────────────────────────────────────────
-- CockroachDB can automatically partition and configure based on a column
 
-- Set database to multi-region
ALTER DATABASE mydb PRIMARY REGION "us-east";
ALTER DATABASE mydb ADD REGION "us-west";
ALTER DATABASE mydb ADD REGION "eu-west";
 
-- Create table that auto-partitions by crdb_region column
CREATE TABLE orders (
    order_id UUID PRIMARY KEY,
    customer_id UUID,
    crdb_region crdb_internal_region AS (
        CASE 
            WHEN customer_id::TEXT < '50000000' THEN 'us-east'
            WHEN customer_id::TEXT < 'a0000000' THEN 'us-west'
            ELSE 'eu-west'
        END
    ) STORED
) LOCALITY REGIONAL BY ROW;
 
-- Each row automatically placed in its computed region
 
 
-- PATTERN 3: Global Tables (Optimized for Reads)
-- ─────────────────────────────────────────────────────────────────────
-- Reference data that's read frequently, written rarely
 
CREATE TABLE country_codes (
    code STRING PRIMARY KEY,
    name STRING,
    continent STRING
) LOCALITY GLOBAL;
 
-- GLOBAL tables:
-- - Replicas in ALL regions
-- - Reads served locally (no cross-region latency)
-- - Writes must propagate to all regions (slower)
 
 
-- PATTERN 4: SSD vs HDD Tiering
-- ─────────────────────────────────────────────────────────────────────
-- Hot data on SSD, cold data on HDD
 
-- Assuming nodes are labeled with storage type
-- Node 1-3: storage:ssd
-- Node 4-6: storage:hdd
 
-- Hot transactional data on SSD
ALTER TABLE active_sessions CONFIGURE ZONE USING
    constraints = '[+storage=ssd]';
 
-- Cold archive data on HDD
ALTER TABLE archived_logs CONFIGURE ZONE USING
    constraints = '[+storage=hdd]';
 
 
-- PATTERN 5: Survival Goals
-- ─────────────────────────────────────────────────────────────────────
-- Configure what level of failure the database survives
 
-- Survive region failure (strongest)
ALTER DATABASE critical_db SURVIVE REGION FAILURE;
-- Requires: Replicas across 3+ regions
-- Impact: Higher write latency (cross-region consensus)
 
-- Survive zone failure only (faster writes)
ALTER DATABASE standard_db SURVIVE ZONE FAILURE;
-- Requires: Replicas across 3+ zones
-- Impact: Lower write latency (intra-region consensus)

Constraint Types:

Required constraint (+): Replica MUST be placed here
- +region=us-east → Must have replica in us-east
Prohibited constraint (-): Replica MUST NOT be placed here
- -region=eu-west → No replicas allowed in eu-west
Constraint count (n): Exactly n replicas in this location
- "+region=us-east": 2 → Exactly 2 replicas in us-east

Viewing Zone Configurations:

-- Show all zone configs
SHOW ALL ZONE CONFIGURATIONS;

-- Show config for specific table
SHOW ZONE CONFIGURATION FOR TABLE accounts;

-- Show effective config (including inheritance)
SHOW ZONE CONFIGURATION FROM TABLE accounts;

Design for Locality First

Zone configurations are powerful but can't fix a schema that ignores locality. Design tables with locality in mind: use REGIONAL BY ROW for user-facing data, GLOBAL for reference data, and consider partitioning for tables accessed by geographic region. The schema design determines what zone configurations can achieve.

Monitoring and Troubleshooting

Effective operation of CockroachDB requires monitoring the automatic load balancing systems and knowing how to diagnose issues.

Key Metrics to Monitor:

1. Range Distribution:

Ranges per node (should be roughly equal)
Under-replicated ranges (any is a problem)
Over-replicated ranges (temporary during changes)
Unavailable ranges (critical—data may be inaccessible)

2. Leaseholder Distribution:

Leaseholders per node (should be roughly equal)
Lease transfers per second (high rate may indicate instability)

3. Rebalancing Activity:

Snapshots sent/received (ongoing rebalancing)
Snapshot send queue (backlog of pending transfers)
Time since last rebalance (stalled rebalancing)

4. Storage Health:

Storage capacity per node (approaching full triggers issues)
Write amplification (LSM-tree health)
Compaction queue (backlog of merge work)

monitoring-queries.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
-- ═══════════════════════════════════════════════════════════════════
-- MONITORING AND TROUBLESHOOTING QUERIES
-- ═══════════════════════════════════════════════════════════════════
 
-- CHECK RANGE DISTRIBUTION
-- ─────────────────────────────────────────────────────────────────────
-- See how ranges are distributed across nodes
 
SELECT 
    node_id,
    count(*) as range_count,
    sum(range_size_mb) as total_mb,
    count(*) FILTER (WHERE lease_holder = node_id) as leaseholder_count
FROM crdb_internal.ranges_no_leases, 
     LATERAL (SELECT (range_info->>'rangeBytes')::INT / 1048576 as range_size_mb)
GROUP BY node_id
ORDER BY node_id;
 
-- Healthy output shows similar numbers across nodes
 
 
-- CHECK FOR UNDER-REPLICATED RANGES
-- ─────────────────────────────────────────────────────────────────────
-- Ranges with fewer replicas than configured (DATA AT RISK)
 
SELECT 
    range_id,
    start_key,
    end_key,
    array_length(replicas, 1) as replica_count,
    lease_holder
FROM crdb_internal.ranges
WHERE array_length(replicas, 1) < 3  -- assuming replication factor 3
ORDER BY range_id;
 
-- Any rows here = immediate investigation needed
 
 
-- CHECK FOR UNAVAILABLE RANGES  
-- ─────────────────────────────────────────────────────────────────────
-- Ranges that cannot serve reads/writes (OUTAGE)
 
SELECT 
    range_id,
    start_key,
    replicas,
    lease_holder
FROM crdb_internal.ranges
WHERE NOT problems @> '{}'  -- has any problems
ORDER BY range_id;
 
 
-- CHECK REBALANCING PROGRESS
-- ─────────────────────────────────────────────────────────────────────
-- See ongoing snapshot transfers
 
SELECT 
    node_id,
    metrics->>'range.snapshots.generated' as snapshots_sent,
    metrics->>'range.snapshots.applied-voter' as snapshots_received,
    metrics->>'range.snapshots.send-queue' as send_queue
FROM crdb_internal.kv_node_status;
 
 
-- IDENTIFY HOT RANGES
-- ─────────────────────────────────────────────────────────────────────
-- Find ranges with highest query activity
 
SELECT 
    range_id,
    start_pretty,
    end_pretty,
    queries_per_second,
    writes_per_second,
    lease_holder_node
FROM crdb_internal.ranges
ORDER BY queries_per_second DESC
LIMIT 20;
 
 
-- CHECK ZONE CONSTRAINT VIOLATIONS
-- ─────────────────────────────────────────────────────────────────────
-- Find ranges not satisfying their zone configs
 
SELECT 
    range_id,
    start_key,
    database_name,
    table_name,
    learner_replicas,
    split_enforced_until
FROM crdb_internal.ranges
WHERE array_length(learner_replicas, 1) > 0  -- has pending learner replicas
   OR constraint_conformance != 'compliant'  -- violates constraints
ORDER BY range_id;
 
 
-- STORAGE CAPACITY CHECK
-- ─────────────────────────────────────────────────────────────────────
-- Check storage utilization per node
 
SELECT 
    node_id,
    store_id,
    capacity / 1073741824 as capacity_gb,
    available / 1073741824 as available_gb,
    used / 1073741824 as used_gb,
    (1 - available::FLOAT / capacity) * 100 as pct_used
FROM crdb_internal.kv_store_status
ORDER BY pct_used DESC;
 
-- Alert if any node > 80% used

Common Load Balancing Issues

•Under-replicated ranges — Node failure with slow re-replication. Check if rebalance rate is too low or if cluster lacks capacity. Increase snapshot rates temporarily.
•Uneven leaseholder distribution — Zone constraints or lease preferences causing imbalance. Review zone configs; consider adjusting preferences if intentional.
•Hot spots despite splitting — Single-key hot spot that can't be split further. Application-level solution needed (sharding, batching, caching).
•Slow rebalancing after node addition — Default rebalance rate is conservative. Increase kv.snapshot_rebalance.max_rate for faster convergence.
•Constraint violations — Ranges can't satisfy zone constraints (not enough nodes in required regions). Add nodes to required locations.

The Admin UI is Your Friend

CockroachDB's Admin UI (default port 8080) provides excellent visibility into replication, rebalancing, and range distribution. The Replication Dashboard, Node List, and Range Debug pages surface most issues immediately. Use it as your first stop for troubleshooting.

Summary: Self-Managing Distribution

We've explored how CockroachDB automatically manages data distribution across the cluster. Let's consolidate the key concepts:

Key Takeaways

•Ranges Are the Unit of Everything: Replication, consensus, movement—all operate at the range level. Understanding ranges is key to understanding CockroachDB internals.
•Leaseholders Coordinate Access: Each range has one leaseholder that serves reads and coordinates writes. Leaseholder location determines performance.
•Automatic Splitting and Merging: Ranges split when they grow too large or become too hot; they merge when they shrink. No manual intervention required.
•Continuous Rebalancing: The rebalancer ensures storage and load are evenly distributed, respecting zone constraints.
•Zone Configurations Control Placement: Use zones to implement locality, compliance, and performance requirements at database, table, or partition level.
•Monitor the Metrics: Range distribution, leaseholder balance, and rebalancing activity indicate cluster health.

What's Next:

We've now covered CockroachDB's core architecture. In the final page, we'll step back and answer the practical question: When should you use CockroachDB? We'll examine use cases, alternatives, and provide a decision framework for evaluating whether CockroachDB is the right choice for your system.

Page Complete

You now understand how CockroachDB automatically distributes and balances data—ranges, leaseholders, splitting, merging, and zone configurations. This knowledge enables you to design schemas that cooperate with the balancing algorithms and troubleshoot distribution issues. Next, we'll discuss when CockroachDB is the right choice and how it compares to alternatives.

4 / 5

Loading learning content...

System Design (HLD)CockroachDB

CockroachDB: Distributed SQL for the Modern Era

LevelAdvanced

Duration90 mins

TopicCockroachDB

4 / 5

Automatic Load Balancing: Ranges, Leaseholders, and Rebalancing

The Self-Managing Database

CockroachDB eliminates this burden through automatic load balancing. The database:

Splits ranges when they grow too large
Moves ranges to balance storage across nodes
Relocates leaseholders to balance request load
Repairs replication when nodes fail
Adapts to access patterns moving data closer to where it's needed

This self-managing behavior is one of CockroachDB's most powerful features. A 3-node cluster and a 300-node cluster operate the same way—you just have more capacity.

Understanding how this works helps you:

Set appropriate cluster policies
Diagnose performance issues
Design schemas that cooperate with the balancing algorithms
Plan capacity based on realistic data distribution

What You Will Learn

The Range Abstraction

All data in CockroachDB is organized into ranges—contiguous portions of the key space. Understanding ranges is fundamental to understanding how CockroachDB distributes and manages data.

What is a Range?

A range is:

A contiguous span of keys in the sorted key-value store
The unit of replication (each range is replicated 3+ times)
The unit of consensus (each range has its own Raft group)
The unit of data movement (ranges are moved atomically between nodes)

Default range size is approximately 512 MB (64 MB minimum to 512 MB maximum, configurable). When a range exceeds this size, it automatically splits.

Range Key Space:

CockroachDB encodes all data—table rows, indexes, system metadata—into a single sorted key-value store. The key format is:

/Table/{table_id}/{index_id}/{primary_key_columns}/{column_family}

Because keys are sorted, rows from the same table (same table_id) are stored together. Rows with similar primary keys are in the same or adjacent ranges.

Range Metadata:

Each range has metadata stored in a range descriptor:

Range ID
Start key (inclusive) and end key (exclusive)
Replica locations (which nodes hold copies)
Which replica is the leaseholder
Generation number (increments on changes)

range-organization.txt
RANGE ORGANIZATION IN COCKROACHDB
═══════════════════════════════════════════════════════════════════
 
KEY SPACE OVERVIEW:
────────────────────────────────────────────────────────────────────
The entire keyspace is divided into ranges:
 
Key Space: [min_key ────────────────────────────────────── max_key]
           
           ├─ Range 1 ─┤├─ Range 2 ─┤├─ Range 3 ─┤├─ Range 4 ─┤
           [a, g)       [g, m)       [m, t)       [t, z]
 
RANGE CONTENTS EXAMPLE:
────────────────────────────────────────────────────────────────────
Database: mydb
Tables: users, orders, products
 
Key encoding:
  /Table/56/1/1    = users table, primary index, row id=1
  /Table/56/1/2    = users table, primary index, row id=2
  /Table/56/2/...  = users table, secondary index
  /Table/57/...    = orders table
  /Table/58/...    = products table
 
Range 42 (example):
┌─────────────────────────────────────────────────────────────────┐
│ Range 42                                                         │
│ Start Key: /Table/56/1/1000                                      │
│ End Key:   /Table/56/1/2000                                      │
│                                                                   │
│ Contents:                                                         │
│   /Table/56/1/1000 → {id: 1000, name: "Alice", ...}              │
│   /Table/56/1/1001 → {id: 1001, name: "Bob", ...}                │
│   /Table/56/1/1002 → {id: 1002, name: "Carol", ...}              │
│   ...                                                             │
│   /Table/56/1/1999 → {id: 1999, name: "Zara", ...}               │
│                                                                   │
│ Replicas:     Node 1 (leaseholder), Node 4, Node 7               │
│ Size:         487 MB                                              │
│ Raft Leader:  Node 1                                              │
└─────────────────────────────────────────────────────────────────┘
 
RANGE DISTRIBUTION ACROSS NODES:
────────────────────────────────────────────────────────────────────
 
┌─────────────────────────────────────────────────────────────────┐
│                        5-NODE CLUSTER                            │
├───────────┬───────────┬───────────┬───────────┬───────────────┐
│   Node 1  │   Node 2  │   Node 3  │   Node 4  │   Node 5      │
├───────────┼───────────┼───────────┼───────────┼───────────────┤
│ R1 (L)    │ R1        │ R1        │           │               │
│ R2 (L)    │           │ R2        │ R2        │               │
│           │ R3 (L)    │ R3        │           │ R3            │
│ R4        │ R4        │           │ R4 (L)    │               │
│           │ R5        │ R5        │ R5 (L)    │               │
│ R6        │           │ R6 (L)    │           │ R6            │
│           │ R7        │           │ R7        │ R7 (L)        │
│ R8 (L)    │           │ R8        │           │ R8            │
└───────────┴───────────┴───────────┴───────────┴───────────────┘
 
Legend: R# = Range #, (L) = Leaseholder for that range
 
Observations:
- Each range has 3 replicas (replication factor = 3)
- Replicas are distributed across different nodes
- Leaseholders are distributed to balance read/write load
- No node has all leaseholders (avoiding hot spots)

Viewing Ranges:

CockroachDB provides commands to inspect range distribution:

-- Show ranges for a table
SHOW RANGES FROM TABLE users;

-- Show range for a specific key
SHOW RANGE FROM TABLE users FOR ROW (12345);

-- Detailed range info in the admin UI
-- http://localhost:8080/#/debug/range/42

Range Size Considerations:

The 512 MB default works well for most workloads. However:

Larger ranges: Fewer ranges means less metadata overhead, but slower splits/merges and potentially longer recovery times
Smaller ranges: More granular load balancing, but more metadata and coordination overhead

You can configure range size per table:

ALTER TABLE hot_table CONFIGURE ZONE USING 
  range_max_bytes = 134217728,  -- 128 MB max
  range_min_bytes = 16777216;   -- 16 MB min

Ranges Are Not Shards

Leaseholders and Access Coordination

Each range has multiple replicas, but one replica is special: the leaseholder. Understanding the leaseholder's role is crucial for understanding performance.

What is a Leaseholder?

The leaseholder is the replica that:

Serves reads: All strongly consistent reads go through the leaseholder
Coordinates writes: Writes are proposed to Raft through the leaseholder
Holds the lease: A time-bounded exclusive right to serve reads and coordinate writes

Leases vs. Raft Leadership:

CockroachDB separates two concepts:

Raft leader: The replica that proposes entries to the Raft log (consensus)
Leaseholder: The replica that coordinates client requests

Typically, the same replica is both Raft leader and leaseholder, but they can diverge temporarily during transitions.

Why Leases?

Leases enable consistent reads without running Raft consensus for every read:

Leaseholder has exclusive read authority during its lease
Lease has a defined expiration time
When the lease expires, a new lease must be acquired (via Raft)
During the lease period, reads can be served immediately

This design provides strong consistency while minimizing read latency.

leaseholder-operations.txt
LEASEHOLDER OPERATION FLOW
═══════════════════════════════════════════════════════════════════
 
RANGE SETUP:
────────────────────────────────────────────────────────────────────
Range 42: accounts rows 1000-2000
Replicas: Node 1 (Leaseholder), Node 3, Node 5
Lease expiry: current_time + 9 seconds
 
READ OPERATION (Strongly Consistent):
────────────────────────────────────────────────────────────────────
Client: SELECT * FROM accounts WHERE id = 1500
 
1. Gateway receives query
   └── Locate leaseholder for range containing key /accounts/1500
   └── Found: Node 1 is leaseholder
 
2. Gateway sends read request to Node 1
   └── DistSQL routes read to Node 1
 
3. Node 1 (leaseholder) processes read:
   ├── Check: Lease still valid? Yes (6 seconds remaining)
   ├── No Raft needed—read directly from local storage
   └── Return result to gateway
 
4. Gateway returns to client
   
Total latency: Gateway → Node 1 → Gateway (2 hops)
 
WRITE OPERATION:
────────────────────────────────────────────────────────────────────
Client: UPDATE accounts SET balance = 500 WHERE id = 1500
 
1. Gateway receives query
   └── Locate leaseholder: Node 1
 
2. Gateway sends write to Node 1
 
3. Node 1 (leaseholder) coordinates write:
   ├── Acquire write intents
   ├── Propose entry to Raft log
   │   └── Replicate to Node 3 and Node 5
   ├── Wait for majority acknowledgment (2 of 3)
   ├── Apply to local state machine
   └── Return success to gateway
 
4. Gateway returns to client
 
Total latency: Gateway → Node 1 → (Node 3, Node 5 parallel) → Node 1 → Gateway
 
LEASEHOLDER LOCATION MATTERS:
────────────────────────────────────────────────────────────────────
 
Scenario A: Gateway on Node 2, Leaseholder on Node 1 (same region)
  Latency: ~1-2 ms per hop × 2 = ~2-4 ms read, ~5-10 ms write
 
Scenario B: Gateway on Node 2 (US), Leaseholder on Node 7 (EU)
  Latency: ~70 ms per hop × 2 = ~140 ms read, ~200 ms+ write
 
Optimization: Connect clients to nodes near their data's leaseholders
 
LEASE TRANSFER:
────────────────────────────────────────────────────────────────────
CockroachDB can transfer leases to optimize performance:
 
1. Observe: Most reads for Range 42 come from Node 5's region
2. Decide: Transfer leaseholder from Node 1 to Node 5
3. Execute:
   ├── Node 1 proposes lease transfer via Raft
   ├── All replicas acknowledge
   ├── Node 1 stops serving reads
   └── Node 5 starts serving reads
4. Result: Reduced latency for majority of reads

Leaseholder Preferences:

You can influence where leaseholders are placed:

-- Prefer leaseholders in us-east region
ALTER TABLE accounts CONFIGURE ZONE USING 
  lease_preferences = '[[+region=us-east]]';

-- Constrain replicas to specific regions, prefer us-west lease
ALTER TABLE user_data CONFIGURE ZONE USING
  num_replicas = 5,
  constraints = '{"+region=us-east": 2, "+region=us-west": 2, "+region=eu-west": 1}',
  lease_preferences = '[[+region=us-west], [+region=us-east]]';

Follower Reads:

For applications that can tolerate slight staleness, follower reads bypass the leaseholder:

-- Read from any replica (with bounded staleness)
SELECT * FROM products 
AS OF SYSTEM TIME follower_read_timestamp();

Follower reads:

Can be served by any replica (including local ones)
Don't require leaseholder coordination
Have bounded staleness (~4.8 seconds by default)
Excellent for read-heavy, latency-sensitive workloads

Monitor Leaseholder Distribution

Range Splitting and Merging

As data grows and shrinks, CockroachDB automatically splits and merges ranges to maintain optimal sizes.

Automatic Splitting:

A range splits when it exceeds the maximum size (default 512 MB):

Detect: Range size exceeds range_max_bytes
Choose split point: Find a key that divides the range roughly in half
Execute split: Create two new ranges from the original
Update metadata: Range descriptors updated, routing tables refreshed

Split Example:

Before split:
  Range 42: [/users/1, /users/10000)
  Size: 600 MB
  Leaseholder: Node 1
  
After split:
  Range 42: [/users/1, /users/5000)     Size: 300 MB, Leaseholder: Node 1
  Range 97: [/users/5000, /users/10000)  Size: 300 MB, Leaseholder: Node 1

Splits are online—queries continue during the split. The leaseholder initially owns both new ranges; rebalancing may move them later.

Load-Based Splitting:

CockroachDB also splits based on load, not just size. A range with very high query volume splits even if it's small:

Detects high QPS (queries per second) to a range
Identifies if queries cluster around certain keys
Splits at the query hot spot

This prevents a single popular key from becoming a bottleneck.

range-split-merge.txt
RANGE SPLITTING AND MERGING
═══════════════════════════════════════════════════════════════════
 
SIZE-BASED SPLITTING:
────────────────────────────────────────────────────────────────────
 
Timeline of table growth:
 
T=0 (Initial):
  Table: users
  Range 1: [/users/, /users/ÿ)  Size: 10 MB  (all users in one range)
 
T=1 (Growth):
  Range 1: [/users/, /users/ÿ)  Size: 480 MB (approaching threshold)
 
T=2 (Split triggered at 512 MB):
  Splitting Range 1 at key /users/id_50000...
  
  ├── Range 1: [/users/, /users/id_50000)     Size: 256 MB
  └── Range 2: [/users/id_50000, /users/ÿ)    Size: 256 MB
 
T=3 (Continued growth):
  Range 1: Size 400 MB
  Range 2: Size 550 MB  (exceeds threshold)
  
  Splitting Range 2 at key /users/id_75000...
  
  Table now has 3 ranges:
  ├── Range 1: [/users/, /users/id_50000)         Size: 400 MB
  ├── Range 2: [/users/id_50000, /users/id_75000) Size: 275 MB
  └── Range 3: [/users/id_75000, /users/ÿ)        Size: 275 MB
 
LOAD-BASED SPLITTING:
────────────────────────────────────────────────────────────────────
 
Scenario: Popular product page causes query spike
 
Range 42: [/products/1000, /products/2000)
  Size: 150 MB (well under threshold)
  QPS: 50,000 reads/sec  (hot!)
  
Hot key analysis:
  /products/1500 → 40,000 QPS (viral product)
  Other keys → 10,000 QPS combined
 
Action: Load-based split at /products/1500
 
Result:
  Range 42: [/products/1000, /products/1500)  QPS: ~5,000
  Range 98: [/products/1500, /products/1501)  QPS: ~40,000 (isolated hot key)
  Range 99: [/products/1501, /products/2000)  QPS: ~5,000
 
Now Range 98 can be served by dedicated resources.
 
AUTOMATIC MERGING:
────────────────────────────────────────────────────────────────────
 
Scenario: Data deletion causes underfilled ranges
 
Before (after bulk delete):
  Range 42: [/old_data/1, /old_data/1000)      Size: 5 MB
  Range 43: [/old_data/1000, /old_data/2000)   Size: 8 MB
 
Merge condition: Both ranges < range_min_bytes (64 MB)
                 Adjacent in keyspace
                 Same zone configuration
 
After merge:
  Range 42: [/old_data/1, /old_data/2000)      Size: 13 MB
 
Benefits:
  - Fewer ranges to manage
  - Less metadata overhead
  - Reduced Raft group count
 
SPLIT/MERGE CONFIGURATION:
────────────────────────────────────────────────────────────────────
 
-- View current zone config
SHOW ZONE CONFIGURATION FOR TABLE users;
 
-- Configure split thresholds
ALTER TABLE users CONFIGURE ZONE USING
  range_max_bytes = 536870912,   -- 512 MB (default)
  range_min_bytes = 67108864;    -- 64 MB (default)
 
-- For high-throughput tables, consider smaller ranges
ALTER TABLE hot_table CONFIGURE ZONE USING
  range_max_bytes = 134217728,   -- 128 MB (smaller = more granular)
  range_min_bytes = 16777216;    -- 16 MB

Split Triggers:

Size-based: Range exceeds range_max_bytes
Load-based: Range QPS exceeds threshold (configurable)
Manual: ALTER TABLE ... SPLIT AT for explicit splits
Pre-splitting: During bulk loads, pre-split to parallelize

Merge Triggers:

Size-based: Both adjacent ranges below range_min_bytes
TTL expiration: Time-series data expires, ranges become empty
Manual: Less common, but possible via admin commands

Pre-Splitting for Bulk Loads:

When loading large amounts of data, pre-splitting avoids the bottleneck of one range handling all inserts:

-- Pre-split table by key values
ALTER TABLE large_import SPLIT AT VALUES 
  (1000000), (2000000), (3000000), (4000000);

-- Now bulk load distributes across 5 ranges
IMPORT INTO large_import ...;

Hot Key Splitting Limitations

The Rebalancing Algorithm

CockroachDB continuously rebalances data across nodes to maintain even distribution. The rebalancing algorithm considers multiple factors.

Rebalancing Goals:

Storage balance: Each node should store roughly the same amount of data
Load balance: Each node should handle roughly the same request rate
Constraint satisfaction: Replicas should be placed according to zone constraints
Diversity: Replicas should be spread across failure domains

The Store Rebalancer:

The store rebalancer runs continuously on every node, evaluating whether to:

Add a replica to this node (if underweight)
Remove a replica from this node (if overweight)
Transfer a lease to optimize access patterns

Rebalancing Decisions:

The algorithm computes a score for each potential action:

score = storage_score + load_score + constraint_score + diversity_score

Storage score: How much does this action improve storage balance?
Load score: How much does this action improve load balance?
Constraint score: Does this satisfy or violate zone constraints?
Diversity score: Does this spread replicas across failure domains?

rebalancing-algorithm.txt
REBALANCING ALGORITHM IN ACTION
═══════════════════════════════════════════════════════════════════
 
INITIAL STATE (UNBALANCED):
────────────────────────────────────────────────────────────────────
                   Storage    Ranges    Leaseholders    QPS
Node 1 (new)       50 GB      100       20              2,000
Node 2             200 GB     400       180             18,000
Node 3             180 GB     360       150             15,000
Node 4             170 GB     340       150             15,000
 
Problem: Node 1 is underutilized (just added or recovered)
         Node 2 is slightly overloaded
 
REBALANCING PROCESS:
────────────────────────────────────────────────────────────────────
 
Step 1: Identify imbalance
  Target per node: ~150 GB, ~300 ranges, ~125 leaseholders
  
  Node 1: -100 GB under, -200 ranges under → needs MORE replicas
  Node 2: +50 GB over, +100 ranges over → needs FEWER replicas
 
Step 2: Select ranges to move
  Algorithm considers:
  ├── Range size (prefer medium-sized for efficient transfer)
  ├── Range activity (prefer moving cold ranges)
  ├── Constraint compatibility (can Node 1 host this range?)
  └── Network cost (prefer ranges where Node 1 is already a replica)
 
Step 3: Execute transfers (throttled)
  ├── Move Range 42 replica: Node 2 → Node 1
  ├── Move Range 58 replica: Node 2 → Node 1
  ├── Move Range 67 replica: Node 3 → Node 1
  └── ... (continues until balanced)
 
Step 4: Lease transfers (after replicas stable)
  ├── Transfer leaseholder for Range 101: Node 2 → Node 1
  ├── Transfer leaseholder for Range 102: Node 3 → Node 1
  └── ... (balances read/write load)
 
AFTER REBALANCING:
────────────────────────────────────────────────────────────────────
                   Storage    Ranges    Leaseholders    QPS
Node 1             150 GB     300       125             12,500
Node 2             150 GB     300       125             12,500
Node 3             150 GB     300       125             12,500
Node 4             150 GB     300       125             12,500
 
REBALANCING RATE LIMITING:
────────────────────────────────────────────────────────────────────
Rebalancing is throttled to avoid overwhelming the cluster:
 
Configurable parameters:
  kv.snapshot_rebalance.max_rate: 32 MB/s (default)
  kv.snapshot_recovery.max_rate: 32 MB/s (default)
 
Impact:
  Moving 100 GB of data at 32 MB/s ≈ 52 minutes
  
  Trade-off:
  - Higher rate: Faster rebalancing, more network/disk impact
  - Lower rate: Slower rebalancing, less production impact
 
LOCALITY-AWARE REBALANCING:
────────────────────────────────────────────────────────────────────
Configuration:
 
ALTER TABLE user_data CONFIGURE ZONE USING
  num_replicas = 5,
  constraints = '{
    "+region=us-east": 2,
    "+region=us-west": 2, 
    "+region=eu-west": 1
  }',
  lease_preferences = '[[+region=us-east]]';
 
Rebalancer ensures:
  ├── Exactly 2 replicas in us-east (never more, never fewer)
  ├── Exactly 2 replicas in us-west
  ├── Exactly 1 replica in eu-west
  └── Leaseholder preferably in us-east

Rebalancing Triggers:

New node joins: Node has no data, needs replicas
Node removed: Data must be re-replicated elsewhere
Imbalance detected: Periodic checks find uneven distribution
Zone config change: New constraints require replica movement
Store becoming full: Proactive move to prevent running out of space

Viewing Rebalancing Status:

-- Check range distribution
SELECT node_id, count(*) as range_count 
FROM crdb_internal.ranges_no_leases 
GROUP BY node_id;

-- Check for under-replicated ranges
SELECT * FROM crdb_internal.ranges 
WHERE array_length(replicas, 1) < 3;

-- Admin UI: Replication dashboard
-- http://localhost:8080/#/metrics/replication

Rebalancing Configuration Options

•kv.allocator.load_based_rebalancing — Enable/disable load-based lease and replica movement (enabled by default)
•kv.allocator.range_rebalance_threshold — Minimum fraction difference before moving replicas (0.05 = 5%)
•kv.snapshot_rebalance.max_rate — Maximum rate for rebalancing snapshots (32 MB/s default)
•kv.range_split.by_load_enabled — Enable load-based range splitting (enabled by default)
•server.time_until_store_dead — Time before a store is considered dead (5 minutes default)

Rebalancing During Operations

Zone Configurations

Zone configurations control where data is placed and how it's replicated. They're the primary mechanism for implementing data locality, disaster recovery, and compliance requirements.

Zone Configuration Hierarchy:

Configurations can be set at multiple levels:

Cluster default: Applies to all data unless overridden
Database level: Applies to all tables in the database
Table level: Applies to specific tables
Partition level: Applies to specific partitions of a table
Index level: Applies to specific indexes

More specific configurations override more general ones.

Key Zone Configuration Options:

ALTER TABLE accounts CONFIGURE ZONE USING
  -- Replication
  num_replicas = 5,
  
  -- Placement constraints  
  constraints = '{
    "+region=us-east": 2,
    "+region=us-west": 2,
    "+region=eu-west": 1
  }',
  
  -- Lease preferences (for read/write performance)
  lease_preferences = '[[+region=us-east], [+region=us-west]]',
  
  -- Range sizing
  range_max_bytes = 536870912,
  range_min_bytes = 67108864,
  
  -- Garbage collection
  gc.ttlseconds = 90000;  -- 25 hours

zone-configuration-patterns.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
-- ═══════════════════════════════════════════════════════════════════
-- ZONE CONFIGURATION PATTERNS
-- ═══════════════════════════════════════════════════════════════════
 
-- PATTERN 1: Multi-Region Table with Regional Leaders
-- ─────────────────────────────────────────────────────────────────────
-- User data should be geographically distributed with local reads
 
CREATE TABLE user_profiles (
    user_id UUID PRIMARY KEY,
    region STRING NOT NULL,
    data JSONB
);
 
-- Partition by user region
ALTER TABLE user_profiles PARTITION BY LIST (region) (
    PARTITION americas VALUES IN ('us-east', 'us-west', 'brazil'),
    PARTITION europe VALUES IN ('eu-west', 'eu-central'),
    PARTITION asia VALUES IN ('asia-east', 'asia-south')
);
 
-- Configure each partition for locality
ALTER PARTITION americas OF TABLE user_profiles CONFIGURE ZONE USING
    num_replicas = 3,
    constraints = '[+region=us-east, +region=us-west]',
    lease_preferences = '[[+region=us-east]]';
 
ALTER PARTITION europe OF TABLE user_profiles CONFIGURE ZONE USING
    num_replicas = 3,
    constraints = '[+region=eu-west]',
    lease_preferences = '[[+region=eu-west]]';
 
ALTER PARTITION asia OF TABLE user_profiles CONFIGURE ZONE USING
    num_replicas = 3,
    constraints = '[+region=asia-east]',
    lease_preferences = '[[+region=asia-east]]';
 
 
-- PATTERN 2: REGIONAL BY ROW (Automatic Partitioning)
-- ─────────────────────────────────────────────────────────────────────
-- CockroachDB can automatically partition and configure based on a column
 
-- Set database to multi-region
ALTER DATABASE mydb PRIMARY REGION "us-east";
ALTER DATABASE mydb ADD REGION "us-west";
ALTER DATABASE mydb ADD REGION "eu-west";
 
-- Create table that auto-partitions by crdb_region column
CREATE TABLE orders (
    order_id UUID PRIMARY KEY,
    customer_id UUID,
    crdb_region crdb_internal_region AS (
        CASE 
            WHEN customer_id::TEXT < '50000000' THEN 'us-east'
            WHEN customer_id::TEXT < 'a0000000' THEN 'us-west'
            ELSE 'eu-west'
        END
    ) STORED
) LOCALITY REGIONAL BY ROW;
 
-- Each row automatically placed in its computed region
 
 
-- PATTERN 3: Global Tables (Optimized for Reads)
-- ─────────────────────────────────────────────────────────────────────
-- Reference data that's read frequently, written rarely
 
CREATE TABLE country_codes (
    code STRING PRIMARY KEY,
    name STRING,
    continent STRING
) LOCALITY GLOBAL;
 
-- GLOBAL tables:
-- - Replicas in ALL regions
-- - Reads served locally (no cross-region latency)
-- - Writes must propagate to all regions (slower)
 
 
-- PATTERN 4: SSD vs HDD Tiering
-- ─────────────────────────────────────────────────────────────────────
-- Hot data on SSD, cold data on HDD
 
-- Assuming nodes are labeled with storage type
-- Node 1-3: storage:ssd
-- Node 4-6: storage:hdd
 
-- Hot transactional data on SSD
ALTER TABLE active_sessions CONFIGURE ZONE USING
    constraints = '[+storage=ssd]';
 
-- Cold archive data on HDD
ALTER TABLE archived_logs CONFIGURE ZONE USING
    constraints = '[+storage=hdd]';
 
 
-- PATTERN 5: Survival Goals
-- ─────────────────────────────────────────────────────────────────────
-- Configure what level of failure the database survives
 
-- Survive region failure (strongest)
ALTER DATABASE critical_db SURVIVE REGION FAILURE;
-- Requires: Replicas across 3+ regions
-- Impact: Higher write latency (cross-region consensus)
 
-- Survive zone failure only (faster writes)
ALTER DATABASE standard_db SURVIVE ZONE FAILURE;
-- Requires: Replicas across 3+ zones
-- Impact: Lower write latency (intra-region consensus)

Constraint Types:

Required constraint (+): Replica MUST be placed here
- +region=us-east → Must have replica in us-east
Prohibited constraint (-): Replica MUST NOT be placed here
- -region=eu-west → No replicas allowed in eu-west
Constraint count (n): Exactly n replicas in this location
- "+region=us-east": 2 → Exactly 2 replicas in us-east

Viewing Zone Configurations:

-- Show all zone configs
SHOW ALL ZONE CONFIGURATIONS;

-- Show config for specific table
SHOW ZONE CONFIGURATION FOR TABLE accounts;

-- Show effective config (including inheritance)
SHOW ZONE CONFIGURATION FROM TABLE accounts;

Design for Locality First

Monitoring and Troubleshooting

Effective operation of CockroachDB requires monitoring the automatic load balancing systems and knowing how to diagnose issues.

Key Metrics to Monitor:

1. Range Distribution:

Ranges per node (should be roughly equal)
Under-replicated ranges (any is a problem)
Over-replicated ranges (temporary during changes)
Unavailable ranges (critical—data may be inaccessible)

2. Leaseholder Distribution:

Leaseholders per node (should be roughly equal)
Lease transfers per second (high rate may indicate instability)

3. Rebalancing Activity:

Snapshots sent/received (ongoing rebalancing)
Snapshot send queue (backlog of pending transfers)
Time since last rebalance (stalled rebalancing)

4. Storage Health:

Storage capacity per node (approaching full triggers issues)
Write amplification (LSM-tree health)
Compaction queue (backlog of merge work)

monitoring-queries.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
-- ═══════════════════════════════════════════════════════════════════
-- MONITORING AND TROUBLESHOOTING QUERIES
-- ═══════════════════════════════════════════════════════════════════
 
-- CHECK RANGE DISTRIBUTION
-- ─────────────────────────────────────────────────────────────────────
-- See how ranges are distributed across nodes
 
SELECT 
    node_id,
    count(*) as range_count,
    sum(range_size_mb) as total_mb,
    count(*) FILTER (WHERE lease_holder = node_id) as leaseholder_count
FROM crdb_internal.ranges_no_leases, 
     LATERAL (SELECT (range_info->>'rangeBytes')::INT / 1048576 as range_size_mb)
GROUP BY node_id
ORDER BY node_id;
 
-- Healthy output shows similar numbers across nodes
 
 
-- CHECK FOR UNDER-REPLICATED RANGES
-- ─────────────────────────────────────────────────────────────────────
-- Ranges with fewer replicas than configured (DATA AT RISK)
 
SELECT 
    range_id,
    start_key,
    end_key,
    array_length(replicas, 1) as replica_count,
    lease_holder
FROM crdb_internal.ranges
WHERE array_length(replicas, 1) < 3  -- assuming replication factor 3
ORDER BY range_id;
 
-- Any rows here = immediate investigation needed
 
 
-- CHECK FOR UNAVAILABLE RANGES  
-- ─────────────────────────────────────────────────────────────────────
-- Ranges that cannot serve reads/writes (OUTAGE)
 
SELECT 
    range_id,
    start_key,
    replicas,
    lease_holder
FROM crdb_internal.ranges
WHERE NOT problems @> '{}'  -- has any problems
ORDER BY range_id;
 
 
-- CHECK REBALANCING PROGRESS
-- ─────────────────────────────────────────────────────────────────────
-- See ongoing snapshot transfers
 
SELECT 
    node_id,
    metrics->>'range.snapshots.generated' as snapshots_sent,
    metrics->>'range.snapshots.applied-voter' as snapshots_received,
    metrics->>'range.snapshots.send-queue' as send_queue
FROM crdb_internal.kv_node_status;
 
 
-- IDENTIFY HOT RANGES
-- ─────────────────────────────────────────────────────────────────────
-- Find ranges with highest query activity
 
SELECT 
    range_id,
    start_pretty,
    end_pretty,
    queries_per_second,
    writes_per_second,
    lease_holder_node
FROM crdb_internal.ranges
ORDER BY queries_per_second DESC
LIMIT 20;
 
 
-- CHECK ZONE CONSTRAINT VIOLATIONS
-- ─────────────────────────────────────────────────────────────────────
-- Find ranges not satisfying their zone configs
 
SELECT 
    range_id,
    start_key,
    database_name,
    table_name,
    learner_replicas,
    split_enforced_until
FROM crdb_internal.ranges
WHERE array_length(learner_replicas, 1) > 0  -- has pending learner replicas
   OR constraint_conformance != 'compliant'  -- violates constraints
ORDER BY range_id;
 
 
-- STORAGE CAPACITY CHECK
-- ─────────────────────────────────────────────────────────────────────
-- Check storage utilization per node
 
SELECT 
    node_id,
    store_id,
    capacity / 1073741824 as capacity_gb,
    available / 1073741824 as available_gb,
    used / 1073741824 as used_gb,
    (1 - available::FLOAT / capacity) * 100 as pct_used
FROM crdb_internal.kv_store_status
ORDER BY pct_used DESC;
 
-- Alert if any node > 80% used

Common Load Balancing Issues

•Under-replicated ranges — Node failure with slow re-replication. Check if rebalance rate is too low or if cluster lacks capacity. Increase snapshot rates temporarily.
•Uneven leaseholder distribution — Zone constraints or lease preferences causing imbalance. Review zone configs; consider adjusting preferences if intentional.
•Hot spots despite splitting — Single-key hot spot that can't be split further. Application-level solution needed (sharding, batching, caching).
•Slow rebalancing after node addition — Default rebalance rate is conservative. Increase kv.snapshot_rebalance.max_rate for faster convergence.
•Constraint violations — Ranges can't satisfy zone constraints (not enough nodes in required regions). Add nodes to required locations.

The Admin UI is Your Friend

Summary: Self-Managing Distribution

We've explored how CockroachDB automatically manages data distribution across the cluster. Let's consolidate the key concepts:

Key Takeaways

•Ranges Are the Unit of Everything: Replication, consensus, movement—all operate at the range level. Understanding ranges is key to understanding CockroachDB internals.
•Leaseholders Coordinate Access: Each range has one leaseholder that serves reads and coordinates writes. Leaseholder location determines performance.
•Automatic Splitting and Merging: Ranges split when they grow too large or become too hot; they merge when they shrink. No manual intervention required.
•Continuous Rebalancing: The rebalancer ensures storage and load are evenly distributed, respecting zone constraints.
•Zone Configurations Control Placement: Use zones to implement locality, compliance, and performance requirements at database, table, or partition level.
•Monitor the Metrics: Range distribution, leaseholder balance, and rebalancing activity indicate cluster health.

What's Next:

Page Complete

4 / 5