CockroachDB - Learning Module

Loading content...

0/273

Serializable Isolation: MVCC and the Transaction Model

The Strongest Guarantee

In a world of eventually consistent NoSQL databases and "read committed" defaults, CockroachDB makes an audacious promise: Serializable Isolation by default—the strongest isolation level defined by the SQL standard.

Serializable isolation means that concurrent transactions behave as if they executed one after another in some serial order. No phantom reads. No write skew. No anomalies that leave your data in an inconsistent state. The database handles the complexity of concurrency control so your application doesn't have to.

This guarantee comes at a cost: serializable isolation is traditionally considered expensive and slow. CockroachDB achieves it efficiently through a combination of:

Multi-Version Concurrency Control (MVCC): Multiple versions of each row, enabling non-blocking reads
Serializable Snapshot Isolation (SSI): A variant that detects and prevents serialization anomalies
Timestamp-based ordering: Using Hybrid Logical Clocks to determine transaction order
Write intents: Provisional writes that lock data during transactions

Understanding this model is essential for writing correct applications and debugging issues when they arise. Most distributed databases offer weaker guarantees; knowing what serializable buys you helps you appreciate when it matters.

What You Will Learn

By the end of this page, you will understand transaction isolation levels and why serializability matters, how MVCC stores and manages multiple versions of data, how CockroachDB's SSI implementation detects conflicts, and practical implications for application developers.

Understanding Isolation Levels

Before diving into CockroachDB's implementation, let's understand what isolation levels are and why the choices matter.

The Problem: Concurrent Access

When multiple transactions access the same data simultaneously, anomalies can occur:

1. Dirty Read: Transaction A reads data written by Transaction B before B commits. If B rolls back, A has read data that never existed.

2. Non-Repeatable Read: Transaction A reads a row, then Transaction B modifies it and commits. A reads again and sees different data.

3. Phantom Read: Transaction A queries rows matching a condition. Transaction B inserts a new row matching that condition and commits. A queries again and sees a new row that wasn't there before.

4. Write Skew: Two transactions read overlapping data, make decisions based on what they read, and write to disjoint parts of the data. The final state violates an invariant that both transactions believed they were maintaining.

SQL Standard Isolation Levels:

The SQL standard defines four isolation levels, each preventing different anomalies:

SQL Standard Isolation Levels
Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Write Skew
Read Uncommitted	Possible	Possible	Possible	Possible
Read Committed	Prevented	Possible	Possible	Possible
Repeatable Read	Prevented	Prevented	Possible	Possible
Serializable	Prevented	Prevented	Prevented	Prevented

Most databases default to Read Committed or weaker, leaving applications to handle potential anomalies. PostgreSQL defaults to Read Committed. MySQL (InnoDB) defaults to Repeatable Read. Many NoSQL databases don't provide isolation at all.

CockroachDB provides Serializable by default—preventing all anomalies. This is a significant advantage for application developers who don't need to reason about concurrency edge cases.

The Write Skew Problem in Detail:

Write skew is subtle and often causes real-world bugs. Consider a hospital on-call scheduling system:

Rule: At least one doctor must be on-call at all times.
Current state: Dr. Alice is on-call, Dr. Bob is on-call.

Transaction 1 (Alice wants off-call):
  Read: Alice on-call=true, Bob on-call=true
  Check: Someone else is on-call? Yes (Bob)
  Write: Alice on-call=false
  Commit

Transaction 2 (Bob wants off-call): [concurrent]
  Read: Alice on-call=true, Bob on-call=true
  Check: Someone else is on-call? Yes (Alice)
  Write: Bob on-call=false
  Commit

Result: No one is on-call. Rule violated!

Both transactions read consistent data and made correct individual decisions. But the interleaved execution violated the invariant. Serializable isolation prevents this—one transaction would be forced to retry after the other committed.

The Industry's Dirty Secret

Many applications running on databases with weaker isolation have latent bugs that only manifest under high concurrency. These bugs often go undetected in development and staging, appearing only in production when load increases. Serializable isolation eliminates an entire class of subtle, hard-to-reproduce bugs.

Multi-Version Concurrency Control (MVCC)

MVCC is the foundation of CockroachDB's concurrency control. Instead of locking rows during reads (which would block writers), MVCC keeps multiple versions of each row, allowing readers and writers to operate without blocking each other.

How MVCC Works:

Every write creates a new version of the data with a timestamp:

Row: accounts.balance WHERE id = 'alice'

Versions stored:
  @ timestamp 100: balance = 500
  @ timestamp 150: balance = 450  (after $50 withdrawal)
  @ timestamp 200: balance = 550  (after $100 deposit)

When a transaction reads, it reads at its read timestamp—seeing only versions with timestamps less than or equal to that value.

Transaction at ts=125: sees balance = 500 (version @ 100)
Transaction at ts=175: sees balance = 450 (version @ 150)
Transaction at ts=250: sees balance = 550 (version @ 200)

Benefits of MVCC:

Non-blocking reads: Readers don't acquire locks; they simply read the appropriate version
Consistent snapshots: A transaction sees a consistent view of the database at its read timestamp
Time-travel queries: You can query the database as it existed at a past timestamp
Efficient rollback: Aborting a transaction just means not committing its versions

mvcc-data-model.txt
MVCC DATA MODEL IN COCKROACHDB
═══════════════════════════════════════════════════════════════════
 
LOGICAL VIEW (What applications see):
────────────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────────────┐
│  accounts table                                                  │
├──────────┬─────────┬──────────────────────────────────────────┤
│  id      │ balance │ last_updated                              │
├──────────┼─────────┼──────────────────────────────────────────┤
│  alice   │   550   │ 2024-01-15 10:30:00                       │
│  bob     │   300   │ 2024-01-15 09:15:00                       │
│  carol   │  1200   │ 2024-01-14 16:45:00                       │
└──────────┴─────────┴──────────────────────────────────────────┘
 
PHYSICAL STORAGE (What MVCC actually stores):
────────────────────────────────────────────────────────────────────
Key format: /Table/accounts/Primary/id/Column/timestamp
 
┌─────────────────────────────────────────────────────────────────┐
│  MVCC Key-Value Pairs (sorted by key)                            │
├─────────────────────────────────────────────────────────────────┤
│  /accounts/alice/balance @ ts=200 → 550                          │
│  /accounts/alice/balance @ ts=150 → 450                          │
│  /accounts/alice/balance @ ts=100 → 500                          │
│  /accounts/alice/balance @ ts=50  → 500  (initial)               │
│                                                                   │
│  /accounts/bob/balance @ ts=175 → 300                            │
│  /accounts/bob/balance @ ts=80  → 350                            │
│                                                                   │
│  /accounts/carol/balance @ ts=160 → 1200                         │
│  /accounts/carol/balance @ ts=90  → 1000                         │
└─────────────────────────────────────────────────────────────────┘
 
READING AT DIFFERENT TIMESTAMPS:
────────────────────────────────────────────────────────────────────
 
Query: SELECT balance FROM accounts WHERE id = 'alice'
 
At timestamp 175:
  └── Scan for /accounts/alice/balance @ ts ≤ 175
  └── Found: /accounts/alice/balance @ ts=150 → 450
  └── Result: balance = 450
 
At timestamp 225:
  └── Scan for /accounts/alice/balance @ ts ≤ 225
  └── Found: /accounts/alice/balance @ ts=200 → 550
  └── Result: balance = 550
 
TIME TRAVEL QUERY:
────────────────────────────────────────────────────────────────────
SELECT balance FROM accounts 
WHERE id = 'alice'
AS OF SYSTEM TIME '2024-01-15 10:00:00';
 
  └── Converts timestamp to HLC
  └── Reads version at or before that HLC
  └── Returns historical value
 
GARBAGE COLLECTION:
────────────────────────────────────────────────────────────────────
Old versions are garbage collected after the GC TTL (default: 25 hours)
 
Before GC (all versions):
  /accounts/alice/balance @ ts=200 → 550
  /accounts/alice/balance @ ts=150 → 450
  /accounts/alice/balance @ ts=100 → 500
  /accounts/alice/balance @ ts=50  → 500
 
After GC (versions older than gc.ttlseconds removed):
  /accounts/alice/balance @ ts=200 → 550
  /accounts/alice/balance @ ts=150 → 450
  
  (ts=100 and ts=50 garbage collected)

MVCC and Storage:

MVCC versions are stored in CockroachDB's underlying storage engine (Pebble, an LSM-tree). The key includes the timestamp, meaning versions are naturally sorted by time. Reading the latest version is a simple seek to the key with the highest timestamp ≤ the read timestamp.

MVCC Overhead:

MVCC isn't free:

Storage amplification: Multiple versions consume more space
Garbage collection: Old versions must be cleaned up
Read amplification: May need to skip over newer versions to find the right one

CockroachDB mitigates these through:

Aggressive garbage collection of old versions
Bloom filters and key-prefix optimization in Pebble
Compaction that removes obsolete versions

Time-Travel Queries

MVCC enables powerful time-travel queries. Use AS OF SYSTEM TIME to query historical data: SELECT * FROM orders AS OF SYSTEM TIME '-1h'. This is useful for: (1) debugging by examining past state, (2) generating consistent reports, (3) implementing optimistic locking patterns, and (4) follower reads for lower latency.

Serializable Snapshot Isolation (SSI)

MVCC alone provides Snapshot Isolation (SI)—each transaction sees a consistent snapshot. But SI doesn't prevent all anomalies (notably write skew). CockroachDB extends SI with conflict detection to achieve Serializable Snapshot Isolation (SSI).

The SSI Approach:

SSI allows transactions to run optimistically (without blocking), then detects conflicts at commit time. If a conflict would cause a serializability violation, one transaction is aborted and retried.

Conflict Types SSI Detects:

1. Read-Write Conflicts (rw-antidependency)

Transaction B writes data that Transaction A read (and A hasn't committed yet):

A starts @ ts=100: reads accounts.balance
B starts @ ts=150: writes accounts.balance, commits
A tries to commit @ ts=200: ???

If A commits, A would have seen a state that existed before B's write, but A's commit appears after B's. This could cause anomalies. CockroachDB either:

Pushes A's timestamp forward if A's reads are still valid, or
Aborts A to retry

2. Write-Read Conflicts (wr-dependency)

Transaction A writes data, Transaction B reads A's write:

A starts @ ts=100: writes accounts.balance = 500
B starts @ ts=150: reads accounts.balance (sees A's write intent)

B must wait for A's outcome:

If A commits: B reads 500
If A aborts: B reads the pre-A value

3. Write-Write Conflicts (ww-dependency)

Both transactions write the same key:

A starts @ ts=100: writes accounts.balance = 500
B starts @ ts=150: writes accounts.balance = 600

Only one can win. CockroachDB uses timestamp ordering—the transaction with the lower commit timestamp wins (if it commits). The other must retry.

ssi-conflict-detection.txt
SERIALIZABLE SNAPSHOT ISOLATION CONFLICT DETECTION
═══════════════════════════════════════════════════════════════════
 
SCENARIO: Write Skew Prevention (Doctor On-Call Example)
────────────────────────────────────────────────────────────────────
 
Initial state:
  doctors(alice, on_call=true)
  doctors(bob, on_call=true)
  
Invariant: COUNT(*) FROM doctors WHERE on_call = true >= 1
 
Transaction A (Alice wants off):     Transaction B (Bob wants off):
─────────────────────────────────    ─────────────────────────────────
BEGIN @ ts=100                       BEGIN @ ts=105
                                     
SELECT COUNT(*) FROM doctors         SELECT COUNT(*) FROM doctors
WHERE on_call AND id != 'alice'      WHERE on_call AND id != 'bob'
→ Result: 1 (Bob is on-call)         → Result: 1 (Alice is on-call)
 
-- A thinks OK to go off-call        -- B thinks OK to go off-call
 
UPDATE doctors SET on_call=false     UPDATE doctors SET on_call=false
WHERE id = 'alice'                   WHERE id = 'bob'
→ Write intent: alice.on_call=false  → Write intent: bob.on_call=false
                                     
COMMIT                               COMMIT
                                     
        │                                    │
        ▼                                    ▼
        
SSI CONFLICT DETECTION DURING COMMIT:
────────────────────────────────────────────────────────────────────
 
When A commits (assumed first):
├── A's read set: {bob.on_call} @ ts=100
├── A's write set: {alice.on_call}
├── No conflicts with A's reads (bob not modified)
└── A commits successfully @ ts=100
 
When B tries to commit:
├── B's read set: {alice.on_call} @ ts=105
├── B's write set: {bob.on_call}
├── Check: Has alice.on_call changed since ts=105?
│   └── YES! A committed alice.on_call=false @ ts=100
│       (But B's timestamp is after A's, so B should have seen it)
├── Conflict detected: B read stale data
└── B's transaction is ABORTED
 
RETRY FLOW:
────────────────────────────────────────────────────────────────────
B retries automatically:
 
Transaction B (retry):
─────────────────────────────────────
BEGIN @ ts=200
 
SELECT COUNT(*) FROM doctors
WHERE on_call AND id != 'bob'
→ Result: 0 (Alice is OFF-call now)
 
-- Application logic: Cannot go off-call, someone must stay
-- Application either:
--   1. Aborts the request: "Cannot go off-call, no one else available"
--   2. Or finds alternative logic
 
COMMIT (no changes)
 
INVARIANT PRESERVED! ✓
 
CONFLICT RESOLUTION STRATEGIES:
────────────────────────────────────────────────────────────────────
CockroachDB uses multiple strategies:
 
1. TIMESTAMP PUSH:
   If B's reads are still valid at a higher timestamp,
   push B's timestamp forward and continue.
   
2. TRANSACTION WAIT:
   If B encounters an uncommitted write intent from A,
   B waits for A to commit/abort before proceeding.
   
3. ABORT AND RETRY:
   If conflict is unresolvable (B's reads are genuinely stale),
   abort B and let it retry with fresh data.

Write Intents: The Locking Mechanism:

When a transaction writes data, it doesn't immediately commit the value. Instead, it writes a write intent—a provisional value that:

Marks the key as being modified by an uncommitted transaction
Contains the uncommitted value
References the transaction record (for commit/abort status)

Other transactions encountering a write intent must:

If the intent's transaction is still running: Wait for it to complete
If the intent's transaction committed: Read the committed value
If the intent's transaction aborted: Ignore the intent, read the pre-intent value

The Transaction Record:

Each transaction has a transaction record stored in the same keyspace. This record tracks:

Transaction ID
Status: PENDING, COMMITTED, ABORTED
Commit timestamp (if committed)
List of write intents (ranges that have intents from this transaction)

When a transaction commits, its record is updated atomically. Other transactions checking intent status read this record to determine the outcome.

SSI vs. 2PL

Traditional databases often use Two-Phase Locking (2PL) for serializability—acquiring locks before accessing data, releasing after commit. 2PL blocks readers and writers. SSI is optimistic—transactions run without blocking, conflicts detected at commit. SSI has better throughput for read-heavy workloads but may have more aborts under high contention.

Timestamp Management and Clock Skew

Timestamps are central to CockroachDB's consistency model. Every transaction has a timestamp that determines:

What data it can see (MVCC versions ≤ timestamp)
Its position in the serial order
Whether it conflicts with other transactions

Hybrid Logical Clocks (HLC):

CockroachDB uses Hybrid Logical Clocks, which combine:

Physical time: The node's system clock (NTP-synchronized)
Logical time: A counter that ensures causality when physical time is insufficient

HLC timestamps have the form (physical_time, logical_counter). They're compared lexicographically—physical time first, then logical counter.

Why Not Just System Clocks?

Distributed systems can't rely solely on system clocks because:

Clock skew: Different nodes' clocks differ (even with NTP)
Clock drift: Clocks diverge over time between synchronizations
Clock jumps: NTP corrections can shift clocks backward or forward

HLC handles these by:

Always moving forward (never going backward)
Updating based on messages from other nodes
Ensuring causally-related events are correctly ordered

hlc-timestamp-scenarios.txt
HLC TIMESTAMP MANAGEMENT
═══════════════════════════════════════════════════════════════════
 
HLC TIMESTAMP STRUCTURE:
────────────────────────────────────────────────────────────────────
┌────────────────────────────────────────┬───────────────────────┐
│     Physical Time (nanoseconds)        │   Logical Counter     │
│            48 bits                     │       16 bits         │
└────────────────────────────────────────┴───────────────────────┘
 
Example: 1705312800000000000.00000005
         └──────────────┬─────────────┘ └────┬────┘
              Physical (ns since epoch)    Logical
 
HLC UPDATE RULES:
────────────────────────────────────────────────────────────────────
 
Rule 1: Local Event (send message, local operation)
─────────────────────────────────────────────────────────────────
hlc.physical = max(hlc.physical, system_clock)
if hlc.physical unchanged:
    hlc.logical++
else:
    hlc.logical = 0
 
Rule 2: Receive Event (receive message with remote HLC)
─────────────────────────────────────────────────────────────────
old_physical = hlc.physical
hlc.physical = max(hlc.physical, msg.hlc.physical, system_clock)
if hlc.physical == old_physical == msg.hlc.physical:
    hlc.logical = max(hlc.logical, msg.hlc.logical) + 1
elif hlc.physical == old_physical:
    hlc.logical++
elif hlc.physical == msg.hlc.physical:
    hlc.logical = msg.hlc.logical + 1  
else:
    hlc.logical = 0
 
CLOCK SKEW SCENARIO:
────────────────────────────────────────────────────────────────────
 
Node A (clock accurate):     Node B (clock 100ms behind):
System time: 1000ms          System time: 900ms
 
Event sequence:
1. Node A: local event
   hlc_A = (1000, 0)
   
2. Node A sends message to B with hlc=(1000, 0)
 
3. Node B receives message:
   system_clock = 900ms (behind!)
   msg.hlc = (1000, 0)
   
   hlc_B.physical = max(900, 1000, 900) = 1000
   hlc_B.logical = 0 + 1 = 1
   
   Result: hlc_B = (1000, 1)
 
4. Node B: local event
   hlc_B = (1000, 2)  [physical unchanged, logical++]
 
5. Node B sends message to A with hlc=(1000, 2)
 
6. Node A receives (now system_clock=1005):
   hlc_A.physical = max(1000, 1000, 1005) = 1005
   hlc_A.logical = 0  [new physical time]
   
   Result: hlc_A = (1005, 0)
 
CAUSALITY PRESERVED: A's later event (1005,0) > B's event (1000,2) ✓
 
UNCERTAINTY WINDOWS:
────────────────────────────────────────────────────────────────────
CockroachDB's max_offset setting (default: 500ms) defines the 
maximum expected clock skew between nodes.
 
When reading, CockroachDB considers the UNCERTAINTY WINDOW:
[read_timestamp - max_offset, read_timestamp]
 
If any version exists in this window, CockroachDB must either:
1. Wait until the uncertainty window passes, OR
2. Refresh the read at a higher timestamp
 
Example:
  Transaction starts @ ts=1000
  max_offset = 500ms
  Uncertainty window: [500, 1000]
  
  Reading key K:
  └── Version @ ts=800 exists
  └── 800 is in [500, 1000] (uncertain!)
  └── Could this version have been written by a concurrent txn whose
      clock was ahead of ours?
  └── Wait or refresh to resolve uncertainty
 
This ensures external consistency even without perfect clock sync.

The Uncertainty Window:

Because node clocks aren't perfectly synchronized, a transaction's timestamp might not reflect the true physical order of events. CockroachDB handles this with uncertainty windows.

When a transaction reads data, it considers values with timestamps in:

[transaction_timestamp - max_offset, transaction_timestamp]

If a value exists in this window, CockroachDB can't be certain whether that value was written before or after the transaction started (due to clock skew). It must either:

Wait: Delay until the uncertainty window passes
Refresh: Restart the read at a higher timestamp
Push: Move the transaction's timestamp forward

Practical Implications:

max_offset setting: Controls the uncertainty window size (default 500ms, very conservative)
NTP quality matters: Better clock sync = smaller effective uncertainty = lower latency
Cloud environments: AWS, GCP, Azure time sync services provide <10ms accuracy

Clock Skew and Commit Wait

If a node's clock is too far off (exceeds max_offset), CockroachDB will refuse to accept its transactions. Monitor clock synchronization in production. Cloud VMs with PTP (Precision Time Protocol) or Amazon Time Sync Service can achieve <1ms accuracy, minimizing uncertainty overhead.

Practical Transaction Handling

Understanding CockroachDB's transaction model helps you write applications that perform well and handle errors correctly.

Transaction Retries:

CockroachDB automatically retries some transactions internally, but applications must be prepared to handle retry errors:

Error 40001 (serialization_failure): Transaction conflicted with another; safe to retry
Error 40003 (statement_completion_unknown): Network issue during commit; check status before retrying

Application-Level Retry Loop:

For operations that might conflict, wrap in a retry loop:

transaction-retry-patterns.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import psycopg2
from psycopg2 import sql, errors
import time
import random
 
def execute_with_retry(conn, operation, max_retries=5):
    """
    Execute a transaction with automatic retry on serialization failures.
    
    CockroachDB may return 40001 (serialization_failure) when transactions
    conflict. These errors are safe to retry with a fresh transaction.
    """
    retries = 0
    while retries < max_retries:
        try:
            with conn.cursor() as cur:
                # Start transaction
                cur.execute("BEGIN")
                
                # Execute the application operation
                result = operation(cur)
                
                # Commit
                cur.execute("COMMIT")
                return result
                
        except errors.SerializationFailure as e:
            # 40001: Safe to retry
            conn.rollback()
            retries += 1
            
            # Exponential backoff with jitter
            wait_time = min(1.0, (2 ** retries) * 0.1) + random.uniform(0, 0.1)
            print(f"Serialization conflict, retry {retries}/{max_retries} "
                  f"after {wait_time:.2f}s")
            time.sleep(wait_time)
            
        except errors.Error as e:
            conn.rollback()
            raise  # Other errors are not safe to retry blindly
    
    raise Exception(f"Transaction failed after {max_retries} retries")
 
 
# Example: Transfer money between accounts
def transfer_money(from_account, to_account, amount):
    conn = psycopg2.connect("postgresql://root@localhost:26257/bank")
    
    def do_transfer(cur):
        # Read current balances
        cur.execute(
            "SELECT balance FROM accounts WHERE id IN (%s, %s) FOR UPDATE",
            (from_account, to_account)
        )
        balances = {row[0]: row[1] for row in cur.fetchall()}
        
        if balances[from_account] < amount:
            raise ValueError("Insufficient funds")
        
        # Update balances
        cur.execute(
            "UPDATE accounts SET balance = balance - %s WHERE id = %s",
            (amount, from_account)
        )
        cur.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s",
            (amount, to_account)
        )
        
        return {"status": "success", "amount": amount}
    
    return execute_with_retry(conn, do_transfer)
 
 
# Example: Inventory reservation with contention handling
def reserve_inventory(product_id, quantity):
    conn = psycopg2.connect("postgresql://root@localhost:26257/inventory")
    
    def do_reservation(cur):
        # SELECT FOR UPDATE acquires lock early, reducing retry probability
        cur.execute(
            """
            SELECT quantity_available 
            FROM inventory 
            WHERE product_id = %s 
            FOR UPDATE
            """,
            (product_id,)
        )
        
        row = cur.fetchone()
        if row is None:
            raise ValueError(f"Product {product_id} not found")
        
        available = row[0]
        if available < quantity:
            raise ValueError(f"Insufficient inventory: {available} < {quantity}")
        
        cur.execute(
            """
            UPDATE inventory 
            SET quantity_available = quantity_available - %s,
                quantity_reserved = quantity_reserved + %s
            WHERE product_id = %s
            """,
            (quantity, quantity, product_id)
        )
        
        return {"reserved": quantity, "remaining": available - quantity}
    
    return execute_with_retry(conn, do_reservation)

Best Practices for Transactions:

1. Keep Transactions Short

Long-running transactions:

Hold write intents longer, blocking other transactions
Are more likely to conflict with concurrent transactions
Consume resources (memory for transaction state)

Best practice: Transactions should complete in seconds, not minutes.

2. Use SELECT FOR UPDATE for Hot Keys

For frequently-updated data (counters, inventory, balances):

-- Acquire lock immediately, don't wait until UPDATE
SELECT balance FROM accounts WHERE id = 'hot_key' FOR UPDATE;
-- Now other transactions wait at SELECT, not at COMMIT
UPDATE accounts SET balance = balance + 100 WHERE id = 'hot_key';

This trades optimistic concurrency for pessimistic locking on known hot spots.

3. Batch Operations

Instead of separate transactions for each row:

-- BAD: 1000 separate transactions
FOR each item:
    BEGIN; INSERT INTO orders VALUES (...); COMMIT;

-- GOOD: One transaction with batch
BEGIN;
INSERT INTO orders VALUES (...), (...), ..., (...);
COMMIT;

Batching reduces coordination overhead dramatically.

Transaction Anti-Patterns to Avoid

•Long-running transactions — Holding transactions open for user think time or external API calls. Keep transactions short; do external work outside transaction boundaries.
•Hot key contention — Multiple transactions constantly updating the same row (e.g., global counter). Consider sharded counters or application-level batching.
•Large read sets — Reading thousands of rows then updating a few. Large read sets increase conflict probability. Consider reducing read scope.
•Implicit transactions with multiple statements — Each statement in autocommit mode is a separate transaction. Wrap related operations in explicit BEGIN/COMMIT.
•Ignoring retry errors — Treating 40001 as fatal instead of retrying. Implement proper retry logic with exponential backoff.

The CockroachDB Advantage

Because CockroachDB provides serializable isolation, you don't need to reason about subtle concurrency anomalies. Write your transaction logic as if it runs in isolation—CockroachDB ensures this is true (or safely retries until it is). This dramatically simplifies application development compared to databases with weaker guarantees.

Isolation Level Configuration

CockroachDB provides Serializable isolation by default, but offers configuration options for specific scenarios.

Default: SERIALIZABLE

All transactions run with serializable isolation unless explicitly changed:

SHOW default_transaction_isolation;
-- Result: serializable

Read Committed Isolation (Experimental)

CockroachDB has experimental support for Read Committed isolation:

SET CLUSTER SETTING sql.txn.read_committed_isolation.enabled = true;

BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Statements here see Read Committed isolation
COMMIT;

Why use Read Committed?

Lower contention (no read-set tracking)
Better performance for some workloads
Compatibility with applications designed for PostgreSQL's default

Why avoid it?

Possible anomalies (write skew, phantom reads)
Application must handle concurrency edge cases
Loses CockroachDB's key differentiator

Follower Reads: Trading Consistency for Latency

Follower reads allow reading from any replica (not just the leaseholder), accepting bounded staleness:

-- Read data as of ~4.8 seconds ago (guaranteed consistent at that point)
SELECT * FROM products 
AS OF SYSTEM TIME follower_read_timestamp();

-- Read data from a specific time
SELECT * FROM products 
AS OF SYSTEM TIME '-30s';

Follower reads provide:

Lower latency (read from local replica)
Reduced load on leaseholder
Trade: Data is slightly stale (controlled by timestamp)

CockroachDB Isolation Options
Option	Consistency	Latency	Contention	Use Case
Serializable (default)	Strongest	Higher	May retry	All transactional workloads
Read Committed	Weaker	Lower	Fewer retries	Compatibility, analytics
Follower Reads	Bounded stale	Lowest	None	Read-heavy, latency-sensitive
AS OF SYSTEM TIME	Point-in-time	Low	None	Reports, time-travel queries

Savepoints and Nested Transactions:

CockroachDB supports savepoints for partial rollback within transactions:

BEGIN;

INSERT INTO orders (id, total) VALUES (1, 100);
SAVEPOINT before_items;

INSERT INTO order_items (order_id, product_id) VALUES (1, 'PROD_A');
INSERT INTO order_items (order_id, product_id) VALUES (1, 'PROD_B_INVALID');
-- Oops, PROD_B_INVALID doesn't exist

ROLLBACK TO SAVEPOINT before_items;
-- Order insert preserved, items rolled back

INSERT INTO order_items (order_id, product_id) VALUES (1, 'PROD_C');
COMMIT;

Savepoints are useful for:

Partial error recovery within complex transactions
Implementing try-catch patterns in stored procedures
Compatibility with ORMs that use savepoints

Recommendation: Keep Serializable

Unless you have a specific, well-understood reason, use the default serializable isolation. The correctness guarantees vastly outweigh the minor performance cost. If serializable creates unacceptable contention, the right solution is usually optimizing the schema or access patterns, not weakening isolation.

Summary: The Transaction Model

We've explored CockroachDB's transaction model in depth. Let's consolidate the key concepts:

Key Takeaways

•Serializable by Default: CockroachDB provides the strongest isolation level, preventing all concurrency anomalies including write skew and phantom reads.
•MVCC Foundation: Multiple versions of each row enable non-blocking reads and consistent snapshots at any point in time.
•SSI Conflict Detection: Serializable Snapshot Isolation detects conflicts optimistically at commit time, aborting and retrying when necessary.
•HLC Timestamps: Hybrid Logical Clocks provide causal ordering without requiring atomic clocks, with uncertainty windows handling clock skew.
•Write Intents: Provisional writes lock data during transactions, with transaction records tracking commit status.
•Practical Implications: Keep transactions short, use SELECT FOR UPDATE for hot keys, implement retry logic for serialization failures.

What's Next:

Strong consistency is only part of the picture. In the next page, we'll explore Automatic Load Balancing—how CockroachDB distributes data across nodes, manages leaseholders, and rebalances the cluster as data grows and access patterns change. This is the operational magic that makes CockroachDB self-managing.

Page Complete

You now understand CockroachDB's transaction model—MVCC, SSI, timestamps, and conflict handling. You can write applications that leverage serializable isolation, implement proper retry logic, and avoid common pitfalls. Next, we'll see how CockroachDB automatically distributes and balances data across the cluster.