Loading content...
In a world of eventually consistent NoSQL databases and "read committed" defaults, CockroachDB makes an audacious promise: Serializable Isolation by default—the strongest isolation level defined by the SQL standard.
Serializable isolation means that concurrent transactions behave as if they executed one after another in some serial order. No phantom reads. No write skew. No anomalies that leave your data in an inconsistent state. The database handles the complexity of concurrency control so your application doesn't have to.
This guarantee comes at a cost: serializable isolation is traditionally considered expensive and slow. CockroachDB achieves it efficiently through a combination of:
Understanding this model is essential for writing correct applications and debugging issues when they arise. Most distributed databases offer weaker guarantees; knowing what serializable buys you helps you appreciate when it matters.
By the end of this page, you will understand transaction isolation levels and why serializability matters, how MVCC stores and manages multiple versions of data, how CockroachDB's SSI implementation detects conflicts, and practical implications for application developers.
Before diving into CockroachDB's implementation, let's understand what isolation levels are and why the choices matter.
The Problem: Concurrent Access
When multiple transactions access the same data simultaneously, anomalies can occur:
1. Dirty Read: Transaction A reads data written by Transaction B before B commits. If B rolls back, A has read data that never existed.
2. Non-Repeatable Read: Transaction A reads a row, then Transaction B modifies it and commits. A reads again and sees different data.
3. Phantom Read: Transaction A queries rows matching a condition. Transaction B inserts a new row matching that condition and commits. A queries again and sees a new row that wasn't there before.
4. Write Skew: Two transactions read overlapping data, make decisions based on what they read, and write to disjoint parts of the data. The final state violates an invariant that both transactions believed they were maintaining.
SQL Standard Isolation Levels:
The SQL standard defines four isolation levels, each preventing different anomalies:
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read | Write Skew |
|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Possible |
| Read Committed | Prevented | Possible | Possible | Possible |
| Repeatable Read | Prevented | Prevented | Possible | Possible |
| Serializable | Prevented | Prevented | Prevented | Prevented |
Most databases default to Read Committed or weaker, leaving applications to handle potential anomalies. PostgreSQL defaults to Read Committed. MySQL (InnoDB) defaults to Repeatable Read. Many NoSQL databases don't provide isolation at all.
CockroachDB provides Serializable by default—preventing all anomalies. This is a significant advantage for application developers who don't need to reason about concurrency edge cases.
The Write Skew Problem in Detail:
Write skew is subtle and often causes real-world bugs. Consider a hospital on-call scheduling system:
Rule: At least one doctor must be on-call at all times.
Current state: Dr. Alice is on-call, Dr. Bob is on-call.
Transaction 1 (Alice wants off-call):
Read: Alice on-call=true, Bob on-call=true
Check: Someone else is on-call? Yes (Bob)
Write: Alice on-call=false
Commit
Transaction 2 (Bob wants off-call): [concurrent]
Read: Alice on-call=true, Bob on-call=true
Check: Someone else is on-call? Yes (Alice)
Write: Bob on-call=false
Commit
Result: No one is on-call. Rule violated!
Both transactions read consistent data and made correct individual decisions. But the interleaved execution violated the invariant. Serializable isolation prevents this—one transaction would be forced to retry after the other committed.
Many applications running on databases with weaker isolation have latent bugs that only manifest under high concurrency. These bugs often go undetected in development and staging, appearing only in production when load increases. Serializable isolation eliminates an entire class of subtle, hard-to-reproduce bugs.
MVCC is the foundation of CockroachDB's concurrency control. Instead of locking rows during reads (which would block writers), MVCC keeps multiple versions of each row, allowing readers and writers to operate without blocking each other.
How MVCC Works:
Every write creates a new version of the data with a timestamp:
Row: accounts.balance WHERE id = 'alice'
Versions stored:
@ timestamp 100: balance = 500
@ timestamp 150: balance = 450 (after $50 withdrawal)
@ timestamp 200: balance = 550 (after $100 deposit)
When a transaction reads, it reads at its read timestamp—seeing only versions with timestamps less than or equal to that value.
Benefits of MVCC:
MVCC DATA MODEL IN COCKROACHDB═══════════════════════════════════════════════════════════════════ LOGICAL VIEW (What applications see):────────────────────────────────────────────────────────────────────┌─────────────────────────────────────────────────────────────────┐│ accounts table │├──────────┬─────────┬──────────────────────────────────────────┤│ id │ balance │ last_updated │├──────────┼─────────┼──────────────────────────────────────────┤│ alice │ 550 │ 2024-01-15 10:30:00 ││ bob │ 300 │ 2024-01-15 09:15:00 ││ carol │ 1200 │ 2024-01-14 16:45:00 │└──────────┴─────────┴──────────────────────────────────────────┘ PHYSICAL STORAGE (What MVCC actually stores):────────────────────────────────────────────────────────────────────Key format: /Table/accounts/Primary/id/Column/timestamp ┌─────────────────────────────────────────────────────────────────┐│ MVCC Key-Value Pairs (sorted by key) │├─────────────────────────────────────────────────────────────────┤│ /accounts/alice/balance @ ts=200 → 550 ││ /accounts/alice/balance @ ts=150 → 450 ││ /accounts/alice/balance @ ts=100 → 500 ││ /accounts/alice/balance @ ts=50 → 500 (initial) ││ ││ /accounts/bob/balance @ ts=175 → 300 ││ /accounts/bob/balance @ ts=80 → 350 ││ ││ /accounts/carol/balance @ ts=160 → 1200 ││ /accounts/carol/balance @ ts=90 → 1000 │└─────────────────────────────────────────────────────────────────┘ READING AT DIFFERENT TIMESTAMPS:──────────────────────────────────────────────────────────────────── Query: SELECT balance FROM accounts WHERE id = 'alice' At timestamp 175: └── Scan for /accounts/alice/balance @ ts ≤ 175 └── Found: /accounts/alice/balance @ ts=150 → 450 └── Result: balance = 450 At timestamp 225: └── Scan for /accounts/alice/balance @ ts ≤ 225 └── Found: /accounts/alice/balance @ ts=200 → 550 └── Result: balance = 550 TIME TRAVEL QUERY:────────────────────────────────────────────────────────────────────SELECT balance FROM accounts WHERE id = 'alice'AS OF SYSTEM TIME '2024-01-15 10:00:00'; └── Converts timestamp to HLC └── Reads version at or before that HLC └── Returns historical value GARBAGE COLLECTION:────────────────────────────────────────────────────────────────────Old versions are garbage collected after the GC TTL (default: 25 hours) Before GC (all versions): /accounts/alice/balance @ ts=200 → 550 /accounts/alice/balance @ ts=150 → 450 /accounts/alice/balance @ ts=100 → 500 /accounts/alice/balance @ ts=50 → 500 After GC (versions older than gc.ttlseconds removed): /accounts/alice/balance @ ts=200 → 550 /accounts/alice/balance @ ts=150 → 450 (ts=100 and ts=50 garbage collected)MVCC and Storage:
MVCC versions are stored in CockroachDB's underlying storage engine (Pebble, an LSM-tree). The key includes the timestamp, meaning versions are naturally sorted by time. Reading the latest version is a simple seek to the key with the highest timestamp ≤ the read timestamp.
MVCC Overhead:
MVCC isn't free:
CockroachDB mitigates these through:
MVCC enables powerful time-travel queries. Use AS OF SYSTEM TIME to query historical data: SELECT * FROM orders AS OF SYSTEM TIME '-1h'. This is useful for: (1) debugging by examining past state, (2) generating consistent reports, (3) implementing optimistic locking patterns, and (4) follower reads for lower latency.
MVCC alone provides Snapshot Isolation (SI)—each transaction sees a consistent snapshot. But SI doesn't prevent all anomalies (notably write skew). CockroachDB extends SI with conflict detection to achieve Serializable Snapshot Isolation (SSI).
The SSI Approach:
SSI allows transactions to run optimistically (without blocking), then detects conflicts at commit time. If a conflict would cause a serializability violation, one transaction is aborted and retried.
Conflict Types SSI Detects:
1. Read-Write Conflicts (rw-antidependency)
Transaction B writes data that Transaction A read (and A hasn't committed yet):
A starts @ ts=100: reads accounts.balance
B starts @ ts=150: writes accounts.balance, commits
A tries to commit @ ts=200: ???
If A commits, A would have seen a state that existed before B's write, but A's commit appears after B's. This could cause anomalies. CockroachDB either:
2. Write-Read Conflicts (wr-dependency)
Transaction A writes data, Transaction B reads A's write:
A starts @ ts=100: writes accounts.balance = 500
B starts @ ts=150: reads accounts.balance (sees A's write intent)
B must wait for A's outcome:
3. Write-Write Conflicts (ww-dependency)
Both transactions write the same key:
A starts @ ts=100: writes accounts.balance = 500
B starts @ ts=150: writes accounts.balance = 600
Only one can win. CockroachDB uses timestamp ordering—the transaction with the lower commit timestamp wins (if it commits). The other must retry.
SERIALIZABLE SNAPSHOT ISOLATION CONFLICT DETECTION═══════════════════════════════════════════════════════════════════ SCENARIO: Write Skew Prevention (Doctor On-Call Example)──────────────────────────────────────────────────────────────────── Initial state: doctors(alice, on_call=true) doctors(bob, on_call=true) Invariant: COUNT(*) FROM doctors WHERE on_call = true >= 1 Transaction A (Alice wants off): Transaction B (Bob wants off):───────────────────────────────── ─────────────────────────────────BEGIN @ ts=100 BEGIN @ ts=105 SELECT COUNT(*) FROM doctors SELECT COUNT(*) FROM doctorsWHERE on_call AND id != 'alice' WHERE on_call AND id != 'bob'→ Result: 1 (Bob is on-call) → Result: 1 (Alice is on-call) -- A thinks OK to go off-call -- B thinks OK to go off-call UPDATE doctors SET on_call=false UPDATE doctors SET on_call=falseWHERE id = 'alice' WHERE id = 'bob'→ Write intent: alice.on_call=false → Write intent: bob.on_call=false COMMIT COMMIT │ │ ▼ ▼ SSI CONFLICT DETECTION DURING COMMIT:──────────────────────────────────────────────────────────────────── When A commits (assumed first):├── A's read set: {bob.on_call} @ ts=100├── A's write set: {alice.on_call}├── No conflicts with A's reads (bob not modified)└── A commits successfully @ ts=100 When B tries to commit:├── B's read set: {alice.on_call} @ ts=105├── B's write set: {bob.on_call}├── Check: Has alice.on_call changed since ts=105?│ └── YES! A committed alice.on_call=false @ ts=100│ (But B's timestamp is after A's, so B should have seen it)├── Conflict detected: B read stale data└── B's transaction is ABORTED RETRY FLOW:────────────────────────────────────────────────────────────────────B retries automatically: Transaction B (retry):─────────────────────────────────────BEGIN @ ts=200 SELECT COUNT(*) FROM doctorsWHERE on_call AND id != 'bob'→ Result: 0 (Alice is OFF-call now) -- Application logic: Cannot go off-call, someone must stay-- Application either:-- 1. Aborts the request: "Cannot go off-call, no one else available"-- 2. Or finds alternative logic COMMIT (no changes) INVARIANT PRESERVED! ✓ CONFLICT RESOLUTION STRATEGIES:────────────────────────────────────────────────────────────────────CockroachDB uses multiple strategies: 1. TIMESTAMP PUSH: If B's reads are still valid at a higher timestamp, push B's timestamp forward and continue. 2. TRANSACTION WAIT: If B encounters an uncommitted write intent from A, B waits for A to commit/abort before proceeding. 3. ABORT AND RETRY: If conflict is unresolvable (B's reads are genuinely stale), abort B and let it retry with fresh data.Write Intents: The Locking Mechanism:
When a transaction writes data, it doesn't immediately commit the value. Instead, it writes a write intent—a provisional value that:
Other transactions encountering a write intent must:
The Transaction Record:
Each transaction has a transaction record stored in the same keyspace. This record tracks:
When a transaction commits, its record is updated atomically. Other transactions checking intent status read this record to determine the outcome.
Traditional databases often use Two-Phase Locking (2PL) for serializability—acquiring locks before accessing data, releasing after commit. 2PL blocks readers and writers. SSI is optimistic—transactions run without blocking, conflicts detected at commit. SSI has better throughput for read-heavy workloads but may have more aborts under high contention.
Timestamps are central to CockroachDB's consistency model. Every transaction has a timestamp that determines:
Hybrid Logical Clocks (HLC):
CockroachDB uses Hybrid Logical Clocks, which combine:
HLC timestamps have the form (physical_time, logical_counter). They're compared lexicographically—physical time first, then logical counter.
Why Not Just System Clocks?
Distributed systems can't rely solely on system clocks because:
HLC handles these by:
HLC TIMESTAMP MANAGEMENT═══════════════════════════════════════════════════════════════════ HLC TIMESTAMP STRUCTURE:────────────────────────────────────────────────────────────────────┌────────────────────────────────────────┬───────────────────────┐│ Physical Time (nanoseconds) │ Logical Counter ││ 48 bits │ 16 bits │└────────────────────────────────────────┴───────────────────────┘ Example: 1705312800000000000.00000005 └──────────────┬─────────────┘ └────┬────┘ Physical (ns since epoch) Logical HLC UPDATE RULES:──────────────────────────────────────────────────────────────────── Rule 1: Local Event (send message, local operation)─────────────────────────────────────────────────────────────────hlc.physical = max(hlc.physical, system_clock)if hlc.physical unchanged: hlc.logical++else: hlc.logical = 0 Rule 2: Receive Event (receive message with remote HLC)─────────────────────────────────────────────────────────────────old_physical = hlc.physicalhlc.physical = max(hlc.physical, msg.hlc.physical, system_clock)if hlc.physical == old_physical == msg.hlc.physical: hlc.logical = max(hlc.logical, msg.hlc.logical) + 1elif hlc.physical == old_physical: hlc.logical++elif hlc.physical == msg.hlc.physical: hlc.logical = msg.hlc.logical + 1 else: hlc.logical = 0 CLOCK SKEW SCENARIO:──────────────────────────────────────────────────────────────────── Node A (clock accurate): Node B (clock 100ms behind):System time: 1000ms System time: 900ms Event sequence:1. Node A: local event hlc_A = (1000, 0) 2. Node A sends message to B with hlc=(1000, 0) 3. Node B receives message: system_clock = 900ms (behind!) msg.hlc = (1000, 0) hlc_B.physical = max(900, 1000, 900) = 1000 hlc_B.logical = 0 + 1 = 1 Result: hlc_B = (1000, 1) 4. Node B: local event hlc_B = (1000, 2) [physical unchanged, logical++] 5. Node B sends message to A with hlc=(1000, 2) 6. Node A receives (now system_clock=1005): hlc_A.physical = max(1000, 1000, 1005) = 1005 hlc_A.logical = 0 [new physical time] Result: hlc_A = (1005, 0) CAUSALITY PRESERVED: A's later event (1005,0) > B's event (1000,2) ✓ UNCERTAINTY WINDOWS:────────────────────────────────────────────────────────────────────CockroachDB's max_offset setting (default: 500ms) defines the maximum expected clock skew between nodes. When reading, CockroachDB considers the UNCERTAINTY WINDOW:[read_timestamp - max_offset, read_timestamp] If any version exists in this window, CockroachDB must either:1. Wait until the uncertainty window passes, OR2. Refresh the read at a higher timestamp Example: Transaction starts @ ts=1000 max_offset = 500ms Uncertainty window: [500, 1000] Reading key K: └── Version @ ts=800 exists └── 800 is in [500, 1000] (uncertain!) └── Could this version have been written by a concurrent txn whose clock was ahead of ours? └── Wait or refresh to resolve uncertainty This ensures external consistency even without perfect clock sync.The Uncertainty Window:
Because node clocks aren't perfectly synchronized, a transaction's timestamp might not reflect the true physical order of events. CockroachDB handles this with uncertainty windows.
When a transaction reads data, it considers values with timestamps in:
[transaction_timestamp - max_offset, transaction_timestamp]
If a value exists in this window, CockroachDB can't be certain whether that value was written before or after the transaction started (due to clock skew). It must either:
Practical Implications:
If a node's clock is too far off (exceeds max_offset), CockroachDB will refuse to accept its transactions. Monitor clock synchronization in production. Cloud VMs with PTP (Precision Time Protocol) or Amazon Time Sync Service can achieve <1ms accuracy, minimizing uncertainty overhead.
Understanding CockroachDB's transaction model helps you write applications that perform well and handle errors correctly.
Transaction Retries:
CockroachDB automatically retries some transactions internally, but applications must be prepared to handle retry errors:
Application-Level Retry Loop:
For operations that might conflict, wrap in a retry loop:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import psycopg2from psycopg2 import sql, errorsimport timeimport random def execute_with_retry(conn, operation, max_retries=5): """ Execute a transaction with automatic retry on serialization failures. CockroachDB may return 40001 (serialization_failure) when transactions conflict. These errors are safe to retry with a fresh transaction. """ retries = 0 while retries < max_retries: try: with conn.cursor() as cur: # Start transaction cur.execute("BEGIN") # Execute the application operation result = operation(cur) # Commit cur.execute("COMMIT") return result except errors.SerializationFailure as e: # 40001: Safe to retry conn.rollback() retries += 1 # Exponential backoff with jitter wait_time = min(1.0, (2 ** retries) * 0.1) + random.uniform(0, 0.1) print(f"Serialization conflict, retry {retries}/{max_retries} " f"after {wait_time:.2f}s") time.sleep(wait_time) except errors.Error as e: conn.rollback() raise # Other errors are not safe to retry blindly raise Exception(f"Transaction failed after {max_retries} retries") # Example: Transfer money between accountsdef transfer_money(from_account, to_account, amount): conn = psycopg2.connect("postgresql://root@localhost:26257/bank") def do_transfer(cur): # Read current balances cur.execute( "SELECT balance FROM accounts WHERE id IN (%s, %s) FOR UPDATE", (from_account, to_account) ) balances = {row[0]: row[1] for row in cur.fetchall()} if balances[from_account] < amount: raise ValueError("Insufficient funds") # Update balances cur.execute( "UPDATE accounts SET balance = balance - %s WHERE id = %s", (amount, from_account) ) cur.execute( "UPDATE accounts SET balance = balance + %s WHERE id = %s", (amount, to_account) ) return {"status": "success", "amount": amount} return execute_with_retry(conn, do_transfer) # Example: Inventory reservation with contention handlingdef reserve_inventory(product_id, quantity): conn = psycopg2.connect("postgresql://root@localhost:26257/inventory") def do_reservation(cur): # SELECT FOR UPDATE acquires lock early, reducing retry probability cur.execute( """ SELECT quantity_available FROM inventory WHERE product_id = %s FOR UPDATE """, (product_id,) ) row = cur.fetchone() if row is None: raise ValueError(f"Product {product_id} not found") available = row[0] if available < quantity: raise ValueError(f"Insufficient inventory: {available} < {quantity}") cur.execute( """ UPDATE inventory SET quantity_available = quantity_available - %s, quantity_reserved = quantity_reserved + %s WHERE product_id = %s """, (quantity, quantity, product_id) ) return {"reserved": quantity, "remaining": available - quantity} return execute_with_retry(conn, do_reservation)Best Practices for Transactions:
1. Keep Transactions Short
Long-running transactions:
Best practice: Transactions should complete in seconds, not minutes.
2. Use SELECT FOR UPDATE for Hot Keys
For frequently-updated data (counters, inventory, balances):
-- Acquire lock immediately, don't wait until UPDATE
SELECT balance FROM accounts WHERE id = 'hot_key' FOR UPDATE;
-- Now other transactions wait at SELECT, not at COMMIT
UPDATE accounts SET balance = balance + 100 WHERE id = 'hot_key';
This trades optimistic concurrency for pessimistic locking on known hot spots.
3. Batch Operations
Instead of separate transactions for each row:
-- BAD: 1000 separate transactions
FOR each item:
BEGIN; INSERT INTO orders VALUES (...); COMMIT;
-- GOOD: One transaction with batch
BEGIN;
INSERT INTO orders VALUES (...), (...), ..., (...);
COMMIT;
Batching reduces coordination overhead dramatically.
Because CockroachDB provides serializable isolation, you don't need to reason about subtle concurrency anomalies. Write your transaction logic as if it runs in isolation—CockroachDB ensures this is true (or safely retries until it is). This dramatically simplifies application development compared to databases with weaker guarantees.
CockroachDB provides Serializable isolation by default, but offers configuration options for specific scenarios.
Default: SERIALIZABLE
All transactions run with serializable isolation unless explicitly changed:
SHOW default_transaction_isolation;
-- Result: serializable
Read Committed Isolation (Experimental)
CockroachDB has experimental support for Read Committed isolation:
SET CLUSTER SETTING sql.txn.read_committed_isolation.enabled = true;
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Statements here see Read Committed isolation
COMMIT;
Why use Read Committed?
Why avoid it?
Follower Reads: Trading Consistency for Latency
Follower reads allow reading from any replica (not just the leaseholder), accepting bounded staleness:
-- Read data as of ~4.8 seconds ago (guaranteed consistent at that point)
SELECT * FROM products
AS OF SYSTEM TIME follower_read_timestamp();
-- Read data from a specific time
SELECT * FROM products
AS OF SYSTEM TIME '-30s';
Follower reads provide:
| Option | Consistency | Latency | Contention | Use Case |
|---|---|---|---|---|
| Serializable (default) | Strongest | Higher | May retry | All transactional workloads |
| Read Committed | Weaker | Lower | Fewer retries | Compatibility, analytics |
| Follower Reads | Bounded stale | Lowest | None | Read-heavy, latency-sensitive |
| AS OF SYSTEM TIME | Point-in-time | Low | None | Reports, time-travel queries |
Savepoints and Nested Transactions:
CockroachDB supports savepoints for partial rollback within transactions:
BEGIN;
INSERT INTO orders (id, total) VALUES (1, 100);
SAVEPOINT before_items;
INSERT INTO order_items (order_id, product_id) VALUES (1, 'PROD_A');
INSERT INTO order_items (order_id, product_id) VALUES (1, 'PROD_B_INVALID');
-- Oops, PROD_B_INVALID doesn't exist
ROLLBACK TO SAVEPOINT before_items;
-- Order insert preserved, items rolled back
INSERT INTO order_items (order_id, product_id) VALUES (1, 'PROD_C');
COMMIT;
Savepoints are useful for:
Unless you have a specific, well-understood reason, use the default serializable isolation. The correctness guarantees vastly outweigh the minor performance cost. If serializable creates unacceptable contention, the right solution is usually optimizing the schema or access patterns, not weakening isolation.
We've explored CockroachDB's transaction model in depth. Let's consolidate the key concepts:
What's Next:
Strong consistency is only part of the picture. In the next page, we'll explore Automatic Load Balancing—how CockroachDB distributes data across nodes, manages leaseholders, and rebalances the cluster as data grows and access patterns change. This is the operational magic that makes CockroachDB self-managing.
You now understand CockroachDB's transaction model—MVCC, SSI, timestamps, and conflict handling. You can write applications that leverage serializable isolation, implement proper retry logic, and avoid common pitfalls. Next, we'll see how CockroachDB automatically distributes and balances data across the cluster.