Loading content...
For decades, distributed transaction support was the holy grail of database systems—and the graveyard of many attempts. The fundamental challenge: how do you ensure that a transaction affecting data in Tokyo, Frankfurt, and Virginia either commits everywhere or nowhere, without unacceptable latency or availability loss?
Traditional two-phase commit (2PC) protocols solve the atomicity problem but create new ones: they require all participants to be available, they hold locks during coordination, and coordinator failures can leave transactions in limbo. At global scale, these problems become severe.
Spanner's distributed transaction protocol represents a synthesis of multiple innovations:
The result is a system that provides serializable, externally consistent transactions spanning the globe—something that was considered impractical before Spanner demonstrated it at scale.
By the end of this page, you will understand Spanner's transaction protocol in detail—from lock acquisition through commit. You'll see how 2PC and Paxos work together, how deadlocks are prevented, and how the system maintains performance despite global coordination requirements.
Before diving into the distributed case, let's establish how transactions work within a single Paxos group.
Single-Group Transactions:
When a transaction only touches data managed by one Paxos group (common for well-designed schemas using interleaved tables), the protocol is straightforward:
This single-group path is highly optimized. The Paxos replication ensures durability (transaction survives server failures), and the commit-wait ensures external consistency.
Transaction Types:
Spanner supports several transaction types, each optimized for its use case:
Read-Write Transactions: Full ACID transactions that can read and modify data. Required for any writes.
Read-Only Transactions: Transactions that only read data. Execute at a consistent snapshot without blocking writes.
DML Statements: Individual SQL statements (INSERT, UPDATE, DELETE) executed as implicit transactions.
Partitioned DML: Bulk operations that execute in parallel across shards, with weaker atomicity guarantees but higher throughput.
| Type | Can Write? | Locks? | Serializable? | Cross-Group? | Typical Use |
|---|---|---|---|---|---|
| Read-Write Txn | Yes | Yes (read/write) | Yes | Yes (with 2PC) | Updates requiring consistency |
| Read-Only Txn | No | No | Yes (snapshot) | Yes (no 2PC needed) | Reports, analytics, bulk reads |
| Single DML | Yes | Yes | Yes | Yes (with 2PC) | Simple single-statement updates |
| Partitioned DML | Yes | Per-partition | Per-row | Yes (parallel) | Bulk updates, data cleanup |
Concurrency Control: Two-Phase Locking:
Spanner uses strict two-phase locking (S2PL) for concurrency control:
Locks are held until transaction completion to ensure serializability. This means:
Lock Granularity:
Spanner supports multiple lock granularities:
Read-only transactions execute at a snapshot timestamp and don't acquire locks. This means they never block writes and can be served by any replica. For read-heavy workloads, this is a massive performance optimization.
When a transaction spans multiple Paxos groups—for example, transferring money between two users in different regions—Spanner uses two-phase commit (2PC) to ensure atomicity.
The Participants:
Phase 1: Prepare
Coordinator sends PREPARE to all participants
Each participant:
Coordinator waits for all participants to respond
Phase 2: Commit
If all participants responded PREPARED:
If any participant responded ABORT:
The Critical Innovation: Paxos-Backed 2PC
Traditional 2PC has a critical vulnerability: if the coordinator fails between receiving all PREPARED votes and writing COMMIT, the transaction is stuck. Participants can't safely commit (they don't know if all participants prepared) or abort (another participant might have already committed).
Spanner solves this by making the coordinator itself a Paxos group:
SPANNER'S PAXOS-BACKED TWO-PHASE COMMIT═══════════════════════════════════════════════════════════════════ Transaction: Transfer $100 from Account A (US-East) to Account B (EU-West) PARTICIPANTS:─────────────┌─────────────────────────────┐ ┌─────────────────────────────┐│ PAXOS GROUP: US-EAST │ │ PAXOS GROUP: EU-WEST ││ (COORDINATOR) │ │ (PARTICIPANT) ││ │ │ ││ ┌─────────────────────┐ │ │ ┌─────────────────────┐ ││ │ Leader │ Follower │ │ │ │ Leader │ Follower │ ││ │ US-E1 │ US-E2 │ │ │ │ EU-W1 │ EU-W2 │ ││ │ │ │ │ │ │ │ │ ││ │ Data: Account A │ │ │ │ Data: Account B │ ││ │ Balance: $500 │ │ │ │ Balance: $200 │ ││ └─────────────────────┘ │ │ └─────────────────────┘ │└──────────────┬──────────────┘ └──────────────┬──────────────┘ │ │ │ PHASE 1: PREPARE │ │ ──────────────── │ │ │ │ 1a. Coordinator logs PREPARE │ │ (via Paxos to US-E2) │ │ │ ├────────PREPARE{txn, writes_A}─────►│ │ │ │ 1b. Participant │ │ • Acquires locks│ │ • Logs PREPARED │ │ (via Paxos) │ │ │ │◄───────PREPARED{max_timestamp}─────┤ │ │ │ 1c. Coordinator processes locally │ │ • Acquires locks on Account A │ │ • Logs PREPARED locally │ │ │ │ PHASE 2: COMMIT │ │ ─────────────── │ │ │ │ 2a. Assign commit timestamp: │ │ s = max(all participant │ │ timestamps, TT.now().latest)│ │ │ │ 2b. COMMIT-WAIT: Wait until │ │ TT.after(s) is true │ │ (ensures s has passed globally)│ │ │ │ 2c. Coordinator logs COMMIT(s) │ │ (THE COMMIT POINT - transaction│ │ is now durably committed) │ │ │ ├─────────COMMIT{timestamp=s}────────►│ │ │ │ 2d. Participant │ │ • Logs COMMIT(s)│ │ • Applies writes│ │ • Releases locks│ │ │ │ 2e. Coordinator applies locally │ │ • Account A: $500 → $400 │ │ • Releases locks │ │ │ ◄────────────────ACK─────────────────┤ │ │ │ COMPLETE: Both accounts updated │ │ atomically at timestamp s│ │ │ KEY INSIGHT: FAULT TOLERANCE────────────────────────────If US-E1 (coordinator leader) fails after Phase 1:• US-E2 (coordinator follower) has the PREPARE in its Paxos log• US-E2 becomes new leader, continues protocol• Transaction is never stuck If EU-W1 (participant leader) fails after Phase 1:• EU-W2 (participant follower) has the PREPARED in its Paxos log• EU-W2 becomes new leader, can respond to coordinator• Transaction completes normally EVERY state transition is logged via Paxos before proceeding,making the entire protocol recoverable from any single failure.Timestamp Coordination in 2PC:
A subtle but critical aspect of distributed transactions is timestamp assignment. The commit timestamp must be:
During prepare, each participant reports the maximum timestamp it has seen. The coordinator takes the maximum of all these values and TT.now().latest to assign the commit timestamp.
Commit-Wait in Distributed Transactions:
The coordinator performs commit-wait after assigning the timestamp but before sending the COMMIT message. This means:
This optimization hides much of the commit-wait latency within the distributed protocol overhead.
The coordinator's location affects transaction latency since commit-wait happens there. Spanner strategically selects coordinators to minimize latency—often choosing the Paxos group geographically central to all participants.
Locking creates the possibility of deadlocks: Transaction A holds lock X and waits for lock Y, while Transaction B holds lock Y and waits for lock X. Neither can proceed.
Spanner prevents deadlocks using the wound-wait protocol—a priority-based scheme using transaction timestamps:
The Wound-Wait Rules:
Why This Prevents Deadlocks:
Consider the potential deadlock scenario:
With wound-wait:
Alternatively:
In no scenario do both transactions wait for each other.
Implementation Details:
WOUND-WAIT DEADLOCK PREVENTION═══════════════════════════════════════════════════════════════════ SCENARIO 1: Older Transaction Wants Lock Held by Younger────────────────────────────────────────────────────────── T1 (timestamp: 100) ───────────► wants Lock X ◄─── held by T2 (timestamp: 200) T1 is OLDER (100 < 200) → T1 has HIGHER priority ACTION: T1 "wounds" T2 • T2 receives abort signal • T2 releases all locks • T2 rolls back • T1 acquires Lock X • T2 will retry later (with SAME timestamp 200, so it gains "age" relative to new transactions) SCENARIO 2: Younger Transaction Wants Lock Held by Older────────────────────────────────────────────────────────── T2 (timestamp: 200) ───────────► wants Lock Y ◄─── held by T1 (timestamp: 100) T2 is YOUNGER (200 > 100) → T2 has LOWER priority ACTION: T2 WAITS • T2 blocks until T1 releases Lock Y • T1 continues normally • When T1 commits/aborts, Lock Y released • T2 then acquires Lock Y DEADLOCK IMPOSSIBILITY PROOF:────────────────────────────Deadlock requires a cycle: T1 → waits for → T2 → waits for → T1 For T1 to wait for T2: T1 must be younger than T2For T2 to wait for T1: T2 must be younger than T1 T1 younger than T2 AND T2 younger than T1 is impossible!(timestamps create a total order) Therefore: Wait-for graph is always acyclic → No deadlocks possible EXAMPLE: Multiple Transactions Competing──────────────────────────────────────── Lock A Lock B Lock C───────────────────────────────────────── T1 T2 T3 (all want each other's locks) (100) (150) (200) T1 wants Lock B (from T2): T1(100) < T2(150) → T1 wounds T2 → T2 abortsT3 wants Lock C (from ???): T3 has it After T2 abort:T1 wants Lock C (from T3): T1(100) < T3(200) → T1 wounds T3 → T3 aborts Result: T1 completes, T2 and T3 retry later T2 retries with priority=150, T3 with priority=200 They will complete eventually (making progress)Retry Behavior:
When a transaction is aborted due to being wounded:
The Cost of Wound-Wait:
Wound-wait has a cost: transactions may be aborted and retried. In high-contention scenarios, this can cause:
However, for most workloads, wound-wait's overhead is minimal compared to deadlock detection and resolution approaches used by other systems.
The best way to minimize wound-wait overhead is to design for low contention: use interleaved tables to localize data, keep transactions short, and avoid hot rows that many transactions update. Well-designed schemas rarely experience significant abort rates.
Let's trace through a complete read-write transaction lifecycle, from client request to committed response.
Pre-Transaction: Session and Context
Before executing transactions, clients establish a session with Spanner. Sessions maintain:
The Transaction Lifecycle:
COMPLETE READ-WRITE TRANSACTION LIFECYCLE═══════════════════════════════════════════════════════════════════ EXAMPLE: Update user's email and log the change BEGIN TRANSACTION; SELECT email FROM users WHERE user_id = 123; -- Read UPDATE users SET email = 'new@email.com' WHERE user_id = 123; INSERT INTO audit_log (user_id, action, timestamp) VALUES (123, 'email_update', CURRENT_TIMESTAMP); -- WriteCOMMIT; ═══════════════════════════════════════════════════════════════════ PHASE 1: TRANSACTION START────────────────────────── Client Spanner Frontend │ │ │ BeginTransaction(readWrite) │ ├─────────────────────────────────────►│ │ │ │ ◄─────────────────────────────────┤ │ Transaction ID: txn_abc123 │ │ Start Time: t_start = 1704672000 │ │ │ • Transaction receives unique ID and start timestamp• Start timestamp used for wound-wait priority PHASE 2: READ PHASE─────────────────── Client Frontend Paxos Group (users) │ │ │ │ Read(users, id=123) │ │ ├───────────────────────►│ │ │ │ │ │ │ AcquireReadLock(123) │ │ ├────────────────────────►│ │ │ │ │ │ ◄─ Lock Granted ──────┤ │ │ (or wait/wound) │ │ │ │ │ │ Read(users, id=123) │ │ ├────────────────────────►│ │ │ │ │ │ ◄─ {email: old@...} ──┤ │ │ │ │ ◄─────────────────────┤ │ │ {email: old@...} │ │ • Read locks prevent concurrent modifications• Read locks compatible with other read locks• Data returned to client for application logic PHASE 3: WRITE BUFFERING──────────────────────── Client Frontend │ │ │ UPDATE(users, 123, │ │ email='new@...') │ │ │ ├───────────────────────►│ │ │ Buffered Mutations: │ ◄─────────────────────┤ [users.123.email = 'new@...'] │ (Buffered, not │ │ applied yet) │ │ │ │ INSERT(audit_log, │ │ ...) │ │ │ ├───────────────────────►│ │ │ Buffered Mutations: │ ◄─────────────────────┤ [users.123.email = 'new@...', │ │ audit_log.{new row}] • Writes are NOT sent to Paxos groups during buffering• Writes accumulated in transaction context• Allows client to do multiple operations efficiently PHASE 4: COMMIT (Single Paxos Group Case)───────────────────────────────────────── Since users and audit_log might be interleaved (same Paxos group): Client Frontend Paxos Group │ │ │ │ COMMIT │ │ ├───────────────────────►│ │ │ │ │ │ │ PreparerAndCommit( │ │ │ mutations=[...], │ │ │ locks=[read:123]) │ │ ├────────────────────────►│ │ │ │ │ │ LEADER: │ │ │ 1. Upgrade read lock to write lock │ │ 2. Validate no conflicts │ │ 3. Assign timestamp s │ │ 4. Log via Paxos │ │ │ 5. Commit-wait │ │ │ (wait TT.after(s)) │ │ 6. Apply mutations│ │ │ 7. Release locks │ │ │ │ │ │ ◄─ COMMITTED(ts=s) ────┤ │ ◄─────────────────────┤ │ │ COMMITTED │ │ │ timestamp: s │ │ │ │ │ TOTAL LATENCY BREAKDOWN (typical single-group):• Read phase: 5-20ms (depends on data locality)• Buffering: ~0ms (client-side)• Lock upgrade: 1-2ms• Paxos logging: 5-10ms (depends on replica spread)• Commit-wait: 4-7ms (TrueTime uncertainty)• Apply: <1ms• Total: 15-40msMulti-Group Transaction Flow:
When the transaction spans multiple Paxos groups (e.g., users in one group, audit_log in another), the flow adds 2PC coordination:
Latency Impact of Distribution:
Multi-group transactions add:
For geographically distributed groups, this can add 100-200ms compared to single-group transactions.
The difference between a single-group transaction (15-40ms) and a multi-group transaction (100-300ms) is dramatic. This is why Spanner's interleaved table design is so important—keeping related data in the same Paxos group avoids distributed transaction overhead.
Read-only transactions are a powerful optimization in Spanner. They provide serializable isolation without acquiring locks, enabling high read throughput without blocking writers.
How Read-Only Transactions Work:
The Power of Lock-Free Reads:
Consider a report that reads from multiple tables:
SELECT SUM(balance) FROM accounts WHERE region = 'US';
SELECT COUNT(*) FROM transactions WHERE date = TODAY;
SELECT * FROM audit_log WHERE timestamp > YESTERDAY;
With a read-only transaction:
| Aspect | Read-Write Transaction | Read-Only Transaction |
|---|---|---|
| Locks | Read locks during reads, write locks at commit | No locks |
| Writer Blocking | Can block concurrent writers on same rows | Never blocks writers |
| Replica Choice | Reads must go to leader (for consistency) | Any replica with data at timestamp |
| Geographic Impact | Bound to leader location for reads | Can read from nearest replica |
| Commit Overhead | Paxos + Commit-wait | None (no commit phase) |
| Abort Possible? | Yes (conflicts, deadlocks) | No (nothing to abort) |
Strong vs. Stale Reads:
The timestamp selection for read-only transactions creates different consistency-latency tradeoffs:
Strong Reads:
Bounded Staleness:
Exact Staleness:
Multi-Region Optimization:
For globally deployed applications, stale reads provide dramatic performance improvements:
For read-heavy workloads where slight staleness is acceptable, this 30x improvement transforms user experience.
A common pattern: use strong reads for critical user-facing operations (checking account balance before transfer) but stale reads for non-critical reads (displaying recent transactions). This provides strong consistency where it matters while optimizing performance elsewhere.
Understanding Spanner's performance characteristics helps you design schemas and applications that work with the system rather than against it.
Latency Components:
Total transaction latency comprises several components:
| Scenario | Components | Typical Latency | Notes |
|---|---|---|---|
| Single-group write (regional) | Network + Paxos + Commit-wait | 10-25ms | Best case for writes |
| Single-group write (multi-region) | Network + Paxos + Commit-wait | 50-150ms | Paxos spans regions |
| Multi-group write (regional) | Network + 2PC + Paxos + Commit-wait | 20-50ms | 2PC adds coordination |
| Multi-group write (multi-region) | Network + 2PC + Paxos + Commit-wait | 100-300ms | Geographic distribution dominates |
| Read-only strong (regional) | Network + Read | 5-15ms | May wait for replica catch-up |
| Read-only strong (multi-region) | Network + Read | 50-150ms | May need leader contact |
| Read-only stale (any config) | Network + Read | 1-10ms | Always served locally |
Throughput Considerations:
Spanner scales horizontally—adding more nodes increases capacity. However, several factors affect throughput:
1. Hot Spots: If many transactions access the same rows, lock contention limits throughput. Solutions:
2. Geographic Distribution: Multi-region configurations have lower per-transaction throughput due to coordination overhead. However, total system throughput may be higher due to geographic distribution of load.
3. Transaction Size: Larger transactions (more rows read/written) hold locks longer, reducing concurrency. Keep transactions small and focused.
4. 2PC Overhead: Multi-group transactions have lower throughput than single-group. Design schemas to keep related data together.
Benchmarks and Real-World Performance:
Google has published performance data showing:
Cloud Spanner pricing is based on node-hours and storage. Multi-region configurations require more nodes for the same throughput (due to coordination overhead). Factor this into capacity planning when choosing regional vs. multi-regional deployments.
We've explored the sophisticated transaction machinery that enables Spanner's globally distributed ACID guarantees. Let's consolidate the key insights:
What's Next:
With transactions covered, we'll explore how Spanner automatically manages data distribution. The next page covers Automatic Sharding—how Spanner dynamically partitions, rebalances, and moves data without application involvement or downtime.
You now understand Spanner's distributed transaction architecture—from single-group commits through multi-group 2PC, from lock management to performance optimization. Next, we'll see how Spanner automatically manages the physical distribution of data.