Loading learning content...
In the Two-Phase Commit protocol, the coordinator (also called the transaction manager or transaction coordinator) serves as the central authority orchestrating the commit process. While participants execute local transaction operations, it is the coordinator that transforms a collection of independent local decisions into a unified global outcome.
The coordinator's role is both powerful and perilous. It is the single entity that can definitively resolve whether a distributed transaction commits or aborts. Yet this centrality also makes it a critical point of failure—if the coordinator fails at the wrong moment, participants may be left stranded in an uncertain state, unable to proceed. Understanding the coordinator's responsibilities, mechanisms, and recovery procedures is essential for anyone implementing or troubleshooting distributed transaction systems.
By the end of this page, you will have a comprehensive understanding of the coordinator's responsibilities throughout the distributed transaction lifecycle. You'll understand transaction initiation, participant tracking, vote collection, decision dissemination, timeout management, and the coordinator's recovery procedures after failures.
Before examining the coordinator's operational responsibilities, we must understand how a coordinator is selected and what architectural role it plays in the distributed system.
Coordinator Selection Strategies:
In production distributed database systems, the coordinator is typically selected through one of several mechanisms:
1. Client-Designated Coordinator The application or client driver selects which node will coordinate the transaction, often based on which node was first contacted or which holds the primary data being accessed. This is common in sharded database systems.
2. First-Participant Coordinator The first database node accessed by the transaction automatically becomes the coordinator. Subsequent participants register with this initial node.
3. Dedicated Transaction Manager A separate, specialized transaction manager service coordinates all distributed transactions. This is common in enterprise transaction processing systems using X/Open XA or similar standards.
4. Leader Election In highly available systems, a leader election protocol (Raft, Paxos) selects a node to serve as coordinator. If this coordinator fails, a new leader is elected.
| Approach | Advantages | Disadvantages | Use Case |
|---|---|---|---|
| Client-Designated | Predictable, client control, low latency | Requires client awareness, potential hotspots | Sharded databases (CockroachDB) |
| First-Participant | Automatic, no central service | Coordinator location varies, harder to monitor | Federated databases |
| Dedicated TM | Centralized management, auditing | Single point of failure without HA | Enterprise OLTP (Tuxedo, CICS) |
| Leader Election | High availability, fault-tolerant | Election latency, complexity | Cloud-native databases (Spanner) |
Coordinator Identity Persistence:
Regardless of how the coordinator is selected, it is critical that all participants know the coordinator's identity and can contact it to resolve uncertainty. This identity must be:
Without this information, a participant that crashed while in the PREPARED state would have no way to discover the transaction's outcome upon recovery.
To manage distributed transactions effectively, the coordinator maintains several essential data structures. These structures track the state of ongoing transactions, participant information, and timing constraints.
Essential Coordinator Data Structures:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
// Transaction Record maintained by the Coordinatorinterface TransactionRecord { // Unique identifier for the distributed transaction transactionId: string; // Current state in the coordinator's state machine state: 'INITIAL' | 'WAIT' | 'COMMIT' | 'ABORT' | 'END'; // List of all participants in this transaction participants: ParticipantInfo[]; // Timestamp when commit process began prepareStartTime: number; // Timeout value for vote collection voteTimeout: number; // Received votes from participants votes: Map<string, 'VOTE_COMMIT' | 'VOTE_ABORT'>; // Acknowledgments received from participants acknowledgments: Set<string>; // The global decision once determined decision?: 'COMMIT' | 'ABORT';} // Information about each participantinterface ParticipantInfo { // Unique identifier for the participant participantId: string; // Network address for communication endpoint: string; // Whether we've received their vote hasVoted: boolean; // The vote received (if any) vote?: 'VOTE_COMMIT' | 'VOTE_ABORT'; // Whether they've acknowledged the decision hasAcknowledged: boolean; // Number of retry attempts for this participant retryCount: number;} // The Coordinator's persistent transaction tableclass CoordinatorTransactionManager { // In-memory transaction table (also persisted to log) private activeTransactions: Map<string, TransactionRecord>; // Pending timer handles for timeouts private timeoutHandles: Map<string, Timer>; // Durable log for recovery private durableLog: DurableLog;}Transaction Table:
The transaction table is the coordinator's primary data structure—a mapping from transaction IDs to transaction records. This table must support:
Participant Registry:
For each transaction, the coordinator maintains a participant registry recording:
This registry is populated during the transaction's execution phase as participants join the distributed transaction.
Before the commit process begins, there's an implicit or explicit registration phase. When a transaction first accesses data at a participant node, that participant registers with the coordinator. This registration must complete before the prepare phase begins—the coordinator must know all participants to send them PREPARE messages.
Let's trace through the complete lifecycle of a distributed transaction from the coordinator's standpoint, examining each phase in detail.
Phase 0: Transaction Initiation and Participant Registration
The coordinator's involvement begins when the distributed transaction is initiated:
Phase 1: Vote Collection (Coordinator's Prepare Phase Actions)
When the application requests commit, the coordinator transitions from transaction execution to the commit protocol:
Step 1: Log Prepare Initiation
log_force(<PREPARE T, participants=[P1, P2, P3], timeout=30s>)
Step 2: Transition to WAIT State Update the transaction record's state to WAIT.
Step 3: Start Timeout Timer Initiate a timer for vote collection. If this timer expires without receiving all votes, the coordinator will decide ABORT.
Step 4: Send PREPARE to All Participants Send PREPARE messages to all registered participants. These messages should include the transaction ID and may include other context.
Step 5: Collect Votes Wait for VOTE_COMMIT or VOTE_ABORT from each participant. Record each vote in the transaction record.
Step 6: Make Decision Once all votes are received (or timeout expires):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
class Coordinator { /** * Initiate the prepare phase for a distributed transaction */ async initiateCommit(txId: string): Promise<'COMMIT' | 'ABORT'> { const txRecord = this.activeTransactions.get(txId); if (!txRecord || txRecord.state !== 'INITIAL') { throw new Error(`Invalid transaction state for commit: ${txId}`); } // Step 1: Force-write prepare record to durable log await this.durableLog.forceWrite({ type: 'PREPARE', transactionId: txId, participants: txRecord.participants.map(p => p.participantId), timestamp: Date.now() }); // Step 2: Transition to WAIT state txRecord.state = 'WAIT'; txRecord.prepareStartTime = Date.now(); // Step 3: Start timeout timer this.startVoteTimeout(txId); // Step 4: Send PREPARE to all participants (parallel) const preparePromises = txRecord.participants.map(p => this.sendPrepare(p, txId) ); // Don't await these - we'll collect votes as they arrive // Fire and forget the PREPARE messages // Step 5: Wait for votes (with timeout) const decision = await this.collectVotes(txId); // Step 6: Execute commit phase return this.executeDecision(txId, decision); } /** * Collect votes from all participants */ private async collectVotes(txId: string): Promise<'COMMIT' | 'ABORT'> { const txRecord = this.activeTransactions.get(txId)!; return new Promise((resolve) => { const checkComplete = () => { // Check if all votes received if (txRecord.votes.size === txRecord.participants.length) { // All votes in - check if unanimous COMMIT const allCommit = Array.from(txRecord.votes.values()) .every(v => v === 'VOTE_COMMIT'); this.clearVoteTimeout(txId); resolve(allCommit ? 'COMMIT' : 'ABORT'); return true; } // Check for any ABORT vote (early termination) if (Array.from(txRecord.votes.values()).includes('VOTE_ABORT')) { this.clearVoteTimeout(txId); resolve('ABORT'); return true; } return false; }; // Register vote handler this.voteHandlers.set(txId, (participantId, vote) => { txRecord.votes.set(participantId, vote); txRecord.participants.find(p => p.participantId === participantId)!.hasVoted = true; checkComplete(); }); // Register timeout handler this.timeoutHandlers.set(txId, () => { // Timeout - not all votes received, must abort resolve('ABORT'); }); }); }}Once the coordinator has made a decision (either COMMIT or ABORT), it must reliably disseminate this decision to all participants. This is Phase 2 of the protocol, and the coordinator has specific responsibilities to ensure all participants receive and act on the decision.
Decision Persistence:
Before sending the decision to any participant, the coordinator must force-write the decision to its durable log. This is critical because:
The Commit Point:
For COMMIT decisions, the moment when <COMMIT T> is written to stable storage is the commit point. This is the point of no return—once this record is durable:
The decision record must be on stable storage before ANY participant receives the decision. If the coordinator crashes after sending COMMIT to one participant but before logging, recovery would not know to commit—potentially leading to one committed participant and others that abort. The log must always lead the execution.
Reliable Decision Delivery:
The coordinator must ensure every participant receives the decision. This requires a reliable delivery mechanism:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
class Coordinator { /** * Execute the commit or abort decision */ async executeDecision(txId: string, decision: 'COMMIT' | 'ABORT'): Promise<'COMMIT' | 'ABORT'> { const txRecord = this.activeTransactions.get(txId)!; // CRITICAL: Force-write decision to log BEFORE sending to participants await this.durableLog.forceWrite({ type: decision, transactionId: txId, timestamp: Date.now() }); // Update in-memory state txRecord.state = decision; txRecord.decision = decision; // Determine which participants need notification // For COMMIT: all participants // For ABORT: only those that voted COMMIT (others already aborted) const participantsToNotify = decision === 'COMMIT' ? txRecord.participants : txRecord.participants.filter(p => p.vote === 'VOTE_COMMIT'); // Send decision to all relevant participants for (const participant of participantsToNotify) { this.sendDecision(participant, txId, decision); } // Start acknowledgment collection phase this.startAckPhase(txId, participantsToNotify); // Wait for all acknowledgments (with retry loop) await this.collectAcknowledgments(txId, participantsToNotify); // All acknowledged - log END and clean up await this.completeTransaction(txId); return decision; } /** * Collect acknowledgments with retry for reliability */ private async collectAcknowledgments( txId: string, participants: ParticipantInfo[] ): Promise<void> { const txRecord = this.activeTransactions.get(txId)!; while (true) { // Check if all have acknowledged const allAcked = participants.every(p => txRecord.acknowledgments.has(p.participantId) ); if (allAcked) { return; // All done } // Find who hasn't acknowledged const unackedParticipants = participants.filter(p => !txRecord.acknowledgments.has(p.participantId) ); // Retry sending decision to unacknowledged participants for (const participant of unackedParticipants) { participant.retryCount++; if (participant.retryCount > MAX_RETRIES) { // Participant seems permanently failed // Log this and consider them acknowledged // (when they recover, they'll query us) this.log.warn(`Participant ${participant.participantId} not responding`); txRecord.acknowledgments.add(participant.participantId); continue; } // Retry sending decision this.sendDecision(participant, txId, txRecord.decision!); } // Wait before next retry round await this.sleep(RETRY_INTERVAL * Math.min(participant.retryCount, 5)); } } /** * Complete the transaction - log END and clean up */ private async completeTransaction(txId: string): Promise<void> { // Log transaction end await this.durableLog.forceWrite({ type: 'END', transactionId: txId, timestamp: Date.now() }); // Clean up in-memory structures this.activeTransactions.delete(txId); this.timeoutHandles.delete(txId); this.voteHandlers.delete(txId); }}Timeouts are essential for ensuring the coordinator doesn't wait indefinitely for participants. The coordinator uses several types of timeouts throughout the protocol.
Types of Coordinator Timeouts:
1. Prepare Timeout (Vote Collection Timeout)
After sending PREPARE to all participants, the coordinator starts a timer. If this timer expires before all votes are received:
2. Acknowledgment Timeout
After sending the global decision, the coordinator waits for acknowledgments. This timeout handling is more nuanced:
| Timeout Type | Typical Value | On Expiry | Blocked Resources |
|---|---|---|---|
| Prepare Timeout | 5-30 seconds | Decide ABORT | No locks held by coordinator |
| Ack Timeout (per retry) | 1-5 seconds | Retry send | Transaction record in memory |
| Ack Maximum Retries | 10-50 attempts | Log warning, mark complete | Log space for transaction |
| Transaction Total Timeout | Minutes to hours | Force abort (if possible) | All participant locks |
Timeout values are critical for system behavior: Too short means aborting transactions that could succeed (false positives). Too long means holding locks while waiting for failed participants (reduced availability). Production systems often use adaptive timeouts that adjust based on historical network latency and participant response times.
Implementing Robust Timeouts:
A robust timeout implementation must handle concurrent operations, timer cancellation, and clock drift:
Use Monotonic Clocks: Wall-clock time can jump (NTP adjustments), causing spurious timeouts or missed timeouts. Use monotonic clocks for measuring elapsed time.
Handle Timer Cancellation: When a vote arrives, check if all votes are now complete. If so, cancel the prepare timeout before it fires.
Avoid Timer Storms: If many transactions timeout simultaneously (e.g., after a network partition heals), the system should handle the load gracefully.
Persist Timeout State: On recovery, recalculate how much time remains rather than restarting timers from zero.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
class TimeoutManager { private timers: Map<string, NodeJS.Timeout> = new Map(); private deadlines: Map<string, number> = new Map(); // Monotonic time /** * Start a timeout for vote collection */ startVoteTimeout(txId: string, timeoutMs: number, onTimeout: () => void): void { // Record deadline using monotonic time const deadline = this.getMonotonicTime() + timeoutMs; this.deadlines.set(txId, deadline); // Schedule callback const handle = setTimeout(() => { this.timers.delete(txId); this.deadlines.delete(txId); onTimeout(); }, timeoutMs); this.timers.set(txId, handle); } /** * Cancel a pending timeout */ cancelTimeout(txId: string): void { const handle = this.timers.get(txId); if (handle) { clearTimeout(handle); this.timers.delete(txId); this.deadlines.delete(txId); } } /** * On recovery, recalculate remaining timeout */ resumeTimeout(txId: string, originalDeadline: number, onTimeout: () => void): void { const now = this.getMonotonicTime(); const remaining = Math.max(0, originalDeadline - now); if (remaining === 0) { // Already expired onTimeout(); } else { // Resume with remaining time this.startVoteTimeout(txId, remaining, onTimeout); } } /** * Get monotonic time (not affected by wall-clock adjustments) */ private getMonotonicTime(): number { // In Node.js: process.hrtime() // In browser: performance.now() return performance.now(); }}When a coordinator crashes and restarts, it must recover the state of all active distributed transactions and complete them. The recovery procedure is driven entirely by the durable log.
Log-Based Recovery Algorithm:
The coordinator scans its log from the most recent checkpoint forward and categorizes each transaction:
Case 1: Only <PREPARE T> found, no <COMMIT T> or <ABORT T>
Case 2: <COMMIT T> found, no <END T>
Case 3: <ABORT T> found, no <END T>
Case 4: <END T> found
Case 5: No log records for a transaction (participant asks about it)
Many systems implement 'Presumed Abort' optimization: if there's no log record for a transaction, assume it aborted. This reduces logging for abort cases (no need to write <ABORT T>) and allows faster response to participant queries. However, COMMIT records must still be force-written before any participant is told to commit.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
class CoordinatorRecovery { /** * Recover coordinator state after crash */ async recover(): Promise<void> { console.log('Starting coordinator recovery...'); // Step 1: Scan log and categorize transactions const transactions = await this.classifyTransactions(); // Step 2: Handle each category for (const [txId, state] of transactions) { await this.recoverTransaction(txId, state); } console.log('Coordinator recovery complete'); } /** * Scan log and classify transaction states */ private async classifyTransactions(): Promise<Map<string, TransactionState>> { const transactions = new Map<string, TransactionState>(); // Read log from checkpoint const logRecords = await this.durableLog.readFromCheckpoint(); for (const record of logRecords) { switch (record.type) { case 'PREPARE': transactions.set(record.transactionId, { phase: 'WAIT', participants: record.participants, decision: undefined }); break; case 'COMMIT': const commitTx = transactions.get(record.transactionId); if (commitTx) { commitTx.decision = 'COMMIT'; } break; case 'ABORT': const abortTx = transactions.get(record.transactionId); if (abortTx) { abortTx.decision = 'ABORT'; } break; case 'END': // Transaction complete - remove from recovery list transactions.delete(record.transactionId); break; } } return transactions; } /** * Recover a single transaction */ private async recoverTransaction(txId: string, state: TransactionState): Promise<void> { if (state.decision === 'COMMIT') { // Decision was COMMIT - need to complete it console.log(`Recovering COMMIT for ${txId}`); await this.completeCommit(txId, state.participants); } else if (state.decision === 'ABORT') { // Decision was ABORT - need to notify voting COMMIT participants console.log(`Recovering ABORT for ${txId}`); await this.completeAbort(txId, state.participants); } else { // No decision - crashed during vote collection // Safe choice: ABORT console.log(`No decision found for ${txId} - aborting`); await this.durableLog.forceWrite({ type: 'ABORT', transactionId: txId }); await this.completeAbort(txId, state.participants); } } /** * Handle participant query about transaction outcome */ handleDecisionQuery(txId: string): 'COMMIT' | 'ABORT' { // Look up in recovered transactions const state = this.activeTransactions.get(txId); if (state?.decision === 'COMMIT') { return 'COMMIT'; } // No record or explicit ABORT - return ABORT // (Presumed Abort) return 'ABORT'; }}The coordinator is a single point of failure in classic 2PC. If the coordinator fails while participants are in the PREPARED state, those participants are blocked—unable to commit or abort—until the coordinator recovers. In production systems, this risk is mitigated through coordinator high availability (HA) mechanisms.
High Availability Strategies:
1. Synchronous Log Replication
The simplest HA approach replicates the coordinator's log to a standby node:
2. Active-Passive Failover
3. Active-Active Coordination (Partitioned)
| Approach | Failover Time | Data Loss Risk | Complexity | Latency Impact |
|---|---|---|---|---|
| Synchronous Replication | Seconds | None | Low-Medium | High (+1 RTT per commit) |
| Asynchronous Replication | Seconds | Some possible | Medium | Low |
| Consensus (Paxos/Raft) | Sub-second | None with quorum | High | Medium |
| Shared Storage | Seconds-Minutes | None if storage survives | Medium | Low |
Modern distributed databases like CockroachDB, TiDB, and YugabyteDB avoid the single-coordinator problem by using consensus protocols (Raft) for transaction coordination. Each transaction's state is replicated across multiple nodes, ensuring no single failure blocks the transaction. These systems effectively integrate 2PC with consensus-based replication.
Fencing to Prevent Split-Brain:
When using failover mechanisms, it's critical to ensure that a 'zombie' coordinator (an old primary that was incorrectly presumed dead) cannot interfere with the new coordinator. This is called the split-brain problem.
Fencing Mechanisms:
Epoch/Generation Numbers: Each coordinator incarnation has a unique epoch number. Participants reject messages from old epochs.
Lease-Based Leadership: The coordinator holds a time-limited lease. It cannot send decisions after its lease expires. The new coordinator waits for the old lease to expire.
STONITH (Shoot The Other Node In The Head): Physically power off the old coordinator before the new one takes over.
Consensus-Based Leadership: Use a consensus protocol to elect a single leader. Only the elected leader can coordinate transactions.
Logging is a significant performance bottleneck in 2PC—every transaction requires multiple synchronous writes to stable storage. Several optimizations reduce this overhead while maintaining correctness.
Key Optimizations:
Group Commit in Detail:
Group commit is the most impactful optimization. Instead of forcing each log record individually:
This amortizes the cost of a single sync (typically 2-10ms for spinning disk, <1ms for NVMe SSD) across many transactions.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
class GroupCommitLog { private pendingRecords: LogRecord[] = []; private waiters: Map<string, () => void> = new Map(); private flushTimer: NodeJS.Timeout | null = null; private readonly MAX_BATCH_SIZE = 100; private readonly MAX_WAIT_MS = 10; /** * Write a log record with group commit optimization */ async writeWithGroupCommit(record: LogRecord): Promise<void> { return new Promise((resolve) => { // Add record to pending batch this.pendingRecords.push(record); this.waiters.set(record.transactionId, resolve); // Check if batch is full if (this.pendingRecords.length >= this.MAX_BATCH_SIZE) { this.flushBatch(); return; } // Start timer if not already running if (!this.flushTimer) { this.flushTimer = setTimeout( () => this.flushBatch(), this.MAX_WAIT_MS ); } }); } /** * Flush pending records to disk */ private async flushBatch(): Promise<void> { if (this.flushTimer) { clearTimeout(this.flushTimer); this.flushTimer = null; } if (this.pendingRecords.length === 0) return; // Capture current batch const batch = this.pendingRecords; const batchWaiters = new Map(this.waiters); this.pendingRecords = []; this.waiters = new Map(); // Write and sync await this.writeToDisk(batch); await this.syncToDisk(); // Notify all waiters for (const resolve of batchWaiters.values()) { resolve(); } }}We've thoroughly examined the coordinator's responsibilities in the Two-Phase Commit protocol. The coordinator is the central authority that transforms distributed uncertainty into global consensus. Let's consolidate the key insights:
What's Next:
The next page examines the Participant Role—how participants process coordinator messages, manage their local state, handle the uncertain PREPARED state, and recover after failures. Understanding both roles is essential for a complete picture of distributed transaction processing.
You now understand the coordinator's comprehensive responsibilities in the Two-Phase Commit protocol—from transaction initiation through vote collection, decision making, reliable notification, recovery, and high availability. Next, we'll explore how participants respond to and implement these coordinator directives.