Distributed Transactions - Learning Module

Loading content...

0/252

Failure Handling

When Distributed Systems Fail

In a distributed database system, failures are not exceptional events—they are inevitable realities. Network cables get unplugged. Servers crash unexpectedly. Power outages affect entire data centers. Hard drives fail silently. Messages are delayed indefinitely or arrive out of order. The Two-Phase Commit protocol must navigate these treacherous waters while still guaranteeing transaction atomicity.

The original 2PC design prioritizes safety over liveness. It ensures that transactions never violate atomicity (no partial commits) even at the cost of blocking during certain failure scenarios. Understanding how 2PC handles—and struggles with—failures is essential for architects and engineers who must build systems that remain both correct and available despite failures.

What You Will Learn

By the end of this page, you will understand every significant failure scenario in the Two-Phase Commit protocol: coordinator failures at each protocol phase, participant failures, network partitions, and message losses. You'll learn the recovery strategies for each scenario and understand why some failures inevitably cause blocking.

The Failure Model

Before analyzing specific failure scenarios, we must establish the failure model—the assumptions about what kinds of failures can occur and how the system behaves when they happen.

Crash-Recovery Model:

The Two-Phase Commit protocol assumes a crash-recovery (or fail-stop with recovery) model:

Crash Failure: Nodes can fail by simply stopping. They don't send erroneous messages or behave maliciously.
Recovery Capability: After a crash, a node can restart and access its stable storage (disk).
Stable Storage: Data written to stable storage survives crashes. (In practice, this requires RAID, replicated storage, or battery-backed write caches.)
No Byzantine Failures: Nodes don't lie, send incorrect messages, or attempt to sabotage the protocol.

Network Model:

The network is assumed to be asynchronous with:

Reliable Links (eventually): Messages may be delayed arbitrarily but are eventually delivered.
In-Order Delivery: Messages between any two nodes arrive in the order they were sent (TCP provides this).
Detectable Timeouts: Nodes can set timeouts and detect when expected messages haven't arrived.

Types of Failures in Distributed Systems
Failure Type	Description	2PC Assumption	Handling Strategy
Crash Failure	Node stops executing entirely	Assumed—primary model	Logging, recovery protocol
Transient Crash	Node crashes and recovers quickly	Assumed	Log-based recovery
Permanent Crash	Node never recovers	Not explicitly handled	Administrative intervention
Message Loss	Individual messages fail to arrive	Network retries	Timeout + retry
Network Partition	Network split prevents communication	Eventually heals	Timeout, blocking possible
Byzantine Failure	Node sends incorrect/malicious messages	NOT assumed	Not handled by basic 2PC

The FLP Impossibility

The Fischer-Lynch-Paterson theorem proves that no deterministic protocol can guarantee consensus in an asynchronous system with even one crash failure. 2PC navigates this impossibility by potentially blocking—it guarantees safety (atomicity) but not liveness (termination) in all failure scenarios.

Coordinator Failure During Phase 1

Let's analyze what happens when the coordinator fails at various points during Phase 1 (the Prepare phase).

Scenario 2.1: Coordinator Crashes Before Sending Any PREPARE

State at crash: Coordinator in INITIAL state, no PREPARE messages sent
Participant state: All participants in ACTIVE state
Effect: Participants never receive PREPARE, continue waiting or timeout
Resolution:
- Participants timeout and abort locally (presumed abort)
- When coordinator recovers, it sees PREPARE record (if written) but no commit decision
- Coordinator decides ABORT and sends GLOBAL_ABORT
- If PREPARE wasn't logged, coordinator has no record of the transaction

Scenario 2.2: Coordinator Crashes After Sending Some PREPAREs

State at crash: Coordinator in WAIT state, some participants received PREPARE
Participants who received PREPARE: Vote and enter PREPARED state
Participants who didn't receive PREPARE: Still in ACTIVE state, will eventually timeout
Resolution:
- Participants who received PREPARE enter PREPARED and wait
- Participants who didn't receive PREPARE timeout and abort locally
- When coordinator recovers, it has logged PREPARE but no decision
- Since not all participants could have voted COMMIT, coordinator decides ABORT
- Coordinator sends GLOBAL_ABORT to all participants

Converting Mermaid diagram...

Scenario 2.3: Coordinator Crashes While Collecting Votes

State at crash: Coordinator in WAIT state, received some but not all votes
Participants: Some in PREPARED state, some haven't voted yet
Effect: Participants in PREPARED state are blocked waiting for decision
Resolution:
- When coordinator recovers, it has PREPARE logged but no decision
- Since the vote collection was incomplete, coordinator decides ABORT
- Sends GLOBAL_ABORT to all participants
- This is safe: if any participant hadn't voted, they'll abort anyway

Key Insight:

Coordinator failures during Phase 1 are relatively benign—the coordinator can safely decide ABORT upon recovery because a commit decision requires ALL votes, and an incomplete Phase 1 means not all votes were received.

Presumed Abort Simplifies Phase 1 Recovery

With the presumed abort optimization, the coordinator doesn't need to log the abort decision. Any transaction with no COMMIT record is presumed aborted. This simplifies Phase 1 failure recovery—the coordinator just needs to ensure any PREPARED participants learn to abort.

Coordinator Failure During Phase 2

Coordinator failures during Phase 2 are more complex because a commit decision may have been made. The exact recovery depends on what was logged before the crash.

Scenario 3.1: Coordinator Crashes Before Logging Decision

State at crash: Coordinator received all votes (all VOTE_COMMIT) but crashed before logging COMMIT
Participants: All in PREPARED state, waiting for decision
Effect: All participants are blocked
Resolution:
- When coordinator recovers, it has PREPARE logged but no decision
- Since no decision was logged, the decision was never made
- Coordinator can decide ABORT (safe because no COMMIT was logged)
- All participants are told to abort

Critical Point: This scenario is safe because nothing was committed. The invariant is: if COMMIT wasn't logged, the transaction can be aborted.

The Dangerous Window

There's a tiny window between when the coordinator decides to commit (in memory) and when it logs that decision. If the coordinator crashes in this window, the in-memory decision is lost. The force-write requirement ensures this window is as small as possible—the decision is logged synchronously before proceeding.

Scenario 3.2: Coordinator Crashes After Logging COMMIT But Before Sending Any GLOBAL_COMMIT

State at crash: Coordinator logged <COMMIT T>, about to send GLOBAL_COMMIT
Participants: All in PREPARED state, waiting
Effect: All participants blocked, but decision is durable
Resolution:
- When coordinator recovers, it sees COMMIT logged
- Coordinator resends GLOBAL_COMMIT to all participants
- Participants commit and acknowledge

Scenario 3.3: Coordinator Crashes After Sending Some GLOBAL_COMMIT Messages

State at crash: Coordinator sent GLOBAL_COMMIT to some participants, not all
Some participants: Received COMMIT, committed, sent ACK
Other participants: Still in PREPARED state, waiting
Resolution:
- When coordinator recovers, COMMIT is logged
- Coordinator resends GLOBAL_COMMIT to participants that haven't acknowledged
- Eventually all participants commit

Scenario 3.4: Coordinator Crashes After Logging ABORT But Before Sending GLOBAL_ABORT

State at crash: Coordinator logged <ABORT T>, about to send GLOBAL_ABORT
Resolution: Similar to 3.2—coordinator resends GLOBAL_ABORT after recovery

Converting Mermaid diagram...

Summary of Coordinator Recovery Actions:

The coordinator's recovery action depends entirely on what's in the log:

Coordinator Recovery Decision Matrix
Log Contains	Recovery Action	Rationale
No PREPARE record	Nothing to do	Transaction never started committing
PREPARE but no decision	Decide ABORT, send GLOBAL_ABORT	Phase 1 incomplete or decision never made
COMMIT record, no END	Resend GLOBAL_COMMIT to all	Decision made, must complete
ABORT record, no END	Resend GLOBAL_ABORT to PREPARED participants	Decision made, must complete
END record	Nothing to do	Transaction fully complete

Participant Failure Scenarios

Participant failures affect the coordinator's ability to collect votes and disseminate decisions. Let's examine the key scenarios.

Scenario 4.1: Participant Crashes During Execution (ACTIVE State)

State at crash: Participant executing local operations, not yet received PREPARE
Coordinator behavior: Sends PREPARE, waits for vote, eventually times out
Resolution:
- Coordinator timeout triggers → decision is ABORT
- Other participants in PREPARED state are told to abort
- When crashed participant recovers, it has only execution records (no PREPARED)
- Participant aborts locally using undo information

Scenario 4.2: Participant Crashes Before Voting

State at crash: Participant received PREPARE but crashed before responding
Coordinator behavior: Timeout waiting for vote
Resolution: Same as 4.1—coordinator decides ABORT, participant aborts on recovery

Converting Mermaid diagram...

Scenario 4.3: Participant Crashes After Voting COMMIT (PREPARED State)

State at crash: Participant in PREPARED state, vote delivered to coordinator
Two sub-scenarios based on coordinator's decision:

Sub-scenario 4.3a: All Others Also Voted COMMIT

Coordinator decides COMMIT, logs it, sends GLOBAL_COMMIT
Crashed participant doesn't receive GLOBAL_COMMIT
When participant recovers:
- Sees PREPARED record but no COMMIT/ABORT
- Queries coordinator for decision
- Learns the decision is COMMIT
- Completes commit

Sub-scenario 4.3b: Some Other Participant Voted ABORT

Coordinator decides ABORT
Crashed participant doesn't receive GLOBAL_ABORT
When participant recovers:
- Sees PREPARED record, queries coordinator
- Learns decision is ABORT
- Completes abort

Scenario 4.4: Participant Crashes After Receiving Decision But Before Applying It

State at crash: Received GLOBAL_COMMIT or GLOBAL_ABORT, crashed before completing
Resolution:
- Coordinator retries sending decision (no ACK received)
- When participant recovers:
  - If no COMMIT/ABORT logged: query coordinator, complete action
  - If COMMIT logged but not applied: redo the commit
  - If ABORT logged: complete the abort

Idempotent Operations

Notice that the coordinator may send the same decision multiple times (due to retries). The participant must handle this idempotently—processing GLOBAL_COMMIT when already committed should simply return ACK without re-committing.

The Blocking Problem in Depth

The most significant limitation of Two-Phase Commit is the blocking problem. Under certain failure scenarios, participants can be stuck indefinitely, unable to make progress.

The Classic Blocking Scenario:

Coordinator sends PREPARE to all participants
All participants vote VOTE_COMMIT and enter PREPARED state
Coordinator receives all votes and decides COMMIT
Coordinator crashes before logging COMMIT and before sending any GLOBAL_COMMIT
All participants are now in PREPARED state
No participant has received a decision
No participant can contact the coordinator

Why Participants Cannot Proceed:

Can't unilaterally COMMIT: They don't know if all others prepared. Perhaps another participant voted ABORT and the coordinator decided ABORT.
Can't unilaterally ABORT: The coordinator might have decided COMMIT (in memory before crashing). Other participants may have received COMMIT and already committed.
Cooperative termination fails: All participants are in PREPARED state. None of them knows the decision.

Indefinite Blocking

In this scenario, all participants hold locks and wait. Other transactions that need these resources are blocked. If the coordinator never recovers (hardware destroyed, disk unrecoverable), the participants may be blocked forever—or until an administrator manually resolves the situation.

Blocked State Characteristics

•All participants in PREPARED state
•Coordinator unreachable (crashed or partitioned)
•No participant knows the decision
•Cooperative termination cannot resolve
•Locks held on all affected resources
•Other transactions blocked on these locks
•No automatic resolution possible

Why This Happens

•Coordinator is single point of knowledge
•PREPARED promise is binding
•No participant has decision authority
•Correct behavior requires waiting
•Safety trumps liveness in 2PC design
•Atomicity guarantee is maintained
•Consistency is preserved

Practical Mitigation Strategies:

While the blocking problem cannot be eliminated in 2PC, it can be mitigated:

1. Coordinator High Availability Replicate the coordinator's state to standby nodes. If the primary fails, a standby takes over and completes the transaction.

2. Transaction Timeouts Set maximum transaction durations. After a very long timeout (minutes or hours), administrators can manually resolve blocked transactions.

3. Presumed Abort with Persistent Decision Ensure the COMMIT decision is replicated before the coordinator fails. If the decision is replicated, standbys can deliver it.

4. Use Consensus-Based Coordination Modern systems (CockroachDB, Spanner) use Paxos or Raft to replicate coordinator state. Coordinator failure causes a brief delay for leader election, not indefinite blocking.

5. Accept Some Blocking In many systems, coordinator failures are rare enough that brief blocking periods are acceptable. The key is ensuring coordinators are highly available.

Network Partition Scenarios

Network partitions—when nodes can communicate with some peers but not others—create some of the most challenging failure scenarios for 2PC.

Scenario 6.1: Partition During Phase 1

The network partitions after coordinator sends PREPARE to some participants:

Participants in coordinator's partition can vote and receive decision
Participants in other partition never receive PREPARE
Resolution depends on partition duration:
- If heals quickly: votes may still arrive before timeout
- If persists: coordinator times out, decides ABORT

Scenario 6.2: Partition During Phase 2

This is more problematic. After all vote COMMIT:

Coordinator decides COMMIT and sends GLOBAL_COMMIT
Network partitions during message delivery
Some participants receive COMMIT and commit
Other participants are partitioned from coordinator

Effects:

Participants who received COMMIT are committed
Partitioned participants are blocked in PREPARED state
They can't confirm with coordinator (partitioned)
They shouldn't unilaterally abort (others may have committed)

Converting Mermaid diagram...

Scenario 6.3: Asymmetric Partition

Consider a more complex scenario:

Participant P1 can reach coordinator, P2, and P3
Participant P2 cannot reach coordinator but can reach P1 and P3
Participant P3 cannot reach coordinator, P1, or P2

If P1 receives GLOBAL_COMMIT:

P1 commits and can tell P2 (via cooperative termination)
P2 learns to commit from P1
P3 is completely isolated, remains blocked

Key Insight: Partial Information Propagation

Cooperative termination helps when at least one participant has definitive information (COMMITTED or ABORTED state). That participant can share the outcome with others. But if the partition isolates all PREPARED participants from any that know the outcome, blocking occurs.

Partition Healing:

When the partition heals:

Blocked participants can finally reach the coordinator
They query for the decision
Coordinator provides the logged decision (COMMIT or ABORT)
Participants complete the transaction
Normal operation resumes

CAP Theorem Connection

The behavior of 2PC during network partitions illustrates the CAP theorem trade-offs. 2PC chooses Consistency over Availability: during a partition, it blocks (sacrifices availability) rather than allowing some nodes to commit while others might abort (which would violate consistency).

Message Loss and Delays

While TCP provides reliable delivery, messages can still be 'lost' from the application's perspective due to crashes, reordering, or extreme delays. 2PC must handle these cases gracefully.

Scenario 7.1: PREPARE Message Lost

Coordinator sends PREPARE, message lost in transit
Participant never receives it, remains in ACTIVE state
Coordinator times out waiting for vote
Resolution: Coordinator decides ABORT, participant aborts when it learns of the abort (or on timeout)

Scenario 7.2: Vote Message Lost

Participant processes PREPARE, sends VOTE_COMMIT
Vote message lost in transit
Coordinator times out waiting for vote
Resolution: Coordinator decides ABORT (missing vote = implicit abort)
Participant in PREPARED state eventually learns of abort

Message Loss Scenarios and Resolutions
Lost Message	Sender State	Receiver Expected Action	Resolution
PREPARE	Coordinator in WAIT	Participant never votes	Timeout → ABORT
VOTE_COMMIT	Participant in PREPARED	Coordinator missing vote	Timeout → ABORT
VOTE_ABORT	Participant aborted	Coordinator missing vote	Timeout → ABORT (same outcome)
GLOBAL_COMMIT	Coordinator sent decision	Participant still PREPARED	Retry until ACK received
GLOBAL_ABORT	Coordinator sent decision	Participant still PREPARED	Retry until ACK received
ACK	Participant completed	Coordinator waiting for ACK	Retry decision, idempotent handling

Scenario 7.3: GLOBAL_COMMIT Lost

Coordinator sends GLOBAL_COMMIT to participant
Message lost
Participant remains in PREPARED state, holding locks
Coordinator waits for ACK, eventually times out
Coordinator resends GLOBAL_COMMIT
Process repeats until message gets through

Scenario 7.4: ACK Message Lost

Participant receives decision, commits, sends ACK
ACK message lost
Coordinator thinks participant hasn't committed
Coordinator resends GLOBAL_COMMIT
Participant receives duplicate GLOBAL_COMMIT, returns ACK again (idempotent)

Delayed Messages:

Messages that arrive very late can also cause issues:

Old PREPARE arrives after transaction was aborted → Participant should respond with current state or ignore
Old GLOBAL_COMMIT arrives after manual resolution → Participant must handle idempotently

Sequence Numbers and Transaction IDs:

To handle duplicates and out-of-order messages:

Every message includes the transaction ID
Participants track transaction state by ID
Duplicate or outdated messages are detected and handled appropriately
Already-completed transactions return immediate ACK

idempotent-message-handling.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class ParticipantMessageHandler {
    /**
     * Handle incoming GLOBAL_COMMIT with idempotency
     */
    async handleCommit(txId: string): Promise<'ACK' | 'UNKNOWN'> {
        const state = await this.getTransactionState(txId);
        
        switch (state) {
            case 'PREPARED':
                // Normal case - execute the commit
                await this.executeCommit(txId);
                return 'ACK';
                
            case 'COMMITTED':
                // Already committed - duplicate message, just ACK
                console.log(`Duplicate COMMIT for ${txId} - already committed`);
                return 'ACK';
                
            case 'ABORTED':
                // This is unexpected - we should not receive COMMIT if we aborted
                // This could be a bug or message reordering issue
                throw new Error(`Received COMMIT for aborted transaction ${txId}`);
                
            case 'UNKNOWN':
                // Transaction not found - might have been cleaned up long ago
                // Check persistent log to see if we ever committed it
                const logState = await this.checkPersistentLog(txId);
                if (logState === 'COMMITTED') {
                    return 'ACK';
                } else if (logState === 'ABORTED') {
                    throw new Error(`Stale COMMIT for aborted transaction ${txId}`);
                }
                // No record at all - very old transaction or bug
                return 'UNKNOWN';
                
            default:
                throw new Error(`Unexpected state ${state} for transaction ${txId}`);
        }
    }
    
    /**
     * Handle incoming PREPARE with idempotency
     */
    async handlePrepare(txId: string): Promise<'VOTE_COMMIT' | 'VOTE_ABORT' | 'ALREADY_PREPARED'> {
        const state = await this.getTransactionState(txId);
        
        switch (state) {
            case 'ACTIVE':
                // Normal case - process the prepare
                return this.processPrepare(txId);
                
            case 'PREPARED':
                // Already prepared - duplicate message
                // Return our previous vote
                const previousVote = await this.getPreviousVote(txId);
                return previousVote;
                
            case 'COMMITTED':
            case 'ABORTED':
                // Transaction already finished - late PREPARE
                console.log(`Late PREPARE for finished transaction ${txId}`);
                // Inform coordinator of current state
                return state === 'COMMITTED' ? 'ALREADY_PREPARED' : 'VOTE_ABORT';
                
            default:
                // Unknown transaction - treat as new, but this is unusual
                return 'VOTE_ABORT';
        }
    }
}

Failure Recovery Summary

Let's consolidate the recovery procedures for all failure scenarios into a comprehensive reference.

Coordinator Recovery Procedure:

coordinator-recovery-procedure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
COORDINATOR RECOVERY ALGORITHM
═══════════════════════════════════════════════════════════════
 
1. SCAN LOG from last checkpoint
   For each transaction T in log:
   
2. IF <END T> found:
   → Transaction complete, no action needed
   
3. ELSE IF <COMMIT T> found (but no <END T>):
   → Decision was COMMIT
   → Resend GLOBAL_COMMIT to all participants
   → Collect ACKs, log <END T> when all acknowledge
   
4. ELSE IF <ABORT T> found (but no <END T>):
   → Decision was ABORT
   → Resend GLOBAL_ABORT to participants that voted COMMIT
   → Collect ACKs, log <END T> when all acknowledge
   
5. ELSE IF <PREPARE T> found (but no decision):
   → Phase 1 was in progress, decision never made
   → Since we cannot confirm all votes, decide ABORT
   → Log <ABORT T>
   → Send GLOBAL_ABORT to all participants
   → Collect ACKs, log <END T>
   
6. HANDLE PARTICIPANT QUERIES:
   IF participant asks about transaction T:
   → Look up T in active transactions or log
   → IF T has COMMIT record: return COMMIT
   → IF T has ABORT record or not found: return ABORT (presumed abort)

Participant Recovery Procedure:

participant-recovery-procedure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
PARTICIPANT RECOVERY ALGORITHM
═══════════════════════════════════════════════════════════════
 
1. SCAN LOG from last checkpoint
   For each transaction T in log:
   
2. IF <COMMIT T> found:
   → Complete commit if not already done
   → Redo any changes if necessary
   → Release locks
   
3. ELSE IF <ABORT T> found:
   → Complete abort if not already done
   → Undo any changes if necessary
   → Release locks
   
4. ELSE IF <PREPARED T> found (no COMMIT or ABORT):
   → Transaction is in-doubt
   → Re-acquire locks based on write set in PREPARED record
   → Query coordinator for decision (may block if unreachable)
   → Once decision learned:
     → IF COMMIT: complete commit, release locks
     → IF ABORT: complete abort, release locks
   
5. ELSE (only execution records found):
   → Transaction was active, never voted
   → Abort locally using undo log
   → Release any locks
   → Log <ABORT T>
 
6. SPECIAL HANDLING FOR BLOCKED TRANSACTIONS:
   → Periodically retry querying coordinator
   → Try cooperative termination with other participants
   → Log warnings if blocked for extended period
   → Eventually may require administrator intervention

Recovery Testing

Production systems should regularly test recovery procedures using fault injection. Simulate crashes at every protocol phase, verify correct recovery, and measure recovery time. This is often formalized as 'chaos engineering' in modern distributed systems practices.

Summary: Navigating the Storm

We've comprehensively examined how Two-Phase Commit handles—and sometimes struggles with—failures in distributed systems. Let's consolidate the key insights:

Key Takeaways

•Crash-Recovery Model: 2PC assumes nodes can crash and recover, with stable storage surviving crashes. It does not handle Byzantine failures.
•Phase 1 Failures Are Benign: Coordinator failures during Phase 1 allow safe abort—no commit decision was made, so aborting maintains correctness.
•Phase 2 Failures Require Careful Handling: If COMMIT was logged, the coordinator must complete the commit. If no decision was logged, abort is safe.
•The Blocking Problem: When all participants are PREPARED and the coordinator is unreachable, participants are blocked indefinitely. This is 2PC's fundamental limitation.
•Network Partitions Cause Blocking: Participants isolated from the coordinator during Phase 2 are blocked until the partition heals.
•Message Loss Handled by Retry: Lost messages trigger timeouts. The sender retries, the receiver handles duplicates idempotently.
•Log-Based Recovery: Both coordinator and participant recovery are driven by what's in the durable log. The log is the source of truth.
•Coordinator HA Mitigates Blocking: High availability for the coordinator reduces the probability and duration of blocking scenarios.

What's Next:

The next page examines the Three-Phase Commit (3PC) Protocol, which attempts to eliminate the blocking problem by adding an extra phase. We'll see how 3PC improves on 2PC's liveness properties while understanding why even 3PC cannot fully solve all distributed consensus challenges.

Page Complete

You now understand the comprehensive failure handling in Two-Phase Commit—from coordinator and participant crashes to network partitions and message losses. Next, we'll explore how the Three-Phase Commit protocol addresses the blocking problem.

Failure Handling

When Distributed Systems Fail

What You Will Learn

The Failure Model

Before analyzing specific failure scenarios, we must establish the failure model—the assumptions about what kinds of failures can occur and how the system behaves when they happen.

Crash-Recovery Model:

The Two-Phase Commit protocol assumes a crash-recovery (or fail-stop with recovery) model:

Crash Failure: Nodes can fail by simply stopping. They don't send erroneous messages or behave maliciously.
Recovery Capability: After a crash, a node can restart and access its stable storage (disk).
Stable Storage: Data written to stable storage survives crashes. (In practice, this requires RAID, replicated storage, or battery-backed write caches.)
No Byzantine Failures: Nodes don't lie, send incorrect messages, or attempt to sabotage the protocol.

Network Model:

The network is assumed to be asynchronous with:

Reliable Links (eventually): Messages may be delayed arbitrarily but are eventually delivered.
In-Order Delivery: Messages between any two nodes arrive in the order they were sent (TCP provides this).
Detectable Timeouts: Nodes can set timeouts and detect when expected messages haven't arrived.

Types of Failures in Distributed Systems
Failure Type	Description	2PC Assumption	Handling Strategy
Crash Failure	Node stops executing entirely	Assumed—primary model	Logging, recovery protocol
Transient Crash	Node crashes and recovers quickly	Assumed	Log-based recovery
Permanent Crash	Node never recovers	Not explicitly handled	Administrative intervention
Message Loss	Individual messages fail to arrive	Network retries	Timeout + retry
Network Partition	Network split prevents communication	Eventually heals	Timeout, blocking possible
Byzantine Failure	Node sends incorrect/malicious messages	NOT assumed	Not handled by basic 2PC

The FLP Impossibility

Coordinator Failure During Phase 1

Let's analyze what happens when the coordinator fails at various points during Phase 1 (the Prepare phase).

Scenario 2.1: Coordinator Crashes Before Sending Any PREPARE

State at crash: Coordinator in INITIAL state, no PREPARE messages sent
Participant state: All participants in ACTIVE state
Effect: Participants never receive PREPARE, continue waiting or timeout
Resolution:
- Participants timeout and abort locally (presumed abort)
- When coordinator recovers, it sees PREPARE record (if written) but no commit decision
- Coordinator decides ABORT and sends GLOBAL_ABORT
- If PREPARE wasn't logged, coordinator has no record of the transaction

Scenario 2.2: Coordinator Crashes After Sending Some PREPAREs

State at crash: Coordinator in WAIT state, some participants received PREPARE
Participants who received PREPARE: Vote and enter PREPARED state
Participants who didn't receive PREPARE: Still in ACTIVE state, will eventually timeout
Resolution:
- Participants who received PREPARE enter PREPARED and wait
- Participants who didn't receive PREPARE timeout and abort locally
- When coordinator recovers, it has logged PREPARE but no decision
- Since not all participants could have voted COMMIT, coordinator decides ABORT
- Coordinator sends GLOBAL_ABORT to all participants

Converting Mermaid diagram...

Scenario 2.3: Coordinator Crashes While Collecting Votes

State at crash: Coordinator in WAIT state, received some but not all votes
Participants: Some in PREPARED state, some haven't voted yet
Effect: Participants in PREPARED state are blocked waiting for decision
Resolution:
- When coordinator recovers, it has PREPARE logged but no decision
- Since the vote collection was incomplete, coordinator decides ABORT
- Sends GLOBAL_ABORT to all participants
- This is safe: if any participant hadn't voted, they'll abort anyway

Key Insight:

Presumed Abort Simplifies Phase 1 Recovery

Coordinator Failure During Phase 2

Coordinator failures during Phase 2 are more complex because a commit decision may have been made. The exact recovery depends on what was logged before the crash.

Scenario 3.1: Coordinator Crashes Before Logging Decision

State at crash: Coordinator received all votes (all VOTE_COMMIT) but crashed before logging COMMIT
Participants: All in PREPARED state, waiting for decision
Effect: All participants are blocked
Resolution:
- When coordinator recovers, it has PREPARE logged but no decision
- Since no decision was logged, the decision was never made
- Coordinator can decide ABORT (safe because no COMMIT was logged)
- All participants are told to abort

Critical Point: This scenario is safe because nothing was committed. The invariant is: if COMMIT wasn't logged, the transaction can be aborted.

The Dangerous Window

Scenario 3.2: Coordinator Crashes After Logging COMMIT But Before Sending Any GLOBAL_COMMIT

State at crash: Coordinator logged <COMMIT T>, about to send GLOBAL_COMMIT
Participants: All in PREPARED state, waiting
Effect: All participants blocked, but decision is durable
Resolution:
- When coordinator recovers, it sees COMMIT logged
- Coordinator resends GLOBAL_COMMIT to all participants
- Participants commit and acknowledge

Scenario 3.3: Coordinator Crashes After Sending Some GLOBAL_COMMIT Messages

State at crash: Coordinator sent GLOBAL_COMMIT to some participants, not all
Some participants: Received COMMIT, committed, sent ACK
Other participants: Still in PREPARED state, waiting
Resolution:
- When coordinator recovers, COMMIT is logged
- Coordinator resends GLOBAL_COMMIT to participants that haven't acknowledged
- Eventually all participants commit

Scenario 3.4: Coordinator Crashes After Logging ABORT But Before Sending GLOBAL_ABORT

State at crash: Coordinator logged <ABORT T>, about to send GLOBAL_ABORT
Resolution: Similar to 3.2—coordinator resends GLOBAL_ABORT after recovery

Converting Mermaid diagram...

Summary of Coordinator Recovery Actions:

The coordinator's recovery action depends entirely on what's in the log:

Coordinator Recovery Decision Matrix
Log Contains	Recovery Action	Rationale
No PREPARE record	Nothing to do	Transaction never started committing
PREPARE but no decision	Decide ABORT, send GLOBAL_ABORT	Phase 1 incomplete or decision never made
COMMIT record, no END	Resend GLOBAL_COMMIT to all	Decision made, must complete
ABORT record, no END	Resend GLOBAL_ABORT to PREPARED participants	Decision made, must complete
END record	Nothing to do	Transaction fully complete

Participant Failure Scenarios

Participant failures affect the coordinator's ability to collect votes and disseminate decisions. Let's examine the key scenarios.

Scenario 4.1: Participant Crashes During Execution (ACTIVE State)

State at crash: Participant executing local operations, not yet received PREPARE
Coordinator behavior: Sends PREPARE, waits for vote, eventually times out
Resolution:
- Coordinator timeout triggers → decision is ABORT
- Other participants in PREPARED state are told to abort
- When crashed participant recovers, it has only execution records (no PREPARED)
- Participant aborts locally using undo information

Scenario 4.2: Participant Crashes Before Voting

State at crash: Participant received PREPARE but crashed before responding
Coordinator behavior: Timeout waiting for vote
Resolution: Same as 4.1—coordinator decides ABORT, participant aborts on recovery

Converting Mermaid diagram...

Scenario 4.3: Participant Crashes After Voting COMMIT (PREPARED State)

State at crash: Participant in PREPARED state, vote delivered to coordinator
Two sub-scenarios based on coordinator's decision:

Sub-scenario 4.3a: All Others Also Voted COMMIT

Coordinator decides COMMIT, logs it, sends GLOBAL_COMMIT
Crashed participant doesn't receive GLOBAL_COMMIT
When participant recovers:
- Sees PREPARED record but no COMMIT/ABORT
- Queries coordinator for decision
- Learns the decision is COMMIT
- Completes commit

Sub-scenario 4.3b: Some Other Participant Voted ABORT

Coordinator decides ABORT
Crashed participant doesn't receive GLOBAL_ABORT
When participant recovers:
- Sees PREPARED record, queries coordinator
- Learns decision is ABORT
- Completes abort

Scenario 4.4: Participant Crashes After Receiving Decision But Before Applying It

State at crash: Received GLOBAL_COMMIT or GLOBAL_ABORT, crashed before completing
Resolution:
- Coordinator retries sending decision (no ACK received)
- When participant recovers:
  - If no COMMIT/ABORT logged: query coordinator, complete action
  - If COMMIT logged but not applied: redo the commit
  - If ABORT logged: complete the abort

Idempotent Operations

The Blocking Problem in Depth

The most significant limitation of Two-Phase Commit is the blocking problem. Under certain failure scenarios, participants can be stuck indefinitely, unable to make progress.

The Classic Blocking Scenario:

Coordinator sends PREPARE to all participants
All participants vote VOTE_COMMIT and enter PREPARED state
Coordinator receives all votes and decides COMMIT
Coordinator crashes before logging COMMIT and before sending any GLOBAL_COMMIT
All participants are now in PREPARED state
No participant has received a decision
No participant can contact the coordinator

Why Participants Cannot Proceed:

Can't unilaterally COMMIT: They don't know if all others prepared. Perhaps another participant voted ABORT and the coordinator decided ABORT.
Can't unilaterally ABORT: The coordinator might have decided COMMIT (in memory before crashing). Other participants may have received COMMIT and already committed.
Cooperative termination fails: All participants are in PREPARED state. None of them knows the decision.

Indefinite Blocking

Blocked State Characteristics

•All participants in PREPARED state
•Coordinator unreachable (crashed or partitioned)
•No participant knows the decision
•Cooperative termination cannot resolve
•Locks held on all affected resources
•Other transactions blocked on these locks
•No automatic resolution possible

Why This Happens

•Coordinator is single point of knowledge
•PREPARED promise is binding
•No participant has decision authority
•Correct behavior requires waiting
•Safety trumps liveness in 2PC design
•Atomicity guarantee is maintained
•Consistency is preserved

Practical Mitigation Strategies:

While the blocking problem cannot be eliminated in 2PC, it can be mitigated:

1. Coordinator High Availability Replicate the coordinator's state to standby nodes. If the primary fails, a standby takes over and completes the transaction.

2. Transaction Timeouts Set maximum transaction durations. After a very long timeout (minutes or hours), administrators can manually resolve blocked transactions.

3. Presumed Abort with Persistent Decision Ensure the COMMIT decision is replicated before the coordinator fails. If the decision is replicated, standbys can deliver it.

5. Accept Some Blocking In many systems, coordinator failures are rare enough that brief blocking periods are acceptable. The key is ensuring coordinators are highly available.

Network Partition Scenarios

Network partitions—when nodes can communicate with some peers but not others—create some of the most challenging failure scenarios for 2PC.

Scenario 6.1: Partition During Phase 1

The network partitions after coordinator sends PREPARE to some participants:

Participants in coordinator's partition can vote and receive decision
Participants in other partition never receive PREPARE
Resolution depends on partition duration:
- If heals quickly: votes may still arrive before timeout
- If persists: coordinator times out, decides ABORT

Scenario 6.2: Partition During Phase 2

This is more problematic. After all vote COMMIT:

Coordinator decides COMMIT and sends GLOBAL_COMMIT
Network partitions during message delivery
Some participants receive COMMIT and commit
Other participants are partitioned from coordinator

Effects:

Participants who received COMMIT are committed
Partitioned participants are blocked in PREPARED state
They can't confirm with coordinator (partitioned)
They shouldn't unilaterally abort (others may have committed)

Converting Mermaid diagram...

Scenario 6.3: Asymmetric Partition

Consider a more complex scenario:

Participant P1 can reach coordinator, P2, and P3
Participant P2 cannot reach coordinator but can reach P1 and P3
Participant P3 cannot reach coordinator, P1, or P2

If P1 receives GLOBAL_COMMIT:

P1 commits and can tell P2 (via cooperative termination)
P2 learns to commit from P1
P3 is completely isolated, remains blocked

Key Insight: Partial Information Propagation

Partition Healing:

When the partition heals:

Blocked participants can finally reach the coordinator
They query for the decision
Coordinator provides the logged decision (COMMIT or ABORT)
Participants complete the transaction
Normal operation resumes

CAP Theorem Connection

Message Loss and Delays

While TCP provides reliable delivery, messages can still be 'lost' from the application's perspective due to crashes, reordering, or extreme delays. 2PC must handle these cases gracefully.

Scenario 7.1: PREPARE Message Lost

Coordinator sends PREPARE, message lost in transit
Participant never receives it, remains in ACTIVE state
Coordinator times out waiting for vote
Resolution: Coordinator decides ABORT, participant aborts when it learns of the abort (or on timeout)

Scenario 7.2: Vote Message Lost

Participant processes PREPARE, sends VOTE_COMMIT
Vote message lost in transit
Coordinator times out waiting for vote
Resolution: Coordinator decides ABORT (missing vote = implicit abort)
Participant in PREPARED state eventually learns of abort

Message Loss Scenarios and Resolutions
Lost Message	Sender State	Receiver Expected Action	Resolution
PREPARE	Coordinator in WAIT	Participant never votes	Timeout → ABORT
VOTE_COMMIT	Participant in PREPARED	Coordinator missing vote	Timeout → ABORT
VOTE_ABORT	Participant aborted	Coordinator missing vote	Timeout → ABORT (same outcome)
GLOBAL_COMMIT	Coordinator sent decision	Participant still PREPARED	Retry until ACK received
GLOBAL_ABORT	Coordinator sent decision	Participant still PREPARED	Retry until ACK received
ACK	Participant completed	Coordinator waiting for ACK	Retry decision, idempotent handling

Scenario 7.3: GLOBAL_COMMIT Lost

Coordinator sends GLOBAL_COMMIT to participant
Message lost
Participant remains in PREPARED state, holding locks
Coordinator waits for ACK, eventually times out
Coordinator resends GLOBAL_COMMIT
Process repeats until message gets through

Scenario 7.4: ACK Message Lost

Participant receives decision, commits, sends ACK
ACK message lost
Coordinator thinks participant hasn't committed
Coordinator resends GLOBAL_COMMIT
Participant receives duplicate GLOBAL_COMMIT, returns ACK again (idempotent)

Delayed Messages:

Messages that arrive very late can also cause issues:

Old PREPARE arrives after transaction was aborted → Participant should respond with current state or ignore
Old GLOBAL_COMMIT arrives after manual resolution → Participant must handle idempotently

Sequence Numbers and Transaction IDs:

To handle duplicates and out-of-order messages:

Every message includes the transaction ID
Participants track transaction state by ID
Duplicate or outdated messages are detected and handled appropriately
Already-completed transactions return immediate ACK

idempotent-message-handling.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class ParticipantMessageHandler {
    /**
     * Handle incoming GLOBAL_COMMIT with idempotency
     */
    async handleCommit(txId: string): Promise<'ACK' | 'UNKNOWN'> {
        const state = await this.getTransactionState(txId);
        
        switch (state) {
            case 'PREPARED':
                // Normal case - execute the commit
                await this.executeCommit(txId);
                return 'ACK';
                
            case 'COMMITTED':
                // Already committed - duplicate message, just ACK
                console.log(`Duplicate COMMIT for ${txId} - already committed`);
                return 'ACK';
                
            case 'ABORTED':
                // This is unexpected - we should not receive COMMIT if we aborted
                // This could be a bug or message reordering issue
                throw new Error(`Received COMMIT for aborted transaction ${txId}`);
                
            case 'UNKNOWN':
                // Transaction not found - might have been cleaned up long ago
                // Check persistent log to see if we ever committed it
                const logState = await this.checkPersistentLog(txId);
                if (logState === 'COMMITTED') {
                    return 'ACK';
                } else if (logState === 'ABORTED') {
                    throw new Error(`Stale COMMIT for aborted transaction ${txId}`);
                }
                // No record at all - very old transaction or bug
                return 'UNKNOWN';
                
            default:
                throw new Error(`Unexpected state ${state} for transaction ${txId}`);
        }
    }
    
    /**
     * Handle incoming PREPARE with idempotency
     */
    async handlePrepare(txId: string): Promise<'VOTE_COMMIT' | 'VOTE_ABORT' | 'ALREADY_PREPARED'> {
        const state = await this.getTransactionState(txId);
        
        switch (state) {
            case 'ACTIVE':
                // Normal case - process the prepare
                return this.processPrepare(txId);
                
            case 'PREPARED':
                // Already prepared - duplicate message
                // Return our previous vote
                const previousVote = await this.getPreviousVote(txId);
                return previousVote;
                
            case 'COMMITTED':
            case 'ABORTED':
                // Transaction already finished - late PREPARE
                console.log(`Late PREPARE for finished transaction ${txId}`);
                // Inform coordinator of current state
                return state === 'COMMITTED' ? 'ALREADY_PREPARED' : 'VOTE_ABORT';
                
            default:
                // Unknown transaction - treat as new, but this is unusual
                return 'VOTE_ABORT';
        }
    }
}

Failure Recovery Summary

Let's consolidate the recovery procedures for all failure scenarios into a comprehensive reference.

Coordinator Recovery Procedure:

coordinator-recovery-procedure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
COORDINATOR RECOVERY ALGORITHM
═══════════════════════════════════════════════════════════════
 
1. SCAN LOG from last checkpoint
   For each transaction T in log:
   
2. IF <END T> found:
   → Transaction complete, no action needed
   
3. ELSE IF <COMMIT T> found (but no <END T>):
   → Decision was COMMIT
   → Resend GLOBAL_COMMIT to all participants
   → Collect ACKs, log <END T> when all acknowledge
   
4. ELSE IF <ABORT T> found (but no <END T>):
   → Decision was ABORT
   → Resend GLOBAL_ABORT to participants that voted COMMIT
   → Collect ACKs, log <END T> when all acknowledge
   
5. ELSE IF <PREPARE T> found (but no decision):
   → Phase 1 was in progress, decision never made
   → Since we cannot confirm all votes, decide ABORT
   → Log <ABORT T>
   → Send GLOBAL_ABORT to all participants
   → Collect ACKs, log <END T>
   
6. HANDLE PARTICIPANT QUERIES:
   IF participant asks about transaction T:
   → Look up T in active transactions or log
   → IF T has COMMIT record: return COMMIT
   → IF T has ABORT record or not found: return ABORT (presumed abort)

Participant Recovery Procedure:

participant-recovery-procedure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
PARTICIPANT RECOVERY ALGORITHM
═══════════════════════════════════════════════════════════════
 
1. SCAN LOG from last checkpoint
   For each transaction T in log:
   
2. IF <COMMIT T> found:
   → Complete commit if not already done
   → Redo any changes if necessary
   → Release locks
   
3. ELSE IF <ABORT T> found:
   → Complete abort if not already done
   → Undo any changes if necessary
   → Release locks
   
4. ELSE IF <PREPARED T> found (no COMMIT or ABORT):
   → Transaction is in-doubt
   → Re-acquire locks based on write set in PREPARED record
   → Query coordinator for decision (may block if unreachable)
   → Once decision learned:
     → IF COMMIT: complete commit, release locks
     → IF ABORT: complete abort, release locks
   
5. ELSE (only execution records found):
   → Transaction was active, never voted
   → Abort locally using undo log
   → Release any locks
   → Log <ABORT T>
 
6. SPECIAL HANDLING FOR BLOCKED TRANSACTIONS:
   → Periodically retry querying coordinator
   → Try cooperative termination with other participants
   → Log warnings if blocked for extended period
   → Eventually may require administrator intervention

Recovery Testing

Summary: Navigating the Storm

We've comprehensively examined how Two-Phase Commit handles—and sometimes struggles with—failures in distributed systems. Let's consolidate the key insights:

Key Takeaways

•Crash-Recovery Model: 2PC assumes nodes can crash and recover, with stable storage surviving crashes. It does not handle Byzantine failures.
•Phase 1 Failures Are Benign: Coordinator failures during Phase 1 allow safe abort—no commit decision was made, so aborting maintains correctness.
•Phase 2 Failures Require Careful Handling: If COMMIT was logged, the coordinator must complete the commit. If no decision was logged, abort is safe.
•The Blocking Problem: When all participants are PREPARED and the coordinator is unreachable, participants are blocked indefinitely. This is 2PC's fundamental limitation.
•Network Partitions Cause Blocking: Participants isolated from the coordinator during Phase 2 are blocked until the partition heals.
•Message Loss Handled by Retry: Lost messages trigger timeouts. The sender retries, the receiver handles duplicates idempotently.
•Log-Based Recovery: Both coordinator and participant recovery are driven by what's in the durable log. The log is the source of truth.
•Coordinator HA Mitigates Blocking: High availability for the coordinator reduces the probability and duration of blocking scenarios.

What's Next:

Page Complete