System Design (HLD)Saga Pattern

The Saga Pattern: Managing Distributed Transactions

LevelAdvanced

Duration90 mins

TopicSaga Pattern

1 / 5

Distributed Transactions Alternative

The Distributed Data Dilemma

Imagine you're building an e-commerce platform. A customer places an order—what seems like a simple action actually requires coordinating multiple operations across different services:\n\n1. Order Service: Create the order record\n2. Inventory Service: Reserve the products\n3. Payment Service: Charge the customer's credit card\n4. Shipping Service: Schedule delivery\n5. Notification Service: Send confirmation email\n\nIn a monolithic application with a single database, you'd wrap all these operations in a single ACID transaction. If any step fails, everything rolls back—atomicity guaranteed. But in a microservices architecture, each service owns its own database. There is no single transaction boundary that spans all services.\n\nThis is the fundamental challenge of distributed systems: How do you maintain data consistency across service boundaries when traditional transactions don't work?

What You Will Learn

By the end of this page, you will understand why distributed transactions are problematic, why the Saga pattern emerged as the industry-standard alternative, and the fundamental principles that make Sagas work. You'll see how this pattern enables consistency without sacrificing the autonomy and scalability that microservices provide.

Why Traditional Distributed Transactions Fail

Before understanding the Saga pattern, we must deeply understand why the seemingly obvious solution—distributed transactions using protocols like Two-Phase Commit (2PC)—doesn't work in modern distributed systems.\n\nThe Two-Phase Commit Protocol (2PC)\n\n2PC coordinates transactions across multiple databases through a central coordinator:\n\nPhase 1 (Prepare):\n- Coordinator sends PREPARE to all participants\n- Each participant executes the transaction locally (but doesn't commit)\n- Participants respond with VOTE_COMMIT or VOTE_ABORT\n- Resources are locked during this phase\n\nPhase 2 (Commit/Rollback):\n- If all participants voted COMMIT: Coordinator sends GLOBAL_COMMIT\n- If any participant voted ABORT: Coordinator sends GLOBAL_ROLLBACK\n- Participants execute the final decision and release locks\n\nThis seems elegant—guaranteed atomicity across databases! So why doesn't it work?

Converting Mermaid diagram...

Critical Problems with 2PC in Microservices

•Synchronous Blocking — All participants must hold locks for the entire duration of the transaction. If the coordinator takes 100ms to receive all votes, every database is locked for 100ms. At scale, this serializes all concurrent transactions.
•Single Point of Failure — The coordinator is a critical dependency. If it crashes between Phase 1 and Phase 2, all participants are stuck holding locks indefinitely, waiting for a decision that will never come.
•Network Partition Vulnerability — If the network partitions during Phase 2, some participants might receive COMMIT while others don't. The protocol has no way to reliably handle this—it assumes reliable network delivery.
•Reduced Availability — If ANY participant is unavailable, the entire transaction must abort. For N participants, system availability = (availability)^N. With 99% availability per service and 5 services: 99%^5 = 95.1% system availability.
•Increased Latency — Every transaction requires at least 2 round-trips to all participants, plus the local transaction processing time. Latency scales with the slowest participant.
•Tight Coupling — All participants must implement the 2PC protocol and trust the coordinator. This couples services together and violates microservices autonomy principles.

The Mathematics of Availability

Let's formalize why 2PC's availability problem is catastrophic at scale. Consider a distributed transaction involving n independent services, each with individual availability a.\n\nFor 2PC to succeed, ALL services must be available simultaneously:\n\n\nP(transaction_success) = a^n\n\n\nThis exponential degradation is devastating:

System Availability with 2PC Across Multiple Services
Individual Service Availability	2 Services	3 Services	5 Services	10 Services
99.9% (three nines)	99.8%	99.7%	99.5%	99.0%
99.5%	99.0%	98.5%	97.5%	95.1%
99.0%	98.0%	97.0%	95.1%	90.4%
95.0%	90.3%	85.7%	77.4%	59.9%

The implications are severe:\n\nWith just 5 services at 99% availability each (which is quite common), your transaction success rate drops to 95.1%. That means 1 in 20 transactions fail due to availability issues alone—not bugs, not load, just the fundamental mathematics of coordinating distributed systems.\n\nAt 10 services with 95% individual availability, 4 in 10 transactions fail. This is unacceptable for any production system.\n\nContrast this with eventual consistency approaches:\n\nWith Sagas and eventual consistency, each operation succeeds or fails independently:\n\n\nP(eventual_consistency) = max(a_1, a_2, ..., a_n) → approaches 1 with retries\n\n\nBecause operations are independent and can be retried, the system availability is determined by the best achievable outcome over time, not the simultaneous availability of all components.

The CAP Theorem Reality

2PC chooses strong Consistency over Availability. When network partitions occur (and they will), 2PC blocks rather than proceeding. The Saga pattern makes the opposite choice: it prioritizes Availability and achieves eventual Consistency through compensating actions. This aligns with the BASE principles (Basically Available, Soft state, Eventually consistent).

The Lock Duration Problem

Beyond availability, 2PC creates severe performance problems due to lock duration. Let's analyze this quantitatively.\n\nTraditional Local Transaction:\n\nLock Duration = Transaction Execution Time\n ≈ 5-50ms for typical operations\n\n\n2PC Distributed Transaction:\n\nLock Duration = Coordinator Processing Time\n + Network Round Trip to All Participants (Phase 1)\n + Slowest Participant Processing Time (voting)\n + Network Round Trip to All Participants (Phase 2)\n + Logging Overhead\n ≈ 100-500ms even on fast networks\n\n\nLocks are held 10-100x longer in 2PC compared to local transactions. This has cascading effects:

lock-duration-analysis.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Quantitative analysis of lock duration impact
 
interface TransactionMetrics {
  localLockDuration: number;     // milliseconds
  distributedLockDuration: number;
  transactionsPerSecond: number;
  lockContentionProbability: number;
}
 
function analyzeLockImpact(
  avgLocalLatency: number = 20,      // 20ms local transaction
  networkLatency: number = 50,       // 50ms network round trip
  participantCount: number = 3,
  transactionsPerSecond: number = 1000
): TransactionMetrics {
  
  // Local transaction: just the execution time
  const localLockDuration = avgLocalLatency;
  
  // 2PC: prepare phase + commit phase with network overhead
  const distributedLockDuration = 
    (networkLatency * 2) +  // Coordinator to participants (both phases)
    (avgLocalLatency * 2) + // Processing at each phase
    networkLatency;         // Final acknowledgment
    
  // Throughput calculation using Little's Law
  // Maximum concurrent transactions = Lock Duration × Arrival Rate
  const activeTransactions = transactionsPerSecond * (distributedLockDuration / 1000);
  
  // Lock contention probability using M/M/c queueing model (simplified)
  // Higher values mean more frequent lock conflicts
  const contentionProbability = 1 - Math.exp(-activeTransactions / 100);
  
  return {
    localLockDuration,                           // 20ms
    distributedLockDuration,                     // 170ms  
    transactionsPerSecond: 1000 / distributedLockDuration * 1000,
    lockContentionProbability: contentionProbability
  };
}
 
// Example calculation:
const metrics = analyzeLockImpact(20, 50, 3, 1000);
console.log("Local lock duration:", metrics.localLockDuration, "ms");
console.log("Distributed lock duration:", metrics.distributedLockDuration, "ms");
console.log("Max TPS with 2PC:", Math.round(metrics.transactionsPerSecond));
console.log("Lock contention probability:", (metrics.lockContentionProbability * 100).toFixed(1), "%");
 
// Output:
// Local lock duration: 20 ms
// Distributed lock duration: 170 ms  
// Max TPS with 2PC: 5882
// Lock contention probability: 81.6%

The Domino Effect:\n\nWhen one transaction holds locks for 170ms instead of 20ms:\n\n1. Queuing: Other transactions waiting for the same resources queue up\n2. Timeout cascades: Queued transactions may timeout, causing retries\n3. Resource exhaustion: Connection pools fill with waiting transactions\n4. Cascading failures: Back-pressure propagates to upstream services\n\nThis is why production systems with high throughput requirements (thousands of TPS) cannot use 2PC. The lock duration alone makes it mathematically impossible to achieve required performance.

Enter the Saga Pattern

The Saga pattern was originally proposed by Hector Garcia-Molina and Kenneth Salem in their 1987 paper "Sagas". The core insight was revolutionary: instead of trying to make distributed transactions atomic, accept that they will be non-atomic and design for it.\n\nDefinition:\n\nA Saga is a sequence of local transactions, where each local transaction updates data within a single service. If any local transaction fails, the Saga executes a series of compensating transactions to undo the changes made by the preceding transactions.\n\nKey Properties:\n\n- Each step is a local ACID transaction — no distributed locking\n- Steps are coordinated asynchronously — no synchronous blocking\n- Failures are handled through compensation — not rollback\n- Eventual consistency is guaranteed — not immediate consistency

Converting Mermaid diagram...

The Saga Execution Model:\n\nHappy Path (All Transactions Succeed):\n\nT1 → T2 → T3 → ... → Tn → SUCCESS\n\n\nFailure at Step k (Compensating Transactions Execute):\n\nT1 → T2 → ... → Tk-1 → Tk (FAILS) → Ck-1 → Ck-2 → ... → C1 → COMPENSATED\n\n\nNotice that we don't have C(k) — the failing transaction Tk never completed, so there's nothing to compensate. We compensate in reverse order from T(k-1) back to T1.

Semantic vs. Technical Rollback

Compensating transactions provide semantic rollback, not technical rollback. If T1 created an order, C1 doesn't magically undo that database write—it creates a CANCELLATION record. The order existed; now it's cancelled. This is a crucial mental model shift: you're not erasing history, you're appending corrective actions.

Saga vs 2PC: A Paradigm Comparison

Understanding the fundamental differences between Sagas and 2PC illuminates when each is appropriate and why Sagas dominate in microservices architectures.

Comprehensive Comparison: Saga Pattern vs Two-Phase Commit
Characteristic	Two-Phase Commit (2PC)	Saga Pattern
Consistency Model	Strong consistency (linearizable)	Eventual consistency
Isolation	Full isolation (ACID)	No isolation (ACD)
Lock Duration	Duration of entire distributed transaction	Only during local transaction
Blocking Behavior	Synchronous, all participants block	Asynchronous, non-blocking
Failure Recovery	Automatic rollback	Compensating transactions required
Coordinator	Required (single point of failure)	Optional (choreography) or distributed (orchestration)
Availability	Degrades exponentially with participants	Maintains high availability
Scalability	Limited by lock contention	Highly scalable
Complexity	Protocol complexity in coordinator	Business logic complexity in compensation
Network Dependency	Requires reliable network	Tolerates network partitions
Latency	Sum of all participants + coordination	Individual transaction latency

When Sagas Excel

•Microservices with separate databases — The canonical use case
•Long-running transactions — Hours or days, not milliseconds
•High-throughput systems — Where lock contention would be fatal
•Cross-organizational boundaries — Where you can't control all participants
•Eventually consistent operations — Order placement, booking, registration
•Systems requiring high availability — 99.99% uptime requirements

When 2PC May Be Acceptable

•Few participants (2-3) — Availability impact is manageable
•Low latency network — Same data center, dedicated network
•Immediate consistency required — Financial settlement, regulatory
•Short-lived transactions — Millisecond operations only
•Homogeneous systems — All databases support XA protocol
•Low throughput — Transactions per minute, not per second

The Isolation Problem: The Missing 'I' in ACD

ACID transactions guarantee Isolation: concurrent transactions don't interfere with each other. Sagas explicitly sacrifice this property, leading to potential anomalies that must be handled at the application level.\n\nAnomalies in Saga Execution:

Common Isolation Anomalies

•Lost Updates — A saga reads data, another saga modifies it, first saga writes based on stale data. Example: Two sagas both try to decrement inventory from 10; instead of 8, result is 9.
•Dirty Reads — A saga reads uncommitted data from another saga that later compensates. Example: Report saga reads order as 'CONFIRMED' while order saga is still processing payment.
•Non-Repeatable Reads — A saga reads data twice and gets different results because another saga modified it. Example: Shipping saga checks inventory availability twice during processing, gets different counts.
•Fuzzy Reads — Reading the same data at different points in a saga yields inconsistent states. Example: Order details change while generating invoice.

Countermeasures for Isolation Anomalies:\n\nThe Saga pattern literature defines several countermeasures, often called semantic locks:

isolation-countermeasures.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// 1. SEMANTIC LOCKS
// Use application-level flags to indicate in-progress sagas
interface Order {
  id: string;
  status: 'PENDING' | 'CONFIRMED' | 'SHIPPED' | 'CANCELLED';
  sagaState: 'APPROVAL_PENDING' | 'COMPLETED' | 'COMPENSATING';  // Semantic lock
}
 
// Other sagas check sagaState before proceeding
async function processOrder(orderId: string) {
  const order = await orderRepo.findById(orderId);
  
  if (order.sagaState === 'APPROVAL_PENDING') {
    throw new Error('Order is currently being processed by another saga');
  }
  
  // Acquire semantic lock
  await orderRepo.update(orderId, { sagaState: 'APPROVAL_PENDING' });
  
  try {
    // ... perform saga steps
    await orderRepo.update(orderId, { sagaState: 'COMPLETED' });
  } catch (error) {
    await orderRepo.update(orderId, { sagaState: 'COMPENSATING' });
    throw error;
  }
}
 
// 2. COMMUTATIVE UPDATES
// Design operations that can execute in any order with same result
interface Account {
  id: string;
  balanceAdjustments: BalanceAdjustment[];  // Append-only
  computedBalance: number;  // Derived, not directly updated
}
 
// Instead of: balance = balance - amount (order-dependent)
// Use: adjustments.push({ amount: -100, sagaId, timestamp })
// Balance = sum(adjustments) - commutative!
 
// 3. PESSIMISTIC VIEW
// Read data at the END of a saga, not the beginning
async function generateInvoice(orderId: string) {
  // Perform all mutations first
  await applyDiscounts(orderId);
  await calculateTaxes(orderId);
  await addShippingCharges(orderId);
  
  // Read final state AFTER all changes
  const finalOrder = await orderRepo.findById(orderId);
  return createInvoiceFromOrder(finalOrder);
}
 
// 4. VERSION-BASED CONCURRENCY
// Optimistic locking to detect concurrent modifications
interface VersionedEntity {
  id: string;
  version: number;
  data: unknown;
}
 
async function updateWithOptimisticLock(
  entity: VersionedEntity, 
  updates: Partial<VersionedEntity>
) {
  const result = await db.update({
    where: { id: entity.id, version: entity.version },
    data: { ...updates, version: entity.version + 1 }
  });
  
  if (result.count === 0) {
    throw new ConcurrentModificationError('Entity was modified by another saga');
  }
}

The Trade-off Reality

These countermeasures add complexity but are far less costly than the availability and performance penalties of 2PC. The key insight: isolation anomalies are observable but recoverable, while 2PC's availability problems are silent and unrecoverable.

Real-World Saga Examples

Let's examine how major organizations implement Sagas for critical business processes:

Uber's Ride Booking Saga:\n\nWhen a rider requests a trip, Uber executes a saga spanning multiple services:\n\nForward Transactions:\n1. T1: Create trip request (Trip Service)\n2. T2: Match with driver (Matching Service)\n3. T3: Driver accepts (Driver Service)\n4. T4: Pre-authorize payment (Payment Service)\n5. T5: Activate navigation (Maps Service)\n\nCompensating Transactions:\n- C4: Release payment hold\n- C3: Notify driver of cancellation\n- C2: Return driver to available pool\n- C1: Cancel trip request, notify rider\n\nIf payment pre-authorization fails (T4), Uber doesn't just abort—it executes C3, C2, C1 in sequence, gracefully unwinding the partial operation.

Summary: The Case for Sagas

We've established why the Saga pattern is the industry-standard approach for managing distributed transactions in microservices architectures. Let's consolidate the key insights:

Key Takeaways

•2PC fails at scale — Synchronous blocking, lock duration, and exponential availability degradation make it unsuitable for modern distributed systems.
•Sagas provide eventual consistency — By decomposing distributed transactions into local transactions with compensating actions, Sagas maintain data consistency without sacrificing availability.
•Compensation is semantic, not technical — Compensating transactions don't erase history; they append corrective actions that achieve business-level rollback.
•Isolation must be handled explicitly — Sagas sacrifice the 'I' in ACID, requiring application-level countermeasures for concurrent saga execution.
•Real-world systems prove the pattern — Uber, Amazon, Netflix, and countless other high-scale systems rely on Sagas for critical business processes.
•The trade-off is manageable — Saga complexity in compensation logic is far more manageable than 2PC's fundamental availability and performance limitations.

What's Next:\n\nNow that we understand why Sagas exist and their core principles, we'll explore the two primary coordination mechanisms: Choreography (event-driven, decentralized) and Orchestration (command-driven, centralized). Each has distinct trade-offs that determine when to use which approach.

Page Complete

You now understand the fundamental motivation for the Saga pattern: it's not merely an alternative to distributed transactions—it's a fundamentally different paradigm designed for the realities of distributed systems. Next, we'll explore how Sagas are coordinated in practice.

1 / 5

Loading learning content...

System Design (HLD)Saga Pattern

The Saga Pattern: Managing Distributed Transactions

LevelAdvanced

Duration90 mins

TopicSaga Pattern

1 / 5

Distributed Transactions Alternative

The Distributed Data Dilemma

What You Will Learn

Why Traditional Distributed Transactions Fail

Converting Mermaid diagram...

Critical Problems with 2PC in Microservices

•Synchronous Blocking — All participants must hold locks for the entire duration of the transaction. If the coordinator takes 100ms to receive all votes, every database is locked for 100ms. At scale, this serializes all concurrent transactions.
•Single Point of Failure — The coordinator is a critical dependency. If it crashes between Phase 1 and Phase 2, all participants are stuck holding locks indefinitely, waiting for a decision that will never come.
•Network Partition Vulnerability — If the network partitions during Phase 2, some participants might receive COMMIT while others don't. The protocol has no way to reliably handle this—it assumes reliable network delivery.
•Reduced Availability — If ANY participant is unavailable, the entire transaction must abort. For N participants, system availability = (availability)^N. With 99% availability per service and 5 services: 99%^5 = 95.1% system availability.
•Increased Latency — Every transaction requires at least 2 round-trips to all participants, plus the local transaction processing time. Latency scales with the slowest participant.
•Tight Coupling — All participants must implement the 2PC protocol and trust the coordinator. This couples services together and violates microservices autonomy principles.

The Mathematics of Availability

System Availability with 2PC Across Multiple Services
Individual Service Availability	2 Services	3 Services	5 Services	10 Services
99.9% (three nines)	99.8%	99.7%	99.5%	99.0%
99.5%	99.0%	98.5%	97.5%	95.1%
99.0%	98.0%	97.0%	95.1%	90.4%
95.0%	90.3%	85.7%	77.4%	59.9%

The CAP Theorem Reality

The Lock Duration Problem

lock-duration-analysis.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Quantitative analysis of lock duration impact
 
interface TransactionMetrics {
  localLockDuration: number;     // milliseconds
  distributedLockDuration: number;
  transactionsPerSecond: number;
  lockContentionProbability: number;
}
 
function analyzeLockImpact(
  avgLocalLatency: number = 20,      // 20ms local transaction
  networkLatency: number = 50,       // 50ms network round trip
  participantCount: number = 3,
  transactionsPerSecond: number = 1000
): TransactionMetrics {
  
  // Local transaction: just the execution time
  const localLockDuration = avgLocalLatency;
  
  // 2PC: prepare phase + commit phase with network overhead
  const distributedLockDuration = 
    (networkLatency * 2) +  // Coordinator to participants (both phases)
    (avgLocalLatency * 2) + // Processing at each phase
    networkLatency;         // Final acknowledgment
    
  // Throughput calculation using Little's Law
  // Maximum concurrent transactions = Lock Duration × Arrival Rate
  const activeTransactions = transactionsPerSecond * (distributedLockDuration / 1000);
  
  // Lock contention probability using M/M/c queueing model (simplified)
  // Higher values mean more frequent lock conflicts
  const contentionProbability = 1 - Math.exp(-activeTransactions / 100);
  
  return {
    localLockDuration,                           // 20ms
    distributedLockDuration,                     // 170ms  
    transactionsPerSecond: 1000 / distributedLockDuration * 1000,
    lockContentionProbability: contentionProbability
  };
}
 
// Example calculation:
const metrics = analyzeLockImpact(20, 50, 3, 1000);
console.log("Local lock duration:", metrics.localLockDuration, "ms");
console.log("Distributed lock duration:", metrics.distributedLockDuration, "ms");
console.log("Max TPS with 2PC:", Math.round(metrics.transactionsPerSecond));
console.log("Lock contention probability:", (metrics.lockContentionProbability * 100).toFixed(1), "%");
 
// Output:
// Local lock duration: 20 ms
// Distributed lock duration: 170 ms  
// Max TPS with 2PC: 5882
// Lock contention probability: 81.6%

Enter the Saga Pattern

Converting Mermaid diagram...

Semantic vs. Technical Rollback

Saga vs 2PC: A Paradigm Comparison

Understanding the fundamental differences between Sagas and 2PC illuminates when each is appropriate and why Sagas dominate in microservices architectures.

Comprehensive Comparison: Saga Pattern vs Two-Phase Commit
Characteristic	Two-Phase Commit (2PC)	Saga Pattern
Consistency Model	Strong consistency (linearizable)	Eventual consistency
Isolation	Full isolation (ACID)	No isolation (ACD)
Lock Duration	Duration of entire distributed transaction	Only during local transaction
Blocking Behavior	Synchronous, all participants block	Asynchronous, non-blocking
Failure Recovery	Automatic rollback	Compensating transactions required
Coordinator	Required (single point of failure)	Optional (choreography) or distributed (orchestration)
Availability	Degrades exponentially with participants	Maintains high availability
Scalability	Limited by lock contention	Highly scalable
Complexity	Protocol complexity in coordinator	Business logic complexity in compensation
Network Dependency	Requires reliable network	Tolerates network partitions
Latency	Sum of all participants + coordination	Individual transaction latency

When Sagas Excel

•Microservices with separate databases — The canonical use case
•Long-running transactions — Hours or days, not milliseconds
•High-throughput systems — Where lock contention would be fatal
•Cross-organizational boundaries — Where you can't control all participants
•Eventually consistent operations — Order placement, booking, registration
•Systems requiring high availability — 99.99% uptime requirements

When 2PC May Be Acceptable

•Few participants (2-3) — Availability impact is manageable
•Low latency network — Same data center, dedicated network
•Immediate consistency required — Financial settlement, regulatory
•Short-lived transactions — Millisecond operations only
•Homogeneous systems — All databases support XA protocol
•Low throughput — Transactions per minute, not per second

The Isolation Problem: The Missing 'I' in ACD

Common Isolation Anomalies

•Lost Updates — A saga reads data, another saga modifies it, first saga writes based on stale data. Example: Two sagas both try to decrement inventory from 10; instead of 8, result is 9.
•Dirty Reads — A saga reads uncommitted data from another saga that later compensates. Example: Report saga reads order as 'CONFIRMED' while order saga is still processing payment.
•Non-Repeatable Reads — A saga reads data twice and gets different results because another saga modified it. Example: Shipping saga checks inventory availability twice during processing, gets different counts.
•Fuzzy Reads — Reading the same data at different points in a saga yields inconsistent states. Example: Order details change while generating invoice.

Countermeasures for Isolation Anomalies:\n\nThe Saga pattern literature defines several countermeasures, often called semantic locks:

isolation-countermeasures.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// 1. SEMANTIC LOCKS
// Use application-level flags to indicate in-progress sagas
interface Order {
  id: string;
  status: 'PENDING' | 'CONFIRMED' | 'SHIPPED' | 'CANCELLED';
  sagaState: 'APPROVAL_PENDING' | 'COMPLETED' | 'COMPENSATING';  // Semantic lock
}
 
// Other sagas check sagaState before proceeding
async function processOrder(orderId: string) {
  const order = await orderRepo.findById(orderId);
  
  if (order.sagaState === 'APPROVAL_PENDING') {
    throw new Error('Order is currently being processed by another saga');
  }
  
  // Acquire semantic lock
  await orderRepo.update(orderId, { sagaState: 'APPROVAL_PENDING' });
  
  try {
    // ... perform saga steps
    await orderRepo.update(orderId, { sagaState: 'COMPLETED' });
  } catch (error) {
    await orderRepo.update(orderId, { sagaState: 'COMPENSATING' });
    throw error;
  }
}
 
// 2. COMMUTATIVE UPDATES
// Design operations that can execute in any order with same result
interface Account {
  id: string;
  balanceAdjustments: BalanceAdjustment[];  // Append-only
  computedBalance: number;  // Derived, not directly updated
}
 
// Instead of: balance = balance - amount (order-dependent)
// Use: adjustments.push({ amount: -100, sagaId, timestamp })
// Balance = sum(adjustments) - commutative!
 
// 3. PESSIMISTIC VIEW
// Read data at the END of a saga, not the beginning
async function generateInvoice(orderId: string) {
  // Perform all mutations first
  await applyDiscounts(orderId);
  await calculateTaxes(orderId);
  await addShippingCharges(orderId);
  
  // Read final state AFTER all changes
  const finalOrder = await orderRepo.findById(orderId);
  return createInvoiceFromOrder(finalOrder);
}
 
// 4. VERSION-BASED CONCURRENCY
// Optimistic locking to detect concurrent modifications
interface VersionedEntity {
  id: string;
  version: number;
  data: unknown;
}
 
async function updateWithOptimisticLock(
  entity: VersionedEntity, 
  updates: Partial<VersionedEntity>
) {
  const result = await db.update({
    where: { id: entity.id, version: entity.version },
    data: { ...updates, version: entity.version + 1 }
  });
  
  if (result.count === 0) {
    throw new ConcurrentModificationError('Entity was modified by another saga');
  }
}

The Trade-off Reality

Real-World Saga Examples

Let's examine how major organizations implement Sagas for critical business processes:

Summary: The Case for Sagas

We've established why the Saga pattern is the industry-standard approach for managing distributed transactions in microservices architectures. Let's consolidate the key insights:

Key Takeaways

•2PC fails at scale — Synchronous blocking, lock duration, and exponential availability degradation make it unsuitable for modern distributed systems.
•Sagas provide eventual consistency — By decomposing distributed transactions into local transactions with compensating actions, Sagas maintain data consistency without sacrificing availability.
•Compensation is semantic, not technical — Compensating transactions don't erase history; they append corrective actions that achieve business-level rollback.
•Isolation must be handled explicitly — Sagas sacrifice the 'I' in ACID, requiring application-level countermeasures for concurrent saga execution.
•Real-world systems prove the pattern — Uber, Amazon, Netflix, and countless other high-scale systems rely on Sagas for critical business processes.
•The trade-off is manageable — Saga complexity in compensation logic is far more manageable than 2PC's fundamental availability and performance limitations.

Page Complete

1 / 5