Loading learning content...
We've covered the theory, coordination mechanisms, compensating transactions, and failure handling. Now it's time to bridge the gap between understanding sagas and actually building them in production systems.\n\nThis page synthesizes everything into actionable implementation guidance: choosing between build vs. buy, leveraging saga frameworks, designing for testability, deploying safely, and learning from real-world implementations at scale.\n\nThe Implementation Reality:\n\nBuilding a robust saga system from scratch is a multi-month engineering effort. Before writing any code, understand the full scope of what 'saga infrastructure' entails:\n\n- State persistence and recovery\n- Retry logic with backoff and jitter\n- Timeout management at multiple levels\n- Circuit breaker integration\n- Distributed tracing\n- Metrics and alerting\n- Dead letter queue management\n- Admin tooling for operators\n\nMost organizations should leverage existing frameworks rather than building from scratch.
By the end of this page, you will be equipped to implement sagas in your organization: evaluating frameworks, designing saga workflows, implementing step definitions, testing effectively, deploying safely, and operating in production.
The first implementation decision is whether to build saga infrastructure yourself or adopt an existing framework. This is rarely a close call.
| Framework | Type | Languages | Best For | Trade-offs |
|---|---|---|---|---|
| Temporal | Orchestration | Go, Java, TS, Python, PHP | Complex workflows, long-running | Learning curve, infrastructure |
| Camunda | Orchestration/BPMN | Java, REST API | Enterprise, visual modeling | Heavyweight, enterprise pricing |
| Axon Framework | Orchestration + ES | Java/Kotlin | DDD + Event Sourcing | Java-only, framework coupling |
| Eventuate Tram | Both | Java/Spring | Microservices patterns book style | Java-only, simpler features |
| MassTransit | Orchestration | C#/.NET | .NET ecosystem | .NET only |
| Conductor (Netflix) | Orchestration | Java, any via REST | Microservices, Netflix stack | Complex setup, Netflix-specific |
For 80% of teams, Temporal is the right choice. It's production-proven at scale (Uber, Netflix, Snap), has excellent multi-language support, provides durable execution out of the box, and has a growing community. Start there unless you have specific requirements that push you elsewhere.
Temporal is the most widely adopted saga/workflow framework. Let's examine a production-ready saga implementation using Temporal's TypeScript SDK.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
// Complete Order Saga Implementation with Temporal // ============================================// WORKFLOW DEFINITION (saga-workflows.ts)// ============================================import { proxyActivities, sleep, ApplicationFailure, defineSignal, setHandler, condition} from '@temporalio/workflow'; import type * as activities from './activities'; // Activity proxies with timeout configurationsconst { createOrder, reserveInventory, processPayment, scheduleShipment, sendConfirmation } = proxyActivities<typeof activities>({ startToCloseTimeout: '30 seconds', retry: { maximumAttempts: 3, initialInterval: '1 second', maximumInterval: '30 seconds', backoffCoefficient: 2 }}); const { cancelOrder, releaseInventory, refundPayment } = proxyActivities<typeof activities>({ // Compensations get more aggressive retry startToCloseTimeout: '60 seconds', retry: { maximumAttempts: 10, initialInterval: '2 seconds', maximumInterval: '5 minutes', backoffCoefficient: 2 }}); // Signals for external interactionexport const cancelOrderSignal = defineSignal<[string]>('cancelOrder'); // The saga workflowexport async function orderSaga(input: OrderInput): Promise<OrderResult> { let cancelled = false; let cancellationReason = ''; // Handle external cancellation requests setHandler(cancelOrderSignal, (reason: string) => { cancelled = true; cancellationReason = reason; }); // Compensation tracking const compensations: Array<() => Promise<void>> = []; try { // Check for cancellation before each step if (cancelled) { throw ApplicationFailure.nonRetryable('Order cancelled before start', 'CANCELLED'); } // Step 1: Create Order console.log('Step 1: Creating order'); const order = await createOrder(input); compensations.push(async () => { console.log('Compensating: Cancelling order'); await cancelOrder(order.id); }); if (cancelled) { throw ApplicationFailure.nonRetryable(cancellationReason, 'CANCELLED'); } // Step 2: Reserve Inventory console.log('Step 2: Reserving inventory'); const reservation = await reserveInventory({ orderId: order.id, items: input.items }); compensations.push(async () => { console.log('Compensating: Releasing inventory'); await releaseInventory(reservation.id); }); if (cancelled) { throw ApplicationFailure.nonRetryable(cancellationReason, 'CANCELLED'); } // Step 3: Process Payment console.log('Step 3: Processing payment'); const payment = await processPayment({ orderId: order.id, amount: order.totalAmount, customerId: input.customerId }); compensations.push(async () => { console.log('Compensating: Refunding payment'); await refundPayment(payment.id); }); // Step 4: Schedule Shipment (non-compensatable - last) console.log('Step 4: Scheduling shipment'); const shipment = await scheduleShipment({ orderId: order.id, address: input.shippingAddress, items: input.items }); // Step 5: Send Confirmation console.log('Step 5: Sending confirmation'); await sendConfirmation({ orderId: order.id, email: input.customerEmail, trackingNumber: shipment.trackingNumber }); return { success: true, orderId: order.id, trackingNumber: shipment.trackingNumber }; } catch (error) { console.error('Saga failed, executing compensations:', error); // Execute compensations in reverse order for (const compensate of compensations.reverse()) { try { await compensate(); } catch (compensationError) { // Log but continue with other compensations console.error('Compensation failed:', compensationError); // In production: alert, add to manual queue } } throw error; }}123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
// Activity Implementations (external service calls) import { OrderService, InventoryService, PaymentService, ShippingService } from './services'; // Activities are where actual external calls happen// They should be idempotent and retriable export async function createOrder(input: OrderInput): Promise<Order> { const orderService = new OrderService(); // Use idempotency key to prevent duplicates on retry return await orderService.create(input, { idempotencyKey: `order-${input.customerId}-${input.requestId}` });} export async function cancelOrder(orderId: string): Promise<void> { const orderService = new OrderService(); // Idempotent: safe to call multiple times const order = await orderService.findById(orderId); if (order.status === 'CANCELLED') { return; // Already cancelled } await orderService.cancel(orderId);} export async function reserveInventory( input: ReserveInventoryInput): Promise<Reservation> { const inventoryService = new InventoryService(); return await inventoryService.reserve(input.items, { idempotencyKey: `inventory-${input.orderId}` });} export async function releaseInventory(reservationId: string): Promise<void> { const inventoryService = new InventoryService(); const reservation = await inventoryService.findReservation(reservationId); if (!reservation || reservation.status === 'RELEASED') { return; // Already released or doesn't exist } await inventoryService.release(reservationId);} export async function processPayment(input: PaymentInput): Promise<Payment> { const paymentService = new PaymentService(); // Use authorization + capture pattern const auth = await paymentService.authorize({ amount: input.amount, customerId: input.customerId, idempotencyKey: `payment-${input.orderId}` }); // Capture immediately (or could delay until shipment) await paymentService.capture(auth.id); return auth;} export async function refundPayment(paymentId: string): Promise<void> { const paymentService = new PaymentService(); const payment = await paymentService.findById(paymentId); if (!payment || payment.status === 'REFUNDED') { return; // Already refunded } await paymentService.refund(paymentId);} export async function scheduleShipment( input: ShipmentInput): Promise<Shipment> { const shippingService = new ShippingService(); return await shippingService.schedule(input, { idempotencyKey: `shipment-${input.orderId}` });} export async function sendConfirmation( input: ConfirmationInput): Promise<void> { // Notifications are fire-and-forget // Failures shouldn't compensate the saga try { await emailService.send({ to: input.email, template: 'order-confirmation', data: { orderId: input.orderId, trackingNumber: input.trackingNumber } }); } catch (error) { console.warn('Failed to send confirmation email:', error); // Don't throw - notification failure shouldn't fail the saga }}Beyond the basic saga structure, several patterns address common production requirements:
Parallel Step Execution\n\nSome saga steps can execute concurrently to reduce latency. For example, reserving inventory and pre-authorizing payment can happen simultaneously.
123456789101112131415161718192021222324
// Parallel saga steps in Temporalasync function orderSagaWithParallelSteps(input: OrderInput) { const compensations: Array<() => Promise<void>> = []; // Step 1: Create order const order = await createOrder(input); compensations.push(() => cancelOrder(order.id)); // Steps 2a and 2b: Execute in parallel const [reservation, paymentAuth] = await Promise.all([ reserveInventory({ orderId: order.id, items: input.items }), authorizePayment({ orderId: order.id, amount: order.totalAmount }) ]); // Add compensations (in reverse order they should execute) compensations.push(() => releaseInventory(reservation.id)); compensations.push(() => voidPaymentAuth(paymentAuth.id)); // Step 3: Capture payment (depends on both parallel steps) const payment = await capturePayment(paymentAuth.id); compensations.push(() => refundPayment(payment.id)); // Continue with remaining sequential steps...}Sagas have multiple execution paths (success, failure at each step, compensation) that all need testing. A comprehensive test strategy covers multiple levels:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201
// Comprehensive Saga Testing import { TestWorkflowEnvironment } from '@temporalio/testing';import { Worker } from '@temporalio/worker'; describe('Order Saga', () => { let testEnv: TestWorkflowEnvironment; beforeAll(async () => { testEnv = await TestWorkflowEnvironment.createLocal(); }); afterAll(async () => { await testEnv.teardown(); }); // ============================================ // HAPPY PATH TESTS // ============================================ describe('Happy Path', () => { it('should complete successfully with all steps', async () => { // Mock activities const activities = { createOrder: jest.fn().mockResolvedValue({ id: 'order-123' }), reserveInventory: jest.fn().mockResolvedValue({ id: 'res-123' }), processPayment: jest.fn().mockResolvedValue({ id: 'pay-123' }), scheduleShipment: jest.fn().mockResolvedValue({ trackingNumber: 'TRACK-123' }), sendConfirmation: jest.fn().mockResolvedValue(undefined), // Compensations should NOT be called cancelOrder: jest.fn(), releaseInventory: jest.fn(), refundPayment: jest.fn() }; const worker = await Worker.create({ connection: testEnv.nativeConnection, taskQueue: 'test', workflowsPath: require.resolve('./workflows'), activities }); const result = await worker.runUntil( testEnv.client.workflow.execute('orderSaga', { taskQueue: 'test', workflowId: 'test-order-saga-1', args: [{ customerId: 'cust-1', items: [{ id: 'item-1', qty: 2 }] }] }) ); expect(result.success).toBe(true); expect(result.orderId).toBe('order-123'); expect(activities.cancelOrder).not.toHaveBeenCalled(); expect(activities.releaseInventory).not.toHaveBeenCalled(); expect(activities.refundPayment).not.toHaveBeenCalled(); }); }); // ============================================ // COMPENSATION PATH TESTS // ============================================ describe('Compensation Paths', () => { it('should compensate when payment fails', async () => { const activities = { createOrder: jest.fn().mockResolvedValue({ id: 'order-123' }), reserveInventory: jest.fn().mockResolvedValue({ id: 'res-123' }), processPayment: jest.fn().mockRejectedValue(new Error('Insufficient funds')), // Compensations cancelOrder: jest.fn().mockResolvedValue(undefined), releaseInventory: jest.fn().mockResolvedValue(undefined), refundPayment: jest.fn().mockResolvedValue(undefined) }; const worker = await Worker.create({ connection: testEnv.nativeConnection, taskQueue: 'test', workflowsPath: require.resolve('./workflows'), activities }); await expect( worker.runUntil( testEnv.client.workflow.execute('orderSaga', { taskQueue: 'test', workflowId: 'test-order-saga-2', args: [{ customerId: 'cust-1', items: [] }] }) ) ).rejects.toThrow('Insufficient funds'); // Verify compensations were called in reverse order expect(activities.releaseInventory).toHaveBeenCalledWith('res-123'); expect(activities.cancelOrder).toHaveBeenCalledWith('order-123'); // refundPayment should NOT be called (payment never succeeded) expect(activities.refundPayment).not.toHaveBeenCalled(); }); // Test compensation at each step const failureScenarios = [ { step: 'reserveInventory', expectedCompensations: ['cancelOrder'] }, { step: 'processPayment', expectedCompensations: ['releaseInventory', 'cancelOrder'] }, { step: 'scheduleShipment', expectedCompensations: ['refundPayment', 'releaseInventory', 'cancelOrder'] }, ]; test.each(failureScenarios)( 'should compensate correctly when %s fails', async ({ step, expectedCompensations }) => { // Generate activities with failure at specified step const activities = generateActivitiesWithFailureAt(step); const worker = await createWorker(activities); await expect(executeOrderSaga(worker)).rejects.toThrow(); // Verify expected compensations were called for (const compensation of expectedCompensations) { expect(activities[compensation]).toHaveBeenCalled(); } } ); }); // ============================================ // IDEMPOTENCY TESTS // ============================================ describe('Idempotency', () => { it('should handle activity retry without duplication', async () => { let callCount = 0; const activities = { createOrder: jest.fn().mockImplementation(async (input) => { callCount++; if (callCount === 1) { throw new Error('Transient failure'); } return { id: 'order-123' }; }), // ... other activities }; const worker = await Worker.create({ connection: testEnv.nativeConnection, taskQueue: 'test', workflowsPath: require.resolve('./workflows'), activities }); const result = await worker.runUntil( testEnv.client.workflow.execute('orderSaga', { taskQueue: 'test', workflowId: 'test-order-saga-idempotent', args: [testInput] }) ); expect(result.success).toBe(true); expect(callCount).toBe(2); // Failed once, succeeded once }); }); // ============================================ // SIGNAL HANDLING TESTS // ============================================ describe('Signal Handling', () => { it('should handle cancellation signal mid-saga', async () => { const activities = { createOrder: jest.fn().mockResolvedValue({ id: 'order-123' }), reserveInventory: jest.fn().mockImplementation(async () => { await new Promise(resolve => setTimeout(resolve, 100)); return { id: 'res-123' }; }), cancelOrder: jest.fn().mockResolvedValue(undefined), releaseInventory: jest.fn().mockResolvedValue(undefined) }; const worker = await Worker.create({ /* ... */ }); // Start workflow const handle = await testEnv.client.workflow.start('orderSaga', { taskQueue: 'test', workflowId: 'test-order-saga-cancel', args: [testInput] }); // Wait a bit, then send cancellation signal await new Promise(resolve => setTimeout(resolve, 50)); await handle.signal('cancelOrder', 'Customer requested cancellation'); // Workflow should fail with cancellation await expect(handle.result()).rejects.toThrow('Customer requested cancellation'); }); });});Deploying saga infrastructure requires careful planning around versioning, migration, rollback, and scaling.
12345678910111213141516171819202122232425262728293031323334353637383940
// Saga Versioning with Temporal import { patched, deprecatePatch } from '@temporalio/workflow'; async function orderSagaVersioned(input: OrderInput) { const order = await createOrder(input); // Version 1: Original inventory reservation // Version 2: Added availability check before reservation if (patched('inventory-check-v2')) { // New behavior: check availability first const available = await checkInventoryAvailability(input.items); if (!available) { await cancelOrder(order.id); throw ApplicationFailure.nonRetryable('Items not available'); } } const reservation = await reserveInventory({ orderId: order.id }); // Version 1: Single payment processor // Version 2: Regional payment processors let payment; if (patched('regional-payments-v2')) { // New behavior: use regional processor payment = await processRegionalPayment(order, input.region); } else { // Old behavior: single processor payment = await processPayment(order); } // Continue with saga...} // Deployment strategy:// 1. Deploy with patched() returning false (old behavior)// 2. Monitor in-flight sagas// 3. Enable patch for new sagas// 4. Wait for old sagas to complete// 5. deprecatePatch() to remove old code pathLong-running sagas (hours/days) make versioning especially challenging. A saga started last week might still be running when you deploy breaking changes. Plan for this: maintain compensation logic for old versions, use feature flags, and consider saga completion before major deployments.
Saga infrastructure must scale with your system load. Key considerations:
| Component | Scaling Approach | Typical Limits | Bottleneck Indicators |
|---|---|---|---|
| Saga Store (DB) | Vertical + Sharding | Varies by DB | High latency, connection exhaustion |
| Saga Workers | Horizontal | Memory per worker | Task queue depth growing |
| Event Bus | Partitioning | Partitions × throughput | Consumer lag increasing |
| Service Endpoints | Load balancing | Service capacity | Step timeouts, retries |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
// Temporal Worker Configuration for Production import { Worker, NativeConnection } from '@temporalio/worker'; async function createProductionWorker() { // Establish connection to Temporal cluster const connection = await NativeConnection.connect({ address: process.env.TEMPORAL_ADDRESS!, tls: { serverRootCACertificate: await fs.readFile(process.env.TEMPORAL_TLS_CERT!), } }); // Create worker with production settings const worker = await Worker.create({ connection, namespace: process.env.TEMPORAL_NAMESPACE!, taskQueue: 'order-sagas', workflowsPath: require.resolve('./workflows'), activities: require('./activities'), // Concurrency tuning maxConcurrentActivityTaskExecutions: 100, maxConcurrentWorkflowTaskExecutions: 100, maxConcurrentLocalActivityExecutions: 100, // Memory management maxCachedWorkflows: 500, // Sticky execution (performance optimization) stickyQueueScheduleToStartTimeout: '10s', // Graceful shutdown shutdownGraceTime: '30s' }); // Metrics integration worker.registerPayload('metrics', (payload) => { metrics.increment('temporal.saga.task_completed'); }); return worker;} // Kubernetes HPA configuration for workersconst hpaConfig = { apiVersion: 'autoscaling/v2', kind: 'HorizontalPodAutoscaler', spec: { scaleTargetRef: { apiVersion: 'apps/v1', kind: 'Deployment', name: 'saga-workers' }, minReplicas: 3, maxReplicas: 50, metrics: [ { type: 'External', external: { metric: { name: 'temporal_task_queue_depth', selector: { matchLabels: { task_queue: 'order-sagas' } } }, target: { type: 'AverageValue', averageValue: '10' // Scale up when >10 pending tasks per worker } } } ] }};Learning from how industry leaders implement sagas provides invaluable practical insight:
Uber's Workflow Platform\n\nUber built Cadence (open-sourced as Temporal) to handle their complex workflow needs across 5,000+ microservices.\n\nScale:\n- 100+ million workflow executions per day\n- Thousands of unique workflow types\n- Workflows spanning hours to months\n\nKey Learnings:\n\n1. Deterministic replay is essential — Workflows must be deterministic for recovery after failures\n2. Visibility is critical — Built extensive tooling to inspect, debug, and manage workflows\n3. Testing at scale matters — Developed replay testing to verify workflow changes against historical executions\n4. Start simple, iterate — Many teams start with simple orchestration before adding saga compensation\n\nUse Cases at Uber:\n- Trip lifecycle management\n- Driver onboarding (multi-day process with background checks)\n- Marketing campaign orchestration\n- Financial reconciliation workflows
We've covered the complete journey from saga theory to production implementation. Let's consolidate the essential guidance:
Module Complete:\n\nYou've now mastered the Saga pattern from first principles through production implementation. You understand why distributed transactions fail, how choreography and orchestration work, the art of compensating transactions, resilient failure handling, and practical implementation patterns.\n\nThis knowledge equips you to design and build distributed systems that maintain data consistency across service boundaries—a fundamental capability for any microservices architecture.
You've completed a comprehensive exploration of the Saga pattern. From understanding why 2PC fails at scale, through choreography vs orchestration trade-offs, compensating transaction design, failure handling strategies, and production implementation—you now have the knowledge to implement reliable distributed transactions in your systems.