Choreography Vs Orchestration - Learning Module

Loading content...

0/273

Orchestration — Centralized Control

The Conductor's Perspective

Now imagine an orchestra. Unlike the jazz ensemble improvising together, an orchestra has a conductor—a central figure who sees the entire score, coordinates the timing, and ensures every section plays its part at the right moment. The musicians are skilled, but they follow the conductor's direction. The complete piece exists in one place: the conductor's mind and the score before them.

This is orchestration in distributed systems. A central service—the orchestrator—holds the complete workflow definition, explicitly invokes other services, tracks progress, and makes decisions about what happens next. The workflow isn't emergent; it's explicit, visible, and controlled from a single point.

Orchestration trades the autonomy of choreography for visibility and control. When you need to understand exactly what's happening in a complex multi-step process, when business rules change frequently, or when the workflow requires sophisticated branching logic, orchestration often provides the clarity you need.

What You Will Learn

By the end of this page, you will understand orchestration as a coordination pattern: its architecture, implementation approaches, state management strategies, and the scenarios where centralized control is the right choice. You'll be able to design orchestrated workflows that provide clear visibility into distributed processes.

Understanding Orchestration

Orchestration is a coordination pattern where a central service—the orchestrator—explicitly controls the workflow by invoking services in sequence, handling responses, and making routing decisions. The orchestrator knows the complete process and directs other services like a conductor directing musicians.

The fundamental principle: In orchestration, the workflow is explicit and centralized. One service contains the complete business process logic. Other services are invoked rather than reacting. The orchestrator maintains state, handles errors, and decides what happens next based on responses.

Key characteristics of orchestration:

Orchestration Characteristics

•Explicit Workflow Definition — The complete process is defined in one place, often as code or declarative configuration.
•Centralized State Management — The orchestrator tracks the current step, accumulated results, and error states.
•Command-Based Communication — The orchestrator sends commands to services ('Process payment') rather than services reacting to events.
•Synchronous or Asynchronous — Can use synchronous calls or asynchronous messaging, depending on requirements.
•Clear Control Flow — The workflow is visible in the orchestrator's code—no need to trace through multiple services.

Choreography vs Orchestration Comparison
Aspect	Choreography	Orchestration
Workflow Location	Distributed across services	Centralized in orchestrator
Control Flow	Implicit (event subscriptions)	Explicit (orchestrator code)
Coupling	Services coupled to events	Orchestrator coupled to all services
Visibility	Requires tracing across services	Visible in one place
Autonomy	Services fully autonomous	Services follow orchestrator commands
Change Impact	Add services without changing others	Workflow changes centralized
Debugging	Requires distributed tracing	Linear execution trace
Scalability	Each service scales independently	Orchestrator can be bottleneck

The same order process, orchestrated:

Consider our e-commerce example from the previous page, now with orchestration:

Order Orchestrator receives a purchase request
Orchestrator invokes Payment Service: "Process payment for order X"
Orchestrator receives payment result, decides next step
Orchestrator invokes Inventory Service: "Reserve items for order X"
Orchestrator receives inventory result, decides next step
Orchestrator invokes Shipping Service: "Schedule shipment for order X"
Orchestrator invokes Notification Service: "Send confirmation for order X"
Orchestrator marks order complete

The complete workflow exists in the orchestrator's code. Each service provides capabilities; the orchestrator composes them into a process.

The Visibility Trade-off

Orchestration trades autonomy for visibility. You can see the entire workflow in one place, debug it with a single trace, and change it without coordinating multiple teams. But the orchestrator becomes a coordination point—logically, if not necessarily a performance bottleneck.

Architecture of an Orchestrator

A well-designed orchestrator has distinct architectural components that separate concerns and enable reliability. Let's examine the key components and how they interact.

Core Components:

1. Workflow Definition The declarative or imperative specification of the process steps, transitions, and decision logic. This could be code, BPMN, a state machine, or a DSL.

2. Execution Engine The runtime that interprets the workflow definition, manages execution state, schedules steps, and handles transitions.

3. State Store Persistent storage for workflow instances, their current state, accumulated data, and history. Essential for durability across restarts.

4. Service Connectors Adapters that invoke external services, whether through synchronous HTTP/gRPC calls or asynchronous message queues.

5. Error Handler Logic for retries, compensation, timeout handling, and escalation when steps fail.

Orchestrator Architecture
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
// Core interfaces of an orchestrator architecture
 
/**
 * Workflow Definition - The declarative description of the process
 */
interface WorkflowDefinition {
  name: string;
  version: string;
  steps: WorkflowStep[];
  errorHandlers: ErrorHandler[];
  timeout: Duration;
}
 
interface WorkflowStep {
  id: string;
  name: string;
  type: 'task' | 'decision' | 'parallel' | 'wait' | 'compensate';
  action: string;               // Reference to service action
  input: InputMapping;          // How to map workflow data to action input
  output: OutputMapping;        // How to map action output to workflow data
  next: string | Transition[];  // Next step or conditional transitions
  retryPolicy?: RetryPolicy;
  timeout?: Duration;
  compensateWith?: string;      // Step to run on rollback
}
 
interface Transition {
  condition: string;    // Expression language: "$.payment.status == 'declined'"
  next: string;         // Target step ID
}
 
/**
 * Execution Engine - Runs workflow instances
 */
class WorkflowEngine {
  constructor(
    private readonly stateStore: WorkflowStateStore,
    private readonly serviceRegistry: ServiceRegistry,
    private readonly scheduler: StepScheduler,
  ) {}
 
  async startWorkflow(
    definition: WorkflowDefinition,
    input: Record<string, unknown>,
    correlationId: string,
  ): Promise<WorkflowInstance> {
    // Create initial state
    const instance = await this.stateStore.create({
      id: uuid(),
      definitionName: definition.name,
      definitionVersion: definition.version,
      correlationId,
      status: 'running',
      currentStep: definition.steps[0].id,
      data: input,
      history: [],
      createdAt: new Date(),
    });
    
    // Schedule first step
    await this.scheduler.scheduleStep(instance.id, definition.steps[0].id);
    
    return instance;
  }
 
  async executeStep(instanceId: string, stepId: string): Promise<void> {
    const instance = await this.stateStore.get(instanceId);
    const definition = await this.getDefinition(instance);
    const step = definition.steps.find(s => s.id === stepId);
    
    try {
      // Execute the step
      const input = this.mapInput(step.input, instance.data);
      const service = this.serviceRegistry.get(step.action);
      const output = await service.execute(input, {
        timeout: step.timeout,
        retryPolicy: step.retryPolicy,
      });
      
      // Update state
      const newData = this.mapOutput(step.output, instance.data, output);
      const nextStepId = this.determineNextStep(step, newData);
      
      await this.stateStore.update(instanceId, {
        currentStep: nextStepId,
        data: newData,
        history: [...instance.history, {
          stepId,
          status: 'completed',
          output,
          timestamp: new Date(),
        }],
      });
      
      // Schedule next step or complete
      if (nextStepId) {
        await this.scheduler.scheduleStep(instanceId, nextStepId);
      } else {
        await this.stateStore.update(instanceId, { status: 'completed' });
      }
      
    } catch (error) {
      await this.handleStepError(instance, step, error);
    }
  }
}
 
/**
 * State Store - Durable workflow state persistence
 */
interface WorkflowStateStore {
  create(instance: WorkflowInstance): Promise<WorkflowInstance>;
  get(id: string): Promise<WorkflowInstance>;
  update(id: string, changes: Partial<WorkflowInstance>): Promise<void>;
  findByCorrelationId(correlationId: string): Promise<WorkflowInstance[]>;
  findPending(): Promise<WorkflowInstance[]>;
}

Synchronous vs Asynchronous Orchestration:

Orchestrators can invoke services either synchronously or asynchronously:

Synchronous Orchestration:

Orchestrator makes HTTP/gRPC calls and waits for responses
Simpler implementation and debugging
Orchestrator thread blocked during service execution
Best for short-lived operations (seconds)

Asynchronous Orchestration:

Orchestrator sends commands via message queue
Services respond via callback or response queue
Orchestrator persists state between steps
Required for long-running operations (minutes/hours/days)
More complex but more resilient

Most production orchestrators use asynchronous orchestration with durable state, even for operations that complete quickly. This provides resilience against orchestrator restarts and enables horizontal scaling.

Workflow Engines vs Custom Orchestrators

You can build custom orchestrators or use dedicated workflow engines like Temporal, Camunda, AWS Step Functions, or Apache Airflow. Workflow engines provide the execution infrastructure, letting you focus on workflow logic. For complex workflows, engines usually provide better reliability and observability than custom code.

State Management in Orchestration

The orchestrator's superpower is centralized state management. Unlike choreography where state is distributed across services, the orchestrator maintains a complete picture of each workflow instance.

What state must be tracked:

1. Execution Position — What step is the workflow currently at? What steps have completed? What's pending?

2. Accumulated Data — Results from previous steps that will be used in future steps. Payment confirmation needed for shipping, etc.

3. Error State — Which steps failed? How many retries? What compensation has been attempted?

4. Timing Information — When did each step start/complete? Are any steps timing out?

5. Business Context — Correlation IDs, customer information, order details—anything needed throughout the workflow.

Workflow State Management
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
// Comprehensive workflow instance state
interface WorkflowInstance {
  // Identity
  id: string;
  definitionName: string;
  definitionVersion: string;
  correlationId: string;
  
  // Execution status
  status: 'pending' | 'running' | 'waiting' | 'completed' | 'failed' | 'compensating';
  currentStep: string | null;
  
  // Accumulated workflow data
  data: {
    // Initial input
    input: {
      orderId: string;
      customerId: string;
      items: OrderItem[];
      shippingAddress: Address;
      totalAmount: Money;
    };
    
    // Results from completed steps
    payment?: {
      paymentId: string;
      status: 'completed' | 'failed';
      transactionRef: string;
      processedAt: string;
    };
    
    inventory?: {
      reservationId: string;
      reservedItems: ReservedItem[];
      warehouseId: string;
    };
    
    shipping?: {
      shipmentId: string;
      carrier: string;
      trackingNumber: string;
      estimatedDelivery: string;
    };
    
    // Error information
    lastError?: {
      step: string;
      message: string;
      code: string;
      timestamp: string;
    };
  };
  
  // Execution history
  history: StepExecution[];
  
  // Timing
  createdAt: Date;
  startedAt?: Date;
  completedAt?: Date;
  
  // Timeout tracking
  timeoutAt?: Date;
  
  // For compensation/rollback
  completedCompensableSteps: string[];
}
 
interface StepExecution {
  stepId: string;
  stepName: string;
  status: 'started' | 'completed' | 'failed' | 'compensated';
  input?: Record<string, unknown>;
  output?: Record<string, unknown>;
  error?: {
    message: string;
    code: string;
    retryCount: number;
  };
  startedAt: Date;
  completedAt?: Date;
  duration?: number;
}
 
// State transitions are explicit and audited
class WorkflowStateManager {
  async transitionTo(
    instanceId: string,
    newStatus: WorkflowInstance['status'],
    changes: Partial<WorkflowInstance>,
    reason: string,
  ): Promise<void> {
    const instance = await this.stateStore.get(instanceId);
    
    // Validate transition is legal
    this.validateTransition(instance.status, newStatus);
    
    // Apply changes atomically
    await this.stateStore.update(instanceId, {
      ...changes,
      status: newStatus,
      history: [
        ...instance.history,
        {
          type: 'transition',
          from: instance.status,
          to: newStatus,
          reason,
          timestamp: new Date(),
        }
      ],
    });
    
    // Emit state change event for monitoring
    await this.eventEmitter.emit('workflow.state.changed', {
      instanceId,
      previousStatus: instance.status,
      newStatus,
      reason,
    });
  }
  
  private validateTransition(from: string, to: string): void {
    const validTransitions: Record<string, string[]> = {
      'pending': ['running', 'failed'],
      'running': ['waiting', 'completed', 'failed', 'compensating'],
      'waiting': ['running', 'failed', 'compensating'],
      'compensating': ['completed', 'failed'],
      'completed': [],  // Terminal
      'failed': [],     // Terminal
    };
    
    if (!validTransitions[from]?.includes(to)) {
      throw new InvalidTransitionError(from, to);
    }
  }
}

State Store Requirements:

The state store is critical infrastructure for orchestration. Key requirements:

Durability: State must survive orchestrator restarts. Use a database, not in-memory storage.

Consistency: State updates must be atomic. A step can't complete without its results being persisted.

Query Support: You need to find workflows by correlation ID, status, age, etc.

Performance: High-throughput orchestration needs a fast state store. Consider write-ahead logs for efficiency.

Visibility: Expose state for debugging and monitoring. Every transition should be queryable.

State Store as Single Point of Failure

The state store is the orchestrator's durability guarantee. If it fails, workflow execution stops. Use replicated databases with failover. Never use single-instance stores in production. The state store's availability directly determines the orchestrator's availability.

Implementing Orchestrated Workflows

Let's implement a complete orchestrated order workflow. We'll use Temporal as an example workflow engine, but the patterns apply to any orchestration platform.

Workflow Definition:

Order Workflow Implementation (Temporal)
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
import { 
  proxyActivities, 
  defineSignal, 
  defineQuery,
  setHandler,
  condition,
  sleep,
  ApplicationFailure,
} from '@temporalio/workflow';
 
import type { OrderActivities } from './activities';
 
// Create proxies for activities (service calls)
const activities = proxyActivities<OrderActivities>({
  startToCloseTimeout: '30s',
  retry: {
    initialInterval: '1s',
    backoffCoefficient: 2,
    maximumAttempts: 3,
    nonRetryableErrorTypes: ['PaymentDeclinedError', 'CustomerBlockedError'],
  },
});
 
// Workflow signals (external events)
const cancelOrderSignal = defineSignal('cancelOrder');
const updateShippingSignal = defineSignal<[Address]>('updateShipping');
 
// Workflow queries (read state)
const getStatusQuery = defineQuery<OrderStatus>('getStatus');
const getHistoryQuery = defineQuery<StepHistory[]>('getHistory');
 
// Workflow state
interface OrderWorkflowState {
  orderId: string;
  status: OrderStatus;
  paymentResult?: PaymentResult;
  inventoryResult?: InventoryResult;
  shippingResult?: ShippingResult;
  history: StepHistory[];
  cancelled: boolean;
}
 
/**
 * Order Processing Workflow
 * 
 * Complete order fulfillment from payment through shipping
 */
export async function orderWorkflow(input: OrderInput): Promise<OrderResult> {
  // Initialize workflow state
  const state: OrderWorkflowState = {
    orderId: input.orderId,
    status: 'processing_payment',
    history: [],
    cancelled: false,
  };
  
  // Set up signal and query handlers
  setHandler(cancelOrderSignal, () => {
    state.cancelled = true;
  });
  
  setHandler(getStatusQuery, () => state.status);
  setHandler(getHistoryQuery, () => state.history);
  
  try {
    // Step 1: Process Payment
    state.status = 'processing_payment';
    state.history.push({ step: 'payment', status: 'started', at: new Date() });
    
    state.paymentResult = await activities.processPayment({
      orderId: input.orderId,
      customerId: input.customerId,
      amount: input.totalAmount,
      paymentMethod: input.paymentMethod,
    });
    
    state.history.push({ 
      step: 'payment', 
      status: 'completed', 
      at: new Date(),
      result: { paymentId: state.paymentResult.paymentId },
    });
    
    // Check for cancellation between steps
    if (state.cancelled) {
      return await compensateOrder(state, 'User cancelled');
    }
    
    // Step 2: Reserve Inventory
    state.status = 'reserving_inventory';
    state.history.push({ step: 'inventory', status: 'started', at: new Date() });
    
    state.inventoryResult = await activities.reserveInventory({
      orderId: input.orderId,
      items: input.items,
      warehousePreference: input.shippingAddress.region,
    });
    
    state.history.push({
      step: 'inventory',
      status: 'completed',
      at: new Date(),
      result: { reservationId: state.inventoryResult.reservationId },
    });
    
    if (state.cancelled) {
      return await compensateOrder(state, 'User cancelled');
    }
    
    // Step 3: Schedule Shipping
    state.status = 'scheduling_shipment';
    state.history.push({ step: 'shipping', status: 'started', at: new Date() });
    
    state.shippingResult = await activities.scheduleShipment({
      orderId: input.orderId,
      reservationId: state.inventoryResult.reservationId,
      shippingAddress: input.shippingAddress,
      customerTier: input.customerTier,
    });
    
    state.history.push({
      step: 'shipping',
      status: 'completed',
      at: new Date(),
      result: { 
        shipmentId: state.shippingResult.shipmentId,
        trackingNumber: state.shippingResult.trackingNumber,
      },
    });
    
    // Step 4: Send Confirmation
    state.status = 'sending_confirmation';
    await activities.sendOrderConfirmation({
      orderId: input.orderId,
      customerId: input.customerId,
      email: input.customerEmail,
      orderDetails: {
        items: input.items,
        total: input.totalAmount,
        trackingNumber: state.shippingResult.trackingNumber,
        estimatedDelivery: state.shippingResult.estimatedDelivery,
      },
    });
    
    state.status = 'completed';
    state.history.push({ step: 'complete', status: 'completed', at: new Date() });
    
    return {
      orderId: input.orderId,
      status: 'completed',
      paymentId: state.paymentResult.paymentId,
      shipmentId: state.shippingResult.shipmentId,
      trackingNumber: state.shippingResult.trackingNumber,
    };
    
  } catch (error) {
    state.status = 'failed';
    state.history.push({
      step: 'error',
      status: 'failed',
      at: new Date(),
      error: error instanceof Error ? error.message : 'Unknown error',
    });
    
    // Attempt compensation
    return await compensateOrder(state, error.message);
  }
}
 
/**
 * Compensation logic - undo completed steps
 */
async function compensateOrder(
  state: OrderWorkflowState,
  reason: string,
): Promise<OrderResult> {
  state.status = 'compensating';
  
  // Compensate in reverse order
  if (state.shippingResult) {
    try {
      await activities.cancelShipment({
        shipmentId: state.shippingResult.shipmentId,
      });
      state.history.push({ 
        step: 'shipping_cancel', 
        status: 'completed', 
        at: new Date() 
      });
    } catch (e) {
      // Log but continue compensation
    }
  }
  
  if (state.inventoryResult) {
    try {
      await activities.releaseInventory({
        reservationId: state.inventoryResult.reservationId,
      });
      state.history.push({
        step: 'inventory_release',
        status: 'completed',
        at: new Date(),
      });
    } catch (e) { /* continue */ }
  }
  
  if (state.paymentResult) {
    try {
      await activities.refundPayment({
        paymentId: state.paymentResult.paymentId,
        amount: state.paymentResult.amount,
        reason,
      });
      state.history.push({
        step: 'payment_refund',
        status: 'completed',
        at: new Date(),
      });
    } catch (e) { /* continue */ }
  }
  
  state.status = 'cancelled';
  
  return {
    orderId: state.orderId,
    status: 'cancelled',
    reason,
  };
}

Activities (Service Invocations):

Activities are the actual service calls. They're separate from workflow logic, allowing different retry policies and timeouts for each service.

Activities Implementation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import { Context } from '@temporalio/activity';
 
// Activities interface - defines what services the workflow can call
export interface OrderActivities {
  processPayment(input: PaymentInput): Promise<PaymentResult>;
  reserveInventory(input: InventoryInput): Promise<InventoryResult>;
  scheduleShipment(input: ShipmentInput): Promise<ShipmentResult>;
  sendOrderConfirmation(input: ConfirmationInput): Promise<void>;
  refundPayment(input: RefundInput): Promise<void>;
  releaseInventory(input: ReleaseInput): Promise<void>;
  cancelShipment(input: CancelInput): Promise<void>;
}
 
// Implementation of activities
export function createOrderActivities(
  paymentService: PaymentServiceClient,
  inventoryService: InventoryServiceClient,
  shippingService: ShippingServiceClient,
  notificationService: NotificationServiceClient,
): OrderActivities {
  return {
    async processPayment(input) {
      const ctx = Context.current();
      ctx.heartbeat('Processing payment');
      
      try {
        const result = await paymentService.charge({
          orderId: input.orderId,
          customerId: input.customerId,
          amount: input.amount,
          method: input.paymentMethod,
          idempotencyKey: ctx.info.activityId, // Temporal provides unique ID
        });
        
        return {
          paymentId: result.id,
          status: 'completed',
          amount: input.amount,
          transactionRef: result.transactionReference,
        };
      } catch (error) {
        if (error instanceof PaymentDeclinedError) {
          // Non-retryable error - workflow will handle
          throw new ApplicationFailure(
            'Payment declined',
            'PaymentDeclinedError',
            true, // nonRetryable
            [{ reason: error.declineReason }],
          );
        }
        throw error; // Retryable error
      }
    },
    
    async reserveInventory(input) {
      const ctx = Context.current();
      ctx.heartbeat('Reserving inventory');
      
      const result = await inventoryService.reserve({
        orderId: input.orderId,
        items: input.items,
        warehousePreference: input.warehousePreference,
        holdDuration: '24h',
      });
      
      return {
        reservationId: result.reservationId,
        reservedItems: result.items,
        warehouseId: result.warehouseId,
        expiresAt: result.expiresAt,
      };
    },
    
    async scheduleShipment(input) {
      const ctx = Context.current();
      ctx.heartbeat('Scheduling shipment');
      
      const result = await shippingService.schedule({
        orderId: input.orderId,
        reservationId: input.reservationId,
        destination: input.shippingAddress,
        serviceLevel: input.customerTier === 'premium' ? 'express' : 'standard',
      });
      
      return {
        shipmentId: result.shipmentId,
        carrier: result.carrier,
        trackingNumber: result.trackingNumber,
        estimatedDelivery: result.estimatedDelivery,
      };
    },
    
    async sendOrderConfirmation(input) {
      await notificationService.sendEmail({
        to: input.email,
        template: 'order_confirmation',
        data: input.orderDetails,
      });
    },
    
    // Compensation activities
    async refundPayment(input) {
      await paymentService.refund({
        paymentId: input.paymentId,
        amount: input.amount,
        reason: input.reason,
      });
    },
    
    async releaseInventory(input) {
      await inventoryService.release({
        reservationId: input.reservationId,
      });
    },
    
    async cancelShipment(input) {
      await shippingService.cancel({
        shipmentId: input.shipmentId,
      });
    },
  };
}

Heartbeats for Long Activities

Notice the heartbeat calls in activities. For long-running operations, heartbeats tell the workflow engine the activity is still alive. If heartbeats stop, the engine can cancel and retry the activity elsewhere. This is how orchestrators detect and recover from worker failures.

Error Handling in Orchestration

Unlike choreography, where error handling is distributed, orchestration centralizes error handling in the orchestrator. This provides powerful capabilities but requires careful design.

Retry Strategies:

The orchestrator controls retry behavior for each step:

Immediate Retry: Try again right away for transient failures
Backoff Retry: Wait progressively longer between attempts
Circuit Breaker: Stop retrying if a service is consistently failing
No Retry: Some failures (validation, business rules) shouldn't be retried

Sophisticated Error Handling
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
// Error handling patterns in orchestration
 
interface RetryPolicy {
  maxAttempts: number;
  initialInterval: Duration;
  backoffCoefficient: number;
  maxInterval: Duration;
  nonRetryableErrors: string[];
}
 
// Different retry policies for different types of failures
const retryPolicies = {
  // Payment service - careful retries, idempotent
  payment: {
    maxAttempts: 3,
    initialInterval: '1s',
    backoffCoefficient: 2,
    maxInterval: '10s',
    nonRetryableErrors: [
      'PaymentDeclinedError',      // Customer's card declined
      'FraudDetectedError',        // Fraud prevention triggered
      'InsufficientFundsError',    // Not enough balance
      'CardExpiredError',          // Card is expired
    ],
  },
  
  // Inventory service - quick retries for lock contention
  inventory: {
    maxAttempts: 5,
    initialInterval: '100ms',
    backoffCoefficient: 1.5,
    maxInterval: '2s',
    nonRetryableErrors: [
      'ItemDiscontinuedError',     // Item no longer sold
      'RegionNotServicedError',    // Can't ship to that region
    ],
  },
  
  // Shipping service - longer intervals for external carrier APIs
  shipping: {
    maxAttempts: 3,
    initialInterval: '5s',
    backoffCoefficient: 2,
    maxInterval: '30s',
    nonRetryableErrors: [
      'AddressUndeliverableError', // Invalid shipping address
    ],
  },
  
  // Notification - aggressive retries, not critical
  notification: {
    maxAttempts: 10,
    initialInterval: '10s',
    backoffCoefficient: 2,
    maxInterval: '5m',
    nonRetryableErrors: [],  // Always retry notifications
  },
};
 
// Orchestrator error handling logic
async function handleStepError(
  instance: WorkflowInstance,
  step: WorkflowStep,
  error: Error,
): Promise<void> {
  const policy = retryPolicies[step.serviceType];
  
  // Check if error is non-retryable
  if (policy.nonRetryableErrors.includes(error.constructor.name)) {
    await this.handleNonRetryableError(instance, step, error);
    return;
  }
  
  // Check retry count
  const retryCount = instance.history.filter(
    h => h.stepId === step.id && h.status === 'failed'
  ).length;
  
  if (retryCount >= policy.maxAttempts) {
    await this.handleMaxRetriesExceeded(instance, step, error);
    return;
  }
  
  // Calculate backoff
  const interval = Math.min(
    policy.initialInterval * Math.pow(policy.backoffCoefficient, retryCount),
    policy.maxInterval,
  );
  
  // Schedule retry
  await Promise.all([
    this.stateStore.update(instance.id, {
      history: [...instance.history, {
        stepId: step.id,
        status: 'failed',
        error: { message: error.message, retryCount },
        timestamp: new Date(),
        nextRetry: new Date(Date.now() + interval),
      }],
    }),
    this.scheduler.scheduleStep(instance.id, step.id, { delay: interval }),
  ]);
}
 
async function handleNonRetryableError(
  instance: WorkflowInstance,
  step: WorkflowStep,
  error: Error,
): Promise<void> {
  // For non-retryable errors, we need to make a decision
  // based on the error type and step
  
  if (step.compensateWith) {
    // This step can be compensated - start compensation flow
    await this.startCompensation(instance, step, error);
  } else if (step.fallback) {
    // There's a fallback step defined
    await this.executeFallback(instance, step);
  } else {
    // No recovery possible - fail the workflow
    await this.failWorkflow(instance, step, error);
  }
}
 
async function handleMaxRetriesExceeded(
  instance: WorkflowInstance,
  step: WorkflowStep,
  error: Error,
): Promise<void> {
  // Max retries exceeded - could escalate, compensate, or fail
  
  // Emit alert for human intervention
  await this.alerting.send({
    severity: 'high',
    workflow: instance.definitionName,
    instanceId: instance.id,
    step: step.id,
    error: error.message,
    retryCount: await this.getRetryCount(instance, step),
    action: 'Max retries exceeded - manual intervention may be required',
  });
  
  // Start compensation or wait for manual resolution
  if (step.onMaxRetriesExceeded === 'compensate') {
    await this.startCompensation(instance, step, error);
  } else if (step.onMaxRetriesExceeded === 'pause') {
    await this.pauseForManualResolution(instance, step);
  } else {
    await this.failWorkflow(instance, step, error);
  }
}

Error Handling Best Practices

•Classify Errors by Retryability — Distinguish between transient errors (network timeout) and permanent errors (validation failed). Only retry transient errors.
•Use Idempotency Keys — Pass unique keys to services so retries don't cause duplicate effects. The activity ID or workflow ID often works.
•Implement Timeouts at Every Level — Step timeout, overall workflow timeout, activity timeout. Don't let anything hang indefinitely.
•Design Compensation Explicitly — Every step that changes state should have a corresponding compensation step. Document what 'undoing' means for each step.
•Alert on Anomalies — Monitor retry rates, compensation frequency, and workflow duration. Unusual patterns may indicate service issues.
•Provide Manual Override Options — Some failures need human decisions. Design workflows that can pause and wait for manual intervention.

The Compensation Order Problem

Compensation must happen in the reverse order of execution. If you reserved inventory then charged payment, compensation must refund payment then release inventory. Getting this order wrong can leave the system in inconsistent states. Track completed steps and compensate them in reverse.

Timeouts and Deadlines

Time management is crucial in orchestration. Workflows can't run forever, steps can't hang indefinitely, and some business processes have real deadlines (24-hour sale, reservation expiration, etc.).

Types of Timeouts:

1. Step Timeout (Start-to-Close) Maximum time a single step can take from invocation to completion. Prevents individual service calls from hanging.

2. Workflow Timeout (Execution Timeout) Maximum time for the entire workflow from start to completion. Prevents zombie workflows.

3. Schedule-to-Start Timeout Maximum time a step can wait in queue before a worker picks it up. Detects capacity issues.

4. Heartbeat Timeout Maximum time between heartbeats for long-running steps. Detects worker failures mid-activity.

5. Business Deadline Time by which the workflow must reach a certain state (e.g., order must ship within 24 hours of payment).

Timeout and Deadline Handling
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import { 
  proxyActivities, 
  sleep, 
  CancellationScope,
  isCancellation,
  condition,
} from '@temporalio/workflow';
 
/**
 * Workflow with comprehensive timeout handling
 */
export async function orderWithTimeouts(input: OrderInput): Promise<OrderResult> {
  const state = initializeState(input);
  
  // Track business deadline
  const orderMustShipBy = new Date(Date.now() + 24 * 60 * 60 * 1000); // 24 hours
  
  try {
    // Step 1: Payment with 30s timeout
    const paymentActivities = proxyActivities<PaymentActivities>({
      startToCloseTimeout: '30s',
      scheduleToCloseTimeout: '1m',  // Including queue time
      heartbeatTimeout: '10s',
    });
    
    state.payment = await paymentActivities.processPayment(input);
    
    // Check if we still have time for shipping deadline
    const timeRemaining = orderMustShipBy.getTime() - Date.now();
    if (timeRemaining < 60 * 60 * 1000) { // Less than 1 hour
      // Expedite the process
      state.expedited = true;
    }
    
    // Step 2: Inventory with business-aware timeout
    const inventoryActivities = proxyActivities<InventoryActivities>({
      startToCloseTimeout: state.expedited ? '10s' : '30s',
    });
    
    state.inventory = await inventoryActivities.reserveInventory({
      ...input,
      priority: state.expedited ? 'high' : 'normal',
    });
    
    // Step 3: Shipping - must complete before deadline
    const shippingDeadline = Math.max(
      timeRemaining - 30 * 60 * 1000, // Leave 30 min buffer
      5 * 60 * 1000, // Minimum 5 minutes
    );
    
    const shippingActivities = proxyActivities<ShippingActivities>({
      startToCloseTimeout: `${Math.floor(shippingDeadline / 1000)}s`,
    });
    
    state.shipping = await shippingActivities.scheduleShipment({
      ...input,
      reservationId: state.inventory.reservationId,
      mustShipBy: orderMustShipBy,
    });
    
    return { status: 'completed', ...state };
    
  } catch (error) {
    if (isCancellation(error)) {
      // Workflow was cancelled - clean up
      return await compensate(state, 'Workflow cancelled');
    }
    if (error.cause?.name === 'TimeoutError') {
      // A step timed out
      return await handleTimeout(state, error);
    }
    throw error;
  }
}
 
/**
 * Long-running workflow with wait states
 */
export async function orderWithApproval(input: OrderInput): Promise<OrderResult> {
  const state = initializeState(input);
  
  // Process payment
  state.payment = await paymentActivities.processPayment(input);
  
  // For high-value orders, wait for manual approval
  if (input.totalAmount.amount > 100000) { // > $1000
    state.status = 'awaiting_approval';
    
    // Wait for approval signal OR timeout after 24 hours
    const approved = await condition(
      () => state.approvalDecision !== undefined,
      '24h', // Timeout
    );
    
    if (!approved) {
      // Approval timed out - cancel order
      return await compensate(state, 'Approval timeout');
    }
    
    if (state.approvalDecision === 'rejected') {
      return await compensate(state, 'Order rejected by approver');
    }
  }
  
  // Continue with approved order
  return await continueOrder(state, input);
}
 
/**
 * Periodic timeout checks
 */
export async function orderWithPeriodicChecks(input: OrderInput): Promise<OrderResult> {
  const state = initializeState(input);
  
  // Start timeout monitor in parallel
  const timeoutScope = CancellationScope.cancellable(async () => {
    // Check every minute if we should abandon
    while (true) {
      await sleep('1m');
      
      const elapsed = Date.now() - state.startTime.getTime();
      const currentStep = state.currentStep;
      
      // Alert if taking too long
      if (elapsed > 5 * 60 * 1000 && currentStep === 'payment') {
        await alertActivities.sendSlowWorkflowAlert({
          workflowId: input.orderId,
          step: currentStep,
          elapsed,
        });
      }
      
      // Hard limit: 1 hour
      if (elapsed > 60 * 60 * 1000) {
        throw new ApplicationFailure('Workflow timeout exceeded');
      }
    }
  });
  
  try {
    // Main workflow logic
    const result = await processOrder(state, input);
    
    // Cancel timeout monitor on success
    timeoutScope.cancel();
    
    return result;
  } catch (error) {
    timeoutScope.cancel();
    throw error;
  }
}

Timeouts Are Business Decisions

Timeout values should reflect business requirements, not just technical constraints. A payment timeout of 30 seconds might be technically feasible but terrible for customers on slow connections. A shipping deadline of 24 hours might be a contractual SLA. Work with business stakeholders to set appropriate timeouts.

Scaling Orchestration

A common concern with orchestration is that the orchestrator becomes a scalability bottleneck. While valid, this concern is often overstated. Properly designed orchestrators scale effectively.

Scaling Strategies:

1. Partition Workflows Run multiple orchestrator instances, each handling a subset of workflows. Partition by customer ID, order region, workflow type, etc.

2. Separate Workers Separate workflow logic (orchestration) from activity execution (service calls). Scale workers independently based on load.

3. Optimize State Store The state store is often the bottleneck. Use partitioned databases, caching, and efficient serialization. Consider event sourcing for append-only writes.

4. Async Communication Don't wait synchronously for slow activities. Use message queues so orchestrator threads aren't blocked.

Orchestration Scaling Patterns
Pattern	How It Works	When to Use
Horizontal Orchestrator Scaling	Multiple orchestrator instances sharing a partitioned state store	High workflow volume, stateless orchestrator logic
Worker Pool Scaling	Activities executed by auto-scaling worker pools	Variable activity load, expensive operations
Event-Driven Activities	Activities triggered by events, not direct calls	Very high throughput, eventual consistency acceptable
Workflow Sharding	Workflows partitioned across clusters by key	Multi-region, tenant isolation requirements
Tiered Execution	Urgent workflows on dedicated fast path	Mixed latency requirements

Scalable Orchestrator Architecture
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// Scalable orchestrator deployment pattern
 
/**
 * Multiple orchestrator instances with partitioned state
 */
class PartitionedOrchestrator {
  constructor(
    private readonly partitionId: string,
    private readonly totalPartitions: number,
    private readonly stateStore: PartitionedStateStore,
  ) {}
  
  // Each instance only processes workflows in its partition
  shouldProcess(workflowId: string): boolean {
    const hash = this.hashWorkflowId(workflowId);
    return (hash % this.totalPartitions) === parseInt(this.partitionId);
  }
  
  async pollAndExecute(): Promise<void> {
    // Only poll for workflows in this partition
    const pending = await this.stateStore.findPendingInPartition(
      this.partitionId
    );
    
    for (const instance of pending) {
      await this.executeStep(instance);
    }
  }
}
 
/**
 * Separate worker pools for different activity types
 */
interface WorkerPoolConfig {
  paymentWorkers: {
    minInstances: 5,
    maxInstances: 50,
    scaleMetric: 'queue_depth',
    scaleThreshold: 100, // Scale up when queue > 100
  },
  inventoryWorkers: {
    minInstances: 10,
    maxInstances: 100,
    scaleMetric: 'queue_depth',
    scaleThreshold: 200,
  },
  shippingWorkers: {
    minInstances: 3,
    maxInstances: 20,
    scaleMetric: 'queue_latency_p99',
    scaleThreshold: 5000, // Scale up when p99 > 5s
  },
  notificationWorkers: {
    minInstances: 2,
    maxInstances: 20,
    scaleMetric: 'queue_depth',
    scaleThreshold: 1000, // Notification can batch more
  },
}
 
/**
 * Event-driven activity for high throughput
 */
class EventDrivenActivityHandler {
  async handleActivityRequest(event: ActivityRequestEvent): Promise<void> {
    // Perform the activity
    const result = await this.executeActivity(event.activityType, event.input);
    
    // Send result back via event (not blocking the orchestrator)
    await this.eventBus.publish('activity.completed', {
      workflowId: event.workflowId,
      stepId: event.stepId,
      result,
    });
  }
}
 
// Orchestrator consumes completion events
class AsyncOrchestrator {
  async handleActivityCompleted(event: ActivityCompletedEvent): Promise<void> {
    const instance = await this.stateStore.get(event.workflowId);
    
    // Update state with activity result
    await this.stateStore.update(event.workflowId, {
      data: { ...instance.data, [event.stepId]: event.result },
    });
    
    // Determine and schedule next step
    const nextStep = this.determineNextStep(instance, event.stepId);
    if (nextStep) {
      await this.eventBus.publish('activity.request', {
        workflowId: event.workflowId,
        stepId: nextStep.id,
        activityType: nextStep.activityType,
        input: this.mapInput(nextStep, instance.data),
      });
    }
  }
}

The Orchestrator Isn't Always the Bottleneck

In many systems, the orchestrator's work is minimal—it's just deciding what to do next and updating state. The actual bottleneck is the services being called (payment processing, inventory checks). Profile before optimizing; you may find the orchestrator is never the limiting factor.

Summary: When Orchestration Shines

We've explored orchestration as a coordination pattern for event-driven architectures. Let's consolidate the key insights:

Key Takeaways

•Orchestration centralizes workflow logic — A single service knows the complete process, invokes services, and controls flow.
•State management is the orchestrator's core — Track execution position, accumulated data, and error state durably.
•Error handling is centralized and powerful — Retry policies, compensation, and timeout handling are explicit and configurable per step.
•Workflow engines handle the hard parts — Temporal, Step Functions, Camunda provide durable execution, freeing you to focus on business logic.
•Visibility is orchestration's superpower — The complete workflow is visible in one place, making debugging and changes easier.
•Scalability requires proper architecture — Partition state, separate workers, use async communication to scale orchestration effectively.

Orchestration excels when:

Complex branching logic based on intermediate results
Frequent workflow changes managed by one team
Clear visibility into process state is essential
Long-running workflows with timeouts and deadlines
Business processes need manual intervention points
Strict ordering and timing requirements

In the next page, we'll explore Saga patterns—the critical mechanism for handling distributed transactions in both choreography and orchestration. You'll learn how to maintain consistency across services without distributed locks.

Page Complete

You now understand orchestration as a coordination pattern: its architecture, state management requirements, error handling capabilities, and scaling strategies. You can design orchestrated workflows where a central service controls complex multi-step processes with full visibility and control. Next, we'll examine saga patterns for distributed transaction management.