Loading learning content...
Imagine a jazz ensemble where musicians improvise together without a conductor. Each player responds to what they hear, contributing their part based on established patterns and mutual understanding. No one gives explicit instructions—yet the music comes together coherently because each musician knows their role and responds appropriately to the others.
This is choreography in distributed systems. Services react to events independently, each knowing its responsibilities without any central authority dictating the flow. The result, when done well, is a system that's resilient, scalable, and naturally decoupled.
But just as a jazz ensemble requires skilled musicians who understand the genre deeply, choreography requires services that are thoughtfully designed to participate in a larger dance without stepping on each other's toes. Get it right, and you have a beautifully autonomous system. Get it wrong, and you have chaos masquerading as architecture.
By the end of this page, you will understand choreography as a coordination pattern: its philosophical foundations, implementation mechanics, event design requirements, testing strategies, and the specific scenarios where it excels. You'll be able to design choreographed workflows that maintain consistency without central control.
Choreography is a coordination pattern where the workflow emerges from independent services reacting to events, rather than being directed by a central controller. Each service subscribes to relevant events, performs its work, and emits new events that trigger subsequent services. There is no single entity that knows or controls the entire workflow.
The fundamental principle: In choreography, services are reactive rather than commanded. They observe the world, respond when they see something relevant, and announce what they've done. The overall business process emerges from these individual reactions, like a complex pattern emerging from simple rules in cellular automata.
| Characteristic | Request-Response | Choreography |
|---|---|---|
| Control Flow | Caller knows and controls the sequence | No single entity controls the sequence |
| Coupling | Caller coupled to callees | Services coupled only to events |
| Synchrony | Typically synchronous | Inherently asynchronous |
| Failure Handling | Caller handles callee failures | Each service handles its own failures |
| Scalability | Limited by calling service capacity | Each service scales independently |
| Visibility | Caller sees the entire flow | No single point sees the entire flow |
An illustrative example:
Consider an e-commerce order process. In a choreographed system:
OrderPlacedOrderPlaced, processes payment, emits PaymentCompletedPaymentCompleted, reserves items, emits InventoryReservedInventoryReserved, schedules shipment, emits ShipmentScheduledShipmentScheduled, sends confirmation emailNo service knows the complete workflow. Order Service doesn't know that its event will eventually trigger shipping. Payment Service doesn't know whether it's the first or fifth step. Each service simply reacts to what it observes and announces what it does.
In choreography, the workflow is implicit in the event subscriptions, not explicit in any single service's code. Understanding the complete flow requires examining multiple services and their event relationships—a fundamental shift from traditional architectures where control flow is visible in calling code.
Choreography isn't just a technical pattern—it embodies specific philosophical commitments about how distributed systems should work. Understanding these foundations helps you decide when choreography aligns with your goals.
Tell, Don't Ask: In traditional systems, services often ask others for information: "What's the inventory level? Okay, now reserve it." In choreography, services tell the world what happened: "Payment succeeded." Other services react based on their own logic. This inversion reduces temporal coupling—the teller doesn't wait for the reactor.
Loose Coupling, High Cohesion: Choreographed services are loosely coupled because they communicate only through events. A service doesn't import another's code or even know another exists. It knows only about event types. Simultaneously, each service remains highly cohesive—focused on one responsibility and containing all logic to fulfill it.
Autonomous Operation: Each service in a choreographed system is operationally autonomous. It can deploy, scale, and fail independently. This autonomy is profound: you can replace, upgrade, or remove a service without changing other services (though you must consider event contract compatibility).
Choreography aligns naturally with autonomous teams. When each team owns a service completely, choreography lets them move independently. The event contracts become the interfaces between teams, much like APIs but with less temporal coupling. This makes choreography particularly attractive in organizations embracing microservices and team autonomy.
Events in a choreographed system carry an outsized importance. They aren't just notifications—they're the contracts that allow autonomous services to coordinate. Poor event design undermines the entire pattern.
Domain Events vs Integration Events:
Domain events capture something that happened within a bounded context: OrderPlaced, PaymentFailed, InventoryReserved. They express business facts, not technical operations.
Integration events are designed explicitly for cross-service communication. They may be derived from domain events but are crafted for external consumption—versioned, documented, and stable.
In choreography, both types appear, but integration events require particular care. They become the interface between autonomous services.
12345678910111213141516171819202122232425262728293031323334353637383940
// Events should be self-describing, immutable records of factsinterface OrderPlacedEvent { // Metadata readonly eventId: string; // Unique identifier for idempotency readonly eventType: 'OrderPlaced'; readonly timestamp: string; // ISO 8601 readonly version: '1.0'; // Schema version readonly correlationId: string; // For tracing across services readonly causationId?: string; // What event caused this one // Business payload readonly orderId: string; readonly customerId: string; readonly items: readonly OrderItem[]; readonly totalAmount: Money; readonly currency: string; readonly shippingAddress: Address; // Context for consumers readonly customerTier: 'standard' | 'premium' | 'vip'; readonly isFirstOrder: boolean;} interface PaymentCompletedEvent { readonly eventId: string; readonly eventType: 'PaymentCompleted'; readonly timestamp: string; readonly version: '1.0'; readonly correlationId: string; readonly causationId: string; // Links to OrderPlaced readonly orderId: string; readonly paymentId: string; readonly amount: Money; readonly paymentMethod: 'card' | 'bank_transfer' | 'wallet'; readonly transactionReference: string; // Information for downstream services readonly fraudScore: number; // 0-100, helps shipping decide}Critical event design principles for choreography:
1. Include Enough Context: Consumers shouldn't need to call back to producers for more information. The event should contain everything needed for a consumer to make decisions. This reduces coupling and enables true autonomy.
2. Use Past Tense:
Events represent facts that have already happened: OrderPlaced, not PlaceOrder; PaymentCompleted, not ProcessPayment. This semantic clarity prevents confusion between commands and events.
3. Design for Unknown Consumers: You don't know who will consume your events. Include information broadly useful, but never include sensitive data that shouldn't propagate (like raw credit card numbers).
4. Support Correlation: Include correlation IDs so consumers can trace events across the entire workflow. Include causation IDs to establish event lineage—which event caused this one?
5. Version Explicitly: Events are contracts. Include version numbers and design for backward compatibility. Consumers of v1.0 should still work when v1.1 is published.
Including enough context doesn't mean including everything. Events that carry entire aggregates create coupling through shared data models and bloat message sizes. Include what consumers need, not what producers have. If you find events growing excessively large, you may be mixing concerns or designing too coarsely.
Implementing choreography requires patterns that ensure reliable event delivery, proper event handling, and consistent state management. Let's examine the essential patterns that make choreography work in production.
Pattern 1: The Transactional Outbox
A critical challenge in choreography is ensuring that state changes and event publication happen atomically. If a service updates its database but fails before publishing the event, the system becomes inconsistent.
The Outbox Pattern solves this by writing events to an "outbox" table in the same database transaction as the state change. A separate process then publishes events from the outbox to the message broker.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
// Outbox table schema (PostgreSQL)// CREATE TABLE outbox (// id UUID PRIMARY KEY,// aggregate_type VARCHAR(255) NOT NULL,// aggregate_id VARCHAR(255) NOT NULL,// event_type VARCHAR(255) NOT NULL,// payload JSONB NOT NULL,// created_at TIMESTAMP DEFAULT NOW(),// published_at TIMESTAMP NULL// ); class OrderService { constructor( private readonly db: Database, private readonly eventPublisher: EventPublisher // For polling, not direct use ) {} async placeOrder(command: PlaceOrderCommand): Promise<Order> { // Single transaction ensures atomicity return this.db.transaction(async (tx) => { // 1. Create the order const order = Order.create(command); await tx.orders.insert(order); // 2. Write event to outbox in SAME transaction const event: OrderPlacedEvent = { eventId: uuid(), eventType: 'OrderPlaced', timestamp: new Date().toISOString(), version: '1.0', correlationId: command.correlationId, orderId: order.id, customerId: command.customerId, items: order.items, totalAmount: order.totalAmount, currency: order.currency, shippingAddress: command.shippingAddress, customerTier: command.customerTier, isFirstOrder: await this.isFirstOrder(tx, command.customerId), }; await tx.outbox.insert({ id: event.eventId, aggregateType: 'Order', aggregateId: order.id, eventType: event.eventType, payload: event, createdAt: new Date(), publishedAt: null, }); return order; }); }} // Separate process: Outbox Publisherclass OutboxPublisher { async poll(): Promise<void> { const events = await this.db.outbox.findMany({ where: { publishedAt: null }, orderBy: { createdAt: 'asc' }, limit: 100, }); for (const event of events) { try { await this.messageBroker.publish( event.eventType, event.payload ); await this.db.outbox.update({ where: { id: event.id }, data: { publishedAt: new Date() }, }); } catch (error) { // Log and continue; will retry on next poll logger.error('Failed to publish event', { eventId: event.id, error }); } } }}Pattern 2: Idempotent Consumers
In distributed systems, events may be delivered more than once. Network issues, retries, and at-least-once delivery semantics all contribute. Consumers must be idempotent—processing the same event multiple times produces the same result as processing it once.
Strategies for idempotency:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
class PaymentConsumer { constructor( private readonly db: Database, private readonly paymentGateway: PaymentGateway, ) {} async handleOrderPlaced(event: OrderPlacedEvent): Promise<void> { // Idempotency check: Have we processed this event? const processed = await this.db.processedEvents.findUnique({ where: { eventId: event.eventId } }); if (processed) { logger.info('Event already processed, skipping', { eventId: event.eventId }); return; } // Use database transaction for idempotency await this.db.transaction(async (tx) => { // Double-check inside transaction (optimistic locking alternative) const exists = await tx.processedEvents.findUnique({ where: { eventId: event.eventId } }); if (exists) return; // Process the payment const payment = await this.paymentGateway.charge({ orderId: event.orderId, amount: event.totalAmount, customerId: event.customerId, idempotencyKey: event.eventId, // Gateway-level idempotency }); // Store payment record await tx.payments.create({ orderId: event.orderId, paymentId: payment.id, amount: event.totalAmount, status: payment.status, }); // Mark event as processed await tx.processedEvents.create({ eventId: event.eventId, processedAt: new Date(), processor: 'PaymentConsumer', }); // Write outgoing event to outbox if (payment.status === 'completed') { await tx.outbox.insert({ id: uuid(), aggregateType: 'Payment', aggregateId: payment.id, eventType: 'PaymentCompleted', payload: this.buildPaymentCompletedEvent(event, payment), }); } }); }}Notice how the event ID becomes the idempotency key for the payment gateway. This propagation of idempotency through the entire chain—from initial event through external API calls—is crucial for reliable choreography. Always pass correlation and idempotency information through the entire workflow.
In choreography, there's no central controller to catch exceptions and coordinate recovery. Each service must handle its own failures and communicate them through events. This distributed error handling is both a challenge and a strength.
The Principle of Compensating Events:
When something goes wrong, services emit failure events that trigger compensating actions in other services. Instead of rolling back a distributed transaction (which choreography doesn't support), you emit events that undo or compensate for previous work.
Example: Payment Failure in Order Processing:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
// Happy path:// OrderPlaced → PaymentCompleted → InventoryReserved → ShipmentScheduled // Payment fails:// OrderPlaced → PaymentFailed // Inventory reservation fails (after payment succeeded):// OrderPlaced → PaymentCompleted → InventoryReservationFailed// ↓// PaymentService reacts to InventoryReservationFailed:// → PaymentRefunded// ↓// OrderService reacts to PaymentRefunded:// → OrderCancelled class PaymentService { async handleInventoryReservationFailed( event: InventoryReservationFailedEvent ): Promise<void> { // Find the payment for this order const payment = await this.db.payments.findUnique({ where: { orderId: event.orderId } }); if (!payment || payment.status !== 'completed') { return; // Nothing to refund } await this.db.transaction(async (tx) => { // Process refund const refund = await this.paymentGateway.refund({ paymentId: payment.paymentId, amount: payment.amount, reason: 'inventory_unavailable', }); // Update payment status await tx.payments.update({ where: { paymentId: payment.paymentId }, data: { status: 'refunded', refundId: refund.id }, }); // Emit compensating event await tx.outbox.insert({ id: uuid(), aggregateType: 'Payment', aggregateId: payment.paymentId, eventType: 'PaymentRefunded', payload: { eventId: uuid(), eventType: 'PaymentRefunded', timestamp: new Date().toISOString(), version: '1.0', correlationId: event.correlationId, causationId: event.eventId, orderId: event.orderId, paymentId: payment.paymentId, refundAmount: payment.amount, reason: 'inventory_unavailable', }, }); }); }} class OrderService { async handlePaymentRefunded(event: PaymentRefundedEvent): Promise<void> { const order = await this.db.orders.findUnique({ where: { orderId: event.orderId } }); if (!order || order.status === 'cancelled') { return; // Already cancelled or doesn't exist } await this.db.transaction(async (tx) => { await tx.orders.update({ where: { orderId: event.orderId }, data: { status: 'cancelled', cancelledAt: new Date(), cancelReason: event.reason, }, }); await tx.outbox.insert({ id: uuid(), aggregateType: 'Order', aggregateId: event.orderId, eventType: 'OrderCancelled', payload: { eventId: uuid(), eventType: 'OrderCancelled', timestamp: new Date().toISOString(), version: '1.0', correlationId: event.correlationId, causationId: event.eventId, orderId: event.orderId, reason: event.reason, }, }); }); }}PaymentFailed, InventoryReservationFailed, ShipmentDelayed are all valid business events.Be careful with compensation chains that create their own failure events. If refunding payment fails, you now have a PaymentRefundFailed event that might trigger more compensations. Design circuit breakers and terminal states to prevent infinite compensation loops.
The greatest challenge with choreography is visibility. When no single service knows the complete workflow, how do you understand what's happening? How do you debug a failed order when the relevant events span five services?
Distributed Tracing is Essential:
Correlation IDs aren't just nice-to-have—they're mandatory. Every event must carry a correlation ID that traces back to the initiating action. Tools like Jaeger, Zipkin, or cloud-native equivalents (AWS X-Ray, Google Cloud Trace) can visualize the entire flow.
The Correlation Graph:
Build the ability to reconstruct the complete event chain for any workflow instance. Given an order ID or correlation ID, you should be able to see every event in the chain, in order, with timing information.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
-- Query to reconstruct event chain for an order-- This assumes events are stored in an event store or log WITH RECURSIVE event_chain AS ( -- Start with the initial event SELECT e.event_id, e.event_type, e.timestamp, e.correlation_id, e.causation_id, e.service_name, e.payload, 1 as depth FROM events e WHERE e.correlation_id = :orderId AND e.causation_id IS NULL UNION ALL -- Recursively find caused events SELECT e.event_id, e.event_type, e.timestamp, e.correlation_id, e.causation_id, e.service_name, e.payload, ec.depth + 1 FROM events e JOIN event_chain ec ON e.causation_id = ec.event_id)SELECT event_type, service_name, timestamp, depth, EXTRACT(MILLISECONDS FROM timestamp - LAG(timestamp) OVER (ORDER BY depth, timestamp) ) as ms_since_previousFROM event_chainORDER BY depth, timestamp; -- Example output:-- | event_type | service_name | timestamp | depth | ms_since_previous |-- |---------------------|--------------|---------------------|-------|-------------------|-- | OrderPlaced | orders | 2024-01-15 10:00:00 | 1 | NULL |-- | PaymentCompleted | payments | 2024-01-15 10:00:02 | 2 | 2000 |-- | InventoryReserved | inventory | 2024-01-15 10:00:03 | 3 | 1000 |-- | ShipmentScheduled | shipping | 2024-01-15 10:00:05 | 4 | 2000 |Process Mining and Visualization:
For complex choreographies, consider process mining tools that can reconstruct workflow patterns from event logs. These tools visualize the actual flows in your system, which may differ from your intended design.
Monitoring Points:
Choreography requires significant investment in observability tooling. Without it, you're flying blind. Budget time and resources for distributed tracing, event stores for replay and debugging, and dashboards that show workflow health. This isn't optional—it's the cost of doing choreography correctly.
Testing choreographed systems requires strategies that span multiple services without introducing tight coupling. The testing pyramid remains relevant, but each level takes on new characteristics.
Unit Testing: Focus on Event Handling Logic
Unit tests verify that a service correctly handles incoming events and produces correct outgoing events. Mock the message infrastructure; test the business logic.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
describe('PaymentService', () => { let service: PaymentService; let mockDb: MockDatabase; let mockPaymentGateway: MockPaymentGateway; let capturedOutboxEvents: OutboxEntry[]; beforeEach(() => { mockDb = createMockDatabase(); mockPaymentGateway = createMockPaymentGateway(); capturedOutboxEvents = []; // Capture events written to outbox mockDb.outbox.insert = jest.fn((entry) => { capturedOutboxEvents.push(entry); return Promise.resolve(entry); }); service = new PaymentService(mockDb, mockPaymentGateway); }); describe('handleOrderPlaced', () => { it('should process payment and emit PaymentCompleted', async () => { const orderPlacedEvent = createOrderPlacedEvent({ orderId: 'order-123', totalAmount: { amount: 9999, currency: 'USD' }, }); mockPaymentGateway.charge.mockResolvedValue({ id: 'payment-456', status: 'completed', transactionRef: 'tx-789', }); await service.handleOrderPlaced(orderPlacedEvent); // Verify payment was processed expect(mockPaymentGateway.charge).toHaveBeenCalledWith( expect.objectContaining({ orderId: 'order-123', amount: { amount: 9999, currency: 'USD' }, }) ); // Verify correct event was emitted expect(capturedOutboxEvents).toHaveLength(1); expect(capturedOutboxEvents[0].eventType).toBe('PaymentCompleted'); expect(capturedOutboxEvents[0].payload).toMatchObject({ orderId: 'order-123', paymentId: 'payment-456', correlationId: orderPlacedEvent.correlationId, causationId: orderPlacedEvent.eventId, }); }); it('should emit PaymentFailed when gateway fails', async () => { const orderPlacedEvent = createOrderPlacedEvent({ orderId: 'order-123', }); mockPaymentGateway.charge.mockRejectedValue( new PaymentDeclinedError('Insufficient funds') ); await service.handleOrderPlaced(orderPlacedEvent); expect(capturedOutboxEvents[0].eventType).toBe('PaymentFailed'); expect(capturedOutboxEvents[0].payload.reason).toBe('insufficient_funds'); }); it('should be idempotent for duplicate events', async () => { const orderPlacedEvent = createOrderPlacedEvent({ orderId: 'order-123', eventId: 'event-duplicate', }); // First processing await service.handleOrderPlaced(orderPlacedEvent); const firstCallCount = mockPaymentGateway.charge.mock.calls.length; // Mark as processed in mock mockDb.processedEvents.findUnique.mockResolvedValue({ eventId: 'event-duplicate', }); // Second processing of same event await service.handleOrderPlaced(orderPlacedEvent); // Should not process again expect(mockPaymentGateway.charge).toHaveBeenCalledTimes(firstCallCount); }); });});Integration Testing: Consumer Contract Testing
Contract testing verifies that services communicate correctly through events. Each consumer maintains contracts describing what events it expects; producers verify they satisfy those contracts.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
// Consumer defines expected event format// payment-service/pacts/order-service-messages.tsimport { MessageConsumerPact, synchronousBodyHandler } from '@pact-foundation/pact'; describe('Payment Service - Order Events Contract', () => { const messagePact = new MessageConsumerPact({ consumer: 'PaymentService', provider: 'OrderService', dir: './pacts', }); it('consumes OrderPlaced events', () => { return messagePact .expectsToReceive('an OrderPlaced event') .withContent({ eventType: 'OrderPlaced', eventId: like('uuid-string'), timestamp: like('2024-01-15T10:00:00Z'), version: '1.0', correlationId: like('correlation-id'), orderId: like('order-id'), customerId: like('customer-id'), totalAmount: { amount: like(9999), currency: like('USD'), }, items: eachLike({ productId: like('product-id'), quantity: like(1), unitPrice: like(1999), }), }) .withMetadata({ contentType: 'application/json' }) .verify(synchronousBodyHandler(async (event: OrderPlacedEvent) => { // Verify our consumer can handle this event shape const handler = new PaymentEventHandler(); await expect(handler.validateEvent(event)).resolves.toBe(true); })); });}); // Producer verifies it satisfies consumer contracts// order-service/tests/pact-verification.test.tsdescribe('Order Service - Pact Verification', () => { it('satisfies PaymentService expectations for OrderPlaced', async () => { const verifier = new MessageProviderPact({ provider: 'OrderService', pactUrls: ['./pacts/order-service-payment-service.json'], messageProviders: { 'an OrderPlaced event': async () => { const order = createTestOrder(); return OrderPlacedEvent.create(order, 'test-correlation-id'); }, }, }); await verifier.verify(); });});For end-to-end tests, deploy all services and verify complete workflows: place an order and verify that a shipment is eventually scheduled. Use the event store to validate intermediate states. These tests are expensive but essential for verifying the choreography works as a whole.
We've explored choreography as a coordination pattern for event-driven architectures. Let's consolidate the key insights:
Choreography excels when:
In the next page, we'll explore the alternative: Orchestration, where a central service explicitly controls the workflow. You'll see how the same order processing workflow looks with centralized control, understand the tradeoffs, and learn when each approach is appropriate.
You now understand choreography as a coordination pattern: its philosophy, implementation requirements, error handling approach, and testing strategies. You can design choreographed workflows where services react autonomously to events, creating loosely coupled systems that scale independently. Next, we'll examine orchestration—the centralized alternative.