Loading learning content...
In the world of distributed systems, some failures are loud and obvious—servers crash, databases timeout, networks partition. But there's a category of failure that's far more insidious: data inconsistency caused by partial failures during multi-system updates. These failures often go undetected until they've corrupted significant amounts of business data.
Consider this scenario: An e-commerce service needs to:
OrderCreated event to a message broker for downstream processingSeems simple enough. But what happens when the database write succeeds and the message publish fails? Or when the publish succeeds but the database transaction rolls back? You're left with inconsistent state—the order exists in one system but not the other, or events describe orders that don't exist.
This problem is so common and so treacherous that it has a name: the dual-write problem. And it has destroyed more distributed systems than nearly any other failure mode.
By the end of this page, you will deeply understand the dual-write problem, why it cannot be solved through naive retry mechanisms, and how the Outbox Pattern provides a fundamental solution. This foundation is essential before exploring implementation strategies.
Event-driven architecture has become the dominant paradigm for building scalable, loosely-coupled systems. Instead of synchronous request-response, services communicate by publishing and consuming events. This approach offers tremendous benefits:
But there's a fundamental problem at the heart of event-driven architectures: How do you reliably publish events when the publishing operation involves multiple systems?
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
// THE NAIVE APPROACH - DECEPTIVELY DANGEROUS class OrderService { constructor( private database: Database, private messageQueue: MessageQueue ) {} async createOrder(orderRequest: CreateOrderRequest): Promise<Order> { // Step 1: Save to database const order = await this.database.orders.create({ customerId: orderRequest.customerId, items: orderRequest.items, total: this.calculateTotal(orderRequest.items), status: 'CREATED' }); // Step 2: Publish event to message broker await this.messageQueue.publish('order.created', { orderId: order.id, customerId: order.customerId, total: order.total, timestamp: new Date() }); return order; }} // WHAT CAN GO WRONG?// // SCENARIO 1: Message publish fails after DB write// ───────────────────────────────────────────────// Timeline:// T1: database.create() → SUCCESS (order saved)// T2: messageQueue.publish() → FAILURE (network timeout)// T3: Exception thrown, function returns error//// Result: Order exists in DB, but no event was published.// Downstream systems never learn about the order.// Customer is confused why order isn't processing.//// SCENARIO 2: Process crashes between operations// ───────────────────────────────────────────────// Timeline:// T1: database.create() → SUCCESS (order saved)// T2: Process crashes (out of memory, killed, etc.)// T3: messageQueue.publish() → NEVER EXECUTED//// Result: Same as Scenario 1—orphaned order.//// SCENARIO 3: Message published but DB transaction fails// ───────────────────────────────────────────────────────// (If using transaction that could still roll back)// Timeline:// T1: database transaction begins// T2: order inserted (not yet committed)// T3: messageQueue.publish() → SUCCESS (event published!)// T4: database.commit() → FAILURE (constraint violation)// T5: transaction rolled back//// Result: Event published for order that DOESN'T EXIST!// Downstream systems try to process ghost order.// Corrupted state that's very hard to detect.You cannot atomically update a database AND publish to a message broker in a single transaction. They are separate systems with separate failure modes. Any sequence of operations across these systems can fail partially, leaving the systems in inconsistent states.
Why Retries Don't Solve This
The first instinct is often to add retry logic. But retries introduce new problems:
// RETRY APPROACH - STILL BROKEN
async createOrderWithRetry(request: CreateOrderRequest): Promise<Order> {
const order = await this.database.orders.create(request);
let published = false;
for (let attempt = 0; attempt < 3; attempt++) {
try {
await this.messageQueue.publish('order.created', { orderId: order.id });
published = true;
break;
} catch (e) {
await sleep(exponentialBackoff(attempt));
}
}
if (!published) {
// What now? Order is already in DB!
// Can't roll back—DB transaction already committed.
// Can't delete order—what if customer sees it?
throw new Error('Failed to publish event');
}
return order;
}
The retry approach fails because:
The dual-write problem occurs when a single business operation must update two or more systems that cannot participate in the same transaction. The term 'dual' refers to the minimum case of two systems, but the problem extends to any number of independent data stores.
Formal Definition:
A dual-write is an operation that modifies data in two or more systems without a coordinated transaction mechanism. Because each system commits independently, partial failures leave the systems in mutually inconsistent states.
Why Can't We Just Use Distributed Transactions?
The textbook solution to coordinating writes across multiple systems is distributed transactions using protocols like Two-Phase Commit (2PC). However, 2PC has severe practical limitations:
| Concern | Two-Phase Commit Limitation | Practical Impact |
|---|---|---|
| Availability | Blocks if coordinator fails | Single point of failure; system unavailable during recovery |
| Performance | Holds locks during prepare phase | High latency; reduced throughput under contention |
| Heterogeneity | All participants must support XA/2PC | Message brokers like Kafka, RabbitMQ don't support XA |
| Scalability | Coordinator becomes bottleneck | Doesn't scale horizontally |
| Cloud Compatibility | Requires persistent connections | Not compatible with serverless, managed services |
| Complexity | Complex failure recovery | Operational burden; hard to debug |
Most message brokers simply don't support distributed transactions. Apache Kafka, RabbitMQ, Amazon SQS, Google Pub/Sub—none of them can participate in an XA transaction with your PostgreSQL database. Even if they could, the performance and availability costs would be prohibitive for most use cases.
This leaves us with a fundamental constraint:
We cannot achieve atomicity across a database and a message broker through traditional transaction mechanisms.
This is not a solvable problem through better engineering of the naive approach. It requires a fundamentally different architecture.
Dual-write failures often don't crash your system—they corrupt it slowly. You'll see symptoms like 'occasional missing orders,' 'sometimes the notification doesn't send,' or 'the audit log is incomplete.' These intermittent issues are extremely hard to debug because they depend on precise timing of unrelated network failures.
The Outbox Pattern solves the dual-write problem through a simple but powerful insight:
Instead of publishing events directly to a message broker, write them to an 'outbox' table in the same database as your business data—within the same transaction. A separate process then reads from the outbox and publishes to the message broker.
This transforms the problem from impossible (atomic cross-system write) to trivial (atomic single-database write).
The Key Insight:
Databases are really good at one thing: atomic transactions within themselves. If you insert an order and an outbox event in the same transaction, either both succeed or both fail. There's no inconsistent state.
Once the data is safely in the outbox (with transactional guarantees), a separate process can take responsibility for reliable delivery to the message broker. This process can retry indefinitely, use idempotency keys, and handle all the complexity of reliable messaging—without any risk to the source transaction.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
// THE OUTBOX PATTERN - CORRECT APPROACH // STEP 1: Database transaction includes outbox write// ────────────────────────────────────────────────── class OrderService { async createOrder(request: CreateOrderRequest): Promise<Order> { // Single database transaction return await this.database.transaction(async (tx) => { // Insert the order const order = await tx.orders.create({ customerId: request.customerId, items: request.items, total: this.calculateTotal(request.items), status: 'CREATED' }); // Insert event into outbox table (SAME TRANSACTION) await tx.outbox.create({ aggregateType: 'Order', aggregateId: order.id, eventType: 'OrderCreated', payload: JSON.stringify({ orderId: order.id, customerId: order.customerId, items: order.items, total: order.total, createdAt: new Date() }), createdAt: new Date(), publishedAt: null // Not yet published }); return order; }); // At this point: // - If transaction committed: order AND outbox event both exist // - If transaction failed: neither exists // - NEVER inconsistent! }} // STEP 2: Separate process publishes from outbox// ────────────────────────────────────────────── class OutboxPublisher { // Runs continuously or on a schedule async publishPendingEvents(): Promise<void> { // Get unpublished events const pendingEvents = await this.database.outbox.findMany({ where: { publishedAt: null }, orderBy: { createdAt: 'asc' }, take: 100 }); for (const event of pendingEvents) { try { // Publish to message broker await this.messageQueue.publish( `${event.aggregateType.toLowerCase()}.${event.eventType.toLowerCase()}`, JSON.parse(event.payload) ); // Mark as published await this.database.outbox.update({ where: { id: event.id }, data: { publishedAt: new Date() } }); } catch (error) { // Failed? No problem! Event stays in outbox. // Will be picked up in next run. // Source transaction is NOT affected. console.error(`Failed to publish event ${event.id}:`, error); } } }} // KEY GUARANTEES://// 1. ATOMICITY: Order and outbox event are in same transaction.// They either both exist or neither exists.//// 2. DURABILITY: Once transaction commits, event is durably stored.// Even if process crashes, event is not lost.//// 3. EVENTUAL DELIVERY: Publisher will keep retrying until successful.// Events might be delayed, but never lost.//// 4. ORDERING: Events published in order of creation (by timestamp).// No event skipped due to earlier failure.The Outbox Pattern cleanly separates the business transaction from the messaging concern. The order service's responsibility is to create orders atomically (with their events). The publisher's responsibility is to deliver events reliably. Neither needs to worry about the other's failure modes.
The outbox table is the heart of the pattern. Its design significantly impacts reliability, performance, and operational characteristics. A well-designed outbox table supports:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
-- PRODUCTION-GRADE OUTBOX TABLE SCHEMA CREATE TABLE outbox_events ( -- Primary identifier (UUID recommended for distributed systems) id UUID PRIMARY KEY DEFAULT gen_random_uuid(), -- Event ordering (monotonically increasing, no gaps) -- More reliable than timestamps for ordering sequence_number BIGSERIAL NOT NULL, -- Aggregate information (for routing and filtering) aggregate_type VARCHAR(255) NOT NULL, -- e.g., 'Order', 'Customer' aggregate_id VARCHAR(255) NOT NULL, -- e.g., order UUID -- Event information event_type VARCHAR(255) NOT NULL, -- e.g., 'OrderCreated' event_version INTEGER NOT NULL DEFAULT 1, -- schema version -- Payload (the actual event data) payload JSONB NOT NULL, -- Metadata for tracing and debugging correlation_id VARCHAR(255), -- trace across services causation_id VARCHAR(255), -- which event caused this metadata JSONB, -- additional context -- Timestamps created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), published_at TIMESTAMPTZ, -- NULL = not yet published -- Retry tracking publish_attempts INTEGER NOT NULL DEFAULT 0, last_attempt_at TIMESTAMPTZ, last_error TEXT); -- CRITICAL INDEX: Efficiently find unpublished events in orderCREATE INDEX idx_outbox_unpublished ON outbox_events (sequence_number) WHERE published_at IS NULL; -- PARTIAL INDEX: Only index what we actually query-- Much smaller than indexing all rows -- For cleanup: find old published eventsCREATE INDEX idx_outbox_published_cleanup ON outbox_events (published_at) WHERE published_at IS NOT NULL; -- For debugging: find events by aggregateCREATE INDEX idx_outbox_aggregate ON outbox_events (aggregate_type, aggregate_id); -- ALTERNATIVE: Partitioned table for high-volume systems-- Partition by date for efficient cleanup CREATE TABLE outbox_events_partitioned ( id UUID NOT NULL, sequence_number BIGINT NOT NULL, aggregate_type VARCHAR(255) NOT NULL, aggregate_id VARCHAR(255) NOT NULL, event_type VARCHAR(255) NOT NULL, payload JSONB NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), published_at TIMESTAMPTZ, created_date DATE NOT NULL DEFAULT CURRENT_DATE, PRIMARY KEY (created_date, id)) PARTITION BY RANGE (created_date); -- Create partitions (automate this in production)CREATE TABLE outbox_events_2024_01 PARTITION OF outbox_events_partitioned FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); -- Drop old partitions instead of DELETE (much faster)-- DROP TABLE outbox_events_2023_12;123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
// TYPE-SAFE OUTBOX EVENT HANDLING // Base event structure (all events extend this)interface BaseEvent<T extends string, P> { eventId: string; eventType: T; eventVersion: number; aggregateType: string; aggregateId: string; payload: P; metadata: EventMetadata; createdAt: Date;} interface EventMetadata { correlationId: string; causationId?: string; userId?: string; traceId?: string; spanId?: string;} // Specific event typesinterface OrderCreatedPayload { customerId: string; items: Array<{ productId: string; quantity: number; unitPrice: number; }>; totalAmount: number; currency: string; shippingAddress: Address;} type OrderCreatedEvent = BaseEvent<'OrderCreated', OrderCreatedPayload>; interface OrderShippedPayload { trackingNumber: string; carrier: string; estimatedDelivery: Date;} type OrderShippedEvent = BaseEvent<'OrderShipped', OrderShippedPayload>; // Union type for all order eventstype OrderEvent = OrderCreatedEvent | OrderShippedEvent; // Outbox repository with type safetyclass OutboxRepository { constructor(private db: Database) {} async addEvent<E extends BaseEvent<string, unknown>>( tx: Transaction, event: E ): Promise<void> { await tx.execute(` INSERT INTO outbox_events ( id, aggregate_type, aggregate_id, event_type, event_version, payload, metadata, correlation_id, causation_id ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) `, [ event.eventId, event.aggregateType, event.aggregateId, event.eventType, event.eventVersion, JSON.stringify(event.payload), JSON.stringify(event.metadata), event.metadata.correlationId, event.metadata.causationId ]); }} // Usage in service layerclass OrderService { constructor( private db: Database, private outbox: OutboxRepository ) {} async createOrder( command: CreateOrderCommand, metadata: EventMetadata ): Promise<Order> { return await this.db.transaction(async (tx) => { // Create the order const order = await tx.orders.create({ customerId: command.customerId, items: command.items, status: 'CREATED', totalAmount: this.calculateTotal(command.items) }); // Create type-safe event const event: OrderCreatedEvent = { eventId: crypto.randomUUID(), eventType: 'OrderCreated', eventVersion: 1, aggregateType: 'Order', aggregateId: order.id, payload: { customerId: order.customerId, items: order.items.map(i => ({ productId: i.productId, quantity: i.quantity, unitPrice: i.unitPrice })), totalAmount: order.totalAmount, currency: 'USD', shippingAddress: command.shippingAddress }, metadata: { ...metadata, causationId: metadata.correlationId // This event caused by the command }, createdAt: new Date() }; // Add to outbox (same transaction) await this.outbox.addEvent(tx, event); return order; }); }}Generate event IDs (UUIDs) in the application before inserting. This allows consumers to use the event ID for idempotency checks, even before the event reaches them. If you rely on database-generated IDs, you can't know the ID until after insert—complicating some patterns.
The Outbox Pattern provides strong guarantees, but like all distributed systems solutions, it involves trade-offs. Understanding exactly what the pattern guarantees—and what it doesn't—is crucial for correct usage.
At-Least-Once vs Exactly-Once
The Outbox Pattern inherently provides at-least-once delivery semantics. Duplicates can occur when:
This is unavoidable in any system without distributed transactions. The solution is to design consumers to be idempotent:
// IDEMPOTENT CONSUMER
class OrderEventConsumer {
constructor(private db: Database) {}
async handleOrderCreated(event: OrderCreatedEvent): Promise<void> {
// Check if we've already processed this event
const exists = await this.db.processedEvents.findFirst({
where: { eventId: event.eventId }
});
if (exists) {
console.log(`Event ${event.eventId} already processed, skipping`);
return; // Idempotent: processing again is a no-op
}
// Process the event
await this.db.transaction(async (tx) => {
// Do the actual work
await tx.notifications.create({
type: 'ORDER_CONFIRMATION',
orderId: event.payload.orderId,
customerId: event.payload.customerId
});
// Record that we processed this event
await tx.processedEvents.create({
eventId: event.eventId,
processedAt: new Date()
});
});
}
}
| Trade-off | Impact | Mitigation |
|---|---|---|
| Added Latency | 50-500ms delay between commit and publish (polling) | Use CDC instead of polling; tune poll frequency |
| Database Load | Additional writes (outbox) and reads (polling) | Efficient indexes; separate read replica for polling |
| Storage Growth | Outbox table accumulates events | Regular cleanup of old published events; partitioning |
| Duplicate Events | At-least-once means possible duplicates | Idempotent consumers; deduplication tables |
| Operational Complexity | Additional component (publisher) to monitor | Robust monitoring; alerting on publication lag |
| Schema Coupling | Event schema in database must match expectations | Event versioning; schema registry |
If you need sub-second event delivery, polling-based outbox may not suffice. Consider CDC (Change Data Capture) which can detect new outbox rows almost instantly. We'll cover this in detail in the 'Polling vs CDC' section.
The Outbox Pattern is not always necessary. For some systems, simpler approaches work fine. For others, the Outbox Pattern is essential. Here's how to decide:
You now understand the dual-write problem and why the Outbox Pattern is the fundamental solution. Next, we'll dive deep into the transactional outbox mechanism—exploring table design, transaction patterns, and ensuring atomicity across different database engines.