Loading learning content...
Consider the most common operation in distributed systems: save something to a database, then notify other systems about it. It seems trivially simple. Write to database. Send message. Done.
const order = await database.orders.create(orderData);
await messageQueue.publish('order.created', order);
Two lines of code. What could go wrong?
Everything.
This innocuous pattern is the source of more data corruption, lost transactions, and debugging nightmares than almost any other in distributed systems. It's called a dual write, and understanding exactly why it fails—and why clever workarounds don't work—is essential for any engineer building distributed systems.
This page dissects the dual-write problem with surgical precision, examining every failure mode and demonstrating why the Outbox Pattern is the principled solution.
By the end of this page, you will understand every way dual writes can fail, why common fixes don't work, the theoretical foundations explaining why this problem is hard, and how the Outbox Pattern's 'write locally, publish later' approach provides the correct abstraction.
A dual write occurs whenever a single logical operation must update two independent systems. Let's systematically catalog every way this can fail.
12345678910111213141516171819202122232425262728293031323334
// THE DUAL-WRITE ANTI-PATTERN// // Any variation of: "update system A, then update system B"// without a coordinating transaction is a dual write. async function createOrder(request: OrderRequest): Promise<Order> { // SYSTEM A: Database const order = await database.orders.create({ customerId: request.customerId, items: request.items, total: calculateTotal(request.items), status: 'CREATED' }); // SYSTEM B: Message Broker await messageBroker.publish('OrderCreated', { orderId: order.id, customerId: order.customerId, total: order.total }); return order;} // Common variations of dual writes://// 1. Database + Message Broker (most common)// 2. Database + Cache (Redis, Memcached)// 3. Database + Search Index (Elasticsearch)// 4. Database + External API (payment processor, etc.)// 5. Two Databases (cross-service writes)// 6. Database + File System//// ALL of these are dual writes and suffer the same problems.Failure Mode 1: Second System Failure
The most obvious failure: the second operation fails after the first succeeds.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
// FAILURE MODE 1: SECOND SYSTEM FAILS async function createOrder(request: OrderRequest): Promise<Order> { // T1: Database write SUCCEEDS const order = await database.orders.create(request); // Order now exists in database ✓ // T2: Message broker FAILS // - Network timeout // - Broker unavailable // - Quota exceeded // - Any of 100 possible failures await messageBroker.publish('OrderCreated', order); // ❌ Throws exception // Exception propagates up // What happens to the order? // IT'S STILL IN THE DATABASE.} // CONSEQUENCES://// 1. Order exists without corresponding event// 2. Downstream systems never learn about order// 3. Inventory not reserved (if inventory service consumes events)// 4. Payment not processed (if payment service consumes events)// 5. Customer confirmation email not sent// 6. Analytics data missing this order//// The order is an "orphan" - visible in UI but not processed // "FIX" ATTEMPT: Retry the publishasync function createOrderWithRetry(request: OrderRequest): Promise<Order> { const order = await database.orders.create(request); let published = false; for (let i = 0; i < 3; i++) { try { await messageBroker.publish('OrderCreated', order); published = true; break; } catch (e) { await sleep(exponentialBackoff(i)); } } if (!published) { // NOW WHAT? // Option A: Throw error // → Order still exists! User sees "error" but order is placed. // → If they retry, they create a DUPLICATE order. // // Option B: Delete the order // → What if delete fails too? // → What if another process is already using the order? // → What about the items we already created? // // Option C: Mark order as "pending_publish" // → Who retries it? When? How? // → You've just reinvented the outbox pattern, poorly. throw new Error('Failed to publish event'); } return order;}Failure Mode 2: Process Crash
The operation is interrupted by a crash between the two operations.
1234567891011121314151617181920212223242526272829303132333435363738
// FAILURE MODE 2: PROCESS CRASH async function createOrder(request: OrderRequest): Promise<Order> { // T1: Database write SUCCEEDS const order = await database.orders.create(request); // Order now committed to database ✓ // T2: PROCESS CRASH // - Out of memory (OOM kill) // - Container killed (Kubernetes pod termination) // - Hardware failure // - Unhandled exception in unrelated code // - Deployment/restart during this window // This line is NEVER REACHED: await messageBroker.publish('OrderCreated', order);} // The window between database commit and publish// is called the "danger zone" or "crash window"// // Even if this window is only milliseconds,// multiplied by millions of operations,// you WILL hit this case.//// Rule of thumb:// - 1ms window × 1000 ops/sec = 1 missed event per 1000 seconds// - 1ms window × 10000 ops/sec = 1 missed event per 100 seconds// - At scale, "rare" becomes "constant" // THERE IS NO CODE-LEVEL FIX// // You cannot catch a crash.// You cannot wrap process death in try/catch.// You cannot "finally" after the kernel kills you.//// The only solution is to not have a crash window.// The Outbox Pattern eliminates the crash window entirely.Failure Mode 3: Ghost Events (Rollback After Publish)
The message is published, but the database transaction later fails.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
// FAILURE MODE 3: GHOST EVENTS async function createOrder(request: OrderRequest): Promise<Order> { // Using database transaction return await database.transaction(async (tx) => { // T1: Insert into database (NOT YET COMMITTED) const order = await tx.orders.create(request); // T2: Publish event (while transaction still open) // DANGER: This event references a row that might not exist! await messageBroker.publish('OrderCreated', { orderId: order.id, customerId: order.customerId }); // ✓ Event now in message broker // T3: Insert order items await tx.orderItems.createMany(request.items); // T4: Trigger FAILURE // - Unique constraint violation // - Foreign key check fails // - Serialization conflict (SSI) // - Explicit rollback from business logic // TRANSACTION ROLLS BACK // - Order row is GONE // - Order items are GONE // // But the event is ALREADY IN KAFKA // Consumers will try to process an order that DOESN'T EXIST });} // CONSEQUENCES OF GHOST EVENTS://// 1. Consumer queries order by ID → 404 Not Found// - Is this a timing issue? Should I retry?// - Is the order actually missing?// - No way to distinguish from race condition//// 2. Consumer processes ghost order → Corrupted state// - Inventory decremented for non-existent order// - Payment captured for phantom order// - Shipping label created for nothing//// 3. Debugging nightmare// - "Why is there a payment for order X?"// - "Order X doesn't exist in the database"// - "But here's the event showing it was created!"//// Ghost events are WORSE than missing events// because they actively corrupt downstream systems. // "FIX" ATTEMPT: Publish after commitasync function createOrderPublishAfterCommit(request: OrderRequest): Promise<Order> { let order: Order; // Transaction for database only order = await database.transaction(async (tx) => { const createdOrder = await tx.orders.create(request); await tx.orderItems.createMany(request.items); return createdOrder; }); // Transaction committed ✓ // Order definitely exists ✓ // Now publish (outside transaction) await messageBroker.publish('OrderCreated', order); // But now we're back to Failure Mode 1 and 2! // If this fails or crashes, order exists without event. return order;}// There is no sequence that solves this with dual writes.| Failure Mode | Trigger | Result | Probability |
|---|---|---|---|
| Second Write Fails | Network error, broker down | Data without event | Common (1-5% of operations during incidents) |
| Process Crash | OOM, kill signal, hardware | Data without event | Rare individually, guaranteed at scale |
| Transaction Rollback | Constraint violation, conflict | Event without data (ghost) | Depends on schema/concurrency |
| Network Partition | Split-brain scenario | Inconsistent state | Rare but catastrophic |
| Out-of-Order | Retry publishes wrong sequence | Corrupted event stream | Common with naive retries |
There is no sequence of operations across two independent systems that guarantees atomicity. This is not a implementation problem—it's a theoretical impossibility without distributed transactions, which message brokers don't support.
Engineers encountering the dual-write problem often propose various 'fixes.' Let's examine why each fails to solve the fundamental issue.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// ATTEMPTED FIX 1: RETRY LOGIC// "Just retry the publish until it succeeds" async function createOrderWithRetry(request: OrderRequest): Promise<Order> { const order = await database.orders.create(request); // Order committed ✓ // Retry publish with exponential backoff await retryWithBackoff(async () => { await messageBroker.publish('OrderCreated', order); }, { maxRetries: 10, initialDelayMs: 100, maxDelayMs: 30000 }); return order;} // WHY THIS DOESN'T WORK://// 1. CRASH DURING RETRY// What if the process crashes during the retry loop?// The order exists, but no one is retrying the publish.// It's lost forever.//// 2. PROLONGED OUTAGE// What if the broker is down for 2 hours?// You can't hold the HTTP response open that long.// User gets timeout, retries, creates duplicate order.//// 3. RETRY STORM// During broker outage, every operation is retrying.// Thousands of threads waiting, consuming resources.// When broker comes back: thundering herd.//// 4. IDEMPOTENCY NIGHTMARE// If broker connection drops AFTER it received message// but BEFORE it sends ACK, you don't know if it worked.// You retry → duplicate event.//// 5. BLOCKING RESPONSE// User waits 30+ seconds for retries.// Terrible user experience.//// Retries don't fix the crash window problem.// They just make it less frequent.123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
// ATTEMPTED FIX 2: IN-MEMORY RETRY QUEUE// "Store failed publishes in memory, retry in background" class FailedEventQueue { private queue: FailedEvent[] = []; add(event: Event): void { this.queue.push({ event, attempts: 0, firstFailedAt: new Date() }); } async startBackgroundRetry(): Promise<void> { setInterval(async () => { const batch = this.queue.splice(0, 100); for (const item of batch) { try { await messageBroker.publish(item.event); } catch (e) { this.queue.push({ ...item, attempts: item.attempts + 1 }); } } }, 1000); }} // Usageasync function createOrder(request: OrderRequest): Promise<Order> { const order = await database.orders.create(request); try { await messageBroker.publish('OrderCreated', order); } catch (e) { // Queue for background retry failedEventQueue.add({ type: 'OrderCreated', data: order }); } return order; // Return immediately} // WHY THIS DOESN'T WORK://// 1. IN-MEMORY = VOLATILE// Process restarts? Queue is gone.// All those pending events? Lost.// You've traded one problem for a worse one.//// 2. PROCESS CRASH// Still doesn't help. If crash happens after// database.create() but before queue.add(),// the event is lost.//// 3. MULTI-INSTANCE PROBLEM// With multiple service instances, which one// owns the retry queue? How do they coordinate?// Race conditions everywhere.//// 4. NO DURABILITY// "But I'll persist the queue to disk!"// Congratulations, you've invented a worse database.// Now you have THREE systems to coordinate.//// 5. ORDERING LOST// Events retry in arbitrary order.// OrderCreated might arrive after OrderShipped.//// This is the Outbox Pattern with worse durability.123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// ATTEMPTED FIX 3: COMPENSATING TRANSACTION// "If publish fails, delete the order" async function createOrder(request: OrderRequest): Promise<Order> { const order = await database.orders.create(request); try { await messageBroker.publish('OrderCreated', order); } catch (e) { // Compensate: undo the database write await database.orders.delete({ where: { id: order.id } }); throw new Error('Failed to create order: could not publish event'); } return order;} // WHY THIS DOESN'T WORK://// 1. DELETE CAN FAIL TOO// Network error on publish → network error on delete.// Now order exists without event AND delete didn't work.// You're in a worse state.//// 2. CRASH BEFORE DELETE// Publish fails, delete starts, process crashes.// Order still exists, no event, no one knows.//// 3. ANOTHER PROCESS USED THE ORDER// Between create and compensating delete:// - Another request read the order// - A scheduled job processed it// - A foreign key reference was created// Now delete fails due to FK violation.//// 4. USER SAW THE ORDER// Between create and delete (milliseconds!):// - User's dashboard refreshed// - They saw order #12345// - You deleted it// "Where did my order go?!"//// 5. COMPENSATION IS HARD// What if creating the order triggered:// - Email confirmation sent// - Inventory reserved// - Credit card pre-authorized// How do you undo all of that?//// Compensating transactions create more problems than they solve.// And they STILL don't fix the crash window.123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
// ATTEMPTED FIX 4: PUBLISH FIRST// "Publish the event first, then write to database" async function createOrder(request: OrderRequest): Promise<Order> { const orderId = uuid(); // Pre-generate ID // Step 1: Publish event first await messageBroker.publish('OrderCreated', { orderId: orderId, customerId: request.customerId, items: request.items }); // Event in broker ✓ // Step 2: Now write to database const order = await database.orders.create({ id: orderId, ...request }); return order;} // WHY THIS DOESN'T WORK://// 1. SAME PROBLEM, REVERSED// If database write fails after publish:// - Event in broker references non-existent order// - Ghost event (Failure Mode 3)//// 2. CONSUMERS RACE AHEAD// Event published → consumers start processing// Consumer queries order → 404, order doesn't exist yet// Is this a timing issue or a real error?//// 3. VALIDATION REVERSED// Database constraints caught on write:// - Unique email already exists// - Customer doesn't exist// - Credit limit exceeded// Event already published for invalid order.//// 4. NO ATOMICITY EITHER DIRECTION// "Publish first" doesn't create atomicity.// It just changes which failure mode you get.//// The order of operations doesn't matter.// The problem is having TWO systems to update.Notice that every attempted fix either: (1) doesn't address the crash window, (2) creates new failure modes, or (3) partially reinvents the Outbox Pattern. There is no code-level fix for the dual-write problem. You need an architectural change.
The dual-write problem isn't just practically hard—it's theoretically impossible to solve without specific mechanisms. Understanding why helps clarify the solution.
The Consensus Problem
At its core, a dual write is asking: "Can two independent systems agree that an operation happened?" This is a variant of the distributed consensus problem.
The FLP Impossibility Result (Fischer, Lynch, Paterson, 1985) proves that in an asynchronous distributed system where even one process can fail, it is impossible to guarantee consensus. No algorithm can ensure all correct processes agree on a value if any process might crash.
Translated to our problem: There is no algorithm that guarantees both the database and message broker agree an operation occurred, if either system or the network between them might fail.
Why Databases 'Solve' This
Databases achieve atomicity through specific mechanisms:
These mechanisms are internal to the database. The database controls every component.
Why Dual Writes Can't Have This
With dual writes:
There's no shared commit log, no coordinated commit protocol, no unified crash recovery.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
// TWO-PHASE COMMIT: THE "TEXTBOOK" SOLUTION// // Distributed transactions using 2PC coordinate commits across systems.// In theory, this solves dual writes. In practice, it's rarely viable. // How 2PC works (conceptually)://// PHASE 1: PREPARE// Coordinator → Database: "Can you commit transaction X?"// Coordinator → Broker: "Can you commit message Y?"// Database → Coordinator: "Yes, I'm prepared"// Broker → Coordinator: "Yes, I'm prepared"//// PHASE 2: COMMIT// Coordinator → Database: "Commit transaction X"// Coordinator → Broker: "Commit message Y"// Database → Coordinator: "Done"// Broker → Coordinator: "Done"//// Both systems commit only when both are prepared.// If either fails, both rollback. // WHY 2PC DOESN'T WORK FOR OUTBOX USE CASE://// 1. MESSAGE BROKERS DON'T SUPPORT IT// - Kafka: No XA/2PC support// - RabbitMQ: No XA/2PC support// - AWS SQS: No XA/2PC support// - Google Pub/Sub: No XA/2PC support// // The systems we want to coordinate don't implement the protocol.//// 2. BLOCKING PROTOCOL// During prepare phase, all participants hold locks.// If coordinator crashes, locks held indefinitely.// System becomes unavailable.//// 3. PERFORMANCE// 2PC requires multiple network round-trips:// - Client → Coordinator// - Coordinator → All participants (prepare)// - All participants → Coordinator (prepared)// - Coordinator → All participants (commit)// - All participants → Coordinator (committed)// // Minimum 4 round-trips for every operation.// At 1ms per hop, that's 4ms minimum latency added.//// 4. AVAILABILITY// If ANY participant is unreachable, entire operation fails.// Single point of failure multiplied by participant count.//// 5. HETEROGENEOUS SYSTEMS// Even if broker supported 2PC, it would need to be the// same IMPLEMENTATION (XA standard has variations).// Different vendors, different interpretations.//// 2PC is used for homogeneous database-to-database transactions.// It's not practical for database-to-message-broker coordination.The Outbox Solution: Local Commit, Eventual Delivery
The Outbox Pattern sidesteps the distributed consensus problem by reframing it:
INSTEAD OF:
"Atomically update two systems" (impossible)
WE DO:
"Atomically update one system" (trivial)
"Eventually propagate to second system" (reliable with at-least-once)
By writing both the business data and the event data to the same database in a single transaction, we reduce the problem to a LOCAL commit—something databases handle perfectly.
The event's presence in the outbox table IS the record of truth. It will eventually be published. The only question is when, not if.
Let's trace through each failure mode and see how the Outbox Pattern handles it.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
// OUTBOX PATTERN: FAILURE MODE ANALYSIS async function createOrder(request: OrderRequest): Promise<Order> { return await database.transaction(async (tx) => { // Step 1: Create order const order = await tx.orders.create(request); // Step 2: Create outbox event (SAME TRANSACTION) await tx.outbox.create({ aggregateType: 'Order', aggregateId: order.id, eventType: 'OrderCreated', payload: JSON.stringify({ orderId: order.id, ...order }) }); return order; }); // Transaction commits here // Both order AND event exist, or neither exists} // ===================================================// FAILURE MODE 1: "Second system fails"// ===================================================// // Q: What if the outbox insert fails?// A: The entire transaction rolls back. No order, no event.// Consistent state. User gets error, can retry.//// The "second system" (outbox) is the SAME system (database).// Same transaction = atomic. Both succeed or both fail. // ===================================================// FAILURE MODE 2: "Process crash"// ===================================================//// SCENARIO A: Crash BEFORE transaction commit// - Transaction never commits// - Database rollback (automatic on connection loss)// - No order, no event// - Consistent state ✓//// SCENARIO B: Crash AFTER transaction commit// - Order and outbox event both committed// - On recovery, outbox event is still there// - Publisher will find it and publish// - Consistent state ✓//// There is NO crash window. Either the transaction commits// (both exist) or it doesn't (neither exists). // ===================================================// FAILURE MODE 3: "Ghost events (rollback after publish)"// ===================================================//// Q: What if we publish before commit?// A: We don't! The Outbox Pattern NEVER publishes before commit.//// Sequence:// 1. Transaction commits (order + outbox event)// 2. Only THEN does publisher read and publish//// It's impossible to publish an event for data that// might roll back, because publishing happens AFTER commit. // ===================================================// FAILURE MODE NEW: "Publisher fails"// ===================================================//// Q: What if the publisher can't publish?// A: Event stays in outbox. Publisher will retry later.//// - Event is durably stored in database// - Publisher is stateless; can restart// - On restart, queries for unpublished events// - Retries until successful//// Worst case: events delayed. Never lost. // ===================================================// FAILURE MODE NEW: "Publisher publishes but crashes // before marking published"// ===================================================//// Q: Event gets published, then publisher crashes before // UPDATE outbox SET published_at = NOW()// A: Event will be published again (duplicate).//// This is expected! Outbox provides AT-LEAST-ONCE delivery.// Consumers MUST be idempotent.//// But: data is never lost, never orphaned.// Duplicates are handleable. Missing events are not.The Outbox Pattern provides at-least-once delivery: events will definitely be delivered, but possibly more than once. This shifts responsibility for handling duplicates to the consumer.
This isn't a bug—it's a feature. At-least-once is the strongest delivery guarantee practical for most systems. Exactly-once requires distributed transactions, which (as we discussed) have prohibitive costs.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
// IDEMPOTENT CONSUMER PATTERNS // PATTERN 1: Event ID Tracking// Store processed event IDs; skip if already seen class OrderNotificationConsumer { async handleOrderCreated(event: OrderCreatedEvent): Promise<void> { // Check if we've processed this event const existing = await this.db.processedEvents.findFirst({ where: { eventId: event.eventId } }); if (existing) { console.log(`Event ${event.eventId} already processed, skipping`); return; // Idempotent: no-op on duplicate } // Process in transaction with event recording await this.db.transaction(async (tx) => { // Do the work await tx.notifications.create({ type: 'ORDER_CONFIRMATION', orderId: event.payload.orderId, customerId: event.payload.customerId, sentAt: new Date() }); // Record that we processed this event await tx.processedEvents.create({ eventId: event.eventId, processedAt: new Date() }); }); }} // PATTERN 2: Natural Idempotency Keys// Use business identifiers instead of event IDs class InventoryConsumer { async handleOrderCreated(event: OrderCreatedEvent): Promise<void> { for (const item of event.payload.items) { // UPSERT instead of INSERT // If reservation for this order+product exists, update it await this.db.inventoryReservations.upsert({ where: { orderId_productId: { orderId: event.payload.orderId, productId: item.productId } }, create: { orderId: event.payload.orderId, productId: item.productId, quantity: item.quantity, reservedAt: new Date() }, update: { quantity: item.quantity, // Update to same value = no-op reservedAt: new Date() } }); } // Even if event processed twice, state converges to same result }} // PATTERN 3: Conditional Updates// Only apply change if preconditions match class OrderStatusConsumer { async handleOrderShipped(event: OrderShippedEvent): Promise<void> { // Only update if order is in expected state const result = await this.db.orders.updateMany({ where: { id: event.payload.orderId, status: 'CONFIRMED' // Only if still confirmed }, data: { status: 'SHIPPED', shippedAt: new Date(), trackingNumber: event.payload.trackingNumber } }); if (result.count === 0) { // Either order doesn't exist, or already shipped console.log(`Order ${event.payload.orderId} not in CONFIRMED state`); // Still idempotent - we don't error on duplicate } }} // PATTERN 4: Version-Based Optimistic Locking// Reject updates that don't increment version correctly class AccountBalanceConsumer { async handlePaymentReceived(event: PaymentReceivedEvent): Promise<void> { const result = await this.db.accounts.updateMany({ where: { id: event.payload.accountId, version: event.payload.expectedVersion // Must match exactly }, data: { balance: { increment: event.payload.amount }, version: { increment: 1 }, lastUpdated: new Date() } }); if (result.count === 0) { // Version mismatch - either: // 1. Duplicate event (version already incremented) // 2. Concurrent modification (need to re-read and retry) const current = await this.db.accounts.findUnique({ where: { id: event.payload.accountId } }); if (current.version > event.payload.expectedVersion) { // Already processed - idempotent no-op return; } else { // Concurrent modification - throw to trigger retry throw new ConcurrentModificationError(); } } }}| Pattern | Trade-offs | Best For |
|---|---|---|
| Event ID Tracking | Extra storage; cleanup needed | General purpose; works for any event |
| Natural Keys | Requires suitable business keys | CRUD operations with natural identifiers |
| Conditional Updates | Extra query; potential races | State machine transitions |
| Version Locking | Requires version tracking | Concurrent updates; optimistic control |
With the Outbox Pattern, duplicate events WILL occur—during crashes, retries, and reprocessing. Every consumer MUST be idempotent. Design for duplicates from day one; retrofitting idempotency is painful.
The dual-write problem is not a bug to fix—it's a fundamental constraint of distributed systems. Attempting to write to two independent systems atomically without coordination protocols is theoretically impossible.
You now have a deep understanding of why dual writes fail and why the Outbox Pattern is the correct architectural solution. In the final page, we'll explore practical implementation patterns for different technologies and frameworks, bringing everything together into production-ready code.