System Design (HLD)Distributed Data Patterns

Outbox Pattern

LevelAdvanced

Duration90 mins

TopicDistributed Data Patterns

4 / 5

Avoiding Dual Writes

The Deceptively Simple Operation

Consider the most common operation in distributed systems: save something to a database, then notify other systems about it. It seems trivially simple. Write to database. Send message. Done.

const order = await database.orders.create(orderData);
await messageQueue.publish('order.created', order);

Two lines of code. What could go wrong?

Everything.

This innocuous pattern is the source of more data corruption, lost transactions, and debugging nightmares than almost any other in distributed systems. It's called a dual write, and understanding exactly why it fails—and why clever workarounds don't work—is essential for any engineer building distributed systems.

This page dissects the dual-write problem with surgical precision, examining every failure mode and demonstrating why the Outbox Pattern is the principled solution.

What You Will Learn

By the end of this page, you will understand every way dual writes can fail, why common fixes don't work, the theoretical foundations explaining why this problem is hard, and how the Outbox Pattern's 'write locally, publish later' approach provides the correct abstraction.

Taxonomy of Dual-Write Failures

A dual write occurs whenever a single logical operation must update two independent systems. Let's systematically catalog every way this can fail.

dual-write-pattern.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// THE DUAL-WRITE ANTI-PATTERN
// 
// Any variation of: "update system A, then update system B"
// without a coordinating transaction is a dual write.
 
async function createOrder(request: OrderRequest): Promise<Order> {
    // SYSTEM A: Database
    const order = await database.orders.create({
        customerId: request.customerId,
        items: request.items,
        total: calculateTotal(request.items),
        status: 'CREATED'
    });
 
    // SYSTEM B: Message Broker
    await messageBroker.publish('OrderCreated', {
        orderId: order.id,
        customerId: order.customerId,
        total: order.total
    });
 
    return order;
}
 
// Common variations of dual writes:
//
// 1. Database + Message Broker (most common)
// 2. Database + Cache (Redis, Memcached)
// 3. Database + Search Index (Elasticsearch)
// 4. Database + External API (payment processor, etc.)
// 5. Two Databases (cross-service writes)
// 6. Database + File System
//
// ALL of these are dual writes and suffer the same problems.

Failure Mode 1: Second System Failure

The most obvious failure: the second operation fails after the first succeeds.

failure-mode-1.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// FAILURE MODE 1: SECOND SYSTEM FAILS
 
async function createOrder(request: OrderRequest): Promise<Order> {
    // T1: Database write SUCCEEDS
    const order = await database.orders.create(request);
    // Order now exists in database ✓
    
    // T2: Message broker FAILS
    // - Network timeout
    // - Broker unavailable
    // - Quota exceeded
    // - Any of 100 possible failures
    await messageBroker.publish('OrderCreated', order);
    // ❌ Throws exception
    
    // Exception propagates up
    // What happens to the order?
    // IT'S STILL IN THE DATABASE.
}
 
// CONSEQUENCES:
//
// 1. Order exists without corresponding event
// 2. Downstream systems never learn about order
// 3. Inventory not reserved (if inventory service consumes events)
// 4. Payment not processed (if payment service consumes events)
// 5. Customer confirmation email not sent
// 6. Analytics data missing this order
//
// The order is an "orphan" - visible in UI but not processed
 
// "FIX" ATTEMPT: Retry the publish
async function createOrderWithRetry(request: OrderRequest): Promise<Order> {
    const order = await database.orders.create(request);
    
    let published = false;
    for (let i = 0; i < 3; i++) {
        try {
            await messageBroker.publish('OrderCreated', order);
            published = true;
            break;
        } catch (e) {
            await sleep(exponentialBackoff(i));
        }
    }
    
    if (!published) {
        // NOW WHAT?
        // Option A: Throw error
        //   → Order still exists! User sees "error" but order is placed.
        //   → If they retry, they create a DUPLICATE order.
        //
        // Option B: Delete the order
        //   → What if delete fails too?
        //   → What if another process is already using the order?
        //   → What about the items we already created?
        //
        // Option C: Mark order as "pending_publish"
        //   → Who retries it? When? How?
        //   → You've just reinvented the outbox pattern, poorly.
        throw new Error('Failed to publish event');
    }
    
    return order;
}

Failure Mode 2: Process Crash

The operation is interrupted by a crash between the two operations.

failure-mode-2.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// FAILURE MODE 2: PROCESS CRASH
 
async function createOrder(request: OrderRequest): Promise<Order> {
    // T1: Database write SUCCEEDS
    const order = await database.orders.create(request);
    // Order now committed to database ✓
    
    // T2: PROCESS CRASH
    // - Out of memory (OOM kill)
    // - Container killed (Kubernetes pod termination)
    // - Hardware failure
    // - Unhandled exception in unrelated code
    // - Deployment/restart during this window
    
    // This line is NEVER REACHED:
    await messageBroker.publish('OrderCreated', order);
}
 
// The window between database commit and publish
// is called the "danger zone" or "crash window"
// 
// Even if this window is only milliseconds,
// multiplied by millions of operations,
// you WILL hit this case.
//
// Rule of thumb:
// - 1ms window × 1000 ops/sec = 1 missed event per 1000 seconds
// - 1ms window × 10000 ops/sec = 1 missed event per 100 seconds
// - At scale, "rare" becomes "constant"
 
// THERE IS NO CODE-LEVEL FIX
// 
// You cannot catch a crash.
// You cannot wrap process death in try/catch.
// You cannot "finally" after the kernel kills you.
//
// The only solution is to not have a crash window.
// The Outbox Pattern eliminates the crash window entirely.

Failure Mode 3: Ghost Events (Rollback After Publish)

The message is published, but the database transaction later fails.

failure-mode-3.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// FAILURE MODE 3: GHOST EVENTS
 
async function createOrder(request: OrderRequest): Promise<Order> {
    // Using database transaction
    return await database.transaction(async (tx) => {
        // T1: Insert into database (NOT YET COMMITTED)
        const order = await tx.orders.create(request);
        
        // T2: Publish event (while transaction still open)
        // DANGER: This event references a row that might not exist!
        await messageBroker.publish('OrderCreated', {
            orderId: order.id,
            customerId: order.customerId
        });
        // ✓ Event now in message broker
        
        // T3: Insert order items
        await tx.orderItems.createMany(request.items);
        
        // T4: Trigger FAILURE
        // - Unique constraint violation
        // - Foreign key check fails
        // - Serialization conflict (SSI)
        // - Explicit rollback from business logic
        
        // TRANSACTION ROLLS BACK
        // - Order row is GONE
        // - Order items are GONE
        // 
        // But the event is ALREADY IN KAFKA
        // Consumers will try to process an order that DOESN'T EXIST
    });
}
 
// CONSEQUENCES OF GHOST EVENTS:
//
// 1. Consumer queries order by ID → 404 Not Found
//    - Is this a timing issue? Should I retry?
//    - Is the order actually missing?
//    - No way to distinguish from race condition
//
// 2. Consumer processes ghost order → Corrupted state
//    - Inventory decremented for non-existent order
//    - Payment captured for phantom order
//    - Shipping label created for nothing
//
// 3. Debugging nightmare
//    - "Why is there a payment for order X?"
//    - "Order X doesn't exist in the database"
//    - "But here's the event showing it was created!"
//
// Ghost events are WORSE than missing events
// because they actively corrupt downstream systems.
 
// "FIX" ATTEMPT: Publish after commit
async function createOrderPublishAfterCommit(request: OrderRequest): Promise<Order> {
    let order: Order;
    
    // Transaction for database only
    order = await database.transaction(async (tx) => {
        const createdOrder = await tx.orders.create(request);
        await tx.orderItems.createMany(request.items);
        return createdOrder;
    });
    // Transaction committed ✓
    // Order definitely exists ✓
    
    // Now publish (outside transaction)
    await messageBroker.publish('OrderCreated', order);
    // But now we're back to Failure Mode 1 and 2!
    // If this fails or crashes, order exists without event.
    
    return order;
}
// There is no sequence that solves this with dual writes.

Summary of Dual-Write Failure Modes
Failure Mode	Trigger	Result	Probability
Second Write Fails	Network error, broker down	Data without event	Common (1-5% of operations during incidents)
Process Crash	OOM, kill signal, hardware	Data without event	Rare individually, guaranteed at scale
Transaction Rollback	Constraint violation, conflict	Event without data (ghost)	Depends on schema/concurrency
Network Partition	Split-brain scenario	Inconsistent state	Rare but catastrophic
Out-of-Order	Retry publishes wrong sequence	Corrupted event stream	Common with naive retries

The Fundamental Truth

There is no sequence of operations across two independent systems that guarantees atomicity. This is not a implementation problem—it's a theoretical impossibility without distributed transactions, which message brokers don't support.

Why Common Fixes Don't Work

Engineers encountering the dual-write problem often propose various 'fixes.' Let's examine why each fails to solve the fundamental issue.

fix-attempt-1.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// ATTEMPTED FIX 1: RETRY LOGIC
// "Just retry the publish until it succeeds"
 
async function createOrderWithRetry(request: OrderRequest): Promise<Order> {
    const order = await database.orders.create(request);
    // Order committed ✓
    
    // Retry publish with exponential backoff
    await retryWithBackoff(async () => {
        await messageBroker.publish('OrderCreated', order);
    }, {
        maxRetries: 10,
        initialDelayMs: 100,
        maxDelayMs: 30000
    });
    
    return order;
}
 
// WHY THIS DOESN'T WORK:
//
// 1. CRASH DURING RETRY
//    What if the process crashes during the retry loop?
//    The order exists, but no one is retrying the publish.
//    It's lost forever.
//
// 2. PROLONGED OUTAGE
//    What if the broker is down for 2 hours?
//    You can't hold the HTTP response open that long.
//    User gets timeout, retries, creates duplicate order.
//
// 3. RETRY STORM
//    During broker outage, every operation is retrying.
//    Thousands of threads waiting, consuming resources.
//    When broker comes back: thundering herd.
//
// 4. IDEMPOTENCY NIGHTMARE
//    If broker connection drops AFTER it received message
//    but BEFORE it sends ACK, you don't know if it worked.
//    You retry → duplicate event.
//
// 5. BLOCKING RESPONSE
//    User waits 30+ seconds for retries.
//    Terrible user experience.
//
// Retries don't fix the crash window problem.
// They just make it less frequent.

fix-attempt-2.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// ATTEMPTED FIX 2: IN-MEMORY RETRY QUEUE
// "Store failed publishes in memory, retry in background"
 
class FailedEventQueue {
    private queue: FailedEvent[] = [];
    
    add(event: Event): void {
        this.queue.push({
            event,
            attempts: 0,
            firstFailedAt: new Date()
        });
    }
    
    async startBackgroundRetry(): Promise<void> {
        setInterval(async () => {
            const batch = this.queue.splice(0, 100);
            for (const item of batch) {
                try {
                    await messageBroker.publish(item.event);
                } catch (e) {
                    this.queue.push({
                        ...item,
                        attempts: item.attempts + 1
                    });
                }
            }
        }, 1000);
    }
}
 
// Usage
async function createOrder(request: OrderRequest): Promise<Order> {
    const order = await database.orders.create(request);
    
    try {
        await messageBroker.publish('OrderCreated', order);
    } catch (e) {
        // Queue for background retry
        failedEventQueue.add({ type: 'OrderCreated', data: order });
    }
    
    return order;  // Return immediately
}
 
// WHY THIS DOESN'T WORK:
//
// 1. IN-MEMORY = VOLATILE
//    Process restarts? Queue is gone.
//    All those pending events? Lost.
//    You've traded one problem for a worse one.
//
// 2. PROCESS CRASH
//    Still doesn't help. If crash happens after
//    database.create() but before queue.add(),
//    the event is lost.
//
// 3. MULTI-INSTANCE PROBLEM
//    With multiple service instances, which one
//    owns the retry queue? How do they coordinate?
//    Race conditions everywhere.
//
// 4. NO DURABILITY
//    "But I'll persist the queue to disk!"
//    Congratulations, you've invented a worse database.
//    Now you have THREE systems to coordinate.
//
// 5. ORDERING LOST
//    Events retry in arbitrary order.
//    OrderCreated might arrive after OrderShipped.
//
// This is the Outbox Pattern with worse durability.

fix-attempt-3.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// ATTEMPTED FIX 3: COMPENSATING TRANSACTION
// "If publish fails, delete the order"
 
async function createOrder(request: OrderRequest): Promise<Order> {
    const order = await database.orders.create(request);
    
    try {
        await messageBroker.publish('OrderCreated', order);
    } catch (e) {
        // Compensate: undo the database write
        await database.orders.delete({ where: { id: order.id } });
        throw new Error('Failed to create order: could not publish event');
    }
    
    return order;
}
 
// WHY THIS DOESN'T WORK:
//
// 1. DELETE CAN FAIL TOO
//    Network error on publish → network error on delete.
//    Now order exists without event AND delete didn't work.
//    You're in a worse state.
//
// 2. CRASH BEFORE DELETE
//    Publish fails, delete starts, process crashes.
//    Order still exists, no event, no one knows.
//
// 3. ANOTHER PROCESS USED THE ORDER
//    Between create and compensating delete:
//    - Another request read the order
//    - A scheduled job processed it
//    - A foreign key reference was created
//    Now delete fails due to FK violation.
//
// 4. USER SAW THE ORDER
//    Between create and delete (milliseconds!):
//    - User's dashboard refreshed
//    - They saw order #12345
//    - You deleted it
//    "Where did my order go?!"
//
// 5. COMPENSATION IS HARD
//    What if creating the order triggered:
//    - Email confirmation sent
//    - Inventory reserved
//    - Credit card pre-authorized
//    How do you undo all of that?
//
// Compensating transactions create more problems than they solve.
// And they STILL don't fix the crash window.

fix-attempt-4.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// ATTEMPTED FIX 4: PUBLISH FIRST
// "Publish the event first, then write to database"
 
async function createOrder(request: OrderRequest): Promise<Order> {
    const orderId = uuid();  // Pre-generate ID
    
    // Step 1: Publish event first
    await messageBroker.publish('OrderCreated', {
        orderId: orderId,
        customerId: request.customerId,
        items: request.items
    });
    // Event in broker ✓
    
    // Step 2: Now write to database
    const order = await database.orders.create({
        id: orderId,
        ...request
    });
    
    return order;
}
 
// WHY THIS DOESN'T WORK:
//
// 1. SAME PROBLEM, REVERSED
//    If database write fails after publish:
//    - Event in broker references non-existent order
//    - Ghost event (Failure Mode 3)
//
// 2. CONSUMERS RACE AHEAD
//    Event published → consumers start processing
//    Consumer queries order → 404, order doesn't exist yet
//    Is this a timing issue or a real error?
//
// 3. VALIDATION REVERSED
//    Database constraints caught on write:
//    - Unique email already exists
//    - Customer doesn't exist
//    - Credit limit exceeded
//    Event already published for invalid order.
//
// 4. NO ATOMICITY EITHER DIRECTION
//    "Publish first" doesn't create atomicity.
//    It just changes which failure mode you get.
//
// The order of operations doesn't matter.
// The problem is having TWO systems to update.

The Pattern of Failure

Notice that every attempted fix either: (1) doesn't address the crash window, (2) creates new failure modes, or (3) partially reinvents the Outbox Pattern. There is no code-level fix for the dual-write problem. You need an architectural change.

Theoretical Foundations

The dual-write problem isn't just practically hard—it's theoretically impossible to solve without specific mechanisms. Understanding why helps clarify the solution.

The Consensus Problem

At its core, a dual write is asking: "Can two independent systems agree that an operation happened?" This is a variant of the distributed consensus problem.

The FLP Impossibility Result (Fischer, Lynch, Paterson, 1985) proves that in an asynchronous distributed system where even one process can fail, it is impossible to guarantee consensus. No algorithm can ensure all correct processes agree on a value if any process might crash.

Translated to our problem: There is no algorithm that guarantees both the database and message broker agree an operation occurred, if either system or the network between them might fail.

Why Databases 'Solve' This

Databases achieve atomicity through specific mechanisms:

Write-Ahead Log (WAL): All changes first written to a log before applying to tables
Commit Protocols: Explicit commit/rollback semantics
Crash Recovery: On restart, WAL is replayed to restore consistent state

These mechanisms are internal to the database. The database controls every component.

Why Dual Writes Can't Have This

With dual writes:

The database doesn't know about the message broker
The message broker doesn't know about the database
Neither can control the other's commit
Network between them is unreliable
Either can crash independently

There's no shared commit log, no coordinated commit protocol, no unified crash recovery.

two-phase-commit.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// TWO-PHASE COMMIT: THE "TEXTBOOK" SOLUTION
// 
// Distributed transactions using 2PC coordinate commits across systems.
// In theory, this solves dual writes. In practice, it's rarely viable.
 
// How 2PC works (conceptually):
//
// PHASE 1: PREPARE
// Coordinator → Database:  "Can you commit transaction X?"
// Coordinator → Broker:    "Can you commit message Y?"
// Database → Coordinator:  "Yes, I'm prepared"
// Broker → Coordinator:    "Yes, I'm prepared"
//
// PHASE 2: COMMIT
// Coordinator → Database:  "Commit transaction X"
// Coordinator → Broker:    "Commit message Y"
// Database → Coordinator:  "Done"
// Broker → Coordinator:    "Done"
//
// Both systems commit only when both are prepared.
// If either fails, both rollback.
 
// WHY 2PC DOESN'T WORK FOR OUTBOX USE CASE:
//
// 1. MESSAGE BROKERS DON'T SUPPORT IT
//    - Kafka: No XA/2PC support
//    - RabbitMQ: No XA/2PC support
//    - AWS SQS: No XA/2PC support
//    - Google Pub/Sub: No XA/2PC support
//    
//    The systems we want to coordinate don't implement the protocol.
//
// 2. BLOCKING PROTOCOL
//    During prepare phase, all participants hold locks.
//    If coordinator crashes, locks held indefinitely.
//    System becomes unavailable.
//
// 3. PERFORMANCE
//    2PC requires multiple network round-trips:
//    - Client → Coordinator
//    - Coordinator → All participants (prepare)
//    - All participants → Coordinator (prepared)
//    - Coordinator → All participants (commit)
//    - All participants → Coordinator (committed)
//    
//    Minimum 4 round-trips for every operation.
//    At 1ms per hop, that's 4ms minimum latency added.
//
// 4. AVAILABILITY
//    If ANY participant is unreachable, entire operation fails.
//    Single point of failure multiplied by participant count.
//
// 5. HETEROGENEOUS SYSTEMS
//    Even if broker supported 2PC, it would need to be the
//    same IMPLEMENTATION (XA standard has variations).
//    Different vendors, different interpretations.
//
// 2PC is used for homogeneous database-to-database transactions.
// It's not practical for database-to-message-broker coordination.

The Outbox Solution: Local Commit, Eventual Delivery

The Outbox Pattern sidesteps the distributed consensus problem by reframing it:

INSTEAD OF:
  "Atomically update two systems" (impossible)

WE DO:
  "Atomically update one system" (trivial)
  "Eventually propagate to second system" (reliable with at-least-once)

By writing both the business data and the event data to the same database in a single transaction, we reduce the problem to a LOCAL commit—something databases handle perfectly.

The event's presence in the outbox table IS the record of truth. It will eventually be published. The only question is when, not if.

Converting Mermaid diagram...

How Outbox Eliminates the Crash Window

Let's trace through each failure mode and see how the Outbox Pattern handles it.

outbox-failure-handling.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
// OUTBOX PATTERN: FAILURE MODE ANALYSIS
 
async function createOrder(request: OrderRequest): Promise<Order> {
    return await database.transaction(async (tx) => {
        // Step 1: Create order
        const order = await tx.orders.create(request);
        
        // Step 2: Create outbox event (SAME TRANSACTION)
        await tx.outbox.create({
            aggregateType: 'Order',
            aggregateId: order.id,
            eventType: 'OrderCreated',
            payload: JSON.stringify({ orderId: order.id, ...order })
        });
        
        return order;
    });
    // Transaction commits here
    // Both order AND event exist, or neither exists
}
 
// ===================================================
// FAILURE MODE 1: "Second system fails"
// ===================================================
// 
// Q: What if the outbox insert fails?
// A: The entire transaction rolls back. No order, no event.
//    Consistent state. User gets error, can retry.
//
// The "second system" (outbox) is the SAME system (database).
// Same transaction = atomic. Both succeed or both fail.
 
// ===================================================
// FAILURE MODE 2: "Process crash"
// ===================================================
//
// SCENARIO A: Crash BEFORE transaction commit
// - Transaction never commits
// - Database rollback (automatic on connection loss)
// - No order, no event
// - Consistent state ✓
//
// SCENARIO B: Crash AFTER transaction commit
// - Order and outbox event both committed
// - On recovery, outbox event is still there
// - Publisher will find it and publish
// - Consistent state ✓
//
// There is NO crash window. Either the transaction commits
// (both exist) or it doesn't (neither exists).
 
// ===================================================
// FAILURE MODE 3: "Ghost events (rollback after publish)"
// ===================================================
//
// Q: What if we publish before commit?
// A: We don't! The Outbox Pattern NEVER publishes before commit.
//
// Sequence:
// 1. Transaction commits (order + outbox event)
// 2. Only THEN does publisher read and publish
//
// It's impossible to publish an event for data that
// might roll back, because publishing happens AFTER commit.
 
// ===================================================
// FAILURE MODE NEW: "Publisher fails"
// ===================================================
//
// Q: What if the publisher can't publish?
// A: Event stays in outbox. Publisher will retry later.
//
// - Event is durably stored in database
// - Publisher is stateless; can restart
// - On restart, queries for unpublished events
// - Retries until successful
//
// Worst case: events delayed. Never lost.
 
// ===================================================
// FAILURE MODE NEW: "Publisher publishes but crashes 
// before marking published"
// ===================================================
//
// Q: Event gets published, then publisher crashes before 
//    UPDATE outbox SET published_at = NOW()
// A: Event will be published again (duplicate).
//
// This is expected! Outbox provides AT-LEAST-ONCE delivery.
// Consumers MUST be idempotent.
//
// But: data is never lost, never orphaned.
// Duplicates are handleable. Missing events are not.

Dual Write Failure Modes

•Second write fails → Data without event
•Process crash → Data without event
•Transaction rollback → Event without data
•Retry ambiguity → Possible duplicates
•Ordering violations → Corrupted stream

Outbox Solution

•Same transaction → Everything atomic
•Durable storage → Survives crashes
•Post-commit publish → No ghost events
•Known duplicates → Handle with idempotency
•Sequence numbers → Ordered delivery

Idempotency: The Consumer's Responsibility

The Outbox Pattern provides at-least-once delivery: events will definitely be delivered, but possibly more than once. This shifts responsibility for handling duplicates to the consumer.

This isn't a bug—it's a feature. At-least-once is the strongest delivery guarantee practical for most systems. Exactly-once requires distributed transactions, which (as we discussed) have prohibitive costs.

idempotent-consumers.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
// IDEMPOTENT CONSUMER PATTERNS
 
// PATTERN 1: Event ID Tracking
// Store processed event IDs; skip if already seen
 
class OrderNotificationConsumer {
    async handleOrderCreated(event: OrderCreatedEvent): Promise<void> {
        // Check if we've processed this event
        const existing = await this.db.processedEvents.findFirst({
            where: { eventId: event.eventId }
        });
        
        if (existing) {
            console.log(`Event ${event.eventId} already processed, skipping`);
            return;  // Idempotent: no-op on duplicate
        }
        
        // Process in transaction with event recording
        await this.db.transaction(async (tx) => {
            // Do the work
            await tx.notifications.create({
                type: 'ORDER_CONFIRMATION',
                orderId: event.payload.orderId,
                customerId: event.payload.customerId,
                sentAt: new Date()
            });
            
            // Record that we processed this event
            await tx.processedEvents.create({
                eventId: event.eventId,
                processedAt: new Date()
            });
        });
    }
}
 
 
// PATTERN 2: Natural Idempotency Keys
// Use business identifiers instead of event IDs
 
class InventoryConsumer {
    async handleOrderCreated(event: OrderCreatedEvent): Promise<void> {
        for (const item of event.payload.items) {
            // UPSERT instead of INSERT
            // If reservation for this order+product exists, update it
            await this.db.inventoryReservations.upsert({
                where: {
                    orderId_productId: {
                        orderId: event.payload.orderId,
                        productId: item.productId
                    }
                },
                create: {
                    orderId: event.payload.orderId,
                    productId: item.productId,
                    quantity: item.quantity,
                    reservedAt: new Date()
                },
                update: {
                    quantity: item.quantity,  // Update to same value = no-op
                    reservedAt: new Date()
                }
            });
        }
        
        // Even if event processed twice, state converges to same result
    }
}
 
 
// PATTERN 3: Conditional Updates
// Only apply change if preconditions match
 
class OrderStatusConsumer {
    async handleOrderShipped(event: OrderShippedEvent): Promise<void> {
        // Only update if order is in expected state
        const result = await this.db.orders.updateMany({
            where: {
                id: event.payload.orderId,
                status: 'CONFIRMED'  // Only if still confirmed
            },
            data: {
                status: 'SHIPPED',
                shippedAt: new Date(),
                trackingNumber: event.payload.trackingNumber
            }
        });
        
        if (result.count === 0) {
            // Either order doesn't exist, or already shipped
            console.log(`Order ${event.payload.orderId} not in CONFIRMED state`);
            // Still idempotent - we don't error on duplicate
        }
    }
}
 
 
// PATTERN 4: Version-Based Optimistic Locking
// Reject updates that don't increment version correctly
 
class AccountBalanceConsumer {
    async handlePaymentReceived(event: PaymentReceivedEvent): Promise<void> {
        const result = await this.db.accounts.updateMany({
            where: {
                id: event.payload.accountId,
                version: event.payload.expectedVersion  // Must match exactly
            },
            data: {
                balance: { increment: event.payload.amount },
                version: { increment: 1 },
                lastUpdated: new Date()
            }
        });
        
        if (result.count === 0) {
            // Version mismatch - either:
            // 1. Duplicate event (version already incremented)
            // 2. Concurrent modification (need to re-read and retry)
            const current = await this.db.accounts.findUnique({
                where: { id: event.payload.accountId }
            });
            
            if (current.version > event.payload.expectedVersion) {
                // Already processed - idempotent no-op
                return;
            } else {
                // Concurrent modification - throw to trigger retry
                throw new ConcurrentModificationError();
            }
        }
    }
}

Idempotency Patterns Comparison
Pattern	Trade-offs	Best For
Event ID Tracking	Extra storage; cleanup needed	General purpose; works for any event
Natural Keys	Requires suitable business keys	CRUD operations with natural identifiers
Conditional Updates	Extra query; potential races	State machine transitions
Version Locking	Requires version tracking	Concurrent updates; optimistic control

Idempotency Is Not Optional

With the Outbox Pattern, duplicate events WILL occur—during crashes, retries, and reprocessing. Every consumer MUST be idempotent. Design for duplicates from day one; retrofitting idempotency is painful.

Summary: From Dual Writes to Reliable Events

The dual-write problem is not a bug to fix—it's a fundamental constraint of distributed systems. Attempting to write to two independent systems atomically without coordination protocols is theoretically impossible.

Key Takeaways

•Dual writes cannot be fixed — No amount of retries, queues, or ordering changes eliminates the fundamental race between two systems
•The crash window is the killer — Even millisecond windows cause data loss at scale
•Ghost events are worse than missing events — They actively corrupt downstream systems
•2PC isn't viable — Message brokers don't support it; performance is prohibitive
•Outbox transforms the problem — Local atomic write + eventual delivery sidesteps distributed consensus
•At-least-once is the right trade-off — Duplicates are manageable; missing data is not
•Idempotency is mandatory — Design every consumer to handle duplicates correctly

Page Complete

You now have a deep understanding of why dual writes fail and why the Outbox Pattern is the correct architectural solution. In the final page, we'll explore practical implementation patterns for different technologies and frameworks, bringing everything together into production-ready code.

4 / 5

Loading learning content...

System Design (HLD)Distributed Data Patterns

Outbox Pattern

LevelAdvanced

Duration90 mins

TopicDistributed Data Patterns

4 / 5

Avoiding Dual Writes

The Deceptively Simple Operation

Consider the most common operation in distributed systems: save something to a database, then notify other systems about it. It seems trivially simple. Write to database. Send message. Done.

const order = await database.orders.create(orderData);
await messageQueue.publish('order.created', order);

Two lines of code. What could go wrong?

Everything.

This page dissects the dual-write problem with surgical precision, examining every failure mode and demonstrating why the Outbox Pattern is the principled solution.

What You Will Learn

Taxonomy of Dual-Write Failures

A dual write occurs whenever a single logical operation must update two independent systems. Let's systematically catalog every way this can fail.

dual-write-pattern.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// THE DUAL-WRITE ANTI-PATTERN
// 
// Any variation of: "update system A, then update system B"
// without a coordinating transaction is a dual write.
 
async function createOrder(request: OrderRequest): Promise<Order> {
    // SYSTEM A: Database
    const order = await database.orders.create({
        customerId: request.customerId,
        items: request.items,
        total: calculateTotal(request.items),
        status: 'CREATED'
    });
 
    // SYSTEM B: Message Broker
    await messageBroker.publish('OrderCreated', {
        orderId: order.id,
        customerId: order.customerId,
        total: order.total
    });
 
    return order;
}
 
// Common variations of dual writes:
//
// 1. Database + Message Broker (most common)
// 2. Database + Cache (Redis, Memcached)
// 3. Database + Search Index (Elasticsearch)
// 4. Database + External API (payment processor, etc.)
// 5. Two Databases (cross-service writes)
// 6. Database + File System
//
// ALL of these are dual writes and suffer the same problems.

Failure Mode 1: Second System Failure

The most obvious failure: the second operation fails after the first succeeds.

failure-mode-1.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// FAILURE MODE 1: SECOND SYSTEM FAILS
 
async function createOrder(request: OrderRequest): Promise<Order> {
    // T1: Database write SUCCEEDS
    const order = await database.orders.create(request);
    // Order now exists in database ✓
    
    // T2: Message broker FAILS
    // - Network timeout
    // - Broker unavailable
    // - Quota exceeded
    // - Any of 100 possible failures
    await messageBroker.publish('OrderCreated', order);
    // ❌ Throws exception
    
    // Exception propagates up
    // What happens to the order?
    // IT'S STILL IN THE DATABASE.
}
 
// CONSEQUENCES:
//
// 1. Order exists without corresponding event
// 2. Downstream systems never learn about order
// 3. Inventory not reserved (if inventory service consumes events)
// 4. Payment not processed (if payment service consumes events)
// 5. Customer confirmation email not sent
// 6. Analytics data missing this order
//
// The order is an "orphan" - visible in UI but not processed
 
// "FIX" ATTEMPT: Retry the publish
async function createOrderWithRetry(request: OrderRequest): Promise<Order> {
    const order = await database.orders.create(request);
    
    let published = false;
    for (let i = 0; i < 3; i++) {
        try {
            await messageBroker.publish('OrderCreated', order);
            published = true;
            break;
        } catch (e) {
            await sleep(exponentialBackoff(i));
        }
    }
    
    if (!published) {
        // NOW WHAT?
        // Option A: Throw error
        //   → Order still exists! User sees "error" but order is placed.
        //   → If they retry, they create a DUPLICATE order.
        //
        // Option B: Delete the order
        //   → What if delete fails too?
        //   → What if another process is already using the order?
        //   → What about the items we already created?
        //
        // Option C: Mark order as "pending_publish"
        //   → Who retries it? When? How?
        //   → You've just reinvented the outbox pattern, poorly.
        throw new Error('Failed to publish event');
    }
    
    return order;
}

Failure Mode 2: Process Crash

The operation is interrupted by a crash between the two operations.

failure-mode-2.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// FAILURE MODE 2: PROCESS CRASH
 
async function createOrder(request: OrderRequest): Promise<Order> {
    // T1: Database write SUCCEEDS
    const order = await database.orders.create(request);
    // Order now committed to database ✓
    
    // T2: PROCESS CRASH
    // - Out of memory (OOM kill)
    // - Container killed (Kubernetes pod termination)
    // - Hardware failure
    // - Unhandled exception in unrelated code
    // - Deployment/restart during this window
    
    // This line is NEVER REACHED:
    await messageBroker.publish('OrderCreated', order);
}
 
// The window between database commit and publish
// is called the "danger zone" or "crash window"
// 
// Even if this window is only milliseconds,
// multiplied by millions of operations,
// you WILL hit this case.
//
// Rule of thumb:
// - 1ms window × 1000 ops/sec = 1 missed event per 1000 seconds
// - 1ms window × 10000 ops/sec = 1 missed event per 100 seconds
// - At scale, "rare" becomes "constant"
 
// THERE IS NO CODE-LEVEL FIX
// 
// You cannot catch a crash.
// You cannot wrap process death in try/catch.
// You cannot "finally" after the kernel kills you.
//
// The only solution is to not have a crash window.
// The Outbox Pattern eliminates the crash window entirely.

Failure Mode 3: Ghost Events (Rollback After Publish)

The message is published, but the database transaction later fails.

failure-mode-3.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// FAILURE MODE 3: GHOST EVENTS
 
async function createOrder(request: OrderRequest): Promise<Order> {
    // Using database transaction
    return await database.transaction(async (tx) => {
        // T1: Insert into database (NOT YET COMMITTED)
        const order = await tx.orders.create(request);
        
        // T2: Publish event (while transaction still open)
        // DANGER: This event references a row that might not exist!
        await messageBroker.publish('OrderCreated', {
            orderId: order.id,
            customerId: order.customerId
        });
        // ✓ Event now in message broker
        
        // T3: Insert order items
        await tx.orderItems.createMany(request.items);
        
        // T4: Trigger FAILURE
        // - Unique constraint violation
        // - Foreign key check fails
        // - Serialization conflict (SSI)
        // - Explicit rollback from business logic
        
        // TRANSACTION ROLLS BACK
        // - Order row is GONE
        // - Order items are GONE
        // 
        // But the event is ALREADY IN KAFKA
        // Consumers will try to process an order that DOESN'T EXIST
    });
}
 
// CONSEQUENCES OF GHOST EVENTS:
//
// 1. Consumer queries order by ID → 404 Not Found
//    - Is this a timing issue? Should I retry?
//    - Is the order actually missing?
//    - No way to distinguish from race condition
//
// 2. Consumer processes ghost order → Corrupted state
//    - Inventory decremented for non-existent order
//    - Payment captured for phantom order
//    - Shipping label created for nothing
//
// 3. Debugging nightmare
//    - "Why is there a payment for order X?"
//    - "Order X doesn't exist in the database"
//    - "But here's the event showing it was created!"
//
// Ghost events are WORSE than missing events
// because they actively corrupt downstream systems.
 
// "FIX" ATTEMPT: Publish after commit
async function createOrderPublishAfterCommit(request: OrderRequest): Promise<Order> {
    let order: Order;
    
    // Transaction for database only
    order = await database.transaction(async (tx) => {
        const createdOrder = await tx.orders.create(request);
        await tx.orderItems.createMany(request.items);
        return createdOrder;
    });
    // Transaction committed ✓
    // Order definitely exists ✓
    
    // Now publish (outside transaction)
    await messageBroker.publish('OrderCreated', order);
    // But now we're back to Failure Mode 1 and 2!
    // If this fails or crashes, order exists without event.
    
    return order;
}
// There is no sequence that solves this with dual writes.

Summary of Dual-Write Failure Modes
Failure Mode	Trigger	Result	Probability
Second Write Fails	Network error, broker down	Data without event	Common (1-5% of operations during incidents)
Process Crash	OOM, kill signal, hardware	Data without event	Rare individually, guaranteed at scale
Transaction Rollback	Constraint violation, conflict	Event without data (ghost)	Depends on schema/concurrency
Network Partition	Split-brain scenario	Inconsistent state	Rare but catastrophic
Out-of-Order	Retry publishes wrong sequence	Corrupted event stream	Common with naive retries

The Fundamental Truth

Why Common Fixes Don't Work

Engineers encountering the dual-write problem often propose various 'fixes.' Let's examine why each fails to solve the fundamental issue.

fix-attempt-1.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// ATTEMPTED FIX 1: RETRY LOGIC
// "Just retry the publish until it succeeds"
 
async function createOrderWithRetry(request: OrderRequest): Promise<Order> {
    const order = await database.orders.create(request);
    // Order committed ✓
    
    // Retry publish with exponential backoff
    await retryWithBackoff(async () => {
        await messageBroker.publish('OrderCreated', order);
    }, {
        maxRetries: 10,
        initialDelayMs: 100,
        maxDelayMs: 30000
    });
    
    return order;
}
 
// WHY THIS DOESN'T WORK:
//
// 1. CRASH DURING RETRY
//    What if the process crashes during the retry loop?
//    The order exists, but no one is retrying the publish.
//    It's lost forever.
//
// 2. PROLONGED OUTAGE
//    What if the broker is down for 2 hours?
//    You can't hold the HTTP response open that long.
//    User gets timeout, retries, creates duplicate order.
//
// 3. RETRY STORM
//    During broker outage, every operation is retrying.
//    Thousands of threads waiting, consuming resources.
//    When broker comes back: thundering herd.
//
// 4. IDEMPOTENCY NIGHTMARE
//    If broker connection drops AFTER it received message
//    but BEFORE it sends ACK, you don't know if it worked.
//    You retry → duplicate event.
//
// 5. BLOCKING RESPONSE
//    User waits 30+ seconds for retries.
//    Terrible user experience.
//
// Retries don't fix the crash window problem.
// They just make it less frequent.

fix-attempt-2.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// ATTEMPTED FIX 2: IN-MEMORY RETRY QUEUE
// "Store failed publishes in memory, retry in background"
 
class FailedEventQueue {
    private queue: FailedEvent[] = [];
    
    add(event: Event): void {
        this.queue.push({
            event,
            attempts: 0,
            firstFailedAt: new Date()
        });
    }
    
    async startBackgroundRetry(): Promise<void> {
        setInterval(async () => {
            const batch = this.queue.splice(0, 100);
            for (const item of batch) {
                try {
                    await messageBroker.publish(item.event);
                } catch (e) {
                    this.queue.push({
                        ...item,
                        attempts: item.attempts + 1
                    });
                }
            }
        }, 1000);
    }
}
 
// Usage
async function createOrder(request: OrderRequest): Promise<Order> {
    const order = await database.orders.create(request);
    
    try {
        await messageBroker.publish('OrderCreated', order);
    } catch (e) {
        // Queue for background retry
        failedEventQueue.add({ type: 'OrderCreated', data: order });
    }
    
    return order;  // Return immediately
}
 
// WHY THIS DOESN'T WORK:
//
// 1. IN-MEMORY = VOLATILE
//    Process restarts? Queue is gone.
//    All those pending events? Lost.
//    You've traded one problem for a worse one.
//
// 2. PROCESS CRASH
//    Still doesn't help. If crash happens after
//    database.create() but before queue.add(),
//    the event is lost.
//
// 3. MULTI-INSTANCE PROBLEM
//    With multiple service instances, which one
//    owns the retry queue? How do they coordinate?
//    Race conditions everywhere.
//
// 4. NO DURABILITY
//    "But I'll persist the queue to disk!"
//    Congratulations, you've invented a worse database.
//    Now you have THREE systems to coordinate.
//
// 5. ORDERING LOST
//    Events retry in arbitrary order.
//    OrderCreated might arrive after OrderShipped.
//
// This is the Outbox Pattern with worse durability.

fix-attempt-3.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// ATTEMPTED FIX 3: COMPENSATING TRANSACTION
// "If publish fails, delete the order"
 
async function createOrder(request: OrderRequest): Promise<Order> {
    const order = await database.orders.create(request);
    
    try {
        await messageBroker.publish('OrderCreated', order);
    } catch (e) {
        // Compensate: undo the database write
        await database.orders.delete({ where: { id: order.id } });
        throw new Error('Failed to create order: could not publish event');
    }
    
    return order;
}
 
// WHY THIS DOESN'T WORK:
//
// 1. DELETE CAN FAIL TOO
//    Network error on publish → network error on delete.
//    Now order exists without event AND delete didn't work.
//    You're in a worse state.
//
// 2. CRASH BEFORE DELETE
//    Publish fails, delete starts, process crashes.
//    Order still exists, no event, no one knows.
//
// 3. ANOTHER PROCESS USED THE ORDER
//    Between create and compensating delete:
//    - Another request read the order
//    - A scheduled job processed it
//    - A foreign key reference was created
//    Now delete fails due to FK violation.
//
// 4. USER SAW THE ORDER
//    Between create and delete (milliseconds!):
//    - User's dashboard refreshed
//    - They saw order #12345
//    - You deleted it
//    "Where did my order go?!"
//
// 5. COMPENSATION IS HARD
//    What if creating the order triggered:
//    - Email confirmation sent
//    - Inventory reserved
//    - Credit card pre-authorized
//    How do you undo all of that?
//
// Compensating transactions create more problems than they solve.
// And they STILL don't fix the crash window.

fix-attempt-4.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// ATTEMPTED FIX 4: PUBLISH FIRST
// "Publish the event first, then write to database"
 
async function createOrder(request: OrderRequest): Promise<Order> {
    const orderId = uuid();  // Pre-generate ID
    
    // Step 1: Publish event first
    await messageBroker.publish('OrderCreated', {
        orderId: orderId,
        customerId: request.customerId,
        items: request.items
    });
    // Event in broker ✓
    
    // Step 2: Now write to database
    const order = await database.orders.create({
        id: orderId,
        ...request
    });
    
    return order;
}
 
// WHY THIS DOESN'T WORK:
//
// 1. SAME PROBLEM, REVERSED
//    If database write fails after publish:
//    - Event in broker references non-existent order
//    - Ghost event (Failure Mode 3)
//
// 2. CONSUMERS RACE AHEAD
//    Event published → consumers start processing
//    Consumer queries order → 404, order doesn't exist yet
//    Is this a timing issue or a real error?
//
// 3. VALIDATION REVERSED
//    Database constraints caught on write:
//    - Unique email already exists
//    - Customer doesn't exist
//    - Credit limit exceeded
//    Event already published for invalid order.
//
// 4. NO ATOMICITY EITHER DIRECTION
//    "Publish first" doesn't create atomicity.
//    It just changes which failure mode you get.
//
// The order of operations doesn't matter.
// The problem is having TWO systems to update.

The Pattern of Failure

Theoretical Foundations

The dual-write problem isn't just practically hard—it's theoretically impossible to solve without specific mechanisms. Understanding why helps clarify the solution.

The Consensus Problem

At its core, a dual write is asking: "Can two independent systems agree that an operation happened?" This is a variant of the distributed consensus problem.

Translated to our problem: There is no algorithm that guarantees both the database and message broker agree an operation occurred, if either system or the network between them might fail.

Why Databases 'Solve' This

Databases achieve atomicity through specific mechanisms:

Write-Ahead Log (WAL): All changes first written to a log before applying to tables
Commit Protocols: Explicit commit/rollback semantics
Crash Recovery: On restart, WAL is replayed to restore consistent state

These mechanisms are internal to the database. The database controls every component.

Why Dual Writes Can't Have This

With dual writes:

The database doesn't know about the message broker
The message broker doesn't know about the database
Neither can control the other's commit
Network between them is unreliable
Either can crash independently

There's no shared commit log, no coordinated commit protocol, no unified crash recovery.

two-phase-commit.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// TWO-PHASE COMMIT: THE "TEXTBOOK" SOLUTION
// 
// Distributed transactions using 2PC coordinate commits across systems.
// In theory, this solves dual writes. In practice, it's rarely viable.
 
// How 2PC works (conceptually):
//
// PHASE 1: PREPARE
// Coordinator → Database:  "Can you commit transaction X?"
// Coordinator → Broker:    "Can you commit message Y?"
// Database → Coordinator:  "Yes, I'm prepared"
// Broker → Coordinator:    "Yes, I'm prepared"
//
// PHASE 2: COMMIT
// Coordinator → Database:  "Commit transaction X"
// Coordinator → Broker:    "Commit message Y"
// Database → Coordinator:  "Done"
// Broker → Coordinator:    "Done"
//
// Both systems commit only when both are prepared.
// If either fails, both rollback.
 
// WHY 2PC DOESN'T WORK FOR OUTBOX USE CASE:
//
// 1. MESSAGE BROKERS DON'T SUPPORT IT
//    - Kafka: No XA/2PC support
//    - RabbitMQ: No XA/2PC support
//    - AWS SQS: No XA/2PC support
//    - Google Pub/Sub: No XA/2PC support
//    
//    The systems we want to coordinate don't implement the protocol.
//
// 2. BLOCKING PROTOCOL
//    During prepare phase, all participants hold locks.
//    If coordinator crashes, locks held indefinitely.
//    System becomes unavailable.
//
// 3. PERFORMANCE
//    2PC requires multiple network round-trips:
//    - Client → Coordinator
//    - Coordinator → All participants (prepare)
//    - All participants → Coordinator (prepared)
//    - Coordinator → All participants (commit)
//    - All participants → Coordinator (committed)
//    
//    Minimum 4 round-trips for every operation.
//    At 1ms per hop, that's 4ms minimum latency added.
//
// 4. AVAILABILITY
//    If ANY participant is unreachable, entire operation fails.
//    Single point of failure multiplied by participant count.
//
// 5. HETEROGENEOUS SYSTEMS
//    Even if broker supported 2PC, it would need to be the
//    same IMPLEMENTATION (XA standard has variations).
//    Different vendors, different interpretations.
//
// 2PC is used for homogeneous database-to-database transactions.
// It's not practical for database-to-message-broker coordination.

The Outbox Solution: Local Commit, Eventual Delivery

The Outbox Pattern sidesteps the distributed consensus problem by reframing it:

INSTEAD OF:
  "Atomically update two systems" (impossible)

WE DO:
  "Atomically update one system" (trivial)
  "Eventually propagate to second system" (reliable with at-least-once)

By writing both the business data and the event data to the same database in a single transaction, we reduce the problem to a LOCAL commit—something databases handle perfectly.

The event's presence in the outbox table IS the record of truth. It will eventually be published. The only question is when, not if.

Converting Mermaid diagram...

How Outbox Eliminates the Crash Window

Let's trace through each failure mode and see how the Outbox Pattern handles it.

outbox-failure-handling.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
// OUTBOX PATTERN: FAILURE MODE ANALYSIS
 
async function createOrder(request: OrderRequest): Promise<Order> {
    return await database.transaction(async (tx) => {
        // Step 1: Create order
        const order = await tx.orders.create(request);
        
        // Step 2: Create outbox event (SAME TRANSACTION)
        await tx.outbox.create({
            aggregateType: 'Order',
            aggregateId: order.id,
            eventType: 'OrderCreated',
            payload: JSON.stringify({ orderId: order.id, ...order })
        });
        
        return order;
    });
    // Transaction commits here
    // Both order AND event exist, or neither exists
}
 
// ===================================================
// FAILURE MODE 1: "Second system fails"
// ===================================================
// 
// Q: What if the outbox insert fails?
// A: The entire transaction rolls back. No order, no event.
//    Consistent state. User gets error, can retry.
//
// The "second system" (outbox) is the SAME system (database).
// Same transaction = atomic. Both succeed or both fail.
 
// ===================================================
// FAILURE MODE 2: "Process crash"
// ===================================================
//
// SCENARIO A: Crash BEFORE transaction commit
// - Transaction never commits
// - Database rollback (automatic on connection loss)
// - No order, no event
// - Consistent state ✓
//
// SCENARIO B: Crash AFTER transaction commit
// - Order and outbox event both committed
// - On recovery, outbox event is still there
// - Publisher will find it and publish
// - Consistent state ✓
//
// There is NO crash window. Either the transaction commits
// (both exist) or it doesn't (neither exists).
 
// ===================================================
// FAILURE MODE 3: "Ghost events (rollback after publish)"
// ===================================================
//
// Q: What if we publish before commit?
// A: We don't! The Outbox Pattern NEVER publishes before commit.
//
// Sequence:
// 1. Transaction commits (order + outbox event)
// 2. Only THEN does publisher read and publish
//
// It's impossible to publish an event for data that
// might roll back, because publishing happens AFTER commit.
 
// ===================================================
// FAILURE MODE NEW: "Publisher fails"
// ===================================================
//
// Q: What if the publisher can't publish?
// A: Event stays in outbox. Publisher will retry later.
//
// - Event is durably stored in database
// - Publisher is stateless; can restart
// - On restart, queries for unpublished events
// - Retries until successful
//
// Worst case: events delayed. Never lost.
 
// ===================================================
// FAILURE MODE NEW: "Publisher publishes but crashes 
// before marking published"
// ===================================================
//
// Q: Event gets published, then publisher crashes before 
//    UPDATE outbox SET published_at = NOW()
// A: Event will be published again (duplicate).
//
// This is expected! Outbox provides AT-LEAST-ONCE delivery.
// Consumers MUST be idempotent.
//
// But: data is never lost, never orphaned.
// Duplicates are handleable. Missing events are not.

Dual Write Failure Modes

•Second write fails → Data without event
•Process crash → Data without event
•Transaction rollback → Event without data
•Retry ambiguity → Possible duplicates
•Ordering violations → Corrupted stream

Outbox Solution

•Same transaction → Everything atomic
•Durable storage → Survives crashes
•Post-commit publish → No ghost events
•Known duplicates → Handle with idempotency
•Sequence numbers → Ordered delivery

Idempotency: The Consumer's Responsibility

The Outbox Pattern provides at-least-once delivery: events will definitely be delivered, but possibly more than once. This shifts responsibility for handling duplicates to the consumer.

idempotent-consumers.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
// IDEMPOTENT CONSUMER PATTERNS
 
// PATTERN 1: Event ID Tracking
// Store processed event IDs; skip if already seen
 
class OrderNotificationConsumer {
    async handleOrderCreated(event: OrderCreatedEvent): Promise<void> {
        // Check if we've processed this event
        const existing = await this.db.processedEvents.findFirst({
            where: { eventId: event.eventId }
        });
        
        if (existing) {
            console.log(`Event ${event.eventId} already processed, skipping`);
            return;  // Idempotent: no-op on duplicate
        }
        
        // Process in transaction with event recording
        await this.db.transaction(async (tx) => {
            // Do the work
            await tx.notifications.create({
                type: 'ORDER_CONFIRMATION',
                orderId: event.payload.orderId,
                customerId: event.payload.customerId,
                sentAt: new Date()
            });
            
            // Record that we processed this event
            await tx.processedEvents.create({
                eventId: event.eventId,
                processedAt: new Date()
            });
        });
    }
}
 
 
// PATTERN 2: Natural Idempotency Keys
// Use business identifiers instead of event IDs
 
class InventoryConsumer {
    async handleOrderCreated(event: OrderCreatedEvent): Promise<void> {
        for (const item of event.payload.items) {
            // UPSERT instead of INSERT
            // If reservation for this order+product exists, update it
            await this.db.inventoryReservations.upsert({
                where: {
                    orderId_productId: {
                        orderId: event.payload.orderId,
                        productId: item.productId
                    }
                },
                create: {
                    orderId: event.payload.orderId,
                    productId: item.productId,
                    quantity: item.quantity,
                    reservedAt: new Date()
                },
                update: {
                    quantity: item.quantity,  // Update to same value = no-op
                    reservedAt: new Date()
                }
            });
        }
        
        // Even if event processed twice, state converges to same result
    }
}
 
 
// PATTERN 3: Conditional Updates
// Only apply change if preconditions match
 
class OrderStatusConsumer {
    async handleOrderShipped(event: OrderShippedEvent): Promise<void> {
        // Only update if order is in expected state
        const result = await this.db.orders.updateMany({
            where: {
                id: event.payload.orderId,
                status: 'CONFIRMED'  // Only if still confirmed
            },
            data: {
                status: 'SHIPPED',
                shippedAt: new Date(),
                trackingNumber: event.payload.trackingNumber
            }
        });
        
        if (result.count === 0) {
            // Either order doesn't exist, or already shipped
            console.log(`Order ${event.payload.orderId} not in CONFIRMED state`);
            // Still idempotent - we don't error on duplicate
        }
    }
}
 
 
// PATTERN 4: Version-Based Optimistic Locking
// Reject updates that don't increment version correctly
 
class AccountBalanceConsumer {
    async handlePaymentReceived(event: PaymentReceivedEvent): Promise<void> {
        const result = await this.db.accounts.updateMany({
            where: {
                id: event.payload.accountId,
                version: event.payload.expectedVersion  // Must match exactly
            },
            data: {
                balance: { increment: event.payload.amount },
                version: { increment: 1 },
                lastUpdated: new Date()
            }
        });
        
        if (result.count === 0) {
            // Version mismatch - either:
            // 1. Duplicate event (version already incremented)
            // 2. Concurrent modification (need to re-read and retry)
            const current = await this.db.accounts.findUnique({
                where: { id: event.payload.accountId }
            });
            
            if (current.version > event.payload.expectedVersion) {
                // Already processed - idempotent no-op
                return;
            } else {
                // Concurrent modification - throw to trigger retry
                throw new ConcurrentModificationError();
            }
        }
    }
}

Idempotency Patterns Comparison
Pattern	Trade-offs	Best For
Event ID Tracking	Extra storage; cleanup needed	General purpose; works for any event
Natural Keys	Requires suitable business keys	CRUD operations with natural identifiers
Conditional Updates	Extra query; potential races	State machine transitions
Version Locking	Requires version tracking	Concurrent updates; optimistic control

Idempotency Is Not Optional

Summary: From Dual Writes to Reliable Events

Key Takeaways

•Dual writes cannot be fixed — No amount of retries, queues, or ordering changes eliminates the fundamental race between two systems
•The crash window is the killer — Even millisecond windows cause data loss at scale
•Ghost events are worse than missing events — They actively corrupt downstream systems
•2PC isn't viable — Message brokers don't support it; performance is prohibitive
•Outbox transforms the problem — Local atomic write + eventual delivery sidesteps distributed consensus
•At-least-once is the right trade-off — Duplicates are manageable; missing data is not
•Idempotency is mandatory — Design every consumer to handle duplicates correctly

Page Complete

4 / 5