System Design (HLD)Distributed Data Patterns

Outbox Pattern

LevelAdvanced

Duration90 mins

TopicDistributed Data Patterns

1 / 5

Reliable Event Publishing

The Silent Killer of Distributed Systems

In the world of distributed systems, some failures are loud and obvious—servers crash, databases timeout, networks partition. But there's a category of failure that's far more insidious: data inconsistency caused by partial failures during multi-system updates. These failures often go undetected until they've corrupted significant amounts of business data.

Consider this scenario: An e-commerce service needs to:

Save an order to the database
Publish an OrderCreated event to a message broker for downstream processing

Seems simple enough. But what happens when the database write succeeds and the message publish fails? Or when the publish succeeds but the database transaction rolls back? You're left with inconsistent state—the order exists in one system but not the other, or events describe orders that don't exist.

This problem is so common and so treacherous that it has a name: the dual-write problem. And it has destroyed more distributed systems than nearly any other failure mode.

What You Will Learn

By the end of this page, you will deeply understand the dual-write problem, why it cannot be solved through naive retry mechanisms, and how the Outbox Pattern provides a fundamental solution. This foundation is essential before exploring implementation strategies.

The Reliability Problem in Event-Driven Systems

Event-driven architecture has become the dominant paradigm for building scalable, loosely-coupled systems. Instead of synchronous request-response, services communicate by publishing and consuming events. This approach offers tremendous benefits:

Loose Coupling: Publishers don't need to know about subscribers
Scalability: Asynchronous processing handles load spikes gracefully
Extensibility: New consumers can subscribe without modifying publishers
Resilience: Temporal decoupling means services can be independently available

But there's a fundamental problem at the heart of event-driven architectures: How do you reliably publish events when the publishing operation involves multiple systems?

naive-approach.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// THE NAIVE APPROACH - DECEPTIVELY DANGEROUS
 
class OrderService {
    constructor(
        private database: Database,
        private messageQueue: MessageQueue
    ) {}
 
    async createOrder(orderRequest: CreateOrderRequest): Promise<Order> {
        // Step 1: Save to database
        const order = await this.database.orders.create({
            customerId: orderRequest.customerId,
            items: orderRequest.items,
            total: this.calculateTotal(orderRequest.items),
            status: 'CREATED'
        });
 
        // Step 2: Publish event to message broker
        await this.messageQueue.publish('order.created', {
            orderId: order.id,
            customerId: order.customerId,
            total: order.total,
            timestamp: new Date()
        });
 
        return order;
    }
}
 
// WHAT CAN GO WRONG?
// 
// SCENARIO 1: Message publish fails after DB write
// ───────────────────────────────────────────────
// Timeline:
//   T1: database.create() → SUCCESS (order saved)
//   T2: messageQueue.publish() → FAILURE (network timeout)
//   T3: Exception thrown, function returns error
//
// Result: Order exists in DB, but no event was published.
//         Downstream systems never learn about the order.
//         Customer is confused why order isn't processing.
//
// SCENARIO 2: Process crashes between operations
// ───────────────────────────────────────────────
// Timeline:
//   T1: database.create() → SUCCESS (order saved)
//   T2: Process crashes (out of memory, killed, etc.)
//   T3: messageQueue.publish() → NEVER EXECUTED
//
// Result: Same as Scenario 1—orphaned order.
//
// SCENARIO 3: Message published but DB transaction fails
// ───────────────────────────────────────────────────────
// (If using transaction that could still roll back)
// Timeline:
//   T1: database transaction begins
//   T2: order inserted (not yet committed)
//   T3: messageQueue.publish() → SUCCESS (event published!)
//   T4: database.commit() → FAILURE (constraint violation)
//   T5: transaction rolled back
//
// Result: Event published for order that DOESN'T EXIST!
//         Downstream systems try to process ghost order.
//         Corrupted state that's very hard to detect.

The Fundamental Issue

You cannot atomically update a database AND publish to a message broker in a single transaction. They are separate systems with separate failure modes. Any sequence of operations across these systems can fail partially, leaving the systems in inconsistent states.

Why Retries Don't Solve This

The first instinct is often to add retry logic. But retries introduce new problems:

// RETRY APPROACH - STILL BROKEN
async createOrderWithRetry(request: CreateOrderRequest): Promise<Order> {
    const order = await this.database.orders.create(request);
    
    let published = false;
    for (let attempt = 0; attempt < 3; attempt++) {
        try {
            await this.messageQueue.publish('order.created', { orderId: order.id });
            published = true;
            break;
        } catch (e) {
            await sleep(exponentialBackoff(attempt));
        }
    }
    
    if (!published) {
        // What now? Order is already in DB!
        // Can't roll back—DB transaction already committed.
        // Can't delete order—what if customer sees it?
        throw new Error('Failed to publish event');
    }
    
    return order;
}

The retry approach fails because:

The database write has already committed—you can't roll it back
If all retries fail, you have an orphaned record
If the process crashes during retries, no one picks up the work
You might retry a publish that actually succeeded (network returned failure after success)
There's no guaranteed way to know if you should retry or if the original succeeded

Understanding the Dual-Write Problem

The dual-write problem occurs when a single business operation must update two or more systems that cannot participate in the same transaction. The term 'dual' refers to the minimum case of two systems, but the problem extends to any number of independent data stores.

Formal Definition:

A dual-write is an operation that modifies data in two or more systems without a coordinated transaction mechanism. Because each system commits independently, partial failures leave the systems in mutually inconsistent states.

Why Can't We Just Use Distributed Transactions?

The textbook solution to coordinating writes across multiple systems is distributed transactions using protocols like Two-Phase Commit (2PC). However, 2PC has severe practical limitations:

Why Distributed Transactions Fall Short
Concern	Two-Phase Commit Limitation	Practical Impact
Availability	Blocks if coordinator fails	Single point of failure; system unavailable during recovery
Performance	Holds locks during prepare phase	High latency; reduced throughput under contention
Heterogeneity	All participants must support XA/2PC	Message brokers like Kafka, RabbitMQ don't support XA
Scalability	Coordinator becomes bottleneck	Doesn't scale horizontally
Cloud Compatibility	Requires persistent connections	Not compatible with serverless, managed services
Complexity	Complex failure recovery	Operational burden; hard to debug

Most message brokers simply don't support distributed transactions. Apache Kafka, RabbitMQ, Amazon SQS, Google Pub/Sub—none of them can participate in an XA transaction with your PostgreSQL database. Even if they could, the performance and availability costs would be prohibitive for most use cases.

This leaves us with a fundamental constraint:

We cannot achieve atomicity across a database and a message broker through traditional transaction mechanisms.

This is not a solvable problem through better engineering of the naive approach. It requires a fundamentally different architecture.

Common Manifestations of Dual-Write Failures

•Lost Events — Database updated but event never published. Downstream systems never learn about the change. Most common failure mode.
•Orphan Events — Event published but database write failed or rolled back. Events reference entities that don't exist. Causes cascading failures.
•Duplicate Events — Retry after ambiguous failure published duplicates. Without idempotency, causes duplicate processing.
•Out-of-Order Events — Retries or partial failures cause events to arrive in wrong order. State machines get corrupted.
•Ghost Reads — Consumers read event, query database, find nothing (eventual consistency + partial failure).

The Insidious Nature of Dual-Write Bugs

Dual-write failures often don't crash your system—they corrupt it slowly. You'll see symptoms like 'occasional missing orders,' 'sometimes the notification doesn't send,' or 'the audit log is incomplete.' These intermittent issues are extremely hard to debug because they depend on precise timing of unrelated network failures.

Introducing the Outbox Pattern

The Outbox Pattern solves the dual-write problem through a simple but powerful insight:

Instead of publishing events directly to a message broker, write them to an 'outbox' table in the same database as your business data—within the same transaction. A separate process then reads from the outbox and publishes to the message broker.

This transforms the problem from impossible (atomic cross-system write) to trivial (atomic single-database write).

The Key Insight:

Databases are really good at one thing: atomic transactions within themselves. If you insert an order and an outbox event in the same transaction, either both succeed or both fail. There's no inconsistent state.

Once the data is safely in the outbox (with transactional guarantees), a separate process can take responsibility for reliable delivery to the message broker. This process can retry indefinitely, use idempotency keys, and handle all the complexity of reliable messaging—without any risk to the source transaction.

outbox-flow.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
// THE OUTBOX PATTERN - CORRECT APPROACH
 
// STEP 1: Database transaction includes outbox write
// ──────────────────────────────────────────────────
 
class OrderService {
    async createOrder(request: CreateOrderRequest): Promise<Order> {
        // Single database transaction
        return await this.database.transaction(async (tx) => {
            // Insert the order
            const order = await tx.orders.create({
                customerId: request.customerId,
                items: request.items,
                total: this.calculateTotal(request.items),
                status: 'CREATED'
            });
 
            // Insert event into outbox table (SAME TRANSACTION)
            await tx.outbox.create({
                aggregateType: 'Order',
                aggregateId: order.id,
                eventType: 'OrderCreated',
                payload: JSON.stringify({
                    orderId: order.id,
                    customerId: order.customerId,
                    items: order.items,
                    total: order.total,
                    createdAt: new Date()
                }),
                createdAt: new Date(),
                publishedAt: null  // Not yet published
            });
 
            return order;
        });
        
        // At this point:
        // - If transaction committed: order AND outbox event both exist
        // - If transaction failed: neither exists
        // - NEVER inconsistent!
    }
}
 
// STEP 2: Separate process publishes from outbox
// ──────────────────────────────────────────────
 
class OutboxPublisher {
    // Runs continuously or on a schedule
    async publishPendingEvents(): Promise<void> {
        // Get unpublished events
        const pendingEvents = await this.database.outbox.findMany({
            where: { publishedAt: null },
            orderBy: { createdAt: 'asc' },
            take: 100
        });
 
        for (const event of pendingEvents) {
            try {
                // Publish to message broker
                await this.messageQueue.publish(
                    `${event.aggregateType.toLowerCase()}.${event.eventType.toLowerCase()}`,
                    JSON.parse(event.payload)
                );
 
                // Mark as published
                await this.database.outbox.update({
                    where: { id: event.id },
                    data: { publishedAt: new Date() }
                });
            } catch (error) {
                // Failed? No problem! Event stays in outbox.
                // Will be picked up in next run.
                // Source transaction is NOT affected.
                console.error(`Failed to publish event ${event.id}:`, error);
            }
        }
    }
}
 
// KEY GUARANTEES:
//
// 1. ATOMICITY: Order and outbox event are in same transaction.
//    They either both exist or neither exists.
//
// 2. DURABILITY: Once transaction commits, event is durably stored.
//    Even if process crashes, event is not lost.
//
// 3. EVENTUAL DELIVERY: Publisher will keep retrying until successful.
//    Events might be delayed, but never lost.
//
// 4. ORDERING: Events published in order of creation (by timestamp).
//    No event skipped due to earlier failure.

Converting Mermaid diagram...

The Separation of Concerns

The Outbox Pattern cleanly separates the business transaction from the messaging concern. The order service's responsibility is to create orders atomically (with their events). The publisher's responsibility is to deliver events reliably. Neither needs to worry about the other's failure modes.

Designing the Outbox Table

The outbox table is the heart of the pattern. Its design significantly impacts reliability, performance, and operational characteristics. A well-designed outbox table supports:

Efficient polling for unpublished events
Ordered delivery by sequence or timestamp
Debugging and monitoring through meaningful metadata
Cleanup of old published events
Idempotent publishing through unique identifiers

outbox-schema.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
-- PRODUCTION-GRADE OUTBOX TABLE SCHEMA
 
CREATE TABLE outbox_events (
    -- Primary identifier (UUID recommended for distributed systems)
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    
    -- Event ordering (monotonically increasing, no gaps)
    -- More reliable than timestamps for ordering
    sequence_number BIGSERIAL NOT NULL,
    
    -- Aggregate information (for routing and filtering)
    aggregate_type  VARCHAR(255) NOT NULL,    -- e.g., 'Order', 'Customer'
    aggregate_id    VARCHAR(255) NOT NULL,    -- e.g., order UUID
    
    -- Event information
    event_type      VARCHAR(255) NOT NULL,    -- e.g., 'OrderCreated'
    event_version   INTEGER NOT NULL DEFAULT 1, -- schema version
    
    -- Payload (the actual event data)
    payload         JSONB NOT NULL,
    
    -- Metadata for tracing and debugging
    correlation_id  VARCHAR(255),             -- trace across services
    causation_id    VARCHAR(255),             -- which event caused this
    metadata        JSONB,                    -- additional context
    
    -- Timestamps
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMPTZ,              -- NULL = not yet published
    
    -- Retry tracking
    publish_attempts INTEGER NOT NULL DEFAULT 0,
    last_attempt_at  TIMESTAMPTZ,
    last_error       TEXT
);
 
-- CRITICAL INDEX: Efficiently find unpublished events in order
CREATE INDEX idx_outbox_unpublished 
    ON outbox_events (sequence_number) 
    WHERE published_at IS NULL;
 
-- PARTIAL INDEX: Only index what we actually query
-- Much smaller than indexing all rows
 
-- For cleanup: find old published events
CREATE INDEX idx_outbox_published_cleanup 
    ON outbox_events (published_at) 
    WHERE published_at IS NOT NULL;
 
-- For debugging: find events by aggregate
CREATE INDEX idx_outbox_aggregate 
    ON outbox_events (aggregate_type, aggregate_id);
 
 
-- ALTERNATIVE: Partitioned table for high-volume systems
-- Partition by date for efficient cleanup
 
CREATE TABLE outbox_events_partitioned (
    id              UUID NOT NULL,
    sequence_number BIGINT NOT NULL,
    aggregate_type  VARCHAR(255) NOT NULL,
    aggregate_id    VARCHAR(255) NOT NULL,
    event_type      VARCHAR(255) NOT NULL,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMPTZ,
    created_date    DATE NOT NULL DEFAULT CURRENT_DATE,
    
    PRIMARY KEY (created_date, id)
) PARTITION BY RANGE (created_date);
 
-- Create partitions (automate this in production)
CREATE TABLE outbox_events_2024_01 
    PARTITION OF outbox_events_partitioned 
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
 
-- Drop old partitions instead of DELETE (much faster)
-- DROP TABLE outbox_events_2023_12;

Outbox Table Design Decisions

•Sequence Numbers vs Timestamps — Sequence numbers (SERIAL/BIGSERIAL) guarantee ordering without clock synchronization issues. Use sequence_number for ordering, timestamp for information.
•UUID vs Auto-Increment ID — UUIDs allow event ID generation in application code before transaction (useful for idempotency). Auto-increment is simpler but requires database assignment.
•JSONB Payload — Flexible schema for different event types. Consider payload size limits and compression for large events.
•Partial Indexes — Only index unpublished events. Dramatically reduces index size since most events are quickly published and the index doesn't need them.
•Event Versioning — Include version number in schema to support event evolution without breaking consumers.
•Correlation/Causation IDs — Critical for distributed tracing. Causation ID links to the event that triggered this one.

outbox-types.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
// TYPE-SAFE OUTBOX EVENT HANDLING
 
// Base event structure (all events extend this)
interface BaseEvent<T extends string, P> {
    eventId: string;
    eventType: T;
    eventVersion: number;
    aggregateType: string;
    aggregateId: string;
    payload: P;
    metadata: EventMetadata;
    createdAt: Date;
}
 
interface EventMetadata {
    correlationId: string;
    causationId?: string;
    userId?: string;
    traceId?: string;
    spanId?: string;
}
 
// Specific event types
interface OrderCreatedPayload {
    customerId: string;
    items: Array<{
        productId: string;
        quantity: number;
        unitPrice: number;
    }>;
    totalAmount: number;
    currency: string;
    shippingAddress: Address;
}
 
type OrderCreatedEvent = BaseEvent<'OrderCreated', OrderCreatedPayload>;
 
interface OrderShippedPayload {
    trackingNumber: string;
    carrier: string;
    estimatedDelivery: Date;
}
 
type OrderShippedEvent = BaseEvent<'OrderShipped', OrderShippedPayload>;
 
// Union type for all order events
type OrderEvent = OrderCreatedEvent | OrderShippedEvent;
 
// Outbox repository with type safety
class OutboxRepository {
    constructor(private db: Database) {}
 
    async addEvent<E extends BaseEvent<string, unknown>>(
        tx: Transaction,
        event: E
    ): Promise<void> {
        await tx.execute(`
            INSERT INTO outbox_events (
                id, aggregate_type, aggregate_id, 
                event_type, event_version, 
                payload, metadata, correlation_id, causation_id
            ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
        `, [
            event.eventId,
            event.aggregateType,
            event.aggregateId,
            event.eventType,
            event.eventVersion,
            JSON.stringify(event.payload),
            JSON.stringify(event.metadata),
            event.metadata.correlationId,
            event.metadata.causationId
        ]);
    }
}
 
// Usage in service layer
class OrderService {
    constructor(
        private db: Database,
        private outbox: OutboxRepository
    ) {}
 
    async createOrder(
        command: CreateOrderCommand,
        metadata: EventMetadata
    ): Promise<Order> {
        return await this.db.transaction(async (tx) => {
            // Create the order
            const order = await tx.orders.create({
                customerId: command.customerId,
                items: command.items,
                status: 'CREATED',
                totalAmount: this.calculateTotal(command.items)
            });
 
            // Create type-safe event
            const event: OrderCreatedEvent = {
                eventId: crypto.randomUUID(),
                eventType: 'OrderCreated',
                eventVersion: 1,
                aggregateType: 'Order',
                aggregateId: order.id,
                payload: {
                    customerId: order.customerId,
                    items: order.items.map(i => ({
                        productId: i.productId,
                        quantity: i.quantity,
                        unitPrice: i.unitPrice
                    })),
                    totalAmount: order.totalAmount,
                    currency: 'USD',
                    shippingAddress: command.shippingAddress
                },
                metadata: {
                    ...metadata,
                    causationId: metadata.correlationId // This event caused by the command
                },
                createdAt: new Date()
            };
 
            // Add to outbox (same transaction)
            await this.outbox.addEvent(tx, event);
 
            return order;
        });
    }
}

Event ID Generation Strategy

Generate event IDs (UUIDs) in the application before inserting. This allows consumers to use the event ID for idempotency checks, even before the event reaches them. If you rely on database-generated IDs, you can't know the ID until after insert—complicating some patterns.

Guarantees and Trade-offs

The Outbox Pattern provides strong guarantees, but like all distributed systems solutions, it involves trade-offs. Understanding exactly what the pattern guarantees—and what it doesn't—is crucial for correct usage.

What Outbox Guarantees

•Atomicity — Business data and event are committed together or not at all
•Durability — Once committed, event will not be lost (survives crashes)
•At-Least-Once Delivery — Events will eventually be published (possibly multiple times)
•Ordering (per aggregate) — Events for same aggregate are ordered by sequence
•No Ghost Events — Events only exist for committed business data

What Outbox Does NOT Guarantee

•Exactly-Once Delivery — Duplicates possible; consumers must be idempotent
•Low Latency — Polling introduces delay between commit and publish
•Global Ordering — Events across different aggregates may be reordered
•Delivery Time Bound — No guarantee on how quickly events are published
•Consumer Acknowledgment — No guarantee consumers successfully processed

At-Least-Once vs Exactly-Once

The Outbox Pattern inherently provides at-least-once delivery semantics. Duplicates can occur when:

Publisher reads event, publishes to broker, then crashes before marking as published
On restart, it publishes the same event again

This is unavoidable in any system without distributed transactions. The solution is to design consumers to be idempotent:

// IDEMPOTENT CONSUMER
class OrderEventConsumer {
    constructor(private db: Database) {}

    async handleOrderCreated(event: OrderCreatedEvent): Promise<void> {
        // Check if we've already processed this event
        const exists = await this.db.processedEvents.findFirst({
            where: { eventId: event.eventId }
        });
        
        if (exists) {
            console.log(`Event ${event.eventId} already processed, skipping`);
            return; // Idempotent: processing again is a no-op
        }

        // Process the event
        await this.db.transaction(async (tx) => {
            // Do the actual work
            await tx.notifications.create({
                type: 'ORDER_CONFIRMATION',
                orderId: event.payload.orderId,
                customerId: event.payload.customerId
            });

            // Record that we processed this event
            await tx.processedEvents.create({
                eventId: event.eventId,
                processedAt: new Date()
            });
        });
    }
}

Outbox Pattern Trade-offs Analysis
Trade-off	Impact	Mitigation
Added Latency	50-500ms delay between commit and publish (polling)	Use CDC instead of polling; tune poll frequency
Database Load	Additional writes (outbox) and reads (polling)	Efficient indexes; separate read replica for polling
Storage Growth	Outbox table accumulates events	Regular cleanup of old published events; partitioning
Duplicate Events	At-least-once means possible duplicates	Idempotent consumers; deduplication tables
Operational Complexity	Additional component (publisher) to monitor	Robust monitoring; alerting on publication lag
Schema Coupling	Event schema in database must match expectations	Event versioning; schema registry

Latency Considerations

If you need sub-second event delivery, polling-based outbox may not suffice. Consider CDC (Change Data Capture) which can detect new outbox rows almost instantly. We'll cover this in detail in the 'Polling vs CDC' section.

When to Use the Outbox Pattern

The Outbox Pattern is not always necessary. For some systems, simpler approaches work fine. For others, the Outbox Pattern is essential. Here's how to decide:

Use Outbox When...

•Events must be reliable — Losing events causes business impact (financial transactions, order processing, audit trails)
•Multiple systems must stay synchronized — CQRS read models, search indices, cache invalidation, analytics pipelines
•Saga coordination — Distributed transactions across services require reliable event delivery
•Regulatory requirements — Audit trails, compliance events that cannot be lost
•High-value domain events — Events that trigger important downstream workflows
•Event Sourcing — The event store should never have events without corresponding persisted state

Outbox May Be Overkill When...

•Events are best-effort — Analytics, metrics that can tolerate occasional loss
•Idempotent replay is easy — You can reconstruct missing events from source data
•Simple CRUD applications — No downstream systems consuming events
•Single database systems — No event publishing needed
•Websocket notifications — Where 'best effort' delivery is acceptable

Page Complete

You now understand the dual-write problem and why the Outbox Pattern is the fundamental solution. Next, we'll dive deep into the transactional outbox mechanism—exploring table design, transaction patterns, and ensuring atomicity across different database engines.

1 / 5

Loading learning content...

System Design (HLD)Distributed Data Patterns

Outbox Pattern

LevelAdvanced

Duration90 mins

TopicDistributed Data Patterns

1 / 5

Reliable Event Publishing

The Silent Killer of Distributed Systems

Consider this scenario: An e-commerce service needs to:

Save an order to the database
Publish an OrderCreated event to a message broker for downstream processing

This problem is so common and so treacherous that it has a name: the dual-write problem. And it has destroyed more distributed systems than nearly any other failure mode.

What You Will Learn

The Reliability Problem in Event-Driven Systems

Loose Coupling: Publishers don't need to know about subscribers
Scalability: Asynchronous processing handles load spikes gracefully
Extensibility: New consumers can subscribe without modifying publishers
Resilience: Temporal decoupling means services can be independently available

But there's a fundamental problem at the heart of event-driven architectures: How do you reliably publish events when the publishing operation involves multiple systems?

naive-approach.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// THE NAIVE APPROACH - DECEPTIVELY DANGEROUS
 
class OrderService {
    constructor(
        private database: Database,
        private messageQueue: MessageQueue
    ) {}
 
    async createOrder(orderRequest: CreateOrderRequest): Promise<Order> {
        // Step 1: Save to database
        const order = await this.database.orders.create({
            customerId: orderRequest.customerId,
            items: orderRequest.items,
            total: this.calculateTotal(orderRequest.items),
            status: 'CREATED'
        });
 
        // Step 2: Publish event to message broker
        await this.messageQueue.publish('order.created', {
            orderId: order.id,
            customerId: order.customerId,
            total: order.total,
            timestamp: new Date()
        });
 
        return order;
    }
}
 
// WHAT CAN GO WRONG?
// 
// SCENARIO 1: Message publish fails after DB write
// ───────────────────────────────────────────────
// Timeline:
//   T1: database.create() → SUCCESS (order saved)
//   T2: messageQueue.publish() → FAILURE (network timeout)
//   T3: Exception thrown, function returns error
//
// Result: Order exists in DB, but no event was published.
//         Downstream systems never learn about the order.
//         Customer is confused why order isn't processing.
//
// SCENARIO 2: Process crashes between operations
// ───────────────────────────────────────────────
// Timeline:
//   T1: database.create() → SUCCESS (order saved)
//   T2: Process crashes (out of memory, killed, etc.)
//   T3: messageQueue.publish() → NEVER EXECUTED
//
// Result: Same as Scenario 1—orphaned order.
//
// SCENARIO 3: Message published but DB transaction fails
// ───────────────────────────────────────────────────────
// (If using transaction that could still roll back)
// Timeline:
//   T1: database transaction begins
//   T2: order inserted (not yet committed)
//   T3: messageQueue.publish() → SUCCESS (event published!)
//   T4: database.commit() → FAILURE (constraint violation)
//   T5: transaction rolled back
//
// Result: Event published for order that DOESN'T EXIST!
//         Downstream systems try to process ghost order.
//         Corrupted state that's very hard to detect.

The Fundamental Issue

Why Retries Don't Solve This

The first instinct is often to add retry logic. But retries introduce new problems:

// RETRY APPROACH - STILL BROKEN
async createOrderWithRetry(request: CreateOrderRequest): Promise<Order> {
    const order = await this.database.orders.create(request);
    
    let published = false;
    for (let attempt = 0; attempt < 3; attempt++) {
        try {
            await this.messageQueue.publish('order.created', { orderId: order.id });
            published = true;
            break;
        } catch (e) {
            await sleep(exponentialBackoff(attempt));
        }
    }
    
    if (!published) {
        // What now? Order is already in DB!
        // Can't roll back—DB transaction already committed.
        // Can't delete order—what if customer sees it?
        throw new Error('Failed to publish event');
    }
    
    return order;
}

The retry approach fails because:

The database write has already committed—you can't roll it back
If all retries fail, you have an orphaned record
If the process crashes during retries, no one picks up the work
You might retry a publish that actually succeeded (network returned failure after success)
There's no guaranteed way to know if you should retry or if the original succeeded

Understanding the Dual-Write Problem

Formal Definition:

A dual-write is an operation that modifies data in two or more systems without a coordinated transaction mechanism. Because each system commits independently, partial failures leave the systems in mutually inconsistent states.

Why Can't We Just Use Distributed Transactions?

The textbook solution to coordinating writes across multiple systems is distributed transactions using protocols like Two-Phase Commit (2PC). However, 2PC has severe practical limitations:

Why Distributed Transactions Fall Short
Concern	Two-Phase Commit Limitation	Practical Impact
Availability	Blocks if coordinator fails	Single point of failure; system unavailable during recovery
Performance	Holds locks during prepare phase	High latency; reduced throughput under contention
Heterogeneity	All participants must support XA/2PC	Message brokers like Kafka, RabbitMQ don't support XA
Scalability	Coordinator becomes bottleneck	Doesn't scale horizontally
Cloud Compatibility	Requires persistent connections	Not compatible with serverless, managed services
Complexity	Complex failure recovery	Operational burden; hard to debug

This leaves us with a fundamental constraint:

We cannot achieve atomicity across a database and a message broker through traditional transaction mechanisms.

This is not a solvable problem through better engineering of the naive approach. It requires a fundamentally different architecture.

Common Manifestations of Dual-Write Failures

•Lost Events — Database updated but event never published. Downstream systems never learn about the change. Most common failure mode.
•Orphan Events — Event published but database write failed or rolled back. Events reference entities that don't exist. Causes cascading failures.
•Duplicate Events — Retry after ambiguous failure published duplicates. Without idempotency, causes duplicate processing.
•Out-of-Order Events — Retries or partial failures cause events to arrive in wrong order. State machines get corrupted.
•Ghost Reads — Consumers read event, query database, find nothing (eventual consistency + partial failure).

The Insidious Nature of Dual-Write Bugs

Introducing the Outbox Pattern

The Outbox Pattern solves the dual-write problem through a simple but powerful insight:

Instead of publishing events directly to a message broker, write them to an 'outbox' table in the same database as your business data—within the same transaction. A separate process then reads from the outbox and publishes to the message broker.

This transforms the problem from impossible (atomic cross-system write) to trivial (atomic single-database write).

The Key Insight:

outbox-flow.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
// THE OUTBOX PATTERN - CORRECT APPROACH
 
// STEP 1: Database transaction includes outbox write
// ──────────────────────────────────────────────────
 
class OrderService {
    async createOrder(request: CreateOrderRequest): Promise<Order> {
        // Single database transaction
        return await this.database.transaction(async (tx) => {
            // Insert the order
            const order = await tx.orders.create({
                customerId: request.customerId,
                items: request.items,
                total: this.calculateTotal(request.items),
                status: 'CREATED'
            });
 
            // Insert event into outbox table (SAME TRANSACTION)
            await tx.outbox.create({
                aggregateType: 'Order',
                aggregateId: order.id,
                eventType: 'OrderCreated',
                payload: JSON.stringify({
                    orderId: order.id,
                    customerId: order.customerId,
                    items: order.items,
                    total: order.total,
                    createdAt: new Date()
                }),
                createdAt: new Date(),
                publishedAt: null  // Not yet published
            });
 
            return order;
        });
        
        // At this point:
        // - If transaction committed: order AND outbox event both exist
        // - If transaction failed: neither exists
        // - NEVER inconsistent!
    }
}
 
// STEP 2: Separate process publishes from outbox
// ──────────────────────────────────────────────
 
class OutboxPublisher {
    // Runs continuously or on a schedule
    async publishPendingEvents(): Promise<void> {
        // Get unpublished events
        const pendingEvents = await this.database.outbox.findMany({
            where: { publishedAt: null },
            orderBy: { createdAt: 'asc' },
            take: 100
        });
 
        for (const event of pendingEvents) {
            try {
                // Publish to message broker
                await this.messageQueue.publish(
                    `${event.aggregateType.toLowerCase()}.${event.eventType.toLowerCase()}`,
                    JSON.parse(event.payload)
                );
 
                // Mark as published
                await this.database.outbox.update({
                    where: { id: event.id },
                    data: { publishedAt: new Date() }
                });
            } catch (error) {
                // Failed? No problem! Event stays in outbox.
                // Will be picked up in next run.
                // Source transaction is NOT affected.
                console.error(`Failed to publish event ${event.id}:`, error);
            }
        }
    }
}
 
// KEY GUARANTEES:
//
// 1. ATOMICITY: Order and outbox event are in same transaction.
//    They either both exist or neither exists.
//
// 2. DURABILITY: Once transaction commits, event is durably stored.
//    Even if process crashes, event is not lost.
//
// 3. EVENTUAL DELIVERY: Publisher will keep retrying until successful.
//    Events might be delayed, but never lost.
//
// 4. ORDERING: Events published in order of creation (by timestamp).
//    No event skipped due to earlier failure.

Converting Mermaid diagram...

The Separation of Concerns

Designing the Outbox Table

The outbox table is the heart of the pattern. Its design significantly impacts reliability, performance, and operational characteristics. A well-designed outbox table supports:

Efficient polling for unpublished events
Ordered delivery by sequence or timestamp
Debugging and monitoring through meaningful metadata
Cleanup of old published events
Idempotent publishing through unique identifiers

outbox-schema.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
-- PRODUCTION-GRADE OUTBOX TABLE SCHEMA
 
CREATE TABLE outbox_events (
    -- Primary identifier (UUID recommended for distributed systems)
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    
    -- Event ordering (monotonically increasing, no gaps)
    -- More reliable than timestamps for ordering
    sequence_number BIGSERIAL NOT NULL,
    
    -- Aggregate information (for routing and filtering)
    aggregate_type  VARCHAR(255) NOT NULL,    -- e.g., 'Order', 'Customer'
    aggregate_id    VARCHAR(255) NOT NULL,    -- e.g., order UUID
    
    -- Event information
    event_type      VARCHAR(255) NOT NULL,    -- e.g., 'OrderCreated'
    event_version   INTEGER NOT NULL DEFAULT 1, -- schema version
    
    -- Payload (the actual event data)
    payload         JSONB NOT NULL,
    
    -- Metadata for tracing and debugging
    correlation_id  VARCHAR(255),             -- trace across services
    causation_id    VARCHAR(255),             -- which event caused this
    metadata        JSONB,                    -- additional context
    
    -- Timestamps
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMPTZ,              -- NULL = not yet published
    
    -- Retry tracking
    publish_attempts INTEGER NOT NULL DEFAULT 0,
    last_attempt_at  TIMESTAMPTZ,
    last_error       TEXT
);
 
-- CRITICAL INDEX: Efficiently find unpublished events in order
CREATE INDEX idx_outbox_unpublished 
    ON outbox_events (sequence_number) 
    WHERE published_at IS NULL;
 
-- PARTIAL INDEX: Only index what we actually query
-- Much smaller than indexing all rows
 
-- For cleanup: find old published events
CREATE INDEX idx_outbox_published_cleanup 
    ON outbox_events (published_at) 
    WHERE published_at IS NOT NULL;
 
-- For debugging: find events by aggregate
CREATE INDEX idx_outbox_aggregate 
    ON outbox_events (aggregate_type, aggregate_id);
 
 
-- ALTERNATIVE: Partitioned table for high-volume systems
-- Partition by date for efficient cleanup
 
CREATE TABLE outbox_events_partitioned (
    id              UUID NOT NULL,
    sequence_number BIGINT NOT NULL,
    aggregate_type  VARCHAR(255) NOT NULL,
    aggregate_id    VARCHAR(255) NOT NULL,
    event_type      VARCHAR(255) NOT NULL,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMPTZ,
    created_date    DATE NOT NULL DEFAULT CURRENT_DATE,
    
    PRIMARY KEY (created_date, id)
) PARTITION BY RANGE (created_date);
 
-- Create partitions (automate this in production)
CREATE TABLE outbox_events_2024_01 
    PARTITION OF outbox_events_partitioned 
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
 
-- Drop old partitions instead of DELETE (much faster)
-- DROP TABLE outbox_events_2023_12;

Outbox Table Design Decisions

•Sequence Numbers vs Timestamps — Sequence numbers (SERIAL/BIGSERIAL) guarantee ordering without clock synchronization issues. Use sequence_number for ordering, timestamp for information.
•UUID vs Auto-Increment ID — UUIDs allow event ID generation in application code before transaction (useful for idempotency). Auto-increment is simpler but requires database assignment.
•JSONB Payload — Flexible schema for different event types. Consider payload size limits and compression for large events.
•Partial Indexes — Only index unpublished events. Dramatically reduces index size since most events are quickly published and the index doesn't need them.
•Event Versioning — Include version number in schema to support event evolution without breaking consumers.
•Correlation/Causation IDs — Critical for distributed tracing. Causation ID links to the event that triggered this one.

outbox-types.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
// TYPE-SAFE OUTBOX EVENT HANDLING
 
// Base event structure (all events extend this)
interface BaseEvent<T extends string, P> {
    eventId: string;
    eventType: T;
    eventVersion: number;
    aggregateType: string;
    aggregateId: string;
    payload: P;
    metadata: EventMetadata;
    createdAt: Date;
}
 
interface EventMetadata {
    correlationId: string;
    causationId?: string;
    userId?: string;
    traceId?: string;
    spanId?: string;
}
 
// Specific event types
interface OrderCreatedPayload {
    customerId: string;
    items: Array<{
        productId: string;
        quantity: number;
        unitPrice: number;
    }>;
    totalAmount: number;
    currency: string;
    shippingAddress: Address;
}
 
type OrderCreatedEvent = BaseEvent<'OrderCreated', OrderCreatedPayload>;
 
interface OrderShippedPayload {
    trackingNumber: string;
    carrier: string;
    estimatedDelivery: Date;
}
 
type OrderShippedEvent = BaseEvent<'OrderShipped', OrderShippedPayload>;
 
// Union type for all order events
type OrderEvent = OrderCreatedEvent | OrderShippedEvent;
 
// Outbox repository with type safety
class OutboxRepository {
    constructor(private db: Database) {}
 
    async addEvent<E extends BaseEvent<string, unknown>>(
        tx: Transaction,
        event: E
    ): Promise<void> {
        await tx.execute(`
            INSERT INTO outbox_events (
                id, aggregate_type, aggregate_id, 
                event_type, event_version, 
                payload, metadata, correlation_id, causation_id
            ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
        `, [
            event.eventId,
            event.aggregateType,
            event.aggregateId,
            event.eventType,
            event.eventVersion,
            JSON.stringify(event.payload),
            JSON.stringify(event.metadata),
            event.metadata.correlationId,
            event.metadata.causationId
        ]);
    }
}
 
// Usage in service layer
class OrderService {
    constructor(
        private db: Database,
        private outbox: OutboxRepository
    ) {}
 
    async createOrder(
        command: CreateOrderCommand,
        metadata: EventMetadata
    ): Promise<Order> {
        return await this.db.transaction(async (tx) => {
            // Create the order
            const order = await tx.orders.create({
                customerId: command.customerId,
                items: command.items,
                status: 'CREATED',
                totalAmount: this.calculateTotal(command.items)
            });
 
            // Create type-safe event
            const event: OrderCreatedEvent = {
                eventId: crypto.randomUUID(),
                eventType: 'OrderCreated',
                eventVersion: 1,
                aggregateType: 'Order',
                aggregateId: order.id,
                payload: {
                    customerId: order.customerId,
                    items: order.items.map(i => ({
                        productId: i.productId,
                        quantity: i.quantity,
                        unitPrice: i.unitPrice
                    })),
                    totalAmount: order.totalAmount,
                    currency: 'USD',
                    shippingAddress: command.shippingAddress
                },
                metadata: {
                    ...metadata,
                    causationId: metadata.correlationId // This event caused by the command
                },
                createdAt: new Date()
            };
 
            // Add to outbox (same transaction)
            await this.outbox.addEvent(tx, event);
 
            return order;
        });
    }
}

Event ID Generation Strategy

Guarantees and Trade-offs

What Outbox Guarantees

•Atomicity — Business data and event are committed together or not at all
•Durability — Once committed, event will not be lost (survives crashes)
•At-Least-Once Delivery — Events will eventually be published (possibly multiple times)
•Ordering (per aggregate) — Events for same aggregate are ordered by sequence
•No Ghost Events — Events only exist for committed business data

What Outbox Does NOT Guarantee

•Exactly-Once Delivery — Duplicates possible; consumers must be idempotent
•Low Latency — Polling introduces delay between commit and publish
•Global Ordering — Events across different aggregates may be reordered
•Delivery Time Bound — No guarantee on how quickly events are published
•Consumer Acknowledgment — No guarantee consumers successfully processed

At-Least-Once vs Exactly-Once

The Outbox Pattern inherently provides at-least-once delivery semantics. Duplicates can occur when:

Publisher reads event, publishes to broker, then crashes before marking as published
On restart, it publishes the same event again

This is unavoidable in any system without distributed transactions. The solution is to design consumers to be idempotent:

// IDEMPOTENT CONSUMER
class OrderEventConsumer {
    constructor(private db: Database) {}

    async handleOrderCreated(event: OrderCreatedEvent): Promise<void> {
        // Check if we've already processed this event
        const exists = await this.db.processedEvents.findFirst({
            where: { eventId: event.eventId }
        });
        
        if (exists) {
            console.log(`Event ${event.eventId} already processed, skipping`);
            return; // Idempotent: processing again is a no-op
        }

        // Process the event
        await this.db.transaction(async (tx) => {
            // Do the actual work
            await tx.notifications.create({
                type: 'ORDER_CONFIRMATION',
                orderId: event.payload.orderId,
                customerId: event.payload.customerId
            });

            // Record that we processed this event
            await tx.processedEvents.create({
                eventId: event.eventId,
                processedAt: new Date()
            });
        });
    }
}

Outbox Pattern Trade-offs Analysis
Trade-off	Impact	Mitigation
Added Latency	50-500ms delay between commit and publish (polling)	Use CDC instead of polling; tune poll frequency
Database Load	Additional writes (outbox) and reads (polling)	Efficient indexes; separate read replica for polling
Storage Growth	Outbox table accumulates events	Regular cleanup of old published events; partitioning
Duplicate Events	At-least-once means possible duplicates	Idempotent consumers; deduplication tables
Operational Complexity	Additional component (publisher) to monitor	Robust monitoring; alerting on publication lag
Schema Coupling	Event schema in database must match expectations	Event versioning; schema registry

Latency Considerations

When to Use the Outbox Pattern

The Outbox Pattern is not always necessary. For some systems, simpler approaches work fine. For others, the Outbox Pattern is essential. Here's how to decide:

Use Outbox When...

•Events must be reliable — Losing events causes business impact (financial transactions, order processing, audit trails)
•Multiple systems must stay synchronized — CQRS read models, search indices, cache invalidation, analytics pipelines
•Saga coordination — Distributed transactions across services require reliable event delivery
•Regulatory requirements — Audit trails, compliance events that cannot be lost
•High-value domain events — Events that trigger important downstream workflows
•Event Sourcing — The event store should never have events without corresponding persisted state

Outbox May Be Overkill When...

•Events are best-effort — Analytics, metrics that can tolerate occasional loss
•Idempotent replay is easy — You can reconstruct missing events from source data
•Simple CRUD applications — No downstream systems consuming events
•Single database systems — No event publishing needed
•Websocket notifications — Where 'best effort' delivery is acceptable

Page Complete

1 / 5