System Design (LLD)Testing Event-Driven Designs

Testing Event-Driven Designs

LevelIntermediate

Duration90 mins

TopicTesting Event-Driven Designs

4 / 4

Event Debugging

When the Trail Goes Cold: Debugging the Invisible

It's 3 AM. Customer support reports that orders are being charged but not shipped. The payment system shows successful captures. The shipping system shows empty queues. A customer placed an order 4 hours ago—what happened to the OrderPaidEvent?

This is the nightmare scenario of event-driven systems. Events are ephemeral. They flow through message brokers, trigger handlers, and vanish. When something goes wrong, there's no stack trace that spans the entire flow. No single log file contains the complete picture. The causal chain is distributed across services, queues, and time.

Debugging event-driven systems requires specialized techniques. This page equips you with those techniques—the tools and patterns that Principal Engineers use to diagnose issues in minutes that would otherwise take days.

What You Will Learn

By the end of this page, you will master: (1) Distributed tracing with correlation IDs and causation chains, (2) Dead letter queue analysis and recovery, (3) Event replay techniques for reproducing issues, (4) Structured logging patterns for event-driven systems, and (5) Observability dashboards and alerting strategies.

The Observability Challenge

Event-driven systems break traditional debugging assumptions. In synchronous systems, a stack trace shows exactly what happened—function A called function B, which called function C. The cause-effect chain is explicit.

In event-driven systems, this chain is implicit and distributed:

Why Event Debugging Is Hard

•Temporal Decoupling — Events are processed asynchronously. Cause and effect may be separated by milliseconds or hours.
•Spatial Decoupling — Producers and consumers run in different processes, often on different machines. No shared memory, no unified logging.
•Multiple Handlers — One event triggers multiple handlers. Each handler may succeed, fail, or partially complete independently.
•Silent Failures — A dropped event produces no visible error. The system simply doesn't do what it should.
•Non-Deterministic Ordering — Events may be processed out of order, making reproduction difficult.
•Transient State — By the time you investigate, the system state has changed. The debugging window is narrow.

The solution is proactive observability. You can't add debugging capabilities after a problem occurs—you must build them into the system from the start. The following sections detail the essential observability patterns for event-driven systems.

Debugging Production Is Already Too Late

If you're reading this after a production incident, you'll be limited to whatever observability was already in place. The lesson: invest in observability during development, not during outages.

Distributed Tracing with Correlation IDs

The cornerstone of event debugging is correlation IDs—unique identifiers that link all events and operations stemming from a single originating action. When a user places an order, every event, log entry, and database operation generated as a result shares the same correlation ID.

correlation-id-implementation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
/**
 * Correlation context that flows through the entire event chain
 */
export interface CorrelationContext {
    // Links all events from the same user action
    correlationId: string;
    
    // ID of the event/command that directly caused this event
    causationId: string | null;
    
    // The original request that started the chain
    originatingRequestId: string | null;
    
    // User who initiated the action
    userId: string | null;
    
    // When the chain started
    initiatedAt: Date;
    
    // Span for distributed tracing (OpenTelemetry compatible)
    traceId?: string;
    spanId?: string;
}
 
/**
 * Base event class with built-in correlation
 */
export abstract class DomainEvent {
    public readonly eventId: string;
    public readonly occurredAt: Date;
    public readonly correlationId: string;
    public readonly causationId: string | null;
 
    protected constructor(
        eventId: string,
        correlationId: string,
        causationId: string | null = null
    ) {
        this.eventId = eventId;
        this.occurredAt = new Date();
        this.correlationId = correlationId;
        this.causationId = causationId;
    }
}
 
/**
 * Event handler base that propagates correlation
 */
export abstract class CorrelatedEventHandler<TEvent extends DomainEvent> {
    protected abstract handleCore(
        event: TEvent,
        context: CorrelationContext
    ): Promise<void>;
 
    public async handle(event: TEvent): Promise<void> {
        // Extract or create correlation context
        const context: CorrelationContext = {
            correlationId: event.correlationId,
            causationId: event.eventId, // This event causes what we do next
            originatingRequestId: null,
            userId: null,
            initiatedAt: event.occurredAt,
        };
 
        // Set up logging context
        AsyncLocalStorage.run(context, async () => {
            await this.handleCore(event, context);
        });
    }
 
    // Utility for handlers to raise follow-up events with propagated correlation
    protected createFollowUpEvent<T extends DomainEvent>(
        EventClass: new (data: any) => T,
        eventData: Omit<T, 'eventId' | 'correlationId' | 'causationId' | 'occurredAt'>
    ): T {
        const context = getCurrentCorrelationContext();
        
        return new EventClass({
            ...eventData,
            eventId: generateEventId(),
            correlationId: context.correlationId,
            causationId: context.causationId, // Links to parent event
        });
    }
}
 
// Usage in a handler
class InventoryReservationHandler extends CorrelatedEventHandler<OrderPlacedEvent> {
    protected async handleCore(
        event: OrderPlacedEvent,
        context: CorrelationContext
    ): Promise<void> {
        // All logging automatically includes correlation ID
        this.logger.info('Processing order for reservation', {
            orderId: event.orderId,
            itemCount: event.items.length,
        });
 
        // Perform reservation
        const reservation = await this.inventoryService.reserve(
            event.orderId,
            event.items
        );
 
        // Publish follow-up event with correlation chain
        const reservedEvent = this.createFollowUpEvent(InventoryReservedEvent, {
            orderId: event.orderId,
            reservationId: reservation.id,
            items: reservation.items,
        });
 
        await this.eventBus.publish(reservedEvent);
    }
}

Querying by Correlation ID

With correlation IDs in place, debugging becomes a search operation:

correlation-query.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/**
 * Debugging utility: trace an entire event chain from correlation ID
 */
export async function traceEventChain(correlationId: string): Promise<EventChainView> {
    // Query event store for all events with this correlation ID
    const events = await eventStore.findByCorrelationId(correlationId);
    
    // Query logs for all log entries with this correlation ID
    const logs = await logService.query({
        filter: { correlationId },
        sort: { timestamp: 'asc' },
    });
 
    // Build causation tree
    const tree = buildCausationTree(events);
 
    return {
        correlationId,
        events: events.sort((a, b) => a.occurredAt.getTime() - b.occurredAt.getTime()),
        eventTree: tree,
        logs,
        
        // Computed insights
        totalDuration: calculateDuration(events),
        failedHandlers: identifyFailures(logs),
        timeline: buildTimeline(events, logs),
    };
}
 
/**
 * Build tree showing which events caused which
 */
function buildCausationTree(events: DomainEvent[]): CausationNode {
    const eventMap = new Map(events.map(e => [e.eventId, e]));
    const children = new Map<string, DomainEvent[]>();
 
    // Group by causation ID
    for (const event of events) {
        if (event.causationId) {
            const existing = children.get(event.causationId) || [];
            existing.push(event);
            children.set(event.causationId, existing);
        }
    }
 
    // Find root (null causation ID or external)
    const root = events.find(e => !e.causationId || !eventMap.has(e.causationId));
 
    function buildNode(event: DomainEvent): CausationNode {
        return {
            event,
            causedEvents: (children.get(event.eventId) || []).map(buildNode),
        };
    }
 
    return root ? buildNode(root) : { event: null, causedEvents: [] };
}
 
// Example output for debugging:
/*
Correlation ID: corr-abc123
Event Chain:
├── OrderPlacedEvent (evt-001) @ 10:00:00.000
│   ├── InventoryReservedEvent (evt-002) @ 10:00:00.123
│   │   └── PaymentCapturedEvent (evt-003) @ 10:00:00.456
│   │       └── OrderConfirmedEvent (evt-004) @ 10:00:00.789
│   └── NotificationSentEvent (evt-005) @ 10:00:00.234
│
Duration: 789ms
Failed Handlers: None
*/

OpenTelemetry Integration

For production systems, integrate with OpenTelemetry. Use traceId as your correlation ID, and each handler creates child spans. Tools like Jaeger, Zipkin, or cloud provider tracing (AWS X-Ray, Google Cloud Trace) visualize the complete distributed trace automatically.

Dead Letter Queue Analysis

When event handlers fail permanently, events end up in a Dead Letter Queue (DLQ). The DLQ is your crime scene—it contains evidence of what went wrong and the original events that caused failures.

Effective DLQ Design for Debugging

dlq-structure.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
/**
 * Dead Letter Message with debugging context
 */
export interface DeadLetterMessage {
    // The original event that failed
    originalEvent: DomainEvent;
    
    // Serialized form as received (for exact reproduction)
    rawPayload: string;
    
    // Which queue/topic the event came from
    sourceQueue: string;
    
    // Which handler(s) failed
    failedHandler: string;
    
    // Error details
    error: {
        type: string;
        message: string;
        stackTrace: string;
        code?: string;
    };
    
    // Retry history
    attempts: {
        attemptNumber: number;
        timestamp: Date;
        error: string;
    }[];
    
    // When it landed in DLQ
    deadLetteredAt: Date;
    
    // For prioritization
    metadata: {
        correlationId: string;
        originatingUserId?: string;
        customerId?: string;
        orderId?: string;
        monetaryValue?: number;
    };
}
 
/**
 * DLQ analysis and recovery tooling
 */
export class DeadLetterQueueManager {
    constructor(
        private readonly dlqRepository: DeadLetterRepository,
        private readonly eventBus: EventBus,
        private readonly logger: Logger
    ) {}
 
    /**
     * Get overview of DLQ for dashboard
     */
    async getDashboard(): Promise<DLQDashboard> {
        const messages = await this.dlqRepository.getAll();
 
        return {
            totalCount: messages.length,
            
            // Group by error type
            byErrorType: this.groupBy(messages, m => m.error.type),
            
            // Group by handler
            byHandler: this.groupBy(messages, m => m.failedHandler),
            
            // Group by event type
            byEventType: this.groupBy(messages, m => m.originalEvent.constructor.name),
            
            // Oldest entries (highest priority)
            oldest: messages.slice().sort(
                (a, b) => a.deadLetteredAt.getTime() - b.deadLetteredAt.getTime()
            ).slice(0, 10),
            
            // Highest value (prioritize by business impact)
            highestValue: messages.slice().sort(
                (a, b) => (b.metadata.monetaryValue || 0) - (a.metadata.monetaryValue || 0)
            ).slice(0, 10),
            
            // Recent entries (potential ongoing issue)
            recent: messages.slice().sort(
                (a, b) => b.deadLetteredAt.getTime() - a.deadLetteredAt.getTime()
            ).slice(0, 10),
        };
    }
 
    /**
     * Retry a specific dead letter message
     */
    async retrySingle(messageId: string): Promise<RetryResult> {
        const message = await this.dlqRepository.findById(messageId);
        
        if (!message) {
            return { success: false, error: 'Message not found' };
        }
 
        this.logger.info('Retrying dead letter message', {
            messageId,
            eventType: message.originalEvent.constructor.name,
            handler: message.failedHandler,
            correlationId: message.metadata.correlationId,
        });
 
        try {
            await this.eventBus.publish(message.originalEvent);
            await this.dlqRepository.markAsRetried(messageId);
            
            return { success: true };
        } catch (error) {
            this.logger.error('Retry failed', { messageId, error });
            
            return {
                success: false,
                error: error instanceof Error ? error.message : 'Unknown error',
            };
        }
    }
 
    /**
     * Batch retry all messages matching criteria
     */
    async retryBatch(criteria: RetryCriteria): Promise<BatchRetryResult> {
        const messages = await this.dlqRepository.findMatching(criteria);
        
        const results = await Promise.allSettled(
            messages.map(m => this.retrySingle(m.id))
        );
 
        return {
            total: messages.length,
            succeeded: results.filter(r => r.status === 'fulfilled' && r.value.success).length,
            failed: results.filter(r => r.status === 'rejected' || !r.value.success).length,
        };
    }
 
    /**
     * Analyze common failure patterns
     */
    async analyzePatterns(): Promise<FailurePattern[]> {
        const messages = await this.dlqRepository.getAll();
        const patterns: Map<string, FailurePattern> = new Map();
 
        for (const message of messages) {
            const patternKey = `${message.error.type}:${message.failedHandler}`;
            
            const existing = patterns.get(patternKey) || {
                errorType: message.error.type,
                handler: message.failedHandler,
                count: 0,
                firstSeen: message.deadLetteredAt,
                lastSeen: message.deadLetteredAt,
                sampleErrors: [],
            };
 
            existing.count++;
            existing.lastSeen = new Date(Math.max(
                existing.lastSeen.getTime(),
                message.deadLetteredAt.getTime()
            ));
            
            if (existing.sampleErrors.length < 3) {
                existing.sampleErrors.push(message.error.message);
            }
 
            patterns.set(patternKey, existing);
        }
 
        return Array.from(patterns.values())
            .sort((a, b) => b.count - a.count);
    }
 
    private groupBy<T>(
        items: T[],
        keyFn: (item: T) => string
    ): Record<string, number> {
        const result: Record<string, number> = {};
        for (const item of items) {
            const key = keyFn(item);
            result[key] = (result[key] || 0) + 1;
        }
        return result;
    }
}

Common DLQ Failure Patterns and Solutions
Error Pattern	Likely Cause	Investigation Steps	Resolution
Validation errors spike	Schema change in producer	Compare event schema to handler expectations	Update handler or roll back producer
Connection refused	Database/service down	Check dependent service health	Fix service, then batch retry
Timeout errors	Slow downstream service	Check service metrics and latency	Increase timeout or fix bottleneck
Not found errors	Race condition or deleted data	Check if entity exists, timing of operations	Add handler resilience or reorder operations
Duplicate key	Handler not idempotent	Check idempotency implementation	Fix handler, manually dedupe DLQ

DLQ Alerting Strategy

Set up alerts for: (1) DLQ depth exceeding threshold, (2) New error types appearing, (3) High-value events in DLQ, (4) Age of oldest DLQ message. These alerts catch issues before they compound.

Event Replay for Reproduction

One of the most powerful debugging techniques in event-driven systems is event replay—taking production events and replaying them in a controlled environment to reproduce issues exactly.

event-replay.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
/**
 * Event Replay System for debugging and testing
 */
export class EventReplayService {
    constructor(
        private readonly eventStore: EventStore,
        private readonly eventBus: EventBus,
        private readonly stateSnapshot: StateSnapshotService,
        private readonly logger: Logger
    ) {}
 
    /**
     * Replay events for a specific correlation ID
     */
    async replayCorrelation(
        correlationId: string,
        options: ReplayOptions = {}
    ): Promise<ReplayResult> {
        const {
            startFrom = null,
            endAt = null,
            handlers = null, // null = all handlers, or list of specific handlers
            dryRun = true,   // Default to dry run for safety
            speedMultiplier = 1.0, // 1.0 = real-time, 0 = instant
        } = options;
 
        this.logger.info('Starting event replay', { correlationId, options });
 
        // Fetch events
        let events = await this.eventStore.findByCorrelationId(correlationId);
        events = events.sort((a, b) => a.occurredAt.getTime() - b.occurredAt.getTime());
 
        // Apply filters
        if (startFrom) {
            events = events.filter(e => e.occurredAt >= startFrom);
        }
        if (endAt) {
            events = events.filter(e => e.occurredAt <= endAt);
        }
 
        this.logger.info(`Replaying ${events.length} events`);
 
        const results: ReplayEventResult[] = [];
        let previousTime = events[0]?.occurredAt.getTime();
 
        for (const event of events) {
            // Simulate time gaps if not instant replay
            if (speedMultiplier > 0 && previousTime) {
                const gap = event.occurredAt.getTime() - previousTime;
                if (gap > 0) {
                    await sleep(gap / speedMultiplier);
                }
            }
            previousTime = event.occurredAt.getTime();
 
            // Replay event
            const result = await this.replayEvent(event, { handlers, dryRun });
            results.push(result);
 
            this.logger.info('Replayed event', {
                eventId: event.eventId,
                eventType: event.constructor.name,
                success: result.success,
                handlerResults: result.handlerResults,
            });
        }
 
        return {
            correlationId,
            eventsReplayed: events.length,
            results,
            succeeded: results.filter(r => r.success).length,
            failed: results.filter(r => !r.success).length,
        };
    }
 
    /**
     * Replay events from a specific time window (for incident investigation)
     */
    async replayTimeWindow(
        startTime: Date,
        endTime: Date,
        eventTypes: string[],
        options: ReplayOptions = {}
    ): Promise<ReplayResult> {
        const events = await this.eventStore.findInTimeWindow(
            startTime,
            endTime,
            { eventTypes }
        );
 
        this.logger.info(`Found ${events.length} events in time window`);
 
        // Group by correlation ID for ordered replay
        const byCorrelation = new Map<string, DomainEvent[]>();
        for (const event of events) {
            const existing = byCorrelation.get(event.correlationId) || [];
            existing.push(event);
            byCorrelation.set(event.correlationId, existing);
        }
 
        const allResults: ReplayEventResult[] = [];
        
        for (const [corrId, corrEvents] of byCorrelation) {
            const result = await this.replayCorrelation(corrId, {
                ...options,
                startFrom: startTime,
                endAt: endTime,
            });
            allResults.push(...result.results);
        }
 
        return {
            correlationId: 'multiple',
            eventsReplayed: allResults.length,
            results: allResults,
            succeeded: allResults.filter(r => r.success).length,
            failed: allResults.filter(r => !r.success).length,
        };
    }
 
    /**
     * Replay single event with detailed tracking
     */
    private async replayEvent(
        event: DomainEvent,
        options: { handlers: string[] | null; dryRun: boolean }
    ): Promise<ReplayEventResult> {
        const handlerResults: HandlerResult[] = [];
 
        if (options.dryRun) {
            // Dry run: log what would happen without side effects
            const registeredHandlers = this.eventBus.getHandlersFor(event);
            
            return {
                eventId: event.eventId,
                eventType: event.constructor.name,
                success: true,
                handlerResults: registeredHandlers.map(h => ({
                    handler: h.name,
                    wouldExecute: true,
                    dryRun: true,
                })),
            };
        }
 
        // Real replay
        try {
            await this.eventBus.publish(event);
            return {
                eventId: event.eventId,
                eventType: event.constructor.name,
                success: true,
                handlerResults: [{ handler: 'all', success: true }],
            };
        } catch (error) {
            return {
                eventId: event.eventId,
                eventType: event.constructor.name,
                success: false,
                error: error instanceof Error ? error.message : 'Unknown',
                handlerResults: [],
            };
        }
    }
}
 
// Usage example
async function debugOrderIssue(orderId: string) {
    const correlationId = await lookupCorrelationId(orderId);
    
    // Step 1: Dry run to see what would happen
    const dryRunResult = await replayService.replayCorrelation(correlationId, {
        dryRun: true,
    });
    
    console.log('Dry run complete:', dryRunResult);
    
    // Step 2: If dry run looks good, replay in isolated environment
    await setupIsolatedEnvironment();
    
    const replayResult = await replayService.replayCorrelation(correlationId, {
        dryRun: false,
        handlers: ['InventoryReservationHandler'], // Test specific handler
    });
    
    console.log('Replay complete:', replayResult);
}

Event Replay Safety Precautions

•Never replay to production directly — Replayed events can trigger duplicate charges, notifications, or state changes.
•Use isolated environments — Replay to staging or local environments with mocked external services.
•Start with dry runs — Preview what would happen before actually replaying.
•Disable external integrations — Mock payment gateways, email services, and other externalities during replay.
•Reset state before replay — If replaying to test a fix, ensure the environment starts from a known state.

Structured Logging for Events

Logs are your primary debugging tool when correlation queries and event replay aren't sufficient. But unstructured logs ('Processing order...') are nearly useless for debugging. Structured logging with consistent fields enables powerful queries.

structured-logging.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
/**
 * Structured logging for event-driven systems
 */
export class EventLogger {
    constructor(
        private readonly baseLogger: Logger,
        private readonly contextProvider: () => CorrelationContext | null
    ) {}
 
    /**
     * Log event publication
     */
    eventPublished(event: DomainEvent, metadata: Record<string, any> = {}): void {
        this.log('info', 'event.published', {
            eventId: event.eventId,
            eventType: event.constructor.name,
            correlationId: event.correlationId,
            causationId: event.causationId,
            ...this.extractEventContext(event),
            ...metadata,
        });
    }
 
    /**
     * Log handler invocation start
     */
    handlerStarted(
        handlerName: string,
        event: DomainEvent,
        metadata: Record<string, any> = {}
    ): void {
        this.log('info', 'handler.started', {
            handler: handlerName,
            eventId: event.eventId,
            eventType: event.constructor.name,
            correlationId: event.correlationId,
            ...metadata,
        });
    }
 
    /**
     * Log handler completion
     */
    handlerCompleted(
        handlerName: string,
        event: DomainEvent,
        durationMs: number,
        metadata: Record<string, any> = {}
    ): void {
        this.log('info', 'handler.completed', {
            handler: handlerName,
            eventId: event.eventId,
            eventType: event.constructor.name,
            correlationId: event.correlationId,
            durationMs,
            ...metadata,
        });
    }
 
    /**
     * Log handler failure
     */
    handlerFailed(
        handlerName: string,
        event: DomainEvent,
        error: Error,
        durationMs: number,
        metadata: Record<string, any> = {}
    ): void {
        this.log('error', 'handler.failed', {
            handler: handlerName,
            eventId: event.eventId,
            eventType: event.constructor.name,
            correlationId: event.correlationId,
            durationMs,
            error: {
                type: error.constructor.name,
                message: error.message,
                stack: error.stack,
            },
            ...metadata,
        });
    }
 
    /**
     * Log business-level action
     */
    action(
        action: string,
        details: Record<string, any>
    ): void {
        const context = this.contextProvider();
        
        this.log('info', `action.${action}`, {
            correlationId: context?.correlationId,
            ...details,
        });
    }
 
    private log(
        level: 'debug' | 'info' | 'warn' | 'error',
        event: string,
        data: Record<string, any>
    ): void {
        const context = this.contextProvider();
        
        this.baseLogger[level]({
            // Standard fields (always present)
            timestamp: new Date().toISOString(),
            level,
            event,
            
            // Correlation (automatically added)
            correlationId: context?.correlationId ?? data.correlationId,
            traceId: context?.traceId,
            spanId: context?.spanId,
            
            // Event data
            ...data,
            
            // Environment context
            service: process.env.SERVICE_NAME,
            version: process.env.APP_VERSION,
            environment: process.env.NODE_ENV,
        });
    }
 
    private extractEventContext(event: DomainEvent): Record<string, any> {
        // Extract common identifiers for querying
        const context: Record<string, any> = {};
        
        if ('orderId' in event) context.orderId = (event as any).orderId?.value ?? (event as any).orderId;
        if ('customerId' in event) context.customerId = (event as any).customerId?.value ?? (event as any).customerId;
        if ('productId' in event) context.productId = (event as any).productId?.value ?? (event as any).productId;
        if ('amount' in event) context.amount = (event as any).amount?.value ?? (event as any).amount;
        
        return context;
    }
}
 
// Usage in handlers
class InventoryReservationHandler {
    async handle(event: OrderPlacedEvent): Promise<void> {
        const startTime = Date.now();
        this.logger.handlerStarted('InventoryReservationHandler', event);
 
        try {
            // Do work
            this.logger.action('inventory.check', {
                productIds: event.items.map(i => i.productId),
            });
 
            const reservation = await this.inventoryService.reserve(event.items);
 
            this.logger.action('inventory.reserved', {
                reservationId: reservation.id,
                itemCount: event.items.length,
            });
 
            this.logger.handlerCompleted(
                'InventoryReservationHandler',
                event,
                Date.now() - startTime,
                { reservationId: reservation.id }
            );
        } catch (error) {
            this.logger.handlerFailed(
                'InventoryReservationHandler',
                event,
                error as Error,
                Date.now() - startTime
            );
            throw error;
        }
    }
}

Queryable Log Fields

With structured logging, you can query efficiently:

log-queries.txt
Queries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Find all events for a specific order
correlationId="corr-abc123" AND orderId="order-456"
 
# Find all handler failures in the last hour
event="handler.failed" AND timestamp > now() - 1h
 
# Find slow handlers (over 5 seconds)
event="handler.completed" AND durationMs > 5000
 
# Find specific error type
event="handler.failed" AND error.type="InsufficientInventoryError"
 
# Trace inventory issues for a customer
action="inventory.*" AND customerId="cust-789"
 
# Find all events in a time window for an order
orderId="order-456" AND timestamp > "2024-01-15T10:00:00Z" AND timestamp < "2024-01-15T11:00:00Z"

Log to a Searchable Platform

Structured logs are only valuable if they're queryable. Use Elasticsearch/Kibana, Datadog, Splunk, or cloud-native solutions (CloudWatch Logs Insights, Google Cloud Logging). Avoid unstructured file logs in production event-driven systems.

Observability Dashboards

Dashboards provide real-time visibility into event flow health. Well-designed dashboards help you spot problems before they become incidents.

Essential Event-Driven System Metrics
Metric	What It Measures	Alert Threshold
Event publish rate	Events/second by type	Sudden drop (50% decrease)
Handler success rate	% of events successfully processed	Below 99.9%
Handler latency p50/p95/p99	Processing time distribution	p99 > 5s
Queue depth	Unprocessed messages per queue	Above 1000 (or 10x normal)
DLQ depth	Failed messages awaiting resolution	Any non-zero (new failures)
Event lag	Time between publish and process	Above 30 seconds
Duplicate events	Events processed more than once	Rising trend
Handler retry rate	Retries per event	Above 5%

Dashboard Layout Recommendations

Dashboard Sections

•System Health Overview — Top row: overall success rate, event throughput, active handlers. Red/yellow/green status indicators.
•Event Flow Visualization — Shows events flowing from producers through handlers. Highlights bottlenecks and failures.
•Handler Performance — Latency percentiles per handler. Sorted by slowest or most failing.
•Queue Status — Depth and lag for each queue. Trend lines to show if backlog is growing or shrinking.
•Error Analysis — Recent errors grouped by type. Click to drill into specific incidents.
•Business Metrics — Domain-specific: orders processed, payments captured, notifications sent. Correlates technical health to business outcomes.

The Four Golden Signals

Google's SRE book recommends monitoring: Latency, Traffic, Errors, and Saturation. For event-driven systems: Handler latency, Event publish rate, Handler failure rate, and Queue depth satisfy these signals.

Debugging Toolkit

Beyond the infrastructure and patterns above, here's a practical toolkit for debugging event-driven systems:

debugging-utilities.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
/**
 * CLI utilities for event debugging
 */
 
// 1. Event Inspector: Detailed view of a single event
async function inspectEvent(eventId: string): Promise<void> {
    const event = await eventStore.findById(eventId);
    
    if (!event) {
        console.error(`Event ${eventId} not found`);
        return;
    }
 
    console.log('='.repeat(60));
    console.log('EVENT DETAILS');
    console.log('='.repeat(60));
    console.log(`Event ID:      ${event.eventId}`);
    console.log(`Type:          ${event.constructor.name}`);
    console.log(`Occurred At:   ${event.occurredAt.toISOString()}`);
    console.log(`Correlation:   ${event.correlationId}`);
    console.log(`Causation:     ${event.causationId || 'none'}`);
    console.log('');
    console.log('Payload:');
    console.log(JSON.stringify(event, null, 2));
    console.log('');
    
    // Show handlers that processed this event
    const handlerLogs = await logService.query({
        filter: { eventId: event.eventId, event: 'handler.*' },
        sort: { timestamp: 'asc' },
    });
    
    console.log('Handler Processing:');
    for (const log of handlerLogs) {
        const status = log.event === 'handler.completed' ? '✓' : '✗';
        const duration = log.durationMs ? `(${log.durationMs}ms)` : '';
        console.log(`  ${status} ${log.handler} ${duration}`);
        if (log.error) {
            console.log(`    Error: ${log.error.message}`);
        }
    }
}
 
// 2. Trace Viewer: Complete chain for a correlation ID
async function traceChain(correlationId: string): Promise<void> {
    const events = await eventStore.findByCorrelationId(correlationId);
    events.sort((a, b) => a.occurredAt.getTime() - b.occurredAt.getTime());
 
    console.log('='.repeat(60));
    console.log(`EVENT CHAIN: ${correlationId}`);
    console.log('='.repeat(60));
    
    const root = events.find(e => !e.causationId);
    
    function printEvent(event: DomainEvent, indent: number): void {
        const prefix = '  '.repeat(indent) + '├── ';
        const time = event.occurredAt.toISOString().split('T')[1].slice(0, 12);
        console.log(`${prefix}${time} ${event.constructor.name} [${event.eventId.slice(0, 8)}]`);
        
        // Find children
        const children = events.filter(e => e.causationId === event.eventId);
        children.forEach(child => printEvent(child, indent + 1));
    }
    
    if (root) {
        printEvent(root, 0);
    }
    
    // Summary
    const totalDuration = events.length > 1
        ? events[events.length - 1].occurredAt.getTime() - events[0].occurredAt.getTime()
        : 0;
    
    console.log('');
    console.log(`Total Events: ${events.length}`);
    console.log(`Total Duration: ${totalDuration}ms`);
}
 
// 3. Handler History: Recent invocations for a specific handler
async function handlerHistory(
    handlerName: string,
    limit: number = 20
): Promise<void> {
    const logs = await logService.query({
        filter: { handler: handlerName, event: /handler\..*/ },
        sort: { timestamp: 'desc' },
        limit,
    });
 
    console.log('='.repeat(60));
    console.log(`HANDLER HISTORY: ${handlerName}`);
    console.log('='.repeat(60));
    console.log('');
    
    for (const log of logs) {
        const status = log.event === 'handler.completed' ? '✓' : '✗';
        const duration = log.durationMs ? `${log.durationMs}ms` : '--';
        console.log(`${status} ${log.timestamp} | ${duration.padStart(6)} | ${log.eventType}`);
        if (log.error) {
            console.log(`  └─ ${log.error.type}: ${log.error.message}`);
        }
    }
    
    // Stats
    const completed = logs.filter(l => l.event === 'handler.completed').length;
    const failed = logs.filter(l => l.event === 'handler.failed').length;
    const avgDuration = logs
        .filter(l => l.durationMs)
        .reduce((sum, l) => sum + l.durationMs, 0) / completed || 0;
    
    console.log('');
    console.log(`Success Rate: ${(completed / logs.length * 100).toFixed(1)}%`);
    console.log(`Avg Duration: ${avgDuration.toFixed(0)}ms`);
}
 
// 4. Compare Events: Side-by-side comparison (useful for debugging schema issues)
async function compareEvents(eventId1: string, eventId2: string): Promise<void> {
    const [event1, event2] = await Promise.all([
        eventStore.findById(eventId1),
        eventStore.findById(eventId2),
    ]);
 
    console.log('='.repeat(60));
    console.log('EVENT COMPARISON');
    console.log('='.repeat(60));
    
    const json1 = JSON.stringify(event1, null, 2).split('\n');
    const json2 = JSON.stringify(event2, null, 2).split('\n');
    
    const maxLines = Math.max(json1.length, json2.length);
    
    for (let i = 0; i < maxLines; i++) {
        const line1 = json1[i] || '';
        const line2 = json2[i] || '';
        const marker = line1 !== line2 ? '≠' : ' ';
        console.log(`${marker} ${line1.padEnd(40)} | ${line2}`);
    }
}

Build Tools Before You Need Them

These utilities should be available before production incidents occur. During an outage is not the time to write debugging scripts. Invest in tooling proactively.

Summary: Event Debugging Mastery

Debugging event-driven systems requires preparation, the right tools, and systematic approaches. Let's consolidate the essential practices:

Key Takeaways

•Correlation IDs Are Non-Negotiable — Every event must carry a correlation ID that links all events from the same user action. Without this, distributed debugging is nearly impossible.
•Dead Letter Queues Are Your Debugging Inbox — DLQs capture evidence of failures. Enrich DLQ messages with context, and build tooling to analyze and retry.
•Event Replay Enables Exact Reproduction — Store events durably and build replay capabilities. Being able to reproduce production issues locally accelerates debugging enormously.
•Structured Logging Enables Querying — Unstructured logs are useless at scale. Log with consistent fields, correlation IDs, and business identifiers.
•Dashboards Provide Early Warning — Monitor publish rates, handler success rates, latencies, and queue depths. Catch problems before they cascade.
•Build Debugging Tools Proactively — CLI utilities for inspection, tracing, and comparison should exist before you need them. Outages are not the time to write scripts.

Module Complete:

You've now completed the comprehensive module on Testing Event-Driven Designs. You've learned how to verify event raising, test event handlers in isolation, perform integration testing across event flows, and debug production issues when they occur.

These skills are the mark of an engineer who can design, build, AND maintain event-driven systems confidently. When events inevitably misbehave in production, you'll have the tools and techniques to diagnose and resolve issues rapidly.

Module Complete

Congratulations! You've mastered testing and debugging for event-driven designs. From unit testing individual handlers to integration testing complete flows to debugging production incidents—you now have the comprehensive toolkit that separates competent developers from elite engineers who can confidently operate event-driven systems at scale.

4 / 4

Loading learning content...

System Design (LLD)Testing Event-Driven Designs

Testing Event-Driven Designs

LevelIntermediate

Duration90 mins

TopicTesting Event-Driven Designs

4 / 4

Event Debugging

When the Trail Goes Cold: Debugging the Invisible

What You Will Learn

The Observability Challenge

In event-driven systems, this chain is implicit and distributed:

Why Event Debugging Is Hard

•Temporal Decoupling — Events are processed asynchronously. Cause and effect may be separated by milliseconds or hours.
•Spatial Decoupling — Producers and consumers run in different processes, often on different machines. No shared memory, no unified logging.
•Multiple Handlers — One event triggers multiple handlers. Each handler may succeed, fail, or partially complete independently.
•Silent Failures — A dropped event produces no visible error. The system simply doesn't do what it should.
•Non-Deterministic Ordering — Events may be processed out of order, making reproduction difficult.
•Transient State — By the time you investigate, the system state has changed. The debugging window is narrow.

Debugging Production Is Already Too Late

If you're reading this after a production incident, you'll be limited to whatever observability was already in place. The lesson: invest in observability during development, not during outages.

Distributed Tracing with Correlation IDs

correlation-id-implementation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
/**
 * Correlation context that flows through the entire event chain
 */
export interface CorrelationContext {
    // Links all events from the same user action
    correlationId: string;
    
    // ID of the event/command that directly caused this event
    causationId: string | null;
    
    // The original request that started the chain
    originatingRequestId: string | null;
    
    // User who initiated the action
    userId: string | null;
    
    // When the chain started
    initiatedAt: Date;
    
    // Span for distributed tracing (OpenTelemetry compatible)
    traceId?: string;
    spanId?: string;
}
 
/**
 * Base event class with built-in correlation
 */
export abstract class DomainEvent {
    public readonly eventId: string;
    public readonly occurredAt: Date;
    public readonly correlationId: string;
    public readonly causationId: string | null;
 
    protected constructor(
        eventId: string,
        correlationId: string,
        causationId: string | null = null
    ) {
        this.eventId = eventId;
        this.occurredAt = new Date();
        this.correlationId = correlationId;
        this.causationId = causationId;
    }
}
 
/**
 * Event handler base that propagates correlation
 */
export abstract class CorrelatedEventHandler<TEvent extends DomainEvent> {
    protected abstract handleCore(
        event: TEvent,
        context: CorrelationContext
    ): Promise<void>;
 
    public async handle(event: TEvent): Promise<void> {
        // Extract or create correlation context
        const context: CorrelationContext = {
            correlationId: event.correlationId,
            causationId: event.eventId, // This event causes what we do next
            originatingRequestId: null,
            userId: null,
            initiatedAt: event.occurredAt,
        };
 
        // Set up logging context
        AsyncLocalStorage.run(context, async () => {
            await this.handleCore(event, context);
        });
    }
 
    // Utility for handlers to raise follow-up events with propagated correlation
    protected createFollowUpEvent<T extends DomainEvent>(
        EventClass: new (data: any) => T,
        eventData: Omit<T, 'eventId' | 'correlationId' | 'causationId' | 'occurredAt'>
    ): T {
        const context = getCurrentCorrelationContext();
        
        return new EventClass({
            ...eventData,
            eventId: generateEventId(),
            correlationId: context.correlationId,
            causationId: context.causationId, // Links to parent event
        });
    }
}
 
// Usage in a handler
class InventoryReservationHandler extends CorrelatedEventHandler<OrderPlacedEvent> {
    protected async handleCore(
        event: OrderPlacedEvent,
        context: CorrelationContext
    ): Promise<void> {
        // All logging automatically includes correlation ID
        this.logger.info('Processing order for reservation', {
            orderId: event.orderId,
            itemCount: event.items.length,
        });
 
        // Perform reservation
        const reservation = await this.inventoryService.reserve(
            event.orderId,
            event.items
        );
 
        // Publish follow-up event with correlation chain
        const reservedEvent = this.createFollowUpEvent(InventoryReservedEvent, {
            orderId: event.orderId,
            reservationId: reservation.id,
            items: reservation.items,
        });
 
        await this.eventBus.publish(reservedEvent);
    }
}

Querying by Correlation ID

With correlation IDs in place, debugging becomes a search operation:

correlation-query.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/**
 * Debugging utility: trace an entire event chain from correlation ID
 */
export async function traceEventChain(correlationId: string): Promise<EventChainView> {
    // Query event store for all events with this correlation ID
    const events = await eventStore.findByCorrelationId(correlationId);
    
    // Query logs for all log entries with this correlation ID
    const logs = await logService.query({
        filter: { correlationId },
        sort: { timestamp: 'asc' },
    });
 
    // Build causation tree
    const tree = buildCausationTree(events);
 
    return {
        correlationId,
        events: events.sort((a, b) => a.occurredAt.getTime() - b.occurredAt.getTime()),
        eventTree: tree,
        logs,
        
        // Computed insights
        totalDuration: calculateDuration(events),
        failedHandlers: identifyFailures(logs),
        timeline: buildTimeline(events, logs),
    };
}
 
/**
 * Build tree showing which events caused which
 */
function buildCausationTree(events: DomainEvent[]): CausationNode {
    const eventMap = new Map(events.map(e => [e.eventId, e]));
    const children = new Map<string, DomainEvent[]>();
 
    // Group by causation ID
    for (const event of events) {
        if (event.causationId) {
            const existing = children.get(event.causationId) || [];
            existing.push(event);
            children.set(event.causationId, existing);
        }
    }
 
    // Find root (null causation ID or external)
    const root = events.find(e => !e.causationId || !eventMap.has(e.causationId));
 
    function buildNode(event: DomainEvent): CausationNode {
        return {
            event,
            causedEvents: (children.get(event.eventId) || []).map(buildNode),
        };
    }
 
    return root ? buildNode(root) : { event: null, causedEvents: [] };
}
 
// Example output for debugging:
/*
Correlation ID: corr-abc123
Event Chain:
├── OrderPlacedEvent (evt-001) @ 10:00:00.000
│   ├── InventoryReservedEvent (evt-002) @ 10:00:00.123
│   │   └── PaymentCapturedEvent (evt-003) @ 10:00:00.456
│   │       └── OrderConfirmedEvent (evt-004) @ 10:00:00.789
│   └── NotificationSentEvent (evt-005) @ 10:00:00.234
│
Duration: 789ms
Failed Handlers: None
*/

OpenTelemetry Integration

Dead Letter Queue Analysis

Effective DLQ Design for Debugging

dlq-structure.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
/**
 * Dead Letter Message with debugging context
 */
export interface DeadLetterMessage {
    // The original event that failed
    originalEvent: DomainEvent;
    
    // Serialized form as received (for exact reproduction)
    rawPayload: string;
    
    // Which queue/topic the event came from
    sourceQueue: string;
    
    // Which handler(s) failed
    failedHandler: string;
    
    // Error details
    error: {
        type: string;
        message: string;
        stackTrace: string;
        code?: string;
    };
    
    // Retry history
    attempts: {
        attemptNumber: number;
        timestamp: Date;
        error: string;
    }[];
    
    // When it landed in DLQ
    deadLetteredAt: Date;
    
    // For prioritization
    metadata: {
        correlationId: string;
        originatingUserId?: string;
        customerId?: string;
        orderId?: string;
        monetaryValue?: number;
    };
}
 
/**
 * DLQ analysis and recovery tooling
 */
export class DeadLetterQueueManager {
    constructor(
        private readonly dlqRepository: DeadLetterRepository,
        private readonly eventBus: EventBus,
        private readonly logger: Logger
    ) {}
 
    /**
     * Get overview of DLQ for dashboard
     */
    async getDashboard(): Promise<DLQDashboard> {
        const messages = await this.dlqRepository.getAll();
 
        return {
            totalCount: messages.length,
            
            // Group by error type
            byErrorType: this.groupBy(messages, m => m.error.type),
            
            // Group by handler
            byHandler: this.groupBy(messages, m => m.failedHandler),
            
            // Group by event type
            byEventType: this.groupBy(messages, m => m.originalEvent.constructor.name),
            
            // Oldest entries (highest priority)
            oldest: messages.slice().sort(
                (a, b) => a.deadLetteredAt.getTime() - b.deadLetteredAt.getTime()
            ).slice(0, 10),
            
            // Highest value (prioritize by business impact)
            highestValue: messages.slice().sort(
                (a, b) => (b.metadata.monetaryValue || 0) - (a.metadata.monetaryValue || 0)
            ).slice(0, 10),
            
            // Recent entries (potential ongoing issue)
            recent: messages.slice().sort(
                (a, b) => b.deadLetteredAt.getTime() - a.deadLetteredAt.getTime()
            ).slice(0, 10),
        };
    }
 
    /**
     * Retry a specific dead letter message
     */
    async retrySingle(messageId: string): Promise<RetryResult> {
        const message = await this.dlqRepository.findById(messageId);
        
        if (!message) {
            return { success: false, error: 'Message not found' };
        }
 
        this.logger.info('Retrying dead letter message', {
            messageId,
            eventType: message.originalEvent.constructor.name,
            handler: message.failedHandler,
            correlationId: message.metadata.correlationId,
        });
 
        try {
            await this.eventBus.publish(message.originalEvent);
            await this.dlqRepository.markAsRetried(messageId);
            
            return { success: true };
        } catch (error) {
            this.logger.error('Retry failed', { messageId, error });
            
            return {
                success: false,
                error: error instanceof Error ? error.message : 'Unknown error',
            };
        }
    }
 
    /**
     * Batch retry all messages matching criteria
     */
    async retryBatch(criteria: RetryCriteria): Promise<BatchRetryResult> {
        const messages = await this.dlqRepository.findMatching(criteria);
        
        const results = await Promise.allSettled(
            messages.map(m => this.retrySingle(m.id))
        );
 
        return {
            total: messages.length,
            succeeded: results.filter(r => r.status === 'fulfilled' && r.value.success).length,
            failed: results.filter(r => r.status === 'rejected' || !r.value.success).length,
        };
    }
 
    /**
     * Analyze common failure patterns
     */
    async analyzePatterns(): Promise<FailurePattern[]> {
        const messages = await this.dlqRepository.getAll();
        const patterns: Map<string, FailurePattern> = new Map();
 
        for (const message of messages) {
            const patternKey = `${message.error.type}:${message.failedHandler}`;
            
            const existing = patterns.get(patternKey) || {
                errorType: message.error.type,
                handler: message.failedHandler,
                count: 0,
                firstSeen: message.deadLetteredAt,
                lastSeen: message.deadLetteredAt,
                sampleErrors: [],
            };
 
            existing.count++;
            existing.lastSeen = new Date(Math.max(
                existing.lastSeen.getTime(),
                message.deadLetteredAt.getTime()
            ));
            
            if (existing.sampleErrors.length < 3) {
                existing.sampleErrors.push(message.error.message);
            }
 
            patterns.set(patternKey, existing);
        }
 
        return Array.from(patterns.values())
            .sort((a, b) => b.count - a.count);
    }
 
    private groupBy<T>(
        items: T[],
        keyFn: (item: T) => string
    ): Record<string, number> {
        const result: Record<string, number> = {};
        for (const item of items) {
            const key = keyFn(item);
            result[key] = (result[key] || 0) + 1;
        }
        return result;
    }
}

Common DLQ Failure Patterns and Solutions
Error Pattern	Likely Cause	Investigation Steps	Resolution
Validation errors spike	Schema change in producer	Compare event schema to handler expectations	Update handler or roll back producer
Connection refused	Database/service down	Check dependent service health	Fix service, then batch retry
Timeout errors	Slow downstream service	Check service metrics and latency	Increase timeout or fix bottleneck
Not found errors	Race condition or deleted data	Check if entity exists, timing of operations	Add handler resilience or reorder operations
Duplicate key	Handler not idempotent	Check idempotency implementation	Fix handler, manually dedupe DLQ

DLQ Alerting Strategy

Set up alerts for: (1) DLQ depth exceeding threshold, (2) New error types appearing, (3) High-value events in DLQ, (4) Age of oldest DLQ message. These alerts catch issues before they compound.

Event Replay for Reproduction

One of the most powerful debugging techniques in event-driven systems is event replay—taking production events and replaying them in a controlled environment to reproduce issues exactly.

event-replay.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
/**
 * Event Replay System for debugging and testing
 */
export class EventReplayService {
    constructor(
        private readonly eventStore: EventStore,
        private readonly eventBus: EventBus,
        private readonly stateSnapshot: StateSnapshotService,
        private readonly logger: Logger
    ) {}
 
    /**
     * Replay events for a specific correlation ID
     */
    async replayCorrelation(
        correlationId: string,
        options: ReplayOptions = {}
    ): Promise<ReplayResult> {
        const {
            startFrom = null,
            endAt = null,
            handlers = null, // null = all handlers, or list of specific handlers
            dryRun = true,   // Default to dry run for safety
            speedMultiplier = 1.0, // 1.0 = real-time, 0 = instant
        } = options;
 
        this.logger.info('Starting event replay', { correlationId, options });
 
        // Fetch events
        let events = await this.eventStore.findByCorrelationId(correlationId);
        events = events.sort((a, b) => a.occurredAt.getTime() - b.occurredAt.getTime());
 
        // Apply filters
        if (startFrom) {
            events = events.filter(e => e.occurredAt >= startFrom);
        }
        if (endAt) {
            events = events.filter(e => e.occurredAt <= endAt);
        }
 
        this.logger.info(`Replaying ${events.length} events`);
 
        const results: ReplayEventResult[] = [];
        let previousTime = events[0]?.occurredAt.getTime();
 
        for (const event of events) {
            // Simulate time gaps if not instant replay
            if (speedMultiplier > 0 && previousTime) {
                const gap = event.occurredAt.getTime() - previousTime;
                if (gap > 0) {
                    await sleep(gap / speedMultiplier);
                }
            }
            previousTime = event.occurredAt.getTime();
 
            // Replay event
            const result = await this.replayEvent(event, { handlers, dryRun });
            results.push(result);
 
            this.logger.info('Replayed event', {
                eventId: event.eventId,
                eventType: event.constructor.name,
                success: result.success,
                handlerResults: result.handlerResults,
            });
        }
 
        return {
            correlationId,
            eventsReplayed: events.length,
            results,
            succeeded: results.filter(r => r.success).length,
            failed: results.filter(r => !r.success).length,
        };
    }
 
    /**
     * Replay events from a specific time window (for incident investigation)
     */
    async replayTimeWindow(
        startTime: Date,
        endTime: Date,
        eventTypes: string[],
        options: ReplayOptions = {}
    ): Promise<ReplayResult> {
        const events = await this.eventStore.findInTimeWindow(
            startTime,
            endTime,
            { eventTypes }
        );
 
        this.logger.info(`Found ${events.length} events in time window`);
 
        // Group by correlation ID for ordered replay
        const byCorrelation = new Map<string, DomainEvent[]>();
        for (const event of events) {
            const existing = byCorrelation.get(event.correlationId) || [];
            existing.push(event);
            byCorrelation.set(event.correlationId, existing);
        }
 
        const allResults: ReplayEventResult[] = [];
        
        for (const [corrId, corrEvents] of byCorrelation) {
            const result = await this.replayCorrelation(corrId, {
                ...options,
                startFrom: startTime,
                endAt: endTime,
            });
            allResults.push(...result.results);
        }
 
        return {
            correlationId: 'multiple',
            eventsReplayed: allResults.length,
            results: allResults,
            succeeded: allResults.filter(r => r.success).length,
            failed: allResults.filter(r => !r.success).length,
        };
    }
 
    /**
     * Replay single event with detailed tracking
     */
    private async replayEvent(
        event: DomainEvent,
        options: { handlers: string[] | null; dryRun: boolean }
    ): Promise<ReplayEventResult> {
        const handlerResults: HandlerResult[] = [];
 
        if (options.dryRun) {
            // Dry run: log what would happen without side effects
            const registeredHandlers = this.eventBus.getHandlersFor(event);
            
            return {
                eventId: event.eventId,
                eventType: event.constructor.name,
                success: true,
                handlerResults: registeredHandlers.map(h => ({
                    handler: h.name,
                    wouldExecute: true,
                    dryRun: true,
                })),
            };
        }
 
        // Real replay
        try {
            await this.eventBus.publish(event);
            return {
                eventId: event.eventId,
                eventType: event.constructor.name,
                success: true,
                handlerResults: [{ handler: 'all', success: true }],
            };
        } catch (error) {
            return {
                eventId: event.eventId,
                eventType: event.constructor.name,
                success: false,
                error: error instanceof Error ? error.message : 'Unknown',
                handlerResults: [],
            };
        }
    }
}
 
// Usage example
async function debugOrderIssue(orderId: string) {
    const correlationId = await lookupCorrelationId(orderId);
    
    // Step 1: Dry run to see what would happen
    const dryRunResult = await replayService.replayCorrelation(correlationId, {
        dryRun: true,
    });
    
    console.log('Dry run complete:', dryRunResult);
    
    // Step 2: If dry run looks good, replay in isolated environment
    await setupIsolatedEnvironment();
    
    const replayResult = await replayService.replayCorrelation(correlationId, {
        dryRun: false,
        handlers: ['InventoryReservationHandler'], // Test specific handler
    });
    
    console.log('Replay complete:', replayResult);
}

Event Replay Safety Precautions

•Never replay to production directly — Replayed events can trigger duplicate charges, notifications, or state changes.
•Use isolated environments — Replay to staging or local environments with mocked external services.
•Start with dry runs — Preview what would happen before actually replaying.
•Disable external integrations — Mock payment gateways, email services, and other externalities during replay.
•Reset state before replay — If replaying to test a fix, ensure the environment starts from a known state.

Structured Logging for Events

structured-logging.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
/**
 * Structured logging for event-driven systems
 */
export class EventLogger {
    constructor(
        private readonly baseLogger: Logger,
        private readonly contextProvider: () => CorrelationContext | null
    ) {}
 
    /**
     * Log event publication
     */
    eventPublished(event: DomainEvent, metadata: Record<string, any> = {}): void {
        this.log('info', 'event.published', {
            eventId: event.eventId,
            eventType: event.constructor.name,
            correlationId: event.correlationId,
            causationId: event.causationId,
            ...this.extractEventContext(event),
            ...metadata,
        });
    }
 
    /**
     * Log handler invocation start
     */
    handlerStarted(
        handlerName: string,
        event: DomainEvent,
        metadata: Record<string, any> = {}
    ): void {
        this.log('info', 'handler.started', {
            handler: handlerName,
            eventId: event.eventId,
            eventType: event.constructor.name,
            correlationId: event.correlationId,
            ...metadata,
        });
    }
 
    /**
     * Log handler completion
     */
    handlerCompleted(
        handlerName: string,
        event: DomainEvent,
        durationMs: number,
        metadata: Record<string, any> = {}
    ): void {
        this.log('info', 'handler.completed', {
            handler: handlerName,
            eventId: event.eventId,
            eventType: event.constructor.name,
            correlationId: event.correlationId,
            durationMs,
            ...metadata,
        });
    }
 
    /**
     * Log handler failure
     */
    handlerFailed(
        handlerName: string,
        event: DomainEvent,
        error: Error,
        durationMs: number,
        metadata: Record<string, any> = {}
    ): void {
        this.log('error', 'handler.failed', {
            handler: handlerName,
            eventId: event.eventId,
            eventType: event.constructor.name,
            correlationId: event.correlationId,
            durationMs,
            error: {
                type: error.constructor.name,
                message: error.message,
                stack: error.stack,
            },
            ...metadata,
        });
    }
 
    /**
     * Log business-level action
     */
    action(
        action: string,
        details: Record<string, any>
    ): void {
        const context = this.contextProvider();
        
        this.log('info', `action.${action}`, {
            correlationId: context?.correlationId,
            ...details,
        });
    }
 
    private log(
        level: 'debug' | 'info' | 'warn' | 'error',
        event: string,
        data: Record<string, any>
    ): void {
        const context = this.contextProvider();
        
        this.baseLogger[level]({
            // Standard fields (always present)
            timestamp: new Date().toISOString(),
            level,
            event,
            
            // Correlation (automatically added)
            correlationId: context?.correlationId ?? data.correlationId,
            traceId: context?.traceId,
            spanId: context?.spanId,
            
            // Event data
            ...data,
            
            // Environment context
            service: process.env.SERVICE_NAME,
            version: process.env.APP_VERSION,
            environment: process.env.NODE_ENV,
        });
    }
 
    private extractEventContext(event: DomainEvent): Record<string, any> {
        // Extract common identifiers for querying
        const context: Record<string, any> = {};
        
        if ('orderId' in event) context.orderId = (event as any).orderId?.value ?? (event as any).orderId;
        if ('customerId' in event) context.customerId = (event as any).customerId?.value ?? (event as any).customerId;
        if ('productId' in event) context.productId = (event as any).productId?.value ?? (event as any).productId;
        if ('amount' in event) context.amount = (event as any).amount?.value ?? (event as any).amount;
        
        return context;
    }
}
 
// Usage in handlers
class InventoryReservationHandler {
    async handle(event: OrderPlacedEvent): Promise<void> {
        const startTime = Date.now();
        this.logger.handlerStarted('InventoryReservationHandler', event);
 
        try {
            // Do work
            this.logger.action('inventory.check', {
                productIds: event.items.map(i => i.productId),
            });
 
            const reservation = await this.inventoryService.reserve(event.items);
 
            this.logger.action('inventory.reserved', {
                reservationId: reservation.id,
                itemCount: event.items.length,
            });
 
            this.logger.handlerCompleted(
                'InventoryReservationHandler',
                event,
                Date.now() - startTime,
                { reservationId: reservation.id }
            );
        } catch (error) {
            this.logger.handlerFailed(
                'InventoryReservationHandler',
                event,
                error as Error,
                Date.now() - startTime
            );
            throw error;
        }
    }
}

Queryable Log Fields

With structured logging, you can query efficiently:

log-queries.txt
Queries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Find all events for a specific order
correlationId="corr-abc123" AND orderId="order-456"
 
# Find all handler failures in the last hour
event="handler.failed" AND timestamp > now() - 1h
 
# Find slow handlers (over 5 seconds)
event="handler.completed" AND durationMs > 5000
 
# Find specific error type
event="handler.failed" AND error.type="InsufficientInventoryError"
 
# Trace inventory issues for a customer
action="inventory.*" AND customerId="cust-789"
 
# Find all events in a time window for an order
orderId="order-456" AND timestamp > "2024-01-15T10:00:00Z" AND timestamp < "2024-01-15T11:00:00Z"

Log to a Searchable Platform

Observability Dashboards

Dashboards provide real-time visibility into event flow health. Well-designed dashboards help you spot problems before they become incidents.

Essential Event-Driven System Metrics
Metric	What It Measures	Alert Threshold
Event publish rate	Events/second by type	Sudden drop (50% decrease)
Handler success rate	% of events successfully processed	Below 99.9%
Handler latency p50/p95/p99	Processing time distribution	p99 > 5s
Queue depth	Unprocessed messages per queue	Above 1000 (or 10x normal)
DLQ depth	Failed messages awaiting resolution	Any non-zero (new failures)
Event lag	Time between publish and process	Above 30 seconds
Duplicate events	Events processed more than once	Rising trend
Handler retry rate	Retries per event	Above 5%

Dashboard Layout Recommendations

Dashboard Sections

•System Health Overview — Top row: overall success rate, event throughput, active handlers. Red/yellow/green status indicators.
•Event Flow Visualization — Shows events flowing from producers through handlers. Highlights bottlenecks and failures.
•Handler Performance — Latency percentiles per handler. Sorted by slowest or most failing.
•Queue Status — Depth and lag for each queue. Trend lines to show if backlog is growing or shrinking.
•Error Analysis — Recent errors grouped by type. Click to drill into specific incidents.
•Business Metrics — Domain-specific: orders processed, payments captured, notifications sent. Correlates technical health to business outcomes.

The Four Golden Signals

Debugging Toolkit

Beyond the infrastructure and patterns above, here's a practical toolkit for debugging event-driven systems:

debugging-utilities.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
/**
 * CLI utilities for event debugging
 */
 
// 1. Event Inspector: Detailed view of a single event
async function inspectEvent(eventId: string): Promise<void> {
    const event = await eventStore.findById(eventId);
    
    if (!event) {
        console.error(`Event ${eventId} not found`);
        return;
    }
 
    console.log('='.repeat(60));
    console.log('EVENT DETAILS');
    console.log('='.repeat(60));
    console.log(`Event ID:      ${event.eventId}`);
    console.log(`Type:          ${event.constructor.name}`);
    console.log(`Occurred At:   ${event.occurredAt.toISOString()}`);
    console.log(`Correlation:   ${event.correlationId}`);
    console.log(`Causation:     ${event.causationId || 'none'}`);
    console.log('');
    console.log('Payload:');
    console.log(JSON.stringify(event, null, 2));
    console.log('');
    
    // Show handlers that processed this event
    const handlerLogs = await logService.query({
        filter: { eventId: event.eventId, event: 'handler.*' },
        sort: { timestamp: 'asc' },
    });
    
    console.log('Handler Processing:');
    for (const log of handlerLogs) {
        const status = log.event === 'handler.completed' ? '✓' : '✗';
        const duration = log.durationMs ? `(${log.durationMs}ms)` : '';
        console.log(`  ${status} ${log.handler} ${duration}`);
        if (log.error) {
            console.log(`    Error: ${log.error.message}`);
        }
    }
}
 
// 2. Trace Viewer: Complete chain for a correlation ID
async function traceChain(correlationId: string): Promise<void> {
    const events = await eventStore.findByCorrelationId(correlationId);
    events.sort((a, b) => a.occurredAt.getTime() - b.occurredAt.getTime());
 
    console.log('='.repeat(60));
    console.log(`EVENT CHAIN: ${correlationId}`);
    console.log('='.repeat(60));
    
    const root = events.find(e => !e.causationId);
    
    function printEvent(event: DomainEvent, indent: number): void {
        const prefix = '  '.repeat(indent) + '├── ';
        const time = event.occurredAt.toISOString().split('T')[1].slice(0, 12);
        console.log(`${prefix}${time} ${event.constructor.name} [${event.eventId.slice(0, 8)}]`);
        
        // Find children
        const children = events.filter(e => e.causationId === event.eventId);
        children.forEach(child => printEvent(child, indent + 1));
    }
    
    if (root) {
        printEvent(root, 0);
    }
    
    // Summary
    const totalDuration = events.length > 1
        ? events[events.length - 1].occurredAt.getTime() - events[0].occurredAt.getTime()
        : 0;
    
    console.log('');
    console.log(`Total Events: ${events.length}`);
    console.log(`Total Duration: ${totalDuration}ms`);
}
 
// 3. Handler History: Recent invocations for a specific handler
async function handlerHistory(
    handlerName: string,
    limit: number = 20
): Promise<void> {
    const logs = await logService.query({
        filter: { handler: handlerName, event: /handler\..*/ },
        sort: { timestamp: 'desc' },
        limit,
    });
 
    console.log('='.repeat(60));
    console.log(`HANDLER HISTORY: ${handlerName}`);
    console.log('='.repeat(60));
    console.log('');
    
    for (const log of logs) {
        const status = log.event === 'handler.completed' ? '✓' : '✗';
        const duration = log.durationMs ? `${log.durationMs}ms` : '--';
        console.log(`${status} ${log.timestamp} | ${duration.padStart(6)} | ${log.eventType}`);
        if (log.error) {
            console.log(`  └─ ${log.error.type}: ${log.error.message}`);
        }
    }
    
    // Stats
    const completed = logs.filter(l => l.event === 'handler.completed').length;
    const failed = logs.filter(l => l.event === 'handler.failed').length;
    const avgDuration = logs
        .filter(l => l.durationMs)
        .reduce((sum, l) => sum + l.durationMs, 0) / completed || 0;
    
    console.log('');
    console.log(`Success Rate: ${(completed / logs.length * 100).toFixed(1)}%`);
    console.log(`Avg Duration: ${avgDuration.toFixed(0)}ms`);
}
 
// 4. Compare Events: Side-by-side comparison (useful for debugging schema issues)
async function compareEvents(eventId1: string, eventId2: string): Promise<void> {
    const [event1, event2] = await Promise.all([
        eventStore.findById(eventId1),
        eventStore.findById(eventId2),
    ]);
 
    console.log('='.repeat(60));
    console.log('EVENT COMPARISON');
    console.log('='.repeat(60));
    
    const json1 = JSON.stringify(event1, null, 2).split('\n');
    const json2 = JSON.stringify(event2, null, 2).split('\n');
    
    const maxLines = Math.max(json1.length, json2.length);
    
    for (let i = 0; i < maxLines; i++) {
        const line1 = json1[i] || '';
        const line2 = json2[i] || '';
        const marker = line1 !== line2 ? '≠' : ' ';
        console.log(`${marker} ${line1.padEnd(40)} | ${line2}`);
    }
}

Build Tools Before You Need Them

These utilities should be available before production incidents occur. During an outage is not the time to write debugging scripts. Invest in tooling proactively.

Summary: Event Debugging Mastery

Debugging event-driven systems requires preparation, the right tools, and systematic approaches. Let's consolidate the essential practices:

Key Takeaways

•Correlation IDs Are Non-Negotiable — Every event must carry a correlation ID that links all events from the same user action. Without this, distributed debugging is nearly impossible.
•Dead Letter Queues Are Your Debugging Inbox — DLQs capture evidence of failures. Enrich DLQ messages with context, and build tooling to analyze and retry.
•Event Replay Enables Exact Reproduction — Store events durably and build replay capabilities. Being able to reproduce production issues locally accelerates debugging enormously.
•Structured Logging Enables Querying — Unstructured logs are useless at scale. Log with consistent fields, correlation IDs, and business identifiers.
•Dashboards Provide Early Warning — Monitor publish rates, handler success rates, latencies, and queue depths. Catch problems before they cascade.
•Build Debugging Tools Proactively — CLI utilities for inspection, tracing, and comparison should exist before you need them. Outages are not the time to write scripts.

Module Complete:

Module Complete

4 / 4