System Design (HLD)Asynchronous Communication

Delivery Guarantees

LevelIntermediate

Duration75 mins

TopicAsynchronous Communication

2 / 5

At-Least-Once Delivery

Guaranteeing Delivery in an Unreliable World

Having explored at-most-once delivery and its message loss trade-offs, we now turn to the opposite end of the reliability spectrum: at-least-once delivery.

At-least-once delivery answers a fundamental business requirement: No message should ever be permanently lost. Every message produced must eventually be consumed, even if the system experiences failures along the way. This guarantee is essential for business-critical workflows where message loss translates directly to financial loss, broken processes, or compliance violations.

But this guarantee comes with a price. To ensure no message is lost, the system must be willing to deliver messages more than once under certain failure conditions. This trade-off fundamentally changes how consumers must be designed.

What You Will Learn

By the end of this page, you will deeply understand at-least-once delivery—its formal guarantees, the mechanisms that enable it, the failure scenarios that cause duplicate delivery, and the critical implications for consumer design. You will understand why at-least-once is the default choice for business-critical messaging and how to design systems that handle it correctly.

Defining At-Least-Once Semantics

At-least-once delivery guarantees that every message will be delivered one or more times—never zero. This means:

Best case: The message is successfully transmitted, received, processed, and acknowledged exactly once.
Normal case: The message is delivered exactly once, but the system performed redundant work (retries) to ensure this.
Failure case: The message is delivered multiple times because an acknowledgment was lost or the consumer crashed after processing but before acknowledging.

The formal definition:

For any message M sent by producer P to consumer C, let D(M) represent the number of times M is delivered to C. At-least-once guarantees: D(M) ≥ 1.

Unlike at-most-once (which bounds the upper limit), at-least-once bounds the lower limit. The message will arrive—the question is whether it arrives once or multiple times.

at-least-once-producer.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// At-least-once delivery: Producer persists and waits for confirmation
class AtLeastOnceProducer {
    private broker: MessageBroker;
    private maxRetries: number = 3;
    private retryDelay: number = 1000;
 
    constructor(broker: MessageBroker) {
        this.broker = broker;
    }
 
    /**
     * Send a message with at-least-once semantics.
     * 
     * Key characteristics:
     * - Wait for broker acknowledgment
     * - Retry on failure
     * - Message persisted before acknowledgment
     * - Message guaranteed to arrive (eventually)
     * - Message may be duplicated on retry
     */
    async send(topic: string, message: Message): Promise<void> {
        let lastError: Error | null = null;
 
        for (let attempt = 1; attempt <= this.maxRetries; attempt++) {
            try {
                // Send and wait for broker confirmation
                const ack = await this.broker.publish(topic, message, {
                    timeout: 5000,
                    requireAck: true,  // Wait for broker to persist
                });
 
                if (ack.success) {
                    return; // Success - message guaranteed persisted
                }
            } catch (error) {
                lastError = error as Error;
                console.warn(`Attempt ${attempt} failed: ${error.message}`);
                
                // Wait before retry (with exponential backoff)
                await this.sleep(this.retryDelay * Math.pow(2, attempt - 1));
            }
        }
 
        // After all retries exhausted, we have options:
        // 1. Throw error (caller must handle)
        // 2. Queue for later retry (dead letter queue)
        // 3. Alert operations team
        throw new Error(`Failed to deliver message after ${this.maxRetries} attempts: ${lastError?.message}`);
    }
 
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

The Retry That Causes Duplicates

The very mechanism that prevents message loss—retries—is also what causes duplicates. If the producer's first attempt succeeded but the acknowledgment was lost (network failure), the producer will retry, delivering the same message again. The consumer will receive it twice.

The Mechanics of At-Least-Once Delivery

At-least-once delivery requires several key mechanisms working together across the entire message pipeline.

Producer-Side Mechanics:

Persistence before acknowledgment: The message must be durably stored before the producer is told the send succeeded
Acknowledgment waiting: The producer blocks until it receives confirmation
Retry with timeout: If no acknowledgment arrives within a timeout, the producer resends
Failure handling: If all retries fail, the producer must handle the error (typically by persisting locally for later retry)

Broker-Side Mechanics:

The message broker plays a critical role in at-least-once delivery:

broker-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class AtLeastOnceBroker {
    private storage: DurableStorage;
    private activeDeliveries: Map<string, MessageDelivery> = new Map();
    private redeliveryTimeout: number = 30000; // 30 seconds
 
    /**
     * Receive message from producer with at-least-once guarantee
     */
    async receiveFromProducer(message: Message): Promise<ProducerAck> {
        // 1. FIRST: Persist the message durably
        // This is the critical step - message survives broker crash
        await this.storage.persist(message);
 
        // 2. ONLY THEN: Acknowledge to producer
        // If broker crashes between persist and ack, producer retries
        // Result: duplicate in storage (consumer must handle)
        return { success: true, messageId: message.id };
    }
 
    /**
     * Deliver message to consumer with redelivery on timeout
     */
    async deliverToConsumer(message: Message, consumer: Consumer): Promise<void> {
        const deliveryId = `${message.id}-${Date.now()}`;
        
        // Track this delivery attempt
        this.activeDeliveries.set(deliveryId, {
            message,
            consumer,
            startedAt: Date.now(),
            acknowledged: false,
        });
 
        // Deliver the message
        await consumer.receive(message);
 
        // Start redelivery timer
        // If consumer doesn't ACK within timeout, redeliver
        this.scheduleRedeliveryCheck(deliveryId);
    }
 
    /**
     * Process acknowledgment from consumer
     */
    async acknowledgeFromConsumer(messageId: string): Promise<void> {
        // Find and mark delivery as acknowledged
        for (const [deliveryId, delivery] of this.activeDeliveries) {
            if (delivery.message.id === messageId) {
                delivery.acknowledged = true;
                
                // Remove from active deliveries
                this.activeDeliveries.delete(deliveryId);
                
                // Mark as consumed in storage (or delete)
                await this.storage.markConsumed(messageId);
                
                return;
            }
        }
    }
 
    /**
     * Check if redelivery is needed
     */
    private scheduleRedeliveryCheck(deliveryId: string): void {
        setTimeout(async () => {
            const delivery = this.activeDeliveries.get(deliveryId);
            
            if (delivery && !delivery.acknowledged) {
                // Consumer didn't acknowledge in time
                // REDELIVER - this causes the duplicate
                console.warn(`Message ${delivery.message.id} not acknowledged, redelivering`);
                
                // Remove stale delivery record
                this.activeDeliveries.delete(deliveryId);
                
                // Redeliver to same or different consumer
                await this.deliverToConsumer(delivery.message, delivery.consumer);
            }
        }, this.redeliveryTimeout);
    }
}

Consumer-Side Mechanics:

The consumer's behavior is crucial for at-least-once delivery. The key rule: acknowledge AFTER processing, not before.

consumer-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class AtLeastOnceConsumer {
    private broker: MessageBroker;
    private processingTimeout: number = 25000; // Less than broker's redelivery timeout
 
    async consumeMessage(): Promise<void> {
        const message = await this.broker.receive();
        
        try {
            // CRITICAL: Process BEFORE acknowledging
            // This is what guarantees at-least-once delivery
            await this.processMessage(message);
            
            // ONLY acknowledge after successful processing
            await this.broker.acknowledge(message.id);
            
        } catch (error) {
            // Processing failed - DO NOT acknowledge
            // Message will be redelivered by broker
            console.error(`Processing failed: ${error.message}`);
            
            // Optionally: negative acknowledge to speed up redelivery
            await this.broker.negativeAcknowledge(message.id);
            
            // The same message will be delivered again
        }
    }
 
    private async processMessage(message: Message): Promise<void> {
        // Business logic here
        // MUST be idempotent because duplicates are possible!
        await this.orderService.fulfillOrder(message.orderId);
    }
}

The Idempotency Requirement

At-least-once delivery fundamentally requires idempotent consumers. Since the same message may be delivered multiple times, processing that message multiple times must produce the same result as processing it once. We'll cover idempotency patterns in detail in Page 4.

When Duplicates Occur: Failure Scenarios

Understanding exactly when and why duplicates occur is essential for designing robust systems. Let's examine each failure scenario in detail.

Duplicate Delivery Scenarios
Failure Scenario	What Happens	Duplicate Cause	Consumer Impact
Producer ACK lost	Producer sent message, broker received and persisted, ACK packet lost	Producer retries, sends same message again	Message arrives twice from broker
Consumer crash after processing	Consumer processed message, crashed before sending ACK	Broker times out, redelivers to same/different consumer	Same message processed twice
Consumer ACK lost	Consumer sent ACK, network lost the packet	Broker times out waiting for ACK, redelivers	Same message processed twice
Broker failover	Primary broker received message, failed before replicating	Consumer reconnects to new broker, may receive again	Depends on replication lag
Network partition healed	Consumer was partitioned, processed old message, reconnects	Out-of-order delivery after partition	Old message reprocessed

Scenario Deep Dive: Consumer Crash After Processing

This is the most common duplicate scenario. Let's trace through exactly what happens:

crash-timeline.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/**
 * Timeline of a consumer crash causing duplicate processing
 * 
 * T+0ms:    Broker delivers message M1 to Consumer A
 * T+100ms:  Consumer A receives message M1
 * T+500ms:  Consumer A starts processing M1
 * T+2000ms: Consumer A completes processing M1
 *           - Order is fulfilled
 *           - Database is updated
 *           - Email is sent
 *           ⚡ CRASH: Consumer A dies before sending ACK
 * 
 * T+30000ms: Broker's redelivery timeout expires
 *            Broker notices M1 was never acknowledged
 *            Broker marks M1 for redelivery
 * 
 * T+30100ms: Broker delivers M1 to Consumer B (or restarted A)
 * 
 * T+32000ms: Consumer B processes M1 again
 *            - Order is fulfilled AGAIN (duplicate!)
 *            - Database is updated AGAIN
 *            - Email is sent AGAIN
 *            Consumer B sends ACK
 * 
 * Result: Customer receives order twice, gets charged twice (potentially),
 *         receives two confirmation emails
 * 
 * This is why AT-LEAST-ONCE REQUIRES IDEMPOTENT CONSUMERS
 */
 
// The solution: Make processing idempotent
class IdempotentOrderProcessor {
    private db: Database;
 
    async processOrder(orderId: string): Promise<void> {
        // Check if already processed
        const existingOrder = await this.db.orders.findUnique({
            where: { orderId, status: 'FULFILLED' }
        });
 
        if (existingOrder) {
            console.log(`Order ${orderId} already fulfilled, skipping`);
            return; // Idempotent: duplicate processing is a no-op
        }
 
        // Process the order
        await this.db.orders.update({
            where: { orderId },
            data: { status: 'FULFILLED' }
        });
 
        await this.fulfillmentService.ship(orderId);
    }
}

Quantifying Duplicate Probability

The probability of duplicate delivery varies based on system characteristics:

Factors Affecting Duplicate Rate

•Processing Time: Longer processing = larger crash window = more duplicates
•Network Reliability: Lossy networks = more lost ACKs = more duplicates
•Consumer Stability: Crash-prone consumers = more crash-before-ACK scenarios
•Redelivery Timeout: Lower timeout = faster redelivery = more duplicates during slow processing
•System Load: High load = more timeouts = more perceived failures = more retries

In well-designed production systems, duplicate rates typically range from 0.01% to 0.5%. While this seems low, at high volumes it becomes significant:

duplicate-calculation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/**
 * Calculate expected duplicates at different scales
 */
function calculateDuplicates(
    messagesPerDay: number, 
    duplicateRate: number = 0.001 // 0.1%
): void {
    const dailyDuplicates = messagesPerDay * duplicateRate;
    
    console.log(`At ${messagesPerDay.toLocaleString()} messages/day:`);
    console.log(`  Daily duplicates: ~${Math.round(dailyDuplicates).toLocaleString()}`);
    console.log(`  Monthly duplicates: ~${Math.round(dailyDuplicates * 30).toLocaleString()}`);
}
 
// Real-world examples
calculateDuplicates(100_000);        // Small service
// Daily: ~100, Monthly: ~3,000
 
calculateDuplicates(10_000_000);     // Medium service  
// Daily: ~10,000, Monthly: ~300,000
 
calculateDuplicates(1_000_000_000);  // Large platform
// Daily: ~1,000,000, Monthly: ~30,000,000
 
// At scale, even a 0.1% duplicate rate means MILLIONS of duplicates per month
// Every consumer MUST be designed to handle this

Duplicates Are Not Bugs—They're Features

Do not treat duplicate messages as bugs to be eliminated. They are the natural consequence of choosing reliability over simplicity. Your system design must embrace duplicates as a normal operating condition, not an exceptional failure mode.

Acknowledgment Patterns

The acknowledgment (ACK) mechanism is the heart of at-least-once delivery. Different acknowledgment patterns offer different trade-offs between reliability, performance, and complexity.

Pattern 1: Single Message Acknowledgment

The consumer acknowledges each message individually after processing. This is the safest and most common pattern.

single-message-ack.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class SingleMessageAckConsumer {
    async consume(): Promise<void> {
        while (true) {
            const message = await this.broker.receive();
            
            try {
                await this.process(message);
                
                // ACK this specific message
                await this.broker.ack(message.deliveryTag);
                
            } catch (error) {
                // NACK to trigger redelivery
                await this.broker.nack(message.deliveryTag, { requeue: true });
            }
        }
    }
}
 
// Pros:
// - Fine-grained control over which messages succeed
// - Failed messages are redelivered independently
// - Easy to reason about
 
// Cons:
// - ACK round-trip for every message (latency overhead)
// - Lower throughput for high-volume scenarios

Pattern 2: Batch Acknowledgment

The consumer acknowledges multiple messages at once, typically up to a specific offset or batch boundary.

batch-ack.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class BatchAckConsumer {
    private batchSize: number = 100;
    private processedMessages: Message[] = [];
 
    async consume(): Promise<void> {
        while (true) {
            const message = await this.broker.receive();
            
            try {
                await this.process(message);
                this.processedMessages.push(message);
                
                // ACK when batch is complete
                if (this.processedMessages.length >= this.batchSize) {
                    // Acknowledge all messages up to the latest offset
                    const latestOffset = this.processedMessages[this.processedMessages.length - 1].offset;
                    await this.broker.ackUpTo(latestOffset);
                    
                    this.processedMessages = []; // Clear batch
                }
                
            } catch (error) {
                // Batch ACK complexity: if message 50/100 fails,
                // we can only ACK up to message 49
                // Messages 50-99 must be reprocessed
                const lastSuccessful = this.processedMessages[this.processedMessages.length - 1];
                if (lastSuccessful) {
                    await this.broker.ackUpTo(lastSuccessful.offset);
                }
                
                // NACK the failed message
                await this.broker.nack(message.deliveryTag, { requeue: true });
                this.processedMessages = [];
            }
        }
    }
}
 
// Pros:
// - Higher throughput (fewer ACK round-trips)
// - Better performance for high-volume scenarios
 
// Cons:
// - Failure in batch affects subsequent messages
// - More complex error handling
// - Larger redelivery units (more duplicates on failure)

Pattern 3: Offset-Based Acknowledgment (Kafka-style)

Common in log-based messaging systems, this pattern commits the consumer's read position (offset) rather than individual messages.

offset-ack.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import { Kafka, EachMessagePayload } from 'kafkajs';
 
const kafka = new Kafka({ brokers: ['kafka:9092'] });
const consumer = kafka.consumer({ groupId: 'order-processor' });
 
await consumer.subscribe({ topic: 'orders' });
 
// Manual offset commit for at-least-once guarantee
await consumer.run({
    autoCommit: false, // CRITICAL: disable auto-commit for at-least-once
    
    eachMessage: async ({ topic, partition, message }: EachMessagePayload) => {
        try {
            // Process the message
            await processOrder(JSON.parse(message.value!.toString()));
            
            // Commit offset AFTER successful processing
            await consumer.commitOffsets([{
                topic,
                partition,
                offset: (BigInt(message.offset) + 1n).toString(),
            }]);
            
        } catch (error) {
            // Don't commit - message will be redelivered on restart
            // Or: seek back to retry immediately
            console.error(`Failed to process message at offset ${message.offset}`);
            
            // Option: explicit seek for immediate retry
            // consumer.seek({ topic, partition, offset: message.offset });
        }
    }
});
 
// WARNING: Auto-commit (autoCommit: true) gives at-most-once semantics!
// Kafka auto-commits periodically regardless of processing success
// For at-least-once, you MUST use manual commits

Kafka's Auto-Commit Trap

Many developers are surprised to learn that Kafka's default auto-commit behavior provides at-most-once semantics, not at-least-once. Auto-commit happens on a timer (every 5 seconds by default), regardless of whether processing succeeded. For at-least-once, always disable auto-commit and commit manually after processing.

Pattern 4: Transactional Acknowledgment

For systems requiring exactly-once processing with external side effects, some brokers support transactional acknowledgment patterns.

transactional-ack.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Kafka Transactions: consume-transform-produce atomically
const producer = kafka.producer({
    transactionalId: 'order-processor-tx',
    idempotent: true,
});
 
await consumer.run({
    autoCommit: false,
    
    eachMessage: async ({ topic, partition, message }) => {
        // Begin transaction
        const transaction = await producer.transaction();
        
        try {
            // Process and produce output
            const result = await processOrder(message.value);
            
            await transaction.send({
                topic: 'processed-orders',
                messages: [{ value: JSON.stringify(result) }],
            });
            
            // Commit consumer offset AS PART OF the transaction
            await transaction.sendOffsets({
                consumerGroupId: 'order-processor',
                topics: [{
                    topic,
                    partitions: [{
                        partition,
                        offset: (BigInt(message.offset) + 1n).toString(),
                    }],
                }],
            });
            
            // Atomically commit: output message + offset commit
            await transaction.commit();
            
        } catch (error) {
            // Abort transaction: neither output nor offset is committed
            await transaction.abort();
        }
    }
});
 
// This provides exactly-once semantics for consume-transform-produce workflows
// But external side effects (APIs, databases) still need additional handling

Retry Strategies for At-Least-Once

Retry mechanisms are fundamental to at-least-once delivery, but naive retry strategies can cause more problems than they solve. Let's explore the key retry patterns and their trade-offs.

Immediate Retry (Anti-Pattern)

The simplest—and worst—retry strategy is to retry immediately upon failure.

immediate-retry.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// ❌ ANTI-PATTERN: Immediate retry
async function sendWithImmediateRetry(message: Message): Promise<void> {
    while (true) {
        try {
            await this.broker.send(message);
            return;
        } catch (error) {
            // Immediately retry - THIS IS DANGEROUS
            console.error(`Send failed, retrying immediately...`);
            // No delay between retries
        }
    }
}
 
// Problems:
// 1. If failure is due to overload, immediate retries make it worse
// 2. CPU/network saturation from tight retry loop
// 3. Can trigger cascading failures across the system
// 4. No opportunity for transient issues to resolve

Exponential Backoff

The standard solution is exponential backoff—each retry waits longer than the previous one.

exponential-backoff.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class ExponentialBackoffRetry {
    private baseDelay: number = 100;     // 100ms initial delay
    private maxDelay: number = 30000;    // 30 second max delay
    private maxRetries: number = 10;
 
    /**
     * Calculate delay for retry attempt.
     * Delay doubles with each attempt: 100ms, 200ms, 400ms, 800ms...
     */
    private calculateDelay(attempt: number): number {
        const delay = this.baseDelay * Math.pow(2, attempt);
        return Math.min(delay, this.maxDelay);
    }
 
    async sendWithRetry(message: Message): Promise<void> {
        let lastError: Error | null = null;
 
        for (let attempt = 0; attempt < this.maxRetries; attempt++) {
            try {
                await this.broker.send(message);
                return; // Success
            } catch (error) {
                lastError = error as Error;
                
                if (attempt < this.maxRetries - 1) {
                    const delay = this.calculateDelay(attempt);
                    console.log(`Retry ${attempt + 1} in ${delay}ms...`);
                    await this.sleep(delay);
                }
            }
        }
 
        throw new Error(`Failed after ${this.maxRetries} retries: ${lastError?.message}`);
    }
 
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}
 
// Retry timeline example:
// Attempt 1: immediate
// Attempt 2: wait 100ms
// Attempt 3: wait 200ms
// Attempt 4: wait 400ms
// Attempt 5: wait 800ms
// Attempt 6: wait 1600ms
// Attempt 7: wait 3200ms
// Attempt 8: wait 6400ms
// Attempt 9: wait 12800ms
// Attempt 10: wait 25600ms
// Total max wait: ~50 seconds across all retries

Exponential Backoff with Jitter

When many clients retry simultaneously (thundering herd), their retries can synchronize and overwhelm the system. Adding jitter (randomness) desynchronizes retries.

backoff-with-jitter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class JitteredExponentialBackoff {
    private baseDelay: number = 100;
    private maxDelay: number = 30000;
 
    /**
     * Full Jitter: Random value between 0 and exponential delay
     * Recommended by AWS and Google for distributed systems
     */
    calculateDelayWithFullJitter(attempt: number): number {
        const exponentialDelay = this.baseDelay * Math.pow(2, attempt);
        const cappedDelay = Math.min(exponentialDelay, this.maxDelay);
        
        // Full jitter: random between 0 and calculated delay
        return Math.random() * cappedDelay;
    }
 
    /**
     * Equal Jitter: Half fixed, half random
     * More predictable minimum wait time
     */
    calculateDelayWithEqualJitter(attempt: number): number {
        const exponentialDelay = this.baseDelay * Math.pow(2, attempt);
        const cappedDelay = Math.min(exponentialDelay, this.maxDelay);
        
        // Equal jitter: half delay + random(0, half delay)
        const halfDelay = cappedDelay / 2;
        return halfDelay + (Math.random() * halfDelay);
    }
 
    /**
     * Decorrelated Jitter: Each delay based on previous delay
     * AWS recommendation for highly contended resources
     */
    calculateDecorrelatedJitter(previousDelay: number): number {
        const newDelay = this.baseDelay + Math.random() * (previousDelay * 3 - this.baseDelay);
        return Math.min(newDelay, this.maxDelay);
    }
}
 
// Without jitter (1000 clients failing at T=0):
// T+100ms: 1000 clients retry simultaneously → system overloaded
// T+200ms: 1000 clients retry simultaneously → system still overloaded
// Repeating pattern prevents recovery
 
// With full jitter (1000 clients failing at T=0):
// T+0-100ms: Clients retry randomly spread across 100ms window
// T+0-200ms: Second attempts spread across 200ms window
// System has time to recover between requests

AWS's Recommendation

AWS's architecture best practices recommend full jitter for most retry scenarios. Their analysis shows full jitter completes work fastest on average because it maximizes the spread of retry attempts, giving the downstream system the best chance to recover.

At-Least-Once in Real Messaging Systems

Let's examine how major messaging systems implement at-least-once delivery guarantees.

Apache Kafka

Kafka provides at-least-once delivery through producer acknowledgments and manual consumer offset commits.

kafka-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import { Kafka, CompressionTypes } from 'kafkajs';
 
const kafka = new Kafka({
    clientId: 'order-service',
    brokers: ['kafka-1:9092', 'kafka-2:9092', 'kafka-3:9092'],
});
 
// Producer: At-least-once configuration
const producer = kafka.producer({
    // acks: 'all' = wait for all in-sync replicas to acknowledge
    // This is the key setting for durability
    acks: -1,  // -1 is equivalent to 'all'
    
    // Enable idempotent producer to deduplicate within producer session
    idempotent: true,
    
    // Retry configuration
    retry: {
        retries: 10,
        initialRetryTime: 100,
        maxRetryTime: 30000,
    },
});
 
// Producer usage
await producer.send({
    topic: 'orders',
    messages: [{
        key: orderId,  // Key for partitioning (ordering guarantee)
        value: JSON.stringify(order),
    }],
});
// This call returns ONLY after all replicas have acknowledged
 
// Consumer: At-least-once configuration
const consumer = kafka.consumer({
    groupId: 'order-processor',
    // Manual commits for at-least-once guarantee
    autoCommit: false,
});
 
await consumer.run({
    eachMessage: async ({ topic, partition, message }) => {
        // Process first
        await processOrder(message.value);
        
        // Then commit
        await consumer.commitOffsets([{
            topic,
            partition,
            offset: (BigInt(message.offset) + 1n).toString(),
        }]);
    },
});

RabbitMQ

RabbitMQ provides at-least-once through publisher confirms and consumer acknowledgments.

rabbitmq-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import amqp from 'amqplib';
 
// Publisher: At-least-once with publisher confirms
const connection = await amqp.connect('amqp://localhost');
const channel = await connection.createChannel();
 
// Enable publisher confirms
await channel.confirmChannel();
 
// Declare durable queue (survives broker restart)
await channel.assertQueue('orders', { 
    durable: true,  // Queue definition persisted
});
 
// Publish with confirmation
const message = JSON.stringify(order);
const published = channel.sendToQueue(
    'orders',
    Buffer.from(message),
    { 
        persistent: true,  // Message persisted to disk
    }
);
 
// Wait for broker confirm
await channel.waitForConfirms();
// Now we know the message is safely persisted
 
// Consumer: At-least-once with manual acknowledgment
await channel.consume(
    'orders',
    async (msg) => {
        if (!msg) return;
        
        try {
            // Process the message
            await processOrder(JSON.parse(msg.content.toString()));
            
            // Acknowledge AFTER successful processing
            channel.ack(msg);
            
        } catch (error) {
            // Negative acknowledge - message will be requeued
            channel.nack(msg, false, true);  // (msg, allUpTo, requeue)
        }
    },
    { 
        noAck: false,  // CRITICAL: Enable acknowledgments
    }
);

AWS SQS

SQS provides at-least-once delivery through visibility timeouts and explicit deletion.

sqs-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import { SQSClient, ReceiveMessageCommand, DeleteMessageCommand } from '@aws-sdk/client-sqs';
 
const sqs = new SQSClient({ region: 'us-east-1' });
const queueUrl = 'https://sqs.us-east-1.amazonaws.com/123456789/orders';
 
async function consumeMessages(): Promise<void> {
    while (true) {
        // Receive messages (they become invisible, not deleted)
        const response = await sqs.send(new ReceiveMessageCommand({
            QueueUrl: queueUrl,
            MaxNumberOfMessages: 10,
            VisibilityTimeout: 300,  // 5 minutes to process
            WaitTimeSeconds: 20,     // Long polling
        }));
 
        for (const message of response.Messages || []) {
            try {
                // Process the message
                await processOrder(JSON.parse(message.Body!));
                
                // Delete AFTER successful processing
                // This is the "acknowledgment" in SQS
                await sqs.send(new DeleteMessageCommand({
                    QueueUrl: queueUrl,
                    ReceiptHandle: message.ReceiptHandle!,
                }));
                
            } catch (error) {
                // Don't delete - message will become visible again
                // after visibility timeout expires
                console.error(`Processing failed: ${error.message}`);
                
                // SQS will redeliver after visibility timeout
                // This provides at-least-once delivery
            }
        }
    }
}
 
// SQS At-Least-Once Mechanics:
// 1. Message is received and becomes invisible
// 2. Consumer has VisibilityTimeout seconds to process and delete
// 3. If deleted: message is gone (successfully processed)
// 4. If not deleted: message becomes visible again (redelivered)
// 
// After MaxReceiveCount redeliveries, message goes to Dead Letter Queue

The Universal Pattern

Despite different APIs and terminology, all systems implement the same fundamental pattern: receive → process → acknowledge. The acknowledgment is only sent after processing succeeds, ensuring no message is lost. The cost is potential duplicates when acknowledgment fails.

Summary: At-Least-Once Delivery

At-least-once delivery is the workhorse of reliable distributed messaging, providing the guarantee that no message will ever be lost. Let's consolidate the key insights:

Key Takeaways

•Definition: At-least-once guarantees D(M) ≥ 1—messages may be delivered multiple times but never lost
•Mechanics: Achieved through acknowledgments after processing, persistence before delivery, and retry on timeout
•Duplicate Sources: Lost acknowledgments, consumer crashes after processing, network partitions, and broker failovers
•Acknowledgment Patterns: Single message, batch, offset-based, and transactional—each with different trade-offs
•Retry Strategies: Exponential backoff with jitter is the gold standard for distributed systems
•Critical Requirement: Consumers MUST be idempotent to handle duplicate deliveries correctly
•Default Choice: At-least-once is the correct default for business-critical messaging

When to Use At-Least-Once

•Order processing and fulfillment
•Payment transactions
•User-generated content
•Audit and compliance logging
•Event sourcing pipelines
•Inter-service commands

Remember the Trade-offs

•Higher latency (ACK round-trips)
•Lower throughput (persistence)
•Consumer complexity (idempotency)
•Storage costs (message retention)
•Duplicate handling overhead

Page Complete

You now understand at-least-once delivery—the most common choice for reliable messaging in distributed systems. In the next page, we'll tackle the most challenging guarantee: exactly-once delivery, exploring why it's so difficult to achieve and what 'exactly-once' actually means in practice.

2 / 5

Loading learning content...

System Design (HLD)Asynchronous Communication

Delivery Guarantees

LevelIntermediate

Duration75 mins

TopicAsynchronous Communication

2 / 5

At-Least-Once Delivery

Guaranteeing Delivery in an Unreliable World

Having explored at-most-once delivery and its message loss trade-offs, we now turn to the opposite end of the reliability spectrum: at-least-once delivery.

What You Will Learn

Defining At-Least-Once Semantics

At-least-once delivery guarantees that every message will be delivered one or more times—never zero. This means:

Best case: The message is successfully transmitted, received, processed, and acknowledged exactly once.
Normal case: The message is delivered exactly once, but the system performed redundant work (retries) to ensure this.
Failure case: The message is delivered multiple times because an acknowledgment was lost or the consumer crashed after processing but before acknowledging.

The formal definition:

For any message M sent by producer P to consumer C, let D(M) represent the number of times M is delivered to C. At-least-once guarantees: D(M) ≥ 1.

Unlike at-most-once (which bounds the upper limit), at-least-once bounds the lower limit. The message will arrive—the question is whether it arrives once or multiple times.

at-least-once-producer.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// At-least-once delivery: Producer persists and waits for confirmation
class AtLeastOnceProducer {
    private broker: MessageBroker;
    private maxRetries: number = 3;
    private retryDelay: number = 1000;
 
    constructor(broker: MessageBroker) {
        this.broker = broker;
    }
 
    /**
     * Send a message with at-least-once semantics.
     * 
     * Key characteristics:
     * - Wait for broker acknowledgment
     * - Retry on failure
     * - Message persisted before acknowledgment
     * - Message guaranteed to arrive (eventually)
     * - Message may be duplicated on retry
     */
    async send(topic: string, message: Message): Promise<void> {
        let lastError: Error | null = null;
 
        for (let attempt = 1; attempt <= this.maxRetries; attempt++) {
            try {
                // Send and wait for broker confirmation
                const ack = await this.broker.publish(topic, message, {
                    timeout: 5000,
                    requireAck: true,  // Wait for broker to persist
                });
 
                if (ack.success) {
                    return; // Success - message guaranteed persisted
                }
            } catch (error) {
                lastError = error as Error;
                console.warn(`Attempt ${attempt} failed: ${error.message}`);
                
                // Wait before retry (with exponential backoff)
                await this.sleep(this.retryDelay * Math.pow(2, attempt - 1));
            }
        }
 
        // After all retries exhausted, we have options:
        // 1. Throw error (caller must handle)
        // 2. Queue for later retry (dead letter queue)
        // 3. Alert operations team
        throw new Error(`Failed to deliver message after ${this.maxRetries} attempts: ${lastError?.message}`);
    }
 
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

The Retry That Causes Duplicates

The Mechanics of At-Least-Once Delivery

At-least-once delivery requires several key mechanisms working together across the entire message pipeline.

Producer-Side Mechanics:

Persistence before acknowledgment: The message must be durably stored before the producer is told the send succeeded
Acknowledgment waiting: The producer blocks until it receives confirmation
Retry with timeout: If no acknowledgment arrives within a timeout, the producer resends
Failure handling: If all retries fail, the producer must handle the error (typically by persisting locally for later retry)

Broker-Side Mechanics:

The message broker plays a critical role in at-least-once delivery:

broker-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class AtLeastOnceBroker {
    private storage: DurableStorage;
    private activeDeliveries: Map<string, MessageDelivery> = new Map();
    private redeliveryTimeout: number = 30000; // 30 seconds
 
    /**
     * Receive message from producer with at-least-once guarantee
     */
    async receiveFromProducer(message: Message): Promise<ProducerAck> {
        // 1. FIRST: Persist the message durably
        // This is the critical step - message survives broker crash
        await this.storage.persist(message);
 
        // 2. ONLY THEN: Acknowledge to producer
        // If broker crashes between persist and ack, producer retries
        // Result: duplicate in storage (consumer must handle)
        return { success: true, messageId: message.id };
    }
 
    /**
     * Deliver message to consumer with redelivery on timeout
     */
    async deliverToConsumer(message: Message, consumer: Consumer): Promise<void> {
        const deliveryId = `${message.id}-${Date.now()}`;
        
        // Track this delivery attempt
        this.activeDeliveries.set(deliveryId, {
            message,
            consumer,
            startedAt: Date.now(),
            acknowledged: false,
        });
 
        // Deliver the message
        await consumer.receive(message);
 
        // Start redelivery timer
        // If consumer doesn't ACK within timeout, redeliver
        this.scheduleRedeliveryCheck(deliveryId);
    }
 
    /**
     * Process acknowledgment from consumer
     */
    async acknowledgeFromConsumer(messageId: string): Promise<void> {
        // Find and mark delivery as acknowledged
        for (const [deliveryId, delivery] of this.activeDeliveries) {
            if (delivery.message.id === messageId) {
                delivery.acknowledged = true;
                
                // Remove from active deliveries
                this.activeDeliveries.delete(deliveryId);
                
                // Mark as consumed in storage (or delete)
                await this.storage.markConsumed(messageId);
                
                return;
            }
        }
    }
 
    /**
     * Check if redelivery is needed
     */
    private scheduleRedeliveryCheck(deliveryId: string): void {
        setTimeout(async () => {
            const delivery = this.activeDeliveries.get(deliveryId);
            
            if (delivery && !delivery.acknowledged) {
                // Consumer didn't acknowledge in time
                // REDELIVER - this causes the duplicate
                console.warn(`Message ${delivery.message.id} not acknowledged, redelivering`);
                
                // Remove stale delivery record
                this.activeDeliveries.delete(deliveryId);
                
                // Redeliver to same or different consumer
                await this.deliverToConsumer(delivery.message, delivery.consumer);
            }
        }, this.redeliveryTimeout);
    }
}

Consumer-Side Mechanics:

The consumer's behavior is crucial for at-least-once delivery. The key rule: acknowledge AFTER processing, not before.

consumer-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class AtLeastOnceConsumer {
    private broker: MessageBroker;
    private processingTimeout: number = 25000; // Less than broker's redelivery timeout
 
    async consumeMessage(): Promise<void> {
        const message = await this.broker.receive();
        
        try {
            // CRITICAL: Process BEFORE acknowledging
            // This is what guarantees at-least-once delivery
            await this.processMessage(message);
            
            // ONLY acknowledge after successful processing
            await this.broker.acknowledge(message.id);
            
        } catch (error) {
            // Processing failed - DO NOT acknowledge
            // Message will be redelivered by broker
            console.error(`Processing failed: ${error.message}`);
            
            // Optionally: negative acknowledge to speed up redelivery
            await this.broker.negativeAcknowledge(message.id);
            
            // The same message will be delivered again
        }
    }
 
    private async processMessage(message: Message): Promise<void> {
        // Business logic here
        // MUST be idempotent because duplicates are possible!
        await this.orderService.fulfillOrder(message.orderId);
    }
}

The Idempotency Requirement

When Duplicates Occur: Failure Scenarios

Understanding exactly when and why duplicates occur is essential for designing robust systems. Let's examine each failure scenario in detail.

Duplicate Delivery Scenarios
Failure Scenario	What Happens	Duplicate Cause	Consumer Impact
Producer ACK lost	Producer sent message, broker received and persisted, ACK packet lost	Producer retries, sends same message again	Message arrives twice from broker
Consumer crash after processing	Consumer processed message, crashed before sending ACK	Broker times out, redelivers to same/different consumer	Same message processed twice
Consumer ACK lost	Consumer sent ACK, network lost the packet	Broker times out waiting for ACK, redelivers	Same message processed twice
Broker failover	Primary broker received message, failed before replicating	Consumer reconnects to new broker, may receive again	Depends on replication lag
Network partition healed	Consumer was partitioned, processed old message, reconnects	Out-of-order delivery after partition	Old message reprocessed

Scenario Deep Dive: Consumer Crash After Processing

This is the most common duplicate scenario. Let's trace through exactly what happens:

crash-timeline.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/**
 * Timeline of a consumer crash causing duplicate processing
 * 
 * T+0ms:    Broker delivers message M1 to Consumer A
 * T+100ms:  Consumer A receives message M1
 * T+500ms:  Consumer A starts processing M1
 * T+2000ms: Consumer A completes processing M1
 *           - Order is fulfilled
 *           - Database is updated
 *           - Email is sent
 *           ⚡ CRASH: Consumer A dies before sending ACK
 * 
 * T+30000ms: Broker's redelivery timeout expires
 *            Broker notices M1 was never acknowledged
 *            Broker marks M1 for redelivery
 * 
 * T+30100ms: Broker delivers M1 to Consumer B (or restarted A)
 * 
 * T+32000ms: Consumer B processes M1 again
 *            - Order is fulfilled AGAIN (duplicate!)
 *            - Database is updated AGAIN
 *            - Email is sent AGAIN
 *            Consumer B sends ACK
 * 
 * Result: Customer receives order twice, gets charged twice (potentially),
 *         receives two confirmation emails
 * 
 * This is why AT-LEAST-ONCE REQUIRES IDEMPOTENT CONSUMERS
 */
 
// The solution: Make processing idempotent
class IdempotentOrderProcessor {
    private db: Database;
 
    async processOrder(orderId: string): Promise<void> {
        // Check if already processed
        const existingOrder = await this.db.orders.findUnique({
            where: { orderId, status: 'FULFILLED' }
        });
 
        if (existingOrder) {
            console.log(`Order ${orderId} already fulfilled, skipping`);
            return; // Idempotent: duplicate processing is a no-op
        }
 
        // Process the order
        await this.db.orders.update({
            where: { orderId },
            data: { status: 'FULFILLED' }
        });
 
        await this.fulfillmentService.ship(orderId);
    }
}

Quantifying Duplicate Probability

The probability of duplicate delivery varies based on system characteristics:

Factors Affecting Duplicate Rate

•Processing Time: Longer processing = larger crash window = more duplicates
•Network Reliability: Lossy networks = more lost ACKs = more duplicates
•Consumer Stability: Crash-prone consumers = more crash-before-ACK scenarios
•Redelivery Timeout: Lower timeout = faster redelivery = more duplicates during slow processing
•System Load: High load = more timeouts = more perceived failures = more retries

In well-designed production systems, duplicate rates typically range from 0.01% to 0.5%. While this seems low, at high volumes it becomes significant:

duplicate-calculation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/**
 * Calculate expected duplicates at different scales
 */
function calculateDuplicates(
    messagesPerDay: number, 
    duplicateRate: number = 0.001 // 0.1%
): void {
    const dailyDuplicates = messagesPerDay * duplicateRate;
    
    console.log(`At ${messagesPerDay.toLocaleString()} messages/day:`);
    console.log(`  Daily duplicates: ~${Math.round(dailyDuplicates).toLocaleString()}`);
    console.log(`  Monthly duplicates: ~${Math.round(dailyDuplicates * 30).toLocaleString()}`);
}
 
// Real-world examples
calculateDuplicates(100_000);        // Small service
// Daily: ~100, Monthly: ~3,000
 
calculateDuplicates(10_000_000);     // Medium service  
// Daily: ~10,000, Monthly: ~300,000
 
calculateDuplicates(1_000_000_000);  // Large platform
// Daily: ~1,000,000, Monthly: ~30,000,000
 
// At scale, even a 0.1% duplicate rate means MILLIONS of duplicates per month
// Every consumer MUST be designed to handle this

Duplicates Are Not Bugs—They're Features

Acknowledgment Patterns

The acknowledgment (ACK) mechanism is the heart of at-least-once delivery. Different acknowledgment patterns offer different trade-offs between reliability, performance, and complexity.

Pattern 1: Single Message Acknowledgment

The consumer acknowledges each message individually after processing. This is the safest and most common pattern.

single-message-ack.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class SingleMessageAckConsumer {
    async consume(): Promise<void> {
        while (true) {
            const message = await this.broker.receive();
            
            try {
                await this.process(message);
                
                // ACK this specific message
                await this.broker.ack(message.deliveryTag);
                
            } catch (error) {
                // NACK to trigger redelivery
                await this.broker.nack(message.deliveryTag, { requeue: true });
            }
        }
    }
}
 
// Pros:
// - Fine-grained control over which messages succeed
// - Failed messages are redelivered independently
// - Easy to reason about
 
// Cons:
// - ACK round-trip for every message (latency overhead)
// - Lower throughput for high-volume scenarios

Pattern 2: Batch Acknowledgment

The consumer acknowledges multiple messages at once, typically up to a specific offset or batch boundary.

batch-ack.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class BatchAckConsumer {
    private batchSize: number = 100;
    private processedMessages: Message[] = [];
 
    async consume(): Promise<void> {
        while (true) {
            const message = await this.broker.receive();
            
            try {
                await this.process(message);
                this.processedMessages.push(message);
                
                // ACK when batch is complete
                if (this.processedMessages.length >= this.batchSize) {
                    // Acknowledge all messages up to the latest offset
                    const latestOffset = this.processedMessages[this.processedMessages.length - 1].offset;
                    await this.broker.ackUpTo(latestOffset);
                    
                    this.processedMessages = []; // Clear batch
                }
                
            } catch (error) {
                // Batch ACK complexity: if message 50/100 fails,
                // we can only ACK up to message 49
                // Messages 50-99 must be reprocessed
                const lastSuccessful = this.processedMessages[this.processedMessages.length - 1];
                if (lastSuccessful) {
                    await this.broker.ackUpTo(lastSuccessful.offset);
                }
                
                // NACK the failed message
                await this.broker.nack(message.deliveryTag, { requeue: true });
                this.processedMessages = [];
            }
        }
    }
}
 
// Pros:
// - Higher throughput (fewer ACK round-trips)
// - Better performance for high-volume scenarios
 
// Cons:
// - Failure in batch affects subsequent messages
// - More complex error handling
// - Larger redelivery units (more duplicates on failure)

Pattern 3: Offset-Based Acknowledgment (Kafka-style)

Common in log-based messaging systems, this pattern commits the consumer's read position (offset) rather than individual messages.

offset-ack.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import { Kafka, EachMessagePayload } from 'kafkajs';
 
const kafka = new Kafka({ brokers: ['kafka:9092'] });
const consumer = kafka.consumer({ groupId: 'order-processor' });
 
await consumer.subscribe({ topic: 'orders' });
 
// Manual offset commit for at-least-once guarantee
await consumer.run({
    autoCommit: false, // CRITICAL: disable auto-commit for at-least-once
    
    eachMessage: async ({ topic, partition, message }: EachMessagePayload) => {
        try {
            // Process the message
            await processOrder(JSON.parse(message.value!.toString()));
            
            // Commit offset AFTER successful processing
            await consumer.commitOffsets([{
                topic,
                partition,
                offset: (BigInt(message.offset) + 1n).toString(),
            }]);
            
        } catch (error) {
            // Don't commit - message will be redelivered on restart
            // Or: seek back to retry immediately
            console.error(`Failed to process message at offset ${message.offset}`);
            
            // Option: explicit seek for immediate retry
            // consumer.seek({ topic, partition, offset: message.offset });
        }
    }
});
 
// WARNING: Auto-commit (autoCommit: true) gives at-most-once semantics!
// Kafka auto-commits periodically regardless of processing success
// For at-least-once, you MUST use manual commits

Kafka's Auto-Commit Trap

Pattern 4: Transactional Acknowledgment

For systems requiring exactly-once processing with external side effects, some brokers support transactional acknowledgment patterns.

transactional-ack.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Kafka Transactions: consume-transform-produce atomically
const producer = kafka.producer({
    transactionalId: 'order-processor-tx',
    idempotent: true,
});
 
await consumer.run({
    autoCommit: false,
    
    eachMessage: async ({ topic, partition, message }) => {
        // Begin transaction
        const transaction = await producer.transaction();
        
        try {
            // Process and produce output
            const result = await processOrder(message.value);
            
            await transaction.send({
                topic: 'processed-orders',
                messages: [{ value: JSON.stringify(result) }],
            });
            
            // Commit consumer offset AS PART OF the transaction
            await transaction.sendOffsets({
                consumerGroupId: 'order-processor',
                topics: [{
                    topic,
                    partitions: [{
                        partition,
                        offset: (BigInt(message.offset) + 1n).toString(),
                    }],
                }],
            });
            
            // Atomically commit: output message + offset commit
            await transaction.commit();
            
        } catch (error) {
            // Abort transaction: neither output nor offset is committed
            await transaction.abort();
        }
    }
});
 
// This provides exactly-once semantics for consume-transform-produce workflows
// But external side effects (APIs, databases) still need additional handling

Retry Strategies for At-Least-Once

Retry mechanisms are fundamental to at-least-once delivery, but naive retry strategies can cause more problems than they solve. Let's explore the key retry patterns and their trade-offs.

Immediate Retry (Anti-Pattern)

The simplest—and worst—retry strategy is to retry immediately upon failure.

immediate-retry.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// ❌ ANTI-PATTERN: Immediate retry
async function sendWithImmediateRetry(message: Message): Promise<void> {
    while (true) {
        try {
            await this.broker.send(message);
            return;
        } catch (error) {
            // Immediately retry - THIS IS DANGEROUS
            console.error(`Send failed, retrying immediately...`);
            // No delay between retries
        }
    }
}
 
// Problems:
// 1. If failure is due to overload, immediate retries make it worse
// 2. CPU/network saturation from tight retry loop
// 3. Can trigger cascading failures across the system
// 4. No opportunity for transient issues to resolve

Exponential Backoff

The standard solution is exponential backoff—each retry waits longer than the previous one.

exponential-backoff.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class ExponentialBackoffRetry {
    private baseDelay: number = 100;     // 100ms initial delay
    private maxDelay: number = 30000;    // 30 second max delay
    private maxRetries: number = 10;
 
    /**
     * Calculate delay for retry attempt.
     * Delay doubles with each attempt: 100ms, 200ms, 400ms, 800ms...
     */
    private calculateDelay(attempt: number): number {
        const delay = this.baseDelay * Math.pow(2, attempt);
        return Math.min(delay, this.maxDelay);
    }
 
    async sendWithRetry(message: Message): Promise<void> {
        let lastError: Error | null = null;
 
        for (let attempt = 0; attempt < this.maxRetries; attempt++) {
            try {
                await this.broker.send(message);
                return; // Success
            } catch (error) {
                lastError = error as Error;
                
                if (attempt < this.maxRetries - 1) {
                    const delay = this.calculateDelay(attempt);
                    console.log(`Retry ${attempt + 1} in ${delay}ms...`);
                    await this.sleep(delay);
                }
            }
        }
 
        throw new Error(`Failed after ${this.maxRetries} retries: ${lastError?.message}`);
    }
 
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}
 
// Retry timeline example:
// Attempt 1: immediate
// Attempt 2: wait 100ms
// Attempt 3: wait 200ms
// Attempt 4: wait 400ms
// Attempt 5: wait 800ms
// Attempt 6: wait 1600ms
// Attempt 7: wait 3200ms
// Attempt 8: wait 6400ms
// Attempt 9: wait 12800ms
// Attempt 10: wait 25600ms
// Total max wait: ~50 seconds across all retries

Exponential Backoff with Jitter

When many clients retry simultaneously (thundering herd), their retries can synchronize and overwhelm the system. Adding jitter (randomness) desynchronizes retries.

backoff-with-jitter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class JitteredExponentialBackoff {
    private baseDelay: number = 100;
    private maxDelay: number = 30000;
 
    /**
     * Full Jitter: Random value between 0 and exponential delay
     * Recommended by AWS and Google for distributed systems
     */
    calculateDelayWithFullJitter(attempt: number): number {
        const exponentialDelay = this.baseDelay * Math.pow(2, attempt);
        const cappedDelay = Math.min(exponentialDelay, this.maxDelay);
        
        // Full jitter: random between 0 and calculated delay
        return Math.random() * cappedDelay;
    }
 
    /**
     * Equal Jitter: Half fixed, half random
     * More predictable minimum wait time
     */
    calculateDelayWithEqualJitter(attempt: number): number {
        const exponentialDelay = this.baseDelay * Math.pow(2, attempt);
        const cappedDelay = Math.min(exponentialDelay, this.maxDelay);
        
        // Equal jitter: half delay + random(0, half delay)
        const halfDelay = cappedDelay / 2;
        return halfDelay + (Math.random() * halfDelay);
    }
 
    /**
     * Decorrelated Jitter: Each delay based on previous delay
     * AWS recommendation for highly contended resources
     */
    calculateDecorrelatedJitter(previousDelay: number): number {
        const newDelay = this.baseDelay + Math.random() * (previousDelay * 3 - this.baseDelay);
        return Math.min(newDelay, this.maxDelay);
    }
}
 
// Without jitter (1000 clients failing at T=0):
// T+100ms: 1000 clients retry simultaneously → system overloaded
// T+200ms: 1000 clients retry simultaneously → system still overloaded
// Repeating pattern prevents recovery
 
// With full jitter (1000 clients failing at T=0):
// T+0-100ms: Clients retry randomly spread across 100ms window
// T+0-200ms: Second attempts spread across 200ms window
// System has time to recover between requests

AWS's Recommendation

At-Least-Once in Real Messaging Systems

Let's examine how major messaging systems implement at-least-once delivery guarantees.

Apache Kafka

Kafka provides at-least-once delivery through producer acknowledgments and manual consumer offset commits.

kafka-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import { Kafka, CompressionTypes } from 'kafkajs';
 
const kafka = new Kafka({
    clientId: 'order-service',
    brokers: ['kafka-1:9092', 'kafka-2:9092', 'kafka-3:9092'],
});
 
// Producer: At-least-once configuration
const producer = kafka.producer({
    // acks: 'all' = wait for all in-sync replicas to acknowledge
    // This is the key setting for durability
    acks: -1,  // -1 is equivalent to 'all'
    
    // Enable idempotent producer to deduplicate within producer session
    idempotent: true,
    
    // Retry configuration
    retry: {
        retries: 10,
        initialRetryTime: 100,
        maxRetryTime: 30000,
    },
});
 
// Producer usage
await producer.send({
    topic: 'orders',
    messages: [{
        key: orderId,  // Key for partitioning (ordering guarantee)
        value: JSON.stringify(order),
    }],
});
// This call returns ONLY after all replicas have acknowledged
 
// Consumer: At-least-once configuration
const consumer = kafka.consumer({
    groupId: 'order-processor',
    // Manual commits for at-least-once guarantee
    autoCommit: false,
});
 
await consumer.run({
    eachMessage: async ({ topic, partition, message }) => {
        // Process first
        await processOrder(message.value);
        
        // Then commit
        await consumer.commitOffsets([{
            topic,
            partition,
            offset: (BigInt(message.offset) + 1n).toString(),
        }]);
    },
});

RabbitMQ

RabbitMQ provides at-least-once through publisher confirms and consumer acknowledgments.

rabbitmq-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import amqp from 'amqplib';
 
// Publisher: At-least-once with publisher confirms
const connection = await amqp.connect('amqp://localhost');
const channel = await connection.createChannel();
 
// Enable publisher confirms
await channel.confirmChannel();
 
// Declare durable queue (survives broker restart)
await channel.assertQueue('orders', { 
    durable: true,  // Queue definition persisted
});
 
// Publish with confirmation
const message = JSON.stringify(order);
const published = channel.sendToQueue(
    'orders',
    Buffer.from(message),
    { 
        persistent: true,  // Message persisted to disk
    }
);
 
// Wait for broker confirm
await channel.waitForConfirms();
// Now we know the message is safely persisted
 
// Consumer: At-least-once with manual acknowledgment
await channel.consume(
    'orders',
    async (msg) => {
        if (!msg) return;
        
        try {
            // Process the message
            await processOrder(JSON.parse(msg.content.toString()));
            
            // Acknowledge AFTER successful processing
            channel.ack(msg);
            
        } catch (error) {
            // Negative acknowledge - message will be requeued
            channel.nack(msg, false, true);  // (msg, allUpTo, requeue)
        }
    },
    { 
        noAck: false,  // CRITICAL: Enable acknowledgments
    }
);

AWS SQS

SQS provides at-least-once delivery through visibility timeouts and explicit deletion.

sqs-at-least-once.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import { SQSClient, ReceiveMessageCommand, DeleteMessageCommand } from '@aws-sdk/client-sqs';
 
const sqs = new SQSClient({ region: 'us-east-1' });
const queueUrl = 'https://sqs.us-east-1.amazonaws.com/123456789/orders';
 
async function consumeMessages(): Promise<void> {
    while (true) {
        // Receive messages (they become invisible, not deleted)
        const response = await sqs.send(new ReceiveMessageCommand({
            QueueUrl: queueUrl,
            MaxNumberOfMessages: 10,
            VisibilityTimeout: 300,  // 5 minutes to process
            WaitTimeSeconds: 20,     // Long polling
        }));
 
        for (const message of response.Messages || []) {
            try {
                // Process the message
                await processOrder(JSON.parse(message.Body!));
                
                // Delete AFTER successful processing
                // This is the "acknowledgment" in SQS
                await sqs.send(new DeleteMessageCommand({
                    QueueUrl: queueUrl,
                    ReceiptHandle: message.ReceiptHandle!,
                }));
                
            } catch (error) {
                // Don't delete - message will become visible again
                // after visibility timeout expires
                console.error(`Processing failed: ${error.message}`);
                
                // SQS will redeliver after visibility timeout
                // This provides at-least-once delivery
            }
        }
    }
}
 
// SQS At-Least-Once Mechanics:
// 1. Message is received and becomes invisible
// 2. Consumer has VisibilityTimeout seconds to process and delete
// 3. If deleted: message is gone (successfully processed)
// 4. If not deleted: message becomes visible again (redelivered)
// 
// After MaxReceiveCount redeliveries, message goes to Dead Letter Queue

The Universal Pattern

Summary: At-Least-Once Delivery

At-least-once delivery is the workhorse of reliable distributed messaging, providing the guarantee that no message will ever be lost. Let's consolidate the key insights:

Key Takeaways

•Definition: At-least-once guarantees D(M) ≥ 1—messages may be delivered multiple times but never lost
•Mechanics: Achieved through acknowledgments after processing, persistence before delivery, and retry on timeout
•Duplicate Sources: Lost acknowledgments, consumer crashes after processing, network partitions, and broker failovers
•Acknowledgment Patterns: Single message, batch, offset-based, and transactional—each with different trade-offs
•Retry Strategies: Exponential backoff with jitter is the gold standard for distributed systems
•Critical Requirement: Consumers MUST be idempotent to handle duplicate deliveries correctly
•Default Choice: At-least-once is the correct default for business-critical messaging

When to Use At-Least-Once

•Order processing and fulfillment
•Payment transactions
•User-generated content
•Audit and compliance logging
•Event sourcing pipelines
•Inter-service commands

Remember the Trade-offs

•Higher latency (ACK round-trips)
•Lower throughput (persistence)
•Consumer complexity (idempotency)
•Storage costs (message retention)
•Duplicate handling overhead

Page Complete

2 / 5