Messaging Systems - Learning Module

Loading content...

0/273

AWS SQS: Managed, Simple

AWS SQS: The Fully Managed Queue Service

While Kafka and RabbitMQ offer powerful capabilities, they come with significant operational overhead—cluster management, capacity planning, failover handling, and ongoing maintenance. Amazon Simple Queue Service (SQS) represents a fundamentally different philosophy: a fully managed queue service where AWS handles all infrastructure concerns, allowing teams to focus entirely on their application logic.

SQS was one of AWS's earliest services, launched in 2006, and has since processed trillions of messages for organizations worldwide. Its design prioritizes simplicity, reliability, and infinite scalability—you never provision capacity, manage servers, or worry about storage limits. Messages flow, and AWS handles everything else.

What You Will Learn

By the end of this page, you will understand SQS's architecture, the differences between Standard and FIFO queues, visibility timeouts and message lifecycle, dead letter queue configuration, long polling optimization, and how SQS integrates with the broader AWS ecosystem.

SQS Architecture Fundamentals

Unlike self-managed message brokers where you control (and are responsible for) every component, SQS abstracts all infrastructure complexity behind a simple API.

Key architectural properties:

Distributed by design: Messages are stored redundantly across multiple AWS Availability Zones, providing durability without manual replication configuration.
Serverless scaling: SQS automatically scales to handle any volume of messages—from zero to millions per second—without capacity planning.
At-least-once delivery: SQS guarantees that messages are delivered at least once, though duplicates are possible (Standard queues) or exactly-once (FIFO queues).
Pull-based consumption: Consumers poll SQS for messages rather than receiving pushes. This enables backpressure management and graceful degradation under load.

sqs-message-lifecycle.txt
SQS Message Lifecycle
+================================================================+
|                        PRODUCER                                |
|  Sends message to queue via AWS SDK/API                        |
+================================================================+
                              |
                              ↓ SendMessage()
+================================================================+
|                    SQS QUEUE (AWS Managed)                     |
|  - Stored redundantly across multiple AZs                      |
|  - Retained up to 14 days (configurable 1 min - 14 days)       |
|  - No size limits on queue (stores unlimited messages)         |
+================================================================+
                              |
                              ↓ ReceiveMessage()
+================================================================+
|                        CONSUMER                                |
|  1. Receives message (message becomes "in-flight")             |
|  2. Processes message                                          |
|  3. Deletes message via DeleteMessage()                        |
+================================================================+
 
If consumer fails before delete:
→ Visibility timeout expires
→ Message becomes visible again
→ Another consumer can receive it

The SQS API is intentionally simple:

import boto3

sqs = boto3.client('sqs', region_name='us-east-1')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/my-queue'

# Send a message
sqs.send_message(
    QueueUrl=queue_url,
    MessageBody='Hello World',
    MessageAttributes={
        'Author': {'DataType': 'String', 'StringValue': 'System'}
    }
)

# Receive messages (up to 10 at a time)
response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=10,
    WaitTimeSeconds=20  # Long polling
)

# Process and delete
for message in response.get('Messages', []):
    process(message['Body'])
    sqs.delete_message(
        QueueUrl=queue_url,
        ReceiptHandle=message['ReceiptHandle']
    )

The entire API surface is small: SendMessage, ReceiveMessage, DeleteMessage, ChangeMessageVisibility, and a few management operations. This simplicity is SQS's greatest strength.

SQS Pricing Model

SQS charges per request (roughly $0.40 per million requests) with no charges for data transfer within the same region. The first million requests per month are free. This pay-per-use model means you never pay for idle capacity.

Standard vs FIFO Queues

SQS offers two queue types with fundamentally different guarantees. Choosing correctly is crucial for your use case.

Standard Queues:

The original SQS queue type, optimized for maximum throughput:

Nearly unlimited throughput: Handles virtually any message rate
At-least-once delivery: Messages are guaranteed to be delivered, but may be delivered more than once
Best-effort ordering: Messages are generally delivered in order, but ordering is not guaranteed
Distributed storage: Messages may be stored on multiple servers, enabling parallel consumption

FIFO Queues:

Introduced in 2016 for use cases requiring strict ordering:

Exactly-once processing: Deduplication prevents duplicate delivery within a 5-minute window
Ordered delivery: Messages are received in exactly the order they were sent
Limited throughput: 300 messages/second (3,000/sec with batching) per message group
Message groups: Logical partitioning for parallel processing while maintaining order within groups

Standard vs FIFO Queue Comparison
Characteristic	Standard Queue	FIFO Queue
Throughput	Nearly unlimited	300-3,000 msg/sec per group
Ordering	Best-effort	Strict (first-in-first-out)
Delivery	At-least-once	Exactly-once
Duplicates	Possible	None (within 5-min window)
Cost	$0.40 per million	$0.50 per million
Queue name	Any valid name	Must end with .fifo
Use case	Throughput matters, dups OK	Order matters, dups not OK

Message Groups in FIFO Queues:

FIFO queues support Message Group IDs that enable parallel processing while maintaining order within each group:

# All messages with same MessageGroupId are ordered
# Different groups can be processed in parallel

# Order 123 - processed in sequence
sqs.send_message(
    QueueUrl=fifo_queue_url,
    MessageBody='Order 123 placed',
    MessageGroupId='order-123'
)
sqs.send_message(
    QueueUrl=fifo_queue_url,
    MessageBody='Order 123 paid',
    MessageGroupId='order-123'
)

# Order 456 - processed in parallel with order 123
sqs.send_message(
    QueueUrl=fifo_queue_url,
    MessageBody='Order 456 placed',
    MessageGroupId='order-456'
)

Deduplication in FIFO Queues:

FIFO queues prevent duplicates using either:

Content-based deduplication: SHA-256 hash of message body
Explicit deduplication ID: Client provides unique ID per message

# Explicit deduplication
sqs.send_message(
    QueueUrl=fifo_queue_url,
    MessageBody='Process payment',
    MessageGroupId='order-123',
    MessageDeduplicationId='payment-attempt-uuid-abc123'
)

Choosing Between Queue Types

Use Standard queues unless you explicitly need ordering or exactly-once processing. Standard queues are simpler, cheaper, and infinitely scalable. For financial transactions, order processing, or any workflow where duplicates cause problems, FIFO is worth the throughput trade-off.

Visibility Timeout and Message Processing

The visibility timeout is SQS's mechanism for ensuring messages are processed successfully. Understanding and configuring it correctly is essential for reliable message processing.

How visibility timeout works:

When a consumer receives a message, that message becomes "invisible" to other consumers for a configurable period. During this window, the consumer processes the message and deletes it. If the consumer fails to delete before the timeout expires, the message becomes visible again for another consumer to process.

1. Consumer A calls ReceiveMessage() at T=0
2. Message becomes invisible, visibility timeout = 30 sec
3. Consumer A processes message...
   
   Success path:
   4a. Consumer A calls DeleteMessage() at T=15
   5a. Message permanently removed
   
   Failure path:
   4b. Consumer A crashes at T=15
   5b. At T=30, message becomes visible
   6b. Consumer B receives and processes message

visibility-timeout-timeline.txt
Visibility Timeout Timeline
+------------------------------------------------------------------+
|  T=0: Consumer receives message                                   |
|  │    Message becomes INVISIBLE                                   |
|  ▼                                                                |
|  ╔════════════════════════════════════════════════════════════╗   |
|  ║               VISIBILITY TIMEOUT (30 sec)                  ║   |
|  ║                                                            ║   |
|  ║  T=10: Processing...                                       ║   |
|  ║  T=20: Still processing...                                 ║   |
|  ║                                                            ║   |
|  ╚════════════════════════════════════════════════════════════╝   |
|  ▼                                                                |
|  T=30: Timeout expires                                            |
|        Message becomes VISIBLE again                              |
|        (if not deleted)                                           |
+------------------------------------------------------------------+
 
If processing takes >30 sec:
→ Message becomes visible while still being processed
→ Another consumer receives duplicate
→ Solution: Extend visibility timeout during processing

Configuring visibility timeout:

# Set default visibility timeout for queue (0 sec - 12 hours)
sqs.set_queue_attributes(
    QueueUrl=queue_url,
    Attributes={'VisibilityTimeout': '300'}  # 5 minutes
)

# Override per-receive
response = sqs.receive_message(
    QueueUrl=queue_url,
    VisibilityTimeout=60  # 1 minute for this batch
)

# Extend during processing for long-running jobs
sqs.change_message_visibility(
    QueueUrl=queue_url,
    ReceiptHandle=message['ReceiptHandle'],
    VisibilityTimeout=120  # Grant 2 more minutes
)

Best practices for visibility timeout:

Set timeout > max processing time: Include buffer for variability
Extend proactively for long jobs: Don't wait until timeout nearly expires
Use heartbeat pattern: Periodically extend while processing
Monitor ApproximateAgeOfOldestMessage: Detect stuck messages

Visibility Timeout Guidelines
Processing Time	Recommended Timeout	Strategy
< 30 sec	30-60 sec	Default, no extension needed
30 sec - 5 min	5-10 min	Buffer for retries
5 min - 1 hour	Initial 5 min + heartbeat	Extend every 3 min
1 hour	Use Step Functions or different pattern	SQS may not be ideal

Dead Letter Queues and Redrive Policies

Messages that repeatedly fail processing can create "poison pill" scenarios—endlessly cycling through receive-fail loops, consuming resources without progress. SQS Dead Letter Queues (DLQ) isolate these problematic messages for investigation.

Redrive policy configuration:

A redrive policy specifies when messages should move to the DLQ:

# Create the dead letter queue first
dlq_response = sqs.create_queue(QueueName='my-queue-dlq')
dlq_arn = sqs.get_queue_attributes(
    QueueUrl=dlq_response['QueueUrl'],
    AttributeNames=['QueueArn']
)['Attributes']['QueueArn']

# Configure main queue with redrive policy
sqs.set_queue_attributes(
    QueueUrl=main_queue_url,
    Attributes={
        'RedrivePolicy': json.dumps({
            'deadLetterTargetArn': dlq_arn,
            'maxReceiveCount': 3  # After 3 failed attempts
        })
    }
)

dead-letter-queue-flow.txt
Dead Letter Queue Flow
+----------------------------------------------------------+
|  Main Queue: orders-queue                                 |
|  RedrivePolicy: maxReceiveCount = 3                       |
+----------------------------------------------------------+
            ↓ Message received
+----------------------------------------------------------+
|  Receive #1: Consumer fails, message returns to queue     |
+----------------------------------------------------------+
            ↓ Message received  
+----------------------------------------------------------+
|  Receive #2: Consumer fails again, message returns        |
+----------------------------------------------------------+
            ↓ Message received
+----------------------------------------------------------+
|  Receive #3: Consumer fails again                         |
|  ApproximateReceiveCount = 3 (equals maxReceiveCount)     |
+----------------------------------------------------------+
            ↓ Moved to DLQ
+----------------------------------------------------------+
|  DLQ: orders-queue-dlq                                    |
|  Message stored with original attributes + receive count  |
|  Awaits manual inspection/redrive                         |
+----------------------------------------------------------+

Redrive to source queue:

After fixing the issue causing failures, you can redrive messages back to the main queue:

# Start redrive from DLQ back to source
sqs.start_message_move_task(
    SourceArn=dlq_arn,
    DestinationArn=main_queue_arn,
    MaxNumberOfMessagesPerSecond=100  # Throttle redrive
)

DLQ best practices:

Always configure DLQs: Don't let messages cycle forever
Set appropriate maxReceiveCount: Typically 3-5 attempts
Monitor DLQ depth: Alert when messages arrive
Longer retention for DLQ: Give time for investigation
Same queue type: Standard queue DLQ for Standard, FIFO for FIFO

DLQ for FIFO Queues

FIFO queues can only use FIFO dead letter queues (ending in .fifo). Messages in FIFO DLQs retain their Message Group ID. Be aware that redriving maintains ordering within groups, which may not match original chronological order across groups.

Long Polling vs Short Polling

How consumers poll for messages significantly impacts both cost and latency. SQS supports two polling modes with different trade-offs.

Short polling (default):

With short polling, ReceiveMessage queries only a subset of SQS servers and returns immediately—even with an empty response. This can result in:

Empty responses: Queue has messages, but queried servers don't
Higher costs: Paying for empty responses
Polling loops: Constant API calls burning money

Long polling (recommended):

Long polling waits (up to 20 seconds) for messages to arrive before returning. This eliminates empty responses and reduces costs:

# Enable long polling per request
response = sqs.receive_message(
    QueueUrl=queue_url,
    WaitTimeSeconds=20,  # Wait up to 20 sec
    MaxNumberOfMessages=10
)

# Or set as queue default
sqs.set_queue_attributes(
    QueueUrl=queue_url,
    Attributes={'ReceiveMessageWaitTimeSeconds': '20'}
)

Short Polling

•Returns immediately (empty or not)
•Queries subset of SQS servers
•May miss available messages
•Higher request costs
•Higher average latency paradoxically

Long Polling

•Waits up to 20 seconds
•Queries all SQS servers
•Returns as soon as message available
•Fewer requests, lower costs
•Lower latency for new messages

Cost impact example:

Scenario: 1000 messages per minute, processing time 100ms

Short polling:
- Consumer polls every 100ms = 600 requests/minute
- 36,000 requests/hour, most empty
- Cost: ~$15/month in requests alone

Long polling (20 sec wait):
- 1000 messages = 100 batched requests/minute (10 per batch)
- 6,000 requests/hour, none empty
- Cost: ~$2.40/month

Savings: ~85% reduction in API costs

Always Use Long Polling

There's almost never a reason to use short polling. Long polling reduces costs, improves latency (messages returned immediately when available), and eliminates empty responses. Set WaitTimeSeconds to 20 as your default.

AWS Service Integration Patterns

SQS's native integration with other AWS services enables powerful serverless architectures without custom glue code.

Lambda event source mapping:

AWS Lambda can automatically poll SQS and invoke functions for each message batch:

# SAM template
Resources:
  ProcessingFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: index.handler
      Events:
        SQSEvent:
          Type: SQS
          Properties:
            Queue: !GetAtt MyQueue.Arn
            BatchSize: 10
            MaximumBatchingWindowSeconds: 5

Lambda handles:

Polling and batching
Automatic retry on failure
Message deletion on success
Scaling based on queue depth

SNS → SQS fan-out:

Combining SNS (pub-sub) with SQS (queuing) enables fan-out patterns:

Publisher → SNS Topic → SQS Queue 1 (Service A)
                      → SQS Queue 2 (Service B)  
                      → SQS Queue 3 (Service C)

Each queue buffers messages independently—if Service B is slow, it doesn't affect A or C.

Common SQS Integration Patterns
Pattern	Services	Use Case
Lambda consumer	SQS → Lambda	Serverless message processing
Fan-out	SNS → SQS	One event, multiple processors
API buffering	API Gateway → SQS → Lambda	Absorb traffic spikes
Step Functions	SQS → Step Functions	Complex workflows with queuing
EventBridge	EventBridge → SQS	Event routing with buffering
S3 notifications	S3 → SQS → Lambda	Process uploaded files

API Gateway → SQS (direct integration):

For extreme resilience, API Gateway can write directly to SQS without Lambda:

# API Gateway sends directly to SQS
integration:
  type: AWS
  uri: arn:aws:apigateway:us-east-1:sqs:path/123456789/my-queue
  httpMethod: POST
  requestParameters:
    integration.request.header.Content-Type: "'application/x-www-form-urlencoded'"
  requestTemplates:
    application/json: "Action=SendMessage&MessageBody=$input.body"

This pattern ensures messages are durably stored even if downstream processing is unavailable—the queue absorbs the load until consumers catch up.

VPC Endpoints

For private connectivity without internet egress, use VPC Endpoints for SQS. This keeps traffic within AWS's network, improves security, and can reduce data transfer costs.

SQS Use Cases and Limitations

SQS excels in specific scenarios but isn't the right choice for every messaging need.

Ideal use cases:

Work queues: Distribute tasks across worker pools with automatic load balancing
Decoupling microservices: Buffer between fast producers and slow consumers
Serverless architectures: Native Lambda integration without managing consumers
Traffic spike absorption: Queue requests during bursts, process at sustainable rate
AWS-native applications: Minimal operational overhead within AWS ecosystem

Where SQS falls short:

Message replay: Once deleted, messages are gone forever
Complex routing: No RabbitMQ-style exchange routing
Streaming: Not designed for real-time analytics or continuous processing
Multi-region: No built-in cross-region replication
Very low latency: ~20-100ms minimum, not suitable for sub-10ms use cases

SQS vs Kafka vs RabbitMQ Quick Comparison
Feature	SQS	Kafka	RabbitMQ
Operations	Fully managed	Self-managed	Self-managed
Throughput	Very high	Highest	High
Message replay	No	Yes	No
Ordering	FIFO queues only	Per-partition	Per-queue
Routing	None	Topic-based	Flexible exchanges
Latency	~20-100ms	~5-10ms	~1-10ms
Scaling	Automatic	Manual partitions	Manual cluster

When to Choose SQS

Choose SQS when: (1) You're building on AWS and want minimal ops, (2) You need simple point-to-point queuing, (3) Message replay isn't required, (4) Standard at-least-once or FIFO exactly-once semantics suffice, (5) Lambda integration is valuable. Consider alternatives for streaming, complex routing, or multi-cloud deployments.

Summary: AWS SQS

AWS SQS represents the philosophy that infrastructure should disappear—no servers, no capacity planning, no maintenance. Just queues that work.

Key Takeaways

•Fully managed — AWS handles scaling, durability, and availability; you just send and receive messages.
•Standard vs FIFO — Choose Standard for maximum throughput, FIFO for ordering and exactly-once processing.
•Visibility timeout — Critical for reliable processing; extend for long-running jobs.
•Dead letter queues — Isolate poison messages and prevent infinite retry loops.
•Long polling — Always use WaitTimeSeconds=20 to reduce costs and improve latency.
•AWS integration — Native Lambda, SNS, and API Gateway integration enables serverless architectures.

Page Complete

You now understand AWS SQS's managed queue service model and its trade-offs. Next, we'll explore NATS—a lightweight, cloud-native messaging system designed for simplicity and performance in modern microservices architectures.