Serverless Patterns - Learning Module

Loading content...

0/273

Data Processing Pipelines

Serverless Data Pipelines

Data is the lifeblood of modern applications. Whether ingesting clickstream events, transforming logs for analysis, aggregating IoT sensor readings, or synchronizing data between systems, data processing pipelines are essential infrastructure components. Serverless computing has revolutionized how we build these pipelines.

Traditional data pipelines required provisioning Hadoop clusters, managing Spark installations, or maintaining dedicated ETL servers. Serverless pipelines eliminate this operational burden. Data flows through managed services—Kinesis, Lambda, Firehose, S3—with automatic scaling, pay-per-use pricing, and minimal maintenance.

This page provides a comprehensive guide to building serverless data processing pipelines. We'll cover streaming versus batch processing, common pipeline patterns, transformation strategies, error handling, and integration with analytics and machine learning services.

What You Will Learn

By the end of this page, you will understand: (1) The difference between streaming and batch processing in serverless, (2) How to design ingestion, transformation, and delivery stages, (3) Common serverless data pipeline patterns, (4) Error handling and retry strategies for data processing, (5) Integration with analytics services like Athena and Redshift, and (6) Best practices for production data pipelines.

Streaming vs Batch Processing

Data pipelines generally fall into two categories: streaming (real-time) and batch (periodic). Serverless computing supports both, with different services optimized for each.

Streaming (Real-Time) Processing:

Data is processed as it arrives, with latency measured in seconds or milliseconds. Use cases:

Real-time dashboards and monitoring
Fraud detection and alerting
Live recommendations and personalization
IoT sensor processing and control systems

Serverless streaming services:

Kinesis Data Streams + Lambda: Process records in near-real-time
Kinesis Data Firehose: Direct delivery to S3, Redshift, or Elasticsearch
DynamoDB Streams + Lambda: React to database changes
EventBridge: Route and filter events in real-time

Converting Mermaid diagram...

Batch Processing:

Data is collected over a period and processed in bulk at scheduled intervals. Use cases:

Daily/weekly reports and analytics
Machine learning model training
Large-scale data migrations
Historical data analysis and backfills

Serverless batch services:

S3 + Lambda (triggered by object creation)
EventBridge Scheduler + Lambda: Time-triggered batch jobs
Step Functions: Orchestrated multi-step batch workflows
AWS Glue: Managed ETL for large datasets
Athena: Serverless SQL queries on S3 data

Streaming vs Batch Processing
Aspect	Streaming	Batch
Latency	Seconds to milliseconds	Minutes to hours
Data volume per operation	Single records or small batches	Large datasets
Processing pattern	Continuous, event-driven	Periodic, scheduled
State management	Complex (windowing, aggregation)	Simpler (full dataset access)
Cost model	Per invocation/record	Per invocation/duration
Failure impact	Affects individual records	May require full reprocessing
Use case fit	Time-sensitive, reactive	Analytical, comprehensive

Lambda Streaming Architecture

Lambda's event source mappings with Kinesis or DynamoDB Streams provide a hybrid model: the platform batches records (up to 10,000 or a time window) before invoking your function. This gives you streaming semantics with batched efficiency. Tune batch size and window based on latency vs. efficiency trade-offs.

Pipeline Architecture

Well-designed data pipelines follow a consistent architectural pattern with distinct stages: Ingestion, Transformation, Storage, and Consumption. Each stage has specific responsibilities and design considerations.

Stage 1: Ingestion

Collecting data from various sources and feeding it into the pipeline:

Direct writes: Applications write to Kinesis, SQS, or S3
API endpoints: API Gateway + Lambda receives data via HTTP
Connectors: Pre-built integrations (Firehose agents, IoT rules)
Change data capture: DynamoDB Streams, database CDC

Ingestion must handle:

High throughput and burst traffic
Data validation at the boundary
Buffering during downstream outages
Deduplication of retried writes

Ingestion Endpoint with Validation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import { APIGatewayProxyHandlerV2 } from "aws-lambda";
import { KinesisClient, PutRecordCommand } from "@aws-sdk/client-kinesis";
import { z } from "zod";
 
const kinesis = new KinesisClient({});
const STREAM_NAME = process.env.KINESIS_STREAM!;
 
// Define event schema
const EventSchema = z.object({
  eventType: z.enum(["page_view", "click", "purchase", "signup"]),
  userId: z.string().uuid(),
  timestamp: z.string().datetime(),
  properties: z.record(z.unknown()).optional(),
  sessionId: z.string().optional()
});
 
type Event = z.infer<typeof EventSchema>;
 
export const handler: APIGatewayProxyHandlerV2 = async (event) => {
  try {
    // Parse and validate incoming event
    const body = JSON.parse(event.body || "{}");
    const validatedEvent = EventSchema.parse(body);
    
    // Enrich with ingestion metadata
    const enrichedEvent = {
      ...validatedEvent,
      ingestedAt: new Date().toISOString(),
      sourceIp: event.requestContext.http.sourceIp,
      userAgent: event.headers["user-agent"]
    };
    
    // Write to Kinesis (partition by userId for ordering)
    await kinesis.send(new PutRecordCommand({
      StreamName: STREAM_NAME,
      Data: Buffer.from(JSON.stringify(enrichedEvent)),
      PartitionKey: validatedEvent.userId
    }));
    
    return {
      statusCode: 202,
      body: JSON.stringify({ accepted: true })
    };
  } catch (error) {
    if (error instanceof z.ZodError) {
      return {
        statusCode: 400,
        body: JSON.stringify({ 
          error: "Validation failed", 
          details: error.errors 
        })
      };
    }
    throw error;
  }
};

Stage 2: Transformation

Processing raw data into usable formats:

Parsing: Extract structured data from logs, JSON, XML
Enrichment: Add contextual data (geo-lookup, user profiles)
Aggregation: Compute metrics, counters, averages
Filtering: Remove irrelevant or invalid records
Normalization: Standardize formats, units, encodings

Stage 3: Storage

Persisting processed data for consumption:

S3: Cost-effective, durable storage for analytics
DynamoDB: Low-latency access for operational data
Redshift: Analytical queries on large datasets
Elasticsearch: Full-text search and log analysis
Timestream: Time-series data for IoT and metrics

Stage 4: Consumption

Accessing processed data:

Athena: Serverless SQL queries on S3
QuickSight: Business intelligence dashboards
API endpoints: Serve processed data to applications
ML services: Feed models with training data

Converting Mermaid diagram...

The Medallion Architecture

A popular pattern organizes data into Bronze (raw), Silver (cleaned/validated), and Gold (aggregated/business-ready) tiers. Each tier has different SLAs, access patterns, and retention policies. Serverless pipelines naturally support this pattern with separate Lambda functions for each transformation stage.

Kinesis Data Streams Processing

Kinesis Data Streams is AWS's core streaming service, and Lambda's integration with it enables powerful real-time processing. Understanding the integration's configuration options is key to building efficient pipelines.

Event Source Mapping Configuration:

Lambda polls Kinesis and invokes your function with batches of records. Key settings:

Batch Size (1-10,000): Maximum records per invocation
Batch Window (0-300 seconds): Maximum wait time to fill batch
Starting Position: TRIM_HORIZON (oldest) or LATEST (newest)
Parallelization Factor (1-10): Concurrent Lambda invocations per shard
On-Failure Destination: SNS, SQS, or Lambda for failed batches

Kinesis Stream Processor
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import { KinesisStreamHandler, KinesisStreamRecord } from "aws-lambda";
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";
 
const s3 = new S3Client({});
const BUCKET = process.env.OUTPUT_BUCKET!;
 
interface ProcessedEvent {
  eventType: string;
  userId: string;
  timestamp: string;
  sessionId?: string;
  // Enriched fields
  processedAt: string;
  hourBucket: string;
  isNewUser: boolean;
}
 
export const handler: KinesisStreamHandler = async (event) => {
  console.log(`Processing ${event.Records.length} records`);
  
  const processedEvents: ProcessedEvent[] = [];
  const failures: Array<{ record: KinesisStreamRecord; error: string }> = [];
  
  for (const record of event.Records) {
    try {
      // Decode and parse record
      const payload = Buffer.from(record.kinesis.data, "base64").toString();
      const data = JSON.parse(payload);
      
      // Transform and enrich
      const processed = await transformEvent(data);
      processedEvents.push(processed);
      
    } catch (error) {
      console.error(`Failed to process record ${record.kinesis.sequenceNumber}`, error);
      failures.push({ 
        record, 
        error: (error as Error).message 
      });
    }
  }
  
  // Write batch to S3 (partitioned by hour)
  if (processedEvents.length > 0) {
    await writeToS3(processedEvents);
  }
  
  // Report failures (partial batch failure)
  if (failures.length > 0) {
    console.error(`${failures.length} records failed processing`);
    // Return failed records for retry (requires reportBatchItemFailures: true)
    return {
      batchItemFailures: failures.map(f => ({
        itemIdentifier: f.record.kinesis.sequenceNumber
      }))
    };
  }
  
  console.log(`Successfully processed ${processedEvents.length} events`);
};
 
async function transformEvent(raw: any): Promise<ProcessedEvent> {
  const timestamp = new Date(raw.timestamp);
  
  return {
    eventType: raw.eventType,
    userId: raw.userId,
    timestamp: raw.timestamp,
    sessionId: raw.sessionId,
    processedAt: new Date().toISOString(),
    // Partition key for S3/Athena queries
    hourBucket: timestamp.toISOString().slice(0, 13).replace("T", "-"),
    // Enrichment: check if user is new (simplified)
    isNewUser: await checkIfNewUser(raw.userId)
  };
}
 
async function writeToS3(events: ProcessedEvent[]): Promise<void> {
  // Group by hour for partitioned storage
  const byHour = events.reduce((acc, event) => {
    const hour = event.hourBucket;
    if (!acc[hour]) acc[hour] = [];
    acc[hour].push(event);
    return acc;
  }, {} as Record<string, ProcessedEvent[]>);
  
  // Write each partition
  for (const [hour, hourEvents] of Object.entries(byHour)) {
    const key = `processed/dt=${hour}/batch-${Date.now()}.json`;
    const body = hourEvents.map(e => JSON.stringify(e)).join("\n");
    
    await s3.send(new PutObjectCommand({
      Bucket: BUCKET,
      Key: key,
      Body: body,
      ContentType: "application/x-ndjson"
    }));
  }
}

Partial Batch Failure Handling:

By default, if your Lambda throws an error, the entire batch is retried—including records that succeeded. This can cause duplicate processing. Enable ReportBatchItemFailures to return specific failed records:

// Return only failed sequence numbers
return {
  batchItemFailures: [
    { itemIdentifier: "123456" },
    { itemIdentifier: "123457" }
  ]
};

Only failed records are retried; successful ones are checkpointed.

Kinesis Lambda Configuration Recommendations
Setting	Low Latency	High Throughput	Cost Optimized
Batch Size	1-10	1,000-10,000	10,000
Batch Window	0s	30-60s	60-300s
Parallelization Factor	10	1-5	1
Memory	512MB-1GB	1-3GB	512MB
Timeout	30s	5m	15m

Watch for Iterator Age

Kinesis iterator age metric shows how far behind your processor is from the stream head. If iterator age grows continuously, your Lambda can't keep up with ingestion rate. Solutions: increase parallelization factor, add shards, optimize Lambda code, or increase memory (which also increases CPU).

S3-Triggered Processing

S3 event notifications trigger Lambda functions when objects are created, modified, or deleted. This pattern is fundamental to many data pipelines—files land in S3, triggering processing workflows.

Common S3 Processing Patterns:

Image/Media Processing: Resize images, generate thumbnails, transcode video
Log Processing: Parse and index logs as they arrive
Data Import: Process uploaded CSVs/JSON files
ETL Triggers: Kick off larger pipelines when data files arrive
Archive Processing: Validate and index archived data

S3-Triggered File Processor
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import { S3Handler, S3Event } from "aws-lambda";
import { S3Client, GetObjectCommand, PutObjectCommand } from "@aws-sdk/client-s3";
import { createReadStream } from "stream";
import { pipeline } from "stream/promises";
import * as readline from "readline";
 
const s3 = new S3Client({});
 
interface LogEntry {
  timestamp: string;
  level: string;
  message: string;
  service?: string;
  traceId?: string;
}
 
export const handler: S3Handler = async (event: S3Event) => {
  for (const record of event.Records) {
    const bucket = record.s3.bucket.name;
    const key = decodeURIComponent(record.s3.object.key.replace(/\+/g, " "));
    
    console.log(`Processing s3://${bucket}/${key}`);
    
    // Skip if not a log file
    if (!key.endsWith(".log") && !key.endsWith(".json")) {
      console.log(`Skipping non-log file: ${key}`);
      continue;
    }
    
    try {
      // Get the object
      const response = await s3.send(new GetObjectCommand({
        Bucket: bucket,
        Key: key
      }));
      
      // Process line by line (memory efficient for large files)
      const processedEntries = await processLogFile(response.Body as NodeJS.ReadableStream);
      
      // Write processed output
      const outputKey = key
        .replace("raw-logs/", "processed-logs/")
        .replace(".log", ".json");
      
      await s3.send(new PutObjectCommand({
        Bucket: process.env.OUTPUT_BUCKET!,
        Key: outputKey,
        Body: JSON.stringify(processedEntries, null, 2),
        ContentType: "application/json"
      }));
      
      console.log(`Processed ${processedEntries.length} entries to ${outputKey}`);
      
    } catch (error) {
      console.error(`Failed to process ${key}:`, error);
      throw error;
    }
  }
};
 
async function processLogFile(stream: NodeJS.ReadableStream): Promise<LogEntry[]> {
  const entries: LogEntry[] = [];
  
  const rl = readline.createInterface({
    input: stream,
    crlfDelay: Infinity
  });
  
  for await (const line of rl) {
    try {
      // Parse log line (example: structured JSON logs)
      const entry = JSON.parse(line) as LogEntry;
      
      // Enrich and validate
      if (isValidEntry(entry)) {
        entries.push({
          ...entry,
          timestamp: normalizeTimestamp(entry.timestamp)
        });
      }
    } catch {
      // Skip malformed lines, log for debugging
      console.warn(`Malformed log line: ${line.substring(0, 100)}`);
    }
  }
  
  return entries;
}
 
function isValidEntry(entry: any): entry is LogEntry {
  return entry && typeof entry.timestamp === "string" && typeof entry.message === "string";
}
 
function normalizeTimestamp(ts: string): string {
  return new Date(ts).toISOString();
}

Handling Large Files:

Lambda's /tmp storage is limited to 512MB-10GB (configurable). For files larger than memory, use streaming techniques:

Stream processing: Read and process line-by-line without loading entire file
S3 Select: Query specific data from large files without downloading
Step Functions with distributed Map: Split large files across many Lambda invocations
Fargate: For truly large files, kick off a Fargate task

Best Practices for S3 Triggers:

Use prefixes to scope triggers: Only trigger on specific folders
Avoid recursive triggers: Don't write to the same prefix you're reading from
Configure SQS dead letter: Failed events have a path to recovery
Use S3 batch operations: For reprocessing existing files
Enable versioning: Protect against accidental overwrites

S3 Select for Large Files
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import { S3Client, SelectObjectContentCommand } from "@aws-sdk/client-s3";
 
const s3 = new S3Client({});
 
// Query specific records from a large CSV without downloading entire file
async function queryLargeFile(bucket: string, key: string): Promise<any[]> {
  const response = await s3.send(new SelectObjectContentCommand({
    Bucket: bucket,
    Key: key,
    ExpressionType: "SQL",
    Expression: `
      SELECT s.user_id, s.event_type, s.timestamp
      FROM s3object s
      WHERE s.event_type = 'purchase'
        AND CAST(s.amount AS DECIMAL) > 100
    `,
    InputSerialization: {
      CSV: {
        FileHeaderInfo: "USE",
        RecordDelimiter: "\n",
        FieldDelimiter: ","
      }
    },
    OutputSerialization: {
      JSON: { RecordDelimiter: "\n" }
    }
  }));
  
  // Collect results from stream
  const results: any[] = [];
  for await (const event of response.Payload || []) {
    if (event.Records?.Payload) {
      const chunk = Buffer.from(event.Records.Payload).toString();
      const lines = chunk.split("\n").filter(l => l.trim());
      results.push(...lines.map(l => JSON.parse(l)));
    }
  }
  
  return results;
}

SQS as Buffer for S3 Events

Instead of triggering Lambda directly from S3, route events through SQS first. S3 → SQS → Lambda provides built-in buffering, dead letter queues, and message visibility timeout. This improves resilience and simplifies failure handling.

Serverless ETL Patterns

ETL (Extract, Transform, Load) pipelines move data between systems while reshaping it. Serverless ETL combines Lambda functions, Step Functions, and managed services like Glue for powerful data transformation.

Pattern 1: Lambda-Native ETL

For moderate data volumes (<500MB per file, <15 minute processing), Lambda can handle the entire ETL process:

Lambda ETL: Customer Data Sync
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import { ScheduledHandler } from "aws-lambda";
import { DynamoDBClient, BatchWriteItemCommand } from "@aws-sdk/client-dynamodb";
import { marshall } from "@aws-sdk/util-dynamodb";
 
const dynamodb = new DynamoDBClient({});
 
interface ExternalCustomer {
  id: string;
  first_name: string;
  last_name: string;
  email_address: string;
  phone_number: string;
  created_at: string;
  loyalty_points: number;
}
 
interface InternalCustomer {
  customerId: string;
  fullName: string;
  email: string;
  phone: string;
  createdAt: string;
  loyaltyTier: "bronze" | "silver" | "gold" | "platinum";
  updatedAt: string;
}
 
export const handler: ScheduledHandler = async () => {
  // EXTRACT: Fetch data from external API
  const externalData = await extractFromExternalAPI();
  console.log(`Extracted ${externalData.length} customers`);
  
  // TRANSFORM: Convert to internal format
  const transformedData = externalData.map(transformCustomer);
  console.log(`Transformed ${transformedData.length} customers`);
  
  // LOAD: Write to DynamoDB in batches
  await loadToDynamoDB(transformedData);
  console.log(`Loaded ${transformedData.length} customers to DynamoDB`);
};
 
async function extractFromExternalAPI(): Promise<ExternalCustomer[]> {
  // Fetch from external CRM API
  const response = await fetch(`${process.env.CRM_API_URL}/customers`, {
    headers: {
      "Authorization": `Bearer ${process.env.CRM_API_KEY}`
    }
  });
  
  if (!response.ok) {
    throw new Error(`CRM API error: ${response.status}`);
  }
  
  return response.json();
}
 
function transformCustomer(external: ExternalCustomer): InternalCustomer {
  return {
    customerId: external.id,
    fullName: `${external.first_name} ${external.last_name}`.trim(),
    email: external.email_address.toLowerCase(),
    phone: normalizePhone(external.phone_number),
    createdAt: new Date(external.created_at).toISOString(),
    loyaltyTier: calculateTier(external.loyalty_points),
    updatedAt: new Date().toISOString()
  };
}
 
function calculateTier(points: number): InternalCustomer["loyaltyTier"] {
  if (points >= 10000) return "platinum";
  if (points >= 5000) return "gold";
  if (points >= 1000) return "silver";
  return "bronze";
}
 
async function loadToDynamoDB(customers: InternalCustomer[]): Promise<void> {
  const TABLE = process.env.CUSTOMERS_TABLE!;
  
  // Batch write in chunks of 25 (DynamoDB limit)
  for (let i = 0; i < customers.length; i += 25) {
    const batch = customers.slice(i, i + 25);
    
    await dynamodb.send(new BatchWriteItemCommand({
      RequestItems: {
        [TABLE]: batch.map(customer => ({
          PutRequest: {
            Item: marshall(customer)
          }
        }))
      }
    }));
  }
}

Pattern 2: AWS Glue for Large-Scale ETL

For larger datasets (gigabytes to terabytes), AWS Glue provides managed Spark-based ETL:

Glue Crawlers: Automatically discover schemas in S3
Glue Jobs: Serverless Spark for heavy transformations
Glue Data Catalog: Central metadata repository
Glue Studio: Visual ETL job designer

Pattern 3: Step Functions Orchestrated ETL

For complex multi-stage ETL with error handling and human approval:

Converting Mermaid diagram...

Pattern 4: Kinesis Firehose for Continuous ETL

Firehose provides near-zero operations streaming ETL:

Data sources write to Firehose delivery stream
Optional Lambda transformation inline
Automatic batching, compression, and delivery to S3/Redshift/Elasticsearch
Built-in retry and error handling

Firehose is ideal when:

Transformation is simple (filtering, format conversion)
Destination is S3, Redshift, Elasticsearch, or Splunk
You want minimal operational overhead

ETL Pattern Selection Guide
Pattern	Data Size	Complexity	Use Case
Lambda-only	<500MB	Simple transforms	API sync, format conversion
Kinesis Firehose	Any (streaming)	Light transforms	Log delivery, metrics ingestion
Step Functions	Variable	Complex workflows	Multi-step with approvals
AWS Glue	GB-TB	Complex joins/aggregations	Data warehouse loading
EMR Serverless	TB+	Custom Spark/Presto	ML pipelines, complex analytics

Idempotency in ETL

ETL jobs may be retried on failure. Design for idempotency: use upserts instead of inserts, use atomic operations, and include timestamps to identify stale data. This ensures that running the same ETL twice produces correct results.

Analytics Integration

Serverless data pipelines ultimately serve analytics and business intelligence. AWS provides several serverless analytics services that integrate seamlessly with Lambda-processed data.

Amazon Athena: Serverless SQL on S3

Athena enables SQL queries directly on S3 data without loading it into a database:

Serverless: No infrastructure to manage
Pay per query: Charged by data scanned
Standard SQL: Familiar syntax, supports complex analytics
Integration: Works with BI tools, QuickSight, Jupyter

Athena Query from Lambda
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import { AthenaClient, StartQueryExecutionCommand, GetQueryExecutionCommand, GetQueryResultsCommand } from "@aws-sdk/client-athena";
 
const athena = new AthenaClient({});
const DATABASE = "analytics";
const OUTPUT_BUCKET = process.env.ATHENA_OUTPUT_BUCKET!;
 
export async function runAthenaQuery(query: string): Promise<any[]> {
  // Start query execution
  const startResult = await athena.send(new StartQueryExecutionCommand({
    QueryString: query,
    QueryExecutionContext: { Database: DATABASE },
    ResultConfiguration: {
      OutputLocation: `s3://${OUTPUT_BUCKET}/athena-results/`
    }
  }));
  
  const executionId = startResult.QueryExecutionId!;
  
  // Wait for completion
  await waitForQuery(executionId);
  
  // Get results
  const results = await athena.send(new GetQueryResultsCommand({
    QueryExecutionId: executionId
  }));
  
  // Parse results into objects
  return parseAthenaResults(results);
}
 
async function waitForQuery(executionId: string): Promise<void> {
  const maxAttempts = 60;
  
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const status = await athena.send(new GetQueryExecutionCommand({
      QueryExecutionId: executionId
    }));
    
    const state = status.QueryExecution?.Status?.State;
    
    if (state === "SUCCEEDED") return;
    if (state === "FAILED" || state === "CANCELLED") {
      throw new Error(`Query ${state}: ${status.QueryExecution?.Status?.StateChangeReason}`);
    }
    
    await new Promise(resolve => setTimeout(resolve, 1000));
  }
  
  throw new Error("Query timed out");
}
 
// Example: Generate daily summary
export async function getDailySummary(date: string): Promise<any> {
  const query = `
    SELECT 
      event_type,
      COUNT(*) as event_count,
      COUNT(DISTINCT user_id) as unique_users
    FROM events
    WHERE date(timestamp) = date('${date}')
    GROUP BY event_type
    ORDER BY event_count DESC
  `;
  
  return runAthenaQuery(query);
}

Optimizing Athena Queries:

Athena charges by data scanned, so optimization saves money:

Partitioning: Store data in partitioned folders (e.g., dt=2024-01-15/)
Columnar formats: Use Parquet or ORC instead of JSON/CSV (10-100x less data scanned)
Compression: Gzip, Snappy, or ZSTD reduce data size
Projection: SELECT only needed columns, never SELECT *
Bucketing: For frequently joined columns

Amazon Redshift Serverless:

For more demanding analytics workloads:

Automatic scaling based on workload
Pay for compute when queries run
Full data warehouse capabilities
ML integration with Redshift ML

Analytics Service Comparison
Service	Query Latency	Concurrency	Best For
Athena	Seconds-minutes	Moderate	Ad-hoc queries, data exploration
Redshift Serverless	Sub-second to seconds	High	Dashboards, BI tools, complex analytics
OpenSearch	Milliseconds	Very high	Log search, full-text, time-series
Timestream	Milliseconds	High	IoT metrics, time-series data
QuickSight	Sub-second (cached)	High	Business dashboards, embedded analytics

Partitioning is Critical

Proper partitioning can reduce Athena costs by 90%+. If you frequently query by date, partition by date. If you frequently query by region , partition by region within date.The output of your data pipeline should write data in a partitioned structure that matches common query patterns.

Error Handling & Reliability

Data pipelines must be resilient. Data is often irreplaceable, and processing failures can have cascading effects. Robust error handling ensures data isn't lost and processing can recover from failures.

Dead Letter Queues (DLQ) for Data Pipelines:

Every pipeline stage should have a DLQ for failed records:

Kinesis → Lambda: Configure on-failure destination (SQS, SNS, Lambda)
SQS → Lambda: Configure DLQ on the source queue
S3 → Lambda via SQS: Failed processing sends to DLQ

DLQs should contain enough context to diagnose and replay:

DLQ Message Structure
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs";
 
const sqs = new SQSClient({});
 
interface DLQMessage {
  originalEvent: any;
  processingAttempts: number;
  lastError: {
    name: string;
    message: string;
    stack?: string;
  };
  failedAt: string;
  sourceFunction: string;
  context: {
    requestId: string;
    traceId?: string;
  };
}
 
async function sendToDLQ(
  originalEvent: any,
  error: Error,
  context: { requestId: string; traceId?: string }
): Promise<void> {
  const dlqMessage: DLQMessage = {
    originalEvent,
    processingAttempts: 1,
    lastError: {
      name: error.name,
      message: error.message,
      stack: error.stack
    },
    failedAt: new Date().toISOString(),
    sourceFunction: process.env.AWS_LAMBDA_FUNCTION_NAME!,
    context
  };
  
  await sqs.send(new SendMessageCommand({
    QueueUrl: process.env.DLQ_URL!,
    MessageBody: JSON.stringify(dlqMessage),
    MessageAttributes: {
      ErrorType: {
        DataType: "String",
        StringValue: error.name
      },
      OriginalTimestamp: {
        DataType: "String",
        StringValue: originalEvent.timestamp || new Date().toISOString()
      }
    }
  }));
}

Replay Strategies:

After fixing issues, you need to replay failed or missing data:

DLQ Replay: Process messages from DLQ after fixing bugs
S3 Reprocessing: Re-trigger S3 events or run backfill job
Kinesis Replay: Use iterator at TRIM_HORIZON to reprocess from start
Manual Backfill: Run targeted processing for specific time ranges

Data Validation and Alerting:

Monitor pipeline health with specific metrics:

Pipeline Monitoring Checklist

•Input vs Output counts — Compare records ingested to records processed. Large discrepancies indicate data loss.
•Processing latency — Track time from ingestion to final storage. Increasing latency may indicate capacity issues.
•Error rates by type — Categorize errors (validation, transform, load). Different error types need different fixes.
•DLQ depth — Any messages in DLQ require investigation. Alert immediately when depth > 0.
•Schema validation failures — Track records failing schema validation—may indicate upstream changes.
•Data freshness — For scheduled pipelines, verify data is as recent as expected.

Pipeline Metrics Emission
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import { CloudWatchClient, PutMetricDataCommand } from "@aws-sdk/client-cloudwatch";
 
const cloudwatch = new CloudWatchClient({});
 
interface PipelineMetrics {
  recordsReceived: number;
  recordsProcessed: number;
  recordsFailed: number;
  processingDurationMs: number;
  bytesProcessed: number;
}
 
async function emitPipelineMetrics(
  pipelineName: string, 
  metrics: PipelineMetrics
): Promise<void> {
  await cloudwatch.send(new PutMetricDataCommand({
    Namespace: "DataPipelines",
    MetricData: [
      {
        MetricName: "RecordsReceived",
        Value: metrics.recordsReceived,
        Unit: "Count",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      },
      {
        MetricName: "RecordsProcessed",
        Value: metrics.recordsProcessed,
        Unit: "Count",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      },
      {
        MetricName: "RecordsFailed",
        Value: metrics.recordsFailed,
        Unit: "Count",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      },
      {
        MetricName: "ProcessingDuration",
        Value: metrics.processingDurationMs,
        Unit: "Milliseconds",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      },
      {
        MetricName: "SuccessRate",
        Value: metrics.recordsReceived > 0 
          ? (metrics.recordsProcessed / metrics.recordsReceived) * 100 
          : 100,
        Unit: "Percent",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      }
    ]
  }));
}

Data Loss is Unacceptable

In data pipelines, silent data loss is the worst outcome. Design for visibility: every record should either succeed, land in a DLQ, or generate an alert. Unknown states—where records disappear without trace—must be eliminated through comprehensive error handling and monitoring.

Summary: Data Processing Pipelines

Serverless data processing pipelines combine the flexibility of Lambda with the power of managed streaming and storage services. By understanding streaming versus batch patterns, pipeline architecture, and integration with analytics services, you can build data infrastructure that scales automatically and costs nothing when idle.

Let's consolidate the key takeaways:

Key Takeaways

•Choose streaming or batch by latency needs — Real-time dashboards need streaming; daily reports can use batch. Many systems combine both.
•Structure pipelines in stages — Ingestion, Transformation, Storage, and Consumption stages have distinct responsibilities and optimization strategies.
•Kinesis + Lambda is powerful for streaming — Tune batch size, parallelization, and error handling for optimal throughput and latency.
•S3 triggers enable event-driven batch processing — Files landing in S3 can kick off complex transformation workflows.
•Partition data for analytics — Athena costs are proportional to data scanned. Partitioning and columnar formats dramatically reduce costs.
•Dead letter queues are mandatory — Every pipeline must capture failed records for investigation and replay.
•Monitor input/output counts — Discrepancies indicate data loss. Alert on any pipeline health anomalies.

What's Next:

With data processing pipelines covered, we'll explore the final serverless pattern in this module: Real-Time File Processing. You'll learn how to build sophisticated file processing workflows—image manipulation, video transcoding, document processing—using serverless components.

Page Complete

You now have comprehensive knowledge of building serverless data processing pipelines. From streaming with Kinesis to batch with S3 triggers, from Lambda transformations to Athena analytics—these patterns enable you to build sophisticated data infrastructure that scales automatically and integrates with the broader AWS analytics ecosystem.

Data Processing Pipelines

Serverless Data Pipelines

What You Will Learn

Streaming vs Batch Processing

Data pipelines generally fall into two categories: streaming (real-time) and batch (periodic). Serverless computing supports both, with different services optimized for each.

Streaming (Real-Time) Processing:

Data is processed as it arrives, with latency measured in seconds or milliseconds. Use cases:

Real-time dashboards and monitoring
Fraud detection and alerting
Live recommendations and personalization
IoT sensor processing and control systems

Serverless streaming services:

Kinesis Data Streams + Lambda: Process records in near-real-time
Kinesis Data Firehose: Direct delivery to S3, Redshift, or Elasticsearch
DynamoDB Streams + Lambda: React to database changes
EventBridge: Route and filter events in real-time

Converting Mermaid diagram...

Batch Processing:

Data is collected over a period and processed in bulk at scheduled intervals. Use cases:

Daily/weekly reports and analytics
Machine learning model training
Large-scale data migrations
Historical data analysis and backfills

Serverless batch services:

S3 + Lambda (triggered by object creation)
EventBridge Scheduler + Lambda: Time-triggered batch jobs
Step Functions: Orchestrated multi-step batch workflows
AWS Glue: Managed ETL for large datasets
Athena: Serverless SQL queries on S3 data

Streaming vs Batch Processing
Aspect	Streaming	Batch
Latency	Seconds to milliseconds	Minutes to hours
Data volume per operation	Single records or small batches	Large datasets
Processing pattern	Continuous, event-driven	Periodic, scheduled
State management	Complex (windowing, aggregation)	Simpler (full dataset access)
Cost model	Per invocation/record	Per invocation/duration
Failure impact	Affects individual records	May require full reprocessing
Use case fit	Time-sensitive, reactive	Analytical, comprehensive

Lambda Streaming Architecture

Pipeline Architecture

Stage 1: Ingestion

Collecting data from various sources and feeding it into the pipeline:

Direct writes: Applications write to Kinesis, SQS, or S3
API endpoints: API Gateway + Lambda receives data via HTTP
Connectors: Pre-built integrations (Firehose agents, IoT rules)
Change data capture: DynamoDB Streams, database CDC

Ingestion must handle:

High throughput and burst traffic
Data validation at the boundary
Buffering during downstream outages
Deduplication of retried writes

Ingestion Endpoint with Validation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import { APIGatewayProxyHandlerV2 } from "aws-lambda";
import { KinesisClient, PutRecordCommand } from "@aws-sdk/client-kinesis";
import { z } from "zod";
 
const kinesis = new KinesisClient({});
const STREAM_NAME = process.env.KINESIS_STREAM!;
 
// Define event schema
const EventSchema = z.object({
  eventType: z.enum(["page_view", "click", "purchase", "signup"]),
  userId: z.string().uuid(),
  timestamp: z.string().datetime(),
  properties: z.record(z.unknown()).optional(),
  sessionId: z.string().optional()
});
 
type Event = z.infer<typeof EventSchema>;
 
export const handler: APIGatewayProxyHandlerV2 = async (event) => {
  try {
    // Parse and validate incoming event
    const body = JSON.parse(event.body || "{}");
    const validatedEvent = EventSchema.parse(body);
    
    // Enrich with ingestion metadata
    const enrichedEvent = {
      ...validatedEvent,
      ingestedAt: new Date().toISOString(),
      sourceIp: event.requestContext.http.sourceIp,
      userAgent: event.headers["user-agent"]
    };
    
    // Write to Kinesis (partition by userId for ordering)
    await kinesis.send(new PutRecordCommand({
      StreamName: STREAM_NAME,
      Data: Buffer.from(JSON.stringify(enrichedEvent)),
      PartitionKey: validatedEvent.userId
    }));
    
    return {
      statusCode: 202,
      body: JSON.stringify({ accepted: true })
    };
  } catch (error) {
    if (error instanceof z.ZodError) {
      return {
        statusCode: 400,
        body: JSON.stringify({ 
          error: "Validation failed", 
          details: error.errors 
        })
      };
    }
    throw error;
  }
};

Stage 2: Transformation

Processing raw data into usable formats:

Parsing: Extract structured data from logs, JSON, XML
Enrichment: Add contextual data (geo-lookup, user profiles)
Aggregation: Compute metrics, counters, averages
Filtering: Remove irrelevant or invalid records
Normalization: Standardize formats, units, encodings

Stage 3: Storage

Persisting processed data for consumption:

S3: Cost-effective, durable storage for analytics
DynamoDB: Low-latency access for operational data
Redshift: Analytical queries on large datasets
Elasticsearch: Full-text search and log analysis
Timestream: Time-series data for IoT and metrics

Stage 4: Consumption

Accessing processed data:

Athena: Serverless SQL queries on S3
QuickSight: Business intelligence dashboards
API endpoints: Serve processed data to applications
ML services: Feed models with training data

Converting Mermaid diagram...

The Medallion Architecture

Kinesis Data Streams Processing

Event Source Mapping Configuration:

Lambda polls Kinesis and invokes your function with batches of records. Key settings:

Batch Size (1-10,000): Maximum records per invocation
Batch Window (0-300 seconds): Maximum wait time to fill batch
Starting Position: TRIM_HORIZON (oldest) or LATEST (newest)
Parallelization Factor (1-10): Concurrent Lambda invocations per shard
On-Failure Destination: SNS, SQS, or Lambda for failed batches

Kinesis Stream Processor
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import { KinesisStreamHandler, KinesisStreamRecord } from "aws-lambda";
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";
 
const s3 = new S3Client({});
const BUCKET = process.env.OUTPUT_BUCKET!;
 
interface ProcessedEvent {
  eventType: string;
  userId: string;
  timestamp: string;
  sessionId?: string;
  // Enriched fields
  processedAt: string;
  hourBucket: string;
  isNewUser: boolean;
}
 
export const handler: KinesisStreamHandler = async (event) => {
  console.log(`Processing ${event.Records.length} records`);
  
  const processedEvents: ProcessedEvent[] = [];
  const failures: Array<{ record: KinesisStreamRecord; error: string }> = [];
  
  for (const record of event.Records) {
    try {
      // Decode and parse record
      const payload = Buffer.from(record.kinesis.data, "base64").toString();
      const data = JSON.parse(payload);
      
      // Transform and enrich
      const processed = await transformEvent(data);
      processedEvents.push(processed);
      
    } catch (error) {
      console.error(`Failed to process record ${record.kinesis.sequenceNumber}`, error);
      failures.push({ 
        record, 
        error: (error as Error).message 
      });
    }
  }
  
  // Write batch to S3 (partitioned by hour)
  if (processedEvents.length > 0) {
    await writeToS3(processedEvents);
  }
  
  // Report failures (partial batch failure)
  if (failures.length > 0) {
    console.error(`${failures.length} records failed processing`);
    // Return failed records for retry (requires reportBatchItemFailures: true)
    return {
      batchItemFailures: failures.map(f => ({
        itemIdentifier: f.record.kinesis.sequenceNumber
      }))
    };
  }
  
  console.log(`Successfully processed ${processedEvents.length} events`);
};
 
async function transformEvent(raw: any): Promise<ProcessedEvent> {
  const timestamp = new Date(raw.timestamp);
  
  return {
    eventType: raw.eventType,
    userId: raw.userId,
    timestamp: raw.timestamp,
    sessionId: raw.sessionId,
    processedAt: new Date().toISOString(),
    // Partition key for S3/Athena queries
    hourBucket: timestamp.toISOString().slice(0, 13).replace("T", "-"),
    // Enrichment: check if user is new (simplified)
    isNewUser: await checkIfNewUser(raw.userId)
  };
}
 
async function writeToS3(events: ProcessedEvent[]): Promise<void> {
  // Group by hour for partitioned storage
  const byHour = events.reduce((acc, event) => {
    const hour = event.hourBucket;
    if (!acc[hour]) acc[hour] = [];
    acc[hour].push(event);
    return acc;
  }, {} as Record<string, ProcessedEvent[]>);
  
  // Write each partition
  for (const [hour, hourEvents] of Object.entries(byHour)) {
    const key = `processed/dt=${hour}/batch-${Date.now()}.json`;
    const body = hourEvents.map(e => JSON.stringify(e)).join("\n");
    
    await s3.send(new PutObjectCommand({
      Bucket: BUCKET,
      Key: key,
      Body: body,
      ContentType: "application/x-ndjson"
    }));
  }
}

Partial Batch Failure Handling:

// Return only failed sequence numbers
return {
  batchItemFailures: [
    { itemIdentifier: "123456" },
    { itemIdentifier: "123457" }
  ]
};

Only failed records are retried; successful ones are checkpointed.

Kinesis Lambda Configuration Recommendations
Setting	Low Latency	High Throughput	Cost Optimized
Batch Size	1-10	1,000-10,000	10,000
Batch Window	0s	30-60s	60-300s
Parallelization Factor	10	1-5	1
Memory	512MB-1GB	1-3GB	512MB
Timeout	30s	5m	15m

Watch for Iterator Age

S3-Triggered Processing

S3 event notifications trigger Lambda functions when objects are created, modified, or deleted. This pattern is fundamental to many data pipelines—files land in S3, triggering processing workflows.

Common S3 Processing Patterns:

Image/Media Processing: Resize images, generate thumbnails, transcode video
Log Processing: Parse and index logs as they arrive
Data Import: Process uploaded CSVs/JSON files
ETL Triggers: Kick off larger pipelines when data files arrive
Archive Processing: Validate and index archived data

S3-Triggered File Processor
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import { S3Handler, S3Event } from "aws-lambda";
import { S3Client, GetObjectCommand, PutObjectCommand } from "@aws-sdk/client-s3";
import { createReadStream } from "stream";
import { pipeline } from "stream/promises";
import * as readline from "readline";
 
const s3 = new S3Client({});
 
interface LogEntry {
  timestamp: string;
  level: string;
  message: string;
  service?: string;
  traceId?: string;
}
 
export const handler: S3Handler = async (event: S3Event) => {
  for (const record of event.Records) {
    const bucket = record.s3.bucket.name;
    const key = decodeURIComponent(record.s3.object.key.replace(/\+/g, " "));
    
    console.log(`Processing s3://${bucket}/${key}`);
    
    // Skip if not a log file
    if (!key.endsWith(".log") && !key.endsWith(".json")) {
      console.log(`Skipping non-log file: ${key}`);
      continue;
    }
    
    try {
      // Get the object
      const response = await s3.send(new GetObjectCommand({
        Bucket: bucket,
        Key: key
      }));
      
      // Process line by line (memory efficient for large files)
      const processedEntries = await processLogFile(response.Body as NodeJS.ReadableStream);
      
      // Write processed output
      const outputKey = key
        .replace("raw-logs/", "processed-logs/")
        .replace(".log", ".json");
      
      await s3.send(new PutObjectCommand({
        Bucket: process.env.OUTPUT_BUCKET!,
        Key: outputKey,
        Body: JSON.stringify(processedEntries, null, 2),
        ContentType: "application/json"
      }));
      
      console.log(`Processed ${processedEntries.length} entries to ${outputKey}`);
      
    } catch (error) {
      console.error(`Failed to process ${key}:`, error);
      throw error;
    }
  }
};
 
async function processLogFile(stream: NodeJS.ReadableStream): Promise<LogEntry[]> {
  const entries: LogEntry[] = [];
  
  const rl = readline.createInterface({
    input: stream,
    crlfDelay: Infinity
  });
  
  for await (const line of rl) {
    try {
      // Parse log line (example: structured JSON logs)
      const entry = JSON.parse(line) as LogEntry;
      
      // Enrich and validate
      if (isValidEntry(entry)) {
        entries.push({
          ...entry,
          timestamp: normalizeTimestamp(entry.timestamp)
        });
      }
    } catch {
      // Skip malformed lines, log for debugging
      console.warn(`Malformed log line: ${line.substring(0, 100)}`);
    }
  }
  
  return entries;
}
 
function isValidEntry(entry: any): entry is LogEntry {
  return entry && typeof entry.timestamp === "string" && typeof entry.message === "string";
}
 
function normalizeTimestamp(ts: string): string {
  return new Date(ts).toISOString();
}

Handling Large Files:

Lambda's /tmp storage is limited to 512MB-10GB (configurable). For files larger than memory, use streaming techniques:

Stream processing: Read and process line-by-line without loading entire file
S3 Select: Query specific data from large files without downloading
Step Functions with distributed Map: Split large files across many Lambda invocations
Fargate: For truly large files, kick off a Fargate task

Best Practices for S3 Triggers:

Use prefixes to scope triggers: Only trigger on specific folders
Avoid recursive triggers: Don't write to the same prefix you're reading from
Configure SQS dead letter: Failed events have a path to recovery
Use S3 batch operations: For reprocessing existing files
Enable versioning: Protect against accidental overwrites

S3 Select for Large Files
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import { S3Client, SelectObjectContentCommand } from "@aws-sdk/client-s3";
 
const s3 = new S3Client({});
 
// Query specific records from a large CSV without downloading entire file
async function queryLargeFile(bucket: string, key: string): Promise<any[]> {
  const response = await s3.send(new SelectObjectContentCommand({
    Bucket: bucket,
    Key: key,
    ExpressionType: "SQL",
    Expression: `
      SELECT s.user_id, s.event_type, s.timestamp
      FROM s3object s
      WHERE s.event_type = 'purchase'
        AND CAST(s.amount AS DECIMAL) > 100
    `,
    InputSerialization: {
      CSV: {
        FileHeaderInfo: "USE",
        RecordDelimiter: "\n",
        FieldDelimiter: ","
      }
    },
    OutputSerialization: {
      JSON: { RecordDelimiter: "\n" }
    }
  }));
  
  // Collect results from stream
  const results: any[] = [];
  for await (const event of response.Payload || []) {
    if (event.Records?.Payload) {
      const chunk = Buffer.from(event.Records.Payload).toString();
      const lines = chunk.split("\n").filter(l => l.trim());
      results.push(...lines.map(l => JSON.parse(l)));
    }
  }
  
  return results;
}

SQS as Buffer for S3 Events

Serverless ETL Patterns

Pattern 1: Lambda-Native ETL

For moderate data volumes (<500MB per file, <15 minute processing), Lambda can handle the entire ETL process:

Lambda ETL: Customer Data Sync
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import { ScheduledHandler } from "aws-lambda";
import { DynamoDBClient, BatchWriteItemCommand } from "@aws-sdk/client-dynamodb";
import { marshall } from "@aws-sdk/util-dynamodb";
 
const dynamodb = new DynamoDBClient({});
 
interface ExternalCustomer {
  id: string;
  first_name: string;
  last_name: string;
  email_address: string;
  phone_number: string;
  created_at: string;
  loyalty_points: number;
}
 
interface InternalCustomer {
  customerId: string;
  fullName: string;
  email: string;
  phone: string;
  createdAt: string;
  loyaltyTier: "bronze" | "silver" | "gold" | "platinum";
  updatedAt: string;
}
 
export const handler: ScheduledHandler = async () => {
  // EXTRACT: Fetch data from external API
  const externalData = await extractFromExternalAPI();
  console.log(`Extracted ${externalData.length} customers`);
  
  // TRANSFORM: Convert to internal format
  const transformedData = externalData.map(transformCustomer);
  console.log(`Transformed ${transformedData.length} customers`);
  
  // LOAD: Write to DynamoDB in batches
  await loadToDynamoDB(transformedData);
  console.log(`Loaded ${transformedData.length} customers to DynamoDB`);
};
 
async function extractFromExternalAPI(): Promise<ExternalCustomer[]> {
  // Fetch from external CRM API
  const response = await fetch(`${process.env.CRM_API_URL}/customers`, {
    headers: {
      "Authorization": `Bearer ${process.env.CRM_API_KEY}`
    }
  });
  
  if (!response.ok) {
    throw new Error(`CRM API error: ${response.status}`);
  }
  
  return response.json();
}
 
function transformCustomer(external: ExternalCustomer): InternalCustomer {
  return {
    customerId: external.id,
    fullName: `${external.first_name} ${external.last_name}`.trim(),
    email: external.email_address.toLowerCase(),
    phone: normalizePhone(external.phone_number),
    createdAt: new Date(external.created_at).toISOString(),
    loyaltyTier: calculateTier(external.loyalty_points),
    updatedAt: new Date().toISOString()
  };
}
 
function calculateTier(points: number): InternalCustomer["loyaltyTier"] {
  if (points >= 10000) return "platinum";
  if (points >= 5000) return "gold";
  if (points >= 1000) return "silver";
  return "bronze";
}
 
async function loadToDynamoDB(customers: InternalCustomer[]): Promise<void> {
  const TABLE = process.env.CUSTOMERS_TABLE!;
  
  // Batch write in chunks of 25 (DynamoDB limit)
  for (let i = 0; i < customers.length; i += 25) {
    const batch = customers.slice(i, i + 25);
    
    await dynamodb.send(new BatchWriteItemCommand({
      RequestItems: {
        [TABLE]: batch.map(customer => ({
          PutRequest: {
            Item: marshall(customer)
          }
        }))
      }
    }));
  }
}

Pattern 2: AWS Glue for Large-Scale ETL

For larger datasets (gigabytes to terabytes), AWS Glue provides managed Spark-based ETL:

Glue Crawlers: Automatically discover schemas in S3
Glue Jobs: Serverless Spark for heavy transformations
Glue Data Catalog: Central metadata repository
Glue Studio: Visual ETL job designer

Pattern 3: Step Functions Orchestrated ETL

For complex multi-stage ETL with error handling and human approval:

Converting Mermaid diagram...

Pattern 4: Kinesis Firehose for Continuous ETL

Firehose provides near-zero operations streaming ETL:

Data sources write to Firehose delivery stream
Optional Lambda transformation inline
Automatic batching, compression, and delivery to S3/Redshift/Elasticsearch
Built-in retry and error handling

Firehose is ideal when:

Transformation is simple (filtering, format conversion)
Destination is S3, Redshift, Elasticsearch, or Splunk
You want minimal operational overhead

ETL Pattern Selection Guide
Pattern	Data Size	Complexity	Use Case
Lambda-only	<500MB	Simple transforms	API sync, format conversion
Kinesis Firehose	Any (streaming)	Light transforms	Log delivery, metrics ingestion
Step Functions	Variable	Complex workflows	Multi-step with approvals
AWS Glue	GB-TB	Complex joins/aggregations	Data warehouse loading
EMR Serverless	TB+	Custom Spark/Presto	ML pipelines, complex analytics

Idempotency in ETL

Analytics Integration

Serverless data pipelines ultimately serve analytics and business intelligence. AWS provides several serverless analytics services that integrate seamlessly with Lambda-processed data.

Amazon Athena: Serverless SQL on S3

Athena enables SQL queries directly on S3 data without loading it into a database:

Serverless: No infrastructure to manage
Pay per query: Charged by data scanned
Standard SQL: Familiar syntax, supports complex analytics
Integration: Works with BI tools, QuickSight, Jupyter

Athena Query from Lambda
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import { AthenaClient, StartQueryExecutionCommand, GetQueryExecutionCommand, GetQueryResultsCommand } from "@aws-sdk/client-athena";
 
const athena = new AthenaClient({});
const DATABASE = "analytics";
const OUTPUT_BUCKET = process.env.ATHENA_OUTPUT_BUCKET!;
 
export async function runAthenaQuery(query: string): Promise<any[]> {
  // Start query execution
  const startResult = await athena.send(new StartQueryExecutionCommand({
    QueryString: query,
    QueryExecutionContext: { Database: DATABASE },
    ResultConfiguration: {
      OutputLocation: `s3://${OUTPUT_BUCKET}/athena-results/`
    }
  }));
  
  const executionId = startResult.QueryExecutionId!;
  
  // Wait for completion
  await waitForQuery(executionId);
  
  // Get results
  const results = await athena.send(new GetQueryResultsCommand({
    QueryExecutionId: executionId
  }));
  
  // Parse results into objects
  return parseAthenaResults(results);
}
 
async function waitForQuery(executionId: string): Promise<void> {
  const maxAttempts = 60;
  
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const status = await athena.send(new GetQueryExecutionCommand({
      QueryExecutionId: executionId
    }));
    
    const state = status.QueryExecution?.Status?.State;
    
    if (state === "SUCCEEDED") return;
    if (state === "FAILED" || state === "CANCELLED") {
      throw new Error(`Query ${state}: ${status.QueryExecution?.Status?.StateChangeReason}`);
    }
    
    await new Promise(resolve => setTimeout(resolve, 1000));
  }
  
  throw new Error("Query timed out");
}
 
// Example: Generate daily summary
export async function getDailySummary(date: string): Promise<any> {
  const query = `
    SELECT 
      event_type,
      COUNT(*) as event_count,
      COUNT(DISTINCT user_id) as unique_users
    FROM events
    WHERE date(timestamp) = date('${date}')
    GROUP BY event_type
    ORDER BY event_count DESC
  `;
  
  return runAthenaQuery(query);
}

Optimizing Athena Queries:

Athena charges by data scanned, so optimization saves money:

Partitioning: Store data in partitioned folders (e.g., dt=2024-01-15/)
Columnar formats: Use Parquet or ORC instead of JSON/CSV (10-100x less data scanned)
Compression: Gzip, Snappy, or ZSTD reduce data size
Projection: SELECT only needed columns, never SELECT *
Bucketing: For frequently joined columns

Amazon Redshift Serverless:

For more demanding analytics workloads:

Automatic scaling based on workload
Pay for compute when queries run
Full data warehouse capabilities
ML integration with Redshift ML

Analytics Service Comparison
Service	Query Latency	Concurrency	Best For
Athena	Seconds-minutes	Moderate	Ad-hoc queries, data exploration
Redshift Serverless	Sub-second to seconds	High	Dashboards, BI tools, complex analytics
OpenSearch	Milliseconds	Very high	Log search, full-text, time-series
Timestream	Milliseconds	High	IoT metrics, time-series data
QuickSight	Sub-second (cached)	High	Business dashboards, embedded analytics

Partitioning is Critical

Error Handling & Reliability

Dead Letter Queues (DLQ) for Data Pipelines:

Every pipeline stage should have a DLQ for failed records:

Kinesis → Lambda: Configure on-failure destination (SQS, SNS, Lambda)
SQS → Lambda: Configure DLQ on the source queue
S3 → Lambda via SQS: Failed processing sends to DLQ

DLQs should contain enough context to diagnose and replay:

DLQ Message Structure
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs";
 
const sqs = new SQSClient({});
 
interface DLQMessage {
  originalEvent: any;
  processingAttempts: number;
  lastError: {
    name: string;
    message: string;
    stack?: string;
  };
  failedAt: string;
  sourceFunction: string;
  context: {
    requestId: string;
    traceId?: string;
  };
}
 
async function sendToDLQ(
  originalEvent: any,
  error: Error,
  context: { requestId: string; traceId?: string }
): Promise<void> {
  const dlqMessage: DLQMessage = {
    originalEvent,
    processingAttempts: 1,
    lastError: {
      name: error.name,
      message: error.message,
      stack: error.stack
    },
    failedAt: new Date().toISOString(),
    sourceFunction: process.env.AWS_LAMBDA_FUNCTION_NAME!,
    context
  };
  
  await sqs.send(new SendMessageCommand({
    QueueUrl: process.env.DLQ_URL!,
    MessageBody: JSON.stringify(dlqMessage),
    MessageAttributes: {
      ErrorType: {
        DataType: "String",
        StringValue: error.name
      },
      OriginalTimestamp: {
        DataType: "String",
        StringValue: originalEvent.timestamp || new Date().toISOString()
      }
    }
  }));
}

Replay Strategies:

After fixing issues, you need to replay failed or missing data:

DLQ Replay: Process messages from DLQ after fixing bugs
S3 Reprocessing: Re-trigger S3 events or run backfill job
Kinesis Replay: Use iterator at TRIM_HORIZON to reprocess from start
Manual Backfill: Run targeted processing for specific time ranges

Data Validation and Alerting:

Monitor pipeline health with specific metrics:

Pipeline Monitoring Checklist

•Input vs Output counts — Compare records ingested to records processed. Large discrepancies indicate data loss.
•Processing latency — Track time from ingestion to final storage. Increasing latency may indicate capacity issues.
•Error rates by type — Categorize errors (validation, transform, load). Different error types need different fixes.
•DLQ depth — Any messages in DLQ require investigation. Alert immediately when depth > 0.
•Schema validation failures — Track records failing schema validation—may indicate upstream changes.
•Data freshness — For scheduled pipelines, verify data is as recent as expected.

Pipeline Metrics Emission
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import { CloudWatchClient, PutMetricDataCommand } from "@aws-sdk/client-cloudwatch";
 
const cloudwatch = new CloudWatchClient({});
 
interface PipelineMetrics {
  recordsReceived: number;
  recordsProcessed: number;
  recordsFailed: number;
  processingDurationMs: number;
  bytesProcessed: number;
}
 
async function emitPipelineMetrics(
  pipelineName: string, 
  metrics: PipelineMetrics
): Promise<void> {
  await cloudwatch.send(new PutMetricDataCommand({
    Namespace: "DataPipelines",
    MetricData: [
      {
        MetricName: "RecordsReceived",
        Value: metrics.recordsReceived,
        Unit: "Count",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      },
      {
        MetricName: "RecordsProcessed",
        Value: metrics.recordsProcessed,
        Unit: "Count",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      },
      {
        MetricName: "RecordsFailed",
        Value: metrics.recordsFailed,
        Unit: "Count",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      },
      {
        MetricName: "ProcessingDuration",
        Value: metrics.processingDurationMs,
        Unit: "Milliseconds",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      },
      {
        MetricName: "SuccessRate",
        Value: metrics.recordsReceived > 0 
          ? (metrics.recordsProcessed / metrics.recordsReceived) * 100 
          : 100,
        Unit: "Percent",
        Dimensions: [{ Name: "Pipeline", Value: pipelineName }]
      }
    ]
  }));
}

Data Loss is Unacceptable

Summary: Data Processing Pipelines

Let's consolidate the key takeaways:

Key Takeaways

•Choose streaming or batch by latency needs — Real-time dashboards need streaming; daily reports can use batch. Many systems combine both.
•Structure pipelines in stages — Ingestion, Transformation, Storage, and Consumption stages have distinct responsibilities and optimization strategies.
•Kinesis + Lambda is powerful for streaming — Tune batch size, parallelization, and error handling for optimal throughput and latency.
•S3 triggers enable event-driven batch processing — Files landing in S3 can kick off complex transformation workflows.
•Partition data for analytics — Athena costs are proportional to data scanned. Partitioning and columnar formats dramatically reduce costs.
•Dead letter queues are mandatory — Every pipeline must capture failed records for investigation and replay.
•Monitor input/output counts — Discrepancies indicate data loss. Alert on any pipeline health anomalies.

What's Next:

Page Complete