Change Data Capture - Learning Module

Loading content...

0/273

CDC Architecture Patterns

Building Production CDC Systems

Understanding CDC concepts and use cases is necessary but not sufficient for production systems. Architecture decisions—how you structure pipelines, handle failures, scale throughput, and manage operations—determine whether your CDC implementation is robust or fragile.

This page synthesizes architectural patterns proven in production CDC deployments. We'll cover topologies for different scales, patterns for handling the inevitable failures, strategies for monitoring and operations, and guidance for making sound architectural decisions.

What You Will Learn

By the end of this page, you will understand the architectural patterns for building production CDC systems, be able to design pipelines that handle failures gracefully, implement monitoring and alerting strategies, and make informed decisions about scaling and topology based on your requirements.

CDC Pipeline Topologies

The way you structure your CDC pipeline significantly impacts scalability, manageability, and failure domains. Let's examine the common topologies:

The Hub and Spoke Pattern:

All source databases connect to a central streaming platform (the "hub"), and consumers read from the hub.

                    ┌─────────────┐
    ┌─────────────▶ │   Kafka     │ ──────────────┐
    │               │   (Hub)     │               │
    │   ┌─────────▶ │             │ ─────┐        │
    │   │           └─────────────┘      │        │
    │   │                 │              │        │
┌───┴───┴───┐             │         ┌────▼────┐   │
│ Debezium  │             │         │Consumer │   │
│ Connectors│             │         │  Apps   │   │
└───────────┘             │         └─────────┘   │
    ▲   ▲                 │              ▲        │
    │   │                 └──────────────┘        │
┌───┴───┴───┐                                     │
│ PostgreSQL│                           ┌────────▼┐
│  MySQL    │                           │Analytics│
│ MongoDB   │                           └─────────┘
└───────────┘

Advantages:

Central control and visibility
Consumers decoupled from sources
Standard platform for all CDC needs
Replayability from Kafka

Disadvantages:

Kafka becomes critical infrastructure
All data flows through one system
Can become bottleneck at extreme scale

Best For:

Organizations standardizing on Kafka
Multiple consumers per change stream
Need for replayability and durability

Scaling CDC Pipelines

CDC pipelines must scale with data growth. Unlike application scaling, CDC scaling has unique constraints because transaction logs must be read sequentially. Here's how to scale each component:

CDC Scaling Dimensions
Dimension	Challenge	Solution
More tables	Single connector handles all tables	Multiple connectors, each with table subset
Higher throughput	Log reading is sequential	Parallel consumers on partitioned topics
More consumers	Shared change stream	Kafka topics with consumer groups
Larger transactions	Memory exhaustion	Tune queue sizes, streaming mode
More databases	Connector proliferation	Centralized Kafka Connect cluster

Scaling the Connector (Ingestion):

Debezium connectors read logs sequentially—you can't parallelize a single connector's log reading. But you can:

Split tables across connectors: One connector per high-throughput table
Increase connector resources: More CPU/memory for faster parsing
Tune snapshot parallelism: snapshot.max.threads for initial load

// High-throughput table gets dedicated connector
{
  "name": "orders-connector",
  "config": {
    "table.include.list": "public.orders,public.order_items",
    "max.batch.size": 4096,
    "max.queue.size": 32768
  }
}

// Lower-throughput tables share a connector
{
  "name": "reference-data-connector",
  "config": {
    "table.include.list": "public.products,public.categories,public.users",
    "max.batch.size": 1024
  }
}

Scaling Consumers (Processing):

Consumer scaling leverages Kafka's partitioning:

// Producer partitions by entity ID
producer.send({
  topic: 'cdc.orders',
  key: event.after.order_id,  // Partition key
  value: JSON.stringify(event)
});

// Consumers in same group divide partitions
// 12 partitions with 4 consumers = 3 partitions each
consumer.subscribe({ topic: 'cdc.orders', fromBeginning: true });

// Each consumer processes its partition subset in parallel
// Order guaranteed within partition (same order_id → same partition)

Partitioning Strategy:

Strategy	Description	Use When
By record key	Hash of primary key	General purpose
By customer	Hash of customer_id	Customer-centric processing
By table	Route tables to partitions	Table-specific consumers
Round-robin	No ordering guarantee	Maximum parallelism, order doesn't matter

The Ordering Trade-off

Kafka guarantees order only within a partition. If you partition by order_id, all changes to order 12345 stay ordered. But changes across different orders may be processed out of order. If you need strict global ordering (rare), you need a single partition—which limits throughput to one consumer.

Failure Handling Patterns

CDC pipelines will fail. Networks partition, databases restart, consumers crash. Robust architectures anticipate failures and recover gracefully.

Common Failure Scenarios

•Source database restart: Log position may change, replication slot may need recreation
•Connector crash: Events stop flowing until restart; offset determines resume point
•Kafka unavailable: Events queue in connector memory, then disk, then backpressure
•Consumer failure: Events accumulate in Kafka; lag grows until consumer recovers
•Schema change breaks serialization: Incompatible schema change causes connector or consumer failure
•Log retention exceeded: Connector falls too far behind, can't resume from offset
•Network partition: Partial failures where some components work, others don't

Connector Resume Strategy
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "name": "resilient-connector",
  "config": {
    // Resume behavior on failure
    "errors.tolerance": "all",        // Continue on errors (vs "none")
    "errors.log.enable": "true",      // Log errors for investigation
    "errors.log.include.messages": "true",  // Include event data in logs
    
    // Dead letter queue for failed events
    "errors.deadletterqueue.topic.name": "cdc-dlq",
    "errors.deadletterqueue.context.headers.enable": "true",
    
    // Offset management
    "offset.flush.interval.ms": "1000",  // Commit offsets frequently
    "offset.flush.timeout.ms": "5000",   // Timeout for offset commits
    
    // Connection retry
    "connect.poll.interval": "1000",
    "connect.max.retries": "10",
    
    // Heartbeat for detecting hung connections
    "heartbeat.interval.ms": "10000",
    
    // Signal table for controlled operations
    "signal.data.collection": "public.debezium_signal"
  }
}

Exactly-Once Processing

Achieving exactly-once semantics in CDC pipelines is critical for data integrity. A payment processed twice or an inventory deduction missed causes real business harm.

The Challenge:

With at-least-once delivery (the default), duplicates occur when:

Connector crashes after writing to Kafka but before committing offset
Consumer crashes after processing but before committing
Network issues cause retries

The Solution: Idempotent Consumers + Transactional Semantics

Idempotent Consumer Implementation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
class IdempotentCDCConsumer {
  constructor(
    private db: Database,
    private redis: Redis,  // For deduplication
    private kafka: KafkaConsumer
  ) {}
 
  async processEvent(event: CDCEvent): Promise<void> {
    // Create unique idempotency key from source position
    const idempotencyKey = this.createIdempotencyKey(event);
    
    // Check if already processed (cheap Redis lookup)
    const alreadyProcessed = await this.redis.get(idempotencyKey);
    if (alreadyProcessed) {
      console.log(`Skipping duplicate event: ${idempotencyKey}`);
      metrics.increment('cdc.duplicates.skipped');
      return;
    }
    
    // Process with database transaction
    await this.db.transaction(async (tx) => {
      // Apply the change
      await this.applyChange(tx, event);
      
      // Record processing in same transaction
      await tx.execute(`
        INSERT INTO processed_events (event_key, processed_at)
        VALUES ($1, NOW())
        ON CONFLICT (event_key) DO NOTHING
      `, [idempotencyKey]);
    });
    
    // Mark as processed in Redis (with TTL for cleanup)
    await this.redis.set(idempotencyKey, '1', 'EX', 86400 * 7);  // 7 days
  }
 
  private createIdempotencyKey(event: CDCEvent): string {
    // Use source position as globally unique key
    // Format: connector-name:table:lsn or txId:position
    return `${event.source.name}:${event.source.table}:${event.source.lsn}`;
  }
 
  private async applyChange(tx: Transaction, event: CDCEvent): Promise<void> {
    // Use upsert semantics for natural idempotency
    switch (event.op) {
      case 'c':
      case 'u':
      case 'r':
        await tx.execute(`
          INSERT INTO orders (id, status, amount, updated_at)
          VALUES ($1, $2, $3, $4)
          ON CONFLICT (id) DO UPDATE SET
            status = EXCLUDED.status,
            amount = EXCLUDED.amount,
            updated_at = EXCLUDED.updated_at
          WHERE orders.updated_at < EXCLUDED.updated_at
        `, [
          event.after.id,
          event.after.status,
          event.after.amount,
          new Date(event.source.ts_ms)
        ]);
        break;
        
      case 'd':
        await tx.execute(`
          DELETE FROM orders WHERE id = $1
        `, [event.before.id]);
        break;
    }
  }
}

Exactly-Once Strategies

•Idempotency Keys: Derive unique keys from source position (LSN, GTID). Skip events already processed.
•Upsert Operations: Use INSERT...ON CONFLICT for creates/updates. Naturally handles replays.
•Conditional Updates: Only apply if source timestamp > existing timestamp. Prevents out-of-order issues.
•Transactional Outbox: Record processing in same transaction as the effect. Atomically consistent.
•Kafka Transactions: Use Kafka's exactly-once semantics with transactional.id for stream processing.

Monitoring and Observability

CDC pipelines require comprehensive monitoring. Unlike request-response systems where failures are immediately visible, CDC failures can silently accumulate stale data until a user notices inconsistencies.

Essential CDC Metrics
Metric	Description	Alert Threshold
cdc.lag.seconds	Time between source commit and Kafka publish	10 seconds
cdc.consumer.lag.messages	Messages waiting in Kafka for consumer	10,000 messages
cdc.connector.status	Connector running/paused/failed	!= running
cdc.replication.slot.lag.bytes	Bytes behind in source log	1 GB
cdc.dlq.messages.count	Dead letter queue size	0
cdc.events.processed.rate	Events per second flowing through	< baseline - 50%
cdc.errors.rate	Errors per second	1/minute
cdc.snapshot.progress	Snapshot completion percentage	Stalled > 1 hour

CDC Monitoring Implementation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
class CDCMonitor {
  constructor(
    private metrics: MetricsClient,
    private alertManager: AlertManager
  ) {}
 
  async monitorPipeline(): Promise<void> {
    // Monitor every 30 seconds
    setInterval(async () => {
      await Promise.all([
        this.checkConnectorHealth(),
        this.checkConsumerLag(),
        this.checkReplicationSlot(),
        this.checkDLQ()
      ]);
    }, 30000);
  }
 
  private async checkConnectorHealth(): Promise<void> {
    const connectors = await this.kafkaConnect.getConnectors();
    
    for (const connector of connectors) {
      const status = await this.kafkaConnect
        .getConnectorStatus(connector.name);
      
      this.metrics.gauge(
        'cdc.connector.status',
        status.state === 'RUNNING' ? 1 : 0,
        { connector: connector.name }
      );
      
      if (status.state !== 'RUNNING') {
        await this.alertManager.fire({
          severity: 'critical',
          summary: `CDC Connector ${connector.name} is ${status.state}`,
          description: `Connector stopped: ${status.trace || 'Unknown error'}`
        });
      }
      
      // Check individual tasks
      for (const task of status.tasks) {
        if (task.state !== 'RUNNING') {
          this.metrics.increment('cdc.task.failures', {
            connector: connector.name,
            taskId: task.id
          });
        }
      }
    }
  }
 
  private async checkConsumerLag(): Promise<void> {
    const consumerGroups = await this.kafka.admin().describeGroups();
    
    for (const group of consumerGroups) {
      const offsets = await this.kafka.admin()
        .fetchOffsets({ groupId: group.groupId });
      
      for (const topicOffset of offsets) {
        const endOffsets = await this.kafka.admin()
          .fetchTopicOffsets(topicOffset.topic);
        
        const lag = endOffsets.reduce((sum, partition) => {
          const consumed = topicOffset.partitions
            .find(p => p.partition === partition.partition)?.offset || '0';
          return sum + (parseInt(partition.high) - parseInt(consumed));
        }, 0);
        
        this.metrics.gauge('cdc.consumer.lag.messages', lag, {
          topic: topicOffset.topic,
          consumerGroup: group.groupId
        });
        
        if (lag > 10000) {
          await this.alertManager.fire({
            severity: 'warning',
            summary: `High consumer lag for ${group.groupId}`,
            description: `Lag: ${lag} messages on ${topicOffset.topic}`
          });
        }
      }
    }
  }
 
  private async checkReplicationSlot(): Promise<void> {
    // PostgreSQL specific
    const slots = await this.db.query(`
      SELECT 
        slot_name,
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as lag_bytes
      FROM pg_replication_slots
      WHERE slot_type = 'logical'
    `);
    
    for (const slot of slots) {
      this.metrics.gauge('cdc.replication.slot.lag.bytes', 
        slot.lag_bytes, { slot: slot.slot_name });
      
      // 1GB lag is critical
      if (slot.lag_bytes > 1073741824) {
        await this.alertManager.fire({
          severity: 'critical',
          summary: `Replication slot ${slot.slot_name} critically behind`,
          description: `Lag: ${(slot.lag_bytes / 1e9).toFixed(2)} GB. Risk of WAL overflow.`
        });
      }
    }
  }
}

Data Validation and Reconciliation

Even with robust CDC pipelines, data drift can occur. Schema changes, bugs, or missed events can cause source and target to diverge. Periodic reconciliation verifies data consistency and detects issues before they become customer-visible.

Converting Mermaid diagram...

CDC Reconciliation Implementation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
class CDCReconciler {
  async reconcileTable(
    sourceDb: Database,
    targetDb: Database,
    table: string,
    options: ReconcileOptions = {}
  ): Promise<ReconciliationResult> {
    const {
      sampleRate = 0.01,  // Check 1% of rows
      batchSize = 1000,
      compareColumns = ['*']  // Or specific columns
    } = options;
 
    const result: ReconciliationResult = {
      table,
      sampledRows: 0,
      matchingRows: 0,
      missingInTarget: 0,
      extraInTarget: 0,
      differentValues: 0,
      discrepancies: []
    };
 
    // Get row count for sampling
    const [{ count }] = await sourceDb.query(
      `SELECT COUNT(*) FROM ${table}`
    );
    const sampleSize = Math.ceil(count * sampleRate);
    
    // Sample random rows from source
    const sourceRows = await sourceDb.query(`
      SELECT * FROM ${table}
      ORDER BY RANDOM()
      LIMIT ${sampleSize}
    `);
 
    // Check each sampled row against target
    for (const batch of this.chunkArray(sourceRows, batchSize)) {
      const ids = batch.map(r => r.id);
      
      const targetRows = await targetDb.query(`
        SELECT * FROM ${table}
        WHERE id = ANY($1)
      `, [ids]);
      
      const targetMap = new Map(targetRows.map(r => [r.id, r]));
      
      for (const sourceRow of batch) {
        result.sampledRows++;
        const targetRow = targetMap.get(sourceRow.id);
        
        if (!targetRow) {
          result.missingInTarget++;
          result.discrepancies.push({
            type: 'missing',
            id: sourceRow.id,
            source: sourceRow,
            target: null
          });
        } else if (!this.rowsEqual(sourceRow, targetRow, compareColumns)) {
          result.differentValues++;
          result.discrepancies.push({
            type: 'different',
            id: sourceRow.id,
            source: sourceRow,
            target: targetRow,
            differences: this.findDifferences(sourceRow, targetRow)
          });
        } else {
          result.matchingRows++;
        }
      }
    }
 
    // Log and alert
    const matchRate = result.matchingRows / result.sampledRows;
    if (matchRate < 0.99) {  // Less than 99% match
      await this.alertManager.fire({
        severity: 'warning',
        summary: `CDC reconciliation found discrepancies in ${table}`,
        description: `Match rate: ${(matchRate * 100).toFixed(2)}%. ` +
                                        `Missing: ${result.missingInTarget}, Different: ${result.differentValues}`
      });
    }
 
    return result;
  }
 
  // Repair function to fix discrepancies
  async repairDiscrepancies(
    result: ReconciliationResult,
    targetDb: Database
  ): Promise<void> {
    for (const discrepancy of result.discrepancies) {
      if (discrepancy.type === 'missing' || discrepancy.type === 'different') {
        // Upsert correct value from source
        await targetDb.upsert(result.table, discrepancy.source);
      }
    }
    
    console.log(`Repaired ${result.discrepancies.length} discrepancies`);
  }
}

Reconciliation Best Practices

Run reconciliation during low-traffic periods to minimize query load. 2. Sample-based reconciliation is usually sufficient—100% comparison is expensive. 3. Focus on high-value tables (orders, payments) first. 4. Automate repair for simple discrepancies, alert humans for complex cases. 5. Track reconciliation metrics over time—degrading match rates indicate systemic issues.

Operational Runbooks

Production CDC requires documented procedures for common operational scenarios. Here are runbooks for critical situations:

Runbook: CDC Connector Failed

Symptoms: Alert: connector status != RUNNING, no new events in Kafka

Diagnosis Steps:

Check connector status:

curl -s http://connect:8083/connectors/<name>/status | jq

Check connector logs:

kubectl logs -l app=kafka-connect --tail=100 | grep ERROR

Common causes:
- Database unreachable (network, credentials)
- Replication slot deleted
- Schema incompatible change
- Out of memory

Resolution:

Cause	Fix
DB unreachable	Verify connectivity, credentials
Slot deleted	Recreate slot, may need re-snapshot
Schema break	Fix schema or update connector config
OOM	Increase connector memory, reduce batch size

Recovery:

# Restart connector
curl -X POST http://connect:8083/connectors/<name>/restart

# If tasks specifically failed
curl -X POST http://connect:8083/connectors/<name>/tasks/0/restart

Post-Incident:

Verify events flowing (check consumer lag decreasing)
Monitor for 15 minutes
Review root cause for prevention

Summary: CDC Architecture Patterns

We've covered the architectural patterns essential for production CDC systems. Here are the key takeaways:

Key Takeaways

•Choose topology based on requirements — Hub-and-spoke for fan-out, direct for simplicity, tiered for complex transformations.
•Scale strategically — Split tables across connectors, partition for consumer parallelism, but respect ordering constraints.
•Design for failure — Configure error tolerance, dead letter queues, circuit breakers. Assume everything will fail.
•Implement exactly-once carefully — Idempotency keys, upserts, conditional updates, and transactional processing work together.
•Monitor comprehensively — Lag, connector status, replication slot health, and DLQ size are essential metrics.
•Validate continuously — Periodic reconciliation catches drift before it becomes visible to users.
•Document runbooks — Operational procedures for common failure scenarios reduce incident duration.

Module Complete:

You've now completed the comprehensive coverage of Change Data Capture. You understand:

What CDC is and why it's foundational to modern distributed systems
How log-based CDC works at the database level
The tooling ecosystem with deep Debezium knowledge
Practical use cases from caching to analytics to migrations
Production architecture patterns for building reliable pipelines

This knowledge positions you to design, implement, and operate CDC pipelines that form the backbone of real-time data infrastructure.

Module Complete

Congratulations! You've mastered Change Data Capture—from fundamentals through production architecture. You now have the knowledge to implement reliable CDC pipelines that enable real-time data synchronization, event-driven architectures, and streaming analytics. Apply these patterns to build systems that treat data changes as first-class events.