System Design (HLD)Change Data Capture

Change Data Capture (CDC)

LevelAdvanced

Duration75 mins

TopicChange Data Capture

3 / 5

Debezium and Other CDC Tools

The CDC Tooling Landscape

Understanding log-based CDC mechanics is valuable, but building production CDC pipelines from scratch would be impractical. A rich ecosystem of tools has emerged to handle the complexities of log reading, schema management, offset tracking, and delivery guarantees.

Debezium has become the de facto open-source standard for CDC, powering data pipelines at companies from startups to enterprises. But it's not the only option—cloud-native services, commercial platforms, and database-native features each have their place.

This page provides comprehensive coverage of CDC tooling: architecture, capabilities, configuration, and when to choose each option.

What You Will Learn

By the end of this page, you will understand Debezium's architecture and capabilities in depth, be able to configure Debezium for production use cases, and evaluate alternative CDC tools for specific requirements. You'll have practical knowledge to build reliable CDC pipelines.

Debezium: The Open-Source Standard

Debezium (pronounced "DEE-bee-zee-um") is an open-source distributed platform for change data capture. Originally developed at Red Hat, it's now a project under the Apache Software Foundation umbrella.

Core Philosophy:

Debezium treats databases as sources of truth for events. Rather than requiring applications to explicitly emit events, Debezium reads the database's transaction log and converts changes into Kafka events. Applications can remain simple CRUD operations while Debezium handles the event generation.

Converting Mermaid diagram...

Debezium Architecture Components:

Component	Purpose	Details
Kafka Connect	Execution framework	Manages connector lifecycle, offsets, parallelism
Source Connectors	Database-specific log readers	PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, Db2, Cassandra, Vitess
Schema Registry	Schema management	Confluent or Apicurio Schema Registry for Avro/Protobuf
SMTs (Transforms)	Event transformation	Filter, route, enrich, mask events in-flight
Offset Storage	Progress tracking	Kafka internal topics or external stores

Supported Databases:

PostgreSQL: Logical decoding with pgoutput, decoderbufs, wal2json plugins
MySQL/MariaDB: Binlog reading with GTID support
MongoDB: Oplog and Change Streams
SQL Server: Transaction log reading (no native CDC required)
Oracle: LogMiner or XStream-based capture
Db2: ASN-based capture
Cassandra: Commit log-based capture
Vitess: VStream-based capture

Configuring Debezium for Production

Proper Debezium configuration is critical for reliability and performance. Let's walk through a production-grade configuration:

debezium-postgresql-connector.json
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
{
  "name": "inventory-connector",
  "config": {
    // Connector identification
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres.example.com",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${env:DB_PASSWORD}",
    "database.dbname": "inventory",
    
    // Logical naming (used in topic names)
    "topic.prefix": "inventory-db",
    
    // Replication configuration
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "debezium_publication",
    
    // What to capture
    "schema.include.list": "public,orders",
    "table.include.list": "public.orders,public.products,orders.line_items",
    
    // Critical: Full row data for updates/deletes
    "replica.identity": "full",
    
    // Snapshot configuration
    "snapshot.mode": "initial",
    "snapshot.isolation.mode": "repeatable_read",
    "snapshot.max.threads": 4,
    
    // Handling large transactions
    "max.batch.size": 2048,
    "max.queue.size": 8192,
    "poll.interval.ms": 500,
    
    // Heartbeat for low-traffic periods
    "heartbeat.interval.ms": 10000,
    "heartbeat.action.query": "INSERT INTO debezium.heartbeat (id, ts) VALUES (1, NOW()) ON CONFLICT (id) DO UPDATE SET ts = NOW()",
    
    // Schema handling
    "schema.history.internal.kafka.topic": "schema-history-inventory",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    
    // Transforms (optional)
    "transforms": "route,unwrap",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "(.*)\\.(.*)",
    "transforms.route.replacement": "cdc.$1.$2",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false"
  }
}

Configuration Best Practices

Always use environment variables for credentials (${env:DB_PASSWORD}). 2. Set explicit table include lists rather than capturing everything. 3. Configure heartbeat queries for low-traffic tables. 4. Enable error logging and dead letter queues from day one. 5. Size max.queue.size based on expected transaction sizes.

Debezium Event Structure

Understanding Debezium's event structure is essential for building consumers. Events follow a rich, standardized format that carries full context about each change.

Debezium Event Anatomy
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
{
  "schema": {
    // Kafka Connect schema (omitted for brevity when using Schema Registry)
  },
  "payload": {
    // Source metadata - WHERE did this change happen?
    "source": {
      "version": "2.4.0.Final",
      "connector": "postgresql",
      "name": "inventory-db",
      "ts_ms": 1704067200000,    // Source database timestamp
      "snapshot": "false",       // Was this from snapshot or streaming?
      "db": "inventory",
      "sequence": "[null,\"2F/ABC123\"]",
      "schema": "public",
      "table": "orders",
      "txId": 571,
      "lsn": 33227976,
      "xmin": null
    },
    
    // Before state - NULL for inserts
    "before": {
      "order_id": 12345,
      "status": "processing",
      "amount": 9999,
      "customer_id": 42,
      "updated_at": "2024-01-01T10:00:00Z"
    },
    
    // After state - NULL for deletes
    "after": {
      "order_id": 12345,
      "status": "shipped",
      "amount": 9999,
      "customer_id": 42,
      "updated_at": "2024-01-01T12:05:00Z"
    },
    
    // Operation type: c=create, u=update, d=delete, r=read (snapshot)
    "op": "u",
    
    // Debezium processing timestamp (not source DB time)
    "ts_ms": 1704067205000,
    
    // Transaction metadata (if enabled)
    "transaction": {
      "id": "571:29523420",
      "total_order": 1,
      "data_collection_order": 1
    }
  }
}

Event Fields Explained:

Field	Description	Use Case
`source.ts_ms`	When the change occurred in the database	Ordering, auditing
`source.snapshot`	Whether this came from initial snapshot	Consumer logic may differ
`source.lsn`	Log sequence number	Offset tracking, deduplication
`before`	Row state before change	Differential updates, auditing
`after`	Row state after change	Application of change
`op`	Operation type (c/u/d/r)	Consumer routing logic
`ts_ms` (payload)	Debezium processing time	Lag monitoring

Calculating CDC Lag:

const sourceTsMs = event.payload.source.ts_ms;  // DB commit time
const debeziumTsMs = event.payload.ts_ms;       // Debezium capture time
const now = Date.now();                          // Consumer receive time

const captureLag = debeziumTsMs - sourceTsMs;   // Log read delay
const deliveryLag = now - debeziumTsMs;         // Kafka + consumer delay
const totalLag = now - sourceTsMs;              // End-to-end delay

Unwrapping Events

The full event structure is verbose. For simpler consumers, use the ExtractNewRecordState SMT to flatten events to just the 'after' state. Use 'add.headers' to preserve operation type and source metadata as Kafka headers instead of nested fields.

Single Message Transforms (SMTs)

Debezium supports Single Message Transforms (SMTs)—lightweight transformations applied to each message before it reaches Kafka. SMTs enable filtering, routing, masking, and restructuring without custom code.

Common Debezium SMTs
Transform	Purpose	Example Use Case
ExtractNewRecordState	Flatten to just after-state	Simpler consumer messages
RegexRouter	Modify topic names with regex	Add prefixes, change routing
Filter	Drop messages matching predicate	Exclude test data, specific ops
MaskField	Redact sensitive field values	PII protection (SSN, credit card)
ReplaceField	Rename, include, exclude fields	Schema normalization
TimestampConverter	Convert timestamp formats	Epoch to ISO string conversion
ContentBasedRouter	Route based on content	Route by customer region
Outbox	Extract events from outbox table	Outbox pattern implementation

SMT Configuration Examples
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "transforms": "filter,unwrap,mask,route",
  
  // 1. Filter out DELETE operations (keep only creates/updates)
  "transforms.filter.type": "io.debezium.transforms.Filter",
  "transforms.filter.language": "jsr223.groovy",
  "transforms.filter.condition": "value.op != 'd'",
  
  // 2. Unwrap to simple format with metadata as headers
  "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
  "transforms.unwrap.drop.tombstones": "false",
  "transforms.unwrap.delete.handling.mode": "rewrite",
  "transforms.unwrap.add.fields": "op,source.ts_ms,source.lsn",
  "transforms.unwrap.add.headers": "op,source.table",
  
  // 3. Mask sensitive fields
  "transforms.mask.type": "org.apache.kafka.connect.transforms.MaskField$Value",
  "transforms.mask.fields": "ssn,credit_card_number",
  "transforms.mask.replacement": "***MASKED***",
  
  // 4. Route to different topics based on table
  "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
  "transforms.route.regex": "inventory-db\\.public\\.(.*)",
  "transforms.route.replacement": "events.$1"
  
  // Result: "inventory-db.public.orders" → "events.orders"
}

SMT Performance Considerations

SMTs execute synchronously on every message. Groovy-based predicates are slower than native transforms. Complex SMT chains add latency. For heavy transformations, consider a dedicated stream processing stage (Kafka Streams, Flink) after the CDC topic rather than overloading SMTs.

Debezium Deployment Patterns

Debezium can be deployed in several ways, each with distinct tradeoffs:

The Standard Production Deployment:

Multiple Kafka Connect workers form a cluster. Connectors are distributed across workers for fault tolerance.

Architecture:

┌─────────────────────────────────────────────────────────────┐
│                    Kafka Connect Cluster                     │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐         │
│  │   Worker 1   │ │   Worker 2   │ │   Worker 3   │         │
│  │  Connector A │ │  Connector B │ │  (Standby)   │         │
│  │  Tasks 1-3   │ │  Tasks 1-2   │ │              │         │
│  └──────────────┘ └──────────────┘ └──────────────┘         │
│                                                              │
│  Consensus via: config, offset, status topics               │
└─────────────────────────────────────────────────────────────┘

Pros:

Automatic failover if a worker dies
Scales to many connectors
REST API for management
Offset storage in Kafka (durable)

Cons:

Requires Kafka cluster
Operational complexity
Resource sharing between connectors

When to use:

Production environments with multiple source databases
Teams with Kafka Connect expertise
Need for high availability

Alternative CDC Tools

While Debezium dominates the open-source space, several alternatives serve specific niches:

AWS Database Migration Service (DMS)

Managed CDC service that streams changes from RDS/Aurora to Kinesis, S3, or other AWS services.

Aspect	Details
Supported Sources	PostgreSQL, MySQL, Oracle, SQL Server, MongoDB
Targets	Kinesis, S3, Redshift, DynamoDB, Elasticsearch
Pricing	Pay per DMS instance-hour + data transfer
Strengths	Zero infrastructure, native AWS integration, easy setup
Weaknesses	AWS-only, limited transformation, less schema flexibility

Azure Data Factory with Change Data Capture

Azure's ETL service can perform CDC from Azure databases.

Google Cloud Datastream

Managed CDC service for GCP, targeting BigQuery and Cloud Storage.

Aspect	Details
Supported Sources	MySQL, PostgreSQL, Oracle
Targets	BigQuery, Cloud Storage, Pub/Sub
Pricing	Per GB of changes captured
Strengths	Serverless, BigQuery integration, no Kafka needed
Weaknesses	GCP-only, limited customization

CDC Tool Comparison Matrix
Tool	Best For	Complexity	Cost	Real-Time
Debezium	Kafka-centric architectures	Medium	Free (OSS)	Yes (<1s)
AWS DMS	AWS-native pipelines	Low	Medium	Yes (~5s)
GCP Datastream	BigQuery analytics	Low	Medium	Yes (~10s)
Confluent CDC	Enterprise Kafka	Low	High	Yes (<1s)
Fivetran	Data warehouse loading	Very Low	High	No (5-60m)
Airbyte	Self-hosted ELT	Medium	Free/Low	No (batch)

Choosing the Right CDC Tool

Selecting a CDC tool involves matching requirements to capabilities. Use this decision framework:

Converting Mermaid diagram...

Key Decision Criteria

•Latency Requirements: Real-time (<1s) demands Debezium or Confluent. Near-real-time (5-60s) opens cloud services. Batch (5+ min) allows Fivetran/Airbyte.
•Target Destination: If Kafka is your backbone, Debezium is natural. For cloud warehouses, managed ELT may be simpler.
•Operational Capacity: Self-hosted Debezium requires Kafka and Connect expertise. Managed services trade cost for simplicity.
•Source Database: Check compatibility. Oracle and Db2 have limited tool support. PostgreSQL and MySQL are well-covered everywhere.
•Transformation Needs: Complex transforms favor Debezium + Kafka Streams/Flink. Simple replication favors managed services.
•Budget: Open-source (Debezium) is free but requires ops investment. Managed services have predictable costs but scale expensively.

Summary: CDC Tools

We've comprehensively covered the CDC tooling landscape. Here are the key takeaways:

Key Takeaways

•Debezium is the open-source standard — Wide database support, rich configuration, active community. First choice for Kafka-centric architectures.
•Configuration matters immensely — Proper snapshot modes, offset handling, and error tolerance differentiate reliable pipelines from fragile ones.
•Event structure enables smart consumers — Before/after states, source metadata, and operation types power intelligent downstream processing.
•SMTs enable in-flight transformation — Filtering, routing, and masking without custom code, but don't overload the connector.
•Deployment options match use cases — Kafka Connect for HA, Debezium Server for non-Kafka targets, Embedded for custom apps.
•Alternatives serve specific niches — Cloud-native services for managed simplicity, commercial platforms for enterprise support, database-native for simple replication.

What's Next:

Now that you understand the CDC tooling ecosystem, we'll explore the practical applications of CDC. The next page covers Use Cases: Replication, Cache Sync, and Beyond—real-world patterns for applying CDC to solve common distributed systems problems.

Page Complete

You now have deep knowledge of Debezium architecture, configuration, and the broader CDC tool ecosystem. You can evaluate and select tools appropriate for your use cases. Next, we'll see CDC applied to solve real-world problems.

3 / 5

Loading learning content...

System Design (HLD)Change Data Capture

Change Data Capture (CDC)

LevelAdvanced

Duration75 mins

TopicChange Data Capture

3 / 5

Debezium and Other CDC Tools

The CDC Tooling Landscape

This page provides comprehensive coverage of CDC tooling: architecture, capabilities, configuration, and when to choose each option.

What You Will Learn

Debezium: The Open-Source Standard

Core Philosophy:

Converting Mermaid diagram...

Debezium Architecture Components:

Component	Purpose	Details
Kafka Connect	Execution framework	Manages connector lifecycle, offsets, parallelism
Source Connectors	Database-specific log readers	PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, Db2, Cassandra, Vitess
Schema Registry	Schema management	Confluent or Apicurio Schema Registry for Avro/Protobuf
SMTs (Transforms)	Event transformation	Filter, route, enrich, mask events in-flight
Offset Storage	Progress tracking	Kafka internal topics or external stores

Supported Databases:

PostgreSQL: Logical decoding with pgoutput, decoderbufs, wal2json plugins
MySQL/MariaDB: Binlog reading with GTID support
MongoDB: Oplog and Change Streams
SQL Server: Transaction log reading (no native CDC required)
Oracle: LogMiner or XStream-based capture
Db2: ASN-based capture
Cassandra: Commit log-based capture
Vitess: VStream-based capture

Configuring Debezium for Production

Proper Debezium configuration is critical for reliability and performance. Let's walk through a production-grade configuration:

debezium-postgresql-connector.json
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
{
  "name": "inventory-connector",
  "config": {
    // Connector identification
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres.example.com",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${env:DB_PASSWORD}",
    "database.dbname": "inventory",
    
    // Logical naming (used in topic names)
    "topic.prefix": "inventory-db",
    
    // Replication configuration
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "debezium_publication",
    
    // What to capture
    "schema.include.list": "public,orders",
    "table.include.list": "public.orders,public.products,orders.line_items",
    
    // Critical: Full row data for updates/deletes
    "replica.identity": "full",
    
    // Snapshot configuration
    "snapshot.mode": "initial",
    "snapshot.isolation.mode": "repeatable_read",
    "snapshot.max.threads": 4,
    
    // Handling large transactions
    "max.batch.size": 2048,
    "max.queue.size": 8192,
    "poll.interval.ms": 500,
    
    // Heartbeat for low-traffic periods
    "heartbeat.interval.ms": 10000,
    "heartbeat.action.query": "INSERT INTO debezium.heartbeat (id, ts) VALUES (1, NOW()) ON CONFLICT (id) DO UPDATE SET ts = NOW()",
    
    // Schema handling
    "schema.history.internal.kafka.topic": "schema-history-inventory",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    
    // Transforms (optional)
    "transforms": "route,unwrap",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "(.*)\\.(.*)",
    "transforms.route.replacement": "cdc.$1.$2",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false"
  }
}

Configuration Best Practices

Always use environment variables for credentials (${env:DB_PASSWORD}). 2. Set explicit table include lists rather than capturing everything. 3. Configure heartbeat queries for low-traffic tables. 4. Enable error logging and dead letter queues from day one. 5. Size max.queue.size based on expected transaction sizes.

Debezium Event Structure

Understanding Debezium's event structure is essential for building consumers. Events follow a rich, standardized format that carries full context about each change.

Debezium Event Anatomy
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
{
  "schema": {
    // Kafka Connect schema (omitted for brevity when using Schema Registry)
  },
  "payload": {
    // Source metadata - WHERE did this change happen?
    "source": {
      "version": "2.4.0.Final",
      "connector": "postgresql",
      "name": "inventory-db",
      "ts_ms": 1704067200000,    // Source database timestamp
      "snapshot": "false",       // Was this from snapshot or streaming?
      "db": "inventory",
      "sequence": "[null,\"2F/ABC123\"]",
      "schema": "public",
      "table": "orders",
      "txId": 571,
      "lsn": 33227976,
      "xmin": null
    },
    
    // Before state - NULL for inserts
    "before": {
      "order_id": 12345,
      "status": "processing",
      "amount": 9999,
      "customer_id": 42,
      "updated_at": "2024-01-01T10:00:00Z"
    },
    
    // After state - NULL for deletes
    "after": {
      "order_id": 12345,
      "status": "shipped",
      "amount": 9999,
      "customer_id": 42,
      "updated_at": "2024-01-01T12:05:00Z"
    },
    
    // Operation type: c=create, u=update, d=delete, r=read (snapshot)
    "op": "u",
    
    // Debezium processing timestamp (not source DB time)
    "ts_ms": 1704067205000,
    
    // Transaction metadata (if enabled)
    "transaction": {
      "id": "571:29523420",
      "total_order": 1,
      "data_collection_order": 1
    }
  }
}

Event Fields Explained:

Field	Description	Use Case
`source.ts_ms`	When the change occurred in the database	Ordering, auditing
`source.snapshot`	Whether this came from initial snapshot	Consumer logic may differ
`source.lsn`	Log sequence number	Offset tracking, deduplication
`before`	Row state before change	Differential updates, auditing
`after`	Row state after change	Application of change
`op`	Operation type (c/u/d/r)	Consumer routing logic
`ts_ms` (payload)	Debezium processing time	Lag monitoring

Calculating CDC Lag:

const sourceTsMs = event.payload.source.ts_ms;  // DB commit time
const debeziumTsMs = event.payload.ts_ms;       // Debezium capture time
const now = Date.now();                          // Consumer receive time

const captureLag = debeziumTsMs - sourceTsMs;   // Log read delay
const deliveryLag = now - debeziumTsMs;         // Kafka + consumer delay
const totalLag = now - sourceTsMs;              // End-to-end delay

Unwrapping Events

Single Message Transforms (SMTs)

Common Debezium SMTs
Transform	Purpose	Example Use Case
ExtractNewRecordState	Flatten to just after-state	Simpler consumer messages
RegexRouter	Modify topic names with regex	Add prefixes, change routing
Filter	Drop messages matching predicate	Exclude test data, specific ops
MaskField	Redact sensitive field values	PII protection (SSN, credit card)
ReplaceField	Rename, include, exclude fields	Schema normalization
TimestampConverter	Convert timestamp formats	Epoch to ISO string conversion
ContentBasedRouter	Route based on content	Route by customer region
Outbox	Extract events from outbox table	Outbox pattern implementation

SMT Configuration Examples
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "transforms": "filter,unwrap,mask,route",
  
  // 1. Filter out DELETE operations (keep only creates/updates)
  "transforms.filter.type": "io.debezium.transforms.Filter",
  "transforms.filter.language": "jsr223.groovy",
  "transforms.filter.condition": "value.op != 'd'",
  
  // 2. Unwrap to simple format with metadata as headers
  "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
  "transforms.unwrap.drop.tombstones": "false",
  "transforms.unwrap.delete.handling.mode": "rewrite",
  "transforms.unwrap.add.fields": "op,source.ts_ms,source.lsn",
  "transforms.unwrap.add.headers": "op,source.table",
  
  // 3. Mask sensitive fields
  "transforms.mask.type": "org.apache.kafka.connect.transforms.MaskField$Value",
  "transforms.mask.fields": "ssn,credit_card_number",
  "transforms.mask.replacement": "***MASKED***",
  
  // 4. Route to different topics based on table
  "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
  "transforms.route.regex": "inventory-db\\.public\\.(.*)",
  "transforms.route.replacement": "events.$1"
  
  // Result: "inventory-db.public.orders" → "events.orders"
}

SMT Performance Considerations

Debezium Deployment Patterns

Debezium can be deployed in several ways, each with distinct tradeoffs:

The Standard Production Deployment:

Multiple Kafka Connect workers form a cluster. Connectors are distributed across workers for fault tolerance.

Architecture:

┌─────────────────────────────────────────────────────────────┐
│                    Kafka Connect Cluster                     │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐         │
│  │   Worker 1   │ │   Worker 2   │ │   Worker 3   │         │
│  │  Connector A │ │  Connector B │ │  (Standby)   │         │
│  │  Tasks 1-3   │ │  Tasks 1-2   │ │              │         │
│  └──────────────┘ └──────────────┘ └──────────────┘         │
│                                                              │
│  Consensus via: config, offset, status topics               │
└─────────────────────────────────────────────────────────────┘

Pros:

Automatic failover if a worker dies
Scales to many connectors
REST API for management
Offset storage in Kafka (durable)

Cons:

Requires Kafka cluster
Operational complexity
Resource sharing between connectors

When to use:

Production environments with multiple source databases
Teams with Kafka Connect expertise
Need for high availability

Alternative CDC Tools

While Debezium dominates the open-source space, several alternatives serve specific niches:

AWS Database Migration Service (DMS)

Managed CDC service that streams changes from RDS/Aurora to Kinesis, S3, or other AWS services.

Aspect	Details
Supported Sources	PostgreSQL, MySQL, Oracle, SQL Server, MongoDB
Targets	Kinesis, S3, Redshift, DynamoDB, Elasticsearch
Pricing	Pay per DMS instance-hour + data transfer
Strengths	Zero infrastructure, native AWS integration, easy setup
Weaknesses	AWS-only, limited transformation, less schema flexibility

Azure Data Factory with Change Data Capture

Azure's ETL service can perform CDC from Azure databases.

Google Cloud Datastream

Managed CDC service for GCP, targeting BigQuery and Cloud Storage.

Aspect	Details
Supported Sources	MySQL, PostgreSQL, Oracle
Targets	BigQuery, Cloud Storage, Pub/Sub
Pricing	Per GB of changes captured
Strengths	Serverless, BigQuery integration, no Kafka needed
Weaknesses	GCP-only, limited customization

CDC Tool Comparison Matrix
Tool	Best For	Complexity	Cost	Real-Time
Debezium	Kafka-centric architectures	Medium	Free (OSS)	Yes (<1s)
AWS DMS	AWS-native pipelines	Low	Medium	Yes (~5s)
GCP Datastream	BigQuery analytics	Low	Medium	Yes (~10s)
Confluent CDC	Enterprise Kafka	Low	High	Yes (<1s)
Fivetran	Data warehouse loading	Very Low	High	No (5-60m)
Airbyte	Self-hosted ELT	Medium	Free/Low	No (batch)

Choosing the Right CDC Tool

Selecting a CDC tool involves matching requirements to capabilities. Use this decision framework:

Converting Mermaid diagram...

Key Decision Criteria

•Latency Requirements: Real-time (<1s) demands Debezium or Confluent. Near-real-time (5-60s) opens cloud services. Batch (5+ min) allows Fivetran/Airbyte.
•Target Destination: If Kafka is your backbone, Debezium is natural. For cloud warehouses, managed ELT may be simpler.
•Operational Capacity: Self-hosted Debezium requires Kafka and Connect expertise. Managed services trade cost for simplicity.
•Source Database: Check compatibility. Oracle and Db2 have limited tool support. PostgreSQL and MySQL are well-covered everywhere.
•Transformation Needs: Complex transforms favor Debezium + Kafka Streams/Flink. Simple replication favors managed services.
•Budget: Open-source (Debezium) is free but requires ops investment. Managed services have predictable costs but scale expensively.

Summary: CDC Tools

We've comprehensively covered the CDC tooling landscape. Here are the key takeaways:

Key Takeaways

•Debezium is the open-source standard — Wide database support, rich configuration, active community. First choice for Kafka-centric architectures.
•Configuration matters immensely — Proper snapshot modes, offset handling, and error tolerance differentiate reliable pipelines from fragile ones.
•Event structure enables smart consumers — Before/after states, source metadata, and operation types power intelligent downstream processing.
•SMTs enable in-flight transformation — Filtering, routing, and masking without custom code, but don't overload the connector.
•Deployment options match use cases — Kafka Connect for HA, Debezium Server for non-Kafka targets, Embedded for custom apps.
•Alternatives serve specific niches — Cloud-native services for managed simplicity, commercial platforms for enterprise support, database-native for simple replication.

What's Next:

Page Complete

3 / 5