Loading learning content...
Understanding log-based CDC mechanics is valuable, but building production CDC pipelines from scratch would be impractical. A rich ecosystem of tools has emerged to handle the complexities of log reading, schema management, offset tracking, and delivery guarantees.
Debezium has become the de facto open-source standard for CDC, powering data pipelines at companies from startups to enterprises. But it's not the only option—cloud-native services, commercial platforms, and database-native features each have their place.
This page provides comprehensive coverage of CDC tooling: architecture, capabilities, configuration, and when to choose each option.
By the end of this page, you will understand Debezium's architecture and capabilities in depth, be able to configure Debezium for production use cases, and evaluate alternative CDC tools for specific requirements. You'll have practical knowledge to build reliable CDC pipelines.
Debezium (pronounced "DEE-bee-zee-um") is an open-source distributed platform for change data capture. Originally developed at Red Hat, it's now a project under the Apache Software Foundation umbrella.
Core Philosophy:
Debezium treats databases as sources of truth for events. Rather than requiring applications to explicitly emit events, Debezium reads the database's transaction log and converts changes into Kafka events. Applications can remain simple CRUD operations while Debezium handles the event generation.
Debezium Architecture Components:
| Component | Purpose | Details |
|---|---|---|
| Kafka Connect | Execution framework | Manages connector lifecycle, offsets, parallelism |
| Source Connectors | Database-specific log readers | PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, Db2, Cassandra, Vitess |
| Schema Registry | Schema management | Confluent or Apicurio Schema Registry for Avro/Protobuf |
| SMTs (Transforms) | Event transformation | Filter, route, enrich, mask events in-flight |
| Offset Storage | Progress tracking | Kafka internal topics or external stores |
Supported Databases:
Proper Debezium configuration is critical for reliability and performance. Let's walk through a production-grade configuration:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
{ "name": "inventory-connector", "config": { // Connector identification "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "postgres.example.com", "database.port": "5432", "database.user": "debezium", "database.password": "${env:DB_PASSWORD}", "database.dbname": "inventory", // Logical naming (used in topic names) "topic.prefix": "inventory-db", // Replication configuration "plugin.name": "pgoutput", "slot.name": "debezium_slot", "publication.name": "debezium_publication", // What to capture "schema.include.list": "public,orders", "table.include.list": "public.orders,public.products,orders.line_items", // Critical: Full row data for updates/deletes "replica.identity": "full", // Snapshot configuration "snapshot.mode": "initial", "snapshot.isolation.mode": "repeatable_read", "snapshot.max.threads": 4, // Handling large transactions "max.batch.size": 2048, "max.queue.size": 8192, "poll.interval.ms": 500, // Heartbeat for low-traffic periods "heartbeat.interval.ms": 10000, "heartbeat.action.query": "INSERT INTO debezium.heartbeat (id, ts) VALUES (1, NOW()) ON CONFLICT (id) DO UPDATE SET ts = NOW()", // Schema handling "schema.history.internal.kafka.topic": "schema-history-inventory", "schema.history.internal.kafka.bootstrap.servers": "kafka:9092", // Transforms (optional) "transforms": "route,unwrap", "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter", "transforms.route.regex": "(.*)\\.(.*)", "transforms.route.replacement": "cdc.$1.$2", "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState", "transforms.unwrap.drop.tombstones": "false" }}Understanding Debezium's event structure is essential for building consumers. Events follow a rich, standardized format that carries full context about each change.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
{ "schema": { // Kafka Connect schema (omitted for brevity when using Schema Registry) }, "payload": { // Source metadata - WHERE did this change happen? "source": { "version": "2.4.0.Final", "connector": "postgresql", "name": "inventory-db", "ts_ms": 1704067200000, // Source database timestamp "snapshot": "false", // Was this from snapshot or streaming? "db": "inventory", "sequence": "[null,\"2F/ABC123\"]", "schema": "public", "table": "orders", "txId": 571, "lsn": 33227976, "xmin": null }, // Before state - NULL for inserts "before": { "order_id": 12345, "status": "processing", "amount": 9999, "customer_id": 42, "updated_at": "2024-01-01T10:00:00Z" }, // After state - NULL for deletes "after": { "order_id": 12345, "status": "shipped", "amount": 9999, "customer_id": 42, "updated_at": "2024-01-01T12:05:00Z" }, // Operation type: c=create, u=update, d=delete, r=read (snapshot) "op": "u", // Debezium processing timestamp (not source DB time) "ts_ms": 1704067205000, // Transaction metadata (if enabled) "transaction": { "id": "571:29523420", "total_order": 1, "data_collection_order": 1 } }}Event Fields Explained:
| Field | Description | Use Case |
|---|---|---|
source.ts_ms | When the change occurred in the database | Ordering, auditing |
source.snapshot | Whether this came from initial snapshot | Consumer logic may differ |
source.lsn | Log sequence number | Offset tracking, deduplication |
before | Row state before change | Differential updates, auditing |
after | Row state after change | Application of change |
op | Operation type (c/u/d/r) | Consumer routing logic |
ts_ms (payload) | Debezium processing time | Lag monitoring |
Calculating CDC Lag:
const sourceTsMs = event.payload.source.ts_ms; // DB commit time
const debeziumTsMs = event.payload.ts_ms; // Debezium capture time
const now = Date.now(); // Consumer receive time
const captureLag = debeziumTsMs - sourceTsMs; // Log read delay
const deliveryLag = now - debeziumTsMs; // Kafka + consumer delay
const totalLag = now - sourceTsMs; // End-to-end delay
The full event structure is verbose. For simpler consumers, use the ExtractNewRecordState SMT to flatten events to just the 'after' state. Use 'add.headers' to preserve operation type and source metadata as Kafka headers instead of nested fields.
Debezium supports Single Message Transforms (SMTs)—lightweight transformations applied to each message before it reaches Kafka. SMTs enable filtering, routing, masking, and restructuring without custom code.
| Transform | Purpose | Example Use Case |
|---|---|---|
| ExtractNewRecordState | Flatten to just after-state | Simpler consumer messages |
| RegexRouter | Modify topic names with regex | Add prefixes, change routing |
| Filter | Drop messages matching predicate | Exclude test data, specific ops |
| MaskField | Redact sensitive field values | PII protection (SSN, credit card) |
| ReplaceField | Rename, include, exclude fields | Schema normalization |
| TimestampConverter | Convert timestamp formats | Epoch to ISO string conversion |
| ContentBasedRouter | Route based on content | Route by customer region |
| Outbox | Extract events from outbox table | Outbox pattern implementation |
123456789101112131415161718192021222324252627
{ "transforms": "filter,unwrap,mask,route", // 1. Filter out DELETE operations (keep only creates/updates) "transforms.filter.type": "io.debezium.transforms.Filter", "transforms.filter.language": "jsr223.groovy", "transforms.filter.condition": "value.op != 'd'", // 2. Unwrap to simple format with metadata as headers "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState", "transforms.unwrap.drop.tombstones": "false", "transforms.unwrap.delete.handling.mode": "rewrite", "transforms.unwrap.add.fields": "op,source.ts_ms,source.lsn", "transforms.unwrap.add.headers": "op,source.table", // 3. Mask sensitive fields "transforms.mask.type": "org.apache.kafka.connect.transforms.MaskField$Value", "transforms.mask.fields": "ssn,credit_card_number", "transforms.mask.replacement": "***MASKED***", // 4. Route to different topics based on table "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter", "transforms.route.regex": "inventory-db\\.public\\.(.*)", "transforms.route.replacement": "events.$1" // Result: "inventory-db.public.orders" → "events.orders"}SMTs execute synchronously on every message. Groovy-based predicates are slower than native transforms. Complex SMT chains add latency. For heavy transformations, consider a dedicated stream processing stage (Kafka Streams, Flink) after the CDC topic rather than overloading SMTs.
Debezium can be deployed in several ways, each with distinct tradeoffs:
The Standard Production Deployment:
Multiple Kafka Connect workers form a cluster. Connectors are distributed across workers for fault tolerance.
Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Kafka Connect Cluster │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │
│ │ Connector A │ │ Connector B │ │ (Standby) │ │
│ │ Tasks 1-3 │ │ Tasks 1-2 │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Consensus via: config, offset, status topics │
└─────────────────────────────────────────────────────────────┘
Pros:
Cons:
When to use:
While Debezium dominates the open-source space, several alternatives serve specific niches:
AWS Database Migration Service (DMS)
Managed CDC service that streams changes from RDS/Aurora to Kinesis, S3, or other AWS services.
| Aspect | Details |
|---|---|
| Supported Sources | PostgreSQL, MySQL, Oracle, SQL Server, MongoDB |
| Targets | Kinesis, S3, Redshift, DynamoDB, Elasticsearch |
| Pricing | Pay per DMS instance-hour + data transfer |
| Strengths | Zero infrastructure, native AWS integration, easy setup |
| Weaknesses | AWS-only, limited transformation, less schema flexibility |
Azure Data Factory with Change Data Capture
Azure's ETL service can perform CDC from Azure databases.
Google Cloud Datastream
Managed CDC service for GCP, targeting BigQuery and Cloud Storage.
| Aspect | Details |
|---|---|
| Supported Sources | MySQL, PostgreSQL, Oracle |
| Targets | BigQuery, Cloud Storage, Pub/Sub |
| Pricing | Per GB of changes captured |
| Strengths | Serverless, BigQuery integration, no Kafka needed |
| Weaknesses | GCP-only, limited customization |
| Tool | Best For | Complexity | Cost | Real-Time |
|---|---|---|---|---|
| Debezium | Kafka-centric architectures | Medium | Free (OSS) | Yes (<1s) |
| AWS DMS | AWS-native pipelines | Low | Medium | Yes (~5s) |
| GCP Datastream | BigQuery analytics | Low | Medium | Yes (~10s) |
| Confluent CDC | Enterprise Kafka | Low | High | Yes (<1s) |
| Fivetran | Data warehouse loading | Very Low | High | No (5-60m) |
| Airbyte | Self-hosted ELT | Medium | Free/Low | No (batch) |
Selecting a CDC tool involves matching requirements to capabilities. Use this decision framework:
We've comprehensively covered the CDC tooling landscape. Here are the key takeaways:
What's Next:
Now that you understand the CDC tooling ecosystem, we'll explore the practical applications of CDC. The next page covers Use Cases: Replication, Cache Sync, and Beyond—real-world patterns for applying CDC to solve common distributed systems problems.
You now have deep knowledge of Debezium architecture, configuration, and the broader CDC tool ecosystem. You can evaluate and select tools appropriate for your use cases. Next, we'll see CDC applied to solve real-world problems.