Loading learning content...
Imagine you're running an e-commerce platform. Your inventory database holds the source of truth for product availability. But this data needs to flow to multiple places: your Elasticsearch index for search, your Redis cache for fast lookups, your analytics warehouse for reporting, and your recommendation engine for personalization.
How do you keep all these systems synchronized?
The naive approach—having your application write to every system—creates a tightly coupled architecture that's fragile, slow, and prone to inconsistencies. What if the cache write fails after the database write succeeds? What about systems that were added after the original code was written?
This fundamental challenge—propagating data changes reliably across distributed systems—is precisely what Change Data Capture solves.
By the end of this page, you will understand exactly what Change Data Capture is, why it has become essential in modern distributed architectures, and how it fundamentally changes the way we think about data synchronization. You'll grasp the core concepts that differentiate CDC from traditional integration approaches and understand when CDC provides decisive advantages.
Change Data Capture (CDC) is a set of software design patterns and technologies used to identify and capture changes made to data in a database, and then deliver those changes in real-time to downstream systems.
At its core, CDC answers a deceptively simple question:
What data changed, when did it change, and what did it change from and to?
This seemingly straightforward question becomes remarkably complex in distributed systems. Consider a simple UPDATE statement:
UPDATE orders SET status = 'shipped' WHERE order_id = 12345;
For a single database, this operation is atomic and complete. But in a distributed architecture, this change might need to:
CDC captures this change once, at the source, and propagates it reliably to all interested consumers.
CDC treats the database's transaction log as a source of truth for what happened, not just what the current state is. This shift from 'state-based' to 'event-based' thinking unlocks powerful architectural patterns that would be impossible or impractical with traditional approaches.
Formal Definition:
Change Data Capture is the process of:
The sophistication of CDC lies not in any single step, but in performing all of these reliably, at scale, with minimal impact on the source system.
To appreciate why CDC has become crucial, we must understand the evolution of data integration:
The ETL Era (1990s-2000s)
Traditionally, organizations moved data between systems using Extract-Transform-Load (ETL) processes. These ran nightly or hourly, extracting full datasets or deltas based on timestamps, transforming them, and loading them into data warehouses.
This approach worked when:
The Microservices Revolution (2010s)
As organizations decomposed monoliths into microservices, each with its own database, the challenge of data synchronization exploded. ETL couldn't keep up—services needed real-time awareness of changes in other services.
The Real-Time Expectation (2020s)
Today's users expect real-time experiences. A customer updating their shipping address expects search, recommendations, and customer service to reflect that change immediately—not in tomorrow's batch job.
| Era | Approach | Latency | Coupling | Reliability |
|---|---|---|---|---|
| 1990s | Batch ETL | Hours to days | Low | High (simple) |
| 2000s | Near-real-time ETL | Minutes to hours | Low | Moderate |
| 2010s | Dual writes (app-level) | Real-time | Very high | Low (unreliable) |
| 2015s | Message queues (explicit) | Real-time | Moderate | Moderate |
| 2020s | Log-based CDC | Real-time | Very low | Very high |
Why CDC Won:
CDC emerged as the superior pattern because it uniquely combines:
At a conceptual level, CDC operates as a bridge between the transactional world and the event-driven world. Here's how the data flows:
The Data Flow Explained:
Application writes to database: Your application performs normal CRUD operations—inserts, updates, deletes.
Database writes to transaction log: Before committing any change, the database records it in its transaction log (also called write-ahead log, redo log, or binlog depending on the database). This is fundamental to how databases ensure durability.
CDC connector reads the log: A CDC system connects to the database and reads the transaction log in order, exactly as the database would during recovery.
Changes become events: Each change is transformed into a structured event containing:
Events flow to consumers: These change events are published to a message broker or stream, where any number of consumers can process them independently.
123456789101112131415161718192021222324252627
{ "source": { "version": "2.4.0", "connector": "postgresql", "name": "inventory-db", "db": "inventory", "schema": "public", "table": "orders" }, "op": "u", // u=update, c=create, d=delete, r=read (snapshot) "ts_ms": 1704067200000, "transaction": { "id": "571:29523420", "total_order": 1, "data_collection_order": 1 }, "before": { "order_id": 12345, "status": "processing", "updated_at": "2024-01-01T12:00:00Z" }, "after": { "order_id": 12345, "status": "shipped", "updated_at": "2024-01-01T12:05:00Z" }}Having both the before and after states of a row is remarkably powerful. It allows consumers to understand not just what the current state is, but what transformation occurred. This enables intelligent cache invalidation, differential updates to search indexes, and audit trails without keeping full history in every consuming system.
Understanding CDC's advantages requires comparing it to traditional data synchronization approaches. Each alternative has significant drawbacks that CDC addresses:
Detailed Comparison with All Alternatives:
| Approach | Latency | Reliability | Completeness | Coupling | DB Load |
|---|---|---|---|---|---|
| Dual Writes | Low (ms) | Poor | Incomplete (app-only) | Very High | None |
| Timestamp Polling | Medium (sec-min) | Good | Incomplete (no deletes) | Low | High |
| Application Events | Low (ms) | Moderate | Incomplete (app-only) | Moderate | Low |
| Database Triggers | Low (ms) | Good | Complete | Moderate | High |
| Log-based CDC | Low (ms) | Excellent | Complete | Very Low | Very Low |
How it works:
Applications or ETL jobs periodically query tables for rows where updated_at > last_poll_time.
Why it fails:
updated_at maintained correctly-- The polling query
SELECT * FROM orders
WHERE updated_at > '2024-01-01T12:00:00Z'
ORDER BY updated_at;
-- Problem: What if order A commits at 12:00:01 but takes 5 seconds,
-- while order B commits at 12:00:03 instantly?
-- B completes first, poll runs, misses A
Production-grade CDC systems must exhibit several critical properties. Understanding these properties helps you evaluate CDC solutions and design robust pipelines:
CDC typically guarantees ordering per-row (or per-partition in streaming terms), not globally. If you update Order A then Order B, consumers might see B before A if they're on different partitions. For most use cases this is fine—operations on different entities are independent. When you need cross-entity ordering (rare), you must design for single partitions or coordination.
Delivery Semantics Deep Dive:
Understanding delivery guarantees is crucial for designing correct consumers:
| Semantic | Meaning | Consumer Requirement |
|---|---|---|
| At-most-once | Changes may be lost | Tolerate missing data |
| At-least-once | Changes may duplicate | Idempotent processing |
| Exactly-once | Each change processed exactly once | Transactional offset commit |
At-least-once is the practical standard for most CDC systems. Exactly-once requires the entire pipeline (source, CDC, broker, consumer) to support transactional semantics, which adds complexity and latency.
Designing for at-least-once:
// Idempotent consumer example
async function handleOrderUpdate(event: CDCEvent) {
// Use the event's transaction ID or offset as idempotency key
const dedupeKey = `${event.source.txId}:${event.source.offset}`;
// Check if already processed
const processed = await cache.get(dedupeKey);
if (processed) return; // Skip duplicate
// Process the change
await updateSearchIndex(event.after);
// Mark as processed (with TTL to avoid infinite growth)
await cache.set(dedupeKey, true, { ttl: '7d' });
}
CDC has become a foundational building block in several dominant architectural patterns. Understanding where CDC fits helps you recognize opportunities to apply it:
The Microservices Data Challenge:
In microservices, each service owns its data. But services frequently need data from other services:
CDC's Role:
CDC enables services to maintain local read replicas of data from other services without tight coupling:
Catalog Service DB → CDC → Kafka → Order Service (local product cache)
Order Service DB → CDC → Kafka → Analytics Service (order events)
This pattern provides:
CDC is powerful but not universal. Understanding when it shines—and when to use alternatives—is essential for good architecture decisions.
Many organizations use both CDC and application events. CDC captures all data changes for infrastructure concerns (caching, search, replication). Application events carry business semantics for workflow and domain logic. They're complementary, not mutually exclusive.
We've established the foundation for understanding Change Data Capture. Let's consolidate the key insights:
What's Next:
Now that you understand what CDC is and why it matters, we'll dive into how it actually works at the database level. The next page explores log-based CDC in detail—how databases write their transaction logs, how CDC systems read them, and the technical mechanics that make real-time change capture possible.
You now understand what Change Data Capture is, its historical context, how it compares to alternatives, and where it fits in modern architectures. In the next page, we'll examine the technical foundations of log-based CDC—the mechanism that makes all of this possible.