System Design (HLD)Data Ownership

Data Ownership in Microservices

LevelAdvanced

Duration90 mins

TopicData Ownership

2 / 5

Data Duplication Trade-offs

The Duplication Dilemma

Here's an uncomfortable truth that distributed systems architects grapple with: in microservices, data duplication isn't just acceptable—it's often necessary. This contradicts decades of database normalization training that taught us duplication is evil, data should live in one place, and redundancy breeds inconsistency.

But the rules change when you're designing for distribution. If the Order Service needs customer information to function, and the Customer Service might be unavailable, what do you do? You either block (coupling, reduced availability) or cache locally (duplication). Most production systems choose duplication.

This page equips you to make duplication decisions deliberately rather than accidentally—understanding when duplication helps, when it hurts, and how to structure it so the costs remain manageable.

What You Will Learn

By the end of this page, you will understand why data duplication emerges in microservices, the specific costs and benefits of duplicating data, how to choose which data to duplicate, and strategies for managing duplicated data without creating consistency nightmares.

Why Duplication Emerges in Microservices

In a monolith, you can JOIN across tables in a single query. Users, orders, products, inventory—all available in one database transaction. But microservices deliberately fragment this architecture. Each service has its own database. Cross-service joins are impossible.

This creates a fundamental tension between service autonomy and data availability. When the Order Service needs to display customer name and email on an order confirmation, it has three options:

Option 1: Call Customer Service at Query Time

Every order query makes a synchronous call to Customer Service
If Customer Service is slow or down, orders are slow or unavailable
Latency compounds with each additional service dependency

Option 2: Return Incomplete Data

Order response includes only customer ID
UI must make additional calls to enrich display
User experience degrades; more client complexity

Option 3: Store Local Copy of Customer Data

Order Service stores customer_name and customer_email locally
Zero runtime dependency on Customer Service
Data may be stale; duplication overhead

Comparing Approaches to Cross-Service Data Access
Approach	Availability	Latency	Consistency	Coupling
Synchronous API call	Low (depends on other service)	High (network roundtrip)	Strong	High
Return IDs only	High	Low (for initial response)	Strong	Medium
Local data copy	High	Low	Eventual	Low

In practice, most production microservices systems adopt Option 3 for read operations. The pattern is so common that it has a name: materializing data locally. The trade-off is explicit consistency for availability and autonomy.

The CAP theorem influence:

CAP theorem tells us that during network partitions, we must choose between consistency and availability. By duplicating data, you're choosing availability: the Order Service can serve requests even when partitioned from Customer Service. You're accepting that the customer name might be slightly out of date—a trade-off most business scenarios gladly accept.

This Isn't Breaking Normalization

Database normalization optimizes for storage efficiency and update consistency within a single database. In microservices, you have multiple databases by design. The normalization principle doesn't cross database boundaries—each database can be internally normalized while the system as a whole has strategic duplication.

The Costs of Data Duplication

Duplication isn't free. Understanding the costs helps you make informed trade-offs and design mitigations.

Costs of Data Duplication

•Staleness and Inconsistency — Duplicated data can be out of sync. A customer changes their address; the Order Service shows the old one until synchronized. The window of inconsistency depends on sync frequency.
•Storage Overhead — Storing the same data in multiple places uses more disk. Usually negligible for metadata (names, emails), but significant for large objects (documents, images).
•Synchronization Complexity — You need mechanisms to propagate changes: events, CDC (Change Data Capture), scheduled syncs. Each adds code, infrastructure, and potential failure modes.
•Schema Coordination — If Customer Service adds a new field, do all consumers need updates? Versioning duplicated data adds coordination overhead.
•Debugging Difficulty — When data is wrong, you must check both the source and all copies. Did the update not propagate? Did syntax sync fail? Is the consumer processing events in wrong order?
•Eventual Consistency Business Impact — Some scenarios genuinely need current data. Showing stale inventory as 'in stock' leads to overselling. Not all data tolerates staleness equally.

Quantifying staleness:

A key question: How stale can this data be? Different data has different tolerance:

Data Type	Staleness Tolerance	Reason
Customer name	Hours to days	Rarely changes; display only
Customer email	Minutes to hours	May affect communications
Account balance	Seconds	Financial accuracy
Inventory count	Seconds to minutes	Overselling risk
Product description	Days	Marketing updates infrequent
Pricing	Seconds to minutes	Revenue impact

Data with low staleness tolerance shouldn't be duplicated—call the source synchronously instead. Data with high tolerance is a good candidate for local caching.

The Invisible Cost: Assumption Drift

When you duplicate data, each consumer makes assumptions about its meaning. Over time, these assumptions diverge. Customer Service adds a 'preferred_name' field; Order Service still uses 'name' without knowing about the change. These semantic drifts are subtle and dangerous.

The Benefits of Data Duplication

Despite the costs, duplication offers substantial benefits that justify its use in most microservices architectures.

Benefits of Data Duplication

•Service Autonomy — Services can operate independently. No runtime dependencies on other services for read operations. Team velocity increases when changes don't require coordination.
•Improved Availability — Local data is always available; network partitions, other service failures, and deployments don't affect your service's ability to respond.
•Reduced Latency — Local reads are fast—no network roundtrip, no serialization, no service discovery. Critical for latency-sensitive paths.
•Scaled Read Path — Each service scales its reads independently. You don't bottleneck on a central data service for read traffic.
•Query Optimization — Local copy can be indexed and structured for the consumer's access patterns, not the owner's. Order Service can index customer data by order_id for fast lookups that Customer Service never optimized for.
•Point-in-Time Consistency — For historical records, you want the data as it was at that time. An order should show the customer's address at order time, not their current address. Duplication naturally provides this.

The availability argument in depth:

Consider a checkout flow that calls five services sequentially. If each has 99.9% availability:

Combined availability = 0.999^5 = 0.995 = 99.5%

That's ~3.5 hours of downtime per month from chained dependencies. If instead each service has local copies of what it needs:

Checkout availability ≈ Individual service availability = 99.9%
Fewer cascading failures, better user experience

This is why companies like Amazon invest heavily in data duplication—the availability gains compound across the system.

Historical Data as Natural Duplication

Some 'duplication' isn't really duplication—it's historical record-keeping. When an order captures the shipping address, it's not copying current customer data; it's recording Facts about that order. If the customer moves, the order's address shouldn't change. This framing clarifies many duplication decisions.

Deciding What to Duplicate

Not all data should be duplicated. A principled approach evaluates each piece of data against specific criteria.

Decision Framework

Step 1: Is this data needed for critical operations?

If the consuming service can function (degrade gracefully) without this data, consider not duplicating. Fallback to ID-only relationships with on-demand enrichment.

Step 2: What is the staleness tolerance?

Data that must be current (inventory, pricing, account balance) is dangerous to duplicate. Accept synchronous calls for these. Data that tolerates minutes/hours of staleness is safe to duplicate.

Step 3: How frequently does this data change?

Static or rarely-changing data (country codes, product categories) is easy to synchronize and has minimal staleness window. Frequently-changing data has more consistency complexity.

Step 4: What is the access pattern?

Data accessed on every request (customer name on orders) benefits most from local copies. Data accessed rarely may not justify duplication overhead.

Step 5: What happens if it's wrong?

Showing wrong customer name = minor embarrassment. Showing wrong price = financial loss. Showing wrong medical dosage = safety issue. Risk level determines acceptable staleness.

Data Duplication Decision Matrix
Criterion	Duplicate ✓	Don't Duplicate ✗
Staleness tolerance	Minutes to hours acceptable	Must be real-time
Change frequency	Rarely or occasionally	Changes constantly
Access pattern	Read on every request	Accessed rarely
Failure impact	UX degradation	Financial/safety impact
Data size	Small (names, IDs, flags)	Large (documents, media)
Historical relevance	Point-in-time matters	Only current matters

duplication-decision-example
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
// ===================================================
// EXAMPLE: Order Service - What to Duplicate
// ===================================================
// The Order Service evaluates each piece of customer and
// product data to decide whether to duplicate locally.
// ===================================================
 
interface Order {
  id: string;
  customerId: string;
  
  // DUPLICATED: Customer name for display
  // - Staleness tolerance: HIGH (days acceptable)
  // - Access pattern: Every order list/detail view
  // - Failure impact: UX only (wrong name displayed)
  // Decision: DUPLICATE
  customerName: string;
  
  // DUPLICATED: Email for notifications
  // - Staleness tolerance: MEDIUM (hours acceptable)
  // - Access pattern: Order confirmation, shipping updates
  // - Failure impact: Missed notification (customer can check site)
  // Decision: DUPLICATE
  customerEmail: string;
  
  // NOT DUPLICATED: Customer payment method ID
  // - Staleness tolerance: LOW (payment might be expired)
  // - Access pattern: At checkout only
  // - Failure impact: Payment failure, revenue loss
  // Decision: CALL PAYMENTS SERVICE IN REAL-TIME
  
  // SNAPSHOTTED AT ORDER TIME: Shipping address
  // - This is ORDER data, not customer data
  // - Captures address as provided for THIS shipment
  // - Customer can change address for future orders without
  //   affecting already-placed orders
  // Decision: STORE AS ORDER ATTRIBUTE (not duplication)
  shippingAddress: Address;
  
  // SNAPSHOTTED AT ORDER TIME: Product price
  // - Captures agreed price at purchase
  // - Protects against price changes affecting past orders
  // Decision: STORE AS LINE ITEM ATTRIBUTE
  lineItems: Array<{
    productId: string;
    productName: string;  // Snapshotted for historical record
    unitPrice: number;    // Snapshotted: price at time of purchase
    quantity: number;
  }>;
  
  // NOT DUPLICATED: Current inventory level
  // - Staleness tolerance: VERY LOW
  // - Wrong data = overselling = customer anger + refunds
  // Decision: CALL INVENTORY SERVICE IN REAL-TIME
}
 
// ===================================================
// SYNCHRONIZATION STRATEGY
// ===================================================
// Duplicated fields (customerName, customerEmail) are
// synchronized via events from Customer Service.
// 
// Non-duplicated data (payment method, inventory) is
// fetched at the moment it's needed via API calls.
// ===================================================
 
class OrderService {
  async createOrder(request: CreateOrderRequest): Promise<Order> {
    // REAL-TIME CHECKS (not duplicated data):
    // 1. Verify inventory availability
    const inventory = await this.inventoryClient.checkAvailability(
      request.items.map(i => i.productId)
    );
    if (!inventory.allAvailable) {
      throw new InsufficientInventoryError(inventory.unavailable);
    }
    
    // 2. Verify payment method is valid
    const paymentValid = await this.paymentClient.validatePaymentMethod(
      request.paymentMethodId
    );
    if (!paymentValid) {
      throw new InvalidPaymentMethodError();
    }
    
    // DUPLICATED DATA - use local cache:
    // 3. Get customer info from local view
    const customer = await this.localCustomerView.findById(
      request.customerId
    );
    
    // SNAPSHOT DATA - capture current values:
    // 4. Get current prices (snapshot at order time)
    const products = await this.catalogClient.getProducts(
      request.items.map(i => i.productId)
    );
    
    // 5. Create order with appropriate data sources
    const order: Order = {
      id: generateId(),
      customerId: request.customerId,
      
      // From local view (duplicated, may be slightly stale)
      customerName: customer.name,
      customerEmail: customer.email,
      
      // From request (user-provided for this order)
      shippingAddress: request.shippingAddress,
      
      // From catalog (snapshotted at order time)
      lineItems: request.items.map(item => {
        const product = products.find(p => p.id === item.productId)!;
        return {
          productId: item.productId,
          productName: product.name,  // Snapshotted
          unitPrice: product.price,   // Snapshotted
          quantity: item.quantity,
        };
      }),
    };
    
    return this.repository.save(order);
  }
}

Strategies for Managing Duplicated Data

Once you've decided to duplicate data, you need strategies for keeping copies reasonably synchronized. Several patterns exist, each with different trade-offs.

Strategy 1: Event-Driven Synchronization

The owner publishes events when data changes. Consumers subscribe and update their local copies.

Pros:

Loose coupling; owner doesn't know about consumers
Subscribers update as fast as they can process
Durable event log allows replay if consumers fail

Cons:

Requires event infrastructure (Kafka, RabbitMQ, etc.)
Event ordering and idempotency must be handled
Consumers may fall behind during traffic spikes

Best for: Most duplication scenarios; the default choice

Strategy 2: Change Data Capture (CDC)

Monitor the source database's transaction log and push changes to consumers.

Pros:

Captures all changes without modifying application code
Lower application complexity; no need to publish events
Can operate on legacy systems

Cons:

Couples consumers to source database schema
Requires specialized tooling (Debezium, AWS DMS)
May expose internal implementation details

Best for: Systems without existing event infrastructure; brownfield migrations

Strategy 3: Scheduled Synchronization

Periodic jobs fetch all data from the source and update local copies.

Pros:

Simple to implement
No event infrastructure required
Handles any source system

Cons:

Data is always at least [sync interval] stale
High load on source during sync
Doesn't scale well with data volume

Best for: Non-critical data; external systems without events; initial data loads

Strategy 4: Cache with TTL and On-Demand Refresh

Store data locally with a time-to-live. When TTL expires, fetch fresh data.

Pros:

Bounded staleness (never older than TTL)
Simpler than event processing
Works well for read-heavy patterns

Cons:

Cache misses add latency
Doesn't handle cold starts well (empty cache)
No push updates for critical changes

Best for: Reference data; lookup tables; non-critical caching

Comparing Synchronization Strategies
Strategy	Staleness	Complexity	Infrastructure	Best Use Case
Event-Driven	Sub-second to seconds	Medium	Event bus	Most scenarios
CDC	Sub-second to seconds	High	CDC tooling	Legacy systems
Scheduled Sync	Minutes to hours	Low	Cron jobs	Non-critical data
Cache + TTL	Bounded by TTL	Low	Cache server	Reference data

Combine Strategies

Production systems often combine strategies. Use events for critical data updates, scheduled sync for bulk reconciliation, and TTL caching for reference data. Different data types within the same service may use different strategies.

Handling Inevitable Inconsistency

When you duplicate data, inconsistency will occur. Networks fail events get delayed, consumers process at different rates. Engineering for inconsistency means building systems that detect, tolerate, and recover from it.

Pattern 1: Version Vectors

Include a version number or timestamp with every data update. Consumers can detect when they have stale data by comparing versions.

interface VersionedData {
  customerId: string;
  name: string;
  version: number;  // Incremented on each update
  updatedAt: Date;
}

// Consumer can log or alert if local version is far behind
if (localCustomer.version < sourceVersion - 10) {
  alertStaleData('customer', customerId, localCustomer.version);
}

Pattern 2: Reconciliation Jobs

Periodically compare source and copies, fixing discrepancies. Run during low-traffic periods.

async function reconcileCustomers() {
  const sourceCustomers = await customerService.getAllCustomerHashes();
  const localCustomers = await localView.getAllCustomerHashes();
  
  const discrepancies = findDifferences(sourceCustomers, localCustomers);
  
  for (const customerId of discrepancies) {
    const fresh = await customerService.getCustomer(customerId);
    await localView.upsert(fresh);
    metrics.increment('reconciliation.fixed');
  }
}

Pattern 3: Graceful Degradation

Design UIs and workflows to handle stale data gracefully.

Show 'last updated' timestamps so users know freshness
Allow refresh buttons for critical data
Design workflows that confirm critical data before consequential actions
Use optimistic updates with conflict detection

Pattern 4: Compensating Actions

When stale data leads to wrong actions, have mechanisms to correct.

Order placed with old price → refund the difference or honor the shown price
Notification sent to old email → send to new email as well
Inventory oversold → apologize, offer alternatives, compensate

The key insight: perfect consistency isn't always possible or cost-effective. Sometimes it's cheaper to handle the occasional inconsistency manually than to engineer a perfectly consistent system.

Business Processes Are Often Tolerant

Humans have dealt with inconsistency forever. Paper-based systems were always slightly inconsistent (forms in transit, updates pending). Many digital processes can tolerate similar latencies. The question is: what's the business-acceptable window of inconsistency?

Data Duplication Anti-Patterns

While duplication can be beneficial, certain practices turn it into a maintenance nightmare. Avoid these anti-patterns.

Anti-Patterns to Avoid

•Duplicating Without Acknowledging It — The worst duplication is accidental. Teams copy data for 'convenience' without considering synchronization. Document all duplication explicitly.
•Duplicating and Mutating — Copies should be read-only. If consumers can modify their copies, they've created competing sources of truth. All writes must go to the owner.
•Duplicating Everything 'Just in Case' — Each duplicated field adds synchronization burden. Duplicate only what you actually use; call the source for rare needs.
•Tightly Coupling to Source Schema — If your local view mirrors the source schema exactly, you're coupled to every schema change. Curate your local view to your needs.
•Ignoring Sync Failures — Events can fail to process; CDC can lag; syncs can time out. Unmonitored sync failure leads to divergence. Alert on sync delays.
•No Reconciliation — Relying solely on real-time sync without periodic reconciliation allows drift to accumulate. Always have a reconciliation path.

Anti-Pattern: Cascading Duplication

Service A (owner)
    ↓ event
Service B (copy 1)
    ↓ event (forwarding!)
Service C (copy of copy)
    ↓ event
Service D (copy of copy of copy)

Each hop:

Adds latency to sync
Adds failure points
Dilutes version history
Makes debugging impossible

Correct: Spoke-and-Hub

Service A (owner)
    ↓ events ↓
  ├── Service B (copy 1)
  ├── Service C (copy 2)
  └── Service D (copy 3)

All consumers:

Subscribe directly to owner
Have consistent staleness
Independent failure modes
Clear provenance

The Mutable Copy Disaster

A fintech company allowed their Risk Service to 'enrich' duplicated customer data with risk scores—writing to their local customer copy. Later, events from Customer Service overwrote the risk scores. Weeks of risk assessments were lost. Rule: never write to duplicated data except to sync from the source.

Summary: Embracing Strategic Duplication

Data duplication in microservices isn't a failure of design—it's a deliberate trade-off for availability, performance, and autonomy. The key is making duplication explicit and managed rather than accidental and chaotic.

Key Takeaways

•Duplication is often necessary — Service autonomy and availability frequently require local data copies. Accept this rather than fighting it.
•Evaluate each field deliberately — Use the decision framework: staleness tolerance, change frequency, access pattern, failure impact.
•Copies are read-only — Only the owner can modify data. Copies are synchronized projections, never independently mutated.
•Choose appropriate sync strategy — Event-driven for most cases; CDC for legacy; scheduled for non-critical; TTL for reference data.
•Design for inconsistency — Version tracking, reconciliation jobs, graceful degradation, and compensating actions handle inevitable staleness.
•Avoid anti-patterns — Don't duplicate silently, mutate copies, duplicate everything, cascade copies, or ignore sync failures.

What's next:

With duplication understood, the question becomes: how do we keep copies synchronized? The next page dives deep into event-driven data synchronization—the dominant pattern for maintaining duplicated data across microservices.

Page Complete

You now understand the trade-offs of data duplication in microservices. Duplication enables autonomy and availability but introduces staleness and sync complexity. Strategic, managed duplication—with clear ownership preserved—is the standard approach in production systems. Next, we'll explore event-driven synchronization for keeping distributed data consistent.

2 / 5

Loading learning content...

System Design (HLD)Data Ownership

Data Ownership in Microservices

LevelAdvanced

Duration90 mins

TopicData Ownership

2 / 5

Data Duplication Trade-offs

The Duplication Dilemma

What You Will Learn

Why Duplication Emerges in Microservices

Option 1: Call Customer Service at Query Time

Every order query makes a synchronous call to Customer Service
If Customer Service is slow or down, orders are slow or unavailable
Latency compounds with each additional service dependency

Option 2: Return Incomplete Data

Order response includes only customer ID
UI must make additional calls to enrich display
User experience degrades; more client complexity

Option 3: Store Local Copy of Customer Data

Order Service stores customer_name and customer_email locally
Zero runtime dependency on Customer Service
Data may be stale; duplication overhead

Comparing Approaches to Cross-Service Data Access
Approach	Availability	Latency	Consistency	Coupling
Synchronous API call	Low (depends on other service)	High (network roundtrip)	Strong	High
Return IDs only	High	Low (for initial response)	Strong	Medium
Local data copy	High	Low	Eventual	Low

The CAP theorem influence:

This Isn't Breaking Normalization

The Costs of Data Duplication

Duplication isn't free. Understanding the costs helps you make informed trade-offs and design mitigations.

Costs of Data Duplication

•Staleness and Inconsistency — Duplicated data can be out of sync. A customer changes their address; the Order Service shows the old one until synchronized. The window of inconsistency depends on sync frequency.
•Storage Overhead — Storing the same data in multiple places uses more disk. Usually negligible for metadata (names, emails), but significant for large objects (documents, images).
•Synchronization Complexity — You need mechanisms to propagate changes: events, CDC (Change Data Capture), scheduled syncs. Each adds code, infrastructure, and potential failure modes.
•Schema Coordination — If Customer Service adds a new field, do all consumers need updates? Versioning duplicated data adds coordination overhead.
•Debugging Difficulty — When data is wrong, you must check both the source and all copies. Did the update not propagate? Did syntax sync fail? Is the consumer processing events in wrong order?
•Eventual Consistency Business Impact — Some scenarios genuinely need current data. Showing stale inventory as 'in stock' leads to overselling. Not all data tolerates staleness equally.

Quantifying staleness:

A key question: How stale can this data be? Different data has different tolerance:

Data Type	Staleness Tolerance	Reason
Customer name	Hours to days	Rarely changes; display only
Customer email	Minutes to hours	May affect communications
Account balance	Seconds	Financial accuracy
Inventory count	Seconds to minutes	Overselling risk
Product description	Days	Marketing updates infrequent
Pricing	Seconds to minutes	Revenue impact

Data with low staleness tolerance shouldn't be duplicated—call the source synchronously instead. Data with high tolerance is a good candidate for local caching.

The Invisible Cost: Assumption Drift

The Benefits of Data Duplication

Despite the costs, duplication offers substantial benefits that justify its use in most microservices architectures.

Benefits of Data Duplication

•Service Autonomy — Services can operate independently. No runtime dependencies on other services for read operations. Team velocity increases when changes don't require coordination.
•Improved Availability — Local data is always available; network partitions, other service failures, and deployments don't affect your service's ability to respond.
•Reduced Latency — Local reads are fast—no network roundtrip, no serialization, no service discovery. Critical for latency-sensitive paths.
•Scaled Read Path — Each service scales its reads independently. You don't bottleneck on a central data service for read traffic.
•Query Optimization — Local copy can be indexed and structured for the consumer's access patterns, not the owner's. Order Service can index customer data by order_id for fast lookups that Customer Service never optimized for.
•Point-in-Time Consistency — For historical records, you want the data as it was at that time. An order should show the customer's address at order time, not their current address. Duplication naturally provides this.

The availability argument in depth:

Consider a checkout flow that calls five services sequentially. If each has 99.9% availability:

Combined availability = 0.999^5 = 0.995 = 99.5%

That's ~3.5 hours of downtime per month from chained dependencies. If instead each service has local copies of what it needs:

Checkout availability ≈ Individual service availability = 99.9%
Fewer cascading failures, better user experience

This is why companies like Amazon invest heavily in data duplication—the availability gains compound across the system.

Historical Data as Natural Duplication

Deciding What to Duplicate

Not all data should be duplicated. A principled approach evaluates each piece of data against specific criteria.

Decision Framework

Step 1: Is this data needed for critical operations?

If the consuming service can function (degrade gracefully) without this data, consider not duplicating. Fallback to ID-only relationships with on-demand enrichment.

Step 2: What is the staleness tolerance?

Data that must be current (inventory, pricing, account balance) is dangerous to duplicate. Accept synchronous calls for these. Data that tolerates minutes/hours of staleness is safe to duplicate.

Step 3: How frequently does this data change?

Static or rarely-changing data (country codes, product categories) is easy to synchronize and has minimal staleness window. Frequently-changing data has more consistency complexity.

Step 4: What is the access pattern?

Data accessed on every request (customer name on orders) benefits most from local copies. Data accessed rarely may not justify duplication overhead.

Step 5: What happens if it's wrong?

Showing wrong customer name = minor embarrassment. Showing wrong price = financial loss. Showing wrong medical dosage = safety issue. Risk level determines acceptable staleness.

Data Duplication Decision Matrix
Criterion	Duplicate ✓	Don't Duplicate ✗
Staleness tolerance	Minutes to hours acceptable	Must be real-time
Change frequency	Rarely or occasionally	Changes constantly
Access pattern	Read on every request	Accessed rarely
Failure impact	UX degradation	Financial/safety impact
Data size	Small (names, IDs, flags)	Large (documents, media)
Historical relevance	Point-in-time matters	Only current matters

duplication-decision-example
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
// ===================================================
// EXAMPLE: Order Service - What to Duplicate
// ===================================================
// The Order Service evaluates each piece of customer and
// product data to decide whether to duplicate locally.
// ===================================================
 
interface Order {
  id: string;
  customerId: string;
  
  // DUPLICATED: Customer name for display
  // - Staleness tolerance: HIGH (days acceptable)
  // - Access pattern: Every order list/detail view
  // - Failure impact: UX only (wrong name displayed)
  // Decision: DUPLICATE
  customerName: string;
  
  // DUPLICATED: Email for notifications
  // - Staleness tolerance: MEDIUM (hours acceptable)
  // - Access pattern: Order confirmation, shipping updates
  // - Failure impact: Missed notification (customer can check site)
  // Decision: DUPLICATE
  customerEmail: string;
  
  // NOT DUPLICATED: Customer payment method ID
  // - Staleness tolerance: LOW (payment might be expired)
  // - Access pattern: At checkout only
  // - Failure impact: Payment failure, revenue loss
  // Decision: CALL PAYMENTS SERVICE IN REAL-TIME
  
  // SNAPSHOTTED AT ORDER TIME: Shipping address
  // - This is ORDER data, not customer data
  // - Captures address as provided for THIS shipment
  // - Customer can change address for future orders without
  //   affecting already-placed orders
  // Decision: STORE AS ORDER ATTRIBUTE (not duplication)
  shippingAddress: Address;
  
  // SNAPSHOTTED AT ORDER TIME: Product price
  // - Captures agreed price at purchase
  // - Protects against price changes affecting past orders
  // Decision: STORE AS LINE ITEM ATTRIBUTE
  lineItems: Array<{
    productId: string;
    productName: string;  // Snapshotted for historical record
    unitPrice: number;    // Snapshotted: price at time of purchase
    quantity: number;
  }>;
  
  // NOT DUPLICATED: Current inventory level
  // - Staleness tolerance: VERY LOW
  // - Wrong data = overselling = customer anger + refunds
  // Decision: CALL INVENTORY SERVICE IN REAL-TIME
}
 
// ===================================================
// SYNCHRONIZATION STRATEGY
// ===================================================
// Duplicated fields (customerName, customerEmail) are
// synchronized via events from Customer Service.
// 
// Non-duplicated data (payment method, inventory) is
// fetched at the moment it's needed via API calls.
// ===================================================
 
class OrderService {
  async createOrder(request: CreateOrderRequest): Promise<Order> {
    // REAL-TIME CHECKS (not duplicated data):
    // 1. Verify inventory availability
    const inventory = await this.inventoryClient.checkAvailability(
      request.items.map(i => i.productId)
    );
    if (!inventory.allAvailable) {
      throw new InsufficientInventoryError(inventory.unavailable);
    }
    
    // 2. Verify payment method is valid
    const paymentValid = await this.paymentClient.validatePaymentMethod(
      request.paymentMethodId
    );
    if (!paymentValid) {
      throw new InvalidPaymentMethodError();
    }
    
    // DUPLICATED DATA - use local cache:
    // 3. Get customer info from local view
    const customer = await this.localCustomerView.findById(
      request.customerId
    );
    
    // SNAPSHOT DATA - capture current values:
    // 4. Get current prices (snapshot at order time)
    const products = await this.catalogClient.getProducts(
      request.items.map(i => i.productId)
    );
    
    // 5. Create order with appropriate data sources
    const order: Order = {
      id: generateId(),
      customerId: request.customerId,
      
      // From local view (duplicated, may be slightly stale)
      customerName: customer.name,
      customerEmail: customer.email,
      
      // From request (user-provided for this order)
      shippingAddress: request.shippingAddress,
      
      // From catalog (snapshotted at order time)
      lineItems: request.items.map(item => {
        const product = products.find(p => p.id === item.productId)!;
        return {
          productId: item.productId,
          productName: product.name,  // Snapshotted
          unitPrice: product.price,   // Snapshotted
          quantity: item.quantity,
        };
      }),
    };
    
    return this.repository.save(order);
  }
}

Strategies for Managing Duplicated Data

Once you've decided to duplicate data, you need strategies for keeping copies reasonably synchronized. Several patterns exist, each with different trade-offs.

Strategy 1: Event-Driven Synchronization

The owner publishes events when data changes. Consumers subscribe and update their local copies.

Pros:

Loose coupling; owner doesn't know about consumers
Subscribers update as fast as they can process
Durable event log allows replay if consumers fail

Cons:

Requires event infrastructure (Kafka, RabbitMQ, etc.)
Event ordering and idempotency must be handled
Consumers may fall behind during traffic spikes

Best for: Most duplication scenarios; the default choice

Strategy 2: Change Data Capture (CDC)

Monitor the source database's transaction log and push changes to consumers.

Pros:

Captures all changes without modifying application code
Lower application complexity; no need to publish events
Can operate on legacy systems

Cons:

Couples consumers to source database schema
Requires specialized tooling (Debezium, AWS DMS)
May expose internal implementation details

Best for: Systems without existing event infrastructure; brownfield migrations

Strategy 3: Scheduled Synchronization

Periodic jobs fetch all data from the source and update local copies.

Pros:

Simple to implement
No event infrastructure required
Handles any source system

Cons:

Data is always at least [sync interval] stale
High load on source during sync
Doesn't scale well with data volume

Best for: Non-critical data; external systems without events; initial data loads

Strategy 4: Cache with TTL and On-Demand Refresh

Store data locally with a time-to-live. When TTL expires, fetch fresh data.

Pros:

Bounded staleness (never older than TTL)
Simpler than event processing
Works well for read-heavy patterns

Cons:

Cache misses add latency
Doesn't handle cold starts well (empty cache)
No push updates for critical changes

Best for: Reference data; lookup tables; non-critical caching

Comparing Synchronization Strategies
Strategy	Staleness	Complexity	Infrastructure	Best Use Case
Event-Driven	Sub-second to seconds	Medium	Event bus	Most scenarios
CDC	Sub-second to seconds	High	CDC tooling	Legacy systems
Scheduled Sync	Minutes to hours	Low	Cron jobs	Non-critical data
Cache + TTL	Bounded by TTL	Low	Cache server	Reference data

Combine Strategies

Handling Inevitable Inconsistency

Pattern 1: Version Vectors

Include a version number or timestamp with every data update. Consumers can detect when they have stale data by comparing versions.

interface VersionedData {
  customerId: string;
  name: string;
  version: number;  // Incremented on each update
  updatedAt: Date;
}

// Consumer can log or alert if local version is far behind
if (localCustomer.version < sourceVersion - 10) {
  alertStaleData('customer', customerId, localCustomer.version);
}

Pattern 2: Reconciliation Jobs

Periodically compare source and copies, fixing discrepancies. Run during low-traffic periods.

async function reconcileCustomers() {
  const sourceCustomers = await customerService.getAllCustomerHashes();
  const localCustomers = await localView.getAllCustomerHashes();
  
  const discrepancies = findDifferences(sourceCustomers, localCustomers);
  
  for (const customerId of discrepancies) {
    const fresh = await customerService.getCustomer(customerId);
    await localView.upsert(fresh);
    metrics.increment('reconciliation.fixed');
  }
}

Pattern 3: Graceful Degradation

Design UIs and workflows to handle stale data gracefully.

Show 'last updated' timestamps so users know freshness
Allow refresh buttons for critical data
Design workflows that confirm critical data before consequential actions
Use optimistic updates with conflict detection

Pattern 4: Compensating Actions

When stale data leads to wrong actions, have mechanisms to correct.

Order placed with old price → refund the difference or honor the shown price
Notification sent to old email → send to new email as well
Inventory oversold → apologize, offer alternatives, compensate

The key insight: perfect consistency isn't always possible or cost-effective. Sometimes it's cheaper to handle the occasional inconsistency manually than to engineer a perfectly consistent system.

Business Processes Are Often Tolerant

Data Duplication Anti-Patterns

While duplication can be beneficial, certain practices turn it into a maintenance nightmare. Avoid these anti-patterns.

Anti-Patterns to Avoid

•Duplicating Without Acknowledging It — The worst duplication is accidental. Teams copy data for 'convenience' without considering synchronization. Document all duplication explicitly.
•Duplicating and Mutating — Copies should be read-only. If consumers can modify their copies, they've created competing sources of truth. All writes must go to the owner.
•Duplicating Everything 'Just in Case' — Each duplicated field adds synchronization burden. Duplicate only what you actually use; call the source for rare needs.
•Tightly Coupling to Source Schema — If your local view mirrors the source schema exactly, you're coupled to every schema change. Curate your local view to your needs.
•Ignoring Sync Failures — Events can fail to process; CDC can lag; syncs can time out. Unmonitored sync failure leads to divergence. Alert on sync delays.
•No Reconciliation — Relying solely on real-time sync without periodic reconciliation allows drift to accumulate. Always have a reconciliation path.

Anti-Pattern: Cascading Duplication

Service A (owner)
    ↓ event
Service B (copy 1)
    ↓ event (forwarding!)
Service C (copy of copy)
    ↓ event
Service D (copy of copy of copy)

Each hop:

Adds latency to sync
Adds failure points
Dilutes version history
Makes debugging impossible

Correct: Spoke-and-Hub

Service A (owner)
    ↓ events ↓
  ├── Service B (copy 1)
  ├── Service C (copy 2)
  └── Service D (copy 3)

All consumers:

Subscribe directly to owner
Have consistent staleness
Independent failure modes
Clear provenance

The Mutable Copy Disaster

Summary: Embracing Strategic Duplication

Key Takeaways

•Duplication is often necessary — Service autonomy and availability frequently require local data copies. Accept this rather than fighting it.
•Evaluate each field deliberately — Use the decision framework: staleness tolerance, change frequency, access pattern, failure impact.
•Copies are read-only — Only the owner can modify data. Copies are synchronized projections, never independently mutated.
•Choose appropriate sync strategy — Event-driven for most cases; CDC for legacy; scheduled for non-critical; TTL for reference data.
•Design for inconsistency — Version tracking, reconciliation jobs, graceful degradation, and compensating actions handle inevitable staleness.
•Avoid anti-patterns — Don't duplicate silently, mutate copies, duplicate everything, cascade copies, or ignore sync failures.

What's next:

Page Complete

2 / 5