System Design (HLD)Hybrid Cloud

Hybrid Cloud Architecture

LevelAdvanced

Duration75 mins

TopicHybrid Cloud

3 / 5

Hybrid Data Strategies

The Data Challenge in Hybrid Cloud

Network connectivity bridges on-premises infrastructure to the cloud. But data is where hybrid cloud becomes genuinely complex. Unlike compute, which can be spun up instantly in any location, data has gravity—it accumulates over years, establishes relationships and dependencies, and cannot simply be moved or replicated without careful consideration of consistency, latency, compliance, and cost.

In hybrid architectures, questions arise constantly:

How do we keep an on-premises database synchronized with cloud analytics?
Which data must remain on-premises for compliance, and which can migrate?
How do we handle writes that occur in both locations?
What consistency guarantees can we provide when data spans continents?

Hybrid data strategy is the discipline of answering these questions with patterns, technologies, and architectural decisions that enable organizations to leverage their data wherever it resides.

What You Will Learn

By the end of this page, you will understand the fundamental patterns for managing data across hybrid environments. You'll learn replication topologies, consistency tradeoffs, caching strategies, and how to design data architectures that respect both technical constraints and business requirements.

Understanding Data Gravity

Data gravity is the concept that large datasets attract applications, services, and other data. Like celestial bodies, massive data accumulations create a gravitational pull that makes migration increasingly difficult over time.

Factors That Create Data Gravity

•Volume — Petabytes of data cannot be migrated overnight. At typical network speeds, moving 1 PB takes days or weeks, even with dedicated bandwidth.
•Velocity — Continuously generated data (logs, IoT, transactions) makes 'pause and migrate' impossible. Data keeps arriving during migration.
•Relationships — Databases reference each other. Application A reads from Database X which joins with Database Y. Moving one without the other breaks functionality.
•Application Dependencies — Legacy applications may be tightly coupled to on-premises data. Moving data means re-architecting application access patterns.
•Compliance Anchors — Regulations may mandate data residency. GDPR, HIPAA, financial regulations can legally require data to remain in specific locations.

Data Migration Time Estimates
Data Volume	100 Mbps Link	1 Gbps Link	10 Gbps Link	AWS Snowball
100 GB	2.2 hours	13 minutes	1.3 minutes	N/A (too small)
1 TB	22 hours	2.2 hours	13 minutes	~1 day (shipping)
10 TB	9 days	22 hours	2.2 hours	~1 day (shipping)
100 TB	93 days	9 days	22 hours	~1 week (shipping)
1 PB	2.5 years	93 days	9 days	~2 weeks (Snowmobile)

The Network as Bottleneck

These times assume 100% link saturation, which is unrealistic in practice. Real-world migrations often achieve 30-50% efficiency due to protocol overhead, competing traffic, and endpoint limitations. Plan accordingly—data moves slower than you expect.

Replication Patterns

When data must exist in both on-premises and cloud environments, replication copies data between locations. The choice of replication pattern depends on consistency requirements, latency tolerance, and conflict handling needs.

Unidirectional Replication

•Data flows one way: primary → replica
•No conflict resolution needed
•Simple to implement and operate
•Replica is read-only
•Use case: On-prem primary → cloud analytics
•Risk: Replica lag, single point of write

Bidirectional Replication

•Data flows both ways: A ↔ B
•Both locations can accept writes
•Requires conflict detection & resolution
•Complex to operate correctly
•Use case: Active-active across regions
•Risk: Conflict storms, data inconsistency

Synchronous vs Asynchronous Replication:

Synchronous — Write is not acknowledged until it's confirmed on all replicas. Zero data loss but introduces latency (write + network RTT to replica + replica write). Suitable only for low-latency links.
Asynchronous — Write is acknowledged immediately; replication happens in background. Lower write latency but risk of data loss if primary fails before replication completes. Suitable for most hybrid scenarios.
Semi-Synchronous — Write is acknowledged after reaching at least one replica, not all. Balances durability with latency. MySQL semi-sync replication is an example.

Replication Mode Trade-offs
Mode	Write Latency	Data Loss Risk	Consistency	Use Case
Synchronous	High (+ 2x network RTT)	Zero (RPO = 0)	Strong	Financial transactions, critical records
Semi-Synchronous	Medium (+ 1x network RTT)	Very Low	Near-Strong	Balanced durability needs
Asynchronous	Low (local only)	Some (seconds to minutes)	Eventual	Analytics, reporting, non-critical data

The CAP Theorem in Practice

In hybrid environments spanning on-prem and cloud, network partitions are not theoretical—they happen. When connectivity fails, you must choose: reject writes (consistency) or allow writes on both sides with later reconciliation (availability). Design your replication strategy with this tradeoff in mind.

Database-Specific Strategies

Different database technologies offer varying levels of support for hybrid replication. Understanding your database's native capabilities is essential for designing effective hybrid data architectures.

Relational Database Hybrid Patterns

•MySQL/MariaDB — Native async replication to RDS, Aurora. GTID-based replication tracks position across sources. Galera Cluster enables multi-primary sync replication (but latency-sensitive).
•PostgreSQL — Logical replication to RDS PostgreSQL or Aurora. pg_logical extension supports selective table replication. AWS DMS can handle initial sync + ongoing CDC.
•SQL Server — Always On Availability Groups can extend to Azure SQL Managed Instance. Transactional replication to cloud VMs. Azure Arc enables unified management.
•Oracle — GoldenGate provides enterprise-grade bidirectional replication. Oracle Cloud Infrastructure supports Direct Connect for integration. Data Guard for disaster recovery.

NoSQL Database Hybrid Patterns

•MongoDB — Replica sets can span on-prem and cloud (Atlas). Write concern configures acknowledgment level. Read preference directs queries to optimal nodes.
•Cassandra — Multi-datacenter by design. Snitch configurations direct reads/writes to local datacenter. NetworkTopologyStrategy controls replication factor per DC.
•Redis — Active-Active Geo-Distribution (Redis Enterprise) enables multi-DC writes with CRDT-based conflict resolution. Open source requires custom sync solutions.
•Elasticsearch — Cross-cluster replication (CCR) syncs indices across clusters. Leader indices in one location, follower indices in another for read scaling or DR.

mongodb-hybrid-config.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// MongoDB Hybrid Replica Set Configuration
// Spans on-premises data center and cloud (AWS/Azure/GCP)
 
// Replica set configuration document
config = {
  _id: "hybridRS",
  version: 1,
  members: [
    // On-premises members (primary eligible)
    { _id: 0, host: "mongo-onprem-1.internal:27017", priority: 2 },
    { _id: 1, host: "mongo-onprem-2.internal:27017", priority: 1 },
    
    // Cloud members (secondary, can become primary on failover)
    { _id: 2, host: "mongo-cloud-1.us-east-1.compute.internal:27017", priority: 1 },
    { _id: 3, host: "mongo-cloud-2.us-east-1.compute.internal:27017", priority: 1 },
    
    // Cloud arbiter (votes but holds no data, breaks ties)
    { _id: 4, host: "mongo-arbiter.us-east-1.compute.internal:27017", arbiterOnly: true }
  ],
  settings: {
    // Write concern to ensure durability across locations
    getLastErrorDefaults: {
      w: "majority",
      wtimeout: 5000
    }
  }
};
 
// Apply configuration
rs.initiate(config);
 
// Set read preference for application
// Prefer local reads, fallback to remote on failure
// db.getMongo().setReadPref("nearest", [{"dc": "onprem"}, {}]);

Quorum Across WAN

Distributed databases requiring quorum (Cassandra, MongoDB, Kafka) can experience write availability issues if WAN latency or partitions prevent majority acknowledgment. Carefully consider member placement and write concern settings to avoid availability cliffs.

Change Data Capture (CDC)

Change Data Capture is a pattern for capturing row-level changes from source databases and applying them to target systems. CDC is foundational for hybrid data architectures because it enables near-real-time synchronization without impacting source system performance.

CDC Approaches

•Log-Based CDC — Reads database transaction logs (WAL, binlog, redo log). Non-intrusive, captures all changes including deletes. Gold standard for production CDC.
•Trigger-Based CDC — Database triggers write changes to shadow tables. Works universally but adds write overhead and increases database complexity.
•Timestamp-Based CDC — Queries for rows with updated_at > last_sync. Misses deletes, requires timestamp columns everywhere, can miss rapid updates.
•Snapshot-Based — Periodic full table exports. Simple but doesn't capture change sequence, resource-intensive for large tables.

CDC Tools for Hybrid Architectures
Tool	Type	Source Support	Target Support	Latency
Debezium	Log-based (OSS)	MySQL, PostgreSQL, MongoDB, SQL Server, Oracle	Kafka, then any consumer	Sub-second
AWS DMS	Log-based (Managed)	Major RDBMS, MongoDB, S3	RDS, Redshift, S3, Kinesis	Seconds
Striim	Log-based (Enterprise)	All major databases + mainframes	Cloud databases, warehouses	Sub-second
Oracle GoldenGate	Log-based (Enterprise)	Oracle, SQL Server, MySQL	Oracle Cloud, others	Sub-second
Azure Data Factory	Batch + CDC	Multiple sources via connectors	Azure services, S3	Minutes to sub-second

debezium-connector-config.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
{
  "name": "onprem-postgres-to-cloud",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres.onprem.internal",
    "database.port": "5432",
    "database.user": "replicator",
    "database.password": "${secrets:postgres-password}",
    "database.dbname": "sales",
    "database.server.name": "onprem-postgres",
    
    // Capture specific tables only
    "table.include.list": "public.orders,public.customers,public.products",
    
    // Plugin configuration for logical decoding
    "plugin.name": "pgoutput",
    "publication.autocreate.mode": "filtered",
    
    // Topic naming
    "topic.prefix": "hybrid-cdc",
    
    // Transforms for cloud compatibility
    "transforms": "route",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "(.*)",
    "transforms.route.replacement": "cloud-ingest.$1",
    
    // Snapshot configuration for initial sync
    "snapshot.mode": "initial",
    
    // Handle deletes explicitly
    "tombstones.on.delete": "true",
    
    // Decimal handling for analytics compatibility
    "decimal.handling.mode": "double"
  }
}

CDC + Kafka = Hybrid Data Backbone

Debezium + Kafka provides a robust hybrid data backbone. On-prem changes stream to Kafka (on-prem or cloud), where cloud consumers (data lakes, warehouses, microservices) process them. This decouples source and target, enabling flexible data routing.

Caching Strategies

When authoritative data resides on-premises but cloud applications need low-latency access, caching bridges the gap. Effective caching strategies reduce cross-boundary traffic, improve application performance, and reduce load on source systems.

Hybrid Caching Patterns

•Write-Through — Writes go to cache and source synchronously. Ensures cache consistency but doesn't reduce write latency. Use when reads far exceed writes.
•Write-Behind (Write-Back) — Writes go to cache, asynchronously persisted to source. Reduces write latency but risks data loss if cache fails before persistence.
•Cache-Aside (Lazy Loading) — Application checks cache; on miss, reads from source and populates cache. Most common pattern. TTL-based expiration manages staleness.
•Read-Through — Cache sits in front of source; on miss, cache retrieves from source. Application only talks to cache. Simplifies application code.
•Refresh-Ahead — Cache proactively refreshes entries before expiration based on access patterns. Reduces cache miss latency for hot data.

Caching Technologies for Hybrid
Technology	Deployment Model	Best For	Considerations
Amazon ElastiCache (Redis)	Fully managed in cloud	Session data, API responses, real-time analytics	Must sync from on-prem sources
Redis Enterprise	Hybrid (on-prem + cloud)	Active-active with CRDT conflict resolution	Enterprise license required
Hazelcast	Distributed (hybrid capable)	In-memory data grid, distributed compute	Java-centric, complex configuration
CDN (CloudFront, Akamai)	Edge locations	Static assets, API response caching	Limited to HTTP(S) content
Local In-Memory	Application layer	Reference data, configuration	Per-instance, no coordination

Cache Invalidation in Hybrid:

The hardest problem in caching is ensuring cached data reflects source changes. In hybrid environments, this is compounded by network latency and potential partitions.

Approaches:

TTL-based — Cache entries expire after fixed duration. Simple but can serve stale data until TTL expires.
Event-driven invalidation — Source publishes change events; cache subscribes and invalidates. Near-real-time but requires robust eventing infrastructure.
Versioned keys — Include version/timestamp in cache key. Source increments version on change; application always queries latest version.
Conditional refresh — Use ETags or Last-Modified headers. Cache checks if local copy is still valid before serving.

Stale Data is Often Acceptable

Not all data requires real-time consistency. Product catalogs, user profiles, and configuration data can often tolerate seconds or minutes of staleness. Focus real-time invalidation efforts on truly time-sensitive data like inventory levels or pricing.

Data Sovereignty and Compliance

Hybrid data strategies must account for data sovereignty—laws and regulations governing where data can be stored and processed. This often dictates which data can move to cloud and which must remain on-premises or in specific geographic regions.

Key Regulatory Frameworks

•GDPR (EU) — Personal data of EU residents must be protected regardless of processing location. Cross-border transfers require adequacy decisions, BCRs, or SCCs. Right to erasure complicates replication.
•HIPAA (US Healthcare) — Protected Health Information (PHI) requires specific safeguards. Cloud providers offer HIPAA-eligible services, but some organizations keep PHI on-prem for control.
•PCI DSS (Payment Cards) — Cardholder data requires stringent controls. Hybrid architectures must maintain PCI scope boundaries—tokenization often pushes sensitive data on-prem.
•Data Residency Laws — Countries like Russia, China, and others mandate that citizen data stays within borders. Cloud regions in those countries may be required.
•Industry-Specific — Finance (SOX, GLBA), defense (ITAR), government (FedRAMP) each impose data handling requirements that influence hybrid architecture.

Compliance-Driven Data Placement Strategies
Strategy	Description	Use Case
Data Residency by Design	Partition data by jurisdiction; each region stores only local data	GDPR compliance for multi-region apps
Tokenization	Replace sensitive data with tokens; tokens cloud-processable	PCI DSS scope reduction; analytics on card data
Pseudonymization	Replace identifiers with pseudonyms; mapping stays on-prem	GDPR compliance with cloud processing
Encryption with Customer Keys	Data encrypted; customer manages keys on-prem (BYOK)	Sensitive data in cloud under customer control
Confidential Computing	Process encrypted data in secure enclaves (TEE)	Sensitive ML training in cloud with privacy

Compliance is Non-Negotiable

Violating data sovereignty regulations carries severe penalties—GDPR fines can reach 4% of global annual revenue. Architecture decisions around data placement must involve legal and compliance teams from the beginning, not as an afterthought.

Data Integration Patterns

Beyond replication, hybrid architectures often require data integration—combining data from multiple sources (on-prem and cloud) for analytics, reporting, or unified APIs. Several patterns enable this integration.

Integration Patterns

•Data Lake Pattern — Raw data from all sources lands in a central cloud data lake (S3, ADLS, GCS). On-prem data streams via CDC or batch export. Analytics tools query the lake.
•Data Warehouse Pattern — Transformed, modeled data consolidates in a cloud warehouse (Snowflake, BigQuery, Redshift). ETL/ELT pipelines pull from on-prem and cloud sources.
•Data Virtualization — Virtual layer provides unified query interface across sources. No data movement; queries federate to source systems in real-time. Products: Denodo, Dremio.
•API Aggregation — GraphQL or API gateway aggregates data from multiple backends (on-prem services, cloud services) into unified API responses.
•Event Streaming — Events from all sources flow through central stream (Kafka). Consumers build materialized views for specific use cases. Decouples sources from consumers.

Data Movement (Physical)

•Data is copied to central location
•High storage costs (duplication)
•Latency = ETL pipeline frequency
•Complex pipeline maintenance
•Best for: Historical analytics, ML training

Data Virtual (Federated)

•Data accessed in place, on demand
•Lower storage costs
•Latency = query time
•Source system load on each query
•Best for: Real-time queries, exploratory analysis

dbt-model-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
-- dbt model combining on-prem CDC data with cloud-native data
-- This runs in Snowflake (cloud warehouse) after CDC ingestion
 
-- models/marts/order_analytics.sql
 
WITH onprem_orders AS (
    -- Orders from on-prem ERP, landed via Debezium CDC
    SELECT
        order_id,
        customer_id,
        order_date,
        total_amount,
        _cdc_timestamp AS synced_at
    FROM {{ source('cdc_ingestion', 'erp_orders') }}
    WHERE _cdc_operation != 'd'  -- Exclude soft-deleted
),
 
cloud_enrichment AS (
    -- Customer segments from cloud marketing platform
    SELECT
        customer_id,
        segment,
        lifetime_value,
        churn_probability
    FROM {{ source('cloud_marketing', 'customer_segments') }}
),
 
weather_data AS (
    -- External cloud data source
    SELECT
        date,
        region,
        avg_temperature,
        precipitation
    FROM {{ source('external_apis', 'weather_history') }}
)
 
SELECT
    o.order_id,
    o.order_date,
    o.total_amount,
    c.segment AS customer_segment,
    c.lifetime_value AS customer_ltv,
    c.churn_probability,
    w.avg_temperature AS weather_temp,
    w.precipitation,
    -- Calculate derived metrics
    CASE 
        WHEN c.churn_probability > 0.7 THEN 'high_risk'
        WHEN c.churn_probability > 0.4 THEN 'medium_risk'
        ELSE 'low_risk'
    END AS churn_risk_category
FROM onprem_orders o
LEFT JOIN cloud_enrichment c ON o.customer_id = c.customer_id
LEFT JOIN weather_data w ON o.order_date = w.date AND o.region = w.region

Modern Data Stack for Hybrid

The 'Modern Data Stack' (Fivetran/Airbyte for ingestion, Snowflake/BigQuery for warehouse, dbt for transformation, Looker/Mode for visualization) works well for hybrid scenarios when combined with CDC from on-prem sources. Decoupled components allow mixing on-prem and cloud data seamlessly.

Summary: Hybrid Data Strategies

Managing data across hybrid environments is the most complex aspect of hybrid cloud architecture. Let's consolidate the key principles:

Key Takeaways

•Respect data gravity — Moving large datasets takes time and bandwidth. Design architectures that minimize cross-boundary data movement.
•Choose replication topology carefully — Unidirectional is simpler; bidirectional enables active-active but requires conflict resolution. Async works for most hybrid scenarios.
•Leverage CDC for real-time sync — Log-based CDC (Debezium, AWS DMS) provides reliable, low-impact change streaming from on-prem databases to cloud.
•Cache strategically — Caching cloud-side reduces latency and on-prem load. Design cache invalidation based on data freshness requirements.
•Compliance dictates placement — Data sovereignty laws may require specific data to stay on-prem or in specific regions. Architecture must align with regulatory requirements.
•Integrate, don't just replicate — Data lakes, warehouses, and virtualization layers combine data from both worlds for unified analytics and APIs.
•Design for failure — Network partitions will occur. Plan for how each data strategy behaves when connectivity is lost—consistency vs availability tradeoffs.

What's next:

With connectivity established and data strategies defined, how do organizations actually move workloads from on-premises to cloud? The next page explores Migration Patterns—approaches for transitioning applications and data, from lift-and-shift to re-architecture, with strategies for minimizing risk and downtime.

Page Complete

You now understand the patterns and technologies for managing data across hybrid cloud environments. From replication to CDC to caching, you have the toolkit to design data architectures that bridge on-premises and cloud while maintaining consistency, performance, and compliance.

3 / 5

Loading learning content...

System Design (HLD)Hybrid Cloud

Hybrid Cloud Architecture

LevelAdvanced

Duration75 mins

TopicHybrid Cloud

3 / 5

Hybrid Data Strategies

The Data Challenge in Hybrid Cloud

In hybrid architectures, questions arise constantly:

How do we keep an on-premises database synchronized with cloud analytics?
Which data must remain on-premises for compliance, and which can migrate?
How do we handle writes that occur in both locations?
What consistency guarantees can we provide when data spans continents?

Hybrid data strategy is the discipline of answering these questions with patterns, technologies, and architectural decisions that enable organizations to leverage their data wherever it resides.

What You Will Learn

Understanding Data Gravity

Factors That Create Data Gravity

•Volume — Petabytes of data cannot be migrated overnight. At typical network speeds, moving 1 PB takes days or weeks, even with dedicated bandwidth.
•Velocity — Continuously generated data (logs, IoT, transactions) makes 'pause and migrate' impossible. Data keeps arriving during migration.
•Relationships — Databases reference each other. Application A reads from Database X which joins with Database Y. Moving one without the other breaks functionality.
•Application Dependencies — Legacy applications may be tightly coupled to on-premises data. Moving data means re-architecting application access patterns.
•Compliance Anchors — Regulations may mandate data residency. GDPR, HIPAA, financial regulations can legally require data to remain in specific locations.

Data Migration Time Estimates
Data Volume	100 Mbps Link	1 Gbps Link	10 Gbps Link	AWS Snowball
100 GB	2.2 hours	13 minutes	1.3 minutes	N/A (too small)
1 TB	22 hours	2.2 hours	13 minutes	~1 day (shipping)
10 TB	9 days	22 hours	2.2 hours	~1 day (shipping)
100 TB	93 days	9 days	22 hours	~1 week (shipping)
1 PB	2.5 years	93 days	9 days	~2 weeks (Snowmobile)

The Network as Bottleneck

Replication Patterns

Unidirectional Replication

•Data flows one way: primary → replica
•No conflict resolution needed
•Simple to implement and operate
•Replica is read-only
•Use case: On-prem primary → cloud analytics
•Risk: Replica lag, single point of write

Bidirectional Replication

•Data flows both ways: A ↔ B
•Both locations can accept writes
•Requires conflict detection & resolution
•Complex to operate correctly
•Use case: Active-active across regions
•Risk: Conflict storms, data inconsistency

Synchronous vs Asynchronous Replication:

Synchronous — Write is not acknowledged until it's confirmed on all replicas. Zero data loss but introduces latency (write + network RTT to replica + replica write). Suitable only for low-latency links.
Asynchronous — Write is acknowledged immediately; replication happens in background. Lower write latency but risk of data loss if primary fails before replication completes. Suitable for most hybrid scenarios.
Semi-Synchronous — Write is acknowledged after reaching at least one replica, not all. Balances durability with latency. MySQL semi-sync replication is an example.

Replication Mode Trade-offs
Mode	Write Latency	Data Loss Risk	Consistency	Use Case
Synchronous	High (+ 2x network RTT)	Zero (RPO = 0)	Strong	Financial transactions, critical records
Semi-Synchronous	Medium (+ 1x network RTT)	Very Low	Near-Strong	Balanced durability needs
Asynchronous	Low (local only)	Some (seconds to minutes)	Eventual	Analytics, reporting, non-critical data

The CAP Theorem in Practice

Database-Specific Strategies

Relational Database Hybrid Patterns

•MySQL/MariaDB — Native async replication to RDS, Aurora. GTID-based replication tracks position across sources. Galera Cluster enables multi-primary sync replication (but latency-sensitive).
•PostgreSQL — Logical replication to RDS PostgreSQL or Aurora. pg_logical extension supports selective table replication. AWS DMS can handle initial sync + ongoing CDC.
•SQL Server — Always On Availability Groups can extend to Azure SQL Managed Instance. Transactional replication to cloud VMs. Azure Arc enables unified management.
•Oracle — GoldenGate provides enterprise-grade bidirectional replication. Oracle Cloud Infrastructure supports Direct Connect for integration. Data Guard for disaster recovery.

NoSQL Database Hybrid Patterns

•MongoDB — Replica sets can span on-prem and cloud (Atlas). Write concern configures acknowledgment level. Read preference directs queries to optimal nodes.
•Cassandra — Multi-datacenter by design. Snitch configurations direct reads/writes to local datacenter. NetworkTopologyStrategy controls replication factor per DC.
•Redis — Active-Active Geo-Distribution (Redis Enterprise) enables multi-DC writes with CRDT-based conflict resolution. Open source requires custom sync solutions.
•Elasticsearch — Cross-cluster replication (CCR) syncs indices across clusters. Leader indices in one location, follower indices in another for read scaling or DR.

mongodb-hybrid-config.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// MongoDB Hybrid Replica Set Configuration
// Spans on-premises data center and cloud (AWS/Azure/GCP)
 
// Replica set configuration document
config = {
  _id: "hybridRS",
  version: 1,
  members: [
    // On-premises members (primary eligible)
    { _id: 0, host: "mongo-onprem-1.internal:27017", priority: 2 },
    { _id: 1, host: "mongo-onprem-2.internal:27017", priority: 1 },
    
    // Cloud members (secondary, can become primary on failover)
    { _id: 2, host: "mongo-cloud-1.us-east-1.compute.internal:27017", priority: 1 },
    { _id: 3, host: "mongo-cloud-2.us-east-1.compute.internal:27017", priority: 1 },
    
    // Cloud arbiter (votes but holds no data, breaks ties)
    { _id: 4, host: "mongo-arbiter.us-east-1.compute.internal:27017", arbiterOnly: true }
  ],
  settings: {
    // Write concern to ensure durability across locations
    getLastErrorDefaults: {
      w: "majority",
      wtimeout: 5000
    }
  }
};
 
// Apply configuration
rs.initiate(config);
 
// Set read preference for application
// Prefer local reads, fallback to remote on failure
// db.getMongo().setReadPref("nearest", [{"dc": "onprem"}, {}]);

Quorum Across WAN

Change Data Capture (CDC)

CDC Approaches

•Log-Based CDC — Reads database transaction logs (WAL, binlog, redo log). Non-intrusive, captures all changes including deletes. Gold standard for production CDC.
•Trigger-Based CDC — Database triggers write changes to shadow tables. Works universally but adds write overhead and increases database complexity.
•Timestamp-Based CDC — Queries for rows with updated_at > last_sync. Misses deletes, requires timestamp columns everywhere, can miss rapid updates.
•Snapshot-Based — Periodic full table exports. Simple but doesn't capture change sequence, resource-intensive for large tables.

CDC Tools for Hybrid Architectures
Tool	Type	Source Support	Target Support	Latency
Debezium	Log-based (OSS)	MySQL, PostgreSQL, MongoDB, SQL Server, Oracle	Kafka, then any consumer	Sub-second
AWS DMS	Log-based (Managed)	Major RDBMS, MongoDB, S3	RDS, Redshift, S3, Kinesis	Seconds
Striim	Log-based (Enterprise)	All major databases + mainframes	Cloud databases, warehouses	Sub-second
Oracle GoldenGate	Log-based (Enterprise)	Oracle, SQL Server, MySQL	Oracle Cloud, others	Sub-second
Azure Data Factory	Batch + CDC	Multiple sources via connectors	Azure services, S3	Minutes to sub-second

debezium-connector-config.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
{
  "name": "onprem-postgres-to-cloud",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres.onprem.internal",
    "database.port": "5432",
    "database.user": "replicator",
    "database.password": "${secrets:postgres-password}",
    "database.dbname": "sales",
    "database.server.name": "onprem-postgres",
    
    // Capture specific tables only
    "table.include.list": "public.orders,public.customers,public.products",
    
    // Plugin configuration for logical decoding
    "plugin.name": "pgoutput",
    "publication.autocreate.mode": "filtered",
    
    // Topic naming
    "topic.prefix": "hybrid-cdc",
    
    // Transforms for cloud compatibility
    "transforms": "route",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "(.*)",
    "transforms.route.replacement": "cloud-ingest.$1",
    
    // Snapshot configuration for initial sync
    "snapshot.mode": "initial",
    
    // Handle deletes explicitly
    "tombstones.on.delete": "true",
    
    // Decimal handling for analytics compatibility
    "decimal.handling.mode": "double"
  }
}

CDC + Kafka = Hybrid Data Backbone

Caching Strategies

Hybrid Caching Patterns

•Write-Through — Writes go to cache and source synchronously. Ensures cache consistency but doesn't reduce write latency. Use when reads far exceed writes.
•Write-Behind (Write-Back) — Writes go to cache, asynchronously persisted to source. Reduces write latency but risks data loss if cache fails before persistence.
•Cache-Aside (Lazy Loading) — Application checks cache; on miss, reads from source and populates cache. Most common pattern. TTL-based expiration manages staleness.
•Read-Through — Cache sits in front of source; on miss, cache retrieves from source. Application only talks to cache. Simplifies application code.
•Refresh-Ahead — Cache proactively refreshes entries before expiration based on access patterns. Reduces cache miss latency for hot data.

Caching Technologies for Hybrid
Technology	Deployment Model	Best For	Considerations
Amazon ElastiCache (Redis)	Fully managed in cloud	Session data, API responses, real-time analytics	Must sync from on-prem sources
Redis Enterprise	Hybrid (on-prem + cloud)	Active-active with CRDT conflict resolution	Enterprise license required
Hazelcast	Distributed (hybrid capable)	In-memory data grid, distributed compute	Java-centric, complex configuration
CDN (CloudFront, Akamai)	Edge locations	Static assets, API response caching	Limited to HTTP(S) content
Local In-Memory	Application layer	Reference data, configuration	Per-instance, no coordination

Cache Invalidation in Hybrid:

The hardest problem in caching is ensuring cached data reflects source changes. In hybrid environments, this is compounded by network latency and potential partitions.

Approaches:

TTL-based — Cache entries expire after fixed duration. Simple but can serve stale data until TTL expires.
Event-driven invalidation — Source publishes change events; cache subscribes and invalidates. Near-real-time but requires robust eventing infrastructure.
Versioned keys — Include version/timestamp in cache key. Source increments version on change; application always queries latest version.
Conditional refresh — Use ETags or Last-Modified headers. Cache checks if local copy is still valid before serving.

Stale Data is Often Acceptable

Data Sovereignty and Compliance

Key Regulatory Frameworks

•GDPR (EU) — Personal data of EU residents must be protected regardless of processing location. Cross-border transfers require adequacy decisions, BCRs, or SCCs. Right to erasure complicates replication.
•HIPAA (US Healthcare) — Protected Health Information (PHI) requires specific safeguards. Cloud providers offer HIPAA-eligible services, but some organizations keep PHI on-prem for control.
•PCI DSS (Payment Cards) — Cardholder data requires stringent controls. Hybrid architectures must maintain PCI scope boundaries—tokenization often pushes sensitive data on-prem.
•Data Residency Laws — Countries like Russia, China, and others mandate that citizen data stays within borders. Cloud regions in those countries may be required.
•Industry-Specific — Finance (SOX, GLBA), defense (ITAR), government (FedRAMP) each impose data handling requirements that influence hybrid architecture.

Compliance-Driven Data Placement Strategies
Strategy	Description	Use Case
Data Residency by Design	Partition data by jurisdiction; each region stores only local data	GDPR compliance for multi-region apps
Tokenization	Replace sensitive data with tokens; tokens cloud-processable	PCI DSS scope reduction; analytics on card data
Pseudonymization	Replace identifiers with pseudonyms; mapping stays on-prem	GDPR compliance with cloud processing
Encryption with Customer Keys	Data encrypted; customer manages keys on-prem (BYOK)	Sensitive data in cloud under customer control
Confidential Computing	Process encrypted data in secure enclaves (TEE)	Sensitive ML training in cloud with privacy

Compliance is Non-Negotiable

Data Integration Patterns

Integration Patterns

•Data Lake Pattern — Raw data from all sources lands in a central cloud data lake (S3, ADLS, GCS). On-prem data streams via CDC or batch export. Analytics tools query the lake.
•Data Warehouse Pattern — Transformed, modeled data consolidates in a cloud warehouse (Snowflake, BigQuery, Redshift). ETL/ELT pipelines pull from on-prem and cloud sources.
•Data Virtualization — Virtual layer provides unified query interface across sources. No data movement; queries federate to source systems in real-time. Products: Denodo, Dremio.
•API Aggregation — GraphQL or API gateway aggregates data from multiple backends (on-prem services, cloud services) into unified API responses.
•Event Streaming — Events from all sources flow through central stream (Kafka). Consumers build materialized views for specific use cases. Decouples sources from consumers.

Data Movement (Physical)

•Data is copied to central location
•High storage costs (duplication)
•Latency = ETL pipeline frequency
•Complex pipeline maintenance
•Best for: Historical analytics, ML training

Data Virtual (Federated)

•Data accessed in place, on demand
•Lower storage costs
•Latency = query time
•Source system load on each query
•Best for: Real-time queries, exploratory analysis

dbt-model-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
-- dbt model combining on-prem CDC data with cloud-native data
-- This runs in Snowflake (cloud warehouse) after CDC ingestion
 
-- models/marts/order_analytics.sql
 
WITH onprem_orders AS (
    -- Orders from on-prem ERP, landed via Debezium CDC
    SELECT
        order_id,
        customer_id,
        order_date,
        total_amount,
        _cdc_timestamp AS synced_at
    FROM {{ source('cdc_ingestion', 'erp_orders') }}
    WHERE _cdc_operation != 'd'  -- Exclude soft-deleted
),
 
cloud_enrichment AS (
    -- Customer segments from cloud marketing platform
    SELECT
        customer_id,
        segment,
        lifetime_value,
        churn_probability
    FROM {{ source('cloud_marketing', 'customer_segments') }}
),
 
weather_data AS (
    -- External cloud data source
    SELECT
        date,
        region,
        avg_temperature,
        precipitation
    FROM {{ source('external_apis', 'weather_history') }}
)
 
SELECT
    o.order_id,
    o.order_date,
    o.total_amount,
    c.segment AS customer_segment,
    c.lifetime_value AS customer_ltv,
    c.churn_probability,
    w.avg_temperature AS weather_temp,
    w.precipitation,
    -- Calculate derived metrics
    CASE 
        WHEN c.churn_probability > 0.7 THEN 'high_risk'
        WHEN c.churn_probability > 0.4 THEN 'medium_risk'
        ELSE 'low_risk'
    END AS churn_risk_category
FROM onprem_orders o
LEFT JOIN cloud_enrichment c ON o.customer_id = c.customer_id
LEFT JOIN weather_data w ON o.order_date = w.date AND o.region = w.region

Modern Data Stack for Hybrid

Summary: Hybrid Data Strategies

Managing data across hybrid environments is the most complex aspect of hybrid cloud architecture. Let's consolidate the key principles:

Key Takeaways

•Respect data gravity — Moving large datasets takes time and bandwidth. Design architectures that minimize cross-boundary data movement.
•Choose replication topology carefully — Unidirectional is simpler; bidirectional enables active-active but requires conflict resolution. Async works for most hybrid scenarios.
•Leverage CDC for real-time sync — Log-based CDC (Debezium, AWS DMS) provides reliable, low-impact change streaming from on-prem databases to cloud.
•Cache strategically — Caching cloud-side reduces latency and on-prem load. Design cache invalidation based on data freshness requirements.
•Compliance dictates placement — Data sovereignty laws may require specific data to stay on-prem or in specific regions. Architecture must align with regulatory requirements.
•Integrate, don't just replicate — Data lakes, warehouses, and virtualization layers combine data from both worlds for unified analytics and APIs.
•Design for failure — Network partitions will occur. Plan for how each data strategy behaves when connectivity is lost—consistency vs availability tradeoffs.

What's next:

Page Complete

3 / 5