Multi-Cloud - Learning Module

Loading content...

0/273

Data Portability: Moving and Synchronizing Data Across Clouds

The Weight of Data in Multi-Cloud

Compute is light. Data is heavy.

You can spin up a containerized service in any cloud within minutes. But moving a 50TB database? That takes days, costs significant money, and introduces substantial risk of data loss or corruption.

Data gravity is the phenomenon where applications cluster around data because moving data is so expensive and slow. Understanding data portability—and its realistic limits—is essential for multi-cloud architecture. This page examines the challenges, patterns, and tooling for managing data across multiple clouds.

Learning Objectives

After completing this page, you will understand: (1) The physics and economics of data movement, (2) Data format standards that enable portability, (3) Synchronization patterns for different use cases, (4) Transfer mechanisms and their trade-offs, and (5) Strategic approaches to data placement in multi-cloud architectures.

Understanding Data Gravity

Data gravity was coined by Dave McCrory in 2010 to describe how data attracts applications and services. Like physical mass attracting objects, large data stores attract compute workloads that need to access that data with low latency.

1.1 The Forces of Data Gravity

1. Latency: Applications need fast access to their data. Cross-cloud API calls add 10-100ms+ of latency compared to local calls. For high-throughput workloads, this becomes prohibitive.

2. Egress Costs: Cloud providers charge $0.02-$0.12 per GB to move data out. A 100TB data warehouse queried frequently would cost thousands monthly just in egress.

3. Transfer Time: Even at enterprise network speeds, moving petabytes takes weeks to months. Meanwhile, applications can't access the data in transit.

4. Compliance and Data Residency: Regulations may require data to remain in specific regions or under specific cloud providers' control, preventing movement.

Data Transfer Reality Check
Data Volume	Transfer at 1 Gbps	Transfer at 10 Gbps	Egress Cost (Approx.)
1 TB	~2.5 hours	~15 minutes	$20-90
10 TB	~1 day	~2.5 hours	$200-900
100 TB	~10 days	~1 day	$2,000-9,000
1 PB	~100 days	~10 days	$20,000-90,000
10 PB	~1,000 days (~3 years)	~100 days	$200,000-900,000

The Snowball Effect

For truly massive migrations, network transfer isn't practical. AWS Snowball, Azure Data Box, and Google Transfer Appliance provide physical devices: you load data onto hardware that's shipped to the destination cloud. At petabyte scale, FedEx is faster than the internet.

1.2 Implications for Multi-Cloud Strategy

The Pragmatic Reality:

Data stays where it is — Moving large data stores between clouds is a major project, not a continuous operation
Compute moves to data — It's easier to deploy applications near data than move data to applications
Selective replication — Only critical subsets of data are replicated across clouds for DR or read performance
Data placement is a primary architectural decision — Decide where data lives early; changing later is expensive

Multi-Cloud Data Architecture Patterns:

Pattern	Description	When to Use
Primary with Read Replicas	Data masters in one cloud, replicas in others	Read-heavy workloads requiring low latency across regions/clouds
Data Federation	Data stays in place; query layer aggregates	Analytics across clouds without moving underlying data
Event Streaming	Changes propagated via events; materialized views per cloud	Eventually consistent read models, decoupled systems
Active-Active	Writes accepted anywhere; conflict resolution	Maximum availability at cost of complexity
Data Partitioning by Cloud	Different data types in different clouds	Regulatory requirements, specialized processing

Portable Data Formats and Standards

Data portability starts with formats. Proprietary formats lock data into specific systems; open formats enable movement and interoperability.

2.1 Open Table Formats for Data Lakes

The Revolution:

Traditional data lakes stored data in raw files (CSV, JSON, Parquet) with metadata managed by compute engines (Hive, Spark). This created tight coupling between data and specific processing tools.

Modern Open Table Formats decouple storage from compute:

Open Table Formats Comparison
Format	Origin	Key Features	Multi-Cloud Status
Apache Iceberg	Netflix	ACID transactions, schema evolution, time travel, partition evolution	Excellent - cloud-agnostic design, broad support
Delta Lake	Databricks	ACID transactions, time travel, unified batch/streaming	Good - originally Databricks-focused, now open
Apache Hudi	Uber	Incremental processing, record-level updates, compaction	Good - designed for incremental pipelines

Why This Matters for Multi-Cloud:

With Apache Iceberg (for example), a table stored on AWS S3 can be:

Queried by Spark on GCP
Read by Trino running on-premises
Processed by Snowflake on Azure

The format is the contract; compute engines implement that contract regardless of where they run.

iceberg-multi-cloud.sql
Trino SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Apache Iceberg table defined in a catalog
-- Data physically stored on AWS S3
-- Queryable from Trino running anywhere
 
-- Create catalog pointing to AWS S3
-- (Configuration in Trino's catalog properties file)
 
-- iceberg-aws.properties:
-- connector.name=iceberg
-- iceberg.catalog.type=glue
-- hive.metastore.glue.region=us-east-1
-- fs.native-s3.enabled=true
 
-- Query from Trino on GCP accessing S3 data
SELECT 
    order_date,
    customer_region,
    SUM(order_total) as total_revenue,
    COUNT(*) as order_count
FROM iceberg_aws.sales.orders
WHERE order_date >= DATE '2024-01-01'
  AND order_status = 'COMPLETED'
GROUP BY order_date, customer_region
ORDER BY order_date DESC;
 
-- Time travel query - access historical snapshot
SELECT COUNT(*) as historical_orders
FROM iceberg_aws.sales.orders
FOR TIMESTAMP AS OF TIMESTAMP '2024-01-15 00:00:00'
WHERE order_status = 'PENDING';
 
-- Schema evolution is transparent
-- Old readers continue working as columns are added
ALTER TABLE iceberg_aws.sales.orders
ADD COLUMN customer_tier VARCHAR;

2.2 Data Serialization Formats

For data in motion (APIs, events, messages):

Format	Pros	Cons	Multi-Cloud Fit
JSON	Human-readable, universally supported	Verbose, no schema enforcement	Excellent - lowest friction
Protocol Buffers	Compact, strongly typed, fast	Requires schema coordination	Excellent - with schema registry
Avro	Schema embedded, Kafka-native	Less language support than Protobuf	Excellent - especially for streaming
MessagePack	Binary JSON, compact	Less tooling than alternatives	Good - drop-in JSON replacement

For data at rest (files, tables):

Format	Pros	Cons	Multi-Cloud Fit
Parquet	Columnar, compressed, schema	Row-level updates expensive	Excellent - industry standard
ORC	Columnar, heavily optimized for Hive	Smaller ecosystem than Parquet	Good - but Parquet more universal
Arrow	In-memory columnar, zero-copy	Primarily for processing, not storage	Excellent - for data exchange

Convention: Iceberg + Parquet + Protobuf

A common portable data stack: Apache Iceberg for table management over Parquet files for storage, with Protocol Buffers for event schemas. This combination provides strong typing, schema evolution, and query performance—all cloud-agnostic.

2.3 Schema Registries for Multi-Cloud

The Problem: Producers and consumers of data need to agree on schema. In multi-cloud environments, schema must be accessible from all clouds.

Options:

Confluent Schema Registry — Standard for Kafka ecosystems; can be deployed anywhere or used as Confluent Cloud service
AWS Glue Schema Registry — AWS-native but accessible via API from other clouds
Apicurio Registry — Open-source, cloud-agnostic, supports multiple formats
Git-based Schema Management — Schemas in Git, CI/CD generates client code for all languages

Multi-Cloud Schema Distribution:

schema-registry-multi-cloud.yaml
Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Multi-cloud schema registry architecture
#
# Option 1: Centralized Registry (accessible from all clouds)
# 
# ┌─────────────────────────────────────────────────────────────┐
# │                  Schema Registry (Primary)                   │
# │                  (Confluent Cloud / Self-hosted)             │
# │                        ┌─────────────────┐                   │
# │                        │  Schema Store   │                   │
# │                        │  (Kafka topic)  │                   │
# │                        └────────┬────────┘                   │
# └───────────────────────────────────┼─────────────────────────┘
#                                     │ API
#           ┌─────────────────────────┼─────────────────────────┐
#           │                         │                         │
#           ▼                         ▼                         ▼
#   ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
#   │     AWS       │         │     GCP       │         │    Azure      │
#   │   Producers   │         │   Consumers   │         │   Producers   │
#   │   Consumers   │         │   Producers   │         │   Consumers   │
#   └───────────────┘         └───────────────┘         └───────────────┘
#
# Option 2: Federated Registries with Sync
#
# ┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
# │   AWS Registry  │◄───────►│   GCP Registry  │◄───────►│  Azure Registry │
# │   (Glue +       │  Sync   │   (Apicurio)    │  Sync   │   (Apicurio)    │
# │    Local)       │         │                 │         │                 │
# └───────┬─────────┘         └───────┬─────────┘         └───────┬─────────┘
#         │                           │                           │
#         ▼                           ▼                           ▼
#    AWS Services              GCP Services               Azure Services
#
# Sync can be:
# - Git-based (schemas in repo, CI/CD pushes to registries)
# - Bi-directional replication between registries
# - Change Data Capture from primary to replicas

Data Synchronization Patterns

When data must exist in multiple clouds, synchronization becomes a central architectural concern. The right pattern depends on consistency requirements, latency tolerance, and data volume.

3.1 Synchronous Replication

How It Works: Every write is confirmed by multiple clouds before returning success to the client.

Pros:

Strong consistency—all clouds have identical data
No conflict resolution needed

Cons:

Latency penalty (cross-cloud round trip for every write)
Availability reduced (failure in any cloud blocks writes)
Expensive network egress for all data

When to Use: Rarely. Only for small, critical datasets where consistency is non-negotiable and latency is acceptable.

3.2 Asynchronous Replication

How It Works: Writes succeed locally; changes are replicated to other clouds in the background.

Pros:

Low latency writes
Single cloud failure doesn't block writes
Efficient batching reduces egress costs

Cons:

Eventual consistency—replicas lag behind primary
Potential for data loss if primary fails before replication

When to Use: Most common pattern. Suitable for read-heavy workloads where slight staleness is acceptable.

async-replication-pattern.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
// Asynchronous replication architecture for multi-cloud
// Primary writes to local cloud; replicator syncs to secondary clouds
 
interface ReplicationEvent {
  eventId: string;
  timestamp: Date;
  operation: 'INSERT' | 'UPDATE' | 'DELETE';
  table: string;
  primaryKey: Record<string, unknown>;
  data: Record<string, unknown>;
  sourceCloud: 'aws' | 'gcp' | 'azure';
}
 
class CrossCloudReplicator {
  private sourceDb: DatabaseConnection;
  private targetDbs: Map<string, DatabaseConnection>;
  private eventQueue: MessageQueue;
  private metrics: MetricsClient;
 
  async startReplication() {
    // Change Data Capture from source database
    const changeStream = await this.sourceDb.watchChanges({
      tables: ['orders', 'customers', 'products'],
      startAfter: await this.getLastReplicatedPosition(),
    });
 
    for await (const change of changeStream) {
      const event = this.transformToReplicationEvent(change);
      
      try {
        // Queue for asynchronous processing
        await this.eventQueue.publish({
          topic: 'cross-cloud-replication',
          key: `${event.table}:${JSON.stringify(event.primaryKey)}`,
          value: event,
          headers: {
            'source-cloud': event.sourceCloud,
            'operation': event.operation,
          },
        });
        
        // Track replication lag
        this.metrics.recordLag(
          'replication.queue.lag',
          Date.now() - event.timestamp.getTime()
        );
        
        // Update checkpoint for recovery
        await this.updateReplicationPosition(change.position);
        
      } catch (error) {
        // Dead letter for failed events
        await this.eventQueue.publish({
          topic: 'cross-cloud-replication-dlq',
          value: { event, error: error.message },
        });
        
        this.metrics.increment('replication.failures');
      }
    }
  }
 
  // Consumer running in each target cloud
  async consumeAndApply(targetCloud: string) {
    const targetDb = this.targetDbs.get(targetCloud)!;
    
    await this.eventQueue.consume({
      topic: 'cross-cloud-replication',
      groupId: `replicator-${targetCloud}`,
      
      handler: async (event: ReplicationEvent) => {
        // Skip events from our own cloud (avoid loops)
        if (event.sourceCloud === targetCloud) {
          return;
        }
 
        // Apply with idempotency
        await this.applyEventIdempotently(targetDb, event);
        
        // Track end-to-end lag
        this.metrics.recordLag(
          'replication.e2e.lag',
          Date.now() - event.timestamp.getTime(),
          { source: event.sourceCloud, target: targetCloud }
        );
      },
    });
  }
 
  private async applyEventIdempotently(
    db: DatabaseConnection, 
    event: ReplicationEvent
  ) {
    // Use event ID to ensure exactly-once semantics
    const applied = await db.query(
      `SELECT 1 FROM _replication_log WHERE event_id = $1`,
      [event.eventId]
    );
    
    if (applied.rowCount > 0) {
      return; // Already applied
    }
 
    await db.transaction(async (tx) => {
      // Apply the change
      switch (event.operation) {
        case 'INSERT':
          await tx.insert(event.table, event.data);
          break;
        case 'UPDATE':
          await tx.update(event.table, event.primaryKey, event.data);
          break;
        case 'DELETE':
          await tx.delete(event.table, event.primaryKey);
          break;
      }
      
      // Record application for idempotency
      await tx.insert('_replication_log', {
        event_id: event.eventId,
        applied_at: new Date(),
      });
    });
  }
}

3.3 Multi-Master / Active-Active Replication

How It Works: Writes are accepted in any cloud; changes are synchronized bidirectionally.

The Conflict Problem:

When users can write to any cloud simultaneously, conflicts are inevitable:

User A updates record in AWS
User B updates same record in GCP (at nearly the same time)
Both changes replicate—which version wins?

Conflict Resolution Strategies:

Strategy	Description	Trade-off
Last-Write-Wins (LWW)	Higher timestamp wins	Simple but can lose data
First-Write-Wins	Lower timestamp wins	Predictable but arbitrary
Application Logic	Business rules decide	Most correct, most complex
CRDTs	Conflict-free replicated data types	Mathematically correct for specific data types
Merge Functions	Custom merge logic per field	Flexible but error-prone

Active-Active Is Hard

Multi-master replication across clouds is one of the most complex distributed systems problems. Unless you have specific requirements (highest availability, geo-distributed writes), prefer single-primary replication with read replicas. The engineering effort for robust active-active is substantial.

3.4 Event Sourcing for Multi-Cloud

Pattern: Instead of synchronizing database state, synchronize the events that produce that state.

How It Works:

All writes are captured as immutable events
Events are replicated across clouds via event streaming (Kafka, Pulsar)
Each cloud maintains its own materialized views by processing events
Views are eventually consistent but events are the source of truth

Advantages for Multi-Cloud:

Natural idempotency — Events can be replayed without side effects
Flexible consumers — Each cloud can build different views optimized for its workload
Audit trail — Complete history of all changes
Time travel — Rebuild state at any point in time

event-sourcing-multi-cloud.yaml
Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Event Sourcing for Multi-Cloud Data Synchronization
#
# ┌──────────────────────────────────────────────────────────────────────┐
# │                         Event Bus (Kafka / Pulsar)                    │
# │                                                                      │
# │    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐│
# │    │  orders.events  │    │ customers.events │   │ products.events ││
# │    │     Topic       │    │      Topic       │   │     Topic       ││
# │    └────────┬────────┘    └────────┬─────────┘   └────────┬────────┘│
# └─────────────┼──────────────────────┼──────────────────────┼─────────┘
#               │                      │                      │
#               │      Mirroring / Cross-Cloud Replication    │
#               │                      │                      │
# ┌─────────────┼──────────────────────┼──────────────────────┼─────────┐
# │ AWS Cloud   │                      │                      │         │
# │             ▼                      ▼                      ▼         │
# │  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐│
# │  │  Order Service   │   │ Customer Service │   │ Product Service  ││
# │  │   Consumer       │   │    Consumer      │   │    Consumer      ││
# │  └────────┬─────────┘   └────────┬─────────┘   └────────┬─────────┘│
# │           │                      │                      │          │
# │           ▼                      ▼                      ▼          │
# │  ┌──────────────────────────────────────────────────────────────┐ │
# │  │              AWS Materialized Views (DynamoDB / RDS)         │ │
# │  │  - Orders by customer (for order service)                    │ │
# │  │  - Customer profiles (for recommendation engine)             │ │
# │  │  - Product catalog (denormalized)                            │ │
# │  └──────────────────────────────────────────────────────────────┘ │
# └────────────────────────────────────────────────────────────────────┘
#
# ┌────────────────────────────────────────────────────────────────────┐
# │ GCP Cloud                                                          │
# │             ▼                      ▼                      ▼        │
# │  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐│
# │  │  Analytics       │   │  ML Pipeline     │   │  Search Index   ││
# │  │   Consumer       │   │   Consumer       │   │    Consumer      ││
# │  └────────┬─────────┘   └────────┬─────────┘   └────────┬─────────┘│
# │           │                      │                      │          │
# │           ▼                      ▼                      ▼          │
# │  ┌──────────────┐     ┌──────────────────┐     ┌──────────────┐   │
# │  │   BigQuery   │     │   Vertex AI      │     │   Elastic    │   │
# │  │  (Analytics) │     │  Feature Store   │     │   Search     │   │
# │  └──────────────┘     └──────────────────┘     └──────────────┘   │
# └────────────────────────────────────────────────────────────────────┘
#
# Key insight: Same events, different materialized views optimized per cloud

Data Transfer Mechanisms

Understanding the available tools for moving data between clouds helps in choosing the right approach for your specific requirements.

4.1 Object Storage Transfer

For S3, GCS, Azure Blob Storage:

Cloud-Native Transfer Services:

AWS DataSync — Automates transfer between AWS and on-premises; limited cross-cloud
Google Cloud Storage Transfer Service — Native transfers from S3, Azure to GCS
Azure Data Factory — Orchestrates data movement with connectors to multiple sources

Third-Party / Open Source:

rclone — Command-line tool supporting 40+ cloud storage providers
Apache NiFi — Visual dataflow tool for automated data movement
Apache Airflow — Workflow orchestration with storage transfer operators

rclone-sync.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# Cross-cloud sync with rclone
# Efficient, resumable, and bandwidth-throttled
 
# Configure remotes (one-time setup in rclone.conf)
# rclone config
 
# Sync S3 bucket to GCS
rclone sync aws-s3:source-bucket gcs:destination-bucket \
  --transfers 32 \           # Parallel transfers
  --checkers 16 \            # Parallel file checking
  --bwlimit 100M \           # Bandwidth limit (100MB/s)
  --progress \               # Show progress
  --log-file /var/log/rclone-sync.log \
  --log-level INFO \
  --stats 30s \              # Stats every 30s
  --retries 5 \              # Retry failed transfers
  --exclude ".git/**" \      # Exclude patterns
  --filter-from /etc/rclone/filter-rules.txt
 
# Bidirectional sync (careful - conflicts not automatically resolved)
rclone bisync aws-s3:bucket gcs:bucket \
  --resync \                 # First run: establish baseline
  --dry-run \                # Preview changes first
  --verbose
 
# Scheduled sync via cron
# 0 */6 * * * /usr/local/bin/rclone sync ... >> /var/log/rclone.log 2>&1
 
# For massive migrations, consider parallel rclone instances
# with prefix-based partitioning:
for prefix in {a..z}; do
  rclone sync aws-s3:bucket gcs:bucket \
    --include "${prefix}**" \
    &  # Run in parallel
done
wait  # Wait for all to complete

4.2 Database Migration and Replication

For Relational Databases:

Tool	Description	Multi-Cloud Support
AWS DMS	Database Migration Service	AWS to/from external
GCP Database Migration Service	Managed migration	GCP-focused
Azure DMS	Database Migration Service	Azure-focused
Debezium	Open-source CDC platform	Any-to-any via Kafka
pgloader	PostgreSQL migration tool	Any PostgreSQL-compatible
Flyway / Liquibase	Schema migration tools	Cloud-agnostic SQL

4.3 Streaming Data Replication

Kafka MirrorMaker 2:

Replicates Kafka clusters across clouds. Essential for multi-cloud event streaming.

# MirrorMaker 2 configuration for cross-cloud Kafka replication
clusters:
  - alias: aws-cluster
    bootstrap.servers: kafka-aws.example.com:9092
    security.protocol: SASL_SSL
    
  - alias: gcp-cluster  
    bootstrap.servers: kafka-gcp.example.com:9092
    security.protocol: SASL_SSL

mirrors:
  - source.cluster.alias: aws-cluster
    target.cluster.alias: gcp-cluster
    topics:
      - "orders.*"
      - "customers.*"
    groups:
      - "order-service-consumer"
    emit.checkpoints.interval.seconds: 60
    sync.topics.interval.seconds: 10
    replication.factor: 3

Confluent Cluster Linking:

For Confluent Cloud users, Cluster Linking provides low-latency topic mirroring across regions and clouds.

4.4 Physical Transfer for Massive Data

When Network Isn't Enough:

Service	Provider	Capacity	When to Use
AWS Snowball	AWS	80TB per device	>10TB, days faster than network
AWS Snowball Edge	AWS	100TB, with compute	Edge processing + transfer
AWS Snowmobile	AWS	100PB (truck)	Exabyte-scale migration
Azure Data Box	Azure	100TB per device	Large migrations to Azure
Google Transfer Appliance	GCP	100TB-1PB	Large migrations to GCP

Hybrid Approach

Large migrations often combine approaches: Snowball for initial bulk transfer, then DMS or Debezium for ongoing CDC replication of changes that occurred during transfer. Plan for the "catch-up" period where incremental changes are replicated after bulk load completes.

Strategic Data Placement in Multi-Cloud

Rather than replicating everything everywhere, strategic data placement minimizes complexity and cost while meeting requirements.

5.1 Data Classification Framework

Classify data by:

Criticality — How important is this data to the business?
Access Pattern — Read-heavy? Write-heavy? Batch or real-time?
Regulatory Constraints — Where can this data legally reside?
Data Gravity — How much other data and compute depends on it?
Replication Tolerance — Can the business accept eventual consistency?

Data Placement Decision Matrix
Data Type	Primary Cloud	Secondary Clouds	Synchronization
User Profiles	Cloud with most users	Read replicas in other clouds	Async replication, ~seconds lag
Transaction Logs	Single cloud (primary)	Event stream for analytics	Event sourcing, append-only
ML Training Data	Cloud with ML platform	Usually not replicated	One-time ETL for feature engineering
Session Data	Closest to user	Not replicated	User-local, ephemeral
Audit Logs	Compliance-dictated location	Archival copies	Write-once, batch sync
Content/Media	CDN origin + backups	CDN caches globally	Origin sync, edge cache
Analytics Warehouse	Cloud with best analytics	Subset replicas for local BI	Batch ETL, nightly/hourly

5.2 The Data Tiering Strategy

Hot Data (Frequently Accessed):

Keep in memory or fast storage in the primary serving location
Replicate to other clouds only if access patterns demand it
Egress costs high; minimize cross-cloud reads

Warm Data (Periodically Accessed):

Store in the cloud where it's most frequently accessed
Cross-cloud access acceptable for occasional analytics
Consider caching layer to reduce repeated transfers

Cold Data (Rarely Accessed):

Store in cheapest available location
Glacier, Azure Archive, or GCS Coldline
Transfer on-demand when needed; latency acceptable

Archive Data (Compliance/Legal Hold):

Single location with compliance certifications
Transfer only for legal discovery or audit
Retention policies enforced automatically

Data Sovereignty First

Before optimizing for performance or cost, ensure data placement complies with relevant regulations (GDPR, CCPA, HIPAA, etc.). Regulatory fines far exceed any egress cost savings. Build compliance into your data classification framework.

5.3 Federated Query Approach

When moving data isn't feasible, bring queries to the data:

Tools for Data Federation:

Trino (formerly Presto) — Query engine that federates across data sources
Starburst — Commercial Trino with connectors for S3, GCS, Azure, Snowflake, etc.
Dremio — Data lakehouse platform with cross-cloud query
BigQuery Omni — GCP BigQuery running on AWS or Azure, querying local data

Trade-offs:

Approach	Latency	Egress Cost	Complexity
Replicate all data	Low (local access)	Very High (initial + ongoing)	Medium
Federated query	Higher (cross-cloud)	Per-query egress	Medium
Move compute to data	Lowest	Minimal	Higher (multi-cloud deployment)

Summary: Data Portability Principles

Data portability is arguably the most challenging aspect of multi-cloud architecture. Let's consolidate the key principles:

Data Portability Principles

•Respect data gravity — Large data stores are expensive and slow to move. Plan data placement early; changing later is costly.
•Use open formats — Apache Iceberg, Parquet, and open serialization formats (Protobuf, Avro) enable processing flexibility across clouds.
•Choose the right sync pattern — Async replication for most cases; active-active only when truly required; event sourcing for audit and flexibility.
•Classify data strategically — Not all data needs multi-cloud presence. Hot data where accessed; cold data where cheapest; regulated data where required.
•Consider federation — Sometimes bringing queries to data is more practical than moving data to queries.
•Account for egress — Data transfer costs can dominate cloud bills. Factor egress into all multi-cloud data architecture decisions.

The Data Portability Mindset:

True data portability is often less about moving data and more about designing systems that can access data wherever it resides. A combination of open formats, clear data classification, and appropriate synchronization patterns creates flexibility without the astronomical costs of full replication.

What's Next:

Having examined data portability, the final page of this module explores vendor lock-in mitigation—strategies for preserving strategic flexibility while still benefiting from cloud-specific capabilities.

Page Complete

You now understand the realities of data portability in multi-cloud environments—from the physics of data gravity to practical synchronization patterns and strategic placement decisions. This knowledge is essential for realistic multi-cloud architecture planning.

Data Portability: Moving and Synchronizing Data Across Clouds

The Weight of Data in Multi-Cloud

Compute is light. Data is heavy.

You can spin up a containerized service in any cloud within minutes. But moving a 50TB database? That takes days, costs significant money, and introduces substantial risk of data loss or corruption.

Learning Objectives

Understanding Data Gravity

1.1 The Forces of Data Gravity

1. Latency: Applications need fast access to their data. Cross-cloud API calls add 10-100ms+ of latency compared to local calls. For high-throughput workloads, this becomes prohibitive.

2. Egress Costs: Cloud providers charge $0.02-$0.12 per GB to move data out. A 100TB data warehouse queried frequently would cost thousands monthly just in egress.

3. Transfer Time: Even at enterprise network speeds, moving petabytes takes weeks to months. Meanwhile, applications can't access the data in transit.

4. Compliance and Data Residency: Regulations may require data to remain in specific regions or under specific cloud providers' control, preventing movement.

Data Transfer Reality Check
Data Volume	Transfer at 1 Gbps	Transfer at 10 Gbps	Egress Cost (Approx.)
1 TB	~2.5 hours	~15 minutes	$20-90
10 TB	~1 day	~2.5 hours	$200-900
100 TB	~10 days	~1 day	$2,000-9,000
1 PB	~100 days	~10 days	$20,000-90,000
10 PB	~1,000 days (~3 years)	~100 days	$200,000-900,000

The Snowball Effect

1.2 Implications for Multi-Cloud Strategy

The Pragmatic Reality:

Data stays where it is — Moving large data stores between clouds is a major project, not a continuous operation
Compute moves to data — It's easier to deploy applications near data than move data to applications
Selective replication — Only critical subsets of data are replicated across clouds for DR or read performance
Data placement is a primary architectural decision — Decide where data lives early; changing later is expensive

Multi-Cloud Data Architecture Patterns:

Pattern	Description	When to Use
Primary with Read Replicas	Data masters in one cloud, replicas in others	Read-heavy workloads requiring low latency across regions/clouds
Data Federation	Data stays in place; query layer aggregates	Analytics across clouds without moving underlying data
Event Streaming	Changes propagated via events; materialized views per cloud	Eventually consistent read models, decoupled systems
Active-Active	Writes accepted anywhere; conflict resolution	Maximum availability at cost of complexity
Data Partitioning by Cloud	Different data types in different clouds	Regulatory requirements, specialized processing

Portable Data Formats and Standards

Data portability starts with formats. Proprietary formats lock data into specific systems; open formats enable movement and interoperability.

2.1 Open Table Formats for Data Lakes

The Revolution:

Traditional data lakes stored data in raw files (CSV, JSON, Parquet) with metadata managed by compute engines (Hive, Spark). This created tight coupling between data and specific processing tools.

Modern Open Table Formats decouple storage from compute:

Open Table Formats Comparison
Format	Origin	Key Features	Multi-Cloud Status
Apache Iceberg	Netflix	ACID transactions, schema evolution, time travel, partition evolution	Excellent - cloud-agnostic design, broad support
Delta Lake	Databricks	ACID transactions, time travel, unified batch/streaming	Good - originally Databricks-focused, now open
Apache Hudi	Uber	Incremental processing, record-level updates, compaction	Good - designed for incremental pipelines

Why This Matters for Multi-Cloud:

With Apache Iceberg (for example), a table stored on AWS S3 can be:

Queried by Spark on GCP
Read by Trino running on-premises
Processed by Snowflake on Azure

The format is the contract; compute engines implement that contract regardless of where they run.

iceberg-multi-cloud.sql
Trino SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Apache Iceberg table defined in a catalog
-- Data physically stored on AWS S3
-- Queryable from Trino running anywhere
 
-- Create catalog pointing to AWS S3
-- (Configuration in Trino's catalog properties file)
 
-- iceberg-aws.properties:
-- connector.name=iceberg
-- iceberg.catalog.type=glue
-- hive.metastore.glue.region=us-east-1
-- fs.native-s3.enabled=true
 
-- Query from Trino on GCP accessing S3 data
SELECT 
    order_date,
    customer_region,
    SUM(order_total) as total_revenue,
    COUNT(*) as order_count
FROM iceberg_aws.sales.orders
WHERE order_date >= DATE '2024-01-01'
  AND order_status = 'COMPLETED'
GROUP BY order_date, customer_region
ORDER BY order_date DESC;
 
-- Time travel query - access historical snapshot
SELECT COUNT(*) as historical_orders
FROM iceberg_aws.sales.orders
FOR TIMESTAMP AS OF TIMESTAMP '2024-01-15 00:00:00'
WHERE order_status = 'PENDING';
 
-- Schema evolution is transparent
-- Old readers continue working as columns are added
ALTER TABLE iceberg_aws.sales.orders
ADD COLUMN customer_tier VARCHAR;

2.2 Data Serialization Formats

For data in motion (APIs, events, messages):

Format	Pros	Cons	Multi-Cloud Fit
JSON	Human-readable, universally supported	Verbose, no schema enforcement	Excellent - lowest friction
Protocol Buffers	Compact, strongly typed, fast	Requires schema coordination	Excellent - with schema registry
Avro	Schema embedded, Kafka-native	Less language support than Protobuf	Excellent - especially for streaming
MessagePack	Binary JSON, compact	Less tooling than alternatives	Good - drop-in JSON replacement

For data at rest (files, tables):

Format	Pros	Cons	Multi-Cloud Fit
Parquet	Columnar, compressed, schema	Row-level updates expensive	Excellent - industry standard
ORC	Columnar, heavily optimized for Hive	Smaller ecosystem than Parquet	Good - but Parquet more universal
Arrow	In-memory columnar, zero-copy	Primarily for processing, not storage	Excellent - for data exchange

Convention: Iceberg + Parquet + Protobuf

2.3 Schema Registries for Multi-Cloud

The Problem: Producers and consumers of data need to agree on schema. In multi-cloud environments, schema must be accessible from all clouds.

Options:

Confluent Schema Registry — Standard for Kafka ecosystems; can be deployed anywhere or used as Confluent Cloud service
AWS Glue Schema Registry — AWS-native but accessible via API from other clouds
Apicurio Registry — Open-source, cloud-agnostic, supports multiple formats
Git-based Schema Management — Schemas in Git, CI/CD generates client code for all languages

Multi-Cloud Schema Distribution:

schema-registry-multi-cloud.yaml
Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Multi-cloud schema registry architecture
#
# Option 1: Centralized Registry (accessible from all clouds)
# 
# ┌─────────────────────────────────────────────────────────────┐
# │                  Schema Registry (Primary)                   │
# │                  (Confluent Cloud / Self-hosted)             │
# │                        ┌─────────────────┐                   │
# │                        │  Schema Store   │                   │
# │                        │  (Kafka topic)  │                   │
# │                        └────────┬────────┘                   │
# └───────────────────────────────────┼─────────────────────────┘
#                                     │ API
#           ┌─────────────────────────┼─────────────────────────┐
#           │                         │                         │
#           ▼                         ▼                         ▼
#   ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
#   │     AWS       │         │     GCP       │         │    Azure      │
#   │   Producers   │         │   Consumers   │         │   Producers   │
#   │   Consumers   │         │   Producers   │         │   Consumers   │
#   └───────────────┘         └───────────────┘         └───────────────┘
#
# Option 2: Federated Registries with Sync
#
# ┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
# │   AWS Registry  │◄───────►│   GCP Registry  │◄───────►│  Azure Registry │
# │   (Glue +       │  Sync   │   (Apicurio)    │  Sync   │   (Apicurio)    │
# │    Local)       │         │                 │         │                 │
# └───────┬─────────┘         └───────┬─────────┘         └───────┬─────────┘
#         │                           │                           │
#         ▼                           ▼                           ▼
#    AWS Services              GCP Services               Azure Services
#
# Sync can be:
# - Git-based (schemas in repo, CI/CD pushes to registries)
# - Bi-directional replication between registries
# - Change Data Capture from primary to replicas

Data Synchronization Patterns

When data must exist in multiple clouds, synchronization becomes a central architectural concern. The right pattern depends on consistency requirements, latency tolerance, and data volume.

3.1 Synchronous Replication

How It Works: Every write is confirmed by multiple clouds before returning success to the client.

Pros:

Strong consistency—all clouds have identical data
No conflict resolution needed

Cons:

Latency penalty (cross-cloud round trip for every write)
Availability reduced (failure in any cloud blocks writes)
Expensive network egress for all data

When to Use: Rarely. Only for small, critical datasets where consistency is non-negotiable and latency is acceptable.

3.2 Asynchronous Replication

How It Works: Writes succeed locally; changes are replicated to other clouds in the background.

Pros:

Low latency writes
Single cloud failure doesn't block writes
Efficient batching reduces egress costs

Cons:

Eventual consistency—replicas lag behind primary
Potential for data loss if primary fails before replication

When to Use: Most common pattern. Suitable for read-heavy workloads where slight staleness is acceptable.

async-replication-pattern.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
// Asynchronous replication architecture for multi-cloud
// Primary writes to local cloud; replicator syncs to secondary clouds
 
interface ReplicationEvent {
  eventId: string;
  timestamp: Date;
  operation: 'INSERT' | 'UPDATE' | 'DELETE';
  table: string;
  primaryKey: Record<string, unknown>;
  data: Record<string, unknown>;
  sourceCloud: 'aws' | 'gcp' | 'azure';
}
 
class CrossCloudReplicator {
  private sourceDb: DatabaseConnection;
  private targetDbs: Map<string, DatabaseConnection>;
  private eventQueue: MessageQueue;
  private metrics: MetricsClient;
 
  async startReplication() {
    // Change Data Capture from source database
    const changeStream = await this.sourceDb.watchChanges({
      tables: ['orders', 'customers', 'products'],
      startAfter: await this.getLastReplicatedPosition(),
    });
 
    for await (const change of changeStream) {
      const event = this.transformToReplicationEvent(change);
      
      try {
        // Queue for asynchronous processing
        await this.eventQueue.publish({
          topic: 'cross-cloud-replication',
          key: `${event.table}:${JSON.stringify(event.primaryKey)}`,
          value: event,
          headers: {
            'source-cloud': event.sourceCloud,
            'operation': event.operation,
          },
        });
        
        // Track replication lag
        this.metrics.recordLag(
          'replication.queue.lag',
          Date.now() - event.timestamp.getTime()
        );
        
        // Update checkpoint for recovery
        await this.updateReplicationPosition(change.position);
        
      } catch (error) {
        // Dead letter for failed events
        await this.eventQueue.publish({
          topic: 'cross-cloud-replication-dlq',
          value: { event, error: error.message },
        });
        
        this.metrics.increment('replication.failures');
      }
    }
  }
 
  // Consumer running in each target cloud
  async consumeAndApply(targetCloud: string) {
    const targetDb = this.targetDbs.get(targetCloud)!;
    
    await this.eventQueue.consume({
      topic: 'cross-cloud-replication',
      groupId: `replicator-${targetCloud}`,
      
      handler: async (event: ReplicationEvent) => {
        // Skip events from our own cloud (avoid loops)
        if (event.sourceCloud === targetCloud) {
          return;
        }
 
        // Apply with idempotency
        await this.applyEventIdempotently(targetDb, event);
        
        // Track end-to-end lag
        this.metrics.recordLag(
          'replication.e2e.lag',
          Date.now() - event.timestamp.getTime(),
          { source: event.sourceCloud, target: targetCloud }
        );
      },
    });
  }
 
  private async applyEventIdempotently(
    db: DatabaseConnection, 
    event: ReplicationEvent
  ) {
    // Use event ID to ensure exactly-once semantics
    const applied = await db.query(
      `SELECT 1 FROM _replication_log WHERE event_id = $1`,
      [event.eventId]
    );
    
    if (applied.rowCount > 0) {
      return; // Already applied
    }
 
    await db.transaction(async (tx) => {
      // Apply the change
      switch (event.operation) {
        case 'INSERT':
          await tx.insert(event.table, event.data);
          break;
        case 'UPDATE':
          await tx.update(event.table, event.primaryKey, event.data);
          break;
        case 'DELETE':
          await tx.delete(event.table, event.primaryKey);
          break;
      }
      
      // Record application for idempotency
      await tx.insert('_replication_log', {
        event_id: event.eventId,
        applied_at: new Date(),
      });
    });
  }
}

3.3 Multi-Master / Active-Active Replication

How It Works: Writes are accepted in any cloud; changes are synchronized bidirectionally.

The Conflict Problem:

When users can write to any cloud simultaneously, conflicts are inevitable:

User A updates record in AWS
User B updates same record in GCP (at nearly the same time)
Both changes replicate—which version wins?

Conflict Resolution Strategies:

Strategy	Description	Trade-off
Last-Write-Wins (LWW)	Higher timestamp wins	Simple but can lose data
First-Write-Wins	Lower timestamp wins	Predictable but arbitrary
Application Logic	Business rules decide	Most correct, most complex
CRDTs	Conflict-free replicated data types	Mathematically correct for specific data types
Merge Functions	Custom merge logic per field	Flexible but error-prone

Active-Active Is Hard

3.4 Event Sourcing for Multi-Cloud

Pattern: Instead of synchronizing database state, synchronize the events that produce that state.

How It Works:

All writes are captured as immutable events
Events are replicated across clouds via event streaming (Kafka, Pulsar)
Each cloud maintains its own materialized views by processing events
Views are eventually consistent but events are the source of truth

Advantages for Multi-Cloud:

Natural idempotency — Events can be replayed without side effects
Flexible consumers — Each cloud can build different views optimized for its workload
Audit trail — Complete history of all changes
Time travel — Rebuild state at any point in time

event-sourcing-multi-cloud.yaml
Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Event Sourcing for Multi-Cloud Data Synchronization
#
# ┌──────────────────────────────────────────────────────────────────────┐
# │                         Event Bus (Kafka / Pulsar)                    │
# │                                                                      │
# │    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐│
# │    │  orders.events  │    │ customers.events │   │ products.events ││
# │    │     Topic       │    │      Topic       │   │     Topic       ││
# │    └────────┬────────┘    └────────┬─────────┘   └────────┬────────┘│
# └─────────────┼──────────────────────┼──────────────────────┼─────────┘
#               │                      │                      │
#               │      Mirroring / Cross-Cloud Replication    │
#               │                      │                      │
# ┌─────────────┼──────────────────────┼──────────────────────┼─────────┐
# │ AWS Cloud   │                      │                      │         │
# │             ▼                      ▼                      ▼         │
# │  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐│
# │  │  Order Service   │   │ Customer Service │   │ Product Service  ││
# │  │   Consumer       │   │    Consumer      │   │    Consumer      ││
# │  └────────┬─────────┘   └────────┬─────────┘   └────────┬─────────┘│
# │           │                      │                      │          │
# │           ▼                      ▼                      ▼          │
# │  ┌──────────────────────────────────────────────────────────────┐ │
# │  │              AWS Materialized Views (DynamoDB / RDS)         │ │
# │  │  - Orders by customer (for order service)                    │ │
# │  │  - Customer profiles (for recommendation engine)             │ │
# │  │  - Product catalog (denormalized)                            │ │
# │  └──────────────────────────────────────────────────────────────┘ │
# └────────────────────────────────────────────────────────────────────┘
#
# ┌────────────────────────────────────────────────────────────────────┐
# │ GCP Cloud                                                          │
# │             ▼                      ▼                      ▼        │
# │  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐│
# │  │  Analytics       │   │  ML Pipeline     │   │  Search Index   ││
# │  │   Consumer       │   │   Consumer       │   │    Consumer      ││
# │  └────────┬─────────┘   └────────┬─────────┘   └────────┬─────────┘│
# │           │                      │                      │          │
# │           ▼                      ▼                      ▼          │
# │  ┌──────────────┐     ┌──────────────────┐     ┌──────────────┐   │
# │  │   BigQuery   │     │   Vertex AI      │     │   Elastic    │   │
# │  │  (Analytics) │     │  Feature Store   │     │   Search     │   │
# │  └──────────────┘     └──────────────────┘     └──────────────┘   │
# └────────────────────────────────────────────────────────────────────┘
#
# Key insight: Same events, different materialized views optimized per cloud

Data Transfer Mechanisms

Understanding the available tools for moving data between clouds helps in choosing the right approach for your specific requirements.

4.1 Object Storage Transfer

For S3, GCS, Azure Blob Storage:

Cloud-Native Transfer Services:

AWS DataSync — Automates transfer between AWS and on-premises; limited cross-cloud
Google Cloud Storage Transfer Service — Native transfers from S3, Azure to GCS
Azure Data Factory — Orchestrates data movement with connectors to multiple sources

Third-Party / Open Source:

rclone — Command-line tool supporting 40+ cloud storage providers
Apache NiFi — Visual dataflow tool for automated data movement
Apache Airflow — Workflow orchestration with storage transfer operators

rclone-sync.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# Cross-cloud sync with rclone
# Efficient, resumable, and bandwidth-throttled
 
# Configure remotes (one-time setup in rclone.conf)
# rclone config
 
# Sync S3 bucket to GCS
rclone sync aws-s3:source-bucket gcs:destination-bucket \
  --transfers 32 \           # Parallel transfers
  --checkers 16 \            # Parallel file checking
  --bwlimit 100M \           # Bandwidth limit (100MB/s)
  --progress \               # Show progress
  --log-file /var/log/rclone-sync.log \
  --log-level INFO \
  --stats 30s \              # Stats every 30s
  --retries 5 \              # Retry failed transfers
  --exclude ".git/**" \      # Exclude patterns
  --filter-from /etc/rclone/filter-rules.txt
 
# Bidirectional sync (careful - conflicts not automatically resolved)
rclone bisync aws-s3:bucket gcs:bucket \
  --resync \                 # First run: establish baseline
  --dry-run \                # Preview changes first
  --verbose
 
# Scheduled sync via cron
# 0 */6 * * * /usr/local/bin/rclone sync ... >> /var/log/rclone.log 2>&1
 
# For massive migrations, consider parallel rclone instances
# with prefix-based partitioning:
for prefix in {a..z}; do
  rclone sync aws-s3:bucket gcs:bucket \
    --include "${prefix}**" \
    &  # Run in parallel
done
wait  # Wait for all to complete

4.2 Database Migration and Replication

For Relational Databases:

Tool	Description	Multi-Cloud Support
AWS DMS	Database Migration Service	AWS to/from external
GCP Database Migration Service	Managed migration	GCP-focused
Azure DMS	Database Migration Service	Azure-focused
Debezium	Open-source CDC platform	Any-to-any via Kafka
pgloader	PostgreSQL migration tool	Any PostgreSQL-compatible
Flyway / Liquibase	Schema migration tools	Cloud-agnostic SQL

4.3 Streaming Data Replication

Kafka MirrorMaker 2:

Replicates Kafka clusters across clouds. Essential for multi-cloud event streaming.

# MirrorMaker 2 configuration for cross-cloud Kafka replication
clusters:
  - alias: aws-cluster
    bootstrap.servers: kafka-aws.example.com:9092
    security.protocol: SASL_SSL
    
  - alias: gcp-cluster  
    bootstrap.servers: kafka-gcp.example.com:9092
    security.protocol: SASL_SSL

mirrors:
  - source.cluster.alias: aws-cluster
    target.cluster.alias: gcp-cluster
    topics:
      - "orders.*"
      - "customers.*"
    groups:
      - "order-service-consumer"
    emit.checkpoints.interval.seconds: 60
    sync.topics.interval.seconds: 10
    replication.factor: 3

Confluent Cluster Linking:

For Confluent Cloud users, Cluster Linking provides low-latency topic mirroring across regions and clouds.

4.4 Physical Transfer for Massive Data

When Network Isn't Enough:

Service	Provider	Capacity	When to Use
AWS Snowball	AWS	80TB per device	>10TB, days faster than network
AWS Snowball Edge	AWS	100TB, with compute	Edge processing + transfer
AWS Snowmobile	AWS	100PB (truck)	Exabyte-scale migration
Azure Data Box	Azure	100TB per device	Large migrations to Azure
Google Transfer Appliance	GCP	100TB-1PB	Large migrations to GCP

Hybrid Approach

Strategic Data Placement in Multi-Cloud

Rather than replicating everything everywhere, strategic data placement minimizes complexity and cost while meeting requirements.

5.1 Data Classification Framework

Classify data by:

Criticality — How important is this data to the business?
Access Pattern — Read-heavy? Write-heavy? Batch or real-time?
Regulatory Constraints — Where can this data legally reside?
Data Gravity — How much other data and compute depends on it?
Replication Tolerance — Can the business accept eventual consistency?

Data Placement Decision Matrix
Data Type	Primary Cloud	Secondary Clouds	Synchronization
User Profiles	Cloud with most users	Read replicas in other clouds	Async replication, ~seconds lag
Transaction Logs	Single cloud (primary)	Event stream for analytics	Event sourcing, append-only
ML Training Data	Cloud with ML platform	Usually not replicated	One-time ETL for feature engineering
Session Data	Closest to user	Not replicated	User-local, ephemeral
Audit Logs	Compliance-dictated location	Archival copies	Write-once, batch sync
Content/Media	CDN origin + backups	CDN caches globally	Origin sync, edge cache
Analytics Warehouse	Cloud with best analytics	Subset replicas for local BI	Batch ETL, nightly/hourly

5.2 The Data Tiering Strategy

Hot Data (Frequently Accessed):

Keep in memory or fast storage in the primary serving location
Replicate to other clouds only if access patterns demand it
Egress costs high; minimize cross-cloud reads

Warm Data (Periodically Accessed):

Store in the cloud where it's most frequently accessed
Cross-cloud access acceptable for occasional analytics
Consider caching layer to reduce repeated transfers

Cold Data (Rarely Accessed):

Store in cheapest available location
Glacier, Azure Archive, or GCS Coldline
Transfer on-demand when needed; latency acceptable

Archive Data (Compliance/Legal Hold):

Single location with compliance certifications
Transfer only for legal discovery or audit
Retention policies enforced automatically

Data Sovereignty First

5.3 Federated Query Approach

When moving data isn't feasible, bring queries to the data:

Tools for Data Federation:

Trino (formerly Presto) — Query engine that federates across data sources
Starburst — Commercial Trino with connectors for S3, GCS, Azure, Snowflake, etc.
Dremio — Data lakehouse platform with cross-cloud query
BigQuery Omni — GCP BigQuery running on AWS or Azure, querying local data

Trade-offs:

Approach	Latency	Egress Cost	Complexity
Replicate all data	Low (local access)	Very High (initial + ongoing)	Medium
Federated query	Higher (cross-cloud)	Per-query egress	Medium
Move compute to data	Lowest	Minimal	Higher (multi-cloud deployment)

Summary: Data Portability Principles

Data portability is arguably the most challenging aspect of multi-cloud architecture. Let's consolidate the key principles:

Data Portability Principles

•Respect data gravity — Large data stores are expensive and slow to move. Plan data placement early; changing later is costly.
•Use open formats — Apache Iceberg, Parquet, and open serialization formats (Protobuf, Avro) enable processing flexibility across clouds.
•Choose the right sync pattern — Async replication for most cases; active-active only when truly required; event sourcing for audit and flexibility.
•Classify data strategically — Not all data needs multi-cloud presence. Hot data where accessed; cold data where cheapest; regulated data where required.
•Consider federation — Sometimes bringing queries to data is more practical than moving data to queries.
•Account for egress — Data transfer costs can dominate cloud bills. Factor egress into all multi-cloud data architecture decisions.

The Data Portability Mindset:

What's Next:

Page Complete