Loading content...
Compute is light. Data is heavy.
You can spin up a containerized service in any cloud within minutes. But moving a 50TB database? That takes days, costs significant money, and introduces substantial risk of data loss or corruption.
Data gravity is the phenomenon where applications cluster around data because moving data is so expensive and slow. Understanding data portability—and its realistic limits—is essential for multi-cloud architecture. This page examines the challenges, patterns, and tooling for managing data across multiple clouds.
After completing this page, you will understand: (1) The physics and economics of data movement, (2) Data format standards that enable portability, (3) Synchronization patterns for different use cases, (4) Transfer mechanisms and their trade-offs, and (5) Strategic approaches to data placement in multi-cloud architectures.
Data gravity was coined by Dave McCrory in 2010 to describe how data attracts applications and services. Like physical mass attracting objects, large data stores attract compute workloads that need to access that data with low latency.
1. Latency: Applications need fast access to their data. Cross-cloud API calls add 10-100ms+ of latency compared to local calls. For high-throughput workloads, this becomes prohibitive.
2. Egress Costs: Cloud providers charge $0.02-$0.12 per GB to move data out. A 100TB data warehouse queried frequently would cost thousands monthly just in egress.
3. Transfer Time: Even at enterprise network speeds, moving petabytes takes weeks to months. Meanwhile, applications can't access the data in transit.
4. Compliance and Data Residency: Regulations may require data to remain in specific regions or under specific cloud providers' control, preventing movement.
| Data Volume | Transfer at 1 Gbps | Transfer at 10 Gbps | Egress Cost (Approx.) |
|---|---|---|---|
| 1 TB | ~2.5 hours | ~15 minutes | $20-90 |
| 10 TB | ~1 day | ~2.5 hours | $200-900 |
| 100 TB | ~10 days | ~1 day | $2,000-9,000 |
| 1 PB | ~100 days | ~10 days | $20,000-90,000 |
| 10 PB | ~1,000 days (~3 years) | ~100 days | $200,000-900,000 |
For truly massive migrations, network transfer isn't practical. AWS Snowball, Azure Data Box, and Google Transfer Appliance provide physical devices: you load data onto hardware that's shipped to the destination cloud. At petabyte scale, FedEx is faster than the internet.
The Pragmatic Reality:
Multi-Cloud Data Architecture Patterns:
| Pattern | Description | When to Use |
|---|---|---|
| Primary with Read Replicas | Data masters in one cloud, replicas in others | Read-heavy workloads requiring low latency across regions/clouds |
| Data Federation | Data stays in place; query layer aggregates | Analytics across clouds without moving underlying data |
| Event Streaming | Changes propagated via events; materialized views per cloud | Eventually consistent read models, decoupled systems |
| Active-Active | Writes accepted anywhere; conflict resolution | Maximum availability at cost of complexity |
| Data Partitioning by Cloud | Different data types in different clouds | Regulatory requirements, specialized processing |
Data portability starts with formats. Proprietary formats lock data into specific systems; open formats enable movement and interoperability.
The Revolution:
Traditional data lakes stored data in raw files (CSV, JSON, Parquet) with metadata managed by compute engines (Hive, Spark). This created tight coupling between data and specific processing tools.
Modern Open Table Formats decouple storage from compute:
| Format | Origin | Key Features | Multi-Cloud Status |
|---|---|---|---|
| Apache Iceberg | Netflix | ACID transactions, schema evolution, time travel, partition evolution | Excellent - cloud-agnostic design, broad support |
| Delta Lake | Databricks | ACID transactions, time travel, unified batch/streaming | Good - originally Databricks-focused, now open |
| Apache Hudi | Uber | Incremental processing, record-level updates, compaction | Good - designed for incremental pipelines |
Why This Matters for Multi-Cloud:
With Apache Iceberg (for example), a table stored on AWS S3 can be:
The format is the contract; compute engines implement that contract regardless of where they run.
1234567891011121314151617181920212223242526272829303132333435
-- Apache Iceberg table defined in a catalog-- Data physically stored on AWS S3-- Queryable from Trino running anywhere -- Create catalog pointing to AWS S3-- (Configuration in Trino's catalog properties file) -- iceberg-aws.properties:-- connector.name=iceberg-- iceberg.catalog.type=glue-- hive.metastore.glue.region=us-east-1-- fs.native-s3.enabled=true -- Query from Trino on GCP accessing S3 dataSELECT order_date, customer_region, SUM(order_total) as total_revenue, COUNT(*) as order_countFROM iceberg_aws.sales.ordersWHERE order_date >= DATE '2024-01-01' AND order_status = 'COMPLETED'GROUP BY order_date, customer_regionORDER BY order_date DESC; -- Time travel query - access historical snapshotSELECT COUNT(*) as historical_ordersFROM iceberg_aws.sales.ordersFOR TIMESTAMP AS OF TIMESTAMP '2024-01-15 00:00:00'WHERE order_status = 'PENDING'; -- Schema evolution is transparent-- Old readers continue working as columns are addedALTER TABLE iceberg_aws.sales.ordersADD COLUMN customer_tier VARCHAR;For data in motion (APIs, events, messages):
| Format | Pros | Cons | Multi-Cloud Fit |
|---|---|---|---|
| JSON | Human-readable, universally supported | Verbose, no schema enforcement | Excellent - lowest friction |
| Protocol Buffers | Compact, strongly typed, fast | Requires schema coordination | Excellent - with schema registry |
| Avro | Schema embedded, Kafka-native | Less language support than Protobuf | Excellent - especially for streaming |
| MessagePack | Binary JSON, compact | Less tooling than alternatives | Good - drop-in JSON replacement |
For data at rest (files, tables):
| Format | Pros | Cons | Multi-Cloud Fit |
|---|---|---|---|
| Parquet | Columnar, compressed, schema | Row-level updates expensive | Excellent - industry standard |
| ORC | Columnar, heavily optimized for Hive | Smaller ecosystem than Parquet | Good - but Parquet more universal |
| Arrow | In-memory columnar, zero-copy | Primarily for processing, not storage | Excellent - for data exchange |
A common portable data stack: Apache Iceberg for table management over Parquet files for storage, with Protocol Buffers for event schemas. This combination provides strong typing, schema evolution, and query performance—all cloud-agnostic.
The Problem: Producers and consumers of data need to agree on schema. In multi-cloud environments, schema must be accessible from all clouds.
Options:
Multi-Cloud Schema Distribution:
12345678910111213141516171819202122232425262728293031323334353637
# Multi-cloud schema registry architecture## Option 1: Centralized Registry (accessible from all clouds)# # ┌─────────────────────────────────────────────────────────────┐# │ Schema Registry (Primary) │# │ (Confluent Cloud / Self-hosted) │# │ ┌─────────────────┐ │# │ │ Schema Store │ │# │ │ (Kafka topic) │ │# │ └────────┬────────┘ │# └───────────────────────────────────┼─────────────────────────┘# │ API# ┌─────────────────────────┼─────────────────────────┐# │ │ │# ▼ ▼ ▼# ┌───────────────┐ ┌───────────────┐ ┌───────────────┐# │ AWS │ │ GCP │ │ Azure │# │ Producers │ │ Consumers │ │ Producers │# │ Consumers │ │ Producers │ │ Consumers │# └───────────────┘ └───────────────┘ └───────────────┘## Option 2: Federated Registries with Sync## ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐# │ AWS Registry │◄───────►│ GCP Registry │◄───────►│ Azure Registry │# │ (Glue + │ Sync │ (Apicurio) │ Sync │ (Apicurio) │# │ Local) │ │ │ │ │# └───────┬─────────┘ └───────┬─────────┘ └───────┬─────────┘# │ │ │# ▼ ▼ ▼# AWS Services GCP Services Azure Services## Sync can be:# - Git-based (schemas in repo, CI/CD pushes to registries)# - Bi-directional replication between registries# - Change Data Capture from primary to replicasWhen data must exist in multiple clouds, synchronization becomes a central architectural concern. The right pattern depends on consistency requirements, latency tolerance, and data volume.
How It Works: Every write is confirmed by multiple clouds before returning success to the client.
Pros:
Cons:
When to Use: Rarely. Only for small, critical datasets where consistency is non-negotiable and latency is acceptable.
How It Works: Writes succeed locally; changes are replicated to other clouds in the background.
Pros:
Cons:
When to Use: Most common pattern. Suitable for read-heavy workloads where slight staleness is acceptable.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
// Asynchronous replication architecture for multi-cloud// Primary writes to local cloud; replicator syncs to secondary clouds interface ReplicationEvent { eventId: string; timestamp: Date; operation: 'INSERT' | 'UPDATE' | 'DELETE'; table: string; primaryKey: Record<string, unknown>; data: Record<string, unknown>; sourceCloud: 'aws' | 'gcp' | 'azure';} class CrossCloudReplicator { private sourceDb: DatabaseConnection; private targetDbs: Map<string, DatabaseConnection>; private eventQueue: MessageQueue; private metrics: MetricsClient; async startReplication() { // Change Data Capture from source database const changeStream = await this.sourceDb.watchChanges({ tables: ['orders', 'customers', 'products'], startAfter: await this.getLastReplicatedPosition(), }); for await (const change of changeStream) { const event = this.transformToReplicationEvent(change); try { // Queue for asynchronous processing await this.eventQueue.publish({ topic: 'cross-cloud-replication', key: `${event.table}:${JSON.stringify(event.primaryKey)}`, value: event, headers: { 'source-cloud': event.sourceCloud, 'operation': event.operation, }, }); // Track replication lag this.metrics.recordLag( 'replication.queue.lag', Date.now() - event.timestamp.getTime() ); // Update checkpoint for recovery await this.updateReplicationPosition(change.position); } catch (error) { // Dead letter for failed events await this.eventQueue.publish({ topic: 'cross-cloud-replication-dlq', value: { event, error: error.message }, }); this.metrics.increment('replication.failures'); } } } // Consumer running in each target cloud async consumeAndApply(targetCloud: string) { const targetDb = this.targetDbs.get(targetCloud)!; await this.eventQueue.consume({ topic: 'cross-cloud-replication', groupId: `replicator-${targetCloud}`, handler: async (event: ReplicationEvent) => { // Skip events from our own cloud (avoid loops) if (event.sourceCloud === targetCloud) { return; } // Apply with idempotency await this.applyEventIdempotently(targetDb, event); // Track end-to-end lag this.metrics.recordLag( 'replication.e2e.lag', Date.now() - event.timestamp.getTime(), { source: event.sourceCloud, target: targetCloud } ); }, }); } private async applyEventIdempotently( db: DatabaseConnection, event: ReplicationEvent ) { // Use event ID to ensure exactly-once semantics const applied = await db.query( `SELECT 1 FROM _replication_log WHERE event_id = $1`, [event.eventId] ); if (applied.rowCount > 0) { return; // Already applied } await db.transaction(async (tx) => { // Apply the change switch (event.operation) { case 'INSERT': await tx.insert(event.table, event.data); break; case 'UPDATE': await tx.update(event.table, event.primaryKey, event.data); break; case 'DELETE': await tx.delete(event.table, event.primaryKey); break; } // Record application for idempotency await tx.insert('_replication_log', { event_id: event.eventId, applied_at: new Date(), }); }); }}How It Works: Writes are accepted in any cloud; changes are synchronized bidirectionally.
The Conflict Problem:
When users can write to any cloud simultaneously, conflicts are inevitable:
Conflict Resolution Strategies:
| Strategy | Description | Trade-off |
|---|---|---|
| Last-Write-Wins (LWW) | Higher timestamp wins | Simple but can lose data |
| First-Write-Wins | Lower timestamp wins | Predictable but arbitrary |
| Application Logic | Business rules decide | Most correct, most complex |
| CRDTs | Conflict-free replicated data types | Mathematically correct for specific data types |
| Merge Functions | Custom merge logic per field | Flexible but error-prone |
Multi-master replication across clouds is one of the most complex distributed systems problems. Unless you have specific requirements (highest availability, geo-distributed writes), prefer single-primary replication with read replicas. The engineering effort for robust active-active is substantial.
Pattern: Instead of synchronizing database state, synchronize the events that produce that state.
How It Works:
Advantages for Multi-Cloud:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# Event Sourcing for Multi-Cloud Data Synchronization## ┌──────────────────────────────────────────────────────────────────────┐# │ Event Bus (Kafka / Pulsar) │# │ │# │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│# │ │ orders.events │ │ customers.events │ │ products.events ││# │ │ Topic │ │ Topic │ │ Topic ││# │ └────────┬────────┘ └────────┬─────────┘ └────────┬────────┘│# └─────────────┼──────────────────────┼──────────────────────┼─────────┘# │ │ │# │ Mirroring / Cross-Cloud Replication │# │ │ │# ┌─────────────┼──────────────────────┼──────────────────────┼─────────┐# │ AWS Cloud │ │ │ │# │ ▼ ▼ ▼ │# │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐│# │ │ Order Service │ │ Customer Service │ │ Product Service ││# │ │ Consumer │ │ Consumer │ │ Consumer ││# │ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘│# │ │ │ │ │# │ ▼ ▼ ▼ │# │ ┌──────────────────────────────────────────────────────────────┐ │# │ │ AWS Materialized Views (DynamoDB / RDS) │ │# │ │ - Orders by customer (for order service) │ │# │ │ - Customer profiles (for recommendation engine) │ │# │ │ - Product catalog (denormalized) │ │# │ └──────────────────────────────────────────────────────────────┘ │# └────────────────────────────────────────────────────────────────────┘## ┌────────────────────────────────────────────────────────────────────┐# │ GCP Cloud │# │ ▼ ▼ ▼ │# │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐│# │ │ Analytics │ │ ML Pipeline │ │ Search Index ││# │ │ Consumer │ │ Consumer │ │ Consumer ││# │ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘│# │ │ │ │ │# │ ▼ ▼ ▼ │# │ ┌──────────────┐ ┌──────────────────┐ ┌──────────────┐ │# │ │ BigQuery │ │ Vertex AI │ │ Elastic │ │# │ │ (Analytics) │ │ Feature Store │ │ Search │ │# │ └──────────────┘ └──────────────────┘ └──────────────┘ │# └────────────────────────────────────────────────────────────────────┘## Key insight: Same events, different materialized views optimized per cloudUnderstanding the available tools for moving data between clouds helps in choosing the right approach for your specific requirements.
For S3, GCS, Azure Blob Storage:
Cloud-Native Transfer Services:
Third-Party / Open Source:
12345678910111213141516171819202122232425262728293031323334353637
#!/bin/bash# Cross-cloud sync with rclone# Efficient, resumable, and bandwidth-throttled # Configure remotes (one-time setup in rclone.conf)# rclone config # Sync S3 bucket to GCSrclone sync aws-s3:source-bucket gcs:destination-bucket \ --transfers 32 \ # Parallel transfers --checkers 16 \ # Parallel file checking --bwlimit 100M \ # Bandwidth limit (100MB/s) --progress \ # Show progress --log-file /var/log/rclone-sync.log \ --log-level INFO \ --stats 30s \ # Stats every 30s --retries 5 \ # Retry failed transfers --exclude ".git/**" \ # Exclude patterns --filter-from /etc/rclone/filter-rules.txt # Bidirectional sync (careful - conflicts not automatically resolved)rclone bisync aws-s3:bucket gcs:bucket \ --resync \ # First run: establish baseline --dry-run \ # Preview changes first --verbose # Scheduled sync via cron# 0 */6 * * * /usr/local/bin/rclone sync ... >> /var/log/rclone.log 2>&1 # For massive migrations, consider parallel rclone instances# with prefix-based partitioning:for prefix in {a..z}; do rclone sync aws-s3:bucket gcs:bucket \ --include "${prefix}**" \ & # Run in paralleldonewait # Wait for all to completeFor Relational Databases:
| Tool | Description | Multi-Cloud Support |
|---|---|---|
| AWS DMS | Database Migration Service | AWS to/from external |
| GCP Database Migration Service | Managed migration | GCP-focused |
| Azure DMS | Database Migration Service | Azure-focused |
| Debezium | Open-source CDC platform | Any-to-any via Kafka |
| pgloader | PostgreSQL migration tool | Any PostgreSQL-compatible |
| Flyway / Liquibase | Schema migration tools | Cloud-agnostic SQL |
Kafka MirrorMaker 2:
Replicates Kafka clusters across clouds. Essential for multi-cloud event streaming.
# MirrorMaker 2 configuration for cross-cloud Kafka replication
clusters:
- alias: aws-cluster
bootstrap.servers: kafka-aws.example.com:9092
security.protocol: SASL_SSL
- alias: gcp-cluster
bootstrap.servers: kafka-gcp.example.com:9092
security.protocol: SASL_SSL
mirrors:
- source.cluster.alias: aws-cluster
target.cluster.alias: gcp-cluster
topics:
- "orders.*"
- "customers.*"
groups:
- "order-service-consumer"
emit.checkpoints.interval.seconds: 60
sync.topics.interval.seconds: 10
replication.factor: 3
Confluent Cluster Linking:
For Confluent Cloud users, Cluster Linking provides low-latency topic mirroring across regions and clouds.
When Network Isn't Enough:
| Service | Provider | Capacity | When to Use |
|---|---|---|---|
| AWS Snowball | AWS | 80TB per device | >10TB, days faster than network |
| AWS Snowball Edge | AWS | 100TB, with compute | Edge processing + transfer |
| AWS Snowmobile | AWS | 100PB (truck) | Exabyte-scale migration |
| Azure Data Box | Azure | 100TB per device | Large migrations to Azure |
| Google Transfer Appliance | GCP | 100TB-1PB | Large migrations to GCP |
Large migrations often combine approaches: Snowball for initial bulk transfer, then DMS or Debezium for ongoing CDC replication of changes that occurred during transfer. Plan for the "catch-up" period where incremental changes are replicated after bulk load completes.
Rather than replicating everything everywhere, strategic data placement minimizes complexity and cost while meeting requirements.
Classify data by:
| Data Type | Primary Cloud | Secondary Clouds | Synchronization |
|---|---|---|---|
| User Profiles | Cloud with most users | Read replicas in other clouds | Async replication, ~seconds lag |
| Transaction Logs | Single cloud (primary) | Event stream for analytics | Event sourcing, append-only |
| ML Training Data | Cloud with ML platform | Usually not replicated | One-time ETL for feature engineering |
| Session Data | Closest to user | Not replicated | User-local, ephemeral |
| Audit Logs | Compliance-dictated location | Archival copies | Write-once, batch sync |
| Content/Media | CDN origin + backups | CDN caches globally | Origin sync, edge cache |
| Analytics Warehouse | Cloud with best analytics | Subset replicas for local BI | Batch ETL, nightly/hourly |
Hot Data (Frequently Accessed):
Warm Data (Periodically Accessed):
Cold Data (Rarely Accessed):
Archive Data (Compliance/Legal Hold):
Before optimizing for performance or cost, ensure data placement complies with relevant regulations (GDPR, CCPA, HIPAA, etc.). Regulatory fines far exceed any egress cost savings. Build compliance into your data classification framework.
When moving data isn't feasible, bring queries to the data:
Tools for Data Federation:
Trade-offs:
| Approach | Latency | Egress Cost | Complexity |
|---|---|---|---|
| Replicate all data | Low (local access) | Very High (initial + ongoing) | Medium |
| Federated query | Higher (cross-cloud) | Per-query egress | Medium |
| Move compute to data | Lowest | Minimal | Higher (multi-cloud deployment) |
Data portability is arguably the most challenging aspect of multi-cloud architecture. Let's consolidate the key principles:
The Data Portability Mindset:
True data portability is often less about moving data and more about designing systems that can access data wherever it resides. A combination of open formats, clear data classification, and appropriate synchronization patterns creates flexibility without the astronomical costs of full replication.
What's Next:
Having examined data portability, the final page of this module explores vendor lock-in mitigation—strategies for preserving strategic flexibility while still benefiting from cloud-specific capabilities.
You now understand the realities of data portability in multi-cloud environments—from the physics of data gravity to practical synchronization patterns and strategic placement decisions. This knowledge is essential for realistic multi-cloud architecture planning.