Cloud Object Storage - Learning Module

Loading content...

0/273

Google Cloud Storage

A Different Philosophy for Object Storage

Google Cloud Storage (GCS) launched in 2010, entering a market that Amazon S3 had already defined. Rather than simply copying S3's model, Google leveraged its unique infrastructure advantages—the same distributed systems that power Search, Gmail, and YouTube—to create an object storage service with distinct characteristics.

GCS was built atop Colossus, Google's next-generation distributed file system (successor to the legendary Google File System), and benefits from Google's global private network, Andromeda SDN, and the same consistency infrastructure used by Spanner. These foundations give GCS architectural properties that, while similar to S3 in API, differ significantly in implementation and behavior.

What You Will Learn

By the end of this page, you'll understand GCS's architecture, its consistency guarantees (which preceded S3's strong consistency by years), its unique features like composite objects and turbo replication, and when to choose GCS over S3 or Azure Blob Storage in system design.

GCS Design Principles

Google Cloud Storage's architecture reflects Google's unique infrastructure philosophy, honed over decades of operating hyperscale systems.

1. Strong Consistency from Day One

Unlike S3's 14-year journey to strong consistency, GCS offered strong read-after-write consistency from the beginning. This stems from Google's experience with Spanner and Colossus, where consistency was a non-negotiable requirement for internal services.

Every successful write is immediately visible to all subsequent reads
LIST operations immediately reflect completed writes
Object overwrites are atomic—no partial or stale reads
This consistency applies globally, not just regionally (for dual-region and multi-region buckets)

2. Global by Default

GCS positions itself for global applications. Features like:

Multi-region buckets: Automatic redundancy across regions (e.g., US, EU, ASIA)
Turbo replication: 15-minute RPO for cross-region data
Global edge caching: Integrated CDN capabilities
Single global namespace: Bucket names are globally unique across all projects

3. Simplicity in Storage Classes

Where S3 has evolved into 8+ storage classes with complex retrieval semantics, GCS offers 4 classes with simpler, more predictable behavior:

Standard: Frequently accessed, highest availability
Nearline: Access once per month or less (30-day minimum)
Coldline: Access once per quarter or less (90-day minimum)
Archive: Access once per year or less (365-day minimum)

All classes share identical performance characteristics for reads—no retrieval delays like S3 Glacier. The difference is purely in cost structure (storage cheaper, access more expensive as you move down).

4. Built on Colossus

Colossus is Google's distributed file system, the successor to GFS described in Google's foundational 2003 paper. Key characteristics:

Exabyte scale: Colossus handles a staggering portion of humanity's data
Reed-Solomon encoding: Achieves durability with less storage overhead than simple replication
Global distribution: Colossus cells span the globe
Unified infrastructure: GCS, BigQuery, Cloud SQL, and many Google services share this layer

Core Design Principles

•Strong consistency everywhere — No eventual consistency workarounds needed; all operations are immediately consistent
•Global infrastructure first — Designed for worldwide applications, not region-centric deployments
•Unified durability model — All storage classes provide 99.999999999% (11 nines) durability
•Access speed equals everywhere — No retrieval delays for cold storage; only pricing differs
•Integration with analytics — Deep integration with BigQuery, Dataflow, and AI/ML services

GCS Architecture Deep Dive

GCS's internal architecture represents decades of distributed systems refinement at Google. While Google keeps implementation details close, published research and documentation reveal a sophisticated system.

The Colossus Foundation

At GCS's core is Colossus, the distributed file system. Colossus differs from traditional replicated storage in several ways:

Erasure coding over replication: Instead of storing 3 complete copies (3x storage overhead), Colossus uses Reed-Solomon encoding. A typical configuration might store 1.5x the data while tolerating more failures than 3x replication.
Separation of metadata and data: Colossus metadata services track what data exists and where it lives. Data services handle actual bytes. These scale independently.
Append-only model: Like GFS before it, Colossus optimizes for append. This aligns perfectly with object storage's immutable object model.
Automatic resharding: Data placement isn't static. Colossus continuously rebalances based on access patterns, storage utilization, and failure recovery.

┌─────────────────────────────────────────────────────────────────────────┐
│                          Global Load Balancer                           │
│  (Anycast IPs route to nearest GFE - Google Front End)                 │
└────────────────────────────────────┬────────────────────────────────────┘
                                     │
         ┌───────────────────────────┼───────────────────────────┐
         ▼                           ▼                           ▼
┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐
│  Google Front   │       │  GCS Metadata   │       │    Colossus     │
│  End (GFE)      │       │  Service        │       │  Data Service   │
│                 │       │                 │       │                 │
│  • TLS/HTTP     │──────▶│  • Bucket Index │──────▶│  • Block Store  │
│  • Auth (IAM)   │       │  • Object Index │       │  • Erasure Code │
│  • Rate Limit   │       │  • ACL Cache    │       │  • Checksums    │
└─────────────────┘       └─────────────────┘       └─────────────────┘
                                   │                         │
                                   ▼                         ▼
                          ┌─────────────────┐       ┌─────────────────┐
                          │   Bigtable /    │       │    Colossus     │
                          │   Spanner       │       │    Cells        │
                          │   (Metadata     │       │   (Physical     │
                          │    Store)       │       │    Storage)     │
                          └─────────────────┘       └─────────────────┘
                                   │                         │
                                   └─────────────┬───────────┘
                                                 ▼
                          ┌─────────────────────────────────────────┐
                          │         Google Private Network          │
                          │  (Andromeda SDN - TB/s inter-datacenter)│
                          └─────────────────────────────────────────┘

Metadata Layer

GCS metadata is stored in a highly consistent, globally distributed database (likely a Spanner variant or specialized Bigtable configuration). This metadata includes:

Bucket configurations (location, storage class, IAM policies)
Object metadata (name, size, checksums, generation numbers)
Access control lists and conditions
Lifecycle rules and retention policies

The metadata layer is the key to GCS's strong consistency. Every write updates metadata atomically. Every read consults authoritative metadata, ensuring no stale reads.

Data Layer

Actual object bytes flow through Colossus:

Chunking: Objects are split into chunks (implementation-specific sizes)
Encoding: Chunks are erasure-coded for durability
Placement: Encoded fragments are distributed across storage nodes in different failure domains
Verification: Continuous background checksumming detects and repairs corruption

Edge Layer

Google Front Ends (GFEs) handle external traffic:

Global anycast IP addresses route clients to the nearest GFE
TLS termination and HTTP parsing
Authentication and authorization (IAM evaluation)
Request routing to appropriate GCS backend services

Erasure Coding Efficiency

Erasure coding (Reed-Solomon) achieves high durability with less overhead than replication. A 6+3 scheme (6 data fragments + 3 parity fragments) tolerates 3 failures while storing only 1.5x the data. This efficiency is why GCS and modern storage systems favor erasure coding over simple replication.

Location Types and Data Placement

GCS offers three location types that fundamentally affect data placement, availability, and cost:

1. Region (Single Region)

Data is stored in a single geographic region (e.g., us-central1, europe-west1).

Use case: Latency-sensitive applications where data locality matters
Durability: 99.999999999% (11 nines) through redundancy within the region
Availability: 99.9% for Standard, lower for other classes
Cost: Lowest storage cost
Failure tolerance: Survives disk, rack, and zone failures within the region; does not survive region-wide outages

2. Dual-Region

Data is replicated across two specific regions (e.g., nam4 = Iowa + South Carolina).

Use case: Balancing cost, availability, and disaster recovery for specific geographic pairs
Durability: 11 nines, with asynchronous replication between regions
Availability: 99.95% for Standard
Turbo replication option: 15-minute RPO (Recovery Point Objective)
Failure tolerance: Survives complete region failure with automatic failover

3. Multi-Region

Data is distributed across multiple regions within a continent (e.g., US, EU, ASIA).

Use case: Maximum availability and performance for global applications
Durability: 11 nines with geo-redundant storage
Availability: 99.95% for Standard
Automatic geographic load balancing: Reads served from nearest available replica
Failure tolerance: Survives multiple simultaneous region failures

GCS Location Comparison
Location Type	Typical Durability	Standard Availability	Failure Tolerance	Relative Cost
Region	99.999999999%	99.9%	Zone failures	Lowest
Dual-Region	99.999999999%	99.95%	Single region failure	Medium
Multi-Region	99.999999999%	99.95%	Multi-region failures	Highest

Location Selection Strategy

Choosing the right location involves balancing multiple factors:

Performance Considerations:

Data should be close to compute resources (GCE instances, Cloud Functions, GKE clusters)
Multi-region adds latency for writes (replication overhead) but improves read latency for distributed readers
Consider where your users are geographically located

Compliance Considerations:

Data residency requirements (GDPR, data sovereignty)
Some industries require specific geographic locations
Multi-region buckets may span political boundaries

Cost Considerations:

Multi-region storage costs ~1.5-2x more than regional
But egress between bucket and compute is cheaper when colocated
Consider total cost, not just storage cost

Turbo Replication Deep Dive

Turbo replication for dual-region buckets guarantees 99.9% of objects replicate within 15 minutes. This is a dramatic improvement over standard async replication (which can lag hours for large objects). Enable it for critical data where DR RPO matters. It costs extra but provides predictable recovery guarantees.

Storage Classes

GCS's storage class model is simpler than S3's while covering similar use cases. The critical distinction from S3 is that all GCS storage classes have identical read performance—no retrieval delays or restore jobs.

Standard Storage

The default class for frequently accessed data:

Maximum availability (99.9-99.95% depending on location)
No retrieval fees (beyond standard operation charges)
No minimum storage duration
Ideal for: hot data, frequently accessed content, real-time analytics

Nearline Storage

For data accessed less than once per month:

Lower storage cost than Standard (~50% cheaper)
Retrieval fee per GB read
30-day minimum storage duration (charged even if deleted earlier)
Read latency: identical to Standard (milliseconds)
Ideal for: monthly reporting, backups accessed occasionally

Coldline Storage

For data accessed quarterly or less:

Significantly lower storage cost (~70% cheaper than Standard)
Higher retrieval fee
90-day minimum storage duration
Read latency: still identical to Standard
Ideal for: disaster recovery data, quarterly access patterns

Archive Storage

For long-term archival, accessed yearly or less:

Lowest storage cost (~85% cheaper than Standard)
Highest retrieval fee
365-day minimum storage duration
Read latency: still identical to Standard (major differentiator from S3 Glacier)
Ideal for: compliance archives, regulatory retention, data you hope to never need

GCS Storage Class Comparison
Class	Min Duration	Retrieval Fee	Use Case	Relative Storage Cost
Standard	None	None	Frequently accessed	1x (baseline)
Nearline	30 days	$0.01/GB	Monthly access	~0.5x
Coldline	90 days	$0.02/GB	Quarterly access	~0.3x
Archive	365 days	$0.05/GB	Yearly access	~0.15x

Minimum Storage Duration Charges

If you store an object in Nearline for 10 days, delete it, you're still charged for 30 days of storage. This applies to class changes too—changing from Coldline to Standard within 90 days incurs the remaining Coldline storage cost. Budget for this when designing lifecycle policies.

Autoclass: Automatic Tiering

Autoclass automatically moves objects between storage classes based on access patterns:

Objects start in Standard
After 30 days without access → moves to Nearline
After 90 days without access → moves to Coldline
After 365 days without access → moves to Archive
Any access moves object back toward Standard

Autoclass eliminates lifecycle management complexity for unpredictable access patterns. The trade-off is less granular control and potential for unnecessary transitions if access patterns are sporadic.

When to Use Each Class

•Standard: Active application data, CDN origins, data lakes with frequent queries, ML training datasets
•Nearline: Monthly reports, recent backups, data awaiting archival processing
•Coldline: Quarterly financial data, compliance snapshots, disaster recovery replicas
•Archive: 7-year regulatory retention, legal holds, historical data you rarely need
•Autoclass: When access patterns are unpredictable or managing lifecycle rules is too complex

Consistency Model and Versioning

GCS's consistency model is notably simpler than S3's historical model because GCS was designed for strong consistency from inception.

Strong Consistency Guarantees

GCS provides strong consistency for all operations:

Read-after-write consistency: A successful write is immediately visible to all subsequent reads
List consistency: Object listings immediately reflect completed writes and deletes
Metadata consistency: Object metadata changes are immediately visible
Global consistency: These guarantees apply across all regions for multi-region buckets

There are no qualification or edge cases. If GCS returns success for a write, all subsequent reads will see that write—period.

How GCS Achieves This

GCS's consistency comes from its metadata layer, which uses strongly consistent databases (likely Spanner derivatives):

Writes acquire distributed locks and update with transactional guarantees
Reads always consult the authoritative metadata store
No eventually consistent caches serve stale data
The performance penalty is minimal due to Google's internal network (sub-millisecond intra-datacenter latency)

Object Versioning

GCS supports object versioning, which preserves historical versions of objects:

Each write creates a new object generation (similar to S3's version ID)
Deletes don't remove data; they hide the current version
Old versions remain accessible by specifying generation number
Versioning is enabled at the bucket level

Generation Numbers

Unlike S3's opaque version IDs, GCS uses generation numbers:

gs://my-bucket/photo.jpg              # Current (live) version
gs://my-bucket/photo.jpg#1673531234567890   # Specific generation

Generation numbers are monotonically increasing integers (actually nanosecond timestamps), making it easy to understand version ordering.

Metageneration Numbers

GCS also tracks metageneration—how many times an object's metadata has changed:

Uploading a new version resets metageneration to 1
Updating metadata (ACLs, custom headers) increments metageneration
Useful for conditional updates (only update if metageneration matches expected value)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Enable versioning on a bucket
gsutil versioning set on gs://my-bucket
 
# List all versions of an object
gsutil ls -a gs://my-bucket/photo.jpg
 
# Output shows generations:
# gs://my-bucket/photo.jpg#1673531234567890
# gs://my-bucket/photo.jpg#1673527654321098
# gs://my-bucket/photo.jpg#1673520000000000
 
# Read a specific version
gsutil cat gs://my-bucket/photo.jpg#1673527654321098
 
# Delete a specific version (permanent!)
gsutil rm gs://my-bucket/photo.jpg#1673520000000000
 
# Restore a previous version (copy old gen to current)
gsutil cp gs://my-bucket/photo.jpg#1673527654321098 gs://my-bucket/photo.jpg

Versioning Storage Costs

Every version consumes storage and incurs cost. A 1GB object overwritten 100 times costs for 100GB of storage. Use lifecycle policies to delete old versions: 'Delete noncurrent versions older than N days' prevents unbounded storage growth.

Unique GCS Features

GCS offers several features not found in S3 or with different implementations:

1. Composite Objects

GCS can combine up to 32 existing objects into a single composite object without downloading/uploading data:

gsutil compose gs://bucket/part1 gs://bucket/part2 gs://bucket/combined

This is incredibly powerful for:

Efficiently merging log files
Stitching video segments
Completing multipart uploads without client-side assembly
Building large objects from smaller pieces server-side

The operation happens entirely server-side with no data transfer.

2. Signed URLs with Post Policies

GCS signed URLs and POST policies are more flexible than S3's:

Signed URLs can specify size limits, allowing uploads up to a maximum
POST policies support complex conditions (bucket, key prefix, content type, etc.)
Signatures can be created with service account keys or IAM workload identity

3. Bucket Lock and Retention Policies

GCS provides regulatory-grade data protection:

Retention policies: Objects cannot be deleted before the retention period expires
Bucket lock: Permanently locks the retention policy—even Google cannot override it
Compliant with SEC Rule 17a-4, FINRA, and other regulatory requirements

4. Object Holds

Beyond retention policies, GCS supports holds:

Event-based hold: Automatically applied to new objects; must be explicitly released
Temporary hold: Places an object on hold regardless of retention policy

Holds prevent deletion until removed, useful for legal holds or compliance scenarios.

5. Parallel Composite Uploads

The gsutil tool can automatically parallelize large uploads using composite objects:

gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp large-file.dat gs://bucket/

This splits large files into parts, uploads in parallel, and composes them server-side—achieving wire-speed uploads for large files.

6. Requester Pays Buckets

Like S3, the requester can be charged for data access and egress:

Bucket owner pays for storage
Requesters pay for operations and data transfer
Useful for public datasets where owners don't want to pay for access

Features GCS Excels At

•BigQuery integration: Direct querying of GCS data without loading—SQL over Parquet, JSON, ORC files
•Pub/Sub notifications: Native event notification to Cloud Pub/Sub (similar to S3 Event Notifications but more flexible)
•Cloud Functions triggers: Serverless processing triggered by object changes
•Dataflow/Beam integration: Streaming and batch processing natively reads/writes GCS
•AI Platform integration: ML training directly from GCS without staging to local disks

gsutil vs. gcloud storage

Google is transitioning from gsutil to 'gcloud storage' commands. Both work, but 'gcloud storage' is more consistent with other gcloud commands and has performance improvements. For new projects, prefer 'gcloud storage cp' over 'gsutil cp'.

GCS vs S3: Architectural Comparison

Understanding the architectural differences between GCS and S3 helps select the right service and avoid surprises during migrations.

API Compatibility

GCS offers an S3-compatible API (Cloud Storage for Firebase uses it), but it's not 100% compatible:

Most common operations work (GET, PUT, DELETE, LIST)
Some S3 features may not translate directly
Authentication differs (though S3 signing is supported)
For true multi-cloud, abstract behind a cloud-agnostic SDK or tool like rclone

GCS vs S3 Feature Comparison
Feature	Google Cloud Storage	Amazon S3
Consistency	Strong (always)	Strong (since Dec 2020)
Storage Classes	4 (Standard, Nearline, Coldline, Archive)	8+ (Standard, IA, One Zone-IA, Glacier variants)
Archive Retrieval	Immediate (no restore)	Expedited: minutes, Standard: hours, Bulk: 5-12 hours
Location Types	Region, Dual-Region, Multi-Region	Region only (cross-region via replication)
Composite Objects	Yes (server-side combine)	No (must download/upload)
Bucket Lock	Yes (SEC 17a-4 compliant)	Yes (S3 Object Lock, Governance/Compliance modes)
Object Size Limit	5 TB	5 TB
Multipart Parts	Composite objects or XML API	10,000 parts max, each 5GB max
Versioning	Generation numbers (ordered)	Opaque version IDs
Event Notifications	Pub/Sub	SNS, SQS, Lambda, EventBridge

When to Choose GCS Over S3

GCS is the better choice when:

You're primarily on Google Cloud: Native integration with BigQuery, Dataflow, GKE, Cloud Functions
You need instant archive access: GCS Archive has no retrieval delay; S3 Glacier requires restore
You want multi-region without replication rules: GCS multi-region buckets are automatic
You need composite objects: Server-side object stitching is uniquely powerful
Data processing is your focus: GCS + BigQuery + Dataflow is a compelling analytics stack

When to Choose S3 Over GCS

S3 is the better choice when:

You're primarily on AWS: Native integration with Lambda, EMR, Athena, Redshift
You need complex storage tiering: S3's many classes offer more granular optimization
You want deep Glacier archival: For truly cold data, Glacier Deep Archive is cheaper than GCS Archive
You need S3 ecosystem compatibility: Many tools assume S3; cross-cloud often targets S3 API
You require S3 Access Points: Complex multi-tenant access control scenarios

Migration Considerations

Migrating between S3 and GCS is straightforward using Storage Transfer Service (GCS) or DataSync (AWS). The challenges are usually in: (1) updating application code for different SDKs, (2) mapping storage classes appropriately, (3) reconfiguring IAM and access policies, and (4) updating event notification handlers.

Summary: Google Cloud Storage

Let's consolidate the key insights about Google Cloud Storage:

Key Takeaways

•Built on Colossus — GCS leverages Google's legendary distributed file system for exabyte-scale storage
•Strong consistency from the start — No eventual consistency complexities; all operations immediately consistent
•Simpler storage classes — 4 classes with identical read performance; only pricing differs
•No archive retrieval delays — Unlike S3 Glacier, GCS Archive provides immediate access
•Multi-region native — Geographic redundancy without complex replication configuration
•Composite objects — Server-side object stitching enables unique workflows
•Deep GCP integration — BigQuery, Dataflow, Cloud Functions work seamlessly with GCS
•Generation-based versioning — Ordered, understandable version numbers vs. opaque IDs

Architectural Patterns for System Design

When designing systems with GCS:

Leverage composite objects for efficient server-side data assembly
Use Autoclass for unpredictable access patterns; manual class management for predictable ones
Choose location based on total cost (storage + compute + egress), not just storage price
Use multi-region for global apps but understand the write latency overhead
Integrate with BigQuery for analytics—direct querying avoids data staging

What's Next:

The next page examines Azure Blob Storage, completing our survey of major cloud object storage services and providing the knowledge needed to design multi-cloud or Azure-specific storage architectures.

Page Complete

You now understand Google Cloud Storage's architecture, consistency model, storage classes, and unique features. You can articulate how GCS differs from S3 and when each is the better choice—essential knowledge for cloud architecture decisions.