Object Storage Fundamentals - Learning Module

Loading content...

0/273

Object Storage Use Cases

Where Object Storage Transforms Architecture

Understanding object storage in theory is valuable, but recognizing where it applies in real-world systems is transformative. Object storage isn't just another storage option—it fundamentally changes how you architect for scale, cost, and durability.

In this page, we examine the major use cases where object storage excels, diving deep into the architectural patterns, implementation considerations, and trade-offs for each. By the end, you'll be able to instantly recognize object storage opportunities in system designs and implement them effectively.

What You Will Learn

By the end of this page, you will understand the primary use cases for object storage: static website and asset hosting, data lakes and analytics, backup and disaster recovery, user-generated content management, media storage and streaming, log aggregation and archival, and machine learning datasets. For each, you'll learn the architectural patterns, implementation strategies, and critical considerations.

Static Website and Asset Hosting

One of the most common and impactful uses of object storage is serving static content—HTML, CSS, JavaScript, images, fonts, and other assets that don't change per-request. This use case leverages object storage's HTTP-native interface, infinite scalability, and cost-effectiveness.

Why Object Storage for Static Content?

Traditionally, static files were served from web servers (nginx, Apache) running on VMs. This approach has significant drawbacks:

Servers must be provisioned for peak load (wasteful at low traffic)
Servers can fail, requiring redundancy and load balancing
Scaling requires adding servers, load balancers, and capacity management
Storage is limited to attached disks

Object storage eliminates these concerns:

Pay only for storage used and requests made (no idle capacity)
Built-in redundancy and durability (11 nines)
Automatic scaling to any request volume
Unlimited storage capacity
Direct HTTP serving (no web server required)

Static Hosting Architecture Patterns

•Direct S3 hosting: Enable static website hosting on bucket; S3 serves HTML directly
•CDN + S3 origin: CloudFront/Cloudflare fronts S3; caches at edge; faster globally
•CDN + S3 + custom domain: Route53/DNS points domain to CDN; CDN pulls from S3
•JAMstack architecture: Static site generators (Next.js, Gatsby, Hugo) deploy build output to S3
•SPA deployment: React/Vue/Angular apps as static assets in S3 with CDN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
┌─────────────────────────────────────────────────────────────────────────────┐
│                        STATIC WEBSITE ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│    User Request                                                             │
│         │                                                                   │
│         ▼                                                                   │
│    ┌─────────────┐                                                          │
│    │   Route53   │  ─── DNS resolution to CloudFront                       │
│    │    (DNS)    │                                                          │
│    └──────┬──────┘                                                          │
│           │                                                                 │
│           ▼                                                                 │
│    ┌─────────────┐         Cache Hit              ┌─────────────┐          │
│    │ CloudFront  │  ─────────────────────────────▶│   Browser   │          │
│    │   (CDN)     │         (200 OK)               │             │          │
│    └──────┬──────┘                                └─────────────┘          │
│           │                                                                 │
│           │ Cache Miss                                                      │
│           ▼                                                                 │
│    ┌─────────────┐                                                          │
│    │   S3 Bucket │  ─── Static files: HTML, CSS, JS, images                │
│    │  (Origin)   │                                                          │
│    └─────────────┘                                                          │
│                                                                             │
│    Benefits:                                                                │
│    • Global edge caching (low latency worldwide)                           │
│    • DDoS protection at CDN layer                                          │
│    • HTTPS/SSL termination at CDN                                          │
│    • Pay-per-request pricing (no idle servers)                             │
│    • Infinite scalability                                                   │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation Best Practices

Cache control headers: Configure appropriate Cache-Control on objects. Static assets with hashed filenames (app.a1b2c3.js) can be cached forever (max-age=31536000). Index.html should be shorter-lived (max-age=3600 or no-cache) to ensure updates are seen.

Content-Type accuracy: Ensure correct MIME types are set. A JavaScript file served as text/plain may not execute correctly.

Gzip/Brotli compression: Pre-compress files before upload or configure CloudFront to compress on-the-fly.

Error pages: Configure custom 404 and 500 error pages in S3 static website settings.

SPA routing: For single-page apps with client-side routing, configure S3/CloudFront to redirect 404s to index.html, allowing the SPA router to handle routes.

Cost Efficiency

Static hosting on S3 + CloudFront is extraordinarily cost-effective. A website serving 10 million page views/month with 2GB of assets might cost $5-15/month. The same traffic on EC2 instances would cost hundreds of dollars. For static content, object storage isn't just an option—it's the obvious choice.

Data Lakes and Analytics

Object storage has become the foundation of modern data analytics architecture. The "data lake" pattern—storing raw, heterogeneous data in its native format for later analysis—relies almost exclusively on object storage.

Why Object Storage for Data Lakes?

Cost at scale: Storing petabytes of analytical data on block storage or databases would be prohibitively expensive. Object storage costs pennies per GB.
Schema-on-read: Unlike databases requiring upfront schema definition, object storage accepts any data format. Structure is applied when reading (by Spark, Athena, Presto, etc.).
Decoupled storage and compute: Compute engines spin up, process data, spin down. Data persists independently. No need for always-on clusters.
Open formats: Parquet, ORC, Avro, JSON—analytics tools understand these formats stored as objects.
Unlimited capacity: A data lake can grow indefinitely without re-architecture.

Data Lake File Formats Comparison
Format	Type	Compression	Query Performance	Best For
Parquet	Columnar	Excellent	Excellent for analytics	Analytical queries, data warehousing
ORC	Columnar	Excellent	Excellent	Hive ecosystem, Presto
Avro	Row-based	Good	Good for row access	Streaming, schema evolution
JSON (NDJSON)	Row-based	Poor	Poor (parse overhead)	Logs, semi-structured data
CSV	Row-based	Poor	Poor	Simple data interchange
Delta Lake	Columnar + metadata	Excellent	Excellent	ACID transactions on data lake

Data Lake Architecture Components

Raw/Bronze Layer: Ingested data in original format. Immutable, append-only. Source of truth for reprocessing.

Processed/Silver Layer: Cleaned, validated, deduplicated data. Partitioned for efficient queries.

Curated/Gold Layer: Business-level aggregations and dimensions. Ready for BI tools and dashboards.

Catalog/Metadata: Services like AWS Glue Catalog, Apache Hive Metastore that track table schemas, partitions, and statistics.

Query Engines: Athena, Presto, Spark SQL, Redshift Spectrum that query data directly from object storage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
s3://company-data-lake/
├── raw/                           # Bronze Layer - Raw ingested data
│   ├── clickstream/
│   │   └── year=2024/month=01/
│   │       ├── day=15/hour=00/
│   │       │   ├── events-00001.json.gz
│   │       │   └── events-00002.json.gz
│   │       └── day=15/hour=01/...
│   ├── transactions/
│   │   └── year=2024/month=01/
│   │       └── transactions-2024-01-15.avro
│   └── user_profiles/
│       └── full_export_2024-01-15.csv.gz
│
├── processed/                     # Silver Layer - Cleaned & normalized
│   ├── clickstream/
│   │   └── year=2024/month=01/day=15/
│   │       ├── part-00000.parquet
│   │       ├── part-00001.parquet
│   │       └── _SUCCESS
│   └── transactions/
│       └── year=2024/month=01/...
│
├── curated/                       # Gold Layer - Business aggregates
│   ├── daily_metrics/
│   │   └── dt=2024-01-15/metrics.parquet
│   ├── user_segments/
│   │   └── segment=high_value/users.parquet
│   └── product_analytics/
│       └── sku_performance.parquet
│
└── _metadata/                     # Catalog, schemas, configs
    ├── schemas/
    └── job_configs/

Data Lake Best Practices

•Partitioning strategy: Partition by date (most common), then by high-cardinality columns used in filters
•File sizing: Target 128MB-1GB files for optimal query performance; too many small files hurts
•Compaction jobs: Periodically merge small files into larger ones
•Columnar formats: Use Parquet or ORC for analytical queries; massive speedup vs. JSON/CSV
•Lifecycle policies: Move older data to cheaper storage classes (IA, Glacier) automatically
•Catalog integration: Register tables in Glue/Hive for SQL access and schema management

The Lakehouse Pattern

Modern data architecture is converging on the 'lakehouse' pattern—combining data lake cost and flexibility with data warehouse performance and ACID guarantees. Technologies like Delta Lake, Apache Iceberg, and Apache Hudi add transaction support, time travel, and schema evolution to object storage-based data lakes.

Backup and Disaster Recovery

Object storage's extreme durability (11 nines = 99.999999999% annual durability) makes it the gold standard for backup storage. At this durability level, if you store 10 million objects, you might statistically lose one object every 10,000 years.

Why Object Storage for Backups?

Durability: Automatic replication across multiple data centers and availability zones
Cost: Significantly cheaper than block storage or tape, especially for infrequently accessed data
Scalability: No capacity planning needed; backups grow indefinitely
Accessibility: HTTP access from anywhere; no special hardware required
Lifecycle management: Automatic transition to cheaper storage classes as backups age

Object Storage Classes for Backup (AWS S3)
Storage Class	Use Case	Min Duration	Retrieval Time	Cost (relative)
S3 Standard	Frequently accessed backups	None	Immediate	1.0x
S3 Standard-IA	Monthly/quarterly backups	30 days	Immediate	0.5x
S3 One Zone-IA	Reproducible backups	30 days	Immediate	0.4x
S3 Glacier Instant	Quarterly backups, fast retrieval	90 days	Immediate	0.2x
S3 Glacier Flexible	Annual/compliance archives	90 days	1-12 hours	0.1x
S3 Glacier Deep Archive	Long-term compliance	180 days	12-48 hours	0.05x

Backup Architecture Patterns

Pattern 1: Database Backups to Object Storage

Database → pg_dump/mysqldump → Compress (gzip/lz4) → Upload to S3
                                                      → Replicate cross-region (CRR)

Pattern 2: Filesystem Snapshots

EBS Volume → EBS Snapshot → (Automatic storage in S3 backend)
                          → Cross-region copy for DR

Pattern 3: Application-Level Backups

Application → Stream data to S3 multipart upload
            → Verify with checksum
            → Update metadata catalog

Pattern 4: Continuous Data Protection (CDP)

Change stream → Buffer (Kinesis/Kafka) → Batch to S3 (minute/hour)
              → Point-in-time recovery catalog

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
  "Rules": [
    {
      "ID": "BackupLifecycle",
      "Status": "Enabled",
      "Filter": { "Prefix": "backups/" },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 2555  // 7 years for compliance
      }
    }
  ]
}
 
// Cost impact example for 1TB of backups:
// Year 1: $23/month (Standard → IA → Glacier IR)
// Year 2+: $1/month (Deep Archive)
// vs. keeping in Standard: $23/month forever

Test Your Restores!

A backup is worthless if you can't restore from it. Regularly test restore procedures. Verify data integrity after restore. Time your restore operations to ensure they meet RTO requirements. Many organizations have discovered during a real disaster that their 'backups' were incomplete or corrupted—too late to fix.

Disaster Recovery Best Practices

•Cross-region replication: Backups in one region won't help if that region has a disaster. Replicate to geographically distant regions.
•Immutability with Object Lock: Use compliance mode to prevent backup deletion, even by administrators
•Encryption at rest: Enable SSE-S3 or SSE-KMS for all backup buckets
•Access logging: Enable S3 access logs to track who accessed backups
•Versioning: Enable versioning to protect against accidental overwrites
•Runbook documentation: Document exact restore procedures; test them quarterly

User-Generated Content Management

Applications that accept user uploads—profile pictures, documents, videos, attachments—face unique challenges: unpredictable volume, untrusted content, varied file types, and the need to serve content globally. Object storage is the natural solution.

The User Upload Pipeline

A production-grade user upload system typically involves:

Authentication: Verify user can upload (quota, permissions)
Pre-signed URL: Generate time-limited upload URL
Direct upload: Client uploads directly to object storage (bypasses your servers)
Validation: Verify file type, size, content safety
Processing: Generate thumbnails, transcode video, extract metadata
Storage: Move to permanent location with appropriate key structure
Serving: Deliver via CDN with appropriate caching

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
┌──────────────────────────────────────────────────────────────────────────────┐
│                       USER-GENERATED CONTENT FLOW                            │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────┐      1. Request upload URL        ┌───────────────┐            │
│  │  Client │ ─────────────────────────────────▶│  API Server   │            │
│  │  (App)  │                                   │               │            │
│  └────┬────┘      2. Pre-signed PUT URL        └───────┬───────┘            │
│       │ ◀─────────────────────────────────────────────┘                     │
│       │                                                                      │
│       │           3. Direct upload to S3                                     │
│       │ ─────────────────────────────────────────────────────────────────┐  │
│       │                                                                  ▼  │
│       │           ┌─────────────────────────────────────────────────────────┤
│       │           │  S3 Bucket: user-uploads-staging                        │
│       │           │  ├── uploads/{upload-id}/original.jpg                   │
│       │           │  └── (temporary, expires in 24h)                        │
│       │           └─────────────────────────────────────────────────────────┤
│       │                                     │                               │
│       │           4. Notify upload complete │ (S3 Event Notification)       │
│       │                                     ▼                               │
│       │           ┌─────────────────────────────────────────────────────────┤
│       │           │  Lambda / Processing Service                            │
│       │           │  ├── Validate content type (magic bytes, not extension) │
│       │           │  ├── Scan for malware                                   │
│       │           │  ├── Check image dimensions / video duration            │
│       │           │  ├── Generate thumbnails / transcode                    │
│       │           │  └── Move to permanent location                         │
│       │           └─────────────────────────────────────────────────────────┤
│       │                                     │                               │
│       │                                     ▼                               │
│       │           ┌─────────────────────────────────────────────────────────┤
│       │           │  S3 Bucket: user-uploads-production                     │
│       │           │  ├── users/{user-id}/avatars/{hash}.jpg                 │
│       │           │  ├── users/{user-id}/avatars/{hash}_thumb.jpg           │
│       │           │  └── users/{user-id}/documents/{doc-id}.pdf             │
│       │           └─────────────────────────────────────────────────────────┤
│       │                                     │                               │
│       ◀─────── 5. Access via CDN ◀──────────┘                               │
│                   (CloudFront with signed cookies for private content)       │
└──────────────────────────────────────────────────────────────────────────────┘

Security Considerations for User Uploads

•Never trust file extensions: Validate content by magic bytes (file signatures), not user-provided extensions
•Malware scanning: Integrate virus scanning (ClamAV, AWS Inspector) before accepting content
•Size limits: Enforce maximum file sizes in pre-signed URL and server-side validation
•Content-Type validation: Only accept expected MIME types; reject executables
•Separate buckets: Upload staging bucket separate from production; failed validations never reach production
•Private by default: All user content should be private; use signed URLs for access
•CORS configuration: Restrict allowed origins for direct uploads

Key Structure for User Content

Well-designed key structures enable efficient access patterns:

// Good: User-scoped, identifiable, hashable
users/{user-id}/avatars/{content-hash}.jpg
users/{user-id}/documents/{document-id}/{filename}

// Good: Content-addressed (hash of content)
blobs/{sha256-hash-prefix}/{sha256-hash}

// Bad: Sequential/timestamp prefixes (hot partitions)
uploads/2024-01-15-12-00-00-001.jpg

// Bad: User-provided filenames (security risk, encoding issues)
uploads/{user-id}/{user-provided-filename}

Content-addressed storage (hash as key) provides automatic deduplication—if two users upload the same image, it's stored once.

Cost Control for User Content

User-generated content can grow explosively. Implement lifecycle policies to archive inactive content. Set per-user quotas. Use Intelligent Tiering for unpredictable access patterns. Consider deferred delete (delete marker now, actual deletion after retention period) for accidental deletion protection.

Media Storage and Streaming

Video and audio streaming is one of object storage's most demanding use cases—requiring high throughput, low latency at the edge, and sophisticated content processing pipelines.

The Media Pipeline

A typical video streaming architecture involves:

Ingest: Accept original video uploads (often large, various formats)
Transcode: Convert to multiple resolutions and codecs for adaptive streaming
Package: Create HLS/DASH segments for streaming protocols
Store: Organize transcoded segments in object storage
Deliver: Serve via CDN with appropriate caching and geo-distribution
Analytics: Track viewing patterns, buffer events, quality metrics

Video Streaming Formats and Use Cases
Format	Protocol	Latency	Adaptive?	Best For
HLS	HTTP	10-30s	Yes	General VOD, wide compatibility
DASH	HTTP	10-30s	Yes	Modern browsers, DRM support
Low-Latency HLS	HTTP	2-5s	Yes	Near-live streaming
WebRTC	UDP/TCP	<1s	Limited	Real-time video (calls, gaming)
Progressive Download	HTTP	N/A	No	Simple playback, download-to-play

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
s3://media-streaming-bucket/
├── videos/
│   └── {video-id}/
│       ├── master.m3u8                    # Master playlist (lists all qualities)
│       ├── 1080p/
│       │   ├── playlist.m3u8              # Quality-specific playlist
│       │   ├── segment-001.ts             # 6-10 second video segments
│       │   ├── segment-002.ts
│       │   └── ...
│       ├── 720p/
│       │   ├── playlist.m3u8
│       │   └── segment-*.ts
│       ├── 480p/
│       │   ├── playlist.m3u8
│       │   └── segment-*.ts
│       ├── audio/
│       │   ├── playlist.m3u8
│       │   └── segment-*.aac
│       └── thumbnails/
│           ├── poster.jpg
│           ├── sprite.jpg                 # Thumbnail strip for preview
│           └── vtt/timeline.vtt           # Thumbnail timing metadata
 
# Flow:
# 1. Player fetches master.m3u8
# 2. Player selects quality based on bandwidth
# 3. Player fetches quality's playlist.m3u8
# 4. Player downloads segments in order, buffers, plays

Media Architecture Best Practices

•Pre-transcode at upload: Don't transcode on-demand; process videos when uploaded and store all variants
•Segment sizing: 6-10 second segments balance startup time vs. seek granularity
•CDN caching: Video segments are highly cacheable; set long TTLs; use versioned paths for updates
•Byte-range requests: Enable Range requests for seeking within segments
•Origin shield: Use CDN origin shield to reduce origin hits when cache misses at edge
•DRM integration: For protected content, use services like AWS Elemental MediaConvert with Widevine/FairPlay
•Cost optimization: Use S3 Intelligent-Tiering for video libraries with unpredictable access patterns

Storage Explosion Warning

Transcoding multiplies storage requirements significantly. A 1GB source video becomes ~5-10GB when transcoded to multiple resolutions with multiple codecs. Plan for 5-10x storage multiplier. Use lifecycle policies to expire old or unpopular content.

Log Aggregation and Archival

Application and infrastructure logs represent one of the highest-volume data streams in most organizations. Object storage provides the cost-effective, durable solution for log storage that database solutions can't match at scale.

Log Storage Requirements

High write throughput: Thousands of log lines per second
Long retention: Compliance may require years of retention
Infrequent access: Most logs are never read; some are read during incidents
Query capability: When needed, must be searchable
Cost sensitivity: Logs can grow to petabytes; cost must be minimal

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
┌────────────────────────────────────────────────────────────────────────────┐
│                     LOG AGGREGATION TO OBJECT STORAGE                       │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│   Application Servers                                                      │
│   ┌──────┐ ┌──────┐ ┌──────┐                                              │
│   │ App  │ │ App  │ │ App  │ ──── stdout/file logs ────┐                  │
│   └──────┘ └──────┘ └──────┘                           │                  │
│                                                        ▼                  │
│   ┌────────────────────────────────────────────────────────────────────┐  │
│   │ Log Shipper (Fluentd/Fluent Bit/Filebeat/Vector)                   │  │
│   │ • Collect from multiple sources                                     │  │
│   │ • Parse and enrich (add metadata, timestamps)                       │  │
│   │ • Buffer in memory/disk                                             │  │
│   │ • Batch for efficient upload                                        │  │
│   └───────────────────────────────────────────────────────┬────────────┘  │
│                                                           │                │
│                                                           ▼                │
│   ┌────────────────────────────────────────────────────────────────────┐  │
│   │ Streaming Buffer (Kinesis Firehose / Kafka)                        │  │
│   │ • Handle burst traffic                                              │  │
│   │ • Reliable delivery                                                 │  │
│   │ • Batch to S3 in configurable intervals (1-15 min)                 │  │
│   │ • Convert to Parquet/ORC for query efficiency                       │  │
│   └───────────────────────────────────────────────────────┬────────────┘  │
│                                                           │                │
│                                                           ▼                │
│   ┌────────────────────────────────────────────────────────────────────┐  │
│   │ S3: Partitioned Log Storage                                        │  │
│   │ s3://logs-bucket/                                                  │  │
│   │ ├── app=myservice/year=2024/month=01/day=15/hour=14/               │  │
│   │ │   ├── logs-001.parquet (compressed, columnar)                    │  │
│   │ │   └── logs-002.parquet                                           │  │
│   │ └── app=otherservice/...                                           │  │
│   └───────────────────────────────────────────────────────┬────────────┘  │
│                                                           │                │
│   Query Layer                                             ▼                │
│   ┌───────────────────────────────────────────────────────────────────┐   │
│   │ Athena / Presto / CloudWatch Logs Insights                        │   │
│   │ SELECT * FROM logs WHERE app='myservice' AND time > '2024-01-15'  │   │
│   └───────────────────────────────────────────────────────────────────┘   │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Log Storage Best Practices

•Partition by time: Hour or day partitions allow time-bounded queries to skip irrelevant data
•Columnar format: Parquet/ORC enables querying specific fields without reading entire logs
•Compression: LZ4 or Snappy for query speed; Gzip or Zstd for archive storage
•Batch size: Target 100MB-1GB files; too many small files hurts query performance
•Lifecycle tiering: Move logs to Glacier after 30-90 days if rarely accessed
•Metadata catalog: Register with Glue/Hive for SQL queries via Athena/Presto
•Retention policies: Define clear retention (often driven by compliance); don't keep logs forever needlessly

Cost Comparison

Storing 10TB of logs: Elasticsearch (~~$3,000/month), CloudWatch Logs (~~$500/month), S3 Standard (~$230/month), S3 Glacier ($40/month). For infrequently accessed compliance logs, object storage is 10-100x cheaper than search systems. Use Athena for ad-hoc queries at $5/TB scanned.

Machine Learning Datasets and Model Artifacts

Machine learning workflows generate and consume massive amounts of data: training datasets, intermediate features, model checkpoints, and final model artifacts. Object storage is the standard for ML data management.

ML Data Lifecycle

Raw data collection: Images, text, sensor readings → object storage
Feature engineering: Processed features in efficient formats (TFRecord, Parquet)
Training data: Versioned, immutable datasets for reproducibility
Training checkpoints: Model state saved periodically during training
Model artifacts: Final trained models for deployment
Inference data: Predictions, logs, feedback for monitoring

ML File Formats for Object Storage
Format	Framework	Use Case	S3 Integration
TFRecord	TensorFlow	Training data	tf.data.Dataset from S3
Parquet	General	Tabular ML data	Pandas, Spark, PyArrow
WebDataset	PyTorch	Large-scale training	tar shards from S3
HDF5	General	Scientific datasets	h5py with S3 backend
SavedModel	TensorFlow	Model deployment	Direct S3 load
ONNX	General	Model interchange	Universal inference

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
s3://ml-platform/
├── datasets/
│   ├── raw/
│   │   └── imagenet-2024/                 # Raw dataset
│   │       ├── images/
│   │       └── labels.csv
│   ├── processed/
│   │   └── imagenet-2024-v1/              # Versioned processed data
│   │       ├── train/
│   │       │   └── shard-{00000..01000}.tfrecord
│   │       ├── validation/
│   │       │   └── shard-{00000..00100}.tfrecord
│   │       └── metadata.json              # Dataset stats, schema
│   └── features/
│       └── embeddings-resnet50/
│           └── {dataset-version}/
│
├── experiments/
│   └── {experiment-id}/
│       ├── config.yaml                    # Hyperparameters, reproducibility
│       ├── checkpoints/
│       │   ├── epoch-010.pt               # Training checkpoints
│       │   ├── epoch-020.pt
│       │   └── best.pt
│       ├── logs/
│       │   └── training-metrics.jsonl
│       └── artifacts/
│           └── model.onnx                 # Final exportable model
│
├── models/
│   └── production/
│       └── {model-name}/
│           └── v{version}/
│               ├── model.tar.gz           # Deployable model package
│               ├── signature.json
│               └── requirements.txt
│
└── inference/
    └── predictions/
        └── {date}/
            └── batch-{id}.parquet         # Inference outputs for analysis

ML Data Best Practices

•Immutable versions: Never modify training data in place; create new versions for reproducibility
•Sharded datasets: Split large datasets into many files (100MB-1GB each) for parallel loading
•Streaming read: Use frameworks' native S3 streaming (TensorFlow tf.data, PyTorch WebDataset) instead of downloading first
•Checkpointing strategy: Save frequently enough to not lose work; not so frequently that storage explodes
•Model versioning: Track model versions with metadata (training config, dataset version, metrics)
•Lifecycle policies: Archive old experiments and checkpoints; keep only significant versions

Training from Object Storage

Modern ML frameworks efficiently stream data from S3 during training. For example, TensorFlow's tf.data.TFRecordDataset can read directly from S3 paths, prefetching and parallelizing I/O to keep GPUs fed. SageMaker, Vertex AI, and other ML platforms integrate natively with object storage.

Summary: Recognizing Object Storage Opportunities

We've explored the major use cases where object storage transforms architecture. Here are the key principles to remember:

Key Takeaways

•Static content → Object storage + CDN — Websites, assets, SPAs should never be served from VMs. Object storage is infinitely cheaper and more scalable.
•Data lakes → Object storage as foundation — Petabyte-scale analytics requires object storage's cost and infinite capacity. Schema-on-read enables flexibility.
•Backups → Object storage tiers — 11 nines durability, lifecycle tiering from Standard to Glacier, and cross-region replication provide enterprise-grade protection.
•User uploads → Direct-to-S3 with processing — Pre-signed URLs bypass your servers; processing pipelines validate and transform content.
•Media streaming → Transcoded segments in S3 + CDN — HLS/DASH segments are naturally object storage friendly; CDN handles global delivery.
•Logs → Partitioned, columnar formats in S3 — 10-100x cheaper than search systems for compliance and archival; query with Athena when needed.
•ML data → Versioned, sharded datasets in S3 — Reproducibility requires immutable data versions; streaming reads keep training efficient.

When to Choose Object Storage: Quick Decision Guide

If you need...	Use Object Storage?
Store files > 100MB	✅ Yes — optimal for large blobs
Store billions of small files	✅ Yes — designed for unlimited scale
Low-latency random access	❌ No — use block storage or cache
POSIX filesystem semantics	❌ No — use file storage
Durability for critical data	✅ Yes — 11 nines durability
Cost-sensitive large datasets	✅ Yes — pennies per GB
Data shared across multiple apps	✅ Yes — HTTP access from anywhere
Frequently modified files	⚠️ Maybe — immutability may be friction

Module Complete

Congratulations! You've completed the Object Storage Fundamentals module. You now understand the fundamental differences between storage paradigms, the internal mechanics of object storage, consistency considerations, and the diverse use cases where object storage excels. This foundation prepares you for advanced topics: cloud provider implementations, distributed file systems, storage optimization, and disaster recovery strategies in the following modules.