System Design (HLD)Cloud Object Storage

Cloud Object Storage: Architecture and Best Practices

LevelIntermediate

Duration120 mins

TopicCloud Object Storage

1 / 5

Amazon S3 Architecture

The Foundation of Cloud Storage

When Amazon launched Simple Storage Service (S3) in March 2006, it quietly revolutionized how the world thinks about data storage. What began as an internal solution for Amazon's e-commerce infrastructure became the blueprint for an entirely new category of storage systems. Today, S3 stores over 100 trillion objects and handles millions of requests per second—a scale that challenges the most sophisticated distributed systems engineering on the planet.

Understanding S3's architecture isn't merely about learning one product. S3 defined the paradigms that all modern object storage systems follow: the RESTful API model, eventual consistency tradeoffs (now evolved to strong consistency), storage class tiering, and the fundamental object storage data model. Whether you're using Google Cloud Storage, Azure Blob Storage, or any S3-compatible system like MinIO, you're working within patterns S3 established.

What You Will Learn

By the end of this page, you will understand S3's internal architecture at a depth suitable for senior system design discussions. You'll grasp how S3 achieves eleven 9s of durability, the engineering behind its recent transition to strong consistency, and the design patterns that enable it to scale to exabytes while maintaining sub-second latency.

S3 Design Philosophy

S3's architecture embodies several core design principles that guided its development and continue to shape its evolution. Understanding these principles illuminates not just how S3 works, but why it makes specific tradeoffs.

1. Durability Over Everything

S3's eleven 9s of durability (99.999999999%) means that if you store 10 million objects, you can statistically expect to lose one object every 10,000 years. This isn't marketing—it's an engineering requirement that drives fundamental architectural decisions:

Objects are automatically replicated across at least three geographically separated data centers (Availability Zones) within a region
Each copy undergoes continuous integrity verification using checksums
Data is protected against not just disk failures, but entire facility failures, including natural disasters
The replication is synchronous for durability—a PUT request doesn't return success until durability is achieved

Durability vs. Availability

Durability and availability are distinct concepts. Durability means your data won't be lost; availability means you can access it right now. S3 standard offers 99.99% availability (about 52 minutes of downtime per year) but 99.999999999% durability. You might temporarily be unable to read data during an outage, but you won't lose it.

2. Infinite Scale Without Pre-Provisioning

Unlike traditional storage systems where you provision capacity upfront, S3 presents effectively unlimited storage. This is achieved through:

Automatic partitioning: S3 distributes objects across a massive fleet of storage nodes, automatically rebalancing as usage patterns change
Key-based sharding: Object keys determine data placement, allowing predictable distribution
Elastic metadata: The index system scales independently from data storage

3. Simple Mental Model, Complex Implementation

S3's API is deceptively simple: buckets contain objects, objects have keys and data. But this simplicity masks extraordinary complexity:

The PUT operation involves coordinate-commit protocols across multiple data centers
The GET operation involves routing to the nearest replica with fastest response
LIST operations scan distributed indexes with consistent snapshots
Versioning requires maintaining complete object histories without impacting read performance

Core Design Principles

•Durability is non-negotiable — Every architectural decision starts with 'how does this affect durability?' Data loss is considered catastrophic failure.
•Scale through partitioning — There's no central coordination point that becomes a bottleneck. Data and metadata are partitioned horizontally.
•Eventual consistency (historically) for availability — S3 chose availability over immediate consistency for most of its history, only recently achieving strong consistency.
•Immutability simplifies everything — Objects are immutable once written. 'Updates' create new versions, not in-place modifications.
•Separation of concerns — Data plane (actual bytes) and control plane (metadata, permissions) operate independently and scale separately.

S3 Data Model in Depth

S3's data model appears simple but has nuances that significantly affect system design. Let's examine each component:

Buckets: The Top-Level Container

A bucket is a globally unique namespace for organizing objects. Key characteristics:

Bucket names are globally unique across all AWS accounts—once someone takes my-bucket, no one else can use it
Each bucket exists in exactly one region, though its contents can be replicated to other regions
Buckets have policies, versioning settings, lifecycle rules, and other configurations
There's a soft limit of 100 buckets per account (can be raised), but no limit on objects per bucket

Objects: The Fundamental Unit

An object consists of:

Key: A UTF-8 string uniquely identifying the object within a bucket (up to 1024 bytes)
Data: The actual content (up to 5TB per object)
Metadata: Key-value pairs (system-defined and user-defined)
Version ID: Unique identifier when versioning is enabled

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Conceptual structure of an S3 object
 
Object:
  Key: "products/electronics/phone-123.jpg"
  VersionId: "3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUo"
  
  Data:
    Content: <binary data - up to 5TB>
    Size: 2,457,600 bytes
    ETag: "d41d8cd98f00b204e9800998ecf8427e"  # MD5 hash for simple uploads
    ContentType: "image/jpeg"
    
  SystemMetadata:
    LastModified: "2024-01-15T14:30:00Z"
    StorageClass: "STANDARD"
    ContentLength: 2457600
    ServerSideEncryption: "AES256"
    
  UserMetadata:  # User-defined, prefixed with 'x-amz-meta-'
    x-amz-meta-photographer: "Jane Smith"
    x-amz-meta-location: "Seattle, WA"
    x-amz-meta-camera: "Canon EOS R5"

Object Keys and the Flat Namespace

S3 uses a flat namespace—there are no actual directories or folders. The apparent folder structure in the console is an illusion created by key prefixes and delimiters:

Key: photos/2024/vacation/beach.jpg
This is ONE key string, not a hierarchy
The / is just a character; you could use - or :: instead
'Folders' appear when you group by common prefixes

This design has critical implications for performance and cost:

Flat Namespace Implications
Aspect	Implication	Design Consideration
Listing Performance	LIST operations scan linearly by prefix	Deep hierarchies with many objects slow down listing
Partition Distribution	S3 partitions by key prefix	Sequential keys (timestamps) can create hot partitions
Rename Operations	No rename—must copy entire object + delete	Avoid designs requiring frequent renames
Folder Operations	'Delete folder' = delete all objects with prefix	Can be very slow for large prefixes
Billing	No folder concept in storage costs	Empty 'folders' are zero-byte objects if created explicitly

Prefix Partition Optimization

S3 automatically partitions data by key prefix for performance. If all keys start with the same prefix (e.g., timestamps), requests concentrate on one partition. In 2018, AWS improved this significantly, but for extreme throughput (>3,500 PUT/s or >5,500 GET/s per prefix), add randomness to key prefixes like hashes.

S3 Internal Architecture

While Amazon doesn't publicly document S3's internals in complete detail, we can reconstruct the architecture from published papers, re:Invent presentations, and engineering blog posts. S3's architecture separates into distinct subsystems:

1. Request Routing Layer

When a request arrives at S3:

DNS resolves bucket-name.s3.region.amazonaws.com to an edge location
The request enters AWS's global network and routes to the bucket's region
A front-end fleet of web servers authenticates the request, validates permissions, and determines the operation
The request routes to the appropriate internal service based on operation type

This layer handles TLS termination, authentication (Signature Version 4), authorization (IAM/bucket policies), and request validation—all before touching storage.

2. Index Subsystem

S3 maintains a distributed index mapping (bucket, key, version) → physical storage location. This index is the core metadata layer:

Built on a distributed key-value store optimized for point lookups and prefix scans
Partitioned by bucket and key prefix for horizontal scaling
Replicated across Availability Zones for durability
Supports consistent reads since the December 2020 strong consistency update

The index stores:

Object metadata (size, ETag, last modified, storage class)
Version information for versioned buckets
Pointers to physical storage locations across data centers
Encryption key references for SSE-S3 and SSE-KMS objects

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
┌─────────────────────────────────────────────────────────────────────────┐
│                         Request Routing Layer                           │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐            │
│  │   Edge    │  │   Auth    │  │   IAM     │  │  Request  │            │
│  │  Router   │→ │  Service  │→ │  Policy   │→ │  Router   │            │
│  └───────────┘  └───────────┘  └───────────┘  └───────────┘            │
└────────────────────────────────────┬────────────────────────────────────┘
                                     │
         ┌───────────────────────────┼───────────────────────────┐
         ▼                           ▼                           ▼
┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐
│  Index Service  │       │  Data Service   │       │  Lifecycle      │
│                 │       │                 │       │  Service        │
│  • Metadata DB  │       │  • Chunk Store  │       │  • Transitions  │
│  • Version Map  │       │  • Replication  │       │  • Expirations  │
│  • Key Lookup   │       │  • Checksums    │       │  • Cleanups     │
└─────────────────┘       └─────────────────┘       └─────────────────┘
         │                         │                         │
         └─────────────────────────┼─────────────────────────┘
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                     Physical Storage Layer                              │
│  ┌─────────────┐        ┌─────────────┐        ┌─────────────┐         │
│  │    AZ-1    │        │    AZ-2    │        │    AZ-3    │          │
│  │  Storage   │        │  Storage   │        │  Storage   │          │
│  │   Fleet    │←──────▶│   Fleet    │←──────▶│   Fleet    │          │
│  └─────────────┘  sync  └─────────────┘  sync  └─────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘

3. Data Storage Subsystem

The actual bytes are stored in a distributed storage system with several key characteristics:

Chunking: Large objects are split into chunks (typically 4-16MB). Each chunk is stored, checksummed, and replicated independently. This enables:

Parallel uploads (multipart upload)
Parallel reads (byte-range fetches)
Efficient storage utilization
Independent chunk verification

Replication: Each chunk is synchronously replicated to at least 3 AZs before the write is acknowledged. The replication protocol ensures:

No single point of failure
Tolerance of entire AZ failures
Continuous background verification
Automatic repair of corrupted or lost chunks

Erasure Coding: For some storage classes (especially cheaper ones), S3 uses erasure coding rather than simple replication. This stores data redundantly using mathematical coding, achieving similar durability with less raw storage overhead.

Why 3 AZs Minimum?

Three-way replication across AZs provides tolerance for: (1) individual disk failures, (2) individual server failures, (3) rack failures, (4) entire data center (AZ) failures, and (5) network partitions between AZs. The probability of simultaneous failures across 3 independent AZs is astronomically low, which achieves the eleven 9s durability.

PUT and GET Operations Deep Dive

Understanding how PUT and GET operations work internally reveals S3's distributed systems sophistication.

PUT Operation Flow (Simple Upload)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Client                    S3 Front-End              Index              Data Stores
   │                           │                      │               │    │    │
   │──── PUT object ──────────▶│                      │              AZ-1 AZ-2 AZ-3
   │                           │                      │               │    │    │
   │                           │── validate request ──│               │    │    │
   │                           │── check permissions ─│               │    │    │
   │                           │                      │               │    │    │
   │                           │────── reserve slot ──▶               │    │    │
   │                           │◀───── slot reserved ─│               │    │    │
   │                           │                      │               │    │    │
   │                           │────────────── write chunk ──────────▶│    │    │
   │                           │──────────────────────│── replicate ──────▶│    │
   │                           │──────────────────────│───────────────────▶│    │
   │                           │◀───── all replicas confirmed ─────────────│────│
   │                           │                      │               │    │    │
   │                           │────── commit entry ──▶               │    │    │
   │                           │◀───── entry committed │               │    │    │
   │                           │                      │               │    │    │
   │◀─── 200 OK (ETag) ────────│                      │               │    │    │
   │                           │                      │               │    │    │

Key observations about PUT:

Synchronous replication: The 200 OK doesn't return until data is durably stored in all 3 AZs. This is why PUT latency is higher than a simple write.
Atomic commits: The index update is atomic. Either the object is fully present in the index, or it's not there at all. There's no 'partially uploaded' visible state.
Two-phase commit: S3 uses a two-phase commit protocol—first confirming data storage, then updating the index. This ensures consistency even during failures.
ETag generation: The ETag (typically MD5 for single-part uploads) is computed and returned, providing a checksum for verification.

Multipart Upload: Handling Large Objects

For objects larger than 100MB (recommended threshold), multipart upload splits the work:

Multipart Upload Process

•Initiate multipart upload — S3 returns an upload ID; this reserves an upload slot
•Upload parts in parallel — Parts can be uploaded concurrently, in any order, even from different clients
•Each part replicated independently — Failed parts can be retried without re-uploading successful ones
•Complete multipart upload — Client sends manifest of part ETags; S3 stitches parts into a single object
•Abort (optional) — If upload is abandoned, parts are eventually cleaned up (or by lifecycle rules)

GET Operation Flow

GET is simpler but still involves distributed coordination:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Client                    S3 Front-End              Index              Data Stores
   │                           │                      │               │    │    │
   │──── GET object ──────────▶│                      │              AZ-1 AZ-2 AZ-3
   │                           │                      │               │    │    │
   │                           │── validate request ──│               │    │    │
   │                           │── check permissions ─│               │    │    │
   │                           │                      │               │    │    │
   │                           │────── lookup key ────▶               │    │    │
   │                           │◀─ location(s) + meta │               │    │    │
   │                           │                      │               │    │    │
   │                           │   (select best AZ based on           │    │    │
   │                           │    latency and availability)         │    │    │
   │                           │                      │               │    │    │
   │                           │─────────────── read chunks ─────────▶│         │
   │                           │◀──────────────── chunk data ─────────│         │
   │                           │                      │               │    │    │
   │◀─ 200 OK + data stream ───│     (verify checksum on each chunk)  │    │    │
   │                           │                      │               │    │    │

Key observations about GET:

Read from nearest healthy replica: S3 doesn't read from all 3 AZs—it picks the fastest available one, reducing latency.
Byte-range support: GET can fetch partial objects, which S3 serves from specific chunks without reading the entire object.
Checksum verification: Every chunk is verified on read. Corrupted chunks trigger automatic repair in the background.
Caching at multiple layers: Metadata is cached aggressively; frequently accessed objects may benefit from implicit caching.

Transfer Acceleration

For global applications, S3 Transfer Acceleration uses CloudFront edge locations as ingress points. Instead of sending data directly to the bucket's region, you upload to the nearest edge, and AWS's optimized backbone routes it to the destination region. This can reduce upload latency significantly for distant clients.

Strong Consistency: A Historic Shift

For 14 years (2006-2020), S3 offered eventual consistency for overwrites and deletes. If you PUT an object and immediately GET it, you might receive the old version. This was S3's original tradeoff for availability and scale.

In December 2020, AWS announced that S3 now provides strong read-after-write consistency for all operations—at no additional cost and with no performance penalty. This was a remarkable engineering achievement that deserves deep examination.

What Strong Consistency Means

Old vs. New Consistency Model
Operation	Old Model (Before Dec 2020)	New Model (Current)
New object PUT	Read-after-write consistent	Read-after-write consistent
Overwrite PUT	Eventually consistent (might get old version)	Read-after-write consistent (always get new version)
DELETE	Eventually consistent (might get deleted object)	Read-after-write consistent (GET returns 404 immediately)
LIST after PUT	Eventually consistent (new object might not appear)	Read-after-write consistent (new object appears immediately)
HEAD after PUT	Eventually consistent	Read-after-write consistent

How Was This Achieved?

Achieving strong consistency at S3's scale without sacrificing performance or availability is non-trivial. Based on AWS's published information, the approach involved:

1. Witness-Based Protocol

The index service evolved to use a witness-based consistency protocol. Before returning success for a write, S3 ensures that enough witnesses (distributed nodes) have acknowledged the write. Subsequent reads consult witnesses to ensure they see the latest version.

2. Logical Clocks and Versioning

Each write receives a logical timestamp. Reads track which timestamp they've observed. If a read attempts to access a version the local node hasn't received yet, the system either waits for propagation or fails forward to a node that has the latest data.

3. Caching Invalidation

S3's internal caching layer needed to become consistency-aware. Writes now invalidate or update caches synchronously, ensuring stale data isn't returned. This was the hardest part—cache coherence at exabyte scale.

4. No Quorum Reads

AWS specifically stated they don't use quorum reads (reading from multiple replicas and taking majority). Quorum reads would increase latency. Instead, they track write propagation and route reads to nodes guaranteed to have current data.

CAP Theorem Implications

Strong consistency might seem to violate the CAP theorem's tradeoff. However, S3 achieves this within a region (not globally) and during normal operation (not during network partitions). During rare partition events, S3 would sacrifice availability rather than consistency—briefly returning errors rather than stale data. This CP choice was a significant strategic shift.

Impact on System Design

This change eliminated many workarounds developers previously needed:

No more read-after-write delays: You can PUT an object and immediately GET it reliably
No more version tracking for consistency: Applications don't need to track ETags to verify they're reading the latest version
Simplified event-driven architectures: S3 event notifications for PUT now guarantee the GET from the notified object will succeed
Reliable LIST operations: Object enumeration after writes is immediately consistent—critical for data pipeline orchestration

Cross-Region Still Eventually Consistent

Strong consistency applies within a region. Cross-Region Replication (CRR) remains asynchronous and eventually consistent. If you write to us-east-1 and read from eu-west-1, there's replication lag—typically seconds to minutes depending on object size and replication backlog.

S3 Performance Characteristics

Understanding S3's performance model is essential for building high-performance applications. S3 has undergone significant improvements, but certain characteristics remain:

Request Rate Limits

S3 supports extremely high request rates, but there are practical limits:

S3 Request Rate Guidelines
Metric	Per-Prefix Baseline	Notes
PUT/POST/DELETE	3,500 requests/second	Per prefix; scales automatically beyond this
GET/HEAD	5,500 requests/second	Per prefix; scales automatically beyond this
LIST	Hundreds/second	Much slower than point operations
Parallel connections	Effectively unlimited	Multipart enables massive parallelism

The Prefix Partitioning Story

Historically, S3 partitioned by key prefix. If all your keys started with timestamp/ and you had high traffic, you'd create a hot partition. The guidance was to add random prefixes like hash characters.

In mid-2018, AWS announced S3 now automatically scales to handle any request rate. The system dynamically partitions based on actual traffic patterns, not just key prefixes. However, for extreme workloads (billions of objects, tens of thousands of requests/second), the original prefix randomization advice still helps.

Latency Characteristics

Typical S3 Latencies

•First-byte latency (GET): 50-150ms for small objects; this is time to first byte, not full download
•PUT latency: 100-300ms for small objects; the synchronous 3-AZ replication adds overhead
•LIST latency: 100-500ms per page of 1000 objects; much slower than point operations
•Cross-region: Add 50-200ms depending on geographic distance and network conditions
•Transfer acceleration: Can reduce latency for distant clients by 50-500% depending on network path

Optimizing S3 Performance

Several techniques maximize S3 performance:

1. Use Multipart Upload

For objects > 100MB, multipart upload enables parallel uploads. AWS SDK handles this automatically with TransferManager. Maximum parallelism can reach 10,000 parts × parallel connections.

2. Use Byte-Range Fetches

For large files, request specific byte ranges. This parallelizes downloads and enables resumable transfers. Example: split a 1GB file into 10MB ranges and download concurrently.

3. Aggressive Concurrency

S3 has no cross-request coordination penalty. If you need 1000 objects, request all 1000 concurrently, not sequentially. The limiting factor is your client's network bandwidth, not S3.

4. Optimize Object Size

Many small objects incur per-request overhead. If you're storing millions of tiny files, consider bundling them (e.g., using tar archives, Parquet files, or S3 Object Lambda for on-the-fly assembly).

5. Use S3 Express One Zone for Latency-Sensitive Workloads

AWS launched S3 Express One Zone in 2023, offering single-digit millisecond latency (10x faster than standard S3) with a directory bucket model. It trades durability (single-AZ) for speed.

Performance Testing Tip

When benchmarking S3, ensure your test client isn't the bottleneck. Use machines in the same region as your bucket, with sufficient network bandwidth (at least 10Gbps for serious testing). Many 'slow S3' reports trace to client limitations, not S3 itself.

S3 Security Architecture

S3's security model is comprehensive but complex, with multiple overlapping layers. Understanding this model prevents both security breaches and access denials.

Identity-Based Policies (IAM)

IAM policies attached to users, roles, or groups define what S3 operations they can perform:

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Resource": [
    "arn:aws:s3:::my-bucket",
    "arn:aws:s3:::my-bucket/*"
  ]
}

Resource-Based Policies (Bucket Policies)

Bucket policies attached to the bucket itself control access. They can grant cross-account access, restrict by IP, require MFA, and more:

{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:*",
  "Resource": "arn:aws:s3:::my-bucket/*",
  "Condition": {
    "Bool": {"aws:SecureTransport": "false"}
  }
}

Access Control Lists (ACLs)

ACLs are the original S3 access control mechanism. They're simpler but less flexible than policies. AWS now recommends disabling ACLs entirely (bucket ownership controls) and using policies exclusively.

Block Public Access

S3 Block Public Access settings override any policies or ACLs that might grant public access. They work at account and bucket levels:

Block public ACLs
Ignore public ACLs
Block public bucket policies
Restrict cross-account access

With Block Public Access enabled at the account level, even a misconfigured policy can't accidentally expose data publicly.

Access Points

Access Points provide application-specific entry points to buckets with their own policies. A large bucket might have:

A 'read-only' access point for analytics
A 'write-only' access point for data ingestion
A 'full-access' access point for administration

Each access point has its own DNS name and policy, simplifying access control in complex environments.

Encryption Options

•SSE-S3: S3 manages keys. Simplest option. Each object encrypted with a unique key, and that key is encrypted with a regularly rotated master key.
•SSE-KMS: AWS KMS manages keys. Provides audit trail in CloudTrail, supports customer-managed keys (CMKs), enables key policies and rotation.
•SSE-C: Customer provides keys with each request. AWS never stores the keys. Client responsible for key management.
•Client-side encryption: Data encrypted before upload. S3 never sees plaintext. Maximum security but requires client-side key management.

Policy Evaluation Logic

S3 access evaluation: Explicit DENY always wins. Then explicit ALLOW is required from both identity-based AND resource-based policies (for cross-account) or from either (same-account). This complexity causes many access issues. Use IAM Policy Simulator and S3 access analyzer to debug.

Summary: S3 Architecture Essentials

We've covered substantial ground exploring Amazon S3's architecture. Let's consolidate the key insights:

Key Takeaways

•S3 defined modern object storage — Its flat namespace, RESTful API, and durability model became industry standards
•11 nines of durability — Achieved through synchronous 3-AZ replication, continuous verification, and automatic repair
•Strong consistency since 2020 — All operations are now read-after-write consistent within a region, eliminating historical workarounds
•Separation of concerns — Index (metadata), data storage, and request routing scale independently
•Performance at scale — 3,500+ PUT/s and 5,500+ GET/s per prefix baseline, scaling automatically beyond
•Immutability simplifies architecture — Objects don't change in place; 'updates' are new versions
•Layered security model — IAM policies, bucket policies, ACLs, Block Public Access, and encryption work together
•Prefix partitioning — While automated, understanding key distribution helps optimize extreme workloads

Architectural Patterns for System Design

When designing systems with S3:

Treat S3 as durable, not fast — For latency-sensitive reads, front S3 with CloudFront or use S3 Express One Zone
Leverage immutability — Design for append-only patterns; use versioning for audit trails
Parallelize aggessively — S3 has no coordination overhead; use maximum concurrency
Consider object size — Optimize for your access patterns (aggregated for throughput, split for parallel access)
Use presigned URLs for direct access — Avoid proxying data through application servers

What's Next:

The next page examines Google Cloud Storage, comparing its architecture with S3 and highlighting the differences that affect system design decisions.

Page Complete

You now understand Amazon S3's architecture at a depth suitable for senior system design discussions. You can explain its durability guarantees, consistency model, performance characteristics, and security layers—knowledge essential for designing systems that leverage object storage at scale.

1 / 5

Loading learning content...

System Design (HLD)Cloud Object Storage

Cloud Object Storage: Architecture and Best Practices

LevelIntermediate

Duration120 mins

TopicCloud Object Storage

1 / 5

Amazon S3 Architecture

The Foundation of Cloud Storage

What You Will Learn

S3 Design Philosophy

1. Durability Over Everything

Objects are automatically replicated across at least three geographically separated data centers (Availability Zones) within a region
Each copy undergoes continuous integrity verification using checksums
Data is protected against not just disk failures, but entire facility failures, including natural disasters
The replication is synchronous for durability—a PUT request doesn't return success until durability is achieved

Durability vs. Availability

2. Infinite Scale Without Pre-Provisioning

Unlike traditional storage systems where you provision capacity upfront, S3 presents effectively unlimited storage. This is achieved through:

Automatic partitioning: S3 distributes objects across a massive fleet of storage nodes, automatically rebalancing as usage patterns change
Key-based sharding: Object keys determine data placement, allowing predictable distribution
Elastic metadata: The index system scales independently from data storage

3. Simple Mental Model, Complex Implementation

S3's API is deceptively simple: buckets contain objects, objects have keys and data. But this simplicity masks extraordinary complexity:

The PUT operation involves coordinate-commit protocols across multiple data centers
The GET operation involves routing to the nearest replica with fastest response
LIST operations scan distributed indexes with consistent snapshots
Versioning requires maintaining complete object histories without impacting read performance

Core Design Principles

•Durability is non-negotiable — Every architectural decision starts with 'how does this affect durability?' Data loss is considered catastrophic failure.
•Scale through partitioning — There's no central coordination point that becomes a bottleneck. Data and metadata are partitioned horizontally.
•Eventual consistency (historically) for availability — S3 chose availability over immediate consistency for most of its history, only recently achieving strong consistency.
•Immutability simplifies everything — Objects are immutable once written. 'Updates' create new versions, not in-place modifications.
•Separation of concerns — Data plane (actual bytes) and control plane (metadata, permissions) operate independently and scale separately.

S3 Data Model in Depth

S3's data model appears simple but has nuances that significantly affect system design. Let's examine each component:

Buckets: The Top-Level Container

A bucket is a globally unique namespace for organizing objects. Key characteristics:

Bucket names are globally unique across all AWS accounts—once someone takes my-bucket, no one else can use it
Each bucket exists in exactly one region, though its contents can be replicated to other regions
Buckets have policies, versioning settings, lifecycle rules, and other configurations
There's a soft limit of 100 buckets per account (can be raised), but no limit on objects per bucket

Objects: The Fundamental Unit

An object consists of:

Key: A UTF-8 string uniquely identifying the object within a bucket (up to 1024 bytes)
Data: The actual content (up to 5TB per object)
Metadata: Key-value pairs (system-defined and user-defined)
Version ID: Unique identifier when versioning is enabled

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Conceptual structure of an S3 object
 
Object:
  Key: "products/electronics/phone-123.jpg"
  VersionId: "3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUo"
  
  Data:
    Content: <binary data - up to 5TB>
    Size: 2,457,600 bytes
    ETag: "d41d8cd98f00b204e9800998ecf8427e"  # MD5 hash for simple uploads
    ContentType: "image/jpeg"
    
  SystemMetadata:
    LastModified: "2024-01-15T14:30:00Z"
    StorageClass: "STANDARD"
    ContentLength: 2457600
    ServerSideEncryption: "AES256"
    
  UserMetadata:  # User-defined, prefixed with 'x-amz-meta-'
    x-amz-meta-photographer: "Jane Smith"
    x-amz-meta-location: "Seattle, WA"
    x-amz-meta-camera: "Canon EOS R5"

Object Keys and the Flat Namespace

S3 uses a flat namespace—there are no actual directories or folders. The apparent folder structure in the console is an illusion created by key prefixes and delimiters:

Key: photos/2024/vacation/beach.jpg
This is ONE key string, not a hierarchy
The / is just a character; you could use - or :: instead
'Folders' appear when you group by common prefixes

This design has critical implications for performance and cost:

Flat Namespace Implications
Aspect	Implication	Design Consideration
Listing Performance	LIST operations scan linearly by prefix	Deep hierarchies with many objects slow down listing
Partition Distribution	S3 partitions by key prefix	Sequential keys (timestamps) can create hot partitions
Rename Operations	No rename—must copy entire object + delete	Avoid designs requiring frequent renames
Folder Operations	'Delete folder' = delete all objects with prefix	Can be very slow for large prefixes
Billing	No folder concept in storage costs	Empty 'folders' are zero-byte objects if created explicitly

Prefix Partition Optimization

S3 Internal Architecture

1. Request Routing Layer

When a request arrives at S3:

DNS resolves bucket-name.s3.region.amazonaws.com to an edge location
The request enters AWS's global network and routes to the bucket's region
A front-end fleet of web servers authenticates the request, validates permissions, and determines the operation
The request routes to the appropriate internal service based on operation type

This layer handles TLS termination, authentication (Signature Version 4), authorization (IAM/bucket policies), and request validation—all before touching storage.

2. Index Subsystem

S3 maintains a distributed index mapping (bucket, key, version) → physical storage location. This index is the core metadata layer:

Built on a distributed key-value store optimized for point lookups and prefix scans
Partitioned by bucket and key prefix for horizontal scaling
Replicated across Availability Zones for durability
Supports consistent reads since the December 2020 strong consistency update

The index stores:

Object metadata (size, ETag, last modified, storage class)
Version information for versioned buckets
Pointers to physical storage locations across data centers
Encryption key references for SSE-S3 and SSE-KMS objects

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
┌─────────────────────────────────────────────────────────────────────────┐
│                         Request Routing Layer                           │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐            │
│  │   Edge    │  │   Auth    │  │   IAM     │  │  Request  │            │
│  │  Router   │→ │  Service  │→ │  Policy   │→ │  Router   │            │
│  └───────────┘  └───────────┘  └───────────┘  └───────────┘            │
└────────────────────────────────────┬────────────────────────────────────┘
                                     │
         ┌───────────────────────────┼───────────────────────────┐
         ▼                           ▼                           ▼
┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐
│  Index Service  │       │  Data Service   │       │  Lifecycle      │
│                 │       │                 │       │  Service        │
│  • Metadata DB  │       │  • Chunk Store  │       │  • Transitions  │
│  • Version Map  │       │  • Replication  │       │  • Expirations  │
│  • Key Lookup   │       │  • Checksums    │       │  • Cleanups     │
└─────────────────┘       └─────────────────┘       └─────────────────┘
         │                         │                         │
         └─────────────────────────┼─────────────────────────┘
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                     Physical Storage Layer                              │
│  ┌─────────────┐        ┌─────────────┐        ┌─────────────┐         │
│  │    AZ-1    │        │    AZ-2    │        │    AZ-3    │          │
│  │  Storage   │        │  Storage   │        │  Storage   │          │
│  │   Fleet    │←──────▶│   Fleet    │←──────▶│   Fleet    │          │
│  └─────────────┘  sync  └─────────────┘  sync  └─────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘

3. Data Storage Subsystem

The actual bytes are stored in a distributed storage system with several key characteristics:

Chunking: Large objects are split into chunks (typically 4-16MB). Each chunk is stored, checksummed, and replicated independently. This enables:

Parallel uploads (multipart upload)
Parallel reads (byte-range fetches)
Efficient storage utilization
Independent chunk verification

Replication: Each chunk is synchronously replicated to at least 3 AZs before the write is acknowledged. The replication protocol ensures:

No single point of failure
Tolerance of entire AZ failures
Continuous background verification
Automatic repair of corrupted or lost chunks

Why 3 AZs Minimum?

PUT and GET Operations Deep Dive

Understanding how PUT and GET operations work internally reveals S3's distributed systems sophistication.

PUT Operation Flow (Simple Upload)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Client                    S3 Front-End              Index              Data Stores
   │                           │                      │               │    │    │
   │──── PUT object ──────────▶│                      │              AZ-1 AZ-2 AZ-3
   │                           │                      │               │    │    │
   │                           │── validate request ──│               │    │    │
   │                           │── check permissions ─│               │    │    │
   │                           │                      │               │    │    │
   │                           │────── reserve slot ──▶               │    │    │
   │                           │◀───── slot reserved ─│               │    │    │
   │                           │                      │               │    │    │
   │                           │────────────── write chunk ──────────▶│    │    │
   │                           │──────────────────────│── replicate ──────▶│    │
   │                           │──────────────────────│───────────────────▶│    │
   │                           │◀───── all replicas confirmed ─────────────│────│
   │                           │                      │               │    │    │
   │                           │────── commit entry ──▶               │    │    │
   │                           │◀───── entry committed │               │    │    │
   │                           │                      │               │    │    │
   │◀─── 200 OK (ETag) ────────│                      │               │    │    │
   │                           │                      │               │    │    │

Key observations about PUT:

Synchronous replication: The 200 OK doesn't return until data is durably stored in all 3 AZs. This is why PUT latency is higher than a simple write.
Atomic commits: The index update is atomic. Either the object is fully present in the index, or it's not there at all. There's no 'partially uploaded' visible state.
Two-phase commit: S3 uses a two-phase commit protocol—first confirming data storage, then updating the index. This ensures consistency even during failures.
ETag generation: The ETag (typically MD5 for single-part uploads) is computed and returned, providing a checksum for verification.

Multipart Upload: Handling Large Objects

For objects larger than 100MB (recommended threshold), multipart upload splits the work:

Multipart Upload Process

•Initiate multipart upload — S3 returns an upload ID; this reserves an upload slot
•Upload parts in parallel — Parts can be uploaded concurrently, in any order, even from different clients
•Each part replicated independently — Failed parts can be retried without re-uploading successful ones
•Complete multipart upload — Client sends manifest of part ETags; S3 stitches parts into a single object
•Abort (optional) — If upload is abandoned, parts are eventually cleaned up (or by lifecycle rules)

GET Operation Flow

GET is simpler but still involves distributed coordination:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Client                    S3 Front-End              Index              Data Stores
   │                           │                      │               │    │    │
   │──── GET object ──────────▶│                      │              AZ-1 AZ-2 AZ-3
   │                           │                      │               │    │    │
   │                           │── validate request ──│               │    │    │
   │                           │── check permissions ─│               │    │    │
   │                           │                      │               │    │    │
   │                           │────── lookup key ────▶               │    │    │
   │                           │◀─ location(s) + meta │               │    │    │
   │                           │                      │               │    │    │
   │                           │   (select best AZ based on           │    │    │
   │                           │    latency and availability)         │    │    │
   │                           │                      │               │    │    │
   │                           │─────────────── read chunks ─────────▶│         │
   │                           │◀──────────────── chunk data ─────────│         │
   │                           │                      │               │    │    │
   │◀─ 200 OK + data stream ───│     (verify checksum on each chunk)  │    │    │
   │                           │                      │               │    │    │

Key observations about GET:

Read from nearest healthy replica: S3 doesn't read from all 3 AZs—it picks the fastest available one, reducing latency.
Byte-range support: GET can fetch partial objects, which S3 serves from specific chunks without reading the entire object.
Checksum verification: Every chunk is verified on read. Corrupted chunks trigger automatic repair in the background.
Caching at multiple layers: Metadata is cached aggressively; frequently accessed objects may benefit from implicit caching.

Transfer Acceleration

Strong Consistency: A Historic Shift

What Strong Consistency Means

Old vs. New Consistency Model
Operation	Old Model (Before Dec 2020)	New Model (Current)
New object PUT	Read-after-write consistent	Read-after-write consistent
Overwrite PUT	Eventually consistent (might get old version)	Read-after-write consistent (always get new version)
DELETE	Eventually consistent (might get deleted object)	Read-after-write consistent (GET returns 404 immediately)
LIST after PUT	Eventually consistent (new object might not appear)	Read-after-write consistent (new object appears immediately)
HEAD after PUT	Eventually consistent	Read-after-write consistent

How Was This Achieved?

Achieving strong consistency at S3's scale without sacrificing performance or availability is non-trivial. Based on AWS's published information, the approach involved:

1. Witness-Based Protocol

2. Logical Clocks and Versioning

3. Caching Invalidation

4. No Quorum Reads

CAP Theorem Implications

Impact on System Design

This change eliminated many workarounds developers previously needed:

No more read-after-write delays: You can PUT an object and immediately GET it reliably
No more version tracking for consistency: Applications don't need to track ETags to verify they're reading the latest version
Simplified event-driven architectures: S3 event notifications for PUT now guarantee the GET from the notified object will succeed
Reliable LIST operations: Object enumeration after writes is immediately consistent—critical for data pipeline orchestration

Cross-Region Still Eventually Consistent

S3 Performance Characteristics

Understanding S3's performance model is essential for building high-performance applications. S3 has undergone significant improvements, but certain characteristics remain:

Request Rate Limits

S3 supports extremely high request rates, but there are practical limits:

S3 Request Rate Guidelines
Metric	Per-Prefix Baseline	Notes
PUT/POST/DELETE	3,500 requests/second	Per prefix; scales automatically beyond this
GET/HEAD	5,500 requests/second	Per prefix; scales automatically beyond this
LIST	Hundreds/second	Much slower than point operations
Parallel connections	Effectively unlimited	Multipart enables massive parallelism

The Prefix Partitioning Story

Latency Characteristics

Typical S3 Latencies

•First-byte latency (GET): 50-150ms for small objects; this is time to first byte, not full download
•PUT latency: 100-300ms for small objects; the synchronous 3-AZ replication adds overhead
•LIST latency: 100-500ms per page of 1000 objects; much slower than point operations
•Cross-region: Add 50-200ms depending on geographic distance and network conditions
•Transfer acceleration: Can reduce latency for distant clients by 50-500% depending on network path

Optimizing S3 Performance

Several techniques maximize S3 performance:

1. Use Multipart Upload

For objects > 100MB, multipart upload enables parallel uploads. AWS SDK handles this automatically with TransferManager. Maximum parallelism can reach 10,000 parts × parallel connections.

2. Use Byte-Range Fetches

For large files, request specific byte ranges. This parallelizes downloads and enables resumable transfers. Example: split a 1GB file into 10MB ranges and download concurrently.

3. Aggressive Concurrency

S3 has no cross-request coordination penalty. If you need 1000 objects, request all 1000 concurrently, not sequentially. The limiting factor is your client's network bandwidth, not S3.

4. Optimize Object Size

Many small objects incur per-request overhead. If you're storing millions of tiny files, consider bundling them (e.g., using tar archives, Parquet files, or S3 Object Lambda for on-the-fly assembly).

5. Use S3 Express One Zone for Latency-Sensitive Workloads

AWS launched S3 Express One Zone in 2023, offering single-digit millisecond latency (10x faster than standard S3) with a directory bucket model. It trades durability (single-AZ) for speed.

Performance Testing Tip

S3 Security Architecture

S3's security model is comprehensive but complex, with multiple overlapping layers. Understanding this model prevents both security breaches and access denials.

Identity-Based Policies (IAM)

IAM policies attached to users, roles, or groups define what S3 operations they can perform:

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Resource": [
    "arn:aws:s3:::my-bucket",
    "arn:aws:s3:::my-bucket/*"
  ]
}

Resource-Based Policies (Bucket Policies)

Bucket policies attached to the bucket itself control access. They can grant cross-account access, restrict by IP, require MFA, and more:

{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:*",
  "Resource": "arn:aws:s3:::my-bucket/*",
  "Condition": {
    "Bool": {"aws:SecureTransport": "false"}
  }
}

Access Control Lists (ACLs)

Block Public Access

S3 Block Public Access settings override any policies or ACLs that might grant public access. They work at account and bucket levels:

Block public ACLs
Ignore public ACLs
Block public bucket policies
Restrict cross-account access

With Block Public Access enabled at the account level, even a misconfigured policy can't accidentally expose data publicly.

Access Points

Access Points provide application-specific entry points to buckets with their own policies. A large bucket might have:

A 'read-only' access point for analytics
A 'write-only' access point for data ingestion
A 'full-access' access point for administration

Each access point has its own DNS name and policy, simplifying access control in complex environments.

Encryption Options

•SSE-S3: S3 manages keys. Simplest option. Each object encrypted with a unique key, and that key is encrypted with a regularly rotated master key.
•SSE-KMS: AWS KMS manages keys. Provides audit trail in CloudTrail, supports customer-managed keys (CMKs), enables key policies and rotation.
•SSE-C: Customer provides keys with each request. AWS never stores the keys. Client responsible for key management.
•Client-side encryption: Data encrypted before upload. S3 never sees plaintext. Maximum security but requires client-side key management.

Policy Evaluation Logic

Summary: S3 Architecture Essentials

We've covered substantial ground exploring Amazon S3's architecture. Let's consolidate the key insights:

Key Takeaways

•S3 defined modern object storage — Its flat namespace, RESTful API, and durability model became industry standards
•11 nines of durability — Achieved through synchronous 3-AZ replication, continuous verification, and automatic repair
•Strong consistency since 2020 — All operations are now read-after-write consistent within a region, eliminating historical workarounds
•Separation of concerns — Index (metadata), data storage, and request routing scale independently
•Performance at scale — 3,500+ PUT/s and 5,500+ GET/s per prefix baseline, scaling automatically beyond
•Immutability simplifies architecture — Objects don't change in place; 'updates' are new versions
•Layered security model — IAM policies, bucket policies, ACLs, Block Public Access, and encryption work together
•Prefix partitioning — While automated, understanding key distribution helps optimize extreme workloads

Architectural Patterns for System Design

When designing systems with S3:

Treat S3 as durable, not fast — For latency-sensitive reads, front S3 with CloudFront or use S3 Express One Zone
Leverage immutability — Design for append-only patterns; use versioning for audit trails
Parallelize aggessively — S3 has no coordination overhead; use maximum concurrency
Consider object size — Optimize for your access patterns (aggregated for throughput, split for parallel access)
Use presigned URLs for direct access — Avoid proxying data through application servers

What's Next:

The next page examines Google Cloud Storage, comparing its architecture with S3 and highlighting the differences that affect system design decisions.

Page Complete

1 / 5