Loading learning content...
When Amazon launched Simple Storage Service (S3) in March 2006, it quietly revolutionized how the world thinks about data storage. What began as an internal solution for Amazon's e-commerce infrastructure became the blueprint for an entirely new category of storage systems. Today, S3 stores over 100 trillion objects and handles millions of requests per second—a scale that challenges the most sophisticated distributed systems engineering on the planet.
Understanding S3's architecture isn't merely about learning one product. S3 defined the paradigms that all modern object storage systems follow: the RESTful API model, eventual consistency tradeoffs (now evolved to strong consistency), storage class tiering, and the fundamental object storage data model. Whether you're using Google Cloud Storage, Azure Blob Storage, or any S3-compatible system like MinIO, you're working within patterns S3 established.
By the end of this page, you will understand S3's internal architecture at a depth suitable for senior system design discussions. You'll grasp how S3 achieves eleven 9s of durability, the engineering behind its recent transition to strong consistency, and the design patterns that enable it to scale to exabytes while maintaining sub-second latency.
S3's architecture embodies several core design principles that guided its development and continue to shape its evolution. Understanding these principles illuminates not just how S3 works, but why it makes specific tradeoffs.
1. Durability Over Everything
S3's eleven 9s of durability (99.999999999%) means that if you store 10 million objects, you can statistically expect to lose one object every 10,000 years. This isn't marketing—it's an engineering requirement that drives fundamental architectural decisions:
Durability and availability are distinct concepts. Durability means your data won't be lost; availability means you can access it right now. S3 standard offers 99.99% availability (about 52 minutes of downtime per year) but 99.999999999% durability. You might temporarily be unable to read data during an outage, but you won't lose it.
2. Infinite Scale Without Pre-Provisioning
Unlike traditional storage systems where you provision capacity upfront, S3 presents effectively unlimited storage. This is achieved through:
3. Simple Mental Model, Complex Implementation
S3's API is deceptively simple: buckets contain objects, objects have keys and data. But this simplicity masks extraordinary complexity:
S3's data model appears simple but has nuances that significantly affect system design. Let's examine each component:
Buckets: The Top-Level Container
A bucket is a globally unique namespace for organizing objects. Key characteristics:
my-bucket, no one else can use itObjects: The Fundamental Unit
An object consists of:
12345678910111213141516171819202122
# Conceptual structure of an S3 object Object: Key: "products/electronics/phone-123.jpg" VersionId: "3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUo" Data: Content: <binary data - up to 5TB> Size: 2,457,600 bytes ETag: "d41d8cd98f00b204e9800998ecf8427e" # MD5 hash for simple uploads ContentType: "image/jpeg" SystemMetadata: LastModified: "2024-01-15T14:30:00Z" StorageClass: "STANDARD" ContentLength: 2457600 ServerSideEncryption: "AES256" UserMetadata: # User-defined, prefixed with 'x-amz-meta-' x-amz-meta-photographer: "Jane Smith" x-amz-meta-location: "Seattle, WA" x-amz-meta-camera: "Canon EOS R5"Object Keys and the Flat Namespace
S3 uses a flat namespace—there are no actual directories or folders. The apparent folder structure in the console is an illusion created by key prefixes and delimiters:
photos/2024/vacation/beach.jpg/ is just a character; you could use - or :: insteadThis design has critical implications for performance and cost:
| Aspect | Implication | Design Consideration |
|---|---|---|
| Listing Performance | LIST operations scan linearly by prefix | Deep hierarchies with many objects slow down listing |
| Partition Distribution | S3 partitions by key prefix | Sequential keys (timestamps) can create hot partitions |
| Rename Operations | No rename—must copy entire object + delete | Avoid designs requiring frequent renames |
| Folder Operations | 'Delete folder' = delete all objects with prefix | Can be very slow for large prefixes |
| Billing | No folder concept in storage costs | Empty 'folders' are zero-byte objects if created explicitly |
S3 automatically partitions data by key prefix for performance. If all keys start with the same prefix (e.g., timestamps), requests concentrate on one partition. In 2018, AWS improved this significantly, but for extreme throughput (>3,500 PUT/s or >5,500 GET/s per prefix), add randomness to key prefixes like hashes.
While Amazon doesn't publicly document S3's internals in complete detail, we can reconstruct the architecture from published papers, re:Invent presentations, and engineering blog posts. S3's architecture separates into distinct subsystems:
1. Request Routing Layer
When a request arrives at S3:
bucket-name.s3.region.amazonaws.com to an edge locationThis layer handles TLS termination, authentication (Signature Version 4), authorization (IAM/bucket policies), and request validation—all before touching storage.
2. Index Subsystem
S3 maintains a distributed index mapping (bucket, key, version) → physical storage location. This index is the core metadata layer:
The index stores:
12345678910111213141516171819202122232425262728
┌─────────────────────────────────────────────────────────────────────────┐│ Request Routing Layer ││ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││ │ Edge │ │ Auth │ │ IAM │ │ Request │ ││ │ Router │→ │ Service │→ │ Policy │→ │ Router │ ││ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │└────────────────────────────────────┬────────────────────────────────────┘ │ ┌───────────────────────────┼───────────────────────────┐ ▼ ▼ ▼┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│ Index Service │ │ Data Service │ │ Lifecycle ││ │ │ │ │ Service ││ • Metadata DB │ │ • Chunk Store │ │ • Transitions ││ • Version Map │ │ • Replication │ │ • Expirations ││ • Key Lookup │ │ • Checksums │ │ • Cleanups │└─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ └─────────────────────────┼─────────────────────────┘ ▼┌─────────────────────────────────────────────────────────────────────────┐│ Physical Storage Layer ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ AZ-1 │ │ AZ-2 │ │ AZ-3 │ ││ │ Storage │ │ Storage │ │ Storage │ ││ │ Fleet │←──────▶│ Fleet │←──────▶│ Fleet │ ││ └─────────────┘ sync └─────────────┘ sync └─────────────┘ │└─────────────────────────────────────────────────────────────────────────┘3. Data Storage Subsystem
The actual bytes are stored in a distributed storage system with several key characteristics:
Chunking: Large objects are split into chunks (typically 4-16MB). Each chunk is stored, checksummed, and replicated independently. This enables:
Replication: Each chunk is synchronously replicated to at least 3 AZs before the write is acknowledged. The replication protocol ensures:
Erasure Coding: For some storage classes (especially cheaper ones), S3 uses erasure coding rather than simple replication. This stores data redundantly using mathematical coding, achieving similar durability with less raw storage overhead.
Three-way replication across AZs provides tolerance for: (1) individual disk failures, (2) individual server failures, (3) rack failures, (4) entire data center (AZ) failures, and (5) network partitions between AZs. The probability of simultaneous failures across 3 independent AZs is astronomically low, which achieves the eleven 9s durability.
Understanding how PUT and GET operations work internally reveals S3's distributed systems sophistication.
PUT Operation Flow (Simple Upload)
1234567891011121314151617181920
Client S3 Front-End Index Data Stores │ │ │ │ │ │ │──── PUT object ──────────▶│ │ AZ-1 AZ-2 AZ-3 │ │ │ │ │ │ │ │── validate request ──│ │ │ │ │ │── check permissions ─│ │ │ │ │ │ │ │ │ │ │ │────── reserve slot ──▶ │ │ │ │ │◀───── slot reserved ─│ │ │ │ │ │ │ │ │ │ │ │────────────── write chunk ──────────▶│ │ │ │ │──────────────────────│── replicate ──────▶│ │ │ │──────────────────────│───────────────────▶│ │ │ │◀───── all replicas confirmed ─────────────│────│ │ │ │ │ │ │ │ │────── commit entry ──▶ │ │ │ │ │◀───── entry committed │ │ │ │ │ │ │ │ │ │ │◀─── 200 OK (ETag) ────────│ │ │ │ │ │ │ │ │ │ │Key observations about PUT:
Synchronous replication: The 200 OK doesn't return until data is durably stored in all 3 AZs. This is why PUT latency is higher than a simple write.
Atomic commits: The index update is atomic. Either the object is fully present in the index, or it's not there at all. There's no 'partially uploaded' visible state.
Two-phase commit: S3 uses a two-phase commit protocol—first confirming data storage, then updating the index. This ensures consistency even during failures.
ETag generation: The ETag (typically MD5 for single-part uploads) is computed and returned, providing a checksum for verification.
Multipart Upload: Handling Large Objects
For objects larger than 100MB (recommended threshold), multipart upload splits the work:
GET Operation Flow
GET is simpler but still involves distributed coordination:
123456789101112131415161718
Client S3 Front-End Index Data Stores │ │ │ │ │ │ │──── GET object ──────────▶│ │ AZ-1 AZ-2 AZ-3 │ │ │ │ │ │ │ │── validate request ──│ │ │ │ │ │── check permissions ─│ │ │ │ │ │ │ │ │ │ │ │────── lookup key ────▶ │ │ │ │ │◀─ location(s) + meta │ │ │ │ │ │ │ │ │ │ │ │ (select best AZ based on │ │ │ │ │ latency and availability) │ │ │ │ │ │ │ │ │ │ │─────────────── read chunks ─────────▶│ │ │ │◀──────────────── chunk data ─────────│ │ │ │ │ │ │ │ │◀─ 200 OK + data stream ───│ (verify checksum on each chunk) │ │ │ │ │ │ │ │ │Key observations about GET:
Read from nearest healthy replica: S3 doesn't read from all 3 AZs—it picks the fastest available one, reducing latency.
Byte-range support: GET can fetch partial objects, which S3 serves from specific chunks without reading the entire object.
Checksum verification: Every chunk is verified on read. Corrupted chunks trigger automatic repair in the background.
Caching at multiple layers: Metadata is cached aggressively; frequently accessed objects may benefit from implicit caching.
For global applications, S3 Transfer Acceleration uses CloudFront edge locations as ingress points. Instead of sending data directly to the bucket's region, you upload to the nearest edge, and AWS's optimized backbone routes it to the destination region. This can reduce upload latency significantly for distant clients.
For 14 years (2006-2020), S3 offered eventual consistency for overwrites and deletes. If you PUT an object and immediately GET it, you might receive the old version. This was S3's original tradeoff for availability and scale.
In December 2020, AWS announced that S3 now provides strong read-after-write consistency for all operations—at no additional cost and with no performance penalty. This was a remarkable engineering achievement that deserves deep examination.
What Strong Consistency Means
| Operation | Old Model (Before Dec 2020) | New Model (Current) |
|---|---|---|
| New object PUT | Read-after-write consistent | Read-after-write consistent |
| Overwrite PUT | Eventually consistent (might get old version) | Read-after-write consistent (always get new version) |
| DELETE | Eventually consistent (might get deleted object) | Read-after-write consistent (GET returns 404 immediately) |
| LIST after PUT | Eventually consistent (new object might not appear) | Read-after-write consistent (new object appears immediately) |
| HEAD after PUT | Eventually consistent | Read-after-write consistent |
How Was This Achieved?
Achieving strong consistency at S3's scale without sacrificing performance or availability is non-trivial. Based on AWS's published information, the approach involved:
1. Witness-Based Protocol
The index service evolved to use a witness-based consistency protocol. Before returning success for a write, S3 ensures that enough witnesses (distributed nodes) have acknowledged the write. Subsequent reads consult witnesses to ensure they see the latest version.
2. Logical Clocks and Versioning
Each write receives a logical timestamp. Reads track which timestamp they've observed. If a read attempts to access a version the local node hasn't received yet, the system either waits for propagation or fails forward to a node that has the latest data.
3. Caching Invalidation
S3's internal caching layer needed to become consistency-aware. Writes now invalidate or update caches synchronously, ensuring stale data isn't returned. This was the hardest part—cache coherence at exabyte scale.
4. No Quorum Reads
AWS specifically stated they don't use quorum reads (reading from multiple replicas and taking majority). Quorum reads would increase latency. Instead, they track write propagation and route reads to nodes guaranteed to have current data.
Strong consistency might seem to violate the CAP theorem's tradeoff. However, S3 achieves this within a region (not globally) and during normal operation (not during network partitions). During rare partition events, S3 would sacrifice availability rather than consistency—briefly returning errors rather than stale data. This CP choice was a significant strategic shift.
Impact on System Design
This change eliminated many workarounds developers previously needed:
Strong consistency applies within a region. Cross-Region Replication (CRR) remains asynchronous and eventually consistent. If you write to us-east-1 and read from eu-west-1, there's replication lag—typically seconds to minutes depending on object size and replication backlog.
Understanding S3's performance model is essential for building high-performance applications. S3 has undergone significant improvements, but certain characteristics remain:
Request Rate Limits
S3 supports extremely high request rates, but there are practical limits:
| Metric | Per-Prefix Baseline | Notes |
|---|---|---|
| PUT/POST/DELETE | 3,500 requests/second | Per prefix; scales automatically beyond this |
| GET/HEAD | 5,500 requests/second | Per prefix; scales automatically beyond this |
| LIST | Hundreds/second | Much slower than point operations |
| Parallel connections | Effectively unlimited | Multipart enables massive parallelism |
The Prefix Partitioning Story
Historically, S3 partitioned by key prefix. If all your keys started with timestamp/ and you had high traffic, you'd create a hot partition. The guidance was to add random prefixes like hash characters.
In mid-2018, AWS announced S3 now automatically scales to handle any request rate. The system dynamically partitions based on actual traffic patterns, not just key prefixes. However, for extreme workloads (billions of objects, tens of thousands of requests/second), the original prefix randomization advice still helps.
Latency Characteristics
Optimizing S3 Performance
Several techniques maximize S3 performance:
1. Use Multipart Upload
For objects > 100MB, multipart upload enables parallel uploads. AWS SDK handles this automatically with TransferManager. Maximum parallelism can reach 10,000 parts × parallel connections.
2. Use Byte-Range Fetches
For large files, request specific byte ranges. This parallelizes downloads and enables resumable transfers. Example: split a 1GB file into 10MB ranges and download concurrently.
3. Aggressive Concurrency
S3 has no cross-request coordination penalty. If you need 1000 objects, request all 1000 concurrently, not sequentially. The limiting factor is your client's network bandwidth, not S3.
4. Optimize Object Size
Many small objects incur per-request overhead. If you're storing millions of tiny files, consider bundling them (e.g., using tar archives, Parquet files, or S3 Object Lambda for on-the-fly assembly).
5. Use S3 Express One Zone for Latency-Sensitive Workloads
AWS launched S3 Express One Zone in 2023, offering single-digit millisecond latency (10x faster than standard S3) with a directory bucket model. It trades durability (single-AZ) for speed.
When benchmarking S3, ensure your test client isn't the bottleneck. Use machines in the same region as your bucket, with sufficient network bandwidth (at least 10Gbps for serious testing). Many 'slow S3' reports trace to client limitations, not S3 itself.
S3's security model is comprehensive but complex, with multiple overlapping layers. Understanding this model prevents both security breaches and access denials.
Identity-Based Policies (IAM)
IAM policies attached to users, roles, or groups define what S3 operations they can perform:
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-bucket",
"arn:aws:s3:::my-bucket/*"
]
}
Resource-Based Policies (Bucket Policies)
Bucket policies attached to the bucket itself control access. They can grant cross-account access, restrict by IP, require MFA, and more:
{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::my-bucket/*",
"Condition": {
"Bool": {"aws:SecureTransport": "false"}
}
}
Access Control Lists (ACLs)
ACLs are the original S3 access control mechanism. They're simpler but less flexible than policies. AWS now recommends disabling ACLs entirely (bucket ownership controls) and using policies exclusively.
Block Public Access
S3 Block Public Access settings override any policies or ACLs that might grant public access. They work at account and bucket levels:
With Block Public Access enabled at the account level, even a misconfigured policy can't accidentally expose data publicly.
Access Points
Access Points provide application-specific entry points to buckets with their own policies. A large bucket might have:
Each access point has its own DNS name and policy, simplifying access control in complex environments.
S3 access evaluation: Explicit DENY always wins. Then explicit ALLOW is required from both identity-based AND resource-based policies (for cross-account) or from either (same-account). This complexity causes many access issues. Use IAM Policy Simulator and S3 access analyzer to debug.
We've covered substantial ground exploring Amazon S3's architecture. Let's consolidate the key insights:
Architectural Patterns for System Design
When designing systems with S3:
What's Next:
The next page examines Google Cloud Storage, comparing its architecture with S3 and highlighting the differences that affect system design decisions.
You now understand Amazon S3's architecture at a depth suitable for senior system design discussions. You can explain its durability guarantees, consistency model, performance characteristics, and security layers—knowledge essential for designing systems that leverage object storage at scale.