Loading content...
In 2007, Amazon published a paper titled "Dynamo: Amazon's Highly Available Key-value Store". This paper didn't just describe a database—it laid out a comprehensive architecture for building highly available distributed systems. Its influence cannot be overstated.
Dynamo's innovations addressed real operational pain at Amazon's scale: the company's e-commerce infrastructure needed to survive datacenter failures, handle holiday traffic spikes, and provide millisecond latencies to millions of concurrent shoppers. Traditional databases couldn't meet these requirements.
The resulting "Dynamo-style" architecture became a template, inspiring databases like Cassandra, Riak, Voldemort, and influencing countless proprietary systems. Understanding Dynamo's approach is essential knowledge for any engineer working with distributed systems.
By the end of this page, you will understand the key innovations in Amazon's Dynamo: consistent hashing with virtual nodes, preference lists, sloppy quorums with hinted handoff, anti-entropy using Merkle trees, and the "always writeable" design philosophy. You'll see how these components work together to create a highly available system.
Before diving into Dynamo's architecture, we must understand the operational context that drove its design. Amazon's requirements were specific and demanding, shaped by years of running large-scale e-commerce infrastructure.
The core requirement: "Always writeable"
"Even if disks are failing, network routes are flapping, or entire datacenters are being destroyed, the shopping cart should remain available."
This single requirement shaped every design decision. Availability wasn't just important—it was the primary constraint. Temporary inconsistency was acceptable; losing customer actions was not.
Service Level Agreements (SLAs) at Amazon:
Amazon's internal SLAs specified latency requirements in percentiles, not averages:
This focus on tail latency influenced Dynamo's design significantly. Techniques that improved average latency but worsened p99 were rejected. Every design decision had to be evaluated at the tail.
| Priority | Traditional RDBMS | Dynamo's Choice |
|---|---|---|
| Consistency vs. Availability | Consistency (ACID) | Availability (always writeable) |
| Latency vs. Consistency | Consistency (synchronous replication) | Latency (async, eventual consistency) |
| Simplicity vs. Flexibility | Simple queries, fixed schemas | Simple key-value, flexible conflict resolution |
| Centralization vs. Distribution | Master-based architecture | Fully decentralized peers |
| Operational model | DBA-managed, careful operations | Self-healing, minimal operations |
Dynamo's primary use case was Amazon's shopping cart. The cart must allow additions even during failures—rejecting a customer's 'Add to Cart' click means losing a potential sale. This framing helps understand Dynamo's choices: better to occasionally have a cart item on two devices that eventually reconciles than to ever refuse the cart operation.
Dynamo's data distribution strategy builds on consistent hashing, but with crucial enhancements that address the limitations of the basic algorithm.
Basic consistent hashing recap:
The problem with basic consistent hashing:
With random node placement, the distribution is rarely uniform:
Dynamo's solution: Virtual nodes (vnodes)
Instead of each physical node having one position on the ring, Dynamo assigns multiple "virtual nodes" to each physical node:
Physical Node A → Virtual nodes: v1, v5, v9, v12, v18
Physical Node B → Virtual nodes: v2, v6, v10, v14, v21
Physical Node C → Virtual nodes: v3, v7, v11, v15, v19
Physical Node D → Virtual nodes: v4, v8, v13, v16, v20
Ring with 21 positions, evenly distributed
Benefits of virtual nodes:
| Scenario | Vnodes per Node | Trade-off |
|---|---|---|
| Homogeneous cluster | 256 per node | Standard choice, good distribution |
| Mixed hardware | Proportional to CPU/RAM | Capacity-proportional load |
| Large cluster (1000+ nodes) | Lower (64-128) | Reduces metadata size |
| Small cluster (3-5 nodes) | Higher (256-512) | Better distribution with few nodes |
More vnodes means better distribution but larger ring metadata. In Cassandra, each vnode requires ~30 bytes in the ring metadata, which all nodes must store and synchronize. Very large clusters with many vnodes per node can have significant metadata overhead.
For each key, Dynamo maintains a preference list: an ordered list of nodes that should store replicas of that key. The preference list determines which nodes participate in reads and writes, and in what order they're preferred.
Constructing the preference list:
Example with N=3:
Key K hashes to position 50
Walking clockwise:
- Position 52: vnode owned by Physical Node A → Add A
- Position 55: vnode owned by Physical Node A → Skip (A already included)
- Position 58: vnode owned by Physical Node B → Add B
- Position 61: vnode owned by Physical Node C → Add C
- Preference list complete: [A, B, C]
Coordinator selection from preference list:
The first node in the preference list is called the coordinator for that key. When a client wants to read or write, it should contact the coordinator first. However, if the coordinator is unavailable, the client can contact any node in the preference list.
Why ordering matters:
The preference list order has implications:
Modern Dynamo-style systems like Cassandra don't use explicit 'preference lists' but implement the same concept through their token ring and replication strategies. The idea of an ordered, deterministic set of replicas for each key remains central to the architecture.
Dynamo's "always writeable" requirement led to two interconnected innovations: sloppy quorums and hinted handoff. These mechanisms allow writes to succeed even when the preferred replica nodes are unavailable.
Traditional (strict) quorums:
With strict quorums, a write to a key must reach nodes in that key's preference list:
Preference list for key K: [A, B, C]
Write requires 2 of 3 (quorum)
If A and B are available → Write succeeds
If only A is available → Write FAILS (can't reach quorum)
Sloppy quorums:
Dynamo extends the concept: if preferred nodes are unavailable, the system can write to other nodes outside the preference list:
Preference list for key K: [A, B, C]
Write requires 2 of 3
If only A is available:
→ Write to A (preferred)
→ Write to D (next healthy node on ring, outside preference list)
→ Write succeeds! (sloppy quorum met)
The key insight: durability is preserved (data exists on 2 nodes) even though it's not on the "correct" nodes temporarily.
Hinted handoff mechanics:
When data is written to a non-preferred node (like D in the example above), the write includes a hint—metadata indicating the intended recipient:
Data stored on Node D:
{
key: K,
value: V,
timestamp: T,
hint: {
intended_for: "Node B",
reason: "B unreachable during write"
}
}
Node D periodically checks if Node B has recovered. Once B is healthy:
The handoff process:
[During failure]
Client → A (primary) ✓
Client → B (primary) ✗ (down)
Client → D (handoff) ✓ with hint for B
[After B recovers]
D → B: "Here's data that was for you"
B: Stores data, ACKs
D: Deletes hinted copy
| Aspect | Strict Quorum | Sloppy Quorum |
|---|---|---|
| Write availability | Limited by preference list health | Higher (uses backup nodes) |
| Read consistency | Quorum reads always consistent | Stale reads possible if hints not delivered |
| Data locality | Data always on correct nodes | Temporarily on "wrong" nodes |
| Recovery complexity | Read repair only | Read repair + hint delivery |
| Storage overhead | None | Hints consume storage on backup nodes |
| Best for | Consistency-sensitive applications | Availability-first applications |
Sloppy quorums can violate traditional quorum intersection guarantees. If a write goes to [A, D] (with D hinted) and a subsequent read contacts [B, C], the read won't see the write—there's no overlap. Dynamo accepts this trade-off for availability, relying on anti-entropy to eventually repair such gaps.
Hinted handoff handles transient failures, but what about durable replica divergence? Long failures, disk corruption, or missed updates can cause replicas to drift apart. Dynamo uses anti-entropy with Merkle trees to detect and repair these inconsistencies.
The anti-entropy concept:
Anti-entropy is a background process that periodically compares replicas and synchronizes differences. The challenge is efficiency: comparing every key between replicas would be prohibitively expensive.
Merkle trees for efficient comparison:
A Merkle tree (hash tree) enables comparison of large datasets by comparing only hashes:
Merkle tree structure:
[Root Hash]
/ \
[Hash1] [Hash2]
/ \ / \
[H1] [H2] [H3] [H4] ← Hash of key ranges
| | | |
keys keys keys keys ← Actual data
Anti-entropy comparison process:
Merkle tree challenges in Dynamo:
Challenge 1: Tree maintenance
Challenge 2: Key range stability
Challenge 3: Comparison frequency
Modern implementations (like Cassandra):
The original Dynamo maintained Merkle trees continuously. Modern systems like Cassandra offer choice: 'full repair' compares all data, 'incremental repair' tracks what's changed since last repair. Most production deployments run repairs on schedules (daily, weekly) rather than continuously.
When multiple replicas can accept writes independently, concurrent modifications can create conflicting versions. Dynamo uses vector clocks (a form of version vectors) to track causality and detect conflicts.
Why simple timestamps fail:
Consider two concurrent writes at physical time T:
Client 1 @ Node A: SET key=X, value="apple", time=1000
Client 2 @ Node B: SET key=X, value="banana", time=1000
Which write happened first? We can't know—the timestamps are identical.
Even if different by milliseconds, clock skew means we can't trust ordering.
Vector clocks solution:
Instead of a single timestamp, track a vector of per-node counters:
Initial state: X = {value: null, vclock: []}
Client 1 writes via Node A:
X = {value: "apple", vclock: [{A: 1}]}
Node B receives, client 2 writes:
X = {value: "banana", vclock: [{A: 1}, {B: 1}]} // Builds on A:1
OR if client 2 hadn't seen A's update:
X = {value: "banana", vclock: [{B: 1}]} // Independent
Comparing vector clocks:
Given two vector clocks V1 and V2:
V1 < V2 (V1 happened before V2)
V1 > V2 (V1 happened after V2)
V1 || V2 (Concurrent, conflict)
Example comparison:
V1 = [{A: 2, B: 1}]
V2 = [{A: 1, B: 2}]
A: 2 > 1 (V1 has more from A)
B: 1 < 2 (V2 has more from B)
Result: V1 || V2 (CONCURRENT)
Conflict detected—application must resolve
| Operation | Action | Example |
|---|---|---|
| Write (new) | Add {node: 1} | {} → [{A: 1}] |
| Write (update) | Increment own counter | [{A: 1}] → [{A: 2}] (via A) |
| Merge (no conflict) | Take max of each counter | [{A: 2}] ⊔ [{A: 1, B: 1}] = [{A: 2, B: 1}] |
| Merge (conflict) | Keep both versions for resolution | [{A: 2}] || [{B: 1}] = [CONFLICT] |
Vector clocks grow with the number of nodes that have modified a key. In theory, this is unbounded. Dynamo uses clock truncation: old entries are pruned based on timestamp. This can cause siblings (false conflicts) but prevents unbounded growth. Most Dynamo-style systems limit vector clock size to a few dozen entries.
Now we can see how Dynamo's components work together to create a highly available, eventually consistent key-value store. Each component addresses a specific challenge:
Component summary:
| Component | Challenge Addressed | Mechanism |
|---|---|---|
| Consistent hashing + vnodes | Data distribution and scaling | Hash ring with multiple virtual nodes per physical node |
| Preference lists | Replica coordination | Ordered list of nodes for each key |
| Sloppy quorums | Write availability during failures | Accept writes on non-preferred nodes |
| Hinted handoff | Data recovery after failures | Forward data to intended node when recovered |
| Anti-entropy / Merkle trees | Long-term consistency | Background comparison and repair of replicas |
| Vector clocks | Conflict detection | Track causality to identify concurrent modifications |
| Read repair | Opportunistic consistency | Fix stale replicas during reads |
A complete write flow:
A complete read flow:
Dynamo's design influenced an entire generation of databases. Cassandra adopted its consistency model, Riak implemented many concepts faithfully, and even databases like DynamoDB (the AWS service) evolved from these principles. Understanding Dynamo means understanding the foundation of modern distributed data systems.
We've explored the architectural innovations that made Dynamo revolutionary. Let's consolidate these insights:
What's next:
With Dynamo's architecture understood, we'll examine quorum reads and writes in depth. The next page explores how quorum parameters (N, R, W) provide tunable consistency, the mathematics behind quorum intersection, and practical guidance for configuring quorums.
You now understand the Dynamo-style architecture that revolutionized distributed databases. The following pages will explore quorum semantics in depth and conflict resolution strategies that handle the inevitable divergence in leaderless systems.