Leaderless Replication - Learning Module

Loading content...

0/273

Dynamo-Style Approach

The Paper That Changed Distributed Systems

In 2007, Amazon published a paper titled "Dynamo: Amazon's Highly Available Key-value Store". This paper didn't just describe a database—it laid out a comprehensive architecture for building highly available distributed systems. Its influence cannot be overstated.

Dynamo's innovations addressed real operational pain at Amazon's scale: the company's e-commerce infrastructure needed to survive datacenter failures, handle holiday traffic spikes, and provide millisecond latencies to millions of concurrent shoppers. Traditional databases couldn't meet these requirements.

The resulting "Dynamo-style" architecture became a template, inspiring databases like Cassandra, Riak, Voldemort, and influencing countless proprietary systems. Understanding Dynamo's approach is essential knowledge for any engineer working with distributed systems.

What You Will Learn

By the end of this page, you will understand the key innovations in Amazon's Dynamo: consistent hashing with virtual nodes, preference lists, sloppy quorums with hinted handoff, anti-entropy using Merkle trees, and the "always writeable" design philosophy. You'll see how these components work together to create a highly available system.

Amazon's Design Requirements

Before diving into Dynamo's architecture, we must understand the operational context that drove its design. Amazon's requirements were specific and demanding, shaped by years of running large-scale e-commerce infrastructure.

The core requirement: "Always writeable"

"Even if disks are failing, network routes are flapping, or entire datacenters are being destroyed, the shopping cart should remain available."

This single requirement shaped every design decision. Availability wasn't just important—it was the primary constraint. Temporary inconsistency was acceptable; losing customer actions was not.

Dynamo's Design Principles

•Incremental scalability — Must scale out one node at a time without manual rebalancing or operational complexity.
•Symmetry — Every node has the same responsibilities. No special nodes, no masters, no leaders requiring special treatment.
•Decentralization — No centralized control. Decentralized techniques are favored to avoid single points of failure.
•Heterogeneity — The system should handle nodes with different capacities. Newer, more powerful nodes should take on proportionally more load.

Service Level Agreements (SLAs) at Amazon:

Amazon's internal SLAs specified latency requirements in percentiles, not averages:

p99.9 latency < 300ms for reads and writes
Targets applied even during peak traffic (holiday shopping)
Individual slow requests were tracked and investigated

This focus on tail latency influenced Dynamo's design significantly. Techniques that improved average latency but worsened p99 were rejected. Every design decision had to be evaluated at the tail.

Dynamo's Design Priorities
Priority	Traditional RDBMS	Dynamo's Choice
Consistency vs. Availability	Consistency (ACID)	Availability (always writeable)
Latency vs. Consistency	Consistency (synchronous replication)	Latency (async, eventual consistency)
Simplicity vs. Flexibility	Simple queries, fixed schemas	Simple key-value, flexible conflict resolution
Centralization vs. Distribution	Master-based architecture	Fully decentralized peers
Operational model	DBA-managed, careful operations	Self-healing, minimal operations

The Shopping Cart Test

Dynamo's primary use case was Amazon's shopping cart. The cart must allow additions even during failures—rejecting a customer's 'Add to Cart' click means losing a potential sale. This framing helps understand Dynamo's choices: better to occasionally have a cart item on two devices that eventually reconciles than to ever refuse the cart operation.

Consistent Hashing with Virtual Nodes

Dynamo's data distribution strategy builds on consistent hashing, but with crucial enhancements that address the limitations of the basic algorithm.

Basic consistent hashing recap:

Nodes and keys are mapped to positions on a conceptual ring (0 to 2^128-1 or similar)
Each key is stored on the first node clockwise from its position
When nodes join/leave, only keys in the affected segment need to move

The problem with basic consistent hashing:

With random node placement, the distribution is rarely uniform:

Some nodes get much larger ring segments than others
Node additions/removals cause uneven load redistribution
Doesn't account for heterogeneous hardware (faster nodes should handle more)

Dynamo's solution: Virtual nodes (vnodes)

Instead of each physical node having one position on the ring, Dynamo assigns multiple "virtual nodes" to each physical node:

Physical Node A → Virtual nodes: v1, v5, v9, v12, v18
Physical Node B → Virtual nodes: v2, v6, v10, v14, v21
Physical Node C → Virtual nodes: v3, v7, v11, v15, v19
Physical Node D → Virtual nodes: v4, v8, v13, v16, v20

Ring with 21 positions, evenly distributed

Benefits of virtual nodes:

Virtual Node Advantages

•Better load distribution — With many virtual nodes, each physical node gets many small segments instead of one large one. The law of large numbers ensures more even distribution.
•Heterogeneous hardware support — A more powerful node can be assigned more virtual nodes, taking proportionally more load (e.g., 300 vnodes vs. 100 vnodes for 3x capacity).
•Faster rebalancing on failures — When a node fails, its load is distributed across many nodes (one per vnode) instead of all shifting to one successor.
•Gradual scaling — Adding a node with few vnodes allows gradual load shifting, minimizing disruption.

Virtual Nodes: Configuration Examples
Scenario	Vnodes per Node	Trade-off
Homogeneous cluster	256 per node	Standard choice, good distribution
Mixed hardware	Proportional to CPU/RAM	Capacity-proportional load
Large cluster (1000+ nodes)	Lower (64-128)	Reduces metadata size
Small cluster (3-5 nodes)	Higher (256-512)	Better distribution with few nodes

Virtual Node Trade-offs

More vnodes means better distribution but larger ring metadata. In Cassandra, each vnode requires ~30 bytes in the ring metadata, which all nodes must store and synchronize. Very large clusters with many vnodes per node can have significant metadata overhead.

Preference Lists and Replica Coordination

For each key, Dynamo maintains a preference list: an ordered list of nodes that should store replicas of that key. The preference list determines which nodes participate in reads and writes, and in what order they're preferred.

Constructing the preference list:

Hash the key to find its position on the ring
Walk clockwise, collecting nodes that should store replicas
Skip virtual nodes belonging to physical nodes already in the list (to ensure replicas are on different physical machines)
Continue until N distinct physical nodes are collected (where N = replication factor)

Example with N=3:

Key K hashes to position 50
Walking clockwise:
  - Position 52: vnode owned by Physical Node A → Add A
  - Position 55: vnode owned by Physical Node A → Skip (A already included)
  - Position 58: vnode owned by Physical Node B → Add B  
  - Position 61: vnode owned by Physical Node C → Add C
  - Preference list complete: [A, B, C]

Coordinator selection from preference list:

The first node in the preference list is called the coordinator for that key. When a client wants to read or write, it should contact the coordinator first. However, if the coordinator is unavailable, the client can contact any node in the preference list.

Why ordering matters:

The preference list order has implications:

Read repair is more efficient — When the coordinator processes a read, it can easily compare with other nodes in the known list
Hinted handoff is targeted — If a node is down, hints are stored for specific list positions
Consistency operations know the authoritative set — Anti-entropy and repair know which nodes should have the data

Preference List Properties

•Deterministic — Given the ring state, any node can compute the preference list for any key without coordination.
•Topology-aware — Can incorporate rack and datacenter awareness to ensure replicas span failure domains.
•Stable across queries — The same key always maps to the same preference list (until topology changes).
•Self-describing — Each node knows where data should be, enabling automatic repair and rebalancing.

Preference Lists in Practice

Modern Dynamo-style systems like Cassandra don't use explicit 'preference lists' but implement the same concept through their token ring and replication strategies. The idea of an ordered, deterministic set of replicas for each key remains central to the architecture.

Sloppy Quorums and Hinted Handoff

Dynamo's "always writeable" requirement led to two interconnected innovations: sloppy quorums and hinted handoff. These mechanisms allow writes to succeed even when the preferred replica nodes are unavailable.

Traditional (strict) quorums:

With strict quorums, a write to a key must reach nodes in that key's preference list:

Preference list for key K: [A, B, C]
Write requires 2 of 3 (quorum)

If A and B are available → Write succeeds
If only A is available → Write FAILS (can't reach quorum)

Sloppy quorums:

Dynamo extends the concept: if preferred nodes are unavailable, the system can write to other nodes outside the preference list:

Preference list for key K: [A, B, C]
Write requires 2 of 3

If only A is available:
  → Write to A (preferred)
  → Write to D (next healthy node on ring, outside preference list)
  → Write succeeds! (sloppy quorum met)

The key insight: durability is preserved (data exists on 2 nodes) even though it's not on the "correct" nodes temporarily.

Hinted handoff mechanics:

When data is written to a non-preferred node (like D in the example above), the write includes a hint—metadata indicating the intended recipient:

Data stored on Node D:
{
  key: K,
  value: V,
  timestamp: T,
  hint: {
    intended_for: "Node B",
    reason: "B unreachable during write"
  }
}

Node D periodically checks if Node B has recovered. Once B is healthy:

D sends the hinted data to B
B acknowledges receipt
D deletes the hinted copy

The handoff process:

[During failure]
Client → A (primary) ✓
Client → B (primary) ✗ (down)
Client → D (handoff) ✓ with hint for B

[After B recovers]
D → B: "Here's data that was for you"
B: Stores data, ACKs
D: Deletes hinted copy

Sloppy Quorum Trade-offs
Aspect	Strict Quorum	Sloppy Quorum
Write availability	Limited by preference list health	Higher (uses backup nodes)
Read consistency	Quorum reads always consistent	Stale reads possible if hints not delivered
Data locality	Data always on correct nodes	Temporarily on "wrong" nodes
Recovery complexity	Read repair only	Read repair + hint delivery
Storage overhead	None	Hints consume storage on backup nodes
Best for	Consistency-sensitive applications	Availability-first applications

The Read Consistency Trade-off

Sloppy quorums can violate traditional quorum intersection guarantees. If a write goes to [A, D] (with D hinted) and a subsequent read contacts [B, C], the read won't see the write—there's no overlap. Dynamo accepts this trade-off for availability, relying on anti-entropy to eventually repair such gaps.

Anti-Entropy with Merkle Trees

Hinted handoff handles transient failures, but what about durable replica divergence? Long failures, disk corruption, or missed updates can cause replicas to drift apart. Dynamo uses anti-entropy with Merkle trees to detect and repair these inconsistencies.

The anti-entropy concept:

Anti-entropy is a background process that periodically compares replicas and synchronizes differences. The challenge is efficiency: comparing every key between replicas would be prohibitively expensive.

Merkle trees for efficient comparison:

A Merkle tree (hash tree) enables comparison of large datasets by comparing only hashes:

Merkle tree structure:

           [Root Hash]
              /    \
         [Hash1]  [Hash2]
          /  \      /  \
       [H1] [H2] [H3] [H4]   ← Hash of key ranges
        |    |    |    |
      keys keys keys keys    ← Actual data

Anti-entropy comparison process:

Each replica maintains a Merkle tree of its data, grouped by key range
Replicas exchange root hashes
If roots match → data is synchronized, done
If roots differ → exchange child hashes to narrow down
Continue recursively until differing key ranges are identified
Exchange only the keys in differing ranges

Merkle Tree Benefits

•Logarithmic comparison — Finding differences takes O(log N) comparisons instead of O(N) for N keys.
•Minimal data transfer — Only differing keys are exchanged, not entire datasets.
•Bandwidth efficient — Comparing millions of keys might require only kilobytes of hash exchange.
•Incremental updates — Tree can be updated incrementally as data changes.

Merkle tree challenges in Dynamo:

Challenge 1: Tree maintenance

Each data modification requires tree updates up to the root
With virtual nodes, each physical node maintains multiple trees (one per range)

Challenge 2: Key range stability

Merkle trees are built over key ranges
When nodes join/leave, ranges change, requiring tree rebuilds

Challenge 3: Comparison frequency

Too frequent: excessive background load
Too infrequent: longer inconsistency windows

Modern implementations (like Cassandra):

Use "repair" operations that can be scheduled or triggered
Merkle trees are built on-demand during repair, not continuously maintained
Incremental repair allows repairing only recently-modified data

Repair Strategy Choices

The original Dynamo maintained Merkle trees continuously. Modern systems like Cassandra offer choice: 'full repair' compares all data, 'incremental repair' tracks what's changed since last repair. Most production deployments run repairs on schedules (daily, weekly) rather than continuously.

Vector Clocks for Versioning

When multiple replicas can accept writes independently, concurrent modifications can create conflicting versions. Dynamo uses vector clocks (a form of version vectors) to track causality and detect conflicts.

Why simple timestamps fail:

Consider two concurrent writes at physical time T:

Client 1 @ Node A: SET key=X, value="apple", time=1000
Client 2 @ Node B: SET key=X, value="banana", time=1000

Which write happened first? We can't know—the timestamps are identical.
Even if different by milliseconds, clock skew means we can't trust ordering.

Vector clocks solution:

Instead of a single timestamp, track a vector of per-node counters:

Initial state: X = {value: null, vclock: []}

Client 1 writes via Node A:
  X = {value: "apple", vclock: [{A: 1}]}

Node B receives, client 2 writes:
  X = {value: "banana", vclock: [{A: 1}, {B: 1}]}  // Builds on A:1

OR if client 2 hadn't seen A's update:
  X = {value: "banana", vclock: [{B: 1}]}  // Independent

Comparing vector clocks:

Given two vector clocks V1 and V2:

V1 < V2 (V1 happened before V2)

Every counter in V1 is ≤ the corresponding counter in V2
At least one counter in V1 is < the corresponding counter in V2

V1 > V2 (V1 happened after V2)

The reverse of above

V1 || V2 (Concurrent, conflict)

Some counters in V1 are greater, some in V2 are greater
Neither happened before the other—they're concurrent

Example comparison:

V1 = [{A: 2, B: 1}]
V2 = [{A: 1, B: 2}]

A: 2 > 1 (V1 has more from A)
B: 1 < 2 (V2 has more from B)

Result: V1 || V2 (CONCURRENT)
Conflict detected—application must resolve

Vector Clock Operations
Operation	Action	Example
Write (new)	Add {node: 1}	{} → [{A: 1}]
Write (update)	Increment own counter	[{A: 1}] → [{A: 2}] (via A)
Merge (no conflict)	Take max of each counter	[{A: 2}] ⊔ [{A: 1, B: 1}] = [{A: 2, B: 1}]
Merge (conflict)	Keep both versions for resolution	[{A: 2}] \|\| [{B: 1}] = [CONFLICT]

Vector Clock Size Growth

Vector clocks grow with the number of nodes that have modified a key. In theory, this is unbounded. Dynamo uses clock truncation: old entries are pruned based on timestamp. This can cause siblings (false conflicts) but prevents unbounded growth. Most Dynamo-style systems limit vector clock size to a few dozen entries.

The Complete Dynamo Architecture

Now we can see how Dynamo's components work together to create a highly available, eventually consistent key-value store. Each component addresses a specific challenge:

Component summary:

Dynamo Components and Their Purposes
Component	Challenge Addressed	Mechanism
Consistent hashing + vnodes	Data distribution and scaling	Hash ring with multiple virtual nodes per physical node
Preference lists	Replica coordination	Ordered list of nodes for each key
Sloppy quorums	Write availability during failures	Accept writes on non-preferred nodes
Hinted handoff	Data recovery after failures	Forward data to intended node when recovered
Anti-entropy / Merkle trees	Long-term consistency	Background comparison and repair of replicas
Vector clocks	Conflict detection	Track causality to identify concurrent modifications
Read repair	Opportunistic consistency	Fix stale replicas during reads

A complete write flow:

Client sends PUT(key=K, value=V) to any node
Node computes preference list for K: [A, B, C]
If this node is in the list, it acts as coordinator
Coordinator sends write to all preference list nodes in parallel
If B is down, write goes to D with hint for B
Coordinator waits for W acknowledgments (quorum)
Returns success to client
Background: D checks for B recovery, hands off data when possible
Background: Anti-entropy verifies all replicas match

A complete read flow:

Client sends GET(key=K) to any node
Node computes preference list for K: [A, B, C]
Coordinator requests value from R nodes (quorum)
Responses return with vector clocks
If all vector clocks match: return value
If vector clocks diverge: detect conflict
Read repair: update stale replicas with latest
Return value (or conflicting versions) to client

Dynamo's Lasting Legacy

Dynamo's design influenced an entire generation of databases. Cassandra adopted its consistency model, Riak implemented many concepts faithfully, and even databases like DynamoDB (the AWS service) evolved from these principles. Understanding Dynamo means understanding the foundation of modern distributed data systems.

Summary: The Dynamo-Style Paradigm

We've explored the architectural innovations that made Dynamo revolutionary. Let's consolidate these insights:

Key Takeaways

•"Always writeable" drove all decisions — Every design choice prioritized availability, accepting eventual consistency as the trade-off.
•Virtual nodes enable flexible scaling — Multiple tokens per physical node provide balanced distribution and proportional capacity allocation.
•Preference lists provide deterministic replica sets — Any node can compute which nodes should store any key, without coordination.
•Sloppy quorums maintain availability under failure — Writes succeed by using backup nodes when preferred nodes are down.
•Hinted handoff bridges failures — Data reaches its proper home after temporary nodes forward it post-recovery.
•Merkle trees enable efficient consistency verification — Logarithmic comparison finds divergence without full data exchange.
•Vector clocks detect concurrent modifications — Causality tracking identifies conflicts requiring resolution.

What's next:

With Dynamo's architecture understood, we'll examine quorum reads and writes in depth. The next page explores how quorum parameters (N, R, W) provide tunable consistency, the mathematics behind quorum intersection, and practical guidance for configuring quorums.

Page Complete

You now understand the Dynamo-style architecture that revolutionized distributed databases. The following pages will explore quorum semantics in depth and conflict resolution strategies that handle the inevitable divergence in leaderless systems.