System Design (HLD)Rebalancing and Resharding

Rebalancing and Resharding

LevelAdvanced

Duration90 mins

TopicRebalancing and Resharding

3 / 5

Consistent Hashing for Distribution

The Algorithmic Foundation of Modern Distribution

In 1997, David Karger and his colleagues at MIT published a paper that would fundamentally reshape how distributed systems handle data placement. Consistent hashing—the algorithm they introduced—solved a problem that had plagued distributed systems: how to redistribute data when servers are added or removed without requiring massive data movement.

The elegance of consistent hashing lies in its guarantee: when a cluster changes from N nodes to N+1 nodes, only 1/(N+1) of the keys need to be redistributed, rather than nearly all of them. This property transforms rebalancing from an expensive, disruptive operation into a manageable, incremental process.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of consistent hashing, why traditional hashing fails for distributed systems, how virtual nodes improve load distribution, the implementation details of consistent hashing rings, and how production systems like DynamoDB, Cassandra, and Riak apply these concepts.

The Problem with Traditional Hashing

Before understanding why consistent hashing matters, we must understand why naive approaches fail. Consider the simplest approach to distributing data across N servers:

Modulo Hashing:

server = hash(key) % N

This approach seems reasonable: hash the key to get a number, take modulo N to get a server index. Data is evenly distributed (assuming a good hash function), and lookups are O(1).

The Catastrophic Failure Mode:

The problem emerges when the cluster size changes. Consider a cluster with 3 servers:

Key Distribution with N=3 Servers
Key	Hash Value	Server (hash % 3)
user:100	297	0
user:101	432	0
user:102	156	0
user:103	847	1
user:104	593	2
user:105	721	1

Now add one server (N=4). The same keys map to different servers:

Key Distribution with N=4 Servers (After Adding One)
Key	Hash Value	Server (hash % 4)	Moved?
user:100	297	1	Yes (was 0)
user:101	432	0	No
user:102	156	0	No
user:103	847	3	Yes (was 1)
user:104	593	1	Yes (was 2)
user:105	721	1	No

The Mathematics of Modulo Redistribution:

When changing from N to N+1 servers, the probability that a key stays on the same server is:

P(same server) = 1/(N+1)  (approximately)

This means nearly all keys must move. For a 3-server to 4-server transition:

Expected keys to move: ~75%
For 10→11 servers: ~91% of keys move
For 100→101 servers: ~99% of keys move

This is catastrophic for large-scale systems. Adding a single server to a 100-node cluster storing 100 TB of data would require moving ~99 TB of data.

The Real-World Impact:

Extended Downtime: Moving petabytes of data takes hours or days
Network Saturation: Bulk data transfer overwhelms network capacity
Cache Invalidation: All caches become invalid when keys relocate
Inconsistency Windows: During migration, data exists in multiple places

Modulo Hashing Is a Trap

Modulo hashing appears elegant in prototype systems but fails catastrophically at scale. Many teams have learned this lesson the hard way when their first cluster expansion triggered multi-hour outages. Always use consistent hashing for distributed data placement.

The Consistent Hashing Algorithm

Consistent hashing solves the redistribution problem by fundamentally changing how we think about key-to-server mapping. Instead of using modulo arithmetic, we conceptualize both keys and servers as points on a circle (ring).

The Hash Ring Concept:

Define a Ring: Conceptualize the hash space as a ring (0 to 2³² - 1, wrapping around)
Place Servers on Ring: Hash each server identifier to get its position on the ring
Place Keys on Ring: Hash each key to get its position on the ring
Assign Keys to Servers: Each key is assigned to the first server encountered when walking clockwise from the key's position

Visual Representation:

                    0°
                    |
            S3 ●----+----● S1
               /         \
              /           \
             /             \
   270° ----●               ●---- 90°
            |       K1 ✕    |
            |     K2 ✕      |
             \      K3 ✕   /
              \           /
               \         /
            S2 ●---------●
                    |
                   180°

Keys K1, K2, K3 are all assigned to Server S2
(the first server clockwise from each key's position)

The Key Insight: Minimal Redistribution

When a server is added or removed, only the keys between the affected server and its predecessor need to move:

Adding Server S4:

Before:     ... ---[ S2 ]-----[ S3 ]--- ...
            Keys K1, K2, K3 belong to S3

After:      ... ---[ S2 ]--[ S4 ]--[ S3 ]--- ...
            Keys K1, K2 now belong to S4
            Key K3 still belongs to S3

Only keys between S2 and S4 need to move from S3 to S4. All other keys remain unchanged.

Mathematical Guarantee:

When changing from N to N+1 servers:

Expected keys to move = K / (N + 1)

Where K = total number of keys

This is a dramatic improvement:

Moving from 3→4 servers: ~25% of keys move (vs. ~75% with modulo)
Moving from 10→11 servers: ~9% of keys move (vs. ~91% with modulo)
Moving from 100→101 servers: ~1% of keys move (vs. ~99% with modulo)

The Power of O(K/N)

Consistent hashing provides O(K/N) redistribution, which is mathematically optimal. You cannot move fewer keys on average while maintaining balanced distribution. This optimality is why consistent hashing has become ubiquitous in distributed systems.

Virtual Nodes (VNodes)

While basic consistent hashing solves the redistribution problem, it introduces a new challenge: uneven load distribution. With only a few physical servers on the ring, some servers may own larger portions of the key space than others simply due to the random nature of hash function outputs.

The Imbalance Problem:

Consider 3 servers placed on a ring by hashing their identifiers:

Server A: position 10°
Server B: position 170°
Server C: position 200°

Key space distribution:
- Server A: 170° (from 200° to 10°, wrapping around)
- Server B: 160° (from 10° to 170°)
- Server C: 30° (from 170° to 200°)

Server A handles 5.7x more keys than Server C!

The Virtual Node Solution:

Instead of placing each physical server once on the ring, place it multiple times using different hash inputs:

Physical Server A → Virtual Nodes A-1, A-2, A-3, ..., A-100
Physical Server B → Virtual Nodes B-1, B-2, B-3, ..., B-100
Physical Server C → Virtual Nodes C-1, C-2, C-3, ..., C-100

Each virtual node is hashed to its own position on the ring. A key is assigned to the virtual node encountered first clockwise, which maps to a physical server.

Load Balance Improvement with Virtual Nodes
Virtual Nodes per Server	Expected Std Dev of Load	Max/Min Load Ratio
1	~50% of mean	5x - 10x
10	~16% of mean	1.5x - 2x
100	~5% of mean	1.1x - 1.2x
500	~2% of mean	~1.05x
1000	~1.5% of mean	~1.03x

Virtual Nodes in Practice:

Benefits Beyond Load Balancing:

Heterogeneous Hardware: Assign more virtual nodes to more powerful servers
- Server with 64GB RAM: 200 virtual nodes
- Server with 32GB RAM: 100 virtual nodes
Graceful Addition/Removal: When adding a server, its virtual nodes are spread across the ring, taking a small portion from each existing server rather than a large portion from one
Finer Replication Control: Each virtual node can be replicated independently, enabling rack-aware or zone-aware placement

Implementation Considerations:

def get_virtual_nodes(server_id, num_vnodes=100):
    virtual_nodes = []
    for i in range(num_vnodes):
        vnode_id = f"{server_id}:vnode-{i}"
        position = hash(vnode_id) % RING_SIZE
        virtual_nodes.append((position, server_id))
    return virtual_nodes

def build_ring(servers, vnodes_per_server=100):
    ring = []
    for server in servers:
        ring.extend(get_virtual_nodes(server, vnodes_per_server))
    ring.sort(key=lambda x: x[0])  # Sort by position
    return ring

Memory Trade-off

More virtual nodes mean better load balance but higher memory usage for the ring data structure. A cluster with 100 servers and 1000 virtual nodes each requires storing 100,000 ring entries. For most systems, 100-500 virtual nodes per server provides a good balance.

Implementing a Consistent Hash Ring

Implementing consistent hashing correctly requires attention to several details: efficient lookup, atomic updates, and proper handling of edge cases.

Core Data Structures:

The ring is typically implemented as a sorted array or balanced tree of (position, server) pairs:

class ConsistentHashRing:
    def __init__(self, nodes=None, vnodes=100):
        self.vnodes = vnodes
        self.ring = []  # Sorted list of (position, node_id)
        self.nodes = set()
        
        if nodes:
            for node in nodes:
                self.add_node(node)
    
    def _hash(self, key):
        """Hash function: use a cryptographic hash for uniform distribution"""
        import hashlib
        return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
    
    def add_node(self, node_id):
        """Add a node to the ring with its virtual nodes"""
        if node_id in self.nodes:
            return
        
        self.nodes.add(node_id)
        for i in range(self.vnodes):
            vnode_key = f"{node_id}:vnode:{i}"
            position = self._hash(vnode_key)
            self.ring.append((position, node_id))
        
        self.ring.sort(key=lambda x: x[0])
    
    def remove_node(self, node_id):
        """Remove a node and all its virtual nodes from the ring"""
        if node_id not in self.nodes:
            return
        
        self.nodes.discard(node_id)
        self.ring = [(pos, nid) for pos, nid in self.ring if nid != node_id]

Key Lookup Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def get_node(self, key):
    """Find the node responsible for a given key"""
    if not self.ring:
        return None
    
    key_hash = self._hash(key)
    
    # Binary search for the first node with position >= key_hash
    left, right = 0, len(self.ring)
    
    while left < right:
        mid = (left + right) // 2
        if self.ring[mid][0] < key_hash:
            left = mid + 1
        else:
            right = mid
    
    # If we've gone past the end, wrap to the first node (ring property)
    if left == len(self.ring):
        left = 0
    
    return self.ring[left][1]
 
def get_nodes(self, key, count=3):
    """Get multiple nodes for replication (e.g., for quorum writes)"""
    if not self.ring or count <= 0:
        return []
    
    key_hash = self._hash(key)
    
    # Find starting position
    left, right = 0, len(self.ring)
    while left < right:
        mid = (left + right) // 2
        if self.ring[mid][0] < key_hash:
            left = mid + 1
        else:
            right = mid
    
    # Collect unique nodes walking clockwise
    result = []
    seen = set()
    idx = left if left < len(self.ring) else 0
    
    while len(result) < count and len(seen) < len(self.nodes):
        node_id = self.ring[idx][1]
        if node_id not in seen:
            result.append(node_id)
            seen.add(node_id)
        idx = (idx + 1) % len(self.ring)
    
    return result

Optimization Techniques:

Binary Search: Use binary search for O(log n) lookup instead of O(n) linear scan
Caching: Cache ring structure locally; invalidate on membership changes
Read-Copy-Update: Create new ring on changes; atomically swap pointer
Ring Slicing: For extreme scale, partition the ring itself across coordinators

Thread Safety:

import threading

class ThreadSafeConsistentHashRing:
    def __init__(self):
        self._ring = ConsistentHashRing()
        self._lock = threading.RWLock()  # Conceptual; use appropriate primitive
    
    def get_node(self, key):
        with self._lock.read_lock():
            return self._ring.get_node(key)
    
    def add_node(self, node_id):
        with self._lock.write_lock():
            self._ring.add_node(node_id)

Hash Function Matters

Use a high-quality hash function (MD5, SHA-1, MurmurHash3) for uniform distribution. Poor hash functions cause clustering on the ring, negating the benefits of consistent hashing. Performance-critical systems often use MurmurHash3 or xxHash for speed.

Replication with Consistent Hashing

Consistent hashing naturally extends to support replication by walking further around the ring to find additional replica locations.

The N Replicas Pattern:

For replication factor N, a key is stored on N consecutive distinct physical servers walking clockwise:

Key K at position 100°
Replication factor N = 3

Walk clockwise, collecting distinct physical servers:
- Position 110°: Virtual node of Server A → Server A (replica 1)
- Position 125°: Virtual node of Server A → Skip (already have A)
- Position 140°: Virtual node of Server B → Server B (replica 2)
- Position 155°: Virtual node of Server C → Server C (replica 3)

Key K is replicated to Servers A, B, C

Replication Benefits

•Natural extension of ring structure
•Deterministic replica selection
•No coordination needed for reads
•Simple failure handling
•Supports quorum operations
•Works with heterogeneous replicas

Replication Challenges

•Virtual nodes complicate replica selection
•Must skip duplicate physical servers
•Rack/zone awareness requires extra logic
•Hot spots can overload replica chains
•Ring changes affect multiple replicas
•Consistency requires coordination

Rack-Aware Replica Placement:

In production environments, replicas should span failure domains (racks, availability zones):

def get_replicas_rack_aware(key, ring, replication_factor, rack_map):
    """
    Select replicas ensuring they're on different racks.
    rack_map: {server_id: rack_id}
    """
    key_hash = ring._hash(key)
    start_idx = ring._find_position(key_hash)
    
    replicas = []
    racks_used = set()
    idx = start_idx
    
    while len(replicas) < replication_factor:
        node_id = ring.ring[idx][1]
        rack = rack_map.get(node_id)
        
        # Only add if this is a new rack (or we've exhausted rack diversity)
        if rack not in racks_used or len(racks_used) >= available_racks:
            if node_id not in replicas:
                replicas.append(node_id)
                racks_used.add(rack)
        
        idx = (idx + 1) % len(ring.ring)
        
        # Prevent infinite loop if we can't find enough diversity
        if idx == start_idx:
            break
    
    return replicas

Preference Lists:

Systems like Dynamo maintain preference lists—ordered lists of nodes that should store each key. The first N healthy nodes in the preference list hold the replicas:

Preference list for key K: [A, B, C, D, E]
Replication factor: 3

Normal case: Replicas on A, B, C
If A is down: Replicas on B, C, D (D temporarily holds A's replica)
If A and B are down: Replicas on C, D, E

Sloppy Quorums

When primary replica nodes are unavailable, writes can go to 'sloppy' replicas (next nodes on the preference list). These hint at the eventual destination and forward data when the primary recovers. This is called 'hinted handoff.'

Handling Node Additions and Removals

The true power of consistent hashing emerges when cluster membership changes. Properly handling additions and removals is crucial for maintaining availability and data integrity.

Adding a New Node:

When a new node joins the cluster:

Calculate Virtual Node Positions: Hash the new node's virtual node identifiers
Identify Affected Key Ranges: Determine which key ranges will map to the new node
Initiate Data Transfer: Request affected data from current owners
Accept Traffic Gradually: Start serving reads, then writes for migrated data
Complete Integration: Remove data from previous owners after confirmation

Node Addition Process Steps
Step	Duration	System State	Risk Level
Ring Position Calculation	Milliseconds	New node determines its positions	None
Gossip/Discovery	Seconds	Cluster learns about new node	Low
Data Streaming	Minutes to hours	Data flows from existing nodes	Medium
Read Activation	Immediate after sync	New node serves reads for owned ranges	Low
Write Activation	After full sync	New node accepts writes	Medium
Old Data Cleanup	Hours after stabilization	Previous owners delete transferred data	Low

Removing a Node (Planned):

Planned removal (decommissioning) follows an orderly process:

Mark Node for Removal: Set node status to 'leaving'
Stop Accepting New Data: Redirect writes to successor nodes
Transfer Data: Stream owned data to successors on the ring
Verify Transfer: Confirm all data is safely replicated
Update Ring: Remove node from the ring
Shutdown: Stop node processes

Removing a Node (Failure):

Unplanned removal requires different handling:

Failure Detection: Health checks determine node is unresponsive
Declare Node Dead: After timeout (typically 30-60 seconds)
Activate Replicas: Next nodes in preference lists take over
Repair Data: Restore replication factor from remaining replicas
Update Ring: Remove failed node after sustained outage
Handle Recovery: If node returns, resolve any conflicts

Key Insight: Minimal Movement

In both cases, only data owned by the changing node moves:

Adding Node X with 100 virtual nodes to a 10-node cluster:
- Each existing node loses ~10% of its data to X
- Total data movement: ~10% of entire dataset
- Compare to modulo: ~100% would move

Bootstrap Storm

When a new node joins and begins streaming data, it can overwhelm the network and source nodes. Always throttle streaming throughput and add nodes one at a time with stabilization periods between additions.

Production Systems Using Consistent Hashing

Consistent hashing underpins many of the world's most scalable data systems. Understanding how production systems apply these concepts provides practical insights.

Amazon DynamoDB:

DynamoDB, inspired by the original Dynamo paper, uses consistent hashing as its core distribution mechanism:

Partition Key Hashing: The partition key is hashed to determine placement
Virtual Nodes: Each partition is divided across multiple storage nodes
Automatic Rebalancing: When capacity is exceeded, partitions split automatically
Adaptive Capacity: Hot partitions can burst beyond their provisioned throughput

Consistent Hashing in Major Distributed Databases
System	Virtual Nodes	Replication Model	Notable Feature
Apache Cassandra	256 default (configurable)	Tunable RF, rack-aware	Token ring, multiple tokens per node
Amazon DynamoDB	Managed (variable)	3 replicas across AZs	Automatic adaptive scaling
Riak	64 default	Configurable RF	True leaderless, CRDTs
Voldemort	Configurable	Configurable RF	LinkedIn's key-value store
ScyllaDB	256 default	NetworkTopologyStrategy	Cassandra-compatible, C++

Apache Cassandra's Token Ring:

Cassandra's implementation is particularly instructive:

Token Assignment: Each node is assigned a set of tokens (positions on the ring)
Partitioner: The partitioner (Murmur3Partitioner) hashes partition keys to tokens
Replication Strategy: NetworkTopologyStrategy ensures replicas span racks and data centers
Consistent Hashing Lookup: Coordinators locate data by checking which nodes own the token range

-- Cassandra keyspace with replication
CREATE KEYSPACE my_app WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'eu-west': 3
};

Memcached and Redis Clusters:

Cache clusters also use consistent hashing:

Memcached: Client libraries (libmemcached, various language bindings) implement consistent hashing client-side
Redis Cluster: Uses hash slots (16384 slots) assigned to nodes; slots map to nodes rather than pure consistent hashing

CDN Edge Servers:

Content Delivery Networks use consistent hashing to route requests to edge servers, ensuring the same content is cached on predictable servers.

Hash Slots vs. Consistent Hashing

Redis Cluster uses a hash slot model (0-16383) instead of pure consistent hashing. This simplifies resharding to slot reassignment rather than ring reconfiguration. Both approaches achieve minimal redistribution; the tradeoff is implementation complexity vs. flexibility.

Summary: Consistent Hashing for Distribution

Consistent hashing is the algorithmic foundation that makes modern distributed database rebalancing possible. The key insights from this page:

Key Takeaways

•Traditional modulo hashing fails at scale — Changing cluster size causes nearly all keys to redistribute, making rebalancing prohibitively expensive.
•Consistent hashing minimizes redistribution — Only O(K/N) keys move when adding or removing a node, which is mathematically optimal.
•Virtual nodes enable balanced distribution — Placing each physical server multiple times on the ring ensures even load distribution and simplifies heterogeneous hardware handling.
•Implementation requires careful attention — Binary search for O(log n) lookups, proper hash functions, and thread-safe updates are essential for production use.
•Replication extends naturally — Walking the ring to find N distinct physical servers provides a deterministic, coordination-free replication strategy.
•Node changes are handled gracefully — Additions and removals only affect a bounded portion of the keyspace, enabling incremental, non-disruptive operations.
•Production systems validate the approach — DynamoDB, Cassandra, Riak, and numerous other systems prove consistent hashing at massive scale.

What's Next:

With the algorithmic foundation established, the next page explores the challenges of online resharding—the complexities that arise when resharding must occur while the system continues serving traffic. We'll examine consistency challenges, coordination protocols, and failure scenarios that make online resharding one of the most demanding operations in distributed systems.

Page Complete

You now have a deep understanding of consistent hashing—from its mathematical guarantees to its implementation details to its application in production systems. This knowledge is essential for any distributed systems engineer.

3 / 5

Loading learning content...

System Design (HLD)Rebalancing and Resharding

Rebalancing and Resharding

LevelAdvanced

Duration90 mins

TopicRebalancing and Resharding

3 / 5

Consistent Hashing for Distribution

The Algorithmic Foundation of Modern Distribution

What You Will Learn

The Problem with Traditional Hashing

Before understanding why consistent hashing matters, we must understand why naive approaches fail. Consider the simplest approach to distributing data across N servers:

Modulo Hashing:

server = hash(key) % N

This approach seems reasonable: hash the key to get a number, take modulo N to get a server index. Data is evenly distributed (assuming a good hash function), and lookups are O(1).

The Catastrophic Failure Mode:

The problem emerges when the cluster size changes. Consider a cluster with 3 servers:

Key Distribution with N=3 Servers
Key	Hash Value	Server (hash % 3)
user:100	297	0
user:101	432	0
user:102	156	0
user:103	847	1
user:104	593	2
user:105	721	1

Now add one server (N=4). The same keys map to different servers:

Key Distribution with N=4 Servers (After Adding One)
Key	Hash Value	Server (hash % 4)	Moved?
user:100	297	1	Yes (was 0)
user:101	432	0	No
user:102	156	0	No
user:103	847	3	Yes (was 1)
user:104	593	1	Yes (was 2)
user:105	721	1	No

The Mathematics of Modulo Redistribution:

When changing from N to N+1 servers, the probability that a key stays on the same server is:

P(same server) = 1/(N+1)  (approximately)

This means nearly all keys must move. For a 3-server to 4-server transition:

Expected keys to move: ~75%
For 10→11 servers: ~91% of keys move
For 100→101 servers: ~99% of keys move

This is catastrophic for large-scale systems. Adding a single server to a 100-node cluster storing 100 TB of data would require moving ~99 TB of data.

The Real-World Impact:

Extended Downtime: Moving petabytes of data takes hours or days
Network Saturation: Bulk data transfer overwhelms network capacity
Cache Invalidation: All caches become invalid when keys relocate
Inconsistency Windows: During migration, data exists in multiple places

Modulo Hashing Is a Trap

The Consistent Hashing Algorithm

The Hash Ring Concept:

Define a Ring: Conceptualize the hash space as a ring (0 to 2³² - 1, wrapping around)
Place Servers on Ring: Hash each server identifier to get its position on the ring
Place Keys on Ring: Hash each key to get its position on the ring
Assign Keys to Servers: Each key is assigned to the first server encountered when walking clockwise from the key's position

Visual Representation:

                    0°
                    |
            S3 ●----+----● S1
               /         \
              /           \
             /             \
   270° ----●               ●---- 90°
            |       K1 ✕    |
            |     K2 ✕      |
             \      K3 ✕   /
              \           /
               \         /
            S2 ●---------●
                    |
                   180°

Keys K1, K2, K3 are all assigned to Server S2
(the first server clockwise from each key's position)

The Key Insight: Minimal Redistribution

When a server is added or removed, only the keys between the affected server and its predecessor need to move:

Adding Server S4:

Before:     ... ---[ S2 ]-----[ S3 ]--- ...
            Keys K1, K2, K3 belong to S3

After:      ... ---[ S2 ]--[ S4 ]--[ S3 ]--- ...
            Keys K1, K2 now belong to S4
            Key K3 still belongs to S3

Only keys between S2 and S4 need to move from S3 to S4. All other keys remain unchanged.

Mathematical Guarantee:

When changing from N to N+1 servers:

Expected keys to move = K / (N + 1)

Where K = total number of keys

This is a dramatic improvement:

Moving from 3→4 servers: ~25% of keys move (vs. ~75% with modulo)
Moving from 10→11 servers: ~9% of keys move (vs. ~91% with modulo)
Moving from 100→101 servers: ~1% of keys move (vs. ~99% with modulo)

The Power of O(K/N)

Virtual Nodes (VNodes)

The Imbalance Problem:

Consider 3 servers placed on a ring by hashing their identifiers:

Server A: position 10°
Server B: position 170°
Server C: position 200°

Key space distribution:
- Server A: 170° (from 200° to 10°, wrapping around)
- Server B: 160° (from 10° to 170°)
- Server C: 30° (from 170° to 200°)

Server A handles 5.7x more keys than Server C!

The Virtual Node Solution:

Instead of placing each physical server once on the ring, place it multiple times using different hash inputs:

Physical Server A → Virtual Nodes A-1, A-2, A-3, ..., A-100
Physical Server B → Virtual Nodes B-1, B-2, B-3, ..., B-100
Physical Server C → Virtual Nodes C-1, C-2, C-3, ..., C-100

Each virtual node is hashed to its own position on the ring. A key is assigned to the virtual node encountered first clockwise, which maps to a physical server.

Load Balance Improvement with Virtual Nodes
Virtual Nodes per Server	Expected Std Dev of Load	Max/Min Load Ratio
1	~50% of mean	5x - 10x
10	~16% of mean	1.5x - 2x
100	~5% of mean	1.1x - 1.2x
500	~2% of mean	~1.05x
1000	~1.5% of mean	~1.03x

Virtual Nodes in Practice:

Benefits Beyond Load Balancing:

Heterogeneous Hardware: Assign more virtual nodes to more powerful servers
- Server with 64GB RAM: 200 virtual nodes
- Server with 32GB RAM: 100 virtual nodes
Graceful Addition/Removal: When adding a server, its virtual nodes are spread across the ring, taking a small portion from each existing server rather than a large portion from one
Finer Replication Control: Each virtual node can be replicated independently, enabling rack-aware or zone-aware placement

Implementation Considerations:

def get_virtual_nodes(server_id, num_vnodes=100):
    virtual_nodes = []
    for i in range(num_vnodes):
        vnode_id = f"{server_id}:vnode-{i}"
        position = hash(vnode_id) % RING_SIZE
        virtual_nodes.append((position, server_id))
    return virtual_nodes

def build_ring(servers, vnodes_per_server=100):
    ring = []
    for server in servers:
        ring.extend(get_virtual_nodes(server, vnodes_per_server))
    ring.sort(key=lambda x: x[0])  # Sort by position
    return ring

Memory Trade-off

Implementing a Consistent Hash Ring

Implementing consistent hashing correctly requires attention to several details: efficient lookup, atomic updates, and proper handling of edge cases.

Core Data Structures:

The ring is typically implemented as a sorted array or balanced tree of (position, server) pairs:

class ConsistentHashRing:
    def __init__(self, nodes=None, vnodes=100):
        self.vnodes = vnodes
        self.ring = []  # Sorted list of (position, node_id)
        self.nodes = set()
        
        if nodes:
            for node in nodes:
                self.add_node(node)
    
    def _hash(self, key):
        """Hash function: use a cryptographic hash for uniform distribution"""
        import hashlib
        return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
    
    def add_node(self, node_id):
        """Add a node to the ring with its virtual nodes"""
        if node_id in self.nodes:
            return
        
        self.nodes.add(node_id)
        for i in range(self.vnodes):
            vnode_key = f"{node_id}:vnode:{i}"
            position = self._hash(vnode_key)
            self.ring.append((position, node_id))
        
        self.ring.sort(key=lambda x: x[0])
    
    def remove_node(self, node_id):
        """Remove a node and all its virtual nodes from the ring"""
        if node_id not in self.nodes:
            return
        
        self.nodes.discard(node_id)
        self.ring = [(pos, nid) for pos, nid in self.ring if nid != node_id]

Key Lookup Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def get_node(self, key):
    """Find the node responsible for a given key"""
    if not self.ring:
        return None
    
    key_hash = self._hash(key)
    
    # Binary search for the first node with position >= key_hash
    left, right = 0, len(self.ring)
    
    while left < right:
        mid = (left + right) // 2
        if self.ring[mid][0] < key_hash:
            left = mid + 1
        else:
            right = mid
    
    # If we've gone past the end, wrap to the first node (ring property)
    if left == len(self.ring):
        left = 0
    
    return self.ring[left][1]
 
def get_nodes(self, key, count=3):
    """Get multiple nodes for replication (e.g., for quorum writes)"""
    if not self.ring or count <= 0:
        return []
    
    key_hash = self._hash(key)
    
    # Find starting position
    left, right = 0, len(self.ring)
    while left < right:
        mid = (left + right) // 2
        if self.ring[mid][0] < key_hash:
            left = mid + 1
        else:
            right = mid
    
    # Collect unique nodes walking clockwise
    result = []
    seen = set()
    idx = left if left < len(self.ring) else 0
    
    while len(result) < count and len(seen) < len(self.nodes):
        node_id = self.ring[idx][1]
        if node_id not in seen:
            result.append(node_id)
            seen.add(node_id)
        idx = (idx + 1) % len(self.ring)
    
    return result

Optimization Techniques:

Binary Search: Use binary search for O(log n) lookup instead of O(n) linear scan
Caching: Cache ring structure locally; invalidate on membership changes
Read-Copy-Update: Create new ring on changes; atomically swap pointer
Ring Slicing: For extreme scale, partition the ring itself across coordinators

Thread Safety:

import threading

class ThreadSafeConsistentHashRing:
    def __init__(self):
        self._ring = ConsistentHashRing()
        self._lock = threading.RWLock()  # Conceptual; use appropriate primitive
    
    def get_node(self, key):
        with self._lock.read_lock():
            return self._ring.get_node(key)
    
    def add_node(self, node_id):
        with self._lock.write_lock():
            self._ring.add_node(node_id)

Hash Function Matters

Replication with Consistent Hashing

Consistent hashing naturally extends to support replication by walking further around the ring to find additional replica locations.

The N Replicas Pattern:

For replication factor N, a key is stored on N consecutive distinct physical servers walking clockwise:

Key K at position 100°
Replication factor N = 3

Walk clockwise, collecting distinct physical servers:
- Position 110°: Virtual node of Server A → Server A (replica 1)
- Position 125°: Virtual node of Server A → Skip (already have A)
- Position 140°: Virtual node of Server B → Server B (replica 2)
- Position 155°: Virtual node of Server C → Server C (replica 3)

Key K is replicated to Servers A, B, C

Replication Benefits

•Natural extension of ring structure
•Deterministic replica selection
•No coordination needed for reads
•Simple failure handling
•Supports quorum operations
•Works with heterogeneous replicas

Replication Challenges

•Virtual nodes complicate replica selection
•Must skip duplicate physical servers
•Rack/zone awareness requires extra logic
•Hot spots can overload replica chains
•Ring changes affect multiple replicas
•Consistency requires coordination

Rack-Aware Replica Placement:

In production environments, replicas should span failure domains (racks, availability zones):

def get_replicas_rack_aware(key, ring, replication_factor, rack_map):
    """
    Select replicas ensuring they're on different racks.
    rack_map: {server_id: rack_id}
    """
    key_hash = ring._hash(key)
    start_idx = ring._find_position(key_hash)
    
    replicas = []
    racks_used = set()
    idx = start_idx
    
    while len(replicas) < replication_factor:
        node_id = ring.ring[idx][1]
        rack = rack_map.get(node_id)
        
        # Only add if this is a new rack (or we've exhausted rack diversity)
        if rack not in racks_used or len(racks_used) >= available_racks:
            if node_id not in replicas:
                replicas.append(node_id)
                racks_used.add(rack)
        
        idx = (idx + 1) % len(ring.ring)
        
        # Prevent infinite loop if we can't find enough diversity
        if idx == start_idx:
            break
    
    return replicas

Preference Lists:

Systems like Dynamo maintain preference lists—ordered lists of nodes that should store each key. The first N healthy nodes in the preference list hold the replicas:

Preference list for key K: [A, B, C, D, E]
Replication factor: 3

Normal case: Replicas on A, B, C
If A is down: Replicas on B, C, D (D temporarily holds A's replica)
If A and B are down: Replicas on C, D, E

Sloppy Quorums

Handling Node Additions and Removals

The true power of consistent hashing emerges when cluster membership changes. Properly handling additions and removals is crucial for maintaining availability and data integrity.

Adding a New Node:

When a new node joins the cluster:

Calculate Virtual Node Positions: Hash the new node's virtual node identifiers
Identify Affected Key Ranges: Determine which key ranges will map to the new node
Initiate Data Transfer: Request affected data from current owners
Accept Traffic Gradually: Start serving reads, then writes for migrated data
Complete Integration: Remove data from previous owners after confirmation

Node Addition Process Steps
Step	Duration	System State	Risk Level
Ring Position Calculation	Milliseconds	New node determines its positions	None
Gossip/Discovery	Seconds	Cluster learns about new node	Low
Data Streaming	Minutes to hours	Data flows from existing nodes	Medium
Read Activation	Immediate after sync	New node serves reads for owned ranges	Low
Write Activation	After full sync	New node accepts writes	Medium
Old Data Cleanup	Hours after stabilization	Previous owners delete transferred data	Low

Removing a Node (Planned):

Planned removal (decommissioning) follows an orderly process:

Mark Node for Removal: Set node status to 'leaving'
Stop Accepting New Data: Redirect writes to successor nodes
Transfer Data: Stream owned data to successors on the ring
Verify Transfer: Confirm all data is safely replicated
Update Ring: Remove node from the ring
Shutdown: Stop node processes

Removing a Node (Failure):

Unplanned removal requires different handling:

Failure Detection: Health checks determine node is unresponsive
Declare Node Dead: After timeout (typically 30-60 seconds)
Activate Replicas: Next nodes in preference lists take over
Repair Data: Restore replication factor from remaining replicas
Update Ring: Remove failed node after sustained outage
Handle Recovery: If node returns, resolve any conflicts

Key Insight: Minimal Movement

In both cases, only data owned by the changing node moves:

Adding Node X with 100 virtual nodes to a 10-node cluster:
- Each existing node loses ~10% of its data to X
- Total data movement: ~10% of entire dataset
- Compare to modulo: ~100% would move

Bootstrap Storm

Production Systems Using Consistent Hashing

Consistent hashing underpins many of the world's most scalable data systems. Understanding how production systems apply these concepts provides practical insights.

Amazon DynamoDB:

DynamoDB, inspired by the original Dynamo paper, uses consistent hashing as its core distribution mechanism:

Partition Key Hashing: The partition key is hashed to determine placement
Virtual Nodes: Each partition is divided across multiple storage nodes
Automatic Rebalancing: When capacity is exceeded, partitions split automatically
Adaptive Capacity: Hot partitions can burst beyond their provisioned throughput

Consistent Hashing in Major Distributed Databases
System	Virtual Nodes	Replication Model	Notable Feature
Apache Cassandra	256 default (configurable)	Tunable RF, rack-aware	Token ring, multiple tokens per node
Amazon DynamoDB	Managed (variable)	3 replicas across AZs	Automatic adaptive scaling
Riak	64 default	Configurable RF	True leaderless, CRDTs
Voldemort	Configurable	Configurable RF	LinkedIn's key-value store
ScyllaDB	256 default	NetworkTopologyStrategy	Cassandra-compatible, C++

Apache Cassandra's Token Ring:

Cassandra's implementation is particularly instructive:

Token Assignment: Each node is assigned a set of tokens (positions on the ring)
Partitioner: The partitioner (Murmur3Partitioner) hashes partition keys to tokens
Replication Strategy: NetworkTopologyStrategy ensures replicas span racks and data centers
Consistent Hashing Lookup: Coordinators locate data by checking which nodes own the token range

-- Cassandra keyspace with replication
CREATE KEYSPACE my_app WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'eu-west': 3
};

Memcached and Redis Clusters:

Cache clusters also use consistent hashing:

Memcached: Client libraries (libmemcached, various language bindings) implement consistent hashing client-side
Redis Cluster: Uses hash slots (16384 slots) assigned to nodes; slots map to nodes rather than pure consistent hashing

CDN Edge Servers:

Content Delivery Networks use consistent hashing to route requests to edge servers, ensuring the same content is cached on predictable servers.

Hash Slots vs. Consistent Hashing

Summary: Consistent Hashing for Distribution

Consistent hashing is the algorithmic foundation that makes modern distributed database rebalancing possible. The key insights from this page:

Key Takeaways

•Traditional modulo hashing fails at scale — Changing cluster size causes nearly all keys to redistribute, making rebalancing prohibitively expensive.
•Consistent hashing minimizes redistribution — Only O(K/N) keys move when adding or removing a node, which is mathematically optimal.
•Virtual nodes enable balanced distribution — Placing each physical server multiple times on the ring ensures even load distribution and simplifies heterogeneous hardware handling.
•Implementation requires careful attention — Binary search for O(log n) lookups, proper hash functions, and thread-safe updates are essential for production use.
•Replication extends naturally — Walking the ring to find N distinct physical servers provides a deterministic, coordination-free replication strategy.
•Node changes are handled gracefully — Additions and removals only affect a bounded portion of the keyspace, enabling incremental, non-disruptive operations.
•Production systems validate the approach — DynamoDB, Cassandra, Riak, and numerous other systems prove consistent hashing at massive scale.

What's Next:

Page Complete

3 / 5