System Design (HLD)Distributed Cache

Designing a Distributed Cache

LevelAdvanced

Duration120 mins

TopicDistributed Cache

3 / 6

Consistent Hashing

The Elegant Solution

In 1997, David Karger and his colleagues at MIT published a paper titled "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web." This paper introduced an algorithm that would become foundational to distributed systems: consistent hashing.

The problem Karger addressed was precisely what we explored in the previous page: how do you distribute cache data across multiple servers in a way that minimizes disruption when servers are added or removed? The naive modulo approach fails catastrophically because nearly all keys remap on topology changes. Consistent hashing solves this elegantly by ensuring that only K/N keys need to move when adding or removing a node (where K is the total number of keys and N is the number of nodes).

Today, consistent hashing is ubiquitous. It underpins Amazon's Dynamo, Apache Cassandra, Discord's message storage, Akamai's CDN, and virtually every distributed cache including Memcached and Redis Cluster. Understanding consistent hashing deeply—not just the concept but the implementation details, edge cases, and optimizations—is essential for any engineer working on distributed systems.

This page provides that deep understanding. We'll build consistent hashing from first principles, analyze its mathematical properties, implement it in code, and explore the optimizations that make it production-ready.

What You Will Learn

By the end of this page, you will understand the consistent hashing algorithm and its mathematical properties, implement consistent hashing with virtual nodes, analyze why consistent hashing achieves minimal key movement, explore real-world implementations and optimizations, and recognize when and how to apply consistent hashing in system design.

The Hash Ring Concept

The core insight of consistent hashing is to treat the hash space as a ring (or circle) rather than a linear range. Both keys and nodes are hashed onto this ring, and each key is assigned to the first node encountered when traversing the ring clockwise from the key's position.

Visualizing the Hash Ring

Imagine a circle where positions represent all possible hash values from 0 to 2³² - 1 (for a 32-bit hash). This circle is the hash ring:

                    0
                    │
          270° ─────┼───── 90°
                    │
                   180°

Nodes are hashed onto this ring using their identifiers (hostname, IP address, etc.):

hash("node-A") → position 45° on ring
hash("node-B") → position 150° on ring  
hash("node-C") → position 280° on ring

Keys are similarly hashed:

hash("user:123") → position 80° on ring

Key-to-Node Assignment

To find which node owns a key, we start at the key's position and walk clockwise until we hit a node. That node owns the key.

hash_ring_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import hashlib
from typing import Optional
 
class BasicHashRing:
    """
    Basic consistent hashing implementation using a hash ring.
    """
    
    def __init__(self):
        # Maps hash position -> node name
        self.ring: dict[int, str] = {}
        # Sorted list of positions for binary search
        self.sorted_positions: list[int] = []
    
    def _hash(self, key: str) -> int:
        """Generate a 32-bit hash value for a key."""
        return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
    
    def add_node(self, node: str) -> None:
        """Add a node to the ring."""
        position = self._hash(node)
        self.ring[position] = node
        self.sorted_positions = sorted(self.ring.keys())
    
    def remove_node(self, node: str) -> None:
        """Remove a node from the ring."""
        position = self._hash(node)
        if position in self.ring:
            del self.ring[position]
            self.sorted_positions = sorted(self.ring.keys())
    
    def get_node(self, key: str) -> Optional[str]:
        """Find the node responsible for a key."""
        if not self.ring:
            return None
        
        key_hash = self._hash(key)
        
        # Binary search for the first position >= key_hash
        import bisect
        idx = bisect.bisect_left(self.sorted_positions, key_hash)
        
        # Wrap around if we're past the last position
        if idx >= len(self.sorted_positions):
            idx = 0
        
        position = self.sorted_positions[idx]
        return self.ring[position]
 
# Example usage
ring = BasicHashRing()
ring.add_node("cache-server-1")
ring.add_node("cache-server-2")
ring.add_node("cache-server-3")
 
# Distribute some keys
keys = ["user:100", "product:555", "session:abc", "order:999"]
for key in keys:
    node = ring.get_node(key)
    print(f"{key} -> {node}")

Why the Ring Works

The ring structure provides the crucial property that modulo hashing lacks: locality of change. When you add or remove a node:

Adding a node: The new node appears at one position on the ring. It takes ownership of keys that were previously assigned to the next node clockwise. Only those keys move.
Removing a node: The removed node's keys move to the next node clockwise. No other keys are affected.

In both cases, only the keys in the affected arc of the ring move. This is fundamentally different from modulo hashing, where the entire key space reshuffles.

The Clockwise Walk

The 'clockwise walk' to find a node is typically implemented using binary search on the sorted list of node positions. This gives O(log N) lookup time, where N is the number of nodes. With virtual nodes, N becomes the total number of virtual nodes, but even with 1000 virtual nodes, log₂(1000) ≈ 10 comparisons is negligible.

Mathematical Properties

Understanding the mathematical properties of consistent hashing explains why it works and guides parameter tuning.

Key Distribution

With a uniform hash function and N nodes on the ring, each node owns approximately 1/N of the key space. However, with only N physical positions on the ring, the variance is high.

Expected arc length per node: 1/N of the ring Standard deviation of arc length: 1/N (same order of magnitude!)

This means with 3 nodes, you might see load distributions like 50%/30%/20% rather than the ideal 33%/33%/33%. This variance is unacceptable for production systems.

Load Variance with Physical Nodes Only
Number of Nodes	Expected Load per Node	Typical Max Deviation	Worst Case
3	33.3%	±15%	One node at 50%+
10	10%	±5%	One node at 20%+
100	1%	±0.5%	Acceptable variance

Virtual Nodes Reduce Variance

The solution is virtual nodes (vnodes). Instead of placing each physical node at one position, we place it at V positions (virtual nodes). Each physical node owns V arcs on the ring.

With V virtual nodes per physical node:

Total positions on ring: N × V
Variance in load decreases proportionally to 1/√V
With V = 100, load imbalance drops to ~10% of what it was with V = 1

virtual_node_math.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import math
import random
 
def simulate_load_distribution(num_nodes: int, vnodes_per_node: int, num_keys: int = 100000):
    """
    Simulate key distribution to demonstrate variance reduction with virtual nodes.
    """
    # Place virtual nodes on the ring
    ring_size = 2**32
    node_loads = {f"node-{i}": 0 for i in range(num_nodes)}
    
    # Create virtual node positions
    vnode_positions = []
    for i in range(num_nodes):
        for v in range(vnodes_per_node):
            # Hash the virtual node identifier
            vnode_id = f"node-{i}-vnode-{v}"
            position = hash(vnode_id) % ring_size
            vnode_positions.append((position, f"node-{i}"))
    
    vnode_positions.sort()
    
    # Distribute random keys
    for _ in range(num_keys):
        key_hash = random.randint(0, ring_size - 1)
        # Binary search for first vnode >= key_hash
        import bisect
        positions = [p[0] for p in vnode_positions]
        idx = bisect.bisect_left(positions, key_hash)
        if idx >= len(vnode_positions):
            idx = 0
        owning_node = vnode_positions[idx][1]
        node_loads[owning_node] += 1
    
    # Calculate statistics
    loads = list(node_loads.values())
    expected_per_node = num_keys / num_nodes
    max_deviation = max(abs(l - expected_per_node) / expected_per_node for l in loads)
    
    return max_deviation
 
# Compare variance with different numbers of virtual nodes
print("Max load deviation from expected:")
for vnodes in [1, 10, 50, 100, 200]:
    deviation = simulate_load_distribution(10, vnodes)
    print(f"  {vnodes:3d} vnodes/node: {deviation*100:.1f}% max deviation")
 
# Expected output (variance decreases with more vnodes):
#   1 vnodes/node: 45.3% max deviation
#  10 vnodes/node: 18.2% max deviation
#  50 vnodes/node: 8.1% max deviation
# 100 vnodes/node: 5.2% max deviation
# 200 vnodes/node: 3.8% max deviation

Key Movement Analysis

When adding a node with V virtual nodes to a cluster of N physical nodes:

The new node's V virtual nodes appear at V positions on the ring
Each virtual node takes keys from the (V × N) + V total arcs
Expected keys moved: K × V / ((N + 1) × V) = K / (N + 1)

This is exactly the optimal amount! The new node gets its fair share (1/(N+1) of keys) and no more.

Choosing Virtual Node Count

Common production values are 100-200 virtual nodes per physical node. More vnodes = better distribution but more memory for the ring structure, more positions to search, and larger cluster state to replicate. The sweet spot depends on cluster size and load sensitivity. Cassandra defaults to 256 vnodes.

Production Implementation

A production-ready consistent hashing implementation requires careful attention to data structures, hash functions, and concurrency. Let's build a complete implementation.

consistent_hash.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
import hashlib
import bisect
from typing import Optional, List, Set
from dataclasses import dataclass
from threading import RLock
 
@dataclass
class VirtualNode:
    """Represents a virtual node on the hash ring."""
    position: int
    physical_node: str
    vnode_index: int
 
class ConsistentHashRing:
    """
    Production-grade consistent hashing with virtual nodes.
    
    Features:
    - Virtual nodes for load distribution
    - Thread-safe operations
    - Efficient O(log N) lookups
    - Replica selection support
    - Weight-based capacity (heterogeneous nodes)
    """
    
    RING_SIZE = 2**32
    
    def __init__(self, vnodes_per_node: int = 150):
        self.vnodes_per_node = vnodes_per_node
        self.ring: List[VirtualNode] = []  # Sorted by position
        self.nodes: Set[str] = set()
        self.node_weights: dict[str, int] = {}
        self._lock = RLock()
    
    def _hash(self, key: str) -> int:
        """
        Generate position on ring using MD5.
        MD5 is chosen for uniformity, not security.
        """
        digest = hashlib.md5(key.encode('utf-8')).digest()
        # Use first 4 bytes for 32-bit position
        return int.from_bytes(digest[:4], byteorder='big')
    
    def _hash_xxhash(self, key: str) -> int:
        """
        Alternative using xxHash for better performance.
        Requires: pip install xxhash
        """
        import xxhash
        return xxhash.xxh32(key.encode('utf-8')).intdigest()
    
    def add_node(self, node: str, weight: int = 1) -> int:
        """
        Add a node to the ring with optional weight.
        Weight multiplies the number of virtual nodes.
        Returns number of keys that would need to migrate.
        """
        with self._lock:
            if node in self.nodes:
                return 0
            
            self.nodes.add(node)
            self.node_weights[node] = weight
            
            # Create virtual nodes
            num_vnodes = self.vnodes_per_node * weight
            new_vnodes = []
            
            for i in range(num_vnodes):
                vnode_key = f"{node}:vnode:{i}"
                position = self._hash(vnode_key)
                new_vnodes.append(VirtualNode(position, node, i))
            
            # Merge into ring
            self.ring = sorted(
                self.ring + new_vnodes,
                key=lambda v: v.position
            )
            
            return len(new_vnodes)
    
    def remove_node(self, node: str) -> int:
        """
        Remove a node from the ring.
        Returns number of keys that would need to migrate.
        """
        with self._lock:
            if node not in self.nodes:
                return 0
            
            # Count vnodes before removal
            vnodes_removed = sum(1 for v in self.ring if v.physical_node == node)
            
            # Remove all virtual nodes for this physical node
            self.ring = [v for v in self.ring if v.physical_node != node]
            self.nodes.remove(node)
            del self.node_weights[node]
            
            return vnodes_removed
    
    def get_node(self, key: str) -> Optional[str]:
        """
        Get the primary node for a key.
        O(log N) where N is total virtual nodes.
        """
        with self._lock:
            if not self.ring:
                return None
            
            key_hash = self._hash(key)
            
            # Binary search for first vnode with position >= key_hash
            positions = [v.position for v in self.ring]
            idx = bisect.bisect_left(positions, key_hash)
            
            # Wrap around
            if idx >= len(self.ring):
                idx = 0
            
            return self.ring[idx].physical_node
    
    def get_nodes(self, key: str, count: int = 3) -> List[str]:
        """
        Get multiple nodes for a key (for replication).
        Returns 'count' distinct physical nodes.
        """
        with self._lock:
            if not self.ring:
                return []
            
            key_hash = self._hash(key)
            positions = [v.position for v in self.ring]
            idx = bisect.bisect_left(positions, key_hash)
            
            result = []
            seen_nodes = set()
            
            # Walk clockwise until we have enough distinct nodes
            for i in range(len(self.ring)):
                current_idx = (idx + i) % len(self.ring)
                node = self.ring[current_idx].physical_node
                
                if node not in seen_nodes:
                    result.append(node)
                    seen_nodes.add(node)
                    
                    if len(result) >= count:
                        break
            
            return result
    
    def get_key_distribution(self) -> dict[str, float]:
        """
        Calculate the fraction of ring owned by each node.
        Useful for monitoring load distribution.
        """
        with self._lock:
            if not self.ring:
                return {}
            
            distribution = {node: 0.0 for node in self.nodes}
            
            for i, vnode in enumerate(self.ring):
                # Calculate arc size from this vnode to the next
                next_idx = (i + 1) % len(self.ring)
                next_pos = self.ring[next_idx].position
                
                if next_pos > vnode.position:
                    arc_size = next_pos - vnode.position
                else:
                    # Wrapped around
                    arc_size = (self.RING_SIZE - vnode.position) + next_pos
                
                distribution[vnode.physical_node] += arc_size / self.RING_SIZE
            
            return distribution

Key Implementation Details

Hash Function Choice: MD5 is commonly used for its uniformity. Security doesn't matter here—we're not protecting against adversaries. For performance-critical paths, xxHash or MurmurHash are 10x faster.
Binary Search: The clockwise walk is implemented as binary search on sorted positions. bisect.bisect_left finds the insertion point efficiently.
Thread Safety: Using RLock (reentrant lock) allows the same thread to nest calls safely. For read-heavy workloads, consider RWLock or lock-free data structures.
Replica Selection: The get_nodes method walks clockwise collecting distinct physical nodes. This naturally spreads replicas across the ring.

Ketama: The Industry Standard

The 'ketama' algorithm is the de facto standard for consistent hashing in Memcached clients. It uses a specific approach: 160 virtual nodes per physical node, points placed at positions hash(node + '-' + index), and MD5-based hashing. Understanding ketama helps when integrating with existing infrastructure.

Handling Node Failures

Consistent hashing's elegance extends to failure handling. When a node fails, its keys automatically 'fall through' to the next node on the ring. But production systems need more sophisticated failure handling.

Failure Scenarios

•Single node failure: Keys move to successor node. If replicated, replicas already exist on other nodes.
•Multiple node failures: Keys may concentrate on remaining nodes, causing overload.
•Network partition: Nodes are alive but unreachable. Different clients may have different ring views.
•Flapping nodes: Node appears/disappears rapidly, causing repeated key migration.

Failure Detection

Before handling failure, you must detect it. Common approaches:

Failure Detection Mechanisms
Mechanism	Speed	Accuracy	Complexity
Heartbeat/Ping	Fast (100ms-1s)	Subject to false positives	Low
Gossip Protocol	Moderate (seconds)	High (quorum-based)	Medium
External Monitor	Variable	Depends on monitor	High
Connection Failure	Immediate	May be transient	Low

failure_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from typing import Optional, Set
from time import time
from threading import Timer
import logging
 
class FailureAwareHashRing:
    """
    Consistent hash ring with failure detection and handling.
    """
    
    def __init__(self, ring: ConsistentHashRing, health_check_interval: float = 1.0):
        self.ring = ring
        self.failed_nodes: Set[str] = set()
        self.node_last_success: dict[str, float] = {}
        self.failure_threshold = 3.0  # seconds
        self.health_check_interval = health_check_interval
        self._start_health_checker()
    
    def _start_health_checker(self):
        """Periodically check node health."""
        def check():
            current_time = time()
            for node in list(self.ring.nodes):
                last_success = self.node_last_success.get(node, current_time)
                if current_time - last_success > self.failure_threshold:
                    if node not in self.failed_nodes:
                        logging.warning(f"Node {node} marked as failed")
                        self.failed_nodes.add(node)
                elif node in self.failed_nodes:
                    logging.info(f"Node {node} recovered")
                    self.failed_nodes.remove(node)
            
            Timer(self.health_check_interval, check).start()
        
        check()
    
    def record_success(self, node: str):
        """Record successful interaction with a node."""
        self.node_last_success[node] = time()
    
    def record_failure(self, node: str):
        """Record failed interaction with a node."""
        # Don't update last_success, letting it age out
        pass
    
    def get_node(self, key: str) -> Optional[str]:
        """
        Get node for key, skipping failed nodes.
        Falls back to next healthy node on ring.
        """
        candidates = self.ring.get_nodes(key, count=len(self.ring.nodes))
        
        for node in candidates:
            if node not in self.failed_nodes:
                return node
        
        # All nodes failed - return None or the primary anyway
        logging.error(f"All replica nodes for key {key} are failed")
        return candidates[0] if candidates else None
    
    def get_healthy_nodes(self, key: str, count: int) -> list[str]:
        """Get up to 'count' healthy nodes for a key."""
        candidates = self.ring.get_nodes(key, count=len(self.ring.nodes))
        return [n for n in candidates if n not in self.failed_nodes][:count]

Hinted Handoff

When the primary node for a key is down, writes can be sent to a 'hint' node temporarily:

Client writes to key, primary is down
Write goes to next healthy node with a 'hint' that it's for the primary
When primary recovers, hint node replays the write to primary
Hint is deleted after successful handoff

This maintains write availability during failures while preserving eventual consistency.

The Cascade Problem

When a node fails, its traffic shifts to the successor. If the successor is already at capacity, it may also fail, triggering a cascade. Mitigation: over-provision nodes, use circuit breakers, implement request shedding, and spread replicas to avoid successor chains under load.

Advanced Topics and Optimizations

Production deployments often require optimizations and extensions beyond the basic consistent hashing algorithm.

Jump Consistent Hash

Google's 'Jump Consistent Hash' provides O(1) memory usage and excellent distribution, but doesn't support weighted nodes or arbitrary node removal:

def jump_hash(key: int, num_buckets: int) -> int:
    b = -1
    j = 0
    while j < num_buckets:
        b = j
        key = ((key * 2862933555777941757) + 1) & 0xFFFFFFFFFFFFFFFF
        j = int((b + 1) * (float(1 << 31) / float((key >> 33) + 1)))
    return b

Jump hash is ideal when nodes are numbered 0 to N-1 and only the count changes (not arbitrary additions/removals).

Consistent Hashing Variants
Algorithm	Memory	Lookup	Add Node	Remove Node
Ketama (Ring)	O(N × V)	O(log(N × V))	Arbitrary	Arbitrary
Jump Hash	O(1)	O(log N)	Append only	Not supported
Rendezvous/HRW	O(N)	O(N)	Arbitrary	Arbitrary
Maglev	O(M)	O(1)	Rebuild table	Rebuild table

Rendezvous Hashing (Highest Random Weight)

An alternative to ring-based consistent hashing where each key computes a score with every node, selecting the highest-scoring node:

def get_node_hrw(key: str, nodes: list[str]) -> str:
    max_weight = -1
    selected = None
    for node in nodes:
        weight = hash(key + node)
        if weight > max_weight:
            max_weight = weight
            selected = node
    return selected

HRW has O(N) lookup but simpler implementation and excellent distribution. Used by Microsoft Azure and Vimeo.

Bounded Load Consistent Hashing

Google's 'Consistent Hashing with Bounded Loads' addresses the hot spot problem where unlucky hash distribution overloads some nodes:

Set a capacity limit per node (e.g., 1.25× average load)
When assigning a key, if the target node is at capacity, cascade to the next node
This provides O(1/ε²) bound on maximum load with 1+ε capacity

Choosing a Variant

For distributed caches: use Ketama-style ring hashing with virtual nodes. It's well-understood, widely implemented, and handles arbitrary node addition/removal. Reserve Jump Hash for specific scenarios like load balancer backends. Use Rendezvous (HRW) when simplicity trumps O(N) lookup cost.

Real-World Implementations

Consistent hashing appears across the industry in various forms. Understanding these implementations contextualizes the theory.

Amazon Dynamo and Cassandra

Dynamo (2007) popularized consistent hashing for distributed databases. Key design choices:

Ring-based with virtual nodes (Dynamo: 'tokens', Cassandra: 'vnodes')
Preference list: N nodes clockwise for replication
Sloppy quorum: write to any W of N, read from any R of N
Gossip protocol for ring membership

Cassandra's default: 256 vnodes per node, replication factor 3.

Redis Cluster

Redis takes a different approach with 'hash slots':

Fixed 16,384 hash slots (not continuous ring)
CRC16(key) mod 16384 determines slot
Slots manually assigned to nodes
Client redirects (MOVED) for topology discovery

This is conceptually simpler but requires explicit slot migration.

Memcached (libmemcached ketama)

The ketama library for Memcached:

160 virtual nodes per physical node (actually 40 × 4 = 160 due to MD5 producing 128 bits)
MD5-based position calculation
Client-side consistent hashing (no cluster coordination)
Servers are stateless; any server can serve any key

Consistent Hashing in Production Systems
System	Approach	Virtual Nodes	Replication
Amazon Dynamo	Ring with vnodes	Configurable	N-way with quorum
Apache Cassandra	Ring with vnodes	256 default	Configurable RF
Redis Cluster	Fixed hash slots	16,384 slots	1 replica per master
Memcached/ketama	Ring with vnodes	160 per node	None (stateless)
Riak	Ring with vnodes	Configurable	N-way
Discord	Ring with vnodes	Custom	In-house sharding

Interview Insight

When discussing consistent hashing in interviews, cite these real-world systems. Saying 'Cassandra uses consistent hashing with 256 virtual nodes for partition assignment' demonstrates practical knowledge beyond textbook understanding. Interviewers value candidates who can connect theory to practice.

Summary: Mastering Consistent Hashing

Consistent hashing is foundational to distributed systems. Let's consolidate the key concepts:

Key Takeaways

•The hash ring concept treats the hash space as circular, with keys assigned to the next clockwise node.
•Minimal key movement — only K/N keys move when adding/removing a node, versus nearly all keys with modulo hashing.
•Virtual nodes improve distribution uniformity and enable heterogeneous hardware capacity.
•O(log N) lookup via binary search on sorted virtual node positions makes lookups fast.
•Failure handling leverages the ring structure—failed nodes' keys automatically fall through to successors.
•Variants exist for specific use cases: Jump Hash for append-only, Rendezvous for simplicity, Bounded Load for fairness.
•Production implementations include Dynamo, Cassandra, Redis Cluster, and Memcached's ketama.

When to Use Consistent Hashing

✅ Distributing cache data across multiple nodes ✅ Partitioning database storage ✅ Load balancing with session affinity ✅ Distributing work across processing nodes ✅ Any scenario where cluster topology changes and you need minimal disruption

When NOT to Use Consistent Hashing

❌ Single-node systems (no distribution needed) ❌ Scenarios requiring strict ordering (use range partitioning) ❌ When topology never changes (simpler approaches work) ❌ When all data must be accessible from any node (use replication instead)

Consistent Hashing Complete

You now have deep understanding of consistent hashing—from the hash ring concept through production implementation details. In the next page, we'll explore Eviction Policies—the algorithms that decide what to remove when cache memory is full.

3 / 6

Loading learning content...

System Design (HLD)Distributed Cache

Designing a Distributed Cache

LevelAdvanced

Duration120 mins

TopicDistributed Cache

3 / 6

Consistent Hashing

The Elegant Solution

What You Will Learn

The Hash Ring Concept

Visualizing the Hash Ring

Imagine a circle where positions represent all possible hash values from 0 to 2³² - 1 (for a 32-bit hash). This circle is the hash ring:

                    0
                    │
          270° ─────┼───── 90°
                    │
                   180°

Nodes are hashed onto this ring using their identifiers (hostname, IP address, etc.):

hash("node-A") → position 45° on ring
hash("node-B") → position 150° on ring  
hash("node-C") → position 280° on ring

Keys are similarly hashed:

hash("user:123") → position 80° on ring

Key-to-Node Assignment

To find which node owns a key, we start at the key's position and walk clockwise until we hit a node. That node owns the key.

hash_ring_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import hashlib
from typing import Optional
 
class BasicHashRing:
    """
    Basic consistent hashing implementation using a hash ring.
    """
    
    def __init__(self):
        # Maps hash position -> node name
        self.ring: dict[int, str] = {}
        # Sorted list of positions for binary search
        self.sorted_positions: list[int] = []
    
    def _hash(self, key: str) -> int:
        """Generate a 32-bit hash value for a key."""
        return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
    
    def add_node(self, node: str) -> None:
        """Add a node to the ring."""
        position = self._hash(node)
        self.ring[position] = node
        self.sorted_positions = sorted(self.ring.keys())
    
    def remove_node(self, node: str) -> None:
        """Remove a node from the ring."""
        position = self._hash(node)
        if position in self.ring:
            del self.ring[position]
            self.sorted_positions = sorted(self.ring.keys())
    
    def get_node(self, key: str) -> Optional[str]:
        """Find the node responsible for a key."""
        if not self.ring:
            return None
        
        key_hash = self._hash(key)
        
        # Binary search for the first position >= key_hash
        import bisect
        idx = bisect.bisect_left(self.sorted_positions, key_hash)
        
        # Wrap around if we're past the last position
        if idx >= len(self.sorted_positions):
            idx = 0
        
        position = self.sorted_positions[idx]
        return self.ring[position]
 
# Example usage
ring = BasicHashRing()
ring.add_node("cache-server-1")
ring.add_node("cache-server-2")
ring.add_node("cache-server-3")
 
# Distribute some keys
keys = ["user:100", "product:555", "session:abc", "order:999"]
for key in keys:
    node = ring.get_node(key)
    print(f"{key} -> {node}")

Why the Ring Works

The ring structure provides the crucial property that modulo hashing lacks: locality of change. When you add or remove a node:

Adding a node: The new node appears at one position on the ring. It takes ownership of keys that were previously assigned to the next node clockwise. Only those keys move.
Removing a node: The removed node's keys move to the next node clockwise. No other keys are affected.

In both cases, only the keys in the affected arc of the ring move. This is fundamentally different from modulo hashing, where the entire key space reshuffles.

The Clockwise Walk

Mathematical Properties

Understanding the mathematical properties of consistent hashing explains why it works and guides parameter tuning.

Key Distribution

With a uniform hash function and N nodes on the ring, each node owns approximately 1/N of the key space. However, with only N physical positions on the ring, the variance is high.

Expected arc length per node: 1/N of the ring Standard deviation of arc length: 1/N (same order of magnitude!)

This means with 3 nodes, you might see load distributions like 50%/30%/20% rather than the ideal 33%/33%/33%. This variance is unacceptable for production systems.

Load Variance with Physical Nodes Only
Number of Nodes	Expected Load per Node	Typical Max Deviation	Worst Case
3	33.3%	±15%	One node at 50%+
10	10%	±5%	One node at 20%+
100	1%	±0.5%	Acceptable variance

Virtual Nodes Reduce Variance

The solution is virtual nodes (vnodes). Instead of placing each physical node at one position, we place it at V positions (virtual nodes). Each physical node owns V arcs on the ring.

With V virtual nodes per physical node:

Total positions on ring: N × V
Variance in load decreases proportionally to 1/√V
With V = 100, load imbalance drops to ~10% of what it was with V = 1

virtual_node_math.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import math
import random
 
def simulate_load_distribution(num_nodes: int, vnodes_per_node: int, num_keys: int = 100000):
    """
    Simulate key distribution to demonstrate variance reduction with virtual nodes.
    """
    # Place virtual nodes on the ring
    ring_size = 2**32
    node_loads = {f"node-{i}": 0 for i in range(num_nodes)}
    
    # Create virtual node positions
    vnode_positions = []
    for i in range(num_nodes):
        for v in range(vnodes_per_node):
            # Hash the virtual node identifier
            vnode_id = f"node-{i}-vnode-{v}"
            position = hash(vnode_id) % ring_size
            vnode_positions.append((position, f"node-{i}"))
    
    vnode_positions.sort()
    
    # Distribute random keys
    for _ in range(num_keys):
        key_hash = random.randint(0, ring_size - 1)
        # Binary search for first vnode >= key_hash
        import bisect
        positions = [p[0] for p in vnode_positions]
        idx = bisect.bisect_left(positions, key_hash)
        if idx >= len(vnode_positions):
            idx = 0
        owning_node = vnode_positions[idx][1]
        node_loads[owning_node] += 1
    
    # Calculate statistics
    loads = list(node_loads.values())
    expected_per_node = num_keys / num_nodes
    max_deviation = max(abs(l - expected_per_node) / expected_per_node for l in loads)
    
    return max_deviation
 
# Compare variance with different numbers of virtual nodes
print("Max load deviation from expected:")
for vnodes in [1, 10, 50, 100, 200]:
    deviation = simulate_load_distribution(10, vnodes)
    print(f"  {vnodes:3d} vnodes/node: {deviation*100:.1f}% max deviation")
 
# Expected output (variance decreases with more vnodes):
#   1 vnodes/node: 45.3% max deviation
#  10 vnodes/node: 18.2% max deviation
#  50 vnodes/node: 8.1% max deviation
# 100 vnodes/node: 5.2% max deviation
# 200 vnodes/node: 3.8% max deviation

Key Movement Analysis

When adding a node with V virtual nodes to a cluster of N physical nodes:

The new node's V virtual nodes appear at V positions on the ring
Each virtual node takes keys from the (V × N) + V total arcs
Expected keys moved: K × V / ((N + 1) × V) = K / (N + 1)

This is exactly the optimal amount! The new node gets its fair share (1/(N+1) of keys) and no more.

Choosing Virtual Node Count

Production Implementation

A production-ready consistent hashing implementation requires careful attention to data structures, hash functions, and concurrency. Let's build a complete implementation.

consistent_hash.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
import hashlib
import bisect
from typing import Optional, List, Set
from dataclasses import dataclass
from threading import RLock
 
@dataclass
class VirtualNode:
    """Represents a virtual node on the hash ring."""
    position: int
    physical_node: str
    vnode_index: int
 
class ConsistentHashRing:
    """
    Production-grade consistent hashing with virtual nodes.
    
    Features:
    - Virtual nodes for load distribution
    - Thread-safe operations
    - Efficient O(log N) lookups
    - Replica selection support
    - Weight-based capacity (heterogeneous nodes)
    """
    
    RING_SIZE = 2**32
    
    def __init__(self, vnodes_per_node: int = 150):
        self.vnodes_per_node = vnodes_per_node
        self.ring: List[VirtualNode] = []  # Sorted by position
        self.nodes: Set[str] = set()
        self.node_weights: dict[str, int] = {}
        self._lock = RLock()
    
    def _hash(self, key: str) -> int:
        """
        Generate position on ring using MD5.
        MD5 is chosen for uniformity, not security.
        """
        digest = hashlib.md5(key.encode('utf-8')).digest()
        # Use first 4 bytes for 32-bit position
        return int.from_bytes(digest[:4], byteorder='big')
    
    def _hash_xxhash(self, key: str) -> int:
        """
        Alternative using xxHash for better performance.
        Requires: pip install xxhash
        """
        import xxhash
        return xxhash.xxh32(key.encode('utf-8')).intdigest()
    
    def add_node(self, node: str, weight: int = 1) -> int:
        """
        Add a node to the ring with optional weight.
        Weight multiplies the number of virtual nodes.
        Returns number of keys that would need to migrate.
        """
        with self._lock:
            if node in self.nodes:
                return 0
            
            self.nodes.add(node)
            self.node_weights[node] = weight
            
            # Create virtual nodes
            num_vnodes = self.vnodes_per_node * weight
            new_vnodes = []
            
            for i in range(num_vnodes):
                vnode_key = f"{node}:vnode:{i}"
                position = self._hash(vnode_key)
                new_vnodes.append(VirtualNode(position, node, i))
            
            # Merge into ring
            self.ring = sorted(
                self.ring + new_vnodes,
                key=lambda v: v.position
            )
            
            return len(new_vnodes)
    
    def remove_node(self, node: str) -> int:
        """
        Remove a node from the ring.
        Returns number of keys that would need to migrate.
        """
        with self._lock:
            if node not in self.nodes:
                return 0
            
            # Count vnodes before removal
            vnodes_removed = sum(1 for v in self.ring if v.physical_node == node)
            
            # Remove all virtual nodes for this physical node
            self.ring = [v for v in self.ring if v.physical_node != node]
            self.nodes.remove(node)
            del self.node_weights[node]
            
            return vnodes_removed
    
    def get_node(self, key: str) -> Optional[str]:
        """
        Get the primary node for a key.
        O(log N) where N is total virtual nodes.
        """
        with self._lock:
            if not self.ring:
                return None
            
            key_hash = self._hash(key)
            
            # Binary search for first vnode with position >= key_hash
            positions = [v.position for v in self.ring]
            idx = bisect.bisect_left(positions, key_hash)
            
            # Wrap around
            if idx >= len(self.ring):
                idx = 0
            
            return self.ring[idx].physical_node
    
    def get_nodes(self, key: str, count: int = 3) -> List[str]:
        """
        Get multiple nodes for a key (for replication).
        Returns 'count' distinct physical nodes.
        """
        with self._lock:
            if not self.ring:
                return []
            
            key_hash = self._hash(key)
            positions = [v.position for v in self.ring]
            idx = bisect.bisect_left(positions, key_hash)
            
            result = []
            seen_nodes = set()
            
            # Walk clockwise until we have enough distinct nodes
            for i in range(len(self.ring)):
                current_idx = (idx + i) % len(self.ring)
                node = self.ring[current_idx].physical_node
                
                if node not in seen_nodes:
                    result.append(node)
                    seen_nodes.add(node)
                    
                    if len(result) >= count:
                        break
            
            return result
    
    def get_key_distribution(self) -> dict[str, float]:
        """
        Calculate the fraction of ring owned by each node.
        Useful for monitoring load distribution.
        """
        with self._lock:
            if not self.ring:
                return {}
            
            distribution = {node: 0.0 for node in self.nodes}
            
            for i, vnode in enumerate(self.ring):
                # Calculate arc size from this vnode to the next
                next_idx = (i + 1) % len(self.ring)
                next_pos = self.ring[next_idx].position
                
                if next_pos > vnode.position:
                    arc_size = next_pos - vnode.position
                else:
                    # Wrapped around
                    arc_size = (self.RING_SIZE - vnode.position) + next_pos
                
                distribution[vnode.physical_node] += arc_size / self.RING_SIZE
            
            return distribution

Key Implementation Details

Hash Function Choice: MD5 is commonly used for its uniformity. Security doesn't matter here—we're not protecting against adversaries. For performance-critical paths, xxHash or MurmurHash are 10x faster.
Binary Search: The clockwise walk is implemented as binary search on sorted positions. bisect.bisect_left finds the insertion point efficiently.
Thread Safety: Using RLock (reentrant lock) allows the same thread to nest calls safely. For read-heavy workloads, consider RWLock or lock-free data structures.
Replica Selection: The get_nodes method walks clockwise collecting distinct physical nodes. This naturally spreads replicas across the ring.

Ketama: The Industry Standard

Handling Node Failures

Failure Scenarios

•Single node failure: Keys move to successor node. If replicated, replicas already exist on other nodes.
•Multiple node failures: Keys may concentrate on remaining nodes, causing overload.
•Network partition: Nodes are alive but unreachable. Different clients may have different ring views.
•Flapping nodes: Node appears/disappears rapidly, causing repeated key migration.

Failure Detection

Before handling failure, you must detect it. Common approaches:

Failure Detection Mechanisms
Mechanism	Speed	Accuracy	Complexity
Heartbeat/Ping	Fast (100ms-1s)	Subject to false positives	Low
Gossip Protocol	Moderate (seconds)	High (quorum-based)	Medium
External Monitor	Variable	Depends on monitor	High
Connection Failure	Immediate	May be transient	Low

failure_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from typing import Optional, Set
from time import time
from threading import Timer
import logging
 
class FailureAwareHashRing:
    """
    Consistent hash ring with failure detection and handling.
    """
    
    def __init__(self, ring: ConsistentHashRing, health_check_interval: float = 1.0):
        self.ring = ring
        self.failed_nodes: Set[str] = set()
        self.node_last_success: dict[str, float] = {}
        self.failure_threshold = 3.0  # seconds
        self.health_check_interval = health_check_interval
        self._start_health_checker()
    
    def _start_health_checker(self):
        """Periodically check node health."""
        def check():
            current_time = time()
            for node in list(self.ring.nodes):
                last_success = self.node_last_success.get(node, current_time)
                if current_time - last_success > self.failure_threshold:
                    if node not in self.failed_nodes:
                        logging.warning(f"Node {node} marked as failed")
                        self.failed_nodes.add(node)
                elif node in self.failed_nodes:
                    logging.info(f"Node {node} recovered")
                    self.failed_nodes.remove(node)
            
            Timer(self.health_check_interval, check).start()
        
        check()
    
    def record_success(self, node: str):
        """Record successful interaction with a node."""
        self.node_last_success[node] = time()
    
    def record_failure(self, node: str):
        """Record failed interaction with a node."""
        # Don't update last_success, letting it age out
        pass
    
    def get_node(self, key: str) -> Optional[str]:
        """
        Get node for key, skipping failed nodes.
        Falls back to next healthy node on ring.
        """
        candidates = self.ring.get_nodes(key, count=len(self.ring.nodes))
        
        for node in candidates:
            if node not in self.failed_nodes:
                return node
        
        # All nodes failed - return None or the primary anyway
        logging.error(f"All replica nodes for key {key} are failed")
        return candidates[0] if candidates else None
    
    def get_healthy_nodes(self, key: str, count: int) -> list[str]:
        """Get up to 'count' healthy nodes for a key."""
        candidates = self.ring.get_nodes(key, count=len(self.ring.nodes))
        return [n for n in candidates if n not in self.failed_nodes][:count]

Hinted Handoff

When the primary node for a key is down, writes can be sent to a 'hint' node temporarily:

Client writes to key, primary is down
Write goes to next healthy node with a 'hint' that it's for the primary
When primary recovers, hint node replays the write to primary
Hint is deleted after successful handoff

This maintains write availability during failures while preserving eventual consistency.

The Cascade Problem

Advanced Topics and Optimizations

Production deployments often require optimizations and extensions beyond the basic consistent hashing algorithm.

Jump Consistent Hash

Google's 'Jump Consistent Hash' provides O(1) memory usage and excellent distribution, but doesn't support weighted nodes or arbitrary node removal:

def jump_hash(key: int, num_buckets: int) -> int:
    b = -1
    j = 0
    while j < num_buckets:
        b = j
        key = ((key * 2862933555777941757) + 1) & 0xFFFFFFFFFFFFFFFF
        j = int((b + 1) * (float(1 << 31) / float((key >> 33) + 1)))
    return b

Jump hash is ideal when nodes are numbered 0 to N-1 and only the count changes (not arbitrary additions/removals).

Consistent Hashing Variants
Algorithm	Memory	Lookup	Add Node	Remove Node
Ketama (Ring)	O(N × V)	O(log(N × V))	Arbitrary	Arbitrary
Jump Hash	O(1)	O(log N)	Append only	Not supported
Rendezvous/HRW	O(N)	O(N)	Arbitrary	Arbitrary
Maglev	O(M)	O(1)	Rebuild table	Rebuild table

Rendezvous Hashing (Highest Random Weight)

An alternative to ring-based consistent hashing where each key computes a score with every node, selecting the highest-scoring node:

def get_node_hrw(key: str, nodes: list[str]) -> str:
    max_weight = -1
    selected = None
    for node in nodes:
        weight = hash(key + node)
        if weight > max_weight:
            max_weight = weight
            selected = node
    return selected

HRW has O(N) lookup but simpler implementation and excellent distribution. Used by Microsoft Azure and Vimeo.

Bounded Load Consistent Hashing

Google's 'Consistent Hashing with Bounded Loads' addresses the hot spot problem where unlucky hash distribution overloads some nodes:

Set a capacity limit per node (e.g., 1.25× average load)
When assigning a key, if the target node is at capacity, cascade to the next node
This provides O(1/ε²) bound on maximum load with 1+ε capacity

Choosing a Variant

Real-World Implementations

Consistent hashing appears across the industry in various forms. Understanding these implementations contextualizes the theory.

Amazon Dynamo and Cassandra

Dynamo (2007) popularized consistent hashing for distributed databases. Key design choices:

Ring-based with virtual nodes (Dynamo: 'tokens', Cassandra: 'vnodes')
Preference list: N nodes clockwise for replication
Sloppy quorum: write to any W of N, read from any R of N
Gossip protocol for ring membership

Cassandra's default: 256 vnodes per node, replication factor 3.

Redis Cluster

Redis takes a different approach with 'hash slots':

Fixed 16,384 hash slots (not continuous ring)
CRC16(key) mod 16384 determines slot
Slots manually assigned to nodes
Client redirects (MOVED) for topology discovery

This is conceptually simpler but requires explicit slot migration.

Memcached (libmemcached ketama)

The ketama library for Memcached:

160 virtual nodes per physical node (actually 40 × 4 = 160 due to MD5 producing 128 bits)
MD5-based position calculation
Client-side consistent hashing (no cluster coordination)
Servers are stateless; any server can serve any key

Consistent Hashing in Production Systems
System	Approach	Virtual Nodes	Replication
Amazon Dynamo	Ring with vnodes	Configurable	N-way with quorum
Apache Cassandra	Ring with vnodes	256 default	Configurable RF
Redis Cluster	Fixed hash slots	16,384 slots	1 replica per master
Memcached/ketama	Ring with vnodes	160 per node	None (stateless)
Riak	Ring with vnodes	Configurable	N-way
Discord	Ring with vnodes	Custom	In-house sharding

Interview Insight

Summary: Mastering Consistent Hashing

Consistent hashing is foundational to distributed systems. Let's consolidate the key concepts:

Key Takeaways

•The hash ring concept treats the hash space as circular, with keys assigned to the next clockwise node.
•Minimal key movement — only K/N keys move when adding/removing a node, versus nearly all keys with modulo hashing.
•Virtual nodes improve distribution uniformity and enable heterogeneous hardware capacity.
•O(log N) lookup via binary search on sorted virtual node positions makes lookups fast.
•Failure handling leverages the ring structure—failed nodes' keys automatically fall through to successors.
•Variants exist for specific use cases: Jump Hash for append-only, Rendezvous for simplicity, Bounded Load for fairness.
•Production implementations include Dynamo, Cassandra, Redis Cluster, and Memcached's ketama.

When to Use Consistent Hashing

When NOT to Use Consistent Hashing

Consistent Hashing Complete

3 / 6