Loading learning content...
In 1997, David Karger and his colleagues at MIT published a paper titled "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web." This paper introduced an algorithm that would become foundational to distributed systems: consistent hashing.
The problem Karger addressed was precisely what we explored in the previous page: how do you distribute cache data across multiple servers in a way that minimizes disruption when servers are added or removed? The naive modulo approach fails catastrophically because nearly all keys remap on topology changes. Consistent hashing solves this elegantly by ensuring that only K/N keys need to move when adding or removing a node (where K is the total number of keys and N is the number of nodes).
Today, consistent hashing is ubiquitous. It underpins Amazon's Dynamo, Apache Cassandra, Discord's message storage, Akamai's CDN, and virtually every distributed cache including Memcached and Redis Cluster. Understanding consistent hashing deeply—not just the concept but the implementation details, edge cases, and optimizations—is essential for any engineer working on distributed systems.
This page provides that deep understanding. We'll build consistent hashing from first principles, analyze its mathematical properties, implement it in code, and explore the optimizations that make it production-ready.
By the end of this page, you will understand the consistent hashing algorithm and its mathematical properties, implement consistent hashing with virtual nodes, analyze why consistent hashing achieves minimal key movement, explore real-world implementations and optimizations, and recognize when and how to apply consistent hashing in system design.
The core insight of consistent hashing is to treat the hash space as a ring (or circle) rather than a linear range. Both keys and nodes are hashed onto this ring, and each key is assigned to the first node encountered when traversing the ring clockwise from the key's position.
Visualizing the Hash Ring
Imagine a circle where positions represent all possible hash values from 0 to 2³² - 1 (for a 32-bit hash). This circle is the hash ring:
0
│
270° ─────┼───── 90°
│
180°
Nodes are hashed onto this ring using their identifiers (hostname, IP address, etc.):
hash("node-A") → position 45° on ring
hash("node-B") → position 150° on ring
hash("node-C") → position 280° on ring
Keys are similarly hashed:
hash("user:123") → position 80° on ring
Key-to-Node Assignment
To find which node owns a key, we start at the key's position and walk clockwise until we hit a node. That node owns the key.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import hashlibfrom typing import Optional class BasicHashRing: """ Basic consistent hashing implementation using a hash ring. """ def __init__(self): # Maps hash position -> node name self.ring: dict[int, str] = {} # Sorted list of positions for binary search self.sorted_positions: list[int] = [] def _hash(self, key: str) -> int: """Generate a 32-bit hash value for a key.""" return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32) def add_node(self, node: str) -> None: """Add a node to the ring.""" position = self._hash(node) self.ring[position] = node self.sorted_positions = sorted(self.ring.keys()) def remove_node(self, node: str) -> None: """Remove a node from the ring.""" position = self._hash(node) if position in self.ring: del self.ring[position] self.sorted_positions = sorted(self.ring.keys()) def get_node(self, key: str) -> Optional[str]: """Find the node responsible for a key.""" if not self.ring: return None key_hash = self._hash(key) # Binary search for the first position >= key_hash import bisect idx = bisect.bisect_left(self.sorted_positions, key_hash) # Wrap around if we're past the last position if idx >= len(self.sorted_positions): idx = 0 position = self.sorted_positions[idx] return self.ring[position] # Example usagering = BasicHashRing()ring.add_node("cache-server-1")ring.add_node("cache-server-2")ring.add_node("cache-server-3") # Distribute some keyskeys = ["user:100", "product:555", "session:abc", "order:999"]for key in keys: node = ring.get_node(key) print(f"{key} -> {node}")Why the Ring Works
The ring structure provides the crucial property that modulo hashing lacks: locality of change. When you add or remove a node:
Adding a node: The new node appears at one position on the ring. It takes ownership of keys that were previously assigned to the next node clockwise. Only those keys move.
Removing a node: The removed node's keys move to the next node clockwise. No other keys are affected.
In both cases, only the keys in the affected arc of the ring move. This is fundamentally different from modulo hashing, where the entire key space reshuffles.
The 'clockwise walk' to find a node is typically implemented using binary search on the sorted list of node positions. This gives O(log N) lookup time, where N is the number of nodes. With virtual nodes, N becomes the total number of virtual nodes, but even with 1000 virtual nodes, log₂(1000) ≈ 10 comparisons is negligible.
Understanding the mathematical properties of consistent hashing explains why it works and guides parameter tuning.
Key Distribution
With a uniform hash function and N nodes on the ring, each node owns approximately 1/N of the key space. However, with only N physical positions on the ring, the variance is high.
Expected arc length per node: 1/N of the ring Standard deviation of arc length: 1/N (same order of magnitude!)
This means with 3 nodes, you might see load distributions like 50%/30%/20% rather than the ideal 33%/33%/33%. This variance is unacceptable for production systems.
| Number of Nodes | Expected Load per Node | Typical Max Deviation | Worst Case |
|---|---|---|---|
| 3 | 33.3% | ±15% | One node at 50%+ |
| 10 | 10% | ±5% | One node at 20%+ |
| 100 | 1% | ±0.5% | Acceptable variance |
Virtual Nodes Reduce Variance
The solution is virtual nodes (vnodes). Instead of placing each physical node at one position, we place it at V positions (virtual nodes). Each physical node owns V arcs on the ring.
With V virtual nodes per physical node:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import mathimport random def simulate_load_distribution(num_nodes: int, vnodes_per_node: int, num_keys: int = 100000): """ Simulate key distribution to demonstrate variance reduction with virtual nodes. """ # Place virtual nodes on the ring ring_size = 2**32 node_loads = {f"node-{i}": 0 for i in range(num_nodes)} # Create virtual node positions vnode_positions = [] for i in range(num_nodes): for v in range(vnodes_per_node): # Hash the virtual node identifier vnode_id = f"node-{i}-vnode-{v}" position = hash(vnode_id) % ring_size vnode_positions.append((position, f"node-{i}")) vnode_positions.sort() # Distribute random keys for _ in range(num_keys): key_hash = random.randint(0, ring_size - 1) # Binary search for first vnode >= key_hash import bisect positions = [p[0] for p in vnode_positions] idx = bisect.bisect_left(positions, key_hash) if idx >= len(vnode_positions): idx = 0 owning_node = vnode_positions[idx][1] node_loads[owning_node] += 1 # Calculate statistics loads = list(node_loads.values()) expected_per_node = num_keys / num_nodes max_deviation = max(abs(l - expected_per_node) / expected_per_node for l in loads) return max_deviation # Compare variance with different numbers of virtual nodesprint("Max load deviation from expected:")for vnodes in [1, 10, 50, 100, 200]: deviation = simulate_load_distribution(10, vnodes) print(f" {vnodes:3d} vnodes/node: {deviation*100:.1f}% max deviation") # Expected output (variance decreases with more vnodes):# 1 vnodes/node: 45.3% max deviation# 10 vnodes/node: 18.2% max deviation# 50 vnodes/node: 8.1% max deviation# 100 vnodes/node: 5.2% max deviation# 200 vnodes/node: 3.8% max deviationKey Movement Analysis
When adding a node with V virtual nodes to a cluster of N physical nodes:
This is exactly the optimal amount! The new node gets its fair share (1/(N+1) of keys) and no more.
Common production values are 100-200 virtual nodes per physical node. More vnodes = better distribution but more memory for the ring structure, more positions to search, and larger cluster state to replicate. The sweet spot depends on cluster size and load sensitivity. Cassandra defaults to 256 vnodes.
A production-ready consistent hashing implementation requires careful attention to data structures, hash functions, and concurrency. Let's build a complete implementation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176
import hashlibimport bisectfrom typing import Optional, List, Setfrom dataclasses import dataclassfrom threading import RLock @dataclassclass VirtualNode: """Represents a virtual node on the hash ring.""" position: int physical_node: str vnode_index: int class ConsistentHashRing: """ Production-grade consistent hashing with virtual nodes. Features: - Virtual nodes for load distribution - Thread-safe operations - Efficient O(log N) lookups - Replica selection support - Weight-based capacity (heterogeneous nodes) """ RING_SIZE = 2**32 def __init__(self, vnodes_per_node: int = 150): self.vnodes_per_node = vnodes_per_node self.ring: List[VirtualNode] = [] # Sorted by position self.nodes: Set[str] = set() self.node_weights: dict[str, int] = {} self._lock = RLock() def _hash(self, key: str) -> int: """ Generate position on ring using MD5. MD5 is chosen for uniformity, not security. """ digest = hashlib.md5(key.encode('utf-8')).digest() # Use first 4 bytes for 32-bit position return int.from_bytes(digest[:4], byteorder='big') def _hash_xxhash(self, key: str) -> int: """ Alternative using xxHash for better performance. Requires: pip install xxhash """ import xxhash return xxhash.xxh32(key.encode('utf-8')).intdigest() def add_node(self, node: str, weight: int = 1) -> int: """ Add a node to the ring with optional weight. Weight multiplies the number of virtual nodes. Returns number of keys that would need to migrate. """ with self._lock: if node in self.nodes: return 0 self.nodes.add(node) self.node_weights[node] = weight # Create virtual nodes num_vnodes = self.vnodes_per_node * weight new_vnodes = [] for i in range(num_vnodes): vnode_key = f"{node}:vnode:{i}" position = self._hash(vnode_key) new_vnodes.append(VirtualNode(position, node, i)) # Merge into ring self.ring = sorted( self.ring + new_vnodes, key=lambda v: v.position ) return len(new_vnodes) def remove_node(self, node: str) -> int: """ Remove a node from the ring. Returns number of keys that would need to migrate. """ with self._lock: if node not in self.nodes: return 0 # Count vnodes before removal vnodes_removed = sum(1 for v in self.ring if v.physical_node == node) # Remove all virtual nodes for this physical node self.ring = [v for v in self.ring if v.physical_node != node] self.nodes.remove(node) del self.node_weights[node] return vnodes_removed def get_node(self, key: str) -> Optional[str]: """ Get the primary node for a key. O(log N) where N is total virtual nodes. """ with self._lock: if not self.ring: return None key_hash = self._hash(key) # Binary search for first vnode with position >= key_hash positions = [v.position for v in self.ring] idx = bisect.bisect_left(positions, key_hash) # Wrap around if idx >= len(self.ring): idx = 0 return self.ring[idx].physical_node def get_nodes(self, key: str, count: int = 3) -> List[str]: """ Get multiple nodes for a key (for replication). Returns 'count' distinct physical nodes. """ with self._lock: if not self.ring: return [] key_hash = self._hash(key) positions = [v.position for v in self.ring] idx = bisect.bisect_left(positions, key_hash) result = [] seen_nodes = set() # Walk clockwise until we have enough distinct nodes for i in range(len(self.ring)): current_idx = (idx + i) % len(self.ring) node = self.ring[current_idx].physical_node if node not in seen_nodes: result.append(node) seen_nodes.add(node) if len(result) >= count: break return result def get_key_distribution(self) -> dict[str, float]: """ Calculate the fraction of ring owned by each node. Useful for monitoring load distribution. """ with self._lock: if not self.ring: return {} distribution = {node: 0.0 for node in self.nodes} for i, vnode in enumerate(self.ring): # Calculate arc size from this vnode to the next next_idx = (i + 1) % len(self.ring) next_pos = self.ring[next_idx].position if next_pos > vnode.position: arc_size = next_pos - vnode.position else: # Wrapped around arc_size = (self.RING_SIZE - vnode.position) + next_pos distribution[vnode.physical_node] += arc_size / self.RING_SIZE return distributionKey Implementation Details
Hash Function Choice: MD5 is commonly used for its uniformity. Security doesn't matter here—we're not protecting against adversaries. For performance-critical paths, xxHash or MurmurHash are 10x faster.
Binary Search: The clockwise walk is implemented as binary search on sorted positions. bisect.bisect_left finds the insertion point efficiently.
Thread Safety: Using RLock (reentrant lock) allows the same thread to nest calls safely. For read-heavy workloads, consider RWLock or lock-free data structures.
Replica Selection: The get_nodes method walks clockwise collecting distinct physical nodes. This naturally spreads replicas across the ring.
The 'ketama' algorithm is the de facto standard for consistent hashing in Memcached clients. It uses a specific approach: 160 virtual nodes per physical node, points placed at positions hash(node + '-' + index), and MD5-based hashing. Understanding ketama helps when integrating with existing infrastructure.
Consistent hashing's elegance extends to failure handling. When a node fails, its keys automatically 'fall through' to the next node on the ring. But production systems need more sophisticated failure handling.
Failure Detection
Before handling failure, you must detect it. Common approaches:
| Mechanism | Speed | Accuracy | Complexity |
|---|---|---|---|
| Heartbeat/Ping | Fast (100ms-1s) | Subject to false positives | Low |
| Gossip Protocol | Moderate (seconds) | High (quorum-based) | Medium |
| External Monitor | Variable | Depends on monitor | High |
| Connection Failure | Immediate | May be transient | Low |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
from typing import Optional, Setfrom time import timefrom threading import Timerimport logging class FailureAwareHashRing: """ Consistent hash ring with failure detection and handling. """ def __init__(self, ring: ConsistentHashRing, health_check_interval: float = 1.0): self.ring = ring self.failed_nodes: Set[str] = set() self.node_last_success: dict[str, float] = {} self.failure_threshold = 3.0 # seconds self.health_check_interval = health_check_interval self._start_health_checker() def _start_health_checker(self): """Periodically check node health.""" def check(): current_time = time() for node in list(self.ring.nodes): last_success = self.node_last_success.get(node, current_time) if current_time - last_success > self.failure_threshold: if node not in self.failed_nodes: logging.warning(f"Node {node} marked as failed") self.failed_nodes.add(node) elif node in self.failed_nodes: logging.info(f"Node {node} recovered") self.failed_nodes.remove(node) Timer(self.health_check_interval, check).start() check() def record_success(self, node: str): """Record successful interaction with a node.""" self.node_last_success[node] = time() def record_failure(self, node: str): """Record failed interaction with a node.""" # Don't update last_success, letting it age out pass def get_node(self, key: str) -> Optional[str]: """ Get node for key, skipping failed nodes. Falls back to next healthy node on ring. """ candidates = self.ring.get_nodes(key, count=len(self.ring.nodes)) for node in candidates: if node not in self.failed_nodes: return node # All nodes failed - return None or the primary anyway logging.error(f"All replica nodes for key {key} are failed") return candidates[0] if candidates else None def get_healthy_nodes(self, key: str, count: int) -> list[str]: """Get up to 'count' healthy nodes for a key.""" candidates = self.ring.get_nodes(key, count=len(self.ring.nodes)) return [n for n in candidates if n not in self.failed_nodes][:count]Hinted Handoff
When the primary node for a key is down, writes can be sent to a 'hint' node temporarily:
This maintains write availability during failures while preserving eventual consistency.
When a node fails, its traffic shifts to the successor. If the successor is already at capacity, it may also fail, triggering a cascade. Mitigation: over-provision nodes, use circuit breakers, implement request shedding, and spread replicas to avoid successor chains under load.
Production deployments often require optimizations and extensions beyond the basic consistent hashing algorithm.
Jump Consistent Hash
Google's 'Jump Consistent Hash' provides O(1) memory usage and excellent distribution, but doesn't support weighted nodes or arbitrary node removal:
def jump_hash(key: int, num_buckets: int) -> int:
b = -1
j = 0
while j < num_buckets:
b = j
key = ((key * 2862933555777941757) + 1) & 0xFFFFFFFFFFFFFFFF
j = int((b + 1) * (float(1 << 31) / float((key >> 33) + 1)))
return b
Jump hash is ideal when nodes are numbered 0 to N-1 and only the count changes (not arbitrary additions/removals).
| Algorithm | Memory | Lookup | Add Node | Remove Node |
|---|---|---|---|---|
| Ketama (Ring) | O(N × V) | O(log(N × V)) | Arbitrary | Arbitrary |
| Jump Hash | O(1) | O(log N) | Append only | Not supported |
| Rendezvous/HRW | O(N) | O(N) | Arbitrary | Arbitrary |
| Maglev | O(M) | O(1) | Rebuild table | Rebuild table |
Rendezvous Hashing (Highest Random Weight)
An alternative to ring-based consistent hashing where each key computes a score with every node, selecting the highest-scoring node:
def get_node_hrw(key: str, nodes: list[str]) -> str:
max_weight = -1
selected = None
for node in nodes:
weight = hash(key + node)
if weight > max_weight:
max_weight = weight
selected = node
return selected
HRW has O(N) lookup but simpler implementation and excellent distribution. Used by Microsoft Azure and Vimeo.
Bounded Load Consistent Hashing
Google's 'Consistent Hashing with Bounded Loads' addresses the hot spot problem where unlucky hash distribution overloads some nodes:
For distributed caches: use Ketama-style ring hashing with virtual nodes. It's well-understood, widely implemented, and handles arbitrary node addition/removal. Reserve Jump Hash for specific scenarios like load balancer backends. Use Rendezvous (HRW) when simplicity trumps O(N) lookup cost.
Consistent hashing appears across the industry in various forms. Understanding these implementations contextualizes the theory.
Amazon Dynamo and Cassandra
Dynamo (2007) popularized consistent hashing for distributed databases. Key design choices:
Cassandra's default: 256 vnodes per node, replication factor 3.
Redis Cluster
Redis takes a different approach with 'hash slots':
This is conceptually simpler but requires explicit slot migration.
Memcached (libmemcached ketama)
The ketama library for Memcached:
| System | Approach | Virtual Nodes | Replication |
|---|---|---|---|
| Amazon Dynamo | Ring with vnodes | Configurable | N-way with quorum |
| Apache Cassandra | Ring with vnodes | 256 default | Configurable RF |
| Redis Cluster | Fixed hash slots | 16,384 slots | 1 replica per master |
| Memcached/ketama | Ring with vnodes | 160 per node | None (stateless) |
| Riak | Ring with vnodes | Configurable | N-way |
| Discord | Ring with vnodes | Custom | In-house sharding |
When discussing consistent hashing in interviews, cite these real-world systems. Saying 'Cassandra uses consistent hashing with 256 virtual nodes for partition assignment' demonstrates practical knowledge beyond textbook understanding. Interviewers value candidates who can connect theory to practice.
Consistent hashing is foundational to distributed systems. Let's consolidate the key concepts:
When to Use Consistent Hashing
✅ Distributing cache data across multiple nodes ✅ Partitioning database storage ✅ Load balancing with session affinity ✅ Distributing work across processing nodes ✅ Any scenario where cluster topology changes and you need minimal disruption
When NOT to Use Consistent Hashing
❌ Single-node systems (no distribution needed) ❌ Scenarios requiring strict ordering (use range partitioning) ❌ When topology never changes (simpler approaches work) ❌ When all data must be accessible from any node (use replication instead)
You now have deep understanding of consistent hashing—from the hash ring concept through production implementation details. In the next page, we'll explore Eviction Policies—the algorithms that decide what to remove when cache memory is full.