Loading content...
In 1997, David Karger and colleagues published a paper that would become foundational to distributed systems: Consistent Hashing and Random Trees. The algorithm they described solved a problem that made truly scalable distributed systems practical.
The Problem (Recap):
Simple hash-based distribution (hash(key) % n) suffers from the remap problem. When the number of servers changes from n to n±1, nearly all keys must be remapped. For a distributed cache with 100 servers, adding one server invalidates approximately 99% of cached data—a catastrophic cache miss storm.
The Solution:
Consistent hashing limits remapping to approximately K/n keys (where K is total keys and n is servers). Adding one server to a 100-server pool remaps only ~1% of keys. This makes hash-based distribution practical for dynamic environments.
Where It's Used:
By the end of this page, you will understand the hash ring concept and its mathematical properties, virtual nodes for balance, multiple implementation approaches, real-world configurations in production load balancers, and when consistent hashing is essential versus overkill.
Consistent hashing maps both servers and keys to positions on a circular hash ring (conceptually, a circle from 0 to 2³² - 1 or similar).
The Core Idea:
position = hash(server_id)position = hash(key)Why This Works:
When a server is added, it only "steals" keys from the server that was previously responsible for that portion of the ring. All other servers keep their keys. Similarly, when a server is removed, only its keys redistribute—to the next server clockwise.
Ring Walkthrough Example:
Imagine a ring from 0 to 100:
Key routing:
Now add Server D at position 65:
Only keys between positions 50 and 65 remap from C to D. Everything else is unchanged.
When adding/removing a server from a pool of n servers: expected keys remapped = K/n (where K is total keys). With 100 servers, only ~1% of keys remap on a change. This is the fundamental property that makes consistent hashing so powerful.
Let's implement a basic consistent hash ring without virtual nodes first, then enhance it.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
import hashlibimport bisectfrom typing import Optional, List, Dict class BasicConsistentHash: """ Basic consistent hash ring implementation. Maps servers and keys to positions on a ring (0 to 2^32-1). Uses binary search for efficient O(log n) lookups. Limitation: Poor balance with few servers (solved by virtual nodes). """ def __init__(self): # Sorted list of (position, server) pairs self._ring: List[int] = [] self._position_to_server: Dict[int, str] = {} def _hash(self, key: str) -> int: """ Hash a key to a position on the ring. Uses MD5 for good distribution. Returns value in [0, 2^32). """ h = hashlib.md5(key.encode()) # Take first 4 bytes as integer return int.from_bytes(h.digest()[:4], 'big') def add_server(self, server: str) -> None: """ Add a server to the ring. Time Complexity: O(n) due to list insertion """ position = self._hash(server) if position in self._position_to_server: # Collision - add suffix and rehash position = self._hash(f"{server}:collision") bisect.insort(self._ring, position) self._position_to_server[position] = server def remove_server(self, server: str) -> None: """ Remove a server from the ring. Time Complexity: O(n) due to list removal """ position = self._hash(server) if position in self._position_to_server: self._ring.remove(position) del self._position_to_server[position] def get_server(self, key: str) -> Optional[str]: """ Get the server responsible for a key. Walks clockwise from key's position to find first server. Time Complexity: O(log n) using binary search """ if not self._ring: return None position = self._hash(key) # Find first server position >= key position idx = bisect.bisect_left(self._ring, position) # Wrap around if past the end if idx >= len(self._ring): idx = 0 server_position = self._ring[idx] return self._position_to_server[server_position] def get_servers_for_key(self, key: str, n: int = 3) -> List[str]: """ Get n servers responsible for a key (for replication). Returns n distinct servers walking clockwise from key position. """ if not self._ring: return [] position = self._hash(key) idx = bisect.bisect_left(self._ring, position) servers = [] seen = set() for i in range(len(self._ring)): server_pos = self._ring[(idx + i) % len(self._ring)] server = self._position_to_server[server_pos] if server not in seen: servers.append(server) seen.add(server) if len(servers) >= n: break return servers # Demonstrationif __name__ == "__main__": ch = BasicConsistentHash() # Add servers servers = ["server-a", "server-b", "server-c"] for s in servers: ch.add_server(s) # Generate some keys keys = [f"user-{i}" for i in range(10)] print("Initial distribution (3 servers):") print("-" * 50) distribution = {} for key in keys: server = ch.get_server(key) print(f" {key} → {server}") distribution[server] = distribution.get(server, 0) + 1 print(f"\nCounts: {distribution}") # Add a server print("\nAdding server-d:") ch.add_server("server-d") new_distribution = {} remapped = 0 for key in keys: old_server = distribution.get(key) new_server = ch.get_server(key) new_distribution[new_server] = new_distribution.get(new_server, 0) + 1 if old_server != new_server: remapped += 1 print(f" {key}: {old_server} → {new_server} (REMAPPED)") print(f"\nRemapped: {remapped}/{len(keys)} = {remapped/len(keys)*100:.0f}%") print(f"Expected: ~{100/4:.0f}% (1/n where n=4)")With only a few physical servers, positions on the ring may cluster, creating imbalanced distribution. Server A might handle 60% of keys while C handles only 10%. Virtual nodes solve this problem.
The virtual nodes technique maps each physical server to multiple positions on the ring. Instead of one position per server, we create 100-300 positions per server.
Why Virtual Nodes Work:
The Tradeoff:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205
import hashlibimport bisectfrom typing import Optional, List, Dict, Setfrom dataclasses import dataclassimport threading @dataclassclass RingPosition: """Represents a position on the hash ring.""" position: int physical_server: str virtual_node_id: int class ConsistentHashRing: """ Production-quality consistent hash ring with virtual nodes. Features: - Configurable virtual nodes per server - O(log n) lookups - Thread-safe operations - Server weight support via variable virtual nodes """ def __init__(self, virtual_nodes: int = 150): """ Initialize the hash ring. Args: virtual_nodes: Number of virtual nodes per physical server. Higher = better balance, more memory. Typical values: 100-300. """ self._virtual_nodes = virtual_nodes self._ring: List[int] = [] self._position_to_info: Dict[int, RingPosition] = {} self._servers: Set[str] = set() self._lock = threading.Lock() def _hash(self, key: str) -> int: """Hash key to 32-bit position.""" h = hashlib.md5(key.encode()) return int.from_bytes(h.digest()[:4], 'big') def _virtual_node_key(self, server: str, vnode_id: int) -> str: """Generate unique key for a virtual node.""" return f"{server}#vnode{vnode_id}" def add_server(self, server: str, weight: int = 1) -> None: """ Add a server with optional weight. Weight controls how many virtual nodes are created. weight=2 means twice the virtual nodes (twice the traffic). Time Complexity: O(v * log(n*v)) where v=virtual_nodes """ with self._lock: if server in self._servers: return self._servers.add(server) # Create virtual nodes (scaled by weight) num_vnodes = self._virtual_nodes * weight for i in range(num_vnodes): vnode_key = self._virtual_node_key(server, i) position = self._hash(vnode_key) # Handle collisions by rehashing attempts = 0 while position in self._position_to_info and attempts < 10: position = self._hash(f"{vnode_key}:{attempts}") attempts += 1 if position not in self._position_to_info: ring_pos = RingPosition( position=position, physical_server=server, virtual_node_id=i ) bisect.insort(self._ring, position) self._position_to_info[position] = ring_pos def remove_server(self, server: str) -> None: """ Remove a server and all its virtual nodes. Time Complexity: O(v * n) where v=virtual_nodes """ with self._lock: if server not in self._servers: return self._servers.discard(server) # Remove all positions for this server positions_to_remove = [ pos for pos, info in self._position_to_info.items() if info.physical_server == server ] for pos in positions_to_remove: self._ring.remove(pos) del self._position_to_info[pos] def get_server(self, key: str) -> Optional[str]: """ Get the server responsible for a key. Time Complexity: O(log n) """ with self._lock: if not self._ring: return None position = self._hash(key) idx = bisect.bisect_left(self._ring, position) if idx >= len(self._ring): idx = 0 ring_pos = self._position_to_info[self._ring[idx]] return ring_pos.physical_server def get_servers(self, key: str, count: int = 3) -> List[str]: """ Get multiple distinct servers for a key (for replication). """ with self._lock: if not self._ring: return [] position = self._hash(key) idx = bisect.bisect_left(self._ring, position) servers = [] seen: Set[str] = set() for i in range(len(self._ring)): ring_idx = (idx + i) % len(self._ring) ring_pos = self._position_to_info[self._ring[ring_idx]] server = ring_pos.physical_server if server not in seen: servers.append(server) seen.add(server) if len(servers) >= count: break return servers def get_distribution_stats(self) -> Dict[str, Dict]: """ Get distribution statistics for monitoring. Returns per-server: virtual node count and ring coverage. """ with self._lock: stats: Dict[str, Dict] = {} for server in self._servers: vnodes = sum( 1 for info in self._position_to_info.values() if info.physical_server == server ) stats[server] = { "virtual_nodes": vnodes, "coverage_pct": round(vnodes / len(self._ring) * 100, 2) if self._ring else 0 } return stats # Demonstration: Balance improvement with virtual nodesif __name__ == "__main__": import random # Test with and without virtual nodes for vnodes in [1, 10, 100, 300]: ring = ConsistentHashRing(virtual_nodes=vnodes) servers = ["server-a", "server-b", "server-c", "server-d"] for s in servers: ring.add_server(s) # Distribute 10,000 random keys distribution = {s: 0 for s in servers} for i in range(10000): key = f"key-{random.randint(0, 1000000)}" server = ring.get_server(key) distribution[server] += 1 # Calculate standard deviation values = list(distribution.values()) mean = sum(values) / len(values) variance = sum((v - mean) ** 2 for v in values) / len(values) std_dev = variance ** 0.5 print(f"\nVirtual nodes: {vnodes}") print(f" Distribution: {distribution}") print(f" Std deviation: {std_dev:.0f} (lower is better)") print(f" Balance score: {(mean - std_dev) / mean * 100:.1f}%")| Virtual Nodes | Balance Quality | Memory Usage | Use Case |
|---|---|---|---|
| 1 (none) | Poor | Minimal | Never use in production |
| 10-50 | Moderate | Low | Small clusters, testing |
| 100-150 | Good | Moderate | Most production use cases |
| 200-300 | Excellent | Higher | Large clusters, strict balance requirements |
| 500+ | Marginal improvement | High | Rarely justified |
For most applications, 100-200 virtual nodes per server provides excellent balance without excessive memory usage. Beyond 300 vnodes, improvements are minimal but memory and computation costs continue to increase.
Major load balancers support consistent hashing. Here's how to configure them.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
# NGINX Consistent Hashing Configuration upstream backend_consistent_hash { # Enable consistent hashing based on request URI hash $request_uri consistent; # Server list - order doesn't matter with consistent hash server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080; # Weight affects virtual node count # Higher weight = more virtual nodes = more traffic # server 10.0.1.4:8080 weight=2;} upstream consistent_hash_by_ip { # Hash by client IP for session affinity hash $remote_addr consistent; server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080;} upstream consistent_hash_custom { # Hash by custom variable # Cookie, header, or computed value hash $cookie_session_id consistent; server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080;} # Consistent hash with health checking# When server goes down, only its keys remapupstream consistent_with_backup { hash $request_uri consistent; server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080; # down marker: keys remap to other servers # server 10.0.1.4:8080 down; # backup: only receives traffic if primary server fails # server 10.0.1.5:8080 backup;} server { listen 80; location /api/ { proxy_pass http://backend_consistent_hash; # Retry on failure - request may remap to different server proxy_next_upstream error timeout http_502 http_503; proxy_next_upstream_tries 2; } location /session/ { proxy_pass http://consistent_hash_by_ip; }}1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
# HAProxy Consistent Hashing Configuration global maxconn 50000 defaults mode http timeout connect 5s timeout client 30s timeout server 60s backend consistent_hash_url # Consistent hash on URL balance uri hash-type consistent server app1 10.0.1.1:8080 check server app2 10.0.1.2:8080 check server app3 10.0.1.3:8080 check backend consistent_hash_source # Consistent hash on source IP balance source hash-type consistent server app1 10.0.1.1:8080 check server app2 10.0.1.2:8080 check server app3 10.0.1.3:8080 check backend consistent_hash_header # Consistent hash on custom header balance hdr(X-User-ID) hash-type consistent # Weight affects traffic share server app1 10.0.1.1:8080 weight 100 check server app2 10.0.1.2:8080 weight 100 check server app3 10.0.1.3:8080 weight 50 check # Half traffic backend consistent_hash_cookie # Hash on session cookie balance url_param sessionid hash-type consistent server app1 10.0.1.1:8080 check server app2 10.0.1.2:8080 check frontend http_front bind *:80 # Route based on path acl is_api path_beg /api use_backend consistent_hash_url if is_api default_backend consistent_hash_source listen stats bind *:8404 stats enable stats uri /stats123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
# Envoy Consistent Hashing Configuration static_resources: clusters: - name: backend_service type: STRICT_DNS connect_timeout: 5s # Ring hash load balancing (consistent hashing) lb_policy: RING_HASH ring_hash_lb_config: # Minimum ring size (virtual nodes) minimum_ring_size: 1024 # Maximum ring size maximum_ring_size: 8388608 # Hash function hash_function: XX_HASH # or MURMUR_HASH_2 load_assignment: cluster_name: backend_service endpoints: - lb_endpoints: - endpoint: address: socket_address: address: 10.0.1.1 port_value: 8080 - endpoint: address: socket_address: address: 10.0.1.2 port_value: 8080 - endpoint: address: socket_address: address: 10.0.1.3 port_value: 8080 health_checks: - timeout: 3s interval: 5s unhealthy_threshold: 3 healthy_threshold: 2 http_health_check: path: /health listeners: - name: listener_0 address: socket_address: address: 0.0.0.0 port_value: 80 filter_chains: - filters: - name: envoy.filters.network.http_connection_manager typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: prefix: "/" route: cluster: backend_service hash_policy: # Hash on header - header: header_name: x-user-id # Fallback: hash on source IP - connection_properties: source_ip: true http_filters: - name: envoy.filters.http.routerEnvoy's minimum_ring_size parameter controls virtual nodes. A value of 1024 with 4 servers means ~256 virtual nodes per server. Increase this for better balance or more servers, but be mindful of memory implications.
Consistent hashing has evolved with several important extensions.
Bounded Loads (Google, 2017):
Standard consistent hashing can still create imbalance—some servers may receive significantly more traffic than others, especially during server failures or additions.
Bounded load consistent hashing adds a constraint: no server can exceed (1 + ε) times the average load. If a server is "full," requests walk further clockwise to find the next available server.
This guarantees:
Jump Hash (Google, 2014):
A simpler, faster alternative that doesn't use a ring structure:
Limitation: Only works when servers are numbered 0 to n-1. Removing server k means renumbering all servers > k, which can cause widespread remapping. Best for scenarios where the server set is append-only (only adding servers, never removing specific ones).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
def jump_hash(key: int, num_buckets: int) -> int: """ Google's Jump Hash algorithm. Fast, memory-efficient consistent hashing. Returns bucket number in [0, num_buckets). Properties: - O(log n) time - O(1) space - Perfectly uniform distribution - Minimal remapping on bucket count changes Limitation: Only works for numbered buckets 0..n-1 Removing bucket k requires renumbering k+1..n Reference: https://arxiv.org/abs/1406.2294 """ if num_buckets <= 0: raise ValueError("num_buckets must be positive") b = -1 j = 0 while j < num_buckets: b = j key = (key * 2862933555777941757 + 1) & 0xFFFFFFFFFFFFFFFF j = int((b + 1) * (float(1 << 31) / float((key >> 33) + 1))) return b def string_to_int(s: str) -> int: """Convert string to int for hashing.""" import hashlib h = hashlib.md5(s.encode()) return int.from_bytes(h.digest()[:8], 'big') # Demonstrationif __name__ == "__main__": print("Jump Hash Distribution Test") print("=" * 50) # Test distribution with 5 buckets buckets = 5 counts = [0] * buckets for i in range(10000): key = string_to_int(f"key-{i}") bucket = jump_hash(key, buckets) counts[bucket] += 1 print(f"\n5 buckets, 10,000 keys:") for i, count in enumerate(counts): bar = "█" * (count // 100) print(f" Bucket {i}: {count:5d} {bar}") # Test remapping when adding bucket print(f"\nRemapping test (5 → 6 buckets):") remapped = 0 for i in range(10000): key = string_to_int(f"key-{i}") old = jump_hash(key, 5) new = jump_hash(key, 6) if old != new: remapped += 1 print(f" Remapped: {remapped}/10,000 = {remapped/100:.1f}%") print(f" Expected: ~{100/6:.1f}% (1/n)")| Algorithm | Memory | Lookup Time | Add/Remove | Balance |
|---|---|---|---|---|
| Ring (basic) | O(n) | O(log n) | O(n) | Poor |
| Ring + vnodes | O(n×v) | O(log(n×v)) | O(v log n) | Good |
| Jump Hash | O(1) | O(log n) compute | O(1) if append-only | Perfect |
| Bounded Load | O(n×v) | O(log(n×v)) + routing | O(v log n) | Guaranteed |
Use consistent hashing when: (1) you need hash-based routing (affinity, partitioning), AND (2) the server pool changes (autoscaling, deployments, failures). If either condition is false, simpler algorithms suffice.
Module Complete:
You've now mastered the complete spectrum of load distribution algorithms:
You can now select and configure the appropriate algorithm for any load balancing scenario, understanding the tradeoffs each entails.
Congratulations! You have achieved mastery of load distribution algorithms. You understand not just how each algorithm works, but when to apply each one, their implementation nuances, and production configuration patterns. This knowledge forms the foundation for designing scalable, resilient distributed systems.