Loading learning content...
A single-node cache, no matter how powerful, cannot satisfy the requirements of large-scale systems. When you need to store terabytes of data in memory, handle hundreds of thousands of requests per second, or survive individual machine failures without service disruption, you must distribute your cache across multiple nodes. This distribution is called partitioning (or sharding).
Partitioning transforms a cache from a single point of failure into a scalable, resilient system. But this transformation introduces fundamental challenges: How do you decide which node stores which key? How do you ensure even distribution of data and load? What happens when you add or remove nodes? How do you handle nodes that fail? These questions define the partitioning problem space.
The partitioning strategy you choose affects every aspect of your cache's behavior—from latency and throughput to failure handling and operational complexity. There is no universally optimal approach; each strategy makes different trade-offs suited to different use cases. Understanding these trade-offs deeply is essential for designing caches that meet your specific requirements.
This page examines partitioning strategies from first principles, progressing from naive approaches through production-grade solutions. By the end, you'll understand why modern distributed caches converge on consistent hashing, and you'll be prepared to dive deep into its implementation in the next page.
By the end of this page, you will understand the fundamental partitioning problem in distributed caches, compare different partitioning strategies with their trade-offs, analyze the key movement problem when cluster topology changes, and recognize why consistent hashing emerged as the dominant solution.
Before exploring how to partition, we must understand why partitioning is necessary. The motivations extend beyond simply 'we have too much data for one machine'—though that's certainly a factor.
The Partitioning Function
At its core, partitioning requires a function that maps each key to a specific node (or set of nodes, if replicated):
partition(key) → node(s)
This function must have several properties:
| Property | Description | Consequence of Violation |
|---|---|---|
| Deterministic | Same key always maps to same node | Keys cannot be found; correctness fails |
| Uniform | Keys distribute evenly across nodes | Hot nodes and cold nodes; inefficient resource use |
| Stable | Minimal remapping when nodes change | Massive cache misses on topology changes |
| Fast | O(1) or O(log n) computation | Partitioning overhead dominates latency |
The terms 'partition' and 'shard' are often used interchangeably in caching contexts. Technically, a partition is a logical division of data, while a shard is the physical host of that data. A cache cluster might have 100 partitions distributed across 20 shards (physical nodes), with replication. In practice, most discussions use these terms loosely.
The simplest partitioning approach uses the modulo operation on the hash of the key:
node_index = hash(key) % number_of_nodes
This approach is intuitive and provides good distribution when the hash function has good uniformity properties.
1234567891011121314151617181920212223242526272829
import hashlib class ModuloPartitioner: def __init__(self, nodes: list[str]): self.nodes = nodes def get_node(self, key: str) -> str: """Return the node responsible for this key.""" hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16) node_index = hash_value % len(self.nodes) return self.nodes[node_index] def set_nodes(self, nodes: list[str]): """Update the node list (triggers redistribution).""" self.nodes = nodes # Example usagepartitioner = ModuloPartitioner(["node-0", "node-1", "node-2"]) # Distribute some keyskeys = ["user:123", "session:abc", "product:456", "cart:789"]for key in keys: print(f"{key} -> {partitioner.get_node(key)}") # Output might be:# user:123 -> node-1# session:abc -> node-0 # product:456 -> node-2# cart:789 -> node-1Advantages of Modulo Partitioning
The Fatal Flaw: Node Changes
Modulo partitioning works beautifully—until the number of nodes changes. When you add or remove a node, the modulo divisor changes, and nearly every key maps to a different node.
Consider what happens when we add a fourth node:
12345678910111213141516171819202122232425262728293031323334353637
import hashlib def simulate_redistribution(): """Demonstrate the redistribution problem with modulo hashing.""" # Generate 1000 keys keys = [f"key:{i}" for i in range(1000)] def get_node(key: str, num_nodes: int) -> int: hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16) return hash_value % num_nodes # Initial state: 3 nodes initial_mapping = {key: get_node(key, 3) for key in keys} # After adding node: 4 nodes new_mapping = {key: get_node(key, 4) for key in keys} # Count keys that moved moved = sum(1 for key in keys if initial_mapping[key] != new_mapping[key]) move_percentage = (moved / len(keys)) * 100 print(f"Keys that moved when adding node: {moved}/{len(keys)} ({move_percentage:.1f}%)") # With 3 -> 4 nodes, approximately (1 - 3/4) * 100% = 75% of keys move # With n -> n+1, approximately (n)/(n+1) keys are UNAFFECTED, so 1/(n+1) move # Wait, let's recalculate... # Actually: for a random key, P(same node) = P(hash % 3 == hash % 4) # This happens when hash mod 12 is in {0, 4, 8} for node 0, etc. # Approximately 1/4 of keys stay in the same spot... no, that's also wrong # The real calculation: Expected fraction that stays = gcd(3,4)/max(3,4) ≈ 25% # So ~75% of keys move! simulate_redistribution()# Output: Keys that moved when adding node: ~750/1000 (75.0%)When adding a single node to a 3-node cluster, approximately 75% of keys remap to different nodes. These remapped keys become cache misses, causing a thundering herd to the origin. For a cache serving 100,000 RPS with a 95% hit rate, suddenly 70,000+ RPS hit the origin instead of 5,000. The database likely cannot handle this load and crashes.
Real-World Impact
This redistribution problem isn't theoretical—it has caused production outages at major companies:
Modulo partitioning is only acceptable when:
An alternative to hash-based partitioning is range partitioning, where the key space is divided into contiguous ranges, each assigned to a node.
Node 0: keys "a" - "m"
Node 1: keys "n" - "z"
For hash-based range partitioning, you divide the hash space into ranges:
Node 0: hash values 0x0000... - 0x5555...
Node 1: hash values 0x5556... - 0xAAAA...
Node 2: hash values 0xAAAB... - 0xFFFF...
123456789101112131415161718192021222324252627282930313233
import hashlib class RangePartitioner: def __init__(self, node_ranges: list[tuple[str, int, int]]): """ node_ranges: List of (node_name, start_hash, end_hash) Ranges should be contiguous and cover the full hash space. """ self.node_ranges = sorted(node_ranges, key=lambda x: x[1]) def get_node(self, key: str) -> str: """Return the node responsible for this key.""" hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16) for node_name, start_hash, end_hash in self.node_ranges: if start_hash <= hash_value <= end_hash: return node_name # Should never reach here if ranges are properly defined raise ValueError(f"No node found for hash {hash_value}") # Example: 3 nodes splitting a 32-bit hash spaceMAX_HASH = 2**128 - 1THIRD = MAX_HASH // 3 partitioner = RangePartitioner([ ("node-0", 0, THIRD), ("node-1", THIRD + 1, 2 * THIRD), ("node-2", 2 * THIRD + 1, MAX_HASH),]) # Now adding a node requires splitting ONE existing range# Only keys in that range need to moveAdvantages of Range Partitioning
Disadvantages of Range Partitioning
| Aspect | Modulo | Range |
|---|---|---|
| Lookup complexity | O(1) | O(log n) |
| Adding a node | ~n/(n+1) keys move | Only split range moves |
| Hot spots | Rare (if hash is good) | Common (if keys clustered) |
| Metadata required | None | Range table |
| Range queries | Not possible | Possible |
Range partitioning suits databases more than caches. It's used by systems like HBase, Cassandra (with ByteOrderedPartitioner), and Bigtable for its range query capabilities. For pure key-value caches, hash-based consistent hashing is typically superior.
At the heart of partitioning challenges lies the key movement problem: when the cluster topology changes, which keys need to move, and how do we minimize this movement?
Why Key Movement Matters
Every key that moves from its old node to a new node during topology changes creates problems:
Quantifying Optimal Key Movement
What's the theoretical minimum key movement when the cluster changes?
Adding a node: If we go from N to N+1 nodes, the new node should ideally receive 1/(N+1) of the total keys (its fair share). These keys come from the existing N nodes. So the minimum movement is ~1/(N+1) of all keys.
Removing a node: If we go from N to N-1 nodes, the removed node's keys (1/N of total) must be distributed among remaining nodes. So the minimum movement is 1/N of all keys.
Optimal movement formula: K × min(N_old, N_new) / max(N_old, N_new)
Where K is total keys.
| Cluster Change | Modulo Hashing Movement | Optimal Movement | Excess Movement |
|---|---|---|---|
| 3 → 4 nodes | ~75% | ~25% | 3x optimal |
| 4 → 5 nodes | ~80% | ~20% | 4x optimal |
| 9 → 10 nodes | ~90% | ~10% | 9x optimal |
| 99 → 100 nodes | ~99% | ~1% | 99x optimal |
Counter-intuitively, modulo hashing gets WORSE as clusters grow. Adding one node to a 100-node cluster moves 99% of keys instead of the optimal 1%. This makes modulo hashing completely unsuitable for large-scale deployments.
The Quest for Minimal Movement
We need a partitioning scheme that achieves or approaches optimal key movement. This leads us to consistent hashing, which we'll explore in depth in the next page.
Consistent hashing achieves:
Beyond the basic partitioning function, two additional concepts are crucial for production cache architectures: virtual nodes and replication.
Virtual Nodes (VNodes)
In simple partitioning, each physical node owns one partition. This creates problems:
Virtual nodes solve these by assigning multiple virtual partitions to each physical node:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
class VirtualNodePartitioner: """ Each physical node owns multiple virtual nodes (vNodes). This improves distribution and enables heterogeneous hardware. """ def __init__(self, nodes: dict[str, int]): """ nodes: Dict of physical_node_name -> number of virtual nodes More vNodes = more capacity share """ self.virtual_to_physical = {} for physical_node, num_vnodes in nodes.items(): for i in range(num_vnodes): vnode_id = f"{physical_node}_vnode_{i}" self.virtual_to_physical[vnode_id] = physical_node # Sort virtual nodes by their hash for range-based lookup self.sorted_vnodes = sorted( self.virtual_to_physical.keys(), key=lambda v: self._hash(v) ) def _hash(self, key: str) -> int: import hashlib return int(hashlib.md5(key.encode()).hexdigest(), 16) def get_node(self, key: str) -> str: """Find the physical node for this key using virtual nodes.""" key_hash = self._hash(key) # Binary search for the first vnode with hash >= key_hash import bisect vnode_hashes = [self._hash(v) for v in self.sorted_vnodes] idx = bisect.bisect_left(vnode_hashes, key_hash) # Wrap around if past the last vnode if idx >= len(self.sorted_vnodes): idx = 0 vnode = self.sorted_vnodes[idx] return self.virtual_to_physical[vnode] # Example: Heterogeneous cluster# node-big has 3x the capacity of node-smallpartitioner = VirtualNodePartitioner({ "node-big": 300, # 300 virtual nodes "node-small": 100, # 100 virtual nodes}) # node-big will receive approximately 75% of keys# node-small will receive approximately 25% of keysReplication for Availability
Partitioning distributes data but doesn't protect against loss. When a node fails, its partition becomes unavailable. Replication stores copies of each key on multiple nodes:
| Strategy | Write Latency | Consistency | Failure Handling |
|---|---|---|---|
| No replication | Lowest | N/A | Data lost on failure |
| Async replication | Low | Eventually consistent | May lose recent writes |
| Sync replication (1 replica) | Medium | Read-your-writes | Survives 1 failure |
| Sync quorum (N/2+1) | Higher | Strong (with reads from quorum) | Survives N/2 failures |
Unlike databases, caches often skip replication entirely or use async replication. Data can be regenerated from the origin, so durability isn't required. The cost of replication (memory, network, latency) may exceed the benefit. Many production caches replicate for availability (survive node failures) rather than durability.
An architectural decision that profoundly impacts cache design is where partitioning logic resides: in the client, in the server, or in a separate proxy layer.
| Aspect | Client-Side | Proxy |
|---|---|---|
| Latency | Lower (1 hop) | Higher (2 hops) |
| Client complexity | Higher | Lower (thin client) |
| Topology changes | Clients must update | Proxy handles transparently |
| Protocol support | Client must support | Proxy can abstract |
| Scalability bottleneck | None (distributed) | Proxy can become bottleneck |
| Debugging | Harder (distributed) | Easier (centralized logs) |
| Client language support | Need library per language | Any client works |
Hybrid Approaches
Many production systems use hybrid approaches:
Memcached traditionally uses client-side partitioning with 'ketama' consistent hashing. Redis Cluster uses a hybrid where servers can redirect clients. Large deployments (Facebook, Twitter) often use custom proxy layers for operational control. Cloud services (AWS ElastiCache, Azure Redis) typically use proxy layers to enable features like automatic failover and encryption.
We've explored the partitioning problem space comprehensively. Let's consolidate our findings:
The Case for Consistent Hashing
Our analysis reveals that an ideal partitioning scheme must:
✅ Achieve uniform distribution of keys across nodes ✅ Minimize key movement when nodes are added or removed (approaching K/N optimal) ✅ Support virtual nodes for heterogeneous capacity and improved distribution ✅ Enable O(log n) or O(1) lookup performance ✅ Work without centralized coordination for lookups
This wishlist describes consistent hashing, the algorithm that powers virtually every production distributed cache. In the next page, we'll dive deep into its mechanics, implementation, and optimizations.
You now understand why partitioning is necessary, the trade-offs of different approaches, and why modulo hashing fails at scale. The next page provides a comprehensive treatment of Consistent Hashing — the elegant algorithm that solves the key movement problem and enables distributed caches to scale gracefully.