System Design (HLD)Distributed Cache

Designing a Distributed Cache

LevelAdvanced

Duration120 mins

TopicDistributed Cache

2 / 6

Cache Partitioning

The Partitioning Imperative

A single-node cache, no matter how powerful, cannot satisfy the requirements of large-scale systems. When you need to store terabytes of data in memory, handle hundreds of thousands of requests per second, or survive individual machine failures without service disruption, you must distribute your cache across multiple nodes. This distribution is called partitioning (or sharding).

Partitioning transforms a cache from a single point of failure into a scalable, resilient system. But this transformation introduces fundamental challenges: How do you decide which node stores which key? How do you ensure even distribution of data and load? What happens when you add or remove nodes? How do you handle nodes that fail? These questions define the partitioning problem space.

The partitioning strategy you choose affects every aspect of your cache's behavior—from latency and throughput to failure handling and operational complexity. There is no universally optimal approach; each strategy makes different trade-offs suited to different use cases. Understanding these trade-offs deeply is essential for designing caches that meet your specific requirements.

This page examines partitioning strategies from first principles, progressing from naive approaches through production-grade solutions. By the end, you'll understand why modern distributed caches converge on consistent hashing, and you'll be prepared to dive deep into its implementation in the next page.

What You Will Learn

By the end of this page, you will understand the fundamental partitioning problem in distributed caches, compare different partitioning strategies with their trade-offs, analyze the key movement problem when cluster topology changes, and recognize why consistent hashing emerged as the dominant solution.

Why Partition?

Before exploring how to partition, we must understand why partitioning is necessary. The motivations extend beyond simply 'we have too much data for one machine'—though that's certainly a factor.

Partitioning Motivations

•Memory Capacity: A single server maxes out at 1-4 TB of RAM in extreme cases, but commodity servers have 64-256 GB. Storing 10 TB of cached data requires distribution across many nodes.
•Throughput Scaling: Even the fastest single-threaded cache (like Redis) saturates at ~100K ops/sec. Distributing across N nodes theoretically multiplies throughput by N.
•Network Bandwidth: A single 10 Gbps NIC can handle roughly 100K requests/sec at 10KB average payload. More traffic requires more network interfaces, which means more nodes.
•Failure Isolation: When all data is on one node, that node's failure means complete cache unavailability. Partitioning limits the blast radius—one node's failure affects only its partition.
•Operational Flexibility: Partitioning enables rolling upgrades, gradual migration, and A/B testing of cache changes without full-system impact.

The Partitioning Function

At its core, partitioning requires a function that maps each key to a specific node (or set of nodes, if replicated):

partition(key) → node(s)

This function must have several properties:

Partitioning Function Requirements
Property	Description	Consequence of Violation
Deterministic	Same key always maps to same node	Keys cannot be found; correctness fails
Uniform	Keys distribute evenly across nodes	Hot nodes and cold nodes; inefficient resource use
Stable	Minimal remapping when nodes change	Massive cache misses on topology changes
Fast	O(1) or O(log n) computation	Partitioning overhead dominates latency

Partition vs. Shard

The terms 'partition' and 'shard' are often used interchangeably in caching contexts. Technically, a partition is a logical division of data, while a shard is the physical host of that data. A cache cluster might have 100 partitions distributed across 20 shards (physical nodes), with replication. In practice, most discussions use these terms loosely.

Naive Approach: Modulo Partitioning

The simplest partitioning approach uses the modulo operation on the hash of the key:

node_index = hash(key) % number_of_nodes

This approach is intuitive and provides good distribution when the hash function has good uniformity properties.

modulo_partitioning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import hashlib
 
class ModuloPartitioner:
    def __init__(self, nodes: list[str]):
        self.nodes = nodes
    
    def get_node(self, key: str) -> str:
        """Return the node responsible for this key."""
        hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16)
        node_index = hash_value % len(self.nodes)
        return self.nodes[node_index]
    
    def set_nodes(self, nodes: list[str]):
        """Update the node list (triggers redistribution)."""
        self.nodes = nodes
 
# Example usage
partitioner = ModuloPartitioner(["node-0", "node-1", "node-2"])
 
# Distribute some keys
keys = ["user:123", "session:abc", "product:456", "cart:789"]
for key in keys:
    print(f"{key} -> {partitioner.get_node(key)}")
 
# Output might be:
# user:123 -> node-1
# session:abc -> node-0  
# product:456 -> node-2
# cart:789 -> node-1

Advantages of Modulo Partitioning

Simple to implement: Single hash, single modulo operation
O(1) lookup: Constant-time node determination
Good distribution: If hash function is uniform, keys distribute evenly
No metadata required: No routing tables or mappings to maintain

The Fatal Flaw: Node Changes

Modulo partitioning works beautifully—until the number of nodes changes. When you add or remove a node, the modulo divisor changes, and nearly every key maps to a different node.

Consider what happens when we add a fourth node:

modulo_redistribution.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import hashlib
 
def simulate_redistribution():
    """Demonstrate the redistribution problem with modulo hashing."""
    
    # Generate 1000 keys
    keys = [f"key:{i}" for i in range(1000)]
    
    def get_node(key: str, num_nodes: int) -> int:
        hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16)
        return hash_value % num_nodes
    
    # Initial state: 3 nodes
    initial_mapping = {key: get_node(key, 3) for key in keys}
    
    # After adding node: 4 nodes
    new_mapping = {key: get_node(key, 4) for key in keys}
    
    # Count keys that moved
    moved = sum(1 for key in keys if initial_mapping[key] != new_mapping[key])
    move_percentage = (moved / len(keys)) * 100
    
    print(f"Keys that moved when adding node: {moved}/{len(keys)} ({move_percentage:.1f}%)")
    
    # With 3 -> 4 nodes, approximately (1 - 3/4) * 100% = 75% of keys move
    # With n -> n+1, approximately (n)/(n+1) keys are UNAFFECTED, so 1/(n+1) move
    # Wait, let's recalculate...
    
    # Actually: for a random key, P(same node) = P(hash % 3 == hash % 4)
    # This happens when hash mod 12 is in {0, 4, 8} for node 0, etc.
    # Approximately 1/4 of keys stay in the same spot... no, that's also wrong
    
    # The real calculation: Expected fraction that stays = gcd(3,4)/max(3,4) ≈ 25%
    # So ~75% of keys move!
    
simulate_redistribution()
# Output: Keys that moved when adding node: ~750/1000 (75.0%)

The Redistribution Catastrophe

When adding a single node to a 3-node cluster, approximately 75% of keys remap to different nodes. These remapped keys become cache misses, causing a thundering herd to the origin. For a cache serving 100,000 RPS with a 95% hit rate, suddenly 70,000+ RPS hit the origin instead of 5,000. The database likely cannot handle this load and crashes.

Real-World Impact

This redistribution problem isn't theoretical—it has caused production outages at major companies:

Scaling events: Adding capacity during traffic spikes causes mass cache invalidation exactly when you need the cache most
Node failures: A single node failure reshuffles the remaining 2/(n-1) of keys (e.g., 67% for a 3-node cluster losing 1 node)
Rolling restarts: Upgrading cache software becomes dangerous; each restart shuffles keys

Modulo partitioning is only acceptable when:

The number of nodes never changes (rare in practice)
Complete cache invalidation is acceptable during topology changes
The origin can handle full load temporarily

Range Partitioning

An alternative to hash-based partitioning is range partitioning, where the key space is divided into contiguous ranges, each assigned to a node.

Node 0: keys "a" - "m"
Node 1: keys "n" - "z"

For hash-based range partitioning, you divide the hash space into ranges:

Node 0: hash values 0x0000... - 0x5555...
Node 1: hash values 0x5556... - 0xAAAA...
Node 2: hash values 0xAAAB... - 0xFFFF...

range_partitioning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import hashlib
 
class RangePartitioner:
    def __init__(self, node_ranges: list[tuple[str, int, int]]):
        """
        node_ranges: List of (node_name, start_hash, end_hash)
        Ranges should be contiguous and cover the full hash space.
        """
        self.node_ranges = sorted(node_ranges, key=lambda x: x[1])
    
    def get_node(self, key: str) -> str:
        """Return the node responsible for this key."""
        hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16)
        
        for node_name, start_hash, end_hash in self.node_ranges:
            if start_hash <= hash_value <= end_hash:
                return node_name
        
        # Should never reach here if ranges are properly defined
        raise ValueError(f"No node found for hash {hash_value}")
 
# Example: 3 nodes splitting a 32-bit hash space
MAX_HASH = 2**128 - 1
THIRD = MAX_HASH // 3
 
partitioner = RangePartitioner([
    ("node-0", 0, THIRD),
    ("node-1", THIRD + 1, 2 * THIRD),
    ("node-2", 2 * THIRD + 1, MAX_HASH),
])
 
# Now adding a node requires splitting ONE existing range
# Only keys in that range need to move

Advantages of Range Partitioning

Controlled redistribution: Adding a node splits one range; only keys in that range move
Range queries possible: If keys have semantic ordering, you can find all keys in a range
Predictable data locality: Related keys (if prefixed similarly) may land on the same node

Disadvantages of Range Partitioning

Hot spots: Popular key prefixes concentrate on single nodes (all 'user:1*' keys on one node)
Manual rebalancing: Range boundaries must be adjusted as data distribution changes
Lookup overhead: Finding the correct range requires O(log n) binary search or O(n) linear scan
Coordination required: Range assignments must be stored and synchronized across clients

Range vs. Modulo Partitioning
Aspect	Modulo	Range
Lookup complexity	O(1)	O(log n)
Adding a node	~n/(n+1) keys move	Only split range moves
Hot spots	Rare (if hash is good)	Common (if keys clustered)
Metadata required	None	Range table
Range queries	Not possible	Possible

When to Use Range Partitioning

Range partitioning suits databases more than caches. It's used by systems like HBase, Cassandra (with ByteOrderedPartitioner), and Bigtable for its range query capabilities. For pure key-value caches, hash-based consistent hashing is typically superior.

The Key Movement Problem

At the heart of partitioning challenges lies the key movement problem: when the cluster topology changes, which keys need to move, and how do we minimize this movement?

Why Key Movement Matters

Every key that moves from its old node to a new node during topology changes creates problems:

Impact of Key Movement

•Cache miss on first access: The key's new node doesn't have the data yet; request goes to origin
•Origin load spike: All moved keys simultaneously become cache misses
•Temporary inconsistency: During migration, the old and new nodes may have different values
•Network bandwidth for migration: If actively migrating data, significant bandwidth consumed
•Increased latency: Cache warming period until working set is reconstructed

Quantifying Optimal Key Movement

What's the theoretical minimum key movement when the cluster changes?

Adding a node: If we go from N to N+1 nodes, the new node should ideally receive 1/(N+1) of the total keys (its fair share). These keys come from the existing N nodes. So the minimum movement is ~1/(N+1) of all keys.

Removing a node: If we go from N to N-1 nodes, the removed node's keys (1/N of total) must be distributed among remaining nodes. So the minimum movement is 1/N of all keys.

Optimal movement formula: K × min(N_old, N_new) / max(N_old, N_new)

Where K is total keys.

Key Movement: Modulo vs. Optimal
Cluster Change	Modulo Hashing Movement	Optimal Movement	Excess Movement
3 → 4 nodes	~75%	~25%	3x optimal
4 → 5 nodes	~80%	~20%	4x optimal
9 → 10 nodes	~90%	~10%	9x optimal
99 → 100 nodes	~99%	~1%	99x optimal

The Larger the Cluster, the Worse It Gets

Counter-intuitively, modulo hashing gets WORSE as clusters grow. Adding one node to a 100-node cluster moves 99% of keys instead of the optimal 1%. This makes modulo hashing completely unsuitable for large-scale deployments.

The Quest for Minimal Movement

We need a partitioning scheme that achieves or approaches optimal key movement. This leads us to consistent hashing, which we'll explore in depth in the next page.

Consistent hashing achieves:

Only K/N keys move when adding/removing a node (where K is total keys, N is nodes)
This is near-optimal: we can't do better than moving the new node's fair share
Movement is localized: we can predict exactly which keys will move
No global coordination required for lookups

Virtual Nodes and Replication

Beyond the basic partitioning function, two additional concepts are crucial for production cache architectures: virtual nodes and replication.

Virtual Nodes (VNodes)

In simple partitioning, each physical node owns one partition. This creates problems:

Load imbalance: Hash functions are uniform, but with few nodes, statistical variance creates imbalance
Recovery bottleneck: When a node fails, its entire partition moves to one other node
Heterogeneous hardware: All nodes must be identical; you can't have some with more capacity

Virtual nodes solve these by assigning multiple virtual partitions to each physical node:

virtual_nodes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class VirtualNodePartitioner:
    """
    Each physical node owns multiple virtual nodes (vNodes).
    This improves distribution and enables heterogeneous hardware.
    """
    
    def __init__(self, nodes: dict[str, int]):
        """
        nodes: Dict of physical_node_name -> number of virtual nodes
        More vNodes = more capacity share
        """
        self.virtual_to_physical = {}
        
        for physical_node, num_vnodes in nodes.items():
            for i in range(num_vnodes):
                vnode_id = f"{physical_node}_vnode_{i}"
                self.virtual_to_physical[vnode_id] = physical_node
        
        # Sort virtual nodes by their hash for range-based lookup
        self.sorted_vnodes = sorted(
            self.virtual_to_physical.keys(),
            key=lambda v: self._hash(v)
        )
    
    def _hash(self, key: str) -> int:
        import hashlib
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def get_node(self, key: str) -> str:
        """Find the physical node for this key using virtual nodes."""
        key_hash = self._hash(key)
        
        # Binary search for the first vnode with hash >= key_hash
        import bisect
        vnode_hashes = [self._hash(v) for v in self.sorted_vnodes]
        idx = bisect.bisect_left(vnode_hashes, key_hash)
        
        # Wrap around if past the last vnode
        if idx >= len(self.sorted_vnodes):
            idx = 0
        
        vnode = self.sorted_vnodes[idx]
        return self.virtual_to_physical[vnode]
 
# Example: Heterogeneous cluster
# node-big has 3x the capacity of node-small
partitioner = VirtualNodePartitioner({
    "node-big": 300,      # 300 virtual nodes
    "node-small": 100,    # 100 virtual nodes
})
 
# node-big will receive approximately 75% of keys
# node-small will receive approximately 25% of keys

Virtual Node Benefits

•Better load distribution: More virtual nodes = more uniform key distribution by law of large numbers
•Heterogeneous capacity: Assign more vNodes to more powerful machines
•Faster recovery: Failed node's vNodes scatter across many physical nodes instead of one
•Gradual rebalancing: Add vNodes one at a time to slowly shift load

Replication for Availability

Partitioning distributes data but doesn't protect against loss. When a node fails, its partition becomes unavailable. Replication stores copies of each key on multiple nodes:

Replica count: Typically 2-3 copies per key
Replica placement: Replicas should be on different physical machines, racks, or availability zones
Consistency: Replicas may be synchronous (all writes wait for replication) or asynchronous (writes return before replication completes)

Replication Strategies
Strategy	Write Latency	Consistency	Failure Handling
No replication	Lowest	N/A	Data lost on failure
Async replication	Low	Eventually consistent	May lose recent writes
Sync replication (1 replica)	Medium	Read-your-writes	Survives 1 failure
Sync quorum (N/2+1)	Higher	Strong (with reads from quorum)	Survives N/2 failures

Cache Replication Philosophy

Unlike databases, caches often skip replication entirely or use async replication. Data can be regenerated from the origin, so durability isn't required. The cost of replication (memory, network, latency) may exceed the benefit. Many production caches replicate for availability (survive node failures) rather than durability.

Client-Side vs. Server-Side Partitioning

An architectural decision that profoundly impacts cache design is where partitioning logic resides: in the client, in the server, or in a separate proxy layer.

Client-Side Partitioning

•Client computes which node to contact
•Client maintains node list and partition map
•Direct connection to cache node (1 hop)
•Examples: Memcached clients, Redis Cluster smart clients

Server-Side/Proxy Partitioning

•Proxy layer computes routing
•Clients connect to proxy, unaware of topology
•Extra network hop through proxy
•Examples: Twemproxy, Codis, custom proxy layers

Client-Side vs. Proxy Partitioning
Aspect	Client-Side	Proxy
Latency	Lower (1 hop)	Higher (2 hops)
Client complexity	Higher	Lower (thin client)
Topology changes	Clients must update	Proxy handles transparently
Protocol support	Client must support	Proxy can abstract
Scalability bottleneck	None (distributed)	Proxy can become bottleneck
Debugging	Harder (distributed)	Easier (centralized logs)
Client language support	Need library per language	Any client works

Hybrid Approaches

Many production systems use hybrid approaches:

Discovery service + client-side routing: Clients fetch topology from a central service (ZooKeeper, etcd, Consul) but compute routing locally
Smart proxy: Proxy handles routing but clients can cache route information to reduce proxy load
Cluster-aware servers: Cache nodes communicate among themselves to redirect requests (Redis Cluster's MOVED/ASK redirects)

Industry Patterns

Memcached traditionally uses client-side partitioning with 'ketama' consistent hashing. Redis Cluster uses a hybrid where servers can redirect clients. Large deployments (Facebook, Twitter) often use custom proxy layers for operational control. Cloud services (AWS ElastiCache, Azure Redis) typically use proxy layers to enable features like automatic failover and encryption.

Summary: The Path to Consistent Hashing

We've explored the partitioning problem space comprehensively. Let's consolidate our findings:

Key Takeaways

•Partitioning is essential for scaling beyond single-node capacity, throughput, and for failure isolation.
•Modulo hashing is simple but flawed — node changes cause catastrophic key movement, rendering it unsuitable for production.
•Range partitioning offers controlled movement but suffers from hot spots and operational complexity.
•Key movement is the core challenge — we need algorithms that minimize movement during topology changes.
•Virtual nodes improve distribution and enable heterogeneous hardware and graceful recovery.
•Replication is orthogonal to partitioning — determine replication strategy based on availability vs. cost trade-offs.
•Client vs. proxy routing is an architectural decision with latency, complexity, and operational trade-offs.

The Case for Consistent Hashing

Our analysis reveals that an ideal partitioning scheme must:

✅ Achieve uniform distribution of keys across nodes ✅ Minimize key movement when nodes are added or removed (approaching K/N optimal) ✅ Support virtual nodes for heterogeneous capacity and improved distribution ✅ Enable O(log n) or O(1) lookup performance ✅ Work without centralized coordination for lookups

This wishlist describes consistent hashing, the algorithm that powers virtually every production distributed cache. In the next page, we'll dive deep into its mechanics, implementation, and optimizations.

Cache Partitioning Complete

You now understand why partitioning is necessary, the trade-offs of different approaches, and why modulo hashing fails at scale. The next page provides a comprehensive treatment of Consistent Hashing — the elegant algorithm that solves the key movement problem and enables distributed caches to scale gracefully.

2 / 6

Loading learning content...

System Design (HLD)Distributed Cache

Designing a Distributed Cache

LevelAdvanced

Duration120 mins

TopicDistributed Cache

2 / 6

Cache Partitioning

The Partitioning Imperative

What You Will Learn

Why Partition?

Before exploring how to partition, we must understand why partitioning is necessary. The motivations extend beyond simply 'we have too much data for one machine'—though that's certainly a factor.

Partitioning Motivations

•Memory Capacity: A single server maxes out at 1-4 TB of RAM in extreme cases, but commodity servers have 64-256 GB. Storing 10 TB of cached data requires distribution across many nodes.
•Throughput Scaling: Even the fastest single-threaded cache (like Redis) saturates at ~100K ops/sec. Distributing across N nodes theoretically multiplies throughput by N.
•Network Bandwidth: A single 10 Gbps NIC can handle roughly 100K requests/sec at 10KB average payload. More traffic requires more network interfaces, which means more nodes.
•Failure Isolation: When all data is on one node, that node's failure means complete cache unavailability. Partitioning limits the blast radius—one node's failure affects only its partition.
•Operational Flexibility: Partitioning enables rolling upgrades, gradual migration, and A/B testing of cache changes without full-system impact.

The Partitioning Function

At its core, partitioning requires a function that maps each key to a specific node (or set of nodes, if replicated):

partition(key) → node(s)

This function must have several properties:

Partitioning Function Requirements
Property	Description	Consequence of Violation
Deterministic	Same key always maps to same node	Keys cannot be found; correctness fails
Uniform	Keys distribute evenly across nodes	Hot nodes and cold nodes; inefficient resource use
Stable	Minimal remapping when nodes change	Massive cache misses on topology changes
Fast	O(1) or O(log n) computation	Partitioning overhead dominates latency

Partition vs. Shard

Naive Approach: Modulo Partitioning

The simplest partitioning approach uses the modulo operation on the hash of the key:

node_index = hash(key) % number_of_nodes

This approach is intuitive and provides good distribution when the hash function has good uniformity properties.

modulo_partitioning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import hashlib
 
class ModuloPartitioner:
    def __init__(self, nodes: list[str]):
        self.nodes = nodes
    
    def get_node(self, key: str) -> str:
        """Return the node responsible for this key."""
        hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16)
        node_index = hash_value % len(self.nodes)
        return self.nodes[node_index]
    
    def set_nodes(self, nodes: list[str]):
        """Update the node list (triggers redistribution)."""
        self.nodes = nodes
 
# Example usage
partitioner = ModuloPartitioner(["node-0", "node-1", "node-2"])
 
# Distribute some keys
keys = ["user:123", "session:abc", "product:456", "cart:789"]
for key in keys:
    print(f"{key} -> {partitioner.get_node(key)}")
 
# Output might be:
# user:123 -> node-1
# session:abc -> node-0  
# product:456 -> node-2
# cart:789 -> node-1

Advantages of Modulo Partitioning

Simple to implement: Single hash, single modulo operation
O(1) lookup: Constant-time node determination
Good distribution: If hash function is uniform, keys distribute evenly
No metadata required: No routing tables or mappings to maintain

The Fatal Flaw: Node Changes

Modulo partitioning works beautifully—until the number of nodes changes. When you add or remove a node, the modulo divisor changes, and nearly every key maps to a different node.

Consider what happens when we add a fourth node:

modulo_redistribution.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import hashlib
 
def simulate_redistribution():
    """Demonstrate the redistribution problem with modulo hashing."""
    
    # Generate 1000 keys
    keys = [f"key:{i}" for i in range(1000)]
    
    def get_node(key: str, num_nodes: int) -> int:
        hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16)
        return hash_value % num_nodes
    
    # Initial state: 3 nodes
    initial_mapping = {key: get_node(key, 3) for key in keys}
    
    # After adding node: 4 nodes
    new_mapping = {key: get_node(key, 4) for key in keys}
    
    # Count keys that moved
    moved = sum(1 for key in keys if initial_mapping[key] != new_mapping[key])
    move_percentage = (moved / len(keys)) * 100
    
    print(f"Keys that moved when adding node: {moved}/{len(keys)} ({move_percentage:.1f}%)")
    
    # With 3 -> 4 nodes, approximately (1 - 3/4) * 100% = 75% of keys move
    # With n -> n+1, approximately (n)/(n+1) keys are UNAFFECTED, so 1/(n+1) move
    # Wait, let's recalculate...
    
    # Actually: for a random key, P(same node) = P(hash % 3 == hash % 4)
    # This happens when hash mod 12 is in {0, 4, 8} for node 0, etc.
    # Approximately 1/4 of keys stay in the same spot... no, that's also wrong
    
    # The real calculation: Expected fraction that stays = gcd(3,4)/max(3,4) ≈ 25%
    # So ~75% of keys move!
    
simulate_redistribution()
# Output: Keys that moved when adding node: ~750/1000 (75.0%)

The Redistribution Catastrophe

Real-World Impact

This redistribution problem isn't theoretical—it has caused production outages at major companies:

Scaling events: Adding capacity during traffic spikes causes mass cache invalidation exactly when you need the cache most
Node failures: A single node failure reshuffles the remaining 2/(n-1) of keys (e.g., 67% for a 3-node cluster losing 1 node)
Rolling restarts: Upgrading cache software becomes dangerous; each restart shuffles keys

Modulo partitioning is only acceptable when:

The number of nodes never changes (rare in practice)
Complete cache invalidation is acceptable during topology changes
The origin can handle full load temporarily

Range Partitioning

An alternative to hash-based partitioning is range partitioning, where the key space is divided into contiguous ranges, each assigned to a node.

Node 0: keys "a" - "m"
Node 1: keys "n" - "z"

For hash-based range partitioning, you divide the hash space into ranges:

Node 0: hash values 0x0000... - 0x5555...
Node 1: hash values 0x5556... - 0xAAAA...
Node 2: hash values 0xAAAB... - 0xFFFF...

range_partitioning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import hashlib
 
class RangePartitioner:
    def __init__(self, node_ranges: list[tuple[str, int, int]]):
        """
        node_ranges: List of (node_name, start_hash, end_hash)
        Ranges should be contiguous and cover the full hash space.
        """
        self.node_ranges = sorted(node_ranges, key=lambda x: x[1])
    
    def get_node(self, key: str) -> str:
        """Return the node responsible for this key."""
        hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16)
        
        for node_name, start_hash, end_hash in self.node_ranges:
            if start_hash <= hash_value <= end_hash:
                return node_name
        
        # Should never reach here if ranges are properly defined
        raise ValueError(f"No node found for hash {hash_value}")
 
# Example: 3 nodes splitting a 32-bit hash space
MAX_HASH = 2**128 - 1
THIRD = MAX_HASH // 3
 
partitioner = RangePartitioner([
    ("node-0", 0, THIRD),
    ("node-1", THIRD + 1, 2 * THIRD),
    ("node-2", 2 * THIRD + 1, MAX_HASH),
])
 
# Now adding a node requires splitting ONE existing range
# Only keys in that range need to move

Advantages of Range Partitioning

Controlled redistribution: Adding a node splits one range; only keys in that range move
Range queries possible: If keys have semantic ordering, you can find all keys in a range
Predictable data locality: Related keys (if prefixed similarly) may land on the same node

Disadvantages of Range Partitioning

Hot spots: Popular key prefixes concentrate on single nodes (all 'user:1*' keys on one node)
Manual rebalancing: Range boundaries must be adjusted as data distribution changes
Lookup overhead: Finding the correct range requires O(log n) binary search or O(n) linear scan
Coordination required: Range assignments must be stored and synchronized across clients

Range vs. Modulo Partitioning
Aspect	Modulo	Range
Lookup complexity	O(1)	O(log n)
Adding a node	~n/(n+1) keys move	Only split range moves
Hot spots	Rare (if hash is good)	Common (if keys clustered)
Metadata required	None	Range table
Range queries	Not possible	Possible

When to Use Range Partitioning

The Key Movement Problem

At the heart of partitioning challenges lies the key movement problem: when the cluster topology changes, which keys need to move, and how do we minimize this movement?

Why Key Movement Matters

Every key that moves from its old node to a new node during topology changes creates problems:

Impact of Key Movement

•Cache miss on first access: The key's new node doesn't have the data yet; request goes to origin
•Origin load spike: All moved keys simultaneously become cache misses
•Temporary inconsistency: During migration, the old and new nodes may have different values
•Network bandwidth for migration: If actively migrating data, significant bandwidth consumed
•Increased latency: Cache warming period until working set is reconstructed

Quantifying Optimal Key Movement

What's the theoretical minimum key movement when the cluster changes?

Removing a node: If we go from N to N-1 nodes, the removed node's keys (1/N of total) must be distributed among remaining nodes. So the minimum movement is 1/N of all keys.

Optimal movement formula: K × min(N_old, N_new) / max(N_old, N_new)

Where K is total keys.

Key Movement: Modulo vs. Optimal
Cluster Change	Modulo Hashing Movement	Optimal Movement	Excess Movement
3 → 4 nodes	~75%	~25%	3x optimal
4 → 5 nodes	~80%	~20%	4x optimal
9 → 10 nodes	~90%	~10%	9x optimal
99 → 100 nodes	~99%	~1%	99x optimal

The Larger the Cluster, the Worse It Gets

The Quest for Minimal Movement

We need a partitioning scheme that achieves or approaches optimal key movement. This leads us to consistent hashing, which we'll explore in depth in the next page.

Consistent hashing achieves:

Only K/N keys move when adding/removing a node (where K is total keys, N is nodes)
This is near-optimal: we can't do better than moving the new node's fair share
Movement is localized: we can predict exactly which keys will move
No global coordination required for lookups

Virtual Nodes and Replication

Beyond the basic partitioning function, two additional concepts are crucial for production cache architectures: virtual nodes and replication.

Virtual Nodes (VNodes)

In simple partitioning, each physical node owns one partition. This creates problems:

Load imbalance: Hash functions are uniform, but with few nodes, statistical variance creates imbalance
Recovery bottleneck: When a node fails, its entire partition moves to one other node
Heterogeneous hardware: All nodes must be identical; you can't have some with more capacity

Virtual nodes solve these by assigning multiple virtual partitions to each physical node:

virtual_nodes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class VirtualNodePartitioner:
    """
    Each physical node owns multiple virtual nodes (vNodes).
    This improves distribution and enables heterogeneous hardware.
    """
    
    def __init__(self, nodes: dict[str, int]):
        """
        nodes: Dict of physical_node_name -> number of virtual nodes
        More vNodes = more capacity share
        """
        self.virtual_to_physical = {}
        
        for physical_node, num_vnodes in nodes.items():
            for i in range(num_vnodes):
                vnode_id = f"{physical_node}_vnode_{i}"
                self.virtual_to_physical[vnode_id] = physical_node
        
        # Sort virtual nodes by their hash for range-based lookup
        self.sorted_vnodes = sorted(
            self.virtual_to_physical.keys(),
            key=lambda v: self._hash(v)
        )
    
    def _hash(self, key: str) -> int:
        import hashlib
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def get_node(self, key: str) -> str:
        """Find the physical node for this key using virtual nodes."""
        key_hash = self._hash(key)
        
        # Binary search for the first vnode with hash >= key_hash
        import bisect
        vnode_hashes = [self._hash(v) for v in self.sorted_vnodes]
        idx = bisect.bisect_left(vnode_hashes, key_hash)
        
        # Wrap around if past the last vnode
        if idx >= len(self.sorted_vnodes):
            idx = 0
        
        vnode = self.sorted_vnodes[idx]
        return self.virtual_to_physical[vnode]
 
# Example: Heterogeneous cluster
# node-big has 3x the capacity of node-small
partitioner = VirtualNodePartitioner({
    "node-big": 300,      # 300 virtual nodes
    "node-small": 100,    # 100 virtual nodes
})
 
# node-big will receive approximately 75% of keys
# node-small will receive approximately 25% of keys

Virtual Node Benefits

•Better load distribution: More virtual nodes = more uniform key distribution by law of large numbers
•Heterogeneous capacity: Assign more vNodes to more powerful machines
•Faster recovery: Failed node's vNodes scatter across many physical nodes instead of one
•Gradual rebalancing: Add vNodes one at a time to slowly shift load

Replication for Availability

Partitioning distributes data but doesn't protect against loss. When a node fails, its partition becomes unavailable. Replication stores copies of each key on multiple nodes:

Replica count: Typically 2-3 copies per key
Replica placement: Replicas should be on different physical machines, racks, or availability zones
Consistency: Replicas may be synchronous (all writes wait for replication) or asynchronous (writes return before replication completes)

Replication Strategies
Strategy	Write Latency	Consistency	Failure Handling
No replication	Lowest	N/A	Data lost on failure
Async replication	Low	Eventually consistent	May lose recent writes
Sync replication (1 replica)	Medium	Read-your-writes	Survives 1 failure
Sync quorum (N/2+1)	Higher	Strong (with reads from quorum)	Survives N/2 failures

Cache Replication Philosophy

Client-Side vs. Server-Side Partitioning

An architectural decision that profoundly impacts cache design is where partitioning logic resides: in the client, in the server, or in a separate proxy layer.

Client-Side Partitioning

•Client computes which node to contact
•Client maintains node list and partition map
•Direct connection to cache node (1 hop)
•Examples: Memcached clients, Redis Cluster smart clients

Server-Side/Proxy Partitioning

•Proxy layer computes routing
•Clients connect to proxy, unaware of topology
•Extra network hop through proxy
•Examples: Twemproxy, Codis, custom proxy layers

Client-Side vs. Proxy Partitioning
Aspect	Client-Side	Proxy
Latency	Lower (1 hop)	Higher (2 hops)
Client complexity	Higher	Lower (thin client)
Topology changes	Clients must update	Proxy handles transparently
Protocol support	Client must support	Proxy can abstract
Scalability bottleneck	None (distributed)	Proxy can become bottleneck
Debugging	Harder (distributed)	Easier (centralized logs)
Client language support	Need library per language	Any client works

Hybrid Approaches

Many production systems use hybrid approaches:

Discovery service + client-side routing: Clients fetch topology from a central service (ZooKeeper, etcd, Consul) but compute routing locally
Smart proxy: Proxy handles routing but clients can cache route information to reduce proxy load
Cluster-aware servers: Cache nodes communicate among themselves to redirect requests (Redis Cluster's MOVED/ASK redirects)

Industry Patterns

Summary: The Path to Consistent Hashing

We've explored the partitioning problem space comprehensively. Let's consolidate our findings:

Key Takeaways

•Partitioning is essential for scaling beyond single-node capacity, throughput, and for failure isolation.
•Modulo hashing is simple but flawed — node changes cause catastrophic key movement, rendering it unsuitable for production.
•Range partitioning offers controlled movement but suffers from hot spots and operational complexity.
•Key movement is the core challenge — we need algorithms that minimize movement during topology changes.
•Virtual nodes improve distribution and enable heterogeneous hardware and graceful recovery.
•Replication is orthogonal to partitioning — determine replication strategy based on availability vs. cost trade-offs.
•Client vs. proxy routing is an architectural decision with latency, complexity, and operational trade-offs.

The Case for Consistent Hashing

Our analysis reveals that an ideal partitioning scheme must:

Cache Partitioning Complete

2 / 6