System Design (HLD)Distributed Cache Systems

Distributed Cache Systems

LevelIntermediate

Duration90 mins

TopicDistributed Cache Systems

4 / 5

Cache Cluster Management

Operating Caches at Scale

Deploying a distributed cache is only the beginning. The real challenge lies in operating cache clusters reliably at scale—handling cluster topology changes, managing capacity, ensuring high availability, responding to incidents, and maintaining performance as traffic patterns evolve.

Cache cluster management encompasses a broad set of operational concerns:

Deployment and Configuration: How do you provision cache clusters, configure them for your workload, and maintain configuration consistency across environments?
Scaling Operations: How do you add capacity without disrupting live traffic? How do you handle traffic spikes that exceed current capacity?
Monitoring and Alerting: What metrics indicate cache health? What thresholds trigger investigation or action?
Maintenance and Upgrades: How do you perform maintenance without cache outages? How do you upgrade versions safely?
Failure Response: What happens when nodes fail? How quickly can you recover? How do you prevent cascade failures?

This page provides a comprehensive operational playbook for managing distributed cache clusters, drawing from lessons learned at organizations operating caches at massive scale.

What You Will Learn

By the end of this page, you will understand cache cluster deployment patterns and infrastructure considerations, master monitoring strategies for cache health and performance, develop operational procedures for scaling, maintenance, and failure response, and apply capacity planning methodologies for cache clusters.

Deployment Topology Patterns

How you deploy cache clusters significantly impacts availability, performance, and operational complexity. Understanding deployment patterns helps you design for your requirements.

Single Availability Zone Deployment

The simplest topology places all cache nodes in a single availability zone:

Advantages:

Lowest latency between nodes and application servers
Simplest to manage and reason about
No cross-AZ data transfer costs

Disadvantages:

AZ failure takes out entire cache
No geographic redundancy
Single point of failure for cache layer

When to Use:

Development and staging environments
Non-critical caching where cold start is acceptable
Cost-sensitive deployments with tolerance for occasional cache loss

Multi-AZ Deployment

Distributing cache nodes across availability zones provides fault tolerance:

Advantages:

Survives single AZ failure
Automatic failover (with Redis Cluster/Sentinel)
No cold start on AZ failure

Disadvantages:

Cross-AZ latency (typically 1-3ms added)
Cross-AZ data transfer costs
More complex network configuration

When to Use:

Production workloads requiring high availability
Caches where cold start would impact SLAs
Systems where cache availability is business-critical

Converting Mermaid diagram...

Multi-Region Deployment

For global applications, deploying caches in multiple regions:

Approaches:

Independent Clusters per Region:
- Each region has its own cache cluster
- No cross-region replication
- Data divergence is acceptable (each region warms independently)
Active-Passive Multi-Region:
- Primary region handles writes
- Secondary region replicas for disaster recovery
- Promotion on primary region failure
Active-Active Multi-Region (Advanced):
- Writes in any region
- Cross-region replication (eventual consistency)
- Conflict resolution strategies required

Considerations:

Cross-region latency: 50-200ms typically
Data transfer costs can be significant
Consistency challenges with active-active

Start Simple, Add Complexity When Needed

Multi-region caching adds substantial complexity. Most applications can achieve adequate availability with multi-AZ deployment in a single region. Only pursue multi-region when you have genuine global user bases or regulatory requirements for data locality.

Capacity Planning

Effective capacity planning prevents both over-provisioning (wasted cost) and under-provisioning (performance degradation, cache thrashing).

Memory Sizing

The core question: How much memory does your cache need?

Approach:

Estimate Active Dataset Size:
- Count distinct keys expected in cache
- Estimate average value size (measure, don't guess)
- Account for key overhead (~50-70 bytes/key for Redis)
```
Memory = (NumKeys × AvgValueSize) + (NumKeys × KeyOverhead)
```
Add Working Set Buffer:
- 20-30% buffer for growth and burst
- Room for hot data without excessive eviction
Account for System Overhead:
- Redis: +10-15% for internal structures
- Redis with persistence: +20-50% for fork and buffers
- Memcached: +10% for slab overhead

capacity-planning-calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Cache Capacity Planning Calculator
 
def calculate_cache_capacity(
    num_keys: int,
    avg_value_bytes: int,
    avg_key_bytes: int = 30,
    key_overhead_bytes: int = 60,  # Redis per-key overhead
    growth_buffer_pct: float = 0.25,
    system_overhead_pct: float = 0.15,
    replication_factor: int = 1,  # 1 = no replicas, 2 = 1 replica per master
) -> dict:
    """
    Calculate required cache memory.
    """
    # Raw data size
    raw_data_mb = (num_keys * (avg_value_bytes + avg_key_bytes)) / (1024 * 1024)
    
    # Per-key overhead
    overhead_mb = (num_keys * key_overhead_bytes) / (1024 * 1024)
    
    # Base memory requirement
    base_memory_mb = raw_data_mb + overhead_mb
    
    # Add growth buffer
    with_buffer_mb = base_memory_mb * (1 + growth_buffer_pct)
    
    # Add system overhead
    total_per_node_mb = with_buffer_mb * (1 + system_overhead_pct)
    
    # Account for replication
    total_cluster_mb = total_per_node_mb * replication_factor
    
    return {
        "raw_data_mb": raw_data_mb,
        "with_overhead_mb": base_memory_mb,
        "recommended_per_node_mb": total_per_node_mb,
        "total_cluster_mb": total_cluster_mb,
        "recommended_per_node_gb": total_per_node_mb / 1024,
    }
 
# Example: E-commerce session cache
result = calculate_cache_capacity(
    num_keys=500_000,        # 500K active sessions
    avg_value_bytes=2048,    # 2KB per session
    replication_factor=2      # 1 master + 1 replica
)
print(f"Recommended per node: {result['recommended_per_node_gb']:.1f} GB")
print(f"Total cluster memory: {result['total_cluster_mb']:.0f} MB")

Throughput Sizing

Memory is one constraint; CPU/network throughput is another.

Estimating Operations:

Measure Current Rate: Use APM or cache stats to determine ops/sec
Project Peak: Peak typically 3-10x average
Add Safety Margin: 2x peak for headroom

Rough Throughput Guidelines:

Redis (single thread): ~100K-300K ops/sec per core
Memcached (multi-threaded): ~500K-1M ops/sec per node
Network: ~10Gbps = ~1.25GB/s (actual throughput lower)

Scaling Decision:

If memory-constrained: Add more total memory (more nodes or bigger nodes) If throughput-constrained: Add more nodes to distribute load

Capacity Planning Checklist
Dimension	Questions to Answer	Sizing Impact
Dataset Size	How many keys? Average value size? Key distribution?	Total memory requirement
Growth Rate	How fast is data growing? Seasonal patterns?	Buffer sizing, scaling timeline
Access Patterns	Read/write ratio? Hot spots?	Sharding strategy, node sizing
Peak Traffic	What's peak vs average? Duration of peaks?	Throughput headroom needed
Latency Requirements	p50? p99? p999?	Node sizing, network proximity
Availability Target	What's acceptable downtime?	Replication factor, failover speed

Monitoring and Observability

Comprehensive monitoring is essential for proactive cache management. You can't fix what you can't see.

Core Metrics to Monitor

Every cache deployment should track these fundamental metrics:

Essential Cache Metrics
Metric Category	Specific Metrics	Why It Matters
Hit Rate	hit_rate = hits / (hits + misses)	Core cache effectiveness indicator
Memory	used_memory, memory_fragmentation_ratio	Capacity utilization, health
Evictions	evictions_per_second, evicted_keys_total	Memory pressure indicator
Connections	connected_clients, rejected_connections	Client health, capacity
Throughput	commands_per_second, bytes_in/out	Load and network utilization
Latency	p50, p99, p999 latency per operation	Client experience, SLA compliance
Replication	replication_lag, connected_replicas	Data safety, read capacity
Persistence (Redis)	rdb_last_save_time, aof_rewrite_status	Durability, recovery point

redis-prometheus-exporter.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Prometheus monitoring for Redis
# Deploy redis_exporter as sidecar or standalone
 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-exporter
spec:
  template:
    spec:
      containers:
      - name: redis-exporter
        image: oliver006/redis_exporter:latest
        env:
        - name: REDIS_ADDR
          value: "redis://redis-master:6379"
        - name: REDIS_PASSWORD
          valueFrom:
            secretKeyRef:
              name: redis-secret
              key: password
        ports:
        - containerPort: 9121
          name: metrics
 
---
# Prometheus scrape config
scrape_configs:
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    
# Key metrics to alert on:
# redis_memory_used_bytes / redis_memory_max_bytes > 0.85
# rate(redis_evicted_keys_total[5m]) > 100
# redis_connected_clients / redis_config_maxclients > 0.8
# redis_master_repl_offset - redis_slave_repl_offset > 10000

Alerting Thresholds

Not all metrics warrant alerts. Focus on actionable signals:

Recommended Alert Thresholds
Alert	Condition	Severity	Response
High Memory Usage	memory_used_pct > 85%	Warning	Plan capacity expansion
Critical Memory	memory_used_pct > 95%	Critical	Immediate action needed
Low Hit Rate	hit_rate < threshold (varies)	Warning	Investigate access patterns
Elevated Latency	p99_latency > 10ms	Warning	Check load, network, slow queries
Connection Saturation	connections > 80% max	Warning	Increase limit or add nodes
Replication Lag	lag_seconds > 30	Warning	Check replica health
Master Down	master_unreachable	Critical	Verify failover, check Sentinel/Cluster
High Eviction Rate	evictions_per_min spike	Warning	Memory pressure, capacity needed

Hit Rate Thresholds Vary by Use Case

A session cache might achieve 99% hit rate, while a recommendation cache might only hit 60%. Set thresholds based on your baseline, not industry averages. Alert on significant deviations from YOUR normal, not absolute values.

Dashboards for Operations

Essential Dashboard Panels:

Overview: Cluster health, node status, key counts
Performance: Throughput, latency percentiles over time
Capacity: Memory utilization per node, eviction trends
Replication: Lag, replica status, sync state
Client Connections: Connection counts, sources, errors
Slow Operations: Slow log entries, problematic commands

Drill-Down Capability:

Design dashboards to support incident investigation:

Click on spike in latency → see which operation types
Click on memory spike → see which keys/patterns are growing
Click on error → see affected clients and commands

Scaling Operations

Scaling cache clusters requires careful planning to avoid service disruption. Both scaling up (adding capacity) and scaling down (reducing cost) have operational considerations.

Scaling Redis Cluster

Adding Nodes:

Add Empty Node to cluster:

redis-cli --cluster add-node new-node:6379 existing-node:6379

Rebalance Slots to new node:

redis-cli --cluster rebalance cluster:6379 --cluster-use-empty-masters

Monitor Migration progress:

redis-cli --cluster check cluster:6379

During Resharding:

Migrating slots causes brief latency spikes
-MOVED and -ASK redirects increase
Some multi-key operations may temporarily fail
Client connection pool may see temporary errors

redis-cluster-scaling.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
# Redis Cluster Scaling Playbook
 
CLUSTER_HOST="redis-1:6379"
NEW_NODE="redis-new:6379"
 
echo "=== Pre-scaling checks ==="
redis-cli --cluster check $CLUSTER_HOST
 
echo "=== Adding new node ==="
redis-cli --cluster add-node $NEW_NODE $CLUSTER_HOST
 
echo "=== Waiting for node to join ==="
sleep 10
redis-cli --cluster check $CLUSTER_HOST
 
echo "=== Adding replica for new master ==="
# Get new node ID
NEW_NODE_ID=$(redis-cli -h redis-new -p 6379 CLUSTER MYID)
# Add replica pointing to new master
redis-cli --cluster add-node redis-replica:6379 $CLUSTER_HOST \
    --cluster-slave --cluster-master-id $NEW_NODE_ID
 
echo "=== Rebalancing slots ==="
# This migrates slots to the new node
redis-cli --cluster rebalance $CLUSTER_HOST \
    --cluster-weight $NEW_NODE_ID=1 \
    --cluster-use-empty-masters \
    --cluster-threshold 1
 
echo "=== Post-scaling verification ==="
redis-cli --cluster check $CLUSTER_HOST
 
echo "=== Monitor during business hours ==="
echo "Watch: latency p99, error rate, slot migration progress"

Scaling Memcached Pools

Memcached scaling is client-driven:

Adding Nodes:

Deploy new Memcached instance
Update client configuration to include new node
Client's consistent hashing redistributes ~1/N keys to new node
Those keys experience cache miss on first access

Best Practices:

Gradual Rollout: Update client configs in phases
Warm New Node: Pre-populate critical keys if possible
Monitor Miss Rate: Expect temporary increase, should stabilize
Database Capacity: Ensure database can handle miss spike

Scaling Down

Removing nodes is riskier than adding:

Redis Cluster:

Migrate all slots away from node to be removed
Remove node from cluster
Shutdown instance

Memcached:

Remove node from client config
Keys from that node become misses
Gradual rollout of config change

Risks:

Removing too much capacity → memory pressure on remaining nodes
Too fast removal → cache miss spike overwhelms database
Removing during peak → worst time for reduced capacity

Scale During Low Traffic

Perform scaling operations during off-peak hours when possible. Slot migrations, cache misses, and configuration changes all have potential to cause latency spikes. Give yourself room to handle issues without peak traffic pressure.

Maintenance Procedures

Regular maintenance keeps cache clusters healthy. Well-defined procedures minimize risk.

Patching and Upgrades

Rolling Upgrades for Redis Cluster:

Upgrade replicas first (one at a time)
For each replica:
- Stop instance
- Upgrade binary
- Start instance
- Wait for sync to complete
- Verify replica health
Trigger failover for each shard (replica becomes master)
Upgrade old masters (now replicas)

Memcached Upgrades:

Deploy new version instances
Update client config to use new instances
Old instances receive no traffic
Decommission old instances

(With consistent hashing, this causes complete cache miss for all keys—plan accordingly)

redis-rolling-upgrade.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/bin/bash
# Redis Cluster Rolling Upgrade Procedure
 
set -e
 
CLUSTER_NODES=("redis-1" "redis-2" "redis-3" "redis-4" "redis-5" "redis-6")
NEW_VERSION="7.2.3"
 
# Phase 1: Upgrade replicas
echo "=== Phase 1: Upgrading replicas ==="
for node in "${CLUSTER_NODES[@]}"; do
    role=$(redis-cli -h $node ROLE | head -1)
    if [ "$role" == "slave" ]; then
        echo "Upgrading replica: $node"
        
        # Graceful shutdown
        redis-cli -h $node SHUTDOWN SAVE
        
        # Upgrade binary (implementation depends on deployment)
        ssh $node "yum update redis-$NEW_VERSION -y"
        
        # Start new version
        ssh $node "systemctl start redis"
        
        # Wait for resync
        sleep 30
        until redis-cli -h $node INFO replication | grep -q "master_link_status:up"; do
            echo "Waiting for $node to resync..."
            sleep 5
        done
        
        echo "$node upgraded and synced"
    fi
done
 
# Phase 2: Failover masters to upgraded replicas
echo "=== Phase 2: Failing over masters ==="
for node in "${CLUSTER_NODES[@]}"; do
    role=$(redis-cli -h $node ROLE | head -1)
    if [ "$role" == "master" ]; then
        echo "Failing over master: $node"
        
        # Get replica
        replica=$(redis-cli -h $node INFO replication | grep slave0 | cut -d',' -f1 | cut -d'=' -f2)
        
        # Trigger failover
        redis-cli -h $replica CLUSTER FAILOVER
        
        sleep 10
        
        # Verify failover
        new_role=$(redis-cli -h $node ROLE | head -1)
        if [ "$new_role" == "slave" ]; then
            echo "$node is now replica"
        else
            echo "ERROR: Failover failed for $node"
            exit 1
        fi
    fi
done
 
# Phase 3: Upgrade old masters (now replicas)
echo "=== Phase 3: Upgrading demoted masters ==="
# Repeat Phase 1 logic for remaining non-upgraded nodes
 
echo "=== Upgrade complete ==="
redis-cli --cluster check redis-1:6379

Memory Optimization

Redis Memory Fragmentation:

Over time, Redis memory can become fragmented:

mem_fragmentation_ratio > 1.5  # Warning
mem_fragmentation_ratio > 2.0  # Action needed

Solutions:

MEMORY PURGE - Attempt to release fragmented memory
Restart during maintenance window - Compact memory on reload
Enable activedefrag yes - Background defragmentation (Redis 4+)

Memcached Slab Rebalancing:

If slabs are imbalanced (check with stats slabs):

slabs automove 1   # Enable automatic slab rebalancing

Configuration Management

Consistency Matters:

All nodes in a cluster should have consistent configuration. Drift causes subtle bugs.

Best Practices:

Infrastructure as Code: Define cache configuration in version control
Templated Configs: Use templates with environment-specific values
Configuration Audits: Regularly verify all nodes match expected config
Change Management: Document all configuration changes with rationale

Failure Handling and Recovery

Cache failures are inevitable. The goal is quick detection, automatic recovery where possible, and well-practiced manual recovery procedures.

Common Failure Scenarios

Cache Failure Scenarios and Responses
Failure	Detection	Automatic Recovery	Manual Response
Node crash	Health check failure	Sentinel/Cluster failover	Replace node, resync
Network partition	Connectivity loss	Client failover to healthy nodes	Resolve network issue
Memory exhaustion	OOM errors, eviction spike	Eviction policy activates	Add capacity, review data
Slow performance	Latency alerts	None (needs investigation)	Profile, optimize, scale
Full cluster down	All health checks fail	None	Emergency recovery procedure
Data corruption	Checksum errors	None	Restore from backup

Redis Sentinel Failover

When Redis Sentinel detects master failure:

SDOWN: Single Sentinel marks master as subjectively down
ODOWN: Quorum of Sentinels agree master is objectively down
Leader Election: Sentinels elect leader to perform failover
Failover Execution: Leader promotes best replica to master
Reconfiguration: Other replicas reconfigured to follow new master
Notification: Clients notified of new master (if subscribed)

Failover Timing:

Detection: down-after-milliseconds (default 30s, recommend 5-10s)
Election + Failover: Typically 10-30 seconds
Total: 15-60 seconds depending on configuration

redis-failover-monitoring.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash
# Monitor Redis Sentinel failover status
 
SENTINEL_HOST="sentinel-1:26379"
MASTER_NAME="mymaster"
 
echo "=== Sentinel Status ==="
redis-cli -h $SENTINEL_HOST -p 26379 SENTINEL MASTER $MASTER_NAME
 
echo "=== Current Master ==="
redis-cli -h $SENTINEL_HOST -p 26379 SENTINEL GET-MASTER-ADDR-BY-NAME $MASTER_NAME
 
echo "=== Replicas ==="
redis-cli -h $SENTINEL_HOST -p 26379 SENTINEL REPLICAS $MASTER_NAME
 
echo "=== Sentinel Quorum ==="
redis-cli -h $SENTINEL_HOST -p 26379 SENTINEL CKQUORUM $MASTER_NAME
 
# Subscribe to failover events
echo "=== Watching for failover events (Ctrl+C to exit) ==="
redis-cli -h $SENTINEL_HOST -p 26379 SUBSCRIBE +switch-master +sdown +odown

Cascading Failure Prevention

Cache failures can cascade to overwhelm databases:

Thundering Herd on Cache Miss:

Multiple requests for the same uncached key simultaneously query the database.

Mitigations:

Request Coalescing: Only one request fetches from DB; others wait
Cache Stampede Lock: Lock while populating cache
Probabilistic Early Expiration: Refresh before TTL expires
Circuit Breakers: Limit database queries during cache failures

Plan for Total Cache Failure

Your system should survive (degraded) if the entire cache layer fails. If cache failure causes total system outage, cache has become a single point of failure. Implement graceful degradation: slower responses from database, simplified features, or static fallbacks.

Security Considerations

Cache security is often overlooked, but cache systems can expose sensitive data and provide attack surfaces.

Network Security

Never Expose to Public Internet:

Redis and Memcached have minimal built-in security
Public exposure enables data theft, command injection, DDoS amplification
Always deploy in private subnets behind firewalls

Network Isolation:

VPC/Private Network: Cache in private subnet, no public IP
Security Groups/Firewall: Allow only application servers
Encryption in Transit: TLS for Redis 6+, stunnel for older versions
Separate Network Segment: Cache traffic isolated from other services

redis-security-config.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Redis Security Configuration
 
# Bind to private interface only
bind 10.0.1.100 127.0.0.1
 
# Require password authentication
requirepass your-strong-password-here
 
# Disable dangerous commands in production
rename-command FLUSHDB ""
rename-command FLUSHALL ""
rename-command CONFIG ""
rename-command SHUTDOWN ""
rename-command DEBUG ""
rename-command KEYS ""       # Use SCAN instead
 
# Enable TLS (Redis 6+)
tls-port 6380
port 0                       # Disable non-TLS port
tls-cert-file /path/to/redis.crt
tls-key-file /path/to/redis.key
tls-ca-cert-file /path/to/ca.crt
tls-auth-clients yes         # Require client certs
 
# ACL for fine-grained access control (Redis 6+)
user app-read on >readpassword ~cached:* +@read
user app-write on >writepassword ~* +@all -@admin
user admin on >adminpassword ~* +@all

Data Security

Sensitive Data in Cache:

Session tokens can enable account takeover
PII exposure violates compliance requirements
Cached credentials are high-value targets

Protections:

Encrypt Sensitive Values: Client-side encryption before caching
Minimize PII: Cache IDs, not full records
Short TTLs for Sensitive Data: Limit exposure window
Audit Access: Log cache operations for security review
Data Classification: Know what's cached, apply appropriate controls

Compliance Considerations

If caching data subject to GDPR, HIPAA, PCI-DSS, or similar regulations, the cache infrastructure must meet those requirements. This often includes encryption at rest (Redis Enterprise, managed services), encryption in transit, access logging, and data residency controls. Consult your compliance team before caching regulated data.

Summary: Operational Excellence

Operating cache clusters at scale requires systematic attention to deployment, monitoring, scaling, maintenance, and security. Let's consolidate the key operational principles:

Key Takeaways

•Deployment topology drives availability — Multi-AZ deployment provides fault tolerance; match topology to your availability requirements.
•Capacity planning prevents emergencies — Size for peak load plus headroom; plan scaling before you need it.
•Monitoring enables proactive management — Track hit rate, memory, latency, evictions; alert on actionable conditions.
•Scaling operations require planning — Scale during low traffic; plan for cache miss spikes during transitions.
•Maintenance procedures minimize risk — Rolling upgrades, practiced playbooks, configuration management.
•Failure is inevitable—prepare for it — Automatic failover, cascade prevention, graceful degradation.
•Security is non-negotiable — Network isolation, authentication, TLS, data encryption for sensitive content.

What's Next:

With operational foundations in place, the next page examines Cache Consistency Challenges—the complex problems that arise when cached data diverges from source data, and strategies for maintaining acceptable consistency levels.

Page Complete

You now have a comprehensive operational playbook for managing distributed cache clusters. This knowledge enables you to deploy, monitor, scale, maintain, and secure cache infrastructure with confidence.

4 / 5

Loading learning content...

System Design (HLD)Distributed Cache Systems

Distributed Cache Systems

LevelIntermediate

Duration90 mins

TopicDistributed Cache Systems

4 / 5

Cache Cluster Management

Operating Caches at Scale

Cache cluster management encompasses a broad set of operational concerns:

Deployment and Configuration: How do you provision cache clusters, configure them for your workload, and maintain configuration consistency across environments?
Scaling Operations: How do you add capacity without disrupting live traffic? How do you handle traffic spikes that exceed current capacity?
Monitoring and Alerting: What metrics indicate cache health? What thresholds trigger investigation or action?
Maintenance and Upgrades: How do you perform maintenance without cache outages? How do you upgrade versions safely?
Failure Response: What happens when nodes fail? How quickly can you recover? How do you prevent cascade failures?

This page provides a comprehensive operational playbook for managing distributed cache clusters, drawing from lessons learned at organizations operating caches at massive scale.

What You Will Learn

Deployment Topology Patterns

How you deploy cache clusters significantly impacts availability, performance, and operational complexity. Understanding deployment patterns helps you design for your requirements.

Single Availability Zone Deployment

The simplest topology places all cache nodes in a single availability zone:

Advantages:

Lowest latency between nodes and application servers
Simplest to manage and reason about
No cross-AZ data transfer costs

Disadvantages:

AZ failure takes out entire cache
No geographic redundancy
Single point of failure for cache layer

When to Use:

Development and staging environments
Non-critical caching where cold start is acceptable
Cost-sensitive deployments with tolerance for occasional cache loss

Multi-AZ Deployment

Distributing cache nodes across availability zones provides fault tolerance:

Advantages:

Survives single AZ failure
Automatic failover (with Redis Cluster/Sentinel)
No cold start on AZ failure

Disadvantages:

Cross-AZ latency (typically 1-3ms added)
Cross-AZ data transfer costs
More complex network configuration

When to Use:

Production workloads requiring high availability
Caches where cold start would impact SLAs
Systems where cache availability is business-critical

Converting Mermaid diagram...

Multi-Region Deployment

For global applications, deploying caches in multiple regions:

Approaches:

Independent Clusters per Region:
- Each region has its own cache cluster
- No cross-region replication
- Data divergence is acceptable (each region warms independently)
Active-Passive Multi-Region:
- Primary region handles writes
- Secondary region replicas for disaster recovery
- Promotion on primary region failure
Active-Active Multi-Region (Advanced):
- Writes in any region
- Cross-region replication (eventual consistency)
- Conflict resolution strategies required

Considerations:

Cross-region latency: 50-200ms typically
Data transfer costs can be significant
Consistency challenges with active-active

Start Simple, Add Complexity When Needed

Capacity Planning

Effective capacity planning prevents both over-provisioning (wasted cost) and under-provisioning (performance degradation, cache thrashing).

Memory Sizing

The core question: How much memory does your cache need?

Approach:

Estimate Active Dataset Size:
- Count distinct keys expected in cache
- Estimate average value size (measure, don't guess)
- Account for key overhead (~50-70 bytes/key for Redis)
```
Memory = (NumKeys × AvgValueSize) + (NumKeys × KeyOverhead)
```
Add Working Set Buffer:
- 20-30% buffer for growth and burst
- Room for hot data without excessive eviction
Account for System Overhead:
- Redis: +10-15% for internal structures
- Redis with persistence: +20-50% for fork and buffers
- Memcached: +10% for slab overhead

capacity-planning-calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Cache Capacity Planning Calculator
 
def calculate_cache_capacity(
    num_keys: int,
    avg_value_bytes: int,
    avg_key_bytes: int = 30,
    key_overhead_bytes: int = 60,  # Redis per-key overhead
    growth_buffer_pct: float = 0.25,
    system_overhead_pct: float = 0.15,
    replication_factor: int = 1,  # 1 = no replicas, 2 = 1 replica per master
) -> dict:
    """
    Calculate required cache memory.
    """
    # Raw data size
    raw_data_mb = (num_keys * (avg_value_bytes + avg_key_bytes)) / (1024 * 1024)
    
    # Per-key overhead
    overhead_mb = (num_keys * key_overhead_bytes) / (1024 * 1024)
    
    # Base memory requirement
    base_memory_mb = raw_data_mb + overhead_mb
    
    # Add growth buffer
    with_buffer_mb = base_memory_mb * (1 + growth_buffer_pct)
    
    # Add system overhead
    total_per_node_mb = with_buffer_mb * (1 + system_overhead_pct)
    
    # Account for replication
    total_cluster_mb = total_per_node_mb * replication_factor
    
    return {
        "raw_data_mb": raw_data_mb,
        "with_overhead_mb": base_memory_mb,
        "recommended_per_node_mb": total_per_node_mb,
        "total_cluster_mb": total_cluster_mb,
        "recommended_per_node_gb": total_per_node_mb / 1024,
    }
 
# Example: E-commerce session cache
result = calculate_cache_capacity(
    num_keys=500_000,        # 500K active sessions
    avg_value_bytes=2048,    # 2KB per session
    replication_factor=2      # 1 master + 1 replica
)
print(f"Recommended per node: {result['recommended_per_node_gb']:.1f} GB")
print(f"Total cluster memory: {result['total_cluster_mb']:.0f} MB")

Throughput Sizing

Memory is one constraint; CPU/network throughput is another.

Estimating Operations:

Measure Current Rate: Use APM or cache stats to determine ops/sec
Project Peak: Peak typically 3-10x average
Add Safety Margin: 2x peak for headroom

Rough Throughput Guidelines:

Redis (single thread): ~100K-300K ops/sec per core
Memcached (multi-threaded): ~500K-1M ops/sec per node
Network: ~10Gbps = ~1.25GB/s (actual throughput lower)

Scaling Decision:

If memory-constrained: Add more total memory (more nodes or bigger nodes) If throughput-constrained: Add more nodes to distribute load

Capacity Planning Checklist
Dimension	Questions to Answer	Sizing Impact
Dataset Size	How many keys? Average value size? Key distribution?	Total memory requirement
Growth Rate	How fast is data growing? Seasonal patterns?	Buffer sizing, scaling timeline
Access Patterns	Read/write ratio? Hot spots?	Sharding strategy, node sizing
Peak Traffic	What's peak vs average? Duration of peaks?	Throughput headroom needed
Latency Requirements	p50? p99? p999?	Node sizing, network proximity
Availability Target	What's acceptable downtime?	Replication factor, failover speed

Monitoring and Observability

Comprehensive monitoring is essential for proactive cache management. You can't fix what you can't see.

Core Metrics to Monitor

Every cache deployment should track these fundamental metrics:

Essential Cache Metrics
Metric Category	Specific Metrics	Why It Matters
Hit Rate	hit_rate = hits / (hits + misses)	Core cache effectiveness indicator
Memory	used_memory, memory_fragmentation_ratio	Capacity utilization, health
Evictions	evictions_per_second, evicted_keys_total	Memory pressure indicator
Connections	connected_clients, rejected_connections	Client health, capacity
Throughput	commands_per_second, bytes_in/out	Load and network utilization
Latency	p50, p99, p999 latency per operation	Client experience, SLA compliance
Replication	replication_lag, connected_replicas	Data safety, read capacity
Persistence (Redis)	rdb_last_save_time, aof_rewrite_status	Durability, recovery point

redis-prometheus-exporter.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Prometheus monitoring for Redis
# Deploy redis_exporter as sidecar or standalone
 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-exporter
spec:
  template:
    spec:
      containers:
      - name: redis-exporter
        image: oliver006/redis_exporter:latest
        env:
        - name: REDIS_ADDR
          value: "redis://redis-master:6379"
        - name: REDIS_PASSWORD
          valueFrom:
            secretKeyRef:
              name: redis-secret
              key: password
        ports:
        - containerPort: 9121
          name: metrics
 
---
# Prometheus scrape config
scrape_configs:
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    
# Key metrics to alert on:
# redis_memory_used_bytes / redis_memory_max_bytes > 0.85
# rate(redis_evicted_keys_total[5m]) > 100
# redis_connected_clients / redis_config_maxclients > 0.8
# redis_master_repl_offset - redis_slave_repl_offset > 10000

Alerting Thresholds

Not all metrics warrant alerts. Focus on actionable signals:

Recommended Alert Thresholds
Alert	Condition	Severity	Response
High Memory Usage	memory_used_pct > 85%	Warning	Plan capacity expansion
Critical Memory	memory_used_pct > 95%	Critical	Immediate action needed
Low Hit Rate	hit_rate < threshold (varies)	Warning	Investigate access patterns
Elevated Latency	p99_latency > 10ms	Warning	Check load, network, slow queries
Connection Saturation	connections > 80% max	Warning	Increase limit or add nodes
Replication Lag	lag_seconds > 30	Warning	Check replica health
Master Down	master_unreachable	Critical	Verify failover, check Sentinel/Cluster
High Eviction Rate	evictions_per_min spike	Warning	Memory pressure, capacity needed

Hit Rate Thresholds Vary by Use Case

Dashboards for Operations

Essential Dashboard Panels:

Overview: Cluster health, node status, key counts
Performance: Throughput, latency percentiles over time
Capacity: Memory utilization per node, eviction trends
Replication: Lag, replica status, sync state
Client Connections: Connection counts, sources, errors
Slow Operations: Slow log entries, problematic commands

Drill-Down Capability:

Design dashboards to support incident investigation:

Click on spike in latency → see which operation types
Click on memory spike → see which keys/patterns are growing
Click on error → see affected clients and commands

Scaling Operations

Scaling cache clusters requires careful planning to avoid service disruption. Both scaling up (adding capacity) and scaling down (reducing cost) have operational considerations.

Scaling Redis Cluster

Adding Nodes:

Add Empty Node to cluster:

redis-cli --cluster add-node new-node:6379 existing-node:6379

Rebalance Slots to new node:

redis-cli --cluster rebalance cluster:6379 --cluster-use-empty-masters

Monitor Migration progress:

redis-cli --cluster check cluster:6379

During Resharding:

Migrating slots causes brief latency spikes
-MOVED and -ASK redirects increase
Some multi-key operations may temporarily fail
Client connection pool may see temporary errors

redis-cluster-scaling.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
# Redis Cluster Scaling Playbook
 
CLUSTER_HOST="redis-1:6379"
NEW_NODE="redis-new:6379"
 
echo "=== Pre-scaling checks ==="
redis-cli --cluster check $CLUSTER_HOST
 
echo "=== Adding new node ==="
redis-cli --cluster add-node $NEW_NODE $CLUSTER_HOST
 
echo "=== Waiting for node to join ==="
sleep 10
redis-cli --cluster check $CLUSTER_HOST
 
echo "=== Adding replica for new master ==="
# Get new node ID
NEW_NODE_ID=$(redis-cli -h redis-new -p 6379 CLUSTER MYID)
# Add replica pointing to new master
redis-cli --cluster add-node redis-replica:6379 $CLUSTER_HOST \
    --cluster-slave --cluster-master-id $NEW_NODE_ID
 
echo "=== Rebalancing slots ==="
# This migrates slots to the new node
redis-cli --cluster rebalance $CLUSTER_HOST \
    --cluster-weight $NEW_NODE_ID=1 \
    --cluster-use-empty-masters \
    --cluster-threshold 1
 
echo "=== Post-scaling verification ==="
redis-cli --cluster check $CLUSTER_HOST
 
echo "=== Monitor during business hours ==="
echo "Watch: latency p99, error rate, slot migration progress"

Scaling Memcached Pools

Memcached scaling is client-driven:

Adding Nodes:

Deploy new Memcached instance
Update client configuration to include new node
Client's consistent hashing redistributes ~1/N keys to new node
Those keys experience cache miss on first access

Best Practices:

Gradual Rollout: Update client configs in phases
Warm New Node: Pre-populate critical keys if possible
Monitor Miss Rate: Expect temporary increase, should stabilize
Database Capacity: Ensure database can handle miss spike

Scaling Down

Removing nodes is riskier than adding:

Redis Cluster:

Migrate all slots away from node to be removed
Remove node from cluster
Shutdown instance

Memcached:

Remove node from client config
Keys from that node become misses
Gradual rollout of config change

Risks:

Removing too much capacity → memory pressure on remaining nodes
Too fast removal → cache miss spike overwhelms database
Removing during peak → worst time for reduced capacity

Scale During Low Traffic

Maintenance Procedures

Regular maintenance keeps cache clusters healthy. Well-defined procedures minimize risk.

Patching and Upgrades

Rolling Upgrades for Redis Cluster:

Upgrade replicas first (one at a time)
For each replica:
- Stop instance
- Upgrade binary
- Start instance
- Wait for sync to complete
- Verify replica health
Trigger failover for each shard (replica becomes master)
Upgrade old masters (now replicas)

Memcached Upgrades:

Deploy new version instances
Update client config to use new instances
Old instances receive no traffic
Decommission old instances

(With consistent hashing, this causes complete cache miss for all keys—plan accordingly)

redis-rolling-upgrade.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/bin/bash
# Redis Cluster Rolling Upgrade Procedure
 
set -e
 
CLUSTER_NODES=("redis-1" "redis-2" "redis-3" "redis-4" "redis-5" "redis-6")
NEW_VERSION="7.2.3"
 
# Phase 1: Upgrade replicas
echo "=== Phase 1: Upgrading replicas ==="
for node in "${CLUSTER_NODES[@]}"; do
    role=$(redis-cli -h $node ROLE | head -1)
    if [ "$role" == "slave" ]; then
        echo "Upgrading replica: $node"
        
        # Graceful shutdown
        redis-cli -h $node SHUTDOWN SAVE
        
        # Upgrade binary (implementation depends on deployment)
        ssh $node "yum update redis-$NEW_VERSION -y"
        
        # Start new version
        ssh $node "systemctl start redis"
        
        # Wait for resync
        sleep 30
        until redis-cli -h $node INFO replication | grep -q "master_link_status:up"; do
            echo "Waiting for $node to resync..."
            sleep 5
        done
        
        echo "$node upgraded and synced"
    fi
done
 
# Phase 2: Failover masters to upgraded replicas
echo "=== Phase 2: Failing over masters ==="
for node in "${CLUSTER_NODES[@]}"; do
    role=$(redis-cli -h $node ROLE | head -1)
    if [ "$role" == "master" ]; then
        echo "Failing over master: $node"
        
        # Get replica
        replica=$(redis-cli -h $node INFO replication | grep slave0 | cut -d',' -f1 | cut -d'=' -f2)
        
        # Trigger failover
        redis-cli -h $replica CLUSTER FAILOVER
        
        sleep 10
        
        # Verify failover
        new_role=$(redis-cli -h $node ROLE | head -1)
        if [ "$new_role" == "slave" ]; then
            echo "$node is now replica"
        else
            echo "ERROR: Failover failed for $node"
            exit 1
        fi
    fi
done
 
# Phase 3: Upgrade old masters (now replicas)
echo "=== Phase 3: Upgrading demoted masters ==="
# Repeat Phase 1 logic for remaining non-upgraded nodes
 
echo "=== Upgrade complete ==="
redis-cli --cluster check redis-1:6379

Memory Optimization

Redis Memory Fragmentation:

Over time, Redis memory can become fragmented:

mem_fragmentation_ratio > 1.5  # Warning
mem_fragmentation_ratio > 2.0  # Action needed

Solutions:

MEMORY PURGE - Attempt to release fragmented memory
Restart during maintenance window - Compact memory on reload
Enable activedefrag yes - Background defragmentation (Redis 4+)

Memcached Slab Rebalancing:

If slabs are imbalanced (check with stats slabs):

slabs automove 1   # Enable automatic slab rebalancing

Configuration Management

Consistency Matters:

All nodes in a cluster should have consistent configuration. Drift causes subtle bugs.

Best Practices:

Infrastructure as Code: Define cache configuration in version control
Templated Configs: Use templates with environment-specific values
Configuration Audits: Regularly verify all nodes match expected config
Change Management: Document all configuration changes with rationale

Failure Handling and Recovery

Cache failures are inevitable. The goal is quick detection, automatic recovery where possible, and well-practiced manual recovery procedures.

Common Failure Scenarios

Cache Failure Scenarios and Responses
Failure	Detection	Automatic Recovery	Manual Response
Node crash	Health check failure	Sentinel/Cluster failover	Replace node, resync
Network partition	Connectivity loss	Client failover to healthy nodes	Resolve network issue
Memory exhaustion	OOM errors, eviction spike	Eviction policy activates	Add capacity, review data
Slow performance	Latency alerts	None (needs investigation)	Profile, optimize, scale
Full cluster down	All health checks fail	None	Emergency recovery procedure
Data corruption	Checksum errors	None	Restore from backup

Redis Sentinel Failover

When Redis Sentinel detects master failure:

SDOWN: Single Sentinel marks master as subjectively down
ODOWN: Quorum of Sentinels agree master is objectively down
Leader Election: Sentinels elect leader to perform failover
Failover Execution: Leader promotes best replica to master
Reconfiguration: Other replicas reconfigured to follow new master
Notification: Clients notified of new master (if subscribed)

Failover Timing:

Detection: down-after-milliseconds (default 30s, recommend 5-10s)
Election + Failover: Typically 10-30 seconds
Total: 15-60 seconds depending on configuration

redis-failover-monitoring.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash
# Monitor Redis Sentinel failover status
 
SENTINEL_HOST="sentinel-1:26379"
MASTER_NAME="mymaster"
 
echo "=== Sentinel Status ==="
redis-cli -h $SENTINEL_HOST -p 26379 SENTINEL MASTER $MASTER_NAME
 
echo "=== Current Master ==="
redis-cli -h $SENTINEL_HOST -p 26379 SENTINEL GET-MASTER-ADDR-BY-NAME $MASTER_NAME
 
echo "=== Replicas ==="
redis-cli -h $SENTINEL_HOST -p 26379 SENTINEL REPLICAS $MASTER_NAME
 
echo "=== Sentinel Quorum ==="
redis-cli -h $SENTINEL_HOST -p 26379 SENTINEL CKQUORUM $MASTER_NAME
 
# Subscribe to failover events
echo "=== Watching for failover events (Ctrl+C to exit) ==="
redis-cli -h $SENTINEL_HOST -p 26379 SUBSCRIBE +switch-master +sdown +odown

Cascading Failure Prevention

Cache failures can cascade to overwhelm databases:

Thundering Herd on Cache Miss:

Multiple requests for the same uncached key simultaneously query the database.

Mitigations:

Request Coalescing: Only one request fetches from DB; others wait
Cache Stampede Lock: Lock while populating cache
Probabilistic Early Expiration: Refresh before TTL expires
Circuit Breakers: Limit database queries during cache failures

Plan for Total Cache Failure

Security Considerations

Cache security is often overlooked, but cache systems can expose sensitive data and provide attack surfaces.

Network Security

Never Expose to Public Internet:

Redis and Memcached have minimal built-in security
Public exposure enables data theft, command injection, DDoS amplification
Always deploy in private subnets behind firewalls

Network Isolation:

VPC/Private Network: Cache in private subnet, no public IP
Security Groups/Firewall: Allow only application servers
Encryption in Transit: TLS for Redis 6+, stunnel for older versions
Separate Network Segment: Cache traffic isolated from other services

redis-security-config.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Redis Security Configuration
 
# Bind to private interface only
bind 10.0.1.100 127.0.0.1
 
# Require password authentication
requirepass your-strong-password-here
 
# Disable dangerous commands in production
rename-command FLUSHDB ""
rename-command FLUSHALL ""
rename-command CONFIG ""
rename-command SHUTDOWN ""
rename-command DEBUG ""
rename-command KEYS ""       # Use SCAN instead
 
# Enable TLS (Redis 6+)
tls-port 6380
port 0                       # Disable non-TLS port
tls-cert-file /path/to/redis.crt
tls-key-file /path/to/redis.key
tls-ca-cert-file /path/to/ca.crt
tls-auth-clients yes         # Require client certs
 
# ACL for fine-grained access control (Redis 6+)
user app-read on >readpassword ~cached:* +@read
user app-write on >writepassword ~* +@all -@admin
user admin on >adminpassword ~* +@all

Data Security

Sensitive Data in Cache:

Session tokens can enable account takeover
PII exposure violates compliance requirements
Cached credentials are high-value targets

Protections:

Encrypt Sensitive Values: Client-side encryption before caching
Minimize PII: Cache IDs, not full records
Short TTLs for Sensitive Data: Limit exposure window
Audit Access: Log cache operations for security review
Data Classification: Know what's cached, apply appropriate controls

Compliance Considerations

Summary: Operational Excellence

Operating cache clusters at scale requires systematic attention to deployment, monitoring, scaling, maintenance, and security. Let's consolidate the key operational principles:

Key Takeaways

•Deployment topology drives availability — Multi-AZ deployment provides fault tolerance; match topology to your availability requirements.
•Capacity planning prevents emergencies — Size for peak load plus headroom; plan scaling before you need it.
•Monitoring enables proactive management — Track hit rate, memory, latency, evictions; alert on actionable conditions.
•Scaling operations require planning — Scale during low traffic; plan for cache miss spikes during transitions.
•Maintenance procedures minimize risk — Rolling upgrades, practiced playbooks, configuration management.
•Failure is inevitable—prepare for it — Automatic failover, cascade prevention, graceful degradation.
•Security is non-negotiable — Network isolation, authentication, TLS, data encryption for sensitive content.

What's Next:

Page Complete

4 / 5