System Design (HLD)Distributed Cache

Designing a Distributed Cache

LevelAdvanced

Duration120 mins

TopicDistributed Cache

6 / 6

Failure Handling

Designing for Failure

Everything fails, all the time. This principle, articulated by Werner Vogels of Amazon, is the foundation of reliable distributed systems. In distributed caching, failures are not exceptional events to be handled as edge cases—they are routine occurrences that the system must handle gracefully.

The irony of distributed caching is that caches are introduced to improve reliability (reducing load on databases), but they also introduce new failure modes. A cache that fails in the wrong way can take down the entire system—the very system it was supposed to protect. Understanding failure modes and designing appropriate mitigations is what separates production-grade caches from prototypes.

This page examines failure handling comprehensively: from the taxonomy of failures through detection mechanisms, recovery strategies, and the operational practices that ensure caches enhance rather than undermine system reliability.

We will cover:

The types of failures that affect distributed caches
Detection mechanisms and their trade-offs
Recovery strategies including failover and data reconstruction
Cascading failure prevention
Operational practices for reliability

What You Will Learn

By the end of this page, you will understand the taxonomy of failures in distributed caches, implement failure detection with appropriate timeouts and health checks, design recovery mechanisms including automatic failover and cache warming, prevent cascading failures through circuit breakers and request shedding, and apply operational practices for cache reliability.

Failure Taxonomy

Understanding the types of failures helps design appropriate responses. Different failures require different handling.

Node Failures

Individual cache nodes can fail in various ways:

Node Failure Types
Failure Type	Cause	Detection	Recovery
Crash (fail-stop)	Hardware fault, OOM, kernel panic	Heartbeat timeout	Failover to replica or rebuild
Hang (fail-slow)	Resource exhaustion, deadlock, GC pause	Latency spike detection	Kill and restart, failover
Data corruption	Memory error, software bug	Checksum validation	Rebuild from source
Byzantine	Software bug returning wrong data	Very hard to detect	Voting, checksums

Network Failures

Network issues create the most challenging failure modes:

Network Failure Modes

•Total partition: Complete isolation between nodes or client and cache. Clear failure once detected.
•Partial partition: Some paths work, others don't. Very hard to diagnose—node A can reach B, B can reach C, but A can't reach C.
•Asymmetric partition: Messages in one direction work, but not the other. Causes split-brain scenarios.
•Packet loss: Random packet drops. Causes retries, timeouts, high tail latency.
•Latency spikes: Transient delays (network congestion, GC pauses). May trigger false failure detection.

Cluster-Wide Failures

Some failures affect the entire cache cluster:

Cluster Failures

•Configuration error: Bad config pushed to all nodes simultaneously
•Correlated failures: All nodes in same rack lose power; all nodes hit same memory bug
•Capacity exhaustion: All nodes full, all evicting simultaneously
•Thundering herd: Cascade of cache misses overwhelms origin

The Partial Failure Problem

The hardest failures to handle are partial failures—where some requests succeed and others fail, or where a node is sometimes reachable and sometimes not. These require sophisticated detection (not just binary up/down) and graceful degradation rather than hard failover.

Failure Detection

You cannot respond to failures you don't detect. Failure detection mechanisms balance speed (fast detection) against accuracy (avoiding false positives).

Heartbeat-Based Detection

The simplest approach: nodes periodically signal that they're alive. Missing heartbeats indicate failure.

heartbeat_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import threading
import time
from typing import Dict, Set, Callable
 
class HeartbeatMonitor:
    """
    Monitor node health via heartbeats.
    Nodes that miss consecutive heartbeats are marked failed.
    """
    
    def __init__(
        self,
        heartbeat_interval: float = 1.0,
        failure_threshold: int = 3,
        on_failure: Callable[[str], None] = None,
        on_recovery: Callable[[str], None] = None
    ):
        self.heartbeat_interval = heartbeat_interval
        self.failure_threshold = failure_threshold
        self.on_failure = on_failure
        self.on_recovery = on_recovery
        
        self.last_heartbeat: Dict[str, float] = {}
        self.missed_beats: Dict[str, int] = {}
        self.failed_nodes: Set[str] = set()
        self.lock = threading.Lock()
        
        self._start_monitor()
    
    def record_heartbeat(self, node_id: str):
        """Record a heartbeat from a node."""
        with self.lock:
            self.last_heartbeat[node_id] = time.time()
            self.missed_beats[node_id] = 0
            
            # Check for recovery
            if node_id in self.failed_nodes:
                self.failed_nodes.remove(node_id)
                if self.on_recovery:
                    self.on_recovery(node_id)
    
    def _check_nodes(self):
        """Periodic check for missing heartbeats."""
        with self.lock:
            current_time = time.time()
            
            for node_id, last_time in list(self.last_heartbeat.items()):
                if current_time - last_time > self.heartbeat_interval:
                    self.missed_beats[node_id] = self.missed_beats.get(node_id, 0) + 1
                    
                    if (self.missed_beats[node_id] >= self.failure_threshold
                        and node_id not in self.failed_nodes):
                        self.failed_nodes.add(node_id)
                        if self.on_failure:
                            self.on_failure(node_id)
    
    def _start_monitor(self):
        """Start background monitoring thread."""
        def monitor_loop():
            while True:
                time.sleep(self.heartbeat_interval)
                self._check_nodes()
        
        thread = threading.Thread(target=monitor_loop, daemon=True)
        thread.start()
    
    def is_healthy(self, node_id: str) -> bool:
        """Check if a node is considered healthy."""
        with self.lock:
            return node_id not in self.failed_nodes

Timeout Configuration

Timeout values balance failure detection speed against false positives:

Too short: Network jitter or GC pauses trigger false failures
Too long: Real failures take too long to detect, impacting users

Typical values:

Heartbeat interval: 1-5 seconds
Failure threshold: 3-5 missed heartbeats
Total detection time: 3-25 seconds

Request-Based Detection

Instead of dedicated heartbeats, use real request traffic:

request_based_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
class RequestBasedHealthCheck:
    """
    Track node health based on request success/failure.
    More accurate than heartbeats for detecting degraded performance.
    """
    
    def __init__(
        self,
        failure_rate_threshold: float = 0.5,  # 50% failure rate
        window_size: int = 100,  # Last 100 requests
        min_requests: int = 10  # Need at least 10 requests to judge
    ):
        self.failure_rate_threshold = failure_rate_threshold
        self.window_size = window_size
        self.min_requests = min_requests
        
        # Per-node sliding window of success/failure
        self.request_history: Dict[str, list[bool]] = {}
    
    def record_success(self, node_id: str):
        """Record a successful request to a node."""
        self._record(node_id, True)
    
    def record_failure(self, node_id: str):
        """Record a failed request to a node."""
        self._record(node_id, False)
    
    def _record(self, node_id: str, success: bool):
        if node_id not in self.request_history:
            self.request_history[node_id] = []
        
        history = self.request_history[node_id]
        history.append(success)
        
        # Keep only recent requests
        if len(history) > self.window_size:
            history.pop(0)
    
    def is_healthy(self, node_id: str) -> bool:
        """Determine if node is healthy based on recent requests."""
        if node_id not in self.request_history:
            return True  # No data, assume healthy
        
        history = self.request_history[node_id]
        
        if len(history) < self.min_requests:
            return True  # Not enough data
        
        failure_rate = 1 - (sum(history) / len(history))
        return failure_rate < self.failure_rate_threshold
    
    def get_failure_rate(self, node_id: str) -> float:
        """Get the current failure rate for a node."""
        if node_id not in self.request_history:
            return 0.0
        
        history = self.request_history[node_id]
        if not history:
            return 0.0
        
        return 1 - (sum(history) / len(history))

Phi Accrual Failure Detector

The Phi Accrual Failure Detector (used in Cassandra, Akka) provides a probability of failure rather than binary up/down. It tracks heartbeat arrival times, models their distribution, and calculates the probability that a node has failed given how late the heartbeat is. This adapts to network conditions automatically.

Failover and Recovery

Once a failure is detected, the system must recover. Recovery strategies depend on whether the cache has replicas and whether data loss is acceptable.

Replica Failover

With replicated caches, clients can failover to replicas:

replica_failover.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class ReplicatedCacheClient:
    """
    Cache client with automatic failover to replicas.
    """
    
    def __init__(self, hash_ring, health_checker, replication_factor: int = 3):
        self.hash_ring = hash_ring
        self.health_checker = health_checker
        self.replication_factor = replication_factor
    
    def get(self, key: str):
        """
        Read with automatic failover.
        Try primary first, then replicas in order.
        """
        nodes = self.hash_ring.get_nodes(key, count=self.replication_factor)
        
        for node in nodes:
            if not self.health_checker.is_healthy(node):
                continue
            
            try:
                value = self._send_get(node, key)
                self.health_checker.record_success(node)
                return value
            except TimeoutError:
                self.health_checker.record_failure(node)
                continue
            except ConnectionError:
                self.health_checker.record_failure(node)
                continue
        
        # All replicas failed
        raise CacheUnavailableError(f"No healthy replica for key {key}")
    
    def set(self, key: str, value: any, ttl: int = None):
        """
        Write to primary and replicas.
        Configurable consistency: write to N replicas before returning.
        """
        nodes = self.hash_ring.get_nodes(key, count=self.replication_factor)
        healthy_nodes = [n for n in nodes if self.health_checker.is_healthy(n)]
        
        if not healthy_nodes:
            raise CacheUnavailableError(f"No healthy node for key {key}")
        
        # Write to all healthy replicas (async for speed, or sync for durability)
        errors = []
        for node in healthy_nodes:
            try:
                self._send_set(node, key, value, ttl)
                self.health_checker.record_success(node)
            except Exception as e:
                self.health_checker.record_failure(node)
                errors.append((node, e))
        
        # Decide success based on write quorum (e.g., W=2 of RF=3)
        successful_writes = len(healthy_nodes) - len(errors)
        required_writes = (self.replication_factor // 2) + 1
        
        if successful_writes < required_writes:
            raise InsufficientReplicasError(
                f"Only {successful_writes}/{required_writes} writes succeeded"
            )

Cache Warming

When a new node joins or a failed node restarts, its cache is cold (empty). This creates a temporary load spike on the origin. Cache warming strategies mitigate this:

Cache Warming Strategies
Strategy	How It Works	Pros	Cons
Lazy warming	Cache fills as requests arrive	Simple, automatic	Temporary hit rate drop
Pre-warming	Copy data from healthy replicas	Immediate hit rate	Network/CPU overhead
Backup restore	Restore from periodic RDB/AOF snapshots	Fast for large datasets	Snapshot may be stale
Traffic replay	Replay recent request log to warm cache	Accurate working set	Requires logging infrastructure

cache_warming.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class CacheWarmer:
    """
    Pre-warm a new cache node from an existing replica.
    """
    
    async def warm_from_replica(
        self,
        new_node: str,
        source_node: str,
        keys_to_warm: list[str]
    ):
        """
        Copy hot keys from source to new node.
        Done in batches to avoid overwhelming either node.
        """
        batch_size = 100
        delay_between_batches = 0.1  # seconds
        
        for i in range(0, len(keys_to_warm), batch_size):
            batch = keys_to_warm[i:i + batch_size]
            
            # Fetch from source
            values = await self._mget(source_node, batch)
            
            # Write to new node
            items = [(k, v) for k, v in zip(batch, values) if v is not None]
            if items:
                await self._mset(new_node, items)
            
            # Throttle to avoid overload
            await asyncio.sleep(delay_between_batches)
    
    def get_hot_keys(self, source_node: str, count: int = 10000) -> list[str]:
        """
        Identify the hottest keys to prioritize warming.
        Options: sample by frequency, use access logs, or LRU order.
        """
        # Redis example: SCAN with OBJECT FREQ
        # Memcached: Use LRU crawler or slab stats
        pass

Gradual Traffic Shift

When a new node joins, shift traffic gradually rather than instantly. Start with 10% of the node's intended traffic, monitor hit rate and latency, then increase. This prevents a cold node from being overwhelmed and allows time for the cache to warm.

Cascading Failure Prevention

Cascading failures are the most dangerous failure mode. A single failure triggers a chain reaction that brings down the entire system. In caching, the most common cascading failure is the thundering herd.

The Cascade Pattern

Cache node fails
Its traffic redirects to other nodes or origin
Increased load causes another node to fail
Repeat until complete outage

Circuit Breakers

Circuit breakers prevent cascading failures by stopping requests to unhealthy components:

circuit_breaker.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import time
from enum import Enum
from threading import Lock
 
class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if recovered
 
class CircuitBreaker:
    """
    Prevent cascading failures by stopping requests to failing components.
    """
    
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        half_open_max_calls: int = 3
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0
        self.half_open_calls = 0
        self.lock = Lock()
    
    def can_execute(self) -> bool:
        """Check if request should proceed."""
        with self.lock:
            if self.state == CircuitState.CLOSED:
                return True
            
            if self.state == CircuitState.OPEN:
                # Check if recovery timeout has passed
                if time.time() - self.last_failure_time > self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.half_open_calls = 0
                    return True
                return False
            
            if self.state == CircuitState.HALF_OPEN:
                if self.half_open_calls < self.half_open_max_calls:
                    self.half_open_calls += 1
                    return True
                return False
        
        return False
    
    def record_success(self):
        """Record a successful call."""
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                # Recovered - close the circuit
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            elif self.state == CircuitState.CLOSED:
                # Reset failure count on success
                self.failure_count = 0
    
    def record_failure(self):
        """Record a failed call."""
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                # Failed during recovery - reopen
                self.state = CircuitState.OPEN
            elif (self.state == CircuitState.CLOSED and 
                  self.failure_count >= self.failure_threshold):
                # Threshold exceeded - open circuit
                self.state = CircuitState.OPEN
 
# Usage
cache_circuit = CircuitBreaker(failure_threshold=5)
 
def get_from_cache(key):
    if not cache_circuit.can_execute():
        raise CircuitOpenError("Cache circuit open, falling back")
    
    try:
        result = cache.get(key)
        cache_circuit.record_success()
        return result
    except Exception as e:
        cache_circuit.record_failure()
        raise

Request Shedding and Backpressure

When a system is overloaded, it's better to reject some requests quickly than to accept all requests and process them slowly (causing timeouts everywhere).

Overload Protection Techniques

•Connection limits: Cap maximum concurrent connections to cache nodes
•Request rate limiting: Limit requests per second per client or globally
•Queue depth limits: Reject requests when internal queues are too long
•Adaptive concurrency limits: Dynamically adjust based on latency (Netflix's concurrency-limits library)
•Load shedding: Drop low-priority requests when overloaded

adaptive_load_shedding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class AdaptiveLoadShedder:
    """
    Shed load when latency increases, indicating overload.
    Based on TCP congestion control principles.
    """
    
    def __init__(self, target_latency_ms: float = 10.0):
        self.target_latency = target_latency_ms
        self.current_limit = 100  # Max concurrent requests
        self.in_flight = 0
        self.lock = Lock()
    
    def try_acquire(self) -> bool:
        """Try to acquire a slot for a new request."""
        with self.lock:
            if self.in_flight >= self.current_limit:
                return False
            self.in_flight += 1
            return True
    
    def release(self, latency_ms: float):
        """Release a slot and adjust limit based on observed latency."""
        with self.lock:
            self.in_flight -= 1
            
            if latency_ms < self.target_latency:
                # Under target: increase limit (additive increase)
                self.current_limit = min(self.current_limit + 1, 1000)
            else:
                # Over target: decrease limit (multiplicative decrease)
                self.current_limit = max(self.current_limit * 0.9, 10)

Fail Fast, Recover Fast

When systems are overloaded, the worst thing you can do is keep trying. Timeouts queue up, memory fills with pending requests, and latency becomes unbounded. Instead: detect overload early, shed load aggressively, and recover quickly when load decreases. A 1-second outage with fast shedding is better than a 10-minute cascading failure.

Fallback Strategies

When the cache fails, the application must have fallback behavior. The right fallback depends on the use case and the nature of the failure.

Fallback Hierarchy

fallback_strategies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class ResilientCacheClient:
    """
    Cache client with multiple fallback layers.
    """
    
    def __init__(self):
        self.local_cache = LocalLRUCache(max_size=1000)
        self.distributed_cache = DistributedCache()
        self.database = Database()
        self.circuit_breaker = CircuitBreaker()
    
    def get(self, key: str, fallback_value=None):
        """
        Try multiple layers with graceful degradation.
        """
        # Layer 1: Local cache (always available)
        value = self.local_cache.get(key)
        if value is not None:
            return value
        
        # Layer 2: Distributed cache (may be unavailable)
        if self.circuit_breaker.can_execute():
            try:
                value = self.distributed_cache.get(key)
                self.circuit_breaker.record_success()
                
                if value is not None:
                    # Populate local cache
                    self.local_cache.set(key, value)
                    return value
            except Exception as e:
                self.circuit_breaker.record_failure()
                logging.warning(f"Distributed cache failed: {e}")
        
        # Layer 3: Database (source of truth)
        try:
            value = self.database.query(key)
            
            if value is not None:
                # Populate caches (best effort)
                self.local_cache.set(key, value)
                try:
                    if self.circuit_breaker.can_execute():
                        self.distributed_cache.set(key, value)
                except Exception:
                    pass  # Cache population is not critical
            
            return value
        except Exception as e:
            logging.error(f"Database query failed: {e}")
        
        # Layer 4: Fallback value (stale data, default, or error)
        return fallback_value

Fallback Strategy Options
Strategy	When To Use	Trade-off
Return stale data	Data is still useful if old	May serve outdated info
Return default value	Acceptable placeholder exists	Incorrect but functional
Return error	Correctness is critical	User sees failure
Queue for retry	Async processing acceptable	Adds latency and complexity
Degrade feature	Feature is non-essential	Reduced functionality

Stale-While-Revalidate

A powerful pattern: return potentially stale data immediately while refreshing in the background:

stale_while_revalidate.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import asyncio
import time
 
class StaleWhileRevalidateCache:
    """
    Return cached data immediately, refresh in background when stale.
    Reduces latency while maintaining freshness.
    """
    
    def __init__(self, cache, database, stale_threshold: float = 30.0):
        self.cache = cache
        self.database = database
        self.stale_threshold = stale_threshold
        self.refresh_tasks = {}  # key -> background task
    
    async def get(self, key: str):
        """
        Get value, triggering background refresh if stale.
        """
        cached = self.cache.get_with_metadata(key)
        
        if cached is None:
            # Cache miss - synchronous fetch
            value = await self._fetch_and_cache(key)
            return value
        
        value, cached_at = cached
        age = time.time() - cached_at
        
        if age > self.stale_threshold:
            # Stale - trigger background refresh, return stale value
            self._trigger_background_refresh(key)
        
        return value
    
    def _trigger_background_refresh(self, key: str):
        """Start background refresh if not already in progress."""
        if key in self.refresh_tasks:
            return  # Already refreshing
        
        async def refresh():
            try:
                await self._fetch_and_cache(key)
            finally:
                del self.refresh_tasks[key]
        
        self.refresh_tasks[key] = asyncio.create_task(refresh())
    
    async def _fetch_and_cache(self, key: str):
        """Fetch from database and update cache."""
        value = await self.database.query(key)
        self.cache.set_with_metadata(key, value, time.time())
        return value

Netflix's Hystrix Pattern

Netflix's Hystrix library popularized the pattern of wrapping every external call with: timeout, circuit breaker, fallback, and bulkhead (isolated thread pools). While Hystrix is deprecated, its patterns live on in Resilience4j (Java) and similar libraries. Apply these patterns to all cache interactions.

Operational Practices for Reliability

Technical mechanisms are necessary but not sufficient for reliability. Operational practices prevent failures and speed recovery.

Monitoring and Alerting

Track these metrics with appropriate thresholds:

Critical Cache Metrics
Metric	Normal Range	Alert Threshold	Action
Hit rate	90%	<80%	Investigate eviction, TTL
Latency p99	<5ms	20ms	Check overload, network
Memory usage	<85%	90%	Add capacity or evict
Connection count	Stable	80% of max	Check for leaks, scale
Eviction rate	Low	Spike	Check for hot keys, capacity
Error rate	<0.01%	1%	Check node health, network

Capacity Planning

Proactive capacity management prevents capacity-related failures:

Capacity Planning Practices

•Headroom: Maintain 30-40% spare capacity for spikes and failures
•Failure simulations: Ensure the system survives losing 1-2 nodes
•Growth projections: Model capacity needs 6-12 months out
•Burst capacity: Have a plan for sudden traffic (flash sales, viral content)

Chaos Engineering

Proactively test failure handling by injecting failures in controlled conditions:

Chaos Experiments for Caches

•Kill a cache node: Verify failover works, measure recovery time
•Network partition: Isolate nodes, verify no data corruption
•Latency injection: Add 500ms delay, verify clients timeout correctly
•Memory exhaustion: Fill cache completely, verify eviction works
•Full cluster restart: Verify cold-start behavior and origin capacity

Runbooks and Automation

Document common failure scenarios and automate responses where possible:

Automated failover: Node failure triggers automatic traffic shift
Auto-scaling: Add nodes when load increases
Self-healing: Restart unhealthy nodes automatically
Runbooks: Document manual steps for complex failures
Post-mortems: Learn from every outage

The Human Element

Most outages involve operator error—bad config, incomplete rollback, or mistakes during incident response. Invest in automation, safe deployment practices, and training. The best failure handling is preventing humans from making errors under pressure.

Summary: Building Resilient Distributed Caches

Failure handling is what transforms a prototype into production infrastructure. Let's consolidate the key principles:

Key Takeaways

•Failures are routine, not exceptional — design assuming components will fail, networks will partition, traffic will spike.
•Detection must balance speed and accuracy — heartbeats catch crashes, request-based checks catch degradation.
•Failover requires replicas or cache-aside pattern — without replicas, fallback to origin; with replicas, failover automatically.
•Cascading failures are the real danger — circuit breakers, load shedding, and backpressure prevent chain reactions.
•Fallbacks provide graceful degradation — stale data, defaults, or reduced functionality are better than hard failures.
•Operational practices prevent and speed recovery — monitoring, capacity planning, chaos engineering, and runbooks.

The Reliability Checklist

✅ Health checks on all cache nodes (heartbeat + request-based) ✅ Automatic failover when nodes fail ✅ Circuit breakers on cache calls ✅ Fallback to origin when cache unavailable ✅ Request shedding under overload ✅ Local cache as additional layer ✅ Monitoring and alerting for key metrics ✅ Capacity headroom for failures and spikes ✅ Tested through chaos engineering ✅ Documented runbooks for incidents

Module Complete: Distributed Cache Design

Congratulations! You've completed the comprehensive Distributed Cache module. You now understand requirements and capacity planning, cache partitioning with consistent hashing, eviction policies from LRU to adaptive algorithms, cache coherence and invalidation strategies, and failure handling for production reliability. You're equipped to design, implement, and operate distributed caching systems at scale.

6 / 6

Loading learning content...

System Design (HLD)Distributed Cache

Designing a Distributed Cache

LevelAdvanced

Duration120 mins

TopicDistributed Cache

6 / 6

Failure Handling

Designing for Failure

We will cover:

The types of failures that affect distributed caches
Detection mechanisms and their trade-offs
Recovery strategies including failover and data reconstruction
Cascading failure prevention
Operational practices for reliability

What You Will Learn

Failure Taxonomy

Understanding the types of failures helps design appropriate responses. Different failures require different handling.

Node Failures

Individual cache nodes can fail in various ways:

Node Failure Types
Failure Type	Cause	Detection	Recovery
Crash (fail-stop)	Hardware fault, OOM, kernel panic	Heartbeat timeout	Failover to replica or rebuild
Hang (fail-slow)	Resource exhaustion, deadlock, GC pause	Latency spike detection	Kill and restart, failover
Data corruption	Memory error, software bug	Checksum validation	Rebuild from source
Byzantine	Software bug returning wrong data	Very hard to detect	Voting, checksums

Network Failures

Network issues create the most challenging failure modes:

Network Failure Modes

•Total partition: Complete isolation between nodes or client and cache. Clear failure once detected.
•Partial partition: Some paths work, others don't. Very hard to diagnose—node A can reach B, B can reach C, but A can't reach C.
•Asymmetric partition: Messages in one direction work, but not the other. Causes split-brain scenarios.
•Packet loss: Random packet drops. Causes retries, timeouts, high tail latency.
•Latency spikes: Transient delays (network congestion, GC pauses). May trigger false failure detection.

Cluster-Wide Failures

Some failures affect the entire cache cluster:

Cluster Failures

•Configuration error: Bad config pushed to all nodes simultaneously
•Correlated failures: All nodes in same rack lose power; all nodes hit same memory bug
•Capacity exhaustion: All nodes full, all evicting simultaneously
•Thundering herd: Cascade of cache misses overwhelms origin

The Partial Failure Problem

Failure Detection

You cannot respond to failures you don't detect. Failure detection mechanisms balance speed (fast detection) against accuracy (avoiding false positives).

Heartbeat-Based Detection

The simplest approach: nodes periodically signal that they're alive. Missing heartbeats indicate failure.

heartbeat_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import threading
import time
from typing import Dict, Set, Callable
 
class HeartbeatMonitor:
    """
    Monitor node health via heartbeats.
    Nodes that miss consecutive heartbeats are marked failed.
    """
    
    def __init__(
        self,
        heartbeat_interval: float = 1.0,
        failure_threshold: int = 3,
        on_failure: Callable[[str], None] = None,
        on_recovery: Callable[[str], None] = None
    ):
        self.heartbeat_interval = heartbeat_interval
        self.failure_threshold = failure_threshold
        self.on_failure = on_failure
        self.on_recovery = on_recovery
        
        self.last_heartbeat: Dict[str, float] = {}
        self.missed_beats: Dict[str, int] = {}
        self.failed_nodes: Set[str] = set()
        self.lock = threading.Lock()
        
        self._start_monitor()
    
    def record_heartbeat(self, node_id: str):
        """Record a heartbeat from a node."""
        with self.lock:
            self.last_heartbeat[node_id] = time.time()
            self.missed_beats[node_id] = 0
            
            # Check for recovery
            if node_id in self.failed_nodes:
                self.failed_nodes.remove(node_id)
                if self.on_recovery:
                    self.on_recovery(node_id)
    
    def _check_nodes(self):
        """Periodic check for missing heartbeats."""
        with self.lock:
            current_time = time.time()
            
            for node_id, last_time in list(self.last_heartbeat.items()):
                if current_time - last_time > self.heartbeat_interval:
                    self.missed_beats[node_id] = self.missed_beats.get(node_id, 0) + 1
                    
                    if (self.missed_beats[node_id] >= self.failure_threshold
                        and node_id not in self.failed_nodes):
                        self.failed_nodes.add(node_id)
                        if self.on_failure:
                            self.on_failure(node_id)
    
    def _start_monitor(self):
        """Start background monitoring thread."""
        def monitor_loop():
            while True:
                time.sleep(self.heartbeat_interval)
                self._check_nodes()
        
        thread = threading.Thread(target=monitor_loop, daemon=True)
        thread.start()
    
    def is_healthy(self, node_id: str) -> bool:
        """Check if a node is considered healthy."""
        with self.lock:
            return node_id not in self.failed_nodes

Timeout Configuration

Timeout values balance failure detection speed against false positives:

Too short: Network jitter or GC pauses trigger false failures
Too long: Real failures take too long to detect, impacting users

Typical values:

Heartbeat interval: 1-5 seconds
Failure threshold: 3-5 missed heartbeats
Total detection time: 3-25 seconds

Request-Based Detection

Instead of dedicated heartbeats, use real request traffic:

request_based_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
class RequestBasedHealthCheck:
    """
    Track node health based on request success/failure.
    More accurate than heartbeats for detecting degraded performance.
    """
    
    def __init__(
        self,
        failure_rate_threshold: float = 0.5,  # 50% failure rate
        window_size: int = 100,  # Last 100 requests
        min_requests: int = 10  # Need at least 10 requests to judge
    ):
        self.failure_rate_threshold = failure_rate_threshold
        self.window_size = window_size
        self.min_requests = min_requests
        
        # Per-node sliding window of success/failure
        self.request_history: Dict[str, list[bool]] = {}
    
    def record_success(self, node_id: str):
        """Record a successful request to a node."""
        self._record(node_id, True)
    
    def record_failure(self, node_id: str):
        """Record a failed request to a node."""
        self._record(node_id, False)
    
    def _record(self, node_id: str, success: bool):
        if node_id not in self.request_history:
            self.request_history[node_id] = []
        
        history = self.request_history[node_id]
        history.append(success)
        
        # Keep only recent requests
        if len(history) > self.window_size:
            history.pop(0)
    
    def is_healthy(self, node_id: str) -> bool:
        """Determine if node is healthy based on recent requests."""
        if node_id not in self.request_history:
            return True  # No data, assume healthy
        
        history = self.request_history[node_id]
        
        if len(history) < self.min_requests:
            return True  # Not enough data
        
        failure_rate = 1 - (sum(history) / len(history))
        return failure_rate < self.failure_rate_threshold
    
    def get_failure_rate(self, node_id: str) -> float:
        """Get the current failure rate for a node."""
        if node_id not in self.request_history:
            return 0.0
        
        history = self.request_history[node_id]
        if not history:
            return 0.0
        
        return 1 - (sum(history) / len(history))

Phi Accrual Failure Detector

Failover and Recovery

Once a failure is detected, the system must recover. Recovery strategies depend on whether the cache has replicas and whether data loss is acceptable.

Replica Failover

With replicated caches, clients can failover to replicas:

replica_failover.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class ReplicatedCacheClient:
    """
    Cache client with automatic failover to replicas.
    """
    
    def __init__(self, hash_ring, health_checker, replication_factor: int = 3):
        self.hash_ring = hash_ring
        self.health_checker = health_checker
        self.replication_factor = replication_factor
    
    def get(self, key: str):
        """
        Read with automatic failover.
        Try primary first, then replicas in order.
        """
        nodes = self.hash_ring.get_nodes(key, count=self.replication_factor)
        
        for node in nodes:
            if not self.health_checker.is_healthy(node):
                continue
            
            try:
                value = self._send_get(node, key)
                self.health_checker.record_success(node)
                return value
            except TimeoutError:
                self.health_checker.record_failure(node)
                continue
            except ConnectionError:
                self.health_checker.record_failure(node)
                continue
        
        # All replicas failed
        raise CacheUnavailableError(f"No healthy replica for key {key}")
    
    def set(self, key: str, value: any, ttl: int = None):
        """
        Write to primary and replicas.
        Configurable consistency: write to N replicas before returning.
        """
        nodes = self.hash_ring.get_nodes(key, count=self.replication_factor)
        healthy_nodes = [n for n in nodes if self.health_checker.is_healthy(n)]
        
        if not healthy_nodes:
            raise CacheUnavailableError(f"No healthy node for key {key}")
        
        # Write to all healthy replicas (async for speed, or sync for durability)
        errors = []
        for node in healthy_nodes:
            try:
                self._send_set(node, key, value, ttl)
                self.health_checker.record_success(node)
            except Exception as e:
                self.health_checker.record_failure(node)
                errors.append((node, e))
        
        # Decide success based on write quorum (e.g., W=2 of RF=3)
        successful_writes = len(healthy_nodes) - len(errors)
        required_writes = (self.replication_factor // 2) + 1
        
        if successful_writes < required_writes:
            raise InsufficientReplicasError(
                f"Only {successful_writes}/{required_writes} writes succeeded"
            )

Cache Warming

When a new node joins or a failed node restarts, its cache is cold (empty). This creates a temporary load spike on the origin. Cache warming strategies mitigate this:

Cache Warming Strategies
Strategy	How It Works	Pros	Cons
Lazy warming	Cache fills as requests arrive	Simple, automatic	Temporary hit rate drop
Pre-warming	Copy data from healthy replicas	Immediate hit rate	Network/CPU overhead
Backup restore	Restore from periodic RDB/AOF snapshots	Fast for large datasets	Snapshot may be stale
Traffic replay	Replay recent request log to warm cache	Accurate working set	Requires logging infrastructure

cache_warming.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class CacheWarmer:
    """
    Pre-warm a new cache node from an existing replica.
    """
    
    async def warm_from_replica(
        self,
        new_node: str,
        source_node: str,
        keys_to_warm: list[str]
    ):
        """
        Copy hot keys from source to new node.
        Done in batches to avoid overwhelming either node.
        """
        batch_size = 100
        delay_between_batches = 0.1  # seconds
        
        for i in range(0, len(keys_to_warm), batch_size):
            batch = keys_to_warm[i:i + batch_size]
            
            # Fetch from source
            values = await self._mget(source_node, batch)
            
            # Write to new node
            items = [(k, v) for k, v in zip(batch, values) if v is not None]
            if items:
                await self._mset(new_node, items)
            
            # Throttle to avoid overload
            await asyncio.sleep(delay_between_batches)
    
    def get_hot_keys(self, source_node: str, count: int = 10000) -> list[str]:
        """
        Identify the hottest keys to prioritize warming.
        Options: sample by frequency, use access logs, or LRU order.
        """
        # Redis example: SCAN with OBJECT FREQ
        # Memcached: Use LRU crawler or slab stats
        pass

Gradual Traffic Shift

Cascading Failure Prevention

The Cascade Pattern

Cache node fails
Its traffic redirects to other nodes or origin
Increased load causes another node to fail
Repeat until complete outage

Circuit Breakers

Circuit breakers prevent cascading failures by stopping requests to unhealthy components:

circuit_breaker.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import time
from enum import Enum
from threading import Lock
 
class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if recovered
 
class CircuitBreaker:
    """
    Prevent cascading failures by stopping requests to failing components.
    """
    
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        half_open_max_calls: int = 3
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0
        self.half_open_calls = 0
        self.lock = Lock()
    
    def can_execute(self) -> bool:
        """Check if request should proceed."""
        with self.lock:
            if self.state == CircuitState.CLOSED:
                return True
            
            if self.state == CircuitState.OPEN:
                # Check if recovery timeout has passed
                if time.time() - self.last_failure_time > self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.half_open_calls = 0
                    return True
                return False
            
            if self.state == CircuitState.HALF_OPEN:
                if self.half_open_calls < self.half_open_max_calls:
                    self.half_open_calls += 1
                    return True
                return False
        
        return False
    
    def record_success(self):
        """Record a successful call."""
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                # Recovered - close the circuit
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            elif self.state == CircuitState.CLOSED:
                # Reset failure count on success
                self.failure_count = 0
    
    def record_failure(self):
        """Record a failed call."""
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                # Failed during recovery - reopen
                self.state = CircuitState.OPEN
            elif (self.state == CircuitState.CLOSED and 
                  self.failure_count >= self.failure_threshold):
                # Threshold exceeded - open circuit
                self.state = CircuitState.OPEN
 
# Usage
cache_circuit = CircuitBreaker(failure_threshold=5)
 
def get_from_cache(key):
    if not cache_circuit.can_execute():
        raise CircuitOpenError("Cache circuit open, falling back")
    
    try:
        result = cache.get(key)
        cache_circuit.record_success()
        return result
    except Exception as e:
        cache_circuit.record_failure()
        raise

Request Shedding and Backpressure

When a system is overloaded, it's better to reject some requests quickly than to accept all requests and process them slowly (causing timeouts everywhere).

Overload Protection Techniques

•Connection limits: Cap maximum concurrent connections to cache nodes
•Request rate limiting: Limit requests per second per client or globally
•Queue depth limits: Reject requests when internal queues are too long
•Adaptive concurrency limits: Dynamically adjust based on latency (Netflix's concurrency-limits library)
•Load shedding: Drop low-priority requests when overloaded

adaptive_load_shedding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class AdaptiveLoadShedder:
    """
    Shed load when latency increases, indicating overload.
    Based on TCP congestion control principles.
    """
    
    def __init__(self, target_latency_ms: float = 10.0):
        self.target_latency = target_latency_ms
        self.current_limit = 100  # Max concurrent requests
        self.in_flight = 0
        self.lock = Lock()
    
    def try_acquire(self) -> bool:
        """Try to acquire a slot for a new request."""
        with self.lock:
            if self.in_flight >= self.current_limit:
                return False
            self.in_flight += 1
            return True
    
    def release(self, latency_ms: float):
        """Release a slot and adjust limit based on observed latency."""
        with self.lock:
            self.in_flight -= 1
            
            if latency_ms < self.target_latency:
                # Under target: increase limit (additive increase)
                self.current_limit = min(self.current_limit + 1, 1000)
            else:
                # Over target: decrease limit (multiplicative decrease)
                self.current_limit = max(self.current_limit * 0.9, 10)

Fail Fast, Recover Fast

Fallback Strategies

When the cache fails, the application must have fallback behavior. The right fallback depends on the use case and the nature of the failure.

Fallback Hierarchy

fallback_strategies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class ResilientCacheClient:
    """
    Cache client with multiple fallback layers.
    """
    
    def __init__(self):
        self.local_cache = LocalLRUCache(max_size=1000)
        self.distributed_cache = DistributedCache()
        self.database = Database()
        self.circuit_breaker = CircuitBreaker()
    
    def get(self, key: str, fallback_value=None):
        """
        Try multiple layers with graceful degradation.
        """
        # Layer 1: Local cache (always available)
        value = self.local_cache.get(key)
        if value is not None:
            return value
        
        # Layer 2: Distributed cache (may be unavailable)
        if self.circuit_breaker.can_execute():
            try:
                value = self.distributed_cache.get(key)
                self.circuit_breaker.record_success()
                
                if value is not None:
                    # Populate local cache
                    self.local_cache.set(key, value)
                    return value
            except Exception as e:
                self.circuit_breaker.record_failure()
                logging.warning(f"Distributed cache failed: {e}")
        
        # Layer 3: Database (source of truth)
        try:
            value = self.database.query(key)
            
            if value is not None:
                # Populate caches (best effort)
                self.local_cache.set(key, value)
                try:
                    if self.circuit_breaker.can_execute():
                        self.distributed_cache.set(key, value)
                except Exception:
                    pass  # Cache population is not critical
            
            return value
        except Exception as e:
            logging.error(f"Database query failed: {e}")
        
        # Layer 4: Fallback value (stale data, default, or error)
        return fallback_value

Fallback Strategy Options
Strategy	When To Use	Trade-off
Return stale data	Data is still useful if old	May serve outdated info
Return default value	Acceptable placeholder exists	Incorrect but functional
Return error	Correctness is critical	User sees failure
Queue for retry	Async processing acceptable	Adds latency and complexity
Degrade feature	Feature is non-essential	Reduced functionality

Stale-While-Revalidate

A powerful pattern: return potentially stale data immediately while refreshing in the background:

stale_while_revalidate.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import asyncio
import time
 
class StaleWhileRevalidateCache:
    """
    Return cached data immediately, refresh in background when stale.
    Reduces latency while maintaining freshness.
    """
    
    def __init__(self, cache, database, stale_threshold: float = 30.0):
        self.cache = cache
        self.database = database
        self.stale_threshold = stale_threshold
        self.refresh_tasks = {}  # key -> background task
    
    async def get(self, key: str):
        """
        Get value, triggering background refresh if stale.
        """
        cached = self.cache.get_with_metadata(key)
        
        if cached is None:
            # Cache miss - synchronous fetch
            value = await self._fetch_and_cache(key)
            return value
        
        value, cached_at = cached
        age = time.time() - cached_at
        
        if age > self.stale_threshold:
            # Stale - trigger background refresh, return stale value
            self._trigger_background_refresh(key)
        
        return value
    
    def _trigger_background_refresh(self, key: str):
        """Start background refresh if not already in progress."""
        if key in self.refresh_tasks:
            return  # Already refreshing
        
        async def refresh():
            try:
                await self._fetch_and_cache(key)
            finally:
                del self.refresh_tasks[key]
        
        self.refresh_tasks[key] = asyncio.create_task(refresh())
    
    async def _fetch_and_cache(self, key: str):
        """Fetch from database and update cache."""
        value = await self.database.query(key)
        self.cache.set_with_metadata(key, value, time.time())
        return value

Netflix's Hystrix Pattern

Operational Practices for Reliability

Technical mechanisms are necessary but not sufficient for reliability. Operational practices prevent failures and speed recovery.

Monitoring and Alerting

Track these metrics with appropriate thresholds:

Critical Cache Metrics
Metric	Normal Range	Alert Threshold	Action
Hit rate	90%	<80%	Investigate eviction, TTL
Latency p99	<5ms	20ms	Check overload, network
Memory usage	<85%	90%	Add capacity or evict
Connection count	Stable	80% of max	Check for leaks, scale
Eviction rate	Low	Spike	Check for hot keys, capacity
Error rate	<0.01%	1%	Check node health, network

Capacity Planning

Proactive capacity management prevents capacity-related failures:

Capacity Planning Practices

•Headroom: Maintain 30-40% spare capacity for spikes and failures
•Failure simulations: Ensure the system survives losing 1-2 nodes
•Growth projections: Model capacity needs 6-12 months out
•Burst capacity: Have a plan for sudden traffic (flash sales, viral content)

Chaos Engineering

Proactively test failure handling by injecting failures in controlled conditions:

Chaos Experiments for Caches

•Kill a cache node: Verify failover works, measure recovery time
•Network partition: Isolate nodes, verify no data corruption
•Latency injection: Add 500ms delay, verify clients timeout correctly
•Memory exhaustion: Fill cache completely, verify eviction works
•Full cluster restart: Verify cold-start behavior and origin capacity

Runbooks and Automation

Document common failure scenarios and automate responses where possible:

Automated failover: Node failure triggers automatic traffic shift
Auto-scaling: Add nodes when load increases
Self-healing: Restart unhealthy nodes automatically
Runbooks: Document manual steps for complex failures
Post-mortems: Learn from every outage

The Human Element

Summary: Building Resilient Distributed Caches

Failure handling is what transforms a prototype into production infrastructure. Let's consolidate the key principles:

Key Takeaways

•Failures are routine, not exceptional — design assuming components will fail, networks will partition, traffic will spike.
•Detection must balance speed and accuracy — heartbeats catch crashes, request-based checks catch degradation.
•Failover requires replicas or cache-aside pattern — without replicas, fallback to origin; with replicas, failover automatically.
•Cascading failures are the real danger — circuit breakers, load shedding, and backpressure prevent chain reactions.
•Fallbacks provide graceful degradation — stale data, defaults, or reduced functionality are better than hard failures.
•Operational practices prevent and speed recovery — monitoring, capacity planning, chaos engineering, and runbooks.

The Reliability Checklist

Module Complete: Distributed Cache Design

6 / 6