Loading learning content...
Everything fails, all the time. This principle, articulated by Werner Vogels of Amazon, is the foundation of reliable distributed systems. In distributed caching, failures are not exceptional events to be handled as edge cases—they are routine occurrences that the system must handle gracefully.
The irony of distributed caching is that caches are introduced to improve reliability (reducing load on databases), but they also introduce new failure modes. A cache that fails in the wrong way can take down the entire system—the very system it was supposed to protect. Understanding failure modes and designing appropriate mitigations is what separates production-grade caches from prototypes.
This page examines failure handling comprehensively: from the taxonomy of failures through detection mechanisms, recovery strategies, and the operational practices that ensure caches enhance rather than undermine system reliability.
We will cover:
By the end of this page, you will understand the taxonomy of failures in distributed caches, implement failure detection with appropriate timeouts and health checks, design recovery mechanisms including automatic failover and cache warming, prevent cascading failures through circuit breakers and request shedding, and apply operational practices for cache reliability.
Understanding the types of failures helps design appropriate responses. Different failures require different handling.
Node Failures
Individual cache nodes can fail in various ways:
| Failure Type | Cause | Detection | Recovery |
|---|---|---|---|
| Crash (fail-stop) | Hardware fault, OOM, kernel panic | Heartbeat timeout | Failover to replica or rebuild |
| Hang (fail-slow) | Resource exhaustion, deadlock, GC pause | Latency spike detection | Kill and restart, failover |
| Data corruption | Memory error, software bug | Checksum validation | Rebuild from source |
| Byzantine | Software bug returning wrong data | Very hard to detect | Voting, checksums |
Network Failures
Network issues create the most challenging failure modes:
Cluster-Wide Failures
Some failures affect the entire cache cluster:
The hardest failures to handle are partial failures—where some requests succeed and others fail, or where a node is sometimes reachable and sometimes not. These require sophisticated detection (not just binary up/down) and graceful degradation rather than hard failover.
You cannot respond to failures you don't detect. Failure detection mechanisms balance speed (fast detection) against accuracy (avoiding false positives).
Heartbeat-Based Detection
The simplest approach: nodes periodically signal that they're alive. Missing heartbeats indicate failure.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import threadingimport timefrom typing import Dict, Set, Callable class HeartbeatMonitor: """ Monitor node health via heartbeats. Nodes that miss consecutive heartbeats are marked failed. """ def __init__( self, heartbeat_interval: float = 1.0, failure_threshold: int = 3, on_failure: Callable[[str], None] = None, on_recovery: Callable[[str], None] = None ): self.heartbeat_interval = heartbeat_interval self.failure_threshold = failure_threshold self.on_failure = on_failure self.on_recovery = on_recovery self.last_heartbeat: Dict[str, float] = {} self.missed_beats: Dict[str, int] = {} self.failed_nodes: Set[str] = set() self.lock = threading.Lock() self._start_monitor() def record_heartbeat(self, node_id: str): """Record a heartbeat from a node.""" with self.lock: self.last_heartbeat[node_id] = time.time() self.missed_beats[node_id] = 0 # Check for recovery if node_id in self.failed_nodes: self.failed_nodes.remove(node_id) if self.on_recovery: self.on_recovery(node_id) def _check_nodes(self): """Periodic check for missing heartbeats.""" with self.lock: current_time = time.time() for node_id, last_time in list(self.last_heartbeat.items()): if current_time - last_time > self.heartbeat_interval: self.missed_beats[node_id] = self.missed_beats.get(node_id, 0) + 1 if (self.missed_beats[node_id] >= self.failure_threshold and node_id not in self.failed_nodes): self.failed_nodes.add(node_id) if self.on_failure: self.on_failure(node_id) def _start_monitor(self): """Start background monitoring thread.""" def monitor_loop(): while True: time.sleep(self.heartbeat_interval) self._check_nodes() thread = threading.Thread(target=monitor_loop, daemon=True) thread.start() def is_healthy(self, node_id: str) -> bool: """Check if a node is considered healthy.""" with self.lock: return node_id not in self.failed_nodesTimeout Configuration
Timeout values balance failure detection speed against false positives:
Typical values:
Request-Based Detection
Instead of dedicated heartbeats, use real request traffic:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
class RequestBasedHealthCheck: """ Track node health based on request success/failure. More accurate than heartbeats for detecting degraded performance. """ def __init__( self, failure_rate_threshold: float = 0.5, # 50% failure rate window_size: int = 100, # Last 100 requests min_requests: int = 10 # Need at least 10 requests to judge ): self.failure_rate_threshold = failure_rate_threshold self.window_size = window_size self.min_requests = min_requests # Per-node sliding window of success/failure self.request_history: Dict[str, list[bool]] = {} def record_success(self, node_id: str): """Record a successful request to a node.""" self._record(node_id, True) def record_failure(self, node_id: str): """Record a failed request to a node.""" self._record(node_id, False) def _record(self, node_id: str, success: bool): if node_id not in self.request_history: self.request_history[node_id] = [] history = self.request_history[node_id] history.append(success) # Keep only recent requests if len(history) > self.window_size: history.pop(0) def is_healthy(self, node_id: str) -> bool: """Determine if node is healthy based on recent requests.""" if node_id not in self.request_history: return True # No data, assume healthy history = self.request_history[node_id] if len(history) < self.min_requests: return True # Not enough data failure_rate = 1 - (sum(history) / len(history)) return failure_rate < self.failure_rate_threshold def get_failure_rate(self, node_id: str) -> float: """Get the current failure rate for a node.""" if node_id not in self.request_history: return 0.0 history = self.request_history[node_id] if not history: return 0.0 return 1 - (sum(history) / len(history))The Phi Accrual Failure Detector (used in Cassandra, Akka) provides a probability of failure rather than binary up/down. It tracks heartbeat arrival times, models their distribution, and calculates the probability that a node has failed given how late the heartbeat is. This adapts to network conditions automatically.
Once a failure is detected, the system must recover. Recovery strategies depend on whether the cache has replicas and whether data loss is acceptable.
Replica Failover
With replicated caches, clients can failover to replicas:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
class ReplicatedCacheClient: """ Cache client with automatic failover to replicas. """ def __init__(self, hash_ring, health_checker, replication_factor: int = 3): self.hash_ring = hash_ring self.health_checker = health_checker self.replication_factor = replication_factor def get(self, key: str): """ Read with automatic failover. Try primary first, then replicas in order. """ nodes = self.hash_ring.get_nodes(key, count=self.replication_factor) for node in nodes: if not self.health_checker.is_healthy(node): continue try: value = self._send_get(node, key) self.health_checker.record_success(node) return value except TimeoutError: self.health_checker.record_failure(node) continue except ConnectionError: self.health_checker.record_failure(node) continue # All replicas failed raise CacheUnavailableError(f"No healthy replica for key {key}") def set(self, key: str, value: any, ttl: int = None): """ Write to primary and replicas. Configurable consistency: write to N replicas before returning. """ nodes = self.hash_ring.get_nodes(key, count=self.replication_factor) healthy_nodes = [n for n in nodes if self.health_checker.is_healthy(n)] if not healthy_nodes: raise CacheUnavailableError(f"No healthy node for key {key}") # Write to all healthy replicas (async for speed, or sync for durability) errors = [] for node in healthy_nodes: try: self._send_set(node, key, value, ttl) self.health_checker.record_success(node) except Exception as e: self.health_checker.record_failure(node) errors.append((node, e)) # Decide success based on write quorum (e.g., W=2 of RF=3) successful_writes = len(healthy_nodes) - len(errors) required_writes = (self.replication_factor // 2) + 1 if successful_writes < required_writes: raise InsufficientReplicasError( f"Only {successful_writes}/{required_writes} writes succeeded" )Cache Warming
When a new node joins or a failed node restarts, its cache is cold (empty). This creates a temporary load spike on the origin. Cache warming strategies mitigate this:
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Lazy warming | Cache fills as requests arrive | Simple, automatic | Temporary hit rate drop |
| Pre-warming | Copy data from healthy replicas | Immediate hit rate | Network/CPU overhead |
| Backup restore | Restore from periodic RDB/AOF snapshots | Fast for large datasets | Snapshot may be stale |
| Traffic replay | Replay recent request log to warm cache | Accurate working set | Requires logging infrastructure |
12345678910111213141516171819202122232425262728293031323334353637383940
class CacheWarmer: """ Pre-warm a new cache node from an existing replica. """ async def warm_from_replica( self, new_node: str, source_node: str, keys_to_warm: list[str] ): """ Copy hot keys from source to new node. Done in batches to avoid overwhelming either node. """ batch_size = 100 delay_between_batches = 0.1 # seconds for i in range(0, len(keys_to_warm), batch_size): batch = keys_to_warm[i:i + batch_size] # Fetch from source values = await self._mget(source_node, batch) # Write to new node items = [(k, v) for k, v in zip(batch, values) if v is not None] if items: await self._mset(new_node, items) # Throttle to avoid overload await asyncio.sleep(delay_between_batches) def get_hot_keys(self, source_node: str, count: int = 10000) -> list[str]: """ Identify the hottest keys to prioritize warming. Options: sample by frequency, use access logs, or LRU order. """ # Redis example: SCAN with OBJECT FREQ # Memcached: Use LRU crawler or slab stats passWhen a new node joins, shift traffic gradually rather than instantly. Start with 10% of the node's intended traffic, monitor hit rate and latency, then increase. This prevents a cold node from being overwhelmed and allows time for the cache to warm.
Cascading failures are the most dangerous failure mode. A single failure triggers a chain reaction that brings down the entire system. In caching, the most common cascading failure is the thundering herd.
The Cascade Pattern
Circuit Breakers
Circuit breakers prevent cascading failures by stopping requests to unhealthy components:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import timefrom enum import Enumfrom threading import Lock class CircuitState(Enum): CLOSED = "closed" # Normal operation OPEN = "open" # Failing, reject requests HALF_OPEN = "half_open" # Testing if recovered class CircuitBreaker: """ Prevent cascading failures by stopping requests to failing components. """ def __init__( self, failure_threshold: int = 5, recovery_timeout: float = 30.0, half_open_max_calls: int = 3 ): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max_calls = half_open_max_calls self.state = CircuitState.CLOSED self.failure_count = 0 self.last_failure_time = 0 self.half_open_calls = 0 self.lock = Lock() def can_execute(self) -> bool: """Check if request should proceed.""" with self.lock: if self.state == CircuitState.CLOSED: return True if self.state == CircuitState.OPEN: # Check if recovery timeout has passed if time.time() - self.last_failure_time > self.recovery_timeout: self.state = CircuitState.HALF_OPEN self.half_open_calls = 0 return True return False if self.state == CircuitState.HALF_OPEN: if self.half_open_calls < self.half_open_max_calls: self.half_open_calls += 1 return True return False return False def record_success(self): """Record a successful call.""" with self.lock: if self.state == CircuitState.HALF_OPEN: # Recovered - close the circuit self.state = CircuitState.CLOSED self.failure_count = 0 elif self.state == CircuitState.CLOSED: # Reset failure count on success self.failure_count = 0 def record_failure(self): """Record a failed call.""" with self.lock: self.failure_count += 1 self.last_failure_time = time.time() if self.state == CircuitState.HALF_OPEN: # Failed during recovery - reopen self.state = CircuitState.OPEN elif (self.state == CircuitState.CLOSED and self.failure_count >= self.failure_threshold): # Threshold exceeded - open circuit self.state = CircuitState.OPEN # Usagecache_circuit = CircuitBreaker(failure_threshold=5) def get_from_cache(key): if not cache_circuit.can_execute(): raise CircuitOpenError("Cache circuit open, falling back") try: result = cache.get(key) cache_circuit.record_success() return result except Exception as e: cache_circuit.record_failure() raiseRequest Shedding and Backpressure
When a system is overloaded, it's better to reject some requests quickly than to accept all requests and process them slowly (causing timeouts everywhere).
12345678910111213141516171819202122232425262728293031
class AdaptiveLoadShedder: """ Shed load when latency increases, indicating overload. Based on TCP congestion control principles. """ def __init__(self, target_latency_ms: float = 10.0): self.target_latency = target_latency_ms self.current_limit = 100 # Max concurrent requests self.in_flight = 0 self.lock = Lock() def try_acquire(self) -> bool: """Try to acquire a slot for a new request.""" with self.lock: if self.in_flight >= self.current_limit: return False self.in_flight += 1 return True def release(self, latency_ms: float): """Release a slot and adjust limit based on observed latency.""" with self.lock: self.in_flight -= 1 if latency_ms < self.target_latency: # Under target: increase limit (additive increase) self.current_limit = min(self.current_limit + 1, 1000) else: # Over target: decrease limit (multiplicative decrease) self.current_limit = max(self.current_limit * 0.9, 10)When systems are overloaded, the worst thing you can do is keep trying. Timeouts queue up, memory fills with pending requests, and latency becomes unbounded. Instead: detect overload early, shed load aggressively, and recover quickly when load decreases. A 1-second outage with fast shedding is better than a 10-minute cascading failure.
When the cache fails, the application must have fallback behavior. The right fallback depends on the use case and the nature of the failure.
Fallback Hierarchy
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
class ResilientCacheClient: """ Cache client with multiple fallback layers. """ def __init__(self): self.local_cache = LocalLRUCache(max_size=1000) self.distributed_cache = DistributedCache() self.database = Database() self.circuit_breaker = CircuitBreaker() def get(self, key: str, fallback_value=None): """ Try multiple layers with graceful degradation. """ # Layer 1: Local cache (always available) value = self.local_cache.get(key) if value is not None: return value # Layer 2: Distributed cache (may be unavailable) if self.circuit_breaker.can_execute(): try: value = self.distributed_cache.get(key) self.circuit_breaker.record_success() if value is not None: # Populate local cache self.local_cache.set(key, value) return value except Exception as e: self.circuit_breaker.record_failure() logging.warning(f"Distributed cache failed: {e}") # Layer 3: Database (source of truth) try: value = self.database.query(key) if value is not None: # Populate caches (best effort) self.local_cache.set(key, value) try: if self.circuit_breaker.can_execute(): self.distributed_cache.set(key, value) except Exception: pass # Cache population is not critical return value except Exception as e: logging.error(f"Database query failed: {e}") # Layer 4: Fallback value (stale data, default, or error) return fallback_value| Strategy | When To Use | Trade-off |
|---|---|---|
| Return stale data | Data is still useful if old | May serve outdated info |
| Return default value | Acceptable placeholder exists | Incorrect but functional |
| Return error | Correctness is critical | User sees failure |
| Queue for retry | Async processing acceptable | Adds latency and complexity |
| Degrade feature | Feature is non-essential | Reduced functionality |
Stale-While-Revalidate
A powerful pattern: return potentially stale data immediately while refreshing in the background:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import asyncioimport time class StaleWhileRevalidateCache: """ Return cached data immediately, refresh in background when stale. Reduces latency while maintaining freshness. """ def __init__(self, cache, database, stale_threshold: float = 30.0): self.cache = cache self.database = database self.stale_threshold = stale_threshold self.refresh_tasks = {} # key -> background task async def get(self, key: str): """ Get value, triggering background refresh if stale. """ cached = self.cache.get_with_metadata(key) if cached is None: # Cache miss - synchronous fetch value = await self._fetch_and_cache(key) return value value, cached_at = cached age = time.time() - cached_at if age > self.stale_threshold: # Stale - trigger background refresh, return stale value self._trigger_background_refresh(key) return value def _trigger_background_refresh(self, key: str): """Start background refresh if not already in progress.""" if key in self.refresh_tasks: return # Already refreshing async def refresh(): try: await self._fetch_and_cache(key) finally: del self.refresh_tasks[key] self.refresh_tasks[key] = asyncio.create_task(refresh()) async def _fetch_and_cache(self, key: str): """Fetch from database and update cache.""" value = await self.database.query(key) self.cache.set_with_metadata(key, value, time.time()) return valueNetflix's Hystrix library popularized the pattern of wrapping every external call with: timeout, circuit breaker, fallback, and bulkhead (isolated thread pools). While Hystrix is deprecated, its patterns live on in Resilience4j (Java) and similar libraries. Apply these patterns to all cache interactions.
Technical mechanisms are necessary but not sufficient for reliability. Operational practices prevent failures and speed recovery.
Monitoring and Alerting
Track these metrics with appropriate thresholds:
| Metric | Normal Range | Alert Threshold | Action |
|---|---|---|---|
| Hit rate | 90% | <80% | Investigate eviction, TTL |
| Latency p99 | <5ms | 20ms | Check overload, network |
| Memory usage | <85% | 90% | Add capacity or evict |
| Connection count | Stable | 80% of max | Check for leaks, scale |
| Eviction rate | Low | Spike | Check for hot keys, capacity |
| Error rate | <0.01% | 1% | Check node health, network |
Capacity Planning
Proactive capacity management prevents capacity-related failures:
Chaos Engineering
Proactively test failure handling by injecting failures in controlled conditions:
Runbooks and Automation
Document common failure scenarios and automate responses where possible:
Most outages involve operator error—bad config, incomplete rollback, or mistakes during incident response. Invest in automation, safe deployment practices, and training. The best failure handling is preventing humans from making errors under pressure.
Failure handling is what transforms a prototype into production infrastructure. Let's consolidate the key principles:
The Reliability Checklist
✅ Health checks on all cache nodes (heartbeat + request-based) ✅ Automatic failover when nodes fail ✅ Circuit breakers on cache calls ✅ Fallback to origin when cache unavailable ✅ Request shedding under overload ✅ Local cache as additional layer ✅ Monitoring and alerting for key metrics ✅ Capacity headroom for failures and spikes ✅ Tested through chaos engineering ✅ Documented runbooks for incidents
Congratulations! You've completed the comprehensive Distributed Cache module. You now understand requirements and capacity planning, cache partitioning with consistent hashing, eviction policies from LRU to adaptive algorithms, cache coherence and invalidation strategies, and failure handling for production reliability. You're equipped to design, implement, and operate distributed caching systems at scale.