Loading content...
Caching creates a fundamental problem: there are now two copies of the data, and they can diverge. The authoritative source (database, API, service) may be updated, leaving the cache serving stale data. Conversely, some cache patterns write to the cache first, creating a window where the cache is ahead of the source.
This divergence is called cache incoherence, and managing it is one of the most nuanced challenges in distributed systems. The famous Phil Karlton quote—"There are only two hard things in Computer Science: cache invalidation and naming things"—reflects the difficulty engineers have wrestled with for decades.
The challenge isn't merely technical. Cache coherence involves fundamental trade-offs between:
There is no universally correct approach—different applications have different tolerance for staleness, different write patterns, and different failure modes. Understanding the full spectrum of cache coherence strategies, and when each is appropriate, is essential for designing effective caching systems.
This page examines cache coherence comprehensively: from fundamental patterns through production considerations, from simple TTL-based approaches through sophisticated event-driven invalidation.
By the end of this page, you will understand the fundamental caching patterns and their consistency guarantees, master invalidation strategies from TTL through event-driven approaches, recognize and prevent common coherence bugs, analyze the consistency-performance trade-off for different use cases, and design coherent caching strategies for complex systems.
Before discussing coherence strategies, we must understand the fundamental patterns for integrating caches with data sources. Each pattern has different consistency characteristics.
Cache-Aside (Lazy Loading)
The most common pattern. The application manages both cache and database:
Read:
1. Check cache for key
2. If hit: return cached value
3. If miss: read from database, write to cache, return value
Write:
1. Write to database
2. Invalidate cache (delete key) OR update cache
1234567891011121314151617181920212223242526272829303132333435
class CacheAsidePattern: """ Application manages cache and database explicitly. Most common pattern in practice. """ def __init__(self, cache, database): self.cache = cache self.database = database def read(self, key: str): # 1. Check cache value = self.cache.get(key) if value is not None: return value # Cache hit # 2. Cache miss: read from database value = self.database.query(key) # 3. Populate cache for future reads if value is not None: self.cache.set(key, value, ttl=300) return value def write(self, key: str, value: any): # 1. Write to database (source of truth) self.database.update(key, value) # 2. Invalidate cache # Option A: Delete (recommended) self.cache.delete(key) # Option B: Update (not recommended - see race conditions below) # self.cache.set(key, value)Read-Through Cache
The cache itself handles database reads. The application only talks to the cache:
Read:
1. Request from cache
2. Cache checks its storage
3. If miss: cache reads from database, stores result, returns value
Write-Through Cache
Writes go through the cache to the database:
Write:
1. Write to cache
2. Cache synchronously writes to database
3. Cache returns success only after database confirms
Write-Behind (Write-Back) Cache
Writes are buffered in the cache and asynchronously written to the database:
Write:
1. Write to cache (returns immediately)
2. Cache queues write for async database update
3. Database updated later (possibly batched)
| Pattern | Read Miss Latency | Write Latency | Consistency | Complexity |
|---|---|---|---|---|
| Cache-Aside | High (DB read) | Low (background) | Eventual | Low |
| Read-Through | High (DB read) | N/A | Eventual | Medium |
| Write-Through | Low (cached) | High (sync DB) | Strong | Medium |
| Write-Behind | Low (cached) | Very Low (async) | Eventual (risk of loss) | High |
Cache-Aside is the most widely used pattern because it's simple, the application has full control, the cache layer doesn't need database knowledge, and failures are explicit. Read-Through and Write-Through are used when the cache vendor supports them (e.g., AWS ElastiCache for Redis). Write-Behind is rare due to durability concerns.
When data changes, how do we ensure the cache reflects the change? This is the invalidation problem. There are three fundamental approaches:
1. Time-Based Invalidation (TTL)
The simplest approach: cached data automatically expires after a fixed time.
1234567891011121314151617181920212223242526
class TTLInvalidation: """ Simplest invalidation: rely entirely on TTL. Data may be stale for up to TTL duration after update. """ def __init__(self, cache, database, ttl: int = 300): self.cache = cache self.database = database self.ttl = ttl def read(self, key: str): value = self.cache.get(key) if value is not None: return value value = self.database.query(key) self.cache.set(key, value, ttl=self.ttl) return value def write(self, key: str, value: any): # Just write to database # Cache will serve stale data until TTL expires self.database.update(key, value) # NOTE: Intentionally NOT invalidating cache here # This is sometimes acceptable for background updates2. Explicit Invalidation
Application invalidates cache on write:
Both have race conditions—we'll analyze these soon.
3. Event-Driven Invalidation
Database changes trigger events that invalidate caches:
123456789101112131415161718192021222324252627282930313233343536373839404142
class EventDrivenInvalidation: """ Database changes emit events that trigger cache invalidation. Common with Change Data Capture (CDC) systems. """ def __init__(self, cache, database, event_bus): self.cache = cache self.database = database self.event_bus = event_bus # Subscribe to database change events self.event_bus.subscribe("db.change", self._on_database_change) def _on_database_change(self, event): """Handle database change events.""" key = event['key'] operation = event['operation'] # INSERT, UPDATE, DELETE if operation in ('UPDATE', 'DELETE'): # Invalidate cache self.cache.delete(key) # Optionally pre-populate cache for frequently accessed keys if event.get('preload', False): new_value = event.get('new_value') self.cache.set(key, new_value, ttl=300) def read(self, key: str): """Standard cache-aside read.""" value = self.cache.get(key) if value is not None: return value value = self.database.query(key) self.cache.set(key, value, ttl=300) return value def write(self, key: str, value: any): """Write to database only; CDC will trigger invalidation.""" self.database.update(key, value) # Cache invalidation happens via event handler| Strategy | Staleness Window | Complexity | Write Performance | Best For |
|---|---|---|---|---|
| TTL only | Up to TTL duration | Very low | Best (no cache ops) | Low-change data |
| Explicit delete | Brief race window | Low | Good | Most applications |
| Write-through | None (sync) | Medium | Poor (sync writes) | Critical data |
| Event-driven | Event processing delay | High | Good | Large-scale systems |
When invalidating, prefer deletion over update. Deletion is simpler (no value to serialize), safer (cache miss is correct behavior), and avoids subtle bugs. Update-on-write should only be used when you need to guarantee the cache is warm immediately after a write.
Cache coherence bugs are often subtle race conditions. Understanding these is essential for designing robust systems.
The Classic Race: Write-Then-Invalidate
Consider the standard pattern: update database, then delete cache.
12345678910111213141516171819202122
# The standard pattern - has a subtle race conditiondef write(key, new_value): database.update(key, new_value) # Step 1 cache.delete(key) # Step 2 # Race scenario:# T=0: Thread A updates DB to "value_A"# T=1: Thread B reads cache (miss) and queries DB, gets "value_A"# T=2: Thread A deletes cache# T=3: Thread B writes "value_A" to cache# # Result: Cache is correct, no problem here. # But consider this scenario:# T=0: Thread A updates DB to "value_A" # T=1: Thread B reads cache (miss)# T=2: Thread A deletes cache# T=3: Thread C updates DB to "value_C"# T=4: Thread C deletes cache # T=5: Thread B writes "value_A" to cache (from T=1 DB read!)## Result: Cache has stale "value_A" while DB has "value_C"!The Thundering Herd on Invalidation
When a popular key is invalidated, many concurrent requests may simultaneously hit the database:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
# Popular key accessed 1000 times/second# T=0: Key is invalidated (cache.delete)# T=1: 1000 requests arrive, all get cache miss# T=2: 1000 requests query database simultaneously# T=3: Database overwhelmed or slows to crawl# T=4: All 1000 requests try to write to cache## Solution: Request coalescing / cache stampede protection class StampedeProtectedCache: """ Prevent thundering herd with request coalescing. Only one request fetches from DB; others wait. """ def __init__(self, cache, database): self.cache = cache self.database = database self.pending_fetches = {} # key -> Future self.lock = threading.Lock() def get(self, key: str): # Check cache first value = self.cache.get(key) if value is not None: return value # Cache miss - check if fetch is already in progress with self.lock: if key in self.pending_fetches: # Another request is already fetching future = self.pending_fetches[key] else: # We're the first - create the fetch future = self._create_fetch(key) self.pending_fetches[key] = future # Wait for the fetch to complete value = future.result() return value def _create_fetch(self, key: str): """Create an async fetch from database.""" import concurrent.futures def fetch(): try: value = self.database.query(key) self.cache.set(key, value, ttl=300) return value finally: with self.lock: del self.pending_fetches[key] executor = concurrent.futures.ThreadPoolExecutor() return executor.submit(fetch)Update-On-Write Race
When updating cache (instead of deleting), races are worse:
1234567891011121314151617
# Why update-on-write is dangerous: # T=0: Thread A updates DB to "A", about to update cache# T=1: Thread B updates DB to "B" (newer than A!)# T=2: Thread B updates cache to "B"# T=3: Thread A updates cache to "A" (A is now STALE!)## Result: DB has "B" (correct), Cache has "A" (stale)# The cache will serve stale data until TTL or next write! # This is why delete-on-write is preferred:# T=0: Thread A updates DB to "A"# T=1: Thread B updates DB to "B" # T=2: Thread B deletes cache# T=3: Thread A deletes cache (idempotent, no harm)## Result: Cache is empty, next read gets fresh "B" from DBEach mitigation adds complexity that can introduce new bugs. A simple scheme (delete-on-write with short TTL) that's occasionally stale is often better than a complex scheme with subtle correctness bugs.
Different applications require different consistency levels. Understanding the spectrum of guarantees helps choose the right approach.
Strong Consistency
Every read reflects the most recent write. Requires synchronous cache updates or bypassing the cache entirely on writes.
Read-Your-Writes Consistency
A client sees its own writes immediately, but may see stale data from other clients.
12345678910111213141516171819202122232425262728293031323334353637
class ReadYourWritesCache: """ Ensures clients see their own writes immediately. Other clients may see stale data briefly. """ def __init__(self, cache, database): self.cache = cache self.database = database # Track recent writes per user self.user_write_timestamps = {} self.bypass_window = 5.0 # seconds def read(self, key: str, user_id: str): # Check if user recently wrote this key last_write = self.user_write_timestamps.get((user_id, key), 0) current_time = time.time() if current_time - last_write < self.bypass_window: # User wrote recently - bypass cache for freshness return self.database.query(key) # Normal cache-aside read value = self.cache.get(key) if value is not None: return value value = self.database.query(key) self.cache.set(key, value, ttl=300) return value def write(self, key: str, value: any, user_id: str): self.database.update(key, value) self.cache.delete(key) # Record that this user wrote this key self.user_write_timestamps[(user_id, key)] = time.time()Eventual Consistency
Given no new writes, all reads will eventually return the same value. This is the default for most caching setups.
Bounded Staleness
Data is never more than N seconds old. Achieved through TTL and/or versioning.
| Level | Staleness | Performance | Example Use Cases |
|---|---|---|---|
| Strong | None | Lowest | Account balance, inventory |
| Read-Your-Writes | For other writers only | Good | User profiles, preferences |
| Bounded Staleness | ≤ configured limit | Good | Leaderboards, analytics |
| Eventual | Unbounded (until TTL/invalidation) | Best | Static content, catalogs |
Start with eventual consistency and short TTLs. Only add complexity if you can demonstrate a business requirement for stronger guarantees. Most applications work fine with 30-second staleness—users are reading far more often than writing, and slight staleness is imperceptible.
Production systems often have multiple caching layers, each with different characteristics:
Typical Cache Hierarchy
Each layer can hold stale data, and invalidation must propagate through all layers.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
class MultiLayerCacheInvalidation: """ Coordinate invalidation across multiple cache layers. """ def __init__(self): self.local_cache = LocalCache() # In-process self.distributed_cache = RedisCache() # Shared self.cdn = CDNAPI() # Edge def invalidate(self, key: str, related_urls: list[str] = None): """ Invalidate a key across all cache layers. Must handle partial failures. """ errors = [] # 1. Invalidate local cache (synchronous, fast) self.local_cache.delete(key) # 2. Invalidate distributed cache try: self.distributed_cache.delete(key) except Exception as e: errors.append(f"Distributed cache: {e}") # 3. Invalidate CDN (may be async) if related_urls: try: self.cdn.purge(related_urls) except Exception as e: errors.append(f"CDN: {e}") # 4. Publish invalidation event for other app servers try: self._publish_invalidation_event(key) except Exception as e: errors.append(f"Event publish: {e}") if errors: logging.error(f"Partial invalidation failure for {key}: {errors}") # Consider: retry? alert? accept staleness? def _publish_invalidation_event(self, key: str): """ Notify other application servers to clear local caches. Uses pub/sub or message queue. """ redis_pubsub.publish("cache:invalidate", {"key": key})Local Cache + Distributed Cache
A common pattern combines fast local (in-process) cache with shared distributed cache:
Invalidation challenges:
CDN Cache Considerations
CDNs add geographic distribution but complicate invalidation:
When using local + distributed cache, ensure reads populate both levels properly. A common bug: local cache miss → distributed cache hit → return value but forget to populate local cache. This creates unnecessary distributed cache traffic.
Real-world cache coherence involves operational concerns beyond the core algorithms.
Monitoring Cache Coherence
You can't fix what you can't measure. Key metrics:
| Metric | Description | Alert Threshold |
|---|---|---|
| Cache hit rate | % requests served by cache | < 90% (workload dependent) |
| Stale read rate | % reads returning stale data | Depends on SLA |
| Invalidation latency | Time from write to cache clear | 1 second |
| Invalidation failures | Failed invalidation attempts | 0 (investigate) |
| Cache/DB divergence | Sampled comparison of cache vs DB | 0.1% |
Measuring Staleness
Periodically sample cache entries and compare to database:
def measure_staleness(cache, database, sample_size=1000):
stale_count = 0
sample_keys = random.sample(cache.keys(), sample_size)
for key in sample_keys:
cached = cache.get(key)
actual = database.query(key)
if cached != actual:
stale_count += 1
return stale_count / sample_size
Graceful Degradation
When cache coherence systems fail, have fallback behavior:
Testing Cache Coherence
Coherence bugs often appear only under concurrent load:
The worst cache coherence bugs are rare and random. They occur only under specific timing conditions that may happen once per million requests. These bugs are nearly impossible to reproduce manually. Invest in automated property testing and production monitoring to catch them.
Cache coherence is the art of balancing consistency, performance, and complexity. Let's consolidate the key insights:
The Pragmatic Approach
Most applications don't need perfect consistency from their cache—they need a cache that works well 99.9% of the time and fails gracefully the rest.
You now understand cache coherence from fundamental patterns through production operations. In the final page, we'll explore Failure Handling—how to design distributed caches that survive node failures, network partitions, and cascading failures.