Distributed Cache - Learning Module

Loading content...

0/273

Cache Coherence

The Coherence Challenge

Caching creates a fundamental problem: there are now two copies of the data, and they can diverge. The authoritative source (database, API, service) may be updated, leaving the cache serving stale data. Conversely, some cache patterns write to the cache first, creating a window where the cache is ahead of the source.

This divergence is called cache incoherence, and managing it is one of the most nuanced challenges in distributed systems. The famous Phil Karlton quote—"There are only two hard things in Computer Science: cache invalidation and naming things"—reflects the difficulty engineers have wrestled with for decades.

The challenge isn't merely technical. Cache coherence involves fundamental trade-offs between:

Consistency vs. Availability: Strong consistency may require synchronous operations that reduce availability
Freshness vs. Performance: Shorter TTLs improve freshness but reduce hit rates
Complexity vs. Correctness: Simpler schemes may serve stale data; complex schemes may have subtle bugs

There is no universally correct approach—different applications have different tolerance for staleness, different write patterns, and different failure modes. Understanding the full spectrum of cache coherence strategies, and when each is appropriate, is essential for designing effective caching systems.

This page examines cache coherence comprehensively: from fundamental patterns through production considerations, from simple TTL-based approaches through sophisticated event-driven invalidation.

What You Will Learn

By the end of this page, you will understand the fundamental caching patterns and their consistency guarantees, master invalidation strategies from TTL through event-driven approaches, recognize and prevent common coherence bugs, analyze the consistency-performance trade-off for different use cases, and design coherent caching strategies for complex systems.

Fundamental Caching Patterns

Before discussing coherence strategies, we must understand the fundamental patterns for integrating caches with data sources. Each pattern has different consistency characteristics.

Cache-Aside (Lazy Loading)

The most common pattern. The application manages both cache and database:

Read:
1. Check cache for key
2. If hit: return cached value
3. If miss: read from database, write to cache, return value

Write:
1. Write to database
2. Invalidate cache (delete key) OR update cache

cache_aside.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
class CacheAsidePattern:
    """
    Application manages cache and database explicitly.
    Most common pattern in practice.
    """
    
    def __init__(self, cache, database):
        self.cache = cache
        self.database = database
    
    def read(self, key: str):
        # 1. Check cache
        value = self.cache.get(key)
        if value is not None:
            return value  # Cache hit
        
        # 2. Cache miss: read from database
        value = self.database.query(key)
        
        # 3. Populate cache for future reads
        if value is not None:
            self.cache.set(key, value, ttl=300)
        
        return value
    
    def write(self, key: str, value: any):
        # 1. Write to database (source of truth)
        self.database.update(key, value)
        
        # 2. Invalidate cache
        # Option A: Delete (recommended)
        self.cache.delete(key)
        
        # Option B: Update (not recommended - see race conditions below)
        # self.cache.set(key, value)

Read-Through Cache

The cache itself handles database reads. The application only talks to the cache:

Read:
1. Request from cache
2. Cache checks its storage
3. If miss: cache reads from database, stores result, returns value

Write-Through Cache

Writes go through the cache to the database:

Write:
1. Write to cache
2. Cache synchronously writes to database
3. Cache returns success only after database confirms

Write-Behind (Write-Back) Cache

Writes are buffered in the cache and asynchronously written to the database:

Write:
1. Write to cache (returns immediately)
2. Cache queues write for async database update
3. Database updated later (possibly batched)

Caching Pattern Comparison
Pattern	Read Miss Latency	Write Latency	Consistency	Complexity
Cache-Aside	High (DB read)	Low (background)	Eventual	Low
Read-Through	High (DB read)	N/A	Eventual	Medium
Write-Through	Low (cached)	High (sync DB)	Strong	Medium
Write-Behind	Low (cached)	Very Low (async)	Eventual (risk of loss)	High

Industry Standard: Cache-Aside

Cache-Aside is the most widely used pattern because it's simple, the application has full control, the cache layer doesn't need database knowledge, and failures are explicit. Read-Through and Write-Through are used when the cache vendor supports them (e.g., AWS ElastiCache for Redis). Write-Behind is rare due to durability concerns.

Invalidation Strategies

When data changes, how do we ensure the cache reflects the change? This is the invalidation problem. There are three fundamental approaches:

1. Time-Based Invalidation (TTL)

The simplest approach: cached data automatically expires after a fixed time.

Pros: Simple, no coordination needed, bounds staleness
Cons: Data may be stale until TTL expires; can't ensure freshness on write
Use when: Staleness tolerance exists (seconds to hours), low write frequency

ttl_invalidation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class TTLInvalidation:
    """
    Simplest invalidation: rely entirely on TTL.
    Data may be stale for up to TTL duration after update.
    """
    
    def __init__(self, cache, database, ttl: int = 300):
        self.cache = cache
        self.database = database
        self.ttl = ttl
    
    def read(self, key: str):
        value = self.cache.get(key)
        if value is not None:
            return value
        
        value = self.database.query(key)
        self.cache.set(key, value, ttl=self.ttl)
        return value
    
    def write(self, key: str, value: any):
        # Just write to database
        # Cache will serve stale data until TTL expires
        self.database.update(key, value)
        # NOTE: Intentionally NOT invalidating cache here
        # This is sometimes acceptable for background updates

2. Explicit Invalidation

Application invalidates cache on write:

Delete-first: Delete cache, then update database
Write-first (standard): Update database, then delete cache

Both have race conditions—we'll analyze these soon.

3. Event-Driven Invalidation

Database changes trigger events that invalidate caches:

Database binlog/change stream → Message queue → Cache invalidation
Guarantees eventual consistency if events are reliably processed
Adds complexity but decouples write path from invalidation

event_driven_invalidation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class EventDrivenInvalidation:
    """
    Database changes emit events that trigger cache invalidation.
    Common with Change Data Capture (CDC) systems.
    """
    
    def __init__(self, cache, database, event_bus):
        self.cache = cache
        self.database = database
        self.event_bus = event_bus
        
        # Subscribe to database change events
        self.event_bus.subscribe("db.change", self._on_database_change)
    
    def _on_database_change(self, event):
        """Handle database change events."""
        key = event['key']
        operation = event['operation']  # INSERT, UPDATE, DELETE
        
        if operation in ('UPDATE', 'DELETE'):
            # Invalidate cache
            self.cache.delete(key)
        
        # Optionally pre-populate cache for frequently accessed keys
        if event.get('preload', False):
            new_value = event.get('new_value')
            self.cache.set(key, new_value, ttl=300)
    
    def read(self, key: str):
        """Standard cache-aside read."""
        value = self.cache.get(key)
        if value is not None:
            return value
        
        value = self.database.query(key)
        self.cache.set(key, value, ttl=300)
        return value
    
    def write(self, key: str, value: any):
        """Write to database only; CDC will trigger invalidation."""
        self.database.update(key, value)
        # Cache invalidation happens via event handler

Invalidation Strategy Comparison
Strategy	Staleness Window	Complexity	Write Performance	Best For
TTL only	Up to TTL duration	Very low	Best (no cache ops)	Low-change data
Explicit delete	Brief race window	Low	Good	Most applications
Write-through	None (sync)	Medium	Poor (sync writes)	Critical data
Event-driven	Event processing delay	High	Good	Large-scale systems

Delete vs. Update

When invalidating, prefer deletion over update. Deletion is simpler (no value to serialize), safer (cache miss is correct behavior), and avoids subtle bugs. Update-on-write should only be used when you need to guarantee the cache is warm immediately after a write.

Race Conditions and Anomalies

Cache coherence bugs are often subtle race conditions. Understanding these is essential for designing robust systems.

The Classic Race: Write-Then-Invalidate

Consider the standard pattern: update database, then delete cache.

race_condition_1.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# The standard pattern - has a subtle race condition
def write(key, new_value):
    database.update(key, new_value)  # Step 1
    cache.delete(key)                 # Step 2
 
# Race scenario:
# T=0: Thread A updates DB to "value_A"
# T=1: Thread B reads cache (miss) and queries DB, gets "value_A"
# T=2: Thread A deletes cache
# T=3: Thread B writes "value_A" to cache
# 
# Result: Cache is correct, no problem here.
 
# But consider this scenario:
# T=0: Thread A updates DB to "value_A"  
# T=1: Thread B reads cache (miss)
# T=2: Thread A deletes cache
# T=3: Thread C updates DB to "value_C"
# T=4: Thread C deletes cache  
# T=5: Thread B writes "value_A" to cache (from T=1 DB read!)
#
# Result: Cache has stale "value_A" while DB has "value_C"!

The Thundering Herd on Invalidation

When a popular key is invalidated, many concurrent requests may simultaneously hit the database:

thundering_herd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Popular key accessed 1000 times/second
# T=0: Key is invalidated (cache.delete)
# T=1: 1000 requests arrive, all get cache miss
# T=2: 1000 requests query database simultaneously
# T=3: Database overwhelmed or slows to crawl
# T=4: All 1000 requests try to write to cache
#
# Solution: Request coalescing / cache stampede protection
 
class StampedeProtectedCache:
    """
    Prevent thundering herd with request coalescing.
    Only one request fetches from DB; others wait.
    """
    
    def __init__(self, cache, database):
        self.cache = cache
        self.database = database
        self.pending_fetches = {}  # key -> Future
        self.lock = threading.Lock()
    
    def get(self, key: str):
        # Check cache first
        value = self.cache.get(key)
        if value is not None:
            return value
        
        # Cache miss - check if fetch is already in progress
        with self.lock:
            if key in self.pending_fetches:
                # Another request is already fetching
                future = self.pending_fetches[key]
            else:
                # We're the first - create the fetch
                future = self._create_fetch(key)
                self.pending_fetches[key] = future
        
        # Wait for the fetch to complete
        value = future.result()
        return value
    
    def _create_fetch(self, key: str):
        """Create an async fetch from database."""
        import concurrent.futures
        
        def fetch():
            try:
                value = self.database.query(key)
                self.cache.set(key, value, ttl=300)
                return value
            finally:
                with self.lock:
                    del self.pending_fetches[key]
        
        executor = concurrent.futures.ThreadPoolExecutor()
        return executor.submit(fetch)

Update-On-Write Race

When updating cache (instead of deleting), races are worse:

update_race.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Why update-on-write is dangerous:
 
# T=0: Thread A updates DB to "A", about to update cache
# T=1: Thread B updates DB to "B" (newer than A!)
# T=2: Thread B updates cache to "B"
# T=3: Thread A updates cache to "A" (A is now STALE!)
#
# Result: DB has "B" (correct), Cache has "A" (stale)
# The cache will serve stale data until TTL or next write!
 
# This is why delete-on-write is preferred:
# T=0: Thread A updates DB to "A"
# T=1: Thread B updates DB to "B"  
# T=2: Thread B deletes cache
# T=3: Thread A deletes cache (idempotent, no harm)
#
# Result: Cache is empty, next read gets fresh "B" from DB

Common Coherence Bugs

•Double-read stale population: Read during invalidation populates stale value
•Thundering herd: Mass cache miss overwhelms database
•Write ordering violation: Out-of-order cache updates leave stale data
•Partial invalidation: Forgetting to invalidate related keys
•TTL mismatch: Different TTLs for related data causing inconsistency
•Invalidation failure: Network/cache failure leaves stale data indefinitely

The Danger of Complex Schemes

Each mitigation adds complexity that can introduce new bugs. A simple scheme (delete-on-write with short TTL) that's occasionally stale is often better than a complex scheme with subtle correctness bugs.

Consistency Guarantees

Different applications require different consistency levels. Understanding the spectrum of guarantees helps choose the right approach.

Strong Consistency

Every read reflects the most recent write. Requires synchronous cache updates or bypassing the cache entirely on writes.

How: Write-through with synchronous confirmation
Cost: Higher write latency, reduced availability
Use when: Financial data, inventory counts, anything where stale reads cause business impact

Read-Your-Writes Consistency

A client sees its own writes immediately, but may see stale data from other clients.

How: After writing, bypass cache for a short window, or use versioned reads
Cost: Moderate complexity
Use when: User profile updates, shopping cart modifications

read_your_writes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class ReadYourWritesCache:
    """
    Ensures clients see their own writes immediately.
    Other clients may see stale data briefly.
    """
    
    def __init__(self, cache, database):
        self.cache = cache
        self.database = database
        # Track recent writes per user
        self.user_write_timestamps = {}
        self.bypass_window = 5.0  # seconds
    
    def read(self, key: str, user_id: str):
        # Check if user recently wrote this key
        last_write = self.user_write_timestamps.get((user_id, key), 0)
        current_time = time.time()
        
        if current_time - last_write < self.bypass_window:
            # User wrote recently - bypass cache for freshness
            return self.database.query(key)
        
        # Normal cache-aside read
        value = self.cache.get(key)
        if value is not None:
            return value
        
        value = self.database.query(key)
        self.cache.set(key, value, ttl=300)
        return value
    
    def write(self, key: str, value: any, user_id: str):
        self.database.update(key, value)
        self.cache.delete(key)
        
        # Record that this user wrote this key
        self.user_write_timestamps[(user_id, key)] = time.time()

Eventual Consistency

Given no new writes, all reads will eventually return the same value. This is the default for most caching setups.

How: Standard cache-aside with TTL
Staleness window: Up to TTL + invalidation delay
Use when: Most read-heavy applications where brief staleness is acceptable

Bounded Staleness

Data is never more than N seconds old. Achieved through TTL and/or versioning.

How: Short TTL, or version checking on read
Cost: More cache misses with shorter TTL
Use when: Need guarantees on maximum staleness

Consistency Level Selection
Level	Staleness	Performance	Example Use Cases
Strong	None	Lowest	Account balance, inventory
Read-Your-Writes	For other writers only	Good	User profiles, preferences
Bounded Staleness	≤ configured limit	Good	Leaderboards, analytics
Eventual	Unbounded (until TTL/invalidation)	Best	Static content, catalogs

The Rule of Thumb

Start with eventual consistency and short TTLs. Only add complexity if you can demonstrate a business requirement for stronger guarantees. Most applications work fine with 30-second staleness—users are reading far more often than writing, and slight staleness is imperceptible.

Multi-Layer Cache Coherence

Production systems often have multiple caching layers, each with different characteristics:

Typical Cache Hierarchy

Browser/Client Cache: Controlled by HTTP headers
CDN Cache: Edge servers worldwide
API Gateway Cache: Reverse proxy caching
Application-Level Cache: In-process (Caffeine, Guava)
Distributed Cache: Shared Redis/Memcached cluster
Database Buffer Cache: Database's internal cache

Each layer can hold stale data, and invalidation must propagate through all layers.

multi_layer_invalidation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class MultiLayerCacheInvalidation:
    """
    Coordinate invalidation across multiple cache layers.
    """
    
    def __init__(self):
        self.local_cache = LocalCache()      # In-process
        self.distributed_cache = RedisCache()  # Shared
        self.cdn = CDNAPI()                   # Edge
    
    def invalidate(self, key: str, related_urls: list[str] = None):
        """
        Invalidate a key across all cache layers.
        Must handle partial failures.
        """
        errors = []
        
        # 1. Invalidate local cache (synchronous, fast)
        self.local_cache.delete(key)
        
        # 2. Invalidate distributed cache
        try:
            self.distributed_cache.delete(key)
        except Exception as e:
            errors.append(f"Distributed cache: {e}")
        
        # 3. Invalidate CDN (may be async)
        if related_urls:
            try:
                self.cdn.purge(related_urls)
            except Exception as e:
                errors.append(f"CDN: {e}")
        
        # 4. Publish invalidation event for other app servers
        try:
            self._publish_invalidation_event(key)
        except Exception as e:
            errors.append(f"Event publish: {e}")
        
        if errors:
            logging.error(f"Partial invalidation failure for {key}: {errors}")
            # Consider: retry? alert? accept staleness?
    
    def _publish_invalidation_event(self, key: str):
        """
        Notify other application servers to clear local caches.
        Uses pub/sub or message queue.
        """
        redis_pubsub.publish("cache:invalidate", {"key": key})

Local Cache + Distributed Cache

A common pattern combines fast local (in-process) cache with shared distributed cache:

L1 (Local): Fastest, but per-process. TTL 10-60 seconds.
L2 (Distributed): Shared across processes. TTL 5-60 minutes.

Invalidation challenges:

Multiple app servers have independent L1 caches
L1 invalidation requires pub/sub notification
L1 may be stale even when L2 is fresh

CDN Cache Considerations

CDNs add geographic distribution but complicate invalidation:

CDN nodes may take minutes to propagate invalidation globally
API rate limits may restrict purge frequency
Use cache keys with version/hash to enable instant client-side invalidation

Multi-Layer Best Practices

•Shorter TTLs at outer layers: Local cache 30s, distributed cache 5m, CDN 1h
•Version URLs for immutable content: /app.v1.2.3.js never needs invalidation
•Accept eventual consistency across layers: Each layer adds staleness
•Use pub/sub for local cache invalidation: Redis pub/sub, Kafka, etc.
•Monitor hit rates per layer: Identify which layer needs tuning

The Two-Level Trap

When using local + distributed cache, ensure reads populate both levels properly. A common bug: local cache miss → distributed cache hit → return value but forget to populate local cache. This creates unnecessary distributed cache traffic.

Production Considerations

Real-world cache coherence involves operational concerns beyond the core algorithms.

Monitoring Cache Coherence

You can't fix what you can't measure. Key metrics:

Cache Coherence Metrics
Metric	Description	Alert Threshold
Cache hit rate	% requests served by cache	< 90% (workload dependent)
Stale read rate	% reads returning stale data	Depends on SLA
Invalidation latency	Time from write to cache clear	1 second
Invalidation failures	Failed invalidation attempts	0 (investigate)
Cache/DB divergence	Sampled comparison of cache vs DB	0.1%

Measuring Staleness

Periodically sample cache entries and compare to database:

def measure_staleness(cache, database, sample_size=1000):
    stale_count = 0
    sample_keys = random.sample(cache.keys(), sample_size)
    
    for key in sample_keys:
        cached = cache.get(key)
        actual = database.query(key)
        if cached != actual:
            stale_count += 1
    
    return stale_count / sample_size

Graceful Degradation

When cache coherence systems fail, have fallback behavior:

If invalidation fails, reduce TTL for affected keys
If distributed cache is unavailable, fall back to local cache or direct DB
If CDC pipeline lags, switch to explicit invalidation

Testing Cache Coherence

Coherence bugs often appear only under concurrent load:

Testing Strategies

•Stress testing: High concurrent read/write load reveals race conditions
•Chaos engineering: Kill cache nodes, partition networks, slow invalidation
•Property-based testing: Generate random operation sequences, verify consistency
•Shadow reads: Compare cached vs fresh reads in production (shadow reads don't serve users)
•Audit logging: Log all cache operations for post-hoc analysis

The Hardest Bugs

The worst cache coherence bugs are rare and random. They occur only under specific timing conditions that may happen once per million requests. These bugs are nearly impossible to reproduce manually. Invest in automated property testing and production monitoring to catch them.

Summary: Mastering Cache Coherence

Cache coherence is the art of balancing consistency, performance, and complexity. Let's consolidate the key insights:

Key Takeaways

•Cache-aside is the standard pattern — application controls both cache and database, explicit invalidation on write.
•Delete on write, not update — avoids write-ordering races, simpler, safer.
•Race conditions are subtle — even 'correct' patterns have edge cases under concurrent load.
•Thundering herd is real — use request coalescing and probabilistic early expiration.
•Choose consistency level based on requirements — don't over-engineer if eventual consistency suffices.
•Multi-layer caching multiplies complexity — each layer can be stale, invalidation must propagate.
•Monitor and measure staleness — you can't fix what you don't observe.

The Pragmatic Approach

Start with cache-aside + delete-on-write + reasonable TTL (5-60 minutes)
Accept eventual consistency unless business requires stronger guarantees
Add stampede protection if you have hot keys
Use event-driven invalidation only if explicit invalidation becomes unmanageable
Monitor staleness and adjust TTLs based on actual data

Most applications don't need perfect consistency from their cache—they need a cache that works well 99.9% of the time and fails gracefully the rest.

Cache Coherence Complete

You now understand cache coherence from fundamental patterns through production operations. In the final page, we'll explore Failure Handling—how to design distributed caches that survive node failures, network partitions, and cascading failures.