Loading learning content...
In traditional system architectures, the database is sacrosanct. It is the system of record, the authoritative source of truth, the final arbiter of what data exists. Caches are ephemeral helpers—optimization layers that can be purged, rebuilt, or discarded without losing data.
Write-back caching fundamentally challenges this model. When writes go to the cache first, the cache temporarily becomes the system of record. For any dirty entry, the cache holds data that doesn't exist anywhere else in the system. The database contains stale information until the flush completes.
This page explores what it means—architecturally, operationally, and philosophically—to treat the cache as the primary write destination.
By the end of this page, you will deeply understand the implications of cache-first writes: how the cache becomes temporarily authoritative, what this means for system of record semantics, how to reason about data location during the dirty window, and the consistency guarantees this architecture provides and sacrifices.
Understanding that the cache becomes the system of record for dirty entries is perhaps the most important mental shift for architects working with write-back caching. Let's examine what this means:
Traditional Cache Role:
Database (System of Record)
↓
↓ replicates to
↓
Cache (Optimization Layer)
↓
↓ serves
↓
Application
In this model, the cache is always derivable from the database. If the cache fails, you can rebuild it from the database. The database always has the "truth."
Write-Back Cache Role:
Application
↓
↓ writes to
↓
Cache (Temporary System of Record)
↓
↓ asynchronously syncs to
↓
Database (Eventual System of Record)
In this model, during the dirty window, the cache contains data that doesn't exist in the database. If the cache fails before flushing, that data is lost. The cache is not merely an optimization—it's holding irreplaceable state.
When the cache holds dirty entries, cache failure means data loss. This is fundamentally different from traditional caching where cache failure means performance degradation. This difference must inform your high availability and disaster recovery strategies.
The Dirty Window Concept:
The period between a write being acknowledged to the client and that write being persisted to the database is called the dirty window. During this window:
The length of the dirty window is determined by your flush policy:
| Flush Strategy | Typical Dirty Window | Data Loss Risk on Cache Failure |
|---|---|---|
| Immediate flush (every write) | ~0ms | Minimal (approaches write-through) |
| Time-based (every 1 second) | 0-1 second | Up to 1 second of writes |
| Time-based (every 30 seconds) | 0-30 seconds | Up to 30 seconds of writes |
| Count-based (every 1000 writes) | Variable | Up to 1000 writes |
| Lazy flush (low priority threshold) | Minutes to hours | Significant |
The trade-off is clear: shorter dirty windows reduce data loss risk but reduce performance benefits. Longer dirty windows maximize write coalescing and throughput but increase risk.
One of the most nuanced aspects of write-back caching is understanding exactly where data lives at any given moment. Let's trace through the lifecycle of a piece of data:
Scenario: User updates their profile bio
Timelinee:
T0: User submits new bio "Hello World"
- Database: bio = "Old bio"
- Cache: no entry (or bio = "Old bio", clean)
T1: Write-back cache receives write
- Database: bio = "Old bio" (stale)
- Cache: bio = "Hello World", dirty=true
- User: receives acknowledgment
T2: User reads their profile
- Cache hit: returns "Hello World" ✓
- (Consistent from user's perspective)
T3: Background service queries database directly
- Database query: returns "Old bio" ✗
- (Inconsistent - sees stale data)
T4: Flush occurs
- Database: bio = "Hello World" (now current)
- Cache: bio = "Hello World", dirty=false
T5: All queries return "Hello World" ✓
The key insight is that during T1-T4, the location of truth depends on how you access it. Through the cache: correct. Direct to database: stale.
NEVER evict a dirty cache entry without first flushing it to the database. Dirty entry eviction = data loss. Your cache eviction policy must treat dirty entries specially: either flush them before eviction or refuse to evict them entirely.
Write-back caching provides specific consistency guarantees while relaxing others. Understanding these semantics precisely is essential for correctly applying the pattern.
Guarantees provided:
Guarantees NOT provided:
Write-back caching effectively provides eventual consistency to non-cache readers and session consistency to cache readers. This split consistency model must be understood by all system components.
Making the cache the primary write destination has sweeping implications for system architecture. These must be addressed in design, not discovered in production.
1. Cache Must Have High Availability
Because the cache holds the only copy of dirty data, cache availability becomes critical:
2. All Writes Must Go Through Cache
Direct database writes bypass the cache and create consistency issues:
3. Read Path Must Go Through Cache
Direct database reads return stale data during the dirty window:
4. Cache Capacity Planning Changes
Unlike read caches that can evict freely, write-back caches must hold all dirty entries:
5. Application Logic Complexity
The application layer takes on responsibilities that traditionally belonged to the database:
Cache-first writes increase architectural complexity significantly. The performance benefits must justify this complexity. For simple CRUD applications, the overhead often isn't worth it. For high-throughput systems with hot keys, it can be transformative.
In distributed systems, multiple clients or services may write to the same key concurrently. Write-back caching must handle these scenarios correctly.
Scenario: Concurrent writes to the same key
Time 0ms:
Client A: Write key → value_A
Client B: Write key → value_B
(Both arrive nearly simultaneously)
Cache behavior (typical: last-write-wins):
T0: Empty
T1: key = value_A (from Client A)
T2: key = value_B (from Client B, overwrites)
Flushed to database: value_B
In most write-back implementations, concurrent writes to the same key follow last-write-wins semantics. The cache holds a single value per key, and the most recent write overwrites previous values.
Important consideration: The "most recent" is determined by cache arrival order, which may differ from client request order due to network variability. This is acceptable for many use cases (counters, status updates) but problematic for others (inventory management, account balances).
| Strategy | Mechanism | Use Case | Trade-offs |
|---|---|---|---|
| Last-Write-Wins | Latest value overwrites | Status updates, session data | Simple but may lose concurrent updates |
| Merge Function | Custom logic combines values | Counters, sets, CRDTs | Complex but preserves all writes |
| Optimistic Locking | Version checks before write | Conflict detection | May reject valid updates under contention |
| Serialized Access | Single writer per key | Critical data with ordering requirements | Lower concurrency, simpler correctness |
Merge-Based Write Coalescing:
For certain data types, write coalescing can use merge functions instead of replacement:
// Counter increment example
T0: INCR counter by 5 → pending: +5
T1: INCR counter by 3 → pending: +8 (merged)
T2: INCR counter by 2 → pending: +10 (merged)
T3: Flush → Database: UPDATE counter = counter + 10
This requires the cache to understand the operation semantics, not just store final values. Redis, for example, can do this naturally with INCR commands. General-purpose write-back caches may need application-level support for merge logic.
Conflict-Free Replicated Data Types (CRDTs):
For advanced use cases, CRDTs provide mathematically guaranteed conflict-free merging:
CRDTs are particularly powerful in distributed caches with multiple replicas, ensuring that regardless of update order or timing, all replicas converge to the same value.
Match your conflict resolution strategy to your data semantics. Counters should merge (sum). Timestamps should last-write-win. Sets should union. Forcing the wrong merge strategy onto data leads to correctness bugs.
Implementing cache-first writes correctly requires careful attention to several patterns and practices. These patterns apply regardless of the specific cache technology used.
Pseudo-code for a write operation:
function writeWithCacheFirst(key, value):
// Step 1: Write to cache and mark dirty
cache.set(key, value)
cache.markDirty(key)
cache.setLastModified(key, currentTimestamp())
// Step 2: Add to flush queue (if using queue-based flush)
flushQueue.add(key)
// Step 3: Acknowledge to caller immediately
return SUCCESS
// Note: Database write happens asynchronously later
Pseudo-code for the flush process:
function flushDirtyEntries():
dirtyKeys = cache.getDirtyKeys()
for each key in dirtyKeys:
value = cache.get(key)
lastModified = cache.getLastModified(key)
try:
// Write to database
database.upsert(key, value, lastModified)
// Only clear dirty if current cache value hasn't changed
if cache.getLastModified(key) == lastModified:
cache.clearDirty(key)
// else: entry was modified again, remain dirty
catch DatabaseException:
// Leave dirty for retry
log.error("Flush failed for key: " + key)
metrics.increment("flush.failures")
This pattern ensures that even if the entry is modified again during flush, the new value will be flushed in the next cycle.
Cache-first writes introduce specific failure modes that don't exist in write-through architectures. Understanding and planning for these is essential.
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| Cache node crash | All dirty entries on that node are lost | Replication, persistent cache (Redis AOF), short dirty windows |
| Cache cluster network partition | Dirty entries stranded on unreachable nodes | Quorum writes, partition-tolerant flush design |
| Flush process crash | Dirty entries accumulate, eventually causing memory pressure | Multiple flush workers, flush process monitoring, dead-letter handling |
| Database unavailable during flush | Dirty entries can't flush, cache fills up | Exponential backoff retry, back-pressure on writes, dead-letter queue |
| Application crash after cache write | Write acknowledged but never triggered flush logic | Dirty entry scanner (independent of write path), TTL-based safety flush |
| Network partition (cache ↔ database) | Dirty entries can't flush but writes continue | Circuit breaker to block new writes, admin alerts, reconciliation on recovery |
Cache-first writes mean cache failure equals data loss for dirty entries. This is an inherent trade-off, not a fixable bug. Your architecture must either accept this risk (for appropriate use cases), or add redundancy that effectively makes the cache durable (replication + persistence).
Designing for failure:
Assume failures will happen — Design recovery mechanisms from day one, not after the first incident.
Bound the blast radius — Shorter dirty windows limit maximum data loss. Trade performance for reduced risk where appropriate.
Monitor dirty entry growth — Alert when dirty entries exceed thresholds. This is a leading indicator of problems.
Test failure scenarios — Use chaos engineering to simulate cache failures. Verify data loss is within acceptable bounds.
Have runbooks — Document what to do when cache fails: how to recover, how to reconcile state, how to communicate to users.
Let's consolidate the implications of making the cache the primary write destination:
The fundamental perspective shift:
When you adopt cache-first writes, you're not just "adding a cache"—you're changing the system's consistency model and durability guarantees. The cache becomes infrastructure, not optimization. It must be treated with the same seriousness as the database itself.
What's next:
Now that we understand what it means for data to go to the cache first, the next page explores the asynchronous database write process: how dirty entries are flushed to the database, the mechanisms involved, and how to make the async pipeline reliable.
You now understand the profound implications of cache-first writes: the cache as temporary system of record, dirty window semantics, consistency model changes, and failure scenarios. Next, we'll explore the asynchronous database write mechanism.