System Design (HLD)Write-Back (Write-Behind) Caching

Write-Back (Write-Behind) Caching

LevelIntermediate

Duration75 mins

TopicWrite-Back (Write-Behind) Caching

2 / 5

Data Written to Cache First

Rethinking the Role of the Cache

In traditional system architectures, the database is sacrosanct. It is the system of record, the authoritative source of truth, the final arbiter of what data exists. Caches are ephemeral helpers—optimization layers that can be purged, rebuilt, or discarded without losing data.

Write-back caching fundamentally challenges this model. When writes go to the cache first, the cache temporarily becomes the system of record. For any dirty entry, the cache holds data that doesn't exist anywhere else in the system. The database contains stale information until the flush completes.

This page explores what it means—architecturally, operationally, and philosophically—to treat the cache as the primary write destination.

What You Will Learn

By the end of this page, you will deeply understand the implications of cache-first writes: how the cache becomes temporarily authoritative, what this means for system of record semantics, how to reason about data location during the dirty window, and the consistency guarantees this architecture provides and sacrifices.

The Cache as Temporary System of Record

Understanding that the cache becomes the system of record for dirty entries is perhaps the most important mental shift for architects working with write-back caching. Let's examine what this means:

Traditional Cache Role:

Database (System of Record)
    ↓
    ↓ replicates to
    ↓
Cache (Optimization Layer)
    ↓
    ↓ serves
    ↓
Application

In this model, the cache is always derivable from the database. If the cache fails, you can rebuild it from the database. The database always has the "truth."

Write-Back Cache Role:

Application
    ↓
    ↓ writes to
    ↓
Cache (Temporary System of Record)
    ↓
    ↓ asynchronously syncs to
    ↓
Database (Eventual System of Record)

In this model, during the dirty window, the cache contains data that doesn't exist in the database. If the cache fails before flushing, that data is lost. The cache is not merely an optimization—it's holding irreplaceable state.

Critical Implication

When the cache holds dirty entries, cache failure means data loss. This is fundamentally different from traditional caching where cache failure means performance degradation. This difference must inform your high availability and disaster recovery strategies.

The Dirty Window Concept:

The period between a write being acknowledged to the client and that write being persisted to the database is called the dirty window. During this window:

The data exists only in the cache
The database contains stale data for that key
Any read from the database will return incorrect values
Any cache failure results in data loss

The length of the dirty window is determined by your flush policy:

Flush Strategy	Typical Dirty Window	Data Loss Risk on Cache Failure
Immediate flush (every write)	~0ms	Minimal (approaches write-through)
Time-based (every 1 second)	0-1 second	Up to 1 second of writes
Time-based (every 30 seconds)	0-30 seconds	Up to 30 seconds of writes
Count-based (every 1000 writes)	Variable	Up to 1000 writes
Lazy flush (low priority threshold)	Minutes to hours	Significant

The trade-off is clear: shorter dirty windows reduce data loss risk but reduce performance benefits. Longer dirty windows maximize write coalescing and throughput but increase risk.

Where Does the Data Live?

One of the most nuanced aspects of write-back caching is understanding exactly where data lives at any given moment. Let's trace through the lifecycle of a piece of data:

Scenario: User updates their profile bio

Timelinee:

T0: User submits new bio "Hello World"
    - Database: bio = "Old bio"
    - Cache: no entry (or bio = "Old bio", clean)

T1: Write-back cache receives write
    - Database: bio = "Old bio" (stale)
    - Cache: bio = "Hello World", dirty=true
    - User: receives acknowledgment

T2: User reads their profile
    - Cache hit: returns "Hello World" ✓
    - (Consistent from user's perspective)

T3: Background service queries database directly
    - Database query: returns "Old bio" ✗
    - (Inconsistent - sees stale data)

T4: Flush occurs
    - Database: bio = "Hello World" (now current)
    - Cache: bio = "Hello World", dirty=false

T5: All queries return "Hello World" ✓

The key insight is that during T1-T4, the location of truth depends on how you access it. Through the cache: correct. Direct to database: stale.

Data Location by State

•Clean Cache Entry — Data exists in both cache and database, and they match. Either can be read.
•Dirty Cache Entry — Data in cache is newer than database. Only cache has current value.
•Cache Miss (uncached key) — Data exists only in database. Cache will be populated on read.
•Recently Flushed — Data was just written to database. Cache is now clean.
•Evicted Before Flush — ⚠️ DANGER: If a dirty entry is evicted before being flushed, data is lost.

The Cardinal Rule

NEVER evict a dirty cache entry without first flushing it to the database. Dirty entry eviction = data loss. Your cache eviction policy must treat dirty entries specially: either flush them before eviction or refuse to evict them entirely.

Consistency Semantics in Write-Back Caching

Write-back caching provides specific consistency guarantees while relaxing others. Understanding these semantics precisely is essential for correctly applying the pattern.

Guarantees provided:

Consistency Guarantees

•Read-Your-Writes (via cache) — If you write through the cache and read through the cache, you will always read your own writes. The cache serves as the source of truth for clients going through the caching layer.
•Monotonic Reads (via cache) — Once you've read a value from the cache, subsequent reads will not return an older value (assuming proper cache invalidation).
•Ordering Within Single Key — Writes to the same key are naturally ordered by the cache—the latest write wins. Coalescing preserves only the final value.
•Session Consistency (for cache clients) — Within a session that uses the cache, consistency is maintained. Each request sees the results of all prior requests.

Guarantees NOT provided:

Consistency Gaps

•Strong Consistency Across All Readers — Readers that bypass the cache (direct database queries, other services, analytics pipelines) will see stale data during the dirty window.
•Immediate Durability — Data is not durable until flushed. A crash during the dirty window loses unflushed writes.
•Cross-Key Atomicity — Writes to multiple keys are not atomic. Some may flush before others, creating intermediate states.
•Causal Consistency Across Keys — If you write key A then key B, there's no guarantee key B isn't visible in the database before key A (depending on flush batching).
•Global Ordering — There's no global order of operations. Concurrent writes to different keys may flush in any order.

Consistency Model

Write-back caching effectively provides eventual consistency to non-cache readers and session consistency to cache readers. This split consistency model must be understood by all system components.

Architectural Implications of Cache-First Writes

Making the cache the primary write destination has sweeping implications for system architecture. These must be addressed in design, not discovered in production.

1. Cache Must Have High Availability

Because the cache holds the only copy of dirty data, cache availability becomes critical:

Replication — The cache must be replicated. A single-node cache is a single point of failure for all dirty data.
Persistence — Consider cache persistence features (like Redis AOF or RDB) as a safety net.
Cluster Design — Distributed caches must be designed for partition tolerance—network splits shouldn't lose dirty data.

2. All Writes Must Go Through Cache

Direct database writes bypass the cache and create consistency issues:

If Service A writes to cache and Service B writes directly to database, they'll overwrite each other's changes.
Database triggers, stored procedures, and external integrations must be carefully evaluated.
Migration scripts and bulk operations need special handling (often: pause cache, modify database, invalidate cache).

3. Read Path Must Go Through Cache

Direct database reads return stale data during the dirty window:

All services reading this data must use the cache layer.
Analytics pipelines must understand they're seeing eventually-consistent data.
Audit systems requiring point-in-time accuracy need special consideration.

4. Cache Capacity Planning Changes

Unlike read caches that can evict freely, write-back caches must hold all dirty entries:

Dirty Set Size — You need enough memory to hold all entries that might be dirty simultaneously.
Write Storm Scenarios — Under load spikes, dirty entry count grows. If cache fills before flush completes, you face a choice: block writes (back-pressure) or force emergency flush.
Eviction Policy — Standard LRU eviction cannot evict dirty entries without flushing first.

5. Application Logic Complexity

The application layer takes on responsibilities that traditionally belonged to the database:

Ensuring all readers go through cache
Handling cache failures gracefully
Understanding which data is durable and which is at risk
Potentially implementing compensation logic for lost writes

Architectural Complexity Cost

Cache-first writes increase architectural complexity significantly. The performance benefits must justify this complexity. For simple CRUD applications, the overhead often isn't worth it. For high-throughput systems with hot keys, it can be transformative.

Handling Multiple Writers

In distributed systems, multiple clients or services may write to the same key concurrently. Write-back caching must handle these scenarios correctly.

Scenario: Concurrent writes to the same key

Time 0ms:
  Client A: Write key → value_A
  Client B: Write key → value_B
  (Both arrive nearly simultaneously)

Cache behavior (typical: last-write-wins):
  T0:   Empty
  T1:   key = value_A (from Client A)
  T2:   key = value_B (from Client B, overwrites)
  
Flushed to database: value_B

In most write-back implementations, concurrent writes to the same key follow last-write-wins semantics. The cache holds a single value per key, and the most recent write overwrites previous values.

Important consideration: The "most recent" is determined by cache arrival order, which may differ from client request order due to network variability. This is acceptable for many use cases (counters, status updates) but problematic for others (inventory management, account balances).

Multi-Writer Handling Strategies
Strategy	Mechanism	Use Case	Trade-offs
Last-Write-Wins	Latest value overwrites	Status updates, session data	Simple but may lose concurrent updates
Merge Function	Custom logic combines values	Counters, sets, CRDTs	Complex but preserves all writes
Optimistic Locking	Version checks before write	Conflict detection	May reject valid updates under contention
Serialized Access	Single writer per key	Critical data with ordering requirements	Lower concurrency, simpler correctness

Merge-Based Write Coalescing:

For certain data types, write coalescing can use merge functions instead of replacement:

// Counter increment example
T0: INCR counter by 5  → pending: +5
T1: INCR counter by 3  → pending: +8 (merged)
T2: INCR counter by 2  → pending: +10 (merged)
T3: Flush              → Database: UPDATE counter = counter + 10

This requires the cache to understand the operation semantics, not just store final values. Redis, for example, can do this naturally with INCR commands. General-purpose write-back caches may need application-level support for merge logic.

Conflict-Free Replicated Data Types (CRDTs):

For advanced use cases, CRDTs provide mathematically guaranteed conflict-free merging:

G-Counter — Grow-only counter that merges by taking max per node
PN-Counter — Counter supporting increment and decrement
G-Set — Grow-only set where merge is union
LWW-Register — Last-writer-wins register with timestamps

CRDTs are particularly powerful in distributed caches with multiple replicas, ensuring that regardless of update order or timing, all replicas converge to the same value.

Choose the Right Semantics

Match your conflict resolution strategy to your data semantics. Counters should merge (sum). Timestamps should last-write-win. Sets should union. Forcing the wrong merge strategy onto data leads to correctness bugs.

Implementation Patterns for Cache-First Writes

Implementing cache-first writes correctly requires careful attention to several patterns and practices. These patterns apply regardless of the specific cache technology used.

Key Implementation Patterns

•Dirty Flag Tracking — Every cache entry must track whether it's dirty. This can be a per-entry flag, a separate dirty set, or a dirty queue. The flush process examines this metadata to determine what needs persistence.
•Write Sequence Numbers — Assign monotonically increasing sequence numbers to writes. This enables ordering of updates for both conflict resolution and detecting stale flushes (if database has newer data than what's being flushed).
•Batch Flush Transactions — When flushing multiple dirty entries, use database transactions where appropriate. This ensures atomic visibility of related updates. However, don't batch too many keys—transaction overhead increases.
•Flush Confirmation Protocol — After writing to the database, confirm success before clearing dirty flags. If the database write fails, the entry remains dirty for retry. Never clear dirty status optimistically.
•Anti-Entropy Reconciliation — Periodically verify cache and database consistency for a sample of clean entries. This catches bugs or edge cases where they diverged. Treat discrepancies as incidents.
•Graceful Degradation — If the cache becomes unavailable, decide whether to fail writes (safer, user sees error) or bypass cache to database (available, but breaks consistency model). Document the choice.

Pseudo-code for a write operation:

function writeWithCacheFirst(key, value):
    // Step 1: Write to cache and mark dirty
    cache.set(key, value)
    cache.markDirty(key)
    cache.setLastModified(key, currentTimestamp())
    
    // Step 2: Add to flush queue (if using queue-based flush)
    flushQueue.add(key)
    
    // Step 3: Acknowledge to caller immediately
    return SUCCESS
    
    // Note: Database write happens asynchronously later

Pseudo-code for the flush process:

function flushDirtyEntries():
    dirtyKeys = cache.getDirtyKeys()
    
    for each key in dirtyKeys:
        value = cache.get(key)
        lastModified = cache.getLastModified(key)
        
        try:
            // Write to database
            database.upsert(key, value, lastModified)
            
            // Only clear dirty if current cache value hasn't changed
            if cache.getLastModified(key) == lastModified:
                cache.clearDirty(key)
            // else: entry was modified again, remain dirty
            
        catch DatabaseException:
            // Leave dirty for retry
            log.error("Flush failed for key: " + key)
            metrics.increment("flush.failures")

This pattern ensures that even if the entry is modified again during flush, the new value will be flushed in the next cycle.

Failure Scenarios to Consider

Cache-first writes introduce specific failure modes that don't exist in write-through architectures. Understanding and planning for these is essential.

Failure Modes and Mitigations
Failure Scenario	Impact	Mitigation
Cache node crash	All dirty entries on that node are lost	Replication, persistent cache (Redis AOF), short dirty windows
Cache cluster network partition	Dirty entries stranded on unreachable nodes	Quorum writes, partition-tolerant flush design
Flush process crash	Dirty entries accumulate, eventually causing memory pressure	Multiple flush workers, flush process monitoring, dead-letter handling
Database unavailable during flush	Dirty entries can't flush, cache fills up	Exponential backoff retry, back-pressure on writes, dead-letter queue
Application crash after cache write	Write acknowledged but never triggered flush logic	Dirty entry scanner (independent of write path), TTL-based safety flush
Network partition (cache ↔ database)	Dirty entries can't flush but writes continue	Circuit breaker to block new writes, admin alerts, reconciliation on recovery

The Fundamental Risk

Cache-first writes mean cache failure equals data loss for dirty entries. This is an inherent trade-off, not a fixable bug. Your architecture must either accept this risk (for appropriate use cases), or add redundancy that effectively makes the cache durable (replication + persistence).

Designing for failure:

Assume failures will happen — Design recovery mechanisms from day one, not after the first incident.
Bound the blast radius — Shorter dirty windows limit maximum data loss. Trade performance for reduced risk where appropriate.
Monitor dirty entry growth — Alert when dirty entries exceed thresholds. This is a leading indicator of problems.
Test failure scenarios — Use chaos engineering to simulate cache failures. Verify data loss is within acceptable bounds.
Have runbooks — Document what to do when cache fails: how to recover, how to reconcile state, how to communicate to users.

Summary: Cache-First Write Implications

Let's consolidate the implications of making the cache the primary write destination:

Key Takeaways

•Cache as System of Record — During the dirty window, the cache holds the only correct copy of modified data.
•The Dirty Window — Writes are at risk between acknowledgment and flush. Window length is controlled by flush policy.
•Split Consistency Model — Cache readers see current data. Direct database readers see stale data during dirty window.
•All Paths Through Cache — Both reads and writes must go through the cache to maintain consistency.
•High Availability Required — Cache failure means data loss. Replication and possibly persistence are essential.
•Multi-Writer Handling — Concurrent writes need defined semantics (last-write-wins, merge, optimistic locking).
•Failure Planning — Unique failure modes require specific mitigations and monitoring.
•Architectural Complexity — Cache-first writes add complexity that must be justified by performance requirements.

The fundamental perspective shift:

When you adopt cache-first writes, you're not just "adding a cache"—you're changing the system's consistency model and durability guarantees. The cache becomes infrastructure, not optimization. It must be treated with the same seriousness as the database itself.

What's next:

Now that we understand what it means for data to go to the cache first, the next page explores the asynchronous database write process: how dirty entries are flushed to the database, the mechanisms involved, and how to make the async pipeline reliable.

Page Complete

You now understand the profound implications of cache-first writes: the cache as temporary system of record, dirty window semantics, consistency model changes, and failure scenarios. Next, we'll explore the asynchronous database write mechanism.

2 / 5

Loading learning content...

System Design (HLD)Write-Back (Write-Behind) Caching

Write-Back (Write-Behind) Caching

LevelIntermediate

Duration75 mins

TopicWrite-Back (Write-Behind) Caching

2 / 5

Data Written to Cache First

Rethinking the Role of the Cache

This page explores what it means—architecturally, operationally, and philosophically—to treat the cache as the primary write destination.

What You Will Learn

The Cache as Temporary System of Record

Understanding that the cache becomes the system of record for dirty entries is perhaps the most important mental shift for architects working with write-back caching. Let's examine what this means:

Traditional Cache Role:

Database (System of Record)
    ↓
    ↓ replicates to
    ↓
Cache (Optimization Layer)
    ↓
    ↓ serves
    ↓
Application

In this model, the cache is always derivable from the database. If the cache fails, you can rebuild it from the database. The database always has the "truth."

Write-Back Cache Role:

Application
    ↓
    ↓ writes to
    ↓
Cache (Temporary System of Record)
    ↓
    ↓ asynchronously syncs to
    ↓
Database (Eventual System of Record)

Critical Implication

The Dirty Window Concept:

The period between a write being acknowledged to the client and that write being persisted to the database is called the dirty window. During this window:

The data exists only in the cache
The database contains stale data for that key
Any read from the database will return incorrect values
Any cache failure results in data loss

The length of the dirty window is determined by your flush policy:

Flush Strategy	Typical Dirty Window	Data Loss Risk on Cache Failure
Immediate flush (every write)	~0ms	Minimal (approaches write-through)
Time-based (every 1 second)	0-1 second	Up to 1 second of writes
Time-based (every 30 seconds)	0-30 seconds	Up to 30 seconds of writes
Count-based (every 1000 writes)	Variable	Up to 1000 writes
Lazy flush (low priority threshold)	Minutes to hours	Significant

The trade-off is clear: shorter dirty windows reduce data loss risk but reduce performance benefits. Longer dirty windows maximize write coalescing and throughput but increase risk.

Where Does the Data Live?

One of the most nuanced aspects of write-back caching is understanding exactly where data lives at any given moment. Let's trace through the lifecycle of a piece of data:

Scenario: User updates their profile bio

Timelinee:

T0: User submits new bio "Hello World"
    - Database: bio = "Old bio"
    - Cache: no entry (or bio = "Old bio", clean)

T1: Write-back cache receives write
    - Database: bio = "Old bio" (stale)
    - Cache: bio = "Hello World", dirty=true
    - User: receives acknowledgment

T2: User reads their profile
    - Cache hit: returns "Hello World" ✓
    - (Consistent from user's perspective)

T3: Background service queries database directly
    - Database query: returns "Old bio" ✗
    - (Inconsistent - sees stale data)

T4: Flush occurs
    - Database: bio = "Hello World" (now current)
    - Cache: bio = "Hello World", dirty=false

T5: All queries return "Hello World" ✓

The key insight is that during T1-T4, the location of truth depends on how you access it. Through the cache: correct. Direct to database: stale.

Data Location by State

•Clean Cache Entry — Data exists in both cache and database, and they match. Either can be read.
•Dirty Cache Entry — Data in cache is newer than database. Only cache has current value.
•Cache Miss (uncached key) — Data exists only in database. Cache will be populated on read.
•Recently Flushed — Data was just written to database. Cache is now clean.
•Evicted Before Flush — ⚠️ DANGER: If a dirty entry is evicted before being flushed, data is lost.

The Cardinal Rule

Consistency Semantics in Write-Back Caching

Write-back caching provides specific consistency guarantees while relaxing others. Understanding these semantics precisely is essential for correctly applying the pattern.

Guarantees provided:

Consistency Guarantees

•Read-Your-Writes (via cache) — If you write through the cache and read through the cache, you will always read your own writes. The cache serves as the source of truth for clients going through the caching layer.
•Monotonic Reads (via cache) — Once you've read a value from the cache, subsequent reads will not return an older value (assuming proper cache invalidation).
•Ordering Within Single Key — Writes to the same key are naturally ordered by the cache—the latest write wins. Coalescing preserves only the final value.
•Session Consistency (for cache clients) — Within a session that uses the cache, consistency is maintained. Each request sees the results of all prior requests.

Guarantees NOT provided:

Consistency Gaps

•Strong Consistency Across All Readers — Readers that bypass the cache (direct database queries, other services, analytics pipelines) will see stale data during the dirty window.
•Immediate Durability — Data is not durable until flushed. A crash during the dirty window loses unflushed writes.
•Cross-Key Atomicity — Writes to multiple keys are not atomic. Some may flush before others, creating intermediate states.
•Causal Consistency Across Keys — If you write key A then key B, there's no guarantee key B isn't visible in the database before key A (depending on flush batching).
•Global Ordering — There's no global order of operations. Concurrent writes to different keys may flush in any order.

Consistency Model

Write-back caching effectively provides eventual consistency to non-cache readers and session consistency to cache readers. This split consistency model must be understood by all system components.

Architectural Implications of Cache-First Writes

Making the cache the primary write destination has sweeping implications for system architecture. These must be addressed in design, not discovered in production.

1. Cache Must Have High Availability

Because the cache holds the only copy of dirty data, cache availability becomes critical:

Replication — The cache must be replicated. A single-node cache is a single point of failure for all dirty data.
Persistence — Consider cache persistence features (like Redis AOF or RDB) as a safety net.
Cluster Design — Distributed caches must be designed for partition tolerance—network splits shouldn't lose dirty data.

2. All Writes Must Go Through Cache

Direct database writes bypass the cache and create consistency issues:

If Service A writes to cache and Service B writes directly to database, they'll overwrite each other's changes.
Database triggers, stored procedures, and external integrations must be carefully evaluated.
Migration scripts and bulk operations need special handling (often: pause cache, modify database, invalidate cache).

3. Read Path Must Go Through Cache

Direct database reads return stale data during the dirty window:

All services reading this data must use the cache layer.
Analytics pipelines must understand they're seeing eventually-consistent data.
Audit systems requiring point-in-time accuracy need special consideration.

4. Cache Capacity Planning Changes

Unlike read caches that can evict freely, write-back caches must hold all dirty entries:

Dirty Set Size — You need enough memory to hold all entries that might be dirty simultaneously.
Write Storm Scenarios — Under load spikes, dirty entry count grows. If cache fills before flush completes, you face a choice: block writes (back-pressure) or force emergency flush.
Eviction Policy — Standard LRU eviction cannot evict dirty entries without flushing first.

5. Application Logic Complexity

The application layer takes on responsibilities that traditionally belonged to the database:

Ensuring all readers go through cache
Handling cache failures gracefully
Understanding which data is durable and which is at risk
Potentially implementing compensation logic for lost writes

Architectural Complexity Cost

Handling Multiple Writers

In distributed systems, multiple clients or services may write to the same key concurrently. Write-back caching must handle these scenarios correctly.

Scenario: Concurrent writes to the same key

Time 0ms:
  Client A: Write key → value_A
  Client B: Write key → value_B
  (Both arrive nearly simultaneously)

Cache behavior (typical: last-write-wins):
  T0:   Empty
  T1:   key = value_A (from Client A)
  T2:   key = value_B (from Client B, overwrites)
  
Flushed to database: value_B

Multi-Writer Handling Strategies
Strategy	Mechanism	Use Case	Trade-offs
Last-Write-Wins	Latest value overwrites	Status updates, session data	Simple but may lose concurrent updates
Merge Function	Custom logic combines values	Counters, sets, CRDTs	Complex but preserves all writes
Optimistic Locking	Version checks before write	Conflict detection	May reject valid updates under contention
Serialized Access	Single writer per key	Critical data with ordering requirements	Lower concurrency, simpler correctness

Merge-Based Write Coalescing:

For certain data types, write coalescing can use merge functions instead of replacement:

// Counter increment example
T0: INCR counter by 5  → pending: +5
T1: INCR counter by 3  → pending: +8 (merged)
T2: INCR counter by 2  → pending: +10 (merged)
T3: Flush              → Database: UPDATE counter = counter + 10

Conflict-Free Replicated Data Types (CRDTs):

For advanced use cases, CRDTs provide mathematically guaranteed conflict-free merging:

G-Counter — Grow-only counter that merges by taking max per node
PN-Counter — Counter supporting increment and decrement
G-Set — Grow-only set where merge is union
LWW-Register — Last-writer-wins register with timestamps

CRDTs are particularly powerful in distributed caches with multiple replicas, ensuring that regardless of update order or timing, all replicas converge to the same value.

Choose the Right Semantics

Implementation Patterns for Cache-First Writes

Implementing cache-first writes correctly requires careful attention to several patterns and practices. These patterns apply regardless of the specific cache technology used.

Key Implementation Patterns

•Dirty Flag Tracking — Every cache entry must track whether it's dirty. This can be a per-entry flag, a separate dirty set, or a dirty queue. The flush process examines this metadata to determine what needs persistence.
•Write Sequence Numbers — Assign monotonically increasing sequence numbers to writes. This enables ordering of updates for both conflict resolution and detecting stale flushes (if database has newer data than what's being flushed).
•Batch Flush Transactions — When flushing multiple dirty entries, use database transactions where appropriate. This ensures atomic visibility of related updates. However, don't batch too many keys—transaction overhead increases.
•Flush Confirmation Protocol — After writing to the database, confirm success before clearing dirty flags. If the database write fails, the entry remains dirty for retry. Never clear dirty status optimistically.
•Anti-Entropy Reconciliation — Periodically verify cache and database consistency for a sample of clean entries. This catches bugs or edge cases where they diverged. Treat discrepancies as incidents.
•Graceful Degradation — If the cache becomes unavailable, decide whether to fail writes (safer, user sees error) or bypass cache to database (available, but breaks consistency model). Document the choice.

Pseudo-code for a write operation:

function writeWithCacheFirst(key, value):
    // Step 1: Write to cache and mark dirty
    cache.set(key, value)
    cache.markDirty(key)
    cache.setLastModified(key, currentTimestamp())
    
    // Step 2: Add to flush queue (if using queue-based flush)
    flushQueue.add(key)
    
    // Step 3: Acknowledge to caller immediately
    return SUCCESS
    
    // Note: Database write happens asynchronously later

Pseudo-code for the flush process:

function flushDirtyEntries():
    dirtyKeys = cache.getDirtyKeys()
    
    for each key in dirtyKeys:
        value = cache.get(key)
        lastModified = cache.getLastModified(key)
        
        try:
            // Write to database
            database.upsert(key, value, lastModified)
            
            // Only clear dirty if current cache value hasn't changed
            if cache.getLastModified(key) == lastModified:
                cache.clearDirty(key)
            // else: entry was modified again, remain dirty
            
        catch DatabaseException:
            // Leave dirty for retry
            log.error("Flush failed for key: " + key)
            metrics.increment("flush.failures")

This pattern ensures that even if the entry is modified again during flush, the new value will be flushed in the next cycle.

Failure Scenarios to Consider

Cache-first writes introduce specific failure modes that don't exist in write-through architectures. Understanding and planning for these is essential.

Failure Modes and Mitigations
Failure Scenario	Impact	Mitigation
Cache node crash	All dirty entries on that node are lost	Replication, persistent cache (Redis AOF), short dirty windows
Cache cluster network partition	Dirty entries stranded on unreachable nodes	Quorum writes, partition-tolerant flush design
Flush process crash	Dirty entries accumulate, eventually causing memory pressure	Multiple flush workers, flush process monitoring, dead-letter handling
Database unavailable during flush	Dirty entries can't flush, cache fills up	Exponential backoff retry, back-pressure on writes, dead-letter queue
Application crash after cache write	Write acknowledged but never triggered flush logic	Dirty entry scanner (independent of write path), TTL-based safety flush
Network partition (cache ↔ database)	Dirty entries can't flush but writes continue	Circuit breaker to block new writes, admin alerts, reconciliation on recovery

The Fundamental Risk

Designing for failure:

Assume failures will happen — Design recovery mechanisms from day one, not after the first incident.
Bound the blast radius — Shorter dirty windows limit maximum data loss. Trade performance for reduced risk where appropriate.
Monitor dirty entry growth — Alert when dirty entries exceed thresholds. This is a leading indicator of problems.
Test failure scenarios — Use chaos engineering to simulate cache failures. Verify data loss is within acceptable bounds.
Have runbooks — Document what to do when cache fails: how to recover, how to reconcile state, how to communicate to users.

Summary: Cache-First Write Implications

Let's consolidate the implications of making the cache the primary write destination:

Key Takeaways

•Cache as System of Record — During the dirty window, the cache holds the only correct copy of modified data.
•The Dirty Window — Writes are at risk between acknowledgment and flush. Window length is controlled by flush policy.
•Split Consistency Model — Cache readers see current data. Direct database readers see stale data during dirty window.
•All Paths Through Cache — Both reads and writes must go through the cache to maintain consistency.
•High Availability Required — Cache failure means data loss. Replication and possibly persistence are essential.
•Multi-Writer Handling — Concurrent writes need defined semantics (last-write-wins, merge, optimistic locking).
•Failure Planning — Unique failure modes require specific mitigations and monitoring.
•Architectural Complexity — Cache-first writes add complexity that must be justified by performance requirements.

The fundamental perspective shift:

What's next:

Page Complete

2 / 5