Loading content...
Every engineering decision involves trade-offs. Write-back caching delivers exceptional performance, but that performance comes at a cost: durability risk. When writes are acknowledged before persistence, data lives in memory—and memory, unlike disk, disappears when power fails or processes crash.
This page confronts durability concerns directly. We'll examine exactly what can be lost, quantify the risk in concrete terms, explore mitigation strategies, and develop frameworks for deciding when write-back caching's durability trade-off is acceptable.
This is not fear-mongering. Many successful systems operate with write-back caching at massive scale. But they do so with clear understanding of the risks and deliberate choices about what data can tolerate those risks.
By the end of this page, you will understand: what durability means and why it matters, exactly what data is at risk in write-back architectures, how to quantify potential data loss, mitigation strategies for reducing risk, hybrid approaches that balance performance and safety, and frameworks for making durability decisions.
Before discussing durability concerns, we need a precise definition of durability and why it matters.
What is durability?
Durability is a data storage property guaranteeing that once a system acknowledges a write operation, that data will persist even through failures:
In database terminology (ACID), the D stands for Durability: committed data survives failures.
The durability spectrum:
← Less Durable More Durable →
In-memory Write-back Write-through Synchronous
only cache cache with replication
local disk to remote DC
│ │ │ │
│ │ │ │
Survives: Survives: Survives: Survives:
- Nothing - Process - Process - DC failure
restart - Power loss - Region
- Graceful - Disk crash disaster
shutdown
Risk: Maximum Risk: Bounded Risk: Low Risk: Minimal
by flush interval
Write-back caching sits in the middle of this spectrum—more durable than pure in-memory (because periodic flushes persist data), but less durable than immediate disk writes.
Durability (data survives failures) and availability (system accepts requests) are different properties. Write-back caching can improve availability (cache buffers during database outages) while reducing durability (unflushed data at risk). Don't conflate them.
Why durability matters:
For some data, durability is non-negotiable:
| Data Type | Durability Requirement | Consequence of Loss |
|---|---|---|
| Financial transactions | Absolute | Legal liability, regulatory fines |
| Medical records | Very high | Patient safety, legal compliance |
| Audit logs | High | Compliance violations |
| Order records | High | Revenue loss, customer disputes |
| User content | Medium-high | User trust erosion |
| Session state | Medium | User inconvenience |
| Analytics events | Lower | Metrics inaccuracy |
| Counters/views | Low | Approximate data acceptable |
The key insight: not all data has the same durability requirements. Write-back caching is appropriate for data where bounded loss is acceptable, not for data where any loss is catastrophic.
Let's be precise about what data is at risk in a write-back caching system.
The At-Risk Data:
Only dirty entries (writes that haven't been flushed to the database) are at risk. Clean entries exist in both cache and database, so cache failure doesn't lose them.
Cache State at Time of Failure:
Entry A: value=100, dirty=false → Safe (exists in DB)
Entry B: value=200, dirty=true → AT RISK (only in cache)
Entry C: value=300, dirty=true → AT RISK (only in cache)
Entry D: value=400, dirty=false → Safe (exists in DB)
If cache fails now:
- Entry B lost
- Entry C lost
- Entries A and D unaffected
What is NOT lost:
Failure scenarios:
| Failure | What's Lost | Recovery Path |
|---|---|---|
| Single cache process crash | Dirty entries in that process | Cache restart, processes rebuild from DB |
| Cache node failure | Dirty entries on that node | Failover to replica (if replicated) |
| Cache cluster partition | Dirty entries on unreachable side | Partition heals, or data lost |
| Full cache cluster failure | All dirty entries | Rebuild from database (losing dirty data) |
| Power loss (non-persistent cache) | All cache data | Rebuild from database |
| Redis with AOF | Entries since last AOF write | Replay AOF on restart |
| Redis with replication | Usually none (replica has copy) | Failover to replica |
Quantifying maximum loss:
The maximum data at risk at any moment is:
Max Loss = (Writes per second) × (Flush interval)
Example:
Write rate: 10,000 writes/second
Flush interval: 5 seconds
Max dirty entries = 50,000 writes
If cache fails instantly before flush:
Maximum data loss = 50,000 unflushed writes
With write coalescing, the picture changes:
Write rate: 10,000 writes/second
Unique keys written per second: 1,000
Flush interval: 5 seconds
Max dirty entries = 5,000 unique keys
(but representing 50,000 write operations)
You lose the latest value for 5,000 keys, but the database has the prior values.
The worst case is cache failure immediately before a scheduled flush, when dirty entry count is at maximum. Design for this scenario, not average case. If maximum potential loss is unacceptable, write-back caching is not appropriate for that data.
Different failures have different impacts. Understanding failure modes helps you design appropriate mitigations.
1. Cache Process Crash
Cause: Bug, OOM, signal
Scope: Single process
Frequency: Rare with good software practices
Mitigation: Process restart, replication
Data loss: Dirty entries in that process only
Impact assessment: Usually limited. Modern cache systems restart quickly. Replication eliminates loss entirely.
2. Cache Host Failure
Cause: Hardware failure, kernel panic, power failure
Scope: Single server
Frequency: Rare (once per server-year typical)
Mitigation: Cluster deployment, replication
Data loss: Dirty entries served by that host
Impact assessment: Moderate. If cache is replicated, data is preserved. If not replicated, data loss equals dirty entries on that host.
3. Network Partition
Cause: Network equipment failure, configuration error
Scope: Subset of cluster
Frequency: Uncommon but not rare
Mitigation: Partition-tolerant cluster design
Data loss: Depends on consistency settings
Impact assessment: Complex. Writes may queue on one side of partition. When partition heals, conflict resolution determines outcome. Poorly designed systems can lose data or duplicate it.
4. Full Cluster Outage
Cause: DC power failure, catastrophic event
Scope: Entire cache cluster
Frequency: Very rare
Mitigation: Persistence (AOF/RDB), multi-region
Data loss: All dirty entries if no persistence
Impact assessment: Severe but rare. If cache has persistence enabled (Redis AOF), data loss is minimal. Without persistence, all dirty data is lost.
5. Cascading Failure
Cause: Cache overload, resource exhaustion
Scope: Entire system
Frequency: Rare with proper capacity planning
Mitigation: Circuit breakers, graceful degradation
Data loss: Potentially all dirty entries
Impact assessment: The most dangerous scenario. A failure cascade can kill the entire cache layer. Recovery often requires manual intervention.
| Failure Type | Probability | Impact (No Mitigation) | Impact (With Best Practices) |
|---|---|---|---|
| Process crash | Medium | Minutes of data | None (replication) |
| Host failure | Low | Host's dirty data | None (replication) |
| Network partition | Low | Complex/variable | Minimal (good design) |
| Cluster outage | Very low | All dirty data | Minimal seconds (persistence) |
| Cascading failure | Very low | All dirty data + downtime | Graceful degradation |
Layer your mitigations: replication handles node failures, persistence handles cluster failures, multi-region handles datacenter failures. Each layer reduces residual risk. The goal is to make data loss require multiple simultaneous failures.
You can't eliminate durability risk in write-back caching (that would make it write-through), but you can substantially reduce it. Here are the primary mitigation strategies:
Strategy deep-dive: Redis persistence options
Redis, the most common write-back cache, offers two persistence mechanisms:
RDB (Snapshots):
- Periodic point-in-time snapshots to disk
- Fast recovery (load snapshot)
- Data loss: Changes since last snapshot
- Typical interval: 60 seconds to 15 minutes
- Best for: Disaster recovery, backup
AOF (Append-Only File):
- Logs every write operation to disk
- Configurable fsync: every-write, every-second, OS-controlled
- Data loss with every-second: Up to 1 second
- Data loss with every-write: Near zero (but slower)
- Best for: Higher durability requirements
Combined approach:
- Enable both RDB and AOF
- RDB for efficient recovery/backup
- AOF for minimizing data loss
- Redis uses AOF if both exist on restart
With appendfsync everysec (default), Redis provides near-database-level durability while maintaining cache performance. Maximum data loss is ~1 second of writes.
| Strategy | Max Data Loss | Performance Impact | Complexity | Cost |
|---|---|---|---|---|
| Baseline (no mitigation) | Full flush interval | None | Low | Low |
| Shorter flush interval | Reduced proportionally | Higher DB load | Low | Low |
| Cache replication | Near zero (sync) | Small latency increase | Medium | 2x cache cost |
| Redis AOF everysec | ~1 second | Minimal (~5%) | Low | Disk for cache |
| Redis AOF always | Near zero | Moderate (~20%) | Low | Disk for cache |
| Multi-region | Near zero | Cross-region latency | High | Significant |
Match mitigation to requirements. If 5 seconds of data loss is acceptable, a 5-second flush interval is sufficient. If <1 second is needed, use Redis AOF. If zero loss is required, write-back caching may not be appropriate—consider write-through instead.
Real-world systems often don't fit neatly into "write-through" or "write-back" categories. Hybrid approaches apply different strategies to different data types, optimizing for each.
Pattern 1: Data Classification
Classify writes by durability requirements:
Incoming Write
│
├── Is this critical data?
│ (payments, orders, audit logs)
│ │
│ └── YES → Write-through (immediate DB write)
│
└── Is this non-critical data?
(counters, session, analytics)
│
└── YES → Write-back (async DB write)
This ensures critical data always has immediate durability while non-critical data benefits from write-back performance.
Pattern 2: Priority-Based Flushing
High-Priority Dirty Entries (e.g., order updates):
- Flush within 1 second
- Smaller batches for lower latency
- Higher flush priority
Low-Priority Dirty Entries (e.g., view counts):
- Flush within 30 seconds or 1000 entries
- Larger batches for efficiency
- Lower flush priority
Both remain write-back, but with tiered durability guarantees.
Pattern 3: Synchronous-Then-Async
Critical Phase (e.g., checkout):
- User initiates checkout → Write-through mode
- Every write persisted synchronously
- Latency increase acceptable during critical flow
Non-Critical Phase (e.g., browsing):
- User is browsing → Write-back mode
- Session updates, view tracking async
- Maximum performance for best UX
Pattern 4: Acknowledgment Levels
Some systems offer client-controlled durability:
cache.write(key, value, {
durability: 'cache_only' // Fastest, least durable
durability: 'cache_replicated' // Cache + sync replica
durability: 'cache_and_disk' // Cache + AOF fsync
durability: 'full' // Cache + immediate DB write
});
This pushes durability decisions to the application layer, which has context about data importance.
Hybrid approaches add complexity. Start with a uniform approach (all write-back or all write-through). Add classification only when you have clear evidence that different data needs different treatment.
The ultimate question: for a given use case, is write-back caching's durability trade-off acceptable? Here's a framework for making that decision.
Step 1: Characterize the data
Questions to answer:
1. What is this data? (user content, transactions, metrics, etc.)
2. What is the cost of losing 1 minute of this data?
□ Catastrophic (legal, financial, safety)
□ Serious (revenue loss, customer impact)
□ Moderate (user inconvenience, reprocessing needed)
□ Minor (approximate data acceptable)
□ None (data is recreatable or ephemeral)
3. Can lost data be recovered or recreated?
□ No, it's gone forever
□ Partially, with significant effort
□ Yes, from source systems
□ Yes, automatically
4. What's the frequency and volume of writes?
(This determines performance benefit of write-back)
Step 2: Apply the decision matrix
| Data Characteristic | Write-Back Suitability |
|---|---|
| Loss is catastrophic | ❌ Not suitable |
| Regulatory requirements for immediate persistence | ❌ Not suitable |
| Loss causes moderate customer impact | ⚠️ Use with strong mitigations |
| Loss is user inconvenience only | ✅ Suitable with standard mitigations |
| Loss is acceptable (approximate data OK) | ✅ Excellent fit |
| Data is recreatable from source | ✅ Excellent fit |
| High write frequency with hot keys | ✅ Maximum benefit |
| Low write frequency | ⚠️ Benefits may not justify complexity |
Step 3: Define acceptable loss bound
If data is suitable for write-back, define:
Maximum acceptable data loss: ______ seconds
This determines your flush interval ceiling.
Example:
Acceptable loss: 5 seconds
Flush interval: ≤ 5 seconds (probably 2-3 for margin)
Step 4: Evaluate mitigation adequacy
Given max acceptable loss and failure probability:
With replication: Risk level = _____
With persistence (AOF): Risk level = _____
With both: Risk level = _____
Is residual risk acceptable to business stakeholders?
(This is a business decision, not just technical)
Step 5: Document the decision
Decision Record:
Data: [Description]
Pattern: [Write-back / Write-through / Hybrid]
Max loss tolerance: [N seconds]
Mitigations: [Replication, AOF, etc.]
Residual risk: [Description]
Approved by: [Stakeholder]
Review date: [When to revisit]
Documenting the decision ensures that future team members understand why write-back was chosen and what risk was accepted.
Durability decisions should be made by stakeholders who understand the business impact, not just engineers optimizing for performance. If you're reducing durability for performance, get explicit acknowledgment from product/business leadership.
Despite best efforts, data loss can occur. Having a recovery plan minimizes impact and restores confidence.
Incident detection:
How will you know data was lost?
Indicators of potential data loss:
1. Cache node failure alerts
2. Sudden drop in flush success rate
3. Discrepancy between expected and actual DB records
4. User reports of missing data
5. Monitoring gaps in time-series data
Monitoring requirements:
- Alert on any cache node failure
- Track dirty entry count and age
- Monitor flush lag continuously
- Log flush failures with entry details
Recovery options by source:
| Data Source | Recovery Approach | Limitations |
|---|---|---|
| Redis RDB/AOF | Replay on restart | Only data since last persistence |
| Replica cache | Failover | Sync lag may lose recent writes |
| Source system logs | Replay events | Compute-intensive, may miss transformations |
| Client retry | Client resends failed writes | Only works if client knows it failed |
| Upstream APIs | Re-fetch current state | Only works for derived/cached data |
| No recovery source | Accept loss | Document, compensate users if needed |
Post-incident improvements:
After any data loss incident, evaluate:
Every incident is a learning opportunity to improve the system's durability posture.
Don't wait for a real incident to test your recovery procedures. Regularly practice cache failure scenarios and recovery processes. Chaos engineering approaches (intentionally killing cache nodes) build confidence and reveal gaps.
Large-scale systems at major tech companies routinely use write-back caching. Understanding their approaches provides practical guidance.
Facebook/Meta:
- Uses Memcached and TAO (graph cache) extensively
- Differentiated durability by data type:
- Social graph: Highly replicated, strong consistency needs
- Counters/likes: Eventual consistency, write-back acceptable
- User content: Write-through for text, async for derived
- Key insight: "Perfect" durability isn't needed for all data
Twitter:
- Tweet delivery uses caching heavily
- Timeline cache can be reconstructed from database
- Follower counts and engagement metrics use write-back
- Key insight: If you can rebuild from source, durability is less critical
Netflix:
- EVCache (Memcached-based) for massive caching layer
- View history can tolerate short-term loss (reconstructed from events)
- Critical state (playback position) has tighter guarantees
- Key insight: Design for rebuild-ability to reduce durability needs
Common themes from industry leaders:
Not all data is equal — Differentiate durability requirements by data criticality.
Design for rebuildability — If data can be reconstructed from events or sources, cache durability is less important.
Accept bounded loss for performance — Explicitly accept that some data loss is possible in exchange for performance benefits.
Invest in detection, not just prevention — Quick detection of data loss enables faster recovery and limits blast radius.
Communicate with users — When data loss affects users, communicate clearly. Users are forgiving when informed.
Defense in depth — Layer mitigations. No single mitigation is perfect.
The pragmatic approach:
"We accept that social-graph engagement data may lose up to 5 seconds
of updates in a cache failure event, estimated at <0.001% of updates
annually. This trade-off enables 10x better performance. Critical
data (messages, content) uses synchronous persistence."
— Example durability policy
This is the mature approach: explicit trade-offs, documented risk acceptance, differentiated treatment.
The largest systems in the world use write-back caching. They've learned that accepting bounded data loss for certain data types is a reasonable trade-off. The key is making that trade-off explicit and informed, not accidental.
Durability is the critical constraint in write-back caching. Making informed decisions about durability trade-offs is essential for successful implementation.
The durability decision in one sentence:
Write-back caching is appropriate when the performance benefit justifies the bounded durability risk for that specific data type, with appropriate mitigations in place.
If the answer is yes, proceed confidently—you've made an informed decision. If the answer is no, write-through or synchronous patterns are more appropriate.
Module complete:
You now have a comprehensive understanding of write-back (write-behind) caching:
With this knowledge, you can design and implement write-back caching systems that deliver exceptional performance while managing durability risk appropriately.
Congratulations! You have completed the Write-Back (Write-Behind) Caching module. You now understand this powerful caching pattern deeply—its mechanics, benefits, risks, and when to apply it. Use this knowledge to build high-performance systems with informed trade-offs.