Loading content...
The magic of write-back caching lies in its asynchronous nature: writes are acknowledged immediately while database persistence happens in the background. But 'asynchronous' isn't a simple concept—it's an entire engineering discipline.
How do you ensure dirty entries eventually reach the database? What happens when the database is slow or unavailable? How do you batch writes efficiently without accumulating unbounded risk? How do you monitor an invisible background process?
This page explores the engineering of asynchronous database writes: the flush mechanisms, reliability patterns, backpressure handling, and operational visibility needed to make write-back caching production-ready.
By the end of this page, you will understand how to design and implement reliable asynchronous persistence: flush scheduling policies, batching strategies, retry mechanisms, backpressure handling, ordering guarantees, and the observability infrastructure needed to operate async write systems confidently.
The flush process is the component responsible for moving dirty entries from the cache to the database. Its architecture determines systemic performance, reliability, and failure modes.
Core components of the flush system:
| Component | Responsibility | Key Design Considerations |
|---|---|---|
| Dirty Entry Scanner | Identifies entries needing flush | Efficiency of dirty detection, incremental vs full scan |
| Flush Scheduler | Determines when to trigger flush | Trigger policies (time, count, size), adaptive scheduling |
| Batch Assembler | Groups entries for efficient writes | Batch size optimization, transaction boundaries |
| Database Writer | Executes actual database operations | Connection pooling, retry logic, timeout handling |
| Confirmation Handler | Marks entries clean after success | Atomicity of confirmation, handling partial batch success |
| Failure Processor | Handles write failures | Retry policies, dead-letter handling, alerting |
Flush process flow:
┌───────────────────────────────────────────────────────────────────┐
│ FLUSH CYCLE │
└───────────────────────────────────────────────────────────────────┘
┌─────────────┐
│ Trigger │ ◄── Time elapsed / Entry count / Size threshold
└──────┬──────┘
│
▼
┌─────────────┐
│ Scan for │
│Dirty Entries│
└──────┬──────┘
│
▼
┌─────────────┐
│ Assemble │
│ Batches │
└──────┬──────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ Write │────▶│ Database │
│ Batches │◀────│ Response │
└──────┬──────┘ └─────────────┘
│
┌───────┴───────┐
│ │
Success? Failure?
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Mark Clean │ │ Retry / │
│ │ │ Dead Letter │
└─────────────┘ └─────────────┘
Each component in this flow must be carefully designed. Failures at any stage need graceful handling without losing data or leaving the system in an inconsistent state.
The flush process can be push-based (writes add entries to a flush queue) or pull-based (periodic scan finds dirty entries). Push is more responsive but adds write-path complexity. Pull is simpler but may have latency variation. Many systems use hybrids: push for immediate visibility, pull as a safety net.
The flush scheduler determines when dirty entries move to the database. The policy choice fundamentally affects durability, performance, and system behavior under various loads.
Policy 1: Fixed Interval Flush
Every T seconds:
Scan for dirty entries
Flush all dirty entries to database
Characteristics:
Best for: Systems with relatively stable write rates and moderate durability requirements.
Policy 2: Threshold-Based Flush
When dirty_entry_count >= N:
Flush dirty entries to database
Characteristics:
Best for: Systems with variable load where batching efficiency matters more than strict time bounds.
Policy 3: Hybrid (Time OR Count)
Flush when:
(dirty_entry_count >= N) OR (time_since_last_flush >= T)
Characteristics:
Best for: Most production workloads. Recommended as the default approach.
Policy 4: Adaptive Flush
Monitor:
- Current dirty entry count
- Database latency percentiles
- Cache memory pressure
Adjust flush frequency dynamically:
- More aggressive when dirty count is high or memory is tight
- Less aggressive when database is slow (to allow recovery)
- Surge protection when approaching limits
Characteristics:
Best for: Large-scale systems with dedicated infrastructure teams and varying workload patterns.
| Policy | Durability Bound | Efficiency | Complexity | Adaptability |
|---|---|---|---|---|
| Fixed Interval | T seconds | Variable | Low | None |
| Threshold-Based | Unbounded | High | Low | To load |
| Hybrid (Time OR Count) | T seconds | Good | Medium | Partial |
| Adaptive | Configurable | Excellent | High | Full |
Begin with a hybrid policy (time OR count threshold). Measure behavior under real load. Only move to adaptive scheduling if you have evidence that simpler policies don't meet your needs. Complexity has ongoing maintenance costs.
Batching is crucial for flush efficiency. Writing entries one-by-one wastes network round-trips and database resources. But batching too aggressively creates other problems. Let's explore the strategies and trade-offs.
Why batch?
The math of batching:
Single write:
Network RTT: 2ms
DB processing: 5ms
Total: 7ms per write
1000 writes = 7000ms = 7 seconds
Batched write (100 entries per batch):
Network RTT: 2ms (same)
DB processing: 50ms (sublinear scaling)
Total: 52ms per batch
1000 writes = 10 batches × 52ms = 520ms
This 13x improvement is typical. Actual gains depend on database, network, and write patterns.
Optimal batch sizing:
Batch size optimization depends on your specific database and network characteristics:
│
Write │ ╭────── Optimal zone
Throughput │ ╭─╯
(writes/sec) │ ╭─╯
│ ╭─╯
│ ╭─╯
│─╯
├──────────────────────────────────────────
1 10 50 100 500 1000 5000 10000
Batch Size
← Too small: Optimal: Too large: →
Network overhead Sweet spot Large batches slower,
dominates for efficiency memory pressure,
partial failure risk
Recommended approach:
Different databases have different optimal strategies: PostgreSQL favors COPY for bulk inserts, MySQL batch INSERT is efficient, MongoDB bulkWrite is optimized, and Cassandra BATCH has size limits. Know your database's bulk write primitives.
Async systems must handle failures gracefully. Unlike synchronous writes where failures are immediately visible to clients, async failures are invisible—and potentially catastrophic if not handled properly.
Types of flush failures:
| Failure Type | Example | Appropriate Response |
|---|---|---|
| Transient | Network timeout, connection reset | Immediate retry with backoff |
| Temporary Unavailability | Database failover, maintenance | Extended retry, potentially minutes |
| Resource Exhaustion | Connection pool full, disk full | Wait for recovery, possibly shed load |
| Permanent (per-entry) | Constraint violation, invalid data | Dead-letter for manual review |
| Systemic | Wrong credentials, schema mismatch | Stop and alert, requires human intervention |
Retry strategy design:
A robust retry strategy must balance persistence with not overwhelming the database:
Retry Policy:
max_retries: 5
initial_delay: 100ms
max_delay: 30s
backoff_multiplier: 2
jitter: ±10%
Retry Schedule (example):
Attempt 1: Immediate
Attempt 2: 100ms later
Attempt 3: 200ms later
Attempt 4: 400ms later
Attempt 5: 800ms later
Attempt 6: Dead letter queue
With jitter, actual delays vary slightly:
Attempt 2: 90-110ms
Attempt 3: 180-220ms
...
Why jitter? When many flush workers fail simultaneously (e.g., database restart), without jitter they all retry at the same time, causing a retry storm. Jitter spreads retries temporally.
A dead-letter queue (DLQ) is not optional—it's required for production systems. Without a DLQ, failed entries are lost or block the queue indefinitely. The DLQ provides a path forward for both: preserve failed entries for later analysis and manual intervention while allowing the system to continue processing.
What happens when writes arrive faster than the database can accept them? This is the backpressure problem, and how you handle it determines whether your system degrades gracefully or catastrophically.
The backpressure scenario:
Incoming writes: 1000 writes/second
Database capacity: 500 writes/second
Without backpressure:
T+0: 0 dirty entries
T+10: 5000 dirty entries (writing 5K/10s, flushing 5K/10s... wait)
Actually: 1000 - 500 = 500 entries/sec accumulating
T+10: 5000 accumulated
T+60: 30000 accumulated
T+120: Cache memory exhausted → System failure
Without flow control, the system will eventually crash when cache memory is exhausted.
Implementing backpressure signals:
The flush system should expose signals that other components can use:
Backpressure Signals:
dirty_entry_count: Number of unflushed entries
dirty_entry_percentage: Dirty count / Cache capacity
flush_lag_seconds: Oldest unflushed entry age
flush_success_rate: Recent flush success percentage
Thresholds:
WARNING (soft): dirty_percentage > 50% OR flush_lag > 10s
CRITICAL (hard): dirty_percentage > 80% OR flush_lag > 60s
Responses:
WARNING: Alert operators, speed up flush rate
CRITICAL: Enable backpressure (block/reject writes)
The graceful degradation ladder:
Normal Operation:
→ Full async write-back, maximum performance
Elevated Dirty Count (50-80%):
→ Increase flush frequency, warn operators
High Dirty Count (80-95%):
→ Block new writes until flush catches up
Near Exhaustion (>95%):
→ Emergency synchronous flush, reject new writes
Exhausted (100%):
→ Full write-through mode or system error
Every async system will eventually face more load than it can handle. Design backpressure mechanisms before you need them. A system that fails gracefully under overload is far better than one that crashes spectacularly. The goal is degraded operation, not total failure.
Asynchronous writes introduce ordering challenges that don't exist in synchronous systems. Understanding these is crucial for correct system behavior.
The ordering problem:
Scenario: Two writes to the same key
T1: Write key=A, value=1 (enters cache)
T2: Write key=A, value=2 (updates cache, value=2)
T3: Flush starts, sees value=2
T4: Write to database: key=A, value=2 ✓
This is correct - last value wins.
But what about this scenario?
T1: Write key=A, value=1 (dirty)
T2: Flush starts, snapshot includes key=A, value=1
T3: Write key=A, value=2 (updates cache to value=2)
T4: Flush completes: database has value=1
T5: Key=A marked clean (WRONG!)
Now cache has value=2, database has value=1, entry is marked clean.
Inconsistency!
This is the snapshot isolation problem in write-back caching.
Solutions to the snapshot isolation problem:
Solution 1: Version Numbers
Cache entry: {key, value, version, dirty}
T1: Write key=A, value=1, version=1, dirty=true
T2: Flush starts, records: flushing key=A at version=1
T3: Write key=A, value=2, version=2, dirty=true
T4: Flush completes for version=1
T5: Clear dirty ONLY IF version is still 1
Version is now 2, so entry stays dirty
T6: Next flush picks up version=2
✓ Correct!
Solution 2: Last-Modified Timestamps
Similar to version numbers but using timestamps. Less collision-resistant but simpler.
Solution 3: Copy-on-Flush
When starting a flush, take a snapshot (copy) of the dirty value. The cache continues accepting writes to the live value. Flush writes the snapshot. After flush completes, compare live value to snapshot—if unchanged, mark clean.
Solution 4: Lock During Flush
Lock the entry while flushing it. Writes to that key block until flush completes. Simple but reduces write concurrency significantly.
| Solution | Concurrency | Complexity | Storage Overhead | Use Case |
|---|---|---|---|---|
| Version Numbers | High | Medium | 8 bytes/entry | General purpose, recommended |
| Timestamps | High | Low | 8 bytes/entry | When clock sync is reliable |
| Copy-on-Flush | High | High | Memory for copies | High write rates, short flushes |
| Lock During Flush | Low | Low | None | Low contention, simple cases |
Note that async writes provide NO ordering guarantees across different keys. If you write key A then key B, they may reach the database in either order. If you need cross-key ordering (e.g., parent before child), you need additional mechanisms like transaction grouping or explicit sequencing.
For high-throughput systems, a single flush worker may not keep up with the rate of dirty entries. Parallelizing the flush process is necessary but introduces its own challenges.
Single-Threaded Flush:
[Dirty Entries] → [Single Flush Worker] → [Database]
Pros: Simple, no coordination needed
Cons: Limited throughput, single point of bottleneck
Multi-Worker Flush:
┌─→ [Worker 1] ──┐
[Dirty Entries] ─→ ├─→ [Worker 2] ──┼──→ [Database]
(partitioned) └─→ [Worker 3] ──┘
Pros: Higher throughput, distributes load
Cons: Coordination complexity, potential conflicts
Avoiding flush conflicts:
When multiple workers are flushing, you must ensure:
Recommended architecture:
Partitioned Flush Workers:
Worker W1: Responsible for keys where hash(key) % 4 == 0
Worker W2: Responsible for keys where hash(key) % 4 == 1
Worker W3: Responsible for keys where hash(key) % 4 == 2
Worker W4: Responsible for keys where hash(key) % 4 == 3
Each worker:
1. Scans its partition for dirty entries
2. Creates batch from its dirty entries
3. Writes batch to database
4. Clears dirty flags (with version check)
No coordination needed between workers.
Each key is "owned" by exactly one worker.
This key-based partitioning is simple, scales linearly with worker count, and avoids all coordination overhead.
The number of flush workers can be dynamic. During high load, spin up more workers. During low load, reduce to save resources. Monitor flush lag as the primary signal for scaling decisions.
Asynchronous operations are notoriously difficult to observe and debug. The flush process is invisible to users, so problems can accumulate silently. Comprehensive monitoring is essential.
Critical metrics to track:
| Metric | What It Tells You | Alert Threshold Example |
|---|---|---|
| dirty_entry_count | Current unflushed entries | 80% of cache capacity |
| flush_lag_seconds | Age of oldest dirty entry | 30 seconds |
| flush_rate_per_second | Entries being flushed | Sudden drop > 50% |
| flush_batch_size_avg | Entries per flush batch | < 10 (inefficient) or > 5000 (too large) |
| flush_latency_p99 | 99th percentile flush time | 1 second |
| flush_error_rate | Failed flush operations | 1% |
| dlq_size | Dead-letter queue entries | 100 entries |
| oldest_dlq_entry_age | Age of oldest DLQ entry | 1 hour |
Tracing async operations:
Distributed tracing for async writes is challenging because the write and flush are disconnected. Consider:
Correlation IDs — Include a trace ID in the cache entry metadata. When flushing, emit a span that references the original write's trace ID.
Write-Flush Linkage — Log both the write and the flush with the same entry identifier. Join them later for analysis.
Async Span Model — Some tracing systems support async spans where the parent span completes before the child.
Log hygiene:
// Good flush logging
{
"event": "flush_batch_complete",
"batch_id": "b-123456",
"entries_count": 247,
"success_count": 245,
"failure_count": 2,
"duration_ms": 156,
"oldest_entry_age_ms": 4521,
"database_latency_ms": 142,
"timestamp": "2024-01-15T14:32:01.456Z"
}
This level of detail enables debugging issues days or weeks later.
Async systems are hard to debug without proper observability. Build monitoring into the flush system from the start, not after the first production incident. When problems occur, you'll be glad you have the data to understand what happened.
Asynchronous database writes are the engine that makes write-back caching work. Let's consolidate the key principles:
The reliability principle:
Asynchronous systems are fundamentally about trading immediate consistency for performance. But "async" doesn't mean "unreliable." A well-engineered async persistence layer can be as reliable as synchronous writes, just with different (and clearly documented) guarantees.
What's next:
Now that we understand the async persistence mechanism, the next page explores the performance benefits of write-back caching in detail: quantifying the gains, understanding where the speedup comes from, and when those benefits are largest.
You now understand how to build reliable asynchronous database write systems: scheduling policies, batching strategies, retry handling, backpressure, ordering, parallelization, and observability. Next, we'll explore the performance benefits these mechanisms enable.