Write-Back Caching - Learning Module

Loading content...

0/273

Asynchronous Database Writes

The Art of Delayed Persistence

The magic of write-back caching lies in its asynchronous nature: writes are acknowledged immediately while database persistence happens in the background. But 'asynchronous' isn't a simple concept—it's an entire engineering discipline.

How do you ensure dirty entries eventually reach the database? What happens when the database is slow or unavailable? How do you batch writes efficiently without accumulating unbounded risk? How do you monitor an invisible background process?

This page explores the engineering of asynchronous database writes: the flush mechanisms, reliability patterns, backpressure handling, and operational visibility needed to make write-back caching production-ready.

What You Will Learn

By the end of this page, you will understand how to design and implement reliable asynchronous persistence: flush scheduling policies, batching strategies, retry mechanisms, backpressure handling, ordering guarantees, and the observability infrastructure needed to operate async write systems confidently.

The Flush Process Architecture

The flush process is the component responsible for moving dirty entries from the cache to the database. Its architecture determines systemic performance, reliability, and failure modes.

Core components of the flush system:

Flush System Components
Component	Responsibility	Key Design Considerations
Dirty Entry Scanner	Identifies entries needing flush	Efficiency of dirty detection, incremental vs full scan
Flush Scheduler	Determines when to trigger flush	Trigger policies (time, count, size), adaptive scheduling
Batch Assembler	Groups entries for efficient writes	Batch size optimization, transaction boundaries
Database Writer	Executes actual database operations	Connection pooling, retry logic, timeout handling
Confirmation Handler	Marks entries clean after success	Atomicity of confirmation, handling partial batch success
Failure Processor	Handles write failures	Retry policies, dead-letter handling, alerting

Flush process flow:

┌───────────────────────────────────────────────────────────────────┐
│                         FLUSH CYCLE                                │
└───────────────────────────────────────────────────────────────────┘

        ┌─────────────┐
        │   Trigger   │ ◄── Time elapsed / Entry count / Size threshold
        └──────┬──────┘
               │
               ▼
        ┌─────────────┐
        │  Scan for   │
        │Dirty Entries│
        └──────┬──────┘
               │
               ▼
        ┌─────────────┐
        │  Assemble   │
        │   Batches   │
        └──────┬──────┘
               │
               ▼
        ┌─────────────┐     ┌─────────────┐
        │    Write    │────▶│  Database   │
        │   Batches   │◀────│   Response  │
        └──────┬──────┘     └─────────────┘
               │
       ┌───────┴───────┐
       │               │
 Success?           Failure?
       │               │
       ▼               ▼
┌─────────────┐  ┌─────────────┐
│ Mark Clean  │  │   Retry /   │
│             │  │ Dead Letter │
└─────────────┘  └─────────────┘

Each component in this flow must be carefully designed. Failures at any stage need graceful handling without losing data or leaving the system in an inconsistent state.

Push vs Pull Flush Models

The flush process can be push-based (writes add entries to a flush queue) or pull-based (periodic scan finds dirty entries). Push is more responsive but adds write-path complexity. Pull is simpler but may have latency variation. Many systems use hybrids: push for immediate visibility, pull as a safety net.

Flush Scheduling Policies

The flush scheduler determines when dirty entries move to the database. The policy choice fundamentally affects durability, performance, and system behavior under various loads.

Policy 1: Fixed Interval Flush

Every T seconds:
    Scan for dirty entries
    Flush all dirty entries to database

Characteristics:

Simple to implement and reason about
Bounded dirty window: maximum T seconds of data at risk
During low traffic: may flush tiny batches inefficiently
During high traffic: may try to flush enormous batches, causing latency spikes

Best for: Systems with relatively stable write rates and moderate durability requirements.

Policy 2: Threshold-Based Flush

When dirty_entry_count >= N:
    Flush dirty entries to database

Characteristics:

Adapts naturally to write load—more writes trigger more flushes
During low traffic: may take long time to accumulate N entries
During high traffic: flushes frequently, keeping risk bounded

Best for: Systems with variable load where batching efficiency matters more than strict time bounds.

Policy 3: Hybrid (Time OR Count)

Flush when:
    (dirty_entry_count >= N) OR (time_since_last_flush >= T)

Characteristics:

Combines benefits of both approaches
Guarantees maximum T seconds of data at risk regardless of load
Batches efficiently during high load
Most commonly used in production systems

Best for: Most production workloads. Recommended as the default approach.

Policy 4: Adaptive Flush

Monitor:
    - Current dirty entry count
    - Database latency percentiles  
    - Cache memory pressure

Adjust flush frequency dynamically:
    - More aggressive when dirty count is high or memory is tight
    - Less aggressive when database is slow (to allow recovery)
    - Surge protection when approaching limits

Characteristics:

Most sophisticated, requires tuning
Can optimize for multiple objectives simultaneously
Adds operational complexity

Best for: Large-scale systems with dedicated infrastructure teams and varying workload patterns.

Flush Policy Comparison
Policy	Durability Bound	Efficiency	Complexity	Adaptability
Fixed Interval	T seconds	Variable	Low	None
Threshold-Based	Unbounded	High	Low	To load
Hybrid (Time OR Count)	T seconds	Good	Medium	Partial
Adaptive	Configurable	Excellent	High	Full

Start Simple, Add Complexity As Needed

Begin with a hybrid policy (time OR count threshold). Measure behavior under real load. Only move to adaptive scheduling if you have evidence that simpler policies don't meet your needs. Complexity has ongoing maintenance costs.

Batching Strategies for Efficient Writes

Batching is crucial for flush efficiency. Writing entries one-by-one wastes network round-trips and database resources. But batching too aggressively creates other problems. Let's explore the strategies and trade-offs.

Why batch?

Network efficiency — One database round-trip for 100 writes beats 100 round-trips
Database transaction overhead — Starting/committing transactions has fixed cost; amortize across many writes
Connection utilization — Fewer, larger operations use connection pool more efficiently
I/O optimization — Databases can optimize bulk writes (e.g., sorted inserts, fewer index updates)

The math of batching:

Single write:
    Network RTT: 2ms
    DB processing: 5ms
    Total: 7ms per write
    
    1000 writes = 7000ms = 7 seconds

Batched write (100 entries per batch):
    Network RTT: 2ms (same)
    DB processing: 50ms (sublinear scaling)
    Total: 52ms per batch
    
    1000 writes = 10 batches × 52ms = 520ms

This 13x improvement is typical. Actual gains depend on database, network, and write patterns.

Batching Considerations

•Batch Size Limits — Databases have limits on statement size, transaction size, and row counts. Exceeding these causes failures. Know your database's limits and stay under them.
•Transaction Boundaries — Should each batch be a single transaction? Transactions provide atomicity but hold locks. For independent keys, separate transactions may be better.
•Partial Failure Handling — What if a batch partially succeeds? Some databases support savepoints for rollback. Otherwise, you need per-entry retry logic.
•Memory Overhead — Assembling large batches requires memory. If the batch assembler buffers thousands of entries, that's memory not available for caching.
•Latency Variance — Large batches take longer to write, causing latency spikes in the flush loop. This can delay subsequent flushes during write storms.
•Ordering Within Batch — Does write order matter? For most cases, no. But for data with dependencies (e.g., parent-child relationships), you may need to respect insertion order.

Optimal batch sizing:

Batch size optimization depends on your specific database and network characteristics:

                     │
    Write            │         ╭────── Optimal zone
    Throughput       │       ╭─╯
    (writes/sec)     │     ╭─╯
                     │   ╭─╯
                     │ ╭─╯
                     │─╯
                     ├──────────────────────────────────────────
                     1    10   50  100  500  1000  5000  10000
                                  Batch Size

    ← Too small:           Optimal:        Too large: →
    Network overhead       Sweet spot      Large batches slower,
    dominates              for efficiency  memory pressure,
                                           partial failure risk

Recommended approach:

Start with batch size of 100-500
Benchmark flush throughput at different sizes
Plot throughput vs batch size
Choose size at the "knee" of the curve
Add a maximum batch size for safety (e.g., 1000-5000)

Database-Specific Batching

Different databases have different optimal strategies: PostgreSQL favors COPY for bulk inserts, MySQL batch INSERT is efficient, MongoDB bulkWrite is optimized, and Cassandra BATCH has size limits. Know your database's bulk write primitives.

Retry and Failure Handling

Async systems must handle failures gracefully. Unlike synchronous writes where failures are immediately visible to clients, async failures are invisible—and potentially catastrophic if not handled properly.

Types of flush failures:

Flush Failure Categories
Failure Type	Example	Appropriate Response
Transient	Network timeout, connection reset	Immediate retry with backoff
Temporary Unavailability	Database failover, maintenance	Extended retry, potentially minutes
Resource Exhaustion	Connection pool full, disk full	Wait for recovery, possibly shed load
Permanent (per-entry)	Constraint violation, invalid data	Dead-letter for manual review
Systemic	Wrong credentials, schema mismatch	Stop and alert, requires human intervention

Retry strategy design:

A robust retry strategy must balance persistence with not overwhelming the database:

Retry Policy:
    max_retries: 5
    initial_delay: 100ms
    max_delay: 30s
    backoff_multiplier: 2
    jitter: ±10%

Retry Schedule (example):
    Attempt 1: Immediate
    Attempt 2: 100ms later
    Attempt 3: 200ms later
    Attempt 4: 400ms later
    Attempt 5: 800ms later
    Attempt 6: Dead letter queue

With jitter, actual delays vary slightly:
    Attempt 2: 90-110ms
    Attempt 3: 180-220ms
    ...

Why jitter? When many flush workers fail simultaneously (e.g., database restart), without jitter they all retry at the same time, causing a retry storm. Jitter spreads retries temporally.

Retry Best Practices

•Exponential Backoff — Each retry waits longer than the previous. This gives the system time to recover without hammering it with retries.
•Maximum Retry Cap — Don't retry forever. After N retries, move to dead-letter handling. Infinite retries can hide systemic issues.
•Circuit Breaker Pattern — If many flushes fail consecutively, stop trying temporarily. This prevents resource waste and lets the database recover.
•Classify Failures — Use error codes to distinguish retryable from non-retryable failures. Don't retry constraint violations endlessly.
•Track Retry Metrics — Monitor retry counts. Spikes indicate problems. Trending upward indicates degradation.
•Dead-Letter Queue — Entries that exhaust retries need somewhere to go. A DLQ allows human review without blocking the system.

Dead-Letter Queue is Required

A dead-letter queue (DLQ) is not optional—it's required for production systems. Without a DLQ, failed entries are lost or block the queue indefinitely. The DLQ provides a path forward for both: preserve failed entries for later analysis and manual intervention while allowing the system to continue processing.

Backpressure and Flow Control

What happens when writes arrive faster than the database can accept them? This is the backpressure problem, and how you handle it determines whether your system degrades gracefully or catastrophically.

The backpressure scenario:

Incoming writes:     1000 writes/second
Database capacity:   500 writes/second

Without backpressure:
    T+0:   0 dirty entries
    T+10:  5000 dirty entries (writing 5K/10s, flushing 5K/10s... wait)
    
    Actually: 1000 - 500 = 500 entries/sec accumulating
    
    T+10:  5000 accumulated
    T+60:  30000 accumulated
    T+120: Cache memory exhausted → System failure

Without flow control, the system will eventually crash when cache memory is exhausted.

Backpressure Strategies

•Block Writers (Synchronous Backpressure) — When dirty entries exceed a threshold, make new writes wait until flush catches up. Trades write latency for system stability. The cache effectively becomes write-through under overload.
•Reject Writes (Fail-Fast) — Return errors to clients when the system is overloaded. Lets clients handle the failure (retry, shed load, degrade features). Honest about capacity limits.
•Degrade to Write-Through — Under pressure, bypass the async path and write directly to database. Slower but durable. Seamless to clients but defeats caching benefits.
•Adaptive Flush Frequency — Speed up flush rate when dirty entries accumulate. Can prevent moderate overload from becoming severe. Has limits—can't exceed database capacity.
•Write Prioritization — Under pressure, only accept high-priority writes. Lower priority writes are rejected or queued separately. Requires application-level priority classification.
•Shedding and Sampling — For analytics-type workloads, drop or sample writes under severe load. Acceptable for approximate data, not for transactional data.

Implementing backpressure signals:

The flush system should expose signals that other components can use:

Backpressure Signals:
    dirty_entry_count:      Number of unflushed entries
    dirty_entry_percentage: Dirty count / Cache capacity
    flush_lag_seconds:      Oldest unflushed entry age
    flush_success_rate:     Recent flush success percentage

Thresholds:
    WARNING (soft):  dirty_percentage > 50%  OR  flush_lag > 10s
    CRITICAL (hard): dirty_percentage > 80%  OR  flush_lag > 60s

Responses:
    WARNING:  Alert operators, speed up flush rate
    CRITICAL: Enable backpressure (block/reject writes)

The graceful degradation ladder:

Normal Operation:
    → Full async write-back, maximum performance

Elevated Dirty Count (50-80%):
    → Increase flush frequency, warn operators

High Dirty Count (80-95%):
    → Block new writes until flush catches up

Near Exhaustion (>95%):
    → Emergency synchronous flush, reject new writes

Exhausted (100%):
    → Full write-through mode or system error

Design for Overload from Day One

Every async system will eventually face more load than it can handle. Design backpressure mechanisms before you need them. A system that fails gracefully under overload is far better than one that crashes spectacularly. The goal is degraded operation, not total failure.

Ordering and Consistency in Async Writes

Asynchronous writes introduce ordering challenges that don't exist in synchronous systems. Understanding these is crucial for correct system behavior.

The ordering problem:

Scenario: Two writes to the same key

T1: Write key=A, value=1 (enters cache)
T2: Write key=A, value=2 (updates cache, value=2)
T3: Flush starts, sees value=2
T4: Write to database: key=A, value=2 ✓

This is correct - last value wins.

But what about this scenario?

T1: Write key=A, value=1 (dirty)
T2: Flush starts, snapshot includes key=A, value=1
T3: Write key=A, value=2 (updates cache to value=2)
T4: Flush completes: database has value=1
T5: Key=A marked clean (WRONG!)

Now cache has value=2, database has value=1, entry is marked clean.
Inconsistency!

This is the snapshot isolation problem in write-back caching.

Solutions to the snapshot isolation problem:

Solution 1: Version Numbers

Cache entry: {key, value, version, dirty}

T1: Write key=A, value=1, version=1, dirty=true
T2: Flush starts, records: flushing key=A at version=1
T3: Write key=A, value=2, version=2, dirty=true
T4: Flush completes for version=1
T5: Clear dirty ONLY IF version is still 1
    Version is now 2, so entry stays dirty
T6: Next flush picks up version=2

✓ Correct!

Solution 2: Last-Modified Timestamps

Similar to version numbers but using timestamps. Less collision-resistant but simpler.

Solution 3: Copy-on-Flush

When starting a flush, take a snapshot (copy) of the dirty value. The cache continues accepting writes to the live value. Flush writes the snapshot. After flush completes, compare live value to snapshot—if unchanged, mark clean.

Solution 4: Lock During Flush

Lock the entry while flushing it. Writes to that key block until flush completes. Simple but reduces write concurrency significantly.

Ordering Solution Comparison
Solution	Concurrency	Complexity	Storage Overhead	Use Case
Version Numbers	High	Medium	8 bytes/entry	General purpose, recommended
Timestamps	High	Low	8 bytes/entry	When clock sync is reliable
Copy-on-Flush	High	High	Memory for copies	High write rates, short flushes
Lock During Flush	Low	Low	None	Low contention, simple cases

Cross-Key Ordering

Note that async writes provide NO ordering guarantees across different keys. If you write key A then key B, they may reach the database in either order. If you need cross-key ordering (e.g., parent before child), you need additional mechanisms like transaction grouping or explicit sequencing.

Concurrency and Flush Parallelization

For high-throughput systems, a single flush worker may not keep up with the rate of dirty entries. Parallelizing the flush process is necessary but introduces its own challenges.

Single-Threaded Flush:

[Dirty Entries] → [Single Flush Worker] → [Database]

Pros: Simple, no coordination needed
Cons: Limited throughput, single point of bottleneck

Multi-Worker Flush:

                    ┌─→ [Worker 1] ──┐
[Dirty Entries] ─→ ├─→ [Worker 2] ──┼──→ [Database]
     (partitioned)  └─→ [Worker 3] ──┘

Pros: Higher throughput, distributes load
Cons: Coordination complexity, potential conflicts

Parallelization Strategies

•Key-Based Partitioning — Partition dirty entries by key (e.g., hash(key) % num_workers). Each key always goes to the same worker, preventing conflicts. Simple and effective.
•Database-Connection Partitioning — Each worker owns a database connection or partition. Reduces database contention. Works well with sharded databases.
•Work Stealing — Workers pull from a shared dirty queue. Dynamic load balancing but requires concurrent data structures and conflict detection.
•Priority Lanes — Separate workers for high-priority and low-priority entries. High-priority flushes immediately; low-priority batches up for efficiency.
•Batch-Level Parallelism — Single coordinator creates batches, multiple workers execute them. Simpler coordination, batch-level isolation.

Avoiding flush conflicts:

When multiple workers are flushing, you must ensure:

No duplicate flushes — The same entry shouldn't be flushed by two workers
No lost updates — If entry is modified during flush, modification must be preserved
Correct dirty status — Only clear dirty when the flushed value matches current cache value

Recommended architecture:

Partitioned Flush Workers:

    Worker W1: Responsible for keys where hash(key) % 4 == 0
    Worker W2: Responsible for keys where hash(key) % 4 == 1
    Worker W3: Responsible for keys where hash(key) % 4 == 2
    Worker W4: Responsible for keys where hash(key) % 4 == 3

Each worker:
    1. Scans its partition for dirty entries
    2. Creates batch from its dirty entries
    3. Writes batch to database
    4. Clears dirty flags (with version check)

No coordination needed between workers.
Each key is "owned" by exactly one worker.

This key-based partitioning is simple, scales linearly with worker count, and avoids all coordination overhead.

Scale Workers with Load

The number of flush workers can be dynamic. During high load, spin up more workers. During low load, reduce to save resources. Monitor flush lag as the primary signal for scaling decisions.

Observability for Async Writes

Asynchronous operations are notoriously difficult to observe and debug. The flush process is invisible to users, so problems can accumulate silently. Comprehensive monitoring is essential.

Critical metrics to track:

Flush System Metrics
Metric	What It Tells You	Alert Threshold Example
dirty_entry_count	Current unflushed entries	80% of cache capacity
flush_lag_seconds	Age of oldest dirty entry	30 seconds
flush_rate_per_second	Entries being flushed	Sudden drop > 50%
flush_batch_size_avg	Entries per flush batch	< 10 (inefficient) or > 5000 (too large)
flush_latency_p99	99th percentile flush time	1 second
flush_error_rate	Failed flush operations	1%
dlq_size	Dead-letter queue entries	100 entries
oldest_dlq_entry_age	Age of oldest DLQ entry	1 hour

Dashboards to Build

•Flush Pipeline Health — Dirty count, flush rate, flush lag over time. The core operational dashboard.
•Flush Performance — Batch sizes, flush latency percentiles, database write latency. For performance tuning.
•Error Analysis — Error rates by type, retry counts, DLQ inflows. For debugging failures.
•Capacity Planning — Dirty count vs cache size, database write capacity utilization. For planning.
•Comparison View — Incoming write rate vs flush rate. Shows whether system is keeping up.

Tracing async operations:

Distributed tracing for async writes is challenging because the write and flush are disconnected. Consider:

Correlation IDs — Include a trace ID in the cache entry metadata. When flushing, emit a span that references the original write's trace ID.
Write-Flush Linkage — Log both the write and the flush with the same entry identifier. Join them later for analysis.
Async Span Model — Some tracing systems support async spans where the parent span completes before the child.

Log hygiene:

// Good flush logging
{
    "event": "flush_batch_complete",
    "batch_id": "b-123456",
    "entries_count": 247,
    "success_count": 245,
    "failure_count": 2,
    "duration_ms": 156,
    "oldest_entry_age_ms": 4521,
    "database_latency_ms": 142,
    "timestamp": "2024-01-15T14:32:01.456Z"
}

This level of detail enables debugging issues days or weeks later.

Invest in Observability Early

Async systems are hard to debug without proper observability. Build monitoring into the flush system from the start, not after the first production incident. When problems occur, you'll be glad you have the data to understand what happened.

Summary: Building Reliable Async Persistence

Asynchronous database writes are the engine that makes write-back caching work. Let's consolidate the key principles:

Key Takeaways

•Flush Architecture — A complete flush system includes scanning, scheduling, batching, writing, confirming, and failure handling components.
•Scheduling Policies — Hybrid policies (time OR count) provide bounded durability risk with efficient batching. Start simple, add complexity as needed.
•Batching — Batch writes for efficiency, but respect database limits and handle partial failures. Optimal batch size varies by system.
•Retry Strategy — Use exponential backoff with jitter. Classify failures to distinguish retryable from permanent. Dead-letter queue is required.
•Backpressure — Design for overload from day one. Implement graceful degradation rather than catastrophic failure.
•Ordering — Use version numbers or timestamps to handle concurrent modifications during flush. Cross-key ordering requires explicit mechanisms.
•Parallelization — Partition by key for simple, scalable multi-worker flushes. Avoid coordination overhead.
•Observability — Monitor dirty count, flush lag, error rates, and DLQ size. Invest in observability before you need to debug.

The reliability principle:

Asynchronous systems are fundamentally about trading immediate consistency for performance. But "async" doesn't mean "unreliable." A well-engineered async persistence layer can be as reliable as synchronous writes, just with different (and clearly documented) guarantees.

What's next:

Now that we understand the async persistence mechanism, the next page explores the performance benefits of write-back caching in detail: quantifying the gains, understanding where the speedup comes from, and when those benefits are largest.

Page Complete

You now understand how to build reliable asynchronous database write systems: scheduling policies, batching strategies, retry handling, backpressure, ordering, parallelization, and observability. Next, we'll explore the performance benefits these mechanisms enable.

Asynchronous Database Writes

The Art of Delayed Persistence

What You Will Learn

The Flush Process Architecture

The flush process is the component responsible for moving dirty entries from the cache to the database. Its architecture determines systemic performance, reliability, and failure modes.

Core components of the flush system:

Flush System Components
Component	Responsibility	Key Design Considerations
Dirty Entry Scanner	Identifies entries needing flush	Efficiency of dirty detection, incremental vs full scan
Flush Scheduler	Determines when to trigger flush	Trigger policies (time, count, size), adaptive scheduling
Batch Assembler	Groups entries for efficient writes	Batch size optimization, transaction boundaries
Database Writer	Executes actual database operations	Connection pooling, retry logic, timeout handling
Confirmation Handler	Marks entries clean after success	Atomicity of confirmation, handling partial batch success
Failure Processor	Handles write failures	Retry policies, dead-letter handling, alerting

Flush process flow:

┌───────────────────────────────────────────────────────────────────┐
│                         FLUSH CYCLE                                │
└───────────────────────────────────────────────────────────────────┘

        ┌─────────────┐
        │   Trigger   │ ◄── Time elapsed / Entry count / Size threshold
        └──────┬──────┘
               │
               ▼
        ┌─────────────┐
        │  Scan for   │
        │Dirty Entries│
        └──────┬──────┘
               │
               ▼
        ┌─────────────┐
        │  Assemble   │
        │   Batches   │
        └──────┬──────┘
               │
               ▼
        ┌─────────────┐     ┌─────────────┐
        │    Write    │────▶│  Database   │
        │   Batches   │◀────│   Response  │
        └──────┬──────┘     └─────────────┘
               │
       ┌───────┴───────┐
       │               │
 Success?           Failure?
       │               │
       ▼               ▼
┌─────────────┐  ┌─────────────┐
│ Mark Clean  │  │   Retry /   │
│             │  │ Dead Letter │
└─────────────┘  └─────────────┘

Each component in this flow must be carefully designed. Failures at any stage need graceful handling without losing data or leaving the system in an inconsistent state.

Push vs Pull Flush Models

Flush Scheduling Policies

The flush scheduler determines when dirty entries move to the database. The policy choice fundamentally affects durability, performance, and system behavior under various loads.

Policy 1: Fixed Interval Flush

Every T seconds:
    Scan for dirty entries
    Flush all dirty entries to database

Characteristics:

Simple to implement and reason about
Bounded dirty window: maximum T seconds of data at risk
During low traffic: may flush tiny batches inefficiently
During high traffic: may try to flush enormous batches, causing latency spikes

Best for: Systems with relatively stable write rates and moderate durability requirements.

Policy 2: Threshold-Based Flush

When dirty_entry_count >= N:
    Flush dirty entries to database

Characteristics:

Adapts naturally to write load—more writes trigger more flushes
During low traffic: may take long time to accumulate N entries
During high traffic: flushes frequently, keeping risk bounded

Best for: Systems with variable load where batching efficiency matters more than strict time bounds.

Policy 3: Hybrid (Time OR Count)

Flush when:
    (dirty_entry_count >= N) OR (time_since_last_flush >= T)

Characteristics:

Combines benefits of both approaches
Guarantees maximum T seconds of data at risk regardless of load
Batches efficiently during high load
Most commonly used in production systems

Best for: Most production workloads. Recommended as the default approach.

Policy 4: Adaptive Flush

Monitor:
    - Current dirty entry count
    - Database latency percentiles  
    - Cache memory pressure

Adjust flush frequency dynamically:
    - More aggressive when dirty count is high or memory is tight
    - Less aggressive when database is slow (to allow recovery)
    - Surge protection when approaching limits

Characteristics:

Most sophisticated, requires tuning
Can optimize for multiple objectives simultaneously
Adds operational complexity

Best for: Large-scale systems with dedicated infrastructure teams and varying workload patterns.

Flush Policy Comparison
Policy	Durability Bound	Efficiency	Complexity	Adaptability
Fixed Interval	T seconds	Variable	Low	None
Threshold-Based	Unbounded	High	Low	To load
Hybrid (Time OR Count)	T seconds	Good	Medium	Partial
Adaptive	Configurable	Excellent	High	Full

Start Simple, Add Complexity As Needed

Batching Strategies for Efficient Writes

Why batch?

Network efficiency — One database round-trip for 100 writes beats 100 round-trips
Database transaction overhead — Starting/committing transactions has fixed cost; amortize across many writes
Connection utilization — Fewer, larger operations use connection pool more efficiently
I/O optimization — Databases can optimize bulk writes (e.g., sorted inserts, fewer index updates)

The math of batching:

Single write:
    Network RTT: 2ms
    DB processing: 5ms
    Total: 7ms per write
    
    1000 writes = 7000ms = 7 seconds

Batched write (100 entries per batch):
    Network RTT: 2ms (same)
    DB processing: 50ms (sublinear scaling)
    Total: 52ms per batch
    
    1000 writes = 10 batches × 52ms = 520ms

This 13x improvement is typical. Actual gains depend on database, network, and write patterns.

Batching Considerations

•Batch Size Limits — Databases have limits on statement size, transaction size, and row counts. Exceeding these causes failures. Know your database's limits and stay under them.
•Transaction Boundaries — Should each batch be a single transaction? Transactions provide atomicity but hold locks. For independent keys, separate transactions may be better.
•Partial Failure Handling — What if a batch partially succeeds? Some databases support savepoints for rollback. Otherwise, you need per-entry retry logic.
•Memory Overhead — Assembling large batches requires memory. If the batch assembler buffers thousands of entries, that's memory not available for caching.
•Latency Variance — Large batches take longer to write, causing latency spikes in the flush loop. This can delay subsequent flushes during write storms.
•Ordering Within Batch — Does write order matter? For most cases, no. But for data with dependencies (e.g., parent-child relationships), you may need to respect insertion order.

Optimal batch sizing:

Batch size optimization depends on your specific database and network characteristics:

                     │
    Write            │         ╭────── Optimal zone
    Throughput       │       ╭─╯
    (writes/sec)     │     ╭─╯
                     │   ╭─╯
                     │ ╭─╯
                     │─╯
                     ├──────────────────────────────────────────
                     1    10   50  100  500  1000  5000  10000
                                  Batch Size

    ← Too small:           Optimal:        Too large: →
    Network overhead       Sweet spot      Large batches slower,
    dominates              for efficiency  memory pressure,
                                           partial failure risk

Recommended approach:

Start with batch size of 100-500
Benchmark flush throughput at different sizes
Plot throughput vs batch size
Choose size at the "knee" of the curve
Add a maximum batch size for safety (e.g., 1000-5000)

Database-Specific Batching

Retry and Failure Handling

Types of flush failures:

Flush Failure Categories
Failure Type	Example	Appropriate Response
Transient	Network timeout, connection reset	Immediate retry with backoff
Temporary Unavailability	Database failover, maintenance	Extended retry, potentially minutes
Resource Exhaustion	Connection pool full, disk full	Wait for recovery, possibly shed load
Permanent (per-entry)	Constraint violation, invalid data	Dead-letter for manual review
Systemic	Wrong credentials, schema mismatch	Stop and alert, requires human intervention

Retry strategy design:

A robust retry strategy must balance persistence with not overwhelming the database:

Retry Policy:
    max_retries: 5
    initial_delay: 100ms
    max_delay: 30s
    backoff_multiplier: 2
    jitter: ±10%

Retry Schedule (example):
    Attempt 1: Immediate
    Attempt 2: 100ms later
    Attempt 3: 200ms later
    Attempt 4: 400ms later
    Attempt 5: 800ms later
    Attempt 6: Dead letter queue

With jitter, actual delays vary slightly:
    Attempt 2: 90-110ms
    Attempt 3: 180-220ms
    ...

Why jitter? When many flush workers fail simultaneously (e.g., database restart), without jitter they all retry at the same time, causing a retry storm. Jitter spreads retries temporally.

Retry Best Practices

•Exponential Backoff — Each retry waits longer than the previous. This gives the system time to recover without hammering it with retries.
•Maximum Retry Cap — Don't retry forever. After N retries, move to dead-letter handling. Infinite retries can hide systemic issues.
•Circuit Breaker Pattern — If many flushes fail consecutively, stop trying temporarily. This prevents resource waste and lets the database recover.
•Classify Failures — Use error codes to distinguish retryable from non-retryable failures. Don't retry constraint violations endlessly.
•Track Retry Metrics — Monitor retry counts. Spikes indicate problems. Trending upward indicates degradation.
•Dead-Letter Queue — Entries that exhaust retries need somewhere to go. A DLQ allows human review without blocking the system.

Dead-Letter Queue is Required

Backpressure and Flow Control

The backpressure scenario:

Incoming writes:     1000 writes/second
Database capacity:   500 writes/second

Without backpressure:
    T+0:   0 dirty entries
    T+10:  5000 dirty entries (writing 5K/10s, flushing 5K/10s... wait)
    
    Actually: 1000 - 500 = 500 entries/sec accumulating
    
    T+10:  5000 accumulated
    T+60:  30000 accumulated
    T+120: Cache memory exhausted → System failure

Without flow control, the system will eventually crash when cache memory is exhausted.

Backpressure Strategies

•Block Writers (Synchronous Backpressure) — When dirty entries exceed a threshold, make new writes wait until flush catches up. Trades write latency for system stability. The cache effectively becomes write-through under overload.
•Reject Writes (Fail-Fast) — Return errors to clients when the system is overloaded. Lets clients handle the failure (retry, shed load, degrade features). Honest about capacity limits.
•Degrade to Write-Through — Under pressure, bypass the async path and write directly to database. Slower but durable. Seamless to clients but defeats caching benefits.
•Adaptive Flush Frequency — Speed up flush rate when dirty entries accumulate. Can prevent moderate overload from becoming severe. Has limits—can't exceed database capacity.
•Write Prioritization — Under pressure, only accept high-priority writes. Lower priority writes are rejected or queued separately. Requires application-level priority classification.
•Shedding and Sampling — For analytics-type workloads, drop or sample writes under severe load. Acceptable for approximate data, not for transactional data.

Implementing backpressure signals:

The flush system should expose signals that other components can use:

Backpressure Signals:
    dirty_entry_count:      Number of unflushed entries
    dirty_entry_percentage: Dirty count / Cache capacity
    flush_lag_seconds:      Oldest unflushed entry age
    flush_success_rate:     Recent flush success percentage

Thresholds:
    WARNING (soft):  dirty_percentage > 50%  OR  flush_lag > 10s
    CRITICAL (hard): dirty_percentage > 80%  OR  flush_lag > 60s

Responses:
    WARNING:  Alert operators, speed up flush rate
    CRITICAL: Enable backpressure (block/reject writes)

The graceful degradation ladder:

Normal Operation:
    → Full async write-back, maximum performance

Elevated Dirty Count (50-80%):
    → Increase flush frequency, warn operators

High Dirty Count (80-95%):
    → Block new writes until flush catches up

Near Exhaustion (>95%):
    → Emergency synchronous flush, reject new writes

Exhausted (100%):
    → Full write-through mode or system error

Design for Overload from Day One

Ordering and Consistency in Async Writes

Asynchronous writes introduce ordering challenges that don't exist in synchronous systems. Understanding these is crucial for correct system behavior.

The ordering problem:

Scenario: Two writes to the same key

T1: Write key=A, value=1 (enters cache)
T2: Write key=A, value=2 (updates cache, value=2)
T3: Flush starts, sees value=2
T4: Write to database: key=A, value=2 ✓

This is correct - last value wins.

But what about this scenario?

T1: Write key=A, value=1 (dirty)
T2: Flush starts, snapshot includes key=A, value=1
T3: Write key=A, value=2 (updates cache to value=2)
T4: Flush completes: database has value=1
T5: Key=A marked clean (WRONG!)

Now cache has value=2, database has value=1, entry is marked clean.
Inconsistency!

This is the snapshot isolation problem in write-back caching.

Solutions to the snapshot isolation problem:

Solution 1: Version Numbers

Cache entry: {key, value, version, dirty}

T1: Write key=A, value=1, version=1, dirty=true
T2: Flush starts, records: flushing key=A at version=1
T3: Write key=A, value=2, version=2, dirty=true
T4: Flush completes for version=1
T5: Clear dirty ONLY IF version is still 1
    Version is now 2, so entry stays dirty
T6: Next flush picks up version=2

✓ Correct!

Solution 2: Last-Modified Timestamps

Similar to version numbers but using timestamps. Less collision-resistant but simpler.

Solution 3: Copy-on-Flush

Solution 4: Lock During Flush

Lock the entry while flushing it. Writes to that key block until flush completes. Simple but reduces write concurrency significantly.

Ordering Solution Comparison
Solution	Concurrency	Complexity	Storage Overhead	Use Case
Version Numbers	High	Medium	8 bytes/entry	General purpose, recommended
Timestamps	High	Low	8 bytes/entry	When clock sync is reliable
Copy-on-Flush	High	High	Memory for copies	High write rates, short flushes
Lock During Flush	Low	Low	None	Low contention, simple cases

Cross-Key Ordering

Concurrency and Flush Parallelization

For high-throughput systems, a single flush worker may not keep up with the rate of dirty entries. Parallelizing the flush process is necessary but introduces its own challenges.

Single-Threaded Flush:

[Dirty Entries] → [Single Flush Worker] → [Database]

Pros: Simple, no coordination needed
Cons: Limited throughput, single point of bottleneck

Multi-Worker Flush:

                    ┌─→ [Worker 1] ──┐
[Dirty Entries] ─→ ├─→ [Worker 2] ──┼──→ [Database]
     (partitioned)  └─→ [Worker 3] ──┘

Pros: Higher throughput, distributes load
Cons: Coordination complexity, potential conflicts

Parallelization Strategies

•Key-Based Partitioning — Partition dirty entries by key (e.g., hash(key) % num_workers). Each key always goes to the same worker, preventing conflicts. Simple and effective.
•Database-Connection Partitioning — Each worker owns a database connection or partition. Reduces database contention. Works well with sharded databases.
•Work Stealing — Workers pull from a shared dirty queue. Dynamic load balancing but requires concurrent data structures and conflict detection.
•Priority Lanes — Separate workers for high-priority and low-priority entries. High-priority flushes immediately; low-priority batches up for efficiency.
•Batch-Level Parallelism — Single coordinator creates batches, multiple workers execute them. Simpler coordination, batch-level isolation.

Avoiding flush conflicts:

When multiple workers are flushing, you must ensure:

No duplicate flushes — The same entry shouldn't be flushed by two workers
No lost updates — If entry is modified during flush, modification must be preserved
Correct dirty status — Only clear dirty when the flushed value matches current cache value

Recommended architecture:

Partitioned Flush Workers:

    Worker W1: Responsible for keys where hash(key) % 4 == 0
    Worker W2: Responsible for keys where hash(key) % 4 == 1
    Worker W3: Responsible for keys where hash(key) % 4 == 2
    Worker W4: Responsible for keys where hash(key) % 4 == 3

Each worker:
    1. Scans its partition for dirty entries
    2. Creates batch from its dirty entries
    3. Writes batch to database
    4. Clears dirty flags (with version check)

No coordination needed between workers.
Each key is "owned" by exactly one worker.

This key-based partitioning is simple, scales linearly with worker count, and avoids all coordination overhead.

Scale Workers with Load

The number of flush workers can be dynamic. During high load, spin up more workers. During low load, reduce to save resources. Monitor flush lag as the primary signal for scaling decisions.

Observability for Async Writes

Asynchronous operations are notoriously difficult to observe and debug. The flush process is invisible to users, so problems can accumulate silently. Comprehensive monitoring is essential.

Critical metrics to track:

Flush System Metrics
Metric	What It Tells You	Alert Threshold Example
dirty_entry_count	Current unflushed entries	80% of cache capacity
flush_lag_seconds	Age of oldest dirty entry	30 seconds
flush_rate_per_second	Entries being flushed	Sudden drop > 50%
flush_batch_size_avg	Entries per flush batch	< 10 (inefficient) or > 5000 (too large)
flush_latency_p99	99th percentile flush time	1 second
flush_error_rate	Failed flush operations	1%
dlq_size	Dead-letter queue entries	100 entries
oldest_dlq_entry_age	Age of oldest DLQ entry	1 hour

Dashboards to Build

•Flush Pipeline Health — Dirty count, flush rate, flush lag over time. The core operational dashboard.
•Flush Performance — Batch sizes, flush latency percentiles, database write latency. For performance tuning.
•Error Analysis — Error rates by type, retry counts, DLQ inflows. For debugging failures.
•Capacity Planning — Dirty count vs cache size, database write capacity utilization. For planning.
•Comparison View — Incoming write rate vs flush rate. Shows whether system is keeping up.

Tracing async operations:

Distributed tracing for async writes is challenging because the write and flush are disconnected. Consider:

Correlation IDs — Include a trace ID in the cache entry metadata. When flushing, emit a span that references the original write's trace ID.
Write-Flush Linkage — Log both the write and the flush with the same entry identifier. Join them later for analysis.
Async Span Model — Some tracing systems support async spans where the parent span completes before the child.

Log hygiene:

// Good flush logging
{
    "event": "flush_batch_complete",
    "batch_id": "b-123456",
    "entries_count": 247,
    "success_count": 245,
    "failure_count": 2,
    "duration_ms": 156,
    "oldest_entry_age_ms": 4521,
    "database_latency_ms": 142,
    "timestamp": "2024-01-15T14:32:01.456Z"
}

This level of detail enables debugging issues days or weeks later.

Invest in Observability Early

Summary: Building Reliable Async Persistence

Asynchronous database writes are the engine that makes write-back caching work. Let's consolidate the key principles:

Key Takeaways

•Flush Architecture — A complete flush system includes scanning, scheduling, batching, writing, confirming, and failure handling components.
•Scheduling Policies — Hybrid policies (time OR count) provide bounded durability risk with efficient batching. Start simple, add complexity as needed.
•Batching — Batch writes for efficiency, but respect database limits and handle partial failures. Optimal batch size varies by system.
•Retry Strategy — Use exponential backoff with jitter. Classify failures to distinguish retryable from permanent. Dead-letter queue is required.
•Backpressure — Design for overload from day one. Implement graceful degradation rather than catastrophic failure.
•Ordering — Use version numbers or timestamps to handle concurrent modifications during flush. Cross-key ordering requires explicit mechanisms.
•Parallelization — Partition by key for simple, scalable multi-worker flushes. Avoid coordination overhead.
•Observability — Monitor dirty count, flush lag, error rates, and DLQ size. Invest in observability before you need to debug.

The reliability principle:

What's next:

Page Complete