Why Databases Matter - Learning Module

Loading content...

0/273

Persistence Requirements

Not All Data Needs to Last Forever

If data is the core of systems, persistence is its heartbeat. But here's a nuance that separates thoughtful system design from naive implementation: not all data has the same durability requirements.

Consider the difference between:

A bank transfer of $50,000 between accounts
A user's current scroll position on a social media feed
A cache of computed product recommendations
A financial audit trail for regulatory compliance

All four are 'data.' But treating them identically—storing all with the same durability guarantees—would be either wasteful (over-engineering ephemeral data) or dangerous (under-protecting critical data).

Understanding persistence requirements is about matching durability guarantees to actual business needs—and accepting the trade-offs each choice entails.

What You Will Learn

By the end of this page, you will understand the spectrum of persistence requirements, from ephemeral storage to ironclad durability. You'll learn how to classify data by its persistence needs, understand the costs of different durability levels, and make informed trade-offs between safety, performance, and complexity.

The Persistence Spectrum

Persistence exists on a spectrum from completely ephemeral to absolutely permanent. Each position on this spectrum comes with different costs, performance characteristics, and engineering complexity.

The spectrum:

The Persistence Spectrum
Level	Durability	Survives...	Typical Use Case	Technologies
Ephemeral	None	Nothing—request only	Request-scoped computation	In-memory variables
Request Cache	Minutes	N/A (expires quickly)	Repeated computations in request	Process memory
Session	Session duration	Page refreshes	Shopping cart, form state	Session storage, Redis
Cached	Hours to days	Restarts (if backed)	Computed results, API responses	Redis, Memcached
Persistent	Indefinitely	Crashes, restarts	User data, transactions	PostgreSQL, MySQL
Durable	Across failures	Disk failures (replication)	Financial records	Replicated databases
Immutable	Forever	Everything—append-only	Audit logs, blockchain	Append-only stores, WORM

Moving along the spectrum:

As you move from ephemeral toward immutable, you gain durability but pay costs:

Latency increases — Durable writes require disk I/O, replication confirmations, or network round-trips.
Throughput decreases — fsync operations, write-ahead logging, and replication all add overhead.
Complexity increases — Replication, consensus protocols, and backup strategies add operational burden.
Cost increases — Replicated, distributed storage costs more than a single in-memory cache.

The art of system design involves placing each piece of data at the appropriate position on this spectrum—durable enough to meet requirements, but no more durable than necessary.

The Goldilocks Zone

Every data type has a 'Goldilocks zone'—durable enough, but not excessively so. Session data in an ACID database is overkill. Financial transactions in Redis are negligent. Finding the right fit for each data type is a core competency of system designers.

Durability Guarantees Explained

When databases and storage systems talk about 'durability,' they're making specific promises about what happens after data is written. Understanding these promises—and their limitations—is essential for choosing appropriate storage.

The fundamental question:

If I receive confirmation that my write succeeded, under what conditions might that data still be lost?

Different systems answer this question differently:

Levels of Durability Guarantees

•No durability (in-memory) — Data is lost on process crash, restart, or power failure. Suitable for caches and computed values that can be regenerated.
•fsync durability — Data is written to disk and fsync'd before confirming success. Survives process crashes; may not survive disk failure or power loss during the write.
•Write-ahead logging (WAL) — Writes are first recorded to a durable log before being applied. Enables crash recovery to a consistent state. Standard for ACID databases.
•Synchronous replication — Data is confirmed written on multiple nodes before acknowledging success. Survives single-node failure; typically protects against disk failure.
•Multi-datacenter replication — Data is replicated across geographic regions. Survives datacenter-level disasters (power outage, network isolation, natural disaster).
•Immutable append-only — Data is never modified, only appended. Combined with replication, provides the highest durability level. Used for audit trails and compliance.

The physics of durability:

At a physical level, durability requires getting data out of volatile memory (RAM) and onto persistent media (disk, SSD, or replicated across network). Each step adds latency:

Operation	Typical Latency	What It Survives
Write to RAM	~100 nanoseconds	Nothing (volatile)
Write to OS buffer cache	~1 microsecond	Process crash (not system crash)
fsync to local SSD	~100 microseconds	Local process/system crash
fsync to local HDD	~10 milliseconds	Local process/system crash
Replicate to 1 remote node	~1-10 milliseconds	Local node failure
Replicate across regions	~50-200 milliseconds	Regional failure

Notice the 1,000x difference between RAM and local disk, and another 100x for cross-region replication. Durability is purchased with latency.

The Durability Lie

Some databases offer 'durable' modes that aren't truly durable. Writing to the OS page cache without fsync looks fast but can lose data on power failure. Default Redis configuration writes asynchronously—acknowledged writes may be lost. Always verify what durability actually means for your storage system.

Classifying Data by Persistence Needs

A systematic approach to persistence starts with classifying your data. Each category has different requirements, and misclassifying data leads to either wasted resources or unacceptable risk.

The classification framework:

Data Classification by Persistence Requirements
Category	If Lost...	Examples	Minimum Durability
Regenerable	Can be fully recreated from source	Caches, computed aggregations, search indexes	None (ephemeral OK)
Reconstructable	Can be rebuilt with effort/cost	Analytics summaries, denormalized views	Best-effort (soft durability)
Replaceable	Would be inconvenient but recoverable	User preferences, UI settings	Session/local storage
Valuable	Would cause user frustration	Draft documents, form progress	Persistent (single node OK)
Critical	Would cause business/user harm	User accounts, orders, payments	Durable (replicated)
Irreplaceable	Would cause regulatory/legal issues	Financial transactions, audit logs	Immutable (multi-region)

Applying the classification:

Let's classify data for a typical e-commerce system:

Regenerable (cache/compute):

Product recommendation lists
Search result rankings
Price calculations (based on source data)
Session hit counters

Reconstructable (soft persist):

Daily sales summaries
User activity analytics
Popular products aggregations

Valuable (persistent):

Shopping cart contents
Wishlist items
Recently viewed products

Critical (durable/replicated):

User accounts and credentials
Product catalog and inventory
Order records
Payment method tokens

Irreplaceable (immutable/multi-region):

Financial transaction ledger
Compliance audit trail
Payment processing logs

Notice that a single application spans the entire persistence spectrum. Treating all data identically would mean either over-engineering the regenerable data or under-engineering the irreplaceable data.

The Classification Exercise

For any system you're designing, explicitly list every data type and classify it. This exercise often reveals assumptions that haven't been examined. 'We're storing that in Redis? What happens when Redis restarts?' becomes a productive conversation rather than a production incident.

The Cost of Durability

Durability isn't free. Understanding its costs helps you make informed trade-offs rather than defaulting to maximum durability (expensive) or minimum durability (dangerous).

Cost dimensions of durability:

Performance Costs

•Write latency — Durable writes wait for disk I/O (milliseconds vs microseconds for RAM).
•Write throughput — fsync operations serialize writes, limiting IOPS on rotating media.
•Replication lag — Synchronous replication waits for network round-trips before acknowledging.
•Recovery time — Crash recovery replays WAL logs, delaying restart.
•Storage overhead — WAL files, replicas, and backups consume additional storage.

Infrastructure Costs

•Replica storage — 3x replication means 3x storage cost.
•Network bandwidth — Cross-region replication consumes expensive WAN bandwidth.
•Compute resources — Replication processing adds CPU and memory overhead.
•Operational complexity — Distributed systems require more sophisticated operations.
•Backup storage — Point-in-time recovery requires historical backup retention.

Quantifying the difference:

Let's compare write performance across durability levels for a typical workload:

Durability Level	Writes/Second (Single Client)	Writes/Second (Parallel)
In-memory only	500,000+	2,000,000+
Write to page cache	200,000+	800,000+
fsync per write (SSD)	5,000-10,000	50,000-100,000
fsync per write (HDD)	100-200	500-1,000
Synchronous replication (local)	2,000-5,000	20,000-50,000
Synchronous replication (regional)	50-200	500-2,000

Numbers are illustrative; actual performance depends on hardware, workload, and configuration.

Notice the orders-of-magnitude differences. Choosing 'maximum durability by default' can reduce write throughput by 1,000x compared to in-memory. This is why persistence requirements must be intentional, not accidental.

Batching for Durability

The cost of fsync is largely per-operation, not per-byte. Writing 1,000 records with one fsync is nearly as fast as writing 1 record with one fsync—but 1,000x faster than 1,000 records with 1,000 fsyncs. This is why databases group commits and why message queues batch writes. Batching amortizes durability costs.

Persistence Patterns in Practice

Real systems use multiple persistence patterns, often layered together. Understanding common patterns helps you design systems that balance performance with appropriate durability.

Pattern 1: Write-Ahead Logging (WAL)

The most fundamental database durability pattern:

Before modifying data, write the intended change to a sequential log
Acknowledge the write to the client
Apply the change to the actual data structures (asynchronously)
On crash recovery, replay uncommitted log entries

WAL enables durability without synchronous updates to complex data structures. The log is append-only (fast), while actual data structures are updated lazily.

Pattern 2: Write-Behind Caching

For write-heavy workloads where some data loss is acceptable:

Accept writes into an in-memory buffer
Acknowledge immediately to the client
Periodically flush the buffer to durable storage
Accept that unflushed writes may be lost on crash

Used for: Analytics events, metrics, non-critical logs. Redis with RDB snapshots works this way.

Pattern 3: Synchronous Replication

For data that must survive single-node failure:

Write to primary node
Wait for confirmation from N replicas (configurable)
Only then acknowledge to the client

Used for: Financial transactions, user-critical data. PostgreSQL synchronous replication, Vitess, CockroachDB.

Pattern 4: Event Sourcing

For complete auditability and reconstructability:

Never update or delete; only append events
Current state is derived by replaying events
Events become the source of truth; derived state is cached

Used for: Audit systems, financial ledgers, systems requiring complete history. Event stores, Kafka with proper configuration.

Pattern 5: Tiered Persistence

Combining multiple persistence levels for hot vs cold data:

Hot writes go to fast, potentially less durable storage (Redis, memory)
Background processes migrate to durable storage (PostgreSQL, object storage)
Data moves from hot → warm → cold tiers based on access patterns

Used for: Time-series data, logs, analytics. InfluxDB, Prometheus + Thanos.

Patterns Combine

Production systems often layer these patterns. A database might use WAL for local durability, synchronous replication for node-level durability, and asynchronous cross-region replication for disaster recovery. Understanding the patterns lets you reason about the composite behavior.

Durability vs Availability Trade-offs

One of the most important trade-offs in distributed systems is between durability and availability. Stronger durability guarantees often require more coordination, which can impact availability.

The fundamental tension:

Strong durability requires waiting for multiple confirmations before acknowledging writes
High availability requires responding quickly even when some nodes are unreachable
These goals conflict when network partitions or node failures occur

Durability-Availability Trade-off Spectrum
Configuration	Durability	Availability	Latency	When to Use
Single node, fsync	Survives crash	Single point of failure	Low	Development, non-critical
Async replication	May lose recent writes	High (reads continue on failover)	Low	Read-heavy, eventual consistency OK
Sync replication (1 replica)	Survives 1 node failure	Reduced (need 2 nodes up)	Medium	Critical data, local region
Sync replication (2+ replicas)	Survives 2+ node failures	Lower (need 3+ nodes)	Higher	Financial, regulated industries
Multi-region sync	Survives region failure	Lowest (cross-region latency)	High	Global, disaster-resistant

Practical decision framework:

When configuring durability, ask:

What's the cost of losing data? — Financial loss, user impact, regulatory consequences, reputation damage.
What's the cost of unavailability? — Lost revenue, user frustration, SLA penalties, operational disruption.
What latency can users tolerate? — 100ms? 500ms? Synchronous cross-region replication may push writes beyond acceptable latency.
What budget is available? — Multi-region synchronous replication requires 2x-3x infrastructure cost plus expensive cross-region networking.

The honest answer for most systems:

User-facing data: Synchronous replication within a region
Financial transactions: Synchronous replication with consideration for multi-region
Analytics and logs: Asynchronous replication acceptable
Caches and computed data: No replication needed

The CAP Connection

This tension connects to the CAP theorem: during a partition, you must choose between consistency (durability of the latest write) and availability. Most systems choose availability (serving potentially stale data) over consistency (refusing to serve until sync is confirmed). Know your system's choice.

Recovery and Backup Strategies

Durability isn't just about preventing data loss during normal operation—it's about recovering when things go wrong. A complete persistence strategy includes backup and recovery planning.

Types of failures to plan for:

Failure Scenarios

•Application bug — Code corrupts data. Need rollback to pre-corruption state.
•Operator error — Accidental DELETE without WHERE clause. Need point-in-time recovery.
•Hardware failure — Disk dies, taking local data with it. Need replica or backup.
•Datacenter failure — Power outage, network isolation. Need cross-datacenter recovery.
•Regional disaster — Natural disaster, infrastructure failure. Need multi-region backup.
•Security breach — Ransomware, malicious deletion. Need air-gapped backups.

Recovery metrics:

Two key metrics govern backup strategy:

Recovery Point Objective (RPO) — How much data can you afford to lose? Measured in time. RPO of 1 hour means you can lose up to 1 hour of data.
Recovery Time Objective (RTO) — How long can recovery take? Measured in time. RTO of 4 hours means service must be restored within 4 hours of incident.

Backup strategies mapped to RPO/RTO:

Strategy	Typical RPO	Typical RTO	Use Case
No backup	N/A (total loss)	N/A	Ephemeral data only
Daily snapshots	Up to 24 hours	Hours	Development, staging
Point-in-time recovery (PITR)	Minutes	Hour+ (depends on archive)	Business-critical databases
Synchronous replica	Zero (for replicated data)	Minutes (failover)	High availability
Multi-region sync	Zero	Minutes to hours	Disaster recovery
Air-gapped backup	Daily/weekly	Hours to days	Ransomware protection

Untested Backups Are Not Backups

A backup that has never been restored is a hope, not a guarantee. Schedule regular restore drills. Verify backup integrity. Test that restored data is actually usable. Many organizations have discovered during an incident that their 'backups' were corrupted, incomplete, or unrestorable.

Choosing Persistence for Your System

Let's consolidate the concepts into a practical decision framework for choosing persistence requirements.

Persistence Decision Checklist

•Classify each data type — Use the framework: regenerable → reconstructable → replaceable → valuable → critical → irreplaceable.
•Define RPO for each category — How much data loss is acceptable? Seconds? Hours? None?
•Define RTO for each category — How fast must recovery be? Minutes? Hours? Days?
•Choose technologies accordingly — Match durability guarantees to requirements. Don't over-engineer or under-engineer.
•Plan for layered persistence — Most systems need multiple tiers. Accept this complexity.
•Implement monitoring — Measure replication lag, backup success, recovery time. You can't manage what you don't measure.
•Test recovery regularly — Schedule restore drills. Verify backup integrity. Document recovery procedures.

Common mistakes to avoid:

Default durability syndrome — Using PostgreSQL with synchronous replication for ephemeral session data. Wasteful and slow.
Optimistic durability — Assuming Redis or in-memory storage is 'good enough' for order data. Dangerous.
Backup theater — Running backup scripts that create files no one ever restores or verifies.
Single-tier thinking — Trying to use one storage system for all durability levels. Inefficient.
Ignoring recovery time — Having backups but no tested recovery procedure. Recovery during an incident is not the time to learn.
Forgetting about corruption — Replication propagates corruption. You need point-in-time recovery for application-level bugs.

Page Complete

You now understand the full spectrum of persistence requirements, from ephemeral caching to ironclad multi-region durability. You can classify data by its persistence needs, understand the performance and cost trade-offs, and design systems with appropriate durability for each data type. Next, we'll explore query and access patterns—how the ways you read data shape your storage choices.