Loading learning content...
At 3:47 AM, a power surge hits a data center. In 0.003 seconds, before any circuit breaker can react, the voltage spike reaches a database server. The CPU halts mid-instruction. Fifty-three active transactions—representing orders, payments, and inventory updates—are suspended in various states of completion. The RAM, holding gigabytes of modified but unflushed data pages, loses power. Every bit vanishes.
This is a system failure—a catastrophic event where the entire DBMS instance stops abruptly. Unlike transaction failures (which affect individual transactions) or media failures (which damage persistent storage), system failures destroy the volatile state while leaving persistent storage intact.
System failures are both terrifying and manageable. Terrifying because they affect everything simultaneously. Manageable because the log-based recovery mechanisms we'll study are specifically designed to handle them. This page explores system failures in comprehensive detail.
By the end of this page, you will understand what constitutes a system failure, its root causes, its specific effects on database state, and why the combination of volatile memory and persistent logging creates a recoverable system. You will see how system failures differ fundamentally from transaction failures in scope and recovery approach.
A system failure (also called a crash or soft failure) occurs when the DBMS instance terminates abnormally, losing all volatile (in-memory) state. The key characteristics are:
Formal Definition:
A system failure is an event where:
The distinction between volatile and persistent state is crucial. System failures are 'soft' failures precisely because they don't damage the disk—they only lose what was in memory. This creates a recovery problem (how do we restore consistent state?) but also provides the solution (the log on disk tells us what happened).
| State Type | Storage Location | Survives System Failure? | Examples |
|---|---|---|---|
| Volatile (Transient) | Main Memory (RAM) | No | Buffer pool, lock table, transaction table, query execution state |
| Persistent (Durable) | Disk/SSD | Yes | Data files, log files, control files, configuration |
| Semi-Persistent | Write-ahead log buffer | Maybe* | Log records not yet flushed (* if battery-backed) |
Database systems make a specific promise regarding system failures: all committed transactions will be durable, and all uncommitted transactions will be rolled back. This is the Durability and Atomicity guarantees of ACID. The entire recovery subsystem exists to keep this promise after system failures.
What Exactly Is Lost?
When a system failure occurs, the following volatile structures are lost:
1. Buffer Pool (Buffer Cache): The buffer pool holds copies of disk pages in memory for fast access. Many of these pages have been modified ('dirtied') by transactions but not yet written back to disk. All dirty pages that weren't flushed are lost.
2. Lock Table: The lock manager's record of which transactions hold which locks on which objects. After a crash, there are no locks—which is actually helpful for recovery.
3. Transaction Table: The list of active transactions, their states, and their progress. After a crash, we must reconstruct this from the log.
4. Query Execution State: Partially executed queries, cursors, intermediate results. All lost and must be re-executed.
5. Connection State: All client connections are broken. Clients must reconnect after recovery.
6. Cached Plans: Query plans compiled by the optimizer. Must be regenerated.
7. Statistical Caches: Buffer pool hit rates, query timings, performance metrics. Lost and reset.
System failures arise from various sources, spanning hardware, software, and environmental domains. Understanding these causes helps in both prevention and in designing recovery procedures.
Categories of System Failure Causes:
2.1 Power Failures
Power failures are perhaps the most common cause of abrupt system failures, especially in environments without adequate power protection.
Types of Power Problems:
| Problem | Description | Duration | Impact |
|---|---|---|---|
| Blackout | Complete power loss | Seconds to hours | Immediate system halt |
| Brownout | Voltage reduction | Minutes | System instability, potential crash |
| Voltage spike | Brief overvoltage | Milliseconds | Component damage, crash |
| Frequency variation | AC frequency shift | Variable | Motor-driven components affected |
| Harmonic distortion | Waveform degradation | Ongoing | Heat generation, gradual damage |
Protection Mechanisms:
Battery-backed write caches in RAID controllers and SSDs can survive brief power losses. However, batteries degrade over time. A 5-year-old battery may only provide 30 seconds of protection instead of the rated 72 hours. Regular battery testing is essential—many 'inexplicable' data losses trace to failed cache batteries.
2.2 Operating System Crashes
The OS is the foundation on which the DBMS runs. When the OS crashes, the DBMS crashes with it.
Common OS Crash Causes:
Operating systems are generally more stable than application software, but they're not immune to failures. A single faulty kernel module can bring down the entire system.
2.3 DBMS Software Bugs
Database software, despite extensive testing, can contain bugs that cause crashes:
1234567891011121314151617181920212223242526272829
# Example: Various system failure scenarios # Scenario 1: Power Failure# Timeline: 14:32:01.234[DBMS] Transaction T1 commits, log flushed[DBMS] Transaction T2 updates 500 rows in buffer pool[DBMS] Transaction T3 begins, acquires locks[POWER] Utility power lost[UPS] Battery power engaged[POWER] UPS battery depleted (aged battery)[SYSTEM] Uncontrolled shutdown# Result: T1 committed, T2/T3 must be rolled back during recovery # Scenario 2: OS Crash (Kernel Panic)# Timeline: 09:15:44.891[OS] NFS driver encounters unhandled error[OS] Kernel panic - not syncing: Fatal exception[DBMS] Process killed by kernel# Result: All active transactions must be recovered # Scenario 3: DBMS Bug# Timeline: 22:08:33.456[DBMS] Parallel query execution active, 8 threads[DBMS] Thread 3: Memory allocation for sort buffer[DBMS] Thread 7: Concurrent deallocation of same region[DBMS] SIGSEGV: Invalid memory access at 0x7f3a...[DBMS] Core dump generated: /var/lib/postgres/core.12345[DBMS] Process terminated# Result: Bug report filed, recovery from checkpoint2.4 Hardware Failures (Non-Media)
Hardware failures that don't damage persistent storage but halt the system:
| Component | Failure Mode | System Impact | Detection |
|---|---|---|---|
| CPU | Overheating, silicon failure | Immediate crash or hang | Temperature sensors, ECC errors |
| RAM | Bit flips, module failure | Crashes, data corruption | ECC, memory tests |
| Motherboard | Capacitor failure, trace damage | System won't boot | POST failures |
| BIOS/UEFI | Firmware corruption | Boot failure | Checksum validation |
| Network Card | Driver crash, hardware fault | Cluster failover | Health checks |
ECC Memory Is Essential:
For database servers, Error-Correcting Code (ECC) memory is mandatory, not optional. Non-ECC memory experiences bit flips due to cosmic rays, electrical noise, and aging. A single bit flip in a data page could corrupt data silently. ECC memory detects and corrects single-bit errors, and detects (but cannot correct) multi-bit errors—triggering a crash rather than silent corruption.
A clean crash with intact disk storage is far better than silent data corruption. ECC memory, checksums, and defensive programming prefer crashing over continuing with potentially corrupt state. This is why databases have so many assertion checks—failing fast prevents data disasters.
The fundamental challenge of system failures lies in the boundary between volatile memory and persistent storage. Understanding this boundary is crucial for understanding database recovery.
The Problem of Uncommitted Data on Disk:
DBMS systems don't write every change immediately to disk—that would be far too slow. Instead, they use a buffer pool:
This creates a problem: dirty pages from uncommitted transactions may be written to disk (this is called 'stealing' the buffer). If the system crashes:
The Problem of Committed Data in Memory:
Conversely, a transaction may commit but its dirty pages may not yet be on disk:
Now:
This dual problem—undo uncommitted changes and redo committed changes—is the essence of crash recovery.
| Policy | Description | Recovery Implication |
|---|---|---|
| STEAL | Dirty pages from uncommitted txns can be written to disk | Requires UNDO during recovery |
| NO-STEAL | Dirty pages from uncommitted txns cannot be written to disk | No UNDO needed, but requires more memory |
| FORCE | All dirty pages must be written before commit | No REDO needed, but slow commits |
| NO-FORCE | Dirty pages may remain in memory after commit | Requires REDO during recovery, fast commits |
The STEAL/NO-FORCE Policy:
Most production database systems use STEAL/NO-FORCE because:
The cost is that recovery is more complex: we need both UNDO and REDO capabilities. But the performance benefit is substantial—commit latency drops from potentially hundreds of milliseconds (if we had to write all dirty pages) to just a few milliseconds (writing only the commit log record).
Write-Ahead Logging (WAL) Makes This Work:
The key insight making STEAL/NO-FORCE safe is Write-Ahead Logging:
Before any dirty page is written to disk, the log records describing the changes must be written first.
This guarantees that:
The log is the source of truth. Combined with the data files, we can always reconstruct the correct database state.
After a system failure, the database on disk is in an inconsistent state. It may contain:
The recovery system must detect and correct all of these issues before making the database available for new transactions.
Example Scenario:
Consider a database with three concurrent transactions at the moment of crash:
1234567891011121314151617181920212223242526272829303132333435363738
Database State at Time of Crash:================================== Transaction T1: COMMITTED (log record written) - Updated Account A: 1000 → 750 [Page in memory, NOT on disk] - Updated Account B: 500 → 750 [Page in memory, NOT on disk] Transaction T2: ACTIVE (uncommitted) - Updated Account C: 2000 → 1800 [Page flushed to disk] - Updated Account D: 300 → 500 [Page in memory only] Transaction T3: PARTIALLY COMMITTED (commit in progress) - Updated Account E: 400 → 600 [Page flushed to disk] - Commit record NOT yet written to log Disk State After Crash:=======================Account A: 1000 (committed change MISSING - needs REDO)Account B: 500 (committed change MISSING - needs REDO)Account C: 1800 (uncommitted change PRESENT - needs UNDO)Account D: 300 (uncommitted change lost - no action needed)Account E: 600 (uncommitted change PRESENT - needs UNDO*) *T3 didn't finish committing, so it's treated as aborted Recovery Actions Required:=========================1. REDO T1's changes to A and B (from log after-images)2. UNDO T2's change to C (from log before-image)3. UNDO T3's change to E (from log before-image) Final Consistent State:======================Account A: 750 (T1 committed)Account B: 750 (T1 committed)Account C: 2000 (T2 aborted, rolled back)Account D: 300 (T2 never made it to disk)Account E: 400 (T3 aborted, rolled back)Torn Writes: A Special Case:
A 'torn write' or 'partial page write' occurs when only part of a page is written to disk before the crash. Consider:
A torn page is corrupted—it contains a mix of old and new data that may be internally inconsistent.
Detection and Prevention of Torn Writes:
Page Checksums: Each page includes a checksum. If the checksum doesn't match after read, the page is torn.
Double-Write Buffer (MySQL InnoDB): Before writing a page to its final location, write it to a special double-write buffer. If crash occurs during final write, the doublewrite buffer has a clean copy for recovery.
Full-Page Writes (PostgreSQL): After each checkpoint, the first modification to any page logs the entire page (not just the change). Recovery uses this full-page image.
Atomic Write Support: Modern storage (some SSDs, storage arrays) guarantees atomic writes of large blocks, eliminating torn writes.
Torn writes are particularly dangerous because they can go undetected if checksums aren't used. The page is readable, but the data is garbage. Always ensure your database has page checksums enabled—the small performance cost is trivial compared to the risk of undetected corruption.
Recovery from system failures follows a structured process that restores the database to a consistent state. While we'll cover recovery algorithms in detail in later chapters, understanding the high-level process is essential for understanding system failures.
The Recovery Process Overview:
Why Redo Everything, Then Undo?
A common question: Why redo changes from uncommitted transactions if we're just going to undo them?
The answer lies in the log structure and recovery correctness:
Page-level consistency: The redo phase brings pages to their exact pre-crash state. This is necessary because the undo phase uses before-images that assume specific page states.
Log ordering: The log records reference specific page versions. Skipping some redos would leave pages in states that don't match what the log expects.
Nested transactions and savepoints: Some undo operations might need page states that were created by operations we'd otherwise skip.
Simplicity and correctness: The 'redo all, then undo uncommitted' approach is simpler to implement correctly and verify than selective redo.
Recovery Time Considerations:
Recovery time depends on several factors:
| Factor | Impact | Mitigation |
|---|---|---|
| Time since last checkpoint | More log to replay | Frequent checkpoints |
| Number of active transactions | More undo work | Shorter transactions |
| Size of transactions | More redo/undo operations | Batch judiciously |
| Page read time | I/O during recovery | Fast storage, SSD |
| Parallel recovery capability | Can parallelize work | Modern DBMS features |
More frequent checkpoints mean faster recovery (less log to replay) but impose overhead during normal operation (more disk writes). Most systems allow configuring checkpoint frequency based on your recovery time requirements versus performance needs.
12345678910111213141516171819202122232425262728293031323334353637
PROCEDURE RecoverFromSystemFailure(): // Phase 1: Analysis // Find the last checkpoint and determine transaction states checkpoint_lsn = FindLastCheckpoint() active_transactions = {} // Will be populated from log dirty_pages = {} // Pages that might need redo // Scan log forward from checkpoint FOR each log_record from checkpoint_lsn to end_of_log: IF log_record.type == BEGIN: active_transactions.add(log_record.txn_id) ELSE IF log_record.type == COMMIT or ABORT: active_transactions.remove(log_record.txn_id) ELSE IF log_record.type == UPDATE: dirty_pages.add(log_record.page_id) // Phase 2: Redo // Replay history to bring database to crash state redo_lsn = min(dirty_pages first LSN, checkpoint_lsn) FOR each log_record from redo_lsn to end_of_log: IF log_record.type == UPDATE: page = ReadPage(log_record.page_id) IF page.lsn < log_record.lsn: // Page is outdated; apply the change ApplyChange(page, log_record.after_image) page.lsn = log_record.lsn WritePage(page) // Phase 3: Undo // Roll back all transactions still in active_transactions FOR each txn_id IN active_transactions: RollbackTransaction(txn_id) // Uses log before-images // Recovery complete WriteLogRecord(RECOVERY_COMPLETE) RETURNUnderstanding the differences between system failures and transaction failures is essential for designing appropriate handling strategies. While both require recovery mechanisms, their characteristics differ significantly:
| Aspect | Transaction Failure | System Failure |
|---|---|---|
| Scope | Single transaction | All active transactions |
| Volatile state | Preserved (system still running) | Lost (memory cleared) |
| Persistent state | Unchanged | Intact but inconsistent |
| Other transactions | Continue normally | All suspended until recovery |
| Detection | Synchronous (immediate) | Obvious (system down) |
| Recovery trigger | Automatic rollback | Explicit recovery at restart |
| Lock management | Locks released immediately | Locks reconstructed or cleared |
| Recovery time | Transaction-proportional | Log-proportional |
| Service impact | Minimal | Full outage during recovery |
| Application handling | Catch error, possibly retry | Reconnect after recovery |
Key Insight: The Lock Table Problem
One significant difference is lock management:
Transaction Failure: The transaction manager immediately releases all locks held by the failed transaction. This happens while the system is still running, so the lock table in memory is updated directly.
System Failure: The lock table in memory is lost. Upon recovery, there are no locks at all. This is actually helpful—we don't need to worry about reconstructing the lock table. Recovery runs in a single-threaded manner, and once complete, new transactions start fresh with clean lock state.
Key Insight: The Buffer Pool Problem
Transaction Failure: Dirty pages remain in the buffer pool. The rollback operation modifies them back to their original state. No disk I/O is strictly necessary (though it happens eventually).
System Failure: The buffer pool is gone. All dirty pages from all transactions are lost. Some of those dirty pages represented committed work (need redo), others represented uncommitted work (need undo). We must read pages from disk, examine log records, and reconstruct correct state.
While system failures cannot be completely eliminated, their frequency and impact can be significantly reduced through proper infrastructure, configuration, and operational practices:
Recovery Time Objective is the maximum acceptable time for recovery after a failure. For system failures, RTO depends on recovery speed (how fast can we replay the log?) and failover capability (can we switch to a standby server?). Know your RTO and design accordingly.
Let's consolidate the key concepts covered in this page:
What's Next:
We've now covered transaction failures (individual) and system failures (entire instance). In the next page, we'll examine Media Failures—the most severe failure type where persistent storage itself is damaged or lost. Media failures require different recovery strategies involving backups and archives, since the normal log-based recovery from disk is impossible when the disk itself is the problem.
You now understand system failures comprehensively—their nature, causes, effects on database state, and the high-level recovery approach. This knowledge prepares you for understanding media failures, which add the dimension of persistent storage loss.