Database Management SystemsFailure Types

Understanding Database Failure Types

LevelIntermediate

Duration60 mins

TopicFailure Types

2 / 5

System Failure

When the Entire System Falls

At 3:47 AM, a power surge hits a data center. In 0.003 seconds, before any circuit breaker can react, the voltage spike reaches a database server. The CPU halts mid-instruction. Fifty-three active transactions—representing orders, payments, and inventory updates—are suspended in various states of completion. The RAM, holding gigabytes of modified but unflushed data pages, loses power. Every bit vanishes.

This is a system failure—a catastrophic event where the entire DBMS instance stops abruptly. Unlike transaction failures (which affect individual transactions) or media failures (which damage persistent storage), system failures destroy the volatile state while leaving persistent storage intact.

System failures are both terrifying and manageable. Terrifying because they affect everything simultaneously. Manageable because the log-based recovery mechanisms we'll study are specifically designed to handle them. This page explores system failures in comprehensive detail.

What You Will Learn

By the end of this page, you will understand what constitutes a system failure, its root causes, its specific effects on database state, and why the combination of volatile memory and persistent logging creates a recoverable system. You will see how system failures differ fundamentally from transaction failures in scope and recovery approach.

Defining System Failure

A system failure (also called a crash or soft failure) occurs when the DBMS instance terminates abnormally, losing all volatile (in-memory) state. The key characteristics are:

Formal Definition:

A system failure is an event where:

The DBMS process(es) terminate unexpectedly
All contents of main memory (RAM) are lost
Contents of persistent storage (disk, SSD) remain intact
All active transactions at the time of failure are affected
The database is in an inconsistent state and requires recovery

The distinction between volatile and persistent state is crucial. System failures are 'soft' failures precisely because they don't damage the disk—they only lose what was in memory. This creates a recovery problem (how do we restore consistent state?) but also provides the solution (the log on disk tells us what happened).

Volatile vs. Persistent State in a DBMS
State Type	Storage Location	Survives System Failure?	Examples
Volatile (Transient)	Main Memory (RAM)	No	Buffer pool, lock table, transaction table, query execution state
Persistent (Durable)	Disk/SSD	Yes	Data files, log files, control files, configuration
Semi-Persistent	Write-ahead log buffer	Maybe*	Log records not yet flushed (* if battery-backed)

The Recovery Contract

Database systems make a specific promise regarding system failures: all committed transactions will be durable, and all uncommitted transactions will be rolled back. This is the Durability and Atomicity guarantees of ACID. The entire recovery subsystem exists to keep this promise after system failures.

What Exactly Is Lost?

When a system failure occurs, the following volatile structures are lost:

1. Buffer Pool (Buffer Cache): The buffer pool holds copies of disk pages in memory for fast access. Many of these pages have been modified ('dirtied') by transactions but not yet written back to disk. All dirty pages that weren't flushed are lost.

2. Lock Table: The lock manager's record of which transactions hold which locks on which objects. After a crash, there are no locks—which is actually helpful for recovery.

3. Transaction Table: The list of active transactions, their states, and their progress. After a crash, we must reconstruct this from the log.

4. Query Execution State: Partially executed queries, cursors, intermediate results. All lost and must be re-executed.

5. Connection State: All client connections are broken. Clients must reconnect after recovery.

6. Cached Plans: Query plans compiled by the optimizer. Must be regenerated.

7. Statistical Caches: Buffer pool hit rates, query timings, performance metrics. Lost and reset.

Causes of System Failures

System failures arise from various sources, spanning hardware, software, and environmental domains. Understanding these causes helps in both prevention and in designing recovery procedures.

Categories of System Failure Causes:

System Failure Root Causes

•Power Failures — Grid outages, UPS failures, power supply malfunctions, circuit breaker trips
•Operating System Crashes — Kernel panics, blue screens, deadlocks in OS code, driver failures
•DBMS Software Bugs — Assertion failures, segmentation faults, memory corruption, infinite loops
•Hardware Failures (Non-Media) — CPU failures, memory errors, motherboard issues, cooling failures
•Resource Exhaustion — Out of memory (OOM) killer, process limits, file descriptor limits
•Administrative Actions — Accidental shutdown, kill -9, emergency maintenance

2.1 Power Failures

Power failures are perhaps the most common cause of abrupt system failures, especially in environments without adequate power protection.

Types of Power Problems:

Problem	Description	Duration	Impact
Blackout	Complete power loss	Seconds to hours	Immediate system halt
Brownout	Voltage reduction	Minutes	System instability, potential crash
Voltage spike	Brief overvoltage	Milliseconds	Component damage, crash
Frequency variation	AC frequency shift	Variable	Motor-driven components affected
Harmonic distortion	Waveform degradation	Ongoing	Heat generation, gradual damage

Protection Mechanisms:

Uninterruptible Power Supply (UPS): Provides battery backup for orderly shutdown or brief outages
Generator Backup: For extended outages, automatic transfer to generator power
Redundant Power Supplies: Multiple PSUs in servers; single failure doesn't cause outage
Power Distribution Units (PDUs): Managed power distribution with monitoring
Battery-Backed Write Cache: Preserves write buffer during brief power loss

The Battery-Backed Cache Trap

Battery-backed write caches in RAID controllers and SSDs can survive brief power losses. However, batteries degrade over time. A 5-year-old battery may only provide 30 seconds of protection instead of the rated 72 hours. Regular battery testing is essential—many 'inexplicable' data losses trace to failed cache batteries.

2.2 Operating System Crashes

The OS is the foundation on which the DBMS runs. When the OS crashes, the DBMS crashes with it.

Common OS Crash Causes:

Kernel Bugs: Race conditions, null pointer dereferences, memory corruption in kernel code
Device Driver Issues: Faulty drivers for storage, network, or other devices
Resource Exhaustion: Out of memory, out of process IDs, file descriptor limits
Deadlocks: Kernel-level deadlocks, especially in I/O paths
Security Exploits: Kernel vulnerabilities being exploited

Operating systems are generally more stable than application software, but they're not immune to failures. A single faulty kernel module can bring down the entire system.

2.3 DBMS Software Bugs

Database software, despite extensive testing, can contain bugs that cause crashes:

Memory Corruption: Buffer overflows, use-after-free, double-free errors
Assertion Violations: Internal consistency checks that fail, triggering abort
Infinite Loops/Hangs: Bugs that cause the system to become unresponsive
Stack Overflow: Excessively deep recursion
Concurrency Bugs: Race conditions in multi-threaded DBMS code

crash_scenarios.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Example: Various system failure scenarios
 
# Scenario 1: Power Failure
# Timeline: 14:32:01.234
[DBMS] Transaction T1 commits, log flushed
[DBMS] Transaction T2 updates 500 rows in buffer pool
[DBMS] Transaction T3 begins, acquires locks
[POWER] Utility power lost
[UPS] Battery power engaged
[POWER] UPS battery depleted (aged battery)
[SYSTEM] Uncontrolled shutdown
# Result: T1 committed, T2/T3 must be rolled back during recovery
 
# Scenario 2: OS Crash (Kernel Panic)
# Timeline: 09:15:44.891
[OS] NFS driver encounters unhandled error
[OS] Kernel panic - not syncing: Fatal exception
[DBMS] Process killed by kernel
# Result: All active transactions must be recovered
 
# Scenario 3: DBMS Bug
# Timeline: 22:08:33.456
[DBMS] Parallel query execution active, 8 threads
[DBMS] Thread 3: Memory allocation for sort buffer
[DBMS] Thread 7: Concurrent deallocation of same region
[DBMS] SIGSEGV: Invalid memory access at 0x7f3a...
[DBMS] Core dump generated: /var/lib/postgres/core.12345
[DBMS] Process terminated
# Result: Bug report filed, recovery from checkpoint

2.4 Hardware Failures (Non-Media)

Hardware failures that don't damage persistent storage but halt the system:

Component	Failure Mode	System Impact	Detection
CPU	Overheating, silicon failure	Immediate crash or hang	Temperature sensors, ECC errors
RAM	Bit flips, module failure	Crashes, data corruption	ECC, memory tests
Motherboard	Capacitor failure, trace damage	System won't boot	POST failures
BIOS/UEFI	Firmware corruption	Boot failure	Checksum validation
Network Card	Driver crash, hardware fault	Cluster failover	Health checks

ECC Memory Is Essential:

For database servers, Error-Correcting Code (ECC) memory is mandatory, not optional. Non-ECC memory experiences bit flips due to cosmic rays, electrical noise, and aging. A single bit flip in a data page could corrupt data silently. ECC memory detects and corrects single-bit errors, and detects (but cannot correct) multi-bit errors—triggering a crash rather than silent corruption.

Crash vs. Corruption

A clean crash with intact disk storage is far better than silent data corruption. ECC memory, checksums, and defensive programming prefer crashing over continuing with potentially corrupt state. This is why databases have so many assertion checks—failing fast prevents data disasters.

The Volatile/Persistent Boundary

The fundamental challenge of system failures lies in the boundary between volatile memory and persistent storage. Understanding this boundary is crucial for understanding database recovery.

The Problem of Uncommitted Data on Disk:

DBMS systems don't write every change immediately to disk—that would be far too slow. Instead, they use a buffer pool:

Data pages are read from disk into memory
Transactions modify pages in memory
Modified ('dirty') pages are written back to disk periodically

This creates a problem: dirty pages from uncommitted transactions may be written to disk (this is called 'stealing' the buffer). If the system crashes:

The disk contains changes from uncommitted transactions
We must undo these changes during recovery

The Problem of Committed Data in Memory:

Conversely, a transaction may commit but its dirty pages may not yet be on disk:

Transaction modifies pages in memory
Transaction commits (log is flushed)
Dirty pages are still only in memory
System crashes before pages are written to disk

Now:

The disk doesn't contain committed changes
We must redo these changes during recovery

This dual problem—undo uncommitted changes and redo committed changes—is the essence of crash recovery.

Buffer Management Policies
Policy	Description	Recovery Implication
STEAL	Dirty pages from uncommitted txns can be written to disk	Requires UNDO during recovery
NO-STEAL	Dirty pages from uncommitted txns cannot be written to disk	No UNDO needed, but requires more memory
FORCE	All dirty pages must be written before commit	No REDO needed, but slow commits
NO-FORCE	Dirty pages may remain in memory after commit	Requires REDO during recovery, fast commits

The STEAL/NO-FORCE Policy:

Most production database systems use STEAL/NO-FORCE because:

STEAL allows the buffer manager to evict any page when memory is needed, providing flexibility
NO-FORCE allows fast commits (just write the log, not all data pages), providing performance

The cost is that recovery is more complex: we need both UNDO and REDO capabilities. But the performance benefit is substantial—commit latency drops from potentially hundreds of milliseconds (if we had to write all dirty pages) to just a few milliseconds (writing only the commit log record).

Write-Ahead Logging (WAL) Makes This Work:

The key insight making STEAL/NO-FORCE safe is Write-Ahead Logging:

Before any dirty page is written to disk, the log records describing the changes must be written first.

This guarantees that:

If a dirty page from an uncommitted transaction is on disk, the log contains the before-image for UNDO
If a committed transaction's changes are not on disk, the log contains the after-image for REDO

The log is the source of truth. Combined with the data files, we can always reconstruct the correct database state.

Converting Mermaid diagram...

State Inconsistency After Crash

After a system failure, the database on disk is in an inconsistent state. It may contain:

Partial changes from uncommitted transactions (need UNDO)
Missing changes from committed transactions (need REDO)
Partially written pages (torn writes)

The recovery system must detect and correct all of these issues before making the database available for new transactions.

Example Scenario:

Consider a database with three concurrent transactions at the moment of crash:

crash_state_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Database State at Time of Crash:
==================================
 
Transaction T1: COMMITTED (log record written)
  - Updated Account A: 1000 → 750  [Page in memory, NOT on disk]
  - Updated Account B: 500 → 750   [Page in memory, NOT on disk]
  
Transaction T2: ACTIVE (uncommitted)
  - Updated Account C: 2000 → 1800 [Page flushed to disk]
  - Updated Account D: 300 → 500   [Page in memory only]
 
Transaction T3: PARTIALLY COMMITTED (commit in progress)
  - Updated Account E: 400 → 600   [Page flushed to disk]
  - Commit record NOT yet written to log
 
Disk State After Crash:
=======================
Account A: 1000  (committed change MISSING - needs REDO)
Account B: 500   (committed change MISSING - needs REDO)
Account C: 1800  (uncommitted change PRESENT - needs UNDO)
Account D: 300   (uncommitted change lost - no action needed)
Account E: 600   (uncommitted change PRESENT - needs UNDO*)
 
*T3 didn't finish committing, so it's treated as aborted
 
Recovery Actions Required:
=========================
1. REDO T1's changes to A and B (from log after-images)
2. UNDO T2's change to C (from log before-image)
3. UNDO T3's change to E (from log before-image)
 
Final Consistent State:
======================
Account A: 750   (T1 committed)
Account B: 750   (T1 committed)
Account C: 2000  (T2 aborted, rolled back)
Account D: 300   (T2 never made it to disk)
Account E: 400   (T3 aborted, rolled back)

Torn Writes: A Special Case:

A 'torn write' or 'partial page write' occurs when only part of a page is written to disk before the crash. Consider:

Database pages are typically 8KB or 16KB
Disk sectors are typically 512 bytes or 4KB
Writing one page requires multiple sector writes
A crash during this sequence leaves a partially written page

A torn page is corrupted—it contains a mix of old and new data that may be internally inconsistent.

Detection and Prevention of Torn Writes:

Page Checksums: Each page includes a checksum. If the checksum doesn't match after read, the page is torn.
Double-Write Buffer (MySQL InnoDB): Before writing a page to its final location, write it to a special double-write buffer. If crash occurs during final write, the doublewrite buffer has a clean copy for recovery.
Full-Page Writes (PostgreSQL): After each checkpoint, the first modification to any page logs the entire page (not just the change). Recovery uses this full-page image.
Atomic Write Support: Modern storage (some SSDs, storage arrays) guarantees atomic writes of large blocks, eliminating torn writes.

Silent Corruption Risk

Torn writes are particularly dangerous because they can go undetected if checksums aren't used. The page is readable, but the data is garbage. Always ensure your database has page checksums enabled—the small performance cost is trivial compared to the risk of undetected corruption.

Recovery From System Failures

Recovery from system failures follows a structured process that restores the database to a consistent state. While we'll cover recovery algorithms in detail in later chapters, understanding the high-level process is essential for understanding system failures.

The Recovery Process Overview:

System Failure Recovery Phases

•1. Analysis Phase: Scan the log from the last checkpoint to determine which transactions were active and which pages might be dirty
•2. Redo Phase: Replay all logged operations (for both committed and uncommitted transactions) to bring the database to its pre-crash state
•3. Undo Phase: Roll back all transactions that were active at the time of crash, since they can't be completed

Why Redo Everything, Then Undo?

A common question: Why redo changes from uncommitted transactions if we're just going to undo them?

The answer lies in the log structure and recovery correctness:

Page-level consistency: The redo phase brings pages to their exact pre-crash state. This is necessary because the undo phase uses before-images that assume specific page states.
Log ordering: The log records reference specific page versions. Skipping some redos would leave pages in states that don't match what the log expects.
Nested transactions and savepoints: Some undo operations might need page states that were created by operations we'd otherwise skip.
Simplicity and correctness: The 'redo all, then undo uncommitted' approach is simpler to implement correctly and verify than selective redo.

Recovery Time Considerations:

Recovery time depends on several factors:

Factor	Impact	Mitigation
Time since last checkpoint	More log to replay	Frequent checkpoints
Number of active transactions	More undo work	Shorter transactions
Size of transactions	More redo/undo operations	Batch judiciously
Page read time	I/O during recovery	Fast storage, SSD
Parallel recovery capability	Can parallelize work	Modern DBMS features

Checkpoint Frequency Trade-off

More frequent checkpoints mean faster recovery (less log to replay) but impose overhead during normal operation (more disk writes). Most systems allow configuring checkpoint frequency based on your recovery time requirements versus performance needs.

recovery_process_overview.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
PROCEDURE RecoverFromSystemFailure():
    // Phase 1: Analysis
    // Find the last checkpoint and determine transaction states
    checkpoint_lsn = FindLastCheckpoint()
    active_transactions = {}  // Will be populated from log
    dirty_pages = {}          // Pages that might need redo
    
    // Scan log forward from checkpoint
    FOR each log_record from checkpoint_lsn to end_of_log:
        IF log_record.type == BEGIN:
            active_transactions.add(log_record.txn_id)
        ELSE IF log_record.type == COMMIT or ABORT:
            active_transactions.remove(log_record.txn_id)
        ELSE IF log_record.type == UPDATE:
            dirty_pages.add(log_record.page_id)
    
    // Phase 2: Redo
    // Replay history to bring database to crash state
    redo_lsn = min(dirty_pages first LSN, checkpoint_lsn)
    
    FOR each log_record from redo_lsn to end_of_log:
        IF log_record.type == UPDATE:
            page = ReadPage(log_record.page_id)
            IF page.lsn < log_record.lsn:
                // Page is outdated; apply the change
                ApplyChange(page, log_record.after_image)
                page.lsn = log_record.lsn
                WritePage(page)
    
    // Phase 3: Undo
    // Roll back all transactions still in active_transactions
    FOR each txn_id IN active_transactions:
        RollbackTransaction(txn_id)  // Uses log before-images
    
    // Recovery complete
    WriteLogRecord(RECOVERY_COMPLETE)
    RETURN

System Failure vs. Transaction Failure

Understanding the differences between system failures and transaction failures is essential for designing appropriate handling strategies. While both require recovery mechanisms, their characteristics differ significantly:

Comparison: Transaction Failure vs. System Failure
Aspect	Transaction Failure	System Failure
Scope	Single transaction	All active transactions
Volatile state	Preserved (system still running)	Lost (memory cleared)
Persistent state	Unchanged	Intact but inconsistent
Other transactions	Continue normally	All suspended until recovery
Detection	Synchronous (immediate)	Obvious (system down)
Recovery trigger	Automatic rollback	Explicit recovery at restart
Lock management	Locks released immediately	Locks reconstructed or cleared
Recovery time	Transaction-proportional	Log-proportional
Service impact	Minimal	Full outage during recovery
Application handling	Catch error, possibly retry	Reconnect after recovery

Key Insight: The Lock Table Problem

One significant difference is lock management:

Transaction Failure: The transaction manager immediately releases all locks held by the failed transaction. This happens while the system is still running, so the lock table in memory is updated directly.

System Failure: The lock table in memory is lost. Upon recovery, there are no locks at all. This is actually helpful—we don't need to worry about reconstructing the lock table. Recovery runs in a single-threaded manner, and once complete, new transactions start fresh with clean lock state.

Key Insight: The Buffer Pool Problem

Transaction Failure: Dirty pages remain in the buffer pool. The rollback operation modifies them back to their original state. No disk I/O is strictly necessary (though it happens eventually).

System Failure: The buffer pool is gone. All dirty pages from all transactions are lost. Some of those dirty pages represented committed work (need redo), others represented uncommitted work (need undo). We must read pages from disk, examine log records, and reconstruct correct state.

Transaction Failure Characteristics

•Isolated to one transaction
•System continues serving others
•Memory state available for rollback
•No recovery restart needed
•Can retry immediately
•Common occurrence, expected

System Failure Characteristics

•Affects all active transactions
•Complete service outage
•Memory state lost entirely
•Recovery runs before service
•Must wait for system restart
•Less frequent, more disruptive

Prevention and Mitigation Strategies

While system failures cannot be completely eliminated, their frequency and impact can be significantly reduced through proper infrastructure, configuration, and operational practices:

Infrastructure Best Practices

•Redundant Power: UPS with regular battery testing, generator backup, dual power feeds
•ECC Memory: Error-correcting memory is mandatory for database servers
•Server-Grade Hardware: Enterprise components with longer MTBF (Mean Time Between Failures)
•Environmental Controls: Proper cooling, humidity control, dust filtration
•Monitoring: SNMP traps, temperature sensors, power monitoring, disk health checks

Database Configuration Practices

•Checkpoint Tuning: Balance recovery time against checkpoint overhead
•Log Configuration: Adequate log size, proper sync settings, fast log storage
•Memory Settings: Appropriate buffer pool size without risking OOM
•Connection Limits: Prevent resource exhaustion from too many connections
•Process Limits: Ensure OS limits (file descriptors, processes) are adequate

High Availability Strategies

•Replication: Standby servers can take over when primary fails
•Clustering: Multiple active nodes; one failure doesn't stop service
•Automatic Failover: Systems that detect failure and switch without human intervention
•Geographic Distribution: Replicas in different data centers for disaster tolerance
•Load Balancing: Distribute connections so individual server failure has limited impact

Recovery Time Objective (RTO)

Recovery Time Objective is the maximum acceptable time for recovery after a failure. For system failures, RTO depends on recovery speed (how fast can we replay the log?) and failover capability (can we switch to a standby server?). Know your RTO and design accordingly.

Summary: System Failure

Let's consolidate the key concepts covered in this page:

Key Takeaways

•System failures lose all volatile state — RAM contents including buffer pool, lock table, and transaction table are gone
•Persistent storage remains intact — Disk contains data files and log files, but in potentially inconsistent state
•Causes are diverse — Power failures, OS crashes, DBMS bugs, hardware issues, and resource exhaustion
•The volatile/persistent boundary is key — STEAL/NO-FORCE policy requires both UNDO and REDO during recovery
•Write-ahead logging enables recovery — Log contains before and after images needed to reconstruct correct state
•Recovery follows Analysis→Redo→Undo — Determine what happened, replay history, then undo uncommitted work
•Prevention includes infrastructure and design — Redundant power, ECC memory, proper configuration, and high availability

What's Next:

We've now covered transaction failures (individual) and system failures (entire instance). In the next page, we'll examine Media Failures—the most severe failure type where persistent storage itself is damaged or lost. Media failures require different recovery strategies involving backups and archives, since the normal log-based recovery from disk is impossible when the disk itself is the problem.

Page Complete

You now understand system failures comprehensively—their nature, causes, effects on database state, and the high-level recovery approach. This knowledge prepares you for understanding media failures, which add the dimension of persistent storage loss.

2 / 5

Loading learning content...

Database Management SystemsFailure Types

Understanding Database Failure Types

LevelIntermediate

Duration60 mins

TopicFailure Types

2 / 5

System Failure

When the Entire System Falls

What You Will Learn

Defining System Failure

A system failure (also called a crash or soft failure) occurs when the DBMS instance terminates abnormally, losing all volatile (in-memory) state. The key characteristics are:

Formal Definition:

A system failure is an event where:

The DBMS process(es) terminate unexpectedly
All contents of main memory (RAM) are lost
Contents of persistent storage (disk, SSD) remain intact
All active transactions at the time of failure are affected
The database is in an inconsistent state and requires recovery

Volatile vs. Persistent State in a DBMS
State Type	Storage Location	Survives System Failure?	Examples
Volatile (Transient)	Main Memory (RAM)	No	Buffer pool, lock table, transaction table, query execution state
Persistent (Durable)	Disk/SSD	Yes	Data files, log files, control files, configuration
Semi-Persistent	Write-ahead log buffer	Maybe*	Log records not yet flushed (* if battery-backed)

The Recovery Contract

What Exactly Is Lost?

When a system failure occurs, the following volatile structures are lost:

2. Lock Table: The lock manager's record of which transactions hold which locks on which objects. After a crash, there are no locks—which is actually helpful for recovery.

3. Transaction Table: The list of active transactions, their states, and their progress. After a crash, we must reconstruct this from the log.

4. Query Execution State: Partially executed queries, cursors, intermediate results. All lost and must be re-executed.

5. Connection State: All client connections are broken. Clients must reconnect after recovery.

6. Cached Plans: Query plans compiled by the optimizer. Must be regenerated.

7. Statistical Caches: Buffer pool hit rates, query timings, performance metrics. Lost and reset.

Causes of System Failures

System failures arise from various sources, spanning hardware, software, and environmental domains. Understanding these causes helps in both prevention and in designing recovery procedures.

Categories of System Failure Causes:

System Failure Root Causes

•Power Failures — Grid outages, UPS failures, power supply malfunctions, circuit breaker trips
•Operating System Crashes — Kernel panics, blue screens, deadlocks in OS code, driver failures
•DBMS Software Bugs — Assertion failures, segmentation faults, memory corruption, infinite loops
•Hardware Failures (Non-Media) — CPU failures, memory errors, motherboard issues, cooling failures
•Resource Exhaustion — Out of memory (OOM) killer, process limits, file descriptor limits
•Administrative Actions — Accidental shutdown, kill -9, emergency maintenance

2.1 Power Failures

Power failures are perhaps the most common cause of abrupt system failures, especially in environments without adequate power protection.

Types of Power Problems:

Problem	Description	Duration	Impact
Blackout	Complete power loss	Seconds to hours	Immediate system halt
Brownout	Voltage reduction	Minutes	System instability, potential crash
Voltage spike	Brief overvoltage	Milliseconds	Component damage, crash
Frequency variation	AC frequency shift	Variable	Motor-driven components affected
Harmonic distortion	Waveform degradation	Ongoing	Heat generation, gradual damage

Protection Mechanisms:

Uninterruptible Power Supply (UPS): Provides battery backup for orderly shutdown or brief outages
Generator Backup: For extended outages, automatic transfer to generator power
Redundant Power Supplies: Multiple PSUs in servers; single failure doesn't cause outage
Power Distribution Units (PDUs): Managed power distribution with monitoring
Battery-Backed Write Cache: Preserves write buffer during brief power loss

The Battery-Backed Cache Trap

2.2 Operating System Crashes

The OS is the foundation on which the DBMS runs. When the OS crashes, the DBMS crashes with it.

Common OS Crash Causes:

Kernel Bugs: Race conditions, null pointer dereferences, memory corruption in kernel code
Device Driver Issues: Faulty drivers for storage, network, or other devices
Resource Exhaustion: Out of memory, out of process IDs, file descriptor limits
Deadlocks: Kernel-level deadlocks, especially in I/O paths
Security Exploits: Kernel vulnerabilities being exploited

Operating systems are generally more stable than application software, but they're not immune to failures. A single faulty kernel module can bring down the entire system.

2.3 DBMS Software Bugs

Database software, despite extensive testing, can contain bugs that cause crashes:

Memory Corruption: Buffer overflows, use-after-free, double-free errors
Assertion Violations: Internal consistency checks that fail, triggering abort
Infinite Loops/Hangs: Bugs that cause the system to become unresponsive
Stack Overflow: Excessively deep recursion
Concurrency Bugs: Race conditions in multi-threaded DBMS code

crash_scenarios.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Example: Various system failure scenarios
 
# Scenario 1: Power Failure
# Timeline: 14:32:01.234
[DBMS] Transaction T1 commits, log flushed
[DBMS] Transaction T2 updates 500 rows in buffer pool
[DBMS] Transaction T3 begins, acquires locks
[POWER] Utility power lost
[UPS] Battery power engaged
[POWER] UPS battery depleted (aged battery)
[SYSTEM] Uncontrolled shutdown
# Result: T1 committed, T2/T3 must be rolled back during recovery
 
# Scenario 2: OS Crash (Kernel Panic)
# Timeline: 09:15:44.891
[OS] NFS driver encounters unhandled error
[OS] Kernel panic - not syncing: Fatal exception
[DBMS] Process killed by kernel
# Result: All active transactions must be recovered
 
# Scenario 3: DBMS Bug
# Timeline: 22:08:33.456
[DBMS] Parallel query execution active, 8 threads
[DBMS] Thread 3: Memory allocation for sort buffer
[DBMS] Thread 7: Concurrent deallocation of same region
[DBMS] SIGSEGV: Invalid memory access at 0x7f3a...
[DBMS] Core dump generated: /var/lib/postgres/core.12345
[DBMS] Process terminated
# Result: Bug report filed, recovery from checkpoint

2.4 Hardware Failures (Non-Media)

Hardware failures that don't damage persistent storage but halt the system:

Component	Failure Mode	System Impact	Detection
CPU	Overheating, silicon failure	Immediate crash or hang	Temperature sensors, ECC errors
RAM	Bit flips, module failure	Crashes, data corruption	ECC, memory tests
Motherboard	Capacitor failure, trace damage	System won't boot	POST failures
BIOS/UEFI	Firmware corruption	Boot failure	Checksum validation
Network Card	Driver crash, hardware fault	Cluster failover	Health checks

ECC Memory Is Essential:

Crash vs. Corruption

The Volatile/Persistent Boundary

The fundamental challenge of system failures lies in the boundary between volatile memory and persistent storage. Understanding this boundary is crucial for understanding database recovery.

The Problem of Uncommitted Data on Disk:

DBMS systems don't write every change immediately to disk—that would be far too slow. Instead, they use a buffer pool:

Data pages are read from disk into memory
Transactions modify pages in memory
Modified ('dirty') pages are written back to disk periodically

This creates a problem: dirty pages from uncommitted transactions may be written to disk (this is called 'stealing' the buffer). If the system crashes:

The disk contains changes from uncommitted transactions
We must undo these changes during recovery

The Problem of Committed Data in Memory:

Conversely, a transaction may commit but its dirty pages may not yet be on disk:

Transaction modifies pages in memory
Transaction commits (log is flushed)
Dirty pages are still only in memory
System crashes before pages are written to disk

Now:

The disk doesn't contain committed changes
We must redo these changes during recovery

This dual problem—undo uncommitted changes and redo committed changes—is the essence of crash recovery.

Buffer Management Policies
Policy	Description	Recovery Implication
STEAL	Dirty pages from uncommitted txns can be written to disk	Requires UNDO during recovery
NO-STEAL	Dirty pages from uncommitted txns cannot be written to disk	No UNDO needed, but requires more memory
FORCE	All dirty pages must be written before commit	No REDO needed, but slow commits
NO-FORCE	Dirty pages may remain in memory after commit	Requires REDO during recovery, fast commits

The STEAL/NO-FORCE Policy:

Most production database systems use STEAL/NO-FORCE because:

STEAL allows the buffer manager to evict any page when memory is needed, providing flexibility
NO-FORCE allows fast commits (just write the log, not all data pages), providing performance

Write-Ahead Logging (WAL) Makes This Work:

The key insight making STEAL/NO-FORCE safe is Write-Ahead Logging:

Before any dirty page is written to disk, the log records describing the changes must be written first.

This guarantees that:

If a dirty page from an uncommitted transaction is on disk, the log contains the before-image for UNDO
If a committed transaction's changes are not on disk, the log contains the after-image for REDO

The log is the source of truth. Combined with the data files, we can always reconstruct the correct database state.

Converting Mermaid diagram...

State Inconsistency After Crash

After a system failure, the database on disk is in an inconsistent state. It may contain:

Partial changes from uncommitted transactions (need UNDO)
Missing changes from committed transactions (need REDO)
Partially written pages (torn writes)

The recovery system must detect and correct all of these issues before making the database available for new transactions.

Example Scenario:

Consider a database with three concurrent transactions at the moment of crash:

crash_state_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Database State at Time of Crash:
==================================
 
Transaction T1: COMMITTED (log record written)
  - Updated Account A: 1000 → 750  [Page in memory, NOT on disk]
  - Updated Account B: 500 → 750   [Page in memory, NOT on disk]
  
Transaction T2: ACTIVE (uncommitted)
  - Updated Account C: 2000 → 1800 [Page flushed to disk]
  - Updated Account D: 300 → 500   [Page in memory only]
 
Transaction T3: PARTIALLY COMMITTED (commit in progress)
  - Updated Account E: 400 → 600   [Page flushed to disk]
  - Commit record NOT yet written to log
 
Disk State After Crash:
=======================
Account A: 1000  (committed change MISSING - needs REDO)
Account B: 500   (committed change MISSING - needs REDO)
Account C: 1800  (uncommitted change PRESENT - needs UNDO)
Account D: 300   (uncommitted change lost - no action needed)
Account E: 600   (uncommitted change PRESENT - needs UNDO*)
 
*T3 didn't finish committing, so it's treated as aborted
 
Recovery Actions Required:
=========================
1. REDO T1's changes to A and B (from log after-images)
2. UNDO T2's change to C (from log before-image)
3. UNDO T3's change to E (from log before-image)
 
Final Consistent State:
======================
Account A: 750   (T1 committed)
Account B: 750   (T1 committed)
Account C: 2000  (T2 aborted, rolled back)
Account D: 300   (T2 never made it to disk)
Account E: 400   (T3 aborted, rolled back)

Torn Writes: A Special Case:

A 'torn write' or 'partial page write' occurs when only part of a page is written to disk before the crash. Consider:

Database pages are typically 8KB or 16KB
Disk sectors are typically 512 bytes or 4KB
Writing one page requires multiple sector writes
A crash during this sequence leaves a partially written page

A torn page is corrupted—it contains a mix of old and new data that may be internally inconsistent.

Detection and Prevention of Torn Writes:

Page Checksums: Each page includes a checksum. If the checksum doesn't match after read, the page is torn.
Double-Write Buffer (MySQL InnoDB): Before writing a page to its final location, write it to a special double-write buffer. If crash occurs during final write, the doublewrite buffer has a clean copy for recovery.
Full-Page Writes (PostgreSQL): After each checkpoint, the first modification to any page logs the entire page (not just the change). Recovery uses this full-page image.
Atomic Write Support: Modern storage (some SSDs, storage arrays) guarantees atomic writes of large blocks, eliminating torn writes.

Silent Corruption Risk

Recovery From System Failures

The Recovery Process Overview:

System Failure Recovery Phases

•1. Analysis Phase: Scan the log from the last checkpoint to determine which transactions were active and which pages might be dirty
•2. Redo Phase: Replay all logged operations (for both committed and uncommitted transactions) to bring the database to its pre-crash state
•3. Undo Phase: Roll back all transactions that were active at the time of crash, since they can't be completed

Why Redo Everything, Then Undo?

A common question: Why redo changes from uncommitted transactions if we're just going to undo them?

The answer lies in the log structure and recovery correctness:

Page-level consistency: The redo phase brings pages to their exact pre-crash state. This is necessary because the undo phase uses before-images that assume specific page states.
Log ordering: The log records reference specific page versions. Skipping some redos would leave pages in states that don't match what the log expects.
Nested transactions and savepoints: Some undo operations might need page states that were created by operations we'd otherwise skip.
Simplicity and correctness: The 'redo all, then undo uncommitted' approach is simpler to implement correctly and verify than selective redo.

Recovery Time Considerations:

Recovery time depends on several factors:

Factor	Impact	Mitigation
Time since last checkpoint	More log to replay	Frequent checkpoints
Number of active transactions	More undo work	Shorter transactions
Size of transactions	More redo/undo operations	Batch judiciously
Page read time	I/O during recovery	Fast storage, SSD
Parallel recovery capability	Can parallelize work	Modern DBMS features

Checkpoint Frequency Trade-off

recovery_process_overview.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
PROCEDURE RecoverFromSystemFailure():
    // Phase 1: Analysis
    // Find the last checkpoint and determine transaction states
    checkpoint_lsn = FindLastCheckpoint()
    active_transactions = {}  // Will be populated from log
    dirty_pages = {}          // Pages that might need redo
    
    // Scan log forward from checkpoint
    FOR each log_record from checkpoint_lsn to end_of_log:
        IF log_record.type == BEGIN:
            active_transactions.add(log_record.txn_id)
        ELSE IF log_record.type == COMMIT or ABORT:
            active_transactions.remove(log_record.txn_id)
        ELSE IF log_record.type == UPDATE:
            dirty_pages.add(log_record.page_id)
    
    // Phase 2: Redo
    // Replay history to bring database to crash state
    redo_lsn = min(dirty_pages first LSN, checkpoint_lsn)
    
    FOR each log_record from redo_lsn to end_of_log:
        IF log_record.type == UPDATE:
            page = ReadPage(log_record.page_id)
            IF page.lsn < log_record.lsn:
                // Page is outdated; apply the change
                ApplyChange(page, log_record.after_image)
                page.lsn = log_record.lsn
                WritePage(page)
    
    // Phase 3: Undo
    // Roll back all transactions still in active_transactions
    FOR each txn_id IN active_transactions:
        RollbackTransaction(txn_id)  // Uses log before-images
    
    // Recovery complete
    WriteLogRecord(RECOVERY_COMPLETE)
    RETURN

System Failure vs. Transaction Failure

Comparison: Transaction Failure vs. System Failure
Aspect	Transaction Failure	System Failure
Scope	Single transaction	All active transactions
Volatile state	Preserved (system still running)	Lost (memory cleared)
Persistent state	Unchanged	Intact but inconsistent
Other transactions	Continue normally	All suspended until recovery
Detection	Synchronous (immediate)	Obvious (system down)
Recovery trigger	Automatic rollback	Explicit recovery at restart
Lock management	Locks released immediately	Locks reconstructed or cleared
Recovery time	Transaction-proportional	Log-proportional
Service impact	Minimal	Full outage during recovery
Application handling	Catch error, possibly retry	Reconnect after recovery

Key Insight: The Lock Table Problem

One significant difference is lock management:

Key Insight: The Buffer Pool Problem

Transaction Failure: Dirty pages remain in the buffer pool. The rollback operation modifies them back to their original state. No disk I/O is strictly necessary (though it happens eventually).

Transaction Failure Characteristics

•Isolated to one transaction
•System continues serving others
•Memory state available for rollback
•No recovery restart needed
•Can retry immediately
•Common occurrence, expected

System Failure Characteristics

•Affects all active transactions
•Complete service outage
•Memory state lost entirely
•Recovery runs before service
•Must wait for system restart
•Less frequent, more disruptive

Prevention and Mitigation Strategies

While system failures cannot be completely eliminated, their frequency and impact can be significantly reduced through proper infrastructure, configuration, and operational practices:

Infrastructure Best Practices

•Redundant Power: UPS with regular battery testing, generator backup, dual power feeds
•ECC Memory: Error-correcting memory is mandatory for database servers
•Server-Grade Hardware: Enterprise components with longer MTBF (Mean Time Between Failures)
•Environmental Controls: Proper cooling, humidity control, dust filtration
•Monitoring: SNMP traps, temperature sensors, power monitoring, disk health checks

Database Configuration Practices

•Checkpoint Tuning: Balance recovery time against checkpoint overhead
•Log Configuration: Adequate log size, proper sync settings, fast log storage
•Memory Settings: Appropriate buffer pool size without risking OOM
•Connection Limits: Prevent resource exhaustion from too many connections
•Process Limits: Ensure OS limits (file descriptors, processes) are adequate

High Availability Strategies

•Replication: Standby servers can take over when primary fails
•Clustering: Multiple active nodes; one failure doesn't stop service
•Automatic Failover: Systems that detect failure and switch without human intervention
•Geographic Distribution: Replicas in different data centers for disaster tolerance
•Load Balancing: Distribute connections so individual server failure has limited impact

Recovery Time Objective (RTO)

Summary: System Failure

Let's consolidate the key concepts covered in this page:

Key Takeaways

•System failures lose all volatile state — RAM contents including buffer pool, lock table, and transaction table are gone
•Persistent storage remains intact — Disk contains data files and log files, but in potentially inconsistent state
•Causes are diverse — Power failures, OS crashes, DBMS bugs, hardware issues, and resource exhaustion
•The volatile/persistent boundary is key — STEAL/NO-FORCE policy requires both UNDO and REDO during recovery
•Write-ahead logging enables recovery — Log contains before and after images needed to reconstruct correct state
•Recovery follows Analysis→Redo→Undo — Determine what happened, replay history, then undo uncommitted work
•Prevention includes infrastructure and design — Redundant power, ECC memory, proper configuration, and high availability

What's Next:

Page Complete

2 / 5