Failure Types - Learning Module

Loading content...

0/241

Failure Classification

Organizing the Chaos of Failure

When a database encounters a failure, the recovery manager faces an immediate challenge: What kind of failure is this? The answer determines everything—which recovery procedure to use, what resources are needed, how long recovery will take, and what data might be lost.

A well-designed failure classification system allows the database to respond appropriately to each failure type. Just as a hospital emergency room triages patients based on severity and type of injury, database recovery systems must classify failures to provide the right treatment.

This page examines the dimensions along which failures are classified, the hierarchies that organize failure types, and how classification decisions are made during system operation. Understanding classification transforms failure from chaotic disaster into manageable, well-understood scenarios.

What You Will Learn

By the end of this page, you will understand the multiple dimensions used to classify database failures, how these dimensions interact to determine recovery strategies, the detection mechanisms that identify failure types, and how classification maps to specific recovery procedures.

Dimensions of Failure Classification

Database failures can be classified along multiple orthogonal dimensions. Each dimension captures a different aspect of the failure that influences recovery:

The Multi-Dimensional Classification Model:

Think of failure classification as a multi-dimensional space where each failure is a point with coordinates along several axes. The combination of coordinates determines the recovery approach.

Dimensions of Failure Classification
Dimension	Categories	Why It Matters
Scope	Transaction / System / Media	Determines scope of recovery
State Affected	Volatile / Persistent / Both	Determines what needs to be recovered
Detectability	Fail-stop / Byzantine	Determines if failure is clearly identifiable
Recoverability	Recoverable / Partially / Unrecoverable	Determines expected outcome
Duration	Transient / Permanent	Determines if retry is viable
Cause	Hardware / Software / Environmental / Human	Informs prevention strategies
Timing	Predictable / Unpredictable	Determines if proactive measures are possible

Dimension 1: Scope

Scope defines how much of the system is affected:

Scope	Affected Components	Example
Transaction	Single transaction	Constraint violation
Connection	One client connection	Network timeout
Query	One query within transaction	Out of memory for sort
Process	One database process	Worker process crash
Instance	Entire database instance	Power failure
Cluster	Multiple instances	Network partition
Site	Entire data center	Natural disaster

Dimension 2: State Affected

Which types of state are impacted:

Volatile State Only: RAM lost, disk intact (system failure)
Persistent State Only: Disk damaged, RAM irrelevant (media failure while database down)
Both States: Everything affected (media failure during operation)

Dimension 3: Fail-Stop vs. Byzantine

This classic distributed systems distinction is important for databases:

Fail-Stop Failures:

The failed component stops completely
Failure is unambiguous and detectable
Component doesn't produce incorrect results
Example: Server power loss

Byzantine Failures:

The failed component continues operating incorrectly
May produce corrupt or inconsistent output
Harder to detect and handle
Example: Disk silently returning wrong data

The Silent Data Corruption Problem

Byzantine failures in storage (silent data corruption) are particularly dangerous. A disk might return data that differs from what was written without signaling any error. Checksums, end-to-end data verification, and ECC are essential defenses. ZFS, Btrfs, and databases with built-in page checksums protect against this.

Dimension 4: Recoverability

Not all failures can be fully recovered:

Class	Description	Example
Fully Recoverable	Complete recovery possible with no data loss	System failure with intact log
Partially Recoverable	Recovery possible but with some data loss	Media failure with outdated backup
Unrecoverable	Database cannot be restored to valid state	Complete data center destruction, no off-site backup

Dimension 5: Duration (Transient vs. Permanent)

Transient Failures:

Temporary condition that resolves
Often recoverable by retry
Example: Network glitch, temporary resource exhaustion, deadlock

Permanent Failures:

Condition persists until external intervention
Retry will not help
Example: Disk head crash, software bug (until patched), configuration error

The Three-Category Model

While the multi-dimensional model provides complete characterization, most practical discussions use a simpler three-category model based primarily on scope and state affected. This model is universally used in database literature:

The Standard Classification:

The Three Failure Categories

•Transaction Failure — Individual transaction cannot complete; rollback required; system continues operating for other transactions
•System Failure — Entire DBMS instance crashes; all volatile state lost; persistent storage intact; crash recovery required
•Media Failure — Persistent storage damaged or destroyed; backup restoration required

Comprehensive Comparison of Failure Categories
Characteristic	Transaction Failure	System Failure	Media Failure
Scope	Single transaction	All active transactions	All data (potentially)
Volatile state	Preserved	Lost	Lost (if system also crashed)
Persistent state	Intact	Intact	Damaged or destroyed
Detection	Synchronous	System restart	I/O errors or restart
Frequency	Common (many per day)	Rare (monthly/yearly)	Very rare (years)
Impact duration	Milliseconds	Minutes to hours	Hours to days
Data loss risk	None (with proper handling)	None (with proper logging)	Possible (based on backup age)
Recovery mechanism	Transaction rollback	Crash recovery	Backup + log restore
Automation	Fully automatic	Fully automatic	May require manual steps

Why This Classification Works:

The three-category model maps directly to distinct recovery mechanisms:

Transaction Failure → Rollback
- Use log before-images to undo partial changes
- Release locks, notify application, continue
- No other transactions affected
System Failure → Crash Recovery
- Begin at restart, run Analysis→Redo→Undo
- Use log to restore both committed and consistent state
- Automatic, no external resources needed
Media Failure → Backup Restoration
- Requires external resources (backups, archived logs)
- May require hardware replacement
- Potentially involves manual decisions

This clean mapping from failure type to recovery mechanism is why the three-category model remains dominant despite its simplifications.

Converting Mermaid diagram...

Failure Detection Mechanisms

Before classification, failures must be detected. Different failure types are detected by different mechanisms at different times:

Detection Timing:

Failure Type	Detection Timing	Detection Method
Transaction (logical)	Immediately during operation	Exception from query executor
Transaction (constraint)	At statement or commit time	Constraint checker
Transaction (concurrency)	During lock wait or commit	Lock manager, serialization check
System (crash)	At restart	Control file check, log examination
Media (complete)	At access attempt	I/O error, disk not present
Media (corruption)	At read time	Checksum mismatch

Detection Mechanisms in Detail:

1. Transaction Failure Detection:

transaction_failure_detection.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Transaction failures are detected by various subsystems
 
FUNCTION ExecuteStatement(statement):
    TRY:
        // 1. Parse and validate
        parsed = Parser.parse(statement)
        IF parsed.error:
            RAISE SyntaxError(parsed.error)
        
        // 2. Check permissions
        IF NOT AccessControl.permitted(parsed):
            RAISE SecurityError("Permission denied")
        
        // 3. Execute operations
        FOR each operation IN parsed.operations:
            result = Executor.execute(operation)
            
            // 4. Check constraints after each DML
            IF operation.type IN [INSERT, UPDATE, DELETE]:
                violations = ConstraintChecker.check(operation.table)
                IF violations:
                    RAISE ConstraintViolation(violations)
        
        // 5. Acquire necessary locks
        locks = LockManager.acquire(parsed.required_locks)
        IF locks.timeout:
            RAISE LockTimeoutError()
        IF locks.deadlock_victim:
            RAISE DeadlockError("Transaction chosen as victim")
        
        RETURN result
        
    CATCH exception:
        // Mark transaction as failed
        TransactionManager.setFailed(current_transaction)
        // Initiate rollback
        RecoveryManager.rollback(current_transaction)
        // Return error to client
        RAISE exception

2. System Failure Detection:

System failures are typically detected at restart:

Control File Flags: Many databases write 'clean shutdown' flags to control files. If the flag isn't set at startup, a crash occurred.
Log Examination: The log is scanned to find uncommitted transactions that need rollback and committed transactions that need redo.
Timestamp Validation: Control files may record the last successful checkpoint time. If it doesn't match expected values, a crash occurred.
Process Monitoring: In running systems, background monitor processes detect when critical processes die unexpectedly.

startup_crash_detection.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
FUNCTION DatabaseStartup():
    // Step 1: Check control file
    control = ReadControlFile()
    
    IF control.shutdown_flag == CLEAN:
        // Normal shutdown - no crash recovery needed
        LogMessage("Clean startup - no recovery required")
        RETURN StartNormal()
    
    // Step 2: Crash detected - need recovery
    LogMessage("Crash detected - initiating recovery")
    
    // Step 3: Verify log file integrity
    IF NOT LogManager.verifyIntegrity():
        RAISE MediaFailureError("Log files corrupted")
    
    // Step 4: Verify data file integrity 
    FOR each datafile IN control.datafiles:
        IF NOT datafile.readable():
            RAISE MediaFailureError("Data file missing: " + datafile.name)
        IF NOT datafile.checksumValid():
            RAISE MediaFailureError("Data file corrupt: " + datafile.name)
    
    // Step 5: Run crash recovery
    RecoveryManager.crashRecovery()
    
    // Step 6: Set clean flag for next startup
    control.shutdown_flag = RUNNING
    WriteControlFile(control)
    
    RETURN StartNormal()

3. Media Failure Detection:

Media failures are detected through I/O failures or data corruption:

I/O Errors: The operating system returns error codes when disk operations fail
Checksum Failures: Page checksums don't match computed values
Missing Files: Files expected by the database are not present
SMART Monitoring: Predictive failures detected by disk self-monitoring

Proactive Detection

Don't wait for failures to be detected reactively. Monitor SMART data, verify backup integrity regularly, run consistency checks periodically, and monitor for replication lag. Proactive detection allows orderly response rather than emergency recovery.

Classification Decision Trees

Database systems use decision trees (implemented as procedural logic) to classify failures and route them to appropriate handlers. Let's examine these decision processes:

Runtime Classification (During Normal Operation):

runtime_classification.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
FUNCTION ClassifyRuntimeFailure(error):
    // Level 1: Is it an I/O error?
    IF error.type == IO_ERROR:
        // Could be media failure
        IF IsTransientIOError(error):
            // Retry a few times before declaring failure
            RETURN RetryableError(error)
        ELSE:
            // Likely media failure
            InitiateMediaFailureProtocol(error.device)
            RETURN MediaFailure(error)
    
    // Level 2: Is it a resource error?
    IF error.type IN [OUT_OF_MEMORY, DISK_FULL, CONNECTION_LIMIT]:
        IF CanFreeResources():
            // Try to recover resources and retry
            RETURN RetryableError(error)
        ELSE:
            // Fail the transaction, but system continues
            RETURN TransactionFailure(error)
    
    // Level 3: Is it a concurrency error?
    IF error.type IN [DEADLOCK, LOCK_TIMEOUT, SERIALIZATION_FAILURE]:
        // These are transaction failures, usually retryable
        RETURN TransactionFailure(error, retryable=TRUE)
    
    // Level 4: Is it a constraint error?
    IF error.type IN [PK_VIOLATION, FK_VIOLATION, CHECK_VIOLATION]:
        // Non-retryable transaction failure
        RETURN TransactionFailure(error, retryable=FALSE)
    
    // Level 5: Is it a logic error?
    IF error.type IN [DIVISION_BY_ZERO, NULL_REFERENCE, TYPE_ERROR]:
        // Non-retryable transaction failure (usually)
        RETURN TransactionFailure(error, retryable=FALSE)
    
    // Level 6: Is it a system error?
    IF error.type IN [ASSERTION_FAILURE, MEMORY_CORRUPTION, INTERNAL_ERROR]:
        // This is bad - likely need to abort the process
        LogEmergency("Internal error: " + error.details)
        // Decide: can we isolate to this transaction, or is system compromised?
        IF CanIsolateToTransaction(error):
            RETURN TransactionFailure(error)
        ELSE:
            // Initiate controlled shutdown
            InitiateGracefulShutdown()
            RETURN SystemFailure(error)
    
    // Unknown error type
    LogWarning("Unknown error type: " + error)
    RETURN TransactionFailure(error)

Startup Classification (After Unclean Shutdown):

startup_classification.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
FUNCTION ClassifyStartupState():
    // Step 1: Can we read the control file?
    TRY:
        control = ReadControlFile()
    CATCH IOError:
        // Control file missing or unreadable - media failure
        RETURN MediaFailure("Control file inaccessible")
    
    // Step 2: Check shutdown status
    IF control.shutdown_status == CLEAN:
        // No recovery needed
        RETURN NormalStartup()
    
    // Step 3: We need recovery - check what's available
    log_status = CheckLogFiles(control)
    data_status = CheckDataFiles(control)
    
    // Step 4: All intact - system failure
    IF log_status == INTACT AND data_status == INTACT:
        RETURN SystemFailure("Standard crash recovery")
    
    // Step 5: Log damaged
    IF log_status == DAMAGED:
        // Critical - we might not be able to recover
        IF HasValidBackup():
            RETURN MediaFailure("Log corruption - backup restore required")
        ELSE:
            RETURN UnrecoverableFailure("Log damaged, no backup available")
    
    // Step 6: Data damaged (but log OK)
    IF data_status == DAMAGED:
        damaged_files = GetDamagedDataFiles(control)
        IF CanRecoverFromLog(damaged_files):
            // Some databases can rebuild data files from log
            RETURN SystemFailure("Data corruption, log-based recovery")
        ELSE:
            RETURN MediaFailure("Data corruption - backup restore required")
    
    // Step 7: Partial damage
    IF log_status == PARTIAL OR data_status == PARTIAL:
        // Complex scenario - depends on what's damaged
        RETURN AnalyzePartialDamage(control, log_status, data_status)

Classification Affects Response

The classification decision directly determines the recovery path. Misclassification can lead to inappropriate responses—treating a media failure as a system failure would attempt crash recovery that cannot succeed, wasting time and potentially worsening the situation.

Mapping Classification to Recovery Strategy

Once a failure is classified, the database engages the appropriate recovery mechanism. This mapping is central to the recovery system design:

Recovery Strategy Matrix:

Classification to Recovery Strategy Mapping
Classification	Primary Strategy	Fallback Strategy	Resources Required
Transaction (retryable)	Rollback + inform application to retry	None (application decides)	Log for undo
Transaction (non-retryable)	Rollback + return error	None	Log for undo
System (clean log)	Crash recovery (ARIES-style)	Full backup restore	Log + data files
System (minor corruption)	Repair + crash recovery	Full backup restore	Log + data files + checksums
Media (data only)	Backup restore + log apply	Older backup + more logs	Backup + archived logs
Media (log only)	Emergency - backup + partial log	Accept data loss	Backup + surviving logs
Media (complete)	Full restore from backup	Accept total loss since backup	All backups + all archived logs

Recovery Selection Process:

The recovery manager uses a priority-based selection:

Attempt Least Disruptive First:
- Try simple rollback before anything else
- Try crash recovery before backup restore
- Try partial recovery before full recovery
Escalate on Failure:
- If rollback fails unexpectedly, escalate to crash recovery
- If crash recovery fails, escalate to backup restore
- If backup restore fails, report unrecoverable failure
Consider Time Constraints:
- RTO may influence strategy choice
- Faster but less complete recovery might be preferred
- Deferred recovery of non-critical data

recovery_selection.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
FUNCTION SelectRecoveryStrategy(classification, constraints):
    strategies = GetApplicableStrategies(classification)
    
    FOR each strategy IN strategies.orderByLeastDisruptive():
        // Check if strategy is possible with available resources
        IF NOT strategy.resourcesAvailable():
            LogWarning("Strategy unavailable: " + strategy.name)
            CONTINUE
        
        // Check if strategy meets time constraints
        estimated_time = strategy.estimateRecoveryTime()
        IF constraints.rto AND estimated_time > constraints.rto:
            LogWarning("Strategy too slow for RTO: " + strategy.name)
            // Might still use if it's the only option
            backup_strategy = strategy
            CONTINUE
        
        // Check if strategy meets data loss constraints
        estimated_loss = strategy.estimateDataLoss()
        IF constraints.rpo AND estimated_loss > constraints.rpo:
            LogWarning("Strategy exceeds RPO: " + strategy.name)
            // Definitely continue looking for better options
            CONTINUE
        
        // This strategy is acceptable
        RETURN strategy
    
    // No strategy met all constraints - return best available
    IF backup_strategy:
        LogWarning("Using backup strategy that doesn't meet RTO")
        RETURN backup_strategy
    ELSE:
        RAISE NoRecoveryStrategyError("No viable recovery path found")

Special Classification Scenarios

Some failure scenarios don't fit neatly into the three-category model. Understanding these edge cases is important for comprehensive failure handling:

1. Cascading Failures:

One failure triggers another:

Transaction failure causes resource leak → System destabilization → System failure
Disk failure triggers heavy I/O on remaining disks → Additional disk failures
Network partition causes split-brain → Data divergence → Consistency failure

2. Partial Failures:

Only part of the system is affected:

One tablespace is corrupt, others are fine
One RAID disk failed, array is degraded but functional
Standby is out of sync, primary is fine

3. Logical vs. Physical Failures:

Logical failures don't damage storage but corrupt data:

Application bug writes wrong values
Accidental DELETE without WHERE clause
Constraint incorrectly disabled, invalid data inserted

These require PITR or logical restore, not physical recovery.

Edge Cases in Classification

•Slow Failure — System degrades gradually (memory leak, disk filling) before crashing. Classification depends on when detected.
•Intermittent Failure — Works sometimes, fails sometimes (flaky disk, network instability). Hard to classify until pattern emerges.
•Phantom Failure — Appears to have failed but actually recovered (brief network partition). May trigger unnecessary recovery.
•Correlated Failure — Multiple components fail simultaneously (power surge damages multiple disks). Each needs classification.
•Delayed-Effect Failure — Failure in one area manifests elsewhere (log full causes transaction failures).

4. Distributed System Failures:

In distributed databases, additional failure types emerge:

Failure Type	Description	Classification Challenge
Network Partition	Nodes can't communicate	Is the remote node down or just unreachable?
Split Brain	Each side thinks it's primary	Both sides may be functioning, but disagree
Replication Lag	Replica behind primary	When does lag become failure?
Quorum Loss	Not enough nodes for consensus	Availability vs. consistency trade-off
Byzantine Node	Node behaves maliciously/incorrectly	Harder to detect than fail-stop

The Danger of Wrong Classification

Misclassifying a failure can cause serious problems. Treating a media failure as a system failure leads to failed crash recovery and delays. Treating a transient failure as permanent wastes resources. Treating a Byzantine failure as fail-stop might allow corrupted data to propagate. Defensive classification with verification is essential.

Implementation Considerations

Implementing failure classification in a production database system involves several practical considerations:

Error Code Design:

Error codes should encode classification information:

error_code_design.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
PostgreSQL Error Code Structure (SQLSTATE):
============================================
 
Error codes are 5 characters: 2 (class) + 3 (detail)
 
Class 00 — Success
Class 01 — Warning
Class 02 — No Data
 
Class 23 — Integrity Constraint Violation (Transaction Failure)
  23000 — integrity_constraint_violation
  23001 — restrict_violation
  23502 — not_null_violation
  23503 — foreign_key_violation
  23505 — unique_violation
  23514 — check_violation
 
Class 40 — Transaction Rollback (Transaction Failure - Retryable)
  40000 — transaction_rollback
  40001 — serialization_failure
  40002 — transaction_integrity_constraint_violation
  40003 — statement_completion_unknown
  40P01 — deadlock_detected
 
Class 53 — Insufficient Resources (May escalate to System Failure)
  53000 — insufficient_resources
  53100 — disk_full
  53200 — out_of_memory
  53300 — too_many_connections
 
Class 58 — System Error (Serious - May indicate System Failure)
  58000 — system_error
  58030 — io_error
  58P01 — undefined_file
  58P02 — duplicate_file
 
Class XX — Internal Error (May require System Failure handling)
  XX000 — internal_error
  XX001 — data_corrupted
  XX002 — index_corrupted

Logging for Classification:

Effective classification requires good logging:

Pre-Failure Indicators: Log warnings that might precede failures
Failure Details: Capture stack traces, system state, recent operations
Recovery Actions: Log what recovery attempts and their outcomes
Post-Mortem Data: Preserve enough information for later analysis

Testing Classification Logic:

Classification code should be thoroughly tested:

Fault Injection: Deliberately cause failures to verify classification
Chaos Engineering: Random failures in production-like environments
Edge Case Testing: Known difficult scenarios (partial failures, cascades)
Regression Testing: Ensure fixes don't break classification

Classification Best Practices

•Fail Safe: When classification is uncertain, choose the classification that leads to safer recovery (even if slower)
•Verify Before Acting: Confirm classification with multiple checks before starting recovery
•Log Everything: Classification logic should produce detailed logs for debugging
•Allow Override: Provide administrative tools to force specific recovery paths when needed
•Monitor Classification: Track classification decisions for patterns that might indicate problems

Summary: Failure Classification

Let's consolidate the key concepts covered in this page:

Key Takeaways

•Failures are classified along multiple dimensions — Scope, state affected, detectability, recoverability, duration, cause
•The three-category model is practical — Transaction, System, and Media failures map to distinct recovery mechanisms
•Detection varies by failure type — Transaction failures are immediate; system failures are detected at restart; media failures are detected on I/O
•Decision trees route failures to handlers — Classification logic determines which recovery procedure is invoked
•Classification directly affects recovery — The chosen recovery strategy depends entirely on correct classification
•Edge cases require careful handling — Cascading, partial, and distributed failures complicate classification
•Implementation matters — Error codes, logging, and testing all affect classification quality

What's Next:

We've now explored the three failure types and how they're classified. In the next and final page of this module, we'll examine Recovery Requirements—the properties that recovery systems must guarantee, the resources they need, and the trade-offs involved in recovery system design. This will complete our understanding of failure concepts and prepare us for studying specific recovery mechanisms.

Page Complete

You now understand failure classification comprehensively—the dimensions, detection mechanisms, decision processes, and how classification maps to recovery strategies. This knowledge is essential for understanding how databases automatically respond to different failure scenarios.