Loading content...
When a database encounters a failure, the recovery manager faces an immediate challenge: What kind of failure is this? The answer determines everything—which recovery procedure to use, what resources are needed, how long recovery will take, and what data might be lost.
A well-designed failure classification system allows the database to respond appropriately to each failure type. Just as a hospital emergency room triages patients based on severity and type of injury, database recovery systems must classify failures to provide the right treatment.
This page examines the dimensions along which failures are classified, the hierarchies that organize failure types, and how classification decisions are made during system operation. Understanding classification transforms failure from chaotic disaster into manageable, well-understood scenarios.
By the end of this page, you will understand the multiple dimensions used to classify database failures, how these dimensions interact to determine recovery strategies, the detection mechanisms that identify failure types, and how classification maps to specific recovery procedures.
Database failures can be classified along multiple orthogonal dimensions. Each dimension captures a different aspect of the failure that influences recovery:
The Multi-Dimensional Classification Model:
Think of failure classification as a multi-dimensional space where each failure is a point with coordinates along several axes. The combination of coordinates determines the recovery approach.
| Dimension | Categories | Why It Matters |
|---|---|---|
| Scope | Transaction / System / Media | Determines scope of recovery |
| State Affected | Volatile / Persistent / Both | Determines what needs to be recovered |
| Detectability | Fail-stop / Byzantine | Determines if failure is clearly identifiable |
| Recoverability | Recoverable / Partially / Unrecoverable | Determines expected outcome |
| Duration | Transient / Permanent | Determines if retry is viable |
| Cause | Hardware / Software / Environmental / Human | Informs prevention strategies |
| Timing | Predictable / Unpredictable | Determines if proactive measures are possible |
Dimension 1: Scope
Scope defines how much of the system is affected:
| Scope | Affected Components | Example |
|---|---|---|
| Transaction | Single transaction | Constraint violation |
| Connection | One client connection | Network timeout |
| Query | One query within transaction | Out of memory for sort |
| Process | One database process | Worker process crash |
| Instance | Entire database instance | Power failure |
| Cluster | Multiple instances | Network partition |
| Site | Entire data center | Natural disaster |
Dimension 2: State Affected
Which types of state are impacted:
Dimension 3: Fail-Stop vs. Byzantine
This classic distributed systems distinction is important for databases:
Fail-Stop Failures:
Byzantine Failures:
Byzantine failures in storage (silent data corruption) are particularly dangerous. A disk might return data that differs from what was written without signaling any error. Checksums, end-to-end data verification, and ECC are essential defenses. ZFS, Btrfs, and databases with built-in page checksums protect against this.
Dimension 4: Recoverability
Not all failures can be fully recovered:
| Class | Description | Example |
|---|---|---|
| Fully Recoverable | Complete recovery possible with no data loss | System failure with intact log |
| Partially Recoverable | Recovery possible but with some data loss | Media failure with outdated backup |
| Unrecoverable | Database cannot be restored to valid state | Complete data center destruction, no off-site backup |
Dimension 5: Duration (Transient vs. Permanent)
Transient Failures:
Permanent Failures:
While the multi-dimensional model provides complete characterization, most practical discussions use a simpler three-category model based primarily on scope and state affected. This model is universally used in database literature:
The Standard Classification:
| Characteristic | Transaction Failure | System Failure | Media Failure |
|---|---|---|---|
| Scope | Single transaction | All active transactions | All data (potentially) |
| Volatile state | Preserved | Lost | Lost (if system also crashed) |
| Persistent state | Intact | Intact | Damaged or destroyed |
| Detection | Synchronous | System restart | I/O errors or restart |
| Frequency | Common (many per day) | Rare (monthly/yearly) | Very rare (years) |
| Impact duration | Milliseconds | Minutes to hours | Hours to days |
| Data loss risk | None (with proper handling) | None (with proper logging) | Possible (based on backup age) |
| Recovery mechanism | Transaction rollback | Crash recovery | Backup + log restore |
| Automation | Fully automatic | Fully automatic | May require manual steps |
Why This Classification Works:
The three-category model maps directly to distinct recovery mechanisms:
Transaction Failure → Rollback
System Failure → Crash Recovery
Media Failure → Backup Restoration
This clean mapping from failure type to recovery mechanism is why the three-category model remains dominant despite its simplifications.
Before classification, failures must be detected. Different failure types are detected by different mechanisms at different times:
Detection Timing:
| Failure Type | Detection Timing | Detection Method |
|---|---|---|
| Transaction (logical) | Immediately during operation | Exception from query executor |
| Transaction (constraint) | At statement or commit time | Constraint checker |
| Transaction (concurrency) | During lock wait or commit | Lock manager, serialization check |
| System (crash) | At restart | Control file check, log examination |
| Media (complete) | At access attempt | I/O error, disk not present |
| Media (corruption) | At read time | Checksum mismatch |
Detection Mechanisms in Detail:
1. Transaction Failure Detection:
123456789101112131415161718192021222324252627282930313233343536373839
// Transaction failures are detected by various subsystems FUNCTION ExecuteStatement(statement): TRY: // 1. Parse and validate parsed = Parser.parse(statement) IF parsed.error: RAISE SyntaxError(parsed.error) // 2. Check permissions IF NOT AccessControl.permitted(parsed): RAISE SecurityError("Permission denied") // 3. Execute operations FOR each operation IN parsed.operations: result = Executor.execute(operation) // 4. Check constraints after each DML IF operation.type IN [INSERT, UPDATE, DELETE]: violations = ConstraintChecker.check(operation.table) IF violations: RAISE ConstraintViolation(violations) // 5. Acquire necessary locks locks = LockManager.acquire(parsed.required_locks) IF locks.timeout: RAISE LockTimeoutError() IF locks.deadlock_victim: RAISE DeadlockError("Transaction chosen as victim") RETURN result CATCH exception: // Mark transaction as failed TransactionManager.setFailed(current_transaction) // Initiate rollback RecoveryManager.rollback(current_transaction) // Return error to client RAISE exception2. System Failure Detection:
System failures are typically detected at restart:
Control File Flags: Many databases write 'clean shutdown' flags to control files. If the flag isn't set at startup, a crash occurred.
Log Examination: The log is scanned to find uncommitted transactions that need rollback and committed transactions that need redo.
Timestamp Validation: Control files may record the last successful checkpoint time. If it doesn't match expected values, a crash occurred.
Process Monitoring: In running systems, background monitor processes detect when critical processes die unexpectedly.
12345678910111213141516171819202122232425262728293031
FUNCTION DatabaseStartup(): // Step 1: Check control file control = ReadControlFile() IF control.shutdown_flag == CLEAN: // Normal shutdown - no crash recovery needed LogMessage("Clean startup - no recovery required") RETURN StartNormal() // Step 2: Crash detected - need recovery LogMessage("Crash detected - initiating recovery") // Step 3: Verify log file integrity IF NOT LogManager.verifyIntegrity(): RAISE MediaFailureError("Log files corrupted") // Step 4: Verify data file integrity FOR each datafile IN control.datafiles: IF NOT datafile.readable(): RAISE MediaFailureError("Data file missing: " + datafile.name) IF NOT datafile.checksumValid(): RAISE MediaFailureError("Data file corrupt: " + datafile.name) // Step 5: Run crash recovery RecoveryManager.crashRecovery() // Step 6: Set clean flag for next startup control.shutdown_flag = RUNNING WriteControlFile(control) RETURN StartNormal()3. Media Failure Detection:
Media failures are detected through I/O failures or data corruption:
Don't wait for failures to be detected reactively. Monitor SMART data, verify backup integrity regularly, run consistency checks periodically, and monitor for replication lag. Proactive detection allows orderly response rather than emergency recovery.
Database systems use decision trees (implemented as procedural logic) to classify failures and route them to appropriate handlers. Let's examine these decision processes:
Runtime Classification (During Normal Operation):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
FUNCTION ClassifyRuntimeFailure(error): // Level 1: Is it an I/O error? IF error.type == IO_ERROR: // Could be media failure IF IsTransientIOError(error): // Retry a few times before declaring failure RETURN RetryableError(error) ELSE: // Likely media failure InitiateMediaFailureProtocol(error.device) RETURN MediaFailure(error) // Level 2: Is it a resource error? IF error.type IN [OUT_OF_MEMORY, DISK_FULL, CONNECTION_LIMIT]: IF CanFreeResources(): // Try to recover resources and retry RETURN RetryableError(error) ELSE: // Fail the transaction, but system continues RETURN TransactionFailure(error) // Level 3: Is it a concurrency error? IF error.type IN [DEADLOCK, LOCK_TIMEOUT, SERIALIZATION_FAILURE]: // These are transaction failures, usually retryable RETURN TransactionFailure(error, retryable=TRUE) // Level 4: Is it a constraint error? IF error.type IN [PK_VIOLATION, FK_VIOLATION, CHECK_VIOLATION]: // Non-retryable transaction failure RETURN TransactionFailure(error, retryable=FALSE) // Level 5: Is it a logic error? IF error.type IN [DIVISION_BY_ZERO, NULL_REFERENCE, TYPE_ERROR]: // Non-retryable transaction failure (usually) RETURN TransactionFailure(error, retryable=FALSE) // Level 6: Is it a system error? IF error.type IN [ASSERTION_FAILURE, MEMORY_CORRUPTION, INTERNAL_ERROR]: // This is bad - likely need to abort the process LogEmergency("Internal error: " + error.details) // Decide: can we isolate to this transaction, or is system compromised? IF CanIsolateToTransaction(error): RETURN TransactionFailure(error) ELSE: // Initiate controlled shutdown InitiateGracefulShutdown() RETURN SystemFailure(error) // Unknown error type LogWarning("Unknown error type: " + error) RETURN TransactionFailure(error)Startup Classification (After Unclean Shutdown):
123456789101112131415161718192021222324252627282930313233343536373839404142
FUNCTION ClassifyStartupState(): // Step 1: Can we read the control file? TRY: control = ReadControlFile() CATCH IOError: // Control file missing or unreadable - media failure RETURN MediaFailure("Control file inaccessible") // Step 2: Check shutdown status IF control.shutdown_status == CLEAN: // No recovery needed RETURN NormalStartup() // Step 3: We need recovery - check what's available log_status = CheckLogFiles(control) data_status = CheckDataFiles(control) // Step 4: All intact - system failure IF log_status == INTACT AND data_status == INTACT: RETURN SystemFailure("Standard crash recovery") // Step 5: Log damaged IF log_status == DAMAGED: // Critical - we might not be able to recover IF HasValidBackup(): RETURN MediaFailure("Log corruption - backup restore required") ELSE: RETURN UnrecoverableFailure("Log damaged, no backup available") // Step 6: Data damaged (but log OK) IF data_status == DAMAGED: damaged_files = GetDamagedDataFiles(control) IF CanRecoverFromLog(damaged_files): // Some databases can rebuild data files from log RETURN SystemFailure("Data corruption, log-based recovery") ELSE: RETURN MediaFailure("Data corruption - backup restore required") // Step 7: Partial damage IF log_status == PARTIAL OR data_status == PARTIAL: // Complex scenario - depends on what's damaged RETURN AnalyzePartialDamage(control, log_status, data_status)The classification decision directly determines the recovery path. Misclassification can lead to inappropriate responses—treating a media failure as a system failure would attempt crash recovery that cannot succeed, wasting time and potentially worsening the situation.
Once a failure is classified, the database engages the appropriate recovery mechanism. This mapping is central to the recovery system design:
Recovery Strategy Matrix:
| Classification | Primary Strategy | Fallback Strategy | Resources Required |
|---|---|---|---|
| Transaction (retryable) | Rollback + inform application to retry | None (application decides) | Log for undo |
| Transaction (non-retryable) | Rollback + return error | None | Log for undo |
| System (clean log) | Crash recovery (ARIES-style) | Full backup restore | Log + data files |
| System (minor corruption) | Repair + crash recovery | Full backup restore | Log + data files + checksums |
| Media (data only) | Backup restore + log apply | Older backup + more logs | Backup + archived logs |
| Media (log only) | Emergency - backup + partial log | Accept data loss | Backup + surviving logs |
| Media (complete) | Full restore from backup | Accept total loss since backup | All backups + all archived logs |
Recovery Selection Process:
The recovery manager uses a priority-based selection:
Attempt Least Disruptive First:
Escalate on Failure:
Consider Time Constraints:
123456789101112131415161718192021222324252627282930313233
FUNCTION SelectRecoveryStrategy(classification, constraints): strategies = GetApplicableStrategies(classification) FOR each strategy IN strategies.orderByLeastDisruptive(): // Check if strategy is possible with available resources IF NOT strategy.resourcesAvailable(): LogWarning("Strategy unavailable: " + strategy.name) CONTINUE // Check if strategy meets time constraints estimated_time = strategy.estimateRecoveryTime() IF constraints.rto AND estimated_time > constraints.rto: LogWarning("Strategy too slow for RTO: " + strategy.name) // Might still use if it's the only option backup_strategy = strategy CONTINUE // Check if strategy meets data loss constraints estimated_loss = strategy.estimateDataLoss() IF constraints.rpo AND estimated_loss > constraints.rpo: LogWarning("Strategy exceeds RPO: " + strategy.name) // Definitely continue looking for better options CONTINUE // This strategy is acceptable RETURN strategy // No strategy met all constraints - return best available IF backup_strategy: LogWarning("Using backup strategy that doesn't meet RTO") RETURN backup_strategy ELSE: RAISE NoRecoveryStrategyError("No viable recovery path found")Some failure scenarios don't fit neatly into the three-category model. Understanding these edge cases is important for comprehensive failure handling:
1. Cascading Failures:
One failure triggers another:
2. Partial Failures:
Only part of the system is affected:
3. Logical vs. Physical Failures:
Logical failures don't damage storage but corrupt data:
These require PITR or logical restore, not physical recovery.
4. Distributed System Failures:
In distributed databases, additional failure types emerge:
| Failure Type | Description | Classification Challenge |
|---|---|---|
| Network Partition | Nodes can't communicate | Is the remote node down or just unreachable? |
| Split Brain | Each side thinks it's primary | Both sides may be functioning, but disagree |
| Replication Lag | Replica behind primary | When does lag become failure? |
| Quorum Loss | Not enough nodes for consensus | Availability vs. consistency trade-off |
| Byzantine Node | Node behaves maliciously/incorrectly | Harder to detect than fail-stop |
Misclassifying a failure can cause serious problems. Treating a media failure as a system failure leads to failed crash recovery and delays. Treating a transient failure as permanent wastes resources. Treating a Byzantine failure as fail-stop might allow corrupted data to propagate. Defensive classification with verification is essential.
Implementing failure classification in a production database system involves several practical considerations:
Error Code Design:
Error codes should encode classification information:
12345678910111213141516171819202122232425262728293031323334353637383940
PostgreSQL Error Code Structure (SQLSTATE):============================================ Error codes are 5 characters: 2 (class) + 3 (detail) Class 00 — SuccessClass 01 — WarningClass 02 — No Data Class 23 — Integrity Constraint Violation (Transaction Failure) 23000 — integrity_constraint_violation 23001 — restrict_violation 23502 — not_null_violation 23503 — foreign_key_violation 23505 — unique_violation 23514 — check_violation Class 40 — Transaction Rollback (Transaction Failure - Retryable) 40000 — transaction_rollback 40001 — serialization_failure 40002 — transaction_integrity_constraint_violation 40003 — statement_completion_unknown 40P01 — deadlock_detected Class 53 — Insufficient Resources (May escalate to System Failure) 53000 — insufficient_resources 53100 — disk_full 53200 — out_of_memory 53300 — too_many_connections Class 58 — System Error (Serious - May indicate System Failure) 58000 — system_error 58030 — io_error 58P01 — undefined_file 58P02 — duplicate_file Class XX — Internal Error (May require System Failure handling) XX000 — internal_error XX001 — data_corrupted XX002 — index_corruptedLogging for Classification:
Effective classification requires good logging:
Testing Classification Logic:
Classification code should be thoroughly tested:
Let's consolidate the key concepts covered in this page:
What's Next:
We've now explored the three failure types and how they're classified. In the next and final page of this module, we'll examine Recovery Requirements—the properties that recovery systems must guarantee, the resources they need, and the trade-offs involved in recovery system design. This will complete our understanding of failure concepts and prepare us for studying specific recovery mechanisms.
You now understand failure classification comprehensively—the dimensions, detection mechanisms, decision processes, and how classification maps to recovery strategies. This knowledge is essential for understanding how databases automatically respond to different failure scenarios.