Failure Types - Learning Module

Loading content...

0/241

Media Failure

When the Disk Dies

The click. Every database administrator dreads it—that distinctive clicking sound a hard drive makes when its read/write heads fail, repeatedly seeking and failing to find the data tracks. In a fraction of a second, years of accumulated data becomes inaccessible. Unlike a power outage where memory is lost but disk survives, a media failure destroys the very persistent storage that was supposed to be our safety net.

Media failures represent the most severe category of database failures. They damage or destroy the persistent storage that normally survives system crashes. The log files that enable crash recovery might themselves be lost. The data files containing the actual database content might be corrupted or unreadable.

This page examines media failures in comprehensive detail—their causes, their devastating effects, and the specialized recovery strategies they require. Understanding media failures is essential because they demand a fundamentally different approach to recovery: backups, archives, and geographic redundancy.

What You Will Learn

By the end of this page, you will understand what constitutes a media failure, its root causes across hardware and environmental factors, why normal crash recovery is insufficient, and the recovery strategies that make databases resilient against even catastrophic storage loss.

Defining Media Failure

A media failure (also called a hard failure or disk crash) occurs when persistent storage becomes unreadable, corrupted, or physically destroyed. The key characteristics are:

Formal Definition:

A media failure is an event where:

Some or all of the persistent storage (data files, log files, or both) becomes unavailable
The stored data cannot be read or is corrupted beyond use
Standard log-based crash recovery cannot restore the database
External backups or redundant copies are required for recovery

Media failures are 'hard' failures because they damage the durable storage that was supposed to persist data indefinitely. Unlike system failures where the disk survives and enables recovery, media failures strike at the foundation of durability itself.

Failure Type Comparison: Impact on Storage
Failure Type	Volatile State (RAM)	Persistent State (Disk)	Recovery Method
Transaction Failure	Preserved	Preserved	Rollback via log
System Failure	Lost	Preserved	Crash recovery via log
Media Failure	Lost	Damaged/Lost	Backup + archived logs

Categories of Media Failure:

Media failures can be categorized by what is affected:

1. Data File Failure: The files containing database tables and indexes are damaged. The transaction log may be intact, allowing some recovery.

2. Log File Failure: The transaction log is damaged. This is particularly dangerous because the log is needed for recovery. Without the log, crash recovery is impossible.

3. Complete Media Failure: Both data files and log files are lost. This requires full restoration from backup plus replaying archived log files.

4. Partial Media Failure: Some storage is damaged but other storage is intact. Just a single disk fails in an array, or corruption affects only certain files.

5. Controller/Channel Failure: The storage interface fails, making otherwise intact disks inaccessible. This may be recoverable by replacing hardware.

The Log File Nightmare

Losing the transaction log is catastrophic. Without the log, we cannot determine which transactions were committed, cannot redo committed work, and cannot undo uncommitted work. This is why production databases store logs on separate physical devices from data files—a media failure affecting data files leaves the log available for recovery.

Causes of Media Failures

Media failures arise from various sources spanning hardware wear, manufacturing defects, environmental factors, and human errors. Understanding these causes is essential for implementing appropriate preventive measures.

Categories of Media Failure Causes:

Media Failure Root Causes

•Mechanical Failures (HDD) — Head crashes, motor failures, bearing wear, platter damage
•Electronic Failures (HDD/SSD) — Controller failure, firmware bugs, capacitor failure, power surge damage
•Flash Memory Wear (SSD) — Worn-out cells, write exhaustion, read disturb errors
•Environmental Damage — Fire, flood, earthquake, extreme temperatures, humidity
•Human Error — Accidental deletion, incorrect commands, improper handling
•Malicious Acts — Ransomware, intentional destruction, sabotage
•Manufacturing Defects — Latent defects that manifest over time

2.1 Hard Disk Drive (HDD) Failures

HDDs use spinning magnetic platters with read/write heads flying nanometers above the surface. This mechanical complexity creates multiple failure modes:

Head Crash: The read/write head contacts the spinning platter, damaging both the head and the magnetic surface. Causes include: vibration, shock, contamination, power interruption during operation.

Motor/Bearing Failure: The spindle motor or its bearings fail, preventing the platters from spinning. Common in older drives or drives operating in high-temperature environments.

PCB Failure: The controller board fails due to power surge, component aging, or manufacturing defect. Sometimes recoverable by swapping PCBs (but risky).

Firmware Corruption: The firmware stored on the drive's internal storage becomes corrupted, rendering the drive unusable even if the platters are fine.

HDD Failure Statistics and MTBF
Failure Type	Typical Cause	Warning Signs	Recovery Possibility
Head crash	Shock, wear, contamination	Clicking sounds, read errors	Professional recovery, expensive
Motor failure	Age, heat, bearing wear	Grinding, spin-up failure	Motor transplant, very expensive
PCB failure	Power surge, component age	Not detected, spin-up issues	PCB swap, moderate chance
Firmware bug	Vendor bugs, corrupt updates	Drive not recognized	Firmware tools, specialized
Bad sectors	Media degradation over time	SMART warnings, slow reads	Remapping, eventual failure

2.2 Solid State Drive (SSD) Failures

SSDs have no moving parts but are subject to different failure modes:

Write Exhaustion: Flash cells can only be written a limited number of times (typically 1,000-100,000 cycles depending on technology). Enterprise SSDs track wear level and warn before failure.

Read Disturb: Repeated reads of certain cells can disturb adjacent cells, causing data corruption. Modern SSDs mitigate this with background refresh operations.

Sudden Power Loss: SSDs with inadequate power-loss protection can corrupt their mapping tables or lose unflushed data during unexpected power loss.

Controller/Firmware Bugs: SSDs have complex controllers running firmware. Bugs can cause data loss, performance degradation, or complete failure.

Retention Loss: Stored data in powered-off SSDs can degrade over time (months to years), especially at high temperatures. Enterprise SSDs are rated for longer retention.

Enterprise vs. Consumer Drives

Enterprise-grade drives include superior power-loss protection (capacitors), higher endurance ratings, better error correction, and longer warranties. For production databases, the cost premium is trivially small compared to the risk reduction.

2.3 Environmental and Catastrophic Causes

Some media failures affect entire facilities:

Fire: Direct fire damage destroys drives. Smoke and soot contaminate internals. Sprinkler systems cause water damage. Heat damages electronics.

Flood: Water damages electronics and can contaminate platter surfaces. Data centers in flood zones require special protection.

Earthquake: Shock and vibration cause head crashes. Building collapse destroys equipment. Power infrastructure may be damaged.

Temperature/Humidity: Extreme temperature variations stress components. High humidity causes condensation and corrosion. Low humidity increases static risk.

2.4 Human Error and Malicious Acts

Not all media failures are accidental:

Accidental Deletion:

-- The infamous command that should never be run without WHERE
DELETE FROM critical_table;
-- Or worse:
DROP DATABASE production;

Ransomware: Malware that encrypts files and demands payment. Database files and backups are prime targets. Offline backups are essential protection.

Intentional Destruction: Disgruntled employees or attackers deliberately destroying data. Requires access controls, audit logging, and off-site backups.

Why Normal Recovery Fails

Standard crash recovery (Analysis → Redo → Undo) cannot handle media failures because it assumes the disk is intact. Let's understand precisely why:

The Crash Recovery Assumption:

Crash recovery works by:

Reading the log from disk
Identifying what needs to be redone/undone
Reading data pages from disk
Applying changes from the log to fix pages
Writing corrected pages back to disk

Every step assumes disk access works. Media failure breaks this assumption.

Impact of Media Failure on Recovery Steps
Recovery Step	Normal Crash	Media Failure (Data)	Media Failure (Log)
Find last checkpoint	Read from log ✓	Read from log ✓	Cannot read log ✗
Scan log records	Read log forward ✓	Read log forward ✓	Log unavailable ✗
Read data pages	Pages on disk ✓	Pages missing/corrupt ✗	Pages may be OK
Apply redo/undo	Modify pages ✓	No pages to modify ✗	No log info ✗
Write fixed pages	Write to disk ✓	Disk damaged ✗	May work

Scenario Analysis:

Case 1: Data Files Lost, Log Intact

This is actually the most recoverable media failure scenario:

Restore data files from the most recent backup
Apply archived log files to bring backup up to date
Apply online (current) log to reach crash-consistent state
Run normal crash recovery

Recovery is possible but may take hours depending on backup age and log volume.

Case 2: Log Files Lost, Data Intact

This is extremely problematic:

Data files may contain uncommitted changes (need undo, but no undo info)
Data files may be missing committed changes (need redo, but no redo info)
Database is in unknown state
Options: Accept potential inconsistency, restore from backup

This scenario is why log files should be on separate physical storage from data files.

Case 3: Both Data and Log Lost

Complete reliance on backups:

Restore from last full backup
Apply all archived logs since backup
ALL work since last archived log is lost

The gap between last archived log and failure point is the data loss window.

media_failure_scenarios.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Scenario: Media Failure Analysis
=================================
 
Timeline:
---------
00:00  Full backup completed
06:00  Archived log backup completed (logs 1-100)
12:00  Archived log backup completed (logs 101-200)
18:00  Online logs 201-250 on disk
18:30  [MEDIA FAILURE - All disks destroyed]
 
Case A: Only data disk fails, log disk survives
------------------------------------------------
Recovery:
  1. Restore data files from 00:00 backup
  2. Apply archived logs 1-200 (from backup)
  3. Apply online logs 201-250 (from surviving disk)
  4. Crash recovery to handle active transactions
Data Loss: Zero (full recovery possible)
Recovery Time: Hours (depends on data and log size)
 
Case B: Only log disk fails, data disk survives
------------------------------------------------
Recovery:
  Option 1: Accept potential inconsistency
    - Database may have uncommitted data
    - Some committed data may be missing
    - Integrity constraints may be violated
  Option 2: Restore from backup + archived logs
    - Full restore to 12:00 state
    - Lose everything from 12:00 to 18:30
Data Loss: ~6.5 hours of work
Recovery Decision: Business/risk judgment required
 
Case C: All disks destroyed (fire/flood/disaster)
-------------------------------------------------
Recovery:
  1. Restore data files from 00:00 backup
  2. Apply archived logs 1-200 (from off-site backup)
  3. STOP - no more logs available
Data Loss: ~6.5 hours of work (18:30 - 12:00)
Recovery Time: Hours to days (depends on backup location)
Lesson: Archive logs more frequently, replicate real-time

The Data Loss Window

The interval between your last archived log backup and a media failure is your data loss window. Transactions committed in this window may be unrecoverable. Reducing this window (more frequent archiving, synchronous replication) directly reduces potential data loss but adds overhead.

Recovery Strategies for Media Failure

Recovering from media failures requires preparation—specifically, backups and archived logs that were created before the failure occurred. Let's examine the recovery strategies:

The Recovery Building Blocks:

Components Required for Media Failure Recovery

•Full Backup — A complete copy of the database at a point in time
•Incremental Backups — Changes since the last full or incremental backup
•Archived Transaction Logs — Log files that have been backed up and removed from online storage
•Online Transaction Logs — Current log files still on the database server (if surviving)
•Control File / Catalog — Metadata about backups and their contents

Strategy 1: Point-in-Time Recovery (PITR)

PITR restores the database to a specific moment in time:

Restore base backup — Write the backup files to the new storage
Apply archived logs — Replay logs from backup time forward
Apply up to target time — Stop applying logs at desired point
Open database — Allow new connections

PITR is used for:

Recovering from media failure to the most recent possible point
Recovering from logical errors (e.g., accidental DELETE) by restoring to before the error
Creating database copies at specific historical points

pitr_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- PostgreSQL Point-in-Time Recovery Example
-- Step 1: Stop the database server (already stopped due to failure)
 
-- Step 2: Restore base backup to data directory
-- (Using pg_basebackup or file copy from backup)
 
-- Step 3: Create recovery.signal file to enter recovery mode
$ touch /var/lib/postgresql/data/recovery.signal
 
-- Step 4: Configure recovery in postgresql.conf
restore_command = 'cp /backup/archive/%f %p'
recovery_target_time = '2024-01-15 14:30:00'
recovery_target_action = 'promote'
 
-- Step 5: Start PostgreSQL - it will enter recovery mode
$ pg_ctl start -D /var/lib/postgresql/data
 
-- PostgreSQL will:
-- 1. Detect recovery.signal
-- 2. Read archived logs using restore_command
-- 3. Apply logs until recovery_target_time
-- 4. Promote to primary (writable) database
-- 5. Delete recovery.signal
 
-- Step 6: Verify recovery
SELECT pg_is_in_recovery();  -- Should return FALSE
SELECT max(created_at) FROM transactions;  -- Check latest data

Strategy 2: Restore with Forward Recovery

This applies all available logs to reach the most current state:

Restore base backup
Apply ALL archived logs — Every log from backup to failure
Apply surviving online logs — If any online log files survived
Run crash recovery — Handle transactions active at failure time
Open database

Strategy 3: Incremental Restore

For large databases with incremental backups:

Restore last full backup
Apply incremental backups — In chronological order
Apply archived logs — From last incremental forward
Apply surviving online logs
Run crash recovery

Incremental restore is faster than full restore when incrementals are available because less data needs to be copied.

Recovery Strategy Comparison
Strategy	Data Loss	Recovery Time	Use Case
Full restore + all logs	Minimal (up to last log)	Longest	Maximum recovery
PITR to specific time	From target time to failure	Moderate	Undo logical errors
Restore backup only	All work since backup	Fastest	Emergency/test
Failover to replica	Depends on replication lag	Minutes	High availability

Test Your Recovery

Recovery procedures that have never been tested are untested assumptions, not plans. Regularly practice restoring from backup, applying logs, and verifying data integrity. Many organizations discover their backup strategy doesn't work only when they desperately need it.

Prevention Through Redundancy

While we cannot prevent individual disk failures, we can design systems where no single media failure causes data loss. This is achieved through redundancy at multiple levels:

Level 1: RAID (Redundant Array of Independent Disks)

RAID Levels for Database Protection
RAID Level	Redundancy	Performance	Disk Failure Tolerance	Database Suitability
RAID 0	None	Highest	None - total data loss	Never use for production
RAID 1	Full mirror	Good reads	1 disk failure	Good for logs
RAID 5	Single parity	Good	1 disk failure	OK, rebuild is slow
RAID 6	Double parity	Moderate	2 disk failures	Better, still slow rebuild
RAID 10	Mirror + stripe	Very good	1+ disk failures	Excellent for databases

RAID 10 is the gold standard for database storage:

Combines mirroring (RAID 1) with striping (RAID 0)
Survives multiple disk failures if they're in different mirror pairs
Fast rebuild compared to parity-based RAID
Excellent read and write performance
Uses 50% of raw capacity for redundancy (acceptable trade-off)

Level 2: Separate Storage for Logs

As we've emphasized, log files should be on physically separate storage from data files:

Different disk drives (at minimum)
Different RAID arrays (better)
Different storage controllers (even better)
Different storage systems (ideal)

If data disks fail, the logs survive for recovery. If log disks fail, immediate action is needed but data isn't immediately lost.

Production Storage Architecture

•Data files: RAID 10 array, multiple disks, battery-backed cache
•Redo/WAL logs: Separate RAID 1 mirror, fastest disks, separate controller
•Archive logs: Separate storage, can be slower, replicated off-site
•Backups: Off-site storage, geographically distant, air-gapped copy
•Temp space: Can be less protected (rebuild from scratch if lost)

Level 3: Database Replication

Replication maintains synchronized copies of the database on different servers:

Synchronous Replication:

Transactions don't commit until confirmed on replica
Zero data loss guarantee (RPO = 0)
Performance cost (latency added to commits)
Replica can take over immediately if primary fails

Asynchronous Replication:

Primary commits without waiting for replica
Small potential data loss (recent commits may not be replicated)
Better performance than synchronous
Replica may lag behind primary

Level 4: Geographic Distribution

For disaster protection, copies must be geographically separated:

Different buildings (fire/flood protection)
Different cities (regional disaster protection)
Different countries (geopolitical risk, extreme disasters)

The 3-2-1 Backup Rule

Maintain 3 copies of data, on 2 different types of media, with 1 copy off-site. For databases: production data + local backup + off-site backup. Different media types (disk + tape, or disk + cloud) protect against media-type-specific failures.

The Role of Backups in Media Failure Recovery

Backups are the ultimate insurance against media failure. While RAID and replication provide real-time protection, backups provide point-in-time copies that survive even catastrophic events.

Backup Types:

Database Backup Types
Backup Type	Contents	Size	Creation Time	Recovery Time
Full backup	Complete database	100%	Longer	Faster (direct restore)
Incremental	Changes since last backup	Small	Faster	Slower (chain needed)
Differential	Changes since last full	Growing	Moderate	Moderate (full + diff)
Log backup	Transaction log entries	Smallest	Fastest	Requires base backup

Backup Strategy Design:

A well-designed backup strategy balances:

Recovery Point Objective (RPO): Maximum acceptable data loss
- RPO = 1 hour means you can lose up to 1 hour of data
- Requires log backups at least hourly
Recovery Time Objective (RTO): Maximum acceptable downtime
- RTO = 4 hours means database must be restored within 4 hours
- Affects backup method, storage speed, and procedure complexity
Storage Costs: Backup storage capacity and performance
- More frequent backups = more storage
- Faster storage = faster recovery but higher cost
Operational Overhead: Time and resources for backup management
- Backup verification
- Rotation and retention
- Off-site transfer

backup_strategy_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Example Backup Strategy for Production Database
================================================
 
Requirements:
- RPO: 15 minutes (can lose at most 15 min of transactions)
- RTO: 2 hours (must be back online within 2 hours)
- Database size: 500 GB
- Daily change rate: ~5%
 
Strategy:
---------
 
Full Backups: Weekly (Sunday 2 AM)
  - Complete database backup
  - ~500 GB, takes ~4 hours
  - Retained for 4 weeks
 
Differential Backups: Daily (2 AM Mon-Sat)
  - Changes since Sunday's full backup
  - Size grows through week (25-150 GB)
  - Retained for 2 weeks
 
Log Backups: Every 15 minutes
  - Transaction log to archive storage
  - ~500 MB per backup (varies with activity)
  - Retained for 30 days
 
Backup Storage:
  - Local: Fast disk array for quick recovery
  - Off-site: Cloud storage (async replicated)
  - Tape: Monthly for long-term archive
 
Recovery Procedure (Worst Case):
  1. Restore Sunday full backup (~1 hour)
  2. Apply Saturday differential (~30 min)
  3. Apply log backups from Saturday to failure (~30 min)
  4. Open database, verify integrity
  Total: ~2 hours (meets RTO)
  Data loss: Up to 15 minutes (meets RPO)

Backup Verification Is Non-Negotiable

An unverified backup is not a backup—it's hope. Regularly restore backups to test systems and verify: file integrity (no corruption), restore procedure works, recovery time is within RTO, application functions correctly on restored data. Discover problems during testing, not during an emergency.

Real-World Considerations

Media failure recovery in production environments involves considerations beyond the technical recovery process:

Operational Considerations

•Hardware Procurement: Replacement disks/servers may not be immediately available. Maintain spare hardware or hot standby systems.
•Network Bandwidth: Restoring 500 GB over a 100 Mbps link takes ~12 hours. Plan for adequate bandwidth to off-site backups.
•Human Resources: Recovery requires skilled personnel. Ensure procedures are documented and multiple people are trained.
•Communication: Stakeholders need updates. Have templates ready for status communications during recovery.
•Post-Recovery Validation: After technical recovery, application-level verification is needed. Some data may need reconciliation.

The Human Factor in Media Failures:

A significant percentage of media failures involve human error or can be prevented/exacerbated by human actions:

Accidental deletion is effectively a 'logical' media failure
Misconfigured RAID may not provide expected protection
Untested backup procedures fail when needed
Ignored warning signs (SMART alerts, replication lag) precede many failures
Poor documentation extends recovery time

Monitoring and Alerting:

Proactive monitoring can catch problems before they become failures:

SMART monitoring for disk health
RAID controller alerts for degraded arrays
Replication lag monitoring
Backup job monitoring
Storage capacity alerts
Log archive monitoring

The Recovery Runbook

Document your recovery procedures in a runbook that's accessible even during disasters (not only on the production server!). Include: step-by-step procedures, contact information, credentials (securely stored), vendor support numbers, and escalation paths. Practice following the runbook under simulated pressure.

Summary: Media Failure

Let's consolidate the key concepts covered in this page:

Key Takeaways

•Media failures destroy persistent storage — Unlike system failures, the disk itself is damaged or lost
•Normal crash recovery cannot help — Log-based recovery requires the log to exist and be readable
•Causes span hardware, environment, and human factors — Disk failures, disasters, and errors all contribute
•Recovery requires backups and archived logs — These must be created before failure and stored safely
•The data loss window is critical — Time between last archived log and failure determines potential data loss
•Redundancy prevents many media failures — RAID, replication, and geographic distribution
•Backup verification is essential — Untested backups are just assumptions

What's Next:

We've now examined the three major failure types: transaction, system, and media failures. In the next page, we'll look at Failure Classification—how databases categorize failures, detect them, and choose appropriate recovery strategies. We'll see how the failure type determines the recovery approach and what mechanisms databases use to distinguish between failure types.

Page Complete

You now understand media failures comprehensively—their causes, their devastating effects, and the backup-based recovery strategies required. This knowledge completes your understanding of the failure spectrum from minor (transaction) to catastrophic (media).