Loading content...
The click. Every database administrator dreads it—that distinctive clicking sound a hard drive makes when its read/write heads fail, repeatedly seeking and failing to find the data tracks. In a fraction of a second, years of accumulated data becomes inaccessible. Unlike a power outage where memory is lost but disk survives, a media failure destroys the very persistent storage that was supposed to be our safety net.
Media failures represent the most severe category of database failures. They damage or destroy the persistent storage that normally survives system crashes. The log files that enable crash recovery might themselves be lost. The data files containing the actual database content might be corrupted or unreadable.
This page examines media failures in comprehensive detail—their causes, their devastating effects, and the specialized recovery strategies they require. Understanding media failures is essential because they demand a fundamentally different approach to recovery: backups, archives, and geographic redundancy.
By the end of this page, you will understand what constitutes a media failure, its root causes across hardware and environmental factors, why normal crash recovery is insufficient, and the recovery strategies that make databases resilient against even catastrophic storage loss.
A media failure (also called a hard failure or disk crash) occurs when persistent storage becomes unreadable, corrupted, or physically destroyed. The key characteristics are:
Formal Definition:
A media failure is an event where:
Media failures are 'hard' failures because they damage the durable storage that was supposed to persist data indefinitely. Unlike system failures where the disk survives and enables recovery, media failures strike at the foundation of durability itself.
| Failure Type | Volatile State (RAM) | Persistent State (Disk) | Recovery Method |
|---|---|---|---|
| Transaction Failure | Preserved | Preserved | Rollback via log |
| System Failure | Lost | Preserved | Crash recovery via log |
| Media Failure | Lost | Damaged/Lost | Backup + archived logs |
Categories of Media Failure:
Media failures can be categorized by what is affected:
1. Data File Failure: The files containing database tables and indexes are damaged. The transaction log may be intact, allowing some recovery.
2. Log File Failure: The transaction log is damaged. This is particularly dangerous because the log is needed for recovery. Without the log, crash recovery is impossible.
3. Complete Media Failure: Both data files and log files are lost. This requires full restoration from backup plus replaying archived log files.
4. Partial Media Failure: Some storage is damaged but other storage is intact. Just a single disk fails in an array, or corruption affects only certain files.
5. Controller/Channel Failure: The storage interface fails, making otherwise intact disks inaccessible. This may be recoverable by replacing hardware.
Losing the transaction log is catastrophic. Without the log, we cannot determine which transactions were committed, cannot redo committed work, and cannot undo uncommitted work. This is why production databases store logs on separate physical devices from data files—a media failure affecting data files leaves the log available for recovery.
Media failures arise from various sources spanning hardware wear, manufacturing defects, environmental factors, and human errors. Understanding these causes is essential for implementing appropriate preventive measures.
Categories of Media Failure Causes:
2.1 Hard Disk Drive (HDD) Failures
HDDs use spinning magnetic platters with read/write heads flying nanometers above the surface. This mechanical complexity creates multiple failure modes:
Head Crash: The read/write head contacts the spinning platter, damaging both the head and the magnetic surface. Causes include: vibration, shock, contamination, power interruption during operation.
Motor/Bearing Failure: The spindle motor or its bearings fail, preventing the platters from spinning. Common in older drives or drives operating in high-temperature environments.
PCB Failure: The controller board fails due to power surge, component aging, or manufacturing defect. Sometimes recoverable by swapping PCBs (but risky).
Firmware Corruption: The firmware stored on the drive's internal storage becomes corrupted, rendering the drive unusable even if the platters are fine.
| Failure Type | Typical Cause | Warning Signs | Recovery Possibility |
|---|---|---|---|
| Head crash | Shock, wear, contamination | Clicking sounds, read errors | Professional recovery, expensive |
| Motor failure | Age, heat, bearing wear | Grinding, spin-up failure | Motor transplant, very expensive |
| PCB failure | Power surge, component age | Not detected, spin-up issues | PCB swap, moderate chance |
| Firmware bug | Vendor bugs, corrupt updates | Drive not recognized | Firmware tools, specialized |
| Bad sectors | Media degradation over time | SMART warnings, slow reads | Remapping, eventual failure |
2.2 Solid State Drive (SSD) Failures
SSDs have no moving parts but are subject to different failure modes:
Write Exhaustion: Flash cells can only be written a limited number of times (typically 1,000-100,000 cycles depending on technology). Enterprise SSDs track wear level and warn before failure.
Read Disturb: Repeated reads of certain cells can disturb adjacent cells, causing data corruption. Modern SSDs mitigate this with background refresh operations.
Sudden Power Loss: SSDs with inadequate power-loss protection can corrupt their mapping tables or lose unflushed data during unexpected power loss.
Controller/Firmware Bugs: SSDs have complex controllers running firmware. Bugs can cause data loss, performance degradation, or complete failure.
Retention Loss: Stored data in powered-off SSDs can degrade over time (months to years), especially at high temperatures. Enterprise SSDs are rated for longer retention.
Enterprise-grade drives include superior power-loss protection (capacitors), higher endurance ratings, better error correction, and longer warranties. For production databases, the cost premium is trivially small compared to the risk reduction.
2.3 Environmental and Catastrophic Causes
Some media failures affect entire facilities:
Fire: Direct fire damage destroys drives. Smoke and soot contaminate internals. Sprinkler systems cause water damage. Heat damages electronics.
Flood: Water damages electronics and can contaminate platter surfaces. Data centers in flood zones require special protection.
Earthquake: Shock and vibration cause head crashes. Building collapse destroys equipment. Power infrastructure may be damaged.
Temperature/Humidity: Extreme temperature variations stress components. High humidity causes condensation and corrosion. Low humidity increases static risk.
2.4 Human Error and Malicious Acts
Not all media failures are accidental:
Accidental Deletion:
-- The infamous command that should never be run without WHERE
DELETE FROM critical_table;
-- Or worse:
DROP DATABASE production;
Ransomware: Malware that encrypts files and demands payment. Database files and backups are prime targets. Offline backups are essential protection.
Intentional Destruction: Disgruntled employees or attackers deliberately destroying data. Requires access controls, audit logging, and off-site backups.
Standard crash recovery (Analysis → Redo → Undo) cannot handle media failures because it assumes the disk is intact. Let's understand precisely why:
The Crash Recovery Assumption:
Crash recovery works by:
Every step assumes disk access works. Media failure breaks this assumption.
| Recovery Step | Normal Crash | Media Failure (Data) | Media Failure (Log) |
|---|---|---|---|
| Find last checkpoint | Read from log ✓ | Read from log ✓ | Cannot read log ✗ |
| Scan log records | Read log forward ✓ | Read log forward ✓ | Log unavailable ✗ |
| Read data pages | Pages on disk ✓ | Pages missing/corrupt ✗ | Pages may be OK |
| Apply redo/undo | Modify pages ✓ | No pages to modify ✗ | No log info ✗ |
| Write fixed pages | Write to disk ✓ | Disk damaged ✗ | May work |
Scenario Analysis:
Case 1: Data Files Lost, Log Intact
This is actually the most recoverable media failure scenario:
Recovery is possible but may take hours depending on backup age and log volume.
Case 2: Log Files Lost, Data Intact
This is extremely problematic:
This scenario is why log files should be on separate physical storage from data files.
Case 3: Both Data and Log Lost
Complete reliance on backups:
The gap between last archived log and failure point is the data loss window.
12345678910111213141516171819202122232425262728293031323334353637383940414243
Scenario: Media Failure Analysis================================= Timeline:---------00:00 Full backup completed06:00 Archived log backup completed (logs 1-100)12:00 Archived log backup completed (logs 101-200)18:00 Online logs 201-250 on disk18:30 [MEDIA FAILURE - All disks destroyed] Case A: Only data disk fails, log disk survives------------------------------------------------Recovery: 1. Restore data files from 00:00 backup 2. Apply archived logs 1-200 (from backup) 3. Apply online logs 201-250 (from surviving disk) 4. Crash recovery to handle active transactionsData Loss: Zero (full recovery possible)Recovery Time: Hours (depends on data and log size) Case B: Only log disk fails, data disk survives------------------------------------------------Recovery: Option 1: Accept potential inconsistency - Database may have uncommitted data - Some committed data may be missing - Integrity constraints may be violated Option 2: Restore from backup + archived logs - Full restore to 12:00 state - Lose everything from 12:00 to 18:30Data Loss: ~6.5 hours of workRecovery Decision: Business/risk judgment required Case C: All disks destroyed (fire/flood/disaster)-------------------------------------------------Recovery: 1. Restore data files from 00:00 backup 2. Apply archived logs 1-200 (from off-site backup) 3. STOP - no more logs availableData Loss: ~6.5 hours of work (18:30 - 12:00)Recovery Time: Hours to days (depends on backup location)Lesson: Archive logs more frequently, replicate real-timeThe interval between your last archived log backup and a media failure is your data loss window. Transactions committed in this window may be unrecoverable. Reducing this window (more frequent archiving, synchronous replication) directly reduces potential data loss but adds overhead.
Recovering from media failures requires preparation—specifically, backups and archived logs that were created before the failure occurred. Let's examine the recovery strategies:
The Recovery Building Blocks:
Strategy 1: Point-in-Time Recovery (PITR)
PITR restores the database to a specific moment in time:
PITR is used for:
123456789101112131415161718192021222324252627
-- PostgreSQL Point-in-Time Recovery Example-- Step 1: Stop the database server (already stopped due to failure) -- Step 2: Restore base backup to data directory-- (Using pg_basebackup or file copy from backup) -- Step 3: Create recovery.signal file to enter recovery mode$ touch /var/lib/postgresql/data/recovery.signal -- Step 4: Configure recovery in postgresql.confrestore_command = 'cp /backup/archive/%f %p'recovery_target_time = '2024-01-15 14:30:00'recovery_target_action = 'promote' -- Step 5: Start PostgreSQL - it will enter recovery mode$ pg_ctl start -D /var/lib/postgresql/data -- PostgreSQL will:-- 1. Detect recovery.signal-- 2. Read archived logs using restore_command-- 3. Apply logs until recovery_target_time-- 4. Promote to primary (writable) database-- 5. Delete recovery.signal -- Step 6: Verify recoverySELECT pg_is_in_recovery(); -- Should return FALSESELECT max(created_at) FROM transactions; -- Check latest dataStrategy 2: Restore with Forward Recovery
This applies all available logs to reach the most current state:
Strategy 3: Incremental Restore
For large databases with incremental backups:
Incremental restore is faster than full restore when incrementals are available because less data needs to be copied.
| Strategy | Data Loss | Recovery Time | Use Case |
|---|---|---|---|
| Full restore + all logs | Minimal (up to last log) | Longest | Maximum recovery |
| PITR to specific time | From target time to failure | Moderate | Undo logical errors |
| Restore backup only | All work since backup | Fastest | Emergency/test |
| Failover to replica | Depends on replication lag | Minutes | High availability |
Recovery procedures that have never been tested are untested assumptions, not plans. Regularly practice restoring from backup, applying logs, and verifying data integrity. Many organizations discover their backup strategy doesn't work only when they desperately need it.
While we cannot prevent individual disk failures, we can design systems where no single media failure causes data loss. This is achieved through redundancy at multiple levels:
Level 1: RAID (Redundant Array of Independent Disks)
| RAID Level | Redundancy | Performance | Disk Failure Tolerance | Database Suitability |
|---|---|---|---|---|
| RAID 0 | None | Highest | None - total data loss | Never use for production |
| RAID 1 | Full mirror | Good reads | 1 disk failure | Good for logs |
| RAID 5 | Single parity | Good | 1 disk failure | OK, rebuild is slow |
| RAID 6 | Double parity | Moderate | 2 disk failures | Better, still slow rebuild |
| RAID 10 | Mirror + stripe | Very good | 1+ disk failures | Excellent for databases |
RAID 10 is the gold standard for database storage:
Level 2: Separate Storage for Logs
As we've emphasized, log files should be on physically separate storage from data files:
If data disks fail, the logs survive for recovery. If log disks fail, immediate action is needed but data isn't immediately lost.
Level 3: Database Replication
Replication maintains synchronized copies of the database on different servers:
Synchronous Replication:
Asynchronous Replication:
Level 4: Geographic Distribution
For disaster protection, copies must be geographically separated:
Maintain 3 copies of data, on 2 different types of media, with 1 copy off-site. For databases: production data + local backup + off-site backup. Different media types (disk + tape, or disk + cloud) protect against media-type-specific failures.
Backups are the ultimate insurance against media failure. While RAID and replication provide real-time protection, backups provide point-in-time copies that survive even catastrophic events.
Backup Types:
| Backup Type | Contents | Size | Creation Time | Recovery Time |
|---|---|---|---|---|
| Full backup | Complete database | 100% | Longer | Faster (direct restore) |
| Incremental | Changes since last backup | Small | Faster | Slower (chain needed) |
| Differential | Changes since last full | Growing | Moderate | Moderate (full + diff) |
| Log backup | Transaction log entries | Smallest | Fastest | Requires base backup |
Backup Strategy Design:
A well-designed backup strategy balances:
Recovery Point Objective (RPO): Maximum acceptable data loss
Recovery Time Objective (RTO): Maximum acceptable downtime
Storage Costs: Backup storage capacity and performance
Operational Overhead: Time and resources for backup management
123456789101112131415161718192021222324252627282930313233343536373839
Example Backup Strategy for Production Database================================================ Requirements:- RPO: 15 minutes (can lose at most 15 min of transactions)- RTO: 2 hours (must be back online within 2 hours)- Database size: 500 GB- Daily change rate: ~5% Strategy:--------- Full Backups: Weekly (Sunday 2 AM) - Complete database backup - ~500 GB, takes ~4 hours - Retained for 4 weeks Differential Backups: Daily (2 AM Mon-Sat) - Changes since Sunday's full backup - Size grows through week (25-150 GB) - Retained for 2 weeks Log Backups: Every 15 minutes - Transaction log to archive storage - ~500 MB per backup (varies with activity) - Retained for 30 days Backup Storage: - Local: Fast disk array for quick recovery - Off-site: Cloud storage (async replicated) - Tape: Monthly for long-term archive Recovery Procedure (Worst Case): 1. Restore Sunday full backup (~1 hour) 2. Apply Saturday differential (~30 min) 3. Apply log backups from Saturday to failure (~30 min) 4. Open database, verify integrity Total: ~2 hours (meets RTO) Data loss: Up to 15 minutes (meets RPO)An unverified backup is not a backup—it's hope. Regularly restore backups to test systems and verify: file integrity (no corruption), restore procedure works, recovery time is within RTO, application functions correctly on restored data. Discover problems during testing, not during an emergency.
Media failure recovery in production environments involves considerations beyond the technical recovery process:
The Human Factor in Media Failures:
A significant percentage of media failures involve human error or can be prevented/exacerbated by human actions:
Monitoring and Alerting:
Proactive monitoring can catch problems before they become failures:
Document your recovery procedures in a runbook that's accessible even during disasters (not only on the production server!). Include: step-by-step procedures, contact information, credentials (securely stored), vendor support numbers, and escalation paths. Practice following the runbook under simulated pressure.
Let's consolidate the key concepts covered in this page:
What's Next:
We've now examined the three major failure types: transaction, system, and media failures. In the next page, we'll look at Failure Classification—how databases categorize failures, detect them, and choose appropriate recovery strategies. We'll see how the failure type determines the recovery approach and what mechanisms databases use to distinguish between failure types.
You now understand media failures comprehensively—their causes, their devastating effects, and the backup-based recovery strategies required. This knowledge completes your understanding of the failure spectrum from minor (transaction) to catastrophic (media).