Loading content...
Throughout this module, we've focused on getting data from applications to persistent storage correctly—write-through for durability, write-back for performance, ordered writes for consistency. But there's a deeper question we haven't addressed: How do we know the data on disk is still correct after it's written?
Data can become corrupted silently:
The terrifying reality: a study by CERN found that 1 in 10^7 to 10^8 bytes in large storage systems contains silent errors that go undetected by disk-level ECC. At petabyte scale, this means thousands of corrupt bytes per system.
This page explores how systems detect, prevent, and recover from data corruption—the strategies that enable us to trust long-term data storage.
By the end of this page, you will understand the full taxonomy of data corruption, the mathematical foundations of checksums and error detection, end-to-end integrity verification techniques, and how production systems balance integrity checking against performance. You'll be equipped to design systems that detect corruption before it causes harm.
To protect data integrity, we must first understand the failure modes we're protecting against. Each mode has different characteristics and requires different countermeasures.
Corruption Categories
| Failure Mode | Cause | Detection Method | Recovery Possibility |
|---|---|---|---|
| Bit Rot | Media degradation, radiation | Checksums | Redundancy (RAID, replicas) |
| Phantom Write | Firmware bug, lost write | Read verification | WAL replay, redundancy |
| Misdirected Write | FTL bug, bad sector map | Block-level checksum with address | Redundancy |
| Misdirected Read | Controller error | End-to-end checksum | Retry from different replica |
| Torn Write | Power loss mid-write | Checksum + logging | WAL replay, COW |
| Metadata Corruption | Bug, crash, bit flip | Journal, checksummed metadata | Journal replay, fsck |
| Silent Corruption (DRAM) | Radiation, voltage glitch | ECC memory | ECC correction or detection |
The Silent Corruption Problem
Most of these failures are silent—the system doesn't know corruption occurred until corrupted data is read. By then, it may be too late:
Timeline of silent corruption:
Day 1: Write block B containing important data
Disk returns success
Day 5: Bit flip occurs in block B on media
No read, no detection
Day 30: Make backup
Backup contains corrupted block B!
Day 60: Try to read block B
ERROR or wrong data returned
Backups are also corrupt
Data is lost
The key insight: detection must happen at read time, not just write time. And ideally, detection should happen during routine scrubbing, before the data is actually needed.
Corruption Rates in Practice
Research from NetApp, Google, CERN, and others provides real-world corruption statistics:
SATA drives: 1 in 10^14 to 10^15 bits URE (Unrecoverable Read Error)
~1-10 URE per 10TB read over device lifetime
Enterprise SAS: 1 in 10^15 to 10^16 bits
~0.1-1 URE per 10TB
Silent corruption (past ECC): 1 in 10^7 to 10^8 bytes
CERN study: 128 undetected errors per PB/year
DRAM bit flips: 1-5% of DIMMs per year have errors
Google study: much higher than expected
These numbers may seem small, but at scale they're significant. A 100 PB storage system might expect 10,000+ silent corruption events per year.
Storage device failure rates follow a 'bathtub curve': high early failures (infant mortality), low middle-life failures, and rising late-life failures. But silent corruption can happen anytime. Don't wait for device age to implement integrity checking—start from day one.
Checksums are the primary mechanism for detecting data corruption. Understanding their properties enables choosing the right checksum for each use case.
What a Checksum Provides
A checksum is a fixed-size value computed from data that changes (with high probability) if the data changes. It's a fingerprint of the data.
Checksum(data) → fixed-size value
Properties:
- Deterministic: Same data always produces same checksum
- Fixed output size: Regardless of input size
- Sensitivity: Any change to input changes output
- Distribution: Outputs uniformly distributed (for good checksums)
What it CAN detect:
- Random bit flips
- Truncation
- Insertions/deletions
- Block transposition (with block-addressed checksums)
What it CANNOT prevent:
- Intentional tampering (need cryptographic hash)
- Errors that happen to produce matching checksum (rare but possible)
Common Checksum Algorithms
| Algorithm | Size | Speed (GB/s) | Collision Resistance | Use Case |
|---|---|---|---|---|
| XOR | Variable | 10+ | Very weak | Quick parity, not recommended |
| CRC-32 | 32 bits | 5-8 | Good for random errors | Ethernet, ZIP files |
| CRC-32C (Castagnoli) | 32 bits | 20-40 (with SSE4.2) | Better than CRC-32 | iSCSI, ext4, btrfs |
| xxHash | 64/128 bits | 15-30 | Good | General purpose, fast |
| SHA-256 | 256 bits | 0.5-2 | Cryptographically strong | Content-addressed storage |
| BLAKE3 | 256 bits | 5-10 | Cryptographically strong | Modern crypto hash |
| Fletcher-64 | 64 bits | 8-12 | Medium | ZFS (legacy) |
CRC-32C: The Modern Standard
CRC-32C (Castagnoli polynomial) is widely used in storage because:
#include <x86intrin.h>
// Hardware-accelerated CRC-32C using SSE4.2
uint32_t crc32c_hardware(const void *data, size_t len) {
const uint8_t *p = data;
uint32_t crc = 0xFFFFFFFF;
// Process 8 bytes at a time
while (len >= 8) {
crc = _mm_crc32_u64(crc, *(const uint64_t *)p);
p += 8;
len -= 8;
}
// Process remaining bytes
while (len--) {
crc = _mm_crc32_u8(crc, *p++);
}
return crc ^ 0xFFFFFFFF;
}
// Performance: 20-40 GB/s on modern CPUs
// Compare to disk speed: 3-7 GB/s for NVMe
// Checksum is NOT the bottleneck
Collision Probability
For a 32-bit checksum, the probability of two random inputs having the same checksum is 2^-32 ≈ 2.3 × 10^-10. This seems small, but:
At 1 billion (10^9) blocks:
Expected collisions ≈ 10^9 × 10^9 × 2.3 × 10^-10 / 2 ≈ 115
For larger systems, 64-bit checksums provide:
Collision probability: 2^-64 ≈ 5.4 × 10^-20
At 1 trillion (10^12) blocks:
Expected collisions ≈ 10^12 × 10^12 × 5.4 × 10^-20 / 2 ≈ 0.027
This is why ZFS uses 256-bit checksums—collision probability is negligible even at extreme scale.
For error detection: CRC-32C (hardware accelerated, good enough for most cases). For content-addressing: SHA-256 or BLAKE3 (collision resistance matters). For simple cases: xxHash (fast, good distribution). Never use MD5 or SHA-1 for security-sensitive applications.
Device-level checksums (like sector ECC) protect only one segment of the data path. True data integrity requires end-to-end protection: verifying that what the application reads is exactly what it wrote.
The Data Path Vulnerability Points
Application Buffer (DRAM)
│
│ ← Bus errors, DMA errors, DRAM bit flips
▼
Kernel Page Cache (DRAM)
│
│ ← Same DRAM risks, plus kernel memory corruption
▼
Filesystem / Block Layer
│
│ ← Software bugs, misdirected I/O
▼
Device Driver
│
│ ← Driver bugs, incorrect command formatting
▼
HBA/Controller (often has DRAM)
│
│ ← Controller firmware bugs, controller DRAM errors
▼
Network (if SAN/iSCSI/NFS)
│
│ ← Packet corruption, routing errors
▼
Device Controller (has DRAM, cache)
│
│ ← Firmware bugs, cache corruption
▼
Persistent Media
│
│ ← Bit rot, media degradation
▼
[Read path goes back up with same risks]
Each layer might have checksums, but layer-specific checksums don't catch cross-layer errors. A misdirected write (correct checksum, wrong location) isn't detected by media ECC.
End-to-End Verification Architecture
True end-to-end integrity means:
Application writes block B at logical address L:
1. Compute: checksum = CRC32C(L || B)
Including L in checksum detects misdirected writes
2. Write: [L][B][checksum] → storage
3. Read at address L:
- Fetch [B][checksum]
- Compute: expected = CRC32C(L || B)
- Compare: if (checksum != expected) → CORRUPTION!
4. If corruption detected:
- Attempt read from replica/parity
- Log error for analysis
- Mark block as suspect
ZFS: The Gold Standard for End-to-End Integrity
ZFS pioneered comprehensive end-to-end integrity in general-purpose filesystems:
ZFS Integrity Features:
1. Block-level checksums
- 256-bit checksum for every block
- Checksum stored in PARENT block's pointer
- Cannot have checksum-storage mismatch
2. Merkle tree verification
- Root block checksum verifies children
- Children verify grandchildren
- Down to leaf data blocks
- Corruption anywhere is detected at root
3. Self-healing with redundancy
- On checksum mismatch, read from mirror/parity
- If redundant copy is good, repair bad copy
- Silent self-healing, no data loss
4. Scrubbing
- Background process reads all blocks
- Verifies checksums
- Repairs from redundancy if possible
- Reports errors for blocks without redundancy
The parent pointer design is crucial: if the checksum were stored with the data, a misdirected write could overwrite both data and checksum. By storing the checksum in the parent, a misdirected write is immediately detected when the parent's pointer checksum fails.
T10 DIF (Data Integrity Field) and DIX (Data Integrity Extension) extend protection to hardware. The OS attaches a protection field (checksum + block address + version) that travels with data through the storage stack. Enterprise HBAs and drives verify DIF at each layer. Not widely deployed but very effective where available.
Different filesystems provide different levels of integrity protection. Understanding these differences is crucial for selecting the right filesystem for data-critical applications.
Feature Comparison
| Feature | ext4 | XFS | Btrfs | ZFS | NTFS |
|---|---|---|---|---|---|
| Metadata Checksums | Optional* | Yes (v5) | Yes | Yes | Yes |
| Data Checksums | No | No | Yes | Yes | No |
| Self-Healing | No | No | With RAID | With RAID | No |
| Scrubbing | No | No | Yes | Yes | No |
| COW for Atomicity | No | No | Yes | Yes | No |
| Snapshot Support | No | No | Yes | Yes | Yes |
| Checksum Algorithm | CRC32C | CRC32C | CRC32C,SHA-256 | SHA-256,etc. | N/A |
*ext4 metadata checksums available with metadata_csum feature (default in newer versions)
ext4 Integrity Strategy
ext4 provides minimal built-in integrity, relying on external tools:
ext4 Approach:
- Journal for metadata consistency (not integrity)
- Optional metadata checksums (mkfs.ext4 -O metadata_csum)
- No data checksums
- Relies on RAID for data protection
- e2fsck for offline repair
When to use ext4:
- Compatibility requirements
- Hardware RAID providing protection
- Applications do their own checksumming
- Performance over safety trade-off
Btrfs Integrity Strategy
Btrfs provides comprehensive checksumming with self-healing:
Btrfs Approach:
- CRC32C checksums for all data and metadata
- Checksums stored in separate checksum tree
- COW ensures atomic updates
- Self-healing with RAID1/5/6
- Online scrubbing
Commands:
btrfs scrub start /mount/point # Start scrub
btrfs scrub status /mount/point # Check progress
btrfs device stats /mount/point # View error counters
# Example output
[/dev/sda].corruption_errs: 2
[/dev/sda].read_errs: 0
[/dev/sda].write_errs: 0
[/dev/sdb].corruption_errs: 0 # Mirror copy was good
ZFS Integrity Architecture
ZFS has the most comprehensive integrity system:
123456789101112131415161718192021222324252627282930313233343536373839404142
# ZFS Integrity Configuration and Monitoring # Check checksum algorithm (fletcher4, sha256, sha512, skein, edonr)zfs get checksum tank/dataset# NAME PROPERTY VALUE SOURCE# tank/dataset checksum sha256 inherited # Set stronger checksum for critical datazfs set checksum=sha256 tank/critical_data # Enable deduplication (uses checksum for block comparison)zfs set dedup=on tank/dedupable_data# WARNING: Dedup is RAM-hungry, needs ~320 bytes per block # View data integrity statisticszpool status tank# pool: tank# state: ONLINE# status: One or more devices has experienced an unrecoverable error.# Applications are unaffected.# action: Online the device using 'zpool online' or replace the device.# scan: scrub repaired 128K in 02:34:56 with 0 errors# config:# NAME STATE READ WRITE CKSUM# tank ONLINE 0 0 0# mirror-0 ONLINE 0 0 0# sda ONLINE 0 0 2 # 2 checksum errors!# sdb ONLINE 0 0 0 # Was used for repair # Start a scrubzpool scrub tank # Schedule regular scrubs (cron)# Weekly scrub is typical for production0 2 * * 0 /sbin/zpool scrub tank # View detailed error logzpool status -v tank# Shows files affected by errors # Clear error counters after replacing bad diskzpool clear tankIf your filesystem supports scrubbing (ZFS, Btrfs), run it regularly (weekly or monthly). Scrubbing finds latent corruption before it's needed—when repair from redundancy is still possible. Without scrubbing, corruption may not be detected until all copies are bad.
Even with checksumming filesystems, applications may need additional integrity measures for:
Pattern 1: Record-Level Checksums
import hashlib
import struct
class ChecksummedRecord:
HEADER_SIZE = 36 # 4 bytes length + 32 bytes SHA-256
def __init__(self, data):
self.data = data
self.checksum = hashlib.sha256(data).digest()
def serialize(self):
length = len(self.data)
return struct.pack(">I", length) + self.checksum + self.data
@classmethod
def deserialize(cls, raw):
length = struct.unpack(">I", raw[:4])[0]
stored_checksum = raw[4:36]
data = raw[36:36+length]
computed_checksum = hashlib.sha256(data).digest()
if stored_checksum != computed_checksum:
raise IntegrityError(
f"Checksum mismatch: expected {stored_checksum.hex()}, "
f"got {computed_checksum.hex()}"
)
record = cls.__new__(cls)
record.data = data
record.checksum = stored_checksum
return record
Pattern 2: Content-Addressed Storage
Deduplicate and verify in one step:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import hashlibimport os class ContentAddressedStore: """ Store objects by their content hash. Advantages: - Automatic deduplication - Built-in integrity verification - Immutable objects (content defines identity) Used by: Git, Docker images, IPFS, many backup systems """ def __init__(self, base_path): self.base_path = base_path os.makedirs(base_path, exist_ok=True) def _hash(self, data): """SHA-256 content hash.""" return hashlib.sha256(data).hexdigest() def _object_path(self, content_hash): """Two-level directory structure like Git.""" return os.path.join( self.base_path, content_hash[:2], content_hash[2:] ) def put(self, data): """ Store data, return content hash. Integrity is automatic: if hash matches, data is correct. """ content_hash = self._hash(data) path = self._object_path(content_hash) # Already exists = already verified (by previous write) if os.path.exists(path): return content_hash # Write to temp, rename atomically dir_path = os.path.dirname(path) os.makedirs(dir_path, exist_ok=True) temp_path = path + '.tmp' with open(temp_path, 'wb') as f: f.write(data) f.flush() os.fsync(f.fileno()) os.rename(temp_path, path) return content_hash def get(self, content_hash): """ Retrieve data by hash. Automatically verifies integrity on read. """ path = self._object_path(content_hash) if not os.path.exists(path): raise KeyError(f"Object not found: {content_hash}") with open(path, 'rb') as f: data = f.read() # Verify integrity actual_hash = self._hash(data) if actual_hash != content_hash: raise IntegrityError( f"Corruption detected: expected {content_hash}, " f"got {actual_hash}" ) return data def verify_all(self): """Scrub all objects, report corrupted ones.""" corrupted = [] for root, dirs, files in os.walk(self.base_path): for filename in files: path = os.path.join(root, filename) # Reconstruct expected hash from path rel_path = os.path.relpath(path, self.base_path) expected_hash = rel_path.replace(os.sep, '') try: self.get(expected_hash) except IntegrityError: corrupted.append(expected_hash) return corruptedPattern 3: Merkle Trees for Efficient Verification
For large datasets, verify incrementally:
Full verification: Hash 1TB = ~30 seconds
Must read entire dataset
Merkle verification:
Level 0: 256MB chunks, hash each → 4096 hashes
Level 1: Groups of 16-chunk hashes → 256 hashes
Level 2: Groups of 16 → 16 hashes
Level 3: Root hash → 1 hash
Total: ~32KB of hashes for 1TB data
To verify ONE chunk changed:
- Hash the chunk (32 bytes to update)
- Recompute parent path to root (4 levels, ~4ms)
- Compare root hash
Only need to read 256MB + parent path, not 1TB
This is exactly how Git verifies repository integrity—any change anywhere is detected by checking the root commit hash.
Databases like PostgreSQL provide page-level checksums (data_checksums), SQLite has checksum verification for each page, and InnoDB has page checksums. Enable these features! They catch corruption before it propagates to backups and replicas.
Detection is only half the battle. When corruption is found, systems need recovery strategies.
Recovery Hierarchy
Order of recovery attempts (most to least desirable):
1. Self-heal from redundancy
- Read from mirror/parity
- Repair corrupted copy
- User never knows
2. Replay from transaction log
- WAL contains original data
- Replay recreates correct state
- Applies to crash recovery too
3. Restore from backup
- Most recent backup
- Check backup isn't also corrupt!
- May lose recent changes
4. Reconstruct from replicas
- Distributed system: fetch from other nodes
- Eventually consistent: latest wins
5. Partial recovery
- Salvage what's readable
- Mark bad blocks
- User decides how to proceed
6. Accept data loss
- Log the loss
- Alert operations
- Move on
Self-Healing Implementation
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
class SelfHealingStorage: """ Storage with automatic corruption repair from replicas. """ def __init__(self, replicas): """ replicas: list of storage backends (local paths, network URLs, etc.) """ self.replicas = replicas self.min_replicas_for_read = 1 self.quorum_for_repair = len(replicas) // 2 + 1 def read(self, key): """ Read with automatic verification and repair. """ results = [] # Read from all replicas for replica in self.replicas: try: data, checksum = replica.read(key) actual_checksum = compute_checksum(data) results.append({ 'replica': replica, 'data': data, 'checksum': checksum, 'valid': checksum == actual_checksum }) except Exception as e: results.append({ 'replica': replica, 'error': e, 'valid': False }) # Count valid copies valid_results = [r for r in results if r.get('valid')] invalid_replicas = [r['replica'] for r in results if not r.get('valid')] if len(valid_results) == 0: # No valid copies - data is lost raise UnrecoverableCorruptionError( f"No valid copies of {key} found across {len(self.replicas)} replicas" ) # Use first valid result correct_data = valid_results[0]['data'] correct_checksum = valid_results[0]['checksum'] # Repair invalid replicas asynchronously if invalid_replicas: self._schedule_repair(key, correct_data, correct_checksum, invalid_replicas) return correct_data def _schedule_repair(self, key, data, checksum, bad_replicas): """ Repair corrupted replicas in background. """ for replica in bad_replicas: try: replica.write(key, data, checksum) log_info(f"Repaired {key} on {replica.name}") except Exception as e: log_error(f"Failed to repair {key} on {replica.name}: {e}") # Mark replica as suspect for ops review self._mark_suspect(replica) def scrub(self): """ Background verification of all data. Should run regularly (daily/weekly). """ all_keys = self._get_all_keys() issues_found = [] for key in all_keys: try: self.read(key) # Read triggers verification and repair except UnrecoverableCorruptionError as e: issues_found.append({'key': key, 'error': str(e)}) return issues_foundBackup Verification
Backups themselves can be corrupted. Always verify:
# PostgreSQL: Verify backup integrity
pg_verifybackup /path/to/backup
# tar: Verify archive checksums (if available)
tar --verify -f backup.tar
# restic: Verify backup integrity
restic -r /backup/repo check --read-data
# ZFS: Send with checksum verification
zfs send -c pool/dataset | zfs receive -F backup_pool/dataset
# -c sends compressed, preserves checksums
Keep 3 copies of critical data, on 2 different media types, with 1 offsite. This protects against: device failure (multiple copies), media-specific issues (different types), and site disasters (offsite). And verify checksums on ALL copies regularly.
Production systems need continuous monitoring of data integrity metrics. Problems caught early are much easier to resolve.
Key Metrics to Monitor
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
#!/bin/bash# Integrity monitoring script for Linux systems # 1. ZFS pool healthzpool_errors() { zpool list -H -o name,health # Check for any checksum errors for pool in $(zpool list -H -o name); do cksum_errors=$(zpool status $pool | grep -c "CKSUM") if [ $cksum_errors -gt 0 ]; then echo "WARNING: $pool has checksum errors" zpool status $pool fi done} # 2. Btrfs device statsbtrfs_errors() { for mount in $(grep btrfs /proc/mounts | cut -d' ' -f2); do errors=$(btrfs device stats $mount 2>/dev/null | grep -v " 0$") if [ -n "$errors" ]; then echo "WARNING: Btrfs errors on $mount" echo "$errors" fi done} # 3. SMART healthsmart_health() { for disk in /dev/sd?; do health=$(smartctl -H $disk 2>/dev/null | grep "SMART overall-health") if [[ $health != *"PASSED"* ]]; then echo "WARNING: SMART failure on $disk" smartctl -A $disk | grep -E "(Reallocated|Pending|Uncorrectable)" fi done} # 4. Memory errors (ECC)memory_errors() { # Requires EDAC driver if [ -d /sys/devices/system/edac/mc ]; then for mc in /sys/devices/system/edac/mc/mc*; do ce=$(cat $mc/ce_count 2>/dev/null || echo 0) ue=$(cat $mc/ue_count 2>/dev/null || echo 0) if [ $ce -gt 0 ] || [ $ue -gt 0 ]; then echo "WARNING: Memory errors - CE: $ce, UE: $ue" fi done fi} # 5. Kernel message analysiskernel_errors() { # Look for I/O errors in recent logs dmesg | grep -i "error|fail|corrupt" | tail -20} # Run all checksecho "=== Integrity Check $(date) ==="zpool_errorsbtrfs_errorssmart_healthmemory_errorskernel_errorsAlerting Thresholds
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Checksum errors (total) | >0 | >10/day | Investigate disk |
| Scrub errors | >0 | Any uncorrectable | Replace failed device |
| SMART reallocated sectors | >0 | >100 or growing | Plan replacement |
| ECC correctable errors | >100/month | >1000/month | Monitor closely |
| ECC uncorrectable | >0 | >0 | Immediate DIMM replacement |
Prometheus Metrics Example
# Prometheus metrics for integrity monitoring
- name: storage_checksum_errors_total
help: Total number of checksum verification failures
type: counter
labels:
- device
- pool
- name: storage_silent_repairs_total
help: Corruptions repaired silently from redundancy
type: counter
labels:
- device
- pool
- name: storage_last_scrub_timestamp
help: Unix timestamp of last completed scrub
type: gauge
labels:
- pool
- name: storage_last_scrub_errors
help: Errors found in most recent scrub
type: gauge
labels:
- pool
- error_type
For critical data: weekly scrubs. For general data: monthly. For archival: before any restore operation. Scrub duration increases with data size—a 100TB array might take 24-48 hours. Schedule during low-usage periods.
Data integrity is the foundation upon which all other storage guarantees rest. Without confidence that stored data remains correct, durability and performance optimizations are meaningless. This page has covered the complete spectrum of integrity concerns, from corruption taxonomy to production monitoring.
Module Complete: Write Strategies
We've now comprehensively covered the spectrum of write strategies:
Together, these strategies form the toolkit for building storage systems that meet any combination of performance, durability, and reliability requirements. The art of systems engineering is choosing the right combination for each use case—and now you have the deep understanding needed to make those choices correctly.
Congratulations! You have mastered the fundamental write strategies that govern how data flows from applications to persistent storage. You understand the trade-offs between performance and safety, the mechanisms that enable crash-consistent data structures, and the techniques for ensuring long-term data integrity. This knowledge is essential for designing and operating reliable storage systems.