Write Strategies - Learning Module

Loading content...

0/227

Data Integrity

The Silent Threat

Throughout this module, we've focused on getting data from applications to persistent storage correctly—write-through for durability, write-back for performance, ordered writes for consistency. But there's a deeper question we haven't addressed: How do we know the data on disk is still correct after it's written?

Data can become corrupted silently:

Bit rot: Random bit flips in storage media over time
Firmware bugs: Controller writes to wrong location
Cosmic rays: High-energy particles flip bits in DRAM or storage
Phantom writes: Write acknowledged but not performed
Misdirected writes: Write lands at wrong address
Torn writes: Partial writes due to power loss
Read errors: Correct on disk, corrupt during read

The terrifying reality: a study by CERN found that 1 in 10^7 to 10^8 bytes in large storage systems contains silent errors that go undetected by disk-level ECC. At petabyte scale, this means thousands of corrupt bytes per system.

This page explores how systems detect, prevent, and recover from data corruption—the strategies that enable us to trust long-term data storage.

What You Will Master

By the end of this page, you will understand the full taxonomy of data corruption, the mathematical foundations of checksums and error detection, end-to-end integrity verification techniques, and how production systems balance integrity checking against performance. You'll be equipped to design systems that detect corruption before it causes harm.

The Corruption Taxonomy: Understanding How Data Fails

To protect data integrity, we must first understand the failure modes we're protecting against. Each mode has different characteristics and requires different countermeasures.

Corruption Categories

Data Corruption Failure Modes
Failure Mode	Cause	Detection Method	Recovery Possibility
Bit Rot	Media degradation, radiation	Checksums	Redundancy (RAID, replicas)
Phantom Write	Firmware bug, lost write	Read verification	WAL replay, redundancy
Misdirected Write	FTL bug, bad sector map	Block-level checksum with address	Redundancy
Misdirected Read	Controller error	End-to-end checksum	Retry from different replica
Torn Write	Power loss mid-write	Checksum + logging	WAL replay, COW
Metadata Corruption	Bug, crash, bit flip	Journal, checksummed metadata	Journal replay, fsck
Silent Corruption (DRAM)	Radiation, voltage glitch	ECC memory	ECC correction or detection

The Silent Corruption Problem

Most of these failures are silent—the system doesn't know corruption occurred until corrupted data is read. By then, it may be too late:

Timeline of silent corruption:

Day 1:   Write block B containing important data
         Disk returns success
         
Day 5:   Bit flip occurs in block B on media
         No read, no detection
         
Day 30:  Make backup
         Backup contains corrupted block B!
         
Day 60:  Try to read block B
         ERROR or wrong data returned
         Backups are also corrupt
         
         Data is lost

The key insight: detection must happen at read time, not just write time. And ideally, detection should happen during routine scrubbing, before the data is actually needed.

Corruption Rates in Practice

Research from NetApp, Google, CERN, and others provides real-world corruption statistics:

SATA drives: 1 in 10^14 to 10^15 bits URE (Unrecoverable Read Error)
             ~1-10 URE per 10TB read over device lifetime
             
Enterprise SAS: 1 in 10^15 to 10^16 bits
             ~0.1-1 URE per 10TB

Silent corruption (past ECC): 1 in 10^7 to 10^8 bytes
             CERN study: 128 undetected errors per PB/year

DRAM bit flips: 1-5% of DIMMs per year have errors
             Google study: much higher than expected

These numbers may seem small, but at scale they're significant. A 100 PB storage system might expect 10,000+ silent corruption events per year.

The Bathtub Curve

Storage device failure rates follow a 'bathtub curve': high early failures (infant mortality), low middle-life failures, and rising late-life failures. But silent corruption can happen anytime. Don't wait for device age to implement integrity checking—start from day one.

Checksum Fundamentals: The Mathematics of Error Detection

Checksums are the primary mechanism for detecting data corruption. Understanding their properties enables choosing the right checksum for each use case.

What a Checksum Provides

A checksum is a fixed-size value computed from data that changes (with high probability) if the data changes. It's a fingerprint of the data.

Checksum(data) → fixed-size value

Properties:
  - Deterministic: Same data always produces same checksum
  - Fixed output size: Regardless of input size
  - Sensitivity: Any change to input changes output
  - Distribution: Outputs uniformly distributed (for good checksums)
  
What it CAN detect:
  - Random bit flips
  - Truncation
  - Insertions/deletions
  - Block transposition (with block-addressed checksums)
  
What it CANNOT prevent:
  - Intentional tampering (need cryptographic hash)
  - Errors that happen to produce matching checksum (rare but possible)

Common Checksum Algorithms

Checksum Algorithm Comparison
Algorithm	Size	Speed (GB/s)	Collision Resistance	Use Case
XOR	Variable	10+	Very weak	Quick parity, not recommended
CRC-32	32 bits	5-8	Good for random errors	Ethernet, ZIP files
CRC-32C (Castagnoli)	32 bits	20-40 (with SSE4.2)	Better than CRC-32	iSCSI, ext4, btrfs
xxHash	64/128 bits	15-30	Good	General purpose, fast
SHA-256	256 bits	0.5-2	Cryptographically strong	Content-addressed storage
BLAKE3	256 bits	5-10	Cryptographically strong	Modern crypto hash
Fletcher-64	64 bits	8-12	Medium	ZFS (legacy)

CRC-32C: The Modern Standard

CRC-32C (Castagnoli polynomial) is widely used in storage because:

Hardware acceleration: Intel SSE4.2 provides CRC-32C instruction
Better error detection: Catches more error patterns than standard CRC-32
Adequate for bit-flip detection: 32 bits detects most random errors

#include <x86intrin.h>

// Hardware-accelerated CRC-32C using SSE4.2
uint32_t crc32c_hardware(const void *data, size_t len) {
    const uint8_t *p = data;
    uint32_t crc = 0xFFFFFFFF;
    
    // Process 8 bytes at a time
    while (len >= 8) {
        crc = _mm_crc32_u64(crc, *(const uint64_t *)p);
        p += 8;
        len -= 8;
    }
    
    // Process remaining bytes
    while (len--) {
        crc = _mm_crc32_u8(crc, *p++);
    }
    
    return crc ^ 0xFFFFFFFF;
}

// Performance: 20-40 GB/s on modern CPUs
// Compare to disk speed: 3-7 GB/s for NVMe
// Checksum is NOT the bottleneck

Collision Probability

For a 32-bit checksum, the probability of two random inputs having the same checksum is 2^-32 ≈ 2.3 × 10^-10. This seems small, but:

At 1 billion (10^9) blocks:
  Expected collisions ≈ 10^9 × 10^9 × 2.3 × 10^-10 / 2 ≈ 115

For larger systems, 64-bit checksums provide:
  Collision probability: 2^-64 ≈ 5.4 × 10^-20
  
  At 1 trillion (10^12) blocks:
  Expected collisions ≈ 10^12 × 10^12 × 5.4 × 10^-20 / 2 ≈ 0.027

This is why ZFS uses 256-bit checksums—collision probability is negligible even at extreme scale.

Checksum Selection Heuristic

For error detection: CRC-32C (hardware accelerated, good enough for most cases). For content-addressing: SHA-256 or BLAKE3 (collision resistance matters). For simple cases: xxHash (fast, good distribution). Never use MD5 or SHA-1 for security-sensitive applications.

End-to-End Integrity: Protecting the Entire Data Path

Device-level checksums (like sector ECC) protect only one segment of the data path. True data integrity requires end-to-end protection: verifying that what the application reads is exactly what it wrote.

The Data Path Vulnerability Points

Application Buffer (DRAM)
        │
        │ ← Bus errors, DMA errors, DRAM bit flips
        ▼
Kernel Page Cache (DRAM)
        │
        │ ← Same DRAM risks, plus kernel memory corruption
        ▼
Filesystem / Block Layer
        │
        │ ← Software bugs, misdirected I/O
        ▼
Device Driver
        │
        │ ← Driver bugs, incorrect command formatting
        ▼
HBA/Controller (often has DRAM)
        │
        │ ← Controller firmware bugs, controller DRAM errors
        ▼
Network (if SAN/iSCSI/NFS)
        │
        │ ← Packet corruption, routing errors
        ▼
Device Controller (has DRAM, cache)
        │
        │ ← Firmware bugs, cache corruption
        ▼
Persistent Media
        │
        │ ← Bit rot, media degradation
        ▼
[Read path goes back up with same risks]

Each layer might have checksums, but layer-specific checksums don't catch cross-layer errors. A misdirected write (correct checksum, wrong location) isn't detected by media ECC.

End-to-End Verification Architecture

True end-to-end integrity means:

Compute checksum at write time (in application or filesystem)
Store checksum with or near data
Verify checksum at read time (before returning to application)
Include metadata in checksum (physical address, version, etc.)

Application writes block B at logical address L:

1. Compute: checksum = CRC32C(L || B)
   Including L in checksum detects misdirected writes
   
2. Write: [L][B][checksum] → storage

3. Read at address L:
   - Fetch [B][checksum]
   - Compute: expected = CRC32C(L || B)
   - Compare: if (checksum != expected) → CORRUPTION!
   
4. If corruption detected:
   - Attempt read from replica/parity
   - Log error for analysis
   - Mark block as suspect

ZFS: The Gold Standard for End-to-End Integrity

ZFS pioneered comprehensive end-to-end integrity in general-purpose filesystems:

ZFS Integrity Features:

1. Block-level checksums
   - 256-bit checksum for every block
   - Checksum stored in PARENT block's pointer
   - Cannot have checksum-storage mismatch
   
2. Merkle tree verification
   - Root block checksum verifies children
   - Children verify grandchildren
   - Down to leaf data blocks
   - Corruption anywhere is detected at root
   
3. Self-healing with redundancy
   - On checksum mismatch, read from mirror/parity
   - If redundant copy is good, repair bad copy
   - Silent self-healing, no data loss
   
4. Scrubbing
   - Background process reads all blocks
   - Verifies checksums
   - Repairs from redundancy if possible
   - Reports errors for blocks without redundancy

The parent pointer design is crucial: if the checksum were stored with the data, a misdirected write could overwrite both data and checksum. By storing the checksum in the parent, a misdirected write is immediately detected when the parent's pointer checksum fails.

DIF/DIX: Hardware End-to-End Protection

T10 DIF (Data Integrity Field) and DIX (Data Integrity Extension) extend protection to hardware. The OS attaches a protection field (checksum + block address + version) that travels with data through the storage stack. Enterprise HBAs and drives verify DIF at each layer. Not widely deployed but very effective where available.

Filesystem Integrity Features: Comparing Modern Approaches

Different filesystems provide different levels of integrity protection. Understanding these differences is crucial for selecting the right filesystem for data-critical applications.

Feature Comparison

Filesystem Integrity Features
Feature	ext4	XFS	Btrfs	ZFS	NTFS
Metadata Checksums	Optional*	Yes (v5)	Yes	Yes	Yes
Data Checksums	No	No	Yes	Yes	No
Self-Healing	No	No	With RAID	With RAID	No
Scrubbing	No	No	Yes	Yes	No
COW for Atomicity	No	No	Yes	Yes	No
Snapshot Support	No	No	Yes	Yes	Yes
Checksum Algorithm	CRC32C	CRC32C	CRC32C,SHA-256	SHA-256,etc.	N/A

*ext4 metadata checksums available with metadata_csum feature (default in newer versions)

ext4 Integrity Strategy

ext4 provides minimal built-in integrity, relying on external tools:

ext4 Approach:
  - Journal for metadata consistency (not integrity)
  - Optional metadata checksums (mkfs.ext4 -O metadata_csum)
  - No data checksums
  - Relies on RAID for data protection
  - e2fsck for offline repair
  
When to use ext4:
  - Compatibility requirements
  - Hardware RAID providing protection
  - Applications do their own checksumming
  - Performance over safety trade-off

Btrfs Integrity Strategy

Btrfs provides comprehensive checksumming with self-healing:

Btrfs Approach:
  - CRC32C checksums for all data and metadata
  - Checksums stored in separate checksum tree
  - COW ensures atomic updates
  - Self-healing with RAID1/5/6
  - Online scrubbing
  
Commands:
  btrfs scrub start /mount/point    # Start scrub
  btrfs scrub status /mount/point   # Check progress
  btrfs device stats /mount/point   # View error counters
  
  # Example output
  [/dev/sda].corruption_errs: 2
  [/dev/sda].read_errs: 0
  [/dev/sda].write_errs: 0
  [/dev/sdb].corruption_errs: 0  # Mirror copy was good

ZFS Integrity Architecture

ZFS has the most comprehensive integrity system:

zfs_integrity.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# ZFS Integrity Configuration and Monitoring
 
# Check checksum algorithm (fletcher4, sha256, sha512, skein, edonr)
zfs get checksum tank/dataset
# NAME          PROPERTY   VALUE      SOURCE
# tank/dataset  checksum   sha256     inherited
 
# Set stronger checksum for critical data
zfs set checksum=sha256 tank/critical_data
 
# Enable deduplication (uses checksum for block comparison)
zfs set dedup=on tank/dedupable_data
# WARNING: Dedup is RAM-hungry, needs ~320 bytes per block
 
# View data integrity statistics
zpool status tank
#   pool: tank
#  state: ONLINE
# status: One or more devices has experienced an unrecoverable error.
#         Applications are unaffected.
# action: Online the device using 'zpool online' or replace the device.
#  scan: scrub repaired 128K in 02:34:56 with 0 errors
# config:
#     NAME        STATE     READ WRITE CKSUM
#     tank        ONLINE       0     0     0
#       mirror-0  ONLINE       0     0     0
#         sda     ONLINE       0     0     2  # 2 checksum errors!
#         sdb     ONLINE       0     0     0  # Was used for repair
 
# Start a scrub
zpool scrub tank
 
# Schedule regular scrubs (cron)
# Weekly scrub is typical for production
0 2 * * 0 /sbin/zpool scrub tank
 
# View detailed error log
zpool status -v tank
# Shows files affected by errors
 
# Clear error counters after replacing bad disk
zpool clear tank

Scrubbing Is Non-Negotiable

If your filesystem supports scrubbing (ZFS, Btrfs), run it regularly (weekly or monthly). Scrubbing finds latent corruption before it's needed—when repair from redundancy is still possible. Without scrubbing, corruption may not be detected until all copies are bad.

Application-Level Integrity: When Filesystems Aren't Enough

Even with checksumming filesystems, applications may need additional integrity measures for:

Portability across filesystems
Multi-system replication
Long-term archive verification
Regulatory compliance

Pattern 1: Record-Level Checksums

import hashlib
import struct

class ChecksummedRecord:
    HEADER_SIZE = 36  # 4 bytes length + 32 bytes SHA-256
    
    def __init__(self, data):
        self.data = data
        self.checksum = hashlib.sha256(data).digest()
    
    def serialize(self):
        length = len(self.data)
        return struct.pack(">I", length) + self.checksum + self.data
    
    @classmethod
    def deserialize(cls, raw):
        length = struct.unpack(">I", raw[:4])[0]
        stored_checksum = raw[4:36]
        data = raw[36:36+length]
        
        computed_checksum = hashlib.sha256(data).digest()
        
        if stored_checksum != computed_checksum:
            raise IntegrityError(
                f"Checksum mismatch: expected {stored_checksum.hex()}, "
                f"got {computed_checksum.hex()}"
            )
        
        record = cls.__new__(cls)
        record.data = data
        record.checksum = stored_checksum
        return record

Pattern 2: Content-Addressed Storage

Deduplicate and verify in one step:

content_addressed_store.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import hashlib
import os
 
class ContentAddressedStore:
    """
    Store objects by their content hash.
    
    Advantages:
    - Automatic deduplication
    - Built-in integrity verification
    - Immutable objects (content defines identity)
    
    Used by: Git, Docker images, IPFS, many backup systems
    """
    
    def __init__(self, base_path):
        self.base_path = base_path
        os.makedirs(base_path, exist_ok=True)
    
    def _hash(self, data):
        """SHA-256 content hash."""
        return hashlib.sha256(data).hexdigest()
    
    def _object_path(self, content_hash):
        """Two-level directory structure like Git."""
        return os.path.join(
            self.base_path,
            content_hash[:2],
            content_hash[2:]
        )
    
    def put(self, data):
        """
        Store data, return content hash.
        
        Integrity is automatic: if hash matches, data is correct.
        """
        content_hash = self._hash(data)
        path = self._object_path(content_hash)
        
        # Already exists = already verified (by previous write)
        if os.path.exists(path):
            return content_hash
        
        # Write to temp, rename atomically
        dir_path = os.path.dirname(path)
        os.makedirs(dir_path, exist_ok=True)
        
        temp_path = path + '.tmp'
        with open(temp_path, 'wb') as f:
            f.write(data)
            f.flush()
            os.fsync(f.fileno())
        
        os.rename(temp_path, path)
        
        return content_hash
    
    def get(self, content_hash):
        """
        Retrieve data by hash.
        
        Automatically verifies integrity on read.
        """
        path = self._object_path(content_hash)
        
        if not os.path.exists(path):
            raise KeyError(f"Object not found: {content_hash}")
        
        with open(path, 'rb') as f:
            data = f.read()
        
        # Verify integrity
        actual_hash = self._hash(data)
        if actual_hash != content_hash:
            raise IntegrityError(
                f"Corruption detected: expected {content_hash}, "
                f"got {actual_hash}"
            )
        
        return data
    
    def verify_all(self):
        """Scrub all objects, report corrupted ones."""
        corrupted = []
        
        for root, dirs, files in os.walk(self.base_path):
            for filename in files:
                path = os.path.join(root, filename)
                
                # Reconstruct expected hash from path
                rel_path = os.path.relpath(path, self.base_path)
                expected_hash = rel_path.replace(os.sep, '')
                
                try:
                    self.get(expected_hash)
                except IntegrityError:
                    corrupted.append(expected_hash)
        
        return corrupted

Pattern 3: Merkle Trees for Efficient Verification

For large datasets, verify incrementally:

Full verification: Hash 1TB = ~30 seconds
                  Must read entire dataset

Merkle verification:
  Level 0: 256MB chunks, hash each → 4096 hashes
  Level 1: Groups of 16-chunk hashes → 256 hashes  
  Level 2: Groups of 16 → 16 hashes
  Level 3: Root hash → 1 hash
  
  Total: ~32KB of hashes for 1TB data
  
To verify ONE chunk changed:
  - Hash the chunk (32 bytes to update)
  - Recompute parent path to root (4 levels, ~4ms)
  - Compare root hash
  
  Only need to read 256MB + parent path, not 1TB

This is exactly how Git verifies repository integrity—any change anywhere is detected by checking the root commit hash.

Database Integrity Patterns

Databases like PostgreSQL provide page-level checksums (data_checksums), SQLite has checksum verification for each page, and InnoDB has page checksums. Enable these features! They catch corruption before it propagates to backups and replicas.

Recovery Strategies: What to Do When Corruption Is Detected

Detection is only half the battle. When corruption is found, systems need recovery strategies.

Recovery Hierarchy

Order of recovery attempts (most to least desirable):

1. Self-heal from redundancy
   - Read from mirror/parity
   - Repair corrupted copy
   - User never knows
   
2. Replay from transaction log
   - WAL contains original data
   - Replay recreates correct state
   - Applies to crash recovery too
   
3. Restore from backup
   - Most recent backup
   - Check backup isn't also corrupt!
   - May lose recent changes
   
4. Reconstruct from replicas
   - Distributed system: fetch from other nodes
   - Eventually consistent: latest wins
   
5. Partial recovery
   - Salvage what's readable
   - Mark bad blocks
   - User decides how to proceed
   
6. Accept data loss
   - Log the loss
   - Alert operations
   - Move on

Self-Healing Implementation

self_healing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
class SelfHealingStorage:
    """
    Storage with automatic corruption repair from replicas.
    """
    
    def __init__(self, replicas):
        """
        replicas: list of storage backends (local paths, network URLs, etc.)
        """
        self.replicas = replicas
        self.min_replicas_for_read = 1
        self.quorum_for_repair = len(replicas) // 2 + 1
    
    def read(self, key):
        """
        Read with automatic verification and repair.
        """
        results = []
        
        # Read from all replicas
        for replica in self.replicas:
            try:
                data, checksum = replica.read(key)
                actual_checksum = compute_checksum(data)
                
                results.append({
                    'replica': replica,
                    'data': data,
                    'checksum': checksum,
                    'valid': checksum == actual_checksum
                })
            except Exception as e:
                results.append({
                    'replica': replica,
                    'error': e,
                    'valid': False
                })
        
        # Count valid copies
        valid_results = [r for r in results if r.get('valid')]
        invalid_replicas = [r['replica'] for r in results if not r.get('valid')]
        
        if len(valid_results) == 0:
            # No valid copies - data is lost
            raise UnrecoverableCorruptionError(
                f"No valid copies of {key} found across {len(self.replicas)} replicas"
            )
        
        # Use first valid result
        correct_data = valid_results[0]['data']
        correct_checksum = valid_results[0]['checksum']
        
        # Repair invalid replicas asynchronously
        if invalid_replicas:
            self._schedule_repair(key, correct_data, correct_checksum, invalid_replicas)
        
        return correct_data
    
    def _schedule_repair(self, key, data, checksum, bad_replicas):
        """
        Repair corrupted replicas in background.
        """
        for replica in bad_replicas:
            try:
                replica.write(key, data, checksum)
                log_info(f"Repaired {key} on {replica.name}")
            except Exception as e:
                log_error(f"Failed to repair {key} on {replica.name}: {e}")
                # Mark replica as suspect for ops review
                self._mark_suspect(replica)
    
    def scrub(self):
        """
        Background verification of all data.
        
        Should run regularly (daily/weekly).
        """
        all_keys = self._get_all_keys()
        issues_found = []
        
        for key in all_keys:
            try:
                self.read(key)  # Read triggers verification and repair
            except UnrecoverableCorruptionError as e:
                issues_found.append({'key': key, 'error': str(e)})
        
        return issues_found

Backup Verification

Backups themselves can be corrupted. Always verify:

# PostgreSQL: Verify backup integrity
pg_verifybackup /path/to/backup

# tar: Verify archive checksums (if available)
tar --verify -f backup.tar

# restic: Verify backup integrity
restic -r /backup/repo check --read-data

# ZFS: Send with checksum verification
zfs send -c pool/dataset | zfs receive -F backup_pool/dataset
# -c sends compressed, preserves checksums

The 3-2-1 Backup Rule

Keep 3 copies of critical data, on 2 different media types, with 1 offsite. This protects against: device failure (multiple copies), media-specific issues (different types), and site disasters (offsite). And verify checksums on ALL copies regularly.

Integrity Monitoring: Observability for Data Health

Production systems need continuous monitoring of data integrity metrics. Problems caught early are much easier to resolve.

Key Metrics to Monitor

Essential Integrity Metrics

•Checksum Errors: Number of read/write checksum failures. Any non-zero value requires investigation.
•Silent Corrections: Times redundancy was used to repair corruption. Indicates degrading device.
•Uncorrectable Errors: Read failures that couldn't be recovered. Immediate action required.
•Scrub Errors: Issues found during periodic verification. Early warning of developing problems.
•SMART Attributes: Reallocated sectors, pending sectors, CRC errors from device self-monitoring.
•Bit Error Rate (BER): For enterprise storage, ratio of errors to total bits read.
•ECC Corrections: Memory controller correcting DRAM errors. High rate indicates failing DIMM.

integrity_monitoring.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/bin/bash
# Integrity monitoring script for Linux systems
 
# 1. ZFS pool health
zpool_errors() {
    zpool list -H -o name,health
    
    # Check for any checksum errors
    for pool in $(zpool list -H -o name); do
        cksum_errors=$(zpool status $pool | grep -c "CKSUM")
        if [ $cksum_errors -gt 0 ]; then
            echo "WARNING: $pool has checksum errors"
            zpool status $pool
        fi
    done
}
 
# 2. Btrfs device stats
btrfs_errors() {
    for mount in $(grep btrfs /proc/mounts | cut -d' ' -f2); do
        errors=$(btrfs device stats $mount 2>/dev/null | grep -v " 0$")
        if [ -n "$errors" ]; then
            echo "WARNING: Btrfs errors on $mount"
            echo "$errors"
        fi
    done
}
 
# 3. SMART health
smart_health() {
    for disk in /dev/sd?; do
        health=$(smartctl -H $disk 2>/dev/null | grep "SMART overall-health")
        if [[ $health != *"PASSED"* ]]; then
            echo "WARNING: SMART failure on $disk"
            smartctl -A $disk | grep -E "(Reallocated|Pending|Uncorrectable)"
        fi
    done
}
 
# 4. Memory errors (ECC)
memory_errors() {
    # Requires EDAC driver
    if [ -d /sys/devices/system/edac/mc ]; then
        for mc in /sys/devices/system/edac/mc/mc*; do
            ce=$(cat $mc/ce_count 2>/dev/null || echo 0)
            ue=$(cat $mc/ue_count 2>/dev/null || echo 0)
            if [ $ce -gt 0 ] || [ $ue -gt 0 ]; then
                echo "WARNING: Memory errors - CE: $ce, UE: $ue"
            fi
        done
    fi
}
 
# 5. Kernel message analysis
kernel_errors() {
    # Look for I/O errors in recent logs
    dmesg | grep -i "error|fail|corrupt" | tail -20
}
 
# Run all checks
echo "=== Integrity Check $(date) ==="
zpool_errors
btrfs_errors
smart_health
memory_errors
kernel_errors

Alerting Thresholds

Metric	Warning	Critical	Action
Checksum errors (total)	>0	>10/day	Investigate disk
Scrub errors	>0	Any uncorrectable	Replace failed device
SMART reallocated sectors	>0	>100 or growing	Plan replacement
ECC correctable errors	>100/month	>1000/month	Monitor closely
ECC uncorrectable	>0	>0	Immediate DIMM replacement

Prometheus Metrics Example

# Prometheus metrics for integrity monitoring
- name: storage_checksum_errors_total
  help: Total number of checksum verification failures
  type: counter
  labels:
    - device
    - pool

- name: storage_silent_repairs_total
  help: Corruptions repaired silently from redundancy
  type: counter
  labels:
    - device
    - pool

- name: storage_last_scrub_timestamp
  help: Unix timestamp of last completed scrub
  type: gauge
  labels:
    - pool

- name: storage_last_scrub_errors
  help: Errors found in most recent scrub
  type: gauge
  labels:
    - pool
    - error_type

Scrub Cadence

For critical data: weekly scrubs. For general data: monthly. For archival: before any restore operation. Scrub duration increases with data size—a 100TB array might take 24-48 hours. Schedule during low-usage periods.

Summary: Comprehensive Data Integrity

Data integrity is the foundation upon which all other storage guarantees rest. Without confidence that stored data remains correct, durability and performance optimizations are meaningless. This page has covered the complete spectrum of integrity concerns, from corruption taxonomy to production monitoring.

Key Takeaways

•Corruption Is Universal: Silent data corruption happens at measurable rates across all storage systems. It's not if, but when—and whether you'll detect it.
•Checksums Are Essential: CRC-32C for performance, SHA-256 for content-addressing. Hardware acceleration makes checksums effectively free versus I/O costs.
•End-to-End Protection: Device-level ECC isn't enough. Checksums must be computed at write origin, stored separately from data, and verified at read destination.
•Filesystem Choice Matters: ZFS and Btrfs provide data checksums and self-healing. ext4/XFS protect metadata only. Choose based on your integrity requirements.
•Application-Level Patterns: Content-addressed storage, Merkle trees, and record-level checksums provide portable, verifiable integrity independent of underlying storage.
•Recovery Requires Redundancy: Self-healing needs copies (mirrors, parity, replicas). Design for recovery before corruption happens.
•Monitor Continuously: Track checksum errors, SMART attributes, ECC corrections. Catch degradation before it becomes data loss.
•Scrub Regularly: Proactive verification finds latent corruption while recovery is still possible. Weekly for critical data.

Module Complete: Write Strategies

We've now comprehensively covered the spectrum of write strategies:

Write-Through: Immediate durability, maximum safety, lowest performance
Write-Back: Deferred durability, maximum performance, risk of data loss
Delayed Write: Application-controlled sync points, balanced trade-off
Ordered Writes: Controlled write ordering for crash-consistent structures
Data Integrity: End-to-end verification ensuring stored data remains correct

Together, these strategies form the toolkit for building storage systems that meet any combination of performance, durability, and reliability requirements. The art of systems engineering is choosing the right combination for each use case—and now you have the deep understanding needed to make those choices correctly.

Module Complete

Congratulations! You have mastered the fundamental write strategies that govern how data flows from applications to persistent storage. You understand the trade-offs between performance and safety, the mechanisms that enable crash-consistent data structures, and the techniques for ensuring long-term data integrity. This knowledge is essential for designing and operating reliable storage systems.