Operating SystemsAdvanced File Systems

Copy-on-Write File Systems

LevelAdvanced

Duration90 mins

TopicAdvanced File Systems

3 / 5

Data Integrity

The Silent Data Corruption Crisis

A troubling reality undermines most file systems: your data can silently corrupt, and you might never know.

Studies at CERN found that 1 in 10^7 bits stored on disk experience undetected errors. For a petabyte storage system, that's approximately one corrupted bit per 100MB of data read. NetApp reported similar findings: silent data corruption affects enterprise storage at rates that would alarm most administrators.

The insidious aspect is silence. Traditional file systems trust that the data read from disk is what was written. They have no mechanism to verify. When corruption occurs—from cosmic rays, firmware bugs, failing drives, controller errors, or cable issues—the corrupted data is simply returned to the application. The application incorporates the corruption. Backups dutifully replicate corrupted data to backup media.

By the time anyone notices, the corruption has propagated everywhere.

What You Will Learn

By the end of this page, you will understand how COW file systems achieve end-to-end data integrity, how checksums are used to detect corruption at every level, how self-healing repairs corruption automatically using redundancy, and why these guarantees are fundamentally impossible in traditional file systems.

The Trust Problem in Traditional File Systems

Traditional file systems operate on a fundamental assumption: the storage layer is trustworthy. When ext4 writes a block and later reads it back, it assumes the data returned is identical to what was written. This assumption is broken regularly.

Sources of silent corruption:

Source	Mechanism	Detection by Traditional FS
Bit rot	Random bit flips from cosmic rays, media degradation	❌ None
Firmware bugs	Incorrect data returned by drive firmware	❌ None
Phantom writes	Drive reports write complete but didn't persist	❌ None
Misdirected writes	Data written to wrong location	❌ None
Misdirected reads	Data read from wrong location	❌ None
Cable errors	Signal degradation or interference	❌ None
Controller bugs	RAID controller returns wrong data	❌ None
Memory errors	Corruption during DMA transfers	❌ None

The RAID fallacy:

Many administrators believe RAID protects against these issues. It doesn't. RAID provides redundancy against complete drive failure, but:

RAID doesn't verify data correctness—it only ensures data exists
If a drive returns corrupted data, RAID doesn't detect it
Misdirected writes to the wrong location bypass RAID protection
RAID controllers have had their own corruption bugs

The famous "RAID-5 write hole" occurs when power fails during a stripe write. The parity becomes inconsistent with data—and RAID cannot detect this. Future reads return wrong data confidently.

Converting Mermaid diagram...

The Backup Propagation Problem

When silent corruption occurs, traditional backup systems replicate it everywhere. Your backup from last night contains corrupted data. Your off-site replica has corrupted data. By the time someone notices incorrect values in a database or garbled sections of a video file, the good data may have aged out of all backup retention windows.

End-to-End Checksums: Trust But Verify

COW file systems solve the trust problem by trusting no one. Every block—data and metadata—is protected by a cryptographic checksum. When data is read, the checksum is recomputed and verified. Any mismatch indicates corruption.

The checksum architecture:

In systems like ZFS, the checksum for a block is stored in its parent block, not alongside the block itself. This is crucial:

If checksums were stored with the block, corruption could affect both data and checksum together
By storing the checksum in the parent, an attacker/corruption would need to modify two independent locations
This creates a Merkle tree—each level validates the level below
The top-level checksum in the überblock is the root of trust

Converting Mermaid diagram...

Checksum algorithms:

COW file systems offer multiple checksum algorithms, balancing speed and security:

Algorithm	Bits	Speed	Collision Resistance	Use Case
Fletcher-4	256	Very fast	Weak	Legacy, testing
SHA-256	256	Moderate	Cryptographic	Default, security-critical
SHA-512	512	Slower	Highest	Paranoid security
Skein	Variable	Fast	Cryptographic	High-performance + security
Edon-R	256	Very fast	Strong	Performance-sensitive
BLAKE3	256	Fastest	Cryptographic	Modern systems

Cryptographic hashes prevent not only accidental corruption but also intentional tampering. An attacker cannot modify data and produce a matching checksum without breaking the hash function.

Checksum Verification in Read Path
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/* ZFS-style checksum verification during block read */
 
typedef struct block_pointer {
    uint64_t dva[3];           /* Data Virtual Address - up to 3 copies */
    uint64_t physical_size;     /* Compressed size on disk */
    uint64_t logical_size;      /* Size after decompression */
    uint8_t  checksum_type;     /* SHA256, fletcher4, etc. */
    uint8_t  compression;       /* lz4, zstd, etc. */
    uint8_t  copies;            /* Number of redundant copies */
    uint8_t  checksum[32];      /* Checksum of the block this points to */
} block_pointer_t;
 
/* Read block with verification */
int read_block_verified(zfs_pool_t *pool, block_pointer_t *bp, void *buf) {
    int error;
    
    for (int copy = 0; copy < bp->copies; copy++) {
        /* Read raw data from disk */
        error = vdev_read(pool, bp->dva[copy], buf, bp->physical_size);
        if (error)
            continue;  /* Try next copy */
        
        /* Decompress if needed */
        if (bp->compression != COMPRESS_NONE) {
            error = decompress(buf, bp->compression);
            if (error)
                continue;  /* Decompression failed, try next copy */
        }
        
        /* CRITICAL: Verify checksum */
        uint8_t computed_checksum[32];
        compute_checksum(buf, bp->logical_size, 
                        bp->checksum_type, computed_checksum);
        
        if (memcmp(computed_checksum, bp->checksum, 32) == 0) {
            /* Checksum matches - data is valid */
            return 0;
        }
        
        /* Checksum mismatch! Log and try next copy */
        log_corruption_event(pool, bp, copy, 
                            "Checksum mismatch detected");
    }
    
    /* All copies failed verification */
    return EIO;  /* Irrecoverable corruption */
}
 
/* Write block - compute and store checksum in parent */
void write_block_with_checksum(zfs_pool_t *pool, 
                                void *data, size_t size,
                                block_pointer_t *parent_bp) {
    /* Compress data */
    void *compressed;
    size_t compressed_size;
    compress(data, size, &compressed, &compressed_size);
    
    /* Compute checksum BEFORE writing */
    compute_checksum(data, size, pool->checksum_algo,
                    parent_bp->checksum);
    
    /* Write to disk */
    parent_bp->dva[0] = allocate_block(pool);
    vdev_write(pool, parent_bp->dva[0], compressed, compressed_size);
    
    /* Checksum is stored in parent, not with data */
}

End-to-End Means the Application

Checksums are verified when data reaches memory, before it's returned to the application. This catches corruption at every level: drive, controller, cable, RAM (during DMA). The application receives verified data or an error—never silently corrupted data.

Self-Healing: Automatic Corruption Repair

Detection is only half the solution. When corruption is detected, what happens next? COW file systems with redundancy can automatically repair corrupted data.

Self-healing mechanics:

When a checksum mismatch occurs on a read:

File system detects corruption in copy A
Reads the same block from copy B (mirror or parity)
Verifies copy B's checksum
If valid, uses copy B's data
Writes verified data back to copy A's location (repair)
Logs the event for administrator review

To the application, the read completes normally—it never sees the corruption. The administrator receives notification that self-healing occurred and can investigate the failing drive.

Converting Mermaid diagram...

Scrubbing: Proactive corruption detection

Self-healing is reactive—it repairs corruption when data is accessed. But what about data that's rarely read? Corruption could silently accumulate until both copies are affected.

Scrub operations solve this:

# Initiate scrub - reads and verifies EVERY block in the pool
zpool scrub tank

# Check scrub status
zpool status tank

A scrub reads every block, verifies its checksum, and repairs any corruption found. This should run regularly—weekly for active systems, monthly for archival storage.

Redundancy Levels and Self-Healing Capability
Configuration	Tolerates	Self-Healing	Space Efficiency
Single disk (no redundancy)	No failures	Detection only (no repair)	100%
Mirror (2 disks)	1 disk failure	Full self-healing	50%
Mirror (3 disks)	2 disk failures	Full self-healing	33%
RAID-Z1 (3+ disks)	1 disk failure	Full self-healing	67-93%
RAID-Z2 (4+ disks)	2 disk failures	Full self-healing	50-88%
RAID-Z3 (5+ disks)	3 disk failures	Full self-healing	40-83%
Copies=2 (ditto blocks)	1 block corruption	Block-level healing	50%

Beyond disk failure:

Traditional RAID only protects against complete disk failure—if a disk returns bad data, RAID blindly uses it. ZFS RAID-Z and mirrors verify checksums, which means:

Firmware bugs that return wrong data are detected and corrected
Misdirected writes are detected when the checksum doesn't match
Bit rot affecting only one copy is repaired from the good copy
Controller errors can't propagate because each copy is verified independently

This is data integrity, not just availability.

Self-Healing in Practice

Large ZFS deployments routinely report self-healing events—blocks silently corrupted by hardware issues, repaired automatically without administrator intervention or service disruption. Without checksums, these would have been silently corrupted data delivered to applications.

Transactional Integrity: Always Consistent

Beyond bit-level integrity, COW file systems provide structural integrity—the guarantee that the file system is always in a consistent, valid state.

The traditional consistency problem:

Modifying a file in traditional file systems involves multiple steps:

Update data blocks
Update inode (size, mtime)
Update directory (if renaming)
Update superblock (free space)

If power fails between any of these steps, the file system is inconsistent. This is why ext2 required fsck after every unclean shutdown—potentially hours of scanning.

COW transactional model:

In a COW file system, all modifications within a transaction group are atomic:

Build complete new state in unused space
All data and all metadata for the transaction
Single atomic überblock update commits everything
Either the entire transaction persists, or nothing does

Traditional FS After Crash

•Unknown consistency state
•Must run fsck (hours for large FS)
•fsck may lose recent changes
•fsck may lose orphaned files
•No guarantee of data integrity
•Manual intervention often required

COW FS After Crash

•Always consistent
•Instant mount (no fsck needed)
•Last complete TXG persisted
•No orphaned structures possible
•Full checksums verify integrity
•Automatic, no intervention

Consistency pool:

ZFS maintains a small ZFS Intent Log (ZIL) for synchronous operations. When an application requires synchronous write (fsync, O_SYNC), ZFS:

Writes the operation to the ZIL (fast sequential write)
Acknowledges to the application
Later commits in the normal transaction group

On crash, ZFS replays the ZIL to recover synchronous operations that hadn't been committed. This provides both:

Application consistency: fsync means persisted
File system consistency: Always valid structure

# Check pool status and consistency
zpool status tank

# ZFS never needs fsck - this doesn't exist:
# fsck.zfs  <- Not a thing!

Journaling vs COW Consistency

Journaling (ext4, NTFS) provides crash consistency for metadata. COW provides crash consistency for EVERYTHING—metadata and data, atomically together. There's no data=ordered vs data=journal tradeoff; all data is protected by the same transactional model.

Comparison With Other Integrity Mechanisms

Let's compare COW file system integrity with other approaches to understand why it's fundamentally superior:

1. ECC RAM and disk sector checksums

These provide protection at specific layers:

ECC RAM: Only protects data in memory
Disk sector checksums: Only protect the disk media

But corruption can occur in transfer between layers—DMA operations, cable transmission, controller processing. These point solutions leave gaps.

Integrity Protection Coverage Comparison
Protection Type	Scope	Detects	Repairs	Coverage Gap
ECC RAM	Memory only	Memory bit flips	Single-bit errors	Transfer, storage, controller
Disk 4K sector CRC	Physical media	Media errors	Via spare sectors	Controller, cable, firmware
T10 DIF/DIX	Storage path	Transfer errors	No	Application, host memory
RAID parity	Disk failure	Complete failure only	From parity	Silent corruption passthrough
md RAID scrub	md arrays	Parity mismatch	No (can't know which is good)	Doesn't know correct data
ZFS checksums	End-to-end	Any corruption	From redundant copy	None - complete coverage

Key insight: Where is the checksum, and who verifies it?

In disk sector checksums, the drive calculates and verifies—but the drive's firmware may be buggy. In RAID, the controller assembles data—but has no way to verify correctness.

In ZFS, the checksum is:

Calculated by the application host before write
Stored in a parent block (not with data)
Verified by the application host after read

The entire storage path is untrusted. Only the initial write and final read verification matter.

Data Corruption Detection Examples
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/bin/bash
# Demonstrating ZFS corruption detection vs traditional FS
 
# === ZFS Corruption Detection ===
 
# Create a test file
echo "Known good data content" > /tank/testfile
 
# Capture the checksum as stored in ZFS
zdb -ddddd tank/testfile  # Shows block pointer with checksum
 
# Trigger a scrub to verify all data
zpool scrub tank
zpool status tank  # Shows checksum error count
 
# After intentional corruption (simulated), scrub detects:
# pool: tank
#  state: ONLINE
#  scan: scrub repaired 4K in 0h0m with 0 errors on Thu Jan 16 15:00
# errors: 1 data errors on /tank/testfile
 
# === What Traditional FS Shows (or doesn't) ===
 
# ext4 has no block-level verification
# A silently corrupted file simply returns wrong data:
cat /ext4mount/corrupted_file  # Returns garbage, no error!
echo $?  # Returns 0 (success) - corruption undetected
 
# The application must detect corruption itself:
# - Video players show artifacts
# - Databases fail with corruption errors  
# - Archive extraction shows CRC errors
# - But raw file access shows nothing wrong
 
# === ZFS Corruption Stats ===
 
# View pool-wide error counters
zpool status tank
 
# View per-device error counts
# NAME        STATE     READ WRITE CKSUM
# tank        ONLINE       0     0     0
#   mirror-0  ONLINE       0     0     0
#     sda     ONLINE       0     0     0
#     sdb     ONLINE       0     0     5  <- 5 checksum errors on sdb!
 
# Event log showing self-healing
zpool events tank | grep -i checksum

The 'Trust' Assumption

Every system without end-to-end checksums implicitly trusts the storage stack. This trust is violated more often than most realize. Enterprise storage administrators regularly encounter silent corruption—but without ZFS-style checksums, they often don't realize until data is irrecoverably damaged.

Implementing Integrity Best Practices

Even with COW file system integrity features, proper configuration and operation maximize protection.

1. Choosing redundancy level:

Data integrity requires redundancy—without it, checksums detect corruption but can't repair it. Choose based on criticality and budget:

Recommended Redundancy by Use Case
Use Case	Recommended Config	Rationale
Personal workstation	Mirror (2 disks) or single + backups	Balance cost vs protection
Department file server	RAID-Z2 (5+ disks)	Survive 2 failures during rebuild
Database server	Mirror (3 disks) or special+mirror	Maximum read IOPS + redundancy
Critical production	RAID-Z3 (6+ disks)	Survive 3 failures, time for replacements
Cold archive	RAID-Z2 + copies=2	Multiple layers for rarely-verified data

2. Metadata protection:

ZFS stores multiple copies of critical metadata:

# Set metadata to triple-copy (recommended for all pools)
zfs set copies=3 tank/metadata

# Or configure at pool level for ditto blocks
zpool set copies=2 tank

Even on non-redundant single-disk pools, copies=2 provides some protection against media errors.

3. Regular scrubbing:

Scheudule scrubs to detect corruption before it affects redundancy:

Automated Scrub Schedule
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# Production scrub configuration
 
# Weekly scrub - crontab entry
# 0 2 * * 0 /usr/sbin/zpool scrub tank
 
# Smart scrub script - respects I/O load
#!/bin/bash
POOL="tank"
MAX_LOAD=5.0  # Don't start if load average > 5
 
# Check system load
load=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1)
if (( $(echo "$load > $MAX_LOAD" | bc -l) )); then
    echo "System load too high ($load), skipping scrub"
    exit 0
fi
 
# Check if scrub already running
if zpool status $POOL | grep -q "scrub in progress"; then
    echo "Scrub already in progress"
    exit 0
fi
 
# Start scrub
echo "Starting scrub on $POOL at $(date)"
zpool scrub $POOL
 
# Monitor completion (optional - for logging)
while zpool status $POOL | grep -q "scrub in progress"; do
    progress=$(zpool status $POOL | grep "scanned" | head -1)
    echo "Progress: $progress"
    sleep 300
done
 
echo "Scrub completed at $(date)"
zpool status $POOL | grep -A4 "scan:"

Integrity Configuration Checklist

•Use ECC RAM — ZFS operates in memory; corrupted RAM can write corrupted data with valid checksums.
•Use cryptographic checksums — SHA-256 or better for security-sensitive data. Fletcher-4 only for legacy/testing.
•Configure adequate redundancy — Checksums detect; redundancy repairs. Both are needed.
•Schedule regular scrubs — Weekly for active pools, monthly for archival. Catch corruption early.
•Monitor pool events — Set up alerts for checksum errors. Investigate drives with errors.
•Maintain free space — COW needs space to write. Keep 10-20% free for healthy operation.
•Use meaningful copies= — copies=2 for metadata on single-disk pools; copies=3 for critical data.
•Plan replacement strategy — Replace drives showing checksum errors before failure.

ECC RAM is Critical

Many ZFS advocates insist on ECC RAM—and for good reason. Without ECC, a memory bit flip could corrupt data before the checksum is calculated. ZFS would then store corrupted data with a valid checksum. ECC RAM completes the end-to-end integrity chain.

Real-World Integrity Incidents and Prevention

Understanding how data integrity failures occur in practice reinforces why COW protections matter:

Case study: Silent corruption in enterprise storage

A major financial institution discovered database inconsistencies traced to storage corruption. Their enterprise SAN (Storage Area Network) had been silently corrupting data for months:

RAID controller firmware bug caused occasional bit flips
Traditional file system had no way to detect it
Backups faithfully replicated corrupted data
Discovery came only when audit totals didn't match
Months of financial records potentially compromised

With ZFS, this scenario:

First read of corrupted block would checksum fail
Self-healing from mirror would repair
Administrators alerted to controller issues
Zero data loss, zero propagation to backups

Common Corruption Scenarios and COW Response
Incident	Traditional FS Outcome	COW FS Outcome
Cosmic ray bit flip	Silent corruption, propagates forever	Detected on read, repaired from copy
Drive firmware returns wrong sector	Application gets wrong data	Checksum fails, correct data from mirror
RAID controller parity error	Wrong data reconstructed from bad parity	Block-level checksums detect, use good copy
Power failure mid-write	Torn write, potential corruption	Atomic TXG, rollback to last good state
Administrator accidentally dd's disk	Catastrophic data loss	Redundancy + snapshots enable recovery
Ransomware encrypts files	Data encrypted, lost without backup	Rollback to pre-encryption snapshot

The importance of defense in depth:

No single protection suffices. COW integrity is part of a defense-in-depth strategy:

Hardware layer: ECC RAM, quality drives, reliable power
File system layer: Checksums, redundancy, self-healing
Snapshot layer: Point-in-time recovery from logical errors
Backup layer: Off-site copies for disaster recovery
Monitoring layer: Alerts on errors, scrub failures, space issues

COW file systems excel at layers 2-4, but they don't eliminate the need for proper hardware (layer 1) or monitoring (layer 5).

False Confidence Is Dangerous

Running ZFS on poor hardware—consumer SSDs with disabled power-loss protection, systems without ECC RAM, or networks with unreliable connections—can create a false sense of security. The checksums verify what was stored, but if garbage was stored (due to upstream corruption), garbage is what you'll detect.

Summary: Data Integrity as a First-Class Feature

Data integrity in COW file systems isn't an afterthought—it's a fundamental design principle. Let's consolidate the key concepts:

Key Takeaways

•Silent corruption is common — Studies show significant rates of undetected bit errors. Traditional file systems can't distinguish corrupted data from valid data.
•End-to-end checksums verify everything — Every block, data and metadata, is protected by a cryptographic checksum stored in its parent block.
•Self-healing repairs corruption automatically — With redundancy, corrupted blocks are detected and repaired transparently using good copies.
•Transactional consistency is guaranteed — COW's atomic updates ensure the file system is always in a valid state, even after crashes.
•Scrubbing proactively detects issues — Regular scrubs verify all data, catching corruption before it affects redundancy.
•Defense in depth is still required — COW integrity complements but doesn't replace ECC RAM, quality hardware, backups, and monitoring.

The integrity revolution:

Before COW file systems, data integrity was an application concern—databases had checksums, archive formats had CRCs, but the storage layer itself provided no guarantees. Applications either implemented their own integrity checking or hoped for the best.

COW file systems moved integrity into infrastructure. Every file, every block, every metadata structure is now protected by the same cryptographic verification. Applications can finally trust their storage layer.

Looking ahead:

btrfs and ZFS are the two dominant COW file systems. In the next pages, we'll examine each in detail—their architectures, unique features, and when to choose one over the other.

Page Complete

You now understand how COW file systems achieve unprecedented data integrity through end-to-end checksums, self-healing, and transactional guarantees. You can explain why these protections are impossible in traditional file systems and how to configure COW file systems for maximum protection. Next, we'll explore btrfs and ZFS in detail.

3 / 5

Loading learning content...

Operating SystemsAdvanced File Systems

Copy-on-Write File Systems

LevelAdvanced

Duration90 mins

TopicAdvanced File Systems

3 / 5

Data Integrity

The Silent Data Corruption Crisis

A troubling reality undermines most file systems: your data can silently corrupt, and you might never know.

By the time anyone notices, the corruption has propagated everywhere.

What You Will Learn

The Trust Problem in Traditional File Systems

Sources of silent corruption:

Source	Mechanism	Detection by Traditional FS
Bit rot	Random bit flips from cosmic rays, media degradation	❌ None
Firmware bugs	Incorrect data returned by drive firmware	❌ None
Phantom writes	Drive reports write complete but didn't persist	❌ None
Misdirected writes	Data written to wrong location	❌ None
Misdirected reads	Data read from wrong location	❌ None
Cable errors	Signal degradation or interference	❌ None
Controller bugs	RAID controller returns wrong data	❌ None
Memory errors	Corruption during DMA transfers	❌ None

The RAID fallacy:

Many administrators believe RAID protects against these issues. It doesn't. RAID provides redundancy against complete drive failure, but:

RAID doesn't verify data correctness—it only ensures data exists
If a drive returns corrupted data, RAID doesn't detect it
Misdirected writes to the wrong location bypass RAID protection
RAID controllers have had their own corruption bugs

The famous "RAID-5 write hole" occurs when power fails during a stripe write. The parity becomes inconsistent with data—and RAID cannot detect this. Future reads return wrong data confidently.

Converting Mermaid diagram...

The Backup Propagation Problem

End-to-End Checksums: Trust But Verify

The checksum architecture:

In systems like ZFS, the checksum for a block is stored in its parent block, not alongside the block itself. This is crucial:

If checksums were stored with the block, corruption could affect both data and checksum together
By storing the checksum in the parent, an attacker/corruption would need to modify two independent locations
This creates a Merkle tree—each level validates the level below
The top-level checksum in the überblock is the root of trust

Converting Mermaid diagram...

Checksum algorithms:

COW file systems offer multiple checksum algorithms, balancing speed and security:

Algorithm	Bits	Speed	Collision Resistance	Use Case
Fletcher-4	256	Very fast	Weak	Legacy, testing
SHA-256	256	Moderate	Cryptographic	Default, security-critical
SHA-512	512	Slower	Highest	Paranoid security
Skein	Variable	Fast	Cryptographic	High-performance + security
Edon-R	256	Very fast	Strong	Performance-sensitive
BLAKE3	256	Fastest	Cryptographic	Modern systems

Cryptographic hashes prevent not only accidental corruption but also intentional tampering. An attacker cannot modify data and produce a matching checksum without breaking the hash function.

Checksum Verification in Read Path
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/* ZFS-style checksum verification during block read */
 
typedef struct block_pointer {
    uint64_t dva[3];           /* Data Virtual Address - up to 3 copies */
    uint64_t physical_size;     /* Compressed size on disk */
    uint64_t logical_size;      /* Size after decompression */
    uint8_t  checksum_type;     /* SHA256, fletcher4, etc. */
    uint8_t  compression;       /* lz4, zstd, etc. */
    uint8_t  copies;            /* Number of redundant copies */
    uint8_t  checksum[32];      /* Checksum of the block this points to */
} block_pointer_t;
 
/* Read block with verification */
int read_block_verified(zfs_pool_t *pool, block_pointer_t *bp, void *buf) {
    int error;
    
    for (int copy = 0; copy < bp->copies; copy++) {
        /* Read raw data from disk */
        error = vdev_read(pool, bp->dva[copy], buf, bp->physical_size);
        if (error)
            continue;  /* Try next copy */
        
        /* Decompress if needed */
        if (bp->compression != COMPRESS_NONE) {
            error = decompress(buf, bp->compression);
            if (error)
                continue;  /* Decompression failed, try next copy */
        }
        
        /* CRITICAL: Verify checksum */
        uint8_t computed_checksum[32];
        compute_checksum(buf, bp->logical_size, 
                        bp->checksum_type, computed_checksum);
        
        if (memcmp(computed_checksum, bp->checksum, 32) == 0) {
            /* Checksum matches - data is valid */
            return 0;
        }
        
        /* Checksum mismatch! Log and try next copy */
        log_corruption_event(pool, bp, copy, 
                            "Checksum mismatch detected");
    }
    
    /* All copies failed verification */
    return EIO;  /* Irrecoverable corruption */
}
 
/* Write block - compute and store checksum in parent */
void write_block_with_checksum(zfs_pool_t *pool, 
                                void *data, size_t size,
                                block_pointer_t *parent_bp) {
    /* Compress data */
    void *compressed;
    size_t compressed_size;
    compress(data, size, &compressed, &compressed_size);
    
    /* Compute checksum BEFORE writing */
    compute_checksum(data, size, pool->checksum_algo,
                    parent_bp->checksum);
    
    /* Write to disk */
    parent_bp->dva[0] = allocate_block(pool);
    vdev_write(pool, parent_bp->dva[0], compressed, compressed_size);
    
    /* Checksum is stored in parent, not with data */
}

End-to-End Means the Application

Self-Healing: Automatic Corruption Repair

Detection is only half the solution. When corruption is detected, what happens next? COW file systems with redundancy can automatically repair corrupted data.

Self-healing mechanics:

When a checksum mismatch occurs on a read:

File system detects corruption in copy A
Reads the same block from copy B (mirror or parity)
Verifies copy B's checksum
If valid, uses copy B's data
Writes verified data back to copy A's location (repair)
Logs the event for administrator review

To the application, the read completes normally—it never sees the corruption. The administrator receives notification that self-healing occurred and can investigate the failing drive.

Converting Mermaid diagram...

Scrubbing: Proactive corruption detection

Self-healing is reactive—it repairs corruption when data is accessed. But what about data that's rarely read? Corruption could silently accumulate until both copies are affected.

Scrub operations solve this:

# Initiate scrub - reads and verifies EVERY block in the pool
zpool scrub tank

# Check scrub status
zpool status tank

A scrub reads every block, verifies its checksum, and repairs any corruption found. This should run regularly—weekly for active systems, monthly for archival storage.

Redundancy Levels and Self-Healing Capability
Configuration	Tolerates	Self-Healing	Space Efficiency
Single disk (no redundancy)	No failures	Detection only (no repair)	100%
Mirror (2 disks)	1 disk failure	Full self-healing	50%
Mirror (3 disks)	2 disk failures	Full self-healing	33%
RAID-Z1 (3+ disks)	1 disk failure	Full self-healing	67-93%
RAID-Z2 (4+ disks)	2 disk failures	Full self-healing	50-88%
RAID-Z3 (5+ disks)	3 disk failures	Full self-healing	40-83%
Copies=2 (ditto blocks)	1 block corruption	Block-level healing	50%

Beyond disk failure:

Traditional RAID only protects against complete disk failure—if a disk returns bad data, RAID blindly uses it. ZFS RAID-Z and mirrors verify checksums, which means:

Firmware bugs that return wrong data are detected and corrected
Misdirected writes are detected when the checksum doesn't match
Bit rot affecting only one copy is repaired from the good copy
Controller errors can't propagate because each copy is verified independently

This is data integrity, not just availability.

Self-Healing in Practice

Transactional Integrity: Always Consistent

Beyond bit-level integrity, COW file systems provide structural integrity—the guarantee that the file system is always in a consistent, valid state.

The traditional consistency problem:

Modifying a file in traditional file systems involves multiple steps:

Update data blocks
Update inode (size, mtime)
Update directory (if renaming)
Update superblock (free space)

If power fails between any of these steps, the file system is inconsistent. This is why ext2 required fsck after every unclean shutdown—potentially hours of scanning.

COW transactional model:

In a COW file system, all modifications within a transaction group are atomic:

Build complete new state in unused space
All data and all metadata for the transaction
Single atomic überblock update commits everything
Either the entire transaction persists, or nothing does

Traditional FS After Crash

•Unknown consistency state
•Must run fsck (hours for large FS)
•fsck may lose recent changes
•fsck may lose orphaned files
•No guarantee of data integrity
•Manual intervention often required

COW FS After Crash

•Always consistent
•Instant mount (no fsck needed)
•Last complete TXG persisted
•No orphaned structures possible
•Full checksums verify integrity
•Automatic, no intervention

Consistency pool:

ZFS maintains a small ZFS Intent Log (ZIL) for synchronous operations. When an application requires synchronous write (fsync, O_SYNC), ZFS:

Writes the operation to the ZIL (fast sequential write)
Acknowledges to the application
Later commits in the normal transaction group

On crash, ZFS replays the ZIL to recover synchronous operations that hadn't been committed. This provides both:

Application consistency: fsync means persisted
File system consistency: Always valid structure

# Check pool status and consistency
zpool status tank

# ZFS never needs fsck - this doesn't exist:
# fsck.zfs  <- Not a thing!

Journaling vs COW Consistency

Comparison With Other Integrity Mechanisms

Let's compare COW file system integrity with other approaches to understand why it's fundamentally superior:

1. ECC RAM and disk sector checksums

These provide protection at specific layers:

ECC RAM: Only protects data in memory
Disk sector checksums: Only protect the disk media

But corruption can occur in transfer between layers—DMA operations, cable transmission, controller processing. These point solutions leave gaps.

Integrity Protection Coverage Comparison
Protection Type	Scope	Detects	Repairs	Coverage Gap
ECC RAM	Memory only	Memory bit flips	Single-bit errors	Transfer, storage, controller
Disk 4K sector CRC	Physical media	Media errors	Via spare sectors	Controller, cable, firmware
T10 DIF/DIX	Storage path	Transfer errors	No	Application, host memory
RAID parity	Disk failure	Complete failure only	From parity	Silent corruption passthrough
md RAID scrub	md arrays	Parity mismatch	No (can't know which is good)	Doesn't know correct data
ZFS checksums	End-to-end	Any corruption	From redundant copy	None - complete coverage

Key insight: Where is the checksum, and who verifies it?

In disk sector checksums, the drive calculates and verifies—but the drive's firmware may be buggy. In RAID, the controller assembles data—but has no way to verify correctness.

In ZFS, the checksum is:

Calculated by the application host before write
Stored in a parent block (not with data)
Verified by the application host after read

The entire storage path is untrusted. Only the initial write and final read verification matter.

Data Corruption Detection Examples
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/bin/bash
# Demonstrating ZFS corruption detection vs traditional FS
 
# === ZFS Corruption Detection ===
 
# Create a test file
echo "Known good data content" > /tank/testfile
 
# Capture the checksum as stored in ZFS
zdb -ddddd tank/testfile  # Shows block pointer with checksum
 
# Trigger a scrub to verify all data
zpool scrub tank
zpool status tank  # Shows checksum error count
 
# After intentional corruption (simulated), scrub detects:
# pool: tank
#  state: ONLINE
#  scan: scrub repaired 4K in 0h0m with 0 errors on Thu Jan 16 15:00
# errors: 1 data errors on /tank/testfile
 
# === What Traditional FS Shows (or doesn't) ===
 
# ext4 has no block-level verification
# A silently corrupted file simply returns wrong data:
cat /ext4mount/corrupted_file  # Returns garbage, no error!
echo $?  # Returns 0 (success) - corruption undetected
 
# The application must detect corruption itself:
# - Video players show artifacts
# - Databases fail with corruption errors  
# - Archive extraction shows CRC errors
# - But raw file access shows nothing wrong
 
# === ZFS Corruption Stats ===
 
# View pool-wide error counters
zpool status tank
 
# View per-device error counts
# NAME        STATE     READ WRITE CKSUM
# tank        ONLINE       0     0     0
#   mirror-0  ONLINE       0     0     0
#     sda     ONLINE       0     0     0
#     sdb     ONLINE       0     0     5  <- 5 checksum errors on sdb!
 
# Event log showing self-healing
zpool events tank | grep -i checksum

The 'Trust' Assumption

Implementing Integrity Best Practices

Even with COW file system integrity features, proper configuration and operation maximize protection.

1. Choosing redundancy level:

Data integrity requires redundancy—without it, checksums detect corruption but can't repair it. Choose based on criticality and budget:

Recommended Redundancy by Use Case
Use Case	Recommended Config	Rationale
Personal workstation	Mirror (2 disks) or single + backups	Balance cost vs protection
Department file server	RAID-Z2 (5+ disks)	Survive 2 failures during rebuild
Database server	Mirror (3 disks) or special+mirror	Maximum read IOPS + redundancy
Critical production	RAID-Z3 (6+ disks)	Survive 3 failures, time for replacements
Cold archive	RAID-Z2 + copies=2	Multiple layers for rarely-verified data

2. Metadata protection:

ZFS stores multiple copies of critical metadata:

# Set metadata to triple-copy (recommended for all pools)
zfs set copies=3 tank/metadata

# Or configure at pool level for ditto blocks
zpool set copies=2 tank

Even on non-redundant single-disk pools, copies=2 provides some protection against media errors.

3. Regular scrubbing:

Scheudule scrubs to detect corruption before it affects redundancy:

Automated Scrub Schedule
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# Production scrub configuration
 
# Weekly scrub - crontab entry
# 0 2 * * 0 /usr/sbin/zpool scrub tank
 
# Smart scrub script - respects I/O load
#!/bin/bash
POOL="tank"
MAX_LOAD=5.0  # Don't start if load average > 5
 
# Check system load
load=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1)
if (( $(echo "$load > $MAX_LOAD" | bc -l) )); then
    echo "System load too high ($load), skipping scrub"
    exit 0
fi
 
# Check if scrub already running
if zpool status $POOL | grep -q "scrub in progress"; then
    echo "Scrub already in progress"
    exit 0
fi
 
# Start scrub
echo "Starting scrub on $POOL at $(date)"
zpool scrub $POOL
 
# Monitor completion (optional - for logging)
while zpool status $POOL | grep -q "scrub in progress"; do
    progress=$(zpool status $POOL | grep "scanned" | head -1)
    echo "Progress: $progress"
    sleep 300
done
 
echo "Scrub completed at $(date)"
zpool status $POOL | grep -A4 "scan:"

Integrity Configuration Checklist

•Use ECC RAM — ZFS operates in memory; corrupted RAM can write corrupted data with valid checksums.
•Use cryptographic checksums — SHA-256 or better for security-sensitive data. Fletcher-4 only for legacy/testing.
•Configure adequate redundancy — Checksums detect; redundancy repairs. Both are needed.
•Schedule regular scrubs — Weekly for active pools, monthly for archival. Catch corruption early.
•Monitor pool events — Set up alerts for checksum errors. Investigate drives with errors.
•Maintain free space — COW needs space to write. Keep 10-20% free for healthy operation.
•Use meaningful copies= — copies=2 for metadata on single-disk pools; copies=3 for critical data.
•Plan replacement strategy — Replace drives showing checksum errors before failure.

ECC RAM is Critical

Real-World Integrity Incidents and Prevention

Understanding how data integrity failures occur in practice reinforces why COW protections matter:

Case study: Silent corruption in enterprise storage

A major financial institution discovered database inconsistencies traced to storage corruption. Their enterprise SAN (Storage Area Network) had been silently corrupting data for months:

RAID controller firmware bug caused occasional bit flips
Traditional file system had no way to detect it
Backups faithfully replicated corrupted data
Discovery came only when audit totals didn't match
Months of financial records potentially compromised

With ZFS, this scenario:

First read of corrupted block would checksum fail
Self-healing from mirror would repair
Administrators alerted to controller issues
Zero data loss, zero propagation to backups

Common Corruption Scenarios and COW Response
Incident	Traditional FS Outcome	COW FS Outcome
Cosmic ray bit flip	Silent corruption, propagates forever	Detected on read, repaired from copy
Drive firmware returns wrong sector	Application gets wrong data	Checksum fails, correct data from mirror
RAID controller parity error	Wrong data reconstructed from bad parity	Block-level checksums detect, use good copy
Power failure mid-write	Torn write, potential corruption	Atomic TXG, rollback to last good state
Administrator accidentally dd's disk	Catastrophic data loss	Redundancy + snapshots enable recovery
Ransomware encrypts files	Data encrypted, lost without backup	Rollback to pre-encryption snapshot

The importance of defense in depth:

No single protection suffices. COW integrity is part of a defense-in-depth strategy:

Hardware layer: ECC RAM, quality drives, reliable power
File system layer: Checksums, redundancy, self-healing
Snapshot layer: Point-in-time recovery from logical errors
Backup layer: Off-site copies for disaster recovery
Monitoring layer: Alerts on errors, scrub failures, space issues

COW file systems excel at layers 2-4, but they don't eliminate the need for proper hardware (layer 1) or monitoring (layer 5).

False Confidence Is Dangerous

Summary: Data Integrity as a First-Class Feature

Data integrity in COW file systems isn't an afterthought—it's a fundamental design principle. Let's consolidate the key concepts:

Key Takeaways

•Silent corruption is common — Studies show significant rates of undetected bit errors. Traditional file systems can't distinguish corrupted data from valid data.
•End-to-end checksums verify everything — Every block, data and metadata, is protected by a cryptographic checksum stored in its parent block.
•Self-healing repairs corruption automatically — With redundancy, corrupted blocks are detected and repaired transparently using good copies.
•Transactional consistency is guaranteed — COW's atomic updates ensure the file system is always in a valid state, even after crashes.
•Scrubbing proactively detects issues — Regular scrubs verify all data, catching corruption before it affects redundancy.
•Defense in depth is still required — COW integrity complements but doesn't replace ECC RAM, quality hardware, backups, and monitoring.

The integrity revolution:

Looking ahead:

btrfs and ZFS are the two dominant COW file systems. In the next pages, we'll examine each in detail—their architectures, unique features, and when to choose one over the other.

Page Complete

3 / 5