Loading learning content...
Imagine retrieving a backup you made five years ago—a precious family photo, a critical contract, or years of research data. You open the file... and it's corrupted. Not obviously missing, not displaying an error during copy—just quietly, irrecoverably wrong.
This isn't a hypothetical scenario. Silent data corruption is a statistical certainty for anyone storing significant amounts of data over time. Unlike disk failures that announce themselves with I/O errors, silent corruption masquerades as valid data while containing errors.
The fundamental problem: traditional storage systems have no mechanism to distinguish between correct data and corrupted data. They trust that whatever bytes come back from a disk are the bytes that were written. This trust is misplaced.
By the end of this page, you will understand ZFS's end-to-end checksum architecture, how it detects corruption at read time, why parent-stored checksums create a Merkle tree of verification, the self-healing process that repairs corruption automatically, and the checksum algorithms available in ZFS.
Silent data corruption occurs when data changes without the storage system detecting or reporting the change. Understanding the sources of corruption clarifies why ZFS's approach is necessary.
The Data Path:
┌─────────────────────────────────────────────────────────────────────┐
│ DATA PATH │
│ │
│ Application → OS Buffer → File System → Volume Mgr → RAID → Disk │
│ │
│ ▲ ▲ ▲ ▲ ▲ ▲ │
│ │ │ │ │ │ │ │
│ App bugs RAM errors FS bugs RAID bugs Firmware Media │
│ Write hole bugs decay │
└─────────────────────────────────────────────────────────────────────┘
Every component in this path can introduce corruption.
| Source | Mechanism | Detection by Traditional FS |
|---|---|---|
| Bit Rot | Physical media degradation over time due to cosmic rays, magnetic decay, electrical noise | None—data appears valid |
| Firmware Bugs | Disk firmware misdirects writes, returns stale cached data, or performs incorrect operations | None—disk reports success |
| Memory Errors | RAM bit flips corrupt data before writing to disk or after reading from disk | None—data written as-is |
| RAID Write Hole | Power failure during RAID write leaves parity inconsistent with data | None—parity check passes incorrectly |
| Controller Bugs | Hardware RAID or HBA controllers corrupt data silently | None—appears as valid I/O |
| Phantom Writes | Disk reports write success but data never reaches platters | None—success was reported |
| Misdirected Writes | Data written to wrong location on disk | None—original location unchanged |
| Driver Bugs | I/O stack software corrupts data during transfer | None—OS layer issue |
The Research:
Studies have quantified silent corruption rates:
At scale, silent corruption is not a possibility—it's a certainty. The only question is whether you detect it or whether it propagates unnoticed through your backups and archives.
Undetected corruption is backed up as if valid. Over time, every copy of the data becomes corrupted. By the time you notice (perhaps years later), no valid copy exists anywhere. ZFS's innovation is making corruption visible immediately—when you still have the chance to recover from redundancy or recent backups.
ZFS computes a cryptographic-strength checksum for every block of data written to the pool. This checksum is stored in the parent block's pointer to the data—a critical design decision that creates a self-validating data structure.
The Parent-Stored Checksum Model:
UBERBLOCK
│
│ checksum of MOS stored here
▼
┌─────────────────┐
│ META-OBJECT │
│ SET │
└────────┬────────┘
│ checksum of child blocks stored here
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Indirect │ │ Indirect │ │ Indirect │
│ Block │ │ Block │ │ Block │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ checksums of data blocks here
┌─────┴─────┐ ┌───┴───┐ ┌────┴────┐
▼ ▼ ▼ ▼ ▼ ▼ ▼
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│Data │Data │Data │Data │Data │Data │Data │
│Blk 1│Blk 2│Blk 3│Blk 4│Blk 5│Blk 6│Blk 7│
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘
EVERY block's checksum is verified against the
checksum stored in its PARENT, not in itself.
Why parent-stored checksums?
If a block stored its own checksum, corrupt data could include a corrupt checksum that incorrectly matches. The corruption would be self-consistent and undetectable.
By storing the checksum in the parent, the checksum and the data are in separate blocks, likely on different locations of the disk or even different disks. Corruption of the data block doesn't affect its checksum. The parent block is the source of truth for what the child should contain.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
/* * ZFS Block Read with Checksum Verification * Simplified illustration of the verification flow. */ typedef enum zio_checksum { ZIO_CHECKSUM_OFF, /* No checksum (not recommended) */ ZIO_CHECKSUM_ON, /* Default: fletcher4 */ ZIO_CHECKSUM_LABEL, /* Label block checksums */ ZIO_CHECKSUM_GANG_HEADER, /* Gang block headers */ ZIO_CHECKSUM_FLETCHER_2, /* Fletcher2 - faster, weaker */ ZIO_CHECKSUM_FLETCHER_4, /* Fletcher4 - default, good balance */ ZIO_CHECKSUM_SHA256, /* SHA-256 - cryptographic */ ZIO_CHECKSUM_SHA512, /* SHA-512 - longer hash */ ZIO_CHECKSUM_SKEIN, /* Skein - cryptographic, faster */ ZIO_CHECKSUM_EDONR, /* Edon-R - very fast cryptographic */ ZIO_CHECKSUM_BLAKE3, /* Blake3 - newest, fastest secure */} zio_checksum_t; /* * Checksum computation using specified algorithm */void checksum_compute(void *data, size_t size, zio_checksum_t algorithm, zio_cksum_t *result) { switch (algorithm) { case ZIO_CHECKSUM_FLETCHER_4: fletcher_4_compute(data, size, result); break; case ZIO_CHECKSUM_SHA256: sha256_compute(data, size, result); break; case ZIO_CHECKSUM_SHA512: sha512_compute(data, size, result); break; case ZIO_CHECKSUM_BLAKE3: blake3_compute(data, size, result); break; /* ... other algorithms ... */ }} /* * Block read with verification - the core ZFS read path */int zfs_read_block_verified(blkptr_t *bp, void *buffer){ zio_cksum_t computed_checksum; zio_cksum_t *expected_checksum = &bp->blk_cksum; zio_checksum_t algorithm = BP_GET_CHECKSUM(bp); /* Read the raw block from disk */ int err = read_physical_block(bp, buffer); if (err != 0) { return err; /* I/O error - disk reported failure */ } /* Decompress if necessary (checksum is of decompressed data) */ if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF) { decompress_block(buffer, BP_GET_COMPRESS(bp)); } /* Compute checksum of what we read */ checksum_compute(buffer, BP_GET_LSIZE(bp), algorithm, &computed_checksum); /* Compare against expected checksum from parent */ if (!checksum_equal(&computed_checksum, expected_checksum)) { /* * CHECKSUM MISMATCH! * This is the critical moment ZFS is designed for. * The data we read does not match what was written. * * Traditional file systems would return this * corrupted data to the application, unaware of * the problem. * * ZFS knows corruption occurred and can: * 1. Try alternate copies (from mirrors/RAIDZ) * 2. Report the error accurately * 3. Potentially repair from redundancy */ return EIO_CHECKSUM; } /* Success - data verified correct */ return 0;} /* * The self-healing read: try multiple copies */int zfs_read_self_healing(blkptr_t *bp, void *buffer){ int copies = BP_GET_NDVAS(bp); /* Number of data copies */ for (int i = 0; i < copies; i++) { dva_t *dva = &bp->blk_dva[i]; if (!DVA_IS_VALID(dva)) continue; /* Try reading from this copy */ int err = read_from_dva(dva, buffer); if (err != 0) continue; /* I/O error, try next copy */ /* Verify checksum */ zio_cksum_t computed; checksum_compute(buffer, BP_GET_LSIZE(bp), BP_GET_CHECKSUM(bp), &computed); if (checksum_equal(&computed, &bp->blk_cksum)) { /* SUCCESS! This copy is valid. */ if (i > 0) { /* We used a non-primary copy - schedule repair of corrupted copy */ schedule_repair(bp, i); } return 0; } /* This copy corrupted, log and try next */ log_checksum_error(bp, i); } /* All copies corrupted - unrecoverable */ return EIO_CHECKSUM_FATAL;}From the uberblock to every data block, ZFS forms a Merkle tree—a cryptographically secured chain of verification. Corruption anywhere is detected because the checksum in the parent won't match. Even if an attacker or bug modifies a block, they cannot forge the parent's checksum without corrupting the parent, which is protected by its grandparent, all the way to the uberblock.
ZFS offers multiple checksum algorithms, balancing computational cost against collision resistance. The choice affects both CPU usage and the probability of undetected corruption.
| Algorithm | Hash Size | Speed | Cryptographic | Best For |
|---|---|---|---|---|
| fletcher4 | 256 bits | Very fast | No | General use, default, low CPU overhead |
| fletcher2 | 256 bits | Fastest | No | Legacy, not recommended |
| sha256 | 256 bits | Slow | Yes | When dedup enabled, high-security needs |
| sha512 | 256 bits (truncated) | Slow | Yes | Marginally more secure than sha256 |
| skein | 256 bits | Medium | Yes | Good balance of speed and security |
| edonr | 256 bits | Fast | Yes* | Fast cryptographic, high throughput |
| blake3 | 256 bits | Very fast | Yes | Modern best choice for security + speed |
Understanding the Trade-offs:
Fletcher4 (Default): A non-cryptographic checksum that's extremely fast—often hardware-accelerated. It detects essentially all random corruption (bit flips, media errors). It does NOT protect against intentional modification by an attacker, but for data integrity purposes, it's excellent.
SHA-256/SHA-512: Cryptographic hashes that provide protection against intentional tampering. Required when using deduplication (to prevent preimage attacks). Significantly slower—can reduce throughput on CPU-bound workloads.
BLAKE3 (OpenZFS 2.2+): The newest option, offering cryptographic security at speeds rivaling fletcher4. If your ZFS version supports it, BLAKE3 is the best choice for new pools where security matters.
12345678910111213141516171819202122232425262728293031323334353637383940
#!/bin/bash# Configuring ZFS Checksums # Check current checksum algorithmzfs get checksum tank/data# NAME PROPERTY VALUE SOURCE# tank/data checksum on default # "on" means default (fletcher4) # Set checksum algorithm for datasetzfs set checksum=sha256 tank/secure # Set checksum for new pool (inherited by all datasets)zpool create -O checksum=blake3 tank /dev/sda # Per-dataset checksum (overrides pool default)zfs create -o checksum=sha512 tank/highly_secure # Disable checksums (NEVER DO THIS in production!)# Only for specific cases like swap volumeszfs set checksum=off tank/swap# This provides NO protection against corruption # View checksum errors from pool statuszpool status -v tank# Example output with checksum errors:# NAME STATE READ WRITE CKSUM# tank ONLINE 0 0 0# sda ONLINE 0 0 2 ← 2 checksum errors!# sdb ONLINE 0 0 0 # Check detailed device error statisticszpool status -x # Show only pools with errors # Benchmark checksum performance on your hardware# (Not a ZFS command, but useful for comparison)echo "fletcher4 speed (estimated similar to zfs default)"openssl speed sha256 sha512# Compare against your CPU's capabilitiesUse cryptographic checksums (sha256, blake3) when: (1) using deduplication—required to prevent attacks, (2) data integrity against tampering matters, (3) you're storing data that could be targeted by sophisticated attackers. For most use cases, fletcher4 provides excellent corruption detection with minimal CPU overhead.
Detecting corruption is valuable; correcting it automatically is transformational. ZFS's self-healing capability combines checksum verification with redundancy to repair corruption without administrator intervention.
The Self-Healing Process:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
SELF-HEALING DATA REPAIR═══════════════════════════════════════════════════════════════════ SCENARIO: Reading a block from a mirrored pool Step 1: Application requests file data ┌─────────────────────────────────────────────┐ │ Application: read("/tank/data/file.txt") │ └─────────────────────────────────────────────┘ │ ▼Step 2: ZFS identifies block location and reads from primary ┌─────────────────────────────────────────────┐ │ Read from disk /dev/sda (primary copy) │ │ Block contains: 0x3F 0x8A 0x2C 0x1D ... │ └─────────────────────────────────────────────┘ │ ▼Step 3: Verify checksum against parent's stored value ┌─────────────────────────────────────────────┐ │ Computed: 0x1A2B3C4D5E6F7890 │ │ Expected: 0x1A2B3C4D5E6F7890 ✓ │ (Match = Success) │ OR │ │ Computed: 0x9F8E7D6C5B4A3210 │ │ Expected: 0x1A2B3C4D5E6F7890 ✗ │ (Mismatch!) └─────────────────────────────────────────────┘ ═══════════════════════════════════════════════════════════════════IF CHECKSUM MISMATCH DETECTED:═══════════════════════════════════════════════════════════════════ Step 4: Log the error and try alternate copy ┌─────────────────────────────────────────────┐ │ LOG: "checksum error on /dev/sda at │ │ offset 0x12345678, attempting │ │ recovery from mirror" │ └─────────────────────────────────────────────┘ │ ▼Step 5: Read from mirror copy (/dev/sdb) ┌─────────────────────────────────────────────┐ │ Read from disk /dev/sdb (mirror copy) │ │ Block contains: 0x7E 0x91 0x4F 0x2A ... │ └─────────────────────────────────────────────┘ │ ▼Step 6: Verify checksum of mirror copy ┌─────────────────────────────────────────────┐ │ Computed: 0x1A2B3C4D5E6F7890 │ │ Expected: 0x1A2B3C4D5E6F7890 ✓ MATCH! │ └─────────────────────────────────────────────┘ │ ▼Step 7: Self-healing - rewrite good data to corrupted disk ┌─────────────────────────────────────────────┐ │ Write verified block back to /dev/sda │ │ Corruption REPAIRED automatically │ └─────────────────────────────────────────────┘ │ ▼Step 8: Return good data to application ┌─────────────────────────────────────────────┐ │ Application receives correct data │ │ User unaware any error occurred │ └─────────────────────────────────────────────┘ ═══════════════════════════════════════════════════════════════════RESULT: - Corruption detected ✓ - Good data retrieved from redundancy ✓ - Corrupted copy repaired ✓ - Application receives valid data ✓ - No administrator intervention required ✓═══════════════════════════════════════════════════════════════════Self-healing happens transparently. The application requesting the data has no idea corruption occurred—it simply receives correct data (slightly delayed while ZFS fetched the good copy). Pool administrators should monitor checksum error counts in 'zpool status' to identify failing disks before redundancy is exhausted.
Self-healing repairs corruption at read time—but what about blocks that are never read? Archival data, old backups, and rarely-accessed files could harbor corruption for years, spreading through backup copies before detection.
ZFS Scrubbing addresses this by proactively reading and verifying every block in the pool, repairing corruption before it becomes a problem.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
#!/bin/bash# ZFS Scrub Operations # ============================================# STARTING A SCRUB# ============================================ # Start a scrub on a poolzpool scrub tank # Scrubbing does:# 1. Reads every block from disk# 2. Verifies checksum against expected value# 3. Repairs any corruption from redundant copies# 4. Reports errors that cannot be repaired # ============================================# MONITORING SCRUB PROGRESS# ============================================ # Check scrub statuszpool status tank # Example output during scrub:# pool: tank# state: ONLINE# scan: scrub in progress since Mon Oct 16 10:00:00 2023# 2.50T scanned at 500M/s, 1.25T issued at 250M/s# 0 repaired, 0.00% done, 1d 8h to go## "scanned" = metadata processed# "issued" = actual data read and verified# "repaired" = bytes fixed from redundancy # Watch scrub progress in real-timewatch -n 5 'zpool status tank | grep scan' # ============================================# SCRUB RESULTS# ============================================ # After completion:# scan: scrub repaired 16K in 48h30m with 0 errors on Wed Oct 18 10:30:00 2023 # If errors were found but not repaired:# scan: scrub repaired 0B in 48h30m with 3 errors on Wed Oct 18 10:30:00 2023 # View detailed error informationzpool status -v tank # Example error output:# errors: Permanent errors have been detected:# FILE POOL# /tank/data/photos/IMG_1234.jpg tank# /tank/data/documents/report.docx tank## These files have no remaining good copies - data is LOST # ============================================# MANAGING SCRUBS# ============================================ # Pause a running scrubzpool scrub -p tank # Resume a paused scrubzpool scrub tank # Cancel a scrub entirelyzpool scrub -s tank # ============================================# SCRUB SCHEDULING (with cron or systemd)# ============================================ # Add to crontab for weekly scrub (Sunday 2 AM)# crontab -e0 2 * * 0 /sbin/zpool scrub tank # On systemd-based systems, enable the ZFS scrub timersudo systemctl enable zfs-scrub-weekly@tank.timersudo systemctl start zfs-scrub-weekly@tank.timer # Or monthly scrubsudo systemctl enable zfs-scrub-monthly@tank.timer # ============================================# SCRUB PERFORMANCE TUNING# ============================================ # Limit scrub I/O to reduce impact on production# (in /etc/modprobe.d/zfs.conf or sysctl) # Reduce scrub priority (Linux)echo 0 > /sys/module/zfs/parameters/zfs_scrub_delay# Values: 0-10, higher = slower scrub, less I/O impact # Set scrub to low priority I/O classecho 1 > /sys/module/zfs/parameters/zfs_vdev_scrub_min_activeecho 2 > /sys/module/zfs/parameters/zfs_vdev_scrub_max_active # ============================================# SCRUB FREQUENCY RECOMMENDATIONS# ============================================ # Based on data importance and disk age:## Consumer SSDs/HDDs under warranty: Monthly# Consumer SSDs/HDDs over 3 years: Weekly # Enterprise storage: Weekly# Critical data / archival: Weekly# ZFS on Root (boot disk): Monthly## General rule: More frequent is better, # constrained by I/O impact on productionScrubbing reads every block in the pool, potentially saturating disk I/O for hours or days on large pools. Schedule scrubs during low-usage periods. The I/O priority parameters above can throttle scrub I/O to maintain production performance, at the cost of longer scrub duration.
Beyond pool-level redundancy (mirrors, RAIDZ), ZFS offers dataset-level redundancy through the copies property. This causes ZFS to store multiple copies of each block, even on non-redundant pools.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
#!/bin/bash# Using the copies property for dataset-level redundancy # Set copies on dataset creationzfs create -o copies=2 tank/important # Set copies on existing dataset (applies to new writes only)zfs set copies=3 tank/critical # Check current copies settingzfs get copies tank/important# NAME PROPERTY VALUE SOURCE# tank/important copies 2 local # BEHAVIOR OF copies PROPERTY:## copies=1: Single copy (default) - relies on pool redundancy# copies=2: Two copies of every block# copies=3: Three copies of every block # ============================================# USE CASES FOR copies > 1# ============================================ # 1. Non-redundant pool (single disk or stripe)# - Pool has no redundancy, but critical data needs protection# - copies=2 provides self-healing capability on single diskzfs set copies=2 tank/irreplaceable_photos # 2. Extra protection beyond RAIDZ# - RAIDZ1 pool with copies=2 = protection against 1 whole# disk failure PLUS 1 additional block corruption# - Not common, but useful for irreplaceable data # 3. Testing self-healing# - Enable copies=2 on single-disk pool to test ZFS behavior # ============================================# SPACE IMPACT# ============================================ # copies=2 uses roughly 2x space# copies=3 uses roughly 3x space # Check actual space usagezfs list -o name,used,refer,logicalused tank/important # With copies=2 and no compression:# logicalused ≈ actual file sizes# used ≈ 2 × logicalused## With copies=2 and compression:# used ≈ 2 × (logicalused / compression_ratio) # ============================================# COPIES INTERACTION WITH POOL REDUNDANCY# ============================================ # Pool type impact on copies placement:## Mirror pool + copies=2:# Actually stores 4 copies total (2 per mirror side)# # RAIDZ1 + copies=2:# Stores 2 independent blocks, each with RAIDZ1 protection# # Single disk + copies=2:# Stores 2 copies at different disk locations# Provides protection against localized corruption, # NOT full disk failureThe copies property is most valuable for: (1) critical data on pools without redundancy, (2) data that's irreplaceable and worth the space cost, (3) situations where pool redundancy isn't sufficient. For most uses, pool-level redundancy (mirrors/RAIDZ) is more space-efficient than copies=2.
File data is important, but metadata—the structures that describe where files are, their permissions, and the pool configuration—is even more critical. Losing a data block loses one file; losing a metadata block can make entire directory trees inaccessible.
ZFS automatically provides extra protection for metadata through ditto blocks—additional copies of metadata written to different disks in the pool.
Ditto Block Placement:
When ZFS has multiple top-level vdevs, it places metadata ditto blocks on different vdevs. This means a single vdev failure (even total loss) cannot corrupt pool-wide metadata.
┌─────────────────────────────────────────────────────────────┐
│ POOL: tank │
├──────────────────────────┬──────────────────────────────────┤
│ VDEV 1 │ VDEV 2 │
│ (raidz1: 4 disks) │ (raidz1: 4 disks) │
├──────────────────────────┼──────────────────────────────────┤
│ │ │
│ Metadata Block M1 │ Metadata Block M1 (ditto) │
│ Data Block D1 │ Data Block D2 │
│ Metadata Block M2 │ Metadata Block M2 (ditto) │
│ Data Block D3 │ Data Block D4 │
│ │ │
└──────────────────────────┴──────────────────────────────────┘
If VDEV 1 fails completely:
- Data blocks D1, D3 are lost (within RAIDZ protection)
- Metadata blocks M1, M2 survive (ditto copies on VDEV 2)
- Pool remains mountable
A pool with a single vdev (even a large RAIDZ) cannot distribute ditto blocks across vdevs. Consider using multiple smaller vdevs instead of one large vdev when possible. This improves not just metadata protection but also rebuild performance and I/O parallelism.
We've explored ZFS's comprehensive data integrity system—the foundation that makes ZFS trusted for mission-critical storage. Let's consolidate the key insights:
What's Next:
With checksums and data integrity understood, we'll explore RAID-Z—ZFS's advanced software RAID implementation that eliminates the write hole, integrates with checksums for intelligent repair, and provides space-efficient redundancy for large storage pools.
You now understand ZFS's end-to-end checksum system: how it detects silent corruption, heals data automatically from redundancy, and protects metadata with extra copies. Next, we'll explore RAID-Z—ZFS's revolutionary approach to software RAID.