Operating SystemsAdvanced File Systems

ZFS: The Zettabyte File System

LevelAdvanced

Duration75 mins

TopicAdvanced File Systems

3 / 5

Checksums: End-to-End Data Integrity

The Silent Data Corruption Problem

Imagine retrieving a backup you made five years ago—a precious family photo, a critical contract, or years of research data. You open the file... and it's corrupted. Not obviously missing, not displaying an error during copy—just quietly, irrecoverably wrong.

This isn't a hypothetical scenario. Silent data corruption is a statistical certainty for anyone storing significant amounts of data over time. Unlike disk failures that announce themselves with I/O errors, silent corruption masquerades as valid data while containing errors.

The fundamental problem: traditional storage systems have no mechanism to distinguish between correct data and corrupted data. They trust that whatever bytes come back from a disk are the bytes that were written. This trust is misplaced.

What You Will Learn

By the end of this page, you will understand ZFS's end-to-end checksum architecture, how it detects corruption at read time, why parent-stored checksums create a Merkle tree of verification, the self-healing process that repairs corruption automatically, and the checksum algorithms available in ZFS.

Sources of Silent Data Corruption

Silent data corruption occurs when data changes without the storage system detecting or reporting the change. Understanding the sources of corruption clarifies why ZFS's approach is necessary.

The Data Path:

┌─────────────────────────────────────────────────────────────────────┐
│                         DATA PATH                                    │
│                                                                      │
│ Application → OS Buffer → File System → Volume Mgr → RAID → Disk   │
│                                                                      │
│    ▲            ▲            ▲            ▲          ▲        ▲     │
│    │            │            │            │          │        │     │
│  App bugs    RAM errors    FS bugs    RAID bugs   Firmware   Media │
│                                         Write hole  bugs     decay │
└─────────────────────────────────────────────────────────────────────┘

Every component in this path can introduce corruption.

Sources of Silent Data Corruption
Source	Mechanism	Detection by Traditional FS
Bit Rot	Physical media degradation over time due to cosmic rays, magnetic decay, electrical noise	None—data appears valid
Firmware Bugs	Disk firmware misdirects writes, returns stale cached data, or performs incorrect operations	None—disk reports success
Memory Errors	RAM bit flips corrupt data before writing to disk or after reading from disk	None—data written as-is
RAID Write Hole	Power failure during RAID write leaves parity inconsistent with data	None—parity check passes incorrectly
Controller Bugs	Hardware RAID or HBA controllers corrupt data silently	None—appears as valid I/O
Phantom Writes	Disk reports write success but data never reaches platters	None—success was reported
Misdirected Writes	Data written to wrong location on disk	None—original location unchanged
Driver Bugs	I/O stack software corrupts data during transfer	None—OS layer issue

The Research:

Studies have quantified silent corruption rates:

CERN study (2007): Found ~3 instances of silent data corruption per 10 terabytes over 6 months in enterprise storage
NetApp study (2010): 400,000 latent sector errors per 1.5 million drives examined
Google study (2016): Detected over 20 silent data corruption events per year per 10PB of storage

At scale, silent corruption is not a possibility—it's a certainty. The only question is whether you detect it or whether it propagates unnoticed through your backups and archives.

The Compounding Disaster

Undetected corruption is backed up as if valid. Over time, every copy of the data becomes corrupted. By the time you notice (perhaps years later), no valid copy exists anywhere. ZFS's innovation is making corruption visible immediately—when you still have the chance to recover from redundancy or recent backups.

ZFS Checksum Architecture

ZFS computes a cryptographic-strength checksum for every block of data written to the pool. This checksum is stored in the parent block's pointer to the data—a critical design decision that creates a self-validating data structure.

The Parent-Stored Checksum Model:

                    UBERBLOCK
                        │
                        │ checksum of MOS stored here
                        ▼
                ┌─────────────────┐
                │   META-OBJECT   │
                │      SET        │
                └────────┬────────┘
                         │ checksum of child blocks stored here
            ┌────────────┼────────────┐
            ▼            ▼            ▼
      ┌──────────┐ ┌──────────┐ ┌──────────┐
      │ Indirect │ │ Indirect │ │ Indirect │
      │  Block   │ │  Block   │ │  Block   │
      └────┬─────┘ └────┬─────┘ └────┬─────┘
           │            │            │  checksums of data blocks here
     ┌─────┴─────┐  ┌───┴───┐   ┌────┴────┐
     ▼     ▼     ▼  ▼       ▼   ▼         ▼
  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
  │Data │Data │Data │Data │Data │Data │Data │
  │Blk 1│Blk 2│Blk 3│Blk 4│Blk 5│Blk 6│Blk 7│
  └─────┴─────┴─────┴─────┴─────┴─────┴─────┘

  EVERY block's checksum is verified against the
  checksum stored in its PARENT, not in itself.

Why parent-stored checksums?

If a block stored its own checksum, corrupt data could include a corrupt checksum that incorrectly matches. The corruption would be self-consistent and undetectable.

By storing the checksum in the parent, the checksum and the data are in separate blocks, likely on different locations of the disk or even different disks. Corruption of the data block doesn't affect its checksum. The parent block is the source of truth for what the child should contain.

checksum_verification.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
/*
 * ZFS Block Read with Checksum Verification
 * Simplified illustration of the verification flow.
 */
 
typedef enum zio_checksum {
    ZIO_CHECKSUM_OFF,           /* No checksum (not recommended) */
    ZIO_CHECKSUM_ON,            /* Default: fletcher4 */
    ZIO_CHECKSUM_LABEL,         /* Label block checksums */
    ZIO_CHECKSUM_GANG_HEADER,   /* Gang block headers */
    ZIO_CHECKSUM_FLETCHER_2,    /* Fletcher2 - faster, weaker */
    ZIO_CHECKSUM_FLETCHER_4,    /* Fletcher4 - default, good balance */
    ZIO_CHECKSUM_SHA256,        /* SHA-256 - cryptographic */
    ZIO_CHECKSUM_SHA512,        /* SHA-512 - longer hash */
    ZIO_CHECKSUM_SKEIN,         /* Skein - cryptographic, faster */
    ZIO_CHECKSUM_EDONR,         /* Edon-R - very fast cryptographic */
    ZIO_CHECKSUM_BLAKE3,        /* Blake3 - newest, fastest secure */
} zio_checksum_t;
 
/*
 * Checksum computation using specified algorithm
 */
void checksum_compute(void *data, size_t size, 
                      zio_checksum_t algorithm, 
                      zio_cksum_t *result) 
{
    switch (algorithm) {
        case ZIO_CHECKSUM_FLETCHER_4:
            fletcher_4_compute(data, size, result);
            break;
        case ZIO_CHECKSUM_SHA256:
            sha256_compute(data, size, result);
            break;
        case ZIO_CHECKSUM_SHA512:
            sha512_compute(data, size, result);
            break;
        case ZIO_CHECKSUM_BLAKE3:
            blake3_compute(data, size, result);
            break;
        /* ... other algorithms ... */
    }
}
 
/*
 * Block read with verification - the core ZFS read path
 */
int zfs_read_block_verified(blkptr_t *bp, void *buffer)
{
    zio_cksum_t computed_checksum;
    zio_cksum_t *expected_checksum = &bp->blk_cksum;
    zio_checksum_t algorithm = BP_GET_CHECKSUM(bp);
    
    /* Read the raw block from disk */
    int err = read_physical_block(bp, buffer);
    if (err != 0) {
        return err;  /* I/O error - disk reported failure */
    }
    
    /* Decompress if necessary (checksum is of decompressed data) */
    if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF) {
        decompress_block(buffer, BP_GET_COMPRESS(bp));
    }
    
    /* Compute checksum of what we read */
    checksum_compute(buffer, BP_GET_LSIZE(bp), 
                     algorithm, &computed_checksum);
    
    /* Compare against expected checksum from parent */
    if (!checksum_equal(&computed_checksum, expected_checksum)) {
        /*
         * CHECKSUM MISMATCH!
         * This is the critical moment ZFS is designed for.
         * The data we read does not match what was written.
         * 
         * Traditional file systems would return this
         * corrupted data to the application, unaware of
         * the problem.
         * 
         * ZFS knows corruption occurred and can:
         * 1. Try alternate copies (from mirrors/RAIDZ)
         * 2. Report the error accurately
         * 3. Potentially repair from redundancy
         */
        return EIO_CHECKSUM;
    }
    
    /* Success - data verified correct */
    return 0;
}
 
/*
 * The self-healing read: try multiple copies
 */
int zfs_read_self_healing(blkptr_t *bp, void *buffer)
{
    int copies = BP_GET_NDVAS(bp);  /* Number of data copies */
    
    for (int i = 0; i < copies; i++) {
        dva_t *dva = &bp->blk_dva[i];
        
        if (!DVA_IS_VALID(dva))
            continue;
        
        /* Try reading from this copy */
        int err = read_from_dva(dva, buffer);
        if (err != 0)
            continue;  /* I/O error, try next copy */
        
        /* Verify checksum */
        zio_cksum_t computed;
        checksum_compute(buffer, BP_GET_LSIZE(bp),
                        BP_GET_CHECKSUM(bp), &computed);
        
        if (checksum_equal(&computed, &bp->blk_cksum)) {
            /* SUCCESS! This copy is valid. */
            if (i > 0) {
                /* We used a non-primary copy - 
                   schedule repair of corrupted copy */
                schedule_repair(bp, i);
            }
            return 0;
        }
        
        /* This copy corrupted, log and try next */
        log_checksum_error(bp, i);
    }
    
    /* All copies corrupted - unrecoverable */
    return EIO_CHECKSUM_FATAL;
}

The Merkle Tree of Trust

From the uberblock to every data block, ZFS forms a Merkle tree—a cryptographically secured chain of verification. Corruption anywhere is detected because the checksum in the parent won't match. Even if an attacker or bug modifies a block, they cannot forge the parent's checksum without corrupting the parent, which is protected by its grandparent, all the way to the uberblock.

Checksum Algorithms in ZFS

ZFS offers multiple checksum algorithms, balancing computational cost against collision resistance. The choice affects both CPU usage and the probability of undetected corruption.

ZFS Checksum Algorithms Comparison
Algorithm	Hash Size	Speed	Cryptographic	Best For
fletcher4	256 bits	Very fast	No	General use, default, low CPU overhead
fletcher2	256 bits	Fastest	No	Legacy, not recommended
sha256	256 bits	Slow	Yes	When dedup enabled, high-security needs
sha512	256 bits (truncated)	Slow	Yes	Marginally more secure than sha256
skein	256 bits	Medium	Yes	Good balance of speed and security
edonr	256 bits	Fast	Yes*	Fast cryptographic, high throughput
blake3	256 bits	Very fast	Yes	Modern best choice for security + speed

Understanding the Trade-offs:

Fletcher4 (Default): A non-cryptographic checksum that's extremely fast—often hardware-accelerated. It detects essentially all random corruption (bit flips, media errors). It does NOT protect against intentional modification by an attacker, but for data integrity purposes, it's excellent.

SHA-256/SHA-512: Cryptographic hashes that provide protection against intentional tampering. Required when using deduplication (to prevent preimage attacks). Significantly slower—can reduce throughput on CPU-bound workloads.

BLAKE3 (OpenZFS 2.2+): The newest option, offering cryptographic security at speeds rivaling fletcher4. If your ZFS version supports it, BLAKE3 is the best choice for new pools where security matters.

checksum_configuration.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/bash
# Configuring ZFS Checksums
 
# Check current checksum algorithm
zfs get checksum tank/data
# NAME       PROPERTY   VALUE      SOURCE
# tank/data  checksum   on         default
 
# "on" means default (fletcher4)
 
# Set checksum algorithm for dataset
zfs set checksum=sha256 tank/secure
 
# Set checksum for new pool (inherited by all datasets)
zpool create -O checksum=blake3 tank /dev/sda
 
# Per-dataset checksum (overrides pool default)
zfs create -o checksum=sha512 tank/highly_secure
 
# Disable checksums (NEVER DO THIS in production!)
# Only for specific cases like swap volumes
zfs set checksum=off tank/swap
# This provides NO protection against corruption
 
# View checksum errors from pool status
zpool status -v tank
# Example output with checksum errors:
#   NAME        STATE     READ WRITE CKSUM
#   tank        ONLINE       0     0     0
#     sda       ONLINE       0     0     2   ← 2 checksum errors!
#     sdb       ONLINE       0     0     0
 
# Check detailed device error statistics
zpool status -x  # Show only pools with errors
 
# Benchmark checksum performance on your hardware
# (Not a ZFS command, but useful for comparison)
echo "fletcher4 speed (estimated similar to zfs default)"
openssl speed sha256 sha512
# Compare against your CPU's capabilities

When to Use Cryptographic Checksums

Use cryptographic checksums (sha256, blake3) when: (1) using deduplication—required to prevent attacks, (2) data integrity against tampering matters, (3) you're storing data that could be targeted by sophisticated attackers. For most use cases, fletcher4 provides excellent corruption detection with minimal CPU overhead.

Self-Healing: Automatic Data Repair

Detecting corruption is valuable; correcting it automatically is transformational. ZFS's self-healing capability combines checksum verification with redundancy to repair corruption without administrator intervention.

The Self-Healing Process:

self_healing_process.txt

Flow

SELF-HEALING DATA REPAIR
═══════════════════════════════════════════════════════════════════
 
SCENARIO: Reading a block from a mirrored pool
 
Step 1: Application requests file data
        ┌─────────────────────────────────────────────┐
        │  Application: read("/tank/data/file.txt")  │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 2: ZFS identifies block location and reads from primary
        ┌─────────────────────────────────────────────┐
        │  Read from disk /dev/sda (primary copy)     │
        │  Block contains: 0x3F 0x8A 0x2C 0x1D ...    │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 3: Verify checksum against parent's stored value
        ┌─────────────────────────────────────────────┐
        │  Computed:  0x1A2B3C4D5E6F7890              │
        │  Expected:  0x1A2B3C4D5E6F7890  ✓           │ (Match = Success)
        │                   OR                        │
        │  Computed:  0x9F8E7D6C5B4A3210              │
        │  Expected:  0x1A2B3C4D5E6F7890  ✗           │ (Mismatch!)
        └─────────────────────────────────────────────┘
 
═══════════════════════════════════════════════════════════════════
IF CHECKSUM MISMATCH DETECTED:
═══════════════════════════════════════════════════════════════════
 
Step 4: Log the error and try alternate copy
        ┌─────────────────────────────────────────────┐
        │  LOG: "checksum error on /dev/sda at       │
        │        offset 0x12345678, attempting        │
        │        recovery from mirror"                │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 5: Read from mirror copy (/dev/sdb)
        ┌─────────────────────────────────────────────┐
        │  Read from disk /dev/sdb (mirror copy)      │
        │  Block contains: 0x7E 0x91 0x4F 0x2A ...    │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 6: Verify checksum of mirror copy
        ┌─────────────────────────────────────────────┐
        │  Computed:  0x1A2B3C4D5E6F7890              │
        │  Expected:  0x1A2B3C4D5E6F7890  ✓ MATCH!   │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 7: Self-healing - rewrite good data to corrupted disk
        ┌─────────────────────────────────────────────┐
        │  Write verified block back to /dev/sda      │
        │  Corruption REPAIRED automatically          │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 8: Return good data to application
        ┌─────────────────────────────────────────────┐
        │  Application receives correct data          │
        │  User unaware any error occurred            │
        └─────────────────────────────────────────────┘
 
═══════════════════════════════════════════════════════════════════
RESULT: 
  - Corruption detected ✓
  - Good data retrieved from redundancy ✓
  - Corrupted copy repaired ✓
  - Application receives valid data ✓
  - No administrator intervention required ✓
═══════════════════════════════════════════════════════════════════

Self-Healing Requirements

•Redundancy is required — Self-healing needs at least one good copy. Mirrors, RAIDZ, or copies=N provide this.
•Checksums must be enabled — Without checksums, ZFS cannot detect corruption. Never disable checksums except for special cases.
•Block must be read — Healing occurs at read time. Unread corrupted blocks remain corrupted until accessed or scrubbed.
•Original block pointer must be trusted — The parent block (containing the checksum) must be valid. The Merkle tree ensures this.

The Invisible Guardian

Self-healing happens transparently. The application requesting the data has no idea corruption occurred—it simply receives correct data (slightly delayed while ZFS fetched the good copy). Pool administrators should monitor checksum error counts in 'zpool status' to identify failing disks before redundancy is exhausted.

Scrubbing: Proactive Data Verification

Self-healing repairs corruption at read time—but what about blocks that are never read? Archival data, old backups, and rarely-accessed files could harbor corruption for years, spreading through backup copies before detection.

ZFS Scrubbing addresses this by proactively reading and verifying every block in the pool, repairing corruption before it becomes a problem.

scrub_operations.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
#!/bin/bash
# ZFS Scrub Operations
 
# ============================================
# STARTING A SCRUB
# ============================================
 
# Start a scrub on a pool
zpool scrub tank
 
# Scrubbing does:
# 1. Reads every block from disk
# 2. Verifies checksum against expected value
# 3. Repairs any corruption from redundant copies
# 4. Reports errors that cannot be repaired
 
# ============================================
# MONITORING SCRUB PROGRESS
# ============================================
 
# Check scrub status
zpool status tank
 
# Example output during scrub:
#   pool: tank
#   state: ONLINE
#   scan: scrub in progress since Mon Oct 16 10:00:00 2023
#         2.50T scanned at 500M/s, 1.25T issued at 250M/s
#         0 repaired, 0.00% done, 1d 8h to go
#
# "scanned" = metadata processed
# "issued" = actual data read and verified
# "repaired" = bytes fixed from redundancy
 
# Watch scrub progress in real-time
watch -n 5 'zpool status tank | grep scan'
 
# ============================================
# SCRUB RESULTS
# ============================================
 
# After completion:
#   scan: scrub repaired 16K in 48h30m with 0 errors on Wed Oct 18 10:30:00 2023
 
# If errors were found but not repaired:
#   scan: scrub repaired 0B in 48h30m with 3 errors on Wed Oct 18 10:30:00 2023
 
# View detailed error information
zpool status -v tank
 
# Example error output:
#   errors: Permanent errors have been detected:
#           FILE                                             POOL
#           /tank/data/photos/IMG_1234.jpg                   tank
#           /tank/data/documents/report.docx                 tank
#
# These files have no remaining good copies - data is LOST
 
# ============================================
# MANAGING SCRUBS
# ============================================
 
# Pause a running scrub
zpool scrub -p tank
 
# Resume a paused scrub
zpool scrub tank
 
# Cancel a scrub entirely
zpool scrub -s tank
 
# ============================================
# SCRUB SCHEDULING (with cron or systemd)
# ============================================
 
# Add to crontab for weekly scrub (Sunday 2 AM)
# crontab -e
0 2 * * 0 /sbin/zpool scrub tank
 
# On systemd-based systems, enable the ZFS scrub timer
sudo systemctl enable zfs-scrub-weekly@tank.timer
sudo systemctl start zfs-scrub-weekly@tank.timer
 
# Or monthly scrub
sudo systemctl enable zfs-scrub-monthly@tank.timer
 
# ============================================
# SCRUB PERFORMANCE TUNING
# ============================================
 
# Limit scrub I/O to reduce impact on production
# (in /etc/modprobe.d/zfs.conf or sysctl)
 
# Reduce scrub priority (Linux)
echo 0 > /sys/module/zfs/parameters/zfs_scrub_delay
# Values: 0-10, higher = slower scrub, less I/O impact
 
# Set scrub to low priority I/O class
echo 1 > /sys/module/zfs/parameters/zfs_vdev_scrub_min_active
echo 2 > /sys/module/zfs/parameters/zfs_vdev_scrub_max_active
 
# ============================================
# SCRUB FREQUENCY RECOMMENDATIONS
# ============================================
 
# Based on data importance and disk age:
#
# Consumer SSDs/HDDs under warranty:  Monthly
# Consumer SSDs/HDDs over 3 years:    Weekly  
# Enterprise storage:                 Weekly
# Critical data / archival:           Weekly
# ZFS on Root (boot disk):           Monthly
#
# General rule: More frequent is better, 
# constrained by I/O impact on production

Scrub I/O Impact

Scrubbing reads every block in the pool, potentially saturating disk I/O for hours or days on large pools. Schedule scrubs during low-usage periods. The I/O priority parameters above can throttle scrub I/O to maintain production performance, at the cost of longer scrub duration.

The copies Property: Dataset-Level Redundancy

Beyond pool-level redundancy (mirrors, RAIDZ), ZFS offers dataset-level redundancy through the copies property. This causes ZFS to store multiple copies of each block, even on non-redundant pools.

copies_usage.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#!/bin/bash
# Using the copies property for dataset-level redundancy
 
# Set copies on dataset creation
zfs create -o copies=2 tank/important
 
# Set copies on existing dataset (applies to new writes only)
zfs set copies=3 tank/critical
 
# Check current copies setting
zfs get copies tank/important
# NAME            PROPERTY  VALUE  SOURCE
# tank/important  copies    2      local
 
# BEHAVIOR OF copies PROPERTY:
#
# copies=1: Single copy (default) - relies on pool redundancy
# copies=2: Two copies of every block
# copies=3: Three copies of every block
 
# ============================================
# USE CASES FOR copies > 1
# ============================================
 
# 1. Non-redundant pool (single disk or stripe)
#    - Pool has no redundancy, but critical data needs protection
#    - copies=2 provides self-healing capability on single disk
zfs set copies=2 tank/irreplaceable_photos
 
# 2. Extra protection beyond RAIDZ
#    - RAIDZ1 pool with copies=2 = protection against 1 whole
#      disk failure PLUS 1 additional block corruption
#    - Not common, but useful for irreplaceable data
 
# 3. Testing self-healing
#    - Enable copies=2 on single-disk pool to test ZFS behavior
 
# ============================================
# SPACE IMPACT
# ============================================
 
# copies=2 uses roughly 2x space
# copies=3 uses roughly 3x space
 
# Check actual space usage
zfs list -o name,used,refer,logicalused tank/important
 
# With copies=2 and no compression:
#   logicalused ≈ actual file sizes
#   used ≈ 2 × logicalused
#
# With copies=2 and compression:
#   used ≈ 2 × (logicalused / compression_ratio)
 
# ============================================
# COPIES INTERACTION WITH POOL REDUNDANCY
# ============================================
 
# Pool type impact on copies placement:
#
# Mirror pool + copies=2:
#   Actually stores 4 copies total (2 per mirror side)
#   
# RAIDZ1 + copies=2:
#   Stores 2 independent blocks, each with RAIDZ1 protection
#   
# Single disk + copies=2:
#   Stores 2 copies at different disk locations
#   Provides protection against localized corruption, 
#   NOT full disk failure

When to Use copies > 1

The copies property is most valuable for: (1) critical data on pools without redundancy, (2) data that's irreplaceable and worth the space cost, (3) situations where pool redundancy isn't sufficient. For most uses, pool-level redundancy (mirrors/RAIDZ) is more space-efficient than copies=2.

Metadata Integrity and Ditto Blocks

File data is important, but metadata—the structures that describe where files are, their permissions, and the pool configuration—is even more critical. Losing a data block loses one file; losing a metadata block can make entire directory trees inaccessible.

ZFS automatically provides extra protection for metadata through ditto blocks—additional copies of metadata written to different disks in the pool.

ZFS Metadata Protection

•Uberblock Ring — Multiple copies of the uberblock (typically 128) are distributed across all top-level vdevs. Losing one disk cannot corrupt the uberblock.
•Metadata Ditto Blocks — Metadata blocks (dnodes, indirect blocks) are automatically written to multiple disks when possible, regardless of the copies setting.
•Pool Configuration — Pool configuration is stored in vdev labels on every device, allowing import even if multiple disks fail.
•Space Maps — Free space tracking structures are checksummed and have multiple copies on multi-disk pools.
•MOS Objects — The Meta-Object Set (pool-level configuration) receives extra replication.

Ditto Block Placement:

When ZFS has multiple top-level vdevs, it places metadata ditto blocks on different vdevs. This means a single vdev failure (even total loss) cannot corrupt pool-wide metadata.

┌─────────────────────────────────────────────────────────────┐
│                    POOL: tank                                │
├──────────────────────────┬──────────────────────────────────┤
│        VDEV 1            │           VDEV 2                 │
│    (raidz1: 4 disks)     │      (raidz1: 4 disks)           │
├──────────────────────────┼──────────────────────────────────┤
│                          │                                   │
│  Metadata Block M1       │   Metadata Block M1 (ditto)      │
│  Data Block D1           │   Data Block D2                  │
│  Metadata Block M2       │   Metadata Block M2 (ditto)      │
│  Data Block D3           │   Data Block D4                  │
│                          │                                   │
└──────────────────────────┴──────────────────────────────────┘

If VDEV 1 fails completely:
  - Data blocks D1, D3 are lost (within RAIDZ protection)
  - Metadata blocks M1, M2 survive (ditto copies on VDEV 2)
  - Pool remains mountable

Why Multiple Vdevs Matter

A pool with a single vdev (even a large RAIDZ) cannot distribute ditto blocks across vdevs. Consider using multiple smaller vdevs instead of one large vdev when possible. This improves not just metadata protection but also rebuild performance and I/O parallelism.

Summary: End-to-End Checksums

We've explored ZFS's comprehensive data integrity system—the foundation that makes ZFS trusted for mission-critical storage. Let's consolidate the key insights:

Key Takeaways

•Silent data corruption is a certainty at scale — Traditional file systems cannot detect corruption that storage hardware doesn't report. ZFS can.
•Parent-stored checksums create a Merkle tree — Every block's checksum is stored in its parent, creating an unbreakable chain of verification from uberblock to data.
•Self-healing repairs corruption automatically — When corruption is detected, ZFS fetches a good copy from redundancy and repairs the corrupted copy—transparently.
•Multiple checksum algorithms balance speed and security — Fletcher4 is fast and sufficient for most uses; cryptographic options like blake3 provide tamper resistance.
•Regular scrubbing is essential — Scrubs proactively verify all data, catching corruption before redundancy is exhausted or data is needed.
•Metadata receives extra protection — Ditto blocks and the uberblock ring ensure pool structures survive even severe failures.
•The copies property adds dataset-level redundancy — For critical data, copies=2 or copies=3 provides protection beyond pool redundancy.

What's Next:

With checksums and data integrity understood, we'll explore RAID-Z—ZFS's advanced software RAID implementation that eliminates the write hole, integrates with checksums for intelligent repair, and provides space-efficient redundancy for large storage pools.

Page Complete

You now understand ZFS's end-to-end checksum system: how it detects silent corruption, heals data automatically from redundancy, and protects metadata with extra copies. Next, we'll explore RAID-Z—ZFS's revolutionary approach to software RAID.

3 / 5

Loading learning content...

Operating SystemsAdvanced File Systems

ZFS: The Zettabyte File System

LevelAdvanced

Duration75 mins

TopicAdvanced File Systems

3 / 5

Checksums: End-to-End Data Integrity

The Silent Data Corruption Problem

What You Will Learn

Sources of Silent Data Corruption

Silent data corruption occurs when data changes without the storage system detecting or reporting the change. Understanding the sources of corruption clarifies why ZFS's approach is necessary.

The Data Path:

┌─────────────────────────────────────────────────────────────────────┐
│                         DATA PATH                                    │
│                                                                      │
│ Application → OS Buffer → File System → Volume Mgr → RAID → Disk   │
│                                                                      │
│    ▲            ▲            ▲            ▲          ▲        ▲     │
│    │            │            │            │          │        │     │
│  App bugs    RAM errors    FS bugs    RAID bugs   Firmware   Media │
│                                         Write hole  bugs     decay │
└─────────────────────────────────────────────────────────────────────┘

Every component in this path can introduce corruption.

Sources of Silent Data Corruption
Source	Mechanism	Detection by Traditional FS
Bit Rot	Physical media degradation over time due to cosmic rays, magnetic decay, electrical noise	None—data appears valid
Firmware Bugs	Disk firmware misdirects writes, returns stale cached data, or performs incorrect operations	None—disk reports success
Memory Errors	RAM bit flips corrupt data before writing to disk or after reading from disk	None—data written as-is
RAID Write Hole	Power failure during RAID write leaves parity inconsistent with data	None—parity check passes incorrectly
Controller Bugs	Hardware RAID or HBA controllers corrupt data silently	None—appears as valid I/O
Phantom Writes	Disk reports write success but data never reaches platters	None—success was reported
Misdirected Writes	Data written to wrong location on disk	None—original location unchanged
Driver Bugs	I/O stack software corrupts data during transfer	None—OS layer issue

The Research:

Studies have quantified silent corruption rates:

CERN study (2007): Found ~3 instances of silent data corruption per 10 terabytes over 6 months in enterprise storage
NetApp study (2010): 400,000 latent sector errors per 1.5 million drives examined
Google study (2016): Detected over 20 silent data corruption events per year per 10PB of storage

At scale, silent corruption is not a possibility—it's a certainty. The only question is whether you detect it or whether it propagates unnoticed through your backups and archives.

The Compounding Disaster

ZFS Checksum Architecture

The Parent-Stored Checksum Model:

                    UBERBLOCK
                        │
                        │ checksum of MOS stored here
                        ▼
                ┌─────────────────┐
                │   META-OBJECT   │
                │      SET        │
                └────────┬────────┘
                         │ checksum of child blocks stored here
            ┌────────────┼────────────┐
            ▼            ▼            ▼
      ┌──────────┐ ┌──────────┐ ┌──────────┐
      │ Indirect │ │ Indirect │ │ Indirect │
      │  Block   │ │  Block   │ │  Block   │
      └────┬─────┘ └────┬─────┘ └────┬─────┘
           │            │            │  checksums of data blocks here
     ┌─────┴─────┐  ┌───┴───┐   ┌────┴────┐
     ▼     ▼     ▼  ▼       ▼   ▼         ▼
  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
  │Data │Data │Data │Data │Data │Data │Data │
  │Blk 1│Blk 2│Blk 3│Blk 4│Blk 5│Blk 6│Blk 7│
  └─────┴─────┴─────┴─────┴─────┴─────┴─────┘

  EVERY block's checksum is verified against the
  checksum stored in its PARENT, not in itself.

Why parent-stored checksums?

If a block stored its own checksum, corrupt data could include a corrupt checksum that incorrectly matches. The corruption would be self-consistent and undetectable.

checksum_verification.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
/*
 * ZFS Block Read with Checksum Verification
 * Simplified illustration of the verification flow.
 */
 
typedef enum zio_checksum {
    ZIO_CHECKSUM_OFF,           /* No checksum (not recommended) */
    ZIO_CHECKSUM_ON,            /* Default: fletcher4 */
    ZIO_CHECKSUM_LABEL,         /* Label block checksums */
    ZIO_CHECKSUM_GANG_HEADER,   /* Gang block headers */
    ZIO_CHECKSUM_FLETCHER_2,    /* Fletcher2 - faster, weaker */
    ZIO_CHECKSUM_FLETCHER_4,    /* Fletcher4 - default, good balance */
    ZIO_CHECKSUM_SHA256,        /* SHA-256 - cryptographic */
    ZIO_CHECKSUM_SHA512,        /* SHA-512 - longer hash */
    ZIO_CHECKSUM_SKEIN,         /* Skein - cryptographic, faster */
    ZIO_CHECKSUM_EDONR,         /* Edon-R - very fast cryptographic */
    ZIO_CHECKSUM_BLAKE3,        /* Blake3 - newest, fastest secure */
} zio_checksum_t;
 
/*
 * Checksum computation using specified algorithm
 */
void checksum_compute(void *data, size_t size, 
                      zio_checksum_t algorithm, 
                      zio_cksum_t *result) 
{
    switch (algorithm) {
        case ZIO_CHECKSUM_FLETCHER_4:
            fletcher_4_compute(data, size, result);
            break;
        case ZIO_CHECKSUM_SHA256:
            sha256_compute(data, size, result);
            break;
        case ZIO_CHECKSUM_SHA512:
            sha512_compute(data, size, result);
            break;
        case ZIO_CHECKSUM_BLAKE3:
            blake3_compute(data, size, result);
            break;
        /* ... other algorithms ... */
    }
}
 
/*
 * Block read with verification - the core ZFS read path
 */
int zfs_read_block_verified(blkptr_t *bp, void *buffer)
{
    zio_cksum_t computed_checksum;
    zio_cksum_t *expected_checksum = &bp->blk_cksum;
    zio_checksum_t algorithm = BP_GET_CHECKSUM(bp);
    
    /* Read the raw block from disk */
    int err = read_physical_block(bp, buffer);
    if (err != 0) {
        return err;  /* I/O error - disk reported failure */
    }
    
    /* Decompress if necessary (checksum is of decompressed data) */
    if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF) {
        decompress_block(buffer, BP_GET_COMPRESS(bp));
    }
    
    /* Compute checksum of what we read */
    checksum_compute(buffer, BP_GET_LSIZE(bp), 
                     algorithm, &computed_checksum);
    
    /* Compare against expected checksum from parent */
    if (!checksum_equal(&computed_checksum, expected_checksum)) {
        /*
         * CHECKSUM MISMATCH!
         * This is the critical moment ZFS is designed for.
         * The data we read does not match what was written.
         * 
         * Traditional file systems would return this
         * corrupted data to the application, unaware of
         * the problem.
         * 
         * ZFS knows corruption occurred and can:
         * 1. Try alternate copies (from mirrors/RAIDZ)
         * 2. Report the error accurately
         * 3. Potentially repair from redundancy
         */
        return EIO_CHECKSUM;
    }
    
    /* Success - data verified correct */
    return 0;
}
 
/*
 * The self-healing read: try multiple copies
 */
int zfs_read_self_healing(blkptr_t *bp, void *buffer)
{
    int copies = BP_GET_NDVAS(bp);  /* Number of data copies */
    
    for (int i = 0; i < copies; i++) {
        dva_t *dva = &bp->blk_dva[i];
        
        if (!DVA_IS_VALID(dva))
            continue;
        
        /* Try reading from this copy */
        int err = read_from_dva(dva, buffer);
        if (err != 0)
            continue;  /* I/O error, try next copy */
        
        /* Verify checksum */
        zio_cksum_t computed;
        checksum_compute(buffer, BP_GET_LSIZE(bp),
                        BP_GET_CHECKSUM(bp), &computed);
        
        if (checksum_equal(&computed, &bp->blk_cksum)) {
            /* SUCCESS! This copy is valid. */
            if (i > 0) {
                /* We used a non-primary copy - 
                   schedule repair of corrupted copy */
                schedule_repair(bp, i);
            }
            return 0;
        }
        
        /* This copy corrupted, log and try next */
        log_checksum_error(bp, i);
    }
    
    /* All copies corrupted - unrecoverable */
    return EIO_CHECKSUM_FATAL;
}

The Merkle Tree of Trust

Checksum Algorithms in ZFS

ZFS offers multiple checksum algorithms, balancing computational cost against collision resistance. The choice affects both CPU usage and the probability of undetected corruption.

ZFS Checksum Algorithms Comparison
Algorithm	Hash Size	Speed	Cryptographic	Best For
fletcher4	256 bits	Very fast	No	General use, default, low CPU overhead
fletcher2	256 bits	Fastest	No	Legacy, not recommended
sha256	256 bits	Slow	Yes	When dedup enabled, high-security needs
sha512	256 bits (truncated)	Slow	Yes	Marginally more secure than sha256
skein	256 bits	Medium	Yes	Good balance of speed and security
edonr	256 bits	Fast	Yes*	Fast cryptographic, high throughput
blake3	256 bits	Very fast	Yes	Modern best choice for security + speed

Understanding the Trade-offs:

checksum_configuration.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/bash
# Configuring ZFS Checksums
 
# Check current checksum algorithm
zfs get checksum tank/data
# NAME       PROPERTY   VALUE      SOURCE
# tank/data  checksum   on         default
 
# "on" means default (fletcher4)
 
# Set checksum algorithm for dataset
zfs set checksum=sha256 tank/secure
 
# Set checksum for new pool (inherited by all datasets)
zpool create -O checksum=blake3 tank /dev/sda
 
# Per-dataset checksum (overrides pool default)
zfs create -o checksum=sha512 tank/highly_secure
 
# Disable checksums (NEVER DO THIS in production!)
# Only for specific cases like swap volumes
zfs set checksum=off tank/swap
# This provides NO protection against corruption
 
# View checksum errors from pool status
zpool status -v tank
# Example output with checksum errors:
#   NAME        STATE     READ WRITE CKSUM
#   tank        ONLINE       0     0     0
#     sda       ONLINE       0     0     2   ← 2 checksum errors!
#     sdb       ONLINE       0     0     0
 
# Check detailed device error statistics
zpool status -x  # Show only pools with errors
 
# Benchmark checksum performance on your hardware
# (Not a ZFS command, but useful for comparison)
echo "fletcher4 speed (estimated similar to zfs default)"
openssl speed sha256 sha512
# Compare against your CPU's capabilities

When to Use Cryptographic Checksums

Self-Healing: Automatic Data Repair

The Self-Healing Process:

self_healing_process.txt

Flow

SELF-HEALING DATA REPAIR
═══════════════════════════════════════════════════════════════════
 
SCENARIO: Reading a block from a mirrored pool
 
Step 1: Application requests file data
        ┌─────────────────────────────────────────────┐
        │  Application: read("/tank/data/file.txt")  │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 2: ZFS identifies block location and reads from primary
        ┌─────────────────────────────────────────────┐
        │  Read from disk /dev/sda (primary copy)     │
        │  Block contains: 0x3F 0x8A 0x2C 0x1D ...    │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 3: Verify checksum against parent's stored value
        ┌─────────────────────────────────────────────┐
        │  Computed:  0x1A2B3C4D5E6F7890              │
        │  Expected:  0x1A2B3C4D5E6F7890  ✓           │ (Match = Success)
        │                   OR                        │
        │  Computed:  0x9F8E7D6C5B4A3210              │
        │  Expected:  0x1A2B3C4D5E6F7890  ✗           │ (Mismatch!)
        └─────────────────────────────────────────────┘
 
═══════════════════════════════════════════════════════════════════
IF CHECKSUM MISMATCH DETECTED:
═══════════════════════════════════════════════════════════════════
 
Step 4: Log the error and try alternate copy
        ┌─────────────────────────────────────────────┐
        │  LOG: "checksum error on /dev/sda at       │
        │        offset 0x12345678, attempting        │
        │        recovery from mirror"                │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 5: Read from mirror copy (/dev/sdb)
        ┌─────────────────────────────────────────────┐
        │  Read from disk /dev/sdb (mirror copy)      │
        │  Block contains: 0x7E 0x91 0x4F 0x2A ...    │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 6: Verify checksum of mirror copy
        ┌─────────────────────────────────────────────┐
        │  Computed:  0x1A2B3C4D5E6F7890              │
        │  Expected:  0x1A2B3C4D5E6F7890  ✓ MATCH!   │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 7: Self-healing - rewrite good data to corrupted disk
        ┌─────────────────────────────────────────────┐
        │  Write verified block back to /dev/sda      │
        │  Corruption REPAIRED automatically          │
        └─────────────────────────────────────────────┘
                            │
                            ▼
Step 8: Return good data to application
        ┌─────────────────────────────────────────────┐
        │  Application receives correct data          │
        │  User unaware any error occurred            │
        └─────────────────────────────────────────────┘
 
═══════════════════════════════════════════════════════════════════
RESULT: 
  - Corruption detected ✓
  - Good data retrieved from redundancy ✓
  - Corrupted copy repaired ✓
  - Application receives valid data ✓
  - No administrator intervention required ✓
═══════════════════════════════════════════════════════════════════

Self-Healing Requirements

•Redundancy is required — Self-healing needs at least one good copy. Mirrors, RAIDZ, or copies=N provide this.
•Checksums must be enabled — Without checksums, ZFS cannot detect corruption. Never disable checksums except for special cases.
•Block must be read — Healing occurs at read time. Unread corrupted blocks remain corrupted until accessed or scrubbed.
•Original block pointer must be trusted — The parent block (containing the checksum) must be valid. The Merkle tree ensures this.

The Invisible Guardian

Scrubbing: Proactive Data Verification

ZFS Scrubbing addresses this by proactively reading and verifying every block in the pool, repairing corruption before it becomes a problem.

scrub_operations.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
#!/bin/bash
# ZFS Scrub Operations
 
# ============================================
# STARTING A SCRUB
# ============================================
 
# Start a scrub on a pool
zpool scrub tank
 
# Scrubbing does:
# 1. Reads every block from disk
# 2. Verifies checksum against expected value
# 3. Repairs any corruption from redundant copies
# 4. Reports errors that cannot be repaired
 
# ============================================
# MONITORING SCRUB PROGRESS
# ============================================
 
# Check scrub status
zpool status tank
 
# Example output during scrub:
#   pool: tank
#   state: ONLINE
#   scan: scrub in progress since Mon Oct 16 10:00:00 2023
#         2.50T scanned at 500M/s, 1.25T issued at 250M/s
#         0 repaired, 0.00% done, 1d 8h to go
#
# "scanned" = metadata processed
# "issued" = actual data read and verified
# "repaired" = bytes fixed from redundancy
 
# Watch scrub progress in real-time
watch -n 5 'zpool status tank | grep scan'
 
# ============================================
# SCRUB RESULTS
# ============================================
 
# After completion:
#   scan: scrub repaired 16K in 48h30m with 0 errors on Wed Oct 18 10:30:00 2023
 
# If errors were found but not repaired:
#   scan: scrub repaired 0B in 48h30m with 3 errors on Wed Oct 18 10:30:00 2023
 
# View detailed error information
zpool status -v tank
 
# Example error output:
#   errors: Permanent errors have been detected:
#           FILE                                             POOL
#           /tank/data/photos/IMG_1234.jpg                   tank
#           /tank/data/documents/report.docx                 tank
#
# These files have no remaining good copies - data is LOST
 
# ============================================
# MANAGING SCRUBS
# ============================================
 
# Pause a running scrub
zpool scrub -p tank
 
# Resume a paused scrub
zpool scrub tank
 
# Cancel a scrub entirely
zpool scrub -s tank
 
# ============================================
# SCRUB SCHEDULING (with cron or systemd)
# ============================================
 
# Add to crontab for weekly scrub (Sunday 2 AM)
# crontab -e
0 2 * * 0 /sbin/zpool scrub tank
 
# On systemd-based systems, enable the ZFS scrub timer
sudo systemctl enable zfs-scrub-weekly@tank.timer
sudo systemctl start zfs-scrub-weekly@tank.timer
 
# Or monthly scrub
sudo systemctl enable zfs-scrub-monthly@tank.timer
 
# ============================================
# SCRUB PERFORMANCE TUNING
# ============================================
 
# Limit scrub I/O to reduce impact on production
# (in /etc/modprobe.d/zfs.conf or sysctl)
 
# Reduce scrub priority (Linux)
echo 0 > /sys/module/zfs/parameters/zfs_scrub_delay
# Values: 0-10, higher = slower scrub, less I/O impact
 
# Set scrub to low priority I/O class
echo 1 > /sys/module/zfs/parameters/zfs_vdev_scrub_min_active
echo 2 > /sys/module/zfs/parameters/zfs_vdev_scrub_max_active
 
# ============================================
# SCRUB FREQUENCY RECOMMENDATIONS
# ============================================
 
# Based on data importance and disk age:
#
# Consumer SSDs/HDDs under warranty:  Monthly
# Consumer SSDs/HDDs over 3 years:    Weekly  
# Enterprise storage:                 Weekly
# Critical data / archival:           Weekly
# ZFS on Root (boot disk):           Monthly
#
# General rule: More frequent is better, 
# constrained by I/O impact on production

Scrub I/O Impact

The copies Property: Dataset-Level Redundancy

copies_usage.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#!/bin/bash
# Using the copies property for dataset-level redundancy
 
# Set copies on dataset creation
zfs create -o copies=2 tank/important
 
# Set copies on existing dataset (applies to new writes only)
zfs set copies=3 tank/critical
 
# Check current copies setting
zfs get copies tank/important
# NAME            PROPERTY  VALUE  SOURCE
# tank/important  copies    2      local
 
# BEHAVIOR OF copies PROPERTY:
#
# copies=1: Single copy (default) - relies on pool redundancy
# copies=2: Two copies of every block
# copies=3: Three copies of every block
 
# ============================================
# USE CASES FOR copies > 1
# ============================================
 
# 1. Non-redundant pool (single disk or stripe)
#    - Pool has no redundancy, but critical data needs protection
#    - copies=2 provides self-healing capability on single disk
zfs set copies=2 tank/irreplaceable_photos
 
# 2. Extra protection beyond RAIDZ
#    - RAIDZ1 pool with copies=2 = protection against 1 whole
#      disk failure PLUS 1 additional block corruption
#    - Not common, but useful for irreplaceable data
 
# 3. Testing self-healing
#    - Enable copies=2 on single-disk pool to test ZFS behavior
 
# ============================================
# SPACE IMPACT
# ============================================
 
# copies=2 uses roughly 2x space
# copies=3 uses roughly 3x space
 
# Check actual space usage
zfs list -o name,used,refer,logicalused tank/important
 
# With copies=2 and no compression:
#   logicalused ≈ actual file sizes
#   used ≈ 2 × logicalused
#
# With copies=2 and compression:
#   used ≈ 2 × (logicalused / compression_ratio)
 
# ============================================
# COPIES INTERACTION WITH POOL REDUNDANCY
# ============================================
 
# Pool type impact on copies placement:
#
# Mirror pool + copies=2:
#   Actually stores 4 copies total (2 per mirror side)
#   
# RAIDZ1 + copies=2:
#   Stores 2 independent blocks, each with RAIDZ1 protection
#   
# Single disk + copies=2:
#   Stores 2 copies at different disk locations
#   Provides protection against localized corruption, 
#   NOT full disk failure

When to Use copies > 1

Metadata Integrity and Ditto Blocks

ZFS automatically provides extra protection for metadata through ditto blocks—additional copies of metadata written to different disks in the pool.

ZFS Metadata Protection

•Uberblock Ring — Multiple copies of the uberblock (typically 128) are distributed across all top-level vdevs. Losing one disk cannot corrupt the uberblock.
•Metadata Ditto Blocks — Metadata blocks (dnodes, indirect blocks) are automatically written to multiple disks when possible, regardless of the copies setting.
•Pool Configuration — Pool configuration is stored in vdev labels on every device, allowing import even if multiple disks fail.
•Space Maps — Free space tracking structures are checksummed and have multiple copies on multi-disk pools.
•MOS Objects — The Meta-Object Set (pool-level configuration) receives extra replication.

Ditto Block Placement:

When ZFS has multiple top-level vdevs, it places metadata ditto blocks on different vdevs. This means a single vdev failure (even total loss) cannot corrupt pool-wide metadata.

┌─────────────────────────────────────────────────────────────┐
│                    POOL: tank                                │
├──────────────────────────┬──────────────────────────────────┤
│        VDEV 1            │           VDEV 2                 │
│    (raidz1: 4 disks)     │      (raidz1: 4 disks)           │
├──────────────────────────┼──────────────────────────────────┤
│                          │                                   │
│  Metadata Block M1       │   Metadata Block M1 (ditto)      │
│  Data Block D1           │   Data Block D2                  │
│  Metadata Block M2       │   Metadata Block M2 (ditto)      │
│  Data Block D3           │   Data Block D4                  │
│                          │                                   │
└──────────────────────────┴──────────────────────────────────┘

If VDEV 1 fails completely:
  - Data blocks D1, D3 are lost (within RAIDZ protection)
  - Metadata blocks M1, M2 survive (ditto copies on VDEV 2)
  - Pool remains mountable

Why Multiple Vdevs Matter

Summary: End-to-End Checksums

We've explored ZFS's comprehensive data integrity system—the foundation that makes ZFS trusted for mission-critical storage. Let's consolidate the key insights:

Key Takeaways

•Silent data corruption is a certainty at scale — Traditional file systems cannot detect corruption that storage hardware doesn't report. ZFS can.
•Parent-stored checksums create a Merkle tree — Every block's checksum is stored in its parent, creating an unbreakable chain of verification from uberblock to data.
•Self-healing repairs corruption automatically — When corruption is detected, ZFS fetches a good copy from redundancy and repairs the corrupted copy—transparently.
•Multiple checksum algorithms balance speed and security — Fletcher4 is fast and sufficient for most uses; cryptographic options like blake3 provide tamper resistance.
•Regular scrubbing is essential — Scrubs proactively verify all data, catching corruption before redundancy is exhausted or data is needed.
•Metadata receives extra protection — Ditto blocks and the uberblock ring ensure pool structures survive even severe failures.
•The copies property adds dataset-level redundancy — For critical data, copies=2 or copies=3 provides protection beyond pool redundancy.

What's Next:

Page Complete

3 / 5