Operating SystemsAdvanced File Systems

ZFS: The Zettabyte File System

LevelAdvanced

Duration75 mins

TopicAdvanced File Systems

4 / 5

RAID-Z: Software RAID Without the Write Hole

The Problem with Traditional RAID

RAID—Redundant Array of Independent Disks—has protected data for decades. By spreading data and parity across multiple disks, RAID systems survive disk failures that would otherwise cause data loss. Yet traditional RAID implementations carry fundamental flaws that ZFS's RAID-Z was designed to solve.

The most insidious is the write hole—a window during writes where a power failure leaves the array in an inconsistent state that cannot be detected or repaired. Hardware RAID controllers use battery-backed caches to mitigate this; software RAID on most systems simply accepts the risk.

RAID-Z eliminates the write hole through Copy-on-Write, integrates with ZFS checksums for intelligent repair, and provides flexible redundancy levels—all without expensive hardware controllers.

What You Will Learn

By the end of this page, you will understand how RAID-Z differs from traditional RAID, why the write hole is dangerous and how Copy-on-Write eliminates it, the three RAID-Z levels and when to use each, how RAID-Z interacts with checksums for self-healing, and practical guidance for RAID-Z pool design.

Traditional RAID: A Brief Review

Before understanding RAID-Z's innovations, let's review traditional RAID concepts and their limitations.

Traditional RAID Levels
Level	Technique	Capacity	Failures Survived	Weakness
RAID 0	Striping only	100%	0	Any disk failure = total data loss
RAID 1	Mirroring	50%	N-1 of N	Space inefficient for large arrays
RAID 5	Striping + single parity	(N-1)/N	1	Write hole; slow rebuild; URE risk
RAID 6	Striping + double parity	(N-2)/N	2	Write hole; very slow rebuild
RAID 10	Mirrored stripes	50%	1 per mirror	Expensive; best for random I/O

The Parity Problem:

RAID 5/6 use XOR-based parity to reconstruct data from any single (RAID 5) or double (RAID 6) disk failure. The parity calculation is simple:

Data:    D1 ⊕ D2 ⊕ D3 ⊕ D4 = P

If D2 fails:
         D1 ⊕ ? ⊕ D3 ⊕ D4 = P
         D2 = D1 ⊕ D3 ⊕ D4 ⊕ P

This works beautifully—when the parity is consistent with the data. But what if power fails mid-write?

RAID Does Not Replace Backups

RAID protects against disk failure, not data loss. Accidental deletion, ransomware, software bugs, and controller failures can destroy data on RAID arrays. RAID-Z (like all RAID) is one layer of protection—snapshots and off-site backups remain essential.

The Write Hole: RAID's Hidden Danger

The write hole is a fundamental flaw in traditional parity RAID (RAID 5/6). It occurs when a stripe write is interrupted, leaving data and parity inconsistent. The array cannot detect this inconsistency—it believes the stripe is healthy when it's actually corrupt.

write_hole_scenario.txt

Visualization

THE WRITE HOLE IN TRADITIONAL RAID-5
═════════════════════════════════════════════════════════════════════
 
INITIAL STATE (Consistent stripe):
┌─────────────────────────────────────────────────────────────────┐
│ Disk 1 │ Disk 2 │ Disk 3 │ Disk 4 │ Disk 5 (Parity)            │
├────────┼────────┼────────┼────────┼────────────────────────────┤
│  D1=A  │  D2=B  │  D3=C  │  D4=D  │  P = A⊕B⊕C⊕D               │
└────────┴────────┴────────┴────────┴────────────────────────────┘
✓ Parity is correct: If any disk fails, data can be rebuilt
 
 
WRITE OPERATION (Updating D2 from B to B'):
Step 1: Write new data to Disk 2
Step 2: Calculate new parity = A⊕B'⊕C⊕D  
Step 3: Write new parity to Disk 5
 
─────────────────────────────────────────────────────────────────
⚡ POWER FAILURE AFTER STEP 1, BEFORE STEP 3 ⚡
─────────────────────────────────────────────────────────────────
 
RESULTING STATE (Inconsistent - THE WRITE HOLE):
┌─────────────────────────────────────────────────────────────────┐
│ Disk 1 │ Disk 2 │ Disk 3 │ Disk 4 │ Disk 5 (Parity)            │
├────────┼────────┼────────┼────────┼────────────────────────────┤
│  D1=A  │  D2=B' │  D3=C  │  D4=D  │  P = A⊕B⊕C⊕D (OLD!)        │
│        │ (NEW)  │        │        │  Should be A⊕B'⊕C⊕D        │
└────────┴────────┴────────┴────────┴────────────────────────────┘
✗ Parity is WRONG but array doesn't know it!
 
 
NOW DISK 3 FAILS (normal hardware failure):
─────────────────────────────────────────────────────────────────
 
REBUILD ATTEMPT:
  Rebuild D3 = D1 ⊕ D2 ⊕ D4 ⊕ P
            = A ⊕ B' ⊕ D ⊕ (A⊕B⊕C⊕D)
            = B' ⊕ B ⊕ C
            ≠ C  (WRONG!)
 
RESULT: D3 is rebuilt INCORRECTLY. Data corruption goes undetected.
        User believes data is safe. Backups contain corrupted data.
 
 
═════════════════════════════════════════════════════════════════════
WHY CAN'T THE ARRAY DETECT THIS?
 
Traditional RAID has no checksums. When Disk 3 fails and is rebuilt,
the array XORs the data and parity it finds. It has no way to know
the resulting value is wrong—it just does the math.
 
Only ZFS, with parent-stored checksums, would detect that the
rebuilt D3 doesn't match its expected checksum.
═════════════════════════════════════════════════════════════════════

Write Hole Risk Factors

•Power failures — The most common cause; UPS helps but doesn't eliminate risk
•System crashes — Kernel panics, driver bugs, or hardware failures during writes
•Disk timeouts — A slow disk that times out mid-stripe write
•Controller bugs — Hardware RAID controllers can have firmware bugs that cause partial writes
•Invisible until failure — The inconsistency is undetectable until a disk fails and rebuild produces wrong data

The Silent Destroyer

The write hole is particularly dangerous because the corruption is invisible. The array reports healthy. Scrubs (on traditional RAID) don't detect it. Only when you actually need to rebuild—perhaps years after the inconsistency was created—do you discover that your 'protected' data is silently corrupted.

How RAID-Z Eliminates the Write Hole

RAID-Z eliminates the write hole through a fundamental architectural change: Copy-on-Write with full-stripe writes.

In traditional RAID, updating a single sector requires:

Read the old data
Read the old parity
Calculate new parity = old parity ⊕ old data ⊕ new data
Write new data
Write new parity

This read-modify-write cycle creates the window where data and parity can become inconsistent.

RAID-Z instead uses full-stripe writes with variable width:

raidz_write_pattern.txt

Visualization

RAID-Z WRITE PATTERN (Copy-on-Write + Full Stripe)
═══════════════════════════════════════════════════════════════════
 
TRADITIONAL RAID-5: Fixed stripe width, partial writes
─────────────────────────────────────────────────────────────────
All stripes have the same width. Updating one block requires
read-modify-write of data and parity.
 
┌────────┬────────┬────────┬────────┬────────┐
│Stripe 1│D1      │D2      │D3      │Parity  │ Fixed width = 4
├────────┼────────┼────────┼────────┼────────┤
│Stripe 2│D4      │D5      │D6      │Parity  │ Fixed width = 4
├────────┼────────┼────────┼────────┼────────┤
│Stripe 3│D7      │D8      │D9      │Parity  │ Fixed width = 4
└────────┴────────┴────────┴────────┴────────┘
 
 
RAID-Z: Variable stripe width, full stripe writes
─────────────────────────────────────────────────────────────────
Stripe width varies based on data size. Every write is a 
complete, new stripe. Never overwrite existing data.
 
┌────────┬────────┬────────┬────────┬────────┐
│Stripe 1│D1      │D2      │D3      │D4      │Parity  │ Width = 5
├────────┼────────┼────────┼────────┼────────┤
│Stripe 2│D5      │D6      │Parity  │        │        │ Width = 3
├────────┼────────┼────────┼────────┼────────┤
│Stripe 3│D7      │Parity  │        │        │        │ Width = 2
├────────┼────────┼────────┼────────┼────────┤
│Stripe 4│D8      │D9      │D10     │D11     │Parity  │ Width = 5
└────────┴────────┴────────┴────────┴────────┴────────┘
 
Empty space is simply where data hasn't been written yet.
 
 
WHY THIS ELIMINATES THE WRITE HOLE:
═══════════════════════════════════════════════════════════════════
 
1. We never overwrite existing stripes
   → Old stripe remains valid until new stripe is complete
 
2. We write data and parity in a single atomic operation
   → Either the entire stripe writes successfully, or it doesn't
 
3. We update pointers only after stripe is fully written
   → Crash before pointer update = new stripe is orphaned, old valid
   → Crash after pointer update = new stripe is complete and valid
 
4. No read-modify-write cycle
   → Nothing to become inconsistent
 
═══════════════════════════════════════════════════════════════════
 
WRITE OPERATION IN RAID-Z:
─────────────────────────────────────────────────────────────────
 
Time T1: Current state
         Old stripe: [D1_old][D2_old][D3_old][Parity_old]
         Pointer: → Old stripe location
 
Time T2: Write new data (to NEW locations)
         New stripe: [D1_new][D2_new][D3_new][Parity_new]
         Pointer still: → Old stripe location (NOT UPDATED YET)
         
         ⚡ Power failure here? Old stripe is still valid!
         
Time T3: Atomic pointer update
         Pointer: → New stripe location
         
Time T4: Old stripe space added to free list
         (May be retained for snapshots)
         
NO WINDOW FOR INCONSISTENCY EXISTS.

Structural Immunity

RAID-Z doesn't mitigate the write hole with battery-backed caches or journaling. It eliminates the write hole structurally—the problem cannot occur because the architecture doesn't permit partial stripe updates. This is why RAID-Z doesn't need the hardware that traditional RAID controllers use for write hole protection.

RAID-Z Levels: RAIDZ1, RAIDZ2, RAIDZ3

ZFS offers three RAID-Z levels, differing in the number of parity blocks per stripe. More parity means more failures survived, at the cost of usable capacity.

RAID-Z Levels Comparison
Level	Parity Disks	Min Disks	Usable Capacity	Survives
RAIDZ1	1	3	(N-1)/N	1 disk failure
RAIDZ2	2	4	(N-2)/N	2 disk failures
RAIDZ3	3	5	(N-3)/N	3 disk failures

raidz_creation.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#!/bin/bash
# Creating RAID-Z Pools
 
# ============================================
# RAIDZ1 (Single Parity - Similar to RAID-5)
# ============================================
 
# Basic RAIDZ1 with 4 disks
# Usable capacity: 3/4 = 75%
zpool create tank raidz1 /dev/sda /dev/sdb /dev/sdc /dev/sdd
 
# RAIDZ1 with 6 disks
# Usable capacity: 5/6 = 83%  
zpool create tank raidz1 /dev/sd{a,b,c,d,e,f}
 
 
# ============================================
# RAIDZ2 (Double Parity - Similar to RAID-6)  
# ============================================
 
# Basic RAIDZ2 with 6 disks (recommended minimum for RAIDZ2)
# Usable capacity: 4/6 = 67%
zpool create tank raidz2 /dev/sd{a,b,c,d,e,f}
 
# RAIDZ2 with 8 disks
# Usable capacity: 6/8 = 75%
zpool create tank raidz2 /dev/sd{a,b,c,d,e,f,g,h}
 
 
# ============================================
# RAIDZ3 (Triple Parity)
# ============================================
 
# RAIDZ3 with 8 disks
# Usable capacity: 5/8 = 62.5%
zpool create tank raidz3 /dev/sd{a,b,c,d,e,f,g,h}
 
# RAIDZ3 with 10 disks  
# Usable capacity: 7/10 = 70%
zpool create tank raidz3 /dev/sd{a,b,c,d,e,f,g,h,i,j}
 
 
# ============================================
# MULTIPLE VDEV POOLS (Striped RAID-Z)
# ============================================
 
# Two RAIDZ2 vdevs (total 12 disks)
# Data stripes across vdevs; each vdev has double parity
# Usable capacity: (4+4)/(6+6) = 67%
# Performance: ~2x single vdev (parallel reads/writes)
zpool create tank \
    raidz2 /dev/sd{a,b,c,d,e,f} \
    raidz2 /dev/sd{g,h,i,j,k,l}
 
# Three RAIDZ2 vdevs (total 18 disks)
zpool create tank \
    raidz2 /dev/sd{a,b,c,d,e,f} \
    raidz2 /dev/sd{g,h,i,j,k,l} \
    raidz2 /dev/sd{m,n,o,p,q,r}
 
# Mix of vdev types (NOT RECOMMENDED - unbalanced!)
# zpool create tank \
#     raidz2 /dev/sd{a,b,c,d,e,f} \
#     mirror /dev/sdg /dev/sdh      # Different redundancy!
 
 
# ============================================
# OPTIMAL VDEV WIDTHS
# ============================================
 
# ZFS allocates in multiples of sector size (ashift-based)
# Optimal widths depend on recordsize and disk count
 
# For RAIDZ1 (1 parity): 3, 5, 9, 17 disks optimal
# For RAIDZ2 (2 parity): 4, 6, 10, 18 disks optimal  
# For RAIDZ3 (3 parity): 5, 7, 11, 19 disks optimal
 
# These "power of 2 + parity" widths minimize wasted space
 
# Example: 9-disk RAIDZ1
# 8 data + 1 parity = 8 data disks (power of 2)
zpool create tank raidz1 /dev/sd{a,b,c,d,e,f,g,h,i}

RAIDZ1 and Large Disks

With modern large disks (8TB+), RAIDZ1 is increasingly risky. Rebuild times can exceed 24 hours, during which a second failure loses the pool. The probability of a second failure or unrecoverable read error (URE) during rebuild is non-trivial. RAIDZ2 or RAIDZ3 is strongly recommended for disks over 4TB.

RAID-Z and Checksum Integration

Traditional RAID uses parity for both redundancy AND error detection. If a disk returns bad data, RAID uses parity to identify which disk is wrong and reconstruct the data.

But what if a disk returns bad data with no I/O error? RAID can't detect this—parity verification would show the stripe is inconsistent, but not which disk is wrong. Traditional RAID might "repair" by overwriting the good data with reconstructed wrong data.

ZFS RAID-Z solves this with checksum-based repair:

raidz_repair_process.txt

Process

RAID-Z INTELLIGENT REPAIR PROCESS
═══════════════════════════════════════════════════════════════════
 
SCENARIO: 4-disk RAIDZ1, reading block X
          Block pointer contains expected checksum
 
STEP 1: Normal Read
─────────────────────────────────────────────────────────────────
Read data from designated disks for this block
 
         ┌──────────────────────────────────────────────┐
Stripe:  │ D1_ok │ D2_corrupt │ D3_ok │ Parity_ok │
         └──────────────────────────────────────────────┘
                      ↑
                 Silent corruption on Disk 2
                 (returned wrong data, no I/O error)
 
Assembled block: D1 + D2_corrupt + D3 = WRONG DATA
 
 
STEP 2: Checksum Verification
─────────────────────────────────────────────────────────────────
Compute checksum of assembled block, compare to expected
 
Expected (from parent block): 0xABCD1234
Computed:                     0xFF001122  ← MISMATCH!
 
ZFS knows the block is corrupt. Traditional RAID would NOT.
 
 
STEP 3: Identify Corrupt Component
─────────────────────────────────────────────────────────────────
ZFS uses a combinatorial approach:
 
Attempt 1: Reconstruct assuming D1 is bad
           D1_rebuilt = D2_corrupt ⊕ D3 ⊕ Parity
           Block = D1_rebuilt + D2_corrupt + D3
           Checksum? No match. D1 was not the problem.
 
Attempt 2: Reconstruct assuming D2 is bad
           D2_rebuilt = D1 ⊕ D3 ⊕ Parity  
           Block = D1 + D2_rebuilt + D3
           Checksum? MATCH! D2 was corrupt.
 
Attempt 3: (Would try D3 if D2 didn't match)
 
 
STEP 4: Return Correct Data
─────────────────────────────────────────────────────────────────
The block reconstructed with D2_rebuilt is correct.
Return this to the application. Schedule repair of D2's block.
 
 
STEP 5: Self-Healing Write
─────────────────────────────────────────────────────────────────
Write the correct D2 value back to Disk 2, repairing corruption.
 
═══════════════════════════════════════════════════════════════════
 
TRADITIONAL RAID FAILURE SCENARIO:
─────────────────────────────────────────────────────────────────
Same silent corruption on Disk 2.
 
1. Read stripe: D1_ok, D2_corrupt, D3_ok, Parity_ok
2. No checksums → Can't detect corruption
3. Return corrupt data to application
4. Application processes/stores/displays wrong data
5. User never knows
 
OR (if parity verification is enabled):
 
1. Read stripe and verify parity
2. Parity doesn't match! Stripe is inconsistent.
3. Which disk is wrong? Unknown!
4. RAID might "repair" by recalculating D1 from D2_corrupt + D3 + Parity
5. Now D1 is ALSO wrong!
6. Corruption spreads instead of being fixed.

Intelligent Repair

The combination of checksums and parity gives RAID-Z capabilities impossible in traditional RAID. Checksums detect ALL corruption (not just disk failures). Parity provides the redundancy to reconstruct. Together, they enable intelligent repair that never makes wrong decisions about which data is good.

Resilvering: The RAID-Z Rebuild Process

When a disk fails and is replaced, ZFS must rebuild the data that was on the failed disk. This process is called resilvering (a term from Sun's mirroring heritage). RAID-Z resilvering is smarter than traditional RAID rebuilds.

RAID-Z Resilvering Advantages

•Metadata-aware — ZFS knows which blocks are allocated. It only resilvers blocks containing data, not empty space. A 10TB disk that's 20% full only needs to resilver 2TB.
•Checksum-verified — Every rebuilt block is verified against its checksum. If corruption is detected during resilver, ZFS uses parity to try alternate reconstructions.
•Sequential allocation awareness — ZFS knows the logical order of data and can optimize I/O patterns during resilver.
•Prioritized — Critical metadata resilvers first, then data. Pool becomes usable faster even if resilver continues in background.
•Restartable — If resilver is interrupted, it resumes from where it stopped, not from the beginning.

resilver_operations.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#!/bin/bash
# RAID-Z Disk Replacement and Resilvering
 
# ============================================
# DISK FAILURE DETECTION
# ============================================
 
# Check pool status for failures
zpool status tank
 
# Example output showing degraded pool:
#   pool: tank
#   state: DEGRADED
#   status: One or more devices has been removed by the administrator.
#   action: Online the device using 'zpool online' or replace the
#           device with 'zpool replace'.
#   scrub: none requested
#   config:
#
#     NAME        STATE     READ WRITE CKSUM
#     tank        DEGRADED     0     0     0
#       raidz2-0  DEGRADED     0     0     0
#         sda     ONLINE       0     0     0
#         sdb     REMOVED      0     0     0  ← FAILED
#         sdc     ONLINE       0     0     0
#         sdd     ONLINE       0     0     0
#         sde     ONLINE       0     0     0
#         sdf     ONLINE       0     0     0
 
# ============================================
# DISK REPLACEMENT
# ============================================
 
# Replace failed disk with new disk
zpool replace tank /dev/sdb /dev/sdz
 
# If same device slot (after physical swap):
zpool replace tank /dev/sdb
 
# Using disk by-id (recommended for reliability):
zpool replace tank \
    /dev/disk/by-id/wwn-0x50014ee2b47e9ae5 \
    /dev/disk/by-id/wwn-0x50014ee2b47e9bf6
 
# ============================================
# MONITORING RESILVER PROGRESS
# ============================================
 
# Basic status shows resilver progress
zpool status tank
 
# Example output during resilver:
#   scan: resilver in progress since Mon Oct 16 14:00:00 2023
#         2.50T scanned at 500M/s, 1.25T resilvered at 250M/s
#         22% done, 4h30m to go
 
# Watch progress in real-time
watch -n 10 'zpool status tank | grep scan'
 
# ============================================
# RESILVER PERFORMANCE TUNING
# ============================================
 
# Priority slider (Linux): 0=low priority, 10=high priority
# Higher = faster resilver, more impact on production I/O
echo 7 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms
 
# View current settings
cat /sys/module/zfs/parameters/zfs_resilver_min_time_ms
cat /sys/module/zfs/parameters/zfs_resilver_delay
 
# Temporarily increase for faster resilver (be careful!)
# In /etc/modprobe.d/zfs.conf:
# options zfs zfs_vdev_resilver_max_active=10
 
# ============================================
# HOT SPARES FOR AUTOMATIC REPLACEMENT
# ============================================
 
# Add hot spare to pool
zpool add tank spare /dev/sdz
 
# Enable autoreplace (disk failed → spare activates)
zpool set autoreplace=on tank
 
# When a disk fails:
# 1. ZFS automatically detaches failed disk
# 2. Attaches spare in its place
# 3. Begins resilvering
# 4. Administrator alerted via ZED (ZFS Event Daemon)
 
# After replacing the original disk:
zpool detach tank /dev/sdz     # Detach spare
# Spare returns to spare pool for next failure
 
# ============================================
# SEQUENTIAL VS HEALING RESILVER
# ============================================
 
# Normal resilver (after replace):
# Copies data from surviving disks to new disk
# Uses parity to compute missing data
 
# Healing resilver (after returning temporarily offline disk):
# The disk may have valid old data
# ZFS checksums each existing block
# Only copies blocks that are actually missing or corrupt
# Much faster when disk was briefly offline

Resilver During Degraded State

While resilvering, the pool has reduced redundancy. In RAIDZ1, the pool has NO redundancy during resilver—another failure loses data. In RAIDZ2 with one disk failed, you still survive one more failure but not two. Prioritize resilver completion and avoid starting additional scrubs or heavy I/O during this window.

RAID-Z Design Recommendations

Designing a RAID-Z pool requires balancing performance, capacity, and reliability. Here's practical guidance based on industry experience.

RAID-Z Level Selection Guide
Disk Size	Recommended Level	Reasoning
≤2TB	RAIDZ1	Smaller disks resilver quickly; single-failure protection usually sufficient
2TB-6TB	RAIDZ2	Resilver time lengthening; double-failure protection warranted
≥8TB	RAIDZ2 or RAIDZ3	Long resilver windows; high URE probability; triple parity for critical data
≥16TB	RAIDZ3 or Mirrors	Very long resilvers; consider mirrors for performance despite space cost

RAID-Z Best Practices

•Match vdev widths — All vdevs in a pool should have identical configuration. Mismatched vdevs create performance and capacity imbalances.
•Use 4-10 disks per vdev — Too few wastes parity overhead; too many increases rebuild time and URE risk. Sweet spot is typically 5-8.
•Plan for expansion — Capacity grows by adding vdevs. If you start with 6-disk RAIDZ2, your next expansion is another 6-disk RAIDZ2.
•Use identical disks — Mixed disk models/sizes can cause the smallest/slowest disk to bottleneck the entire vdev.
•Consider mirrors for random I/O — Database workloads often perform better on mirrors than RAIDZ due to write pattern differences.
•Run regular scrubs — Weekly or monthly scrubs catch latent errors before they compound into multi-disk failures.
•Keep hot spares — Automatic replacement reduces the window of vulnerability during disk failure.
•Use ashift=12 (or 13 for AF) — Set appropriate sector size alignment for your disks at pool creation.

Mirror vs RAIDZ Trade-off

Mirrors sacrifice 50% capacity but offer: (1) faster resilver (single disk copy vs parity reconstruction), (2) better random read IOPS (read from any mirror), (3) simpler failure mode (any N-1 of N-way mirror can fail). For databases, VMs, or critical random I/O workloads, mirrors often outperform similarly-sized RAIDZ pools despite lower capacity efficiency.

Summary: RAID-Z

We've explored RAID-Z—ZFS's revolutionary approach to software RAID that integrates redundancy with data integrity verification. Let's consolidate the key insights:

Key Takeaways

•The write hole is eliminated by design — Copy-on-Write with full-stripe writes means data and parity are never inconsistent. No battery backup required.
•Three RAID-Z levels provide flexibility — RAIDZ1 (1 parity), RAIDZ2 (2 parity), RAIDZ3 (3 parity) match redundancy to requirements.
•Variable stripe width enables efficiency — Unlike fixed-width RAID-5/6, RAID-Z stripes match data size, reducing wasted space.
•Checksums enable intelligent repair — When corruption is detected, ZFS uses checksums to identify which disk is wrong, unlike traditional RAID which might repair incorrectly.
•Resilvering is metadata-aware — Only allocated blocks are rebuilt, potentially much faster than whole-disk RAID rebuilds.
•RAIDZ level should match disk size — Larger disks mean longer resilvers and higher URE risk; RAIDZ2/3 is increasingly essential.
•Consider mirrors for random I/O — Despite lower capacity efficiency, mirrors excel for database and VM workloads.

What's Next:

With RAID-Z's redundancy understood, we'll explore ZFS's most user-visible advanced features: snapshots and clones. These Copy-on-Write powered capabilities enable instant point-in-time copies, space-efficient backups, and powerful workflows impossible with traditional file systems.

Page Complete

You now understand RAID-Z: how it eliminates the write hole through Copy-on-Write, how it integrates with checksums for intelligent repair, and practical guidance for designing RAID-Z pools. Next, we'll explore snapshots and clones—ZFS's powerful data management capabilities.

4 / 5

Loading learning content...

Operating SystemsAdvanced File Systems

ZFS: The Zettabyte File System

LevelAdvanced

Duration75 mins

TopicAdvanced File Systems

4 / 5

RAID-Z: Software RAID Without the Write Hole

The Problem with Traditional RAID

RAID-Z eliminates the write hole through Copy-on-Write, integrates with ZFS checksums for intelligent repair, and provides flexible redundancy levels—all without expensive hardware controllers.

What You Will Learn

Traditional RAID: A Brief Review

Before understanding RAID-Z's innovations, let's review traditional RAID concepts and their limitations.

Traditional RAID Levels
Level	Technique	Capacity	Failures Survived	Weakness
RAID 0	Striping only	100%	0	Any disk failure = total data loss
RAID 1	Mirroring	50%	N-1 of N	Space inefficient for large arrays
RAID 5	Striping + single parity	(N-1)/N	1	Write hole; slow rebuild; URE risk
RAID 6	Striping + double parity	(N-2)/N	2	Write hole; very slow rebuild
RAID 10	Mirrored stripes	50%	1 per mirror	Expensive; best for random I/O

The Parity Problem:

RAID 5/6 use XOR-based parity to reconstruct data from any single (RAID 5) or double (RAID 6) disk failure. The parity calculation is simple:

Data:    D1 ⊕ D2 ⊕ D3 ⊕ D4 = P

If D2 fails:
         D1 ⊕ ? ⊕ D3 ⊕ D4 = P
         D2 = D1 ⊕ D3 ⊕ D4 ⊕ P

This works beautifully—when the parity is consistent with the data. But what if power fails mid-write?

RAID Does Not Replace Backups

The Write Hole: RAID's Hidden Danger

write_hole_scenario.txt

Visualization

THE WRITE HOLE IN TRADITIONAL RAID-5
═════════════════════════════════════════════════════════════════════
 
INITIAL STATE (Consistent stripe):
┌─────────────────────────────────────────────────────────────────┐
│ Disk 1 │ Disk 2 │ Disk 3 │ Disk 4 │ Disk 5 (Parity)            │
├────────┼────────┼────────┼────────┼────────────────────────────┤
│  D1=A  │  D2=B  │  D3=C  │  D4=D  │  P = A⊕B⊕C⊕D               │
└────────┴────────┴────────┴────────┴────────────────────────────┘
✓ Parity is correct: If any disk fails, data can be rebuilt
 
 
WRITE OPERATION (Updating D2 from B to B'):
Step 1: Write new data to Disk 2
Step 2: Calculate new parity = A⊕B'⊕C⊕D  
Step 3: Write new parity to Disk 5
 
─────────────────────────────────────────────────────────────────
⚡ POWER FAILURE AFTER STEP 1, BEFORE STEP 3 ⚡
─────────────────────────────────────────────────────────────────
 
RESULTING STATE (Inconsistent - THE WRITE HOLE):
┌─────────────────────────────────────────────────────────────────┐
│ Disk 1 │ Disk 2 │ Disk 3 │ Disk 4 │ Disk 5 (Parity)            │
├────────┼────────┼────────┼────────┼────────────────────────────┤
│  D1=A  │  D2=B' │  D3=C  │  D4=D  │  P = A⊕B⊕C⊕D (OLD!)        │
│        │ (NEW)  │        │        │  Should be A⊕B'⊕C⊕D        │
└────────┴────────┴────────┴────────┴────────────────────────────┘
✗ Parity is WRONG but array doesn't know it!
 
 
NOW DISK 3 FAILS (normal hardware failure):
─────────────────────────────────────────────────────────────────
 
REBUILD ATTEMPT:
  Rebuild D3 = D1 ⊕ D2 ⊕ D4 ⊕ P
            = A ⊕ B' ⊕ D ⊕ (A⊕B⊕C⊕D)
            = B' ⊕ B ⊕ C
            ≠ C  (WRONG!)
 
RESULT: D3 is rebuilt INCORRECTLY. Data corruption goes undetected.
        User believes data is safe. Backups contain corrupted data.
 
 
═════════════════════════════════════════════════════════════════════
WHY CAN'T THE ARRAY DETECT THIS?
 
Traditional RAID has no checksums. When Disk 3 fails and is rebuilt,
the array XORs the data and parity it finds. It has no way to know
the resulting value is wrong—it just does the math.
 
Only ZFS, with parent-stored checksums, would detect that the
rebuilt D3 doesn't match its expected checksum.
═════════════════════════════════════════════════════════════════════

Write Hole Risk Factors

•Power failures — The most common cause; UPS helps but doesn't eliminate risk
•System crashes — Kernel panics, driver bugs, or hardware failures during writes
•Disk timeouts — A slow disk that times out mid-stripe write
•Controller bugs — Hardware RAID controllers can have firmware bugs that cause partial writes
•Invisible until failure — The inconsistency is undetectable until a disk fails and rebuild produces wrong data

The Silent Destroyer

How RAID-Z Eliminates the Write Hole

RAID-Z eliminates the write hole through a fundamental architectural change: Copy-on-Write with full-stripe writes.

In traditional RAID, updating a single sector requires:

Read the old data
Read the old parity
Calculate new parity = old parity ⊕ old data ⊕ new data
Write new data
Write new parity

This read-modify-write cycle creates the window where data and parity can become inconsistent.

RAID-Z instead uses full-stripe writes with variable width:

raidz_write_pattern.txt

Visualization

RAID-Z WRITE PATTERN (Copy-on-Write + Full Stripe)
═══════════════════════════════════════════════════════════════════
 
TRADITIONAL RAID-5: Fixed stripe width, partial writes
─────────────────────────────────────────────────────────────────
All stripes have the same width. Updating one block requires
read-modify-write of data and parity.
 
┌────────┬────────┬────────┬────────┬────────┐
│Stripe 1│D1      │D2      │D3      │Parity  │ Fixed width = 4
├────────┼────────┼────────┼────────┼────────┤
│Stripe 2│D4      │D5      │D6      │Parity  │ Fixed width = 4
├────────┼────────┼────────┼────────┼────────┤
│Stripe 3│D7      │D8      │D9      │Parity  │ Fixed width = 4
└────────┴────────┴────────┴────────┴────────┘
 
 
RAID-Z: Variable stripe width, full stripe writes
─────────────────────────────────────────────────────────────────
Stripe width varies based on data size. Every write is a 
complete, new stripe. Never overwrite existing data.
 
┌────────┬────────┬────────┬────────┬────────┐
│Stripe 1│D1      │D2      │D3      │D4      │Parity  │ Width = 5
├────────┼────────┼────────┼────────┼────────┤
│Stripe 2│D5      │D6      │Parity  │        │        │ Width = 3
├────────┼────────┼────────┼────────┼────────┤
│Stripe 3│D7      │Parity  │        │        │        │ Width = 2
├────────┼────────┼────────┼────────┼────────┤
│Stripe 4│D8      │D9      │D10     │D11     │Parity  │ Width = 5
└────────┴────────┴────────┴────────┴────────┴────────┘
 
Empty space is simply where data hasn't been written yet.
 
 
WHY THIS ELIMINATES THE WRITE HOLE:
═══════════════════════════════════════════════════════════════════
 
1. We never overwrite existing stripes
   → Old stripe remains valid until new stripe is complete
 
2. We write data and parity in a single atomic operation
   → Either the entire stripe writes successfully, or it doesn't
 
3. We update pointers only after stripe is fully written
   → Crash before pointer update = new stripe is orphaned, old valid
   → Crash after pointer update = new stripe is complete and valid
 
4. No read-modify-write cycle
   → Nothing to become inconsistent
 
═══════════════════════════════════════════════════════════════════
 
WRITE OPERATION IN RAID-Z:
─────────────────────────────────────────────────────────────────
 
Time T1: Current state
         Old stripe: [D1_old][D2_old][D3_old][Parity_old]
         Pointer: → Old stripe location
 
Time T2: Write new data (to NEW locations)
         New stripe: [D1_new][D2_new][D3_new][Parity_new]
         Pointer still: → Old stripe location (NOT UPDATED YET)
         
         ⚡ Power failure here? Old stripe is still valid!
         
Time T3: Atomic pointer update
         Pointer: → New stripe location
         
Time T4: Old stripe space added to free list
         (May be retained for snapshots)
         
NO WINDOW FOR INCONSISTENCY EXISTS.

Structural Immunity

RAID-Z Levels: RAIDZ1, RAIDZ2, RAIDZ3

ZFS offers three RAID-Z levels, differing in the number of parity blocks per stripe. More parity means more failures survived, at the cost of usable capacity.

RAID-Z Levels Comparison
Level	Parity Disks	Min Disks	Usable Capacity	Survives
RAIDZ1	1	3	(N-1)/N	1 disk failure
RAIDZ2	2	4	(N-2)/N	2 disk failures
RAIDZ3	3	5	(N-3)/N	3 disk failures

raidz_creation.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#!/bin/bash
# Creating RAID-Z Pools
 
# ============================================
# RAIDZ1 (Single Parity - Similar to RAID-5)
# ============================================
 
# Basic RAIDZ1 with 4 disks
# Usable capacity: 3/4 = 75%
zpool create tank raidz1 /dev/sda /dev/sdb /dev/sdc /dev/sdd
 
# RAIDZ1 with 6 disks
# Usable capacity: 5/6 = 83%  
zpool create tank raidz1 /dev/sd{a,b,c,d,e,f}
 
 
# ============================================
# RAIDZ2 (Double Parity - Similar to RAID-6)  
# ============================================
 
# Basic RAIDZ2 with 6 disks (recommended minimum for RAIDZ2)
# Usable capacity: 4/6 = 67%
zpool create tank raidz2 /dev/sd{a,b,c,d,e,f}
 
# RAIDZ2 with 8 disks
# Usable capacity: 6/8 = 75%
zpool create tank raidz2 /dev/sd{a,b,c,d,e,f,g,h}
 
 
# ============================================
# RAIDZ3 (Triple Parity)
# ============================================
 
# RAIDZ3 with 8 disks
# Usable capacity: 5/8 = 62.5%
zpool create tank raidz3 /dev/sd{a,b,c,d,e,f,g,h}
 
# RAIDZ3 with 10 disks  
# Usable capacity: 7/10 = 70%
zpool create tank raidz3 /dev/sd{a,b,c,d,e,f,g,h,i,j}
 
 
# ============================================
# MULTIPLE VDEV POOLS (Striped RAID-Z)
# ============================================
 
# Two RAIDZ2 vdevs (total 12 disks)
# Data stripes across vdevs; each vdev has double parity
# Usable capacity: (4+4)/(6+6) = 67%
# Performance: ~2x single vdev (parallel reads/writes)
zpool create tank \
    raidz2 /dev/sd{a,b,c,d,e,f} \
    raidz2 /dev/sd{g,h,i,j,k,l}
 
# Three RAIDZ2 vdevs (total 18 disks)
zpool create tank \
    raidz2 /dev/sd{a,b,c,d,e,f} \
    raidz2 /dev/sd{g,h,i,j,k,l} \
    raidz2 /dev/sd{m,n,o,p,q,r}
 
# Mix of vdev types (NOT RECOMMENDED - unbalanced!)
# zpool create tank \
#     raidz2 /dev/sd{a,b,c,d,e,f} \
#     mirror /dev/sdg /dev/sdh      # Different redundancy!
 
 
# ============================================
# OPTIMAL VDEV WIDTHS
# ============================================
 
# ZFS allocates in multiples of sector size (ashift-based)
# Optimal widths depend on recordsize and disk count
 
# For RAIDZ1 (1 parity): 3, 5, 9, 17 disks optimal
# For RAIDZ2 (2 parity): 4, 6, 10, 18 disks optimal  
# For RAIDZ3 (3 parity): 5, 7, 11, 19 disks optimal
 
# These "power of 2 + parity" widths minimize wasted space
 
# Example: 9-disk RAIDZ1
# 8 data + 1 parity = 8 data disks (power of 2)
zpool create tank raidz1 /dev/sd{a,b,c,d,e,f,g,h,i}

RAIDZ1 and Large Disks

RAID-Z and Checksum Integration

Traditional RAID uses parity for both redundancy AND error detection. If a disk returns bad data, RAID uses parity to identify which disk is wrong and reconstruct the data.

ZFS RAID-Z solves this with checksum-based repair:

raidz_repair_process.txt

Process

RAID-Z INTELLIGENT REPAIR PROCESS
═══════════════════════════════════════════════════════════════════
 
SCENARIO: 4-disk RAIDZ1, reading block X
          Block pointer contains expected checksum
 
STEP 1: Normal Read
─────────────────────────────────────────────────────────────────
Read data from designated disks for this block
 
         ┌──────────────────────────────────────────────┐
Stripe:  │ D1_ok │ D2_corrupt │ D3_ok │ Parity_ok │
         └──────────────────────────────────────────────┘
                      ↑
                 Silent corruption on Disk 2
                 (returned wrong data, no I/O error)
 
Assembled block: D1 + D2_corrupt + D3 = WRONG DATA
 
 
STEP 2: Checksum Verification
─────────────────────────────────────────────────────────────────
Compute checksum of assembled block, compare to expected
 
Expected (from parent block): 0xABCD1234
Computed:                     0xFF001122  ← MISMATCH!
 
ZFS knows the block is corrupt. Traditional RAID would NOT.
 
 
STEP 3: Identify Corrupt Component
─────────────────────────────────────────────────────────────────
ZFS uses a combinatorial approach:
 
Attempt 1: Reconstruct assuming D1 is bad
           D1_rebuilt = D2_corrupt ⊕ D3 ⊕ Parity
           Block = D1_rebuilt + D2_corrupt + D3
           Checksum? No match. D1 was not the problem.
 
Attempt 2: Reconstruct assuming D2 is bad
           D2_rebuilt = D1 ⊕ D3 ⊕ Parity  
           Block = D1 + D2_rebuilt + D3
           Checksum? MATCH! D2 was corrupt.
 
Attempt 3: (Would try D3 if D2 didn't match)
 
 
STEP 4: Return Correct Data
─────────────────────────────────────────────────────────────────
The block reconstructed with D2_rebuilt is correct.
Return this to the application. Schedule repair of D2's block.
 
 
STEP 5: Self-Healing Write
─────────────────────────────────────────────────────────────────
Write the correct D2 value back to Disk 2, repairing corruption.
 
═══════════════════════════════════════════════════════════════════
 
TRADITIONAL RAID FAILURE SCENARIO:
─────────────────────────────────────────────────────────────────
Same silent corruption on Disk 2.
 
1. Read stripe: D1_ok, D2_corrupt, D3_ok, Parity_ok
2. No checksums → Can't detect corruption
3. Return corrupt data to application
4. Application processes/stores/displays wrong data
5. User never knows
 
OR (if parity verification is enabled):
 
1. Read stripe and verify parity
2. Parity doesn't match! Stripe is inconsistent.
3. Which disk is wrong? Unknown!
4. RAID might "repair" by recalculating D1 from D2_corrupt + D3 + Parity
5. Now D1 is ALSO wrong!
6. Corruption spreads instead of being fixed.

Intelligent Repair

Resilvering: The RAID-Z Rebuild Process

RAID-Z Resilvering Advantages

•Metadata-aware — ZFS knows which blocks are allocated. It only resilvers blocks containing data, not empty space. A 10TB disk that's 20% full only needs to resilver 2TB.
•Checksum-verified — Every rebuilt block is verified against its checksum. If corruption is detected during resilver, ZFS uses parity to try alternate reconstructions.
•Sequential allocation awareness — ZFS knows the logical order of data and can optimize I/O patterns during resilver.
•Prioritized — Critical metadata resilvers first, then data. Pool becomes usable faster even if resilver continues in background.
•Restartable — If resilver is interrupted, it resumes from where it stopped, not from the beginning.

resilver_operations.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#!/bin/bash
# RAID-Z Disk Replacement and Resilvering
 
# ============================================
# DISK FAILURE DETECTION
# ============================================
 
# Check pool status for failures
zpool status tank
 
# Example output showing degraded pool:
#   pool: tank
#   state: DEGRADED
#   status: One or more devices has been removed by the administrator.
#   action: Online the device using 'zpool online' or replace the
#           device with 'zpool replace'.
#   scrub: none requested
#   config:
#
#     NAME        STATE     READ WRITE CKSUM
#     tank        DEGRADED     0     0     0
#       raidz2-0  DEGRADED     0     0     0
#         sda     ONLINE       0     0     0
#         sdb     REMOVED      0     0     0  ← FAILED
#         sdc     ONLINE       0     0     0
#         sdd     ONLINE       0     0     0
#         sde     ONLINE       0     0     0
#         sdf     ONLINE       0     0     0
 
# ============================================
# DISK REPLACEMENT
# ============================================
 
# Replace failed disk with new disk
zpool replace tank /dev/sdb /dev/sdz
 
# If same device slot (after physical swap):
zpool replace tank /dev/sdb
 
# Using disk by-id (recommended for reliability):
zpool replace tank \
    /dev/disk/by-id/wwn-0x50014ee2b47e9ae5 \
    /dev/disk/by-id/wwn-0x50014ee2b47e9bf6
 
# ============================================
# MONITORING RESILVER PROGRESS
# ============================================
 
# Basic status shows resilver progress
zpool status tank
 
# Example output during resilver:
#   scan: resilver in progress since Mon Oct 16 14:00:00 2023
#         2.50T scanned at 500M/s, 1.25T resilvered at 250M/s
#         22% done, 4h30m to go
 
# Watch progress in real-time
watch -n 10 'zpool status tank | grep scan'
 
# ============================================
# RESILVER PERFORMANCE TUNING
# ============================================
 
# Priority slider (Linux): 0=low priority, 10=high priority
# Higher = faster resilver, more impact on production I/O
echo 7 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms
 
# View current settings
cat /sys/module/zfs/parameters/zfs_resilver_min_time_ms
cat /sys/module/zfs/parameters/zfs_resilver_delay
 
# Temporarily increase for faster resilver (be careful!)
# In /etc/modprobe.d/zfs.conf:
# options zfs zfs_vdev_resilver_max_active=10
 
# ============================================
# HOT SPARES FOR AUTOMATIC REPLACEMENT
# ============================================
 
# Add hot spare to pool
zpool add tank spare /dev/sdz
 
# Enable autoreplace (disk failed → spare activates)
zpool set autoreplace=on tank
 
# When a disk fails:
# 1. ZFS automatically detaches failed disk
# 2. Attaches spare in its place
# 3. Begins resilvering
# 4. Administrator alerted via ZED (ZFS Event Daemon)
 
# After replacing the original disk:
zpool detach tank /dev/sdz     # Detach spare
# Spare returns to spare pool for next failure
 
# ============================================
# SEQUENTIAL VS HEALING RESILVER
# ============================================
 
# Normal resilver (after replace):
# Copies data from surviving disks to new disk
# Uses parity to compute missing data
 
# Healing resilver (after returning temporarily offline disk):
# The disk may have valid old data
# ZFS checksums each existing block
# Only copies blocks that are actually missing or corrupt
# Much faster when disk was briefly offline

Resilver During Degraded State

RAID-Z Design Recommendations

Designing a RAID-Z pool requires balancing performance, capacity, and reliability. Here's practical guidance based on industry experience.

RAID-Z Level Selection Guide
Disk Size	Recommended Level	Reasoning
≤2TB	RAIDZ1	Smaller disks resilver quickly; single-failure protection usually sufficient
2TB-6TB	RAIDZ2	Resilver time lengthening; double-failure protection warranted
≥8TB	RAIDZ2 or RAIDZ3	Long resilver windows; high URE probability; triple parity for critical data
≥16TB	RAIDZ3 or Mirrors	Very long resilvers; consider mirrors for performance despite space cost

RAID-Z Best Practices

•Match vdev widths — All vdevs in a pool should have identical configuration. Mismatched vdevs create performance and capacity imbalances.
•Use 4-10 disks per vdev — Too few wastes parity overhead; too many increases rebuild time and URE risk. Sweet spot is typically 5-8.
•Plan for expansion — Capacity grows by adding vdevs. If you start with 6-disk RAIDZ2, your next expansion is another 6-disk RAIDZ2.
•Use identical disks — Mixed disk models/sizes can cause the smallest/slowest disk to bottleneck the entire vdev.
•Consider mirrors for random I/O — Database workloads often perform better on mirrors than RAIDZ due to write pattern differences.
•Run regular scrubs — Weekly or monthly scrubs catch latent errors before they compound into multi-disk failures.
•Keep hot spares — Automatic replacement reduces the window of vulnerability during disk failure.
•Use ashift=12 (or 13 for AF) — Set appropriate sector size alignment for your disks at pool creation.

Mirror vs RAIDZ Trade-off

Summary: RAID-Z

We've explored RAID-Z—ZFS's revolutionary approach to software RAID that integrates redundancy with data integrity verification. Let's consolidate the key insights:

Key Takeaways

•The write hole is eliminated by design — Copy-on-Write with full-stripe writes means data and parity are never inconsistent. No battery backup required.
•Three RAID-Z levels provide flexibility — RAIDZ1 (1 parity), RAIDZ2 (2 parity), RAIDZ3 (3 parity) match redundancy to requirements.
•Variable stripe width enables efficiency — Unlike fixed-width RAID-5/6, RAID-Z stripes match data size, reducing wasted space.
•Checksums enable intelligent repair — When corruption is detected, ZFS uses checksums to identify which disk is wrong, unlike traditional RAID which might repair incorrectly.
•Resilvering is metadata-aware — Only allocated blocks are rebuilt, potentially much faster than whole-disk RAID rebuilds.
•RAIDZ level should match disk size — Larger disks mean longer resilvers and higher URE risk; RAIDZ2/3 is increasingly essential.
•Consider mirrors for random I/O — Despite lower capacity efficiency, mirrors excel for database and VM workloads.

What's Next:

Page Complete

4 / 5