Loading learning content...
RAID—Redundant Array of Independent Disks—has protected data for decades. By spreading data and parity across multiple disks, RAID systems survive disk failures that would otherwise cause data loss. Yet traditional RAID implementations carry fundamental flaws that ZFS's RAID-Z was designed to solve.
The most insidious is the write hole—a window during writes where a power failure leaves the array in an inconsistent state that cannot be detected or repaired. Hardware RAID controllers use battery-backed caches to mitigate this; software RAID on most systems simply accepts the risk.
RAID-Z eliminates the write hole through Copy-on-Write, integrates with ZFS checksums for intelligent repair, and provides flexible redundancy levels—all without expensive hardware controllers.
By the end of this page, you will understand how RAID-Z differs from traditional RAID, why the write hole is dangerous and how Copy-on-Write eliminates it, the three RAID-Z levels and when to use each, how RAID-Z interacts with checksums for self-healing, and practical guidance for RAID-Z pool design.
Before understanding RAID-Z's innovations, let's review traditional RAID concepts and their limitations.
| Level | Technique | Capacity | Failures Survived | Weakness |
|---|---|---|---|---|
| RAID 0 | Striping only | 100% | 0 | Any disk failure = total data loss |
| RAID 1 | Mirroring | 50% | N-1 of N | Space inefficient for large arrays |
| RAID 5 | Striping + single parity | (N-1)/N | 1 | Write hole; slow rebuild; URE risk |
| RAID 6 | Striping + double parity | (N-2)/N | 2 | Write hole; very slow rebuild |
| RAID 10 | Mirrored stripes | 50% | 1 per mirror | Expensive; best for random I/O |
The Parity Problem:
RAID 5/6 use XOR-based parity to reconstruct data from any single (RAID 5) or double (RAID 6) disk failure. The parity calculation is simple:
Data: D1 ⊕ D2 ⊕ D3 ⊕ D4 = P
If D2 fails:
D1 ⊕ ? ⊕ D3 ⊕ D4 = P
D2 = D1 ⊕ D3 ⊕ D4 ⊕ P
This works beautifully—when the parity is consistent with the data. But what if power fails mid-write?
RAID protects against disk failure, not data loss. Accidental deletion, ransomware, software bugs, and controller failures can destroy data on RAID arrays. RAID-Z (like all RAID) is one layer of protection—snapshots and off-site backups remain essential.
The write hole is a fundamental flaw in traditional parity RAID (RAID 5/6). It occurs when a stripe write is interrupted, leaving data and parity inconsistent. The array cannot detect this inconsistency—it believes the stripe is healthy when it's actually corrupt.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
THE WRITE HOLE IN TRADITIONAL RAID-5═════════════════════════════════════════════════════════════════════ INITIAL STATE (Consistent stripe):┌─────────────────────────────────────────────────────────────────┐│ Disk 1 │ Disk 2 │ Disk 3 │ Disk 4 │ Disk 5 (Parity) │├────────┼────────┼────────┼────────┼────────────────────────────┤│ D1=A │ D2=B │ D3=C │ D4=D │ P = A⊕B⊕C⊕D │└────────┴────────┴────────┴────────┴────────────────────────────┘✓ Parity is correct: If any disk fails, data can be rebuilt WRITE OPERATION (Updating D2 from B to B'):Step 1: Write new data to Disk 2Step 2: Calculate new parity = A⊕B'⊕C⊕D Step 3: Write new parity to Disk 5 ─────────────────────────────────────────────────────────────────⚡ POWER FAILURE AFTER STEP 1, BEFORE STEP 3 ⚡───────────────────────────────────────────────────────────────── RESULTING STATE (Inconsistent - THE WRITE HOLE):┌─────────────────────────────────────────────────────────────────┐│ Disk 1 │ Disk 2 │ Disk 3 │ Disk 4 │ Disk 5 (Parity) │├────────┼────────┼────────┼────────┼────────────────────────────┤│ D1=A │ D2=B' │ D3=C │ D4=D │ P = A⊕B⊕C⊕D (OLD!) ││ │ (NEW) │ │ │ Should be A⊕B'⊕C⊕D │└────────┴────────┴────────┴────────┴────────────────────────────┘✗ Parity is WRONG but array doesn't know it! NOW DISK 3 FAILS (normal hardware failure):───────────────────────────────────────────────────────────────── REBUILD ATTEMPT: Rebuild D3 = D1 ⊕ D2 ⊕ D4 ⊕ P = A ⊕ B' ⊕ D ⊕ (A⊕B⊕C⊕D) = B' ⊕ B ⊕ C ≠ C (WRONG!) RESULT: D3 is rebuilt INCORRECTLY. Data corruption goes undetected. User believes data is safe. Backups contain corrupted data. ═════════════════════════════════════════════════════════════════════WHY CAN'T THE ARRAY DETECT THIS? Traditional RAID has no checksums. When Disk 3 fails and is rebuilt,the array XORs the data and parity it finds. It has no way to knowthe resulting value is wrong—it just does the math. Only ZFS, with parent-stored checksums, would detect that therebuilt D3 doesn't match its expected checksum.═════════════════════════════════════════════════════════════════════The write hole is particularly dangerous because the corruption is invisible. The array reports healthy. Scrubs (on traditional RAID) don't detect it. Only when you actually need to rebuild—perhaps years after the inconsistency was created—do you discover that your 'protected' data is silently corrupted.
RAID-Z eliminates the write hole through a fundamental architectural change: Copy-on-Write with full-stripe writes.
In traditional RAID, updating a single sector requires:
This read-modify-write cycle creates the window where data and parity can become inconsistent.
RAID-Z instead uses full-stripe writes with variable width:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
RAID-Z WRITE PATTERN (Copy-on-Write + Full Stripe)═══════════════════════════════════════════════════════════════════ TRADITIONAL RAID-5: Fixed stripe width, partial writes─────────────────────────────────────────────────────────────────All stripes have the same width. Updating one block requiresread-modify-write of data and parity. ┌────────┬────────┬────────┬────────┬────────┐│Stripe 1│D1 │D2 │D3 │Parity │ Fixed width = 4├────────┼────────┼────────┼────────┼────────┤│Stripe 2│D4 │D5 │D6 │Parity │ Fixed width = 4├────────┼────────┼────────┼────────┼────────┤│Stripe 3│D7 │D8 │D9 │Parity │ Fixed width = 4└────────┴────────┴────────┴────────┴────────┘ RAID-Z: Variable stripe width, full stripe writes─────────────────────────────────────────────────────────────────Stripe width varies based on data size. Every write is a complete, new stripe. Never overwrite existing data. ┌────────┬────────┬────────┬────────┬────────┐│Stripe 1│D1 │D2 │D3 │D4 │Parity │ Width = 5├────────┼────────┼────────┼────────┼────────┤│Stripe 2│D5 │D6 │Parity │ │ │ Width = 3├────────┼────────┼────────┼────────┼────────┤│Stripe 3│D7 │Parity │ │ │ │ Width = 2├────────┼────────┼────────┼────────┼────────┤│Stripe 4│D8 │D9 │D10 │D11 │Parity │ Width = 5└────────┴────────┴────────┴────────┴────────┴────────┘ Empty space is simply where data hasn't been written yet. WHY THIS ELIMINATES THE WRITE HOLE:═══════════════════════════════════════════════════════════════════ 1. We never overwrite existing stripes → Old stripe remains valid until new stripe is complete 2. We write data and parity in a single atomic operation → Either the entire stripe writes successfully, or it doesn't 3. We update pointers only after stripe is fully written → Crash before pointer update = new stripe is orphaned, old valid → Crash after pointer update = new stripe is complete and valid 4. No read-modify-write cycle → Nothing to become inconsistent ═══════════════════════════════════════════════════════════════════ WRITE OPERATION IN RAID-Z:───────────────────────────────────────────────────────────────── Time T1: Current state Old stripe: [D1_old][D2_old][D3_old][Parity_old] Pointer: → Old stripe location Time T2: Write new data (to NEW locations) New stripe: [D1_new][D2_new][D3_new][Parity_new] Pointer still: → Old stripe location (NOT UPDATED YET) ⚡ Power failure here? Old stripe is still valid! Time T3: Atomic pointer update Pointer: → New stripe location Time T4: Old stripe space added to free list (May be retained for snapshots) NO WINDOW FOR INCONSISTENCY EXISTS.RAID-Z doesn't mitigate the write hole with battery-backed caches or journaling. It eliminates the write hole structurally—the problem cannot occur because the architecture doesn't permit partial stripe updates. This is why RAID-Z doesn't need the hardware that traditional RAID controllers use for write hole protection.
ZFS offers three RAID-Z levels, differing in the number of parity blocks per stripe. More parity means more failures survived, at the cost of usable capacity.
| Level | Parity Disks | Min Disks | Usable Capacity | Survives |
|---|---|---|---|---|
| RAIDZ1 | 1 | 3 | (N-1)/N | 1 disk failure |
| RAIDZ2 | 2 | 4 | (N-2)/N | 2 disk failures |
| RAIDZ3 | 3 | 5 | (N-3)/N | 3 disk failures |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
#!/bin/bash# Creating RAID-Z Pools # ============================================# RAIDZ1 (Single Parity - Similar to RAID-5)# ============================================ # Basic RAIDZ1 with 4 disks# Usable capacity: 3/4 = 75%zpool create tank raidz1 /dev/sda /dev/sdb /dev/sdc /dev/sdd # RAIDZ1 with 6 disks# Usable capacity: 5/6 = 83% zpool create tank raidz1 /dev/sd{a,b,c,d,e,f} # ============================================# RAIDZ2 (Double Parity - Similar to RAID-6) # ============================================ # Basic RAIDZ2 with 6 disks (recommended minimum for RAIDZ2)# Usable capacity: 4/6 = 67%zpool create tank raidz2 /dev/sd{a,b,c,d,e,f} # RAIDZ2 with 8 disks# Usable capacity: 6/8 = 75%zpool create tank raidz2 /dev/sd{a,b,c,d,e,f,g,h} # ============================================# RAIDZ3 (Triple Parity)# ============================================ # RAIDZ3 with 8 disks# Usable capacity: 5/8 = 62.5%zpool create tank raidz3 /dev/sd{a,b,c,d,e,f,g,h} # RAIDZ3 with 10 disks # Usable capacity: 7/10 = 70%zpool create tank raidz3 /dev/sd{a,b,c,d,e,f,g,h,i,j} # ============================================# MULTIPLE VDEV POOLS (Striped RAID-Z)# ============================================ # Two RAIDZ2 vdevs (total 12 disks)# Data stripes across vdevs; each vdev has double parity# Usable capacity: (4+4)/(6+6) = 67%# Performance: ~2x single vdev (parallel reads/writes)zpool create tank \ raidz2 /dev/sd{a,b,c,d,e,f} \ raidz2 /dev/sd{g,h,i,j,k,l} # Three RAIDZ2 vdevs (total 18 disks)zpool create tank \ raidz2 /dev/sd{a,b,c,d,e,f} \ raidz2 /dev/sd{g,h,i,j,k,l} \ raidz2 /dev/sd{m,n,o,p,q,r} # Mix of vdev types (NOT RECOMMENDED - unbalanced!)# zpool create tank \# raidz2 /dev/sd{a,b,c,d,e,f} \# mirror /dev/sdg /dev/sdh # Different redundancy! # ============================================# OPTIMAL VDEV WIDTHS# ============================================ # ZFS allocates in multiples of sector size (ashift-based)# Optimal widths depend on recordsize and disk count # For RAIDZ1 (1 parity): 3, 5, 9, 17 disks optimal# For RAIDZ2 (2 parity): 4, 6, 10, 18 disks optimal # For RAIDZ3 (3 parity): 5, 7, 11, 19 disks optimal # These "power of 2 + parity" widths minimize wasted space # Example: 9-disk RAIDZ1# 8 data + 1 parity = 8 data disks (power of 2)zpool create tank raidz1 /dev/sd{a,b,c,d,e,f,g,h,i}With modern large disks (8TB+), RAIDZ1 is increasingly risky. Rebuild times can exceed 24 hours, during which a second failure loses the pool. The probability of a second failure or unrecoverable read error (URE) during rebuild is non-trivial. RAIDZ2 or RAIDZ3 is strongly recommended for disks over 4TB.
Traditional RAID uses parity for both redundancy AND error detection. If a disk returns bad data, RAID uses parity to identify which disk is wrong and reconstruct the data.
But what if a disk returns bad data with no I/O error? RAID can't detect this—parity verification would show the stripe is inconsistent, but not which disk is wrong. Traditional RAID might "repair" by overwriting the good data with reconstructed wrong data.
ZFS RAID-Z solves this with checksum-based repair:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
RAID-Z INTELLIGENT REPAIR PROCESS═══════════════════════════════════════════════════════════════════ SCENARIO: 4-disk RAIDZ1, reading block X Block pointer contains expected checksum STEP 1: Normal Read─────────────────────────────────────────────────────────────────Read data from designated disks for this block ┌──────────────────────────────────────────────┐Stripe: │ D1_ok │ D2_corrupt │ D3_ok │ Parity_ok │ └──────────────────────────────────────────────┘ ↑ Silent corruption on Disk 2 (returned wrong data, no I/O error) Assembled block: D1 + D2_corrupt + D3 = WRONG DATA STEP 2: Checksum Verification─────────────────────────────────────────────────────────────────Compute checksum of assembled block, compare to expected Expected (from parent block): 0xABCD1234Computed: 0xFF001122 ← MISMATCH! ZFS knows the block is corrupt. Traditional RAID would NOT. STEP 3: Identify Corrupt Component─────────────────────────────────────────────────────────────────ZFS uses a combinatorial approach: Attempt 1: Reconstruct assuming D1 is bad D1_rebuilt = D2_corrupt ⊕ D3 ⊕ Parity Block = D1_rebuilt + D2_corrupt + D3 Checksum? No match. D1 was not the problem. Attempt 2: Reconstruct assuming D2 is bad D2_rebuilt = D1 ⊕ D3 ⊕ Parity Block = D1 + D2_rebuilt + D3 Checksum? MATCH! D2 was corrupt. Attempt 3: (Would try D3 if D2 didn't match) STEP 4: Return Correct Data─────────────────────────────────────────────────────────────────The block reconstructed with D2_rebuilt is correct.Return this to the application. Schedule repair of D2's block. STEP 5: Self-Healing Write─────────────────────────────────────────────────────────────────Write the correct D2 value back to Disk 2, repairing corruption. ═══════════════════════════════════════════════════════════════════ TRADITIONAL RAID FAILURE SCENARIO:─────────────────────────────────────────────────────────────────Same silent corruption on Disk 2. 1. Read stripe: D1_ok, D2_corrupt, D3_ok, Parity_ok2. No checksums → Can't detect corruption3. Return corrupt data to application4. Application processes/stores/displays wrong data5. User never knows OR (if parity verification is enabled): 1. Read stripe and verify parity2. Parity doesn't match! Stripe is inconsistent.3. Which disk is wrong? Unknown!4. RAID might "repair" by recalculating D1 from D2_corrupt + D3 + Parity5. Now D1 is ALSO wrong!6. Corruption spreads instead of being fixed.The combination of checksums and parity gives RAID-Z capabilities impossible in traditional RAID. Checksums detect ALL corruption (not just disk failures). Parity provides the redundancy to reconstruct. Together, they enable intelligent repair that never makes wrong decisions about which data is good.
When a disk fails and is replaced, ZFS must rebuild the data that was on the failed disk. This process is called resilvering (a term from Sun's mirroring heritage). RAID-Z resilvering is smarter than traditional RAID rebuilds.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
#!/bin/bash# RAID-Z Disk Replacement and Resilvering # ============================================# DISK FAILURE DETECTION# ============================================ # Check pool status for failureszpool status tank # Example output showing degraded pool:# pool: tank# state: DEGRADED# status: One or more devices has been removed by the administrator.# action: Online the device using 'zpool online' or replace the# device with 'zpool replace'.# scrub: none requested# config:## NAME STATE READ WRITE CKSUM# tank DEGRADED 0 0 0# raidz2-0 DEGRADED 0 0 0# sda ONLINE 0 0 0# sdb REMOVED 0 0 0 ← FAILED# sdc ONLINE 0 0 0# sdd ONLINE 0 0 0# sde ONLINE 0 0 0# sdf ONLINE 0 0 0 # ============================================# DISK REPLACEMENT# ============================================ # Replace failed disk with new diskzpool replace tank /dev/sdb /dev/sdz # If same device slot (after physical swap):zpool replace tank /dev/sdb # Using disk by-id (recommended for reliability):zpool replace tank \ /dev/disk/by-id/wwn-0x50014ee2b47e9ae5 \ /dev/disk/by-id/wwn-0x50014ee2b47e9bf6 # ============================================# MONITORING RESILVER PROGRESS# ============================================ # Basic status shows resilver progresszpool status tank # Example output during resilver:# scan: resilver in progress since Mon Oct 16 14:00:00 2023# 2.50T scanned at 500M/s, 1.25T resilvered at 250M/s# 22% done, 4h30m to go # Watch progress in real-timewatch -n 10 'zpool status tank | grep scan' # ============================================# RESILVER PERFORMANCE TUNING# ============================================ # Priority slider (Linux): 0=low priority, 10=high priority# Higher = faster resilver, more impact on production I/Oecho 7 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms # View current settingscat /sys/module/zfs/parameters/zfs_resilver_min_time_mscat /sys/module/zfs/parameters/zfs_resilver_delay # Temporarily increase for faster resilver (be careful!)# In /etc/modprobe.d/zfs.conf:# options zfs zfs_vdev_resilver_max_active=10 # ============================================# HOT SPARES FOR AUTOMATIC REPLACEMENT# ============================================ # Add hot spare to poolzpool add tank spare /dev/sdz # Enable autoreplace (disk failed → spare activates)zpool set autoreplace=on tank # When a disk fails:# 1. ZFS automatically detaches failed disk# 2. Attaches spare in its place# 3. Begins resilvering# 4. Administrator alerted via ZED (ZFS Event Daemon) # After replacing the original disk:zpool detach tank /dev/sdz # Detach spare# Spare returns to spare pool for next failure # ============================================# SEQUENTIAL VS HEALING RESILVER# ============================================ # Normal resilver (after replace):# Copies data from surviving disks to new disk# Uses parity to compute missing data # Healing resilver (after returning temporarily offline disk):# The disk may have valid old data# ZFS checksums each existing block# Only copies blocks that are actually missing or corrupt# Much faster when disk was briefly offlineWhile resilvering, the pool has reduced redundancy. In RAIDZ1, the pool has NO redundancy during resilver—another failure loses data. In RAIDZ2 with one disk failed, you still survive one more failure but not two. Prioritize resilver completion and avoid starting additional scrubs or heavy I/O during this window.
Designing a RAID-Z pool requires balancing performance, capacity, and reliability. Here's practical guidance based on industry experience.
| Disk Size | Recommended Level | Reasoning |
|---|---|---|
| ≤2TB | RAIDZ1 | Smaller disks resilver quickly; single-failure protection usually sufficient |
| 2TB-6TB | RAIDZ2 | Resilver time lengthening; double-failure protection warranted |
| ≥8TB | RAIDZ2 or RAIDZ3 | Long resilver windows; high URE probability; triple parity for critical data |
| ≥16TB | RAIDZ3 or Mirrors | Very long resilvers; consider mirrors for performance despite space cost |
Mirrors sacrifice 50% capacity but offer: (1) faster resilver (single disk copy vs parity reconstruction), (2) better random read IOPS (read from any mirror), (3) simpler failure mode (any N-1 of N-way mirror can fail). For databases, VMs, or critical random I/O workloads, mirrors often outperform similarly-sized RAIDZ pools despite lower capacity efficiency.
We've explored RAID-Z—ZFS's revolutionary approach to software RAID that integrates redundancy with data integrity verification. Let's consolidate the key insights:
What's Next:
With RAID-Z's redundancy understood, we'll explore ZFS's most user-visible advanced features: snapshots and clones. These Copy-on-Write powered capabilities enable instant point-in-time copies, space-efficient backups, and powerful workflows impossible with traditional file systems.
You now understand RAID-Z: how it eliminates the write hole through Copy-on-Write, how it integrates with checksums for intelligent repair, and practical guidance for designing RAID-Z pools. Next, we'll explore snapshots and clones—ZFS's powerful data management capabilities.