Loading content...
Traditional storage administration is an exercise in prediction and compromise. Before creating a file system, administrators must answer questions that are fundamentally unanswerable:
/home or /var?/data fills up but /backup has 500GB free?The answers require predicting the future—and administrators are invariably wrong. The result: some volumes fill up and require emergency expansion while others sit half-empty for years. The sum of wasted space across an organization can reach dozens of terabytes.
ZFS's radical solution: abolish the volume. Instead, create a storage pool from which all file systems dynamically draw space as needed.
By the end of this page, you will understand ZFS's pooled storage architecture, how virtual devices (vdevs) abstract physical storage, how datasets dynamically share pool space, and how to design robust pool topologies for different reliability and performance requirements.
A ZFS storage pool (or simply 'pool') is a collection of physical storage devices aggregated into a single logical unit. Unlike traditional volume managers that carve fixed partitions, ZFS pools provide dynamic, shared storage for all datasets.
Traditional Model vs. ZFS Model:
TRADITIONAL (Fixed Volumes):
┌─────────────────────────────────────────────────────────┐
│ Physical Disks │
├─────────────────────────────────────────────────────────┤
│ Partition 1 │ Partition 2 │ Partition 3 │
│ (100GB) │ (200GB) │ (200GB) │
├─────────────────────────────────────────────────────────┤
│ /home │ /var │ /data │
│ (Fixed Size) │ (Fixed Size) │ (Fixed Size) │
└─────────────────────────────────────────────────────────┘
Problem: /home is full but /data has 150GB free → STUCK
ZFS (Pooled Storage):
┌─────────────────────────────────────────────────────────┐
│ Physical Disks │
├─────────────────────────────────────────────────────────┤
│ Storage Pool (500GB) │
├─────────────────────────────────────────────────────────┤
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ /home │ │ /var │ │ /data │ Unused │
│ │ (80GB) │ │ (50GB) │ │ (120GB) │ (250GB) │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
Solution: All datasets share the pool → Grow as needed
Traditional volumes are like cash in separate envelopes—$100 for rent, $50 for food, $200 for savings. If you overspend on food, you can't easily move money from the rent envelope. A ZFS pool is like a single bank account—all expenses draw from the same balance, and you only need to ensure the total balance covers total needs.
The building blocks of ZFS pools are Virtual Devices (vdevs). A vdev is an abstraction that can represent a single disk, a group of disks with redundancy, or special-purpose devices for caching and logging.
Pool Structure:
POOL: "tank"
├── VDEV 1: raidz2 (6 disks)
│ ├── /dev/sda
│ ├── /dev/sdb
│ ├── /dev/sdc
│ ├── /dev/sdd
│ ├── /dev/sde
│ └── /dev/sdf
├── VDEV 2: raidz2 (6 disks)
│ ├── /dev/sdg
│ ├── /dev/sdh
│ ├── /dev/sdi
│ ├── /dev/sdj
│ ├── /dev/sdk
│ └── /dev/sdl
├── LOG: mirror (2 SSDs)
│ ├── /dev/nvme0n1p1
│ └── /dev/nvme1n1p1
├── CACHE: single (1 SSD)
│ └── /dev/nvme2n1
└── SPARE: disk
└── /dev/sdm
Data is striped across vdevs (VDEV 1 and VDEV 2), while LOG and CACHE serve special purposes. This hierarchy provides both performance (striping) and reliability (redundancy within vdevs).
| VDEV Type | Configuration | Purpose | Failure Impact |
|---|---|---|---|
| Disk | Single disk | Simplest unit, no redundancy | Disk failure = pool failure if no other redundancy |
| Mirror | 2+ identical copies | Full redundancy, excellent read IOPS | Survives N-1 disk failures in N-way mirror |
| RAIDZ1 | N disks, 1 parity | Space-efficient single-parity protection | Survives 1 disk failure per vdev |
| RAIDZ2 | N disks, 2 parity | Double-parity for large disks | Survives 2 disk failures per vdev |
| RAIDZ3 | N disks, 3 parity | Triple-parity for critical data | Survives 3 disk failures per vdev |
| Log (SLOG) | Mirror recommended | Accelerates synchronous writes | Log loss = loss of in-flight sync writes |
| Cache (L2ARC) | Single disk OK | Extends ARC with SSD caching | Cache loss = temporary performance drop only |
| Spare | Idle disk in pool | Auto-replaces failed disk | No impact—spare is backup capacity |
| Special | Mirror recommended | Stores metadata and small blocks on fast storage | Special vdev loss can destroy entire pool |
The pool is only as reliable as its least redundant vdev. If any vdev in the pool fails completely (exceeds its redundancy), the entire pool becomes inaccessible. This is why mixing single-disk vdevs with RAIDZ vdevs is dangerous—the single disk becomes the weak link.
ZFS provides the zpool command for all pool operations. Pool creation specifies the vdev topology; once created, vdevs cannot be removed or restructured (though they can be expanded and additional vdevs can be added).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
#!/bin/bash# ZFS Pool Management Commands # ============================================# CREATING POOLS# ============================================ # Simple pool with single disk (NO REDUNDANCY - testing only!)zpool create tank /dev/sda # Mirror pool (2-way mirror for redundancy)zpool create tank mirror /dev/sda /dev/sdb # 3-way mirror for high reliabilityzpool create tank mirror /dev/sda /dev/sdb /dev/sdc # RAIDZ1 pool (single parity, like RAID-5)# Requires minimum 3 disks; recommends 3-7 diskszpool create tank raidz1 /dev/sda /dev/sdb /dev/sdc /dev/sdd # RAIDZ2 pool (double parity, like RAID-6)# Requires minimum 4 disks; recommends 4-10 diskszpool create tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf # RAIDZ3 pool (triple parity)# For very large disks and critical datazpool create tank raidz3 /dev/sd{a,b,c,d,e,f,g,h} # ============================================# MULTI-VDEV POOLS (Striping + Redundancy)# ============================================ # Multiple mirrors (striped mirrors, like RAID-10)# Best balance of performance and redundancyzpool create tank \ mirror /dev/sda /dev/sdb \ mirror /dev/sdc /dev/sdd \ mirror /dev/sde /dev/sdf # Multiple RAIDZ2 vdevs (common enterprise config)zpool create tank \ raidz2 /dev/sd{a,b,c,d,e,f} \ raidz2 /dev/sd{g,h,i,j,k,l} # ============================================# SPECIAL VDEVS# ============================================ # Add SLOG (Separate Intent Log) for sync writeszpool add tank log mirror /dev/nvme0n1 /dev/nvme1n1 # Add L2ARC cache for read accelerationzpool add tank cache /dev/nvme2n1 # Add hot spare for automatic replacementzpool add tank spare /dev/sdz # Add special vdev for metadata (MUST be mirrored!)zpool add tank special mirror /dev/nvme3n1 /dev/nvme4n1 # ============================================# POOL STATUS AND HEALTH# ============================================ # Show pool statuszpool status tank # Show pool status with detailed I/O statisticszpool status -v tank # Show pool I/O statisticszpool iostat tank 5 # Every 5 seconds # Show pool space usagezpool list tank # Show all pool propertieszpool get all tank # ============================================# POOL MAINTENANCE# ============================================ # Start a data integrity verification (scrub)# CRITICAL: Run regularly (weekly/monthly recommended)zpool scrub tank # Check scrub progresszpool status tank | grep scan # Cancel a running scrubzpool scrub -s tank # Export pool (before moving disks or unmounting)zpool export tank # Import pool (after moving disks to new system)zpool import tank # Import pool by ID (when names conflict)zpool import 1234567890 tank # ============================================# ADDING AND REPLACING DISKS# ============================================ # Replace a failed disk (in RAIDZ or mirror)zpool replace tank /dev/sda /dev/sdz # Online a disk that was taken offlinezpool online tank /dev/sda # Offline a disk for maintenancezpool offline tank /dev/sda # Expand a vdev with larger disks# (Replace each disk, wait for resilver, repeat)# After all disks replaced:zpool online -e tank # Expand to use new capacity # ============================================# POOL UPGRADE# ============================================ # Check if pool can be upgraded to newer featureszpool upgrade tank # Upgrade pool to latest feature flagszpool upgrade -a # All pools # Show available feature flagszpool upgrade -vOnce a pool is created, you cannot change the vdev topology. You cannot convert a RAIDZ1 to RAIDZ2, cannot remove a vdev (in most ZFS versions), and cannot change mirror to RAIDZ. Plan your topology carefully before creation. The only way to restructure is to create a new pool and copy data.
Within a pool, ZFS supports two types of storage consumers:
Datasets are hierarchical and inherit properties from parents, enabling elegant configuration management.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
#!/bin/bash# ZFS Dataset and Volume Management # ============================================# CREATING DATASETS (File Systems)# ============================================ # Create a simple dataset (automatically mounted at /tank/data)zfs create tank/data # Create nested datasets (inherits parent properties)zfs create tank/data/projectszfs create tank/data/backups # Create with specific mountpointzfs create -o mountpoint=/home tank/home # Create with custom propertieszfs create -o compression=lz4 \ -o atime=off \ -o recordsize=1M \ tank/media # Create but don't mount automaticallyzfs create -o canmount=noauto tank/archive # ============================================# CREATING VOLUMES (Block Devices)# ============================================ # Create a 100GB volume for VM usezfs create -V 100G tank/vm/windows # Create sparse volume (thin provisioned)zfs create -s -V 1T tank/vm/large_vm # Create volume with specific block size# (Should match guest filesystem or database block size)zfs create -V 50G -o volblocksize=64K tank/db/postgres # Volume appears as block device at:# /dev/zvol/tank/vm/windows # ============================================# DATASET HIERARCHY AND INHERITANCE# ============================================ # Example hierarchy:## tank (pool root dataset)# ├── tank/home (user homes)# │ ├── tank/home/alice (inherits from tank/home)# │ └── tank/home/bob (inherits from tank/home)# ├── tank/data (application data)# │ ├── tank/data/projects (work files)# │ └── tank/data/media (large files, custom recordsize)# └── tank/vm (virtual machines)# ├── tank/vm/windows (ZVOL for Windows VM)# └── tank/vm/linux (ZVOL for Linux VM) # Set property on parent - children inheritzfs set compression=lz4 tank/home# Now tank/home/alice and tank/home/bob also use lz4 # Override inherited property on childzfs set compression=zstd tank/home/bob # View inherited vs local propertieszfs get -s local,inherited compression tank/home/bob # ============================================# QUOTAS AND RESERVATIONS# ============================================ # Set quota (maximum space dataset can use)zfs set quota=100G tank/home/alice # Set reference quota (space for this dataset only, excludes snapshots)zfs set refquota=50G tank/home/alice # Set reservation (guaranteed minimum space)zfs set reservation=10G tank/data/critical # Set reference reservation (guaranteed for data only)zfs set refreservation=10G tank/data/critical # ============================================# LISTING AND INSPECTING# ============================================ # List all datasetszfs list # List datasets with specific propertieszfs list -o name,used,available,compression,mountpoint # List only specific dataset hierarchyzfs list -r tank/home # Show all properties of a datasetzfs get all tank/data # Show specific propertieszfs get compression,recordsize,atime tank/data # Show space usage breakdownzfs list -o name,used,usedbydataset,usedbychildren,usedbysnapshots tank # ============================================# MODIFYING DATASETS# ============================================ # Rename datasetzfs rename tank/data/old tank/data/new # Move dataset to different parent (and change mountpoint)zfs rename tank/data/archive tank/cold/archive # Change mountpointzfs set mountpoint=/new/path tank/data # Unmount datasetzfs unmount tank/data # Mount datasetzfs mount tank/data # ============================================# DESTROYING DATASETS# ============================================ # Destroy dataset (fails if has children or snapshots)zfs destroy tank/data/old # Destroy recursively (includes all children)zfs destroy -r tank/data/old # Destroy including dependent cloneszfs destroy -R tank/data/old # Dry run - see what would be destroyedzfs destroy -nv tank/data/oldDesign your dataset hierarchy thoughtfully. Group by usage pattern (VMs, databases, home directories) because properties inherit from parents. Don't create a flat list of datasets—use hierarchy to express relationships and simplify configuration. Each child can override parent settings when needed.
Understanding ZFS space reporting requires grasping several interconnected concepts. Space is shared across all datasets, snapshots consume space, compression affects accounting, and reservations interact with quotas.
Key Space Metrics:
| Property | Meaning | Includes |
|---|---|---|
| used | Total space consumed by this dataset and descendants | Data + children + snapshots + clones |
| usedbydataset | Space used by this dataset's data only | Active data only, not snapshots or children |
| usedbysnapshots | Space uniquely held by snapshots | Data that would be freed if all snapshots destroyed |
| usedbychildren | Space used by child datasets | Recursive sum of all descendant 'used' |
| available | Space available to this dataset | Pool free space minus reservations, respecting quotas |
| referenced | Space this dataset would consume without sharing | What would be freed if dataset destroyed (no clones) |
| compressratio | Compression effectiveness | Logical size / physical size |
| logicalused | Space before compression | How much data was written (pre-compression) |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
ZFS SPACE ACCOUNTING EXAMPLE═══════════════════════════════════════════════════════════ $ zfs list -o name,used,usedbydataset,usedbychildren,usedbysnapshots,available,referenced tank/dataNAME USED DATASET CHILDREN SNAPSHOTS AVAILABLE REFERENCEDtank/data 250G 100G 80G 70G 450G 100G INTERPRETATION:─────────────────────────────────────────────────────────────tank/data uses 250G total: • 100G = active data in this dataset (usedbydataset) • 80G = children datasets (usedbychildren) • 70G = space held ONLY by snapshots (usedbysnapshots) If you destroy tank/data (no clones): → You would FREE: 100G (referenced amount) → Children and snapshots would also be destroyed If you destroy only snapshots: → You would FREE: 70G (usedbysnapshots) → Active data untouched COMPRESSION IMPACT:─────────────────────────────────────────────────────────────$ zfs get compressratio,used,logicalused tank/logsNAME PROPERTY VALUEtank/logs compressratio 5.00xtank/logs used 20Gtank/logs logicalused 100G INTERPRETATION: • You wrote 100G of log data • ZFS stored it in 20G of physical space • 5:1 compression ratio (typical for text logs) • Pool accounting uses physical size (20G) QUOTA ENFORCEMENT:─────────────────────────────────────────────────────────────$ zfs get quota,refquota,used,referenced tank/home/aliceNAME PROPERTY VALUEtank/home/alice quota 50G ← Max for dataset + snapshotstank/home/alice refquota 40G ← Max for dataset data onlytank/home/alice used 35G ← Current total usagetank/home/alice referenced 30G ← Current data only INTERPRETATION: • Alice has 30G of active data • 5G is held by snapshots (35G - 30G) • She can add 10G more data (40G refquota - 30G) • She can take 15G more snapshots (50G quota - 35G)When you modify data, ZFS writes new blocks and keeps old blocks for snapshots. The 'usedbysnapshots' metric shows blocks that exist ONLY in snapshots—the delta between current and historical states. Heavily modified datasets accumulate snapshot overhead; consider snapshot retention policies.
Designing a ZFS pool topology involves balancing capacity, performance, and reliability. There's no universal best configuration—each workload has different requirements.
Key Considerations:
| Topology | Capacity Efficiency | Read IOPS | Write IOPS | Rebuild Speed | Use Case |
|---|---|---|---|---|---|
| Striped Mirrors | 50% | Excellent (N×single-disk) | Excellent | Fast (single disk copy) | Databases, random I/O heavy, VMs |
| RAIDZ1 (3-5 disks) | 67-80% | Fair (read across stripe) | Fair (full-stripe writes) | Medium | Home NAS, non-critical data |
| RAIDZ2 (5-8 disks) | 60-75% | Fair | Fair | Slower (more parity math) | General enterprise storage |
| RAIDZ3 (8+ disks) | 55-70% | Fair | Fair | Slowest | Archival, large disks, critical data |
| Striped RAIDZ2 | 60-75% | Good (stripe across vdevs) | Good | Parallel resilver | Enterprise storage, balanced |
RAIDZ width (number of disks per vdev) is permanent. Adding disks to an existing RAIDZ vdev is impossible. You can only add NEW vdevs. If you start with 4-disk RAIDZ2 and want to expand, you must add another 4-disk RAIDZ2. Plan your width based on expected expansion pattern.
ZFS supports special-purpose vdevs that accelerate specific operations without affecting the main data path. Understanding when these help—and when they waste money—is essential.
The ZFS Intent Log (ZIL) and SLOG:
Synchronous writes (database commits, NFS over sync mode, O_SYNC) must be acknowledged only after data is safely on persistent storage. ZFS writes these to the ZIL before acknowledging, then commits them to main storage later.
By default, the ZIL lives on the main pool—fast enough for many workloads. A Separate Log Device (SLOG) moves the ZIL to dedicated fast storage.
When SLOG helps:
When SLOG doesn't help:
Requirements:
Example configuration:
# Add mirrored SLOG using enterprise SSDs
zpool add tank log mirror /dev/nvme0n1 /dev/nvme1n1
SLOG, L2ARC, and special vdevs add operational complexity. Before adding them, verify your workload actually benefits. Use tools like 'arc_summary' and I/O tracing to identify bottlenecks. Many systems perform excellently with just a well-designed main pool and sufficient RAM.
We've explored ZFS's revolutionary pooled storage architecture—the foundation that enables its powerful features. Let's consolidate the key insights:
What's Next:
With the pooled storage foundation understood, we'll explore ZFS's checksum and data integrity mechanisms—how ZFS detects and corrects silent data corruption that other file systems miss entirely. This self-healing capability is central to ZFS's reputation for reliability.
You now understand ZFS's pooled storage architecture: how vdevs abstract physical storage, how datasets dynamically share pool space, and how to design pool topologies for different requirements. Next, we'll explore the checksum system that makes ZFS data trustworthy.