Zfs - Learning Module | OneNoughtOne

Loading content...

0/227

Storage Pools: Pooled Storage Architecture

The Volume Management Problem

Traditional storage administration is an exercise in prediction and compromise. Before creating a file system, administrators must answer questions that are fundamentally unanswerable:

How much space will this application need in three years?
Should I allocate more to /home or /var?
What happens when /data fills up but /backup has 500GB free?

The answers require predicting the future—and administrators are invariably wrong. The result: some volumes fill up and require emergency expansion while others sit half-empty for years. The sum of wasted space across an organization can reach dozens of terabytes.

ZFS's radical solution: abolish the volume. Instead, create a storage pool from which all file systems dynamically draw space as needed.

What You Will Learn

By the end of this page, you will understand ZFS's pooled storage architecture, how virtual devices (vdevs) abstract physical storage, how datasets dynamically share pool space, and how to design robust pool topologies for different reliability and performance requirements.

The Pooled Storage Model

A ZFS storage pool (or simply 'pool') is a collection of physical storage devices aggregated into a single logical unit. Unlike traditional volume managers that carve fixed partitions, ZFS pools provide dynamic, shared storage for all datasets.

Traditional Model vs. ZFS Model:

TRADITIONAL (Fixed Volumes):
┌─────────────────────────────────────────────────────────┐
│                    Physical Disks                        │
├─────────────────────────────────────────────────────────┤
│   Partition 1   │   Partition 2   │   Partition 3       │
│     (100GB)     │     (200GB)     │     (200GB)         │
├─────────────────────────────────────────────────────────┤
│     /home       │     /var        │     /data           │
│   (Fixed Size)  │   (Fixed Size)  │   (Fixed Size)      │
└─────────────────────────────────────────────────────────┘
    Problem: /home is full but /data has 150GB free → STUCK


ZFS (Pooled Storage):
┌─────────────────────────────────────────────────────────┐
│                    Physical Disks                        │
├─────────────────────────────────────────────────────────┤
│                   Storage Pool (500GB)                   │
├─────────────────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
│  │  /home   │  │   /var   │  │  /data   │   Unused     │
│  │  (80GB)  │  │  (50GB)  │  │ (120GB)  │   (250GB)    │
│  └──────────┘  └──────────┘  └──────────┘              │
└─────────────────────────────────────────────────────────┘
    Solution: All datasets share the pool → Grow as needed

Benefits of Pooled Storage

•Dynamic Space Allocation — Datasets grow and shrink automatically. No need to resize partitions or move data between volumes.
•Elimination of Stranded Space — Space unused by one dataset is immediately available to all others. No more fragmented free space across multiple volumes.
•Simplified Administration — Add disks to the pool; all datasets benefit. One command, universal effect.
•Consistent Performance — All datasets stripe across all devices in the pool, benefiting from aggregate bandwidth.
•Unified Snapshots and Quotas — Pool-level features apply consistently to all datasets without per-volume configuration.
•Flexible Capacity Planning — Instead of pre-allocating specific sizes, simply monitor pool utilization and add capacity when needed.

Think of It Like a Bank Account

Traditional volumes are like cash in separate envelopes—$100 for rent, $50 for food, $200 for savings. If you overspend on food, you can't easily move money from the rent envelope. A ZFS pool is like a single bank account—all expenses draw from the same balance, and you only need to ensure the total balance covers total needs.

Virtual Devices (VDEVs)

The building blocks of ZFS pools are Virtual Devices (vdevs). A vdev is an abstraction that can represent a single disk, a group of disks with redundancy, or special-purpose devices for caching and logging.

Pool Structure:

POOL: "tank"
├── VDEV 1: raidz2 (6 disks)
│   ├── /dev/sda
│   ├── /dev/sdb
│   ├── /dev/sdc
│   ├── /dev/sdd
│   ├── /dev/sde
│   └── /dev/sdf
├── VDEV 2: raidz2 (6 disks)
│   ├── /dev/sdg
│   ├── /dev/sdh
│   ├── /dev/sdi
│   ├── /dev/sdj
│   ├── /dev/sdk
│   └── /dev/sdl
├── LOG: mirror (2 SSDs)
│   ├── /dev/nvme0n1p1
│   └── /dev/nvme1n1p1
├── CACHE: single (1 SSD)
│   └── /dev/nvme2n1
└── SPARE: disk
    └── /dev/sdm

Data is striped across vdevs (VDEV 1 and VDEV 2), while LOG and CACHE serve special purposes. This hierarchy provides both performance (striping) and reliability (redundancy within vdevs).

VDEV Types and Purposes
VDEV Type	Configuration	Purpose	Failure Impact
Disk	Single disk	Simplest unit, no redundancy	Disk failure = pool failure if no other redundancy
Mirror	2+ identical copies	Full redundancy, excellent read IOPS	Survives N-1 disk failures in N-way mirror
RAIDZ1	N disks, 1 parity	Space-efficient single-parity protection	Survives 1 disk failure per vdev
RAIDZ2	N disks, 2 parity	Double-parity for large disks	Survives 2 disk failures per vdev
RAIDZ3	N disks, 3 parity	Triple-parity for critical data	Survives 3 disk failures per vdev
Log (SLOG)	Mirror recommended	Accelerates synchronous writes	Log loss = loss of in-flight sync writes
Cache (L2ARC)	Single disk OK	Extends ARC with SSD caching	Cache loss = temporary performance drop only
Spare	Idle disk in pool	Auto-replaces failed disk	No impact—spare is backup capacity
Special	Mirror recommended	Stores metadata and small blocks on fast storage	Special vdev loss can destroy entire pool

VDEV Failure and Pool Health

The pool is only as reliable as its least redundant vdev. If any vdev in the pool fails completely (exceeds its redundancy), the entire pool becomes inaccessible. This is why mixing single-disk vdevs with RAIDZ vdevs is dangerous—the single disk becomes the weak link.

Creating and Managing Pools

ZFS provides the zpool command for all pool operations. Pool creation specifies the vdev topology; once created, vdevs cannot be removed or restructured (though they can be expanded and additional vdevs can be added).

zpool_commands.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
#!/bin/bash
# ZFS Pool Management Commands
 
# ============================================
# CREATING POOLS
# ============================================
 
# Simple pool with single disk (NO REDUNDANCY - testing only!)
zpool create tank /dev/sda
 
# Mirror pool (2-way mirror for redundancy)
zpool create tank mirror /dev/sda /dev/sdb
 
# 3-way mirror for high reliability
zpool create tank mirror /dev/sda /dev/sdb /dev/sdc
 
# RAIDZ1 pool (single parity, like RAID-5)
# Requires minimum 3 disks; recommends 3-7 disks
zpool create tank raidz1 /dev/sda /dev/sdb /dev/sdc /dev/sdd
 
# RAIDZ2 pool (double parity, like RAID-6)
# Requires minimum 4 disks; recommends 4-10 disks
zpool create tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
 
# RAIDZ3 pool (triple parity)
# For very large disks and critical data
zpool create tank raidz3 /dev/sd{a,b,c,d,e,f,g,h}
 
# ============================================
# MULTI-VDEV POOLS (Striping + Redundancy)
# ============================================
 
# Multiple mirrors (striped mirrors, like RAID-10)
# Best balance of performance and redundancy
zpool create tank \
    mirror /dev/sda /dev/sdb \
    mirror /dev/sdc /dev/sdd \
    mirror /dev/sde /dev/sdf
 
# Multiple RAIDZ2 vdevs (common enterprise config)
zpool create tank \
    raidz2 /dev/sd{a,b,c,d,e,f} \
    raidz2 /dev/sd{g,h,i,j,k,l}
 
# ============================================
# SPECIAL VDEVS
# ============================================
 
# Add SLOG (Separate Intent Log) for sync writes
zpool add tank log mirror /dev/nvme0n1 /dev/nvme1n1
 
# Add L2ARC cache for read acceleration
zpool add tank cache /dev/nvme2n1
 
# Add hot spare for automatic replacement
zpool add tank spare /dev/sdz
 
# Add special vdev for metadata (MUST be mirrored!)
zpool add tank special mirror /dev/nvme3n1 /dev/nvme4n1
 
# ============================================
# POOL STATUS AND HEALTH
# ============================================
 
# Show pool status
zpool status tank
 
# Show pool status with detailed I/O statistics
zpool status -v tank
 
# Show pool I/O statistics
zpool iostat tank 5     # Every 5 seconds
 
# Show pool space usage
zpool list tank
 
# Show all pool properties
zpool get all tank
 
# ============================================
# POOL MAINTENANCE
# ============================================
 
# Start a data integrity verification (scrub)
# CRITICAL: Run regularly (weekly/monthly recommended)
zpool scrub tank
 
# Check scrub progress
zpool status tank | grep scan
 
# Cancel a running scrub
zpool scrub -s tank
 
# Export pool (before moving disks or unmounting)
zpool export tank
 
# Import pool (after moving disks to new system)
zpool import tank
 
# Import pool by ID (when names conflict)
zpool import 1234567890 tank
 
# ============================================
# ADDING AND REPLACING DISKS
# ============================================
 
# Replace a failed disk (in RAIDZ or mirror)
zpool replace tank /dev/sda /dev/sdz
 
# Online a disk that was taken offline
zpool online tank /dev/sda
 
# Offline a disk for maintenance
zpool offline tank /dev/sda
 
# Expand a vdev with larger disks
# (Replace each disk, wait for resilver, repeat)
# After all disks replaced:
zpool online -e tank     # Expand to use new capacity
 
# ============================================
# POOL UPGRADE
# ============================================
 
# Check if pool can be upgraded to newer features
zpool upgrade tank
 
# Upgrade pool to latest feature flags
zpool upgrade -a         # All pools
 
# Show available feature flags
zpool upgrade -v

Pool Topology is Permanent

Once a pool is created, you cannot change the vdev topology. You cannot convert a RAIDZ1 to RAIDZ2, cannot remove a vdev (in most ZFS versions), and cannot change mirror to RAIDZ. Plan your topology carefully before creation. The only way to restructure is to create a new pool and copy data.

Datasets and Volumes

Within a pool, ZFS supports two types of storage consumers:

Datasets (also called file systems) — POSIX-compliant file systems that can be mounted and used by applications
Volumes (ZVOLs) — Block devices for use cases requiring raw block access (VMs, iSCSI, non-ZFS file systems)

Datasets are hierarchical and inherit properties from parents, enabling elegant configuration management.

dataset_commands.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
#!/bin/bash
# ZFS Dataset and Volume Management
 
# ============================================
# CREATING DATASETS (File Systems)
# ============================================
 
# Create a simple dataset (automatically mounted at /tank/data)
zfs create tank/data
 
# Create nested datasets (inherits parent properties)
zfs create tank/data/projects
zfs create tank/data/backups
 
# Create with specific mountpoint
zfs create -o mountpoint=/home tank/home
 
# Create with custom properties
zfs create -o compression=lz4 \
           -o atime=off \
           -o recordsize=1M \
           tank/media
 
# Create but don't mount automatically
zfs create -o canmount=noauto tank/archive
 
# ============================================
# CREATING VOLUMES (Block Devices)
# ============================================
 
# Create a 100GB volume for VM use
zfs create -V 100G tank/vm/windows
 
# Create sparse volume (thin provisioned)
zfs create -s -V 1T tank/vm/large_vm
 
# Create volume with specific block size
# (Should match guest filesystem or database block size)
zfs create -V 50G -o volblocksize=64K tank/db/postgres
 
# Volume appears as block device at:
# /dev/zvol/tank/vm/windows
 
# ============================================
# DATASET HIERARCHY AND INHERITANCE
# ============================================
 
# Example hierarchy:
#
# tank                          (pool root dataset)
# ├── tank/home                 (user homes)
# │   ├── tank/home/alice       (inherits from tank/home)
# │   └── tank/home/bob         (inherits from tank/home)
# ├── tank/data                 (application data)
# │   ├── tank/data/projects    (work files)
# │   └── tank/data/media       (large files, custom recordsize)
# └── tank/vm                   (virtual machines)
#     ├── tank/vm/windows       (ZVOL for Windows VM)
#     └── tank/vm/linux         (ZVOL for Linux VM)
 
# Set property on parent - children inherit
zfs set compression=lz4 tank/home
# Now tank/home/alice and tank/home/bob also use lz4
 
# Override inherited property on child
zfs set compression=zstd tank/home/bob
 
# View inherited vs local properties
zfs get -s local,inherited compression tank/home/bob
 
# ============================================
# QUOTAS AND RESERVATIONS
# ============================================
 
# Set quota (maximum space dataset can use)
zfs set quota=100G tank/home/alice
 
# Set reference quota (space for this dataset only, excludes snapshots)
zfs set refquota=50G tank/home/alice
 
# Set reservation (guaranteed minimum space)
zfs set reservation=10G tank/data/critical
 
# Set reference reservation (guaranteed for data only)
zfs set refreservation=10G tank/data/critical
 
# ============================================
# LISTING AND INSPECTING
# ============================================
 
# List all datasets
zfs list
 
# List datasets with specific properties
zfs list -o name,used,available,compression,mountpoint
 
# List only specific dataset hierarchy
zfs list -r tank/home
 
# Show all properties of a dataset
zfs get all tank/data
 
# Show specific properties
zfs get compression,recordsize,atime tank/data
 
# Show space usage breakdown
zfs list -o name,used,usedbydataset,usedbychildren,usedbysnapshots tank
 
# ============================================
# MODIFYING DATASETS
# ============================================
 
# Rename dataset
zfs rename tank/data/old tank/data/new
 
# Move dataset to different parent (and change mountpoint)
zfs rename tank/data/archive tank/cold/archive
 
# Change mountpoint
zfs set mountpoint=/new/path tank/data
 
# Unmount dataset
zfs unmount tank/data
 
# Mount dataset
zfs mount tank/data
 
# ============================================
# DESTROYING DATASETS
# ============================================
 
# Destroy dataset (fails if has children or snapshots)
zfs destroy tank/data/old
 
# Destroy recursively (includes all children)
zfs destroy -r tank/data/old
 
# Destroy including dependent clones
zfs destroy -R tank/data/old
 
# Dry run - see what would be destroyed
zfs destroy -nv tank/data/old

Dataset Hierarchy Best Practices

Design your dataset hierarchy thoughtfully. Group by usage pattern (VMs, databases, home directories) because properties inherit from parents. Don't create a flat list of datasets—use hierarchy to express relationships and simplify configuration. Each child can override parent settings when needed.

Space Allocation and Reporting

Understanding ZFS space reporting requires grasping several interconnected concepts. Space is shared across all datasets, snapshots consume space, compression affects accounting, and reservations interact with quotas.

Key Space Metrics:

ZFS Space Properties Explained
Property	Meaning	Includes
used	Total space consumed by this dataset and descendants	Data + children + snapshots + clones
usedbydataset	Space used by this dataset's data only	Active data only, not snapshots or children
usedbysnapshots	Space uniquely held by snapshots	Data that would be freed if all snapshots destroyed
usedbychildren	Space used by child datasets	Recursive sum of all descendant 'used'
available	Space available to this dataset	Pool free space minus reservations, respecting quotas
referenced	Space this dataset would consume without sharing	What would be freed if dataset destroyed (no clones)
compressratio	Compression effectiveness	Logical size / physical size
logicalused	Space before compression	How much data was written (pre-compression)

space_analysis.txt

Example

ZFS SPACE ACCOUNTING EXAMPLE
═══════════════════════════════════════════════════════════
 
$ zfs list -o name,used,usedbydataset,usedbychildren,usedbysnapshots,available,referenced tank/data
NAME        USED    DATASET  CHILDREN  SNAPSHOTS  AVAILABLE  REFERENCED
tank/data   250G    100G     80G       70G        450G       100G
 
INTERPRETATION:
─────────────────────────────────────────────────────────────
tank/data uses 250G total:
  • 100G  = active data in this dataset (usedbydataset)
  • 80G   = children datasets (usedbychildren)
  • 70G   = space held ONLY by snapshots (usedbysnapshots)
  
If you destroy tank/data (no clones):
  → You would FREE: 100G (referenced amount)
  → Children and snapshots would also be destroyed
  
If you destroy only snapshots:
  → You would FREE: 70G (usedbysnapshots)
  → Active data untouched
 
 
COMPRESSION IMPACT:
─────────────────────────────────────────────────────────────
$ zfs get compressratio,used,logicalused tank/logs
NAME        PROPERTY       VALUE
tank/logs   compressratio  5.00x
tank/logs   used           20G
tank/logs   logicalused    100G
 
INTERPRETATION:
  • You wrote 100G of log data
  • ZFS stored it in 20G of physical space
  • 5:1 compression ratio (typical for text logs)
  • Pool accounting uses physical size (20G)
 
 
QUOTA ENFORCEMENT:
─────────────────────────────────────────────────────────────
$ zfs get quota,refquota,used,referenced tank/home/alice
NAME              PROPERTY    VALUE
tank/home/alice   quota       50G      ← Max for dataset + snapshots
tank/home/alice   refquota    40G      ← Max for dataset data only
tank/home/alice   used        35G      ← Current total usage
tank/home/alice   referenced  30G      ← Current data only
 
INTERPRETATION:
  • Alice has 30G of active data
  • 5G is held by snapshots (35G - 30G)
  • She can add 10G more data (40G refquota - 30G)
  • She can take 15G more snapshots (50G quota - 35G)

Copy-on-Write Effect on Snapshots

When you modify data, ZFS writes new blocks and keeps old blocks for snapshots. The 'usedbysnapshots' metric shows blocks that exist ONLY in snapshots—the delta between current and historical states. Heavily modified datasets accumulate snapshot overhead; consider snapshot retention policies.

Pool Topology Design Principles

Designing a ZFS pool topology involves balancing capacity, performance, and reliability. There's no universal best configuration—each workload has different requirements.

Key Considerations:

Redundancy Level — How many simultaneous disk failures must survive?
Performance Profile — Random IOPS vs. sequential throughput?
Capacity Efficiency — How much raw space becomes usable?
Rebuild Time — How long to resilver after failure?
Expansion Path — How will you add capacity later?

Pool Topology Comparison
Topology	Capacity Efficiency	Read IOPS	Write IOPS	Rebuild Speed	Use Case
Striped Mirrors	50%	Excellent (N×single-disk)	Excellent	Fast (single disk copy)	Databases, random I/O heavy, VMs
RAIDZ1 (3-5 disks)	67-80%	Fair (read across stripe)	Fair (full-stripe writes)	Medium	Home NAS, non-critical data
RAIDZ2 (5-8 disks)	60-75%	Fair	Fair	Slower (more parity math)	General enterprise storage
RAIDZ3 (8+ disks)	55-70%	Fair	Fair	Slowest	Archival, large disks, critical data
Striped RAIDZ2	60-75%	Good (stripe across vdevs)	Good	Parallel resilver	Enterprise storage, balanced

Pool Design Best Practices

•Match vdev sizes — All vdevs in a pool should have similar capacity. ZFS stripes data across vdevs, so unequal sizes waste space on larger vdevs.
•Plan for expansion — Capacity is added by adding vdevs (mirrors or RAIDZ groups of same width). Plan initial topology to allow symmetric expansion.
•Consider rebuild time — With 10TB+ disks, RAIDZ1 rebuild takes days. During rebuild, another failure loses the pool. RAIDZ2 or mirrors are safer.
•Separate I/O patterns — Consider separate pools for different workloads. Database random I/O and media streaming have conflicting optimal configurations.
•Reserve capacity — ZFS performance degrades significantly above 80% capacity. Plan for 20% headroom.
•Use hot spares — Spares automatically replace failed disks, starting resilver without human intervention.

The RAIDZ Width Decision

RAIDZ width (number of disks per vdev) is permanent. Adding disks to an existing RAIDZ vdev is impossible. You can only add NEW vdevs. If you start with 4-disk RAIDZ2 and want to expand, you must add another 4-disk RAIDZ2. Plan your width based on expected expansion pattern.

Special Vdevs: SLOG, L2ARC, and Special

ZFS supports special-purpose vdevs that accelerate specific operations without affecting the main data path. Understanding when these help—and when they waste money—is essential.

The ZFS Intent Log (ZIL) and SLOG:

Synchronous writes (database commits, NFS over sync mode, O_SYNC) must be acknowledged only after data is safely on persistent storage. ZFS writes these to the ZIL before acknowledging, then commits them to main storage later.

By default, the ZIL lives on the main pool—fast enough for many workloads. A Separate Log Device (SLOG) moves the ZIL to dedicated fast storage.

When SLOG helps:

NFS with sync=standard (forces sync writes)
Databases with synchronous commit (PostgreSQL, MySQL with sync_binlog=1)
ESXi/VMware datastores

When SLOG doesn't help:

Async workloads (most file serving)
Read-heavy workloads
Workloads already using O_DSYNC (data-only sync)

Requirements:

Low latency > high throughput (IOPS matter more than MB/s)
Power-fail protection (consumer SSDs lose data on power loss!)
Mirror recommended (SLOG loss = loss of inflight sync writes)

Example configuration:

# Add mirrored SLOG using enterprise SSDs
zpool add tank log mirror /dev/nvme0n1 /dev/nvme1n1

Don't Add Complexity Without Need

SLOG, L2ARC, and special vdevs add operational complexity. Before adding them, verify your workload actually benefits. Use tools like 'arc_summary' and I/O tracing to identify bottlenecks. Many systems perform excellently with just a well-designed main pool and sufficient RAM.

Summary: Storage Pools

We've explored ZFS's revolutionary pooled storage architecture—the foundation that enables its powerful features. Let's consolidate the key insights:

Key Takeaways

•Pooled storage eliminates volume management pain — Datasets dynamically share space, eliminating stranded capacity and simplifying administration.
•VDEVs abstract physical storage — Pools are built from virtual devices that can be single disks, mirrors, or RAIDZ groups with different redundancy levels.
•Pool topology is permanent — Once created, you cannot restructure vdevs. Plan carefully before creation.
•Datasets inherit properties hierarchically — Use hierarchy to organize configuration, with children inheriting from parents but able to override.
•Space accounting is sophisticated — Understand 'used', 'referenced', 'usedbysnapshots' to interpret where space is consumed.
•Special vdevs accelerate specific operations — SLOG for sync writes, L2ARC for read caching, special for metadata—but only add when needed.
•Design for expansion — Add capacity by adding vdevs; match sizes and types for balanced performance.

What's Next:

With the pooled storage foundation understood, we'll explore ZFS's checksum and data integrity mechanisms—how ZFS detects and corrects silent data corruption that other file systems miss entirely. This self-healing capability is central to ZFS's reputation for reliability.

Page Complete

You now understand ZFS's pooled storage architecture: how vdevs abstract physical storage, how datasets dynamically share pool space, and how to design pool topologies for different requirements. Next, we'll explore the checksum system that makes ZFS data trustworthy.