Operating SystemsAdvanced File Systems

Copy-on-Write File Systems

LevelAdvanced

Duration90 mins

TopicAdvanced File Systems

4 / 5

btrfs and ZFS

The Two Giants of Modern Storage

In the world of Copy-on-Write file systems, two names dominate: ZFS and btrfs. Both deliver the COW promise—snapshots, checksums, and data integrity—but they emerged from different origins, serve different communities, and make different architectural choices.

ZFS, born at Sun Microsystems in 2005, was designed from the ground up as a complete storage solution: file system, volume manager, and RAID implementation combined. It prioritizes data integrity above all else and has earned a reputation for bulletproof reliability in enterprise deployments.

btrfs, initiated by Oracle in 2007, aimed to bring ZFS-like features to the Linux kernel with native integration. It emphasizes flexibility—subvolumes, snapshots, and online operations—while maintaining Linux's modular philosophy of separate tools working together.

Choosing between them isn't about which is "better"—it's about which fits your requirements, constraints, and operational model.

What You Will Learn

By the end of this page, you will understand the architectural differences between btrfs and ZFS, their unique features and capabilities, the licensing and integration implications, and practical guidance for choosing between them based on your specific needs.

ZFS: The Zettabyte File System

ZFS was designed with an ambitious goal: eliminate all known data corruption vectors. Its creators at Sun Microsystems approached storage as a single, integrated problem rather than a stack of independent layers.

Core architectural principles:

Pooled storage: Disks are combined into pools; file systems (datasets) draw from shared pool space. No fixed partition sizes.
Transactional object model: Everything is an object with a checksum. Transactions are atomic across the entire pool.
End-to-end integrity: Data is checksummed at write, verified at read, with checksums stored in parent blocks.
Integrated RAID: RAID-Z levels are checksum-aware; can repair silent corruption, not just failed disks.
Immutable data: Copy-on-write means data is never overwritten; older versions are preserved until freed.

Converting Mermaid diagram...

Key ZFS features:

Feature	Description
Storage pools	Combine multiple devices; datasets share capacity automatically
RAID-Z/Z2/Z3	Single/double/triple parity with per-block checksum verification
Snapshots & clones	Instant, space-efficient point-in-time copies
Send/receive	Efficient serialization for backup and replication
Compression	LZ4, ZSTD, gzip, lzjb - transparent, per-dataset
Deduplication	Optional block-level deduplication (RAM-intensive)
Encryption	Native encryption (OpenZFS 2.0+) with key management
ARC/L2ARC	Adaptive Replacement Cache with optional SSD extension
Special vdev	Separate device class for metadata/small blocks
Quotas/reservations	Per-dataset space management

ZFS Essential Commands
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# === ZFS Pool Management ===
 
# Create a mirrored pool
zpool create tank mirror /dev/sda /dev/sdb
 
# Create a RAID-Z2 pool (double parity)
zpool create tank raidz2 /dev/sdc /dev/sdd /dev/sde /dev/sdf
 
# Add a SLOG for sync write acceleration
zpool add tank log /dev/nvme0n1
 
# Add L2ARC for read cache
zpool add tank cache /dev/nvme1n1
 
# View pool status
zpool status tank
zpool list tank
 
# === Dataset Management ===
 
# Create nested datasets (automatic mounting)
zfs create tank/home
zfs create tank/home/alice
zfs create tank/databases
 
# Set properties
zfs set compression=lz4 tank          # Enable compression pool-wide
zfs set quota=100G tank/home/alice    # Limit space usage
zfs set mountpoint=/mnt/data tank/data
 
# List datasets with space usage
zfs list -r tank
 
# === Snapshots and Clones ===
 
# Create snapshot
zfs snapshot tank/databases@before-migration
 
# Create recursive snapshot
zfs snapshot -r tank/home@daily-$(date +%Y%m%d)
 
# Clone from snapshot (writeable copy)
zfs clone tank/databases@before-migration tank/databases-test
 
# Rollback to snapshot
zfs rollback tank/databases@before-migration
 
# === Send/Receive Replication ===
 
# Initial full send
zfs send tank/databases@snap1 | ssh backup zfs receive backuppool/databases
 
# Incremental send (much faster)
zfs send -i tank/databases@snap1 tank/databases@snap2 | \
    ssh backup zfs receive backuppool/databases
 
# Encrypted, compressed send
zfs send -wc tank/encrypted@snap | gzip | \
    ssh backup "gunzip | zfs receive backuppool/encrypted"

ZFS on Linux Licensing

ZFS uses the CDDL license, incompatible with the Linux kernel's GPL. This means ZFS cannot be included in the mainline kernel. Users install via OpenZFS (zfs-dkms or pre-built modules). While legally complex, it's widely used in production without practical issues.

btrfs: The B-tree File System

btrfs (pronounced "butter-FS" or "B-tree-FS") was designed as a native Linux file system to bring ZFS-like features with GPL licensing and tight kernel integration.

Core architectural principles:

B-tree everything: All metadata is stored in B-tree structures, enabling efficient lookups and modifications.
Subvolumes: Lightweight, independent namespace trees within a single file system. More flexible than ZFS datasets.
Extents: Large contiguous file allocations (vs. fixed block pointers) for efficiency with large files.
Inline data: Small files stored directly in metadata, eliminating separate data blocks.
Native Linux integration: Standard VFS interface, works with all Linux tools, kernel module included.

Converting Mermaid diagram...

Key btrfs features:

Feature	Description
Subvolumes	Independent directory trees, each snapshot-able
Snapshots	Instant, space-efficient, including read-write snapshots
Send/receive	File-level streaming for backup/replication
Compression	zlib, lzo, zstd - transparent, can be per-file
Deduplication	Offline dedup via `duperemove`, no RAM overhead
Built-in RAID	RAID 0/1/10/5/6 (5/6 have write hole issues)
Checksums	CRC32C default, xxhash, sha256, blake2b available
Online operations	Resize, balance, convert, defrag while mounted
Reflinks	Instant copy-on-write file copies (cp --reflink)
Quotas	Per-subvolume quota groups (qgroups)

btrfs Essential Commands
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# === btrfs File System Creation ===
 
# Create btrfs on a single device
mkfs.btrfs /dev/sda
 
# Create btrfs with RAID1 metadata, RAID0 data
mkfs.btrfs -m raid1 -d raid0 /dev/sda /dev/sdb
 
# Create with specific label and checksums
mkfs.btrfs -L mydata --checksum sha256 /dev/sda
 
# Mount the file system
mount /dev/sda /mnt/mydata
 
# === Subvolume Management ===
 
# Create subvolumes
btrfs subvolume create /mnt/mydata/@root
btrfs subvolume create /mnt/mydata/@home
btrfs subvolume create /mnt/mydata/@snapshots
 
# List subvolumes
btrfs subvolume list /mnt/mydata
 
# Mount a specific subvolume
mount -o subvol=@home /dev/sda /home
 
# Set default subvolume (for boot without subvol= option)
btrfs subvolume set-default 256 /mnt/mydata
 
# === Snapshots ===
 
# Create read-only snapshot
btrfs subvolume snapshot -r /mnt/mydata/@root \
    /mnt/mydata/@snapshots/root-$(date +%Y%m%d)
 
# Create read-write snapshot (for testing)
btrfs subvolume snapshot /mnt/mydata/@home \
    /mnt/mydata/@home-test
 
# Delete snapshot
btrfs subvolume delete /mnt/mydata/@snapshots/root-old
 
# === Send/Receive ===
 
# Send snapshot to file
btrfs send /mnt/mydata/@snapshots/root-20250116 > /backup/root.btrfs
 
# Incremental send (requires parent snapshot at destination)
btrfs send -p /mnt/mydata/@snapshots/root-20250115 \
    /mnt/mydata/@snapshots/root-20250116 | \
    ssh backup btrfs receive /backup/snapshots/
 
# === Maintenance ===
 
# Check file system (must be unmounted or read-only)
btrfs check /dev/sda
 
# Scrub for corruption detection
btrfs scrub start /mnt/mydata
btrfs scrub status /mnt/mydata
 
# Balance to redistribute data (after adding drives)
btrfs balance start /mnt/mydata
 
# Online resize
btrfs filesystem resize +10G /mnt/mydata
btrfs filesystem resize max /mnt/mydata  # Use all available
 
# === Reflinks (instant copy) ===
 
# Copy file instantly with copy-on-write
cp --reflink=always large_file.iso large_file_copy.iso
 
# Check if files share blocks
filefrag -v large_file.iso large_file_copy.iso

btrfs RAID5/6 Status

btrfs RAID5 and RAID6 have a known write hole issue and are not recommended for production use. Use RAID1 or RAID10 for redundancy, or use btrfs on top of mdadm or LVM for RAID5/6 semantics. The development community continues work on this, but treat RAID5/6 as experimental.

Architectural Comparison: Deep Dive

Understanding the architectural differences helps explain why each file system behaves as it does:

Block allocation strategies:

Block Allocation Comparison
Aspect	ZFS	btrfs
Allocation unit	Variable block size (512B - 16MB)	Fixed 4KB blocks, variable extents
Default record size	128KB (tunable per dataset)	4KB blocks in contiguous extents
Small file handling	Stored in one block (up to recordsize)	Inline in metadata (< ~2KB)
Large file handling	One block pointer per recordsize chunk	Extents describe contiguous ranges
Fragmentation tendency	Moderate (sequential allocation hints)	Lower (extent-based allocation)

Checksum and integrity:

Both file systems use parent-pointer checksums (Merkle tree style), but differ in implementation:

Aspect	ZFS	btrfs
Default algorithm	SHA-256	CRC32C
Algorithm options	Fletcher, SHA-256/512, Skein, Edon-R, BLAKE3	CRC32C, xxhash, SHA-256, BLAKE2b
Metadata copies	2-3 copies of critical metadata	Configurable DUP for metadata
Checksum location	In block pointer (parent block)	In extent item (parent block)
Self-healing	Yes, with mirror/RAID-Z	Yes, with RAID1/10

ZFS Strengths

•Mature RAID-Z — Proven, reliable, no write hole
•Flexible block sizes — Tune per-dataset for workload
•Robust ARC — Sophisticated caching, L2ARC extension
•Native encryption — Key management, send encrypted
•Integrated volume manager — No need for LVM
•Boot support — Mature boot-from-ZFS story
•Enterprise proven — Decades of production use

btrfs Strengths

•Native Linux — GPL licensed, in-kernel, no modules
•Flexible subvolumes — Lighter weight than ZFS datasets
•Reflinks — Instant file copies without duplication
•Online conversion — ext4 can be converted in-place
•Per-file compression — Fine-grained control
•Lighter weight — Lower memory requirements
•Send/receive — File-level (not block-level) streaming

Memory and resource usage:

ZFS and btrfs have different resource profiles:

Resource	ZFS	btrfs
Minimum RAM	2-4GB (more for dedup)	512MB-1GB
Recommended RAM	1GB per TB (more for production)	1GB typical
Deduplication RAM	5GB per TB of deduped data	Offline tool, no RAM overhead
Metadata caching	ARC, configurable size	Page cache (standard Linux)
CPU overhead	Moderate (checksums, compression)	Similar

The 1GB per TB Rule

ZFS's "1GB RAM per TB of storage" rule is for optimal ARC performance, not a hard requirement. ZFS will run with less RAM but with reduced caching effectiveness. For deduplication, the rule becomes much more aggressive—plan for 5GB per TB of deduplicated data.

Feature-by-Feature Comparison

Let's compare specific features in detail:

Snapshots:

Both excel at snapshots, but with different characteristics:

Aspect	ZFS	btrfs
Creation time	O(1), instantaneous	O(1), instantaneous
Space accounting	Per-dataset, clear breakdown	Quota groups (complex)
Writeable snapshots	Via clones	Native read-write snapshots
Snapshot visibility	Hidden .zfs directory	Normal directory (subvolume)
Cross-dataset	Independent per dataset	Subvolumes are independent
Max snapshots	Thousands practical, no hard limit	Thousands practical

Comprehensive Feature Comparison
Feature	ZFS	btrfs
Compression algorithms	lz4, zstd, gzip, lzjb, zle	lzo, zlib, zstd
Deduplication	Inline (RAM-intensive)	Offline (duperemove tool)
Encryption	Native (OpenZFS 2.0+)	Via dm-crypt/LUKS layer
RAID levels	Mirror, RAID-Z/Z2/Z3, dRAID	0, 1, 10, 5, 6 (*unstable)
Device replacement	Resilver to new device	Replace command, balance
Growing pool	Add vdevs, expand vdev	Add devices, balance
Shrinking pool	Not supported	Supported with balance
Quotas	Per-dataset, simple	Quota groups, complex
ACL support	NFSv4 ACLs, POSIX ACLs	POSIX ACLs
Special characters	UTF-8, case sensitivity options	UTF-8 only
Boot support	Well-supported with GRUB	Well-supported with GRUB

Send/receive comparison:

Both support efficient replication, but the mechanisms differ:

ZFS send streams are block-level:

Full send includes all data blocks
Incremental send includes changed blocks between snapshots
Receives exactly replicate block structure
Encrypted send (-w) preserves encryption without access to keys
Compressed send (-c) preserves compression decisions
Can resume interrupted sends (-s flag)

Advantages: Exact block-level replication, encrypted send without decryption.

Limitations: Destination must be ZFS; block structure determines size.

Reflinks: btrfs Exclusive

btrfs's reflink feature (cp --reflink) enables instant file copies with COW semantics at the file level. This is powerful for container layers, package caching, and backup tools. ZFS achieves similar results via clones but at the dataset level, not individual files.

When to Choose ZFS

ZFS is the right choice when your priorities align with its strengths:

1. Enterprise storage servers:

ZFS shines in file servers, NAS appliances, and storage arrays:

Proven reliability at scale
RAID-Z with true data integrity
Mature tooling and monitoring
Commercial support available (iXsystems, etc.)

2. Databases with large storage:

For database servers requiring robust storage:

Adjustable recordsize for database page alignment
ARC provides intelligent caching
Snapshots for consistent backup points
Self-healing protects valuable data

ZFS Ideal Use Cases

•Large-scale storage — NAS, SAN, storage servers with many disks
•Data-critical systems — Financial data, medical records, legal archives
•Virtualization hosts — Fast clones for VM deployment
•Database servers — Adjustable block sizes, consistent snapshots
•Backup targets — Send/receive for efficient replication
•FreeBSD/illumos — Native, first-class support
•Systems with ample RAM — Especially if using deduplication

ZFS Performance Tuning for Use Cases
Use Case	Recordsize	Compression	Special Settings
General file server	128K (default)	lz4	atime=off, xattr=sa
MySQL/MariaDB	16K	off or lz4	primarycache=metadata
PostgreSQL	8K or 16K	lz4	logbias=throughput
MongoDB	8K or 16K	lz4	recordsize matching page size
Virtualization	64K or 128K	lz4 or off	sync=disabled (for non-critical)
Backup storage	1M	zstd	copies=2 for extra safety

ZFS Configuration for Database Server
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Example: Configure ZFS for PostgreSQL production database
 
# Create pool with redundancy
zpool create -o ashift=12 dbpool mirror /dev/sda /dev/sdb
 
# Create dataset optimized for PostgreSQL
zfs create dbpool/postgres
 
# PostgreSQL uses 8K blocks
zfs set recordsize=16K dbpool/postgres
 
# Light compression - lz4 is fast
zfs set compression=lz4 dbpool/postgres
 
# Disable access time updates
zfs set atime=off dbpool/postgres
 
# Prioritize metadata in ARC for random I/O
zfs set primarycache=metadata dbpool/postgres
 
# Optimize for throughput (PostgreSQL handles its own logging)
zfs set logbias=throughput dbpool/postgres
 
# Set mount point
zfs set mountpoint=/var/lib/postgresql dbpool/postgres
 
# Create dataset for WAL logs (different characteristics)
zfs create dbpool/postgres/wal
zfs set recordsize=128K dbpool/postgres/wal
zfs set logbias=latency dbpool/postgres/wal  # WAL needs low latency
 
# Add SLOG for sync write performance if using sync=standard
# zpool add dbpool log mirror /dev/nvme0n1 /dev/nvme1n1
 
# Verify configuration
zfs get recordsize,compression,atime,primarycache dbpool/postgres

ZFS is Mature and Conservative

ZFS prioritizes data safety over features. New features are added slowly after extensive testing. If you need the most reliable, proven COW file system and can accept the CDDL licensing constraints, ZFS is the safe choice.

When to Choose btrfs

btrfs is the right choice when your priorities favor native Linux integration and flexibility:

1. Desktop and laptop systems:

btrfs has become the default for several major distributions:

Fedora uses btrfs by default since Fedora 33
openSUSE has long supported btrfs with snapper integration
Configuration is straightforward—standard Linux tools

2. Container and development environments:

Reflinks and subvolumes excel for:

Docker's btrfs storage driver
Development snapshots (before risky changes)
Build systems with copy-on-write artifacts
Package caching and deduplication

btrfs Ideal Use Cases

•Linux desktops/laptops — Timeshift, Snapper for system snapshots
•Development machines — Reflinks for instant project copies
•Container hosts — Docker/Podman with btrfs storage driver
•NAS appliances — Synology, QNAP use btrfs internally
•Single-disk systems — Lower memory requirements
•Mixed RAID + file system — btrfs on mdadm RAID for RAID5/6
•GPL-only environments — When licensing is a constraint

3. System snapshot and rollback:

btrfs's integration with tools like Snapper and Timeshift provides:

# Snapper configuration for root subvolume
snapper -c root create-config /

# Automatic snapshots before/after package operations
# (via PAC hooks or zypper plugins)

# Boot into previous snapshot if update breaks system
# GRUB menu shows snapshot entries

# Rollback to previous snapshot
snapper rollback 42

This workflow is particularly powerful for rolling-release distributions where updates occasionally cause issues.

btrfs Configuration for Desktop with Snapshots
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Example: Configure btrfs desktop with Timeshift integration
 
# Create btrfs with optimal settings
mkfs.btrfs -L system /dev/sda2
 
# Mount temporarily to create subvolume layout
mount /dev/sda2 /mnt
 
# Create subvolume layout suitable for snapshots
btrfs subvolume create /mnt/@
btrfs subvolume create /mnt/@home
btrfs subvolume create /mnt/@snapshots
btrfs subvolume create /mnt/@log
btrfs subvolume create /mnt/@cache
 
# Unmount and remount with subvolumes
umount /mnt
 
# /etc/fstab entries:
# UUID=xxx /           btrfs subvol=@,compress=zstd,noatime 0 0
# UUID=xxx /home       btrfs subvol=@home,compress=zstd,noatime 0 0
# UUID=xxx /.snapshots btrfs subvol=@snapshots,noatime 0 0
# UUID=xxx /var/log    btrfs subvol=@log,compress=zstd,noatime 0 0
# UUID=xxx /var/cache  btrfs subvol=@cache,compress=zstd,noatime 0 0
 
# Install and configure Timeshift (after system installation)
# apt install timeshift  # Debian/Ubuntu
# dnf install timeshift  # Fedora
 
# Configure Timeshift to snapshot @ and @home
# Set schedule: daily, keep 7
 
# Manual snapshot before risky operation
timeshift --create --comments "Before kernel update"
 
# List snapshots
timeshift --list
 
# Restore if needed (from live USB for root)
timeshift --restore
 
# === Reflink usage for development ===
 
# Clone a large project instantly
cp -r --reflink=always myproject myproject-experiment
 
# Both share data until divergence
du -sh myproject myproject-experiment  # Same apparent size
btrfs filesystem du myproject myproject-experiment  # Shows shared data

btrfs + mdadm for Best of Both

If you need RAID5/6 reliability with btrfs features, layer btrfs on top of mdadm. mdadm provides stable RAID5/6, and btrfs adds COW benefits. You lose btrfs self-healing (mdadm doesn't know about checksums), but gain stable parity RAID.

Common Pitfalls and Gotchas

Both file systems have sharp edges that can surprise administrators:

ZFS pitfalls:

ZFS Common Mistakes

•Pool full == frozen — At 100% capacity, COW can't write. Keep 10-20% free.
•Can't shrink pools — Adding vdevs is permanent. Plan capacity carefully.
•Wrong ashift — Using 512-byte sectors on 4K drives kills performance.
•Dedup memory explosion — Enabling dedup without sufficient RAM crashes systems.
•RAIDZ expansion — Can't expand existing RAIDZ vdevs (until OpenZFS 2.2+).
•Import on wrong version — Importing on older ZFS version may lose features.

btrfs Common Mistakes

•RAID5/6 data loss — Using RAID5/6 in production—known write hole issues.
•Quota slowdowns — Quota groups (qgroups) can severely impact performance.
•Balance taking forever — Balance on large file systems can run for days.
•ENOSPC surprises — Space accounting is complex; running out of metadata space.
•Nodatacow misuse — Disabling COW breaks checksums and snapshots for those files.
•Snapshot accumulation — Snapshots holding deleted data, preventing space reclaim.

Maturity Assessment
Feature	ZFS Maturity	btrfs Maturity
Basic operations	⭐⭐⭐⭐⭐ Production-ready	⭐⭐⭐⭐⭐ Production-ready
RAID redundancy	⭐⭐⭐⭐⭐ RAID-Z is rock-solid	⭐⭐⭐⭐ RAID1/10 solid; 5/6 unstable
Snapshots	⭐⭐⭐⭐⭐ Mature	⭐⭐⭐⭐⭐ Mature
Send/receive	⭐⭐⭐⭐⭐ Mature with resume	⭐⭐⭐⭐ Good, no resume
Compression	⭐⭐⭐⭐⭐ Multiple algorithms	⭐⭐⭐⭐⭐ Good algorithm support
Encryption	⭐⭐⭐⭐ Native (newer)	⭐⭐⭐ Via LUKS layer
Self-healing	⭐⭐⭐⭐⭐ Proven	⭐⭐⭐⭐ Works with RAID1/10
Tooling	⭐⭐⭐⭐⭐ Comprehensive	⭐⭐⭐⭐ Good, some gaps

Test Your Recovery Procedures

Both file systems have complex recovery scenarios. Before relying on either in production, practice: recovering from drive failures, rolling back snapshots, and restoring from send/receive backups. Knowing the recovery process before you need it is essential.

Summary: Choosing Your COW File System

We've explored both major COW file systems in depth. Here's a consolidated decision framework:

Quick Decision Guide
If you need...	Choose	Because
Rock-solid RAID5/6 equivalent	ZFS	RAID-Z is proven and reliable
Native Linux kernel integration	btrfs	GPL licensed, in mainline kernel
Enterprise storage appliances	ZFS	More mature enterprise tooling
Desktop snapshots (Timeshift)	btrfs	Better tool integration
Container storage (reflinks)	btrfs	Reflinks are powerful for containers
FreeBSD/illumos	ZFS	Native, first-class support
Lower memory systems	btrfs	More RAM-efficient
Block-level encrypted backup	ZFS	Native encrypted send support

Key Takeaways

•Both deliver on COW promises — Snapshots, checksums, and self-healing work well in both.
•ZFS is more conservative — Emphasizes reliability over features; mature and proven at scale.
•btrfs is more flexible — Subvolumes, reflinks, and native Linux integration offer unique capabilities.
•RAID maturity differs — ZFS RAID-Z is production-ready; btrfs RAID5/6 is not.
•Resource requirements differ — ZFS benefits from more RAM; btrfs is lighter weight.
•Licensing matters — CDDL (ZFS) vs GPL (btrfs) affects some deployment scenarios.

Neither choice is wrong

Organizations successfully deploy both file systems in production. The "right" choice depends on your specific requirements, existing infrastructure, and operational expertise. Many teams use both—ZFS for storage servers and btrfs for desktop systems.

Looking ahead:

With a solid understanding of COW file systems—concept, snapshots, integrity, and implementations—our final page examines the performance tradeoffs. COW isn't free; understanding the costs helps you optimize for your workload.

Page Complete

You now understand both btrfs and ZFS at an architectural level, their strengths and weaknesses, and when to choose each. You can explain the key differences to stakeholders and make informed decisions for your storage needs. Next, we'll explore COW performance tradeoffs and optimization strategies.

4 / 5

Loading learning content...

Operating SystemsAdvanced File Systems

Copy-on-Write File Systems

LevelAdvanced

Duration90 mins

TopicAdvanced File Systems

4 / 5

btrfs and ZFS

The Two Giants of Modern Storage

Choosing between them isn't about which is "better"—it's about which fits your requirements, constraints, and operational model.

What You Will Learn

ZFS: The Zettabyte File System

Core architectural principles:

Pooled storage: Disks are combined into pools; file systems (datasets) draw from shared pool space. No fixed partition sizes.
Transactional object model: Everything is an object with a checksum. Transactions are atomic across the entire pool.
End-to-end integrity: Data is checksummed at write, verified at read, with checksums stored in parent blocks.
Integrated RAID: RAID-Z levels are checksum-aware; can repair silent corruption, not just failed disks.
Immutable data: Copy-on-write means data is never overwritten; older versions are preserved until freed.

Converting Mermaid diagram...

Key ZFS features:

Feature	Description
Storage pools	Combine multiple devices; datasets share capacity automatically
RAID-Z/Z2/Z3	Single/double/triple parity with per-block checksum verification
Snapshots & clones	Instant, space-efficient point-in-time copies
Send/receive	Efficient serialization for backup and replication
Compression	LZ4, ZSTD, gzip, lzjb - transparent, per-dataset
Deduplication	Optional block-level deduplication (RAM-intensive)
Encryption	Native encryption (OpenZFS 2.0+) with key management
ARC/L2ARC	Adaptive Replacement Cache with optional SSD extension
Special vdev	Separate device class for metadata/small blocks
Quotas/reservations	Per-dataset space management

ZFS Essential Commands
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# === ZFS Pool Management ===
 
# Create a mirrored pool
zpool create tank mirror /dev/sda /dev/sdb
 
# Create a RAID-Z2 pool (double parity)
zpool create tank raidz2 /dev/sdc /dev/sdd /dev/sde /dev/sdf
 
# Add a SLOG for sync write acceleration
zpool add tank log /dev/nvme0n1
 
# Add L2ARC for read cache
zpool add tank cache /dev/nvme1n1
 
# View pool status
zpool status tank
zpool list tank
 
# === Dataset Management ===
 
# Create nested datasets (automatic mounting)
zfs create tank/home
zfs create tank/home/alice
zfs create tank/databases
 
# Set properties
zfs set compression=lz4 tank          # Enable compression pool-wide
zfs set quota=100G tank/home/alice    # Limit space usage
zfs set mountpoint=/mnt/data tank/data
 
# List datasets with space usage
zfs list -r tank
 
# === Snapshots and Clones ===
 
# Create snapshot
zfs snapshot tank/databases@before-migration
 
# Create recursive snapshot
zfs snapshot -r tank/home@daily-$(date +%Y%m%d)
 
# Clone from snapshot (writeable copy)
zfs clone tank/databases@before-migration tank/databases-test
 
# Rollback to snapshot
zfs rollback tank/databases@before-migration
 
# === Send/Receive Replication ===
 
# Initial full send
zfs send tank/databases@snap1 | ssh backup zfs receive backuppool/databases
 
# Incremental send (much faster)
zfs send -i tank/databases@snap1 tank/databases@snap2 | \
    ssh backup zfs receive backuppool/databases
 
# Encrypted, compressed send
zfs send -wc tank/encrypted@snap | gzip | \
    ssh backup "gunzip | zfs receive backuppool/encrypted"

ZFS on Linux Licensing

btrfs: The B-tree File System

btrfs (pronounced "butter-FS" or "B-tree-FS") was designed as a native Linux file system to bring ZFS-like features with GPL licensing and tight kernel integration.

Core architectural principles:

B-tree everything: All metadata is stored in B-tree structures, enabling efficient lookups and modifications.
Subvolumes: Lightweight, independent namespace trees within a single file system. More flexible than ZFS datasets.
Extents: Large contiguous file allocations (vs. fixed block pointers) for efficiency with large files.
Inline data: Small files stored directly in metadata, eliminating separate data blocks.
Native Linux integration: Standard VFS interface, works with all Linux tools, kernel module included.

Converting Mermaid diagram...

Key btrfs features:

Feature	Description
Subvolumes	Independent directory trees, each snapshot-able
Snapshots	Instant, space-efficient, including read-write snapshots
Send/receive	File-level streaming for backup/replication
Compression	zlib, lzo, zstd - transparent, can be per-file
Deduplication	Offline dedup via `duperemove`, no RAM overhead
Built-in RAID	RAID 0/1/10/5/6 (5/6 have write hole issues)
Checksums	CRC32C default, xxhash, sha256, blake2b available
Online operations	Resize, balance, convert, defrag while mounted
Reflinks	Instant copy-on-write file copies (cp --reflink)
Quotas	Per-subvolume quota groups (qgroups)

btrfs Essential Commands
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# === btrfs File System Creation ===
 
# Create btrfs on a single device
mkfs.btrfs /dev/sda
 
# Create btrfs with RAID1 metadata, RAID0 data
mkfs.btrfs -m raid1 -d raid0 /dev/sda /dev/sdb
 
# Create with specific label and checksums
mkfs.btrfs -L mydata --checksum sha256 /dev/sda
 
# Mount the file system
mount /dev/sda /mnt/mydata
 
# === Subvolume Management ===
 
# Create subvolumes
btrfs subvolume create /mnt/mydata/@root
btrfs subvolume create /mnt/mydata/@home
btrfs subvolume create /mnt/mydata/@snapshots
 
# List subvolumes
btrfs subvolume list /mnt/mydata
 
# Mount a specific subvolume
mount -o subvol=@home /dev/sda /home
 
# Set default subvolume (for boot without subvol= option)
btrfs subvolume set-default 256 /mnt/mydata
 
# === Snapshots ===
 
# Create read-only snapshot
btrfs subvolume snapshot -r /mnt/mydata/@root \
    /mnt/mydata/@snapshots/root-$(date +%Y%m%d)
 
# Create read-write snapshot (for testing)
btrfs subvolume snapshot /mnt/mydata/@home \
    /mnt/mydata/@home-test
 
# Delete snapshot
btrfs subvolume delete /mnt/mydata/@snapshots/root-old
 
# === Send/Receive ===
 
# Send snapshot to file
btrfs send /mnt/mydata/@snapshots/root-20250116 > /backup/root.btrfs
 
# Incremental send (requires parent snapshot at destination)
btrfs send -p /mnt/mydata/@snapshots/root-20250115 \
    /mnt/mydata/@snapshots/root-20250116 | \
    ssh backup btrfs receive /backup/snapshots/
 
# === Maintenance ===
 
# Check file system (must be unmounted or read-only)
btrfs check /dev/sda
 
# Scrub for corruption detection
btrfs scrub start /mnt/mydata
btrfs scrub status /mnt/mydata
 
# Balance to redistribute data (after adding drives)
btrfs balance start /mnt/mydata
 
# Online resize
btrfs filesystem resize +10G /mnt/mydata
btrfs filesystem resize max /mnt/mydata  # Use all available
 
# === Reflinks (instant copy) ===
 
# Copy file instantly with copy-on-write
cp --reflink=always large_file.iso large_file_copy.iso
 
# Check if files share blocks
filefrag -v large_file.iso large_file_copy.iso

btrfs RAID5/6 Status

Architectural Comparison: Deep Dive

Understanding the architectural differences helps explain why each file system behaves as it does:

Block allocation strategies:

Block Allocation Comparison
Aspect	ZFS	btrfs
Allocation unit	Variable block size (512B - 16MB)	Fixed 4KB blocks, variable extents
Default record size	128KB (tunable per dataset)	4KB blocks in contiguous extents
Small file handling	Stored in one block (up to recordsize)	Inline in metadata (< ~2KB)
Large file handling	One block pointer per recordsize chunk	Extents describe contiguous ranges
Fragmentation tendency	Moderate (sequential allocation hints)	Lower (extent-based allocation)

Checksum and integrity:

Both file systems use parent-pointer checksums (Merkle tree style), but differ in implementation:

Aspect	ZFS	btrfs
Default algorithm	SHA-256	CRC32C
Algorithm options	Fletcher, SHA-256/512, Skein, Edon-R, BLAKE3	CRC32C, xxhash, SHA-256, BLAKE2b
Metadata copies	2-3 copies of critical metadata	Configurable DUP for metadata
Checksum location	In block pointer (parent block)	In extent item (parent block)
Self-healing	Yes, with mirror/RAID-Z	Yes, with RAID1/10

ZFS Strengths

•Mature RAID-Z — Proven, reliable, no write hole
•Flexible block sizes — Tune per-dataset for workload
•Robust ARC — Sophisticated caching, L2ARC extension
•Native encryption — Key management, send encrypted
•Integrated volume manager — No need for LVM
•Boot support — Mature boot-from-ZFS story
•Enterprise proven — Decades of production use

btrfs Strengths

•Native Linux — GPL licensed, in-kernel, no modules
•Flexible subvolumes — Lighter weight than ZFS datasets
•Reflinks — Instant file copies without duplication
•Online conversion — ext4 can be converted in-place
•Per-file compression — Fine-grained control
•Lighter weight — Lower memory requirements
•Send/receive — File-level (not block-level) streaming

Memory and resource usage:

ZFS and btrfs have different resource profiles:

Resource	ZFS	btrfs
Minimum RAM	2-4GB (more for dedup)	512MB-1GB
Recommended RAM	1GB per TB (more for production)	1GB typical
Deduplication RAM	5GB per TB of deduped data	Offline tool, no RAM overhead
Metadata caching	ARC, configurable size	Page cache (standard Linux)
CPU overhead	Moderate (checksums, compression)	Similar

The 1GB per TB Rule

Feature-by-Feature Comparison

Let's compare specific features in detail:

Snapshots:

Both excel at snapshots, but with different characteristics:

Aspect	ZFS	btrfs
Creation time	O(1), instantaneous	O(1), instantaneous
Space accounting	Per-dataset, clear breakdown	Quota groups (complex)
Writeable snapshots	Via clones	Native read-write snapshots
Snapshot visibility	Hidden .zfs directory	Normal directory (subvolume)
Cross-dataset	Independent per dataset	Subvolumes are independent
Max snapshots	Thousands practical, no hard limit	Thousands practical

Comprehensive Feature Comparison
Feature	ZFS	btrfs
Compression algorithms	lz4, zstd, gzip, lzjb, zle	lzo, zlib, zstd
Deduplication	Inline (RAM-intensive)	Offline (duperemove tool)
Encryption	Native (OpenZFS 2.0+)	Via dm-crypt/LUKS layer
RAID levels	Mirror, RAID-Z/Z2/Z3, dRAID	0, 1, 10, 5, 6 (*unstable)
Device replacement	Resilver to new device	Replace command, balance
Growing pool	Add vdevs, expand vdev	Add devices, balance
Shrinking pool	Not supported	Supported with balance
Quotas	Per-dataset, simple	Quota groups, complex
ACL support	NFSv4 ACLs, POSIX ACLs	POSIX ACLs
Special characters	UTF-8, case sensitivity options	UTF-8 only
Boot support	Well-supported with GRUB	Well-supported with GRUB

Send/receive comparison:

Both support efficient replication, but the mechanisms differ:

ZFS send streams are block-level:

Full send includes all data blocks
Incremental send includes changed blocks between snapshots
Receives exactly replicate block structure
Encrypted send (-w) preserves encryption without access to keys
Compressed send (-c) preserves compression decisions
Can resume interrupted sends (-s flag)

Advantages: Exact block-level replication, encrypted send without decryption.

Limitations: Destination must be ZFS; block structure determines size.

Reflinks: btrfs Exclusive

When to Choose ZFS

ZFS is the right choice when your priorities align with its strengths:

1. Enterprise storage servers:

ZFS shines in file servers, NAS appliances, and storage arrays:

Proven reliability at scale
RAID-Z with true data integrity
Mature tooling and monitoring
Commercial support available (iXsystems, etc.)

2. Databases with large storage:

For database servers requiring robust storage:

Adjustable recordsize for database page alignment
ARC provides intelligent caching
Snapshots for consistent backup points
Self-healing protects valuable data

ZFS Ideal Use Cases

•Large-scale storage — NAS, SAN, storage servers with many disks
•Data-critical systems — Financial data, medical records, legal archives
•Virtualization hosts — Fast clones for VM deployment
•Database servers — Adjustable block sizes, consistent snapshots
•Backup targets — Send/receive for efficient replication
•FreeBSD/illumos — Native, first-class support
•Systems with ample RAM — Especially if using deduplication

ZFS Performance Tuning for Use Cases
Use Case	Recordsize	Compression	Special Settings
General file server	128K (default)	lz4	atime=off, xattr=sa
MySQL/MariaDB	16K	off or lz4	primarycache=metadata
PostgreSQL	8K or 16K	lz4	logbias=throughput
MongoDB	8K or 16K	lz4	recordsize matching page size
Virtualization	64K or 128K	lz4 or off	sync=disabled (for non-critical)
Backup storage	1M	zstd	copies=2 for extra safety

ZFS Configuration for Database Server
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Example: Configure ZFS for PostgreSQL production database
 
# Create pool with redundancy
zpool create -o ashift=12 dbpool mirror /dev/sda /dev/sdb
 
# Create dataset optimized for PostgreSQL
zfs create dbpool/postgres
 
# PostgreSQL uses 8K blocks
zfs set recordsize=16K dbpool/postgres
 
# Light compression - lz4 is fast
zfs set compression=lz4 dbpool/postgres
 
# Disable access time updates
zfs set atime=off dbpool/postgres
 
# Prioritize metadata in ARC for random I/O
zfs set primarycache=metadata dbpool/postgres
 
# Optimize for throughput (PostgreSQL handles its own logging)
zfs set logbias=throughput dbpool/postgres
 
# Set mount point
zfs set mountpoint=/var/lib/postgresql dbpool/postgres
 
# Create dataset for WAL logs (different characteristics)
zfs create dbpool/postgres/wal
zfs set recordsize=128K dbpool/postgres/wal
zfs set logbias=latency dbpool/postgres/wal  # WAL needs low latency
 
# Add SLOG for sync write performance if using sync=standard
# zpool add dbpool log mirror /dev/nvme0n1 /dev/nvme1n1
 
# Verify configuration
zfs get recordsize,compression,atime,primarycache dbpool/postgres

ZFS is Mature and Conservative

When to Choose btrfs

btrfs is the right choice when your priorities favor native Linux integration and flexibility:

1. Desktop and laptop systems:

btrfs has become the default for several major distributions:

Fedora uses btrfs by default since Fedora 33
openSUSE has long supported btrfs with snapper integration
Configuration is straightforward—standard Linux tools

2. Container and development environments:

Reflinks and subvolumes excel for:

Docker's btrfs storage driver
Development snapshots (before risky changes)
Build systems with copy-on-write artifacts
Package caching and deduplication

btrfs Ideal Use Cases

•Linux desktops/laptops — Timeshift, Snapper for system snapshots
•Development machines — Reflinks for instant project copies
•Container hosts — Docker/Podman with btrfs storage driver
•NAS appliances — Synology, QNAP use btrfs internally
•Single-disk systems — Lower memory requirements
•Mixed RAID + file system — btrfs on mdadm RAID for RAID5/6
•GPL-only environments — When licensing is a constraint

3. System snapshot and rollback:

btrfs's integration with tools like Snapper and Timeshift provides:

# Snapper configuration for root subvolume
snapper -c root create-config /

# Automatic snapshots before/after package operations
# (via PAC hooks or zypper plugins)

# Boot into previous snapshot if update breaks system
# GRUB menu shows snapshot entries

# Rollback to previous snapshot
snapper rollback 42

This workflow is particularly powerful for rolling-release distributions where updates occasionally cause issues.

btrfs Configuration for Desktop with Snapshots
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Example: Configure btrfs desktop with Timeshift integration
 
# Create btrfs with optimal settings
mkfs.btrfs -L system /dev/sda2
 
# Mount temporarily to create subvolume layout
mount /dev/sda2 /mnt
 
# Create subvolume layout suitable for snapshots
btrfs subvolume create /mnt/@
btrfs subvolume create /mnt/@home
btrfs subvolume create /mnt/@snapshots
btrfs subvolume create /mnt/@log
btrfs subvolume create /mnt/@cache
 
# Unmount and remount with subvolumes
umount /mnt
 
# /etc/fstab entries:
# UUID=xxx /           btrfs subvol=@,compress=zstd,noatime 0 0
# UUID=xxx /home       btrfs subvol=@home,compress=zstd,noatime 0 0
# UUID=xxx /.snapshots btrfs subvol=@snapshots,noatime 0 0
# UUID=xxx /var/log    btrfs subvol=@log,compress=zstd,noatime 0 0
# UUID=xxx /var/cache  btrfs subvol=@cache,compress=zstd,noatime 0 0
 
# Install and configure Timeshift (after system installation)
# apt install timeshift  # Debian/Ubuntu
# dnf install timeshift  # Fedora
 
# Configure Timeshift to snapshot @ and @home
# Set schedule: daily, keep 7
 
# Manual snapshot before risky operation
timeshift --create --comments "Before kernel update"
 
# List snapshots
timeshift --list
 
# Restore if needed (from live USB for root)
timeshift --restore
 
# === Reflink usage for development ===
 
# Clone a large project instantly
cp -r --reflink=always myproject myproject-experiment
 
# Both share data until divergence
du -sh myproject myproject-experiment  # Same apparent size
btrfs filesystem du myproject myproject-experiment  # Shows shared data

btrfs + mdadm for Best of Both

Common Pitfalls and Gotchas

Both file systems have sharp edges that can surprise administrators:

ZFS pitfalls:

ZFS Common Mistakes

•Pool full == frozen — At 100% capacity, COW can't write. Keep 10-20% free.
•Can't shrink pools — Adding vdevs is permanent. Plan capacity carefully.
•Wrong ashift — Using 512-byte sectors on 4K drives kills performance.
•Dedup memory explosion — Enabling dedup without sufficient RAM crashes systems.
•RAIDZ expansion — Can't expand existing RAIDZ vdevs (until OpenZFS 2.2+).
•Import on wrong version — Importing on older ZFS version may lose features.

btrfs Common Mistakes

•RAID5/6 data loss — Using RAID5/6 in production—known write hole issues.
•Quota slowdowns — Quota groups (qgroups) can severely impact performance.
•Balance taking forever — Balance on large file systems can run for days.
•ENOSPC surprises — Space accounting is complex; running out of metadata space.
•Nodatacow misuse — Disabling COW breaks checksums and snapshots for those files.
•Snapshot accumulation — Snapshots holding deleted data, preventing space reclaim.

Maturity Assessment
Feature	ZFS Maturity	btrfs Maturity
Basic operations	⭐⭐⭐⭐⭐ Production-ready	⭐⭐⭐⭐⭐ Production-ready
RAID redundancy	⭐⭐⭐⭐⭐ RAID-Z is rock-solid	⭐⭐⭐⭐ RAID1/10 solid; 5/6 unstable
Snapshots	⭐⭐⭐⭐⭐ Mature	⭐⭐⭐⭐⭐ Mature
Send/receive	⭐⭐⭐⭐⭐ Mature with resume	⭐⭐⭐⭐ Good, no resume
Compression	⭐⭐⭐⭐⭐ Multiple algorithms	⭐⭐⭐⭐⭐ Good algorithm support
Encryption	⭐⭐⭐⭐ Native (newer)	⭐⭐⭐ Via LUKS layer
Self-healing	⭐⭐⭐⭐⭐ Proven	⭐⭐⭐⭐ Works with RAID1/10
Tooling	⭐⭐⭐⭐⭐ Comprehensive	⭐⭐⭐⭐ Good, some gaps

Test Your Recovery Procedures

Summary: Choosing Your COW File System

We've explored both major COW file systems in depth. Here's a consolidated decision framework:

Quick Decision Guide
If you need...	Choose	Because
Rock-solid RAID5/6 equivalent	ZFS	RAID-Z is proven and reliable
Native Linux kernel integration	btrfs	GPL licensed, in mainline kernel
Enterprise storage appliances	ZFS	More mature enterprise tooling
Desktop snapshots (Timeshift)	btrfs	Better tool integration
Container storage (reflinks)	btrfs	Reflinks are powerful for containers
FreeBSD/illumos	ZFS	Native, first-class support
Lower memory systems	btrfs	More RAM-efficient
Block-level encrypted backup	ZFS	Native encrypted send support

Key Takeaways

•Both deliver on COW promises — Snapshots, checksums, and self-healing work well in both.
•ZFS is more conservative — Emphasizes reliability over features; mature and proven at scale.
•btrfs is more flexible — Subvolumes, reflinks, and native Linux integration offer unique capabilities.
•RAID maturity differs — ZFS RAID-Z is production-ready; btrfs RAID5/6 is not.
•Resource requirements differ — ZFS benefits from more RAM; btrfs is lighter weight.
•Licensing matters — CDDL (ZFS) vs GPL (btrfs) affects some deployment scenarios.

Neither choice is wrong

Looking ahead:

Page Complete

4 / 5