Loading learning content...
In the landscape of modern file systems, Btrfs (B-tree File System, pronounced "Butter FS" or "Better FS") represents one of the most ambitious attempts to reimagine how Linux manages storage. Born from the recognition that traditional file systems like ext4 were reaching their architectural limits, Btrfs was designed from the ground up to address the storage challenges of the 21st century.
Btrfs isn't merely an incremental improvement over its predecessors—it's a fundamental rethinking of file system architecture. Where ext4 added features like extents and journaling to the venerable ext2 foundation, Btrfs started with a clean slate, incorporating lessons learned from decades of file system research and the revolutionary ideas pioneered by Sun's ZFS.
By the end of this page, you will understand the architectural foundations of Btrfs: why B-trees were chosen as the core data structure, how Btrfs organizes data on disk, its key design principles, and how it compares to traditional file systems. You'll gain the foundational knowledge necessary to understand Btrfs's advanced features like COW, snapshots, and self-healing.
Btrfs development began at Oracle Corporation in 2007, initiated by Chris Mason, a veteran file system developer who had previously worked on ReiserFS. The project emerged from a recognition that existing Linux file systems, despite their maturity and reliability, could not meet the evolving demands of modern storage:
The Limitations of Traditional File Systems:
Static volume management: Traditional file systems required separate volume managers (like LVM) to pool storage, add devices, or resize partitions—operations that were complex and sometimes dangerous.
No native checksumming: Data corruption could go undetected. Silent bit rot, controller errors, and firmware bugs could corrupt data without the file system's knowledge.
Limited snapshot capabilities: While LVM offered snapshots, they were inefficient (requiring copy-on-write at the block level) and couldn't leverage file system semantics.
Fixed metadata structures: File systems like ext4 used fixed-size metadata structures that couldn't adapt to varying workload characteristics.
Single-device thinking: Most file systems assumed a single underlying block device, requiring external RAID controllers or LVM for multi-device configurations.
Btrfs was heavily influenced by Sun Microsystems' ZFS, which demonstrated that a file system could integrate volume management, checksumming, snapshots, and self-healing into a unified architecture. However, ZFS's CDDL license was incompatible with the Linux GPL, necessitating a new implementation. Btrfs aimed to bring ZFS-like capabilities to Linux with a GPL-compatible license.
The Design Goals:
From its inception, Btrfs was designed with several ambitious goals:
Copy-on-Write Architecture: Never overwrite data in place. Instead, write modifications to new locations, enabling atomic updates and cheap snapshots.
Integrated Volume Management: Pool multiple devices, add/remove storage dynamically, and handle RAID without external tools.
Data and Metadata Checksumming: Every block of data and metadata carries a checksum, enabling detection (and with redundancy, correction) of corruption.
Efficient Snapshots and Clones: Create instant point-in-time copies without duplicating data.
Online Operations: Grow, shrink, defragment, scrub, and check the file system while it remains mounted and in use.
Scalability: Support exabyte-scale storage with efficient handling of billions of files.
These goals required a completely new architecture—one built around a data structure capable of efficiently handling all these operations: the B-tree.
The choice of B-trees as Btrfs's fundamental data structure wasn't arbitrary—it was a carefully considered engineering decision based on decades of database and file system research.
Understanding B-trees:
A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. Unlike binary trees where each node has at most two children, B-tree nodes can have many children—making them particularly well-suited for storage systems that read and write data in large blocks.
Key B-tree Properties:
The B-tree Variant in Btrfs:
Btrfs uses a variant called B+ trees with copy-on-write semantics. In a traditional B+ tree:
Btrfs extends this with copy-on-write (COW):
This COW approach enables Btrfs's snapshot functionality and ensures atomicity—either a complete, consistent update is visible, or none of it is.
Consider a Btrfs tree with 4KB nodes, each holding approximately 200 items. Three levels can address 200³ = 8 million items. Four levels reaches 200⁴ = 1.6 billion items. This means any lookup requires at most 4 disk reads, regardless of file system size—a property called O(log n) complexity with excellent constant factors.
Unlike traditional file systems that use separate structures for different purposes (inodes, block bitmaps, extent trees), Btrfs unifies almost everything into multiple B-trees, all sharing the same fundamental structure but storing different types of items.
The Forest of Trees:
Btrfs maintains several distinct trees, each serving a specific purpose:
| Tree | Purpose | Key Contents |
|---|---|---|
| Root Tree | The tree of trees—locates all other trees | Root items pointing to other tree roots |
| FS Tree (Subvolume Tree) | Contains file system objects for a subvolume | Inodes, directory items, extent data references |
| Extent Tree | Tracks allocation of disk space | Extent items, back-references |
| Chunk Tree | Maps logical addresses to physical devices | Chunk items, device extent items |
| Device Tree | Per-device allocation information | Device items and extent allocation |
| Checksum Tree | Stores checksums for data blocks | Checksum items keyed by extent offset |
| UUID Tree | Maps subvolume UUIDs to tree IDs | UUID items for subvolume lookup |
| Free Space Tree | Efficiently tracks free space (newer feature) | Free space info and bitmap items |
The Unified Key Structure:
All Btrfs items are located using a 136-bit key consisting of three components:
[ objectid: 64 bits | type: 8 bits | offset: 64 bits ]
This unified key format means all trees use the same lookup algorithms, simplifying code and enabling optimizations to benefit all operations.
Example Key Types:
Btrfs stores variable-sized items in its tree nodes, not fixed-size structures like ext4 inodes. This allows items to grow (e.g., inline data for tiny files) or shrink as needed, improving space efficiency and flexibility.
Understanding how Btrfs organizes data on physical storage is essential for comprehending its capabilities and performance characteristics.
The Superblock:
Like all file systems, Btrfs begins with a superblock—the fixed location where the file system's root information is stored. Btrfs maintains multiple superblock copies at fixed offsets for redundancy:
The superblock contains:
Chunks and Block Groups:
Btrfs abstracts physical devices through a two-level address mapping:
Logical Addresses → Chunks → Physical Addresses
Logical Address Space: A single unified address space spanning potentially multiple devices. The file system internally uses logical addresses everywhere.
Chunks: Contiguous regions of logical address space. Each chunk maps to one or more physical device locations.
Block Groups: Logical groupings of chunks with the same allocation purpose (data, metadata, system) and redundancy profile (single, DUP, RAID0, RAID1, etc.).
This abstraction enables Btrfs to:
123456789101112131415161718192021
Logical Address Space (Unified View):┌────────────────────────────────────────────────────────┐│ 0GB 32GB 64GB 96GB 128GB ││ │ │ │ │ │ ││ └──Chunk 1─┘──Chunk 2─┘──Chunk 3─┘──Chunk 4─┘ ││ (Data) (Meta) (Data) (Data) │└────────────────────────────────────────────────────────┘ │ ▼ Chunk MappingPhysical Devices:┌─────────────┐ ┌─────────────┐│ /dev/sda │ │ /dev/sdb │├─────────────┤ ├─────────────┤│ Chunk 1 │ │ Chunk 2-m1 │ ← RAID1 mirror│ Chunk 2-m0 │ │ Chunk 3-p0 │ ← RAID0 stripe│ Chunk 3-p1 │ │ Chunk 4 │└─────────────┘ └─────────────┘ Keys:- Chunk 2: Metadata with RAID1 (mirrored across both devices)- Chunk 3: Data with RAID0 (striped across both devices)Node Size and Sector Size:
Btrfs uses a configurable node size (default: 16 KiB) for its tree nodes and a sector size (typically 4 KiB) for data allocation. The node size affects:
The separation of node size from sector size allows metadata to use larger, more efficient blocks while data can be allocated at sector granularity.
The node size is set when the file system is created (mkfs.btrfs -n <size>) and cannot be changed afterward. For most workloads, the default 16 KiB is appropriate, but metadata-heavy workloads (many small files) may benefit from larger nodes.
Btrfs reimagines the traditional Unix inode concept, storing inode data as variable-sized items within its B-trees rather than fixed structures in a dedicated inode table.
The INODE_ITEM:
Each file or directory has an inode item containing:
struct btrfs_inode_item {
__le64 generation; // Creation transaction
__le64 transid; // Last modification transaction
__le64 size; // File size in bytes
__le64 nbytes; // Actual bytes used
__le64 block_group; // Preferred allocation group
__le32 nlink; // Hard link count
__le32 uid; // Owner user ID
__le32 gid; // Owner group ID
__le32 mode; // File mode/permissions
__le64 rdev; // Device ID (for device files)
__le64 flags; // Inode flags
__le64 sequence; // Sequence for fsync
struct btrfs_timespec atime; // Access time
struct btrfs_timespec ctime; // Change time
struct btrfs_timespec mtime; // Modification time
struct btrfs_timespec otime; // Creation time
};
Key Differences from Traditional Inodes:
Inline Data:
For very small files, Btrfs can store file data inline within the EXTENT_DATA item itself, eliminating the need for a separate data extent allocation:
This optimization is particularly beneficial for workloads with many tiny files (configuration files, Git object stores, etc.).
The Inode Namespace:
Unlike ext4's global inode number space, Btrfs inodes are scoped to their subvolume. Each subvolume has its own inode number space starting from 256 (the first non-reserved inode). This means:
Unlike many Unix file systems, Btrfs stores the file creation time (otime/birth time) natively in the inode item. This allows applications to determine when a file was originally created, not just when its metadata was last changed.
Btrfs uses extents to map file data to disk locations—contiguous ranges of disk blocks that store file contents. This extent-based approach provides significant advantages over traditional block mapping.
EXTENT_DATA Items:
For each contiguous range of file data, Btrfs creates an EXTENT_DATA item in the file's inode tree:
Key: [inode, EXTENT_DATA, file_offset]
Value: {
generation,
ram_bytes, // Uncompressed size
compression, // Compression type (none/zlib/lzo/zstd)
encryption, // Encryption type (reserved)
type, // Regular, inline, or prealloc
disk_bytenr, // Disk location of extent
disk_num_bytes, // Size on disk (may differ due to compression)
offset, // Offset within the extent
num_bytes // Bytes used from this extent
}
Extent Sharing and Reflinks:
One of Btrfs's most powerful features is extent sharing. Multiple files (or multiple regions of the same file) can reference the same physical extent:
To track sharing, Btrfs maintains back-references in the extent tree:
12345678910111213141516171819202122232425262728
File A (original):┌─────────────────────────────────────────┐│ EXTENT_DATA: offset 0, extent X ││ EXTENT_DATA: offset 1MB, extent Y ││ EXTENT_DATA: offset 2MB, extent Z │└─────────────────────────────────────────┘ ↓ ↓ ↓Physical Extents: X Y Z ↑ ↑ ↑File B (reflink copy of A):┌─────────────────────────────────────────┐│ EXTENT_DATA: offset 0, extent X (shared)││ EXTENT_DATA: offset 1MB, extent Y (shared)││ EXTENT_DATA: offset 2MB, extent Z (shared)│└─────────────────────────────────────────┘ After modifying File B's first megabyte:┌─────────────────────────────────────────┐│ EXTENT_DATA: offset 0, extent X' (NEW) │ ← COW created new extent│ EXTENT_DATA: offset 1MB, extent Y (shared)││ EXTENT_DATA: offset 2MB, extent Z (shared)│└─────────────────────────────────────────┘ Extent Tree Back-References:Extent X: ref_count=1 (only File A now)Extent X': ref_count=1 (only File B)Extent Y: ref_count=2 (File A + File B)Extent Z: ref_count=2 (File A + File B)The Extent Tree:
The extent tree is Btrfs's allocation manager, tracking:
Back-references are crucial for:
Free Space Tracking:
Historically, Btrfs tracked free space by walking the extent tree—an expensive operation that caused mount delays on large file systems. Modern Btrfs includes a dedicated Free Space Tree that explicitly tracks free regions, dramatically improving mount time and allocation speed.
Btrfs extents can be up to 128 MiB for data (configurable, default is typically 128 MiB). Larger files are split into multiple extents. The extent size affects fragmentation characteristics—larger extents mean better sequential performance but potentially more internal fragmentation.
Understanding Btrfs's position relative to other file systems helps clarify when it's the right choice.
Btrfs vs. ext4:
| Feature | Btrfs | ext4 |
|---|---|---|
| Architecture | COW B-trees | Journal + extent trees |
| Data checksumming | ✅ Yes, with verification | ❌ No |
| Metadata checksumming | ✅ Yes | ✅ Limited (journal) |
| Snapshots | ✅ Instant, space-efficient | ❌ No (requires LVM) |
| Multi-device support | ✅ Native, integrated | ❌ Requires LVM/MD |
| RAID | ✅ Built-in (0/1/10/5/6) | ❌ Requires MD/hardware |
| Online resize | ✅ Grow and shrink | ✅ Grow only |
| Compression | ✅ Transparent (zstd/lzo/zlib) | ❌ No |
| Deduplication | ✅ Offline/online tools | ❌ No |
| Max file size | 16 EiB | 16 TiB |
| Stability | Mature, some features stable | Very mature, stable |
| Performance (writes) | Variable, COW overhead | Generally faster |
| Performance (reads) | Comparable | Slightly better |
Btrfs vs. ZFS:
| Feature | Btrfs | ZFS |
|---|---|---|
| License | GPL v2 | CDDL (licensing issues) |
| Integration | Mainline Linux kernel | External module (OpenZFS) |
| Memory requirements | Moderate | High (ARC cache) |
| RAID-Z equivalent | RAID5/6 (less mature) | RAID-Z1/2/3 (very mature) |
| Deduplication | Offline tools | Inline (memory-intensive) |
| Send/Receive | ✅ Supported | ✅ Supported |
| Quotas | ✅ qgroups | ✅ Dataset quotas |
| Encryption | ❌ Limited/external | ✅ Native |
| Special devices | ❌ No L2ARC/SLOG | ✅ L2ARC, SLOG |
| Stability | Improving | Very mature |
When to Choose Btrfs:
Btrfs RAID5 and RAID6 profiles have had write-hole issues and are not recommended for production use as of kernel 6.x. Use RAID1 (mirroring) or RAID10 for redundancy, or rely on external MD RAID arrays.
We've established the foundational understanding of Btrfs architecture. Let's consolidate the essential concepts:
What's Next:
With the B-tree architecture understood, we'll explore Btrfs's Copy-on-Write (COW) semantics in depth—the mechanism that makes atomic updates, snapshots, and data integrity guarantees possible.
You now understand the architectural foundations of Btrfs: its B-tree design, tree hierarchy, on-disk layout, inode structure, and extent management. This foundation is essential for understanding the advanced features we'll explore in subsequent pages.