Operating SystemsBtrfs

Btrfs: The B-tree File System

LevelAdvanced

Duration90 mins

TopicBtrfs

1 / 5

B-tree File System

The Next Generation of Linux File Systems

In the landscape of modern file systems, Btrfs (B-tree File System, pronounced "Butter FS" or "Better FS") represents one of the most ambitious attempts to reimagine how Linux manages storage. Born from the recognition that traditional file systems like ext4 were reaching their architectural limits, Btrfs was designed from the ground up to address the storage challenges of the 21st century.

Btrfs isn't merely an incremental improvement over its predecessors—it's a fundamental rethinking of file system architecture. Where ext4 added features like extents and journaling to the venerable ext2 foundation, Btrfs started with a clean slate, incorporating lessons learned from decades of file system research and the revolutionary ideas pioneered by Sun's ZFS.

What You Will Learn

By the end of this page, you will understand the architectural foundations of Btrfs: why B-trees were chosen as the core data structure, how Btrfs organizes data on disk, its key design principles, and how it compares to traditional file systems. You'll gain the foundational knowledge necessary to understand Btrfs's advanced features like COW, snapshots, and self-healing.

The Origins of Btrfs

Btrfs development began at Oracle Corporation in 2007, initiated by Chris Mason, a veteran file system developer who had previously worked on ReiserFS. The project emerged from a recognition that existing Linux file systems, despite their maturity and reliability, could not meet the evolving demands of modern storage:

The Limitations of Traditional File Systems:

Static volume management: Traditional file systems required separate volume managers (like LVM) to pool storage, add devices, or resize partitions—operations that were complex and sometimes dangerous.
No native checksumming: Data corruption could go undetected. Silent bit rot, controller errors, and firmware bugs could corrupt data without the file system's knowledge.
Limited snapshot capabilities: While LVM offered snapshots, they were inefficient (requiring copy-on-write at the block level) and couldn't leverage file system semantics.
Fixed metadata structures: File systems like ext4 used fixed-size metadata structures that couldn't adapt to varying workload characteristics.
Single-device thinking: Most file systems assumed a single underlying block device, requiring external RAID controllers or LVM for multi-device configurations.

The ZFS Influence

Btrfs was heavily influenced by Sun Microsystems' ZFS, which demonstrated that a file system could integrate volume management, checksumming, snapshots, and self-healing into a unified architecture. However, ZFS's CDDL license was incompatible with the Linux GPL, necessitating a new implementation. Btrfs aimed to bring ZFS-like capabilities to Linux with a GPL-compatible license.

The Design Goals:

From its inception, Btrfs was designed with several ambitious goals:

Copy-on-Write Architecture: Never overwrite data in place. Instead, write modifications to new locations, enabling atomic updates and cheap snapshots.
Integrated Volume Management: Pool multiple devices, add/remove storage dynamically, and handle RAID without external tools.
Data and Metadata Checksumming: Every block of data and metadata carries a checksum, enabling detection (and with redundancy, correction) of corruption.
Efficient Snapshots and Clones: Create instant point-in-time copies without duplicating data.
Online Operations: Grow, shrink, defragment, scrub, and check the file system while it remains mounted and in use.
Scalability: Support exabyte-scale storage with efficient handling of billions of files.

These goals required a completely new architecture—one built around a data structure capable of efficiently handling all these operations: the B-tree.

Why B-trees?

The choice of B-trees as Btrfs's fundamental data structure wasn't arbitrary—it was a carefully considered engineering decision based on decades of database and file system research.

Understanding B-trees:

A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. Unlike binary trees where each node has at most two children, B-tree nodes can have many children—making them particularly well-suited for storage systems that read and write data in large blocks.

Key B-tree Properties:

Why B-trees Excel for File Systems

•Logarithmic Height — A B-tree with branching factor of 1000 can address 1 billion items with only 3 levels. This minimizes disk I/O operations, as each level requires at most one disk read.
•Block-Aligned Nodes — B-tree nodes are sized to match disk blocks (or file system node sizes), ensuring efficient I/O. Each node read brings maximum useful data.
•Sorted Order Maintenance — Data remains sorted, enabling efficient range queries and sequential access—critical for operations like directory listing or extent traversal.
•Self-Balancing — Unlike binary search trees that can degenerate into linked lists, B-trees automatically maintain balance through splits and merges, guaranteeing O(log n) operations.
•High Fan-Out — With hundreds of keys per node, B-trees are exceptionally shallow, reducing the number of I/O operations for any lookup.

The B-tree Variant in Btrfs:

Btrfs uses a variant called B+ trees with copy-on-write semantics. In a traditional B+ tree:

All actual data (values) reside in leaf nodes
Internal nodes contain only keys and child pointers
Leaf nodes are linked together for efficient sequential access

Btrfs extends this with copy-on-write (COW):

When a leaf is modified, a new copy is written to a different location
The parent pointer is updated (in a new copy of the parent)
This propagates up to the root, which is atomically updated
The old tree remains intact until its space is reclaimed

This COW approach enables Btrfs's snapshot functionality and ensures atomicity—either a complete, consistent update is visible, or none of it is.

The Power of Shallow Trees

Consider a Btrfs tree with 4KB nodes, each holding approximately 200 items. Three levels can address 200³ = 8 million items. Four levels reaches 200⁴ = 1.6 billion items. This means any lookup requires at most 4 disk reads, regardless of file system size—a property called O(log n) complexity with excellent constant factors.

The Btrfs Tree Architecture

Unlike traditional file systems that use separate structures for different purposes (inodes, block bitmaps, extent trees), Btrfs unifies almost everything into multiple B-trees, all sharing the same fundamental structure but storing different types of items.

The Forest of Trees:

Btrfs maintains several distinct trees, each serving a specific purpose:

Core Btrfs Trees
Tree	Purpose	Key Contents
Root Tree	The tree of trees—locates all other trees	Root items pointing to other tree roots
FS Tree (Subvolume Tree)	Contains file system objects for a subvolume	Inodes, directory items, extent data references
Extent Tree	Tracks allocation of disk space	Extent items, back-references
Chunk Tree	Maps logical addresses to physical devices	Chunk items, device extent items
Device Tree	Per-device allocation information	Device items and extent allocation
Checksum Tree	Stores checksums for data blocks	Checksum items keyed by extent offset
UUID Tree	Maps subvolume UUIDs to tree IDs	UUID items for subvolume lookup
Free Space Tree	Efficiently tracks free space (newer feature)	Free space info and bitmap items

The Unified Key Structure:

All Btrfs items are located using a 136-bit key consisting of three components:

[ objectid: 64 bits | type: 8 bits | offset: 64 bits ]

objectid: Identifies the object (usually an inode number or extent address)
type: Specifies the item type (INODE_ITEM, EXTENT_DATA, DIR_ITEM, etc.)
offset: Context-dependent (file offset, directory hash, parent ID, etc.)

This unified key format means all trees use the same lookup algorithms, simplifying code and enabling optimizations to benefit all operations.

Example Key Types:

Common Btrfs Item Types

•INODE_ITEM (1): File metadata (size, permissions, timestamps)
•INODE_REF (12): Filename linking an inode to its parent directory
•XATTR_ITEM (24): Extended attributes for files
•DIR_ITEM (84): Directory entry (maps name to inode)
•DIR_INDEX (96): Directory index for readdir() ordering
•EXTENT_DATA (108): File extent mapping (where data lives on disk)
•EXTENT_ITEM (168): Allocation record in the extent tree
•CHUNK_ITEM (228): Logical-to-physical address mapping

Items, Not Blocks

Btrfs stores variable-sized items in its tree nodes, not fixed-size structures like ext4 inodes. This allows items to grow (e.g., inline data for tiny files) or shrink as needed, improving space efficiency and flexibility.

On-Disk Layout

Understanding how Btrfs organizes data on physical storage is essential for comprehending its capabilities and performance characteristics.

The Superblock:

Like all file systems, Btrfs begins with a superblock—the fixed location where the file system's root information is stored. Btrfs maintains multiple superblock copies at fixed offsets for redundancy:

Primary: 64 KiB offset
Mirror 1: 64 MiB offset
Mirror 2: 256 GiB offset (only on devices large enough)

The superblock contains:

File system UUID and label
Total and used bytes
Root tree location (the entry point to all trees)
Chunk tree location (for address translation)
Generation number (transaction counter)
Compatibility flags

Chunks and Block Groups:

Btrfs abstracts physical devices through a two-level address mapping:

Logical Addresses → Chunks → Physical Addresses

Logical Address Space: A single unified address space spanning potentially multiple devices. The file system internally uses logical addresses everywhere.
Chunks: Contiguous regions of logical address space. Each chunk maps to one or more physical device locations.
Block Groups: Logical groupings of chunks with the same allocation purpose (data, metadata, system) and redundancy profile (single, DUP, RAID0, RAID1, etc.).

This abstraction enables Btrfs to:

Span multiple devices transparently
Implement various RAID profiles at the chunk level
Move data between devices online
Add and remove devices without reformatting

Chunk Mapping Example

Structure

Logical Address Space (Unified View):
┌────────────────────────────────────────────────────────┐
│ 0GB        32GB       64GB       96GB      128GB       │
│  │          │          │          │          │         │
│  └──Chunk 1─┘──Chunk 2─┘──Chunk 3─┘──Chunk 4─┘         │
│   (Data)     (Meta)     (Data)     (Data)              │
└────────────────────────────────────────────────────────┘
                    │
                    ▼ Chunk Mapping
Physical Devices:
┌─────────────┐  ┌─────────────┐
│   /dev/sda  │  │   /dev/sdb  │
├─────────────┤  ├─────────────┤
│ Chunk 1     │  │ Chunk 2-m1  │  ← RAID1 mirror
│ Chunk 2-m0  │  │ Chunk 3-p0  │  ← RAID0 stripe
│ Chunk 3-p1  │  │ Chunk 4     │
└─────────────┘  └─────────────┘
 
Keys:
- Chunk 2: Metadata with RAID1 (mirrored across both devices)
- Chunk 3: Data with RAID0 (striped across both devices)

Node Size and Sector Size:

Btrfs uses a configurable node size (default: 16 KiB) for its tree nodes and a sector size (typically 4 KiB) for data allocation. The node size affects:

Fanout: Larger nodes hold more items, creating shallower trees
I/O efficiency: Larger nodes amortize metadata I/O but may waste space for small updates
Fragmentation: Smaller nodes adapt better to random updates

The separation of node size from sector size allows metadata to use larger, more efficient blocks while data can be allocated at sector granularity.

Node Size Is Fixed at Creation

The node size is set when the file system is created (mkfs.btrfs -n <size>) and cannot be changed afterward. For most workloads, the default 16 KiB is appropriate, but metadata-heavy workloads (many small files) may benefit from larger nodes.

Inode Structure in Btrfs

Btrfs reimagines the traditional Unix inode concept, storing inode data as variable-sized items within its B-trees rather than fixed structures in a dedicated inode table.

The INODE_ITEM:

Each file or directory has an inode item containing:

struct btrfs_inode_item {
    __le64 generation;     // Creation transaction
    __le64 transid;        // Last modification transaction
    __le64 size;           // File size in bytes
    __le64 nbytes;         // Actual bytes used
    __le64 block_group;    // Preferred allocation group
    __le32 nlink;          // Hard link count
    __le32 uid;            // Owner user ID
    __le32 gid;            // Owner group ID
    __le32 mode;           // File mode/permissions
    __le64 rdev;           // Device ID (for device files)
    __le64 flags;          // Inode flags
    __le64 sequence;       // Sequence for fsync
    struct btrfs_timespec atime;  // Access time
    struct btrfs_timespec ctime;  // Change time
    struct btrfs_timespec mtime;  // Modification time
    struct btrfs_timespec otime;  // Creation time
};

Key Differences from Traditional Inodes:

Traditional ext4 Inodes

•Fixed size (256 bytes in ext4)
•Stored in pre-allocated inode tables
•Inode number = position in table
•Direct/indirect block pointers in inode
•Limited extended attribute space

Btrfs Inode Items

•Variable size items (~160 bytes base)
•Stored in same B-tree as all metadata
•Inode number = object ID in tree
•Separate EXTENT_DATA items for mapping
•Separate XATTR_ITEM for each attribute

Inline Data:

For very small files, Btrfs can store file data inline within the EXTENT_DATA item itself, eliminating the need for a separate data extent allocation:

Files under ~2KB (depending on node size and metadata) can be fully inline
Reduces I/O by fetching metadata and data together
Saves space by avoiding extent allocation overhead
Automatically promoted to regular extents as files grow

This optimization is particularly beneficial for workloads with many tiny files (configuration files, Git object stores, etc.).

The Inode Namespace:

Unlike ext4's global inode number space, Btrfs inodes are scoped to their subvolume. Each subvolume has its own inode number space starting from 256 (the first non-reserved inode). This means:

The same inode number can exist in different subvolumes
Cross-subvolume hard links are not possible
Subvolumes provide namespace isolation

Birth Time Support

Unlike many Unix file systems, Btrfs stores the file creation time (otime/birth time) natively in the inode item. This allows applications to determine when a file was originally created, not just when its metadata was last changed.

Extent Management

Btrfs uses extents to map file data to disk locations—contiguous ranges of disk blocks that store file contents. This extent-based approach provides significant advantages over traditional block mapping.

EXTENT_DATA Items:

For each contiguous range of file data, Btrfs creates an EXTENT_DATA item in the file's inode tree:

Key: [inode, EXTENT_DATA, file_offset]
Value: {
    generation,
    ram_bytes,      // Uncompressed size
    compression,    // Compression type (none/zlib/lzo/zstd)
    encryption,     // Encryption type (reserved)
    type,           // Regular, inline, or prealloc
    disk_bytenr,    // Disk location of extent
    disk_num_bytes, // Size on disk (may differ due to compression)
    offset,         // Offset within the extent
    num_bytes       // Bytes used from this extent
}

Extent Sharing and Reflinks:

One of Btrfs's most powerful features is extent sharing. Multiple files (or multiple regions of the same file) can reference the same physical extent:

Snapshots: A snapshot shares all extents with its source; only modified extents diverge
Reflinks (cp --reflink): Instantly copy files by sharing extents
Deduplication: Tools can identify identical blocks and merge them

To track sharing, Btrfs maintains back-references in the extent tree:

Extent Sharing Visualization

Diagram

File A (original):
┌─────────────────────────────────────────┐
│ EXTENT_DATA: offset 0, extent X         │
│ EXTENT_DATA: offset 1MB, extent Y       │
│ EXTENT_DATA: offset 2MB, extent Z       │
└─────────────────────────────────────────┘
            ↓        ↓        ↓
Physical Extents:    X        Y        Z
            ↑        ↑        ↑
File B (reflink copy of A):
┌─────────────────────────────────────────┐
│ EXTENT_DATA: offset 0, extent X (shared)│
│ EXTENT_DATA: offset 1MB, extent Y (shared)│
│ EXTENT_DATA: offset 2MB, extent Z (shared)│
└─────────────────────────────────────────┘
 
After modifying File B's first megabyte:
┌─────────────────────────────────────────┐
│ EXTENT_DATA: offset 0, extent X' (NEW)  │ ← COW created new extent
│ EXTENT_DATA: offset 1MB, extent Y (shared)│
│ EXTENT_DATA: offset 2MB, extent Z (shared)│
└─────────────────────────────────────────┘
 
Extent Tree Back-References:
Extent X: ref_count=1 (only File A now)
Extent X': ref_count=1 (only File B)
Extent Y: ref_count=2 (File A + File B)
Extent Z: ref_count=2 (File A + File B)

The Extent Tree:

The extent tree is Btrfs's allocation manager, tracking:

Which extents are allocated (EXTENT_ITEM)
Who references each extent (back-references)
Free space (implicitly or via Free Space Tree)

Back-references are crucial for:

Determining when an extent can be freed (reference count → 0)
Relocating data during balance operations
Verifying file system integrity during scrub

Free Space Tracking:

Historically, Btrfs tracked free space by walking the extent tree—an expensive operation that caused mount delays on large file systems. Modern Btrfs includes a dedicated Free Space Tree that explicitly tracks free regions, dramatically improving mount time and allocation speed.

Extent Size Limits

Btrfs extents can be up to 128 MiB for data (configurable, default is typically 128 MiB). Larger files are split into multiple extents. The extent size affects fragmentation characteristics—larger extents mean better sequential performance but potentially more internal fragmentation.

Comparison with Other File Systems

Understanding Btrfs's position relative to other file systems helps clarify when it's the right choice.

Btrfs vs. ext4:

Feature Comparison: Btrfs vs ext4
Feature	Btrfs	ext4
Architecture	COW B-trees	Journal + extent trees
Data checksumming	✅ Yes, with verification	❌ No
Metadata checksumming	✅ Yes	✅ Limited (journal)
Snapshots	✅ Instant, space-efficient	❌ No (requires LVM)
Multi-device support	✅ Native, integrated	❌ Requires LVM/MD
RAID	✅ Built-in (0/1/10/5/6)	❌ Requires MD/hardware
Online resize	✅ Grow and shrink	✅ Grow only
Compression	✅ Transparent (zstd/lzo/zlib)	❌ No
Deduplication	✅ Offline/online tools	❌ No
Max file size	16 EiB	16 TiB
Stability	Mature, some features stable	Very mature, stable
Performance (writes)	Variable, COW overhead	Generally faster
Performance (reads)	Comparable	Slightly better

Btrfs vs. ZFS:

Feature Comparison: Btrfs vs ZFS
Feature	Btrfs	ZFS
License	GPL v2	CDDL (licensing issues)
Integration	Mainline Linux kernel	External module (OpenZFS)
Memory requirements	Moderate	High (ARC cache)
RAID-Z equivalent	RAID5/6 (less mature)	RAID-Z1/2/3 (very mature)
Deduplication	Offline tools	Inline (memory-intensive)
Send/Receive	✅ Supported	✅ Supported
Quotas	✅ qgroups	✅ Dataset quotas
Encryption	❌ Limited/external	✅ Native
Special devices	❌ No L2ARC/SLOG	✅ L2ARC, SLOG
Stability	Improving	Very mature

When to Choose Btrfs:

Ideal Use Cases for Btrfs

•Desktop/Laptop systems — Snapshots for system rollback, compression saves space, checksumming catches corruption
•NAS/Home servers — Multi-device pooling, RAID1 for redundancy, easy management
•Development workstations — Instant snapshots before risky operations, reflinks for fast project copies
•Virtual machine hosts — Snapshots for VM backups, COW for cloning VMs
•Containers (Docker/LXC) — Efficient layer storage using snapshots and reflinks
•Systems requiring data integrity — Checksumming detects bit rot on archival storage

Caution: RAID5/6

Btrfs RAID5 and RAID6 profiles have had write-hole issues and are not recommended for production use as of kernel 6.x. Use RAID1 (mirroring) or RAID10 for redundancy, or rely on external MD RAID arrays.

Summary and Key Takeaways

We've established the foundational understanding of Btrfs architecture. Let's consolidate the essential concepts:

Key Takeaways

•Btrfs is a ground-up redesign — Not an evolution of ext4, but a new architecture inspired by ZFS, designed for modern storage challenges.
•B-trees are the universal data structure — Multiple B-trees store all file system data, enabling efficient operations and copy-on-write semantics.
•Unified key structure simplifies everything — The 136-bit key (objectid, type, offset) addresses all items uniformly.
•Chunks provide device abstraction — Logical-to-physical mapping enables multi-device support, RAID, and online rebalancing.
•Variable-sized items increase flexibility — Unlike fixed inodes, Btrfs items adapt to content (inline data, variable xattrs).
•Extent sharing enables efficiency — Snapshots, reflinks, and deduplication leverage shared extents with back-references.
•Choose Btrfs for features, ext4 for simplicity — Btrfs offers powerful capabilities with increased complexity; ext4 remains reliable and straightforward.

What's Next:

With the B-tree architecture understood, we'll explore Btrfs's Copy-on-Write (COW) semantics in depth—the mechanism that makes atomic updates, snapshots, and data integrity guarantees possible.

Page Complete

You now understand the architectural foundations of Btrfs: its B-tree design, tree hierarchy, on-disk layout, inode structure, and extent management. This foundation is essential for understanding the advanced features we'll explore in subsequent pages.

1 / 5

Loading learning content...

Operating SystemsBtrfs

Btrfs: The B-tree File System

LevelAdvanced

Duration90 mins

TopicBtrfs

1 / 5

B-tree File System

The Next Generation of Linux File Systems

What You Will Learn

The Origins of Btrfs

The Limitations of Traditional File Systems:

Static volume management: Traditional file systems required separate volume managers (like LVM) to pool storage, add devices, or resize partitions—operations that were complex and sometimes dangerous.
No native checksumming: Data corruption could go undetected. Silent bit rot, controller errors, and firmware bugs could corrupt data without the file system's knowledge.
Limited snapshot capabilities: While LVM offered snapshots, they were inefficient (requiring copy-on-write at the block level) and couldn't leverage file system semantics.
Fixed metadata structures: File systems like ext4 used fixed-size metadata structures that couldn't adapt to varying workload characteristics.
Single-device thinking: Most file systems assumed a single underlying block device, requiring external RAID controllers or LVM for multi-device configurations.

The ZFS Influence

The Design Goals:

From its inception, Btrfs was designed with several ambitious goals:

Copy-on-Write Architecture: Never overwrite data in place. Instead, write modifications to new locations, enabling atomic updates and cheap snapshots.
Integrated Volume Management: Pool multiple devices, add/remove storage dynamically, and handle RAID without external tools.
Data and Metadata Checksumming: Every block of data and metadata carries a checksum, enabling detection (and with redundancy, correction) of corruption.
Efficient Snapshots and Clones: Create instant point-in-time copies without duplicating data.
Online Operations: Grow, shrink, defragment, scrub, and check the file system while it remains mounted and in use.
Scalability: Support exabyte-scale storage with efficient handling of billions of files.

These goals required a completely new architecture—one built around a data structure capable of efficiently handling all these operations: the B-tree.

Why B-trees?

The choice of B-trees as Btrfs's fundamental data structure wasn't arbitrary—it was a carefully considered engineering decision based on decades of database and file system research.

Understanding B-trees:

Key B-tree Properties:

Why B-trees Excel for File Systems

•Logarithmic Height — A B-tree with branching factor of 1000 can address 1 billion items with only 3 levels. This minimizes disk I/O operations, as each level requires at most one disk read.
•Block-Aligned Nodes — B-tree nodes are sized to match disk blocks (or file system node sizes), ensuring efficient I/O. Each node read brings maximum useful data.
•Sorted Order Maintenance — Data remains sorted, enabling efficient range queries and sequential access—critical for operations like directory listing or extent traversal.
•Self-Balancing — Unlike binary search trees that can degenerate into linked lists, B-trees automatically maintain balance through splits and merges, guaranteeing O(log n) operations.
•High Fan-Out — With hundreds of keys per node, B-trees are exceptionally shallow, reducing the number of I/O operations for any lookup.

The B-tree Variant in Btrfs:

Btrfs uses a variant called B+ trees with copy-on-write semantics. In a traditional B+ tree:

All actual data (values) reside in leaf nodes
Internal nodes contain only keys and child pointers
Leaf nodes are linked together for efficient sequential access

Btrfs extends this with copy-on-write (COW):

When a leaf is modified, a new copy is written to a different location
The parent pointer is updated (in a new copy of the parent)
This propagates up to the root, which is atomically updated
The old tree remains intact until its space is reclaimed

This COW approach enables Btrfs's snapshot functionality and ensures atomicity—either a complete, consistent update is visible, or none of it is.

The Power of Shallow Trees

The Btrfs Tree Architecture

The Forest of Trees:

Btrfs maintains several distinct trees, each serving a specific purpose:

Core Btrfs Trees
Tree	Purpose	Key Contents
Root Tree	The tree of trees—locates all other trees	Root items pointing to other tree roots
FS Tree (Subvolume Tree)	Contains file system objects for a subvolume	Inodes, directory items, extent data references
Extent Tree	Tracks allocation of disk space	Extent items, back-references
Chunk Tree	Maps logical addresses to physical devices	Chunk items, device extent items
Device Tree	Per-device allocation information	Device items and extent allocation
Checksum Tree	Stores checksums for data blocks	Checksum items keyed by extent offset
UUID Tree	Maps subvolume UUIDs to tree IDs	UUID items for subvolume lookup
Free Space Tree	Efficiently tracks free space (newer feature)	Free space info and bitmap items

The Unified Key Structure:

All Btrfs items are located using a 136-bit key consisting of three components:

[ objectid: 64 bits | type: 8 bits | offset: 64 bits ]

objectid: Identifies the object (usually an inode number or extent address)
type: Specifies the item type (INODE_ITEM, EXTENT_DATA, DIR_ITEM, etc.)
offset: Context-dependent (file offset, directory hash, parent ID, etc.)

This unified key format means all trees use the same lookup algorithms, simplifying code and enabling optimizations to benefit all operations.

Example Key Types:

Common Btrfs Item Types

•INODE_ITEM (1): File metadata (size, permissions, timestamps)
•INODE_REF (12): Filename linking an inode to its parent directory
•XATTR_ITEM (24): Extended attributes for files
•DIR_ITEM (84): Directory entry (maps name to inode)
•DIR_INDEX (96): Directory index for readdir() ordering
•EXTENT_DATA (108): File extent mapping (where data lives on disk)
•EXTENT_ITEM (168): Allocation record in the extent tree
•CHUNK_ITEM (228): Logical-to-physical address mapping

Items, Not Blocks

On-Disk Layout

Understanding how Btrfs organizes data on physical storage is essential for comprehending its capabilities and performance characteristics.

The Superblock:

Primary: 64 KiB offset
Mirror 1: 64 MiB offset
Mirror 2: 256 GiB offset (only on devices large enough)

The superblock contains:

File system UUID and label
Total and used bytes
Root tree location (the entry point to all trees)
Chunk tree location (for address translation)
Generation number (transaction counter)
Compatibility flags

Chunks and Block Groups:

Btrfs abstracts physical devices through a two-level address mapping:

Logical Addresses → Chunks → Physical Addresses

Logical Address Space: A single unified address space spanning potentially multiple devices. The file system internally uses logical addresses everywhere.
Chunks: Contiguous regions of logical address space. Each chunk maps to one or more physical device locations.
Block Groups: Logical groupings of chunks with the same allocation purpose (data, metadata, system) and redundancy profile (single, DUP, RAID0, RAID1, etc.).

This abstraction enables Btrfs to:

Span multiple devices transparently
Implement various RAID profiles at the chunk level
Move data between devices online
Add and remove devices without reformatting

Chunk Mapping Example

Structure

Logical Address Space (Unified View):
┌────────────────────────────────────────────────────────┐
│ 0GB        32GB       64GB       96GB      128GB       │
│  │          │          │          │          │         │
│  └──Chunk 1─┘──Chunk 2─┘──Chunk 3─┘──Chunk 4─┘         │
│   (Data)     (Meta)     (Data)     (Data)              │
└────────────────────────────────────────────────────────┘
                    │
                    ▼ Chunk Mapping
Physical Devices:
┌─────────────┐  ┌─────────────┐
│   /dev/sda  │  │   /dev/sdb  │
├─────────────┤  ├─────────────┤
│ Chunk 1     │  │ Chunk 2-m1  │  ← RAID1 mirror
│ Chunk 2-m0  │  │ Chunk 3-p0  │  ← RAID0 stripe
│ Chunk 3-p1  │  │ Chunk 4     │
└─────────────┘  └─────────────┘
 
Keys:
- Chunk 2: Metadata with RAID1 (mirrored across both devices)
- Chunk 3: Data with RAID0 (striped across both devices)

Node Size and Sector Size:

Btrfs uses a configurable node size (default: 16 KiB) for its tree nodes and a sector size (typically 4 KiB) for data allocation. The node size affects:

Fanout: Larger nodes hold more items, creating shallower trees
I/O efficiency: Larger nodes amortize metadata I/O but may waste space for small updates
Fragmentation: Smaller nodes adapt better to random updates

The separation of node size from sector size allows metadata to use larger, more efficient blocks while data can be allocated at sector granularity.

Node Size Is Fixed at Creation

Inode Structure in Btrfs

Btrfs reimagines the traditional Unix inode concept, storing inode data as variable-sized items within its B-trees rather than fixed structures in a dedicated inode table.

The INODE_ITEM:

Each file or directory has an inode item containing:

struct btrfs_inode_item {
    __le64 generation;     // Creation transaction
    __le64 transid;        // Last modification transaction
    __le64 size;           // File size in bytes
    __le64 nbytes;         // Actual bytes used
    __le64 block_group;    // Preferred allocation group
    __le32 nlink;          // Hard link count
    __le32 uid;            // Owner user ID
    __le32 gid;            // Owner group ID
    __le32 mode;           // File mode/permissions
    __le64 rdev;           // Device ID (for device files)
    __le64 flags;          // Inode flags
    __le64 sequence;       // Sequence for fsync
    struct btrfs_timespec atime;  // Access time
    struct btrfs_timespec ctime;  // Change time
    struct btrfs_timespec mtime;  // Modification time
    struct btrfs_timespec otime;  // Creation time
};

Key Differences from Traditional Inodes:

Traditional ext4 Inodes

•Fixed size (256 bytes in ext4)
•Stored in pre-allocated inode tables
•Inode number = position in table
•Direct/indirect block pointers in inode
•Limited extended attribute space

Btrfs Inode Items

•Variable size items (~160 bytes base)
•Stored in same B-tree as all metadata
•Inode number = object ID in tree
•Separate EXTENT_DATA items for mapping
•Separate XATTR_ITEM for each attribute

Inline Data:

For very small files, Btrfs can store file data inline within the EXTENT_DATA item itself, eliminating the need for a separate data extent allocation:

Files under ~2KB (depending on node size and metadata) can be fully inline
Reduces I/O by fetching metadata and data together
Saves space by avoiding extent allocation overhead
Automatically promoted to regular extents as files grow

This optimization is particularly beneficial for workloads with many tiny files (configuration files, Git object stores, etc.).

The Inode Namespace:

Unlike ext4's global inode number space, Btrfs inodes are scoped to their subvolume. Each subvolume has its own inode number space starting from 256 (the first non-reserved inode). This means:

The same inode number can exist in different subvolumes
Cross-subvolume hard links are not possible
Subvolumes provide namespace isolation

Birth Time Support

Extent Management

EXTENT_DATA Items:

For each contiguous range of file data, Btrfs creates an EXTENT_DATA item in the file's inode tree:

Key: [inode, EXTENT_DATA, file_offset]
Value: {
    generation,
    ram_bytes,      // Uncompressed size
    compression,    // Compression type (none/zlib/lzo/zstd)
    encryption,     // Encryption type (reserved)
    type,           // Regular, inline, or prealloc
    disk_bytenr,    // Disk location of extent
    disk_num_bytes, // Size on disk (may differ due to compression)
    offset,         // Offset within the extent
    num_bytes       // Bytes used from this extent
}

Extent Sharing and Reflinks:

One of Btrfs's most powerful features is extent sharing. Multiple files (or multiple regions of the same file) can reference the same physical extent:

Snapshots: A snapshot shares all extents with its source; only modified extents diverge
Reflinks (cp --reflink): Instantly copy files by sharing extents
Deduplication: Tools can identify identical blocks and merge them

To track sharing, Btrfs maintains back-references in the extent tree:

Extent Sharing Visualization

Diagram

File A (original):
┌─────────────────────────────────────────┐
│ EXTENT_DATA: offset 0, extent X         │
│ EXTENT_DATA: offset 1MB, extent Y       │
│ EXTENT_DATA: offset 2MB, extent Z       │
└─────────────────────────────────────────┘
            ↓        ↓        ↓
Physical Extents:    X        Y        Z
            ↑        ↑        ↑
File B (reflink copy of A):
┌─────────────────────────────────────────┐
│ EXTENT_DATA: offset 0, extent X (shared)│
│ EXTENT_DATA: offset 1MB, extent Y (shared)│
│ EXTENT_DATA: offset 2MB, extent Z (shared)│
└─────────────────────────────────────────┘
 
After modifying File B's first megabyte:
┌─────────────────────────────────────────┐
│ EXTENT_DATA: offset 0, extent X' (NEW)  │ ← COW created new extent
│ EXTENT_DATA: offset 1MB, extent Y (shared)│
│ EXTENT_DATA: offset 2MB, extent Z (shared)│
└─────────────────────────────────────────┘
 
Extent Tree Back-References:
Extent X: ref_count=1 (only File A now)
Extent X': ref_count=1 (only File B)
Extent Y: ref_count=2 (File A + File B)
Extent Z: ref_count=2 (File A + File B)

The Extent Tree:

The extent tree is Btrfs's allocation manager, tracking:

Which extents are allocated (EXTENT_ITEM)
Who references each extent (back-references)
Free space (implicitly or via Free Space Tree)

Back-references are crucial for:

Determining when an extent can be freed (reference count → 0)
Relocating data during balance operations
Verifying file system integrity during scrub

Free Space Tracking:

Extent Size Limits

Comparison with Other File Systems

Understanding Btrfs's position relative to other file systems helps clarify when it's the right choice.

Btrfs vs. ext4:

Feature Comparison: Btrfs vs ext4
Feature	Btrfs	ext4
Architecture	COW B-trees	Journal + extent trees
Data checksumming	✅ Yes, with verification	❌ No
Metadata checksumming	✅ Yes	✅ Limited (journal)
Snapshots	✅ Instant, space-efficient	❌ No (requires LVM)
Multi-device support	✅ Native, integrated	❌ Requires LVM/MD
RAID	✅ Built-in (0/1/10/5/6)	❌ Requires MD/hardware
Online resize	✅ Grow and shrink	✅ Grow only
Compression	✅ Transparent (zstd/lzo/zlib)	❌ No
Deduplication	✅ Offline/online tools	❌ No
Max file size	16 EiB	16 TiB
Stability	Mature, some features stable	Very mature, stable
Performance (writes)	Variable, COW overhead	Generally faster
Performance (reads)	Comparable	Slightly better

Btrfs vs. ZFS:

Feature Comparison: Btrfs vs ZFS
Feature	Btrfs	ZFS
License	GPL v2	CDDL (licensing issues)
Integration	Mainline Linux kernel	External module (OpenZFS)
Memory requirements	Moderate	High (ARC cache)
RAID-Z equivalent	RAID5/6 (less mature)	RAID-Z1/2/3 (very mature)
Deduplication	Offline tools	Inline (memory-intensive)
Send/Receive	✅ Supported	✅ Supported
Quotas	✅ qgroups	✅ Dataset quotas
Encryption	❌ Limited/external	✅ Native
Special devices	❌ No L2ARC/SLOG	✅ L2ARC, SLOG
Stability	Improving	Very mature

When to Choose Btrfs:

Ideal Use Cases for Btrfs

•Desktop/Laptop systems — Snapshots for system rollback, compression saves space, checksumming catches corruption
•NAS/Home servers — Multi-device pooling, RAID1 for redundancy, easy management
•Development workstations — Instant snapshots before risky operations, reflinks for fast project copies
•Virtual machine hosts — Snapshots for VM backups, COW for cloning VMs
•Containers (Docker/LXC) — Efficient layer storage using snapshots and reflinks
•Systems requiring data integrity — Checksumming detects bit rot on archival storage

Caution: RAID5/6

Summary and Key Takeaways

We've established the foundational understanding of Btrfs architecture. Let's consolidate the essential concepts:

Key Takeaways

•Btrfs is a ground-up redesign — Not an evolution of ext4, but a new architecture inspired by ZFS, designed for modern storage challenges.
•B-trees are the universal data structure — Multiple B-trees store all file system data, enabling efficient operations and copy-on-write semantics.
•Unified key structure simplifies everything — The 136-bit key (objectid, type, offset) addresses all items uniformly.
•Chunks provide device abstraction — Logical-to-physical mapping enables multi-device support, RAID, and online rebalancing.
•Variable-sized items increase flexibility — Unlike fixed inodes, Btrfs items adapt to content (inline data, variable xattrs).
•Extent sharing enables efficiency — Snapshots, reflinks, and deduplication leverage shared extents with back-references.
•Choose Btrfs for features, ext4 for simplicity — Btrfs offers powerful capabilities with increased complexity; ext4 remains reliable and straightforward.

What's Next:

With the B-tree architecture understood, we'll explore Btrfs's Copy-on-Write (COW) semantics in depth—the mechanism that makes atomic updates, snapshots, and data integrity guarantees possible.

Page Complete

1 / 5