Loading content...
The Copy-on-Write (COW) principle is the philosophical and technical cornerstone of Btrfs. Unlike traditional file systems that modify data in place—risking corruption if a write is interrupted—Btrfs never overwrites existing data. Instead, every modification creates a new copy at a different location, and only after the write completes successfully does the file system atomically update its metadata to point to the new data.
This seemingly simple concept has profound implications. COW enables instant snapshots without duplicating data, guarantees that file system structures are always consistent even after crashes, and provides the foundation for features that would be impossible or impractical in traditional file systems.
By the end of this page, you will understand COW at a deep level: how Btrfs implements COW for both data and metadata, how transaction semantics work, the role of COW in crash consistency, the mechanics of reference counting and extent sharing, and the performance tradeoffs involved.
To appreciate COW, we must first understand what traditional file systems do—and why it's problematic.
The Problem with In-Place Updates:
Consider a traditional file system like ext4 (without journaling for simplicity). When you modify an existing file:
This "update-in-place" approach creates a critical vulnerability: if the system crashes during step 4, the data on disk is in an inconsistent state—partially old, partially new, potentially corrupted.
Traditional file systems mitigate this with journaling: writing changes first to a dedicated log, then applying them to their final locations. If a crash occurs, the journal is replayed to restore consistency. But journaling has costs:
The COW Alternative:
Btrfs takes a fundamentally different approach:
The key insight: the old data remains intact until the new data is fully written and referenced. There's never a moment when the on-disk structure is inconsistent—either the old version is complete and valid, or the new version is complete and valid.
12345678910111213141516171819202122232425262728
Traditional In-Place Update:════════════════════════════════════════════════════════Before: [Block A: old data] │ ▼ Write new data (overwrite)During: [Block A: partial old/new] ← DANGER ZONE │ ▼ Complete writeAfter: [Block A: new data] If crash during "During": CORRUPTION ═══════════════════════════════════════════════════════Copy-on-Write:════════════════════════════════════════════════════════Before: Pointer → [Block A: old data] Allocate new block During: Pointer → [Block A: old data] ← Still valid! [Block B: writing new data...] Complete write, update pointer atomically After: Pointer → [Block B: new data] [Block A: old data] ← Marked free If crash "During": Old data still valid, new data ignoredThe secret to COW's crash safety is that changing a metadata pointer from A to B is an atomic operation at the disk level. Either the pointer still points to A (old version) or it points to B (new version). There's no intermediate state where the pointer is invalid.
Btrfs applies COW not just to file data but to all metadata—every tree node, every inode item, every extent reference. This is what makes Btrfs trees truly "copy-on-write trees."
The Cascade Effect:
When you modify a leaf node in a Btrfs tree:
This is called "path copying"—every modification creates new copies along the entire path from leaf to root.
1234567891011121314151617
Original Tree: After Modifying Leaf C: [Root] [Root]───────→[Root'] / \ / \ \ [Node1] [Node2] [Node1] [Node2] [Node2'] / \ / \ / \ / \ \ [A] [B] [C] [D] [A] [B] [C] [D] [C'] ↑ │ │ │ Old leaf, still valid New leaf Transaction Commit:1. Write C' (new leaf with modifications)2. Write Node2' (parent pointing to C')3. Write Root' (new root pointing to Node2')4. Atomically update superblock to point to Root'5. Old path (Root→Node2→C) becomes reclaimableThe Cost of Path Copying:
Path copying means modifying a single item requires writing O(log n) tree nodes—the height of the tree. For a tree with height 4, that's 4 node writes instead of 1.
However, Btrfs mitigates this cost through delayed allocation and batching:
The Benefits of Metadata COW:
The superblock is the one structure that IS updated in place (at fixed disk locations). However, Btrfs keeps multiple superblock copies and uses generation numbers to identify the most recent valid superblock, providing reliability even for this special case.
By default, Btrfs applies COW to file data as well as metadata. When you overwrite existing file contents:
Data COW Implications:
Disabling Data COW:
For certain workloads, data COW can be problematic. Database files and virtual machine images are frequently overwritten in small chunks, causing severe fragmentation and space amplification.
Btrfs allows disabling data COW per-file using the nodatacow attribute:
# Set nodatacow on a file/directory (new files inherit)
chattr +C /path/to/vm-images/
# Check COW status
lsattr /path/to/file
Important: When nodatacow is set:
The NOCOW tradeoff:
| Aspect | COW (Default) | NOCOW (+C attribute) |
|---|---|---|
| Write behavior | Always new location | In-place overwrite |
| Checksumming | ✅ Full checksums | ❌ Disabled for data |
| Snapshot sharing | ✅ Shares extents | ❌ Creates copies |
| Fragmentation | High for overwrites | Low/none |
| Write amplification | Higher (COW + metadata) | Lower |
| Ideal for | Documents, source code, logs | Databases, VM images |
| Data integrity | Protected | Risk of partial writes |
On Btrfs RAID5/6 (which is already not recommended), NOCOW files are particularly dangerous because they can experience the write hole problem—a crash during a partial stripe write can leave parity inconsistent with data.
Btrfs organizes changes into transactions—groups of modifications that are committed atomically to disk. This transaction model is the mechanism by which COW's theoretical benefits become practical reality.
Transaction Lifecycle:
123456789101112131415161718192021222324252627
Transaction States:═══════════════════════════════════════════════════════════════ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ RUNNING │────▶│ COMMITTING │────▶│ COMMITTED │ │ │ │ │ │ │ │ - Accept new │ │ - No new ops │ │ - On disk │ │ operations │ │ - Write data │ │ - Visible to │ │ - In-memory │ │ - Write meta │ │ new reads │ │ changes │ │ - Update SB │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ (Crash during RUNNING) │ ▼ ▼ Transaction lost, Transaction survives, file system rolls fully visible after back to previous recovery commit point Generation Numbers:═══════════════════════════════════════════════════════════════Gen 99 (committed) ──▶ Gen 100 (running) ──▶ Gen 100 (committed) │ │ Crash ▼ Recovery sees Gen 99 as last valid stateHow Commits Work:
Running Phase: A transaction is open, accepting file operations. Modifications are accumulated in memory within COW'd tree nodes.
Commit Trigger: Commits are triggered by:
sync() or fsync() callsCommit Sequence:
Generation Numbers:
Each transaction has a generation number—a monotonically increasing counter. Every tree node, extent, and metadata item records which generation created or modified it.
Generation numbers enable:
The commit interval affects both safety and performance. Shorter intervals (e.g., 5 seconds) mean less data loss on crash but more I/O overhead. Longer intervals (e.g., 60 seconds) improve performance by batching more writes but risk more data loss. The default of 30 seconds is a reasonable balance. Tune with 'commit=N' mount option.
COW creates a fundamental challenge: when multiple snapshots or files reference the same extent, how do we know when the extent can be freed? Btrfs solves this with reference counting and back-references.
The Reference Counting Challenge:
Consider a file with 100 extents that has 10 snapshots. Each snapshot might share some extents and have diverged on others. We need to track:
Back-References:
Btrfs stores back-references alongside extent items in the extent tree. A back-reference indicates "who is pointing to this extent":
Extent Item: [disk_bytenr, num_bytes]
├── RefCount: 3
├── BackRef: { tree_id: 257, objectid: 123, offset: 0 }
├── BackRef: { tree_id: 258, objectid: 456, offset: 0 }
└── BackRef: { tree_id: 259, objectid: 789, offset: 4096 }
Types of Back-References:
Btrfs uses different back-reference types depending on shared status:
| Type | When Used | Content |
|---|---|---|
| Inline Back-Ref | Few references, fit in extent item | Directly in extent item data |
| Keyed Back-Ref | Many references, overflow | Separate items in extent tree |
| Full Back-Ref | Shared extents (snapshots) | Complete tree path information |
| Shared Back-Ref | Trees share metadata nodes | Points to parent tree node |
Extent Sharing Mechanics:
When you create a snapshot:
When you modify a file in the snapshot:
123456789101112131415161718192021
Initial State (1 subvolume, 1 file):═══════════════════════════════════════════Subvol A└── file.txt → [Extent X: data] (refcount: 1) After Creating Snapshot B:═══════════════════════════════════════════Subvol A Subvol B (snapshot of A)└── file.txt → [Extent X] ←── file.txt ↑ (refcount: 2) After Modifying file.txt in Subvol B:═══════════════════════════════════════════Subvol A Subvol B└── file.txt → [Extent X] └── file.txt → [Extent Y: new data] ↑ ↑ (refcount: 1) (refcount: 1) The original file in Subvol A is unchanged!Only Subvol B has the new data.Because extents are shared, calculating 'space used' by a subvolume is non-trivial. Btrfs provides 'quota groups' (qgroups) for precise space accounting, tracking both exclusive space (only this subvolume) and shared space (with others).
One of COW's most valuable properties is crash consistency—the guarantee that after any crash, the file system will be in a valid, consistent state without requiring expensive fsck operations.
The Consistency Guarantee:
At any point in time, the file system has exactly one committed state—the most recently completed transaction. If a crash occurs:
Why This Works:
Recovery Process:
When Btrfs mounts after a crash:
There's no journal to replay, no structure to check—the file system is immediately consistent.
Log Tree for fsync():
While transactions enable consistency, the default 30-second commit interval could lose significant data on crash. For applications that call fsync() to ensure durability, Btrfs uses a log tree:
fsync() writes file data to a temporary log treeThis provides the durability guarantee of fsync without forcing a full commit.
123456789101112131415161718192021222324252627282930313233
Scenario 1: Crash during data write (before commit)═══════════════════════════════════════════════════════════════State: New data blocks written to disk, but superblock still points to old tree roots. Recovery: Superblock still references old state. New data blocks are orphaned, will be reclaimed. NO DATA CORRUPTION. Old files intact. Scenario 2: Crash during metadata write (before commit)═══════════════════════════════════════════════════════════════State: Some new tree nodes written, but superblock still points to old tree roots. Recovery: Superblock still references old state. New tree nodes are orphaned, will be reclaimed. NO CORRUPTION. Old metadata intact. Scenario 3: Crash during superblock write═══════════════════════════════════════════════════════════════State: Superblock partially written or only some mirrors updated. Recovery: Btrfs checks all superblock mirrors. Uses highest-generation mirror with valid checksum. Even if one mirror corrupted, others provide recovery. Scenario 4: Crash after superblock commit═══════════════════════════════════════════════════════════════State: New superblock successfully written. Recovery: File system uses new state. All committed changes preserved. Old data blocks now reclaimable.Traditional file systems require fsck after crashes to verify and repair consistency. Btrfs's COW design guarantees consistency, so fsck is only needed to check for hardware-induced corruption (via scrub), not structural inconsistencies. This saves potentially hours of boot-time checks on large file systems.
Copy-on-Write is not without costs. Understanding the performance implications helps in proper workload selection and tuning.
The Fragmentation Problem:
In traditional file systems, overwriting file data keeps it contiguous. In a COW file system, each overwrite allocates new space wherever available, leading to fragmentation:
| Operation | Performance Impact | Mitigation |
|---|---|---|
| Sequential write (new file) | ✅ Excellent | Allocator prefers contiguous space |
| Sequential read | ✅ Excellent initially, degrades with COW | Periodic defragmentation |
| Random write (existing file) | ⚠️ Allocates new extents, fragments | NOCOW for databases, defrag |
| Random read | ⚠️ Depends on fragmentation level | Defragmentation, compression |
| Small writes | ⚠️ High metadata overhead per write | Batching, larger writes |
| Snapshot creation | ✅ Instant (O(1)) | COW benefit! |
| Snapshot deletion | ⚠️ Can be slow (extent cleanup) | Background processing |
| With many snapshots | ⚠️ More metadata, more back-refs | Limit snapshot count |
Write Amplification:
Btrfs COW creates write amplification from multiple sources:
For small, random writes, this amplification can be significant—a 4KB write might require 64KB+ of metadata writes.
Optimizing for COW:
chattr +C on database directories before creating filesbtrfs filesystem defragment on fragmented filesMounting with autodefrag causes Btrfs to automatically defragment files that receive small random writes. This helps maintain performance for general-purpose workloads but adds background I/O. Don't use with NOCOW files or databases.
Copy-on-Write is the foundation upon which all of Btrfs's advanced features are built. Let's consolidate the key concepts:
What's Next:
With COW understood, we'll explore one of its most powerful applications: Subvolumes. Subvolumes provide lightweight, flexible containers within a Btrfs file system—enabling independent mount points, quota control, and the foundation for snapshot management.
You now understand how Copy-on-Write works in Btrfs—from the mechanics of path copying and transaction commits to reference counting, crash consistency, and performance optimization. This knowledge is essential for understanding snapshots, subvolumes, and Btrfs's self-healing capabilities.