Btrfs - Learning Module

Loading content...

0/240

Linux Native COW

Never Overwrite, Always Preserve

The Copy-on-Write (COW) principle is the philosophical and technical cornerstone of Btrfs. Unlike traditional file systems that modify data in place—risking corruption if a write is interrupted—Btrfs never overwrites existing data. Instead, every modification creates a new copy at a different location, and only after the write completes successfully does the file system atomically update its metadata to point to the new data.

This seemingly simple concept has profound implications. COW enables instant snapshots without duplicating data, guarantees that file system structures are always consistent even after crashes, and provides the foundation for features that would be impossible or impractical in traditional file systems.

What You Will Learn

By the end of this page, you will understand COW at a deep level: how Btrfs implements COW for both data and metadata, how transaction semantics work, the role of COW in crash consistency, the mechanics of reference counting and extent sharing, and the performance tradeoffs involved.

Understanding Copy-on-Write

To appreciate COW, we must first understand what traditional file systems do—and why it's problematic.

The Problem with In-Place Updates:

Consider a traditional file system like ext4 (without journaling for simplicity). When you modify an existing file:

The file system locates the disk blocks containing the data to modify
It reads those blocks into memory
It modifies the in-memory copy
It writes the modified data back to the same disk locations

This "update-in-place" approach creates a critical vulnerability: if the system crashes during step 4, the data on disk is in an inconsistent state—partially old, partially new, potentially corrupted.

Traditional file systems mitigate this with journaling: writing changes first to a dedicated log, then applying them to their final locations. If a crash occurs, the journal is replayed to restore consistency. But journaling has costs:

Write amplification: Data is written twice (journal + final location)
Complexity: Crash recovery requires journal replay
Still imperfect: Data (not just metadata) journaling is expensive

The COW Alternative:

Btrfs takes a fundamentally different approach:

When modifying data, allocate new disk space
Write the modified data to the new location
Update metadata pointers to reference the new location
Mark the old location as free (after confirming no other references)

The key insight: the old data remains intact until the new data is fully written and referenced. There's never a moment when the on-disk structure is inconsistent—either the old version is complete and valid, or the new version is complete and valid.

COW vs In-Place Update

Conceptual

Traditional In-Place Update:
════════════════════════════════════════════════════════
Before:    [Block A: old data]
                    │
                    ▼ Write new data (overwrite)
During:    [Block A: partial old/new] ← DANGER ZONE
                    │
                    ▼ Complete write
After:     [Block A: new data]
 
If crash during "During": CORRUPTION
 
═══════════════════════════════════════════════════════
Copy-on-Write:
════════════════════════════════════════════════════════
Before:    Pointer → [Block A: old data]
                    
           Allocate new block
           
During:    Pointer → [Block A: old data] ← Still valid!
                     [Block B: writing new data...]
                    
           Complete write, update pointer atomically
           
After:     Pointer → [Block B: new data]
           [Block A: old data] ← Marked free
 
If crash "During": Old data still valid, new data ignored

The Elegance of Atomic Pointer Updates

The secret to COW's crash safety is that changing a metadata pointer from A to B is an atomic operation at the disk level. Either the pointer still points to A (old version) or it points to B (new version). There's no intermediate state where the pointer is invalid.

COW for Metadata

Btrfs applies COW not just to file data but to all metadata—every tree node, every inode item, every extent reference. This is what makes Btrfs trees truly "copy-on-write trees."

The Cascade Effect:

When you modify a leaf node in a Btrfs tree:

A new leaf node is written with the modification
The parent node's pointer to the old leaf must be updated
But updating the parent means COW'ing the parent node too
This cascades up to the root of the tree
Finally, the tree root pointer (in the superblock) is atomically updated

This is called "path copying"—every modification creates new copies along the entire path from leaf to root.

Path Copying Visualization

Diagram

Original Tree:                   After Modifying Leaf C:
                                    
         [Root]                            [Root]───────→[Root']
        /      \                          /      \              \
    [Node1]   [Node2]                [Node1]   [Node2]         [Node2']
    /    \     /    \               /    \     /    \              \
  [A]   [B]  [C]   [D]            [A]   [B]  [C]   [D]            [C']
                                              ↑                     │
                                              │                     │
                                   Old leaf, still valid        New leaf
 
Transaction Commit:
1. Write C' (new leaf with modifications)
2. Write Node2' (parent pointing to C')
3. Write Root' (new root pointing to Node2')
4. Atomically update superblock to point to Root'
5. Old path (Root→Node2→C) becomes reclaimable

The Cost of Path Copying:

Path copying means modifying a single item requires writing O(log n) tree nodes—the height of the tree. For a tree with height 4, that's 4 node writes instead of 1.

However, Btrfs mitigates this cost through delayed allocation and batching:

Multiple modifications to the same tree share the new path
Changes accumulate in memory before being flushed
A single commit can include thousands of modifications with minimal overhead

The Benefits of Metadata COW:

Metadata COW Advantages

•Crash consistency without journal: Metadata is always consistent—either old version or new version, never partial
•Instant snapshots: Creating a snapshot is just creating a new reference to an existing tree root
•Space-efficient versioning: Unmodified subtrees are shared between versions
•No fsck required: Because consistency is guaranteed, expensive file system checks are eliminated
•Atomic batch operations: Multiple file operations can commit atomically as a single transaction

The Superblock Exception

The superblock is the one structure that IS updated in place (at fixed disk locations). However, Btrfs keeps multiple superblock copies and uses generation numbers to identify the most recent valid superblock, providing reliability even for this special case.

COW for Data

By default, Btrfs applies COW to file data as well as metadata. When you overwrite existing file contents:

New disk space is allocated for the modified data
The new data is written to the new location
The file's EXTENT_DATA items are updated to point to the new extent
The old extent's reference count is decremented
If the old extent has no remaining references, its space is freed

Data COW Implications:

Benefits

•Data integrity: Old data preserved until new data is safe
•Snapshot efficiency: Snapshots share data extents
•Natural deduplication: Identical writes can share extents
•No data corruption window: Writes are all-or-nothing

Costs

•Fragmentation: Sequential writes become scattered
•No in-place updates: Always allocates new space
•Extent bookkeeping: More metadata for extent tracking
•Snapshot bloat: Modified data keeps old copies

Disabling Data COW:

For certain workloads, data COW can be problematic. Database files and virtual machine images are frequently overwritten in small chunks, causing severe fragmentation and space amplification.

Btrfs allows disabling data COW per-file using the nodatacow attribute:

# Set nodatacow on a file/directory (new files inherit)
chattr +C /path/to/vm-images/

# Check COW status
lsattr /path/to/file

Important: When nodatacow is set:

Data is overwritten in place (like ext4)
Data checksumming is also disabled (they're linked)
Snapshots create independent copies (not shared)
Write-hole risk returns for RAID5/6

The NOCOW tradeoff:

COW vs NOCOW Comparison
Aspect	COW (Default)	NOCOW (+C attribute)
Write behavior	Always new location	In-place overwrite
Checksumming	✅ Full checksums	❌ Disabled for data
Snapshot sharing	✅ Shares extents	❌ Creates copies
Fragmentation	High for overwrites	Low/none
Write amplification	Higher (COW + metadata)	Lower
Ideal for	Documents, source code, logs	Databases, VM images
Data integrity	Protected	Risk of partial writes

NOCOW and RAID

On Btrfs RAID5/6 (which is already not recommended), NOCOW files are particularly dangerous because they can experience the write hole problem—a crash during a partial stripe write can leave parity inconsistent with data.

Transaction Model

Btrfs organizes changes into transactions—groups of modifications that are committed atomically to disk. This transaction model is the mechanism by which COW's theoretical benefits become practical reality.

Transaction Lifecycle:

Transaction State Machine

Diagram

Transaction States:
═══════════════════════════════════════════════════════════════
 
  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │   RUNNING    │────▶│  COMMITTING  │────▶│  COMMITTED   │
  │              │     │              │     │              │
  │ - Accept new │     │ - No new ops │     │ - On disk    │
  │   operations │     │ - Write data │     │ - Visible to │
  │ - In-memory  │     │ - Write meta │     │   new reads  │
  │   changes    │     │ - Update SB  │     │              │
  └──────────────┘     └──────────────┘     └──────────────┘
        │                                          │
        │    (Crash during RUNNING)               │
        ▼                                          ▼
  Transaction lost,                         Transaction survives,
  file system rolls                         fully visible after
  back to previous                          recovery
  commit point
  
Generation Numbers:
═══════════════════════════════════════════════════════════════
Gen 99 (committed) ──▶ Gen 100 (running) ──▶ Gen 100 (committed)
                              │
                              │ Crash
                              ▼
                      Recovery sees Gen 99
                      as last valid state

How Commits Work:

Running Phase: A transaction is open, accepting file operations. Modifications are accumulated in memory within COW'd tree nodes.
Commit Trigger: Commits are triggered by:
- Periodic timer (default: every 30 seconds)
- Explicit sync() or fsync() calls
- Memory pressure from too many uncommitted changes
- Unmount operations
Commit Sequence:
- Stop accepting new operations for this transaction
- Start a new transaction for incoming operations
- Write all dirty data extents to disk
- Write all dirty metadata nodes (COW'd path)
- Write super block with new tree roots
- Mark old tree nodes as freeable

Generation Numbers:

Each transaction has a generation number—a monotonically increasing counter. Every tree node, extent, and metadata item records which generation created or modified it.

Generation numbers enable:

Crash recovery: Identify which state to recover to
Snapshot distinction: Tell which items are shared vs. diverged
Garbage collection: Determine when old data can be freed

Commit Interval Tuning

The commit interval affects both safety and performance. Shorter intervals (e.g., 5 seconds) mean less data loss on crash but more I/O overhead. Longer intervals (e.g., 60 seconds) improve performance by batching more writes but risk more data loss. The default of 30 seconds is a reasonable balance. Tune with 'commit=N' mount option.

Reference Counting and Sharing

COW creates a fundamental challenge: when multiple snapshots or files reference the same extent, how do we know when the extent can be freed? Btrfs solves this with reference counting and back-references.

The Reference Counting Challenge:

Consider a file with 100 extents that has 10 snapshots. Each snapshot might share some extents and have diverged on others. We need to track:

How many references exist to each extent
When the count reaches zero (extent can be freed)
Who the referees are (for relocation and integrity checks)

Back-References:

Btrfs stores back-references alongside extent items in the extent tree. A back-reference indicates "who is pointing to this extent":

Extent Item: [disk_bytenr, num_bytes]
  ├── RefCount: 3
  ├── BackRef: { tree_id: 257, objectid: 123, offset: 0 }
  ├── BackRef: { tree_id: 258, objectid: 456, offset: 0 }
  └── BackRef: { tree_id: 259, objectid: 789, offset: 4096 }

Types of Back-References:

Btrfs uses different back-reference types depending on shared status:

Back-Reference Types
Type	When Used	Content
Inline Back-Ref	Few references, fit in extent item	Directly in extent item data
Keyed Back-Ref	Many references, overflow	Separate items in extent tree
Full Back-Ref	Shared extents (snapshots)	Complete tree path information
Shared Back-Ref	Trees share metadata nodes	Points to parent tree node

Extent Sharing Mechanics:

When you create a snapshot:

A new tree is created with its root pointing to the source tree's root
The source tree's root node's reference count is incremented
All child pointers now point to shared subtrees
No extent data is copied—only the root reference is new

When you modify a file in the snapshot:

COW creates a new extent for the modified data
The new EXTENT_DATA points to the new extent
The old extent's reference count is decremented
If count > 0, the extent is kept (still referenced by original)
If count = 0, the extent is freed

Snapshot Extent Sharing

Example

Initial State (1 subvolume, 1 file):
═══════════════════════════════════════════
Subvol A
└── file.txt → [Extent X: data] (refcount: 1)
 
After Creating Snapshot B:
═══════════════════════════════════════════
Subvol A                      Subvol B (snapshot of A)
└── file.txt → [Extent X] ←── file.txt
                    ↑
               (refcount: 2)
 
After Modifying file.txt in Subvol B:
═══════════════════════════════════════════
Subvol A                      Subvol B
└── file.txt → [Extent X]    └── file.txt → [Extent Y: new data]
                    ↑                              ↑
               (refcount: 1)                 (refcount: 1)
 
The original file in Subvol A is unchanged!
Only Subvol B has the new data.

Space Accounting Complexity

Because extents are shared, calculating 'space used' by a subvolume is non-trivial. Btrfs provides 'quota groups' (qgroups) for precise space accounting, tracking both exclusive space (only this subvolume) and shared space (with others).

COW and Crash Consistency

One of COW's most valuable properties is crash consistency—the guarantee that after any crash, the file system will be in a valid, consistent state without requiring expensive fsck operations.

The Consistency Guarantee:

At any point in time, the file system has exactly one committed state—the most recently completed transaction. If a crash occurs:

All changes from incomplete transactions are lost (as expected)
All changes from complete transactions are preserved
The on-disk structure is always internally consistent

Why This Works:

COW Crash Consistency Mechanisms

•Old data is never overwritten — The previous consistent state remains intact until the new state is complete
•Atomic root update — The superblock write is the single commit point; either it succeeds (new state) or it doesn't (old state)
•Write ordering — All data and metadata is written BEFORE the superblock is updated
•No cross-references between old and new — A committed state never references uncommitted data
•Generation numbers — Make it easy to identify the latest valid state

Recovery Process:

When Btrfs mounts after a crash:

Read superblock mirrors, select the one with highest generation and valid checksum
The selected superblock points to the last committed tree roots
All tree operations proceed from these known-good roots
Any "orphaned" data (written but not committed) is gradually reclaimed

There's no journal to replay, no structure to check—the file system is immediately consistent.

Log Tree for fsync():

While transactions enable consistency, the default 30-second commit interval could lose significant data on crash. For applications that call fsync() to ensure durability, Btrfs uses a log tree:

fsync() writes file data to a temporary log tree
On commit, log tree contents are merged into the main trees
On crash, log tree replay restores fsync'd data

This provides the durability guarantee of fsync without forcing a full commit.

Crash Scenarios

Analysis

Scenario 1: Crash during data write (before commit)
═══════════════════════════════════════════════════════════════
State: New data blocks written to disk, but superblock still
       points to old tree roots.
       
Recovery: Superblock still references old state.
          New data blocks are orphaned, will be reclaimed.
          NO DATA CORRUPTION. Old files intact.
          
Scenario 2: Crash during metadata write (before commit)
═══════════════════════════════════════════════════════════════
State: Some new tree nodes written, but superblock still
       points to old tree roots.
       
Recovery: Superblock still references old state.
          New tree nodes are orphaned, will be reclaimed.
          NO CORRUPTION. Old metadata intact.
 
Scenario 3: Crash during superblock write
═══════════════════════════════════════════════════════════════
State: Superblock partially written or only some mirrors updated.
 
Recovery: Btrfs checks all superblock mirrors.
          Uses highest-generation mirror with valid checksum.
          Even if one mirror corrupted, others provide recovery.
          
Scenario 4: Crash after superblock commit
═══════════════════════════════════════════════════════════════
State: New superblock successfully written.
 
Recovery: File system uses new state.
          All committed changes preserved.
          Old data blocks now reclaimable.

No fsck, No Problem

Traditional file systems require fsck after crashes to verify and repair consistency. Btrfs's COW design guarantees consistency, so fsck is only needed to check for hardware-induced corruption (via scrub), not structural inconsistencies. This saves potentially hours of boot-time checks on large file systems.

COW Performance Implications

Copy-on-Write is not without costs. Understanding the performance implications helps in proper workload selection and tuning.

The Fragmentation Problem:

In traditional file systems, overwriting file data keeps it contiguous. In a COW file system, each overwrite allocates new space wherever available, leading to fragmentation:

A file written sequentially becomes scattered after in-place modifications
Sequential reads become random reads, dramatically hurting HDD performance
Even SSDs, while tolerant, suffer from increased write amplification

COW Performance Characteristics
Operation	Performance Impact	Mitigation
Sequential write (new file)	✅ Excellent	Allocator prefers contiguous space
Sequential read	✅ Excellent initially, degrades with COW	Periodic defragmentation
Random write (existing file)	⚠️ Allocates new extents, fragments	NOCOW for databases, defrag
Random read	⚠️ Depends on fragmentation level	Defragmentation, compression
Small writes	⚠️ High metadata overhead per write	Batching, larger writes
Snapshot creation	✅ Instant (O(1))	COW benefit!
Snapshot deletion	⚠️ Can be slow (extent cleanup)	Background processing
With many snapshots	⚠️ More metadata, more back-refs	Limit snapshot count

Write Amplification:

Btrfs COW creates write amplification from multiple sources:

Metadata COW: Each data write requires updating (COW'ing) tree nodes from leaf to root
Checksums: Every data block has a checksum entry written to the checksum tree
Extent bookkeeping: Back-references and extent items add overhead

For small, random writes, this amplification can be significant—a 4KB write might require 64KB+ of metadata writes.

Optimizing for COW:

COW Performance Best Practices

•Use NOCOW for database files — Set chattr +C on database directories before creating files
•Use NOCOW for VM images — Virtual disk images benefit from in-place updates
•Enable compression — Reduces actual I/O even if metadata overhead remains
•Periodic defragmentation — Run btrfs filesystem defragment on fragmented files
•Avoid excessive snapshots — Each snapshot adds metadata overhead to all writes
•Consider autodefrag mount option — Automatically defragments small random writes
•Use SSDs — SSDs tolerate random I/O patterns much better than HDDs

The autodefrag Option

Mounting with autodefrag causes Btrfs to automatically defragment files that receive small random writes. This helps maintain performance for general-purpose workloads but adds background I/O. Don't use with NOCOW files or databases.

Summary and Key Takeaways

Copy-on-Write is the foundation upon which all of Btrfs's advanced features are built. Let's consolidate the key concepts:

Key Takeaways

•COW means never overwriting — All modifications write to new locations, preserving old data until the new state is committed
•Metadata and data both use COW — Tree nodes cascade updates from leaf to root; data extents are written to new locations
•Transactions batch changes — Multiple operations are grouped and committed atomically, improving efficiency
•Reference counting enables sharing — Snapshots, reflinks, and deduplication all leverage shared extents with back-references
•Crash consistency is guaranteed — The file system is always in a valid state after any crash, no fsck needed
•Fragmentation is the tradeoff — Random writes scatter data; mitigate with NOCOW, defrag, or autodefrag
•Performance varies by workload — Sequential writes are excellent; random small writes need optimization
•NOCOW is available when needed — Disable COW per-file for databases and VMs, accepting the tradeoffs

What's Next:

With COW understood, we'll explore one of its most powerful applications: Subvolumes. Subvolumes provide lightweight, flexible containers within a Btrfs file system—enabling independent mount points, quota control, and the foundation for snapshot management.

Page Complete

You now understand how Copy-on-Write works in Btrfs—from the mechanics of path copying and transaction commits to reference counting, crash consistency, and performance optimization. This knowledge is essential for understanding snapshots, subvolumes, and Btrfs's self-healing capabilities.

Linux Native COW

Never Overwrite, Always Preserve

What You Will Learn

Understanding Copy-on-Write

To appreciate COW, we must first understand what traditional file systems do—and why it's problematic.

The Problem with In-Place Updates:

Consider a traditional file system like ext4 (without journaling for simplicity). When you modify an existing file:

The file system locates the disk blocks containing the data to modify
It reads those blocks into memory
It modifies the in-memory copy
It writes the modified data back to the same disk locations

Write amplification: Data is written twice (journal + final location)
Complexity: Crash recovery requires journal replay
Still imperfect: Data (not just metadata) journaling is expensive

The COW Alternative:

Btrfs takes a fundamentally different approach:

When modifying data, allocate new disk space
Write the modified data to the new location
Update metadata pointers to reference the new location
Mark the old location as free (after confirming no other references)

COW vs In-Place Update

Conceptual

Traditional In-Place Update:
════════════════════════════════════════════════════════
Before:    [Block A: old data]
                    │
                    ▼ Write new data (overwrite)
During:    [Block A: partial old/new] ← DANGER ZONE
                    │
                    ▼ Complete write
After:     [Block A: new data]
 
If crash during "During": CORRUPTION
 
═══════════════════════════════════════════════════════
Copy-on-Write:
════════════════════════════════════════════════════════
Before:    Pointer → [Block A: old data]
                    
           Allocate new block
           
During:    Pointer → [Block A: old data] ← Still valid!
                     [Block B: writing new data...]
                    
           Complete write, update pointer atomically
           
After:     Pointer → [Block B: new data]
           [Block A: old data] ← Marked free
 
If crash "During": Old data still valid, new data ignored

The Elegance of Atomic Pointer Updates

COW for Metadata

Btrfs applies COW not just to file data but to all metadata—every tree node, every inode item, every extent reference. This is what makes Btrfs trees truly "copy-on-write trees."

The Cascade Effect:

When you modify a leaf node in a Btrfs tree:

A new leaf node is written with the modification
The parent node's pointer to the old leaf must be updated
But updating the parent means COW'ing the parent node too
This cascades up to the root of the tree
Finally, the tree root pointer (in the superblock) is atomically updated

This is called "path copying"—every modification creates new copies along the entire path from leaf to root.

Path Copying Visualization

Diagram

Original Tree:                   After Modifying Leaf C:
                                    
         [Root]                            [Root]───────→[Root']
        /      \                          /      \              \
    [Node1]   [Node2]                [Node1]   [Node2]         [Node2']
    /    \     /    \               /    \     /    \              \
  [A]   [B]  [C]   [D]            [A]   [B]  [C]   [D]            [C']
                                              ↑                     │
                                              │                     │
                                   Old leaf, still valid        New leaf
 
Transaction Commit:
1. Write C' (new leaf with modifications)
2. Write Node2' (parent pointing to C')
3. Write Root' (new root pointing to Node2')
4. Atomically update superblock to point to Root'
5. Old path (Root→Node2→C) becomes reclaimable

The Cost of Path Copying:

Path copying means modifying a single item requires writing O(log n) tree nodes—the height of the tree. For a tree with height 4, that's 4 node writes instead of 1.

However, Btrfs mitigates this cost through delayed allocation and batching:

Multiple modifications to the same tree share the new path
Changes accumulate in memory before being flushed
A single commit can include thousands of modifications with minimal overhead

The Benefits of Metadata COW:

Metadata COW Advantages

•Crash consistency without journal: Metadata is always consistent—either old version or new version, never partial
•Instant snapshots: Creating a snapshot is just creating a new reference to an existing tree root
•Space-efficient versioning: Unmodified subtrees are shared between versions
•No fsck required: Because consistency is guaranteed, expensive file system checks are eliminated
•Atomic batch operations: Multiple file operations can commit atomically as a single transaction

The Superblock Exception

COW for Data

By default, Btrfs applies COW to file data as well as metadata. When you overwrite existing file contents:

New disk space is allocated for the modified data
The new data is written to the new location
The file's EXTENT_DATA items are updated to point to the new extent
The old extent's reference count is decremented
If the old extent has no remaining references, its space is freed

Data COW Implications:

Benefits

•Data integrity: Old data preserved until new data is safe
•Snapshot efficiency: Snapshots share data extents
•Natural deduplication: Identical writes can share extents
•No data corruption window: Writes are all-or-nothing

Costs

•Fragmentation: Sequential writes become scattered
•No in-place updates: Always allocates new space
•Extent bookkeeping: More metadata for extent tracking
•Snapshot bloat: Modified data keeps old copies

Disabling Data COW:

For certain workloads, data COW can be problematic. Database files and virtual machine images are frequently overwritten in small chunks, causing severe fragmentation and space amplification.

Btrfs allows disabling data COW per-file using the nodatacow attribute:

# Set nodatacow on a file/directory (new files inherit)
chattr +C /path/to/vm-images/

# Check COW status
lsattr /path/to/file

Important: When nodatacow is set:

Data is overwritten in place (like ext4)
Data checksumming is also disabled (they're linked)
Snapshots create independent copies (not shared)
Write-hole risk returns for RAID5/6

The NOCOW tradeoff:

COW vs NOCOW Comparison
Aspect	COW (Default)	NOCOW (+C attribute)
Write behavior	Always new location	In-place overwrite
Checksumming	✅ Full checksums	❌ Disabled for data
Snapshot sharing	✅ Shares extents	❌ Creates copies
Fragmentation	High for overwrites	Low/none
Write amplification	Higher (COW + metadata)	Lower
Ideal for	Documents, source code, logs	Databases, VM images
Data integrity	Protected	Risk of partial writes

NOCOW and RAID

Transaction Model

Transaction Lifecycle:

Transaction State Machine

Diagram

Transaction States:
═══════════════════════════════════════════════════════════════
 
  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │   RUNNING    │────▶│  COMMITTING  │────▶│  COMMITTED   │
  │              │     │              │     │              │
  │ - Accept new │     │ - No new ops │     │ - On disk    │
  │   operations │     │ - Write data │     │ - Visible to │
  │ - In-memory  │     │ - Write meta │     │   new reads  │
  │   changes    │     │ - Update SB  │     │              │
  └──────────────┘     └──────────────┘     └──────────────┘
        │                                          │
        │    (Crash during RUNNING)               │
        ▼                                          ▼
  Transaction lost,                         Transaction survives,
  file system rolls                         fully visible after
  back to previous                          recovery
  commit point
  
Generation Numbers:
═══════════════════════════════════════════════════════════════
Gen 99 (committed) ──▶ Gen 100 (running) ──▶ Gen 100 (committed)
                              │
                              │ Crash
                              ▼
                      Recovery sees Gen 99
                      as last valid state

How Commits Work:

Running Phase: A transaction is open, accepting file operations. Modifications are accumulated in memory within COW'd tree nodes.
Commit Trigger: Commits are triggered by:
- Periodic timer (default: every 30 seconds)
- Explicit sync() or fsync() calls
- Memory pressure from too many uncommitted changes
- Unmount operations
Commit Sequence:
- Stop accepting new operations for this transaction
- Start a new transaction for incoming operations
- Write all dirty data extents to disk
- Write all dirty metadata nodes (COW'd path)
- Write super block with new tree roots
- Mark old tree nodes as freeable

Generation Numbers:

Each transaction has a generation number—a monotonically increasing counter. Every tree node, extent, and metadata item records which generation created or modified it.

Generation numbers enable:

Crash recovery: Identify which state to recover to
Snapshot distinction: Tell which items are shared vs. diverged
Garbage collection: Determine when old data can be freed

Commit Interval Tuning

Reference Counting and Sharing

The Reference Counting Challenge:

Consider a file with 100 extents that has 10 snapshots. Each snapshot might share some extents and have diverged on others. We need to track:

How many references exist to each extent
When the count reaches zero (extent can be freed)
Who the referees are (for relocation and integrity checks)

Back-References:

Btrfs stores back-references alongside extent items in the extent tree. A back-reference indicates "who is pointing to this extent":

Extent Item: [disk_bytenr, num_bytes]
  ├── RefCount: 3
  ├── BackRef: { tree_id: 257, objectid: 123, offset: 0 }
  ├── BackRef: { tree_id: 258, objectid: 456, offset: 0 }
  └── BackRef: { tree_id: 259, objectid: 789, offset: 4096 }

Types of Back-References:

Btrfs uses different back-reference types depending on shared status:

Back-Reference Types
Type	When Used	Content
Inline Back-Ref	Few references, fit in extent item	Directly in extent item data
Keyed Back-Ref	Many references, overflow	Separate items in extent tree
Full Back-Ref	Shared extents (snapshots)	Complete tree path information
Shared Back-Ref	Trees share metadata nodes	Points to parent tree node

Extent Sharing Mechanics:

When you create a snapshot:

A new tree is created with its root pointing to the source tree's root
The source tree's root node's reference count is incremented
All child pointers now point to shared subtrees
No extent data is copied—only the root reference is new

When you modify a file in the snapshot:

COW creates a new extent for the modified data
The new EXTENT_DATA points to the new extent
The old extent's reference count is decremented
If count > 0, the extent is kept (still referenced by original)
If count = 0, the extent is freed

Snapshot Extent Sharing

Example

Initial State (1 subvolume, 1 file):
═══════════════════════════════════════════
Subvol A
└── file.txt → [Extent X: data] (refcount: 1)
 
After Creating Snapshot B:
═══════════════════════════════════════════
Subvol A                      Subvol B (snapshot of A)
└── file.txt → [Extent X] ←── file.txt
                    ↑
               (refcount: 2)
 
After Modifying file.txt in Subvol B:
═══════════════════════════════════════════
Subvol A                      Subvol B
└── file.txt → [Extent X]    └── file.txt → [Extent Y: new data]
                    ↑                              ↑
               (refcount: 1)                 (refcount: 1)
 
The original file in Subvol A is unchanged!
Only Subvol B has the new data.

Space Accounting Complexity

COW and Crash Consistency

One of COW's most valuable properties is crash consistency—the guarantee that after any crash, the file system will be in a valid, consistent state without requiring expensive fsck operations.

The Consistency Guarantee:

At any point in time, the file system has exactly one committed state—the most recently completed transaction. If a crash occurs:

All changes from incomplete transactions are lost (as expected)
All changes from complete transactions are preserved
The on-disk structure is always internally consistent

Why This Works:

COW Crash Consistency Mechanisms

•Old data is never overwritten — The previous consistent state remains intact until the new state is complete
•Atomic root update — The superblock write is the single commit point; either it succeeds (new state) or it doesn't (old state)
•Write ordering — All data and metadata is written BEFORE the superblock is updated
•No cross-references between old and new — A committed state never references uncommitted data
•Generation numbers — Make it easy to identify the latest valid state

Recovery Process:

When Btrfs mounts after a crash:

Read superblock mirrors, select the one with highest generation and valid checksum
The selected superblock points to the last committed tree roots
All tree operations proceed from these known-good roots
Any "orphaned" data (written but not committed) is gradually reclaimed

There's no journal to replay, no structure to check—the file system is immediately consistent.

Log Tree for fsync():

fsync() writes file data to a temporary log tree
On commit, log tree contents are merged into the main trees
On crash, log tree replay restores fsync'd data

This provides the durability guarantee of fsync without forcing a full commit.

Crash Scenarios

Analysis

Scenario 1: Crash during data write (before commit)
═══════════════════════════════════════════════════════════════
State: New data blocks written to disk, but superblock still
       points to old tree roots.
       
Recovery: Superblock still references old state.
          New data blocks are orphaned, will be reclaimed.
          NO DATA CORRUPTION. Old files intact.
          
Scenario 2: Crash during metadata write (before commit)
═══════════════════════════════════════════════════════════════
State: Some new tree nodes written, but superblock still
       points to old tree roots.
       
Recovery: Superblock still references old state.
          New tree nodes are orphaned, will be reclaimed.
          NO CORRUPTION. Old metadata intact.
 
Scenario 3: Crash during superblock write
═══════════════════════════════════════════════════════════════
State: Superblock partially written or only some mirrors updated.
 
Recovery: Btrfs checks all superblock mirrors.
          Uses highest-generation mirror with valid checksum.
          Even if one mirror corrupted, others provide recovery.
          
Scenario 4: Crash after superblock commit
═══════════════════════════════════════════════════════════════
State: New superblock successfully written.
 
Recovery: File system uses new state.
          All committed changes preserved.
          Old data blocks now reclaimable.

No fsck, No Problem

COW Performance Implications

Copy-on-Write is not without costs. Understanding the performance implications helps in proper workload selection and tuning.

The Fragmentation Problem:

In traditional file systems, overwriting file data keeps it contiguous. In a COW file system, each overwrite allocates new space wherever available, leading to fragmentation:

A file written sequentially becomes scattered after in-place modifications
Sequential reads become random reads, dramatically hurting HDD performance
Even SSDs, while tolerant, suffer from increased write amplification

COW Performance Characteristics
Operation	Performance Impact	Mitigation
Sequential write (new file)	✅ Excellent	Allocator prefers contiguous space
Sequential read	✅ Excellent initially, degrades with COW	Periodic defragmentation
Random write (existing file)	⚠️ Allocates new extents, fragments	NOCOW for databases, defrag
Random read	⚠️ Depends on fragmentation level	Defragmentation, compression
Small writes	⚠️ High metadata overhead per write	Batching, larger writes
Snapshot creation	✅ Instant (O(1))	COW benefit!
Snapshot deletion	⚠️ Can be slow (extent cleanup)	Background processing
With many snapshots	⚠️ More metadata, more back-refs	Limit snapshot count

Write Amplification:

Btrfs COW creates write amplification from multiple sources:

Metadata COW: Each data write requires updating (COW'ing) tree nodes from leaf to root
Checksums: Every data block has a checksum entry written to the checksum tree
Extent bookkeeping: Back-references and extent items add overhead

For small, random writes, this amplification can be significant—a 4KB write might require 64KB+ of metadata writes.

Optimizing for COW:

COW Performance Best Practices

•Use NOCOW for database files — Set chattr +C on database directories before creating files
•Use NOCOW for VM images — Virtual disk images benefit from in-place updates
•Enable compression — Reduces actual I/O even if metadata overhead remains
•Periodic defragmentation — Run btrfs filesystem defragment on fragmented files
•Avoid excessive snapshots — Each snapshot adds metadata overhead to all writes
•Consider autodefrag mount option — Automatically defragments small random writes
•Use SSDs — SSDs tolerate random I/O patterns much better than HDDs

The autodefrag Option

Summary and Key Takeaways

Copy-on-Write is the foundation upon which all of Btrfs's advanced features are built. Let's consolidate the key concepts:

Key Takeaways

•COW means never overwriting — All modifications write to new locations, preserving old data until the new state is committed
•Metadata and data both use COW — Tree nodes cascade updates from leaf to root; data extents are written to new locations
•Transactions batch changes — Multiple operations are grouped and committed atomically, improving efficiency
•Reference counting enables sharing — Snapshots, reflinks, and deduplication all leverage shared extents with back-references
•Crash consistency is guaranteed — The file system is always in a valid state after any crash, no fsck needed
•Fragmentation is the tradeoff — Random writes scatter data; mitigate with NOCOW, defrag, or autodefrag
•Performance varies by workload — Sequential writes are excellent; random small writes need optimization
•NOCOW is available when needed — Disable COW per-file for databases and VMs, accepting the tradeoffs

What's Next:

Page Complete