Copy On Write File Systems - Learning Module

Loading content...

0/227

Copy-on-Write Concept

The Write Problem in File Systems

Every file system faces a fundamental challenge: how do you safely modify data on disk?

Consider what happens when you save a document. The file system must update the file's data blocks, modify the inode's timestamp and size, potentially update directory entries, and adjust free space tracking metadata. If power fails mid-operation—or the system crashes—you risk leaving the disk in an inconsistent state: partial writes, orphaned blocks, corrupted metadata, or worse.

Traditional file systems address this through various strategies:

fsck (file system check) repairs damage after crashes—often slowly and imperfectly
Journaling records intentions before modifications—but still modifies data in-place
Soft updates carefully orders writes—but adds significant complexity

Each approach attempts to solve the same underlying problem: in-place updates are inherently dangerous. When you overwrite existing data, there's always a window where both the old and new states are incomplete.

What You Will Learn

By the end of this page, you will understand the Copy-on-Write paradigm at a fundamental level: why it eliminates entire categories of file system corruption, how it enables capabilities impossible with in-place update systems, and why it represents the architectural foundation of the most advanced modern file systems.

The In-Place Update Problem

To truly appreciate Copy-on-Write, we must first understand why in-place updates are problematic. Traditional file systems—ext4, NTFS, HFS+—follow a fundamental pattern: when data changes, they modify the existing disk blocks directly.

The anatomy of an in-place update:

Imagine modifying a 4KB block in a file. The traditional approach:

Read the block from disk into memory
Modify the in-memory copy
Write the modified block back to the same disk location
Update metadata (timestamps, sizes, etc.)

This seems straightforward—but consider what happens if the system fails at step 3. The disk block might contain:

Only the first portion of new data (partial write)
A mix of old and new data (torn write)
Corrupted data if the write was interrupted

The original data is gone, and the new data is incomplete. There's no way to recover.

Converting Mermaid diagram...

The corruption cascade:

The problem extends beyond individual blocks. File systems maintain complex relationships:

Inodes point to data blocks
Directories point to inodes
Free space bitmaps track available blocks
Superblocks describe the overall structure

When any of these structures is partially updated, the ripple effects can be catastrophic:

A partially updated inode might point to freed blocks (now containing other files' data)
A corrupted directory entry might reference a non-existent inode
An inconsistent free space bitmap might cause the same block to be allocated twice

This is why fsck exists—and why it can take hours to run on large file systems. It must examine every structure, detect inconsistencies, and attempt repairs that often result in data loss.

The Fundamental Flaw

In-place updates create a time window of vulnerability. During this window, a crash leaves data in an indeterminate state. No amount of ordering, journaling, or careful programming can eliminate this window—it's inherent to the in-place update model.

The Copy-on-Write Solution

Copy-on-Write (COW) takes a radically different approach: never overwrite existing data. Instead of modifying blocks in-place, COW file systems always write to new, unused locations.

The COW write operation:

Read the existing block(s) into memory
Make modifications in memory
Write the modified data to a new disk location
Update parent pointers to reference the new location
Only then, mark the old location as free

The crucial difference: until step 4 completes atomically, the old data remains intact. If the system crashes at any point before the final atomic pointer update:

Before step 3 completes: Old data is still valid, new data hasn't been committed
During step 4: Either the old pointer or new pointer is valid—never a partial state
After step 4: New data is committed, old data can be freed

There is no window of vulnerability. At every instant, the file system is in a consistent state.

Converting Mermaid diagram...

The atomic pointer update:

The magic of COW lies in how it commits changes. Instead of updating multiple structures individually, COW file systems use a single atomic operation to switch between states.

In most COW file systems, this is accomplished through a root pointer (often called the überblock in ZFS or superblock in btrfs):

Build the entire new state in unused disk space
Update the root pointer atomically (a single sector write)
The entire file system transitions from old state to new state

This atomic transition is typically achieved by:

Writing the new root pointer
Using checksums to verify which root is valid
Maintaining multiple copies of the root for redundancy

The disk hardware guarantees that a single sector write either completes fully or not at all—there's no partial sector write. By reducing all changes to a single pointer update, COW eliminates partial states entirely.

No fsck Required

COW file systems are always consistent by construction. After a crash, no repair is needed—the file system simply uses whichever root pointer represents the last valid state. Mount time is constant, regardless of file system size. No more waiting hours for fsck to complete.

COW Data Structures: The Merkle Tree Foundation

COW file systems typically organize data using tree structures, often resembling Merkle trees (hash trees). This design is not coincidental—it's essential for efficient COW operations.

Why trees work for COW:

Consider a file system organized as a tree:

Root node: Points to top-level structures
Internal nodes: Directories, indirect blocks, metadata
Leaf nodes: Actual file data blocks

When you modify a leaf node in a COW tree:

Write the new leaf data to a new location
The parent node's pointer must change → write a new parent
The grandparent's pointer must change → write a new grandparent
Continue up to the root
Atomically update the root pointer

This is called path copying—you copy the path from the modified leaf to the root. Importantly, unchanged subtrees are shared between the old and new states.

Converting Mermaid diagram...

Block sharing and efficiency:

In the diagram above, modifying File 1 requires:

Writing a new File 1 data block
Writing a new Dir A pointer block
Writing a new Root pointer block

But Dir B and File 2 are unchanged and shared. The new state simply points to the same disk blocks. This sharing is fundamental to COW efficiency:

Operation	Blocks Written	Blocks Shared
Modify 1 byte in 1 file	O(log n) path blocks	All other blocks
Modify 100 files	O(100 × log n)	All unmodified files
Create snapshot	O(1) - just root pointer	Entire file system

The logarithmic write amplification is the price of consistency. For typical file systems where tree depth is 3-5 levels, this means 3-5 blocks written per modification—a very reasonable overhead.

Conceptual COW Tree Structure
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Simplified COW tree node structure */
 
typedef struct cow_node {
    uint64_t checksum;          /* Self-validating checksum */
    uint64_t generation;        /* Transaction ID when written */
    uint64_t logical_address;   /* Logical block address */
    uint64_t physical_address;  /* Actual disk location */
    
    uint32_t flags;             /* Node type and state flags */
    uint32_t num_pointers;      /* Number of child pointers */
    
    /* Variable-length array of child pointers */
    struct cow_pointer {
        uint64_t key;           /* Search key for this subtree */
        uint64_t physical_addr; /* Physical address of child */
        uint64_t generation;    /* Expected generation of child */
        uint64_t checksum;      /* Expected checksum of child */
    } pointers[];
} cow_node_t;
 
/* COW modification pseudocode */
void cow_modify_block(cow_tree_t *tree, uint64_t block_id, void *new_data) {
    /* 1. Allocate new block for modified data */
    uint64_t new_phys = allocate_block(tree->pool);
    
    /* 2. Write new data with checksum */
    cow_node_t *new_node = create_node(new_data);
    new_node->checksum = calculate_checksum(new_node);
    new_node->generation = tree->current_txn;
    write_block(new_phys, new_node);
    
    /* 3. COW up the tree path */
    cow_update_path_to_root(tree, block_id, new_phys);
    
    /* 4. Old block now orphaned - will be freed by garbage collection */
}

Self-Describing Blocks

COW file systems typically embed checksums and generation numbers directly in each block. This makes blocks self-validating—if a block's checksum doesn't match its contents, or its generation doesn't match what the parent expects, the corruption is immediately detected. This is impossible with in-place update systems where you can't distinguish old data from corrupted data.

Transaction Semantics: Atomic Multi-Block Operations

One of COW's most powerful properties is transaction atomicity. Because all changes are staged in new locations before committing via a root pointer update, COW naturally supports transactions spanning multiple files and metadata operations.

Transaction groups:

Modern COW file systems batch multiple operations into transaction groups (TXG in ZFS terminology):

Open phase: Accumulate modifications in memory
Quiesce phase: Stop accepting new modifications to this group
Sync phase: Write all modified data to new disk locations
Commit phase: Atomically update the root pointer

Until the commit phase completes, none of the modifications in the transaction group are visible. Either all modifications commit together, or none do. This provides all-or-nothing semantics that are incredibly valuable:

Moving a file between directories: Either both the removal from source and addition to destination complete, or neither does
Updating multiple related files: All files reflect the new state, or all reflect the old state
Metadata and data consistency: Impossible to have data without metadata or vice versa

Transaction Atomicity: COW vs Traditional File Systems
Scenario	Traditional FS Outcome	COW FS Outcome
Crash during file creation	Possible orphaned inode, leaked blocks	File either exists completely or not at all
Crash during file move	File in neither directory, or in both	File in source OR destination, never limbo
Crash during concurrent writes	Files may have mixed old/new data	Each file consistent (possibly old version)
Crash during metadata update	Incorrect sizes, timestamps, permissions	Metadata matches data state exactly
Crash during large write	File truncated or corrupted mid-content	Previous complete version preserved

The copy-on-write contract:

COW file systems provide a fundamental guarantee that can be stated simply:

At any instant, the on-disk state represents a valid, complete point-in-time view of the file system.

There are no intermediate states. The transition from one valid state to the next is atomic. This guarantee holds regardless of:

System crashes
Power failures
Kernel panics
Hardware issues (to the extent checksums can detect them)

This is a dramatically stronger guarantee than journaling provides. Journaling ensures you can recover to a consistent state; COW ensures you never leave a consistent state.

Understanding TXG Timing

ZFS commits transaction groups every 5 seconds by default (configurable via zfs_txg_timeout). This means at most 5 seconds of work is lost in a crash—but that work is lost cleanly, without corruption. The tradeoff between sync frequency and performance is a key tuning parameter in COW file systems.

Space Management: The Garbage Collection Challenge

Copy-on-Write's "never overwrite" philosophy creates a unique challenge: dead data accumulates. When you modify a block, the old version isn't immediately freed—it remains on disk until the system confirms it's no longer needed.

Why immediate freeing is dangerous:

In a COW system, consider this sequence:

Block A is modified → New block A' is written
System starts to update root pointer to reference A'
Crash occurs before root update completes
After reboot, system uses old root → Block A is the valid version

If we had freed block A immediately after writing A', the file system would be corrupted after the crash. We'd have a root pointing to freed space.

The solution: Reference counting and garbage collection

COW file systems track references to each block:

A block with zero references is dead (garbage)
A block with one reference is live (active)
A block with multiple references is shared (clones/snapshots)

Garbage collection runs periodically or continuously, identifying and reclaiming dead blocks.

COW Block Reference Tracking
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* Block reference tracking in COW file systems */
 
struct block_reference {
    uint64_t block_addr;      /* Physical block address */
    uint32_t ref_count;       /* Number of references to this block */
    uint64_t birth_txg;       /* Transaction when block was allocated */
    uint64_t death_txg;       /* Transaction when block was obsoleted (0 if live) */
};
 
/* When a block is modified (COW triggered) */
void on_block_modify(cow_fs_t *fs, uint64_t old_block, uint64_t new_block) {
    /* New block starts with refcount of 1 */
    set_refcount(fs, new_block, 1);
    set_birth_txg(fs, new_block, fs->current_txg);
    
    /* Old block's death is recorded, but refcount unchanged */
    set_death_txg(fs, old_block, fs->current_txg);
    
    /* Note: old_block is NOT freed yet! */
    /* It may still be referenced by snapshots or in-flight transactions */
}
 
/* When a snapshot is deleted */
void on_snapshot_delete(cow_fs_t *fs, snapshot_t *snap) {
    /* Walk the snapshot's tree */
    for_each_block(snap, block) {
        /* Decrement reference count */
        if (--block->ref_count == 0) {
            /* Block is now garbage - add to free list */
            add_to_free_list(fs, block);
        }
    }
}
 
/* Garbage collection pass */
void gc_collect(cow_fs_t *fs) {
    /* Find blocks where:
     * - death_txg > 0 (block was superseded)
     * - death_txg < oldest_active_txg (no transaction can reference it)
     * - ref_count == 0 (no snapshots reference it) */
    
    for_each_dead_block(fs, block) {
        if (block->death_txg < fs->oldest_txg && block->ref_count == 0) {
            reclaim_block(fs, block);
        }
    }
}

Space amplification considerations:

COW file systems experience write amplification (more data written than changed) and potentially space amplification (more space used than traditional file systems for the same data). Understanding these effects is crucial:

Factor	Impact	Mitigation
Pending garbage	Old blocks not yet freed	Aggressive GC scheduling
Snapshot retention	Old versions retained for snapshots	Careful snapshot policies
Fragmentation	Sequential writes scatter across disk	Periodic rebalancing
Metadata overhead	Checksums, tree nodes	Larger block sizes
Write amplification	O(log n) blocks per write	Batching writes in TXGs

Most COW file systems recommend keeping 10-20% free space for efficient operation. As the file system fills, garbage collection becomes more expensive and performance degrades.

The Full Disk Pathology

A COW file system that reaches 100% capacity can become completely stuck. Every write requires allocating new space, but there's no space to allocate. Unlike traditional file systems where you can modify existing files on a full disk, COW requires space for new blocks. Always maintain adequate free space reserves.

Historical Context and Evolution

The Copy-on-Write concept didn't emerge in isolation. It evolved from decades of research in databases, virtual memory systems, and version control. Understanding this history illuminates why COW works so well.

Origins in virtual memory:

The term "copy-on-write" first appeared in virtual memory systems during the 1970s. When a Unix process calls fork(), the child process traditionally needed a complete copy of the parent's address space. Early systems duplicated every page—expensive for large processes.

COW optimization changed this:

Mark all parent pages as read-only
Child shares the same physical pages
Only copy a page when either process writes to it

This same principle—defer copying until modification—forms the foundation of COW file systems.

Database influence:

Database systems pioneered many techniques later adopted by COW file systems:

Write-ahead logging (WAL): Databases log changes before applying them
MVCC (Multi-Version Concurrency Control): Keep old versions for concurrent readers
Shadow paging: Write changes to new locations, atomically swap page tables

COW file systems essentially bring database-grade transactional semantics to general-purpose storage.

Evolution of COW File Systems
Year	System	Key Innovation
1988	Episode/Alliance (IBM)	First COW file system concepts for distributed storage
1991	WAFL (NetApp)	Production COW for network-attached storage
2001	NILFS (NTT)	Log-structured + COW for continuous snapshotting
2005	ZFS (Sun Microsystems)	Full COW with integrated volume management, checksums
2007	Btrfs (Oracle)	Linux-native COW with subvolumes and snapshots
2009	HAMMER (DragonFlyBSD)	COW with historical data access
2013	APFS (Apple)	COW for macOS/iOS with space sharing
2017	Bcachefs (Linux)	COW with integrated caching layer

The ZFS watershed moment:

ZFS, developed at Sun Microsystems and released in 2005, represented a paradigm shift. It wasn't just a file system—it was a complete storage stack:

Pooled storage: Multiple devices presented as a single pool
Integrated volume management: No separate LVM needed
End-to-end checksums: Data integrity from application to disk
Native snapshots and clones: Zero-cost point-in-time copies
Built-in compression and deduplication: Storage efficiency
Self-healing: Automatic repair using redundancy

ZFS proved that COW could power enterprise storage at scale, handling petabytes of data while maintaining integrity guarantees that traditional file systems couldn't match.

The modern landscape:

Today, COW file systems are standard in many environments:

Enterprise storage: ZFS, btrfs, WAFL power critical infrastructure
Consumer devices: APFS on every Apple device since 2017
Cloud infrastructure: Copy-on-write principles underpin object storage
Containers: Copy-on-write enables efficient container layering

The question is no longer "should we use COW?" but "which COW file system best fits our needs?"

COW Beyond File Systems

COW principles appear throughout computing: B-tree databases use COW for consistency (LMDB), container runtimes use COW for image layers (OverlayFS), and version control systems use COW-like techniques for branches (Git's immutable objects). Understanding COW in file systems prepares you to recognize these patterns everywhere.

COW vs Traditional Approaches: A Technical Comparison

To cement our understanding, let's directly compare COW file systems with the alternatives, examining specific technical characteristics:

Consistency mechanisms compared:

Consistency Mechanisms: COW vs Alternatives
Aspect	In-Place (ext2)	Journaling (ext4)	Soft Updates (FFS)	Copy-on-Write (ZFS)
Crash consistency	Requires fsck	Journal replay	Careful ordering	Always consistent
Recovery time	O(filesystem size)	O(journal size)	Moderate	O(1) - constant
Data integrity	Silent corruption possible	Metadata only usually	No checksums	End-to-end checksums
Atomicity scope	Single operation	Single transaction	Ordered operations	Full transaction groups
Implementation complexity	Simple	Moderate	Very high	High but contained
Snapshot support	Requires LVM	Requires LVM	Not native	Native, efficient
Space overhead	Minimal	Journal space	Moderate	~10% recommended

Write path comparison:

Let's trace what happens when writing 4KB to the middle of a file in each system:

Ext4 Journal Write Path

•Start journal transaction
•Write metadata changes to journal (inode update)
•If data=journal: Also write data to journal
•Commit journal transaction (journal barrier)
•Write metadata to final location (in-place)
•Write data to final location (in-place)
•Mark journal transaction complete
•Total writes: 2-3x data size (journal + final)

Key limitation: Data is still written in-place at step 6. A failure here can corrupt data even though metadata is protected. Most ext4 configurations only journal metadata (data=ordered), leaving data vulnerable.

COW Advantages

•Always consistent - no recovery phase
•End-to-end integrity - checksums everywhere
•Native snapshots - zero space initially
•Atomic transactions - all or nothing
•Self-healing - with redundancy
•No fsck - instant mount after crash

COW Considerations

•Write amplification - O(log n) extra blocks
•Fragmentation - over time, data scatters
•Space overhead - need free space for GC
•Memory usage - caching and metadata
•Complexity - more sophisticated implementation
•Full-disk pathology - unusable at 100% capacity

Summary: The Copy-on-Write Paradigm

We've explored the Copy-on-Write concept from first principles. Let's consolidate the essential understanding:

Core Principles

•Never overwrite existing data - Always write modifications to new locations, preserving the previous state until the new state is atomically committed.
•Atomic state transitions - Use a single root pointer update to transition the entire file system from one consistent state to another.
•Tree-structured organization - Organize data as a tree where path copying enables efficient COW with unchanged subtree sharing.
•Self-validating blocks - Embed checksums and generation numbers in each block for end-to-end data integrity verification.
•Transaction groups - Batch multiple operations and commit them atomically for all-or-nothing semantics.
•Reference tracking - Maintain reference counts for garbage collection of obsoleted blocks.

The transformative insight:

Copy-on-Write transforms file system consistency from a problem to be solved into a property guaranteed by construction. Instead of asking "how do we recover from inconsistent states?", COW asks "how do we ensure we never enter an inconsistent state?"

The answer—never overwrite, always copy, commit atomically—is elegant in principle and powerful in practice.

Looking ahead:

With the fundamental COW concept established, we're ready to explore its most revolutionary capability: snapshots. In the next page, we'll see how COW enables instant, space-efficient point-in-time copies that would be impossible with in-place update systems.

Page Complete

You now understand the Copy-on-Write paradigm at a fundamental level. You can explain why COW eliminates crash-consistency concerns, how atomic pointer updates enable transaction semantics, and why modern file systems increasingly adopt this approach. Next, we'll explore COW's killer feature: instant snapshots.

Copy-on-Write Concept

The Write Problem in File Systems

Every file system faces a fundamental challenge: how do you safely modify data on disk?

Traditional file systems address this through various strategies:

fsck (file system check) repairs damage after crashes—often slowly and imperfectly
Journaling records intentions before modifications—but still modifies data in-place
Soft updates carefully orders writes—but adds significant complexity

What You Will Learn

The In-Place Update Problem

The anatomy of an in-place update:

Imagine modifying a 4KB block in a file. The traditional approach:

Read the block from disk into memory
Modify the in-memory copy
Write the modified block back to the same disk location
Update metadata (timestamps, sizes, etc.)

This seems straightforward—but consider what happens if the system fails at step 3. The disk block might contain:

Only the first portion of new data (partial write)
A mix of old and new data (torn write)
Corrupted data if the write was interrupted

The original data is gone, and the new data is incomplete. There's no way to recover.

Converting Mermaid diagram...

The corruption cascade:

The problem extends beyond individual blocks. File systems maintain complex relationships:

Inodes point to data blocks
Directories point to inodes
Free space bitmaps track available blocks
Superblocks describe the overall structure

When any of these structures is partially updated, the ripple effects can be catastrophic:

A partially updated inode might point to freed blocks (now containing other files' data)
A corrupted directory entry might reference a non-existent inode
An inconsistent free space bitmap might cause the same block to be allocated twice

This is why fsck exists—and why it can take hours to run on large file systems. It must examine every structure, detect inconsistencies, and attempt repairs that often result in data loss.

The Fundamental Flaw

The Copy-on-Write Solution

Copy-on-Write (COW) takes a radically different approach: never overwrite existing data. Instead of modifying blocks in-place, COW file systems always write to new, unused locations.

The COW write operation:

Read the existing block(s) into memory
Make modifications in memory
Write the modified data to a new disk location
Update parent pointers to reference the new location
Only then, mark the old location as free

The crucial difference: until step 4 completes atomically, the old data remains intact. If the system crashes at any point before the final atomic pointer update:

Before step 3 completes: Old data is still valid, new data hasn't been committed
During step 4: Either the old pointer or new pointer is valid—never a partial state
After step 4: New data is committed, old data can be freed

There is no window of vulnerability. At every instant, the file system is in a consistent state.

Converting Mermaid diagram...

The atomic pointer update:

The magic of COW lies in how it commits changes. Instead of updating multiple structures individually, COW file systems use a single atomic operation to switch between states.

In most COW file systems, this is accomplished through a root pointer (often called the überblock in ZFS or superblock in btrfs):

Build the entire new state in unused disk space
Update the root pointer atomically (a single sector write)
The entire file system transitions from old state to new state

This atomic transition is typically achieved by:

Writing the new root pointer
Using checksums to verify which root is valid
Maintaining multiple copies of the root for redundancy

No fsck Required

COW Data Structures: The Merkle Tree Foundation

COW file systems typically organize data using tree structures, often resembling Merkle trees (hash trees). This design is not coincidental—it's essential for efficient COW operations.

Why trees work for COW:

Consider a file system organized as a tree:

Root node: Points to top-level structures
Internal nodes: Directories, indirect blocks, metadata
Leaf nodes: Actual file data blocks

When you modify a leaf node in a COW tree:

Write the new leaf data to a new location
The parent node's pointer must change → write a new parent
The grandparent's pointer must change → write a new grandparent
Continue up to the root
Atomically update the root pointer

This is called path copying—you copy the path from the modified leaf to the root. Importantly, unchanged subtrees are shared between the old and new states.

Converting Mermaid diagram...

Block sharing and efficiency:

In the diagram above, modifying File 1 requires:

Writing a new File 1 data block
Writing a new Dir A pointer block
Writing a new Root pointer block

But Dir B and File 2 are unchanged and shared. The new state simply points to the same disk blocks. This sharing is fundamental to COW efficiency:

Operation	Blocks Written	Blocks Shared
Modify 1 byte in 1 file	O(log n) path blocks	All other blocks
Modify 100 files	O(100 × log n)	All unmodified files
Create snapshot	O(1) - just root pointer	Entire file system

The logarithmic write amplification is the price of consistency. For typical file systems where tree depth is 3-5 levels, this means 3-5 blocks written per modification—a very reasonable overhead.

Conceptual COW Tree Structure
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Simplified COW tree node structure */
 
typedef struct cow_node {
    uint64_t checksum;          /* Self-validating checksum */
    uint64_t generation;        /* Transaction ID when written */
    uint64_t logical_address;   /* Logical block address */
    uint64_t physical_address;  /* Actual disk location */
    
    uint32_t flags;             /* Node type and state flags */
    uint32_t num_pointers;      /* Number of child pointers */
    
    /* Variable-length array of child pointers */
    struct cow_pointer {
        uint64_t key;           /* Search key for this subtree */
        uint64_t physical_addr; /* Physical address of child */
        uint64_t generation;    /* Expected generation of child */
        uint64_t checksum;      /* Expected checksum of child */
    } pointers[];
} cow_node_t;
 
/* COW modification pseudocode */
void cow_modify_block(cow_tree_t *tree, uint64_t block_id, void *new_data) {
    /* 1. Allocate new block for modified data */
    uint64_t new_phys = allocate_block(tree->pool);
    
    /* 2. Write new data with checksum */
    cow_node_t *new_node = create_node(new_data);
    new_node->checksum = calculate_checksum(new_node);
    new_node->generation = tree->current_txn;
    write_block(new_phys, new_node);
    
    /* 3. COW up the tree path */
    cow_update_path_to_root(tree, block_id, new_phys);
    
    /* 4. Old block now orphaned - will be freed by garbage collection */
}

Self-Describing Blocks

Transaction Semantics: Atomic Multi-Block Operations

Transaction groups:

Modern COW file systems batch multiple operations into transaction groups (TXG in ZFS terminology):

Open phase: Accumulate modifications in memory
Quiesce phase: Stop accepting new modifications to this group
Sync phase: Write all modified data to new disk locations
Commit phase: Atomically update the root pointer

Moving a file between directories: Either both the removal from source and addition to destination complete, or neither does
Updating multiple related files: All files reflect the new state, or all reflect the old state
Metadata and data consistency: Impossible to have data without metadata or vice versa

Transaction Atomicity: COW vs Traditional File Systems
Scenario	Traditional FS Outcome	COW FS Outcome
Crash during file creation	Possible orphaned inode, leaked blocks	File either exists completely or not at all
Crash during file move	File in neither directory, or in both	File in source OR destination, never limbo
Crash during concurrent writes	Files may have mixed old/new data	Each file consistent (possibly old version)
Crash during metadata update	Incorrect sizes, timestamps, permissions	Metadata matches data state exactly
Crash during large write	File truncated or corrupted mid-content	Previous complete version preserved

The copy-on-write contract:

COW file systems provide a fundamental guarantee that can be stated simply:

At any instant, the on-disk state represents a valid, complete point-in-time view of the file system.

There are no intermediate states. The transition from one valid state to the next is atomic. This guarantee holds regardless of:

System crashes
Power failures
Kernel panics
Hardware issues (to the extent checksums can detect them)

This is a dramatically stronger guarantee than journaling provides. Journaling ensures you can recover to a consistent state; COW ensures you never leave a consistent state.

Understanding TXG Timing

Space Management: The Garbage Collection Challenge

Why immediate freeing is dangerous:

In a COW system, consider this sequence:

Block A is modified → New block A' is written
System starts to update root pointer to reference A'
Crash occurs before root update completes
After reboot, system uses old root → Block A is the valid version

If we had freed block A immediately after writing A', the file system would be corrupted after the crash. We'd have a root pointing to freed space.

The solution: Reference counting and garbage collection

COW file systems track references to each block:

A block with zero references is dead (garbage)
A block with one reference is live (active)
A block with multiple references is shared (clones/snapshots)

Garbage collection runs periodically or continuously, identifying and reclaiming dead blocks.

COW Block Reference Tracking
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* Block reference tracking in COW file systems */
 
struct block_reference {
    uint64_t block_addr;      /* Physical block address */
    uint32_t ref_count;       /* Number of references to this block */
    uint64_t birth_txg;       /* Transaction when block was allocated */
    uint64_t death_txg;       /* Transaction when block was obsoleted (0 if live) */
};
 
/* When a block is modified (COW triggered) */
void on_block_modify(cow_fs_t *fs, uint64_t old_block, uint64_t new_block) {
    /* New block starts with refcount of 1 */
    set_refcount(fs, new_block, 1);
    set_birth_txg(fs, new_block, fs->current_txg);
    
    /* Old block's death is recorded, but refcount unchanged */
    set_death_txg(fs, old_block, fs->current_txg);
    
    /* Note: old_block is NOT freed yet! */
    /* It may still be referenced by snapshots or in-flight transactions */
}
 
/* When a snapshot is deleted */
void on_snapshot_delete(cow_fs_t *fs, snapshot_t *snap) {
    /* Walk the snapshot's tree */
    for_each_block(snap, block) {
        /* Decrement reference count */
        if (--block->ref_count == 0) {
            /* Block is now garbage - add to free list */
            add_to_free_list(fs, block);
        }
    }
}
 
/* Garbage collection pass */
void gc_collect(cow_fs_t *fs) {
    /* Find blocks where:
     * - death_txg > 0 (block was superseded)
     * - death_txg < oldest_active_txg (no transaction can reference it)
     * - ref_count == 0 (no snapshots reference it) */
    
    for_each_dead_block(fs, block) {
        if (block->death_txg < fs->oldest_txg && block->ref_count == 0) {
            reclaim_block(fs, block);
        }
    }
}

Space amplification considerations:

Factor	Impact	Mitigation
Pending garbage	Old blocks not yet freed	Aggressive GC scheduling
Snapshot retention	Old versions retained for snapshots	Careful snapshot policies
Fragmentation	Sequential writes scatter across disk	Periodic rebalancing
Metadata overhead	Checksums, tree nodes	Larger block sizes
Write amplification	O(log n) blocks per write	Batching writes in TXGs

Most COW file systems recommend keeping 10-20% free space for efficient operation. As the file system fills, garbage collection becomes more expensive and performance degrades.

The Full Disk Pathology

Historical Context and Evolution

Origins in virtual memory:

COW optimization changed this:

Mark all parent pages as read-only
Child shares the same physical pages
Only copy a page when either process writes to it

This same principle—defer copying until modification—forms the foundation of COW file systems.

Database influence:

Database systems pioneered many techniques later adopted by COW file systems:

Write-ahead logging (WAL): Databases log changes before applying them
MVCC (Multi-Version Concurrency Control): Keep old versions for concurrent readers
Shadow paging: Write changes to new locations, atomically swap page tables

COW file systems essentially bring database-grade transactional semantics to general-purpose storage.

Evolution of COW File Systems
Year	System	Key Innovation
1988	Episode/Alliance (IBM)	First COW file system concepts for distributed storage
1991	WAFL (NetApp)	Production COW for network-attached storage
2001	NILFS (NTT)	Log-structured + COW for continuous snapshotting
2005	ZFS (Sun Microsystems)	Full COW with integrated volume management, checksums
2007	Btrfs (Oracle)	Linux-native COW with subvolumes and snapshots
2009	HAMMER (DragonFlyBSD)	COW with historical data access
2013	APFS (Apple)	COW for macOS/iOS with space sharing
2017	Bcachefs (Linux)	COW with integrated caching layer

The ZFS watershed moment:

ZFS, developed at Sun Microsystems and released in 2005, represented a paradigm shift. It wasn't just a file system—it was a complete storage stack:

Pooled storage: Multiple devices presented as a single pool
Integrated volume management: No separate LVM needed
End-to-end checksums: Data integrity from application to disk
Native snapshots and clones: Zero-cost point-in-time copies
Built-in compression and deduplication: Storage efficiency
Self-healing: Automatic repair using redundancy

ZFS proved that COW could power enterprise storage at scale, handling petabytes of data while maintaining integrity guarantees that traditional file systems couldn't match.

The modern landscape:

Today, COW file systems are standard in many environments:

Enterprise storage: ZFS, btrfs, WAFL power critical infrastructure
Consumer devices: APFS on every Apple device since 2017
Cloud infrastructure: Copy-on-write principles underpin object storage
Containers: Copy-on-write enables efficient container layering

The question is no longer "should we use COW?" but "which COW file system best fits our needs?"

COW Beyond File Systems

COW vs Traditional Approaches: A Technical Comparison

To cement our understanding, let's directly compare COW file systems with the alternatives, examining specific technical characteristics:

Consistency mechanisms compared:

Consistency Mechanisms: COW vs Alternatives
Aspect	In-Place (ext2)	Journaling (ext4)	Soft Updates (FFS)	Copy-on-Write (ZFS)
Crash consistency	Requires fsck	Journal replay	Careful ordering	Always consistent
Recovery time	O(filesystem size)	O(journal size)	Moderate	O(1) - constant
Data integrity	Silent corruption possible	Metadata only usually	No checksums	End-to-end checksums
Atomicity scope	Single operation	Single transaction	Ordered operations	Full transaction groups
Implementation complexity	Simple	Moderate	Very high	High but contained
Snapshot support	Requires LVM	Requires LVM	Not native	Native, efficient
Space overhead	Minimal	Journal space	Moderate	~10% recommended

Write path comparison:

Let's trace what happens when writing 4KB to the middle of a file in each system:

Ext4 Journal Write Path

•Start journal transaction
•Write metadata changes to journal (inode update)
•If data=journal: Also write data to journal
•Commit journal transaction (journal barrier)
•Write metadata to final location (in-place)
•Write data to final location (in-place)
•Mark journal transaction complete
•Total writes: 2-3x data size (journal + final)

COW Advantages

•Always consistent - no recovery phase
•End-to-end integrity - checksums everywhere
•Native snapshots - zero space initially
•Atomic transactions - all or nothing
•Self-healing - with redundancy
•No fsck - instant mount after crash

COW Considerations

•Write amplification - O(log n) extra blocks
•Fragmentation - over time, data scatters
•Space overhead - need free space for GC
•Memory usage - caching and metadata
•Complexity - more sophisticated implementation
•Full-disk pathology - unusable at 100% capacity

Summary: The Copy-on-Write Paradigm

We've explored the Copy-on-Write concept from first principles. Let's consolidate the essential understanding:

Core Principles

•Never overwrite existing data - Always write modifications to new locations, preserving the previous state until the new state is atomically committed.
•Atomic state transitions - Use a single root pointer update to transition the entire file system from one consistent state to another.
•Tree-structured organization - Organize data as a tree where path copying enables efficient COW with unchanged subtree sharing.
•Self-validating blocks - Embed checksums and generation numbers in each block for end-to-end data integrity verification.
•Transaction groups - Batch multiple operations and commit them atomically for all-or-nothing semantics.
•Reference tracking - Maintain reference counts for garbage collection of obsoleted blocks.

The transformative insight:

The answer—never overwrite, always copy, commit atomically—is elegant in principle and powerful in practice.

Looking ahead:

Page Complete