Free Space Management - Learning Module

Loading content...

0/227

Space Maps

The Next Generation of Free Space Tracking

Traditional free space management techniques—bitmaps, linked lists, counting—all share a fundamental vulnerability: they're modified in place. Every allocation or deallocation changes the existing data structure on disk. This creates a window where a crash can leave the free space metadata in an inconsistent state, potentially causing data loss or corruption.

Space maps, pioneered by the ZFS file system (Sun Microsystems, 2005), take a radically different approach: they treat free space changes as a log of operations rather than immediate updates to a bitmap or tree. Combined with ZFS's copy-on-write (COW) architecture, space maps achieve:

Perfect crash consistency without journaling
Constant-time allocation (O(1) amortized)
Efficient space accounting even at petabyte scale
Transactional semantics matching ZFS's overall design

Space maps represent the state-of-the-art in free space management, synthesizing lessons from decades of file system development into an elegant, unified design. Understanding them reveals both the pinnacle of traditional techniques and the paradigm shift enabled by copy-on-write file systems.

What You Will Learn

By the end of this page, you will understand the copy-on-write foundation that enables space maps, how space maps log allocations and frees rather than updating data structures, the condensing process that converts logs into sorted trees, metaslab allocation for scalable management, crash consistency guarantees, and how space maps compare to traditional approaches.

Copy-on-Write Foundation

Before diving into space maps specifically, we must understand the copy-on-write (COW) architecture that enables them.

Traditional (In-Place) Updates:

Before write:
Block 100: [old data]

During write:
Block 100: [partial new data] ← CRASH HERE = CORRUPTION!

After write:
Block 100: [new data]

If the system crashes mid-write, block 100 contains neither old data nor valid new data—it's corrupted.

Copy-on-Write Updates:

Before write:
Block 100: [old data]

During write:
Block 100: [old data] (unchanged)
Block 200: [new data] (write to NEW location)

Commit (atomic pointer update):
Parent now points to 200 instead of 100

After commit:
Block 100: [old data] (now free)
Block 200: [new data]

Key insight: The old data is never modified. We write new data to a fresh location, then atomically update the pointer. A crash at any point either leaves the old state intact (if before atomic commit) or the new state intact (if after commit).

How This Applies to Free Space:

With COW, we can't update a bitmap in place—that would violate the COW principle. Instead, ZFS logs free space changes as part of the transaction, and those changes become durable only when the entire transaction commits.

The COW Trade-off

COW trades in-place updates for additional writes (writing to new locations plus updating parent pointers). However, this enables perfect crash consistency, instant snapshots (old blocks aren't freed while referenced by snapshots), and checksummed self-healing data structures. For ZFS, these benefits far outweigh the extra writes.

Transaction Group Model:

ZFS batches operations into transaction groups (TXGs). Every 5-30 seconds (configurable), a TXG commits:

All data blocks written during this TXG go to new locations
All metadata updates (including space map changes) are finalized
The überblock (ZFS superblock) is atomically updated to reference the new root
Previous TXG's data structures remain valid until no longer referenced

Space maps exist within this transaction model—changes to free space are recorded as part of the TXG commit process, ensuring they're always consistent with the rest of the pool state.

Space Map Structure

A space map is a log-structured data structure that records free space changes over time. Rather than maintaining a current snapshot of free/allocated status, it records the history of allocations and frees.

Space Map Entry Format:

Each entry describes an operation (allocate or free) on a range of blocks:

┌────────────────────────────────────────────────────────────┐
│                    SPACE MAP ENTRY (64 bits)                │
├────────┬────────────────────────────────────────┬──────────┤
│ Type   │            Offset (Block Number)        │   Run    │
│ A/F    │               (variable bits)           │  Length  │
│ 1 bit  │            (typically 47 bits)          │ (16 bits)│
└────────┴────────────────────────────────────────┴──────────┘

Type: 0 = ALLOCATE, 1 = FREE
Offset: Starting block number of the range
Run Length: Number of contiguous blocks in this operation

Example Space Map Log:

Transaction 1: Create file, allocate blocks 1000-1099
  Entry: [ALLOC, 1000, 100]

Transaction 2: Delete file, free blocks 1000-1099
  Entry: [FREE, 1000, 100]

Transaction 3: Create new file, allocate blocks 500-549
  Entry: [ALLOC, 500, 50]

Space Map after 3 transactions:
  [ALLOC, 1000, 100]
  [FREE, 1000, 100]
  [ALLOC, 500, 50]

Key Observation: The log contains redundant information! The first two entries cancel out. Processing the log produces the actual free space state, but accumulating entries forever would be wasteful.

space_map_entry.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#include <stdint.h>
#include <stdbool.h>
 
/**
 * Space Map Entry encoding
 * 
 * ZFS uses variable-width encoding for efficiency.
 * Simplified here as fixed 64-bit entries.
 */
 
#define SM_ALLOC 0
#define SM_FREE  1
 
typedef struct __attribute__((packed)) {
    uint64_t type    : 1;   // ALLOC (0) or FREE (1)
    uint64_t offset  : 47;  // Starting block (supports 128 PB at 1K blocks)
    uint64_t run     : 16;  // Length (up to 65535 blocks per entry)
} SpaceMapEntry;
 
/**
 * Encode an allocation/free operation
 */
SpaceMapEntry sm_encode(bool is_free, uint64_t start, uint16_t count) {
    SpaceMapEntry entry;
    entry.type = is_free ? SM_FREE : SM_ALLOC;
    entry.offset = start;
    entry.run = count;
    return entry;
}
 
/**
 * Space Map on-disk structure (simplified)
 * In reality, ZFS uses a more complex block pointer structure.
 */
#define SM_BLOCK_ENTRIES (4096 / sizeof(SpaceMapEntry))  // ~512 entries per 4KB block
 
typedef struct {
    SpaceMapEntry entries[SM_BLOCK_ENTRIES];
} SpaceMapBlock;
 
typedef struct {
    uint64_t object_id;        // ZFS object identifier
    uint64_t block_count;      // Number of blocks in this space map
    uint64_t entry_count;      // Total entries across all blocks
    uint64_t allocated_space;  // Computed: total allocated bytes
    uint64_t free_space;       // Computed: total free bytes
    // Block pointers follow...
} SpaceMapHeader;
 
/**
 * Space Map in-memory representation
 * After loading and processing, entries are converted to range trees
 */
typedef struct {
    // In-memory range tree of free segments
    void *free_tree;      // AVL tree of (offset, length) pairs
    
    // Pending operations not yet written to disk
    SpaceMapEntry *pending;
    uint32_t pending_count;
    uint32_t pending_capacity;
    
    // Statistics
    uint64_t alloc_count;
    uint64_t free_count;
    uint64_t histogram[64];  // Distribution of free segment sizes
} SpaceMapLoaded;

Variable-Width Encoding

Real ZFS space maps use variable-width encoding: small runs (common case) use fewer bits, while large runs use more. This achieves better compression than the fixed 64-bit format shown. Additionally, runs of consecutive operations can be combined into single debug entries, further reducing size.

Space Map Operations

Space maps support three primary operations: recording allocations, recording frees, and loading (processing the log to determine current state).

Recording Allocations/Frees (O(1)):

When blocks are allocated or freed, the operation is simply appended to the in-memory pending list:

void sm_record(SpaceMapLoaded *sm, bool is_free, 
               uint64_t offset, uint32_t size) {
    // Just append to pending list - O(1)
    sm->pending[sm->pending_count++] = sm_encode(is_free, offset, size);
    
    // Update statistics
    if (is_free) {
        sm->free_count++;
    } else {
        sm->alloc_count++;
    }
}

No tree traversal, no bitmap updates—just an append. This is why allocations are so fast in ZFS.

Syncing to Disk:

At TXG commit time, pending entries are written as a new block appended to the space map object:

New space map block written with pending entries
Space map header updated with new block count/entry count
TXG commits, making changes durable

Loading (Processing the Log):

When mounting a pool or accessing a space map for the first time, the log must be "replayed" to determine current state:

for each entry in space_map_log:
    if entry.type == ALLOC:
        remove (entry.offset, entry.run) from free_tree
    else:  // FREE
        add (entry.offset, entry.run) to free_tree

After processing, free_tree contains the current free space as a sorted tree of (offset, length) pairs—essentially the "counting" representation from the previous page.

space_map_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
/**
 * Space Map Operations
 */
 
#include "avl_tree.h"  // ZFS uses AVL trees for range management
 
typedef struct {
    avl_node_t node;      // AVL tree linkage
    uint64_t offset;      // Starting offset
    uint64_t size;        // Size in bytes/blocks
} range_seg_t;
 
/**
 * Load space map from disk and build in-memory free tree
 * Called once when metaslab is first accessed
 */
int sm_load(SpaceMapLoaded *sm, SpaceMapHeader *header, void *disk_handle) {
    // Initialize empty AVL tree
    avl_tree_t *free_tree = avl_create(range_compare, NULL, NULL);
    
    // Process each block of the space map log
    for (uint64_t blk = 0; blk < header->block_count; blk++) {
        SpaceMapBlock block;
        disk_read(disk_handle, header->block_pointers[blk], &block, sizeof(block));
        
        for (uint32_t i = 0; i < SM_BLOCK_ENTRIES; i++) {
            SpaceMapEntry *e = &block.entries[i];
            
            // Check for end marker
            if (e->offset == 0 && e->run == 0 && e->type == 0) {
                break;
            }
            
            if (e->type == SM_ALLOC) {
                // Remove range from free tree
                range_tree_remove(free_tree, e->offset, e->run);
            } else {
                // Add range to free tree (with coalescing)
                range_tree_add(free_tree, e->offset, e->run);
            }
        }
    }
    
    sm->free_tree = free_tree;
    return 0;
}
 
/**
 * Range tree operations with automatic coalescing
 */
void range_tree_add(avl_tree_t *tree, uint64_t offset, uint64_t size) {
    uint64_t end = offset + size;
    
    // Find adjacent ranges
    range_seg_t *left = avl_find_predecessor(tree, offset);
    range_seg_t *right = avl_find_successor(tree, offset);
    
    bool merge_left = (left && left->offset + left->size == offset);
    bool merge_right = (right && right->offset == end);
    
    if (merge_left && merge_right) {
        // Three-way merge
        left->size += size + right->size;
        avl_remove(tree, right);
        free(right);
    } else if (merge_left) {
        left->size += size;
    } else if (merge_right) {
        right->offset = offset;
        right->size += size;
    } else {
        // Insert new range
        range_seg_t *new_seg = malloc(sizeof(range_seg_t));
        new_seg->offset = offset;
        new_seg->size = size;
        avl_insert(tree, new_seg);
    }
}
 
void range_tree_remove(avl_tree_t *tree, uint64_t offset, uint64_t size) {
    uint64_t end = offset + size;
    range_seg_t *seg = avl_find_containing(tree, offset);
    
    if (!seg) {
        // Range not in tree - might be partial overlap
        // Handle edge cases...
        return;
    }
    
    uint64_t seg_end = seg->offset + seg->size;
    
    if (seg->offset == offset && seg->size == size) {
        // Exact match - remove entirely
        avl_remove(tree, seg);
        free(seg);
    } else if (seg->offset == offset) {
        // Remove from start
        seg->offset = end;
        seg->size -= size;
    } else if (seg_end == end) {
        // Remove from end
        seg->size -= size;
    } else {
        // Split in middle
        range_seg_t *right = malloc(sizeof(range_seg_t));
        right->offset = end;
        right->size = seg_end - end;
        avl_insert(tree, right);
        
        seg->size = offset - seg->offset;
    }
}
 
/**
 * Allocate from space map (using loaded free tree)
 */
uint64_t sm_alloc(SpaceMapLoaded *sm, uint64_t size) {
    // Find suitable range in free tree (first-fit, best-fit, etc.)
    range_seg_t *seg = find_suitable_range(sm->free_tree, size);
    
    if (!seg) {
        return 0;  // ENOSPC
    }
    
    uint64_t offset = seg->offset;
    
    // Remove allocated portion from tree
    range_tree_remove(sm->free_tree, offset, size);
    
    // Record in pending log - O(1)
    sm_record(sm, false, offset, size);
    
    return offset;
}

Space Map Condensing

If we only appended entries to the space map log, it would grow without bound. Condensing is the process of periodically rewriting the space map to eliminate redundant entries.

When Condensing is Needed:

Space map grows beyond a threshold size
Log contains many canceling pairs (ALLOC followed by FREE of same range)
Fragmentation makes loading slow
Scheduled maintenance window

Condensing Algorithm:

1. Load current space map (process log to build range tree)
2. Walk range tree to enumerate all free ranges
3. Write new space map with single FREE entry per range
4. Update space map header to point to new blocks
5. Free old space map blocks

Before Condensing (10,000 entries):

[ALLOC 100, 50]
[FREE 100, 50]
[ALLOC 200, 100]
[ALLOC 500, 25]
[FREE 500, 25]
[ALLOC 300, 50]
[FREE 200, 100]
... (thousands more entries)

After Condensing (50 entries):

[FREE 0, 100]      // Blocks 0-99 free
[FREE 250, 50]     // Blocks 250-299 free  
[FREE 350, 150]    // Blocks 350-499 free
... (other free ranges)

The condensed form is equivalent to the "counting" representation—just free extents, no history.

space_map_condense.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
/**
 * Space Map Condensing
 * 
 * Rewrites space map as minimal set of FREE entries,
 * eliminating all historical ALLOC/FREE pairs.
 */
 
typedef struct {
    uint64_t entries_before;
    uint64_t entries_after;
    uint64_t blocks_before;
    uint64_t blocks_after;
    uint64_t time_ms;
} CondenseStats;
 
/**
 * Condense a space map to minimal representation
 */
CondenseStats sm_condense(SpaceMapLoaded *sm, SpaceMapHeader *header, 
                          void *disk_handle) {
    CondenseStats stats = {0};
    uint64_t start_time = get_timestamp_ms();
    
    stats.entries_before = header->entry_count;
    stats.blocks_before = header->block_count;
    
    // Step 1: Ensure space map is loaded
    if (!sm->free_tree) {
        sm_load(sm, header, disk_handle);
    }
    
    // Step 2: Count free ranges
    uint64_t range_count = avl_numnodes(sm->free_tree);
    
    // Step 3: Allocate new space map blocks
    uint64_t blocks_needed = (range_count + SM_BLOCK_ENTRIES - 1) / SM_BLOCK_ENTRIES;
    uint64_t *new_block_ptrs = malloc(blocks_needed * sizeof(uint64_t));
    
    // Step 4: Write condensed entries
    avl_iterator_t iter;
    avl_iter_init(&iter, sm->free_tree);
    
    uint64_t entry_idx = 0;
    SpaceMapBlock current_block;
    memset(&current_block, 0, sizeof(current_block));
    
    range_seg_t *seg;
    while ((seg = avl_iter_next(&iter)) != NULL) {
        uint64_t block_idx = entry_idx / SM_BLOCK_ENTRIES;
        uint64_t local_idx = entry_idx % SM_BLOCK_ENTRIES;
        
        // Write full block if needed
        if (local_idx == 0 && entry_idx > 0) {
            new_block_ptrs[block_idx - 1] = allocate_new_block(disk_handle);
            disk_write(disk_handle, new_block_ptrs[block_idx - 1], 
                       &current_block, sizeof(current_block));
            memset(&current_block, 0, sizeof(current_block));
        }
        
        // Create FREE entry for this range
        current_block.entries[local_idx] = sm_encode(true, seg->offset, seg->size);
        entry_idx++;
    }
    
    // Write final partial block
    if (entry_idx > 0) {
        uint64_t last_block = (entry_idx - 1) / SM_BLOCK_ENTRIES;
        new_block_ptrs[last_block] = allocate_new_block(disk_handle);
        disk_write(disk_handle, new_block_ptrs[last_block], 
                   &current_block, sizeof(current_block));
    }
    
    // Step 5: Update header (atomically via TXG)
    uint64_t *old_blocks = header->block_pointers;
    uint64_t old_count = header->block_count;
    
    header->block_pointers = new_block_ptrs;
    header->block_count = blocks_needed;
    header->entry_count = range_count;
    
    // Step 6: Mark old blocks for freeing (after TXG commit)
    for (uint64_t i = 0; i < old_count; i++) {
        schedule_free(old_blocks[i]);
    }
    
    stats.entries_after = range_count;
    stats.blocks_after = blocks_needed;
    stats.time_ms = get_timestamp_ms() - start_time;
    
    return stats;
}
 
/**
 * Check if condensing is worthwhile
 */
bool sm_should_condense(SpaceMapLoaded *sm, SpaceMapHeader *header) {
    // Condense if:
    // 1. Entry count is much larger than range count
    uint64_t expected_ranges = avl_numnodes(sm->free_tree);
    if (header->entry_count > expected_ranges * 10) {
        return true;
    }
    
    // 2. Space map exceeds size threshold (e.g., 1MB)
    if (header->block_count * 4096 > 1024 * 1024) {
        return true;
    }
    
    // 3. Alloc/free ratio indicates many canceling pairs
    uint64_t cancel_ratio = min(sm->alloc_count, sm->free_count);
    if (cancel_ratio > header->entry_count / 2) {
        return true;
    }
    
    return false;
}

Condensing Trade-offs

Condensing is essentially converting from log form to counting form, then back to log form (with only FREE entries). It's similar to LSM-tree compaction in databases: accept write amplification periodically to maintain read performance. ZFS triggers condensing based on space map size and fragmentation metrics, balancing immediate write cost against future load time.

Metaslab Allocation

Space maps don't manage an entire vdev (virtual device) directly. Instead, ZFS divides each vdev into metaslabs—self-contained allocation units, each with its own space map.

Metaslab Structure:

┌─────────────────────────────────────────────────────────────┐
│                        VDEV (e.g., 1TB disk)                 │
├───────────────┬───────────────┬───────────────┬─────────────┤
│   Metaslab 0  │   Metaslab 1  │   Metaslab 2  │     ...     │
│   (128 MB)    │   (128 MB)    │   (128 MB)    │             │
├───────────────┼───────────────┼───────────────┼─────────────┤
│ ┌───────────┐ │ ┌───────────┐ │ ┌───────────┐ │             │
│ │Space Map 0│ │ │Space Map 1│ │ │Space Map 2│ │             │
│ └───────────┘ │ └───────────┘ │ └───────────┘ │             │
│ ┌───────────┐ │ ┌───────────┐ │ ┌───────────┐ │             │
│ │Data Blocks│ │ │Data Blocks│ │ │Data Blocks│ │             │
│ └───────────┘ │ └───────────┘ │ └───────────┘ │             │
└───────────────┴───────────────┴───────────────┴─────────────┘

Why Metaslabs?

Parallelism: Different metaslabs can be allocated from concurrently by different threads
Lazy Loading: Only load space maps for metaslabs currently in use
Locality: Keep allocations within a metaslab for sequential writes
Wear Leveling: Rotate between metaslabs to spread writes
Bounded Complexity: Each space map is small enough to manage efficiently

Metaslab Selection:

When allocating, ZFS must choose which metaslab to use:

Most-free: Choose metaslab with most free space (reduces fragmentation)
First-fit: Choose first metaslab with sufficient space
Best-fit: Choose metaslab where request fits best
Cursor: Round-robin through metaslabs (wear leveling)

metaslab_allocator.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
/**
 * Metaslab Allocator
 * 
 * ZFS allocates from metaslabs, not directly from vdevs.
 * Each metaslab has its own space map for tracking.
 */
 
typedef struct {
    uint64_t ms_id;           // Metaslab identifier
    uint64_t ms_start;        // Starting offset in vdev
    uint64_t ms_size;         // Total size
    uint64_t ms_free;         // Current free space
    
    SpaceMapLoaded *ms_sm;    // Space map (NULL if not loaded)
    bool ms_loaded;           // Whether space map is loaded
    bool ms_loading;          // Loading in progress
    
    // Allocation state
    uint64_t ms_weight;       // Selection priority (based on free space)
    uint64_t ms_activation_count;  // Times selected for allocation
    
    // Statistics
    uint64_t ms_histogram[64];     // Free segment size distribution
} Metaslab;
 
typedef struct {
    uint64_t vdev_id;
    uint64_t vdev_size;
    
    Metaslab **vd_metaslabs;
    uint32_t vd_ms_count;
    
    // Allocation groups (for parallelism)
    uint32_t vd_mg_count;    
} Vdev;
 
/**
 * Select metaslab for allocation
 * Uses weighted selection based on free space
 */
Metaslab* vdev_select_metaslab(Vdev *vd, uint64_t size) {
    Metaslab *best = NULL;
    uint64_t best_weight = 0;
    
    for (uint32_t i = 0; i < vd->vd_ms_count; i++) {
        Metaslab *ms = vd->vd_metaslabs[i];
        
        // Skip if not enough free space
        if (ms->ms_free < size) {
            continue;
        }
        
        // Calculate weight (free space - overhead to load)
        uint64_t weight = ms->ms_free;
        if (!ms->ms_loaded) {
            // Penalty for needing to load space map
            weight = weight * 90 / 100;
        }
        
        if (weight > best_weight) {
            best_weight = weight;
            best = ms;
        }
    }
    
    return best;
}
 
/**
 * Allocate from a specific metaslab
 */
uint64_t metaslab_alloc(Metaslab *ms, uint64_t size) {
    // Ensure space map is loaded
    if (!ms->ms_loaded) {
        metaslab_load(ms);
    }
    
    // Allocate from space map's free tree
    uint64_t offset = sm_alloc(ms->ms_sm, size);
    
    if (offset != 0) {
        ms->ms_free -= size;
        ms->ms_activation_count++;
        recalculate_weight(ms);
    }
    
    return offset;  // Returns absolute vdev offset
}
 
/**
 * Free space within a metaslab
 */
void metaslab_free(Metaslab *ms, uint64_t offset, uint64_t size) {
    // Load if necessary
    if (!ms->ms_loaded) {
        // For frees, we can defer to space map sync
        // Just record in a deferred free list
        defer_free(ms, offset, size);
        return;
    }
    
    // Add to free tree
    range_tree_add(ms->ms_sm->free_tree, offset, size);
    sm_record(ms->ms_sm, true, offset, size);
    
    ms->ms_free += size;
    recalculate_weight(ms);
}
 
/**
 * Lazy loading: load space map only when metaslab is accessed
 */
void metaslab_load(Metaslab *ms) {
    if (ms->ms_loaded || ms->ms_loading) {
        return;
    }
    
    ms->ms_loading = true;
    
    // Allocate space map structure
    ms->ms_sm = malloc(sizeof(SpaceMapLoaded));
    memset(ms->ms_sm, 0, sizeof(SpaceMapLoaded));
    
    // Read space map header
    SpaceMapHeader header;
    read_space_map_header(ms, &header);
    
    // Load and process the space map log
    sm_load(ms->ms_sm, &header, ms->vdev_handle);
    
    // Build histogram for allocation strategy
    build_free_histogram(ms);
    
    ms->ms_loading = false;
    ms->ms_loaded = true;
}

Metaslab Sizing Considerations
Vdev Size	Metaslabs	Metaslab Size	Space Maps in Memory (if all loaded)
100 GB	~780	128 MB	~50 MB
1 TB	~8,000	128 MB	~500 MB
10 TB	~80,000	128 MB	~5 GB
100 TB	~780,000	128 MB	~50 GB

Memory Considerations for Large Pools

Each loaded metaslab space map consumes memory for its AVL tree of free ranges. With very large pools, ZFS can unload inactive metaslab space maps to conserve memory, reloading them when needed. This is why lazy loading is crucial—only active metaslabs consume memory resources.

Crash Consistency Guarantees

Space maps inherit ZFS's copy-on-write crash consistency without requiring separate journaling. The key insight is that space map changes are part of the atomic TXG commit.

Why Space Maps are Inherently Consistent:

No in-place updates: Space map blocks are never modified; new entries go to new blocks
Atomic überblock update: The überblock (root of the pool tree) is updated atomically using a two-phase commit:
- Write new überblock to location A
- Update pointer to use location A
- Old location B remains valid if crash before pointer update
Transaction group semantics: All changes in a TXG (data, metadata, space maps) become visible atomically or not at all
Self-healing: Checksums on every block detect corruption; mirrored/RAID-Z pools can reconstruct from redundancy

Crash Scenarios:

Scenario 1: Crash during TXG write (before commit)

State before crash:
- New data blocks written to disk
- New space map entries written
- Überblock NOT yet updated

After reboot:
- Überblock points to previous TXG
- New blocks are unreferenced ("leaked")
- Space maps reflect previous TXG state
- Run spa_sync to detect and free leaked blocks

Scenario 2: Crash after TXG commit

State:
- All new blocks written
- Überblock points to new TXG

After reboot:
- Everything is consistent
- No recovery needed

crash_consistency.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/**
 * ZFS Transaction Group (TXG) Commit with Space Maps
 * 
 * Demonstrates how space map updates are atomic with pool state.
 */
 
typedef struct {
    uint64_t ub_txg;           // Transaction group number
    uint64_t ub_timestamp;     // Commit time
    uint64_t ub_rootbp;        // Block pointer to MOS (Meta-Object Set)
    uint64_t ub_checksum[4];   // 256-bit checksum
} Uberblock;
 
/**
 * Atomic TXG commit process
 */
void txg_commit(Pool *pool, TxgState *txg) {
    // Phase 1: Write all data blocks
    sync_data_blocks(txg);
    
    // Phase 2: Write updated metadata (includes space maps)
    for (Vdev *vd : pool->vdevs) {
        for (Metaslab *ms : vd->metaslabs) {
            if (ms->ms_sm && ms->ms_sm->pending_count > 0) {
                // Append pending entries to space map object
                sm_sync_to_disk(ms->ms_sm);
            }
        }
    }
    
    // Phase 3: Write new MOS root block
    uint64_t new_mos_bp = write_mos(pool);
    
    // Phase 4: Calculate new überblock
    Uberblock new_ub = {
        .ub_txg = txg->txg_number,
        .ub_timestamp = current_time(),
        .ub_rootbp = new_mos_bp,
    };
    calculate_checksum(&new_ub);
    
    // Phase 5: Write new überblock to next slot (rotating through slots)
    uint64_t ub_slot = (pool->current_ub_slot + 1) % NUM_UB_SLOTS;
    write_uberblock(pool, ub_slot, &new_ub);
    
    // Issue disk flush to ensure all writes are persistent
    disk_flush(pool->all_vdevs);
    
    // Phase 6: Commit complete - update in-memory state
    pool->current_ub_slot = ub_slot;
    pool->current_txg = new_ub.ub_txg;
    
    // Phase 7: Previous TXG's space is now available for reuse
    process_deferred_frees(pool, txg->txg_number - 1);
}
 
/**
 * Pool open - find most recent valid überblock
 */
int spa_load(Pool *pool) {
    Uberblock best_ub = {0};
    
    // Check all überblock slots, choose most recent valid one
    for (int i = 0; i < NUM_UB_SLOTS; i++) {
        Uberblock ub;
        read_uberblock(pool, i, &ub);
        
        if (verify_checksum(&ub) && ub.ub_txg > best_ub.ub_txg) {
            best_ub = ub;
            pool->current_ub_slot = i;
        }
    }
    
    if (best_ub.ub_txg == 0) {
        return ECORRUPT;  // No valid überblock found
    }
    
    pool->current_txg = best_ub.ub_txg;
    
    // Load MOS from überblock's root pointer
    load_mos(pool, best_ub.ub_rootbp);
    
    // Space maps are loaded lazily when metaslabs are accessed
    // They will reflect state as of best_ub.ub_txg
    
    printf("Pool loaded at TXG %lu\n", best_ub.ub_txg);
    return 0;
}

No fsck Required

Because ZFS is always consistent (thanks to COW + atomic überblock), there's no need for a file system check (fsck) after crashes. On mount, ZFS simply finds the most recent valid überblock and loads from there. Any incomplete writes from the crashed TXG are simply ignored—they occupy space temporarily but will be reclaimed during background scrubbing.

Comparison with Traditional Approaches

Space maps represent a sophisticated synthesis of ideas from earlier techniques. Let's compare how they stack up against the traditional approaches we've studied.

Space Maps vs. Traditional Free Space Management
Aspect	Bitmap	Counting	Space Maps
Allocation time	O(N) scan	O(log N) tree	O(1) append
Free time	O(1)	O(log N) tree	O(1) append
Space overhead	Fixed (1 bit/block)	Variable (extents)	Log + tree (variable)
Crash consistency	Needs journal	Needs journal	Inherent (COW)
Load time	O(N) scan	O(extents) tree build	O(entries) replay
Parallelism	Bitmap sections	Separate trees	Metaslabs
Best for	Small-medium FS	Large sequential	All workloads

Space Map Advantages

•O(1) allocation and free operations
•Perfect crash consistency without journal
•Efficient even with high fragmentation
•Naturally supports snapshots
•Metaslab parallelism for multi-core
•Self-healing with checksums

Space Map Disadvantages

•Complex implementation
•Higher memory usage for range trees
•Requires periodic condensing
•Slow initial load after long activity
•Write amplification for small operations
•ZFS-specific (not in traditional FS)

Right Tool for the Job

Space maps are optimal for ZFS's design goals (enterprise storage, data integrity, pooled storage) but may be overkill for embedded systems or simple use cases. Bitmaps remain excellent for their simplicity, and counting/extent tracking works well in ext4 and XFS where journaling handles consistency. Each technique has its place.

Summary: Space Maps and Free Space Management

Space maps represent the culmination of decades of file system evolution, combining the best aspects of logging, extent tracking, and copy-on-write semantics. They demonstrate that free space management isn't just about tracking which blocks are available—it's deeply intertwined with crash consistency, performance, and scalability.

With this page, we complete our exploration of free space management techniques—from the elegant simplicity of bitmaps to the sophisticated log-structured approach of space maps.

Key Takeaways

•Space maps log allocations and frees rather than updating a data structure in place
•O(1) allocation via simple append operations, O(log N) queries via in-memory range trees
•Condensing periodically rewrites the log to eliminate redundant entries
•Metaslabs partition vdevs for parallelism, locality, and lazy loading
•COW semantics provide crash consistency without journaling
•Atomic TXG commits ensure space maps are always consistent with pool state

Module Summary: Free Space Management Techniques

•Bit Vectors (Bitmaps): 1 bit/block, fixed overhead, simple and widely used
•Linked Lists: O(1) allocation, poor contiguity, historical significance
•Grouping: Batch addresses to reduce I/O, improvement over linked lists
•Counting: Track extents (start, length), optimal for contiguous free space
•Space Maps: Log-structured, COW-consistent, state-of-the-art for modern systems

Module Complete

You've mastered the fundamental techniques for tracking free space in file systems—from the simplest (bitmap) to the most sophisticated (space maps). These techniques form the foundation for understanding how any file system manages its storage resources. You're now equipped to analyze file system design decisions and understand the trade-offs that shape modern storage systems.