Write Strategies - Learning Module

Loading content...

0/227

Ordered Writes Strategy

The Ordering Problem

Consider this seemingly simple operation: moving a file from one directory to another. At the filesystem level, this requires:

Add entry in destination directory
Remove entry from source directory
Update file's parent directory pointer (in some filesystems)

What happens if we crash after step 1 but before step 2? The file appears in both directories. What if we crash after step 2 but before step 1? The file is orphaned—it exists but has no directory entry pointing to it.

Neither outcome is acceptable. We need these operations to either all complete or none complete—atomic operation semantics. But we also want performance, which means not syncing to disk after every sub-operation.

The solution is ordered writes: controlling which writes must reach disk before others, without necessarily waiting for each write individually. This enables filesystems to maintain consistency guarantees while preserving write-back performance characteristics.

This page explores write ordering in depth—the mechanisms, the guarantees they provide, and how they enable crash-consistent data structures.

What You Will Master

By the end of this page, you will understand write barriers—what they guarantee and what they don't. You'll learn how filesystems use ordering to maintain consistency, how databases implement atomic transactions, and how to reason about the ordering requirements of your own data structures.

Write Ordering Fundamentals: Why Order Matters

Without explicit ordering control, the path from write() to persistent storage involves multiple opportunities for reordering:

Sources of Write Reordering

Application writes: W1, W2, W3 (in that order)

┌─────────────────────────────────────────────────────────────┐
│  Kernel Page Cache                                          │
│  - Pages may be marked dirty in any order                   │
│  - Write scheduler may batch/sort for efficiency            │
│  Possible order to block layer: W2, W3, W1                  │
└─────────────────────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│  Block I/O Scheduler                                        │
│  - Reorders for seek optimization (HDDs)                    │
│  - Merges adjacent blocks                                   │
│  - May prioritize reads over writes                         │
│  Possible order to device: W3, W1, W2                       │
└─────────────────────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│  Disk/SSD Controller                                        │
│  - Native Command Queuing (NCQ) on SATA                     │
│  - Deep command queues on NVMe                              │
│  - Internal reordering for performance                      │
│  Actual execution order: W1, W3, W2                         │
└─────────────────────────────────────────────────────────────┘

The result: writes issued in order W1→W2→W3 might reach disk in order W1→W3→W2, or any permutation for that matter.

Why Reordering Happens

Each layer optimizes for different goals:

Layer	Optimization Goal	Reordering Behavior
Page Cache	Memory efficiency	Writes to same page coalesced
I/O Scheduler	Seek time (HDD)	Sorted by physical location
Device Controller	Firmware efficiency	Parallel execution, out-of-order completion

For performance, this reordering is desirable—it can improve throughput by 10-100×. But for consistency, it's catastrophic if not controlled.

The Consistency Invariant

Most data structures have invariants that must hold after any crash:

Example: Directory with file count
  - Count field says 5 files exist
  - Directory has 5 file entries
  
Invariant: count must equal actual entries

If we add a file:
  1. Write new file entry
  2. Increment count
  
If crash after 1, before 2:
  - Count says 5, but 6 files exist
  - INCONSISTENT
  
If we reverse the order:
  1. Increment count 
  2. Write new file entry
  
If crash after 1, before 2:
  - Count says 6, but 5 files exist
  - STILL INCONSISTENT

The ONLY safe approaches:
  a) Do both atomically (single write, requires careful layout)
  b) Write in order that crash always leaves recoverable state
  c) Log the operation so it can be replayed/rolled-back

Order vs. Durability

Ordering and durability are separate concerns. Ordering says: 'Write A must hit disk before Write B.' Durability says: 'Write A must be on disk before we continue.' Ordering can be achieved without immediate durability—the writes happen when convenient, but the specified order is maintained.

Write Barriers: The Ordering Primitive

A write barrier is a synchronization point that guarantees all writes before the barrier reach persistent storage before any write after the barrier.

Barrier Semantics

Timeline:
  write(block A)
  write(block B)
  ─── BARRIER ───
  write(block C)
  write(block D)

Guarantees:
  1. A and B complete before C or D start
  2. A and B may complete in either order relative to each other
  3. C and D may complete in either order relative to each other
  
Visualization of possible completion orders:
  Legal: A, B, C, D
  Legal: B, A, D, C
  Legal: B, A, C, D
  ILLEGAL: A, C, B, D (C before B violates barrier)
  ILLEGAL: C, A, B, D (C before A,B violates barrier)

Hardware Barrier Implementation

Storage devices implement barriers through several mechanisms:

Mechanism	Device Type	Semantics	Performance Impact
Cache Flush	SATA/NVMe	Flush write cache before continuing	50µs - 10ms
FUA (Force Unit Access)	SATA/NVMe	Bypass cache for this write only	Minimal per-write overhead
Write Cache Disable	All	All writes uncached	Significant throughput loss
Barrier Command	NVMe	Explicit ordering primitive	Minimal

Linux Block Layer Barriers

Linux implements barriers through the block layer request flags:

// Block layer request flags relevant to ordering

// REQ_PREFLUSH: Issue a cache flush BEFORE this write
// Ensures all previous writes are on media before this one

// REQ_FUA: Force Unit Access - bypass volatile cache for THIS write
// Ensures this specific write goes directly to persistent media

// Combined: REQ_PREFLUSH | REQ_FUA
// Flush cache, then write directly to media
// This is a "full barrier" - nothing crosses this point

// Example construction of barrier request:
struct bio *bio = bio_alloc(GFP_NOIO, 1);
bio_set_dev(bio, bdev);
bio->bi_iter.bi_sector = sector;
bio_add_page(bio, page, PAGE_SIZE, 0);
bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA;
submit_bio(bio);

The Evolution of Linux Barriers

Linux's barrier implementation has evolved significantly:

Pre-2.6.37 (Explicit Barriers)

// Old API
blkdev_issue_flush(bdev, GFP_KERNEL, NULL);  // Explicit flush
// OR
bio->bi_rw |= WRITE_BARRIER;  // Request includes barrier

2.6.37+ (PREFLUSH/FUA Model)

// Modern API - more granular control
bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;  // Flush before
bio->bi_opf = REQ_OP_WRITE | REQ_FUA;       // Direct write
bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA;  // Both

The modern API is more flexible:

PREFLUSH waits for prior writes to complete
FUA ensures this specific write is durable
They can be combined as needed

Barrier Cost Analysis

Barriers are expensive because they serialize I/O:

Without barriers (write-back, full parallelism):
  4 writes × parallel = 1 round trip ≈ 20µs total (NVMe)

With barrier after every write:
  4 writes × sequential = 4 round trips ≈ 80µs total (NVMe)

With strategic barriers (2 groups):
  2 writes, barrier, 2 writes = 2 round trips ≈ 40µs total

Optimization: Minimize barriers while maintaining required ordering

FUA vs. Flush

When you only need to ensure a single write is durable (not all prior writes), FUA is much cheaper than a full cache flush. Use PREFLUSH when you need to order against prior writes; use FUA when you only need durability for this specific write. Use both for a full barrier.

Filesystem Ordering Strategies: How ext4, XFS, and Others Stay Consistent

Filesystems use write ordering extensively to maintain consistency. Let's examine the strategies used by major filesystems.

Strategy 1: Soft Updates (BSD FFS/UFS)

Soft updates track dependencies between metadata updates and enforce ordering through those dependencies:

Creating a new file:
  1. Allocate inode (modifies inode bitmap)
  2. Initialize inode (modifies inode block)
  3. Allocate data blocks (modifies block bitmap)
  4. Update directory entry (modifies directory block)

Dependencies:
  1 must complete before 2 (inode must exist before init)
  2 must complete before 4 (inode initialized before linked)
  3 must complete before 2 (blocks allocated before inode references them)

Soft updates tracks these and enforces order during writeback.
If crash occurs:
  - May have allocated but unused inodes/blocks (fsck fix: free them)
  - NEVER has directory pointing to unallocated inode (always safe to follow links)

Strategy 2: Ordered Mode Journaling (ext3/ext4)

Ordered mode ensures data is written before the journal commit:

Transaction lifecycle:
  1. Write data blocks to final location (no barrier yet)
  2. Write metadata to journal (describes the operation)
  3. BARRIER: Wait for data writes to complete
  4. Write commit block to journal (with FUA or barrier)
  5. Write metadata to final location (can overlap with new transactions)
  6. Mark journal space reusable

Crash scenarios:
  - Before step 4: Transaction not committed, don't replay
  - After step 4: Replay journal, metadata matches data
  
KEY: Data is always written before journal commits
     Metadata in journal describes already-written data
     No barrier within data writes (parallel, fast)
     Barrier between data and commit (correctness)

Strategy 3: Full Data Journaling (ext3/ext4 data=journal)

Both data and metadata go to journal:

Transaction:
  1. Write data to journal
  2. Write metadata to journal
  3. Commit block (with barrier)
  4. Write data to final location
  5. Write metadata to final location
  
Advantage: Simple, consistent
Disadvantage: Data is written twice (journal + final location)
              Throughput roughly halved

Filesystem Ordering Strategy Comparison
Filesystem	Strategy	Crash Consistency	Write Overhead	Modern Use
BSD UFS	Soft Updates	Consistent (needs background fsck)	Low	FreeBSD, NetBSD
ext4 data=ordered	Ordered Journaling	Fully consistent	Medium (1 barrier per TX)	Linux default
ext4 data=journal	Full Journaling	Fully consistent	High (2× writes)	Paranoid setups
ext4 data=writeback	Metadata-only Journal	Metadata consistent only	Low	Speed-critical
XFS	Metadata Journal	Metadata consistent	Low	Enterprise Linux
ZFS/Btrfs	Copy-on-Write	Always consistent	Variable (COW overhead)	Modern systems

The Copy-on-Write Alternative

Modern filesystems like ZFS and Btrfs avoid many ordering concerns through copy-on-write (COW):

Traditional update (in-place):
  [Block A version 1] → overwrite → [Block A version 2]
  
  Risks:
  - Crash during write leaves block corrupted
  - Must order dependent writes carefully
  
Copy-on-Write update:
  [Block A v1] remains
  Write [Block A v2] to NEW location
  Update pointer from A_v1 to A_v2 (single write)
  
  Ordering requirement: Only the final pointer update
  Everything else can be written in any order
  Crash leaves either old version (pointer not updated)
            or new version (pointer updated)
            NEVER corrupted version

COW transforms most ordering problems into simple pointer atomicity problems.

ext4 Mount Options

ext4's write ordering is controlled by mount options: data=ordered (safe default), data=writeback (fastest, metadata-only consistency), data=journal (safest, slowest). Check with mount | grep data=. For databases that handle their own ordering, data=writeback may be appropriate.

Database Write Ordering: Transactions Through Ordered Persistence

Databases face the same ordering challenges as filesystems but with additional requirements: user-defined transaction boundaries and ACID properties.

The WAL Protocol and Ordering

Write-Ahead Logging (WAL) relies critically on write ordering:

ACID Requirement: Transaction is durable if commit succeeds
Implementation: Log record is durable before commit returns

Ordering requirement:
  WAL log record → durable → BEFORE → commit acknowledgment
  
  NOT required:
  Data page modifications (can stay in cache indefinitely)
  Index updates (can stay in cache)
  
  Only required:
  Log records for committed transactions must survive crash
  
Write order:
  1. Generate log record describing change
  2. Apply changes to in-memory buffer pool
  3. When commit: write log records to WAL file
  4. BARRIER (fsync WAL file)
  5. Return success to client
  6. Eventually: write dirty data pages (no ordering constraint)

Log Record Ordering

Within the log, records must also be ordered:

Transaction T1: writes A, B, commits
Transaction T2: writes C, commits  
Transaction T3: writes A (same item as T1), commits

Log order must reflect commit order:
  [T1: A=1] [T1: B=2] [T1: commit]
  [T2: C=3] [T2: commit]
  [T3: A=5] [T3: commit]

On recovery:
  - Replay in order
  - T1's A=1 is overwritten by T3's A=5
  - Final state correct

If log were reordered (T3 commit before T1):
  - Apply T3: A=5
  - Apply T1: A=1
  - WRONG! Lost T3's update

wal_commit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import os
import struct
 
class WriteAheadLog:
    """
    Simple WAL demonstrating the ordering requirements.
    In production, this would need proper record formats,
    checksums, and group commit optimization.
    """
    
    def __init__(self, path):
        self.path = path
        self.fd = os.open(path, 
                          os.O_RDWR | os.O_CREAT | os.O_APPEND, 
                          0o644)
        self.next_lsn = 0  # Log Sequence Number
    
    def append_record(self, tx_id, operation, data):
        """
        Append a log record. Does NOT make it durable yet.
        Returns LSN for this record.
        """
        lsn = self.next_lsn
        self.next_lsn += 1
        
        # Format: [LSN (8 bytes)][TX_ID (8)][OP (4)][LEN (4)][DATA]
        record = struct.pack(">QQI", lsn, tx_id, operation)
        record += struct.pack(">I", len(data))
        record += data
        
        # Write to file (goes to cache)
        os.write(self.fd, record)
        
        return lsn
    
    def commit(self, tx_id, data_lsn):
        """
        Commit a transaction. MUST ensure all log records
        up to and including the commit record are durable.
        """
        # Append commit record
        commit_record = struct.pack(">QQI", 
                                    self.next_lsn, 
                                    tx_id, 
                                    0xFFFFFFFF)  # Commit marker
        os.write(self.fd, commit_record)
        
        # CRITICAL: fsync BEFORE returning
        # This is the ordering guarantee:
        # All log records, including commit, are durable
        # Only THEN is the transaction committed
        os.fsync(self.fd)
        
        return True
    
    def recover(self):
        """
        On startup, scan log and return committed transactions.
        Ordering guarantee means: if we see commit record,
        all prior records for that TX are durable.
        """
        committed_txs = set()
        records_by_tx = {}
        
        # Seek to start
        os.lseek(self.fd, 0, os.SEEK_SET)
        
        while True:
            header = os.read(self.fd, 20)
            if len(header) < 20:
                break
            
            lsn, tx_id, operation = struct.unpack(">QQI", header)
            
            if operation == 0xFFFFFFFF:
                # Commit record
                committed_txs.add(tx_id)
            else:
                # Data record
                len_data = struct.unpack(">I", os.read(self.fd, 4))[0]
                data = os.read(self.fd, len_data)
                
                if tx_id not in records_by_tx:
                    records_by_tx[tx_id] = []
                records_by_tx[tx_id].append((lsn, operation, data))
        
        # Return only committed transaction records
        return {tx: records for tx, records in records_by_tx.items() 
                if tx in committed_txs}

Group Commit Optimization

Individual fsync per transaction is slow. Group commit amortizes the cost:

Without group commit:
  T1 appends records, fsync, returns (10ms)
  T2 appends records, fsync, returns (10ms)
  T3 appends records, fsync, returns (10ms)
  Total: 30ms for 3 transactions

With group commit:
  T1 appends records
  T2 appends records  
  T3 appends records
  Single fsync covers all three (10ms)
  All return
  Total: 10ms for 3 transactions

The key insight: fsync cost is dominated by device round-trip,
not by the amount of data. Bigger batches = better amortization.

But group commit requires careful ordering: all transactions in the group must have their records written before the shared fsync.

PostgreSQL's Implementation

PostgreSQL uses group commit with commit_delay and commit_siblings settings. When a transaction commits, it waits briefly for siblings to accumulate, then one fsync covers all. This can increase throughput by 10x on synchronous commit workloads.

Implementing Crash-Safe Data Structures

Let's apply ordering principles to implement crash-safe data structures. These patterns appear throughout systems software.

Pattern 1: The Atomic Update File

Safely update a file's contents:

int atomic_file_update(const char *path, const void *data, size_t len) {
    char temp_path[PATH_MAX];
    snprintf(temp_path, sizeof(temp_path), "%s.tmp", path);
    
    // 1. Write new content to temp file
    int fd = open(temp_path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd < 0) return -1;
    
    if (write(fd, data, len) != len) {
        close(fd);
        unlink(temp_path);
        return -1;
    }
    
    // 2. Ensure temp file content is durable
    if (fsync(fd) < 0) {
        close(fd);
        unlink(temp_path);
        return -1;
    }
    close(fd);
    
    // 3. Rename temp to final (atomic on POSIX)
    if (rename(temp_path, path) < 0) {
        unlink(temp_path);
        return -1;
    }
    
    // 4. Ensure directory entry is durable
    char *dir = dirname(strdup(path));
    int dir_fd = open(dir, O_RDONLY);
    if (dir_fd >= 0) {
        fsync(dir_fd);  // Make rename durable
        close(dir_fd);
    }
    
    return 0;
}

/*
 * Ordering guarantees:
 * - Step 2 ensures temp file content is durable before rename
 * - rename() is atomic: either old version or new version exists
 * - Step 4 ensures rename is durable
 * 
 * Crash at any point:
 * - Before step 3: original file intact, temp file may/may not exist
 * - After step 3: new file is complete and durable
 */

Pattern 2: The Append-Only Log

Create a log that survives crashes:

append_log.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#include <stdint.h>
#include <sys/uio.h>
 
/*
 * Log entry format:
 *   [LENGTH: 4 bytes][DATA: LENGTH bytes][CRC32: 4 bytes]
 * 
 * The CRC allows detection of torn/incomplete writes.
 */
 
typedef struct {
    int fd;
    off_t write_offset;
} AppendLog;
 
int log_append(AppendLog *log, const void *data, size_t len) {
    uint32_t header = (uint32_t)len;
    uint32_t crc = compute_crc32(data, len);
    
    // Use writev for atomic multi-part write
    struct iovec iov[3] = {
        { .iov_base = &header, .iov_len = sizeof(header) },
        { .iov_base = (void *)data, .iov_len = len },
        { .iov_base = &crc, .iov_len = sizeof(crc) }
    };
    
    // Single writev is more likely to be atomic
    // But still not guaranteed - see note below
    ssize_t written = writev(log->fd, iov, 3);
    
    if (written != sizeof(header) + len + sizeof(crc)) {
        return -1;  // Partial write or error
    }
    
    return 0;
}
 
int log_sync(AppendLog *log) {
    // Make all appended entries durable
    return fdatasync(log->fd);
}
 
/*
 * Recovery: Read entries, verify CRCs
 * 
 * - Valid CRC: entry is complete and correct
 * - Invalid/missing CRC: entry is torn, truncate log here
 * 
 * This provides "prefix semantics": some prefix of
 * appended entries survives, possibly not all.
 */
int log_recover(AppendLog *log) {
    off_t valid_end = 0;
    
    while (1) {
        uint32_t header;
        if (read(log->fd, &header, sizeof(header)) != sizeof(header))
            break;
        
        void *data = malloc(header);
        if (read(log->fd, data, header) != header) {
            free(data);
            break;
        }
        
        uint32_t stored_crc;
        if (read(log->fd, &stored_crc, sizeof(stored_crc)) != sizeof(stored_crc)) {
            free(data);
            break;
        }
        
        uint32_t computed_crc = compute_crc32(data, header);
        free(data);
        
        if (computed_crc != stored_crc) {
            // Torn write or corruption
            break;
        }
        
        valid_end = lseek(log->fd, 0, SEEK_CUR);
    }
    
    // Truncate to last valid entry
    ftruncate(log->fd, valid_end);
    log->write_offset = valid_end;
    
    return 0;
}

Pattern 3: The B-Tree with Ordering

B-trees require careful ordering for crash safety:

Inserting into B-tree (may cause split):

Unsafe order:
  1. Update parent to point to new split node
  2. Write split node content
  
  Crash after 1: Parent points to garbage!

Safe order:
  1. Write split node content
  2. BARRIER
  3. Update parent to point to split node
  4. BARRIER
  
  Crash after 1: Split node is orphan (space leak, not corruption)
  Crash after 3: Parent points to valid split node ✓

Even better (COW B-tree):
  1. Copy-on-write from root to modified leaf
  2. Write ALL new nodes (any order, no barriers needed)
  3. BARRIER
  4. Update root pointer (single atomic write)
  5. BARRIER
  
  Result: Either old tree or new tree, never inconsistent mix

The COW approach is why LMDB (Lightning Memory-Mapped Database) achieves such high performance with full crash safety—most writes need no ordering constraints.

The LMDB Design

LMDB uses a clever COW B-tree design where only one 8-byte root pointer needs to be durable for a commit. Combined with 2 root pointers (ping-pong), it achieves atomic commits with a single msync/fdatasync, regardless of how much data changed. This is write ordering at its most elegant.

Hardware Ordering Guarantees: What Devices Actually Promise

Understanding hardware-level ordering guarantees is essential for correct system design. Different device types provide different guarantees.

SATA and SAS Drives

SATA Native Command Queue (NCQ):
  - Up to 32 outstanding commands
  - Drive may execute in any order
  - No inter-command ordering by default
  
Ordering primitives:
  - FLUSH CACHE command: Wait for all prior commands
  - FUA bit on write: This write bypasses cache
  
  Example correct sequence:
    WRITE A
    WRITE B  
    FLUSH CACHE (barrier)
    WRITE C
  
  Guarantee: A and B on media before C starts

NVMe Drives

NVMe provides more sophisticated ordering:

Submission Queue ordering:
  - Commands within same queue maintain order (if not reordered by drive)
  - Commands across queues have no ordering
  
Ordering primitives:
  - FLUSH command: Persist all previous writes
  - FUA bit in write command
  - Write Zeroes with Deallocate flag
  
NVMe 1.4+ Streams:
  - Hint to drive about related writes
  - May improve ordering within stream
  
NVMe Atomic Write Unit:
  - AWUN: size of guaranteed atomic write
  - Typically 4KB-16KB
  - Writes up to this size won't be torn

Device Atomicity and Ordering Properties
Property	SATA HDD	SATA SSD	NVMe SSD	Optane
Sector atomic (512B)	Yes	Yes	Yes	Yes
4KB atomic	Sometimes*	Usually	Yes (AWUN)	Yes
16KB atomic	No	Rare	Some models	Yes
NCQ/Queue depth	32	32	64K	64K
Per-queue order	No	No	No**	No
Flush latency	10-30ms	50-200µs	10-100µs	<10µs
FUA support	Yes	Yes	Yes	Yes

*Some modern HDDs have 4KB physical sectors with 4KB atomic write. **NVMe spec doesn't guarantee queue ordering; drives may reorder.

The Power-Fail Protection Consideration

Consumer vs. enterprise drives differ significantly:

Consumer SSD (no PLP):
  - Volatile DRAM cache for write staging
  - Power loss = cache content lost
  - May also corrupt data being programmed
  
Enterprise SSD (with PLP):
  - Capacitor or battery backup
  - 10-100ms of power for emergency flush
  - All in-flight data written to NAND
  
Critical for ordering:
  - Without PLP, even FUA writes may be lost if power fails during NAND programming
  - Must assume worst case for consumer hardware
  - Or use UPS + clean shutdown guarantees

Testing Hardware Guarantees

Never trust vendor claims—test:

# diskchecker.pl - classic tool for verifying barrier behavior
# Writes pattern, crashes, checks pattern survived

# dm-log-writes - record and replay I/O
dmsetup create log-test --table \
  "0 $(blockdev --getsz /dev/sda) \
   log-writes /dev/sda /dev/sdb"

# Write workload, crash, check /dev/sdb log for ordering violations

Virtual Environment Caveats

VMs add layers that may violate ordering. A VM's flush may not reach physical media if the hypervisor has its own cache. AWS EBS, for example, maintains ordering within a volume but the timing of persistence is opaque. Always test your specific deployment environment.

Debugging Ordering Issues: Finding the Missing Barrier

Ordering bugs are notoriously difficult to diagnose because they only manifest after crashes, and crashes are (hopefully) rare. Here are techniques for finding and fixing these bugs.

Symptom Recognition

Symptoms suggesting ordering problems:

1. "Corruption" that fsck fixes
   - Usually metadata/data mismatch
   - Indicates metadata written before/without data
   
2. Lost transactions after crash
   - Commit was acknowledged
   - After crash, transaction is gone
   - Commit notification sent before log durable
   
3. Partial records
   - Half-written data structures
   - CRCs failing
   - No barrier between multi-block writes
   
4. "Impossible" states
   - Child object exists without parent
   - Reference count negative
   - Usually missing dependency ordering

Tracing I/O Order

trace_io_order.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/bin/bash
# Trace block I/O to identify ordering issues
 
# 1. Enable block tracing
echo 1 > /sys/kernel/debug/tracing/events/block/enable
 
# 2. Run workload
./my_application &
 
# 3. Capture trace (shows order of block writes)
cat /sys/kernel/debug/tracing/trace_pipe | while read line; do
    echo "$line"
    # Look for:
    # block_rq_issue: request submitted to device
    # block_rq_complete: request completed
    # block_bio_queue: bio queued
done
 
# 4. Look for ordering:
# Are flushes (REQ_PREFLUSH) appearing where expected?
# Are FUA writes (REQ_FUA) on critical data?
# Is there gap between data write and commit?
 
# Example trace line:
# block_rq_issue: 8,0 WFS 123456 + 8 [my_app] 0
#                     ^^^ 
#                     W = Write
#                     F = FUA
#                     S = Sync
# 
# Look for pattern: [data writes] [WFS] [more data]
#                                  ^barrier
 
# 5. Use blktrace for more detailed analysis
blktrace -d /dev/sda -o trace &
./my_application
kill %1
blkparse -i trace > trace.txt

Static Analysis for Ordering Bugs

Some ordering bugs can be found through code review:

Checklist:
□ After creating new file: fsync file AND parent directory?
□ After rename: fsync destination directory?
□ After log append: fsync before commit acknowledgment?
□ Writing multi-block structure: barrier before pointer update?
□ In-place update of structure: using COW or WAL?
□ Error paths: do they also maintain ordering?

Crash Testing Infrastructure

For systematic testing, simulate crashes at various points:

# Pseudo-code for crash testing framework

class CrashTestHarness:
    def __init__(self, app):
        self.app = app
        self.write_log = []  # All writes captured
    
    def intercept_write(self, fd, data, barrier=False):
        # Record write
        self.write_log.append({
            'fd': fd,
            'data': data.copy(),
            'barrier': barrier
        })
        # Actually write
        os.write(fd, data)
    
    def simulate_crash_at(self, n):
        """Simulate crash after write n completes."""
        # Create fresh test device
        setup_test_device()
        
        # Replay first n writes
        for i, write in enumerate(self.write_log[:n]):
            test_write(write['fd'], write['data'])
            if write['barrier']:
                test_sync(write['fd'])
        
        # "Crash" - writes after n are lost
        
        # Run recovery
        self.app.recover(test_device)
        
        # Check consistency
        return self.app.check_consistency()
    
    def test_all_crash_points(self):
        """Test crash at every possible point."""
        for i in range(len(self.write_log)):
            if not self.simulate_crash_at(i):
                print(f"Consistency failure at crash point {i}")
                print(f"Last successful write: {self.write_log[i-1]}")
                print(f"Lost write: {self.write_log[i]}")

This approach is used by SQLite, PostgreSQL, and other databases to verify crash safety.

The Alice and Bob Method

For simpler testing, use two disk images: 'Alice' (mid-transaction) and 'Bob' (post-commit). Crash at transaction boundary, replay to either state, verify consistency. If either Alice or Bob is corrupt, you have an ordering bug.

Summary: Mastering Write Ordering

Write ordering is the foundation of crash consistency. By explicitly controlling which writes complete before others, we can build data structures that survive system failures without requiring expensive synchronous I/O for every operation.

Key Takeaways

•Reordering Is Pervasive: Writes may be reordered at page cache, I/O scheduler, and device levels. Never assume application write order equals disk write order.
•Barriers Enforce Order: PREFLUSH ensures prior writes complete; FUA ensures this write goes to media. Combine for full barrier. Each has different costs.
•Filesystem Strategies Vary: Ordered journaling, soft updates, and COW each solve ordering differently. Know your filesystem's guarantees.
•Database WAL Protocol: Log records must be ordered and durable before commit acknowledgment. Group commit amortizes barrier costs.
•Crash-Safe Patterns: Atomic file update, append-only logs, COW trees—each uses ordering to ensure recoverable state at any crash point.
•Hardware Limits: Device atomicity (AWUN), queue behavior, and power-fail protection all affect what ordering is actually achievable.
•Testing Is Essential: Ordering bugs only manifest on crash. Use systematic crash testing (dm-flakey, crash injection) to verify implementations.

What's Next

We've covered the mechanisms for ordering writes. Our final topic brings everything together: data integrity—the comprehensive strategies for ensuring that stored data remains correct, encompassing checksums, error detection, end-to-end verification, and the policies that determine acceptable risk levels for different types of data.

Page Complete

You now understand write ordering at a deep level—from barrier primitives to filesystem strategies to crash-safe data structure patterns. This knowledge is essential for building systems that maintain consistency despite inevitable failures.