Loading content...
Consider this seemingly simple operation: moving a file from one directory to another. At the filesystem level, this requires:
What happens if we crash after step 1 but before step 2? The file appears in both directories. What if we crash after step 2 but before step 1? The file is orphaned—it exists but has no directory entry pointing to it.
Neither outcome is acceptable. We need these operations to either all complete or none complete—atomic operation semantics. But we also want performance, which means not syncing to disk after every sub-operation.
The solution is ordered writes: controlling which writes must reach disk before others, without necessarily waiting for each write individually. This enables filesystems to maintain consistency guarantees while preserving write-back performance characteristics.
This page explores write ordering in depth—the mechanisms, the guarantees they provide, and how they enable crash-consistent data structures.
By the end of this page, you will understand write barriers—what they guarantee and what they don't. You'll learn how filesystems use ordering to maintain consistency, how databases implement atomic transactions, and how to reason about the ordering requirements of your own data structures.
Without explicit ordering control, the path from write() to persistent storage involves multiple opportunities for reordering:
Sources of Write Reordering
Application writes: W1, W2, W3 (in that order)
┌─────────────────────────────────────────────────────────────┐
│ Kernel Page Cache │
│ - Pages may be marked dirty in any order │
│ - Write scheduler may batch/sort for efficiency │
│ Possible order to block layer: W2, W3, W1 │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Block I/O Scheduler │
│ - Reorders for seek optimization (HDDs) │
│ - Merges adjacent blocks │
│ - May prioritize reads over writes │
│ Possible order to device: W3, W1, W2 │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Disk/SSD Controller │
│ - Native Command Queuing (NCQ) on SATA │
│ - Deep command queues on NVMe │
│ - Internal reordering for performance │
│ Actual execution order: W1, W3, W2 │
└─────────────────────────────────────────────────────────────┘
The result: writes issued in order W1→W2→W3 might reach disk in order W1→W3→W2, or any permutation for that matter.
Why Reordering Happens
Each layer optimizes for different goals:
| Layer | Optimization Goal | Reordering Behavior |
|---|---|---|
| Page Cache | Memory efficiency | Writes to same page coalesced |
| I/O Scheduler | Seek time (HDD) | Sorted by physical location |
| Device Controller | Firmware efficiency | Parallel execution, out-of-order completion |
For performance, this reordering is desirable—it can improve throughput by 10-100×. But for consistency, it's catastrophic if not controlled.
The Consistency Invariant
Most data structures have invariants that must hold after any crash:
Example: Directory with file count
- Count field says 5 files exist
- Directory has 5 file entries
Invariant: count must equal actual entries
If we add a file:
1. Write new file entry
2. Increment count
If crash after 1, before 2:
- Count says 5, but 6 files exist
- INCONSISTENT
If we reverse the order:
1. Increment count
2. Write new file entry
If crash after 1, before 2:
- Count says 6, but 5 files exist
- STILL INCONSISTENT
The ONLY safe approaches:
a) Do both atomically (single write, requires careful layout)
b) Write in order that crash always leaves recoverable state
c) Log the operation so it can be replayed/rolled-back
Ordering and durability are separate concerns. Ordering says: 'Write A must hit disk before Write B.' Durability says: 'Write A must be on disk before we continue.' Ordering can be achieved without immediate durability—the writes happen when convenient, but the specified order is maintained.
A write barrier is a synchronization point that guarantees all writes before the barrier reach persistent storage before any write after the barrier.
Barrier Semantics
Timeline:
write(block A)
write(block B)
─── BARRIER ───
write(block C)
write(block D)
Guarantees:
1. A and B complete before C or D start
2. A and B may complete in either order relative to each other
3. C and D may complete in either order relative to each other
Visualization of possible completion orders:
Legal: A, B, C, D
Legal: B, A, D, C
Legal: B, A, C, D
ILLEGAL: A, C, B, D (C before B violates barrier)
ILLEGAL: C, A, B, D (C before A,B violates barrier)
Hardware Barrier Implementation
Storage devices implement barriers through several mechanisms:
| Mechanism | Device Type | Semantics | Performance Impact |
|---|---|---|---|
| Cache Flush | SATA/NVMe | Flush write cache before continuing | 50µs - 10ms |
| FUA (Force Unit Access) | SATA/NVMe | Bypass cache for this write only | Minimal per-write overhead |
| Write Cache Disable | All | All writes uncached | Significant throughput loss |
| Barrier Command | NVMe | Explicit ordering primitive | Minimal |
Linux Block Layer Barriers
Linux implements barriers through the block layer request flags:
// Block layer request flags relevant to ordering
// REQ_PREFLUSH: Issue a cache flush BEFORE this write
// Ensures all previous writes are on media before this one
// REQ_FUA: Force Unit Access - bypass volatile cache for THIS write
// Ensures this specific write goes directly to persistent media
// Combined: REQ_PREFLUSH | REQ_FUA
// Flush cache, then write directly to media
// This is a "full barrier" - nothing crosses this point
// Example construction of barrier request:
struct bio *bio = bio_alloc(GFP_NOIO, 1);
bio_set_dev(bio, bdev);
bio->bi_iter.bi_sector = sector;
bio_add_page(bio, page, PAGE_SIZE, 0);
bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA;
submit_bio(bio);
The Evolution of Linux Barriers
Linux's barrier implementation has evolved significantly:
Pre-2.6.37 (Explicit Barriers)
// Old API
blkdev_issue_flush(bdev, GFP_KERNEL, NULL); // Explicit flush
// OR
bio->bi_rw |= WRITE_BARRIER; // Request includes barrier
2.6.37+ (PREFLUSH/FUA Model)
// Modern API - more granular control
bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; // Flush before
bio->bi_opf = REQ_OP_WRITE | REQ_FUA; // Direct write
bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA; // Both
The modern API is more flexible:
Barrier Cost Analysis
Barriers are expensive because they serialize I/O:
Without barriers (write-back, full parallelism):
4 writes × parallel = 1 round trip ≈ 20µs total (NVMe)
With barrier after every write:
4 writes × sequential = 4 round trips ≈ 80µs total (NVMe)
With strategic barriers (2 groups):
2 writes, barrier, 2 writes = 2 round trips ≈ 40µs total
Optimization: Minimize barriers while maintaining required ordering
When you only need to ensure a single write is durable (not all prior writes), FUA is much cheaper than a full cache flush. Use PREFLUSH when you need to order against prior writes; use FUA when you only need durability for this specific write. Use both for a full barrier.
Filesystems use write ordering extensively to maintain consistency. Let's examine the strategies used by major filesystems.
Strategy 1: Soft Updates (BSD FFS/UFS)
Soft updates track dependencies between metadata updates and enforce ordering through those dependencies:
Creating a new file:
1. Allocate inode (modifies inode bitmap)
2. Initialize inode (modifies inode block)
3. Allocate data blocks (modifies block bitmap)
4. Update directory entry (modifies directory block)
Dependencies:
1 must complete before 2 (inode must exist before init)
2 must complete before 4 (inode initialized before linked)
3 must complete before 2 (blocks allocated before inode references them)
Soft updates tracks these and enforces order during writeback.
If crash occurs:
- May have allocated but unused inodes/blocks (fsck fix: free them)
- NEVER has directory pointing to unallocated inode (always safe to follow links)
Strategy 2: Ordered Mode Journaling (ext3/ext4)
Ordered mode ensures data is written before the journal commit:
Transaction lifecycle:
1. Write data blocks to final location (no barrier yet)
2. Write metadata to journal (describes the operation)
3. BARRIER: Wait for data writes to complete
4. Write commit block to journal (with FUA or barrier)
5. Write metadata to final location (can overlap with new transactions)
6. Mark journal space reusable
Crash scenarios:
- Before step 4: Transaction not committed, don't replay
- After step 4: Replay journal, metadata matches data
KEY: Data is always written before journal commits
Metadata in journal describes already-written data
No barrier within data writes (parallel, fast)
Barrier between data and commit (correctness)
Strategy 3: Full Data Journaling (ext3/ext4 data=journal)
Both data and metadata go to journal:
Transaction:
1. Write data to journal
2. Write metadata to journal
3. Commit block (with barrier)
4. Write data to final location
5. Write metadata to final location
Advantage: Simple, consistent
Disadvantage: Data is written twice (journal + final location)
Throughput roughly halved
| Filesystem | Strategy | Crash Consistency | Write Overhead | Modern Use |
|---|---|---|---|---|
| BSD UFS | Soft Updates | Consistent (needs background fsck) | Low | FreeBSD, NetBSD |
| ext4 data=ordered | Ordered Journaling | Fully consistent | Medium (1 barrier per TX) | Linux default |
| ext4 data=journal | Full Journaling | Fully consistent | High (2× writes) | Paranoid setups |
| ext4 data=writeback | Metadata-only Journal | Metadata consistent only | Low | Speed-critical |
| XFS | Metadata Journal | Metadata consistent | Low | Enterprise Linux |
| ZFS/Btrfs | Copy-on-Write | Always consistent | Variable (COW overhead) | Modern systems |
The Copy-on-Write Alternative
Modern filesystems like ZFS and Btrfs avoid many ordering concerns through copy-on-write (COW):
Traditional update (in-place):
[Block A version 1] → overwrite → [Block A version 2]
Risks:
- Crash during write leaves block corrupted
- Must order dependent writes carefully
Copy-on-Write update:
[Block A v1] remains
Write [Block A v2] to NEW location
Update pointer from A_v1 to A_v2 (single write)
Ordering requirement: Only the final pointer update
Everything else can be written in any order
Crash leaves either old version (pointer not updated)
or new version (pointer updated)
NEVER corrupted version
COW transforms most ordering problems into simple pointer atomicity problems.
ext4's write ordering is controlled by mount options: data=ordered (safe default), data=writeback (fastest, metadata-only consistency), data=journal (safest, slowest). Check with mount | grep data=. For databases that handle their own ordering, data=writeback may be appropriate.
Databases face the same ordering challenges as filesystems but with additional requirements: user-defined transaction boundaries and ACID properties.
The WAL Protocol and Ordering
Write-Ahead Logging (WAL) relies critically on write ordering:
ACID Requirement: Transaction is durable if commit succeeds
Implementation: Log record is durable before commit returns
Ordering requirement:
WAL log record → durable → BEFORE → commit acknowledgment
NOT required:
Data page modifications (can stay in cache indefinitely)
Index updates (can stay in cache)
Only required:
Log records for committed transactions must survive crash
Write order:
1. Generate log record describing change
2. Apply changes to in-memory buffer pool
3. When commit: write log records to WAL file
4. BARRIER (fsync WAL file)
5. Return success to client
6. Eventually: write dirty data pages (no ordering constraint)
Log Record Ordering
Within the log, records must also be ordered:
Transaction T1: writes A, B, commits
Transaction T2: writes C, commits
Transaction T3: writes A (same item as T1), commits
Log order must reflect commit order:
[T1: A=1] [T1: B=2] [T1: commit]
[T2: C=3] [T2: commit]
[T3: A=5] [T3: commit]
On recovery:
- Replay in order
- T1's A=1 is overwritten by T3's A=5
- Final state correct
If log were reordered (T3 commit before T1):
- Apply T3: A=5
- Apply T1: A=1
- WRONG! Lost T3's update
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import osimport struct class WriteAheadLog: """ Simple WAL demonstrating the ordering requirements. In production, this would need proper record formats, checksums, and group commit optimization. """ def __init__(self, path): self.path = path self.fd = os.open(path, os.O_RDWR | os.O_CREAT | os.O_APPEND, 0o644) self.next_lsn = 0 # Log Sequence Number def append_record(self, tx_id, operation, data): """ Append a log record. Does NOT make it durable yet. Returns LSN for this record. """ lsn = self.next_lsn self.next_lsn += 1 # Format: [LSN (8 bytes)][TX_ID (8)][OP (4)][LEN (4)][DATA] record = struct.pack(">QQI", lsn, tx_id, operation) record += struct.pack(">I", len(data)) record += data # Write to file (goes to cache) os.write(self.fd, record) return lsn def commit(self, tx_id, data_lsn): """ Commit a transaction. MUST ensure all log records up to and including the commit record are durable. """ # Append commit record commit_record = struct.pack(">QQI", self.next_lsn, tx_id, 0xFFFFFFFF) # Commit marker os.write(self.fd, commit_record) # CRITICAL: fsync BEFORE returning # This is the ordering guarantee: # All log records, including commit, are durable # Only THEN is the transaction committed os.fsync(self.fd) return True def recover(self): """ On startup, scan log and return committed transactions. Ordering guarantee means: if we see commit record, all prior records for that TX are durable. """ committed_txs = set() records_by_tx = {} # Seek to start os.lseek(self.fd, 0, os.SEEK_SET) while True: header = os.read(self.fd, 20) if len(header) < 20: break lsn, tx_id, operation = struct.unpack(">QQI", header) if operation == 0xFFFFFFFF: # Commit record committed_txs.add(tx_id) else: # Data record len_data = struct.unpack(">I", os.read(self.fd, 4))[0] data = os.read(self.fd, len_data) if tx_id not in records_by_tx: records_by_tx[tx_id] = [] records_by_tx[tx_id].append((lsn, operation, data)) # Return only committed transaction records return {tx: records for tx, records in records_by_tx.items() if tx in committed_txs}Group Commit Optimization
Individual fsync per transaction is slow. Group commit amortizes the cost:
Without group commit:
T1 appends records, fsync, returns (10ms)
T2 appends records, fsync, returns (10ms)
T3 appends records, fsync, returns (10ms)
Total: 30ms for 3 transactions
With group commit:
T1 appends records
T2 appends records
T3 appends records
Single fsync covers all three (10ms)
All return
Total: 10ms for 3 transactions
The key insight: fsync cost is dominated by device round-trip,
not by the amount of data. Bigger batches = better amortization.
But group commit requires careful ordering: all transactions in the group must have their records written before the shared fsync.
PostgreSQL uses group commit with commit_delay and commit_siblings settings. When a transaction commits, it waits briefly for siblings to accumulate, then one fsync covers all. This can increase throughput by 10x on synchronous commit workloads.
Let's apply ordering principles to implement crash-safe data structures. These patterns appear throughout systems software.
Pattern 1: The Atomic Update File
Safely update a file's contents:
int atomic_file_update(const char *path, const void *data, size_t len) {
char temp_path[PATH_MAX];
snprintf(temp_path, sizeof(temp_path), "%s.tmp", path);
// 1. Write new content to temp file
int fd = open(temp_path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd < 0) return -1;
if (write(fd, data, len) != len) {
close(fd);
unlink(temp_path);
return -1;
}
// 2. Ensure temp file content is durable
if (fsync(fd) < 0) {
close(fd);
unlink(temp_path);
return -1;
}
close(fd);
// 3. Rename temp to final (atomic on POSIX)
if (rename(temp_path, path) < 0) {
unlink(temp_path);
return -1;
}
// 4. Ensure directory entry is durable
char *dir = dirname(strdup(path));
int dir_fd = open(dir, O_RDONLY);
if (dir_fd >= 0) {
fsync(dir_fd); // Make rename durable
close(dir_fd);
}
return 0;
}
/*
* Ordering guarantees:
* - Step 2 ensures temp file content is durable before rename
* - rename() is atomic: either old version or new version exists
* - Step 4 ensures rename is durable
*
* Crash at any point:
* - Before step 3: original file intact, temp file may/may not exist
* - After step 3: new file is complete and durable
*/
Pattern 2: The Append-Only Log
Create a log that survives crashes:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
#include <stdint.h>#include <sys/uio.h> /* * Log entry format: * [LENGTH: 4 bytes][DATA: LENGTH bytes][CRC32: 4 bytes] * * The CRC allows detection of torn/incomplete writes. */ typedef struct { int fd; off_t write_offset;} AppendLog; int log_append(AppendLog *log, const void *data, size_t len) { uint32_t header = (uint32_t)len; uint32_t crc = compute_crc32(data, len); // Use writev for atomic multi-part write struct iovec iov[3] = { { .iov_base = &header, .iov_len = sizeof(header) }, { .iov_base = (void *)data, .iov_len = len }, { .iov_base = &crc, .iov_len = sizeof(crc) } }; // Single writev is more likely to be atomic // But still not guaranteed - see note below ssize_t written = writev(log->fd, iov, 3); if (written != sizeof(header) + len + sizeof(crc)) { return -1; // Partial write or error } return 0;} int log_sync(AppendLog *log) { // Make all appended entries durable return fdatasync(log->fd);} /* * Recovery: Read entries, verify CRCs * * - Valid CRC: entry is complete and correct * - Invalid/missing CRC: entry is torn, truncate log here * * This provides "prefix semantics": some prefix of * appended entries survives, possibly not all. */int log_recover(AppendLog *log) { off_t valid_end = 0; while (1) { uint32_t header; if (read(log->fd, &header, sizeof(header)) != sizeof(header)) break; void *data = malloc(header); if (read(log->fd, data, header) != header) { free(data); break; } uint32_t stored_crc; if (read(log->fd, &stored_crc, sizeof(stored_crc)) != sizeof(stored_crc)) { free(data); break; } uint32_t computed_crc = compute_crc32(data, header); free(data); if (computed_crc != stored_crc) { // Torn write or corruption break; } valid_end = lseek(log->fd, 0, SEEK_CUR); } // Truncate to last valid entry ftruncate(log->fd, valid_end); log->write_offset = valid_end; return 0;}Pattern 3: The B-Tree with Ordering
B-trees require careful ordering for crash safety:
Inserting into B-tree (may cause split):
Unsafe order:
1. Update parent to point to new split node
2. Write split node content
Crash after 1: Parent points to garbage!
Safe order:
1. Write split node content
2. BARRIER
3. Update parent to point to split node
4. BARRIER
Crash after 1: Split node is orphan (space leak, not corruption)
Crash after 3: Parent points to valid split node ✓
Even better (COW B-tree):
1. Copy-on-write from root to modified leaf
2. Write ALL new nodes (any order, no barriers needed)
3. BARRIER
4. Update root pointer (single atomic write)
5. BARRIER
Result: Either old tree or new tree, never inconsistent mix
The COW approach is why LMDB (Lightning Memory-Mapped Database) achieves such high performance with full crash safety—most writes need no ordering constraints.
LMDB uses a clever COW B-tree design where only one 8-byte root pointer needs to be durable for a commit. Combined with 2 root pointers (ping-pong), it achieves atomic commits with a single msync/fdatasync, regardless of how much data changed. This is write ordering at its most elegant.
Understanding hardware-level ordering guarantees is essential for correct system design. Different device types provide different guarantees.
SATA and SAS Drives
SATA Native Command Queue (NCQ):
- Up to 32 outstanding commands
- Drive may execute in any order
- No inter-command ordering by default
Ordering primitives:
- FLUSH CACHE command: Wait for all prior commands
- FUA bit on write: This write bypasses cache
Example correct sequence:
WRITE A
WRITE B
FLUSH CACHE (barrier)
WRITE C
Guarantee: A and B on media before C starts
NVMe Drives
NVMe provides more sophisticated ordering:
Submission Queue ordering:
- Commands within same queue maintain order (if not reordered by drive)
- Commands across queues have no ordering
Ordering primitives:
- FLUSH command: Persist all previous writes
- FUA bit in write command
- Write Zeroes with Deallocate flag
NVMe 1.4+ Streams:
- Hint to drive about related writes
- May improve ordering within stream
NVMe Atomic Write Unit:
- AWUN: size of guaranteed atomic write
- Typically 4KB-16KB
- Writes up to this size won't be torn
| Property | SATA HDD | SATA SSD | NVMe SSD | Optane |
|---|---|---|---|---|
| Sector atomic (512B) | Yes | Yes | Yes | Yes |
| 4KB atomic | Sometimes* | Usually | Yes (AWUN) | Yes |
| 16KB atomic | No | Rare | Some models | Yes |
| NCQ/Queue depth | 32 | 32 | 64K | 64K |
| Per-queue order | No | No | No** | No |
| Flush latency | 10-30ms | 50-200µs | 10-100µs | <10µs |
| FUA support | Yes | Yes | Yes | Yes |
*Some modern HDDs have 4KB physical sectors with 4KB atomic write. **NVMe spec doesn't guarantee queue ordering; drives may reorder.
The Power-Fail Protection Consideration
Consumer vs. enterprise drives differ significantly:
Consumer SSD (no PLP):
- Volatile DRAM cache for write staging
- Power loss = cache content lost
- May also corrupt data being programmed
Enterprise SSD (with PLP):
- Capacitor or battery backup
- 10-100ms of power for emergency flush
- All in-flight data written to NAND
Critical for ordering:
- Without PLP, even FUA writes may be lost if power fails during NAND programming
- Must assume worst case for consumer hardware
- Or use UPS + clean shutdown guarantees
Testing Hardware Guarantees
Never trust vendor claims—test:
# diskchecker.pl - classic tool for verifying barrier behavior
# Writes pattern, crashes, checks pattern survived
# dm-log-writes - record and replay I/O
dmsetup create log-test --table \
"0 $(blockdev --getsz /dev/sda) \
log-writes /dev/sda /dev/sdb"
# Write workload, crash, check /dev/sdb log for ordering violations
VMs add layers that may violate ordering. A VM's flush may not reach physical media if the hypervisor has its own cache. AWS EBS, for example, maintains ordering within a volume but the timing of persistence is opaque. Always test your specific deployment environment.
Ordering bugs are notoriously difficult to diagnose because they only manifest after crashes, and crashes are (hopefully) rare. Here are techniques for finding and fixing these bugs.
Symptom Recognition
Symptoms suggesting ordering problems:
1. "Corruption" that fsck fixes
- Usually metadata/data mismatch
- Indicates metadata written before/without data
2. Lost transactions after crash
- Commit was acknowledged
- After crash, transaction is gone
- Commit notification sent before log durable
3. Partial records
- Half-written data structures
- CRCs failing
- No barrier between multi-block writes
4. "Impossible" states
- Child object exists without parent
- Reference count negative
- Usually missing dependency ordering
Tracing I/O Order
1234567891011121314151617181920212223242526272829303132333435363738
#!/bin/bash# Trace block I/O to identify ordering issues # 1. Enable block tracingecho 1 > /sys/kernel/debug/tracing/events/block/enable # 2. Run workload./my_application & # 3. Capture trace (shows order of block writes)cat /sys/kernel/debug/tracing/trace_pipe | while read line; do echo "$line" # Look for: # block_rq_issue: request submitted to device # block_rq_complete: request completed # block_bio_queue: bio queueddone # 4. Look for ordering:# Are flushes (REQ_PREFLUSH) appearing where expected?# Are FUA writes (REQ_FUA) on critical data?# Is there gap between data write and commit? # Example trace line:# block_rq_issue: 8,0 WFS 123456 + 8 [my_app] 0# ^^^ # W = Write# F = FUA# S = Sync# # Look for pattern: [data writes] [WFS] [more data]# ^barrier # 5. Use blktrace for more detailed analysisblktrace -d /dev/sda -o trace &./my_applicationkill %1blkparse -i trace > trace.txtStatic Analysis for Ordering Bugs
Some ordering bugs can be found through code review:
Checklist:
□ After creating new file: fsync file AND parent directory?
□ After rename: fsync destination directory?
□ After log append: fsync before commit acknowledgment?
□ Writing multi-block structure: barrier before pointer update?
□ In-place update of structure: using COW or WAL?
□ Error paths: do they also maintain ordering?
Crash Testing Infrastructure
For systematic testing, simulate crashes at various points:
# Pseudo-code for crash testing framework
class CrashTestHarness:
def __init__(self, app):
self.app = app
self.write_log = [] # All writes captured
def intercept_write(self, fd, data, barrier=False):
# Record write
self.write_log.append({
'fd': fd,
'data': data.copy(),
'barrier': barrier
})
# Actually write
os.write(fd, data)
def simulate_crash_at(self, n):
"""Simulate crash after write n completes."""
# Create fresh test device
setup_test_device()
# Replay first n writes
for i, write in enumerate(self.write_log[:n]):
test_write(write['fd'], write['data'])
if write['barrier']:
test_sync(write['fd'])
# "Crash" - writes after n are lost
# Run recovery
self.app.recover(test_device)
# Check consistency
return self.app.check_consistency()
def test_all_crash_points(self):
"""Test crash at every possible point."""
for i in range(len(self.write_log)):
if not self.simulate_crash_at(i):
print(f"Consistency failure at crash point {i}")
print(f"Last successful write: {self.write_log[i-1]}")
print(f"Lost write: {self.write_log[i]}")
This approach is used by SQLite, PostgreSQL, and other databases to verify crash safety.
For simpler testing, use two disk images: 'Alice' (mid-transaction) and 'Bob' (post-commit). Crash at transaction boundary, replay to either state, verify consistency. If either Alice or Bob is corrupt, you have an ordering bug.
Write ordering is the foundation of crash consistency. By explicitly controlling which writes complete before others, we can build data structures that survive system failures without requiring expensive synchronous I/O for every operation.
What's Next
We've covered the mechanisms for ordering writes. Our final topic brings everything together: data integrity—the comprehensive strategies for ensuring that stored data remains correct, encompassing checksums, error detection, end-to-end verification, and the policies that determine acceptable risk levels for different types of data.
You now understand write ordering at a deep level—from barrier primitives to filesystem strategies to crash-safe data structure patterns. This knowledge is essential for building systems that maintain consistency despite inevitable failures.