Operating SystemsBuffer Cache

Buffer Cache

LevelIntermediate

Duration90 mins

TopicBuffer Cache

5 / 5

Sync Operations

From Cache to Permanence

Throughout this module, we've examined how the buffer cache accelerates file system performance by keeping data in fast memory. But performance without durability is hollow—data that exists only in volatile memory vanishes on power failure. Sync operations bridge this gap: they are the mechanisms that force data from the buffer cache onto stable, persistent storage.

Understanding sync operations is essential for any engineer building reliable software:

Application developers need to know when to call fsync() and what it guarantees
Database engineers must understand the performance-durability tradeoffs of different sync strategies
System administrators need to tune sync behavior for their workloads
Operating system developers must implement these operations correctly across diverse storage hardware

In this final page of the buffer cache module, we'll examine every aspect of synchronization: from high-level system calls to kernel implementation to storage device behavior.

What You Will Learn

By the end of this page, you will understand: (1) The complete sync() family of system calls and their guarantees, (2) How the kernel implements these operations internally, (3) The difference between data sync and metadata sync, (4) Storage device flush commands and their interaction with software sync, (5) Performance implications and optimization strategies, and (6) Real-world sync patterns for databases and critical applications.

The Sync System Call Family

POSIX defines several system calls for synchronizing cached data to storage. Each has different scope, guarantees, and performance characteristics.

The Complete Family:

Sync System Calls Comparison
Call	Scope	Waits?	Syncs Data?	Syncs Metadata?
`sync()`	All filesystems	Usually no*	Yes	Yes
`syncfs(fd)`	Single filesystem	Yes	Yes	Yes
`fsync(fd)`	Single file	Yes	Yes	Yes
`fdatasync(fd)`	Single file	Yes	Yes	Partial**
`sync_file_range()`	Byte range	Configurable	Yes	No
`msync(addr, len, flags)`	Mapped region	Configurable	Yes	Yes

*sync() historically returned immediately after initiating writeback; modern Linux waits for completion.

**fdatasync() skips metadata that doesn't affect data retrieval (e.g., atime, mtime) but includes metadata that does (e.g., file size).

sync_calls.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
 
/*
 * sync() - Synchronize all filesystems
 *
 * Schedules or performs writeback of all dirty data and metadata
 * across all mounted filesystems.
 *
 * Returns: void (cannot fail in traditional UNIX semantics)
 */
void sync(void);  /* Flushes entire system */
 
/*
 * syncfs(fd) - Synchronize single filesystem
 *
 * Forces writes for all dirty data on the filesystem containing fd.
 * More targeted than sync() for systems with many filesystems.
 *
 * Returns: 0 on success, -1 on error
 */
int backup_filesystem_sync(const char *mountpoint) {
    int fd = open(mountpoint, O_RDONLY);
    if (fd < 0) {
        perror("open");
        return -1;
    }
    
    /* Sync only this filesystem */
    if (syncfs(fd) != 0) {
        perror("syncfs");
        close(fd);
        return -1;
    }
    
    close(fd);
    return 0;
}
 
/*
 * fsync(fd) - Synchronize single file
 *
 * Forces all data and metadata for fd to stable storage.
 * Returns only after data is confirmed durable.
 *
 * This is the PRIMARY durability mechanism for applications.
 */
int durable_write(int fd, const void *data, size_t len) {
    /* Write data to kernel buffer */
    ssize_t written = write(fd, data, len);
    if (written != len) {
        return -1;
    }
    
    /* Force to stable storage */
    if (fsync(fd) != 0) {
        perror("fsync");
        return -1;  /* Data may not be durable! */
    }
    
    return 0;  /* Data is now on stable storage */
}
 
/*
 * fdatasync(fd) - Synchronize file data only
 *
 * Like fsync() but omits metadata that doesn't affect data retrieval.
 * Faster when you don't need timestamp updates to be durable.
 */
int durable_data_write(int fd, const void *data, size_t len) {
    write(fd, data, len);
    
    /* Only sync data and essential metadata (like file size) */
    if (fdatasync(fd) != 0) {
        return -1;
    }
    
    return 0;
    /* File content is durable; mtime may not be */
}
 
/*
 * sync_file_range() - Fine-grained sync control (Linux-specific)
 *
 * Offers surgical control over which portions of a file to sync
 * and whether to wait for completion.
 */
#define SYNC_FILE_RANGE_WAIT_BEFORE  1
#define SYNC_FILE_RANGE_WRITE        2
#define SYNC_FILE_RANGE_WAIT_AFTER   4
 
void streaming_write_example(int fd) {
    char buffer[1024 * 1024];  /* 1MB chunks */
    off_t offset = 0;
    
    while (/* have more data */) {
        /* Fill buffer with data */
        // ...
        
        /* Write to kernel buffer */
        pwrite(fd, buffer, sizeof(buffer), offset);
        
        /* Initiate async writeback for this chunk */
        /* Don't wait - let it happen in background */
        sync_file_range(fd, offset, sizeof(buffer), SYNC_FILE_RANGE_WRITE);
        
        offset += sizeof(buffer);
    }
    
    /* At end, ensure everything is synced */
    sync_file_range(fd, 0, offset,
                    SYNC_FILE_RANGE_WAIT_BEFORE |
                    SYNC_FILE_RANGE_WRITE |
                    SYNC_FILE_RANGE_WAIT_AFTER);
}
 
/*
 * msync() - Synchronize memory-mapped file region
 */
void mmap_sync_example(void *addr, size_t len) {
    /* MS_SYNC: Wait for writeback to complete */
    if (msync(addr, len, MS_SYNC) != 0) {
        perror("msync");
    }
    
    /* MS_ASYNC: Schedule writeback, don't wait */
    msync(addr, len, MS_ASYNC);
    
    /* MS_INVALIDATE: Invalidate cached pages (force re-read) */
    msync(addr, len, MS_INVALIDATE);
}

fsync Error Handling Critical

If fsync() fails, your data may not be durable! Ignoring fsync() errors has caused data loss in real applications. Always check the return value, and if fsync fails, consider the data at risk. Some file systems (historically ext3/4) had issues reporting errors correctly—test your specific setup.

Kernel Implementation of Sync Operations

Understanding how the kernel implements sync operations helps explain their behavior and performance. Let's trace through the implementation path.

fsync() Implementation Flow:

Converting Mermaid diagram...

fsync_implementation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
/*
 * Linux fsync() implementation (simplified)
 * Real code in fs/sync.c and specific FS implementations
 */
 
/* System call entry point */
SYSCALL_DEFINE1(fsync, unsigned int, fd) {
    struct fd f = fdget(fd);
    int ret;
    
    if (!f.file)
        return -EBADF;
    
    ret = vfs_fsync(f.file, 0);  /* 0 = sync data AND metadata */
    fdput(f);
    return ret;
}
 
/* VFS layer fsync implementation */
int vfs_fsync(struct file *file, int datasync) {
    return vfs_fsync_range(file, 0, LLONG_MAX, datasync);
}
 
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, 
                    int datasync) {
    struct inode *inode = file_inode(file);
    int ret;
    
    /* Call filesystem-specific fsync if provided */
    if (file->f_op->fsync) {
        return file->f_op->fsync(file, start, end, datasync);
    }
    
    /* Generic implementation */
    ret = sync_inode_metadata(inode, 1);  /* 1 = wait */
    if (!ret)
        ret = filemap_fdatawrite_range(file->f_mapping, start, end);
    if (!ret)
        ret = filemap_fdatawait_range(file->f_mapping, start, end);
    return ret;
}
 
/* Write all dirty pages in the specified range */
int filemap_fdatawrite_range(struct address_space *mapping,
                              loff_t start, loff_t end) {
    struct writeback_control wbc = {
        .sync_mode = WB_SYNC_ALL,       /* Wait for completion */
        .nr_to_write = LONG_MAX,        /* No page limit */
        .range_start = start,
        .range_end = end,
    };
    
    return mapping->a_ops->writepages(mapping, &wbc);
}
 
/* Wait for all I/O on pages in range to complete */
int filemap_fdatawait_range(struct address_space *mapping,
                            loff_t start, loff_t end) {
    pgoff_t start_idx = start >> PAGE_SHIFT;
    pgoff_t end_idx = end >> PAGE_SHIFT;
    struct page *page;
    int ret = 0;
    
    /* Iterate through all pages in range */
    for (pgoff_t idx = start_idx; idx <= end_idx; idx++) {
        page = find_get_page(mapping, idx);
        if (!page)
            continue;
        
        /* Wait if page has writeback in progress */
        if (PageWriteback(page))
            wait_on_page_writeback(page);
        
        /* Check for I/O error */
        if (PageError(page))
            ret = -EIO;
        
        put_page(page);
    }
    
    return ret;
}
 
/*
 * Ext4-specific fsync (more sophisticated)
 */
int ext4_sync_file(struct file *file, loff_t start, loff_t end, 
                   int datasync) {
    struct inode *inode = file_inode(file);
    int ret;
    
    /* Handle journaled data case */
    if (EXT4_JOURNAL(inode)) {
        /* Wait for journal commit covering our data */
        ret = ext4_jbd2_inode_add_wait(inode, start, end);
        if (ret)
            return ret;
    }
    
    /* Write file data */
    ret = file_write_and_wait_range(file, start, end);
    if (ret)
        return ret;
    
    /* Sync metadata if needed */
    if (!datasync || ext4_inode_data_dirty(inode)) {
        ret = ext4_write_inode(inode, WB_SYNC_ALL);
        if (ret)
            return ret;
    }
    
    /* Issue storage device flush */
    if (needs_barrier(inode->i_sb)) {
        ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
    }
    
    return ret;
}

The Critical Device Flush

Note the blkdev_issue_flush() at the end. Even after data is written to the device, it may sit in the device's volatile cache. The flush command forces data to stable media. Without this, fsync() could return success while data is still in volatile device memory—leading to data loss on power failure.

Data vs. Metadata Synchronization

Understanding the distinction between data and metadata sync is crucial for performance optimization. File operations affect both data and metadata, but not all metadata is equally important for durability.

What is File Metadata?

File Metadata Categories
Metadata	Synced by fsync?	Synced by fdatasync?	Why it matters
File size	Yes	Yes*	Must know how much data to read
Block pointers	Yes	Yes	Must find where data is stored
mtime (modify time)	Yes	No	Nice-to-have, not for data retrieval
atime (access time)	Yes	No	Often disabled entirely
ctime (change time)	Yes	No	Nice-to-have
Permissions	Yes	No	Doesn't affect data retrieval
Owner/group	Yes	No	Doesn't affect data retrieval

*fdatasync syncs file size only if it changed, because knowing the correct size is essential for reading the file correctly.

fsync_vs_fdatasync.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/*
 * When to use fsync() vs fdatasync()
 */
 
#include <unistd.h>
#include <fcntl.h>
#include <time.h>
 
/*
 * SCENARIO 1: Appending to a log file
 *
 * Each append extends the file, so size changes.
 * fdatasync() syncs both data and the new size.
 * Timestamps don't matter for log integrity.
 */
void append_log_entry(int fd, const char *entry, size_t len) {
    write(fd, entry, len);
    fdatasync(fd);  /* Syncs data + size */
    /* Faster than fsync() because skips mtime/ctime */
}
 
/*
 * SCENARIO 2: Overwriting existing data (no size change)
 *
 * If file size doesn't change, fdatasync() only syncs data.
 * This is even faster - no metadata I/O at all.
 */
void update_record(int fd, off_t offset, const void *data, size_t len) {
    pwrite(fd, data, len, offset);
    fdatasync(fd);  /* Only data, no metadata (size unchanged) */
    /* This can be 2x faster than fsync() */
}
 
/*
 * SCENARIO 3: Creating a new file (fsync required)
 *
 * New file needs directory entry synced too!
 * fsync() on file PLUS fsync() on directory
 */
void create_durable_file(const char *path, const void *data, size_t len) {
    int fd = open(path, O_CREAT | O_WRONLY | O_TRUNC, 0644);
    
    write(fd, data, len);
    fsync(fd);  /* Use fsync for new files */
    close(fd);
    
    /* CRITICAL: Sync the directory to make filename durable */
    int dir_fd = open(dirname_path(path), O_RDONLY);
    fsync(dir_fd);
    close(dir_fd);
}
 
/*
 * SCENARIO 4: Preserving timestamps matters
 *
 * Some applications rely on mtime for synchronization
 * (e.g., rsync, make). Use fsync() in these cases.
 */
void timestamp_critical_write(int fd, const void *data, size_t len) {
    write(fd, data, len);
    fsync(fd);  /* Ensure mtime is also persisted */
}
 
/*
 * Performance comparison: fsync vs fdatasync
 * 
 * On a test system with ext4:
 * - fsync() on overwrite: ~8ms
 * - fdatasync() on overwrite: ~4ms (2x faster!)
 * 
 * The difference comes from:
 * 1. Writing inode block for timestamp update
 * 2. Possibly journal transaction for metadata
 * 
 * For high-frequency sync (like database commits),
 * fdatasync() can significantly improve throughput.
 */

PostgreSQL's Choice

PostgreSQL uses fdatasync() by default for WAL (Write-Ahead Log) syncing because WAL entries don't need accurate timestamps. This can provide 30-50% better transaction throughput compared to fsync(). The choice can be configured via wal_sync_method.

Storage Device Flush Commands

The operating system's sync operations ultimately depend on storage devices honoring flush commands. Understanding these commands and their behavior across device types is essential for true durability guarantees.

The Storage Stack:

Application: fsync(fd)
    ↓
VFS Layer: Writes dirty pages
    ↓
Block Layer: Submits I/O requests
    ↓
Device Driver: Sends commands to device
    ↓
Storage Device: Writes to volatile cache
    ↓  (flush command)
Storage Media: Data on stable storage

Storage Flush Commands by Interface
Interface	Flush Command	Description
SATA/ATA	FLUSH CACHE / FLUSH CACHE EXT	Flushes volatile write cache to platters/cells
SCSI/SAS	SYNCHRONIZE CACHE	Ensures data in cache reaches medium
NVMe	Flush	Commits data in volatile write cache
MMC/eMMC	CACHE_FLUSH	Flushes internal cache to flash

device_flush.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/*
 * How the kernel issues device flush commands
 */
 
/*
 * Issue a flush/sync to a block device
 * Called at end of fsync() path
 */
int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
                       sector_t *error_sector) {
    struct bio *bio;
    int ret = 0;
    
    /* Create a bio (block I/O request) with no data */
    bio = bio_alloc(gfp_mask, 0);
    bio_set_dev(bio, bdev);
    bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
    
    /* Submit and wait for completion */
    ret = submit_bio_wait(bio);
    
    bio_put(bio);
    return ret;
}
 
/*
 * NVMe flush command handling (in driver)
 */
void nvme_execute_flush(struct nvme_ns *ns, struct request *req) {
    struct nvme_command cmd = {
        .common = {
            .opcode = nvme_cmd_flush,
            .nsid = cpu_to_le32(ns->head->ns_id),
        },
    };
    
    /* Send flush command to controller */
    nvme_submit_sync_cmd(ns->ctrl, &cmd, NULL, 0);
    /* Returns when flush is complete */
}
 
/*
 * Force Unit Access (FUA) - Alternative to flush
 *
 * FUA writes bypass the volatile cache entirely.
 * Used for individual critical writes without draining entire cache.
 */
void write_with_fua(struct block_device *bdev, void *data, 
                     sector_t sector, size_t len) {
    struct bio *bio = bio_alloc(GFP_KERNEL, 1);
    
    bio_set_dev(bio, bdev);
    bio->bi_iter.bi_sector = sector;
    bio_add_page(bio, virt_to_page(data), len, offset_in_page(data));
    
    /* REQ_FUA: Force Unit Access - bypass volatile cache */
    bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
    
    submit_bio_wait(bio);
    bio_put(bio);
    /* This write is directly on stable media */
}

Volatile Cache Dangers

Some storage devices have non-battery-backed volatile write caches that ignore flush commands (for performance). Data 'synced' to such devices is NOT durable! Enterprise SSDs typically have power-loss protection (capacitors). Check your device specs. For critical data, disable volatile write caches or use devices with PLP (Power Loss Protection).

disable_write_cache.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# Managing storage device write cache
 
# Check current write cache status (SATA/ATA)
hdparm -W /dev/sda
# /dev/sda:
#  write-caching =  1 (on)
 
# Disable write cache (sacrifices performance for safety)
hdparm -W0 /dev/sda
 
# For SCSI devices
sdparm --get=WCE /dev/sdb  # Get Write Cache Enable
sdparm --set=WCE=0 /dev/sdb  # Disable
 
# Check NVMe volatile write cache
nvme id-ctrl /dev/nvme0 | grep vwc
#  vwc   : 1  (volatile write cache present)
 
# Enterprise SSDs with Power Loss Protection (PLP)
# These are SAFE to have write cache enabled because
# capacitors flush data on power loss

Performance Implications of Sync

Sync operations have profound performance implications. Understanding these helps you make informed tradeoffs between durability and speed.

The Cost of Sync:

Each fsync() involves multiple expensive operations:

fsync() Component Costs (Typical)
Component	HDD Time	SSD Time	NVMe Time
Dirty page writeback	5-10ms	0.5-2ms	0.1-0.5ms
Metadata/Journal write	2-5ms	0.5-1ms	0.1-0.3ms
Device flush command	5-15ms	1-5ms	0.5-2ms
Total fsync()	12-30ms	2-8ms	0.7-3ms
Maximum TPS*	~30-80	~120-500	~300-1400

*TPS = Transactions Per Second if each transaction requires one fsync.

Why Sync is Slow:

Sequential bottleneck: Only one sync can be truly in-flight per device
Cache drain: Flush commands drain entire device cache
Rotational latency: HDDs must complete a full rotation
Flash write latency: SSDs must program cells (slower than reads)

sync_performance.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
/*
 * Performance optimization strategies for sync-heavy workloads
 */
 
/*
 * STRATEGY 1: Group Commit
 *
 * Instead of syncing each write, batch multiple writes
 * and sync once for the entire batch.
 */
struct batch {
    int count;
    int max_count;
    int fd;
};
 
void batch_write(struct batch *b, const void *data, size_t len) {
    write(b->fd, data, len);
    b->count++;
    
    if (b->count >= b->max_count) {
        /* Batch is full, sync now */
        fdatasync(b->fd);
        b->count = 0;
    }
}
 
/* Result: 
 * 100 individual writes with sync: 100 × 5ms = 500ms
 * 100 writes batched, 1 sync:      100 × 0.001ms + 5ms ≈ 5ms
 * 100x improvement!
 */
 
 
/*
 * STRATEGY 2: Parallel Sync
 *
 * Multiple devices can be synced in parallel.
 * Distribute data across devices for higher throughput.
 */
void parallel_sync(int *fds, int count) {
    pthread_t threads[count];
    
    for (int i = 0; i < count; i++) {
        pthread_create(&threads[i], NULL, 
                       (void*)fsync_thread, &fds[i]);
    }
    
    for (int i = 0; i < count; i++) {
        pthread_join(threads[i], NULL);
    }
    
    /* With 4 SSDs: 4 × 200 TPS = 800 TPS aggregate */
}
 
 
/*
 * STRATEGY 3: Asynchronous Writeback + Barrier
 *
 * Use sync_file_range() to start writeback early,
 * then sync at commit points.
 */
void pipelined_write(int fd, void *data, size_t chunk_size, int chunks) {
    for (int i = 0; i < chunks; i++) {
        off_t offset = i * chunk_size;
        
        /* Write to page cache */
        pwrite(fd, data + offset, chunk_size, offset);
        
        /* Start async writeback (don't wait) */
        sync_file_range(fd, offset, chunk_size, 
                        SYNC_FILE_RANGE_WRITE);
    }
    
    /* Final sync - wait for all writeback */
    sync_file_range(fd, 0, chunks * chunk_size,
                    SYNC_FILE_RANGE_WAIT_BEFORE |
                    SYNC_FILE_RANGE_WRITE |
                    SYNC_FILE_RANGE_WAIT_AFTER);
    fdatasync(fd);
}
 
 
/*
 * STRATEGY 4: Use O_DSYNC or O_SYNC for implicit sync
 *
 * Opens file with automatic sync on every write.
 * Good for small, critical writes.
 */
void always_sync_writes() {
    /* O_DSYNC: Every write is like write + fdatasync */
    int fd = open("critical.dat", O_WRONLY | O_DSYNC);
    write(fd, data, len);  /* Returns after data durable */
    
    /* O_SYNC: Every write is like write + fsync */
    int fd2 = open("more_critical.dat", O_WRONLY | O_SYNC);
    write(fd2, data, len);  /* Returns after data+metadata durable */
}

Database Group Commit

Production databases like PostgreSQL, MySQL, and MongoDB use group commit extensively. PostgreSQL's commit_delay parameter introduces a brief wait (0-100ms) after the first transaction to accumulate more transactions for a single WAL sync. This can improve throughput by 10x or more under concurrent load.

Sync Patterns for Databases

Databases have the most demanding sync requirements: they must provide ACID durability while maintaining high transaction throughput. Let's examine the sync patterns used by real database systems.

The WAL Sync Challenge:

Relational databases use Write-Ahead Logging (WAL): every transaction's changes are written to a log and synced before the transaction commits. This creates a sync bottleneck—every commit requires at least one fsync.

database_sync_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
/*
 * Database Sync Patterns
 */
 
/*
 * PATTERN 1: Synchronous Commit (Default PostgreSQL)
 *
 * Every commit waits for WAL fsync.
 * Maximum durability, lowest throughput.
 */
void synchronous_commit(struct transaction *txn) {
    /* Write transaction to WAL buffer */
    write_wal_record(txn);
    
    /* Force WAL to disk */
    fdatasync(wal_fd);  /* ~4ms */
    
    /* Acknowledge commit */
    txn->status = COMMITTED;
    /* Client can proceed - data is durable */
}
/* Throughput: ~250 TPS per disk */
 
 
/*
 * PATTERN 2: Group Commit (PostgreSQL, MySQL)
 *
 * Batch multiple commits into one fsync.
 * Slight latency increase, huge throughput gain.
 */
void group_commit() {
    /* Collect commits for a short window */
    collect_pending_commits(COMMIT_DELAY_MS);
    
    /* Write all pending to WAL */
    for_each_pending(txn) {
        write_wal_record(txn);
    }
    
    /* Single fsync for all */
    fdatasync(wal_fd);
    
    /* Acknowledge all commits */
    for_each_pending(txn) {
        txn->status = COMMITTED;
        wake_client(txn);
    }
}
/* Throughput: 10,000+ TPS with concurrent connections */
 
 
/*
 * PATTERN 3: Asynchronous Commit (PostgreSQL)
 *
 * Don't wait for fsync; accept data loss risk.
 * Maximum throughput, potential loss window.
 */
void async_commit(struct transaction *txn) {
    /* Write to WAL buffer (not yet on disk) */
    write_wal_record(txn);
    
    /* Immediately acknowledge - data NOT durable! */
    txn->status = COMMITTED;
    
    /* Background thread eventually syncs */
    /* Risk: last ~10ms of transactions lost on crash */
}
/* Throughput: 50,000+ TPS (memory speed) */
 
 
/*
 * PATTERN 4: Commit=No_wait + Group Sync (MySQL InnoDB)
 *
 * innodb_flush_log_at_trx_commit settings:
 * 0: Log written & synced once/second (may lose 1s data)
 * 1: Log written & synced on each commit (full durability)
 * 2: Log written on commit, synced once/second (lose 1s)
 */
 
/*
 * PATTERN 5: Parallel WAL (PostgreSQL 9.4+)
 *
 * Multiple backends can populate WAL concurrently.
 * Single sync still needed, but write phase parallelized.
 */
 
 
/*
 * PATTERN 6: Double-Write Buffer (MySQL InnoDB)
 *
 * Problem: Partial page writes on crash (torn pages).
 * Solution: Write pages to doublewrite buffer first.
 */
void doublewrite_flush(struct page **pages, int count) {
    /* Write all pages to sequential doublewrite space */
    for (int i = 0; i < count; i++) {
        pwrite(dblwr_fd, pages[i], PAGE_SIZE, dblwr_offset);
        dblwr_offset += PAGE_SIZE;
    }
    
    /* Single fsync for doublewrite */
    fdatasync(dblwr_fd);
    
    /* Now safe to write to actual locations */
    for (int i = 0; i < count; i++) {
        pwrite(datafile_fd, pages[i], PAGE_SIZE, pages[i]->offset);
    }
    
    /* On crash: recover torn pages from doublewrite */
}

Database Sync Configuration Examples
Database	Setting	Effect
PostgreSQL	synchronous_commit = on	Full durability, lower TPS
PostgreSQL	synchronous_commit = off	Higher TPS, risk window
PostgreSQL	commit_delay = 10	10 µs wait for group commit
MySQL InnoDB	innodb_flush_log_at_trx_commit = 1	Full durability
MySQL InnoDB	innodb_flush_log_at_trx_commit = 2	OS-level sync timing
MongoDB	j: true (write concern)	Wait for journal fsync
MongoDB	w: majority, j: false	Replicated but not synced

The Replication Alternative

Modern databases often use synchronous replication instead of (or alongside) fsync for durability. Writing to two nodes before acknowledging protects against single-node failure and can be faster than fsync if replicas are memory-acknowledged. This is why cloud databases often offer 'durability' without waiting for disk sync.

Testing and Verifying Sync Behavior

Sync behavior is notoriously difficult to test—you can't easily simulate power failures in software. However, several techniques help verify that your application's durability guarantees hold.

Testing Strategies:

sync_testing.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#!/bin/bash
# Testing sync behavior and durability
 
# 1. STRACE: Verify sync calls are happening
strace -e fsync,fdatasync,sync,sync_file_range \
    ./my_database_app 2>&1 | head -50
 
# Expected output:
# fsync(5) = 0
# fdatasync(5) = 0
# ...
 
# 2. Track device I/O during sync
# In terminal 1:
iostat -x 1 /dev/sda
 
# In terminal 2:
fsync_test_program
 
# Look for:
# - Writes during sync
# - Device utilization spikes
# - await (average wait time)
 
# 3. BLKTRACE: Detailed block I/O tracing
blktrace -d /dev/sda -o trace &
./my_app --run-test
kill %1
blkparse trace.* | grep -i flush
# Look for FLUSH operations
 
# 4. Use dm-flakey to simulate failures
# Create a device that randomly fails
dmsetup create flakey --table "0 $(blockdev --getsz /dev/loop0) \
    flakey /dev/loop0 0 30 5 1 drop_writes"
# 30 seconds good, 5 seconds dropping writes
 
# Run your application against /dev/mapper/flakey
# Check data integrity after simulated failure
 
# 5. Crash testing with VMs
# - Run workload in VM
# - Abruptly terminate VM process
# - Boot VM and check data integrity
 
# 6. SystemTap probes for sync tracking
stap -e 'probe kernel.function("blkdev_issue_flush") { 
    printf("Flush on %s
", kernel_string($bdev->bd_disk->disk_name)) 
}'

Power-Loss Testing:

For critical applications, simulate actual power loss:

Power Loss Test Methods

•Virtual machine kill: Abruptly terminate QEMU/VMware process (kill -9)
•Remote power control: Use IPMI/iLO to hard-power-off servers
•Smart PDU: Controllable power strips that can cut power programmatically
•Battery disconnect: For laptops/embedded systems
•dm-flakey/dm-delay: Linux device mapper targets simulating failures

The Lie of Virtualized Storage

Virtual machines often virtualize fsync—the hypervisor may acknowledge fsync before data reaches host storage. For durability testing, ensure your VM disk is configured for write-through caching, or test on bare metal. Many cloud VMs provide NO durability guarantee for local storage!

Summary: Sync Operations Mastery

We've completed our deep exploration of sync operations—the critical mechanisms that transform cached data into durable, persistent storage. This knowledge completes our understanding of the buffer cache and its role in file system performance and reliability.

Key Takeaways

•The sync family offers different scope/guarantees — sync(), syncfs(), fsync(), fdatasync(), sync_file_range(), and msync() each serve different use cases.
•fsync waits for stable storage confirmation — It doesn't return until data is on non-volatile media (or the device lies about it).
•fdatasync skips non-essential metadata — Often 2x faster than fsync for overwrites, syncs size changes but not timestamps.
•Device flush commands are the final step — Without them, data may remain in volatile device cache.
•Sync is expensive but necessary — Each fsync can take 3-30ms depending on device, limiting transactions per second.
•Group commit amortizes sync cost — Batching multiple operations per sync dramatically improves throughput.
•Testing sync is hard but essential — Use strace, blktrace, VM kill testing, and power-loss tests to verify durability.

Module Complete:

Congratulations! You've completed the Buffer Cache module—the foundational layer of file system performance. You now understand:

Block caching fundamentals: Why caching exists, how blocks are organized
Cache management: LRU, LFU, ARC, and practical replacement policies
Write-back policies: Delayed writes, dirty tracking, and writeback threads
Cache consistency: Coherence across processes, systems, and crashes
Sync operations: The machinery that provides durability guarantees

This knowledge forms the foundation for understanding file system performance optimization, database internals, and building reliable software systems.

Module Complete

You have mastered the Buffer Cache module. You understand the complete lifecycle of data from application write through cache residence to stable storage, including all the performance and durability tradeoffs involved. This knowledge will serve you in database development, system administration, performance tuning, and building any software that cares about data persistence.

5 / 5

Loading learning content...

Operating SystemsBuffer Cache

Buffer Cache

LevelIntermediate

Duration90 mins

TopicBuffer Cache

5 / 5

Sync Operations

From Cache to Permanence

Understanding sync operations is essential for any engineer building reliable software:

Application developers need to know when to call fsync() and what it guarantees
Database engineers must understand the performance-durability tradeoffs of different sync strategies
System administrators need to tune sync behavior for their workloads
Operating system developers must implement these operations correctly across diverse storage hardware

In this final page of the buffer cache module, we'll examine every aspect of synchronization: from high-level system calls to kernel implementation to storage device behavior.

What You Will Learn

The Sync System Call Family

POSIX defines several system calls for synchronizing cached data to storage. Each has different scope, guarantees, and performance characteristics.

The Complete Family:

Sync System Calls Comparison
Call	Scope	Waits?	Syncs Data?	Syncs Metadata?
`sync()`	All filesystems	Usually no*	Yes	Yes
`syncfs(fd)`	Single filesystem	Yes	Yes	Yes
`fsync(fd)`	Single file	Yes	Yes	Yes
`fdatasync(fd)`	Single file	Yes	Yes	Partial**
`sync_file_range()`	Byte range	Configurable	Yes	No
`msync(addr, len, flags)`	Mapped region	Configurable	Yes	Yes

*sync() historically returned immediately after initiating writeback; modern Linux waits for completion.

**fdatasync() skips metadata that doesn't affect data retrieval (e.g., atime, mtime) but includes metadata that does (e.g., file size).

sync_calls.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
 
/*
 * sync() - Synchronize all filesystems
 *
 * Schedules or performs writeback of all dirty data and metadata
 * across all mounted filesystems.
 *
 * Returns: void (cannot fail in traditional UNIX semantics)
 */
void sync(void);  /* Flushes entire system */
 
/*
 * syncfs(fd) - Synchronize single filesystem
 *
 * Forces writes for all dirty data on the filesystem containing fd.
 * More targeted than sync() for systems with many filesystems.
 *
 * Returns: 0 on success, -1 on error
 */
int backup_filesystem_sync(const char *mountpoint) {
    int fd = open(mountpoint, O_RDONLY);
    if (fd < 0) {
        perror("open");
        return -1;
    }
    
    /* Sync only this filesystem */
    if (syncfs(fd) != 0) {
        perror("syncfs");
        close(fd);
        return -1;
    }
    
    close(fd);
    return 0;
}
 
/*
 * fsync(fd) - Synchronize single file
 *
 * Forces all data and metadata for fd to stable storage.
 * Returns only after data is confirmed durable.
 *
 * This is the PRIMARY durability mechanism for applications.
 */
int durable_write(int fd, const void *data, size_t len) {
    /* Write data to kernel buffer */
    ssize_t written = write(fd, data, len);
    if (written != len) {
        return -1;
    }
    
    /* Force to stable storage */
    if (fsync(fd) != 0) {
        perror("fsync");
        return -1;  /* Data may not be durable! */
    }
    
    return 0;  /* Data is now on stable storage */
}
 
/*
 * fdatasync(fd) - Synchronize file data only
 *
 * Like fsync() but omits metadata that doesn't affect data retrieval.
 * Faster when you don't need timestamp updates to be durable.
 */
int durable_data_write(int fd, const void *data, size_t len) {
    write(fd, data, len);
    
    /* Only sync data and essential metadata (like file size) */
    if (fdatasync(fd) != 0) {
        return -1;
    }
    
    return 0;
    /* File content is durable; mtime may not be */
}
 
/*
 * sync_file_range() - Fine-grained sync control (Linux-specific)
 *
 * Offers surgical control over which portions of a file to sync
 * and whether to wait for completion.
 */
#define SYNC_FILE_RANGE_WAIT_BEFORE  1
#define SYNC_FILE_RANGE_WRITE        2
#define SYNC_FILE_RANGE_WAIT_AFTER   4
 
void streaming_write_example(int fd) {
    char buffer[1024 * 1024];  /* 1MB chunks */
    off_t offset = 0;
    
    while (/* have more data */) {
        /* Fill buffer with data */
        // ...
        
        /* Write to kernel buffer */
        pwrite(fd, buffer, sizeof(buffer), offset);
        
        /* Initiate async writeback for this chunk */
        /* Don't wait - let it happen in background */
        sync_file_range(fd, offset, sizeof(buffer), SYNC_FILE_RANGE_WRITE);
        
        offset += sizeof(buffer);
    }
    
    /* At end, ensure everything is synced */
    sync_file_range(fd, 0, offset,
                    SYNC_FILE_RANGE_WAIT_BEFORE |
                    SYNC_FILE_RANGE_WRITE |
                    SYNC_FILE_RANGE_WAIT_AFTER);
}
 
/*
 * msync() - Synchronize memory-mapped file region
 */
void mmap_sync_example(void *addr, size_t len) {
    /* MS_SYNC: Wait for writeback to complete */
    if (msync(addr, len, MS_SYNC) != 0) {
        perror("msync");
    }
    
    /* MS_ASYNC: Schedule writeback, don't wait */
    msync(addr, len, MS_ASYNC);
    
    /* MS_INVALIDATE: Invalidate cached pages (force re-read) */
    msync(addr, len, MS_INVALIDATE);
}

fsync Error Handling Critical

Kernel Implementation of Sync Operations

Understanding how the kernel implements sync operations helps explain their behavior and performance. Let's trace through the implementation path.

fsync() Implementation Flow:

Converting Mermaid diagram...

fsync_implementation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
/*
 * Linux fsync() implementation (simplified)
 * Real code in fs/sync.c and specific FS implementations
 */
 
/* System call entry point */
SYSCALL_DEFINE1(fsync, unsigned int, fd) {
    struct fd f = fdget(fd);
    int ret;
    
    if (!f.file)
        return -EBADF;
    
    ret = vfs_fsync(f.file, 0);  /* 0 = sync data AND metadata */
    fdput(f);
    return ret;
}
 
/* VFS layer fsync implementation */
int vfs_fsync(struct file *file, int datasync) {
    return vfs_fsync_range(file, 0, LLONG_MAX, datasync);
}
 
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, 
                    int datasync) {
    struct inode *inode = file_inode(file);
    int ret;
    
    /* Call filesystem-specific fsync if provided */
    if (file->f_op->fsync) {
        return file->f_op->fsync(file, start, end, datasync);
    }
    
    /* Generic implementation */
    ret = sync_inode_metadata(inode, 1);  /* 1 = wait */
    if (!ret)
        ret = filemap_fdatawrite_range(file->f_mapping, start, end);
    if (!ret)
        ret = filemap_fdatawait_range(file->f_mapping, start, end);
    return ret;
}
 
/* Write all dirty pages in the specified range */
int filemap_fdatawrite_range(struct address_space *mapping,
                              loff_t start, loff_t end) {
    struct writeback_control wbc = {
        .sync_mode = WB_SYNC_ALL,       /* Wait for completion */
        .nr_to_write = LONG_MAX,        /* No page limit */
        .range_start = start,
        .range_end = end,
    };
    
    return mapping->a_ops->writepages(mapping, &wbc);
}
 
/* Wait for all I/O on pages in range to complete */
int filemap_fdatawait_range(struct address_space *mapping,
                            loff_t start, loff_t end) {
    pgoff_t start_idx = start >> PAGE_SHIFT;
    pgoff_t end_idx = end >> PAGE_SHIFT;
    struct page *page;
    int ret = 0;
    
    /* Iterate through all pages in range */
    for (pgoff_t idx = start_idx; idx <= end_idx; idx++) {
        page = find_get_page(mapping, idx);
        if (!page)
            continue;
        
        /* Wait if page has writeback in progress */
        if (PageWriteback(page))
            wait_on_page_writeback(page);
        
        /* Check for I/O error */
        if (PageError(page))
            ret = -EIO;
        
        put_page(page);
    }
    
    return ret;
}
 
/*
 * Ext4-specific fsync (more sophisticated)
 */
int ext4_sync_file(struct file *file, loff_t start, loff_t end, 
                   int datasync) {
    struct inode *inode = file_inode(file);
    int ret;
    
    /* Handle journaled data case */
    if (EXT4_JOURNAL(inode)) {
        /* Wait for journal commit covering our data */
        ret = ext4_jbd2_inode_add_wait(inode, start, end);
        if (ret)
            return ret;
    }
    
    /* Write file data */
    ret = file_write_and_wait_range(file, start, end);
    if (ret)
        return ret;
    
    /* Sync metadata if needed */
    if (!datasync || ext4_inode_data_dirty(inode)) {
        ret = ext4_write_inode(inode, WB_SYNC_ALL);
        if (ret)
            return ret;
    }
    
    /* Issue storage device flush */
    if (needs_barrier(inode->i_sb)) {
        ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
    }
    
    return ret;
}

The Critical Device Flush

Data vs. Metadata Synchronization

What is File Metadata?

File Metadata Categories
Metadata	Synced by fsync?	Synced by fdatasync?	Why it matters
File size	Yes	Yes*	Must know how much data to read
Block pointers	Yes	Yes	Must find where data is stored
mtime (modify time)	Yes	No	Nice-to-have, not for data retrieval
atime (access time)	Yes	No	Often disabled entirely
ctime (change time)	Yes	No	Nice-to-have
Permissions	Yes	No	Doesn't affect data retrieval
Owner/group	Yes	No	Doesn't affect data retrieval

*fdatasync syncs file size only if it changed, because knowing the correct size is essential for reading the file correctly.

fsync_vs_fdatasync.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/*
 * When to use fsync() vs fdatasync()
 */
 
#include <unistd.h>
#include <fcntl.h>
#include <time.h>
 
/*
 * SCENARIO 1: Appending to a log file
 *
 * Each append extends the file, so size changes.
 * fdatasync() syncs both data and the new size.
 * Timestamps don't matter for log integrity.
 */
void append_log_entry(int fd, const char *entry, size_t len) {
    write(fd, entry, len);
    fdatasync(fd);  /* Syncs data + size */
    /* Faster than fsync() because skips mtime/ctime */
}
 
/*
 * SCENARIO 2: Overwriting existing data (no size change)
 *
 * If file size doesn't change, fdatasync() only syncs data.
 * This is even faster - no metadata I/O at all.
 */
void update_record(int fd, off_t offset, const void *data, size_t len) {
    pwrite(fd, data, len, offset);
    fdatasync(fd);  /* Only data, no metadata (size unchanged) */
    /* This can be 2x faster than fsync() */
}
 
/*
 * SCENARIO 3: Creating a new file (fsync required)
 *
 * New file needs directory entry synced too!
 * fsync() on file PLUS fsync() on directory
 */
void create_durable_file(const char *path, const void *data, size_t len) {
    int fd = open(path, O_CREAT | O_WRONLY | O_TRUNC, 0644);
    
    write(fd, data, len);
    fsync(fd);  /* Use fsync for new files */
    close(fd);
    
    /* CRITICAL: Sync the directory to make filename durable */
    int dir_fd = open(dirname_path(path), O_RDONLY);
    fsync(dir_fd);
    close(dir_fd);
}
 
/*
 * SCENARIO 4: Preserving timestamps matters
 *
 * Some applications rely on mtime for synchronization
 * (e.g., rsync, make). Use fsync() in these cases.
 */
void timestamp_critical_write(int fd, const void *data, size_t len) {
    write(fd, data, len);
    fsync(fd);  /* Ensure mtime is also persisted */
}
 
/*
 * Performance comparison: fsync vs fdatasync
 * 
 * On a test system with ext4:
 * - fsync() on overwrite: ~8ms
 * - fdatasync() on overwrite: ~4ms (2x faster!)
 * 
 * The difference comes from:
 * 1. Writing inode block for timestamp update
 * 2. Possibly journal transaction for metadata
 * 
 * For high-frequency sync (like database commits),
 * fdatasync() can significantly improve throughput.
 */

PostgreSQL's Choice

Storage Device Flush Commands

The Storage Stack:

Application: fsync(fd)
    ↓
VFS Layer: Writes dirty pages
    ↓
Block Layer: Submits I/O requests
    ↓
Device Driver: Sends commands to device
    ↓
Storage Device: Writes to volatile cache
    ↓  (flush command)
Storage Media: Data on stable storage

Storage Flush Commands by Interface
Interface	Flush Command	Description
SATA/ATA	FLUSH CACHE / FLUSH CACHE EXT	Flushes volatile write cache to platters/cells
SCSI/SAS	SYNCHRONIZE CACHE	Ensures data in cache reaches medium
NVMe	Flush	Commits data in volatile write cache
MMC/eMMC	CACHE_FLUSH	Flushes internal cache to flash

device_flush.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/*
 * How the kernel issues device flush commands
 */
 
/*
 * Issue a flush/sync to a block device
 * Called at end of fsync() path
 */
int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
                       sector_t *error_sector) {
    struct bio *bio;
    int ret = 0;
    
    /* Create a bio (block I/O request) with no data */
    bio = bio_alloc(gfp_mask, 0);
    bio_set_dev(bio, bdev);
    bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
    
    /* Submit and wait for completion */
    ret = submit_bio_wait(bio);
    
    bio_put(bio);
    return ret;
}
 
/*
 * NVMe flush command handling (in driver)
 */
void nvme_execute_flush(struct nvme_ns *ns, struct request *req) {
    struct nvme_command cmd = {
        .common = {
            .opcode = nvme_cmd_flush,
            .nsid = cpu_to_le32(ns->head->ns_id),
        },
    };
    
    /* Send flush command to controller */
    nvme_submit_sync_cmd(ns->ctrl, &cmd, NULL, 0);
    /* Returns when flush is complete */
}
 
/*
 * Force Unit Access (FUA) - Alternative to flush
 *
 * FUA writes bypass the volatile cache entirely.
 * Used for individual critical writes without draining entire cache.
 */
void write_with_fua(struct block_device *bdev, void *data, 
                     sector_t sector, size_t len) {
    struct bio *bio = bio_alloc(GFP_KERNEL, 1);
    
    bio_set_dev(bio, bdev);
    bio->bi_iter.bi_sector = sector;
    bio_add_page(bio, virt_to_page(data), len, offset_in_page(data));
    
    /* REQ_FUA: Force Unit Access - bypass volatile cache */
    bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
    
    submit_bio_wait(bio);
    bio_put(bio);
    /* This write is directly on stable media */
}

Volatile Cache Dangers

disable_write_cache.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# Managing storage device write cache
 
# Check current write cache status (SATA/ATA)
hdparm -W /dev/sda
# /dev/sda:
#  write-caching =  1 (on)
 
# Disable write cache (sacrifices performance for safety)
hdparm -W0 /dev/sda
 
# For SCSI devices
sdparm --get=WCE /dev/sdb  # Get Write Cache Enable
sdparm --set=WCE=0 /dev/sdb  # Disable
 
# Check NVMe volatile write cache
nvme id-ctrl /dev/nvme0 | grep vwc
#  vwc   : 1  (volatile write cache present)
 
# Enterprise SSDs with Power Loss Protection (PLP)
# These are SAFE to have write cache enabled because
# capacitors flush data on power loss

Performance Implications of Sync

Sync operations have profound performance implications. Understanding these helps you make informed tradeoffs between durability and speed.

The Cost of Sync:

Each fsync() involves multiple expensive operations:

fsync() Component Costs (Typical)
Component	HDD Time	SSD Time	NVMe Time
Dirty page writeback	5-10ms	0.5-2ms	0.1-0.5ms
Metadata/Journal write	2-5ms	0.5-1ms	0.1-0.3ms
Device flush command	5-15ms	1-5ms	0.5-2ms
Total fsync()	12-30ms	2-8ms	0.7-3ms
Maximum TPS*	~30-80	~120-500	~300-1400

*TPS = Transactions Per Second if each transaction requires one fsync.

Why Sync is Slow:

Sequential bottleneck: Only one sync can be truly in-flight per device
Cache drain: Flush commands drain entire device cache
Rotational latency: HDDs must complete a full rotation
Flash write latency: SSDs must program cells (slower than reads)

sync_performance.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
/*
 * Performance optimization strategies for sync-heavy workloads
 */
 
/*
 * STRATEGY 1: Group Commit
 *
 * Instead of syncing each write, batch multiple writes
 * and sync once for the entire batch.
 */
struct batch {
    int count;
    int max_count;
    int fd;
};
 
void batch_write(struct batch *b, const void *data, size_t len) {
    write(b->fd, data, len);
    b->count++;
    
    if (b->count >= b->max_count) {
        /* Batch is full, sync now */
        fdatasync(b->fd);
        b->count = 0;
    }
}
 
/* Result: 
 * 100 individual writes with sync: 100 × 5ms = 500ms
 * 100 writes batched, 1 sync:      100 × 0.001ms + 5ms ≈ 5ms
 * 100x improvement!
 */
 
 
/*
 * STRATEGY 2: Parallel Sync
 *
 * Multiple devices can be synced in parallel.
 * Distribute data across devices for higher throughput.
 */
void parallel_sync(int *fds, int count) {
    pthread_t threads[count];
    
    for (int i = 0; i < count; i++) {
        pthread_create(&threads[i], NULL, 
                       (void*)fsync_thread, &fds[i]);
    }
    
    for (int i = 0; i < count; i++) {
        pthread_join(threads[i], NULL);
    }
    
    /* With 4 SSDs: 4 × 200 TPS = 800 TPS aggregate */
}
 
 
/*
 * STRATEGY 3: Asynchronous Writeback + Barrier
 *
 * Use sync_file_range() to start writeback early,
 * then sync at commit points.
 */
void pipelined_write(int fd, void *data, size_t chunk_size, int chunks) {
    for (int i = 0; i < chunks; i++) {
        off_t offset = i * chunk_size;
        
        /* Write to page cache */
        pwrite(fd, data + offset, chunk_size, offset);
        
        /* Start async writeback (don't wait) */
        sync_file_range(fd, offset, chunk_size, 
                        SYNC_FILE_RANGE_WRITE);
    }
    
    /* Final sync - wait for all writeback */
    sync_file_range(fd, 0, chunks * chunk_size,
                    SYNC_FILE_RANGE_WAIT_BEFORE |
                    SYNC_FILE_RANGE_WRITE |
                    SYNC_FILE_RANGE_WAIT_AFTER);
    fdatasync(fd);
}
 
 
/*
 * STRATEGY 4: Use O_DSYNC or O_SYNC for implicit sync
 *
 * Opens file with automatic sync on every write.
 * Good for small, critical writes.
 */
void always_sync_writes() {
    /* O_DSYNC: Every write is like write + fdatasync */
    int fd = open("critical.dat", O_WRONLY | O_DSYNC);
    write(fd, data, len);  /* Returns after data durable */
    
    /* O_SYNC: Every write is like write + fsync */
    int fd2 = open("more_critical.dat", O_WRONLY | O_SYNC);
    write(fd2, data, len);  /* Returns after data+metadata durable */
}

Database Group Commit

Sync Patterns for Databases

Databases have the most demanding sync requirements: they must provide ACID durability while maintaining high transaction throughput. Let's examine the sync patterns used by real database systems.

The WAL Sync Challenge:

database_sync_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
/*
 * Database Sync Patterns
 */
 
/*
 * PATTERN 1: Synchronous Commit (Default PostgreSQL)
 *
 * Every commit waits for WAL fsync.
 * Maximum durability, lowest throughput.
 */
void synchronous_commit(struct transaction *txn) {
    /* Write transaction to WAL buffer */
    write_wal_record(txn);
    
    /* Force WAL to disk */
    fdatasync(wal_fd);  /* ~4ms */
    
    /* Acknowledge commit */
    txn->status = COMMITTED;
    /* Client can proceed - data is durable */
}
/* Throughput: ~250 TPS per disk */
 
 
/*
 * PATTERN 2: Group Commit (PostgreSQL, MySQL)
 *
 * Batch multiple commits into one fsync.
 * Slight latency increase, huge throughput gain.
 */
void group_commit() {
    /* Collect commits for a short window */
    collect_pending_commits(COMMIT_DELAY_MS);
    
    /* Write all pending to WAL */
    for_each_pending(txn) {
        write_wal_record(txn);
    }
    
    /* Single fsync for all */
    fdatasync(wal_fd);
    
    /* Acknowledge all commits */
    for_each_pending(txn) {
        txn->status = COMMITTED;
        wake_client(txn);
    }
}
/* Throughput: 10,000+ TPS with concurrent connections */
 
 
/*
 * PATTERN 3: Asynchronous Commit (PostgreSQL)
 *
 * Don't wait for fsync; accept data loss risk.
 * Maximum throughput, potential loss window.
 */
void async_commit(struct transaction *txn) {
    /* Write to WAL buffer (not yet on disk) */
    write_wal_record(txn);
    
    /* Immediately acknowledge - data NOT durable! */
    txn->status = COMMITTED;
    
    /* Background thread eventually syncs */
    /* Risk: last ~10ms of transactions lost on crash */
}
/* Throughput: 50,000+ TPS (memory speed) */
 
 
/*
 * PATTERN 4: Commit=No_wait + Group Sync (MySQL InnoDB)
 *
 * innodb_flush_log_at_trx_commit settings:
 * 0: Log written & synced once/second (may lose 1s data)
 * 1: Log written & synced on each commit (full durability)
 * 2: Log written on commit, synced once/second (lose 1s)
 */
 
/*
 * PATTERN 5: Parallel WAL (PostgreSQL 9.4+)
 *
 * Multiple backends can populate WAL concurrently.
 * Single sync still needed, but write phase parallelized.
 */
 
 
/*
 * PATTERN 6: Double-Write Buffer (MySQL InnoDB)
 *
 * Problem: Partial page writes on crash (torn pages).
 * Solution: Write pages to doublewrite buffer first.
 */
void doublewrite_flush(struct page **pages, int count) {
    /* Write all pages to sequential doublewrite space */
    for (int i = 0; i < count; i++) {
        pwrite(dblwr_fd, pages[i], PAGE_SIZE, dblwr_offset);
        dblwr_offset += PAGE_SIZE;
    }
    
    /* Single fsync for doublewrite */
    fdatasync(dblwr_fd);
    
    /* Now safe to write to actual locations */
    for (int i = 0; i < count; i++) {
        pwrite(datafile_fd, pages[i], PAGE_SIZE, pages[i]->offset);
    }
    
    /* On crash: recover torn pages from doublewrite */
}

Database Sync Configuration Examples
Database	Setting	Effect
PostgreSQL	synchronous_commit = on	Full durability, lower TPS
PostgreSQL	synchronous_commit = off	Higher TPS, risk window
PostgreSQL	commit_delay = 10	10 µs wait for group commit
MySQL InnoDB	innodb_flush_log_at_trx_commit = 1	Full durability
MySQL InnoDB	innodb_flush_log_at_trx_commit = 2	OS-level sync timing
MongoDB	j: true (write concern)	Wait for journal fsync
MongoDB	w: majority, j: false	Replicated but not synced

The Replication Alternative

Testing and Verifying Sync Behavior

Sync behavior is notoriously difficult to test—you can't easily simulate power failures in software. However, several techniques help verify that your application's durability guarantees hold.

Testing Strategies:

sync_testing.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#!/bin/bash
# Testing sync behavior and durability
 
# 1. STRACE: Verify sync calls are happening
strace -e fsync,fdatasync,sync,sync_file_range \
    ./my_database_app 2>&1 | head -50
 
# Expected output:
# fsync(5) = 0
# fdatasync(5) = 0
# ...
 
# 2. Track device I/O during sync
# In terminal 1:
iostat -x 1 /dev/sda
 
# In terminal 2:
fsync_test_program
 
# Look for:
# - Writes during sync
# - Device utilization spikes
# - await (average wait time)
 
# 3. BLKTRACE: Detailed block I/O tracing
blktrace -d /dev/sda -o trace &
./my_app --run-test
kill %1
blkparse trace.* | grep -i flush
# Look for FLUSH operations
 
# 4. Use dm-flakey to simulate failures
# Create a device that randomly fails
dmsetup create flakey --table "0 $(blockdev --getsz /dev/loop0) \
    flakey /dev/loop0 0 30 5 1 drop_writes"
# 30 seconds good, 5 seconds dropping writes
 
# Run your application against /dev/mapper/flakey
# Check data integrity after simulated failure
 
# 5. Crash testing with VMs
# - Run workload in VM
# - Abruptly terminate VM process
# - Boot VM and check data integrity
 
# 6. SystemTap probes for sync tracking
stap -e 'probe kernel.function("blkdev_issue_flush") { 
    printf("Flush on %s
", kernel_string($bdev->bd_disk->disk_name)) 
}'

Power-Loss Testing:

For critical applications, simulate actual power loss:

Power Loss Test Methods

•Virtual machine kill: Abruptly terminate QEMU/VMware process (kill -9)
•Remote power control: Use IPMI/iLO to hard-power-off servers
•Smart PDU: Controllable power strips that can cut power programmatically
•Battery disconnect: For laptops/embedded systems
•dm-flakey/dm-delay: Linux device mapper targets simulating failures

The Lie of Virtualized Storage

Summary: Sync Operations Mastery

Key Takeaways

•The sync family offers different scope/guarantees — sync(), syncfs(), fsync(), fdatasync(), sync_file_range(), and msync() each serve different use cases.
•fsync waits for stable storage confirmation — It doesn't return until data is on non-volatile media (or the device lies about it).
•fdatasync skips non-essential metadata — Often 2x faster than fsync for overwrites, syncs size changes but not timestamps.
•Device flush commands are the final step — Without them, data may remain in volatile device cache.
•Sync is expensive but necessary — Each fsync can take 3-30ms depending on device, limiting transactions per second.
•Group commit amortizes sync cost — Batching multiple operations per sync dramatically improves throughput.
•Testing sync is hard but essential — Use strace, blktrace, VM kill testing, and power-loss tests to verify durability.

Module Complete:

Congratulations! You've completed the Buffer Cache module—the foundational layer of file system performance. You now understand:

Block caching fundamentals: Why caching exists, how blocks are organized
Cache management: LRU, LFU, ARC, and practical replacement policies
Write-back policies: Delayed writes, dirty tracking, and writeback threads
Cache consistency: Coherence across processes, systems, and crashes
Sync operations: The machinery that provides durability guarantees

This knowledge forms the foundation for understanding file system performance optimization, database internals, and building reliable software systems.

Module Complete

5 / 5