Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

5 / 5

File Operations

The Journey of Every File Access

Every open(), read(), and write() you've ever called in user space triggers a carefully orchestrated journey through the kernel. From the system call entry point through VFS dispatch, file system operations, page cache interactions, and finally to disk I/O—each step involves precise data structure manipulation and locking protocols.

Understanding this journey is essential for anyone who needs to debug file system issues, write kernel code, or simply understand why some I/O patterns perform better than others.

What You Will Learn

By the end of this page, you will understand the complete implementation of file operations in Linux: file descriptor tables and the open file table, the open() system call flow from pathname to file descriptor, read and write paths including buffered and direct I/O, file locking mechanisms (flock, fcntl, lockf), and advanced operations like splice, sendfile, and copy_file_range.

File Descriptors and the Open File Table

Before examining individual operations, we must understand the data structures that represent open files. There are three distinct levels:

Converting Mermaid diagram...

Level 1: Per-Process File Descriptor Table

Each process has a file descriptor table (fd table)—an array mapping integer file descriptors to struct file * pointers. This is stored in the process's files_struct:

struct files_struct
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
struct files_struct {
    atomic_t count;                      /* Reference count */
    
    /* Fast path: inline array for first 64 fds */
    struct fdtable __rcu *fdt;           /* Current fd table */
    struct fdtable fdtab;                /* Initial inline table */
    
    spinlock_t file_lock;
    unsigned int next_fd;                /* Next fd to allocate */
    unsigned long close_on_exec_init[1]; /* Close-on-exec bitmap */
    unsigned long open_fds_init[1];      /* Open fds bitmap */
    unsigned long full_fds_bits_init[1];
    struct file __rcu *fd_array[NR_OPEN_DEFAULT]; /* Default: 64 */
};
 
/* Dynamically grown table for >64 fds */
struct fdtable {
    unsigned int max_fds;                /* Current size */
    struct file __rcu **fd;              /* Array of file pointers */
    unsigned long *close_on_exec;        /* Close-on-exec bitmap */
    unsigned long *open_fds;             /* Open fds bitmap */
    unsigned long *full_fds_bits;        /* Full word bitmap */
    struct rcu_head rcu;
};

Level 2: System-Wide Open File Table

Each struct file represents an open instance of a file. Multiple file descriptors (in the same or different processes) can point to the same struct file. Key fields we saw earlier:

f_pos: Current file position (shared among all fds pointing here)
f_flags: Open flags (O_RDONLY, O_APPEND, etc.)
f_op: File operations table
f_path: Path (dentry + mount)
f_inode: Pointer to the inode

Level 3: Inode

The inode represents the actual file. Multiple struct file entries can reference the same inode (for multiply-opened files). The inode contains file metadata and, through its i_op and i_fop tables, defines how operations work.

Why Three Levels?

This three-level design enables powerful sharing patterns. fork() shares the fd table initially (copy-on-write). dup() creates a new fd pointing to the same struct file. Multiple open() calls to the same file create separate struct file entries (with independent positions) pointing to the same inode. This flexibility enables everything from shell pipelines to shared memory mappings.

The open() System Call

The open() system call is the beginning of all file access. It translates a pathname into a file descriptor through a multi-step process.

System Call Entry

open() implementation (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/* User-space call */
int fd = open("/path/to/file", O_RDWR | O_CREAT, 0644);
 
/* System call handler (fs/open.c) */
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
    if (force_o_largefile())
        flags |= O_LARGEFILE;
    return do_sys_open(AT_FDCWD, filename, flags, mode);
}
 
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
    struct open_how how = {
        .flags = flags,
        .mode = mode,
    };
    return do_sys_openat2(dfd, filename, &how);
}
 
static long do_sys_openat2(int dfd, const char __user *filename,
                           struct open_how *how)
{
    struct open_flags op;
    int fd = build_open_flags(how, &op);
    struct filename *tmp;
    
    if (fd)
        return fd;
    
    /* Copy filename from user space */
    tmp = getname(filename);
    if (IS_ERR(tmp))
        return PTR_ERR(tmp);
    
    /* Allocate a file descriptor */
    fd = get_unused_fd_flags(how->flags);
    if (fd >= 0) {
        /* Perform the actual open */
        struct file *f = do_filp_open(dfd, tmp, &op);
        if (IS_ERR(f)) {
            put_unused_fd(fd);
            fd = PTR_ERR(f);
        } else {
            /* Install file in fd table */
            fd_install(fd, f);
        }
    }
    putname(tmp);
    return fd;
}

Path Resolution (do_filp_open)

The core work happens in do_filp_open(), which performs pathname resolution and file setup:

Converting Mermaid diagram...

Key Steps in open()

Flag validation: Verify flag combinations are legal
FD allocation: Reserve a file descriptor number from the process's table
Pathname resolution: Walk the directory tree to find the target
Existence check: Handle O_CREAT, O_EXCL based on whether file exists
Permission check: Verify access rights for the requested operation
File creation: Create struct file, set up operations table
Driver/FS open: Call file system's open method (may trigger hardware access)
FD installation: Link the fd to the struct file

TOCTOU Vulnerabilities

The gap between checking a file's status and operating on it creates Time-of-Check-Time-of-Use (TOCTOU) vulnerabilities. An attacker might replace a file with a symlink between access() and open(). Use openat() with O_NOFOLLOW, or better yet, open first and then fstat/fchmod on the fd.

read() and write() Paths

Once a file is open, reading and writing follow well-defined paths through VFS and the file system.

The read() System Call

read() implementation path
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/* User space */
ssize_t n = read(fd, buffer, count);
 
/* System call */
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
    return ksys_read(fd, buf, count);
}
 
ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
{
    struct fd f = fdget_pos(fd);        /* Get file + position lock */
    ssize_t ret = -EBADF;
    
    if (f.file) {
        loff_t pos, *ppos = file_ppos(f.file);
        if (ppos) {
            pos = *ppos;
            ppos = &pos;
        }
        ret = vfs_read(f.file, buf, count, ppos);
        if (ret >= 0 && ppos)
            f.file->f_pos = pos;
        fdput_pos(f);
    }
    return ret;
}
 
ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
    ssize_t ret;
    
    /* Permission and limit checks */
    if (!(file->f_mode & FMODE_READ))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_READ))
        return -EINVAL;
    if (unlikely(!access_ok(buf, count)))
        return -EFAULT;
    
    ret = rw_verify_area(READ, file, pos, count);
    if (ret)
        return ret;
    
    if (count > MAX_RW_COUNT)
        count = MAX_RW_COUNT;
    
    /* Call the file's read method */
    if (file->f_op->read)
        ret = file->f_op->read(file, buf, count, pos);
    else if (file->f_op->read_iter)
        ret = new_sync_read(file, buf, count, pos);
    else
        ret = -EINVAL;
    
    if (ret > 0) {
        fsnotify_access(file);          /* Notify file access */
        add_rchar(current, ret);        /* Accounting */
    }
    
    return ret;
}

Buffered vs Direct I/O

The path diverges based on whether the file was opened with O_DIRECT:

Buffered I/O Path

•generic_file_read_iter()
•Check page cache for pages
•On miss: trigger readpage/readahead
•Wait for pages to be uptodate
•copy_to_iter() to user buffer
•May satisfy entirely from memory

Direct I/O Path

•generic_file_read_iter() with IOCB_DIRECT
•Bypass page cache entirely
•a_ops->direct_IO() called
•BIO submitted directly to block layer
•Data transferred DMA to user buffer
•Always involves disk I/O

The write() System Call

Writes follow a similar pattern but with additional complexity for durability:

write() key steps
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
ssize_t vfs_write(struct file *file, const char __user *buf,
                  size_t count, loff_t *pos)
{
    ssize_t ret;
    
    /* Permission checks */
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_WRITE))
        return -EINVAL;
    if (unlikely(!access_ok(buf, count)))
        return -EFAULT;
    
    ret = rw_verify_area(WRITE, file, pos, count);
    if (ret)
        return ret;
    
    /* Remove suid/sgid bits on write */
    ret = file_remove_privs(file);
    if (ret)
        return ret;
    
    /* Call file's write method */
    if (file->f_op->write)
        ret = file->f_op->write(file, buf, count, pos);
    else if (file->f_op->write_iter)
        ret = new_sync_write(file, buf, count, pos);
    else
        ret = -EINVAL;
    
    if (ret > 0) {
        fsnotify_modify(file);          /* Notify file modification */
        add_wchar(current, ret);        /* Accounting */
    }
    
    return ret;
}
 
/* Buffered write path */
ssize_t generic_perform_write(struct file *file, struct iov_iter *i,
                               loff_t pos)
{
    struct address_space *mapping = file->f_mapping;
    const struct address_space_operations *a_ops = mapping->a_ops;
    
    do {
        struct page *page;
        unsigned long offset = pos & (PAGE_SIZE - 1);
        unsigned long bytes = min(PAGE_SIZE - offset, remaining);
        
        /* Prepare page for writing (may allocate, read for partial) */
        status = a_ops->write_begin(file, mapping, pos, bytes,
                                     &page, &fsdata);
        
        /* Copy data from user */
        copied = copy_page_from_iter_atomic(page, offset, bytes, i);
        flush_dcache_page(page);
        
        /* Finalize write, mark page dirty */
        status = a_ops->write_end(file, mapping, pos, bytes, copied,
                                   page, fsdata);
        
        pos += copied;
        remaining -= copied;
    } while (remaining);
    
    return written;
}

O_APPEND and Atomicity

When a file is opened with O_APPEND, each write() atomically seeks to end-of-file before writing. This happens under the inode's i_rwsem lock, guaranteeing that concurrent appends don't interleave (though they may reorder). This is how log files handle concurrent writers.

close() and File Release

Closing a file descriptor involves cleanup at multiple levels. The process is more nuanced than simply "freeing resources."

close() implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
SYSCALL_DEFINE1(close, unsigned int, fd)
{
    int retval = close_fd(fd);
    
    /* Can't restart close() after interrupt */
    if (unlikely(retval == -ERESTARTSYS ||
                 retval == -ERESTARTNOINTR ||
                 retval == -ERESTARTNOHAND ||
                 retval == -ERESTART_RESTARTBLOCK))
        retval = -EINTR;
    
    return retval;
}
 
int close_fd(unsigned fd)
{
    struct files_struct *files = current->files;
    struct file *file;
    
    spin_lock(&files->file_lock);
    /* Remove fd from table */
    file = pick_file(files, fd);
    spin_unlock(&files->file_lock);
    
    if (!file)
        return -EBADF;
    
    return filp_close(file, files);
}
 
int filp_close(struct file *filp, fl_owner_t id)
{
    int retval = 0;
    
    if (!file_count(filp)) {
        printk(KERN_ERR "VFS: Close: file count is 0
");
        return 0;
    }
    
    /* Call flush handler (for special files) */
    if (filp->f_op->flush)
        retval = filp->f_op->flush(filp, id);
    
    /* Remove any POSIX locks */
    if (likely(!(filp->f_mode & FMODE_PATH))) {
        dnotify_flush(filp, id);
        locks_remove_posix(filp, id);
    }
    
    /* Drop reference (may trigger release) */
    fput(filp);
    return retval;
}

When is release() Called?

The file's release() method is only called when the last reference to the struct file is dropped. This means:

dup() creates a new fd pointing to the same file—closing one doesn't trigger release
fork() initially shares the file table—child's close doesn't affect parent
Memory-mapped files hold a reference—munmap needed before release

The actual release happens in __fput(), which may be deferred to a workqueue.

close() Error Handling

Always check close()'s return value! For regular files, close() may fail if buffered writes couldn't complete (e.g., disk full, network file system error). Ignoring the return means you don't know if your data was saved. For best practices, fsync() before close() for critical data.

Seeking and File Position

The lseek() system call changes the file offset for subsequent read/write operations. It's simpler than open/read/write but reveals important file system capabilities.

lseek() implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
{
    return ksys_lseek(fd, offset, whence);
}
 
off_t ksys_lseek(unsigned int fd, off_t offset, unsigned int whence)
{
    off_t retval;
    struct fd f = fdget_pos(fd);
    
    if (!f.file)
        return -EBADF;
    
    retval = -EINVAL;
    if (whence <= SEEK_MAX) {
        loff_t res = vfs_llseek(f.file, offset, whence);
        retval = res;
        if (res != (loff_t)retval)
            retval = -EOVERFLOW;  /* Offset too large */
    }
    
    fdput_pos(f);
    return retval;
}
 
loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
{
    loff_t (*fn)(struct file *, loff_t, int);
    
    fn = no_llseek;
    if (file->f_op->llseek)
        fn = file->f_op->llseek;
    else if (file->f_mode & FMODE_LSEEK)
        fn = default_llseek;
    
    return fn(file, offset, whence);
}

Seek Modes
Mode	Description	New Position
`SEEK_SET`	Absolute position	offset
`SEEK_CUR`	Relative to current	current + offset
`SEEK_END`	Relative to file end	file_size + offset
`SEEK_DATA`	Next data after offset	First non-hole >= offset
`SEEK_HOLE`	Next hole after offset	First hole >= offset

SEEK_DATA and SEEK_HOLE

These modes (added in Linux 3.1) enable efficient traversal of sparse files:

// Find all data regions in a sparse file
loff_t pos = 0;
while ((pos = lseek(fd, pos, SEEK_DATA)) != -1) {
    loff_t data_start = pos;
    loff_t data_end = lseek(fd, pos, SEEK_HOLE);
    printf("Data region: %lld - %lld
", data_start, data_end);
    pos = data_end;
}

This is far more efficient than reading the entire file to find non-zero regions.

File Locking Mechanisms

Linux provides multiple file locking mechanisms for coordinating access between processes.

File Locking Types
Mechanism	Scope	Mandatory?	Interface
flock()	Whole file	Advisory only	flock(fd, operation)
POSIX locks	Byte ranges	Advisory (can be mandatory)	fcntl(fd, F_SETLK, &flock)
OFD locks	Byte ranges, file-based	Advisory only	fcntl(fd, F_OFD_SETLK, &flock)
lockf()	Byte ranges	Advisory only	lockf(fd, cmd, len)

flock() — BSD-style Locking

flock() usage
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <sys/file.h>
 
int fd = open("/tmp/lockfile", O_RDWR);
 
/* Acquire shared (read) lock */
flock(fd, LOCK_SH);  /* Blocks if exclusive lock held */
 
/* Acquire exclusive (write) lock */
flock(fd, LOCK_EX);  /* Blocks if any lock held */
 
/* Try without blocking */
if (flock(fd, LOCK_EX | LOCK_NB) == -1) {
    if (errno == EWOULDBLOCK)
        printf("Could not acquire lock
");
}
 
/* Release lock */
flock(fd, LOCK_UN);
 
/* Lock associated with struct file, survives fork, released on close */

POSIX Locks (fcntl) — Byte-Range Locking

POSIX locking
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <fcntl.h>
 
struct flock fl = {
    .l_type = F_WRLCK,    /* F_RDLCK, F_WRLCK, F_UNLCK */
    .l_whence = SEEK_SET,
    .l_start = 0,         /* Starting offset */
    .l_len = 100,         /* Number of bytes (0 = entire file) */
    .l_pid = 0,           /* PID of blocking process (for F_GETLK) */
};
 
/* Set lock (blocking) */
fcntl(fd, F_SETLKW, &fl);
 
/* Set lock (non-blocking) */
if (fcntl(fd, F_SETLK, &fl) == -1) {
    if (errno == EAGAIN || errno == EACCES)
        printf("Lock held by another process
");
}
 
/* Query lock status */
fl.l_type = F_WRLCK;
fcntl(fd, F_GETLK, &fl);
if (fl.l_type != F_UNLCK)
    printf("Lock held by PID %d
", fl.l_pid);
 
/* Unlock */
fl.l_type = F_UNLCK;
fcntl(fd, F_SETLK, &fl);
 
/* WARNING: POSIX locks are associated with (PID, inode) pair!
 * Closing ANY fd to the file releases ALL locks.
 * This is often surprising and problematic. */

POSIX Lock Pitfall

POSIX locks have surprising semantics: they're tied to the process, not the file descriptor. If you have fd1 and fd2 both pointing to the same file, locking via fd1 then closing fd2 releases the lock! This breaks many use cases. OFD locks (F_OFD_SETLK) were added to fix this—they're associated with the struct file, not the process.

Advanced File Operations

Beyond basic read/write, Linux provides specialized operations for high-performance and zero-copy I/O.

sendfile() — Zero-Copy File to Socket

sendfile() transfers data directly between file descriptors in kernel space, avoiding user-space copies:

sendfile() usage
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <sys/sendfile.h>
 
/* Common use: serving static files over network */
int file_fd = open("large_file.bin", O_RDONLY);
int socket_fd = accept(listen_fd, ...);
 
/* Send entire file */
struct stat st;
fstat(file_fd, &st);
off_t offset = 0;
 
ssize_t sent = sendfile(socket_fd, file_fd, &offset, st.st_size);
 
/* Data flow: disk → page cache → socket buffer
 * Never touches user-space memory */

splice() — Generic Zero-Copy Between FDs

splice() moves data between a pipe and a file descriptor without copying to user space:

splice() for zero-copy
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <fcntl.h>
 
int pipefd[2];
pipe(pipefd);
 
/* Copy from file to socket via pipe (zero-copy) */
ssize_t n;
 
/* File → pipe */
n = splice(file_fd, &file_offset, pipefd[1], NULL, 
           chunk_size, SPLICE_F_MOVE | SPLICE_F_MORE);
 
/* Pipe → socket */
n = splice(pipefd[0], NULL, socket_fd, NULL,
           n, SPLICE_F_MOVE);
 
/* Also useful: vmsplice() maps user pages into pipe
 * tee() copies pipe contents to another pipe without consuming */

copy_file_range() — Optimized File Copy

copy_file_range() enables server-side copy for network file systems and reflink for copy-on-write file systems:

copy_file_range()
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#define _GNU_SOURCE
#include <unistd.h>
 
/* Efficient file copy within same filesystem */
loff_t off_in = 0, off_out = 0;
ssize_t copied = copy_file_range(src_fd, &off_in,
                                  dst_fd, &off_out,
                                  size, 0);
 
/* Benefits:
 * - Btrfs/XFS: Creates reflink (instant, shares blocks)
 * - NFS: Server-side copy (no data over network)
 * - Otherwise: Kernel-space copy (still faster than read+write)
 */

Zero-Copy Operations Comparison
Operation	Source	Destination	Use Case
`sendfile()`	Regular file	Socket	Static file serving
`splice()`	Any + pipe	Any + pipe	Generic zero-copy with pipe
`copy_file_range()`	File	File	File duplication, backup
`io_uring` + registered buffers	File/socket	File/socket	High-throughput async I/O

fsync() and Data Integrity

For critical data, applications must ensure writes reach durable storage. Linux provides several mechanisms with different guarantees and performance characteristics.

Data Durability Operations
Operation	Syncs Data?	Syncs Metadata?	Performance
`fsync(fd)`	Yes	Yes (all metadata)	Slowest
`fdatasync(fd)`	Yes	Yes (essential only)	Faster
`sync()`	All files	All metadata	Very slow
`syncfs(fd)`	All files on FS	All metadata on FS	Slow
`O_SYNC`	Each write	Each write	Very slow writes
`O_DSYNC`	Each write	Essential only	Slow writes

Durability patterns
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/* Pattern 1: fsync after batch of writes */
for (int i = 0; i < 1000; i++) {
    write(fd, data[i], size);
}
fsync(fd);  /* One sync for all writes */
 
/* Pattern 2: sync_file_range for control */
write(fd, data, large_size);
 
/* Start async writeback */
sync_file_range(fd, offset, length,
                SYNC_FILE_RANGE_WRITE);
 
/* ... do other work ... */
 
/* Wait for completion */
sync_file_range(fd, offset, length,
                SYNC_FILE_RANGE_WAIT_BEFORE |
                SYNC_FILE_RANGE_WRITE |
                SYNC_FILE_RANGE_WAIT_AFTER);
 
/* Pattern 3: Write-ahead logging */
write(log_fd, log_record, log_size);
fdatasync(log_fd);  /* Log must be durable first */
write(data_fd, data, data_size);
/* Data sync can be delayed */

Storage Stack Caveats

fsync() only guarantees data reached the device's write cache, not the physical media. Many SSDs have volatile write caches; power loss can still lose data. For true durability, devices must have battery-backed caches, or you must disable write caching (performance penalty). Enterprise storage typically handles this correctly.

Summary: File Operations in Linux

We've explored the complete implementation of file operations in Linux. Let's consolidate the key concepts:

Key Takeaways

•Three-level file table structure (fd table → struct file → inode) enables flexible sharing and resource management across processes.
•open() performs pathname resolution, permission checks, file/inode allocation, and driver initialization before returning a file descriptor.
•read() and write() dispatch through VFS to file-system-specific implementations, with buffered I/O using the page cache and direct I/O bypassing it.
•close() may not immediately release resources—only when the last reference to struct file is dropped does release() execute.
•File locking (flock, POSIX, OFD) provides advisory coordination, with important semantic differences that affect multi-process code.
•Zero-copy operations (sendfile, splice, copy_file_range) enable efficient data movement without user-space copies.
•fsync() and related calls ensure data durability, but true guarantees depend on the entire storage stack.

Module Complete:

This concludes our deep dive into Linux File Systems. You've now explored:

VFS Implementation — The abstraction layer enabling file system diversity
ext4 Internals — The most widely deployed file system's design
Block I/O Layer — How file system requests become disk operations
Page Cache — Memory caching for file data
File Operations — The complete lifecycle of file I/O

Together, these components form the foundation of how Linux manages persistent storage—knowledge essential for kernel development, system administration, and building high-performance storage applications.

Module Complete

You now possess Principal Engineer-level understanding of Linux file system internals. This knowledge enables you to debug complex I/O issues, optimize storage-intensive applications, make informed file system configuration decisions, and understand the tradeoffs inherent in any storage architecture.

5 / 5

Loading learning content...

Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

5 / 5

File Operations

The Journey of Every File Access

Understanding this journey is essential for anyone who needs to debug file system issues, write kernel code, or simply understand why some I/O patterns perform better than others.

What You Will Learn

File Descriptors and the Open File Table

Before examining individual operations, we must understand the data structures that represent open files. There are three distinct levels:

Converting Mermaid diagram...

Level 1: Per-Process File Descriptor Table

Each process has a file descriptor table (fd table)—an array mapping integer file descriptors to struct file * pointers. This is stored in the process's files_struct:

struct files_struct
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
struct files_struct {
    atomic_t count;                      /* Reference count */
    
    /* Fast path: inline array for first 64 fds */
    struct fdtable __rcu *fdt;           /* Current fd table */
    struct fdtable fdtab;                /* Initial inline table */
    
    spinlock_t file_lock;
    unsigned int next_fd;                /* Next fd to allocate */
    unsigned long close_on_exec_init[1]; /* Close-on-exec bitmap */
    unsigned long open_fds_init[1];      /* Open fds bitmap */
    unsigned long full_fds_bits_init[1];
    struct file __rcu *fd_array[NR_OPEN_DEFAULT]; /* Default: 64 */
};
 
/* Dynamically grown table for >64 fds */
struct fdtable {
    unsigned int max_fds;                /* Current size */
    struct file __rcu **fd;              /* Array of file pointers */
    unsigned long *close_on_exec;        /* Close-on-exec bitmap */
    unsigned long *open_fds;             /* Open fds bitmap */
    unsigned long *full_fds_bits;        /* Full word bitmap */
    struct rcu_head rcu;
};

Level 2: System-Wide Open File Table

Each struct file represents an open instance of a file. Multiple file descriptors (in the same or different processes) can point to the same struct file. Key fields we saw earlier:

f_pos: Current file position (shared among all fds pointing here)
f_flags: Open flags (O_RDONLY, O_APPEND, etc.)
f_op: File operations table
f_path: Path (dentry + mount)
f_inode: Pointer to the inode

Level 3: Inode

Why Three Levels?

The open() System Call

The open() system call is the beginning of all file access. It translates a pathname into a file descriptor through a multi-step process.

System Call Entry

open() implementation (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/* User-space call */
int fd = open("/path/to/file", O_RDWR | O_CREAT, 0644);
 
/* System call handler (fs/open.c) */
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
    if (force_o_largefile())
        flags |= O_LARGEFILE;
    return do_sys_open(AT_FDCWD, filename, flags, mode);
}
 
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
    struct open_how how = {
        .flags = flags,
        .mode = mode,
    };
    return do_sys_openat2(dfd, filename, &how);
}
 
static long do_sys_openat2(int dfd, const char __user *filename,
                           struct open_how *how)
{
    struct open_flags op;
    int fd = build_open_flags(how, &op);
    struct filename *tmp;
    
    if (fd)
        return fd;
    
    /* Copy filename from user space */
    tmp = getname(filename);
    if (IS_ERR(tmp))
        return PTR_ERR(tmp);
    
    /* Allocate a file descriptor */
    fd = get_unused_fd_flags(how->flags);
    if (fd >= 0) {
        /* Perform the actual open */
        struct file *f = do_filp_open(dfd, tmp, &op);
        if (IS_ERR(f)) {
            put_unused_fd(fd);
            fd = PTR_ERR(f);
        } else {
            /* Install file in fd table */
            fd_install(fd, f);
        }
    }
    putname(tmp);
    return fd;
}

Path Resolution (do_filp_open)

The core work happens in do_filp_open(), which performs pathname resolution and file setup:

Converting Mermaid diagram...

Key Steps in open()

Flag validation: Verify flag combinations are legal
FD allocation: Reserve a file descriptor number from the process's table
Pathname resolution: Walk the directory tree to find the target
Existence check: Handle O_CREAT, O_EXCL based on whether file exists
Permission check: Verify access rights for the requested operation
File creation: Create struct file, set up operations table
Driver/FS open: Call file system's open method (may trigger hardware access)
FD installation: Link the fd to the struct file

TOCTOU Vulnerabilities

read() and write() Paths

Once a file is open, reading and writing follow well-defined paths through VFS and the file system.

The read() System Call

read() implementation path
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/* User space */
ssize_t n = read(fd, buffer, count);
 
/* System call */
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
    return ksys_read(fd, buf, count);
}
 
ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
{
    struct fd f = fdget_pos(fd);        /* Get file + position lock */
    ssize_t ret = -EBADF;
    
    if (f.file) {
        loff_t pos, *ppos = file_ppos(f.file);
        if (ppos) {
            pos = *ppos;
            ppos = &pos;
        }
        ret = vfs_read(f.file, buf, count, ppos);
        if (ret >= 0 && ppos)
            f.file->f_pos = pos;
        fdput_pos(f);
    }
    return ret;
}
 
ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
    ssize_t ret;
    
    /* Permission and limit checks */
    if (!(file->f_mode & FMODE_READ))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_READ))
        return -EINVAL;
    if (unlikely(!access_ok(buf, count)))
        return -EFAULT;
    
    ret = rw_verify_area(READ, file, pos, count);
    if (ret)
        return ret;
    
    if (count > MAX_RW_COUNT)
        count = MAX_RW_COUNT;
    
    /* Call the file's read method */
    if (file->f_op->read)
        ret = file->f_op->read(file, buf, count, pos);
    else if (file->f_op->read_iter)
        ret = new_sync_read(file, buf, count, pos);
    else
        ret = -EINVAL;
    
    if (ret > 0) {
        fsnotify_access(file);          /* Notify file access */
        add_rchar(current, ret);        /* Accounting */
    }
    
    return ret;
}

Buffered vs Direct I/O

The path diverges based on whether the file was opened with O_DIRECT:

Buffered I/O Path

•generic_file_read_iter()
•Check page cache for pages
•On miss: trigger readpage/readahead
•Wait for pages to be uptodate
•copy_to_iter() to user buffer
•May satisfy entirely from memory

Direct I/O Path

•generic_file_read_iter() with IOCB_DIRECT
•Bypass page cache entirely
•a_ops->direct_IO() called
•BIO submitted directly to block layer
•Data transferred DMA to user buffer
•Always involves disk I/O

The write() System Call

Writes follow a similar pattern but with additional complexity for durability:

write() key steps
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
ssize_t vfs_write(struct file *file, const char __user *buf,
                  size_t count, loff_t *pos)
{
    ssize_t ret;
    
    /* Permission checks */
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_WRITE))
        return -EINVAL;
    if (unlikely(!access_ok(buf, count)))
        return -EFAULT;
    
    ret = rw_verify_area(WRITE, file, pos, count);
    if (ret)
        return ret;
    
    /* Remove suid/sgid bits on write */
    ret = file_remove_privs(file);
    if (ret)
        return ret;
    
    /* Call file's write method */
    if (file->f_op->write)
        ret = file->f_op->write(file, buf, count, pos);
    else if (file->f_op->write_iter)
        ret = new_sync_write(file, buf, count, pos);
    else
        ret = -EINVAL;
    
    if (ret > 0) {
        fsnotify_modify(file);          /* Notify file modification */
        add_wchar(current, ret);        /* Accounting */
    }
    
    return ret;
}
 
/* Buffered write path */
ssize_t generic_perform_write(struct file *file, struct iov_iter *i,
                               loff_t pos)
{
    struct address_space *mapping = file->f_mapping;
    const struct address_space_operations *a_ops = mapping->a_ops;
    
    do {
        struct page *page;
        unsigned long offset = pos & (PAGE_SIZE - 1);
        unsigned long bytes = min(PAGE_SIZE - offset, remaining);
        
        /* Prepare page for writing (may allocate, read for partial) */
        status = a_ops->write_begin(file, mapping, pos, bytes,
                                     &page, &fsdata);
        
        /* Copy data from user */
        copied = copy_page_from_iter_atomic(page, offset, bytes, i);
        flush_dcache_page(page);
        
        /* Finalize write, mark page dirty */
        status = a_ops->write_end(file, mapping, pos, bytes, copied,
                                   page, fsdata);
        
        pos += copied;
        remaining -= copied;
    } while (remaining);
    
    return written;
}

O_APPEND and Atomicity

close() and File Release

Closing a file descriptor involves cleanup at multiple levels. The process is more nuanced than simply "freeing resources."

close() implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
SYSCALL_DEFINE1(close, unsigned int, fd)
{
    int retval = close_fd(fd);
    
    /* Can't restart close() after interrupt */
    if (unlikely(retval == -ERESTARTSYS ||
                 retval == -ERESTARTNOINTR ||
                 retval == -ERESTARTNOHAND ||
                 retval == -ERESTART_RESTARTBLOCK))
        retval = -EINTR;
    
    return retval;
}
 
int close_fd(unsigned fd)
{
    struct files_struct *files = current->files;
    struct file *file;
    
    spin_lock(&files->file_lock);
    /* Remove fd from table */
    file = pick_file(files, fd);
    spin_unlock(&files->file_lock);
    
    if (!file)
        return -EBADF;
    
    return filp_close(file, files);
}
 
int filp_close(struct file *filp, fl_owner_t id)
{
    int retval = 0;
    
    if (!file_count(filp)) {
        printk(KERN_ERR "VFS: Close: file count is 0
");
        return 0;
    }
    
    /* Call flush handler (for special files) */
    if (filp->f_op->flush)
        retval = filp->f_op->flush(filp, id);
    
    /* Remove any POSIX locks */
    if (likely(!(filp->f_mode & FMODE_PATH))) {
        dnotify_flush(filp, id);
        locks_remove_posix(filp, id);
    }
    
    /* Drop reference (may trigger release) */
    fput(filp);
    return retval;
}

When is release() Called?

The file's release() method is only called when the last reference to the struct file is dropped. This means:

dup() creates a new fd pointing to the same file—closing one doesn't trigger release
fork() initially shares the file table—child's close doesn't affect parent
Memory-mapped files hold a reference—munmap needed before release

The actual release happens in __fput(), which may be deferred to a workqueue.

close() Error Handling

Seeking and File Position

The lseek() system call changes the file offset for subsequent read/write operations. It's simpler than open/read/write but reveals important file system capabilities.

lseek() implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
{
    return ksys_lseek(fd, offset, whence);
}
 
off_t ksys_lseek(unsigned int fd, off_t offset, unsigned int whence)
{
    off_t retval;
    struct fd f = fdget_pos(fd);
    
    if (!f.file)
        return -EBADF;
    
    retval = -EINVAL;
    if (whence <= SEEK_MAX) {
        loff_t res = vfs_llseek(f.file, offset, whence);
        retval = res;
        if (res != (loff_t)retval)
            retval = -EOVERFLOW;  /* Offset too large */
    }
    
    fdput_pos(f);
    return retval;
}
 
loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
{
    loff_t (*fn)(struct file *, loff_t, int);
    
    fn = no_llseek;
    if (file->f_op->llseek)
        fn = file->f_op->llseek;
    else if (file->f_mode & FMODE_LSEEK)
        fn = default_llseek;
    
    return fn(file, offset, whence);
}

Seek Modes
Mode	Description	New Position
`SEEK_SET`	Absolute position	offset
`SEEK_CUR`	Relative to current	current + offset
`SEEK_END`	Relative to file end	file_size + offset
`SEEK_DATA`	Next data after offset	First non-hole >= offset
`SEEK_HOLE`	Next hole after offset	First hole >= offset

SEEK_DATA and SEEK_HOLE

These modes (added in Linux 3.1) enable efficient traversal of sparse files:

// Find all data regions in a sparse file
loff_t pos = 0;
while ((pos = lseek(fd, pos, SEEK_DATA)) != -1) {
    loff_t data_start = pos;
    loff_t data_end = lseek(fd, pos, SEEK_HOLE);
    printf("Data region: %lld - %lld
", data_start, data_end);
    pos = data_end;
}

This is far more efficient than reading the entire file to find non-zero regions.

File Locking Mechanisms

Linux provides multiple file locking mechanisms for coordinating access between processes.

File Locking Types
Mechanism	Scope	Mandatory?	Interface
flock()	Whole file	Advisory only	flock(fd, operation)
POSIX locks	Byte ranges	Advisory (can be mandatory)	fcntl(fd, F_SETLK, &flock)
OFD locks	Byte ranges, file-based	Advisory only	fcntl(fd, F_OFD_SETLK, &flock)
lockf()	Byte ranges	Advisory only	lockf(fd, cmd, len)

flock() — BSD-style Locking

flock() usage
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <sys/file.h>
 
int fd = open("/tmp/lockfile", O_RDWR);
 
/* Acquire shared (read) lock */
flock(fd, LOCK_SH);  /* Blocks if exclusive lock held */
 
/* Acquire exclusive (write) lock */
flock(fd, LOCK_EX);  /* Blocks if any lock held */
 
/* Try without blocking */
if (flock(fd, LOCK_EX | LOCK_NB) == -1) {
    if (errno == EWOULDBLOCK)
        printf("Could not acquire lock
");
}
 
/* Release lock */
flock(fd, LOCK_UN);
 
/* Lock associated with struct file, survives fork, released on close */

POSIX Locks (fcntl) — Byte-Range Locking

POSIX locking
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <fcntl.h>
 
struct flock fl = {
    .l_type = F_WRLCK,    /* F_RDLCK, F_WRLCK, F_UNLCK */
    .l_whence = SEEK_SET,
    .l_start = 0,         /* Starting offset */
    .l_len = 100,         /* Number of bytes (0 = entire file) */
    .l_pid = 0,           /* PID of blocking process (for F_GETLK) */
};
 
/* Set lock (blocking) */
fcntl(fd, F_SETLKW, &fl);
 
/* Set lock (non-blocking) */
if (fcntl(fd, F_SETLK, &fl) == -1) {
    if (errno == EAGAIN || errno == EACCES)
        printf("Lock held by another process
");
}
 
/* Query lock status */
fl.l_type = F_WRLCK;
fcntl(fd, F_GETLK, &fl);
if (fl.l_type != F_UNLCK)
    printf("Lock held by PID %d
", fl.l_pid);
 
/* Unlock */
fl.l_type = F_UNLCK;
fcntl(fd, F_SETLK, &fl);
 
/* WARNING: POSIX locks are associated with (PID, inode) pair!
 * Closing ANY fd to the file releases ALL locks.
 * This is often surprising and problematic. */

POSIX Lock Pitfall

Advanced File Operations

Beyond basic read/write, Linux provides specialized operations for high-performance and zero-copy I/O.

sendfile() — Zero-Copy File to Socket

sendfile() transfers data directly between file descriptors in kernel space, avoiding user-space copies:

sendfile() usage
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <sys/sendfile.h>
 
/* Common use: serving static files over network */
int file_fd = open("large_file.bin", O_RDONLY);
int socket_fd = accept(listen_fd, ...);
 
/* Send entire file */
struct stat st;
fstat(file_fd, &st);
off_t offset = 0;
 
ssize_t sent = sendfile(socket_fd, file_fd, &offset, st.st_size);
 
/* Data flow: disk → page cache → socket buffer
 * Never touches user-space memory */

splice() — Generic Zero-Copy Between FDs

splice() moves data between a pipe and a file descriptor without copying to user space:

splice() for zero-copy
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <fcntl.h>
 
int pipefd[2];
pipe(pipefd);
 
/* Copy from file to socket via pipe (zero-copy) */
ssize_t n;
 
/* File → pipe */
n = splice(file_fd, &file_offset, pipefd[1], NULL, 
           chunk_size, SPLICE_F_MOVE | SPLICE_F_MORE);
 
/* Pipe → socket */
n = splice(pipefd[0], NULL, socket_fd, NULL,
           n, SPLICE_F_MOVE);
 
/* Also useful: vmsplice() maps user pages into pipe
 * tee() copies pipe contents to another pipe without consuming */

copy_file_range() — Optimized File Copy

copy_file_range() enables server-side copy for network file systems and reflink for copy-on-write file systems:

copy_file_range()
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#define _GNU_SOURCE
#include <unistd.h>
 
/* Efficient file copy within same filesystem */
loff_t off_in = 0, off_out = 0;
ssize_t copied = copy_file_range(src_fd, &off_in,
                                  dst_fd, &off_out,
                                  size, 0);
 
/* Benefits:
 * - Btrfs/XFS: Creates reflink (instant, shares blocks)
 * - NFS: Server-side copy (no data over network)
 * - Otherwise: Kernel-space copy (still faster than read+write)
 */

Zero-Copy Operations Comparison
Operation	Source	Destination	Use Case
`sendfile()`	Regular file	Socket	Static file serving
`splice()`	Any + pipe	Any + pipe	Generic zero-copy with pipe
`copy_file_range()`	File	File	File duplication, backup
`io_uring` + registered buffers	File/socket	File/socket	High-throughput async I/O

fsync() and Data Integrity

For critical data, applications must ensure writes reach durable storage. Linux provides several mechanisms with different guarantees and performance characteristics.

Data Durability Operations
Operation	Syncs Data?	Syncs Metadata?	Performance
`fsync(fd)`	Yes	Yes (all metadata)	Slowest
`fdatasync(fd)`	Yes	Yes (essential only)	Faster
`sync()`	All files	All metadata	Very slow
`syncfs(fd)`	All files on FS	All metadata on FS	Slow
`O_SYNC`	Each write	Each write	Very slow writes
`O_DSYNC`	Each write	Essential only	Slow writes

Durability patterns
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/* Pattern 1: fsync after batch of writes */
for (int i = 0; i < 1000; i++) {
    write(fd, data[i], size);
}
fsync(fd);  /* One sync for all writes */
 
/* Pattern 2: sync_file_range for control */
write(fd, data, large_size);
 
/* Start async writeback */
sync_file_range(fd, offset, length,
                SYNC_FILE_RANGE_WRITE);
 
/* ... do other work ... */
 
/* Wait for completion */
sync_file_range(fd, offset, length,
                SYNC_FILE_RANGE_WAIT_BEFORE |
                SYNC_FILE_RANGE_WRITE |
                SYNC_FILE_RANGE_WAIT_AFTER);
 
/* Pattern 3: Write-ahead logging */
write(log_fd, log_record, log_size);
fdatasync(log_fd);  /* Log must be durable first */
write(data_fd, data, data_size);
/* Data sync can be delayed */

Storage Stack Caveats

Summary: File Operations in Linux

We've explored the complete implementation of file operations in Linux. Let's consolidate the key concepts:

Key Takeaways

•Three-level file table structure (fd table → struct file → inode) enables flexible sharing and resource management across processes.
•open() performs pathname resolution, permission checks, file/inode allocation, and driver initialization before returning a file descriptor.
•read() and write() dispatch through VFS to file-system-specific implementations, with buffered I/O using the page cache and direct I/O bypassing it.
•close() may not immediately release resources—only when the last reference to struct file is dropped does release() execute.
•File locking (flock, POSIX, OFD) provides advisory coordination, with important semantic differences that affect multi-process code.
•Zero-copy operations (sendfile, splice, copy_file_range) enable efficient data movement without user-space copies.
•fsync() and related calls ensure data durability, but true guarantees depend on the entire storage stack.

Module Complete:

This concludes our deep dive into Linux File Systems. You've now explored:

VFS Implementation — The abstraction layer enabling file system diversity
ext4 Internals — The most widely deployed file system's design
Block I/O Layer — How file system requests become disk operations
Page Cache — Memory caching for file data
File Operations — The complete lifecycle of file I/O

Module Complete

5 / 5