Loading learning content...
Every open(), read(), and write() you've ever called in user space triggers a carefully orchestrated journey through the kernel. From the system call entry point through VFS dispatch, file system operations, page cache interactions, and finally to disk I/O—each step involves precise data structure manipulation and locking protocols.
Understanding this journey is essential for anyone who needs to debug file system issues, write kernel code, or simply understand why some I/O patterns perform better than others.
By the end of this page, you will understand the complete implementation of file operations in Linux: file descriptor tables and the open file table, the open() system call flow from pathname to file descriptor, read and write paths including buffered and direct I/O, file locking mechanisms (flock, fcntl, lockf), and advanced operations like splice, sendfile, and copy_file_range.
Before examining individual operations, we must understand the data structures that represent open files. There are three distinct levels:
Each process has a file descriptor table (fd table)—an array mapping integer file descriptors to struct file * pointers. This is stored in the process's files_struct:
123456789101112131415161718192021222324
struct files_struct { atomic_t count; /* Reference count */ /* Fast path: inline array for first 64 fds */ struct fdtable __rcu *fdt; /* Current fd table */ struct fdtable fdtab; /* Initial inline table */ spinlock_t file_lock; unsigned int next_fd; /* Next fd to allocate */ unsigned long close_on_exec_init[1]; /* Close-on-exec bitmap */ unsigned long open_fds_init[1]; /* Open fds bitmap */ unsigned long full_fds_bits_init[1]; struct file __rcu *fd_array[NR_OPEN_DEFAULT]; /* Default: 64 */}; /* Dynamically grown table for >64 fds */struct fdtable { unsigned int max_fds; /* Current size */ struct file __rcu **fd; /* Array of file pointers */ unsigned long *close_on_exec; /* Close-on-exec bitmap */ unsigned long *open_fds; /* Open fds bitmap */ unsigned long *full_fds_bits; /* Full word bitmap */ struct rcu_head rcu;};Each struct file represents an open instance of a file. Multiple file descriptors (in the same or different processes) can point to the same struct file. Key fields we saw earlier:
f_pos: Current file position (shared among all fds pointing here)f_flags: Open flags (O_RDONLY, O_APPEND, etc.)f_op: File operations tablef_path: Path (dentry + mount)f_inode: Pointer to the inodeThe inode represents the actual file. Multiple struct file entries can reference the same inode (for multiply-opened files). The inode contains file metadata and, through its i_op and i_fop tables, defines how operations work.
This three-level design enables powerful sharing patterns. fork() shares the fd table initially (copy-on-write). dup() creates a new fd pointing to the same struct file. Multiple open() calls to the same file create separate struct file entries (with independent positions) pointing to the same inode. This flexibility enables everything from shell pipelines to shared memory mappings.
The open() system call is the beginning of all file access. It translates a pathname into a file descriptor through a multi-step process.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
/* User-space call */int fd = open("/path/to/file", O_RDWR | O_CREAT, 0644); /* System call handler (fs/open.c) */SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode){ if (force_o_largefile()) flags |= O_LARGEFILE; return do_sys_open(AT_FDCWD, filename, flags, mode);} long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode){ struct open_how how = { .flags = flags, .mode = mode, }; return do_sys_openat2(dfd, filename, &how);} static long do_sys_openat2(int dfd, const char __user *filename, struct open_how *how){ struct open_flags op; int fd = build_open_flags(how, &op); struct filename *tmp; if (fd) return fd; /* Copy filename from user space */ tmp = getname(filename); if (IS_ERR(tmp)) return PTR_ERR(tmp); /* Allocate a file descriptor */ fd = get_unused_fd_flags(how->flags); if (fd >= 0) { /* Perform the actual open */ struct file *f = do_filp_open(dfd, tmp, &op); if (IS_ERR(f)) { put_unused_fd(fd); fd = PTR_ERR(f); } else { /* Install file in fd table */ fd_install(fd, f); } } putname(tmp); return fd;}The core work happens in do_filp_open(), which performs pathname resolution and file setup:
struct file, set up operations tableThe gap between checking a file's status and operating on it creates Time-of-Check-Time-of-Use (TOCTOU) vulnerabilities. An attacker might replace a file with a symlink between access() and open(). Use openat() with O_NOFOLLOW, or better yet, open first and then fstat/fchmod on the fd.
Once a file is open, reading and writing follow well-defined paths through VFS and the file system.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
/* User space */ssize_t n = read(fd, buffer, count); /* System call */SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count){ return ksys_read(fd, buf, count);} ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count){ struct fd f = fdget_pos(fd); /* Get file + position lock */ ssize_t ret = -EBADF; if (f.file) { loff_t pos, *ppos = file_ppos(f.file); if (ppos) { pos = *ppos; ppos = &pos; } ret = vfs_read(f.file, buf, count, ppos); if (ret >= 0 && ppos) f.file->f_pos = pos; fdput_pos(f); } return ret;} ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos){ ssize_t ret; /* Permission and limit checks */ if (!(file->f_mode & FMODE_READ)) return -EBADF; if (!(file->f_mode & FMODE_CAN_READ)) return -EINVAL; if (unlikely(!access_ok(buf, count))) return -EFAULT; ret = rw_verify_area(READ, file, pos, count); if (ret) return ret; if (count > MAX_RW_COUNT) count = MAX_RW_COUNT; /* Call the file's read method */ if (file->f_op->read) ret = file->f_op->read(file, buf, count, pos); else if (file->f_op->read_iter) ret = new_sync_read(file, buf, count, pos); else ret = -EINVAL; if (ret > 0) { fsnotify_access(file); /* Notify file access */ add_rchar(current, ret); /* Accounting */ } return ret;}The path diverges based on whether the file was opened with O_DIRECT:
Writes follow a similar pattern but with additional complexity for durability:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos){ ssize_t ret; /* Permission checks */ if (!(file->f_mode & FMODE_WRITE)) return -EBADF; if (!(file->f_mode & FMODE_CAN_WRITE)) return -EINVAL; if (unlikely(!access_ok(buf, count))) return -EFAULT; ret = rw_verify_area(WRITE, file, pos, count); if (ret) return ret; /* Remove suid/sgid bits on write */ ret = file_remove_privs(file); if (ret) return ret; /* Call file's write method */ if (file->f_op->write) ret = file->f_op->write(file, buf, count, pos); else if (file->f_op->write_iter) ret = new_sync_write(file, buf, count, pos); else ret = -EINVAL; if (ret > 0) { fsnotify_modify(file); /* Notify file modification */ add_wchar(current, ret); /* Accounting */ } return ret;} /* Buffered write path */ssize_t generic_perform_write(struct file *file, struct iov_iter *i, loff_t pos){ struct address_space *mapping = file->f_mapping; const struct address_space_operations *a_ops = mapping->a_ops; do { struct page *page; unsigned long offset = pos & (PAGE_SIZE - 1); unsigned long bytes = min(PAGE_SIZE - offset, remaining); /* Prepare page for writing (may allocate, read for partial) */ status = a_ops->write_begin(file, mapping, pos, bytes, &page, &fsdata); /* Copy data from user */ copied = copy_page_from_iter_atomic(page, offset, bytes, i); flush_dcache_page(page); /* Finalize write, mark page dirty */ status = a_ops->write_end(file, mapping, pos, bytes, copied, page, fsdata); pos += copied; remaining -= copied; } while (remaining); return written;}When a file is opened with O_APPEND, each write() atomically seeks to end-of-file before writing. This happens under the inode's i_rwsem lock, guaranteeing that concurrent appends don't interleave (though they may reorder). This is how log files handle concurrent writers.
Closing a file descriptor involves cleanup at multiple levels. The process is more nuanced than simply "freeing resources."
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
SYSCALL_DEFINE1(close, unsigned int, fd){ int retval = close_fd(fd); /* Can't restart close() after interrupt */ if (unlikely(retval == -ERESTARTSYS || retval == -ERESTARTNOINTR || retval == -ERESTARTNOHAND || retval == -ERESTART_RESTARTBLOCK)) retval = -EINTR; return retval;} int close_fd(unsigned fd){ struct files_struct *files = current->files; struct file *file; spin_lock(&files->file_lock); /* Remove fd from table */ file = pick_file(files, fd); spin_unlock(&files->file_lock); if (!file) return -EBADF; return filp_close(file, files);} int filp_close(struct file *filp, fl_owner_t id){ int retval = 0; if (!file_count(filp)) { printk(KERN_ERR "VFS: Close: file count is 0"); return 0; } /* Call flush handler (for special files) */ if (filp->f_op->flush) retval = filp->f_op->flush(filp, id); /* Remove any POSIX locks */ if (likely(!(filp->f_mode & FMODE_PATH))) { dnotify_flush(filp, id); locks_remove_posix(filp, id); } /* Drop reference (may trigger release) */ fput(filp); return retval;}The file's release() method is only called when the last reference to the struct file is dropped. This means:
dup() creates a new fd pointing to the same file—closing one doesn't trigger releasefork() initially shares the file table—child's close doesn't affect parentThe actual release happens in __fput(), which may be deferred to a workqueue.
Always check close()'s return value! For regular files, close() may fail if buffered writes couldn't complete (e.g., disk full, network file system error). Ignoring the return means you don't know if your data was saved. For best practices, fsync() before close() for critical data.
The lseek() system call changes the file offset for subsequent read/write operations. It's simpler than open/read/write but reveals important file system capabilities.
12345678910111213141516171819202122232425262728293031323334353637
SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence){ return ksys_lseek(fd, offset, whence);} off_t ksys_lseek(unsigned int fd, off_t offset, unsigned int whence){ off_t retval; struct fd f = fdget_pos(fd); if (!f.file) return -EBADF; retval = -EINVAL; if (whence <= SEEK_MAX) { loff_t res = vfs_llseek(f.file, offset, whence); retval = res; if (res != (loff_t)retval) retval = -EOVERFLOW; /* Offset too large */ } fdput_pos(f); return retval;} loff_t vfs_llseek(struct file *file, loff_t offset, int whence){ loff_t (*fn)(struct file *, loff_t, int); fn = no_llseek; if (file->f_op->llseek) fn = file->f_op->llseek; else if (file->f_mode & FMODE_LSEEK) fn = default_llseek; return fn(file, offset, whence);}| Mode | Description | New Position |
|---|---|---|
SEEK_SET | Absolute position | offset |
SEEK_CUR | Relative to current | current + offset |
SEEK_END | Relative to file end | file_size + offset |
SEEK_DATA | Next data after offset | First non-hole >= offset |
SEEK_HOLE | Next hole after offset | First hole >= offset |
These modes (added in Linux 3.1) enable efficient traversal of sparse files:
// Find all data regions in a sparse file
loff_t pos = 0;
while ((pos = lseek(fd, pos, SEEK_DATA)) != -1) {
loff_t data_start = pos;
loff_t data_end = lseek(fd, pos, SEEK_HOLE);
printf("Data region: %lld - %lld
", data_start, data_end);
pos = data_end;
}
This is far more efficient than reading the entire file to find non-zero regions.
Linux provides multiple file locking mechanisms for coordinating access between processes.
| Mechanism | Scope | Mandatory? | Interface |
|---|---|---|---|
| flock() | Whole file | Advisory only | flock(fd, operation) |
| POSIX locks | Byte ranges | Advisory (can be mandatory) | fcntl(fd, F_SETLK, &flock) |
| OFD locks | Byte ranges, file-based | Advisory only | fcntl(fd, F_OFD_SETLK, &flock) |
| lockf() | Byte ranges | Advisory only | lockf(fd, cmd, len) |
123456789101112131415161718192021
#include <sys/file.h> int fd = open("/tmp/lockfile", O_RDWR); /* Acquire shared (read) lock */flock(fd, LOCK_SH); /* Blocks if exclusive lock held */ /* Acquire exclusive (write) lock */flock(fd, LOCK_EX); /* Blocks if any lock held */ /* Try without blocking */if (flock(fd, LOCK_EX | LOCK_NB) == -1) { if (errno == EWOULDBLOCK) printf("Could not acquire lock");} /* Release lock */flock(fd, LOCK_UN); /* Lock associated with struct file, survives fork, released on close */12345678910111213141516171819202122232425262728293031323334
#include <fcntl.h> struct flock fl = { .l_type = F_WRLCK, /* F_RDLCK, F_WRLCK, F_UNLCK */ .l_whence = SEEK_SET, .l_start = 0, /* Starting offset */ .l_len = 100, /* Number of bytes (0 = entire file) */ .l_pid = 0, /* PID of blocking process (for F_GETLK) */}; /* Set lock (blocking) */fcntl(fd, F_SETLKW, &fl); /* Set lock (non-blocking) */if (fcntl(fd, F_SETLK, &fl) == -1) { if (errno == EAGAIN || errno == EACCES) printf("Lock held by another process");} /* Query lock status */fl.l_type = F_WRLCK;fcntl(fd, F_GETLK, &fl);if (fl.l_type != F_UNLCK) printf("Lock held by PID %d", fl.l_pid); /* Unlock */fl.l_type = F_UNLCK;fcntl(fd, F_SETLK, &fl); /* WARNING: POSIX locks are associated with (PID, inode) pair! * Closing ANY fd to the file releases ALL locks. * This is often surprising and problematic. */POSIX locks have surprising semantics: they're tied to the process, not the file descriptor. If you have fd1 and fd2 both pointing to the same file, locking via fd1 then closing fd2 releases the lock! This breaks many use cases. OFD locks (F_OFD_SETLK) were added to fix this—they're associated with the struct file, not the process.
Beyond basic read/write, Linux provides specialized operations for high-performance and zero-copy I/O.
sendfile() transfers data directly between file descriptors in kernel space, avoiding user-space copies:
123456789101112131415
#include <sys/sendfile.h> /* Common use: serving static files over network */int file_fd = open("large_file.bin", O_RDONLY);int socket_fd = accept(listen_fd, ...); /* Send entire file */struct stat st;fstat(file_fd, &st);off_t offset = 0; ssize_t sent = sendfile(socket_fd, file_fd, &offset, st.st_size); /* Data flow: disk → page cache → socket buffer * Never touches user-space memory */splice() moves data between a pipe and a file descriptor without copying to user space:
123456789101112131415161718
#include <fcntl.h> int pipefd[2];pipe(pipefd); /* Copy from file to socket via pipe (zero-copy) */ssize_t n; /* File → pipe */n = splice(file_fd, &file_offset, pipefd[1], NULL, chunk_size, SPLICE_F_MOVE | SPLICE_F_MORE); /* Pipe → socket */n = splice(pipefd[0], NULL, socket_fd, NULL, n, SPLICE_F_MOVE); /* Also useful: vmsplice() maps user pages into pipe * tee() copies pipe contents to another pipe without consuming */copy_file_range() enables server-side copy for network file systems and reflink for copy-on-write file systems:
1234567891011121314
#define _GNU_SOURCE#include <unistd.h> /* Efficient file copy within same filesystem */loff_t off_in = 0, off_out = 0;ssize_t copied = copy_file_range(src_fd, &off_in, dst_fd, &off_out, size, 0); /* Benefits: * - Btrfs/XFS: Creates reflink (instant, shares blocks) * - NFS: Server-side copy (no data over network) * - Otherwise: Kernel-space copy (still faster than read+write) */| Operation | Source | Destination | Use Case |
|---|---|---|---|
sendfile() | Regular file | Socket | Static file serving |
splice() | Any + pipe | Any + pipe | Generic zero-copy with pipe |
copy_file_range() | File | File | File duplication, backup |
io_uring + registered buffers | File/socket | File/socket | High-throughput async I/O |
For critical data, applications must ensure writes reach durable storage. Linux provides several mechanisms with different guarantees and performance characteristics.
| Operation | Syncs Data? | Syncs Metadata? | Performance |
|---|---|---|---|
fsync(fd) | Yes | Yes (all metadata) | Slowest |
fdatasync(fd) | Yes | Yes (essential only) | Faster |
sync() | All files | All metadata | Very slow |
syncfs(fd) | All files on FS | All metadata on FS | Slow |
O_SYNC | Each write | Each write | Very slow writes |
O_DSYNC | Each write | Essential only | Slow writes |
1234567891011121314151617181920212223242526
/* Pattern 1: fsync after batch of writes */for (int i = 0; i < 1000; i++) { write(fd, data[i], size);}fsync(fd); /* One sync for all writes */ /* Pattern 2: sync_file_range for control */write(fd, data, large_size); /* Start async writeback */sync_file_range(fd, offset, length, SYNC_FILE_RANGE_WRITE); /* ... do other work ... */ /* Wait for completion */sync_file_range(fd, offset, length, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); /* Pattern 3: Write-ahead logging */write(log_fd, log_record, log_size);fdatasync(log_fd); /* Log must be durable first */write(data_fd, data, data_size);/* Data sync can be delayed */fsync() only guarantees data reached the device's write cache, not the physical media. Many SSDs have volatile write caches; power loss can still lose data. For true durability, devices must have battery-backed caches, or you must disable write caching (performance penalty). Enterprise storage typically handles this correctly.
We've explored the complete implementation of file operations in Linux. Let's consolidate the key concepts:
Module Complete:
This concludes our deep dive into Linux File Systems. You've now explored:
Together, these components form the foundation of how Linux manages persistent storage—knowledge essential for kernel development, system administration, and building high-performance storage applications.
You now possess Principal Engineer-level understanding of Linux file system internals. This knowledge enables you to debug complex I/O issues, optimize storage-intensive applications, make informed file system configuration decisions, and understand the tradeoffs inherent in any storage architecture.