Loading learning content...
Linux supports an extraordinary diversity of file systems: ext4, XFS, Btrfs, NFS, FUSE, tmpfs, procfs, sysfs, NTFS, FAT, and dozens more. Yet from a user-space perspective, every file system looks identical—the same open(), read(), write(), and close() system calls work uniformly across all of them.
How is this possible? How can fundamentally different storage technologies—from local SSDs to network shares to in-memory pseudo-filesystems—present the same interface to applications?
The answer lies in one of the most elegant abstractions in the Linux kernel: the Virtual File System (VFS).
By the end of this page, you will understand the complete architecture of the Linux VFS layer: its design philosophy, the four core objects (superblock, inode, dentry, file), the operation tables that enable polymorphism, and how VFS routes file system operations to concrete implementations. You'll gain the deep understanding that distinguishes kernel developers and systems architects from application programmers.
Before understanding how the VFS works, we must understand why it exists. The problem VFS solves is fundamental to operating system design: how do you support multiple, incompatible file system implementations while providing a uniform API to user space?
In early Unix systems, file system support was hardcoded. If you wanted to add a new file system type, you had to modify core kernel code in dozens of places. Each file system had its own data structures, its own pathname resolution logic, its own caching strategies. This led to:
Sun Microsystems introduced the VFS concept in 1985 with SunOS 2.0 to support NFS alongside UFS. This architectural innovation proved so successful that it became the standard approach across all Unix-like systems, including Linux, which reimplemented the concept with its own design decisions.
VFS provides a common abstraction layer that sits between user-space system calls and concrete file system implementations. It defines:
The VFS embodies object-oriented programming principles implemented in C. While C lacks classes and inheritance, VFS achieves polymorphism through a technique called function pointer tables (or operation vectors). Each VFS object contains a pointer to a table of functions that implement operations for that object. Different file systems provide different function implementations while conforming to the same interface.
This pattern—sometimes called "poor man's OOP"—is ubiquitous in the Linux kernel and represents a masterclass in API design.
VFS defines four fundamental objects that represent file system entities. Understanding these objects—their roles, lifetimes, and relationships—is essential for comprehending Linux file system internals.
| Object | Represents | Kernel Structure | Lifetime |
|---|---|---|---|
| Superblock | A mounted file system instance | struct super_block | Mount to unmount |
| Inode | A specific file or directory | struct inode | Cached; evicted under memory pressure |
| Dentry | A directory entry (path component) | struct dentry | Cached; forms the directory cache |
| File | An open file instance | struct file | open() to close() |
The superblock represents a mounted file system instance. When you mount a device (e.g., mount /dev/sda1 /mnt), the kernel creates a superblock object to hold information about that specific mount.
The superblock contains:
12345678910111213141516171819202122232425
struct super_block { struct list_head s_list; /* Link in global list */ dev_t s_dev; /* Device identifier */ unsigned char s_blocksize_bits; unsigned long s_blocksize; loff_t s_maxbytes; /* Max file size */ struct file_system_type *s_type; /* File system type */ const struct super_operations *s_op; /* Superblock operations */ unsigned long s_flags; /* Mount flags */ unsigned long s_magic; /* File system magic number */ struct dentry *s_root; /* Root dentry */ struct rw_semaphore s_umount; /* Unmount semaphore */ int s_count; /* Reference count */ atomic_t s_active; /* Active reference count */ void *s_fs_info; /* File system private data */ /* Inode and dentry caches */ struct list_lru s_dentry_lru; struct list_lru s_inode_lru; /* ... many more fields ... */};The s_op field points to the superblock operations table, which defines how the file system handles mount-wide operations:
123456789101112131415161718192021222324252627
struct super_operations { /* Inode lifecycle management */ struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); void (*free_inode)(struct inode *); /* Inode operations */ void (*dirty_inode)(struct inode *, int flags); int (*write_inode)(struct inode *, struct writeback_control *wbc); int (*drop_inode)(struct inode *); void (*evict_inode)(struct inode *); /* Superblock operations */ void (*put_super)(struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); int (*freeze_super)(struct super_block *, enum freeze_holder who); int (*freeze_fs)(struct super_block *); int (*thaw_super)(struct super_block *, enum freeze_holder who); int (*unfreeze_fs)(struct super_block *); int (*statfs)(struct dentry *, struct kstatfs *); int (*remount_fs)(struct super_block *, int *, char *); /* Quota operations */ int (*show_options)(struct seq_file *, struct dentry *); /* ... additional operations ... */};When the VFS needs to perform an operation on a superblock, it calls through the s_op pointer. For ext4, s_op points to ext4_sops. For XFS, it points to xfs_super_operations. This is how VFS achieves file system independence—it never calls ext4 functions directly; it calls through the operation table provided by whichever file system created the superblock.
The inode (index node) represents a specific file or directory in the file system. It contains all metadata about the file except its name (which lives in the dentry). In VFS, the inode is a bridge between the generic VFS layer and file-system-specific data.
Key inode attributes include:
1234567891011121314151617181920212223242526272829303132333435363738394041
struct inode { umode_t i_mode; /* File type and permissions */ unsigned short i_opflags; kuid_t i_uid; /* Owner user ID */ kgid_t i_gid; /* Owner group ID */ unsigned int i_flags; const struct inode_operations *i_op; /* Inode operations */ const struct file_operations *i_fop; /* Default file operations */ struct super_block *i_sb; /* Superblock pointer */ struct address_space *i_mapping; /* Page cache mapping */ unsigned long i_ino; /* Inode number */ atomic_t i_count; /* Reference count */ unsigned int i_nlink; /* Hard link count */ dev_t i_rdev; /* Device ID (if device file) */ loff_t i_size; /* File size in bytes */ struct timespec64 __i_atime; /* Access time */ struct timespec64 __i_mtime; /* Modification time */ struct timespec64 __i_ctime; /* Change time (metadata) */ spinlock_t i_lock; unsigned long i_state; /* State flags */ rwlock_t i_rwsem; atomic64_t i_version; /* Inode version */ struct hlist_node i_hash; /* Hash list entry */ struct list_head i_io_list; struct list_head i_lru; /* LRU list for eviction */ union { struct pipe_inode_info *i_pipe; /* If pipe */ struct cdev *i_cdev; /* If character device */ char *i_link; /* If symlink (short) */ }; void *i_private; /* File system private data */};The inode number (i_ino) uniquely identifies a file within a file system. Combined with the device ID, it provides a system-wide unique identifier. This is why hard links work—multiple directory entries can point to the same inode. It's also why renaming a file is O(1)—you're just updating dentry pointers, not moving data.
The dentry (directory entry) represents a component in a pathname. For the path /home/user/document.txt, there are four dentries: /, home, user, and document.txt. Dentries link names to inodes and form the pathname resolution cache.
Dentries exist in three states:
1234567891011121314151617181920212223242526272829
struct dentry { unsigned int d_flags; /* Dentry flags */ seqcount_spinlock_t d_seq; struct hlist_bl_node d_hash; /* Hash table entry */ struct dentry *d_parent; /* Parent dentry */ struct qstr d_name; /* Dentry name */ struct inode *d_inode; /* Associated inode (NULL if negative) */ unsigned char d_iname[DNAME_INLINE_LEN]; /* Inline name for short names */ struct lockref d_lockref; /* Lock and reference count */ const struct dentry_operations *d_op; /* Dentry operations */ struct super_block *d_sb; /* Superblock of this dentry */ unsigned long d_time; /* Revalidation time */ void *d_fsdata; /* File system specific data */ union { struct list_head d_lru; /* LRU list */ wait_queue_head_t *d_wait; }; struct list_head d_child; /* Child list entry */ struct list_head d_subdirs; /* Subdirectory list */ union { struct hlist_node d_alias; /* Inode alias list */ struct hlist_bl_node d_in_lookup_hash; struct rcu_head d_rcu; } d_u;};The file object represents an open instance of a file. When a process calls open(), the kernel creates a file object. Multiple processes can have separate file objects pointing to the same inode (each with its own file position, flags, etc.).
The file object contains:
123456789101112131415161718192021222324252627
struct file { struct file *f_next; struct file *f_prev; struct inode *f_inode; /* Associated inode */ const struct file_operations *f_op; /* File operations */ spinlock_t f_lock; atomic_long_t f_count; /* Reference count */ unsigned int f_flags; /* Open flags */ fmode_t f_mode; /* File mode */ struct mutex f_pos_lock; loff_t f_pos; /* Current file position */ struct fown_struct f_owner; /* For async I/O */ const struct cred *f_cred; /* File credentials */ struct file_ra_state f_ra; /* Read-ahead state */ errseq_t f_wb_err; errseq_t f_sb_err; struct path f_path; /* File path info */#define f_dentry f_path.dentry#define f_vfsmnt f_path.mnt struct address_space *f_mapping; /* Page cache mapping */ void *private_data; /* File system private data */};Each VFS object has an associated operations table—a structure of function pointers that define how to perform various operations on that object. This is the heart of VFS polymorphism.
inode_operations)Inode operations handle manipulation of file system objects themselves—creating, deleting, renaming, and looking up files and directories.
1234567891011121314151617181920212223242526272829303132333435363738
struct inode_operations { /* Pathname resolution */ struct dentry * (*lookup)(struct inode *, struct dentry *, unsigned int); /* File creation and deletion */ int (*create)(struct mnt_idmap *, struct inode *, struct dentry *, umode_t, bool); int (*mkdir)(struct mnt_idmap *, struct inode *, struct dentry *, umode_t); int (*rmdir)(struct inode *, struct dentry *); int (*unlink)(struct inode *, struct dentry *); int (*mknod)(struct mnt_idmap *, struct inode *, struct dentry *, umode_t, dev_t); /* Renaming and linking */ int (*rename)(struct mnt_idmap *, struct inode *, struct dentry *, struct inode *, struct dentry *, unsigned int); int (*link)(struct dentry *, struct inode *, struct dentry *); int (*symlink)(struct mnt_idmap *, struct inode *, struct dentry *, const char *); /* Symlink handling */ const char * (*get_link)(struct dentry *, struct inode *, struct delayed_call *); /* Permissions and attributes */ int (*permission)(struct mnt_idmap *, struct inode *, int); int (*setattr)(struct mnt_idmap *, struct dentry *, struct iattr *); int (*getattr)(struct mnt_idmap *, const struct path *, struct kstat *, u32, unsigned int); /* Extended attributes */ ssize_t (*listxattr)(struct dentry *, char *, size_t); /* Truncation */ void (*truncate)(struct inode *); /* ... */};file_operations)File operations are the most commonly used—they handle reading, writing, seeking, and everything else you can do with an open file.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
struct file_operations { struct module *owner; /* Position management */ loff_t (*llseek)(struct file *, loff_t, int); /* Reading */ ssize_t (*read)(struct file *, char __user *, size_t, loff_t *); ssize_t (*read_iter)(struct kiocb *, struct iov_iter *); /* Writing */ ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *); ssize_t (*write_iter)(struct kiocb *, struct iov_iter *); /* Directory reading */ int (*iterate)(struct file *, struct dir_context *); int (*iterate_shared)(struct file *, struct dir_context *); /* Event notification */ __poll_t (*poll)(struct file *, struct poll_table_struct *); /* ioctl */ long (*unlocked_ioctl)(struct file *, unsigned int, unsigned long); long (*compat_ioctl)(struct file *, unsigned int, unsigned long); /* Memory mapping */ int (*mmap)(struct file *, struct vm_area_struct *); /* Open and release */ int (*open)(struct inode *, struct file *); int (*flush)(struct file *, fl_owner_t id); int (*release)(struct inode *, struct file *); /* Sync */ int (*fsync)(struct file *, loff_t, loff_t, int datasync); int (*fasync)(int, struct file *, int); /* Locking */ int (*lock)(struct file *, int, struct file_lock *); int (*flock)(struct file *, int, struct file_lock *); /* Splice (zero-copy) operations */ ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); /* ... */};Modern file systems implement read_iter and write_iter (using struct kiocb and struct iov_iter) rather than the older read/write methods. These newer interfaces support vectored I/O, asynchronous I/O, and are more efficient for scatter-gather operations. The legacy read/write are provided for backward compatibility and are translated to the iterator interfaces internally.
The address_space_operations table handles the interface between the page cache and the file system—reading pages from disk, writing dirty pages back, and managing memory-mapped files.
12345678910111213141516171819202122232425262728293031323334353637
struct address_space_operations { /* Writing dirty data */ int (*writepage)(struct page *page, struct writeback_control *wbc); int (*writepages)(struct address_space *, struct writeback_control *); /* Reading data */ int (*readpage)(struct file *, struct page *); int (*readahead)(struct readahead_control *); /* Write preparation (allocate blocks, etc.) */ int (*write_begin)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, struct page **pagep, void **fsdata); int (*write_end)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata); /* Block mapping */ struct buffer_head * (*get_block)(struct inode *, sector_t, struct buffer_head *, int); sector_t (*bmap)(struct address_space *, sector_t); /* Direct I/O */ ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter); /* Page migration (NUMA) */ int (*migratepage)(struct address_space *, struct page *, struct page *, enum migrate_mode); /* Memory pressure handling */ int (*launder_page)(struct page *); bool (*release_folio)(struct folio *, gfp_t); void (*free_folio)(struct folio *); int (*swap_activate)(struct swap_info_struct *, struct file *, sector_t *); void (*swap_deactivate)(struct file *);};One of VFS's most critical responsibilities is pathname resolution—converting a string like /home/user/document.txt into the corresponding inode. This process, called a "path walk" or "namei" (name-to-inode), is highly optimized and involves intricate interactions between the dentry cache, inodes, and mount points.
For the path /home/user/document.txt:
Start at the root: Begin with the root dentry of the process's root file system (or current directory for relative paths)
Lookup 'home': Call lookup() on the root inode's operations table, requesting the child named "home"
Cross mount points: If "home" is a mount point, switch to the root dentry of the mounted file system
Lookup 'user': Repeat for the next component
Lookup 'document.txt': Repeat for the final component
Return the final inode
Pathname resolution is so frequent that Linux employs a highly optimized "RCU-walk" mode that performs the entire traversal without acquiring any locks or incrementing reference counts.
RCU (Read-Copy-Update) allows readers to traverse data structures concurrently with writers, using careful memory ordering and deferred reclamation. In RCU-walk mode:
If RCU-walk encounters a condition it cannot handle (e.g., a dentry being simultaneously deleted), it "falls back" to the slower REF-walk mode, which uses proper locking and reference counting.
RCU-walk makes a dramatic difference for workloads with high path resolution rates. A web server serving static files, for example, might resolve the same paths millions of times. With RCU-walk, these resolutions require no atomic operations, no cache line bouncing, and no lock contention—scaling linearly with CPU count.
The dentry cache is one of the most important caches in the Linux kernel. It caches the results of pathname lookups, making repeated access to the same files extremely fast.
Dcache characteristics:
Why cache negative dentries?
Caching non-existent entries might seem wasteful, but it's invaluable for performance. Consider a shell checking for a command:
$ foo
bash: foo: command not found
Bash searches PATH: /usr/local/bin/foo, /usr/bin/foo, /bin/foo, etc. Each lookup returns ENOENT. Without negative caching, every failed command would require re-reading multiple directories from disk.
VFS manages the mount tree—the hierarchical structure that connects multiple file systems into a single directory namespace. Understanding how mounts work is essential for comprehending containers, chroot jails, and the overall Linux file system landscape.
When you mount a file system, the kernel creates a struct mount object that links the mounted file system into the directory tree:
123456789101112131415161718192021222324252627282930313233
struct mount { struct hlist_node mnt_hash; /* Hash table entry */ struct mount *mnt_parent; /* Parent mount */ struct dentry *mnt_mountpoint; /* Dentry of mount point */ struct vfsmount mnt; /* Visible in VFS */ union { struct rcu_head mnt_rcu; struct llist_node mnt_llist; }; struct list_head mnt_mounts; /* Child mounts */ struct list_head mnt_child; /* Sibling link */ struct list_head mnt_instance; /* Superblock's mount list */ const char *mnt_devname; /* Device name */ struct list_head mnt_list; /* Mount namespace list */ struct list_head mnt_fsnotify_marks; struct mnt_namespace *mnt_ns; /* Mount namespace */ int mnt_id; /* Unique mount ID */ int mnt_group_id; /* Peer group ID */ int mnt_expiry_mark; /* For expirable mounts */ int mnt_writers; /* Active writer count */}; struct vfsmount { struct dentry *mnt_root; /* Root dentry of mounted fs */ struct super_block *mnt_sb; /* Superblock pointer */ int mnt_flags; /* Mount flags */ struct mnt_idmap *mnt_idmap; /* ID mapping */};Linux supports mount namespaces—isolated views of the mount tree. Each mount namespace has its own root and its own set of mounted file systems. Processes in different mount namespaces can see completely different directory hierarchies.
Mount namespaces are fundamental to container technology:
| Type | Behavior | Use Case |
|---|---|---|
| Private | Mounts visible only within namespace | Default isolation |
| Shared | Mounts propagate to peer groups | Live system updates |
| Slave | Receives mounts from master, doesn't propagate | Chroot environments |
| Unbindable | Cannot be bind-mounted | Preventing mount escape |
Bind mounts make a directory tree visible at another location in the namespace. Unlike symbolic links, bind mounts operate at the VFS level and are transparent to applications.
mount --bind /existing/path /new/location
Overlay mounts (overlayfs) combine multiple directory trees into a single unified view. This is the basis for container image layering:
mount -t overlay overlay -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged
To understand the VFS architecture fully, let's examine how a file system registers itself with the kernel. This reveals the contract that file systems must fulfill.
Every file system driver defines a file_system_type structure that describes the file system and provides the functions needed to create superblocks:
1234567891011121314151617181920212223242526
struct file_system_type { const char *name; /* File system name */ int fs_flags; /* Flags (e.g., FS_REQUIRES_DEV) */ /* Superblock initialization */ int (*init_fs_context)(struct fs_context *); const struct fs_parameter_spec *parameters; /* Legacy mount interface */ struct dentry *(*mount)(struct file_system_type *, int, const char *, void *); /* Cleanup */ void (*kill_sb)(struct super_block *); struct module *owner; /* Module reference */ struct file_system_type *next; /* Linked list */ struct hlist_head fs_supers; /* Superblock list */ struct lock_class_key s_lock_key; struct lock_class_key s_umount_key; struct lock_class_key s_vfs_rename_key; struct lock_class_key i_lock_key; struct lock_class_key i_mutex_key;};Here's a skeleton for a simple pseudo-file system (like procfs or debugfs):
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
#include <linux/module.h>#include <linux/fs.h>#include <linux/fs_context.h> #define MYFS_MAGIC 0x12345678 /* Superblock operations */static const struct super_operations myfs_sops = { .statfs = simple_statfs, .drop_inode = generic_drop_inode,}; /* Fill superblock during mount */static int myfs_fill_super(struct super_block *sb, struct fs_context *fc){ struct inode *root; sb->s_blocksize = PAGE_SIZE; sb->s_blocksize_bits = PAGE_SHIFT; sb->s_magic = MYFS_MAGIC; sb->s_op = &myfs_sops; sb->s_time_gran = 1; /* Create root inode */ root = new_inode(sb); if (!root) return -ENOMEM; root->i_ino = 1; root->i_mode = S_IFDIR | 0755; root->i_atime = root->i_mtime = root->i_ctime = current_time(root); root->i_op = &simple_dir_inode_operations; root->i_fop = &simple_dir_operations; /* Create root dentry */ sb->s_root = d_make_root(root); if (!sb->s_root) return -ENOMEM; return 0;} /* Filesystem context operations */static int myfs_get_tree(struct fs_context *fc){ return get_tree_nodev(fc, myfs_fill_super);} static const struct fs_context_operations myfs_context_ops = { .get_tree = myfs_get_tree,}; static int myfs_init_fs_context(struct fs_context *fc){ fc->ops = &myfs_context_ops; return 0;} /* File system type definition */static struct file_system_type myfs_type = { .owner = THIS_MODULE, .name = "myfs", .init_fs_context = myfs_init_fs_context, .kill_sb = kill_litter_super,}; /* Module init/exit */static int __init myfs_init(void){ return register_filesystem(&myfs_type);} static void __exit myfs_exit(void){ unregister_filesystem(&myfs_type);} module_init(myfs_init);module_exit(myfs_exit);MODULE_LICENSE("GPL");VFS provides helper functions like simple_dir_inode_operations, simple_dir_operations, and simple_statfs for pseudo-filesystems that don't need custom behavior. Real file systems (ext4, XFS) provide their own implementations for all these operation tables.
Let's trace a complete I/O operation to see how all the VFS components work together. We'll follow a read() system call from user space to disk.
1. System Call Entry
read(fd, buffer, count)sys_read() looks up the file descriptor in the process's fd tablestruct file pointer2. File Operations Dispatch
file->f_op->read_iter() or legacy read()generic_file_read_iter()3. Page Cache Check
address_space to locate cached pages4. Cache Hit Path
5. Cache Miss Path
address_space_operations->readpage() (or readahead())6. Copy to User Space
copy_to_user() transfers data from kernel pages to user bufferVFS includes a sophisticated read-ahead mechanism that predicts sequential access patterns and pre-fetches upcoming pages before they're requested. This hides disk latency for sequential reads and is automatically tuned based on observed access patterns.
The VFS represents decades of accumulated wisdom in OS design. Here are the key architectural insights that make it successful:
File system implementation is notoriously tricky. Here are common issues the VFS design helps address:
| Problem | VFS Solution |
|---|---|
| Cache coherency between processes | Single page cache per inode, shared across all openers |
| Race conditions during unmount | Reference counting on superblocks, mounting semaphore |
| Stale dentries after file deletion | Negative dentry caching with invalidation on change |
| Mount point traversal correctness | Explicit mount point checking during path walk |
| Concurrent inode modification | Per-inode mutex (i_rwsem) for data, spinlock (i_lock) for metadata |
We've explored the Linux Virtual File System in depth. Let's consolidate the key concepts:
What's next:
With the VFS layer understood, we'll descend into a specific file system implementation: ext4. The next page examines ext4 internals—how it implements the VFS interfaces, its on-disk layout, journaling mechanism, and the design decisions that make it the most widely deployed Linux file system.
You now understand the Virtual File System layer—the architectural backbone that enables Linux to support an extraordinary diversity of file systems through a single, unified API. This knowledge is essential for anyone working on kernel development, file systems, or storage systems at scale.