Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

1 / 5

VFS Implementation

The Universal Interface Problem

Linux supports an extraordinary diversity of file systems: ext4, XFS, Btrfs, NFS, FUSE, tmpfs, procfs, sysfs, NTFS, FAT, and dozens more. Yet from a user-space perspective, every file system looks identical—the same open(), read(), write(), and close() system calls work uniformly across all of them.

How is this possible? How can fundamentally different storage technologies—from local SSDs to network shares to in-memory pseudo-filesystems—present the same interface to applications?

The answer lies in one of the most elegant abstractions in the Linux kernel: the Virtual File System (VFS).

What You Will Learn

By the end of this page, you will understand the complete architecture of the Linux VFS layer: its design philosophy, the four core objects (superblock, inode, dentry, file), the operation tables that enable polymorphism, and how VFS routes file system operations to concrete implementations. You'll gain the deep understanding that distinguishes kernel developers and systems architects from application programmers.

The Need for VFS

Before understanding how the VFS works, we must understand why it exists. The problem VFS solves is fundamental to operating system design: how do you support multiple, incompatible file system implementations while providing a uniform API to user space?

The Pre-VFS World

In early Unix systems, file system support was hardcoded. If you wanted to add a new file system type, you had to modify core kernel code in dozens of places. Each file system had its own data structures, its own pathname resolution logic, its own caching strategies. This led to:

Code duplication: Similar logic repeated across file system implementations
Tight coupling: File systems couldn't be added or removed modularly
Maintenance nightmares: Bug fixes required changes across multiple subsystems
Limited extensibility: Adding a new file system was a major undertaking

Historical Context

Sun Microsystems introduced the VFS concept in 1985 with SunOS 2.0 to support NFS alongside UFS. This architectural innovation proved so successful that it became the standard approach across all Unix-like systems, including Linux, which reimplemented the concept with its own design decisions.

The VFS Solution

VFS provides a common abstraction layer that sits between user-space system calls and concrete file system implementations. It defines:

Standard interfaces that all file systems must implement
Generic data structures that represent file system objects uniformly
Common code paths for operations shared across all file systems
Caching infrastructure that benefits all file systems equally

Converting Mermaid diagram...

Design Pattern: Object-Oriented in C

The VFS embodies object-oriented programming principles implemented in C. While C lacks classes and inheritance, VFS achieves polymorphism through a technique called function pointer tables (or operation vectors). Each VFS object contains a pointer to a table of functions that implement operations for that object. Different file systems provide different function implementations while conforming to the same interface.

This pattern—sometimes called "poor man's OOP"—is ubiquitous in the Linux kernel and represents a masterclass in API design.

The Four Core VFS Objects

VFS defines four fundamental objects that represent file system entities. Understanding these objects—their roles, lifetimes, and relationships—is essential for comprehending Linux file system internals.

VFS Core Objects Overview
Object	Represents	Kernel Structure	Lifetime
Superblock	A mounted file system instance	`struct super_block`	Mount to unmount
Inode	A specific file or directory	`struct inode`	Cached; evicted under memory pressure
Dentry	A directory entry (path component)	`struct dentry`	Cached; forms the directory cache
File	An open file instance	`struct file`	open() to close()

2.1 The Superblock

The superblock represents a mounted file system instance. When you mount a device (e.g., mount /dev/sda1 /mnt), the kernel creates a superblock object to hold information about that specific mount.

The superblock contains:

File system type and metadata
Device information (for block-based file systems)
File system-specific private data
Root dentry of the mounted file system
Operations table for file system-wide operations

include/linux/fs.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
struct super_block {
    struct list_head        s_list;         /* Link in global list */
    dev_t                   s_dev;          /* Device identifier */
    unsigned char           s_blocksize_bits;
    unsigned long           s_blocksize;
    loff_t                  s_maxbytes;     /* Max file size */
    struct file_system_type *s_type;        /* File system type */
    const struct super_operations *s_op;    /* Superblock operations */
    
    unsigned long           s_flags;        /* Mount flags */
    unsigned long           s_magic;        /* File system magic number */
    struct dentry           *s_root;        /* Root dentry */
    struct rw_semaphore     s_umount;       /* Unmount semaphore */
    
    int                     s_count;        /* Reference count */
    atomic_t                s_active;       /* Active reference count */
    
    void                    *s_fs_info;     /* File system private data */
    
    /* Inode and dentry caches */
    struct list_lru         s_dentry_lru;
    struct list_lru         s_inode_lru;
    
    /* ... many more fields ... */
};

The s_op field points to the superblock operations table, which defines how the file system handles mount-wide operations:

struct super_operations
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
struct super_operations {
    /* Inode lifecycle management */
    struct inode *(*alloc_inode)(struct super_block *sb);
    void (*destroy_inode)(struct inode *);
    void (*free_inode)(struct inode *);
    
    /* Inode operations */
    void (*dirty_inode)(struct inode *, int flags);
    int (*write_inode)(struct inode *, struct writeback_control *wbc);
    int (*drop_inode)(struct inode *);
    void (*evict_inode)(struct inode *);
    
    /* Superblock operations */
    void (*put_super)(struct super_block *);
    int (*sync_fs)(struct super_block *sb, int wait);
    int (*freeze_super)(struct super_block *, enum freeze_holder who);
    int (*freeze_fs)(struct super_block *);
    int (*thaw_super)(struct super_block *, enum freeze_holder who);
    int (*unfreeze_fs)(struct super_block *);
    int (*statfs)(struct dentry *, struct kstatfs *);
    int (*remount_fs)(struct super_block *, int *, char *);
    
    /* Quota operations */
    int (*show_options)(struct seq_file *, struct dentry *);
    
    /* ... additional operations ... */
};

The Polymorphism Mechanism

When the VFS needs to perform an operation on a superblock, it calls through the s_op pointer. For ext4, s_op points to ext4_sops. For XFS, it points to xfs_super_operations. This is how VFS achieves file system independence—it never calls ext4 functions directly; it calls through the operation table provided by whichever file system created the superblock.

2.2 The Inode

The inode (index node) represents a specific file or directory in the file system. It contains all metadata about the file except its name (which lives in the dentry). In VFS, the inode is a bridge between the generic VFS layer and file-system-specific data.

Key inode attributes include:

File type (regular, directory, symlink, device, socket, FIFO)
Permissions and ownership (uid, gid, mode)
Timestamps (atime, mtime, ctime)
Size and block count
Link count
Pointers to the actual data (via address_space)

struct inode (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
struct inode {
    umode_t                 i_mode;         /* File type and permissions */
    unsigned short          i_opflags;
    kuid_t                  i_uid;          /* Owner user ID */
    kgid_t                  i_gid;          /* Owner group ID */
    unsigned int            i_flags;
    
    const struct inode_operations   *i_op;   /* Inode operations */
    const struct file_operations    *i_fop;  /* Default file operations */
    struct super_block      *i_sb;           /* Superblock pointer */
    struct address_space    *i_mapping;      /* Page cache mapping */
    
    unsigned long           i_ino;           /* Inode number */
    atomic_t                i_count;         /* Reference count */
    unsigned int            i_nlink;         /* Hard link count */
    
    dev_t                   i_rdev;          /* Device ID (if device file) */
    loff_t                  i_size;          /* File size in bytes */
    
    struct timespec64       __i_atime;       /* Access time */
    struct timespec64       __i_mtime;       /* Modification time */
    struct timespec64       __i_ctime;       /* Change time (metadata) */
    
    spinlock_t              i_lock;
    unsigned long           i_state;         /* State flags */
    rwlock_t                i_rwsem;
    
    atomic64_t              i_version;       /* Inode version */
    
    struct hlist_node       i_hash;          /* Hash list entry */
    struct list_head        i_io_list;
    struct list_head        i_lru;           /* LRU list for eviction */
    
    union {
        struct pipe_inode_info  *i_pipe;    /* If pipe */
        struct cdev             *i_cdev;    /* If character device */
        char                    *i_link;    /* If symlink (short) */
    };
    
    void                    *i_private;      /* File system private data */
};

Inode Numbers and Identity

The inode number (i_ino) uniquely identifies a file within a file system. Combined with the device ID, it provides a system-wide unique identifier. This is why hard links work—multiple directory entries can point to the same inode. It's also why renaming a file is O(1)—you're just updating dentry pointers, not moving data.

2.3 The Dentry

The dentry (directory entry) represents a component in a pathname. For the path /home/user/document.txt, there are four dentries: /, home, user, and document.txt. Dentries link names to inodes and form the pathname resolution cache.

Dentries exist in three states:

Used: Referenced by at least one process (d_count > 0)
Unused: Valid but not currently referenced (cached for reuse)
Negative: Represents a non-existent entry (caches lookup failures)

struct dentry (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
struct dentry {
    unsigned int            d_flags;         /* Dentry flags */
    seqcount_spinlock_t     d_seq;
    struct hlist_bl_node    d_hash;          /* Hash table entry */
    struct dentry           *d_parent;       /* Parent dentry */
    struct qstr             d_name;          /* Dentry name */
    struct inode            *d_inode;        /* Associated inode (NULL if negative) */
    
    unsigned char           d_iname[DNAME_INLINE_LEN]; /* Inline name for short names */
    
    struct lockref          d_lockref;       /* Lock and reference count */
    const struct dentry_operations *d_op;    /* Dentry operations */
    struct super_block      *d_sb;           /* Superblock of this dentry */
    unsigned long           d_time;          /* Revalidation time */
    void                    *d_fsdata;       /* File system specific data */
    
    union {
        struct list_head    d_lru;           /* LRU list */
        wait_queue_head_t   *d_wait;
    };
    struct list_head        d_child;         /* Child list entry */
    struct list_head        d_subdirs;       /* Subdirectory list */
    
    union {
        struct hlist_node   d_alias;         /* Inode alias list */
        struct hlist_bl_node d_in_lookup_hash;
        struct rcu_head     d_rcu;
    } d_u;
};

2.4 The File Object

The file object represents an open instance of a file. When a process calls open(), the kernel creates a file object. Multiple processes can have separate file objects pointing to the same inode (each with its own file position, flags, etc.).

The file object contains:

Current file position (offset)
Open flags (read/write mode, append, non-blocking)
File operations table
Reference to the dentry and inode

struct file (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
struct file {
    struct file             *f_next;
    struct file             *f_prev;
    struct inode            *f_inode;        /* Associated inode */
    const struct file_operations *f_op;      /* File operations */
    
    spinlock_t              f_lock;
    atomic_long_t           f_count;         /* Reference count */
    unsigned int            f_flags;         /* Open flags */
    fmode_t                 f_mode;          /* File mode */
    struct mutex            f_pos_lock;
    loff_t                  f_pos;           /* Current file position */
    
    struct fown_struct      f_owner;         /* For async I/O */
    const struct cred       *f_cred;         /* File credentials */
    struct file_ra_state    f_ra;            /* Read-ahead state */
    
    errseq_t                f_wb_err;
    errseq_t                f_sb_err;
    
    struct path             f_path;          /* File path info */
#define f_dentry            f_path.dentry
#define f_vfsmnt            f_path.mnt
    
    struct address_space    *f_mapping;      /* Page cache mapping */
    void                    *private_data;   /* File system private data */
};

Operation Tables in Detail

Each VFS object has an associated operations table—a structure of function pointers that define how to perform various operations on that object. This is the heart of VFS polymorphism.

3.1 Inode Operations (`inode_operations`)

Inode operations handle manipulation of file system objects themselves—creating, deleting, renaming, and looking up files and directories.

struct inode_operations (key functions)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
struct inode_operations {
    /* Pathname resolution */
    struct dentry * (*lookup)(struct inode *, struct dentry *, unsigned int);
    
    /* File creation and deletion */
    int (*create)(struct mnt_idmap *, struct inode *, struct dentry *,
                  umode_t, bool);
    int (*mkdir)(struct mnt_idmap *, struct inode *, struct dentry *, umode_t);
    int (*rmdir)(struct inode *, struct dentry *);
    int (*unlink)(struct inode *, struct dentry *);
    int (*mknod)(struct mnt_idmap *, struct inode *, struct dentry *,
                 umode_t, dev_t);
    
    /* Renaming and linking */
    int (*rename)(struct mnt_idmap *, struct inode *, struct dentry *,
                  struct inode *, struct dentry *, unsigned int);
    int (*link)(struct dentry *, struct inode *, struct dentry *);
    int (*symlink)(struct mnt_idmap *, struct inode *, struct dentry *,
                   const char *);
    
    /* Symlink handling */
    const char * (*get_link)(struct dentry *, struct inode *, 
                             struct delayed_call *);
    
    /* Permissions and attributes */
    int (*permission)(struct mnt_idmap *, struct inode *, int);
    int (*setattr)(struct mnt_idmap *, struct dentry *, struct iattr *);
    int (*getattr)(struct mnt_idmap *, const struct path *, struct kstat *,
                   u32, unsigned int);
    
    /* Extended attributes */
    ssize_t (*listxattr)(struct dentry *, char *, size_t);
    
    /* Truncation */
    void (*truncate)(struct inode *);
    
    /* ... */
};

3.2 File Operations (`file_operations`)

File operations are the most commonly used—they handle reading, writing, seeking, and everything else you can do with an open file.

struct file_operations (key functions)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
struct file_operations {
    struct module *owner;
    
    /* Position management */
    loff_t (*llseek)(struct file *, loff_t, int);
    
    /* Reading */
    ssize_t (*read)(struct file *, char __user *, size_t, loff_t *);
    ssize_t (*read_iter)(struct kiocb *, struct iov_iter *);
    
    /* Writing */
    ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *);
    ssize_t (*write_iter)(struct kiocb *, struct iov_iter *);
    
    /* Directory reading */
    int (*iterate)(struct file *, struct dir_context *);
    int (*iterate_shared)(struct file *, struct dir_context *);
    
    /* Event notification */
    __poll_t (*poll)(struct file *, struct poll_table_struct *);
    
    /* ioctl */
    long (*unlocked_ioctl)(struct file *, unsigned int, unsigned long);
    long (*compat_ioctl)(struct file *, unsigned int, unsigned long);
    
    /* Memory mapping */
    int (*mmap)(struct file *, struct vm_area_struct *);
    
    /* Open and release */
    int (*open)(struct inode *, struct file *);
    int (*flush)(struct file *, fl_owner_t id);
    int (*release)(struct inode *, struct file *);
    
    /* Sync */
    int (*fsync)(struct file *, loff_t, loff_t, int datasync);
    int (*fasync)(int, struct file *, int);
    
    /* Locking */
    int (*lock)(struct file *, int, struct file_lock *);
    int (*flock)(struct file *, int, struct file_lock *);
    
    /* Splice (zero-copy) operations */
    ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *,
                            size_t, unsigned int);
    ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *,
                           size_t, unsigned int);
    
    /* ... */
};

The read vs read_iter Evolution

Modern file systems implement read_iter and write_iter (using struct kiocb and struct iov_iter) rather than the older read/write methods. These newer interfaces support vectored I/O, asynchronous I/O, and are more efficient for scatter-gather operations. The legacy read/write are provided for backward compatibility and are translated to the iterator interfaces internally.

3.3 Address Space Operations

The address_space_operations table handles the interface between the page cache and the file system—reading pages from disk, writing dirty pages back, and managing memory-mapped files.

struct address_space_operations (key functions)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
struct address_space_operations {
    /* Writing dirty data */
    int (*writepage)(struct page *page, struct writeback_control *wbc);
    int (*writepages)(struct address_space *, struct writeback_control *);
    
    /* Reading data */
    int (*readpage)(struct file *, struct page *);
    int (*readahead)(struct readahead_control *);
    
    /* Write preparation (allocate blocks, etc.) */
    int (*write_begin)(struct file *, struct address_space *mapping,
                       loff_t pos, unsigned len, struct page **pagep,
                       void **fsdata);
    int (*write_end)(struct file *, struct address_space *mapping,
                     loff_t pos, unsigned len, unsigned copied,
                     struct page *page, void *fsdata);
    
    /* Block mapping */
    struct buffer_head * (*get_block)(struct inode *, sector_t, 
                                      struct buffer_head *, int);
    sector_t (*bmap)(struct address_space *, sector_t);
    
    /* Direct I/O */
    ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
    
    /* Page migration (NUMA) */
    int (*migratepage)(struct address_space *, struct page *, struct page *,
                       enum migrate_mode);
    
    /* Memory pressure handling */
    int (*launder_page)(struct page *);
    bool (*release_folio)(struct folio *, gfp_t);
    void (*free_folio)(struct folio *);
    int (*swap_activate)(struct swap_info_struct *, struct file *,
                         sector_t *);
    void (*swap_deactivate)(struct file *);
};

Pathname Resolution — The Walkthrough

One of VFS's most critical responsibilities is pathname resolution—converting a string like /home/user/document.txt into the corresponding inode. This process, called a "path walk" or "namei" (name-to-inode), is highly optimized and involves intricate interactions between the dentry cache, inodes, and mount points.

The Resolution Algorithm

For the path /home/user/document.txt:

Start at the root: Begin with the root dentry of the process's root file system (or current directory for relative paths)
Lookup 'home': Call lookup() on the root inode's operations table, requesting the child named "home"
- Check the dentry cache first
- On cache miss, invoke the file system's lookup function
- The lookup returns a dentry (positive if found, negative if not)
Cross mount points: If "home" is a mount point, switch to the root dentry of the mounted file system
Lookup 'user': Repeat for the next component
Lookup 'document.txt': Repeat for the final component
Return the final inode

Converting Mermaid diagram...

RCU-Walk: The Fast Path

Pathname resolution is so frequent that Linux employs a highly optimized "RCU-walk" mode that performs the entire traversal without acquiring any locks or incrementing reference counts.

RCU (Read-Copy-Update) allows readers to traverse data structures concurrently with writers, using careful memory ordering and deferred reclamation. In RCU-walk mode:

The path walk proceeds entirely in RCU read-side critical section
Dentries are accessed through RCU-protected pointers
No reference counts are modified
No locks are acquired

If RCU-walk encounters a condition it cannot handle (e.g., a dentry being simultaneously deleted), it "falls back" to the slower REF-walk mode, which uses proper locking and reference counting.

Performance Impact

RCU-walk makes a dramatic difference for workloads with high path resolution rates. A web server serving static files, for example, might resolve the same paths millions of times. With RCU-walk, these resolutions require no atomic operations, no cache line bouncing, and no lock contention—scaling linearly with CPU count.

The Dentry Cache (dcache)

The dentry cache is one of the most important caches in the Linux kernel. It caches the results of pathname lookups, making repeated access to the same files extremely fast.

Dcache characteristics:

Global hash table indexed by (directory inode, name)
Caches both positive dentries (file exists) and negative dentries (file doesn't exist)
Organized into per-superblock LRU lists for eviction
Deeply integrated with the inode cache (icache)

Why cache negative dentries?

Caching non-existent entries might seem wasteful, but it's invaluable for performance. Consider a shell checking for a command:

$ foo
bash: foo: command not found

Bash searches PATH: /usr/local/bin/foo, /usr/bin/foo, /bin/foo, etc. Each lookup returns ENOENT. Without negative caching, every failed command would require re-reading multiple directories from disk.

Mount Points and Namespaces

VFS manages the mount tree—the hierarchical structure that connects multiple file systems into a single directory namespace. Understanding how mounts work is essential for comprehending containers, chroot jails, and the overall Linux file system landscape.

The Mount Structure

When you mount a file system, the kernel creates a struct mount object that links the mounted file system into the directory tree:

struct mount (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
struct mount {
    struct hlist_node mnt_hash;      /* Hash table entry */
    struct mount *mnt_parent;        /* Parent mount */
    struct dentry *mnt_mountpoint;   /* Dentry of mount point */
    struct vfsmount mnt;             /* Visible in VFS */
    
    union {
        struct rcu_head mnt_rcu;
        struct llist_node mnt_llist;
    };
    
    struct list_head mnt_mounts;     /* Child mounts */
    struct list_head mnt_child;      /* Sibling link */
    struct list_head mnt_instance;   /* Superblock's mount list */
    
    const char *mnt_devname;         /* Device name */
    struct list_head mnt_list;       /* Mount namespace list */
    struct list_head mnt_fsnotify_marks;
    
    struct mnt_namespace *mnt_ns;    /* Mount namespace */
    int mnt_id;                      /* Unique mount ID */
    int mnt_group_id;                /* Peer group ID */
    int mnt_expiry_mark;             /* For expirable mounts */
    
    int mnt_writers;                 /* Active writer count */
};
 
struct vfsmount {
    struct dentry *mnt_root;         /* Root dentry of mounted fs */
    struct super_block *mnt_sb;      /* Superblock pointer */
    int mnt_flags;                   /* Mount flags */
    struct mnt_idmap *mnt_idmap;     /* ID mapping */
};

Mount Namespace Isolation

Linux supports mount namespaces—isolated views of the mount tree. Each mount namespace has its own root and its own set of mounted file systems. Processes in different mount namespaces can see completely different directory hierarchies.

Mount namespaces are fundamental to container technology:

Each container gets its own mount namespace
The container sees only its root file system plus explicitly shared mounts
Mounts in one container are invisible to others
The host's mount namespace is unaffected by container operations

Mount Propagation Types
Type	Behavior	Use Case
Private	Mounts visible only within namespace	Default isolation
Shared	Mounts propagate to peer groups	Live system updates
Slave	Receives mounts from master, doesn't propagate	Chroot environments
Unbindable	Cannot be bind-mounted	Preventing mount escape

Bind Mounts and Overlays

Bind mounts make a directory tree visible at another location in the namespace. Unlike symbolic links, bind mounts operate at the VFS level and are transparent to applications.

mount --bind /existing/path /new/location

Overlay mounts (overlayfs) combine multiple directory trees into a single unified view. This is the basis for container image layering:

mount -t overlay overlay -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged

Lower: Read-only base layer(s)
Upper: Writable layer for modifications
Work: Scratch space for atomicity
Merged: The unified view

Registering a File System

To understand the VFS architecture fully, let's examine how a file system registers itself with the kernel. This reveals the contract that file systems must fulfill.

The file_system_type Structure

Every file system driver defines a file_system_type structure that describes the file system and provides the functions needed to create superblocks:

struct file_system_type
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
struct file_system_type {
    const char *name;                    /* File system name */
    int fs_flags;                        /* Flags (e.g., FS_REQUIRES_DEV) */
    
    /* Superblock initialization */
    int (*init_fs_context)(struct fs_context *);
    const struct fs_parameter_spec *parameters;
    
    /* Legacy mount interface */
    struct dentry *(*mount)(struct file_system_type *, int,
                            const char *, void *);
    
    /* Cleanup */
    void (*kill_sb)(struct super_block *);
    
    struct module *owner;                /* Module reference */
    struct file_system_type *next;       /* Linked list */
    
    struct hlist_head fs_supers;         /* Superblock list */
    
    struct lock_class_key s_lock_key;
    struct lock_class_key s_umount_key;
    struct lock_class_key s_vfs_rename_key;
    struct lock_class_key i_lock_key;
    struct lock_class_key i_mutex_key;
};

A Minimal File System Example

Here's a skeleton for a simple pseudo-file system (like procfs or debugfs):

Minimal file system registration
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/fs_context.h>
 
#define MYFS_MAGIC 0x12345678
 
/* Superblock operations */
static const struct super_operations myfs_sops = {
    .statfs    = simple_statfs,
    .drop_inode = generic_drop_inode,
};
 
/* Fill superblock during mount */
static int myfs_fill_super(struct super_block *sb, struct fs_context *fc)
{
    struct inode *root;
    
    sb->s_blocksize = PAGE_SIZE;
    sb->s_blocksize_bits = PAGE_SHIFT;
    sb->s_magic = MYFS_MAGIC;
    sb->s_op = &myfs_sops;
    sb->s_time_gran = 1;
    
    /* Create root inode */
    root = new_inode(sb);
    if (!root)
        return -ENOMEM;
    
    root->i_ino = 1;
    root->i_mode = S_IFDIR | 0755;
    root->i_atime = root->i_mtime = root->i_ctime = current_time(root);
    root->i_op = &simple_dir_inode_operations;
    root->i_fop = &simple_dir_operations;
    
    /* Create root dentry */
    sb->s_root = d_make_root(root);
    if (!sb->s_root)
        return -ENOMEM;
    
    return 0;
}
 
/* Filesystem context operations */
static int myfs_get_tree(struct fs_context *fc)
{
    return get_tree_nodev(fc, myfs_fill_super);
}
 
static const struct fs_context_operations myfs_context_ops = {
    .get_tree = myfs_get_tree,
};
 
static int myfs_init_fs_context(struct fs_context *fc)
{
    fc->ops = &myfs_context_ops;
    return 0;
}
 
/* File system type definition */
static struct file_system_type myfs_type = {
    .owner           = THIS_MODULE,
    .name            = "myfs",
    .init_fs_context = myfs_init_fs_context,
    .kill_sb         = kill_litter_super,
};
 
/* Module init/exit */
static int __init myfs_init(void)
{
    return register_filesystem(&myfs_type);
}
 
static void __exit myfs_exit(void)
{
    unregister_filesystem(&myfs_type);
}
 
module_init(myfs_init);
module_exit(myfs_exit);
MODULE_LICENSE("GPL");

Simple Helpers

VFS provides helper functions like simple_dir_inode_operations, simple_dir_operations, and simple_statfs for pseudo-filesystems that don't need custom behavior. Real file systems (ext4, XFS) provide their own implementations for all these operation tables.

VFS Data Flow

Let's trace a complete I/O operation to see how all the VFS components work together. We'll follow a read() system call from user space to disk.

Converting Mermaid diagram...

Step-by-Step Breakdown

1. System Call Entry

User calls read(fd, buffer, count)
sys_read() looks up the file descriptor in the process's fd table
Retrieves the struct file pointer

2. File Operations Dispatch

Calls file->f_op->read_iter() or legacy read()
For buffered I/O, this goes to generic_file_read_iter()

3. Page Cache Check

VFS checks if requested data is in the page cache
Uses the file's address_space to locate cached pages
Each page is indexed by file offset

4. Cache Hit Path

If data is cached, copy directly to user buffer
No disk I/O required
This is the fast path for warm caches

5. Cache Miss Path

Allocate new pages in the page cache
Call address_space_operations->readpage() (or readahead())
The file system translates file offsets to block numbers
Submit I/O requests to the block layer
Block layer handles scheduling, merging, and device queues
Wait for I/O completion

6. Copy to User Space

copy_to_user() transfers data from kernel pages to user buffer
Returns number of bytes read (or error)

Read-Ahead Optimization

VFS includes a sophisticated read-ahead mechanism that predicts sequential access patterns and pre-fetches upcoming pages before they're requested. This hides disk latency for sequential reads and is automatically tuned based on observed access patterns.

Key Design Insights

The VFS represents decades of accumulated wisdom in OS design. Here are the key architectural insights that make it successful:

VFS Architectural Principles

•Separation of Policy and Mechanism — VFS provides the mechanism (caching, pathname resolution, file descriptors) while file systems provide the policy (how data is actually stored and retrieved).
•Deferred Concrete Typing — The four core objects (superblock, inode, dentry, file) are generic containers. File-system-specific data lives in embedded structures or void pointers, resolved at runtime.
•Aggressive Caching as Architecture — The dentry cache and inode cache aren't bolt-on optimizations; they're fundamental to how VFS works. The design assumes cached results are the common case.
•Lock-Free Fast Paths — RCU-walk for pathname resolution exemplifies the philosophy: design the common case to be lock-free, with fallback to locking only when necessary.
•Extensibility Without Abstraction Tax — Adding a new file system doesn't require modifying VFS core. The operation table pattern allows extension with minimal overhead.

Common Pitfalls and Solutions

File system implementation is notoriously tricky. Here are common issues the VFS design helps address:

VFS Problem-Solution Patterns
Problem	VFS Solution
Cache coherency between processes	Single page cache per inode, shared across all openers
Race conditions during unmount	Reference counting on superblocks, mounting semaphore
Stale dentries after file deletion	Negative dentry caching with invalidation on change
Mount point traversal correctness	Explicit mount point checking during path walk
Concurrent inode modification	Per-inode mutex (i_rwsem) for data, spinlock (i_lock) for metadata

Summary: The VFS Abstraction

We've explored the Linux Virtual File System in depth. Let's consolidate the key concepts:

Key Takeaways

•VFS is an abstraction layer that decouples user-space file operations from concrete file system implementations, enabling support for dozens of file system types through a unified API.
•Four core objects — superblock (mounted filesystem), inode (file metadata), dentry (path component), and file (open instance) — form the foundation of the VFS data model.
•Operation tables provide polymorphism in C: each object contains a pointer to a structure of function pointers that define its behavior. Different file systems provide different implementations.
•Pathname resolution (namei) traverses the directory tree using the dentry cache, with RCU-walk providing a lock-free fast path for the common case.
•Mount namespaces enable isolated views of the file system hierarchy, forming the foundation for container technology.
•Caching is architectural — the dentry cache and inode cache are integral to VFS design, not optional optimizations.

What's next:

With the VFS layer understood, we'll descend into a specific file system implementation: ext4. The next page examines ext4 internals—how it implements the VFS interfaces, its on-disk layout, journaling mechanism, and the design decisions that make it the most widely deployed Linux file system.

Page Complete

You now understand the Virtual File System layer—the architectural backbone that enables Linux to support an extraordinary diversity of file systems through a single, unified API. This knowledge is essential for anyone working on kernel development, file systems, or storage systems at scale.

1 / 5

Loading learning content...

Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

1 / 5

VFS Implementation

The Universal Interface Problem

How is this possible? How can fundamentally different storage technologies—from local SSDs to network shares to in-memory pseudo-filesystems—present the same interface to applications?

The answer lies in one of the most elegant abstractions in the Linux kernel: the Virtual File System (VFS).

What You Will Learn

The Need for VFS

The Pre-VFS World

Code duplication: Similar logic repeated across file system implementations
Tight coupling: File systems couldn't be added or removed modularly
Maintenance nightmares: Bug fixes required changes across multiple subsystems
Limited extensibility: Adding a new file system was a major undertaking

Historical Context

The VFS Solution

VFS provides a common abstraction layer that sits between user-space system calls and concrete file system implementations. It defines:

Standard interfaces that all file systems must implement
Generic data structures that represent file system objects uniformly
Common code paths for operations shared across all file systems
Caching infrastructure that benefits all file systems equally

Converting Mermaid diagram...

Design Pattern: Object-Oriented in C

This pattern—sometimes called "poor man's OOP"—is ubiquitous in the Linux kernel and represents a masterclass in API design.

The Four Core VFS Objects

VFS Core Objects Overview
Object	Represents	Kernel Structure	Lifetime
Superblock	A mounted file system instance	`struct super_block`	Mount to unmount
Inode	A specific file or directory	`struct inode`	Cached; evicted under memory pressure
Dentry	A directory entry (path component)	`struct dentry`	Cached; forms the directory cache
File	An open file instance	`struct file`	open() to close()

2.1 The Superblock

The superblock contains:

File system type and metadata
Device information (for block-based file systems)
File system-specific private data
Root dentry of the mounted file system
Operations table for file system-wide operations

include/linux/fs.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
struct super_block {
    struct list_head        s_list;         /* Link in global list */
    dev_t                   s_dev;          /* Device identifier */
    unsigned char           s_blocksize_bits;
    unsigned long           s_blocksize;
    loff_t                  s_maxbytes;     /* Max file size */
    struct file_system_type *s_type;        /* File system type */
    const struct super_operations *s_op;    /* Superblock operations */
    
    unsigned long           s_flags;        /* Mount flags */
    unsigned long           s_magic;        /* File system magic number */
    struct dentry           *s_root;        /* Root dentry */
    struct rw_semaphore     s_umount;       /* Unmount semaphore */
    
    int                     s_count;        /* Reference count */
    atomic_t                s_active;       /* Active reference count */
    
    void                    *s_fs_info;     /* File system private data */
    
    /* Inode and dentry caches */
    struct list_lru         s_dentry_lru;
    struct list_lru         s_inode_lru;
    
    /* ... many more fields ... */
};

The s_op field points to the superblock operations table, which defines how the file system handles mount-wide operations:

struct super_operations
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
struct super_operations {
    /* Inode lifecycle management */
    struct inode *(*alloc_inode)(struct super_block *sb);
    void (*destroy_inode)(struct inode *);
    void (*free_inode)(struct inode *);
    
    /* Inode operations */
    void (*dirty_inode)(struct inode *, int flags);
    int (*write_inode)(struct inode *, struct writeback_control *wbc);
    int (*drop_inode)(struct inode *);
    void (*evict_inode)(struct inode *);
    
    /* Superblock operations */
    void (*put_super)(struct super_block *);
    int (*sync_fs)(struct super_block *sb, int wait);
    int (*freeze_super)(struct super_block *, enum freeze_holder who);
    int (*freeze_fs)(struct super_block *);
    int (*thaw_super)(struct super_block *, enum freeze_holder who);
    int (*unfreeze_fs)(struct super_block *);
    int (*statfs)(struct dentry *, struct kstatfs *);
    int (*remount_fs)(struct super_block *, int *, char *);
    
    /* Quota operations */
    int (*show_options)(struct seq_file *, struct dentry *);
    
    /* ... additional operations ... */
};

The Polymorphism Mechanism

2.2 The Inode

Key inode attributes include:

File type (regular, directory, symlink, device, socket, FIFO)
Permissions and ownership (uid, gid, mode)
Timestamps (atime, mtime, ctime)
Size and block count
Link count
Pointers to the actual data (via address_space)

struct inode (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
struct inode {
    umode_t                 i_mode;         /* File type and permissions */
    unsigned short          i_opflags;
    kuid_t                  i_uid;          /* Owner user ID */
    kgid_t                  i_gid;          /* Owner group ID */
    unsigned int            i_flags;
    
    const struct inode_operations   *i_op;   /* Inode operations */
    const struct file_operations    *i_fop;  /* Default file operations */
    struct super_block      *i_sb;           /* Superblock pointer */
    struct address_space    *i_mapping;      /* Page cache mapping */
    
    unsigned long           i_ino;           /* Inode number */
    atomic_t                i_count;         /* Reference count */
    unsigned int            i_nlink;         /* Hard link count */
    
    dev_t                   i_rdev;          /* Device ID (if device file) */
    loff_t                  i_size;          /* File size in bytes */
    
    struct timespec64       __i_atime;       /* Access time */
    struct timespec64       __i_mtime;       /* Modification time */
    struct timespec64       __i_ctime;       /* Change time (metadata) */
    
    spinlock_t              i_lock;
    unsigned long           i_state;         /* State flags */
    rwlock_t                i_rwsem;
    
    atomic64_t              i_version;       /* Inode version */
    
    struct hlist_node       i_hash;          /* Hash list entry */
    struct list_head        i_io_list;
    struct list_head        i_lru;           /* LRU list for eviction */
    
    union {
        struct pipe_inode_info  *i_pipe;    /* If pipe */
        struct cdev             *i_cdev;    /* If character device */
        char                    *i_link;    /* If symlink (short) */
    };
    
    void                    *i_private;      /* File system private data */
};

Inode Numbers and Identity

2.3 The Dentry

Dentries exist in three states:

Used: Referenced by at least one process (d_count > 0)
Unused: Valid but not currently referenced (cached for reuse)
Negative: Represents a non-existent entry (caches lookup failures)

struct dentry (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
struct dentry {
    unsigned int            d_flags;         /* Dentry flags */
    seqcount_spinlock_t     d_seq;
    struct hlist_bl_node    d_hash;          /* Hash table entry */
    struct dentry           *d_parent;       /* Parent dentry */
    struct qstr             d_name;          /* Dentry name */
    struct inode            *d_inode;        /* Associated inode (NULL if negative) */
    
    unsigned char           d_iname[DNAME_INLINE_LEN]; /* Inline name for short names */
    
    struct lockref          d_lockref;       /* Lock and reference count */
    const struct dentry_operations *d_op;    /* Dentry operations */
    struct super_block      *d_sb;           /* Superblock of this dentry */
    unsigned long           d_time;          /* Revalidation time */
    void                    *d_fsdata;       /* File system specific data */
    
    union {
        struct list_head    d_lru;           /* LRU list */
        wait_queue_head_t   *d_wait;
    };
    struct list_head        d_child;         /* Child list entry */
    struct list_head        d_subdirs;       /* Subdirectory list */
    
    union {
        struct hlist_node   d_alias;         /* Inode alias list */
        struct hlist_bl_node d_in_lookup_hash;
        struct rcu_head     d_rcu;
    } d_u;
};

2.4 The File Object

The file object contains:

Current file position (offset)
Open flags (read/write mode, append, non-blocking)
File operations table
Reference to the dentry and inode

struct file (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
struct file {
    struct file             *f_next;
    struct file             *f_prev;
    struct inode            *f_inode;        /* Associated inode */
    const struct file_operations *f_op;      /* File operations */
    
    spinlock_t              f_lock;
    atomic_long_t           f_count;         /* Reference count */
    unsigned int            f_flags;         /* Open flags */
    fmode_t                 f_mode;          /* File mode */
    struct mutex            f_pos_lock;
    loff_t                  f_pos;           /* Current file position */
    
    struct fown_struct      f_owner;         /* For async I/O */
    const struct cred       *f_cred;         /* File credentials */
    struct file_ra_state    f_ra;            /* Read-ahead state */
    
    errseq_t                f_wb_err;
    errseq_t                f_sb_err;
    
    struct path             f_path;          /* File path info */
#define f_dentry            f_path.dentry
#define f_vfsmnt            f_path.mnt
    
    struct address_space    *f_mapping;      /* Page cache mapping */
    void                    *private_data;   /* File system private data */
};

Operation Tables in Detail

Each VFS object has an associated operations table—a structure of function pointers that define how to perform various operations on that object. This is the heart of VFS polymorphism.

3.1 Inode Operations (`inode_operations`)

Inode operations handle manipulation of file system objects themselves—creating, deleting, renaming, and looking up files and directories.

struct inode_operations (key functions)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
struct inode_operations {
    /* Pathname resolution */
    struct dentry * (*lookup)(struct inode *, struct dentry *, unsigned int);
    
    /* File creation and deletion */
    int (*create)(struct mnt_idmap *, struct inode *, struct dentry *,
                  umode_t, bool);
    int (*mkdir)(struct mnt_idmap *, struct inode *, struct dentry *, umode_t);
    int (*rmdir)(struct inode *, struct dentry *);
    int (*unlink)(struct inode *, struct dentry *);
    int (*mknod)(struct mnt_idmap *, struct inode *, struct dentry *,
                 umode_t, dev_t);
    
    /* Renaming and linking */
    int (*rename)(struct mnt_idmap *, struct inode *, struct dentry *,
                  struct inode *, struct dentry *, unsigned int);
    int (*link)(struct dentry *, struct inode *, struct dentry *);
    int (*symlink)(struct mnt_idmap *, struct inode *, struct dentry *,
                   const char *);
    
    /* Symlink handling */
    const char * (*get_link)(struct dentry *, struct inode *, 
                             struct delayed_call *);
    
    /* Permissions and attributes */
    int (*permission)(struct mnt_idmap *, struct inode *, int);
    int (*setattr)(struct mnt_idmap *, struct dentry *, struct iattr *);
    int (*getattr)(struct mnt_idmap *, const struct path *, struct kstat *,
                   u32, unsigned int);
    
    /* Extended attributes */
    ssize_t (*listxattr)(struct dentry *, char *, size_t);
    
    /* Truncation */
    void (*truncate)(struct inode *);
    
    /* ... */
};

3.2 File Operations (`file_operations`)

File operations are the most commonly used—they handle reading, writing, seeking, and everything else you can do with an open file.

struct file_operations (key functions)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
struct file_operations {
    struct module *owner;
    
    /* Position management */
    loff_t (*llseek)(struct file *, loff_t, int);
    
    /* Reading */
    ssize_t (*read)(struct file *, char __user *, size_t, loff_t *);
    ssize_t (*read_iter)(struct kiocb *, struct iov_iter *);
    
    /* Writing */
    ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *);
    ssize_t (*write_iter)(struct kiocb *, struct iov_iter *);
    
    /* Directory reading */
    int (*iterate)(struct file *, struct dir_context *);
    int (*iterate_shared)(struct file *, struct dir_context *);
    
    /* Event notification */
    __poll_t (*poll)(struct file *, struct poll_table_struct *);
    
    /* ioctl */
    long (*unlocked_ioctl)(struct file *, unsigned int, unsigned long);
    long (*compat_ioctl)(struct file *, unsigned int, unsigned long);
    
    /* Memory mapping */
    int (*mmap)(struct file *, struct vm_area_struct *);
    
    /* Open and release */
    int (*open)(struct inode *, struct file *);
    int (*flush)(struct file *, fl_owner_t id);
    int (*release)(struct inode *, struct file *);
    
    /* Sync */
    int (*fsync)(struct file *, loff_t, loff_t, int datasync);
    int (*fasync)(int, struct file *, int);
    
    /* Locking */
    int (*lock)(struct file *, int, struct file_lock *);
    int (*flock)(struct file *, int, struct file_lock *);
    
    /* Splice (zero-copy) operations */
    ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *,
                            size_t, unsigned int);
    ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *,
                           size_t, unsigned int);
    
    /* ... */
};

The read vs read_iter Evolution

3.3 Address Space Operations

The address_space_operations table handles the interface between the page cache and the file system—reading pages from disk, writing dirty pages back, and managing memory-mapped files.

struct address_space_operations (key functions)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
struct address_space_operations {
    /* Writing dirty data */
    int (*writepage)(struct page *page, struct writeback_control *wbc);
    int (*writepages)(struct address_space *, struct writeback_control *);
    
    /* Reading data */
    int (*readpage)(struct file *, struct page *);
    int (*readahead)(struct readahead_control *);
    
    /* Write preparation (allocate blocks, etc.) */
    int (*write_begin)(struct file *, struct address_space *mapping,
                       loff_t pos, unsigned len, struct page **pagep,
                       void **fsdata);
    int (*write_end)(struct file *, struct address_space *mapping,
                     loff_t pos, unsigned len, unsigned copied,
                     struct page *page, void *fsdata);
    
    /* Block mapping */
    struct buffer_head * (*get_block)(struct inode *, sector_t, 
                                      struct buffer_head *, int);
    sector_t (*bmap)(struct address_space *, sector_t);
    
    /* Direct I/O */
    ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
    
    /* Page migration (NUMA) */
    int (*migratepage)(struct address_space *, struct page *, struct page *,
                       enum migrate_mode);
    
    /* Memory pressure handling */
    int (*launder_page)(struct page *);
    bool (*release_folio)(struct folio *, gfp_t);
    void (*free_folio)(struct folio *);
    int (*swap_activate)(struct swap_info_struct *, struct file *,
                         sector_t *);
    void (*swap_deactivate)(struct file *);
};

Pathname Resolution — The Walkthrough

The Resolution Algorithm

For the path /home/user/document.txt:

Start at the root: Begin with the root dentry of the process's root file system (or current directory for relative paths)
Lookup 'home': Call lookup() on the root inode's operations table, requesting the child named "home"
- Check the dentry cache first
- On cache miss, invoke the file system's lookup function
- The lookup returns a dentry (positive if found, negative if not)
Cross mount points: If "home" is a mount point, switch to the root dentry of the mounted file system
Lookup 'user': Repeat for the next component
Lookup 'document.txt': Repeat for the final component
Return the final inode

Converting Mermaid diagram...

RCU-Walk: The Fast Path

Pathname resolution is so frequent that Linux employs a highly optimized "RCU-walk" mode that performs the entire traversal without acquiring any locks or incrementing reference counts.

RCU (Read-Copy-Update) allows readers to traverse data structures concurrently with writers, using careful memory ordering and deferred reclamation. In RCU-walk mode:

The path walk proceeds entirely in RCU read-side critical section
Dentries are accessed through RCU-protected pointers
No reference counts are modified
No locks are acquired

If RCU-walk encounters a condition it cannot handle (e.g., a dentry being simultaneously deleted), it "falls back" to the slower REF-walk mode, which uses proper locking and reference counting.

Performance Impact

The Dentry Cache (dcache)

The dentry cache is one of the most important caches in the Linux kernel. It caches the results of pathname lookups, making repeated access to the same files extremely fast.

Dcache characteristics:

Global hash table indexed by (directory inode, name)
Caches both positive dentries (file exists) and negative dentries (file doesn't exist)
Organized into per-superblock LRU lists for eviction
Deeply integrated with the inode cache (icache)

Why cache negative dentries?

Caching non-existent entries might seem wasteful, but it's invaluable for performance. Consider a shell checking for a command:

$ foo
bash: foo: command not found

Mount Points and Namespaces

The Mount Structure

When you mount a file system, the kernel creates a struct mount object that links the mounted file system into the directory tree:

struct mount (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
struct mount {
    struct hlist_node mnt_hash;      /* Hash table entry */
    struct mount *mnt_parent;        /* Parent mount */
    struct dentry *mnt_mountpoint;   /* Dentry of mount point */
    struct vfsmount mnt;             /* Visible in VFS */
    
    union {
        struct rcu_head mnt_rcu;
        struct llist_node mnt_llist;
    };
    
    struct list_head mnt_mounts;     /* Child mounts */
    struct list_head mnt_child;      /* Sibling link */
    struct list_head mnt_instance;   /* Superblock's mount list */
    
    const char *mnt_devname;         /* Device name */
    struct list_head mnt_list;       /* Mount namespace list */
    struct list_head mnt_fsnotify_marks;
    
    struct mnt_namespace *mnt_ns;    /* Mount namespace */
    int mnt_id;                      /* Unique mount ID */
    int mnt_group_id;                /* Peer group ID */
    int mnt_expiry_mark;             /* For expirable mounts */
    
    int mnt_writers;                 /* Active writer count */
};
 
struct vfsmount {
    struct dentry *mnt_root;         /* Root dentry of mounted fs */
    struct super_block *mnt_sb;      /* Superblock pointer */
    int mnt_flags;                   /* Mount flags */
    struct mnt_idmap *mnt_idmap;     /* ID mapping */
};

Mount Namespace Isolation

Mount namespaces are fundamental to container technology:

Each container gets its own mount namespace
The container sees only its root file system plus explicitly shared mounts
Mounts in one container are invisible to others
The host's mount namespace is unaffected by container operations

Mount Propagation Types
Type	Behavior	Use Case
Private	Mounts visible only within namespace	Default isolation
Shared	Mounts propagate to peer groups	Live system updates
Slave	Receives mounts from master, doesn't propagate	Chroot environments
Unbindable	Cannot be bind-mounted	Preventing mount escape

Bind Mounts and Overlays

Bind mounts make a directory tree visible at another location in the namespace. Unlike symbolic links, bind mounts operate at the VFS level and are transparent to applications.

mount --bind /existing/path /new/location

Overlay mounts (overlayfs) combine multiple directory trees into a single unified view. This is the basis for container image layering:

mount -t overlay overlay -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged

Lower: Read-only base layer(s)
Upper: Writable layer for modifications
Work: Scratch space for atomicity
Merged: The unified view

Registering a File System

To understand the VFS architecture fully, let's examine how a file system registers itself with the kernel. This reveals the contract that file systems must fulfill.

The file_system_type Structure

Every file system driver defines a file_system_type structure that describes the file system and provides the functions needed to create superblocks:

struct file_system_type
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
struct file_system_type {
    const char *name;                    /* File system name */
    int fs_flags;                        /* Flags (e.g., FS_REQUIRES_DEV) */
    
    /* Superblock initialization */
    int (*init_fs_context)(struct fs_context *);
    const struct fs_parameter_spec *parameters;
    
    /* Legacy mount interface */
    struct dentry *(*mount)(struct file_system_type *, int,
                            const char *, void *);
    
    /* Cleanup */
    void (*kill_sb)(struct super_block *);
    
    struct module *owner;                /* Module reference */
    struct file_system_type *next;       /* Linked list */
    
    struct hlist_head fs_supers;         /* Superblock list */
    
    struct lock_class_key s_lock_key;
    struct lock_class_key s_umount_key;
    struct lock_class_key s_vfs_rename_key;
    struct lock_class_key i_lock_key;
    struct lock_class_key i_mutex_key;
};

A Minimal File System Example

Here's a skeleton for a simple pseudo-file system (like procfs or debugfs):

Minimal file system registration
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/fs_context.h>
 
#define MYFS_MAGIC 0x12345678
 
/* Superblock operations */
static const struct super_operations myfs_sops = {
    .statfs    = simple_statfs,
    .drop_inode = generic_drop_inode,
};
 
/* Fill superblock during mount */
static int myfs_fill_super(struct super_block *sb, struct fs_context *fc)
{
    struct inode *root;
    
    sb->s_blocksize = PAGE_SIZE;
    sb->s_blocksize_bits = PAGE_SHIFT;
    sb->s_magic = MYFS_MAGIC;
    sb->s_op = &myfs_sops;
    sb->s_time_gran = 1;
    
    /* Create root inode */
    root = new_inode(sb);
    if (!root)
        return -ENOMEM;
    
    root->i_ino = 1;
    root->i_mode = S_IFDIR | 0755;
    root->i_atime = root->i_mtime = root->i_ctime = current_time(root);
    root->i_op = &simple_dir_inode_operations;
    root->i_fop = &simple_dir_operations;
    
    /* Create root dentry */
    sb->s_root = d_make_root(root);
    if (!sb->s_root)
        return -ENOMEM;
    
    return 0;
}
 
/* Filesystem context operations */
static int myfs_get_tree(struct fs_context *fc)
{
    return get_tree_nodev(fc, myfs_fill_super);
}
 
static const struct fs_context_operations myfs_context_ops = {
    .get_tree = myfs_get_tree,
};
 
static int myfs_init_fs_context(struct fs_context *fc)
{
    fc->ops = &myfs_context_ops;
    return 0;
}
 
/* File system type definition */
static struct file_system_type myfs_type = {
    .owner           = THIS_MODULE,
    .name            = "myfs",
    .init_fs_context = myfs_init_fs_context,
    .kill_sb         = kill_litter_super,
};
 
/* Module init/exit */
static int __init myfs_init(void)
{
    return register_filesystem(&myfs_type);
}
 
static void __exit myfs_exit(void)
{
    unregister_filesystem(&myfs_type);
}
 
module_init(myfs_init);
module_exit(myfs_exit);
MODULE_LICENSE("GPL");

Simple Helpers

VFS Data Flow

Let's trace a complete I/O operation to see how all the VFS components work together. We'll follow a read() system call from user space to disk.

Converting Mermaid diagram...

Step-by-Step Breakdown

1. System Call Entry

User calls read(fd, buffer, count)
sys_read() looks up the file descriptor in the process's fd table
Retrieves the struct file pointer

2. File Operations Dispatch

Calls file->f_op->read_iter() or legacy read()
For buffered I/O, this goes to generic_file_read_iter()

3. Page Cache Check

VFS checks if requested data is in the page cache
Uses the file's address_space to locate cached pages
Each page is indexed by file offset

4. Cache Hit Path

If data is cached, copy directly to user buffer
No disk I/O required
This is the fast path for warm caches

5. Cache Miss Path

Allocate new pages in the page cache
Call address_space_operations->readpage() (or readahead())
The file system translates file offsets to block numbers
Submit I/O requests to the block layer
Block layer handles scheduling, merging, and device queues
Wait for I/O completion

6. Copy to User Space

copy_to_user() transfers data from kernel pages to user buffer
Returns number of bytes read (or error)

Read-Ahead Optimization

Key Design Insights

The VFS represents decades of accumulated wisdom in OS design. Here are the key architectural insights that make it successful:

VFS Architectural Principles

•Separation of Policy and Mechanism — VFS provides the mechanism (caching, pathname resolution, file descriptors) while file systems provide the policy (how data is actually stored and retrieved).
•Deferred Concrete Typing — The four core objects (superblock, inode, dentry, file) are generic containers. File-system-specific data lives in embedded structures or void pointers, resolved at runtime.
•Aggressive Caching as Architecture — The dentry cache and inode cache aren't bolt-on optimizations; they're fundamental to how VFS works. The design assumes cached results are the common case.
•Lock-Free Fast Paths — RCU-walk for pathname resolution exemplifies the philosophy: design the common case to be lock-free, with fallback to locking only when necessary.
•Extensibility Without Abstraction Tax — Adding a new file system doesn't require modifying VFS core. The operation table pattern allows extension with minimal overhead.

Common Pitfalls and Solutions

File system implementation is notoriously tricky. Here are common issues the VFS design helps address:

VFS Problem-Solution Patterns
Problem	VFS Solution
Cache coherency between processes	Single page cache per inode, shared across all openers
Race conditions during unmount	Reference counting on superblocks, mounting semaphore
Stale dentries after file deletion	Negative dentry caching with invalidation on change
Mount point traversal correctness	Explicit mount point checking during path walk
Concurrent inode modification	Per-inode mutex (i_rwsem) for data, spinlock (i_lock) for metadata

Summary: The VFS Abstraction

We've explored the Linux Virtual File System in depth. Let's consolidate the key concepts:

Key Takeaways

•VFS is an abstraction layer that decouples user-space file operations from concrete file system implementations, enabling support for dozens of file system types through a unified API.
•Four core objects — superblock (mounted filesystem), inode (file metadata), dentry (path component), and file (open instance) — form the foundation of the VFS data model.
•Operation tables provide polymorphism in C: each object contains a pointer to a structure of function pointers that define its behavior. Different file systems provide different implementations.
•Pathname resolution (namei) traverses the directory tree using the dentry cache, with RCU-walk providing a lock-free fast path for the common case.
•Mount namespaces enable isolated views of the file system hierarchy, forming the foundation for container technology.
•Caching is architectural — the dentry cache and inode cache are integral to VFS design, not optional optimizations.

What's next:

Page Complete

1 / 5

Linux File Systems

VFS Implementation

The Pre-VFS World

The VFS Solution

Design Pattern: Object-Oriented in C

2.1 The Superblock

2.2 The Inode

2.3 The Dentry

2.4 The File Object

3.1 Inode Operations (inode_operations)

3.2 File Operations (file_operations)

3.3 Address Space Operations

The Resolution Algorithm

RCU-Walk: The Fast Path

The Dentry Cache (dcache)

The Mount Structure

Mount Namespace Isolation

Bind Mounts and Overlays

The file_system_type Structure

A Minimal File System Example

Step-by-Step Breakdown

Common Pitfalls and Solutions

Linux File Systems

VFS Implementation

The Pre-VFS World

The VFS Solution

Design Pattern: Object-Oriented in C

2.1 The Superblock

2.2 The Inode

2.3 The Dentry

2.4 The File Object

3.1 Inode Operations (inode_operations)

3.2 File Operations (file_operations)

3.3 Address Space Operations

The Resolution Algorithm

RCU-Walk: The Fast Path

The Dentry Cache (dcache)

The Mount Structure

Mount Namespace Isolation

Bind Mounts and Overlays

The file_system_type Structure

A Minimal File System Example

Step-by-Step Breakdown

Common Pitfalls and Solutions

3.1 Inode Operations (`inode_operations`)

3.2 File Operations (`file_operations`)

3.1 Inode Operations (`inode_operations`)

3.2 File Operations (`file_operations`)