Operating SystemsFile System Structures

Virtual File System (VFS)

LevelAdvanced

Duration90 mins

TopicFile System Structures

4 / 5

VFS Objects

The Four Pillars of VFS

The Virtual File System represents the complexity of diverse file systems—ext4 with its inodes, FAT with its clusters, NFS with its file handles—through just four fundamental object types. These objects form an elegant, unified model that every file system must map onto:

Superblock (super_block) — Represents a mounted file system instance
Inode (inode) — Represents a file's metadata (but not its name)
Dentry (dentry) — Represents a directory entry (the name-to-inode mapping)
File (file) — Represents an open file in a process

Understanding these objects is essential for kernel developers, file system implementers, and anyone who wants to truly understand how operating systems manage files. This page dissects each object in depth, exploring its purpose, structure, lifecycle, and caching strategies.

What You Will Learn

By the end of this page, you will understand the purpose and structure of each VFS object, how they relate to each other, their lifecycle (creation, caching, destruction), and the optimization strategies that make VFS performant.

VFS Object Relationships

Before diving into individual objects, let's understand how they relate to each other. These relationships are fundamental to VFS operation.

Converting Mermaid diagram...

Key Observations:

Multiple processes → One inode: When P1 and P2 both open /home/alice/document.txt, they get separate struct file objects (with independent file positions), but both reference the same dentry and inode.
Dentry as name: The dentry connects the name "document.txt" to inode 12345. A file (inode) can have multiple dentries pointing to it (hard links).
File vs Inode: The file object represents an open file in a specific process with a specific position. The inode represents the file itself, independent of who has it open.
Superblock as container: All inodes for a mounted file system reference back to their superblock.
Layered abstraction: Processes see file descriptors → descriptors map to file objects → file objects contain dentries → dentries point to inodes → inodes belong to superblocks.

VFS Objects Summary
Object	Represents	Unique Identifier	Cached?	Multiple per File?
super_block	Mounted file system	(device, mount point)	While mounted	No (one per mount)
inode	File metadata	inode number	Yes (inode cache)	No (one per file)
dentry	Name in directory	(parent, name) pair	Yes (dcache)	Yes (hard links)
file	Open file instance	fd per process	No	Yes (each open())

The Superblock Object (struct super_block)

The superblock represents an entire mounted file system. When you mount a file system, the kernel creates a struct super_block that holds all file-system-wide information.

Purpose:

Contains file system metadata (type, block size, mount options)
Provides operations structure for file system-wide functions
Tracks all inodes belonging to this file system
Manages file system state (mounted, read-only, dirty, etc.)

Key Fields of struct super_block:

include/linux/fs.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
struct super_block {
    /* List membership */
    struct list_head    s_list;         /* All superblocks in system */
    
    /* Device and type identification */
    dev_t               s_dev;          /* Device identifier */
    unsigned char       s_blocksize_bits; /* Block size in bits */
    unsigned long       s_blocksize;    /* Block size in bytes */
    loff_t              s_maxbytes;     /* Max file size */
    struct file_system_type *s_type;    /* File system type (ext4, xfs, etc.) */
    
    /* Operations - file system provides these */
    const struct super_operations *s_op; /* Superblock operations */
    const struct dquot_operations *dq_op; /* Quota operations */
    const struct export_operations *s_export_op; /* NFS export ops */
    
    /* Mount options and flags */
    unsigned long       s_flags;        /* Mount flags (MS_RDONLY etc.) */
    unsigned long       s_iflags;       /* Internal flags */
    unsigned long       s_magic;        /* Magic number (FS identifier) */
    
    /* Root of this file system */
    struct dentry       *s_root;        /* Root dentry */
    
    /* Unmounting support */
    struct rw_semaphore s_umount;       /* Unmount semaphore */
    int                 s_count;        /* Reference count */
    atomic_t            s_active;       /* Active reference count */
    
    /* Inode management */
    struct list_head    s_inodes;       /* All inodes for this FS */
    spinlock_t          s_inode_list_lock;
    struct list_head    s_inodes_wb;    /* Writeback inodes */
    
    /* Block device (if any) */
    struct block_device *s_bdev;        /* Associated block device */
    struct backing_dev_info *s_bdi;     /* Backing device info */
    struct mtd_info     *s_mtd;         /* MTD for flash-based FS */
    
    /* Filesystem-specific data */
    void                *s_fs_info;     /* FS-private superblock info */
    
    /* Timestamps and limits */
    time64_t            s_time_min;     /* Minimum timestamp */
    time64_t            s_time_max;     /* Maximum timestamp */
    u32                 s_time_gran;    /* Timestamp granularity (ns) */
    
    /* Freezing for snapshots */
    int                 s_frozen;       /* Freeze state */
    struct percpu_rw_semaphore s_writers; /* Writers handling */
    
    /* ... more fields ... */
};

Superblock Lifecycle:

Creation: When a file system is mounted, VFS calls the file system's mount() method. The file system reads its on-disk superblock, creates struct super_block, and populates s_op with its operations.
Active Use: While mounted, the superblock exists in memory. All inodes, dentries, and files trace back to this superblock. The s_active count tracks active references.
Sync: Periodically, or on sync command, VFS calls s_op->sync_fs() to flush metadata.
Unmount: When unmounting, VFS calls s_op->put_super(). The file system writes any dirty data and frees resources.

File System Private Data:

The s_fs_info pointer is critical. Each file system stores its own superblock data here:

/* ext4 stores its private superblock info here */
struct ext4_sb_info *sbi = EXT4_SB(sb);  // Macro accessing s_fs_info

For ext4, this includes the journal handle, block group descriptors, bitmap caches, and countless other ext4-specific fields.

On-Disk vs In-Memory Superblock

Don't confuse the VFS 'struct super_block' (in-memory, kernel structure) with the on-disk superblock (file system specific, stored at a fixed location on the disk). ext4 has an 'ext4_super_block' on disk at byte 1024 (or other locations for backup copies). At mount time, ext4 reads this on-disk structure and uses it to populate both its private 's_fs_info' and the VFS 'super_block'.

The Inode Object (struct inode)

The inode (index node) represents a file's metadata—everything about a file except its name and its data content. When you run ls -l or call stat(), you're reading inode information.

Key Insight: A file's name is NOT part of the inode. Names live in directory entries (dentries). This separation enables hard links—multiple names pointing to one inode.

What Inode Contains:

File type (regular file, directory, symlink, device, socket, FIFO)
Permissions and ownership (mode, uid, gid)
Timestamps (access, modification, change, creation)
Size and block count
Link count (number of hard links)
Pointers to data blocks (or extent information)
Device info (for device files)
Locks and extended attributes

include/linux/fs.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
struct inode {
    /* Mode and file type */
    umode_t             i_mode;         /* File type and permissions */
    unsigned short      i_opflags;      /* Optimization flags */
    kuid_t              i_uid;          /* Owner user ID */
    kgid_t              i_gid;          /* Owner group ID */
    unsigned int        i_flags;        /* Inode flags (immutable, append, etc.) */
    
    /* Access control */
    const struct inode_operations *i_op; /* Inode operations */
    struct super_block  *i_sb;          /* Superblock (file system) */
    struct address_space *i_mapping;    /* Page cache mapping */
    
    /* Identification */
    unsigned long       i_ino;          /* Inode number (unique per FS) */
    dev_t               i_rdev;         /* Device number (for dev files) */
    
    /* Size and blocks */
    loff_t              i_size;         /* File size in bytes */
    blkcnt_t            i_blocks;       /* Blocks allocated */
    unsigned int        i_blkbits;      /* Block size bits */
    
    /* Timestamps */
    struct timespec64   i_atime;        /* Last access time */
    struct timespec64   i_mtime;        /* Last modification time */
    struct timespec64   i_ctime;        /* Last metadata change time */
    
    /* Links and references */
    unsigned int        i_nlink;        /* Hard link count */
    atomic_t            i_count;        /* Reference count */
    atomic_t            i_writecount;   /* Writers count */
    
    /* State flags */
    unsigned long       i_state;        /* State flags (dirty, etc.) */
    
    /* File operations (set by inode type) */
    const struct file_operations *i_fop; /* Default file operations */
    
    /* Locking */
    struct rw_semaphore i_rwsem;        /* Read-write semaphore */
    struct mutex        i_mutex;        /* Inode mutex (older code) */
    
    /* Directory-specific */
    struct hlist_head   i_dentry;       /* Dentries referencing this inode */
    
    /* File system private data */
    void                *i_private;     /* FS-specific private data */
    
    /* ... more fields ... */
};

Inode Lifecycle:

Allocation: File systems implement super_operations->alloc_inode() to allocate inodes. They typically allocate a larger structure with VFS inode embedded:

struct ext4_inode_info {
    __le32 i_data[15];          /* Block pointers */
    __u32 i_dtime;              /* Deletion time */
    /* ... ext4-specific fields ... */
    struct inode vfs_inode;     /* VFS inode embedded */
};

Creation: When a new file is created, the file system assigns an inode number, initializes fields, and marks it dirty.
Reading from Disk: When an existing file is accessed, iget() family functions look up the inode in cache or read from disk.
Modification: Changes to metadata (chmod, chown, touch) modify inode fields and mark it dirty (mark_inode_dirty()).
Writeback: Periodically, dirty inodes are written to disk via super_operations->write_inode().
Eviction: When the inode cache is under pressure or i_nlink reaches 0, super_operations->evict_inode() is called to clean up.

The Inode Cache

Inodes are heavily cached. Reading an inode from disk costs milliseconds; reading from the cache costs nanoseconds. The kernel maintains an inode hash table indexed by (superblock, inode_number). On a busy server, millions of inodes may be cached, with cache hits dramatically reducing disk I/O.

State Flags (i_state):

Flag	Meaning
`I_DIRTY_SYNC`	Inode metadata needs sync
`I_DIRTY_DATASYNC`	Data pages need sync
`I_DIRTY_PAGES`	Has dirty pages in page cache
`I_NEW`	Inode is new, being initialized
`I_WILL_FREE`	Inode scheduled for freeing
`I_FREEING`	Inode being freed
`I_CLEAR`	Inode cleared, awaiting destruction
`I_SYNC`	Sync in progress
`I_REFERENCED`	Recently referenced (for eviction)

The Dentry Object (struct dentry)

The dentry (directory entry) represents a component in a pathname—a name like "alice", "documents", or "file.txt". Dentries form a tree that mirrors the directory structure, caching the results of pathname lookups.

Key Insight: Dentries are primarily a caching mechanism. They cache the name→inode mapping so that repeated path lookups don't require disk reads.

Why Dentries Are Separate from Inodes:

Hard Links: One inode can have multiple names (dentries pointing to it).
Caching Names: Caching path lookups requires caching names, not just inodes.
Negative Dentries: VFS can cache "this name doesn't exist" for failed lookups.
Parent Tracking: Dentries track their parent, enabling efficient .. traversal.

include/linux/dcache.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
struct dentry {
    /* Dentry flags and state */
    unsigned int d_flags;          /* Dentry flags */
    seqcount_spinlock_t d_seq;     /* Sequence count for RCU */
    
    /* Hash chain */
    struct hlist_bl_node d_hash;   /* Hash table bucket list */
    
    /* Parent dentry */
    struct dentry *d_parent;       /* Parent dentry */
    
    /* Name of this entry */
    struct qstr d_name;            /* Name (quick string) */
    /*
     * struct qstr contains:
     *   unsigned int hash;        // Hash value
     *   unsigned int len;         // Length
     *   const unsigned char *name; // The name bytes
     */
    
    /* The inode this dentry refers to */
    struct inode *d_inode;         /* Associated inode (NULL for negative) */
    
    /* Short name storage */
    unsigned char d_iname[DNAME_INLINE_LEN]; /* Inline name storage */
    
    /* Reference counting */
    struct lockref d_lockref;      /* Lock and refcount */
    
    /* Operations */
    const struct dentry_operations *d_op; /* Dentry operations */
    
    /* Superblock */
    struct super_block *d_sb;      /* Superblock of this dentry */
    
    /* FS-specific data */
    void *d_fsdata;                /* FS-private data */
    
    /* Child and sibling management */
    struct list_head d_child;      /* Link in parent's d_subdirs */
    struct list_head d_subdirs;    /* Children of this dentry */
    
    /* Alias list (for inodes with multiple dentries) */
    union {
        struct hlist_node d_alias; /* Link in inode's i_dentry list */
        struct hlist_bl_node d_in_lookup_hash; /* For lookup */
        struct rcu_head d_rcu;     /* For RCU freeing */
    } d_u;
    
    /* For time-based expiration (autofs, NFS) */
    unsigned long d_time;          /* Revalidation time */
};

Dentry States:

State	d_inode	d_lockref.count	Description
Used	valid	0	Actively referenced by VFS or processes
Unused	valid	= 0	Valid but not currently referenced; cached for reuse
Negative	NULL	≥ 0	Represents "name not found"; caches failed lookups

Negative Dentries:

Negative dentries are powerful for performance. Consider:

# stat() returns "file not found"
$ stat /etc/nonexistent

The first call requires reading the /etc directory from disk. The second call? VFS finds a negative dentry cached for "nonexistent" under /etc and returns ENOENT immediately, no disk I/O.

This is crucial for compilation: build systems frequently check if files exist (hundreds of header file searches per compile). Negative dentry caching prevents repeated disk reads for missing files.

The Dentry Cache (dcache)

The dcache is one of the most performance-critical caches in the kernel. It's a hash table indexed by (parent dentry, name hash). On production systems, the dcache can contain millions of entries and satisfy the vast majority of pathname lookups without any disk I/O. The 'd_lru' list provides LRU ordering for eviction under memory pressure.

Dentry Lifecycle:

Lookup/Creation: During pathname resolution, VFS calls inode_operations->lookup() for each component. The file system returns a dentry (possibly newly allocated via d_alloc()) with its inode set.
Caching: After successful lookup, the dentry is inserted into the dcache hash table. Subsequent lookups find it there.
Reference Counting: Each dget() increments the count; dput() decrements it. When count reaches 0, the dentry becomes "unused" but remains cached.
LRU Management: Unused dentries live on an LRU list. Under memory pressure, the oldest unused dentries are evicted.
Invalidation: File systems can mark dentries invalid (e.g., NFS after a timeout). d_invalidate() drops cached children.
Deletion: When d_delete() is called (e.g., file unlinked), the dentry transitions to negative state or is removed entirely.

The File Object (struct file)

The file object represents an open file—not the file itself, but a specific instance of opening it. When a process calls open(), a new struct file is created. This object tracks the current read/write position and access mode.

Key Distinction:

Inode: Represents the file (one per file on disk)
File: Represents an opening of the file (one per open() call)

Two processes opening the same file get separate struct file objects (with independent positions) but share the same inode.

include/linux/fs.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
struct file {
    /* Path information */
    struct path             f_path;         /* dentry and vfsmount */
    /*
     * struct path {
     *     struct vfsmount *mnt;
     *     struct dentry *dentry;
     * };
     */
    
    /* Inode (shortcut) */
    struct inode            *f_inode;       /* Cached inode pointer */
    
    /* Operations */
    const struct file_operations *f_op;     /* File operations */
    
    /* Reference counting */
    atomic_long_t           f_count;        /* Reference count */
    
    /* Flags and mode */
    unsigned int            f_flags;        /* O_RDONLY, O_WRONLY, O_RDWR, etc. */
    fmode_t                 f_mode;         /* Access mode (FMODE_READ, etc.) */
    
    /* Position */
    loff_t                  f_pos;          /* Current file position (offset) */
    struct mutex            f_pos_lock;     /* Protects f_pos */
    
    /* Security/credentials */
    const struct cred       *f_cred;        /* Credentials at open time */
    
    /* Locking */
    struct fown_struct      f_owner;        /* Owner for signals */
    
    /* Read-ahead state */
    struct file_ra_state    f_ra;           /* Read-ahead state */
    
    /* File-specific data */
    void                    *private_data;  /* File-specific private data */
    
    /* Epoll and async notification */
    struct list_head        f_ep_links;     /* Epoll link list */
    struct list_head        f_tfile_llink;  /* Thread-local list */
    
    /* Address space operations (for mmap) */
    struct address_space    *f_mapping;     /* Page cache mapping */
    
    /* Write hints for block layer */
    enum rw_hint            f_write_hint;   /* Write lifetime hint */
    
    /* ... more fields ... */
};

File Object Lifecycle:

Creation via open():
- open() system call triggers pathname resolution → dentry → inode
- VFS allocates new struct file (from a slab cache)
- Sets f_path, f_inode, f_op, f_mode, f_flags
- Calls file_operations->open() for file system notification
- Assigns a file descriptor in the process's fd table
- Returns the fd to userspace
Usage (read/write/etc.):
- Each read/write uses and potentially updates f_pos
- lseek() directly modifies f_pos
- Multiple threads can share a file object (dup/fork), requiring f_pos_lock
Duplication (dup, fork):
- dup(fd) creates new fd pointing to same struct file (refcount++)
- fork() copies fd table; child's fds point to same struct file objects
- Parent and child share file positions after fork!
Closing:
- close(fd) removes fd from process table, calls fput() to decrement refcount
- When refcount reaches 0, file_operations->release() is called
- struct file is returned to slab cache

Access Mode Flags (f_mode):

Flag	Meaning
FMODE_READ	Open for reading
FMODE_WRITE	Open for writing
FMODE_EXEC	Open for execution
FMODE_PREAD	pread() allowed
FMODE_PWRITE	pwrite() allowed
FMODE_LSEEK	lseek() allowed

Open Flags (f_flags):

Flag	Meaning
O_RDONLY	Read only
O_WRONLY	Write only
O_RDWR	Read and write
O_APPEND	Append mode
O_NONBLOCK	Non-blocking I/O
O_DIRECT	Direct I/O (bypass cache)

Shared File Positions After Fork

After fork(), parent and child share struct file objects. If the parent reads 100 bytes, the child's file position also advances! This is often surprising. To get independent positions, the child must close and re-open the file, or the parent can use O_CLOEXEC.

Caching Strategies

VFS performance depends heavily on caching. Let's examine the caching strategies for each object type.

Dentry Cache (dcache):

The dcache is implemented as a hash table where the key is (parent_dentry, name_hash). This enables O(1) lookup during pathname resolution.

Structure:

/* Global hash table */
static struct hlist_bl_head *dentry_hashtable;

/* Hash function */
static inline struct hlist_bl_head *
d_hash(unsigned int hash) {
    return dentry_hashtable + (hash >> d_hash_shift);
}

LRU Lists:

Unused dentries (refcount = 0) are placed on an LRU list. Under memory pressure, the slab shrinker scans this list and frees the oldest entries.

RCU for Lock-free Lookup:

The dcache uses Read-Copy-Update (RCU) for most lookups. Readers don't need locks—they read consistent snapshots while writers carefully update pointers. This enables massive concurrency.

Sizing:

The dcache grows dynamically based on available memory. On a 64GB system, it might cache 10+ million dentries. Cache hit rates of 99%+ are normal on stable workloads.

Cache Coherency

All these caches work together. When a file is modified, the page cache page is dirtied, the inode's mtime is updated (marking it dirty), and the dentry remains valid. When a file is deleted, the dentry goes negative, the inode's nlink decrements, and cached pages are eventually freed.

Objects in Action: A Complete Example

Let's trace what happens when a process opens and reads a file, seeing how all VFS objects participate:

Simple File Read Operation
C
1
2
3
4
5
// User program
int fd = open("/home/alice/data.txt", O_RDONLY);
char buf[1024];
ssize_t n = read(fd, buf, sizeof(buf));
close(fd);

Step 1: open() - Pathname Resolution:

VFS starts at root dentry (/) from current mount namespace
Lookup "home" in root directory:
- Check dcache for (root_dentry, "home") → cache hit → get home's dentry
- Or cache miss → call root_inode->i_op->lookup() → get dentry from filesystem
- Check if mount point → switch if needed
Lookup "alice" in home directory (repeat process)
Lookup "data.txt" in alice directory → get final dentry and inode

Step 2: open() - File Object Creation:

Allocate struct file from slab cache
Set f_path = {.mnt = current_mount, .dentry = data_txt_dentry}
Set f_inode = data_txt_dentry->d_inode
Set f_op = inode->i_fop (file operations for this file type)
Set f_mode = FMODE_READ (based on O_RDONLY)
Set f_pos = 0 (start of file)
Call f_op->open() if defined (ext4_file_open for ext4)
Allocate file descriptor in process's fd table
Return fd to userspace

Step 3: read() - Data Retrieval:

Look up fd in process's file descriptor table → get struct file
Verify FMODE_READ is set
Get current position from f_pos (0 in this case)
Calculate which page(s) contain bytes 0-1023
For each needed page:
- Check page cache for (inode, page_index) → hit or miss
- On miss: allocate page, call a_ops->readpage() → block I/O
- Return page from cache
Copy data from page(s) to user buffer
Update f_pos += bytes_read (now 1024 if full read)
Return bytes read to userspace

Step 4: close() - Cleanup:

Remove fd from process's file descriptor table
Call fput() → decrement f_count
If f_count reaches 0:
- Call f_op->release() if defined
- Release any locks held by this file object
- Free struct file back to slab
Dentry and inode remain cached (for future opens)
Return 0 to userspace

The Power of Caching

On a warm system, most of these steps complete without any disk I/O: dentries are in dcache, inode is in icache, data pages are in page cache. An open-read-close cycle that seems like it would require multiple disk accesses often completes entirely from memory in microseconds.

Summary: VFS Objects

We've explored the four fundamental VFS objects in depth. Here are the key takeaways:

Key Takeaways

•Superblock represents a mounted filesystem — Contains filesystem-wide info, operations, and tracks all inodes. One per mount.
•Inode represents file metadata — Contains type, permissions, size, timestamps, and data location. One per file/directory.
•Dentry represents a name in the directory hierarchy — Links names to inodes, enables path caching. Multiple dentries can point to one inode (hard links).
•File represents an open file instance — Contains position, access mode, and reference to dentry/inode. One per open() call.
•Objects form a layered structure — Process → fd table → file → dentry → inode → superblock
•Aggressive caching is essential — Dcache, icache, and page cache prevent most disk I/O on warm systems.
•File systems provide implementations — VFS defines the interface; file systems like ext4, XFS, NFS provide the operations that work on their specific storage format.
•Reference counting manages lifecycle — Each object has reference counts; objects are freed when counts reach zero.

What's Next:

With VFS objects understood, we'll explore file system registration—how new file systems make themselves known to the kernel, and how the kernel discovers and invokes them during mount operations.

Page Complete

You now understand the four fundamental VFS objects: superblock, inode, dentry, and file. These structures form the backbone of the VFS layer, enabling uniform file access across diverse file systems while maintaining high performance through sophisticated caching.

4 / 5

Loading learning content...

Operating SystemsFile System Structures

Virtual File System (VFS)

LevelAdvanced

Duration90 mins

TopicFile System Structures

4 / 5

VFS Objects

The Four Pillars of VFS

Superblock (super_block) — Represents a mounted file system instance
Inode (inode) — Represents a file's metadata (but not its name)
Dentry (dentry) — Represents a directory entry (the name-to-inode mapping)
File (file) — Represents an open file in a process

What You Will Learn

VFS Object Relationships

Before diving into individual objects, let's understand how they relate to each other. These relationships are fundamental to VFS operation.

Converting Mermaid diagram...

Key Observations:

Multiple processes → One inode: When P1 and P2 both open /home/alice/document.txt, they get separate struct file objects (with independent file positions), but both reference the same dentry and inode.
Dentry as name: The dentry connects the name "document.txt" to inode 12345. A file (inode) can have multiple dentries pointing to it (hard links).
File vs Inode: The file object represents an open file in a specific process with a specific position. The inode represents the file itself, independent of who has it open.
Superblock as container: All inodes for a mounted file system reference back to their superblock.
Layered abstraction: Processes see file descriptors → descriptors map to file objects → file objects contain dentries → dentries point to inodes → inodes belong to superblocks.

VFS Objects Summary
Object	Represents	Unique Identifier	Cached?	Multiple per File?
super_block	Mounted file system	(device, mount point)	While mounted	No (one per mount)
inode	File metadata	inode number	Yes (inode cache)	No (one per file)
dentry	Name in directory	(parent, name) pair	Yes (dcache)	Yes (hard links)
file	Open file instance	fd per process	No	Yes (each open())

The Superblock Object (struct super_block)

The superblock represents an entire mounted file system. When you mount a file system, the kernel creates a struct super_block that holds all file-system-wide information.

Purpose:

Contains file system metadata (type, block size, mount options)
Provides operations structure for file system-wide functions
Tracks all inodes belonging to this file system
Manages file system state (mounted, read-only, dirty, etc.)

Key Fields of struct super_block:

include/linux/fs.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
struct super_block {
    /* List membership */
    struct list_head    s_list;         /* All superblocks in system */
    
    /* Device and type identification */
    dev_t               s_dev;          /* Device identifier */
    unsigned char       s_blocksize_bits; /* Block size in bits */
    unsigned long       s_blocksize;    /* Block size in bytes */
    loff_t              s_maxbytes;     /* Max file size */
    struct file_system_type *s_type;    /* File system type (ext4, xfs, etc.) */
    
    /* Operations - file system provides these */
    const struct super_operations *s_op; /* Superblock operations */
    const struct dquot_operations *dq_op; /* Quota operations */
    const struct export_operations *s_export_op; /* NFS export ops */
    
    /* Mount options and flags */
    unsigned long       s_flags;        /* Mount flags (MS_RDONLY etc.) */
    unsigned long       s_iflags;       /* Internal flags */
    unsigned long       s_magic;        /* Magic number (FS identifier) */
    
    /* Root of this file system */
    struct dentry       *s_root;        /* Root dentry */
    
    /* Unmounting support */
    struct rw_semaphore s_umount;       /* Unmount semaphore */
    int                 s_count;        /* Reference count */
    atomic_t            s_active;       /* Active reference count */
    
    /* Inode management */
    struct list_head    s_inodes;       /* All inodes for this FS */
    spinlock_t          s_inode_list_lock;
    struct list_head    s_inodes_wb;    /* Writeback inodes */
    
    /* Block device (if any) */
    struct block_device *s_bdev;        /* Associated block device */
    struct backing_dev_info *s_bdi;     /* Backing device info */
    struct mtd_info     *s_mtd;         /* MTD for flash-based FS */
    
    /* Filesystem-specific data */
    void                *s_fs_info;     /* FS-private superblock info */
    
    /* Timestamps and limits */
    time64_t            s_time_min;     /* Minimum timestamp */
    time64_t            s_time_max;     /* Maximum timestamp */
    u32                 s_time_gran;    /* Timestamp granularity (ns) */
    
    /* Freezing for snapshots */
    int                 s_frozen;       /* Freeze state */
    struct percpu_rw_semaphore s_writers; /* Writers handling */
    
    /* ... more fields ... */
};

Superblock Lifecycle:

Creation: When a file system is mounted, VFS calls the file system's mount() method. The file system reads its on-disk superblock, creates struct super_block, and populates s_op with its operations.
Active Use: While mounted, the superblock exists in memory. All inodes, dentries, and files trace back to this superblock. The s_active count tracks active references.
Sync: Periodically, or on sync command, VFS calls s_op->sync_fs() to flush metadata.
Unmount: When unmounting, VFS calls s_op->put_super(). The file system writes any dirty data and frees resources.

File System Private Data:

The s_fs_info pointer is critical. Each file system stores its own superblock data here:

/* ext4 stores its private superblock info here */
struct ext4_sb_info *sbi = EXT4_SB(sb);  // Macro accessing s_fs_info

For ext4, this includes the journal handle, block group descriptors, bitmap caches, and countless other ext4-specific fields.

On-Disk vs In-Memory Superblock

The Inode Object (struct inode)

The inode (index node) represents a file's metadata—everything about a file except its name and its data content. When you run ls -l or call stat(), you're reading inode information.

Key Insight: A file's name is NOT part of the inode. Names live in directory entries (dentries). This separation enables hard links—multiple names pointing to one inode.

What Inode Contains:

File type (regular file, directory, symlink, device, socket, FIFO)
Permissions and ownership (mode, uid, gid)
Timestamps (access, modification, change, creation)
Size and block count
Link count (number of hard links)
Pointers to data blocks (or extent information)
Device info (for device files)
Locks and extended attributes

include/linux/fs.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
struct inode {
    /* Mode and file type */
    umode_t             i_mode;         /* File type and permissions */
    unsigned short      i_opflags;      /* Optimization flags */
    kuid_t              i_uid;          /* Owner user ID */
    kgid_t              i_gid;          /* Owner group ID */
    unsigned int        i_flags;        /* Inode flags (immutable, append, etc.) */
    
    /* Access control */
    const struct inode_operations *i_op; /* Inode operations */
    struct super_block  *i_sb;          /* Superblock (file system) */
    struct address_space *i_mapping;    /* Page cache mapping */
    
    /* Identification */
    unsigned long       i_ino;          /* Inode number (unique per FS) */
    dev_t               i_rdev;         /* Device number (for dev files) */
    
    /* Size and blocks */
    loff_t              i_size;         /* File size in bytes */
    blkcnt_t            i_blocks;       /* Blocks allocated */
    unsigned int        i_blkbits;      /* Block size bits */
    
    /* Timestamps */
    struct timespec64   i_atime;        /* Last access time */
    struct timespec64   i_mtime;        /* Last modification time */
    struct timespec64   i_ctime;        /* Last metadata change time */
    
    /* Links and references */
    unsigned int        i_nlink;        /* Hard link count */
    atomic_t            i_count;        /* Reference count */
    atomic_t            i_writecount;   /* Writers count */
    
    /* State flags */
    unsigned long       i_state;        /* State flags (dirty, etc.) */
    
    /* File operations (set by inode type) */
    const struct file_operations *i_fop; /* Default file operations */
    
    /* Locking */
    struct rw_semaphore i_rwsem;        /* Read-write semaphore */
    struct mutex        i_mutex;        /* Inode mutex (older code) */
    
    /* Directory-specific */
    struct hlist_head   i_dentry;       /* Dentries referencing this inode */
    
    /* File system private data */
    void                *i_private;     /* FS-specific private data */
    
    /* ... more fields ... */
};

Inode Lifecycle:

Allocation: File systems implement super_operations->alloc_inode() to allocate inodes. They typically allocate a larger structure with VFS inode embedded:

struct ext4_inode_info {
    __le32 i_data[15];          /* Block pointers */
    __u32 i_dtime;              /* Deletion time */
    /* ... ext4-specific fields ... */
    struct inode vfs_inode;     /* VFS inode embedded */
};

Creation: When a new file is created, the file system assigns an inode number, initializes fields, and marks it dirty.
Reading from Disk: When an existing file is accessed, iget() family functions look up the inode in cache or read from disk.
Modification: Changes to metadata (chmod, chown, touch) modify inode fields and mark it dirty (mark_inode_dirty()).
Writeback: Periodically, dirty inodes are written to disk via super_operations->write_inode().
Eviction: When the inode cache is under pressure or i_nlink reaches 0, super_operations->evict_inode() is called to clean up.

The Inode Cache

State Flags (i_state):

Flag	Meaning
`I_DIRTY_SYNC`	Inode metadata needs sync
`I_DIRTY_DATASYNC`	Data pages need sync
`I_DIRTY_PAGES`	Has dirty pages in page cache
`I_NEW`	Inode is new, being initialized
`I_WILL_FREE`	Inode scheduled for freeing
`I_FREEING`	Inode being freed
`I_CLEAR`	Inode cleared, awaiting destruction
`I_SYNC`	Sync in progress
`I_REFERENCED`	Recently referenced (for eviction)

The Dentry Object (struct dentry)

Key Insight: Dentries are primarily a caching mechanism. They cache the name→inode mapping so that repeated path lookups don't require disk reads.

Why Dentries Are Separate from Inodes:

Hard Links: One inode can have multiple names (dentries pointing to it).
Caching Names: Caching path lookups requires caching names, not just inodes.
Negative Dentries: VFS can cache "this name doesn't exist" for failed lookups.
Parent Tracking: Dentries track their parent, enabling efficient .. traversal.

include/linux/dcache.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
struct dentry {
    /* Dentry flags and state */
    unsigned int d_flags;          /* Dentry flags */
    seqcount_spinlock_t d_seq;     /* Sequence count for RCU */
    
    /* Hash chain */
    struct hlist_bl_node d_hash;   /* Hash table bucket list */
    
    /* Parent dentry */
    struct dentry *d_parent;       /* Parent dentry */
    
    /* Name of this entry */
    struct qstr d_name;            /* Name (quick string) */
    /*
     * struct qstr contains:
     *   unsigned int hash;        // Hash value
     *   unsigned int len;         // Length
     *   const unsigned char *name; // The name bytes
     */
    
    /* The inode this dentry refers to */
    struct inode *d_inode;         /* Associated inode (NULL for negative) */
    
    /* Short name storage */
    unsigned char d_iname[DNAME_INLINE_LEN]; /* Inline name storage */
    
    /* Reference counting */
    struct lockref d_lockref;      /* Lock and refcount */
    
    /* Operations */
    const struct dentry_operations *d_op; /* Dentry operations */
    
    /* Superblock */
    struct super_block *d_sb;      /* Superblock of this dentry */
    
    /* FS-specific data */
    void *d_fsdata;                /* FS-private data */
    
    /* Child and sibling management */
    struct list_head d_child;      /* Link in parent's d_subdirs */
    struct list_head d_subdirs;    /* Children of this dentry */
    
    /* Alias list (for inodes with multiple dentries) */
    union {
        struct hlist_node d_alias; /* Link in inode's i_dentry list */
        struct hlist_bl_node d_in_lookup_hash; /* For lookup */
        struct rcu_head d_rcu;     /* For RCU freeing */
    } d_u;
    
    /* For time-based expiration (autofs, NFS) */
    unsigned long d_time;          /* Revalidation time */
};

Dentry States:

State	d_inode	d_lockref.count	Description
Used	valid	0	Actively referenced by VFS or processes
Unused	valid	= 0	Valid but not currently referenced; cached for reuse
Negative	NULL	≥ 0	Represents "name not found"; caches failed lookups

Negative Dentries:

Negative dentries are powerful for performance. Consider:

# stat() returns "file not found"
$ stat /etc/nonexistent

The first call requires reading the /etc directory from disk. The second call? VFS finds a negative dentry cached for "nonexistent" under /etc and returns ENOENT immediately, no disk I/O.

This is crucial for compilation: build systems frequently check if files exist (hundreds of header file searches per compile). Negative dentry caching prevents repeated disk reads for missing files.

The Dentry Cache (dcache)

Dentry Lifecycle:

Lookup/Creation: During pathname resolution, VFS calls inode_operations->lookup() for each component. The file system returns a dentry (possibly newly allocated via d_alloc()) with its inode set.
Caching: After successful lookup, the dentry is inserted into the dcache hash table. Subsequent lookups find it there.
Reference Counting: Each dget() increments the count; dput() decrements it. When count reaches 0, the dentry becomes "unused" but remains cached.
LRU Management: Unused dentries live on an LRU list. Under memory pressure, the oldest unused dentries are evicted.
Invalidation: File systems can mark dentries invalid (e.g., NFS after a timeout). d_invalidate() drops cached children.
Deletion: When d_delete() is called (e.g., file unlinked), the dentry transitions to negative state or is removed entirely.

The File Object (struct file)

Key Distinction:

Inode: Represents the file (one per file on disk)
File: Represents an opening of the file (one per open() call)

Two processes opening the same file get separate struct file objects (with independent positions) but share the same inode.

include/linux/fs.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
struct file {
    /* Path information */
    struct path             f_path;         /* dentry and vfsmount */
    /*
     * struct path {
     *     struct vfsmount *mnt;
     *     struct dentry *dentry;
     * };
     */
    
    /* Inode (shortcut) */
    struct inode            *f_inode;       /* Cached inode pointer */
    
    /* Operations */
    const struct file_operations *f_op;     /* File operations */
    
    /* Reference counting */
    atomic_long_t           f_count;        /* Reference count */
    
    /* Flags and mode */
    unsigned int            f_flags;        /* O_RDONLY, O_WRONLY, O_RDWR, etc. */
    fmode_t                 f_mode;         /* Access mode (FMODE_READ, etc.) */
    
    /* Position */
    loff_t                  f_pos;          /* Current file position (offset) */
    struct mutex            f_pos_lock;     /* Protects f_pos */
    
    /* Security/credentials */
    const struct cred       *f_cred;        /* Credentials at open time */
    
    /* Locking */
    struct fown_struct      f_owner;        /* Owner for signals */
    
    /* Read-ahead state */
    struct file_ra_state    f_ra;           /* Read-ahead state */
    
    /* File-specific data */
    void                    *private_data;  /* File-specific private data */
    
    /* Epoll and async notification */
    struct list_head        f_ep_links;     /* Epoll link list */
    struct list_head        f_tfile_llink;  /* Thread-local list */
    
    /* Address space operations (for mmap) */
    struct address_space    *f_mapping;     /* Page cache mapping */
    
    /* Write hints for block layer */
    enum rw_hint            f_write_hint;   /* Write lifetime hint */
    
    /* ... more fields ... */
};

File Object Lifecycle:

Creation via open():
- open() system call triggers pathname resolution → dentry → inode
- VFS allocates new struct file (from a slab cache)
- Sets f_path, f_inode, f_op, f_mode, f_flags
- Calls file_operations->open() for file system notification
- Assigns a file descriptor in the process's fd table
- Returns the fd to userspace
Usage (read/write/etc.):
- Each read/write uses and potentially updates f_pos
- lseek() directly modifies f_pos
- Multiple threads can share a file object (dup/fork), requiring f_pos_lock
Duplication (dup, fork):
- dup(fd) creates new fd pointing to same struct file (refcount++)
- fork() copies fd table; child's fds point to same struct file objects
- Parent and child share file positions after fork!
Closing:
- close(fd) removes fd from process table, calls fput() to decrement refcount
- When refcount reaches 0, file_operations->release() is called
- struct file is returned to slab cache

Access Mode Flags (f_mode):

Flag	Meaning
FMODE_READ	Open for reading
FMODE_WRITE	Open for writing
FMODE_EXEC	Open for execution
FMODE_PREAD	pread() allowed
FMODE_PWRITE	pwrite() allowed
FMODE_LSEEK	lseek() allowed

Open Flags (f_flags):

Flag	Meaning
O_RDONLY	Read only
O_WRONLY	Write only
O_RDWR	Read and write
O_APPEND	Append mode
O_NONBLOCK	Non-blocking I/O
O_DIRECT	Direct I/O (bypass cache)

Shared File Positions After Fork

Caching Strategies

VFS performance depends heavily on caching. Let's examine the caching strategies for each object type.

Dentry Cache (dcache):

The dcache is implemented as a hash table where the key is (parent_dentry, name_hash). This enables O(1) lookup during pathname resolution.

Structure:

/* Global hash table */
static struct hlist_bl_head *dentry_hashtable;

/* Hash function */
static inline struct hlist_bl_head *
d_hash(unsigned int hash) {
    return dentry_hashtable + (hash >> d_hash_shift);
}

LRU Lists:

Unused dentries (refcount = 0) are placed on an LRU list. Under memory pressure, the slab shrinker scans this list and frees the oldest entries.

RCU for Lock-free Lookup:

The dcache uses Read-Copy-Update (RCU) for most lookups. Readers don't need locks—they read consistent snapshots while writers carefully update pointers. This enables massive concurrency.

Sizing:

The dcache grows dynamically based on available memory. On a 64GB system, it might cache 10+ million dentries. Cache hit rates of 99%+ are normal on stable workloads.

Cache Coherency

Objects in Action: A Complete Example

Let's trace what happens when a process opens and reads a file, seeing how all VFS objects participate:

Simple File Read Operation
C
1
2
3
4
5
// User program
int fd = open("/home/alice/data.txt", O_RDONLY);
char buf[1024];
ssize_t n = read(fd, buf, sizeof(buf));
close(fd);

Step 1: open() - Pathname Resolution:

VFS starts at root dentry (/) from current mount namespace
Lookup "home" in root directory:
- Check dcache for (root_dentry, "home") → cache hit → get home's dentry
- Or cache miss → call root_inode->i_op->lookup() → get dentry from filesystem
- Check if mount point → switch if needed
Lookup "alice" in home directory (repeat process)
Lookup "data.txt" in alice directory → get final dentry and inode

Step 2: open() - File Object Creation:

Allocate struct file from slab cache
Set f_path = {.mnt = current_mount, .dentry = data_txt_dentry}
Set f_inode = data_txt_dentry->d_inode
Set f_op = inode->i_fop (file operations for this file type)
Set f_mode = FMODE_READ (based on O_RDONLY)
Set f_pos = 0 (start of file)
Call f_op->open() if defined (ext4_file_open for ext4)
Allocate file descriptor in process's fd table
Return fd to userspace

Step 3: read() - Data Retrieval:

Look up fd in process's file descriptor table → get struct file
Verify FMODE_READ is set
Get current position from f_pos (0 in this case)
Calculate which page(s) contain bytes 0-1023
For each needed page:
- Check page cache for (inode, page_index) → hit or miss
- On miss: allocate page, call a_ops->readpage() → block I/O
- Return page from cache
Copy data from page(s) to user buffer
Update f_pos += bytes_read (now 1024 if full read)
Return bytes read to userspace

Step 4: close() - Cleanup:

Remove fd from process's file descriptor table
Call fput() → decrement f_count
If f_count reaches 0:
- Call f_op->release() if defined
- Release any locks held by this file object
- Free struct file back to slab
Dentry and inode remain cached (for future opens)
Return 0 to userspace

The Power of Caching

Summary: VFS Objects

We've explored the four fundamental VFS objects in depth. Here are the key takeaways:

Key Takeaways

•Superblock represents a mounted filesystem — Contains filesystem-wide info, operations, and tracks all inodes. One per mount.
•Inode represents file metadata — Contains type, permissions, size, timestamps, and data location. One per file/directory.
•Dentry represents a name in the directory hierarchy — Links names to inodes, enables path caching. Multiple dentries can point to one inode (hard links).
•File represents an open file instance — Contains position, access mode, and reference to dentry/inode. One per open() call.
•Objects form a layered structure — Process → fd table → file → dentry → inode → superblock
•Aggressive caching is essential — Dcache, icache, and page cache prevent most disk I/O on warm systems.
•File systems provide implementations — VFS defines the interface; file systems like ext4, XFS, NFS provide the operations that work on their specific storage format.
•Reference counting manages lifecycle — Each object has reference counts; objects are freed when counts reach zero.

What's Next:

Page Complete

4 / 5