Loading learning content...
The Virtual File System represents the complexity of diverse file systems—ext4 with its inodes, FAT with its clusters, NFS with its file handles—through just four fundamental object types. These objects form an elegant, unified model that every file system must map onto:
Understanding these objects is essential for kernel developers, file system implementers, and anyone who wants to truly understand how operating systems manage files. This page dissects each object in depth, exploring its purpose, structure, lifecycle, and caching strategies.
By the end of this page, you will understand the purpose and structure of each VFS object, how they relate to each other, their lifecycle (creation, caching, destruction), and the optimization strategies that make VFS performant.
Before diving into individual objects, let's understand how they relate to each other. These relationships are fundamental to VFS operation.
Key Observations:
Multiple processes → One inode: When P1 and P2 both open /home/alice/document.txt, they get separate struct file objects (with independent file positions), but both reference the same dentry and inode.
Dentry as name: The dentry connects the name "document.txt" to inode 12345. A file (inode) can have multiple dentries pointing to it (hard links).
File vs Inode: The file object represents an open file in a specific process with a specific position. The inode represents the file itself, independent of who has it open.
Superblock as container: All inodes for a mounted file system reference back to their superblock.
Layered abstraction: Processes see file descriptors → descriptors map to file objects → file objects contain dentries → dentries point to inodes → inodes belong to superblocks.
| Object | Represents | Unique Identifier | Cached? | Multiple per File? |
|---|---|---|---|---|
| super_block | Mounted file system | (device, mount point) | While mounted | No (one per mount) |
| inode | File metadata | inode number | Yes (inode cache) | No (one per file) |
| dentry | Name in directory | (parent, name) pair | Yes (dcache) | Yes (hard links) |
| file | Open file instance | fd per process | No | Yes (each open()) |
The superblock represents an entire mounted file system. When you mount a file system, the kernel creates a struct super_block that holds all file-system-wide information.
Purpose:
Key Fields of struct super_block:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
struct super_block { /* List membership */ struct list_head s_list; /* All superblocks in system */ /* Device and type identification */ dev_t s_dev; /* Device identifier */ unsigned char s_blocksize_bits; /* Block size in bits */ unsigned long s_blocksize; /* Block size in bytes */ loff_t s_maxbytes; /* Max file size */ struct file_system_type *s_type; /* File system type (ext4, xfs, etc.) */ /* Operations - file system provides these */ const struct super_operations *s_op; /* Superblock operations */ const struct dquot_operations *dq_op; /* Quota operations */ const struct export_operations *s_export_op; /* NFS export ops */ /* Mount options and flags */ unsigned long s_flags; /* Mount flags (MS_RDONLY etc.) */ unsigned long s_iflags; /* Internal flags */ unsigned long s_magic; /* Magic number (FS identifier) */ /* Root of this file system */ struct dentry *s_root; /* Root dentry */ /* Unmounting support */ struct rw_semaphore s_umount; /* Unmount semaphore */ int s_count; /* Reference count */ atomic_t s_active; /* Active reference count */ /* Inode management */ struct list_head s_inodes; /* All inodes for this FS */ spinlock_t s_inode_list_lock; struct list_head s_inodes_wb; /* Writeback inodes */ /* Block device (if any) */ struct block_device *s_bdev; /* Associated block device */ struct backing_dev_info *s_bdi; /* Backing device info */ struct mtd_info *s_mtd; /* MTD for flash-based FS */ /* Filesystem-specific data */ void *s_fs_info; /* FS-private superblock info */ /* Timestamps and limits */ time64_t s_time_min; /* Minimum timestamp */ time64_t s_time_max; /* Maximum timestamp */ u32 s_time_gran; /* Timestamp granularity (ns) */ /* Freezing for snapshots */ int s_frozen; /* Freeze state */ struct percpu_rw_semaphore s_writers; /* Writers handling */ /* ... more fields ... */};Superblock Lifecycle:
Creation: When a file system is mounted, VFS calls the file system's mount() method. The file system reads its on-disk superblock, creates struct super_block, and populates s_op with its operations.
Active Use: While mounted, the superblock exists in memory. All inodes, dentries, and files trace back to this superblock. The s_active count tracks active references.
Sync: Periodically, or on sync command, VFS calls s_op->sync_fs() to flush metadata.
Unmount: When unmounting, VFS calls s_op->put_super(). The file system writes any dirty data and frees resources.
File System Private Data:
The s_fs_info pointer is critical. Each file system stores its own superblock data here:
/* ext4 stores its private superblock info here */
struct ext4_sb_info *sbi = EXT4_SB(sb); // Macro accessing s_fs_info
For ext4, this includes the journal handle, block group descriptors, bitmap caches, and countless other ext4-specific fields.
Don't confuse the VFS 'struct super_block' (in-memory, kernel structure) with the on-disk superblock (file system specific, stored at a fixed location on the disk). ext4 has an 'ext4_super_block' on disk at byte 1024 (or other locations for backup copies). At mount time, ext4 reads this on-disk structure and uses it to populate both its private 's_fs_info' and the VFS 'super_block'.
The inode (index node) represents a file's metadata—everything about a file except its name and its data content. When you run ls -l or call stat(), you're reading inode information.
Key Insight: A file's name is NOT part of the inode. Names live in directory entries (dentries). This separation enables hard links—multiple names pointing to one inode.
What Inode Contains:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
struct inode { /* Mode and file type */ umode_t i_mode; /* File type and permissions */ unsigned short i_opflags; /* Optimization flags */ kuid_t i_uid; /* Owner user ID */ kgid_t i_gid; /* Owner group ID */ unsigned int i_flags; /* Inode flags (immutable, append, etc.) */ /* Access control */ const struct inode_operations *i_op; /* Inode operations */ struct super_block *i_sb; /* Superblock (file system) */ struct address_space *i_mapping; /* Page cache mapping */ /* Identification */ unsigned long i_ino; /* Inode number (unique per FS) */ dev_t i_rdev; /* Device number (for dev files) */ /* Size and blocks */ loff_t i_size; /* File size in bytes */ blkcnt_t i_blocks; /* Blocks allocated */ unsigned int i_blkbits; /* Block size bits */ /* Timestamps */ struct timespec64 i_atime; /* Last access time */ struct timespec64 i_mtime; /* Last modification time */ struct timespec64 i_ctime; /* Last metadata change time */ /* Links and references */ unsigned int i_nlink; /* Hard link count */ atomic_t i_count; /* Reference count */ atomic_t i_writecount; /* Writers count */ /* State flags */ unsigned long i_state; /* State flags (dirty, etc.) */ /* File operations (set by inode type) */ const struct file_operations *i_fop; /* Default file operations */ /* Locking */ struct rw_semaphore i_rwsem; /* Read-write semaphore */ struct mutex i_mutex; /* Inode mutex (older code) */ /* Directory-specific */ struct hlist_head i_dentry; /* Dentries referencing this inode */ /* File system private data */ void *i_private; /* FS-specific private data */ /* ... more fields ... */};Inode Lifecycle:
super_operations->alloc_inode() to allocate inodes. They typically allocate a larger structure with VFS inode embedded:struct ext4_inode_info {
__le32 i_data[15]; /* Block pointers */
__u32 i_dtime; /* Deletion time */
/* ... ext4-specific fields ... */
struct inode vfs_inode; /* VFS inode embedded */
};
Creation: When a new file is created, the file system assigns an inode number, initializes fields, and marks it dirty.
Reading from Disk: When an existing file is accessed, iget() family functions look up the inode in cache or read from disk.
Modification: Changes to metadata (chmod, chown, touch) modify inode fields and mark it dirty (mark_inode_dirty()).
Writeback: Periodically, dirty inodes are written to disk via super_operations->write_inode().
Eviction: When the inode cache is under pressure or i_nlink reaches 0, super_operations->evict_inode() is called to clean up.
Inodes are heavily cached. Reading an inode from disk costs milliseconds; reading from the cache costs nanoseconds. The kernel maintains an inode hash table indexed by (superblock, inode_number). On a busy server, millions of inodes may be cached, with cache hits dramatically reducing disk I/O.
State Flags (i_state):
| Flag | Meaning |
|---|---|
I_DIRTY_SYNC | Inode metadata needs sync |
I_DIRTY_DATASYNC | Data pages need sync |
I_DIRTY_PAGES | Has dirty pages in page cache |
I_NEW | Inode is new, being initialized |
I_WILL_FREE | Inode scheduled for freeing |
I_FREEING | Inode being freed |
I_CLEAR | Inode cleared, awaiting destruction |
I_SYNC | Sync in progress |
I_REFERENCED | Recently referenced (for eviction) |
The dentry (directory entry) represents a component in a pathname—a name like "alice", "documents", or "file.txt". Dentries form a tree that mirrors the directory structure, caching the results of pathname lookups.
Key Insight: Dentries are primarily a caching mechanism. They cache the name→inode mapping so that repeated path lookups don't require disk reads.
Why Dentries Are Separate from Inodes:
.. traversal.12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
struct dentry { /* Dentry flags and state */ unsigned int d_flags; /* Dentry flags */ seqcount_spinlock_t d_seq; /* Sequence count for RCU */ /* Hash chain */ struct hlist_bl_node d_hash; /* Hash table bucket list */ /* Parent dentry */ struct dentry *d_parent; /* Parent dentry */ /* Name of this entry */ struct qstr d_name; /* Name (quick string) */ /* * struct qstr contains: * unsigned int hash; // Hash value * unsigned int len; // Length * const unsigned char *name; // The name bytes */ /* The inode this dentry refers to */ struct inode *d_inode; /* Associated inode (NULL for negative) */ /* Short name storage */ unsigned char d_iname[DNAME_INLINE_LEN]; /* Inline name storage */ /* Reference counting */ struct lockref d_lockref; /* Lock and refcount */ /* Operations */ const struct dentry_operations *d_op; /* Dentry operations */ /* Superblock */ struct super_block *d_sb; /* Superblock of this dentry */ /* FS-specific data */ void *d_fsdata; /* FS-private data */ /* Child and sibling management */ struct list_head d_child; /* Link in parent's d_subdirs */ struct list_head d_subdirs; /* Children of this dentry */ /* Alias list (for inodes with multiple dentries) */ union { struct hlist_node d_alias; /* Link in inode's i_dentry list */ struct hlist_bl_node d_in_lookup_hash; /* For lookup */ struct rcu_head d_rcu; /* For RCU freeing */ } d_u; /* For time-based expiration (autofs, NFS) */ unsigned long d_time; /* Revalidation time */};Dentry States:
| State | d_inode | d_lockref.count | Description |
|---|---|---|---|
| Used | valid | > 0 | Actively referenced by VFS or processes |
| Unused | valid | = 0 | Valid but not currently referenced; cached for reuse |
| Negative | NULL | ≥ 0 | Represents "name not found"; caches failed lookups |
Negative Dentries:
Negative dentries are powerful for performance. Consider:
# stat() returns "file not found"
$ stat /etc/nonexistent
The first call requires reading the /etc directory from disk. The second call? VFS finds a negative dentry cached for "nonexistent" under /etc and returns ENOENT immediately, no disk I/O.
This is crucial for compilation: build systems frequently check if files exist (hundreds of header file searches per compile). Negative dentry caching prevents repeated disk reads for missing files.
The dcache is one of the most performance-critical caches in the kernel. It's a hash table indexed by (parent dentry, name hash). On production systems, the dcache can contain millions of entries and satisfy the vast majority of pathname lookups without any disk I/O. The 'd_lru' list provides LRU ordering for eviction under memory pressure.
Dentry Lifecycle:
Lookup/Creation: During pathname resolution, VFS calls inode_operations->lookup() for each component. The file system returns a dentry (possibly newly allocated via d_alloc()) with its inode set.
Caching: After successful lookup, the dentry is inserted into the dcache hash table. Subsequent lookups find it there.
Reference Counting: Each dget() increments the count; dput() decrements it. When count reaches 0, the dentry becomes "unused" but remains cached.
LRU Management: Unused dentries live on an LRU list. Under memory pressure, the oldest unused dentries are evicted.
Invalidation: File systems can mark dentries invalid (e.g., NFS after a timeout). d_invalidate() drops cached children.
Deletion: When d_delete() is called (e.g., file unlinked), the dentry transitions to negative state or is removed entirely.
The file object represents an open file—not the file itself, but a specific instance of opening it. When a process calls open(), a new struct file is created. This object tracks the current read/write position and access mode.
Key Distinction:
open() call)Two processes opening the same file get separate struct file objects (with independent positions) but share the same inode.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
struct file { /* Path information */ struct path f_path; /* dentry and vfsmount */ /* * struct path { * struct vfsmount *mnt; * struct dentry *dentry; * }; */ /* Inode (shortcut) */ struct inode *f_inode; /* Cached inode pointer */ /* Operations */ const struct file_operations *f_op; /* File operations */ /* Reference counting */ atomic_long_t f_count; /* Reference count */ /* Flags and mode */ unsigned int f_flags; /* O_RDONLY, O_WRONLY, O_RDWR, etc. */ fmode_t f_mode; /* Access mode (FMODE_READ, etc.) */ /* Position */ loff_t f_pos; /* Current file position (offset) */ struct mutex f_pos_lock; /* Protects f_pos */ /* Security/credentials */ const struct cred *f_cred; /* Credentials at open time */ /* Locking */ struct fown_struct f_owner; /* Owner for signals */ /* Read-ahead state */ struct file_ra_state f_ra; /* Read-ahead state */ /* File-specific data */ void *private_data; /* File-specific private data */ /* Epoll and async notification */ struct list_head f_ep_links; /* Epoll link list */ struct list_head f_tfile_llink; /* Thread-local list */ /* Address space operations (for mmap) */ struct address_space *f_mapping; /* Page cache mapping */ /* Write hints for block layer */ enum rw_hint f_write_hint; /* Write lifetime hint */ /* ... more fields ... */};File Object Lifecycle:
Creation via open():
open() system call triggers pathname resolution → dentry → inodestruct file (from a slab cache)f_path, f_inode, f_op, f_mode, f_flagsfile_operations->open() for file system notificationUsage (read/write/etc.):
f_poslseek() directly modifies f_posDuplication (dup, fork):
dup(fd) creates new fd pointing to same struct file (refcount++)fork() copies fd table; child's fds point to same struct file objectsClosing:
close(fd) removes fd from process table, calls fput() to decrement refcountfile_operations->release() is calledAccess Mode Flags (f_mode):
| Flag | Meaning |
|---|---|
| FMODE_READ | Open for reading |
| FMODE_WRITE | Open for writing |
| FMODE_EXEC | Open for execution |
| FMODE_PREAD | pread() allowed |
| FMODE_PWRITE | pwrite() allowed |
| FMODE_LSEEK | lseek() allowed |
Open Flags (f_flags):
| Flag | Meaning |
|---|---|
| O_RDONLY | Read only |
| O_WRONLY | Write only |
| O_RDWR | Read and write |
| O_APPEND | Append mode |
| O_NONBLOCK | Non-blocking I/O |
| O_DIRECT | Direct I/O (bypass cache) |
After fork(), parent and child share struct file objects. If the parent reads 100 bytes, the child's file position also advances! This is often surprising. To get independent positions, the child must close and re-open the file, or the parent can use O_CLOEXEC.
VFS performance depends heavily on caching. Let's examine the caching strategies for each object type.
Dentry Cache (dcache):
The dcache is implemented as a hash table where the key is (parent_dentry, name_hash). This enables O(1) lookup during pathname resolution.
Structure:
/* Global hash table */
static struct hlist_bl_head *dentry_hashtable;
/* Hash function */
static inline struct hlist_bl_head *
d_hash(unsigned int hash) {
return dentry_hashtable + (hash >> d_hash_shift);
}
LRU Lists:
Unused dentries (refcount = 0) are placed on an LRU list. Under memory pressure, the slab shrinker scans this list and frees the oldest entries.
RCU for Lock-free Lookup:
The dcache uses Read-Copy-Update (RCU) for most lookups. Readers don't need locks—they read consistent snapshots while writers carefully update pointers. This enables massive concurrency.
Sizing:
The dcache grows dynamically based on available memory. On a 64GB system, it might cache 10+ million dentries. Cache hit rates of 99%+ are normal on stable workloads.
All these caches work together. When a file is modified, the page cache page is dirtied, the inode's mtime is updated (marking it dirty), and the dentry remains valid. When a file is deleted, the dentry goes negative, the inode's nlink decrements, and cached pages are eventually freed.
Let's trace what happens when a process opens and reads a file, seeing how all VFS objects participate:
12345
// User programint fd = open("/home/alice/data.txt", O_RDONLY);char buf[1024];ssize_t n = read(fd, buf, sizeof(buf));close(fd);Step 1: open() - Pathname Resolution:
/) from current mount namespaceroot_inode->i_op->lookup() → get dentry from filesystemStep 2: open() - File Object Creation:
struct file from slab cachef_path = {.mnt = current_mount, .dentry = data_txt_dentry}f_inode = data_txt_dentry->d_inodef_op = inode->i_fop (file operations for this file type)f_mode = FMODE_READ (based on O_RDONLY)f_pos = 0 (start of file)f_op->open() if defined (ext4_file_open for ext4)Step 3: read() - Data Retrieval:
struct filef_pos (0 in this case)a_ops->readpage() → block I/Of_pos += bytes_read (now 1024 if full read)Step 4: close() - Cleanup:
fput() → decrement f_countf_count reaches 0:
f_op->release() if definedstruct file back to slabOn a warm system, most of these steps complete without any disk I/O: dentries are in dcache, inode is in icache, data pages are in page cache. An open-read-close cycle that seems like it would require multiple disk accesses often completes entirely from memory in microseconds.
We've explored the four fundamental VFS objects in depth. Here are the key takeaways:
What's Next:
With VFS objects understood, we'll explore file system registration—how new file systems make themselves known to the kernel, and how the kernel discovers and invokes them during mount operations.
You now understand the four fundamental VFS objects: superblock, inode, dentry, and file. These structures form the backbone of the VFS layer, enabling uniform file access across diverse file systems while maintaining high performance through sophisticated caching.