Operating SystemsFile System Implementation

Directory Implementation

LevelIntermediate

Duration120 mins

TopicFile System Implementation

1 / 5

Linear List

The Simplest Directory Structure

When you type ls in a Unix terminal or open a folder in Windows Explorer, the operating system must answer a fundamental question: How do we find files by name? Behind every directory listing lies a data structure that maps human-readable names to the metadata needed to access file contents.

The linear list is the most intuitive solution to this problem. Imagine a directory as a notebook where each page contains one file's name and information. To find a file, you flip through pages one by one until you find the right name. Simple, straightforward, and surprisingly effective—until your notebook has thousands of pages.

This page provides an exhaustive examination of linear list directory implementation: how it works at the byte level, why it was the original choice for many file systems, where it excels, and why modern systems have largely moved beyond it for high-performance scenarios.

What You Will Learn

By the end of this page, you will understand: (1) The exact structure of linear list directories, (2) Algorithms for insertion, deletion, and lookup operations, (3) Time and space complexity analysis, (4) Real-world implementations in FAT and early Unix, (5) Optimization techniques including sorted lists and caching, and (6) When linear lists remain appropriate in modern systems.

Understanding Linear List Directories

A linear list directory stores directory entries as a sequential collection—essentially an array or linked list of entries. Each entry contains a filename and either the file's metadata directly or a pointer to where metadata is stored (like an inode number).

Core Principle:

A directory is a special file whose content is a sequence of (name, locator) pairs. The locator identifies where to find the file's actual data and metadata.

Let's dissect this concept with precision:

Linear List Components

•Directory File — A directory is stored on disk just like any other file, but its contents have special structure interpreted by the file system. The kernel treats it differently, but at the storage level, it's just bytes.
•Entry Sequence — Entries are stored one after another, either in a contiguous block or across multiple blocks linked together. There's no inherent ordering requirement—entries exist in whatever order they were created.
•Name Field — Each entry contains the filename. Early systems used fixed-length names (8.3 in DOS, 14 characters in early Unix). Modern systems use variable-length names with length prefixes.
•Locator Field — This identifies the file. In Unix-like systems, it's an inode number. In FAT, it's the first cluster number. In some systems, it's an offset to metadata within the directory entry itself.
•Status Indicators — Entries may include flags indicating whether the entry is valid, deleted, or represents the last entry in the directory.

Visual Representation:

Consider a directory containing three files: readme.txt, data.bin, and config.yaml. In a linear list structure:

Directory Linear List Layout
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Directory File Content (on disk):
┌─────────────────────────────────────────────────────────────────┐
│ Entry 0                                                          │
│ ├── Name: "readme.txt"                                          │
│ ├── Inode: 42                                                    │
│ └── Status: VALID                                                │
├─────────────────────────────────────────────────────────────────┤
│ Entry 1                                                          │
│ ├── Name: "data.bin"                                            │
│ ├── Inode: 87                                                    │
│ └── Status: VALID                                                │
├─────────────────────────────────────────────────────────────────┤
│ Entry 2                                                          │
│ ├── Name: "config.yaml"                                         │
│ ├── Inode: 103                                                   │
│ └── Status: VALID                                                │
├─────────────────────────────────────────────────────────────────┤
│ Entry 3                                                          │
│ ├── Name: (empty or garbage)                                    │
│ ├── Inode: 0                                                     │
│ └── Status: END_OF_DIRECTORY                                     │
└─────────────────────────────────────────────────────────────────┘
 
To find "data.bin":
  1. Read Entry 0, compare "readme.txt" ≠ "data.bin" → continue
  2. Read Entry 1, compare "data.bin" = "data.bin" → FOUND! Return inode 87

Why 'Linear'?

The term 'linear' refers to the O(n) time complexity of search operations. To find a file, you must potentially examine every entry. This linear relationship between directory size and search time defines the approach—and its limitations.

Directory Entry Structure in Detail

The exact structure of directory entries varies by file system, but understanding common patterns reveals the engineering trade-offs involved. Let's examine real implementations.

Fixed-Length Entries (Classic Approach):

Early file systems used fixed-size entries for simplicity:

Fixed-Length Directory Entries
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Original Unix V7 directory entry (16 bytes)
struct direct {
    unsigned short d_ino;    // 2 bytes: inode number (0 = deleted)
    char d_name[14];         // 14 bytes: filename (null-padded)
};
// Total: 16 bytes per entry
// Max filename: 14 characters
// Max files per directory: limited by inode file size
 
// DOS FAT 8.3 directory entry (32 bytes)
struct fat_dirent {
    char     name[8];        // 8 bytes: filename (space-padded)
    char     ext[3];         // 3 bytes: extension (space-padded)
    uint8_t  attr;           // 1 byte: attributes (hidden, system, etc.)
    uint8_t  reserved[10];   // 10 bytes: reserved/extended attributes
    uint16_t time;           // 2 bytes: modification time
    uint16_t date;           // 2 bytes: modification date
    uint16_t cluster;        // 2 bytes: first cluster number
    uint32_t size;           // 4 bytes: file size in bytes
};
// Total: 32 bytes per entry
// Note: FAT stores metadata IN the directory entry, not separately
 
// Example: Reading Unix V7 directory
void list_directory(int dir_fd) {
    struct direct entry;
    
    while (read(dir_fd, &entry, sizeof(entry)) == sizeof(entry)) {
        if (entry.d_ino != 0) {  // 0 means deleted/empty
            // d_name may not be null-terminated if 14 chars
            printf("%.14s -> inode %u\n", entry.d_name, entry.d_ino);
        }
    }
}

Variable-Length Entries (Modern Approach):

Fixed-length entries waste space for short names and limit long names. Modern file systems use variable-length entries:

Variable-Length Directory Entries
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Linux ext2/ext3/ext4 directory entry
struct ext4_dir_entry_2 {
    __le32 inode;           // 4 bytes: inode number
    __le16 rec_len;         // 2 bytes: total entry length (for alignment)
    __u8   name_len;        // 1 byte: actual name length
    __u8   file_type;       // 1 byte: file type (regular, directory, etc.)
    char   name[];          // variable: filename (NOT null-terminated)
};
// Minimum size: 8 bytes + name
// rec_len includes padding to 4-byte boundary
 
// Example directory block layout:
// 
// Offset 0:   [inode=2, rec_len=12, name_len=1, type=DIR, name="."]
// Offset 12:  [inode=2, rec_len=12, name_len=2, type=DIR, name=".."]
// Offset 24:  [inode=42, rec_len=20, name_len=10, type=REG, name="readme.txt"]
// Offset 44:  [inode=87, rec_len=16, name_len=8, type=REG, name="data.bin"]
// Offset 60:  [inode=103, rec_len=4036, name_len=11, type=REG, name="config.yaml"]
//             ^ rec_len extends to end of block (absorbs free space)
 
// Deletion in variable-length: merge rec_len with previous entry
// Before deletion of "data.bin":
//   readme.txt: rec_len=20, data.bin: rec_len=16, config.yaml: rec_len=4036
// After deletion:
//   readme.txt: rec_len=36, config.yaml: rec_len=4036
//   (data.bin's space absorbed into readme.txt's rec_len)

Fixed vs Variable-Length Entry Comparison
Aspect	Fixed-Length	Variable-Length
Space efficiency	Poor (wastes space for short names)	Good (only uses needed bytes + padding)
Name length limit	Strict (8.3 or 14 chars)	Flexible (255+ chars typical)
Entry location	Direct calculation: entry_n = offset + n × size	Must scan from start
Deletion	Mark inode as 0	Merge rec_len with neighbor
Insertion	Find slot with inode=0	Find gap with sufficient rec_len
Fragmentation	None (fixed slots)	Internal fragmentation in rec_len gaps

The rec_len Trick

In ext4, rec_len serves dual purposes: it enables variable-length names AND handles deletion without compaction. When an entry is deleted, its space is absorbed by the previous entry's rec_len. This avoids expensive block rewrites but creates internal fragmentation that's only reclaimed when new entries fit in the gaps.

Directory Operations: Algorithms and Complexity

Every directory supports fundamental operations: lookup, insertion, deletion, and enumeration. With linear lists, each operation has specific algorithms and complexity characteristics.

Operation 1: Lookup (Finding a File)

The most frequent operation—every file access begins with resolving a path, which requires looking up each component:

Linear List Lookup Algorithm
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Linear search lookup for ext4-style directories
// Returns inode number, or 0 if not found
 
uint32_t directory_lookup(
    int dir_fd,           // File descriptor for directory
    const char *target,   // Name to find
    size_t target_len     // Length of target name
) {
    char block[4096];     // Directory block buffer
    ssize_t bytes_read;
    size_t offset;
    
    // Read directory blocks sequentially
    while ((bytes_read = read(dir_fd, block, sizeof(block))) > 0) {
        offset = 0;
        
        while (offset < bytes_read) {
            struct ext4_dir_entry_2 *entry = 
                (struct ext4_dir_entry_2 *)(block + offset);
            
            // Check for valid entry (inode != 0)
            if (entry->inode != 0) {
                // Compare names (not null-terminated!)
                if (entry->name_len == target_len &&
                    memcmp(entry->name, target, target_len) == 0) {
                    return entry->inode;  // FOUND
                }
            }
            
            // Move to next entry
            offset += entry->rec_len;
            
            // Sanity check: prevent infinite loop on corruption
            if (entry->rec_len == 0) break;
        }
    }
    
    return 0;  // NOT FOUND
}
 
// Time Complexity Analysis:
// - Best case: O(1) - target is first entry
// - Worst case: O(n) - target is last entry or doesn't exist
// - Average case: O(n/2) = O(n) - target is in middle
// 
// Space Complexity: O(1) - only need one block buffer
// 
// I/O Complexity: O(n/b) disk reads, where b = entries per block

Operation 2: Insertion (Creating a File)

Insertion requires finding space for a new entry. With fixed-length entries, find a deleted slot. With variable-length entries, find a gap large enough:

Linear List Insertion Algorithm
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// Insertion for variable-length directory entries
// Returns 0 on success, -1 on failure
 
int directory_insert(
    int dir_fd,
    const char *name,
    size_t name_len,
    uint32_t inode,
    uint8_t file_type
) {
    // Calculate required space (8 byte header + name + padding to 4 bytes)
    size_t required = 8 + name_len;
    required = (required + 3) & ~3;  // Round up to 4-byte boundary
    
    char block[4096];
    ssize_t bytes_read;
    off_t block_offset = 0;
    
    while ((bytes_read = read(dir_fd, block, sizeof(block))) > 0) {
        size_t offset = 0;
        
        while (offset < bytes_read) {
            struct ext4_dir_entry_2 *entry = 
                (struct ext4_dir_entry_2 *)(block + offset);
            
            // Calculate actual space this entry uses
            size_t actual_len = 8 + entry->name_len;
            actual_len = (actual_len + 3) & ~3;
            
            // Check if there's unused space in this entry's rec_len
            size_t unused = entry->rec_len - actual_len;
            
            if (unused >= required) {
                // Split this entry: shrink current, add new after it
                size_t new_rec_len = entry->rec_len - actual_len;
                entry->rec_len = actual_len;
                
                // Create new entry in the gap
                struct ext4_dir_entry_2 *new_entry = 
                    (struct ext4_dir_entry_2 *)(block + offset + actual_len);
                new_entry->inode = inode;
                new_entry->rec_len = new_rec_len;
                new_entry->name_len = name_len;
                new_entry->file_type = file_type;
                memcpy(new_entry->name, name, name_len);
                
                // Write block back
                lseek(dir_fd, block_offset, SEEK_SET);
                write(dir_fd, block, bytes_read);
                return 0;  // SUCCESS
            }
            
            offset += entry->rec_len;
            if (entry->rec_len == 0) break;
        }
        block_offset += bytes_read;
    }
    
    // No space found - must extend directory (allocate new block)
    return extend_directory_and_insert(dir_fd, name, name_len, inode, file_type);
}
 
// Time Complexity: O(n) worst case to find space
// Note: First-fit strategy - stops at first sufficient gap

Operation 3: Deletion (Removing a File)

Deletion must mark the entry as unused and reclaim space if possible:

Linear List Deletion Algorithm
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Deletion in variable-length directory
// Strategy: merge deleted entry's space into previous entry
 
int directory_delete(int dir_fd, const char *name, size_t name_len) {
    char block[4096];
    ssize_t bytes_read;
    off_t block_offset = 0;
    
    while ((bytes_read = read(dir_fd, block, sizeof(block))) > 0) {
        size_t offset = 0;
        struct ext4_dir_entry_2 *prev_entry = NULL;
        
        while (offset < bytes_read) {
            struct ext4_dir_entry_2 *entry = 
                (struct ext4_dir_entry_2 *)(block + offset);
            
            if (entry->inode != 0 &&
                entry->name_len == name_len &&
                memcmp(entry->name, name, name_len) == 0) {
                
                // FOUND - now delete it
                if (prev_entry != NULL) {
                    // Merge with previous entry
                    prev_entry->rec_len += entry->rec_len;
                } else {
                    // First entry in block - just mark as deleted
                    entry->inode = 0;
                }
                
                // Write block back
                lseek(dir_fd, block_offset, SEEK_SET);
                write(dir_fd, block, bytes_read);
                return 0;  // SUCCESS
            }
            
            prev_entry = entry;
            offset += entry->rec_len;
            if (entry->rec_len == 0) break;
        }
        block_offset += bytes_read;
    }
    
    return -1;  // NOT FOUND
}
 
// Time Complexity: O(n) to find the entry
// Note: No compaction - fragmentation accumulates

Linear List Operation Complexity Summary
Operation	Best Case	Average Case	Worst Case	I/O Operations
Lookup	O(1)	O(n)	O(n)	1 to n/b reads
Insert	O(1)	O(n)	O(n)	1 read + 1 write minimum
Delete	O(1)	O(n)	O(n)	1 read + 1 write minimum
List all	O(n)	O(n)	O(n)	n/b reads exactly

Performance Analysis and Real-World Impact

Understanding when linear lists become problematic requires analyzing real performance characteristics. Let's examine the math and see concrete examples.

Quantifying the Lookup Problem:

Consider the I/O cost of resolving a path like /home/alice/projects/myapp/src/main.c:

Path Resolution Cost Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Path: /home/alice/projects/myapp/src/main.c
Components: [home, alice, projects, myapp, src, main.c]
 
Assumptions:
- Block size: 4096 bytes
- Average entry size: 16 bytes (8 header + 8 name)
- Entries per block: ~256
 
Directory sizes (typical system):
  /           :   20 entries  →   1 block  →   1 read
  /home       :  100 entries  →   1 block  →   1 read (avg 50 comparisons)
  /home/alice :  500 entries  →   2 blocks →   1 read (avg 250 comparisons)
  projects    : 1000 entries  →   4 blocks →   2 reads (avg 500 comparisons)
  myapp       :  200 entries  →   1 block  →   1 read (avg 100 comparisons)
  src         : 5000 entries  →  20 blocks →  10 reads (avg 2500 comparisons)
 
Total for ONE file open:
  - Disk reads: 1 + 1 + 1 + 2 + 1 + 10 = 16 reads minimum
  - String comparisons: 50 + 250 + 500 + 100 + 2500 = 3400 comparisons
 
With HDD (5ms seek + 5ms latency):
  - 16 reads × 10ms = 160ms just for path resolution!
 
With SSD (0.1ms access):
  - 16 reads × 0.1ms = 1.6ms (much better, but still adds up)
 
Now imagine this multiplied by:
  - Compilation: 10,000 source files
  - Web server: 1000 requests/second
  - Package manager: 50,000 files to verify

The Large Directory Problem:

Linear lists become catastrophically slow for directories with many files:

Lookup Time vs Directory Size
Directory Size	Avg Comparisons	Blocks to Read	Time (HDD)	Time (SSD)
100 files	50	1	10ms	0.1ms
1,000 files	500	4	40ms	0.4ms
10,000 files	5,000	40	400ms	4ms
100,000 files	50,000	400	4 seconds	40ms
1,000,000 files	500,000	4,000	40 seconds	400ms

Real-World Case Study: Mail Spool Directories

Email servers historically stored one file per message in spool directories. A busy mail server might accumulate hundreds of thousands of messages:

Mail Server Performance Degradation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Scenario: Mail server with linear list directory (/var/mail/spool/)
Growth rate: 10,000 new messages per day
Retention: 30 days before cleanup
 
Day 1:   10,000 files  →  0.4s to find message  → acceptable
Day 7:   70,000 files  →  2.8s to find message  → users complain
Day 14: 140,000 files  →  5.6s to find message  → timeouts start
Day 30: 300,000 files  → 12.0s to find message  → system unusable
 
The Problem:
- Each message delivery: O(n) to check if filename exists
- Each message retrieval: O(n) to find the message
- Listing new mail: O(n) to enumerate directory
 
Solution attempts:
1. Subdirectories by date: /var/mail/spool/2024/01/15/message_id
   - Reduces individual directory size
   - But now path resolution has more components
   
2. Subdirectories by hash: /var/mail/spool/a/b/c/message_abc123
   - First 3 chars of hash create directory tree
   - Each leaf directory has ~1000 files max
   
3. Use file system with hash/tree directories (ext4 dir_index, XFS)

The Maildir Lesson

The Maildir format—storing each email as a separate file—exposed linear list limitations so severely that it drove widespread adoption of directory indexing features. Many system administrators learned about O(n) directory lookup the hard way when their mail servers ground to a halt.

Optimizations for Linear List Directories

While the fundamental O(n) complexity cannot be eliminated without changing the data structure, several optimizations improve practical performance significantly.

Optimization 1: Directory Entry Caching (dcache)

The Linux kernel maintains a cache of recently resolved (name → inode) mappings:

Linux Directory Entry Cache
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Simplified view of Linux dentry cache
struct dentry {
    struct qstr d_name;           // Filename with hash
    struct inode *d_inode;        // Associated inode
    struct dentry *d_parent;      // Parent directory
    struct hlist_node d_hash;     // Hash table linkage
    struct list_head d_lru;       // LRU list for reclaim
    // ... more fields
};
 
// Lookup flow with cache:
// 
// 1. Compute hash of (parent_dentry, filename)
// 2. Search dcache hash table - O(1) average
// 3. If HIT: return cached dentry
// 4. If MISS: 
//    a. Read directory from disk - O(n)
//    b. Create dentry, add to cache
//    c. Return new dentry
 
// Cache effectiveness depends on:
// - Temporal locality: same files accessed repeatedly
// - Spatial locality: files in same directory accessed together
// - Memory pressure: cache evicted under memory pressure
 
// Check cache statistics:
// $ cat /proc/sys/fs/dentry-state
// Output: nr_dentry nr_unused age_limit want_pages dummy dummy

Optimization 2: Sorted Linear List

Maintaining entries in sorted order enables binary search:

Sorted Directory with Binary Search
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Binary search on sorted fixed-length directory
// Only works with fixed-length entries (need direct addressing)
 
uint32_t sorted_directory_lookup(
    int dir_fd,
    const char *target,
    size_t entry_size,  // Must be fixed
    size_t num_entries
) {
    char entry[entry_size];
    size_t low = 0;
    size_t high = num_entries - 1;
    
    while (low <= high) {
        size_t mid = (low + high) / 2;
        
        // Direct seek to entry - O(1) with fixed-size entries
        lseek(dir_fd, mid * entry_size, SEEK_SET);
        read(dir_fd, entry, entry_size);
        
        char *entry_name = entry + 4;  // Skip inode field
        int cmp = strcmp(entry_name, target);
        
        if (cmp == 0) {
            return *(uint32_t*)entry;  // Return inode
        } else if (cmp < 0) {
            low = mid + 1;
        } else {
            high = mid - 1;
        }
    }
    
    return 0;  // Not found
}
 
// Complexity: O(log n) lookups
// Tradeoff: Insertion becomes O(n) - must shift entries to maintain order
// Historical use: Some CD-ROM file systems (ISO 9660) use sorted directories

Optimization 3: Read-Ahead for Sequential Scans

When scanning directories, reading multiple blocks at once amortizes I/O overhead:

Read-Ahead Benefits

•Prefetching — OS detects sequential read pattern and fetches blocks before they're requested.
•Reduced Seeks — Multiple contiguous blocks read in single I/O operation.
•Buffer Cache — Directory blocks cached in page cache for subsequent accesses.
•Extent Allocation — Modern file systems allocate directory blocks contiguously when possible.

When Caching Fails

Caching helps repeated accesses but not first access or cache misses. Under memory pressure, directory caches are reclaimed. For workloads with poor locality (random access to many directories), caching provides minimal benefit. This is why structural improvements (hash tables, B-trees) are essential for large directories.

When Linear Lists Remain Appropriate

Despite their limitations, linear lists remain viable—even optimal—in specific scenarios. Understanding when simplicity trumps asymptotic efficiency is crucial for systems design.

Scenario 1: Small Directories

Most directories are small. Studies of real file systems consistently show:

Typical Directory Size Distribution
Directory Size	Percentage of Directories	Entries Per Lookup
1-10 files	60-70%	5 average
11-50 files	20-25%	30 average
51-100 files	5-10%	75 average
101-1000 files	3-5%	500 average
1000 files	<1%	5000+ average

For the 85-95% of directories with fewer than 50 files, linear search of cached data is extremely fast—often faster than hash computation overhead.

Scenario 2: Read-Only or Archival File Systems

Linear List Advantages for Archives

•Simplicity — No complex balancing or rehashing during writes (because there are no writes).
•Compact Representation — No overhead for hash buckets or tree nodes.
•Portable — Easy to implement correctly across different systems.
•Corruption Resistance — Simple structure easier to recover from damage.
•Sequential Access — Often entire archives are read sequentially anyway.

Scenario 3: Embedded and Resource-Constrained Systems

In memory-limited environments, the simplicity of linear lists has real value:

Embedded System Considerations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Resource Constraints:
- RAM: 256KB total (no room for hash tables or tree caches)
- CPU: 8-bit microcontroller (complex algorithms costly)
- Storage: SD card with FAT32 (linear list directories)
 
Linear List Advantages:
1. Code size: ~100 lines vs ~1000+ for B-tree
2. RAM usage: Single sector buffer (512 bytes)
3. Implementation: Hard to get wrong
4. Debugging: Easy to inspect raw bytes
 
Acceptable Because:
- Directories typically <100 files
- Access patterns are predictable
- Performance requirements modest
- Reliability/simplicity critical
 
Example: IoT sensor logging
- One directory with daily log files
- Max 365 files per directory (one year)
- Linear search of 365 × 32 bytes = 11KB
- Entire directory fits in ~23 SD card sectors
- Acceptable latency for this application

The Hybrid Approach

Many modern file systems use adaptive approaches: linear lists for small directories (below a threshold like 1-2 blocks), automatically converting to hash tables or B-trees when directories grow. This combines simplicity for common cases with scalability for edge cases. ext4's dir_index feature is exactly this hybrid approach.

Summary: Linear List Directory Implementation

We've comprehensively explored linear list directory implementation—the foundational approach that every file system designer must understand before appreciating more sophisticated alternatives.

Key Takeaways

•Structure — Directories store entries sequentially as (name, locator) pairs. Fixed-length entries enable direct addressing; variable-length entries save space.
•Complexity — All operations are O(n) worst case. Lookup, insert, and delete must potentially scan the entire directory.
•Performance — Acceptable for directories under ~100 entries. Becomes problematic with thousands of files, catastrophic with hundreds of thousands.
•Optimizations — Caching (dcache), sorting + binary search, and read-ahead help but don't change fundamental O(n) complexity.
•Appropriate Use Cases — Small directories, embedded systems, archival/read-only file systems, or as the base case in hybrid implementations.
•Historical Importance — Original Unix and DOS file systems used linear lists. Their limitations drove development of hash tables and B-trees.

What's Next:

Linear lists showed us the problem; hash tables show us the first solution. The next page examines hash table directory implementation—how we achieve O(1) lookup at the cost of additional complexity and space overhead.

Page Complete

You now understand linear list directory implementation in full depth: the data structures, algorithms, performance characteristics, and appropriate use cases. This foundation is essential for appreciating why more complex directory structures (hash tables, B-trees) were developed and when their overhead is justified.

1 / 5

Loading learning content...

Operating SystemsFile System Implementation

Directory Implementation

LevelIntermediate

Duration120 mins

TopicFile System Implementation

1 / 5

Linear List

The Simplest Directory Structure

What You Will Learn

Understanding Linear List Directories

Core Principle:

A directory is a special file whose content is a sequence of (name, locator) pairs. The locator identifies where to find the file's actual data and metadata.

Let's dissect this concept with precision:

Linear List Components

•Directory File — A directory is stored on disk just like any other file, but its contents have special structure interpreted by the file system. The kernel treats it differently, but at the storage level, it's just bytes.
•Entry Sequence — Entries are stored one after another, either in a contiguous block or across multiple blocks linked together. There's no inherent ordering requirement—entries exist in whatever order they were created.
•Name Field — Each entry contains the filename. Early systems used fixed-length names (8.3 in DOS, 14 characters in early Unix). Modern systems use variable-length names with length prefixes.
•Locator Field — This identifies the file. In Unix-like systems, it's an inode number. In FAT, it's the first cluster number. In some systems, it's an offset to metadata within the directory entry itself.
•Status Indicators — Entries may include flags indicating whether the entry is valid, deleted, or represents the last entry in the directory.

Visual Representation:

Consider a directory containing three files: readme.txt, data.bin, and config.yaml. In a linear list structure:

Directory Linear List Layout
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Directory File Content (on disk):
┌─────────────────────────────────────────────────────────────────┐
│ Entry 0                                                          │
│ ├── Name: "readme.txt"                                          │
│ ├── Inode: 42                                                    │
│ └── Status: VALID                                                │
├─────────────────────────────────────────────────────────────────┤
│ Entry 1                                                          │
│ ├── Name: "data.bin"                                            │
│ ├── Inode: 87                                                    │
│ └── Status: VALID                                                │
├─────────────────────────────────────────────────────────────────┤
│ Entry 2                                                          │
│ ├── Name: "config.yaml"                                         │
│ ├── Inode: 103                                                   │
│ └── Status: VALID                                                │
├─────────────────────────────────────────────────────────────────┤
│ Entry 3                                                          │
│ ├── Name: (empty or garbage)                                    │
│ ├── Inode: 0                                                     │
│ └── Status: END_OF_DIRECTORY                                     │
└─────────────────────────────────────────────────────────────────┘
 
To find "data.bin":
  1. Read Entry 0, compare "readme.txt" ≠ "data.bin" → continue
  2. Read Entry 1, compare "data.bin" = "data.bin" → FOUND! Return inode 87

Why 'Linear'?

Directory Entry Structure in Detail

The exact structure of directory entries varies by file system, but understanding common patterns reveals the engineering trade-offs involved. Let's examine real implementations.

Fixed-Length Entries (Classic Approach):

Early file systems used fixed-size entries for simplicity:

Fixed-Length Directory Entries
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Original Unix V7 directory entry (16 bytes)
struct direct {
    unsigned short d_ino;    // 2 bytes: inode number (0 = deleted)
    char d_name[14];         // 14 bytes: filename (null-padded)
};
// Total: 16 bytes per entry
// Max filename: 14 characters
// Max files per directory: limited by inode file size
 
// DOS FAT 8.3 directory entry (32 bytes)
struct fat_dirent {
    char     name[8];        // 8 bytes: filename (space-padded)
    char     ext[3];         // 3 bytes: extension (space-padded)
    uint8_t  attr;           // 1 byte: attributes (hidden, system, etc.)
    uint8_t  reserved[10];   // 10 bytes: reserved/extended attributes
    uint16_t time;           // 2 bytes: modification time
    uint16_t date;           // 2 bytes: modification date
    uint16_t cluster;        // 2 bytes: first cluster number
    uint32_t size;           // 4 bytes: file size in bytes
};
// Total: 32 bytes per entry
// Note: FAT stores metadata IN the directory entry, not separately
 
// Example: Reading Unix V7 directory
void list_directory(int dir_fd) {
    struct direct entry;
    
    while (read(dir_fd, &entry, sizeof(entry)) == sizeof(entry)) {
        if (entry.d_ino != 0) {  // 0 means deleted/empty
            // d_name may not be null-terminated if 14 chars
            printf("%.14s -> inode %u\n", entry.d_name, entry.d_ino);
        }
    }
}

Variable-Length Entries (Modern Approach):

Fixed-length entries waste space for short names and limit long names. Modern file systems use variable-length entries:

Variable-Length Directory Entries
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Linux ext2/ext3/ext4 directory entry
struct ext4_dir_entry_2 {
    __le32 inode;           // 4 bytes: inode number
    __le16 rec_len;         // 2 bytes: total entry length (for alignment)
    __u8   name_len;        // 1 byte: actual name length
    __u8   file_type;       // 1 byte: file type (regular, directory, etc.)
    char   name[];          // variable: filename (NOT null-terminated)
};
// Minimum size: 8 bytes + name
// rec_len includes padding to 4-byte boundary
 
// Example directory block layout:
// 
// Offset 0:   [inode=2, rec_len=12, name_len=1, type=DIR, name="."]
// Offset 12:  [inode=2, rec_len=12, name_len=2, type=DIR, name=".."]
// Offset 24:  [inode=42, rec_len=20, name_len=10, type=REG, name="readme.txt"]
// Offset 44:  [inode=87, rec_len=16, name_len=8, type=REG, name="data.bin"]
// Offset 60:  [inode=103, rec_len=4036, name_len=11, type=REG, name="config.yaml"]
//             ^ rec_len extends to end of block (absorbs free space)
 
// Deletion in variable-length: merge rec_len with previous entry
// Before deletion of "data.bin":
//   readme.txt: rec_len=20, data.bin: rec_len=16, config.yaml: rec_len=4036
// After deletion:
//   readme.txt: rec_len=36, config.yaml: rec_len=4036
//   (data.bin's space absorbed into readme.txt's rec_len)

Fixed vs Variable-Length Entry Comparison
Aspect	Fixed-Length	Variable-Length
Space efficiency	Poor (wastes space for short names)	Good (only uses needed bytes + padding)
Name length limit	Strict (8.3 or 14 chars)	Flexible (255+ chars typical)
Entry location	Direct calculation: entry_n = offset + n × size	Must scan from start
Deletion	Mark inode as 0	Merge rec_len with neighbor
Insertion	Find slot with inode=0	Find gap with sufficient rec_len
Fragmentation	None (fixed slots)	Internal fragmentation in rec_len gaps

The rec_len Trick

Directory Operations: Algorithms and Complexity

Every directory supports fundamental operations: lookup, insertion, deletion, and enumeration. With linear lists, each operation has specific algorithms and complexity characteristics.

Operation 1: Lookup (Finding a File)

The most frequent operation—every file access begins with resolving a path, which requires looking up each component:

Linear List Lookup Algorithm
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Linear search lookup for ext4-style directories
// Returns inode number, or 0 if not found
 
uint32_t directory_lookup(
    int dir_fd,           // File descriptor for directory
    const char *target,   // Name to find
    size_t target_len     // Length of target name
) {
    char block[4096];     // Directory block buffer
    ssize_t bytes_read;
    size_t offset;
    
    // Read directory blocks sequentially
    while ((bytes_read = read(dir_fd, block, sizeof(block))) > 0) {
        offset = 0;
        
        while (offset < bytes_read) {
            struct ext4_dir_entry_2 *entry = 
                (struct ext4_dir_entry_2 *)(block + offset);
            
            // Check for valid entry (inode != 0)
            if (entry->inode != 0) {
                // Compare names (not null-terminated!)
                if (entry->name_len == target_len &&
                    memcmp(entry->name, target, target_len) == 0) {
                    return entry->inode;  // FOUND
                }
            }
            
            // Move to next entry
            offset += entry->rec_len;
            
            // Sanity check: prevent infinite loop on corruption
            if (entry->rec_len == 0) break;
        }
    }
    
    return 0;  // NOT FOUND
}
 
// Time Complexity Analysis:
// - Best case: O(1) - target is first entry
// - Worst case: O(n) - target is last entry or doesn't exist
// - Average case: O(n/2) = O(n) - target is in middle
// 
// Space Complexity: O(1) - only need one block buffer
// 
// I/O Complexity: O(n/b) disk reads, where b = entries per block

Operation 2: Insertion (Creating a File)

Insertion requires finding space for a new entry. With fixed-length entries, find a deleted slot. With variable-length entries, find a gap large enough:

Linear List Insertion Algorithm
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// Insertion for variable-length directory entries
// Returns 0 on success, -1 on failure
 
int directory_insert(
    int dir_fd,
    const char *name,
    size_t name_len,
    uint32_t inode,
    uint8_t file_type
) {
    // Calculate required space (8 byte header + name + padding to 4 bytes)
    size_t required = 8 + name_len;
    required = (required + 3) & ~3;  // Round up to 4-byte boundary
    
    char block[4096];
    ssize_t bytes_read;
    off_t block_offset = 0;
    
    while ((bytes_read = read(dir_fd, block, sizeof(block))) > 0) {
        size_t offset = 0;
        
        while (offset < bytes_read) {
            struct ext4_dir_entry_2 *entry = 
                (struct ext4_dir_entry_2 *)(block + offset);
            
            // Calculate actual space this entry uses
            size_t actual_len = 8 + entry->name_len;
            actual_len = (actual_len + 3) & ~3;
            
            // Check if there's unused space in this entry's rec_len
            size_t unused = entry->rec_len - actual_len;
            
            if (unused >= required) {
                // Split this entry: shrink current, add new after it
                size_t new_rec_len = entry->rec_len - actual_len;
                entry->rec_len = actual_len;
                
                // Create new entry in the gap
                struct ext4_dir_entry_2 *new_entry = 
                    (struct ext4_dir_entry_2 *)(block + offset + actual_len);
                new_entry->inode = inode;
                new_entry->rec_len = new_rec_len;
                new_entry->name_len = name_len;
                new_entry->file_type = file_type;
                memcpy(new_entry->name, name, name_len);
                
                // Write block back
                lseek(dir_fd, block_offset, SEEK_SET);
                write(dir_fd, block, bytes_read);
                return 0;  // SUCCESS
            }
            
            offset += entry->rec_len;
            if (entry->rec_len == 0) break;
        }
        block_offset += bytes_read;
    }
    
    // No space found - must extend directory (allocate new block)
    return extend_directory_and_insert(dir_fd, name, name_len, inode, file_type);
}
 
// Time Complexity: O(n) worst case to find space
// Note: First-fit strategy - stops at first sufficient gap

Operation 3: Deletion (Removing a File)

Deletion must mark the entry as unused and reclaim space if possible:

Linear List Deletion Algorithm
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Deletion in variable-length directory
// Strategy: merge deleted entry's space into previous entry
 
int directory_delete(int dir_fd, const char *name, size_t name_len) {
    char block[4096];
    ssize_t bytes_read;
    off_t block_offset = 0;
    
    while ((bytes_read = read(dir_fd, block, sizeof(block))) > 0) {
        size_t offset = 0;
        struct ext4_dir_entry_2 *prev_entry = NULL;
        
        while (offset < bytes_read) {
            struct ext4_dir_entry_2 *entry = 
                (struct ext4_dir_entry_2 *)(block + offset);
            
            if (entry->inode != 0 &&
                entry->name_len == name_len &&
                memcmp(entry->name, name, name_len) == 0) {
                
                // FOUND - now delete it
                if (prev_entry != NULL) {
                    // Merge with previous entry
                    prev_entry->rec_len += entry->rec_len;
                } else {
                    // First entry in block - just mark as deleted
                    entry->inode = 0;
                }
                
                // Write block back
                lseek(dir_fd, block_offset, SEEK_SET);
                write(dir_fd, block, bytes_read);
                return 0;  // SUCCESS
            }
            
            prev_entry = entry;
            offset += entry->rec_len;
            if (entry->rec_len == 0) break;
        }
        block_offset += bytes_read;
    }
    
    return -1;  // NOT FOUND
}
 
// Time Complexity: O(n) to find the entry
// Note: No compaction - fragmentation accumulates

Linear List Operation Complexity Summary
Operation	Best Case	Average Case	Worst Case	I/O Operations
Lookup	O(1)	O(n)	O(n)	1 to n/b reads
Insert	O(1)	O(n)	O(n)	1 read + 1 write minimum
Delete	O(1)	O(n)	O(n)	1 read + 1 write minimum
List all	O(n)	O(n)	O(n)	n/b reads exactly

Performance Analysis and Real-World Impact

Understanding when linear lists become problematic requires analyzing real performance characteristics. Let's examine the math and see concrete examples.

Quantifying the Lookup Problem:

Consider the I/O cost of resolving a path like /home/alice/projects/myapp/src/main.c:

Path Resolution Cost Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Path: /home/alice/projects/myapp/src/main.c
Components: [home, alice, projects, myapp, src, main.c]
 
Assumptions:
- Block size: 4096 bytes
- Average entry size: 16 bytes (8 header + 8 name)
- Entries per block: ~256
 
Directory sizes (typical system):
  /           :   20 entries  →   1 block  →   1 read
  /home       :  100 entries  →   1 block  →   1 read (avg 50 comparisons)
  /home/alice :  500 entries  →   2 blocks →   1 read (avg 250 comparisons)
  projects    : 1000 entries  →   4 blocks →   2 reads (avg 500 comparisons)
  myapp       :  200 entries  →   1 block  →   1 read (avg 100 comparisons)
  src         : 5000 entries  →  20 blocks →  10 reads (avg 2500 comparisons)
 
Total for ONE file open:
  - Disk reads: 1 + 1 + 1 + 2 + 1 + 10 = 16 reads minimum
  - String comparisons: 50 + 250 + 500 + 100 + 2500 = 3400 comparisons
 
With HDD (5ms seek + 5ms latency):
  - 16 reads × 10ms = 160ms just for path resolution!
 
With SSD (0.1ms access):
  - 16 reads × 0.1ms = 1.6ms (much better, but still adds up)
 
Now imagine this multiplied by:
  - Compilation: 10,000 source files
  - Web server: 1000 requests/second
  - Package manager: 50,000 files to verify

The Large Directory Problem:

Linear lists become catastrophically slow for directories with many files:

Lookup Time vs Directory Size
Directory Size	Avg Comparisons	Blocks to Read	Time (HDD)	Time (SSD)
100 files	50	1	10ms	0.1ms
1,000 files	500	4	40ms	0.4ms
10,000 files	5,000	40	400ms	4ms
100,000 files	50,000	400	4 seconds	40ms
1,000,000 files	500,000	4,000	40 seconds	400ms

Real-World Case Study: Mail Spool Directories

Email servers historically stored one file per message in spool directories. A busy mail server might accumulate hundreds of thousands of messages:

Mail Server Performance Degradation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Scenario: Mail server with linear list directory (/var/mail/spool/)
Growth rate: 10,000 new messages per day
Retention: 30 days before cleanup
 
Day 1:   10,000 files  →  0.4s to find message  → acceptable
Day 7:   70,000 files  →  2.8s to find message  → users complain
Day 14: 140,000 files  →  5.6s to find message  → timeouts start
Day 30: 300,000 files  → 12.0s to find message  → system unusable
 
The Problem:
- Each message delivery: O(n) to check if filename exists
- Each message retrieval: O(n) to find the message
- Listing new mail: O(n) to enumerate directory
 
Solution attempts:
1. Subdirectories by date: /var/mail/spool/2024/01/15/message_id
   - Reduces individual directory size
   - But now path resolution has more components
   
2. Subdirectories by hash: /var/mail/spool/a/b/c/message_abc123
   - First 3 chars of hash create directory tree
   - Each leaf directory has ~1000 files max
   
3. Use file system with hash/tree directories (ext4 dir_index, XFS)

The Maildir Lesson

Optimizations for Linear List Directories

While the fundamental O(n) complexity cannot be eliminated without changing the data structure, several optimizations improve practical performance significantly.

Optimization 1: Directory Entry Caching (dcache)

The Linux kernel maintains a cache of recently resolved (name → inode) mappings:

Linux Directory Entry Cache
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Simplified view of Linux dentry cache
struct dentry {
    struct qstr d_name;           // Filename with hash
    struct inode *d_inode;        // Associated inode
    struct dentry *d_parent;      // Parent directory
    struct hlist_node d_hash;     // Hash table linkage
    struct list_head d_lru;       // LRU list for reclaim
    // ... more fields
};
 
// Lookup flow with cache:
// 
// 1. Compute hash of (parent_dentry, filename)
// 2. Search dcache hash table - O(1) average
// 3. If HIT: return cached dentry
// 4. If MISS: 
//    a. Read directory from disk - O(n)
//    b. Create dentry, add to cache
//    c. Return new dentry
 
// Cache effectiveness depends on:
// - Temporal locality: same files accessed repeatedly
// - Spatial locality: files in same directory accessed together
// - Memory pressure: cache evicted under memory pressure
 
// Check cache statistics:
// $ cat /proc/sys/fs/dentry-state
// Output: nr_dentry nr_unused age_limit want_pages dummy dummy

Optimization 2: Sorted Linear List

Maintaining entries in sorted order enables binary search:

Sorted Directory with Binary Search
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Binary search on sorted fixed-length directory
// Only works with fixed-length entries (need direct addressing)
 
uint32_t sorted_directory_lookup(
    int dir_fd,
    const char *target,
    size_t entry_size,  // Must be fixed
    size_t num_entries
) {
    char entry[entry_size];
    size_t low = 0;
    size_t high = num_entries - 1;
    
    while (low <= high) {
        size_t mid = (low + high) / 2;
        
        // Direct seek to entry - O(1) with fixed-size entries
        lseek(dir_fd, mid * entry_size, SEEK_SET);
        read(dir_fd, entry, entry_size);
        
        char *entry_name = entry + 4;  // Skip inode field
        int cmp = strcmp(entry_name, target);
        
        if (cmp == 0) {
            return *(uint32_t*)entry;  // Return inode
        } else if (cmp < 0) {
            low = mid + 1;
        } else {
            high = mid - 1;
        }
    }
    
    return 0;  // Not found
}
 
// Complexity: O(log n) lookups
// Tradeoff: Insertion becomes O(n) - must shift entries to maintain order
// Historical use: Some CD-ROM file systems (ISO 9660) use sorted directories

Optimization 3: Read-Ahead for Sequential Scans

When scanning directories, reading multiple blocks at once amortizes I/O overhead:

Read-Ahead Benefits

•Prefetching — OS detects sequential read pattern and fetches blocks before they're requested.
•Reduced Seeks — Multiple contiguous blocks read in single I/O operation.
•Buffer Cache — Directory blocks cached in page cache for subsequent accesses.
•Extent Allocation — Modern file systems allocate directory blocks contiguously when possible.

When Caching Fails

When Linear Lists Remain Appropriate

Despite their limitations, linear lists remain viable—even optimal—in specific scenarios. Understanding when simplicity trumps asymptotic efficiency is crucial for systems design.

Scenario 1: Small Directories

Most directories are small. Studies of real file systems consistently show:

Typical Directory Size Distribution
Directory Size	Percentage of Directories	Entries Per Lookup
1-10 files	60-70%	5 average
11-50 files	20-25%	30 average
51-100 files	5-10%	75 average
101-1000 files	3-5%	500 average
1000 files	<1%	5000+ average

For the 85-95% of directories with fewer than 50 files, linear search of cached data is extremely fast—often faster than hash computation overhead.

Scenario 2: Read-Only or Archival File Systems

Linear List Advantages for Archives

•Simplicity — No complex balancing or rehashing during writes (because there are no writes).
•Compact Representation — No overhead for hash buckets or tree nodes.
•Portable — Easy to implement correctly across different systems.
•Corruption Resistance — Simple structure easier to recover from damage.
•Sequential Access — Often entire archives are read sequentially anyway.

Scenario 3: Embedded and Resource-Constrained Systems

In memory-limited environments, the simplicity of linear lists has real value:

Embedded System Considerations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Resource Constraints:
- RAM: 256KB total (no room for hash tables or tree caches)
- CPU: 8-bit microcontroller (complex algorithms costly)
- Storage: SD card with FAT32 (linear list directories)
 
Linear List Advantages:
1. Code size: ~100 lines vs ~1000+ for B-tree
2. RAM usage: Single sector buffer (512 bytes)
3. Implementation: Hard to get wrong
4. Debugging: Easy to inspect raw bytes
 
Acceptable Because:
- Directories typically <100 files
- Access patterns are predictable
- Performance requirements modest
- Reliability/simplicity critical
 
Example: IoT sensor logging
- One directory with daily log files
- Max 365 files per directory (one year)
- Linear search of 365 × 32 bytes = 11KB
- Entire directory fits in ~23 SD card sectors
- Acceptable latency for this application

The Hybrid Approach

Summary: Linear List Directory Implementation

We've comprehensively explored linear list directory implementation—the foundational approach that every file system designer must understand before appreciating more sophisticated alternatives.

Key Takeaways

•Structure — Directories store entries sequentially as (name, locator) pairs. Fixed-length entries enable direct addressing; variable-length entries save space.
•Complexity — All operations are O(n) worst case. Lookup, insert, and delete must potentially scan the entire directory.
•Performance — Acceptable for directories under ~100 entries. Becomes problematic with thousands of files, catastrophic with hundreds of thousands.
•Optimizations — Caching (dcache), sorting + binary search, and read-ahead help but don't change fundamental O(n) complexity.
•Appropriate Use Cases — Small directories, embedded systems, archival/read-only file systems, or as the base case in hybrid implementations.
•Historical Importance — Original Unix and DOS file systems used linear lists. Their limitations drove development of hash tables and B-trees.

What's Next:

Page Complete

1 / 5