Loading learning content...
When you type ls in a Unix terminal or open a folder in Windows Explorer, the operating system must answer a fundamental question: How do we find files by name? Behind every directory listing lies a data structure that maps human-readable names to the metadata needed to access file contents.
The linear list is the most intuitive solution to this problem. Imagine a directory as a notebook where each page contains one file's name and information. To find a file, you flip through pages one by one until you find the right name. Simple, straightforward, and surprisingly effective—until your notebook has thousands of pages.
This page provides an exhaustive examination of linear list directory implementation: how it works at the byte level, why it was the original choice for many file systems, where it excels, and why modern systems have largely moved beyond it for high-performance scenarios.
By the end of this page, you will understand: (1) The exact structure of linear list directories, (2) Algorithms for insertion, deletion, and lookup operations, (3) Time and space complexity analysis, (4) Real-world implementations in FAT and early Unix, (5) Optimization techniques including sorted lists and caching, and (6) When linear lists remain appropriate in modern systems.
A linear list directory stores directory entries as a sequential collection—essentially an array or linked list of entries. Each entry contains a filename and either the file's metadata directly or a pointer to where metadata is stored (like an inode number).
Core Principle:
A directory is a special file whose content is a sequence of (name, locator) pairs. The locator identifies where to find the file's actual data and metadata.
Let's dissect this concept with precision:
Visual Representation:
Consider a directory containing three files: readme.txt, data.bin, and config.yaml. In a linear list structure:
1234567891011121314151617181920212223242526
Directory File Content (on disk):┌─────────────────────────────────────────────────────────────────┐│ Entry 0 ││ ├── Name: "readme.txt" ││ ├── Inode: 42 ││ └── Status: VALID │├─────────────────────────────────────────────────────────────────┤│ Entry 1 ││ ├── Name: "data.bin" ││ ├── Inode: 87 ││ └── Status: VALID │├─────────────────────────────────────────────────────────────────┤│ Entry 2 ││ ├── Name: "config.yaml" ││ ├── Inode: 103 ││ └── Status: VALID │├─────────────────────────────────────────────────────────────────┤│ Entry 3 ││ ├── Name: (empty or garbage) ││ ├── Inode: 0 ││ └── Status: END_OF_DIRECTORY │└─────────────────────────────────────────────────────────────────┘ To find "data.bin": 1. Read Entry 0, compare "readme.txt" ≠ "data.bin" → continue 2. Read Entry 1, compare "data.bin" = "data.bin" → FOUND! Return inode 87The term 'linear' refers to the O(n) time complexity of search operations. To find a file, you must potentially examine every entry. This linear relationship between directory size and search time defines the approach—and its limitations.
The exact structure of directory entries varies by file system, but understanding common patterns reveals the engineering trade-offs involved. Let's examine real implementations.
Fixed-Length Entries (Classic Approach):
Early file systems used fixed-size entries for simplicity:
12345678910111213141516171819202122232425262728293031323334
// Original Unix V7 directory entry (16 bytes)struct direct { unsigned short d_ino; // 2 bytes: inode number (0 = deleted) char d_name[14]; // 14 bytes: filename (null-padded)};// Total: 16 bytes per entry// Max filename: 14 characters// Max files per directory: limited by inode file size // DOS FAT 8.3 directory entry (32 bytes)struct fat_dirent { char name[8]; // 8 bytes: filename (space-padded) char ext[3]; // 3 bytes: extension (space-padded) uint8_t attr; // 1 byte: attributes (hidden, system, etc.) uint8_t reserved[10]; // 10 bytes: reserved/extended attributes uint16_t time; // 2 bytes: modification time uint16_t date; // 2 bytes: modification date uint16_t cluster; // 2 bytes: first cluster number uint32_t size; // 4 bytes: file size in bytes};// Total: 32 bytes per entry// Note: FAT stores metadata IN the directory entry, not separately // Example: Reading Unix V7 directoryvoid list_directory(int dir_fd) { struct direct entry; while (read(dir_fd, &entry, sizeof(entry)) == sizeof(entry)) { if (entry.d_ino != 0) { // 0 means deleted/empty // d_name may not be null-terminated if 14 chars printf("%.14s -> inode %u\n", entry.d_name, entry.d_ino); } }}Variable-Length Entries (Modern Approach):
Fixed-length entries waste space for short names and limit long names. Modern file systems use variable-length entries:
1234567891011121314151617181920212223242526
// Linux ext2/ext3/ext4 directory entrystruct ext4_dir_entry_2 { __le32 inode; // 4 bytes: inode number __le16 rec_len; // 2 bytes: total entry length (for alignment) __u8 name_len; // 1 byte: actual name length __u8 file_type; // 1 byte: file type (regular, directory, etc.) char name[]; // variable: filename (NOT null-terminated)};// Minimum size: 8 bytes + name// rec_len includes padding to 4-byte boundary // Example directory block layout:// // Offset 0: [inode=2, rec_len=12, name_len=1, type=DIR, name="."]// Offset 12: [inode=2, rec_len=12, name_len=2, type=DIR, name=".."]// Offset 24: [inode=42, rec_len=20, name_len=10, type=REG, name="readme.txt"]// Offset 44: [inode=87, rec_len=16, name_len=8, type=REG, name="data.bin"]// Offset 60: [inode=103, rec_len=4036, name_len=11, type=REG, name="config.yaml"]// ^ rec_len extends to end of block (absorbs free space) // Deletion in variable-length: merge rec_len with previous entry// Before deletion of "data.bin":// readme.txt: rec_len=20, data.bin: rec_len=16, config.yaml: rec_len=4036// After deletion:// readme.txt: rec_len=36, config.yaml: rec_len=4036// (data.bin's space absorbed into readme.txt's rec_len)| Aspect | Fixed-Length | Variable-Length |
|---|---|---|
| Space efficiency | Poor (wastes space for short names) | Good (only uses needed bytes + padding) |
| Name length limit | Strict (8.3 or 14 chars) | Flexible (255+ chars typical) |
| Entry location | Direct calculation: entry_n = offset + n × size | Must scan from start |
| Deletion | Mark inode as 0 | Merge rec_len with neighbor |
| Insertion | Find slot with inode=0 | Find gap with sufficient rec_len |
| Fragmentation | None (fixed slots) | Internal fragmentation in rec_len gaps |
In ext4, rec_len serves dual purposes: it enables variable-length names AND handles deletion without compaction. When an entry is deleted, its space is absorbed by the previous entry's rec_len. This avoids expensive block rewrites but creates internal fragmentation that's only reclaimed when new entries fit in the gaps.
Every directory supports fundamental operations: lookup, insertion, deletion, and enumeration. With linear lists, each operation has specific algorithms and complexity characteristics.
Operation 1: Lookup (Finding a File)
The most frequent operation—every file access begins with resolving a path, which requires looking up each component:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
// Linear search lookup for ext4-style directories// Returns inode number, or 0 if not found uint32_t directory_lookup( int dir_fd, // File descriptor for directory const char *target, // Name to find size_t target_len // Length of target name) { char block[4096]; // Directory block buffer ssize_t bytes_read; size_t offset; // Read directory blocks sequentially while ((bytes_read = read(dir_fd, block, sizeof(block))) > 0) { offset = 0; while (offset < bytes_read) { struct ext4_dir_entry_2 *entry = (struct ext4_dir_entry_2 *)(block + offset); // Check for valid entry (inode != 0) if (entry->inode != 0) { // Compare names (not null-terminated!) if (entry->name_len == target_len && memcmp(entry->name, target, target_len) == 0) { return entry->inode; // FOUND } } // Move to next entry offset += entry->rec_len; // Sanity check: prevent infinite loop on corruption if (entry->rec_len == 0) break; } } return 0; // NOT FOUND} // Time Complexity Analysis:// - Best case: O(1) - target is first entry// - Worst case: O(n) - target is last entry or doesn't exist// - Average case: O(n/2) = O(n) - target is in middle// // Space Complexity: O(1) - only need one block buffer// // I/O Complexity: O(n/b) disk reads, where b = entries per blockOperation 2: Insertion (Creating a File)
Insertion requires finding space for a new entry. With fixed-length entries, find a deleted slot. With variable-length entries, find a gap large enough:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
// Insertion for variable-length directory entries// Returns 0 on success, -1 on failure int directory_insert( int dir_fd, const char *name, size_t name_len, uint32_t inode, uint8_t file_type) { // Calculate required space (8 byte header + name + padding to 4 bytes) size_t required = 8 + name_len; required = (required + 3) & ~3; // Round up to 4-byte boundary char block[4096]; ssize_t bytes_read; off_t block_offset = 0; while ((bytes_read = read(dir_fd, block, sizeof(block))) > 0) { size_t offset = 0; while (offset < bytes_read) { struct ext4_dir_entry_2 *entry = (struct ext4_dir_entry_2 *)(block + offset); // Calculate actual space this entry uses size_t actual_len = 8 + entry->name_len; actual_len = (actual_len + 3) & ~3; // Check if there's unused space in this entry's rec_len size_t unused = entry->rec_len - actual_len; if (unused >= required) { // Split this entry: shrink current, add new after it size_t new_rec_len = entry->rec_len - actual_len; entry->rec_len = actual_len; // Create new entry in the gap struct ext4_dir_entry_2 *new_entry = (struct ext4_dir_entry_2 *)(block + offset + actual_len); new_entry->inode = inode; new_entry->rec_len = new_rec_len; new_entry->name_len = name_len; new_entry->file_type = file_type; memcpy(new_entry->name, name, name_len); // Write block back lseek(dir_fd, block_offset, SEEK_SET); write(dir_fd, block, bytes_read); return 0; // SUCCESS } offset += entry->rec_len; if (entry->rec_len == 0) break; } block_offset += bytes_read; } // No space found - must extend directory (allocate new block) return extend_directory_and_insert(dir_fd, name, name_len, inode, file_type);} // Time Complexity: O(n) worst case to find space// Note: First-fit strategy - stops at first sufficient gapOperation 3: Deletion (Removing a File)
Deletion must mark the entry as unused and reclaim space if possible:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// Deletion in variable-length directory// Strategy: merge deleted entry's space into previous entry int directory_delete(int dir_fd, const char *name, size_t name_len) { char block[4096]; ssize_t bytes_read; off_t block_offset = 0; while ((bytes_read = read(dir_fd, block, sizeof(block))) > 0) { size_t offset = 0; struct ext4_dir_entry_2 *prev_entry = NULL; while (offset < bytes_read) { struct ext4_dir_entry_2 *entry = (struct ext4_dir_entry_2 *)(block + offset); if (entry->inode != 0 && entry->name_len == name_len && memcmp(entry->name, name, name_len) == 0) { // FOUND - now delete it if (prev_entry != NULL) { // Merge with previous entry prev_entry->rec_len += entry->rec_len; } else { // First entry in block - just mark as deleted entry->inode = 0; } // Write block back lseek(dir_fd, block_offset, SEEK_SET); write(dir_fd, block, bytes_read); return 0; // SUCCESS } prev_entry = entry; offset += entry->rec_len; if (entry->rec_len == 0) break; } block_offset += bytes_read; } return -1; // NOT FOUND} // Time Complexity: O(n) to find the entry// Note: No compaction - fragmentation accumulates| Operation | Best Case | Average Case | Worst Case | I/O Operations |
|---|---|---|---|---|
| Lookup | O(1) | O(n) | O(n) | 1 to n/b reads |
| Insert | O(1) | O(n) | O(n) | 1 read + 1 write minimum |
| Delete | O(1) | O(n) | O(n) | 1 read + 1 write minimum |
| List all | O(n) | O(n) | O(n) | n/b reads exactly |
Understanding when linear lists become problematic requires analyzing real performance characteristics. Let's examine the math and see concrete examples.
Quantifying the Lookup Problem:
Consider the I/O cost of resolving a path like /home/alice/projects/myapp/src/main.c:
123456789101112131415161718192021222324252627282930
Path: /home/alice/projects/myapp/src/main.cComponents: [home, alice, projects, myapp, src, main.c] Assumptions:- Block size: 4096 bytes- Average entry size: 16 bytes (8 header + 8 name)- Entries per block: ~256 Directory sizes (typical system): / : 20 entries → 1 block → 1 read /home : 100 entries → 1 block → 1 read (avg 50 comparisons) /home/alice : 500 entries → 2 blocks → 1 read (avg 250 comparisons) projects : 1000 entries → 4 blocks → 2 reads (avg 500 comparisons) myapp : 200 entries → 1 block → 1 read (avg 100 comparisons) src : 5000 entries → 20 blocks → 10 reads (avg 2500 comparisons) Total for ONE file open: - Disk reads: 1 + 1 + 1 + 2 + 1 + 10 = 16 reads minimum - String comparisons: 50 + 250 + 500 + 100 + 2500 = 3400 comparisons With HDD (5ms seek + 5ms latency): - 16 reads × 10ms = 160ms just for path resolution! With SSD (0.1ms access): - 16 reads × 0.1ms = 1.6ms (much better, but still adds up) Now imagine this multiplied by: - Compilation: 10,000 source files - Web server: 1000 requests/second - Package manager: 50,000 files to verifyThe Large Directory Problem:
Linear lists become catastrophically slow for directories with many files:
| Directory Size | Avg Comparisons | Blocks to Read | Time (HDD) | Time (SSD) |
|---|---|---|---|---|
| 100 files | 50 | 1 | 10ms | 0.1ms |
| 1,000 files | 500 | 4 | 40ms | 0.4ms |
| 10,000 files | 5,000 | 40 | 400ms | 4ms |
| 100,000 files | 50,000 | 400 | 4 seconds | 40ms |
| 1,000,000 files | 500,000 | 4,000 | 40 seconds | 400ms |
Real-World Case Study: Mail Spool Directories
Email servers historically stored one file per message in spool directories. A busy mail server might accumulate hundreds of thousands of messages:
123456789101112131415161718192021222324
Scenario: Mail server with linear list directory (/var/mail/spool/)Growth rate: 10,000 new messages per dayRetention: 30 days before cleanup Day 1: 10,000 files → 0.4s to find message → acceptableDay 7: 70,000 files → 2.8s to find message → users complainDay 14: 140,000 files → 5.6s to find message → timeouts startDay 30: 300,000 files → 12.0s to find message → system unusable The Problem:- Each message delivery: O(n) to check if filename exists- Each message retrieval: O(n) to find the message- Listing new mail: O(n) to enumerate directory Solution attempts:1. Subdirectories by date: /var/mail/spool/2024/01/15/message_id - Reduces individual directory size - But now path resolution has more components 2. Subdirectories by hash: /var/mail/spool/a/b/c/message_abc123 - First 3 chars of hash create directory tree - Each leaf directory has ~1000 files max 3. Use file system with hash/tree directories (ext4 dir_index, XFS)The Maildir format—storing each email as a separate file—exposed linear list limitations so severely that it drove widespread adoption of directory indexing features. Many system administrators learned about O(n) directory lookup the hard way when their mail servers ground to a halt.
While the fundamental O(n) complexity cannot be eliminated without changing the data structure, several optimizations improve practical performance significantly.
Optimization 1: Directory Entry Caching (dcache)
The Linux kernel maintains a cache of recently resolved (name → inode) mappings:
12345678910111213141516171819202122232425262728
// Simplified view of Linux dentry cachestruct dentry { struct qstr d_name; // Filename with hash struct inode *d_inode; // Associated inode struct dentry *d_parent; // Parent directory struct hlist_node d_hash; // Hash table linkage struct list_head d_lru; // LRU list for reclaim // ... more fields}; // Lookup flow with cache:// // 1. Compute hash of (parent_dentry, filename)// 2. Search dcache hash table - O(1) average// 3. If HIT: return cached dentry// 4. If MISS: // a. Read directory from disk - O(n)// b. Create dentry, add to cache// c. Return new dentry // Cache effectiveness depends on:// - Temporal locality: same files accessed repeatedly// - Spatial locality: files in same directory accessed together// - Memory pressure: cache evicted under memory pressure // Check cache statistics:// $ cat /proc/sys/fs/dentry-state// Output: nr_dentry nr_unused age_limit want_pages dummy dummyOptimization 2: Sorted Linear List
Maintaining entries in sorted order enables binary search:
1234567891011121314151617181920212223242526272829303132333435363738
// Binary search on sorted fixed-length directory// Only works with fixed-length entries (need direct addressing) uint32_t sorted_directory_lookup( int dir_fd, const char *target, size_t entry_size, // Must be fixed size_t num_entries) { char entry[entry_size]; size_t low = 0; size_t high = num_entries - 1; while (low <= high) { size_t mid = (low + high) / 2; // Direct seek to entry - O(1) with fixed-size entries lseek(dir_fd, mid * entry_size, SEEK_SET); read(dir_fd, entry, entry_size); char *entry_name = entry + 4; // Skip inode field int cmp = strcmp(entry_name, target); if (cmp == 0) { return *(uint32_t*)entry; // Return inode } else if (cmp < 0) { low = mid + 1; } else { high = mid - 1; } } return 0; // Not found} // Complexity: O(log n) lookups// Tradeoff: Insertion becomes O(n) - must shift entries to maintain order// Historical use: Some CD-ROM file systems (ISO 9660) use sorted directoriesOptimization 3: Read-Ahead for Sequential Scans
When scanning directories, reading multiple blocks at once amortizes I/O overhead:
Caching helps repeated accesses but not first access or cache misses. Under memory pressure, directory caches are reclaimed. For workloads with poor locality (random access to many directories), caching provides minimal benefit. This is why structural improvements (hash tables, B-trees) are essential for large directories.
Despite their limitations, linear lists remain viable—even optimal—in specific scenarios. Understanding when simplicity trumps asymptotic efficiency is crucial for systems design.
Scenario 1: Small Directories
Most directories are small. Studies of real file systems consistently show:
| Directory Size | Percentage of Directories | Entries Per Lookup |
|---|---|---|
| 1-10 files | 60-70% | 5 average |
| 11-50 files | 20-25% | 30 average |
| 51-100 files | 5-10% | 75 average |
| 101-1000 files | 3-5% | 500 average |
1000 files | <1% | 5000+ average |
For the 85-95% of directories with fewer than 50 files, linear search of cached data is extremely fast—often faster than hash computation overhead.
Scenario 2: Read-Only or Archival File Systems
Scenario 3: Embedded and Resource-Constrained Systems
In memory-limited environments, the simplicity of linear lists has real value:
1234567891011121314151617181920212223
Resource Constraints:- RAM: 256KB total (no room for hash tables or tree caches)- CPU: 8-bit microcontroller (complex algorithms costly)- Storage: SD card with FAT32 (linear list directories) Linear List Advantages:1. Code size: ~100 lines vs ~1000+ for B-tree2. RAM usage: Single sector buffer (512 bytes)3. Implementation: Hard to get wrong4. Debugging: Easy to inspect raw bytes Acceptable Because:- Directories typically <100 files- Access patterns are predictable- Performance requirements modest- Reliability/simplicity critical Example: IoT sensor logging- One directory with daily log files- Max 365 files per directory (one year)- Linear search of 365 × 32 bytes = 11KB- Entire directory fits in ~23 SD card sectors- Acceptable latency for this applicationMany modern file systems use adaptive approaches: linear lists for small directories (below a threshold like 1-2 blocks), automatically converting to hash tables or B-trees when directories grow. This combines simplicity for common cases with scalability for edge cases. ext4's dir_index feature is exactly this hybrid approach.
We've comprehensively explored linear list directory implementation—the foundational approach that every file system designer must understand before appreciating more sophisticated alternatives.
What's Next:
Linear lists showed us the problem; hash tables show us the first solution. The next page examines hash table directory implementation—how we achieve O(1) lookup at the cost of additional complexity and space overhead.
You now understand linear list directory implementation in full depth: the data structures, algorithms, performance characteristics, and appropriate use cases. This foundation is essential for appreciating why more complex directory structures (hash tables, B-trees) were developed and when their overhead is justified.