Linked Allocation - Learning Module

Loading content...

0/227

Poor Random Access

The Hidden Cost of Every Link

Every design choice involves tradeoffs. Linked allocation's elimination of external fragmentation comes at a severe cost: random access performance degrades linearly with file position. To read byte 1,000,000 of a file, you must first traverse every block that comes before it.

This isn't a minor inconvenience—it's a fundamental algorithmic limitation that makes pure linked allocation unsuitable for databases, media players, spreadsheets, and virtually any application that needs to jump to arbitrary positions within files. Understanding this limitation deeply is essential for appreciating both why linked allocation found its niche and why more sophisticated schemes eventually replaced it.

What You Will Learn

By the end of this page, you will understand the mathematical proof of O(n) random access complexity, see concrete performance measurements demonstrating this problem, analyze which workloads are affected most severely, and appreciate why this limitation drove the development of alternatives like indexed allocation and the FAT optimization.

The Fundamental Problem

To understand linked allocation's random access problem, we must compare it with the ideal case.

Contiguous Allocation: O(1) Random Access

With contiguous allocation, accessing any byte is direct arithmetic:

Position of byte B in file starting at block S:

block_number = S + (B / block_size)
offset_in_block = B % block_size

disk_address = block_number × block_size + offset_in_block

No matter how large the file, computing the disk location of byte 1,000,000 or byte 1 takes the same constant time—a few arithmetic operations.

Linked Allocation: O(n) Random Access

With linked allocation, the blocks are scattered. The only way to find block N is to follow the chain from block 0:

To read byte B:
1. Calculate target_block_index = B / data_per_block
2. Start at first_block (from directory)
3. Read block, extract next_pointer
4. Repeat step 3 for (target_block_index) times
5. Finally read the target block

Disk reads required: target_block_index + 1 = O(n)

Converting Mermaid diagram...

The Critical Difference:

Access Pattern	Contiguous	Linked
Read block 0	1 disk I/O	1 disk I/O
Read block 10	1 disk I/O	11 disk I/Os
Read block 100	1 disk I/O	101 disk I/Os
Read block 1000	1 disk I/O	1001 disk I/Os
Read block 10000	1 disk I/O	10001 disk I/Os

The disparity grows without bound.

Not Just Disk I/O

Each disk I/O on an HDD typically involves a seek operation (5-15ms). For block 1000 in a linked file, that's potentially 1001 seeks = 5-15 SECONDS just to reach the data. Compare this to ~5ms for contiguous access.

Mathematical Analysis of Access Complexity

Let's rigorously analyze the complexity of various operations in linked allocation.

Definitions:

n = number of blocks in the file
B = block size (e.g., 4096 bytes)
P = pointer size (e.g., 4 bytes)
D = data per block = B - P
t_seek = average seek time (HDD: ~8ms)
t_read = block read time (HDD: ~0.1ms)
t_total = total time per block access = t_seek + t_read

Sequential Read (Complete File):

Time to read entire file:
T_seq = n × t_read + (scatter_factor × n × t_seek)

Where scatter_factor ≈ 0 for contiguous, ≈ 1 for fully scattered

For contiguous linked blocks:  T_seq ≈ n × 0.1ms
For scattered linked blocks:   T_seq ≈ n × 8.1ms (81× slower)

Random Access (Single Read at Position p):

access_complexity.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Analysis of linked allocation access complexity
 
// Time to access byte at position 'offset' in a linked file
double time_to_access(uint64_t offset, 
                      uint32_t block_size,
                      uint32_t pointer_size,
                      double seek_time_ms,
                      double read_time_ms) {
    
    uint32_t data_per_block = block_size - pointer_size;
    uint64_t target_block = offset / data_per_block;
    
    // Must traverse all blocks from 0 to target_block
    // Each traversal step requires:
    //   - Reading the current block (to get data and next pointer)
    //   - Seeking to the next block (if scattered)
    
    // Best case: blocks are contiguous
    // Traversal: (target_block + 1) reads, no seeks
    double best_case = (target_block + 1) * read_time_ms;
    
    // Average case: blocks have some locality
    // Assume 50% of transitions require seeks
    double avg_seeks = target_block * 0.5;
    double avg_case = (target_block + 1) * read_time_ms 
                    + avg_seeks * seek_time_ms;
    
    // Worst case: every block requires a seek
    double worst_case = (target_block + 1) * (read_time_ms + seek_time_ms);
    
    return avg_case;  // Return average case estimate
}
 
// Example calculations:
// 4KB blocks, 4-byte pointers, 8ms seek, 0.1ms read
 
// Access byte 1,000,000:
// target_block = 1,000,000 / 4092 = 244
// avg_case = 245 * 0.1 + 122 * 8 = 24.5 + 976 = 1000.5ms ≈ 1 second
 
// Access byte 100,000,000 (100MB into file):
// target_block = 100,000,000 / 4092 = 24,437
// avg_case = 24,438 * 0.1 + 12,219 * 8 = 2444 + 97,752 = 100,196ms ≈ 100 seconds
 
// This is completely impractical!

Complexity Summary:

Time Complexity of File Operations
Operation	Contiguous	Linked (Pure)	Linked (FAT cached)
Read byte at random offset	O(1)	O(n)	O(1)*
Read entire file sequentially	O(n)	O(n)	O(n)
Append data	O(1) amortized	O(n) or O(1)**	O(1)**
Truncate to size k	O(1)	O(k)	O(k)
Delete file	O(1)	O(n)	O(n)

* With FAT in memory, pointer lookup is O(1) but initial load is O(n) ** With cached last-block pointer

The Exponential Pain of Deep Access:

The O(n) complexity means access time grows linearly with position:

File Offset	Block Number	Disk I/Os	Time (8ms seek)
40 KB	10	11	~88ms
400 KB	100	101	~0.8s
4 MB	1,000	1,001	~8s
40 MB	10,000	10,001	~80s
400 MB	100,000	100,001	~13 min
4 GB	1,000,000	1,000,001	~2.2 hours

Utterly Impractical at Scale

Seeking to the end of a 4GB file could take over 2 HOURS with pure linked allocation. Even with SSD speeds (0.1ms per access instead of 8ms), it would still take ~100 seconds. No production system could tolerate this.

Workload Impact Analysis

Different applications have different access patterns. Let's analyze how linked allocation affects various workloads.

Workload Categories:

Sequential Access Workloads:

Applications that read/write files from beginning to end:

Video encoding/decoding — Process frame by frame
Log file writing — Append to end
File copying — Read entire source, write entire dest
Compilation — Read source files linearly
Backup software — Sequential file processing

Linked Allocation Performance: ✅ ACCEPTABLE

Sequential access only requires following pointers in order—no wasted traversals. The penalty is the seek between scattered blocks, but each block is visited exactly once.

Sequential read time = blocks × (read_time + scatter_penalty)
                    ≈ Same as contiguous if blocks are near each other

Workload Compatibility Summary
Application Type	Access Pattern	Linked Allocation Verdict
Log files	Append-only	✅ Excellent
Streaming video capture	Sequential write	✅ Good
File backup	Sequential read	✅ Good
Email (read new)	Sequential	✅ Acceptable
Text editor (small files)	Mixed	⚠️ Tolerable
Text editor (large files)	Mixed	❌ Poor
Database	Random	❌ Unusable
Video player (seek)	Random	❌ Unusable
Virtual machines	Random	❌ Unusable

Case Study: The Database Catastrophe

Let's examine a concrete scenario that illustrates why linked allocation is fundamentally incompatible with database workloads.

Scenario:

A simple customer database:

1 million customer records
1 KB per record
Total file size: 1 GB
Block size: 4 KB (4 records per block)
Total blocks: 262,144

database_nightmare.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// Database query: Find customer by ID
// Customer IDs are not sorted - random distribution
 
typedef struct {
    uint32_t id;
    char name[100];
    char email[200];
    // ... other fields
    uint8_t padding[724];  // Total: 1024 bytes per record
} CustomerRecord;
 
// QUERY 1: Lookup customer ID 500,000
// This customer is approximately in the middle of the file
 
CustomerRecord* find_customer(uint32_t target_id) {
    /*
     * With linked allocation, we must:
     * 1. Start at block 0
     * 2. Traverse through blocks until we find the record
     * 3. On average, the record is at block ~125,000
     * 
     * Time calculation:
     * - 125,000 blocks to traverse
     * - Each block: 8ms seek + 0.1ms read = 8.1ms
     * - Total: 125,000 × 8.1ms = 1,012,500 ms = 16.9 MINUTES
     * 
     * That's 17 MINUTES to look up ONE customer!
     */
    
    uint32_t current_block = file->first_block;
    uint8_t buffer[BLOCK_SIZE];
    
    while (current_block != EOF_MARKER) {
        // Read block (triggers seek + read)
        disk_read(current_block, buffer);
        
        // Check each record in block
        for (int i = 0; i < 4; i++) {
            CustomerRecord *rec = (CustomerRecord*)(buffer + i * 1024);
            if (rec->id == target_id) {
                return rec;  // Found!
            }
        }
        
        // Follow pointer to next block
        current_block = *(uint32_t*)(buffer + 4092);
    }
    
    return NULL;  // Not found
}
 
// QUERY 2: Find all customers in zip code 94102
// Must scan entire file - guaranteed worst case
 
int count_by_zipcode(const char *zipcode) {
    /*
     * Full table scan required: 262,144 blocks
     * Time: 262,144 × 8.1ms = 2,123,366 ms = 35 MINUTES
     * 
     * Every single query against non-indexed fields
     * takes 35 minutes!
     */
}
 
// QUERY 3: 100 random customer lookups
// Perhaps for a customer list page
 
void batch_lookup(uint32_t ids[], int count) {
    /*
     * 100 lookups × 17 minutes average = 1,700 minutes = 28 HOURS
     * 
     * A single page load could take over a DAY!
     */
}

No Database Could Function

With pure linked allocation, a 1-million record database would take 17 minutes per lookup, 35 minutes for a table scan, and 28 hours for 100 random lookups. This is not merely slow—it's completely non-functional. This is why no database storage engine uses pure linked allocation.

Comparison with Indexed/Contiguous:

Operation	Linked	Contiguous	B-tree Indexed
Random lookup	17 min	8ms	24ms (3 levels)
Table scan	35 min	35 min	35 min
100 lookups	28 hours	800ms	2.4s
Index range	N/A	N/A	~100ms

The 100,000× to 1,000,000× performance difference makes linked allocation effectively unusable for random-access workloads.

Mitigation Attempts and Their Limitations

Over the years, various approaches have been tried to improve linked allocation's random access performance. Let's examine them:

1. Position Caching:

Cache recently accessed block positions:

position_cache.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Maintain a sparse cache of block positions
typedef struct {
    uint32_t file_block_index;  // Logical block number
    uint32_t disk_block;        // Physical block address
} CacheEntry;
 
#define CACHE_SIZE 128
 
CacheEntry position_cache[CACHE_SIZE];
uint32_t current_position;  // Last accessed block
 
uint32_t seek_with_cache(FileHandle *file, uint32_t target_block) {
    // Check cache for exact hit or nearest lower entry
    uint32_t best_start = 0;
    uint32_t best_block = file->first_block;
    
    for (int i = 0; i < CACHE_SIZE; i++) {
        if (position_cache[i].file_block_index <= target_block &&
            position_cache[i].file_block_index > best_start) {
            best_start = position_cache[i].file_block_index;
            best_block = position_cache[i].disk_block;
        }
    }
    
    // Continue from best cached position
    // Still O(n) worst case, but often faster in practice
    uint32_t current = best_block;
    for (uint32_t i = best_start; i < target_block; i++) {
        current = follow_pointer(current);
    }
    
    // Update cache
    add_to_cache(target_block, current);
    
    return current;
}
 
// Limitation: Cache helps repeated access patterns
// but can't help first access to any position

2. Sequential Detection and Prefetching:

If sequential access is detected, prefetch upcoming blocks:

prefetching.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Detect sequential pattern and prefetch block pointers
typedef struct {
    uint32_t last_block;
    uint32_t sequential_count;
    uint32_t prefetch_buffer[PREFETCH_SIZE];
    int prefetch_valid;
} PrefetchState;
 
void read_with_prefetch(FileHandle *file, uint32_t block) {
    if (file->prefetch.prefetch_valid && 
        block == file->prefetch.last_block + 1) {
        // Sequential access detected
        file->prefetch.sequential_count++;
        
        // After 3 sequential reads, start prefetching
        if (file->prefetch.sequential_count > 3) {
            // Follow pointers ahead and cache them
            uint32_t current = block;
            for (int i = 0; i < PREFETCH_SIZE; i++) {
                current = follow_pointer(current);
                file->prefetch.prefetch_buffer[i] = current;
            }
        }
    } else {
        // Random access - reset prefetch
        file->prefetch.sequential_count = 0;
        file->prefetch.prefetch_valid = 0;
    }
    
    file->prefetch.last_block = block;
}
 
// Limitation: Only helps sequential access
// Random access still requires full traversal

3. Skip Pointers (Multi-Level Links):

Add forward pointers that skip multiple blocks:

Regular linked list:
B0 → B1 → B2 → B3 → B4 → B5 → B6 → B7

With skip pointers (every 4 blocks):
B0 ─────→ B4 ─────→ B8 ─────→ ...
 ↓    ↓    ↓    ↓    ↓    ↓
B1 → B2 → B3   B5 → B6 → B7

This reduces traversal from O(n) to O(n/k) where k is skip distance.

Mitigation Technique Effectiveness
Technique	Complexity (worst)	Improvement	Overhead
None (baseline)	O(n)	1x	None
Position cache	O(n)	~2-10x for hot paths	Memory for cache
Sequential prefetch	O(n)	~2x for seq. access	Prefetch I/O
Skip pointers	O(n/k)	kx improvement	k pointers per block
Full pointer table (FAT)	O(1)	n× improvement	Table in memory

None Achieve O(1)

Notice that none of the mitigations achieve true O(1) random access while staying pure linked allocation. The only way to achieve O(1) is to cache all pointers in memory (like FAT) or use a fundamentally different structure (like indexed allocation with B-trees).

Why Not Just Cache All Pointers?

The obvious solution to linked allocation's random access problem is to cache all the pointers in memory. This is exactly what the File Allocation Table (FAT) does. But this approach has significant implications.

FAT Solution:

Instead of storing pointers in each block, store ALL pointers in a centralized table:

Block 0:  Data only (no embedded pointer)
Block 1:  Data only
...
Block N:  Data only

FAT table: FAT[0]=next_of_0, FAT[1]=next_of_1, ..., FAT[N]=next_of_N

With the FAT loaded in memory, following pointers is O(1)—just an array lookup.

Memory Requirements:

FAT Memory Requirements
Disk Size	Block Size	Total Blocks	4-byte FAT Size
1 GB	4 KB	262,144	1 MB
10 GB	4 KB	2,621,440	10 MB
100 GB	4 KB	26,214,400	100 MB
1 TB	4 KB	268,435,456	1 GB
10 TB	4 KB	2,684,354,560	10 GB

The Memory Tradeoff:

For filesystems designed in the 1980s-1990s:

1MB of FAT for a 1GB disk was acceptable
Systems had 1-16MB of RAM, so FAT fit comfortably

But as disk sizes grew:

1TB disk needs 1GB FAT—a significant portion of RAM
10TB disk needs 10GB FAT—more than many systems have
FAT doesn't scale to modern storage sizes

This is why FAT32 has a maximum volume size of ~2TB (with 512-byte sectors) or ~16TB (with 4KB sectors)—the FAT itself becomes unwieldy.

Partial FAT Loading

Modern systems can load FAT portions on demand rather than the entire table. This reduces memory requirements but reintroduces some disk I/O for traversal. The kernel maintains a cache of frequently-accessed FAT regions.

Alternative: Indexed Allocation

For truly large files, a better solution is indexed allocation:

Store block pointers in a dedicated index structure
Use B-trees or extent trees for logarithmic access
Trade some space overhead for guaranteed O(log n) access

This is the approach used by modern file systems:

ext4: extent trees
NTFS: B-trees for MFT
XFS: B+ trees for extents
Btrfs: copy-on-write B-trees

The SSD Era: Does Random Access Still Matter?

SSDs have dramatically different performance characteristics than HDDs. Does this change linked allocation's viability for random access?

SSD vs HDD Access Times:

Metric	HDD	SSD (SATA)	SSD (NVMe)
Random read latency	8-15ms	0.1-0.2ms	0.02-0.1ms
Sequential read	150 MB/s	500 MB/s	3500 MB/s
Random read (IOPS)	100-200	50,000-100,000	500,000+

Recalculating the Database Scenario:

With NVMe SSD (0.05ms per access):

125,000 blocks to traverse
Time: 125,000 × 0.05ms = 6,250ms ≈ 6.25 seconds

Improvement: 17 minutes → 6 seconds (163× faster)

But still problematic:

6 seconds for ONE lookup is still too slow for interactive use
Compare to indexed: O(log n) = 17 levels × 0.05ms = 0.85ms

SSD Improvements

•Seek penalty eliminated
•Random = Sequential performance
•Block scatter doesn't hurt reads
•Traversal 100-1000× faster
•Small files become practical

Remaining Problems

•O(n) is still O(n)
•Large files still take seconds
•I/O queue depth wasted on traversal
•Each traversal consumes bandwidth
•Databases still need O(1) or O(log n)

SSDs Don't Fix O(n)

SSDs make linked allocation tolerable for small-to-medium files, but fundamental O(n) complexity remains. A 100GB file with 25 million blocks still requires 25 million lookups. Even at 0.02ms each, that's still 500 seconds (8+ minutes) to seek to the end.

Summary: The Random Access Tax

We've thoroughly examined linked allocation's most significant weakness—O(n) random access complexity. Let's consolidate our understanding:

Key Takeaways

•Random access is O(n) in pure linked allocation — Every seek requires traversing all preceding blocks, making access time proportional to file position.
•Sequential access remains efficient — Reading files from start to end only pays the scatter penalty, not repeated traversal.
•Database and media workloads are catastrophically affected — Minutes to hours for operations that should take milliseconds.
•Mitigation techniques provide marginal improvements — Caching and prefetching help but cannot achieve O(1) without fundamental changes.
•FAT optimization caches pointers in memory — Achieves O(1) lookup at the cost of significant memory for large volumes.
•SSDs reduce but don't eliminate the problem — 100× faster traversal still leaves O(n) complexity intact for large files.
•Modern file systems use B-trees for O(log n) access — The only practical solution for large-scale random access workloads.

What's next:

The final page explores the FAT optimization—the genius modification that made linked allocation practical for decades. We'll examine how moving pointers to a centralized table transformed random access performance while preserving linked allocation's fragmentation benefits.

Page Complete

You now understand why pure linked allocation is fundamentally unsuitable for random-access workloads. The O(n) traversal requirement makes it impractical for databases, media, and most modern applications. This understanding is crucial for appreciating both the FAT optimization and the move toward indexed allocation in modern file systems.