Loading content...
We've explored three fundamental file access methods: sequential access (reading linearly from start to end), direct access (jumping to arbitrary positions), and indexed access (using auxiliary structures to map keys to locations). Each has distinct strengths and weaknesses.
But knowing how each method works isn't enough. The critical skill is knowing when to use each one. This decision profoundly impacts application performance—sometimes by orders of magnitude.
A common mistake is applying the same access pattern everywhere. A developer comfortable with databases might over-index everything, incurring write overhead on workloads that are naturally sequential. Conversely, someone used to streaming data might perform linear scans where a simple index would reduce runtime from hours to milliseconds.
This page brings together everything we've learned to provide a decision framework. We'll compare the methods head-to-head across multiple dimensions, examine hybrid strategies, and work through real-world scenarios to build intuition for access method selection.
By the end of this page, you will be able to: (1) Compare access methods across key performance metrics, (2) Identify workload characteristics that favor each method, (3) Recognize hybrid strategies that combine multiple approaches, (4) Select the optimal access method for any given scenario with confidence.
Let's systematically compare the three access methods across critical performance dimensions. These comparisons assume typical hardware (HDD or SSD) and reasonable implementation quality.
| Metric | Sequential | Direct (Random) | Indexed |
|---|---|---|---|
| Read entire file | ⭐⭐⭐⭐⭐ Optimal | ⭐⭐⭐⭐⭐ Same | ➖ N/A |
| Find one record by position | ⭐⭐ O(n) scan | ⭐⭐⭐⭐⭐ O(1) seek | ⭐⭐⭐⭐ O(log n) |
| Find one record by key | ⭐⭐ O(n) scan | ⭐⭐ O(n) scan | ⭐⭐⭐⭐⭐ O(log n) or O(1) |
| Range query | ⭐⭐⭐ Full scan | ⭐⭐⭐ Scattered reads | ⭐⭐⭐⭐⭐ Efficient (B-tree) |
| Append new record | ⭐⭐⭐⭐⭐ O(1) | ⭐⭐⭐⭐ O(1) | ⭐⭐⭐ O(log n) + index update |
| Update existing record | ⭐⭐ Rewrite file | ⭐⭐⭐⭐⭐ O(1) in place | ⭐⭐⭐⭐ O(log n) seek + write |
| Delete record | ⭐⭐ Compact file | ⭐⭐⭐⭐ Mark deleted | ⭐⭐⭐ Update index(es) |
| Storage overhead | ⭐⭐⭐⭐⭐ None | ⭐⭐⭐⭐⭐ None | ⭐⭐⭐ Index space (1-10%) |
| Implementation complexity | ⭐⭐⭐⭐⭐ Simple | ⭐⭐⭐⭐ Simple | ⭐⭐ Complex |
| Predictable performance | ⭐⭐⭐⭐⭐ Very | ⭐⭐⭐⭐ Very | ⭐⭐⭐ Variable (cache, tree depth) |
Interpretation:
Sequential access excels when you process data in order (logs, streams, batch ETL). It achieves maximum throughput because it aligns with physical storage characteristics.
Direct access excels when you know the exact position of what you need (record-based files, memory-mapped editing, fixed-size record arrays).
Indexed access excels when you need to find records by key (databases, search, any key-value lookup). It trades space and write complexity for dramatically faster reads.
The key insight: None of these methods is universally superior. Each optimizes for different access patterns. The right choice depends entirely on your workload.
Ask yourself: 'How will data be accessed?' (1) Process everything once → Sequential. (2) Access by position → Direct. (3) Access by key → Indexed. Real applications often combine these—e.g., an indexed primary lookup followed by sequential streaming of results.
Understanding the quantitative performance differences is crucial for making informed decisions. Let's examine realistic throughput numbers across storage media.
Scenario: Reading 1 million 100-byte records (100MB total)
| Operation | HDD (7200 RPM) | SATA SSD | NVMe SSD |
|---|---|---|---|
| Sequential read all | ~0.6s (150 MB/s) | ~0.2s (500 MB/s) | ~0.03s (3 GB/s) |
| Random read all (1M seeks) | ~2.8 hours (100 IOPS) | ~20s (50K IOPS) | ~2s (500K IOPS) |
| Indexed read 1 record | ~40ms (4 seeks) | ~0.3ms (4 reads) | ~0.08ms (4 reads) |
| Sequential write all | ~0.8s (120 MB/s) | ~0.3s (350 MB/s) | ~0.05s (2 GB/s) |
| Random write all | ~5 hours (50 IOPS) | ~40s (25K IOPS) | ~4s (250K IOPS) |
Key observations:
HDD random vs. sequential: 10,000x difference
SSD narrows but doesn't eliminate the gap
Indexed access is worthwhile even on SSD
Batch matters
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
"""Demonstrates the throughput difference between access patterns.Run this on your system to see actual numbers for your storage.""" import osimport timeimport random RECORD_SIZE = 100NUM_RECORDS = 100_000 # 10MB total (adjust for your system)FILENAME = "test_data.bin" def create_test_file(): """Create a file with NUM_RECORDS fixed-size records""" with open(FILENAME, 'wb') as f: for i in range(NUM_RECORDS): record = f"Record {i:08d}".encode().ljust(RECORD_SIZE, b'\0') f.write(record) def sequential_read_all(): """Read entire file sequentially""" start = time.perf_counter() with open(FILENAME, 'rb') as f: while f.read(RECORD_SIZE): pass elapsed = time.perf_counter() - start throughput = (NUM_RECORDS * RECORD_SIZE) / elapsed / 1e6 print(f"Sequential read: {elapsed:.3f}s ({throughput:.1f} MB/s)") def random_read_all(): """Read all records in random order""" positions = list(range(NUM_RECORDS)) random.shuffle(positions) start = time.perf_counter() with open(FILENAME, 'rb') as f: for pos in positions: f.seek(pos * RECORD_SIZE) f.read(RECORD_SIZE) elapsed = time.perf_counter() - start iops = NUM_RECORDS / elapsed print(f"Random read all: {elapsed:.3f}s ({iops:.0f} IOPS)") def indexed_single_lookup(): """Simulate indexed lookup (4 seeks for B-tree depth 4)""" target = random.randint(0, NUM_RECORDS - 1) start = time.perf_counter() with open(FILENAME, 'rb') as f: # Simulate 3 internal node reads + 1 leaf read for _ in range(3): f.seek(random.randint(0, NUM_RECORDS - 1) * RECORD_SIZE) f.read(RECORD_SIZE) # Final read of actual record f.seek(target * RECORD_SIZE) record = f.read(RECORD_SIZE) elapsed = time.perf_counter() - start print(f"Indexed single lookup: {elapsed*1000:.2f}ms") if __name__ == "__main__": print(f"Test file: {NUM_RECORDS:,} records ({NUM_RECORDS * RECORD_SIZE / 1e6:.1f} MB)") if not os.path.exists(FILENAME): print("Creating test file...") create_test_file() # Warm up filesystem cache sequential_read_all() # Actual measurements print("\n--- Measurements ---") sequential_read_all() random_read_all() for _ in range(3): indexed_single_lookup()These measurements are significantly affected by OS caching. Warm cache makes reads appear faster (data comes from RAM, not disk). For realistic storage-bound numbers, either (1) use files larger than RAM, (2) clear caches between tests (sudo sync; echo 3 > /proc/sys/vm/drop_caches on Linux), or (3) use O_DIRECT to bypass caching.
To select the right access method, analyze your workload along these dimensions:
1. Read vs. Write Ratio
2. Access Pattern
3. Data Size
4. Latency Requirements
5. Consistency Requirements
| Workload Type | Characteristics | Recommended Approach |
|---|---|---|
| Log aggregation | Append writes, full scans | Sequential files, time-partitioned |
| OLTP database | Many point lookups, updates | B-tree indexes on key columns |
| Analytics warehouse | Batch scans, aggregations | Columnar format, sequential scan |
| Document store | Key-value lookups | Hash or B-tree primary index |
| Time series | Append writes, time-range queries | Time-partitioned + B-tree per partition |
| Search engine | Full-text queries | Inverted index + sequential posting lists |
| Media streaming | Sequential playback, occasional seek | Sequential with offset index for seeking |
| Configuration files | Read on startup, rarely write | Sequential (small enough to load fully) |
These are guidelines, not rules. Real workloads have nuances. A 'read-heavy' workload with reads that scan 90% of data behaves differently from one with 90% point lookups. Profile your actual access patterns before committing to an architecture.
Real-world systems rarely use a single access method in isolation. The most effective designs combine methods strategically, using each where it excels.
Pattern 1: Index + Sequential (Common in Databases)
Use an index to find the starting position, then read sequentially:
SELECT * FROM orders WHERE date > '2024-01-01' ORDER BY date;
1. Use B-tree index to find first matching row (indexed access)
2. Scan forward through sorted data (sequential access)
This combines O(log n) initial lookup with O(k) streaming of k results.
Pattern 2: Append-Only with Index (Log-Structured)
Write sequentially for performance; build index for lookup:
1. All writes append to log file (sequential)
2. Background process builds/updates index
3. Reads use index, then follow pointers
Used by: Log-structured merge trees (LSM), Kafka, many modern databases.
Pattern 3: Time-Partitioned with Per-Partition Index
Partition data by time; each partition has its own index:
/data/2024/01/data.db ← index for January 2024
/data/2024/02/data.db ← index for February 2024
/data/2024/03/data.db ← index for March 2024
Query for February data only touches February index/files
Benefits: Old partitions are immutable (simple), queries skip irrelevant partitions.
Pattern 4: Tiered Storage (Hot/Warm/Cold)
Different access methods at different tiers:
123456789101112131415161718192021222324252627
Tiered Storage Architecture: ┌────────────────────────────────────────────────────┐│ HOT TIER (RAM) ││ - In-memory hash index for recent data ││ - O(1) access, limited capacity (e.g., 1 hour) ││ - Access: In-memory hash/tree lookup │└────────────────────────────────────────────────────┘ ↓ (age-out)┌────────────────────────────────────────────────────┐│ WARM TIER (SSD) ││ - B-tree indexed recent historical data ││ - O(log n) access, medium capacity (e.g., 30 days)││ - Access: B-tree index + direct read │└────────────────────────────────────────────────────┘ ↓ (age-out)┌────────────────────────────────────────────────────┐│ COLD TIER (HDD/Object Storage) ││ - Compressed, sequential-access archives ││ - Sparse index or date-based partitioning ││ - Access: Sequential scan within time range │└────────────────────────────────────────────────────┘ Query routing:1. Check hot tier (most recent, in-memory)2. If not found, check warm tier (indexed on SSD)3. If not found, identify cold partition and scanPattern 5: Materialized Views with Different Access
Maintain multiple representations optimized for different queries:
Trade-off: Storage multiplication for query flexibility.
Pattern 6: Index-Organized Tables
Store data directly in the index structure (clustered index):
Used by: MySQL InnoDB primary indexes, Oracle IOTs.
Begin with the simplest approach that might work. Add complexity only when measurements show it's needed. A sequential scan over 10,000 records completes in milliseconds—often faster than the index lookup due to cache efficiency. Premature optimization with indexes can actually slow down small datasets.
Here's a systematic decision process for selecting access methods:
Step 1: Characterize the Workload
Step 2: Identify the Dominant Access Pattern
For each operation type, estimate frequency and criticality:
Operation | Frequency | Latency Requirement | Priority
-------------------|-----------|---------------------|----------
Full data export | Daily | Hours OK | Low
Lookup by user_id | 10K/sec | <50ms | Critical
Recent records scan| 100/sec | <200ms | High
New record append | 1K/sec | <100ms | High
Step 3: Match Patterns to Methods
Using the dominant pattern, apply this decision tree:
123456789101112131415161718192021222324252627282930313233
Access Method Decision Tree: Is the data small enough to fit in RAM?├─ YES → Load into memory, use appropriate in-memory data structure│ (HashMap for key lookup, ArrayList for sequential, TreeMap for range)│└─ NO → Continue... What is the dominant access pattern? ├─ Process ALL records (logs, ETL, batch)│ └─ Use SEQUENTIAL access│ └─ Consider: partitioning by time, parallel processing│├─ Access by known POSITION (fixed-size records)│ └─ Use DIRECT access with calculated offsets│ └─ Consider: record alignment, sparse file support│├─ Access by KEY (lookups, point queries)│ ├─ Only equality lookups needed?│ │ └─ Consider HASH index (O(1) lookup)│ └─ Range queries or ordering needed?│ └─ Use B-TREE index (O(log n) lookup + range)│├─ Access by TIME RANGE│ └─ TIME-PARTITIONED files + per-partition index│ └─ Query routing skips irrelevant partitions│└─ MIXED patterns └─ Consider HYBRID approach: ├─ Primary access path (most frequent) → optimize heavily ├─ Secondary paths → acceptable trade-offs └─ Rare paths → can scan/be slowStep 4: Validate with Prototyping
Before committing to an architecture:
Step 5: Plan for Evolution
Workloads change. Design for flexibility:
There's rarely a single 'correct' access method. Multiple approaches may work, each with different trade-offs. The goal is to find an approach that meets requirements with acceptable complexity. Perfect is the enemy of good—a working system beats an optimal design that's never built.
Let's examine how real systems combine access methods to meet their requirements.
Case Study 1: Apache Kafka
Workload: High-throughput message streaming. Producers append; consumers read sequentially from their last position.
Solution:
Why it works: The workload is inherently sequential. Indexes are minimal (sparse) because consumers almost never need random access.
Case Study 2: SQLite
Workload: Embedded database for applications. Mixed OLTP-style queries.
Solution:
Why it works: B-trees handle the diverse query patterns. Single-file design simplifies embedding.
Case Study 3: Elasticsearch
Workload: Full-text search. Write once, query many times.
Solution:
Why it works: Inverted indexes are optimal for text search. Immutability simplifies concurrency and enables aggressive caching.
Case Study 4: RocksDB/LevelDB
Workload: High-write-volume key-value store.
Solution:
Why it works: LSM design converts random writes to sequential, trading read amplification for write performance.
| System | Write Pattern | Read Pattern | Key Structure |
|---|---|---|---|
| Kafka | Sequential append | Sequential from offset | Sparse offset index |
| SQLite | B-tree update | B-tree + table scan | B-tree primary + secondary |
| Elasticsearch | Batch indexing | Inverted index lookup | Inverted + forward index |
| RocksDB | Memtable → SSTable | Multi-level search | LSM tree + bloom filters |
| PostgreSQL | Heap + WAL | Index + heap fetch | B-tree, hash, GiST, GIN |
| MongoDB | In-place or append | B-tree index | B-tree on _id + secondaries |
The best way to build intuition for access method selection is to study how successful systems solve similar problems. Read the architecture docs for databases and storage systems. Understand why they made their choices—and when those choices cause problems.
Understanding common mistakes helps you avoid them:
Pitfall 1: Over-Indexing
Symptom: Slow writes, excessive disk usage, minimal read improvement.
Cause: Adding indexes 'just in case' without profiling actual queries.
Solution: Only index columns that are actually used in WHERE clauses of frequent queries. Remove unused indexes. Measure before and after.
Pitfall 2: Under-Indexing
Symptom: Queries that should be fast take seconds or minutes.
Cause: Assuming small data will stay small; not adding indexes as data grows.
Solution: Monitor query latencies. Add indexes when full scans become unacceptable. Plan indexing strategy early.
Pitfall 3: Sequential Access on Random Workload
Symptom: Log processing takes hours; each lookup scans entire file.
Cause: Using append-only logs for key-based lookup without indexing.
Solution: If you need key-based lookup, add an index. Consider a database instead of raw files.
Pitfall 4: Random Access on Sequential Workload
Symptom: Batch job orders of magnitude slower than expected.
Cause: Processing records in random order instead of disk order.
Solution: Sort work by file offset before processing. Batch related operations.
Pitfall 5: Ignoring Cache Effects
Symptom: Index lookups slower than expected; 'random' reads actually sequential.
Cause: Not accounting for OS page cache behavior.
Solution: Understand your working set size vs. available RAM. Hot data stays cached; cold data requires disk I/O. Design for cache friendliness.
Never guess where the bottleneck is. Use profilers, tracing, and metrics to identify actual problems. The bottleneck is rarely where you expect. A surprising number of 'I/O' problems turn out to be CPU, memory, or network issues.
We've conducted a comprehensive comparison of file access methods and developed a framework for selection. Let's consolidate the key insights:
What's next:
We've covered sequential, direct, and indexed access using traditional read/write system calls. But there's another powerful technique: memory-mapped file access. The next page explores how mapping files directly into virtual memory provides an elegant alternative that combines the simplicity of memory access with the persistence of files.
You now have a comprehensive framework for comparing and selecting file access methods. You understand the quantitative performance differences, the workload characteristics that favor each approach, hybrid strategies used in production systems, and common pitfalls to avoid. This knowledge is essential for designing efficient file I/O in any application.