Access Methods - Learning Module

Loading content...

0/241

Access Method Comparison

Choosing the Right Tool for the Job

We've explored three fundamental file access methods: sequential access (reading linearly from start to end), direct access (jumping to arbitrary positions), and indexed access (using auxiliary structures to map keys to locations). Each has distinct strengths and weaknesses.

But knowing how each method works isn't enough. The critical skill is knowing when to use each one. This decision profoundly impacts application performance—sometimes by orders of magnitude.

A common mistake is applying the same access pattern everywhere. A developer comfortable with databases might over-index everything, incurring write overhead on workloads that are naturally sequential. Conversely, someone used to streaming data might perform linear scans where a simple index would reduce runtime from hours to milliseconds.

This page brings together everything we've learned to provide a decision framework. We'll compare the methods head-to-head across multiple dimensions, examine hybrid strategies, and work through real-world scenarios to build intuition for access method selection.

What You Will Learn

By the end of this page, you will be able to: (1) Compare access methods across key performance metrics, (2) Identify workload characteristics that favor each method, (3) Recognize hybrid strategies that combine multiple approaches, (4) Select the optimal access method for any given scenario with confidence.

Head-to-Head Performance Comparison

Let's systematically compare the three access methods across critical performance dimensions. These comparisons assume typical hardware (HDD or SSD) and reasonable implementation quality.

Access Method Performance Comparison
Metric	Sequential	Direct (Random)	Indexed
Read entire file	⭐⭐⭐⭐⭐ Optimal	⭐⭐⭐⭐⭐ Same	➖ N/A
Find one record by position	⭐⭐ O(n) scan	⭐⭐⭐⭐⭐ O(1) seek	⭐⭐⭐⭐ O(log n)
Find one record by key	⭐⭐ O(n) scan	⭐⭐ O(n) scan	⭐⭐⭐⭐⭐ O(log n) or O(1)
Range query	⭐⭐⭐ Full scan	⭐⭐⭐ Scattered reads	⭐⭐⭐⭐⭐ Efficient (B-tree)
Append new record	⭐⭐⭐⭐⭐ O(1)	⭐⭐⭐⭐ O(1)	⭐⭐⭐ O(log n) + index update
Update existing record	⭐⭐ Rewrite file	⭐⭐⭐⭐⭐ O(1) in place	⭐⭐⭐⭐ O(log n) seek + write
Delete record	⭐⭐ Compact file	⭐⭐⭐⭐ Mark deleted	⭐⭐⭐ Update index(es)
Storage overhead	⭐⭐⭐⭐⭐ None	⭐⭐⭐⭐⭐ None	⭐⭐⭐ Index space (1-10%)
Implementation complexity	⭐⭐⭐⭐⭐ Simple	⭐⭐⭐⭐ Simple	⭐⭐ Complex
Predictable performance	⭐⭐⭐⭐⭐ Very	⭐⭐⭐⭐ Very	⭐⭐⭐ Variable (cache, tree depth)

Interpretation:

Sequential access excels when you process data in order (logs, streams, batch ETL). It achieves maximum throughput because it aligns with physical storage characteristics.
Direct access excels when you know the exact position of what you need (record-based files, memory-mapped editing, fixed-size record arrays).
Indexed access excels when you need to find records by key (databases, search, any key-value lookup). It trades space and write complexity for dramatically faster reads.

The key insight: None of these methods is universally superior. Each optimizes for different access patterns. The right choice depends entirely on your workload.

Quick Decision Heuristic

Ask yourself: 'How will data be accessed?' (1) Process everything once → Sequential. (2) Access by position → Direct. (3) Access by key → Indexed. Real applications often combine these—e.g., an indexed primary lookup followed by sequential streaming of results.

Throughput Analysis: The Numbers Matter

Understanding the quantitative performance differences is crucial for making informed decisions. Let's examine realistic throughput numbers across storage media.

Scenario: Reading 1 million 100-byte records (100MB total)

Performance by Access Pattern and Storage (1M Records)
Operation	HDD (7200 RPM)	SATA SSD	NVMe SSD
Sequential read all	~0.6s (150 MB/s)	~0.2s (500 MB/s)	~0.03s (3 GB/s)
Random read all (1M seeks)	~2.8 hours (100 IOPS)	~20s (50K IOPS)	~2s (500K IOPS)
Indexed read 1 record	~40ms (4 seeks)	~0.3ms (4 reads)	~0.08ms (4 reads)
Sequential write all	~0.8s (120 MB/s)	~0.3s (350 MB/s)	~0.05s (2 GB/s)
Random write all	~5 hours (50 IOPS)	~40s (25K IOPS)	~4s (250K IOPS)

Key observations:

HDD random vs. sequential: 10,000x difference
- Sequential: ~150 MB/s
- Random: ~0.015 MB/s per operation
- This explains why databases, file systems, and storage engines obsess over sequential I/O patterns on spinning disks.
SSD narrows but doesn't eliminate the gap
- The ratio drops from 10,000x to ~10-50x
- Random access becomes viable but sequential still wins for throughput
- Write amplification on SSDs makes the random write penalty larger than raw numbers suggest
Indexed access is worthwhile even on SSD
- Finding 1 record among 1 million: 0.08ms (indexed) vs. 10ms (sequential scan half)
- For single lookups, indexing provides 100x improvement even on NVMe
Batch matters
- If you need 1,000 random records from 1 million, indexed access with batching (sort reads by offset) can approach sequential performance

throughput_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
"""
Demonstrates the throughput difference between access patterns.
Run this on your system to see actual numbers for your storage.
"""
 
import os
import time
import random
 
RECORD_SIZE = 100
NUM_RECORDS = 100_000  # 10MB total (adjust for your system)
FILENAME = "test_data.bin"
 
def create_test_file():
    """Create a file with NUM_RECORDS fixed-size records"""
    with open(FILENAME, 'wb') as f:
        for i in range(NUM_RECORDS):
            record = f"Record {i:08d}".encode().ljust(RECORD_SIZE, b'\0')
            f.write(record)
 
def sequential_read_all():
    """Read entire file sequentially"""
    start = time.perf_counter()
    with open(FILENAME, 'rb') as f:
        while f.read(RECORD_SIZE):
            pass
    elapsed = time.perf_counter() - start
    throughput = (NUM_RECORDS * RECORD_SIZE) / elapsed / 1e6
    print(f"Sequential read: {elapsed:.3f}s ({throughput:.1f} MB/s)")
 
def random_read_all():
    """Read all records in random order"""
    positions = list(range(NUM_RECORDS))
    random.shuffle(positions)
    
    start = time.perf_counter()
    with open(FILENAME, 'rb') as f:
        for pos in positions:
            f.seek(pos * RECORD_SIZE)
            f.read(RECORD_SIZE)
    elapsed = time.perf_counter() - start
    iops = NUM_RECORDS / elapsed
    print(f"Random read all: {elapsed:.3f}s ({iops:.0f} IOPS)")
 
def indexed_single_lookup():
    """Simulate indexed lookup (4 seeks for B-tree depth 4)"""
    target = random.randint(0, NUM_RECORDS - 1)
    
    start = time.perf_counter()
    with open(FILENAME, 'rb') as f:
        # Simulate 3 internal node reads + 1 leaf read
        for _ in range(3):
            f.seek(random.randint(0, NUM_RECORDS - 1) * RECORD_SIZE)
            f.read(RECORD_SIZE)
        # Final read of actual record
        f.seek(target * RECORD_SIZE)
        record = f.read(RECORD_SIZE)
    elapsed = time.perf_counter() - start
    print(f"Indexed single lookup: {elapsed*1000:.2f}ms")
 
if __name__ == "__main__":
    print(f"Test file: {NUM_RECORDS:,} records ({NUM_RECORDS * RECORD_SIZE / 1e6:.1f} MB)")
    if not os.path.exists(FILENAME):
        print("Creating test file...")
        create_test_file()
    
    # Warm up filesystem cache
    sequential_read_all()
    
    # Actual measurements
    print("\n--- Measurements ---")
    sequential_read_all()
    random_read_all()
    for _ in range(3):
        indexed_single_lookup()

Cache Effects

These measurements are significantly affected by OS caching. Warm cache makes reads appear faster (data comes from RAM, not disk). For realistic storage-bound numbers, either (1) use files larger than RAM, (2) clear caches between tests (sudo sync; echo 3 > /proc/sys/vm/drop_caches on Linux), or (3) use O_DIRECT to bypass caching.

Workload Characteristics That Drive Selection

To select the right access method, analyze your workload along these dimensions:

1. Read vs. Write Ratio

Read-heavy (>90% reads): Favor indexed access; write overhead is amortized over many reads
Write-heavy (>50% writes): Consider sequential (append-only) or be very selective with indexes
Mixed: Balance carefully; each index multiplies write cost

2. Access Pattern

Full scans: Sequential wins decisively
Point lookups by key: Indexed is essential
Point lookups by position: Direct access
Range queries: Sorted data + sparse index or B-tree
Mixed: May need multiple access paths

3. Data Size

Fits in RAM: Access method matters less; everything is fast
Fits on SSD: Random access viable; index benefits significant
Spinning disk: Sequential critical; random access is expensive
Distributed storage: Network latency dominates; batch and prefetch

4. Latency Requirements

Batch processing (minutes okay): Sequential scans acceptable
Interactive (100ms): Need indexes for large datasets
Real-time (<10ms): Require indexes and caching

5. Consistency Requirements

Strict (ACID): Indexed access with WAL
Eventual: Can defer index updates, use async writes
Immutable data: Simpler; build indexes once

Workload to Access Method Mapping
Workload Type	Characteristics	Recommended Approach
Log aggregation	Append writes, full scans	Sequential files, time-partitioned
OLTP database	Many point lookups, updates	B-tree indexes on key columns
Analytics warehouse	Batch scans, aggregations	Columnar format, sequential scan
Document store	Key-value lookups	Hash or B-tree primary index
Time series	Append writes, time-range queries	Time-partitioned + B-tree per partition
Search engine	Full-text queries	Inverted index + sequential posting lists
Media streaming	Sequential playback, occasional seek	Sequential with offset index for seeking
Configuration files	Read on startup, rarely write	Sequential (small enough to load fully)

Measure, Don't Assume

These are guidelines, not rules. Real workloads have nuances. A 'read-heavy' workload with reads that scan 90% of data behaves differently from one with 90% point lookups. Profile your actual access patterns before committing to an architecture.

Hybrid Strategies: Combining Access Methods

Real-world systems rarely use a single access method in isolation. The most effective designs combine methods strategically, using each where it excels.

Pattern 1: Index + Sequential (Common in Databases)

Use an index to find the starting position, then read sequentially:

SELECT * FROM orders WHERE date > '2024-01-01' ORDER BY date;

1. Use B-tree index to find first matching row (indexed access)
2. Scan forward through sorted data (sequential access)

This combines O(log n) initial lookup with O(k) streaming of k results.

Pattern 2: Append-Only with Index (Log-Structured)

Write sequentially for performance; build index for lookup:

1. All writes append to log file (sequential)
2. Background process builds/updates index
3. Reads use index, then follow pointers

Used by: Log-structured merge trees (LSM), Kafka, many modern databases.

Pattern 3: Time-Partitioned with Per-Partition Index

Partition data by time; each partition has its own index:

/data/2024/01/data.db   ← index for January 2024
/data/2024/02/data.db   ← index for February 2024
/data/2024/03/data.db   ← index for March 2024

Query for February data only touches February index/files

Benefits: Old partitions are immutable (simple), queries skip irrelevant partitions.

Pattern 4: Tiered Storage (Hot/Warm/Cold)

Different access methods at different tiers:

tiered_storage.txt

text

Tiered Storage Architecture:
 
┌────────────────────────────────────────────────────┐
│                   HOT TIER (RAM)                    │
│  - In-memory hash index for recent data            │
│  - O(1) access, limited capacity (e.g., 1 hour)    │
│  - Access: In-memory hash/tree lookup              │
└────────────────────────────────────────────────────┘
                         ↓ (age-out)
┌────────────────────────────────────────────────────┐
│                  WARM TIER (SSD)                    │
│  - B-tree indexed recent historical data           │
│  - O(log n) access, medium capacity (e.g., 30 days)│
│  - Access: B-tree index + direct read              │
└────────────────────────────────────────────────────┘
                         ↓ (age-out)
┌────────────────────────────────────────────────────┐
│                  COLD TIER (HDD/Object Storage)     │
│  - Compressed, sequential-access archives          │
│  - Sparse index or date-based partitioning         │
│  - Access: Sequential scan within time range       │
└────────────────────────────────────────────────────┘
 
Query routing:
1. Check hot tier (most recent, in-memory)
2. If not found, check warm tier (indexed on SSD)
3. If not found, identify cold partition and scan

Pattern 5: Materialized Views with Different Access

Maintain multiple representations optimized for different queries:

Base table: B-tree indexed by primary key (for point lookups)
Materialized view 1: Sorted by date (for time-range scans)
Materialized view 2: Hash index by customer_id (for customer lookups)

Trade-off: Storage multiplication for query flexibility.

Pattern 6: Index-Organized Tables

Store data directly in the index structure (clustered index):

The B-tree leaf nodes contain the actual records
Primary key lookup = single index traversal
Range queries on primary key are extremely efficient
No separate heap file to maintain

Used by: MySQL InnoDB primary indexes, Oracle IOTs.

Start Simple, Optimize Later

Begin with the simplest approach that might work. Add complexity only when measurements show it's needed. A sequential scan over 10,000 records completes in milliseconds—often faster than the index lookup due to cache efficiency. Premature optimization with indexes can actually slow down small datasets.

A Practical Decision Framework

Here's a systematic decision process for selecting access methods:

Step 1: Characterize the Workload

What are the primary operations? (read all, read by key, read by range, append, update, delete)
What's the data volume? (fits in RAM, fits on SSD, requires HDD/distributed)
What are latency requirements? (batch, interactive, real-time)
What's the read/write ratio?

Step 2: Identify the Dominant Access Pattern

For each operation type, estimate frequency and criticality:

Operation          | Frequency | Latency Requirement | Priority
-------------------|-----------|---------------------|----------
Full data export   | Daily     | Hours OK            | Low
Lookup by user_id  | 10K/sec   | <50ms               | Critical
Recent records scan| 100/sec   | <200ms              | High
New record append  | 1K/sec    | <100ms              | High

Step 3: Match Patterns to Methods

Using the dominant pattern, apply this decision tree:

decision_tree.txt

text

Access Method Decision Tree:
 
Is the data small enough to fit in RAM?
├─ YES → Load into memory, use appropriate in-memory data structure
│         (HashMap for key lookup, ArrayList for sequential, TreeMap for range)
│
└─ NO → Continue...
 
What is the dominant access pattern?
 
├─ Process ALL records (logs, ETL, batch)
│   └─ Use SEQUENTIAL access
│       └─ Consider: partitioning by time, parallel processing
│
├─ Access by known POSITION (fixed-size records)
│   └─ Use DIRECT access with calculated offsets
│       └─ Consider: record alignment, sparse file support
│
├─ Access by KEY (lookups, point queries)
│   ├─ Only equality lookups needed?
│   │   └─ Consider HASH index (O(1) lookup)
│   └─ Range queries or ordering needed?
│       └─ Use B-TREE index (O(log n) lookup + range)
│
├─ Access by TIME RANGE
│   └─ TIME-PARTITIONED files + per-partition index
│       └─ Query routing skips irrelevant partitions
│
└─ MIXED patterns
    └─ Consider HYBRID approach:
        ├─ Primary access path (most frequent) → optimize heavily
        ├─ Secondary paths → acceptable trade-offs
        └─ Rare paths → can scan/be slow

Step 4: Validate with Prototyping

Before committing to an architecture:

Build a minimal prototype with realistic data volumes
Run representative query workloads
Measure actual latencies and throughput
Compare against requirements
Iterate if needed

Step 5: Plan for Evolution

Workloads change. Design for flexibility:

Abstract access patterns behind interfaces
Monitor access patterns in production
Be prepared to add/remove indexes
Consider tiered approaches for growing data

No Perfect Answer

There's rarely a single 'correct' access method. Multiple approaches may work, each with different trade-offs. The goal is to find an approach that meets requirements with acceptable complexity. Perfect is the enemy of good—a working system beats an optimal design that's never built.

Real-World Case Studies

Let's examine how real systems combine access methods to meet their requirements.

Case Study 1: Apache Kafka

Workload: High-throughput message streaming. Producers append; consumers read sequentially from their last position.

Solution:

Primary access: Sequential append/read within partitions
Index: Sparse offset index (message offset → file position)
Consumer reads: Sequential from a starting offset
Seek to timestamp: Use timestamp index to find approximate position, then scan

Why it works: The workload is inherently sequential. Indexes are minimal (sparse) because consumers almost never need random access.

Case Study 2: SQLite

Workload: Embedded database for applications. Mixed OLTP-style queries.

Solution:

Primary: B-tree indexes (data stored in B-tree leaves)
Table scans: Sequential read of B-tree leaves
Index lookups: B-tree traversal O(log n)
Writes: In-place update + WAL for durability

Why it works: B-trees handle the diverse query patterns. Single-file design simplifies embedding.

Case Study 3: Elasticsearch

Workload: Full-text search. Write once, query many times.

Solution:

Primary: Inverted index (term → document IDs)
Document retrieval: Forward index (doc ID → stored fields)
Segment merging: Sequential I/O for compaction
Immutable segments: Once written, never modified

Why it works: Inverted indexes are optimal for text search. Immutability simplifies concurrency and enables aggressive caching.

Case Study 4: RocksDB/LevelDB

Workload: High-write-volume key-value store.

Solution:

Writes: All writes go to in-memory buffer (memtable)
Flush: Memtable → immutable SSTable on disk (sequential write)
Reads: Check memtable, then SSTables level by level
Compaction: Merge SSTables in background (sequential I/O)
Index: Per-SSTable index blocks + bloom filters

Why it works: LSM design converts random writes to sequential, trading read amplification for write performance.

System Access Method Summary
System	Write Pattern	Read Pattern	Key Structure
Kafka	Sequential append	Sequential from offset	Sparse offset index
SQLite	B-tree update	B-tree + table scan	B-tree primary + secondary
Elasticsearch	Batch indexing	Inverted index lookup	Inverted + forward index
RocksDB	Memtable → SSTable	Multi-level search	LSM tree + bloom filters
PostgreSQL	Heap + WAL	Index + heap fetch	B-tree, hash, GiST, GIN
MongoDB	In-place or append	B-tree index	B-tree on _id + secondaries

Study Real Systems

The best way to build intuition for access method selection is to study how successful systems solve similar problems. Read the architecture docs for databases and storage systems. Understand why they made their choices—and when those choices cause problems.

Common Pitfalls and How to Avoid Them

Understanding common mistakes helps you avoid them:

Pitfall 1: Over-Indexing

Symptom: Slow writes, excessive disk usage, minimal read improvement.

Cause: Adding indexes 'just in case' without profiling actual queries.

Solution: Only index columns that are actually used in WHERE clauses of frequent queries. Remove unused indexes. Measure before and after.

Pitfall 2: Under-Indexing

Symptom: Queries that should be fast take seconds or minutes.

Cause: Assuming small data will stay small; not adding indexes as data grows.

Solution: Monitor query latencies. Add indexes when full scans become unacceptable. Plan indexing strategy early.

Pitfall 3: Sequential Access on Random Workload

Symptom: Log processing takes hours; each lookup scans entire file.

Cause: Using append-only logs for key-based lookup without indexing.

Solution: If you need key-based lookup, add an index. Consider a database instead of raw files.

Pitfall 4: Random Access on Sequential Workload

Symptom: Batch job orders of magnitude slower than expected.

Cause: Processing records in random order instead of disk order.

Solution: Sort work by file offset before processing. Batch related operations.

Pitfall 5: Ignoring Cache Effects

Symptom: Index lookups slower than expected; 'random' reads actually sequential.

Cause: Not accounting for OS page cache behavior.

Solution: Understand your working set size vs. available RAM. Hot data stays cached; cold data requires disk I/O. Design for cache friendliness.

Anti-Patterns to Avoid

•Opening and closing files repeatedly — Open once, seek as needed
•Reading one byte at a time — Buffer! Use 4KB+ buffers
•Linear search in sorted data — Use binary search
•Building in-memory index on every startup — Persist the index
•Synchronizing (fsync) every write — Batch or use async unless durability requires it
•Ignoring file system block size — Align I/O to 4KB boundaries

Profile Before Optimizing

Never guess where the bottleneck is. Use profilers, tracing, and metrics to identify actual problems. The bottleneck is rarely where you expect. A surprising number of 'I/O' problems turn out to be CPU, memory, or network issues.

Summary: Mastering Access Method Selection

We've conducted a comprehensive comparison of file access methods and developed a framework for selection. Let's consolidate the key insights:

Key Takeaways

•No universal best method — Each excels for specific access patterns; understand your workload.
•Sequential = maximum throughput — 10-10,000x faster than random on conventional storage.
•Indexed = efficient key lookup — O(log n) or O(1) beats O(n) scans for large datasets.
•Direct = position-based access — Essential for fixed-size records and in-place updates.
•Hybrid strategies combine strengths — Real systems use multiple methods together.
•Storage medium matters — HDD requires sequential obsession; SSD is more forgiving but not immune.
•Measure, don't assume — Profile actual workloads before and after optimizations.

What's next:

We've covered sequential, direct, and indexed access using traditional read/write system calls. But there's another powerful technique: memory-mapped file access. The next page explores how mapping files directly into virtual memory provides an elegant alternative that combines the simplicity of memory access with the persistence of files.

Page Complete

You now have a comprehensive framework for comparing and selecting file access methods. You understand the quantitative performance differences, the workload characteristics that favor each approach, hybrid strategies used in production systems, and common pitfalls to avoid. This knowledge is essential for designing efficient file I/O in any application.

Access Method Comparison

Choosing the Right Tool for the Job

But knowing how each method works isn't enough. The critical skill is knowing when to use each one. This decision profoundly impacts application performance—sometimes by orders of magnitude.

What You Will Learn

Head-to-Head Performance Comparison

Let's systematically compare the three access methods across critical performance dimensions. These comparisons assume typical hardware (HDD or SSD) and reasonable implementation quality.

Access Method Performance Comparison
Metric	Sequential	Direct (Random)	Indexed
Read entire file	⭐⭐⭐⭐⭐ Optimal	⭐⭐⭐⭐⭐ Same	➖ N/A
Find one record by position	⭐⭐ O(n) scan	⭐⭐⭐⭐⭐ O(1) seek	⭐⭐⭐⭐ O(log n)
Find one record by key	⭐⭐ O(n) scan	⭐⭐ O(n) scan	⭐⭐⭐⭐⭐ O(log n) or O(1)
Range query	⭐⭐⭐ Full scan	⭐⭐⭐ Scattered reads	⭐⭐⭐⭐⭐ Efficient (B-tree)
Append new record	⭐⭐⭐⭐⭐ O(1)	⭐⭐⭐⭐ O(1)	⭐⭐⭐ O(log n) + index update
Update existing record	⭐⭐ Rewrite file	⭐⭐⭐⭐⭐ O(1) in place	⭐⭐⭐⭐ O(log n) seek + write
Delete record	⭐⭐ Compact file	⭐⭐⭐⭐ Mark deleted	⭐⭐⭐ Update index(es)
Storage overhead	⭐⭐⭐⭐⭐ None	⭐⭐⭐⭐⭐ None	⭐⭐⭐ Index space (1-10%)
Implementation complexity	⭐⭐⭐⭐⭐ Simple	⭐⭐⭐⭐ Simple	⭐⭐ Complex
Predictable performance	⭐⭐⭐⭐⭐ Very	⭐⭐⭐⭐ Very	⭐⭐⭐ Variable (cache, tree depth)

Interpretation:

Sequential access excels when you process data in order (logs, streams, batch ETL). It achieves maximum throughput because it aligns with physical storage characteristics.
Direct access excels when you know the exact position of what you need (record-based files, memory-mapped editing, fixed-size record arrays).
Indexed access excels when you need to find records by key (databases, search, any key-value lookup). It trades space and write complexity for dramatically faster reads.

The key insight: None of these methods is universally superior. Each optimizes for different access patterns. The right choice depends entirely on your workload.

Quick Decision Heuristic

Throughput Analysis: The Numbers Matter

Understanding the quantitative performance differences is crucial for making informed decisions. Let's examine realistic throughput numbers across storage media.

Scenario: Reading 1 million 100-byte records (100MB total)

Performance by Access Pattern and Storage (1M Records)
Operation	HDD (7200 RPM)	SATA SSD	NVMe SSD
Sequential read all	~0.6s (150 MB/s)	~0.2s (500 MB/s)	~0.03s (3 GB/s)
Random read all (1M seeks)	~2.8 hours (100 IOPS)	~20s (50K IOPS)	~2s (500K IOPS)
Indexed read 1 record	~40ms (4 seeks)	~0.3ms (4 reads)	~0.08ms (4 reads)
Sequential write all	~0.8s (120 MB/s)	~0.3s (350 MB/s)	~0.05s (2 GB/s)
Random write all	~5 hours (50 IOPS)	~40s (25K IOPS)	~4s (250K IOPS)

Key observations:

HDD random vs. sequential: 10,000x difference
- Sequential: ~150 MB/s
- Random: ~0.015 MB/s per operation
- This explains why databases, file systems, and storage engines obsess over sequential I/O patterns on spinning disks.
SSD narrows but doesn't eliminate the gap
- The ratio drops from 10,000x to ~10-50x
- Random access becomes viable but sequential still wins for throughput
- Write amplification on SSDs makes the random write penalty larger than raw numbers suggest
Indexed access is worthwhile even on SSD
- Finding 1 record among 1 million: 0.08ms (indexed) vs. 10ms (sequential scan half)
- For single lookups, indexing provides 100x improvement even on NVMe
Batch matters
- If you need 1,000 random records from 1 million, indexed access with batching (sort reads by offset) can approach sequential performance

throughput_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
"""
Demonstrates the throughput difference between access patterns.
Run this on your system to see actual numbers for your storage.
"""
 
import os
import time
import random
 
RECORD_SIZE = 100
NUM_RECORDS = 100_000  # 10MB total (adjust for your system)
FILENAME = "test_data.bin"
 
def create_test_file():
    """Create a file with NUM_RECORDS fixed-size records"""
    with open(FILENAME, 'wb') as f:
        for i in range(NUM_RECORDS):
            record = f"Record {i:08d}".encode().ljust(RECORD_SIZE, b'\0')
            f.write(record)
 
def sequential_read_all():
    """Read entire file sequentially"""
    start = time.perf_counter()
    with open(FILENAME, 'rb') as f:
        while f.read(RECORD_SIZE):
            pass
    elapsed = time.perf_counter() - start
    throughput = (NUM_RECORDS * RECORD_SIZE) / elapsed / 1e6
    print(f"Sequential read: {elapsed:.3f}s ({throughput:.1f} MB/s)")
 
def random_read_all():
    """Read all records in random order"""
    positions = list(range(NUM_RECORDS))
    random.shuffle(positions)
    
    start = time.perf_counter()
    with open(FILENAME, 'rb') as f:
        for pos in positions:
            f.seek(pos * RECORD_SIZE)
            f.read(RECORD_SIZE)
    elapsed = time.perf_counter() - start
    iops = NUM_RECORDS / elapsed
    print(f"Random read all: {elapsed:.3f}s ({iops:.0f} IOPS)")
 
def indexed_single_lookup():
    """Simulate indexed lookup (4 seeks for B-tree depth 4)"""
    target = random.randint(0, NUM_RECORDS - 1)
    
    start = time.perf_counter()
    with open(FILENAME, 'rb') as f:
        # Simulate 3 internal node reads + 1 leaf read
        for _ in range(3):
            f.seek(random.randint(0, NUM_RECORDS - 1) * RECORD_SIZE)
            f.read(RECORD_SIZE)
        # Final read of actual record
        f.seek(target * RECORD_SIZE)
        record = f.read(RECORD_SIZE)
    elapsed = time.perf_counter() - start
    print(f"Indexed single lookup: {elapsed*1000:.2f}ms")
 
if __name__ == "__main__":
    print(f"Test file: {NUM_RECORDS:,} records ({NUM_RECORDS * RECORD_SIZE / 1e6:.1f} MB)")
    if not os.path.exists(FILENAME):
        print("Creating test file...")
        create_test_file()
    
    # Warm up filesystem cache
    sequential_read_all()
    
    # Actual measurements
    print("\n--- Measurements ---")
    sequential_read_all()
    random_read_all()
    for _ in range(3):
        indexed_single_lookup()

Cache Effects

Workload Characteristics That Drive Selection

To select the right access method, analyze your workload along these dimensions:

1. Read vs. Write Ratio

Read-heavy (>90% reads): Favor indexed access; write overhead is amortized over many reads
Write-heavy (>50% writes): Consider sequential (append-only) or be very selective with indexes
Mixed: Balance carefully; each index multiplies write cost

2. Access Pattern

Full scans: Sequential wins decisively
Point lookups by key: Indexed is essential
Point lookups by position: Direct access
Range queries: Sorted data + sparse index or B-tree
Mixed: May need multiple access paths

3. Data Size

Fits in RAM: Access method matters less; everything is fast
Fits on SSD: Random access viable; index benefits significant
Spinning disk: Sequential critical; random access is expensive
Distributed storage: Network latency dominates; batch and prefetch

4. Latency Requirements

Batch processing (minutes okay): Sequential scans acceptable
Interactive (100ms): Need indexes for large datasets
Real-time (<10ms): Require indexes and caching

5. Consistency Requirements

Strict (ACID): Indexed access with WAL
Eventual: Can defer index updates, use async writes
Immutable data: Simpler; build indexes once

Workload to Access Method Mapping
Workload Type	Characteristics	Recommended Approach
Log aggregation	Append writes, full scans	Sequential files, time-partitioned
OLTP database	Many point lookups, updates	B-tree indexes on key columns
Analytics warehouse	Batch scans, aggregations	Columnar format, sequential scan
Document store	Key-value lookups	Hash or B-tree primary index
Time series	Append writes, time-range queries	Time-partitioned + B-tree per partition
Search engine	Full-text queries	Inverted index + sequential posting lists
Media streaming	Sequential playback, occasional seek	Sequential with offset index for seeking
Configuration files	Read on startup, rarely write	Sequential (small enough to load fully)

Measure, Don't Assume

Hybrid Strategies: Combining Access Methods

Real-world systems rarely use a single access method in isolation. The most effective designs combine methods strategically, using each where it excels.

Pattern 1: Index + Sequential (Common in Databases)

Use an index to find the starting position, then read sequentially:

SELECT * FROM orders WHERE date > '2024-01-01' ORDER BY date;

1. Use B-tree index to find first matching row (indexed access)
2. Scan forward through sorted data (sequential access)

This combines O(log n) initial lookup with O(k) streaming of k results.

Pattern 2: Append-Only with Index (Log-Structured)

Write sequentially for performance; build index for lookup:

1. All writes append to log file (sequential)
2. Background process builds/updates index
3. Reads use index, then follow pointers

Used by: Log-structured merge trees (LSM), Kafka, many modern databases.

Pattern 3: Time-Partitioned with Per-Partition Index

Partition data by time; each partition has its own index:

/data/2024/01/data.db   ← index for January 2024
/data/2024/02/data.db   ← index for February 2024
/data/2024/03/data.db   ← index for March 2024

Query for February data only touches February index/files

Benefits: Old partitions are immutable (simple), queries skip irrelevant partitions.

Pattern 4: Tiered Storage (Hot/Warm/Cold)

Different access methods at different tiers:

tiered_storage.txt

text

Tiered Storage Architecture:
 
┌────────────────────────────────────────────────────┐
│                   HOT TIER (RAM)                    │
│  - In-memory hash index for recent data            │
│  - O(1) access, limited capacity (e.g., 1 hour)    │
│  - Access: In-memory hash/tree lookup              │
└────────────────────────────────────────────────────┘
                         ↓ (age-out)
┌────────────────────────────────────────────────────┐
│                  WARM TIER (SSD)                    │
│  - B-tree indexed recent historical data           │
│  - O(log n) access, medium capacity (e.g., 30 days)│
│  - Access: B-tree index + direct read              │
└────────────────────────────────────────────────────┘
                         ↓ (age-out)
┌────────────────────────────────────────────────────┐
│                  COLD TIER (HDD/Object Storage)     │
│  - Compressed, sequential-access archives          │
│  - Sparse index or date-based partitioning         │
│  - Access: Sequential scan within time range       │
└────────────────────────────────────────────────────┘
 
Query routing:
1. Check hot tier (most recent, in-memory)
2. If not found, check warm tier (indexed on SSD)
3. If not found, identify cold partition and scan

Pattern 5: Materialized Views with Different Access

Maintain multiple representations optimized for different queries:

Base table: B-tree indexed by primary key (for point lookups)
Materialized view 1: Sorted by date (for time-range scans)
Materialized view 2: Hash index by customer_id (for customer lookups)

Trade-off: Storage multiplication for query flexibility.

Pattern 6: Index-Organized Tables

Store data directly in the index structure (clustered index):

The B-tree leaf nodes contain the actual records
Primary key lookup = single index traversal
Range queries on primary key are extremely efficient
No separate heap file to maintain

Used by: MySQL InnoDB primary indexes, Oracle IOTs.

Start Simple, Optimize Later

A Practical Decision Framework

Here's a systematic decision process for selecting access methods:

Step 1: Characterize the Workload

What are the primary operations? (read all, read by key, read by range, append, update, delete)
What's the data volume? (fits in RAM, fits on SSD, requires HDD/distributed)
What are latency requirements? (batch, interactive, real-time)
What's the read/write ratio?

Step 2: Identify the Dominant Access Pattern

For each operation type, estimate frequency and criticality:

Operation          | Frequency | Latency Requirement | Priority
-------------------|-----------|---------------------|----------
Full data export   | Daily     | Hours OK            | Low
Lookup by user_id  | 10K/sec   | <50ms               | Critical
Recent records scan| 100/sec   | <200ms              | High
New record append  | 1K/sec    | <100ms              | High

Step 3: Match Patterns to Methods

Using the dominant pattern, apply this decision tree:

decision_tree.txt

text

Access Method Decision Tree:
 
Is the data small enough to fit in RAM?
├─ YES → Load into memory, use appropriate in-memory data structure
│         (HashMap for key lookup, ArrayList for sequential, TreeMap for range)
│
└─ NO → Continue...
 
What is the dominant access pattern?
 
├─ Process ALL records (logs, ETL, batch)
│   └─ Use SEQUENTIAL access
│       └─ Consider: partitioning by time, parallel processing
│
├─ Access by known POSITION (fixed-size records)
│   └─ Use DIRECT access with calculated offsets
│       └─ Consider: record alignment, sparse file support
│
├─ Access by KEY (lookups, point queries)
│   ├─ Only equality lookups needed?
│   │   └─ Consider HASH index (O(1) lookup)
│   └─ Range queries or ordering needed?
│       └─ Use B-TREE index (O(log n) lookup + range)
│
├─ Access by TIME RANGE
│   └─ TIME-PARTITIONED files + per-partition index
│       └─ Query routing skips irrelevant partitions
│
└─ MIXED patterns
    └─ Consider HYBRID approach:
        ├─ Primary access path (most frequent) → optimize heavily
        ├─ Secondary paths → acceptable trade-offs
        └─ Rare paths → can scan/be slow

Step 4: Validate with Prototyping

Before committing to an architecture:

Build a minimal prototype with realistic data volumes
Run representative query workloads
Measure actual latencies and throughput
Compare against requirements
Iterate if needed

Step 5: Plan for Evolution

Workloads change. Design for flexibility:

Abstract access patterns behind interfaces
Monitor access patterns in production
Be prepared to add/remove indexes
Consider tiered approaches for growing data

No Perfect Answer

Real-World Case Studies

Let's examine how real systems combine access methods to meet their requirements.

Case Study 1: Apache Kafka

Workload: High-throughput message streaming. Producers append; consumers read sequentially from their last position.

Solution:

Primary access: Sequential append/read within partitions
Index: Sparse offset index (message offset → file position)
Consumer reads: Sequential from a starting offset
Seek to timestamp: Use timestamp index to find approximate position, then scan

Why it works: The workload is inherently sequential. Indexes are minimal (sparse) because consumers almost never need random access.

Case Study 2: SQLite

Workload: Embedded database for applications. Mixed OLTP-style queries.

Solution:

Primary: B-tree indexes (data stored in B-tree leaves)
Table scans: Sequential read of B-tree leaves
Index lookups: B-tree traversal O(log n)
Writes: In-place update + WAL for durability

Why it works: B-trees handle the diverse query patterns. Single-file design simplifies embedding.

Case Study 3: Elasticsearch

Workload: Full-text search. Write once, query many times.

Solution:

Primary: Inverted index (term → document IDs)
Document retrieval: Forward index (doc ID → stored fields)
Segment merging: Sequential I/O for compaction
Immutable segments: Once written, never modified

Why it works: Inverted indexes are optimal for text search. Immutability simplifies concurrency and enables aggressive caching.

Case Study 4: RocksDB/LevelDB

Workload: High-write-volume key-value store.

Solution:

Writes: All writes go to in-memory buffer (memtable)
Flush: Memtable → immutable SSTable on disk (sequential write)
Reads: Check memtable, then SSTables level by level
Compaction: Merge SSTables in background (sequential I/O)
Index: Per-SSTable index blocks + bloom filters

Why it works: LSM design converts random writes to sequential, trading read amplification for write performance.

System Access Method Summary
System	Write Pattern	Read Pattern	Key Structure
Kafka	Sequential append	Sequential from offset	Sparse offset index
SQLite	B-tree update	B-tree + table scan	B-tree primary + secondary
Elasticsearch	Batch indexing	Inverted index lookup	Inverted + forward index
RocksDB	Memtable → SSTable	Multi-level search	LSM tree + bloom filters
PostgreSQL	Heap + WAL	Index + heap fetch	B-tree, hash, GiST, GIN
MongoDB	In-place or append	B-tree index	B-tree on _id + secondaries

Study Real Systems

Common Pitfalls and How to Avoid Them

Understanding common mistakes helps you avoid them:

Pitfall 1: Over-Indexing

Symptom: Slow writes, excessive disk usage, minimal read improvement.

Cause: Adding indexes 'just in case' without profiling actual queries.

Solution: Only index columns that are actually used in WHERE clauses of frequent queries. Remove unused indexes. Measure before and after.

Pitfall 2: Under-Indexing

Symptom: Queries that should be fast take seconds or minutes.

Cause: Assuming small data will stay small; not adding indexes as data grows.

Solution: Monitor query latencies. Add indexes when full scans become unacceptable. Plan indexing strategy early.

Pitfall 3: Sequential Access on Random Workload

Symptom: Log processing takes hours; each lookup scans entire file.

Cause: Using append-only logs for key-based lookup without indexing.

Solution: If you need key-based lookup, add an index. Consider a database instead of raw files.

Pitfall 4: Random Access on Sequential Workload

Symptom: Batch job orders of magnitude slower than expected.

Cause: Processing records in random order instead of disk order.

Solution: Sort work by file offset before processing. Batch related operations.

Pitfall 5: Ignoring Cache Effects

Symptom: Index lookups slower than expected; 'random' reads actually sequential.

Cause: Not accounting for OS page cache behavior.

Solution: Understand your working set size vs. available RAM. Hot data stays cached; cold data requires disk I/O. Design for cache friendliness.

Anti-Patterns to Avoid

•Opening and closing files repeatedly — Open once, seek as needed
•Reading one byte at a time — Buffer! Use 4KB+ buffers
•Linear search in sorted data — Use binary search
•Building in-memory index on every startup — Persist the index
•Synchronizing (fsync) every write — Batch or use async unless durability requires it
•Ignoring file system block size — Align I/O to 4KB boundaries

Profile Before Optimizing

Summary: Mastering Access Method Selection

We've conducted a comprehensive comparison of file access methods and developed a framework for selection. Let's consolidate the key insights:

Key Takeaways

•No universal best method — Each excels for specific access patterns; understand your workload.
•Sequential = maximum throughput — 10-10,000x faster than random on conventional storage.
•Indexed = efficient key lookup — O(log n) or O(1) beats O(n) scans for large datasets.
•Direct = position-based access — Essential for fixed-size records and in-place updates.
•Hybrid strategies combine strengths — Real systems use multiple methods together.
•Storage medium matters — HDD requires sequential obsession; SSD is more forgiving but not immune.
•Measure, don't assume — Profile actual workloads before and after optimizations.

What's next:

Page Complete