Loading content...
Direct access lets you seek to byte 1,000,000—but how do you know that's where the record you want is stored? How does a database find account #12345 among millions of records? How does an email client locate message ID 'abc123def' without scanning the entire mailbox?
The answer is indexed access: an approach that maintains auxiliary data structures (indexes) mapping logical identifiers (keys) to physical locations (file offsets or block numbers). The index transforms human-meaningful lookups into efficient random access operations.
The fundamental insight:
An index is a sorted map that trades space and write overhead for dramatically faster reads:
This transformation—from linear search to logarithmic or constant-time lookup—is arguably the most important optimization technique in all of computer science. Every database, search engine, file system, and key-value store depends fundamentally on indexed access.
By the end of this page, you will understand indexed file access at a deep level—the conceptual model, common index structures (hash, B-tree, LSM), how indexes interact with file organization, the space/time trade-offs involved, and practical implementation patterns used in production systems.
Indexed access introduces a layer of indirection between logical data identity and physical storage location:
Application Index Data File
----------- --------- ----------
"Find user 'alice'" → index['alice'] = 47520 → seek(47520) → read
Components of indexed access:
Data File — Contains the actual records, stored sequentially, by insertion order, or in some other primary organization.
Index — A separate data structure mapping search keys to record locations. May be stored in:
Search Key — The attribute(s) used for lookup. Can be:
Index entry structure:
Each index entry typically contains:
Types of indexes:
| Classification | Types | Characteristics |
|---|---|---|
| By uniqueness | Primary (unique), Secondary (non-unique) | Primary: exactly one record per key; Secondary: multiple records possible |
| By organization | Clustered, Non-clustered | Clustered: data file sorted by index key; Non-clustered: data in different order |
| By structure | Hash, Tree (B-tree), Bitmap, etc. | Each structure optimizes for different query patterns |
| By density | Dense, Sparse | Dense: entry for every record; Sparse: entry for every block |
Dense vs. Sparse indexes:
Dense index: Contains an entry for every record in the data file. Allows direct lookup but requires more space.
Sparse index: Contains entries only for some records (e.g., one per disk block). Requires data file to be sorted by index key. Smaller but requires block search after index lookup.
Clustered vs. Non-clustered:
Clustered index: The data file is physically sorted according to the index key. Only one clustered index possible per table. Range queries are extremely efficient.
Non-clustered index: The index order differs from data file order. Multiple non-clustered indexes possible. Each lookup may require random access to scattered locations.
Indexes occupy storage space and slow down writes (every insert/update must maintain the index). In exchange, they dramatically accelerate reads. The decision to create an index is always a trade-off: Is the read speedup worth the space overhead and write slowdown for this particular access pattern?
The simplest form of indexing uses a hash table: compute a hash function on the search key, and use the result to determine where the record pointer is stored.
How hash indexes work:
Hash function: h(key) → bucket number (integer in range 0 to B-1, where B = number of buckets)
Bucket: Each bucket stores entries for keys that hash to that bucket number. Buckets may be:
Lookup: Compute h(key), go to that bucket, find the entry, follow pointer to data
Insert: Compute h(key), add entry to appropriate bucket (handle overflow if needed)
Strengths of hash indexes:
Weaknesses of hash indexes:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
#include <stdio.h>#include <stdlib.h>#include <string.h> /** * Simple on-disk hash index implementation * Demonstrates the core concepts of hash-based file indexing */ #define BUCKET_COUNT 1024#define ENTRIES_PER_BUCKET 16 typedef struct { char key[32]; // Search key off_t offset; // File offset of data record} HashEntry; typedef struct { HashEntry entries[ENTRIES_PER_BUCKET]; int count; off_t overflow_ptr; // Pointer to overflow bucket (0 if none)} HashBucket; typedef struct { int index_fd; // File descriptor for index file int data_fd; // File descriptor for data file int bucket_count;} HashIndex; // Simple djb2 hash functionunsigned long hash_key(const char *key) { unsigned long hash = 5381; int c; while ((c = *key++)) { hash = ((hash << 5) + hash) + c; } return hash;} // Lookup: Find record offset by keyoff_t hash_index_lookup(HashIndex *idx, const char *key) { // 1. Compute hash and determine bucket number unsigned long bucket_num = hash_key(key) % idx->bucket_count; // 2. Read the bucket from index file HashBucket bucket; off_t bucket_offset = bucket_num * sizeof(HashBucket); pread(idx->index_fd, &bucket, sizeof(bucket), bucket_offset); // 3. Search within the bucket for (int i = 0; i < bucket.count; i++) { if (strcmp(bucket.entries[i].key, key) == 0) { return bucket.entries[i].offset; // Found! } } // 4. Check overflow bucket if exists if (bucket.overflow_ptr != 0) { // Recursively search overflow chain // (simplified: in production, iterate instead) return -1; // Not found (simplified) } return -1; // Key not found} // Insert: Add a key-offset pair to the index int hash_index_insert(HashIndex *idx, const char *key, off_t offset) { unsigned long bucket_num = hash_key(key) % idx->bucket_count; HashBucket bucket; off_t bucket_offset = bucket_num * sizeof(HashBucket); pread(idx->index_fd, &bucket, sizeof(bucket), bucket_offset); // Check if bucket has room if (bucket.count < ENTRIES_PER_BUCKET) { // Add entry to bucket strncpy(bucket.entries[bucket.count].key, key, 31); bucket.entries[bucket.count].offset = offset; bucket.count++; // Write bucket back pwrite(idx->index_fd, &bucket, sizeof(bucket), bucket_offset); return 0; } // Bucket full: would need overflow handling // (allocate overflow bucket, link from parent) return -1; // Overflow not implemented in simplified example} /* * Performance characteristics: * * - Lookup: O(1) average, O(n) worst case (all in one bucket) * - Insert: O(1) average, O(1) amortized with rehashing * - Range query: O(n) - must scan all buckets * - Space: O(n) entries + O(B) buckets overhead * * Real implementations use: * - Extendible hashing (grows without full rehash) * - Linear hashing (incremental growth) * - Cuckoo hashing (guaranteed O(1) lookup) */Use hash indexes when: (1) You only need equality lookups (key = value), (2) Keys are uniformly distributed, (3) The number of keys is relatively stable. Avoid for: range queries, prefix matching, ordered iteration, or highly dynamic key sets.
The B-tree (and its variant, the B+tree) is the dominant index structure in databases, file systems, and storage engines worldwide. It provides logarithmic lookup time while supporting range queries and maintaining sorted order.
Why B-trees dominate:
B-tree structure:
A B-tree of order m has these properties:
B+tree enhancement:
Databases typically use B+trees where:
Example structure with m=4:
1234567891011121314151617181920212223242526272829303132
B+Tree of order 4 (max 3 keys per node): [30 | 60 | ] ← Root (internal) / | \ / | \ [10|20| ] [40|50| ] [70|80|90] ← Internal nodes / | \ / | \ / | | \ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ [L1][L2][L3] [L4][L5][L6] [L7][L8][L9][L10] ← Leaf nodes Leaf nodes contain:- Actual key values- Pointers to data records- Link to next leaf (for range scans) Example leaf node L4:+-------------------------+| 31 → offset 4500 || 35 → offset 2200 || 38 → offset 9100 || next_leaf → L5 |+-------------------------+ To find key 35:1. Start at root [30|60]2. 35 > 30 and 35 < 60, go to middle child [40|50]3. 35 < 40, go to leftmost child (L4)4. Linear search in L4, find 35 → offset 22005. Seek to offset 2200 in data file Total I/Os: 3 (root page + internal page + leaf page)For 1 billion records, tree height ≈ 4 → only 4 I/Os per lookup!B-tree performance analysis:
For n records with nodes holding m keys:
Practical numbers:
| Records | Keys/Node | Height | Lookups to Find Any Key |
|---|---|---|---|
| 1,000 | 100 | 2 | 2 page reads |
| 1,000,000 | 100 | 4 | 4 page reads |
| 1,000,000,000 | 100 | 5 | 5 page reads |
With 100+ keys per node (typical for 4KB pages with small keys), a B-tree covering a billion records is only 5 levels deep. The root and upper levels are usually cached in RAM, so practical lookups often require just 1-2 disk reads.
Binary search trees have O(log₂ n) height; B-trees have O(log_m n) where m might be 100+. This dramatically reduces tree height and thus disk reads. A BST with 1 billion nodes has height ≈30, requiring 30 disk reads per lookup. A B-tree needs only 5. Since disk I/O is ~10,000x slower than RAM access, this difference is existential.
The Log-Structured Merge-tree (LSM tree) is an alternative index structure optimized for write-heavy workloads. It's the foundation of modern storage engines like LevelDB, RocksDB, Cassandra, and many others.
The problem with B-trees for writes:
B-tree inserts cause random I/O:
On HDDs, random writes are catastrophically slow. Even on SSDs, random small writes cause write amplification.
The LSM approach:
Instead of updating in place, LSM trees:
LSM structure:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
LSM Tree Structure: ┌─────────────────────────────────────────────────┐│ MEMORY ││ ┌─────────────────────────────────────────┐ ││ │ Memtable (red-black tree or skiplist) │ ││ │ Holds recent writes, sorted by key │ ││ │ Write: O(log n) Read: O(log n) │ ││ └─────────────────────────────────────────┘ ││ ↓ (when full, flush to L0) │├─────────────────────────────────────────────────┤│ DISK ││ ││ Level 0: [SST1] [SST2] [SST3] ││ (recently flushed, may overlap) ││ ↓ (compact to L1) ││ ││ Level 1: [ SST-A ] [ SST-B ] [ SST-C ] ││ (non-overlapping, 10x size of L0) ││ ↓ (compact to L2) ││ ││ Level 2: [ SST-X ] [ SST-Y ] [ SST-Z ]││ (10x size of L1) ││ ↓ ││ Level N: [...largest files...] │└─────────────────────────────────────────────────┘ SSTable (Sorted String Table) format:┌─────────────────────────────────────────────────┐│ [Data Block 0] [Data Block 1] ... [Data Block N]││ [Index Block: key → block offset] ││ [Bloom Filter: fast negative lookups] ││ [Footer: metadata and offsets] │└─────────────────────────────────────────────────┘ Read path for key K:1. Check memtable (RAM) — O(log n)2. Check bloom filters for each L0 SSTable3. Binary search in L0 SSTables that might contain K4. Check bloom filter for L1, binary search, etc.5. Continue down levels until found or confirmed absent Write path for key K:1. Write to memtable — O(log n)2. Done! (Flush and compaction are background) This is why LSM is fast for writes!LSM vs. B-tree trade-offs:
| Aspect | B-Tree | LSM Tree |
|---|---|---|
| Write pattern | Random I/O | Sequential I/O |
| Write throughput | Lower (seeks) | Higher (batched) |
| Write amplification | 1-2x | 10-30x |
| Read amplification | 1 (single path) | Multiple levels |
| Point reads | Faster | Slower (check levels) |
| Range scans | Excellent | Good (merge iterators) |
| Space amplification | Lower (~1.5x) | Higher (~2x+) |
| Predictable latency | Yes | No (compaction spikes) |
B-trees excel for: read-heavy workloads, predictable latency requirements, frequently updated data. LSM trees excel for: write-heavy workloads, append-mostly data (logs, time series), SSDs where write amplification matters less. Many modern databases offer both options (e.g., MySQL InnoDB vs. MyRocks).
So far we've focused on the primary index—usually organized around a unique identifier. But real applications need to search by multiple attributes:
SELECT * FROM users WHERE email = 'alice@example.com';
SELECT * FROM products WHERE category = 'electronics' AND price < 500;
SELECT * FROM orders WHERE customer_id = 42 ORDER BY date DESC;
Secondary indexes provide this capability: additional indexes on non-primary columns that point back to the primary record.
Structure of secondary indexes:
A secondary index maps:
secondary_key_value → primary_key (or record pointer)
The indirection through primary key is common because:
Types of secondary index entries:
Storing record pointers directly:
email='alice@example.com' → {offset: 47520, page: 12}Storing primary keys:
email='alice@example.com' → {user_id: 1001}Handling non-unique keys:
Unlike primary indexes, secondary indexes often have duplicate keys:
category = 'electronics' → [product_101, product_205, product_892, ...]
Storage options:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
/** * Secondary index example: Index users by email * Primary index: user_id → user record * Secondary index: email → user_id */ typedef struct { int user_id; char email[128]; char name[64]; // ... other fields} User; // Primary index: user_id → file offset (B-tree)// secondary_email_idx: email → user_id (B-tree) // Find user by email (using secondary index)User* find_user_by_email(DB *db, const char *email) { // Step 1: Look up email in secondary index // Returns the user_id for this email int user_id = btree_lookup(db->email_index, email); if (user_id < 0) { return NULL; // Email not found } // Step 2: Look up user_id in primary index // Returns the file offset for this user off_t offset = btree_lookup(db->primary_index, user_id); // Step 3: Read the actual record User *user = malloc(sizeof(User)); pread(db->data_fd, user, sizeof(User), offset); return user;} // Insert new user (must update both indexes)int insert_user(DB *db, User *user) { // 1. Write record to data file off_t offset = append_record(db->data_fd, user, sizeof(User)); // 2. Add to primary index: user_id → offset btree_insert(db->primary_index, user->user_id, offset); // 3. Add to secondary index: email → user_id btree_insert(db->email_index, user->email, user->user_id); return 0;} // Find all users in a city (non-unique secondary index)// city_index stores: city → list of user_idsUserList* find_users_by_city(DB *db, const char *city) { // Get posting list: all user_ids in this city int *user_ids; int count = btree_get_posting_list(db->city_index, city, &user_ids); UserList *result = create_user_list(count); // Fetch each user via primary index for (int i = 0; i < count; i++) { off_t offset = btree_lookup(db->primary_index, user_ids[i]); User user; pread(db->data_fd, &user, sizeof(User), offset); add_to_list(result, &user); } free(user_ids); return result;}Every secondary index multiplies write cost. Inserting one user record requires: 1 data write + 1 primary index update + N secondary index updates. With 5 secondary indexes, a single insert becomes 7 writes. This is why database tuning often involves carefully selecting which columns to index.
Let's implement a complete, practical index system for a file of records. We'll use a simplified B-tree-like structure that demonstrates core principles without full B-tree complexity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
#include <stdio.h>#include <stdlib.h>#include <string.h>#include <unistd.h>#include <fcntl.h> /** * Practical file index implementation * Uses a sorted index file with binary search * Suitable for read-heavy workloads with batch updates */ #define INDEX_ENTRY_SIZE 40#define KEY_SIZE 32 typedef struct { char key[KEY_SIZE]; off_t offset;} IndexEntry; typedef struct { int index_fd; // Index file descriptor int data_fd; // Data file descriptor IndexEntry *cache; // In-memory index cache int entry_count; // Number of entries int cache_dirty; // Needs write-back} FileIndex; // Load index into memoryFileIndex* index_open(const char *index_path, const char *data_path) { FileIndex *idx = calloc(1, sizeof(FileIndex)); idx->index_fd = open(index_path, O_RDWR | O_CREAT, 0644); idx->data_fd = open(data_path, O_RDWR | O_CREAT, 0644); // Determine index size off_t size = lseek(idx->index_fd, 0, SEEK_END); idx->entry_count = size / sizeof(IndexEntry); // Load entire index into RAM if (idx->entry_count > 0) { idx->cache = malloc(idx->entry_count * sizeof(IndexEntry)); pread(idx->index_fd, idx->cache, size, 0); } return idx;} // Binary search for key (O(log n))off_t index_lookup(FileIndex *idx, const char *key) { if (idx->entry_count == 0) return -1; int left = 0, right = idx->entry_count - 1; while (left <= right) { int mid = left + (right - left) / 2; int cmp = strcmp(idx->cache[mid].key, key); if (cmp == 0) { return idx->cache[mid].offset; // Found } else if (cmp < 0) { left = mid + 1; } else { right = mid - 1; } } return -1; // Not found} // Range query: find all keys in [start_key, end_key]int index_range_query(FileIndex *idx, const char *start_key, const char *end_key, off_t *results, int max_results) { // Find starting position with binary search int left = 0, right = idx->entry_count - 1; int start_pos = idx->entry_count; // Default: past end while (left <= right) { int mid = left + (right - left) / 2; int cmp = strcmp(idx->cache[mid].key, start_key); if (cmp >= 0) { start_pos = mid; right = mid - 1; } else { left = mid + 1; } } // Scan forward collecting matching entries int count = 0; for (int i = start_pos; i < idx->entry_count && count < max_results; i++) { if (strcmp(idx->cache[i].key, end_key) > 0) { break; // Past end of range } results[count++] = idx->cache[i].offset; } return count;} // Insert new entry (maintains sorted order)int index_insert(FileIndex *idx, const char *key, off_t offset) { // Find insertion point int pos = 0; while (pos < idx->entry_count && strcmp(idx->cache[pos].key, key) < 0) { pos++; } // Expand cache idx->cache = realloc(idx->cache, (idx->entry_count + 1) * sizeof(IndexEntry)); // Shift entries to make room memmove(&idx->cache[pos + 1], &idx->cache[pos], (idx->entry_count - pos) * sizeof(IndexEntry)); // Insert new entry strncpy(idx->cache[pos].key, key, KEY_SIZE - 1); idx->cache[pos].offset = offset; idx->entry_count++; idx->cache_dirty = 1; return 0;} // Flush index to diskvoid index_sync(FileIndex *idx) { if (!idx->cache_dirty) return; // Rewrite entire index file (simple but not optimal) ftruncate(idx->index_fd, 0); pwrite(idx->index_fd, idx->cache, idx->entry_count * sizeof(IndexEntry), 0); fsync(idx->index_fd); idx->cache_dirty = 0;} // Close and cleanupvoid index_close(FileIndex *idx) { index_sync(idx); close(idx->index_fd); close(idx->data_fd); free(idx->cache); free(idx);}This simplified implementation loads the entire index into RAM and rewrites on every sync. Production systems use: (1) Page-based B-trees written incrementally, (2) Write-ahead logging for crash recovery, (3) Memory-mapped files for efficient large indexes, (4) Bloom filters to avoid unnecessary disk reads for absent keys.
Designing indexes involves balancing competing concerns. Understanding these trade-offs is essential for building efficient storage systems.
1. Space Overhead vs. Read Performance
Indexes consume disk space and memory:
2. Write Performance vs. Read Performance
3. Consistency vs. Performance
4. Memory vs. Disk
| Scenario | Recommended Structure | Reason |
|---|---|---|
| Point lookups only | Hash index | O(1) lookup, simple |
| Range queries needed | B-tree / B+tree | Sorted, efficient range scans |
| Write-heavy workload | LSM tree | Batched sequential writes |
| Read-heavy, rarely updated | Sorted array + binary search | Simple, minimal overhead |
| Full-text search | Inverted index | Maps terms to documents |
| Geospatial data | R-tree / Quadtree | 2D/3D range queries |
| Time-series data | Time-partitioned + B-tree | Partition pruning + indexing |
You can always make reads faster by adding more indexes—until you can't. At some point, the write overhead of maintaining indexes exceeds the read benefit. Finding the right balance requires understanding your actual query patterns and measuring performance empirically.
We've explored indexed file access—the crucial technique that transforms key-based lookups from linear scans into efficient targeted retrieval. Let's consolidate the key concepts:
What's next:
We've covered three distinct access methods: sequential, direct, and indexed. But how do these compare in practice? When should you choose one over another? The next page provides a systematic Access Method Comparison, examining performance characteristics, use cases, and decision criteria for selecting the right approach.
You now possess deep knowledge of indexed file access—from fundamental concepts to B-trees, LSM trees, and secondary indexes. This understanding is essential for database internals, file system design, and building any application that must efficiently locate data by key.