Loading content...
Everything we've discussed about indexes—their definition, search keys, and entries—serves a single purpose: accelerating data lookup. An index that cannot find data faster than a table scan is worthless. Understanding exactly how indexes achieve lookup acceleration is fundamental to leveraging them effectively.
Lookup acceleration isn't magic. It follows from well-understood computer science principles: reducing search space, exploiting ordering, and trading space for time. By the end of this page, you'll understand precisely why an index lookup completes in milliseconds while a table scan takes minutes—not as a mysterious database optimization, but as the predictable outcome of algorithmic design.
By the end of this page, you will understand the complexity theory behind index acceleration, trace the step-by-step process of an index lookup, analyze how different index types achieve their performance characteristics, and calculate expected I/O costs for various lookup operations.
To appreciate what indexes provide, we must first understand what happens without them. Consider a table with N rows stored as a heap file (unsorted, unindexed). To find a specific record:
SELECT * FROM employees WHERE employee_id = 12345;
Without an index, the database must:
employee_id = 12345This is fundamentally O(N) — the time grows linearly with data size.
The Scale Problem:
Linear time sounds acceptable until you consider real-world scale:
| Row Count | Pages | Scan Time | Index Lookup (log₂N levels) |
|---|---|---|---|
| 1,000 | 10 | 10ms | ~2ms (4 levels) |
| 100,000 | 1,000 | 1 second | ~3ms (5 levels) |
| 10,000,000 | 100,000 | 100 seconds | ~4ms (7 levels) |
| 1,000,000,000 | 10,000,000 | 2.8 hours | ~5ms (10 levels) |
Notice how the gap between scan and index lookup grows exponentially. At 1 billion rows, the table scan takes 2.8 hours while the index lookup takes 5 milliseconds—a factor of 2 million difference. This isn't a small optimization; it's the difference between a usable system and an unusable one.
The Impossibility of Linear Scale:
For interactive applications, queries must complete in tens or hundreds of milliseconds. At 10 million rows, a 100-second table scan is unacceptable. At 1 billion rows, a 2.8-hour table scan is absurd.
Moreover, linear scan has no parallelization benefit for selectivity. Whether you're looking for 1 row or 1 million rows, you must still examine every page. The only parallel optimization is reading multiple pages simultaneously—which improves throughput but not latency for single queries.
Indexes break this linear dependency completely.
The most common index structure—the B+-tree—achieves O(log N) lookup time by organizing entries in a balanced tree. This is the fundamental acceleration mechanism that makes modern databases possible.
The Key Insight: Divide and Conquer
Instead of examining all N entries, a tree structure allows us to:
With each step, we eliminate a fraction of the search space. The number of steps needed is the tree's height, which is log(N) for a balanced tree.
B+-Tree Height Calculation:
For a B+-tree with:
The height H is approximately:
H = ⌈log_F(N)⌉
Example:
This means 4 page reads to locate any record among 100 million. And because root and high-level internal nodes are frequently accessed, they're typically cached in memory, reducing I/O further.
The Fanout Advantage:
B+-trees achieve high fanout (branching factor) because:
A binary tree with 100 million nodes is 27 levels deep. A B+-tree with 100 million entries is 4 levels deep. This difference—enabled by fanout—is why B+-trees dominate in databases.
Let's trace exactly what happens during an index lookup. Understanding this process helps diagnose performance issues and reason about expected costs.
Query:
SELECT name, salary FROM employees WHERE employee_id = 5042;
Assumptions:
employee_id (primary clustered index)employee_id = 5042 equality predicate. Optimizer selects index seek on the primary index.name and salary columns and return to client.Cost Analysis:
| Step | Memory/Disk | Time |
|---|---|---|
| Root lookup | Memory (cached) | ~0.001 ms |
| Level-1 lookup | Memory (cached) | ~0.001 ms |
| Level-2 lookup | Disk I/O | ~1-10 ms |
| Leaf lookup | Disk I/O | ~1-10 ms |
| Total | ~2-20 ms |
Compare this to a table scan that would read 100,000 pages at 1ms each: 100 seconds. The index reduced latency by a factor of 5,000-50,000.
Key Observations:
Indexes accelerate not just point lookups but also range queries. The mechanism differs slightly but remains fundamentally efficient.
Query:
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
Range Scan Algorithm:
Find Start Point: Use tree traversal to locate the first entry where order_date >= '2024-01-01'
Sequential Scan: From that point, scan forward through leaf pages (they're linked!) until order_date > '2024-01-31'
Terminate: Stop when the range is exhausted
The beauty of B+-trees: leaf pages are linked in a doubly-linked list. Once we find the range start, we scan sequentially—no more tree traversals needed.
Cost Analysis for Range Queries:
If the range contains K matching records spread across P pages:
| Component | Cost |
|---|---|
| Locate start | O(log N) page reads |
| Scan range | O(P) page reads |
| Total | O(log N + P) |
Why This Beats Table Scan:
A table scan reads ALL pages regardless of how few records match. A range scan reads:
If your range is 1% of the table, you read ~1% of the pages (plus the initial lookup cost). This scales beautifully.
Example:
Range scan efficiency depends heavily on how well the data is clustered by the index key. If records are stored in index order (clustered index), consecutive index entries point to consecutive data pages—minimal I/O. If scattered (non-clustered index), each entry may require a separate random I/O—potentially worse than a table scan for large ranges.
While B+-trees achieve O(log N) lookup, hash indexes can achieve O(1) average-case lookup—constant time regardless of data size. This seems miraculous but comes with significant trade-offs.
Hash Index Mechanism:
bucket = hash(key) mod bucket_countWhy O(1)?
The hash function directly computes where to look—no traversal needed. Regardless of whether the table has 1,000 or 1 billion rows, the same number of steps (1-2 page reads) finds the target bucket.
When to Use Hash Indexes:
Hash indexes are ideal for:
They are poor choices for:
PostgreSQL Example:
-- Create a hash index (PostgreSQL syntax)
CREATE INDEX idx_user_email_hash ON users USING hash (email);
-- This query uses the hash index
SELECT * FROM users WHERE email = 'user@example.com';
-- This query CANNOT use the hash index (range predicate)
SELECT * FROM users WHERE email > 'a' AND email < 'b';
The ultimate acceleration is avoiding table access entirely. If the index contains all columns the query needs, the database can answer the query from the index alone—an index-only scan (also called a covering index scenario).
Traditional Index Access:
1. Search index → Find RID(s)
2. Use RID to fetch record from table → One I/O per record potentially
3. Extract requested columns
Index-Only Scan:
1. Search index → Find matching entries
2. Extract columns directly from index entries
3. Done (no table access)
The savings can be enormous, especially when:
Creating Covering Indexes:
-- Query we want to optimize
SELECT order_id, total_amount
FROM orders
WHERE customer_id = 12345;
-- Index that enables index-only scan
CREATE INDEX idx_orders_covering
ON orders (customer_id)
INCLUDE (order_id, total_amount);
The INCLUDE clause (or equivalent) adds columns to the leaf entries without making them part of the search key. This means:
customer_id onlyorder_id and total_amountSpace-Performance Trade-off:
Covering indexes increase index size. Each leaf entry now includes additional columns. For a heavily-used query pattern, this trade-off is usually worthwhile. For rarely-used queries, the space cost may not justify the benefit.
When reviewing execution plans, look for indications like 'Index Only Scan' (PostgreSQL), 'Using index' (MySQL), or Cost estimates that don't include table access. If your query qualifies for index-only scan but isn't getting it, check: Does the index include all SELECT columns? Are visibility map pages clean (PostgreSQL)? Is the query eligible?
| Scenario | Matching Records | Index Scan I/O | Index-Only Scan I/O | Savings |
|---|---|---|---|---|
| Single record | 1 | 3-4 pages | 2-3 pages | 25-33% |
| Small range | 100 | 3 + ~100 pages | 3-10 pages | 90%+ |
| Large range | 10,000 | 3 + ~5,000 pages | 3 + ~50 pages | 99%+ |
Indexes are powerful but not omnipotent. Understanding when indexes cannot help—or actively hurt—is as important as understanding when they help.
Low Selectivity Predicates:
When a predicate matches most rows, index lookup is slower than table scan:
Example:
-- Index probably helps (assuming 'suspended' is rare)
SELECT * FROM accounts WHERE status = 'suspended';
-- Index probably hurts (most accounts are active)
SELECT * FROM accounts WHERE status = 'active';
Non-Sargable Predicates:
Some predicates cannot use indexes regardless of selectivity. These are called non-sargable (not Search ARGument ABLE):
-- Non-sargable: Function applied to column
WHERE UPPER(email) = 'USER@EXAMPLE.COM'
-- Sargable alternative (if stored in uppercase)
WHERE email = 'USER@EXAMPLE.COM'
-- Non-sargable: Leading wildcard
WHERE name LIKE '%smith'
-- Sargable: Trailing wildcard
WHERE name LIKE 'smith%'
-- Non-sargable: Expression on column
WHERE salary * 1.1 > 100000
-- Sargable alternative
WHERE salary > 100000 / 1.1
Wrong Column Order (Composite Index):
A composite index on (A, B, C) cannot accelerate queries that filter only on B or C.
Lookup acceleration is the raison d'être of indexing. Understanding the mechanisms behind this acceleration enables you to design effective indexes and diagnose performance issues. Let's consolidate the key concepts:
What's Next:
We now understand what indexes are, how they're organized (search keys and entries), and how they accelerate lookups. The final piece is understanding the trade-offs — the costs and limitations of indexing that must be balanced against benefits. The next page provides a comprehensive analysis of index trade-offs.
You now understand the mechanisms behind index acceleration: tree-based logarithmic lookup, range scan efficiency, hash-based constant time, and index-only optimization. This knowledge empowers you to predict and optimize query performance.