Access Methods - Learning Module

Loading content...

0/252

Index Scan: Targeted Data Retrieval

Precision Access to Table Data

If table scans are the brute-force approach to data retrieval, index scans are the surgical precision alternative. An index scan uses a pre-built index structure (typically a B+ tree) to locate exactly which rows satisfy a query predicate, then retrieves only those rows from the base table.

Index scans enable databases to answer queries like 'find user with ID 12345 among 100 million users' by examining only 3-4 index pages and 1 data page, rather than scanning all 100 million rows. This is the fundamental technique that makes transactional databases practical at scale.

What You Will Learn

By the end of this page, you will understand the complete anatomy of index scans, including how the index is traversed, how matching data rows are retrieved, when index scans outperform table scans (and vice versa), and the critical role of selectivity in access method decisions.

How Index Scans Work: Step by Step

An index scan is a two-phase operation: first, traverse the index to find pointers to matching rows; second, follow those pointers to retrieve the actual row data from the base table. Let's examine each phase in detail.

Phase 1: Index Traversal

For a B+ tree index (the most common type), traversal proceeds as follows:

Root Page Access: Start at the root page of the B+ tree. This page contains separator keys that direct the search to the appropriate child subtree.
Internal Node Navigation: At each internal node, perform binary search on the separator keys to find the correct child pointer. Follow the pointer to the next level.
Leaf Page Arrival: Upon reaching a leaf page, the actual index entries are found. Each leaf entry contains the indexed column value(s) plus a row identifier (RID) that points to the corresponding row in the base table.
Range Scan (if applicable): For range queries or multi-row results, the scan continues along the leaf level using sibling pointers (B+ tree leaves are linked as a doubly-linked list).

Index Scan Algorithm (Pseudocode)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
function IndexScan(index, search_key, predicate):
    results = []
    
    // Phase 1: Traverse index to find starting leaf
    current_page = index.root_page
    
    while current_page.is_internal():
        // Binary search within the page for correct child
        child_idx = binary_search(current_page.keys, search_key)
        current_page = buffer_pool.fetch(current_page.children[child_idx])
    
    // current_page is now a leaf page
    // Phase 1b: Find starting position within leaf
    entry_idx = binary_search(current_page.entries, search_key)
    
    // Phase 2: Collect matching RIDs and fetch rows
    while current_page != NULL:
        for each entry in current_page.entries[entry_idx:]:
            
            if not index_key_matches_predicate(entry.key, predicate):
                // Past the range, we're done
                return results
            
            // Fetch the actual row from the base table
            row_page = buffer_pool.fetch(entry.rid.page_id)
            row = row_page.get_row(entry.rid.slot)
            
            // Final predicate check (non-indexed columns)
            if full_predicate_matches(row, predicate):
                results.append(row)
        
        // Move to next leaf page (range scan)
        current_page = buffer_pool.fetch(current_page.next_leaf)
        entry_idx = 0
    
    return results

Phase 2: Row Retrieval (Heap Fetches)

The index provides RIDs, but we still need the actual row data. This involves:

Page Fetch: The RID contains a page ID and slot number. The database fetches the page from disk (or buffer pool) containing the row.
Slot Access: Within the page, a slot directory maps slot numbers to actual byte offsets where rows are stored.
Row Extraction: The complete row is read, including all columns (not just indexed columns).

This phase is where index scans can become expensive. If matching rows are scattered across many different pages, each row retrieval becomes a separate random I/O operation.

The Random I/O Problem

Index scans trade sequential I/O (reading consecutive pages) for random I/O (jumping to different pages for each row). While the index itself can be traversed with 3-4 I/Os regardless of table size, fetching matching rows may require one random I/O per row. This is why index scans are not always faster than table scans.

Index Scan Cost Model

Accurate cost estimation for index scans requires modeling both index traversal and row retrieval phases.

Index Traversal Cost

For a B+ tree with N leaf pages and branching factor F, the height is:

Height = ⌈log_F(N)⌉

With F typically 100-500 (depending on key size and page size), even indexes with billions of entries have heights of 3-5. The traversal cost is:

Traversal Cost = Height × t_random_I/O

This is nearly constant regardless of table size—the logarithmic magic of B+ trees.

B+ Tree Height vs Data Size (Branching Factor = 200)
Leaf Pages	Approximate Rows	Tree Height	Traversal I/Os
200	20,000	1	2 (root + 1 leaf)
40,000	4,000,000	2	3 (root + 1 internal + 1 leaf)
8,000,000	800,000,000	3	4 pages
1,600,000,000	160 billion	4	5 pages

Row Retrieval Cost

After finding matching index entries, each row must be fetched. If the query matches M rows and they are scattered across P distinct pages:

Retrieval Cost = P × t_random_I/O

In the worst case (rows uniformly distributed across all pages), P approaches M (one page per row). In the best case (rows clustered together), P could be much smaller than M.

The Clustering Factor

The clustering factor (or correlation) measures how well the physical order of rows matches the index order. It ranges from:

Clustering Factor ≈ Number of Leaf Pages: Perfect clustering. Rows are stored in index order. Retrieving adjacent index entries hits the same or adjacent table pages.
Clustering Factor ≈ Number of Rows: No clustering. Each row is on a different page. Index order is completely unrelated to physical order.

The clustering factor is critical for index scan cost estimation:

Index Scan Cost Formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
= Cost Components =
 
Index Traversal Cost:
    C_traverse = Height × t_random_IO
               ≈ log_F(N) × t_random_IO
               ≈ 3-5 random I/Os for any practical table size
 
Leaf Scan Cost (for range queries):
    C_leaf_scan = Matching_Leaf_Pages × t_sequential_IO
 
Row Retrieval Cost:
    
    Best case (perfect clustering):
    C_retrieval = Matching_Pages × t_random_IO
    
    Worst case (no clustering):
    C_retrieval = Matching_Rows × t_random_IO
 
    General formula (using clustering factor CF):
    C_retrieval ≈ Selectivity × CF × t_random_IO
    
    where CF is between NumLeafPages and NumRows
 
= Total Index Scan Cost =
 
C_total = C_traverse + C_leaf_scan + C_retrieval
 
Example: Finding 1,000 rows from 1,000,000 row table
- Height = 3, t_random = 5ms, t_seq = 0.1ms
- Selectivity = 0.1%
- CF = 100,000 (moderately clustered)
 
C_total = 3 × 5ms + 10 × 0.1ms + 0.001 × 100,000 × 5ms
        = 15ms + 1ms + 500ms
        = 516ms
 
Same query via table scan:
C_table_scan = 10,000 pages × 0.1ms = 1,000ms
 
Index scan wins! (516ms vs 1,000ms)

The Clustering Factor is Often the Deciding Factor

A well-clustered index (CF close to number of pages) makes index scans extremely efficient because multiple matching rows share the same pages. A poorly-clustered index (CF close to number of rows) can make index scans more expensive than table scans even at low selectivity.

Selectivity and the Access Method Crossover Point

Selectivity is the fraction of rows that match a query predicate. It's the single most important factor in choosing between index scan and table scan.

Let:

s = selectivity (0 to 1)
B = number of table pages
N = number of rows
t_seq = sequential I/O time
t_rand = random I/O time

Table Scan Cost:

C_table = B × t_seq

Index Scan Cost (simplified, worst-case clustering):

C_index = Height + s × N × t_rand

Crossover Point:

Index scan is cheaper when:

Height + s × N × t_rand < B × t_seq

Solving for s:
s < (B × t_seq - Height) / (N × t_rand)

Crossover Selectivity by Scenario
Scenario	t_rand/t_seq Ratio	Typical Crossover Selectivity
HDD, poor clustering	50× (5ms vs 0.1ms)	~2-5%
HDD, good clustering	50× but pages shared	~15-25%
SSD, poor clustering	5× (0.2ms vs 0.04ms)	~10-20%
SSD, good clustering	5× but pages shared	~30-40%
Fully cached (memory)	1× (no I/O difference)	~40-60%

Practical Guidelines:

Selectivity < 1%: Almost always use index scan (if appropriate index exists)
Selectivity 1-5%: Index scan usually wins, but verify clustering factor
Selectivity 5-15%: Crossover zone—depends heavily on clustering and storage
Selectivity 15-30%: Table scan often wins unless index is perfectly clustered
Selectivity > 30%: Table scan almost always preferred

These are rules of thumb. Real optimizers compute actual costs using statistics.

The Optimizer Knows

Query optimizers use collected statistics (histograms, distinct value counts, clustering factors) to estimate selectivity and compute expected costs. When an optimizer chooses a table scan over an available index, it often means the optimizer estimated high selectivity. Trust the optimizer—but verify with EXPLAIN when results seem wrong.

Index Scan Variants

Database systems implement several variants of index scans to handle different query patterns.

1. Unique Index Scan

When scanning a unique index for an equality predicate, at most one row can match. The scan terminates immediately after finding (or not finding) the key:

SELECT * FROM users WHERE user_id = 12345;  -- Unique index on user_id

Cost: Height + 1 (one row fetch), regardless of table size.

2. Range Scan

For range predicates, the index locates the starting point, then scans leaf pages until the range ends:

SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

Cost: Height + Leaf Pages in Range + Row Fetches.

Index Scan Variants in PostgreSQL EXPLAIN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Unique Index Scan (equality on unique column)
EXPLAIN SELECT * FROM users WHERE user_id = 12345;
/*
Index Scan using users_pkey on users  (cost=0.42..8.44 rows=1 width=...)
  Index Cond: (user_id = 12345)
*/
 
-- Range Scan (range predicate)
EXPLAIN SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
/*
Index Scan using idx_orders_date on orders  (cost=0.43..1523.45 rows=12000 width=...)
  Index Cond: ((order_date >= '2024-01-01') AND (order_date <= '2024-01-31'))
*/
 
-- Backward Scan (DESC order matches reverse index traversal)
EXPLAIN SELECT * FROM orders ORDER BY order_date DESC LIMIT 10;
/*
Index Scan Backward using idx_orders_date on orders  (cost=0.43..50.23 rows=10 width=...)
*/
 
-- Index scan with filter (index used for partial predicate)
EXPLAIN SELECT * FROM orders WHERE order_date = '2024-01-15' AND status = 'shipped';
/*
Index Scan using idx_orders_date on orders  (cost=0.43..200.12 rows=50 width=...)
  Index Cond: (order_date = '2024-01-15')
  Filter: (status = 'shipped')                 -- Applied after row fetch
  Rows Removed by Filter: 150
*/

3. Multi-Column Index Scan

Composite indexes on multiple columns enable efficient scans when the query matches a prefix of the index columns:

-- Index on (customer_id, order_date)
SELECT * FROM orders WHERE customer_id = 123 AND order_date > '2024-01-01';  -- Uses full index
SELECT * FROM orders WHERE customer_id = 123;                                -- Uses index (prefix)
SELECT * FROM orders WHERE order_date > '2024-01-01';                         -- Cannot use index efficiently!

4. Index Skip Scan

Some databases (Oracle, recent PostgreSQL) can perform 'skip scans' when the first column of a composite index is not in the predicate. The scan jumps through distinct values of the first column:

-- Index on (gender, age)
SELECT * FROM people WHERE age = 25;  
-- Skip scan: find age=25 for gender='M', then gender='F', etc.

Skip scans are only efficient when the skipped column has low cardinality.

Index Scan Variant Summary

•Unique Scan: Single-row lookup via unique/primary key index; O(log n) guaranteed
•Range Scan: Sequential traversal of leaf pages for range predicates; cost proportional to range size
•Backward Scan: Traverse leaves in reverse; useful for ORDER BY DESC with LIMIT
•Skip Scan: Jump through distinct values of leading columns to use non-prefix predicates
•Loose Scan: Special optimization for GROUP BY on leading index columns with MIN/MAX aggregates

Physical Order, Logical Order, and Clustering

The efficiency of index scans depends heavily on the relationship between logical order (as defined by the index) and physical order (how rows are actually stored on disk).

Heap Tables vs Clustered Tables

In a heap table, rows are stored in insertion order (or wherever free space exists). The physical order has no relationship to any index's logical order. Index scans on heap tables typically exhibit poor locality—consecutive index entries point to rows scattered across the entire table.

In a clustered table (or an index-organized table in Oracle), rows are physically stored in the order of a specific index (usually the primary key). Index scans using the clustering index have excellent locality—consecutive entries often point to the same or adjacent pages.

Heap Table Index Scan

•Rows stored in insertion order
•Index entries point to random pages
•Each row fetch likely hits a different page
•High random I/O overhead
•Clustering factor ≈ number of rows

Clustered Index Scan

•Rows stored in index key order
•Adjacent index entries → adjacent/same pages
•Sequential access pattern restored
•Minimal random I/O
•Clustering factor ≈ number of pages

Implications for Index Design

Primary key clustering: In MySQL InnoDB, tables are always clustered by primary key. This makes PK-based index scans extremely efficient.
Secondary index overhead: Secondary indexes in clustered tables contain the clustering key (not just a RID), which adds storage but enables efficient row lookup.
Index correlation: Even in heap tables, indexes on columns correlated with insertion order (like auto-increment IDs, timestamps) will have better clustering factors.
CLUSTER command: Some databases (PostgreSQL) allow explicitly reordering heap tables to match an index, improving that index's clustering factor (until new inserts disorder it again).

Checking Clustering Factor

In PostgreSQL, check pg_stats.correlation for a column to see its clustering factor (-1 to 1, where ±1 means perfect clustering). In Oracle, DBA_INDEXES.CLUSTERING_FACTOR shows the absolute value. In SQL Server, analyze the page-level RIDs in sys.dm_db_index_physical_stats.

Index Scan Optimizations

Modern databases employ several optimizations to accelerate index scans and mitigate their random I/O characteristics.

1. Index Prefetching

During index scans, databases can analyze upcoming leaf entries and initiate asynchronous reads for the pages they reference. This hides I/O latency by overlapping fetch operations:

Time without prefetch: [Index Entry 1] → [Wait for I/O 1] → [Process Row 1] → [Entry 2] → [Wait for I/O 2]...
Time with prefetch:    [Index Entry 1] → [Wait I/O 1] → [Process Row 1] → [Process Row 2 (already prefetched)]...

2. Batched Key Lookup

Instead of fetching rows one at a time, collect a batch of RIDs, sort them by page ID, then fetch in page order. This converts random I/O into more sequential pattern and allows page reuse:

Batched Key Lookup Strategy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
function BatchedIndexScan(index, predicate, batch_size=1000):
    results = []
    rid_batch = []
    
    // Collect RIDs from index
    for each entry in index.range_scan(predicate):
        rid_batch.append(entry.rid)
        
        if len(rid_batch) >= batch_size:
            // Sort RIDs by page ID to minimize random I/O
            rid_batch.sort(by=lambda rid: rid.page_id)
            
            // Fetch rows in page order
            for rid in rid_batch:
                row = fetch_row(rid)
                if full_predicate_matches(row, predicate):
                    results.append(row)
            
            rid_batch = []
    
    // Process remaining batch
    // ... same sorting and fetching logic
    
    return results

3. Bitmap Heap Scan (PostgreSQL)

A hybrid approach where the index is scanned to build an in-memory bitmap of qualifying page IDs, then pages are fetched in physical order:

EXPLAIN SELECT * FROM orders WHERE customer_id IN (1, 2, 3, 4, 5);
/*
Bitmap Heap Scan on orders
  Recheck Cond: (customer_id = ANY ('{1,2,3,4,5}'))
  ->  Bitmap Index Scan on idx_orders_customer
        Index Cond: (customer_id = ANY ('{1,2,3,4,5}'))
*/

This approach is covered in detail in the Bitmap Scan page.

4. Index-Only Scans

If all columns needed by the query are present in the index, the row fetch phase can be skipped entirely. This is covered in the Index-Only Scan page.

Key Optimization Strategies

•Prefetching: Initiate I/O for upcoming rows while processing current rows—hides latency
•RID Sorting: Reorder row fetches by page ID to improve locality and enable sequential reads
•Buffering/Batching: Collect multiple RIDs before fetching, enabling bulk optimization
•Lossy Bitmap: Trade precision for memory by tracking pages rather than individual rows
•Covering Indexes: Include all needed columns in index to eliminate row fetches entirely

Reading Index Scans in Execution Plans

Understanding how index scans appear in query plans across different database systems is essential for performance analysis.

Index Scan in PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Create sample table and index
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    category_id INTEGER NOT NULL,
    name VARCHAR(100),
    price DECIMAL(10,2)
);
CREATE INDEX idx_products_category ON products(category_id);
 
-- Analyze query using index
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) 
SELECT * FROM products WHERE category_id = 42;
 
/*
Index Scan using idx_products_category on products  
    (cost=0.42..248.56 rows=500 width=48) 
    (actual time=0.023..1.234 rows=487 loops=1)
  Index Cond: (category_id = 42)
  Buffers: shared hit=156 read=12
Planning Time: 0.089 ms
Execution Time: 1.456 ms
 
Key observations:
- "Index Scan using idx_products_category" = index scan access method
- "Index Cond: (category_id = 42)" = condition pushed to index
- "Buffers: shared hit=156 read=12" = 156 pages from cache, 12 from disk
- "rows=500" estimated vs "rows=487" actual = good cardinality estimate
*/
 
-- Range scan example
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM products WHERE category_id BETWEEN 10 AND 20;
 
/*
Index Scan using idx_products_category on products  
    (cost=0.42..3452.78 rows=5500 width=48) 
    (actual time=0.034..45.678 rows=5423 loops=1)
  Index Cond: ((category_id >= 10) AND (category_id <= 20))
  Buffers: shared hit=1234 read=567
*/

Index Scan in MySQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- MySQL EXPLAIN output
EXPLAIN SELECT * FROM products WHERE category_id = 42;
 
/*
+----+-------------+----------+------+----------------------+----------------------+---------+-------+------+-------+
| id | select_type | table    | type | possible_keys        | key                  | key_len | ref   | rows | Extra |
+----+-------------+----------+------+----------------------+----------------------+---------+-------+------+-------+
|  1 | SIMPLE      | products | ref  | idx_products_category| idx_products_category| 4       | const | 500  | NULL  |
+----+-------------+----------+------+----------------------+----------------------+---------+-------+------+-------+
 
Key columns:
- type="ref": index lookup (eq_ref for unique, range for range scans)
- key="idx_products_category": index being used
- key_len="4": bytes of index being used (useful for composite indexes)
- rows="500": estimated rows to examine
*/
 
-- MySQL 8.0+ EXPLAIN ANALYZE for actual execution stats
EXPLAIN ANALYZE SELECT * FROM products WHERE category_id = 42;
 
/*
-> Index lookup on products using idx_products_category (category_id=42)  
   (cost=50.25 rows=500) (actual time=0.045..1.234 rows=487 loops=1)
*/

Watch for Filter vs Index Cond

In execution plans, 'Index Cond' (PostgreSQL) or the index key usage (MySQL) indicates conditions evaluated during index traversal. 'Filter' indicates conditions evaluated after fetching rows. Filters mean the index wasn't fully utilized, and extra rows were fetched then discarded.

Summary: Index Scan Mastery

You now have comprehensive knowledge of index scans—the primary alternative to table scans for selective queries. Here are the essential takeaways:

Key Takeaways

•Index scans are two-phase operations — First traverse the index (O(log n)), then fetch matching rows from the base table (potentially O(matching rows)).
•The random I/O problem — While index traversal is fast, each row fetch can be a separate random I/O, which is 20-100× slower than sequential I/O.
•Selectivity determines the winner — Index scans win at low selectivity (<5-15%), table scans win at high selectivity (>15-30%). The exact crossover depends on storage and clustering.
•Clustering factor is critical — Well-clustered indexes have excellent locality (multiple rows per page fetch); poorly-clustered indexes degrade to one page per row.
•Multiple variants exist — Unique scans, range scans, backward scans, and skip scans handle different query patterns.
•Optimizations help — Prefetching, RID sorting, batching, and bitmap approaches mitigate random I/O overhead.

What's Next:

The next page covers Index-Only Scans—a powerful optimization that eliminates the row fetch phase entirely by storing all needed columns within the index itself. This approach can dramatically accelerate queries when applicable.

Page Complete

You now understand index scans at a deep level—their mechanics, cost models, selectivity crossover points, and optimization strategies. This knowledge is essential for understanding when indexes help, when they hurt, and how to design effective indexing strategies.