Access Methods - Learning Module

Loading content...

0/252

Index-Only Scan: Eliminating Row Fetches

When the Index Is Enough

The most expensive part of an index scan is typically not the index traversal—it's fetching the actual rows from the base table. Each row fetch can require a random I/O operation, and when retrieving thousands of rows, this overhead dominates query execution time.

But what if you could skip the row fetch entirely? What if the index itself contained all the data needed to answer the query? This is the essence of an index-only scan (also called a covering index scan or index-only access)—one of the most powerful performance optimizations in relational databases.

What You Will Learn

By the end of this page, you will understand how index-only scans work, why they provide such dramatic performance improvements, how to design covering indexes, and the subtle requirements that must be met for an index-only scan to be possible.

The Index-Only Scan Concept

A standard B+ tree index stores index keys (the columns being indexed) and row identifiers (pointers to the actual rows in the base table). When a query needs columns beyond the index keys, the database must:

Traverse the index to find matching entries
Extract the RID from each entry
Fetch the corresponding row from the base table
Extract the required columns from the fetched row

Step 3 is the bottleneck—potentially one random I/O per row.

The Index-Only Alternative:

If the index includes all columns referenced by the query (in SELECT, WHERE, ORDER BY, GROUP BY, etc.), the database can:

Traverse the index to find matching entries
Extract all needed data directly from the index entry
Skip the table access entirely

Index-Only vs Standard Index Scan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Table: orders (order_id, customer_id, order_date, total_amount, status)
-- Index: idx_orders_customer_date ON orders(customer_id, order_date)
 
-- Query 1: Requires table access (needs 'status' column not in index)
SELECT customer_id, order_date, status 
FROM orders 
WHERE customer_id = 1234;
-- Execution: Index Scan → For each match, fetch row → Extract status
 
-- Query 2: Index-only possible (all columns in index)
SELECT customer_id, order_date 
FROM orders 
WHERE customer_id = 1234;
-- Execution: Index Scan → Return data directly from index leaf entries
-- NO TABLE ACCESS REQUIRED!
 
-- Query 3: Aggregation with index-only
SELECT customer_id, COUNT(*), MIN(order_date), MAX(order_date)
FROM orders 
GROUP BY customer_id;
-- If idx_orders_customer_date covers this, entire aggregation runs 
-- without touching the base table

10× to 100× Performance Gains

Index-only scans can be 10× to 100× faster than equivalent index scans with row fetches, especially when the table has wide rows (many columns) or poor clustering. You read only the compact index pages instead of full row pages.

Requirements for Index-Only Scans

Several conditions must be met for an index-only scan to be possible:

Requirement 1: All Query Columns Must Be In The Index

Every column referenced anywhere in the query must be present in the index. This includes:

SELECT clause columns
WHERE clause columns (including inequality checks)
ORDER BY columns
GROUP BY columns
HAVING clause columns
Columns used in expressions or function calls

Column Coverage Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Index: (customer_id, order_date, total_amount)
 
-- ✓ Index-only possible: all columns covered
SELECT customer_id, SUM(total_amount)
FROM orders
WHERE order_date > '2024-01-01'
GROUP BY customer_id;
 
-- ✗ NOT index-only: 'status' not in index
SELECT customer_id, status
FROM orders
WHERE order_date > '2024-01-01';
 
-- ✗ NOT index-only: 'shipping_address' in expression
SELECT customer_id
FROM orders
WHERE LENGTH(shipping_address) > 100;
 
-- ✓ Index-only possible: expression uses only indexed columns
SELECT customer_id
FROM orders
WHERE total_amount * 1.1 > 100;

Requirement 2: Visibility Information (PostgreSQL-specific)

In PostgreSQL's MVCC implementation, each row has visibility information (transaction IDs indicating when the row was created/deleted). This information is stored in the heap (table), not in indexes.

For an index-only scan to work, PostgreSQL must be able to verify that matching rows are visible to the current transaction without reading the heap. This is possible only when:

The visibility map indicates that all rows on a page are visible to all transactions (the page is 'all-visible')
If the visibility map bit is not set, PostgreSQL must fall back to reading the heap for those entries

This is why VACUUM is critical for PostgreSQL index-only scan performance—it sets visibility map bits.

Index-Only Scan Requirements by Database
Database	Column Coverage	Additional Requirements
PostgreSQL	All columns in index	Visibility map must confirm row visibility (run VACUUM regularly)
MySQL InnoDB	All columns in index (including PK for secondary indexes)	Secondary indexes include PK columns automatically; covering index must include all needed columns
Oracle	All columns in index	Null values require index to store NULLs (some index types don't)
SQL Server	All columns in index or INCLUDE clause	Columnstore indexes have different rules

Beware Hidden Columns

SELECT * defeats index-only scans because it references all columns, most of which aren't indexed. Even if you end up using only a few columns in application code, * forces the database to fetch full rows. Always specify explicit column lists for queries that could benefit from index-only scans.

Designing Covering Indexes

A covering index is an index designed specifically to enable index-only scans for particular queries. The term emphasizes that the index 'covers' all columns needed by the query.

Design Strategy: Query-Driven Indexing

To create a covering index:

Identify high-frequency or high-cost queries
List all columns referenced by each query
Order columns optimally (filter columns first, then SELECT-only columns)
Create an index with all required columns

Example: E-commerce Order Analytics

Covering Index Design Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Frequent analytics query:
SELECT 
    customer_id,
    DATE_TRUNC('month', order_date) as order_month,
    COUNT(*) as order_count,
    SUM(total_amount) as total_spent
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id, DATE_TRUNC('month', order_date);
 
-- Columns needed:
-- - WHERE: order_date (for filtering)
-- - SELECT/GROUP BY: customer_id, order_date (for grouping)
-- - SELECT: total_amount (for aggregation)
 
-- Covering index design:
CREATE INDEX idx_orders_covering ON orders(
    order_date,      -- Leading column for WHERE filter (range scan)
    customer_id,     -- GROUP BY column (secondary ordering helps grouping)
    total_amount     -- Aggregate source (no ordering needed)
);
 
-- Alternatively, in PostgreSQL 11+ or SQL Server, use INCLUDE:
CREATE INDEX idx_orders_covering_v2 ON orders(order_date, customer_id)
INCLUDE (total_amount);
 
-- INCLUDE columns are stored in leaf pages but not in non-leaf pages,
-- reducing index size and maintenance cost for columns only needed
-- in SELECT (not WHERE or ORDER BY)

The INCLUDE Clause (PostgreSQL 11+, SQL Server)

The INCLUDE clause allows adding columns to an index's leaf entries without including them in the index's sort order. Benefits:

Smaller non-leaf pages: Internal nodes contain only key columns, making the tree shallower
Supporting index-only scans: Included columns are available in leaf entries
No maintenance for ordering: Changes to included columns don't affect index order

Key vs Included Column Tradeoffs

Use as Key Column When

•Column is used in WHERE for equality/range filters
•Column is used in ORDER BY
•Column is used in GROUP BY
•Column ordering improves scan efficiency

Use as INCLUDE Column When

•Column is only in SELECT list
•Column is used for aggregate functions (SUM, AVG, etc.)
•Column doesn't benefit from ordering
•Column is wide (large strings, text)

Index Size Considerations

Covering indexes are larger than minimal indexes because they store additional columns. This increases storage requirements and write overhead (every INSERT/UPDATE must update more index data). Design covering indexes for your most critical queries, not for every possible query.

Index-Only Scan Cost Analysis

The cost advantage of index-only scans comes from eliminating the row fetch phase entirely. Let's quantify this benefit.

Standard Index Scan Cost:

C_standard = Height + LeafPages + RowFetches × t_random

Index-Only Scan Cost:

C_index_only = Height + LeafPages

The savings equal the eliminated row fetch cost:

Savings = RowFetches × t_random

Index-Only vs Standard Index Scan Cost Comparison
Scenario	Matching Rows	Standard Index Scan	Index-Only Scan	Speedup
Point query	1 row	4 pages + 1 fetch ≈ 5 I/Os	4 pages ≈ 4 I/Os	1.25×
Small range	100 rows, 80 pages	4 + 80 random ≈ 84 I/Os	4 pages ≈ 4 I/Os	21×
Medium range	10,000 rows, 5,000 pages	4 + 5,000 random ≈ 5,004 I/Os	50 leaf pages ≈ 50 I/Os	100×
Large range (aggregation)	1M rows, all pages	4 + ~50,000 random	1,000 leaf pages ≈ 1,000 I/Os	50×+

Why Such Dramatic Improvements?

The speedup comes from multiple factors:

Eliminating Random I/O: Row fetches are random; index leaf scans are sequential
Smaller Data Volume: Index entries are typically 10-100× smaller than full rows
Better Cache Efficiency: More index entries fit in buffer pool than row data
No Row Reconstruction: Reading from index requires no additional processing

Example Comparison:

Cost Comparison Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Table: transactions (1 million rows)
-- Row size: 500 bytes
-- Table pages: 62,500 pages (at 8KB/page, ~80 rows/page)
 
-- Index on (account_id, transaction_date, amount)
-- Index entry size: ~30 bytes
-- Index leaf pages: ~4,500 pages
 
-- Query: Monthly account summary
SELECT account_id, 
       DATE_TRUNC('month', transaction_date) as month,
       SUM(amount) as total
FROM transactions
WHERE transaction_date >= '2024-01-01'
GROUP BY account_id, DATE_TRUNC('month', transaction_date);
 
-- Assume 500,000 matching rows (6 months of data)
 
=== Standard Index Scan (non-covering index) ===
Index traversal:     3 pages
Index leaf scan:     2,250 pages (half the index)
Row fetches:         500,000 rows × potentially different pages
                     ≈ 15,000-40,000 random I/Os (depending on clustering)
 
Total I/O:           ~20,000+ I/Os
At 0.2ms/random I/O: ~4,000ms = 4 seconds
 
=== Index-Only Scan (covering index) ===
Index traversal:     3 pages
Index leaf scan:     2,250 pages (sequential)
Row fetches:         0 (not needed!)
 
Total I/O:           2,253 sequential I/Os
At 0.05ms/seq I/O:   ~113ms = 0.1 seconds
 
Speedup: ~35× faster!

Best Case: Aggregations on Large Result Sets

Index-only scans provide the greatest benefit for aggregation queries over large data sets. These queries touch many rows but only need specific columns. Without index-only access, they would require massive random I/O; with it, they zip through compact index pages.

PostgreSQL Visibility Map Deep Dive

PostgreSQL's MVCC implementation creates a unique challenge for index-only scans. Let's understand this in detail.

The Problem: Tuple Visibility

In PostgreSQL, each row (tuple) has visibility attributes (xmin, xmax) that determine which transactions can see it. This information is stored in the heap (table), not in indexes. When reading an index entry, PostgreSQL doesn't inherently know if the corresponding row is visible to the current transaction.

The Solution: Visibility Map

The visibility map is a companion structure for each table—a bitmap with one bit per table page. If the bit is set, all tuples on that page are visible to all transactions (the page is 'all-visible'). This means:

For pages marked all-visible: Index-only scans can proceed without heap access
For pages not marked: PostgreSQL must fetch the heap tuple to check visibility

PostgreSQL Index-Only Scan with Visibility
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Examine visibility map statistics
SELECT 
    relname,
    n_tup_ins as inserts,
    n_tup_upd as updates,
    n_tup_del as deletes,
    n_live_tup as live_tuples,
    n_dead_tup as dead_tuples
FROM pg_stat_user_tables
WHERE relname = 'orders';
 
-- Check index-only scan usage in execution plan
EXPLAIN (ANALYZE, BUFFERS) 
SELECT customer_id, order_date 
FROM orders 
WHERE customer_id = 1234;
 
/*
Index Only Scan using idx_orders_customer_date on orders
    (cost=0.43..45.12 rows=200 width=12) 
    (actual time=0.023..0.456 rows=189 loops=1)
  Index Cond: (customer_id = 1234)
  Heap Fetches: 23                           -- <-- CRITICAL METRIC!
  Buffers: shared hit=15
*/
 
-- "Heap Fetches: 23" means 23 tuples required visibility check via heap
-- Ideal: Heap Fetches = 0 (100% index-only)
-- If Heap Fetches ≈ actual rows, visibility map is unhelpful
 
-- Force visibility map update
VACUUM orders;
 
-- Re-run query after VACUUM
EXPLAIN (ANALYZE, BUFFERS) 
SELECT customer_id, order_date 
FROM orders 
WHERE customer_id = 1234;
 
/*
Index Only Scan using idx_orders_customer_date on orders
    (cost=0.43..45.12 rows=200 width=12) 
    (actual time=0.019..0.234 rows=189 loops=1)
  Index Cond: (customer_id = 1234)
  Heap Fetches: 0                            -- <-- Perfect!
  Buffers: shared hit=8                       -- Fewer buffers needed!
*/

VACUUM and Index-Only Scans

The visibility map is updated by VACUUM:

VACUUM scans the table for dead tuples
It removes dead tuples and frees space
For pages with no dead tuples and no recent transactions, it sets the visibility map bit

If you skip VACUUM:

Visibility map bits remain unset
'Index-only' scans degrade to index scans with heap fetches
Performance is much worse than expected

This is why autovacuum is critical for PostgreSQL performance.

PostgreSQL Index-Only Scan Optimization

•Monitor Heap Fetches: In EXPLAIN ANALYZE output, 'Heap Fetches' should be near zero for optimal index-only performance
•Run VACUUM regularly: Autovacuum handles this automatically; tune its aggressiveness for high-update tables
•Avoid hot spots: Frequently updated rows prevent visibility map bits from being set
•Consider partitioning: Old partitions become read-only, allowing visibility maps to stabilize
•Check pg_stat_user_tables: High n_dead_tup indicates VACUUM is falling behind

Index-Only Scans Across Database Systems

Each major database implements index-only scans differently. Understanding these differences is essential for cross-platform development.

Index-Only Scan Implementation Comparison
Database	Feature Name	INCLUDE Clause	Special Considerations
PostgreSQL	Index Only Scan	Yes (v11+)	Visibility map required; check Heap Fetches metric
MySQL InnoDB	Using index (covering)	No	Secondary indexes include PK implicitly; covering index must include all columns
Oracle	INDEX FAST FULL SCAN	No (but function-based indexes possible)	Must handle NULLs correctly; IOT (index-organized tables) are inherently covering
SQL Server	Index Scan (non-clustered, covering)	Yes	INCLUDE clause for non-key columns; columnstore has different rules
SQLite	COVERING INDEX	No	Automatically recognized when index covers query

Cross-Database Index-Only Scan Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
=== MySQL ===
-- Check if query uses covering index
EXPLAIN SELECT customer_id, order_date 
FROM orders 
WHERE customer_id = 1234;
 
-- Look for "Using index" in Extra column:
-- +----+...+-------+------+---------------+------------------------+
-- | id |...| type  | key  |   Extra       |
-- +----+...+-------+------+---------------+------------------------+
-- |  1 |...| ref   | idx  | Using index   |   <-- Covering index!
-- +----+...+-------+------+---------------+------------------------+
 
-- MySQL InnoDB: secondary indexes include primary key columns
-- If PK is (id), and index is (customer_id), the index actually stores (customer_id, id)
-- This means PK is always "covered" by secondary indexes
 
=== SQL Server ===
-- Create covering index with INCLUDE
CREATE NONCLUSTERED INDEX idx_orders_covering 
ON orders (customer_id, order_date)
INCLUDE (total_amount, status);
 
-- Check execution plan for "Index Scan" without "Key Lookup"
-- Key Lookup = heap/clustered fetch = NOT index-only
 
=== Oracle ===
-- Look for "INDEX FAST FULL SCAN" or absence of "TABLE ACCESS" steps
EXPLAIN PLAN FOR
SELECT customer_id, order_date FROM orders WHERE customer_id = 1234;
 
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
 
-- If plan shows only index operations, no TABLE ACCESS BY INDEX ROWID, 
-- it's an index-only execution

MySQL InnoDB Automatic Coverage

In InnoDB, secondary indexes automatically include primary key columns because they use the PK to locate rows in the clustered index. This means an index on (a, b) for a table with PK (id) actually stores (a, b, id). Queries needing only a, b, and id can use index-only access without explicitly including id in the index.

Covering Index Design Patterns and Trade-offs

Designing effective covering indexes requires balancing query performance against storage and maintenance costs.

Pattern 1: Query-Specific Covering Index

Create a single index that covers a specific critical query:

Query-Specific Covering Index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Critical dashboard query runs every 5 seconds:
SELECT 
    status, 
    COUNT(*) as count,
    AVG(processing_time_ms) as avg_time
FROM jobs
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY status;
 
-- Covering index for this exact query:
CREATE INDEX idx_jobs_dashboard ON jobs(created_at, status)
INCLUDE (processing_time_ms);
 
-- Result: Query runs in <10ms instead of >500ms

Pattern 2: Multi-Query Composite Covering Index

Design an index that covers multiple related queries:

Multi-Query Composite Index
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Multiple queries against orders table:
-- Q1: Orders by customer in date range
-- Q2: Order count by customer
-- Q3: Total spent by customer by month
 
-- Unified covering index:
CREATE INDEX idx_orders_multi ON orders(
    customer_id,      -- First: equality conditions in all queries
    order_date,       -- Second: range conditions, ORDER BY
    total_amount      -- Covered: for SUM/AVG aggregations
);
 
-- This single index enables index-only scans for all three queries

Trade-offs to Consider:

Benefits

•Dramatic read performance improvement
•Reduced buffer pool pressure
•Lower I/O latency for covered queries
•Enables efficient aggregations

Costs

•Increased storage requirements
•Higher write overhead (more index columns to update)
•More complex index maintenance
•Potential index bloat with wide columns

Don't Over-Index

Creating a covering index for every possible query leads to index explosion. Focus on: (1) queries that run frequently, (2) queries that are performance-critical, (3) queries over large result sets where row fetch overhead is substantial. Let non-critical queries use standard index scans.

Summary: Index-Only Scan Mastery

You now understand index-only scans—one of the most powerful query optimization techniques available. Here are the essential takeaways:

Key Takeaways

•Index-only scans eliminate row fetches — All query data comes from the index itself, skipping the expensive heap access phase entirely.
•All columns must be covered — Every column referenced in SELECT, WHERE, ORDER BY, GROUP BY must be present in the index.
•INCLUDE clause adds non-key columns — PostgreSQL and SQL Server allow including columns without affecting index sort order, optimizing for covering without bloating non-leaf pages.
•PostgreSQL requires visibility map — The visibility map must confirm row visibility; run VACUUM regularly and monitor 'Heap Fetches' in EXPLAIN output.
•Performance gains can be 10-100× — Especially for aggregation queries over large result sets where row fetch overhead would dominate.
•Balance coverage vs overhead — Covering indexes increase storage and write costs; focus on critical queries.

What's Next:

The next page covers Bitmap Scans—a hybrid access method that combines the selectivity benefits of indexes with the sequential access efficiency of table scans. Bitmap scans are particularly useful when multiple indexes can be combined or when index scans would otherwise have poor clustering.

Page Complete

You now have mastery of index-only scans—understanding their mechanics, requirements, cost benefits, and design patterns. This knowledge enables you to design covering indexes that can transform query performance from seconds to milliseconds.