Covering Indexes - Learning Module

Loading content...

0/241

Designing Effective Covering Indexes

The Art and Science of Coverage Design

Creating effective covering indexes requires more than simply adding columns to an index. It demands careful analysis of query patterns, understanding of index structure implications, and thoughtful balancing of competing concerns.

A poorly designed covering index can be worse than no index at all: it consumes storage, slows down writes, requires maintenance, and may not even be used by the optimizer. A well-designed covering index provides dramatic query acceleration while minimizing overhead.

This page provides the systematic framework for designing covering indexes that deliver on their promise.

What You Will Learn

By the end of this page, you will understand how to analyze query workloads for coverage opportunities, the principles governing key column ordering, when to use INCLUDE columns vs. key columns, strategies for covering multiple queries with a single index, and common design anti-patterns to avoid.

Query Workload Analysis: Finding Coverage Opportunities

Effective covering index design begins with understanding your query workload. Not all queries are candidates for covering indexes, and resources should focus on high-impact opportunities.

The Workload Analysis Process

Converting Mermaid diagram...

workload_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- Step 1: Enable pg_stat_statements for query analysis
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
 
-- Step 2: Identify high-frequency queries with table access
SELECT 
    substring(query, 1, 100) as query_sample,
    calls,
    total_exec_time / calls as avg_time_ms,
    rows / calls as avg_rows,
    shared_blks_read / calls as avg_disk_reads,
    (shared_blks_hit::float / NULLIF(shared_blks_hit + shared_blks_read, 0) * 100)::int 
        as buffer_hit_pct
FROM pg_stat_statements
WHERE query ILIKE '%orders%'
  AND calls > 100
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Step 3: Examine individual query for column access
-- For each candidate query, extract all referenced columns:
-- - SELECT list columns (output)
-- - WHERE clause columns (filtering)
-- - JOIN condition columns
-- - ORDER BY columns
-- - GROUP BY columns
 
-- Example: Analyze orders table query patterns
SELECT 
    query,
    -- Extract column references (simplified - use query parser for production)
    calls,
    total_exec_time
FROM pg_stat_statements
WHERE query ILIKE '%FROM orders%'
  AND query NOT ILIKE '%INSERT%'
  AND query NOT ILIKE '%UPDATE%'
ORDER BY total_exec_time DESC;
 
-- Step 4: Identify queries with poor index coverage
-- Look for queries with high avg_disk_reads indicating heap access
SELECT *
FROM pg_stat_statements
WHERE shared_blks_read / NULLIF(calls, 0) > 100  -- High disk reads
  AND calls > 1000  -- Frequently executed
ORDER BY shared_blks_read DESC;

Prime Candidates for Covering Indexes

•Dashboard/reporting queries — Same query structure, high frequency, consistent column set
•API endpoint queries — Narrow result sets, precise column requirements, latency-sensitive
•Lookup queries with additional columns — Filter by one column, return a few more
•Aggregation queries on subsets — SUM/COUNT on specific columns with WHERE filters
•EXISTS/IN subqueries — Only need to verify row existence, not retrieve data
•JOIN lookup tables — Small, frequently-joined tables benefit enormously from coverage

Key Column Ordering Principles

The order of key columns in a covering index critically affects which queries can use the index and how efficiently. This is one of the most misunderstood aspects of index design.

The Leftmost Prefix Rule

B+-tree indexes can only be used for queries that filter on a leftmost prefix of the key columns. An index on (A, B, C) can be used for:

Queries filtering on A
Queries filtering on A and B
Queries filtering on A, B, and C

But NOT for:

Queries filtering only on B
Queries filtering only on C
Queries filtering on B and C without A

Index (A, B, C) Usability by Query Pattern
Query Filter	Index Usable?	Scan Type	Efficiency
`WHERE A = 1`	✅ Yes	Index Seek	High
`WHERE A = 1 AND B = 2`	✅ Yes	Index Seek	High
`WHERE A = 1 AND B = 2 AND C = 3`	✅ Yes	Index Seek	Highest
`WHERE A = 1 AND C = 3`	⚠️ Partial	Seek on A, Scan for C	Medium
`WHERE B = 2`	❌ No	Full Index Scan or Table Scan	Low
`WHERE B = 2 AND C = 3`	❌ No	Full Index Scan or Table Scan	Low
`WHERE C = 3`	❌ No	Full Index Scan or Table Scan	Low

Column Ordering Strategy

Given the leftmost prefix rule, order key columns by:

Equality conditions first — Columns used with = in WHERE clauses should come first
Range conditions second — Columns used with >, <, BETWEEN, LIKE 'prefix%' after equality columns
ORDER BY columns third — If the index can provide sorted output, avoid explicit sorting
High-selectivity columns earlier — Columns that narrow results most should come first

Important: Range conditions "break" the key for subsequent columns. After a range condition, the index cannot efficiently filter on later columns.

column_ordering.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Example: Analyzing column order for a reporting query
 
-- Query pattern:
SELECT order_id, customer_id, total_amount, status
FROM orders
WHERE status = 'shipped'           -- Equality filter
  AND region = 'North America'     -- Equality filter  
  AND order_date >= '2024-01-01'   -- Range filter
ORDER BY order_date DESC;
 
-- CORRECT column order:
CREATE INDEX idx_orders_covering ON orders 
(status, region, order_date DESC)  -- Keys in correct order
INCLUDE (order_id, customer_id, total_amount);
 
-- Why this order works:
-- 1. status = 'shipped' → Seek to 'shipped' entries
-- 2. region = 'North America' → Further narrow within 'shipped'
-- 3. order_date >= '2024-01-01' → Range scan (stops the key matching)
-- 4. DESC order matches query ORDER BY (no additional sort needed)
 
-- INCORRECT column order:
CREATE INDEX idx_orders_wrong ON orders 
(order_date, status, region)  -- Range column first!
INCLUDE (order_id, customer_id, total_amount);
 
-- Why this fails:
-- 1. order_date >= '2024-01-01' is a RANGE, so it scans all matching dates
-- 2. status and region filters must be applied as post-scan filters
-- 3. Much less efficient - scans all dates, then filters

The Range Condition Trap

A common mistake is placing date ranges early in the index key because dates seem important. But date ranges break the B+-tree seek, forcing scans. Always place equality conditions before range conditions, even if the range column is commonly used.

Key Columns vs. INCLUDE Columns

One of the most important design decisions for covering indexes is choosing which columns belong in the key and which should be in the INCLUDE clause.

The Fundamental Distinction

Key Columns

•Stored at ALL levels of B+-tree
•Participate in sort order
•Enable index seeks and range scans
•Can satisfy ORDER BY clauses
•Increase internal node size (reduces fanout)
•Higher storage and maintenance cost

INCLUDE Columns

•Stored ONLY at leaf level
•Do not affect sort order
•Cannot be used for filtering/seeking
•Cannot satisfy ORDER BY
•Minimal impact on tree structure
•Lower storage and maintenance cost

Decision Framework

Use the following matrix to decide where each column belongs:

Column Usage	Put in KEY	Put in INCLUDE
Used in WHERE with equality	✅ Yes	❌ No
Used in WHERE with range	✅ Yes (after equality cols)	❌ No
Used in ORDER BY	✅ Yes (to avoid sort)	⚠️ Only if sort is acceptable
Used in GROUP BY	✅ Usually	⚠️ Depends
Used only in SELECT output	❌ No	✅ Yes
Used in JOIN conditions	✅ Yes	❌ No

key_vs_include_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Example: Designing key vs. INCLUDE for a complex query
 
-- Query:
SELECT 
    customer_id,          -- Used in WHERE (equality)
    order_date,           -- Used in WHERE (range) and ORDER BY
    total_amount,         -- Only in SELECT
    status,               -- Only in SELECT
    shipping_address      -- Only in SELECT
FROM orders
WHERE customer_id = @customerId
  AND order_date >= @startDate
ORDER BY order_date DESC;
 
-- Optimal index design:
CREATE NONCLUSTERED INDEX IX_orders_customer_date_covering
ON orders (customer_id, order_date DESC)  -- Keys: filter + sort columns
INCLUDE (total_amount, status, shipping_address);  -- Include: output-only
 
-- Why this works:
-- 1. customer_id is first: enables seek for the specific customer
-- 2. order_date DESC is second: enables range scan + provides sort order
-- 3. total_amount, status, shipping_address: only needed for output
--    → INCLUDE is perfect: no sort contribution, just data coverage
 
-- ANTI-PATTERN: All columns as keys
CREATE NONCLUSTERED INDEX IX_orders_overkill
ON orders (customer_id, order_date, total_amount, status, shipping_address);
 
-- Problems:
-- 1. All columns stored at internal nodes → lower fanout → taller tree
-- 2. More data to maintain on every INSERT/UPDATE
-- 3. No benefit: we don't filter/sort on amount, status, or address

Multi-Query Coverage Strategies

In practice, you want each index to cover multiple related queries rather than creating a separate index for each query. This reduces storage and maintenance overhead while maximizing coverage.

The Column Superset Approach

Analyze related queries and design indexes that cover the union of their column requirements:

Example Query Family:

-- Query A: Customer dashboard
SELECT order_id, order_date, total_amount 
FROM orders WHERE customer_id = ?

-- Query B: Customer with status filter
SELECT order_id, order_date, total_amount, status 
FROM orders WHERE customer_id = ? AND status = 'pending'

-- Query C: Customer order history
SELECT order_id, order_date, total_amount, status, shipping_date 
FROM orders WHERE customer_id = ? ORDER BY order_date DESC

Unified Covering Index:

CREATE INDEX idx_orders_unified
ON orders (customer_id, status, order_date DESC)
INCLUDE (order_id, total_amount, shipping_date);

This single index covers all three queries:

Query A: Seeks on customer_id, returns covered columns
Query B: Seeks on customer_id + status, returns covered columns
Query C: Seeks on customer_id, scan ordered by order_date

The 80/20 Rule of Coverage

Aim to cover 80% of query variations with 20% of possible indexes. A few well-designed covering indexes often outperform many narrowly-targeted indexes. Focus on high-frequency query patterns and accept some queries may not be perfectly covered.

Handling Incompatible Query Patterns

Some queries have fundamentally different access patterns that cannot share an index efficiently:

Query A	Query B	Compatible?
`WHERE customer_id = ?`	`WHERE customer_id = ?`	✅ Yes - same leading column
`WHERE customer_id = ?`	`WHERE order_date = ?`	❌ No - different leading columns
`ORDER BY order_date ASC`	`ORDER BY order_date DESC`	⚠️ Partial - one direction will scan backwards
`WHERE region = ?`	`WHERE region IN (?, ?, ?)`	✅ Yes - IN uses index seeks
`WHERE status = 'active'`	`WHERE status != 'active'`	⚠️ Depends - inequality may scan

For incompatible patterns, you may need multiple indexes. Use query frequency and impact to prioritize.

multi_query_analysis.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- Multi-query coverage analysis workflow
 
-- Step 1: Group queries by table and leading filter columns
-- Identify query families that can share indexes
 
-- Family 1: Customer-centric queries
-- Leading column: customer_id
-- Additional filters: status, date ranges
-- Output columns: order details
 
-- Design covering index for Family 1:
CREATE INDEX idx_orders_by_customer
ON orders (customer_id, status, order_date DESC)
INCLUDE (order_id, total_amount, shipping_date, item_count);
 
-- Verify coverage for all family queries:
EXPLAIN (ANALYZE, COSTS OFF)
SELECT order_id, total_amount FROM orders WHERE customer_id = 123;
-- Expected: Index Only Scan
 
EXPLAIN (ANALYZE, COSTS OFF)  
SELECT order_id, total_amount, status 
FROM orders WHERE customer_id = 123 AND status = 'pending';
-- Expected: Index Only Scan
 
EXPLAIN (ANALYZE, COSTS OFF)
SELECT order_id, order_date, total_amount, shipping_date
FROM orders WHERE customer_id = 123 
ORDER BY order_date DESC LIMIT 10;
-- Expected: Index Only Scan with no sort operation
 
-- Family 2: Date-centric queries (different leading column)
-- These CANNOT share an index with Family 1 efficiently
 
CREATE INDEX idx_orders_by_date
ON orders (order_date, region)
INCLUDE (order_id, customer_id, total_amount);
 
-- This covers:
EXPLAIN (ANALYZE, COSTS OFF)
SELECT order_id, customer_id, total_amount
FROM orders WHERE order_date = '2024-01-15';
 
EXPLAIN (ANALYZE, COSTS OFF)
SELECT SUM(total_amount) FROM orders
WHERE order_date >= '2024-01-01' AND region = 'West';

Index Size and Storage Management

Covering indexes add columns that increase storage requirements. Understanding and managing this overhead is essential for sustainable index design.

Estimating Covering Index Size

Index size depends on:

Number of rows in the table
Total width of key + included columns
B+-tree overhead (~10-20% for structure overhead)
Fill factor (free space reserved for updates)

index_sizing.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- Estimate covering index size before creation
 
-- Get average column widths from statistics
SELECT 
    attname,
    avg_width,
    n_distinct
FROM pg_stats
WHERE tablename = 'orders'
  AND attname IN ('customer_id', 'order_date', 'status', 
                  'total_amount', 'shipping_date');
                  
-- Example output:
-- customer_id: 4 bytes (integer)
-- order_date: 4 bytes (date)
-- status: 12 bytes (varchar average)
-- total_amount: 8 bytes (numeric)
-- shipping_date: 4 bytes (date)
 
-- Calculate estimated index size:
SELECT 
    (SELECT reltuples FROM pg_class WHERE relname = 'orders') as row_count,
    -- Per-entry size: columns + 6 bytes tuple header + 6 bytes ItemId
    (4 + 4 + 12 + 8 + 4) + 6 + 6 as bytes_per_entry,
    -- Estimated total (with 20% B-tree overhead)
    pg_size_pretty(
        ((SELECT reltuples FROM pg_class WHERE relname = 'orders') 
         * ((4 + 4 + 12 + 8 + 4) + 12)
         * 1.2)::bigint
    ) as estimated_index_size;
 
-- After index creation, verify actual size:
SELECT 
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as actual_size
FROM pg_indexes
JOIN pg_class ON pg_class.relname = indexname
WHERE tablename = 'orders';
 
-- Compare covering index size to table size:
SELECT 
    pg_size_pretty(pg_relation_size('orders')) as table_size,
    pg_size_pretty(pg_relation_size('idx_orders_covering')) as index_size,
    round(100.0 * pg_relation_size('idx_orders_covering') / 
          pg_relation_size('orders'), 1) as index_pct_of_table;

Storage Management Strategies

Strategies for Controlling Index Size

•Partial/Filtered Indexes — Index only rows that queries actually access. WHERE is_active = true eliminates inactive rows from the index.
•Compression — Many databases support index compression, reducing size by 50-80% for indexes with repeated values in leading columns.
•Selective Coverage — Don't include rarely-queried columns. If 90% of queries need cols A, B, C, but 10% also need D, consider if D is worth the space.
•Archival Strategies — For time-series data, create covering indexes only on recent partitions. Older data can use smaller, non-covering indexes.
•Regular Maintenance — Index bloat from updates can double effective size. Schedule regular REINDEX or rebuilds.

storage_strategies.sql
Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Strategy 1: Partial Index for Active Records Only
-- Full table: 100M rows, but only 5M are status='active'
 
-- Instead of indexing all rows:
CREATE INDEX idx_orders_all ON orders (customer_id) 
INCLUDE (order_date, total_amount);
-- Size: ~8GB
 
-- Use a partial index:
CREATE INDEX idx_orders_active ON orders (customer_id)
INCLUDE (order_date, total_amount)
WHERE status = 'active';
-- Size: ~400MB (5% of original!)
 
-- Query must include matching WHERE clause:
SELECT order_date, total_amount FROM orders
WHERE customer_id = 123 AND status = 'active';
-- Uses idx_orders_active
 
 
-- Strategy 2: Compression (SQL Server example)
CREATE NONCLUSTERED INDEX IX_orders_compressed
ON orders (customer_id, order_date)
INCLUDE (total_amount, status)
WITH (DATA_COMPRESSION = PAGE);
-- PAGE compression typically reduces size 50-70%
 
 
-- Strategy 3: Partitioned Coverage
-- Cover only recent partitions with full covering index
-- PostgreSQL example with declarative partitioning:
 
CREATE INDEX idx_orders_2024_covering 
ON orders_2024 (customer_id, order_date)
INCLUDE (order_id, total_amount, status);
 
-- Older partitions: minimal index only
CREATE INDEX idx_orders_2023_minimal
ON orders_2023 (customer_id, order_date);
-- Queries on old data are less frequent, can tolerate heap access

Common Design Anti-Patterns

Learning from common mistakes accelerates mastery. Here are anti-patterns frequently seen in covering index design:

Covering Index Anti-Patterns

•The 'Cover Everything' Index — Including all table columns in one massive index. Creates huge storage overhead, slow writes, and usually isn't more effective than several targeted indexes.
•Ignoring Key Column Order — Placing columns in the index without considering query filter patterns. Results in index scans instead of seeks.
•All Columns as Keys — Making every column a key column when INCLUDE would suffice. Bloats internal nodes, reduces fanout, creates unnecessary maintenance.
•Duplicate Coverage — Multiple indexes with overlapping coverage. Each INSERT/UPDATE must maintain all of them. Consolidate into fewer, broader indexes.
•Coverage Without Index Utility — An index must first be usable for filtering. An index that covers output columns but can't filter efficiently (wrong leading columns) may not be chosen.
•Ignoring Write Overhead — On write-heavy tables, each covering index adds INSERT/UPDATE latency. The read benefit must justify the write cost.

anti_patterns.sql
Anti-Pattern Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- ANTI-PATTERN 1: The "Cover Everything" Index
-- DON'T DO THIS
CREATE INDEX idx_orders_everything ON orders (
    order_id, customer_id, order_date, status, total_amount,
    shipping_address, billing_address, created_at, updated_at,
    product_count, discount_code, sales_rep_id, notes
);
-- Problems:
-- - Massive index size (possibly larger than table)
-- - Every INSERT/UPDATE touches all columns
-- - Internal nodes are huge, tree is tall
 
-- BETTER: Multiple focused covering indexes
CREATE INDEX idx_orders_customer_lookup 
ON orders (customer_id, order_date) 
INCLUDE (order_id, total_amount, status);
 
CREATE INDEX idx_orders_status_report
ON orders (status, order_date)
INCLUDE (order_id, customer_id, total_amount);
 
 
-- ANTI-PATTERN 2: Wrong Key Column Order
-- Query: WHERE region = 'West' AND order_date >= '2024-01-01'
 
-- DON'T DO THIS (range column first)
CREATE INDEX idx_wrong_order ON orders (order_date, region);
-- Can't seek to region='West', must scan all dates
 
-- CORRECT (equality column first)
CREATE INDEX idx_right_order ON orders (region, order_date);
-- Seeks to 'West', then range scan on date
 
 
-- ANTI-PATTERN 3: Duplicate Coverage
-- DON'T create both of these:
CREATE INDEX idx_orders_v1 ON orders (customer_id) 
INCLUDE (order_date, total_amount);
 
CREATE INDEX idx_orders_v2 ON orders (customer_id) 
INCLUDE (order_date, total_amount, status);
 
-- CONSOLIDATE into one:
CREATE INDEX idx_orders_unified ON orders (customer_id)
INCLUDE (order_date, total_amount, status);
-- The superset covers all queries the subsets would cover

Summary: Designing World-Class Covering Indexes

We've covered the comprehensive framework for designing effective covering indexes. Here are the essential principles:

Key Takeaways

•Analyze workload first — Use query statistics to identify high-frequency, high-impact queries suitable for covering index optimization.
•Order key columns deliberately — Equality columns first, then range columns, then ORDER BY columns. The leftmost prefix rule governs index usability.
•Use INCLUDE for output-only columns — Columns that only appear in SELECT should typically be INCLUDE columns, not key columns.
•Design for query families — Create indexes that cover multiple related queries by including the superset of their column requirements.
•Manage storage proactively — Estimate index sizes, use partial indexes where appropriate, and consider compression for large indexes.
•Avoid anti-patterns — Don't create 'cover everything' indexes, don't ignore column order, and don't create duplicate coverage.

What's Next

Now that we understand how to design covering indexes, we'll examine the trade-offs involved—the costs of covering indexes in terms of write overhead, storage consumption, maintenance requirements, and situations where covering indexes may not be the optimal choice.

Design Mastery Achieved

You now have a systematic framework for designing covering indexes that maximize query coverage while minimizing overhead. This knowledge enables you to create indexes that will stand the test of production workloads.