Loading content...
Creating effective covering indexes requires more than simply adding columns to an index. It demands careful analysis of query patterns, understanding of index structure implications, and thoughtful balancing of competing concerns.
A poorly designed covering index can be worse than no index at all: it consumes storage, slows down writes, requires maintenance, and may not even be used by the optimizer. A well-designed covering index provides dramatic query acceleration while minimizing overhead.
This page provides the systematic framework for designing covering indexes that deliver on their promise.
By the end of this page, you will understand how to analyze query workloads for coverage opportunities, the principles governing key column ordering, when to use INCLUDE columns vs. key columns, strategies for covering multiple queries with a single index, and common design anti-patterns to avoid.
Effective covering index design begins with understanding your query workload. Not all queries are candidates for covering indexes, and resources should focus on high-impact opportunities.
The Workload Analysis Process
123456789101112131415161718192021222324252627282930313233343536373839404142434445
-- Step 1: Enable pg_stat_statements for query analysisCREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Step 2: Identify high-frequency queries with table accessSELECT substring(query, 1, 100) as query_sample, calls, total_exec_time / calls as avg_time_ms, rows / calls as avg_rows, shared_blks_read / calls as avg_disk_reads, (shared_blks_hit::float / NULLIF(shared_blks_hit + shared_blks_read, 0) * 100)::int as buffer_hit_pctFROM pg_stat_statementsWHERE query ILIKE '%orders%' AND calls > 100ORDER BY total_exec_time DESCLIMIT 20; -- Step 3: Examine individual query for column access-- For each candidate query, extract all referenced columns:-- - SELECT list columns (output)-- - WHERE clause columns (filtering)-- - JOIN condition columns-- - ORDER BY columns-- - GROUP BY columns -- Example: Analyze orders table query patternsSELECT query, -- Extract column references (simplified - use query parser for production) calls, total_exec_timeFROM pg_stat_statementsWHERE query ILIKE '%FROM orders%' AND query NOT ILIKE '%INSERT%' AND query NOT ILIKE '%UPDATE%'ORDER BY total_exec_time DESC; -- Step 4: Identify queries with poor index coverage-- Look for queries with high avg_disk_reads indicating heap accessSELECT *FROM pg_stat_statementsWHERE shared_blks_read / NULLIF(calls, 0) > 100 -- High disk reads AND calls > 1000 -- Frequently executedORDER BY shared_blks_read DESC;The order of key columns in a covering index critically affects which queries can use the index and how efficiently. This is one of the most misunderstood aspects of index design.
The Leftmost Prefix Rule
B+-tree indexes can only be used for queries that filter on a leftmost prefix of the key columns. An index on (A, B, C) can be used for:
AA and BA, B, and CBut NOT for:
BCB and C without A| Query Filter | Index Usable? | Scan Type | Efficiency |
|---|---|---|---|
WHERE A = 1 | ✅ Yes | Index Seek | High |
WHERE A = 1 AND B = 2 | ✅ Yes | Index Seek | High |
WHERE A = 1 AND B = 2 AND C = 3 | ✅ Yes | Index Seek | Highest |
WHERE A = 1 AND C = 3 | ⚠️ Partial | Seek on A, Scan for C | Medium |
WHERE B = 2 | ❌ No | Full Index Scan or Table Scan | Low |
WHERE B = 2 AND C = 3 | ❌ No | Full Index Scan or Table Scan | Low |
WHERE C = 3 | ❌ No | Full Index Scan or Table Scan | Low |
Column Ordering Strategy
Given the leftmost prefix rule, order key columns by:
= in WHERE clauses should come first>, <, BETWEEN, LIKE 'prefix%' after equality columnsImportant: Range conditions "break" the key for subsequent columns. After a range condition, the index cannot efficiently filter on later columns.
123456789101112131415161718192021222324252627282930
-- Example: Analyzing column order for a reporting query -- Query pattern:SELECT order_id, customer_id, total_amount, statusFROM ordersWHERE status = 'shipped' -- Equality filter AND region = 'North America' -- Equality filter AND order_date >= '2024-01-01' -- Range filterORDER BY order_date DESC; -- CORRECT column order:CREATE INDEX idx_orders_covering ON orders (status, region, order_date DESC) -- Keys in correct orderINCLUDE (order_id, customer_id, total_amount); -- Why this order works:-- 1. status = 'shipped' → Seek to 'shipped' entries-- 2. region = 'North America' → Further narrow within 'shipped'-- 3. order_date >= '2024-01-01' → Range scan (stops the key matching)-- 4. DESC order matches query ORDER BY (no additional sort needed) -- INCORRECT column order:CREATE INDEX idx_orders_wrong ON orders (order_date, status, region) -- Range column first!INCLUDE (order_id, customer_id, total_amount); -- Why this fails:-- 1. order_date >= '2024-01-01' is a RANGE, so it scans all matching dates-- 2. status and region filters must be applied as post-scan filters-- 3. Much less efficient - scans all dates, then filtersA common mistake is placing date ranges early in the index key because dates seem important. But date ranges break the B+-tree seek, forcing scans. Always place equality conditions before range conditions, even if the range column is commonly used.
One of the most important design decisions for covering indexes is choosing which columns belong in the key and which should be in the INCLUDE clause.
The Fundamental Distinction
Decision Framework
Use the following matrix to decide where each column belongs:
| Column Usage | Put in KEY | Put in INCLUDE |
|---|---|---|
| Used in WHERE with equality | ✅ Yes | ❌ No |
| Used in WHERE with range | ✅ Yes (after equality cols) | ❌ No |
| Used in ORDER BY | ✅ Yes (to avoid sort) | ⚠️ Only if sort is acceptable |
| Used in GROUP BY | ✅ Usually | ⚠️ Depends |
| Used only in SELECT output | ❌ No | ✅ Yes |
| Used in JOIN conditions | ✅ Yes | ❌ No |
123456789101112131415161718192021222324252627282930313233
-- Example: Designing key vs. INCLUDE for a complex query -- Query:SELECT customer_id, -- Used in WHERE (equality) order_date, -- Used in WHERE (range) and ORDER BY total_amount, -- Only in SELECT status, -- Only in SELECT shipping_address -- Only in SELECTFROM ordersWHERE customer_id = @customerId AND order_date >= @startDateORDER BY order_date DESC; -- Optimal index design:CREATE NONCLUSTERED INDEX IX_orders_customer_date_coveringON orders (customer_id, order_date DESC) -- Keys: filter + sort columnsINCLUDE (total_amount, status, shipping_address); -- Include: output-only -- Why this works:-- 1. customer_id is first: enables seek for the specific customer-- 2. order_date DESC is second: enables range scan + provides sort order-- 3. total_amount, status, shipping_address: only needed for output-- → INCLUDE is perfect: no sort contribution, just data coverage -- ANTI-PATTERN: All columns as keysCREATE NONCLUSTERED INDEX IX_orders_overkillON orders (customer_id, order_date, total_amount, status, shipping_address); -- Problems:-- 1. All columns stored at internal nodes → lower fanout → taller tree-- 2. More data to maintain on every INSERT/UPDATE-- 3. No benefit: we don't filter/sort on amount, status, or addressIn practice, you want each index to cover multiple related queries rather than creating a separate index for each query. This reduces storage and maintenance overhead while maximizing coverage.
The Column Superset Approach
Analyze related queries and design indexes that cover the union of their column requirements:
Example Query Family:
-- Query A: Customer dashboard
SELECT order_id, order_date, total_amount
FROM orders WHERE customer_id = ?
-- Query B: Customer with status filter
SELECT order_id, order_date, total_amount, status
FROM orders WHERE customer_id = ? AND status = 'pending'
-- Query C: Customer order history
SELECT order_id, order_date, total_amount, status, shipping_date
FROM orders WHERE customer_id = ? ORDER BY order_date DESC
Unified Covering Index:
CREATE INDEX idx_orders_unified
ON orders (customer_id, status, order_date DESC)
INCLUDE (order_id, total_amount, shipping_date);
This single index covers all three queries:
Aim to cover 80% of query variations with 20% of possible indexes. A few well-designed covering indexes often outperform many narrowly-targeted indexes. Focus on high-frequency query patterns and accept some queries may not be perfectly covered.
Handling Incompatible Query Patterns
Some queries have fundamentally different access patterns that cannot share an index efficiently:
| Query A | Query B | Compatible? |
|---|---|---|
WHERE customer_id = ? | WHERE customer_id = ? | ✅ Yes - same leading column |
WHERE customer_id = ? | WHERE order_date = ? | ❌ No - different leading columns |
ORDER BY order_date ASC | ORDER BY order_date DESC | ⚠️ Partial - one direction will scan backwards |
WHERE region = ? | WHERE region IN (?, ?, ?) | ✅ Yes - IN uses index seeks |
WHERE status = 'active' | WHERE status != 'active' | ⚠️ Depends - inequality may scan |
For incompatible patterns, you may need multiple indexes. Use query frequency and impact to prioritize.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
-- Multi-query coverage analysis workflow -- Step 1: Group queries by table and leading filter columns-- Identify query families that can share indexes -- Family 1: Customer-centric queries-- Leading column: customer_id-- Additional filters: status, date ranges-- Output columns: order details -- Design covering index for Family 1:CREATE INDEX idx_orders_by_customerON orders (customer_id, status, order_date DESC)INCLUDE (order_id, total_amount, shipping_date, item_count); -- Verify coverage for all family queries:EXPLAIN (ANALYZE, COSTS OFF)SELECT order_id, total_amount FROM orders WHERE customer_id = 123;-- Expected: Index Only Scan EXPLAIN (ANALYZE, COSTS OFF) SELECT order_id, total_amount, status FROM orders WHERE customer_id = 123 AND status = 'pending';-- Expected: Index Only Scan EXPLAIN (ANALYZE, COSTS OFF)SELECT order_id, order_date, total_amount, shipping_dateFROM orders WHERE customer_id = 123 ORDER BY order_date DESC LIMIT 10;-- Expected: Index Only Scan with no sort operation -- Family 2: Date-centric queries (different leading column)-- These CANNOT share an index with Family 1 efficiently CREATE INDEX idx_orders_by_dateON orders (order_date, region)INCLUDE (order_id, customer_id, total_amount); -- This covers:EXPLAIN (ANALYZE, COSTS OFF)SELECT order_id, customer_id, total_amountFROM orders WHERE order_date = '2024-01-15'; EXPLAIN (ANALYZE, COSTS OFF)SELECT SUM(total_amount) FROM ordersWHERE order_date >= '2024-01-01' AND region = 'West';Covering indexes add columns that increase storage requirements. Understanding and managing this overhead is essential for sustainable index design.
Estimating Covering Index Size
Index size depends on:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
-- Estimate covering index size before creation -- Get average column widths from statisticsSELECT attname, avg_width, n_distinctFROM pg_statsWHERE tablename = 'orders' AND attname IN ('customer_id', 'order_date', 'status', 'total_amount', 'shipping_date'); -- Example output:-- customer_id: 4 bytes (integer)-- order_date: 4 bytes (date)-- status: 12 bytes (varchar average)-- total_amount: 8 bytes (numeric)-- shipping_date: 4 bytes (date) -- Calculate estimated index size:SELECT (SELECT reltuples FROM pg_class WHERE relname = 'orders') as row_count, -- Per-entry size: columns + 6 bytes tuple header + 6 bytes ItemId (4 + 4 + 12 + 8 + 4) + 6 + 6 as bytes_per_entry, -- Estimated total (with 20% B-tree overhead) pg_size_pretty( ((SELECT reltuples FROM pg_class WHERE relname = 'orders') * ((4 + 4 + 12 + 8 + 4) + 12) * 1.2)::bigint ) as estimated_index_size; -- After index creation, verify actual size:SELECT indexname, pg_size_pretty(pg_relation_size(indexrelid)) as actual_sizeFROM pg_indexesJOIN pg_class ON pg_class.relname = indexnameWHERE tablename = 'orders'; -- Compare covering index size to table size:SELECT pg_size_pretty(pg_relation_size('orders')) as table_size, pg_size_pretty(pg_relation_size('idx_orders_covering')) as index_size, round(100.0 * pg_relation_size('idx_orders_covering') / pg_relation_size('orders'), 1) as index_pct_of_table;Storage Management Strategies
WHERE is_active = true eliminates inactive rows from the index.12345678910111213141516171819202122232425262728293031323334353637383940
-- Strategy 1: Partial Index for Active Records Only-- Full table: 100M rows, but only 5M are status='active' -- Instead of indexing all rows:CREATE INDEX idx_orders_all ON orders (customer_id) INCLUDE (order_date, total_amount);-- Size: ~8GB -- Use a partial index:CREATE INDEX idx_orders_active ON orders (customer_id)INCLUDE (order_date, total_amount)WHERE status = 'active';-- Size: ~400MB (5% of original!) -- Query must include matching WHERE clause:SELECT order_date, total_amount FROM ordersWHERE customer_id = 123 AND status = 'active';-- Uses idx_orders_active -- Strategy 2: Compression (SQL Server example)CREATE NONCLUSTERED INDEX IX_orders_compressedON orders (customer_id, order_date)INCLUDE (total_amount, status)WITH (DATA_COMPRESSION = PAGE);-- PAGE compression typically reduces size 50-70% -- Strategy 3: Partitioned Coverage-- Cover only recent partitions with full covering index-- PostgreSQL example with declarative partitioning: CREATE INDEX idx_orders_2024_covering ON orders_2024 (customer_id, order_date)INCLUDE (order_id, total_amount, status); -- Older partitions: minimal index onlyCREATE INDEX idx_orders_2023_minimalON orders_2023 (customer_id, order_date);-- Queries on old data are less frequent, can tolerate heap accessLearning from common mistakes accelerates mastery. Here are anti-patterns frequently seen in covering index design:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
-- ANTI-PATTERN 1: The "Cover Everything" Index-- DON'T DO THISCREATE INDEX idx_orders_everything ON orders ( order_id, customer_id, order_date, status, total_amount, shipping_address, billing_address, created_at, updated_at, product_count, discount_code, sales_rep_id, notes);-- Problems:-- - Massive index size (possibly larger than table)-- - Every INSERT/UPDATE touches all columns-- - Internal nodes are huge, tree is tall -- BETTER: Multiple focused covering indexesCREATE INDEX idx_orders_customer_lookup ON orders (customer_id, order_date) INCLUDE (order_id, total_amount, status); CREATE INDEX idx_orders_status_reportON orders (status, order_date)INCLUDE (order_id, customer_id, total_amount); -- ANTI-PATTERN 2: Wrong Key Column Order-- Query: WHERE region = 'West' AND order_date >= '2024-01-01' -- DON'T DO THIS (range column first)CREATE INDEX idx_wrong_order ON orders (order_date, region);-- Can't seek to region='West', must scan all dates -- CORRECT (equality column first)CREATE INDEX idx_right_order ON orders (region, order_date);-- Seeks to 'West', then range scan on date -- ANTI-PATTERN 3: Duplicate Coverage-- DON'T create both of these:CREATE INDEX idx_orders_v1 ON orders (customer_id) INCLUDE (order_date, total_amount); CREATE INDEX idx_orders_v2 ON orders (customer_id) INCLUDE (order_date, total_amount, status); -- CONSOLIDATE into one:CREATE INDEX idx_orders_unified ON orders (customer_id)INCLUDE (order_date, total_amount, status);-- The superset covers all queries the subsets would coverWe've covered the comprehensive framework for designing effective covering indexes. Here are the essential principles:
What's Next
Now that we understand how to design covering indexes, we'll examine the trade-offs involved—the costs of covering indexes in terms of write overhead, storage consumption, maintenance requirements, and situations where covering indexes may not be the optimal choice.
You now have a systematic framework for designing covering indexes that maximize query coverage while minimizing overhead. This knowledge enables you to create indexes that will stand the test of production workloads.