Index Selection - Learning Module

Loading content...

0/252

Query Patterns

Understanding Query Patterns

Effective index design begins with understanding your queries. An index is only valuable if it serves actual query patterns—otherwise, it's pure overhead. Yet many database administrators create indexes based on intuition or table structure rather than careful query analysis.

The Query-First Principle:

Professional index design follows a fundamental principle: indexes exist to serve queries, not tables. The structure of a table tells you what could be indexed. The actual query workload tells you what should be indexed.

This page teaches you to read queries like an optimizer, identify indexable patterns, and design indexes that precisely match your workload's needs.

What You Will Learn

By the end of this page, you will understand how to classify queries by their indexing requirements, identify which parts of a query benefit from indexes, recognize patterns that prevent index usage, and design indexes that serve multiple query patterns efficiently.

Query Classification for Indexing

Before designing indexes, you must understand the types of queries in your workload. Different query patterns have different indexing requirements.

The Four Query Archetypes:

All queries can be classified into archetypes based on how they access data:

Query Archetypes and Index Strategy
Archetype	Description	Example	Index Strategy
Point Lookup	Retrieve single row by exact key	`WHERE id = 123`	Highly selective index essential. B+-tree optimal.
Range Scan	Retrieve rows within value range	`WHERE date BETWEEN '2024-01-01' AND '2024-01-31'`	Ordered index critical. Clustering beneficial.
Multi-Column Equality	Multiple AND conditions	`WHERE status = 'active' AND region = 'US'`	Composite index, most selective column first.
Mixed Equality and Range	Equality + range on different columns	`WHERE customer_id = 5 AND date > '2024-01-01'`	Composite: equality columns first, then range.
Pattern Search	Prefix or like patterns	`WHERE name LIKE 'Joh%'`	B+-tree for prefix patterns. Full-text for others.
Aggregation	COUNT, SUM, AVG over subsets	`SELECT COUNT(*) WHERE region = 'EU'`	Covering index eliminates table access.
Join Queries	Multi-table joins	`FROM orders o JOIN customers c ON o.customer_id = c.id`	Index on join columns for both tables.
Sorting/Top-N	ORDER BY with LIMIT	`ORDER BY created_at DESC LIMIT 10`	Index on sort column(s) eliminates sort step.

Priority Assessment:

Not all queries deserve equal indexing effort. Classify queries by:

Frequency: How often does this query run?
Latency sensitivity: Is this user-facing or background?
Current performance: How far from acceptable is it?
Data volume: How many rows does it process?

A user-facing query running 10,000 times per hour with 2-second latency is your highest priority. A nightly batch job with 30-second runtime might be acceptable as-is.

The Pareto Principle of Queries

In most systems, 20% of queries generate 80% of database load. Focus indexing efforts on this critical minority. Perfect indexes for rare queries at the expense of common ones is counterproductive optimization.

Anatomy of an Indexable Query

A query consists of multiple clauses, each with different indexing implications. Understanding which parts can be accelerated by indexes is essential for effective index design.

The Indexable Components:

SELECT [columns]              -- Affects: Covering index opportunity
FROM [table]                  -- The table to potentially index
WHERE [predicates]            -- Primary target for index filtering
AND [more predicates]         -- Additional filter opportunities
ORDER BY [columns]            -- Sort elimination opportunity
GROUP BY [columns]            -- Aggregation optimization
LIMIT [n]                     -- Early termination with sorted index

Clause-by-Clause Index Opportunities

•WHERE predicates — The primary indexing target. Indexes enable the optimizer to skip non-matching rows entirely rather than reading and filtering them.
•JOIN conditions — Join columns should be indexed on the inner table (the one being looked up for each outer row). Foreign key indexes serve this purpose.
•ORDER BY columns — If the index is sorted in the same order as ORDER BY, the database can avoid an expensive sort operation.
•GROUP BY columns — Some databases can use sorted indexes to perform streaming aggregation without materializing all groups.
•SELECT columns — If all selected columns are in the index, the table need never be accessed (covering index).
•DISTINCT / UNION — Duplicate elimination may leverage sorted indexes for efficient deduplication.

Query Analysis Example
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- Example query to analyze
SELECT
    o.order_id,           -- Column needed: order_id
    o.total,              -- Column needed: total
    c.company_name        -- From joined table
FROM orders o
JOIN customers c ON o.customer_id = c.id    -- Join: customer_id → id
WHERE
    o.status = 'shipped'                    -- Equality predicate
    AND o.order_date >= '2024-01-01'        -- Range predicate
    AND o.region = 'EMEA'                   -- Equality predicate
ORDER BY o.order_date DESC                  -- Sort requirement
LIMIT 50;                                   -- Top-N pattern
 
/*
ANALYSIS:
 
1. WHERE predicates:
   - status = 'shipped' (equality)
   - order_date >= '2024-01-01' (range)
   - region = 'EMEA' (equality)
 
2. Best composite index for orders:
   Index on (status, region, order_date DESC, order_id, total)
 
   Why this order?
   - status first: equality predicate, likely moderate selectivity
   - region second: equality predicate, combined with status for narrow search
   - order_date third: range predicate (must come after equalities)
   - order_date DESC: matches ORDER BY to eliminate sort
   - order_id, total: covering columns to avoid table access
 
3. Join optimization:
   - customers.id should be indexed (typically primary key)
 
4. Result: Index-only scan with no sort, limited to 50 rows
*/

Order Matters for Composite Indexes

In a composite index, column order is critical. The index (A, B, C) can efficiently serve WHERE A = ?, WHERE A = ? AND B = ?, but NOT efficiently serve WHERE B = ? alone. Place columns in order of: equality predicates, then range predicates, then sort columns, then covering columns.

Patterns That Prevent Index Usage

Certain query patterns prevent the optimizer from using indexes, even when seemingly appropriate indexes exist. Recognizing these patterns is essential for both query writing and index design.

Problem: Applying a function to an indexed column prevents index usage.

-- INDEX on email column exists, but CANNOT be used:
SELECT * FROM users WHERE UPPER(email) = 'JOHN@EXAMPLE.COM';
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
SELECT * FROM products WHERE price * 1.1 > 100;

Why: The index stores values of email, not UPPER(email). To use the index, the database would need to compute UPPER() for every row to compare—defeating the purpose.

Solutions:

Rewrite the query to avoid the function:

-- Instead of YEAR(order_date) = 2024
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'

Create an expression/functional index:

-- PostgreSQL
CREATE INDEX idx_users_email_upper ON users (UPPER(email));

-- MySQL 8.0+
CREATE INDEX idx_users_email_upper ON users ((UPPER(email)));

-- SQL Server (computed column + index)
ALTER TABLE users ADD email_upper AS UPPER(email);
CREATE INDEX idx_users_email_upper ON users (email_upper);

Store normalized values:

-- Add column for indexed lookups
ALTER TABLE users ADD email_normalized VARCHAR(255);
CREATE INDEX idx_users_email_norm ON users (email_normalized);

Reading Query Execution Plans

Query execution plans are your window into how the database actually processes queries. They reveal whether indexes are used, how data is accessed, and where bottlenecks occur. Mastering plan reading is essential for index design.

Generating Execution Plans:

Generating Query Plans
1
2
3
4
5
6
7
8
9
10
-- Basic explain
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
 
-- With execution statistics (actually runs query)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 123;
 
-- JSON format for programmatic analysis
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders WHERE customer_id = 123;

Key Plan Elements to Examine:

Understanding Query Plan Operators
Operator	Description	Performance Implication
Seq Scan / Table Scan	Reads every row in table	Potentially slow for large tables. May indicate missing index.
Index Scan	Uses index but also reads table	Good for selective queries. Check# of rows returned.
Index Only Scan	Answers query entirely from index	Excellent. No table access needed (covering index).
Bitmap Index Scan	Collects row locations, then fetches	Good for moderate selectivity or OR conditions.
Sort	Explicit sorting step	CPU-intensive. Index on ORDER BY columns can eliminate.
Hash Join	Builds hash table from one input	Efficient for many workloads. Memory-dependent.
Nested Loop	For each outer row, scan inner	Best with small outer + indexed inner.
Merge Join	Merges two sorted inputs	Efficient when inputs already sorted.
Filter	Applies WHERE after fetching rows	Post-fetch filtering. May indicate unusable predicate.

Interpreting a Query Plan
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Sample query
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.company_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'shipped' AND o.order_date > '2024-01-01'
ORDER BY o.order_date DESC
LIMIT 100;
 
/*
RESULT:
 
Limit  (cost=0.87..1250.43 rows=100 width=48) (actual time=0.089..2.156 rows=100)
  Buffers: shared hit=412
  ->  Nested Loop  (cost=0.87..45621.34 rows=3650 width=48) (actual time=0.088..2.134)
        Buffers: shared hit=412
        ->  Index Scan Backward using idx_orders_status_date on orders o
              (cost=0.43..12345.67 rows=3650 width=20) (actual time=0.052..0.845)
              Index Cond: ((status = 'shipped') AND (order_date > '2024-01-01'))
              Buffers: shared hit=156
        ->  Index Scan using customers_pkey on customers c
              (cost=0.43..8.89 rows=1 width=32) (actual time=0.008..0.008)
              Index Cond: (id = o.customer_id)
              Buffers: shared hit=256
 
ANALYSIS:
✓ Index Scan Backward: Using index for both filter AND sort (no Sort operator)
✓ Limit applied early: Only fetching 100 rows, not all 3650
✓ Index Cond shows predicates: Both status and order_date using index
✓ Nested Loop with index: Efficient join strategy
✓ shared hit=412: All pages from cache (no disk I/O needed)
✓ actual time=2.156ms: Excellent performance
 
NO IMPROVEMENTS NEEDED - Query is well-optimized
*/

Estimated vs Actual Rows

Compare 'estimated rows' to 'actual rows' in the plan. Large discrepancies (10× or more) indicate stale or missing statistics. The optimizer chose its plan based on wrong assumptions. Run ANALYZE/UPDATE STATISTICS to refresh statistics.

Designing Indexes for Multiple Queries

In production systems, you rarely design indexes for a single query. The challenge is creating indexes that serve multiple query patterns efficiently while minimizing total index count.

The Composite Index Advantage:

A well-designed composite index can serve multiple queries:

-- Composite index: (customer_id, status, order_date)

-- Query 1: All orders for a customer
WHERE customer_id = 123                              -- Uses index ✓

-- Query 2: Pending orders for a customer
WHERE customer_id = 123 AND status = 'pending'       -- Uses index ✓

-- Query 3: Recent pending orders for a customer
WHERE customer_id = 123 AND status = 'pending'
  AND order_date > '2024-01-01'                      -- Uses index ✓

-- Query 4: All pending orders (no customer filter)
WHERE status = 'pending'                             -- CANNOT use index efficiently ✗

The composite index supports any query that uses a prefix of its columns.

Composite Index Design Rules

•Place columns used with = first — Equality predicates can match exactly, enabling efficient subtree navigation.
•Range predicates come after equalities — A range on column 2 means columns 3+ in the index are unused for filtering.
•Order columns by frequency of use — If 90% of queries include customer_id and only 50% include status, put customer_id first.
•Order by selectivity (with caution) — More selective columns first narrow results faster, but frequency often trumps selectivity.
•Consider ORDER BY alignment — If you always sort by order_date DESC, include that in the index to eliminate sorts.
•Add covering columns last — Columns only in SELECT (not WHERE/ORDER BY) go at the end for covering index benefit.

Before Optimization:

-- 4 separate indexes
CREATE INDEX idx_cust ON orders (customer_id);
CREATE INDEX idx_status ON orders (status);
CREATE INDEX idx_date ON orders (order_date);
CREATE INDEX idx_cust_status ON orders (customer_id, status);

Problems:

4 indexes to maintain
Redundancy (idx_cust is subset of idx_cust_status)
No covering for common SELECT columns
No sort optimization

After Optimization:

-- 2 strategic indexes
CREATE INDEX idx_orders_main ON orders 
  (customer_id, status, order_date DESC)
  INCLUDE (total);

CREATE INDEX idx_orders_status ON orders 
  (status, order_date DESC)
  WHERE status NOT IN ('cancelled', 'archived');

Benefits:

2 indexes instead of 4 (50% less maintenance)
Covers all 4 query patterns plus more
Partial index for common status queries
Sort elimination for common ORDER BY

The Trade-off Triangle

Every index decision involves balancing: (1) Number of queries served, (2) Index maintenance overhead, and (3) Storage/memory footprint. You cannot optimize all three simultaneously. Your workload characteristics determine the optimal balance.

Detecting Query Patterns in Production

Before optimizing, you must know your actual query patterns. Production databases offer several ways to discover what queries are running and which need optimization.

Query Pattern Discovery
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Enable pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
 
-- Top queries by total execution time
SELECT
    substring(query, 1, 100) AS query_preview,
    calls,
    total_exec_time::NUMERIC(10,2) AS total_ms,
    mean_exec_time::NUMERIC(10,2) AS avg_ms,
    rows AS total_rows_returned,
    ROUND(100.0 * shared_blks_hit /
        NULLIF(shared_blks_hit + shared_blks_read, 0), 2) AS cache_hit_pct
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat%'  -- Exclude monitoring queries
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Queries with poor cache hit ratio (may need better indexes)
SELECT
    substring(query, 1, 100) AS query_preview,
    calls,
    mean_exec_time::NUMERIC(10,2) AS avg_ms,
    shared_blks_read AS disk_blocks,
    shared_blks_hit AS cache_blocks,
    ROUND(100.0 * shared_blks_hit /
        NULLIF(shared_blks_hit + shared_blks_read, 0), 2) AS cache_hit_pct
FROM pg_stat_statements
WHERE calls > 100
  AND shared_blks_hit + shared_blks_read > 0
ORDER BY cache_hit_pct ASC
LIMIT 20;
 
-- Index usage statistics
SELECT
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    idx_scan AS index_scans,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

Collect Over Time

Query patterns vary by hour, day, and season. Don't design indexes based on a single snapshot. Collect query statistics over at least a full business cycle (typically a week) before making indexing decisions. Include peak hours, batch windows, and month-end processing.

Systematic Index Design Methodology

Bringing together everything we've learned, here is a systematic methodology for designing indexes based on query patterns.

Index Design Workflow

•Collect Query Inventory — Gather all queries hitting each table. Include frequency, latency requirements, and current performance.
•Rank Queries by Priority — Score = Frequency × Performance Gap × Business Impact. Focus on the top 10-20 queries.
•Analyze Each Priority Query — Identify WHERE predicates, JOIN columns, ORDER BY, and GROUP BY clauses. Note which columns are in SELECT.
•Group Queries by Access Pattern — Queries with similar predicate columns can share indexes. Look for natural groupings.
•Design Candidate Indexes — For each group, design a composite index following column ordering rules. Include covering columns where beneficial.
•Check for Redundancy — Ensure no index is a prefix of another. Eliminate indexes that serve only low-priority queries.
•Estimate Storage and Maintenance — Calculate expected size and write amplification. Ensure total is acceptable.
•Test in Staging — Create indexes in a staging environment with production-like data. Validate query plans and performance.
•Deploy and Monitor — Create indexes in production during low-activity windows. Monitor for unexpected impacts.
•Review Periodically — Query patterns change. Schedule quarterly reviews of index usage and effectiveness.

Page Complete

You now understand how to analyze query patterns for indexing decisions. You can identify indexable clauses, recognize anti-patterns, read execution plans, and design indexes that serve multiple queries efficiently. Next, we'll expand from individual queries to complete workload analysis.