Loading content...
Effective index design begins with understanding your queries. An index is only valuable if it serves actual query patterns—otherwise, it's pure overhead. Yet many database administrators create indexes based on intuition or table structure rather than careful query analysis.
The Query-First Principle:
Professional index design follows a fundamental principle: indexes exist to serve queries, not tables. The structure of a table tells you what could be indexed. The actual query workload tells you what should be indexed.
This page teaches you to read queries like an optimizer, identify indexable patterns, and design indexes that precisely match your workload's needs.
By the end of this page, you will understand how to classify queries by their indexing requirements, identify which parts of a query benefit from indexes, recognize patterns that prevent index usage, and design indexes that serve multiple query patterns efficiently.
Before designing indexes, you must understand the types of queries in your workload. Different query patterns have different indexing requirements.
The Four Query Archetypes:
All queries can be classified into archetypes based on how they access data:
| Archetype | Description | Example | Index Strategy |
|---|---|---|---|
| Point Lookup | Retrieve single row by exact key | WHERE id = 123 | Highly selective index essential. B+-tree optimal. |
| Range Scan | Retrieve rows within value range | WHERE date BETWEEN '2024-01-01' AND '2024-01-31' | Ordered index critical. Clustering beneficial. |
| Multi-Column Equality | Multiple AND conditions | WHERE status = 'active' AND region = 'US' | Composite index, most selective column first. |
| Mixed Equality and Range | Equality + range on different columns | WHERE customer_id = 5 AND date > '2024-01-01' | Composite: equality columns first, then range. |
| Pattern Search | Prefix or like patterns | WHERE name LIKE 'Joh%' | B+-tree for prefix patterns. Full-text for others. |
| Aggregation | COUNT, SUM, AVG over subsets | SELECT COUNT(*) WHERE region = 'EU' | Covering index eliminates table access. |
| Join Queries | Multi-table joins | FROM orders o JOIN customers c ON o.customer_id = c.id | Index on join columns for both tables. |
| Sorting/Top-N | ORDER BY with LIMIT | ORDER BY created_at DESC LIMIT 10 | Index on sort column(s) eliminates sort step. |
Priority Assessment:
Not all queries deserve equal indexing effort. Classify queries by:
A user-facing query running 10,000 times per hour with 2-second latency is your highest priority. A nightly batch job with 30-second runtime might be acceptable as-is.
In most systems, 20% of queries generate 80% of database load. Focus indexing efforts on this critical minority. Perfect indexes for rare queries at the expense of common ones is counterproductive optimization.
A query consists of multiple clauses, each with different indexing implications. Understanding which parts can be accelerated by indexes is essential for effective index design.
The Indexable Components:
SELECT [columns] -- Affects: Covering index opportunity
FROM [table] -- The table to potentially index
WHERE [predicates] -- Primary target for index filtering
AND [more predicates] -- Additional filter opportunities
ORDER BY [columns] -- Sort elimination opportunity
GROUP BY [columns] -- Aggregation optimization
LIMIT [n] -- Early termination with sorted index
12345678910111213141516171819202122232425262728293031323334353637
-- Example query to analyzeSELECT o.order_id, -- Column needed: order_id o.total, -- Column needed: total c.company_name -- From joined tableFROM orders oJOIN customers c ON o.customer_id = c.id -- Join: customer_id → idWHERE o.status = 'shipped' -- Equality predicate AND o.order_date >= '2024-01-01' -- Range predicate AND o.region = 'EMEA' -- Equality predicateORDER BY o.order_date DESC -- Sort requirementLIMIT 50; -- Top-N pattern /*ANALYSIS: 1. WHERE predicates: - status = 'shipped' (equality) - order_date >= '2024-01-01' (range) - region = 'EMEA' (equality) 2. Best composite index for orders: Index on (status, region, order_date DESC, order_id, total) Why this order? - status first: equality predicate, likely moderate selectivity - region second: equality predicate, combined with status for narrow search - order_date third: range predicate (must come after equalities) - order_date DESC: matches ORDER BY to eliminate sort - order_id, total: covering columns to avoid table access 3. Join optimization: - customers.id should be indexed (typically primary key) 4. Result: Index-only scan with no sort, limited to 50 rows*/In a composite index, column order is critical. The index (A, B, C) can efficiently serve WHERE A = ?, WHERE A = ? AND B = ?, but NOT efficiently serve WHERE B = ? alone. Place columns in order of: equality predicates, then range predicates, then sort columns, then covering columns.
Certain query patterns prevent the optimizer from using indexes, even when seemingly appropriate indexes exist. Recognizing these patterns is essential for both query writing and index design.
Problem: Applying a function to an indexed column prevents index usage.
-- INDEX on email column exists, but CANNOT be used:
SELECT * FROM users WHERE UPPER(email) = 'JOHN@EXAMPLE.COM';
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
SELECT * FROM products WHERE price * 1.1 > 100;
Why: The index stores values of email, not UPPER(email). To use the index, the database would need to compute UPPER() for every row to compare—defeating the purpose.
Solutions:
-- Instead of YEAR(order_date) = 2024
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'
-- PostgreSQL
CREATE INDEX idx_users_email_upper ON users (UPPER(email));
-- MySQL 8.0+
CREATE INDEX idx_users_email_upper ON users ((UPPER(email)));
-- SQL Server (computed column + index)
ALTER TABLE users ADD email_upper AS UPPER(email);
CREATE INDEX idx_users_email_upper ON users (email_upper);
-- Add column for indexed lookups
ALTER TABLE users ADD email_normalized VARCHAR(255);
CREATE INDEX idx_users_email_norm ON users (email_normalized);
Query execution plans are your window into how the database actually processes queries. They reveal whether indexes are used, how data is accessed, and where bottlenecks occur. Mastering plan reading is essential for index design.
Generating Execution Plans:
12345678910
-- Basic explainEXPLAIN SELECT * FROM orders WHERE customer_id = 123; -- With execution statistics (actually runs query)EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT * FROM orders WHERE customer_id = 123; -- JSON format for programmatic analysisEXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)SELECT * FROM orders WHERE customer_id = 123;Key Plan Elements to Examine:
| Operator | Description | Performance Implication |
|---|---|---|
| Seq Scan / Table Scan | Reads every row in table | Potentially slow for large tables. May indicate missing index. |
| Index Scan | Uses index but also reads table | Good for selective queries. Check# of rows returned. |
| Index Only Scan | Answers query entirely from index | Excellent. No table access needed (covering index). |
| Bitmap Index Scan | Collects row locations, then fetches | Good for moderate selectivity or OR conditions. |
| Sort | Explicit sorting step | CPU-intensive. Index on ORDER BY columns can eliminate. |
| Hash Join | Builds hash table from one input | Efficient for many workloads. Memory-dependent. |
| Nested Loop | For each outer row, scan inner | Best with small outer + indexed inner. |
| Merge Join | Merges two sorted inputs | Efficient when inputs already sorted. |
| Filter | Applies WHERE after fetching rows | Post-fetch filtering. May indicate unusable predicate. |
1234567891011121314151617181920212223242526272829303132333435
-- Sample queryEXPLAIN (ANALYZE, BUFFERS)SELECT o.order_id, c.company_nameFROM orders oJOIN customers c ON o.customer_id = c.idWHERE o.status = 'shipped' AND o.order_date > '2024-01-01'ORDER BY o.order_date DESCLIMIT 100; /*RESULT: Limit (cost=0.87..1250.43 rows=100 width=48) (actual time=0.089..2.156 rows=100) Buffers: shared hit=412 -> Nested Loop (cost=0.87..45621.34 rows=3650 width=48) (actual time=0.088..2.134) Buffers: shared hit=412 -> Index Scan Backward using idx_orders_status_date on orders o (cost=0.43..12345.67 rows=3650 width=20) (actual time=0.052..0.845) Index Cond: ((status = 'shipped') AND (order_date > '2024-01-01')) Buffers: shared hit=156 -> Index Scan using customers_pkey on customers c (cost=0.43..8.89 rows=1 width=32) (actual time=0.008..0.008) Index Cond: (id = o.customer_id) Buffers: shared hit=256 ANALYSIS:✓ Index Scan Backward: Using index for both filter AND sort (no Sort operator)✓ Limit applied early: Only fetching 100 rows, not all 3650✓ Index Cond shows predicates: Both status and order_date using index✓ Nested Loop with index: Efficient join strategy✓ shared hit=412: All pages from cache (no disk I/O needed)✓ actual time=2.156ms: Excellent performance NO IMPROVEMENTS NEEDED - Query is well-optimized*/Compare 'estimated rows' to 'actual rows' in the plan. Large discrepancies (10× or more) indicate stale or missing statistics. The optimizer chose its plan based on wrong assumptions. Run ANALYZE/UPDATE STATISTICS to refresh statistics.
In production systems, you rarely design indexes for a single query. The challenge is creating indexes that serve multiple query patterns efficiently while minimizing total index count.
The Composite Index Advantage:
A well-designed composite index can serve multiple queries:
-- Composite index: (customer_id, status, order_date)
-- Query 1: All orders for a customer
WHERE customer_id = 123 -- Uses index ✓
-- Query 2: Pending orders for a customer
WHERE customer_id = 123 AND status = 'pending' -- Uses index ✓
-- Query 3: Recent pending orders for a customer
WHERE customer_id = 123 AND status = 'pending'
AND order_date > '2024-01-01' -- Uses index ✓
-- Query 4: All pending orders (no customer filter)
WHERE status = 'pending' -- CANNOT use index efficiently ✗
The composite index supports any query that uses a prefix of its columns.
Before Optimization:
-- 4 separate indexes
CREATE INDEX idx_cust ON orders (customer_id);
CREATE INDEX idx_status ON orders (status);
CREATE INDEX idx_date ON orders (order_date);
CREATE INDEX idx_cust_status ON orders (customer_id, status);
Problems:
After Optimization:
-- 2 strategic indexes
CREATE INDEX idx_orders_main ON orders
(customer_id, status, order_date DESC)
INCLUDE (total);
CREATE INDEX idx_orders_status ON orders
(status, order_date DESC)
WHERE status NOT IN ('cancelled', 'archived');
Benefits:
Every index decision involves balancing: (1) Number of queries served, (2) Index maintenance overhead, and (3) Storage/memory footprint. You cannot optimize all three simultaneously. Your workload characteristics determine the optimal balance.
Before optimizing, you must know your actual query patterns. Production databases offer several ways to discover what queries are running and which need optimization.
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Enable pg_stat_statements extensionCREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Top queries by total execution timeSELECT substring(query, 1, 100) AS query_preview, calls, total_exec_time::NUMERIC(10,2) AS total_ms, mean_exec_time::NUMERIC(10,2) AS avg_ms, rows AS total_rows_returned, ROUND(100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0), 2) AS cache_hit_pctFROM pg_stat_statementsWHERE query NOT LIKE '%pg_stat%' -- Exclude monitoring queriesORDER BY total_exec_time DESCLIMIT 20; -- Queries with poor cache hit ratio (may need better indexes)SELECT substring(query, 1, 100) AS query_preview, calls, mean_exec_time::NUMERIC(10,2) AS avg_ms, shared_blks_read AS disk_blocks, shared_blks_hit AS cache_blocks, ROUND(100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0), 2) AS cache_hit_pctFROM pg_stat_statementsWHERE calls > 100 AND shared_blks_hit + shared_blks_read > 0ORDER BY cache_hit_pct ASCLIMIT 20; -- Index usage statisticsSELECT schemaname || '.' || relname AS table, indexrelname AS index, idx_scan AS index_scans, idx_tup_read AS tuples_read, idx_tup_fetch AS tuples_fetchedFROM pg_stat_user_indexesORDER BY idx_scan DESC;Query patterns vary by hour, day, and season. Don't design indexes based on a single snapshot. Collect query statistics over at least a full business cycle (typically a week) before making indexing decisions. Include peak hours, batch windows, and month-end processing.
Bringing together everything we've learned, here is a systematic methodology for designing indexes based on query patterns.
You now understand how to analyze query patterns for indexing decisions. You can identify indexable clauses, recognize anti-patterns, read execution plans, and design indexes that serve multiple queries efficiently. Next, we'll expand from individual queries to complete workload analysis.