Loading content...
If query tuning is about writing efficient SQL, index tuning is about building the data highways that make efficient SQL possible. A perfectly written query against an unindexed table is like a Formula 1 car on a dirt road—all that engineering potential, wasted.
Indexes are the most impactful performance tool in the DBA's arsenal. A single well-designed index can reduce query execution time from minutes to milliseconds. But indexes are not free: they consume storage, slow down writes, and require ongoing maintenance. The art of index tuning is finding the optimal balance—enough indexes to accelerate reads without crippling writes.
This page teaches you to think systematically about index design, moving beyond intuition to evidence-based index strategies.
By the end of this page, you will understand index internals deeply enough to design optimal indexes for any query pattern. You'll learn to identify missing indexes, evaluate existing indexes for effectiveness, and avoid over-indexing. These skills are essential for building databases that perform well at any scale.
Before diving into tuning strategies, let's establish a solid understanding of how indexes work internally. This knowledge is essential for making informed design decisions.
The B-Tree Index Structure:
Most database indexes are B-tree (or B+-tree) structures. They maintain sorted data in a tree format that allows:
The key insight is that B-tree indexes store sorted pointers to data, not the data itself (except in clustered indexes or index-only scans).
| Index Type | Best For | Limitations |
|---|---|---|
| B-Tree | Equality, range queries, ORDER BY, PREFIX matching | Poor for low-cardinality; full scans still O(n) |
| Hash | Exact equality lookups only | No range queries; no ordering; size unpredictable |
| Bitmap | Low-cardinality columns in data warehouses | Poor for OLTP with frequent updates; lock contention |
| GiST | Geometric data, full-text search, custom types | More complex to tune; varies by use case |
| GIN | Arrays, JSONB, full-text search with many keys | Slower updates; larger size; excellent for containment |
| Covering/Included | Avoiding table lookups for SELECT columns | Increased index size; write overhead |
| Partial/Filtered | Indexing subset of rows meeting a condition | Only helps queries with matching predicates |
Clustered vs Non-Clustered Indexes:
Clustered Index:
Non-Clustered Index:
PostgreSQL Note: PostgreSQL doesn't have traditional clustered indexes. The CLUSTER command reorders table data once, but it's not maintained automatically. Heap tables remain the norm.
Index effectiveness depends on selectivity—the fraction of rows an index lookup returns. High selectivity (few rows returned) means index is valuable. Low selectivity (many rows) may make full table scan faster. Rule of thumb: indexes help when retrieving less than 5-15% of table rows.
Index design is not guesswork—it follows predictable principles based on how queries access data. The key is understanding which columns appear in predicates, joins, and ordering clauses, and in what combinations.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
-- PRINCIPLE 1: Index columns used in WHERE clauses -- Query patternSELECT * FROM orders WHERE customer_id = 12345 AND status = 'pending' AND order_date > '2024-01-01'; -- Single-column indexes (least effective)CREATE INDEX idx_customer ON orders(customer_id);CREATE INDEX idx_status ON orders(status);CREATE INDEX idx_date ON orders(order_date);-- Database can use ONE index, then filter remaining conditions -- Composite index (most effective)CREATE INDEX idx_orders_lookup ON orders(customer_id, status, order_date);-- Single index satisfies all WHERE predicates -- PRINCIPLE 2: Column order in composite indexes MATTERS -- Most selective FIRST for equality predicatesCREATE INDEX idx_good ON orders(customer_id, status, order_date);-- customer_id is most selective (few orders per customer) -- Range predicates should be LAST CREATE INDEX idx_range ON orders(customer_id, status, order_date);-- order_date is a range (>), so columns after it can't use index -- PRINCIPLE 3: Index columns used in ORDER BY -- Query requiring sorted resultsSELECT order_id, order_date, totalFROM ordersWHERE customer_id = 12345ORDER BY order_date DESC; -- Index supporting both WHERE and ORDER BYCREATE INDEX idx_cust_date ON orders(customer_id, order_date DESC);-- Avoids separate sort operation -- PRINCIPLE 4: Index columns used in JOIN conditions SELECT o.order_id, c.customer_name, oi.product_idFROM orders oJOIN customers c ON o.customer_id = c.customer_idJOIN order_items oi ON o.order_id = oi.order_idWHERE o.order_date > '2024-01-01'; -- Indexes for join columnsCREATE INDEX idx_orders_customer ON orders(customer_id);CREATE INDEX idx_items_order ON order_items(order_id); -- Covering index for frequently joined columnsCREATE INDEX idx_orders_join ON orders(customer_id) INCLUDE (order_id, order_date);A composite index (A, B, C) can efficiently support queries on A alone, or A+B, or A+B+C. It cannot efficiently support queries on B alone or C alone. Always design composite indexes with the most commonly filtered column as the leading column.
Databases provide multiple mechanisms to identify queries that could benefit from additional indexes. Combining automated recommendations with manual analysis yields the best results.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
-- POSTGRESQL: Missing Index Detection -- Install pg_stat_statements extension (if not already)CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Find slowest queries (candidates for indexing)SELECT queryid, calls, round(total_exec_time::numeric, 2) as total_time_ms, round(mean_exec_time::numeric, 2) as avg_time_ms, round((shared_blks_read + shared_blks_hit)::numeric / calls, 2) as avg_blocks, queryFROM pg_stat_statementsORDER BY total_exec_time DESCLIMIT 20; -- Find tables with sequential scans (may need indexes)SELECT schemaname, relname as table_name, seq_scan, seq_tup_read, idx_scan, CASE WHEN seq_scan > 0 THEN round(seq_tup_read::numeric / seq_scan, 0) ELSE 0 END as avg_rows_per_seq_scanFROM pg_stat_user_tablesWHERE seq_scan > 100 -- Tables with many sequential scansORDER BY seq_tup_read DESCLIMIT 20; -- Check index usage on existing indexesSELECT schemaname, tablename, indexname, idx_scan as times_used, pg_size_pretty(pg_relation_size(indexrelid)) as index_sizeFROM pg_stat_user_indexesORDER BY idx_scan ASC -- Least used indexes firstLIMIT 20; -- Using EXPLAIN to identify missing indexesEXPLAIN (ANALYZE, BUFFERS)SELECT * FROM orders WHERE customer_id = 12345 AND order_date > '2024-01-01';-- Look for: Seq Scan, high rows removed by filter, -- or high buffer reads relative to rows returned -- Hypothetical indexes with HypoPG extensionCREATE EXTENSION hypopg; -- Create hypothetical index (not actually built)SELECT hypopg_create_index('CREATE INDEX ON orders(customer_id, order_date)'); -- Test query plan with hypothetical indexEXPLAIN SELECT * FROM orders WHERE customer_id = 12345 AND order_date > '2024-01-01'; -- Drop hypothetical indexesSELECT hypopg_reset();Automated missing index suggestions are starting points, not final answers. They may recommend redundant indexes overlapping with existing ones, suggest indexes for rarely-run ad-hoc queries, or miss composite index opportunities. Always validate recommendations against actual workload patterns and existing indexes.
Over time, databases accumulate indexes added by different developers solving immediate problems. Many become redundant (covered by other indexes) or unused (no longer matching query patterns). Index consolidation reclaims storage and improves write performance.
idx_a (customer_id, order_date) and idx_b (customer_id, order_date) → Remove oneidx_x (customer_id) and idx_y (customer_id, order_date) → idx_y covers idx_x; remove idx_x123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
-- POSTGRESQL: Find redundant indexes (prefix covered)WITH index_cols AS ( SELECT i.indrelid::regclass as table_name, i.indexrelid::regclass as index_name, array_agg(a.attname ORDER BY array_position(i.indkey, a.attnum)) as columns, pg_relation_size(i.indexrelid) as index_size FROM pg_index i JOIN pg_attribute a ON a.attrelid = i.indrelid AND a.attnum = ANY(i.indkey) WHERE i.indrelid::regclass::text NOT LIKE 'pg_%' GROUP BY i.indrelid, i.indexrelid)SELECT ic1.table_name, ic1.index_name as redundant_index, ic1.columns as redundant_columns, pg_size_pretty(ic1.index_size) as redundant_size, ic2.index_name as covering_index, ic2.columns as covering_columnsFROM index_cols ic1JOIN index_cols ic2 ON ic1.table_name = ic2.table_name AND ic1.index_name != ic2.index_name AND ic1.columns <@ ic2.columns -- ic1 is subset of ic2ORDER BY ic1.index_size DESC; -- SQL SERVER: Find duplicate and overlapping indexesWITH IndexColumns AS ( SELECT i.object_id, i.index_id, i.name as index_name, (SELECT STUFF(( SELECT ',' + c.name FROM sys.index_columns ic JOIN sys.columns c ON c.object_id = ic.object_id AND c.column_id = ic.column_id WHERE ic.object_id = i.object_id AND ic.index_id = i.index_id AND ic.is_included_column = 0 ORDER BY ic.key_ordinal FOR XML PATH('')), 1, 1, '')) as key_columns, (SELECT SUM(ps.used_page_count) * 8 / 1024.0 FROM sys.dm_db_partition_stats ps WHERE ps.object_id = i.object_id AND ps.index_id = i.index_id) as size_mb FROM sys.indexes i WHERE i.type IN (1, 2) -- Clustered and nonclustered)SELECT OBJECT_NAME(ic1.object_id) as table_name, ic1.index_name as index1, ic1.key_columns as columns1, ic1.size_mb as size1_mb, ic2.index_name as index2, ic2.key_columns as columns2FROM IndexColumns ic1JOIN IndexColumns ic2 ON ic1.object_id = ic2.object_id AND ic1.index_id < ic2.index_id AND (ic1.key_columns = ic2.key_columns OR ic2.key_columns LIKE ic1.key_columns + ',%')ORDER BY ic1.size_mb DESC;Before dropping an index: (1) Mark it invisible/disabled if your database supports it (SQL Server, Oracle, MySQL 8.0+), (2) Monitor for 1-2 weeks covering all workload patterns, (3) Check for queries that may only run monthly/quarterly, (4) Drop only when confident no critical queries depend on it. Keep DROP statements ready for quick undo.
Indexes degrade over time. Fragmentation from inserts, updates, and deletes reduces efficiency. Bloat from dead tuples wastes storage. Regular maintenance keeps indexes performing optimally.
| Operation | Purpose | When to Use |
|---|---|---|
| REINDEX / REBUILD | Completely recreates index structure | Heavy fragmentation (>30%); after bulk deletes; corrupted index |
| REORGANIZE | Defragments leaf level without full rebuild | Moderate fragmentation (10-30%); less locking than rebuild |
| UPDATE STATISTICS | Refreshes cardinality estimations | After significant data changes; when plans become suboptimal |
| VACUUM (PostgreSQL) | Reclaims dead tuple space | Regular schedule; after bulk updates/deletes |
| OPTIMIZE TABLE (MySQL) | Defragments and reclaims space | After many updates/deletes; InnoDB rebuilds table + indexes |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
-- POSTGRESQL Index Maintenance -- Check index bloatCREATE EXTENSION IF NOT EXISTS pgstattuple; SELECT indexrelname, pg_size_pretty(pg_relation_size(indexrelid)) as index_size, ROUND(100 * pg_stat_user_indexes.idx_scan / NULLIF(pg_stat_user_tables.seq_scan + pg_stat_user_tables.idx_scan, 0), 2) as index_usage_pct, stat.leaf_fragmentationFROM pg_stat_user_indexesJOIN pg_stat_user_tables USING (relid)CROSS JOIN LATERAL pgstatindex(indexrelid) AS statORDER BY pg_relation_size(indexrelid) DESCLIMIT 20; -- Check table bloat (affects clustered data)SELECT schemaname, tablename, n_dead_tup as dead_tuples, n_live_tup as live_tuples, ROUND(100 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct, last_vacuum, last_autovacuumFROM pg_stat_user_tablesWHERE n_dead_tup > 10000ORDER BY dead_pct DESC; -- Manual VACUUM for specific tablesVACUUM (VERBOSE) orders; -- VACUUM FULL to reclaim space (requires exclusive lock)VACUUM FULL orders; -- REINDEX to rebuild specific index (online in PG 12+)REINDEX INDEX CONCURRENTLY idx_orders_date; -- REINDEX entire table REINDEX TABLE CONCURRENTLY orders; -- Monitor autovacuum progressSELECT relid::regclass as table_name, phase, heap_blks_total, heap_blks_scanned, ROUND(100 * heap_blks_scanned / NULLIF(heap_blks_total, 0), 2) as pct_completeFROM pg_stat_progress_vacuum; -- Tune autovacuum for heavy write tablesALTER TABLE orders SET ( autovacuum_vacuum_scale_factor = 0.05, -- Vacuum at 5% dead tuples autovacuum_analyze_scale_factor = 0.02, autovacuum_vacuum_cost_limit = 1000);Schedule intensive maintenance (REBUILD, VACUUM FULL, OPTIMIZE TABLE) during low-traffic periods. Use online rebuild options when available (SQL Server ONLINE, PostgreSQL CONCURRENTLY). For 24/7 systems, consider third-party tools that minimize locking (pt-online-schema-change for MySQL, pg_repack for PostgreSQL).
Beyond basic B-tree indexes, advanced strategies can dramatically improve performance for specific query patterns. These require deeper understanding of workload characteristics.
Partial indexes include only rows meeting a condition, reducing index size and maintenance overhead while accelerating specific queries.
12345678910111213141516171819202122232425262728
-- PostgreSQL: Partial indexes-- Only index active orders (90% of queries filter on active)CREATE INDEX idx_active_orders ON orders(customer_id, order_date)WHERE status = 'active'; -- Query that benefits:SELECT * FROM orders WHERE status = 'active' AND customer_id = 12345; -- Uses partial index -- Query that cannot use it:SELECT * FROM orders WHERE status = 'completed' AND customer_id = 12345; -- Full scan -- Index only non-null values (common pattern)CREATE INDEX idx_orders_shipped ON orders(shipped_date)WHERE shipped_date IS NOT NULL; -- SQL Server: Filtered indexesCREATE INDEX idx_pending_orders ON orders(customer_id, order_date)WHERE status = 'pending'; -- MySQL: No native partial indexes, but workarounds exist-- Use generated column with conditionALTER TABLE orders ADD COLUMN is_active_customer TINYINT GENERATED ALWAYS AS (CASE WHEN status = 'active' THEN customer_id END); CREATE INDEX idx_active ON orders(is_active_customer, order_date);Partial indexes shine when a small fraction of rows (5-30%) are queried frequently. An index on 'status = pending' for a table where 95% of orders are completed saves 95% storage and maintenance cost.
Index tuning transforms database performance more dramatically than any other single factor. Let's consolidate the key principles:
What's Next:
Query and index tuning address how individual statements perform. But overall database performance also depends on configuration tuning—memory allocation, parallelism settings, and other server-level parameters. The next page explores configuration tuning strategies.
You now understand index tuning from first principles through advanced strategies. These skills enable you to design indexes that accelerate reads without overburdening writes, a balance that separates performant databases from struggling ones. Next, we'll examine server-level configuration tuning.