Performance Tuning - Learning Module

Loading content...

0/241

Index Tuning

The Index Imperative

If query tuning is about writing efficient SQL, index tuning is about building the data highways that make efficient SQL possible. A perfectly written query against an unindexed table is like a Formula 1 car on a dirt road—all that engineering potential, wasted.

Indexes are the most impactful performance tool in the DBA's arsenal. A single well-designed index can reduce query execution time from minutes to milliseconds. But indexes are not free: they consume storage, slow down writes, and require ongoing maintenance. The art of index tuning is finding the optimal balance—enough indexes to accelerate reads without crippling writes.

This page teaches you to think systematically about index design, moving beyond intuition to evidence-based index strategies.

What You Will Learn

By the end of this page, you will understand index internals deeply enough to design optimal indexes for any query pattern. You'll learn to identify missing indexes, evaluate existing indexes for effectiveness, and avoid over-indexing. These skills are essential for building databases that perform well at any scale.

Index Fundamentals Review

Before diving into tuning strategies, let's establish a solid understanding of how indexes work internally. This knowledge is essential for making informed design decisions.

The B-Tree Index Structure:

Most database indexes are B-tree (or B+-tree) structures. They maintain sorted data in a tree format that allows:

O(log n) search — Find any key in logarithmic time
O(log n) range scans — Efficiently traverse ordered sequences
Ordered retrieval — Return data in sorted order without additional sorting

The key insight is that B-tree indexes store sorted pointers to data, not the data itself (except in clustered indexes or index-only scans).

Converting Mermaid diagram...

Index Types and Their Characteristics
Index Type	Best For	Limitations
B-Tree	Equality, range queries, ORDER BY, PREFIX matching	Poor for low-cardinality; full scans still O(n)
Hash	Exact equality lookups only	No range queries; no ordering; size unpredictable
Bitmap	Low-cardinality columns in data warehouses	Poor for OLTP with frequent updates; lock contention
GiST	Geometric data, full-text search, custom types	More complex to tune; varies by use case
GIN	Arrays, JSONB, full-text search with many keys	Slower updates; larger size; excellent for containment
Covering/Included	Avoiding table lookups for SELECT columns	Increased index size; write overhead
Partial/Filtered	Indexing subset of rows meeting a condition	Only helps queries with matching predicates

Clustered vs Non-Clustered Indexes:

Clustered Index:

The table data itself is stored in index order
Only one per table (because data can only be sorted one way)
Range scans are extremely efficient (data is contiguous)
Inserts into middle of range cause page splits

Non-Clustered Index:

Separate structure pointing to data rows
Multiple allowed per table
Requires "bookmark lookup" to fetch table data
Index-only scans possible if all columns present

PostgreSQL Note: PostgreSQL doesn't have traditional clustered indexes. The CLUSTER command reorders table data once, but it's not maintained automatically. Heap tables remain the norm.

The Selectivity Factor

Index effectiveness depends on selectivity—the fraction of rows an index lookup returns. High selectivity (few rows returned) means index is valuable. Low selectivity (many rows) may make full table scan faster. Rule of thumb: indexes help when retrieving less than 5-15% of table rows.

Designing Effective Indexes

Index design is not guesswork—it follows predictable principles based on how queries access data. The key is understanding which columns appear in predicates, joins, and ordering clauses, and in what combinations.

index_design_principles.sql

-- PRINCIPLE 1: Index columns used in WHERE clauses
 
-- Query pattern
SELECT * FROM orders 
WHERE customer_id = 12345 
  AND status = 'pending'
  AND order_date > '2024-01-01';
 
-- Single-column indexes (least effective)
CREATE INDEX idx_customer ON orders(customer_id);
CREATE INDEX idx_status ON orders(status);
CREATE INDEX idx_date ON orders(order_date);
-- Database can use ONE index, then filter remaining conditions
 
-- Composite index (most effective)
CREATE INDEX idx_orders_lookup ON orders(customer_id, status, order_date);
-- Single index satisfies all WHERE predicates
 
 
-- PRINCIPLE 2: Column order in composite indexes MATTERS
 
-- Most selective FIRST for equality predicates
CREATE INDEX idx_good ON orders(customer_id, status, order_date);
-- customer_id is most selective (few orders per customer)
 
-- Range predicates should be LAST  
CREATE INDEX idx_range ON orders(customer_id, status, order_date);
-- order_date is a range (>), so columns after it can't use index
 
 
-- PRINCIPLE 3: Index columns used in ORDER BY
 
-- Query requiring sorted results
SELECT order_id, order_date, total
FROM orders
WHERE customer_id = 12345
ORDER BY order_date DESC;
 
-- Index supporting both WHERE and ORDER BY
CREATE INDEX idx_cust_date ON orders(customer_id, order_date DESC);
-- Avoids separate sort operation
 
 
-- PRINCIPLE 4: Index columns used in JOIN conditions
 
SELECT o.order_id, c.customer_name, oi.product_id
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date > '2024-01-01';
 
-- Indexes for join columns
CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_items_order ON order_items(order_id);
 
-- Covering index for frequently joined columns
CREATE INDEX idx_orders_join ON orders(customer_id) 
    INCLUDE (order_id, order_date);

The Leading Column Rule

A composite index (A, B, C) can efficiently support queries on A alone, or A+B, or A+B+C. It cannot efficiently support queries on B alone or C alone. Always design composite indexes with the most commonly filtered column as the leading column.

Identifying Missing Indexes

Databases provide multiple mechanisms to identify queries that could benefit from additional indexes. Combining automated recommendations with manual analysis yields the best results.

missing_index_detection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- POSTGRESQL: Missing Index Detection
 
-- Install pg_stat_statements extension (if not already)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
 
-- Find slowest queries (candidates for indexing)
SELECT 
    queryid,
    calls,
    round(total_exec_time::numeric, 2) as total_time_ms,
    round(mean_exec_time::numeric, 2) as avg_time_ms,
    round((shared_blks_read + shared_blks_hit)::numeric / calls, 2) 
        as avg_blocks,
    query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Find tables with sequential scans (may need indexes)
SELECT 
    schemaname,
    relname as table_name,
    seq_scan,
    seq_tup_read,
    idx_scan,
    CASE WHEN seq_scan > 0 
         THEN round(seq_tup_read::numeric / seq_scan, 0) 
         ELSE 0 END as avg_rows_per_seq_scan
FROM pg_stat_user_tables
WHERE seq_scan > 100  -- Tables with many sequential scans
ORDER BY seq_tup_read DESC
LIMIT 20;
 
-- Check index usage on existing indexes
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan as times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC  -- Least used indexes first
LIMIT 20;
 
-- Using EXPLAIN to identify missing indexes
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE customer_id = 12345 AND order_date > '2024-01-01';
-- Look for: Seq Scan, high rows removed by filter, 
--           or high buffer reads relative to rows returned
 
-- Hypothetical indexes with HypoPG extension
CREATE EXTENSION hypopg;
 
-- Create hypothetical index (not actually built)
SELECT hypopg_create_index('CREATE INDEX ON orders(customer_id, order_date)');
 
-- Test query plan with hypothetical index
EXPLAIN SELECT * FROM orders 
WHERE customer_id = 12345 AND order_date > '2024-01-01';
 
-- Drop hypothetical indexes
SELECT hypopg_reset();

Don't Blindly Follow Recommendations

Automated missing index suggestions are starting points, not final answers. They may recommend redundant indexes overlapping with existing ones, suggest indexes for rarely-run ad-hoc queries, or miss composite index opportunities. Always validate recommendations against actual workload patterns and existing indexes.

Index Consolidation and Redundancy

Over time, databases accumulate indexes added by different developers solving immediate problems. Many become redundant (covered by other indexes) or unused (no longer matching query patterns). Index consolidation reclaims storage and improves write performance.

Index Redundancy Patterns

•
Duplicate Indexes — Identical column lists (often created by mistake or different developers)
- idx_a (customer_id, order_date) and idx_b (customer_id, order_date) → Remove one
•
Prefix Redundancy — Shorter index is prefix of longer index
- idx_x (customer_id) and idx_y (customer_id, order_date) → idx_y covers idx_x; remove idx_x
•
Unused Indexes — No queries use the index according to statistics
- Review and drop after production verification
•
Low-Usage Indexes — Rarely used but frequently updated
- Calculate read/write ratio; consider dropping if writes >> reads
•
Overlapping Indexes — Multiple indexes partially covering same queries
- Consolidate into fewer, well-designed composite indexes

index_analysis.sql

-- POSTGRESQL: Find redundant indexes (prefix covered)
WITH index_cols AS (
    SELECT
        i.indrelid::regclass as table_name,
        i.indexrelid::regclass as index_name,
        array_agg(a.attname ORDER BY array_position(i.indkey, a.attnum)) 
            as columns,
        pg_relation_size(i.indexrelid) as index_size
    FROM pg_index i
    JOIN pg_attribute a ON a.attrelid = i.indrelid 
        AND a.attnum = ANY(i.indkey)
    WHERE i.indrelid::regclass::text NOT LIKE 'pg_%'
    GROUP BY i.indrelid, i.indexrelid
)
SELECT 
    ic1.table_name,
    ic1.index_name as redundant_index,
    ic1.columns as redundant_columns,
    pg_size_pretty(ic1.index_size) as redundant_size,
    ic2.index_name as covering_index,
    ic2.columns as covering_columns
FROM index_cols ic1
JOIN index_cols ic2 ON ic1.table_name = ic2.table_name
    AND ic1.index_name != ic2.index_name
    AND ic1.columns <@ ic2.columns  -- ic1 is subset of ic2
ORDER BY ic1.index_size DESC;
 
-- SQL SERVER: Find duplicate and overlapping indexes
WITH IndexColumns AS (
    SELECT 
        i.object_id,
        i.index_id,
        i.name as index_name,
        (SELECT STUFF((
            SELECT ',' + c.name 
            FROM sys.index_columns ic
            JOIN sys.columns c ON c.object_id = ic.object_id 
                AND c.column_id = ic.column_id
            WHERE ic.object_id = i.object_id 
                AND ic.index_id = i.index_id
                AND ic.is_included_column = 0
            ORDER BY ic.key_ordinal
            FOR XML PATH('')), 1, 1, '')) as key_columns,
        (SELECT SUM(ps.used_page_count) * 8 / 1024.0
         FROM sys.dm_db_partition_stats ps
         WHERE ps.object_id = i.object_id 
           AND ps.index_id = i.index_id) as size_mb
    FROM sys.indexes i
    WHERE i.type IN (1, 2)  -- Clustered and nonclustered
)
SELECT 
    OBJECT_NAME(ic1.object_id) as table_name,
    ic1.index_name as index1,
    ic1.key_columns as columns1,
    ic1.size_mb as size1_mb,
    ic2.index_name as index2,
    ic2.key_columns as columns2
FROM IndexColumns ic1
JOIN IndexColumns ic2 ON ic1.object_id = ic2.object_id
    AND ic1.index_id < ic2.index_id
    AND (ic1.key_columns = ic2.key_columns 
         OR ic2.key_columns LIKE ic1.key_columns + ',%')
ORDER BY ic1.size_mb DESC;

Safe Index Removal Process

Before dropping an index: (1) Mark it invisible/disabled if your database supports it (SQL Server, Oracle, MySQL 8.0+), (2) Monitor for 1-2 weeks covering all workload patterns, (3) Check for queries that may only run monthly/quarterly, (4) Drop only when confident no critical queries depend on it. Keep DROP statements ready for quick undo.

Index Maintenance

Indexes degrade over time. Fragmentation from inserts, updates, and deletes reduces efficiency. Bloat from dead tuples wastes storage. Regular maintenance keeps indexes performing optimally.

Index Maintenance Operations
Operation	Purpose	When to Use
REINDEX / REBUILD	Completely recreates index structure	Heavy fragmentation (>30%); after bulk deletes; corrupted index
REORGANIZE	Defragments leaf level without full rebuild	Moderate fragmentation (10-30%); less locking than rebuild
UPDATE STATISTICS	Refreshes cardinality estimations	After significant data changes; when plans become suboptimal
VACUUM (PostgreSQL)	Reclaims dead tuple space	Regular schedule; after bulk updates/deletes
OPTIMIZE TABLE (MySQL)	Defragments and reclaims space	After many updates/deletes; InnoDB rebuilds table + indexes

index_maintenance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- POSTGRESQL Index Maintenance
 
-- Check index bloat
CREATE EXTENSION IF NOT EXISTS pgstattuple;
 
SELECT 
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    ROUND(100 * pg_stat_user_indexes.idx_scan / 
        NULLIF(pg_stat_user_tables.seq_scan + 
               pg_stat_user_tables.idx_scan, 0), 2) as index_usage_pct,
    stat.leaf_fragmentation
FROM pg_stat_user_indexes
JOIN pg_stat_user_tables USING (relid)
CROSS JOIN LATERAL pgstatindex(indexrelid) AS stat
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;
 
-- Check table bloat (affects clustered data)
SELECT
    schemaname,
    tablename,
    n_dead_tup as dead_tuples,
    n_live_tup as live_tuples,
    ROUND(100 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) 
        as dead_pct,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_pct DESC;
 
-- Manual VACUUM for specific tables
VACUUM (VERBOSE) orders;
 
-- VACUUM FULL to reclaim space (requires exclusive lock)
VACUUM FULL orders;
 
-- REINDEX to rebuild specific index (online in PG 12+)
REINDEX INDEX CONCURRENTLY idx_orders_date;
 
-- REINDEX entire table  
REINDEX TABLE CONCURRENTLY orders;
 
-- Monitor autovacuum progress
SELECT 
    relid::regclass as table_name,
    phase,
    heap_blks_total,
    heap_blks_scanned,
    ROUND(100 * heap_blks_scanned / NULLIF(heap_blks_total, 0), 2) 
        as pct_complete
FROM pg_stat_progress_vacuum;
 
-- Tune autovacuum for heavy write tables
ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.05,  -- Vacuum at 5% dead tuples
    autovacuum_analyze_scale_factor = 0.02,
    autovacuum_vacuum_cost_limit = 1000
);

Maintenance Window Considerations

Schedule intensive maintenance (REBUILD, VACUUM FULL, OPTIMIZE TABLE) during low-traffic periods. Use online rebuild options when available (SQL Server ONLINE, PostgreSQL CONCURRENTLY). For 24/7 systems, consider third-party tools that minimize locking (pt-online-schema-change for MySQL, pg_repack for PostgreSQL).

Advanced Index Strategies

Beyond basic B-tree indexes, advanced strategies can dramatically improve performance for specific query patterns. These require deeper understanding of workload characteristics.

Partial indexes include only rows meeting a condition, reducing index size and maintenance overhead while accelerating specific queries.

partial_indexes.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- PostgreSQL: Partial indexes
-- Only index active orders (90% of queries filter on active)
CREATE INDEX idx_active_orders ON orders(customer_id, order_date)
WHERE status = 'active';
 
-- Query that benefits:
SELECT * FROM orders 
WHERE status = 'active' AND customer_id = 12345;  -- Uses partial index
 
-- Query that cannot use it:
SELECT * FROM orders 
WHERE status = 'completed' AND customer_id = 12345;  -- Full scan
 
-- Index only non-null values (common pattern)
CREATE INDEX idx_orders_shipped ON orders(shipped_date)
WHERE shipped_date IS NOT NULL;
 
-- SQL Server: Filtered indexes
CREATE INDEX idx_pending_orders ON orders(customer_id, order_date)
WHERE status = 'pending';
 
-- MySQL: No native partial indexes, but workarounds exist
-- Use generated column with condition
ALTER TABLE orders 
ADD COLUMN is_active_customer TINYINT 
    GENERATED ALWAYS AS (CASE WHEN status = 'active' THEN customer_id END);
 
CREATE INDEX idx_active ON orders(is_active_customer, order_date);

Partial Index Sweet Spot

Partial indexes shine when a small fraction of rows (5-30%) are queried frequently. An index on 'status = pending' for a table where 95% of orders are completed saves 95% storage and maintenance cost.

Summary: Index Tuning Mastery

Index tuning transforms database performance more dramatically than any other single factor. Let's consolidate the key principles:

Key Takeaways

•Understand index internals — B-tree structure, selectivity, and covering indexes are foundational knowledge for effective tuning.
•Design indexes for queries — Analyze WHERE clauses, JOIN conditions, and ORDER BY to create purposeful composite indexes.
•Follow the leading column rule — Composite index (A, B, C) only efficiently supports queries starting with A.
•Use database tools — Missing index suggestions, usage statistics, and execution plans guide evidence-based decisions.
•Consolidate redundant indexes — Regularly audit for duplicate, unused, and overlapping indexes that waste resources.
•Maintain index health — Schedule rebuilds/reorganizes based on fragmentation levels and workload patterns.

What's Next:

Query and index tuning address how individual statements perform. But overall database performance also depends on configuration tuning—memory allocation, parallelism settings, and other server-level parameters. The next page explores configuration tuning strategies.

Page Complete

You now understand index tuning from first principles through advanced strategies. These skills enable you to design indexes that accelerate reads without overburdening writes, a balance that separates performant databases from struggling ones. Next, we'll examine server-level configuration tuning.