Query Optimization - Learning Module

Loading content...

0/273

Index Optimization

Indexes: The Make-or-Break Factor

If there's one lever that has the most dramatic impact on SQL query performance, it's indexing. A well-designed index can transform a 10-minute query into a 10-millisecond query—a 60,000x improvement. Conversely, missing or poorly-designed indexes can bring production databases to their knees.

Yet indexing is widely misunderstood. Many developers either avoid indexes ("they slow down writes") or over-index ("more indexes = faster queries"). Both approaches are wrong. Effective indexing requires understanding how indexes work, when they help, and what makes them optimal for specific query patterns.

This page will give you that understanding—from the internals that explain index behavior to practical strategies for index design in production systems.

What You Will Learn

By the end of this page, you will understand: how B-tree indexes work internally; when indexes help vs. hurt; how to design composite indexes for complex queries; covering indexes and index-only scans; partial and functional indexes for advanced use cases; and a systematic approach to index design.

How Indexes Work

An index is an auxiliary data structure that accelerates data retrieval at the cost of additional storage and write overhead. Understanding index internals reveals why certain query patterns benefit from indexes while others don't.

The B-Tree: Foundation of Relational Indexes

The vast majority of database indexes use the B-tree (or B+tree) data structure. B-trees are balanced tree structures optimized for systems that read and write large blocks of data—exactly how databases access storage.

B-tree properties:

Balanced: All leaf nodes are at the same depth, guaranteeing consistent O(log N) lookup time
Sorted: Keys are stored in sorted order, enabling efficient range scans
Block-aligned: Nodes are sized to match disk pages (typically 8KB-16KB), minimizing I/O
Multi-way: Each internal node has many children (typically 100-500), creating very shallow trees

Implications:

A B-tree index on 10 million rows typically has only 3-4 levels. Finding any specific row requires reading 3-4 index pages, regardless of table size. For comparison, a binary search tree would require log₂(10M) ≈ 24 comparisons—and potentially 24 random I/O operations.

B-Tree Index Depth vs. Table Size
Rows	Approximate Depth	Pages to Read	Time (SSD)
10,000	2	2-3	<0.1ms
1,000,000	3	3-4	0.2ms
100,000,000	4	4-5	0.3ms
10,000,000,000	5	5-6	0.4ms

The Leaf Level Contains Everything:

In a B+tree (the variant used by most databases), all actual data pointers are stored in leaf nodes. Internal nodes contain only keys for navigation. Leaf nodes are linked together, enabling efficient sequential scans through sorted data.

What the index stores:

Each leaf entry contains:

The indexed column value(s)
A pointer/reference to the table row (TID in PostgreSQL, ROWID in Oracle, clustered key in InnoDB)
For covering indexes: additional non-key columns

This pointer is how the database locates the actual row after finding it in the index.

Clustered vs. Non-Clustered Indexes

In a clustered index (SQL Server, InnoDB), the table data IS the leaf level—rows are physically stored in index order. Each table has at most one clustered index. Non-clustered indexes store pointers to the clustered index key or row location. In PostgreSQL, all indexes are non-clustered; the table (heap) is separate from indexes.

When Indexes Help (and When They Don't)

Indexes are not universally beneficial. Understanding when they help—and when they actually hurt performance—is crucial for effective index design.

Indexes Help When...

•Selecting a small percentage of rows (high selectivity)
•Sorting output matches index order (avoid filesort)
•Joining tables on indexed columns
•Checking existence (EXISTS, IN subqueries)
•Aggregate queries can use index-only scans
•LIMIT queries find matching rows early

Indexes Hurt When...

•Selecting most of the table (low selectivity)
•Small tables (full scan is faster than index)
•Write-heavy workloads (index maintenance overhead)
•Very wide indexes (more I/O than table scan)
•Index columns not in WHERE/JOIN/ORDER BY
•Filter values cause most rows to match

The Selectivity Threshold:

The critical factor is selectivity—what percentage of rows match the filter. For typical tables:

Selectivity	Rows Returned	Best Access Method
< 1%	Few rows	Index Scan
1-5%	Some rows	Index Scan or Bitmap Scan
5-15%	Many rows	Bitmap Scan
> 15%	Most rows	Sequential Scan

These thresholds vary by table width, clustering, and hardware. On SSDs, the crossover point is higher because random I/O is less penalized.

Why sequential scans beat indexes for low selectivity:

An index scan involves:

Read index page (possibly from disk)
Read table page for each match (random I/O)

For 40% of a million-row table, that's 400,000 random table page reads. A sequential scan reads every page once, in order—which disk and OS can prefetch efficiently. Sequential I/O at 500MB/s beats random I/O at 20MB/s.

The Index That Makes Queries Slower

Adding an index can make queries slower if the optimizer chooses it incorrectly. If statistics suggest high selectivity but actual data is low selectivity, the optimizer chooses an index scan that performs worse than a sequential scan. This is why accurate statistics are essential.

The Write Overhead Reality:

Every index imposes overhead on data modification:

INSERT: Add entry to every index on the table
UPDATE: For each modified indexed column, remove old entry and add new entry
DELETE: Remove entry from every index

For a table with 5 indexes, an INSERT does 6x the write work (1 table + 5 indexes). This overhead can be significant for write-heavy tables.

Rule of thumb:

For read-heavy workloads (95%+ reads): Index aggressively
For balanced workloads: Index selectively on high-value queries
For write-heavy workloads: Minimize indexes, consider batching

Composite (Multi-Column) Indexes

Real-world queries rarely filter on a single column. Composite indexes—indexes on multiple columns—are essential for optimizing complex queries. However, their design requires careful consideration of column order and usage patterns.

The Left-to-Right Rule:

A B-tree composite index on columns (A, B, C) sorts entries first by A, then by B within each A value, then by C within each (A, B) combination.

This index can efficiently satisfy:

WHERE A = 1
WHERE A = 1 AND B = 2
WHERE A = 1 AND B = 2 AND C = 3
WHERE A = 1 AND B BETWEEN 10 AND 20
WHERE A = 1 ORDER BY B

This index CANNOT efficiently satisfy:

WHERE B = 2 (skipping A)
WHERE C = 3 (skipping A and B)
WHERE A = 1 AND C = 3 (skipping B for filter, but can still use for A)

The reason: B-tree lookup requires starting from the leftmost column. You can't jump to the middle of a sorted sequence without scanning from the start.

Composite Index Column Order
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Table: orders (id, customer_id, status, created_at, total)
 
-- Query 1: Filter by customer, sort by date
SELECT * FROM orders 
WHERE customer_id = 123 
ORDER BY created_at DESC 
LIMIT 10;
 
-- Optimal index: (customer_id, created_at DESC)
-- Why: Filter on customer_id, then sorted access for ORDER BY
 
-- Query 2: Filter by customer and status
SELECT * FROM orders 
WHERE customer_id = 123 AND status = 'pending';
 
-- Optimal index: (customer_id, status)
-- OR: (status, customer_id) if status is more selective
 
-- Query 3: Filter by customer, status, sort by date
SELECT * FROM orders 
WHERE customer_id = 123 AND status = 'pending'
ORDER BY created_at DESC;
 
-- Optimal index: (customer_id, status, created_at DESC)
-- Why: Equality filters first, then sort column
 
-- Anti-pattern: (status, created_at, customer_id)
-- This can't efficiently filter on customer_id

Composite Index Design Rules

•Equality conditions first — Columns with = conditions should be leftmost; they provide exact navigation to the matching subset.
•Higher selectivity before lower — Among equality columns, put the most selective (fewer matching rows) first for efficient narrowing.
•Range conditions last — Columns with <, >, BETWEEN, LIKE 'prefix%' should come after equality columns; range scan ends sorted access for remaining columns.
•ORDER BY column after filters — If the query has ORDER BY and the sort column comes after filter columns in the index, the database can avoid a separate sort step.
•Match the query pattern — Design indexes for specific query patterns, not abstract 'coverage'. Different queries may need different indexes on the same columns in different orders.

The Telephone Book Analogy

A phone book is indexed by (Last Name, First Name). You can quickly find 'Smith, John' or all 'Smith' entries. But finding all 'John' entries requires scanning the entire book—you can't jump to 'John' without knowing the last name first. Same principle applies to composite indexes.

Covering Indexes and Index-Only Scans

A covering index includes all columns that a query needs, allowing the database to satisfy the query entirely from the index without accessing the table. This is the fastest possible access pattern.

Why Covering Indexes Are So Fast:

No table access: The table heap is never touched; all data comes from the index
Sequential access: Index leaf pages are linked; scanning is sequential
Smaller data volume: Indexes are typically much smaller than tables
Better caching: More of the index fits in memory

Example scenario:

-- Query
SELECT customer_id, status, COUNT(*) 
FROM orders 
WHERE status = 'pending' 
GROUP BY customer_id, status;

-- Non-covering index: (status)
-- Plan: Index Scan on idx_status → Fetch rows from table → Aggregate
--       10,000 index entries → 10,000 random table reads

-- Covering index: (status, customer_id)
-- Plan: Index Only Scan → Aggregate
--       10,000 index entries scanned sequentially, no table access

The covering index version can be 10-100x faster, especially on cold cache.

Covering Index Syntax by Database
Database	Syntax	Notes
PostgreSQL	CREATE INDEX idx ON tbl (key_cols) INCLUDE (other_cols)	INCLUDE columns are leaf-only; don't affect sort order
SQL Server	CREATE INDEX idx ON tbl (key_cols) INCLUDE (other_cols)	Same as PostgreSQL; INCLUDE columns in leaf
MySQL/InnoDB	CREATE INDEX idx ON tbl (key_cols, other_cols)	No INCLUDE syntax; add columns to index key
Oracle	CREATE INDEX idx ON tbl (key_cols, other_cols)	No INCLUDE syntax; add columns to index key

The INCLUDE Advantage:

PostgreSQL and SQL Server's INCLUDE clause has a significant advantage: included columns are stored only in leaf nodes, not in internal nodes. This means:

Smaller internal nodes → more keys per node → shallower tree
No sorting by included columns → no maintenance on included column changes
Only columns used for filtering/sorting are in the index key

When to use INCLUDE:

The column is needed in SELECT or WHERE but not for filtering/ordering
The column is frequently updated (no key position change)
You want covering without affecting index sort order

Index Size Trade-off

Covering indexes can become very large if you include many columns or wide columns (VARCHAR, TEXT). A covering index that approaches the table size provides little benefit—you're essentially duplicating data. Balance coverage against size. For frequently-run queries on hot paths, the trade-off is usually worthwhile.

Partial and Filtered Indexes

A partial index (PostgreSQL) or filtered index (SQL Server) indexes only rows that match a specified condition. This powerful technique reduces index size and maintenance overhead while still accelerating the targeted queries.

Partial Index Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- PostgreSQL: Partial Index
-- Only index active users (5% of table)
CREATE INDEX idx_users_active ON users (email) 
WHERE is_active = true;
 
-- Only queries with WHERE is_active = true can use this index
SELECT * FROM users WHERE email = 'test@example.com' AND is_active = true;  -- Uses index
SELECT * FROM users WHERE email = 'test@example.com';  -- Cannot use index
 
-- SQL Server: Filtered Index
CREATE NONCLUSTERED INDEX idx_orders_pending 
ON orders (customer_id, created_at)
WHERE status = 'pending';
 
-- Oracle: Function-based approach (no true partial indexes)
CREATE INDEX idx_orders_pending ON orders (
    CASE WHEN status = 'pending' THEN customer_id END
);

Ideal Use Cases for Partial Indexes

•Status-based filtering — Queries always filter by status = 'pending', 'active', or similar. Index only the relevant subset.
•Soft deletes — Index only non-deleted rows (WHERE deleted_at IS NULL). Queries never access deleted rows anyway.
•Time-based access — Only recent orders are queried. Index created_at > '2024-01-01' for active order queries.
•Null exclusion — Index only non-null values for columns that are usually null. Smaller index, faster queries.
•Unique constraints on subsets — Unique email per tenant: UNIQUE (email) WHERE tenant_id = 5.

Benefits of Partial Indexes:

Benefit	Impact
Smaller size	Indexing 5% of rows = 5% of full index size
Faster writes	INSERT/UPDATE only touches index if row matches predicate
Better cache efficiency	More of the useful index fits in memory
Faster maintenance	VACUUM/REINDEX operates on smaller structure

Limitations:

Query must include the index predicate to use the index
Optimizer must recognize that query predicate implies index predicate
MySQL/InnoDB doesn't support partial indexes (workaround: use generated columns)

Match the Query Exactly

The query's WHERE clause must match or imply the index predicate. For an index WHERE status = 'pending', a query with WHERE status IN ('pending', 'active') cannot use it—the query might return 'active' rows not in the index. Design partial indexes around exact query patterns.

Functional and Expression Indexes

Sometimes queries filter on expressions or function results rather than raw column values. Functional indexes (also called expression indexes) index the computed result, enabling efficient lookups on transformed data.

Functional Index Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- PostgreSQL: Case-insensitive search
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
 
-- Query that uses the index:
SELECT * FROM users WHERE LOWER(email) = 'test@example.com';
 
-- PostgreSQL: JSON field extraction
CREATE INDEX idx_data_type ON events ((data->>'type'));
 
-- Query:
SELECT * FROM events WHERE data->>'type' = 'purchase';
 
-- PostgreSQL: Date truncation for time-based grouping
CREATE INDEX idx_orders_month ON orders (DATE_TRUNC('month', created_at));
 
-- Query grouping by month uses this:
SELECT DATE_TRUNC('month', created_at), COUNT(*) 
FROM orders GROUP BY DATE_TRUNC('month', created_at);
 
-- SQL Server: Computed column with index
ALTER TABLE users ADD email_lower AS LOWER(email) PERSISTED;
CREATE INDEX idx_email_lower ON users (email_lower);
 
-- MySQL: Generated column with index
ALTER TABLE users ADD email_lower VARCHAR(255) 
    GENERATED ALWAYS AS (LOWER(email)) STORED;
CREATE INDEX idx_email_lower ON users (email_lower);

Common Functional Index Use Cases

•Case-insensitive search — Index LOWER(column) or UPPER(column) for case-insensitive lookups without COLLATION changes.
•JSON/JSONB field access — Index specific JSON paths that are frequently queried.
•Date component extraction — Index YEAR(date), MONTH(date), or DATE_TRUNC for time-based aggregations.
•String prefix — Index LEFT(column, n) for prefix searches on very long strings.
•Computed values — Index calculations that are frequently filtered (e.g., area = length * width).

Critical Constraint: Exact Match Required

The query expression must exactly match the index expression. Consider:

CREATE INDEX idx ON users (LOWER(email));

-- Uses index:
SELECT * FROM users WHERE LOWER(email) = 'test@example.com';

-- Does NOT use index (different function):
SELECT * FROM users WHERE UPPER(email) = 'TEST@EXAMPLE.COM';

-- Does NOT use index (different expression):
SELECT * FROM users WHERE LOWER(TRIM(email)) = 'test@example.com';

The optimizer doesn't reason about function equivalences. If your code uses different variations of the same logical operation, you'll need multiple indexes or standardized function usage.

Immutability Requirement

Functions used in expression indexes must be IMMUTABLE—they must always return the same output for the same input. Functions that depend on locale, time zone, or other external state cannot be indexed. NOW(), RANDOM(), and locale-dependent functions are not allowed.

Index Maintenance and Monitoring

Indexes require ongoing maintenance. Without it, they bloat, fragment, and provide diminishing returns. Effective index management is as important as initial index design.

Index Bloat:

As rows are updated and deleted, index entries are marked dead but not immediately removed. This creates bloat—wasted space in the index. Symptoms:

Index larger than expected for row count
Slower index scans due to reading dead entries
Increased I/O and cache pressure

Solutions by database:

Database	Automatic Solution	Manual Solution
PostgreSQL	autovacuum (background)	VACUUM, REINDEX
MySQL	Background purge	OPTIMIZE TABLE, ALTER TABLE ENGINE=InnoDB
SQL Server	Ghost cleanup	ALTER INDEX REBUILD/REORGANIZE
Oracle	Automatic segment space management	ALTER INDEX REBUILD

Index Health Monitoring Queries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- PostgreSQL: Check index bloat
SELECT
    schemaname,
    tablename,
    indexname,
    pg_relation_size(indexrelid) AS index_size,
    idx_scan AS index_scans,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- PostgreSQL: Find unused indexes
SELECT
    schemaname || '.' || indexrelname AS index,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0  -- Never used
  AND indexrelid NOT IN (
    SELECT conindid FROM pg_constraint  -- Not a constraint
  )
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- MySQL: Index usage statistics
SELECT 
    object_schema,
    object_name,
    index_name,
    count_star AS rows_examined,
    count_read,
    count_fetch
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE index_name IS NOT NULL
ORDER BY count_star DESC;

Index Audit Checklist

•Unused indexes — Remove indexes with zero scans over meaningful time period. They cost writes without benefiting reads.
•Duplicate indexes — Identify indexes that are prefixes of other indexes (idx(A) is redundant if idx(A,B) exists).
•Over-indexed columns — Multiple indexes starting with the same column may indicate consolidation opportunities.
•High-churn indexes — Indexes on frequently-updated columns have high maintenance overhead. Evaluate if the read benefit justifies write cost.
•Bloated indexes — Compare index size to expected size based on row count. Significant excess indicates bloat.
•Missing indexes — Review EXPLAIN plans for sequential scans on large tables. Consider adding indexes for common filter patterns.

Schedule Regular Index Reviews

Query patterns change over time. An index critical during a feature launch may become unused after user behavior shifts. Schedule quarterly reviews of index usage statistics and remove unused indexes. This reduces write overhead and storage costs.

Systematic Index Design Approach

Index design should be systematic, not ad-hoc. Following a consistent methodology ensures you create indexes that serve your actual query patterns while minimizing overhead.

The Index Design Process

•Identify critical queries: Profile your application to find the most frequent and most expensive queries. Focus on these first.
•Analyze query access patterns: For each query, note: filter columns (WHERE), join columns (ON), sort columns (ORDER BY), and selected columns (SELECT).
•Design candidate indexes: Apply composite index rules—equality before range, higher selectivity first, sort column after filters.
•Consider covering: If the query is a hot path and SELECT list is small, include columns to enable index-only scan.
•Evaluate partial indexes: If queries always include a specific constant filter (status = 'active'), a partial index may be smaller and faster.
•Test with EXPLAIN: Create the index and verify the optimizer uses it. Check estimated vs actual costs.
•Monitor in production: Track index usage over days/weeks. Confirm the index is actually used by real queries.
•Iterate and consolidate: After initial indexes, look for consolidation opportunities—one index serving multiple queries.

The Index Consolidation Pattern:

Multiple queries can often share a single well-designed index:

-- Query A: WHERE customer_id = 1 AND status = 'active'
-- Query B: WHERE customer_id = 1 ORDER BY created_at DESC
-- Query C: WHERE customer_id = 1 AND status = 'active' ORDER BY created_at DESC

-- Instead of three indexes, one composite index serves all:
CREATE INDEX idx_customer_orders ON orders (customer_id, status, created_at DESC);

This index efficiently supports all three patterns because:

All queries filter on customer_id (leftmost column)
Adding status doesn't hurt Query B (it just scans within customer)
created_at ordering is maintained within each (customer_id, status) group

Start Conservative, Expand as Needed

It's easier to add an index later than to remove one that application logic depends on. Start with indexes for the most critical queries, monitor performance, and add more only when specific performance problems are identified. Over-indexing from day one creates maintenance burden and write overhead.

Key Takeaways

•B-tree indexes provide O(log N) lookup — Consistent performance regardless of table size, but with write overhead.
•Selectivity determines value — Indexes help for selective queries (<15% of rows); sequential scans win for bulk access.
•Column order matters in composite indexes — Leftmost columns must be in the query. Equality before range.
•Covering indexes eliminate table access — Include all needed columns for index-only scans on hot paths.
•Partial indexes reduce overhead — Index only the rows your queries actually access.
•Functional indexes enable expression optimization — Index computed values for queries on transformed data.
•Monitor and maintain — Track usage, remove unused indexes, address bloat.

What's Next:

With a solid foundation in index design, the next page covers Query Rewriting—techniques for restructuring SQL queries to achieve better execution plans and performance.

Page Complete

You now have comprehensive knowledge of index design—from B-tree fundamentals through advanced techniques like covering, partial, and functional indexes. This knowledge enables you to make informed decisions about when, where, and how to index for optimal performance.