Index Usage - Learning Module

Loading content...

0/252

Index Selection by Optimizer

The Optimizer's Critical Decision

Having indexes is only half the battle. The real challenge is ensuring the database actually uses them. Every time you execute a query, an extraordinarily sophisticated component called the query optimizer analyzes your SQL, considers all available execution strategies, and chooses what it believes is the optimal plan.

This decision includes whether to use an index, which index to use when multiple are available, and whether a full table scan might actually be faster. Understanding how the optimizer makes these decisions transforms you from someone who creates indexes hoping they'll help into someone who designs indexes knowing they will be used effectively.

What You Will Learn

By the end of this page, you will understand the cost-based optimization model, the statistics that drive optimizer decisions, selectivity estimation, the factors that influence index selection, and common reasons why the optimizer might reject an available index.

Cost-Based Optimization: How Optimizers Think

Modern database optimizers use a cost-based approach to query planning. Rather than following fixed rules ('always use an index if one exists'), they estimate the computational cost of different execution strategies and choose the one with the lowest estimated cost.

The Optimization Process

Parse: Convert SQL text into an internal query tree representation
Analyze: Identify tables, columns, predicates, and operations involved
Generate Plans: Produce multiple candidate execution plans
Estimate Costs: Calculate expected resource consumption for each plan
Select Optimal: Choose the plan with the lowest estimated cost

The critical step is cost estimation. The optimizer models costs based on:

I/O Cost: Number of disk pages to read (usually the dominant factor)
CPU Cost: Processing time for comparisons, calculations, sorting
Memory Cost: Working memory required for sorts, hash tables, buffers
Network Cost: Data transfer in distributed databases

cost-model-simplified.txt
Simplified Cost Model Example
=============================
 
Query: SELECT * FROM orders WHERE customer_id = 12345;
 
Plan A: Full Table Scan
-----------------------
- Table size: 500,000 pages
- Sequential read cost: 1.0 per page
- Total cost: 500,000 × 1.0 = 500,000 units
 
Plan B: Index Scan on customer_id
---------------------------------
- Index traversal: 4 pages × 4.0 (random I/O) = 16 units
- Estimated matching rows: 50
- Data page fetches: 50 pages × 4.0 = 200 units
- Total cost: 216 units
 
Plan B (216) << Plan A (500,000) → Optimizer chooses index scan
 
Key insight: Random I/O (index access) costs more per page than 
sequential I/O (table scan), but the total I/O volume matters more.

Cost Units Are Relative

Cost units are an internal abstraction, not real time or I/O counts. They allow the optimizer to compare plans consistently. The actual values and formulas vary significantly between database systems (PostgreSQL, MySQL, Oracle, SQL Server each have different models).

Statistics: The Foundation of Optimizer Decisions

The optimizer's cost estimates depend entirely on statistics—metadata about table contents, data distributions, and index characteristics. Without accurate statistics, even the most sophisticated optimizer makes poor decisions.

Key Statistics Collected

Database Statistics Types
Statistic	Description	Used For
Row Count	Total number of rows in table	Estimating result set sizes
Page Count	Total data pages occupied by table	Full scan cost estimation
Column NDV	Number of Distinct Values in a column	Selectivity calculations
Histograms	Value distribution across buckets	Range and equality estimates
NULL Fraction	Percentage of NULL values	IS NULL/IS NOT NULL estimates
Average Width	Average bytes per column/row	Memory and I/O calculations
Correlation	Physical vs. logical ordering match	Range scan cost accuracy
Index Stats	Index height, leaf pages, clustering	Index access cost estimation

Histogram Deep Dive

Histograms are particularly important for accurate estimation. They capture how values are distributed, enabling the optimizer to handle:

Uniform distributions: Values evenly spread (simple, but rare in practice)
Skewed distributions: Some values appear much more frequently than others
Gaps: Ranges with no data
Outliers: Extreme values at distribution tails

histogram-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Example: Order status distribution
-- Value        | Count   | Percentage
-- -------------|---------|----------
-- 'pending'    | 5,000   | 5%
-- 'processing' | 10,000  | 10%
-- 'shipped'    | 25,000  | 25%
-- 'delivered'  | 55,000  | 55%
-- 'cancelled'  | 5,000   | 5%
 
-- Without histogram (assuming uniform distribution):
SELECT * FROM orders WHERE status = 'delivered';
-- Optimizer estimates: 100,000 / 5 distinct = 20,000 rows (20%)
-- Actual result: 55,000 rows (55%)
-- Estimation error: 2.75x underestimate!
 
-- With accurate histogram:
-- Optimizer knows 'delivered' = 55% → estimates 55,000 rows
-- This changes whether index is beneficial!
 
-- If only 5,000 rows (5%), index is very efficient
-- If 55,000 rows (55%), full scan might be better

Stale Statistics Are Dangerous

Statistics must be refreshed as data changes. If your table grew from 100,000 to 10 million rows but statistics still reflect the smaller size, the optimizer makes wildly incorrect estimates. Most databases have auto-analyze features, but high-activity tables may require manual statistics updates.

update-statistics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- PostgreSQL: Analyze table to update statistics
ANALYZE orders;
ANALYZE orders(customer_id, status);  -- Specific columns
 
-- MySQL: Analyze table
ANALYZE TABLE orders;
 
-- SQL Server: Update statistics
UPDATE STATISTICS orders;
UPDATE STATISTICS orders WITH FULLSCAN;  -- Complete analysis
 
-- Oracle: Gather table statistics
EXEC DBMS_STATS.GATHER_TABLE_STATS('schema_name', 'orders');
 
-- Check when statistics were last updated (PostgreSQL)
SELECT schemaname, relname, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

Selectivity: The Core Metric for Index Decisions

Selectivity is the estimated fraction of rows that will match a condition. It's the single most important factor in index selection decisions.

$$Selectivity = \frac{Matching\ Rows}{Total\ Rows}$$

Why Selectivity Matters

An index is beneficial when selectivity is low (few rows match). The lower the selectivity, the greater the advantage of using an index:

Selectivity 0.001 (0.1%): Index is almost certainly beneficial
Selectivity 0.10 (10%): Index probably beneficial, depends on factors
Selectivity 0.30 (30%): Index may or may not be beneficial
Selectivity 0.50+ (50%+): Full table scan often more efficient

Selectivity Estimation Methods

The optimizer estimates selectivity using different techniques based on available information:

Selectivity Estimation Techniques
Condition Type	Estimation Method	Formula/Approach
`column = value`	Histogram or uniform assumption	`1 / NDV` if no histogram
`column > value`	Histogram bucket analysis	Fraction of values above threshold
`column BETWEEN a AND b`	Range fraction calculation	`(b - a) / (max - min)`
`column LIKE 'prefix%'`	Treated as range condition	Similar to BETWEEN
`column IS NULL`	NULL fraction statistic	Direct statistic lookup
`column IN (v1, v2, v3)`	Sum of individual selectivities	Min(sum, 1.0)
`cond1 AND cond2`	Independence assumption	`sel1 × sel2`
`cond1 OR cond2`	Inclusion-exclusion	`sel1 + sel2 - (sel1 × sel2)`

Selectivity Calculation ExampleHow the optimizer estimates selectivity for a compound condition:

Input

Output

Correlation Breaks Estimates

The independence assumption (multiplying selectivities) works poorly when columns are correlated. If all pending orders come from the WEST region, the actual selectivity is much higher than estimated. Modern databases have multi-column statistics and correlation detection to address this, but it remains a common source of optimizer errors.

The Index Selection Algorithm

When multiple indexes exist on a table, the optimizer must choose which one (if any) to use. This decision follows a structured evaluation process:

Step 1: Identify Candidate Indexes

The optimizer identifies indexes that could potentially be used based on:

Columns referenced in WHERE, JOIN, ORDER BY, GROUP BY clauses
Index column order (leftmost prefix rule)
Operator compatibility (B-tree indexes don't support inequality without equality on prefix columns)

Step 2: Evaluate Each Candidate

For each candidate index, the optimizer estimates:

Number of index entries to scan
Number of table rows to fetch
Cost of index access + table access combined

Step 3: Compare Against Full Scan

The index access cost is compared against a sequential full table scan. The full scan has a known, predictable cost based on table size.

Step 4: Select Optimal Access Path

The access method with the lowest total estimated cost is chosen.

index-selection-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Table: products (2,000,000 rows)
-- Available indexes:
--   1. idx_category (category_id)
--   2. idx_price (price)
--   3. idx_cat_price (category_id, price)
--   4. idx_name (product_name)
 
SELECT product_id, product_name, price
FROM products
WHERE category_id = 15 
  AND price > 100
ORDER BY price;
 
-- Optimizer evaluation:
 
-- Candidate 1: idx_category
-- Pro: Can filter category_id = 15
-- Con: Must then filter price > 100 from results
-- Con: Requires sort for ORDER BY price
-- Estimated: 40,000 index entries, 35,000 after price filter, sort needed
 
-- Candidate 2: idx_price  
-- Pro: Already sorted by price (no sort needed)
-- Con: Must scan all price > 100 entries and filter category
-- Estimated: 800,000 index entries scanned, 35,000 match both
 
-- Candidate 3: idx_cat_price (BEST CHOICE)
-- Pro: Filters category_id first (leftmost)
-- Pro: Then uses price for range scan (second column)
-- Pro: Already sorted by price within category!
-- Estimated: Navigate to category=15, range scan price>100
-- Result: 35,000 entries, no additional sort needed
 
-- Candidate 4: idx_name
-- Not usable: product_name not in WHERE clause with searchable condition
 
-- Full scan comparison:
-- 2,000,000 rows × 0.0175 selectivity = 35,000 rows expected
-- But requires reading all 2,000,000 rows + sort
 
-- Winner: idx_cat_price with estimated cost ~35,000 vs others 100,000+

The Leftmost Prefix Rule

A composite index on (a, b, c) can be used for conditions on: (a), (a, b), or (a, b, c). It CANNOT be used for conditions only on (b), (c), or (b, c). The index must be used from left to right without gaps. This is because B-tree indexes sort first by a, then by b within each a value, then by c within each (a, b) combination.

Factors Affecting Index Selection

Beyond selectivity and cost estimates, several other factors influence whether the optimizer chooses to use an index:

Key Factors in Index Selection

•Clustering Factor / Correlation: How well the physical row order matches the index order. A highly clustered index means index-order access is nearly sequential; poor clustering means random I/O for each row.
•Index Size vs Table Size: Very small indexes (fitting in memory) are more attractive. If the index itself requires many page reads, the advantage diminishes.
•Query Output Columns: If the index contains all needed columns (covering index), table access is eliminated entirely—dramatically reducing cost.
•Concurrent Access Patterns: Some databases consider lock contention; index scans may lock fewer rows than full scans.
•Available Memory: With limited buffer pool, index scans that cause random I/O may be penalized more heavily.
•Parallelism Potential: Full scans can be parallelized easily; some index access patterns are harder to parallelize.

Clustering Factor Deep Dive

The clustering factor measures how 'expensive' it is to fetch rows via an index. Consider an orders table with an index on order_date:

Good Clustering

•Rows inserted chronologically
•Adjacent index entries → adjacent table pages
•Range scan reads pages sequentially
•Clustering factor ≈ number of data pages
•Index highly efficient for ranges

Poor Clustering

•Rows inserted randomly by customer
•Adjacent index entries → scattered table pages
•Range scan causes random I/O
•Clustering factor ≈ number of rows
•Index less attractive for ranges

clustering-factor.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- PostgreSQL: Check correlation (clustering)
SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename = 'orders';
 
-- correlation close to 1.0 or -1.0 = good clustering
-- correlation close to 0 = poor clustering
 
-- attname    | correlation
-- -----------|------------
-- order_id   | 0.98        -- Well clustered (likely inserted in order)
-- order_date | 0.95        -- Well clustered (dates correlate with insert order)
-- customer_id| 0.02        -- Poorly clustered (customers are random)
 
-- Implication: Range scans on order_date are efficient
-- Range scans on customer_id cause lots of random I/O

When (and Why) the Optimizer Rejects Indexes

Sometimes you expect an index to be used, but the optimizer chooses a full scan instead. Understanding why helps you either adjust your indexing strategy or accept that the optimizer made the right choice.

Common Reasons for Index Rejection

Why Indexes Get Rejected
Reason	Explanation	Solution
Low Selectivity	Too many rows match; full scan is cheaper	Accept full scan or rethink query pattern
Stale Statistics	Optimizer has wrong row count estimates	Run ANALYZE / UPDATE STATISTICS
Poor Clustering	Random I/O cost exceeds sequential scan benefit	Consider clustered/IOT tables or covering index
Small Table	Entire table fits in one or few pages	Accept full scan (it's fast anyway)
Function on Column	`WHERE UPPER(name) = 'X'` can't use index on name	Create functional index or change query
Type Mismatch	Implicit cast prevents index usage	Ensure query uses correct types
Leading Wildcard	`LIKE '%pattern'` can't use B-tree index	Use full-text search or trigram index
OR Conditions	Index may not help with complex OR logic	Consider UNION of indexed queries

index-rejection-examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Example 1: Low selectivity
SELECT * FROM orders WHERE status = 'delivered';
-- If 60% of orders are 'delivered', full scan is likely chosen
-- Index would read 60% of index + 60% of table = more work than full scan
 
-- Example 2: Function on column
SELECT * FROM customers WHERE UPPER(email) = 'USER@EXAMPLE.COM';
-- Index on email column CANNOT be used
-- UPPER() transforms values; B-tree doesn't know transformed order
-- Fix: CREATE INDEX idx_email_upper ON customers(UPPER(email));
 
-- Example 3: Implicit type conversion
-- Table has: customer_id INTEGER
SELECT * FROM orders WHERE customer_id = '12345';  -- String literal
-- Some databases convert column to string, breaking index usage
-- Fix: Use correct type: customer_id = 12345
 
-- Example 4: OR with different columns
SELECT * FROM products 
WHERE category_id = 5 OR brand_id = 10;
-- Neither single-column index covers both conditions efficiently
-- Options:
--   a) Accept full scan if both are low selectivity
--   b) Use UNION:
      SELECT * FROM products WHERE category_id = 5
      UNION
      SELECT * FROM products WHERE brand_id = 10;

The Tipping Point

There's typically a 'tipping point' where full scans become preferred—often around 5-30% of table rows depending on clustering. Queries that select 1% with an index will be fast. The same query selecting 40% might use a full scan—and that's often the correct choice! Don't assume index usage is always better.

Viewing Optimizer Decisions

The key tool for understanding optimizer decisions is the EXPLAIN command (covered in detail in Module 2). Here's a focused look at what to observe for index selection:

explain-index-selection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- PostgreSQL: EXPLAIN ANALYZE with index information
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 12345;
 
-- Look for these indicators:
 
-- Index being used:
-- Index Scan using idx_customer_id on orders  (cost=0.43..12.50 rows=50 width=120)
--                 ^^^^^^^^^^^^^^^                      ^^^^^ ^^^^^
--                 Index name used                      Cost estimates
 
-- Index NOT used (full scan):
-- Seq Scan on orders  (cost=0.00..15000.00 rows=50 width=120)
--   Filter: (customer_id = 12345)
--   Rows Removed by Filter: 999950  <- Had to examine many rows
 
-- MySQL: EXPLAIN with type column
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;
 
-- Key columns to check:
-- type: 'ref' or 'range' = index used, 'ALL' = full scan
-- possible_keys: indexes optimizer considered
-- key: index actually chosen (NULL = no index)
-- rows: estimated rows to examine
 
-- SQL Server: Actual execution plan
SET STATISTICS IO ON;
SELECT * FROM orders WHERE customer_id = 12345;
-- Shows logical reads, index seeks vs scans

Diagnosing Index Selection Problems

When investigating why an index isn't being used:

Check if the index exists: Typos in index creation or dropped indexes
Verify statistics are current: Run ANALYZE if data changed significantly
Examine the WHERE clause: Functions, type mismatches, or OR conditions
Check selectivity: Is the condition too broad?
Look at clustering: Poor clustering can make indexes uneconomical
Consider covering: Would adding columns to index help?

Trust but Verify

The optimizer is usually right. Before forcing an index (next page covers hints), verify with EXPLAIN ANALYZE that the optimizer's choice is actually slower. Sometimes what looks like a 'wrong' decision is correct for your data distribution.

Summary: Understanding Optimizer Index Selection

Understanding how the optimizer selects indexes transforms you from creating indexes blindly to designing them strategically. Let's consolidate the key points:

Key Takeaways

•Cost-based optimization — The optimizer estimates resource consumption for each plan and chooses the cheapest
•Statistics drive decisions — Row counts, NDV, histograms, and clustering factors are essential for accurate estimates
•Selectivity is paramount — Low selectivity (few matching rows) makes indexes attractive; high selectivity may favor full scans
•Multiple factors influence selection — Clustering, index size, covering potential, and memory all play roles
•Indexes can be rightfully rejected — Full scans are sometimes optimal; don't assume index usage is always better
•EXPLAIN reveals decisions — Always use EXPLAIN to understand what the optimizer chose and why
•Maintain statistics — Stale statistics cause bad plans; refresh after significant data changes

What's Next

Sometimes the optimizer makes suboptimal choices due to estimation errors or unusual data patterns. The next page covers index hints—mechanisms for guiding (or overriding) the optimizer's index selection when you know better than the statistics suggest.

Page Complete

You now understand how database optimizers evaluate and select indexes using cost-based analysis, statistics, and selectivity estimation. This knowledge enables you to design indexes that will actually be used and to diagnose cases where indexes are unexpectedly ignored.