Loading content...
Having indexes is only half the battle. The real challenge is ensuring the database actually uses them. Every time you execute a query, an extraordinarily sophisticated component called the query optimizer analyzes your SQL, considers all available execution strategies, and chooses what it believes is the optimal plan.
This decision includes whether to use an index, which index to use when multiple are available, and whether a full table scan might actually be faster. Understanding how the optimizer makes these decisions transforms you from someone who creates indexes hoping they'll help into someone who designs indexes knowing they will be used effectively.
By the end of this page, you will understand the cost-based optimization model, the statistics that drive optimizer decisions, selectivity estimation, the factors that influence index selection, and common reasons why the optimizer might reject an available index.
Modern database optimizers use a cost-based approach to query planning. Rather than following fixed rules ('always use an index if one exists'), they estimate the computational cost of different execution strategies and choose the one with the lowest estimated cost.
The Optimization Process
The critical step is cost estimation. The optimizer models costs based on:
Simplified Cost Model Example============================= Query: SELECT * FROM orders WHERE customer_id = 12345; Plan A: Full Table Scan------------------------ Table size: 500,000 pages- Sequential read cost: 1.0 per page- Total cost: 500,000 × 1.0 = 500,000 units Plan B: Index Scan on customer_id---------------------------------- Index traversal: 4 pages × 4.0 (random I/O) = 16 units- Estimated matching rows: 50- Data page fetches: 50 pages × 4.0 = 200 units- Total cost: 216 units Plan B (216) << Plan A (500,000) → Optimizer chooses index scan Key insight: Random I/O (index access) costs more per page than sequential I/O (table scan), but the total I/O volume matters more.Cost units are an internal abstraction, not real time or I/O counts. They allow the optimizer to compare plans consistently. The actual values and formulas vary significantly between database systems (PostgreSQL, MySQL, Oracle, SQL Server each have different models).
The optimizer's cost estimates depend entirely on statistics—metadata about table contents, data distributions, and index characteristics. Without accurate statistics, even the most sophisticated optimizer makes poor decisions.
Key Statistics Collected
| Statistic | Description | Used For |
|---|---|---|
| Row Count | Total number of rows in table | Estimating result set sizes |
| Page Count | Total data pages occupied by table | Full scan cost estimation |
| Column NDV | Number of Distinct Values in a column | Selectivity calculations |
| Histograms | Value distribution across buckets | Range and equality estimates |
| NULL Fraction | Percentage of NULL values | IS NULL/IS NOT NULL estimates |
| Average Width | Average bytes per column/row | Memory and I/O calculations |
| Correlation | Physical vs. logical ordering match | Range scan cost accuracy |
| Index Stats | Index height, leaf pages, clustering | Index access cost estimation |
Histogram Deep Dive
Histograms are particularly important for accurate estimation. They capture how values are distributed, enabling the optimizer to handle:
123456789101112131415161718192021
-- Example: Order status distribution-- Value | Count | Percentage-- -------------|---------|------------ 'pending' | 5,000 | 5%-- 'processing' | 10,000 | 10%-- 'shipped' | 25,000 | 25%-- 'delivered' | 55,000 | 55%-- 'cancelled' | 5,000 | 5% -- Without histogram (assuming uniform distribution):SELECT * FROM orders WHERE status = 'delivered';-- Optimizer estimates: 100,000 / 5 distinct = 20,000 rows (20%)-- Actual result: 55,000 rows (55%)-- Estimation error: 2.75x underestimate! -- With accurate histogram:-- Optimizer knows 'delivered' = 55% → estimates 55,000 rows-- This changes whether index is beneficial! -- If only 5,000 rows (5%), index is very efficient-- If 55,000 rows (55%), full scan might be betterStatistics must be refreshed as data changes. If your table grew from 100,000 to 10 million rows but statistics still reflect the smaller size, the optimizer makes wildly incorrect estimates. Most databases have auto-analyze features, but high-activity tables may require manual statistics updates.
123456789101112131415161718
-- PostgreSQL: Analyze table to update statisticsANALYZE orders;ANALYZE orders(customer_id, status); -- Specific columns -- MySQL: Analyze tableANALYZE TABLE orders; -- SQL Server: Update statisticsUPDATE STATISTICS orders;UPDATE STATISTICS orders WITH FULLSCAN; -- Complete analysis -- Oracle: Gather table statisticsEXEC DBMS_STATS.GATHER_TABLE_STATS('schema_name', 'orders'); -- Check when statistics were last updated (PostgreSQL)SELECT schemaname, relname, last_analyze, last_autoanalyzeFROM pg_stat_user_tablesWHERE relname = 'orders';Selectivity is the estimated fraction of rows that will match a condition. It's the single most important factor in index selection decisions.
$$Selectivity = \frac{Matching\ Rows}{Total\ Rows}$$
Why Selectivity Matters
An index is beneficial when selectivity is low (few rows match). The lower the selectivity, the greater the advantage of using an index:
Selectivity Estimation Methods
The optimizer estimates selectivity using different techniques based on available information:
| Condition Type | Estimation Method | Formula/Approach |
|---|---|---|
column = value | Histogram or uniform assumption | 1 / NDV if no histogram |
column > value | Histogram bucket analysis | Fraction of values above threshold |
column BETWEEN a AND b | Range fraction calculation | (b - a) / (max - min) |
column LIKE 'prefix%' | Treated as range condition | Similar to BETWEEN |
column IS NULL | NULL fraction statistic | Direct statistic lookup |
column IN (v1, v2, v3) | Sum of individual selectivities | Min(sum, 1.0) |
cond1 AND cond2 | Independence assumption | sel1 × sel2 |
cond1 OR cond2 | Inclusion-exclusion | sel1 + sel2 - (sel1 × sel2) |
The independence assumption (multiplying selectivities) works poorly when columns are correlated. If all pending orders come from the WEST region, the actual selectivity is much higher than estimated. Modern databases have multi-column statistics and correlation detection to address this, but it remains a common source of optimizer errors.
When multiple indexes exist on a table, the optimizer must choose which one (if any) to use. This decision follows a structured evaluation process:
Step 1: Identify Candidate Indexes
The optimizer identifies indexes that could potentially be used based on:
Step 2: Evaluate Each Candidate
For each candidate index, the optimizer estimates:
Step 3: Compare Against Full Scan
The index access cost is compared against a sequential full table scan. The full scan has a known, predictable cost based on table size.
Step 4: Select Optimal Access Path
The access method with the lowest total estimated cost is chosen.
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Table: products (2,000,000 rows)-- Available indexes:-- 1. idx_category (category_id)-- 2. idx_price (price)-- 3. idx_cat_price (category_id, price)-- 4. idx_name (product_name) SELECT product_id, product_name, priceFROM productsWHERE category_id = 15 AND price > 100ORDER BY price; -- Optimizer evaluation: -- Candidate 1: idx_category-- Pro: Can filter category_id = 15-- Con: Must then filter price > 100 from results-- Con: Requires sort for ORDER BY price-- Estimated: 40,000 index entries, 35,000 after price filter, sort needed -- Candidate 2: idx_price -- Pro: Already sorted by price (no sort needed)-- Con: Must scan all price > 100 entries and filter category-- Estimated: 800,000 index entries scanned, 35,000 match both -- Candidate 3: idx_cat_price (BEST CHOICE)-- Pro: Filters category_id first (leftmost)-- Pro: Then uses price for range scan (second column)-- Pro: Already sorted by price within category!-- Estimated: Navigate to category=15, range scan price>100-- Result: 35,000 entries, no additional sort needed -- Candidate 4: idx_name-- Not usable: product_name not in WHERE clause with searchable condition -- Full scan comparison:-- 2,000,000 rows × 0.0175 selectivity = 35,000 rows expected-- But requires reading all 2,000,000 rows + sort -- Winner: idx_cat_price with estimated cost ~35,000 vs others 100,000+A composite index on (a, b, c) can be used for conditions on: (a), (a, b), or (a, b, c). It CANNOT be used for conditions only on (b), (c), or (b, c). The index must be used from left to right without gaps. This is because B-tree indexes sort first by a, then by b within each a value, then by c within each (a, b) combination.
Beyond selectivity and cost estimates, several other factors influence whether the optimizer chooses to use an index:
Clustering Factor Deep Dive
The clustering factor measures how 'expensive' it is to fetch rows via an index. Consider an orders table with an index on order_date:
12345678910111213141516
-- PostgreSQL: Check correlation (clustering)SELECT tablename, attname, correlationFROM pg_statsWHERE tablename = 'orders'; -- correlation close to 1.0 or -1.0 = good clustering-- correlation close to 0 = poor clustering -- attname | correlation-- -----------|-------------- order_id | 0.98 -- Well clustered (likely inserted in order)-- order_date | 0.95 -- Well clustered (dates correlate with insert order)-- customer_id| 0.02 -- Poorly clustered (customers are random) -- Implication: Range scans on order_date are efficient-- Range scans on customer_id cause lots of random I/OSometimes you expect an index to be used, but the optimizer chooses a full scan instead. Understanding why helps you either adjust your indexing strategy or accept that the optimizer made the right choice.
Common Reasons for Index Rejection
| Reason | Explanation | Solution |
|---|---|---|
| Low Selectivity | Too many rows match; full scan is cheaper | Accept full scan or rethink query pattern |
| Stale Statistics | Optimizer has wrong row count estimates | Run ANALYZE / UPDATE STATISTICS |
| Poor Clustering | Random I/O cost exceeds sequential scan benefit | Consider clustered/IOT tables or covering index |
| Small Table | Entire table fits in one or few pages | Accept full scan (it's fast anyway) |
| Function on Column | WHERE UPPER(name) = 'X' can't use index on name | Create functional index or change query |
| Type Mismatch | Implicit cast prevents index usage | Ensure query uses correct types |
| Leading Wildcard | LIKE '%pattern' can't use B-tree index | Use full-text search or trigram index |
| OR Conditions | Index may not help with complex OR logic | Consider UNION of indexed queries |
123456789101112131415161718192021222324252627
-- Example 1: Low selectivitySELECT * FROM orders WHERE status = 'delivered';-- If 60% of orders are 'delivered', full scan is likely chosen-- Index would read 60% of index + 60% of table = more work than full scan -- Example 2: Function on columnSELECT * FROM customers WHERE UPPER(email) = 'USER@EXAMPLE.COM';-- Index on email column CANNOT be used-- UPPER() transforms values; B-tree doesn't know transformed order-- Fix: CREATE INDEX idx_email_upper ON customers(UPPER(email)); -- Example 3: Implicit type conversion-- Table has: customer_id INTEGERSELECT * FROM orders WHERE customer_id = '12345'; -- String literal-- Some databases convert column to string, breaking index usage-- Fix: Use correct type: customer_id = 12345 -- Example 4: OR with different columnsSELECT * FROM products WHERE category_id = 5 OR brand_id = 10;-- Neither single-column index covers both conditions efficiently-- Options:-- a) Accept full scan if both are low selectivity-- b) Use UNION: SELECT * FROM products WHERE category_id = 5 UNION SELECT * FROM products WHERE brand_id = 10;There's typically a 'tipping point' where full scans become preferred—often around 5-30% of table rows depending on clustering. Queries that select 1% with an index will be fast. The same query selecting 40% might use a full scan—and that's often the correct choice! Don't assume index usage is always better.
The key tool for understanding optimizer decisions is the EXPLAIN command (covered in detail in Module 2). Here's a focused look at what to observe for index selection:
1234567891011121314151617181920212223242526272829
-- PostgreSQL: EXPLAIN ANALYZE with index informationEXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT * FROM orders WHERE customer_id = 12345; -- Look for these indicators: -- Index being used:-- Index Scan using idx_customer_id on orders (cost=0.43..12.50 rows=50 width=120)-- ^^^^^^^^^^^^^^^ ^^^^^ ^^^^^-- Index name used Cost estimates -- Index NOT used (full scan):-- Seq Scan on orders (cost=0.00..15000.00 rows=50 width=120)-- Filter: (customer_id = 12345)-- Rows Removed by Filter: 999950 <- Had to examine many rows -- MySQL: EXPLAIN with type columnEXPLAIN SELECT * FROM orders WHERE customer_id = 12345; -- Key columns to check:-- type: 'ref' or 'range' = index used, 'ALL' = full scan-- possible_keys: indexes optimizer considered-- key: index actually chosen (NULL = no index)-- rows: estimated rows to examine -- SQL Server: Actual execution planSET STATISTICS IO ON;SELECT * FROM orders WHERE customer_id = 12345;-- Shows logical reads, index seeks vs scansDiagnosing Index Selection Problems
When investigating why an index isn't being used:
The optimizer is usually right. Before forcing an index (next page covers hints), verify with EXPLAIN ANALYZE that the optimizer's choice is actually slower. Sometimes what looks like a 'wrong' decision is correct for your data distribution.
Understanding how the optimizer selects indexes transforms you from creating indexes blindly to designing them strategically. Let's consolidate the key points:
What's Next
Sometimes the optimizer makes suboptimal choices due to estimation errors or unusual data patterns. The next page covers index hints—mechanisms for guiding (or overriding) the optimizer's index selection when you know better than the statistics suggest.
You now understand how database optimizers evaluate and select indexes using cost-based analysis, statistics, and selectivity estimation. This knowledge enables you to design indexes that will actually be used and to diagnose cases where indexes are unexpectedly ignored.