Loading learning content...
We've now explored the major access methods: table scans, index scans, index-only scans, and bitmap scans. Each has distinct characteristics, strengths, and costs. But how does the database actually decide which method to use for any given query?
The answer lies in the query optimizer—a sophisticated component that evaluates available access paths, estimates the cost of each, and selects the cheapest option. This decision process happens in milliseconds, yet it determines whether a query takes 10 milliseconds or 10 minutes.
Understanding the selection process is crucial because it reveals why the optimizer sometimes makes surprising choices—and how to guide it toward better decisions when necessary.
By the end of this page, you will understand the complete decision process for access method selection, including cost estimation, the role of statistics, common reasons for suboptimal choices, and techniques for influencing the optimizer's decisions.
When the query optimizer evaluates a query, it performs a systematic analysis to select the best access method for each table. This process involves several stages:
Stage 1: Identify Candidate Access Methods
The optimizer first identifies all possible access methods:
Stage 2: Estimate Costs for Each Candidate
For each candidate, the optimizer estimates:
1234567891011121314151617181920212223242526272829303132
= Query: SELECT * FROM orders WHERE customer_id = 1234 = Available Indexes:- idx_orders_customer (customer_id)- idx_orders_date (order_date)- idx_orders_customer_date (customer_id, order_date) Candidate Access Methods: 1. Sequential Scan (Table Scan) - I/O: All 50,000 pages × 0.1 ms = 5,000 ms - CPU: 4,000,000 rows × filter cost - Total estimated cost: 5,200 units 2. Index Scan on idx_orders_customer - I/O: 3 index pages + 500 row fetches × random_io_cost - Estimated selectivity: 0.0125% (500 rows) - Total estimated cost: 1,800 units 3. Index Scan on idx_orders_customer_date (using prefix) - Same as above, but index is larger - Total estimated cost: 1,850 units 4. Index-Only Scan - Cannot satisfy SELECT * (needs all columns) - Not applicable 5. Bitmap Scan on idx_orders_customer - I/O: Index scan + 450 heap pages (sequential) - Total estimated cost: 650 units ← WINNER! Selected: Bitmap Heap Scan with Bitmap Index Scan on idx_orders_customerStage 3: Compare and Select
The optimizer compares estimated costs and selects the lowest-cost option. If multiple methods have similar costs (within a threshold), additional heuristics may influence the choice:
Optimizer 'cost' is an abstract metric combining I/O, CPU, and other factors. It correlates with execution time but isn't a direct time prediction. The optimizer's goal is to minimize cost, not to predict exact duration.
Cost estimation depends critically on statistics—metadata about table contents that the optimizer uses to predict query selectivity and result sizes. Without accurate statistics, the optimizer is essentially guessing.
Key Statistics Used:
12345678910111213141516171819202122232425262728293031323334353637383940
-- View table-level statisticsSELECT relname, reltuples::bigint as row_estimate, relpages as page_count, relallvisible as visible_pagesFROM pg_classWHERE relname = 'orders'; -- View column-level statisticsSELECT attname as column_name, n_distinct, most_common_vals, most_common_freqs, histogram_bounds, correlationFROM pg_statsWHERE tablename = 'orders' AND attname = 'customer_id'; /*Example output: column_name | n_distinct | most_common_vals | most_common_freqs | correlation-------------+------------+------------------+-------------------+------------- customer_id | 10523 | {101,205,302,...} | {0.015,0.012,...} | 0.23 Interpretation:- 10,523 distinct customer_id values- Customer 101 appears in 1.5% of rows (most frequent)- Correlation 0.23 = weak clustering (rows not in customer_id order)*/ -- View index statisticsSELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetchFROM pg_stat_user_indexesWHERE relname = 'orders';Selectivity Estimation
Using statistics, the optimizer estimates selectivity—the fraction of rows that match a predicate:
| Predicate Type | Selectivity Estimation |
|---|---|
| Equality (col = val) | 1 / n_distinct, or MCV frequency if value in MCV list |
| Range (col > val) | Histogram-based interpolation |
| LIKE 'prefix%' | Estimated from histogram and pattern |
| IS NULL | null_frac for that column |
| AND | selectivity_A × selectivity_B (independence assumption) |
| OR | sel_A + sel_B - (sel_A × sel_B) |
Outdated statistics lead to poor cost estimates and wrong access method choices. After significant data changes (bulk loads, mass updates, deletes), run ANALYZE to refresh statistics. Most databases have autovacuum/auto-analyze, but heavily modified tables may need manual analysis.
The optimizer uses configurable parameters to weight different cost components. Understanding these parameters helps explain optimizer decisions and allows tuning for specific hardware.
PostgreSQL Cost Parameters:
| Parameter | Default | Meaning |
|---|---|---|
| seq_page_cost | 1.0 | Cost to read one page sequentially |
| random_page_cost | 4.0 | Cost to read one page randomly |
| cpu_tuple_cost | 0.01 | Cost to process one row |
| cpu_index_tuple_cost | 0.005 | Cost to process one index entry |
| cpu_operator_cost | 0.0025 | Cost to execute one operator/function |
| effective_cache_size | 4GB | Assumed available memory for caching |
How Parameters Affect Access Method Selection:
The ratio between random_page_cost and seq_page_cost strongly influences when index scans are preferred over table scans:
1234567891011121314151617181920212223
-- Default settings are tuned for HDD (high random I/O penalty)SHOW random_page_cost; -- 4.0SHOW seq_page_cost; -- 1.0 -- For SSD storage, random I/O is much cheaper-- Reduce random_page_cost to reflect thisSET random_page_cost = 1.1; -- Only 10% more expensive than sequential -- This makes the optimizer more willing to use index scans-- at higher selectivity levels -- For fully cached databases (data fits in RAM)SET random_page_cost = 1.0; -- No disk I/O, all memory accessSET seq_page_cost = 1.0; -- Effective cache size affects index vs table scan choice-- Higher values make optimizer assume more data is cachedSHOW effective_cache_size; -- Default variesSET effective_cache_size = '64GB'; -- For a server with 96GB RAM -- Verify the change affects plansEXPLAIN SELECT * FROM orders WHERE customer_id = 1234;-- May now choose index scan where it previously chose table scanPostgreSQL's default random_page_cost=4.0 was set when spinning disks were dominant. With NVMe SSDs and large buffer pools, random I/O is far less expensive. Many production systems benefit from random_page_cost between 1.0 and 1.5.
Despite sophisticated cost modeling, optimizers sometimes choose suboptimal access methods. Understanding these failure modes helps you diagnose and fix performance problems.
Problem 1: Stale Statistics
123456789101112131415161718192021
-- Table had 1,000 rows when statistics were collected-- Now it has 1,000,000 rows after bulk load -- Optimizer thinks table scan is cheap (based on old 1,000 rows)EXPLAIN SELECT * FROM orders WHERE status = 'pending';/*Seq Scan on orders (cost=0.00..25.00 rows=100 width=...) Filter: (status = 'pending')*/ -- Reality: Query takes 30 seconds because table is now huge! -- Fix: Update statisticsANALYZE orders; -- Now optimizer sees true size and chooses indexEXPLAIN SELECT * FROM orders WHERE status = 'pending';/*Index Scan using idx_orders_status on orders (cost=0.42..523.45 rows=75000) Index Cond: (status = 'pending')*/Problem 2: Correlated Columns (Independence Assumption)
Optimizers assume predicates are independent. This fails when columns are correlated:
123456789101112131415161718192021222324252627
-- city and zip_code are highly correlated-- (knowing zip_code determines city) -- Individual selectivities:-- city = 'New York' → 2% of rows-- zip_code = '10001' → 0.01% of rows -- Optimizer's estimate (assuming independence):-- 2% × 0.01% = 0.0002% → ~200 rows -- Reality (columns correlated):-- Every row with zip 10001 IS in New York-- Actual result: 10,000 rows EXPLAIN ANALYZESELECT * FROM addresses WHERE city = 'New York' AND zip_code = '10001';/*Index Scan on idx_addresses_zip (rows=200) -- ESTIMATED (actual rows=10000) -- ACTUAL 50x higher!*/ -- Fix: Use extended statistics (PostgreSQL 10+)CREATE STATISTICS stat_city_zip ON city, zip_code FROM addresses;ANALYZE addresses; -- Optimizer now understands the correlationProblem 3: Parameterized Queries with Different Skew
Prepared statements use a generic plan that may not suit all parameter values:
1234567891011121314
-- 99% of orders belong to customer_id 1-10000-- 1% of orders belong to customer_id 10001+ (VIP bulk buyers) -- Prepared statement: SELECT * FROM orders WHERE customer_id = $1 -- For customer_id = 5000 (normal customer): ~100 rows, index scan optimal-- For customer_id = 10001 (VIP with 500,000 orders): table scan might be faster! -- But prepared statement uses one plan for all executions-- Generic plan chosen may be wrong for edge cases -- PostgreSQL mitigations:-- - After several executions, re-plans with actual parameter statistics-- - Use plan_cache_mode = force_custom_plan for problematic queriesWhen the optimizer makes suboptimal choices, you have several techniques to guide it toward better decisions.
Technique 1: Update Statistics
Always the first step. Ensure the optimizer has accurate information:
123456789101112131415161718
-- Basic statistics updateANALYZE orders; -- Force full statistics collection (slower but more accurate)ANALYZE VERBOSE orders; -- Increase statistics target for high-cardinality columnsALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 1000;-- Default is 100; higher = more detailed histogram, better estimatesANALYZE orders; -- Check autovacuum is runningSELECT relname, last_autoanalyze, n_live_tup, n_dead_tupFROM pg_stat_user_tablesWHERE relname = 'orders'; -- If autovacuum is behind, trigger manuallyVACUUM ANALYZE orders;Technique 2: Create Better Indexes
Sometimes the optimizer chooses a table scan because no suitable index exists:
12345678910111213141516
-- Query isn't using an indexEXPLAIN SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01';-- Shows: Seq Scan with Filter on both conditions -- No composite index exists; create oneCREATE INDEX idx_orders_status_date ON orders(status, created_at); ANALYZE orders; -- Now optimizer can use indexEXPLAIN SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01';-- Shows: Index Scan using idx_orders_status_date -- For index-only scans, include needed columnsCREATE INDEX idx_orders_covering ON orders(status, created_at) INCLUDE (customer_id, total_amount);Technique 3: Query Hints (Database-Specific)
Some databases allow explicit hints to override optimizer choices. Use sparingly—hints bypass cost-based optimization:
12345678910111213141516171819202122232425262728293031323334353637383940414243
=== PostgreSQL (no direct hints, but workarounds exist) === -- Disable sequential scans to force index usage (session level)SET enable_seqscan = off;EXPLAIN SELECT * FROM orders WHERE customer_id = 1234;-- Forces index scan even if optimizer prefers seq scanSET enable_seqscan = on; -- Reset -- Disable bitmap scansSET enable_bitmapscan = off; -- Adjust cost parameters to influence choiceSET random_page_cost = 1.0; -- Makes index scans cheaper === MySQL === -- Force index usageSELECT * FROM orders FORCE INDEX (idx_orders_customer)WHERE customer_id = 1234 OR status = 'pending'; -- Ignore specific indexSELECT * FROM orders IGNORE INDEX (idx_orders_status)WHERE customer_id = 1234; -- Hint for join orderSELECT /*+ JOIN_ORDER(o, c) */ * FROM orders o JOIN customers c... === Oracle === -- Hint for full table scanSELECT /*+ FULL(orders) */ * FROM orders WHERE customer_id = 1234; -- Hint for specific indexSELECT /*+ INDEX(orders idx_orders_customer) */ * FROM orders WHERE... === SQL Server === -- Table hint for indexSELECT * FROM orders WITH (INDEX(idx_orders_customer))WHERE customer_id = 1234; -- Force recompile for parameter-sensitive queriesSELECT * FROM orders WHERE customer_id = @cust_id OPTION(RECOMPILE);Query hints lock the plan to specific choices, preventing the optimizer from adapting to data changes. Use hints only after exhausting other options (statistics, indexes, cost parameters). Document why the hint is necessary, and review hints periodically.
The EXPLAIN command is your window into optimizer decision-making. Learning to read execution plans reveals why specific access methods were chosen.
Key EXPLAIN Elements for Access Method Analysis:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
-- PostgreSQL: EXPLAIN with all detailsEXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT TEXT)SELECT customer_id, order_date, total_amountFROM ordersWHERE customer_id = 1234 AND order_date > '2024-01-01'; /*Index Scan using idx_orders_customer_date on public.orders (cost=0.43..156.78 rows=142 width=24) (actual time=0.023..0.456 rows=138 loops=1) Output: customer_id, order_date, total_amount Index Cond: ((orders.customer_id = 1234) AND (orders.order_date > '2024-01-01')) Buffers: shared hit=45 read=3Planning Time: 0.234 msExecution Time: 0.567 ms Key observations:1. "Index Scan using idx_orders_customer_date" = chosen access method2. "cost=0.43..156.78" = optimizer's estimated cost (startup..total)3. "rows=142" estimated vs "rows=138" actual = good estimation4. "Index Cond" = conditions pushed to index (efficient)5. "Buffers: shared hit=45 read=3" = mostly cached, 3 pages from disk*/ -- Compare alternative access methods by disabling optionsSET enable_indexscan = off;SET enable_bitmapscan = off; EXPLAIN (ANALYZE, BUFFERS)SELECT customer_id, order_date, total_amountFROM ordersWHERE customer_id = 1234 AND order_date > '2024-01-01'; /*Seq Scan on orders (cost=0.00..45000.00 rows=142 width=24) (actual time=0.023..890.456 rows=138 loops=1) Filter: ((customer_id = 1234) AND (order_date > '2024-01-01')) Rows Removed by Filter: 3999862 Buffers: shared hit=41234 read=3766*/ -- Cost comparison:-- Index Scan: 156.78 cost, 0.567 ms actual-- Seq Scan: 45000 cost, 890 ms actual-- Optimizer correctly chose index scan (290× better cost estimate, 1570× faster) -- Reset settingsRESET enable_indexscan;RESET enable_bitmapscan;Identifying Why Optimizer Chose Differently Than Expected:
Let's examine realistic scenarios where access method selection significantly impacts performance.
123456789101112131415161718192021222324252627282930313233343536373839404142
-- Scenario: Admin dashboard showing recent order statistics-- Problem: Query takes 15 seconds during peak hours -- Original querySELECT status, COUNT(*) as count, SUM(total_amount) as totalFROM ordersWHERE created_at > NOW() - INTERVAL '1 hour'GROUP BY status; -- EXPLAIN shows:EXPLAIN ANALYZE .../*HashAggregate (actual time=15234.567..15234.789) -> Seq Scan on orders (actual time=0.023..14567.890 rows=12456) Filter: (created_at > ...) Rows Removed by Filter: 9987544*/ -- Problem: Scanning 10M rows to find 12K recent ones! -- Solution 1: Create targeted indexCREATE INDEX idx_orders_recent ON orders(created_at) WHERE created_at > '2024-01-01'; -- Partial index for recent data -- Solution 2: Covering index for index-only scanCREATE INDEX idx_orders_dashboard ON orders(created_at, status)INCLUDE (total_amount); ANALYZE orders; -- New plan:/*HashAggregate (actual time=45.123..45.234) -> Index Only Scan using idx_orders_dashboard (actual time=0.034..23.456 rows=12456) Index Cond: (created_at > ...) Heap Fetches: 0*/ -- Result: 15 seconds → 45 milliseconds (300× improvement)12345678910111213141516171819202122232425
-- Scenario: Product search with multiple optional filters-- Problem: Sometimes fast, sometimes slow depending on filters SELECT product_id, name, priceFROM productsWHERE category = 'electronics' -- 100K products AND brand = 'samsung' -- 20K products AND price BETWEEN 100 AND 500 -- 500K products AND rating >= 4.0 -- 300K products AND in_stock = true; -- 800K products -- When all filters apply: ~500 products (bitmap AND works great)-- When only category: 100K products (index scan on category)-- When only price: 500K products (table scan might be better!) -- Problem: Optimizer creates different plans, some suboptimal -- Solution: Create composite indexes for common filter combinationsCREATE INDEX idx_products_cat_brand ON products(category, brand)INCLUDE (price, rating, in_stock); CREATE INDEX idx_products_cat_price ON products(category, price)INCLUDE (brand, rating, in_stock); -- Now optimizer has better options for different filter combos123456789101112131415161718192021222324252627
-- Scenario: Optimizer chooses index scan but table scan would be faster -- Query for active users (90% of users are active!)SELECT * FROM users WHERE is_active = true; -- Optimizer sees index on is_active, estimates 50% selectivity (wrong!)-- Chooses index scan, but with 90% selectivity, table scan is better EXPLAIN ANALYZE SELECT * FROM users WHERE is_active = true;/*Index Scan using idx_users_active (actual time=0.023..2345.678 rows=900000) -- Lots of random I/O for 900K rows!*/ -- Solutions: -- 1. Update statistics with more detailALTER TABLE users ALTER COLUMN is_active SET STATISTICS 1000;ANALYZE users; -- 2. If statistics still wrong, consider partial index for minority caseDROP INDEX idx_users_active;CREATE INDEX idx_users_inactive ON users(is_active) WHERE is_active = false;-- Now inactive queries use index, active queries use table scan -- 3. Adjust optimizer assumptionsSET random_page_cost = 1.0; -- Reduces preference for index at high selectivityYou now have comprehensive knowledge of how query optimizers select access methods and how to influence their decisions. Here are the essential takeaways:
Module Complete:
This concludes the Access Methods module. You now understand the complete landscape of how databases retrieve data—from simple table scans through sophisticated bitmap operations—and how query optimizers choose among these options.
This knowledge is fundamental for the remaining modules in this chapter, which cover implementing specific relational operators (selection, joins, aggregation) using these access methods as building blocks.
You have mastered the Access Methods module—understanding table scans, index scans, index-only scans, bitmap scans, and the optimizer's selection process. This knowledge enables you to diagnose access method problems, design effective indexing strategies, and optimize query performance at a fundamental level.