Database Management SystemsAccess Methods

Access Methods: How Databases Retrieve Data

LevelIntermediate

Duration90 mins

TopicAccess Methods

5 / 5

Access Method Selection: The Optimizer's Decision

How the Optimizer Chooses

We've now explored the major access methods: table scans, index scans, index-only scans, and bitmap scans. Each has distinct characteristics, strengths, and costs. But how does the database actually decide which method to use for any given query?

The answer lies in the query optimizer—a sophisticated component that evaluates available access paths, estimates the cost of each, and selects the cheapest option. This decision process happens in milliseconds, yet it determines whether a query takes 10 milliseconds or 10 minutes.

Understanding the selection process is crucial because it reveals why the optimizer sometimes makes surprising choices—and how to guide it toward better decisions when necessary.

What You Will Learn

By the end of this page, you will understand the complete decision process for access method selection, including cost estimation, the role of statistics, common reasons for suboptimal choices, and techniques for influencing the optimizer's decisions.

The Access Method Selection Process

When the query optimizer evaluates a query, it performs a systematic analysis to select the best access method for each table. This process involves several stages:

Stage 1: Identify Candidate Access Methods

The optimizer first identifies all possible access methods:

Table Scan: Always available as a baseline
Index Scans: One for each index that could satisfy the query's predicates
Index-Only Scans: For indexes that cover all required columns
Bitmap Scans: When bitmap operations could benefit the query

Stage 2: Estimate Costs for Each Candidate

For each candidate, the optimizer estimates:

I/O cost: Pages to be read from disk (the dominant factor)
CPU cost: Processing per row (filtering, projection, expressions)
Memory cost: Working memory required (for sorting, hashing, bitmaps)

Access Method Cost Comparison (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
= Query: SELECT * FROM orders WHERE customer_id = 1234 =
 
Available Indexes:
- idx_orders_customer (customer_id)
- idx_orders_date (order_date)
- idx_orders_customer_date (customer_id, order_date)
 
Candidate Access Methods:
 
1. Sequential Scan (Table Scan)
   - I/O: All 50,000 pages × 0.1 ms = 5,000 ms
   - CPU: 4,000,000 rows × filter cost
   - Total estimated cost: 5,200 units
 
2. Index Scan on idx_orders_customer
   - I/O: 3 index pages + 500 row fetches × random_io_cost
   - Estimated selectivity: 0.0125% (500 rows)
   - Total estimated cost: 1,800 units
 
3. Index Scan on idx_orders_customer_date (using prefix)
   - Same as above, but index is larger
   - Total estimated cost: 1,850 units
 
4. Index-Only Scan
   - Cannot satisfy SELECT * (needs all columns)
   - Not applicable
 
5. Bitmap Scan on idx_orders_customer
   - I/O: Index scan + 450 heap pages (sequential)
   - Total estimated cost: 650 units   ← WINNER!
 
Selected: Bitmap Heap Scan with Bitmap Index Scan on idx_orders_customer

Stage 3: Compare and Select

The optimizer compares estimated costs and selects the lowest-cost option. If multiple methods have similar costs (within a threshold), additional heuristics may influence the choice:

Prefer index-only scans to avoid heap access
Prefer bitmap scans when multiple indexes can combine
Consider memory availability for bitmap operations
Account for parallel execution potential

Cost Is Not Execution Time

Optimizer 'cost' is an abstract metric combining I/O, CPU, and other factors. It correlates with execution time but isn't a direct time prediction. The optimizer's goal is to minimize cost, not to predict exact duration.

Statistics: The Foundation of Cost Estimation

Cost estimation depends critically on statistics—metadata about table contents that the optimizer uses to predict query selectivity and result sizes. Without accurate statistics, the optimizer is essentially guessing.

Key Statistics Used:

Row Count (n_tuples): Total number of rows in the table
Page Count (n_pages): Number of data pages the table occupies
Distinct Values (n_distinct): Number of unique values per column
Most Common Values (MCV): Frequently occurring values and their frequencies
Histograms: Distribution of values across buckets for range estimation
Correlation/Clustering Factor: How well physical order matches index order
Null Fraction: Percentage of NULL values in each column

Viewing Statistics in PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- View table-level statistics
SELECT 
    relname,
    reltuples::bigint as row_estimate,
    relpages as page_count,
    relallvisible as visible_pages
FROM pg_class
WHERE relname = 'orders';
 
-- View column-level statistics
SELECT
    attname as column_name,
    n_distinct,
    most_common_vals,
    most_common_freqs,
    histogram_bounds,
    correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'customer_id';
 
/*
Example output:
 column_name | n_distinct | most_common_vals | most_common_freqs | correlation
-------------+------------+------------------+-------------------+-------------
 customer_id |    10523   | {101,205,302,...} | {0.015,0.012,...} |    0.23
 
Interpretation:
- 10,523 distinct customer_id values
- Customer 101 appears in 1.5% of rows (most frequent)
- Correlation 0.23 = weak clustering (rows not in customer_id order)
*/
 
-- View index statistics
SELECT
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders';

Selectivity Estimation

Using statistics, the optimizer estimates selectivity—the fraction of rows that match a predicate:

Predicate Type	Selectivity Estimation
Equality (col = val)	1 / n_distinct, or MCV frequency if value in MCV list
Range (col > val)	Histogram-based interpolation
LIKE 'prefix%'	Estimated from histogram and pattern
IS NULL	null_frac for that column
AND	selectivity_A × selectivity_B (independence assumption)
OR	sel_A + sel_B - (sel_A × sel_B)

Statistics Must Be Current

Outdated statistics lead to poor cost estimates and wrong access method choices. After significant data changes (bulk loads, mass updates, deletes), run ANALYZE to refresh statistics. Most databases have autovacuum/auto-analyze, but heavily modified tables may need manual analysis.

Cost Model Parameters and Tuning

The optimizer uses configurable parameters to weight different cost components. Understanding these parameters helps explain optimizer decisions and allows tuning for specific hardware.

PostgreSQL Cost Parameters:

PostgreSQL Cost Model Parameters
Parameter	Default	Meaning
seq_page_cost	1.0	Cost to read one page sequentially
random_page_cost	4.0	Cost to read one page randomly
cpu_tuple_cost	0.01	Cost to process one row
cpu_index_tuple_cost	0.005	Cost to process one index entry
cpu_operator_cost	0.0025	Cost to execute one operator/function
effective_cache_size	4GB	Assumed available memory for caching

How Parameters Affect Access Method Selection:

The ratio between random_page_cost and seq_page_cost strongly influences when index scans are preferred over table scans:

Default ratio (4:1): Index scans preferred when selectivity < ~20%
High ratio (10:1): Index scans preferred even at higher selectivity
Low ratio (1.1:1): Nearly equivalent costs; selectivity threshold drops to ~5%

Tuning Cost Parameters for SSD Storage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Default settings are tuned for HDD (high random I/O penalty)
SHOW random_page_cost;  -- 4.0
SHOW seq_page_cost;     -- 1.0
 
-- For SSD storage, random I/O is much cheaper
-- Reduce random_page_cost to reflect this
SET random_page_cost = 1.1;  -- Only 10% more expensive than sequential
 
-- This makes the optimizer more willing to use index scans
-- at higher selectivity levels
 
-- For fully cached databases (data fits in RAM)
SET random_page_cost = 1.0;  -- No disk I/O, all memory access
SET seq_page_cost = 1.0;
 
-- Effective cache size affects index vs table scan choice
-- Higher values make optimizer assume more data is cached
SHOW effective_cache_size;  -- Default varies
SET effective_cache_size = '64GB';  -- For a server with 96GB RAM
 
-- Verify the change affects plans
EXPLAIN SELECT * FROM orders WHERE customer_id = 1234;
-- May now choose index scan where it previously chose table scan

Modern Systems Often Need Lower random_page_cost

PostgreSQL's default random_page_cost=4.0 was set when spinning disks were dominant. With NVMe SSDs and large buffer pools, random I/O is far less expensive. Many production systems benefit from random_page_cost between 1.0 and 1.5.

Why the Optimizer Gets It Wrong

Despite sophisticated cost modeling, optimizers sometimes choose suboptimal access methods. Understanding these failure modes helps you diagnose and fix performance problems.

Problem 1: Stale Statistics

Stale Statistics Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Table had 1,000 rows when statistics were collected
-- Now it has 1,000,000 rows after bulk load
 
-- Optimizer thinks table scan is cheap (based on old 1,000 rows)
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
/*
Seq Scan on orders  (cost=0.00..25.00 rows=100 width=...)
  Filter: (status = 'pending')
*/
 
-- Reality: Query takes 30 seconds because table is now huge!
 
-- Fix: Update statistics
ANALYZE orders;
 
-- Now optimizer sees true size and chooses index
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
/*
Index Scan using idx_orders_status on orders  (cost=0.42..523.45 rows=75000)
  Index Cond: (status = 'pending')
*/

Problem 2: Correlated Columns (Independence Assumption)

Optimizers assume predicates are independent. This fails when columns are correlated:

Correlated Columns Problem
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- city and zip_code are highly correlated
-- (knowing zip_code determines city)
 
-- Individual selectivities:
-- city = 'New York' → 2% of rows
-- zip_code = '10001' → 0.01% of rows
 
-- Optimizer's estimate (assuming independence):
-- 2% × 0.01% = 0.0002% → ~200 rows
 
-- Reality (columns correlated):
-- Every row with zip 10001 IS in New York
-- Actual result: 10,000 rows
 
EXPLAIN ANALYZE
SELECT * FROM addresses 
WHERE city = 'New York' AND zip_code = '10001';
/*
Index Scan on idx_addresses_zip  (rows=200)  -- ESTIMATED
  (actual rows=10000)                         -- ACTUAL 50x higher!
*/
 
-- Fix: Use extended statistics (PostgreSQL 10+)
CREATE STATISTICS stat_city_zip ON city, zip_code FROM addresses;
ANALYZE addresses;
 
-- Optimizer now understands the correlation

Problem 3: Parameterized Queries with Different Skew

Prepared statements use a generic plan that may not suit all parameter values:

Parameter Skew Problem
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- 99% of orders belong to customer_id 1-10000
-- 1% of orders belong to customer_id 10001+ (VIP bulk buyers)
 
-- Prepared statement: SELECT * FROM orders WHERE customer_id = $1
 
-- For customer_id = 5000 (normal customer): ~100 rows, index scan optimal
-- For customer_id = 10001 (VIP with 500,000 orders): table scan might be faster!
 
-- But prepared statement uses one plan for all executions
-- Generic plan chosen may be wrong for edge cases
 
-- PostgreSQL mitigations:
-- - After several executions, re-plans with actual parameter statistics
-- - Use plan_cache_mode = force_custom_plan for problematic queries

Common Optimizer Mistakes

•Stale statistics: Row counts, value distributions outdated after bulk operations
•Independence assumption: Correlated columns treated as independent, causing estimation errors
•Generic plans: Prepared statements use one plan for all parameter values
•Function opacity: User-defined functions treated as black boxes with default selectivity
•Missing indexes: Optimizer can't choose an access method that doesn't exist
•Wrong cost parameters: Default settings don't match actual hardware characteristics

Techniques for Influencing Access Method Selection

When the optimizer makes suboptimal choices, you have several techniques to guide it toward better decisions.

Technique 1: Update Statistics

Always the first step. Ensure the optimizer has accurate information:

Updating Statistics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Basic statistics update
ANALYZE orders;
 
-- Force full statistics collection (slower but more accurate)
ANALYZE VERBOSE orders;
 
-- Increase statistics target for high-cardinality columns
ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 1000;
-- Default is 100; higher = more detailed histogram, better estimates
ANALYZE orders;
 
-- Check autovacuum is running
SELECT relname, last_autoanalyze, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'orders';
 
-- If autovacuum is behind, trigger manually
VACUUM ANALYZE orders;

Technique 2: Create Better Indexes

Sometimes the optimizer chooses a table scan because no suitable index exists:

Creating Better Indexes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Query isn't using an index
EXPLAIN SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01';
-- Shows: Seq Scan with Filter on both conditions
 
-- No composite index exists; create one
CREATE INDEX idx_orders_status_date ON orders(status, created_at);
 
ANALYZE orders;
 
-- Now optimizer can use index
EXPLAIN SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01';
-- Shows: Index Scan using idx_orders_status_date
 
-- For index-only scans, include needed columns
CREATE INDEX idx_orders_covering ON orders(status, created_at) 
INCLUDE (customer_id, total_amount);

Technique 3: Query Hints (Database-Specific)

Some databases allow explicit hints to override optimizer choices. Use sparingly—hints bypass cost-based optimization:

Query Hints by Database
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
=== PostgreSQL (no direct hints, but workarounds exist) ===
 
-- Disable sequential scans to force index usage (session level)
SET enable_seqscan = off;
EXPLAIN SELECT * FROM orders WHERE customer_id = 1234;
-- Forces index scan even if optimizer prefers seq scan
SET enable_seqscan = on;  -- Reset
 
-- Disable bitmap scans
SET enable_bitmapscan = off;
 
-- Adjust cost parameters to influence choice
SET random_page_cost = 1.0;  -- Makes index scans cheaper
 
=== MySQL ===
 
-- Force index usage
SELECT * FROM orders FORCE INDEX (idx_orders_customer)
WHERE customer_id = 1234 OR status = 'pending';
 
-- Ignore specific index
SELECT * FROM orders IGNORE INDEX (idx_orders_status)
WHERE customer_id = 1234;
 
-- Hint for join order
SELECT /*+ JOIN_ORDER(o, c) */ * FROM orders o JOIN customers c...
 
=== Oracle ===
 
-- Hint for full table scan
SELECT /*+ FULL(orders) */ * FROM orders WHERE customer_id = 1234;
 
-- Hint for specific index
SELECT /*+ INDEX(orders idx_orders_customer) */ * FROM orders WHERE...
 
=== SQL Server ===
 
-- Table hint for index
SELECT * FROM orders WITH (INDEX(idx_orders_customer))
WHERE customer_id = 1234;
 
-- Force recompile for parameter-sensitive queries
SELECT * FROM orders WHERE customer_id = @cust_id OPTION(RECOMPILE);

Hints Are a Last Resort

Query hints lock the plan to specific choices, preventing the optimizer from adapting to data changes. Use hints only after exhausting other options (statistics, indexes, cost parameters). Document why the hint is necessary, and review hints periodically.

Analyzing and Understanding Optimizer Decisions

The EXPLAIN command is your window into optimizer decision-making. Learning to read execution plans reveals why specific access methods were chosen.

Key EXPLAIN Elements for Access Method Analysis:

Reading EXPLAIN for Access Decisions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- PostgreSQL: EXPLAIN with all details
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT TEXT)
SELECT customer_id, order_date, total_amount
FROM orders
WHERE customer_id = 1234 AND order_date > '2024-01-01';
 
/*
Index Scan using idx_orders_customer_date on public.orders
    (cost=0.43..156.78 rows=142 width=24) 
    (actual time=0.023..0.456 rows=138 loops=1)
  Output: customer_id, order_date, total_amount
  Index Cond: ((orders.customer_id = 1234) AND (orders.order_date > '2024-01-01'))
  Buffers: shared hit=45 read=3
Planning Time: 0.234 ms
Execution Time: 0.567 ms
 
Key observations:
1. "Index Scan using idx_orders_customer_date" = chosen access method
2. "cost=0.43..156.78" = optimizer's estimated cost (startup..total)
3. "rows=142" estimated vs "rows=138" actual = good estimation
4. "Index Cond" = conditions pushed to index (efficient)
5. "Buffers: shared hit=45 read=3" = mostly cached, 3 pages from disk
*/
 
-- Compare alternative access methods by disabling options
SET enable_indexscan = off;
SET enable_bitmapscan = off;
 
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, order_date, total_amount
FROM orders
WHERE customer_id = 1234 AND order_date > '2024-01-01';
 
/*
Seq Scan on orders
    (cost=0.00..45000.00 rows=142 width=24) 
    (actual time=0.023..890.456 rows=138 loops=1)
  Filter: ((customer_id = 1234) AND (order_date > '2024-01-01'))
  Rows Removed by Filter: 3999862
  Buffers: shared hit=41234 read=3766
*/
 
-- Cost comparison:
-- Index Scan: 156.78 cost, 0.567 ms actual
-- Seq Scan:   45000 cost, 890 ms actual
-- Optimizer correctly chose index scan (290× better cost estimate, 1570× faster)
 
-- Reset settings
RESET enable_indexscan;
RESET enable_bitmapscan;

Identifying Why Optimizer Chose Differently Than Expected:

Diagnostic Checklist

•Compare estimated vs actual rows: Large differences indicate stale statistics or correlation issues
•Check Index Cond vs Filter: Conditions in 'Filter' weren't pushed to index (index not optimal for predicate)
•Look at Buffers: High read counts indicate cold cache; adjust effective_cache_size if this is unusual
•Examine cost breakdown: Which component (I/O vs CPU) dominates? Adjust appropriate cost parameter
•Test alternatives manually: Use enable_* settings to compare actual performance of different access methods

Real-World Access Method Selection Scenarios

Let's examine realistic scenarios where access method selection significantly impacts performance.

Scenario 1: Dashboard Query Optimization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Scenario: Admin dashboard showing recent order statistics
-- Problem: Query takes 15 seconds during peak hours
 
-- Original query
SELECT 
    status,
    COUNT(*) as count,
    SUM(total_amount) as total
FROM orders
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY status;
 
-- EXPLAIN shows:
EXPLAIN ANALYZE ...
/*
HashAggregate  (actual time=15234.567..15234.789)
  ->  Seq Scan on orders  (actual time=0.023..14567.890 rows=12456)
        Filter: (created_at > ...)
        Rows Removed by Filter: 9987544
*/
 
-- Problem: Scanning 10M rows to find 12K recent ones!
 
-- Solution 1: Create targeted index
CREATE INDEX idx_orders_recent ON orders(created_at) 
WHERE created_at > '2024-01-01';  -- Partial index for recent data
 
-- Solution 2: Covering index for index-only scan
CREATE INDEX idx_orders_dashboard ON orders(created_at, status)
INCLUDE (total_amount);
 
ANALYZE orders;
 
-- New plan:
/*
HashAggregate  (actual time=45.123..45.234)
  ->  Index Only Scan using idx_orders_dashboard (actual time=0.034..23.456 rows=12456)
        Index Cond: (created_at > ...)
        Heap Fetches: 0
*/
 
-- Result: 15 seconds → 45 milliseconds (300× improvement)

Scenario 2: Multi-Filter Search
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Scenario: Product search with multiple optional filters
-- Problem: Sometimes fast, sometimes slow depending on filters
 
SELECT product_id, name, price
FROM products
WHERE category = 'electronics'           -- 100K products
  AND brand = 'samsung'                  -- 20K products  
  AND price BETWEEN 100 AND 500          -- 500K products
  AND rating >= 4.0                      -- 300K products
  AND in_stock = true;                   -- 800K products
 
-- When all filters apply: ~500 products (bitmap AND works great)
-- When only category: 100K products (index scan on category)
-- When only price: 500K products (table scan might be better!)
 
-- Problem: Optimizer creates different plans, some suboptimal
 
-- Solution: Create composite indexes for common filter combinations
CREATE INDEX idx_products_cat_brand ON products(category, brand)
INCLUDE (price, rating, in_stock);
 
CREATE INDEX idx_products_cat_price ON products(category, price)
INCLUDE (brand, rating, in_stock);
 
-- Now optimizer has better options for different filter combos

Scenario 3: High-Selectivity Surprise
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Scenario: Optimizer chooses index scan but table scan would be faster
 
-- Query for active users (90% of users are active!)
SELECT * FROM users WHERE is_active = true;
 
-- Optimizer sees index on is_active, estimates 50% selectivity (wrong!)
-- Chooses index scan, but with 90% selectivity, table scan is better
 
EXPLAIN ANALYZE SELECT * FROM users WHERE is_active = true;
/*
Index Scan using idx_users_active  (actual time=0.023..2345.678 rows=900000)
  -- Lots of random I/O for 900K rows!
*/
 
-- Solutions:
 
-- 1. Update statistics with more detail
ALTER TABLE users ALTER COLUMN is_active SET STATISTICS 1000;
ANALYZE users;
 
-- 2. If statistics still wrong, consider partial index for minority case
DROP INDEX idx_users_active;
CREATE INDEX idx_users_inactive ON users(is_active) WHERE is_active = false;
-- Now inactive queries use index, active queries use table scan
 
-- 3. Adjust optimizer assumptions
SET random_page_cost = 1.0;  -- Reduces preference for index at high selectivity

Summary: Mastering Access Method Selection

You now have comprehensive knowledge of how query optimizers select access methods and how to influence their decisions. Here are the essential takeaways:

Key Takeaways

•Optimizers use cost-based selection — Each candidate access method is evaluated; the lowest estimated cost wins.
•Statistics are critical — Row counts, value distributions, and histograms enable accurate selectivity estimation. Stale statistics lead to wrong decisions.
•Cost parameters can be tuned — random_page_cost, seq_page_cost, and effective_cache_size should reflect actual hardware characteristics.
•Common failures have patterns — Stale statistics, correlated columns, and parameter skew cause predictable optimizer mistakes.
•Proper indexes enable better choices — The optimizer can only choose from available options. Missing indexes force suboptimal access methods.
•EXPLAIN reveals the truth — Always use EXPLAIN ANALYZE to understand why the optimizer made its choice and whether the choice was correct.

Module Complete:

This concludes the Access Methods module. You now understand the complete landscape of how databases retrieve data—from simple table scans through sophisticated bitmap operations—and how query optimizers choose among these options.

This knowledge is fundamental for the remaining modules in this chapter, which cover implementing specific relational operators (selection, joins, aggregation) using these access methods as building blocks.

Module Complete

You have mastered the Access Methods module—understanding table scans, index scans, index-only scans, bitmap scans, and the optimizer's selection process. This knowledge enables you to diagnose access method problems, design effective indexing strategies, and optimize query performance at a fundamental level.

5 / 5

Loading learning content...

Database Management SystemsAccess Methods

Access Methods: How Databases Retrieve Data

LevelIntermediate

Duration90 mins

TopicAccess Methods

5 / 5

Access Method Selection: The Optimizer's Decision

How the Optimizer Chooses

Understanding the selection process is crucial because it reveals why the optimizer sometimes makes surprising choices—and how to guide it toward better decisions when necessary.

What You Will Learn

The Access Method Selection Process

When the query optimizer evaluates a query, it performs a systematic analysis to select the best access method for each table. This process involves several stages:

Stage 1: Identify Candidate Access Methods

The optimizer first identifies all possible access methods:

Table Scan: Always available as a baseline
Index Scans: One for each index that could satisfy the query's predicates
Index-Only Scans: For indexes that cover all required columns
Bitmap Scans: When bitmap operations could benefit the query

Stage 2: Estimate Costs for Each Candidate

For each candidate, the optimizer estimates:

I/O cost: Pages to be read from disk (the dominant factor)
CPU cost: Processing per row (filtering, projection, expressions)
Memory cost: Working memory required (for sorting, hashing, bitmaps)

Access Method Cost Comparison (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
= Query: SELECT * FROM orders WHERE customer_id = 1234 =
 
Available Indexes:
- idx_orders_customer (customer_id)
- idx_orders_date (order_date)
- idx_orders_customer_date (customer_id, order_date)
 
Candidate Access Methods:
 
1. Sequential Scan (Table Scan)
   - I/O: All 50,000 pages × 0.1 ms = 5,000 ms
   - CPU: 4,000,000 rows × filter cost
   - Total estimated cost: 5,200 units
 
2. Index Scan on idx_orders_customer
   - I/O: 3 index pages + 500 row fetches × random_io_cost
   - Estimated selectivity: 0.0125% (500 rows)
   - Total estimated cost: 1,800 units
 
3. Index Scan on idx_orders_customer_date (using prefix)
   - Same as above, but index is larger
   - Total estimated cost: 1,850 units
 
4. Index-Only Scan
   - Cannot satisfy SELECT * (needs all columns)
   - Not applicable
 
5. Bitmap Scan on idx_orders_customer
   - I/O: Index scan + 450 heap pages (sequential)
   - Total estimated cost: 650 units   ← WINNER!
 
Selected: Bitmap Heap Scan with Bitmap Index Scan on idx_orders_customer

Stage 3: Compare and Select

The optimizer compares estimated costs and selects the lowest-cost option. If multiple methods have similar costs (within a threshold), additional heuristics may influence the choice:

Prefer index-only scans to avoid heap access
Prefer bitmap scans when multiple indexes can combine
Consider memory availability for bitmap operations
Account for parallel execution potential

Cost Is Not Execution Time

Statistics: The Foundation of Cost Estimation

Key Statistics Used:

Row Count (n_tuples): Total number of rows in the table
Page Count (n_pages): Number of data pages the table occupies
Distinct Values (n_distinct): Number of unique values per column
Most Common Values (MCV): Frequently occurring values and their frequencies
Histograms: Distribution of values across buckets for range estimation
Correlation/Clustering Factor: How well physical order matches index order
Null Fraction: Percentage of NULL values in each column

Viewing Statistics in PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- View table-level statistics
SELECT 
    relname,
    reltuples::bigint as row_estimate,
    relpages as page_count,
    relallvisible as visible_pages
FROM pg_class
WHERE relname = 'orders';
 
-- View column-level statistics
SELECT
    attname as column_name,
    n_distinct,
    most_common_vals,
    most_common_freqs,
    histogram_bounds,
    correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'customer_id';
 
/*
Example output:
 column_name | n_distinct | most_common_vals | most_common_freqs | correlation
-------------+------------+------------------+-------------------+-------------
 customer_id |    10523   | {101,205,302,...} | {0.015,0.012,...} |    0.23
 
Interpretation:
- 10,523 distinct customer_id values
- Customer 101 appears in 1.5% of rows (most frequent)
- Correlation 0.23 = weak clustering (rows not in customer_id order)
*/
 
-- View index statistics
SELECT
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders';

Selectivity Estimation

Using statistics, the optimizer estimates selectivity—the fraction of rows that match a predicate:

Predicate Type	Selectivity Estimation
Equality (col = val)	1 / n_distinct, or MCV frequency if value in MCV list
Range (col > val)	Histogram-based interpolation
LIKE 'prefix%'	Estimated from histogram and pattern
IS NULL	null_frac for that column
AND	selectivity_A × selectivity_B (independence assumption)
OR	sel_A + sel_B - (sel_A × sel_B)

Statistics Must Be Current

Cost Model Parameters and Tuning

The optimizer uses configurable parameters to weight different cost components. Understanding these parameters helps explain optimizer decisions and allows tuning for specific hardware.

PostgreSQL Cost Parameters:

PostgreSQL Cost Model Parameters
Parameter	Default	Meaning
seq_page_cost	1.0	Cost to read one page sequentially
random_page_cost	4.0	Cost to read one page randomly
cpu_tuple_cost	0.01	Cost to process one row
cpu_index_tuple_cost	0.005	Cost to process one index entry
cpu_operator_cost	0.0025	Cost to execute one operator/function
effective_cache_size	4GB	Assumed available memory for caching

How Parameters Affect Access Method Selection:

The ratio between random_page_cost and seq_page_cost strongly influences when index scans are preferred over table scans:

Default ratio (4:1): Index scans preferred when selectivity < ~20%
High ratio (10:1): Index scans preferred even at higher selectivity
Low ratio (1.1:1): Nearly equivalent costs; selectivity threshold drops to ~5%

Tuning Cost Parameters for SSD Storage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Default settings are tuned for HDD (high random I/O penalty)
SHOW random_page_cost;  -- 4.0
SHOW seq_page_cost;     -- 1.0
 
-- For SSD storage, random I/O is much cheaper
-- Reduce random_page_cost to reflect this
SET random_page_cost = 1.1;  -- Only 10% more expensive than sequential
 
-- This makes the optimizer more willing to use index scans
-- at higher selectivity levels
 
-- For fully cached databases (data fits in RAM)
SET random_page_cost = 1.0;  -- No disk I/O, all memory access
SET seq_page_cost = 1.0;
 
-- Effective cache size affects index vs table scan choice
-- Higher values make optimizer assume more data is cached
SHOW effective_cache_size;  -- Default varies
SET effective_cache_size = '64GB';  -- For a server with 96GB RAM
 
-- Verify the change affects plans
EXPLAIN SELECT * FROM orders WHERE customer_id = 1234;
-- May now choose index scan where it previously chose table scan

Modern Systems Often Need Lower random_page_cost

Why the Optimizer Gets It Wrong

Despite sophisticated cost modeling, optimizers sometimes choose suboptimal access methods. Understanding these failure modes helps you diagnose and fix performance problems.

Problem 1: Stale Statistics

Stale Statistics Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Table had 1,000 rows when statistics were collected
-- Now it has 1,000,000 rows after bulk load
 
-- Optimizer thinks table scan is cheap (based on old 1,000 rows)
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
/*
Seq Scan on orders  (cost=0.00..25.00 rows=100 width=...)
  Filter: (status = 'pending')
*/
 
-- Reality: Query takes 30 seconds because table is now huge!
 
-- Fix: Update statistics
ANALYZE orders;
 
-- Now optimizer sees true size and chooses index
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
/*
Index Scan using idx_orders_status on orders  (cost=0.42..523.45 rows=75000)
  Index Cond: (status = 'pending')
*/

Problem 2: Correlated Columns (Independence Assumption)

Optimizers assume predicates are independent. This fails when columns are correlated:

Correlated Columns Problem
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- city and zip_code are highly correlated
-- (knowing zip_code determines city)
 
-- Individual selectivities:
-- city = 'New York' → 2% of rows
-- zip_code = '10001' → 0.01% of rows
 
-- Optimizer's estimate (assuming independence):
-- 2% × 0.01% = 0.0002% → ~200 rows
 
-- Reality (columns correlated):
-- Every row with zip 10001 IS in New York
-- Actual result: 10,000 rows
 
EXPLAIN ANALYZE
SELECT * FROM addresses 
WHERE city = 'New York' AND zip_code = '10001';
/*
Index Scan on idx_addresses_zip  (rows=200)  -- ESTIMATED
  (actual rows=10000)                         -- ACTUAL 50x higher!
*/
 
-- Fix: Use extended statistics (PostgreSQL 10+)
CREATE STATISTICS stat_city_zip ON city, zip_code FROM addresses;
ANALYZE addresses;
 
-- Optimizer now understands the correlation

Problem 3: Parameterized Queries with Different Skew

Prepared statements use a generic plan that may not suit all parameter values:

Parameter Skew Problem
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- 99% of orders belong to customer_id 1-10000
-- 1% of orders belong to customer_id 10001+ (VIP bulk buyers)
 
-- Prepared statement: SELECT * FROM orders WHERE customer_id = $1
 
-- For customer_id = 5000 (normal customer): ~100 rows, index scan optimal
-- For customer_id = 10001 (VIP with 500,000 orders): table scan might be faster!
 
-- But prepared statement uses one plan for all executions
-- Generic plan chosen may be wrong for edge cases
 
-- PostgreSQL mitigations:
-- - After several executions, re-plans with actual parameter statistics
-- - Use plan_cache_mode = force_custom_plan for problematic queries

Common Optimizer Mistakes

•Stale statistics: Row counts, value distributions outdated after bulk operations
•Independence assumption: Correlated columns treated as independent, causing estimation errors
•Generic plans: Prepared statements use one plan for all parameter values
•Function opacity: User-defined functions treated as black boxes with default selectivity
•Missing indexes: Optimizer can't choose an access method that doesn't exist
•Wrong cost parameters: Default settings don't match actual hardware characteristics

Techniques for Influencing Access Method Selection

When the optimizer makes suboptimal choices, you have several techniques to guide it toward better decisions.

Technique 1: Update Statistics

Always the first step. Ensure the optimizer has accurate information:

Updating Statistics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Basic statistics update
ANALYZE orders;
 
-- Force full statistics collection (slower but more accurate)
ANALYZE VERBOSE orders;
 
-- Increase statistics target for high-cardinality columns
ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 1000;
-- Default is 100; higher = more detailed histogram, better estimates
ANALYZE orders;
 
-- Check autovacuum is running
SELECT relname, last_autoanalyze, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'orders';
 
-- If autovacuum is behind, trigger manually
VACUUM ANALYZE orders;

Technique 2: Create Better Indexes

Sometimes the optimizer chooses a table scan because no suitable index exists:

Creating Better Indexes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Query isn't using an index
EXPLAIN SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01';
-- Shows: Seq Scan with Filter on both conditions
 
-- No composite index exists; create one
CREATE INDEX idx_orders_status_date ON orders(status, created_at);
 
ANALYZE orders;
 
-- Now optimizer can use index
EXPLAIN SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01';
-- Shows: Index Scan using idx_orders_status_date
 
-- For index-only scans, include needed columns
CREATE INDEX idx_orders_covering ON orders(status, created_at) 
INCLUDE (customer_id, total_amount);

Technique 3: Query Hints (Database-Specific)

Some databases allow explicit hints to override optimizer choices. Use sparingly—hints bypass cost-based optimization:

Query Hints by Database
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
=== PostgreSQL (no direct hints, but workarounds exist) ===
 
-- Disable sequential scans to force index usage (session level)
SET enable_seqscan = off;
EXPLAIN SELECT * FROM orders WHERE customer_id = 1234;
-- Forces index scan even if optimizer prefers seq scan
SET enable_seqscan = on;  -- Reset
 
-- Disable bitmap scans
SET enable_bitmapscan = off;
 
-- Adjust cost parameters to influence choice
SET random_page_cost = 1.0;  -- Makes index scans cheaper
 
=== MySQL ===
 
-- Force index usage
SELECT * FROM orders FORCE INDEX (idx_orders_customer)
WHERE customer_id = 1234 OR status = 'pending';
 
-- Ignore specific index
SELECT * FROM orders IGNORE INDEX (idx_orders_status)
WHERE customer_id = 1234;
 
-- Hint for join order
SELECT /*+ JOIN_ORDER(o, c) */ * FROM orders o JOIN customers c...
 
=== Oracle ===
 
-- Hint for full table scan
SELECT /*+ FULL(orders) */ * FROM orders WHERE customer_id = 1234;
 
-- Hint for specific index
SELECT /*+ INDEX(orders idx_orders_customer) */ * FROM orders WHERE...
 
=== SQL Server ===
 
-- Table hint for index
SELECT * FROM orders WITH (INDEX(idx_orders_customer))
WHERE customer_id = 1234;
 
-- Force recompile for parameter-sensitive queries
SELECT * FROM orders WHERE customer_id = @cust_id OPTION(RECOMPILE);

Hints Are a Last Resort

Analyzing and Understanding Optimizer Decisions

The EXPLAIN command is your window into optimizer decision-making. Learning to read execution plans reveals why specific access methods were chosen.

Key EXPLAIN Elements for Access Method Analysis:

Reading EXPLAIN for Access Decisions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- PostgreSQL: EXPLAIN with all details
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT TEXT)
SELECT customer_id, order_date, total_amount
FROM orders
WHERE customer_id = 1234 AND order_date > '2024-01-01';
 
/*
Index Scan using idx_orders_customer_date on public.orders
    (cost=0.43..156.78 rows=142 width=24) 
    (actual time=0.023..0.456 rows=138 loops=1)
  Output: customer_id, order_date, total_amount
  Index Cond: ((orders.customer_id = 1234) AND (orders.order_date > '2024-01-01'))
  Buffers: shared hit=45 read=3
Planning Time: 0.234 ms
Execution Time: 0.567 ms
 
Key observations:
1. "Index Scan using idx_orders_customer_date" = chosen access method
2. "cost=0.43..156.78" = optimizer's estimated cost (startup..total)
3. "rows=142" estimated vs "rows=138" actual = good estimation
4. "Index Cond" = conditions pushed to index (efficient)
5. "Buffers: shared hit=45 read=3" = mostly cached, 3 pages from disk
*/
 
-- Compare alternative access methods by disabling options
SET enable_indexscan = off;
SET enable_bitmapscan = off;
 
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, order_date, total_amount
FROM orders
WHERE customer_id = 1234 AND order_date > '2024-01-01';
 
/*
Seq Scan on orders
    (cost=0.00..45000.00 rows=142 width=24) 
    (actual time=0.023..890.456 rows=138 loops=1)
  Filter: ((customer_id = 1234) AND (order_date > '2024-01-01'))
  Rows Removed by Filter: 3999862
  Buffers: shared hit=41234 read=3766
*/
 
-- Cost comparison:
-- Index Scan: 156.78 cost, 0.567 ms actual
-- Seq Scan:   45000 cost, 890 ms actual
-- Optimizer correctly chose index scan (290× better cost estimate, 1570× faster)
 
-- Reset settings
RESET enable_indexscan;
RESET enable_bitmapscan;

Identifying Why Optimizer Chose Differently Than Expected:

Diagnostic Checklist

•Compare estimated vs actual rows: Large differences indicate stale statistics or correlation issues
•Check Index Cond vs Filter: Conditions in 'Filter' weren't pushed to index (index not optimal for predicate)
•Look at Buffers: High read counts indicate cold cache; adjust effective_cache_size if this is unusual
•Examine cost breakdown: Which component (I/O vs CPU) dominates? Adjust appropriate cost parameter
•Test alternatives manually: Use enable_* settings to compare actual performance of different access methods

Real-World Access Method Selection Scenarios

Let's examine realistic scenarios where access method selection significantly impacts performance.

Scenario 1: Dashboard Query Optimization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Scenario: Admin dashboard showing recent order statistics
-- Problem: Query takes 15 seconds during peak hours
 
-- Original query
SELECT 
    status,
    COUNT(*) as count,
    SUM(total_amount) as total
FROM orders
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY status;
 
-- EXPLAIN shows:
EXPLAIN ANALYZE ...
/*
HashAggregate  (actual time=15234.567..15234.789)
  ->  Seq Scan on orders  (actual time=0.023..14567.890 rows=12456)
        Filter: (created_at > ...)
        Rows Removed by Filter: 9987544
*/
 
-- Problem: Scanning 10M rows to find 12K recent ones!
 
-- Solution 1: Create targeted index
CREATE INDEX idx_orders_recent ON orders(created_at) 
WHERE created_at > '2024-01-01';  -- Partial index for recent data
 
-- Solution 2: Covering index for index-only scan
CREATE INDEX idx_orders_dashboard ON orders(created_at, status)
INCLUDE (total_amount);
 
ANALYZE orders;
 
-- New plan:
/*
HashAggregate  (actual time=45.123..45.234)
  ->  Index Only Scan using idx_orders_dashboard (actual time=0.034..23.456 rows=12456)
        Index Cond: (created_at > ...)
        Heap Fetches: 0
*/
 
-- Result: 15 seconds → 45 milliseconds (300× improvement)

Scenario 2: Multi-Filter Search
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Scenario: Product search with multiple optional filters
-- Problem: Sometimes fast, sometimes slow depending on filters
 
SELECT product_id, name, price
FROM products
WHERE category = 'electronics'           -- 100K products
  AND brand = 'samsung'                  -- 20K products  
  AND price BETWEEN 100 AND 500          -- 500K products
  AND rating >= 4.0                      -- 300K products
  AND in_stock = true;                   -- 800K products
 
-- When all filters apply: ~500 products (bitmap AND works great)
-- When only category: 100K products (index scan on category)
-- When only price: 500K products (table scan might be better!)
 
-- Problem: Optimizer creates different plans, some suboptimal
 
-- Solution: Create composite indexes for common filter combinations
CREATE INDEX idx_products_cat_brand ON products(category, brand)
INCLUDE (price, rating, in_stock);
 
CREATE INDEX idx_products_cat_price ON products(category, price)
INCLUDE (brand, rating, in_stock);
 
-- Now optimizer has better options for different filter combos

Scenario 3: High-Selectivity Surprise
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Scenario: Optimizer chooses index scan but table scan would be faster
 
-- Query for active users (90% of users are active!)
SELECT * FROM users WHERE is_active = true;
 
-- Optimizer sees index on is_active, estimates 50% selectivity (wrong!)
-- Chooses index scan, but with 90% selectivity, table scan is better
 
EXPLAIN ANALYZE SELECT * FROM users WHERE is_active = true;
/*
Index Scan using idx_users_active  (actual time=0.023..2345.678 rows=900000)
  -- Lots of random I/O for 900K rows!
*/
 
-- Solutions:
 
-- 1. Update statistics with more detail
ALTER TABLE users ALTER COLUMN is_active SET STATISTICS 1000;
ANALYZE users;
 
-- 2. If statistics still wrong, consider partial index for minority case
DROP INDEX idx_users_active;
CREATE INDEX idx_users_inactive ON users(is_active) WHERE is_active = false;
-- Now inactive queries use index, active queries use table scan
 
-- 3. Adjust optimizer assumptions
SET random_page_cost = 1.0;  -- Reduces preference for index at high selectivity

Summary: Mastering Access Method Selection

You now have comprehensive knowledge of how query optimizers select access methods and how to influence their decisions. Here are the essential takeaways:

Key Takeaways

•Optimizers use cost-based selection — Each candidate access method is evaluated; the lowest estimated cost wins.
•Statistics are critical — Row counts, value distributions, and histograms enable accurate selectivity estimation. Stale statistics lead to wrong decisions.
•Cost parameters can be tuned — random_page_cost, seq_page_cost, and effective_cache_size should reflect actual hardware characteristics.
•Common failures have patterns — Stale statistics, correlated columns, and parameter skew cause predictable optimizer mistakes.
•Proper indexes enable better choices — The optimizer can only choose from available options. Missing indexes force suboptimal access methods.
•EXPLAIN reveals the truth — Always use EXPLAIN ANALYZE to understand why the optimizer made its choice and whether the choice was correct.

Module Complete:

Module Complete

5 / 5