Optimization Overview - Learning Module

Loading content...

0/241

Cost-Based Selection

Choosing Without Executing

The query optimizer faces a paradox: it must select the fastest execution plan without actually executing any plans. Running every candidate plan to measure its real cost would defeat the purpose of optimization—by the time you knew which plan was best, you would have executed all of them.

The solution is cost-based selection: estimating the cost of each candidate plan using mathematical models and statistical information about the data. The optimizer computes predicted costs for all enumerated plans and selects the one with the lowest estimate.

This approach works remarkably well in practice, but it's not infallible. Cost estimation is inherently imprecise, and estimation errors can lead to suboptimal—sometimes catastrophically bad—plan choices. Understanding how cost estimation works, where it fails, and how to recognize and address failures is essential knowledge for database professionals.

What You Will Learn

By the end of this page, you will understand the components of cost models (I/O cost, CPU cost, memory cost), how cardinality estimation underlies all cost calculations, the catalog statistics that feed cost models, selectivity estimation for various predicate types, and the challenges that cause estimation to fail. This knowledge enables effective query tuning and explains optimizer behavior.

The Foundation of Cost Models

A cost model is a mathematical framework that assigns a numeric cost to each execution plan. The cost represents the estimated resources required to execute the plan—typically measured in time units or abstract cost units that correlate with execution time.

1.1 Cost Components

Execution cost is typically decomposed into several components:

Component	Represents	Example Factors
I/O Cost	Disk access time	Pages read, sequential vs. random
CPU Cost	Processing time	Tuples processed, predicate complexity
Memory Cost	Memory pressure	Hash table sizes, sort buffers
Network Cost	Data transfer (distributed)	Bytes transmitted, latency
Startup Cost	Initialization overhead	Opening files, building hash tables

Total cost is typically a weighted sum:

$$Cost = w_{io} \cdot C_{io} + w_{cpu} \cdot C_{cpu} + w_{mem} \cdot C_{mem} + w_{net} \cdot C_{net}$$

The weights reflect the relative importance of each resource in the specific system.

1.2 Historical Context: I/O Dominance

Traditionally, disk I/O dominated cost—reading one page from a mechanical disk took 10-20 milliseconds, while processing a tuple took microseconds. A cost model focused purely on I/O was sufficient:

$$Cost \approx \text{Pages Read} + \alpha \cdot \text{Pages Written}, \quad \alpha > 1$$

(Writes are more expensive due to logging and durability requirements.)

1.3 Modern Cost Models

With SSDs and large memory pools, the CPU/memory/I/O balance has shifted:

SSD random reads: ~0.1ms (100× faster than HDD)
NVMe reads: ~0.01ms (1000× faster than HDD)
CPU cache misses: Significant for in-memory data
Memory bandwidth: Can be the bottleneck for large scans

Modern cost models must weight all components appropriately. Some systems allow per-instance calibration of weights based on hardware profiling.

1.4 Additive Cost Assumption

Cost models typically assume costs are additive: the cost of a plan is the sum of costs of its operators. This enables bottom-up cost computation during DP enumeration:

$$Cost(Plan) = Cost(Op_1) + Cost(Op_2) + ... + Cost(Op_n)$$

This assumption breaks down with parallelism (where overlapping execution reduces total time) and with memory contention (where simultaneous operators compete for buffer space).

Cost ≠ Time

Cost is not directly equal to execution time, though it should correlate with time. Cost is a relative metric: a plan with cost 1000 should not take 1000 seconds, but should take roughly twice as long as a plan with cost 500 for the same query. The actual execution time depends on system load, caching, and factors the optimizer can't predict.

Cardinality: The Critical Input

Cardinality (the number of tuples in a result) is the single most important input to cost estimation. Virtually every cost formula depends on cardinality:

Scan cost ∝ pages read ∝ tuples / tuples per page
Join cost ∝ cardinality of inputs
Sort cost ∝ n log n where n = cardinality
Network cost ∝ tuples transferred

If cardinality estimation is wrong, cost estimation will be wrong—often by the same factor or worse.

2.1 The Cardinality Estimation Problem

For a query like:

SELECT * FROM employees WHERE dept = 'Engineering' AND salary > 100000;

We need to estimate how many rows satisfy the predicate. This requires knowing:

How many rows have dept = 'Engineering'?
How many rows have salary > 100000?
Are these conditions independent (can we multiply selectivities)?

2.2 Selectivity

Selectivity is the fraction of rows that satisfy a predicate:

$$selectivity(predicate) = \frac{|{rows \text{ satisfying predicate}}|}{|all rows|}$$

Estimated output cardinality = Input cardinality × Selectivity

Common Selectivity Estimates:

Predicate Type	Selectivity Formula	Notes
`col = constant`	1/NDV	NDV = number of distinct values
`col > constant`	(max - constant)/(max - min)	Uniform assumption
`col BETWEEN a AND b`	(b - a)/(max - min)	Uniform assumption
`col LIKE 'abc%'`	Varies (often ~10%)	Hard to estimate accurately
`col IN (v1, v2, ...)`	k/NDV	k = number of values
`col IS NULL`	null_frac from statistics	Tracked directly

2.3 Independence Assumption

For compound predicates, optimizers typically assume independence:

$$sel(A \wedge B) = sel(A) \times sel(B)$$ $$sel(A \vee B) = sel(A) + sel(B) - sel(A) \times sel(B)$$

When Independence Fails:

-- dept and manager are correlated (each dept has one manager)
SELECT * FROM employees 
WHERE dept = 'Engineering' AND manager = 'Alice';

If dept = 'Engineering' has selectivity 0.1 and manager = 'Alice' has selectivity 0.01, independence predicts selectivity 0.001. But if Alice is the Engineering manager, the actual selectivity is 0.1—100× higher than estimated!

The Correlation Problem

Correlated predicates are a major source of estimation errors. Research shows that assuming independence can cause errors of 10-1000× for real-world queries. Some systems track multi-column statistics to capture correlations, but this doesn't scale to arbitrary column combinations.

Catalog Statistics

The optimizer relies on catalog statistics—metadata about table and column characteristics—to make estimates. These statistics are stored in system catalogs and must be periodically refreshed as data changes.

3.1 Table-Level Statistics

Statistic	Description	Usage
`n_rows`	Number of rows in table	Base for cardinality
`n_pages`	Number of data pages	I/O cost estimation
`row_width`	Average bytes per row	Pages computation
`n_indices`	Number of indexes	Access path options
`last_analyze`	When stats were computed	Freshness indicator

3.2 Column-Level Statistics

Statistic	Description	Usage
`n_distinct` (NDV)	Count of distinct values	Equality selectivity
`null_frac`	Fraction of NULL values	NULL predicate selectivity
`avg_width`	Average column width	Memory/I/O estimation
`min_value`	Minimum value	Range predicate bounds
`max_value`	Maximum value	Range predicate bounds
`histogram`	Value distribution	Non-uniform selectivity
`most_common_values`	Top k values with frequencies	Skew handling
`correlation`	Physical vs. logical ordering	Index scan efficiency

3.3 Histograms

Histograms capture value distributions for non-uniform data. Instead of assuming values are distributed uniformly between min and max, histograms divide the value range into buckets.

Types of Histograms:

Type	Definition	Pros	Cons
Equi-width	Equal-size value ranges	Simple to build	Poor for skewed data
Equi-height	Equal number of rows per bucket	Handles skew	More complex
Hybrid	Mix of above	Balances trade-offs	Most common in practice
Compressed	Explicit buckets for common values	Best for extreme skew	Storage overhead

Example Equi-Height Histogram:

Column: salary
Total rows: 100,000
Buckets: 10 (each ~10,000 rows)

Bucket 1: [30000, 45000)  - 10,123 rows
Bucket 2: [45000, 52000)  - 9,987 rows
Bucket 3: [52000, 58000)  - 10,001 rows
...
Bucket 10: [120000, 500000) - 10,089 rows

Note: Bucket 10 spans a much wider range—fewer employees have high salaries.

3.4 Gathering Statistics

-- PostgreSQL
ANALYZE employees;
ANALYZE employees(salary);  -- Specific column

-- SQL Server
UPDATE STATISTICS employees;
CREATE STATISTICS stat_dept_salary ON employees(dept, salary);  -- Multi-column

-- Oracle
EXEC DBMS_STATS.GATHER_TABLE_STATS('HR', 'EMPLOYEES');

When to Update:

After significant data changes (>10-20% of table)
After bulk loads
Before critical queries
Periodically via maintenance jobs

Stale Statistics = Bad Plans

Statistics gathered on a 1000-row table are useless when the table grows to 1 million rows. Stale statistics are a leading cause of query performance problems. Production systems should have automated statistics maintenance and monitoring for freshness.

Cost Formulas for Common Operators

Each operator type has characteristic cost formulas:

4.1 Sequential Scan

$$Cost_{seqscan} = n_{pages} + cpu_tuple_cost \times n_{rows}$$

Sequential scans read all pages in order. Since disk read-ahead makes sequential I/O efficient, the cost is simply proportional to table size.

4.2 Index Scan

$$Cost_{indexscan} = index_pages_read + data_pages_read + cpu_index_tuple_cost \times n_{index_tuples}$$

Index scan cost depends on:

Selectivity: More tuples → more pages
Clustering factor: If table is clustered on index, fewer page fetches
Index depth: B-tree traversal cost (usually ~3-4 levels)

For a clustered index with selectivity s: $$Cost \approx log(n) + s \times n_{pages}$$

For an unclustered index: $$Cost \approx log(n) + s \times n_{rows}$$

(Worst case: one page fetch per row if totally unclustered)

4.3 Nested Loop Join

$$Cost_{NL} = Cost_{outer} + n_{outer} \times Cost_{inner_per_row}$$

For each outer row, we access the inner table
If inner uses an index with selectivity s: Cost_inner ≈ log(n) + s
If inner is a full scan: Cost_inner = pages_inner (per outer row!)

4.4 Hash Join

$$Cost_{hash} = Cost_{build} + Cost_{probe}$$ $$Cost_{build} = 3 \times n_{build_pages}$$ (read, write hash, read hash) $$Cost_{probe} = n_{probe_pages} + cpu_cost \times n_{probe_rows}$$

If the hash table fits in memory, eliminate the write/re-read.

4.5 Sort-Merge Join

$$Cost_{merge} = Cost_{sort_R} + Cost_{sort_S} + n_{pages_R} + n_{pages_S}$$

For external sort: $$Cost_{sort}(T) = 2 \times n_{pages} \times (1 + \lceil log_{M-1}(n_{pages}/M) \rceil)$$

where M = available memory pages

4.6 Aggregation

$$Cost_{agg} = Cost_{input} + cpu_cost \times n_{input_rows}$$

For hash-based grouping, add memory for hash table. For sort-based grouping, add sort cost if not already sorted.

cost_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
class CostModel:
    """
    Simplified cost model for query operators.
    Costs are in abstract units (higher = worse).
    """
    # Cost factors (system-specific, can be calibrated)
    SEQ_PAGE_COST = 1.0       # Reading sequential page
    RANDOM_PAGE_COST = 4.0    # Reading random page (seek overhead)
    CPU_TUPLE_COST = 0.01     # Processing one tuple
    CPU_INDEX_COST = 0.005    # Index tuple processing
    CPU_OPERATOR_COST = 0.0025  # Per-comparison cost
    
    def seq_scan_cost(self, table_pages, table_rows, filter_cost=1):
        """Cost of full table scan with optional filter."""
        io_cost = self.SEQ_PAGE_COST * table_pages
        cpu_cost = self.CPU_TUPLE_COST * table_rows * filter_cost
        return io_cost + cpu_cost
    
    def index_scan_cost(self, index_height, index_selectivity, 
                        table_pages, table_rows, is_clustered):
        """Cost of index scan."""
        # Index traversal
        index_io = index_height * self.RANDOM_PAGE_COST
        
        # Data page access
        matched_rows = table_rows * index_selectivity
        if is_clustered:
            # Clustered: pages roughly proportional to rows
            data_pages = table_pages * index_selectivity
            data_io = data_pages * self.SEQ_PAGE_COST
        else:
            # Unclustered: potentially one page per row
            data_io = matched_rows * self.RANDOM_PAGE_COST
        
        cpu_cost = self.CPU_INDEX_COST * matched_rows
        
        return index_io + data_io + cpu_cost
    
    def nested_loop_cost(self, outer_rows, inner_cost_per_probe):
        """Cost of nested loop join."""
        return outer_rows * inner_cost_per_probe
    
    def hash_join_cost(self, build_rows, build_pages, 
                       probe_rows, probe_pages, fits_memory):
        """Cost of hash join."""
        if fits_memory:
            # In-memory hash join
            build_cost = build_pages * self.SEQ_PAGE_COST
            probe_cost = probe_pages * self.SEQ_PAGE_COST
            cpu_cost = (build_rows + probe_rows) * self.CPU_TUPLE_COST
        else:
            # Grace hash join (partition to disk)
            build_cost = 3 * build_pages * self.SEQ_PAGE_COST
            probe_cost = 3 * probe_pages * self.SEQ_PAGE_COST
            cpu_cost = (build_rows + probe_rows) * self.CPU_TUPLE_COST
            
        return build_cost + probe_cost + cpu_cost

Join Cardinality Estimation

Estimating the cardinality of join results is particularly challenging and critical. Join cardinality affects the cost of all subsequent operations.

5.1 Primary Key – Foreign Key Joins

The easiest case: joining a foreign key to its referenced primary key:

$$|R \bowtie S| = |R|$$

Each row in R matches exactly one row in S (assuming referential integrity).

5.2 General Equi-Joins

For R.a = S.b where a and b are not keys:

$$|R \bowtie S| = \frac{|R| \times |S|}{max(NDV(R.a), NDV(S.b))}$$

Intuition: Each domain value in R matches |S|/NDV(S.b) rows; R has |R| rows; but we divide by containment—values in R might not exist in S.

The Containment Assumption: We assume the smaller domain is contained in the larger. If NDV(R.a) = 100 and NDV(S.b) = 1000, we assume all 100 distinct values in R.a appear in S.b.

5.3 Multi-Predicate Joins

For joins with multiple conditions:

R JOIN S ON R.a = S.a AND R.b < S.c

The standard approach applies independence: $$sel_{join} = sel(R.a = S.a) \times sel(R.b < S.c)$$

Join Selectivity Estimation Cases
Case	Selectivity Formula	Notes
R.pk = S.fk	1/\|R\|	PK-FK relationship
R.a = S.b (general)	1/max(NDV(a), NDV(b))	Containment assumption
R.a < S.b	~1/3 (heuristic)	Range comparison
R.a = S.b AND R.c = S.d	Product of individual selectivities	Independence assumed
R.a = S.b = T.c (3-way)	1/max(NDV(a), NDV(b), NDV(c))	Mutual containment

5.4 The Join Estimation Problem

Join estimation is where errors compound most severely:

Error Propagation Example:

R has 1M rows, S has 1M rows, T has 1M rows
True join result: |R ⋈ S ⋈ T| = 10,000 rows
Estimated with independent assumptions: 100,000 rows (10× error)

The 10× error in R ⋈ S propagates and multiplies through R ⋈ S ⋈ T:

Optimizer chooses wrong algorithm for final join
Memory allocated incorrectly
Parallel degree wrong
Final query takes 100× longer than necessary

5.5 Advanced Join Estimation Techniques

Join Histograms: Maintain histograms on join columns and match buckets between tables.

Sketches: Use probabilistic data structures (HyperLogLog, Count-Min Sketch) for NDV and frequency estimation.

Sampling: Sample actual join output during optimization for complex joins.

Feedback: Use actual runtime cardinalities to update estimates for future queries.

The Elephant in the Room

Research consistently shows join cardinality estimation errors of 100-1000× in production systems. This is arguably the most significant unsolved problem in query optimization. Modern approaches like cardinality feedback (learning from actual execution) and learned cardinality estimation (ML models) are active research areas.

Plan Comparison and Selection

With costs estimated for all enumerated plans, the final step is selection. While "choose the lowest cost" seems simple, several complexities arise:

6.1 Single-Objective Selection

The basic selection rule:

selected_plan = argmin(cost(p) for p in enumerated_plans)

This assumes a single cost metric. When the goal is minimizing response time, this works directly.

6.2 Multi-Objective Selection

Sometimes we care about multiple metrics:

First-row time (responsiveness) vs. total time
Resource consumption vs. latency
Memory usage vs. speed

Pareto Optimality: A plan is Pareto optimal if no other plan is better in all objectives. The optimizer might return the set of Pareto-optimal plans for the user to choose.

Typically, systems combine metrics into a single weighted score: $$Score = w_1 \times latency + w_2 \times resources + w_3 \times memory$$

6.3 Robustness Considerations

Problem: The "optimal" plan based on estimates might be very sensitive to estimation errors.

Scenario:

Plan A: Cost = 100, but 200 if cardinality estimate is 2× off
Plan B: Cost = 110, but 115 if cardinality estimate is 2× off

Plan A is "optimal" but risky. Plan B is slightly worse but robust—its cost varies little with estimation error.

Approaches:

Parametric optimization: Compute optimal plan as a function of cardinality; choose plan that's optimal across a range
Robust optimization: Minimize worst-case cost rather than expected cost
Plan switching: Prepare multiple plans, choose at runtime based on actual cardinalities

6.4 Plan Caching Considerations

The selected plan may be cached for reuse with different parameter values:

PREPARE stmt AS SELECT * FROM orders WHERE customer_id = $1;

The optimal plan depends on the selectivity of customer_id = $1, which varies:

For customer with 10,000 orders: full scan might be best
For customer with 10 orders: index scan is better

Solutions:

Deferred optimization: Optimize at execution time when parameter known
Plan forking: Cache multiple plans for different selectivity ranges
Adaptive execution: Start with one plan, switch if cardinality diverges

Parameter Sniffing Problems

In SQL Server, 'parameter sniffing' means the plan is optimized for the first parameter value seen. If that value is atypical, all subsequent executions use a suboptimal plan. Solutions include OPTIMIZE FOR UNKNOWN, plan guides, or query hints specifying typical values.

When Cost Estimation Fails

Understanding cost estimation failures is crucial for query tuning. Here are common scenarios:

7.1 Stale Statistics

Symptom: Query plan was good last month, now it's terrible. Cause: Statistics haven't been updated after significant data changes. Solution: Update statistics, set up automatic statistics maintenance.

7.2 Correlated Predicates

-- city and zip_code are perfectly correlated
WHERE city = 'New York' AND zip_code LIKE '10%';

Symptom: Estimates are orders of magnitude too low. Cause: Independence assumption fails for correlated columns. Solution: Create multi-column statistics, or rewrite query.

7.3 Data Skew

SELECT * FROM orders WHERE customer_id = 12345;

Symptom: One customer has 1M orders (outlier), but estimate assumes uniform distribution. Cause: Histogram doesn't capture extreme outliers, or NDV-based estimate ignores skew. Solution: Use histograms, create statistics on frequent values.

7.4 Complex Expressions

WHERE YEAR(order_date) = 2024 AND MONTH(order_date) = 12;

Symptom: Optimizer uses magic numbers (10% per expression). Cause: Statistics on order_date don't help with function expressions. Solution: Rewrite to order_date BETWEEN '2024-12-01' AND '2024-12-31'.

7.5 Join Ordering Errors

Symptom: 3+ table joins take forever; EXPLAIN shows huge intermediate results. Cause: Multiplicative error propagation through multi-join estimates. Solution: Hint the join order, or materialize a subquery.

7.6 Recognizing Estimation Errors

Examine the execution plan and compare estimates to actuals:

-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

-- Shows both estimated and actual row counts
-- Seq Scan on orders (cost=0.00..1234.00 rows=100 width=50)
--                                    ^^^^^ estimated
-- (actual time=0.020..45.678 rows=95000 loops=1)
--                            ^^^^^^^ actual

A 1000× discrepancy between estimated and actual rows is a red flag.

The Cascade Effect

A single bad estimate can cascade through the entire plan. If intermediate result A is estimated at 100 rows but actually has 100,000, the optimizer might choose nested-loop join for the next stage (expecting 100 probes) when hash join would be 100× faster. Always check for estimate mismatches working from leaves to root of the plan tree.

Summary: Cost-Based Selection

Cost-based selection is the heart of query optimization—transforming candidate plans into a single selected plan by estimating and comparing costs. Despite its importance, cost estimation remains imprecise, making this both the strength and weakness of modern optimizers.

Key Takeaways

•Cost models convert plans to comparable numbers — Combining I/O, CPU, memory, and network costs into a single metric.
•Cardinality estimation is foundational — Almost all cost calculations depend on row count estimates.
•Statistics enable estimation — Histograms, NDV, and column correlations inform selectivity calculations.
•Independence assumptions cause errors — Real-world data correlations violate the statistical assumptions.
•Join estimation compounds errors — Multi-table joins multiply estimation errors, often catastrophically.
•Robustness matters as much as optimality — A slightly worse but stable plan beats an unstable optimal one.
•Estimation failures are diagnosable — Comparing estimated vs. actual cardinalities reveals problems.

What's Next:

The final page of this module examines Time Constraints—how optimizers balance thorough optimization against time pressure, when to use quick heuristics versus careful search, and adaptive strategies that adjust optimization effort based on query characteristics.

Page Complete

You now understand how query optimizers estimate costs and select execution plans. This knowledge is essential for diagnosing query performance problems, understanding EXPLAIN output, and effectively tuning database queries.