Loading content...
The query optimizer faces a paradox: it must select the fastest execution plan without actually executing any plans. Running every candidate plan to measure its real cost would defeat the purpose of optimization—by the time you knew which plan was best, you would have executed all of them.
The solution is cost-based selection: estimating the cost of each candidate plan using mathematical models and statistical information about the data. The optimizer computes predicted costs for all enumerated plans and selects the one with the lowest estimate.
This approach works remarkably well in practice, but it's not infallible. Cost estimation is inherently imprecise, and estimation errors can lead to suboptimal—sometimes catastrophically bad—plan choices. Understanding how cost estimation works, where it fails, and how to recognize and address failures is essential knowledge for database professionals.
By the end of this page, you will understand the components of cost models (I/O cost, CPU cost, memory cost), how cardinality estimation underlies all cost calculations, the catalog statistics that feed cost models, selectivity estimation for various predicate types, and the challenges that cause estimation to fail. This knowledge enables effective query tuning and explains optimizer behavior.
A cost model is a mathematical framework that assigns a numeric cost to each execution plan. The cost represents the estimated resources required to execute the plan—typically measured in time units or abstract cost units that correlate with execution time.
Execution cost is typically decomposed into several components:
| Component | Represents | Example Factors |
|---|---|---|
| I/O Cost | Disk access time | Pages read, sequential vs. random |
| CPU Cost | Processing time | Tuples processed, predicate complexity |
| Memory Cost | Memory pressure | Hash table sizes, sort buffers |
| Network Cost | Data transfer (distributed) | Bytes transmitted, latency |
| Startup Cost | Initialization overhead | Opening files, building hash tables |
Total cost is typically a weighted sum:
$$Cost = w_{io} \cdot C_{io} + w_{cpu} \cdot C_{cpu} + w_{mem} \cdot C_{mem} + w_{net} \cdot C_{net}$$
The weights reflect the relative importance of each resource in the specific system.
Traditionally, disk I/O dominated cost—reading one page from a mechanical disk took 10-20 milliseconds, while processing a tuple took microseconds. A cost model focused purely on I/O was sufficient:
$$Cost \approx \text{Pages Read} + \alpha \cdot \text{Pages Written}, \quad \alpha > 1$$
(Writes are more expensive due to logging and durability requirements.)
With SSDs and large memory pools, the CPU/memory/I/O balance has shifted:
Modern cost models must weight all components appropriately. Some systems allow per-instance calibration of weights based on hardware profiling.
Cost models typically assume costs are additive: the cost of a plan is the sum of costs of its operators. This enables bottom-up cost computation during DP enumeration:
$$Cost(Plan) = Cost(Op_1) + Cost(Op_2) + ... + Cost(Op_n)$$
This assumption breaks down with parallelism (where overlapping execution reduces total time) and with memory contention (where simultaneous operators compete for buffer space).
Cost is not directly equal to execution time, though it should correlate with time. Cost is a relative metric: a plan with cost 1000 should not take 1000 seconds, but should take roughly twice as long as a plan with cost 500 for the same query. The actual execution time depends on system load, caching, and factors the optimizer can't predict.
Cardinality (the number of tuples in a result) is the single most important input to cost estimation. Virtually every cost formula depends on cardinality:
If cardinality estimation is wrong, cost estimation will be wrong—often by the same factor or worse.
For a query like:
SELECT * FROM employees WHERE dept = 'Engineering' AND salary > 100000;
We need to estimate how many rows satisfy the predicate. This requires knowing:
dept = 'Engineering'?salary > 100000?Selectivity is the fraction of rows that satisfy a predicate:
$$selectivity(predicate) = \frac{|{rows \text{ satisfying predicate}}|}{|all rows|}$$
Estimated output cardinality = Input cardinality × Selectivity
Common Selectivity Estimates:
| Predicate Type | Selectivity Formula | Notes |
|---|---|---|
col = constant | 1/NDV | NDV = number of distinct values |
col > constant | (max - constant)/(max - min) | Uniform assumption |
col BETWEEN a AND b | (b - a)/(max - min) | Uniform assumption |
col LIKE 'abc%' | Varies (often ~10%) | Hard to estimate accurately |
col IN (v1, v2, ...) | k/NDV | k = number of values |
col IS NULL | null_frac from statistics | Tracked directly |
For compound predicates, optimizers typically assume independence:
$$sel(A \wedge B) = sel(A) \times sel(B)$$ $$sel(A \vee B) = sel(A) + sel(B) - sel(A) \times sel(B)$$
When Independence Fails:
-- dept and manager are correlated (each dept has one manager)
SELECT * FROM employees
WHERE dept = 'Engineering' AND manager = 'Alice';
If dept = 'Engineering' has selectivity 0.1 and manager = 'Alice' has selectivity 0.01, independence predicts selectivity 0.001. But if Alice is the Engineering manager, the actual selectivity is 0.1—100× higher than estimated!
Correlated predicates are a major source of estimation errors. Research shows that assuming independence can cause errors of 10-1000× for real-world queries. Some systems track multi-column statistics to capture correlations, but this doesn't scale to arbitrary column combinations.
The optimizer relies on catalog statistics—metadata about table and column characteristics—to make estimates. These statistics are stored in system catalogs and must be periodically refreshed as data changes.
| Statistic | Description | Usage |
|---|---|---|
n_rows | Number of rows in table | Base for cardinality |
n_pages | Number of data pages | I/O cost estimation |
row_width | Average bytes per row | Pages computation |
n_indices | Number of indexes | Access path options |
last_analyze | When stats were computed | Freshness indicator |
| Statistic | Description | Usage |
|---|---|---|
n_distinct (NDV) | Count of distinct values | Equality selectivity |
null_frac | Fraction of NULL values | NULL predicate selectivity |
avg_width | Average column width | Memory/I/O estimation |
min_value | Minimum value | Range predicate bounds |
max_value | Maximum value | Range predicate bounds |
histogram | Value distribution | Non-uniform selectivity |
most_common_values | Top k values with frequencies | Skew handling |
correlation | Physical vs. logical ordering | Index scan efficiency |
Histograms capture value distributions for non-uniform data. Instead of assuming values are distributed uniformly between min and max, histograms divide the value range into buckets.
Types of Histograms:
| Type | Definition | Pros | Cons |
|---|---|---|---|
| Equi-width | Equal-size value ranges | Simple to build | Poor for skewed data |
| Equi-height | Equal number of rows per bucket | Handles skew | More complex |
| Hybrid | Mix of above | Balances trade-offs | Most common in practice |
| Compressed | Explicit buckets for common values | Best for extreme skew | Storage overhead |
Example Equi-Height Histogram:
Column: salary
Total rows: 100,000
Buckets: 10 (each ~10,000 rows)
Bucket 1: [30000, 45000) - 10,123 rows
Bucket 2: [45000, 52000) - 9,987 rows
Bucket 3: [52000, 58000) - 10,001 rows
...
Bucket 10: [120000, 500000) - 10,089 rows
Note: Bucket 10 spans a much wider range—fewer employees have high salaries.
-- PostgreSQL
ANALYZE employees;
ANALYZE employees(salary); -- Specific column
-- SQL Server
UPDATE STATISTICS employees;
CREATE STATISTICS stat_dept_salary ON employees(dept, salary); -- Multi-column
-- Oracle
EXEC DBMS_STATS.GATHER_TABLE_STATS('HR', 'EMPLOYEES');
When to Update:
Statistics gathered on a 1000-row table are useless when the table grows to 1 million rows. Stale statistics are a leading cause of query performance problems. Production systems should have automated statistics maintenance and monitoring for freshness.
Each operator type has characteristic cost formulas:
$$Cost_{seqscan} = n_{pages} + cpu_tuple_cost \times n_{rows}$$
Sequential scans read all pages in order. Since disk read-ahead makes sequential I/O efficient, the cost is simply proportional to table size.
$$Cost_{indexscan} = index_pages_read + data_pages_read + cpu_index_tuple_cost \times n_{index_tuples}$$
Index scan cost depends on:
For a clustered index with selectivity s: $$Cost \approx log(n) + s \times n_{pages}$$
For an unclustered index: $$Cost \approx log(n) + s \times n_{rows}$$
(Worst case: one page fetch per row if totally unclustered)
$$Cost_{NL} = Cost_{outer} + n_{outer} \times Cost_{inner_per_row}$$
$$Cost_{hash} = Cost_{build} + Cost_{probe}$$ $$Cost_{build} = 3 \times n_{build_pages}$$ (read, write hash, read hash) $$Cost_{probe} = n_{probe_pages} + cpu_cost \times n_{probe_rows}$$
If the hash table fits in memory, eliminate the write/re-read.
$$Cost_{merge} = Cost_{sort_R} + Cost_{sort_S} + n_{pages_R} + n_{pages_S}$$
For external sort: $$Cost_{sort}(T) = 2 \times n_{pages} \times (1 + \lceil log_{M-1}(n_{pages}/M) \rceil)$$
where M = available memory pages
$$Cost_{agg} = Cost_{input} + cpu_cost \times n_{input_rows}$$
For hash-based grouping, add memory for hash table. For sort-based grouping, add sort cost if not already sorted.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
class CostModel: """ Simplified cost model for query operators. Costs are in abstract units (higher = worse). """ # Cost factors (system-specific, can be calibrated) SEQ_PAGE_COST = 1.0 # Reading sequential page RANDOM_PAGE_COST = 4.0 # Reading random page (seek overhead) CPU_TUPLE_COST = 0.01 # Processing one tuple CPU_INDEX_COST = 0.005 # Index tuple processing CPU_OPERATOR_COST = 0.0025 # Per-comparison cost def seq_scan_cost(self, table_pages, table_rows, filter_cost=1): """Cost of full table scan with optional filter.""" io_cost = self.SEQ_PAGE_COST * table_pages cpu_cost = self.CPU_TUPLE_COST * table_rows * filter_cost return io_cost + cpu_cost def index_scan_cost(self, index_height, index_selectivity, table_pages, table_rows, is_clustered): """Cost of index scan.""" # Index traversal index_io = index_height * self.RANDOM_PAGE_COST # Data page access matched_rows = table_rows * index_selectivity if is_clustered: # Clustered: pages roughly proportional to rows data_pages = table_pages * index_selectivity data_io = data_pages * self.SEQ_PAGE_COST else: # Unclustered: potentially one page per row data_io = matched_rows * self.RANDOM_PAGE_COST cpu_cost = self.CPU_INDEX_COST * matched_rows return index_io + data_io + cpu_cost def nested_loop_cost(self, outer_rows, inner_cost_per_probe): """Cost of nested loop join.""" return outer_rows * inner_cost_per_probe def hash_join_cost(self, build_rows, build_pages, probe_rows, probe_pages, fits_memory): """Cost of hash join.""" if fits_memory: # In-memory hash join build_cost = build_pages * self.SEQ_PAGE_COST probe_cost = probe_pages * self.SEQ_PAGE_COST cpu_cost = (build_rows + probe_rows) * self.CPU_TUPLE_COST else: # Grace hash join (partition to disk) build_cost = 3 * build_pages * self.SEQ_PAGE_COST probe_cost = 3 * probe_pages * self.SEQ_PAGE_COST cpu_cost = (build_rows + probe_rows) * self.CPU_TUPLE_COST return build_cost + probe_cost + cpu_costEstimating the cardinality of join results is particularly challenging and critical. Join cardinality affects the cost of all subsequent operations.
The easiest case: joining a foreign key to its referenced primary key:
$$|R \bowtie S| = |R|$$
Each row in R matches exactly one row in S (assuming referential integrity).
For R.a = S.b where a and b are not keys:
$$|R \bowtie S| = \frac{|R| \times |S|}{max(NDV(R.a), NDV(S.b))}$$
Intuition: Each domain value in R matches |S|/NDV(S.b) rows; R has |R| rows; but we divide by containment—values in R might not exist in S.
The Containment Assumption: We assume the smaller domain is contained in the larger. If NDV(R.a) = 100 and NDV(S.b) = 1000, we assume all 100 distinct values in R.a appear in S.b.
For joins with multiple conditions:
R JOIN S ON R.a = S.a AND R.b < S.c
The standard approach applies independence: $$sel_{join} = sel(R.a = S.a) \times sel(R.b < S.c)$$
| Case | Selectivity Formula | Notes |
|---|---|---|
| R.pk = S.fk | 1/|R| | PK-FK relationship |
| R.a = S.b (general) | 1/max(NDV(a), NDV(b)) | Containment assumption |
| R.a < S.b | ~1/3 (heuristic) | Range comparison |
| R.a = S.b AND R.c = S.d | Product of individual selectivities | Independence assumed |
| R.a = S.b = T.c (3-way) | 1/max(NDV(a), NDV(b), NDV(c)) | Mutual containment |
Join estimation is where errors compound most severely:
Error Propagation Example:
The 10× error in R ⋈ S propagates and multiplies through R ⋈ S ⋈ T:
Join Histograms: Maintain histograms on join columns and match buckets between tables.
Sketches: Use probabilistic data structures (HyperLogLog, Count-Min Sketch) for NDV and frequency estimation.
Sampling: Sample actual join output during optimization for complex joins.
Feedback: Use actual runtime cardinalities to update estimates for future queries.
Research consistently shows join cardinality estimation errors of 100-1000× in production systems. This is arguably the most significant unsolved problem in query optimization. Modern approaches like cardinality feedback (learning from actual execution) and learned cardinality estimation (ML models) are active research areas.
With costs estimated for all enumerated plans, the final step is selection. While "choose the lowest cost" seems simple, several complexities arise:
The basic selection rule:
selected_plan = argmin(cost(p) for p in enumerated_plans)
This assumes a single cost metric. When the goal is minimizing response time, this works directly.
Sometimes we care about multiple metrics:
Pareto Optimality: A plan is Pareto optimal if no other plan is better in all objectives. The optimizer might return the set of Pareto-optimal plans for the user to choose.
Typically, systems combine metrics into a single weighted score: $$Score = w_1 \times latency + w_2 \times resources + w_3 \times memory$$
Problem: The "optimal" plan based on estimates might be very sensitive to estimation errors.
Scenario:
Plan A is "optimal" but risky. Plan B is slightly worse but robust—its cost varies little with estimation error.
Approaches:
The selected plan may be cached for reuse with different parameter values:
PREPARE stmt AS SELECT * FROM orders WHERE customer_id = $1;
The optimal plan depends on the selectivity of customer_id = $1, which varies:
Solutions:
In SQL Server, 'parameter sniffing' means the plan is optimized for the first parameter value seen. If that value is atypical, all subsequent executions use a suboptimal plan. Solutions include OPTIMIZE FOR UNKNOWN, plan guides, or query hints specifying typical values.
Understanding cost estimation failures is crucial for query tuning. Here are common scenarios:
Symptom: Query plan was good last month, now it's terrible. Cause: Statistics haven't been updated after significant data changes. Solution: Update statistics, set up automatic statistics maintenance.
-- city and zip_code are perfectly correlated
WHERE city = 'New York' AND zip_code LIKE '10%';
Symptom: Estimates are orders of magnitude too low. Cause: Independence assumption fails for correlated columns. Solution: Create multi-column statistics, or rewrite query.
SELECT * FROM orders WHERE customer_id = 12345;
Symptom: One customer has 1M orders (outlier), but estimate assumes uniform distribution. Cause: Histogram doesn't capture extreme outliers, or NDV-based estimate ignores skew. Solution: Use histograms, create statistics on frequent values.
WHERE YEAR(order_date) = 2024 AND MONTH(order_date) = 12;
Symptom: Optimizer uses magic numbers (10% per expression).
Cause: Statistics on order_date don't help with function expressions.
Solution: Rewrite to order_date BETWEEN '2024-12-01' AND '2024-12-31'.
Symptom: 3+ table joins take forever; EXPLAIN shows huge intermediate results. Cause: Multiplicative error propagation through multi-join estimates. Solution: Hint the join order, or materialize a subquery.
Examine the execution plan and compare estimates to actuals:
-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
-- Shows both estimated and actual row counts
-- Seq Scan on orders (cost=0.00..1234.00 rows=100 width=50)
-- ^^^^^ estimated
-- (actual time=0.020..45.678 rows=95000 loops=1)
-- ^^^^^^^ actual
A 1000× discrepancy between estimated and actual rows is a red flag.
A single bad estimate can cascade through the entire plan. If intermediate result A is estimated at 100 rows but actually has 100,000, the optimizer might choose nested-loop join for the next stage (expecting 100 probes) when hash join would be 100× faster. Always check for estimate mismatches working from leaves to root of the plan tree.
Cost-based selection is the heart of query optimization—transforming candidate plans into a single selected plan by estimating and comparing costs. Despite its importance, cost estimation remains imprecise, making this both the strength and weakness of modern optimizers.
What's Next:
The final page of this module examines Time Constraints—how optimizers balance thorough optimization against time pressure, when to use quick heuristics versus careful search, and adaptive strategies that adjust optimization effort based on query characteristics.
You now understand how query optimizers estimate costs and select execution plans. This knowledge is essential for diagnosing query performance problems, understanding EXPLAIN output, and effectively tuning database queries.