Database Management SystemsCost-Based Optimization

Cost-Based Query Optimization

LevelAdvanced

Duration90 mins

TopicCost-Based Optimization

1 / 5

Cost Estimation

The Quest for the Optimal Plan

Every SQL query you write can be executed in dozens, hundreds, or even millions of different ways. The query optimizer's job is to find the best way—but what does "best" even mean? How can a database system predict which execution strategy will be fastest without actually running each one?

The answer lies in cost estimation: the mathematical art of predicting how expensive a query execution plan will be before executing it. Cost-based optimizers don't guess or rely solely on rules—they build sophisticated models that estimate the resources each plan will consume, then choose the plan with the lowest predicted cost.

This page explores the foundation of cost-based optimization: how optimizers assign numerical costs to plans, what those costs represent, and why accurate estimation is both critical and extraordinarily challenging.

What You Will Learn

By the end of this page, you will understand: (1) What cost estimation is and why it matters, (2) The components of cost models used by database optimizers, (3) How I/O costs, CPU costs, and memory costs are calculated, (4) The role of statistics and cardinality estimates in cost formulas, and (5) Why cost estimation is inherently imprecise and how optimizers cope with uncertainty.

What Is Cost Estimation?

Cost estimation is the process of assigning a numerical value to a query execution plan that represents the predicted resources required to execute it. This cost is not measured in real time or actual resource consumption—it's a prediction based on mathematical models and statistical information about the data.

The fundamental problem:

Given a SQL query, there are typically many logically equivalent ways to execute it. Consider a simple join of three tables A, B, and C:

We could join A with B first, then join with C
We could join B with C first, then join with A
We could join A with C first, then join with B
For each ordering, we could use different join algorithms (nested loop, hash join, sort-merge)
For each table access, we could scan the table or use various indexes

With just three tables, there might be hundreds of possible execution plans. With ten tables, there can be millions. The optimizer cannot execute all of them to find the fastest—it would take longer to find the plan than to just pick one randomly!

The cost estimation solution:

Instead of executing plans, the optimizer estimates how expensive each plan would be if executed:

Build a cost model that predicts resource consumption
Apply the model to each candidate plan
Compare costs and choose the lowest-cost plan
Execute only the chosen plan

This works because estimating cost is vastly cheaper than actually executing plans. A cost estimate might take microseconds; executing the plan might take minutes or hours.

Cost Is Not Time

Cost units are deliberately abstract. A plan with cost 1000 is not necessarily twice as slow as a plan with cost 500. Cost values are relative measures for comparison, not absolute predictions of execution time. The goal is ordering plans correctly, not predicting elapsed time precisely.

The cost function abstraction:

Formally, a cost model defines a function:

Cost(P) = f(operators in P, cardinalities, physical characteristics)

Where:

P is an execution plan (a tree of physical operators)
Operators are the individual operations (scans, joins, sorts, etc.)
Cardinalities are estimates of how many rows flow through each operator
Physical characteristics include disk speeds, memory sizes, CPU costs, etc.

The optimizer evaluates this function for each candidate plan and selects the one with minimum cost.

Components of Cost Models

A comprehensive cost model considers multiple resource types. The total cost of a plan is typically a weighted combination of these components:

Major Components of Query Cost Models
Cost Component	What It Measures	Typical Importance
I/O Cost	Disk reads and writes (pages transferred)	Often dominant for disk-based systems
CPU Cost	Tuple processing, comparisons, computations	Significant for complex predicates and functions
Memory Cost	Buffer usage, hash table sizes, sort memory	Critical when memory is constrained
Network Cost	Data transfer in distributed systems	Dominant in distributed and cloud databases
Startup Cost	Initial overhead before first tuple produced	Important for pipelined execution

The total cost formula:

Most optimizers express total cost as:

Total_Cost = w₁ × IO_Cost + w₂ × CPU_Cost + w₃ × Memory_Cost + ...

Where w₁, w₂, w₃ are weights that reflect the relative costs of each resource. These weights are often configurable parameters:

In I/O-bound systems (spinning disks), w₁ >> w₂
In CPU-bound systems (in-memory databases), w₂ may dominate
In distributed systems, network latency weights become significant

Example from PostgreSQL:

PostgreSQL uses parameters like:

seq_page_cost = 1.0 (baseline for sequential I/O)
random_page_cost = 4.0 (random I/O is 4× more expensive)
cpu_tuple_cost = 0.01 (processing each tuple)
cpu_operator_cost = 0.0025 (evaluating predicates)

These weights can be tuned based on actual hardware characteristics. If you move to SSDs, you might reduce random_page_cost since random reads are nearly as fast as sequential reads on SSDs.

Calibrating Cost Parameters

Optimal cost weights depend on hardware. Cloud databases often auto-tune these parameters based on observed performance. On-premise systems may require manual calibration. Incorrect weights lead to suboptimal plan choices—the optimizer might prefer a disk-heavy plan when CPU is the actual bottleneck.

I/O Cost Estimation

For disk-based database systems, I/O cost is often the dominant factor. Reading and writing pages to disk is orders of magnitude slower than in-memory operations. A well-designed cost model pays careful attention to I/O.

Key I/O concepts:

Page: The basic unit of I/O (typically 4KB-16KB)
Sequential I/O: Reading consecutive pages (fast on HDDs, optimal on SSDs)
Random I/O: Reading non-consecutive pages (slow on HDDs, reasonable on SSDs)
Buffer pool: In-memory cache for frequently accessed pages

Basic I/O cost formulas:

I/O Cost Formulas for Common Operations
Operation	I/O Cost Formula	Variables
Full Table Scan	`b_R`	b_R = number of pages in relation R
Index Scan (Clustered)	`h + ⌈s × b_R⌉`	h = index height, s = selectivity
Index Scan (Unclustered)	`h + ⌈s × n_R⌉`	n_R = number of tuples (worst case)
Nested Loop Join	`b_R + n_R × b_S`	Scan R, for each tuple scan S
Block Nested Loop	`b_R + ⌈b_R / M⌉ × b_S`	M = memory pages available
Sort-Merge Join	`3 × (b_R + b_S)`	Sort both, then merge (approx.)
Hash Join	`3 × (b_R + b_S)`	Build hash, partition if needed

Example: Comparing scan vs. index access

Consider a table Orders with:

1,000,000 rows
50,000 pages (20 rows per page)
A B+-tree index on customer_id with height 3

Query: SELECT * FROM Orders WHERE customer_id = 12345

Assume the selectivity is 0.01% (100 matching rows).

Full table scan cost:

Cost = b_R = 50,000 page reads

Clustered index scan cost:

Cost = h + ⌈s × b_R⌉
     = 3 + ⌈0.0001 × 50,000⌉
     = 3 + 5 = 8 page reads

Unclustered index scan cost (worst case):

Cost = h + ⌈s × n_R⌉
     = 3 + ⌈0.0001 × 1,000,000⌉
     = 3 + 100 = 103 page reads

The index is clearly better, but notice how a clustered index (8 pages) dramatically outperforms an unclustered index (103 pages) because matching rows are stored together.

The Clustering Factor

Unclustered index costs can vary wildly. If matching rows happen to be on the same pages (good correlation), actual I/O is much lower than the worst case. Many optimizers track "clustering factor" or "correlation" statistics to refine these estimates.

CPU Cost Estimation

While I/O often dominates, CPU cost becomes increasingly important as:

Data fits in memory (eliminating I/O bottlenecks)
Queries involve complex computations
Predicates require expensive evaluations (regex, string operations)
Modern CPUs become relatively more expensive compared to fast SSDs

Components of CPU cost:

CPU Cost Components

•Tuple Processing: Moving tuples through operators, copying data, maintaining pipeline state
•Predicate Evaluation: Comparing values, checking NULL, evaluating WHERE clauses
•Expression Computation: Arithmetic, string functions, CASE expressions
•Hash Computation: Building and probing hash tables for joins and aggregation
•Sorting Comparisons: Key comparisons during sort operations
•Aggregation: Computing SUM, AVG, COUNT, maintaining aggregation state

CPU cost formulas:

CPU_Cost(Scan R with predicate P) = n_R × (tuple_cost + predicate_cost(P))

CPU_Cost(Hash Join) = n_R × hash_cost + n_S × probe_cost + matched × output_cost

CPU_Cost(Sort) = n_R × log₂(n_R) × comparison_cost

CPU_Cost(Aggregation) = n_R × (grouping_cost + aggregate_update_cost)

Predicate complexity matters:

Not all predicates are equal. Consider these WHERE clauses:

-- Simple: Just an integer comparison
WHERE order_id = 12345

-- Moderate: String comparison with potential collation
WHERE customer_name = 'Smith'

-- Expensive: Pattern matching (may scan entire string)
WHERE customer_name LIKE '%Smith%'

-- Very expensive: Regular expression
WHERE customer_name ~ '^[A-Z][a-z]+\\s[A-Z][a-z]+$'

A sophisticated optimizer assigns different costs to each predicate type. The regex might be 100× more expensive than the integer comparison, significantly affecting which plan is optimal.

The In-Memory Revolution

Modern in-memory databases like SAP HANA, MemSQL (SingleStore), and VoltDB make CPU cost the primary concern. When all data is in RAM, optimizers must accurately model CPU costs—traditional I/O-centric cost models produce poor plans.

The Role of Statistics

Cost formulas depend critically on cardinality estimates—predictions of how many rows will flow through each operator. These estimates come from statistics collected about the data.

Without statistics, optimization is blind:

Consider estimating the cost of: SELECT * FROM Orders WHERE status = 'pending'

If 1% of orders are pending → ~10,000 rows → index access is optimal
If 80% of orders are pending → ~800,000 rows → table scan is optimal

The optimizer must know the data distribution to make correct decisions.

Key Statistics Maintained by Optimizers
Statistic Type	Description	Usage in Cost Estimation
Table Cardinality (n_R)	Total number of rows in table R	Base for all tuple-based costs
Table Size (b_R)	Number of pages/blocks in table R	Base for all I/O costs
Column Cardinality (V(A,R))	Number of distinct values in column A	Selectivity under uniformity assumption
Min/Max Values	Range of values in a column	Range query selectivity
Histograms	Distribution of values in buckets	Non-uniform selectivity estimation
Null Fraction	Percentage of NULL values	IS NULL / IS NOT NULL selectivity
Correlation	How column order matches physical order	Index scan cost refinement
Most Common Values	Frequent values and their frequencies	Accurate estimation for skewed data

Selectivity estimation:

The selectivity of a predicate is the fraction of rows that satisfy it:

Selectivity(predicate) = |rows satisfying predicate| / |total rows|

Common selectivity formulas (assuming uniform distribution):

Selectivity(A = value) = 1 / V(A,R)

Selectivity(A > value) = (max(A) - value) / (max(A) - min(A))

Selectivity(A BETWEEN low AND high) = (high - low) / (max(A) - min(A))

Selectivity(A IN (v1, v2, ..., vk)) = k / V(A,R)

Combining selectivities (independence assumption):

Selectivity(P1 AND P2) = Selectivity(P1) × Selectivity(P2)

Selectivity(P1 OR P2) = Selectivity(P1) + Selectivity(P2) - Selectivity(P1) × Selectivity(P2)

Selectivity(NOT P) = 1 - Selectivity(P)

These formulas make strong assumptions (uniformity, independence) that are often violated in real data, leading to estimation errors.

The Cardinality Estimation Problem

Cardinality estimation is the Achilles' heel of query optimization. Studies show estimates can be off by orders of magnitude, especially for complex predicates, joins, and correlated columns. A 10× estimation error can lead to choosing a plan that is 1000× slower than optimal.

Join Cardinality Estimation

Estimating the size of join results is particularly challenging and crucial because join ordering decisions depend heavily on intermediate result sizes.

The fundamental join formula:

For a join R ⋈ S on attribute A:

|R ⋈ S| = |R| × |S| / max(V(A,R), V(A,S))

This assumes:

Containment: All values in the smaller domain appear in the larger
Uniformity: Each value appears equally often
Independence: No correlation between attributes

When Estimates Work Well

•Foreign key joins (containment holds)
•Uniformly distributed join keys
•Simple equi-joins on single column
•Well-maintained statistics
•Primary key lookups (always 1 match)

When Estimates Fail

•Skewed key distributions (power law)
•Correlated join predicates
•Non-equi joins (range, inequality)
•Joins on computed expressions
•Chained joins amplifying errors

The error propagation problem:

In multi-way joins, estimation errors compound multiplicatively:

Actual: R ⋈ S ⋈ T ⋈ U

If each join estimate has 2× error:
- After R ⋈ S: 2× off
- After R ⋈ S ⋈ T: 4× off
- After R ⋈ S ⋈ T ⋈ U: 8× off

With 10 tables, even small per-join errors can produce estimates that are millions of times wrong!

Advanced techniques for better join estimation:

Join histograms: Align bucket boundaries and join corresponding buckets
Sampling: Maintain small samples of join results for common join patterns
Machine learning: Use neural networks trained on actual execution statistics
Query feedback: Adjust estimates based on observed cardinalities in past executions

Modern Approaches

Research systems like Naru (learned cardinality estimation) and production features like Oracle's SQL Plan Management use historical execution data to improve estimates. The trend is toward adaptive optimization that learns from actual performance.

Handling Estimation Uncertainty

Given that cost estimates are inherently imprecise, how do optimizers cope? Several strategies have emerged to handle uncertainty:

1. Robust Plan Selection

Instead of choosing the plan with minimum estimated cost, choose a plan that performs reasonably well across a range of possible cardinalities:

Robust_Plan = argmin_P max_c∈C Cost(P, c)

Where C is the set of possible cardinality scenarios. This "minimax" approach avoids plans that are optimal for the estimate but catastrophic if the estimate is wrong.

2. Plan Bouquets

Pre-compute multiple plans for different cardinality regions. At runtime, take a sample to refine the estimate, then choose the appropriate plan:

If estimated_rows < 100: use index nested loop plan
If estimated_rows in [100, 10000]: use hash join plan
If estimated_rows > 10000: use sort-merge join plan

3. Adaptive Query Execution

Monitor actual cardinalities during execution and switch plans mid-query if estimates prove wrong:

Start with hash join, but if build-side is unexpectedly large...
Spill to disk gracefully (hybrid hash join)
Or switch to sort-merge if hash table would exceed memory

Adaptive Optimization Techniques

•Late Materialization: Defer decisions until runtime data is known
•Eddies: Route tuples adaptively based on observed selectivities
•Mid-Query Re-optimization: Pause and re-plan if actuals diverge from estimates
•Parametric Query Optimization: Cache plans for parameter value ranges
•Feedback Loops: Update statistics based on actual execution cardinalities

Production Implementations

Oracle's Adaptive Query Optimization, SQL Server's Adaptive Joins, and PostgreSQL's parallel query execution all implement forms of runtime adaptation. The future of query optimization increasingly involves runtime feedback, not just static estimation.

Practical Cost Model Example

Let's walk through a complete cost estimation example to solidify these concepts.

Scenario:

Two tables:

Customers: 100,000 rows, 5,000 pages, index on customer_id
Orders: 10,000,000 rows, 500,000 pages, index on customer_id

Query:

SELECT c.name, COUNT(*) as order_count
FROM Customers c
JOIN Orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA'
GROUP BY c.name;

Statistics:

25% of customers are in USA (selectivity = 0.25)
Average 100 orders per customer
Index on Customers.country exists

Plan A: Hash Join with Customer filter first

1. Index scan on Customers.country = 'USA'
   - Pages: 3 (index) + 0.25 × 5,000 = 1,253 pages
   - Rows output: 25,000 customers

2. Build hash table on filtered customers
   - CPU: 25,000 × hash_cost
   - Memory: ~25,000 × tuple_size

3. Scan Orders and probe hash table
   - Pages: 500,000
   - CPU: 10,000,000 × probe_cost
   - Matches: 25,000 × 100 = 2,500,000 rows

4. Aggregate by name
   - CPU: 2,500,000 × aggregate_cost

Total I/O: 1,253 + 500,000 = 501,253 pages

Plan B: Index Nested Loop Join

1. Scan all Orders
   - Pages: 500,000
   - Rows: 10,000,000

2. For each order, index lookup on Customers
   - Per lookup: 3 pages (index) + 1 page (data)
   - Total lookups: 10,000,000
   - But with caching, unique customers accessed: 100,000
   - Estimated pages: 100,000 × 4 = 400,000 (optimistic with caching)

3. Filter on country = 'USA'
   - 25% pass: 2,500,000 rows

4. Aggregate by name
   - CPU: 2,500,000 × aggregate_cost

Total I/O: 500,000 + 400,000 = 900,000 pages

Comparison:

Plan	I/O Cost	Notes
Plan A (Hash Join)	501,253 pages	Filter early reduces hash table size
Plan B (Index NL)	900,000 pages	Index lookups expensive despite caching

Winner: Plan A — the hash join with early filtering is substantially cheaper because it avoids repeated index probes.

The Power of Selectivity

If only 0.1% of customers were in USA (250 customers), Plan B with index nested loop might win—a small hash build side and minimal index probes. The optimal plan depends critically on selectivity estimates. This is why accurate statistics matter!

Summary: Cost Estimation Fundamentals

We've explored the foundation of cost-based query optimization. Let's consolidate the key concepts:

Key Takeaways

•Cost estimation predicts resource consumption — Optimizers assign numerical costs to plans based on I/O, CPU, memory, and other factors, choosing the lowest-cost plan.
•I/O cost often dominates — For disk-based systems, counting page accesses is the primary cost metric, distinguishing sequential from random I/O.
•CPU cost matters for in-memory and complex queries — Expression evaluation, hash computation, and sorting contribute significant CPU costs.
•Statistics are essential — Cardinality estimates depend on table sizes, distinct values, histograms, and correlation data collected about the database.
•Selectivity estimation uses simplifying assumptions — Uniformity and independence assumptions enable formulas but often fail for real data.
•Join estimation is particularly error-prone — Errors compound through multi-way joins, potentially producing estimates off by orders of magnitude.
•Modern systems adapt to uncertainty — Adaptive execution, plan bouquets, and learning-based estimation help cope with inevitable errors.

What's next:

With cost estimation in place, the optimizer must efficiently compare alternative plans. The next page explores Plan Comparison—how optimizers enumerate candidate plans, prune the search space, and select winners using the cost estimates we've learned to compute.

Page Complete

You now understand the mathematical foundation of cost-based optimization. Cost estimation transforms the abstract problem of finding the best plan into a concrete optimization: minimize a cost function. Next, we'll see how optimizers efficiently search through the vast space of possible plans.

1 / 5

Loading learning content...

Database Management SystemsCost-Based Optimization

Cost-Based Query Optimization

LevelAdvanced

Duration90 mins

TopicCost-Based Optimization

1 / 5

Cost Estimation

The Quest for the Optimal Plan

What You Will Learn

What Is Cost Estimation?

The fundamental problem:

Given a SQL query, there are typically many logically equivalent ways to execute it. Consider a simple join of three tables A, B, and C:

We could join A with B first, then join with C
We could join B with C first, then join with A
We could join A with C first, then join with B
For each ordering, we could use different join algorithms (nested loop, hash join, sort-merge)
For each table access, we could scan the table or use various indexes

The cost estimation solution:

Instead of executing plans, the optimizer estimates how expensive each plan would be if executed:

Build a cost model that predicts resource consumption
Apply the model to each candidate plan
Compare costs and choose the lowest-cost plan
Execute only the chosen plan

This works because estimating cost is vastly cheaper than actually executing plans. A cost estimate might take microseconds; executing the plan might take minutes or hours.

Cost Is Not Time

The cost function abstraction:

Formally, a cost model defines a function:

Cost(P) = f(operators in P, cardinalities, physical characteristics)

Where:

P is an execution plan (a tree of physical operators)
Operators are the individual operations (scans, joins, sorts, etc.)
Cardinalities are estimates of how many rows flow through each operator
Physical characteristics include disk speeds, memory sizes, CPU costs, etc.

The optimizer evaluates this function for each candidate plan and selects the one with minimum cost.

Components of Cost Models

A comprehensive cost model considers multiple resource types. The total cost of a plan is typically a weighted combination of these components:

Major Components of Query Cost Models
Cost Component	What It Measures	Typical Importance
I/O Cost	Disk reads and writes (pages transferred)	Often dominant for disk-based systems
CPU Cost	Tuple processing, comparisons, computations	Significant for complex predicates and functions
Memory Cost	Buffer usage, hash table sizes, sort memory	Critical when memory is constrained
Network Cost	Data transfer in distributed systems	Dominant in distributed and cloud databases
Startup Cost	Initial overhead before first tuple produced	Important for pipelined execution

The total cost formula:

Most optimizers express total cost as:

Total_Cost = w₁ × IO_Cost + w₂ × CPU_Cost + w₃ × Memory_Cost + ...

Where w₁, w₂, w₃ are weights that reflect the relative costs of each resource. These weights are often configurable parameters:

In I/O-bound systems (spinning disks), w₁ >> w₂
In CPU-bound systems (in-memory databases), w₂ may dominate
In distributed systems, network latency weights become significant

Example from PostgreSQL:

PostgreSQL uses parameters like:

seq_page_cost = 1.0 (baseline for sequential I/O)
random_page_cost = 4.0 (random I/O is 4× more expensive)
cpu_tuple_cost = 0.01 (processing each tuple)
cpu_operator_cost = 0.0025 (evaluating predicates)

These weights can be tuned based on actual hardware characteristics. If you move to SSDs, you might reduce random_page_cost since random reads are nearly as fast as sequential reads on SSDs.

Calibrating Cost Parameters

I/O Cost Estimation

Key I/O concepts:

Page: The basic unit of I/O (typically 4KB-16KB)
Sequential I/O: Reading consecutive pages (fast on HDDs, optimal on SSDs)
Random I/O: Reading non-consecutive pages (slow on HDDs, reasonable on SSDs)
Buffer pool: In-memory cache for frequently accessed pages

Basic I/O cost formulas:

I/O Cost Formulas for Common Operations
Operation	I/O Cost Formula	Variables
Full Table Scan	`b_R`	b_R = number of pages in relation R
Index Scan (Clustered)	`h + ⌈s × b_R⌉`	h = index height, s = selectivity
Index Scan (Unclustered)	`h + ⌈s × n_R⌉`	n_R = number of tuples (worst case)
Nested Loop Join	`b_R + n_R × b_S`	Scan R, for each tuple scan S
Block Nested Loop	`b_R + ⌈b_R / M⌉ × b_S`	M = memory pages available
Sort-Merge Join	`3 × (b_R + b_S)`	Sort both, then merge (approx.)
Hash Join	`3 × (b_R + b_S)`	Build hash, partition if needed

Example: Comparing scan vs. index access

Consider a table Orders with:

1,000,000 rows
50,000 pages (20 rows per page)
A B+-tree index on customer_id with height 3

Query: SELECT * FROM Orders WHERE customer_id = 12345

Assume the selectivity is 0.01% (100 matching rows).

Full table scan cost:

Cost = b_R = 50,000 page reads

Clustered index scan cost:

Cost = h + ⌈s × b_R⌉
     = 3 + ⌈0.0001 × 50,000⌉
     = 3 + 5 = 8 page reads

Unclustered index scan cost (worst case):

Cost = h + ⌈s × n_R⌉
     = 3 + ⌈0.0001 × 1,000,000⌉
     = 3 + 100 = 103 page reads

The index is clearly better, but notice how a clustered index (8 pages) dramatically outperforms an unclustered index (103 pages) because matching rows are stored together.

The Clustering Factor

CPU Cost Estimation

While I/O often dominates, CPU cost becomes increasingly important as:

Data fits in memory (eliminating I/O bottlenecks)
Queries involve complex computations
Predicates require expensive evaluations (regex, string operations)
Modern CPUs become relatively more expensive compared to fast SSDs

Components of CPU cost:

CPU Cost Components

•Tuple Processing: Moving tuples through operators, copying data, maintaining pipeline state
•Predicate Evaluation: Comparing values, checking NULL, evaluating WHERE clauses
•Expression Computation: Arithmetic, string functions, CASE expressions
•Hash Computation: Building and probing hash tables for joins and aggregation
•Sorting Comparisons: Key comparisons during sort operations
•Aggregation: Computing SUM, AVG, COUNT, maintaining aggregation state

CPU cost formulas:

CPU_Cost(Scan R with predicate P) = n_R × (tuple_cost + predicate_cost(P))

CPU_Cost(Hash Join) = n_R × hash_cost + n_S × probe_cost + matched × output_cost

CPU_Cost(Sort) = n_R × log₂(n_R) × comparison_cost

CPU_Cost(Aggregation) = n_R × (grouping_cost + aggregate_update_cost)

Predicate complexity matters:

Not all predicates are equal. Consider these WHERE clauses:

-- Simple: Just an integer comparison
WHERE order_id = 12345

-- Moderate: String comparison with potential collation
WHERE customer_name = 'Smith'

-- Expensive: Pattern matching (may scan entire string)
WHERE customer_name LIKE '%Smith%'

-- Very expensive: Regular expression
WHERE customer_name ~ '^[A-Z][a-z]+\\s[A-Z][a-z]+$'

A sophisticated optimizer assigns different costs to each predicate type. The regex might be 100× more expensive than the integer comparison, significantly affecting which plan is optimal.

The In-Memory Revolution

The Role of Statistics

Cost formulas depend critically on cardinality estimates—predictions of how many rows will flow through each operator. These estimates come from statistics collected about the data.

Without statistics, optimization is blind:

Consider estimating the cost of: SELECT * FROM Orders WHERE status = 'pending'

If 1% of orders are pending → ~10,000 rows → index access is optimal
If 80% of orders are pending → ~800,000 rows → table scan is optimal

The optimizer must know the data distribution to make correct decisions.

Key Statistics Maintained by Optimizers
Statistic Type	Description	Usage in Cost Estimation
Table Cardinality (n_R)	Total number of rows in table R	Base for all tuple-based costs
Table Size (b_R)	Number of pages/blocks in table R	Base for all I/O costs
Column Cardinality (V(A,R))	Number of distinct values in column A	Selectivity under uniformity assumption
Min/Max Values	Range of values in a column	Range query selectivity
Histograms	Distribution of values in buckets	Non-uniform selectivity estimation
Null Fraction	Percentage of NULL values	IS NULL / IS NOT NULL selectivity
Correlation	How column order matches physical order	Index scan cost refinement
Most Common Values	Frequent values and their frequencies	Accurate estimation for skewed data

Selectivity estimation:

The selectivity of a predicate is the fraction of rows that satisfy it:

Selectivity(predicate) = |rows satisfying predicate| / |total rows|

Common selectivity formulas (assuming uniform distribution):

Selectivity(A = value) = 1 / V(A,R)

Selectivity(A > value) = (max(A) - value) / (max(A) - min(A))

Selectivity(A BETWEEN low AND high) = (high - low) / (max(A) - min(A))

Selectivity(A IN (v1, v2, ..., vk)) = k / V(A,R)

Combining selectivities (independence assumption):

Selectivity(P1 AND P2) = Selectivity(P1) × Selectivity(P2)

Selectivity(P1 OR P2) = Selectivity(P1) + Selectivity(P2) - Selectivity(P1) × Selectivity(P2)

Selectivity(NOT P) = 1 - Selectivity(P)

These formulas make strong assumptions (uniformity, independence) that are often violated in real data, leading to estimation errors.

The Cardinality Estimation Problem

Join Cardinality Estimation

Estimating the size of join results is particularly challenging and crucial because join ordering decisions depend heavily on intermediate result sizes.

The fundamental join formula:

For a join R ⋈ S on attribute A:

|R ⋈ S| = |R| × |S| / max(V(A,R), V(A,S))

This assumes:

Containment: All values in the smaller domain appear in the larger
Uniformity: Each value appears equally often
Independence: No correlation between attributes

When Estimates Work Well

•Foreign key joins (containment holds)
•Uniformly distributed join keys
•Simple equi-joins on single column
•Well-maintained statistics
•Primary key lookups (always 1 match)

When Estimates Fail

•Skewed key distributions (power law)
•Correlated join predicates
•Non-equi joins (range, inequality)
•Joins on computed expressions
•Chained joins amplifying errors

The error propagation problem:

In multi-way joins, estimation errors compound multiplicatively:

Actual: R ⋈ S ⋈ T ⋈ U

If each join estimate has 2× error:
- After R ⋈ S: 2× off
- After R ⋈ S ⋈ T: 4× off
- After R ⋈ S ⋈ T ⋈ U: 8× off

With 10 tables, even small per-join errors can produce estimates that are millions of times wrong!

Advanced techniques for better join estimation:

Join histograms: Align bucket boundaries and join corresponding buckets
Sampling: Maintain small samples of join results for common join patterns
Machine learning: Use neural networks trained on actual execution statistics
Query feedback: Adjust estimates based on observed cardinalities in past executions

Modern Approaches

Handling Estimation Uncertainty

Given that cost estimates are inherently imprecise, how do optimizers cope? Several strategies have emerged to handle uncertainty:

1. Robust Plan Selection

Instead of choosing the plan with minimum estimated cost, choose a plan that performs reasonably well across a range of possible cardinalities:

Robust_Plan = argmin_P max_c∈C Cost(P, c)

Where C is the set of possible cardinality scenarios. This "minimax" approach avoids plans that are optimal for the estimate but catastrophic if the estimate is wrong.

2. Plan Bouquets

Pre-compute multiple plans for different cardinality regions. At runtime, take a sample to refine the estimate, then choose the appropriate plan:

If estimated_rows < 100: use index nested loop plan
If estimated_rows in [100, 10000]: use hash join plan
If estimated_rows > 10000: use sort-merge join plan

3. Adaptive Query Execution

Monitor actual cardinalities during execution and switch plans mid-query if estimates prove wrong:

Start with hash join, but if build-side is unexpectedly large...
Spill to disk gracefully (hybrid hash join)
Or switch to sort-merge if hash table would exceed memory

Adaptive Optimization Techniques

•Late Materialization: Defer decisions until runtime data is known
•Eddies: Route tuples adaptively based on observed selectivities
•Mid-Query Re-optimization: Pause and re-plan if actuals diverge from estimates
•Parametric Query Optimization: Cache plans for parameter value ranges
•Feedback Loops: Update statistics based on actual execution cardinalities

Production Implementations

Practical Cost Model Example

Let's walk through a complete cost estimation example to solidify these concepts.

Scenario:

Two tables:

Customers: 100,000 rows, 5,000 pages, index on customer_id
Orders: 10,000,000 rows, 500,000 pages, index on customer_id

Query:

SELECT c.name, COUNT(*) as order_count
FROM Customers c
JOIN Orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA'
GROUP BY c.name;

Statistics:

25% of customers are in USA (selectivity = 0.25)
Average 100 orders per customer
Index on Customers.country exists

Plan A: Hash Join with Customer filter first

1. Index scan on Customers.country = 'USA'
   - Pages: 3 (index) + 0.25 × 5,000 = 1,253 pages
   - Rows output: 25,000 customers

2. Build hash table on filtered customers
   - CPU: 25,000 × hash_cost
   - Memory: ~25,000 × tuple_size

3. Scan Orders and probe hash table
   - Pages: 500,000
   - CPU: 10,000,000 × probe_cost
   - Matches: 25,000 × 100 = 2,500,000 rows

4. Aggregate by name
   - CPU: 2,500,000 × aggregate_cost

Total I/O: 1,253 + 500,000 = 501,253 pages

Plan B: Index Nested Loop Join

1. Scan all Orders
   - Pages: 500,000
   - Rows: 10,000,000

2. For each order, index lookup on Customers
   - Per lookup: 3 pages (index) + 1 page (data)
   - Total lookups: 10,000,000
   - But with caching, unique customers accessed: 100,000
   - Estimated pages: 100,000 × 4 = 400,000 (optimistic with caching)

3. Filter on country = 'USA'
   - 25% pass: 2,500,000 rows

4. Aggregate by name
   - CPU: 2,500,000 × aggregate_cost

Total I/O: 500,000 + 400,000 = 900,000 pages

Comparison:

Plan	I/O Cost	Notes
Plan A (Hash Join)	501,253 pages	Filter early reduces hash table size
Plan B (Index NL)	900,000 pages	Index lookups expensive despite caching

Winner: Plan A — the hash join with early filtering is substantially cheaper because it avoids repeated index probes.

The Power of Selectivity

Summary: Cost Estimation Fundamentals

We've explored the foundation of cost-based query optimization. Let's consolidate the key concepts:

Key Takeaways

•Cost estimation predicts resource consumption — Optimizers assign numerical costs to plans based on I/O, CPU, memory, and other factors, choosing the lowest-cost plan.
•I/O cost often dominates — For disk-based systems, counting page accesses is the primary cost metric, distinguishing sequential from random I/O.
•CPU cost matters for in-memory and complex queries — Expression evaluation, hash computation, and sorting contribute significant CPU costs.
•Statistics are essential — Cardinality estimates depend on table sizes, distinct values, histograms, and correlation data collected about the database.
•Selectivity estimation uses simplifying assumptions — Uniformity and independence assumptions enable formulas but often fail for real data.
•Join estimation is particularly error-prone — Errors compound through multi-way joins, potentially producing estimates off by orders of magnitude.
•Modern systems adapt to uncertainty — Adaptive execution, plan bouquets, and learning-based estimation help cope with inevitable errors.

What's next:

Page Complete

1 / 5