Database Management SystemsCost Models

Cost Models in Query Processing

LevelIntermediate

Duration75 mins

TopicCost Models

1 / 5

Cost Estimation

The Query Optimizer's Crystal Ball

When you execute a SQL query, something remarkable happens behind the scenes. The database has mere milliseconds to make a critical decision: out of potentially thousands of different ways to execute your query, which execution plan will be fastest? This is the domain of cost estimation—the mathematical foundation that enables query optimizers to predict the future.

Cost estimation is the process by which a database management system predicts the resource consumption of executing a query using a particular execution plan before actually executing it. It's like asking: 'If I run this query using Plan A, how long will it take and how many resources will it consume?' The optimizer asks this question for multiple candidate plans and chooses the one with the lowest estimated cost.

This capability is what separates modern database systems from naive query processors. Without cost estimation, databases would either execute queries using fixed strategies (often catastrophically slow) or try every possible strategy (computationally infeasible). Cost estimation enables intelligent selection among potentially millions of execution alternatives.

What You Will Learn

By the end of this page, you will understand what cost estimation is and why it's essential, how cost functions are structured and what components they include, the relationship between cost estimation and query optimization, the fundamental challenges and sources of estimation error, and how cost models have evolved across different database systems.

Why Cost Estimation Matters

To appreciate the importance of cost estimation, consider a deceptively simple query:

SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 100000 AND d.location = 'New York';

Even for this straightforward two-table join with two filter conditions, the optimizer faces numerous choices:

Access Path Decisions:

Should employees be accessed via a full table scan or an index on salary?
Should departments be accessed via a full table scan or an index on location?

Join Order Decisions:

Should we join employees ⋈ departments or departments ⋈ employees?
Does it matter which table is the 'outer' vs 'inner' relation?

Join Algorithm Decisions:

Nested loop join? Hash join? Sort-merge join?
If nested loop, which table should be the outer relation?

Predicate Application:

Apply filters before the join or after?
Push predicates down into base table scans?

For just two tables, we might have 10-20 reasonable execution plans. For a 10-table join query—common in analytical workloads—the number of possible plans can exceed 10 billion. The optimizer cannot try them all. Instead, it must estimate which plans are likely to be fast and focus its search accordingly.

The Explosion of Choice

For n tables in a join query, the number of possible join orders alone is n! (n factorial). For 5 tables, that's 120 orderings. For 10 tables, it's 3,628,800. Add in choices for access methods, join algorithms, and predicate placement, and the search space grows exponentially. Cost estimation makes navigating this space tractable.

The Performance Stakes:

The difference between a good and bad execution plan isn't marginal—it can be orders of magnitude:

Scenario	Bad Plan	Good Plan	Improvement
Simple join (1M × 1M rows)	45 minutes	0.3 seconds	9,000× faster
Complex analytics (8-way join)	12 hours	2 minutes	360× faster
OLTP transaction	500ms	2ms	250× faster

These aren't hypothetical examples—they reflect real production scenarios. A poorly chosen nested loop join where a hash join would be optimal, or a table scan where an index seek would be appropriate, can transform a sub-second query into a multi-hour ordeal.

Cost estimation is the mechanism that prevents these performance disasters. By predicting execution costs, the optimizer can steer toward efficient plans and away from catastrophic ones.

Anatomy of a Cost Model

A cost model is a mathematical framework that assigns a numeric cost value to a query execution plan. This cost is a proxy for the actual execution time or resource consumption. The model consists of three fundamental components:

1. Cost Functions: Mathematical formulas that compute the cost of individual operations (table scans, index lookups, joins, sorts, etc.)

2. Statistical Inputs: Data about table sizes, column distributions, and correlations that feed into the cost functions

3. System Parameters: Hardware-specific constants like disk I/O time, CPU speed, memory capacity, and network latency

The general structure of a cost estimation formula is:

cost_model_formula.txt
Total_Cost(Plan) = Σ Cost(operation_i) for all operations in Plan
 
Cost(operation) = w_io × IO_Cost(operation) 
                + w_cpu × CPU_Cost(operation)
                + w_mem × Memory_Cost(operation)
                + w_net × Network_Cost(operation)
 
Where:
  - w_io, w_cpu, w_mem, w_net are weight factors
  - Each component cost is computed from formulas specific to the operation type
  - Formulas are parameterized by statistics and system constants

Cost as a Relative Measure:

It's crucial to understand that cost values are unit-less relative measures, not predictions of actual execution time. A plan with cost 1000 isn't predicted to take 1000 milliseconds—it's predicted to be about 10× more expensive than a plan with cost 100.

This relative nature has important implications:

Comparison is valid: We can reliably compare costs between plans for the same query
Absolute prediction is unreliable: We cannot accurately predict that a plan will take exactly X seconds
Cross-query comparison is meaningless: A cost of 500 for Query A and 500 for Query B doesn't mean they'll take the same time

Modern systems like PostgreSQL do attempt to calibrate costs to approximate milliseconds, but this remains a secondary goal. The primary purpose is plan comparison and selection.

Cost Model Components Across Database Systems
Component	PostgreSQL	Oracle	SQL Server	MySQL
Primary Cost Unit	Estimated time (ms-like)	I/O operations	Resource units	Cost units
I/O Weighting	seq_page_cost = 1.0, random_page_cost = 4.0	I/O cost dominates	Configurable via cost weights	Implicit in formulas
CPU Weighting	cpu_tuple_cost = 0.01	Included in workarea sizing	Separate CPU component	Minimal explicit CPU cost
Memory Consideration	work_mem limits	PGA/SGA sizing	Memory grants	Buffer pool assumptions
Parallelism	parallel_tuple_cost	Parallel execution cost	Parallelism cost threshold	Limited support

The Cost Estimation Pipeline

Cost estimation occurs within the query optimization phase, following a structured pipeline that transforms a logical query representation into a costed physical execution plan:

Converting Mermaid diagram...

Pipeline Stages Explained:

Stage 1: Plan Enumeration The optimizer generates candidate execution plans. For a query with multiple tables, this involves considering different join orders, access methods, and algorithm choices. Sophisticated optimizers use dynamic programming or heuristics to prune the search space.

Stage 2: Cardinality Estimation For each operator in a candidate plan, the optimizer estimates how many rows will flow through it. This is critical because the cost of most operations scales with the number of rows processed. Cardinality estimates propagate bottom-up through the plan tree.

Stage 3: Cost Calculation Using the cardinality estimates and system parameters, the optimizer applies cost functions to compute the total cost of each plan. This involves summing the costs of all operators: scans, filters, joins, sorts, and aggregations.

Stage 4: Plan Comparison Candidate plans are compared by their total estimated cost. The plan with the lowest cost is selected (subject to memory and other constraints).

The Dependency Chain:

Note the critical dependency: cost estimation depends on cardinality estimation, which depends on statistics. If statistics are stale or missing, cardinality estimates will be wrong, leading to incorrect cost estimates and potentially disastrous plan choices. This dependency chain is both the power and the vulnerability of cost-based optimization.

The Cardinality Foundation

More query optimization problems are caused by cardinality estimation errors than by any other factor. Even with perfect cost formulas, if you estimate that a query returns 100 rows when it actually returns 1 million, you'll choose the wrong plan. Cardinality estimation is the foundation upon which all cost-based optimization rests.

What Costs Are Being Estimated

A comprehensive cost model accounts for all significant resource consumptions during query execution. These fall into several categories:

I/O Costs (Often Dominant):

Sequential reads: Reading consecutive disk pages (efficient)
Random reads: Reading scattered disk pages (expensive)
Writes: Temporary file creation for sorts, hash spills
Page cache misses: When data isn't in memory

CPU Costs:

Tuple processing: Evaluating predicates, projections
Expression evaluation: Computing derived columns
Comparison operations: Sorting, hashing, joining
Function calls: User-defined functions, aggregations

Memory Costs:

Working memory consumption: Sort runs, hash tables
Buffer pool pressure: Displacing cached pages
Memory allocation overhead: Creating temporary structures

Network Costs (Distributed Systems):

Data transfer: Moving tuples between nodes
Coordination overhead: Distributed query orchestration

Typical Cost Ratios in a Disk-Based RDBMS
Operation	Relative Cost	Why This Cost
CPU (per tuple)	1×	In-memory processing is extremely fast
Sequential I/O (per page)	100×	Disk seeks are slow, but sequential access is predictable
Random I/O (per page)	400×	Disk head movement adds significant latency
Network transfer (per page)	200×	Network latency + bandwidth constraints
Flash/SSD random I/O (per page)	25×	SSDs eliminate mechanical seek time

The I/O Dominance Principle:

In traditional disk-based databases, I/O costs typically dominate. Reading a single random disk page might take 10 milliseconds, while a CPU can process thousands of tuples in that time. This historical reality shaped cost models to focus heavily on I/O minimization.

However, this landscape is shifting:

SSDs have dramatically reduced random I/O costs
Larger memories mean more data is cached in RAM
Column stores achieve better compression and I/O efficiency
In-memory databases eliminate disk I/O for hot data

Modern cost models must balance these factors appropriately. An optimizer optimizing purely for I/O on a system with 512GB of RAM and all data cached will make suboptimal choices.

Hardware-Aware Optimization

Advanced systems like Oracle and SQL Server allow DBAs to configure cost model parameters to reflect actual hardware characteristics. If your storage is SSD-based, you can reduce the random I/O cost weight. If you have abundant CPU cores, you can reduce parallelism startup costs. Proper calibration significantly improves plan quality.

Sources of Estimation Error

Cost estimation is inherently imperfect. Understanding the sources of error helps in diagnosing query performance problems and maintaining database health.

Major Sources of Cost Estimation Error

•Stale Statistics — Table contents change through inserts, updates, and deletes, but statistics aren't continuously updated. A table that had 1 million rows when statistics were collected might now have 10 million rows.
•Correlation Blindness — Standard statistics assume column independence. If most employees in 'Engineering' earn over $100K, the optimizer likely won't know this correlation, leading to cardinality underestimation for combined predicates.
•Skewed Distributions — Histograms capture data distribution, but extreme skew can defeat even fine-grained histograms. A single popular product ID accounting for 40% of orders might not be accurately represented.
•Complex Predicates — User-defined functions, complex expressions, and predicates with low selectivity estimation accuracy (LIKE patterns, range overlaps) challenge the optimizer.
•Join Estimation Cascades — Errors compound through joins. If a base table cardinality is off by 10×, and that table joins with another also off by 10×, the result might be off by 100× or more.
•Parameter Sniffing Issues — Plans are optimized for the first parameter values seen. If later queries use very different parameter values, the cached plan may be catastrophically inappropriate.
•Missing Statistics — Newly created tables, expressions, and temporary structures may have no statistics at all, forcing the optimizer to guess.

The Error Amplification Effect:

Consider a simple three-way join: A ⋈ B ⋈ C. If the selectivity of the join condition A ⋈ B is estimated as 0.01 (1%) but is actually 0.1 (10%), the intermediate result is underestimated by 10×. When this intermediate result joins with C, that 10× error carries forward:

Actual flow:    1M × 1M × 0.1 = 100K rows
                100K × 1M × 0.1 = 10B rows (then filtered)

Estimated flow: 1M × 1M × 0.01 = 10K rows  
                10K × 1M × 0.1 = 1B rows

The optimizer chose hash join expecting to process 1B rows. It actually processed 10B rows. The query runs 10× slower than expected—not because of a wrong algorithm choice, but because cardinality estimation errors cascaded through the plan.

This is why database performance tuning so often comes down to statistics maintenance and understanding cardinality estimation.

The Cliff Effect

Estimation errors often produce 'cliff' effects. A small cardinality overestimate might push the optimizer from a nested loop (optimal for small results) to a hash join (better for large results). If the actual cardinality is small, you've switched to a vastly worse algorithm. This discontinuity means small estimation errors can cause large performance impacts.

Cost Model Categories

Database systems implement cost models at various levels of sophistication. Understanding these categories helps in evaluating optimizer capabilities and limitations.

Simple (Heuristic) Models

•Use fixed rules without cost computation
•Example: 'Always use index if available'
•Fast optimization time
•No statistics required
•Poor plan quality for complex queries
•Found in: Early MySQL, SQLite (partial)

Sophisticated Cost Models

•Detailed component-based cost functions
•Hardware-calibrated parameters
•Rich statistical foundations
•Adaptive algorithm selection
•Plan caching and reoptimization
•Found in: PostgreSQL, Oracle, SQL Server

Evolutionary Progression:

Generation 1: Rule-Based (1970s-1980s) Early systems used fixed rules: 'Apply selections early,' 'Use indexes when possible.' No cost estimation—decisions were deterministic based on query structure.

Generation 2: Basic Cost-Based (1980s-1990s) System R pioneered cost-based optimization. Simple statistics (row counts, column cardinalities) fed into I/O-focused cost functions. Revolutionary for its time, but limited by single-column statistics.

Generation 3: Advanced Cost-Based (1990s-2010s) Histograms, multi-column statistics, extended statistics on expressions. Sophisticated cardinality estimators. Plan caching and prepared statements. Oracle, SQL Server, and PostgreSQL represent this generation.

Generation 4: Adaptive and Learned (2010s-Present) Adaptive query execution (runtime plan adjustment), machine learning for cardinality estimation, query result caching, and feedback-driven statistics. Systems learn from execution history to improve future estimates.

Modern databases like Oracle 19c, SQL Server 2022, and PostgreSQL 15+ incorporate Generation 4 features alongside traditional cost models.

The Adaptive Future

Adaptive query execution (AQE) represents a paradigm shift. Instead of committing to a single plan before execution, the system monitors actual cardinalities during execution and can switch strategies mid-query. This mitigates the impact of estimation errors by correcting course when reality diverges from prediction.

Cost Estimation in Practice

Let's examine how cost estimation works in a real database system. PostgreSQL provides excellent visibility into its cost model through the EXPLAIN command.

explain_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- Sample query on a hypothetical e-commerce database
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT 
    c.customer_name,
    o.order_date,
    SUM(oi.quantity * oi.unit_price) AS order_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date BETWEEN '2024-01-01' AND '2024-03-31'
  AND c.country = 'USA'
GROUP BY c.customer_name, o.order_date
ORDER BY order_total DESC
LIMIT 100;
 
-- Example output with cost details:
/*
Limit  (cost=45678.23..45678.48 rows=100 width=52)
       (actual time=234.567..234.589 rows=100 loops=1)
  ->  Sort  (cost=45678.23..45923.45 rows=98089 width=52)
            (actual time=234.563..234.577 rows=100 loops=1)
        Sort Key: (sum((oi.quantity * oi.unit_price))) DESC
        Sort Method: top-N heapsort  Memory: 42kB
        ->  HashAggregate  (cost=39234.56..40215.45 rows=98089 width=52)
                          (actual time=189.234..212.456 rows=72543 loops=1)
              Group Key: c.customer_name, o.order_date
              ->  Hash Join  (cost=8234.12..36478.23 rows=245678 width=36)
                            (actual time=45.678..156.789 rows=198234 loops=1)
                    Hash Cond: (oi.order_id = o.order_id)
                    ->  Seq Scan on order_items oi  
                        (cost=0.00..18234.56 rows=1023456 width=20)
                        (actual time=0.012..67.890 rows=1023456 loops=1)
                    ->  Hash  (cost=7234.12..7234.12 rows=80000 width=24)
                              (actual time=45.234..45.234 rows=78234 loops=1)
                          Buckets: 131072  Batches: 1  Memory Usage: 5234kB
                          ->  Hash Join  (cost=1234.56..7234.12 rows=80000 width=24)
                                        (actual time=12.345..34.567 rows=78234 loops=1)
                                Hash Cond: (o.customer_id = c.customer_id)
                                ->  Index Scan using idx_orders_date on orders o
                                    (cost=0.43..5123.45 rows=95678 width=16)
                                    (actual time=0.034..18.234 rows=92345 loops=1)
                                    Index Cond: ((order_date >= '2024-01-01'::date) 
                                             AND (order_date <= '2024-03-31'::date))
                                ->  Hash  (cost=987.65..987.65 rows=19745 width=22)
                                          (actual time=12.123..12.123 rows=18976 loops=1)
                                    ->  Seq Scan on customers c
                                        (cost=0.00..987.65 rows=19745 width=22)
                                        (actual time=0.008..8.234 rows=18976 loops=1)
                                        Filter: (country = 'USA'::text)
                                        Rows Removed by Filter: 32156
*/

Interpreting the Cost Numbers:

Each node shows two cost numbers: (startup_cost..total_cost)

Startup cost: Work required before the first row can be returned (e.g., building a hash table)
Total cost: Cost to return all rows from this node

For the Hash Join returning order info:

cost=8234.12..36478.23: Startup is 8234 (building hash), total is 36478
rows=245678: Optimizer estimates 245,678 rows will be produced
width=36: Each row uses approximately 36 bytes

Comparing Estimated vs Actual:

The ANALYZE option reveals actual execution:

rows=245678 (estimated) vs rows=198234 (actual): Off by about 24%
This is acceptable estimation accuracy

When estimates differ dramatically (10× or more), that's diagnostic of stale statistics or correlation issues that need investigation.

The Units of Cost

PostgreSQL costs are calibrated such that cost=1.0 roughly corresponds to a sequential page read. The actual time depends on your hardware. A cost of 45678 might execute in 200ms on fast NVMe storage or 5 seconds on slow spinning disks. Use EXPLAIN ANALYZE to see actual timings on your system.

Summary: Cost Estimation Foundations

We've established the foundational understanding of cost estimation in database query processing. Let's consolidate the key concepts:

Key Takeaways

•Cost estimation enables intelligent query optimization — Without it, databases cannot navigate the exponential space of possible execution plans.
•Cost is a relative measure — Used for plan comparison, not absolute time prediction. A plan with cost 1000 is ~10× more expensive than one with cost 100.
•Cost models have multiple components — I/O costs (often dominant), CPU costs, memory pressure, and network transfer for distributed systems.
•Cardinality estimation underlies cost estimation — Errors in predicting row counts cascade through the plan, amplifying into significant cost miscalculations.
•Multiple error sources exist — Stale statistics, correlation blindness, skewed distributions, and complex predicates all challenge accurate estimation.
•Cost models evolve with hardware — Modern systems account for SSDs, in-memory processing, and parallelism beyond traditional I/O-centric models.
•Transparency is available — Commands like EXPLAIN ANALYZE expose optimizer estimates versus actual execution, enabling diagnosis and tuning.

What's Next:

With the conceptual foundation established, we'll dive deeper into the most impactful component of cost models: I/O Cost. The next page explores how databases model disk and memory access patterns, account for caching effects, and optimize for the realities of storage hierarchies.

Page Complete

You now understand the fundamental role and structure of cost estimation in query optimization. This knowledge is essential for understanding query performance, interpreting EXPLAIN output, and diagnosing optimization problems in production databases.

1 / 5

Loading learning content...

Database Management SystemsCost Models

Cost Models in Query Processing

LevelIntermediate

Duration75 mins

TopicCost Models

1 / 5

Cost Estimation

The Query Optimizer's Crystal Ball

What You Will Learn

Why Cost Estimation Matters

To appreciate the importance of cost estimation, consider a deceptively simple query:

SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 100000 AND d.location = 'New York';

Even for this straightforward two-table join with two filter conditions, the optimizer faces numerous choices:

Access Path Decisions:

Should employees be accessed via a full table scan or an index on salary?
Should departments be accessed via a full table scan or an index on location?

Join Order Decisions:

Should we join employees ⋈ departments or departments ⋈ employees?
Does it matter which table is the 'outer' vs 'inner' relation?

Join Algorithm Decisions:

Nested loop join? Hash join? Sort-merge join?
If nested loop, which table should be the outer relation?

Predicate Application:

Apply filters before the join or after?
Push predicates down into base table scans?

The Explosion of Choice

The Performance Stakes:

The difference between a good and bad execution plan isn't marginal—it can be orders of magnitude:

Scenario	Bad Plan	Good Plan	Improvement
Simple join (1M × 1M rows)	45 minutes	0.3 seconds	9,000× faster
Complex analytics (8-way join)	12 hours	2 minutes	360× faster
OLTP transaction	500ms	2ms	250× faster

Cost estimation is the mechanism that prevents these performance disasters. By predicting execution costs, the optimizer can steer toward efficient plans and away from catastrophic ones.

Anatomy of a Cost Model

1. Cost Functions: Mathematical formulas that compute the cost of individual operations (table scans, index lookups, joins, sorts, etc.)

2. Statistical Inputs: Data about table sizes, column distributions, and correlations that feed into the cost functions

3. System Parameters: Hardware-specific constants like disk I/O time, CPU speed, memory capacity, and network latency

The general structure of a cost estimation formula is:

cost_model_formula.txt
Total_Cost(Plan) = Σ Cost(operation_i) for all operations in Plan
 
Cost(operation) = w_io × IO_Cost(operation) 
                + w_cpu × CPU_Cost(operation)
                + w_mem × Memory_Cost(operation)
                + w_net × Network_Cost(operation)
 
Where:
  - w_io, w_cpu, w_mem, w_net are weight factors
  - Each component cost is computed from formulas specific to the operation type
  - Formulas are parameterized by statistics and system constants

Cost as a Relative Measure:

This relative nature has important implications:

Comparison is valid: We can reliably compare costs between plans for the same query
Absolute prediction is unreliable: We cannot accurately predict that a plan will take exactly X seconds
Cross-query comparison is meaningless: A cost of 500 for Query A and 500 for Query B doesn't mean they'll take the same time

Modern systems like PostgreSQL do attempt to calibrate costs to approximate milliseconds, but this remains a secondary goal. The primary purpose is plan comparison and selection.

Cost Model Components Across Database Systems
Component	PostgreSQL	Oracle	SQL Server	MySQL
Primary Cost Unit	Estimated time (ms-like)	I/O operations	Resource units	Cost units
I/O Weighting	seq_page_cost = 1.0, random_page_cost = 4.0	I/O cost dominates	Configurable via cost weights	Implicit in formulas
CPU Weighting	cpu_tuple_cost = 0.01	Included in workarea sizing	Separate CPU component	Minimal explicit CPU cost
Memory Consideration	work_mem limits	PGA/SGA sizing	Memory grants	Buffer pool assumptions
Parallelism	parallel_tuple_cost	Parallel execution cost	Parallelism cost threshold	Limited support

The Cost Estimation Pipeline

Cost estimation occurs within the query optimization phase, following a structured pipeline that transforms a logical query representation into a costed physical execution plan:

Converting Mermaid diagram...

Pipeline Stages Explained:

Stage 4: Plan Comparison Candidate plans are compared by their total estimated cost. The plan with the lowest cost is selected (subject to memory and other constraints).

The Dependency Chain:

The Cardinality Foundation

What Costs Are Being Estimated

A comprehensive cost model accounts for all significant resource consumptions during query execution. These fall into several categories:

I/O Costs (Often Dominant):

Sequential reads: Reading consecutive disk pages (efficient)
Random reads: Reading scattered disk pages (expensive)
Writes: Temporary file creation for sorts, hash spills
Page cache misses: When data isn't in memory

CPU Costs:

Tuple processing: Evaluating predicates, projections
Expression evaluation: Computing derived columns
Comparison operations: Sorting, hashing, joining
Function calls: User-defined functions, aggregations

Memory Costs:

Working memory consumption: Sort runs, hash tables
Buffer pool pressure: Displacing cached pages
Memory allocation overhead: Creating temporary structures

Network Costs (Distributed Systems):

Data transfer: Moving tuples between nodes
Coordination overhead: Distributed query orchestration

Typical Cost Ratios in a Disk-Based RDBMS
Operation	Relative Cost	Why This Cost
CPU (per tuple)	1×	In-memory processing is extremely fast
Sequential I/O (per page)	100×	Disk seeks are slow, but sequential access is predictable
Random I/O (per page)	400×	Disk head movement adds significant latency
Network transfer (per page)	200×	Network latency + bandwidth constraints
Flash/SSD random I/O (per page)	25×	SSDs eliminate mechanical seek time

The I/O Dominance Principle:

However, this landscape is shifting:

SSDs have dramatically reduced random I/O costs
Larger memories mean more data is cached in RAM
Column stores achieve better compression and I/O efficiency
In-memory databases eliminate disk I/O for hot data

Modern cost models must balance these factors appropriately. An optimizer optimizing purely for I/O on a system with 512GB of RAM and all data cached will make suboptimal choices.

Hardware-Aware Optimization

Sources of Estimation Error

Cost estimation is inherently imperfect. Understanding the sources of error helps in diagnosing query performance problems and maintaining database health.

Major Sources of Cost Estimation Error

•Stale Statistics — Table contents change through inserts, updates, and deletes, but statistics aren't continuously updated. A table that had 1 million rows when statistics were collected might now have 10 million rows.
•Correlation Blindness — Standard statistics assume column independence. If most employees in 'Engineering' earn over $100K, the optimizer likely won't know this correlation, leading to cardinality underestimation for combined predicates.
•Skewed Distributions — Histograms capture data distribution, but extreme skew can defeat even fine-grained histograms. A single popular product ID accounting for 40% of orders might not be accurately represented.
•Complex Predicates — User-defined functions, complex expressions, and predicates with low selectivity estimation accuracy (LIKE patterns, range overlaps) challenge the optimizer.
•Join Estimation Cascades — Errors compound through joins. If a base table cardinality is off by 10×, and that table joins with another also off by 10×, the result might be off by 100× or more.
•Parameter Sniffing Issues — Plans are optimized for the first parameter values seen. If later queries use very different parameter values, the cached plan may be catastrophically inappropriate.
•Missing Statistics — Newly created tables, expressions, and temporary structures may have no statistics at all, forcing the optimizer to guess.

The Error Amplification Effect:

Actual flow:    1M × 1M × 0.1 = 100K rows
                100K × 1M × 0.1 = 10B rows (then filtered)

Estimated flow: 1M × 1M × 0.01 = 10K rows  
                10K × 1M × 0.1 = 1B rows

This is why database performance tuning so often comes down to statistics maintenance and understanding cardinality estimation.

The Cliff Effect

Cost Model Categories

Database systems implement cost models at various levels of sophistication. Understanding these categories helps in evaluating optimizer capabilities and limitations.

Simple (Heuristic) Models

•Use fixed rules without cost computation
•Example: 'Always use index if available'
•Fast optimization time
•No statistics required
•Poor plan quality for complex queries
•Found in: Early MySQL, SQLite (partial)

Sophisticated Cost Models

•Detailed component-based cost functions
•Hardware-calibrated parameters
•Rich statistical foundations
•Adaptive algorithm selection
•Plan caching and reoptimization
•Found in: PostgreSQL, Oracle, SQL Server

Evolutionary Progression:

Modern databases like Oracle 19c, SQL Server 2022, and PostgreSQL 15+ incorporate Generation 4 features alongside traditional cost models.

The Adaptive Future

Cost Estimation in Practice

Let's examine how cost estimation works in a real database system. PostgreSQL provides excellent visibility into its cost model through the EXPLAIN command.

explain_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- Sample query on a hypothetical e-commerce database
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT 
    c.customer_name,
    o.order_date,
    SUM(oi.quantity * oi.unit_price) AS order_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date BETWEEN '2024-01-01' AND '2024-03-31'
  AND c.country = 'USA'
GROUP BY c.customer_name, o.order_date
ORDER BY order_total DESC
LIMIT 100;
 
-- Example output with cost details:
/*
Limit  (cost=45678.23..45678.48 rows=100 width=52)
       (actual time=234.567..234.589 rows=100 loops=1)
  ->  Sort  (cost=45678.23..45923.45 rows=98089 width=52)
            (actual time=234.563..234.577 rows=100 loops=1)
        Sort Key: (sum((oi.quantity * oi.unit_price))) DESC
        Sort Method: top-N heapsort  Memory: 42kB
        ->  HashAggregate  (cost=39234.56..40215.45 rows=98089 width=52)
                          (actual time=189.234..212.456 rows=72543 loops=1)
              Group Key: c.customer_name, o.order_date
              ->  Hash Join  (cost=8234.12..36478.23 rows=245678 width=36)
                            (actual time=45.678..156.789 rows=198234 loops=1)
                    Hash Cond: (oi.order_id = o.order_id)
                    ->  Seq Scan on order_items oi  
                        (cost=0.00..18234.56 rows=1023456 width=20)
                        (actual time=0.012..67.890 rows=1023456 loops=1)
                    ->  Hash  (cost=7234.12..7234.12 rows=80000 width=24)
                              (actual time=45.234..45.234 rows=78234 loops=1)
                          Buckets: 131072  Batches: 1  Memory Usage: 5234kB
                          ->  Hash Join  (cost=1234.56..7234.12 rows=80000 width=24)
                                        (actual time=12.345..34.567 rows=78234 loops=1)
                                Hash Cond: (o.customer_id = c.customer_id)
                                ->  Index Scan using idx_orders_date on orders o
                                    (cost=0.43..5123.45 rows=95678 width=16)
                                    (actual time=0.034..18.234 rows=92345 loops=1)
                                    Index Cond: ((order_date >= '2024-01-01'::date) 
                                             AND (order_date <= '2024-03-31'::date))
                                ->  Hash  (cost=987.65..987.65 rows=19745 width=22)
                                          (actual time=12.123..12.123 rows=18976 loops=1)
                                    ->  Seq Scan on customers c
                                        (cost=0.00..987.65 rows=19745 width=22)
                                        (actual time=0.008..8.234 rows=18976 loops=1)
                                        Filter: (country = 'USA'::text)
                                        Rows Removed by Filter: 32156
*/

Interpreting the Cost Numbers:

Each node shows two cost numbers: (startup_cost..total_cost)

Startup cost: Work required before the first row can be returned (e.g., building a hash table)
Total cost: Cost to return all rows from this node

For the Hash Join returning order info:

cost=8234.12..36478.23: Startup is 8234 (building hash), total is 36478
rows=245678: Optimizer estimates 245,678 rows will be produced
width=36: Each row uses approximately 36 bytes

Comparing Estimated vs Actual:

The ANALYZE option reveals actual execution:

rows=245678 (estimated) vs rows=198234 (actual): Off by about 24%
This is acceptable estimation accuracy

When estimates differ dramatically (10× or more), that's diagnostic of stale statistics or correlation issues that need investigation.

The Units of Cost

Summary: Cost Estimation Foundations

We've established the foundational understanding of cost estimation in database query processing. Let's consolidate the key concepts:

Key Takeaways

•Cost estimation enables intelligent query optimization — Without it, databases cannot navigate the exponential space of possible execution plans.
•Cost is a relative measure — Used for plan comparison, not absolute time prediction. A plan with cost 1000 is ~10× more expensive than one with cost 100.
•Cost models have multiple components — I/O costs (often dominant), CPU costs, memory pressure, and network transfer for distributed systems.
•Cardinality estimation underlies cost estimation — Errors in predicting row counts cascade through the plan, amplifying into significant cost miscalculations.
•Multiple error sources exist — Stale statistics, correlation blindness, skewed distributions, and complex predicates all challenge accurate estimation.
•Cost models evolve with hardware — Modern systems account for SSDs, in-memory processing, and parallelism beyond traditional I/O-centric models.
•Transparency is available — Commands like EXPLAIN ANALYZE expose optimizer estimates versus actual execution, enabling diagnosis and tuning.

What's Next:

Page Complete

1 / 5