Optimization Overview - Learning Module

Loading content...

0/241

Optimization Goal

The Query Optimizer: The Brain of the Database

When you submit a SQL query to a database system, something remarkable happens before any data is touched. Your declarative request—a statement describing what data you want—must be transformed into an imperative sequence of operations that actually retrieves that data. This transformation isn't trivial; for any non-trivial query, there exist potentially millions or even billions of different ways to execute it.

The query optimizer is the database component responsible for navigating this vast landscape of possibilities and selecting an execution plan that (ideally) minimizes resource consumption while maximizing performance. It is, without exaggeration, one of the most sophisticated pieces of software ever written—a component that must make near-optimal decisions in milliseconds, decisions that can mean the difference between a query completing in 10 milliseconds or 10 hours.

What You Will Learn

By the end of this page, you will understand the fundamental goal of query optimization: why it exists, what it aims to achieve, how 'optimal' is defined in the context of query execution, and the inherent challenges that make optimization one of the hardest problems in database systems. You'll gain insight into the optimizer's role as a critical bridge between declarative SQL and efficient physical execution.

Why Query Optimization Exists

To understand the goal of query optimization, we must first understand why optimization is necessary at all. The necessity stems from a fundamental property of SQL and other declarative query languages: they specify what data to retrieve, not how to retrieve it.

Consider a simple SQL query:

SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 100000 AND d.location = 'New York';

This query requests names of highly-paid employees in New York departments. But how should the database execute this? The options are staggering:

Execution Strategy Variations

•Join Order: Should we first filter employees, then join? Or first filter departments, then join? Or join first, then filter?
•Join Algorithm: Should we use nested-loop join, hash join, merge join, or index-based join?
•Access Methods: Should we full-table-scan employees or use an index on salary? Is there a composite index on (dept_id, salary)?
•Filter Application: Should filters be applied before or after joins? Can they be pushed into the join condition?
•Parallel Execution: Should we parallelize the scan? The join? Both? How many threads?

Each combination of these choices yields a different execution plan. For even a modest query joining five tables, the number of possible join orderings alone exceeds 5! = 120. When you factor in algorithm choices, access method selection, and operator placement, a five-way join can have hundreds of thousands of valid execution plans.

The performance difference between a good plan and a bad plan isn't marginal—it's often orders of magnitude. A query that completes in 50 milliseconds with an optimal plan might take 50 minutes—or even 50 hours—with a poor plan. This exponential variance in execution time is what makes optimization indispensable.

The Consequence of Naive Execution

Early database systems executed queries in the order written, without optimization. A query that joined tables A, B, C in that order would always join A with B first, then with C—even if joining B with C first would eliminate 99% of rows. These systems became unusable as data grew, forcing the invention of query optimization in IBM's System R project in the 1970s.

Defining the Optimization Goal

The query optimizer's goal appears simple in statement but profound in implication:

Find the execution plan that minimizes the cost of executing a query while producing correct results.

This definition contains three critical components that warrant deep examination:

2.1 Correctness: The Inviolable Constraint

Before any discussion of efficiency, the optimizer must guarantee semantic correctness—the chosen plan must produce exactly the same results as any other valid plan for the query. This is non-negotiable. The optimizer is free to transform the query in any way that preserves its meaning, but the final result set must be identical (or equivalent, in the case of unordered results) to what the user specified.

Correctness constraints include:

Result Set Equivalence: The same rows must be returned
Ordering Preservation: If ORDER BY is specified, the plan must respect it
Null Handling: NULL semantics must be preserved across transformations
Duplicate Handling: SET vs. BAG semantics must be maintained

The optimizer uses equivalence rules (covered in Module 2) to verify that plan transformations preserve query semantics. These rules form the mathematical foundation guaranteeing correctness.

Equivalence vs. Equality

Two plans are equivalent if they produce the same result set for all possible database states. The optimizer only considers equivalent plans, never sacrificing correctness for performance. This constraint dramatically limits the search space—not every possible plan is semantically valid.

2.2 Cost Minimization: The Optimization Objective

Given correctness, the optimizer seeks to minimize cost. But what exactly is cost? Different systems define it differently:

Cost Metric	Definition	When Prioritized
Response Time	Total wall-clock time to first/last result	OLTP, interactive queries
Resource Consumption	Total CPU + I/O + Memory used	Shared systems, cloud billing
Throughput	Queries completed per unit time	Batch processing
I/O Operations	Disk reads/writes required	Disk-bound systems
Network Traffic	Data transferred across nodes	Distributed databases

Historically, disk I/O dominated cost models because mechanical disk access was 100,000× slower than memory access. Modern systems with SSDs and large memory pools increasingly weight CPU cost and memory bandwidth. Cloud databases often optimize for monetary cost, not just time.

The optimizer's cost model assigns an estimated cost to each operator and plan. This model, combined with statistics about data distribution, enables comparison of plans without executing them.

2.3 The Execution Plan: The Optimizer's Output

The optimizer's output is an execution plan (also called a query plan or query execution plan)—a detailed specification of how to execute the query. A complete execution plan specifies:

Operators: What operations to perform (scan, join, filter, sort, aggregate)
Operator Order: The sequence/hierarchy of operations
Algorithms: Which algorithm to use for each operator (e.g., hash join vs. merge join)
Access Paths: How to read data (full scan, index scan, index-only scan)
Data Flow: How results flow between operators
Parallelism: Degree and type of parallel execution
Memory Allocation: Buffer sizes for operators requiring memory

The plan is typically represented as a tree of operators, where data flows from leaf nodes (data sources) up to the root (final result). This tree structure mirrors the nested evaluation of relational algebra expressions.

execution_plan_example.txt

Execution Plan

Hash Join (cost=1250.00 rows=850)
├── Hash Condition: e.dept_id = d.id
├── Filter: d.location = 'New York'
│   └── Seq Scan on departments d (cost=50.00 rows=15)
│       └── Filter applied during scan
└── Index Scan on employees e (cost=450.00 rows=5000)
    ├── Index: idx_employees_salary
    ├── Index Condition: salary > 100000
    └── Returns: e.name, e.dept_id
 
Total Estimated Cost: 1750.00
Estimated Rows: 850
Estimated Time: 45ms

The Optimization Challenge

If the optimizer's goal is simply to minimize cost, why is optimization considered one of the hardest problems in database systems? The challenge stems from several fundamental difficulties:

3.1 Exponential Search Space

The number of possible execution plans grows exponentially with query complexity. For a query joining n tables:

Join orderings alone: n!/2 for linear trees, Catalan(n) for bushy trees
With algorithm choices: Multiply by k^(n-1) for k join algorithms
With access methods: Multiply again by access options per table

For a 10-table join with 3 join algorithms and 2 access methods per table, the search space exceeds 10^15 possible plans. Exhaustively evaluating each plan is computationally infeasible—even at nanosecond per evaluation, this would take centuries.

Search Space Growth with Query Complexity
Tables Joined	Linear Join Orders	Bushy Join Orders	With 3 Algorithms	Exploration Time*
3	3	6	54	< 1 μs
5	60	120	9,720	< 1 ms
7	2,520	5,040	1.2M	~1 second
10	1.8M	3.6M	~10^12	~Hours (infeasible)
15	~10^12	~10^13	~10^18	~Millennia (impossible)

*Assuming 1 microsecond per plan evaluation

3.2 Estimation Uncertainty

The optimizer cannot execute plans to measure their true cost—that would defeat the purpose. Instead, it must estimate costs based on:

Table statistics: Row counts, column distributions, distinct values
Index metadata: Structure, size, clustering factor
Cost formulas: Mathematical models of operator performance

These estimates are inherently imprecise. Statistics become stale as data changes. Correlations between columns are often ignored. Join selectivity estimation for complex predicates involves significant guesswork.

The optimizer is making critical decisions based on imperfect information. A 10× error in cardinality estimation can lead to a 1000× error in plan cost—and the optimizer has no way to know its estimates are wrong until the query executes.

The Estimation Problem in Practice

Research shows that cardinality estimation errors frequently exceed 100× in real workloads. A join estimated to produce 1,000 rows might produce 1 million. The optimizer, believing the intermediate result to be small, might choose nested-loop join instead of hash join—turning a 1-second query into a 1-hour disaster.

3.3 Time Constraints

Optimization itself consumes time and resources. The optimizer faces a paradox:

More exploration = Higher chance of finding the best plan
More exploration = Longer optimization time, delaying query start

For OLTP queries that should complete in milliseconds, spending seconds on optimization is absurd. For long-running analytical queries, spending extra seconds optimizing can save hours of execution. The optimizer must dynamically balance optimization thoroughness against time pressure.

3.4 Multi-objective Trade-offs

Different "optimal" plans exist depending on what you optimize for:

First row: Pipelined plans that return initial results quickly
Last row: Plans that minimize total execution time
Memory-constrained: Plans that limit memory footprint
Parallel-friendly: Plans that scale across cores/nodes

The optimizer must understand context—is this an interactive query where users want to see something immediately, or a batch job where total time matters most?

Optimality vs. Good Enough

Given the challenges outlined above, a crucial question emerges: Does the optimizer actually find the optimal plan?

The honest answer is: rarely, and it doesn't matter as much as you'd think.

4.1 The Impossibility of Guaranteed Optimality

Finding the truly optimal plan would require:

Exhaustive search of all possible plans (exponentially expensive)
Perfect cost estimation (impossible with imperfect statistics)
Zero optimization time budget (contradicts itself)

These conditions are mutually incompatible. Guaranteed optimality is mathematically infeasible for non-trivial queries.

4.2 The "Good Enough" Philosophy

Practical optimizers operate on a different principle: find a plan that is good enough, quickly enough.

This approach accepts that:

The "optimal" plan based on estimates may not be truly optimal due to estimation errors
A plan within 2× of optimal is usually acceptable
A plan found in 10ms is better than eternal search for perfection

Most optimizers use heuristics and pruning to quickly eliminate obviously bad plans while thoroughly exploring the most promising regions of the search space.

What Optimizers Don't Guarantee

•Finding the mathematically optimal plan
•Consistent plan choice across runs (some randomness)
•Perfect cost estimation
•Exploration of all possible plans
•Optimal parallel distribution

What Optimizers Do Guarantee

•Semantic correctness of chosen plan
•Reasonable performance for most queries
•Bounded optimization time
•Avoidance of catastrophically bad plans
•Use of available indexes when beneficial

The 80/20 Rule of Optimization

In practice, optimizers spend 80% of effort avoiding the worst 20% of plans. Identifying and eliminating disaster plans (those with Cartesian products, unnecessary full scans, or terrible join orders) matters more than fine-tuning among good alternatives. A plan that's 90% optimal is a success; a plan that's 0.1% optimal is a catastrophe.

The Optimizer's Role in Query Processing

To fully appreciate the optimization goal, we must understand where the optimizer fits within the broader query processing pipeline:

SQL Query
    │
    ▼
┌───────────────┐
│    Parser     │ ─── Lexical & Syntactic Analysis
└───────────────┘
    │
    ▼
┌───────────────┐
│   Analyzer    │ ─── Semantic Analysis, Name Resolution
└───────────────┘
    │
    ▼
┌───────────────┐
│   Rewriter    │ ─── View Expansion, Subquery Unnesting
└───────────────┘
    │
    ▼
┌───────────────────────────────────────────────┐
│              OPTIMIZER                         │
│  ┌─────────────┐   ┌──────────────────────┐   │
│  │   Logical   │ → │      Physical        │   │
│  │ Optimization│   │    Optimization      │   │
│  └─────────────┘   └──────────────────────┘   │
│         │                    │                │
│  Plan Generation  ←─→  Cost Estimation        │
└───────────────────────────────────────────────┘
    │
    ▼
┌───────────────┐
│   Executor    │ ─── Plan Execution
└───────────────┘
    │
    ▼
  Results

The optimizer receives a logical query representation (typically a relational algebra expression tree) and produces a physical execution plan. This transformation involves:

Logical Optimization: Query rewriting using equivalence rules (e.g., predicate pushdown)
Physical Optimization: Algorithm selection, access path selection, ordering
Plan Generation: Enumeration of candidate plans
Cost Estimation: Assigning costs to plans for comparison
Plan Selection: Choosing the lowest-cost plan

Optimizer Input and Output
Aspect	Input (Logical Plan)	Output (Physical Plan)
Representation	Relational algebra tree	Execution operator tree
Join Specification	Logical join (condition)	Specific algorithm (hash, merge, nested-loop)
Data Access	Table reference	Access path (heap scan, index scan)
Abstractions	Logical operators	Physical operators with parameters
Parallelism	Not specified	Explicit parallel operators
Memory	Not specified	Memory allocations for operators

Types of Optimization Goals

Different query types and workloads require different optimization objectives. Understanding these variations is crucial for both database design and query tuning:

6.1 Response Time Optimization (Latency)

For interactive applications—web backends, user-facing tools, real-time dashboards—the goal is minimizing total response time from query submission to result delivery.

Characteristics:

Prioritizes wall-clock time over resource efficiency
May accept higher resource consumption for faster results
Parallel execution attractive if it reduces latency
Optimization time must be minimal (query duration is short)

6.2 Resource Consumption Optimization (Cost)

For cloud databases billed by resource usage, or shared systems where resource contention matters, the goal shifts to minimizing CPU cycles, memory, and I/O.

Characteristics:

Sequential plans may be preferred if they use fewer total resources
Memory-efficient algorithms favored even if slower
Parallelism only if it reduces total resource usage
Particularly relevant for multi-tenant cloud databases

6.3 Throughput Optimization (Batch)

For batch processing, ETL jobs, or report generation, the goal is maximizing queries per hour rather than minimizing individual query time.

Characteristics:

Plans that share work across queries are valuable
Materialization and caching become important
Individual query latency less important than aggregate
Longer optimization time acceptable for complex queries

Optimization Goal Selection Factors

•Workload Type: OLTP (latency) vs. OLAP (throughput) vs. mixed
•Billing Model: Time-based vs. resource-based vs. fixed infrastructure
•User Expectations: Interactive (ms) vs. batch (minutes acceptable)
•Resource Availability: Memory constraints, CPU cores, I/O bandwidth
•Data Characteristics: Size, distribution, update frequency
•Concurrency Level: Single-user vs. thousands of simultaneous queries

Modern Adaptive Optimization

Cutting-edge database systems employ adaptive optimization that adjusts goals based on context. A query might start with latency-focused optimization, switch to throughput mode when queue depth increases, and incorporate resource awareness when memory pressure rises. This represents the frontier of optimizer development.

Historical Perspective: The Evolution of Optimization Goals

Query optimization has evolved dramatically over five decades, with optimization goals shifting alongside hardware and workload changes:

7.1 The System R Era (1970s)

IBM's System R introduced cost-based query optimization—the first system to systematically enumerate plans and estimate costs. Goals focused almost exclusively on minimizing disk I/O, as disk access was 100,000× slower than memory.

Key innovations:

Dynamic programming for join ordering
Statistics-based cardinality estimation
Access path selection based on cost

7.2 The OLTP Era (1980s-1990s)

As databases powered business transactions, response time became critical. Optimization goals shifted toward:

Sub-second query completion
Efficient index utilization
Minimizing lock contention
Query plan caching for repeated queries

7.3 The Analytics Era (2000s)

Data warehousing and OLAP brought new optimization challenges with massive table scans, complex aggregations, and multi-way joins. Goals expanded to:

Parallel query execution
Partition-aware optimization
Materialized view utilization
Memory-intensive operators (hash joins, sorts)

7.4 The Cloud Era (2010s-Present)

Cloud databases introduce economic optimization alongside performance:

Cost transparency: Resource usage directly tied to billing
Multi-tenancy: Isolation and fairness across workloads
Elasticity: Optimization aware of dynamic resource scaling
Serverless: Pay-per-query models require tight resource control

The Constant Amidst Change

Despite hardware revolutions—from spinning disks to SSDs to NVMe, from single-core to 128-core CPUs—the fundamental optimization goal remains unchanged: find a semantically correct plan that executes efficiently given current constraints. Only the definition of 'efficiently' evolves with technology and economics.

Summary: The Optimization Goal

We have established the foundational understanding of query optimization's purpose and challenges. Let's consolidate the key insights:

Key Takeaways

•Query optimization bridges declarative SQL and efficient execution — The optimizer transforms 'what' into 'how' by selecting among countless possible execution strategies.
•The goal is minimizing cost while guaranteeing correctness — Correctness is inviolable; cost minimization is the objective function.
•Cost is context-dependent — It may mean response time, resource consumption, or throughput depending on workload and business requirements.
•True optimality is impossible — Exponential search space, imperfect estimates, and time constraints preclude guaranteed optimal solutions.
•'Good enough' plans are the practical target — Avoiding catastrophically bad plans matters more than achieving theoretical perfection.
•Optimization goals evolve with technology — From disk I/O minimization to cloud cost optimization, the definition of efficiency keeps changing.

What's Next:

Now that we understand what the optimizer is trying to achieve, the next page examines where it searches for solutions. We'll explore the search space—the vast landscape of possible execution plans that the optimizer must navigate—and understand the mathematical structures that define which plans are valid and how the space grows with query complexity.

Page Complete

You now understand the fundamental goal of query optimization: transforming declarative queries into efficient execution plans while navigating massive search spaces under imperfect information. This conceptual foundation is essential for understanding the techniques covered in the remaining pages of this module.

Optimization Goal

The Query Optimizer: The Brain of the Database

What You Will Learn

Why Query Optimization Exists

Consider a simple SQL query:

SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 100000 AND d.location = 'New York';

This query requests names of highly-paid employees in New York departments. But how should the database execute this? The options are staggering:

Execution Strategy Variations

•Join Order: Should we first filter employees, then join? Or first filter departments, then join? Or join first, then filter?
•Join Algorithm: Should we use nested-loop join, hash join, merge join, or index-based join?
•Access Methods: Should we full-table-scan employees or use an index on salary? Is there a composite index on (dept_id, salary)?
•Filter Application: Should filters be applied before or after joins? Can they be pushed into the join condition?
•Parallel Execution: Should we parallelize the scan? The join? Both? How many threads?

The Consequence of Naive Execution

Defining the Optimization Goal

The query optimizer's goal appears simple in statement but profound in implication:

Find the execution plan that minimizes the cost of executing a query while producing correct results.

This definition contains three critical components that warrant deep examination:

2.1 Correctness: The Inviolable Constraint

Correctness constraints include:

Result Set Equivalence: The same rows must be returned
Ordering Preservation: If ORDER BY is specified, the plan must respect it
Null Handling: NULL semantics must be preserved across transformations
Duplicate Handling: SET vs. BAG semantics must be maintained

The optimizer uses equivalence rules (covered in Module 2) to verify that plan transformations preserve query semantics. These rules form the mathematical foundation guaranteeing correctness.

Equivalence vs. Equality

2.2 Cost Minimization: The Optimization Objective

Given correctness, the optimizer seeks to minimize cost. But what exactly is cost? Different systems define it differently:

Cost Metric	Definition	When Prioritized
Response Time	Total wall-clock time to first/last result	OLTP, interactive queries
Resource Consumption	Total CPU + I/O + Memory used	Shared systems, cloud billing
Throughput	Queries completed per unit time	Batch processing
I/O Operations	Disk reads/writes required	Disk-bound systems
Network Traffic	Data transferred across nodes	Distributed databases

The optimizer's cost model assigns an estimated cost to each operator and plan. This model, combined with statistics about data distribution, enables comparison of plans without executing them.

2.3 The Execution Plan: The Optimizer's Output

The optimizer's output is an execution plan (also called a query plan or query execution plan)—a detailed specification of how to execute the query. A complete execution plan specifies:

Operators: What operations to perform (scan, join, filter, sort, aggregate)
Operator Order: The sequence/hierarchy of operations
Algorithms: Which algorithm to use for each operator (e.g., hash join vs. merge join)
Access Paths: How to read data (full scan, index scan, index-only scan)
Data Flow: How results flow between operators
Parallelism: Degree and type of parallel execution
Memory Allocation: Buffer sizes for operators requiring memory

execution_plan_example.txt

Execution Plan

Hash Join (cost=1250.00 rows=850)
├── Hash Condition: e.dept_id = d.id
├── Filter: d.location = 'New York'
│   └── Seq Scan on departments d (cost=50.00 rows=15)
│       └── Filter applied during scan
└── Index Scan on employees e (cost=450.00 rows=5000)
    ├── Index: idx_employees_salary
    ├── Index Condition: salary > 100000
    └── Returns: e.name, e.dept_id
 
Total Estimated Cost: 1750.00
Estimated Rows: 850
Estimated Time: 45ms

The Optimization Challenge

If the optimizer's goal is simply to minimize cost, why is optimization considered one of the hardest problems in database systems? The challenge stems from several fundamental difficulties:

3.1 Exponential Search Space

The number of possible execution plans grows exponentially with query complexity. For a query joining n tables:

Join orderings alone: n!/2 for linear trees, Catalan(n) for bushy trees
With algorithm choices: Multiply by k^(n-1) for k join algorithms
With access methods: Multiply again by access options per table

Search Space Growth with Query Complexity
Tables Joined	Linear Join Orders	Bushy Join Orders	With 3 Algorithms	Exploration Time*
3	3	6	54	< 1 μs
5	60	120	9,720	< 1 ms
7	2,520	5,040	1.2M	~1 second
10	1.8M	3.6M	~10^12	~Hours (infeasible)
15	~10^12	~10^13	~10^18	~Millennia (impossible)

*Assuming 1 microsecond per plan evaluation

3.2 Estimation Uncertainty

The optimizer cannot execute plans to measure their true cost—that would defeat the purpose. Instead, it must estimate costs based on:

Table statistics: Row counts, column distributions, distinct values
Index metadata: Structure, size, clustering factor
Cost formulas: Mathematical models of operator performance

The Estimation Problem in Practice

3.3 Time Constraints

Optimization itself consumes time and resources. The optimizer faces a paradox:

More exploration = Higher chance of finding the best plan
More exploration = Longer optimization time, delaying query start

3.4 Multi-objective Trade-offs

Different "optimal" plans exist depending on what you optimize for:

First row: Pipelined plans that return initial results quickly
Last row: Plans that minimize total execution time
Memory-constrained: Plans that limit memory footprint
Parallel-friendly: Plans that scale across cores/nodes

The optimizer must understand context—is this an interactive query where users want to see something immediately, or a batch job where total time matters most?

Optimality vs. Good Enough

Given the challenges outlined above, a crucial question emerges: Does the optimizer actually find the optimal plan?

The honest answer is: rarely, and it doesn't matter as much as you'd think.

4.1 The Impossibility of Guaranteed Optimality

Finding the truly optimal plan would require:

Exhaustive search of all possible plans (exponentially expensive)
Perfect cost estimation (impossible with imperfect statistics)
Zero optimization time budget (contradicts itself)

These conditions are mutually incompatible. Guaranteed optimality is mathematically infeasible for non-trivial queries.

4.2 The "Good Enough" Philosophy

Practical optimizers operate on a different principle: find a plan that is good enough, quickly enough.

This approach accepts that:

The "optimal" plan based on estimates may not be truly optimal due to estimation errors
A plan within 2× of optimal is usually acceptable
A plan found in 10ms is better than eternal search for perfection

Most optimizers use heuristics and pruning to quickly eliminate obviously bad plans while thoroughly exploring the most promising regions of the search space.

What Optimizers Don't Guarantee

•Finding the mathematically optimal plan
•Consistent plan choice across runs (some randomness)
•Perfect cost estimation
•Exploration of all possible plans
•Optimal parallel distribution

What Optimizers Do Guarantee

•Semantic correctness of chosen plan
•Reasonable performance for most queries
•Bounded optimization time
•Avoidance of catastrophically bad plans
•Use of available indexes when beneficial

The 80/20 Rule of Optimization

The Optimizer's Role in Query Processing

To fully appreciate the optimization goal, we must understand where the optimizer fits within the broader query processing pipeline:

SQL Query
    │
    ▼
┌───────────────┐
│    Parser     │ ─── Lexical & Syntactic Analysis
└───────────────┘
    │
    ▼
┌───────────────┐
│   Analyzer    │ ─── Semantic Analysis, Name Resolution
└───────────────┘
    │
    ▼
┌───────────────┐
│   Rewriter    │ ─── View Expansion, Subquery Unnesting
└───────────────┘
    │
    ▼
┌───────────────────────────────────────────────┐
│              OPTIMIZER                         │
│  ┌─────────────┐   ┌──────────────────────┐   │
│  │   Logical   │ → │      Physical        │   │
│  │ Optimization│   │    Optimization      │   │
│  └─────────────┘   └──────────────────────┘   │
│         │                    │                │
│  Plan Generation  ←─→  Cost Estimation        │
└───────────────────────────────────────────────┘
    │
    ▼
┌───────────────┐
│   Executor    │ ─── Plan Execution
└───────────────┘
    │
    ▼
  Results

The optimizer receives a logical query representation (typically a relational algebra expression tree) and produces a physical execution plan. This transformation involves:

Logical Optimization: Query rewriting using equivalence rules (e.g., predicate pushdown)
Physical Optimization: Algorithm selection, access path selection, ordering
Plan Generation: Enumeration of candidate plans
Cost Estimation: Assigning costs to plans for comparison
Plan Selection: Choosing the lowest-cost plan

Optimizer Input and Output
Aspect	Input (Logical Plan)	Output (Physical Plan)
Representation	Relational algebra tree	Execution operator tree
Join Specification	Logical join (condition)	Specific algorithm (hash, merge, nested-loop)
Data Access	Table reference	Access path (heap scan, index scan)
Abstractions	Logical operators	Physical operators with parameters
Parallelism	Not specified	Explicit parallel operators
Memory	Not specified	Memory allocations for operators

Types of Optimization Goals

Different query types and workloads require different optimization objectives. Understanding these variations is crucial for both database design and query tuning:

6.1 Response Time Optimization (Latency)

For interactive applications—web backends, user-facing tools, real-time dashboards—the goal is minimizing total response time from query submission to result delivery.

Characteristics:

Prioritizes wall-clock time over resource efficiency
May accept higher resource consumption for faster results
Parallel execution attractive if it reduces latency
Optimization time must be minimal (query duration is short)

6.2 Resource Consumption Optimization (Cost)

For cloud databases billed by resource usage, or shared systems where resource contention matters, the goal shifts to minimizing CPU cycles, memory, and I/O.

Characteristics:

Sequential plans may be preferred if they use fewer total resources
Memory-efficient algorithms favored even if slower
Parallelism only if it reduces total resource usage
Particularly relevant for multi-tenant cloud databases

6.3 Throughput Optimization (Batch)

For batch processing, ETL jobs, or report generation, the goal is maximizing queries per hour rather than minimizing individual query time.

Characteristics:

Plans that share work across queries are valuable
Materialization and caching become important
Individual query latency less important than aggregate
Longer optimization time acceptable for complex queries

Optimization Goal Selection Factors

•Workload Type: OLTP (latency) vs. OLAP (throughput) vs. mixed
•Billing Model: Time-based vs. resource-based vs. fixed infrastructure
•User Expectations: Interactive (ms) vs. batch (minutes acceptable)
•Resource Availability: Memory constraints, CPU cores, I/O bandwidth
•Data Characteristics: Size, distribution, update frequency
•Concurrency Level: Single-user vs. thousands of simultaneous queries

Modern Adaptive Optimization

Historical Perspective: The Evolution of Optimization Goals

Query optimization has evolved dramatically over five decades, with optimization goals shifting alongside hardware and workload changes:

7.1 The System R Era (1970s)

Key innovations:

Dynamic programming for join ordering
Statistics-based cardinality estimation
Access path selection based on cost

7.2 The OLTP Era (1980s-1990s)

As databases powered business transactions, response time became critical. Optimization goals shifted toward:

Sub-second query completion
Efficient index utilization
Minimizing lock contention
Query plan caching for repeated queries

7.3 The Analytics Era (2000s)

Data warehousing and OLAP brought new optimization challenges with massive table scans, complex aggregations, and multi-way joins. Goals expanded to:

Parallel query execution
Partition-aware optimization
Materialized view utilization
Memory-intensive operators (hash joins, sorts)

7.4 The Cloud Era (2010s-Present)

Cloud databases introduce economic optimization alongside performance:

Cost transparency: Resource usage directly tied to billing
Multi-tenancy: Isolation and fairness across workloads
Elasticity: Optimization aware of dynamic resource scaling
Serverless: Pay-per-query models require tight resource control

The Constant Amidst Change

Summary: The Optimization Goal

We have established the foundational understanding of query optimization's purpose and challenges. Let's consolidate the key insights:

Key Takeaways

•Query optimization bridges declarative SQL and efficient execution — The optimizer transforms 'what' into 'how' by selecting among countless possible execution strategies.
•The goal is minimizing cost while guaranteeing correctness — Correctness is inviolable; cost minimization is the objective function.
•Cost is context-dependent — It may mean response time, resource consumption, or throughput depending on workload and business requirements.
•True optimality is impossible — Exponential search space, imperfect estimates, and time constraints preclude guaranteed optimal solutions.
•'Good enough' plans are the practical target — Avoiding catastrophically bad plans matters more than achieving theoretical perfection.
•Optimization goals evolve with technology — From disk I/O minimization to cloud cost optimization, the definition of efficiency keeps changing.

What's Next:

Page Complete