Loading learning content...
A SQL query specifies what data to retrieve, not how to retrieve it. Yet the "how" matters enormously for performance. The same logical query can be executed in countless ways—some taking milliseconds, others taking hours.
This is where transformations come in. Query transformations are systematic rules that convert one query plan into another semantically equivalent plan. "Semantically equivalent" means both plans produce identical results for any possible database state—they differ only in how they compute that result.
Transformations are the core mechanism of query optimization. By applying transformations, optimizers explore a vast space of equivalent execution strategies, evaluating costs and selecting the most efficient approach. Understanding transformations is understanding the heart of how databases achieve high performance.
By the end of this page, you will understand the major categories of plan transformations, how they preserve semantics while improving efficiency, when each transformation is applicable, and how transformations compose to create dramatically better execution plans.
A transformation rule is a pattern-matching rewrite:
IF plan matches pattern P
AND validity conditions hold
THEN rewrite to pattern P'
Transformations are grounded in the algebraic properties of relational operations—commutativity, associativity, distributivity, and identity laws. These mathematical properties guarantee that the rewritten plan produces identical results.
| Property | Algebraic Form | Example | Implication |
|---|---|---|---|
| Commutativity | R ⋈ S = S ⋈ R | Inner joins can swap sides | Flexible join order |
| Associativity | (R ⋈ S) ⋈ T = R ⋈ (S ⋈ T) | Join grouping is flexible | Enables all join orders |
| Distributivity | σ_p(R ⋈ S) = σ_p(R) ⋈ S | Selection distributes over join (if p references only R) | Enables pushdown |
| Cascade | σ_p(σ_q(R)) = σ_{p∧q}(R) | Conjunctive selections combine | Predicate consolidation |
| Identity | π_A(R) = R if A = schema(R) | Full projection is identity | Enables elimination |
Not all transformations apply unconditionally. Join commutativity holds for inner joins but not for outer joins. Selection pushdown through joins requires predicates to reference specific tables. Optimizers must verify conditions before applying rules.
Selection transformations move, combine, and simplify predicate evaluations. Since selections reduce data volume, pushing them early in the plan dramatically reduces intermediate result sizes.
Selection Pushdown
Move selections down the plan tree, closer to base tables.
Through Join:
σ_p(R ⋈ S) → σ_p(R) ⋈ S (if p references only R)
σ_p(R ⋈ S) → R ⋈ σ_p(S) (if p references only S)
σ_p(R ⋈ S) → σ_p₁(R) ⋈ σ_p₂(S) (if p = p₁ ∧ p₂ and p₁→R, p₂→S)
Through Projection:
σ_p(π_A(R)) → π_A(σ_p(R)) (if p uses only columns in A)
Through Union:
σ_p(R ∪ S) → σ_p(R) ∪ σ_p(S) (always valid)
Earlier filtering means smaller intermediate results: fewer rows to join, sort, aggregate. If a predicate eliminates 90% of rows, pushing it before a join turns a 10M-row join into a 1M-row join—potentially 10x faster.
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Original querySELECT o.order_id, c.nameFROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.region = 'West' AND o.status = 'completed'; -- Initial plan (canonical translation):-- -- Project [o.order_id, c.name]-- |-- Filter [c.region = 'West' AND o.status = 'completed']-- |-- Join [o.customer_id = c.id]-- / \-- Scan(o) Scan(c) -- After selection pushdown:---- Project [o.order_id, c.name]-- |-- Join [o.customer_id = c.id]-- / \-- Filter Filter-- [status= [region=-- 'completed'] 'West']-- | |-- Scan(o) Scan(c) -- After pushing to scans:---- Project [o.order_id, c.name]-- |-- Join [o.customer_id = c.id]-- / \-- Scan(o) Scan(c)-- with with-- filter: filter:-- status= region=-- 'completed' 'West' -- Result: Join operates on filtered subsets, not full tablesProjection transformations manage which columns flow through the plan. Eliminating unneeded columns early reduces tuple widths and memory usage throughout subsequent operations.
12345678910111213141516171819202122232425262728293031323334353637
-- Query needing only two columnsSELECT e.nameFROM employees eJOIN departments d ON e.dept_id = d.idWHERE d.location = 'NYC'; -- Without projection pushdown:-- -- Project [e.name]-- |-- Filter [d.location = 'NYC']-- |-- Join [e.dept_id = d.id]-- / \-- Scan(e) Scan(d)-- [ALL cols] [ALL cols] -- Full tuples flow through join, then most columns discarded.-- If employees has 50 columns @ 500 bytes/row, -- and we're joining 100K rows, that's 50MB of intermediate data. -- With projection pushdown:---- Project [e.name] -- final output-- |-- Join [e.dept_id = d.id]-- / \-- Project Filter [d.location = 'NYC']-- [e.name, |-- e.dept_id] Project [d.id, d.location]-- | |-- Scan(e) Scan(d) -- Now intermediate tuples are small:-- - employees: only name + dept_id (~50 bytes)-- - departments: only id + location (~30 bytes)-- Same 100K rows join uses ~5MB—10x less memory, faster to process.Projection pushdown requires tracking which columns are used where: by predicates, join conditions, output projections, ORDER BY, and computed expressions. Optimizers maintain 'required columns' sets that propagate through the plan.
Join transformations are among the most impactful because joins are often the most expensive operations. Small changes in join order can result in order-of-magnitude performance differences.
| Transformation | Rule | Validity |
|---|---|---|
| Commutativity | R ⋈ S → S ⋈ R | Inner joins only (outer joins are directional) |
| Associativity | (R ⋈ S) ⋈ T → R ⋈ (S ⋈ T) | Inner joins; outer joins have restrictions |
| Left-Right Swap | R LEFT JOIN S → S RIGHT JOIN R | Always (just changes perspective) |
| Join-Selection Pushdown | σ_p(R ⋈ S) → σ_p(R) ⋈ S | If p references only R |
| Cross Join Conversion | R × S → R ⋈_{true} S | Semantic equivalence |
| Semi-Join Introduction | π_R(R ⋈ S) → R ⋉ S | If output uses only R columns |
| Anti-Join Introduction | R - π_R(R ⋈ S) → R ▷ S | NOT EXISTS semantics |
Join Reordering:
For n inner-join tables, there are n! possible orderings. The optimizer must find the best order without trying all possibilities.
Example: Joining A(1000 rows), B(1M rows), C(10 rows):
Difference: 1000× performance gap!
Outer joins are NOT freely commutative or associative. LEFT JOIN preserves left rows—swapping sides changes semantics. Reordering outer joins requires careful analysis of null-rejection predicates and association rules.
Subqueries often create performance problems because they may be evaluated repeatedly (once per outer row). Subquery transformations convert nested queries into equivalent flat queries, typically using joins, enabling the optimizer to consider all tables together.
Subquery Decorrelation
Correlated subqueries reference outer query columns:
SELECT *
FROM orders o
WHERE o.amount > (
SELECT AVG(amount)
FROM orders o2
WHERE o2.customer_id = o.customer_id -- correlation!
);
Naive execution: For each outer row, execute inner query (N × M operations).
Decorrelated form (using join):
SELECT o.*
FROM orders o
JOIN (
SELECT customer_id, AVG(amount) as avg_amt
FROM orders
GROUP BY customer_id
) avgs ON o.customer_id = avgs.customer_id
WHERE o.amount > avgs.avg_amt;
Now the subquery runs once, then joins—often dramatically faster.
Decorrelation converts N×M to N+M operations. When outer result is large, this is huge. When outer result is tiny (1 row), decorrelation overhead may not be worth it—optimizers consider both approaches.
123456789101112131415161718192021222324252627282930
-- Complex nested querySELECT d.name, d.budget, (SELECT COUNT(*) FROM employees e WHERE e.dept_id = d.id) as emp_count, (SELECT SUM(salary) FROM employees e WHERE e.dept_id = d.id) as total_salaryFROM departments dWHERE EXISTS ( SELECT 1 FROM employees e WHERE e.dept_id = d.id AND e.hire_date > '2024-01-01'); -- Multiple transformations applied: -- 1. EXISTS → Semi-join-- 2. Scalar subqueries → Left join with aggregation SELECT d.name, d.budget, emp_agg.emp_count, emp_agg.total_salaryFROM departments dSEMI JOIN employees e2 ON e2.dept_id = d.id AND e2.hire_date > '2024-01-01'LEFT JOIN ( SELECT dept_id, COUNT(*) as emp_count, SUM(salary) as total_salary FROM employees GROUP BY dept_id) emp_agg ON emp_agg.dept_id = d.id; -- Or further simplified (combine aggregation with semi-join logic):-- The optimizer explores multiple equivalent forms and costs each.Aggregate transformations restructure GROUP BY and aggregate computations, often to enable parallelism or reduce data processed.
12345678910111213141516171819202122232425262728293031
-- Query with aggregateSELECT d.name, SUM(e.salary) as total_salaryFROM departments dJOIN employees e ON d.id = e.dept_idGROUP BY d.id, d.name; -- Standard plan:-- GroupBy [d.id, d.name; SUM(e.salary)]-- |-- Join [d.id = e.dept_id]-- / \-- Scan(d) Scan(e) -- Eager aggregation transformation (if valid):-- Pre-aggregate employees by dept_id before joining:---- GroupBy [d.id, d.name; SUM(pre_agg.sal_sum)]-- |-- Join [d.id = pre_agg.dept_id]-- / \-- Scan(d) GroupBy [e.dept_id; SUM(e.salary) as sal_sum]-- |-- Scan(e) -- Why this helps:-- - If 100 departments, 10M employees-- - Original: Join 10M rows, then aggregate to 100 groups-- - Eager: Pre-aggregate to 100 groups, then join 100 × 100-- - Massive reduction in join size! -- Validity: Requires one-to-many join, specific aggregate types.SUM(x) can split: compute partial sums on each partition, then sum the partial sums. COUNT splits similarly (sum the counts). AVG splits as SUM/COUNT combined. But MEDIAN, percentiles, and COUNT DISTINCT have limited splitting—they need all data or approximation.
Modern optimizers apply sophisticated transformations beyond basic algebraic rewrites.
View/CTE Merging:
Inline view definitions into the outer query:
-- Query with view/CTE
WITH recent_orders AS (
SELECT * FROM orders WHERE order_date > '2024-01-01'
)
SELECT c.name, r.order_id
FROM customers c
JOIN recent_orders r ON c.id = r.customer_id
WHERE c.region = 'West';
-- Merged form
SELECT c.name, o.order_id
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.region = 'West'
AND o.order_date > '2024-01-01';
Merging enables:
Views with DISTINCT, GROUP BY, LIMIT, or aggregate may block merging or make it incorrect. Some optimizers 'materialize' these views, computing them once and using the result—another valid strategy.
Transformations are the engine of query optimization. By systematically rewriting plans according to algebraic rules, optimizers explore vast spaces of equivalent execution strategies to find efficient plans.
Module 3 Complete:
You've now completed Module 3: Query Representation. You understand:
This knowledge forms a complete picture of how databases internally represent, analyze, and transform queries from declarative SQL into efficient, executable plans.
Congratulations! You've mastered Query Representation. You now understand the internal representations databases use to process queries, how optimizers transform these representations, and the principles that guide optimization decisions. This knowledge enables you to write better SQL, interpret execution plans, and understand database performance at a deep level.