Query ProcessingQuery Representation

Query Representation

LevelIntermediate

Duration60 mins

TopicQuery Representation

5 / 5

Transformation

The Art of Equivalent Rewriting

A SQL query specifies what data to retrieve, not how to retrieve it. Yet the "how" matters enormously for performance. The same logical query can be executed in countless ways—some taking milliseconds, others taking hours.

This is where transformations come in. Query transformations are systematic rules that convert one query plan into another semantically equivalent plan. "Semantically equivalent" means both plans produce identical results for any possible database state—they differ only in how they compute that result.

Transformations are the core mechanism of query optimization. By applying transformations, optimizers explore a vast space of equivalent execution strategies, evaluating costs and selecting the most efficient approach. Understanding transformations is understanding the heart of how databases achieve high performance.

What You Will Learn

By the end of this page, you will understand the major categories of plan transformations, how they preserve semantics while improving efficiency, when each transformation is applicable, and how transformations compose to create dramatically better execution plans.

Transformation Fundamentals

A transformation rule is a pattern-matching rewrite:

IF plan matches pattern P
   AND validity conditions hold
THEN rewrite to pattern P'

Transformations are grounded in the algebraic properties of relational operations—commutativity, associativity, distributivity, and identity laws. These mathematical properties guarantee that the rewritten plan produces identical results.

Transformation Categories

•Simplification: Eliminate redundant or no-op operations. Example: Remove projection that projects all columns.
•Predicate Transformations: Rewrite, simplify, or move predicates. Example: Push selection through join.
•Join Transformations: Reorder, reassociate, or factor joins. Example: Swap join operands.
•Subquery Transformations: Decorrelate or restructure subqueries. Example: Convert NOT IN to anti-join.
•Aggregate Transformations: Reorganize grouping operations. Example: Split aggregate for parallelism.
•View/CTE Merging: Inline view definitions. Example: Merge simple view into outer query.

Core Algebraic Properties
Property	Algebraic Form	Example	Implication
Commutativity	R ⋈ S = S ⋈ R	Inner joins can swap sides	Flexible join order
Associativity	(R ⋈ S) ⋈ T = R ⋈ (S ⋈ T)	Join grouping is flexible	Enables all join orders
Distributivity	σ_p(R ⋈ S) = σ_p(R) ⋈ S	Selection distributes over join (if p references only R)	Enables pushdown
Cascade	σ_p(σ_q(R)) = σ_{p∧q}(R)	Conjunctive selections combine	Predicate consolidation
Identity	π_A(R) = R if A = schema(R)	Full projection is identity	Enables elimination

Validity Conditions Matter

Not all transformations apply unconditionally. Join commutativity holds for inner joins but not for outer joins. Selection pushdown through joins requires predicates to reference specific tables. Optimizers must verify conditions before applying rules.

Selection (Predicate) Transformations

Selection transformations move, combine, and simplify predicate evaluations. Since selections reduce data volume, pushing them early in the plan dramatically reduces intermediate result sizes.

Selection Pushdown

Move selections down the plan tree, closer to base tables.

Through Join:

σ_p(R ⋈ S) → σ_p(R) ⋈ S      (if p references only R)
σ_p(R ⋈ S) → R ⋈ σ_p(S)      (if p references only S)
σ_p(R ⋈ S) → σ_p₁(R) ⋈ σ_p₂(S)  (if p = p₁ ∧ p₂ and p₁→R, p₂→S)

Through Projection:

σ_p(π_A(R)) → π_A(σ_p(R))    (if p uses only columns in A)

Through Union:

σ_p(R ∪ S) → σ_p(R) ∪ σ_p(S)  (always valid)

Why Pushdown Helps

Earlier filtering means smaller intermediate results: fewer rows to join, sort, aggregate. If a predicate eliminates 90% of rows, pushing it before a join turns a 10M-row join into a 1M-row join—potentially 10x faster.

selection-pushdown-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Original query
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'West' AND o.status = 'completed';
 
-- Initial plan (canonical translation):
-- 
-- Project [o.order_id, c.name]
--    |
-- Filter [c.region = 'West' AND o.status = 'completed']
--    |
--   Join [o.customer_id = c.id]
--   /    \
-- Scan(o) Scan(c)
 
-- After selection pushdown:
--
-- Project [o.order_id, c.name]
--    |
--   Join [o.customer_id = c.id]
--   /    \
-- Filter    Filter
-- [status=  [region=
-- 'completed'] 'West']
--   |          |
-- Scan(o)   Scan(c)
 
-- After pushing to scans:
--
-- Project [o.order_id, c.name]
--    |
--   Join [o.customer_id = c.id]
--   /    \
-- Scan(o) Scan(c)
-- with    with
-- filter: filter:
-- status= region=
-- 'completed' 'West'
 
-- Result: Join operates on filtered subsets, not full tables

Projection Transformations

Projection transformations manage which columns flow through the plan. Eliminating unneeded columns early reduces tuple widths and memory usage throughout subsequent operations.

Projection Rules

•Projection Pushdown: π_A(R ⋈ S) → π_A((π_{A∩R}(R)) ⋈ (π_{A∩S}(S))). Keep only columns needed for join + output on each side.
•Projection Cascade: π_A(π_B(R)) → π_A(R) if A ⊆ B. Inner projection is redundant.
•Projection Elimination: π_schema(R)(R) → R. Projecting all columns is identity.
•Projection-Selection Commutation: π_A(σ_p(R)) ↔ σ_p(π_{A∪cols(p)}(R)). Must preserve predicate columns.
•Late Projection: Sometimes deferred projection is better—narrow intermediate results vs. simpler plans.
•DISTINCT Pushdown: In some cases, DISTINCT can push through operators (complex validity conditions).

projection-pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- Query needing only two columns
SELECT e.name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE d.location = 'NYC';
 
-- Without projection pushdown:
-- 
-- Project [e.name]
--    |
-- Filter [d.location = 'NYC']
--    |
--   Join [e.dept_id = d.id]
--   /    \
-- Scan(e)   Scan(d)
-- [ALL cols] [ALL cols]
 
-- Full tuples flow through join, then most columns discarded.
-- If employees has 50 columns @ 500 bytes/row, 
-- and we're joining 100K rows, that's 50MB of intermediate data.
 
-- With projection pushdown:
--
-- Project [e.name]  -- final output
--    |
--   Join [e.dept_id = d.id]
--   /    \
-- Project   Filter [d.location = 'NYC']
-- [e.name,    |
-- e.dept_id] Project [d.id, d.location]
--   |           |
-- Scan(e)    Scan(d)
 
-- Now intermediate tuples are small:
-- - employees: only name + dept_id (~50 bytes)
-- - departments: only id + location (~30 bytes)
-- Same 100K rows join uses ~5MB—10x less memory, faster to process.

Column Tracking Complexity

Projection pushdown requires tracking which columns are used where: by predicates, join conditions, output projections, ORDER BY, and computed expressions. Optimizers maintain 'required columns' sets that propagate through the plan.

Join Transformations

Join transformations are among the most impactful because joins are often the most expensive operations. Small changes in join order can result in order-of-magnitude performance differences.

Join Transformation Rules
Transformation	Rule	Validity
Commutativity	R ⋈ S → S ⋈ R	Inner joins only (outer joins are directional)
Associativity	(R ⋈ S) ⋈ T → R ⋈ (S ⋈ T)	Inner joins; outer joins have restrictions
Left-Right Swap	R LEFT JOIN S → S RIGHT JOIN R	Always (just changes perspective)
Join-Selection Pushdown	σ_p(R ⋈ S) → σ_p(R) ⋈ S	If p references only R
Cross Join Conversion	R × S → R ⋈_{true} S	Semantic equivalence
Semi-Join Introduction	π_R(R ⋈ S) → R ⋉ S	If output uses only R columns
Anti-Join Introduction	R - π_R(R ⋈ S) → R ▷ S	NOT EXISTS semantics

Join Reordering:

For n inner-join tables, there are n! possible orderings. The optimizer must find the best order without trying all possibilities.

Example: Joining A(1000 rows), B(1M rows), C(10 rows):

Order A × B × C: 1000 × 1M = 1B intermediate, × 10 = 10B ops
Order A × C × B: 1000 × 10 = 10K intermediate, × 1M = 10M ops
Order C × A × B: 10 × 1000 = 10K intermediate, × 1M = 10M ops

Difference: 1000× performance gap!

Converting Mermaid diagram...

Outer Join Restrictions

Outer joins are NOT freely commutative or associative. LEFT JOIN preserves left rows—swapping sides changes semantics. Reordering outer joins requires careful analysis of null-rejection predicates and association rules.

Subquery Transformations

Subqueries often create performance problems because they may be evaluated repeatedly (once per outer row). Subquery transformations convert nested queries into equivalent flat queries, typically using joins, enabling the optimizer to consider all tables together.

Subquery Decorrelation

Correlated subqueries reference outer query columns:

SELECT *
FROM orders o
WHERE o.amount > (
    SELECT AVG(amount) 
    FROM orders o2 
    WHERE o2.customer_id = o.customer_id  -- correlation!
);

Naive execution: For each outer row, execute inner query (N × M operations).

Decorrelated form (using join):

SELECT o.*
FROM orders o
JOIN (
    SELECT customer_id, AVG(amount) as avg_amt
    FROM orders
    GROUP BY customer_id
) avgs ON o.customer_id = avgs.customer_id
WHERE o.amount > avgs.avg_amt;

Now the subquery runs once, then joins—often dramatically faster.

When Decorrelation Helps

Decorrelation converts N×M to N+M operations. When outer result is large, this is huge. When outer result is tiny (1 row), decorrelation overhead may not be worth it—optimizers consider both approaches.

subquery-transformation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Complex nested query
SELECT d.name, d.budget,
       (SELECT COUNT(*) FROM employees e WHERE e.dept_id = d.id) as emp_count,
       (SELECT SUM(salary) FROM employees e WHERE e.dept_id = d.id) as total_salary
FROM departments d
WHERE EXISTS (
    SELECT 1 FROM employees e 
    WHERE e.dept_id = d.id AND e.hire_date > '2024-01-01'
);
 
-- Multiple transformations applied:
 
-- 1. EXISTS → Semi-join
-- 2. Scalar subqueries → Left join with aggregation
 
SELECT d.name, d.budget, 
       emp_agg.emp_count,
       emp_agg.total_salary
FROM departments d
SEMI JOIN employees e2 ON e2.dept_id = d.id AND e2.hire_date > '2024-01-01'
LEFT JOIN (
    SELECT dept_id, 
           COUNT(*) as emp_count,
           SUM(salary) as total_salary
    FROM employees
    GROUP BY dept_id
) emp_agg ON emp_agg.dept_id = d.id;
 
-- Or further simplified (combine aggregation with semi-join logic):
-- The optimizer explores multiple equivalent forms and costs each.

Aggregate Transformations

Aggregate transformations restructure GROUP BY and aggregate computations, often to enable parallelism or reduce data processed.

Aggregate Transformation Rules

•Aggregate Pushdown Through Join: γ_g,agg(R ⋈ S) → γ_g,agg(R) ⋈ S in limited cases. Requires functional dependencies ensuring grouping integrity.
•Aggregate Splitting: Split into partial + final aggregation for parallelism. Each partition computes partial; final combines. Works for SUM, COUNT, MIN, MAX (not MEDIAN).
•Distinct Elimination: If GROUP BY on unique key, DISTINCT in aggregate (COUNT DISTINCT) may simplify.
•Aggregate Coalescing: Multiple aggregates on same grouping → single GroupBy operator.
•HAVING to WHERE: If HAVING predicate doesn't use aggregates, push to WHERE (before GROUP BY).
•Eager Aggregation: Aggregate before join to reduce rows, if semantics allow (complex validity conditions).

aggregate-transformation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Query with aggregate
SELECT d.name, SUM(e.salary) as total_salary
FROM departments d
JOIN employees e ON d.id = e.dept_id
GROUP BY d.id, d.name;
 
-- Standard plan:
-- GroupBy [d.id, d.name; SUM(e.salary)]
--    |
--   Join [d.id = e.dept_id]
--   /    \
-- Scan(d) Scan(e)
 
-- Eager aggregation transformation (if valid):
-- Pre-aggregate employees by dept_id before joining:
--
-- GroupBy [d.id, d.name; SUM(pre_agg.sal_sum)]
--    |
--   Join [d.id = pre_agg.dept_id]
--   /    \
-- Scan(d) GroupBy [e.dept_id; SUM(e.salary) as sal_sum]
--            |
--         Scan(e)
 
-- Why this helps:
-- - If 100 departments, 10M employees
-- - Original: Join 10M rows, then aggregate to 100 groups
-- - Eager: Pre-aggregate to 100 groups, then join 100 × 100
-- - Massive reduction in join size!
 
-- Validity: Requires one-to-many join, specific aggregate types.

Aggregate Splitting for Parallelism

SUM(x) can split: compute partial sums on each partition, then sum the partial sums. COUNT splits similarly (sum the counts). AVG splits as SUM/COUNT combined. But MEDIAN, percentiles, and COUNT DISTINCT have limited splitting—they need all data or approximation.

Advanced Transformations

Modern optimizers apply sophisticated transformations beyond basic algebraic rewrites.

View/CTE Merging:

Inline view definitions into the outer query:

-- Query with view/CTE
WITH recent_orders AS (
    SELECT * FROM orders WHERE order_date > '2024-01-01'
)
SELECT c.name, r.order_id
FROM customers c
JOIN recent_orders r ON c.id = r.customer_id
WHERE c.region = 'West';

-- Merged form
SELECT c.name, o.order_id
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.region = 'West' 
  AND o.order_date > '2024-01-01';

Merging enables:

Joint predicate pushdown
Better join order options
Combined optimization scope

When Not to Merge

Views with DISTINCT, GROUP BY, LIMIT, or aggregate may block merging or make it incorrect. Some optimizers 'materialize' these views, computing them once and using the result—another valid strategy.

Summary and Module Conclusion

Transformations are the engine of query optimization. By systematically rewriting plans according to algebraic rules, optimizers explore vast spaces of equivalent execution strategies to find efficient plans.

Key Takeaways

•Transformations preserve semantics while changing structure—same results, different efficiency.
•Selection pushdown moves predicates early, reducing data volume for subsequent operations.
•Projection pushdown eliminates unneeded columns early, reducing tuple widths.
•Join reordering exploits commutativity and associativity for massive performance gains.
•Subquery decorrelation converts N×M correlated execution to N+M join-based execution.
•Aggregate transformations enable parallelism and reduce data processed before expensive operations.
•Advanced transformations (view merging, partition pruning, window optimization) handle modern SQL complexity.

Module 3 Complete:

You've now completed Module 3: Query Representation. You understand:

Relational algebra trees: The mathematical foundation of query representation
Query graphs: Structural view for join ordering and predicate placement
Logical plans: The comprehensive representation optimizers manipulate
Operator nodes: The building blocks of query execution
Transformations: The rules that rewrite plans for efficiency

This knowledge forms a complete picture of how databases internally represent, analyze, and transform queries from declarative SQL into efficient, executable plans.

Module Complete

Congratulations! You've mastered Query Representation. You now understand the internal representations databases use to process queries, how optimizers transform these representations, and the principles that guide optimization decisions. This knowledge enables you to write better SQL, interpret execution plans, and understand database performance at a deep level.

5 / 5

Loading learning content...

Query ProcessingQuery Representation

Query Representation

LevelIntermediate

Duration60 mins

TopicQuery Representation

5 / 5

Transformation

The Art of Equivalent Rewriting

What You Will Learn

Transformation Fundamentals

A transformation rule is a pattern-matching rewrite:

IF plan matches pattern P
   AND validity conditions hold
THEN rewrite to pattern P'

Transformation Categories

•Simplification: Eliminate redundant or no-op operations. Example: Remove projection that projects all columns.
•Predicate Transformations: Rewrite, simplify, or move predicates. Example: Push selection through join.
•Join Transformations: Reorder, reassociate, or factor joins. Example: Swap join operands.
•Subquery Transformations: Decorrelate or restructure subqueries. Example: Convert NOT IN to anti-join.
•Aggregate Transformations: Reorganize grouping operations. Example: Split aggregate for parallelism.
•View/CTE Merging: Inline view definitions. Example: Merge simple view into outer query.

Core Algebraic Properties
Property	Algebraic Form	Example	Implication
Commutativity	R ⋈ S = S ⋈ R	Inner joins can swap sides	Flexible join order
Associativity	(R ⋈ S) ⋈ T = R ⋈ (S ⋈ T)	Join grouping is flexible	Enables all join orders
Distributivity	σ_p(R ⋈ S) = σ_p(R) ⋈ S	Selection distributes over join (if p references only R)	Enables pushdown
Cascade	σ_p(σ_q(R)) = σ_{p∧q}(R)	Conjunctive selections combine	Predicate consolidation
Identity	π_A(R) = R if A = schema(R)	Full projection is identity	Enables elimination

Validity Conditions Matter

Selection (Predicate) Transformations

Selection transformations move, combine, and simplify predicate evaluations. Since selections reduce data volume, pushing them early in the plan dramatically reduces intermediate result sizes.

Selection Pushdown

Move selections down the plan tree, closer to base tables.

Through Join:

σ_p(R ⋈ S) → σ_p(R) ⋈ S      (if p references only R)
σ_p(R ⋈ S) → R ⋈ σ_p(S)      (if p references only S)
σ_p(R ⋈ S) → σ_p₁(R) ⋈ σ_p₂(S)  (if p = p₁ ∧ p₂ and p₁→R, p₂→S)

Through Projection:

σ_p(π_A(R)) → π_A(σ_p(R))    (if p uses only columns in A)

Through Union:

σ_p(R ∪ S) → σ_p(R) ∪ σ_p(S)  (always valid)

Why Pushdown Helps

selection-pushdown-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Original query
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'West' AND o.status = 'completed';
 
-- Initial plan (canonical translation):
-- 
-- Project [o.order_id, c.name]
--    |
-- Filter [c.region = 'West' AND o.status = 'completed']
--    |
--   Join [o.customer_id = c.id]
--   /    \
-- Scan(o) Scan(c)
 
-- After selection pushdown:
--
-- Project [o.order_id, c.name]
--    |
--   Join [o.customer_id = c.id]
--   /    \
-- Filter    Filter
-- [status=  [region=
-- 'completed'] 'West']
--   |          |
-- Scan(o)   Scan(c)
 
-- After pushing to scans:
--
-- Project [o.order_id, c.name]
--    |
--   Join [o.customer_id = c.id]
--   /    \
-- Scan(o) Scan(c)
-- with    with
-- filter: filter:
-- status= region=
-- 'completed' 'West'
 
-- Result: Join operates on filtered subsets, not full tables

Projection Transformations

Projection transformations manage which columns flow through the plan. Eliminating unneeded columns early reduces tuple widths and memory usage throughout subsequent operations.

Projection Rules

•Projection Pushdown: π_A(R ⋈ S) → π_A((π_{A∩R}(R)) ⋈ (π_{A∩S}(S))). Keep only columns needed for join + output on each side.
•Projection Cascade: π_A(π_B(R)) → π_A(R) if A ⊆ B. Inner projection is redundant.
•Projection Elimination: π_schema(R)(R) → R. Projecting all columns is identity.
•Projection-Selection Commutation: π_A(σ_p(R)) ↔ σ_p(π_{A∪cols(p)}(R)). Must preserve predicate columns.
•Late Projection: Sometimes deferred projection is better—narrow intermediate results vs. simpler plans.
•DISTINCT Pushdown: In some cases, DISTINCT can push through operators (complex validity conditions).

projection-pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- Query needing only two columns
SELECT e.name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE d.location = 'NYC';
 
-- Without projection pushdown:
-- 
-- Project [e.name]
--    |
-- Filter [d.location = 'NYC']
--    |
--   Join [e.dept_id = d.id]
--   /    \
-- Scan(e)   Scan(d)
-- [ALL cols] [ALL cols]
 
-- Full tuples flow through join, then most columns discarded.
-- If employees has 50 columns @ 500 bytes/row, 
-- and we're joining 100K rows, that's 50MB of intermediate data.
 
-- With projection pushdown:
--
-- Project [e.name]  -- final output
--    |
--   Join [e.dept_id = d.id]
--   /    \
-- Project   Filter [d.location = 'NYC']
-- [e.name,    |
-- e.dept_id] Project [d.id, d.location]
--   |           |
-- Scan(e)    Scan(d)
 
-- Now intermediate tuples are small:
-- - employees: only name + dept_id (~50 bytes)
-- - departments: only id + location (~30 bytes)
-- Same 100K rows join uses ~5MB—10x less memory, faster to process.

Column Tracking Complexity

Join Transformations

Join transformations are among the most impactful because joins are often the most expensive operations. Small changes in join order can result in order-of-magnitude performance differences.

Join Transformation Rules
Transformation	Rule	Validity
Commutativity	R ⋈ S → S ⋈ R	Inner joins only (outer joins are directional)
Associativity	(R ⋈ S) ⋈ T → R ⋈ (S ⋈ T)	Inner joins; outer joins have restrictions
Left-Right Swap	R LEFT JOIN S → S RIGHT JOIN R	Always (just changes perspective)
Join-Selection Pushdown	σ_p(R ⋈ S) → σ_p(R) ⋈ S	If p references only R
Cross Join Conversion	R × S → R ⋈_{true} S	Semantic equivalence
Semi-Join Introduction	π_R(R ⋈ S) → R ⋉ S	If output uses only R columns
Anti-Join Introduction	R - π_R(R ⋈ S) → R ▷ S	NOT EXISTS semantics

Join Reordering:

For n inner-join tables, there are n! possible orderings. The optimizer must find the best order without trying all possibilities.

Example: Joining A(1000 rows), B(1M rows), C(10 rows):

Order A × B × C: 1000 × 1M = 1B intermediate, × 10 = 10B ops
Order A × C × B: 1000 × 10 = 10K intermediate, × 1M = 10M ops
Order C × A × B: 10 × 1000 = 10K intermediate, × 1M = 10M ops

Difference: 1000× performance gap!

Converting Mermaid diagram...

Outer Join Restrictions

Subquery Transformations

Subquery Decorrelation

Correlated subqueries reference outer query columns:

SELECT *
FROM orders o
WHERE o.amount > (
    SELECT AVG(amount) 
    FROM orders o2 
    WHERE o2.customer_id = o.customer_id  -- correlation!
);

Naive execution: For each outer row, execute inner query (N × M operations).

Decorrelated form (using join):

SELECT o.*
FROM orders o
JOIN (
    SELECT customer_id, AVG(amount) as avg_amt
    FROM orders
    GROUP BY customer_id
) avgs ON o.customer_id = avgs.customer_id
WHERE o.amount > avgs.avg_amt;

Now the subquery runs once, then joins—often dramatically faster.

When Decorrelation Helps

subquery-transformation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Complex nested query
SELECT d.name, d.budget,
       (SELECT COUNT(*) FROM employees e WHERE e.dept_id = d.id) as emp_count,
       (SELECT SUM(salary) FROM employees e WHERE e.dept_id = d.id) as total_salary
FROM departments d
WHERE EXISTS (
    SELECT 1 FROM employees e 
    WHERE e.dept_id = d.id AND e.hire_date > '2024-01-01'
);
 
-- Multiple transformations applied:
 
-- 1. EXISTS → Semi-join
-- 2. Scalar subqueries → Left join with aggregation
 
SELECT d.name, d.budget, 
       emp_agg.emp_count,
       emp_agg.total_salary
FROM departments d
SEMI JOIN employees e2 ON e2.dept_id = d.id AND e2.hire_date > '2024-01-01'
LEFT JOIN (
    SELECT dept_id, 
           COUNT(*) as emp_count,
           SUM(salary) as total_salary
    FROM employees
    GROUP BY dept_id
) emp_agg ON emp_agg.dept_id = d.id;
 
-- Or further simplified (combine aggregation with semi-join logic):
-- The optimizer explores multiple equivalent forms and costs each.

Aggregate Transformations

Aggregate transformations restructure GROUP BY and aggregate computations, often to enable parallelism or reduce data processed.

Aggregate Transformation Rules

•Aggregate Pushdown Through Join: γ_g,agg(R ⋈ S) → γ_g,agg(R) ⋈ S in limited cases. Requires functional dependencies ensuring grouping integrity.
•Aggregate Splitting: Split into partial + final aggregation for parallelism. Each partition computes partial; final combines. Works for SUM, COUNT, MIN, MAX (not MEDIAN).
•Distinct Elimination: If GROUP BY on unique key, DISTINCT in aggregate (COUNT DISTINCT) may simplify.
•Aggregate Coalescing: Multiple aggregates on same grouping → single GroupBy operator.
•HAVING to WHERE: If HAVING predicate doesn't use aggregates, push to WHERE (before GROUP BY).
•Eager Aggregation: Aggregate before join to reduce rows, if semantics allow (complex validity conditions).

aggregate-transformation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Query with aggregate
SELECT d.name, SUM(e.salary) as total_salary
FROM departments d
JOIN employees e ON d.id = e.dept_id
GROUP BY d.id, d.name;
 
-- Standard plan:
-- GroupBy [d.id, d.name; SUM(e.salary)]
--    |
--   Join [d.id = e.dept_id]
--   /    \
-- Scan(d) Scan(e)
 
-- Eager aggregation transformation (if valid):
-- Pre-aggregate employees by dept_id before joining:
--
-- GroupBy [d.id, d.name; SUM(pre_agg.sal_sum)]
--    |
--   Join [d.id = pre_agg.dept_id]
--   /    \
-- Scan(d) GroupBy [e.dept_id; SUM(e.salary) as sal_sum]
--            |
--         Scan(e)
 
-- Why this helps:
-- - If 100 departments, 10M employees
-- - Original: Join 10M rows, then aggregate to 100 groups
-- - Eager: Pre-aggregate to 100 groups, then join 100 × 100
-- - Massive reduction in join size!
 
-- Validity: Requires one-to-many join, specific aggregate types.

Aggregate Splitting for Parallelism

Advanced Transformations

Modern optimizers apply sophisticated transformations beyond basic algebraic rewrites.

View/CTE Merging:

Inline view definitions into the outer query:

-- Query with view/CTE
WITH recent_orders AS (
    SELECT * FROM orders WHERE order_date > '2024-01-01'
)
SELECT c.name, r.order_id
FROM customers c
JOIN recent_orders r ON c.id = r.customer_id
WHERE c.region = 'West';

-- Merged form
SELECT c.name, o.order_id
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.region = 'West' 
  AND o.order_date > '2024-01-01';

Merging enables:

Joint predicate pushdown
Better join order options
Combined optimization scope

When Not to Merge

Summary and Module Conclusion

Key Takeaways

•Transformations preserve semantics while changing structure—same results, different efficiency.
•Selection pushdown moves predicates early, reducing data volume for subsequent operations.
•Projection pushdown eliminates unneeded columns early, reducing tuple widths.
•Join reordering exploits commutativity and associativity for massive performance gains.
•Subquery decorrelation converts N×M correlated execution to N+M join-based execution.
•Aggregate transformations enable parallelism and reduce data processed before expensive operations.
•Advanced transformations (view merging, partition pruning, window optimization) handle modern SQL complexity.

Module 3 Complete:

You've now completed Module 3: Query Representation. You understand:

Relational algebra trees: The mathematical foundation of query representation
Query graphs: Structural view for join ordering and predicate placement
Logical plans: The comprehensive representation optimizers manipulate
Operator nodes: The building blocks of query execution
Transformations: The rules that rewrite plans for efficiency

This knowledge forms a complete picture of how databases internally represent, analyze, and transform queries from declarative SQL into efficient, executable plans.

Module Complete

5 / 5