Database Management SystemsRelational Algebra Overview

Relational Algebra Overview

LevelIntermediate

Duration60 mins

TopicRelational Algebra Overview

5 / 5

Query Equivalence

Different Paths, Same Destination

Consider two relational algebra expressions:

1: π_name(σ_{salary>50000}(Employees))
2: σ_{salary>50000}(π_{name,salary}(Employees))

Do they produce the same result? If so, which should we execute? These questions—of equivalence and optimization—are at the heart of query processing.

Query equivalence is the theoretical foundation that makes query optimization possible. When we know that two expressions are equivalent—they produce identical results for any database state—we can freely substitute one for the other. This freedom to substitute is what allows optimizers to explore vast spaces of alternative execution strategies while guaranteeing correctness.

In this page, we'll formally define query equivalence, explore the equivalence rules that govern relational algebra, understand the conditions under which transformations are valid, and see how these rules enable the optimizations that make databases fast.

What You Will Learn

By the end of this page, you will understand: the formal definition of query equivalence; the major equivalence rules for relational algebra operations; conditions that make transformations valid or invalid; how equivalence rules combine to enable complex optimizations; and the relationship between equivalence and query optimization.

Formal Definition of Equivalence

Definition: Query Equivalence

Two relational algebra expressions E₁ and E₂ are equivalent, written E₁ ≡ E₂, if and only if:

For every database instance D, the evaluation of E₁ on D produces the same result as the evaluation of E₂ on D.

Formally:

E₁ ≡ E₂  ⟺  ∀D. E₁(D) = E₂(D)

Where:

D ranges over all possible database instances
E(D) denotes the relation produced by evaluating expression E on database D
= denotes set equality (identical tuples)

Key Points:

Universal Quantification: Equivalence must hold for ALL possible database states, not just the current one or specific test cases.
Set Equality: Results must contain exactly the same tuples. Same tuples in different order are still equal (relations are sets, unordered).
Schema Compatibility: Equivalent expressions must produce results with the same schema (column names and types). If schemas differ, expressions aren't equivalent even if the data values match.

Examples:

Equivalent:

σ_{p∧q}(R) ≡ σ_p(σ_q(R)) ≡ σ_q(σ_p(R))

Conjunctive selections can be split and reordered.

NOT Equivalent:

π_A(σ_B(R)) ≢ σ_B(π_A(R))  when B ∉ A

If attribute B isn't in projection list A, we can't filter on it after projection.

Testing Is Not Proving

Just because two expressions produce the same result on test data doesn't mean they're equivalent. Equivalence requires proof across ALL possible database states. A transformation that works for test data but fails for boundary cases (empty tables, nulls, duplicate values) is incorrect, regardless of test success.

Structural vs Semantic Equivalence:

Structural Equivalence: Expressions match syntactically after normalization (essentially the same expression with possible renaming).

Semantic Equivalence: Expressions produce identical results, possibly through completely different computational paths.

Query optimization exploits semantic equivalence—finding syntactically different but semantically identical expressions that execute more efficiently.

Why Equivalence Matters:

Optimization Correctness: Transformations are only valid if they preserve equivalence
View Substitution: Views can be substituted for their definitions because they're equivalent
Constraint Enforcement: Some constraints guarantee equivalences (e.g., key constraints enable certain simplifications)
Query Normalization: Multiple user queries can be recognized as equivalent for caching

Equivalence vs Non-Equivalence Examples
Expression 1	Expression 2	Equivalent?	Reason
σ_p(σ_q(R))	σ_q(σ_p(R))	Yes	Selection is commutative
R ⋈ S	S ⋈ R	Yes	Natural join is commutative
R − S	S − R	No	Set difference is NOT commutative
π_A(π_B(R))	π_A(R) where A⊆B	Yes	Cascading projections
σ_p(R ⋈ S)	σ_p(R) ⋈ S (p refs only R)	Yes	Selection pushdown
R ⟕ S	S ⟖ R	Yes	Outer join direction swap

Selection Equivalences

Selection (σ) has several important equivalence rules that form the foundation of selection pushdown optimization.

Rule 1: Conjunction Splitting (Cascade Rule)

σ_{p₁ ∧ p₂ ∧ ... ∧ pₙ}(R) ≡ σ_{p₁}(σ_{p₂}(...σ_{pₙ}(R)...))

A selection with a conjunctive (AND) predicate can be split into a cascade of selections. This enables pushing individual conditions to different places in the expression tree.

Rule 2: Selection Commutativity

σ_{p₁}(σ_{p₂}(R)) ≡ σ_{p₂}(σ_{p₁}(R))

The order of cascaded selections doesn't matter. This flexibility allows the optimizer to arrange selections in the most beneficial order (e.g., most selective first).

Rule 3: Selection and Projection Commutativity (Conditional)

π_L(σ_p(R)) ≡ σ_p(π_L(R))  only if p references only attributes in L

If the selection predicate uses only attributes that survive projection, we can swap their order. This often enables projecting early to narrow tuples.

Rule 4: Selection and Cartesian Product

σ_p(R × S) ≡ R ⋈_p S  when p references both R and S
σ_p(R × S) ≡ σ_p(R) × S  when p references only R

Selection over product with a join condition IS a theta join. Selection with a predicate referencing only one relation can be pushed to that relation.

selection_equivalences.txt

Relational Algebra

-- RULE 1: Conjunction Splitting
-- Original (compound predicate)
σ_( dept='Eng' AND salary>70000 AND hired>2020-01-01 )(Employees)
 
-- Equivalent cascade (optimizable)
σ_(dept='Eng')(
  σ_(salary>70000)(
    σ_(hired>2020-01-01)(Employees)
  )
)
 
-- RULE 2: Selection Commutativity
-- The cascade can be reordered for efficiency
-- If dept='Eng' is most selective, do it first:
σ_(hired>2020-01-01)(
  σ_(salary>70000)(
    σ_(dept='Eng')(Employees)  -- Most selective first
  )
)
 
-- RULE 3: Selection/Projection Commutativity
-- Original
π_(name, dept)(σ_(dept='Eng')(Employees))
 
-- Equivalent (only if we don't need other attributes for predicate)
σ_(dept='Eng')(π_(name, dept)(Employees))
 
-- RULE 4: Selection Distribution over Join
-- Original
σ_(E.dept='Eng' AND D.location='NYC')(Employees × Departments)
 
-- Optimized (push conditions to appropriate relations)
σ_(E.dept='Eng')(Employees) ⋈ σ_(D.location='NYC')(Departments)

Rule 5: Selection and Join

σ_p(R ⋈ S) ≡ σ_{p1}(R) ⋈ σ_{p2}(S) ⋈ (conditions involving both)

Where predicate p can be decomposed into:

p1: conditions on R's attributes only
p2: conditions on S's attributes only
Remaining: conditions involving both R and S

Pushing selections past joins is the most impactful optimization rule, often reducing join input sizes dramatically.

Rule 6: Selection and Union/Intersection/Difference

σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)
σ_p(R ∩ S) ≡ σ_p(R) ∩ σ_p(S)
σ_p(R − S) ≡ σ_p(R) − S    (note: only pushed to first operand)
           ≡ σ_p(R) − σ_p(S)  (alternative)

Selection distributes over set operations, enabling early filtering.

Selection Pushdown: The #1 Optimization

Selection pushdown is typically the single most effective optimization. Reducing input to joins from millions of rows to thousands can change execution time from hours to seconds. Optimizers aggressively push selections as early as possible, splitting compound predicates to maximize pushdown opportunities.

Projection Equivalences

Projection (π) equivalences enable removing unneeded columns early, reducing data volume and memory usage.

Rule 1: Cascading Projections

π_L₁(π_L₂(R)) ≡ π_L₁(R)  when L₁ ⊆ L₂

Nested projections simplify to the outermost (smallest) projection. Intermediate projections that keep extra columns are wasteful.

Rule 2: Projection and Selection

As noted earlier:

π_L(σ_p(R)) ≡ σ_p(π_{L∪attrs(p)}(R))  followed by π_L

We can project away columns not needed for the final result or the selection predicate. The minimal projection keeps L plus any attributes referenced in p.

Rule 3: Projection and Cartesian Product

π_L(R × S) ≡ π_{L∩attrs(R)}(R) × π_{L∩attrs(S)}(S)  when L = L_R ∪ L_S

We can push projection over product if the projected columns come from separate relations.

Rule 4: Projection and Join

π_L(R ⋈ S) ≡ π_L(π_{L ∪ join_attrs(R)}(R) ⋈ π_{L ∪ join_attrs(S)}(S))

Before joining, project away columns not needed for the join condition or the final result. This narrows tuples, reducing memory and potentially I/O.

Projection Pushdown Examples
Original	Optimized	Columns Eliminated
π_{name}(Employees ⋈ Departments)	π_{name}(π_{name,dept_id}(E) ⋈ π_{id}(D))	All except name, dept_id, id
π_{A,B}(π_{A,B,C,D}(R))	π_{A,B}(R)	Intermediate C, D
π_{name}(σ_{salary>50k}(Employees))	π_{name}(σ_{sal>50k}(π_{name,salary}(E)))	All except name, salary

Rule 5: Projection and Union

π_L(R ∪ S) ≡ π_L(R) ∪ π_L(S)

Projection distributes over union. This enables projecting before combining, reducing data volume throughout.

Rule 6: Projection and Set Difference

π_L(R − S) ≡ π_L(R) − π_L(S)

Similarly, projection distributes over difference.

Projection Pushdown Limitations:

Projection can't always be pushed:

Attributes needed for selection: Must keep columns used in predicates
Attributes needed for join: Must keep join columns
Grouping attributes: Aggregation requires grouping columns
Order by attributes: If ORDER BY is later, need those columns

The optimizer computes the minimum attribute set needed at each tree node and projects away everything else.

Column Pruning in Practice

Modern databases perform 'column pruning' automatically—tracking which columns are needed at each point and eliminating others. In columnar storage systems (like Parquet or column-oriented databases), this is especially impactful as unneeded columns are never even read from disk.

Join Equivalences

Join equivalences are crucial because joins are often the most expensive operations, and reordering can dramatically affect performance.

Rule 1: Join Commutativity

R ⋈ S ≡ S ⋈ R

For natural join and inner join, operand order doesn't affect the result. This enables choosing which relation to use as the "build" side in hash joins, for example.

Rule 2: Join Associativity

(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)

Join order can be changed without affecting results. This is the foundation of join reordering optimization—finding the order that minimizes intermediate result sizes.

Rule 3: Theta Join Equivalence

R ⋈_θ S ≡ σ_θ(R × S)

Theta join is equivalent to Cartesian product followed by selection. While we'd never execute this way, it helps prove other equivalences.

Rule 4: Join and Selection Commutativity

σ_p(R ⋈ S) ≡ R ⋈_p S  (pushing p into join condition)
σ_{p1∧p2}(R ⋈ S) ≡ σ_{p1}(R) ⋈ σ_{p2}(S) ⋈_{remaining}  (splitting p)

Selections can be pushed into or past joins when predicate attributes allow.

Converting Mermaid diagram...

Join Orders and Tree Shapes:

For n tables, the number of different join orders is:

Left-deep trees only: n!
All binary trees (bushy): Catalan number C(n-1) × n! (approximately 4^n / (n^1.5))

For 10 tables:

Left-deep: 3.6 million orderings
Bushy: billions of orderings

Optimizers use dynamic programming or greedy heuristics to search this space efficiently.

Rule 5: Outer Join Considerations

Outer joins are NOT fully commutative or associative:

R ⟕ S ≢ S ⟕ R  (LEFT vs RIGHT outer join differ)
R ⟕ (S ⟕ T) may ≢ (R ⟕ S) ⟕ T  (depends on join conditions)

Outer join reordering requires careful analysis. Some transformations are valid under specific conditions, but the general rules are more restrictive than for inner joins.

Rule 6: Semi-join and Anti-join

π_{attrs(R)}(R ⋈ S) ≡ R ⋉ S  (semi-join)
R − π_{attrs(R)}(R ⋈ S) ≡ R ▷ S  (anti-join)

Semi-joins keep R tuples that have a match in S; anti-joins keep R tuples that have NO match. These are often more efficient than full joins when only existence is checked.

Join Reordering Complexity

While join commutativity and associativity hold for inner joins, real-world queries often have outer joins, semi-joins, or predicates that change semantics. Optimizers must carefully verify that each reordering is semantically valid, not just syntactically permitted.

Set Operation Equivalences

Set operations (∪, ∩, −) follow classical set theory equivalences, with some unique considerations for relational databases.

Rule 1: Union Commutativity and Associativity

R ∪ S ≡ S ∪ R
(R ∪ S) ∪ T ≡ R ∪ (S ∪ T)

Union is both commutative and associative, like set union in mathematics.

Rule 2: Intersection Commutativity and Associativity

R ∩ S ≡ S ∩ R
(R ∩ S) ∩ T ≡ R ∩ (S ∩ T)

Intersection similarly satisfies both properties.

Rule 3: Set Difference is NOT Commutative

R − S ≢ S − R  (in general)

R − S removes S's tuples from R; S − R removes R's tuples from S. Completely different operations.

Rule 4: Distributive Laws

R ∩ (S ∪ T) ≡ (R ∩ S) ∪ (R ∩ T)
R ∪ (S ∩ T) ≡ (R ∪ S) ∩ (R ∪ T)

Intersection distributes over union and vice versa, following set theory.

Rule 5: De Morgan's Laws

U − (R ∪ S) ≡ (U − R) ∩ (U − S)
U − (R ∩ S) ≡ (U − R) ∪ (U − S)

Where U is a "universal" relation. These enable alternative formulations for complex set expressions.

Set Operation Property Summary
Property	Union (∪)	Intersection (∩)	Difference (−)
Commutative	Yes: R ∪ S = S ∪ R	Yes: R ∩ S = S ∩ R	No: R − S ≠ S − R
Associative	Yes: (R∪S)∪T = R∪(S∪T)	Yes: (R∩S)∩T = R∩(S∩T)	No
Idempotent	Yes: R ∪ R = R	Yes: R ∩ R = R	R − R = ∅ (empty)
Identity	R ∪ ∅ = R	R ∩ U = R (universal)	R − ∅ = R
Annihilator	R ∪ U = U	R ∩ ∅ = ∅	∅ − R = ∅

Rule 6: Selection Distribution (Revisited)

σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)
σ_p(R ∩ S) ≡ σ_p(R) ∩ σ_p(S)
σ_p(R − S) ≡ σ_p(R) − σ_p(S)

Selection distributes over all set operations, enabling pushdown before set combination.

Rule 7: Intersection as Join

R ∩ S ≡ R ⋈ S  (when R and S have the same schema)

Intersection of union-compatible relations is equivalent to natural join on all attributes. This enables using join algorithms for intersection.

Rule 8: Difference through Anti-Join

R − S ≡ R ▷ S  (anti-join; tuples in R with no match in S)

For union-compatible relations, set difference equals anti-join. This may enable more efficient execution using anti-join algorithms.

NOT IN vs NOT EXISTS

SQL's NOT IN and NOT EXISTS both map to set difference or anti-join conceptually. Understanding these equivalences helps you choose the right SQL construct—and understand why the optimizer might transform one into another.

General Equivalence Rules

Beyond operation-specific rules, several general principles govern equivalence in relational algebra.

Rule 1: Rename Equivalence

ρ_S(ρ_R(T)) ≡ ρ_S(T)

Sequential renames collapse—only the final name matters.

Rule 2: Rename and Other Operators

Rename generally commutes with most operators, modulo adjusting predicate/column references:

ρ_S(σ_p(R)) ≡ σ_{p[renamed]}(ρ_S(R))
ρ_S(R ⋈ T) ≡ ρ_S(R) ⋈ T  (adjusting references)

Rule 3: Aggregation Decomposition

For certain aggregates, we can compute partial results then combine:

SUM(R ∪ S) ≡ SUM(R) + SUM(S)  (for disjoint R, S)
COUNT(R ∪ S) ≡ COUNT(R) + COUNT(S)  (for disjoint R, S)

This enables parallel aggregation with final merge.

Rule 4: Redundant Operation Elimination

σ_true(R) ≡ R
π_{all-attrs}(R) ≡ R
R ∪ ∅ ≡ R
R ⋈ R ≡ R  (for relations with keys)

Redundant or identity operations can be eliminated.

Complete Equivalence Rule Summary

•Selection: Cascade (split AND), commute (reorder), push past join/product/set-ops
•Projection: Eliminate cascades, push past join/product (preserving needed attrs)
•Join: Commute, associate (for inner joins); use with care for outer joins
•Set Operations: Follow set theory laws; selection distributes over all
•Rename: Collapse sequences; adjust references when commuting
•Aggregation: Decompose for parallelism when aggregates allow
•Identity/Redundancy: Eliminate no-op operations

Rule 5: Constant Folding and Simplification

σ_{false}(R) ≡ ∅
σ_{A=A}(R) ≡ R  (assuming no nulls)
R ⋈_{false} S ≡ ∅

Predicates with constant truth values can be simplified.

Rule 6: Constraint-Based Equivalences

With knowledge of constraints, additional equivalences hold:

-- If K is a key of R:
π_K(R ⋈ R) ≡ π_K(R)

-- If FK references PK:
R ⋈ S ≡ R  (if the join is automatically satisfied by FK)

-- If NOT NULL constraint exists:
σ_{A IS NOT NULL}(R) ≡ R

Optimizers use constraint knowledge for semantic optimization.

Sound vs Complete Rule Sets

The rule sets we've discussed are sound (transformations preserve equivalence) but not necessarily complete (not all equivalences can be derived from these rules). Research continues on finding complete rule sets for various query languages. Practical optimizers use heuristics alongside proven rules.

Applying Equivalences in Optimization

Understanding equivalence rules is one thing; applying them systematically to find optimal plans is another. Let's examine how optimizers use these rules.

The Search Space

Given a query's initial expression tree, equivalence rules define a search space:

Each rule application produces an equivalent tree
Multiple rules can be applied in various orders
The number of equivalent trees can be enormous

For a query joining 10 tables, there may be millions of equivalent join orderings alone.

Search Strategies

Exhaustive Enumeration:

Generate all equivalent expressions, cost each, return the cheapest.

Guaranteed optimal
Feasible only for small queries
Used: System R, early relational optimizers

Dynamic Programming:

Build optimal plans for subexpressions, combine into full plans.

Avoids recomputing equivalent subplans
Polynomial complexity for many query patterns
Used: Most modern optimizers

Heuristic/Rule-Based:

Apply transformation rules in a fixed order without exhaustive search.

Fast but may miss optimal plan
Useful for simple queries or as a first pass
Used: Often combined with cost-based search

optimization_example.txt

Optimization Trace

-- Original Query (SQL)
SELECT e.name, d.dept_name
FROM Employees e, Departments d
WHERE e.dept_id = d.id AND e.salary > 50000 AND d.location = 'NYC';
 
-- Initial Expression Tree (naive translation)
π_{name, dept_name}(
  σ_{dept_id=id AND salary>50000 AND location='NYC'}(
    Employees × Departments
  )
)
 
-- Apply Rule: Split conjunction
π(σ_{dept_id=id}(σ_{salary>50000}(σ_{location='NYC'}(E × D))))
 
-- Apply Rule: Push selection (location='NYC' references only D)
π(σ_{dept_id=id}(σ_{salary>50000}(E × σ_{location='NYC'}(D))))
 
-- Apply Rule: Push selection (salary>50000 references only E)
π(σ_{dept_id=id}(σ_{salary>50000}(E) × σ_{location='NYC'}(D)))
 
-- Apply Rule: Product + selection on both → Join
π(σ_{salary>50000}(E) ⋈_{dept_id=id} σ_{location='NYC'}(D))
 
-- Apply Rule: Projection pushdown
π_{name, dept_name}(
  π_{name, dept_id}(σ_{salary>50000}(E)) 
  ⋈_{dept_id=id} 
  π_{id, dept_name}(σ_{location='NYC'}(D))
)
 
-- Final optimized plan: much more efficient!

Cost-Based Selection

After generating equivalent plans, the optimizer estimates costs:

Cardinality Estimation: How many tuples does each operator produce?
I/O Cost: How many disk pages read/written?
CPU Cost: How many comparisons, computations?
Memory Usage: Does the plan fit in available memory?
Network Cost: (For distributed) How much data transferred?

The plan with lowest total cost wins.

Practical Considerations:

Time Limits: Optimization can't take longer than execution; budgets limit search
Statistics: Cost estimation depends on statistics (histograms, row counts)
Hints: Users can influence optimizer choices when it makes mistakes
Plan Caching: Prepared statements reuse optimized plans

Learning to Read Plans

When a query is slow, examine its execution plan. Look for missed pushdowns, unexpected join orders, or full scans where index scans are expected. Each issue maps to an equivalence rule that the optimizer either applied incorrectly or couldn't apply due to constraints or missing statistics.

Summary: Query Equivalence

We've explored query equivalence—the theoretical foundation of query optimization. Let's consolidate the key insights:

Key Takeaways

•Definition — Two expressions are equivalent if they produce identical results for every possible database state.
•Selection Rules — Selections split, commute, and push past joins/products/set-ops, enabling early filtering.
•Projection Rules — Projections cascade, push past operators (with attribute preservation), enabling early column elimination.
•Join Rules — Inner joins are commutative and associative; join reordering is the key optimization for multi-table queries.
•Set Operation Rules — Follow classical set theory; selection distributes over all set operations.
•General Rules — Rename collapse, aggregation decomposition, redundancy elimination, and constraint-based simplification.
•Optimization Application — Equivalent rule applications define a search space; optimizers use dynamic programming or heuristics to find low-cost plans.

Module Complete

With this page, we've completed our overview of relational algebra. You now understand:

Relational algebra as a procedural query language
The types of operations and their categorization
The closure property that enables composition
Expression trees as the query representation
Query equivalence as the foundation of optimization

In the following modules, we'll dive deeper into specific operators—selection, projection, joins, and more—exploring their semantics, implementation algorithms, and optimization techniques in exhaustive detail.

Module Complete

You now understand query equivalence—the theoretical foundation that makes query optimization possible. Every transformation the optimizer applies, every plan it generates, rests on the equivalence rules we've explored. This knowledge will serve you whenever you analyze query performance, understand optimizer behavior, or design database systems.

5 / 5

Loading learning content...

Database Management SystemsRelational Algebra Overview

Relational Algebra Overview

LevelIntermediate

Duration60 mins

TopicRelational Algebra Overview

5 / 5

Query Equivalence

Different Paths, Same Destination

Consider two relational algebra expressions:

1: π_name(σ_{salary>50000}(Employees))
2: σ_{salary>50000}(π_{name,salary}(Employees))

Do they produce the same result? If so, which should we execute? These questions—of equivalence and optimization—are at the heart of query processing.

What You Will Learn

Formal Definition of Equivalence

Definition: Query Equivalence

Two relational algebra expressions E₁ and E₂ are equivalent, written E₁ ≡ E₂, if and only if:

For every database instance D, the evaluation of E₁ on D produces the same result as the evaluation of E₂ on D.

Formally:

E₁ ≡ E₂  ⟺  ∀D. E₁(D) = E₂(D)

Where:

D ranges over all possible database instances
E(D) denotes the relation produced by evaluating expression E on database D
= denotes set equality (identical tuples)

Key Points:

Universal Quantification: Equivalence must hold for ALL possible database states, not just the current one or specific test cases.
Set Equality: Results must contain exactly the same tuples. Same tuples in different order are still equal (relations are sets, unordered).
Schema Compatibility: Equivalent expressions must produce results with the same schema (column names and types). If schemas differ, expressions aren't equivalent even if the data values match.

Examples:

Equivalent:

σ_{p∧q}(R) ≡ σ_p(σ_q(R)) ≡ σ_q(σ_p(R))

Conjunctive selections can be split and reordered.

NOT Equivalent:

π_A(σ_B(R)) ≢ σ_B(π_A(R))  when B ∉ A

If attribute B isn't in projection list A, we can't filter on it after projection.

Testing Is Not Proving

Structural vs Semantic Equivalence:

Structural Equivalence: Expressions match syntactically after normalization (essentially the same expression with possible renaming).

Semantic Equivalence: Expressions produce identical results, possibly through completely different computational paths.

Query optimization exploits semantic equivalence—finding syntactically different but semantically identical expressions that execute more efficiently.

Why Equivalence Matters:

Optimization Correctness: Transformations are only valid if they preserve equivalence
View Substitution: Views can be substituted for their definitions because they're equivalent
Constraint Enforcement: Some constraints guarantee equivalences (e.g., key constraints enable certain simplifications)
Query Normalization: Multiple user queries can be recognized as equivalent for caching

Equivalence vs Non-Equivalence Examples
Expression 1	Expression 2	Equivalent?	Reason
σ_p(σ_q(R))	σ_q(σ_p(R))	Yes	Selection is commutative
R ⋈ S	S ⋈ R	Yes	Natural join is commutative
R − S	S − R	No	Set difference is NOT commutative
π_A(π_B(R))	π_A(R) where A⊆B	Yes	Cascading projections
σ_p(R ⋈ S)	σ_p(R) ⋈ S (p refs only R)	Yes	Selection pushdown
R ⟕ S	S ⟖ R	Yes	Outer join direction swap

Selection Equivalences

Selection (σ) has several important equivalence rules that form the foundation of selection pushdown optimization.

Rule 1: Conjunction Splitting (Cascade Rule)

σ_{p₁ ∧ p₂ ∧ ... ∧ pₙ}(R) ≡ σ_{p₁}(σ_{p₂}(...σ_{pₙ}(R)...))

A selection with a conjunctive (AND) predicate can be split into a cascade of selections. This enables pushing individual conditions to different places in the expression tree.

Rule 2: Selection Commutativity

σ_{p₁}(σ_{p₂}(R)) ≡ σ_{p₂}(σ_{p₁}(R))

The order of cascaded selections doesn't matter. This flexibility allows the optimizer to arrange selections in the most beneficial order (e.g., most selective first).

Rule 3: Selection and Projection Commutativity (Conditional)

π_L(σ_p(R)) ≡ σ_p(π_L(R))  only if p references only attributes in L

If the selection predicate uses only attributes that survive projection, we can swap their order. This often enables projecting early to narrow tuples.

Rule 4: Selection and Cartesian Product

σ_p(R × S) ≡ R ⋈_p S  when p references both R and S
σ_p(R × S) ≡ σ_p(R) × S  when p references only R

Selection over product with a join condition IS a theta join. Selection with a predicate referencing only one relation can be pushed to that relation.

selection_equivalences.txt

Relational Algebra

-- RULE 1: Conjunction Splitting
-- Original (compound predicate)
σ_( dept='Eng' AND salary>70000 AND hired>2020-01-01 )(Employees)
 
-- Equivalent cascade (optimizable)
σ_(dept='Eng')(
  σ_(salary>70000)(
    σ_(hired>2020-01-01)(Employees)
  )
)
 
-- RULE 2: Selection Commutativity
-- The cascade can be reordered for efficiency
-- If dept='Eng' is most selective, do it first:
σ_(hired>2020-01-01)(
  σ_(salary>70000)(
    σ_(dept='Eng')(Employees)  -- Most selective first
  )
)
 
-- RULE 3: Selection/Projection Commutativity
-- Original
π_(name, dept)(σ_(dept='Eng')(Employees))
 
-- Equivalent (only if we don't need other attributes for predicate)
σ_(dept='Eng')(π_(name, dept)(Employees))
 
-- RULE 4: Selection Distribution over Join
-- Original
σ_(E.dept='Eng' AND D.location='NYC')(Employees × Departments)
 
-- Optimized (push conditions to appropriate relations)
σ_(E.dept='Eng')(Employees) ⋈ σ_(D.location='NYC')(Departments)

Rule 5: Selection and Join

σ_p(R ⋈ S) ≡ σ_{p1}(R) ⋈ σ_{p2}(S) ⋈ (conditions involving both)

Where predicate p can be decomposed into:

p1: conditions on R's attributes only
p2: conditions on S's attributes only
Remaining: conditions involving both R and S

Pushing selections past joins is the most impactful optimization rule, often reducing join input sizes dramatically.

Rule 6: Selection and Union/Intersection/Difference

σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)
σ_p(R ∩ S) ≡ σ_p(R) ∩ σ_p(S)
σ_p(R − S) ≡ σ_p(R) − S    (note: only pushed to first operand)
           ≡ σ_p(R) − σ_p(S)  (alternative)

Selection distributes over set operations, enabling early filtering.

Selection Pushdown: The #1 Optimization

Projection Equivalences

Projection (π) equivalences enable removing unneeded columns early, reducing data volume and memory usage.

Rule 1: Cascading Projections

π_L₁(π_L₂(R)) ≡ π_L₁(R)  when L₁ ⊆ L₂

Nested projections simplify to the outermost (smallest) projection. Intermediate projections that keep extra columns are wasteful.

Rule 2: Projection and Selection

As noted earlier:

π_L(σ_p(R)) ≡ σ_p(π_{L∪attrs(p)}(R))  followed by π_L

We can project away columns not needed for the final result or the selection predicate. The minimal projection keeps L plus any attributes referenced in p.

Rule 3: Projection and Cartesian Product

π_L(R × S) ≡ π_{L∩attrs(R)}(R) × π_{L∩attrs(S)}(S)  when L = L_R ∪ L_S

We can push projection over product if the projected columns come from separate relations.

Rule 4: Projection and Join

π_L(R ⋈ S) ≡ π_L(π_{L ∪ join_attrs(R)}(R) ⋈ π_{L ∪ join_attrs(S)}(S))

Before joining, project away columns not needed for the join condition or the final result. This narrows tuples, reducing memory and potentially I/O.

Projection Pushdown Examples
Original	Optimized	Columns Eliminated
π_{name}(Employees ⋈ Departments)	π_{name}(π_{name,dept_id}(E) ⋈ π_{id}(D))	All except name, dept_id, id
π_{A,B}(π_{A,B,C,D}(R))	π_{A,B}(R)	Intermediate C, D
π_{name}(σ_{salary>50k}(Employees))	π_{name}(σ_{sal>50k}(π_{name,salary}(E)))	All except name, salary

Rule 5: Projection and Union

π_L(R ∪ S) ≡ π_L(R) ∪ π_L(S)

Projection distributes over union. This enables projecting before combining, reducing data volume throughout.

Rule 6: Projection and Set Difference

π_L(R − S) ≡ π_L(R) − π_L(S)

Similarly, projection distributes over difference.

Projection Pushdown Limitations:

Projection can't always be pushed:

Attributes needed for selection: Must keep columns used in predicates
Attributes needed for join: Must keep join columns
Grouping attributes: Aggregation requires grouping columns
Order by attributes: If ORDER BY is later, need those columns

The optimizer computes the minimum attribute set needed at each tree node and projects away everything else.

Column Pruning in Practice

Join Equivalences

Join equivalences are crucial because joins are often the most expensive operations, and reordering can dramatically affect performance.

Rule 1: Join Commutativity

R ⋈ S ≡ S ⋈ R

For natural join and inner join, operand order doesn't affect the result. This enables choosing which relation to use as the "build" side in hash joins, for example.

Rule 2: Join Associativity

(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)

Join order can be changed without affecting results. This is the foundation of join reordering optimization—finding the order that minimizes intermediate result sizes.

Rule 3: Theta Join Equivalence

R ⋈_θ S ≡ σ_θ(R × S)

Theta join is equivalent to Cartesian product followed by selection. While we'd never execute this way, it helps prove other equivalences.

Rule 4: Join and Selection Commutativity

σ_p(R ⋈ S) ≡ R ⋈_p S  (pushing p into join condition)
σ_{p1∧p2}(R ⋈ S) ≡ σ_{p1}(R) ⋈ σ_{p2}(S) ⋈_{remaining}  (splitting p)

Selections can be pushed into or past joins when predicate attributes allow.

Converting Mermaid diagram...

Join Orders and Tree Shapes:

For n tables, the number of different join orders is:

Left-deep trees only: n!
All binary trees (bushy): Catalan number C(n-1) × n! (approximately 4^n / (n^1.5))

For 10 tables:

Left-deep: 3.6 million orderings
Bushy: billions of orderings

Optimizers use dynamic programming or greedy heuristics to search this space efficiently.

Rule 5: Outer Join Considerations

Outer joins are NOT fully commutative or associative:

R ⟕ S ≢ S ⟕ R  (LEFT vs RIGHT outer join differ)
R ⟕ (S ⟕ T) may ≢ (R ⟕ S) ⟕ T  (depends on join conditions)

Outer join reordering requires careful analysis. Some transformations are valid under specific conditions, but the general rules are more restrictive than for inner joins.

Rule 6: Semi-join and Anti-join

π_{attrs(R)}(R ⋈ S) ≡ R ⋉ S  (semi-join)
R − π_{attrs(R)}(R ⋈ S) ≡ R ▷ S  (anti-join)

Semi-joins keep R tuples that have a match in S; anti-joins keep R tuples that have NO match. These are often more efficient than full joins when only existence is checked.

Join Reordering Complexity

Set Operation Equivalences

Set operations (∪, ∩, −) follow classical set theory equivalences, with some unique considerations for relational databases.

Rule 1: Union Commutativity and Associativity

R ∪ S ≡ S ∪ R
(R ∪ S) ∪ T ≡ R ∪ (S ∪ T)

Union is both commutative and associative, like set union in mathematics.

Rule 2: Intersection Commutativity and Associativity

R ∩ S ≡ S ∩ R
(R ∩ S) ∩ T ≡ R ∩ (S ∩ T)

Intersection similarly satisfies both properties.

Rule 3: Set Difference is NOT Commutative

R − S ≢ S − R  (in general)

R − S removes S's tuples from R; S − R removes R's tuples from S. Completely different operations.

Rule 4: Distributive Laws

R ∩ (S ∪ T) ≡ (R ∩ S) ∪ (R ∩ T)
R ∪ (S ∩ T) ≡ (R ∪ S) ∩ (R ∪ T)

Intersection distributes over union and vice versa, following set theory.

Rule 5: De Morgan's Laws

U − (R ∪ S) ≡ (U − R) ∩ (U − S)
U − (R ∩ S) ≡ (U − R) ∪ (U − S)

Where U is a "universal" relation. These enable alternative formulations for complex set expressions.

Set Operation Property Summary
Property	Union (∪)	Intersection (∩)	Difference (−)
Commutative	Yes: R ∪ S = S ∪ R	Yes: R ∩ S = S ∩ R	No: R − S ≠ S − R
Associative	Yes: (R∪S)∪T = R∪(S∪T)	Yes: (R∩S)∩T = R∩(S∩T)	No
Idempotent	Yes: R ∪ R = R	Yes: R ∩ R = R	R − R = ∅ (empty)
Identity	R ∪ ∅ = R	R ∩ U = R (universal)	R − ∅ = R
Annihilator	R ∪ U = U	R ∩ ∅ = ∅	∅ − R = ∅

Rule 6: Selection Distribution (Revisited)

σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)
σ_p(R ∩ S) ≡ σ_p(R) ∩ σ_p(S)
σ_p(R − S) ≡ σ_p(R) − σ_p(S)

Selection distributes over all set operations, enabling pushdown before set combination.

Rule 7: Intersection as Join

R ∩ S ≡ R ⋈ S  (when R and S have the same schema)

Intersection of union-compatible relations is equivalent to natural join on all attributes. This enables using join algorithms for intersection.

Rule 8: Difference through Anti-Join

R − S ≡ R ▷ S  (anti-join; tuples in R with no match in S)

For union-compatible relations, set difference equals anti-join. This may enable more efficient execution using anti-join algorithms.

NOT IN vs NOT EXISTS

General Equivalence Rules

Beyond operation-specific rules, several general principles govern equivalence in relational algebra.

Rule 1: Rename Equivalence

ρ_S(ρ_R(T)) ≡ ρ_S(T)

Sequential renames collapse—only the final name matters.

Rule 2: Rename and Other Operators

Rename generally commutes with most operators, modulo adjusting predicate/column references:

ρ_S(σ_p(R)) ≡ σ_{p[renamed]}(ρ_S(R))
ρ_S(R ⋈ T) ≡ ρ_S(R) ⋈ T  (adjusting references)

Rule 3: Aggregation Decomposition

For certain aggregates, we can compute partial results then combine:

SUM(R ∪ S) ≡ SUM(R) + SUM(S)  (for disjoint R, S)
COUNT(R ∪ S) ≡ COUNT(R) + COUNT(S)  (for disjoint R, S)

This enables parallel aggregation with final merge.

Rule 4: Redundant Operation Elimination

σ_true(R) ≡ R
π_{all-attrs}(R) ≡ R
R ∪ ∅ ≡ R
R ⋈ R ≡ R  (for relations with keys)

Redundant or identity operations can be eliminated.

Complete Equivalence Rule Summary

•Selection: Cascade (split AND), commute (reorder), push past join/product/set-ops
•Projection: Eliminate cascades, push past join/product (preserving needed attrs)
•Join: Commute, associate (for inner joins); use with care for outer joins
•Set Operations: Follow set theory laws; selection distributes over all
•Rename: Collapse sequences; adjust references when commuting
•Aggregation: Decompose for parallelism when aggregates allow
•Identity/Redundancy: Eliminate no-op operations

Rule 5: Constant Folding and Simplification

σ_{false}(R) ≡ ∅
σ_{A=A}(R) ≡ R  (assuming no nulls)
R ⋈_{false} S ≡ ∅

Predicates with constant truth values can be simplified.

Rule 6: Constraint-Based Equivalences

With knowledge of constraints, additional equivalences hold:

-- If K is a key of R:
π_K(R ⋈ R) ≡ π_K(R)

-- If FK references PK:
R ⋈ S ≡ R  (if the join is automatically satisfied by FK)

-- If NOT NULL constraint exists:
σ_{A IS NOT NULL}(R) ≡ R

Optimizers use constraint knowledge for semantic optimization.

Sound vs Complete Rule Sets

Applying Equivalences in Optimization

Understanding equivalence rules is one thing; applying them systematically to find optimal plans is another. Let's examine how optimizers use these rules.

The Search Space

Given a query's initial expression tree, equivalence rules define a search space:

Each rule application produces an equivalent tree
Multiple rules can be applied in various orders
The number of equivalent trees can be enormous

For a query joining 10 tables, there may be millions of equivalent join orderings alone.

Search Strategies

Exhaustive Enumeration:

Generate all equivalent expressions, cost each, return the cheapest.

Guaranteed optimal
Feasible only for small queries
Used: System R, early relational optimizers

Dynamic Programming:

Build optimal plans for subexpressions, combine into full plans.

Avoids recomputing equivalent subplans
Polynomial complexity for many query patterns
Used: Most modern optimizers

Heuristic/Rule-Based:

Apply transformation rules in a fixed order without exhaustive search.

Fast but may miss optimal plan
Useful for simple queries or as a first pass
Used: Often combined with cost-based search

optimization_example.txt

Optimization Trace

-- Original Query (SQL)
SELECT e.name, d.dept_name
FROM Employees e, Departments d
WHERE e.dept_id = d.id AND e.salary > 50000 AND d.location = 'NYC';
 
-- Initial Expression Tree (naive translation)
π_{name, dept_name}(
  σ_{dept_id=id AND salary>50000 AND location='NYC'}(
    Employees × Departments
  )
)
 
-- Apply Rule: Split conjunction
π(σ_{dept_id=id}(σ_{salary>50000}(σ_{location='NYC'}(E × D))))
 
-- Apply Rule: Push selection (location='NYC' references only D)
π(σ_{dept_id=id}(σ_{salary>50000}(E × σ_{location='NYC'}(D))))
 
-- Apply Rule: Push selection (salary>50000 references only E)
π(σ_{dept_id=id}(σ_{salary>50000}(E) × σ_{location='NYC'}(D)))
 
-- Apply Rule: Product + selection on both → Join
π(σ_{salary>50000}(E) ⋈_{dept_id=id} σ_{location='NYC'}(D))
 
-- Apply Rule: Projection pushdown
π_{name, dept_name}(
  π_{name, dept_id}(σ_{salary>50000}(E)) 
  ⋈_{dept_id=id} 
  π_{id, dept_name}(σ_{location='NYC'}(D))
)
 
-- Final optimized plan: much more efficient!

Cost-Based Selection

After generating equivalent plans, the optimizer estimates costs:

Cardinality Estimation: How many tuples does each operator produce?
I/O Cost: How many disk pages read/written?
CPU Cost: How many comparisons, computations?
Memory Usage: Does the plan fit in available memory?
Network Cost: (For distributed) How much data transferred?

The plan with lowest total cost wins.

Practical Considerations:

Time Limits: Optimization can't take longer than execution; budgets limit search
Statistics: Cost estimation depends on statistics (histograms, row counts)
Hints: Users can influence optimizer choices when it makes mistakes
Plan Caching: Prepared statements reuse optimized plans

Learning to Read Plans

Summary: Query Equivalence

We've explored query equivalence—the theoretical foundation of query optimization. Let's consolidate the key insights:

Key Takeaways

•Definition — Two expressions are equivalent if they produce identical results for every possible database state.
•Selection Rules — Selections split, commute, and push past joins/products/set-ops, enabling early filtering.
•Projection Rules — Projections cascade, push past operators (with attribute preservation), enabling early column elimination.
•Join Rules — Inner joins are commutative and associative; join reordering is the key optimization for multi-table queries.
•Set Operation Rules — Follow classical set theory; selection distributes over all set operations.
•General Rules — Rename collapse, aggregation decomposition, redundancy elimination, and constraint-based simplification.
•Optimization Application — Equivalent rule applications define a search space; optimizers use dynamic programming or heuristics to find low-cost plans.

Module Complete

With this page, we've completed our overview of relational algebra. You now understand:

Relational algebra as a procedural query language
The types of operations and their categorization
The closure property that enables composition
Expression trees as the query representation
Query equivalence as the foundation of optimization

Module Complete

5 / 5