Loading learning content...
Consider two relational algebra expressions:
1: π_name(σ_{salary>50000}(Employees))
2: σ_{salary>50000}(π_{name,salary}(Employees))
Do they produce the same result? If so, which should we execute? These questions—of equivalence and optimization—are at the heart of query processing.
Query equivalence is the theoretical foundation that makes query optimization possible. When we know that two expressions are equivalent—they produce identical results for any database state—we can freely substitute one for the other. This freedom to substitute is what allows optimizers to explore vast spaces of alternative execution strategies while guaranteeing correctness.
In this page, we'll formally define query equivalence, explore the equivalence rules that govern relational algebra, understand the conditions under which transformations are valid, and see how these rules enable the optimizations that make databases fast.
By the end of this page, you will understand: the formal definition of query equivalence; the major equivalence rules for relational algebra operations; conditions that make transformations valid or invalid; how equivalence rules combine to enable complex optimizations; and the relationship between equivalence and query optimization.
Definition: Query Equivalence
Two relational algebra expressions E₁ and E₂ are equivalent, written E₁ ≡ E₂, if and only if:
For every database instance D, the evaluation of E₁ on D produces the same result as the evaluation of E₂ on D.
Formally:
E₁ ≡ E₂ ⟺ ∀D. E₁(D) = E₂(D)
Where:
Key Points:
Universal Quantification: Equivalence must hold for ALL possible database states, not just the current one or specific test cases.
Set Equality: Results must contain exactly the same tuples. Same tuples in different order are still equal (relations are sets, unordered).
Schema Compatibility: Equivalent expressions must produce results with the same schema (column names and types). If schemas differ, expressions aren't equivalent even if the data values match.
Examples:
Equivalent:
σ_{p∧q}(R) ≡ σ_p(σ_q(R)) ≡ σ_q(σ_p(R))
Conjunctive selections can be split and reordered.
NOT Equivalent:
π_A(σ_B(R)) ≢ σ_B(π_A(R)) when B ∉ A
If attribute B isn't in projection list A, we can't filter on it after projection.
Just because two expressions produce the same result on test data doesn't mean they're equivalent. Equivalence requires proof across ALL possible database states. A transformation that works for test data but fails for boundary cases (empty tables, nulls, duplicate values) is incorrect, regardless of test success.
Structural vs Semantic Equivalence:
Structural Equivalence: Expressions match syntactically after normalization (essentially the same expression with possible renaming).
Semantic Equivalence: Expressions produce identical results, possibly through completely different computational paths.
Query optimization exploits semantic equivalence—finding syntactically different but semantically identical expressions that execute more efficiently.
Why Equivalence Matters:
| Expression 1 | Expression 2 | Equivalent? | Reason |
|---|---|---|---|
| σ_p(σ_q(R)) | σ_q(σ_p(R)) | Yes | Selection is commutative |
| R ⋈ S | S ⋈ R | Yes | Natural join is commutative |
| R − S | S − R | No | Set difference is NOT commutative |
| π_A(π_B(R)) | π_A(R) where A⊆B | Yes | Cascading projections |
| σ_p(R ⋈ S) | σ_p(R) ⋈ S (p refs only R) | Yes | Selection pushdown |
| R ⟕ S | S ⟖ R | Yes | Outer join direction swap |
Selection (σ) has several important equivalence rules that form the foundation of selection pushdown optimization.
Rule 1: Conjunction Splitting (Cascade Rule)
σ_{p₁ ∧ p₂ ∧ ... ∧ pₙ}(R) ≡ σ_{p₁}(σ_{p₂}(...σ_{pₙ}(R)...))
A selection with a conjunctive (AND) predicate can be split into a cascade of selections. This enables pushing individual conditions to different places in the expression tree.
Rule 2: Selection Commutativity
σ_{p₁}(σ_{p₂}(R)) ≡ σ_{p₂}(σ_{p₁}(R))
The order of cascaded selections doesn't matter. This flexibility allows the optimizer to arrange selections in the most beneficial order (e.g., most selective first).
Rule 3: Selection and Projection Commutativity (Conditional)
π_L(σ_p(R)) ≡ σ_p(π_L(R)) only if p references only attributes in L
If the selection predicate uses only attributes that survive projection, we can swap their order. This often enables projecting early to narrow tuples.
Rule 4: Selection and Cartesian Product
σ_p(R × S) ≡ R ⋈_p S when p references both R and S
σ_p(R × S) ≡ σ_p(R) × S when p references only R
Selection over product with a join condition IS a theta join. Selection with a predicate referencing only one relation can be pushed to that relation.
123456789101112131415161718192021222324252627282930313233
-- RULE 1: Conjunction Splitting-- Original (compound predicate)σ_( dept='Eng' AND salary>70000 AND hired>2020-01-01 )(Employees) -- Equivalent cascade (optimizable)σ_(dept='Eng')( σ_(salary>70000)( σ_(hired>2020-01-01)(Employees) )) -- RULE 2: Selection Commutativity-- The cascade can be reordered for efficiency-- If dept='Eng' is most selective, do it first:σ_(hired>2020-01-01)( σ_(salary>70000)( σ_(dept='Eng')(Employees) -- Most selective first )) -- RULE 3: Selection/Projection Commutativity-- Originalπ_(name, dept)(σ_(dept='Eng')(Employees)) -- Equivalent (only if we don't need other attributes for predicate)σ_(dept='Eng')(π_(name, dept)(Employees)) -- RULE 4: Selection Distribution over Join-- Originalσ_(E.dept='Eng' AND D.location='NYC')(Employees × Departments) -- Optimized (push conditions to appropriate relations)σ_(E.dept='Eng')(Employees) ⋈ σ_(D.location='NYC')(Departments)Rule 5: Selection and Join
σ_p(R ⋈ S) ≡ σ_{p1}(R) ⋈ σ_{p2}(S) ⋈ (conditions involving both)
Where predicate p can be decomposed into:
Pushing selections past joins is the most impactful optimization rule, often reducing join input sizes dramatically.
Rule 6: Selection and Union/Intersection/Difference
σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)
σ_p(R ∩ S) ≡ σ_p(R) ∩ σ_p(S)
σ_p(R − S) ≡ σ_p(R) − S (note: only pushed to first operand)
≡ σ_p(R) − σ_p(S) (alternative)
Selection distributes over set operations, enabling early filtering.
Selection pushdown is typically the single most effective optimization. Reducing input to joins from millions of rows to thousands can change execution time from hours to seconds. Optimizers aggressively push selections as early as possible, splitting compound predicates to maximize pushdown opportunities.
Projection (π) equivalences enable removing unneeded columns early, reducing data volume and memory usage.
Rule 1: Cascading Projections
π_L₁(π_L₂(R)) ≡ π_L₁(R) when L₁ ⊆ L₂
Nested projections simplify to the outermost (smallest) projection. Intermediate projections that keep extra columns are wasteful.
Rule 2: Projection and Selection
As noted earlier:
π_L(σ_p(R)) ≡ σ_p(π_{L∪attrs(p)}(R)) followed by π_L
We can project away columns not needed for the final result or the selection predicate. The minimal projection keeps L plus any attributes referenced in p.
Rule 3: Projection and Cartesian Product
π_L(R × S) ≡ π_{L∩attrs(R)}(R) × π_{L∩attrs(S)}(S) when L = L_R ∪ L_S
We can push projection over product if the projected columns come from separate relations.
Rule 4: Projection and Join
π_L(R ⋈ S) ≡ π_L(π_{L ∪ join_attrs(R)}(R) ⋈ π_{L ∪ join_attrs(S)}(S))
Before joining, project away columns not needed for the join condition or the final result. This narrows tuples, reducing memory and potentially I/O.
| Original | Optimized | Columns Eliminated |
|---|---|---|
| π_{name}(Employees ⋈ Departments) | π_{name}(π_{name,dept_id}(E) ⋈ π_{id}(D)) | All except name, dept_id, id |
| π_{A,B}(π_{A,B,C,D}(R)) | π_{A,B}(R) | Intermediate C, D |
| π_{name}(σ_{salary>50k}(Employees)) | π_{name}(σ_{sal>50k}(π_{name,salary}(E))) | All except name, salary |
Rule 5: Projection and Union
π_L(R ∪ S) ≡ π_L(R) ∪ π_L(S)
Projection distributes over union. This enables projecting before combining, reducing data volume throughout.
Rule 6: Projection and Set Difference
π_L(R − S) ≡ π_L(R) − π_L(S)
Similarly, projection distributes over difference.
Projection Pushdown Limitations:
Projection can't always be pushed:
The optimizer computes the minimum attribute set needed at each tree node and projects away everything else.
Modern databases perform 'column pruning' automatically—tracking which columns are needed at each point and eliminating others. In columnar storage systems (like Parquet or column-oriented databases), this is especially impactful as unneeded columns are never even read from disk.
Join equivalences are crucial because joins are often the most expensive operations, and reordering can dramatically affect performance.
Rule 1: Join Commutativity
R ⋈ S ≡ S ⋈ R
For natural join and inner join, operand order doesn't affect the result. This enables choosing which relation to use as the "build" side in hash joins, for example.
Rule 2: Join Associativity
(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)
Join order can be changed without affecting results. This is the foundation of join reordering optimization—finding the order that minimizes intermediate result sizes.
Rule 3: Theta Join Equivalence
R ⋈_θ S ≡ σ_θ(R × S)
Theta join is equivalent to Cartesian product followed by selection. While we'd never execute this way, it helps prove other equivalences.
Rule 4: Join and Selection Commutativity
σ_p(R ⋈ S) ≡ R ⋈_p S (pushing p into join condition)
σ_{p1∧p2}(R ⋈ S) ≡ σ_{p1}(R) ⋈ σ_{p2}(S) ⋈_{remaining} (splitting p)
Selections can be pushed into or past joins when predicate attributes allow.
Join Orders and Tree Shapes:
For n tables, the number of different join orders is:
For 10 tables:
Optimizers use dynamic programming or greedy heuristics to search this space efficiently.
Rule 5: Outer Join Considerations
Outer joins are NOT fully commutative or associative:
R ⟕ S ≢ S ⟕ R (LEFT vs RIGHT outer join differ)
R ⟕ (S ⟕ T) may ≢ (R ⟕ S) ⟕ T (depends on join conditions)
Outer join reordering requires careful analysis. Some transformations are valid under specific conditions, but the general rules are more restrictive than for inner joins.
Rule 6: Semi-join and Anti-join
π_{attrs(R)}(R ⋈ S) ≡ R ⋉ S (semi-join)
R − π_{attrs(R)}(R ⋈ S) ≡ R ▷ S (anti-join)
Semi-joins keep R tuples that have a match in S; anti-joins keep R tuples that have NO match. These are often more efficient than full joins when only existence is checked.
While join commutativity and associativity hold for inner joins, real-world queries often have outer joins, semi-joins, or predicates that change semantics. Optimizers must carefully verify that each reordering is semantically valid, not just syntactically permitted.
Set operations (∪, ∩, −) follow classical set theory equivalences, with some unique considerations for relational databases.
Rule 1: Union Commutativity and Associativity
R ∪ S ≡ S ∪ R
(R ∪ S) ∪ T ≡ R ∪ (S ∪ T)
Union is both commutative and associative, like set union in mathematics.
Rule 2: Intersection Commutativity and Associativity
R ∩ S ≡ S ∩ R
(R ∩ S) ∩ T ≡ R ∩ (S ∩ T)
Intersection similarly satisfies both properties.
Rule 3: Set Difference is NOT Commutative
R − S ≢ S − R (in general)
R − S removes S's tuples from R; S − R removes R's tuples from S. Completely different operations.
Rule 4: Distributive Laws
R ∩ (S ∪ T) ≡ (R ∩ S) ∪ (R ∩ T)
R ∪ (S ∩ T) ≡ (R ∪ S) ∩ (R ∪ T)
Intersection distributes over union and vice versa, following set theory.
Rule 5: De Morgan's Laws
U − (R ∪ S) ≡ (U − R) ∩ (U − S)
U − (R ∩ S) ≡ (U − R) ∪ (U − S)
Where U is a "universal" relation. These enable alternative formulations for complex set expressions.
| Property | Union (∪) | Intersection (∩) | Difference (−) |
|---|---|---|---|
| Commutative | Yes: R ∪ S = S ∪ R | Yes: R ∩ S = S ∩ R | No: R − S ≠ S − R |
| Associative | Yes: (R∪S)∪T = R∪(S∪T) | Yes: (R∩S)∩T = R∩(S∩T) | No |
| Idempotent | Yes: R ∪ R = R | Yes: R ∩ R = R | R − R = ∅ (empty) |
| Identity | R ∪ ∅ = R | R ∩ U = R (universal) | R − ∅ = R |
| Annihilator | R ∪ U = U | R ∩ ∅ = ∅ | ∅ − R = ∅ |
Rule 6: Selection Distribution (Revisited)
σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)
σ_p(R ∩ S) ≡ σ_p(R) ∩ σ_p(S)
σ_p(R − S) ≡ σ_p(R) − σ_p(S)
Selection distributes over all set operations, enabling pushdown before set combination.
Rule 7: Intersection as Join
R ∩ S ≡ R ⋈ S (when R and S have the same schema)
Intersection of union-compatible relations is equivalent to natural join on all attributes. This enables using join algorithms for intersection.
Rule 8: Difference through Anti-Join
R − S ≡ R ▷ S (anti-join; tuples in R with no match in S)
For union-compatible relations, set difference equals anti-join. This may enable more efficient execution using anti-join algorithms.
SQL's NOT IN and NOT EXISTS both map to set difference or anti-join conceptually. Understanding these equivalences helps you choose the right SQL construct—and understand why the optimizer might transform one into another.
Beyond operation-specific rules, several general principles govern equivalence in relational algebra.
Rule 1: Rename Equivalence
ρ_S(ρ_R(T)) ≡ ρ_S(T)
Sequential renames collapse—only the final name matters.
Rule 2: Rename and Other Operators
Rename generally commutes with most operators, modulo adjusting predicate/column references:
ρ_S(σ_p(R)) ≡ σ_{p[renamed]}(ρ_S(R))
ρ_S(R ⋈ T) ≡ ρ_S(R) ⋈ T (adjusting references)
Rule 3: Aggregation Decomposition
For certain aggregates, we can compute partial results then combine:
SUM(R ∪ S) ≡ SUM(R) + SUM(S) (for disjoint R, S)
COUNT(R ∪ S) ≡ COUNT(R) + COUNT(S) (for disjoint R, S)
This enables parallel aggregation with final merge.
Rule 4: Redundant Operation Elimination
σ_true(R) ≡ R
π_{all-attrs}(R) ≡ R
R ∪ ∅ ≡ R
R ⋈ R ≡ R (for relations with keys)
Redundant or identity operations can be eliminated.
Rule 5: Constant Folding and Simplification
σ_{false}(R) ≡ ∅
σ_{A=A}(R) ≡ R (assuming no nulls)
R ⋈_{false} S ≡ ∅
Predicates with constant truth values can be simplified.
Rule 6: Constraint-Based Equivalences
With knowledge of constraints, additional equivalences hold:
-- If K is a key of R:
π_K(R ⋈ R) ≡ π_K(R)
-- If FK references PK:
R ⋈ S ≡ R (if the join is automatically satisfied by FK)
-- If NOT NULL constraint exists:
σ_{A IS NOT NULL}(R) ≡ R
Optimizers use constraint knowledge for semantic optimization.
The rule sets we've discussed are sound (transformations preserve equivalence) but not necessarily complete (not all equivalences can be derived from these rules). Research continues on finding complete rule sets for various query languages. Practical optimizers use heuristics alongside proven rules.
Understanding equivalence rules is one thing; applying them systematically to find optimal plans is another. Let's examine how optimizers use these rules.
The Search Space
Given a query's initial expression tree, equivalence rules define a search space:
For a query joining 10 tables, there may be millions of equivalent join orderings alone.
Search Strategies
Exhaustive Enumeration:
Generate all equivalent expressions, cost each, return the cheapest.
Dynamic Programming:
Build optimal plans for subexpressions, combine into full plans.
Heuristic/Rule-Based:
Apply transformation rules in a fixed order without exhaustive search.
1234567891011121314151617181920212223242526272829303132
-- Original Query (SQL)SELECT e.name, d.dept_nameFROM Employees e, Departments dWHERE e.dept_id = d.id AND e.salary > 50000 AND d.location = 'NYC'; -- Initial Expression Tree (naive translation)π_{name, dept_name}( σ_{dept_id=id AND salary>50000 AND location='NYC'}( Employees × Departments )) -- Apply Rule: Split conjunctionπ(σ_{dept_id=id}(σ_{salary>50000}(σ_{location='NYC'}(E × D)))) -- Apply Rule: Push selection (location='NYC' references only D)π(σ_{dept_id=id}(σ_{salary>50000}(E × σ_{location='NYC'}(D)))) -- Apply Rule: Push selection (salary>50000 references only E)π(σ_{dept_id=id}(σ_{salary>50000}(E) × σ_{location='NYC'}(D))) -- Apply Rule: Product + selection on both → Joinπ(σ_{salary>50000}(E) ⋈_{dept_id=id} σ_{location='NYC'}(D)) -- Apply Rule: Projection pushdownπ_{name, dept_name}( π_{name, dept_id}(σ_{salary>50000}(E)) ⋈_{dept_id=id} π_{id, dept_name}(σ_{location='NYC'}(D))) -- Final optimized plan: much more efficient!Cost-Based Selection
After generating equivalent plans, the optimizer estimates costs:
The plan with lowest total cost wins.
Practical Considerations:
When a query is slow, examine its execution plan. Look for missed pushdowns, unexpected join orders, or full scans where index scans are expected. Each issue maps to an equivalence rule that the optimizer either applied incorrectly or couldn't apply due to constraints or missing statistics.
We've explored query equivalence—the theoretical foundation of query optimization. Let's consolidate the key insights:
Module Complete
With this page, we've completed our overview of relational algebra. You now understand:
In the following modules, we'll dive deeper into specific operators—selection, projection, joins, and more—exploring their semantics, implementation algorithms, and optimization techniques in exhaustive detail.
You now understand query equivalence—the theoretical foundation that makes query optimization possible. Every transformation the optimizer applies, every plan it generates, rests on the equivalence rules we've explored. This knowledge will serve you whenever you analyze query performance, understand optimizer behavior, or design database systems.