Loading learning content...
When you write a SQL query, you express what data you want—not how to retrieve it. The database's query optimizer must bridge this gap, transforming your declarative specification into an efficient execution plan. At the heart of this transformation process lies a profound mathematical insight: relational algebra expressions can be rewritten into equivalent forms that produce identical results but with vastly different performance characteristics.
This is not merely an optimization trick—it's a rigorous mathematical framework. Just as algebraic identities like a + b = b + a allow mathematicians to manipulate equations while preserving truth, relational algebra equivalences allow query optimizers to manipulate queries while preserving correctness. Understanding these equivalences is fundamental to understanding how databases achieve their remarkable performance.
By the end of this page, you will understand: (1) the formal definition of expression equivalence in relational algebra, (2) the major categories of equivalence rules, (3) how these rules preserve query semantics, and (4) why equivalence-based optimization is both sound and complete for relational query processing.
Before diving into specific rules, we must establish what it means for two relational algebra expressions to be equivalent. This definition is the bedrock upon which all query optimization rests.
Definition: Expression Equivalence
Two relational algebra expressions E₁ and E₂ are said to be equivalent, denoted E₁ ≡ E₂, if and only if:
For every legal database instance D, the expressions E₁ and E₂ produce exactly the same result relation when evaluated on D.
This definition has several critical implications:
1. Universal Quantification: Equivalence must hold for all possible database states, not just the current one. An optimizer cannot assume anything about the data beyond schema constraints.
2. Set Semantics: Under standard relational algebra, relations are sets (no duplicates). Two expressions are equivalent if they produce the same set of tuples. Under bag (multiset) semantics, multiplicities must also match.
3. Order Independence: Relational algebra operates on unordered sets. The physical order of tuples in the result is irrelevant for equivalence (though it matters for ORDER BY in SQL).
4. Attribute Preservation: Equivalent expressions must produce relations with the same schema (attribute names and types).
When a query optimizer applies an equivalence rule, it is mathematically guaranteed to preserve the query's semantics. This is not heuristic-based guessing—it's provably correct transformation. The optimizer can aggressively rewrite queries knowing the result will always be identical.
Why Equivalence Matters for Optimization
The power of equivalence lies in this fundamental observation: while equivalent expressions produce the same result, they may have dramatically different costs. Consider a simple example:
Suppose we have tables Employees (1 million rows) and Departments (100 rows), and we want employees in the 'Engineering' department.
Expression 1: σ_{dept='Engineering'}(Employees ⋈ Departments)
Expression 2: Employees ⋈ (σ_{dept='Engineering'}(Departments))
Both expressions produce identical results, but Expression 2 may be orders of magnitude faster. The optimizer's job is to find such efficient equivalents using a repertoire of transformation rules.
Relational algebra equivalence rules fall into several major categories, each addressing different aspects of query optimization. Understanding this taxonomy helps you reason about what transformations are possible and why they preserve correctness.
| Category | Purpose | Key Operations | Primary Benefit |
|---|---|---|---|
| Commutativity Rules | Reorder operands | Selection, Join | Enable optimal operation ordering |
| Associativity Rules | Regroup operations | Join, Union, Intersection | Enable join order optimization |
| Distributivity Rules | Push/pull operations | Selection, Projection, Join | Reduce intermediate result sizes |
| Idempotence Rules | Eliminate redundancy | Selection, Projection | Remove unnecessary operations |
| Decomposition Rules | Split/merge operations | Selection (conjunctions), Projection | Enable fine-grained optimization |
| Null/Empty Rules | Handle edge cases | All operations with empty sets | Short-circuit unnecessary computation |
Let's examine each category in detail, understanding both the mathematical basis and the practical implications for query processing.
Now let's formalize the core equivalence rules that every query optimizer employs. These rules form the mathematical toolkit that transforms naive query plans into efficient ones.
Selection is the workhorse of query optimization. These rules govern how selection predicates can be manipulated:
Rule S1: Cascade of Selections (Decomposition)
σ_{p₁∧p₂∧...∧pₙ}(R) ≡ σ_{p₁}(σ_{p₂}(...σ_{pₙ}(R)...))
A conjunction of conditions can be split into a cascade of individual selections. This enables predicates to be applied at different points in the query plan.
Rule S2: Commutativity of Selections
σ_{p₁}(σ_{p₂}(R)) ≡ σ_{p₂}(σ_{p₁}(R))
The order in which selections are applied doesn't matter (mathematically). However, performance differs—more selective predicates should be applied first.
Rule S3: Idempotence of Selection
σ_p(σ_p(R)) ≡ σ_p(R)
Applying the same selection twice is redundant. Optimizers use this to eliminate duplicate predicates.
**Rule S4: Selection with OR (Disjunction)
σ_{p₁∨p₂}(R) ≡ σ_{p₁}(R) ∪ σ_{p₂}(R) [if p₁ and p₂ are disjoint]
OR conditions can potentially be split into unions, enabling parallel or index-based processing.
The cascade rule works both ways. Sometimes splitting is beneficial (to push predicates closer to base tables), and sometimes combining is better (to enable index usage on compound conditions). Good optimizers consider both directions.
Projection removes attributes, reducing the width of tuples flowing through the query plan:
Rule P1: Cascade of Projections
π_{L₁}(π_{L₂}(R)) ≡ π_{L₁}(R) [if L₁ ⊆ L₂]
If we project on a subset of attributes, an intermediate projection on a superset is redundant.
Rule P2: Projection Commuting with Selection
π_L(σ_p(R)) ≡ σ_p(π_{L∪attrs(p)}(R)) [if attrs(p) ⊆ L]
Projection can move past selection if the selection's attributes are preserved. This is subtle—if selection needs attributes not in L, we must temporarily include them.
Union, intersection, and set difference have their own equivalence properties:
Rule U1: Commutativity of Union and Intersection
R ∪ S ≡ S ∪ R
R ∩ S ≡ S ∩ R
Rule U2: Associativity of Union and Intersection
(R ∪ S) ∪ T ≡ R ∪ (S ∪ T)
(R ∩ S) ∩ T ≡ R ∩ (S ∩ T)
Note: Set difference is neither commutative nor associative: R - S ≢ S - R.
Join operations are typically the most expensive components of query execution. Consequently, the equivalence rules governing joins are among the most impactful for optimization.
Rule J1: Commutativity of Natural Join
R ⋈ S ≡ S ⋈ R
The order of relations in a natural join doesn't affect the result. This is fundamental for choosing which relation to use as the inner vs. outer in nested-loop algorithms.
Rule J2: Commutativity of Theta Join
R ⋈_θ S ≡ S ⋈_θ R [with θ appropriately transformed]
Theta joins are also commutative, though the join condition may need to be rewritten if it's not symmetric.
Rule J3: Associativity of Natural Join
(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)
This is the cornerstone of join ordering optimization. For n tables, there are n!/2 different join orders to consider.
Rule J4: Associativity of Theta Join
(R ⋈_{θ₁} S) ⋈_{θ₂} T ≡ R ⋈_{θ₁} (S ⋈_{θ₂∧θ₃} T)
[where θ₃ represents conditions involving attributes from R and T]
Theta join associativity is more complex because predicates may reference multiple relations.
Rule J5: Join and Cartesian Product
R ⋈_θ S ≡ σ_θ(R × S)
A theta join is equivalent to a Cartesian product followed by selection. This equivalence is typically traversed in the opposite direction—converting Cartesian products with subsequent selections into joins.
Rule J6: Natural Join via Projection
R ⋈ S ≡ π_{attrs(R)∪attrs(S)}(R ⋈_{R.A=S.A∧...} S)
A natural join is a theta join on common attributes with appropriate projection. This makes explicit what natural join implies.
These commutativity and associativity rules apply to inner joins. Outer joins (LEFT, RIGHT, FULL) have much more limited equivalences. For example, R LEFT JOIN S ≢ S LEFT JOIN R. Outer join optimization requires specialized rules and careful handling.
Distributivity rules specify how operations can be pushed through other operations. These rules are the mathematical foundation for "pushing" selections and projections toward leaf nodes in the query tree, which is typically the most effective optimization.
Rule D1: Selection Distributes over Join
σ_p(R ⋈ S) ≡ σ_p(R) ⋈ S [if p involves only attributes of R]
If a selection predicate only references one relation, it can be pushed to that relation before the join. This is the canonical "selection pushdown."
Rule D2: Selection Distributes over Join (Conjunctive)
σ_{p₁∧p₂}(R ⋈ S) ≡ σ_{p₁}(R) ⋈ σ_{p₂}(S)
[if p₁ involves only R, p₂ involves only S]
Conjunctive predicates can be split and pushed to their respective relations.
Rule D3: Selection Distributes over Union
σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)
Selection can be pushed through union to both operands.
Rule D4: Selection Distributes over Set Difference
σ_p(R - S) ≡ σ_p(R) - S ≡ σ_p(R) - σ_p(S)
Selection pushes through set difference, but note the subtle point: we can apply it to just R, or to both.
Rule D5: Projection Distributes over Join
π_L(R ⋈ S) ≡ π_{L₁}(R) ⋈ π_{L₂}(S)
[where L₁ ⊆ attrs(R), L₂ ⊆ attrs(S), and join attributes are included]
Projection can be pushed down to remove unused attributes before joining, reducing tuple width.
123456789101112131415161718192021
-- Original QuerySELECT e.name, d.dept_nameFROM Employees eJOIN Departments d ON e.dept_id = d.idWHERE e.salary > 100000 AND d.location = 'NYC'; -- The optimizer transforms this using distributivity rules:-- Step 1: Decompose the selection (Rule S1)-- σ_{salary>100000 ∧ location='NYC'}(E ⋈ D)-- ≡ σ_{salary>100000}(σ_{location='NYC'}(E ⋈ D)) -- Step 2: Push down selections using D2-- ≡ σ_{salary>100000}(E) ⋈ σ_{location='NYC'}(D) -- This is equivalent to:SELECT e.name, d.dept_nameFROM (SELECT * FROM Employees WHERE salary > 100000) eJOIN (SELECT * FROM Departments WHERE location = 'NYC') dON e.dept_id = d.id; -- The second form filters BEFORE joining, dramatically reducing workDistributivity rules enable what's often called 'early selection' or 'early projection.' By reducing data volume as early as possible in the query plan, we minimize I/O, memory usage, and CPU time for all downstream operations. This is often the single most impactful class of optimizations.
The collection of equivalence rules forms a mathematical system with important formal properties that guarantee the correctness and completeness of equivalence-based optimization.
Property 1: Reflexivity
E ≡ E (every expression is equivalent to itself)
Property 2: Symmetry
If E₁ ≡ E₂, then E₂ ≡ E₁
Property 3: Transitivity
If E₁ ≡ E₂ and E₂ ≡ E₃, then E₁ ≡ E₃
These three properties establish that equivalence is an equivalence relation (in the formal mathematical sense), partitioning all possible expressions into equivalence classes. Each equivalence class contains all expressions that produce the same result.
The Optimization Paradigm
Given these properties, optimization can be viewed as:
In practice, enumerating the entire equivalence class is infeasible—there are typically infinitely many equivalent expressions (since rules can be applied repeatedly). Instead, optimizers use heuristics to explore promising portions of the search space.
| Operation | Commutative? | Associative? | Idempotent? | Notes |
|---|---|---|---|---|
| Selection (σ) | Yes (cascade) | Yes (cascade) | Yes | Can decompose conjunctions |
| Projection (π) | N/A | N/A | Yes | Nested projections collapse |
| Natural Join (⋈) | Yes | Yes | Yes (R⋈R≡R) | Full flexibility |
| Theta Join (⋈_θ) | Yes | Conditional | Conditional | Depends on condition |
| Union (∪) | Yes | Yes | Yes | Full flexibility |
| Intersection (∩) | Yes | Yes | Yes | Full flexibility |
| Set Difference (-) | No | No | No | Very limited transformations |
| Cartesian Product (×) | Yes | Yes | No | Avoid when possible |
Confluence and Termination
For rule-based optimization systems, two properties are critical:
Confluence: If we can reach expressions E₂ and E₃ from E₁, can we reach a common expression from both E₂ and E₃? Confluence ensures that the order of rule application doesn't matter for the final result.
Termination: Does repeated rule application eventually stop? Without termination guarantees, optimization could loop forever.
Most query optimizers don't have strict confluence or termination—they use cost-based cutoffs and don't explore all possibilities. However, for specific rule subsets (like selection pushdown), these properties can be proven.
Understanding equivalence rules changes how you think about writing queries. While optimizers try to find efficient plans regardless of how you write your SQL, there are practical implications.
WHERE conditions on joined tables, modern optimizers will push them down appropriately. You don't need to use subqueries to force early filtering.Not all SQL constructs participate fully in equivalence-based optimization: (1) User-defined functions may have side effects, preventing reordering. (2) Non-deterministic functions (RAND(), CURRENT_TIMESTAMP) can't be safely moved. (3) Queries with ORDER BY, LIMIT, or OFFSET have ordering semantics the optimizer must preserve.
The Declarative Promise
Equivalence rules make SQL's declarative nature powerful. When you specify:
SELECT ... FROM A, B, C WHERE <conditions>
You're not saying "scan A, cross with B, cross with C, then filter." You're saying "give me tuples satisfying these conditions from the Cartesian product of A, B, C." The optimizer is free to:
This separation of specification from implementation is what makes database systems so powerful and durable across decades of hardware evolution.
We've established the mathematical foundation that enables query optimizers to transform queries while guaranteeing correctness. Let's consolidate the key insights:
What's Next
Now that we understand the general framework of equivalence rules, we'll dive deep into selection pushdown—arguably the single most important optimization technique. We'll see exactly how and when selections can be pushed through various operations, and the dramatic performance improvements this enables.
You now understand the algebraic foundations of query optimization. These equivalence rules are what enable databases to execute your declarative SQL specifications efficiently, regardless of how you write them. Next, we'll explore selection pushdown in comprehensive detail.