Equivalence Rules - Learning Module

Loading content...

0/241

Selection Pushdown

The Art of Filtering Early

If there is one optimization technique that every database professional must understand deeply, it is selection pushdown. This deceptively simple idea—applying filter conditions as early as possible in query execution—is responsible for transforming queries that would take hours into queries that complete in seconds.

The core insight is profound: every tuple that fails a filter condition is wasted work if processed before filtering. A join might produce millions of intermediate tuples, only for a subsequent selection to discard 99% of them. Push that selection before the join, and you've eliminated 99% of the join's work.

This page provides a comprehensive treatment of selection pushdown: when it applies, when it doesn't, the exact conditions required for correctness, and how modern optimizers implement it.

What You Will Learn

By the end of this page, you will understand: (1) the formal conditions for safe selection pushdown, (2) how to push selections through joins, unions, and other operators, (3) when pushdown is impossible or counterproductive, and (4) how real databases implement this optimization.

The Selection Pushdown Principle

The Fundamental Idea

Selection pushdown is based on a simple observation about data flow in query execution:

If a tuple will eventually be filtered out by a selection condition, it's better to filter it out as early as possible, before expending resources processing it through expensive operations like joins.

Formal Statement

Given a query plan with a selection σ_p applied after some operation Op, we seek to "push" the selection below Op, applying it earlier in the execution tree. This is valid when:

The selection condition p only references attributes available in the lower subtree
Pushing doesn't change the query semantics
The operation Op permits selection to be distributed through it

Why Is This So Powerful?

Consider the cost implications with concrete numbers:

Table Orders: 10,000,000 rows
Table Customers: 100,000 rows
Query: Find orders from customers in 'California'
Selectivity of state='California': 5%

Without pushdown:

Join all Orders with Customers: 10M potential matches
Then filter to California customers
Work done: Process ~10M tuples through join

With pushdown:

First filter Customers to California: 100K × 5% = 5,000 customers
Then join: Only 5K customers to match
Work done: Process ~500K tuples through join

The difference is roughly 20×—and this is a modest example. In real queries with multiple joins, the savings compound multiplicatively.

The Compound Effect

Selection pushdown benefits compound through the query plan. If you filter 90% of rows before a 3-way join, you're not just doing 10% of the work—you might be doing 0.1% (10% × 10% × 10%) because the reduction applies at each join stage.

Pushing Selection Through Joins

Join operations are typically the most expensive in query processing, making selection pushdown through joins particularly valuable. The rules differ based on the type of join.

Inner Joins (Natural Join, Theta Join)

Rule: σ_p(R ⋈ S) Transformation

For inner joins, selection can be pushed down based on which relation's attributes the predicate references:

Case 1: Predicate references only R's attributes

σ_p(R ⋈ S) ≡ σ_p(R) ⋈ S

The selection is pushed entirely to R.

Case 2: Predicate references only S's attributes

σ_p(R ⋈ S) ≡ R ⋈ σ_p(S)

The selection is pushed entirely to S.

Case 3: Predicate references both R and S (join-spanning predicate)

σ_p(R ⋈ S) ≡ σ_p(R ⋈ S)  [cannot push as-is]

If p references attributes from both relations, it cannot be pushed past the join. However, if p is a conjunction:

Case 4: Conjunctive predicate with separable components

σ_{p₁∧p₂∧p₃}(R ⋈ S)
≡ σ_{p₃}(σ_{p₁}(R) ⋈ σ_{p₂}(S))

Where p₁ references only R, p₂ references only S, and p₃ references both (and must remain after the join).

Selection Pushdown Through Join
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Original query with selection after join
SELECT o.order_id, c.name
FROM Orders o
JOIN Customers c ON o.customer_id = c.id
WHERE c.country = 'USA'          -- Only references Customers
  AND o.amount > 1000            -- Only references Orders
  AND o.order_date > c.member_since; -- References both
 
-- After selection pushdown (equivalent query, showing conceptual transformation):
/*
 1. c.country = 'USA' pushed to Customers
 2. o.amount > 1000 pushed to Orders
 3. o.order_date > c.member_since remains on join
*/
SELECT o.order_id, c.name
FROM (SELECT * FROM Orders WHERE amount > 1000) o
JOIN (SELECT * FROM Customers WHERE country = 'USA') c 
  ON o.customer_id = c.id
WHERE o.order_date > c.member_since;
 
-- The optimizer sees both queries as equivalent and chooses the pushed form

Outer Joins (LEFT, RIGHT, FULL)

Outer joins require more careful handling because they preserve unmatched tuples. Pushing selections incorrectly can change null-padded results.

LEFT OUTER JOIN: R ⟕ S

On the preserved side (R):

σ_p(R ⟕ S) ≡ σ_p(R) ⟕ S  [if p references only R]

This is safe because filtering R before the join still preserves the "no match → NULL pad" semantics for the filtered R tuples.

On the null-padded side (S):

σ_p(R ⟕ S) ≢ R ⟕ σ_p(S)  [in general, NOT equivalent!]

Pushing selection to S can convert an outer join into something semantically different. However, there's a subtle case:

Safe pushdown into null-padded side:

σ_p(R ⟕ S) ≡ R ⟕ σ_p(S)  
[ONLY if p is 'null-rejecting' on S, i.e., σ_p returns false for NULL]

A null-rejecting predicate (like S.column IS NOT NULL or S.value > 0) will reject all null-padded tuples anyway. In this case, the outer join effectively becomes an inner join, and pushdown is safe.

Outer Join Pitfall

Incorrectly pushing selections through outer joins is a common source of optimizer bugs. If your query results unexpectedly differ from expectations with outer joins, check whether selection pushdown is incorrectly transforming your outer join semantics.

Pushing Through Set Operations

Set operations (UNION, INTERSECT, EXCEPT) have their own rules for selection pushdown.

Union (∪)

Full Pushdown:

σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)

Selection distributes completely over union. Every tuple in the result was in R or S, and we apply the same predicate to both. This is always safe.

Intersection (∩)

Partial or Full Pushdown:

σ_p(R ∩ S) ≡ σ_p(R) ∩ σ_p(S)
≡ σ_p(R) ∩ S  
≡ R ∩ σ_p(S)

For intersection, pushing to even one side is sufficient (though pushing to both is also correct). If a tuple survives the intersection, it must be in both R and S. If it satisfies p in the result, it must satisfy p in whichever relation it came from.

Set Difference (−)

Asymmetric Rules:

σ_p(R - S) ≡ σ_p(R) - S
≡ σ_p(R) - σ_p(S)

For set difference, we can always push to R (the left operand). Whether we push to S doesn't matter for correctness, but it might help performance if it reduces S before the difference computation.

σ_p(R - S) ≢ R - σ_p(S)  [NOT equivalent!]

We cannot push selection only to S without also applying it to R. Doing so would incorrectly include tuples from R that don't satisfy p.

Selection Pushdown Through Set Operations
Operation	Push to Left	Push to Right	Push to Both	Notes
R ∪ S	Must push to both	Must push to both	Required	Selection must apply to all sources
R ∩ S	Sufficient	Sufficient	Optional	Pushing to one side is enough
R − S	Required	Optional	Safe	Must push to R; S is optional

Pushing Through Aggregation and Grouping

Aggregation with GROUP BY introduces complexity for selection pushdown. The key distinction is whether the predicate references:

Pre-aggregation attributes (columns being grouped or used in aggregates)
Post-aggregation results (the output of aggregate functions)

Pre-Aggregation Predicates

σ_{p}(γ_{G,F}(R)) ≡ γ_{G,F}(σ_{p}(R))  [if p references only attributes in G or R]

Where γ_{G,F} denotes grouping by G and applying aggregate functions F.

Example:

-- Original: filter after grouping on department (grouping column)
SELECT dept, SUM(salary) 
FROM Employees
GROUP BY dept
HAVING dept = 'Engineering';  -- 'dept' is a grouping column

-- Equivalent: filter before grouping
SELECT dept, SUM(salary)
FROM Employees
WHERE dept = 'Engineering'
GROUP BY dept;
-- WHERE is applied before aggregation, HAVING after

This pushdown is beneficial because it reduces the number of tuples that must be grouped and aggregated.

Post-Aggregation Predicates

σ_{p}(γ_{G,F}(R)) [if p references aggregate results]

Predicates on aggregate results (HAVING conditions like SUM(salary) > 100000) cannot be pushed before the aggregation. The aggregate value doesn't exist until grouping completes.

CAN Push Before GROUP BY

•Predicates on grouping columns
•Predicates on non-aggregated columns
•WHERE clause conditions
•Filters that don't reference aggregate results
•Range conditions on source attributes

CANNOT Push Before GROUP BY

•Predicates on aggregate results (SUM, COUNT, etc.)
•HAVING conditions on computed values
•Conditions comparing aggregate to constant
•Predicates involving window functions
•Any condition requiring the grouped result

WHERE vs. HAVING

The SQL distinction between WHERE (applied before grouping) and HAVING (applied after) directly maps to pushable vs. non-pushable predicates. Moving conditions from HAVING to WHERE when possible is a form of selection pushdown that every SQL optimizer performs.

Attribute Dependency Analysis

The core technical challenge in selection pushdown is determining which attributes a predicate references and whether those attributes are available at a given point in the query tree. This requires attribute dependency analysis.

Computing Attribute References

For each predicate p, we compute attrs(p): the set of attributes referenced by p.

Simple predicates:

salary > 50000: attrs = {salary}
R.x = S.y: attrs = {R.x, S.y}
a + b < c * 2: attrs = {a, b, c}

Complex predicates:

EXISTS (SELECT ...): attrs include correlated columns from outer query
x IN (SELECT y FROM T): attrs = {x} for simple IN; more complex for correlated

Available Attribute Sets

At each node in the query tree, we track which attributes are available (produced by that subtree):

available(R) = schema(R)  [for base table R]
available(R ⋈ S) = available(R) ∪ available(S)
available(σ_p(R)) = available(R)
available(π_L(R)) = L

Pushdown Decision

A predicate p can be pushed to a subtree T if and only if:

attrs(p) ⊆ available(T)

If attrs(p) is not a subset of available(T), the predicate references attributes not yet computed, and pushdown is impossible.

Attribute Analysis Example

Pseudocode

-- Query: Find high-value orders from premium customers in 2024
SELECT c.name, o.amount
FROM Customers c
JOIN Orders o ON c.id = o.customer_id
WHERE c.tier = 'Premium'         -- attrs = {c.tier}
  AND o.amount > 10000           -- attrs = {o.amount}
  AND o.year = 2024              -- attrs = {o.year}
  AND o.amount > c.credit_limit; -- attrs = {o.amount, c.credit_limit}
 
-- Attribute availability analysis:
-- At node "Customers": available = {c.id, c.name, c.tier, c.credit_limit, ...}
-- At node "Orders":    available = {o.order_id, o.customer_id, o.amount, o.year, ...}
-- At node "Join":      available = {c.*, o.*}  -- all columns from both
 
-- Pushdown decisions:
-- c.tier = 'Premium':      attrs ⊆ available(Customers) → PUSH TO CUSTOMERS
-- o.amount > 10000:        attrs ⊆ available(Orders)    → PUSH TO ORDERS
-- o.year = 2024:           attrs ⊆ available(Orders)    → PUSH TO ORDERS
-- o.amount > c.credit_limit: attrs ⊆ available(Join)   → REMAIN ON JOIN

Handling Projections

Projections complicate attribute availability. If a projection removes an attribute, predicates referencing that attribute cannot be pushed below the projection.

π_{name}(σ_{age>30}(R)) ≡ π_{name}(σ_{age>30}(R))  -- age removed, but used in select

The optimizer must ensure that projected-out attributes needed for selections are temporarily retained:

π_{name}(σ_{age>30}(R)) ≡ π_{name}(σ_{age>30}(π_{name,age}(R)))

Note how age is retained through intermediate projections until the selection is applied.

The Pushdown Algorithm

Modern query optimizers implement selection pushdown as a systematic transformation pass over the query tree. Here's a detailed look at how this works.

Algorithm Overview

The canonical approach is a top-down traversal with predicate accumulation:

Start at the root of the query tree
Collect all selection predicates
Decompose conjunctions into individual predicates
For each predicate, find the lowest (earliest) point it can be applied
Push the predicate to that point
Recur into child nodes with remaining predicates

Selection Pushdown Algorithm

Pseudocode

function pushdownSelection(node, pendingPredicates):
    // Collect predicates from this node if it's a selection
    if node is Selection(predicate, child):
        predicates = pendingPredicates ∪ decompose(predicate)
        return pushdownSelection(child, predicates)
    
    // For joins, partition predicates
    if node is Join(left, right, joinCond):
        leftOnly = {p ∈ predicates | attrs(p) ⊆ available(left)}
        rightOnly = {p ∈ predicates | attrs(p) ⊆ available(right)}
        remaining = predicates - leftOnly - rightOnly
        
        newLeft = pushdownSelection(left, leftOnly)
        newRight = pushdownSelection(right, rightOnly)
        newJoin = Join(newLeft, newRight, joinCond)
        
        // Apply remaining predicates that span both sides
        if remaining is not empty:
            return Selection(conjoin(remaining), newJoin)
        else:
            return newJoin
    
    // For base tables, apply all accumulated predicates
    if node is BaseTable(R):
        if predicates is not empty:
            return Selection(conjoin(predicates), R)
        else:
            return R
    
    // Handle other operators (Union, Aggregation, etc.) similarly
    ...

Predicate Decomposition

Conjunctive predicates (AND-connected conditions) are decomposed to enable maximum pushdown:

σ_{A∧B∧C}(R ⋈ S) 
→ decompose to {A, B, C}
→ push each independently

This is why writing WHERE a AND b AND c enables better optimization than complex conditions that can't be decomposed.

Implementation Considerations

1. Predicate Cloning vs. Moving

Some optimizers move predicates (removing them from higher nodes), while others clone them (applying them both early and late). Cloning is safer but may add redundant checks.

2. Interaction with Indexes

When predicates are pushed to base tables, the optimizer can consider using indexes:

σ_{dept='Engineering'}(Employees) → Use index on 'dept' if available

3. Predicate Ordering

When multiple predicates are pushed to the same node, the order matters for performance. More selective predicates (rejecting more tuples) should typically be evaluated first.

Bottom-Up Alternative

Some optimizers use a bottom-up approach: annotate each node with its available attributes, then walk up collecting selections that can be pushed down. Both approaches achieve the same result; the choice is implementation-dependent.

When NOT to Push Selections

While selection pushdown is generally beneficial, there are cases where it's impossible, counterproductive, or requires caution.

Cases Where Pushdown Is Impossible

1. Predicates referencing computed values:

SELECT a + b AS sum, c
FROM R
WHERE sum > 100;  -- 'sum' doesn't exist until computed

The predicate must wait until the projection/computation is complete.

2. Predicates with correlated subqueries:

SELECT * FROM R
WHERE x IN (SELECT y FROM S WHERE S.z = R.w);

The subquery references the outer table; complex dependency analysis required.

3. Predicates referencing window function results:

SELECT *, rank() OVER (ORDER BY salary) AS rnk
FROM Employees
WHERE rnk <= 10;  -- Window function computed after base scan

Cases Where Pushdown May Be Counterproductive

•Low selectivity predicates — If only 1% of tuples are filtered, the overhead of checking the predicate at every level may exceed the benefit.
•Expensive predicates — If the predicate involves expensive functions (regex matching, spatial calculations), it might be cheaper to reduce data first through other means.
•Predicates that prevent index usage — Sometimes a predicate pushed to a base table prevents using a better index for the join condition.
•Predicates with side effects — User-defined functions with side effects should not be reordered or duplicated.
•Predicates involving external calls — Network calls, system functions with timing dependencies, etc.

The Cost-Based Decision

Modern optimizers don't blindly push all predicates. They consider:

Selectivity: How many tuples will this predicate filter?
Evaluation Cost: How expensive is the predicate to evaluate?
Index Utilization: Does pushing enable or prevent index usage?
Intermediate Result Sizes: How does this affect downstream operators?

The optimizer may choose to not push a predicate if doing so results in a worse overall plan. This is why cost-based optimization complements rule-based transformation.

Semantic Barriers

Some SQL constructs create semantic barriers that block pushdown: DISTINCT, LIMIT, certain uses of ORDER BY, and database-specific features. The optimizer must recognize these barriers and not push predicates across them incorrectly.

Summary: Mastering Selection Pushdown

Selection pushdown is perhaps the most impactful single optimization in query processing. Let's consolidate what we've learned:

Key Takeaways

•Early filtering is powerful — Every tuple filtered early avoids processing through all subsequent operations. The savings compound multiplicatively.
•Attribute analysis governs pushability — A predicate can only be pushed to a point where all its referenced attributes are available.
•Join pushdown depends on predicate scope — Single-table predicates push fully; multi-table predicates may partially push or remain.
•Outer joins require care — Pushing to the null-padded side can change semantics unless the predicate is null-rejecting.
•Aggregation splits predicates — Predicates on grouping columns can push below GROUP BY; aggregate predicates cannot.
•Cost-based refinement — Not all pushdowns are beneficial; optimizers consider selectivity and evaluation cost.

What's Next

We've thoroughly covered selection pushdown. Next, we'll examine projection pushdown—the analogous technique for reducing tuple width by eliminating unused attributes as early as possible. While not as dramatic as selection pushdown, projection optimization significantly reduces memory and I/O costs throughout query execution.

Page Complete

You now have comprehensive knowledge of selection pushdown—the most important filter optimization technique. Apply this knowledge when analyzing query plans: look for selections that could be pushed earlier, and understand why the optimizer made the choices it did.

Selection Pushdown

The Art of Filtering Early

This page provides a comprehensive treatment of selection pushdown: when it applies, when it doesn't, the exact conditions required for correctness, and how modern optimizers implement it.

What You Will Learn

The Selection Pushdown Principle

The Fundamental Idea

Selection pushdown is based on a simple observation about data flow in query execution:

If a tuple will eventually be filtered out by a selection condition, it's better to filter it out as early as possible, before expending resources processing it through expensive operations like joins.

Formal Statement

Given a query plan with a selection σ_p applied after some operation Op, we seek to "push" the selection below Op, applying it earlier in the execution tree. This is valid when:

The selection condition p only references attributes available in the lower subtree
Pushing doesn't change the query semantics
The operation Op permits selection to be distributed through it

Why Is This So Powerful?

Consider the cost implications with concrete numbers:

Table Orders: 10,000,000 rows
Table Customers: 100,000 rows
Query: Find orders from customers in 'California'
Selectivity of state='California': 5%

Without pushdown:

Join all Orders with Customers: 10M potential matches
Then filter to California customers
Work done: Process ~10M tuples through join

With pushdown:

First filter Customers to California: 100K × 5% = 5,000 customers
Then join: Only 5K customers to match
Work done: Process ~500K tuples through join

The difference is roughly 20×—and this is a modest example. In real queries with multiple joins, the savings compound multiplicatively.

The Compound Effect

Pushing Selection Through Joins

Join operations are typically the most expensive in query processing, making selection pushdown through joins particularly valuable. The rules differ based on the type of join.

Inner Joins (Natural Join, Theta Join)

Rule: σ_p(R ⋈ S) Transformation

For inner joins, selection can be pushed down based on which relation's attributes the predicate references:

Case 1: Predicate references only R's attributes

σ_p(R ⋈ S) ≡ σ_p(R) ⋈ S

The selection is pushed entirely to R.

Case 2: Predicate references only S's attributes

σ_p(R ⋈ S) ≡ R ⋈ σ_p(S)

The selection is pushed entirely to S.

Case 3: Predicate references both R and S (join-spanning predicate)

σ_p(R ⋈ S) ≡ σ_p(R ⋈ S)  [cannot push as-is]

If p references attributes from both relations, it cannot be pushed past the join. However, if p is a conjunction:

Case 4: Conjunctive predicate with separable components

σ_{p₁∧p₂∧p₃}(R ⋈ S)
≡ σ_{p₃}(σ_{p₁}(R) ⋈ σ_{p₂}(S))

Where p₁ references only R, p₂ references only S, and p₃ references both (and must remain after the join).

Selection Pushdown Through Join
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Original query with selection after join
SELECT o.order_id, c.name
FROM Orders o
JOIN Customers c ON o.customer_id = c.id
WHERE c.country = 'USA'          -- Only references Customers
  AND o.amount > 1000            -- Only references Orders
  AND o.order_date > c.member_since; -- References both
 
-- After selection pushdown (equivalent query, showing conceptual transformation):
/*
 1. c.country = 'USA' pushed to Customers
 2. o.amount > 1000 pushed to Orders
 3. o.order_date > c.member_since remains on join
*/
SELECT o.order_id, c.name
FROM (SELECT * FROM Orders WHERE amount > 1000) o
JOIN (SELECT * FROM Customers WHERE country = 'USA') c 
  ON o.customer_id = c.id
WHERE o.order_date > c.member_since;
 
-- The optimizer sees both queries as equivalent and chooses the pushed form

Outer Joins (LEFT, RIGHT, FULL)

Outer joins require more careful handling because they preserve unmatched tuples. Pushing selections incorrectly can change null-padded results.

LEFT OUTER JOIN: R ⟕ S

On the preserved side (R):

σ_p(R ⟕ S) ≡ σ_p(R) ⟕ S  [if p references only R]

This is safe because filtering R before the join still preserves the "no match → NULL pad" semantics for the filtered R tuples.

On the null-padded side (S):

σ_p(R ⟕ S) ≢ R ⟕ σ_p(S)  [in general, NOT equivalent!]

Pushing selection to S can convert an outer join into something semantically different. However, there's a subtle case:

Safe pushdown into null-padded side:

σ_p(R ⟕ S) ≡ R ⟕ σ_p(S)  
[ONLY if p is 'null-rejecting' on S, i.e., σ_p returns false for NULL]

Outer Join Pitfall

Pushing Through Set Operations

Set operations (UNION, INTERSECT, EXCEPT) have their own rules for selection pushdown.

Union (∪)

Full Pushdown:

σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)

Selection distributes completely over union. Every tuple in the result was in R or S, and we apply the same predicate to both. This is always safe.

Intersection (∩)

Partial or Full Pushdown:

σ_p(R ∩ S) ≡ σ_p(R) ∩ σ_p(S)
≡ σ_p(R) ∩ S  
≡ R ∩ σ_p(S)

Set Difference (−)

Asymmetric Rules:

σ_p(R - S) ≡ σ_p(R) - S
≡ σ_p(R) - σ_p(S)

For set difference, we can always push to R (the left operand). Whether we push to S doesn't matter for correctness, but it might help performance if it reduces S before the difference computation.

σ_p(R - S) ≢ R - σ_p(S)  [NOT equivalent!]

We cannot push selection only to S without also applying it to R. Doing so would incorrectly include tuples from R that don't satisfy p.

Selection Pushdown Through Set Operations
Operation	Push to Left	Push to Right	Push to Both	Notes
R ∪ S	Must push to both	Must push to both	Required	Selection must apply to all sources
R ∩ S	Sufficient	Sufficient	Optional	Pushing to one side is enough
R − S	Required	Optional	Safe	Must push to R; S is optional

Pushing Through Aggregation and Grouping

Aggregation with GROUP BY introduces complexity for selection pushdown. The key distinction is whether the predicate references:

Pre-aggregation attributes (columns being grouped or used in aggregates)
Post-aggregation results (the output of aggregate functions)

Pre-Aggregation Predicates

σ_{p}(γ_{G,F}(R)) ≡ γ_{G,F}(σ_{p}(R))  [if p references only attributes in G or R]

Where γ_{G,F} denotes grouping by G and applying aggregate functions F.

Example:

-- Original: filter after grouping on department (grouping column)
SELECT dept, SUM(salary) 
FROM Employees
GROUP BY dept
HAVING dept = 'Engineering';  -- 'dept' is a grouping column

-- Equivalent: filter before grouping
SELECT dept, SUM(salary)
FROM Employees
WHERE dept = 'Engineering'
GROUP BY dept;
-- WHERE is applied before aggregation, HAVING after

This pushdown is beneficial because it reduces the number of tuples that must be grouped and aggregated.

Post-Aggregation Predicates

σ_{p}(γ_{G,F}(R)) [if p references aggregate results]

Predicates on aggregate results (HAVING conditions like SUM(salary) > 100000) cannot be pushed before the aggregation. The aggregate value doesn't exist until grouping completes.

CAN Push Before GROUP BY

•Predicates on grouping columns
•Predicates on non-aggregated columns
•WHERE clause conditions
•Filters that don't reference aggregate results
•Range conditions on source attributes

CANNOT Push Before GROUP BY

•Predicates on aggregate results (SUM, COUNT, etc.)
•HAVING conditions on computed values
•Conditions comparing aggregate to constant
•Predicates involving window functions
•Any condition requiring the grouped result

WHERE vs. HAVING

Attribute Dependency Analysis

Computing Attribute References

For each predicate p, we compute attrs(p): the set of attributes referenced by p.

Simple predicates:

salary > 50000: attrs = {salary}
R.x = S.y: attrs = {R.x, S.y}
a + b < c * 2: attrs = {a, b, c}

Complex predicates:

EXISTS (SELECT ...): attrs include correlated columns from outer query
x IN (SELECT y FROM T): attrs = {x} for simple IN; more complex for correlated

Available Attribute Sets

At each node in the query tree, we track which attributes are available (produced by that subtree):

available(R) = schema(R)  [for base table R]
available(R ⋈ S) = available(R) ∪ available(S)
available(σ_p(R)) = available(R)
available(π_L(R)) = L

Pushdown Decision

A predicate p can be pushed to a subtree T if and only if:

attrs(p) ⊆ available(T)

If attrs(p) is not a subset of available(T), the predicate references attributes not yet computed, and pushdown is impossible.

Attribute Analysis Example

Pseudocode

-- Query: Find high-value orders from premium customers in 2024
SELECT c.name, o.amount
FROM Customers c
JOIN Orders o ON c.id = o.customer_id
WHERE c.tier = 'Premium'         -- attrs = {c.tier}
  AND o.amount > 10000           -- attrs = {o.amount}
  AND o.year = 2024              -- attrs = {o.year}
  AND o.amount > c.credit_limit; -- attrs = {o.amount, c.credit_limit}
 
-- Attribute availability analysis:
-- At node "Customers": available = {c.id, c.name, c.tier, c.credit_limit, ...}
-- At node "Orders":    available = {o.order_id, o.customer_id, o.amount, o.year, ...}
-- At node "Join":      available = {c.*, o.*}  -- all columns from both
 
-- Pushdown decisions:
-- c.tier = 'Premium':      attrs ⊆ available(Customers) → PUSH TO CUSTOMERS
-- o.amount > 10000:        attrs ⊆ available(Orders)    → PUSH TO ORDERS
-- o.year = 2024:           attrs ⊆ available(Orders)    → PUSH TO ORDERS
-- o.amount > c.credit_limit: attrs ⊆ available(Join)   → REMAIN ON JOIN

Handling Projections

Projections complicate attribute availability. If a projection removes an attribute, predicates referencing that attribute cannot be pushed below the projection.

π_{name}(σ_{age>30}(R)) ≡ π_{name}(σ_{age>30}(R))  -- age removed, but used in select

The optimizer must ensure that projected-out attributes needed for selections are temporarily retained:

π_{name}(σ_{age>30}(R)) ≡ π_{name}(σ_{age>30}(π_{name,age}(R)))

Note how age is retained through intermediate projections until the selection is applied.

The Pushdown Algorithm

Modern query optimizers implement selection pushdown as a systematic transformation pass over the query tree. Here's a detailed look at how this works.

Algorithm Overview

The canonical approach is a top-down traversal with predicate accumulation:

Start at the root of the query tree
Collect all selection predicates
Decompose conjunctions into individual predicates
For each predicate, find the lowest (earliest) point it can be applied
Push the predicate to that point
Recur into child nodes with remaining predicates

Selection Pushdown Algorithm

Pseudocode

function pushdownSelection(node, pendingPredicates):
    // Collect predicates from this node if it's a selection
    if node is Selection(predicate, child):
        predicates = pendingPredicates ∪ decompose(predicate)
        return pushdownSelection(child, predicates)
    
    // For joins, partition predicates
    if node is Join(left, right, joinCond):
        leftOnly = {p ∈ predicates | attrs(p) ⊆ available(left)}
        rightOnly = {p ∈ predicates | attrs(p) ⊆ available(right)}
        remaining = predicates - leftOnly - rightOnly
        
        newLeft = pushdownSelection(left, leftOnly)
        newRight = pushdownSelection(right, rightOnly)
        newJoin = Join(newLeft, newRight, joinCond)
        
        // Apply remaining predicates that span both sides
        if remaining is not empty:
            return Selection(conjoin(remaining), newJoin)
        else:
            return newJoin
    
    // For base tables, apply all accumulated predicates
    if node is BaseTable(R):
        if predicates is not empty:
            return Selection(conjoin(predicates), R)
        else:
            return R
    
    // Handle other operators (Union, Aggregation, etc.) similarly
    ...

Predicate Decomposition

Conjunctive predicates (AND-connected conditions) are decomposed to enable maximum pushdown:

σ_{A∧B∧C}(R ⋈ S) 
→ decompose to {A, B, C}
→ push each independently

This is why writing WHERE a AND b AND c enables better optimization than complex conditions that can't be decomposed.

Implementation Considerations

1. Predicate Cloning vs. Moving

Some optimizers move predicates (removing them from higher nodes), while others clone them (applying them both early and late). Cloning is safer but may add redundant checks.

2. Interaction with Indexes

When predicates are pushed to base tables, the optimizer can consider using indexes:

σ_{dept='Engineering'}(Employees) → Use index on 'dept' if available

3. Predicate Ordering

When multiple predicates are pushed to the same node, the order matters for performance. More selective predicates (rejecting more tuples) should typically be evaluated first.

Bottom-Up Alternative

When NOT to Push Selections

While selection pushdown is generally beneficial, there are cases where it's impossible, counterproductive, or requires caution.

Cases Where Pushdown Is Impossible

1. Predicates referencing computed values:

SELECT a + b AS sum, c
FROM R
WHERE sum > 100;  -- 'sum' doesn't exist until computed

The predicate must wait until the projection/computation is complete.

2. Predicates with correlated subqueries:

SELECT * FROM R
WHERE x IN (SELECT y FROM S WHERE S.z = R.w);

The subquery references the outer table; complex dependency analysis required.

3. Predicates referencing window function results:

SELECT *, rank() OVER (ORDER BY salary) AS rnk
FROM Employees
WHERE rnk <= 10;  -- Window function computed after base scan

Cases Where Pushdown May Be Counterproductive

•Low selectivity predicates — If only 1% of tuples are filtered, the overhead of checking the predicate at every level may exceed the benefit.
•Expensive predicates — If the predicate involves expensive functions (regex matching, spatial calculations), it might be cheaper to reduce data first through other means.
•Predicates that prevent index usage — Sometimes a predicate pushed to a base table prevents using a better index for the join condition.
•Predicates with side effects — User-defined functions with side effects should not be reordered or duplicated.
•Predicates involving external calls — Network calls, system functions with timing dependencies, etc.

The Cost-Based Decision

Modern optimizers don't blindly push all predicates. They consider:

Selectivity: How many tuples will this predicate filter?
Evaluation Cost: How expensive is the predicate to evaluate?
Index Utilization: Does pushing enable or prevent index usage?
Intermediate Result Sizes: How does this affect downstream operators?

The optimizer may choose to not push a predicate if doing so results in a worse overall plan. This is why cost-based optimization complements rule-based transformation.

Semantic Barriers

Summary: Mastering Selection Pushdown

Selection pushdown is perhaps the most impactful single optimization in query processing. Let's consolidate what we've learned:

Key Takeaways

•Early filtering is powerful — Every tuple filtered early avoids processing through all subsequent operations. The savings compound multiplicatively.
•Attribute analysis governs pushability — A predicate can only be pushed to a point where all its referenced attributes are available.
•Join pushdown depends on predicate scope — Single-table predicates push fully; multi-table predicates may partially push or remain.
•Outer joins require care — Pushing to the null-padded side can change semantics unless the predicate is null-rejecting.
•Aggregation splits predicates — Predicates on grouping columns can push below GROUP BY; aggregate predicates cannot.
•Cost-based refinement — Not all pushdowns are beneficial; optimizers consider selectivity and evaluation cost.

What's Next

Page Complete