Loading content...
If there is one optimization technique that every database professional must understand deeply, it is selection pushdown. This deceptively simple idea—applying filter conditions as early as possible in query execution—is responsible for transforming queries that would take hours into queries that complete in seconds.
The core insight is profound: every tuple that fails a filter condition is wasted work if processed before filtering. A join might produce millions of intermediate tuples, only for a subsequent selection to discard 99% of them. Push that selection before the join, and you've eliminated 99% of the join's work.
This page provides a comprehensive treatment of selection pushdown: when it applies, when it doesn't, the exact conditions required for correctness, and how modern optimizers implement it.
By the end of this page, you will understand: (1) the formal conditions for safe selection pushdown, (2) how to push selections through joins, unions, and other operators, (3) when pushdown is impossible or counterproductive, and (4) how real databases implement this optimization.
The Fundamental Idea
Selection pushdown is based on a simple observation about data flow in query execution:
If a tuple will eventually be filtered out by a selection condition, it's better to filter it out as early as possible, before expending resources processing it through expensive operations like joins.
Formal Statement
Given a query plan with a selection σ_p applied after some operation Op, we seek to "push" the selection below Op, applying it earlier in the execution tree. This is valid when:
Why Is This So Powerful?
Consider the cost implications with concrete numbers:
Table Orders: 10,000,000 rows
Table Customers: 100,000 rows
Query: Find orders from customers in 'California'
Selectivity of state='California': 5%
Without pushdown:
With pushdown:
The difference is roughly 20×—and this is a modest example. In real queries with multiple joins, the savings compound multiplicatively.
Selection pushdown benefits compound through the query plan. If you filter 90% of rows before a 3-way join, you're not just doing 10% of the work—you might be doing 0.1% (10% × 10% × 10%) because the reduction applies at each join stage.
Join operations are typically the most expensive in query processing, making selection pushdown through joins particularly valuable. The rules differ based on the type of join.
Rule: σ_p(R ⋈ S) Transformation
For inner joins, selection can be pushed down based on which relation's attributes the predicate references:
Case 1: Predicate references only R's attributes
σ_p(R ⋈ S) ≡ σ_p(R) ⋈ S
The selection is pushed entirely to R.
Case 2: Predicate references only S's attributes
σ_p(R ⋈ S) ≡ R ⋈ σ_p(S)
The selection is pushed entirely to S.
Case 3: Predicate references both R and S (join-spanning predicate)
σ_p(R ⋈ S) ≡ σ_p(R ⋈ S) [cannot push as-is]
If p references attributes from both relations, it cannot be pushed past the join. However, if p is a conjunction:
Case 4: Conjunctive predicate with separable components
σ_{p₁∧p₂∧p₃}(R ⋈ S)
≡ σ_{p₃}(σ_{p₁}(R) ⋈ σ_{p₂}(S))
Where p₁ references only R, p₂ references only S, and p₃ references both (and must remain after the join).
123456789101112131415161718192021
-- Original query with selection after joinSELECT o.order_id, c.nameFROM Orders oJOIN Customers c ON o.customer_id = c.idWHERE c.country = 'USA' -- Only references Customers AND o.amount > 1000 -- Only references Orders AND o.order_date > c.member_since; -- References both -- After selection pushdown (equivalent query, showing conceptual transformation):/* 1. c.country = 'USA' pushed to Customers 2. o.amount > 1000 pushed to Orders 3. o.order_date > c.member_since remains on join*/SELECT o.order_id, c.nameFROM (SELECT * FROM Orders WHERE amount > 1000) oJOIN (SELECT * FROM Customers WHERE country = 'USA') c ON o.customer_id = c.idWHERE o.order_date > c.member_since; -- The optimizer sees both queries as equivalent and chooses the pushed formOuter joins require more careful handling because they preserve unmatched tuples. Pushing selections incorrectly can change null-padded results.
LEFT OUTER JOIN: R ⟕ S
On the preserved side (R):
σ_p(R ⟕ S) ≡ σ_p(R) ⟕ S [if p references only R]
This is safe because filtering R before the join still preserves the "no match → NULL pad" semantics for the filtered R tuples.
On the null-padded side (S):
σ_p(R ⟕ S) ≢ R ⟕ σ_p(S) [in general, NOT equivalent!]
Pushing selection to S can convert an outer join into something semantically different. However, there's a subtle case:
Safe pushdown into null-padded side:
σ_p(R ⟕ S) ≡ R ⟕ σ_p(S)
[ONLY if p is 'null-rejecting' on S, i.e., σ_p returns false for NULL]
A null-rejecting predicate (like S.column IS NOT NULL or S.value > 0) will reject all null-padded tuples anyway. In this case, the outer join effectively becomes an inner join, and pushdown is safe.
Incorrectly pushing selections through outer joins is a common source of optimizer bugs. If your query results unexpectedly differ from expectations with outer joins, check whether selection pushdown is incorrectly transforming your outer join semantics.
Set operations (UNION, INTERSECT, EXCEPT) have their own rules for selection pushdown.
Full Pushdown:
σ_p(R ∪ S) ≡ σ_p(R) ∪ σ_p(S)
Selection distributes completely over union. Every tuple in the result was in R or S, and we apply the same predicate to both. This is always safe.
Partial or Full Pushdown:
σ_p(R ∩ S) ≡ σ_p(R) ∩ σ_p(S)
≡ σ_p(R) ∩ S
≡ R ∩ σ_p(S)
For intersection, pushing to even one side is sufficient (though pushing to both is also correct). If a tuple survives the intersection, it must be in both R and S. If it satisfies p in the result, it must satisfy p in whichever relation it came from.
Asymmetric Rules:
σ_p(R - S) ≡ σ_p(R) - S
≡ σ_p(R) - σ_p(S)
For set difference, we can always push to R (the left operand). Whether we push to S doesn't matter for correctness, but it might help performance if it reduces S before the difference computation.
σ_p(R - S) ≢ R - σ_p(S) [NOT equivalent!]
We cannot push selection only to S without also applying it to R. Doing so would incorrectly include tuples from R that don't satisfy p.
| Operation | Push to Left | Push to Right | Push to Both | Notes |
|---|---|---|---|---|
| R ∪ S | Must push to both | Must push to both | Required | Selection must apply to all sources |
| R ∩ S | Sufficient | Sufficient | Optional | Pushing to one side is enough |
| R − S | Required | Optional | Safe | Must push to R; S is optional |
Aggregation with GROUP BY introduces complexity for selection pushdown. The key distinction is whether the predicate references:
σ_{p}(γ_{G,F}(R)) ≡ γ_{G,F}(σ_{p}(R)) [if p references only attributes in G or R]
Where γ_{G,F} denotes grouping by G and applying aggregate functions F.
Example:
-- Original: filter after grouping on department (grouping column)
SELECT dept, SUM(salary)
FROM Employees
GROUP BY dept
HAVING dept = 'Engineering'; -- 'dept' is a grouping column
-- Equivalent: filter before grouping
SELECT dept, SUM(salary)
FROM Employees
WHERE dept = 'Engineering'
GROUP BY dept;
-- WHERE is applied before aggregation, HAVING after
This pushdown is beneficial because it reduces the number of tuples that must be grouped and aggregated.
σ_{p}(γ_{G,F}(R)) [if p references aggregate results]
Predicates on aggregate results (HAVING conditions like SUM(salary) > 100000) cannot be pushed before the aggregation. The aggregate value doesn't exist until grouping completes.
The SQL distinction between WHERE (applied before grouping) and HAVING (applied after) directly maps to pushable vs. non-pushable predicates. Moving conditions from HAVING to WHERE when possible is a form of selection pushdown that every SQL optimizer performs.
The core technical challenge in selection pushdown is determining which attributes a predicate references and whether those attributes are available at a given point in the query tree. This requires attribute dependency analysis.
For each predicate p, we compute attrs(p): the set of attributes referenced by p.
Simple predicates:
salary > 50000: attrs = {salary}R.x = S.y: attrs = {R.x, S.y}a + b < c * 2: attrs = {a, b, c}Complex predicates:
EXISTS (SELECT ...): attrs include correlated columns from outer queryx IN (SELECT y FROM T): attrs = {x} for simple IN; more complex for correlatedAt each node in the query tree, we track which attributes are available (produced by that subtree):
available(R) = schema(R) [for base table R]
available(R ⋈ S) = available(R) ∪ available(S)
available(σ_p(R)) = available(R)
available(π_L(R)) = L
A predicate p can be pushed to a subtree T if and only if:
attrs(p) ⊆ available(T)
If attrs(p) is not a subset of available(T), the predicate references attributes not yet computed, and pushdown is impossible.
12345678910111213141516171819
-- Query: Find high-value orders from premium customers in 2024SELECT c.name, o.amountFROM Customers cJOIN Orders o ON c.id = o.customer_idWHERE c.tier = 'Premium' -- attrs = {c.tier} AND o.amount > 10000 -- attrs = {o.amount} AND o.year = 2024 -- attrs = {o.year} AND o.amount > c.credit_limit; -- attrs = {o.amount, c.credit_limit} -- Attribute availability analysis:-- At node "Customers": available = {c.id, c.name, c.tier, c.credit_limit, ...}-- At node "Orders": available = {o.order_id, o.customer_id, o.amount, o.year, ...}-- At node "Join": available = {c.*, o.*} -- all columns from both -- Pushdown decisions:-- c.tier = 'Premium': attrs ⊆ available(Customers) → PUSH TO CUSTOMERS-- o.amount > 10000: attrs ⊆ available(Orders) → PUSH TO ORDERS-- o.year = 2024: attrs ⊆ available(Orders) → PUSH TO ORDERS-- o.amount > c.credit_limit: attrs ⊆ available(Join) → REMAIN ON JOINProjections complicate attribute availability. If a projection removes an attribute, predicates referencing that attribute cannot be pushed below the projection.
π_{name}(σ_{age>30}(R)) ≡ π_{name}(σ_{age>30}(R)) -- age removed, but used in select
The optimizer must ensure that projected-out attributes needed for selections are temporarily retained:
π_{name}(σ_{age>30}(R)) ≡ π_{name}(σ_{age>30}(π_{name,age}(R)))
Note how age is retained through intermediate projections until the selection is applied.
Modern query optimizers implement selection pushdown as a systematic transformation pass over the query tree. Here's a detailed look at how this works.
The canonical approach is a top-down traversal with predicate accumulation:
12345678910111213141516171819202122232425262728293031
function pushdownSelection(node, pendingPredicates): // Collect predicates from this node if it's a selection if node is Selection(predicate, child): predicates = pendingPredicates ∪ decompose(predicate) return pushdownSelection(child, predicates) // For joins, partition predicates if node is Join(left, right, joinCond): leftOnly = {p ∈ predicates | attrs(p) ⊆ available(left)} rightOnly = {p ∈ predicates | attrs(p) ⊆ available(right)} remaining = predicates - leftOnly - rightOnly newLeft = pushdownSelection(left, leftOnly) newRight = pushdownSelection(right, rightOnly) newJoin = Join(newLeft, newRight, joinCond) // Apply remaining predicates that span both sides if remaining is not empty: return Selection(conjoin(remaining), newJoin) else: return newJoin // For base tables, apply all accumulated predicates if node is BaseTable(R): if predicates is not empty: return Selection(conjoin(predicates), R) else: return R // Handle other operators (Union, Aggregation, etc.) similarly ...Conjunctive predicates (AND-connected conditions) are decomposed to enable maximum pushdown:
σ_{A∧B∧C}(R ⋈ S)
→ decompose to {A, B, C}
→ push each independently
This is why writing WHERE a AND b AND c enables better optimization than complex conditions that can't be decomposed.
1. Predicate Cloning vs. Moving
Some optimizers move predicates (removing them from higher nodes), while others clone them (applying them both early and late). Cloning is safer but may add redundant checks.
2. Interaction with Indexes
When predicates are pushed to base tables, the optimizer can consider using indexes:
σ_{dept='Engineering'}(Employees) → Use index on 'dept' if available
3. Predicate Ordering
When multiple predicates are pushed to the same node, the order matters for performance. More selective predicates (rejecting more tuples) should typically be evaluated first.
Some optimizers use a bottom-up approach: annotate each node with its available attributes, then walk up collecting selections that can be pushed down. Both approaches achieve the same result; the choice is implementation-dependent.
While selection pushdown is generally beneficial, there are cases where it's impossible, counterproductive, or requires caution.
1. Predicates referencing computed values:
SELECT a + b AS sum, c
FROM R
WHERE sum > 100; -- 'sum' doesn't exist until computed
The predicate must wait until the projection/computation is complete.
2. Predicates with correlated subqueries:
SELECT * FROM R
WHERE x IN (SELECT y FROM S WHERE S.z = R.w);
The subquery references the outer table; complex dependency analysis required.
3. Predicates referencing window function results:
SELECT *, rank() OVER (ORDER BY salary) AS rnk
FROM Employees
WHERE rnk <= 10; -- Window function computed after base scan
Modern optimizers don't blindly push all predicates. They consider:
The optimizer may choose to not push a predicate if doing so results in a worse overall plan. This is why cost-based optimization complements rule-based transformation.
Some SQL constructs create semantic barriers that block pushdown: DISTINCT, LIMIT, certain uses of ORDER BY, and database-specific features. The optimizer must recognize these barriers and not push predicates across them incorrectly.
Selection pushdown is perhaps the most impactful single optimization in query processing. Let's consolidate what we've learned:
What's Next
We've thoroughly covered selection pushdown. Next, we'll examine projection pushdown—the analogous technique for reducing tuple width by eliminating unused attributes as early as possible. While not as dramatic as selection pushdown, projection optimization significantly reduces memory and I/O costs throughout query execution.
You now have comprehensive knowledge of selection pushdown—the most important filter optimization technique. Apply this knowledge when analyzing query plans: look for selections that could be pushed earlier, and understand why the optimizer made the choices it did.