Loading content...
If you could implement only one query optimization, selection pushdown would be it. This single technique—pushing filter predicates as close to the data source as possible—delivers the most consistent, dramatic performance improvements across virtually all workloads.
The principle is intuitive: why join a million rows if ninety percent of them will be filtered out afterward? Yet the mechanics of safely and maximally pushing selections through complex query plans involve subtle considerations that separate sophisticated optimizers from naive ones.
By the end of this page, you will deeply understand the algebraic foundation of selection pushdown, the mechanics of pushing selections through different operators, challenging edge cases involving outer joins and subqueries, predicate inference and derivation techniques, and practical strategies for writing queries that benefit maximally from selection pushdown.
Selection pushdown is grounded in relational algebra equivalences. These equivalences prove that certain transformations preserve query semantics—the transformed query returns exactly the same results as the original.
In relational algebra, the selection operator σ (sigma) filters rows based on a predicate:
σpredicate = Keep only rows from relation R where predicate evaluates to TRUE
For example, σage > 21 returns only customers older than 21.
The following equivalences form the mathematical basis for selection pushdown:
| Equivalence | Description | Conditions |
|---|---|---|
σ[p](R ⋈ S) ≡ σ[p](R) ⋈ S | Push selection to left operand | Predicate p uses only columns from R |
σ[p](R ⋈ S) ≡ R ⋈ σ[p](S) | Push selection to right operand | Predicate p uses only columns from S |
σ[p1 ∧ p2](R) ≡ σ[p1](σ[p2](R)) | Decompose conjunctive predicates | Always valid |
σ[p1](σ[p2](R)) ≡ σ[p2](σ[p1](R)) | Selections commute | Always valid |
σ[p](R ∪ S) ≡ σ[p](R) ∪ σ[p](S) | Push through union | Always valid for UNION ALL; for UNION requires care |
σ[p](R - S) ≡ σ[p](R) - S | Push through set difference | Safe to push to positive operand |
Consider how data flows through a plan with and without selection pushdown:
In the "Without Pushdown" plan, the join processes 10M × 1M potential pairs, then filters. In the "With Pushdown" plan, the filter reduces customers from 1M to 100K before the join even begins. The join now processes 10M × 100K potential pairs—10× less work.
Real-world WHERE clauses often combine conditions with AND and OR. Optimizers decompose these to maximize pushdown opportunities:
12345678910111213141516171819202122232425262728
-- Original query with compound predicateSELECT *FROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.country = 'USA' -- About customers (push to c) AND o.order_date > '2024-01-01' -- About orders (push to o) AND o.total > 100; -- About orders (push to o) -- Decomposed predicates:-- Predicate 1: c.country = 'USA' → push to customers scan-- Predicate 2: o.order_date > '2024-01-01' → push to orders scan-- Predicate 3: o.total > 100 → push to orders scan -- Optimized plan structure:-- 1. Scan customers WHERE country = 'USA' (push predicate 1)-- 2. Scan orders WHERE order_date > '2024-01-01' AND total > 100 (push 2 & 3)-- 3. Join the filtered results -- OR predicates are trickier:SELECT *FROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.country = 'USA' OR o.total > 10000; -- Cannot simply push either predicate!-- If we only pushed c.country = 'USA' to customers,-- we'd miss orders with total > 10000 for non-USA customers-- Must evaluate after join, OR convert to UNIONPredicates in Conjunctive Normal Form (ANDs of ORs) are easier to optimize than Disjunctive Normal Form (ORs of ANDs). Many optimizers convert predicates to CNF early in the optimization process to maximize pushdown opportunities.
Each relational operator presents unique considerations for selection pushdown. Mastering these rules enables maximum optimization.
Inner joins are fully transparent to selections. Any predicate can be pushed to whichever operand contains the referenced columns:
1234567891011121314151617181920
-- Rule: σ[p](R ⋈ S) ≡ σ[p](R) ⋈ S [when p references only R]-- Rule: σ[p](R ⋈ S) ≡ R ⋈ σ[p](S) [when p references only S]-- Rule: σ[p1 ∧ p2](R ⋈ S) ≡ σ[p1](R) ⋈ σ[p2](S) [when p1 uses R, p2 uses S] -- Example:SELECT *FROM employees eJOIN departments d ON e.dept_id = d.idJOIN projects p ON e.id = p.lead_idWHERE d.location = 'NYC' -- Pushes to departments AND e.hire_date > '2020-01-01' -- Pushes to employees AND p.budget > 100000; -- Pushes to projects -- Plan after pushdown:-- 1. Scan departments WHERE location = 'NYC'-- 2. Scan employees WHERE hire_date > '2020-01-01'-- 3. Scan projects WHERE budget > 100000-- 4. Join (1) and (2) on dept_id-- 5. Join result with (3) on lead_id-- All filters applied at leaf level!Outer joins are NOT transparent to selections. The rules depend on which side of the outer join the predicate applies to:
12345678910111213
-- WRONG: Pushing predicate to nullable sideSELECT *FROM customers cLEFT JOIN orders o ON c.id = o.customer_idWHERE o.total > 100; -- Filters NULLs! -- This WHERE clause eliminates customers-- with no orders (where o.total is NULL)-- Converting LEFT JOIN to INNER JOIN! -- Result: Only customers WITH orders > $100-- Lost: Customers with no orders at all12345678910111213
-- CORRECT: Keep predicate post-join or use ON-- Option 1: Accept NULL filtering (if intentional)SELECT *FROM customers cLEFT JOIN orders o ON c.id = o.customer_idAND o.total > 100; -- In ON clause! -- Now: Customers without high-value orders-- still appear (with NULL order columns) -- Option 2: Explicit NULL handlingWHERE o.total > 100 OR o.customer_id IS NULLSelection pushdown through GROUP BY depends on whether the predicate references aggregate results or base columns:
1234567891011121314151617181920212223242526
-- HAVING vs WHERE: Different pushdown behavior -- Predicate on grouping column: CAN push downSELECT department_id, SUM(salary)FROM employeesWHERE department_id IN (10, 20, 30) -- Push to scan!GROUP BY department_id; -- Predicate on aggregate result: CANNOT push downSELECT department_id, SUM(salary) as total_salaryFROM employeesGROUP BY department_idHAVING SUM(salary) > 500000; -- Must evaluate after GROUP BY -- Mixed predicates:SELECT department_id, AVG(salary)FROM employeesWHERE hire_date > '2020-01-01' -- Push to scanGROUP BY department_idHAVING COUNT(*) >= 5 -- Evaluate post-aggregation AND department_id != 99; -- Could theoretically push, but already in WHERE -- Optimizer separates:-- 1. Predicates on base columns → WHERE (pushed down)-- 2. Predicates on aggregate results → HAVING (post-aggregation)-- 3. Predicates on grouping columns in HAVING → may be pushed to WHEREPushdown into subqueries and Common Table Expressions (CTEs) requires careful analysis:
12345678910111213141516171819202122232425262728293031323334
-- Simple subquery: CAN push down (view merging)SELECT * FROM ( SELECT order_id, total, customer_id FROM orders) subqWHERE subq.total > 1000; -- Optimizer merges to:SELECT order_id, total, customer_idFROM ordersWHERE total > 1000; -- Subquery with DISTINCT: CANNOT push throughSELECT * FROM ( SELECT DISTINCT customer_id, region FROM customers) subqWHERE subq.region = 'West'; -- WRONG if pushed: Would apply filter before DISTINCT-- Semantically different! DISTINCT sees fewer input rows -- CTE pushdown (depends on database):WITH regional_sales AS ( SELECT region, SUM(total) as region_total FROM orders GROUP BY region)SELECT * FROM regional_salesWHERE region = 'West'; -- Some databases materialize CTEs (no pushdown)-- Others inline CTEs like subqueries (pushdown possible)-- PostgreSQL: Use MATERIALIZED / NOT MATERIALIZED hintsSome operators 'block' pushdown entirely: LIMIT (affects which rows survive), window functions (require full input), ORDER BY (may affect LIMIT semantics), and volatile functions (must execute exactly once). Predicates cannot be pushed through these operators.
Advanced optimizers don't just push existing predicates—they derive new predicates that weren't explicitly stated. This predicate inference can dramatically expand pushdown opportunities.
When columns are connected via equality conditions, predicates on one column can be inferred for the other:
123456789101112131415161718192021222324
-- Original querySELECT *FROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.id = 12345; -- Optimizer infers: o.customer_id = 12345-- Because: o.customer_id = c.id AND c.id = 12345-- Therefore: o.customer_id = 12345 (transitivity) -- Enhanced plan:-- 1. Scan orders WHERE customer_id = 12345 (derived predicate!)-- 2. Scan customers WHERE id = 12345 (original predicate)-- 3. Join (both sides already filtered!) -- Complex transitivity chain:SELECT *FROM aJOIN b ON a.x = b.xJOIN c ON b.x = c.xWHERE a.x = 100; -- Inferred: b.x = 100, c.x = 100-- All three tables can be filtered at scan time!Similarly, range predicates can be propagated through equality joins:
12345678910111213141516171819202122232425
-- Range predicate with join equalitySELECT *FROM events eJOIN event_types t ON e.type_id = t.idWHERE t.id BETWEEN 1 AND 10; -- Derived predicate: e.type_id BETWEEN 1 AND 10-- Push to events table scan! -- More complex: inequality chainsSELECT *FROM aJOIN b ON a.date = b.dateWHERE a.date >= '2024-01-01' AND a.date < '2024-04-01'; -- Derived: b.date >= '2024-01-01' AND b.date < '2024-04-01' -- Careful with mixed predicates:SELECT *FROM orders oJOIN order_items oi ON o.id = oi.order_idWHERE o.id > 1000 AND oi.quantity > 5; -- o.id > 1000 implies oi.order_id > 1000 (useful!)-- oi.quantity > 5 cannot be propagated to orders (different column)Database constraints provide additional inference opportunities:
CHECK (status IN ('A','B','C')) exists, predicate status = 'D' implies empty result.column IS NULL returns empty result.12345678910111213141516171819202122232425
-- Constraint: orders.status IN ('pending', 'shipped', 'delivered', 'cancelled')SELECT *FROM ordersWHERE status = 'invalid_status';-- Optimizer knows CHECK constraint-- Detects contradiction → returns empty immediately-- No table scan needed! -- Partition elimination:-- Table: sales PARTITION BY RANGE (sale_date)-- Partition: sales_2024_q1 FOR VALUES FROM ('2024-01-01') TO ('2024-04-01') SELECT * FROM salesWHERE sale_date BETWEEN '2024-02-01' AND '2024-02-28'; -- Only scans partition sales_2024_q1-- Other partitions eliminated based on implied constraints -- Foreign key optimization:-- FK: orders.customer_id REFERENCES customers(id)SELECT * FROM orders oJOIN customers c ON o.customer_id = c.id; -- Optimizer knows every orders.customer_id matches exactly one customer-- Can use optimized join algorithm knowing 1:1 match ratioDefining proper constraints (CHECK, NOT NULL, FOREIGN KEY) doesn't just ensure data integrity—it provides the optimizer with valuable information for predicate inference. Many performance improvements come 'for free' when you properly constrain your schema.
Selection pushdown becomes especially powerful when pushed predicates enable index usage. A predicate that reaches the table scan level can leverage indexes for efficient access.
Consider the transformation that occurs when selection pushdown enables indexing:
Once predicates are pushed to the scan level, the optimizer evaluates index options:
123456789101112131415161718192021222324252627
-- Table: orders-- Indexes: idx_customer (customer_id), idx_date (order_date), -- idx_status (status), idx_customer_date (customer_id, order_date) SELECT *FROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.region = 'West' AND o.order_date > '2024-01-01' AND o.status = 'shipped'; -- After pushdown to orders table:-- Predicates: order_date > '2024-01-01' AND status = 'shipped' -- Index options:-- 1. idx_date: Use for order_date > '2024-01-01', filter status afterward-- 2. idx_status: Use for status = 'shipped', filter date afterward -- 3. Full scan: Apply both filters during scan-- 4. Index intersection: Use both indexes, intersect results -- Best choice depends on:-- - How selective each predicate is-- - Index clustering (are matching rows physically close?)-- - Memory available for bitmap intersection -- If customer_id is also derived (via join transitivity):-- idx_customer_date composite index might be best!Maximal pushdown can enable index-only scans when pushed predicates and projected columns are all in the index:
123456789101112131415161718192021222324
-- Index: CREATE INDEX idx_orders_covering -- ON orders(customer_id, order_date, total) SELECT customer_id, order_date, totalFROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.id = 12345 AND o.order_date > '2024-01-01'; -- After transitivity inference: o.customer_id = 12345-- Pushed predicates: customer_id = 12345 AND order_date > '2024-01-01'-- Required columns: customer_id, order_date, total -- All predicates and columns are in idx_orders_covering!-- Index-only scan possible:-- - Never reads table heap-- - Finds rows directly in index structure-- - Returns results from index leaf pages -- Performance gain: Often 10-100× faster than heap access-- Especially beneficial for:-- - Large tables with big rows-- - Remote/distributed storage-- - High-selectivity queriesIn distributed or federated databases, selection pushdown becomes even more critical—pushing predicates to remote systems reduces network transfer:
1234567891011121314151617181920212223242526
-- Foreign table on remote databaseCREATE FOREIGN TABLE remote_orders ( id INT, customer_id INT, total DECIMAL, order_date DATE) SERVER remote_server; SELECT * FROM remote_ordersWHERE order_date > '2024-01-01' AND total > 1000; -- Without pushdown:-- 1. Fetch ALL rows from remote (millions of rows over network)-- 2. Apply filter locally -- With pushdown:-- 1. Send query to remote with WHERE clause-- 2. Remote applies filter using local indexes-- 3. Only matching rows (thousands) sent over network -- Critical considerations:-- - Does the foreign data wrapper support predicate pushdown?-- - Are operators and data types compatible across systems?-- - Can remote indexes be leveraged?In distributed systems, 'pushdown' often means sending not just predicates but entire computation to data nodes. This includes aggregations, joins, and projections. The goal is always the same: minimize data movement by computing where data lives.
While optimizers work hard to push selections down, developers can write queries that facilitate or hinder pushdown. Understanding these patterns lets you write queries that optimize better.
SARGable (Search ARGument able) predicates can use indexes. Prefer these forms:
YEAR(order_date) = 2024UPPER(name) = 'SMITH'price + tax > 100quantity * 2 < 10COALESCE(date, NOW()) > Xstatus != 'active'name LIKE '%smith%'order_date >= '2024-01-01'name = 'smith' (case-insensitive collation)price > 100 - taxquantity < 5date > X OR date IS NULLstatus IN ('pending', 'shipped')name LIKE 'smith%'Explicit separation of predicates by table makes pushdown analysis easier:
1234567891011121314151617181920
-- Less optimal: Combined predicate harder to decomposeSELECT *FROM orders oJOIN customers c ON o.customer_id = c.idWHERE (c.region = 'West' AND o.total > 100) OR (c.region = 'East' AND o.total > 200);-- Optimizer must analyze OR branches carefully -- More optimal: Separate when possible-- Option 1: UNION approachSELECT * FROM orders o JOIN customers c ON o.customer_id = c.idWHERE c.region = 'West' AND o.total > 100UNION ALLSELECT * FROM orders o JOIN customers c ON o.customer_id = c.idWHERE c.region = 'East' AND o.total > 200;-- Each branch can push fully -- Option 2: Denormalize if query is common-- Add region to orders table, filter directlySELECT * FROM orders WHERE region = 'West' AND total > 100;EXISTS subqueries often optimize better than IN subqueries:
123456789101112131415161718192021
-- IN subquery: May process entire subquery firstSELECT * FROM productsWHERE category_id IN ( SELECT id FROM categories WHERE department = 'Electronics'); -- EXISTS: Can short-circuit, often correlates betterSELECT * FROM products pWHERE EXISTS ( SELECT 1 FROM categories c WHERE c.id = p.category_id AND c.department = 'Electronics'); -- EXISTS advantages:-- 1. Short-circuits: Stops when first match found-- 2. Better for large subquery results-- 3. Often converted to semi-join efficiently-- 4. Correlation enables index usage on p.category_id -- Modern optimizers often transform IN to EXISTS internally-- But explicit EXISTS makes optimization intent clearerViews and CTEs can either help or hinder optimization:
12345678910111213141516171819202122
-- Inlined view: Optimization friendlyCREATE VIEW active_customers ASSELECT * FROM customers WHERE status = 'active'; SELECT * FROM active_customers WHERE region = 'West';-- Merged to: SELECT * FROM customers WHERE status='active' AND region='West'-- Both predicates push down! -- Materialized CTE: Optimization barrier (PostgreSQL)WITH regional_data AS MATERIALIZED ( SELECT region, SUM(sales) FROM orders GROUP BY region)SELECT * FROM regional_data WHERE region = 'West';-- CTE computed first (all regions), then filtered-- 'West' filter cannot push into CTE -- Non-materialized: Optimization enabledWITH regional_data AS ( -- No MATERIALIZED keyword SELECT region, SUM(sales) FROM orders GROUP BY region)SELECT * FROM regional_data WHERE region = 'West';-- May be inlined, 'West' filter can push downAlways verify pushdown using EXPLAIN (or your database's equivalent). Look for filters applied at scan nodes rather than after joins. If predicates appear post-join, investigate why pushdown failed and rewrite the query accordingly.
Selection pushdown is the workhorse of query optimization—a simple principle with profound performance implications. By filtering rows early, every subsequent operation works on reduced data volumes.
What's Next:
Having explored selection pushdown in depth, we'll examine its complementary optimization: Projection Early. This technique reduces column counts early in the plan, minimizing memory usage and I/O at every stage.
You now have deep expertise in selection pushdown—its algebraic foundations, operator-specific rules, inference techniques, and practical application. This knowledge is fundamental to understanding query behavior and writing efficient SQL.