Heuristic Optimization - Learning Module

Loading content...

0/241

Selection Before Join

The Most Powerful Optimization: Filter First

If you could implement only one query optimization, selection pushdown would be it. This single technique—pushing filter predicates as close to the data source as possible—delivers the most consistent, dramatic performance improvements across virtually all workloads.

The principle is intuitive: why join a million rows if ninety percent of them will be filtered out afterward? Yet the mechanics of safely and maximally pushing selections through complex query plans involve subtle considerations that separate sophisticated optimizers from naive ones.

What You Will Learn

By the end of this page, you will deeply understand the algebraic foundation of selection pushdown, the mechanics of pushing selections through different operators, challenging edge cases involving outer joins and subqueries, predicate inference and derivation techniques, and practical strategies for writing queries that benefit maximally from selection pushdown.

The Algebraic Foundation

Selection pushdown is grounded in relational algebra equivalences. These equivalences prove that certain transformations preserve query semantics—the transformed query returns exactly the same results as the original.

The Selection Operator

In relational algebra, the selection operator σ (sigma) filters rows based on a predicate:

σpredicate = Keep only rows from relation R where predicate evaluates to TRUE

For example, σage > 21 returns only customers older than 21.

Fundamental Selection Equivalences

The following equivalences form the mathematical basis for selection pushdown:

Selection Pushdown Equivalences
Equivalence	Description	Conditions
`σ[p](R ⋈ S) ≡ σ[p](R) ⋈ S`	Push selection to left operand	Predicate p uses only columns from R
`σ[p](R ⋈ S) ≡ R ⋈ σ[p](S)`	Push selection to right operand	Predicate p uses only columns from S
`σ[p1 ∧ p2](R) ≡ σ[p1](σ[p2](R))`	Decompose conjunctive predicates	Always valid
`σ[p1](σ[p2](R)) ≡ σ[p2](σ[p1](R))`	Selections commute	Always valid
`σ[p](R ∪ S) ≡ σ[p](R) ∪ σ[p](S)`	Push through union	Always valid for UNION ALL; for UNION requires care
`σ[p](R - S) ≡ σ[p](R) - S`	Push through set difference	Safe to push to positive operand

Visual Understanding: Push vs. No Push

Consider how data flows through a plan with and without selection pushdown:

Converting Mermaid diagram...

In the "Without Pushdown" plan, the join processes 10M × 1M potential pairs, then filters. In the "With Pushdown" plan, the filter reduces customers from 1M to 100K before the join even begins. The join now processes 10M × 100K potential pairs—10× less work.

Decomposing Complex Predicates

Real-world WHERE clauses often combine conditions with AND and OR. Optimizers decompose these to maximize pushdown opportunities:

predicate_decomposition.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Original query with compound predicate
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA'           -- About customers (push to c)
  AND o.order_date > '2024-01-01' -- About orders (push to o)
  AND o.total > 100;              -- About orders (push to o)
 
-- Decomposed predicates:
-- Predicate 1: c.country = 'USA' → push to customers scan
-- Predicate 2: o.order_date > '2024-01-01' → push to orders scan
-- Predicate 3: o.total > 100 → push to orders scan
 
-- Optimized plan structure:
-- 1. Scan customers WHERE country = 'USA' (push predicate 1)
-- 2. Scan orders WHERE order_date > '2024-01-01' AND total > 100 (push 2 & 3)
-- 3. Join the filtered results
 
-- OR predicates are trickier:
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA' OR o.total > 10000;
 
-- Cannot simply push either predicate!
-- If we only pushed c.country = 'USA' to customers,
-- we'd miss orders with total > 10000 for non-USA customers
-- Must evaluate after join, OR convert to UNION

CNF vs DNF

Predicates in Conjunctive Normal Form (ANDs of ORs) are easier to optimize than Disjunctive Normal Form (ORs of ANDs). Many optimizers convert predicates to CNF early in the optimization process to maximize pushdown opportunities.

Pushing Through Different Operators

Each relational operator presents unique considerations for selection pushdown. Mastering these rules enables maximum optimization.

Inner Joins: The Easy Case

Inner joins are fully transparent to selections. Any predicate can be pushed to whichever operand contains the referenced columns:

inner_join_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Rule: σ[p](R ⋈ S) ≡ σ[p](R) ⋈ S   [when p references only R]
-- Rule: σ[p](R ⋈ S) ≡ R ⋈ σ[p](S)   [when p references only S]
-- Rule: σ[p1 ∧ p2](R ⋈ S) ≡ σ[p1](R) ⋈ σ[p2](S)  [when p1 uses R, p2 uses S]
 
-- Example:
SELECT *
FROM employees e
JOIN departments d ON e.dept_id = d.id
JOIN projects p ON e.id = p.lead_id
WHERE d.location = 'NYC'       -- Pushes to departments
  AND e.hire_date > '2020-01-01'  -- Pushes to employees
  AND p.budget > 100000;       -- Pushes to projects
 
-- Plan after pushdown:
-- 1. Scan departments WHERE location = 'NYC'
-- 2. Scan employees WHERE hire_date > '2020-01-01'
-- 3. Scan projects WHERE budget > 100000
-- 4. Join (1) and (2) on dept_id
-- 5. Join result with (3) on lead_id
-- All filters applied at leaf level!

Outer Joins: The Tricky Case

Outer joins are NOT transparent to selections. The rules depend on which side of the outer join the predicate applies to:

Outer Join Pushdown Rules

•LEFT JOIN, predicate on LEFT table — Safe to push. Filtering the preserved side doesn't change outer join semantics.
•LEFT JOIN, predicate on RIGHT table — DANGEROUS! Cannot push. The predicate would filter out NULL-extended rows that should be in the result.
•RIGHT JOIN, predicate on RIGHT table — Safe to push. (Symmetric to LEFT JOIN case.)
•RIGHT JOIN, predicate on LEFT table — DANGEROUS! Cannot push.
•FULL OUTER JOIN, any predicate — Generally cannot push either side directly.

wrong_outer_join_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
-- WRONG: Pushing predicate to nullable side
SELECT *
FROM customers c
LEFT JOIN orders o 
  ON c.id = o.customer_id
WHERE o.total > 100;  -- Filters NULLs!
 
-- This WHERE clause eliminates customers
-- with no orders (where o.total is NULL)
-- Converting LEFT JOIN to INNER JOIN!
 
-- Result: Only customers WITH orders > $100
-- Lost: Customers with no orders at all

correct_outer_join_handling.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
-- CORRECT: Keep predicate post-join or use ON
-- Option 1: Accept NULL filtering (if intentional)
SELECT *
FROM customers c
LEFT JOIN orders o 
  ON c.id = o.customer_id
AND o.total > 100;  -- In ON clause!
 
-- Now: Customers without high-value orders
-- still appear (with NULL order columns)
 
-- Option 2: Explicit NULL handling
WHERE o.total > 100 OR o.customer_id IS NULL

Aggregations (GROUP BY)

Selection pushdown through GROUP BY depends on whether the predicate references aggregate results or base columns:

pushdown_through_aggregation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- HAVING vs WHERE: Different pushdown behavior
 
-- Predicate on grouping column: CAN push down
SELECT department_id, SUM(salary)
FROM employees
WHERE department_id IN (10, 20, 30)  -- Push to scan!
GROUP BY department_id;
 
-- Predicate on aggregate result: CANNOT push down
SELECT department_id, SUM(salary) as total_salary
FROM employees
GROUP BY department_id
HAVING SUM(salary) > 500000;  -- Must evaluate after GROUP BY
 
-- Mixed predicates:
SELECT department_id, AVG(salary)
FROM employees
WHERE hire_date > '2020-01-01'      -- Push to scan
GROUP BY department_id
HAVING COUNT(*) >= 5                 -- Evaluate post-aggregation
   AND department_id != 99;          -- Could theoretically push, but already in WHERE
 
-- Optimizer separates:
-- 1. Predicates on base columns → WHERE (pushed down)
-- 2. Predicates on aggregate results → HAVING (post-aggregation)
-- 3. Predicates on grouping columns in HAVING → may be pushed to WHERE

Subqueries and CTEs

Pushdown into subqueries and Common Table Expressions (CTEs) requires careful analysis:

pushdown_into_subqueries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Simple subquery: CAN push down (view merging)
SELECT * FROM (
    SELECT order_id, total, customer_id
    FROM orders
) subq
WHERE subq.total > 1000;
 
-- Optimizer merges to:
SELECT order_id, total, customer_id
FROM orders
WHERE total > 1000;
 
-- Subquery with DISTINCT: CANNOT push through
SELECT * FROM (
    SELECT DISTINCT customer_id, region
    FROM customers
) subq
WHERE subq.region = 'West';
 
-- WRONG if pushed: Would apply filter before DISTINCT
-- Semantically different! DISTINCT sees fewer input rows
 
-- CTE pushdown (depends on database):
WITH regional_sales AS (
    SELECT region, SUM(total) as region_total
    FROM orders
    GROUP BY region
)
SELECT * FROM regional_sales
WHERE region = 'West';
 
-- Some databases materialize CTEs (no pushdown)
-- Others inline CTEs like subqueries (pushdown possible)
-- PostgreSQL: Use MATERIALIZED / NOT MATERIALIZED hints

Blocking Operators

Some operators 'block' pushdown entirely: LIMIT (affects which rows survive), window functions (require full input), ORDER BY (may affect LIMIT semantics), and volatile functions (must execute exactly once). Predicates cannot be pushed through these operators.

Predicate Inference and Derivation

Advanced optimizers don't just push existing predicates—they derive new predicates that weren't explicitly stated. This predicate inference can dramatically expand pushdown opportunities.

Transitivity-Based Inference

When columns are connected via equality conditions, predicates on one column can be inferred for the other:

transitivity_inference.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Original query
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.id = 12345;
 
-- Optimizer infers: o.customer_id = 12345
-- Because: o.customer_id = c.id AND c.id = 12345
-- Therefore: o.customer_id = 12345 (transitivity)
 
-- Enhanced plan:
-- 1. Scan orders WHERE customer_id = 12345 (derived predicate!)
-- 2. Scan customers WHERE id = 12345 (original predicate)
-- 3. Join (both sides already filtered!)
 
-- Complex transitivity chain:
SELECT *
FROM a
JOIN b ON a.x = b.x
JOIN c ON b.x = c.x
WHERE a.x = 100;
 
-- Inferred: b.x = 100, c.x = 100
-- All three tables can be filtered at scan time!

Range Propagation

Similarly, range predicates can be propagated through equality joins:

range_propagation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Range predicate with join equality
SELECT *
FROM events e
JOIN event_types t ON e.type_id = t.id
WHERE t.id BETWEEN 1 AND 10;
 
-- Derived predicate: e.type_id BETWEEN 1 AND 10
-- Push to events table scan!
 
-- More complex: inequality chains
SELECT *
FROM a
JOIN b ON a.date = b.date
WHERE a.date >= '2024-01-01' AND a.date < '2024-04-01';
 
-- Derived: b.date >= '2024-01-01' AND b.date < '2024-04-01'
 
-- Careful with mixed predicates:
SELECT *
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
WHERE o.id > 1000 AND oi.quantity > 5;
 
-- o.id > 1000 implies oi.order_id > 1000 (useful!)
-- oi.quantity > 5 cannot be propagated to orders (different column)

Constraint-Based Inference

Database constraints provide additional inference opportunities:

Constraint-Based Inferences

•CHECK constraints — If CHECK (status IN ('A','B','C')) exists, predicate status = 'D' implies empty result.
•NOT NULL constraints — If column is NOT NULL, predicate column IS NULL returns empty result.
•UNIQUE constraints — If unique constraint exists, equality lookup returns at most one row.
•Foreign key constraints — If FK enforced, join with parent table will match exactly once per child row.
•Partition elimination — Partition bounds act as implicit CHECK constraints for range predicates.

constraint_based_inference.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Constraint: orders.status IN ('pending', 'shipped', 'delivered', 'cancelled')
SELECT *
FROM orders
WHERE status = 'invalid_status';
-- Optimizer knows CHECK constraint
-- Detects contradiction → returns empty immediately
-- No table scan needed!
 
-- Partition elimination:
-- Table: sales PARTITION BY RANGE (sale_date)
-- Partition: sales_2024_q1 FOR VALUES FROM ('2024-01-01') TO ('2024-04-01')
 
SELECT * FROM sales
WHERE sale_date BETWEEN '2024-02-01' AND '2024-02-28';
 
-- Only scans partition sales_2024_q1
-- Other partitions eliminated based on implied constraints
 
-- Foreign key optimization:
-- FK: orders.customer_id REFERENCES customers(id)
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id;
 
-- Optimizer knows every orders.customer_id matches exactly one customer
-- Can use optimized join algorithm knowing 1:1 match ratio

Enable Inference with Constraints

Defining proper constraints (CHECK, NOT NULL, FOREIGN KEY) doesn't just ensure data integrity—it provides the optimizer with valuable information for predicate inference. Many performance improvements come 'for free' when you properly constrain your schema.

Selection Pushdown and Index Usage

Selection pushdown becomes especially powerful when pushed predicates enable index usage. A predicate that reaches the table scan level can leverage indexes for efficient access.

From Full Scan to Index Seek

Consider the transformation that occurs when selection pushdown enables indexing:

Converting Mermaid diagram...

Index Selection After Pushdown

Once predicates are pushed to the scan level, the optimizer evaluates index options:

index_selection_after_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Table: orders
-- Indexes: idx_customer (customer_id), idx_date (order_date), 
--          idx_status (status), idx_customer_date (customer_id, order_date)
 
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'West'
  AND o.order_date > '2024-01-01'
  AND o.status = 'shipped';
 
-- After pushdown to orders table:
-- Predicates: order_date > '2024-01-01' AND status = 'shipped'
 
-- Index options:
-- 1. idx_date: Use for order_date > '2024-01-01', filter status afterward
-- 2. idx_status: Use for status = 'shipped', filter date afterward  
-- 3. Full scan: Apply both filters during scan
-- 4. Index intersection: Use both indexes, intersect results
 
-- Best choice depends on:
-- - How selective each predicate is
-- - Index clustering (are matching rows physically close?)
-- - Memory available for bitmap intersection
 
-- If customer_id is also derived (via join transitivity):
-- idx_customer_date composite index might be best!

Covering Index Opportunities

Maximal pushdown can enable index-only scans when pushed predicates and projected columns are all in the index:

covering_index_opportunity.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Index: CREATE INDEX idx_orders_covering 
--        ON orders(customer_id, order_date, total)
 
SELECT customer_id, order_date, total
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.id = 12345
  AND o.order_date > '2024-01-01';
 
-- After transitivity inference: o.customer_id = 12345
-- Pushed predicates: customer_id = 12345 AND order_date > '2024-01-01'
-- Required columns: customer_id, order_date, total
 
-- All predicates and columns are in idx_orders_covering!
-- Index-only scan possible:
-- - Never reads table heap
-- - Finds rows directly in index structure
-- - Returns results from index leaf pages
 
-- Performance gain: Often 10-100× faster than heap access
-- Especially beneficial for:
-- - Large tables with big rows
-- - Remote/distributed storage
-- - High-selectivity queries

Pushing Predicates to Indexes on Foreign Tables

In distributed or federated databases, selection pushdown becomes even more critical—pushing predicates to remote systems reduces network transfer:

federated_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Foreign table on remote database
CREATE FOREIGN TABLE remote_orders (
    id INT,
    customer_id INT,
    total DECIMAL,
    order_date DATE
) SERVER remote_server;
 
SELECT * 
FROM remote_orders
WHERE order_date > '2024-01-01'
  AND total > 1000;
 
-- Without pushdown:
-- 1. Fetch ALL rows from remote (millions of rows over network)
-- 2. Apply filter locally
 
-- With pushdown:
-- 1. Send query to remote with WHERE clause
-- 2. Remote applies filter using local indexes
-- 3. Only matching rows (thousands) sent over network
 
-- Critical considerations:
-- - Does the foreign data wrapper support predicate pushdown?
-- - Are operators and data types compatible across systems?
-- - Can remote indexes be leveraged?

Push Down vs. Compute Down

In distributed systems, 'pushdown' often means sending not just predicates but entire computation to data nodes. This includes aggregations, joins, and projections. The goal is always the same: minimize data movement by computing where data lives.

Practical Strategies for Maximizing Pushdown

While optimizers work hard to push selections down, developers can write queries that facilitate or hinder pushdown. Understanding these patterns lets you write queries that optimize better.

Write SARGable Predicates

SARGable (Search ARGument able) predicates can use indexes. Prefer these forms:

Non-SARGable (Avoid)

•YEAR(order_date) = 2024
•UPPER(name) = 'SMITH'
•price + tax > 100
•quantity * 2 < 10
•COALESCE(date, NOW()) > X
•status != 'active'
•name LIKE '%smith%'

SARGable (Prefer)

•order_date >= '2024-01-01'
•name = 'smith' (case-insensitive collation)
•price > 100 - tax
•quantity < 5
•date > X OR date IS NULL
•status IN ('pending', 'shipped')
•name LIKE 'smith%'

Separate Predicates for Different Tables

Explicit separation of predicates by table makes pushdown analysis easier:

separated_predicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Less optimal: Combined predicate harder to decompose
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE (c.region = 'West' AND o.total > 100) 
   OR (c.region = 'East' AND o.total > 200);
-- Optimizer must analyze OR branches carefully
 
-- More optimal: Separate when possible
-- Option 1: UNION approach
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'West' AND o.total > 100
UNION ALL
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'East' AND o.total > 200;
-- Each branch can push fully
 
-- Option 2: Denormalize if query is common
-- Add region to orders table, filter directly
SELECT * FROM orders WHERE region = 'West' AND total > 100;

Use EXISTS Instead of IN for Correlation

EXISTS subqueries often optimize better than IN subqueries:

exists_vs_in.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- IN subquery: May process entire subquery first
SELECT * FROM products
WHERE category_id IN (
    SELECT id FROM categories WHERE department = 'Electronics'
);
 
-- EXISTS: Can short-circuit, often correlates better
SELECT * FROM products p
WHERE EXISTS (
    SELECT 1 FROM categories c 
    WHERE c.id = p.category_id AND c.department = 'Electronics'
);
 
-- EXISTS advantages:
-- 1. Short-circuits: Stops when first match found
-- 2. Better for large subquery results
-- 3. Often converted to semi-join efficiently
-- 4. Correlation enables index usage on p.category_id
 
-- Modern optimizers often transform IN to EXISTS internally
-- But explicit EXISTS makes optimization intent clearer

Leverage Views and CTEs Carefully

Views and CTEs can either help or hinder optimization:

view_cte_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Inlined view: Optimization friendly
CREATE VIEW active_customers AS
SELECT * FROM customers WHERE status = 'active';
 
SELECT * FROM active_customers WHERE region = 'West';
-- Merged to: SELECT * FROM customers WHERE status='active' AND region='West'
-- Both predicates push down!
 
-- Materialized CTE: Optimization barrier (PostgreSQL)
WITH regional_data AS MATERIALIZED (
    SELECT region, SUM(sales) FROM orders GROUP BY region
)
SELECT * FROM regional_data WHERE region = 'West';
-- CTE computed first (all regions), then filtered
-- 'West' filter cannot push into CTE
 
-- Non-materialized: Optimization enabled
WITH regional_data AS (  -- No MATERIALIZED keyword
    SELECT region, SUM(sales) FROM orders GROUP BY region
)
SELECT * FROM regional_data WHERE region = 'West';
-- May be inlined, 'West' filter can push down

Use EXPLAIN to Verify Pushdown

Always verify pushdown using EXPLAIN (or your database's equivalent). Look for filters applied at scan nodes rather than after joins. If predicates appear post-join, investigate why pushdown failed and rewrite the query accordingly.

Summary: Selection Before Join

Selection pushdown is the workhorse of query optimization—a simple principle with profound performance implications. By filtering rows early, every subsequent operation works on reduced data volumes.

Key Takeaways

•Selection pushdown is grounded in relational algebra — Equivalences prove that pushing predicates preserves query semantics while reducing intermediate result sizes.
•Different operators have different pushdown rules — Inner joins are transparent; outer joins require careful analysis; aggregations block predicates on aggregate results.
•Predicate inference amplifies pushdown — Transitivity through join conditions and constraint knowledge enable derived predicates that weren't explicitly stated.
•Pushdown enables index usage — Predicates at the scan level can leverage indexes, transforming full scans into index seeks.
•Writing SARGable queries facilitates optimization — Avoid functions on columns, use explicit ranges, and structure predicates for clean decomposition.
•Verify with EXPLAIN — Don't assume pushdown occurred; confirm that filters apply at table scans rather than post-join.

What's Next:

Having explored selection pushdown in depth, we'll examine its complementary optimization: Projection Early. This technique reduces column counts early in the plan, minimizing memory usage and I/O at every stage.

Page Complete

You now have deep expertise in selection pushdown—its algebraic foundations, operator-specific rules, inference techniques, and practical application. This knowledge is fundamental to understanding query behavior and writing efficient SQL.

Selection Before Join

The Most Powerful Optimization: Filter First

What You Will Learn

The Algebraic Foundation

The Selection Operator

In relational algebra, the selection operator σ (sigma) filters rows based on a predicate:

σpredicate = Keep only rows from relation R where predicate evaluates to TRUE

For example, σage > 21 returns only customers older than 21.

Fundamental Selection Equivalences

The following equivalences form the mathematical basis for selection pushdown:

Selection Pushdown Equivalences
Equivalence	Description	Conditions
`σ[p](R ⋈ S) ≡ σ[p](R) ⋈ S`	Push selection to left operand	Predicate p uses only columns from R
`σ[p](R ⋈ S) ≡ R ⋈ σ[p](S)`	Push selection to right operand	Predicate p uses only columns from S
`σ[p1 ∧ p2](R) ≡ σ[p1](σ[p2](R))`	Decompose conjunctive predicates	Always valid
`σ[p1](σ[p2](R)) ≡ σ[p2](σ[p1](R))`	Selections commute	Always valid
`σ[p](R ∪ S) ≡ σ[p](R) ∪ σ[p](S)`	Push through union	Always valid for UNION ALL; for UNION requires care
`σ[p](R - S) ≡ σ[p](R) - S`	Push through set difference	Safe to push to positive operand

Visual Understanding: Push vs. No Push

Consider how data flows through a plan with and without selection pushdown:

Converting Mermaid diagram...

Decomposing Complex Predicates

Real-world WHERE clauses often combine conditions with AND and OR. Optimizers decompose these to maximize pushdown opportunities:

predicate_decomposition.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Original query with compound predicate
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA'           -- About customers (push to c)
  AND o.order_date > '2024-01-01' -- About orders (push to o)
  AND o.total > 100;              -- About orders (push to o)
 
-- Decomposed predicates:
-- Predicate 1: c.country = 'USA' → push to customers scan
-- Predicate 2: o.order_date > '2024-01-01' → push to orders scan
-- Predicate 3: o.total > 100 → push to orders scan
 
-- Optimized plan structure:
-- 1. Scan customers WHERE country = 'USA' (push predicate 1)
-- 2. Scan orders WHERE order_date > '2024-01-01' AND total > 100 (push 2 & 3)
-- 3. Join the filtered results
 
-- OR predicates are trickier:
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA' OR o.total > 10000;
 
-- Cannot simply push either predicate!
-- If we only pushed c.country = 'USA' to customers,
-- we'd miss orders with total > 10000 for non-USA customers
-- Must evaluate after join, OR convert to UNION

CNF vs DNF

Pushing Through Different Operators

Each relational operator presents unique considerations for selection pushdown. Mastering these rules enables maximum optimization.

Inner Joins: The Easy Case

Inner joins are fully transparent to selections. Any predicate can be pushed to whichever operand contains the referenced columns:

inner_join_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Rule: σ[p](R ⋈ S) ≡ σ[p](R) ⋈ S   [when p references only R]
-- Rule: σ[p](R ⋈ S) ≡ R ⋈ σ[p](S)   [when p references only S]
-- Rule: σ[p1 ∧ p2](R ⋈ S) ≡ σ[p1](R) ⋈ σ[p2](S)  [when p1 uses R, p2 uses S]
 
-- Example:
SELECT *
FROM employees e
JOIN departments d ON e.dept_id = d.id
JOIN projects p ON e.id = p.lead_id
WHERE d.location = 'NYC'       -- Pushes to departments
  AND e.hire_date > '2020-01-01'  -- Pushes to employees
  AND p.budget > 100000;       -- Pushes to projects
 
-- Plan after pushdown:
-- 1. Scan departments WHERE location = 'NYC'
-- 2. Scan employees WHERE hire_date > '2020-01-01'
-- 3. Scan projects WHERE budget > 100000
-- 4. Join (1) and (2) on dept_id
-- 5. Join result with (3) on lead_id
-- All filters applied at leaf level!

Outer Joins: The Tricky Case

Outer joins are NOT transparent to selections. The rules depend on which side of the outer join the predicate applies to:

Outer Join Pushdown Rules

•LEFT JOIN, predicate on LEFT table — Safe to push. Filtering the preserved side doesn't change outer join semantics.
•LEFT JOIN, predicate on RIGHT table — DANGEROUS! Cannot push. The predicate would filter out NULL-extended rows that should be in the result.
•RIGHT JOIN, predicate on RIGHT table — Safe to push. (Symmetric to LEFT JOIN case.)
•RIGHT JOIN, predicate on LEFT table — DANGEROUS! Cannot push.
•FULL OUTER JOIN, any predicate — Generally cannot push either side directly.

wrong_outer_join_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
-- WRONG: Pushing predicate to nullable side
SELECT *
FROM customers c
LEFT JOIN orders o 
  ON c.id = o.customer_id
WHERE o.total > 100;  -- Filters NULLs!
 
-- This WHERE clause eliminates customers
-- with no orders (where o.total is NULL)
-- Converting LEFT JOIN to INNER JOIN!
 
-- Result: Only customers WITH orders > $100
-- Lost: Customers with no orders at all

correct_outer_join_handling.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
-- CORRECT: Keep predicate post-join or use ON
-- Option 1: Accept NULL filtering (if intentional)
SELECT *
FROM customers c
LEFT JOIN orders o 
  ON c.id = o.customer_id
AND o.total > 100;  -- In ON clause!
 
-- Now: Customers without high-value orders
-- still appear (with NULL order columns)
 
-- Option 2: Explicit NULL handling
WHERE o.total > 100 OR o.customer_id IS NULL

Aggregations (GROUP BY)

Selection pushdown through GROUP BY depends on whether the predicate references aggregate results or base columns:

pushdown_through_aggregation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- HAVING vs WHERE: Different pushdown behavior
 
-- Predicate on grouping column: CAN push down
SELECT department_id, SUM(salary)
FROM employees
WHERE department_id IN (10, 20, 30)  -- Push to scan!
GROUP BY department_id;
 
-- Predicate on aggregate result: CANNOT push down
SELECT department_id, SUM(salary) as total_salary
FROM employees
GROUP BY department_id
HAVING SUM(salary) > 500000;  -- Must evaluate after GROUP BY
 
-- Mixed predicates:
SELECT department_id, AVG(salary)
FROM employees
WHERE hire_date > '2020-01-01'      -- Push to scan
GROUP BY department_id
HAVING COUNT(*) >= 5                 -- Evaluate post-aggregation
   AND department_id != 99;          -- Could theoretically push, but already in WHERE
 
-- Optimizer separates:
-- 1. Predicates on base columns → WHERE (pushed down)
-- 2. Predicates on aggregate results → HAVING (post-aggregation)
-- 3. Predicates on grouping columns in HAVING → may be pushed to WHERE

Subqueries and CTEs

Pushdown into subqueries and Common Table Expressions (CTEs) requires careful analysis:

pushdown_into_subqueries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Simple subquery: CAN push down (view merging)
SELECT * FROM (
    SELECT order_id, total, customer_id
    FROM orders
) subq
WHERE subq.total > 1000;
 
-- Optimizer merges to:
SELECT order_id, total, customer_id
FROM orders
WHERE total > 1000;
 
-- Subquery with DISTINCT: CANNOT push through
SELECT * FROM (
    SELECT DISTINCT customer_id, region
    FROM customers
) subq
WHERE subq.region = 'West';
 
-- WRONG if pushed: Would apply filter before DISTINCT
-- Semantically different! DISTINCT sees fewer input rows
 
-- CTE pushdown (depends on database):
WITH regional_sales AS (
    SELECT region, SUM(total) as region_total
    FROM orders
    GROUP BY region
)
SELECT * FROM regional_sales
WHERE region = 'West';
 
-- Some databases materialize CTEs (no pushdown)
-- Others inline CTEs like subqueries (pushdown possible)
-- PostgreSQL: Use MATERIALIZED / NOT MATERIALIZED hints

Blocking Operators

Predicate Inference and Derivation

Advanced optimizers don't just push existing predicates—they derive new predicates that weren't explicitly stated. This predicate inference can dramatically expand pushdown opportunities.

Transitivity-Based Inference

When columns are connected via equality conditions, predicates on one column can be inferred for the other:

transitivity_inference.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Original query
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.id = 12345;
 
-- Optimizer infers: o.customer_id = 12345
-- Because: o.customer_id = c.id AND c.id = 12345
-- Therefore: o.customer_id = 12345 (transitivity)
 
-- Enhanced plan:
-- 1. Scan orders WHERE customer_id = 12345 (derived predicate!)
-- 2. Scan customers WHERE id = 12345 (original predicate)
-- 3. Join (both sides already filtered!)
 
-- Complex transitivity chain:
SELECT *
FROM a
JOIN b ON a.x = b.x
JOIN c ON b.x = c.x
WHERE a.x = 100;
 
-- Inferred: b.x = 100, c.x = 100
-- All three tables can be filtered at scan time!

Range Propagation

Similarly, range predicates can be propagated through equality joins:

range_propagation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Range predicate with join equality
SELECT *
FROM events e
JOIN event_types t ON e.type_id = t.id
WHERE t.id BETWEEN 1 AND 10;
 
-- Derived predicate: e.type_id BETWEEN 1 AND 10
-- Push to events table scan!
 
-- More complex: inequality chains
SELECT *
FROM a
JOIN b ON a.date = b.date
WHERE a.date >= '2024-01-01' AND a.date < '2024-04-01';
 
-- Derived: b.date >= '2024-01-01' AND b.date < '2024-04-01'
 
-- Careful with mixed predicates:
SELECT *
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
WHERE o.id > 1000 AND oi.quantity > 5;
 
-- o.id > 1000 implies oi.order_id > 1000 (useful!)
-- oi.quantity > 5 cannot be propagated to orders (different column)

Constraint-Based Inference

Database constraints provide additional inference opportunities:

Constraint-Based Inferences

•CHECK constraints — If CHECK (status IN ('A','B','C')) exists, predicate status = 'D' implies empty result.
•NOT NULL constraints — If column is NOT NULL, predicate column IS NULL returns empty result.
•UNIQUE constraints — If unique constraint exists, equality lookup returns at most one row.
•Foreign key constraints — If FK enforced, join with parent table will match exactly once per child row.
•Partition elimination — Partition bounds act as implicit CHECK constraints for range predicates.

constraint_based_inference.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Constraint: orders.status IN ('pending', 'shipped', 'delivered', 'cancelled')
SELECT *
FROM orders
WHERE status = 'invalid_status';
-- Optimizer knows CHECK constraint
-- Detects contradiction → returns empty immediately
-- No table scan needed!
 
-- Partition elimination:
-- Table: sales PARTITION BY RANGE (sale_date)
-- Partition: sales_2024_q1 FOR VALUES FROM ('2024-01-01') TO ('2024-04-01')
 
SELECT * FROM sales
WHERE sale_date BETWEEN '2024-02-01' AND '2024-02-28';
 
-- Only scans partition sales_2024_q1
-- Other partitions eliminated based on implied constraints
 
-- Foreign key optimization:
-- FK: orders.customer_id REFERENCES customers(id)
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id;
 
-- Optimizer knows every orders.customer_id matches exactly one customer
-- Can use optimized join algorithm knowing 1:1 match ratio

Enable Inference with Constraints

Selection Pushdown and Index Usage

Selection pushdown becomes especially powerful when pushed predicates enable index usage. A predicate that reaches the table scan level can leverage indexes for efficient access.

From Full Scan to Index Seek

Consider the transformation that occurs when selection pushdown enables indexing:

Converting Mermaid diagram...

Index Selection After Pushdown

Once predicates are pushed to the scan level, the optimizer evaluates index options:

index_selection_after_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Table: orders
-- Indexes: idx_customer (customer_id), idx_date (order_date), 
--          idx_status (status), idx_customer_date (customer_id, order_date)
 
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'West'
  AND o.order_date > '2024-01-01'
  AND o.status = 'shipped';
 
-- After pushdown to orders table:
-- Predicates: order_date > '2024-01-01' AND status = 'shipped'
 
-- Index options:
-- 1. idx_date: Use for order_date > '2024-01-01', filter status afterward
-- 2. idx_status: Use for status = 'shipped', filter date afterward  
-- 3. Full scan: Apply both filters during scan
-- 4. Index intersection: Use both indexes, intersect results
 
-- Best choice depends on:
-- - How selective each predicate is
-- - Index clustering (are matching rows physically close?)
-- - Memory available for bitmap intersection
 
-- If customer_id is also derived (via join transitivity):
-- idx_customer_date composite index might be best!

Covering Index Opportunities

Maximal pushdown can enable index-only scans when pushed predicates and projected columns are all in the index:

covering_index_opportunity.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Index: CREATE INDEX idx_orders_covering 
--        ON orders(customer_id, order_date, total)
 
SELECT customer_id, order_date, total
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.id = 12345
  AND o.order_date > '2024-01-01';
 
-- After transitivity inference: o.customer_id = 12345
-- Pushed predicates: customer_id = 12345 AND order_date > '2024-01-01'
-- Required columns: customer_id, order_date, total
 
-- All predicates and columns are in idx_orders_covering!
-- Index-only scan possible:
-- - Never reads table heap
-- - Finds rows directly in index structure
-- - Returns results from index leaf pages
 
-- Performance gain: Often 10-100× faster than heap access
-- Especially beneficial for:
-- - Large tables with big rows
-- - Remote/distributed storage
-- - High-selectivity queries

Pushing Predicates to Indexes on Foreign Tables

In distributed or federated databases, selection pushdown becomes even more critical—pushing predicates to remote systems reduces network transfer:

federated_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Foreign table on remote database
CREATE FOREIGN TABLE remote_orders (
    id INT,
    customer_id INT,
    total DECIMAL,
    order_date DATE
) SERVER remote_server;
 
SELECT * 
FROM remote_orders
WHERE order_date > '2024-01-01'
  AND total > 1000;
 
-- Without pushdown:
-- 1. Fetch ALL rows from remote (millions of rows over network)
-- 2. Apply filter locally
 
-- With pushdown:
-- 1. Send query to remote with WHERE clause
-- 2. Remote applies filter using local indexes
-- 3. Only matching rows (thousands) sent over network
 
-- Critical considerations:
-- - Does the foreign data wrapper support predicate pushdown?
-- - Are operators and data types compatible across systems?
-- - Can remote indexes be leveraged?

Push Down vs. Compute Down

Practical Strategies for Maximizing Pushdown

While optimizers work hard to push selections down, developers can write queries that facilitate or hinder pushdown. Understanding these patterns lets you write queries that optimize better.

Write SARGable Predicates

SARGable (Search ARGument able) predicates can use indexes. Prefer these forms:

Non-SARGable (Avoid)

•YEAR(order_date) = 2024
•UPPER(name) = 'SMITH'
•price + tax > 100
•quantity * 2 < 10
•COALESCE(date, NOW()) > X
•status != 'active'
•name LIKE '%smith%'

SARGable (Prefer)

•order_date >= '2024-01-01'
•name = 'smith' (case-insensitive collation)
•price > 100 - tax
•quantity < 5
•date > X OR date IS NULL
•status IN ('pending', 'shipped')
•name LIKE 'smith%'

Separate Predicates for Different Tables

Explicit separation of predicates by table makes pushdown analysis easier:

separated_predicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Less optimal: Combined predicate harder to decompose
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE (c.region = 'West' AND o.total > 100) 
   OR (c.region = 'East' AND o.total > 200);
-- Optimizer must analyze OR branches carefully
 
-- More optimal: Separate when possible
-- Option 1: UNION approach
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'West' AND o.total > 100
UNION ALL
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'East' AND o.total > 200;
-- Each branch can push fully
 
-- Option 2: Denormalize if query is common
-- Add region to orders table, filter directly
SELECT * FROM orders WHERE region = 'West' AND total > 100;

Use EXISTS Instead of IN for Correlation

EXISTS subqueries often optimize better than IN subqueries:

exists_vs_in.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- IN subquery: May process entire subquery first
SELECT * FROM products
WHERE category_id IN (
    SELECT id FROM categories WHERE department = 'Electronics'
);
 
-- EXISTS: Can short-circuit, often correlates better
SELECT * FROM products p
WHERE EXISTS (
    SELECT 1 FROM categories c 
    WHERE c.id = p.category_id AND c.department = 'Electronics'
);
 
-- EXISTS advantages:
-- 1. Short-circuits: Stops when first match found
-- 2. Better for large subquery results
-- 3. Often converted to semi-join efficiently
-- 4. Correlation enables index usage on p.category_id
 
-- Modern optimizers often transform IN to EXISTS internally
-- But explicit EXISTS makes optimization intent clearer

Leverage Views and CTEs Carefully

Views and CTEs can either help or hinder optimization:

view_cte_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Inlined view: Optimization friendly
CREATE VIEW active_customers AS
SELECT * FROM customers WHERE status = 'active';
 
SELECT * FROM active_customers WHERE region = 'West';
-- Merged to: SELECT * FROM customers WHERE status='active' AND region='West'
-- Both predicates push down!
 
-- Materialized CTE: Optimization barrier (PostgreSQL)
WITH regional_data AS MATERIALIZED (
    SELECT region, SUM(sales) FROM orders GROUP BY region
)
SELECT * FROM regional_data WHERE region = 'West';
-- CTE computed first (all regions), then filtered
-- 'West' filter cannot push into CTE
 
-- Non-materialized: Optimization enabled
WITH regional_data AS (  -- No MATERIALIZED keyword
    SELECT region, SUM(sales) FROM orders GROUP BY region
)
SELECT * FROM regional_data WHERE region = 'West';
-- May be inlined, 'West' filter can push down

Use EXPLAIN to Verify Pushdown

Summary: Selection Before Join

Key Takeaways

•Selection pushdown is grounded in relational algebra — Equivalences prove that pushing predicates preserves query semantics while reducing intermediate result sizes.
•Different operators have different pushdown rules — Inner joins are transparent; outer joins require careful analysis; aggregations block predicates on aggregate results.
•Predicate inference amplifies pushdown — Transitivity through join conditions and constraint knowledge enable derived predicates that weren't explicitly stated.
•Pushdown enables index usage — Predicates at the scan level can leverage indexes, transforming full scans into index seeks.
•Writing SARGable queries facilitates optimization — Avoid functions on columns, use explicit ranges, and structure predicates for clean decomposition.
•Verify with EXPLAIN — Don't assume pushdown occurred; confirm that filters apply at table scans rather than post-join.

What's Next:

Page Complete