Loading learning content...
If selection pushdown reduces the number of tuples flowing through a query plan, projection pushdown reduces their width. Together, these optimizations minimize the volume of data at every stage of query processing.
Consider a table with 50 columns and millions of rows. If your query only needs 3 columns, why carry the other 47 through joins, sorts, and aggregations? Each unnecessary column consumes memory, cache space, and I/O bandwidth. In column-oriented storage systems, unnecessary columns may avoid disk access entirely.
Projection pushdown systematically eliminates unused attributes as early as possible in the query plan. While less dramatic than selection pushdown in terms of row reduction, it provides consistent, multiplicative benefits across all operations by reducing the per-tuple processing cost.
By the end of this page, you will understand: (1) how projection pushdown reduces query execution costs, (2) the rules governing projection through different operators, (3) the critical concept of 'extended projection' for joins, and (4) the interaction between projection pushdown and modern columnar storage.
The Width Problem
In database systems, tuple width directly affects:
Memory consumption: Every intermediate result consumes memory proportional to (rows × width). Wide tuples exhaust memory faster, causing spills to disk.
I/O bandwidth: Whether reading from disk or transferring between operators, wider tuples mean more bytes moved.
Cache efficiency: Modern CPUs rely heavily on cache. Narrower tuples mean more tuples fit in cache, dramatically improving performance.
Network transfer: For distributed databases, tuple width directly affects network costs.
Hash table efficiency: Hash joins and aggregations build in-memory structures. Narrower keys and values mean smaller hash tables.
Quantifying the Impact
Consider a concrete example:
Table: Orders (100 columns total, average 50 bytes per column)
Full row width: ~5,000 bytes
Query needs: order_id, customer_id, total (3 columns, ~150 bytes)
For 1 million rows:
This reduction applies to every operation: every join, sort, and aggregation processes 33× less data.
| Operation | Width Impact | Why Width Matters |
|---|---|---|
| Table Scan | Proportional | Reading unnecessary columns from disk (esp. row stores) |
| Hash Join Build | Quadratic potential | Hash table memory scales with tuple width |
| Sort | Proportional | Sorting moves entire tuples; narrower = faster |
| Aggregation | Proportional | Group hashtables store keys; narrower keys help |
| Network Transfer | Proportional | Every byte crosses the network |
| Temp Materialization | Proportional | Spilling to disk amplifies width impact |
Core Objective
Projection pushdown aims to eliminate attributes as early as possible while preserving correctness. The fundamental constraint is:
An attribute can be projected out (removed) as soon as it's no longer needed by any subsequent operation in the query plan.
What Determines "Need"?
An attribute is needed downstream if it appears in:
The Extended Attribute Set
For any node in the query tree, we compute its required attributes—the set of attributes that must flow upward from this node. This set includes:
required(node) = output_attrs(node) ∪ predicate_attrs(ancestors)
∪ join_attrs(ancestors) ∪ ...
While selection pushdown typically works top-down (pushing predicates from root toward leaves), projection pushdown often works bottom-up: we compute what attributes each node produces, then propagate downward what attributes are actually needed. Unnecessary attributes are dropped.
The Basic Rule
π_L(Op(R, ...)) → Op(π_{L∪A}(R), ...)
Where A is the set of additional attributes needed by operator Op that aren't in the final output L.
Example:
SELECT name FROM Employees WHERE dept_id = 5;
The 48 other columns in Employees can be projected out immediately at the base table scan.
Joins are the most impactful location for projection pushdown because they often combine large tables with many columns.
The Extended Projection Rule for Joins
π_L(R ⋈_θ S) ≡ π_L(π_{L_R ∪ A_R}(R) ⋈_θ π_{L_S ∪ A_S}(S))
Where:
Critical Insight: Before the join, we must retain:
After the join, we can project down to just the output attributes.
Step-by-Step Example:
SELECT c.name, o.total
FROM Customers c (50 columns)
JOIN Orders o (30 columns) ON c.id = o.customer_id
WHERE o.status = 'completed';
Analysis:
Pushed projections:
12345678910111213141516171819
-- Original query (all columns flow through)SELECT c.name, o.totalFROM Customers cJOIN Orders o ON c.id = o.customer_idWHERE o.status = 'completed'; -- Conceptually equivalent after projection pushdown:SELECT c.name, o.totalFROM ( SELECT id, name FROM Customers -- Only needed columns) cJOIN ( SELECT customer_id, total, status FROM Orders -- Only needed columns) o ON c.id = o.customer_idWHERE o.status = 'completed'; -- In practice, the optimizer generates a plan that reads only needed columns-- from base tables, never materializing excluded columns. The subqueries above-- are just conceptual representation.Multi-Way Joins
For queries with multiple joins, projection pushdown propagates through the entire join tree:
π_L((R ⋈ S) ⋈ T)
→ π_L(π_{L_RS∪A_RS}(R ⋈ S) ⋈ T)
→ π_L(π_{L_RS∪A_RS}(π_{L_R∪A_R}(R) ⋈ π_{L_S∪A_S}(S)) ⋈ π_{L_T∪A_T}(T))
Each join stage projects out attributes no longer needed by subsequent operations. The key is computing the correct required attribute sets at each level.
For outer joins, be careful: null-padded attributes from the preserved side must be tracked correctly. Projection can still be applied, but the NULL values for unmatched tuples must be generated for any output columns even if they're projected on the inner side.
Different query operators have different relationships with projection.
π_L(σ_p(R)) ≡ π_L(σ_p(π_{L∪attrs(p)}(R)))
Projection can be pushed through selection, but must include predicate attributes. This is bidirectional: projection can go before or after selection.
Preferred order: Typically, project first (to narrow tuples), then select. The selection runs faster on narrower tuples.
π_L(γ_{G,F}(R)) ≡ π_L(γ_{G,F}(π_{G∪inputs(F)}(R)))
For aggregation, project to only:
Example:
SELECT dept, SUM(salary)
FROM Employees (50 columns)
GROUP BY dept;
Only {dept, salary} need to flow into the aggregation—48 columns can be projected out.
π_L(R ∪ S) ≡ π_L(R) ∪ π_L(S)
Projection distributes fully over all set operations. Both operands must be projected to the same schema.
π_L(sort_K(R)) ≡ π_L(sort_K(π_{L∪K}(R)))
Sort needs the ordering keys plus the output columns. Extra columns can be projected out before sorting, making the sort operate on narrower tuples.
| Operator | Can Push? | Extended Attributes Needed | Benefit |
|---|---|---|---|
| Selection (σ) | Yes | Predicate attributes | Narrower tuples for predicate evaluation |
| Natural Join (⋈) | Yes | Join key attributes | Reduced memory for hash tables, smaller I/O |
| Aggregation (γ) | Yes | Grouping keys + aggregate inputs | Smaller group hashtables |
| Sort | Yes | Sort keys | Faster sorting on narrow tuples |
| Union (∪) | Yes | None extra | Both branches get same projection |
| Distinct | Yes | None extra | Deduplication on fewer columns |
| Limit | Partially | Usually full below | Often applied late in the plan |
The core of projection pushdown is computing, for each node in the query tree, exactly which attributes must be produced. This is the required attributes computation.
The algorithm works in two passes:
Pass 1: Top-Down Requirement Propagation
Starting from the root (final output), propagate requirements downward:
required(root) = output columns
for each node, given required(node):
compute required(children) by adding:
- attributes needed by this node's operation
- minus attributes this node generates
Pass 2: Bottom-Up Projection Insertion
Insert projections at each node to produce only required attributes:
for each node bottom-up:
if produced(node) ⊃ required(node):
insert π_{required(node)} above node (or integrate into node)
12345678910111213141516171819202122232425262728293031323334353637
function computeRequiredAttributes(node, requiredFromAbove): // What this node needs from its children depends on the operator if node is Project(outputList, child): // Project only needs what's in outputList (already a projection) childRequired = outputList return computeRequiredAttributes(child, childRequired) if node is Select(predicate, child): // Selection needs predicate attributes plus what's required above childRequired = requiredFromAbove ∪ attrs(predicate) return computeRequiredAttributes(child, childRequired) if node is Join(left, right, joinCond): // Partition required attributes to left and right children leftRequired = (requiredFromAbove ∩ available(left)) ∪ (attrs(joinCond) ∩ available(left)) rightRequired = (requiredFromAbove ∩ available(right)) ∪ (attrs(joinCond) ∩ available(right)) computeRequiredAttributes(left, leftRequired) computeRequiredAttributes(right, rightRequired) // Insert projections if children produce more than required if produced(left) ⊃ leftRequired: replaceChild(node, left, Project(leftRequired, left)) if produced(right) ⊃ rightRequired: replaceChild(node, right, Project(rightRequired, right)) if node is Aggregate(groupBy, aggFuncs, child): // Need grouping columns plus aggregate inputs childRequired = groupBy ∪ aggregateInputs(aggFuncs) return computeRequiredAttributes(child, childRequired) if node is BaseTable(R): // Insert a projection to produce only required columns if requiredFromAbove ⊂ schema(R): return Project(requiredFromAbove, R) return RExpressions and Computed Columns
When a query includes expressions like a + b AS sum, the required set must include both a and b until the expression is computed:
required = {sum} → after expression computation
required = {a, b} → before expression computation
Subqueries
Correlated subqueries reference outer columns. These correlating columns must be tracked through the scope hierarchy and included in required sets at appropriate levels.
Projection pushdown becomes dramatically more impactful with columnar storage systems.
Row-oriented storage (traditional RDBMS):
Column-oriented storage (modern analytics systems):
The Impact Is Multiplicative
For a table with 100 columns using columnar storage:
| Aspect | Row Storage | Column Storage | Improvement Factor |
|---|---|---|---|
| Disk I/O | Minimal savings | Major savings (only read needed columns) | 10–100× |
| Memory | Moderate savings | Major savings | Proportional to column reduction |
| CPU/Decompression | Moderate savings | Major savings (skip column decompression) | 10–50× |
| Cache efficiency | Moderate | Excellent (columns fit in cache) | 2–10× |
| Vectorized processing | Some benefit | Major benefit (full column batches) | 5–20× |
Systems like Snowflake, BigQuery, Redshift, and ClickHouse are columnar. When using these systems, projection pushdown is critical for performance. Always SELECT only the columns you need—never use SELECT * in analytics queries unless absolutely necessary.
In columnar systems, projection and selection pushdown work together powerfully:
Zone maps / min-max statistics: Columnar systems maintain per-column metadata. Pushing selections allows skipping entire column chunks.
Late materialization: Read only filtering columns first, evaluate predicates, then fetch remaining columns only for qualifying rows.
Dictionary encoding: Operate on encoded values without decompression until projection extracts final needed values.
This is why analytical queries on columnar systems can be 10–100× faster than equivalent queries on row stores—and why projection pushdown is essential.
While projection pushdown is generally beneficial, there are nuances to be aware of.
-- Avoid this in production queries:
SELECT * FROM Orders WHERE status = 'pending';
-- Use this instead:
SELECT order_id, customer_id, total FROM Orders WHERE status = 'pending';
SELECT * prevents the optimizer from knowing which columns are actually needed downstream (especially important in views, CTEs, and application layers). Always specify needed columns explicitly.
In row-oriented storage with covering indexes, including extra columns may be essentially free if they're retrieved together anyway. The optimizer considers this.
Some application patterns (ORMs, generic query builders) make it difficult to know columns upfront. Modern systems handle this, but explicit column lists are always preferable.
Some optimizers are better at projection pushdown than others. Complex views, certain subquery patterns, and database-specific features may limit pushdown effectiveness. Always verify with EXPLAIN/EXPLAIN ANALYZE that expected projections are happening.
Projection pushdown complements selection pushdown to minimize data volume throughout query execution. Let's consolidate the key insights:
What's Next
We've covered the two pushdown optimizations that reduce data volume. Next, we'll explore join commutativity—the equivalence rule that allows the optimizer to choose which relation appears as the inner vs. outer operand in join implementations. This subtle flexibility enables significant performance improvements.
You now understand projection pushdown—the technique that minimizes tuple width throughout query execution. Combined with selection pushdown, you have a complete picture of how optimizers reduce data volume before expensive operations.