Database Management SystemsQuery Optimization

Equivalence Rules

LevelIntermediate

Duration60 mins

TopicQuery Optimization

3 / 5

Projection Pushdown

Narrowing the Data Pipeline

If selection pushdown reduces the number of tuples flowing through a query plan, projection pushdown reduces their width. Together, these optimizations minimize the volume of data at every stage of query processing.

Consider a table with 50 columns and millions of rows. If your query only needs 3 columns, why carry the other 47 through joins, sorts, and aggregations? Each unnecessary column consumes memory, cache space, and I/O bandwidth. In column-oriented storage systems, unnecessary columns may avoid disk access entirely.

Projection pushdown systematically eliminates unused attributes as early as possible in the query plan. While less dramatic than selection pushdown in terms of row reduction, it provides consistent, multiplicative benefits across all operations by reducing the per-tuple processing cost.

What You Will Learn

By the end of this page, you will understand: (1) how projection pushdown reduces query execution costs, (2) the rules governing projection through different operators, (3) the critical concept of 'extended projection' for joins, and (4) the interaction between projection pushdown and modern columnar storage.

The Case for Early Projection

The Width Problem

In database systems, tuple width directly affects:

Memory consumption: Every intermediate result consumes memory proportional to (rows × width). Wide tuples exhaust memory faster, causing spills to disk.
I/O bandwidth: Whether reading from disk or transferring between operators, wider tuples mean more bytes moved.
Cache efficiency: Modern CPUs rely heavily on cache. Narrower tuples mean more tuples fit in cache, dramatically improving performance.
Network transfer: For distributed databases, tuple width directly affects network costs.
Hash table efficiency: Hash joins and aggregations build in-memory structures. Narrower keys and values mean smaller hash tables.

Quantifying the Impact

Consider a concrete example:

Table: Orders (100 columns total, average 50 bytes per column)
Full row width: ~5,000 bytes
Query needs: order_id, customer_id, total (3 columns, ~150 bytes)

For 1 million rows:

Without projection pushdown: ~5 GB of data flowing through operators
With early projection: ~150 MB of data flowing through operators
Reduction: 33×

This reduction applies to every operation: every join, sort, and aggregation processes 33× less data.

Impact of Tuple Width on Query Operations
Operation	Width Impact	Why Width Matters
Table Scan	Proportional	Reading unnecessary columns from disk (esp. row stores)
Hash Join Build	Quadratic potential	Hash table memory scales with tuple width
Sort	Proportional	Sorting moves entire tuples; narrower = faster
Aggregation	Proportional	Group hashtables store keys; narrower keys help
Network Transfer	Proportional	Every byte crosses the network
Temp Materialization	Proportional	Spilling to disk amplifies width impact

The Projection Pushdown Principle

Core Objective

Projection pushdown aims to eliminate attributes as early as possible while preserving correctness. The fundamental constraint is:

An attribute can be projected out (removed) as soon as it's no longer needed by any subsequent operation in the query plan.

What Determines "Need"?

An attribute is needed downstream if it appears in:

Final output — Columns requested in SELECT
Selection predicates — Attributes in WHERE/HAVING conditions
Join conditions — Attributes used to match tuples between tables
Grouping keys — Columns in GROUP BY
Aggregate function inputs — Columns being aggregated (SUM(salary))
Order specifications — Columns in ORDER BY
Subquery correlations — Columns referenced by correlated subqueries

The Extended Attribute Set

For any node in the query tree, we compute its required attributes—the set of attributes that must flow upward from this node. This set includes:

required(node) = output_attrs(node) ∪ predicate_attrs(ancestors) 
                 ∪ join_attrs(ancestors) ∪ ...

Bottom-Up vs. Top-Down

While selection pushdown typically works top-down (pushing predicates from root toward leaves), projection pushdown often works bottom-up: we compute what attributes each node produces, then propagate downward what attributes are actually needed. Unnecessary attributes are dropped.

The Basic Rule

π_L(Op(R, ...)) → Op(π_{L∪A}(R), ...)

Where A is the set of additional attributes needed by operator Op that aren't in the final output L.

Example:

SELECT name FROM Employees WHERE dept_id = 5;

Final output needs: {name}
Selection needs: {dept_id}
Required set: {name, dept_id}

The 48 other columns in Employees can be projected out immediately at the base table scan.

Projection Through Joins

Joins are the most impactful location for projection pushdown because they often combine large tables with many columns.

The Extended Projection Rule for Joins

π_L(R ⋈_θ S) ≡ π_L(π_{L_R ∪ A_R}(R) ⋈_θ π_{L_S ∪ A_S}(S))

Where:

L_R = attributes of L that come from R
L_S = attributes of L that come from S
A_R = attributes of R needed for the join condition θ
A_S = attributes of S needed for the join condition θ

Critical Insight: Before the join, we must retain:

Attributes that will be in the final output
Attributes needed for the join condition

After the join, we can project down to just the output attributes.

Step-by-Step Example:

SELECT c.name, o.total
FROM Customers c (50 columns)
JOIN Orders o (30 columns) ON c.id = o.customer_id
WHERE o.status = 'completed';

Analysis:

Final output: {c.name, o.total}
Join condition needs: {c.id, o.customer_id}
Selection needs: {o.status}

Pushed projections:

From Customers: project to {name, id} (2 columns, not 50)
From Orders: project to {total, customer_id, status} (3 columns, not 30)

Projection Pushdown Example
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Original query (all columns flow through)
SELECT c.name, o.total
FROM Customers c
JOIN Orders o ON c.id = o.customer_id
WHERE o.status = 'completed';
 
-- Conceptually equivalent after projection pushdown:
SELECT c.name, o.total
FROM (
    SELECT id, name FROM Customers  -- Only needed columns
) c
JOIN (
    SELECT customer_id, total, status FROM Orders  -- Only needed columns
) o ON c.id = o.customer_id
WHERE o.status = 'completed';
 
-- In practice, the optimizer generates a plan that reads only needed columns
-- from base tables, never materializing excluded columns. The subqueries above
-- are just conceptual representation.

Multi-Way Joins

For queries with multiple joins, projection pushdown propagates through the entire join tree:

π_L((R ⋈ S) ⋈ T)
→ π_L(π_{L_RS∪A_RS}(R ⋈ S) ⋈ T)
→ π_L(π_{L_RS∪A_RS}(π_{L_R∪A_R}(R) ⋈ π_{L_S∪A_S}(S)) ⋈ π_{L_T∪A_T}(T))

Each join stage projects out attributes no longer needed by subsequent operations. The key is computing the correct required attribute sets at each level.

Outer Joins Preserve Schema

For outer joins, be careful: null-padded attributes from the preserved side must be tracked correctly. Projection can still be applied, but the NULL values for unmatched tuples must be generated for any output columns even if they're projected on the inner side.

Projection Through Other Operators

Different query operators have different relationships with projection.

Selection (σ)

π_L(σ_p(R)) ≡ π_L(σ_p(π_{L∪attrs(p)}(R)))

Projection can be pushed through selection, but must include predicate attributes. This is bidirectional: projection can go before or after selection.

Preferred order: Typically, project first (to narrow tuples), then select. The selection runs faster on narrower tuples.

Aggregation (γ)

π_L(γ_{G,F}(R)) ≡ π_L(γ_{G,F}(π_{G∪inputs(F)}(R)))

For aggregation, project to only:

Grouping columns (G)
Input columns for aggregate functions (e.g., the column being SUMmed)

Example:

SELECT dept, SUM(salary) 
FROM Employees (50 columns)
GROUP BY dept;

Only {dept, salary} need to flow into the aggregation—48 columns can be projected out.

Union / Intersect / Except

π_L(R ∪ S) ≡ π_L(R) ∪ π_L(S)

Projection distributes fully over all set operations. Both operands must be projected to the same schema.

Sort

π_L(sort_K(R)) ≡ π_L(sort_K(π_{L∪K}(R)))

Sort needs the ordering keys plus the output columns. Extra columns can be projected out before sorting, making the sort operate on narrower tuples.

Projection Pushdown Through Various Operators
Operator	Can Push?	Extended Attributes Needed	Benefit
Selection (σ)	Yes	Predicate attributes	Narrower tuples for predicate evaluation
Natural Join (⋈)	Yes	Join key attributes	Reduced memory for hash tables, smaller I/O
Aggregation (γ)	Yes	Grouping keys + aggregate inputs	Smaller group hashtables
Sort	Yes	Sort keys	Faster sorting on narrow tuples
Union (∪)	Yes	None extra	Both branches get same projection
Distinct	Yes	None extra	Deduplication on fewer columns
Limit	Partially	Usually full below	Often applied late in the plan

Computing Required Attributes

The core of projection pushdown is computing, for each node in the query tree, exactly which attributes must be produced. This is the required attributes computation.

Algorithm Overview

The algorithm works in two passes:

Pass 1: Top-Down Requirement Propagation

Starting from the root (final output), propagate requirements downward:

required(root) = output columns

for each node, given required(node):
    compute required(children) by adding:
    - attributes needed by this node's operation
    - minus attributes this node generates

Pass 2: Bottom-Up Projection Insertion

Insert projections at each node to produce only required attributes:

for each node bottom-up:
    if produced(node) ⊃ required(node):
        insert π_{required(node)} above node (or integrate into node)

Required Attributes Algorithm

Pseudocode

function computeRequiredAttributes(node, requiredFromAbove):
    // What this node needs from its children depends on the operator
    
    if node is Project(outputList, child):
        // Project only needs what's in outputList (already a projection)
        childRequired = outputList
        return computeRequiredAttributes(child, childRequired)
    
    if node is Select(predicate, child):
        // Selection needs predicate attributes plus what's required above
        childRequired = requiredFromAbove ∪ attrs(predicate)
        return computeRequiredAttributes(child, childRequired)
    
    if node is Join(left, right, joinCond):
        // Partition required attributes to left and right children
        leftRequired = (requiredFromAbove ∩ available(left)) ∪ (attrs(joinCond) ∩ available(left))
        rightRequired = (requiredFromAbove ∩ available(right)) ∪ (attrs(joinCond) ∩ available(right))
        
        computeRequiredAttributes(left, leftRequired)
        computeRequiredAttributes(right, rightRequired)
        
        // Insert projections if children produce more than required
        if produced(left) ⊃ leftRequired:
            replaceChild(node, left, Project(leftRequired, left))
        if produced(right) ⊃ rightRequired:
            replaceChild(node, right, Project(rightRequired, right))
    
    if node is Aggregate(groupBy, aggFuncs, child):
        // Need grouping columns plus aggregate inputs
        childRequired = groupBy ∪ aggregateInputs(aggFuncs)
        return computeRequiredAttributes(child, childRequired)
    
    if node is BaseTable(R):
        // Insert a projection to produce only required columns
        if requiredFromAbove ⊂ schema(R):
            return Project(requiredFromAbove, R)
        return R

Handling Complex Cases

Expressions and Computed Columns

When a query includes expressions like a + b AS sum, the required set must include both a and b until the expression is computed:

required = {sum} → after expression computation
required = {a, b} → before expression computation

Subqueries

Correlated subqueries reference outer columns. These correlating columns must be tracked through the scope hierarchy and included in required sets at appropriate levels.

Interaction with Columnar Storage

Projection pushdown becomes dramatically more impactful with columnar storage systems.

Row vs. Column Storage

Row-oriented storage (traditional RDBMS):

All columns of a row stored contiguously
Reading a row's subset of columns still requires reading the whole row from disk
Projection saves memory/CPU but not disk I/O

Column-oriented storage (modern analytics systems):

Each column stored separately
Reading only needed columns avoids I/O for unreferenced columns entirely
Projection pushdown can eliminate >90% of disk I/O

The Impact Is Multiplicative

For a table with 100 columns using columnar storage:

Query needs 5 columns
Without projection pushdown: read all 100 columns → 100 I/O units
With projection pushdown: read 5 columns → 5 I/O units
Savings: 95% of disk I/O eliminated

Projection Pushdown Benefits: Row vs. Column Storage
Aspect	Row Storage	Column Storage	Improvement Factor
Disk I/O	Minimal savings	Major savings (only read needed columns)	10–100×
Memory	Moderate savings	Major savings	Proportional to column reduction
CPU/Decompression	Moderate savings	Major savings (skip column decompression)	10–50×
Cache efficiency	Moderate	Excellent (columns fit in cache)	2–10×
Vectorized processing	Some benefit	Major benefit (full column batches)	5–20×

Modern Analytics Systems

Systems like Snowflake, BigQuery, Redshift, and ClickHouse are columnar. When using these systems, projection pushdown is critical for performance. Always SELECT only the columns you need—never use SELECT * in analytics queries unless absolutely necessary.

Predicate Pushdown Synergy

In columnar systems, projection and selection pushdown work together powerfully:

Zone maps / min-max statistics: Columnar systems maintain per-column metadata. Pushing selections allows skipping entire column chunks.
Late materialization: Read only filtering columns first, evaluate predicates, then fetch remaining columns only for qualifying rows.
Dictionary encoding: Operate on encoded values without decompression until projection extracts final needed values.

This is why analytical queries on columnar systems can be 10–100× faster than equivalent queries on row stores—and why projection pushdown is essential.

Practical Considerations and Pitfalls

While projection pushdown is generally beneficial, there are nuances to be aware of.

The SELECT * Problem

-- Avoid this in production queries:
SELECT * FROM Orders WHERE status = 'pending';

-- Use this instead:
SELECT order_id, customer_id, total FROM Orders WHERE status = 'pending';

SELECT * prevents the optimizer from knowing which columns are actually needed downstream (especially important in views, CTEs, and application layers). Always specify needed columns explicitly.

When Extra Columns Are Free

In row-oriented storage with covering indexes, including extra columns may be essentially free if they're retrieved together anyway. The optimizer considers this.

Dynamic Column Requirements

Some application patterns (ORMs, generic query builders) make it difficult to know columns upfront. Modern systems handle this, but explicit column lists are always preferable.

Common Pitfalls

•Using SELECT * in views — Views that select all columns prevent downstream projection pushdown to the base tables.
•Logging/auditing columns — Including columns for debugging that aren't needed for results adds unnecessary overhead.
•Wide tables with sparse access — Tables with 100+ columns where queries typically need 5–10 columns suffer most without proper projection.
•JSON/BLOB columns — Large semi-structured columns should especially be excluded when not needed.
•ORM defaults — Many ORMs fetch all columns by default; override this behavior for performance-critical queries.

Optimizer Limitations

Some optimizers are better at projection pushdown than others. Complex views, certain subquery patterns, and database-specific features may limit pushdown effectiveness. Always verify with EXPLAIN/EXPLAIN ANALYZE that expected projections are happening.

Summary: Optimizing Data Width

Projection pushdown complements selection pushdown to minimize data volume throughout query execution. Let's consolidate the key insights:

Key Takeaways

•Width reduction compounds — Narrower tuples benefit every downstream operation: joins, sorts, aggregations, and network transfers.
•Required attributes drive pushdown — Each node needs only the attributes for output, predicates, and join conditions. Everything else can be discarded.
•Extended projection for joins — When pushing through joins, temporarily retain join key columns even if they're not in final output.
•Columnar storage amplifies benefits — In column stores, projection pushdown eliminates disk I/O for unused columns entirely.
•**Avoid SELECT *** — Explicit column lists enable maximum projection optimization. Never use SELECT * in production analytics.
•Two-pass algorithm — Compute required attributes top-down, then insert projections bottom-up.

What's Next

We've covered the two pushdown optimizations that reduce data volume. Next, we'll explore join commutativity—the equivalence rule that allows the optimizer to choose which relation appears as the inner vs. outer operand in join implementations. This subtle flexibility enables significant performance improvements.

Page Complete

You now understand projection pushdown—the technique that minimizes tuple width throughout query execution. Combined with selection pushdown, you have a complete picture of how optimizers reduce data volume before expensive operations.

3 / 5

Loading learning content...

Database Management SystemsQuery Optimization

Equivalence Rules

LevelIntermediate

Duration60 mins

TopicQuery Optimization

3 / 5

Projection Pushdown

Narrowing the Data Pipeline

What You Will Learn

The Case for Early Projection

The Width Problem

In database systems, tuple width directly affects:

Memory consumption: Every intermediate result consumes memory proportional to (rows × width). Wide tuples exhaust memory faster, causing spills to disk.
I/O bandwidth: Whether reading from disk or transferring between operators, wider tuples mean more bytes moved.
Cache efficiency: Modern CPUs rely heavily on cache. Narrower tuples mean more tuples fit in cache, dramatically improving performance.
Network transfer: For distributed databases, tuple width directly affects network costs.
Hash table efficiency: Hash joins and aggregations build in-memory structures. Narrower keys and values mean smaller hash tables.

Quantifying the Impact

Consider a concrete example:

Table: Orders (100 columns total, average 50 bytes per column)
Full row width: ~5,000 bytes
Query needs: order_id, customer_id, total (3 columns, ~150 bytes)

For 1 million rows:

Without projection pushdown: ~5 GB of data flowing through operators
With early projection: ~150 MB of data flowing through operators
Reduction: 33×

This reduction applies to every operation: every join, sort, and aggregation processes 33× less data.

Impact of Tuple Width on Query Operations
Operation	Width Impact	Why Width Matters
Table Scan	Proportional	Reading unnecessary columns from disk (esp. row stores)
Hash Join Build	Quadratic potential	Hash table memory scales with tuple width
Sort	Proportional	Sorting moves entire tuples; narrower = faster
Aggregation	Proportional	Group hashtables store keys; narrower keys help
Network Transfer	Proportional	Every byte crosses the network
Temp Materialization	Proportional	Spilling to disk amplifies width impact

The Projection Pushdown Principle

Core Objective

Projection pushdown aims to eliminate attributes as early as possible while preserving correctness. The fundamental constraint is:

An attribute can be projected out (removed) as soon as it's no longer needed by any subsequent operation in the query plan.

What Determines "Need"?

An attribute is needed downstream if it appears in:

Final output — Columns requested in SELECT
Selection predicates — Attributes in WHERE/HAVING conditions
Join conditions — Attributes used to match tuples between tables
Grouping keys — Columns in GROUP BY
Aggregate function inputs — Columns being aggregated (SUM(salary))
Order specifications — Columns in ORDER BY
Subquery correlations — Columns referenced by correlated subqueries

The Extended Attribute Set

For any node in the query tree, we compute its required attributes—the set of attributes that must flow upward from this node. This set includes:

required(node) = output_attrs(node) ∪ predicate_attrs(ancestors) 
                 ∪ join_attrs(ancestors) ∪ ...

Bottom-Up vs. Top-Down

The Basic Rule

π_L(Op(R, ...)) → Op(π_{L∪A}(R), ...)

Where A is the set of additional attributes needed by operator Op that aren't in the final output L.

Example:

SELECT name FROM Employees WHERE dept_id = 5;

Final output needs: {name}
Selection needs: {dept_id}
Required set: {name, dept_id}

The 48 other columns in Employees can be projected out immediately at the base table scan.

Projection Through Joins

Joins are the most impactful location for projection pushdown because they often combine large tables with many columns.

The Extended Projection Rule for Joins

π_L(R ⋈_θ S) ≡ π_L(π_{L_R ∪ A_R}(R) ⋈_θ π_{L_S ∪ A_S}(S))

Where:

L_R = attributes of L that come from R
L_S = attributes of L that come from S
A_R = attributes of R needed for the join condition θ
A_S = attributes of S needed for the join condition θ

Critical Insight: Before the join, we must retain:

Attributes that will be in the final output
Attributes needed for the join condition

After the join, we can project down to just the output attributes.

Step-by-Step Example:

SELECT c.name, o.total
FROM Customers c (50 columns)
JOIN Orders o (30 columns) ON c.id = o.customer_id
WHERE o.status = 'completed';

Analysis:

Final output: {c.name, o.total}
Join condition needs: {c.id, o.customer_id}
Selection needs: {o.status}

Pushed projections:

From Customers: project to {name, id} (2 columns, not 50)
From Orders: project to {total, customer_id, status} (3 columns, not 30)

Projection Pushdown Example
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Original query (all columns flow through)
SELECT c.name, o.total
FROM Customers c
JOIN Orders o ON c.id = o.customer_id
WHERE o.status = 'completed';
 
-- Conceptually equivalent after projection pushdown:
SELECT c.name, o.total
FROM (
    SELECT id, name FROM Customers  -- Only needed columns
) c
JOIN (
    SELECT customer_id, total, status FROM Orders  -- Only needed columns
) o ON c.id = o.customer_id
WHERE o.status = 'completed';
 
-- In practice, the optimizer generates a plan that reads only needed columns
-- from base tables, never materializing excluded columns. The subqueries above
-- are just conceptual representation.

Multi-Way Joins

For queries with multiple joins, projection pushdown propagates through the entire join tree:

π_L((R ⋈ S) ⋈ T)
→ π_L(π_{L_RS∪A_RS}(R ⋈ S) ⋈ T)
→ π_L(π_{L_RS∪A_RS}(π_{L_R∪A_R}(R) ⋈ π_{L_S∪A_S}(S)) ⋈ π_{L_T∪A_T}(T))

Each join stage projects out attributes no longer needed by subsequent operations. The key is computing the correct required attribute sets at each level.

Outer Joins Preserve Schema

Projection Through Other Operators

Different query operators have different relationships with projection.

Selection (σ)

π_L(σ_p(R)) ≡ π_L(σ_p(π_{L∪attrs(p)}(R)))

Projection can be pushed through selection, but must include predicate attributes. This is bidirectional: projection can go before or after selection.

Preferred order: Typically, project first (to narrow tuples), then select. The selection runs faster on narrower tuples.

Aggregation (γ)

π_L(γ_{G,F}(R)) ≡ π_L(γ_{G,F}(π_{G∪inputs(F)}(R)))

For aggregation, project to only:

Grouping columns (G)
Input columns for aggregate functions (e.g., the column being SUMmed)

Example:

SELECT dept, SUM(salary) 
FROM Employees (50 columns)
GROUP BY dept;

Only {dept, salary} need to flow into the aggregation—48 columns can be projected out.

Union / Intersect / Except

π_L(R ∪ S) ≡ π_L(R) ∪ π_L(S)

Projection distributes fully over all set operations. Both operands must be projected to the same schema.

Sort

π_L(sort_K(R)) ≡ π_L(sort_K(π_{L∪K}(R)))

Sort needs the ordering keys plus the output columns. Extra columns can be projected out before sorting, making the sort operate on narrower tuples.

Projection Pushdown Through Various Operators
Operator	Can Push?	Extended Attributes Needed	Benefit
Selection (σ)	Yes	Predicate attributes	Narrower tuples for predicate evaluation
Natural Join (⋈)	Yes	Join key attributes	Reduced memory for hash tables, smaller I/O
Aggregation (γ)	Yes	Grouping keys + aggregate inputs	Smaller group hashtables
Sort	Yes	Sort keys	Faster sorting on narrow tuples
Union (∪)	Yes	None extra	Both branches get same projection
Distinct	Yes	None extra	Deduplication on fewer columns
Limit	Partially	Usually full below	Often applied late in the plan

Computing Required Attributes

The core of projection pushdown is computing, for each node in the query tree, exactly which attributes must be produced. This is the required attributes computation.

Algorithm Overview

The algorithm works in two passes:

Pass 1: Top-Down Requirement Propagation

Starting from the root (final output), propagate requirements downward:

required(root) = output columns

for each node, given required(node):
    compute required(children) by adding:
    - attributes needed by this node's operation
    - minus attributes this node generates

Pass 2: Bottom-Up Projection Insertion

Insert projections at each node to produce only required attributes:

for each node bottom-up:
    if produced(node) ⊃ required(node):
        insert π_{required(node)} above node (or integrate into node)

Required Attributes Algorithm

Pseudocode

function computeRequiredAttributes(node, requiredFromAbove):
    // What this node needs from its children depends on the operator
    
    if node is Project(outputList, child):
        // Project only needs what's in outputList (already a projection)
        childRequired = outputList
        return computeRequiredAttributes(child, childRequired)
    
    if node is Select(predicate, child):
        // Selection needs predicate attributes plus what's required above
        childRequired = requiredFromAbove ∪ attrs(predicate)
        return computeRequiredAttributes(child, childRequired)
    
    if node is Join(left, right, joinCond):
        // Partition required attributes to left and right children
        leftRequired = (requiredFromAbove ∩ available(left)) ∪ (attrs(joinCond) ∩ available(left))
        rightRequired = (requiredFromAbove ∩ available(right)) ∪ (attrs(joinCond) ∩ available(right))
        
        computeRequiredAttributes(left, leftRequired)
        computeRequiredAttributes(right, rightRequired)
        
        // Insert projections if children produce more than required
        if produced(left) ⊃ leftRequired:
            replaceChild(node, left, Project(leftRequired, left))
        if produced(right) ⊃ rightRequired:
            replaceChild(node, right, Project(rightRequired, right))
    
    if node is Aggregate(groupBy, aggFuncs, child):
        // Need grouping columns plus aggregate inputs
        childRequired = groupBy ∪ aggregateInputs(aggFuncs)
        return computeRequiredAttributes(child, childRequired)
    
    if node is BaseTable(R):
        // Insert a projection to produce only required columns
        if requiredFromAbove ⊂ schema(R):
            return Project(requiredFromAbove, R)
        return R

Handling Complex Cases

Expressions and Computed Columns

When a query includes expressions like a + b AS sum, the required set must include both a and b until the expression is computed:

required = {sum} → after expression computation
required = {a, b} → before expression computation

Subqueries

Correlated subqueries reference outer columns. These correlating columns must be tracked through the scope hierarchy and included in required sets at appropriate levels.

Interaction with Columnar Storage

Projection pushdown becomes dramatically more impactful with columnar storage systems.

Row vs. Column Storage

Row-oriented storage (traditional RDBMS):

All columns of a row stored contiguously
Reading a row's subset of columns still requires reading the whole row from disk
Projection saves memory/CPU but not disk I/O

Column-oriented storage (modern analytics systems):

Each column stored separately
Reading only needed columns avoids I/O for unreferenced columns entirely
Projection pushdown can eliminate >90% of disk I/O

The Impact Is Multiplicative

For a table with 100 columns using columnar storage:

Query needs 5 columns
Without projection pushdown: read all 100 columns → 100 I/O units
With projection pushdown: read 5 columns → 5 I/O units
Savings: 95% of disk I/O eliminated

Projection Pushdown Benefits: Row vs. Column Storage
Aspect	Row Storage	Column Storage	Improvement Factor
Disk I/O	Minimal savings	Major savings (only read needed columns)	10–100×
Memory	Moderate savings	Major savings	Proportional to column reduction
CPU/Decompression	Moderate savings	Major savings (skip column decompression)	10–50×
Cache efficiency	Moderate	Excellent (columns fit in cache)	2–10×
Vectorized processing	Some benefit	Major benefit (full column batches)	5–20×

Modern Analytics Systems

Predicate Pushdown Synergy

In columnar systems, projection and selection pushdown work together powerfully:

Zone maps / min-max statistics: Columnar systems maintain per-column metadata. Pushing selections allows skipping entire column chunks.
Late materialization: Read only filtering columns first, evaluate predicates, then fetch remaining columns only for qualifying rows.
Dictionary encoding: Operate on encoded values without decompression until projection extracts final needed values.

This is why analytical queries on columnar systems can be 10–100× faster than equivalent queries on row stores—and why projection pushdown is essential.

Practical Considerations and Pitfalls

While projection pushdown is generally beneficial, there are nuances to be aware of.

The SELECT * Problem

-- Avoid this in production queries:
SELECT * FROM Orders WHERE status = 'pending';

-- Use this instead:
SELECT order_id, customer_id, total FROM Orders WHERE status = 'pending';

SELECT * prevents the optimizer from knowing which columns are actually needed downstream (especially important in views, CTEs, and application layers). Always specify needed columns explicitly.

When Extra Columns Are Free

In row-oriented storage with covering indexes, including extra columns may be essentially free if they're retrieved together anyway. The optimizer considers this.

Dynamic Column Requirements

Some application patterns (ORMs, generic query builders) make it difficult to know columns upfront. Modern systems handle this, but explicit column lists are always preferable.

Common Pitfalls

•Using SELECT * in views — Views that select all columns prevent downstream projection pushdown to the base tables.
•Logging/auditing columns — Including columns for debugging that aren't needed for results adds unnecessary overhead.
•Wide tables with sparse access — Tables with 100+ columns where queries typically need 5–10 columns suffer most without proper projection.
•JSON/BLOB columns — Large semi-structured columns should especially be excluded when not needed.
•ORM defaults — Many ORMs fetch all columns by default; override this behavior for performance-critical queries.

Optimizer Limitations

Summary: Optimizing Data Width

Projection pushdown complements selection pushdown to minimize data volume throughout query execution. Let's consolidate the key insights:

Key Takeaways

•Width reduction compounds — Narrower tuples benefit every downstream operation: joins, sorts, aggregations, and network transfers.
•Required attributes drive pushdown — Each node needs only the attributes for output, predicates, and join conditions. Everything else can be discarded.
•Extended projection for joins — When pushing through joins, temporarily retain join key columns even if they're not in final output.
•Columnar storage amplifies benefits — In column stores, projection pushdown eliminates disk I/O for unused columns entirely.
•**Avoid SELECT *** — Explicit column lists enable maximum projection optimization. Never use SELECT * in production analytics.
•Two-pass algorithm — Compute required attributes top-down, then insert projections bottom-up.

What's Next

Page Complete

3 / 5