Heuristic Optimization - Learning Module

Loading content...

0/241

Projection Early

Why Carry What You Don't Need?

Imagine carrying a 50-pound backpack through a mountain hike, only to discover at the summit that you never used 45 pounds of its contents. You expended enormous energy hauling weight you didn't need. Query execution works similarly—every column carried through the plan consumes memory, I/O bandwidth, and CPU cycles, even if the final result never uses it.

Projection pushdown (also called early projection or column pruning) ensures that only necessary columns flow through the query plan. This optimization reduces memory footprints, accelerates I/O, and enables other optimizations like index-only scans.

What You Will Learn

By the end of this page, you will understand how column requirements propagate through query plans, the mechanics of pushing projections through different operators, how projection interacts with storage formats and indexes, and practical patterns for writing queries that maximize projection optimization.

The Projection Operator

In relational algebra, the projection operator π (pi) selects specific columns from a relation:

πcolumn_list = Produce a relation with only the specified columns from R

For example, πname, email returns a relation with only the name and email columns, discarding all others.

Why Early Projection Matters

Consider a query against a table with 50 columns, selecting only 3:

projection_impact_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Table: customers
-- 50 columns, average row size: 2KB
-- 1 million rows = 2GB data
 
SELECT id, name, email  -- Only 3 columns needed (~100 bytes/row)
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.total > 1000;
 
-- WITHOUT early projection:
-- 1. Scan customers: Read 2GB (all columns)
-- 2. Build hash table for join: Store 2GB in memory
-- 3. Probe with orders: Output 2GB of customer data per match
-- Memory pressure: 2GB+ for hash table
 
-- WITH early projection (push π[id, name, email] to scan):
-- 1. Scan customers: Read only id, name, email (~100MB)
-- 2. Build hash table: Store 100MB in memory
-- 3. Probe with orders: Output 100MB of customer data
-- Memory pressure: 100MB for hash table
 
-- 20× reduction in memory usage!

Algebraic Basis for Projection Pushdown

Projection pushdown is grounded in relational algebra equivalences, but with more nuances than selection pushdown:

Projection Pushdown Equivalences
Equivalence	Description	Conditions
`π[L](R ⋈ S)` ≡ `π[L](π[L∪J](R) ⋈ π[L∪J](S))`	Push projection past join	L = final columns, J = join columns; keep columns needed for join
`π[L](σ[p](R))` ≡ `π[L](σ[p](π[L∪P](R)))`	Push projection past selection	P = columns in predicate; keep columns needed for filter
`π[L1](π[L2](R))` ≡ `π[L1](R)` when L1 ⊆ L2	Eliminate nested projections	Inner projection includes all outer columns
`π[L](R ∪ S)` ≡ `π[L](R) ∪ π[L](S)`	Push through union	Always valid
`π[L](R - S)` ≡ `π[L](R) - π[L](S)`	Push through difference	When L contains key that distinguishes rows

The Key Difference from Selection Pushdown

Unlike selection pushdown, projection pushdown must preserve columns needed by downstream operators. You can't discard a column that a later filter, join, or expression needs. This requires careful analysis of column dependencies throughout the plan.

Column Requirements Analysis

Before pushing projections down, the optimizer must determine which columns are actually required at each point in the plan. This column requirements analysis propagates from the root (what the query returns) to the leaves (what must be read from storage).

Top-Down Requirements Propagation

The analysis starts from the query's output and works downward:

Converting Mermaid diagram...

Column Set Computation

At each operator, the required column set is computed differently:

column_requirements_algorithm.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Compute required columns for each operator
function computeRequiredColumns(node, requiredByParent):
    switch node.type:
        
        case PROJECTION:
            // Pass through: project's output columns
            requiredByChild = node.projectColumns
            
        case SELECTION:
            // Add columns used in predicate
            predicateCols = extractColumns(node.predicate)
            requiredByChild = requiredByParent ∪ predicateCols
            
        case JOIN:
            // Add join key columns
            joinKeyCols = extractColumns(node.joinCondition)
            leftRequired = (requiredByParent ∩ leftSchema) ∪ (joinKeyCols ∩ leftSchema)
            rightRequired = (requiredByParent ∩ rightSchema) ∪ (joinKeyCols ∩ rightSchema)
            
        case AGGREGATE:
            // Need grouping columns and columns in aggregate expressions
            groupCols = node.groupByColumns
            aggInputCols = extractColumns(node.aggregateExpressions)
            requiredByChild = groupCols ∪ aggInputCols
            
        case UNION / INTERSECT / EXCEPT:
            // All inputs need same columns
            requiredByChild = requiredByParent  // for each child
            
        case TABLE_SCAN:
            // Terminal: read only required columns
            node.columnsToRead = requiredByParent
 
// Example iteration:
// Query: SELECT c.name FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.total > 100
// 
// Result needs: {c.name}
// Join needs: {c.name} ∪ {c.id, o.customer_id} = {c.name, c.id, o.customer_id}
// Filter adds: {o.total}
// Final requirements:
//   - customers scan: {id, name}
//   - orders scan: {customer_id, total}

Handling Expression Columns

When queries include computed expressions, the analysis must trace through to base columns:

expression_column_tracing.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Query with computed expressions
SELECT 
    customer_id,
    full_name,                          -- Computed: first_name || ' ' || last_name
    total_with_tax                      -- Computed: total * 1.1
FROM (
    SELECT 
        c.id as customer_id,
        c.first_name || ' ' || c.last_name as full_name,
        o.total * 1.1 as total_with_tax
    FROM customers c
    JOIN orders o ON c.id = o.customer_id
) derived;
 
-- Column requirements analysis:
-- final output needs: customer_id, full_name, total_with_tax
-- 
-- Tracing expressions:
-- customer_id ← c.id
-- full_name ← c.first_name || ' ' || c.last_name ← {c.first_name, c.last_name}
-- total_with_tax ← o.total * 1.1 ← {o.total}
-- 
-- Join needs: c.id, o.customer_id (for join condition)
-- 
-- Final requirements:
-- customers: {id, first_name, last_name}
-- orders: {customer_id, total}
-- 
-- Columns NOT needed (eliminated):
-- customers: email, phone, address, created_at, status, ...
-- orders: order_date, status, shipping_address, ...

Star Queries Are Expensive

SELECT * prevents projection pushdown because all columns from all tables are required. Even if you're debugging, consider specifying columns explicitly. In production code, always list required columns to enable optimization.

Pushing Projections Through Operators

Each operator type has specific rules for projection pushdown. Understanding these rules reveals why some queries optimize better than others.

Joins

Projection through joins requires keeping join key columns in addition to output columns:

join_projection_before.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Before optimization
SELECT c.name
FROM customers c  -- 50 columns
JOIN orders o     -- 30 columns
  ON c.id = o.customer_id
WHERE o.status = 'shipped';
 
-- Plan without projection pushdown:
-- Scan customers: ALL 50 columns
-- Scan orders: ALL 30 columns
-- Join: 80 columns per matched row
-- Filter: Still 80 columns
-- Final project: Extract name
 
-- Memory: 80 column rows in join hash table

join_projection_after.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- After optimization
SELECT c.name
FROM (SELECT id, name FROM customers) c
JOIN (SELECT customer_id, status FROM orders) o
  ON c.id = o.customer_id
WHERE o.status = 'shipped';
 
-- Plan WITH projection pushdown:
-- Scan customers: ONLY id, name (2 columns)
-- Scan orders: ONLY customer_id, status (2 columns)
-- Join: 4 columns per matched row
-- Filter: 4 columns
-- Final project: Extract name
 
-- Memory: 20× less data in join hash table

Aggregations

Projection through GROUP BY requires keeping grouping columns and columns needed by aggregate functions:

aggregation_projection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Query with aggregation
SELECT department, AVG(salary) as avg_salary
FROM employees  -- Has 40 columns
GROUP BY department;
 
-- Required columns:
-- Output: department, AVG(salary)
-- GROUP BY: department
-- AVG function: salary
-- 
-- Minimal scan: {department, salary} - only 2 of 40 columns!
 
-- More complex aggregation:
SELECT 
    region,
    COUNT(*) as order_count,
    SUM(total) as revenue,
    AVG(discount) as avg_discount
FROM orders  -- Has 30 columns
WHERE status = 'completed'
GROUP BY region;
 
-- Required columns:
-- GROUP BY: region
-- COUNT(*): no specific column (just row existence)
-- SUM(total): total
-- AVG(discount): discount
-- Filter: status
-- 
-- Minimal scan: {region, total, discount, status} - 4 of 30 columns!

Subqueries and CTEs

Projection through subqueries follows similar principles but requires analyzing the subquery boundary:

subquery_projection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Correlated subquery requiring careful projection
SELECT 
    c.name,
    (SELECT MAX(o.order_date) 
     FROM orders o 
     WHERE o.customer_id = c.id) as last_order
FROM customers c;
 
-- Outer query needs: c.name, c.id (for correlation)
-- Subquery needs: o.order_date, o.customer_id
 
-- Optimized reads:
-- customers: {id, name}
-- orders: {customer_id, order_date}
 
-- Scalar subquery in SELECT requires:
-- - Correlation columns from outer (c.id)
-- - Expression columns (o.order_date for MAX)
 
-- EXISTS subquery - minimal projection needed:
SELECT c.name
FROM customers c
WHERE EXISTS (
    SELECT 1  -- No columns actually needed in output!
    FROM orders o
    WHERE o.customer_id = c.id
);
 
-- Subquery only needs: {customer_id} for the correlation check
-- The "SELECT 1" means no other columns are required

UNION and Set Operations

Set operations require the same columns from all inputs, limiting but not preventing projection:

union_projection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- UNION with projection opportunity
SELECT name, email FROM current_customers    -- 50 columns
UNION ALL
SELECT name, email FROM archived_customers;  -- 50 columns
 
-- Both branches project down to {name, email}
-- Each table scan reads only required columns
 
-- Set operation is the projection boundary:
-- Cannot project differently on each side
-- All inputs must produce the same schema
 
-- UNION DISTINCT adds complexity:
SELECT name FROM table1
UNION  -- Implicit DISTINCT
SELECT name FROM table2;
 
-- DISTINCT requires comparing all output columns
-- But only 'name' is in output, so that's all we need
-- Projection still applies!

Window Functions Block Simple Projection

Window functions like ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) require columns x and y even if they're not in the final output. The window must be computed before projecting away its input columns. This can prevent aggressive projection pushdown.

Storage Format and I/O Implications

The effectiveness of projection pushdown depends heavily on the underlying storage format. Some formats can skip unneeded columns efficiently; others must read entire rows regardless.

Row-Oriented Storage

Traditional row-store databases (MySQL InnoDB, PostgreSQL heap) store all column values for a row together:

row_store_projection.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Row Store Physical Layout:
┌────────────────────────────────────────────────────────────────┐
│ Page 1                                                          │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Row 1: [id=1][name="Alice"][email="a@x"][addr="..."][...]   │ │
│ │ Row 2: [id=2][name="Bob"][email="b@x"][addr="..."][...]     │ │
│ │ Row 3: [id=3][name="Carol"][email="c@x"][addr="..."][...]   │ │
│ └─────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
 
Query: SELECT name FROM customers;
 
Physical Reality:
- Must read entire pages from disk
- Each page contains full rows (all columns)
- Projection happens AFTER reading into memory
- I/O savings: minimal (all data read)
- Memory savings: significant (discard columns after read)
- CPU savings: moderate (skip parsing unused columns)

Column-Oriented Storage

Column-store databases (ClickHouse, Amazon Redshift, Apache Parquet) store each column separately:

column_store_projection.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Column Store Physical Layout:
┌───────────────────────────────────────────────────────────────────┐
│ id_column_file:    [1][2][3][4][5][6][7][8][9][10]...             │
│ name_column_file:  ["Alice"]["Bob"]["Carol"]["Dave"]...            │
│ email_column_file: ["a@x"]["b@x"]["c@x"]["d@x"]...                 │
│ addr_column_file:  ["123 Main"]["456 Oak"]["789 Pine"]...          │
│ ... 46 more column files ...                                       │
└───────────────────────────────────────────────────────────────────┘
 
Query: SELECT name FROM customers;
 
Physical Reality:
- Read ONLY the name_column_file
- Skip all 49 other column files entirely!
- I/O savings: dramatic (read 1/50th of data)
- Memory savings: dramatic (only required columns)
- CPU savings: dramatic (no parsing of other columns)
 
Additional benefits:
- Column files compress better (similar values together)
- Vectorized processing on homogeneous column data
- Skip entire column files with column-level statistics

Projection Effectiveness by Storage Format
Storage Format	I/O Reduction	Memory Reduction	Best For
Row Store (OLTP)	Minimal	Significant	Point lookups, row-level operations
Column Store (OLAP)	Dramatic	Dramatic	Analytical queries scanning few columns
Hybrid (HTAP)	Moderate	Significant	Mixed workloads
Parquet/ORC Files	Dramatic	Dramatic	Data lakes, batch processing

Index-Only Scans

Projection interacts powerfully with covering indexes. If all required columns are in an index, the table heap need never be accessed:

index_only_scan.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Covering index example
CREATE INDEX idx_orders_covering 
ON orders (customer_id, order_date, total);
 
-- Query requiring only indexed columns:
SELECT customer_id, order_date, total
FROM orders
WHERE customer_id = 12345;
 
-- Index-only scan possible because:
-- 1. Filter: customer_id (in index)
-- 2. Output: customer_id, order_date, total (all in index)
-- 
-- Result: Never read the orders table heap!
-- Orders table might have 30 columns × 1KB each
-- Index entry: 3 columns × ~50 bytes
-- 20× less I/O
 
-- Query that defeats index-only scan:
SELECT customer_id, order_date, total, shipping_address
FROM orders
WHERE customer_id = 12345;
 
-- shipping_address NOT in index
-- Must fetch from heap (Index + Heap access)
-- Still uses index for filter, but no index-only scan

Design Covering Indexes for Common Queries

When a query pattern is frequent and reads only a subset of columns, consider a covering index that includes all needed columns. The trade-off is insert/update overhead for dramatic read speedup. This is especially effective for read-heavy OLTP workloads.

Wide Tables and Performance Impact

The impact of projection pushdown grows with table width. Understanding this relationship helps prioritize optimization efforts.

The Wide Table Problem

Many real-world tables evolve to have dozens or hundreds of columns:

wide_table_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- Typical wide table in enterprise systems
CREATE TABLE customer_profiles (
    id INT PRIMARY KEY,
    -- Core fields (10 columns)
    name VARCHAR(100),
    email VARCHAR(255),
    phone VARCHAR(20),
    ... 
    -- Address fields (10 columns)
    street_address VARCHAR(255),
    city VARCHAR(100),
    state VARCHAR(50),
    ...
    -- Compliance fields (15 columns)
    gdpr_consent BOOLEAN,
    marketing_opt_in BOOLEAN,
    data_classification VARCHAR(50),
    ...
    -- Integration fields (20 columns, often large)
    salesforce_data JSON,
    segment_traits JSON,
    hubspot_sync JSON,
    ...
    -- Audit fields (5 columns)
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    created_by VARCHAR(100),
    ...
    -- Total: 60+ columns, average row 5KB
);
 
-- Common query patterns often need few columns:
 
-- Pattern 1: Contact lookup
SELECT name, email, phone FROM customer_profiles WHERE id = ?;
-- Needs: 3 columns out of 60
 
-- Pattern 2: Compliance report
SELECT id, gdpr_consent, marketing_opt_in FROM customer_profiles;
-- Needs: 3 columns out of 60
 
-- Pattern 3: Integration sync
SELECT id, salesforce_data FROM customer_profiles WHERE updated_at > ?;
-- Needs: 2 columns out of 60
 
-- Without projection: Every query moves ~5KB/row
-- With projection: Queries move 100-500 bytes/row
-- 10-50× performance improvement on wide tables!

Large Object (LOB) Column Elimination

Projection is especially critical when tables contain large objects like TEXT, BLOB, or JSON fields:

Impact of Eliminating LOB Columns
Scenario	Without Projection	With Projection	Improvement
Table with 1MB JSON per row	Read 1MB/row	Read ~100 bytes/row	10,000×
BLOB field (documents)	Full blob transfer	Skip blob entirely	Variable, often 100×
TEXT field (descriptions)	Read full text	Skip or read partial	10-1000×
Array/nested columns	Deserialize entire array	Skip or extract element	Variable

lob_elimination_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Table with large JSON documents
CREATE TABLE documents (
    id INT PRIMARY KEY,
    title VARCHAR(255),
    content JSON,  -- Average 500KB per document
    created_at TIMESTAMP,
    author_id INT
);
 
-- Query 1: List documents (WRONG way)
SELECT * FROM documents WHERE author_id = 42;
-- Reads 500KB content field for each document
-- 100 documents = 50MB transferred!
 
-- Query 2: List documents (RIGHT way)
SELECT id, title, created_at FROM documents WHERE author_id = 42;
-- Reads only metadata: ~500 bytes per document
-- 100 documents = 50KB transferred
-- 1000× less data!
 
-- Pattern for "lazy loading":
-- First: Get list with metadata only
SELECT id, title FROM documents WHERE author_id = 42;
 
-- Then: Fetch content for specific document
SELECT content FROM documents WHERE id = ?;

ORM N+1 and Projection

Many ORMs (Object-Relational Mappers) generate SELECT * by default. Combined with the N+1 query problem, this creates massive inefficiency: fetching all columns for hundreds of objects when only a few columns are needed. Explicitly specify fields in ORM queries to enable projection optimization.

Best Practices for Projection Optimization

While optimizers handle projection pushdown automatically, developers can write queries and design schemas that maximize optimization opportunities.

Query Writing Best Practices

Query Best Practices

•**Never use SELECT *** — Always specify required columns explicitly. This enables projection pushdown and documents data requirements.
•Use column aliases at the source — SELECT a.col AS col FROM a locks in projection early, making the dependency clear.
•Avoid SELECT just for debugging — Even in debugging, specifying columns helps identify data dependencies.
•Be explicit in subqueries — SELECT * FROM subquery defeats optimization if subquery has many columns.
•Check ORM query generation — Ensure your ORM generates queries with explicit column lists, not *.

Schema Design Best Practices

Schema Best Practices

•Vertical partitioning for wide tables — Split rarely-accessed columns into separate tables linked by foreign key.
•Move LOBs to separate tables — Store large TEXT/BLOB in a 1:1 related table, join only when needed.
•Use column stores for analytics — For OLAP workloads, columnar storage provides automatic projection benefits.
•Create covering indexes — For common query patterns, include all needed columns in the index.
•Consider materialized views — Pre-project frequently needed column subsets for repeated query patterns.

schema_patterns_for_projection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
-- Pattern 1: Vertical Partitioning
-- Instead of one wide table:
-- CREATE TABLE customers (id, name, email, ..., audit_json, analytics_json);
 
-- Split into core + extension tables:
CREATE TABLE customers_core (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(255),
    phone VARCHAR(20)
);
 
CREATE TABLE customers_audit (
    customer_id INT PRIMARY KEY REFERENCES customers_core(id),
    audit_json JSON,
    change_history JSON
);
 
-- Queries that don't need audit data skip it entirely
 
-- Pattern 2: Covering Index
CREATE INDEX idx_orders_summary 
ON orders (customer_id, order_date) 
INCLUDE (total, status);
 
-- Index-only scans for common dashboard query:
SELECT order_date, total, status
FROM orders
WHERE customer_id = 123;
 
-- Pattern 3: Materialized View
CREATE MATERIALIZED VIEW customer_summary AS
SELECT 
    c.id,
    c.name,
    COUNT(o.id) as order_count,
    SUM(o.total) as total_spent
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name;
 
-- Queries hit pre-projected, pre-aggregated data
SELECT name, total_spent FROM customer_summary WHERE id = 42;

Profile Before Optimizing

Use EXPLAIN (BUFFERS) or equivalent to measure actual I/O before and after projection changes. Sometimes the optimizer already optimizes effectively; other times, schema changes provide dramatic improvements. Let data guide optimization efforts.

Summary: Projection Early

Projection pushdown—eliminating unnecessary columns early—is a fundamental optimization that reduces memory consumption, I/O overhead, and processing time throughout the query plan.

Key Takeaways

•Column requirements flow top-down — The optimizer traces from query output back to base tables to determine minimal column sets for each scan.
•Each operator has pushdown rules — Joins require keeping join keys; aggregations require grouping and aggregate input columns; window functions may block aggressive pushdown.
•Storage format determines I/O impact — Columnar storage enables dramatic I/O reduction; row stores benefit mainly at the memory level.
•Wide tables magnify benefits — Tables with many columns or large objects benefit disproportionately from projection optimization.
•Index-only scans are the ultimate projection — When covering indexes contain all needed columns, table heap access is completely avoided.
•Developers enable optimization — Avoiding SELECT *, using explicit columns, and designing schemas with vertical partitioning all maximize projection benefits.

What's Next:

We've now explored the major heuristic optimizations in depth: selection pushdown and projection early. The final page of this module examines the Limitations of heuristic optimization—cases where simple rules fail and cost-based methods become essential.

Page Complete

You now understand projection pushdown comprehensively—its mechanics, operator-specific rules, storage interactions, and practical optimization strategies. Combined with selection pushdown, you have the tools to analyze and improve query performance fundamentally.