Loading content...
Imagine carrying a 50-pound backpack through a mountain hike, only to discover at the summit that you never used 45 pounds of its contents. You expended enormous energy hauling weight you didn't need. Query execution works similarly—every column carried through the plan consumes memory, I/O bandwidth, and CPU cycles, even if the final result never uses it.
Projection pushdown (also called early projection or column pruning) ensures that only necessary columns flow through the query plan. This optimization reduces memory footprints, accelerates I/O, and enables other optimizations like index-only scans.
By the end of this page, you will understand how column requirements propagate through query plans, the mechanics of pushing projections through different operators, how projection interacts with storage formats and indexes, and practical patterns for writing queries that maximize projection optimization.
In relational algebra, the projection operator π (pi) selects specific columns from a relation:
πcolumn_list = Produce a relation with only the specified columns from R
For example, πname, email returns a relation with only the name and email columns, discarding all others.
Consider a query against a table with 50 columns, selecting only 3:
12345678910111213141516171819202122
-- Table: customers-- 50 columns, average row size: 2KB-- 1 million rows = 2GB data SELECT id, name, email -- Only 3 columns needed (~100 bytes/row)FROM customers cJOIN orders o ON c.id = o.customer_idWHERE o.total > 1000; -- WITHOUT early projection:-- 1. Scan customers: Read 2GB (all columns)-- 2. Build hash table for join: Store 2GB in memory-- 3. Probe with orders: Output 2GB of customer data per match-- Memory pressure: 2GB+ for hash table -- WITH early projection (push π[id, name, email] to scan):-- 1. Scan customers: Read only id, name, email (~100MB)-- 2. Build hash table: Store 100MB in memory-- 3. Probe with orders: Output 100MB of customer data-- Memory pressure: 100MB for hash table -- 20× reduction in memory usage!Projection pushdown is grounded in relational algebra equivalences, but with more nuances than selection pushdown:
| Equivalence | Description | Conditions |
|---|---|---|
π[L](R ⋈ S) ≡ π[L](π[L∪J](R) ⋈ π[L∪J](S)) | Push projection past join | L = final columns, J = join columns; keep columns needed for join |
π[L](σ[p](R)) ≡ π[L](σ[p](π[L∪P](R))) | Push projection past selection | P = columns in predicate; keep columns needed for filter |
π[L1](π[L2](R)) ≡ π[L1](R) when L1 ⊆ L2 | Eliminate nested projections | Inner projection includes all outer columns |
π[L](R ∪ S) ≡ π[L](R) ∪ π[L](S) | Push through union | Always valid |
π[L](R - S) ≡ π[L](R) - π[L](S) | Push through difference | When L contains key that distinguishes rows |
Unlike selection pushdown, projection pushdown must preserve columns needed by downstream operators. You can't discard a column that a later filter, join, or expression needs. This requires careful analysis of column dependencies throughout the plan.
Before pushing projections down, the optimizer must determine which columns are actually required at each point in the plan. This column requirements analysis propagates from the root (what the query returns) to the leaves (what must be read from storage).
The analysis starts from the query's output and works downward:
At each operator, the required column set is computed differently:
123456789101112131415161718192021222324252627282930313233343536373839404142
// Compute required columns for each operatorfunction computeRequiredColumns(node, requiredByParent): switch node.type: case PROJECTION: // Pass through: project's output columns requiredByChild = node.projectColumns case SELECTION: // Add columns used in predicate predicateCols = extractColumns(node.predicate) requiredByChild = requiredByParent ∪ predicateCols case JOIN: // Add join key columns joinKeyCols = extractColumns(node.joinCondition) leftRequired = (requiredByParent ∩ leftSchema) ∪ (joinKeyCols ∩ leftSchema) rightRequired = (requiredByParent ∩ rightSchema) ∪ (joinKeyCols ∩ rightSchema) case AGGREGATE: // Need grouping columns and columns in aggregate expressions groupCols = node.groupByColumns aggInputCols = extractColumns(node.aggregateExpressions) requiredByChild = groupCols ∪ aggInputCols case UNION / INTERSECT / EXCEPT: // All inputs need same columns requiredByChild = requiredByParent // for each child case TABLE_SCAN: // Terminal: read only required columns node.columnsToRead = requiredByParent // Example iteration:// Query: SELECT c.name FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.total > 100// // Result needs: {c.name}// Join needs: {c.name} ∪ {c.id, o.customer_id} = {c.name, c.id, o.customer_id}// Filter adds: {o.total}// Final requirements:// - customers scan: {id, name}// - orders scan: {customer_id, total}When queries include computed expressions, the analysis must trace through to base columns:
12345678910111213141516171819202122232425262728293031
-- Query with computed expressionsSELECT customer_id, full_name, -- Computed: first_name || ' ' || last_name total_with_tax -- Computed: total * 1.1FROM ( SELECT c.id as customer_id, c.first_name || ' ' || c.last_name as full_name, o.total * 1.1 as total_with_tax FROM customers c JOIN orders o ON c.id = o.customer_id) derived; -- Column requirements analysis:-- final output needs: customer_id, full_name, total_with_tax-- -- Tracing expressions:-- customer_id ← c.id-- full_name ← c.first_name || ' ' || c.last_name ← {c.first_name, c.last_name}-- total_with_tax ← o.total * 1.1 ← {o.total}-- -- Join needs: c.id, o.customer_id (for join condition)-- -- Final requirements:-- customers: {id, first_name, last_name}-- orders: {customer_id, total}-- -- Columns NOT needed (eliminated):-- customers: email, phone, address, created_at, status, ...-- orders: order_date, status, shipping_address, ...SELECT * prevents projection pushdown because all columns from all tables are required. Even if you're debugging, consider specifying columns explicitly. In production code, always list required columns to enable optimization.
Each operator type has specific rules for projection pushdown. Understanding these rules reveals why some queries optimize better than others.
Projection through joins requires keeping join key columns in addition to output columns:
123456789101112131415
-- Before optimizationSELECT c.nameFROM customers c -- 50 columnsJOIN orders o -- 30 columns ON c.id = o.customer_idWHERE o.status = 'shipped'; -- Plan without projection pushdown:-- Scan customers: ALL 50 columns-- Scan orders: ALL 30 columns-- Join: 80 columns per matched row-- Filter: Still 80 columns-- Final project: Extract name -- Memory: 80 column rows in join hash table123456789101112131415
-- After optimizationSELECT c.nameFROM (SELECT id, name FROM customers) cJOIN (SELECT customer_id, status FROM orders) o ON c.id = o.customer_idWHERE o.status = 'shipped'; -- Plan WITH projection pushdown:-- Scan customers: ONLY id, name (2 columns)-- Scan orders: ONLY customer_id, status (2 columns)-- Join: 4 columns per matched row-- Filter: 4 columns-- Final project: Extract name -- Memory: 20× less data in join hash tableProjection through GROUP BY requires keeping grouping columns and columns needed by aggregate functions:
123456789101112131415161718192021222324252627282930
-- Query with aggregationSELECT department, AVG(salary) as avg_salaryFROM employees -- Has 40 columnsGROUP BY department; -- Required columns:-- Output: department, AVG(salary)-- GROUP BY: department-- AVG function: salary-- -- Minimal scan: {department, salary} - only 2 of 40 columns! -- More complex aggregation:SELECT region, COUNT(*) as order_count, SUM(total) as revenue, AVG(discount) as avg_discountFROM orders -- Has 30 columnsWHERE status = 'completed'GROUP BY region; -- Required columns:-- GROUP BY: region-- COUNT(*): no specific column (just row existence)-- SUM(total): total-- AVG(discount): discount-- Filter: status-- -- Minimal scan: {region, total, discount, status} - 4 of 30 columns!Projection through subqueries follows similar principles but requires analyzing the subquery boundary:
123456789101112131415161718192021222324252627282930
-- Correlated subquery requiring careful projectionSELECT c.name, (SELECT MAX(o.order_date) FROM orders o WHERE o.customer_id = c.id) as last_orderFROM customers c; -- Outer query needs: c.name, c.id (for correlation)-- Subquery needs: o.order_date, o.customer_id -- Optimized reads:-- customers: {id, name}-- orders: {customer_id, order_date} -- Scalar subquery in SELECT requires:-- - Correlation columns from outer (c.id)-- - Expression columns (o.order_date for MAX) -- EXISTS subquery - minimal projection needed:SELECT c.nameFROM customers cWHERE EXISTS ( SELECT 1 -- No columns actually needed in output! FROM orders o WHERE o.customer_id = c.id); -- Subquery only needs: {customer_id} for the correlation check-- The "SELECT 1" means no other columns are requiredSet operations require the same columns from all inputs, limiting but not preventing projection:
1234567891011121314151617181920
-- UNION with projection opportunitySELECT name, email FROM current_customers -- 50 columnsUNION ALLSELECT name, email FROM archived_customers; -- 50 columns -- Both branches project down to {name, email}-- Each table scan reads only required columns -- Set operation is the projection boundary:-- Cannot project differently on each side-- All inputs must produce the same schema -- UNION DISTINCT adds complexity:SELECT name FROM table1UNION -- Implicit DISTINCTSELECT name FROM table2; -- DISTINCT requires comparing all output columns-- But only 'name' is in output, so that's all we need-- Projection still applies!Window functions like ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) require columns x and y even if they're not in the final output. The window must be computed before projecting away its input columns. This can prevent aggressive projection pushdown.
The effectiveness of projection pushdown depends heavily on the underlying storage format. Some formats can skip unneeded columns efficiently; others must read entire rows regardless.
Traditional row-store databases (MySQL InnoDB, PostgreSQL heap) store all column values for a row together:
12345678910111213141516171819
Row Store Physical Layout:┌────────────────────────────────────────────────────────────────┐│ Page 1 ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Row 1: [id=1][name="Alice"][email="a@x"][addr="..."][...] │ ││ │ Row 2: [id=2][name="Bob"][email="b@x"][addr="..."][...] │ ││ │ Row 3: [id=3][name="Carol"][email="c@x"][addr="..."][...] │ ││ └─────────────────────────────────────────────────────────────┘ │└────────────────────────────────────────────────────────────────┘ Query: SELECT name FROM customers; Physical Reality:- Must read entire pages from disk- Each page contains full rows (all columns)- Projection happens AFTER reading into memory- I/O savings: minimal (all data read)- Memory savings: significant (discard columns after read)- CPU savings: moderate (skip parsing unused columns)Column-store databases (ClickHouse, Amazon Redshift, Apache Parquet) store each column separately:
12345678910111213141516171819202122
Column Store Physical Layout:┌───────────────────────────────────────────────────────────────────┐│ id_column_file: [1][2][3][4][5][6][7][8][9][10]... ││ name_column_file: ["Alice"]["Bob"]["Carol"]["Dave"]... ││ email_column_file: ["a@x"]["b@x"]["c@x"]["d@x"]... ││ addr_column_file: ["123 Main"]["456 Oak"]["789 Pine"]... ││ ... 46 more column files ... │└───────────────────────────────────────────────────────────────────┘ Query: SELECT name FROM customers; Physical Reality:- Read ONLY the name_column_file- Skip all 49 other column files entirely!- I/O savings: dramatic (read 1/50th of data)- Memory savings: dramatic (only required columns)- CPU savings: dramatic (no parsing of other columns) Additional benefits:- Column files compress better (similar values together)- Vectorized processing on homogeneous column data- Skip entire column files with column-level statistics| Storage Format | I/O Reduction | Memory Reduction | Best For |
|---|---|---|---|
| Row Store (OLTP) | Minimal | Significant | Point lookups, row-level operations |
| Column Store (OLAP) | Dramatic | Dramatic | Analytical queries scanning few columns |
| Hybrid (HTAP) | Moderate | Significant | Mixed workloads |
| Parquet/ORC Files | Dramatic | Dramatic | Data lakes, batch processing |
Projection interacts powerfully with covering indexes. If all required columns are in an index, the table heap need never be accessed:
1234567891011121314151617181920212223242526
-- Covering index exampleCREATE INDEX idx_orders_covering ON orders (customer_id, order_date, total); -- Query requiring only indexed columns:SELECT customer_id, order_date, totalFROM ordersWHERE customer_id = 12345; -- Index-only scan possible because:-- 1. Filter: customer_id (in index)-- 2. Output: customer_id, order_date, total (all in index)-- -- Result: Never read the orders table heap!-- Orders table might have 30 columns × 1KB each-- Index entry: 3 columns × ~50 bytes-- 20× less I/O -- Query that defeats index-only scan:SELECT customer_id, order_date, total, shipping_addressFROM ordersWHERE customer_id = 12345; -- shipping_address NOT in index-- Must fetch from heap (Index + Heap access)-- Still uses index for filter, but no index-only scanWhen a query pattern is frequent and reads only a subset of columns, consider a covering index that includes all needed columns. The trade-off is insert/update overhead for dramatic read speedup. This is especially effective for read-heavy OLTP workloads.
The impact of projection pushdown grows with table width. Understanding this relationship helps prioritize optimization efforts.
Many real-world tables evolve to have dozens or hundreds of columns:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
-- Typical wide table in enterprise systemsCREATE TABLE customer_profiles ( id INT PRIMARY KEY, -- Core fields (10 columns) name VARCHAR(100), email VARCHAR(255), phone VARCHAR(20), ... -- Address fields (10 columns) street_address VARCHAR(255), city VARCHAR(100), state VARCHAR(50), ... -- Compliance fields (15 columns) gdpr_consent BOOLEAN, marketing_opt_in BOOLEAN, data_classification VARCHAR(50), ... -- Integration fields (20 columns, often large) salesforce_data JSON, segment_traits JSON, hubspot_sync JSON, ... -- Audit fields (5 columns) created_at TIMESTAMP, updated_at TIMESTAMP, created_by VARCHAR(100), ... -- Total: 60+ columns, average row 5KB); -- Common query patterns often need few columns: -- Pattern 1: Contact lookupSELECT name, email, phone FROM customer_profiles WHERE id = ?;-- Needs: 3 columns out of 60 -- Pattern 2: Compliance reportSELECT id, gdpr_consent, marketing_opt_in FROM customer_profiles;-- Needs: 3 columns out of 60 -- Pattern 3: Integration syncSELECT id, salesforce_data FROM customer_profiles WHERE updated_at > ?;-- Needs: 2 columns out of 60 -- Without projection: Every query moves ~5KB/row-- With projection: Queries move 100-500 bytes/row-- 10-50× performance improvement on wide tables!Projection is especially critical when tables contain large objects like TEXT, BLOB, or JSON fields:
| Scenario | Without Projection | With Projection | Improvement |
|---|---|---|---|
| Table with 1MB JSON per row | Read 1MB/row | Read ~100 bytes/row | 10,000× |
| BLOB field (documents) | Full blob transfer | Skip blob entirely | Variable, often 100× |
| TEXT field (descriptions) | Read full text | Skip or read partial | 10-1000× |
| Array/nested columns | Deserialize entire array | Skip or extract element | Variable |
1234567891011121314151617181920212223242526
-- Table with large JSON documentsCREATE TABLE documents ( id INT PRIMARY KEY, title VARCHAR(255), content JSON, -- Average 500KB per document created_at TIMESTAMP, author_id INT); -- Query 1: List documents (WRONG way)SELECT * FROM documents WHERE author_id = 42;-- Reads 500KB content field for each document-- 100 documents = 50MB transferred! -- Query 2: List documents (RIGHT way)SELECT id, title, created_at FROM documents WHERE author_id = 42;-- Reads only metadata: ~500 bytes per document-- 100 documents = 50KB transferred-- 1000× less data! -- Pattern for "lazy loading":-- First: Get list with metadata onlySELECT id, title FROM documents WHERE author_id = 42; -- Then: Fetch content for specific documentSELECT content FROM documents WHERE id = ?;Many ORMs (Object-Relational Mappers) generate SELECT * by default. Combined with the N+1 query problem, this creates massive inefficiency: fetching all columns for hundreds of objects when only a few columns are needed. Explicitly specify fields in ORM queries to enable projection optimization.
While optimizers handle projection pushdown automatically, developers can write queries and design schemas that maximize optimization opportunities.
SELECT a.col AS col FROM a locks in projection early, making the dependency clear.SELECT * FROM subquery defeats optimization if subquery has many columns.12345678910111213141516171819202122232425262728293031323334353637383940414243
-- Pattern 1: Vertical Partitioning-- Instead of one wide table:-- CREATE TABLE customers (id, name, email, ..., audit_json, analytics_json); -- Split into core + extension tables:CREATE TABLE customers_core ( id INT PRIMARY KEY, name VARCHAR(100), email VARCHAR(255), phone VARCHAR(20)); CREATE TABLE customers_audit ( customer_id INT PRIMARY KEY REFERENCES customers_core(id), audit_json JSON, change_history JSON); -- Queries that don't need audit data skip it entirely -- Pattern 2: Covering IndexCREATE INDEX idx_orders_summary ON orders (customer_id, order_date) INCLUDE (total, status); -- Index-only scans for common dashboard query:SELECT order_date, total, statusFROM ordersWHERE customer_id = 123; -- Pattern 3: Materialized ViewCREATE MATERIALIZED VIEW customer_summary ASSELECT c.id, c.name, COUNT(o.id) as order_count, SUM(o.total) as total_spentFROM customers cLEFT JOIN orders o ON c.id = o.customer_idGROUP BY c.id, c.name; -- Queries hit pre-projected, pre-aggregated dataSELECT name, total_spent FROM customer_summary WHERE id = 42;Use EXPLAIN (BUFFERS) or equivalent to measure actual I/O before and after projection changes. Sometimes the optimizer already optimizes effectively; other times, schema changes provide dramatic improvements. Let data guide optimization efforts.
Projection pushdown—eliminating unnecessary columns early—is a fundamental optimization that reduces memory consumption, I/O overhead, and processing time throughout the query plan.
What's Next:
We've now explored the major heuristic optimizations in depth: selection pushdown and projection early. The final page of this module examines the Limitations of heuristic optimization—cases where simple rules fail and cost-based methods become essential.
You now understand projection pushdown comprehensively—its mechanics, operator-specific rules, storage interactions, and practical optimization strategies. Combined with selection pushdown, you have the tools to analyze and improve query performance fundamentally.